tapas.datasets.dataset.TabularDataset

class tapas.datasets.dataset.TabularDataset(data, description)

Bases: tapas.datasets.dataset.Dataset

Class to represent tabular data as a Dataset. Internally, the tabular data is stored as a Pandas Dataframe and the schema is an array of types.

__init__(data, description)

Parameters

data (pandas.DataFrame (or a valid argument for pd.DataFrame).) –
description (tapas.datasets.data_description.DataDescription) –
label (str (optional)) –

Methods

`__init__`(data, description)	param data
`add_records`(records[, in_place])	Add record(s) to dataset and return modified dataset.
`copy`()	Create a TabularDataset that is a deep copy of this one.
`create_subsets`(n, sample_size[, drop_records])	Create a number n of subsets of this dataset of size sample_size without replacement.
`drop_records`([record_ids, n, in_place])	Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.
`empty`()	Create an empty TabularDataset with the same description as the current one.
`get_records`(record_ids)	Get a record from the TabularDataset object
`read`(filepath[, label])	Read csv and json files for dataframe and schema respectively.
`read_from_string`(data, description)	param data The csv version of the data
`replace`(records_in[, records_out, in_place])	Replace a record with another one in the dataset, if records_out is empty it will remove a random record.
`sample`([n_samples, frac, random_state])	Sample a set of records from a TabularDataset object.
`view`([columns, exclude_columns])	Create a TabularDataset object that contains a subset of the columns of this TabularDataset.
`write`(filepath)	Write data and description to file
`write_to_string`()	Return a string holding the dataset (as a csv).

Attributes

`as_numeric`	Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded.
`label`

add_records(records, in_place=False)

Add record(s) to dataset and return modified dataset.

Parameters

records (TabularDataset) – A TabularDataset object with the record(s) to add.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A new TabularDataset object with the record(s) or None if inplace=True.

Return type

TabularDataset or None

property as_numeric

Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded. This is only computed once (for efficiency reasons), so beware of modifying TabularDataset after using this property.

The columns are kept in the order of the description, with categorical variables encoded over several contiguous columns.

Return type: np.array

copy()

Create a TabularDataset that is a deep copy of this one. In particular, the underlying data is copied and can thus be modified freely.

Returns: A copy of this TabularDataset.
Return type: TabularDataset

create_subsets(n, sample_size, drop_records=False)

Create a number n of subsets of this dataset of size sample_size without replacement. If needed, the records can be dropped from this dataset.

Parameters

n (int) – Number of datasets to create.
sample_size (int) – Size of the subset datasets to be created.
drop_records (bool) – Whether to remove the records sampled from this dataset (in place).

Returns

A lists containing subsets of the data with and without the target record(s).

Return type

list(TabularDataset)

drop_records(record_ids=[], n=1, in_place=False)

Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.

Parameters

record_ids (list[int]) – List of indexes of records to drop.
n (int) – Number of random records to drop if record_ids is empty.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A new TabularDataset object without the record(s) or None if in_place=True.

Return type

TabularDataset or None

empty()

Create an empty TabularDataset with the same description as the current one. Short-hand for TabularDataset.get_records([]).

Returns: Empty tabular dataset.
Return type: TabularDataset

get_records(record_ids)

Get a record from the TabularDataset object

Parameters: record_ids (list[int]) – List of indexes of records to retrieve.
Returns: A TabularDataset object with the record(s).
Return type: TabularDataset

classmethod read(filepath, label=None)

Read csv and json files for dataframe and schema respectively.

Parameters

filepath (str) – Full path to the csv and json, excluding the .csv or .json extension. Both files should have the same root name.
label (str or None) – An optional string to represent this dataset.

Returns

A TabularDataset.

Return type

TabularDataset

classmethod read_from_string(data, description)

Parameters

data (str) – The csv version of the data
description (DataDescription) –

Return type

TabularDataset

replace(records_in, records_out=[], in_place=False)

Replace a record with another one in the dataset, if records_out is empty it will remove a random record.

Parameters

records_in (TabularDataset) – A TabularDataset object with the record(s) to add.
records_out (list(int)) – List of indexes of records to drop.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A modified TabularDataset object with the replaced record(s) or None if in_place=True..

Return type

TabularDataset or None

sample(n_samples=1, frac=None, random_state=None)

Sample a set of records from a TabularDataset object.

Parameters

n_samples (int) – Number of records to sample. If frac is not None, this parameter is ignored.
frac (float) – Fraction of records to sample.
random_state (optional) – Passed to pandas.DataFrame.sample()

Returns

A TabularDataset object with a sample of the records of the original object.

Return type

TabularDataset

view(columns=None, exclude_columns=None)

Create a TabularDataset object that contains a subset of the columns of this TabularDataset. The resulting object only has a copy of the data, and can thus be modified without affecting the original data.

Parameters

defined. (Exactly one of columns and exclude_columns must be) –
columns (list, or None) – The columns to include in the view.
exclude_columns (list, or None) – The columns to exclude from the view, with all other columns included.

Returns

A subset of this data, restricted to some columns.

Return type

TabularDataset

write(filepath)

Write data and description to file

Parameters: filepath (str) – Path where the csv and json file are saved.

write_to_string(): Return a string holding the dataset (as a csv).