tapas.datasets.dataset.TabularDataset

class tapas.datasets.dataset.TabularDataset(data, description)

Bases: tapas.datasets.dataset.Dataset

Class to represent tabular data as a Dataset. Internally, the tabular data is stored as a Pandas Dataframe and the schema is an array of types.

__init__(data, description)
Parameters

Methods

__init__(data, description)

param data

add_records(records[, in_place])

Add record(s) to dataset and return modified dataset.

copy()

Create a TabularDataset that is a deep copy of this one.

create_subsets(n, sample_size[, drop_records])

Create a number n of subsets of this dataset of size sample_size without replacement.

drop_records([record_ids, n, in_place])

Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.

empty()

Create an empty TabularDataset with the same description as the current one.

get_records(record_ids)

Get a record from the TabularDataset object

read(filepath[, label])

Read csv and json files for dataframe and schema respectively.

read_from_string(data, description)

param data

The csv version of the data

replace(records_in[, records_out, in_place])

Replace a record with another one in the dataset, if records_out is empty it will remove a random record.

sample([n_samples, frac, random_state])

Sample a set of records from a TabularDataset object.

view([columns, exclude_columns])

Create a TabularDataset object that contains a subset of the columns of this TabularDataset.

write(filepath)

Write data and description to file

write_to_string()

Return a string holding the dataset (as a csv).

Attributes

as_numeric

Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded.

label

add_records(records, in_place=False)

Add record(s) to dataset and return modified dataset.

Parameters
  • records (TabularDataset) – A TabularDataset object with the record(s) to add.

  • in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A new TabularDataset object with the record(s) or None if inplace=True.

Return type

TabularDataset or None

property as_numeric

Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded. This is only computed once (for efficiency reasons), so beware of modifying TabularDataset after using this property.

The columns are kept in the order of the description, with categorical variables encoded over several contiguous columns.

Return type

np.array

copy()

Create a TabularDataset that is a deep copy of this one. In particular, the underlying data is copied and can thus be modified freely.

Returns

A copy of this TabularDataset.

Return type

TabularDataset

create_subsets(n, sample_size, drop_records=False)

Create a number n of subsets of this dataset of size sample_size without replacement. If needed, the records can be dropped from this dataset.

Parameters
  • n (int) – Number of datasets to create.

  • sample_size (int) – Size of the subset datasets to be created.

  • drop_records (bool) – Whether to remove the records sampled from this dataset (in place).

Returns

A lists containing subsets of the data with and without the target record(s).

Return type

list(TabularDataset)

drop_records(record_ids=[], n=1, in_place=False)

Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.

Parameters
  • record_ids (list[int]) – List of indexes of records to drop.

  • n (int) – Number of random records to drop if record_ids is empty.

  • in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A new TabularDataset object without the record(s) or None if in_place=True.

Return type

TabularDataset or None

empty()

Create an empty TabularDataset with the same description as the current one. Short-hand for TabularDataset.get_records([]).

Returns

Empty tabular dataset.

Return type

TabularDataset

get_records(record_ids)

Get a record from the TabularDataset object

Parameters

record_ids (list[int]) – List of indexes of records to retrieve.

Returns

A TabularDataset object with the record(s).

Return type

TabularDataset

classmethod read(filepath, label=None)

Read csv and json files for dataframe and schema respectively.

Parameters
  • filepath (str) – Full path to the csv and json, excluding the .csv or .json extension. Both files should have the same root name.

  • label (str or None) – An optional string to represent this dataset.

Returns

A TabularDataset.

Return type

TabularDataset

classmethod read_from_string(data, description)
Parameters
  • data (str) – The csv version of the data

  • description (DataDescription) –

Return type

TabularDataset

replace(records_in, records_out=[], in_place=False)

Replace a record with another one in the dataset, if records_out is empty it will remove a random record.

Parameters
  • records_in (TabularDataset) – A TabularDataset object with the record(s) to add.

  • records_out (list(int)) – List of indexes of records to drop.

  • in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A modified TabularDataset object with the replaced record(s) or None if in_place=True..

Return type

TabularDataset or None

sample(n_samples=1, frac=None, random_state=None)

Sample a set of records from a TabularDataset object.

Parameters
  • n_samples (int) – Number of records to sample. If frac is not None, this parameter is ignored.

  • frac (float) – Fraction of records to sample.

  • random_state (optional) – Passed to pandas.DataFrame.sample()

Returns

A TabularDataset object with a sample of the records of the original object.

Return type

TabularDataset

view(columns=None, exclude_columns=None)

Create a TabularDataset object that contains a subset of the columns of this TabularDataset. The resulting object only has a copy of the data, and can thus be modified without affecting the original data.

Parameters
  • defined. (Exactly one of columns and exclude_columns must be) –

  • columns (list, or None) – The columns to include in the view.

  • exclude_columns (list, or None) – The columns to exclude from the view, with all other columns included.

Returns

A subset of this data, restricted to some columns.

Return type

TabularDataset

write(filepath)

Write data and description to file

Parameters

filepath (str) – Path where the csv and json file are saved.

write_to_string()

Return a string holding the dataset (as a csv).