tapas.datasets.dataset.TabularRecord

class tapas.datasets.dataset.TabularRecord(data, description, identifier)

Bases: tapas.datasets.dataset.TabularDataset

Class for tabular record object. The tabular data is a Pandas Dataframe with 1 row and the data description is a dictionary.

__init__(data, description, identifier)

Parameters

data (pandas.DataFrame (or a valid argument for pd.DataFrame).) –
description (tapas.datasets.data_description.DataDescription) –
label (str (optional)) –

Methods

`__init__`(data, description, identifier)	param data
`add_records`(records[, in_place])	Add record(s) to dataset and return modified dataset.
`copy`()	Create a TabularRecord that is a deep copy of this one.
`create_subsets`(n, sample_size[, drop_records])	Create a number n of subsets of this dataset of size sample_size without replacement.
`drop_records`([record_ids, n, in_place])	Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.
`empty`()	Create an empty TabularDataset with the same description as the current one.
`from_dataset`(tabular_row)	Create a TabularRecord object from a TabularDataset object containing 1 record.
`get_id`(tabular_dataset)	Check if the record is found on a given TabularDataset and return the object id (index) on that dataset.
`get_records`(record_ids)	Get a record from the TabularDataset object
`read`(filepath[, label])	Read csv and json files for dataframe and schema respectively.
`read_from_string`(data, description)	param data The csv version of the data
`replace`(records_in[, records_out, in_place])	Replace a record with another one in the dataset, if records_out is empty it will remove a random record.
`sample`([n_samples, frac, random_state])	Sample a set of records from a TabularDataset object.
`set_id`(identifier)	Overwrite the id attribute on the TabularRecord object.
`set_value`(column, value)	Overwrite the value of attribute column of the TabularRecord object.
`view`([columns, exclude_columns])	Create a TabularDataset object that contains a subset of the columns of this TabularDataset.
`write`(filepath)	Write data and description to file
`write_to_string`()	Return a string holding the dataset (as a csv).

Attributes

`as_numeric`	Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded.
`label`	The label for records is their identifier.

add_records(records, in_place=False)

Add record(s) to dataset and return modified dataset.

Parameters

records (TabularDataset) – A TabularDataset object with the record(s) to add.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A new TabularDataset object with the record(s) or None if inplace=True.

Return type

TabularDataset or None

property as_numeric

Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded. This is only computed once (for efficiency reasons), so beware of modifying TabularDataset after using this property.

The columns are kept in the order of the description, with categorical variables encoded over several contiguous columns.

Return type: np.array

copy()

Create a TabularRecord that is a deep copy of this one. In particular, the underlying data is copied and can thus be modified freely.

Returns: A copy of this TabularRecord.
Return type: TabularRecord

create_subsets(n, sample_size, drop_records=False)

Create a number n of subsets of this dataset of size sample_size without replacement. If needed, the records can be dropped from this dataset.

Parameters

n (int) – Number of datasets to create.
sample_size (int) – Size of the subset datasets to be created.
drop_records (bool) – Whether to remove the records sampled from this dataset (in place).

Returns

A lists containing subsets of the data with and without the target record(s).

Return type

list(TabularDataset)

drop_records(record_ids=[], n=1, in_place=False)

Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.

Parameters

record_ids (list[int]) – List of indexes of records to drop.
n (int) – Number of random records to drop if record_ids is empty.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A new TabularDataset object without the record(s) or None if in_place=True.

Return type

TabularDataset or None

empty()

Create an empty TabularDataset with the same description as the current one. Short-hand for TabularDataset.get_records([]).

Returns: Empty tabular dataset.
Return type: TabularDataset

classmethod from_dataset(tabular_row)

Create a TabularRecord object from a TabularDataset object containing 1 record.

Parameters: tabular_row (TabularDataset) – A TabularDataset object containing one record.
Returns: A TabularRecord object
Return type: TabularRecord

get_id(tabular_dataset)

Check if the record is found on a given TabularDataset and return the object id (index) on that dataset.

Parameters: tabular_dataset (TabularDataset) – A TabularDataset object.
Returns: The id of the object based on the index in the original dataset.
Return type: int

get_records(record_ids)

Get a record from the TabularDataset object

Parameters: record_ids (list[int]) – List of indexes of records to retrieve.
Returns: A TabularDataset object with the record(s).
Return type: TabularDataset

property label: The label for records is their identifier. We assume here that the label of the rest of the dataset is obvious from context. If not, it can be retrived as self.description.label.

classmethod read(filepath, label=None)

Read csv and json files for dataframe and schema respectively.

Parameters

filepath (str) – Full path to the csv and json, excluding the .csv or .json extension. Both files should have the same root name.
label (str or None) – An optional string to represent this dataset.

Returns

A TabularDataset.

Return type

TabularDataset

classmethod read_from_string(data, description)

Parameters

data (str) – The csv version of the data
description (DataDescription) –

Return type

TabularDataset

replace(records_in, records_out=[], in_place=False)

Replace a record with another one in the dataset, if records_out is empty it will remove a random record.

Parameters

records_in (TabularDataset) – A TabularDataset object with the record(s) to add.
records_out (list(int)) – List of indexes of records to drop.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.

Returns

A modified TabularDataset object with the replaced record(s) or None if in_place=True..

Return type

TabularDataset or None

sample(n_samples=1, frac=None, random_state=None)

Sample a set of records from a TabularDataset object.

Parameters

n_samples (int) – Number of records to sample. If frac is not None, this parameter is ignored.
frac (float) – Fraction of records to sample.
random_state (optional) – Passed to pandas.DataFrame.sample()

Returns

A TabularDataset object with a sample of the records of the original object.

Return type

TabularDataset

set_id(identifier)

Overwrite the id attribute on the TabularRecord object.

Parameters: identifier (int or str) – An id value to be assigned to the TabularRecord id attribute
Return type: None

set_value(column, value)

Overwrite the value of attribute column of the TabularRecord object.

Parameters

column (str) – The identifier of the attribute to be replaced.
value ((value set of column)) – The value to set the column of the record.

Return type

None

view(columns=None, exclude_columns=None)

Create a TabularDataset object that contains a subset of the columns of this TabularDataset. The resulting object only has a copy of the data, and can thus be modified without affecting the original data.

Parameters

defined. (Exactly one of columns and exclude_columns must be) –
columns (list, or None) – The columns to include in the view.
exclude_columns (list, or None) – The columns to exclude from the view, with all other columns included.

Returns

A subset of this data, restricted to some columns.

Return type

TabularDataset

write(filepath)

Write data and description to file

Parameters: filepath (str) – Path where the csv and json file are saved.

write_to_string(): Return a string holding the dataset (as a csv).