tapas.datasets.dataset.TabularRecord
- class tapas.datasets.dataset.TabularRecord(data, description, identifier)
Bases:
tapas.datasets.dataset.TabularDatasetClass for tabular record object. The tabular data is a Pandas Dataframe with 1 row and the data description is a dictionary.
- __init__(data, description, identifier)
- Parameters
data (pandas.DataFrame (or a valid argument for pd.DataFrame).) –
description (tapas.datasets.data_description.DataDescription) –
label (str (optional)) –
Methods
__init__(data, description, identifier)- param data
add_records(records[, in_place])Add record(s) to dataset and return modified dataset.
copy()Create a TabularRecord that is a deep copy of this one.
create_subsets(n, sample_size[, drop_records])Create a number n of subsets of this dataset of size sample_size without replacement.
drop_records([record_ids, n, in_place])Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.
empty()Create an empty TabularDataset with the same description as the current one.
from_dataset(tabular_row)Create a TabularRecord object from a TabularDataset object containing 1 record.
get_id(tabular_dataset)Check if the record is found on a given TabularDataset and return the object id (index) on that dataset.
get_records(record_ids)Get a record from the TabularDataset object
read(filepath[, label])Read csv and json files for dataframe and schema respectively.
read_from_string(data, description)- param data
The csv version of the data
replace(records_in[, records_out, in_place])Replace a record with another one in the dataset, if records_out is empty it will remove a random record.
sample([n_samples, frac, random_state])Sample a set of records from a TabularDataset object.
set_id(identifier)Overwrite the id attribute on the TabularRecord object.
set_value(column, value)Overwrite the value of attribute column of the TabularRecord object.
view([columns, exclude_columns])Create a TabularDataset object that contains a subset of the columns of this TabularDataset.
write(filepath)Write data and description to file
Return a string holding the dataset (as a csv).
Attributes
Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded.
The label for records is their identifier.
- add_records(records, in_place=False)
Add record(s) to dataset and return modified dataset.
- Parameters
records (TabularDataset) – A TabularDataset object with the record(s) to add.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.
- Returns
A new TabularDataset object with the record(s) or None if inplace=True.
- Return type
TabularDataset or None
- property as_numeric
Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded. This is only computed once (for efficiency reasons), so beware of modifying TabularDataset after using this property.
The columns are kept in the order of the description, with categorical variables encoded over several contiguous columns.
- Return type
np.array
- copy()
Create a TabularRecord that is a deep copy of this one. In particular, the underlying data is copied and can thus be modified freely.
- Returns
A copy of this TabularRecord.
- Return type
- create_subsets(n, sample_size, drop_records=False)
Create a number n of subsets of this dataset of size sample_size without replacement. If needed, the records can be dropped from this dataset.
- Parameters
n (int) – Number of datasets to create.
sample_size (int) – Size of the subset datasets to be created.
drop_records (bool) – Whether to remove the records sampled from this dataset (in place).
- Returns
A lists containing subsets of the data with and without the target record(s).
- Return type
list(TabularDataset)
- drop_records(record_ids=[], n=1, in_place=False)
Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.
- Parameters
record_ids (list[int]) – List of indexes of records to drop.
n (int) – Number of random records to drop if record_ids is empty.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.
- Returns
A new TabularDataset object without the record(s) or None if in_place=True.
- Return type
TabularDataset or None
- empty()
Create an empty TabularDataset with the same description as the current one. Short-hand for TabularDataset.get_records([]).
- Returns
Empty tabular dataset.
- Return type
- classmethod from_dataset(tabular_row)
Create a TabularRecord object from a TabularDataset object containing 1 record.
- Parameters
tabular_row (TabularDataset) – A TabularDataset object containing one record.
- Returns
A TabularRecord object
- Return type
- get_id(tabular_dataset)
Check if the record is found on a given TabularDataset and return the object id (index) on that dataset.
- Parameters
tabular_dataset (TabularDataset) – A TabularDataset object.
- Returns
The id of the object based on the index in the original dataset.
- Return type
int
- get_records(record_ids)
Get a record from the TabularDataset object
- Parameters
record_ids (list[int]) – List of indexes of records to retrieve.
- Returns
A TabularDataset object with the record(s).
- Return type
- property label
The label for records is their identifier. We assume here that the label of the rest of the dataset is obvious from context. If not, it can be retrived as self.description.label.
- classmethod read(filepath, label=None)
Read csv and json files for dataframe and schema respectively.
- Parameters
filepath (str) – Full path to the csv and json, excluding the
.csvor.jsonextension. Both files should have the same root name.label (str or None) – An optional string to represent this dataset.
- Returns
A TabularDataset.
- Return type
- classmethod read_from_string(data, description)
- Parameters
data (str) – The csv version of the data
description (DataDescription) –
- Return type
- replace(records_in, records_out=[], in_place=False)
Replace a record with another one in the dataset, if records_out is empty it will remove a random record.
- Parameters
records_in (TabularDataset) – A TabularDataset object with the record(s) to add.
records_out (list(int)) – List of indexes of records to drop.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.
- Returns
A modified TabularDataset object with the replaced record(s) or None if in_place=True..
- Return type
TabularDataset or None
- sample(n_samples=1, frac=None, random_state=None)
Sample a set of records from a TabularDataset object.
- Parameters
n_samples (int) – Number of records to sample. If frac is not None, this parameter is ignored.
frac (float) – Fraction of records to sample.
random_state (optional) – Passed to pandas.DataFrame.sample()
- Returns
A TabularDataset object with a sample of the records of the original object.
- Return type
- set_id(identifier)
Overwrite the id attribute on the TabularRecord object.
- Parameters
identifier (int or str) – An id value to be assigned to the TabularRecord id attribute
- Return type
None
- set_value(column, value)
Overwrite the value of attribute column of the TabularRecord object.
- Parameters
column (str) – The identifier of the attribute to be replaced.
value ((value set of column)) – The value to set the column of the record.
- Return type
None
- view(columns=None, exclude_columns=None)
Create a TabularDataset object that contains a subset of the columns of this TabularDataset. The resulting object only has a copy of the data, and can thus be modified without affecting the original data.
- Parameters
defined. (Exactly one of columns and exclude_columns must be) –
columns (list, or None) – The columns to include in the view.
exclude_columns (list, or None) – The columns to exclude from the view, with all other columns included.
- Returns
A subset of this data, restricted to some columns.
- Return type
- write(filepath)
Write data and description to file
- Parameters
filepath (str) – Path where the csv and json file are saved.
- write_to_string()
Return a string holding the dataset (as a csv).