tapas.datasets.dataset.TabularDataset
- class tapas.datasets.dataset.TabularDataset(data, description)
Bases:
tapas.datasets.dataset.DatasetClass to represent tabular data as a Dataset. Internally, the tabular data is stored as a Pandas Dataframe and the schema is an array of types.
- __init__(data, description)
- Parameters
data (pandas.DataFrame (or a valid argument for pd.DataFrame).) –
description (tapas.datasets.data_description.DataDescription) –
label (str (optional)) –
Methods
__init__(data, description)- param data
add_records(records[, in_place])Add record(s) to dataset and return modified dataset.
copy()Create a TabularDataset that is a deep copy of this one.
create_subsets(n, sample_size[, drop_records])Create a number n of subsets of this dataset of size sample_size without replacement.
drop_records([record_ids, n, in_place])Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.
empty()Create an empty TabularDataset with the same description as the current one.
get_records(record_ids)Get a record from the TabularDataset object
read(filepath[, label])Read csv and json files for dataframe and schema respectively.
read_from_string(data, description)- param data
The csv version of the data
replace(records_in[, records_out, in_place])Replace a record with another one in the dataset, if records_out is empty it will remove a random record.
sample([n_samples, frac, random_state])Sample a set of records from a TabularDataset object.
view([columns, exclude_columns])Create a TabularDataset object that contains a subset of the columns of this TabularDataset.
write(filepath)Write data and description to file
Return a string holding the dataset (as a csv).
Attributes
Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded.
label- add_records(records, in_place=False)
Add record(s) to dataset and return modified dataset.
- Parameters
records (TabularDataset) – A TabularDataset object with the record(s) to add.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.
- Returns
A new TabularDataset object with the record(s) or None if inplace=True.
- Return type
TabularDataset or None
- property as_numeric
Encodes this dataset as a np.array, where numeric values are kept as is and categorical values are 1-hot encoded. This is only computed once (for efficiency reasons), so beware of modifying TabularDataset after using this property.
The columns are kept in the order of the description, with categorical variables encoded over several contiguous columns.
- Return type
np.array
- copy()
Create a TabularDataset that is a deep copy of this one. In particular, the underlying data is copied and can thus be modified freely.
- Returns
A copy of this TabularDataset.
- Return type
- create_subsets(n, sample_size, drop_records=False)
Create a number n of subsets of this dataset of size sample_size without replacement. If needed, the records can be dropped from this dataset.
- Parameters
n (int) – Number of datasets to create.
sample_size (int) – Size of the subset datasets to be created.
drop_records (bool) – Whether to remove the records sampled from this dataset (in place).
- Returns
A lists containing subsets of the data with and without the target record(s).
- Return type
list(TabularDataset)
- drop_records(record_ids=[], n=1, in_place=False)
Drop records from the TabularDataset object, if record_ids is empty it will drop a random record.
- Parameters
record_ids (list[int]) – List of indexes of records to drop.
n (int) – Number of random records to drop if record_ids is empty.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.
- Returns
A new TabularDataset object without the record(s) or None if in_place=True.
- Return type
TabularDataset or None
- empty()
Create an empty TabularDataset with the same description as the current one. Short-hand for TabularDataset.get_records([]).
- Returns
Empty tabular dataset.
- Return type
- get_records(record_ids)
Get a record from the TabularDataset object
- Parameters
record_ids (list[int]) – List of indexes of records to retrieve.
- Returns
A TabularDataset object with the record(s).
- Return type
- classmethod read(filepath, label=None)
Read csv and json files for dataframe and schema respectively.
- Parameters
filepath (str) – Full path to the csv and json, excluding the
.csvor.jsonextension. Both files should have the same root name.label (str or None) – An optional string to represent this dataset.
- Returns
A TabularDataset.
- Return type
- classmethod read_from_string(data, description)
- Parameters
data (str) – The csv version of the data
description (DataDescription) –
- Return type
- replace(records_in, records_out=[], in_place=False)
Replace a record with another one in the dataset, if records_out is empty it will remove a random record.
- Parameters
records_in (TabularDataset) – A TabularDataset object with the record(s) to add.
records_out (list(int)) – List of indexes of records to drop.
in_place (bool) – Bool indicating whether or not to change the dataset in-place or return a copy. If True, the dataset is changed in-place. The default is False.
- Returns
A modified TabularDataset object with the replaced record(s) or None if in_place=True..
- Return type
TabularDataset or None
- sample(n_samples=1, frac=None, random_state=None)
Sample a set of records from a TabularDataset object.
- Parameters
n_samples (int) – Number of records to sample. If frac is not None, this parameter is ignored.
frac (float) – Fraction of records to sample.
random_state (optional) – Passed to pandas.DataFrame.sample()
- Returns
A TabularDataset object with a sample of the records of the original object.
- Return type
- view(columns=None, exclude_columns=None)
Create a TabularDataset object that contains a subset of the columns of this TabularDataset. The resulting object only has a copy of the data, and can thus be modified without affecting the original data.
- Parameters
defined. (Exactly one of columns and exclude_columns must be) –
columns (list, or None) – The columns to include in the view.
exclude_columns (list, or None) – The columns to exclude from the view, with all other columns included.
- Returns
A subset of this data, restricted to some columns.
- Return type
- write(filepath)
Write data and description to file
- Parameters
filepath (str) – Path where the csv and json file are saved.
- write_to_string()
Return a string holding the dataset (as a csv).