tempor.data.samples module

Data handling for different data samples modalities supported by TemporAI.

class tempor.data.samples.DataSamples(data: DataFrame | ndarray, **kwargs: Any)[source]

Bases: Plugin, ABC

The abstract base class for all data samples classes.

Parameters:
data : data_typing.DataContainer

The data container.

**kwargs : Any

Any additional keyword arguments.

abstract property modality : DataModality

Return the data modality enum corresponding to the class

Returns:

The data modality enum.

Return type:

data_typing.DataModality

validate() None[source]

Validate the data contained.

Raises:

tempor.exc.DataValidationException – Raised if data validation fails.

abstract static from_numpy(array: ndarray, *, sample_index: list[int] | list[str] | None = None, feature_index: list[str] | None = None, **kwargs: Any) DataSamples[source]

Create DataSamples from numpy.ndarray.

Parameters:
array : np.ndarray

The array that represents the data.

sample_index : Optional[data_typing.SampleIndex], optional

List with sample (row) index for each sample. Optional, if None, will be of form [0, 1, ...]. Defaults to None.

feature_index : Optional[data_typing.FeatureIndex], optional

List with feature (column) index for each feature. Optional, if None, will be of form ["feat_0", "feat_1", ...]. Defaults to None.

**kwargs : Any

Any additional keyword arguments.

Returns:

DataSamples object from array.

Return type:

DataSamples

abstract static from_dataframe(dataframe: DataFrame, **kwargs: Any) DataSamples[source]

Create DataSamples from pandas.DataFrame.

abstract numpy(**kwargs: Any) ndarray[source]

Return numpy.ndarray representation of the data.

abstract dataframe(**kwargs: Any) DataFrame[source]

Return pandas.DataFrame representation of the data.

abstract property num_samples : int

Return number of samples.

abstract sample_index() list[int] | list[str][source]

Return a list representing sample indexes.

abstract property num_features : int

Return number of features.

abstract short_repr() str[source]

A short string representation of the object.

Returns:

The short string representation of the object.

Return type:

str

class tempor.data.samples.StaticSamplesBase(data: DataFrame | ndarray, **kwargs: Any)[source]

Bases: DataSamples

The abstract base class for all data samples classes.

Parameters:
data : data_typing.DataContainer

The data container.

**kwargs : Any

Any additional keyword arguments.

property modality : DataModality

Return the data modality enum corresponding to the class. Here, STATIC.

Returns:

The data modality enum. Here, STATIC.

Return type:

data_typing.DataModality

class tempor.data.samples.TimeSeriesSamplesBase(data: DataFrame | ndarray, **kwargs: Any)[source]

Bases: DataSamples

The abstract base class for all data samples classes.

Parameters:
data : data_typing.DataContainer

The data container.

**kwargs : Any

Any additional keyword arguments.

property modality : DataModality

Return the data modality enum corresponding to the class. Here, TIME_SERIES.

Returns:

The data modality enum. Here, TIME_SERIES.

Return type:

data_typing.DataModality

abstract time_indexes() list[list[float]] | list[list[int]] | list[list[Timestamp]][source]

Get a list containing time indexes for each sample. Each time index is represented as a list of time step elements.

Returns:

A list containing time indexes for each sample.

Return type:

data_typing.TimeIndexList

abstract time_indexes_as_dict() dict[int, list[float] | list[int] | list[Timestamp]] | dict[str, list[float] | list[int] | list[Timestamp]][source]

Get a dictionary mapping each sample index to its time index. Time index is represented as a list of time step elements.

Returns:

The dictionary mapping each sample index to its time index.

Return type:

data_typing.SampleToTimeIndexDict

abstract time_indexes_float() list[ndarray][source]

Return time indexes but converting their elements to float values.

Date-time time index will be converted using datetime_time_index_to_float.

Returns:

List of 1D numpy.ndarray s of float values, corresponding to the time index.

Return type:

List[np.ndarray]

abstract num_timesteps() list[int][source]

Get the number of timesteps for each sample.

Returns:

List containing the number of timesteps for each sample.

Return type:

List[int]

abstract num_timesteps_as_dict() dict[int, int] | dict[str, int][source]

Get a dictionary mapping each sample index to its the number of timesteps.

Returns:

List containing the number of timesteps for each sample.

Return type:

data_typing.SampleToNumTimestepsDict

abstract num_timesteps_equal() bool[source]

Returns True if all samples share the same number of timesteps, False otherwise.

Returns:

whether all samples share the same number of timesteps.

Return type:

bool

abstract list_of_dataframes() list[DataFrame][source]

Returns a list of dataframes where each dataframe has the data for each sample.

Returns:

List of dataframes for each sample.

Return type:

List[pd.DataFrame]

class tempor.data.samples.EventSamplesBase(data: DataFrame | ndarray, **kwargs: Any)[source]

Bases: DataSamples

The abstract base class for all data samples classes.

Parameters:
data : data_typing.DataContainer

The data container.

**kwargs : Any

Any additional keyword arguments.

property modality : DataModality

Return the data modality enum corresponding to the class. Here, EVENT.

Returns:

The data modality enum. Here, EVENT.

Return type:

data_typing.DataModality

abstract split(time_feature_suffix: str = '_time') DataFrame[source]

Return a pandas.DataFrame where the time component of each event feature has been split off to its own column. The new columns that contain the times will be named "<original column name><time_feature_suffix>" and will be inserted before each corresponding <original column name> column. The <original column name> columns will contain only the event value.

Parameters:
time_feature_suffix : str, optional

A column name suffix string to identify the time columns that will be split off. Defaults to "_time".

Returns:

The output dataframe.

Return type:

pd.DataFrame

abstract split_as_two_dataframes(time_feature_suffix: str = '_time') tuple[DataFrame, DataFrame][source]
Analogous to split() but returns two pandas.DataFrame s:
  • first dataframe contains the event times of each feature.

  • second dataframe contains the event values (True/False) of each feature.

Parameters:
time_feature_suffix : str, optional

A column name suffix string to identify the time columns that will be split off. Defaults to "_time".

Returns:

Two pandas.DataFrame s containing event times and values respectively.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

class tempor.data.samples.StaticSamples(data: DataFrame | ndarray, *, sample_index: list[int] | list[str] | None = None, feature_index: list[str] | None = None, **kwargs: Any)[source]

Bases: StaticSamplesBase

Create a StaticSamples object from the data.

Parameters:
data : data_typing.DataContainer

A container with the data.

sample_index : Optional[data_typing.SampleIndex], optional

Used only if data is a numpy.ndarray. List with sample (row) index for each sample. Optional, if None, will be of form [0, 1, ...]. Defaults to None.

feature_index : Optional[data_typing.FeatureIndex], optional

Used only if data is a numpy.ndarray. List with feature (column) index for each feature. Optional, if None, will be of form ["feat_0", "feat_1", ...]. Defaults to None.

**kwargs : Any

Any additional keyword arguments to pass to the constructor.

static from_dataframe(dataframe: DataFrame, **kwargs: Any) StaticSamples[source]

Create StaticSamples from pandas.DataFrame. The rows represent samples, the columns represent features.

Parameters:
dataframe : pd.DataFrame

The dataframe that represents the data.

**kwargs : Any

Any additional keyword arguments to pass to the constructor.

Returns:

StaticSamples object from dataframe.

Return type:

StaticSamples

static from_numpy(array: ndarray, *, sample_index: list[int] | list[str] | None = None, feature_index: list[str] | None = None, **kwargs: Any) StaticSamples[source]

Create StaticSamples from numpy.ndarray. The 0th dimension represents samples, the 1st dimension represents features.

Parameters:
array : np.ndarray

The array with the data.

sample_index : Optional[data_typing.SampleIndex], optional

Sample indices to assign. Defaults to None.

feature_index : Optional[data_typing.FeatureIndex], optional

Feature indices to assign. Defaults to None.

**kwargs : Any

Any additional keyword arguments to pass to the constructor.

Returns:

StaticSamples object created from the array.

Return type:

StaticSamples

numpy(**kwargs: Any) ndarray[source]

Return the data as a numpy.ndarray.

Parameters:
**kwargs : Any

Any additional keyword arguments. Currently unused.

Returns:

The numpy.ndarray.

Return type:

np.ndarray

dataframe(**kwargs: Any) DataFrame[source]

Return the data as a pandas.DataFrame.

Parameters:
**kwargs : Any

Any additional keyword arguments. Currently unused.

Returns:

The dataframe.

Return type:

pd.DataFrame

sample_index() list[int] | list[str][source]

Return a list representing sample indexes.

Returns:

Sample indexes.

Return type:

data_typing.SampleIndex

property num_samples : int

Return number of samples.

Returns:

Number of samples.

Return type:

int

property num_features : int

Return number of features.

Returns:

Number of features.

Return type:

int

short_repr() str[source]

A short string representation of the object.

Returns:

The short representation.

Return type:

str

category : ClassVar[plugin_typing.PluginCategory] = 'static_samples'

Plugin category, such as 'prediction.one_off.classification'. Must be set by the plugin class using @register_plugin.

name : ClassVar[plugin_typing.PluginName] = 'static_samples_df'

Plugin name, such as 'my_nn_classifier'. Must be set by the plugin class using @register_plugin.

plugin_type : ClassVar[plugin_typing.PluginTypeArg] = 'dataformat'

Plugin type, such as 'method'. May be optionally set by the plugin class using @register_plugin, else will set the default plugin type.

tempor.data.samples.workaround_pandera_pd2_1_0_multiindex_compatibility(schema: DataFrameSchema, data: DataFrame) Generator[source]

A version compatibility issue exists between pandera and pandas 2.1.0, as reported here: https://github.com/unionai-oss/pandera/issues/1328

The error pertains to multiindex uniqueness validation giving an unexpected error.

This is a workaround that will “manually” throw an error that is expected from pandera.

class tempor.data.samples.TimeSeriesSamples(data: DataFrame | ndarray, *, padding_indicator: Any = None, sample_index: list[int] | list[str] | None = None, time_indexes: list[list[float]] | list[list[int]] | list[list[Timestamp]] | None = None, feature_index: list[str] | None = None, **kwargs: Any)[source]

Bases: TimeSeriesSamplesBase

Create a TimeSeriesSamples object from the data.

If data is a pandas.DataFrame, this should be a 2-level multiindex (sample, timestep) dataframe.

If data is a numpy.ndarray, this should be a 3D array, with dimensions (sample, timestep, feature). Optionally, padding values of padding_indicator can be set inside the array to pad out the length of arrays of different samples in case they differ. Padding needs to go at the end of the timesteps (dim 1). Padding must be the same across the feature dimension (dim 2) for each sample.

Parameters:
data : data_typing.DataContainer

A container with the data.

padding_indicator : Any, optional

Padding indicator used in data to indicate padding. Defaults to None.

sample_index : Optional[data_typing.SampleIndex], optional

Used only if data is a numpy.ndarray. List with sample (row) index for each sample. Optional, if None, will be of form [0, 1, ...]. Defaults to None.

time_indexes : Optional[data_typing.TimeIndexList], optional

Used only if data is a numpy.ndarray. List of lists containing timesteps for each sample (outer list should be the same length as dim 0 of data, inner list should contain as many elements as each sample has timesteps). Optional, if None, will be of form [[0, 1, ...], [0, 1, ...], ...] Defaults to None.

feature_index : Optional[data_typing.FeatureIndex], optional

Used only if data is a numpy.ndarray. List with feature (column) index for each feature. Optional, if None, will be of form ["feat_0", "feat_1", ...]. Defaults to None.

**kwargs : Any

Any additional keyword arguments to pass to the constructor.

static from_dataframe(dataframe: DataFrame, **kwargs: Any) TimeSeriesSamples[source]

Create TimeSeriesSamples from pandas.DataFrame. This row index of the dataframe should be a 2-level multiindex (sample, timestep). The columns should be the features.

Parameters:
dataframe : pd.DataFrame

The dataframe that contains the data.

**kwargs : Any

Any additional keyword arguments to pass to the constructor.

Returns:

The TimeSeriesSamples object created from the dataframe.

Return type:

TimeSeriesSamples

static from_numpy(array: ndarray, *, padding_indicator: Any | None = None, sample_index: list[int] | list[str] | None = None, time_indexes: list[list[float]] | list[list[int]] | list[list[Timestamp]] | None = None, feature_index: list[str] | None = None, **kwargs: Any) TimeSeriesSamples[source]

Create TimeSeriesSamples from numpy.ndarray.

This should be a 3D array, with dimensions (sample, timestep, feature).

Optionally, padding values of padding_indicator can be set inside the array to pad out the length of arrays of different samples in case they differ. Padding needs to go at the end of the timesteps (dim 1). Padding must be the same across the feature dimension (dim 2) for each sample.

Parameters:
array : np.ndarray

The array that contains the data.

padding_indicator : Any, optional

The padding indicator value. Defaults to None.

sample_index : Optional[data_typing.SampleIndex], optional

Sample indexes as a list. Defaults to None.

time_indexes : Optional[data_typing.TimeIndexList], optional

Time indexes as a list of list (that is, time indexes per sample). Defaults to None.

feature_index : Optional[data_typing.FeatureIndex], optional

Feature indexes as a list. Defaults to None.

**kwargs : Any

Any additional keyword arguments.

Returns:

The TimeSeriesSamples object created from the array.

Return type:

TimeSeriesSamples

numpy(*, padding_indicator: Any = 999.0, **kwargs: Any) ndarray[source]

Return the data as a numpy.ndarray.

Parameters:
padding_indicator : Any, optional

Padding indicator value. Defaults to DATA_SETTINGS.default_padding_indicator.

**kwargs : Any

Any additional keyword arguments. Currently unused.

Returns:

The numpy.ndarray.

Return type:

np.ndarray

dataframe(**kwargs: Any) DataFrame[source]

Return the data as a pandas.DataFrame.

Parameters:
**kwargs : Any

Any additional keyword arguments. Currently unused.

Returns:

The pandas.DataFrame.

Return type:

pd.DataFrame

sample_index() list[int] | list[str][source]

Get a list containing sample indexes.

Returns:

A list containing sample indexes.

Return type:

data_typing.SampleIndex

time_indexes() list[list[float]] | list[list[int]] | list[list[Timestamp]][source]

Get a list containing time indexes for each sample. Each time index is represented as a list of time step elements.

Returns:

A list containing time indexes for each sample.

Return type:

data_typing.TimeIndexList

time_indexes_as_dict() dict[int, list[float] | list[int] | list[Timestamp]] | dict[str, list[float] | list[int] | list[Timestamp]][source]

Get a dictionary mapping each sample index to its time index. Time index is represented as a list of time step elements.

Returns:

The dictionary mapping each sample index to its time index.

Return type:

data_typing.SampleToTimeIndexDict

time_indexes_float() list[ndarray][source]

Return time indexes but converting their elements to float values.

Date-time time index will be converted using datetime_time_index_to_float.

Returns:

List of 1D numpy.ndarray s of float values, corresponding to the time index.

Return type:

List[np.ndarray]

num_timesteps() list[int][source]

Get the number of timesteps for each sample.

Returns:

List containing the number of timesteps for each sample.

Return type:

List[int]

num_timesteps_as_dict() dict[int, int] | dict[str, int][source]

Get a dictionary mapping each sample index to its the number of timesteps.

Returns:

List containing the number of timesteps for each sample.

Return type:

data_typing.SampleToNumTimestepsDict

num_timesteps_equal() bool[source]

Returns True if all samples share the same number of timesteps, False otherwise.

Returns:

whether all samples share the same number of timesteps.

Return type:

bool

list_of_dataframes() list[DataFrame][source]

Returns a list of dataframes where each dataframe has the data for each sample.

Returns:

List of dataframes for each sample.

Return type:

List[pd.DataFrame]

property num_samples : int

Return number of samples.

Returns:

Number of samples.

Return type:

int

property num_features : int

Return number of features.

Returns:

Number of features.

Return type:

int

short_repr() str[source]

A short string representation of the object.

Returns:

The short representation.

Return type:

str

category : ClassVar[plugin_typing.PluginCategory] = 'time_series_samples'

Plugin category, such as 'prediction.one_off.classification'. Must be set by the plugin class using @register_plugin.

name : ClassVar[plugin_typing.PluginName] = 'time_series_samples_df'

Plugin name, such as 'my_nn_classifier'. Must be set by the plugin class using @register_plugin.

plugin_type : ClassVar[plugin_typing.PluginTypeArg] = 'dataformat'

Plugin type, such as 'method'. May be optionally set by the plugin class using @register_plugin, else will set the default plugin type.

class tempor.data.samples.EventSamples(data: DataFrame | ndarray, *, sample_index: list[int] | list[str] | None = None, feature_index: list[str] | None = None, **kwargs: Any)[source]

Bases: EventSamplesBase

Create an EventSamples object from the data.

Parameters:
data : data_typing.DataContainer

A container with the data.

sample_index : Optional[data_typing.SampleIndex], optional

Used only if data is a numpy.ndarray. List with sample (row) index for each sample. Optional, if None, will be of form [0, 1, ...]. Defaults to None.

feature_index : Optional[data_typing.FeatureIndex], optional

Used only if data is a numpy.ndarray. List with feature (column) index for each feature. Optional, if None, will be of form ["feat_0", "feat_1", ...]. Defaults to None.

**kwargs : Any

Any additional keyword arguments to pass to the constructor.

static from_dataframe(dataframe: DataFrame, **kwargs: Any) EventSamples[source]

Create EventSamples from pandas.DataFrame. The row index of the dataframe should be the sample indexes. The columns should be the features. Each feature should contain a tuple of (time, value) representing the event.

Parameters:
dataframe : pd.DataFrame

The dataframe that contains the data.

**kwargs : Any

Any additional keyword arguments to pass to the constructor.

Returns:

The EventSamples object created from the dataframe.

Return type:

EventSamples

static from_numpy(array: ndarray, *, sample_index: list[int] | list[str] | None = None, feature_index: list[str] | None = None, **kwargs: Any) EventSamples[source]

Create EventSamples from numpy.ndarray. The array should be a 2D array, with dimensions (sample, feature). Each element should contain a tuple of (time, value) representing the event.

Parameters:
array : np.ndarray

The array that contains the data.

sample_index : Optional[data_typing.SampleIndex], optional

Sample indexes. Defaults to None.

feature_index : Optional[data_typing.FeatureIndex], optional

Feature index. Defaults to None.

**kwargs : Any

Any additional keyword arguments to pass to the constructor.

Returns:

The EventSamples object created from the array.

Return type:

EventSamples

numpy(**kwargs: Any) ndarray[source]

Return the data as a numpy.ndarray.

Parameters:
**kwargs : Any

Any additional keyword arguments. Currently unused.

Returns:

The numpy.ndarray.

Return type:

np.ndarray

dataframe(**kwargs: Any) DataFrame[source]

Return the data as a pandas.DataFrame.

Parameters:
**kwargs : Any

Any additional keyword arguments. Currently unused.

Returns:

The pandas.DataFrame.

Return type:

pd.DataFrame

sample_index() list[int] | list[str][source]

Return a list representing sample indexes.

Returns:

Sample indexes.

Return type:

data_typing.SampleIndex

category : ClassVar[plugin_typing.PluginCategory] = 'event_samples'

Plugin category, such as 'prediction.one_off.classification'. Must be set by the plugin class using @register_plugin.

name : ClassVar[plugin_typing.PluginName] = 'event_samples_df'

Plugin name, such as 'my_nn_classifier'. Must be set by the plugin class using @register_plugin.

property num_samples : int

Return number of samples.

Returns:

Number of samples.

Return type:

int

plugin_type : ClassVar[plugin_typing.PluginTypeArg] = 'dataformat'

Plugin type, such as 'method'. May be optionally set by the plugin class using @register_plugin, else will set the default plugin type.

property num_features : int

Return number of features.

Returns:

Number of features.

Return type:

int

split(time_feature_suffix: str = '_time') DataFrame[source]

Return a pandas.DataFrame where the time component of each event feature has been split off to its own column. The new columns that contain the times will be named "<original column name><time_feature_suffix>" and will be inserted before each corresponding <original column name> column. The <original column name> columns will contain only the event value.

Parameters:
time_feature_suffix : str, optional

A column name suffix string to identify the time columns that will be split off. Defaults to "_time".

Returns:

The output dataframe.

Return type:

pd.DataFrame

split_as_two_dataframes(time_feature_suffix: str = '_time') tuple[DataFrame, DataFrame][source]
Analogous to split() but returns two pandas.DataFrame s:
  • first dataframe contains the event times of each feature.

  • second dataframe contains the event values (True/False) of each feature.

Parameters:
time_feature_suffix : str, optional

A column name suffix string to identify the time columns that will be split off. Defaults to "_time".

Returns:

Two pandas.DataFrame s containing event times and values respectively.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

short_repr() str[source]

A short string representation of the object.

Returns:

The short representation.

Return type:

str