tempor.data.utils module

Module containing utility functions for data format management.

tempor.data.utils.EXCEPTION_MESSAGES = _ExceptionMessages()

Reusable error messages for the module.

tempor.data.utils.value_in_df(df: DataFrame, *, value: Any) bool[source]

Check if value exists in dataframe df, accounting for the case where value is numpy.nan.

tempor.data.utils.set_df_column_names_inplace(df: DataFrame, names: Sequence) DataFrame[source]

Set column names of df to names inplace. Used to handle different behaviour of set_axis in different pandas versions.

Parameters:
df : pd.DataFrame

Dataframe.

names : Sequence

Columns names.

Returns:

Dataframe with column names set.

Return type:

pd.DataFrame

tempor.data.utils.get_df_index_level0_unique(df: DataFrame) Index[source]

Return the unique values of the level 0 index of df.

Parameters:
df : pd.DataFrame

The dataframe.

Returns:

The unique values of the level 0 index of df.

Return type:

pd.Index

tempor.data.utils.multiindex_timeseries_dataframe_to_array3d(df: DataFrame, *, padding_indicator: Any, max_timesteps: int | None = None) ndarray[source]

Convert timeseries dataframe df with a 2-level multiindex (sample, timestep) to a 3D numpy array with dimensions (sample, timestep, feature).

Parameters:
df : pd.DataFrame

Input dataframe.

padding_indicator : Any

padding indicator value to use to pad the output array in case of unequal number of timesteps for different samples.

max_timesteps : Optional[int], optional

Maximum number of timesteps to use. This will become the size of the dim 1 of the output array. If set to None, this dimension will be set as the highest number of timesteps among the samples. Defaults to None.

Raises:

ValueError – raised if the padding_indicator found as one of the data values in df.

Returns:

Output 3D numpy array.

Return type:

np.ndarray

tempor.data.utils.check_bool_array1d_trues_consecutive(array: ndarray, at_beginning: bool = False, at_end: bool = False) bool[source]

Check if 1D array (containing bool values) has all True elements consecutively. If at_{beginning,end} is set, will also check that a True element is present as the first or last element of the array, respectively. Raises ValueError if input array format is unexpected.

Examples

>>> import numpy as np
>>> from tempor.data.utils import *
>>>
>>> check_bool_array1d_trues_consecutive(np.asarray([False, True, True, True, False]))
True
>>> check_bool_array1d_trues_consecutive(np.asarray([False, True, False, True, False]))
False
>>> check_bool_array1d_trues_consecutive(np.asarray([False, True, True, True]), at_end=True)
True
Parameters:
array : np.ndarray

Input array.

at_beginning : bool, optional

Check if first element is True. Defaults to False.

at_end : bool, optional

Check if last element is True. Defaults to False.

Returns:

The result of the check.

Return type:

bool

tempor.data.utils.check_bool_array2d_identical_along_dim1(array: ndarray) bool[source]

Check if 2D array (containing bool values) has the same values along dimension 1.

Examples

>>> import numpy as np
>>> from tempor.data.utils import *
>>>
>>> check_bool_array2d_identical_along_dim1(np.asarray([[True, True, False], [True, True, False]]).T)
True
>>> check_bool_array2d_identical_along_dim1(np.asarray([[True, True, False], [False, True, False]]).T)
False
Parameters:
array : np.ndarray

Input array.

Returns:

The result of the check.

Return type:

bool

tempor.data.utils.get_array1d_length_until_padding(array: ndarray, padding_indicator: Any | None = None) int[source]

Get the length of 1D array up to first padding indicated by padding_indicator. Raises ValueError if input array format is unexpected.

Examples

>>> import numpy as np
>>> from tempor.data.utils import *
>>>
>>> pad = 999.0
>>> get_array1d_length_until_padding(np.asarray([1, 8, -3, 9, pad]), padding_indicator=pad)
4
>>> get_array1d_length_until_padding(np.asarray([1, 8, -3, 9, 5]), padding_indicator=pad)
5
Parameters:
array : np.ndarray

Input array.

padding_indicator : Any, optional

Padding indicator. Defaults to None.

Returns:

Length of array up to first padding.

Return type:

int

tempor.data.utils.validate_timeseries_array3d(array: ndarray, padding_indicator: Any | None = None) None[source]

Check if 3D array representing timeseries satisfies the blow criteria, otherwise raise ValueError: - 3 dimensions, - Dimension 2 not of size 0, - If padding_indicator is provided, also check it is not np.nan, as this is not supported.

tempor.data.utils.get_seq_lengths_timeseries_array3d(array: ndarray, padding_indicator: Any | None = None) list[int][source]

Given a 3D numpy array that represents timeseries like (sample, timestep, feature), and optionally a padding_indicator to indicate padding, get the length (number of [non-padding] timesteps) for each sample.

Example

>>> import numpy as np
>>> from tempor.data.utils import *
>>>
>>> pad = 999.0
>>> array = np.asarray(  # Array with two samples, with two timeseries features.
...     [
...         # Sample 1:
...         [
...             [11, 12, 13, 14, pad],
...             [1.1, 1.2, 1.3, 1.4, pad],
...         ],
...         # Sample 2:
...         [
...             [21, 22, pad, pad, pad],
...             [2.1, 2.2, pad, pad, pad],
...         ],
...     ]
... )
>>> array = np.transpose(array, (0, 2, 1))
>>> get_seq_lengths_timeseries_array3d(array, padding_indicator=pad)
[4, 2]
Parameters:
array : np.ndarray

3D numpy array that represents timeseries like (sample, timestep, feature).

padding_indicator : Any, optional

Padding indicator used in array to indicate padding. Defaults to None.

Returns:

List of lengths (number of [non-padding] timesteps) for each sample.

Return type:

List[int]

tempor.data.utils.unpad_timeseries_array3d(array: ndarray, padding_indicator: Any) list[ndarray][source]

Given a 3D numpy array that represents timeseries like (sample, timestep, feature), and optionally a padding_indicator to indicate padding, return a list of length num_samples, which contains arrays for each sample like (timestep, feature), with the padding removed.

tempor.data.utils.make_sample_time_index_tuples(sample_index: list[int] | list[str], time_indexes: list[list[float]] | list[list[int]] | list[list[Timestamp]]) list[tuple[int, float] | tuple[int, int] | tuple[int, Timestamp] | tuple[str, float] | tuple[str, int] | tuple[str, Timestamp]][source]

Given a list of elements sample_index representing sample IDs and a list (of same length) of lists each representing the timesteps for the corresponding sample, return a list of tuples like [(<sample ID>, <timestep>), ...].

Example

>>> from tempor.data.utils import *
>>>
>>> sample_index = ["s1", "s2"]
>>> time_indexes = [[1, 2, 3], [1, 5, 9, 10]]
>>> make_sample_time_index_tuples(sample_index, time_indexes)
[('s1', 1), ('s1', 2), ('s1', 3), ('s2', 1), ('s2', 5), ('s2', 9), ('s2', 10)]
Parameters:
sample_index : data_typing.SampleIndex

List of sample IDs.

time_indexes : data_typing.TimeIndexList

List of lists of timesteps for each sample.

Returns:

List of tuples like [(<sample ID>, <timestep>), ...].

Return type:

data_typing.SampleTimeIndexTuples

tempor.data.utils.array3d_to_multiindex_timeseries_dataframe(array: ndarray, *, sample_index: list[int] | list[str], time_indexes: list[list[float]] | list[list[int]] | list[list[Timestamp]], feature_index: list[str], padding_indicator: Any = None) DataFrame[source]

Given a 3D timeseries array, sample_index, time_indexes, feature_index, and a padding_indicator, build a 2-level multiindex (sample, timestep) pandas.DataFrame.

Padding values of padding_indicator can be set inside the array to pad out the length of arrays of different samples in case they differ. Padding needs to go at the end of the timesteps (dim 1). Padding must be the same across the feature dimension (dim 2) for each sample.

Raises:

ValueError – if data or padding format is unexpected.

Parameters:
array : np.ndarray

3D numpy array that represents timeseries like (sample, timestep, feature).

sample_index : data_typing.SampleIndex

List of sample IDs (should be the same length as dim 0 of array).

time_indexes : data_typing.TimeIndexList

List of lists containing timesteps for each sample (outer list should be the same length as dim 0 of array, inner list should contain as many elements as each sample has timesteps).

feature_index : data_typing.FeatureIndex

List of feature names.

padding_indicator : Any, optional

Padding indicator used in array to indicate padding. Defaults to None.

Returns:

Resultant dataframe.

Return type:

pd.DataFrame

tempor.data.utils.list_of_dataframes_to_multiindex_timeseries_dataframe(list_of_dataframes: list[DataFrame], *, sample_index: list[int] | list[str], time_indexes: list[list[float]] | list[list[int]] | list[list[Timestamp]] | None = None, feature_index: list[str] | None = None) DataFrame[source]

Given a list of dataframes list_of_dataframes, sample_index, [time_indexes, feature_index,] build a 2-level multiindex (sample, timestep) pandas.DataFrame.

Parameters:
list_of_dataframes : List[pd.DataFrame]

List of dataframes.

sample_index : data_typing.SampleIndex

List of sample IDs.

time_indexes : Optional[data_typing.TimeIndexList], optional

List of lists of time indexes. Defaults to None.

feature_index : Optional[data_typing.FeatureIndex], optional

Feature index. Defaults to None.

Returns:

Resultant dataframe.

Return type:

pd.DataFrame

tempor.data.utils.multiindex_timeseries_dataframe_to_list_of_dataframes(df: DataFrame) list[DataFrame][source]

Returns a list of dataframes where each dataframe has the data for each sample. That is, each of the dataframes has a unique level 0 index value.

Parameters:
df : pd.DataFrame

Input multiindex dataframe.

Returns:

Output list of dataframes.

Return type:

List[pd.DataFrame]

tempor.data.utils.event_time_value_pairs_to_event_dataframe(event_time_value_pairs: Sequence[tuple[list[float] | list[int] | list[Timestamp], list[bool]]], sample_index: list[int] | list[str], feature_index: list[str] | None = None) DataFrame[source]

Convert a sequence like [(event_times, event_values), ...] to a pandas.DataFrame whose columns contain elements like tuples (event_time, event_value).

Parameters:
event_time_value_pairs : Sequence[Tuple[data_typing.TimeIndex, List[bool]]]

A sequence where each item corresponds to an event feature and is a tuple of form (event_times, event_values) (e.g. ([1.1, 1.2, 1.3], [True, True, False])).

sample_index : data_typing.SampleIndex

List of sample IDs, to be set as dataframe row index.

feature_index : Optional[data_typing.FeatureIndex]

List of feature names, to be set as dataframe column names.

Example

>>> from tempor.data.utils import *
>>>
>>> sample_index = ["s1", "s2", "s3"]
>>> feature_names = ["feature_1", "feature_2"]
>>> event_feature_1 = ([1.1, 1.2, 1.3], [True, True, False])
>>> event_feature_2 = ([2.1, 2.2, 2.3], [False, True, False])
>>> event_time_value_pairs = [event_feature_1, event_feature_2]
>>> df = event_time_value_pairs_to_event_dataframe(
...     event_time_value_pairs, sample_index=sample_index, feature_index=feature_names
... )
>>> df.shape
(3, 2)
Returns:

pandas.DataFrame compatible with EventSamples.

Return type:

pd.DataFrame

tempor.data.utils.datetime_time_index_to_float(time_index: list[float] | list[int] | list[Timestamp] | Index | Series) ndarray[source]

Convert a date-time time_index to floats. The conversion is done by calling <time_index as a numpy array>.astype(float).

Parameters:
time_index : Union[data_typing.TimeIndex, pd.Index, pd.Series]

The input time index.

Returns:

NumPy array containing the time index converted to float s.

Return type:

np.ndarray

tempor.data.utils.ensure_pd_iloc_key_returns_df(key: Any) Iterable | slice[source]

Modify key such that when this is passed to pd.DataFrame.iloc, the result is always a dataframe.

Parameters:
key : Any

Input key.

Returns:

Modified key.

Return type:

Union[Iterable, slice]