tempor.data.utils module¶
Module containing utility functions for data format management.
-
tempor.data.utils.EXCEPTION_MESSAGES =
_ExceptionMessages()¶ Reusable error messages for the module.
- tempor.data.utils.value_in_df(df: DataFrame, *, value: Any) bool[source]¶
Check if
valueexists in dataframedf, accounting for the case wherevalueisnumpy.nan.
- tempor.data.utils.set_df_column_names_inplace(df: DataFrame, names: Sequence) DataFrame[source]¶
Set column names of
dftonamesinplace. Used to handle different behaviour ofset_axisin different pandas versions.
- tempor.data.utils.get_df_index_level0_unique(df: DataFrame) Index[source]¶
Return the unique values of the level 0 index of
df.
-
tempor.data.utils.multiindex_timeseries_dataframe_to_array3d(df: DataFrame, *, padding_indicator: Any, max_timesteps: int | None =
None) ndarray[source]¶ Convert timeseries dataframe
dfwith a 2-level multiindex (sample, timestep) to a 3D numpy array with dimensions(sample, timestep, feature).- Parameters:¶
- df : pd.DataFrame¶
Input dataframe.
- padding_indicator : Any¶
padding indicator value to use to pad the output array in case of unequal number of timesteps for different samples.
- max_timesteps : Optional[int], optional¶
Maximum number of timesteps to use. This will become the size of the dim 1 of the output array. If set to
None, this dimension will be set as the highest number of timesteps among the samples. Defaults toNone.
- Raises:¶
ValueError – raised if the
padding_indicatorfound as one of the data values indf.- Returns:¶
Output 3D numpy array.
- Return type:¶
np.ndarray
-
tempor.data.utils.check_bool_array1d_trues_consecutive(array: ndarray, at_beginning: bool =
False, at_end: bool =False) bool[source]¶ Check if 1D
array(containingboolvalues) has allTrueelements consecutively. Ifat_{beginning,end}is set, will also check that aTrueelement is present as the first or last element of thearray, respectively. RaisesValueErrorif inputarrayformat is unexpected.Examples
>>> import numpy as np >>> from tempor.data.utils import * >>> >>> check_bool_array1d_trues_consecutive(np.asarray([False, True, True, True, False])) True >>> check_bool_array1d_trues_consecutive(np.asarray([False, True, False, True, False])) False >>> check_bool_array1d_trues_consecutive(np.asarray([False, True, True, True]), at_end=True) True
- tempor.data.utils.check_bool_array2d_identical_along_dim1(array: ndarray) bool[source]¶
Check if 2D
array(containingboolvalues) has the same values along dimension 1.Examples
>>> import numpy as np >>> from tempor.data.utils import * >>> >>> check_bool_array2d_identical_along_dim1(np.asarray([[True, True, False], [True, True, False]]).T) True >>> check_bool_array2d_identical_along_dim1(np.asarray([[True, True, False], [False, True, False]]).T) False
-
tempor.data.utils.get_array1d_length_until_padding(array: ndarray, padding_indicator: Any | None =
None) int[source]¶ Get the length of 1D
arrayup to first padding indicated bypadding_indicator. RaisesValueErrorif inputarrayformat is unexpected.Examples
>>> import numpy as np >>> from tempor.data.utils import * >>> >>> pad = 999.0 >>> get_array1d_length_until_padding(np.asarray([1, 8, -3, 9, pad]), padding_indicator=pad) 4 >>> get_array1d_length_until_padding(np.asarray([1, 8, -3, 9, 5]), padding_indicator=pad) 5
-
tempor.data.utils.validate_timeseries_array3d(array: ndarray, padding_indicator: Any | None =
None) None[source]¶ Check if 3D
arrayrepresenting timeseries satisfies the blow criteria, otherwise raiseValueError: - 3 dimensions, - Dimension 2 not of size 0, - Ifpadding_indicatoris provided, also check it is notnp.nan, as this is not supported.
-
tempor.data.utils.get_seq_lengths_timeseries_array3d(array: ndarray, padding_indicator: Any | None =
None) list[int][source]¶ Given a 3D numpy
arraythat represents timeseries like(sample, timestep, feature), and optionally apadding_indicatorto indicate padding, get the length (number of [non-padding] timesteps) for each sample.Example
>>> import numpy as np >>> from tempor.data.utils import * >>> >>> pad = 999.0 >>> array = np.asarray( # Array with two samples, with two timeseries features. ... [ ... # Sample 1: ... [ ... [11, 12, 13, 14, pad], ... [1.1, 1.2, 1.3, 1.4, pad], ... ], ... # Sample 2: ... [ ... [21, 22, pad, pad, pad], ... [2.1, 2.2, pad, pad, pad], ... ], ... ] ... ) >>> array = np.transpose(array, (0, 2, 1)) >>> get_seq_lengths_timeseries_array3d(array, padding_indicator=pad) [4, 2]
- tempor.data.utils.unpad_timeseries_array3d(array: ndarray, padding_indicator: Any) list[ndarray][source]¶
Given a 3D numpy
arraythat represents timeseries like(sample, timestep, feature), and optionally apadding_indicatorto indicate padding, return a list of lengthnum_samples, which contains arrays for each sample like(timestep, feature), with the padding removed.
- tempor.data.utils.make_sample_time_index_tuples(sample_index: list[int] | list[str], time_indexes: list[list[float]] | list[list[int]] | list[list[Timestamp]]) list[tuple[int, float] | tuple[int, int] | tuple[int, Timestamp] | tuple[str, float] | tuple[str, int] | tuple[str, Timestamp]][source]¶
Given a list of elements
sample_indexrepresenting sample IDs and a list (of same length) of lists each representing the timesteps for the corresponding sample, return a list of tuples like[(<sample ID>, <timestep>), ...].Example
>>> from tempor.data.utils import * >>> >>> sample_index = ["s1", "s2"] >>> time_indexes = [[1, 2, 3], [1, 5, 9, 10]] >>> make_sample_time_index_tuples(sample_index, time_indexes) [('s1', 1), ('s1', 2), ('s1', 3), ('s2', 1), ('s2', 5), ('s2', 9), ('s2', 10)]
-
tempor.data.utils.array3d_to_multiindex_timeseries_dataframe(array: ndarray, *, sample_index: list[int] | list[str], time_indexes: list[list[float]] | list[list[int]] | list[list[Timestamp]], feature_index: list[str], padding_indicator: Any =
None) DataFrame[source]¶ Given a 3D timeseries
array,sample_index,time_indexes,feature_index, and apadding_indicator, build a 2-level multiindex (sample, timestep)pandas.DataFrame.Padding values of
padding_indicatorcan be set inside the array to pad out the length of arrays of different samples in case they differ. Padding needs to go at the end of the timesteps (dim 1). Padding must be the same across the feature dimension (dim 2) for each sample.- Raises:¶
ValueError – if data or padding format is unexpected.
- Parameters:¶
- array : np.ndarray¶
3D numpy
arraythat represents timeseries like(sample, timestep, feature).- sample_index : data_typing.SampleIndex¶
List of sample IDs (should be the same length as dim 0 of
array).- time_indexes : data_typing.TimeIndexList¶
List of lists containing timesteps for each sample (outer list should be the same length as dim 0 of
array, inner list should contain as many elements as each sample has timesteps).- feature_index : data_typing.FeatureIndex¶
List of feature names.
- padding_indicator : Any, optional¶
Padding indicator used in
arrayto indicate padding. Defaults toNone.
- Returns:¶
Resultant dataframe.
- Return type:¶
pd.DataFrame
-
tempor.data.utils.list_of_dataframes_to_multiindex_timeseries_dataframe(list_of_dataframes: list[DataFrame], *, sample_index: list[int] | list[str], time_indexes: list[list[float]] | list[list[int]] | list[list[Timestamp]] | None =
None, feature_index: list[str] | None =None) DataFrame[source]¶ Given a list of dataframes
list_of_dataframes,sample_index, [time_indexes,feature_index,] build a 2-level multiindex (sample, timestep)pandas.DataFrame.- Parameters:¶
- list_of_dataframes : List[pd.DataFrame]¶
List of dataframes.
- sample_index : data_typing.SampleIndex¶
List of sample IDs.
- time_indexes : Optional[data_typing.TimeIndexList], optional¶
List of lists of time indexes. Defaults to
None.- feature_index : Optional[data_typing.FeatureIndex], optional¶
Feature index. Defaults to
None.
- Returns:¶
Resultant dataframe.
- Return type:¶
pd.DataFrame
- tempor.data.utils.multiindex_timeseries_dataframe_to_list_of_dataframes(df: DataFrame) list[DataFrame][source]¶
Returns a list of dataframes where each dataframe has the data for each sample. That is, each of the dataframes has a unique level
0index value.
-
tempor.data.utils.event_time_value_pairs_to_event_dataframe(event_time_value_pairs: Sequence[tuple[list[float] | list[int] | list[Timestamp], list[bool]]], sample_index: list[int] | list[str], feature_index: list[str] | None =
None) DataFrame[source]¶ Convert a sequence like
[(event_times, event_values), ...]to apandas.DataFramewhose columns contain elements like tuples(event_time, event_value).- Parameters:¶
- event_time_value_pairs : Sequence[Tuple[data_typing.TimeIndex, List[bool]]]¶
A sequence where each item corresponds to an event feature and is a tuple of form
(event_times, event_values)(e.g.([1.1, 1.2, 1.3], [True, True, False])).- sample_index : data_typing.SampleIndex¶
List of sample IDs, to be set as dataframe row index.
- feature_index : Optional[data_typing.FeatureIndex]¶
List of feature names, to be set as dataframe column names.
Example
>>> from tempor.data.utils import * >>> >>> sample_index = ["s1", "s2", "s3"] >>> feature_names = ["feature_1", "feature_2"] >>> event_feature_1 = ([1.1, 1.2, 1.3], [True, True, False]) >>> event_feature_2 = ([2.1, 2.2, 2.3], [False, True, False]) >>> event_time_value_pairs = [event_feature_1, event_feature_2] >>> df = event_time_value_pairs_to_event_dataframe( ... event_time_value_pairs, sample_index=sample_index, feature_index=feature_names ... ) >>> df.shape (3, 2)- Returns:¶
pandas.DataFramecompatible withEventSamples.- Return type:¶
pd.DataFrame