Data Tutorial 05: Other Data Formats¶
This tutorial shows additional data formats supported by TemporAI.
⚠️ This feature is experimental and may not yet work as expected.
Data formats¶
You can view the supported data formats by running:
[ ]:
from tempor import plugin_loader
from tempor.data import samples_experimental # Load experimental.
import rich.pretty
dataformat_plugins = plugin_loader.list("dataformat")
rich.pretty.pprint(dataformat_plugins)
{ │ 'static_samples': ['static_samples_df', 'static_samples_dask'], │ 'time_series_samples': ['time_series_samples_df', 'time_series_samples_dask'], │ 'event_samples': ['event_samples_df', 'event_samples_dask'] }
Dask data format¶
Dask is a Python library for parallel computing. We provide an interface to `Dask dataframes <https://docs.dask.org/en/stable/dataframe.html>`__, which supports parallel computation.
Below example shows how to load the data samples from Dask dataframes.
[ ]:
# Static samples example.
import pandas as pd
import numpy as np
import dask.dataframe as dd
from tempor.data import samples_experimental
categories = ["A", "B", "C"]
np.random.seed(12345)
size = 10
df_s = pd.DataFrame(
{
"sample_idx": [f"sample_{x}" for x in range(1, size + 1)],
"cat_feat_1": pd.Categorical(np.random.choice(categories, size=size)),
"cat_feat_2": pd.Categorical(np.random.choice(categories, size=size)),
"num_feat_1": np.random.uniform(0, 10, size=size),
"num_feat_2": np.random.uniform(20, 30, size=size),
}
)
df_s.set_index("sample_idx", drop=True, inplace=True)
# Create a dask dataframe:
ddf_s = dd.from_pandas(df_s, npartitions=2) # type: ignore
# Initialize the static samples object:
samples_experimental.StaticSamplesDask(ddf_s) # type: ignore
2023-12-07 20:29:54 | INFO | tempor.data.samples_experimental:_validate:69 | Validation not yet implemented for Dask data format. Data format consistency is not guaranteed.
StaticSamplesDask with data:
| cat_feat_1 | cat_feat_2 | num_feat_1 | num_feat_2 | |
|---|---|---|---|---|
| sample_idx | ||||
| sample_1 | C | B | 0.267897 | 29.308157 |
| sample_10 | C | B | 8.062348 | 27.949706 |
| sample_2 | B | B | 2.915024 | 23.640296 |
| sample_3 | B | C | 3.987440 | 26.909479 |
| sample_4 | B | B | 8.072887 | 21.293146 |
| sample_5 | A | C | 6.270943 | 28.326864 |
| sample_6 | B | B | 9.079249 | 23.183537 |
| sample_7 | C | C | 5.563973 | 27.372023 |
| sample_8 | C | A | 8.399193 | 25.967696 |
| sample_9 | B | C | 0.504880 | 23.637068 |
[ ]:
# Time series samples example.
df_t = pd.DataFrame(
{
"sample_idx": ["a", "a", "a", "a", "b", "b", "c"],
"time_idx": [1, 2, 3, 4, 2, 4, 9],
"feat_1": [11, 12, 13, 14, 21, 22, 31],
"feat_2": [1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 3.1],
}
)
df_t.set_index(keys=["sample_idx", "time_idx"], drop=True, inplace=True)
# Create a dask dataframe:
ddf_t = samples_experimental.multiindex_df_to_compatible_ddf(df_t, npartitions=2)
samples_experimental.TimeSeriesSamplesDask(ddf_t) # type: ignore
2023-12-07 20:30:30 | INFO | tempor.data.samples_experimental:_validate:223 | Validation not yet implemented for Dask data format. Data format consistency is not guaranteed.
TimeSeriesSamplesDask with data:
| feat_1 | feat_2 | ||
|---|---|---|---|
| a | 1 | 11.0 | 1.1 |
| 2 | 12.0 | 1.2 | |
| 3 | 13.0 | 1.3 | |
| 4 | 14.0 | 1.4 | |
| b | 2 | 21.0 | 2.1 |
| 4 | 22.0 | 2.2 | |
| c | 9 | 31.0 | 3.1 |
[ ]:
# Event samples example.
df_e = pd.DataFrame(
{
"sample_idx": [f"sample_{x}" for x in range(1, 3 + 1)],
"feat_1": [(5, True), (6, False), (3, True)],
"feat_2": [(1, False), (8, False), (8, True)],
"feat_3": [
(pd.to_datetime("2000-01-02"), False),
(pd.to_datetime("2000-01-03"), True),
(pd.to_datetime("2000-01-01"), True),
],
},
)
df_e.set_index("sample_idx", drop=True, inplace=True)
# Create a dask dataframe:
ddf_e = dd.from_pandas(df_e, npartitions=2) # type: ignore
# Initialize the event samples object:
samples_experimental.EventSamplesDask(ddf_e) # type: ignore
2023-12-07 20:31:53 | INFO | tempor.data.samples_experimental:_validate:434 | Validation not yet implemented for Dask data format. Data format consistency is not guaranteed.
EventSamplesDask with data:
| feat_1 | feat_2 | feat_3 | |
|---|---|---|---|
| sample_idx | |||
| sample_1 | (5, True) | (1, False) | (2000-01-02 00:00:00, False) |
| sample_2 | (6, False) | (8, False) | (2000-01-03 00:00:00, True) |
| sample_3 | (3, True) | (8, True) | (2000-01-01 00:00:00, True) |