Data Tutorial 05: Other Data Formats¶

This tutorial shows additional data formats supported by TemporAI.

⚠️ This feature is experimental and may not yet work as expected.

Data formats¶

You can view the supported data formats by running:

[ ]:

from tempor import plugin_loader
from tempor.data import samples_experimental  # Load experimental.

import rich.pretty

dataformat_plugins = plugin_loader.list("dataformat")
rich.pretty.pprint(dataformat_plugins)

{
│   'static_samples': ['static_samples_df', 'static_samples_dask'],
│   'time_series_samples': ['time_series_samples_df', 'time_series_samples_dask'],
│   'event_samples': ['event_samples_df', 'event_samples_dask']
}

`Dask` data format¶

Dask is a Python library for parallel computing. We provide an interface to `Dask dataframes <https://docs.dask.org/en/stable/dataframe.html>`__, which supports parallel computation.

Below example shows how to load the data samples from Dask dataframes.

[ ]:

# Static samples example.

import pandas as pd
import numpy as np
import dask.dataframe as dd

from tempor.data import samples_experimental

categories = ["A", "B", "C"]
np.random.seed(12345)
size = 10
df_s = pd.DataFrame(
    {
        "sample_idx": [f"sample_{x}" for x in range(1, size + 1)],
        "cat_feat_1": pd.Categorical(np.random.choice(categories, size=size)),
        "cat_feat_2": pd.Categorical(np.random.choice(categories, size=size)),
        "num_feat_1": np.random.uniform(0, 10, size=size),
        "num_feat_2": np.random.uniform(20, 30, size=size),
    }
)
df_s.set_index("sample_idx", drop=True, inplace=True)

# Create a dask dataframe:
ddf_s = dd.from_pandas(df_s, npartitions=2)  # type: ignore

# Initialize the static samples object:
samples_experimental.StaticSamplesDask(ddf_s)  # type: ignore

2023-12-07 20:29:54 | INFO     | tempor.data.samples_experimental:_validate:69 | Validation not yet implemented for Dask data format. Data format consistency is not guaranteed.

StaticSamplesDask with data:

	cat_feat_1	cat_feat_2	num_feat_1	num_feat_2
sample_idx
sample_1	C	B	0.267897	29.308157
sample_10	C	B	8.062348	27.949706
sample_2	B	B	2.915024	23.640296
sample_3	B	C	3.987440	26.909479
sample_4	B	B	8.072887	21.293146
sample_5	A	C	6.270943	28.326864
sample_6	B	B	9.079249	23.183537
sample_7	C	C	5.563973	27.372023
sample_8	C	A	8.399193	25.967696
sample_9	B	C	0.504880	23.637068

[ ]:

# Time series samples example.

df_t = pd.DataFrame(
    {
        "sample_idx": ["a", "a", "a", "a", "b", "b", "c"],
        "time_idx": [1, 2, 3, 4, 2, 4, 9],
        "feat_1": [11, 12, 13, 14, 21, 22, 31],
        "feat_2": [1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 3.1],
    }
)
df_t.set_index(keys=["sample_idx", "time_idx"], drop=True, inplace=True)

# Create a dask dataframe:
ddf_t = samples_experimental.multiindex_df_to_compatible_ddf(df_t, npartitions=2)

samples_experimental.TimeSeriesSamplesDask(ddf_t)  # type: ignore

2023-12-07 20:30:30 | INFO     | tempor.data.samples_experimental:_validate:223 | Validation not yet implemented for Dask data format. Data format consistency is not guaranteed.

TimeSeriesSamplesDask with data:

		feat_1	feat_2
a	1	11.0	1.1
	2	12.0	1.2
	3	13.0	1.3
	4	14.0	1.4
b	2	21.0	2.1
b	4	22.0	2.2
c	9	31.0	3.1

[ ]:

# Event samples example.

df_e = pd.DataFrame(
    {
        "sample_idx": [f"sample_{x}" for x in range(1, 3 + 1)],
        "feat_1": [(5, True), (6, False), (3, True)],
        "feat_2": [(1, False), (8, False), (8, True)],
        "feat_3": [
            (pd.to_datetime("2000-01-02"), False),
            (pd.to_datetime("2000-01-03"), True),
            (pd.to_datetime("2000-01-01"), True),
        ],
    },
)
df_e.set_index("sample_idx", drop=True, inplace=True)

# Create a dask dataframe:
ddf_e = dd.from_pandas(df_e, npartitions=2)  # type: ignore

# Initialize the event samples object:
samples_experimental.EventSamplesDask(ddf_e)  # type: ignore

2023-12-07 20:31:53 | INFO     | tempor.data.samples_experimental:_validate:434 | Validation not yet implemented for Dask data format. Data format consistency is not guaranteed.

EventSamplesDask with data:

	feat_1	feat_2	feat_3
sample_idx
sample_1	(5, True)	(1, False)	(2000-01-02 00:00:00, False)
sample_2	(6, False)	(8, False)	(2000-01-03 00:00:00, True)
sample_3	(3, True)	(8, True)	(2000-01-01 00:00:00, True)

Data Tutorial 05: Other Data Formats¶

Data formats¶

Dask data format¶

`Dask` data format¶