Test In Colab

Data Tutorial 03: Data sources

This tutorial shows TemporAI DataSources.

DataSource class

A TemporAI DataSource implements a load() method which returns a TemporAI dataset.

DataSources are useful to load in some custom datasets, having done the necessary preprocessing, perhaps user-configured.

Data sources, like methods, are TemporAI plugins, and can be loaded with the plugin_loader, but plugin_type="datasource" needs to be specified.

Below is an example of SineDataSource.

[ ]:
from tempor import plugin_loader

# Get the DataSource class:
SineDataSource = plugin_loader.get_class("prediction.one_off.sine", plugin_type="datasource")

The constructor of the Dataloader can take various keyword arguments - this is where the user may customize the data preprocessing etc.

[ ]:
# Initialize.

sine_datasource = SineDataSource(
    no=80,  # Here, number of samples.
    seq_len=5,  # Here, time series sequence length.
    # ...
)

sine_datasource
<tempor.datasources.prediction.one_off.plugin_sine.SineDataSource at 0x7ff8f84f6070>
[ ]:
# Load the Dataset:
data = sine_datasource.load()

print(type(data))

data
<class 'tempor.data.dataset.OneOffPredictionDataset'>
OneOffPredictionDataset(
    time_series=TimeSeriesSamples([80, *, 5]),
    static=StaticSamples([80, 4]),
    predictive=OneOffPredictionTaskData(targets=StaticSamples([80, 1]))
)
[ ]:
data.time_series

TimeSeriesSamples with data:

0 1 2 3 4
sample_idx time_idx
0 0 -0.151203 0.206110 0.783078 0.768667 0.957344
1 0.679518 0.785370 0.913243 0.999923 0.973799
2 0.997174 0.999603 0.985913 0.784278 0.730349
3 0.561921 0.749235 0.996514 0.218111 0.291970
4 -0.297606 0.150635 0.944377 -0.445537 -0.224335
... ... ... ... ... ... ...
79 0 0.999730 0.101680 -0.976039 -0.999547 -0.715265
1 0.803590 0.577241 -0.696389 -0.897416 -0.312411
2 0.269220 0.903914 -0.188132 -0.586595 0.160840
3 -0.378464 0.997443 0.381883 -0.139366 0.597849
4 -0.866853 0.833703 0.826536 0.340273 0.900138

400 rows × 5 columns

[ ]:
data.static

StaticSamples with data:

0 1 2 3
sample_idx
0 0.374540 0.950714 0.731994 0.598658
1 0.156019 0.155995 0.058084 0.866176
2 0.601115 0.708073 0.020584 0.969910
3 0.832443 0.212339 0.181825 0.183405
4 0.304242 0.524756 0.431945 0.291229
... ... ... ... ...
75 0.051682 0.531355 0.540635 0.637430
76 0.726091 0.975852 0.516300 0.322956
77 0.795186 0.270832 0.438971 0.078456
78 0.025351 0.962648 0.835980 0.695974
79 0.408953 0.173294 0.156437 0.250243

80 rows × 4 columns

[ ]:
data.predictive.targets

StaticSamples with data:

0
sample_idx
0 1
1 1
2 1
3 0
4 1
... ...
75 1
76 0
77 1
78 1
79 1

80 rows × 1 columns

Alternatively you can initialize the datasource instance directly in one step, to the same effect, as below.

[ ]:
sine_datasource = plugin_loader.get("prediction.one_off.sine", "datasource", no=80, seq_len=5)

sine_datasource
<tempor.datasources.prediction.one_off.plugin_sine.SineDataSource at 0x7ff8f8552d90>

Provided DataSources

TemporAI comes with a number of data sources.

To list them all by category:

[ ]:
from rich.pretty import pprint

pprint(
    plugin_loader.list(plugin_type="datasource"),
    indent_guides=False,
)
{
    'prediction': {'one_off': ['sine', 'google_stocks'], 'temporal': ['uci_diabetes', 'dummy_prediction']},
    'time_to_event': ['pbc'],
    'treatments': {'one_off': ['pkpd'], 'temporal': ['dummy_treatments']}
}

Below you can see more information about each available data source and examples of the TemporAI DataSet it loads.

[ ]:
# Display information about each datasource's default loaded dataset.

all_datasources = plugin_loader.list_full_names(plugin_type="datasource")

from IPython.display import display

for datasource_name in all_datasources:
    print(f"\n{'-' * 80}\n")

    datasource_cls = plugin_loader.get_class(datasource_name, plugin_type="datasource")

    print(f"{datasource_cls.__name__} loads the following dataset:\n")
    data = datasource_cls().load()
    print(data)

    print("This contains:", end="\n\n")

    print("time_series:")
    display(data.time_series)
    if data.static is not None:
        print("static:")
        display(data.static)
    if data.predictive.targets is not None:
        print("predictive.targets:")
        display(data.predictive.targets)
    if data.predictive.treatments is not None:
        print("predictive.treatments:")
        display(data.predictive.treatments)

--------------------------------------------------------------------------------

SineDataSource loads the following dataset:

OneOffPredictionDataset(
    time_series=TimeSeriesSamples([100, *, 5]),
    static=StaticSamples([100, 4]),
    predictive=OneOffPredictionTaskData(targets=StaticSamples([100, 1]))
)
This contains:

time_series:

TimeSeriesSamples with data:

0 1 2 3 4
sample_idx time_idx
0 0 -0.019015 -0.048177 -0.108546 0.441865 0.024508
1 0.300030 0.364550 0.576590 0.890053 0.534722
2 0.587904 0.713509 0.972993 0.986179 0.892946
3 0.814697 0.937661 0.882158 0.692221 0.997357
4 0.956846 0.997797 0.349572 0.124454 0.818278
... ... ... ... ... ... ...
99 5 0.967121 0.126890 0.926979 0.982022 0.963113
6 0.706533 0.413329 0.569034 0.656214 0.989224
7 0.252748 0.663121 0.024381 0.050668 0.999771
8 -0.270150 0.854119 -0.528273 -0.576478 0.994586
9 -0.719177 0.969389 -0.907592 -0.957876 0.973752

1000 rows × 5 columns

static:

StaticSamples with data:

0 1 2 3
sample_idx
0 0.374540 0.950714 0.731994 0.598658
1 0.156019 0.155995 0.058084 0.866176
2 0.601115 0.708073 0.020584 0.969910
3 0.832443 0.212339 0.181825 0.183405
4 0.304242 0.524756 0.431945 0.291229
... ... ... ... ...
95 0.118165 0.696737 0.628943 0.877472
96 0.735071 0.803481 0.282035 0.177440
97 0.750615 0.806835 0.990505 0.412618
98 0.372018 0.776413 0.340804 0.930757
99 0.858413 0.428994 0.750871 0.754543

100 rows × 4 columns

predictive.targets:

StaticSamples with data:

0
sample_idx
0 0
1 1
2 0
3 0
4 1
... ...
95 1
96 1
97 1
98 0
99 0

100 rows × 1 columns


--------------------------------------------------------------------------------

GoogleStocksDataSource loads the following dataset:

OneOffPredictionDataset(
    time_series=TimeSeriesSamples([50, *, 5]),
    predictive=OneOffPredictionTaskData(targets=StaticSamples([50, 1]))
)
This contains:

time_series:

TimeSeriesSamples with data:

Open High Low Close Volume
sample_idx time_idx
0 0.875000 0.661264 0.652789 0.677836 0.696887 0.185147
0.886364 0.667446 0.716935 0.731552 0.748318 0.150912
0.897727 0.751374 0.784055 0.800261 0.791407 0.140203
0.909091 0.785577 0.838572 0.831813 0.832628 0.244291
0.920455 0.885578 0.879778 0.900782 0.889539 0.413625
... ... ... ... ... ... ...
9 0.806818 0.642857 0.647974 0.649153 0.639975 0.625178
0.818182 0.687362 0.757221 0.741200 0.789788 0.333141
0.829545 0.756044 0.732512 0.772230 0.732379 0.120629
0.840909 0.710852 0.687907 0.721525 0.713076 0.101900
0.875000 0.661264 0.652789 0.677836 0.696887 0.185147

500 rows × 5 columns

predictive.targets:

StaticSamples with data:

out
sample_idx
0 0.710852
1 0.756044
10 0.564835
11 0.557005
12 0.552061
13 0.510852
14 0.451786
15 0.421704
16 0.387225
17 0.345879
18 0.286951
19 0.332143
2 0.687362
20 0.205906
21 0.286676
22 0.247939
23 0.492445
24 0.767858
25 0.810440
26 0.697940
27 0.597390
28 0.390659
29 0.385989
3 0.642857
30 0.361401
31 0.370879
32 0.388325
33 0.393819
34 0.389149
35 0.359753
36 0.399038
37 0.378984
38 0.225962
39 0.099863
4 0.628297
40 0.131181
41 0.000000
42 0.054121
43 0.062088
44 0.204533
45 0.163049
46 0.166072
47 0.186126
48 0.233929
49 0.246566
5 0.671978
6 0.704808
7 0.684753
8 0.684753
9 0.607281

--------------------------------------------------------------------------------

UCIDiabetesDataSource loads the following dataset:

TemporalPredictionDataset(
    time_series=TimeSeriesSamples([70, *, 18]),
    predictive=TemporalPredictionTaskData(
        targets=TimeSeriesSamples([70, *, 1])
    )
)
This contains:

time_series:

TimeSeriesSamples with data:

post-lunch_blood_glucose_measurement more-than-usual_meal_ingestion unspecified_special_event typical_exercise_activity less-than-usual_exercise_activity post-supper_blood_glucose_measurement more-than-usual_exercise_activity unspecified_blood_glucose_measurement pre-snack_blood_glucose_measurement ultralente_insulin_dose less-than-usual_meal_ingestion pre-supper_blood_glucose_measurement regular_insulin_dose pre-breakfast_blood_glucose_measurement nph_insulin_dose post-breakfast_blood_glucose_measurement pre-lunch_blood_glucose_measurement typical_meal_ingestion
sample_idx time_idx
0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9.0 100.0 13.0 NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 119.0 7.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN 123.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10.0 216.0 13.0 NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
69 146 NaN NaN NaN NaN NaN NaN NaN 145.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
147 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 259.0 7.0 NaN NaN NaN
148 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN NaN
149 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN NaN
150 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN NaN

18199 rows × 18 columns

predictive.targets:

TimeSeriesSamples with data:

hypoglycemic_symptoms
sample_idx time_idx
0 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ... ...
69 146 NaN
147 NaN
148 NaN
149 NaN
150 NaN

18199 rows × 1 columns


--------------------------------------------------------------------------------

DummyTemporalPredictionDataSource loads the following dataset:

TemporalPredictionDataset(
    time_series=TimeSeriesSamples([100, *, 5]),
    static=StaticSamples([100, 3]),
    predictive=TemporalPredictionTaskData(
        targets=TimeSeriesSamples([100, *, 2])
    )
)
This contains:

time_series:

TimeSeriesSamples with data:

0 1 2 3 4
sample_idx time_idx
0 0 NaN 0.893763 NaN NaN 1.047522
1 1.257931 2.172271 2.226089 2.360713 1.981578
2 2.247657 0.853397 2.525946 3.213647 2.897191
3 3.396456 5.386071 3.721545 2.503248 3.517212
4 4.387812 3.365264 5.612532 5.573375 4.767746
... ... ... ... ... ... ...
99 12 12.654769 14.810888 12.914859 NaN 12.818675
13 13.418815 12.135655 12.481295 13.336797 13.696168
14 13.785503 14.431228 15.193174 17.551818 14.464249
15 15.344934 15.916966 14.368132 15.965113 15.419334
16 16.033907 15.162631 17.338485 17.007235 17.034645

1547 rows × 5 columns

static:

StaticSamples with data:

0 1 2
sample_idx
0 0.753423 3.239284 0.995587
1 0.829240 3.175298 0.770566
2 0.674581 3.229741 1.302317
3 0.584040 3.234011 1.594861
4 0.501552 3.211027 0.639503
... ... ... ...
95 0.680235 3.287749 0.705369
96 0.788814 3.313229 1.318394
97 0.589116 3.268607 1.646737
98 0.551060 3.268599 0.998024
99 0.716501 3.254501 1.047537

100 rows × 3 columns

predictive.targets:

TimeSeriesSamples with data:

0 1
sample_idx time_idx
0 0 -1.433570 0.714861
1 -0.600733 2.744446
2 0.622874 1.816995
3 1.879785 4.981217
4 2.477957 5.932101
... ... ... ...
99 12 10.736462 13.415872
13 11.617465 15.103293
14 12.858327 16.105966
15 13.652358 16.148926
16 14.442286 17.567963

1547 rows × 2 columns


--------------------------------------------------------------------------------

PBCDataSource loads the following dataset:

TimeToEventAnalysisDataset(
    time_series=TimeSeriesSamples([312, *, 14]),
    static=StaticSamples([312, 1]),
    predictive=TimeToEventAnalysisTaskData(targets=EventSamples([312, 1]))
)
This contains:

time_series:

TimeSeriesSamples with data:

drug ascites hepatomegaly spiders edema histologic serBilir serChol albumin alkaline SGOT platelets prothrombin age
sample_idx time_idx
1 0.569489 0.0 1.0 1.0 1.0 1.0 3.0 3.281890 0.000000 -0.894575 0.195532 -1.485263 -0.529101 0.136768 0.248058
1.095170 0.0 1.0 1.0 1.0 1.0 3.0 2.015877 -0.469461 -1.570646 0.285613 0.195488 -0.456022 0.813132 0.248058
2 5.319790 0.0 1.0 1.0 1.0 1.0 2.0 0.172710 -0.658914 -1.431455 -0.605844 -0.442126 -1.395605 0.339677 1.292856
6.261636 0.0 1.0 1.0 1.0 1.0 2.0 -0.013468 -0.603657 -1.172958 -0.512364 -0.046806 -1.259888 0.339677 1.292856
7.266455 0.0 1.0 1.0 1.0 1.0 2.0 0.098239 0.000000 -1.312149 -0.443529 0.293680 -1.364286 0.339677 1.292856
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
312 1.045888 1.0 0.0 0.0 1.0 2.0 2.0 3.672865 3.319599 0.059878 1.385274 0.986129 -1.103291 1.624769 -1.962482
1.867265 1.0 0.0 0.0 1.0 2.0 1.0 2.350998 2.901224 -0.099197 0.916176 0.641817 -0.998892 1.354223 -1.962482
2.921367 1.0 0.0 0.0 0.0 0.0 1.0 0.694010 -0.066873 0.338261 0.327254 0.552551 -0.894494 0.474950 -1.962482
3.425145 1.0 0.0 0.0 0.0 0.0 1.0 0.340271 0.000000 -0.377580 0.251620 0.016956 -0.466462 -0.066141 -1.962482
3.989158 1.0 0.0 0.0 1.0 0.0 1.0 0.507832 2.017110 0.795603 0.622990 0.169983 -0.351624 -0.133778 -1.962482

1945 rows × 14 columns

static:

StaticSamples with data:

sex
sample_idx
1 0.0
2 0.0
3 1.0
4 0.0
5 0.0
... ...
308 0.0
309 0.0
310 0.0
311 0.0
312 0.0

312 rows × 1 columns

predictive.targets:

EventSamples with data:

status
sample_idx
1 (0.569488555470374, True)
2 (14.1523381885883, False)
3 (0.7365020260650499, True)
4 (0.27653050049282957, True)
5 (4.12057824991786, False)
... ...
308 (4.98850071186069, False)
309 (4.55317051801555, False)
310 (4.4025846019056, False)
311 (4.12879202716022, False)
312 (3.98915781404008, False)

312 rows × 1 columns


--------------------------------------------------------------------------------

PKPDDataSource loads the following dataset:

Generating simple PKPD dataset with random seed 100...
OneOffTreatmentEffectsDataset(
    time_series=TimeSeriesSamples([40, *, 2]),
    predictive=OneOffTreatmentEffectsTaskData(
        targets=TimeSeriesSamples([40, *, 1]),
        treatments=EventSamples([40, 1])
    )
)
This contains:

time_series:

TimeSeriesSamples with data:

k_in p
sample_idx time_idx
0 0 -0.781441 -0.245827
1 -1.001889 -0.541523
2 -1.070862 -0.589325
3 -1.425115 -1.065485
4 -1.841006 -1.542429
... ... ... ...
39 5 0.959902 -0.690056
6 1.683426 -0.128967
7 2.233045 0.637905
8 1.645018 1.056957
9 0.333051 1.048721

400 rows × 2 columns

predictive.targets:

TimeSeriesSamples with data:

y
sample_idx time_idx
0 0 -0.197049
1 0.020346
2 -0.281120
3 -0.483934
4 -0.947253
... ... ...
39 5 -1.418583
6 -1.495843
7 -1.193632
8 -0.850845
9 -0.431990

400 rows × 1 columns

predictive.treatments:

EventSamples with data:

a
sample_idx
0 (7, False)
1 (7, False)
2 (7, False)
3 (7, False)
4 (7, False)
5 (7, False)
6 (7, False)
7 (7, False)
8 (7, False)
9 (7, False)
10 (7, False)
11 (7, False)
12 (7, False)
13 (7, False)
14 (7, False)
15 (7, False)
16 (7, False)
17 (7, False)
18 (7, False)
19 (7, False)
20 (7, True)
21 (7, True)
22 (7, True)
23 (7, True)
24 (7, True)
25 (7, True)
26 (7, True)
27 (7, True)
28 (7, True)
29 (7, True)
30 (7, True)
31 (7, True)
32 (7, True)
33 (7, True)
34 (7, True)
35 (7, True)
36 (7, True)
37 (7, True)
38 (7, True)
39 (7, True)

--------------------------------------------------------------------------------

DummyTemporalTreatmentEffectsDataSource loads the following dataset:

TemporalTreatmentEffectsDataset(
    time_series=TimeSeriesSamples([100, *, 5]),
    static=StaticSamples([100, 3]),
    predictive=TemporalTreatmentEffectsTaskData(
        targets=TimeSeriesSamples([100, *, 2]),
        treatments=TimeSeriesSamples([100, *, 2])
    )
)
This contains:

time_series:

TimeSeriesSamples with data:

0 1 2 3 4
sample_idx time_idx
0 0 NaN 0.893763 NaN NaN 1.047522
1 1.257931 2.172271 2.226089 2.360713 1.981578
2 2.247657 0.853397 2.525946 3.213647 2.897191
3 3.396456 5.386071 3.721545 2.503248 3.517212
4 4.387812 3.365264 5.612532 5.573375 4.767746
... ... ... ... ... ... ...
99 12 12.654769 14.810888 12.914859 NaN 12.818675
13 13.418815 12.135655 12.481295 13.336797 13.696168
14 13.785503 14.431228 15.193174 17.551818 14.464249
15 15.344934 15.916966 14.368132 15.965113 15.419334
16 16.033907 15.162631 17.338485 17.007235 17.034645

1547 rows × 5 columns

static:

StaticSamples with data:

0 1 2
sample_idx
0 0.753423 3.239284 0.995587
1 0.829240 3.175298 0.770566
2 0.674581 3.229741 1.302317
3 0.584040 3.234011 1.594861
4 0.501552 3.211027 0.639503
... ... ... ...
95 0.680235 3.287749 0.705369
96 0.788814 3.313229 1.318394
97 0.589116 3.268607 1.646737
98 0.551060 3.268599 0.998024
99 0.716501 3.254501 1.047537

100 rows × 3 columns

predictive.targets:

TimeSeriesSamples with data:

0 1
sample_idx time_idx
0 0 -1.433570 0.714861
1 -0.600733 2.744446
2 0.622874 1.816995
3 1.879785 4.981217
4 2.477957 5.932101
... ... ... ...
99 12 10.736462 13.415872
13 11.617465 15.103293
14 12.858327 16.105966
15 13.652358 16.148926
16 14.442286 17.567963

1547 rows × 2 columns

predictive.treatments:

TimeSeriesSamples with data:

0 1
sample_idx time_idx
0 0 -1.433570 0.714861
1 -0.600733 2.744446
2 0.622874 1.816995
3 1.879785 4.981217
4 2.477957 5.932101
... ... ... ...
99 12 10.736462 13.415872
13 11.617465 15.103293
14 12.858327 16.105966
15 13.652358 16.148926
16 14.442286 17.567963

1547 rows × 2 columns