User Guide Tutorial 02: Preprocessing › Imputation¶

This tutorial shows how to use TemporAI preprocessing.imputation plugins.

All `preprocessing.imputation` plugins¶

To see all the relevant plugins:

[ ]:

from tempor import plugin_loader

plugin_loader.list()["preprocessing"]["imputation"]

{'static': ['static_tabular_imputer'],
 'temporal': ['ffill', 'ts_tabular_imputer', 'bfill']}

Now also load data source(s) we will use:

[ ]:

SineDataSource = plugin_loader.get_class("prediction.one_off.sine", plugin_type="datasource")

Using a static data imputation plugin¶

[ ]:

from tempor import plugin_loader

dataset = SineDataSource(with_missing=True, random_state=42).load()
print(dataset)

model = plugin_loader.get("preprocessing.imputation.static.static_tabular_imputer", static_imputer="mean")
print(model)

2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 | Iteration imputation: select_model_by_column: True, select_model_by_iteration: True

OneOffPredictionDataset(
    time_series=TimeSeriesSamples([100, *, 5]),
    static=StaticSamples([100, 4]),
    predictive=OneOffPredictionTaskData(targets=StaticSamples([100, 1]))
)
StaticTabularImputer(
    name='static_tabular_imputer',
    category='preprocessing.imputation.static',
    plugin_type='method',
    params={
        'imputer': 'ice',
        'random_state': 0,
        'imputer_params': {'random_state': 0}
    }
)

[ ]:

# Note missingness in static data.

print("Missing value count:", dataset.static.dataframe().isnull().sum().sum())  # type: ignore

dataset.static

Missing value count: 40

StaticSamples with data:

	0	1	2	3
sample_idx
0	0.374540	0.950714	0.731994	0.598658
1	0.156019	0.155995	0.058084	0.866176
2	0.601115	0.708073	0.020584	0.969910
3	0.832443	NaN	0.181825	0.183405
4	0.304242	0.524756	0.431945	0.291229
...	...	...	...	...
95	NaN	0.696737	0.628943	NaN
96	0.735071	0.803481	0.282035	NaN
97	0.750615	0.806835	0.990505	0.412618
98	0.372018	0.776413	0.340804	0.930757
99	0.858413	0.428994	0.750871	0.754543

100 rows × 4 columns

[ ]:

# Note no more missingness in static data.

dataset = model.fit_transform(dataset)  # Or call fit() then transform().

print("Missing value count:", dataset.static.dataframe().isnull().sum().sum())  # type: ignore

dataset.static

2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > HyperImpute using inner optimization
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 0
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.22005514968324613 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 1
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21670750510884584 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 2
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.2166465658117811 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 3
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.2166456843686172 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 4
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 5
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 6
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 7
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 8
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 9
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 10
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>>> Early stopping on objective diff iteration

Missing value count: 0

StaticSamples with data:

	0	1	2	3
sample_idx
0	0.374540	0.950714	0.731994	0.598658
1	0.156019	0.155995	0.058084	0.866176
2	0.601115	0.708073	0.020584	0.969910
3	0.832443	0.450438	0.181825	0.183405
4	0.304242	0.524756	0.431945	0.291229
...	...	...	...	...
95	0.498806	0.696737	0.628943	0.509994
96	0.735071	0.803481	0.282035	0.503886
97	0.750615	0.806835	0.990505	0.412618
98	0.372018	0.776413	0.340804	0.930757
99	0.858413	0.428994	0.750871	0.754543

100 rows × 4 columns

Using a temporal data imputation plugin¶

[ ]:

from tempor import plugin_loader

dataset = SineDataSource(with_missing=True, random_state=42).load()
print(dataset)

model = plugin_loader.get("preprocessing.imputation.temporal.bfill")
print(model)

OneOffPredictionDataset(
    time_series=TimeSeriesSamples([100, *, 5]),
    static=StaticSamples([100, 4]),
    predictive=OneOffPredictionTaskData(targets=StaticSamples([100, 1]))
)
BFillImputer(
    name='bfill',
    category='preprocessing.imputation.temporal',
    plugin_type='method',
    params={}
)

[ ]:

# Note missingness in temporal data.

print("Missing value count:", dataset.time_series.dataframe().isnull().sum().sum())

dataset.time_series

Missing value count: 500

TimeSeriesSamples with data:

		0	1	2	3	4
sample_idx	time_idx
0	0	-0.955338	0.016053	-0.995752	0.948138	0.738158
	1	-0.896718	0.717189	-0.497625	0.962001	0.968258
	2	-0.346466	0.999920	0.423104	0.639780	0.972469
	3	0.393737	0.699299	0.984517	0.094046	0.749807
	4	0.918072	-0.009290	NaN	NaN	NaN
...	...	...	...	...	...	...
99	5	0.904284	-0.939985	0.994099	-0.984349	0.688521
	6	0.990911	-0.518593	0.908681	-0.801263	0.813486
	7	0.757745	0.131791	NaN	-0.110629	0.908965
	8	NaN	0.723981	0.476023	0.650082	0.971498
	9	-0.288052	0.996486	0.173255	0.999008	0.998817

1000 rows × 5 columns

[ ]:

# Note no more missingness in temporal data.

dataset = model.fit_transform(dataset)  # Or call fit() then transform().

print("Missing value count:", dataset.time_series.dataframe().isnull().sum().sum())

dataset.time_series

Missing value count: 0