Test In Colab

User Guide Tutorial 02: Preprocessing › Imputation

This tutorial shows how to use TemporAI preprocessing.imputation plugins.

All preprocessing.imputation plugins

To see all the relevant plugins:

[ ]:
from tempor import plugin_loader

plugin_loader.list()["preprocessing"]["imputation"]
{'static': ['static_tabular_imputer'],
 'temporal': ['ffill', 'ts_tabular_imputer', 'bfill']}

Now also load data source(s) we will use:

[ ]:
SineDataSource = plugin_loader.get_class("prediction.one_off.sine", plugin_type="datasource")

Using a static data imputation plugin

[ ]:
from tempor import plugin_loader

dataset = SineDataSource(with_missing=True, random_state=42).load()
print(dataset)

model = plugin_loader.get("preprocessing.imputation.static.static_tabular_imputer", static_imputer="mean")
print(model)
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 | Iteration imputation: select_model_by_column: True, select_model_by_iteration: True
OneOffPredictionDataset(
    time_series=TimeSeriesSamples([100, *, 5]),
    static=StaticSamples([100, 4]),
    predictive=OneOffPredictionTaskData(targets=StaticSamples([100, 1]))
)
StaticTabularImputer(
    name='static_tabular_imputer',
    category='preprocessing.imputation.static',
    plugin_type='method',
    params={
        'imputer': 'ice',
        'random_state': 0,
        'imputer_params': {'random_state': 0}
    }
)
[ ]:
# Note missingness in static data.

print("Missing value count:", dataset.static.dataframe().isnull().sum().sum())  # type: ignore

dataset.static
Missing value count: 40

StaticSamples with data:

0 1 2 3
sample_idx
0 0.374540 0.950714 0.731994 0.598658
1 0.156019 0.155995 0.058084 0.866176
2 0.601115 0.708073 0.020584 0.969910
3 0.832443 NaN 0.181825 0.183405
4 0.304242 0.524756 0.431945 0.291229
... ... ... ... ...
95 NaN 0.696737 0.628943 NaN
96 0.735071 0.803481 0.282035 NaN
97 0.750615 0.806835 0.990505 0.412618
98 0.372018 0.776413 0.340804 0.930757
99 0.858413 0.428994 0.750871 0.754543

100 rows × 4 columns

[ ]:
# Note no more missingness in static data.

dataset = model.fit_transform(dataset)  # Or call fit() then transform().

print("Missing value count:", dataset.static.dataframe().isnull().sum().sum())  # type: ignore

dataset.static
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > HyperImpute using inner optimization
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 0
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.22005514968324613 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 1
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21670750510884584 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 2
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.2166465658117811 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 3
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.2166456843686172 <-- Model linear_regression
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 4
2023-11-16 22:18:42 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 5
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 6
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 7
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 8
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 9
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |   > Imputation iter 10
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO     | hyperimpute.logger:log_and_print:65 |      >>>> Early stopping on objective diff iteration
Missing value count: 0

StaticSamples with data:

0 1 2 3
sample_idx
0 0.374540 0.950714 0.731994 0.598658
1 0.156019 0.155995 0.058084 0.866176
2 0.601115 0.708073 0.020584 0.969910
3 0.832443 0.450438 0.181825 0.183405
4 0.304242 0.524756 0.431945 0.291229
... ... ... ... ...
95 0.498806 0.696737 0.628943 0.509994
96 0.735071 0.803481 0.282035 0.503886
97 0.750615 0.806835 0.990505 0.412618
98 0.372018 0.776413 0.340804 0.930757
99 0.858413 0.428994 0.750871 0.754543

100 rows × 4 columns

Using a temporal data imputation plugin

[ ]:
from tempor import plugin_loader

dataset = SineDataSource(with_missing=True, random_state=42).load()
print(dataset)

model = plugin_loader.get("preprocessing.imputation.temporal.bfill")
print(model)
OneOffPredictionDataset(
    time_series=TimeSeriesSamples([100, *, 5]),
    static=StaticSamples([100, 4]),
    predictive=OneOffPredictionTaskData(targets=StaticSamples([100, 1]))
)
BFillImputer(
    name='bfill',
    category='preprocessing.imputation.temporal',
    plugin_type='method',
    params={}
)
[ ]:
# Note missingness in temporal data.

print("Missing value count:", dataset.time_series.dataframe().isnull().sum().sum())

dataset.time_series
Missing value count: 500

TimeSeriesSamples with data:

0 1 2 3 4
sample_idx time_idx
0 0 -0.955338 0.016053 -0.995752 0.948138 0.738158
1 -0.896718 0.717189 -0.497625 0.962001 0.968258
2 -0.346466 0.999920 0.423104 0.639780 0.972469
3 0.393737 0.699299 0.984517 0.094046 0.749807
4 0.918072 -0.009290 NaN NaN NaN
... ... ... ... ... ... ...
99 5 0.904284 -0.939985 0.994099 -0.984349 0.688521
6 0.990911 -0.518593 0.908681 -0.801263 0.813486
7 0.757745 0.131791 NaN -0.110629 0.908965
8 NaN 0.723981 0.476023 0.650082 0.971498
9 -0.288052 0.996486 0.173255 0.999008 0.998817

1000 rows × 5 columns

[ ]:
# Note no more missingness in temporal data.

dataset = model.fit_transform(dataset)  # Or call fit() then transform().

print("Missing value count:", dataset.time_series.dataframe().isnull().sum().sum())

dataset.time_series
Missing value count: 0

TimeSeriesSamples with data:

0 1 2 3 4
sample_idx time_idx
0 0 -0.955338 0.016053 -0.995752 0.948138 0.738158
1 -0.896718 0.717189 -0.497625 0.962001 0.968258
2 -0.346466 0.999920 0.423104 0.639780 0.972469
3 0.393737 0.699299 0.984517 0.094046 0.749807
4 0.918072 -0.009290 -0.167662 -0.893854 -0.127538
... ... ... ... ... ... ...
99 5 0.904284 -0.939985 0.994099 -0.984349 0.688521
6 0.990911 -0.518593 0.908681 -0.801263 0.813486
7 0.757745 0.131791 0.476023 -0.110629 0.908965
8 -0.288052 0.723981 0.476023 0.650082 0.971498
9 -0.288052 0.996486 0.173255 0.999008 0.998817

1000 rows × 5 columns