User Guide Tutorial 02: Preprocessing › Imputation¶
This tutorial shows how to use TemporAI preprocessing.imputation plugins.
All preprocessing.imputation plugins¶
To see all the relevant plugins:
[ ]:
from tempor import plugin_loader
plugin_loader.list()["preprocessing"]["imputation"]
{'static': ['static_tabular_imputer'],
'temporal': ['ffill', 'ts_tabular_imputer', 'bfill']}
Now also load data source(s) we will use:
[ ]:
SineDataSource = plugin_loader.get_class("prediction.one_off.sine", plugin_type="datasource")
Using a static data imputation plugin¶
[ ]:
from tempor import plugin_loader
dataset = SineDataSource(with_missing=True, random_state=42).load()
print(dataset)
model = plugin_loader.get("preprocessing.imputation.static.static_tabular_imputer", static_imputer="mean")
print(model)
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | Iteration imputation: select_model_by_column: True, select_model_by_iteration: True
OneOffPredictionDataset(
time_series=TimeSeriesSamples([100, *, 5]),
static=StaticSamples([100, 4]),
predictive=OneOffPredictionTaskData(targets=StaticSamples([100, 1]))
)
StaticTabularImputer(
name='static_tabular_imputer',
category='preprocessing.imputation.static',
plugin_type='method',
params={
'imputer': 'ice',
'random_state': 0,
'imputer_params': {'random_state': 0}
}
)
[ ]:
# Note missingness in static data.
print("Missing value count:", dataset.static.dataframe().isnull().sum().sum()) # type: ignore
dataset.static
Missing value count: 40
StaticSamples with data:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| sample_idx | ||||
| 0 | 0.374540 | 0.950714 | 0.731994 | 0.598658 |
| 1 | 0.156019 | 0.155995 | 0.058084 | 0.866176 |
| 2 | 0.601115 | 0.708073 | 0.020584 | 0.969910 |
| 3 | 0.832443 | NaN | 0.181825 | 0.183405 |
| 4 | 0.304242 | 0.524756 | 0.431945 | 0.291229 |
| ... | ... | ... | ... | ... |
| 95 | NaN | 0.696737 | 0.628943 | NaN |
| 96 | 0.735071 | 0.803481 | 0.282035 | NaN |
| 97 | 0.750615 | 0.806835 | 0.990505 | 0.412618 |
| 98 | 0.372018 | 0.776413 | 0.340804 | 0.930757 |
| 99 | 0.858413 | 0.428994 | 0.750871 | 0.754543 |
100 rows × 4 columns
[ ]:
# Note no more missingness in static data.
dataset = model.fit_transform(dataset) # Or call fit() then transform().
print("Missing value count:", dataset.static.dataframe().isnull().sum().sum()) # type: ignore
dataset.static
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | > HyperImpute using inner optimization
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 0
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.22005514968324613 <-- Model linear_regression
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 1
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.21670750510884584 <-- Model linear_regression
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 2
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.2166465658117811 <-- Model linear_regression
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 3
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.2166456843686172 <-- Model linear_regression
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 4
2023-11-16 22:18:42 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 5
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 6
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 7
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 8
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 9
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | > Imputation iter 10
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 0 <-- score -0.23109777030617995 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>> Column 1 <-- score -0.21664567863448503 <-- Model linear_regression
2023-11-16 22:18:43 | INFO | hyperimpute.logger:log_and_print:65 | >>>> Early stopping on objective diff iteration
Missing value count: 0
StaticSamples with data:
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| sample_idx | ||||
| 0 | 0.374540 | 0.950714 | 0.731994 | 0.598658 |
| 1 | 0.156019 | 0.155995 | 0.058084 | 0.866176 |
| 2 | 0.601115 | 0.708073 | 0.020584 | 0.969910 |
| 3 | 0.832443 | 0.450438 | 0.181825 | 0.183405 |
| 4 | 0.304242 | 0.524756 | 0.431945 | 0.291229 |
| ... | ... | ... | ... | ... |
| 95 | 0.498806 | 0.696737 | 0.628943 | 0.509994 |
| 96 | 0.735071 | 0.803481 | 0.282035 | 0.503886 |
| 97 | 0.750615 | 0.806835 | 0.990505 | 0.412618 |
| 98 | 0.372018 | 0.776413 | 0.340804 | 0.930757 |
| 99 | 0.858413 | 0.428994 | 0.750871 | 0.754543 |
100 rows × 4 columns
Using a temporal data imputation plugin¶
[ ]:
from tempor import plugin_loader
dataset = SineDataSource(with_missing=True, random_state=42).load()
print(dataset)
model = plugin_loader.get("preprocessing.imputation.temporal.bfill")
print(model)
OneOffPredictionDataset(
time_series=TimeSeriesSamples([100, *, 5]),
static=StaticSamples([100, 4]),
predictive=OneOffPredictionTaskData(targets=StaticSamples([100, 1]))
)
BFillImputer(
name='bfill',
category='preprocessing.imputation.temporal',
plugin_type='method',
params={}
)
[ ]:
# Note missingness in temporal data.
print("Missing value count:", dataset.time_series.dataframe().isnull().sum().sum())
dataset.time_series
Missing value count: 500
TimeSeriesSamples with data:
| 0 | 1 | 2 | 3 | 4 | ||
|---|---|---|---|---|---|---|
| sample_idx | time_idx | |||||
| 0 | 0 | -0.955338 | 0.016053 | -0.995752 | 0.948138 | 0.738158 |
| 1 | -0.896718 | 0.717189 | -0.497625 | 0.962001 | 0.968258 | |
| 2 | -0.346466 | 0.999920 | 0.423104 | 0.639780 | 0.972469 | |
| 3 | 0.393737 | 0.699299 | 0.984517 | 0.094046 | 0.749807 | |
| 4 | 0.918072 | -0.009290 | NaN | NaN | NaN | |
| ... | ... | ... | ... | ... | ... | ... |
| 99 | 5 | 0.904284 | -0.939985 | 0.994099 | -0.984349 | 0.688521 |
| 6 | 0.990911 | -0.518593 | 0.908681 | -0.801263 | 0.813486 | |
| 7 | 0.757745 | 0.131791 | NaN | -0.110629 | 0.908965 | |
| 8 | NaN | 0.723981 | 0.476023 | 0.650082 | 0.971498 | |
| 9 | -0.288052 | 0.996486 | 0.173255 | 0.999008 | 0.998817 |
1000 rows × 5 columns
[ ]:
# Note no more missingness in temporal data.
dataset = model.fit_transform(dataset) # Or call fit() then transform().
print("Missing value count:", dataset.time_series.dataframe().isnull().sum().sum())
dataset.time_series
Missing value count: 0
TimeSeriesSamples with data:
| 0 | 1 | 2 | 3 | 4 | ||
|---|---|---|---|---|---|---|
| sample_idx | time_idx | |||||
| 0 | 0 | -0.955338 | 0.016053 | -0.995752 | 0.948138 | 0.738158 |
| 1 | -0.896718 | 0.717189 | -0.497625 | 0.962001 | 0.968258 | |
| 2 | -0.346466 | 0.999920 | 0.423104 | 0.639780 | 0.972469 | |
| 3 | 0.393737 | 0.699299 | 0.984517 | 0.094046 | 0.749807 | |
| 4 | 0.918072 | -0.009290 | -0.167662 | -0.893854 | -0.127538 | |
| ... | ... | ... | ... | ... | ... | ... |
| 99 | 5 | 0.904284 | -0.939985 | 0.994099 | -0.984349 | 0.688521 |
| 6 | 0.990911 | -0.518593 | 0.908681 | -0.801263 | 0.813486 | |
| 7 | 0.757745 | 0.131791 | 0.476023 | -0.110629 | 0.908965 | |
| 8 | -0.288052 | 0.723981 | 0.476023 | 0.650082 | 0.971498 | |
| 9 | -0.288052 | 0.996486 | 0.173255 | 0.999008 | 0.998817 |
1000 rows × 5 columns