User Guide Tutorial 09: Benchmarks¶
TemporAI provides some useful benchmarking tools in tempor.benchmarks, these are demonstrated here.
Using tempor.benchmarks.benchmark_models¶
The tempor.benchmarks.benchmark_models function provides a quick way to benchmark a number of models (plugins) for a particular task.
It takes a list of models (these may also be a Pipeline) and a dataset, and performs cross-validation to get the mean and standard deviation of the various metrics.
It returns a tuple (results_readable, results) as below.
[ ]:
from tempor.benchmarks import benchmark_models
from tempor import plugin_loader
from IPython.display import display
dataset = plugin_loader.get("prediction.one_off.sine", plugin_type="datasource", random_state=42, no=25).load()
results_readable, results = benchmark_models(
task_type="prediction.one_off.classification",
tests=[
("model_1", plugin_loader.get("prediction.one_off.classification.nn_classifier", n_iter=10)),
("model_2", plugin_loader.get("prediction.one_off.classification.ode_classifier", n_iter=100)),
],
data=dataset,
n_splits=3,
)
print("Results in easily-readable format:")
display(results_readable)
print("Full results:\n")
for model, value in results.items():
print(f"{model}:")
display(value)
2023-11-16 22:18:41 | INFO | tempor.benchmarks.benchmark:benchmark_models:104 | Test case: model_1
2023-11-16 22:18:45 | INFO | tempor.models.ts_model:_train:388 | Epoch:0| train loss: 0.68637615442276, validation loss: 0.6680578589439392
2023-11-16 22:18:46 | INFO | tempor.models.ts_model:_train:388 | Epoch:0| train loss: 0.6887814402580261, validation loss: 0.6939960718154907
2023-11-16 22:18:47 | INFO | tempor.models.ts_model:_train:388 | Epoch:0| train loss: 0.6906945109367371, validation loss: 0.6966598033905029
2023-11-16 22:18:47 | INFO | tempor.benchmarks.benchmark:benchmark_models:104 | Test case: model_2
2023-11-16 22:18:52 | INFO | tempor.models.ts_ode:_train:617 | Epoch:99| train loss: 0.6465951204299927, validation loss: 0.5632617473602295
2023-11-16 22:18:58 | INFO | tempor.models.ts_ode:_train:617 | Epoch:99| train loss: 0.913261890411377, validation loss: 0.8132617473602295
2023-11-16 22:19:04 | INFO | tempor.models.ts_ode:_train:617 | Epoch:99| train loss: 0.7132617831230164, validation loss: 0.8132617473602295
Results in easily-readable format:
| model_1 | model_2 | |
|---|---|---|
| accuracy | 0.602 +/- 0.033 | 0.519 +/- 0.105 |
| f1_score_micro | 0.602 +/- 0.033 | 0.519 +/- 0.105 |
| f1_score_macro | 0.375 +/- 0.013 | 0.338 +/- 0.048 |
| f1_score_weighted | 0.453 +/- 0.04 | 0.361 +/- 0.116 |
| kappa | 0.0 +/- 0.0 | 0.0 +/- 0.0 |
| kappa_quadratic | 0.0 +/- 0.0 | 0.0 +/- 0.0 |
| recall_micro | 0.602 +/- 0.033 | 0.519 +/- 0.105 |
| recall_macro | 0.5 +/- 0.0 | 0.5 +/- 0.0 |
| recall_weighted | 0.602 +/- 0.033 | 0.519 +/- 0.105 |
| precision_micro | 0.602 +/- 0.033 | 0.519 +/- 0.105 |
| precision_macro | 0.301 +/- 0.016 | 0.259 +/- 0.053 |
| precision_weighted | 0.363 +/- 0.039 | 0.28 +/- 0.104 |
| mcc | 0.0 +/- 0.0 | 0.0 +/- 0.0 |
| aucprc | 0.385 +/- 0.028 | 0.588 +/- 0.292 |
| aucroc | 0.311 +/- 0.083 | 0.583 +/- 0.312 |
Full results:
model_1:
| mean | stddev | |
|---|---|---|
| accuracy | 0.601852 | 0.032736 |
| f1_score_micro | 0.601852 | 0.032736 |
| f1_score_macro | 0.375458 | 0.012951 |
| f1_score_weighted | 0.452788 | 0.039572 |
| kappa | 0.000000 | 0.000000 |
| kappa_quadratic | 0.000000 | 0.000000 |
| recall_micro | 0.601852 | 0.032736 |
| recall_macro | 0.500000 | 0.000000 |
| recall_weighted | 0.601852 | 0.032736 |
| precision_micro | 0.601852 | 0.032736 |
| precision_macro | 0.300926 | 0.016368 |
| precision_weighted | 0.363297 | 0.038647 |
| mcc | 0.000000 | 0.000000 |
| aucprc | 0.385185 | 0.028154 |
| aucroc | 0.311111 | 0.083148 |
model_2:
| mean | stddev | |
|---|---|---|
| accuracy | 0.518519 | 0.105369 |
| f1_score_micro | 0.518519 | 0.105369 |
| f1_score_macro | 0.338162 | 0.047609 |
| f1_score_weighted | 0.360713 | 0.115623 |
| kappa | 0.000000 | 0.000000 |
| kappa_quadratic | 0.000000 | 0.000000 |
| recall_micro | 0.518519 | 0.105369 |
| recall_macro | 0.500000 | 0.000000 |
| recall_weighted | 0.518519 | 0.105369 |
| precision_micro | 0.518519 | 0.105369 |
| precision_macro | 0.259259 | 0.052684 |
| precision_weighted | 0.279964 | 0.104057 |
| mcc | 0.000000 | 0.000000 |
| aucprc | 0.587731 | 0.291568 |
| aucroc | 0.583333 | 0.311805 |
Supported tasks¶
⚠️ Not all task types are supported by
benchmark_modelsyet.
Supported tasks (for each task_type argument): * task_type="prediction.one_off.classification". * task_type="prediction.one_off.regression". * task_type="time_to_event".