tempor.benchmarks.benchmark module¶

The main benchmarking module.

tempor.benchmarks.benchmark.print_score(mean: Series, std: Series) → Series[source]¶: Print the mean and standard deviation of a metric in a human-readable format.

tempor.benchmarks.benchmark.benchmark_models(task_type: Literal[prediction.one_off.classification] | Literal[prediction.one_off.regression] | Literal[prediction.temporal.classification] | Literal[prediction.temporal.regression] | Literal[time_to_event] | Literal[treatments.one_off.classification] | Literal[treatments.one_off.regression] | Literal[treatments.temporal.classification] | Literal[treatments.temporal.regression], tests: list[tuple[str, Any]], data: PredictiveDataset, n_splits: int = 3, random_state: int = 0, horizons: list[float] | list[int] | list[Timestamp] | None = None, raise_exceptions: bool = False, silence_warnings: bool = True) → tuple[DataFrame, dict[str, DataFrame]][source]¶

Benchmark the performance of several algorithms.

Parameters:¶

task_type : PredictiveTaskType¶: The type of problem. Relevant for evaluating the downstream models with the correct metrics. The options are any of PredictiveTaskType.
tests : List[Tuple[str, Any]]¶: Tuples of form (test_name: str, plugin: BasePredictor/Pipeline)
data : dataset.PredictiveDataset¶: The evaluation dataset to use for cross-validation.
n_splits : int, optional¶: Number of splits used for cross-validation. Defaults to 3.
random_state : int, optional¶: Random seed. Defaults to 0.
horizons : Optional[data_typing.TimeIndex], optional¶: Time horizons for making predictions, if applicable to the task.
raise_exceptions : bool, optional¶: Whether to raise exceptions during evaluation. If False, the exceptions will be swallowed and the evaluation will continue - exception count will be reported in the "errors" column of the resultant dataframe. Defaults to False.
silence_warnings : bool, optional¶: Whether to silence warnings raised. Some dependencies (e.g. xgbse) may circumvent this and raise warnings regardless. Defaults to True.

Returns:¶

The benchmarking results given as (readable_dataframe: pd.DataFrame, results: Dict[str, pd.DataFrame]]) where:

readable_dataframe: a dataframe with metric name as index and test names as columns, where the values are readable string representations of the evaluation metric, like: MEAN +/- STDDEV.
results: a dictionary mapping the test name to a dataframe with metric names as index and ["mean", "stddev"] columns, where the values are the float mean and standard deviation for each metric.

Return type:¶

Tuple[pd.DataFrame, Dict[str, pd.DataFrame]]

tempor.benchmarks.benchmark.visualize_benchmark(results: dict[str, DataFrame], palette: str = 'viridis', plot_block: bool = True) → Any[source]¶

Visualize the benchmarking results.

Parameters:¶

results : Dict[str, pd.DataFrame]¶: The results dictionary returned by benchmark_models.
palette : str, optional¶: seaborn color palette for the visualization. Defaults to "viridis".
plot_block : bool, optional¶: Whether to block the execution flow by the generated matplotlib chart. Defaults to True.

Returns:¶

The list of matplotlib axes objects with the generated plots.

Return type:¶

Any