tempor.benchmarks.benchmark module

The main benchmarking module.

tempor.benchmarks.benchmark.print_score(mean: Series, std: Series) Series[source]

Print the mean and standard deviation of a metric in a human-readable format.

tempor.benchmarks.benchmark.benchmark_models(task_type: Literal[prediction.one_off.classification] | Literal[prediction.one_off.regression] | Literal[prediction.temporal.classification] | Literal[prediction.temporal.regression] | Literal[time_to_event] | Literal[treatments.one_off.classification] | Literal[treatments.one_off.regression] | Literal[treatments.temporal.classification] | Literal[treatments.temporal.regression], tests: list[tuple[str, Any]], data: PredictiveDataset, n_splits: int = 3, random_state: int = 0, horizons: list[float] | list[int] | list[Timestamp] | None = None, raise_exceptions: bool = False, silence_warnings: bool = True) tuple[DataFrame, dict[str, DataFrame]][source]

Benchmark the performance of several algorithms.

Parameters:
task_type : PredictiveTaskType

The type of problem. Relevant for evaluating the downstream models with the correct metrics. The options are any of PredictiveTaskType.

tests : List[Tuple[str, Any]]

Tuples of form (test_name: str, plugin: BasePredictor/Pipeline)

data : dataset.PredictiveDataset

The evaluation dataset to use for cross-validation.

n_splits : int, optional

Number of splits used for cross-validation. Defaults to 3.

random_state : int, optional

Random seed. Defaults to 0.

horizons : Optional[data_typing.TimeIndex], optional

Time horizons for making predictions, if applicable to the task.

raise_exceptions : bool, optional

Whether to raise exceptions during evaluation. If False, the exceptions will be swallowed and the evaluation will continue - exception count will be reported in the "errors" column of the resultant dataframe. Defaults to False.

silence_warnings : bool, optional

Whether to silence warnings raised. Some dependencies (e.g. xgbse) may circumvent this and raise warnings regardless. Defaults to True.

Returns:

The benchmarking results given as (readable_dataframe: pd.DataFrame, results: Dict[str, pd.DataFrame]]) where:
  • readable_dataframe: a dataframe with metric name as index and test names as columns, where the values are readable string representations of the evaluation metric, like: MEAN +/- STDDEV.

  • results: a dictionary mapping the test name to a dataframe with metric names as index and ["mean", "stddev"] columns, where the values are the float mean and standard deviation for each metric.

Return type:

Tuple[pd.DataFrame, Dict[str, pd.DataFrame]]

tempor.benchmarks.benchmark.visualize_benchmark(results: dict[str, DataFrame], palette: str = 'viridis', plot_block: bool = True) Any[source]

Visualize the benchmarking results.

Parameters:
results : Dict[str, pd.DataFrame]

The results dictionary returned by benchmark_models.

palette : str, optional

seaborn color palette for the visualization. Defaults to "viridis".

plot_block : bool, optional

Whether to block the execution flow by the generated matplotlib chart. Defaults to True.

Returns:

The list of matplotlib axes objects with the generated plots.

Return type:

Any