Note
Go to the end to download the full example code. or to run this example in your browser via JupyterLite or Binder
Balance model complexity and cross-validated score#
This example demonstrates how to balance model complexity and cross-validated score by
finding a decent accuracy within 1 standard deviation of the best accuracy score while
minimising the number of PCA
components [1]. It uses
GridSearchCV
with a custom refit callable to select
the optimal model.
The figure shows the trade-off between cross-validated score and the number
of PCA components. The balanced case is when n_components=10
and accuracy=0.88
,
which falls into the range within 1 standard deviation of the best accuracy
score.
[1] Hastie, T., Tibshirani, R.,, Friedman, J. (2001). Model Assessment and Selection. The Elements of Statistical Learning (pp. 219-260). New York, NY, USA: Springer New York Inc..
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
import matplotlib.pyplot as plt
import numpy as np
import polars as pl
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.pipeline import Pipeline
Introduction#
When tuning hyperparameters, we often want to balance model complexity and performance. The “one-standard-error” rule is a common approach: select the simplest model whose performance is within one standard error of the best model’s performance. This helps to avoid overfitting by preferring simpler models when their performance is statistically comparable to more complex ones.
Helper functions#
We define two helper functions:
1. lower_bound
: Calculates the threshold for acceptable performance
(best score - 1 std)
2. best_low_complexity
: Selects the model with the fewest PCA components that
exceeds this threshold
def lower_bound(cv_results):
"""
Calculate the lower bound within 1 standard deviation
of the best `mean_test_scores`.
Parameters
----------
cv_results : dict of numpy(masked) ndarrays
See attribute cv_results_ of `GridSearchCV`
Returns
-------
float
Lower bound within 1 standard deviation of the
best `mean_test_score`.
"""
best_score_idx = np.argmax(cv_results["mean_test_score"])
return (
cv_results["mean_test_score"][best_score_idx]
- cv_results["std_test_score"][best_score_idx]
)
def best_low_complexity(cv_results):
"""
Balance model complexity with cross-validated score.
Parameters
----------
cv_results : dict of numpy(masked) ndarrays
See attribute cv_results_ of `GridSearchCV`.
Return
------
int
Index of a model that has the fewest PCA components
while has its test score within 1 standard deviation of the best
`mean_test_score`.
"""
threshold = lower_bound(cv_results)
candidate_idx = np.flatnonzero(cv_results["mean_test_score"] >= threshold)
best_idx = candidate_idx[
cv_results["param_reduce_dim__n_components"][candidate_idx].argmin()
]
return best_idx
Set up the pipeline and parameter grid#
We create a pipeline with two steps: 1. Dimensionality reduction using PCA 2. Classification using LogisticRegression
We’ll search over different numbers of PCA components to find the optimal complexity.
pipe = Pipeline(
[
("reduce_dim", PCA(random_state=42)),
("classify", LogisticRegression(random_state=42, C=0.01, max_iter=1000)),
]
)
param_grid = {"reduce_dim__n_components": [6, 8, 10, 15, 20, 25, 35, 45, 55]}
Perform the search with GridSearchCV#
We use GridSearchCV
with our custom best_low_complexity
function as the refit
parameter. This function will select the model with the fewest PCA components that
still performs within one standard deviation of the best model.
grid = GridSearchCV(
pipe,
# Use a non-stratified CV strategy to make sure that the inter-fold
# standard deviation of the test scores is informative.
cv=ShuffleSplit(n_splits=30, random_state=0),
n_jobs=1, # increase this on your machine to use more physical cores
param_grid=param_grid,
scoring="accuracy",
refit=best_low_complexity,
return_train_score=True,
)
Load the digits dataset and fit the model#
X, y = load_digits(return_X_y=True)
grid.fit(X, y)