Open In App

Validation Curve using Scikit-learn

Last Updated : 24 Jun, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Validation curves are essential tools in machine learning for diagnosing model performance and understanding the impact of hyperparameters on model accuracy. This article will delve into the concept of validation curves, their importance, and how to implement them using Scikit-learn in Python.

What is a Validation Curve?

A validation curve is a graphical representation that shows the relationship between a model's performance and a specific hyperparameter. It helps in understanding how changes in hyperparameters affect the training and validation scores of a model. The curve typically plots the model performance metric (such as accuracy, F1-score, or mean squared error) on the y-axis and a range of hyperparameter values on the x-axis.

We have a table which describes various scenarios of the two scores of validation and training.

Training Score

Validation Score

Estimator is:

Low

Low

Underfitting

High

Low

Overfitting

Low

High

(Not Possible)

Importance of Validation Curves

Validation curves are crucial for several reasons:

  1. Hyperparameter Tuning: They help in selecting the optimal hyperparameter values that balance bias and variance.
  2. Diagnosing Overfitting and Underfitting: By analyzing the training and validation scores, one can identify whether the model is overfitting or underfitting.
  3. Model Improvement: They provide insights into how to improve the model by adjusting hyperparameters.

Understanding Bias and Variance

Before diving into the implementation, it's essential to understand the concepts of bias and variance:

  • Bias: Error due to overly simplistic models that do not capture the underlying patterns in the data (underfitting).
  • Variance: Error due to overly complex models that capture noise in the training data (overfitting).

Implementing Validation Curves with Scikit-learn

Step 1: Import Required Libraries

First, we need to import the necessary libraries. We'll use Scikit-learn for model building and Matplotlib for plotting.

Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import validation_curve

Step 2: Load the Dataset

For this example, we'll use the digits dataset from Scikit-learn.

Python
# Load the digits dataset
dataset = load_digits()
X, y = dataset.data, dataset.target

Step 3: Define the Hyperparameter Range

We'll define the range of hyperparameter values we want to evaluate. In this case, we'll vary the number of neighbors (n_neighbors) for the K-Nearest Neighbors (KNN) classifier.

Python
# Define the range for the hyperparameter
param_range = np.arange(1, 10, 1)

Step 4: Calculate Training and Validation Scores

We'll use the validation_curve function from Scikit-learn to calculate the training and validation scores for each value of the hyperparameter.

Python
# Calculate accuracy on training and test set using the validation curve
train_scores, test_scores = validation_curve(
    KNeighborsClassifier(),
    X, y,
    param_name="n_neighbors",
    param_range=param_range,
    cv=5,
    scoring="accuracy"
)

Step 5: Calculate Mean and Standard Deviation

Next, we'll calculate the mean and standard deviation of the training and validation scores.

Python
# Calculate mean and standard deviation of training scores
mean_train_score = np.mean(train_scores, axis=1)
std_train_score = np.std(train_scores, axis=1)

# Calculate mean and standard deviation of validation scores
mean_test_score = np.mean(test_scores, axis=1)
std_test_score = np.std(test_scores, axis=1)

Step 6: Plot the Validation Curve

Finally, we'll plot the validation curve using Matplotlib.

Python
# Plot mean accuracy scores for training and testing scores
plt.plot(param_range, mean_train_score, label="Training Score", color='b')
plt.plot(param_range, mean_test_score, label="Cross Validation Score", color='g')

# Plot the accuracy bands
plt.fill_between(param_range, mean_train_score - std_train_score, mean_train_score + std_train_score, alpha=0.2, color='blue')
plt.fill_between(param_range, mean_test_score - std_test_score, mean_test_score + std_test_score, alpha=0.2, color='green')

# Create the plot
plt.title("Validation Curve with KNN")
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.legend(loc="best")
plt.show()

Output:

download---2024-06-24T153236115
Validation Curve

Interpreting the Validation Curve

Interpreting the results of a validation curve can sometimes be tricky. Here are some key points to keep in mind:

  1. Underfitting: If both the training and validation scores are low, the model is likely underfitting. This means the model is too simple or is informed by too few features.
  2. Overfitting: If the training score is high and the validation score is low, the model is overfitting. This means the model is too complex and is capturing noise in the training data.
  3. Optimal Hyperparameter: The optimal value of the hyperparameter is where the training and validation scores are closest to each other and both are relatively high.

Validation Curves with Machine Learning Models

Example 1: Validation Curve with Random Forest

Let's consider another example using the RandomForestClassifier and varying the number of trees (n_estimators).

Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import validation_curve
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_classes=2, random_state=42)

param_range = np.arange(1, 250, 2)

# Calculate accuracy on training and test set using the validation curve
train_scores, test_scores = validation_curve(
    RandomForestClassifier(),
    X, y,
    param_name="n_estimators",
    param_range=param_range,
    cv=4,
    scoring="accuracy",
    n_jobs=-1
)

# Calculate mean and standard deviation of training scores across folds
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Calculate mean and standard deviation of test scores across folds
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot mean accuracy scores for training and test scores
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_mean, label="Training score", color="black")
plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")

# Plot the accuracy bands for training and test scores
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.2, color="gray")
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.2, color="gainsboro")

plt.title("Validation Curve with Random Forest")
plt.xlabel("Number of Trees")
plt.ylabel("Accuracy Score")
plt.legend(loc="best")
plt.tight_layout()
plt.show()

Output:

download---2024-06-24T154122586
Validation Curve with Random Forest

Example 2: Validation Curve with Ridge Regression

Python
from sklearn.model_selection import validation_curve
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
import numpy as np
import matplotlib.pyplot as plt

X, y = make_regression(n_samples=100, n_features=20, noise=0.1, random_state=42)

# Define the parameter range for alpha
param_range = np.logspace(-7, 3, 3)

# Compute the validation curve
train_scores, valid_scores = validation_curve(Ridge(), X, y, param_name="alpha",
                                              param_range=param_range, cv=5)

# Calculate mean and standard deviation of training and validation scores
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)

# Plotting the validation curve
plt.figure()
plt.title("Validation Curve with Ridge Regression")
plt.xlabel("Alpha")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
plt.semilogx(param_range, train_scores_mean, label="Training score", color="darkorange", lw=2)
plt.fill_between(param_range, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.2,
                 color="darkorange", lw=2)
plt.semilogx(param_range, valid_scores_mean, label="Cross-validation score",
             color="navy", lw=2)
plt.fill_between(param_range, valid_scores_mean - valid_scores_std,
                 valid_scores_mean + valid_scores_std, alpha=0.2,
                 color="navy", lw=2)
plt.legend(loc="best")
plt.show()

Output:

cross_valid
Validation Curve with Ridge Regression

Conclusion

Validation curves are powerful tools for diagnosing model performance and understanding the impact of hyperparameters. By using Scikit-learn and visualization libraries like Matplotlib and Yellowbrick, you can effectively create and interpret validation curves to improve your machine learning models. Understanding and utilizing validation curves will help you build models that generalize well to unseen data, ultimately leading to more robust and accurate predictions.


Next Article

Similar Reads