ADS
ADS
What is Cross-Validation?
The main purpose of cross validation is to prevent overfitting, which occurs when a
model is trained too well on the training data and performs poorly on new, unseen
data. By evaluating the model on multiple validation sets, cross validation provides a
more realistic estimate of the model’s generalization performance, i.e., its ability to
perform well on new, unseen data.
Types of Cross-Validation
There are several types of cross validation techniques, including k-fold cross
validation, leave-one-out cross validation, and Holdout validation, Stratified
Cross-Validation. The choice of technique depends on the size and nature of the data,
as well as the specific requirements of the modeling problem.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest
50% is used for the testing purpose. It’s a simple and quick way to evaluate a model.
The major drawback of this method is that we perform training on the 50% of the
dataset, it may possible that the remaining 50% of the data contains some important
information which we are leaving while training our model i.e. higher bias.
In this method, we perform training on the whole dataset but leaves only one
data-point of the available dataset and then iterates for each data-point. In LOOCV,
the model is trained on
n−1
n−1 samples and tested on the one omitted sample, repeating this process for each data
An advantage of using this method is that we make use of all data points and hence it
is low bias.
The major drawback of this method is that it leads to higher variation in the testing
model as we are testing against one data point. If the data point is an outlier it can lead
to higher variation. Another drawback is it takes a lot of execution time as it iterates
over ‘the number of data points’ times.
3. Stratified Cross-Validation
1. The dataset is divided into k folds while maintaining the proportion of
classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining folds
are used for training.
3. The process is repeated k times, with each fold serving as the test set exactly
once.
In K-Fold Cross Validation, we split the dataset into k number of subsets (known as
folds) then we perform training on the all the subsets but leave one(k-1) subset for the
evaluation of the trained model. In this method, we iterate k times with a different
subset reserved for testing purpose each time.
Note: It is always suggested that the value of k should be 10 as the lower value of k
takes towards validation and higher value of k leads to LOOCV method.
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration
we use the first 20 percent of data for evaluation, and the remaining 80 percent for
training ([1-5] testing and [5-25] training) while in the second iteration we use the
second subset of 20 percent for evaluation, and the remaining three subsets of the data
for training ([5-10] testing and [1-5 and 10-25] training), and so on.
Total instances: 25
Value of k : 5
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22
23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22
23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19] [20 21 22 23 24]
1. This runs K times faster than Leave One Out cross-validation because
K-fold cross-validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
Advantages:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by
providing a more robust estimate of the model’s performance on unseen
data.
2. Model Selection: Cross validation can be used to compare different models
and select the one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize the
hyperparameters of a model, such as the regularization parameter, by
selecting the values that result in the best performance on the validation set.
4. Data Efficient: Cross validation allows the use of all the available data for
both training and validation, making it a more data-efficient method
compared to traditional validation techniques.
Disadvantages: