K Fold Cross Validation
Under-fitting
Over-fitting
K Fold Cross Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a
limited data sample.
The procedure has a single parameter called k that refers to the number of groups that a given
data sample is to be split into.
Training
20
20
20
20
20
100 Math Questions
K Fold Cross Validation
Option-1
Re-Substitution
20 20
20
20
Trained Test 20
20
20
20
20
20
Few Questions
from 100 Questions
100 Math Questions
K Fold Cross Validation
Option-2
Holdout
20
20 Trained Test
20
20
20 20 Questions
from100 Questions
20
80 Math Questions
From 100 Questions
K Fold Cross Validation
Option-3
K Fold Cross Validation
Here, K=5
Underfitting
Underfitting
Test
Underfitting
Option-2
Holdout
20
20 Trained Test
20
20
20 20 Questions
from100 Questions
20
80 Math Questions
From 100 Questions
Underfitting
Underfitting
Test
It is no
t ball
Overfitting
Overfitting
Test
It is no
t ball
It is no
t ball
Overfitting
Option-1
Re-Substitution
20 20
20
20
Trained Test 20
20
20
20
20
20
Few Questions
from 100 Questions
100 Math Questions
5 4
1
Let’s have a generalized K value. If K=5, it means, in the given dataset and we are splitting
into 5 folds and running the Train and Test. During each run, one fold is considered for
testing and the rest will be for training and moving on with iterations, the below pictorial
representation would give you an idea of the flow of the fold-defined size.
In K=5, Training (K-1) or 4 and Test 1
Thumb Rules Associated with K Fold
Now, we will discuss a few thumb rules while playing with K – fold
•K should be always >= 2 and = to number of records, (LOOCV)
• If 2 then just 2 iterations
• If K=No of records in the dataset, then 1 for testing and n- for training
•The optimized value for the K is 10 and used with the data of good size. (Commonly used)
•If the K value is too large, then this will lead to less variance across the training set and
limit the model currency difference across the iterations.
•The number of folds is indirectly proportional to the size of the data set, which means, if
the dataset size is too small, the number of folds can increase.
•Larger values of K eventually increase the running time of the cross-validation process
Please remember K-Fold Cross Validation for the below purpose in the ML stream.
1. Model selection
2. Parameter tuning
3. Feature selection
So far, we have discussed the K Fold and its way of implementation, let’s do some
hands-on now.