0% found this document useful (0 votes)
25 views16 pages

UNIT4 Cross Validation

This document discusses different types of cross validation techniques used in machine learning to prevent overfitting. It describes holdout validation, k-fold cross validation, stratified k-fold cross validation, and leave one out cross validation. Examples of implementing each technique in Python are also provided.

Uploaded by

Jaya Sankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views16 pages

UNIT4 Cross Validation

This document discusses different types of cross validation techniques used in machine learning to prevent overfitting. It describes holdout validation, k-fold cross validation, stratified k-fold cross validation, and leave one out cross validation. Examples of implementing each technique in Python are also provided.

Uploaded by

Jaya Sankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

PYTHON PROGRAMMING & DATA SCIENCE

CROSS VALIDATION
Cross_Validation
If our algorithm works well with the training dataset but not well with test
dataset, then such problem is called Overfitting.

To overcome over-fitting problems, we use a technique called Cross-


Validation.
Cross_Validation
What is Cross Validation?
 Cross-validation is a statistical method used to estimate the performance
(or accuracy) of machine learning models.
 It is used to protect against overfitting in a predictive model, particularly in a
case where the amount of data may be limited.
 In cross-validation, we make a fixed number of folds (or partitions) of the
data, run the analysis on each fold, and then average the overall error
estimate.
(or)
Cross-validation is a technique in which we train our model using the subset of
the data-set and then evaluate using the complementary subset of the data-
set.
Cross_Validation
Basic Steps of Cross Validation
 the basic steps of cross-validations are:

1. Reserve a subset of the dataset as a validation set.


2. Provide the training to the model using the training dataset.
3. Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else check
for the issues.
Cross_Validation
Types of Cross Validation
There are 2 categories of cross validation methods. Those are:
1. Non-exhaustive and
2. Exhaustive Methods
Non-exhaustive Methods:
Non-exhaustive cross validation methods do not compute all ways of
splitting the original data.
Some of the important Non-exhaustive Methods are:
1. Holdout Method
2. K-Fold Cross-Validation
3. Stratified K-Fold Cross-Validation
Cross_Validation
1. Hold Out method
This is the simplest evaluation method and is widely used in Machine
Learning projects.
Here the entire dataset(population) is divided into 2 sets – train set and test
set. The data can be divided into 70-30 or 60-40, 75-25 or 80-20, or even 50-50
depending on the use case.
 As a rule, the proportion of training data has to be larger than the test data.

The data split happens randomly, and we


can’t be sure which data ends up in the train
and test bucket during the split unless
we specify random_state.
This can lead to extremely high variance and
every time, the split changes, the accuracy will also change.
Cross_Validation
1. Hold Out method
One of the major advantages of this method is that it is computationally
inexpensive compared to other cross-validation techniques.
There are some drawbacks to this method:
1. In the Hold out method, the test error rates are highly variable (high
variance) and it totally depends on which observations end up in the
training set and test set
2. Only a part of the data is used to train the model (high bias) which is not a
very good idea when data is not huge and this will lead to overestimation
of test error.
Cross_Validation
implementation of Hold Out method in Python
from sklearn.model_selection import train_test_split
X = [10,20,30,40,50,60,70,80,90,100]
train, test= train_test_split(X,test_size=0.3, random_state=1)
print(“Train:”,train,”Test:” ,test)
Output
Train: [50, 10, 40, 20, 80, 90, 60]
Test: [30, 100, 70]
Here, random_state is the seed used for reproducibility.
Cross_Validation
2. K-Fold Cross-Validation
In this resampling technique, the whole data is divided into k sets of almost
equal sizes.
The first set is selected as the test set and the model is trained on the
remaining k-1 sets. The test error rate is then calculated after fitting the model
to the test data.
In the second iteration, the 2nd set is selected as a test set and the
remaining k-1 sets are used to train the data and the error is calculated. This
process continues for all the k sets.
The mean of errors from all the
iterations is calculated as the
CV test error estimate.
Cross_Validation
2. K-Fold Cross-Validation
Typically, K-fold Cross Validation is performed using k=5 or k=10 as these
values have been empirically shown to yield test error estimates that neither
have high bias nor high variance.
The major disadvantage of this method is that the model has to be run from
scratch k-times and is computationally expensive than the Hold Out method
but better than the Leave One Out method.
Cross_Validation
implementation of K-Fold Cross-Validation in Python
from sklearn.model_selection import KFold
X = ["a",'b','c','d','e','f']
kf = KFold(n_splits=3, shuffle=False, random_state=None)
for train, test in kf.split(X):
print("Train data",train,"Test data",test)
Output
Train: [2 3 4 5]
Test: [0 1]
Train: [0 1 4 5]
Test: [2 3]
Train: [0 1 2 3]
Test: [4 5]
Cross_Validation
3. Stratified K-Fold Cross-Validation
This is a slight variation from K-Fold Cross Validation, which uses ‘stratified
sampling’ instead of ‘random sampling.’
Suppose our data contains reviews for a cosmetic product used by both the
male and female population.
When we perform random sampling to split the data into train and test sets,
there is a possibility that most of the data representing males is not
represented in training data but might end up in test data. When we train the
model on sample training data that is not a correct representation of the
actual population, the model will not predict the test data with good accuracy.
This is where Stratified Sampling comes to the rescue. Here the data is split
in such a way that it represents all the classes from the population.
Cross_Validation
3. Stratified K-Fold Cross-Validation
Let’s consider the above example which has a cosmetic product review of
1000 customers out of which 60% is female and 40% is male. I want to split the
data into train and test data in proportion (80:20). 80% of 1000 customers will
be 800 which will be chosen in such a way that there are 480 reviews
associated with the female population and 320 representing the male
population. In a similar fashion, 20% of 1000 customers will be chosen for the
test data ( with the same female and male representation).

This is exactly what stratified


K-Fold CV does and it will create
K-Folds by preserving the
percentage of sample for each class. This solves the problem of random
Cross_Validation
implementation of Stratified K-Fold Cross-Validation in Python
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y= np.array([0,0,1,0,1,1]) 
skf = StratifiedKFold(n_splits=3,random_state=None,shuffle=False)
 for train_index,test_index in skf.split(X,y):
print("Train:",train_index,'Test:',test_index)
X_train,X_test = X[train_index], X[test_index]
y_train,y_test = y[train_index], y[test_index]
Output
Train: [1 3 4 5] Test: [0 2]
Train: [0 2 3 5] Test: [1 4]
Cross_Validation
Exhaustive Methods
Exhaustive cross validation methods and test on all possible ways to divide
the original sample into a training and a validation set.
1. Leave-P-Out cross validation
1. Leave One Out Cross-Validation
In this method, we divide the data into train and test sets – but with a twist.
Instead of dividing the data into 2 subsets, we select a single observation as
test data, and everything else is labeled as training data and the model is
trained. Now the 2nd observation is selected as test data and the model is
trained on the remaining data.
Cross_Validation
implementation of Leave One Out Cross-Validation in Python
from sklearn.model_selection import LeaveOneOut
X = [10,20,30,40,50,60,70,80,90,100] 
l = LeaveOneOut() 
for train, test in l.split(X): 
print("%s %s"% (train,test))
Output
[1 2 3 4 5 6 7 8 9] [0] [0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2] [0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4] [0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6] [0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8] [0 1 2 3 4 5 6 7 8] [9]
This output clearly shows how LOOCV keeps one observation aside as test
data and all the other observations go to train data.

You might also like