UNIT4 Cross Validation

This document discusses different types of cross validation techniques used in machine learning to prevent overfitting. It describes holdout validation, k-fold cross validation, stratified k-fold cross validation, and leave one out cross validation. Examples of implementing each technique in Python are also provided.

Uploaded by

Jaya Sankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views16 pages

UNIT4 Cross Validation

Uploaded by

Jaya Sankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

PYTHON PROGRAMMING & DATA SCIENCE

CROSS VALIDATION
Cross_Validation
If our algorithm works well with the training dataset but not well with test
dataset, then such problem is called Overfitting.

To overcome over-fitting problems, we use a technique called Cross-

Validation.
Cross_Validation
What is Cross Validation?
 Cross-validation is a statistical method used to estimate the performance
(or accuracy) of machine learning models.
 It is used to protect against overfitting in a predictive model, particularly in a
case where the amount of data may be limited.
 In cross-validation, we make a fixed number of folds (or partitions) of the
data, run the analysis on each fold, and then average the overall error
estimate.
(or)
Cross-validation is a technique in which we train our model using the subset of
the data-set and then evaluate using the complementary subset of the data-
set.
Cross_Validation
Basic Steps of Cross Validation
 the basic steps of cross-validations are:

1. Reserve a subset of the dataset as a validation set.

2. Provide the training to the model using the training dataset.
3. Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else check
for the issues.
Cross_Validation
Types of Cross Validation
There are 2 categories of cross validation methods. Those are:
1. Non-exhaustive and
2. Exhaustive Methods
Non-exhaustive Methods:
Non-exhaustive cross validation methods do not compute all ways of
splitting the original data.
Some of the important Non-exhaustive Methods are:
1. Holdout Method
2. K-Fold Cross-Validation
3. Stratified K-Fold Cross-Validation
Cross_Validation
1. Hold Out method
This is the simplest evaluation method and is widely used in Machine
Learning projects.
Here the entire dataset(population) is divided into 2 sets – train set and test
set. The data can be divided into 70-30 or 60-40, 75-25 or 80-20, or even 50-50
depending on the use case.
 As a rule, the proportion of training data has to be larger than the test data.

The data split happens randomly, and we

can’t be sure which data ends up in the train
and test bucket during the split unless
we specify random_state.
This can lead to extremely high variance and
every time, the split changes, the accuracy will also change.
Cross_Validation
1. Hold Out method
One of the major advantages of this method is that it is computationally
inexpensive compared to other cross-validation techniques.
There are some drawbacks to this method:
1. In the Hold out method, the test error rates are highly variable (high
variance) and it totally depends on which observations end up in the
training set and test set
2. Only a part of the data is used to train the model (high bias) which is not a
very good idea when data is not huge and this will lead to overestimation
of test error.
Cross_Validation
implementation of Hold Out method in Python
from sklearn.model_selection import train_test_split
X = [10,20,30,40,50,60,70,80,90,100]
train, test= train_test_split(X,test_size=0.3, random_state=1)
print(“Train:”,train,”Test:” ,test)
Output
Train: [50, 10, 40, 20, 80, 90, 60]
Test: [30, 100, 70]
Here, random_state is the seed used for reproducibility.
Cross_Validation
2. K-Fold Cross-Validation
In this resampling technique, the whole data is divided into k sets of almost
equal sizes.
The first set is selected as the test set and the model is trained on the
remaining k-1 sets. The test error rate is then calculated after fitting the model
to the test data.
In the second iteration, the 2nd set is selected as a test set and the
remaining k-1 sets are used to train the data and the error is calculated. This
process continues for all the k sets.
The mean of errors from all the
iterations is calculated as the
CV test error estimate.
Cross_Validation
2. K-Fold Cross-Validation
Typically, K-fold Cross Validation is performed using k=5 or k=10 as these
values have been empirically shown to yield test error estimates that neither
have high bias nor high variance.
The major disadvantage of this method is that the model has to be run from
scratch k-times and is computationally expensive than the Hold Out method
but better than the Leave One Out method.
Cross_Validation
implementation of K-Fold Cross-Validation in Python
from sklearn.model_selection import KFold
X = ["a",'b','c','d','e','f']
kf = KFold(n_splits=3, shuffle=False, random_state=None)
for train, test in kf.split(X):
print("Train data",train,"Test data",test)
Output
Train: [2 3 4 5]
Test: [0 1]
Train: [0 1 4 5]
Test: [2 3]
Train: [0 1 2 3]
Test: [4 5]
Cross_Validation
3. Stratified K-Fold Cross-Validation
This is a slight variation from K-Fold Cross Validation, which uses ‘stratified
sampling’ instead of ‘random sampling.’
Suppose our data contains reviews for a cosmetic product used by both the
male and female population.
When we perform random sampling to split the data into train and test sets,
there is a possibility that most of the data representing males is not
represented in training data but might end up in test data. When we train the
model on sample training data that is not a correct representation of the
actual population, the model will not predict the test data with good accuracy.
This is where Stratified Sampling comes to the rescue. Here the data is split
in such a way that it represents all the classes from the population.
Cross_Validation
3. Stratified K-Fold Cross-Validation
Let’s consider the above example which has a cosmetic product review of
1000 customers out of which 60% is female and 40% is male. I want to split the
data into train and test data in proportion (80:20). 80% of 1000 customers will
be 800 which will be chosen in such a way that there are 480 reviews
associated with the female population and 320 representing the male
population. In a similar fashion, 20% of 1000 customers will be chosen for the
test data ( with the same female and male representation).

This is exactly what stratified

K-Fold CV does and it will create
K-Folds by preserving the
percentage of sample for each class. This solves the problem of random
Cross_Validation
implementation of Stratified K-Fold Cross-Validation in Python
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y= np.array([0,0,1,0,1,1])
skf = StratifiedKFold(n_splits=3,random_state=None,shuffle=False)
for train_index,test_index in skf.split(X,y):
print("Train:",train_index,'Test:',test_index)
X_train,X_test = X[train_index], X[test_index]
y_train,y_test = y[train_index], y[test_index]
Output
Train: [1 3 4 5] Test: [0 2]
Train: [0 2 3 5] Test: [1 4]
Cross_Validation
Exhaustive Methods
Exhaustive cross validation methods and test on all possible ways to divide
the original sample into a training and a validation set.
1. Leave-P-Out cross validation
1. Leave One Out Cross-Validation
In this method, we divide the data into train and test sets – but with a twist.
Instead of dividing the data into 2 subsets, we select a single observation as
test data, and everything else is labeled as training data and the model is
trained. Now the 2nd observation is selected as test data and the model is
trained on the remaining data.
Cross_Validation
implementation of Leave One Out Cross-Validation in Python
from sklearn.model_selection import LeaveOneOut
X = [10,20,30,40,50,60,70,80,90,100]
l = LeaveOneOut()
for train, test in l.split(X):
print("%s %s"% (train,test))
Output
[1 2 3 4 5 6 7 8 9] [0] [0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2] [0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4] [0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6] [0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8] [0 1 2 3 4 5 6 7 8] [9]
This output clearly shows how LOOCV keeps one observation aside as test
data and all the other observations go to train data.

Cross-Validation in Machine Learning - Javatpoint
No ratings yet
Cross-Validation in Machine Learning - Javatpoint
8 pages
Analysis of K-Fold Cross-Validation Over Hold-Out
No ratings yet
Analysis of K-Fold Cross-Validation Over Hold-Out
6 pages
A Gentle Introduction To K-Fold Cross-Validation
No ratings yet
A Gentle Introduction To K-Fold Cross-Validation
69 pages
Cross Validation: Chandan B K Mrs. S Asst Professor, Department of Computer Science Engineering
No ratings yet
Cross Validation: Chandan B K Mrs. S Asst Professor, Department of Computer Science Engineering
21 pages
IML 8 - Grid Search and Cross Validation
No ratings yet
IML 8 - Grid Search and Cross Validation
22 pages
All Types of Cross Validation
No ratings yet
All Types of Cross Validation
9 pages
Wa0001.
No ratings yet
Wa0001.
173 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
K-Fold CV On Imbalance Classification Data - Analytics Vidhya - Ayobami Akiode
No ratings yet
K-Fold CV On Imbalance Classification Data - Analytics Vidhya - Ayobami Akiode
18 pages
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
No ratings yet
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
56 pages
K Fold
No ratings yet
K Fold
21 pages
Module3-Ensemble Learning
No ratings yet
Module3-Ensemble Learning
107 pages
Predictive Accuracy Evaluation
No ratings yet
Predictive Accuracy Evaluation
13 pages
Cross Validation
No ratings yet
Cross Validation
10 pages
ML Technique01
No ratings yet
ML Technique01
7 pages
Unit 6 - Model Selection
No ratings yet
Unit 6 - Model Selection
13 pages
Sampling Methods in Machine Learning
No ratings yet
Sampling Methods in Machine Learning
13 pages
ML Unit4 Notes
No ratings yet
ML Unit4 Notes
20 pages
Cross Validation Techniques
No ratings yet
Cross Validation Techniques
27 pages
Assign 3
No ratings yet
Assign 3
5 pages
Data Lecture
No ratings yet
Data Lecture
16 pages
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
No ratings yet
Project 03: Data Fitting Applied Mathematics and Statistics For Information Technology
17 pages
CH 05 Optimization Technique
No ratings yet
CH 05 Optimization Technique
58 pages
Lecture Note #6 - PEC-CS701E
No ratings yet
Lecture Note #6 - PEC-CS701E
11 pages
Unit 9 Model Evaluation
No ratings yet
Unit 9 Model Evaluation
26 pages
Cross Validation
No ratings yet
Cross Validation
5 pages
ML Unit4 Notes
No ratings yet
ML Unit4 Notes
20 pages
Lec 16
No ratings yet
Lec 16
18 pages
Module 6 - ML
No ratings yet
Module 6 - ML
30 pages
List Steps in Data Preparation. Give Short Description of Each Step
No ratings yet
List Steps in Data Preparation. Give Short Description of Each Step
20 pages
ADS-Methodology and Data Visualization
No ratings yet
ADS-Methodology and Data Visualization
12 pages
Cross Validation in ML
No ratings yet
Cross Validation in ML
5 pages
Cross Validation - Notes
No ratings yet
Cross Validation - Notes
10 pages
Comparison Between Performance of Classifiers
No ratings yet
Comparison Between Performance of Classifiers
5 pages
ML-4th Unit
No ratings yet
ML-4th Unit
44 pages
Model Validation
No ratings yet
Model Validation
5 pages
ML Mod 5
No ratings yet
ML Mod 5
58 pages
3.1. Cross-Validation - Evaluating Estimator Performance - Scikit-Learn 1.3.0 Documentation
No ratings yet
3.1. Cross-Validation - Evaluating Estimator Performance - Scikit-Learn 1.3.0 Documentation
12 pages
Chapter2 1 33
No ratings yet
Chapter2 1 33
18 pages
Cross Validation
No ratings yet
Cross Validation
16 pages
Day 24
No ratings yet
Day 24
3 pages
P-2.1.2 Cross Validation and Regularization
No ratings yet
P-2.1.2 Cross Validation and Regularization
37 pages
Unit 1 PPDS
100% (1)
Unit 1 PPDS
159 pages
Cross Validation
No ratings yet
Cross Validation
5 pages
Answer-4 Shreyansh
No ratings yet
Answer-4 Shreyansh
4 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
20 pages
Unit 5 New
No ratings yet
Unit 5 New
9 pages
Unit 2
No ratings yet
Unit 2
28 pages
Xiiaiuniticapstone Projectpartii
No ratings yet
Xiiaiuniticapstone Projectpartii
11 pages
K-Fold Cross Validation Technique and Its Essentials - Analytics Vidhya
No ratings yet
K-Fold Cross Validation Technique and Its Essentials - Analytics Vidhya
11 pages
Cross Validation
No ratings yet
Cross Validation
4 pages
Introduction To K-Fold Cross-Validation
No ratings yet
Introduction To K-Fold Cross-Validation
6 pages
Cross Validation It S Types and How To Choose Correct CV 1707762388
No ratings yet
Cross Validation It S Types and How To Choose Correct CV 1707762388
13 pages
13 Cross - Validation
No ratings yet
13 Cross - Validation
4 pages
Model Evaluation and Cross-Validation Methods
No ratings yet
Model Evaluation and Cross-Validation Methods
3 pages
Cross Validation LN 12
No ratings yet
Cross Validation LN 12
11 pages
Cross Validation LN 12
No ratings yet
Cross Validation LN 12
11 pages
K Fold and Other Cross-Validation Techniques
No ratings yet
K Fold and Other Cross-Validation Techniques
10 pages
Validation Over Under Fir Unit 5
No ratings yet
Validation Over Under Fir Unit 5
6 pages
13A05601 Computer Networks
No ratings yet
13A05601 Computer Networks
1 page
UNIT4 CostFunctions
No ratings yet
UNIT4 CostFunctions
23 pages
15A05502 Computer Networks
No ratings yet
15A05502 Computer Networks
1 page

UNIT4 Cross Validation

Uploaded by

UNIT4 Cross Validation

Uploaded by

PYTHON PROGRAMMING & DATA SCIENCE

To overcome over-fitting problems, we use a technique called Cross-

1. Reserve a subset of the dataset as a validation set.

The data split happens randomly, and we

This is exactly what stratified

You might also like