This document discusses different types of cross validation techniques used in machine learning to prevent overfitting. It describes holdout validation, k-fold cross validation, stratified k-fold cross validation, and leave one out cross validation. Examples of implementing each technique in Python are also provided.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
25 views16 pages
UNIT4 Cross Validation
This document discusses different types of cross validation techniques used in machine learning to prevent overfitting. It describes holdout validation, k-fold cross validation, stratified k-fold cross validation, and leave one out cross validation. Examples of implementing each technique in Python are also provided.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16
PYTHON PROGRAMMING & DATA SCIENCE
CROSS VALIDATION Cross_Validation If our algorithm works well with the training dataset but not well with test dataset, then such problem is called Overfitting.
To overcome over-fitting problems, we use a technique called Cross-
Validation. Cross_Validation What is Cross Validation? Cross-validation is a statistical method used to estimate the performance (or accuracy) of machine learning models. It is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In cross-validation, we make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate. (or) Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data- set. Cross_Validation Basic Steps of Cross Validation the basic steps of cross-validations are:
1. Reserve a subset of the dataset as a validation set.
2. Provide the training to the model using the training dataset. 3. Now, evaluate model performance using the validation set. If the model performs well with the validation set, perform the further step, else check for the issues. Cross_Validation Types of Cross Validation There are 2 categories of cross validation methods. Those are: 1. Non-exhaustive and 2. Exhaustive Methods Non-exhaustive Methods: Non-exhaustive cross validation methods do not compute all ways of splitting the original data. Some of the important Non-exhaustive Methods are: 1. Holdout Method 2. K-Fold Cross-Validation 3. Stratified K-Fold Cross-Validation Cross_Validation 1. Hold Out method This is the simplest evaluation method and is widely used in Machine Learning projects. Here the entire dataset(population) is divided into 2 sets – train set and test set. The data can be divided into 70-30 or 60-40, 75-25 or 80-20, or even 50-50 depending on the use case. As a rule, the proportion of training data has to be larger than the test data.
The data split happens randomly, and we
can’t be sure which data ends up in the train and test bucket during the split unless we specify random_state. This can lead to extremely high variance and every time, the split changes, the accuracy will also change. Cross_Validation 1. Hold Out method One of the major advantages of this method is that it is computationally inexpensive compared to other cross-validation techniques. There are some drawbacks to this method: 1. In the Hold out method, the test error rates are highly variable (high variance) and it totally depends on which observations end up in the training set and test set 2. Only a part of the data is used to train the model (high bias) which is not a very good idea when data is not huge and this will lead to overestimation of test error. Cross_Validation implementation of Hold Out method in Python from sklearn.model_selection import train_test_split X = [10,20,30,40,50,60,70,80,90,100] train, test= train_test_split(X,test_size=0.3, random_state=1) print(“Train:”,train,”Test:” ,test) Output Train: [50, 10, 40, 20, 80, 90, 60] Test: [30, 100, 70] Here, random_state is the seed used for reproducibility. Cross_Validation 2. K-Fold Cross-Validation In this resampling technique, the whole data is divided into k sets of almost equal sizes. The first set is selected as the test set and the model is trained on the remaining k-1 sets. The test error rate is then calculated after fitting the model to the test data. In the second iteration, the 2nd set is selected as a test set and the remaining k-1 sets are used to train the data and the error is calculated. This process continues for all the k sets. The mean of errors from all the iterations is calculated as the CV test error estimate. Cross_Validation 2. K-Fold Cross-Validation Typically, K-fold Cross Validation is performed using k=5 or k=10 as these values have been empirically shown to yield test error estimates that neither have high bias nor high variance. The major disadvantage of this method is that the model has to be run from scratch k-times and is computationally expensive than the Hold Out method but better than the Leave One Out method. Cross_Validation implementation of K-Fold Cross-Validation in Python from sklearn.model_selection import KFold X = ["a",'b','c','d','e','f'] kf = KFold(n_splits=3, shuffle=False, random_state=None) for train, test in kf.split(X): print("Train data",train,"Test data",test) Output Train: [2 3 4 5] Test: [0 1] Train: [0 1 4 5] Test: [2 3] Train: [0 1 2 3] Test: [4 5] Cross_Validation 3. Stratified K-Fold Cross-Validation This is a slight variation from K-Fold Cross Validation, which uses ‘stratified sampling’ instead of ‘random sampling.’ Suppose our data contains reviews for a cosmetic product used by both the male and female population. When we perform random sampling to split the data into train and test sets, there is a possibility that most of the data representing males is not represented in training data but might end up in test data. When we train the model on sample training data that is not a correct representation of the actual population, the model will not predict the test data with good accuracy. This is where Stratified Sampling comes to the rescue. Here the data is split in such a way that it represents all the classes from the population. Cross_Validation 3. Stratified K-Fold Cross-Validation Let’s consider the above example which has a cosmetic product review of 1000 customers out of which 60% is female and 40% is male. I want to split the data into train and test data in proportion (80:20). 80% of 1000 customers will be 800 which will be chosen in such a way that there are 480 reviews associated with the female population and 320 representing the male population. In a similar fashion, 20% of 1000 customers will be chosen for the test data ( with the same female and male representation).
This is exactly what stratified
K-Fold CV does and it will create K-Folds by preserving the percentage of sample for each class. This solves the problem of random Cross_Validation implementation of Stratified K-Fold Cross-Validation in Python import numpy as np from sklearn.model_selection import StratifiedKFold X = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]]) y= np.array([0,0,1,0,1,1]) skf = StratifiedKFold(n_splits=3,random_state=None,shuffle=False) for train_index,test_index in skf.split(X,y): print("Train:",train_index,'Test:',test_index) X_train,X_test = X[train_index], X[test_index] y_train,y_test = y[train_index], y[test_index] Output Train: [1 3 4 5] Test: [0 2] Train: [0 2 3 5] Test: [1 4] Cross_Validation Exhaustive Methods Exhaustive cross validation methods and test on all possible ways to divide the original sample into a training and a validation set. 1. Leave-P-Out cross validation 1. Leave One Out Cross-Validation In this method, we divide the data into train and test sets – but with a twist. Instead of dividing the data into 2 subsets, we select a single observation as test data, and everything else is labeled as training data and the model is trained. Now the 2nd observation is selected as test data and the model is trained on the remaining data. Cross_Validation implementation of Leave One Out Cross-Validation in Python from sklearn.model_selection import LeaveOneOut X = [10,20,30,40,50,60,70,80,90,100] l = LeaveOneOut() for train, test in l.split(X): print("%s %s"% (train,test)) Output [1 2 3 4 5 6 7 8 9] [0] [0 2 3 4 5 6 7 8 9] [1] [0 1 3 4 5 6 7 8 9] [2] [0 1 2 4 5 6 7 8 9] [3] [0 1 2 3 5 6 7 8 9] [4] [0 1 2 3 4 6 7 8 9] [5] [0 1 2 3 4 5 7 8 9] [6] [0 1 2 3 4 5 6 8 9] [7] [0 1 2 3 4 5 6 7 9] [8] [0 1 2 3 4 5 6 7 8] [9] This output clearly shows how LOOCV keeps one observation aside as test data and all the other observations go to train data.