0% found this document useful (0 votes)
90 views11 pages

Cross Validation LN 12

Cross validation is a technique used to evaluate machine learning models on a limited data sample. It involves splitting the data into k groups, using one as a test set and the others for training, and averaging the results over k iterations. This helps reduce bias and gives a more accurate estimate of how the model will generalize to new data compared to just using test and train splits. Common values of k are 5, 10, or leaving one out (LOOCV) at a time. The choice of k involves a bias-variance tradeoff.

Uploaded by

M S Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views11 pages

Cross Validation LN 12

Cross validation is a technique used to evaluate machine learning models on a limited data sample. It involves splitting the data into k groups, using one as a test set and the others for training, and averaging the results over k iterations. This helps reduce bias and gives a more accurate estimate of how the model will generalize to new data compared to just using test and train splits. Common values of k are 5, 10, or leaving one out (LOOCV) at a time. The choice of k involves a bias-variance tradeoff.

Uploaded by

M S Prasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CROSS VALIDATION

Model verification LN 12

INSTITUTE OF SPACE SCIENCE & TECHNOLOGY


Amity University ,NOIDA
Cross Validation ; LN 12
Cross validation is a model evaluation method that is better than residuals. The
problem with residual evaluations is that they do not give an indication of how
well the learner will do when it is asked to make new predictions for data it has
not already seen.

One way to overcome this problem is to not use the entire data set when training
a learner. Some of the data is removed before training begins. Then when training
is done, the data that was removed can be used to test the performance of the
learned model on ``new'' data. This is the basic idea for a whole class of model
evaluation methods called cross validation.

Cross validation checks how well a model generalizes to new data.

The holdout method is the simplest kind of cross validation. The data set is
separated into two sets, called the training set and the testing set. The function
approximator fits a function using the training set only. Then the function
approximator is asked to predict the output values for the data in the testing set
(it has never seen these output values before). The errors it makes are
accumulated as before to give the mean absolute test set error, which is used to
evaluate the model. The advantage of this method is that it is usually preferable
to the residual method and takes no longer to compute. However, its evaluation
can have a high variance. The evaluation may depend heavily on which data
points end up in the training set and which end up in the test set, and thus the
evaluation may be significantly different depending on how the division is made.

Cross Validation [email protected] 1


K-fold cross validation is one way to improve over the holdout method. The data
set is divided into k subsets, and the holdout method is repeated k times. Each
time, one of the k subsets is used as the test set and the other k-1 subsets are
put together to form a training set. Then the average error across all k trials is
computed. The advantage of this method is that it matters less how the data gets
divided. Every data point gets to be in a test set exactly once, and gets to be in a
training set k-1 times. The variance of the resulting estimate is reduced as k is
increased. The disadvantage of this method is that the training algorithm has to
be rerun from scratch k times, which means it takes k times as much
computation to make an evaluation. A variant of this method is to randomly
divide the data into a test and training set k different times. The advantage of
doing this is that you can independently choose how large each test set is and
how many trials you average over.

Leave-one-out cross validation is K-fold cross validation taken to its logical


extreme, with K equal to N, the number of data points in the set. That means that
N separate times, the function approximator is trained on all the data except for
one point and a prediction is made for that point. As before the average error is
computed and used to evaluate the model. The evaluation given by leave-one-
out cross validation error (LOO-XVE) is good, but at first pass it seems very
expensive to compute. Fortunately, locally weighted learners can make LOO
predictions just as easily as they make regular predictions. That means computing
the LOO-XVE takes no more time than computing the residual error and it is a
much better way to evaluate models. We will see shortly that Vizier relies heavily
on LOO-XVE to choose its meta codes.

k-Fold Cross-Validation

Cross-validation is a resampling procedure used to evaluate machine learning


models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups
that a given data sample is to be split into. As such, the procedure is often called
k-fold cross-validation. When a specific value for k is chosen, it may be used in
place of k in the reference to the model, such as k=10 becoming 10-fold cross-
validation.

Cross-validation is primarily used in applied machine learning to estimate the skill


of a machine learning model on unseen data. That is, to use a limited sample in

Cross Validation [email protected] 2


order to estimate how the model is expected to perform in general when used to
make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally


results in a less biased or less optimistic estimate of the model skill than other
methods, such as a simple train/test split.

The general procedure is as follows:

1. Shuffle the dataset randomly.


2. Split the dataset into k groups
3. For each unique group:
1. Take the group as a hold out or test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test set
4. Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation
scores

Importantly, each observation in the data sample is assigned to an individual


group and stays in that group for the duration of the procedure. This means that
each sample is given the opportunity to be used in the hold out set 1 time and
used to train the model k-1 times.

This approach involves randomly dividing the set of observations into k groups,
or folds, of approximately equal size. The first fold is treated as a validation set,
and the method is fit on the remaining k − 1 folds.

It is also important that any preparation of the data prior to fitting the model
occur on the CV-assigned training dataset within the loop rather than on the
broader data set. This also applies to any tuning of hyperparameters. A failure to
perform these operations within the loop may result in data leakage and an
optimistic estimate of the model skill.

Despite the best efforts of statistical methodologists, users frequently invalidate


their results by inadvertently peeking at the test data.

The results of a k-fold cross-validation run is often summarized with the mean of
the model skill scores. It is also good practice to include a measure of the variance
of the skill scores, such as the standard deviation or standard error.

Cross Validation [email protected] 3


Configuration of k

The k value must be chosen carefully for data sample.

A poorly chosen value for k may result in a mis-representative idea of the skill of
the model, such as a score with a high variance (that may change a lot based on
the data used to fit the model), or a high bias, (such as an overestimate of the
skill of the model).

Three common tactics for choosing a value for k are as follows:

• Representative: The value for k is chosen such that each train/test group
of data samples is large enough to be statistically representative of the
broader dataset.
• k=10: The value for k is fixed to 10, a value that has been found through
experimentation to generally result in a model skill estimate with low bias
a modest variance.
• k=n: The value for k is fixed to n, where n is the size of the dataset to give
each test sample an opportunity to be used in the hold out dataset. This
approach is called leave-one-out cross-validation.

The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the
difference in size between the training set and the resampling subsets gets
smaller. As this difference decreases, the bias of the technique becomes smaller

A value of k=10 is very common in the field of applied machine learning, and is
recommend if you are struggling to choose a value for your dataset.

To summarize, there is a bias-variance trade-off associated with the choice of k


in k-fold cross-validation. Typically, given these considerations, one performs k-
fold cross-validation using k = 5 or k = 10, as these values have been shown
empirically to yield test error rate estimates that suffer neither from excessively
high bias nor from very high variance.

If a value for k is chosen that does not evenly split the data sample, then one
group will contain a remainder of the examples. It is preferable to split the data
sample into k groups with the same number of samples, such that the sample of
model skill scores are all equivalent.

Variations on Cross-Validation

Cross Validation [email protected] 4


There are a number of variations on the k-fold cross validation procedure.

Three commonly used variations are as follows:

• Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that
a single train/test split is created to evaluate the model.
• LOOCV: Taken to another extreme, k may be set to the total number of
observations in the dataset such that each observation is given a chance
to be the held out of the dataset. This is called leave-one-out cross-
validation, or LOOCV for short.
• Stratified: The splitting of data into folds may be governed by criteria such
as ensuring that each fold has the same proportion of observations with a
given categorical value, such as the class outcome value. This is called
stratified cross-validation.
• Repeated: This is where the k-fold cross-validation procedure is repeated
n times, where importantly, the data sample is shuffled prior to each
repetition, which results in a different split of the sample.

Data Sampling
There are perhaps three main types of data sampling techniques; they are:

• Data Oversampling.
• Data Under sampling.
• Combined Oversampling and Under sampling.
Data oversampling involves duplicating examples of the minority class or
synthesizing new examples from the minority class from existing examples.
Perhaps the most popular methods is SMOTE and variations such as Borderline
SMOTE. Synthetic Minority Over-sampling Technique (SMOTE) Perhaps the most
important hyperparameter to tune is the amount of oversampling to perform.

Examples of data oversampling methods include:

• Random Oversampling
• SMOTE
• Borderline SMOTE
• SVM SMOTE
• k-Means SMOTE
• ADASYN
Under sampling involves deleting examples from the majority class, such as
randomly or using an algorithm to carefully choose which examples to delete.

Cross Validation [email protected] 5


Popular editing algorithms include the edited nearest neighbours and Tomek
links.

Examples of data under sampling methods include:

• Random Under sampling


• Condensed Nearest Neighbour
• Tomek Links
• Edited Nearest Neighbours
• Neighbourhood Cleaning Rule
• One-Sided Selection
Almost any oversampling method can be combined with almost any under
sampling technique. Therefore, it may be beneficial to test a suite of different
combinations of oversampling and under sampling techniques.

Examples of popular combinations of over and under sampling include:

• SMOTE and Random Under sampling


• SMOTE and Tomek Links
• SMOTE and Edited Nearest Neighbours
Data sampling algorithms may perform differently depending on the choice of
machine learning algorithm.

As such, it may be beneficial to test a suite of standard machine learning


algorithms, such as all or a subset of those algorithms used when spot checking
in the previous section.

Additionally, most data sampling algorithms make use of the k-nearest neighbor
algorithm internally. This algorithm is very sensitive to the data types and scale
of input variables. As such, it may be important to at least normalize input
variables that have differing scales prior to testing the methods, and perhaps
using specialized methods if some input variables are categorical instead of
numerical.
. One-Class Algorithms
Algorithms used for outlier detection and anomaly detection can be used for
classification tasks.Although unusual, when used in this way, they are often
referred to as one-class classification algorithms.

Cross Validation [email protected] 6


In some cases, one-class classification algorithms can be very effective, such as
when there is a severe class imbalance with very few examples of the positive
class.

Examples of one-class classification algorithms to try include:

• One-Class Support Vector Machines


• Isolation Forests
• Minimum Covariance Determinant
• Local Outlier Factor

Probability Tuning Algorithms


Predicted probabilities can be improved in two ways; they are:

• Calibrating Probabilities.
• Tuning the Classification Threshold.
Calibrating Probabilities
Some algorithms are fit using a probabilistic framework and, in turn, have
calibrated probabilities.

This means that when 100 examples are predicted to have the positive class label
with a probability of 80 percent, then the algorithm will predict the correct class
label 80 percent of the time.

Calibrated probabilities are required from a model to be considered skilful on a


binary classification task when probabilities are either required as the output or
used to evaluate the model (e.g. ROC AUC or PR AUC).

Some examples of machine learning algorithms that predict calibrated


probabilities are as follows:

• Logistic Regression
• Linear Discriminant Analysis
• Naive Bayes
• Artificial Neural Networks
Most nonlinear algorithms do not predict calibrated probabilities; therefore,
algorithms can be used to post-process the predicted probabilities in order to
calibrate them.

Cross Validation [email protected] 7


Therefore, when probabilities are required directly or are used to evaluate a
model, and nonlinear algorithms are being used, it is important to calibrate the
predicted probabilities.
Some examples of probability calibration algorithms are:

• Platt Scaling
• Isotonic Regression
Tuning the Classification Threshold
Some algorithms are designed to naively predict probabilities that later must be
mapped to crisp class labels.

This is the case if class labels are required as output for the problem, or the model
is evaluated using class labels.

Examples of probabilistic machine learning algorithms that predict a probability


by default include:

• Logistic Regression
• Linear Discriminant Analysis
• Naive Bayes
• Artificial Neural Networks
Probabilities are mapped to class labels using a threshold probability value. All
probabilities below the threshold are mapped to class 0, and all probabilities
equal-to or above the threshold are mapped to class 1.

The default threshold is 0.5, although different thresholds can be used that will
dramatically impact the class labels and, in turn, the performance of a machine
learning model that natively predicts probabilities.

As such, if probabilistic algorithms are used that natively predict a probability and
class labels are required as output or used to evaluate models, it is a good idea
to try tuning the classification threshold.

Framework for Spot-Checking Imbalanced Algorithms

We can summarize these suggestions into a framework for testing imbalanced


machine learning algorithms on a dataset.

• Data Oversampling
• Random Oversampling

Cross Validation [email protected] 8


• SMOTE
• Borderline SMOTE
• SVM SMOTE
• k-Means SMOTE
• ADASYS


Data Undersampling
• Random Undersampling
• Condensed Nearest Neighbour
• Tomek Links
• Edited Nearest Neighbours
• Neighbourhood Cleaning Rule
• One Sided Selection
• Combined Oversampling and Undersampling
• SMOTE and Random Undersampling
• SMOTE and Tomek Links
• SMOTE and Edited Nearest Neighbors
2. Cost-Sensitive Algorithms
• Logistic Regression
• Decision Trees
• Support Vector Machines
• Artificial Neural Networks
• Bagged Decision Trees
• Random Forest
• Stochastic Gradient Boosting
3. One-Class Algorithms
• One-Class Support Vector Machines
• Isolation Forests
• Minimum Covariance Determinant
• Local Outlier Factor
4. Probability Tuning Algorithms
• Calibrating Probabilities
• Platt Scaling
• Isotonic Regression
• Tuning the Classification Threshold
The order of the steps is flexible, and the order of algorithms within each step is
also flexible, and the list of algorithms is not complete.

Cross Validation [email protected] 9


The structure is designed to get you thinking systematically about what algorithm
to evaluate.

Cross Validation [email protected] 10

You might also like