Cross Validation LN 12
Cross Validation LN 12
Model verification LN 12
One way to overcome this problem is to not use the entire data set when training
a learner. Some of the data is removed before training begins. Then when training
is done, the data that was removed can be used to test the performance of the
learned model on ``new'' data. This is the basic idea for a whole class of model
evaluation methods called cross validation.
The holdout method is the simplest kind of cross validation. The data set is
separated into two sets, called the training set and the testing set. The function
approximator fits a function using the training set only. Then the function
approximator is asked to predict the output values for the data in the testing set
(it has never seen these output values before). The errors it makes are
accumulated as before to give the mean absolute test set error, which is used to
evaluate the model. The advantage of this method is that it is usually preferable
to the residual method and takes no longer to compute. However, its evaluation
can have a high variance. The evaluation may depend heavily on which data
points end up in the training set and which end up in the test set, and thus the
evaluation may be significantly different depending on how the division is made.
k-Fold Cross-Validation
The procedure has a single parameter called k that refers to the number of groups
that a given data sample is to be split into. As such, the procedure is often called
k-fold cross-validation. When a specific value for k is chosen, it may be used in
place of k in the reference to the model, such as k=10 becoming 10-fold cross-
validation.
This approach involves randomly dividing the set of observations into k groups,
or folds, of approximately equal size. The first fold is treated as a validation set,
and the method is fit on the remaining k − 1 folds.
It is also important that any preparation of the data prior to fitting the model
occur on the CV-assigned training dataset within the loop rather than on the
broader data set. This also applies to any tuning of hyperparameters. A failure to
perform these operations within the loop may result in data leakage and an
optimistic estimate of the model skill.
The results of a k-fold cross-validation run is often summarized with the mean of
the model skill scores. It is also good practice to include a measure of the variance
of the skill scores, such as the standard deviation or standard error.
A poorly chosen value for k may result in a mis-representative idea of the skill of
the model, such as a score with a high variance (that may change a lot based on
the data used to fit the model), or a high bias, (such as an overestimate of the
skill of the model).
• Representative: The value for k is chosen such that each train/test group
of data samples is large enough to be statistically representative of the
broader dataset.
• k=10: The value for k is fixed to 10, a value that has been found through
experimentation to generally result in a model skill estimate with low bias
a modest variance.
• k=n: The value for k is fixed to n, where n is the size of the dataset to give
each test sample an opportunity to be used in the hold out dataset. This
approach is called leave-one-out cross-validation.
The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the
difference in size between the training set and the resampling subsets gets
smaller. As this difference decreases, the bias of the technique becomes smaller
A value of k=10 is very common in the field of applied machine learning, and is
recommend if you are struggling to choose a value for your dataset.
If a value for k is chosen that does not evenly split the data sample, then one
group will contain a remainder of the examples. It is preferable to split the data
sample into k groups with the same number of samples, such that the sample of
model skill scores are all equivalent.
Variations on Cross-Validation
• Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that
a single train/test split is created to evaluate the model.
• LOOCV: Taken to another extreme, k may be set to the total number of
observations in the dataset such that each observation is given a chance
to be the held out of the dataset. This is called leave-one-out cross-
validation, or LOOCV for short.
• Stratified: The splitting of data into folds may be governed by criteria such
as ensuring that each fold has the same proportion of observations with a
given categorical value, such as the class outcome value. This is called
stratified cross-validation.
• Repeated: This is where the k-fold cross-validation procedure is repeated
n times, where importantly, the data sample is shuffled prior to each
repetition, which results in a different split of the sample.
Data Sampling
There are perhaps three main types of data sampling techniques; they are:
• Data Oversampling.
• Data Under sampling.
• Combined Oversampling and Under sampling.
Data oversampling involves duplicating examples of the minority class or
synthesizing new examples from the minority class from existing examples.
Perhaps the most popular methods is SMOTE and variations such as Borderline
SMOTE. Synthetic Minority Over-sampling Technique (SMOTE) Perhaps the most
important hyperparameter to tune is the amount of oversampling to perform.
• Random Oversampling
• SMOTE
• Borderline SMOTE
• SVM SMOTE
• k-Means SMOTE
• ADASYN
Under sampling involves deleting examples from the majority class, such as
randomly or using an algorithm to carefully choose which examples to delete.
Additionally, most data sampling algorithms make use of the k-nearest neighbor
algorithm internally. This algorithm is very sensitive to the data types and scale
of input variables. As such, it may be important to at least normalize input
variables that have differing scales prior to testing the methods, and perhaps
using specialized methods if some input variables are categorical instead of
numerical.
. One-Class Algorithms
Algorithms used for outlier detection and anomaly detection can be used for
classification tasks.Although unusual, when used in this way, they are often
referred to as one-class classification algorithms.
• Calibrating Probabilities.
• Tuning the Classification Threshold.
Calibrating Probabilities
Some algorithms are fit using a probabilistic framework and, in turn, have
calibrated probabilities.
This means that when 100 examples are predicted to have the positive class label
with a probability of 80 percent, then the algorithm will predict the correct class
label 80 percent of the time.
• Logistic Regression
• Linear Discriminant Analysis
• Naive Bayes
• Artificial Neural Networks
Most nonlinear algorithms do not predict calibrated probabilities; therefore,
algorithms can be used to post-process the predicted probabilities in order to
calibrate them.
• Platt Scaling
• Isotonic Regression
Tuning the Classification Threshold
Some algorithms are designed to naively predict probabilities that later must be
mapped to crisp class labels.
This is the case if class labels are required as output for the problem, or the model
is evaluated using class labels.
• Logistic Regression
• Linear Discriminant Analysis
• Naive Bayes
• Artificial Neural Networks
Probabilities are mapped to class labels using a threshold probability value. All
probabilities below the threshold are mapped to class 0, and all probabilities
equal-to or above the threshold are mapped to class 1.
The default threshold is 0.5, although different thresholds can be used that will
dramatically impact the class labels and, in turn, the performance of a machine
learning model that natively predicts probabilities.
As such, if probabilistic algorithms are used that natively predict a probability and
class labels are required as output or used to evaluate models, it is a good idea
to try tuning the classification threshold.
• Data Oversampling
• Random Oversampling