0% found this document useful (0 votes)
67 views30 pages

Lecture - 5 - Validation

Generalization refers to how well a model can apply concepts learned from its training data to examples it has not seen before. A good model should make accurate predictions on new, out-of-sample data. Overfitting and underfitting are the two main causes of poor generalization. Validation and regularization techniques are used to address overfitting. The validation set approach involves randomly splitting data into training, validation, and test sets. Models are fit on the training set and evaluated on the validation set, which provides an estimate of the test error. This helps select the best model before evaluating on the held-out test set.

Uploaded by

bberkcan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views30 pages

Lecture - 5 - Validation

Generalization refers to how well a model can apply concepts learned from its training data to examples it has not seen before. A good model should make accurate predictions on new, out-of-sample data. Overfitting and underfitting are the two main causes of poor generalization. Validation and regularization techniques are used to address overfitting. The validation set approach involves randomly splitting data into training, validation, and test sets. Models are fit on the training set and evaluated on the validation set, which provides an estimate of the test error. This helps select the best model before evaluating on the held-out test set.

Uploaded by

bberkcan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Validation

Dr. Mehmet Yasin Ulukuş

Mindset Institute - Mehmet Yasin Ulukuş


Generalization
• Generalization refers to how well the concepts learned by the model
applies to specific examples not seen by the model when it was learning.
• Note that this is the real purpose of building a model (learn from the data
and use it in real life).
• A good model make predictions in the future on data the model has never
seen which we call out of sample error .
• Hence we want a model that provides as small as possible.
• Overfitting and underfitting are the two biggest causes for poor
performance of algorithms.

Mindset Institute - Mehmet Yasin Ulukuş


Overfitting
• Overfitting refers to a model that models fits the training data too well (more
than it should).
• Overfitting is the phenomenon where fitting the observed facts (data) well no
longer indicates that we will get a decent error out of training set, and may
actually lead to the opposite effect.
• Overfitting occurs when the learning model is more complex than is necessary to
represent the target function.
• The model uses its additional degrees of freedom to fit idiosyncrasies in the data
(for example, noise), yielding a final hypothesis that is inferior.
• The ability to deal with overfitting is what separates professionals from amateurs
in the field of learning from data.

Mindset Institute - Mehmet Yasin Ulukuş


Overfitting
• Consider a simple one-dimensional regression
problem with five data points
• The target function is a 2nd order polynomial
• We added a little noise to create the data points.
• We use 5 data points to fit a 4th order polynomial

Mindset Institute - Mehmet Yasin Ulukuş


Overfitting
• Consider a simple one-dimensional regression
problem with five data points
• The target function is a 2nd order polynomial
• We added a little noise to create the data
points.
• We use 5 data points to fit a 4th order
polynomial
• Though the target is simple, the learning
algorithm used the full power of the 4th order
polynomial to fit the data exactly, but the
result does not look anything like the target
function
The data has been overfit

Mindset Institute - Mehmet Yasin Ulukuş


Overfitting
• The model has zero training error
• However the model does a very bad job
at generalization
• There are different ways of dealing with
this problem
• We will be covering two main approaches
(1) validation and (2) regularization to
deal with overfitting

The data has been overfit

Mindset Institute - Mehmet Yasin Ulukuş


Overfitting
• We introduced the idea of a test set in
the first class where a data set that is not
involved in the learning process is used to
evaluate the final hypothesis.
• The test error , unlike the training error ,
is an unbiased estimate of .

Mindset Institute - Mehmet Yasin Ulukuş


Overfitting
• We introduced the idea of a test set in
the first class where a data set that is not
involved in the learning process is used
to evaluate the final hypothesis.
• The test error , unlike the training error,
is an unbiased estimate of .

Mindset Institute - Mehmet Yasin Ulukuş


Overfitting
• We introduced the idea of a test set in
the first class where a data set that is not
involved in the learning process is used
to evaluate the final hypothesis.
• The test error , unlike the training error,
is an unbiased estimate of .
• Should we then use test error to pick a
model that does best in the test set?
• The answer is NO!

Mindset Institute - Mehmet Yasin Ulukuş


Validation
• The idea of a validation set is almost identical to that of test set.
• We remove a subset from the data; this subset is not used in training.
• We then use this held-out subset to estimate the out-of-sample error.
• The held-out set is effectively out-of-sample, because it has not been used during
the learning.
• However, there is a difference between a validation set and a test set.
• Although the validation set will not be directly used for training, it will be used in
making certain choices in the learning process.
• For example tuning the parameters of the model (choosing k in KNN, choosing the
order of polynomial function in regression) or selecting the set of features to be
used in the model.
• The minute a set affects the learning process in any way, it is no longer a test set.
Mindset Institute - Mehmet Yasin Ulukuş
Validation
• The best approach for both problems is to randomly divide the dataset into three
parts: training set, a validation set, and a test set.

• The training set is used to fit the models; the validation set is used to estimate
prediction error for model selection; the test set is used for assessment of the
generalization error of the final chosen model.
• Ideally, the test set should be kept in a “vault,” and be brought out only at the
end of the data analysis

Mindset Institute - Mehmet Yasin Ulukuş


Validation Set Approach
• The validation set approach, displayed in Figure is a very simple strategy validation for
this task.
Training Set Validation Set Test Set

• Model is fit on the training set, and the fitted model is used to predict the responses for
the observations in the validation set
• The resulting validation set error rate—typically assessed using MSE in the case of a
quantitative response—provides an estimate of the test error rate
• The model choices, like determining the parameters of the model should be done using
validation set.

Mindset Institute - Mehmet Yasin Ulukuş


Validation Set Approach
• Consider Car example, in which the
mpg (gas mileage in miles per gallon)
versus horsepower is shown for a
number of cars in the Auto data set
• The data suggest a curved relationship
• A simple approach for incorporating
non-linear associations in a linear
model is to include transformed
versions of the predictors in the
model

Mindset Institute - Mehmet Yasin Ulukuş


Validation Set Approach
• The of the quadratic fit is 0.688,
compared to 0.606 for the linear fit,
and the p-value in for the quadratic
term is highly significant
• If including horsepower^2 led to such
a big improvement in the model, why
not include horsepower^3,
horsepower^4, or even
horsepower^5?
• Overfitting 

Mindset Institute - Mehmet Yasin Ulukuş


Validation Set Approach
• We randomly split the 392 observations into
two sets first, 20% of the data is preserved
for final test.
• We then split 312 observations into a
training set containing 156 of the data
points, and a validation set containing the
remaining 156 observations
• Model is fit using training set and MSE is
computed using validation set
• The validation set error rates that result
from fitting various regression models on
the training sample and evaluating their
performance on the validation sample are
shown in the figure

Mindset Institute - Mehmet Yasin Ulukuş


Validation Set Approach
• The validation set MSE for the quadratic
fit is considerably smaller than for the
linear fit.
• However, the validation set MSE for the
cubic fit is actually slightly larger than for
the quadratic fit.
• This implies that including a cubic term in
the regression does not lead to better
prediction than simply using a quadratic
term

Mindset Institute - Mehmet Yasin Ulukuş


Validation Set Approach
• Recall that in order to create the
figure, we randomly divided the data
set into two parts, a training set and
a validation set
• If we repeat the process of randomly
splitting the sample set into two
parts, we will get a somewhat
different estimate for the test MSE

Mindset Institute - Mehmet Yasin Ulukuş


Validation Set Approach
• All ten curves indicate that the model
with a quadratic term has a dramatically
smaller validation set MSE than the
model with only a linear term
• Furthermore, all ten curves indicate that
there is not much benefit in including
cubic or higher-order polynomial terms
in the model
• But it is worth noting that each of the
ten curves results in a different test MSE
estimate (be careful these are not real
test MSE scores, validation MSE is an
estimate of test MSE) for each of the ten
regression models considered

Mindset Institute - Mehmet Yasin Ulukuş


Validation Set Approach
• The validation set approach is conceptually simple and is easy to
implement.
• But it has two potential drawbacks:
• The validation estimate of the test error rate can be highly variable,
depending on precisely which observations are included in the training set
and which observations are included in the validation set (as seen in the
previous figure)
• In the validation approach, only a subset of the observations are used to fit
the model.
• Statistical methods tend to perform worse when trained on fewer
observations, this suggests that the validation set error rate may tend to
overestimate the test error rate for the model fit on the entire data set.
Mindset Institute - Mehmet Yasin Ulukuş
Cross-Validation
• To make the algorithm learn better we would like to make the training
set as big as possible
• However, if we make this choice, we lose the reliability of the
validation estimate since now the validation error is computed using a
small sample
• We present cross-validation, a refinement of the validation set
approach that addresses these two problems
• We will cover two basic cross-validation techniques: (1) leave-one-out
cross validation (LOOCV), and (2) k-fold cross validation

Mindset Institute - Mehmet Yasin Ulukuş


LOOCV
• Like the validation set approach, LOOCV involves splitting the set of observations into two
parts.
• However, instead of creating two subsets of comparable size, a single observation is used
for the validation set, and the remaining observations make up the training set.
• Since was not used in the fitting process, is an unbiased estimator of the test error
• It is a poor estimate because it is highly variable, since it is based upon a single
observation
• But we can repeat the procedure by selecting each in the validation set one at a time
and leave the remaining ones in the training set, then compute

Mindset Institute - Mehmet Yasin Ulukuş


LOOCV
• Schematically
Test Set

• The LOOCV estimate for the test MSE is the average of these n error estimates

Mindset Institute - Mehmet Yasin Ulukuş


LOOCV
• LOOCV has a couple of major advantages over the validation set
approach.
• In LOOCV, we repeatedly fit the statistical learning method using
training sets that contain n − 1 observations, almost as many as are in
the entire data set.
• Hence, the model fitted with more data points is better
• Also there is no randomness in the method since there are no random
splits
• In other words, performing LOOCV multiple times provides the same
result, unlike validation set approach

Mindset Institute - Mehmet Yasin Ulukuş


LOOCV
• LOOCV has the potential to be expensive to implement, since the
model has to be fit n times.
• This can be very time consuming if n is large, and if each individual
model is slow to fit
• LOOCV is a very general method, and can be used with any kind of
predictive modeling.
• For example we could use it with logistic regression or linear
discriminant analysis, or any of the methods discussed in later classes
• Note that we need to replace MSE with other types of error measures
depending on the method

Mindset Institute - Mehmet Yasin Ulukuş


LOOCV
• We used LOOCV on the Auto data set in order to obtain an estimate
of the test set MSE

Mindset Institute - Mehmet Yasin Ulukuş


K-fold Cross Validation
• An alternative to LOOCV is k-fold CV.
• This approach involves randomly dividing the set of observations into k groups, or folds,
of approximately equal size.
• The first fold is treated as a validation set, and the method is fit on the remaining k − 1
folds
• The mean squared error, MSE1, is then computed on the observations in the held-out
fold
• This procedure is repeated k times; each time, a different group of observations is
treated as a validation set.
• This process results in k estimates of the test error, .
• The k-fold CV estimate is computed by averaging these values

Mindset Institute - Mehmet Yasin Ulukuş


K-fold Cross Validation
• Figure illustrates the k-fold CV approach.
Test Set

• In practice, one typically performs k-fold CV using k = 5 or k = 10

Mindset Institute - Mehmet Yasin Ulukuş


K-fold Cross Validation
• K-fold is computationally less expensive than LOOCV
• Some statistical learning methods have computationally intensive fitting procedures, and
so performing LOOCV may pose computational problems, especially if n is extremely
large.
• Figures shows validation errors with LOOCV and 10-fold CV
• Results are similar

Mindset Institute - Mehmet Yasin Ulukuş


K-fold Cross Validation
• As we can see from the figure, there is some variability in the CV estimates as a result of
the variability in how the observations are divided into ten folds.
• But this variability is typically much lower than the variability in the test error estimates
that results from the validation set approach

Mindset Institute - Mehmet Yasin Ulukuş


Cross-Validation on Classification Problems
• Cross-validation can also be a very useful approach in the classification setting when Y is
qualitative
• In this setting, cross-validation works just as described earlier in this chapter, except that
rather than using MSE to quantify test error, we instead use the number of misclassified
observations.
• For instance, in the classification setting, the LOOCV error rate takes the form

where
• The k-fold CV error rate and validation set error rates are defined analogously
• Accuracy or other measures can also be used similarly (see Jupyter notebook
KNN_Validation)

Mindset Institute - Mehmet Yasin Ulukuş

You might also like