Week7 Lecture 1 ML SPR25
Week7 Lecture 1 ML SPR25
with R
Instructor: Babu Adhimoolam
Resampling Methods
Learning objectives Cross-validation Methods
Bootstrapping Method
Estimating the test error rate!
• The accuracy or worthiness of a machine learning model depends on how it can predict
the response y given x on a completely novel dataset (and not on the training dataset).
• We don’t have access to this novel dataset (in real life situations) and must devise methods
to estimate these error rates so that our model that we developed is dependable!
• We overcome this issue by holding out some data (often termed as test data) from
being used for model fitting during training and finally estimating the model error on
this hold-out dataset.
A simple Validation Set approach
• It involves randomly dividing the observations into a training set and a validation set
(or hold-out set).
• The model is fit using the training set and the fitted model is used to predict the responses
in the validation set. The resulting error rate using the validation set provides an estimate of
test error rate.
• Unlike validation set approach, we leave only one observation for testing and all the
remaining (n-1) observations are used for training. The process is repeated until all the
remaining observations are included in testing one by one.
• If x1, x2, …, xn are the set of observations, then in the first round x1,y1is the test data and
remaining x2,…,xn are the training data and the test mean squared error MSE1 is given by
.
Similarly, in the second round with x2,y2 as test data and the remaining as training
dataset the test mean squared error MSE2 is .
n observations
• First, the first group or fold is used for validation set and the model is fit on the remaining
k-1 groups or folds of data. On the next iteration, the second group or fold is used as validation
set and the procedure is repeated for remaining k-1 group and so on.
• The procedure results in k estimates of the test MSE(MSE1,MSE2,..,MSEk) and the average
of these k estimates is the k-fold cross-validation estimate:
n observations
Validation set method repeated 10 times Repetition of 10-fold method with different random
splits each time
Comparison of true MSE and cross-validated MSE
Linear Regression (Orange) and Training MSE(Gray) and Test MSE(Red) True Test MSE in Blue
Two smoothing splines (Green and Blue) LOOCV estimate in black dashed line
10-fold estimate in Orange
Comparison of true MSE and cross-validated MSE
Linear Regression (Orange) and Training MSE(Gray) and Test MSE(Red) True Test MSE in Blue
Two smoothing splines (Green and Blue) LOOCV estimate in black dashed line
10-fold estimate in Orange
Comparison of true MSE and cross-validated MSE
Linear Regression (Orange) and Training MSE(Gray) and Test MSE(Red) True Test MSE in Blue
Two smoothing splines (Green and Blue) LOOCV estimate in black dashed line
10-fold estimate in Orange
The true MSE is closely approximated by cross-validated
MSE
True Test MSE in Blue LOOCV estimate in black dashed line 10-fold estimate in Orange
Despite underestimating the true MSE all the CV estimates come closed to identifying the correct degree of flexibility
associated with lowest test error.
The validation procedure fits a model to part of the
data and its estimation of test error is thus biased.
Since the LOOCV procedure fits the model to most of
the data it has lowest bias in test error estimation.
The k-fold has intermediate level of bias.
Bias Variance Trade
off with cross- In terms of variance, LOOCV estimates suffer from
validation methods high variance, in comparison to k-fold methods.
Parvandeh, S., Yeh, H. W., Paulus, M. P., & McKinney, B. A. (2020). Consensus features nested cross-validation.
Bioinformatics, 36(10), 3093-3098.
Varma, S., & Simon, R. (2006). Bias in error estimation when using cross-validation for model selection.
BMC bioinformatics, 7(1), 1-8.
In boot strapping we repeatedly
sample data with replacement from
the original data.
Boot Strapping
The model is then fit on multiple
datasets that were created and the
statistic of interest is calculated
across all these datasets.
Graphical illustration of sampling with replacement