Machine Learning
Machine Learning
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly
drawing samples from a training set and refitting a model of interest on each sample in order
to obtain additional information about the fitted model.
Resampling approaches can be computationally expensive, because they involve fitting the
same statistical method multiple times using different subsets of the training data.
cross-validation can be used to estimate the test error associated with a given statistical
learning method in order to evaluate its performance.
The bootstrap is used in several contexts, most commonly model selection to provide a
measure of accuracy of a parameter estimate or of a given statistical learning method.
Suppose that we would like to estimate the test error associated with fitting a particular
statistical learning method on a set of samples. The validation set approach involves randomly
dividing the available set of samples into two parts, a training set and a validation set or
hold-out set. The model is fit on the training set, and the fitted model is used to predict the
responses for the observations in the validation set.
1. The validation estimate of the test error rate can be highly variable, depending on precisely
which observations are included in the training set and which observations are included in the
validation set.
2. In the validation approach, only a subset of the observations - those that are included in the
training set rather than in the validation set - are used to fit the model. Since statistical
methods tend to perform worse when trained on fewer observations, this suggests that the
validation set error rate may tend to overestimate the test error rate for the model fit on the
entire data set.
Leave-One-Out Cross-Validation(LOOCV)
Like the validation set approach, LOOCV involves splitting the set of observations into two
parts. However, instead of creating two subsets of comparable size, a single observation (x1 , y1 )
is used for the validation set, and the remaining observations {(x2 , y2 ), ..., (xn , yn )} make up
the training set. The statistical learning method is fit on the n − 1 training observations, and a
prediction ŷ1 is made for the excluded observation, using its value x1 . Since (x1 , y1 ) was not
used in the fitting process, MSE1 = (y1 − ŷ1 )2 provides an approximately unbiased estimate
for the test error. But even though MSE1 is unbiased for the test error, it is a poor estimate
because it is highly variable, since it is based upon a single observation (x1 , y1 )
In LOOCV, we repeatedly fit the statistical learning method using training sets that contain
n − 1 observations, almost as many as are in the entire data set. This is in contrast to the
validation set approach, in which the training set is typically around half the size of the original
data set. Consequently, the LOOCV approach tends not to overestimate the test error rate as
much as the validation set approach does.
Second, in contrast to the validation approach which will yield different results when applied
repeatedly due to randomness in the training/validation set splits, performing LOOCV multiple
times will always yield the same results: there is no randomness in the training/validation set
splits.
An alternative to LOOCV is k-fold CV . This approach involves randomly dividing the set of
observations into k groups, or folds, of approximately equal size. The first fold is treated as a
validation set, and the method is fit on the remaining k − 1 folds. The mean squared error,
MSE1 , is then computed on the observations in the held-out fold. This procedure is repeated
k times; each time, a different group of observations is treated as a validation set. This
process results in k estimates of the test error, MSE1 , MSE2 , . . . , MSEk . The k-fold CV
estimate is computed by averaging these values,
k
1X
CV(k) = MSEi .
k
i=1
The expected test MSE, for a given value x0 , can always be decomposed into the sum of three
fundamental quantities: the variance of fˆ (x0 ), the squared bias of fˆ (x0 ) and the variance of
the error terms ϵ. That is,
2 h i2
E y0 − fˆ (x0 ) = Var fˆ (x0 ) + Bias fˆ (x0 ) + Var(ϵ).
2
Here the notation E y0 − fˆ (x0 ) defines the expected test MSE, and refers to the average
test MSE that we would obtain if we repeatedly estimated f using a large number of training
sets, and tested each at x0 . The overall expected test MSE can be computed by averaging
2
E y0 − fˆ (x0 ) over all possible values of x0 in the test set.
The relationship between bias, variance, and test set MSE is referred to as the bias-variance
trade-off.
We have illustrated the use of cross-validation in the regression setting where the outcome Y
is quantitative, and so have used MSE to quantify test error. But cross-validation can also be
a very useful approach in the classification setting when Y is qualitative. In this setting,
cross-validation works just as described earlier, except that rather than using MSE to quantify
test error, we instead use the the number of misclassified observations. For instance, in the
classification setting, the LOOCV error rate takes the form
n
1X
CV(n) = Erri
n
i=1
where Erri = I (yi ̸= ŷi ). The k-fold CV error rate and validation set error rates are defined
analogously.
Dr. A V Prajeesh August 21, 2022 13 / 23
Bootstrap
The bootstrap is a widely applicable and extremely powerful statistical tool bootstrap that can
be used to quantify the uncertainty associated with a given estimator or statistical learning
method. As a simple example, the bootstrap can be used to estimate the standard errors of
the coefficients from a linear regression fit. However, the power of the bootstrap lies in the
fact that it can be easily applied to a wide range of statistical learning methods, including
some for which a measure of variability is otherwise difficult to obtain and is not automatically
output by statistical software.
σY2 − σXY
α= ,
σX2 + σY2 − 2σXY
σ̂Y2 − σ̂XY
α̂ = .
σ̂X2 + σ̂Y2 − 2σ̂XY
So to get a better estimates we need to have data say 100 or 1000 sets of data and generate
the estimates of α̂ and Standard error (α̂) (the standard deviation of the estimates). So in the
real world situation we can use the concept of Bootstrap for generating such data sets from a
single data set.
Figure 1
Dr. A V Prajeesh August 21, 2022 18 / 23
Figure 2: Validation set approach