Resampling Methods - ML
Resampling Methods - ML
Module-3 (part-A)
Dr. Debanjali Bhattacharya
Sample: a subset of a population that seeks to accurately reflect the
characteristics of the larger group.
Resampling methods
• Resampling: This method involves repeatedly drawing samples
from a training set and refitting a model on each sample in order
to obtain additional information about the fitted model.
• Test error rate: The test error is the average error that results
from using a statistical learning method to predict the response
on a new observation— that is, a measurement that was
obtained from the examples which were not used in training.
• Training error rate often is quite different from the test error
rate.
• The model is fit on the training set, and the fitted model is used
to predict the responses for the observations in the validation
set.
• The validation set error rates are calculated from MSE that result from
fitting various regression models on the training sample and evaluating
their performance on the validation sample.
• Say, The validation set MSE for the quadratic fit is considerably smaller
than that for the linear fit. However, the validation set MSE for the
cubic fit is actually slightly larger than for the quadratic fit.
• This implies that including a cubic term in the regression does not lead
to better prediction than simply using a quadratic term.
Validation set approach
Drawbacks:
1) The validation estimate of the test error rate can be highly variable,
depending on precisely which observations are included in the
training set and which observations are included in the validation set.
• Since (x1, y1) was not used in the fitting process, the provides
an approximately unbiased estimate for the test error.
• But even though MSE is unbiased for the test error, it is a poor
estimate because it is highly variable, since it is based upon a
single observation (x1, y1).
1. It has less bias. In LOOCV, we repeatedly fit the statistical learning method
using training sets that contain n − 1 observations, almost as many as are in
the entire data set. This is in contrast to the validation set approach, in which
the training set is typically around half the size of the original data set.
2. The LOOCV approach does not over-estimate the test error rate as much as
the validation set approach does.
3. In contrast to the validation approach which will yield different results when
applied repeatedly due to randomness in the training/validation set splits,
performing LOOCV multiple times will always yield the same results: there is
no randomness in the training/validation set splits.
4. LOOCV is a very general method, and can be used with any kind of predictive
modeling.
Leave-one-out cross-validation
• Disadvantages of LOOCV:
• Contrary to this LOOCV will give approximately unbiased estimates of the test
error, since each training set contains n − 1 observations, which is almost as
many as the number of observations in the full data set.
• Why ?
Bias/Variance trade-off in K-fold CV
• Since the mean of many highly correlated quantities has higher variance as
compared to the mean of many quantities that are not as highly
correlated, the test error estimate resulting from LOOCV tends to have
higher variance as compared to the test error estimate resulting from k-
fold CV.
Bias/Variance trade-off in K-fold CV
• In the classification setting, the LOOCV error rate takes the form
where .
The Bootstrap
• We training the model on the sample and evaluating the skill of the model
on those samples not included in the sample.
• These samples not included in a given sample are called the out-of-bag
samples (OOB).
The Bootstrap
“The samples not selected are usually referred to as the “out-of-
bag” samples. For a given iteration of bootstrap resampling, a
model is built on the selected samples and is used to predict the
out-of-bag samples.”
— K.Johnson et.al, Applied Predictive Modelling, page
72, 2013.
The Bootstrap example
Ref: https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
Ref: https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
Decision Tree
Some Characteristics
• All nodes drawn with circle (ellipse) are called internal nodes.
• All nodes drawn with rectangle boxes are called terminal nodes or
leaf nodes.
• Decision trees are typically drawn upside down, in the sense that
the leaves are at the bottom of the tree.
• The points along the tree where the predictor space is split are
referred to as internal nodes. In Figure, the two internal nodes
are indicated by the text Years<4.5 and Hits<117.5.
• Easier to interpret
• It has a nice graphical representation.
Prediction via Stratification of the Feature Space
• 2. For every observation that falls into the region Rj, we make
the same prediction, which is simply the mean of the response
values for the training observations in Rj .
# How do we construct the regions R1, . . .,RJ?
where is the mean response for the training observations in R1(j, s),
and is the mean response for the training observations in R2(j, s).
• Finding the values of j and s that minimize the above expression can be
done quickly if the number of features p is not too large.
3. We repeat the process, looking for the best predictor (j) and best cut-point
(s) in order to split the data further so as to minimize the RSS within each of
the resulting regions.
• Again, we look to split one of these three regions further (say R3, next slide)
into R4 and R5, based on some other attribute so as to minimize the RSS.
• Once the regions R1, . . . , RJ have been created, we predict the response for a
given test observation using the mean of the training observations in the
region to which that test observation belongs.
The output of recursive binary splitting
on a two-dimensional example.
• A smaller tree with fewer splits (that is, fewer regions R1, . . .,RJ )
might lead to lower variance and better interpretation at the
cost of a little bias.
^
𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙𝑠
Bagging
• Hence each individual tree has high variance, but low bias.
• We can predict the response for the i’th observation using each
of the trees in which that observation was OOB.
• In order to obtain a single prediction for the i’th observation,
we can average these predicted responses (if regression is the
goal) or can take a majority vote (if classification is the goal).
• The resulting OOB error is a valid estimate of the test error for
the bagged model, since the response for each observation is
predicted using only the trees that were not fit using that
observation.
Bagging
• The nodes in the decision tree represent attributes that are used
for predicting the outcome.
Decision tree… a quick recap!
Random Forest
• A random forest combines several decision trees.
• Random forests are bagged decision tree models that use a subset of
features on each split. In every random forest tree, a subset of
features is selected randomly at the node’s splitting point.
• This means that at each split of the tree, the model considers only a
small subset of features rather than all of the features of the model.
• That is, from the set of available features ‘n’, a subset of ‘m’ features
() are selected at random. In this way, variance can be averaged away.
Feature Randomness in random forest
• The leaf node of each tree is the final output produced by that
specific decision tree.
• Then the second model is built which tries to correct the errors
(misclassification) present in the first model.
• Initialize the dataset and assign equal weight to each of the data
point.
• Provide this as input to the model and identify the wrongly
classified data points.
• Increase the weight of the wrongly classified data points.
• if (got required results)
Goto step 5
else
Goto step 2
• End
Boosting Algorithm
Explanation: The above diagram explains the AdaBoost algorithm.
• B1 consists of 10 data points which consist of two types namely plus(+) and
minus(-) and 5 of which are plus(+) and the other 5 are minus(-) and each one
has been assigned equal weight initially. The first model tries to classify the data
points and generates a vertical separator line but it wrongly classifies 3 plus(+) as
minus(-).
• B2 consists of the 10 data points from the previous model in which the 3 wrongly
classified plus(+) are weighted more so that the current model tries more to
classify these pluses(+) correctly. This model generates a vertical separator line
that correctly classifies the previously wrongly classified pluses(+) but in this
attempt, it wrongly classifies three minuses(-).
• B3 consists of the 10 data points from the previous model in which the 3 wrongly
classified minus(-) are weighted more so that the current model tries more to
classify these minuses(-) correctly. This model generates a horizontal separator
line that correctly classifies the previously wrongly classified minuses(-).
• B4 combines together B1, B2, and B3 in order to build a strong prediction model
which is much better than any individual model used.
How the sample weights
affect the decision boundary
AdaBoost trains a
sequence of models with
augmented sample
weights, generating
‘confidence’ coefficients
Alpha for individual
classifiers based on errors.
• strong prediction power: usually boosting > bagging (random forrest) > decision
tree.