0% found this document useful (0 votes)
89 views115 pages

Resampling Methods - ML

This document discusses resampling methods for estimating model performance, including cross-validation and the bootstrap. It describes how cross-validation works, the difference between leave-one-out and k-fold cross-validation, and how k-fold addresses the computational limitations of leave-one-out cross-validation when the sample size is large.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views115 pages

Resampling Methods - ML

This document discusses resampling methods for estimating model performance, including cross-validation and the bootstrap. It describes how cross-validation works, the difference between leave-one-out and k-fold cross-validation, and how k-fold addresses the computational limitations of leave-one-out cross-validation when the sample size is large.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 115

Resampling methods

Module-3 (part-A)
Dr. Debanjali Bhattacharya
Sample: a subset of a population that seeks to accurately reflect the
characteristics of the larger group.
Resampling methods
• Resampling: This method involves repeatedly drawing samples
from a training set and refitting a model on each sample in order
to obtain additional information about the fitted model.

• For example, in order to estimate the variability of a linear


regression fit, we can repeatedly draw different samples from
the training data, fit a linear regression to each new sample, and
then examine the extent to which the resulting fits differ.

• Such an approach may allow us to obtain information that would


not be available from fitting the model only once, using one
training sample.
Simulated data set. Left: The red line represents the true relationship, f(X) = 3X+2,
which is known as the population regression line. The blue line is the least squares
line; it is the least squares estimate for f(X) based on the observed data, shown in
black.
Right: The population regression line is again shown in red, and the least squares line
in dark blue. In light blue, ten least squares lines are shown, each computed on the
basis of a separate random set of observations. Each least squares line is different,
but on average, the least squares lines are quite close to the population regression
line.
Resampling methods
• Resampling approaches can be computationally expensive,
because they involve fitting the same statistical method multiple
times using different subsets of the training data.

• However, due to recent advances in computing power, the


computational requirements of resampling methods generally
are not prohibitive.

• The two of the most commonly used resampling methods are


 cross-validation
 bootstrap.
Cross-validation
• Cross-validation can be used to estimate the test error
associated with a given statistical learning method in order to
evaluate its performance, or to select the appropriate level of
flexibility.

• The process of selecting the proper level of flexibility for a model


is known as model selection (d = ?).

• The process of evaluating a model’s performance is known as


model assessment (MSE = ? for selected model).
Validation set approach

• Test error rate: The test error is the average error that results
from using a statistical learning method to predict the response
on a new observation— that is, a measurement that was
obtained from the examples which were not used in training.

• Training error rate: It is calculated by applying the statistical


learning method to the observations used for training.

• Can training error rate and test error rate be similar


(approximately)?
Validation set approach

• Training error rate often is quite different from the test error
rate.

• To solve this problem, generally we estimate the test error rate


by holding out a subset of the training observations from the
fitting process, and then applying the statistical learning method
to those held out observations.

• This approach is called “validation-set approach”.


Validation set approach
Validation-set approach:
• It involves randomly dividing the available set of observation set
into two parts, a training set and a validation set or hold-out set.

• The model is fit on the training set, and the fitted model is used
to predict the responses for the observations in the validation
set.

• The resulting validation set error rate—typically assessed using


MSE (in the case of a quantitative response) that provides an
estimate of the test error rate.
Validation set approach
Example:
• We randomly split the 392 observations into two sets, a training set
containing 196 of the data points, and a validation set containing the
remaining 196 observations.

• The validation set error rates are calculated from MSE that result from
fitting various regression models on the training sample and evaluating
their performance on the validation sample.

• Say, The validation set MSE for the quadratic fit is considerably smaller
than that for the linear fit. However, the validation set MSE for the
cubic fit is actually slightly larger than for the quadratic fit.

• This implies that including a cubic term in the regression does not lead
to better prediction than simply using a quadratic term.
Validation set approach
Drawbacks:
1) The validation estimate of the test error rate can be highly variable,
depending on precisely which observations are included in the
training set and which observations are included in the validation set.

2) In the validation approach, only a subset of the observation are


included in the training set, are used to fit the model. Thus the
performance can be poor since the model is trained on fewer
observations.

• How to overcome these limitations?


Cross-validation

• Cross-validation- a refinement of the validation


set approach that addresses these two issues.

1. Leave-one-out cross-validation (LOOCV)


2. K-fold cross-validation (K-fold CV)
Leave-one-out cross-validation

• Similar to the validation set approach, LOOCV involves splitting


the set of observations into two parts.

• However, instead of creating two subsets of comparable size, a


single observation (x1, y1) is used for the validation set, and the
remaining observations {(x2, y2), . . . , (xn, yn)} make up the
training set.

• The statistical learning method is fit on the n − 1 training


observations, and a prediction is made for the excluded
observation (x1, y1).
Leave-one-out cross-validation

• Since (x1, y1) was not used in the fitting process, the provides
an approximately unbiased estimate for the test error.

• But even though MSE is unbiased for the test error, it is a poor
estimate because it is highly variable, since it is based upon a
single observation (x1, y1).

• We can repeat the procedure by selecting (x2, y2) for the


validation data, training the statistical learning procedure on the
n − 1 observations {(x1, y1), (x3, y3), . . . , (xn, yn)}, and
computing
Leave-one-out cross-validation

• Repeating this approach n times produces n squared errors,


MSE1, . . . , MSEn. The LOOCV estimate for the test MSE is the
average of these n test error estimates:
Leave-one-out cross-validation
• Advantages of LOOCV:

1. It has less bias. In LOOCV, we repeatedly fit the statistical learning method
using training sets that contain n − 1 observations, almost as many as are in
the entire data set. This is in contrast to the validation set approach, in which
the training set is typically around half the size of the original data set.

2. The LOOCV approach does not over-estimate the test error rate as much as
the validation set approach does.

3. In contrast to the validation approach which will yield different results when
applied repeatedly due to randomness in the training/validation set splits,
performing LOOCV multiple times will always yield the same results: there is
no randomness in the training/validation set splits.

4. LOOCV is a very general method, and can be used with any kind of predictive
modeling.
Leave-one-out cross-validation

• Disadvantages of LOOCV:

1. Computationally expensive to implement, since the model has


to be fit n times. This can be very time consuming if n is large,
and if each individual model is slow to fit.

Q> What to do if ‘n’ is too large?


K-fold cross-validation
• An alternative to LOOCV is k-fold CV.

• This approach involves randomly dividing the dataset of


observations into k groups, or folds. The first fold is treated as a
validation set, and the method is fit on the remaining k − 1
folds.

• The mean squared error, MSE1, is then computed on the


observations in the held-out fold.

• This procedure is repeated k times; each time, a different group
of observations is treated as a validation set.
K-fold cross-validation
K-fold cross-validation
• This process results in k estimates of the test error, MSE1,MSE2,
. . . ,MSEk.

• The k-fold CV estimate is computed by averaging these values,

• It is not hard to see that LOOCV is a special case of k-fold CV in


which k is set to equal n. In practice, one typically performs k-
fold CV using k = 5 or k = 10.
K-fold cross-validation
Q. What is the advantage of using k = 5 or k = 10 rather than k = n?

1. The most obvious advantage is computational. LOOCV requires


fitting the statistical learning method n times. This has the
potential to be computationally expensive. But k-fold CV is a
very general approach that can be applied to almost any
statistical learning method.

2. Performing LOOCV may pose computational problems,


especially if n is extremely large. In contrast, performing 10-fold
CV requires fitting the learning procedure only ten times (for n =
100), which may be much more feasible.
Bias/Variance trade-off in K-fold CV
• It is seen that the validation set approach can lead to overestimates of the test
error rate, since in this approach the training set (that is used to fit the statistical
learning method) contains only half the observations of the entire data set.

• Contrary to this LOOCV will give approximately unbiased estimates of the test
error, since each training set contains n − 1 observations, which is almost as
many as the number of observations in the full data set.

• Performing k-fold CV for, say, k = 5 or k = 10 will lead to an intermediate level of


bias, since each training set contains (k − 1)n/k observations—fewer than in the
LOOCV approach, but substantially more than in the validation set approach.

• Therefore, from the perspective of bias reduction, it is clear that LOOCV is to


be preferred to k-fold CV.
Bias/Variance trade-off in K-fold CV

• However, bias is not the only source for concern in an


estimating procedure; we must also consider the model’s
variance. It is seen that LOOCV has higher variance than does
k-fold CV with k < n.

• Why ?
Bias/Variance trade-off in K-fold CV

• When we perform LOOCV, we are in effect averaging the outputs of n fitted


models, each of which is trained on an almost identical set of observations;
therefore, these outputs are highly (positively) correlated with each other.

• In contrast, when we perform k-fold CV with k < n, we are averaging the


outputs of k fitted models that are somewhat less correlated with each
other, since the overlap between the training sets in each model is smaller.

• Since the mean of many highly correlated quantities has higher variance as
compared to the mean of many quantities that are not as highly
correlated, the test error estimate resulting from LOOCV tends to have
higher variance as compared to the test error estimate resulting from k-
fold CV.
Bias/Variance trade-off in K-fold CV

• To summarize, there is a bias-variance trade-off associated with


the choice of k in k-fold cross-validation.

• Typically, given these considerations, one performs k-fold cross-


validation using k = 5 or k = 10, as these values have been shown
empirically to yield test error rate estimates, that suffer neither
from excessively high bias nor from very high variance.
Cross-validation on classification problem

• For Cross-validation in the regression setting (where the outcome Y


is quantitative), we use MSE to quantify test error.

• Cross-validation can also be a very useful approach in the


classification setting when Y is qualitative/catagorical.

• In this setting cross-validation works just as described earlier in this


chapter, except that rather than using MSE to quantify test error,
we instead use the number of misclassified observations.
Cross-validation on classification problem

• In the classification setting, the LOOCV error rate takes the form

where .
The Bootstrap

• The Bootstrap Sampling Method is a very simple concept and is a


building block for some of the more advanced machine learning
algorithms like AdaBoost and XGBoost.

• The bootstrap sampling method is a resampling method that uses


random sampling with replacement, in order to estimate statistics on a
population.

• It can be used to estimate summary statistics such as the mean or


standard deviation.
The Bootstrap
• How it works?
• Suppose you have an initial sample with 3 observations. Using the
bootstrap sampling method, you’ll create a new sample with 3
observations as well.
• Each observation has an equal chance of being chosen (1/3).
• In this case, the second observation was chosen randomly and will be
the first observation in our new sample.
The Bootstrap
• How it works?
• After choosing another observation at random, you chose the green
observation.
The Bootstrap
• How it works?
• Lastly, the yellow observation is chosen again at random. Remember
that bootstrap sampling using random sampling with replacement.
• This means that it is very much possible for an already chosen
observation to be chosen again.
• This is the essence of bootstrap sampling!
The Bootstrap
• How it works?

The process for building one sample can be summarized as follows:


– Randomly select an observation from the dataset
– Add it to the sample
The Bootstrap
• The bootstrap method can be used to estimate a quantity of a
population. This is done by repeatedly taking small samples, calculating
the statistic, and taking the average of the calculated statistics.

• We can summarize this procedure as follows:


1. Choose a number of bootstrap samples to perform
2. Choose a sample size (same size as the original dataset)
3. For each bootstrap sample
– Draw a sample with replacement with the chosen size
– Calculate the statistic on the sample
– Calculate the mean of the calculated sample statistics.
The Bootstrap
• The procedure can also be used to estimate the skill of a machine learning
model.

• We training the model on the sample and evaluating the skill of the model
on those samples not included in the sample.

• These samples not included in a given sample are called the out-of-bag
samples (OOB).
The Bootstrap
“The samples not selected are usually referred to as the “out-of-
bag” samples. For a given iteration of bootstrap resampling, a
model is built on the selected samples and is used to predict the
out-of-bag samples.”
— K.Johnson et.al, Applied Predictive Modelling, page
72, 2013.
The Bootstrap example

Ref: https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
Ref: https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
Decision Tree
Some Characteristics

• It starts with root node.

• All nodes drawn with circle (ellipse) are called internal nodes.

• All nodes drawn with rectangle boxes are called terminal nodes or
leaf nodes.

• Edges of a node represent the outcome for a value of the node.


Tree based method for regression and
classification

• Decision trees involve stratifying or segmenting the predictor


space into a number of simple regions.

• In order to make a prediction for a given observation, we


typically use the mean or the mode of the training
observations in the region to which it belongs.

• Since the set of splitting rules used to segment the predictor


space, can be summarized in a tree like structures. That is why
these types of approaches are known as decision tree methods.
Tree based method for regression and
classification

• Basics of Decision Trees:


Decision trees can be applied to both regression and classification
problems.
Tree based method for regression and
classification

• Decision tree for regression analysis:

Objective: To predict Baseball Players’ Salaries Using Regression


Trees.
We use the ‘Hitters data set’ to predict a baseball player’s
Salary based on Years (the number of years that he has played in
the major leagues) and Hits (the number of hits that he made in
the previous year).
How to construct the regression tree?

• It consists of a series of splitting rules, starting at the top of


the tree.

• The top split assigns observations (players) to the left branch,


who are having less than 4.5 years of experience (Years < 4.5).

• The predicted salary for these players is given by the mean


response value for the players in the data set with Years < 4.5.
For such players, the mean log salary is 5.107, and so we
make a prediction of thousands of dollars, i.e. $165,174, for
these players.
How to construct the regression tree?

• Players having more than 4.5 years of experience (Years>=4.5)


are assigned to the right branch, and then that group is further
subdivided by Hits.

• Overall, the tree segments (stratifies) the players into three


regions of predictor space:
(1) players who have played for 4.5 or fewer years,
(2) players who have played for 4.5 or more years and
who made fewer than 117.5 hits last year, and
(3) players who have played for 4.5 or more years and who
made at least 117.5 hits last year.
How to construct the regression tree?

• These 3 regions can be written as:


R1 ={X | Years<4.5},
R2 =(X | Years>=4.5, Hits<117.5}, and
R3 ={X | Years>=4.5, Hits>=117.5}.

The predicted salaries for these 3 groups are $1,000×e5.107


=$165,174, $1,000×e5.999 =$402,834, and $1,000×e6.740
=$845,346 respectively.
The three-region partition for the Hitters data set from the
regression tree. The figure illustrates the regions as a function of
Years and Hits.
Terminal nodes (leaves), internal nodes,
branches
• In keeping with the tree analogy, the regions R1, R2, and R3 are
known as terminal nodes or leaves of the tree.

• Decision trees are typically drawn upside down, in the sense that
the leaves are at the bottom of the tree.

• The points along the tree where the predictor space is split are
referred to as internal nodes. In Figure, the two internal nodes
are indicated by the text Years<4.5 and Hits<117.5.

• We refer to the segments of the trees that connect the nodes as


branches.
Interpretation of regression tree

We might interpret the regression tree displayed in


Figure as follows:
• Years is the most important factor in determining
Salary.

• Players with less experience earn lower salaries


than more experienced players.

• But among players who have been in the major


leagues for 4.5 years or more, the number of hits
made in the previous year will affect salary.

• Players who made more hits last year tend to have


higher salaries.
Advantages of decision tree

• Easier to interpret
• It has a nice graphical representation.
Prediction via Stratification of the Feature Space

We now discuss the process of building a regression tree. There


are two steps.

• 1. We divide the predictor space—that is, the set of possible


values for X1, X2, . . .,Xp—into j distinct and non-overlapping
regions, R1, R2, . . . , RJ .

• 2. For every observation that falls into the region Rj, we make
the same prediction, which is simply the mean of the response
values for the training observations in Rj .
# How do we construct the regions R1, . . .,RJ?

• The regions could have any shape. However, we choose to


divide the predictor space into high-dimensional rectangles,
or boxes, for simplicity and for ease of interpretation of the
resulting predictive model.

• The goal is to find boxes R1, . . . , RJ that minimize the RSS,


given by

where is the mean response for the training


observations within the j’th box.
• However it is computationally infeasible to consider all possible
partition of the feature space into J boxes.

• For this reason, we take a “top-down, greedy approach” that is


known as recursive binary splitting.

• The recursive binary splitting approach is top-down because it


begins at the top of the tree (at which point all observations
belong to a single region) and then successively splits the
predictor space; each split is indicated via two new branches
further down on the tree.

• It is greedy because at each step of the tree-building process, the


best split is made at that particular step, rather than looking
ahead and picking a split that will lead to a better tree in some
future step.
Recursive binary splitting

1. First select the predictor Xj and the cut-point s such that


splitting the predictor space into the regions {X|Xj < s} and
{X|Xj ≥ s} leads to the greatest possible reduction in RSS.

2. Consider all predictors X1, . . .,Xp, and all possible values of


the cut-point s for each of the predictors, and then choose
the predictor and cut-point such that the resulting tree has
the lowest RSS.

(Here, the notation {X|Xj < s} means the region of predictor


space in which Xj takes on a value less than s)
Recursive binary splitting

• In greater detail, for any j and s, we define,

where is the mean response for the training observations in R1(j, s),
and is the mean response for the training observations in R2(j, s).

• Finding the values of j and s that minimize the above expression can be
done quickly if the number of features p is not too large.
3. We repeat the process, looking for the best predictor (j) and best cut-point
(s) in order to split the data further so as to minimize the RSS within each of
the resulting regions.

• However in this example, instead of splitting the entire predictor space, we


split one of the two previously identified regions. We now have three regions:
R1, R2 and R3.

• Again, we look to split one of these three regions further (say R3, next slide)
into R4 and R5, based on some other attribute so as to minimize the RSS.

• The process continues until a stopping criterion is reached; for instance, we


may continue until no region contains more than five observations.

• Once the regions R1, . . . , RJ have been created, we predict the response for a
given test observation using the mean of the training observations in the
region to which that test observation belongs.
The output of recursive binary splitting
on a two-dimensional example.

A tree corresponding to the partition.


Tree Pruning
• Decision tree might be too complex when all attributes are used
and thus is likely to overfit the data, leading to poor test set
performance.

• A smaller tree with fewer splits (that is, fewer regions R1, . . .,RJ )
might lead to lower variance and better interpretation at the
cost of a little bias.

• How to overcome this overfitting problem?


… by tree pruning
Tree Pruning
 How tree pruning is performed?
First grow a very large tree T0, and then prune it back in order to
obtain a subtree.

 How do we determine the best prune way to prune the tree?


• Our goal is to select a subtree that subtree leads to the lowest test error
rate.
• Given a subtree, we can estimate its test error using cross-validation or
the validation set approach.
• However, estimating the cross-validation error for every possible
subtree would be too cumbersome, since there is an extremely large
number of possible subtrees. Instead, we need a way to select a small
set of subtrees for consideration out of all possible subtrees.
Tree Pruning
 Cost complexity pruning:

Here rather than considering every possible subtree, we consider a


sequence of trees indexed by a nonnegative tuning parameter α.
Tree Pruning

• The tuning parameter α controls a trade-off between the subtree’s


complexity and its fit to the training data. When α = 0, then the subtree
T will simply equal T0 and the above equation just measures the
training error.

• However, as α increases, there is a price to pay for having a tree with


many terminal nodes, and so the quantity (in the eq.) will tend to be
minimized for a smaller subtree.
Tree Pruning

• It turns out that as we increase α from zero in the equation,


branches get pruned from the tree in a nested and predictable
fashion.

• We obtain the whole sequence of subtrees as a function of α.

• We can select a value of α using a validation set or using cross-


validation. We then return to the full data set and obtain the
subtree corresponding to α. This process is summarized in
Algorithm
Although CV error is computed as a function of α, it is convenient to display the result as a
function of |T|, the number of leaves.
• Figures display the results of fitting and pruning a regression tree
on the Hitters data, using nine of the features.

• First, we randomly divided the data set in half, yielding 132


observations in the training set and 131 observations in the test
set.

• We then built a large regression tree on the training data and


varied α in order to create subtrees with different numbers of
terminal nodes.

• Finally, we performed six-fold cross-validation in order to estimate


the cross-validated MSE of the trees as a function of α.
• The CV error takes on its minimum, for a three-node tree, while
the test error also dips down at the three-node tree (though it
takes on its lowest value at the ten-node tree).
• The pruned tree containing three terminal nodes is shown here.
Classification tree

• A classification tree is very similar to a regression tree,


except that it is classification used to predict a qualitative
response rather than a quantitative one.

• Just as in the regression setting, we use recursive binary


splitting to grow a classification tree.

• However, in the classification setting, RSS cannot be used as


a criterion for making the binary splits. Instead we use the
classification error rate
Trees Versus Linear Models
Which model is better?

• If the relationship between the features and the response is


well approximated by a linear model, then an approach
such as linear regression will likely work well, and will
outperform decision tree that does not exploit this linear
structure.

• If instead there is a highly non-linear and complex


relationship between the features and the response then
decision trees may outperform classical approaches.
Advantages and Disadvantages of Trees

• Trees are easier to explain and interpret than linear regression


since decision trees are displayed graphically and more closely
mirror human decision-making than do the regression and
classification approaches.

• Trees can be non-robust: a small change in the data can cause a


large change in the final estimated tree: suffer from high
variance.

• However, by aggregating many decision trees, using methods like


bagging, random forests, and boosting, the predictive
performance of trees can be substantially improved.
Bagging

• Use trees as building blocks to construct more powerful


prediction models.

• Decision trees suffer from high variance: splitting the


training data into two parts at random, and fitting a decision
tree to both halves, will give quite different results.

• Bootstrap aggregation, also called bagging, is a general-


purpose procedure for reducing the variance of a statistical
learning method.
Bagging

• Given a set of n independent observations Z1, . . … , Zn, each


with variance , the variance of the mean of the observations
is given by /n.

• Thus averaging a set of observations reduces variance.

• Hence a natural way to reduce the variance and increase the


prediction accuracy of a statistical learning method is to take
many training sets from the population, build a separate
prediction model using each training set, and average the
resulting predictions.
Bagging

• Calculate using B separate


training sets, and average them in order to
obtain a single low-variance statistical learning
model, given by

^
𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙𝑠
Bagging

• Since we generally do not have access to multiple


training sets, we can bootstrap, by taking repeated
samples from the single training data set.

• In this approach we generate B different


bootstrapped training data sets.

• We then train our method on the b’th


bootstrapped training set in order to get
Bagging

• Finally average all the predictions, to obtain

• This is called bagging.


Bagging
Bagging, or bootstrap aggregating,
is where we create bagged trees by
creating X number of decision trees
that is trained on X bootstrapped
training sets.

The final predicted value is the


average value of all our X decision
trees.

One single decision tree has high


variance (tends to overfit), so
by bagging or combining many
weak learners into strong learners,
we are averaging away the
variance. It’s a majority vote!
Bagging
• To apply bagging to regression trees, we simply construct B regression
trees using B bootstrapped training sets, and average the resulting
predictions.

• These trees are grown deep, and are not pruned.

• Hence each individual tree has high variance, but low bias.

• Averaging these B trees reduces the variance.

• Bagging has been demonstrated to give impressive improvements in


accuracy by combining together hundreds or even thousands of trees
into a single procedure.
Bagging
 How can bagging be extended to a classification problem where
Y is qualitative?

• In that situation, for a given test observation, we can record the


class predicted by each of the B trees, and take a majority vote:
i.e. the overall prediction is the most commonly occurring class
among the B predictions.
Bagging
• The key to bagging is that trees are repeatedly fit to
bootstrapped subsets of observations.

• It is seen that on average, each bagged tree makes use of


around two-thirds of the observations.

• The remaining one-third of the observations not used to fit a


given bagged tree are referred to as the out-of-bag (OOB)
observations.

• We can predict the response for the i’th observation using each
of the trees in which that observation was OOB.
• In order to obtain a single prediction for the i’th observation,
we can average these predicted responses (if regression is the
goal) or can take a majority vote (if classification is the goal).

• This leads to a single OOB prediction for the i’th observation.

• An OOB prediction can be obtained in this way for each of the


‘n’ observations, from which the overall OOB MSE (for a
regression problem) or classification error (for a classification
problem) can be computed.

• The resulting OOB error is a valid estimate of the test error for
the bagged model, since the response for each observation is
predicted using only the trees that were not fit using that
observation.
Bagging

• It can be shown that with B sufficiently large,


OOB error is virtually equivalent to LOOCV
error.

• One single decision tree has high variance


(tends to overfit), so by bagging or combining
many weak learners into strong learners, we
are averaging away the variance. It’s a
majority vote!
Disadvantage of Bagging
• Bagging typically results in improved accuracy over prediction
using a single decision tree.

• Unfortunately, however, it can be difficult to interpret the resulting


model

• Recall that one of the advantages of decision trees is the attractive


and easily interpretable. However, when we bag a large number of
trees, it is no longer possible to represent the resulting statistical
learning procedure using a single tree, and it is no longer clear
which variables are most important to the procedure.
Bagging

• Although the collection of bagged trees is much more difficult to


interpret than a single decision tree, one can obtain an overall
summary of the importance of each predictor using the RSS (for
bagging regression trees) or the Gini index (for bagging
classification trees).
Gini Index
Bagging

• In the case of bagging regression trees, we can record the total


amount that the RSS is decreased due to splits over a given
predictor, averaged over all B trees.

• Similarly, in the context of bagging classification trees, we can add


up the total amount that the Gini index is decrease by splits over a
given predictor, averaged over all B trees.
Decision tree… a quick recap!

• A decision tree consists of three components: decision nodes,


leaf nodes, and a root node.

• A decision tree algorithm divides the entire training dataset into


branches, which further segregate into other branches, and so
on..

• This sequence continues until a leaf node is attained. The leaf


node cannot be segregated further.

• The nodes in the decision tree represent attributes that are used
for predicting the outcome.
Decision tree… a quick recap!
Random Forest
• A random forest combines several decision trees.

• It utilizes ensemble learning, which is a technique that combines


many classifiers (week classifiers) into a single strong classifier to
provide solutions to complex problems.

• The random forest algorithm establishes the outcome based on the


predictions of the decision trees.

• It predicts by taking the average or mean of the output from various


trees. Thus increasing the number of trees increases the precision
of the outcome.
Random Forest algorithm

• Random forests are bagged decision tree models that use a subset of
features on each split. In every random forest tree, a subset of
features is selected randomly at the node’s splitting point.

• Random forest improves on bagging because it decorrelates the trees


with the introduction of splitting on a random subset of features.

• This means that at each split of the tree, the model considers only a
small subset of features rather than all of the features of the model.

• That is, from the set of available features ‘n’, a subset of ‘m’ features
() are selected at random. In this way, variance can be averaged away.
Feature Randomness in random forest

• In a normal decision tree, when it is time to split a node, we


consider every possible feature and pick the one that
produces the most separation between the observations in
the left node vs. those in the right node.

• In contrast, each tree in a random forest can pick only from a


random subset of features. This forces even more variation
amongst the trees in the model and ultimately results in lower
correlation across trees and more diversification
In our random forest, we end up with trees that are not only
trained on different sets of data (thanks to bagging) but also
use different features to make decisions.
Random Forest algorithm
• A rain forest system relies on various decision trees.

• Every decision tree consists of decision nodes, leaf nodes, and a


root node.

• The leaf node of each tree is the final output produced by that
specific decision tree.

• The selection of the final output follows the majority-voting


system. In this case, the output chosen by the majority of the
decision trees becomes the final output of the rain forest
system.
Example

• Let’s take an example of a training dataset consisting of various


fruits such as bananas, apples, pineapples, and mangoes.

• The random forest classifier divides this dataset into subsets.

• These subsets are given to every decision tree in the random


forest system.

• Each decision tree produces its specific output.

• Final prediction is taken by ‘majority vote’


Features of Random Forest algorithm

• It’s more accurate than the decision tree algorithm.

• It can produce a reasonable prediction without hyper-


parameter tuning.

• It solves the issue of overfitting in decision trees: Each


decision tree has a high variance, but low bias. But
because we average all the trees in random forest, we
are averaging the variance as well so that we have a low
bias and moderate variance model.
Features of Random Forest algorithm

• In every random forest tree, a subset of features is


selected randomly at the node’s splitting point.

• Random forests is great with high dimensional data


since we are working with subsets of data.

• It is faster to train than decision trees because we are


working only on a subset of features in this model, so
we can easily work with hundreds of features.
Summary: Random Forest algorithm

The random forest is a classification algorithm consisting


of many decisions trees. It uses bagging and feature
randomness when building each individual tree to
create an uncorrelated forest of trees whose prediction
is more accurate than that of any individual tree.

Essentially, Random Forest is a good model if you want


high performance with less need for interpretation.
Boosting
• Boosting is another approach for improving the predictions resulting
from a decision tree.

• Difference between bagging and boosting:


Bagging involves creating multiple copies of the original training
data set using the bootstrap, fitting a separate decision tree to each copy,
and then combining all of the trees in order to create a single predictive
model. Here, each tree is built on a bootstrap data set, independent of
the other trees.
Boosting works in a similar way, except that the trees are grown
sequentially: each tree is grown using information from previously
grown trees. Boosting does not involve bootstrap sampling.
Boosting
• Boosting algorithms seek to improve the prediction power by
training a sequence of weak models (e.g. regression, shallow
decision trees, etc), each compensating the weaknesses of its
predecessors.

• The weakness is identified by the weak estimator’s error rate.

• Examples of boosting algorithm:


AdaBoost, GradientBoost, XGBoost..
Boosting
• Boosting is an ensemble modeling technique that attempts to
build a strong classifier from the number of weak classifiers.

• It is done by building a model by using weak models in series.

• Firstly, a model is built from the training data.

• Then the second model is built which tries to correct the errors
(misclassification) present in the first model.

• This procedure is continued and models are added until either


the complete training data set is predicted correctly or the
maximum number of models are added.
Boosting

• Turning a weak model to a stronger one to fix its weaknesses.

• AdaBoost was the first really successful boosting algorithm


developed for the purpose of binary classification.

• AdaBoost is short for Adaptive Boosting and is a very popular


boosting technique that combines multiple “weak classifiers”
into a single “strong classifier”.
Boosting Algorithm

• Initialize the dataset and assign equal weight to each of the data
point.
• Provide this as input to the model and identify the wrongly
classified data points.
• Increase the weight of the wrongly classified data points.
• if (got required results)
Goto step 5
else
Goto step 2
• End
Boosting Algorithm
Explanation: The above diagram explains the AdaBoost algorithm.

• B1 consists of 10 data points which consist of two types namely plus(+) and
minus(-) and 5 of which are plus(+) and the other 5 are minus(-) and each one
has been assigned equal weight initially. The first model tries to classify the data
points and generates a vertical separator line but it wrongly classifies 3 plus(+) as
minus(-).
• B2 consists of the 10 data points from the previous model in which the 3 wrongly
classified plus(+) are weighted more so that the current model tries more to
classify these pluses(+) correctly. This model generates a vertical separator line
that correctly classifies the previously wrongly classified pluses(+) but in this
attempt, it wrongly classifies three minuses(-).
• B3 consists of the 10 data points from the previous model in which the 3 wrongly
classified minus(-) are weighted more so that the current model tries more to
classify these minuses(-) correctly. This model generates a horizontal separator
line that correctly classifies the previously wrongly classified minuses(-).
• B4 combines together B1, B2, and B3 in order to build a strong prediction model
which is much better than any individual model used.
How the sample weights
affect the decision boundary

• In each iteration, AdaBoost identifies mis-classified data points, increasing their


weights (and decrease the weights of correct points, in a sense) so that the next
classifier will pay extra attention to get previously misclassified datapoints right.
How to combine the sequence of models to make the ensemble stronger overtime.

AdaBoost trains a
sequence of models with
augmented sample
weights, generating
‘confidence’ coefficients
Alpha for individual
classifiers based on errors.

Low errors leads to large


Alpha, which means
higher importance in the
voting.
Pros and Cons of Boosting
• Easy to interpret: boosting is essentially an ensemble model, hence it is easy to
interpret it’s prediction.

• strong prediction power: usually boosting > bagging (random forrest) > decision
tree.

• Sensitive to outliers: since each weak classifier is dedicated to fix its


predecessors’ shortcomings, the model may pay too much attention to outliers.

• Hard to scale up (sequential): since each estimator is built on its predecessors,


the process is hard to parallelize.

• Slower to train as compared to Bagging

You might also like