0% found this document useful (0 votes)

2 views

Lecture 5

Uploaded by

Linh Hoàng

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lecture 5

Uploaded by

Linh Hoàng

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 53

Lecture 16.

Bagging Random Forest and

Boosting
CS 109A/AC 209A/STAT 121A Data Science:
Harvard University
Fall 2016

Instructors: P. Protopapas, K. Rader, W. Pan

Announcements
• HW5 solutions are out
• HW6 is due tonight
• HW7 will be released today

More thinking less implementation

Quiz
Code: deepdarkwoods
Outline
Bagging
Random Forests
Boosting
Outline
Bagging
Random Forests
Boosting
Power of the crowds

• Wisdom of the crowds

http://www.scaasymposium.org/portfolio/part-v-the-power-of-innovation-and-the-market/
Ensemble methods
• A single decision tree does not perform well
• But, it is super fast
• What if we learn multiple trees?

We need to make sure they do not all just learn the same
Bagging
If we split the data in random different ways, decision
trees give different results, high variance.

Bagging: Bootstrap aggregating is a method that result in

low variance.

If we had multiple realizations of the data (or multiple

samples) we could calculate the predictions multiple times
and take the average of the fact that averaging multiple
onerous estimations produce less uncertain results
Bagging
Say for each sample b, we calculate fb(x), then:

How?

Bootstrap
Construct B (hundreds) of trees (no pruning)
Learn a classifier for each bootstrap sample and
average them
Very effective
Bagging for classification: Majority vote

Test error

NO OVERFITTING
X2

X1
Bagging decision trees

Hastie et al.,”The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer (2009)
Out-of-Bag Error Estimation
• No cross validation?
• Remember, in bootstrapping we sample with
replacement, and therefore not all observations are
used for each bootstrap sample. On average 1/3 of them
are not used!
• We call them out-of-bag samples (OOB)
• We can predict the response for the i-th observation
using each of the trees in which that observation was
OOB and do this for n observations
• Calculate overall OOB MSE or classification error
Bagging
• Reduces overfitting (variance)
• Normally uses one type of classifier
• Decision trees are popular
• Easy to parallelize
Variable Importance Measures
• Bagging results in improved accuracy over prediction
using a single tree
• Unfortunately, difficult to interpret the resulting model.
Bagging improves prediction accuracy at the expense of
interpretability.

Calculate the total amount that the RSS or Gini index is

decreased due to splits over a given predictor, averaged
over all B trees.
Using Gini index on heart data
RF: Variable Importance Measures
Record the prediction accuracy on the oob samples for
each tree

Randomly permute the data for column j in the oob

samples the record the accuracy again.

The decrease in accuracy as a result of this permuting is

averaged over all trees, and is used as a measure of the
importance of variable j in the random forest.
Bagging - issues
Each tree is identically distributed (i.d.)
 the expectation of the average of B such trees
is the same as the expectation of any one of
them
the bias of bagged trees is the same as that of
the individual trees

i.d. and not i.i.d

Bagging - issues
An average of B i.i.d. random variables, each with variance
σ2, has variance: σ2/B
If i.d. (identical but not independent) and pair correlation r
is present, then the variance is:

As B increases the second term disappears but the first

term remains -> that’s why we want to rise B

Why does bagging generate correlated trees?

Bagging - issues
Suppose that there is one very strong predictor in the
data set, along with a number of other moderately
strong predictors.

Then all bagged trees will select the strong predictor at

the top of the tree and therefore all trees will look
similar.

How do we avoid this?

Bagging - issues
We can penalize the splitting (like in pruning) with
NOthat
a penalty term THEdepends
SAMEon BIAS
the number of
times a predictor is selected at a given length

We can restrict how many times a predictor can

be used
NO THE SAME BIAS
We only allow a certain number of predictors
NO THE SAME BIAS
Bagging - issues
We can penalize the splitting (like in pruning) with
a penalty term that depends on the number of
times a predictor is selected at a given length

We can restrict how many times a predictor can

be used

We only allow a certain number of predictors

Bagging - issues
Remember we want i.i.d such as the bias to be the
same and variance to be less?
Other ideas?

What if we consider only a subset of the predictors

at each split?

We will still get correlated trees unless ….

we randomly select the subset !
Random Forests
Outline
Bagging
Random Forests
Boosting
Random Forests
As in bagging, we build a number of decision trees on
bootstrapped training samples each time a split in a
tree is considered, a random sample of m predictors is
chosen as split candidates from the full set of p
predictors.

Note that if m = p, then this is bagging.

Random Forests
Random forests are popular. Leo Breiman’s and Adele
Cutler maintains a random forest website where the
software is freely available, and of course it is included
in every ML/STAT package

http://www.stat.berkeley.edu/~breiman/RandomForest
s/
Random Forests Algorithm
For b = 1 to B:
(a) Draw a bootstrap sample Z∗ of size N from the training data.
(b) Grow a random-forest tree to the bootstrapped data, by
recursively repeating the following steps for each terminal node of the
tree, until the minimum node size nmin is reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.
Output the ensemble of trees.

To make a prediction at a new point x we do:

For regression: average the results
For classification: majority vote
Random Forests Tuning
The inventors make the following recommendations:
• For classification, the default value for m is √p and the minimum
node size is one.
• For regression, the default value for m is p/3 and the minimum
node size is five.

In practice the best values for these parameters will depend on the
problem, and they should be treated as tuning parameters.

Like with Bagging, we can use OOB and therefore RF can be fit in one
sequence, with cross-validation being performed along the way. Once
the OOB error stabilizes, the training can be terminated.
Example
• 4,718 genes measured on tissue samples from 349 patients.
• Each gene has different expression
• Each of the patient samples has a qualitative label with 15
different levels: either normal or 1 of 14 different types of
cancer.

Use random forests to predict cancer type based on the 500

genes that have the largest variance in the training set.
Null choice (Normal)
Random Forests Issues
When the number of variables is large, but the fraction of relevant
variables is small, random forests are likely to perform poorly when m is
small

Why?

Because:
At each split the chance can be small that the relevant variables will be
selected

For example, with 3 relevant and 100 not so relevant variables the
probability of any of the relevant variables being selected at any split is
~0.25
Probability of being selected
Can RF overfit?
Random forests “cannot overfit” the data wrt to
number of trees.

Why?

The number of trees, B does not mean increase

in the flexibility of the model
I have seen discussion about gains in
performance by controlling the depths of the
individual trees grown in random forests. I
usually use full-grown trees and seldom it costs
much (in the classification error) and results in
one less tuning parameter.
Outline
Bagging
Random Forests
Boosting
Boosting
Boosting is a general approach that can be applied to
many statistical learning methods for regression or
classification.

Bagging: Generate multiple trees from bootstrapped

data and average the trees.
Recall bagging results in i.d. trees and not i.i.d.

RF produces i.i.d (or more independent) trees by

randomly selecting a subset of predictors at each step
Boosting
Boosting works very differently.
1. Boosting does not involve bootstrap sampling
2. Trees are grown sequentially: each tree is
grown using information from previously
grown trees
3. Like bagging, boosting involves combining a
large number of decision trees, f1, . . . , fB
Sequential fitting
Given the current model,
• we fit a decision tree to the residuals from the
model. Response variable now is the residuals and
not Y
• We then add this new decision tree into the fitted
function in order to update the residuals
• The learning rate has to be controlled
Boosting for regression
1. Set f(x)=0 and ri =yi for all i in the training set.
2. For b=1,2,...,B, repeat:
a. Fit a tree with d splits(+1 terminal nodes) to the training data (X, r).
b. Update the tree by adding in a shrunken version of the new tree:

c. Update the residuals,

3. Output the boosted model,

Boosting tuning parameters
• The number of trees B. RF and Bagging do not
overfit as B increases. Boosting can overfit! Cross
Validation
• The shrinkage parameter λ, a small positive
number. Typical values are 0.01 or 0.001 but it
depends on the problem. λ only controls the
learning rate
• The number d of splits in each tree, which controls
the complexity of the boosted ensemble. Stumpy
trees, d = 1 works well.
Boosting for classification

Challenge question for HW7

Different flavors
• ID3, or alternative Dichotomizer, was the first of three Decision Tree
implementations developed by Ross Quinlan (Quinlan, J. R. 1986.
Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.)
Only categorical predictors and no pruning.
• C4.5, Quinlan's next iteration. The new features (versus ID3) are: (i)
accepts both continuous and discrete features; (ii) handles
incomplete data points; (iii) solves over-fitting problem by (very
clever) bottom-up technique usually known as "pruning"; and (iv)
different weights can be applied the features that comprise the
training data.
Used in orange http://orange.biolab.si/
Different flavors
• C5.0, The most significant feature unique to C5.0 is a scheme for
deriving rule sets. After a tree is grown, the splitting rules that
define the terminal nodes can sometimes be simplified: that is, one
or more condition can be dropped without changing the subset of
observations that fall in the node.

• CART or Classification And Regression Trees is often used as a

generic acronym for the term Decision Tree, though it apparently
has a more specific meaning. In sum, the CART implementation is
very similar to C4.5. Used in sklearn
Missing data
• What if we miss predictor values?
– Remove those examples => depletion of the
training set
– Impute the values either with mean, knn, from the
marginal or joint distributions
• Trees have a nicer way of doing this
– Categorical
Further reading
• Pattern Recognition and Machine Learning,
Christopher M. Bishop
• The Elements of Statistical Learning
Trevor Hastie, Robert Tibshirani, Jerome Friedman
http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/E
SLII_print10.pdf

Great LEarning Weekly Quiz - Bagging and Random Forest
100% (4)
Great LEarning Weekly Quiz - Bagging and Random Forest
5 pages
CS109a Lecture16 Bagging RF Boosting
No ratings yet
CS109a Lecture16 Bagging RF Boosting
48 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Random Forests
No ratings yet
Random Forests
43 pages
Lecture 05 Random Forest 07112022 124639pm
No ratings yet
Lecture 05 Random Forest 07112022 124639pm
25 pages
Random Forest
No ratings yet
Random Forest
32 pages
Machine learning
No ratings yet
Machine learning
5 pages
Handout9 Trees Bagging Boosting
100% (1)
Handout9 Trees Bagging Boosting
23 pages
Random Forest
No ratings yet
Random Forest
8 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
Machine Learning: Classification & Decision Trees
No ratings yet
Machine Learning: Classification & Decision Trees
24 pages
Bagging and Random Forest Presentation1
100% (2)
Bagging and Random Forest Presentation1
23 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
AIML Final Cpy Word
No ratings yet
AIML Final Cpy Word
15 pages
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
100% (1)
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
14 pages
Data Mining Notes
No ratings yet
Data Mining Notes
5 pages
Chp 8.2 Intro to Statistical Learning
No ratings yet
Chp 8.2 Intro to Statistical Learning
13 pages
Bagging
No ratings yet
Bagging
6 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
Random Forest (RF) : Decision Trees
No ratings yet
Random Forest (RF) : Decision Trees
3 pages
2025 Ensemble Learning.docx
No ratings yet
2025 Ensemble Learning.docx
25 pages
UNIT-3 Material
No ratings yet
UNIT-3 Material
19 pages
Ensemble Methods.pptx
No ratings yet
Ensemble Methods.pptx
32 pages
CS109a Lecture17 Boosting Other
No ratings yet
CS109a Lecture17 Boosting Other
21 pages
Decision Trees
67% (3)
Decision Trees
14 pages
2023-24_ML_NOTES_2
No ratings yet
2023-24_ML_NOTES_2
16 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Random Forest
No ratings yet
Random Forest
25 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
40 pages
Ch-4 Ensemble Learning
No ratings yet
Ch-4 Ensemble Learning
18 pages
Case Study Possible Questions
No ratings yet
Case Study Possible Questions
3 pages
1.decision Trees Concepts
No ratings yet
1.decision Trees Concepts
70 pages
Random Forest Class Lecture Notes
No ratings yet
Random Forest Class Lecture Notes
2 pages
Random Forest
No ratings yet
Random Forest
83 pages
Random Forest Summary
No ratings yet
Random Forest Summary
6 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
Aiml ML Session 13
No ratings yet
Aiml ML Session 13
78 pages
Ensemble Methods
No ratings yet
Ensemble Methods
31 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Ensemble Models
No ratings yet
Ensemble Models
52 pages
CP 4
No ratings yet
CP 4
2 pages
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
No ratings yet
UNIT-V (Bagging, Boosting, Random Forest) : by Dr. K. Aditya Shastry Associate Professor Dept. of ISE NMIT, Bengaluru
27 pages
Support, Decision and Random
No ratings yet
Support, Decision and Random
8 pages
Studio 9 Questions
No ratings yet
Studio 9 Questions
6 pages
Issues in Decision Tree Learning
No ratings yet
Issues in Decision Tree Learning
6 pages
Random Forest
No ratings yet
Random Forest
10 pages
Random Forest Algorithms - Comprehensive Guide With Examples
No ratings yet
Random Forest Algorithms - Comprehensive Guide With Examples
13 pages
ML Mod2
No ratings yet
ML Mod2
5 pages
Random Forests
No ratings yet
Random Forests
35 pages
Random Forest
No ratings yet
Random Forest
11 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Slide 3
No ratings yet
Slide 3
23 pages
Unit I ML (I) 24-25-1
No ratings yet
Unit I ML (I) 24-25-1
152 pages
Unit I ML (I) 24-25
No ratings yet
Unit I ML (I) 24-25
79 pages
unit 4 ml
No ratings yet
unit 4 ml
9 pages
phys361-S24-lecture-17-random-forests
No ratings yet
phys361-S24-lecture-17-random-forests
24 pages
Question Set-1
No ratings yet
Question Set-1
10 pages
Trees, Boosting, and Random Forest
No ratings yet
Trees, Boosting, and Random Forest
14 pages
Ensemble Learning: Wisdom of The Crowd
100% (1)
Ensemble Learning: Wisdom of The Crowd
12 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Book1
No ratings yet
Book1
3 pages
HNX
No ratings yet
HNX
16,011 pages
Multinational Capital Budgeting-2
No ratings yet
Multinational Capital Budgeting-2
16 pages
weight
No ratings yet
weight
1 page
quiz2
No ratings yet
quiz2
1 page
wui_monthly
No ratings yet
wui_monthly
137 pages
TA-CN-Tai-chinh-ngan-hang-21.7-2_Unit-5_Risks-and-Returns
No ratings yet
TA-CN-Tai-chinh-ngan-hang-21.7-2_Unit-5_Risks-and-Returns
15 pages
TA-CN-Tai-chinh-ngan-hang-21.7-2_31_Unit-3_Interest-rates
No ratings yet
TA-CN-Tai-chinh-ngan-hang-21.7-2_31_Unit-3_Interest-rates
9 pages
Application_Module-2-Lecture-Slides
No ratings yet
Application_Module-2-Lecture-Slides
47 pages
Rhetorical Questions
No ratings yet
Rhetorical Questions
3 pages
TA CN Tai Chinh Ngan Hang - UNIT 1 - INTRODUCTION
No ratings yet
TA CN Tai Chinh Ngan Hang - UNIT 1 - INTRODUCTION
11 pages
Fin1 - Stage 1 Gearning Up
No ratings yet
Fin1 - Stage 1 Gearning Up
30 pages
Fakulti Teknologi Maklumat & Komunikasi Universiti Teknikal Kebangsaan Malaysia Melaka (Utem) Diti 3113 Artificial Intelligence Laboratory 2
No ratings yet
Fakulti Teknologi Maklumat & Komunikasi Universiti Teknikal Kebangsaan Malaysia Melaka (Utem) Diti 3113 Artificial Intelligence Laboratory 2
4 pages
9 ai CLASS IX QP
100% (1)
9 ai CLASS IX QP
2 pages
2 Linear Programming Problem
100% (1)
2 Linear Programming Problem
35 pages
Algoritmo LPC en C
No ratings yet
Algoritmo LPC en C
4 pages
Quantum Secure Communication
No ratings yet
Quantum Secure Communication
2 pages
The Polygon Overlay Operation
No ratings yet
The Polygon Overlay Operation
8 pages
Akash Maths Project 2
No ratings yet
Akash Maths Project 2
6 pages
Mehta, Rastegari - 2022 - Separable Self-Attention For Mobile Vision Transformers
No ratings yet
Mehta, Rastegari - 2022 - Separable Self-Attention For Mobile Vision Transformers
18 pages
Bio Physics Unit 01 by Cool Education
No ratings yet
Bio Physics Unit 01 by Cool Education
11 pages
HW 3
No ratings yet
HW 3
3 pages
SSRN Id3257419
No ratings yet
SSRN Id3257419
32 pages
DBMS PPT7
No ratings yet
DBMS PPT7
23 pages
Ai Handout
No ratings yet
Ai Handout
74 pages
Adc Atmega128
No ratings yet
Adc Atmega128
4 pages
Artificial Intelligence Heuristic (Informed) Search
No ratings yet
Artificial Intelligence Heuristic (Informed) Search
51 pages
February 2023
No ratings yet
February 2023
2 pages
Faults in Digital Testing Systems
No ratings yet
Faults in Digital Testing Systems
30 pages
Quiz Feedback - Coursera
100% (1)
Quiz Feedback - Coursera
4 pages
Recitation 7: Counter and Stack Machines, Reducibility, Rice's Theorem
No ratings yet
Recitation 7: Counter and Stack Machines, Reducibility, Rice's Theorem
2 pages
Deriving Simpsons Rule Using Newton Interpolation
No ratings yet
Deriving Simpsons Rule Using Newton Interpolation
4 pages
Midterm
No ratings yet
Midterm
9 pages
Scaling Free CORDIC Algorithm Implementation of Sine and Cosine Function
No ratings yet
Scaling Free CORDIC Algorithm Implementation of Sine and Cosine Function
4 pages
Trip Distribution Models TransCAD-1
No ratings yet
Trip Distribution Models TransCAD-1
34 pages
Stepwise Regression
0% (1)
Stepwise Regression
9 pages
Real Life Application of Fuzzy Logic: A Smart Traffic Light Controller
No ratings yet
Real Life Application of Fuzzy Logic: A Smart Traffic Light Controller
19 pages
Nerlove, M., & Wallis, K. F. (1966) - Use of The Durbin-Watson Statistic in Inappropriate Situations. Econometrica, 34 (1), 235.
No ratings yet
Nerlove, M., & Wallis, K. F. (1966) - Use of The Durbin-Watson Statistic in Inappropriate Situations. Econometrica, 34 (1), 235.
5 pages
Lange and Sippel MachineLearning Hydrology
No ratings yet
Lange and Sippel MachineLearning Hydrology
26 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
12 pages
Classical Normal Linear Regression Model
No ratings yet
Classical Normal Linear Regression Model
13 pages
21CS15IT PROBLEM SOLVING AND PYTHON PROGRAMMING QUESTIONS AND ANSWERS
No ratings yet
21CS15IT PROBLEM SOLVING AND PYTHON PROGRAMMING QUESTIONS AND ANSWERS
6 pages