0% found this document useful (0 votes)
39 views

CSC 3304 Lecture 08 Boosting Ensemble Methods

Boosting Ensemble IIUM

Uploaded by

Steve Oscar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

CSC 3304 Lecture 08 Boosting Ensemble Methods

Boosting Ensemble IIUM

Uploaded by

Steve Oscar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Click to edit Master

subtitle style

| Boosting

Cynthia Rudin | MIT Sloan School of Management


Boosting Motivation

• Question of Kearns: Can you turn a “weak” learning algorithm


(that is barely better than random guessing) into a “strong”
learning algorithm (whose error rate is arbitrarily close to 0)?
Boosting Motivation

• Question of Kearns: Can you turn a “weak” learning algorithm


(that is barely better than random guessing) into a “strong”
learning algorithm (whose error rate is arbitrarily close to 0)?

• We could ask the algorithm to create a lot of classifiers and


figure out how to combine them… but if we give the algorithm
the same input each time, it will produce the same answer, not
a lot of different classifiers.
Boosting Motivation

• Schapire and Freund’s answer:


• Reweight the data in many ways
• Use the weak learning algorithm to create a weak
classifier for each (reweighted) dataset
• Compute a weighted average of the weak classifiers.
Adaboost

• Now there are several boosting algorithm


• One of the popular boosting algorithm is Adaboost.
• Computer Vision by Viola and Jones (2001)
Weak classifiers used by Viola and Jones

• Viola and Jones is working on face detections


• Subtract the white areas from the black ones

they came up with these very simple classifiers


that looked something like this.
Weak classifiers used by Viola and Jones

• Subtract the white areas from the black ones


Weak classifiers used by Viola and Jones

• Subtract the white areas from the black ones


Doesn’t detect
anything

Black and white areas


are very similar
Weak classifiers used by Viola and Jones

• Subtract the white areas from the black ones


Doesn’t detect
anything

Black and white areas


are very similar
Weak classifiers used by Viola and Jones

• Subtract the white areas from the black ones

Now it detects!
Weak classifiers used by Viola and Jones

• Subtract the white areas from the black ones


Weak classifiers

Detects
eyes!
Weak classifiers used by Viola and Jones

• Subtract the white areas from the black ones


Weak classifiers
1.The detector is
actually looking
for the difference
between the dark
eyes and the part Detects
of the face below eyes!
that. 
2. is actually looking for (Eye
two black areas Detector)
separated by a white
area which is the
person’s nose.
Weak classifiers used by Viola and Jones
• The detector by itself is not useful, but if you put it together it is
very usef
• Used hundreds of thousands of these weak classifiers at all
different scales
Weak classifiers used by Viola and Jones

• Used hundreds of thousands of these weak classifiers at all


different scales
AdaBoost Pseudocode
Assign observation i the weight of d1i=1/n (equal weights).
For t=1:T

Train weak learning algorithm using data weighted by dti=1/n. This produces
weak classifier ht.
Choose coefficient αt. yi ht (xi ) is 1 if correct, -1 if incorrect

Update weights: Zt is a normalization factor (all the


dt,i exp(- a t yi ht (xi ))
dt+1,i = Weights is add up to 1)
Zt
End
æT ö Weighted sum of the of the
Output the final classifier: H(x) =sign çå a tht (xi )÷ weak classifiers
è t=1 ø
Boosting Example

Now our weak classifiers are only


allowed to be lines that are either
horizontal or vertical in this
example. And all the
observations start with equal
weights.

All points start with equal weights.


(Credit: Example adapted from Freund and Schapire)
Boosting Example
h1

Run the weak learning algorithm to get a


weak classifier.

Choose coefficient a 1 =.41

It classified these two points a positive. And


it classified everything else as negative.
Okay and which points were misclassified?

(Credit: Example adapted from Freund and Schapire)


Boosting Example
h1

Increase the weights on the misclassified


points, decrease the weights on the
correctly classified points.

Because the weights on those points are so high


the weak learning algorithm basically has to get
those three points right on the next round.

(Credit: Example adapted from Freund and Schapire)


Boosting Example

(Credit: Example adapted from Freund and Schapire)


Boosting Example
h2

Run the weak learning algorithm to get a


weak classifier for the weighted data.

Choose coefficient a 2 =.66

(Credit: Example adapted from Freund and Schapire)


Boosting Example
h2

Increase the weights on the misclassified


points, decrease the weights on the
correctly classified points.

(Credit: Example adapted from Freund and Schapire)


Boosting Example

Increase the weights on the misclassified


points, decrease the weights on the
correctly classified points.

(Credit: Example adapted from Freund and Schapire)


Boosting Example

Increase the weights on the misclassified


h3
points, decrease the weights on the
correctly classified points.

Choose coefficient a 3 =.93

(Credit: Example adapted from Freund and Schapire)


Boosting Example
After three rounds of boosting, well, we are going
to stop. And here is the final combined classifier,
which is 0.42 times (1) + 0.66 times (2) and 0.93
h3 times (3). And if you put that all together, here is
the final combined classifier and it classified all the
points correctly.

H=sign(.42 + .66 + .93 )

(Credit: Example adapted from Freund and Schapire)


Click to edit Master
subtitle style

| Decision Forests

Cynthia Rudin | MIT Sloan School of Management


What’s the difference between Boosted Decision
Trees and Decision Forests?

• Decision Forests
– compute many trees from different subsets of data and features
– average them (bagging).
• Boosted Decision Trees
– reweight the data to generate different trees.
– The combination minimizes training error (coordinate descent on the
exponential loss).
Decision Forests

• Complex and powerful prediction tool


• Black-box
• Uses similar idea to boosted decision trees: if you average many
uncorrelated yet accurate models, it reduces variance.
Decision Forests
• Example: Will the customer wait for a table at a restaurant?

• OthOptions: Other options, True if there are restaurants nearby.


• Weekend: This is true if it is Friday, Saturday or Sunday.
• Area: Does it have a bar or other nice waiting area to wait in?
• Plans: Does the customer have plans just after dinner?
• Price: This is either $, $$, $$$, or $$$$
• Precip: Is it raining or snowing?
• Genre: French, Mexican, Thai, or Pizza
• Wait: Wait time estimate: 0-5 min, 5-15 min, 15+
• Crowded: Whether there are other customers (no, some, or full)

Credit: Adapted from Russell and Norvig


Crowded? Crowded?
Full
None
None Full Some
Some No Yes OthOptions?
No Yes Plans? No Yes
Yes Genre? No
No
Price? No French
$$$$ $$$ $$ $ Mex Pizza
Chinese
Yes Yes
No Yes Yes No No
Wait time No

30+ 5-15 0-5


New observation:
No Yes Crowded?
Mexican, $$, Full, 5-15 min
No Yes No plans, No other options
Price?
$$$$ Yes
$$$ $$ $ Majority Vote: Yes
No
No Yes Yes
Decision Forests

A bootstrap sample of size n: Draw n points with replacement at


random from the training data.

(So you have some repeated points, and that’s ok.)


Decision Forests
For t=1 to T:
• Draw a bootstrap sample of size n from the training data.
• Grow a tree (treet) using this splitting and stopping procedure:
– Choose m features at random (out of p)
– Evaluate the splitting criteria on all of them
– Split on the best feature
– If the node has less than nmin then stop splitting.

Output all the trees.


To predict on a new observation x, use the majority vote of the
trees on x.
Decision Forests
Make trees diverse, which
reduces variance
Comparison with decision trees:
• Bootstrap resamples
• Splitting considers only m possible (randomly chosen) features
• No pruning Make trees fit more tightly, reduces bias
• Majority vote of several trees is used to make predictions

Why?
Decision Forests: Measuring Variable Importance

• Let us measure the “importance” of variable j.


• Take the data not used to construct treet. Call it “out-of-bag”,
OOBt.
• Compute errort, using model treet on data OOBt.
• Now randomly permute only the jth feature values.
Decision Forests: Measuring Variable Importance

• Let us measure the “importance” of variable j.


• Take the data not used to construct treet. Call it “out-of-bag”,
OOBt.
Reorder
• Compute errort, using model treet on data OOBt. me!
• Now randomly permute only the jth feature values.
æ x11 x12 x13 ö
OOB
ç ÷
ç x21 x22 x23 ÷
ç x31 x32 x33 ÷
è ø
Decision Forests: Measuring Variable Importance

• Let us measure the “importance” of variable j.


• Take the data not used to construct treet. Call it “out-of-bag”,
OOBt.
Reorder
• Compute errort, using model treet on data OOBt. me!
• Now randomly permute only the jth feature values.
æ x11 x32 x13 ö
ç ÷
ç x21 x12 x23 ÷
ç x31 x22 x33 ÷
è ø
Decision Forests: Measuring Variable Importance

• Let us measure the “importance” of variable j.


• Take the data not used to construct treet. Call it “out-of-bag”,
OOBt.
• Compute errort, using model treet on data OOBt.
• Now randomly permute only the jth feature values. Call this
æ x11 x32 x13 ö
OOBt,permuted.
ç ÷
ç x21 x12 x23 ÷
ç x31 x22 x33 ÷
è ø
Decision Forests: Measuring Variable Importance
• Let us measure the “importance” of variable j.
• Take the data not used to construct treet. Call it “out-of-bag”,
OOBt.
• Compute errort, using model treet on data OOBt.
• Now randomly permute only the jth feature values. Call this
OOBt,permuted.
• Compute errort,permuted, using model treet on data OOBt,permuted.
• The “raw importance”1 of variable j is then the average over trees
of the difference: T å( )
errort -errort,permuted
trees t
Decision Forests for Regression
For t=1 to T:
• Draw a bootstrap sample of size n from the training data.
• Grow a tree (treet) using this splitting and stopping procedure:
– Choose m features at random (out of p)
– Evaluate the splitting criteria on all of them
– Split on the best feature
– If the node has less than nmin then stop splitting.

Output all the trees.


To predict on a new observation x, use the average vote of the
trees on x.
Decision Forests
Advantages
• Complex and powerful prediction tool, highly nonlinear

Disadvantages
• Black-box
• Tends to overfit unless tuned carefully (not always intuitive with
the R package)
• Slow
©2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Office, Azure, System Center, Dynamics and other product names are or may be registered trademarks and/or trademarks in the
U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must
respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of
this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

You might also like