0% found this document useful (0 votes)
21 views67 pages

05 Ensemble Learning

Uploaded by

sahandakpou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views67 pages

05 Ensemble Learning

Uploaded by

sahandakpou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Introduction Bagging Boosting AdaBoost Comparison References

Machine Learning (CE 40717)


Fall 2024

Ali Sharifi-Zarchi

CE Department
Sharif University of Technology

October 13, 2024

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 1 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 2 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction
Condorcet’s jury theorem
Ensemble learning
Ensemble Methods

2 Bagging

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 3 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction
Condorcet’s jury theorem
Ensemble learning
Ensemble Methods

2 Bagging

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 4 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Condorcet’s jury theorem

• N voters wish to reach a decision by major-


ity vote.

• Each voter has an independent probability


p of voting for the correct decision.

• Let M be the probability of the majority


voting for the correct decision.

• If p > 0.5 and N → ∞, then M → 1


• How?

Adopted from Wikipedia

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 5 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction
Condorcet’s jury theorem
Ensemble learning
Ensemble Methods

2 Bagging

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 6 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Strong vs. weak Learners

• Strong learner: we seek to produce one classifier for which the classification error
can be made arbitrarily small.
• So far we were looking for such methods.

• Weak learner: a classifier which is just better than random guessing (for now this
will be our only expectation).

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 7 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Basic idea

• Certain weak learners do well in model-


ing one aspect of the data, while others
do well in modeling another.

• Learn several simple models and com-


bine their outputs to produce the final
decision.

• A composite prediction where the final


accuracy is better than the accuracy of
individual models. Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 8 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction
Condorcet’s jury theorem
Ensemble learning
Ensemble Methods

2 Bagging

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 9 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Ensemble Methods

• Weak learners are generated in parallel.


Parallel
• Basic motivation is to use independence
ensemble
methods between the learners.

Ensemble
methods

• Weak learners are generated consecu-


Sequential
tively.
ensemble
methods • Basic motivation is to use dependence be-
tween the base learners.

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 10 / 67
Introduction Bagging Boosting AdaBoost Comparison References

What we talk about

• Weak or simple learners


• Low variance: they don’t usually overfit
• High bias: they can’t learn complex functions

• Bagging (parallel): To decrease the variance


• Random Forest

• Boosting (sequential): To decrease the bias (enhance their capabilities)


• AdaBoost

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 11 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging
Basic idea & algorithm
Decision tree (quick review)
Random Forest

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 12 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging
Basic idea & algorithm
Decision tree (quick review)
Random Forest

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 13 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Basic idea

• Bagging = Bootstrap aggregating

• It uses bootstrap resampling to generate different training datasets from the origi-
nal training dataset.
• Samples training data uniformly at random with replacement.

• On the training datasets, it trains different weak learners.

• During testing, it aggregates the weak learners by uniform averaging or majority


voting.
• Works best with unstable models (high variance models). Why?

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 14 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Basic idea, Cont.

Adopted from GeeksForGeeks

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 15 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Algorithm

Algorithm 1 Bagging

1: Input: M (required ensemble size), D = {(x (1) , y (1) ), . . . , (x (N) , y (N) )} (training set)
2: for t = 1 to M do
3: Build a dataset Dt by sampling N items randomly with replacement from D
▷ Bootstrap resampling: like rolling N-face dice N times
4: Train a model ht using Dt and add it to the ensemble
5: end for
H(x) = sign M
¡P ¢
6: t=1 ht (x)
▷ Aggregate models by voting for classification or by averaging for regression

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 16 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging
Basic idea & algorithm
Decision tree (quick review)
Random Forest

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 17 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Structure

• Terminal nodes (leaves) represent tar-


get variable.

• Each internal node denotes a test on


an attribute.

Adopted from Medium

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 18 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Learning

• Learning an optimal decision tree is NP-Complete.


• Instead, we use a greedy search based on a heuristic.
• We can’t guarantee to return the globally-optimal decision tree.

• The most common strategy for DT learning is a greedy top-down approach.

• Tree is constructed by splitting samples into subsets based on an attribute value


test in a recursive manner.

Adopted from G.E. Naumov, "NP-completeness of problems of construction of


optimal decision trees", 1991

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 19 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Algorithm

Algorithm 2 Constructing DT

1: procedure F IND T REE(S, A) ▷ Input: S (samples), A (attributes)


2: if A is empty or all labels in S are the same then
3: status ← leaf
4: class ← most common class in S
5: else
6: status ← internal
7: a ← bestAttribute(S, A) ▷ The attribute value test
8: LeftNode ← FindTree(S(a = 1), A − {a})
9: RightNode ← FindTree(S(a = 0), A − {a})
10: end if
11: end procedure

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 20 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Which attribute is the best?

• Entropy measures the uncertainty in a


specific distribution.
X
H(X ) = − P(x i ) log P(x i )
x i ∈x

• Information Gain (IG)


X |Sv |
Gain(S, A) = HS (Y ) − HSV (Y )
v∈Values(A) |S|

A: variable used to split samples


Y : target variable
S: samples, Sv : subset of S where A = v Adopted from Wikipedia
HS (Y ): entropy of Y over S

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 21 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example

Adopted from [5]

Gain(S, Humidity) = 0.940 − (7/14)0.985 − (7/14)0.592 = 0.151

Gain(S, Wind) = 0.940 − (8/14)0.811 − (6/14)1.0 = 0.48


CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 22 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging
Basic idea & algorithm
Decision tree (quick review)
Random Forest

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 23 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Bagging on decision trees?

Why decision trees?


• Interpretable
• Robust to outliers
• Low bias
• High variance

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 24 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Perfect candidates

• Why are DTs good candidates for ensembles?


• Consider averaging many (nearly) unbiased tree estimators.
• Bias remains similar, but variance is reduced.

• Remember Bagging?
• Train many trees on bootstrapped data, then aggregate (average/majority) the
outputs.

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 25 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Algorithm

Algorithm 3 Random Forest

1: Input: T (number of trees), m (number of variables used to split each node)


2: for t = 1 to T do
3: Draw a bootstrap dataset
4: Select m features randomly out of d features as candidates for splitting
5: Learn a tree on this dataset
6: end for p
7: Output: ▷ Usually: m ≤ d
8: Regression: average of the outputs
9: Classification: majority voting

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 26 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example

Adopted from [4]


CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 27 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example, Cont.

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 28 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example, Cont.

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 29 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting
Motivation & basic idea
Algorithm

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 30 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting
Motivation & basic idea
Algorithm

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 31 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Problems with bagging

• Bagging created a diversity of weak learners by creating random datasets.


• Examples: Decision stumps (shallow decision trees), Logistic regression, . . .

• Did we have full control over the usefulness of the weak learners?
• The diversity or complementarity of the weak learners is not controlled in any way, it
is left to chance and to the instability (variance) of the models.

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 32 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Basic idea

• We would expect a better performance if the weak learners also complemented


each other.
• They would have "expertise" on different subsets of the dataset.
• So they would work better on different subsets.

• The basic idea of boosting is to generate a series of weak learners which comple-
ment each other.
• For this, we will force each learner to focus on the mistakes of the previous learner.

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 33 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Basic idea, Cont.

Adopted from GeeksForGeeks

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 34 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting
Motivation & basic idea
Algorithm

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 35 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Algorithm

• Try to combine many simple weak learners (in sequence) to find a single strong
learner (For simplicity, suppose that we have a classification problem from now on).

• Each component is a simple binary ±1 classifier


• Voted combinations of component classifiers

HM (x) = α1 h(x; θ 1 ) + · · · + αM h(x; θ M )

• To simplify notations: h(x; θ i ) = hi (x)

HM (x) = α1 h1 (x) + · · · + αM hM (x)

• Prediction: ŷ = sign(HM (x))

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 36 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Candidate for hi (x)

• Decision stumps

• Each classifier is based


on only a single feature
of x (e.g., x k ):

h(x; θ) = sign(w1 x k − w0 )
θ = {k, w1 , w0 }

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 37 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting

4 AdaBoost
Basic idea & algorithm
Loss function & proof
Properties (extra-reading)

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 38 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting

4 AdaBoost
Basic idea & algorithm
Loss function & proof
Properties (extra-reading)

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 39 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Basic idea

• Sequential production of classifiers


• Iteratively add the classifier whose addition will be most helpful.

• Represent the important of each sample by assigning weights to them.


• Correct classification =⇒ smaller weights
• Misclassified samples =⇒ larger weights

• Each classifier is dependent on the previous ones.


• Focuses on the previous ones’ error.

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 40 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 41 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example, Cont.

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 42 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example, Cont.

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 43 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example, Cont.

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 44 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Example, Cont.

Adopted from [4]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 45 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Algorithm

1
• HM (x) = [α1 h1 (x) + · · · + αM hM (x)] −→ the complete model y (i) ∈ {−1, 1}
2
• hm (x): m-th weak learner
• αm = ? −→ votes of the m-th weak learner
(i)
• wm : weight of sample i in iteration m
• w(i) = ?
m+1

N
(i)
× I(y (i) ̸= hm (x (i) )) −→ loss of the m-th weak learner
X
• Jm = wm
i=1
PN (i) (i) (i)
• ϵm = i=1 wm × I(y ̸= hm (x ))
PN (i)
−→ weighted error of the m-th weak learner
i=1 wm

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 46 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Algorithm, Cont.

1
• HM (x) = [α1 h1 (x) + · · · + αM hM (x)] −→ the complete model y (i) ∈ {−1, 1}
2
• hm (x): m-th weak learner
1 − ϵm
µ ¶
• αm = ln −→ votes of the m-th weak learner
ϵm
(i)
• wm : weight of sample i in iteration m
(i) αm I(y (i) ̸=h (x (i) ))
• w(i) = wm e m
m+1

N
(i)
× I(y (i) ̸= hm (x (i) )) −→ loss of the m-th weak learner
X
• Jm = wm
i=1
PN (i) (i) (i)
• ϵm = i=1 wm × I(y ̸= hm (x ))
PN (i)
−→ weighted error of the m-th weak learner
i=1 wm

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 47 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Algorithm, Cont.

Algorithm 4 AdaBoost

1: Initialize data weight w1(i) = 1


N for all N samples
2: for m = 1 to M do
N
(i)
× I(y (i) ̸= hm (x (i) ))
X
3: Find hm (x) by minimizing the loss: Jm = wm
PNi=1 (i)
wm × I(y (i) ̸= hm (x (i) ))
4: Find the weighted error of hm (x): ϵm = i=1 PN (i)
i=1 wm
1 − ϵm
µ ¶
5: Assign votes αm = ln
ϵm
(i) (i) αm I(y (i)
̸=hm (x (i) ))
6: Update the weights: wm+1 = wm e
7: end for
1 PM
8: Combined classifier: ŷ = sign(HM (x)) where HM (x) = αm hm (x)
2 m=1

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 48 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting

4 AdaBoost
Basic idea & algorithm
Loss function & proof
Properties (extra-reading)

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 49 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Loss function

• There are many options for the loss function.


• AdaBoost is equivalent to using the following exponential loss.

L (y, HM (x)) = e−y×HM (x)

ŷ = sign(HM (x))

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 50 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Why the exponential loss?

• Differentiable approximation (bound) of


the 0/1 loss
• Easy to optimize
• Optimizing an upper bound on
classification error.

Adopted from [2]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 51 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Step 1: Calculating the exponential loss

• We need to calculate the exponential loss for:

1
Hm (x) = [α1 h1 (x) + . . . , +αm hm (x)]
2

To have a cleaner form later

• Idea: consider adding the m-th component:

N N
(i)
Hm (x (i) ) (i)
[Hm−1 (x (i) )+ 12 αm hm (x (i) )]
Lm = e−y e−y
X X
=
i=1 i=1
N N
Hm−1 (x (i) ) − 12 αm y (i) hm (x (i) ) 1
e− 2 αm yx hm (x
(i) (i) (i)
(i) )
e−y
X X
= x e = wm
i=1  i=1 |{z} 
(i) (i)
 e−y Hm−1 (x ) 
Suppose it is fixed at stage m Should be optimized at stage m by seeking hm (x) and αm

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 52 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Step 2: Deriving the weighted error function

• We need to derive the weighted error function, Jm

N 1
(i) − 2 αm y (i)
hm (x (i) )
Lm =
X
wm e
i=1
à ! à !
−αm αm
(i) (i)
X X
=e 2 wm +e 2 wm
y (i) =hm (x (i) ) y (i) ̸=hm (x (i) )
à ! à !
αm −αm −αm
N
(i) (i)
X X
= (e 2 −e 2 ) wm +e 2 wm
y (i) ̸=hm (x (i) ) i=1
{z| }
N ³ ´
(i)
× I y (i) ̸= hm (x (i) )
X
Jm = wm
i=1 ↑
Find hm (x) that minimizes Jm

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 53 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Step 3: Deriving ϵm and αm

• We need to derive ϵm and αm by setting the derivative equal to zero:

∂L m
=0
∂αm
• Idea: separate the derivative into misclassified and correctly classified samples.
à ! à !
1 αm −αm 1 −αm X N
(i) (i)
X
=⇒ (e 2 + e 2 ) wm = e 2 wm
2 y (i) ̸=hm (x (i) )
2 i=1
−αm P (i)
e 2 y (i) ̸=hm (x (i) ) wm
=⇒ αm −αm = PN (i)
(e +e ) i=1 wm
2 2

PN (i) ¡ (i) (i)


¢
i=1 wm I y ̸= hm (x ) 1 − ϵm
µ ¶
• Set ϵm = =⇒ αm = ln
(i) ϵm
PN
i=1 wm
CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 54 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Step 4: Justifying the weight update mechanism

• We need to justify the weight update mechanism.


(i)
(i) (i) HM (x (i) )
• Idea: we have wm from the first step as wm+1 = e−y

separate hm (x (i) )
(i) (i) − 2 αm y 1 (i)
hm (x (i) )
===========⇒ wm+1 = wm e
y (i) hm (x (i) )=1−2I (y (i) ̸=hm (x (i) )) (i) (i) − 21 αm αm I (y (i) ̸=hm (x (i) ))
=====================⇒ wm+1 = wm e e

Independent of i and can be ignored

(i) (i) αm I (y (i)


̸=hm (x (i) ))
=⇒ wm+1 = wm e

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 55 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting

4 AdaBoost
Basic idea & algorithm
Loss function & proof
Properties (extra-reading)

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 56 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Exponential loss properties

• In each boosting iteration, assuming we can find h(x; θ m ) whose weighted error is
better than chance.

Hm (x) = 12 [α1 h(x; θ 1 ) + · · · + αm h(x; θ m )]

• Thus, lower exponential loss over training data is guaranteed.

Adopted from [6]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 57 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Training error properties

• Boosting iterations typically decrease the training error of HM (x) over training ex-
amples.

Adopted from [6]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 58 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Training error properties, Cont.

• Training error has to go down exponentially fast if the weighted error of each hm is
strictly better than chance (i.e., ϵm < 0.5)
M p
2 ϵm (1 − ϵm )
Y
Etrain (HM ) ≤
m=1

Adopted from [6]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 59 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Weighted error properties

• Weighted error of each new component classifier tends to increase as a function of


boosting iterations.
PN (i) ¡ (i) (i)
¢
i=1 wm I y ̸= hm (x )
ϵm = PN (i)
i=1 wm

Adopted from [6]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 60 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Test error properties

• Test error can still decrease after training error is flat (even zero).

• But, is it robust to overfitting?


• May easily overfit in the presence of labeling noise or overlap of classes.

Adopted from [6] Adopted from [3]

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 61 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Typical behavior

• Exponential loss goes strictly down.

• Training error of H goes down.

• Weighted error ϵm goes up =⇒ share of votes αm goes down.

• Test error decreases even after a flat training error.

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 62 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 63 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Bagging vs. Boosting

Bagging Boosting

Training Strategy Parallel training Sequential training

Bootstrapping Weighted
Data Sampling
(random subsets) (by instance importance)

Dependent
Learners Dependency Independent
(on the previous models)

Varying weights
Learner Weighting Equal weights
(based on importance)

More robust More sensitive


Tolerance to Noise
(due to aggregation) (may overfit to noise)

Reduces bias and variance


Properties Reduces variance
(focus on bias)

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 64 / 67
Introduction Bagging Boosting AdaBoost Comparison References

1 Introduction

2 Bagging

3 Boosting

4 AdaBoost

5 Comparison

6 References

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 65 / 67
Introduction Bagging Boosting AdaBoost Comparison References

Contributions

• This slide has been prepared thanks to:


• Nikan Vasei

• Mahan Bayhaghi

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 66 / 67
Introduction Bagging Boosting AdaBoost Comparison References

[1] C. M., Pattern Recognition and Machine Learning.


Information Science and Statistics, New York, NY: Springer, 1 ed., Aug. 2006.
[2] M. Soleymani Baghshah, “Machine learning.” Lecture slides.
[3] R. E. Schapire, “The boosting approach to machine learning: An overview,”
Nonlinear estimation and classification, pp. 149–171, 2003.
[4] L. Serrano, Grokking machine learning.
New York, NY: Manning Publications, Jan. 2022.
[5] T. Mitchell, Machine Learning.
McGraw-Hill series in computer science, New York, NY: McGraw-Hill Professional,
Mar. 1997.
[6] T. Jaakkola, “Machine learning course slides.” Lecture slides.

CE Department (Sharif University of Technology) Machine Learning (CE 40717) October 13, 2024 67 / 67

You might also like