0% found this document useful (0 votes)
0 views

Lecture-10-boosting

The document discusses the concept of boosting, particularly the Adaboost algorithm, which combines weak learners to create a strong classifier by iteratively adjusting weights based on classification errors. It explains the process of initializing weights, fitting predictors, computing error rates, and updating weights to improve classification accuracy. Additionally, it addresses empirical risk control and generalization error bounds, highlighting the trade-off between approximation and estimation errors based on the number of iterations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Lecture-10-boosting

The document discusses the concept of boosting, particularly the Adaboost algorithm, which combines weak learners to create a strong classifier by iteratively adjusting weights based on classification errors. It explains the process of initializing weights, fitting predictors, computing error rates, and updating weights to improve classification accuracy. Additionally, it addresses empirical risk control and generalization error bounds, highlighting the trade-off between approximation and estimation errors based on the number of iterations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Boosting

Son P. Nguyen

University of Economics and Law


Vietnam National University-HCMC

July 30, 2024


Introduction

▶ First algorithm of “boosting”: Tukey in 1972!


▶ Build a set of rules (predictors) that are then aggregated.
▶ Process recursive: the rule built in step m depends on the one
built in step m − 1.
Introduction
Weak Learner

▶ The term Boosting refers to general methods for producing


precise decisions from weak learner rules.
▶ A rule g that is slightly better than chance is called a weak
learner:
1
∃γ > 0 s.t. P(g (X ) ̸= Y ) = − γ.
2
▶ Some examples of weak learners include:
▶ 1-nearest neighbor (1-nn)
▶ Decision trees with 2 terminal nodes (stumps)
The Adaboost Algorithm
A weak learner g and a number of iterations M.
1. Initialize: Set the weights of each data point:
1
∀i ∈ {1, . . . , n} : wi,1 = .
n
2. For m = 1 to M:
2.1 Fit the predictor gm to the sample Dn weighted by
w1,m , . . . , wn,m .
2.2 Compute the error rate:
Pn
wi,m 1{gm (xi ) ̸= yi }
em = i=1 Pn .
i=1 wi,m

2.3 Compute:  
1 − em
αm = ln .
em
2.4 Update weights:

∀i ∈ {1, . . . , n} : wi,m+1 = wi,m exp(αm 1{gm (xi ) ̸= yi }).


The Adaboost Algorithm

Final hypothesis:
M
!
X
G (x) = sign αm gm (x) .
m=1
Comments on the Adaboost Algorithm

▶ Handling Weights in Weak Learners:


▶ If the weak learner cannot incorporate weights directly, the
predictor can be trained on a subsample of Dn where
observations are randomly drawn according to weights
w1,m , . . . , wn,m .
▶ Updating Weights:
▶ The weights w1,m , . . . , wn,m are updated after each iteration:
▶ If the i-th individual is correctly classified, its weight remains
unchanged.
▶ If the i-th individual is misclassified, its weight is increased.
▶ Weight of the Rule αm :
▶ The weight αm of the rule gm increases with its performance
on Dn :
▶ αm increases as the error rate em decreases.
▶ The rule must not be ”too weak”: if em > 0.5, then αm < 0.
Classification

Goal: create a good classifier by combining several weak classifiers


▶ a weak classifier is a classifier which is able to produce results
only slightly better than a random guess
▶ Idea: apply repeatedly (iteratively ) a weak classifier to
modifications of the data
▶ at each iteration, give more weight to the misclassified
observations.
An example

Initially all examples are equally important


An example

Initially all examples are equally important.


h1 = The best classifier on this data.
An example

Initially all examples are equally important.


h1 = The best classifier on this data.
Clearly there are mistakes. Error ϵ1 = 0.3
An example
Initially all examples are equally important.
h1 = The best classifier on this data.
Clearly there are mistakes. Error ϵ1 = 0.3

For the next round, increase the importance of the examples with
mistakes and down-weight the examples that h1 got correctly
An example

Dt = Set of weights at round t, one for each example. Think


“How much should the weak learner care about this example in its
choice of the classifier ?”
h1 A classifier learned on this data. Has an error ϵ2 = 0.21
Why not 0.3 ? Because while computing error, we will weight each
example xi by its Dt (i).
An example

m
!
1 1 X
ϵt = − Dt (i)yi h(xi )
2 2
i=1

Why is this a reasonable definition ?


Consider two cases
1. When y ̸= h(x), we have yi h(xi ) = −1
2. When y = h(x), we have yi h(xi ) = 1
Therefore, ϵt is in fact
X
ϵt = Dt (i)
yi ̸=h(xi )

Represents the total error, but each example only contributes to


the extent that it is important.
An example

Dt = Set of weights at round t, one for each example. Think


“How much should the weak learner care about this example in its
choice of the classifier?”
h2 = A classifier learned on this data. Has an error ϵ = 0.21
For the next round, increase the importance of the mistakes and
down-weight the examples that h2 got correctly
An example

m
!
1 1 X
ϵt = − Dt (i)yi h(xi )
2 2
i=1

h2 = A classifier learned on this data. Has an error ϵ3 = 0.14.


An example

The final hypothesis is a combination of all the hi ’s we have seen


so far

Think of the α values as the vote for each weak classifier and the
boosting algorithm has to somehow specify them
An outline of boosting

Given a training set (x1 , y1 ), . . . , (xm , ym ).


▶ Instances xi ∈ X labeled with yi ∈ {−1, 1}
For t = 1, 2, . . . , T :
▶ Construct a distribution Dt on {1, 2, . . . , m}.
▶ Find a weak hypothesis (rule of thumb) h. such that it has a
small weighted error ϵ.
Construct a final output Hfinal .
Empirical Risk Control (Empirical Error)
▶ Error Rate em :
▶ em refers to the error rate of the weak learner ĝm on the
dataset D1n : Pn
wi I{gm (xi ) ̸= yi }
em = i=1 Pn
i=1 wi
▶ Gain over Pure Chance (γm ):
▶ γm measures the improvement of gm over random guessing:

1
em = − γm
2
▶ Empirical Risk Bound:
▶ The empirical risk, Rn (ĝ ), decreases with more iterations:

M
!
X
2
Rn (ĝ ) ≤ exp −2 γm
m=1

The empirical risk tends to 0 when the number of iterations


increases.
Risk Control (Generalization Error)

▶ Generalization Error Bound:


▶ The generalization error R(ĝ ) is bounded as follows:
r !
MV
R(ĝ ) ≤ Rn (ĝ ) + O
n

▶ Here, V denotes the Vapnik-Chervonenkis (VC) dimension,


and n is the number of samples.
▶ The bias/variance (approximation/estimation error) trade-off
is regulated by the number of iterations M:
▶ Small M: The first term (approximation error) dominates.
▶ Large M: The second term (estimation error) dominates.
▶ When M is very large, Adaboost overfits

You might also like