0% found this document useful (0 votes)
10 views10 pages

07 Boosting Notes

Uploaded by

睿笙
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

07 Boosting Notes

Uploaded by

睿笙
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MA4270 Lecture Notes #7:

Boosting

October 7, 2024

1 Introduction
In this lecture we introduce the Boosting algorithm. The main paradigm of boosting is to combine several
weak classifiers to produce a substantially more powerful classifier. By weak, we mean that the classifier
performs slightly better than random guessing. The technical definition of weak is that the classifier needs
to be correct > 1/2 the time. In fact this is not hard to achieve, if the classifier is correct < 1/2 the time,
simply invert your answers (−1 to 1 and 1 to −1). In contrast, by strong, we mean that the classifier classifies
correctly with error very close to zero.
(This is in the same spirit with kernels where, using kernel evaluations, we could perform non-linear
classification using the ideas of linear classification.)
In this lecture, we examine the AdaBoost algorithm, which is used extensively in practical scenarios.
AdaBoost, which stands for Adaptive Boosting, was proposed by Robert Schapire and Yoav Freund.
The boosting algorithm applies to any collection of weak classifiers. For concreteness, in this lecture, we
will focus on a specific family known as decision stumps.

2 Decision Stumps
A decision stump is a classifier of the form

h(x; {s, k, θ0 }) := sign(s(xk − θ0 )),

A visual illustration: In words: Get the value of the k-th coordinate entry xk . Subtract θ0 . What is the

sign?

1
ˆ k = index of the (only) feature that the decision is made based on

ˆ s ∈ {±1} = sign (to allow both combinations of + on one side and − on the other)

ˆ θ0 = offset/threshold (classify to + if above this value and − if below, or vice versa)

Note that if a specific decision stump has an error > 1/2, the corresponding classifier obtained by flipping
the sign s has error < 1/2.
Notation: In what follows, we let θ := {s, k, θ0 } denote all the parameters of the decision stump.
A single decision stump h(x; θ) is extremely simple and unlikely to classify well.
But it is possible to these simple classifiers to build much more complex classifiers. For example, consider
a weighted decision function of the form
M
X
fM (x) = αm h(x; θ m ), (2.1)
m=1

where θ m = (sm , km , θ0,m ) for each m = 1, . . . , M , can perform more complex decision rules.

ˆ The individual h(x; θ m ) are called weak learners or base learners.

ˆ We can interpret αm as the “vote” of the m-th weak learner.

ˆ The AdaBoost algorithm provides a means for finding good {(θ m , αm )}M
m=1 .

Example. Even combining just three stumps (with equal weights), we get classifier regions that start to
look quite different from a single stump:

2.1 **Decision Trees**


Side note: Decision stumps are often used to construct decision trees (not covered in this course):
Sometimes boosting is used with decisions trees playing the role of the weak learner. Decisions trees are
also a key building block in another popular technique called random forests (and the associated concept of
bagging), which we will not cover.

3 The AdaBoost Algorithm


The algorithm selects a sequence of weak classifiers h(·, θ) together with a corresponding sequence of weights
αm . We select one of each in each iteration.
To identify the optimal choice of weak classifier to include in each iterate, we will create a set of weights
w(t), 1 ≤ t ≤ n, for each data-point. These weights help us identify the weak classifier in the next iterate.

2
The algorithm and the analysis makes use of the exponential loss:

Lossexp (y, f (x)) = exp(−yf (x)). (3.1)

This is an upper bound to the 0-1 loss

Loss01 (y, f (x)) = 1{y 6= sign(f (x))}.

(Here f (x) can be thought of as representing θ T x from earlier lectures, but now we will use more general
choices.) The proof can be seen visually be interpreting the two losses as being functions of z = yf (x):

3
3.1 The algorithm
Input: Data set D = {(xt , yt )}nt=1 with data xt ∈ Rd , labels yt ∈ {−1, 1}. Specify a certain number of
iterations M .1
Steps:
1
1. Initialize a set of weights w0 (t) = n for t = 1, . . . , n.
2. For m = 1, . . . , M , do the following:
(a) Choose the next base learner h(·; θ̂ m ) based on the following criteria:
X
θ̂ m = arg min wm−1 (t). (3.2)
θ
t : yt 6=h(xt ;θ)

1 1−ˆm P
(b) Set α̂m = 2 log ˆm , where ˆm = t : yt 6=h(xt ;θ̂ m ) wm−1 (t) is the minimal value attained in (3.2).
(c) Update the weights:
1
wm (t) = wm−1 (t)e−yt h(xt ;θ̂m )α̂m (3.3)
Zm
for each t = 1, . . . , n, where Zm is defined so that the weights sum to one:
n
X
Zm = wm−1 (t)e−yt h(xt ;θ̂m )α̂m . (3.4)
t=1

PM
3. Output: fM (x) = m=1 α̂m h(x; θ̂ m ), corresponding to the classifier ŷ = sign(fM (x))
Note: The outputs of the weak classifiers h are ±1.

3.2 Equivalent update step


It is possible to write step (c) in a different way. Specifically, in step (c), the update of the weights can be
expressed as
w̃m (t) = wm−1 (t)e−yt h(xt ;θ̂m )α̂m
and subsequently
w̃m (t)
wm (t) = Pn
t=1 w̃m (t)
for each t = 1, . . . , n

4 What is AdaBoost doing?


We describe the intuition behind the AdaBoost algorithm. This gives us an idea why the algorithm is
effective.

4.1 Intuition behind the base learner (3.2)


We begin by explaining the intuition behind step (a).
First, note that the quantity in (3.2) is a weighted training error
X
m = wm−1 (t)
t : yt 6=h(xt ;θ)

This is because we sum the weights over the data-points for which h(·; θ) classifies incorrectly. As such, in
Step (a), we are selecting a classifier that performs best according to the weighted training error score defined
1 Or some other stopping criterion could be used (e.g., stop when the error on D is small)

4
by the set of weights {w(t) : t}. The weights {w(t) : t} have the effect of emphasizing certain data-points
more than others – it tells us which data-points to focus on. As we later see, the weights are updated in
such a way that we emphasize data-points that we classify incorrectly.
Sometimes you might see the selection rule (3.2) written as
n
X 
θ̂ m = arg min wm−1 (t) · − yt h(xt ; θ) (4.1)
θ t=1

To see that the two are equivalent, first note that since both yt and h(·) take values in ±1, we have

−yt h(xt ; θ) = 2 · 1{yt 6= h(xt ; θ)} − 1 (4.2)

(just check the cases yt = h and yt 6= h separately). Summing over the samples t = 1, . . . , n gives
n
X 
wm−1 (t) − yt h(xt ; θ) = 2m − 1,
t=1

so that the two rules are minimizing the same thing (up to ×2 and −1 that have no effect).

4.2 Intuition behind the choice of α̂m


1 1−ˆm
Next, we discuss the choice of α̂m = 2 log ˆm . First, let’s examine the value α̂m as a function of the
training error ˆm :

In particular, it is:
ˆ It is a decreasing function of ˆm . This means that we assign the classifier θ̂ m more weight if it has a
smaller error ˆm .
ˆ If ˆm = 0, then α̂m = ∞. In other words, if the classifier θ̂ m classifies the entire data set perfectly, we
set θ̂ m to be our final classifier.
ˆ If ˆm = 21 , then α̂m = 0. That is to say, if the classifier θ̂ m does nothing useful (and is only as good as
pure guessing), we assign no weight to the classifier.
In the next section, we will see more formally how the choice of ˆm arises
Note that the regime ˆm > 21 (which would give α̂m < 0) is not relevant, because such a base learner
cannot minimize ˆm (Proof: Just consider the same base learner with s replaced by −s)

5
4.3 Intuition behind the weights update step
Next, we explain how the weights are updated. Since y and h take values in ±1, the term yt h(xt ; θ̂ m must
be ±1. Therefore the update step (3.3) only has two possible updates
(
1 eα̂m yt 6= h(xt ; θ̂ m )
wm (t) = wm−1 (t) × −α̂m (4.3)
Zm e yt = h(xt ; θ̂ m ).

In particular, the updates depends on whether θ̂ m classifies the data-point xt correctly or incorrectly. In
particular:
ˆ If the newly chosen base learner h(·; θ̂ m ) classifies xt incorrectly, we increase the weight w(t) by a
factor of eα̂m .
ˆ If the newly chosen base learner h(·; θ̂ m ) classifies xt correctly, we decrease the weight w(t) by a factor
of e−α̂m .

This update step places more importance on inputs that were previously classified wrongly. (Analogy:
Teacher places more teaching emphasis on topics that students scored poorly on)
The update rule is a form of multiplicative weights update, which is a widespread tool that was developed
independently in multiple research communities.

4.4 Illustration of the algorithm


ˆ A simple example with n = 7 samples:

ˆ Initially w0 = 1 1

7, . . . , 7 .

ˆ Then ˆ1 = 2
7 and hence α̂1 = 1
2 log 5
2

ˆ The updated weight is (


1 − 12 log 5
1 7e
correct samples 2
w1 (t) = 1 12 log 25
Z1 7e
incorrect samples,
P
and a bit
 1of 1analysis shows that after choosing Z1 to satisfy t w1 (t) = 1, the weights simplify to
1
w1 (t) ∈ 10 , 4 (five values of 10 , two values of 14 )
1
On the next iteration, a base classifier is chosen that classifies those with weight 4 correctly.

4.5 **(Optional) Application:**


See Section 10.4 of the “Understanding Machine Learning” book for an example of a real-world application
of AdaBoost, in the context of face recognition (Viola-Jones method).

6
5 Analysis of Training Error
In the following, we define the training error as the proportion of mis-classified samples in D.
Theorem 5.1. After M iterations, the training error of AdaBoost satisfies
n  M  2 
1X X 1
1{yt fM (xt ) ≤ 0} ≤ exp −2 − ˆm .
n t=1 m=1
2

There are some consequences. First, because the LHS can only takes values in {0, n1 , . . . }, the training
error is guaranteed to be zero if the RHS is < 1/n.
This can happen, for instance, if we have some form of guarantee that ˆm ≤ 21 − γ for all m and some
γ > 0. Suppose this is true, then one has
n
1X 2
1{yt fM (xt ) ≤ 0} ≤ e−2M γ .
n t=1

Then if M > log n


2γ 2 , the RHS is < 1/n, in which case the LHS must be equal to zero.
The condition that ˆm ≤ 12 − γ can be interpreted as being slightly better than random guessing, since
randomly guessing a label gives a 50% chance of success. Also, note that there is no assumption here of
linear separability.
The result sounds like overfitting, but surprisingly it is not – we discuss this later.
Proof of Theorem 5.1. We break the proof to a number of steps.
[Step 1]: Our first step is to upper bound with the exponential loss. As we noted in (3.1), the exponential
loss upper bounds the 0-1 loss, and hence
n n
1X 1 X −yt fM (xt )
1{yt 6= sign(fM (xt ))} ≤ e .
n t=1 n t=1

[Step 2]: Our second step is to apply an identity; namely


n M
1 X −yt fM (xt ) Y
e = Zm ,
n t=1 m=1

where Zm is the normalizing constant in (3.4).


To see this, write out the updates recursively:
1
w0 (t) =
n
1 exp(−α̂1 yt h(xt ; θ 1 ))
w1 (t) =
n Z1
1 exp(−α̂1 yt h(xt ; θ 1 )) exp(−α̂2 yt h(xt ; θ 2 ))
w2 (t) =
n Z1 Z2
..
.
PM  
1 exp − m=1 α̂m yt h(xt ; θ m ) 1 exp − yt fM (xt )
wM (t) = QM = · QM ,
n m=1 Zm
n m=1 Zm

where in the last


Pstep we substituted (2.1).
n
Recall that t=1 wM (t) = 1 (by design), from which the claim follows.
[Step 3] Our third step is to re-write Zm in a way that is easy to bound. Recall the definition of Zm
in (3.4). We can split the sum over t into two cases: If yt = h(xt ; θ̂ m ) (the classifier classifies correctly)

7
then yt h(xt ; θ̂ m ) = 1, whereas if yt 6= h(xt ; θ̂ m ) (the classifier classifies incorrectly) then yt h(xt ; θ̂ m ) = −1.
Therefore, (3.4) simplifies to
X X
Zm = eα̂m wm−1 (t) + e−α̂m wm−1 (t)
t : yt 6=h(xt ;θ̂ m ) t : yt =h(xt ;θ̂ m )
α̂m −α̂m
=e ˆm + e (1 − ˆm ), . (5.1)
P
In the second line, we used the fact that t : yt 6=h(xt ;θ̂ m ) wm−1 (t) =  ˆm (by definition of ˆm ), and that
P P
w
t : yt =h(xt ;θ̂ m ) m−1 (t) = 1 − 
ˆm (since w m−1 (t) = 1 by design).
Note that ∂Z m
∂ α̂m = 
ˆ m eα̂m
− e −α̂m
(1 − 
ˆ m ). In particular, if we set the derivative to zero, we obtain
α̂m = 21 log 1−ˆ m
ˆm . (One can check that this choice is a global minimum.) In particular, the choice of α̂m is
now more well-motivated minimizes the upper bound on training error from Steps 1 and 2.
[Step 4] We substitute the choice of α̂m into (5.1) to obtain
r r
1 − ˆm ˆm
Zm = ˆm + (1 − ˆm )
ˆ 1 − ˆm
p m
= 2 ˆm (1 − ˆm )
p
= 1 − (1 − 2ˆ m ) 2 ,

where the last step can be verified√by expanding the square.


Note the following inequality: 1 − c2 = exp( 12 log(1−c2 )) ≤ exp(− 12 c2 ) – the last step uses log(1+a) ≤ a.
p
With that, we can bound 1 − (1 − 2ˆ m )2 ≤ exp(− 12 (1 − 2ˆ
m )2 ). This is useful because it allows us to upper
bound a product of square roots to a product of exponentials, which is easier to work with.
In short, we have Zm ≤ exp − 21 (1 − 2ˆ m )2 . By combining with Steps 1 and 2, one has

n M
1X Y
1{yt 6= sign(fM (xt ))} ≤ Zm
n t=1 m=1
M  
Y 1
≤ exp m )2
− (1 − 2ˆ
m=1
2
 M  2 
X 1
= exp −2 − ˆm ,
m=1
2

which proves the theorem.

6 Weighted error relative to updated weights


Claim 6.1. The updated weights wm (t) are such that
n
X 1
wm (t)1{yt 6= h(xt ; θ̂ m )} = .
t=1
2

Hence, the base learner chosen at iteration m has the “worst possible” weighted training error with
respect to the updated weights (a value  > 0.5 is technically worse, but it could be replaced by 1 −  < 0.5
by just flipping the sign, i.e., changing positive labels to negative and vice versa).
ˆ Consequence: The same base learner will never be chosen on two consecutive rounds.

Proof. Since
1 
1{yt 6= h(xt ; θ̂ m )} = 1 − yt h(xt ; θ̂ m )
2

8
Pn
(see (4.2)) and t=1 wm (t) = 1 (by definition), it suffices to prove that
n
X
wm (t)yt h(xt ; θ̂ m ) = 0.
t=1

To show this, we do the usual trick of splitting the summation into two:
n
X
wm (t)yt h(xt ; θ̂ m )
t=1
X X
= wm (t) − wm (t)
t : yt =h(xt ;θ̂ m ) t : yt 6=h(xt ;θ̂ m )
(a) X 1 X 1
= wm−1 (t)e−α̂m − wm−1 (t)eα̂m
Zm Zm
t : yt =h(xt ;θ̂ m ) t : yt 6=h(xt ;θ̂ m )
(b)1 1
= (1 − ˆm )e−α̂m − ˆm eα̂m
Zm Zm
(c)
= 0,

Here
(a) substitutes the update equation
P (4.3),
(b) applies the definition ˆm = t : yt 6=h(xt ;θ̂m ) wm−1 (t), and
(c)uses the fact that α̂m was chosen by equating ˆm eα̂m − e−α̂m (1 − ˆm ) with zero (see Step 3 above).

7 **Test Error**
For typical learning algorithms, making the training error too small makes the test error (i.e., the error rate
on unseen data) large.
ˆ This is known as overfitting, and will be explored more in the coming lectures.
For boosting, we typically observe the following surprising phenomenon:

ˆ Test error continues decreasing even after getting zero training error!
Let’s try to get an intuitive understanding of this. Define
PM
α̂m h(x; θ̂ m )
f˜M (x) = m=1
PM ∈ [−1, 1],
m=1 α̂m
margin(t) = yt f˜M (xt ) ∈ [−1, 1].

We have margin(t) > 0 if and only if xt is classified correctly. But the closer margin(t) is to one, the “further
away” the classifier is from incorrectly classifying it.

9
Now define the following stricter notion of error:
n
1X
Errn (fM ; ρ) = 1{margin(t) ≤ ρ}.
n t=1

Note that the normal training error is simply Errn (fM ; 0).
Implicitly, AdaBoost is minimizing e−margin(t) . This means that even after achieving Errn (fM ; 0) = 0,
the algorithm continues to decrease Errn (fM ; ρ) for ρ > 0:

ˆ A higher margin leads to better generalization, and boosting naturally increases the margin.

A formal statement on the test error of AdaBoost is given in the 1998 paper by Schapire, Freund, Bartlett,
and Lee – the final author is a professor at NUS.

8 Additional References
ˆ Blog posts by Jeremy Kun on boosting2 and why it doesn’t overfit3

ˆ MIT lecture notes,4 lectures 12 and 13

ˆ Section 14.3 of Bishop’s “Pattern Recognition and Machine Learning” book

ˆ Chapter 10 of “Understanding Machine Learning” book

ˆ The original paper on boosting: “A decision-theoretic generalization of on-line learning and an appli-
cation to boosting” (Freund and Schapire, 1997)

2 http://jeremykun.com/2015/05/18/boosting-census/
3 http://jeremykun.com/2015/09/21/the-boosting-margin-or-why-boosting-doesnt-overfit/
4 http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/

lecture-notes/

10

You might also like