07 Boosting Notes
07 Boosting Notes
Boosting
October 7, 2024
1 Introduction
In this lecture we introduce the Boosting algorithm. The main paradigm of boosting is to combine several
weak classifiers to produce a substantially more powerful classifier. By weak, we mean that the classifier
performs slightly better than random guessing. The technical definition of weak is that the classifier needs
to be correct > 1/2 the time. In fact this is not hard to achieve, if the classifier is correct < 1/2 the time,
simply invert your answers (−1 to 1 and 1 to −1). In contrast, by strong, we mean that the classifier classifies
correctly with error very close to zero.
(This is in the same spirit with kernels where, using kernel evaluations, we could perform non-linear
classification using the ideas of linear classification.)
In this lecture, we examine the AdaBoost algorithm, which is used extensively in practical scenarios.
AdaBoost, which stands for Adaptive Boosting, was proposed by Robert Schapire and Yoav Freund.
The boosting algorithm applies to any collection of weak classifiers. For concreteness, in this lecture, we
will focus on a specific family known as decision stumps.
2 Decision Stumps
A decision stump is a classifier of the form
A visual illustration: In words: Get the value of the k-th coordinate entry xk . Subtract θ0 . What is the
sign?
1
k = index of the (only) feature that the decision is made based on
s ∈ {±1} = sign (to allow both combinations of + on one side and − on the other)
Note that if a specific decision stump has an error > 1/2, the corresponding classifier obtained by flipping
the sign s has error < 1/2.
Notation: In what follows, we let θ := {s, k, θ0 } denote all the parameters of the decision stump.
A single decision stump h(x; θ) is extremely simple and unlikely to classify well.
But it is possible to these simple classifiers to build much more complex classifiers. For example, consider
a weighted decision function of the form
M
X
fM (x) = αm h(x; θ m ), (2.1)
m=1
where θ m = (sm , km , θ0,m ) for each m = 1, . . . , M , can perform more complex decision rules.
The AdaBoost algorithm provides a means for finding good {(θ m , αm )}M
m=1 .
Example. Even combining just three stumps (with equal weights), we get classifier regions that start to
look quite different from a single stump:
2
The algorithm and the analysis makes use of the exponential loss:
(Here f (x) can be thought of as representing θ T x from earlier lectures, but now we will use more general
choices.) The proof can be seen visually be interpreting the two losses as being functions of z = yf (x):
3
3.1 The algorithm
Input: Data set D = {(xt , yt )}nt=1 with data xt ∈ Rd , labels yt ∈ {−1, 1}. Specify a certain number of
iterations M .1
Steps:
1
1. Initialize a set of weights w0 (t) = n for t = 1, . . . , n.
2. For m = 1, . . . , M , do the following:
(a) Choose the next base learner h(·; θ̂ m ) based on the following criteria:
X
θ̂ m = arg min wm−1 (t). (3.2)
θ
t : yt 6=h(xt ;θ)
1 1−ˆm P
(b) Set α̂m = 2 log ˆm , where ˆm = t : yt 6=h(xt ;θ̂ m ) wm−1 (t) is the minimal value attained in (3.2).
(c) Update the weights:
1
wm (t) = wm−1 (t)e−yt h(xt ;θ̂m )α̂m (3.3)
Zm
for each t = 1, . . . , n, where Zm is defined so that the weights sum to one:
n
X
Zm = wm−1 (t)e−yt h(xt ;θ̂m )α̂m . (3.4)
t=1
PM
3. Output: fM (x) = m=1 α̂m h(x; θ̂ m ), corresponding to the classifier ŷ = sign(fM (x))
Note: The outputs of the weak classifiers h are ±1.
This is because we sum the weights over the data-points for which h(·; θ) classifies incorrectly. As such, in
Step (a), we are selecting a classifier that performs best according to the weighted training error score defined
1 Or some other stopping criterion could be used (e.g., stop when the error on D is small)
4
by the set of weights {w(t) : t}. The weights {w(t) : t} have the effect of emphasizing certain data-points
more than others – it tells us which data-points to focus on. As we later see, the weights are updated in
such a way that we emphasize data-points that we classify incorrectly.
Sometimes you might see the selection rule (3.2) written as
n
X
θ̂ m = arg min wm−1 (t) · − yt h(xt ; θ) (4.1)
θ t=1
To see that the two are equivalent, first note that since both yt and h(·) take values in ±1, we have
(just check the cases yt = h and yt 6= h separately). Summing over the samples t = 1, . . . , n gives
n
X
wm−1 (t) − yt h(xt ; θ) = 2m − 1,
t=1
so that the two rules are minimizing the same thing (up to ×2 and −1 that have no effect).
In particular, it is:
It is a decreasing function of ˆm . This means that we assign the classifier θ̂ m more weight if it has a
smaller error ˆm .
If ˆm = 0, then α̂m = ∞. In other words, if the classifier θ̂ m classifies the entire data set perfectly, we
set θ̂ m to be our final classifier.
If ˆm = 21 , then α̂m = 0. That is to say, if the classifier θ̂ m does nothing useful (and is only as good as
pure guessing), we assign no weight to the classifier.
In the next section, we will see more formally how the choice of ˆm arises
Note that the regime ˆm > 21 (which would give α̂m < 0) is not relevant, because such a base learner
cannot minimize ˆm (Proof: Just consider the same base learner with s replaced by −s)
5
4.3 Intuition behind the weights update step
Next, we explain how the weights are updated. Since y and h take values in ±1, the term yt h(xt ; θ̂ m must
be ±1. Therefore the update step (3.3) only has two possible updates
(
1 eα̂m yt 6= h(xt ; θ̂ m )
wm (t) = wm−1 (t) × −α̂m (4.3)
Zm e yt = h(xt ; θ̂ m ).
In particular, the updates depends on whether θ̂ m classifies the data-point xt correctly or incorrectly. In
particular:
If the newly chosen base learner h(·; θ̂ m ) classifies xt incorrectly, we increase the weight w(t) by a
factor of eα̂m .
If the newly chosen base learner h(·; θ̂ m ) classifies xt correctly, we decrease the weight w(t) by a factor
of e−α̂m .
This update step places more importance on inputs that were previously classified wrongly. (Analogy:
Teacher places more teaching emphasis on topics that students scored poorly on)
The update rule is a form of multiplicative weights update, which is a widespread tool that was developed
independently in multiple research communities.
Initially w0 = 1 1
7, . . . , 7 .
Then ˆ1 = 2
7 and hence α̂1 = 1
2 log 5
2
6
5 Analysis of Training Error
In the following, we define the training error as the proportion of mis-classified samples in D.
Theorem 5.1. After M iterations, the training error of AdaBoost satisfies
n M 2
1X X 1
1{yt fM (xt ) ≤ 0} ≤ exp −2 − ˆm .
n t=1 m=1
2
There are some consequences. First, because the LHS can only takes values in {0, n1 , . . . }, the training
error is guaranteed to be zero if the RHS is < 1/n.
This can happen, for instance, if we have some form of guarantee that ˆm ≤ 21 − γ for all m and some
γ > 0. Suppose this is true, then one has
n
1X 2
1{yt fM (xt ) ≤ 0} ≤ e−2M γ .
n t=1
7
then yt h(xt ; θ̂ m ) = 1, whereas if yt 6= h(xt ; θ̂ m ) (the classifier classifies incorrectly) then yt h(xt ; θ̂ m ) = −1.
Therefore, (3.4) simplifies to
X X
Zm = eα̂m wm−1 (t) + e−α̂m wm−1 (t)
t : yt 6=h(xt ;θ̂ m ) t : yt =h(xt ;θ̂ m )
α̂m −α̂m
=e ˆm + e (1 − ˆm ), . (5.1)
P
In the second line, we used the fact that t : yt 6=h(xt ;θ̂ m ) wm−1 (t) = ˆm (by definition of ˆm ), and that
P P
w
t : yt =h(xt ;θ̂ m ) m−1 (t) = 1 −
ˆm (since w m−1 (t) = 1 by design).
Note that ∂Z m
∂ α̂m =
ˆ m eα̂m
− e −α̂m
(1 −
ˆ m ). In particular, if we set the derivative to zero, we obtain
α̂m = 21 log 1−ˆ m
ˆm . (One can check that this choice is a global minimum.) In particular, the choice of α̂m is
now more well-motivated minimizes the upper bound on training error from Steps 1 and 2.
[Step 4] We substitute the choice of α̂m into (5.1) to obtain
r r
1 − ˆm ˆm
Zm = ˆm + (1 − ˆm )
ˆ 1 − ˆm
p m
= 2 ˆm (1 − ˆm )
p
= 1 − (1 − 2ˆ m ) 2 ,
n M
1X Y
1{yt 6= sign(fM (xt ))} ≤ Zm
n t=1 m=1
M
Y 1
≤ exp m )2
− (1 − 2ˆ
m=1
2
M 2
X 1
= exp −2 − ˆm ,
m=1
2
Hence, the base learner chosen at iteration m has the “worst possible” weighted training error with
respect to the updated weights (a value > 0.5 is technically worse, but it could be replaced by 1 − < 0.5
by just flipping the sign, i.e., changing positive labels to negative and vice versa).
Consequence: The same base learner will never be chosen on two consecutive rounds.
Proof. Since
1
1{yt 6= h(xt ; θ̂ m )} = 1 − yt h(xt ; θ̂ m )
2
8
Pn
(see (4.2)) and t=1 wm (t) = 1 (by definition), it suffices to prove that
n
X
wm (t)yt h(xt ; θ̂ m ) = 0.
t=1
To show this, we do the usual trick of splitting the summation into two:
n
X
wm (t)yt h(xt ; θ̂ m )
t=1
X X
= wm (t) − wm (t)
t : yt =h(xt ;θ̂ m ) t : yt 6=h(xt ;θ̂ m )
(a) X 1 X 1
= wm−1 (t)e−α̂m − wm−1 (t)eα̂m
Zm Zm
t : yt =h(xt ;θ̂ m ) t : yt 6=h(xt ;θ̂ m )
(b)1 1
= (1 − ˆm )e−α̂m − ˆm eα̂m
Zm Zm
(c)
= 0,
Here
(a) substitutes the update equation
P (4.3),
(b) applies the definition ˆm = t : yt 6=h(xt ;θ̂m ) wm−1 (t), and
(c)uses the fact that α̂m was chosen by equating ˆm eα̂m − e−α̂m (1 − ˆm ) with zero (see Step 3 above).
7 **Test Error**
For typical learning algorithms, making the training error too small makes the test error (i.e., the error rate
on unseen data) large.
This is known as overfitting, and will be explored more in the coming lectures.
For boosting, we typically observe the following surprising phenomenon:
Test error continues decreasing even after getting zero training error!
Let’s try to get an intuitive understanding of this. Define
PM
α̂m h(x; θ̂ m )
f˜M (x) = m=1
PM ∈ [−1, 1],
m=1 α̂m
margin(t) = yt f˜M (xt ) ∈ [−1, 1].
We have margin(t) > 0 if and only if xt is classified correctly. But the closer margin(t) is to one, the “further
away” the classifier is from incorrectly classifying it.
9
Now define the following stricter notion of error:
n
1X
Errn (fM ; ρ) = 1{margin(t) ≤ ρ}.
n t=1
Note that the normal training error is simply Errn (fM ; 0).
Implicitly, AdaBoost is minimizing e−margin(t) . This means that even after achieving Errn (fM ; 0) = 0,
the algorithm continues to decrease Errn (fM ; ρ) for ρ > 0:
A higher margin leads to better generalization, and boosting naturally increases the margin.
A formal statement on the test error of AdaBoost is given in the 1998 paper by Schapire, Freund, Bartlett,
and Lee – the final author is a professor at NUS.
8 Additional References
Blog posts by Jeremy Kun on boosting2 and why it doesn’t overfit3
The original paper on boosting: “A decision-theoretic generalization of on-line learning and an appli-
cation to boosting” (Freund and Schapire, 1997)
2 http://jeremykun.com/2015/05/18/boosting-census/
3 http://jeremykun.com/2015/09/21/the-boosting-margin-or-why-boosting-doesnt-overfit/
4 http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/
lecture-notes/
10