Boosting
Boosting
Boosting
Andrew Ng
1 Boosting
We have seen so far how to solve classification (and other) problems when we
have a data representation already chosen. We now talk about a procedure,
known as boosting, which was originally discovered by Rob Schapire, and
further developed by Schapire and Yoav Freund, that automatically chooses
feature representations. We take an optimization-based perspective, which
is somewhat different from the original interpretation and justification of
Freund and Schapire, but which lends itself to our approach of (1) choose a
representation, (2) choose a loss, and (3) minimize the loss.
Before formulating the problem, we give a little intuition for what we
are going to do. Roughly, the idea of boosting is to take a weak learning
algorithm—any learning algorithm that gives a classifier that is slightly bet-
ter than random—and transforms it into a strong classifier, which does much
much better than random. To build a bit of intuition for what this means,
consider a hypothetical digit recognition experiment, where we wish to dis-
tinguish 0s from 1s, and we receive images we must classify. Then a natural
weak learner might be to take the middle pixel of the image, and if it is
colored, call the image a 1, and if it is blank, call the image a 0. This clas-
sifier may be far from perfect, but it is likely better than random. Boosting
procedures proceed by taking a collection of such weak classifiers, and then
reweighting their contributions to form a classifier with much better accuracy
than any individual classifier.
With that in mind, let us formulate the problem. Our interpretation of
boosting is as a coordinate descent method in an infinite dimensional space,
which—while it sounds complex—is not so bad as it seems. First, we assume
we have raw input examples x ∈ Rn with labels y ∈ {−1, 1}, as is usual in
binary classification. We also assume we have an infinite collection of feature
functions φj : Rn → {−1, 1} and an infinite vector θ = [θ1 θ2 · · · ]T , but
1
which we assume always has only a finite number of non-zero entries. For
our classifier we use
∞
X
hθ (x) = sign θj φj (x) .
j=1
Then we say that there is a weak learner with margin γ > 0 if for any
distribution p on the m training examples there exists one weak hypothesis
φj such that
m
X 1
p(i) 1 y (i) 6= φj (x(i) ) ≤ − γ.
(1)
i=1
2
That is, we assume that there is some classifier that does slightly better than
random guessing on the dataset. The existence of a weak learning algorithm
is an assumption, but the surprising thing is that we can transform any weak
learning algorithm into one with perfect accuracy.
In more generality, we assume we have access to a weak learner, which is
an algorithm that takes as input a distribution (weights) p on the training
examples and returns a classifier doing slightly better than random. We will
(i) Input:
PmA distribution p(1) , . . . , p(m) and training set {(x(i) , y (i) )}m
i=1
with i=1 p(i) = 1 and p(i) ≥ 0
2
show how, given access to a weak learning algorithm, boosting can return a
classifier with perfect accuracy on the training data. (Admittedly, we would
like the classifier to generalize well to unseen data, but for now, we ignore
this issue.)
We first show how to compute the exact form of the coordinate descent
update for the risk J(θ). Coordinate descent iterates as follows:
(i) Choose a coordinate j ∈ N
(ii) Update θj to
θj = arg min J(θ)
θj
3
to be a weight, and let α = θj . We can then express
m
1 X (i)
J(θ) = w exp(−y (i) φj (x(i) )α)
m i=1
Now, define
X X
W + := w(i) and W − := w(i)
i:y (i) φj (x(i) )=1 i:y (i) φj (x(i) )=−1
Then p
J(θ(t) ) ≤ 1 − 4γ 2 J(θ(t−1) ).
4
For each iteration t = 1, 2, . . .:
(iii) Compute Wt+ = i:y(i) φt (x(i) )=1 w(i) and Wt− = i:y(i) φt (x(i) )=−1 w(i)
P P
and set
1 W+
θt = log t− .
2 Wt
As the proof of the lemma is somewhat involved and not the central focus of
these notes—though it is important to know one’s algorithm will converge!—
we defer the proof to Appendix A.1. Let us describe how it guarantees
convergence of the boosting procedure to a classifier with zero training error.
We initialize the procedure at θ(0) = ~0, so that the initial empirical risk
J(θ(0) ) = 1. Now, we note that for any θ, the misclassification error satisfies
and so if J(θ) < m1 then the vector θ makes no mistakes on the training data.
After t iterations of boosting, we find that the empirical risk satisfies
t t
J(θ(t) ) ≤ (1 − 4γ 2 ) 2 J(θ(0) ) = (1 − 4γ 2 ) 2 .
5
1
To find how many iterations are required to guarantee J(θ(t) ) < m
, we take
logarithms to find that J(θ(t) ) < 1/m if
t 1 2 log m
log(1 − 4γ 2 ) < log , or t > .
2 m − log(1 − 4γ 2 )
Using a first order Taylor expansion, that is, that log(1 − 4γ 2 ) ≤ −4γ 2 , we
see that if the number of rounds of boosting—the number of weak classifiers
we use—satisfies
log m 2 log m
t> 2
≥ ,
2γ − log(1 − 4γ 2 )
1
then J(θ(t) ) < m
.
3 Implementing weak-learners
One of the major advantages of boosting algorithms is that they automat-
ically generate features from raw data for us. Moreover, because the weak
hypotheses always return values in {−1, 1}, there is no need to normalize fea-
tures to have similar scales when using learning algorithms, which in practice
can make a large difference. Additionally, and while this is not theoret-
ically well-understood, many types of weak-learning procedures introduce
non-linearities intelligently into our classifiers, which can yield much more
expressive models than the simpler linear models of the form θT x that we
have seen so far.
These classifiers are simple enough that we can fit them efficiently even to a
weighted dataset, as we now describe.
6
Indeed, a decision stump weak learner proceeds as follows. We begin with
a distribution—set of weights p(1) , . . . , p(m) summing to 1—on the training
set, and we wish to choose a decision stump of the form (2) to minimize the
error on the training set. That is, we wish to find a threshold s ∈ R and
index j such that
m
X m
X n o
(i) (i) (i) (i)
p(i) 1 y (i) (xj − s) ≤ 0 (3)
Err(φ
c j,s , p) = p 1 φj,s (x ) 6= y =
i=1 i=1
As the only values s for which the error of the decision stump can change
(i)
are the values xj , a bit of clever book-keeping allows us to compute
m
X n o Xm n o
(i) (i )
p(i) 1 y (i) (xj − s) ≤ 0 = p(ik ) 1 y (ik ) (xj k − s) ≤ 0
i=1 k=1
(You should convince yourself that this is true.) Thus, it is important to also
track the smallest value of 1 − Err(φ
c j,s , p) over all thresholds, because this
may be smaller than Err(φ
c j,s , p), which gives a better weak learner. Using
this procedure for our weak learner (Fig. 1) gives the basic, but extremely
useful, boosting classifier.
7
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
3.2 Example
We now give an example showing the behavior of boosting on a simple
dataset. In particular, we consider a problem with data points x ∈ R2 ,
where the optimal classifier is
(
1 if x1 < .6 and x2 < .6
y= (4)
−1 otherwise.
8
Iterations = 2 Iterations = 4
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Iterations = 5 Iterations = 10
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
9
3.3 Other strategies
There are a huge number of variations on the basic boosted decision stumps
idea. First, we do not require that the input features xj be real-valued. Some
of them may be categorical, meaning that xj ∈ {1, 2, . . . , k} for some k, in
which case natural decision stumps are of the form
(
1 if xj = l
φj (x) =
−1 otherwise,
A Appendices
A.1 Proof of Lemma 1
We now return to prove the progress lemma. We prove this result by directly
showing the relationship of the weights at time t to those at time t − 1. In
particular, we note by inspection that
1 2
q
(t)
J(θ ) = + −α − α
min{Wt e + Wt e } = Wt+ Wt−
m α m
while
m t−1
(t−1) 1 X (i)
X
(i) 1 + −
J(θ )= exp −y θτ φτ (x ) = Wt + Wt .
m i=1 τ =1
m
10
Rewriting this expression by noting that the sum on the right is Wt− , we
have
− 1 1 + 2γ −
Wt ≤ − γ (Wt+ + Wt− ), or Wt+ ≥ W .
2 1 − 2γ t
By substituting α = 12 log 1+2γ
1−2γ
in the minimum defining J(θ(t) ), we obtain
r r
1 1 − 2γ − 1 + 2γ
J(θ(t) ) ≤ Wt+
+ Wt
m 1 + 2γ 1 − 2γ
r r
1 + 1 − 2γ − 1 + 2γ
= Wt + Wt (1 − 2γ + 2γ)
m 1 + 2γ 1 − 2γ
r r r
1 + 1 − 2γ − 1 + 2γ 1 − 2γ 1 + 2γ +
≤ Wt + Wt (1 − 2γ) + 2γ W
m 1 + 2γ 1 − 2γ 1 + 2γ 1 − 2γ t
r r
1 + 1 − 2γ 1 − 2γ −
p
2
= Wt + 2γ + Wt 1 − 4γ ,
m 1 + 2γ 1 + 2γ
11