LectureNotes7
LectureNotes7
AdaBoost
Prof. Alan Yuille
Spring 2014
Outline
1. Introduction of AdaBoost
2. Mathematical Description
3. AdaBoost Algorithmic
5. Further Discussion
1 Introduction to AdaBoost
1.1 Basics
AdaBoost is a method for combining many weak classifiers to make a strong classifier.
Input: set of weak classifiers {φµ (x) : µ = 1, ..., M }. Labelled data X = {(xi , y i ) : i =
1, ..., N } with y i ∈ {±1}.
Output: strong classifier:
M
X
S(x) = sign( λµ φµ (x), )
µ=1
1
Figure 1: AdaBoost learns a strong classifier which is a plane in feature space..
1.2 Motivation
The task of AdaBoost is to select weights {λµ } to make a strong classifier whose perfor-
mance is as good as possible.
The motivation is that it is often possible to specify weak classifiers for a problem, e.g.,
to find a ”weak classifier” that is effective sixty percent of the time. Or, more generally
to specify a large dictionary of weak classifiers from which AdaBoost can select a limited
number to build a strong classifier which is correct ninety nine point nine percent of the
time.
Why combine the weak classifier by linear weights? Because there is an efficient algo-
rithm to do this. And because this algorithm appears to give a classifier which generalizes
well even if there is only limited amounts of data (as always, cross-validation is used to
check that AdaBoost does not over-generalize).
Note that weak classifiers which are correct only forty percent of the time are still
useful. By changing their sign, we can make a weak classifier that is right fifty five percent
of the time. (Maybe you have a friend whose advice is usually bad – if so, ask for their
advice, but then do the opposite),
The ”best” way to combine weak classifiers if we have enough data is to learn the
distribution P (y|{φµ (x)}) – the distribution of y conditioned on the data. But this requires
much too much data to be practical (most of the time).
2
brighter than the region directly below – the eyebrows and eyes – which gives us a weak
classifier by putting a threshold on the magnitude of this difference. Similarly faces are
often symmetric, so the average intensity on one side of the face is uaually equal to the
intensity on the other side – so the difference between these averages is often smaller than
a threshold.
Figure 2: The face detection problem (left panel). Weak classifiers for detecting faces (right
panels).
Viola and Jones specified a large dictionary of weak classfiers of this type (M = 20, 000).
Note: AdaBoost will only give you good results if you have a good dictionary of weak
classifiers.
2 Mathematical Description
2.1 The convex upper bound of the empirical risk
Let
N
X M
X
i
Z[λ1 , ..., λM ] = exp{−y λµ φµ (xi )}.
i=1 µ=1
This is a convex upper bound of the empirical risk of the strong classifier S(.). Convex-
ity can be shown by calculating the Hessian and showing it is positive definite (Cauchy-
Schwartz inequality).
To prove it is an upper bound of the empirical risk (penalizing all errors the same),
3
recall that the empirical risk of a strong classifier S(.) is:
N
X
Remp (X ) = {1 − I(S(xi ) = y i )},
i=1
λt+1
µ̂ = λ̂µ̂ , λt+1
µ = λtµ , for all µ 6= µ̂.
4
Figure 3: AdaBoost performs coordinate descent in Z. At each time step it calculate the
best direction to maximally minimize Z – i.e. by selecting µ̂ – and then moves in that
direction.
5
Repeat until Z stops decreasing.
Note: this may over-learn the data. In practice, stop earlier and check by cross-
validation.
Intuition. At each time step calculate how much you can decrease Z by changing only
one of the λ’s, and chose the λ which gives the biggest decrease. Each step of the algorithm
decreases Z and so the algorithm converges to the (unique) global minimum of Z.
Note: this algorithm is only practical because we can solve for λ̂µ and for µ̂ efficiently,
see next section. Note: Once one weak classifier is selected, it can be selected again in later
steps.
3 AdaBoost Algorithm
For each weak classifier φµ (.) divide the data into two sets:
(i) Wµ+ = {i : y i φµ (xi ) = 1}
(ii) Wµ− = {i : y i φµ (xi ) = −1}
I.e. Wµ+ is the set of the data which φµ (.) classifies correctly, and Wµ− is the set which it
gets wrong,
Then, at each time step t, define a set of ”weights” for the training data:
i
PM
λtµ φµ (xi )
e−y µ=1
Dit =P i
PM .
N −y λtµ φµ (xi )
i=1 e
µ=1
These weights are all positive and sum to 1 – i.e. i Dit = 1. At t = 0 all the weights take
P
value 1/N . Otherwise,Pthe weights are largest for the data which is incorrectly classified
by the classifier sign( M t i
µ=1 λµ φµ (x )) (our current estimate of the strong classifier) and
smallest for those which are correctly classified (to see this, look at the sign of the exponent).
These weights are a way to take into account the weak classifiers we have already selected
and the weights we have assigned them when we try to add a new classifier.
Now we describe the AdaBoost algorithm, explicity showing how to compute the steps
which were summarized in the previous section.
Initialize λµ = 0, for µ = 1, ..., M .
At time step t, let the weights be {λtµ : µ = 1, ..., M }. Then, for each µ compute:
t
P
t 1 i∈Wµ+ Di
∆µ = log P t.
2 i∈Wµ− Di
6
(This computes the change in Z, and selects the µ for which this change is biggest).
Then set:
λt+1
µ̂ = λtµ̂ + ∆µ̂ , λt+1
µ = λtµ , for all µ 6= µ̂.
Repeat until convergence. q qP
t t
P
Convergence occurs when i∈Wµ+ Di i∈Wµ− Di takes its maximal value of 1/2 for
all µ. (To see this is the maximal possible value, observe that i∈Wµ+ Dit + i∈Wµ− Dit = 1.)
P P
7
Now compute Z[λ1 , ..., λµ + ∆µ , ..., λM ] by:
N PM
X i{ λµ φµ (xi )+∆µ φµ (xi )}
Z[λ1 , ..., λµ + ∆µ , ..., λM ] = e−y µ=1
i=1
N
i∆ i)
X
=K Di e−y µ φµ (x
i=1
−y i ( M i
PN P
where K = µ=1 λµ φµ (x ))
i=1 e is independent of µ. Hence:
X X
Z[λ1 , ..., λµ + ∆µ , ..., λM ] = K{ Di e−∆µ + D i e ∆µ }
i∈Wµ+ i∈Wµ−
sX sX
= 2K Di Di
i∈Wµ+ i∈Wµ−
.
This shows the equivalence between the mathematical and the algorithmic descriptions
of AdaBoost.
5 Further Discussion
5.1 AdaBoost and Regression
It has been shown (Friedman, Hastie, Tishbirani) that AdaBoost converges asymptotically
to a solution to the regression problem:
P
ey µ λµ φµ (x)
P (y|x) = P P
− µ λµ φµ (x)
.
e µ λµ φµ (x) + e
This holds only in the limit as the amount of data tends to infinity (plus other technical
conditions).
It has been argued (Miller, Lebanon, and others) that doing regression gives better
results that AdaBoost. It does require more computation. It also requires enforcing a
sparsity constrain that most of the λ’s are zero (see lecture on sparsity).
5.2
The main advantage of AdaBoost is that you can specify a large set of weak classifiers and
the algorithm decides which weak classifier to use by assigning them non-zero weights.
Standard logistic regression only uses a small set of features.
SVM uses the kernel trick (discuss in next lecture) to simplify the dependence on φ(x)
, but does not say how select kernel or φ.
Multilayer perceptron can be interpreted as selecting weak classifiers but in a non-
optimal manner.