0% found this document useful (0 votes)
2 views

LectureNotes7

Lecture 7 discusses AdaBoost, a method for combining weak classifiers into a strong classifier using a mathematical and algorithmic approach. The lecture covers the motivation behind AdaBoost, its application in face detection, and the algorithmic steps involved in its implementation. Additionally, it addresses the equivalence between mathematical and algorithmic descriptions, as well as the advantages of AdaBoost over other classification methods.

Uploaded by

mike mwanym
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

LectureNotes7

Lecture 7 discusses AdaBoost, a method for combining weak classifiers into a strong classifier using a mathematical and algorithmic approach. The lecture covers the motivation behind AdaBoost, its application in face detection, and the algorithmic steps involved in its implementation. Additionally, it addresses the equivalence between mathematical and algorithmic descriptions, as well as the advantages of AdaBoost over other classification methods.

Uploaded by

mike mwanym
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Lecture 7.

AdaBoost
Prof. Alan Yuille

Spring 2014

Outline
1. Introduction of AdaBoost

2. Mathematical Description

3. AdaBoost Algorithmic

4. Equivalence between Mathematical and Algorithmic Description

5. Further Discussion

1 Introduction to AdaBoost
1.1 Basics
AdaBoost is a method for combining many weak classifiers to make a strong classifier.
Input: set of weak classifiers {φµ (x) : µ = 1, ..., M }. Labelled data X = {(xi , y i ) : i =
1, ..., N } with y i ∈ {±1}.
Output: strong classifier:
M
X
S(x) = sign( λµ φµ (x), )
µ=1

where the {λµ } are weights to be learned.


We generally want most of the λµ = 0, which means that the corresponding weak
classifier φµ (.) is not selected.
Note: the strong classifier is a plane in feature space, see figure 1.

1
Figure 1: AdaBoost learns a strong classifier which is a plane in feature space..

1.2 Motivation
The task of AdaBoost is to select weights {λµ } to make a strong classifier whose perfor-
mance is as good as possible.
The motivation is that it is often possible to specify weak classifiers for a problem, e.g.,
to find a ”weak classifier” that is effective sixty percent of the time. Or, more generally
to specify a large dictionary of weak classifiers from which AdaBoost can select a limited
number to build a strong classifier which is correct ninety nine point nine percent of the
time.
Why combine the weak classifier by linear weights? Because there is an efficient algo-
rithm to do this. And because this algorithm appears to give a classifier which generalizes
well even if there is only limited amounts of data (as always, cross-validation is used to
check that AdaBoost does not over-generalize).
Note that weak classifiers which are correct only forty percent of the time are still
useful. By changing their sign, we can make a weak classifier that is right fifty five percent
of the time. (Maybe you have a friend whose advice is usually bad – if so, ask for their
advice, but then do the opposite),
The ”best” way to combine weak classifiers if we have enough data is to learn the
distribution P (y|{φµ (x)}) – the distribution of y conditioned on the data. But this requires
much too much data to be practical (most of the time).

1.3 Example: Face Detection


The training set are a set of images I (we don’t use x because that is used to denote
position in the image) which are labelled as face or non-face. y = 1 means face, y = −1
means not-face, see figure 2.
Viola and Jones calculated image features f (I). For example, they computed the
average intensity within different subregions of the images and subtracted these averages
to obtain the feature f (I). They obtain a weak classifier by thresholding the value of the
feature (e.g., set y = 1 if f (I) > T . For example, the forehead region is typically a lot

2
brighter than the region directly below – the eyebrows and eyes – which gives us a weak
classifier by putting a threshold on the magnitude of this difference. Similarly faces are
often symmetric, so the average intensity on one side of the face is uaually equal to the
intensity on the other side – so the difference between these averages is often smaller than
a threshold.

Figure 2: The face detection problem (left panel). Weak classifiers for detecting faces (right
panels).

Viola and Jones specified a large dictionary of weak classfiers of this type (M = 20, 000).
Note: AdaBoost will only give you good results if you have a good dictionary of weak
classifiers.

2 Mathematical Description
2.1 The convex upper bound of the empirical risk
Let
N
X M
X
i
Z[λ1 , ..., λM ] = exp{−y λµ φµ (xi )}.
i=1 µ=1

This is a convex upper bound of the empirical risk of the strong classifier S(.). Convex-
ity can be shown by calculating the Hessian and showing it is positive definite (Cauchy-
Schwartz inequality).
To prove it is an upper bound of the empirical risk (penalizing all errors the same),

3
recall that the empirical risk of a strong classifier S(.) is:
N
X
Remp (X ) = {1 − I(S(xi ) = y i )},
i=1

where I(.) is the indicator function.


To prove the result, we compare the above two equations term by term. Terms in the
empirical risk take value 1, if S(xi ) 6= y i (if the strong classifier is wrong) and value 0 if
S(xi ) = y i (if the strong classifier gets the right answer).
Mathematical Fact: y i S(xi ) = 1 if the strong classifier is correct and y i S(xi ) = −1 is
the strong classifier is wrong. This mathematical fact is useful when we discuss AdaBoost
and also Max-Margin methods (e.g., P Support Vector Machines) in the next lecture.
Now consider the terms exp{−y i M i
µ=1 λµ φµ (x )} in Z[λ1 , ..., λM ]. If the strong classi-
fier is wrong (S(xi ) 6= y i ), then y i and M i
P
µ=1 λµ φµ (x ) must have the different sign. This
implies that y i M i i M i
P P
µ=1 λµ φµ (x ) < 0, hence exp{−y µ=1 λµ φµ (x )} > 1. If the strong
classier is right, then the term is greater than 0 since it is exponential. Hence each term
in Z is bigger than the corresponding term in the empirical risk.
Note: this is a standard technique in machine learning. The empirical risk is a non-
convex function of the parameters λ of the strong classifier. But we can bound it by a
convex function Z. This allows us to specify an algorithm that is guaranteed to converge
to the global minimum of Z, which hence gives an upper bound on the empirical risk.
(This doesn’t mean that we have minimized the empirical risk – but, it does mean that we
know that is below the minimum of Z).

2.2 Coordinate Descent


The AdaBoost algorithm can be expressed as coordinate descent in Z (this was not how it
was originally described), see figure 3 .
Initialize λµ = 0, for µ = 1, ..., M .
At time step t, suppose the weights are {λtµ : µ = 1, ..., M },
For each µ, minimize Z with respect to λµ keeping the remaining λ’s fixed. I.e. solve
∂Z
∂λµ = 0 (because Z is convex there will only be one solution) to obtain λ̂µ . Then compute
Z[λt1 , ..., λ̂µ , ..., λtM ]. This gives the decrease in Z due to minimizing it with respect to the
µth λ. (See next section for how to do this).
Then compute:
µ̂ = arg min Z[λt1 , ..., λ̂µ , ..., λtM ].
(I.e. find the µ for which we can maximal decrease in Z.
Then update the weights:

λt+1
µ̂ = λ̂µ̂ , λt+1
µ = λtµ , for all µ 6= µ̂.

4
Figure 3: AdaBoost performs coordinate descent in Z. At each time step it calculate the
best direction to maximally minimize Z – i.e. by selecting µ̂ – and then moves in that
direction.

5
Repeat until Z stops decreasing.
Note: this may over-learn the data. In practice, stop earlier and check by cross-
validation.
Intuition. At each time step calculate how much you can decrease Z by changing only
one of the λ’s, and chose the λ which gives the biggest decrease. Each step of the algorithm
decreases Z and so the algorithm converges to the (unique) global minimum of Z.
Note: this algorithm is only practical because we can solve for λ̂µ and for µ̂ efficiently,
see next section. Note: Once one weak classifier is selected, it can be selected again in later
steps.

3 AdaBoost Algorithm
For each weak classifier φµ (.) divide the data into two sets:
(i) Wµ+ = {i : y i φµ (xi ) = 1}
(ii) Wµ− = {i : y i φµ (xi ) = −1}
I.e. Wµ+ is the set of the data which φµ (.) classifies correctly, and Wµ− is the set which it
gets wrong,
Then, at each time step t, define a set of ”weights” for the training data:
i
PM
λtµ φµ (xi )
e−y µ=1
Dit =P i
PM .
N −y λtµ φµ (xi )
i=1 e
µ=1

These weights are all positive and sum to 1 – i.e. i Dit = 1. At t = 0 all the weights take
P
value 1/N . Otherwise,Pthe weights are largest for the data which is incorrectly classified
by the classifier sign( M t i
µ=1 λµ φµ (x )) (our current estimate of the strong classifier) and
smallest for those which are correctly classified (to see this, look at the sign of the exponent).
These weights are a way to take into account the weak classifiers we have already selected
and the weights we have assigned them when we try to add a new classifier.
Now we describe the AdaBoost algorithm, explicity showing how to compute the steps
which were summarized in the previous section.
Initialize λµ = 0, for µ = 1, ..., M .
At time step t, let the weights be {λtµ : µ = 1, ..., M }. Then, for each µ compute:
t
P
t 1 i∈Wµ+ Di
∆µ = log P t.
2 i∈Wµ− Di

(this corresponds to solving ∂Z/∂λµ = 0, see next section).


Then solve for: sX sX
µ̂ = arg min Dit Dit .
i∈Wµ+ i∈Wµ−

6
(This computes the change in Z, and selects the µ for which this change is biggest).
Then set:
λt+1
µ̂ = λtµ̂ + ∆µ̂ , λt+1
µ = λtµ , for all µ 6= µ̂.
Repeat until convergence. q qP
t t
P
Convergence occurs when i∈Wµ+ Di i∈Wµ− Di takes its maximal value of 1/2 for
all µ. (To see this is the maximal possible value, observe that i∈Wµ+ Dit + i∈Wµ− Dit = 1.)
P P

This maximal values occurs when the weak Pclassifier gets


P exactly half the data correct
(allowing for the data weighs) – i.e. when i∈Wµ+ Dit = i∈Wµ− Dit = 1/2. (Note, if the
weights take vale 1/N – at the start of the algorithm – then this is the condition that the
weak classifier is fifty percent correct).

4 Equivalence between Mathematical and Algorithmic De-


scription
We omit superscript t since the following argument applies to any step.
N
∂Z X i
PM i
= (−y i φµ (xi ))e−y µ=1 λµ φµ (x ) = 0.
∂λµ
i=1

Set λµ = λµ + ∆µ and solve for ∆µ .


N PM
X i i i i
(y i φµ (xi ))e−y µ=1 λµ φµ (x ) e−y ∆µ φµ (x ) = 0
i=1
PM
P −y i λtµ φµ (xi )
Which implies that (using the definition of Di and divide both sides by ie
µ=1 );
N
i i
X
(y i φµ (xi ))Di e−y ∆µ φµ (x ) = 0
i=1

This can be expressed as (using definitions of Wµ+ and Wµ− ):


X X
Di e−∆µ − D i e ∆µ = 0
i∈Wµ+ i∈Wµ−

. This can be solved to give the result:


P
1 i∈Wµ+ Di
∆µ = log P .
2 i∈Wµ− Di

7
Now compute Z[λ1 , ..., λµ + ∆µ , ..., λM ] by:
N PM
X i{ λµ φµ (xi )+∆µ φµ (xi )}
Z[λ1 , ..., λµ + ∆µ , ..., λM ] = e−y µ=1

i=1
N
i∆ i)
X
=K Di e−y µ φµ (x

i=1
−y i ( M i
PN P
where K = µ=1 λµ φµ (x ))
i=1 e is independent of µ. Hence:
X X
Z[λ1 , ..., λµ + ∆µ , ..., λM ] = K{ Di e−∆µ + D i e ∆µ }
i∈Wµ+ i∈Wµ−
sX sX
= 2K Di Di
i∈Wµ+ i∈Wµ−
.
This shows the equivalence between the mathematical and the algorithmic descriptions
of AdaBoost.

5 Further Discussion
5.1 AdaBoost and Regression
It has been shown (Friedman, Hastie, Tishbirani) that AdaBoost converges asymptotically
to a solution to the regression problem:
P
ey µ λµ φµ (x)
P (y|x) = P P
− µ λµ φµ (x)
.
e µ λµ φµ (x) + e
This holds only in the limit as the amount of data tends to infinity (plus other technical
conditions).
It has been argued (Miller, Lebanon, and others) that doing regression gives better
results that AdaBoost. It does require more computation. It also requires enforcing a
sparsity constrain that most of the λ’s are zero (see lecture on sparsity).

5.2
The main advantage of AdaBoost is that you can specify a large set of weak classifiers and
the algorithm decides which weak classifier to use by assigning them non-zero weights.
Standard logistic regression only uses a small set of features.
SVM uses the kernel trick (discuss in next lecture) to simplify the dependence on φ(x)
, but does not say how select kernel or φ.
Multilayer perceptron can be interpreted as selecting weak classifiers but in a non-
optimal manner.

You might also like