0% found this document useful (0 votes)

4 views8 pages

LectureNotes7

Lecture 7 discusses AdaBoost, a method for combining weak classifiers into a strong classifier using a mathematical and algorithmic approach. The lecture covers the motivation behind AdaBoost, its application in face detection, and the algorithmic steps involved in its implementation. Additionally, it addresses the equivalence between mathematical and algorithmic descriptions, as well as the advantages of AdaBoost over other classification methods.

Uploaded by

mike mwanym

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views8 pages

LectureNotes7

Uploaded by

mike mwanym

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Lecture 7.

AdaBoost
Prof. Alan Yuille

Spring 2014

Outline
1. Introduction of AdaBoost

2. Mathematical Description

3. AdaBoost Algorithmic

4. Equivalence between Mathematical and Algorithmic Description

5. Further Discussion

1 Introduction to AdaBoost
1.1 Basics
AdaBoost is a method for combining many weak classifiers to make a strong classifier.
Input: set of weak classifiers {φµ (x) : µ = 1, ..., M }. Labelled data X = {(xi , y i ) : i =
1, ..., N } with y i ∈ {±1}.
Output: strong classifier:
M
X
S(x) = sign( λµ φµ (x), )
µ=1

where the {λµ } are weights to be learned.

We generally want most of the λµ = 0, which means that the corresponding weak
classifier φµ (.) is not selected.
Note: the strong classifier is a plane in feature space, see figure 1.

1
Figure 1: AdaBoost learns a strong classifier which is a plane in feature space..

1.2 Motivation
The task of AdaBoost is to select weights {λµ } to make a strong classifier whose perfor-
mance is as good as possible.
The motivation is that it is often possible to specify weak classifiers for a problem, e.g.,
to find a ”weak classifier” that is effective sixty percent of the time. Or, more generally
to specify a large dictionary of weak classifiers from which AdaBoost can select a limited
number to build a strong classifier which is correct ninety nine point nine percent of the
time.
Why combine the weak classifier by linear weights? Because there is an efficient algo-
rithm to do this. And because this algorithm appears to give a classifier which generalizes
well even if there is only limited amounts of data (as always, cross-validation is used to
check that AdaBoost does not over-generalize).
Note that weak classifiers which are correct only forty percent of the time are still
useful. By changing their sign, we can make a weak classifier that is right fifty five percent
of the time. (Maybe you have a friend whose advice is usually bad – if so, ask for their
advice, but then do the opposite),
The ”best” way to combine weak classifiers if we have enough data is to learn the
distribution P (y|{φµ (x)}) – the distribution of y conditioned on the data. But this requires
much too much data to be practical (most of the time).

1.3 Example: Face Detection

The training set are a set of images I (we don’t use x because that is used to denote
position in the image) which are labelled as face or non-face. y = 1 means face, y = −1
means not-face, see figure 2.
Viola and Jones calculated image features f (I). For example, they computed the
average intensity within different subregions of the images and subtracted these averages
to obtain the feature f (I). They obtain a weak classifier by thresholding the value of the
feature (e.g., set y = 1 if f (I) > T . For example, the forehead region is typically a lot

2
brighter than the region directly below – the eyebrows and eyes – which gives us a weak
classifier by putting a threshold on the magnitude of this difference. Similarly faces are
often symmetric, so the average intensity on one side of the face is uaually equal to the
intensity on the other side – so the difference between these averages is often smaller than
a threshold.

Figure 2: The face detection problem (left panel). Weak classifiers for detecting faces (right
panels).

Viola and Jones specified a large dictionary of weak classfiers of this type (M = 20, 000).
Note: AdaBoost will only give you good results if you have a good dictionary of weak
classifiers.

2 Mathematical Description
2.1 The convex upper bound of the empirical risk
Let
N
X M
X
i
Z[λ1 , ..., λM ] = exp{−y λµ φµ (xi )}.
i=1 µ=1

This is a convex upper bound of the empirical risk of the strong classifier S(.). Convex-
ity can be shown by calculating the Hessian and showing it is positive definite (Cauchy-
Schwartz inequality).
To prove it is an upper bound of the empirical risk (penalizing all errors the same),

3
recall that the empirical risk of a strong classifier S(.) is:
N
X
Remp (X ) = {1 − I(S(xi ) = y i )},
i=1

where I(.) is the indicator function.

To prove the result, we compare the above two equations term by term. Terms in the
empirical risk take value 1, if S(xi ) 6= y i (if the strong classifier is wrong) and value 0 if
S(xi ) = y i (if the strong classifier gets the right answer).
Mathematical Fact: y i S(xi ) = 1 if the strong classifier is correct and y i S(xi ) = −1 is
the strong classifier is wrong. This mathematical fact is useful when we discuss AdaBoost
and also Max-Margin methods (e.g., P Support Vector Machines) in the next lecture.
Now consider the terms exp{−y i M i
µ=1 λµ φµ (x )} in Z[λ1 , ..., λM ]. If the strong classi-
fier is wrong (S(xi ) 6= y i ), then y i and M i
P
µ=1 λµ φµ (x ) must have the different sign. This
implies that y i M i i M i
P P
µ=1 λµ φµ (x ) < 0, hence exp{−y µ=1 λµ φµ (x )} > 1. If the strong
classier is right, then the term is greater than 0 since it is exponential. Hence each term
in Z is bigger than the corresponding term in the empirical risk.
Note: this is a standard technique in machine learning. The empirical risk is a non-
convex function of the parameters λ of the strong classifier. But we can bound it by a
convex function Z. This allows us to specify an algorithm that is guaranteed to converge
to the global minimum of Z, which hence gives an upper bound on the empirical risk.
(This doesn’t mean that we have minimized the empirical risk – but, it does mean that we
know that is below the minimum of Z).

2.2 Coordinate Descent

The AdaBoost algorithm can be expressed as coordinate descent in Z (this was not how it
was originally described), see figure 3 .
Initialize λµ = 0, for µ = 1, ..., M .
At time step t, suppose the weights are {λtµ : µ = 1, ..., M },
For each µ, minimize Z with respect to λµ keeping the remaining λ’s fixed. I.e. solve
∂Z
∂λµ = 0 (because Z is convex there will only be one solution) to obtain λ̂µ . Then compute
Z[λt1 , ..., λ̂µ , ..., λtM ]. This gives the decrease in Z due to minimizing it with respect to the
µth λ. (See next section for how to do this).
Then compute:
µ̂ = arg min Z[λt1 , ..., λ̂µ , ..., λtM ].
(I.e. find the µ for which we can maximal decrease in Z.
Then update the weights:

λt+1
µ̂ = λ̂µ̂ , λt+1
µ = λtµ , for all µ 6= µ̂.

4
Figure 3: AdaBoost performs coordinate descent in Z. At each time step it calculate the
best direction to maximally minimize Z – i.e. by selecting µ̂ – and then moves in that
direction.

5
Repeat until Z stops decreasing.
Note: this may over-learn the data. In practice, stop earlier and check by cross-
validation.
Intuition. At each time step calculate how much you can decrease Z by changing only
one of the λ’s, and chose the λ which gives the biggest decrease. Each step of the algorithm
decreases Z and so the algorithm converges to the (unique) global minimum of Z.
Note: this algorithm is only practical because we can solve for λ̂µ and for µ̂ efficiently,
see next section. Note: Once one weak classifier is selected, it can be selected again in later
steps.

3 AdaBoost Algorithm
For each weak classifier φµ (.) divide the data into two sets:
(i) Wµ+ = {i : y i φµ (xi ) = 1}
(ii) Wµ− = {i : y i φµ (xi ) = −1}
I.e. Wµ+ is the set of the data which φµ (.) classifies correctly, and Wµ− is the set which it
gets wrong,
Then, at each time step t, define a set of ”weights” for the training data:
i
PM
λtµ φµ (xi )
e−y µ=1
Dit =P i
PM .
N −y λtµ φµ (xi )
i=1 e
µ=1

These weights are all positive and sum to 1 – i.e. i Dit = 1. At t = 0 all the weights take
P
value 1/N . Otherwise,Pthe weights are largest for the data which is incorrectly classified
by the classifier sign( M t i
µ=1 λµ φµ (x )) (our current estimate of the strong classifier) and
smallest for those which are correctly classified (to see this, look at the sign of the exponent).
These weights are a way to take into account the weak classifiers we have already selected
and the weights we have assigned them when we try to add a new classifier.
Now we describe the AdaBoost algorithm, explicity showing how to compute the steps
which were summarized in the previous section.
Initialize λµ = 0, for µ = 1, ..., M .
At time step t, let the weights be {λtµ : µ = 1, ..., M }. Then, for each µ compute:
t
P
t 1 i∈Wµ+ Di
∆µ = log P t.
2 i∈Wµ− Di

(this corresponds to solving ∂Z/∂λµ = 0, see next section).

Then solve for: sX sX
µ̂ = arg min Dit Dit .
i∈Wµ+ i∈Wµ−

6
(This computes the change in Z, and selects the µ for which this change is biggest).
Then set:
λt+1
µ̂ = λtµ̂ + ∆µ̂ , λt+1
µ = λtµ , for all µ 6= µ̂.
Repeat until convergence. q qP
t t
P
Convergence occurs when i∈Wµ+ Di i∈Wµ− Di takes its maximal value of 1/2 for
all µ. (To see this is the maximal possible value, observe that i∈Wµ+ Dit + i∈Wµ− Dit = 1.)
P P

This maximal values occurs when the weak Pclassifier gets

P exactly half the data correct
(allowing for the data weighs) – i.e. when i∈Wµ+ Dit = i∈Wµ− Dit = 1/2. (Note, if the
weights take vale 1/N – at the start of the algorithm – then this is the condition that the
weak classifier is fifty percent correct).

4 Equivalence between Mathematical and Algorithmic De-

scription
We omit superscript t since the following argument applies to any step.
N
∂Z X i
PM i
= (−y i φµ (xi ))e−y µ=1 λµ φµ (x ) = 0.
∂λµ
i=1

Set λµ = λµ + ∆µ and solve for ∆µ .

N PM
X i i i i
(y i φµ (xi ))e−y µ=1 λµ φµ (x ) e−y ∆µ φµ (x ) = 0
i=1
PM
P −y i λtµ φµ (xi )
Which implies that (using the definition of Di and divide both sides by ie
µ=1 );
N
i i
X
(y i φµ (xi ))Di e−y ∆µ φµ (x ) = 0
i=1

This can be expressed as (using definitions of Wµ+ and Wµ− ):

X X
Di e−∆µ − D i e ∆µ = 0
i∈Wµ+ i∈Wµ−

. This can be solved to give the result:

P
1 i∈Wµ+ Di
∆µ = log P .
2 i∈Wµ− Di

7
Now compute Z[λ1 , ..., λµ + ∆µ , ..., λM ] by:
N PM
X i{ λµ φµ (xi )+∆µ φµ (xi )}
Z[λ1 , ..., λµ + ∆µ , ..., λM ] = e−y µ=1

i=1
N
i∆ i)
X
=K Di e−y µ φµ (x

i=1
−y i ( M i
PN P
where K = µ=1 λµ φµ (x ))
i=1 e is independent of µ. Hence:
X X
Z[λ1 , ..., λµ + ∆µ , ..., λM ] = K{ Di e−∆µ + D i e ∆µ }
i∈Wµ+ i∈Wµ−
sX sX
= 2K Di Di
i∈Wµ+ i∈Wµ−
.
This shows the equivalence between the mathematical and the algorithmic descriptions
of AdaBoost.

5 Further Discussion
5.1 AdaBoost and Regression
It has been shown (Friedman, Hastie, Tishbirani) that AdaBoost converges asymptotically
to a solution to the regression problem:
P
ey µ λµ φµ (x)
P (y|x) = P P
− µ λµ φµ (x)
.
e µ λµ φµ (x) + e
This holds only in the limit as the amount of data tends to infinity (plus other technical
conditions).
It has been argued (Miller, Lebanon, and others) that doing regression gives better
results that AdaBoost. It does require more computation. It also requires enforcing a
sparsity constrain that most of the λ’s are zero (see lecture on sparsity).

5.2
The main advantage of AdaBoost is that you can specify a large set of weak classifiers and
the algorithm decides which weak classifier to use by assigning them non-zero weights.
Standard logistic regression only uses a small set of features.
SVM uses the kernel trick (discuss in next lecture) to simplify the dependence on φ(x)
, but does not say how select kernel or φ.
Multilayer perceptron can be interpreted as selecting weak classifiers but in a non-
optimal manner.

_LECTURE+NOTES_Boosting
No ratings yet
_LECTURE+NOTES_Boosting
8 pages
1 Eric Boosting304FinalRpdf
No ratings yet
1 Eric Boosting304FinalRpdf
19 pages
Boosting
No ratings yet
Boosting
11 pages
ML 14 Boosting
No ratings yet
ML 14 Boosting
57 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
Foundations of Machine Learning: Boosting
No ratings yet
Foundations of Machine Learning: Boosting
41 pages
Ensemble (v6)
No ratings yet
Ensemble (v6)
45 pages
Adaboost Algorithm
No ratings yet
Adaboost Algorithm
17 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
Foundations of Machine Learning: Courant Institute and Google Research
No ratings yet
Foundations of Machine Learning: Courant Institute and Google Research
42 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
ENG6500 7 Ensembles Boosting
No ratings yet
ENG6500 7 Ensembles Boosting
49 pages
Ada Boost
No ratings yet
Ada Boost
7 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
Lecture18 Boosting
No ratings yet
Lecture18 Boosting
21 pages
Lecture-10-boosting
No ratings yet
Lecture-10-boosting
20 pages
AdaBoost Is Consistent
No ratings yet
AdaBoost Is Consistent
22 pages
Adaboost
No ratings yet
Adaboost
13 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
Lecture 16: Boosting — Applied ML
No ratings yet
Lecture 16: Boosting — Applied ML
20 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
addaboost
No ratings yet
addaboost
12 pages
Boosting Mit
No ratings yet
Boosting Mit
36 pages
Bagging+Boosting+Gradient Boosting
100% (1)
Bagging+Boosting+Gradient Boosting
48 pages
Boosting: I I I I
No ratings yet
Boosting: I I I I
5 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
2011 CLOOSTING_ CLustering Data with bOOSTING
No ratings yet
2011 CLOOSTING_ CLustering Data with bOOSTING
10 pages
Artificial Intelligence Fundamentals: Learning: Boosting
No ratings yet
Artificial Intelligence Fundamentals: Learning: Boosting
24 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Class Adv Classification V
No ratings yet
Class Adv Classification V
50 pages
A Robust Real Time Face Detection
No ratings yet
A Robust Real Time Face Detection
55 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
ensemble
No ratings yet
ensemble
33 pages
Introduction To Boosting - 2
No ratings yet
Introduction To Boosting - 2
79 pages
Statistics Project
No ratings yet
Statistics Project
5 pages
Boosting Approach To Machine Learn
No ratings yet
Boosting Approach To Machine Learn
23 pages
Ensemble Classifiers
No ratings yet
Ensemble Classifiers
37 pages
Unit 2
No ratings yet
Unit 2
162 pages
Machine Learning: Ensemble Methods
No ratings yet
Machine Learning: Ensemble Methods
54 pages
IE506 Bagging Boosting April5 6
No ratings yet
IE506 Bagging Boosting April5 6
14 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
chapter 3- boosting theory
No ratings yet
chapter 3- boosting theory
7 pages
ADABOOST
No ratings yet
ADABOOST
9 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
22 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
Slide07 Haykin Chapter 7: Committee Machines
No ratings yet
Slide07 Haykin Chapter 7: Committee Machines
8 pages
Machine Learning Methods For Fault Diagnosis in AC Microgrids
No ratings yet
Machine Learning Methods For Fault Diagnosis in AC Microgrids
39 pages
24-IDL-EE 466-UNIT 7-Brief Overview of Optimization Theory PDF
No ratings yet
24-IDL-EE 466-UNIT 7-Brief Overview of Optimization Theory PDF
24 pages
Quicksort On Singly Linked List 14. Iterative Quick Sort 15. Merge Sort For Linked List
No ratings yet
Quicksort On Singly Linked List 14. Iterative Quick Sort 15. Merge Sort For Linked List
21 pages
question bank with answers AI
No ratings yet
question bank with answers AI
5 pages
Fibonacci Heap
No ratings yet
Fibonacci Heap
50 pages
A Star Algorithm
No ratings yet
A Star Algorithm
20 pages
Pubkey-Crypto 14
No ratings yet
Pubkey-Crypto 14
19 pages
Unit1 AI PPT Dhilip Updated
No ratings yet
Unit1 AI PPT Dhilip Updated
221 pages
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
No ratings yet
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
44 pages
Error and Stability
No ratings yet
Error and Stability
13 pages
Matrix Square Root Computation Algorithm
No ratings yet
Matrix Square Root Computation Algorithm
12 pages
Bisection Method Solution Example
No ratings yet
Bisection Method Solution Example
7 pages
Image Processing
No ratings yet
Image Processing
12 pages
On The Traveling Salesman Problem With Hierarchical Objective Function
No ratings yet
On The Traveling Salesman Problem With Hierarchical Objective Function
5 pages
TMS320c50 Programs
50% (2)
TMS320c50 Programs
28 pages
Homework 2 Solution PDF
No ratings yet
Homework 2 Solution PDF
5 pages
Problem Set 1
100% (1)
Problem Set 1
4 pages
1.1 Review of Expanding and Factoring
No ratings yet
1.1 Review of Expanding and Factoring
3 pages
EXPERIMENT 8 daa
No ratings yet
EXPERIMENT 8 daa
5 pages
Numerical Methods For Graduate School: JP Bersamina October 11,2018
No ratings yet
Numerical Methods For Graduate School: JP Bersamina October 11,2018
67 pages
Analyzing Pseudocode WTW 152
No ratings yet
Analyzing Pseudocode WTW 152
3 pages
Tutorial of Algorithms
No ratings yet
Tutorial of Algorithms
2 pages
Topic_7_Linear_regression
No ratings yet
Topic_7_Linear_regression
2 pages
EE370 Lab Experiment 01
No ratings yet
EE370 Lab Experiment 01
6 pages
Informed Search Algorithms: UNIT-2
No ratings yet
Informed Search Algorithms: UNIT-2
35 pages
DSP MCQ's PDF
No ratings yet
DSP MCQ's PDF
13 pages
DL MCQ
No ratings yet
DL MCQ
13 pages
F0283111611-Ijsce Paper - Subir
No ratings yet
F0283111611-Ijsce Paper - Subir
5 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

LectureNotes7

Uploaded by

LectureNotes7

Uploaded by

Lecture 7.

4. Equivalence between Mathematical and Algorithmic Description

where the {λµ } are weights to be learned.

1.3 Example: Face Detection

where I(.) is the indicator function.

2.2 Coordinate Descent

(this corresponds to solving ∂Z/∂λµ = 0, see next section).

This maximal values occurs when the weak Pclassifier gets

4 Equivalence between Mathematical and Algorithmic De-

Set λµ = λµ + ∆µ and solve for ∆µ .

This can be expressed as (using definitions of Wµ+ and Wµ− ):

. This can be solved to give the result:

You might also like