0% found this document useful (0 votes)
62 views80 pages

3.pattern Recognition (Pattern Classification) - AdaBoost

The document provides an overview of boosting and the AdaBoost algorithm. It discusses how boosting works to combine multiple weak learners to create a strong learner by iteratively adjusting weights of training examples. Specifically, AdaBoost focuses on examples that previous hypotheses misclassified, gradually boosting the accuracy of the combined hypothesis on each round. The document defines key boosting concepts like weak learning, base classifiers, and the goal of boosting to efficiently learn complex concepts from simple hypotheses.

Uploaded by

temp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views80 pages

3.pattern Recognition (Pattern Classification) - AdaBoost

The document provides an overview of boosting and the AdaBoost algorithm. It discusses how boosting works to combine multiple weak learners to create a strong learner by iteratively adjusting weights of training examples. Specifically, AdaBoost focuses on examples that previous hypotheses misclassified, gradually boosting the accuracy of the combined hypothesis on each round. The document defines key boosting concepts like weak learning, base classifiers, and the goal of boosting to efficiently learn complex concepts from simple hypotheses.

Uploaded by

temp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Pattern Recognition

(Pattern Classification)
AdaBoost (Adaptive Boosting)
Hypothesis set and Algorithm

Second Edition
Contents
1. Boosting
2. AdaBoost
3. AdaBoost and margin maximization
4. Multiclass boosting algorithms
5. Appendix: Decision Tree

This chapter is mostly based on:


Foundations of Machine Learning, 2nd Ed., By Mehryar Mohri, , Afshin Rostamizadeh, Ameet Talwalkar,
Publisher: MIT Press, 2018

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 2


gineering, IUST, Morteza Analoui
1- Boosting
Boosting
• In 1988 Kearns and Valiant posed a theoretical question of whether a
“weak (non complex)” learner that performs just slightly better than
random guessing can be “boosted” into an arbitrarily accurate “strong
(complex)” learning algorithm
• Schapire came up with the first provable polynomial-time boosting
algorithm in 1989
• Freund developed a much more efficient boosting learner in 1990. First
experiments with these early boosting algorithms were carried out by
Drucker, Schapire and Simard on an OCR task.
• AdaBoost algorithm, introduced in 1995 by Freund and Schapire is a very
practical machine learning tool.
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 4
gineering, IUST, Morteza Analoui
Recall from Chapter 1
Other Model selection methods
• Boosting (Ensemble methods)
• Minimum description length (MDL)
• Akaike’s information criterion (AIC)*
• Bayesian information criterion (BIC)
• Focused information criterion (FIC)
•…
Note: MDL "gives a selection criterion formally identical to BIC approach" for large number of
SAMPLEs
* Akaike, H. (1974), "A new look at the statistical model identification", IEEE Transactions on Automatic Control, 19 (6): 716723, Bibcode:
1974ITAC...19..716A, doi:10.1109/TAC.1974.1100705, MR 0423716.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 6


gineering, IUST, Morteza Analoui
From Chapter1: Model Selection Problem
• A key problem in design of learning algorithms is choice of hypothesis
set H. This is known as model selection problem
• How should H be chosen?
• A rich or complex enough H could contain ideal Bayes classifier
• On the other hand, learning with such a complex family becomes a very difficult
task
• Choice of H is subject to a trade-off that can be analyzed in terms of
• estimation error (empirical error)
• approximation errors (hypothesis set complexity)
• We focus on case of binary classification but much of what is discussed
can be straightforwardly extended to different tasks and loss functions
03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 7
omputer Engineering, IUST - Morteza Analoui
Boosting is a Model selection strategy
• Generalization is done by boosting (amplifying) accuracy of weak
hypothesis to address complexity (bias) tradeoff issue:
• Error of a learner can be decomposed into a sum of approximation error and
estimation error: smaller approximation error (bias) means larger estimation
error
• A learner is thus faced with the problem of picking a good tradeoff between
these two considerations.
( 𝑅 ( h ) − 𝑅 ( h∗ ) ) + ( 𝑅 ( h∗) − 𝑅∗ ¿ )

excess error of estimation error approximation error


how good is how good is in H how good is H
bias
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 8
gineering, IUST, Morteza Analoui
Boosting the complexity of H
• Boosting paradigm allows learner to have smooth control over this
tradeoff
• Learning starts with a basic simple (weak) hypothesis set (might have
large approximation error), and as it progresses, the set grows richer
(more complex)
• It is just opposite to regularization-based model selection, which
begins with very high complex H and learning process tries to reduce
the complexity up to the best trade-off

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 9


gineering, IUST, Morteza Analoui
Boosting the computational complexity
• A boosting algorithm amplifies accuracy of weak learners (simple
hypothesis set)
• Intuitively, one can think of a weak learner as an algorithm that uses a
simple learner to output a hypothesis that comes from an easy-to-
learn hypothesis set H and performs just slightly better than a
random guess
• When a weak learner can be implemented efficiently, boosting
provides a tool for aggregating such weak hypotheses to approximate
gradually good predictors for larger, and harder to learn, concepts.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 10


gineering, IUST, Morteza Analoui
Ensemble methods
• Ensemble methods are general techniques in machine learning for
combining several hypothesis (predictors) to create a more accurate
one.
• It is often difficult, for a non-trivial learning task, to directly devise an
accurate algorithm satisfying the strong PAC-learning.
• A large ensembles of diverse weak classifiers can have exceptional
performance.
• Boosting stemmed from theoretical question of whether an efficient
weak learner can be “boosted” into an efficient strong learner.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 11


gineering, IUST, Morteza Analoui
From Chapter1: Definition 2.3: PAC-
learning definition
• A concept class is said to be PAC-learnable if there exists an algorithm
and a polynomial function such that for any and , for all distributions
on X and for any target concept , following holds for any sample size

𝑃 [ 𝑅( h𝑆 )≤ 𝜖 ] ≥1 − 𝛿 1-δ: confidence
ϵ: error (1-ϵ: accuracy) (2.4)
𝑚
𝑆 𝐷
• When exists, it is called a PAC-learning algorithm for
• (2.4): is PAC-learnable if is approximately correct (error at most )
with high probability (at least )

03/18/2024 Machine Learning Theory for Pattern Recognition - School of C 12


omputer Engineering, IUST - Morteza Analoui
Definition 7.1 (Weak learning)
• Following gives a formal definition of the weak learners.
• Let be a number such that the computational cost of representing any element
is at most O() and denote by size(c) the maximal cost of the computational
representation of .
• Definition 7.1 (Weak learning): A concept set (class) is said to be weakly PAC-
learnable if there exists an algorithm A, , and a polynomial function poly(.,.,.) such
that for any , for all distributions on and for any target concept , the following
holds for any sample size poly(1/ , size(c)):

𝑆 𝒟
[
ℙ 𝑚𝑅 ( h 𝑆 ) ≤
1
2
−𝛾 ≥1−𝛿
] (7.1)

• where is the hypothesis returned by algorithm when trained on sample .


03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 13
gineering, IUST, Morteza Analoui
Base classifier
• When such an algorithm exists, it is called a weak learning algorithm
for or a weak learner.
• Hypotheses returned by a weak learning algorithm are called base
classifiers.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 14


gineering, IUST, Morteza Analoui
Boosting algorithms
• Key idea behind boosting algorithms is to use a weak learning
algorithm to build a strong learner, that is, an accurate PAC-learning
algorithm.
• Boosting techniques use an ensemble method: they combine
different base classifiers returned by a weak learner to create a more
accurate predictor.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 15


gineering, IUST, Morteza Analoui
Contents
1. Boosting
2. AdaBoost
3. AdaBoost and margin maximization
4. Multiclass boosting algorithms
5. Appendix: Decision Tree

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 16


T, Morteza Analoui
2- AdaBoost
AdaBoost algorithm
• is hypothesis set out of which
the base classifiers are
selected (base classifier set).
• Pseudocode: base classifier is
a function mapping from to
• Algorithm takes as input a
labeled sample
𝑆=( ( 𝑥 1 , 𝑦 1 ) , … , ( 𝑥 𝑚 , 𝑦 𝑚 ) ) h is mixture (ensemble) weight
for all
h
• Base classifier’s error should is normalization factor to ensure that the weights t+1() sum to one.

be less than random: =

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 18


gineering, IUST, Morteza Analoui
Ensemble of base classifiers
• At each iteration of the loop 3-8, a new base classifier is selected that
minimizes the error on the training sample weighted by the
distribution t:
𝑚
h 𝑡 ∈arg min ℙ [ h(𝑥 𝑖 )≠ 𝑦 𝑖 ]=argmin ∑ 𝒟𝑡 (𝑖)1h (𝑥 )≠ 𝑦 𝑖 𝑖
h ∈H h ∈H 𝑖=1

• (error of base classifier t ),


• then , is ratio of accuracy and error

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 19


gineering, IUST, Morteza Analoui
Higher weight for miss classified examples
• Distribution t+1 substantially increases the weight on if is incorrectly classified (),
and, on the contrary, decreasing it if is correctly classified.
• This has the effect of focusing more on examples incorrectly labeled at the next
round of boosting, less on those correctly classified by

𝒟 𝑡 (𝑖 ) 𝑒− 𝛼 𝑡 𝑦 𝑖 h𝑡 ( 𝑥 𝑖)

𝒟 𝑡 +1 ( 𝑖 ) ←
• Where mixture weight and 𝑍𝑡
• Simple mathematical derivation of the algorithm: Rojas, R. (2009).
AdaBoost and the super bowl of classifiers a tutorial introduction to adaptive boosting. Freie University, Berlin, Tech.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 20


gineering, IUST, Morteza Analoui
1 − 𝜖
𝑍 𝑡 = 2 [ 𝜖𝑡 (1 − 𝜖 𝑡 ) ]
1
𝑡
2
𝛼 𝑡= 0 . 5 𝑙𝑜𝑔
𝜖𝑡
𝛼𝑡
𝜖 𝑡 < 0 .5

0
0.2 0.4 𝜖𝑡
0 0.5 𝜖𝑡
𝒟 𝑡 (𝑖 ) 𝑒− 𝛼 𝑡 𝑦 𝑖 h𝑡 ( 𝑥 𝑖)

𝒟 𝑡 +1 ( 𝑖 ) ←
𝑍𝑡
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 21
gineering, IUST, Morteza Analoui
: a linear mixture (combination) of
• After rounds of boosting, the hypothesis (classifier) returned by
AdaBoost is based on the sign of function , which is a non-negative
linear combination of the base classifiers .

• The weight assigned to in is a logarithmic function of


• More accurate has higher and
• Thus, more accurate assigns a larger weight in
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 23
gineering, IUST, Morteza Analoui
Linear mixture of
• is a linear function in space

•,

• Vector of base hypothesis can be viewed as a feature vector associated to , which


is considered to be similar to in Kernel-SVM, and is weight vector that was
denoted by .

• AdaBoost: , SVM:
SVM: Linear mixture of features

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 24


gineering, IUST, Morteza Analoui
Base classifier in ensemble methods
• A combination of several diverse suboptimal learner may lead to a better overall learner.

• Hypothesis of an unstable hypothesis set are diverse


• Unstable means: small changes in training sample can cause substantial changes of hypothesis
trained on the sample
• Unstable hypothesis are versatile models which reacts to small change in training sample
• Unstable classifiers play a major role in ensembles methods

• Examples of unstable hypothesis set:


• decision tree classifiers
• some neural networks
• SVM is an stable learning machine

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 25


gineering, IUST, Morteza Analoui
H in AdaBoost
• The family of base classifiers H typically used with AdaBoost in
practice is that of decision trees of depth one, known as stumps (root
and 2 leaves).
• Boosting stumps are threshold functions associated to a single feature
• If data is in ( features), we can associate a stump to components
• To determine the stump with the minimal weighted error at each
round of boosting, the best feature and its best threshold must be
computed.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 26


gineering, IUST, Morteza Analoui
Example 1 – Ensemble of linear

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 27


gineering, IUST, Morteza Analoui
Example 1 – Ensemble of linear

h 1( 𝑥 ) h 2 (𝑥 )
0.35

training error
h=𝜶 ∙ 𝒉

0.10
0.05

𝑡 =5 𝑡 =40
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 28
gineering, IUST, Morteza Analoui
Example 2– Ensemble of Stumps
best thresholds (decision boundaries)
at each boosting round
h1 h2 h3 h

Visualization of final classifier , constructed


as a linear combination of base classifiers.
3
h= ∑ 𝛼𝑡 h 𝑡
𝑡 =1

weights are updated at each round

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 29


gineering, IUST, Morteza Analoui
Example 3 =0.424

0 .5
𝑍 𝑡 = 2 [ 𝜖𝑡 ( 1 −𝜖 𝑡 ) ]

𝑐𝑜𝑟𝑟𝑒𝑐𝑡 0 .1 𝑒 −0 . 424
𝐷 2 = 0 .5
= 0 . 071
2( 0 .3 × 0 . 7) 𝜖 2=3 × 0 . 071=0 . 213
𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 0 .1 𝑒
+0 . 424
𝛼 2=0 . 653
𝐷2 = =0 . 167
2( 0 . 3 × 0 . 7)0 . 5

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 30


gineering, IUST, Morteza Analoui
Example 3
𝜶=[ 𝛼 1 𝛼2 𝛼3 ]=[ 0 . 424 0 . 625 0 . 908 ]
𝑇𝑟𝑎𝑛𝑠𝑝𝑜𝑠𝑒
𝒉= [ h1 h2 h3]
• Range of

0.424 +0.653 +0.908

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 32


gineering, IUST, Morteza Analoui
Normalized version of
• We will denote by
𝑇

∑ 𝛼 𝑡 h𝑡 ( 𝑥 ) 𝜶 ∙ 𝒉𝑇 ( 𝑥 )
𝑡 =1
h ( 𝑥 )= = = 𝜶 ∙ 𝒉𝑇 ( 𝑥 )
𝑇
‖𝜶‖1
∑ 𝛼𝑡
𝑡 =1

• is normalized version of the function returned by AdaBoost

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 33


gineering, IUST, Morteza Analoui
Recall from Chapter 1: Definition 5.1 -
margins
• Definition of SVM solution is based on notion of margin. If we omit
outliers, training data is correctly separated by with a margin :
h ( Φ( 𝑥 ) )=0
𝑥𝑖
ρi () = =

geometric margin of classifier = 2 𝜌h =

geometric margin at a point = distance from to hyperplane =0


0. 5
‖𝑤‖2 =( 𝑤 21+ 𝑤22 +… 𝑤 2𝑁 )

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 36


gineering, IUST, Morteza Analoui
Recall from Chapter 1: Soft margin –
Hard margin
• For ·, vector with can be viewed as an outlier
• with is correctly classified by hyperplane but is considered to be an
outlier, that is > 0

• If we omit outliers, training data is correctly separated by with a


margin that we refer to as soft margin, as opposed to hard margin in
separable case

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 37


T, Morteza Analoui
Definition 7.3 (L1-geometric margin of )
• =L1-geometric margin at point for a linear function with we define as:

𝑤 ∙ Φ(𝑥 )
𝑇
𝑦 ∑ 𝛼 𝑡 h𝑡 ( 𝑥 ) 𝑦 h ( 𝑥) 𝑦 𝜶 ∙ 𝒉𝑇 ( 𝒙 )
𝑡 =1
𝜌h ( 𝑥 ) = = = = 𝑦 𝜶 ∙ 𝒉𝑇 ( 𝑥 ) = 𝑦 h ( 𝑥 )
𝑇
‖𝜶‖1 ‖𝜶‖1
∑ 𝛼𝑡
𝑡 =1

• =L1-geometric margin of : If we omit miss-classified examples, training data is


correctly separated by with a margin that we refer to as soft margin. The L1-
margin of over a sample is its minimum margin at the points in that sample:

|𝜶 ∙ 𝒉 𝑇 ( 𝑥 𝑖 )|
𝜌 h = min 𝜌 h ( 𝑥𝑖 ) = min
‖𝛼‖1
= min |𝜶 ∙ 𝒉𝑇 ( 𝑥 𝑖 )|
𝑖 ∈[𝑚 ] 𝑖 ∈ [𝑚] 𝑖 ∈ [𝑚 ]

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 38


gineering, IUST, Morteza Analoui
Analogy with margin in SVM
• Consider: , the vector of base hypothesis values can be viewed as a
feature vector associated to , which is considered to be similar to in
Kernel-SVM, and is weight vector that was denoted by .

• For ensemble linear combinations such as those returned by AdaBoost,


additionally, the weight vector is non-negative:

• Notion of geometric margin for such ensemble functions which differs from
the one introduced for SVMs only by the norm-1 used instead of norm-2.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 39


gineering, IUST, Morteza Analoui
Contents
1. Boosting
2. AdaBoost
3. AdaBoost and margin maximization
4. Multiclass boosting algorithms
5. Appendix: Decision Tree

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 40


T, Morteza Analoui
3- AdaBoost and margin
maximization
AdaBoost and margin maximization
• Maximum margin for a linearly separable sample is given by
argmax min 𝜌 h ( 𝑥 )
𝜶 𝑖using
• By definition, optimization problem ∈[𝑚confidence
] margin can be written as:

argmax 𝜌 h ( 𝑥 )
𝜶
• This is a linear program (LP), that is, a convex optimization problem with a linear
objective function and linear constraints.subject
There to:
are several different methods for
solving relative large LPs in practice, using the simplex method, interior-point
methods, or a variety of special-purpose solutions.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 42


gineering, IUST, Morteza Analoui
Theorem 7.7: Bound on the empirical margin
loss (training error)
• Let denote function returned by AdaBoost after rounds of boosting
and assume for all that , which implies Then, for any confidence
margin , the following holds:

𝑦 𝑖 h ( 𝑥 𝑖)
𝑚 𝑦𝑖 h ( 𝑥𝑖 ) −(
𝜌
−1 )
1 −( −1) 𝑒
^
𝑅 𝑆 , 𝜌 ( h )= ∑ 𝑒 𝜌
𝑚 𝑖=1
𝑇
𝑅 𝑆 , 𝜌 ( h ) ≤ 2 ∏ √𝜖 𝑡 (1 −𝜖 𝑡 )
^ 𝑇 1−𝜌 1+ 𝜌
𝑦𝑖 h ( 𝑥𝑖)
=𝜌 h ( 𝑥𝑖 ) / 𝜌
𝑡 =1 − 1 0 +1+2 𝜌

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 43


gineering, IUST, Morteza Analoui
Theorem 7.7: Bound on the empirical loss
• Furthermore, if for all
• is known as the edge.
• Note: means
• For the empirical margin loss is upper bounded:

^ 1+ 𝜌 𝑇 /2
𝑅 𝑆 , 𝜌 (h) ≤ [(1 −2 𝛾 )1−𝜌
(1+2 𝛾 ) ]

• Note that , therefore empirical margin loss decreases exponentially fast when
increases and becomes zero for sufficiently large

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 44


gineering, IUST, Morteza Analoui
Note on bound on the empirical loss
• Value of and accuracy of base classifiers do not need to be known to
AdaBoost algorithm.
• Algorithm adapts to accuracy and defines a solution based on .
• In practice, error may increase as a function of . This is because
boosting presses weak learner to concentrate on instances that are
harder and harder to classify, for which even best base classifier could
not achieve an error significantly better than random.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 45


gineering, IUST, Morteza Analoui
Note on bound on the empirical loss
• If becomes close to 0.5 relatively fast as a function of , then the
bound of theorem 7.7 becomes uninformative.

𝜀𝑡 ⟶ 0 .5 then 0< 𝛾 ≤ ( 0 . 5 −𝜖 𝑡 ) . So 𝛾 ≅ 0

^ 1+ 𝜌 𝑇 /2=1
𝑅 𝑆 , 𝜌 (h) ≤ [(1 −2 𝛾 )1−𝜌
(1+2 𝛾 ) ]
0 0

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 46


gineering, IUST, Morteza Analoui
VC-dimension-based analysis of AdaBoost
• The family of functions out of which AdaBoost selects its output after
rounds of boosting is

{ (∑ ) }
𝑇
ℱ 𝑇 = 𝑠𝑔𝑛 𝛼 𝑡 h𝑡 : 𝛼𝑡 ≥ 0 , h 𝑡 ∈ H, 𝑡 ∈ [ 𝑇 ]
𝑡 =1

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 47


gineering, IUST, Morteza Analoui
VC-dimension-based analysis of AdaBoost
• The VC-dimension of can be bounded as follows in terms of the VC-
dimension of the family of base hypothesis
Vcdim

• Bound suggests that AdaBoost could over fit for large values of , and
indeed this can occur.
• However, in many cases, it has been observed empirically that the
generalization error of AdaBoost decreases as a function of the
number of rounds of boosting , as illustrated in figure 7.5.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 48


gineering, IUST, Morteza Analoui
Test error of AdaBoost
• Test error

Figure 7.5
An empirical result using AdaBoost with C4.5 decision trees as base learners. In this example, the
training error goes to zero after about 5 rounds of boosting, yet the test error continues to
decrease for larger values of (Reduction of bias due to increasing hypothesis set complexity)

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 49


gineering, IUST, Morteza Analoui
Generalization Bound, using margin
bound and VC-d complexity
• Corollary 7.6 Let be a family of functions taking values in {+1, -1} with VC-dimension .
• Fix for any , with probability at least , the following holds for all :

√ √
𝑒𝑚 1
2 𝑑 𝑙𝑜𝑔 𝑙𝑜𝑔 (7.15)
^ ( h) + 2 𝑑 𝛿
𝑅 (h )≤ 𝑅 𝑆,𝜌 +
𝜌 𝑚 2𝑚

• stands for “convex combination of base hypotheses”


• is true for binary classification
• can be chosen as a larger quantity for which vanishes and while complexity term becomes more
favorable since it decreases as
• is a free parameter that typically determined via cross-validation

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 50


gineering, IUST, Morteza Analoui
Empirical observation - Linearly separable
case
• Empirical margin loss becomes zero for sufficiently large

• In some tasks, the generalization error decreases as a function of


even after the error on training sample is zero.
• It means that geometric margin continues to increase

• Margin-based analysis supports the theoretical explanation for these


empirical observations, (7.15)

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 51


gineering, IUST, Morteza Analoui
AdaBoost in practice
• AdaBoost may admits a negative edge , in which case the weak learning condition
does not hold. (for all )

• AdaBoost may result in several with large total mixture weights ()


• This can be because algorithm increasingly concentrates on a few examples
that are hard to classify and whose weights keep growing. Only a few base
classifiers might achieve the best performance for hard examples. These base
classifiers with relatively large total mixture weights dominate the ensemble
and therefore solely dictate the classification decision.
• The performance of the resulting ensemble is typically poor since it almost
entirely hinges on that of a few base classifiers.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 52


gineering, IUST, Morteza Analoui
L1-regularized AdaBoost
1. Limiting the number of rounds of boosting , which is also known as early-
stopping.
2. Controlling the magnitude of s. This can be done by augmenting the objective
function of AdaBoost with a regularization term based on a norm of the vector
of mixture weights. It is referred to as L1-regularized AdaBoost.
𝑚
1
𝑇
h ( 𝑥 𝑖 ) = ∑ 𝛼 𝑖 h𝑡 ( 𝑥 𝑖 )
𝑡 =1 or
min ∑
𝑚 𝑖=1
𝑒 − 𝑦 h (𝑥 )
+ 𝜆‖𝜶‖1 𝑖 𝑖
(7.31)

convex and differentiable upper bound on the zero- regularization term


one loss (see Theorem 7.7 slide).

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 53


gineering, IUST, Morteza Analoui
A practical application: face detector
• Viola and Jones’s algorithm: cascade of AdaBoost ensembles
• sliding a square window and classifying it as positive (face) or negative
(non-face). square window

P P P
𝐴𝑑𝑎𝐵𝑜𝑜𝑠𝑡1 𝐴𝑑𝑎𝐵𝑜𝑜𝑠𝑡 2 𝐴𝑑𝑎𝐵𝑜𝑜𝑠𝑡 3
An example of the output from
the Viola–Jones face detector

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 54


gineering, IUST, Morteza Analoui
Contents
1. Boosting
2. AdaBoost
3. AdaBoost and margin maximization
4. Multiclass boosting algorithms
5. Appendix: Decision Tree

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 55


T, Morteza Analoui
4- Multiclass boosting algorithms
Recall from Chapter 1: Multiclass
classification
• Let denots input space and denots output space classes (labels), and
let be an unknown distribution over according to which input points
are drawn. We will distinguish between two cases:
• mono-label case, where is a finite set of classes that we mark with numbers
for convenience, , and
• multi-label case where .

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 57


gineering, IUST, Morteza Analoui
Recall from Chapter 1: Multiclass
classification
• In mono-label case, each example is labeled with a single class, while
in multi-label case it can be labeled with several. Multi-label can be
illustrated by case of text documents, which can be labeled with
several different relevant topics
• Multi-label example: : 1-sport, 2-business, 3-society. Positive
components of a vector in indicate classes associated with example :
• , that is labeled as sport and society.

h ( 𝑥 𝑖 , 𝑦 𝑖 [1] ) h ( 𝑥 𝑖 , 𝑦 𝑖 [2 ] ) h ( 𝑥 𝑖 , 𝑦 𝑖 [3 ] )
sport business society

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 58


gineering, IUST, Morteza Analoui
Multiclass boosting algorithms
• We describe a boosting algorithm for multi-class classification called
AdaBoost.MH
• Multi-label setting is:
• Training set: ,
• Consider:
• , where
(9.13)
• Empirical loss:
• is a convex and differentiable
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 59
gineering, IUST, Morteza Analoui
AdaBoost.MH exactly coincides with
AdaBoost
• AdaBoost.MH exactly coincides with AdaBoost applied to the training
sample derived from by splitting each labeled point into labeled
examples , with each example in and its label in denotes the th
coordinate of
( 𝑥 𝑖 , 𝑦 𝑖 ) ⟶ ( ( 𝑥 𝑖 , 1 ) , 𝑦 𝑖 [ 1 ] ) , … , ( ( 𝑥 𝑖 , 𝑘 ) , 𝑦 𝑖 [ 𝑘 ] ) ,𝑖 ∈[𝑚]

• Let denote the resulting sample, then


ℝ𝑛
𝑆 =( ( 𝑥 1 , 1 ) , 𝑦 1 [ 1 ] ) , ( ( 𝑥 1 , 2 ) , 𝑦 1 [ 2 ] ) , …, ( ( 𝑥 1 ,𝑘 ) , 𝑦 1 [ 𝑘 ] ) , …, ( ( 𝑥 𝑚,1 ) , 𝑦 𝑚 [ 1 ] ) , ( ( 𝑥 𝑚, 2 ) , 𝑦 𝑚 [ 2 ] ) , …, ( ( 𝑥 𝑚, 𝑘 ) , 𝑦 𝑚 [ 𝑘 ] )

( examples ) ℝ 𝑛 +1
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 60
gineering, IUST, Morteza Analoui
Example

(𝑛+1)𝑡 h 𝑓𝑒𝑎𝑡𝑢𝑟𝑒

• ,

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 61


gineering, IUST, Morteza Analoui
AdaBoost.MH algorithm
(𝑆 , 𝑋 ∈ ℝ
′ 𝑛 +1
)

h
h
𝑇
𝛼𝑗

h 𝑇 𝛼 h
𝑗 𝑗
h
𝑗 =𝑡
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 62
gineering, IUST, Morteza Analoui
AdaBoost.MH exactly coincides with
AdaBoost
• contains examples and the expression of the objective function in
(9.13) coincides exactly with that of the objective function of
AdaBoost for the sample
• Theoretical analysis along with the other observations we presented
for AdaBoost so far, also apply here
• Now, we will focus on aspects related to the computational efficiency
and to the weak learning condition that are specific to the multi-class
scenario

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 63


gineering, IUST, Morteza Analoui
Complexity of AdaBoost.MH algorithm
• Complexity of the algorithm is that of AdaBoost applied to a sample of size .
• For , using boosting stumps as base classifiers, the complexity of the algorithm is
therefore in
• Thus, for a large number of classes , the algorithm may become impractical using
a single processor.

• Weak learning condition:


at each round there exists a base classifier : such that
• This may be hard to achieve if some classes difficult to distinguish

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 64


gineering, IUST, Morteza Analoui
Boosting Algorithms
• Boosting algorithms can differ in how they create and aggregate weak
learners during sequential process
• Three popular types of boosting methods include:
1. Adaptive boosting or AdaBoost: Yoav Freund and Robert Schapire
are credited with the creation of the AdaBoost algorithm. This
method operates iteratively, identifying misclassified data points
and adjusting their weights to minimize the training error. The
model continues optimize in a sequential fashion until it yields the
strongest predictor.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 65


gineering, IUST, Morteza Analoui
Boosting Algorithms
2. Gradient boosting: It works by sequentially adding predictors to an ensemble
with each one correcting for errors of its predecessor. However, instead of
changing weights of data points like AdaBoost, gradient boosting trains on
residual errors of previous predictor. The name, gradient boosting, is used
since it combines gradient descent algorithm and boosting method

Extreme gradient boosting or XGBoost: XGBoost is an implementation of gradient boosting


that’s designed for computational speed and scale. XGBoost leverages multiple cores on
CPU, allowing for learning to occur in parallel during training

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 66


gineering, IUST, Morteza Analoui
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 67
gineering, IUST, Morteza Analoui
AdaBoost offers several advantages
• It is simple, its implementation is straightforward, and the time
complexity of each round of boosting as a function of the sample size
is rather favorable
• When using decision stumps, the time complexity of each round of
boosting is in . Of course, if the dimension of the feature space is very
large, then the algorithm could become in fact quite slow.
• When using decision stumps, algorithm only select features that
increase its predictive power during training, it can help to reduce
dimensionality (feature selection)
• It benefits from a rich theoretical analysis
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 68
gineering, IUST, Morteza Analoui
Drawbacks of AdaBoost
• Need to select the parameter and the base classifier set. (stopping
criterion) is crucial to the performance of the algorithm.

• VCdimension analysis shows: larger values of can lead to over fitting.


In practice, is typically determined by validation set

• Complexity () of the family of base classifiers appeared in VCdim


bound. It is important to control in order to guarantee generalization.
Higher may leads to over fitting

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 69


gineering, IUST, Morteza Analoui
Drawbacks of AdaBoost
• Serious disadvantage of AdaBoost is its performance in the presence
of noise. Distribution weight assigned to examples that are harder to
classify substantially increases with . Noisy samples may end up
dominating the s.

• Sequential training in boosting is hard to scale up. Since each is built


on its predecessors, boosting models can be computationally
expensive
• XGBoost has been introduced to address scalability issues seen in other types
of boosting methods

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 70


gineering, IUST, Morteza Analoui
How to benefit from noise drawback
• Behavior of AdaBoost in the presence of noise can be used, in fact, as
a useful feature for detecting outliers, that is, examples that are
incorrectly labeled or that are hard to classify.
• Examples with large weights after a certain number of rounds of
boosting can be identified as outliers.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 71


gineering, IUST, Morteza Analoui
Applications of boosting
• Boosting algorithms are well suited for artificial intelligence projects
across a broad range of industries, including:
• Healthcare: Boosting is used to lower errors in medical data
predictions, such as predicting cardiovascular risk factors and cancer
patient survival rates
• Finance: Boosting is used with deep learning models to automate
critical tasks, including fraud detection, pricing analysis, and more
•…

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 72


gineering, IUST, Morteza Analoui
Contents
1. Boosting
2. AdaBoost
3. AdaBoost and margin maximization
4. Multiclass boosting algorithms
5. Appendix: Decision Tree

03/18/2024 Pattern Recognition-SVM, School of Computer Engineering, IUS 73


T, Morteza Analoui
5- Appendix: Decision Tree
Decision Tree
• Decision trees can be used as weak learners with boosting to define
effective learning algorithms
• Decision trees are typically fast to train and evaluate and relatively
easy to interpret root node
𝑥1 > 𝑎1

𝑥1 > 𝑎2 𝑥 2> 𝑎 3

𝑥 2> 𝑎 4
leaf3 leaf4 leaf5

leaf1 leaf2

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 75


gineering, IUST, Morteza Analoui
root node

Decision Tree
𝑥1 > 𝑎1

𝑥1 > 𝑎2 𝑥 2> 𝑎 3

• Figure 9.2 shows a simple example in the 𝑥 2> 𝑎 4

case of a space based on features and , as leaf3 leaf4 leaf5


well as the partition it represents
leaf1 leaf2
• A leaf defines a region of formed by set of
sample points corresponding to same 𝑥2
traversal of tree
• A label is assigned to a leaf, . Majority
representation among training points falling
in a leaf region defines label of that leaf. leaf1 region

𝑥 1
Majority label of training examples
in region is the label of leaf1
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 76
gineering, IUST, Morteza Analoui
Binary decision tree
• Definition 9.5 - Binary decision tree is a representation of a partition
of feature space
• As in Figure 9.2, each interior node of a decision tree corresponds to a
question related to a feature (attribute)
• It can be a
• numerical question of form for a feature variable , , and some threshold , as
in example of Figure 9.2, or
• a categorical question such as , when feature takes a categorical value such
as a color

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 77


gineering, IUST, Morteza Analoui
More complex node questions
• More complex node questions, resulting in partitions based on more
complex decision surfaces
• Example: binary space partition (BSP) trees partition space with
convex polyhedral regions, based on questions of form

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 78


gineering, IUST, Morteza Analoui
Prediction/partitioning
• To predict label of any point we start at root node of decision tree and
go down tree until a leaf is found, by moving to right child of a node
when response to node question is positive, and to left child
otherwise. When we reach a leaf, we associate with label of this leaf
• A leaf defines a region of formed by set of points corresponding to
same traversal of tree
• By definition, no two regions intersect and all points belong to exactly
one region

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 79


gineering, IUST, Morteza Analoui
Learning
• Label of a leaf is determined using training sample: class with
majority representation among labeled training examples falling in a
leaf region defines label of that leaf, with ties broken arbitrarily
• There are different training algorithm, we mention two here
• Greedy: This is motivated by fact that general problem of finding a decision
tree with smallest error is NP-hard
• Grow-then-prune: First a very large tree is grown until it fully fits training
sample. Then, resulting tree is pruned back to minimize an objective function
defined (based on generalization bounds) as sum of an empirical error and a
complexity term

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 80


gineering, IUST, Morteza Analoui
1- Greedy algorithm root node
𝑥1 > 𝑎1 𝑞 1𝑡 =1
Greedy Decision Trees
1. tree root node 𝑡 =3𝑞 3𝑥 > 𝑎 1 2 𝑞 2𝑡 =2
𝑥 2> 𝑎 3

2. For to do
𝑞 7𝑥 > 𝑎𝑞 6 𝑞 5
2 4 𝑞4
3. SPLIT(tree, ) leaf3 leaf4 leaf5
4. Return tree 𝑞9
𝑡 =𝑇 =9 𝑞8
5. leaf1 leaf2

• The procedure splits node by making it an internal node with question and leaf
children and each labeled with dominating class of region it defines, with ties
broken arbitrarily. Root node is a leaf whose label is class that has majority over
entire

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 81


gineering, IUST, Morteza Analoui
Node impurity (error in a node)
• Pair is chosen so that node impurity is maximally decreased according to some
measure of impurity (impurity of node )
• Decrease in impurity (information gain) by split is given by:
~
𝐹 ( 𝑛 , 𝑞) = 𝐹 ( 𝑛) − ¿
• is fraction of examples in region defined by that are moved to
: # of examples in
𝑛,𝑞 denotes fraction of that belong
+ to class

𝑚𝑛− =𝜂 ( 𝑛 , 𝑞 ) ×𝑚𝑛𝑑 𝑛− (𝑛 , 𝑞) 𝑛+¿ (𝑛 ,𝑞 )¿ 𝑚𝑛 +¿ =(1 −𝜂 ( 𝑛 , 𝑞 ) ) ×𝑚


𝑛 ¿

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 82


gineering, IUST, Morteza Analoui
Node impurity definitions
For mono-label multiclass ( labels) the impurity of
node can be defined in 3 ways:

{
𝑘
− ∑ 𝑝 𝑙 (𝑛𝑑 )𝑙𝑜𝑔 2( 𝑝 𝑙 (𝑛𝑑 ) ) Entropy
𝑙 =1
𝐹 ( 𝑛 )= 𝑘

∑ 𝑝𝑙 (𝑛𝑑)( 1− 𝑝 𝑙 ( 𝑛𝑑 ) ) Gini index


𝑙 =1
1 − max 𝑝 𝑙 ( 𝑛𝑑 ) Misclassification

:
𝑙 ∈ [ 𝑘]

For any node and class , denote fraction of points at that belong
to class . Figure 9.4: binary case,
All three functions are concave, which ensures that Three node impurity definitions
plotted as a function of fraction of
positive examples in.

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 83


gineering, IUST, Morteza Analoui
2- Grow-then-prune
• First a very large tree is grown until it fully fits training sample or until no more than a very small
number of points are left at each leaf
• Then, resulting tree, denoted as , is pruned back to minimize an objective function defined (based
on generalization bounds) as sum of an empirical error and a complexity term. Complexity can be
expressed in terms of size of (set of leaves of tree). Resulting objective is
number of nodes number of leaves
~
𝐺 𝜆 ( 𝑡𝑟𝑒𝑒 ) = ∑~ |𝑛| 𝐹 ( 𝑛𝑑 )+ 𝜆|𝑡𝑟𝑒𝑒|
complexity
(9.15)
𝑛∈ 𝑡𝑟𝑒𝑒
empirical error(impurity)

• where is a regularization parameter determining trade-of between misclassification, or more


generally impurity, versus tree complexity

03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 84


gineering, IUST, Morteza Analoui

You might also like