3.pattern Recognition (Pattern Classification) - AdaBoost
3.pattern Recognition (Pattern Classification) - AdaBoost
(Pattern Classification)
AdaBoost (Adaptive Boosting)
Hypothesis set and Algorithm
Second Edition
Contents
1. Boosting
2. AdaBoost
3. AdaBoost and margin maximization
4. Multiclass boosting algorithms
5. Appendix: Decision Tree
𝑃 [ 𝑅( h𝑆 )≤ 𝜖 ] ≥1 − 𝛿 1-δ: confidence
ϵ: error (1-ϵ: accuracy) (2.4)
𝑚
𝑆 𝐷
• When exists, it is called a PAC-learning algorithm for
• (2.4): is PAC-learnable if is approximately correct (error at most )
with high probability (at least )
𝑆 𝒟
[
ℙ 𝑚𝑅 ( h 𝑆 ) ≤
1
2
−𝛾 ≥1−𝛿
] (7.1)
𝒟 𝑡 (𝑖 ) 𝑒− 𝛼 𝑡 𝑦 𝑖 h𝑡 ( 𝑥 𝑖)
𝒟 𝑡 +1 ( 𝑖 ) ←
• Where mixture weight and 𝑍𝑡
• Simple mathematical derivation of the algorithm: Rojas, R. (2009).
AdaBoost and the super bowl of classifiers a tutorial introduction to adaptive boosting. Freie University, Berlin, Tech.
0
0.2 0.4 𝜖𝑡
0 0.5 𝜖𝑡
𝒟 𝑡 (𝑖 ) 𝑒− 𝛼 𝑡 𝑦 𝑖 h𝑡 ( 𝑥 𝑖)
𝒟 𝑡 +1 ( 𝑖 ) ←
𝑍𝑡
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 21
gineering, IUST, Morteza Analoui
: a linear mixture (combination) of
• After rounds of boosting, the hypothesis (classifier) returned by
AdaBoost is based on the sign of function , which is a non-negative
linear combination of the base classifiers .
•,
• AdaBoost: , SVM:
SVM: Linear mixture of features
h 1( 𝑥 ) h 2 (𝑥 )
0.35
training error
h=𝜶 ∙ 𝒉
0.10
0.05
𝑡 =5 𝑡 =40
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 28
gineering, IUST, Morteza Analoui
Example 2– Ensemble of Stumps
best thresholds (decision boundaries)
at each boosting round
h1 h2 h3 h
0 .5
𝑍 𝑡 = 2 [ 𝜖𝑡 ( 1 −𝜖 𝑡 ) ]
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 0 .1 𝑒 −0 . 424
𝐷 2 = 0 .5
= 0 . 071
2( 0 .3 × 0 . 7) 𝜖 2=3 × 0 . 071=0 . 213
𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 0 .1 𝑒
+0 . 424
𝛼 2=0 . 653
𝐷2 = =0 . 167
2( 0 . 3 × 0 . 7)0 . 5
∑ 𝛼 𝑡 h𝑡 ( 𝑥 ) 𝜶 ∙ 𝒉𝑇 ( 𝑥 )
𝑡 =1
h ( 𝑥 )= = = 𝜶 ∙ 𝒉𝑇 ( 𝑥 )
𝑇
‖𝜶‖1
∑ 𝛼𝑡
𝑡 =1
𝑤 ∙ Φ(𝑥 )
𝑇
𝑦 ∑ 𝛼 𝑡 h𝑡 ( 𝑥 ) 𝑦 h ( 𝑥) 𝑦 𝜶 ∙ 𝒉𝑇 ( 𝒙 )
𝑡 =1
𝜌h ( 𝑥 ) = = = = 𝑦 𝜶 ∙ 𝒉𝑇 ( 𝑥 ) = 𝑦 h ( 𝑥 )
𝑇
‖𝜶‖1 ‖𝜶‖1
∑ 𝛼𝑡
𝑡 =1
|𝜶 ∙ 𝒉 𝑇 ( 𝑥 𝑖 )|
𝜌 h = min 𝜌 h ( 𝑥𝑖 ) = min
‖𝛼‖1
= min |𝜶 ∙ 𝒉𝑇 ( 𝑥 𝑖 )|
𝑖 ∈[𝑚 ] 𝑖 ∈ [𝑚] 𝑖 ∈ [𝑚 ]
• Notion of geometric margin for such ensemble functions which differs from
the one introduced for SVMs only by the norm-1 used instead of norm-2.
argmax 𝜌 h ( 𝑥 )
𝜶
• This is a linear program (LP), that is, a convex optimization problem with a linear
objective function and linear constraints.subject
There to:
are several different methods for
solving relative large LPs in practice, using the simplex method, interior-point
methods, or a variety of special-purpose solutions.
𝑦 𝑖 h ( 𝑥 𝑖)
𝑚 𝑦𝑖 h ( 𝑥𝑖 ) −(
𝜌
−1 )
1 −( −1) 𝑒
^
𝑅 𝑆 , 𝜌 ( h )= ∑ 𝑒 𝜌
𝑚 𝑖=1
𝑇
𝑅 𝑆 , 𝜌 ( h ) ≤ 2 ∏ √𝜖 𝑡 (1 −𝜖 𝑡 )
^ 𝑇 1−𝜌 1+ 𝜌
𝑦𝑖 h ( 𝑥𝑖)
=𝜌 h ( 𝑥𝑖 ) / 𝜌
𝑡 =1 − 1 0 +1+2 𝜌
^ 1+ 𝜌 𝑇 /2
𝑅 𝑆 , 𝜌 (h) ≤ [(1 −2 𝛾 )1−𝜌
(1+2 𝛾 ) ]
• Note that , therefore empirical margin loss decreases exponentially fast when
increases and becomes zero for sufficiently large
𝜀𝑡 ⟶ 0 .5 then 0< 𝛾 ≤ ( 0 . 5 −𝜖 𝑡 ) . So 𝛾 ≅ 0
^ 1+ 𝜌 𝑇 /2=1
𝑅 𝑆 , 𝜌 (h) ≤ [(1 −2 𝛾 )1−𝜌
(1+2 𝛾 ) ]
0 0
{ (∑ ) }
𝑇
ℱ 𝑇 = 𝑠𝑔𝑛 𝛼 𝑡 h𝑡 : 𝛼𝑡 ≥ 0 , h 𝑡 ∈ H, 𝑡 ∈ [ 𝑇 ]
𝑡 =1
• Bound suggests that AdaBoost could over fit for large values of , and
indeed this can occur.
• However, in many cases, it has been observed empirically that the
generalization error of AdaBoost decreases as a function of the
number of rounds of boosting , as illustrated in figure 7.5.
Figure 7.5
An empirical result using AdaBoost with C4.5 decision trees as base learners. In this example, the
training error goes to zero after about 5 rounds of boosting, yet the test error continues to
decrease for larger values of (Reduction of bias due to increasing hypothesis set complexity)
√ √
𝑒𝑚 1
2 𝑑 𝑙𝑜𝑔 𝑙𝑜𝑔 (7.15)
^ ( h) + 2 𝑑 𝛿
𝑅 (h )≤ 𝑅 𝑆,𝜌 +
𝜌 𝑚 2𝑚
P P P
𝐴𝑑𝑎𝐵𝑜𝑜𝑠𝑡1 𝐴𝑑𝑎𝐵𝑜𝑜𝑠𝑡 2 𝐴𝑑𝑎𝐵𝑜𝑜𝑠𝑡 3
An example of the output from
the Viola–Jones face detector
h ( 𝑥 𝑖 , 𝑦 𝑖 [1] ) h ( 𝑥 𝑖 , 𝑦 𝑖 [2 ] ) h ( 𝑥 𝑖 , 𝑦 𝑖 [3 ] )
sport business society
( examples ) ℝ 𝑛 +1
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 60
gineering, IUST, Morteza Analoui
Example
(𝑛+1)𝑡 h 𝑓𝑒𝑎𝑡𝑢𝑟𝑒
• ,
h
h
𝑇
𝛼𝑗
h 𝑇 𝛼 h
𝑗 𝑗
h
𝑗 =𝑡
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 62
gineering, IUST, Morteza Analoui
AdaBoost.MH exactly coincides with
AdaBoost
• contains examples and the expression of the objective function in
(9.13) coincides exactly with that of the objective function of
AdaBoost for the sample
• Theoretical analysis along with the other observations we presented
for AdaBoost so far, also apply here
• Now, we will focus on aspects related to the computational efficiency
and to the weak learning condition that are specific to the multi-class
scenario
𝑥1 > 𝑎2 𝑥 2> 𝑎 3
𝑥 2> 𝑎 4
leaf3 leaf4 leaf5
leaf1 leaf2
Decision Tree
𝑥1 > 𝑎1
𝑥1 > 𝑎2 𝑥 2> 𝑎 3
𝑥 1
Majority label of training examples
in region is the label of leaf1
03/18/2024 Pattern Recognition-Adaptive Boosting, School of Computer En 76
gineering, IUST, Morteza Analoui
Binary decision tree
• Definition 9.5 - Binary decision tree is a representation of a partition
of feature space
• As in Figure 9.2, each interior node of a decision tree corresponds to a
question related to a feature (attribute)
• It can be a
• numerical question of form for a feature variable , , and some threshold , as
in example of Figure 9.2, or
• a categorical question such as , when feature takes a categorical value such
as a color
2. For to do
𝑞 7𝑥 > 𝑎𝑞 6 𝑞 5
2 4 𝑞4
3. SPLIT(tree, ) leaf3 leaf4 leaf5
4. Return tree 𝑞9
𝑡 =𝑇 =9 𝑞8
5. leaf1 leaf2
• The procedure splits node by making it an internal node with question and leaf
children and each labeled with dominating class of region it defines, with ties
broken arbitrarily. Root node is a leaf whose label is class that has majority over
entire
{
𝑘
− ∑ 𝑝 𝑙 (𝑛𝑑 )𝑙𝑜𝑔 2( 𝑝 𝑙 (𝑛𝑑 ) ) Entropy
𝑙 =1
𝐹 ( 𝑛 )= 𝑘
:
𝑙 ∈ [ 𝑘]
For any node and class , denote fraction of points at that belong
to class . Figure 9.4: binary case,
All three functions are concave, which ensures that Three node impurity definitions
plotted as a function of fraction of
positive examples in.