Foundations of Machine Learning: Courant Institute and Google Research
Foundations of Machine Learning: Courant Institute and Google Research
Boosting
page 2
Weak Learning
(Kearns and Valiant, 1994)
Denition: concept class C is weakly PAC-learnable if there exists a (weak) learning algorithm L and > 0 such that: for all c C, > 0, > 0, and all distributions D,
S D
Pr
R(hS )
1 2
page 3
Boosting Ideas
Finding simple relatively accurate base classiers often not hard weak learner. Main ideas:
use weak learner to create a strong learner. how? Combine base classiers returned by weak
learner (ensemble method). But, how should the base classiers be combined?
page 4
AdaBoost - Preliminaries
Family of base classiers: H
{ 1, +1}X .
Iterative algorithm (T rounds): at each round t [1, T ] maintains distribution Dt over training sample. t convex combination hypothesis ft = s=1 s hs , with s 0 and hs H . Weighted error of hypothesis h at round t :
m
Pr[h(xi ) = yi ] =
Dt i=1
Dt (i)1h(xi )=yi .
page 5
AdaBoost
H { 1, +1} .
X
AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 for i 1 to m do 1 2 D1 (i) m 3 for t 1 to T do 4 ht base classier in H with small error t = PrDt [ht (xi ) = yi ] 1 t 1 log 5 t 2 t 1 2 6 Zt 2[ t (1 normalization factor t )] 7 for i 1 to m do Dt (i) exp( t yi ht (xi )) 8 Dt+1 (i) Zt T 9 f t=1 t ht 10 return h = sgn(f )
page 6
Notes
Distributions Dt over training sample: originally uniform. at each round, the weight of a misclassied example is increased. e Q observation: Dt+1 (i) = m , since Z
yi ft (xi ) t s=1 s
Dt+1 (i) =
Dt (i)e
t y i h t (x i )
Zt
Dt
1 (i)e
1 y i ht
1 (x i )
t y i h t (x i )
Zt
1 Zt
1 e = m
yi
Pt
s=1
s h s (x i )
t s=1
Zs
page 7
Illustration
t = 1
t = 2
page
t = 3
...
...
page 9
+ 2
+ 3
page 10
1 2
2 t
(1 2
T ).
t ) , then
t+1 (i) = m
1 y i f (x i )
i=1 m T
1 m
exp( yi f (xi ))
i=1 T
yi ft (xi ) e Q m t s=1 Zs
m
i=1 t t=1
Zt DT +1 (i) =
t=1
Zt .
Now, since Z
Zt = =
i=1
is a normalization factor,
t y i h t (x i )
Dt (i)e
Dt (i)e
t )e t) 1
t
Dt (i)e
i:yi ht (xi ) 0
= (1 = (1
+ te
t t
1
t
=2
t (1
page 12
t ).
Thus,
T
Zt =
t=1 t=1 T
2 exp
t=1
t (1
t)
=
t=1 2 t
1 2
2 t T
1 2
= exp
2
t=1
1 2
2 t
(1
te
t
t )e
+ te .
, at each round, AdaBoost assigns the same probability mass to correctly classied and misclassied instances. for base classiers x [ 1, +1] , t can be similarly chosen to minimize Zt .
page 13
F( ) =
i=1
y i f (x i )
=
i=1
x
yi
PT
t=1
t h t (x i )
0 1 loss
page 14
et = argmin
t
dF (
t 1
+ et )
=0
s h s (x i )
d
yi Pt
1 s=1
dF (
Since F (
t 1 + et ) d
m t 1
+ et ) =
m
e
i=1
y i ht (x i )
t 1
=
=0 i=1 m
yi ht (xi ) exp
yi
s=1 t 1
s hs (xi )
=
i=1
Zs
t 1
[(1
t)
t]
m
s=1
Zs = [2
1] m
s=1
Zs .
page 15
yi ht (xi ) exp
i=1 m
yi
s=1 t 1
s hs (xi )
y i h t (x i )
=0
Zs e
y i h t (x i )
=0
y i h t (x i )
=0
[(1 =
t )e
te t t
]=0
1 1 log 2
page 16
square loss
x (1
x)2 1x
hinge loss
x max(1
x, 0)
page 17
F( ) =
i=1
log(1 + e
2yi (xi )
).
Solve exactly the same optimization problem, except for a normalization constraint required for logistic regression:
min D(p, 1)
p
subject to: p x
0
x
p(x)
y
p(x, y )[
j (x, y )
E[
p b
j |x]]
=0
X,
y Y
p(x, y ) = 1
with D(p, q) =
x
p(x)
y
p(x, y ) log
p(x, y ) q (x, y )
p(x, y ) + q (x, y ) .
page 18
Boosting stumps: data in RN , e.g.,N = 2, (height(x), weight(x)) . associate a stump to each component. pre-sort each component: O(N m log m). at each round, nd best component and threshold. total complexity: O((m log m)N + mN T ). stumps not weak learners: think XOR example!
page 19
Overtting?
Assume that VCdim(H ) = d and for a xed T , dene
T
FT =
sgn
t=1
t ht
b :
t, b
R, h t
H .
This suggests that AdaBoost could overt for large values of T , and that is in fact observed in some cases, but in various others it is not!
Mehryar Mohri - Foundations of Machine Learning
page 20
Empirical Observations
cumulative distribution
Several empirical observations (not all): AdaBoost does not seem to overt, furthermore:
20 15
1.0
test error
error
10 5 0 10 100 1000
0.5
training error
# rounds
-1
-0.5
ma
Figure 2: Errortrees curves and the margin C4.5 decision (Schapire et al., 1998). distribution graph for the letter dataset as reported by Schapire et al. [69]. Left: the error curves (lower and upper curves, respectively) of the comb
page 21
L1 Margin Denitions
Denition: the margin of a point x with label y is (the algebraic distance of x = [h1 (x), . . . , hT (x)] to the hyperplane x = 0 ):
(x) = yf (x)
m t=T
=
t
T t=1
t ht (x) 1
=y
Denition: the margin of the classier for a sample S = (x1 , . . . , xm ) is the minimum margin of the points in that sample:
= min yi
i [1,m]
xi
1
page 22
H(xi )
1
. . .
Distances:
distance to hyperplane w x + b = 0 :
|w x + b| , w p
1 1 with + = 1. p q
Mehryar Mohri - Foundations of Machine Learning
page 23
conv(H ) = {
k hk : p
k=1
1, k
0,
k=1
1 , hk
H }.
sup
hk H, 0, 1 i=1 p
k hk (xi )
k=1 m i hk (xi ) i=1
sup
sup
1
hk H 0,
k
1 k=1 m i hk (xi ) i=1
hk H k [1,p] m
sup max
1 E sup = m h H i=1
i h(xi )
= RS (H ).
page 24
Corollary: Let H be a set of real-valued functions. Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H ) :
R(h) R(h) R (h) + Rm H + R (h) + RS H + 3 2 2 log 1 2m log 2 . 2m
page 25
Corollary: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H ) :
R(h) R (h) + 2 2d log em d + m log 1 . 2m
Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3).
Mehryar Mohri - Foundations of Machine Learning
page 26
Notes
All of these bounds can be generalized to hold uniformly for all (0, 1), at the cost of an additional log log2 2 term and other minor constant factor m changes (Koltchinskii and Panchenko, 2002). For AdaBoost, the bound applies to the functions
x f (x)
1 T t=1 t ht (x) 1
conv(H ).
page 27
Margin Distribution
Theorem: For any > 0 , the following holds:
Pr yf (x)
1 T
2T
t=1
1 t
(1
1+ t)
1 y i f (x i )
i=1
1 m 1 = m =e
m i=1 m
exp( yi f (xi ) +
T
yi f (xi ) eQ m T t=1 Zt
e
i=1
1
Zt DT +1 (i)
t=1 T T t=1
page 28
T t=1
Zt = 2
1
t
t (1
t ).
Notes
If for all t [1, T ] , can be bounded by
Pr yf (x)
1
(1 2
(1
t ), then
2 )1
(1 + 2 )1+
For < , (1 2 )1 (1+2 )1+ < 1 and the bound decreases exponentially in T .
O(1/ m), For the bound to be convergent: O(1/ m) is roughly the condition on the thus edge value.
page 29
Outliers
AdaBoost assigns larger weights to harder examples.
Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rtsch, 2003).
page
31
1 = 2
yi ht (xi )D(i).
i=1
Condition: there exists > 0 for any distribution D over the training sample and any base classier
(t) .
page 32
Zero-Sum Games
Denition: payoff matrix M = (Mij ) Rm n. m possible actions (pure strategy) for row player. n possible actions for column player. Mij payoff for row player ( = loss for column player) when row plays i , column plays j . Example:
rock rock 0 paper 1 scissors -1 paper scissors -1 1 0 -1 1 0
page 33
Mixed Strategies
(von Neumann, 1928)
Denition: player row selects a distribution p over the rows, player column a distribution q over columns. The expected payoff for row is
m i p j q n
E [Mij ] =
i=1 j =1
pi Mij qj = p Mq.
equivalent form:
p j [1,n]
Mehryar Mohri - Foundations of Machine Learning
Game: Player A: selects point xi , i [1, m]. Player B: selects base learner ht , t [1, T ]. Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ). von Neumanns theorem: assume nite H .
m T
= min max
D h H i=1
t ht (xi ) 1
page 35
Consequences
Weak learning condition = non-zero margin. thus, possible to search for non-zero margin. AdaBoost = (suboptimal) search for corresponding . Weak learning = strong condition: the condition implies linear separability with margin 2 > 0 .
page 36
Note that:
| x|
1
= x
, with H = {x :
x = 0}.
page 37
Norm || ||2 .
Norm || || .
Advantages of AdaBoost
Simple: straightforward implementation. Efcient: complexity O(mN T ) for stumps: when N and T are not too large, the algorithm is quite fast. Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost.
page 39
Weaker Aspects
Parameters: need to determine T , the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overtting or low margins. Noise: severely damages the accuracy of Adaboost(Dietterich, 2000). boosting algorithms based on convex potentials do not tolerate even low levels of random noise, even with L1 regularization or early stopping (Long and Servedio, 2010).
page 40
References
Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages 447454, 2001. Ron Meir and Gunnar Rtsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295-320, 1928. Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: 1557-1595, 2004.
page 41
References
Rtsch, G., and Warmuth, M. K. (2002) Maximizing the Margin with Boosting, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp. 334350, July 2002. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and Classication. Springer, 2003. Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998.
page 42