0% found this document useful (0 votes)
57 views5 pages

Boosting: I I I I

Boosting is an algorithm that combines weak learners into a strong learner. AdaBoost is an example boosting algorithm. It works by iteratively updating weights on training examples to focus on incorrectly classified examples. AdaBoost is proven to exponentially decrease the training error over iterations. While the training error decreases quickly, the generalization error depends on the number of iterations and complexity of the weak learners. Margin theory can be used to relate the generalization error to the empirical margin, providing guarantees that do not depend on the number of iterations.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views5 pages

Boosting: I I I I

Boosting is an algorithm that combines weak learners into a strong learner. AdaBoost is an example boosting algorithm. It works by iteratively updating weights on training examples to focus on incorrectly classified examples. AdaBoost is proven to exponentially decrease the training error over iterations. While the training error decreases quickly, the generalization error depends on the number of iterations and complexity of the weak learners. Margin theory can be used to relate the generalization error to the empirical margin, providing guarantees that do not depend on the number of iterations.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Boosting

(Following Mohri, Rostamizadeh and Talwalkar.)

Let Zi = (Xi , Yi ) where Yi ∈ {−1, +1}. Boosting is a way to combine weak classifers into
a better classifier. We make the weak learning assumption: for some γ > 0 we have an
algorithm returns h ∈ H such that, for all P ,
P (R(h) ≤ 1/2 − γ) ≥ 1 − δ
where γ > 0 is the edge.

Let us recall the AdaBboost algorithm:

1. Set D1 (i) = 1/n for i = 1, . . . , n.


2. Repeat for t = 1, . . . , T :
(a) Let ht = argminh∈H PDt (Yi 6= h(Xi )).
(b) t = PDt (Yi 6= ht (Xi )).
(c) αt = (1/2) log((1 − t )/t ).
(d) Let
Dt (i)e−Yi αt ht (Xi )
Dt+1 (i) =
Zt
where Zt is a normalizing constant.
P
3. Set g(x) = t αt ht (x).
4. Return h(x) = signg(x).

Training Error. Now we show that the training error decreases exponentially fast.

Lemma 1 We have p
Zt = 2 t (1 − t ).

P
Proof. Since Dt (i) = 1 we have
i
X X X
Zt = Dt (i)e−αt Yi ht (Xi ) = Dt (i)e−αt + Dt (i)eαt
i Yi ht (Xi )=1 Yi ht (Xi )=−1
−αt αt
p
= (1 − t )e + t e = 2 t (1 − t ).
since αt = (1/2) log((1 − t )/t ). 

1
Theorem 2 Suppose that γ ≤ (1/2) − t for all t. Then
2
R(h)
b ≤ e−2γ T .

Hence, the training error goes to 0 quickly.

Proof. Recall that D1 (i) = 1/n. So


Dt (i)e−αt Yi ht (Xi ) Dt−1 (i)e−αt−1 Yi ht−1 (Xi ) e−αt Yi ht (Xi )
Dt+1 (i) = =
Zt Zt Zt−1
P
−Yi t αt ht (Xi ) −Yi g(Xi )
e e
= ··· = Q = Q
n t Zt n t Zt
which implies that Y
e−Yi g(Xi ) = nDT +1 (i) Zt . (1)
t

Since I(u ≤ 0) ≤ e−u we have


T
1X 1 X −Yi g(Xi ) 1X Y Y
R(h) =
b I(Yi g(Xi ) ≤ 0) ≤ e = n( Zt )DT +1 (i) = Zt
n i n i n i t t=1
Y p Yp
= 2 t (1 − t ) = 1 − 4(1/2 − t )2

t t
)2
Y
≤ e−2(1/2−t since 1 − x ≤ e−x
t
−2 t (1/2−t )2 2
P
=e ≤ e−2γ T .


Generalization Error. The training error gets small very quickly. But how well do we do
in terms of prediction error?

Let ( )
X
F= sign( αt ht ) : αt ∈ R, ht ∈ H .
t

For fixed h = (h1 , . . . , hT ) this is just a set of linear classifiers which has VC dimension T .
So the shattering number is
 en T
.
T
If H is finite then the shattering number is
 en T
.|H|T .
T
2
If H is infinite but has VC dimension d then the shattering number is bounded by
 en T  en dT
 nT d .
T d
By the VC theorem, with probability at least 1 − δ,
r
T d log n
R(bh) ≤ R(h)
b + .
n
Unfortunately this depends on T . We can fix this using margin theory.

P
Margins. Consider the classifier h(x) = sign(g(x)) where g(x) = t αt ht (x). The classifier
is unchanged if we multiply g by a scalar. In particular, we can replace g with ge = g/||α||1 .
This form of the classifier is a convex combination of the ht ’s.
P
We define the margin at x of g = t αt ht by
yg(x)
ρ(x) = = ye
g (x).
||α||1
Think of |ρ(x)| as our confidence in classifying x. The margin of g is defined to be
Yi g(Xi )
ρ = min ρ(Xi ) = min .
i i ||α||1
Note that ρ ∈ [−1, 1].

To proceed we need to review Radamacher complexity. Given a class of functions F with


−1 ≤ f (x) ≤ 1 we define
" #
1 X
Rn (F) = Eσ sup σi f (Zi )
f ∈F n i

where P (σi = 1) = P (σi = −1) = 1/2. If H is finite then


r
2 log |H|
Rn (H) ≤ .
n
If H has VC dimension d then
r
2d log(en/d)
Rn (H) ≤ .
n
We will need the following two facts. First,

Rn (conv(H)) = Rn (H)

3
where conv(H) is the convex hull of H. Second, if

|φ(x) − φ(y)| ≤ L||x − y||

for all x, y then


Rn (φ ◦ F) ≤ LRn (F).
The set of margin functions is

M = {yf (x) : f ∈ conv(H)}.

We then have
Rn (M) = Rn (conv(H)) = Rn (H).
A key result is that, with probability at least 1 − δ, for all f ∈ F,
r
1X 2 log(1/δ)
E[f (Z)] ≤ f (Zi ) + 2Rn (F) + . (2)
n i n

Now fix a number ρ and define the margin-sensitive loss function



1
 u≤0
u
φ(u) = 1 − ρ 0 ≤ ρ

0 u ≥ ρ.

Note that
I(u ≤ 0) ≤ φ(u) ≤ I(u ≤ ρ).
Assume that H has VC dimension d. Then
r
1 2d log(en/d)
Rn (φ ◦ M) ≤ LRn (M) ≤ LRn (H) ≤ .
ρ n

Now define the empirical margin sensitive loss of a classifer f by

bρ = 1
X
R I(Yi f (Xi ) ≤ ρ).
n i

Theorem 3 With probability at least 1 − δ,


r r
1 2d log(en/d) 2 log(2/δ)
R(g) ≤ Rbρ (g/||α||1 ) ≤ + .
ρ n n

4
Proof. Recall that I(u ≤ 0) ≤ φ(u) ≤ I(u ≤ ρ). Also recall that g and ge = g/||α||1 are
equivalent classifiers. Then using (2) we have
r
1X 2 log(2/δ)
R(g) = R(e g ) = P (Y ge(X) ≤ 0) ≤ φ(Yi ge(Xi )) + 2Rn (φ ◦ M) +
n i n
r r
1X 1 2d log(en/d) 2 log(2/δ)
≤ φ(Yi ge(Xi )) + +
n i ρ n n
r r
=R bρ (g/||α||1 ) + 1 2d log(en/d) + 2 log(2/δ) .
ρ n n


Next we bound R
bρ (g/||α||1 ).

Theorem 4 We have
T q
Y
bρ (g/||α||1 ) ≤
R 41−ρ
t (1 − t )1+ρ .
t=1

Proof. Since φ(u) ≤ I(u ≤ ρ) we have

bρ (g/||α||1 ) ≤ 1
X
R I(Yi g(Xi ) − ρ||α||1 ≤ 0)
n i
1 X −Yi g(Xi )
≤ eρ||α||1 e
n i
1X Y Y
= eρ||α||1 nDT +1 (i) Zt = eρ||α||1 Zt
n i t t
YT q
= 41−ρ
t (1 − t )1+ρ
t=1
p
since Zt = 2 t (1 − t ) and αt = (1/2) log((1 − t )/t ). 
q
Assuming γ ≤ (1/2 − t ) and ρ < γ then it can be shown that 41−ρ t (1 − t )1+ρ ≡ b < 1.
bρ (g/||α||1 ) ≤ bT . Combining with the previous result we have, with probability at least
So R
1 − δ, r r
1 2d log(en/d) 2 log(2/δ)
R(g) ≤ bT + + .
ρ n n
This shows that we get small error even with T large (unlike the earlier bound based only
on VC theory).

You might also like