Boosting: I I I I
Boosting: I I I I
Let Zi = (Xi , Yi ) where Yi ∈ {−1, +1}. Boosting is a way to combine weak classifers into
a better classifier. We make the weak learning assumption: for some γ > 0 we have an
algorithm returns h ∈ H such that, for all P ,
P (R(h) ≤ 1/2 − γ) ≥ 1 − δ
where γ > 0 is the edge.
Training Error. Now we show that the training error decreases exponentially fast.
Lemma 1 We have p
Zt = 2 t (1 − t ).
P
Proof. Since Dt (i) = 1 we have
i
X X X
Zt = Dt (i)e−αt Yi ht (Xi ) = Dt (i)e−αt + Dt (i)eαt
i Yi ht (Xi )=1 Yi ht (Xi )=−1
−αt αt
p
= (1 − t )e + t e = 2 t (1 − t ).
since αt = (1/2) log((1 − t )/t ).
1
Theorem 2 Suppose that γ ≤ (1/2) − t for all t. Then
2
R(h)
b ≤ e−2γ T .
t t
)2
Y
≤ e−2(1/2−t since 1 − x ≤ e−x
t
−2 t (1/2−t )2 2
P
=e ≤ e−2γ T .
Generalization Error. The training error gets small very quickly. But how well do we do
in terms of prediction error?
Let ( )
X
F= sign( αt ht ) : αt ∈ R, ht ∈ H .
t
For fixed h = (h1 , . . . , hT ) this is just a set of linear classifiers which has VC dimension T .
So the shattering number is
en T
.
T
If H is finite then the shattering number is
en T
.|H|T .
T
2
If H is infinite but has VC dimension d then the shattering number is bounded by
en T en dT
nT d .
T d
By the VC theorem, with probability at least 1 − δ,
r
T d log n
R(bh) ≤ R(h)
b + .
n
Unfortunately this depends on T . We can fix this using margin theory.
P
Margins. Consider the classifier h(x) = sign(g(x)) where g(x) = t αt ht (x). The classifier
is unchanged if we multiply g by a scalar. In particular, we can replace g with ge = g/||α||1 .
This form of the classifier is a convex combination of the ht ’s.
P
We define the margin at x of g = t αt ht by
yg(x)
ρ(x) = = ye
g (x).
||α||1
Think of |ρ(x)| as our confidence in classifying x. The margin of g is defined to be
Yi g(Xi )
ρ = min ρ(Xi ) = min .
i i ||α||1
Note that ρ ∈ [−1, 1].
Rn (conv(H)) = Rn (H)
3
where conv(H) is the convex hull of H. Second, if
We then have
Rn (M) = Rn (conv(H)) = Rn (H).
A key result is that, with probability at least 1 − δ, for all f ∈ F,
r
1X 2 log(1/δ)
E[f (Z)] ≤ f (Zi ) + 2Rn (F) + . (2)
n i n
Note that
I(u ≤ 0) ≤ φ(u) ≤ I(u ≤ ρ).
Assume that H has VC dimension d. Then
r
1 2d log(en/d)
Rn (φ ◦ M) ≤ LRn (M) ≤ LRn (H) ≤ .
ρ n
bρ = 1
X
R I(Yi f (Xi ) ≤ ρ).
n i
4
Proof. Recall that I(u ≤ 0) ≤ φ(u) ≤ I(u ≤ ρ). Also recall that g and ge = g/||α||1 are
equivalent classifiers. Then using (2) we have
r
1X 2 log(2/δ)
R(g) = R(e g ) = P (Y ge(X) ≤ 0) ≤ φ(Yi ge(Xi )) + 2Rn (φ ◦ M) +
n i n
r r
1X 1 2d log(en/d) 2 log(2/δ)
≤ φ(Yi ge(Xi )) + +
n i ρ n n
r r
=R bρ (g/||α||1 ) + 1 2d log(en/d) + 2 log(2/δ) .
ρ n n
Next we bound R
bρ (g/||α||1 ).
Theorem 4 We have
T q
Y
bρ (g/||α||1 ) ≤
R 41−ρ
t (1 − t )1+ρ .
t=1
bρ (g/||α||1 ) ≤ 1
X
R I(Yi g(Xi ) − ρ||α||1 ≤ 0)
n i
1 X −Yi g(Xi )
≤ eρ||α||1 e
n i
1X Y Y
= eρ||α||1 nDT +1 (i) Zt = eρ||α||1 Zt
n i t t
YT q
= 41−ρ
t (1 − t )1+ρ
t=1
p
since Zt = 2 t (1 − t ) and αt = (1/2) log((1 − t )/t ).
q
Assuming γ ≤ (1/2 − t ) and ρ < γ then it can be shown that 41−ρ t (1 − t )1+ρ ≡ b < 1.
bρ (g/||α||1 ) ≤ bT . Combining with the previous result we have, with probability at least
So R
1 − δ, r r
1 2d log(en/d) 2 log(2/δ)
R(g) ≤ bT + + .
ρ n n
This shows that we get small error even with T large (unlike the earlier bound based only
on VC theory).