0% found this document useful (0 votes)
4 views13 pages

Adaboost

This document discusses the AdaBoost algorithm for boosting classifiers. It introduces empirical loss functions and shows how AdaBoost optimizes an exponential loss function in a greedy stage-wise fashion. AdaBoost weights training examples differently on each round based on whether the current classifier correctly or incorrectly predicts them, focusing on examples that are hard to classify.

Uploaded by

srobertjames
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views13 pages

Adaboost

This document discusses the AdaBoost algorithm for boosting classifiers. It introduces empirical loss functions and shows how AdaBoost optimizes an exponential loss function in a greedy stage-wise fashion. AdaBoost weights training examples differently on each round based on whether the current classifier correctly or incorrectly predicts them, focusing on examples that are hard to classify.

Uploaded by

srobertjames
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

AdaBoost

Karl Stratos

1 Empirical Loss
Let pop denote a population distribution over input-label pairs in X ×{±1}. Let (x1 , y1 ) . . . (xN , yN ) ∼ pop denote
iid samples. Given f : X → R, define
N
1 X
ˆ(f ) := [[yi f (xi ) ≤ 0]] (empirical zero-one loss)
N i=1
N
ˆl(f ) := 1
X
exp(−yi f (xi )) (empirical exponential loss)
N i=1

ˆl(f ) is a convex upper bound on ˆ(f ). More generally, given a distribution D over {1 . . . N }, define
N
X
ˆD (f ) := D(i) [[yi f (xi ) ≤ 0]] (expected empirical zero-one loss)
i=1
N
X
ˆlD (f ) := D(i) exp(−yi f (xi )) (expected empirical exponential loss)
i=1
N
X
γ̂D (f ) := D(i)yi f (xi ) (expected empirical margin, aka. edge)
i=1

1.1 For Classifiers


Let h : X → {±1}.
Lemma 1.1. γ̂D (h) = 1 − 2ˆ
D (h)
Lemma 1.2. ˆlD (αh) ≥ ˆD (h) for all α ≥ 0
Lemma 1.3. Assume ˆD (h) ∈ (0, 1/2]. Then

min ˆlD (αh) = 2 ˆD (h)(1 − ˆD (h)) = 1 − γ̂D (h)2


p p
α≥0

where the unique minimizer is


s s
1 − ˆD (h) 1 + γ̂D (h)
α = log = log ≥0
ˆD (h) 1 − γ̂D (h)

In words: we can approximate ˆD (h) > 0 by tightening the upper bound ˆlD (αh) where α ≥ 0 controls the margin.
The better h is, the larger α. This is useful because ˆlD is easier to optimize than ˆD .

2 Ensemble
Let h1 . . . hT : X → {±1}. An ensemble with weights α1 . . . αT ∈ R is the function gT : X → R defined by
T
X
gT (x) := αt ht (x)
t=1

1
For t = 1 . . . T , define a distribution Dt over {1 . . . N } by
exp(−yi gt (xi )) exp(−yi gt (xi ))
Dt (i) := PN =
j=1 exp(−yj gt (xj )) N ˆl(gt )

A softmax over negative margins is simply normalized exponential losses. We can pull Pt (α1 , h1 ) . . . (αt−1 , ht−1 ) from
Dt to express it as a function of Dt−1 and (αt , ht ) thanks to the linearity of gt = s=1 αs hs . For convenience we
define g0 (x) = 0 so that D0 (i) = 1/N .
Lemma 2.1. For t ≥ 1,
Dt−1 (i) exp(−yi αt ht (xi )) Dt−1 (i) exp(−yi αt ht (xi )) exp(−yi gt (xi ))
Dt (i) = PN = = Qt
ˆ
lDt−1 (αt ht ) N s=1 ˆlDs−1 (αs hs )
j=1 Dt−1 (j) exp(−yj αt ht (xj ))

By equating the two expressions of the normalizer, we obtain


t t−1
!
Y Y
ˆl(gt ) = ˆlD (αs hs ) = ˆlD (αs hs ) ˆlD (αt ht ) (1)
s−1 s−1 t−1
s=1 s=1

This suggests a greedy strategy: minimize ˆl(gt ) over αt ∈ R while holding α1 . . . αt−1 fixed. This is reduced to
minimizing ˆlDt−1 (αt ht ) where Dt−1 is constant. From Lemma 1.3 we know the optimal solution
s s
1 − ˆDt−1 (ht ) 1 + γ̂Dt−1 (ht )
αt = log = log ≥0 (2)
ˆDt−1 (ht ) 1 − γ̂Dt−1 (ht )

assuming ˆDt−1 (ht ) ∈ (0, 1/2]. The greedy selection of α1 . . . αT gives


T
Y q T q
Y
ˆl(gT ) = 2 ˆDt−1 (ht )(1 − ˆDt−1 (ht )) = 1 − γ̂Dt−1 (ht )2 (3)
t=1 t=1

Plugging Eq. (2) in the recursive expression of Dt in Lemma 2.1, we can verify the following update:
1 1
(
Dt−1 (ht )) Dt−1 (i) = 1+γ̂Dt−1 (ht ) Dt−1 (i)
2(1−ˆ if ht (xi ) = yi
Dt (i) = 1 1

D (ht ) Dt−1 (i) = 1−γ̂D
t−1
(ht ) Dt−1 (i)
t−1
otherwise

The edge formulation, where γ̂Dt−1 (ht ) ∈ [0, 1), makes it clear that the probility of an example is downweighted if
correctly classified but upweighted if misclassified. This is the AdaBoost algorithm. We give it in the edge form:
the zero-one loss form is found in Appendix B.

AdaBoost (Freund and Schapire, 1997)


Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ X and yi ∈ {±1}
• Hypothesis class H of base classifiers
• Number of boosting rounds T
Output: gT : X → R with ˆ(gT ) ≤ exp(− 21 γ 2 T ) for some γ ∈ [0, 1) (Theorem 2.2).
1. Set the initial distribution D0 (i) ← 1/N for i = 1 . . . N .
2. For t = 1 . . . T ,

(a) Let ht be a maximizer of γ̂Dt−1 (h) ∈ [−1, 1] over h ∈ H : γ̂Dt−1 (h) < 1 .
p
(b) Compute the weight αt = log (1 + γ̂Dt−1 (ht ))/(1 − γ̂Dt−1 (ht )) ≥ 0.
(c) Set the next distribution: for i = 1 . . . N ,
( 1
D (i)
1+γ̂Dt−1 (ht )) t−1
if ht (xi ) = yi
Dt (i) = 1
D (i)
1−γ̂Dt−1 (ht )) t−1
otherwise

PT
3. Return gT (x) ← t=1 αt ht (x).

2
Theorem 2.2. gT ← AdaBoost(S, H, T ) satisfies
 
1 2
ˆ(gT ) ≤ exp − γ̂min T
2

where γ̂min := minTt=1 γ̂Dt−1 (ht ) is the minimum edge among the T base classifiers trained in AdaBoost.
Proof.
T q T    
Y Y 1 1 2
ˆ(gT ) ≤ ˆl(gT ) = 1 − γ̂Dt−1 (ht )2 ≤ exp γ̂Dt−1 (ht )2 ≤ exp γ̂min T
t=1 t=1
2 2

where the first equality is from Eq. (3).


AdaBoost is agnostic to the choice of ht (Step 2a): any base classifier satisfying γ̂Dt−1 (ht ) ∈ [0, 1) can be used, and
Theorem 2.2 holds. But it is natural to explicitly optimize the edge in each iteration, hence the algorithm is given
in this form. Below, we give an interpretation of this optimization as taking a gradient step in coordinate descent.

2.1 Degenerate cases


AdaBoost typically assumes that ht is a “weak” classifier: better than random γ̂Dt−1 (ht ) > 0, but imperfect
γ̂Dt−1 (ht ) < 1. However, if
• γ̂Dt−1 (ht ) = 0: This implies αt = 0, and furthermore Dt = Dt−1 , thus more iterations will not change the
final ensemble and the algorithm should terminate. Intuitively, we have exhausted H.
• γ̂Dt−1 (ht ) = 1: If this happens, the algorithm is undefined. But this implies that ht is perfect and there is no
need for more ensembling. We can either stop and return the current ensemble, or continue ensembling by
introducing some noise (e.g., subsample data).

2.2 Interpretations
QT
We motivated AdaBoost as a greedy sequential minimization of ˆl(gT ) = t=1 ˆlDt−1 (αt ht ) (Eq. (1)), which upper
bounds ˆ(gT ). AdaBoost can also be derived as an adversarial step-wise calibration of Dt : select αt so that
N PN
X
i=1 [[ht (xi ) 6= yi ]] Dt−1 (i) exp(−yi αt ht (xi ))
ˆDt (ht ) = [[ht (xi ) 6= yi ]] Dt (i) = PN
i=1 j=1 Dt−1 (j) exp(−yj αt ht (xj ))
ˆDt−1 (ht ) exp(αt )
=
ˆDt−1 (ht ) exp(αt ) + (1 − ˆDt−1 (ht )) exp(−αt )

is equal to 1/2. Solving for αt yields the same solution as Eq. (2). More generally, AdaBoost can be seen a
coordinate descent on ˆ(g) over all ensembles g. For simplicity assume a finite hypothesis class H = {h1 . . . hH }
H
and write H(x) := (h1 (x) . . . hH (x)) ∈ {±1} . Then any ensemble can be written as
H
X
hα, H(x)i = αk hk (x)
k=1

where α ∈ RH . Thus the goal is simplified to finding α, which implicitly finds base classifiers. Let U be a convex
upper bound on the zero-one loss: U (z) ≥ [[z ≤ 0]] where z is the margin. This gives a convex loss of hα, Hi
N
ˆlU (α) := 1
X
U (yi hα, H(xi )i)
N i=1

which upper bounds ˆ (hα, Hi). By minimizing ˆlU (α) over α ∈ RH , we implicitly minimize ˆ(g) over all ensembles
g. We do coordinate descent: at each step, find the coordinate k ∈ {1 . . . H} with a largest decrease in the loss
(equivalently, steepest descent with an l1 -norm constraint) and take a step in that coordinate by α + ηek where
η ∈ R is an optimal step size.

3
AdaBoost is a special case with the exponential loss U (z) = exp(−z). The main idea is that by initializing α = 0H ,
each step corresponds to adding a single classifier in H. The coordinate and step size coincide with the classifier
and its weight selected in AdaBoost. This is due to the recursive property of the exponential derivative,

N
∂ ˆ 1 X
lU (α) = (−yi hk (xi )) exp(−yi hα, H(xi )i)
∂αk N i=1

The partial derivative “selects” hk while keeping the exponential loss, which can be normalized to give an expression
in expected zero-one loss. The partial derivative of ˆlU (α + ηek ) (which is convex in η) with respect to η is similar.

CoordinateDescent
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ X and yi ∈ {±1}
• Finite hypothesis class H = {h1 . . . hH }, where we write H(x) := (h1 (x) . . . hH (x)) ∈ {±1}H
• Convex upper bound U (z) ≥ [[z ≤ 0]], which approximates ˆ(g) over all possible ensembles g by
N
ˆ 1 X
lU (α) := U (yi hα, H(xi )i) ≥ ˆ (hα, Hi)
N i=1

• Number of gradient steps T


Output: Estimation of arg minα∈RH ˆ
lU (α)
1. α(0) ← 0H
2. For t = 1 . . . T , compute

kt ∈ arg min [∇ˆ


lU (α(t−1) )]k ηt ∈ arg min ˆ
lU (α(t−1) + ηekt )
k∈{1...H} η∈R

and set α(t) ← α(t−1) + ηt ekt where ekt is the kt -th basis vector.
3. Return α(T ) .

Theorem 2.3. Let H be finite and gT ← AdaBoost(S, H, T ). There exists an output

α(T ) ← CoordinateDescent(S, H, exp(−z), T )

(accounting for ties in optimization) such that gT = α(T ) , H .

This interpretation generalizes AdaBoost to other convex surrogates of the zero-one loss (e.g., hinge or logistic). It
also hints at a deeper connection between gradient descent and ensemble learning. Namely, taking gradient steps in
a function space is equivalent to taking an ensemble of functions. This motivates gradient boosting (Appendix E);
CoordinateDescent can be seen as a special case of gradient boosting.

3 Decision Stumps
A popular version of AdaBoost assumes the input space X = Rd and the hypothesis class of decision stumps:

Hstumps := {x 7→ b × Ind(xr > τ ) : r ∈ {1 . . . d} , b ∈ {±1} , τ ∈ R}

where Ind(A) is 1 if A is true and −1 otherwise. Without loss of generality we assume that each dimension is
sorted, so that [x1 ]r ≤ · · · ≤ [xN ]r for r = 1 . . . d (we can implement this assumption by iterating through examples
in a presorted list for each r). Under this assumption, we define for each r = 1 . . . d and i = 2 . . . N ,

(r) [xi−1 ]r + [xi ]r


τi :=
2
(r)
If [xi−1 ]r < [xi ]r then τi is a threshold in dimension r that partitions examples into x1 . . . xr and xr+1 . . . xN .

4
Expressiveness. A decision stump is a (restricted) linear classifier and cannot realize nonlinear labelings. This
works in our favor since we are unlikely to run into the degenerate case of perfect edge γ̂D (h) = 1. Specifically, given
N samples with 2N possible labelings, Hstumps can realize at most 2dN labelings. For instance, if d = 1 and N = 3
so that the inputs are scalars x1 ≤ x2 ≤ x3 , there exists no stump that can realize (1, −1, 1) or (−1, 1, −1), so only
6 out of 8 possible labelings are realized. Note that even fewer labelings would be realized if there are duplicate
input values (e.g., x1 = x2 ) since we cannot find a threshold. If we have another dimension in which the inputs are
ordered differently (e.g., x3 ≤ x1 ≤ x2 ), the missing labelings may be realized. But since each dimension realizes at
most 2N labelings and many labelings are duplicates across dimensions, Hstumps can realize at most 2dN labelings.

Learning. There are at most dN dimension-wise linear separations of N examples. Since Hstumps can only
induce labelings based on these separations, we can consider any stumps corresponding to these separations to do
(r)
an exhaustive search over Hstumps . In particular, we use stumps with τi as thresholds. If we calculate the edge
value for each threshold it would take O(N 2 d) time. Below we give a single-sweep approach with a linear runtime
O(N d) as described by Kégl (2009).

DecisionStump
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ Rd and yi ∈ {±1}; [x1 ]r ≤ · · · ≤ [xN ]r for r = 1 . . . d;
(r)
τi := ([xi−1 ]r + [xi ]r )/2
• Distribution D over {1 . . . N }
Output: h∗ ∈ arg maxh∈Hstumps γ̂D (h)
1. γ ∗ ← N
P
i=1 D(i)yi

2. For r = 1 . . . d:
(a) γ ← N
P
i=1 D(i)yi

(b) For j = 2 . . . N such that [xj−1 ]r < [xj ]r : # Can we use r for thresholding?
i. γ ← γ − 2D(j − 1)yj−1
(r)
ii. If |γ| > |γ ∗ |, set γ ∗ ← γ, r∗ ← r, and τ ∗ ← τj .

3. If γ ∗ = N ∗
P
i=1 D(i)yi , return a constant classifier x 7→ sign(γ ). Otherwise, return

h(x) ← sign(γ ∗ ) × Ind(xr∗ > τ ∗ )


We
PN maintain the best edge value γ over all dN linear separations. In each dimension, we start from the edge
i=1 D(i)y of the constant classifier x 7→ 1 and subtract 2D(j − 1)yj−1 for j = 1 . . . N . Note that the j-th
PNi Pj
value is i=j+1 D(i)yi − i=1 D(i)yi (i.e., the edge of a stump that labels (−1, −1, . . . , 1, 1) with j negative ones).
Multiplying the value by its sign makes it nonnegative, implying that the sign is the parameter b of the underlying
decision stump. We examine the absolute value of the edge to pick whichever side that gives the highest value.

Decision trees. A decision stump is a special case of decision tree with 2 leaves: Hstumps = Htrees(2) (Ap-
pendix D). Unlike a decision stump, a decision tree (with more leaves) is nonlinear and can induce O(2N ) labelings
on S. However, exact learning is intractable and requires heuristics.

4 Generalization
The VC dimension of Hstumps is 2 (i.e., the maximum number of points that can be shattered by a decision stump is
PT
two). Let HstumpsT = { t=1 αt ht : αt ≥ 0, ht ∈ Hstumps }. It is intuitively clear that the VC dimension of HstumpsT
is 2T . A standard application of Hoeffding’s inequality gives the following.
Theorem 4.1. Draw S = {(x1 , y1 ) . . . (xN , yN )} ∼ popN and gT ← AdaBoost(S, Hstumps , T ). Then with high
probability,
r !
T
Pr (ygT (x) ≤ 0) ≤ ˆ(gT ) + O
(x,y)∼pop N

5
ˆ(gT ) can be further bounded using Theorem 2.2. The bound becomes looser as T increases (due to the increased
complexity of HstumpsT and the danger of overfitting). In contrast, it is observed empirically that gT generalizes
better as T goes up—even after ˆ(gT ) = 0. This motivated researchers to find a better generalization statement
based on the margin (Theorem 1, Schapire et al. (1998)).
Theorem 4.2. Draw S = {(x1 , y1 ) . . . (xN , yN )} ∼ popN and gT ← AdaBoost(S, Hstumps , T ). Then with high
probability,
N r !
1 X 1
Pr (ygT (x) ≤ 0) ≤ [[yi gT (xi ) ≤ 0.1]] + O
(x,y)∼pop N i=1 N

The first term on the RHS is the empirical probility that the ensemble has a small margin on S (0.1 arbitrarily
picked). Intuitively, this becomes smaller as T goes up because AdaBoost focuses on hard examples (≈ support
vectors) to increase the margin, even after the training error becomes zero. Since the second term is free of T , the
bound becomes tighter with more rounds of boosting.

References
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to
boosting. Journal of computer and system sciences, 55(1), 119–139.
Kégl, B. (2009). Introduction to adaboost.

Schapire, R. E., Freund, Y., Bartlett, P., Lee, W. S., et al. (1998). Boosting the margin: A new explanation for the
effectiveness of voting methods. The annals of statistics, 26(5), 1651–1686.

6
A Proofs
Proof of Lemma 1.1. We have yi h(xi ) = 1 − 2[[yi h(xi ) ≤ 0]]. Thus

N
X
γ̂D (h) = D(i)(1 − 2[[yi h(xi ) ≤ 0]]) = 1 − 2ˆ
D (h)
i=1

Proof of Lemma 1.2. If α = 0 the statement holds trivially since ˆlD (αh) = 1 ≥ ˆD (h). If α > 0,

N
X N
X N
X
ˆlD (αh) = D(i) exp(−yi αh(xi )) ≥ D(i) [[yi αh(xi ) ≤ 0]] = D(i) [[yi h(xi ) ≤ 0]] = ˆD (h)
i=1 i=1 i=1

Proof of Lemma 1.3. For any  ∈ (0, 1/2], consider the objective J : R → R defined by

J (α) =  exp(α) + (1 − ) exp(−α)

The first and second derivatives are

J0 (α) =  exp(α) − (1 − ) exp(−α)


J00 (α) =  exp(α) + (1 − ) exp(−α)

J00 (α) > 0 for all α ∈ R, thus it is sufficient to find a stationary point to find the unique optimal solution. Setting
J0 (α) = 0 gives

 exp(α) = (1 − ) exp(−α) ⇔ log  + α = log(1 − ) − α


r
1−
⇔ α = log


This is nonnegative because (1 − )/ ≥ 1. The optimal value is


r ! r !  r 
1− 1−  p
J log =  exp log + (1 − ) exp log = 2 (1 − )
  1−

Now we view ˆlD (αh) as a function of α ∈ R, where

N
X
ˆlD (αh) = D(i) exp(−yi αh(xi ))
i=1
N
X N
X
= D(i) [[h(xi ) 6= yi ]] exp(α) + D(i) [[h(xi ) = yi ]] exp(−α)
i=1 i=1
= ˆD (h) exp(α) + (1 − ˆD (h)) exp(−α)
p p
Thus α = log (1 − ˆD (h))/ˆ D (h) ≥ 0 is the unique minimizer and 2 ˆD (h)(1
p − ˆD (h)) is the minimum. Plugging
ˆD (h) = 1/2(1 − γ̂D (h)) (Lemma 1.1) in the minimizer we also have α = log (1 + γ̂D (h))/(1 − γ̂D (h)). To get the
expression of the minimum in the edge, we use the algebraic fact 4z(1 − z) = 1 − (1 − 4z + 4z 2 ) = 1 − (1 − 2z)2 for
any z ∈ R. Then
p p p p
2 ˆD (h)(1 − ˆD (h)) = 4ˆ D (h))2 = 1 − γ̂D (h)2
D (h)(1 − ˆD (h)) = 1 − (1 − 2ˆ

7
Proof of Lemma 2.1.
exp(−yi gt (xi ))
Dt (i) = PN
j=1 exp(−yj gt (xj ))
exp(−yi gt−1 (xi )) exp(−yi αt ht (xi ))
= PN
j=1 exp(−yj gt−1 (xj )) exp(−yj αt ht (xj ))
(exp(−yi gt−1 (xi ))/C) exp(−yi αt ht (xi ))
= PN
j=1 (exp(−yj gt−1 (xj ))/C) exp(−yj αt ht (xj ))
Dt−1 (i) exp(−yi αt ht (xi ))
= PN
j=1 Dt−1 (j) exp(−yj αt ht (xj ))

PN
where we define the constant C = k=1 exp(−yk gt−1 (xk )). This proves the first equality. The second equality
holds by the definition of expected weighted exponential loss. The third equality then holds inductively. 

s t (i) := exp −yi α(t) , H(xi )


 PN 
Proof of Theorem 2.3. Define D /Mt where Mt := j=1 exp −yj α(t) , H(xj ) .
We can assume by the technical lemmas A.1 and A.2 that
s
H 1 − ˆDs t−1 (hk )
kt ∈ arg min ˆDs t−1 (hk ) ηt = log
k=1 ˆDs t−1 (hk )

Ds 0 is uniform since α(0) = 0H , hence D


s 0 = D0 . Then hk ∈ arg minh∈H ˆD (h) and α(1) , H(x) = hη1 ek , H(x)i =
1 0 1

η1 hk1 is the same as the step-1 ensemble in AdaBoost. At step t > 1, assume α(t−1) , H(x) is the same as
the step-(t − 1) ensemble in AdaBoost. Then D s t−1 = Dt−1 , so hk ∈ arg minh∈H ˆD (h) and α(t) , H(x) =
t t−1

α(t−1) , H(x) + ηt hkt is the same as the step-t ensemble in AdaBoost, 

Lemma A.1.
Mt  
[∇ˆlU (α(t−1) )]k = Ds t−1 (hk ) − 1

N

Proof.

N
∂ 1 X  D E
[∇ˆlU (α(t−1) )]k = exp −yi α(t−1) , H(xi )
∂αk N i=1
N
1 X  D E
=− yi hk (xi ) exp −yi α(t−1) , H(xi )
N i=1
N
Mt−1 X
=− yi hk (xi )D
s t−1 (i)
N i=1
Mt−1  
= ˆDs t−1 (hk ) − (1 − ˆDs t−1 (hk ))
N
Mt−1  
= 2ˆ Ds t−1 (hk ) − 1
N

Lemma A.2.
s
1 − ˆDs t−1 (hk )
log ∈ arg min ˆlU (α(t−1) + ηek )
ˆDs t−1 (hk ) η∈R

8
Proof.

N
∂ ˆlU (α(t−1) + ηek ) ∂ 1 X  D E 
= exp −yi α(t−1) , H(xi ) − ηyi hk (xi )
∂η ∂η N i=1
N
1 X  D E
=− yi hk (xi ) exp −yi α(t−1) , H(xi ) exp (−ηyi hk (xi ))
N i=1
N
Mt−1 X
=− yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi ))
N i=1

ˆlU (α(t−1) + ηek ) is a composition of a convex (by premise) and a linear function in η, thus convex. We set the
derivative to zero to find a minimizer:

N N
Mt−1 X X
− yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi )) = 0 ⇒ yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi )) = 0
N i=1 i=1
⇔ −ˆ
Ds t−1 (hk ) exp(η) + (1 − ˆDs t−1 (hk )) exp(η) = 0

q
Solving for η yields η = log (1 − ˆDs t−1 (hk ))/ˆ
Ds t−1 (hk ) assuming ˆDs t−1 (hk ) 6= 0.

B Zero-One Loss Form of AdaBoost


AdaBoost (zero-one loss)
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ X and yi ∈ {±1}
• Hypothesis class H of base classifiers
• Number of boosting rounds T
Output: gT : X → R with ˆ(gT ) ≤ exp(− 21 γ 2 T ) for some γ ∈ [0, 1) (Theorem 2.2).
1. Set the initial distribution D0 (i) ← 1/N for i = 1 . . . N .
2. For t = 1 . . . T ,

(a) Let ht be a minimizer of ˆDt−1 (h) ∈ [0, 1] over h ∈ H : ˆDt−1 (h) > 0 .
p
(b) Compute the weight αt = log (1 − ˆDt−1 (ht ))/ˆ Dt−1 (ht ) ≥ 0.
(c) Set the next distribution: for i = 1 . . . N ,
( 1
D (i)
Dt−1 (ht )) t−1
2(1−ˆ
if ht (xi ) = yi
Dt (i) = 1
D
Dt−1 (ht ) t−1

(i) otherwise

PT
3. Return gT (x) ← t=1 αt ht (x).

C Gini Impurity
Given any set of N labeled examples
PN S and a full-support distribution D over their indices, the Gini impurity is
defined as 2p(1 − p) where p = i:yi =1 D(i) is the probability of drawing label 1 from S under D. This is minimized
at 0 if all examples are labeled as either 1 or −1, and maximized at 1/2 if p = 1/2. A split is optimal if the expected
Giny impurity of the resulting partition is the smallest. We can derive a O(N d)-time algorithm to find an optimal
split, again assuming that each dimension is sorted:

9
MinimizeGiniImpurity
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where N ≥ 2, xi ∈ Rd , and yi ∈ {±1}; [x1 ]r ≤ · · · ≤ [xN ]r for
(r)
r = 1 . . . d; τi := ([xi−1 ]r + [xi ]r )/2
• Distribution D over {1 . . . N }
Output: split (r, τ ) with minimum Gini impurity, or fail if there is no split
1. γ ∗ ← ∞
2. D+1 ← N
P
j=1 [[yj = 1]] D(j)

3. For r = 1 . . . d:
(a) p ← 0 # Left label-1 probability
(b) p0 ← D+1 # Right label-1 probability
(c) β ← 0 # Left probability
(d) For j = 2 . . . N such that [xj−1 ]r < [xj ]r : # Can we use r for thresholding?
i. p ← p + [[yj−1 = 1]] D(j − 1)
ii. p0 ← p0 − [[yj−1 = 1]] D(j − 1)
iii. β ← β + D(j − 1)
iv. γ ← β2p(1 − p) + (1 − β)2p0 (1 − p0 )
(r)
v. If |γ| < |γ ∗ |, set γ ∗ ← γ, r∗ ← r, and τ ∗ ← τj .
4. If γ ∗ = ∞, fail. Otherwise, return (r∗ , τ ∗ ).

D Decision Trees

d d d
 X d= R . We say R ⊆ R is a hyperrectangle if there exist aR , bR ∈ (R ∪ {±∞}) such that R =
Let
x ∈ R : aR < x ≤ bR where the inequalities are element-wise. We say R = {R1 . . . RM } is a hyperrectan-
gle partition if R1 . . . RM are hyperrectangles such that ∪R∈R R = Rd and Ri ∩ Rj = ∅ for all i 6= j. For x ∈ Rd ,
we write R(x) ∈ {1 . . . M } to denote the unique hyperrectangle in R that x belongs to. A decision tree is a
mapping x 7→ π(R(x)) where R is a hyperrectangle partition and π : {1 . . . M } → {±1} is a region-labeling. It is
called a decision tree because it can be expressed as a binary tree with M leaves. Let ν denote a node object. If
internal, it is equipped with a dimension r ∈ {1 . . . d} and a threshold τ ∈ R as well as left and right child nodes; if
leaf, it is equipped with y ∈ {±}. Given any (R, π), by definition there is binary tree with a root node νroot such
that π(R(x)) = Traverse(x, νroot ) where

Traverse(x, ν)
1. If ν is a leaf node, return ν.y.
2. Otherwise,
(a) If [x]ν.r > ν.τ , return Traverse(x, ν.left)
(b) If [x]ν.r ≤ ν.τ , return Traverse(x, ν.right)

Let Htrees(M ) denote the hypothesis class of decision trees with M leaves. We would like to minimize a loss function
over Htrees(M ) but an exhaustive search is intractable unless M is small (e.g., M = 2 in DecisionStump): the
number of labelings is exponential in N since a decision tree is highly nonlinear (e.g., it can fit the XOR mapping
with M = 4 regions). Thus we adopt a top-down greedy heuristic. We assume a single-dimension splitting algorithm
A that maps any data-distribution pair (S 0 , D0 ) to a best dimension r ∈ {1 . . . d} and threshold τ ∈ R according to
some metric.

10
BuildTree
Input: S = {(x1 , y1 ) . . . (xN , yN )}, distribution D over {1 . . . N }, number of splits P ≤ blog N c, single-
dimension splitting algoirthm A
Output: root node νroot of a binary tree with 2P leaves
1. Initialize νroot and q ← queue([(νroot , S, D)])
2. For P times or until q is empty,
(a) (ν, S 0 , D0 ) ← q.pop(); if the labels in S 0 are pure, go to Step 2.
(b) (ν.r, ν.τ ) ← A(S 0 , D0 )
(c) Partition S 0 into S10 , S20 by thresholding dimension ν.r using ν.τ .
(d) Compute distributions D10 , D20 over S10 , S20 by renormalizing D0 .
(e) Initialize ν.left and push (ν.left, S10 , D10 ) onto q.
(f) Initialize ν.right and push (ν.right, S20 , D20 ) onto q.
3. Return νroot .

This is a heuristic with no optimality guarantee. For example, if we use DecisionStump (which does find an
optimal stump) as our choice of A, the resulting tree is generally suboptimal for P ≥ 2: that is, there may be
a different tree h ∈ Htrees(2P ) that achieves a larger edge. One popular approach is to grow a full tree (i.e.,
P = blog N c) using a “label purity” splitting metric (e.g., Gini impurity, Appendix C) and then prune the full tree
to minimize the misclassification rate of the leaf nodes. Decision trees can be used as base classifiers of AdaBoost,
regardless of the specifics of how they are learned.

E Gradient Boosting
Let F denote the set of all functions f : X → R. This is a vector space because functions are closed under (element-
wise) addition
p and scalar multiplication. We may assume an inner product h·, ·i : F × F → R, with the norm
||f || := hf, f i, for certain subspaces like square-integrable or RKHS.1 We will assume that F is an RKSH.

A functional is a mapping E : F → R. We say E is differentiable at f ∈ F if there is a function ∇E(f ) ∈ F such


that
||E(f + g) − (E(f ) + h∇E(f ), gi)||
lim =0
||g||→0 ||g||
Thus E(f ) + h∇E(f ), gi is a linear approximation of E around f . We call ∇E(f ) ∈ F the (functional) gradient
of E at f . We have the following results from functional analysis:
∇(aE1 + bE2 )(f ) = a∇E1 (f ) + b∇E2 (f ) (linearity of functional differentiation)
0
∇(g ◦ E)(f ) = g (E(f ))∇E(f ) (chain rule for any differentiable g : R → R)
Some examples:
• f (x), the evaluation functional for x ∈ X . The gradient is K(·, x) since
(f + g)(x) = f (x) + g(x) = f (x) + hK(·, x), gi
2 2 2 2
• ||f || , the squared-norm functional. The gradient is 2f since ||f + g|| = ||f || + h2f, gi + ||g|| .
b sq (f ) := 1 PN (yi − f (xi ))2 , the squared-error functional for (x1 , y1 ) . . . (xN , yN ) ∈ X × R. By the linearity
• L 2 i=1
of differentiation and the chain rule,
N N
1X X
∇L
b sq (f ) = ∇(yi − f (xi ))2 = − (yi − f (xi ))K(·, xi )
2 i=1 i=1
1
If F is a reproducing kernel Hilbert space (RKHS), it means there is some associated kernel K : X × X → R such that every f ∈ F
can be written as f = x∈Cf αfx K(·, x) where Cf ⊂ X is a finite set of “center points” and αfx is a coefficient for each center point.
P

Note that K(·, x) ∈ F for every x ∈ X . The inner product is defined as hf, gi := (αf )> K f,g αg , in particular
X f
hf, K(·, x)i = αx0 K(x0 , x) = f (x)
x0 ∈Cf

hence the name “reproducing”. By the Moore-Aronszajn theorem, any kernel induces a unique associated RKSH.

11
Functional gradient descent. Let L b : F → R be a differentiable loss over S = {(x1 , y1 ) . . . (xN , yN )} ⊂ X × R
(e.g., L
b sq ). We may do steepest descent on L b over F. This means we pick an initial f0 ∈ F and for t = 1 . . . T
find gt ∈ F with a bounded norm such that L(ft−1 + gt ) is small. Using the linear approximation L(f
b b t−1 + gt ) ≈
b t−1 ) + h∇L(f
L(f b t−1 ), gt i, the steepest descent direction is given by gt = ηt (−∇L(f
b t−1 )) where ηt ∈ R. Then we set

ft = ft−1 + gt
PT
Note that fT = f0 + t=1 gt can be viewed as an ensemble; at test time, given x ∈ X the model computes f0 (x) ∈ R
and g1 (x) . . . gT (x) ∈ R, then returns fT (x) = f0 (x) + g1 (x) + · · · + gT (x).

While conceptually nice, functional gradient descent is not implementable due to its nonparametric nature. For
instance, the descent direction −∇Lb sq (ft−1 ) = PN (yi − ft−1 (xi ))K(·, xi ) is an abstract real-valued mapping over
i=1
X . However, we can easily evaluate it on S. Define
n o
St = (xi , yi0 ) : yi0 = −∇L(f
b t−1 )(xi ), i ∈ {1 . . . N }

We can treat St as labeled data and fit some parametric model to approximate ht ≈ −∇L(f b t−1 ). We decide ηt by
some learning rate schedule or line search (i.e., minη∈R L(f
b t−1 + ηht )), and set gt = ηt ht .

E.1 Application to Regression


Let K(x, x0 ) = [[x = x0 ]] be the identity kernel and F be the associated RKHS. Given S = {(x1 , y1 ) . . . (xN , yN )} ⊂
X × R, we optimize the squared-error functional L b sq (f ) = 1 PN (yi − f (xi ))2 . For any i ∈ {1 . . . N },
2 i=1

−∇L
b sq (f )(xi ) = yi − f (xi )

namely the i-th residual of f . We will use the squared loss to fit the residuals. We assume a parametric hypothesis
class that is easy to optimize over (e.g., decision trees, linear regressors). With the constant step size ηt = 1, the
gradient boosting algorithm is
PN
1. Initialize h0 ∈ arg minh:X →R i=1 (yi − h(xi ))2 .
2. For t = 1 . . . T , find
N t−1
! !2
X X
ht ∈ arg min yi − hs (xi ) − h(xi )
h:X →R i=1 s=0

PT
3. Given x ∈ X , predict t=0 ht (x).

E.2 Application to Classification


Consider K-way classification. A classifier can be viewed as an array of K regressors f (x) = (f (1) (x) . . . f (K) (x)) ∈
RK where f (k) ∈ F calculates the k-th logit for input x ∈ X . Given labeled data S = {(x1 , y1 ) . . . (xN , yN )} ⊂
X × {1 . . . K}, the cross-entropy functional Lb CE : F K → R is defined as
N K
!
X X  
(k)
L
b CE (f ) := log exp f (xi ) − f (yi ) (xi )
i=1 k=1

For k ∈ {1 . . . K}, the negative functional gradient of L


b CE (f ) with respect to fk is
N
X
−∇k L
b CE (f ) = ([[yi = k]] − pf (k|xi )) K(·, xi )
i=1
PK
where pf (y|x) := exp(f (y) (x))/ k=1 exp(f (k) (x)). Assuming the identity kernel, upon receiving xi it evaluates to

−∇k L
b CE (f )(xi ) = [[yi = k]] − pf (k|xi )

namely the i-th residual of f for the k-th class. We will again use the squared loss to fit the residuals, an easy-to-
optimize-over hypothesis class, and ηt = 1. Then the gradient boosting algorithm is

12
PN P 
K
1. Initialize h0 ∈ arg minh:X →RK i=1 log k=1 exp h(k) (xi ) − h(yi ) (xi ).

2. For t = 1 . . . T , for k = 1 . . . K find


 P
t−1 (k)
  2
N exp hs (x)
(k)
X s=0
ht ∈ arg min [[yi = k]] − P P   − h(xi )
h:X →R K t−1 (k0 )
i=1 k0 =1 exp s=0 hs (x)

PT (y)
3. Given x ∈ X , predict y ∗ ∈ arg maxK
y=1 t=0 ht (x).

13

You might also like