0% found this document useful (0 votes)

4 views13 pages

Adaboost

This document discusses the AdaBoost algorithm for boosting classifiers. It introduces empirical loss functions and shows how AdaBoost optimizes an exponential loss function in a greedy stage-wise fashion. AdaBoost weights training examples differently on each round based on whether the current classifier correctly or incorrectly predicts them, focusing on examples that are hard to classify.

Uploaded by

srobertjames

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views13 pages

Adaboost

Uploaded by

srobertjames

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

AdaBoost

Karl Stratos

1 Empirical Loss
Let pop denote a population distribution over input-label pairs in X ×{±1}. Let (x1 , y1 ) . . . (xN , yN ) ∼ pop denote
iid samples. Given f : X → R, define
N
1 X
ˆ(f ) := [[yi f (xi ) ≤ 0]] (empirical zero-one loss)
N i=1
N
ˆl(f ) := 1
X
exp(−yi f (xi )) (empirical exponential loss)
N i=1

ˆl(f ) is a convex upper bound on ˆ(f ). More generally, given a distribution D over {1 . . . N }, define
N
X
ˆD (f ) := D(i) [[yi f (xi ) ≤ 0]] (expected empirical zero-one loss)
i=1
N
X
ˆlD (f ) := D(i) exp(−yi f (xi )) (expected empirical exponential loss)
i=1
N
X
γ̂D (f ) := D(i)yi f (xi ) (expected empirical margin, aka. edge)
i=1

1.1 For Classifiers

Let h : X → {±1}.
Lemma 1.1. γ̂D (h) = 1 − 2ˆ
D (h)
Lemma 1.2. ˆlD (αh) ≥ ˆD (h) for all α ≥ 0
Lemma 1.3. Assume ˆD (h) ∈ (0, 1/2]. Then

min ˆlD (αh) = 2 ˆD (h)(1 − ˆD (h)) = 1 − γ̂D (h)2

p p
α≥0

where the unique minimizer is

s s
1 − ˆD (h) 1 + γ̂D (h)
α = log = log ≥0
ˆD (h) 1 − γ̂D (h)

In words: we can approximate ˆD (h) > 0 by tightening the upper bound ˆlD (αh) where α ≥ 0 controls the margin.
The better h is, the larger α. This is useful because ˆlD is easier to optimize than ˆD .

2 Ensemble
Let h1 . . . hT : X → {±1}. An ensemble with weights α1 . . . αT ∈ R is the function gT : X → R defined by
T
X
gT (x) := αt ht (x)
t=1

1
For t = 1 . . . T , define a distribution Dt over {1 . . . N } by
exp(−yi gt (xi )) exp(−yi gt (xi ))
Dt (i) := PN =
j=1 exp(−yj gt (xj )) N ˆl(gt )

A softmax over negative margins is simply normalized exponential losses. We can pull Pt (α1 , h1 ) . . . (αt−1 , ht−1 ) from
Dt to express it as a function of Dt−1 and (αt , ht ) thanks to the linearity of gt = s=1 αs hs . For convenience we
define g0 (x) = 0 so that D0 (i) = 1/N .
Lemma 2.1. For t ≥ 1,
Dt−1 (i) exp(−yi αt ht (xi )) Dt−1 (i) exp(−yi αt ht (xi )) exp(−yi gt (xi ))
Dt (i) = PN = = Qt
ˆ
lDt−1 (αt ht ) N s=1 ˆlDs−1 (αs hs )
j=1 Dt−1 (j) exp(−yj αt ht (xj ))

By equating the two expressions of the normalizer, we obtain

t t−1
!
Y Y
ˆl(gt ) = ˆlD (αs hs ) = ˆlD (αs hs ) ˆlD (αt ht ) (1)
s−1 s−1 t−1
s=1 s=1

This suggests a greedy strategy: minimize ˆl(gt ) over αt ∈ R while holding α1 . . . αt−1 fixed. This is reduced to
minimizing ˆlDt−1 (αt ht ) where Dt−1 is constant. From Lemma 1.3 we know the optimal solution
s s
1 − ˆDt−1 (ht ) 1 + γ̂Dt−1 (ht )
αt = log = log ≥0 (2)
ˆDt−1 (ht ) 1 − γ̂Dt−1 (ht )

assuming ˆDt−1 (ht ) ∈ (0, 1/2]. The greedy selection of α1 . . . αT gives

T
Y q T q
Y
ˆl(gT ) = 2 ˆDt−1 (ht )(1 − ˆDt−1 (ht )) = 1 − γ̂Dt−1 (ht )2 (3)
t=1 t=1

Plugging Eq. (2) in the recursive expression of Dt in Lemma 2.1, we can verify the following update:
1 1
(
Dt−1 (ht )) Dt−1 (i) = 1+γ̂Dt−1 (ht ) Dt−1 (i)
2(1−ˆ if ht (xi ) = yi
Dt (i) = 1 1
2ˆ
D (ht ) Dt−1 (i) = 1−γ̂D
t−1
(ht ) Dt−1 (i)
t−1
otherwise

The edge formulation, where γ̂Dt−1 (ht ) ∈ [0, 1), makes it clear that the probility of an example is downweighted if
correctly classified but upweighted if misclassified. This is the AdaBoost algorithm. We give it in the edge form:
the zero-one loss form is found in Appendix B.

AdaBoost (Freund and Schapire, 1997)

Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ X and yi ∈ {±1}
• Hypothesis class H of base classifiers
• Number of boosting rounds T
Output: gT : X → R with ˆ(gT ) ≤ exp(− 21 γ 2 T ) for some γ ∈ [0, 1) (Theorem 2.2).
1. Set the initial distribution D0 (i) ← 1/N for i = 1 . . . N .
2. For t = 1 . . . T ,

(a) Let ht be a maximizer of γ̂Dt−1 (h) ∈ [−1, 1] over h ∈ H : γ̂Dt−1 (h) < 1 .
p
(b) Compute the weight αt = log (1 + γ̂Dt−1 (ht ))/(1 − γ̂Dt−1 (ht )) ≥ 0.
(c) Set the next distribution: for i = 1 . . . N ,
( 1
D (i)
1+γ̂Dt−1 (ht )) t−1
if ht (xi ) = yi
Dt (i) = 1
D (i)
1−γ̂Dt−1 (ht )) t−1
otherwise

PT
3. Return gT (x) ← t=1 αt ht (x).

2
Theorem 2.2. gT ← AdaBoost(S, H, T ) satisfies

1 2
ˆ(gT ) ≤ exp − γ̂min T
2

where γ̂min := minTt=1 γ̂Dt−1 (ht ) is the minimum edge among the T base classifiers trained in AdaBoost.
Proof.
T q T
Y Y 1 1 2
ˆ(gT ) ≤ ˆl(gT ) = 1 − γ̂Dt−1 (ht )2 ≤ exp γ̂Dt−1 (ht )2 ≤ exp γ̂min T
t=1 t=1
2 2

where the first equality is from Eq. (3).

AdaBoost is agnostic to the choice of ht (Step 2a): any base classifier satisfying γ̂Dt−1 (ht ) ∈ [0, 1) can be used, and
Theorem 2.2 holds. But it is natural to explicitly optimize the edge in each iteration, hence the algorithm is given
in this form. Below, we give an interpretation of this optimization as taking a gradient step in coordinate descent.

2.1 Degenerate cases

AdaBoost typically assumes that ht is a “weak” classifier: better than random γ̂Dt−1 (ht ) > 0, but imperfect
γ̂Dt−1 (ht ) < 1. However, if
• γ̂Dt−1 (ht ) = 0: This implies αt = 0, and furthermore Dt = Dt−1 , thus more iterations will not change the
final ensemble and the algorithm should terminate. Intuitively, we have exhausted H.
• γ̂Dt−1 (ht ) = 1: If this happens, the algorithm is undefined. But this implies that ht is perfect and there is no
need for more ensembling. We can either stop and return the current ensemble, or continue ensembling by
introducing some noise (e.g., subsample data).

2.2 Interpretations
QT
We motivated AdaBoost as a greedy sequential minimization of ˆl(gT ) = t=1 ˆlDt−1 (αt ht ) (Eq. (1)), which upper
bounds ˆ(gT ). AdaBoost can also be derived as an adversarial step-wise calibration of Dt : select αt so that
N PN
X
i=1 [[ht (xi ) 6= yi ]] Dt−1 (i) exp(−yi αt ht (xi ))
ˆDt (ht ) = [[ht (xi ) 6= yi ]] Dt (i) = PN
i=1 j=1 Dt−1 (j) exp(−yj αt ht (xj ))
ˆDt−1 (ht ) exp(αt )
=
ˆDt−1 (ht ) exp(αt ) + (1 − ˆDt−1 (ht )) exp(−αt )

is equal to 1/2. Solving for αt yields the same solution as Eq. (2). More generally, AdaBoost can be seen a
coordinate descent on ˆ(g) over all ensembles g. For simplicity assume a finite hypothesis class H = {h1 . . . hH }
H
and write H(x) := (h1 (x) . . . hH (x)) ∈ {±1} . Then any ensemble can be written as
H
X
hα, H(x)i = αk hk (x)
k=1

where α ∈ RH . Thus the goal is simplified to finding α, which implicitly finds base classifiers. Let U be a convex
upper bound on the zero-one loss: U (z) ≥ [[z ≤ 0]] where z is the margin. This gives a convex loss of hα, Hi
N
ˆlU (α) := 1
X
U (yi hα, H(xi )i)
N i=1

which upper bounds ˆ (hα, Hi). By minimizing ˆlU (α) over α ∈ RH , we implicitly minimize ˆ(g) over all ensembles
g. We do coordinate descent: at each step, find the coordinate k ∈ {1 . . . H} with a largest decrease in the loss
(equivalently, steepest descent with an l1 -norm constraint) and take a step in that coordinate by α + ηek where
η ∈ R is an optimal step size.

3
AdaBoost is a special case with the exponential loss U (z) = exp(−z). The main idea is that by initializing α = 0H ,
each step corresponds to adding a single classifier in H. The coordinate and step size coincide with the classifier
and its weight selected in AdaBoost. This is due to the recursive property of the exponential derivative,

N
∂ ˆ 1 X
lU (α) = (−yi hk (xi )) exp(−yi hα, H(xi )i)
∂αk N i=1

The partial derivative “selects” hk while keeping the exponential loss, which can be normalized to give an expression
in expected zero-one loss. The partial derivative of ˆlU (α + ηek ) (which is convex in η) with respect to η is similar.

CoordinateDescent
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ X and yi ∈ {±1}
• Finite hypothesis class H = {h1 . . . hH }, where we write H(x) := (h1 (x) . . . hH (x)) ∈ {±1}H
• Convex upper bound U (z) ≥ [[z ≤ 0]], which approximates ˆ(g) over all possible ensembles g by
N
ˆ 1 X
lU (α) := U (yi hα, H(xi )i) ≥ ˆ (hα, Hi)
N i=1

• Number of gradient steps T

Output: Estimation of arg minα∈RH ˆ
lU (α)
1. α(0) ← 0H
2. For t = 1 . . . T , compute

kt ∈ arg min [∇ˆ

lU (α(t−1) )]k ηt ∈ arg min ˆ
lU (α(t−1) + ηekt )
k∈{1...H} η∈R

and set α(t) ← α(t−1) + ηt ekt where ekt is the kt -th basis vector.
3. Return α(T ) .

Theorem 2.3. Let H be finite and gT ← AdaBoost(S, H, T ). There exists an output

α(T ) ← CoordinateDescent(S, H, exp(−z), T )

(accounting for ties in optimization) such that gT = α(T ) , H .

This interpretation generalizes AdaBoost to other convex surrogates of the zero-one loss (e.g., hinge or logistic). It
also hints at a deeper connection between gradient descent and ensemble learning. Namely, taking gradient steps in
a function space is equivalent to taking an ensemble of functions. This motivates gradient boosting (Appendix E);
CoordinateDescent can be seen as a special case of gradient boosting.

3 Decision Stumps
A popular version of AdaBoost assumes the input space X = Rd and the hypothesis class of decision stumps:

Hstumps := {x 7→ b × Ind(xr > τ ) : r ∈ {1 . . . d} , b ∈ {±1} , τ ∈ R}

where Ind(A) is 1 if A is true and −1 otherwise. Without loss of generality we assume that each dimension is
sorted, so that [x1 ]r ≤ · · · ≤ [xN ]r for r = 1 . . . d (we can implement this assumption by iterating through examples
in a presorted list for each r). Under this assumption, we define for each r = 1 . . . d and i = 2 . . . N ,

(r) [xi−1 ]r + [xi ]r

τi :=
2
(r)
If [xi−1 ]r < [xi ]r then τi is a threshold in dimension r that partitions examples into x1 . . . xr and xr+1 . . . xN .

4
Expressiveness. A decision stump is a (restricted) linear classifier and cannot realize nonlinear labelings. This
works in our favor since we are unlikely to run into the degenerate case of perfect edge γ̂D (h) = 1. Specifically, given
N samples with 2N possible labelings, Hstumps can realize at most 2dN labelings. For instance, if d = 1 and N = 3
so that the inputs are scalars x1 ≤ x2 ≤ x3 , there exists no stump that can realize (1, −1, 1) or (−1, 1, −1), so only
6 out of 8 possible labelings are realized. Note that even fewer labelings would be realized if there are duplicate
input values (e.g., x1 = x2 ) since we cannot find a threshold. If we have another dimension in which the inputs are
ordered differently (e.g., x3 ≤ x1 ≤ x2 ), the missing labelings may be realized. But since each dimension realizes at
most 2N labelings and many labelings are duplicates across dimensions, Hstumps can realize at most 2dN labelings.

Learning. There are at most dN dimension-wise linear separations of N examples. Since Hstumps can only
induce labelings based on these separations, we can consider any stumps corresponding to these separations to do
(r)
an exhaustive search over Hstumps . In particular, we use stumps with τi as thresholds. If we calculate the edge
value for each threshold it would take O(N 2 d) time. Below we give a single-sweep approach with a linear runtime
O(N d) as described by Kégl (2009).

DecisionStump
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ Rd and yi ∈ {±1}; [x1 ]r ≤ · · · ≤ [xN ]r for r = 1 . . . d;
(r)
τi := ([xi−1 ]r + [xi ]r )/2
• Distribution D over {1 . . . N }
Output: h∗ ∈ arg maxh∈Hstumps γ̂D (h)
1. γ ∗ ← N
P
i=1 D(i)yi

2. For r = 1 . . . d:
(a) γ ← N
P
i=1 D(i)yi

(b) For j = 2 . . . N such that [xj−1 ]r < [xj ]r : # Can we use r for thresholding?
i. γ ← γ − 2D(j − 1)yj−1
(r)
ii. If |γ| > |γ ∗ |, set γ ∗ ← γ, r∗ ← r, and τ ∗ ← τj .

3. If γ ∗ = N ∗
P
i=1 D(i)yi , return a constant classifier x 7→ sign(γ ). Otherwise, return

h(x) ← sign(γ ∗ ) × Ind(xr∗ > τ ∗ )

∗
We
PN maintain the best edge value γ over all dN linear separations. In each dimension, we start from the edge
i=1 D(i)y of the constant classifier x 7→ 1 and subtract 2D(j − 1)yj−1 for j = 1 . . . N . Note that the j-th
PNi Pj
value is i=j+1 D(i)yi − i=1 D(i)yi (i.e., the edge of a stump that labels (−1, −1, . . . , 1, 1) with j negative ones).
Multiplying the value by its sign makes it nonnegative, implying that the sign is the parameter b of the underlying
decision stump. We examine the absolute value of the edge to pick whichever side that gives the highest value.

Decision trees. A decision stump is a special case of decision tree with 2 leaves: Hstumps = Htrees(2) (Ap-
pendix D). Unlike a decision stump, a decision tree (with more leaves) is nonlinear and can induce O(2N ) labelings
on S. However, exact learning is intractable and requires heuristics.

4 Generalization
The VC dimension of Hstumps is 2 (i.e., the maximum number of points that can be shattered by a decision stump is
PT
two). Let HstumpsT = { t=1 αt ht : αt ≥ 0, ht ∈ Hstumps }. It is intuitively clear that the VC dimension of HstumpsT
is 2T . A standard application of Hoeffding’s inequality gives the following.
Theorem 4.1. Draw S = {(x1 , y1 ) . . . (xN , yN )} ∼ popN and gT ← AdaBoost(S, Hstumps , T ). Then with high
probability,
r !
T
Pr (ygT (x) ≤ 0) ≤ ˆ(gT ) + O
(x,y)∼pop N

5
ˆ(gT ) can be further bounded using Theorem 2.2. The bound becomes looser as T increases (due to the increased
complexity of HstumpsT and the danger of overfitting). In contrast, it is observed empirically that gT generalizes
better as T goes up—even after ˆ(gT ) = 0. This motivated researchers to find a better generalization statement
based on the margin (Theorem 1, Schapire et al. (1998)).
Theorem 4.2. Draw S = {(x1 , y1 ) . . . (xN , yN )} ∼ popN and gT ← AdaBoost(S, Hstumps , T ). Then with high
probability,
N r !
1 X 1
Pr (ygT (x) ≤ 0) ≤ [[yi gT (xi ) ≤ 0.1]] + O
(x,y)∼pop N i=1 N

The first term on the RHS is the empirical probility that the ensemble has a small margin on S (0.1 arbitrarily
picked). Intuitively, this becomes smaller as T goes up because AdaBoost focuses on hard examples (≈ support
vectors) to increase the margin, even after the training error becomes zero. Since the second term is free of T , the
bound becomes tighter with more rounds of boosting.

References
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to
boosting. Journal of computer and system sciences, 55(1), 119–139.
Kégl, B. (2009). Introduction to adaboost.

Schapire, R. E., Freund, Y., Bartlett, P., Lee, W. S., et al. (1998). Boosting the margin: A new explanation for the
effectiveness of voting methods. The annals of statistics, 26(5), 1651–1686.

6
A Proofs
Proof of Lemma 1.1. We have yi h(xi ) = 1 − 2[[yi h(xi ) ≤ 0]]. Thus

N
X
γ̂D (h) = D(i)(1 − 2[[yi h(xi ) ≤ 0]]) = 1 − 2ˆ
D (h)
i=1

Proof of Lemma 1.2. If α = 0 the statement holds trivially since ˆlD (αh) = 1 ≥ ˆD (h). If α > 0,

N
X N
X N
X
ˆlD (αh) = D(i) exp(−yi αh(xi )) ≥ D(i) [[yi αh(xi ) ≤ 0]] = D(i) [[yi h(xi ) ≤ 0]] = ˆD (h)
i=1 i=1 i=1

Proof of Lemma 1.3. For any ∈ (0, 1/2], consider the objective J : R → R defined by

J (α) = exp(α) + (1 − ) exp(−α)

The first and second derivatives are

J0 (α) = exp(α) − (1 − ) exp(−α)

J00 (α) = exp(α) + (1 − ) exp(−α)

J00 (α) > 0 for all α ∈ R, thus it is sufficient to find a stationary point to find the unique optimal solution. Setting
J0 (α) = 0 gives

exp(α) = (1 − ) exp(−α) ⇔ log + α = log(1 − ) − α

r
1−
⇔ α = log

This is nonnegative because (1 − )/ ≥ 1. The optimal value is

r ! r ! r
1− 1− p
J log = exp log + (1 − ) exp log = 2 (1 − )
1−

Now we view ˆlD (αh) as a function of α ∈ R, where

N
X
ˆlD (αh) = D(i) exp(−yi αh(xi ))
i=1
N
X N
X
= D(i) [[h(xi ) 6= yi ]] exp(α) + D(i) [[h(xi ) = yi ]] exp(−α)
i=1 i=1
= ˆD (h) exp(α) + (1 − ˆD (h)) exp(−α)
p p
Thus α = log (1 − ˆD (h))/ˆ D (h) ≥ 0 is the unique minimizer and 2 ˆD (h)(1
p − ˆD (h)) is the minimum. Plugging
ˆD (h) = 1/2(1 − γ̂D (h)) (Lemma 1.1) in the minimizer we also have α = log (1 + γ̂D (h))/(1 − γ̂D (h)). To get the
expression of the minimum in the edge, we use the algebraic fact 4z(1 − z) = 1 − (1 − 4z + 4z 2 ) = 1 − (1 − 2z)2 for
any z ∈ R. Then
p p p p
2 ˆD (h)(1 − ˆD (h)) = 4ˆ D (h))2 = 1 − γ̂D (h)2
D (h)(1 − ˆD (h)) = 1 − (1 − 2ˆ

7
Proof of Lemma 2.1.
exp(−yi gt (xi ))
Dt (i) = PN
j=1 exp(−yj gt (xj ))
exp(−yi gt−1 (xi )) exp(−yi αt ht (xi ))
= PN
j=1 exp(−yj gt−1 (xj )) exp(−yj αt ht (xj ))
(exp(−yi gt−1 (xi ))/C) exp(−yi αt ht (xi ))
= PN
j=1 (exp(−yj gt−1 (xj ))/C) exp(−yj αt ht (xj ))
Dt−1 (i) exp(−yi αt ht (xi ))
= PN
j=1 Dt−1 (j) exp(−yj αt ht (xj ))

PN
where we define the constant C = k=1 exp(−yk gt−1 (xk )). This proves the first equality. The second equality
holds by the definition of expected weighted exponential loss. The third equality then holds inductively.

s t (i) := exp −yi α(t) , H(xi )

PN
Proof of Theorem 2.3. Define D /Mt where Mt := j=1 exp −yj α(t) , H(xj ) .
We can assume by the technical lemmas A.1 and A.2 that
s
H 1 − ˆDs t−1 (hk )
kt ∈ arg min ˆDs t−1 (hk ) ηt = log
k=1 ˆDs t−1 (hk )

Ds 0 is uniform since α(0) = 0H , hence D

s 0 = D0 . Then hk ∈ arg minh∈H ˆD (h) and α(1) , H(x) = hη1 ek , H(x)i =
1 0 1

η1 hk1 is the same as the step-1 ensemble in AdaBoost. At step t > 1, assume α(t−1) , H(x) is the same as
the step-(t − 1) ensemble in AdaBoost. Then D s t−1 = Dt−1 , so hk ∈ arg minh∈H ˆD (h) and α(t) , H(x) =
t t−1

α(t−1) , H(x) + ηt hkt is the same as the step-t ensemble in AdaBoost,

Lemma A.1.
Mt
[∇ˆlU (α(t−1) )]k = Ds t−1 (hk ) − 1
2ˆ
N

Proof.

N
∂ 1 X D E
[∇ˆlU (α(t−1) )]k = exp −yi α(t−1) , H(xi )
∂αk N i=1
N
1 X D E
=− yi hk (xi ) exp −yi α(t−1) , H(xi )
N i=1
N
Mt−1 X
=− yi hk (xi )D
s t−1 (i)
N i=1
Mt−1
= ˆDs t−1 (hk ) − (1 − ˆDs t−1 (hk ))
N
Mt−1
= 2ˆ Ds t−1 (hk ) − 1
N

Lemma A.2.
s
1 − ˆDs t−1 (hk )
log ∈ arg min ˆlU (α(t−1) + ηek )
ˆDs t−1 (hk ) η∈R

8
Proof.

N
∂ ˆlU (α(t−1) + ηek ) ∂ 1 X D E
= exp −yi α(t−1) , H(xi ) − ηyi hk (xi )
∂η ∂η N i=1
N
1 X D E
=− yi hk (xi ) exp −yi α(t−1) , H(xi ) exp (−ηyi hk (xi ))
N i=1
N
Mt−1 X
=− yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi ))
N i=1

ˆlU (α(t−1) + ηek ) is a composition of a convex (by premise) and a linear function in η, thus convex. We set the
derivative to zero to find a minimizer:

N N
Mt−1 X X
− yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi )) = 0 ⇒ yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi )) = 0
N i=1 i=1
⇔ −ˆ
Ds t−1 (hk ) exp(η) + (1 − ˆDs t−1 (hk )) exp(η) = 0

q
Solving for η yields η = log (1 − ˆDs t−1 (hk ))/ˆ
Ds t−1 (hk ) assuming ˆDs t−1 (hk ) 6= 0.

B Zero-One Loss Form of AdaBoost

AdaBoost (zero-one loss)
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ X and yi ∈ {±1}
• Hypothesis class H of base classifiers
• Number of boosting rounds T
Output: gT : X → R with ˆ(gT ) ≤ exp(− 21 γ 2 T ) for some γ ∈ [0, 1) (Theorem 2.2).
1. Set the initial distribution D0 (i) ← 1/N for i = 1 . . . N .
2. For t = 1 . . . T ,

(a) Let ht be a minimizer of ˆDt−1 (h) ∈ [0, 1] over h ∈ H : ˆDt−1 (h) > 0 .
p
(b) Compute the weight αt = log (1 − ˆDt−1 (ht ))/ˆ Dt−1 (ht ) ≥ 0.
(c) Set the next distribution: for i = 1 . . . N ,
( 1
D (i)
Dt−1 (ht )) t−1
2(1−ˆ
if ht (xi ) = yi
Dt (i) = 1
D
Dt−1 (ht ) t−1
2ˆ
(i) otherwise

PT
3. Return gT (x) ← t=1 αt ht (x).

C Gini Impurity
Given any set of N labeled examples
PN S and a full-support distribution D over their indices, the Gini impurity is
defined as 2p(1 − p) where p = i:yi =1 D(i) is the probability of drawing label 1 from S under D. This is minimized
at 0 if all examples are labeled as either 1 or −1, and maximized at 1/2 if p = 1/2. A split is optimal if the expected
Giny impurity of the resulting partition is the smallest. We can derive a O(N d)-time algorithm to find an optimal
split, again assuming that each dimension is sorted:

9
MinimizeGiniImpurity
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where N ≥ 2, xi ∈ Rd , and yi ∈ {±1}; [x1 ]r ≤ · · · ≤ [xN ]r for
(r)
r = 1 . . . d; τi := ([xi−1 ]r + [xi ]r )/2
• Distribution D over {1 . . . N }
Output: split (r, τ ) with minimum Gini impurity, or fail if there is no split
1. γ ∗ ← ∞
2. D+1 ← N
P
j=1 [[yj = 1]] D(j)

3. For r = 1 . . . d:
(a) p ← 0 # Left label-1 probability
(b) p0 ← D+1 # Right label-1 probability
(c) β ← 0 # Left probability
(d) For j = 2 . . . N such that [xj−1 ]r < [xj ]r : # Can we use r for thresholding?
i. p ← p + [[yj−1 = 1]] D(j − 1)
ii. p0 ← p0 − [[yj−1 = 1]] D(j − 1)
iii. β ← β + D(j − 1)
iv. γ ← β2p(1 − p) + (1 − β)2p0 (1 − p0 )
(r)
v. If |γ| < |γ ∗ |, set γ ∗ ← γ, r∗ ← r, and τ ∗ ← τj .
4. If γ ∗ = ∞, fail. Otherwise, return (r∗ , τ ∗ ).

D Decision Trees

d d d
X d= R . We say R ⊆ R is a hyperrectangle if there exist aR , bR ∈ (R ∪ {±∞}) such that R =
Let
x ∈ R : aR < x ≤ bR where the inequalities are element-wise. We say R = {R1 . . . RM } is a hyperrectan-
gle partition if R1 . . . RM are hyperrectangles such that ∪R∈R R = Rd and Ri ∩ Rj = ∅ for all i 6= j. For x ∈ Rd ,
we write R(x) ∈ {1 . . . M } to denote the unique hyperrectangle in R that x belongs to. A decision tree is a
mapping x 7→ π(R(x)) where R is a hyperrectangle partition and π : {1 . . . M } → {±1} is a region-labeling. It is
called a decision tree because it can be expressed as a binary tree with M leaves. Let ν denote a node object. If
internal, it is equipped with a dimension r ∈ {1 . . . d} and a threshold τ ∈ R as well as left and right child nodes; if
leaf, it is equipped with y ∈ {±}. Given any (R, π), by definition there is binary tree with a root node νroot such
that π(R(x)) = Traverse(x, νroot ) where

Traverse(x, ν)
1. If ν is a leaf node, return ν.y.
2. Otherwise,
(a) If [x]ν.r > ν.τ , return Traverse(x, ν.left)
(b) If [x]ν.r ≤ ν.τ , return Traverse(x, ν.right)

Let Htrees(M ) denote the hypothesis class of decision trees with M leaves. We would like to minimize a loss function
over Htrees(M ) but an exhaustive search is intractable unless M is small (e.g., M = 2 in DecisionStump): the
number of labelings is exponential in N since a decision tree is highly nonlinear (e.g., it can fit the XOR mapping
with M = 4 regions). Thus we adopt a top-down greedy heuristic. We assume a single-dimension splitting algorithm
A that maps any data-distribution pair (S 0 , D0 ) to a best dimension r ∈ {1 . . . d} and threshold τ ∈ R according to
some metric.

10
BuildTree
Input: S = {(x1 , y1 ) . . . (xN , yN )}, distribution D over {1 . . . N }, number of splits P ≤ blog N c, single-
dimension splitting algoirthm A
Output: root node νroot of a binary tree with 2P leaves
1. Initialize νroot and q ← queue([(νroot , S, D)])
2. For P times or until q is empty,
(a) (ν, S 0 , D0 ) ← q.pop(); if the labels in S 0 are pure, go to Step 2.
(b) (ν.r, ν.τ ) ← A(S 0 , D0 )
(c) Partition S 0 into S10 , S20 by thresholding dimension ν.r using ν.τ .
(d) Compute distributions D10 , D20 over S10 , S20 by renormalizing D0 .
(e) Initialize ν.left and push (ν.left, S10 , D10 ) onto q.
(f) Initialize ν.right and push (ν.right, S20 , D20 ) onto q.
3. Return νroot .

This is a heuristic with no optimality guarantee. For example, if we use DecisionStump (which does find an
optimal stump) as our choice of A, the resulting tree is generally suboptimal for P ≥ 2: that is, there may be
a different tree h ∈ Htrees(2P ) that achieves a larger edge. One popular approach is to grow a full tree (i.e.,
P = blog N c) using a “label purity” splitting metric (e.g., Gini impurity, Appendix C) and then prune the full tree
to minimize the misclassification rate of the leaf nodes. Decision trees can be used as base classifiers of AdaBoost,
regardless of the specifics of how they are learned.

E Gradient Boosting
Let F denote the set of all functions f : X → R. This is a vector space because functions are closed under (element-
wise) addition
p and scalar multiplication. We may assume an inner product h·, ·i : F × F → R, with the norm
||f || := hf, f i, for certain subspaces like square-integrable or RKHS.1 We will assume that F is an RKSH.

A functional is a mapping E : F → R. We say E is differentiable at f ∈ F if there is a function ∇E(f ) ∈ F such

that
||E(f + g) − (E(f ) + h∇E(f ), gi)||
lim =0
||g||→0 ||g||
Thus E(f ) + h∇E(f ), gi is a linear approximation of E around f . We call ∇E(f ) ∈ F the (functional) gradient
of E at f . We have the following results from functional analysis:
∇(aE1 + bE2 )(f ) = a∇E1 (f ) + b∇E2 (f ) (linearity of functional differentiation)
0
∇(g ◦ E)(f ) = g (E(f ))∇E(f ) (chain rule for any differentiable g : R → R)
Some examples:
• f (x), the evaluation functional for x ∈ X . The gradient is K(·, x) since
(f + g)(x) = f (x) + g(x) = f (x) + hK(·, x), gi
2 2 2 2
• ||f || , the squared-norm functional. The gradient is 2f since ||f + g|| = ||f || + h2f, gi + ||g|| .
b sq (f ) := 1 PN (yi − f (xi ))2 , the squared-error functional for (x1 , y1 ) . . . (xN , yN ) ∈ X × R. By the linearity
• L 2 i=1
of differentiation and the chain rule,
N N
1X X
∇L
b sq (f ) = ∇(yi − f (xi ))2 = − (yi − f (xi ))K(·, xi )
2 i=1 i=1
1
If F is a reproducing kernel Hilbert space (RKHS), it means there is some associated kernel K : X × X → R such that every f ∈ F
can be written as f = x∈Cf αfx K(·, x) where Cf ⊂ X is a finite set of “center points” and αfx is a coefficient for each center point.
P

Note that K(·, x) ∈ F for every x ∈ X . The inner product is defined as hf, gi := (αf )> K f,g αg , in particular
X f
hf, K(·, x)i = αx0 K(x0 , x) = f (x)
x0 ∈Cf

hence the name “reproducing”. By the Moore-Aronszajn theorem, any kernel induces a unique associated RKSH.

11
Functional gradient descent. Let L b : F → R be a differentiable loss over S = {(x1 , y1 ) . . . (xN , yN )} ⊂ X × R
(e.g., L
b sq ). We may do steepest descent on L b over F. This means we pick an initial f0 ∈ F and for t = 1 . . . T
find gt ∈ F with a bounded norm such that L(ft−1 + gt ) is small. Using the linear approximation L(f
b b t−1 + gt ) ≈
b t−1 ) + h∇L(f
L(f b t−1 ), gt i, the steepest descent direction is given by gt = ηt (−∇L(f
b t−1 )) where ηt ∈ R. Then we set

ft = ft−1 + gt
PT
Note that fT = f0 + t=1 gt can be viewed as an ensemble; at test time, given x ∈ X the model computes f0 (x) ∈ R
and g1 (x) . . . gT (x) ∈ R, then returns fT (x) = f0 (x) + g1 (x) + · · · + gT (x).

While conceptually nice, functional gradient descent is not implementable due to its nonparametric nature. For
instance, the descent direction −∇Lb sq (ft−1 ) = PN (yi − ft−1 (xi ))K(·, xi ) is an abstract real-valued mapping over
i=1
X . However, we can easily evaluate it on S. Define
n o
St = (xi , yi0 ) : yi0 = −∇L(f
b t−1 )(xi ), i ∈ {1 . . . N }

We can treat St as labeled data and fit some parametric model to approximate ht ≈ −∇L(f b t−1 ). We decide ηt by
some learning rate schedule or line search (i.e., minη∈R L(f
b t−1 + ηht )), and set gt = ηt ht .

E.1 Application to Regression

Let K(x, x0 ) = [[x = x0 ]] be the identity kernel and F be the associated RKHS. Given S = {(x1 , y1 ) . . . (xN , yN )} ⊂
X × R, we optimize the squared-error functional L b sq (f ) = 1 PN (yi − f (xi ))2 . For any i ∈ {1 . . . N },
2 i=1

−∇L
b sq (f )(xi ) = yi − f (xi )

namely the i-th residual of f . We will use the squared loss to fit the residuals. We assume a parametric hypothesis
class that is easy to optimize over (e.g., decision trees, linear regressors). With the constant step size ηt = 1, the
gradient boosting algorithm is
PN
1. Initialize h0 ∈ arg minh:X →R i=1 (yi − h(xi ))2 .
2. For t = 1 . . . T , find
N t−1
! !2
X X
ht ∈ arg min yi − hs (xi ) − h(xi )
h:X →R i=1 s=0

PT
3. Given x ∈ X , predict t=0 ht (x).

E.2 Application to Classification

Consider K-way classification. A classifier can be viewed as an array of K regressors f (x) = (f (1) (x) . . . f (K) (x)) ∈
RK where f (k) ∈ F calculates the k-th logit for input x ∈ X . Given labeled data S = {(x1 , y1 ) . . . (xN , yN )} ⊂
X × {1 . . . K}, the cross-entropy functional Lb CE : F K → R is defined as
N K
!
X X
(k)
L
b CE (f ) := log exp f (xi ) − f (yi ) (xi )
i=1 k=1

For k ∈ {1 . . . K}, the negative functional gradient of L

b CE (f ) with respect to fk is
N
X
−∇k L
b CE (f ) = ([[yi = k]] − pf (k|xi )) K(·, xi )
i=1
PK
where pf (y|x) := exp(f (y) (x))/ k=1 exp(f (k) (x)). Assuming the identity kernel, upon receiving xi it evaluates to

−∇k L
b CE (f )(xi ) = [[yi = k]] − pf (k|xi )

namely the i-th residual of f for the k-th class. We will again use the squared loss to fit the residuals, an easy-to-
optimize-over hypothesis class, and ηt = 1. Then the gradient boosting algorithm is

12
PN P
K
1. Initialize h0 ∈ arg minh:X →RK i=1 log k=1 exp h(k) (xi ) − h(yi ) (xi ).

2. For t = 1 . . . T , for k = 1 . . . K find

 P
t−1 (k)
 2
N exp hs (x)
(k)
X s=0
ht ∈ arg min [[yi = k]] − P P  − h(xi )
h:X →R K t−1 (k0 )
i=1 k0 =1 exp s=0 hs (x)

PT (y)
3. Given x ∈ X , predict y ∗ ∈ arg maxK
y=1 t=0 ht (x).

AdaBoost New PDF
No ratings yet
AdaBoost New PDF
45 pages
Liang Sur 20
No ratings yet
Liang Sur 20
61 pages
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
No ratings yet
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
39 pages
T R Ik-Cl Ervor Er Kis: (Example)
No ratings yet
T R Ik-Cl Ervor Er Kis: (Example)
122 pages
کتاب هفتم بارگزاری شده
No ratings yet
کتاب هفتم بارگزاری شده
57 pages
22 Boosting
No ratings yet
22 Boosting
32 pages
Lecture18 Boosting
No ratings yet
Lecture18 Boosting
21 pages
ML 14 Boosting
No ratings yet
ML 14 Boosting
57 pages
07 Boosting Notes
No ratings yet
07 Boosting Notes
10 pages
Foundations of Machine Learning: Boosting
No ratings yet
Foundations of Machine Learning: Boosting
41 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
AdaBoost Is Consistent
No ratings yet
AdaBoost Is Consistent
22 pages
09 EnsembleLearning
No ratings yet
09 EnsembleLearning
36 pages
DM - Lecture 4
No ratings yet
DM - Lecture 4
65 pages
Sol3 2020
No ratings yet
Sol3 2020
5 pages
Boosting Reduces Bias
No ratings yet
Boosting Reduces Bias
7 pages
Boosting Buehlmann
No ratings yet
Boosting Buehlmann
52 pages
Lecture Notes 7
No ratings yet
Lecture Notes 7
8 pages
Addaboost
No ratings yet
Addaboost
12 pages
Improving Classification With AdaBoost
No ratings yet
Improving Classification With AdaBoost
20 pages
Write Up
No ratings yet
Write Up
12 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
Computational Data Analysis: Machine Learning
No ratings yet
Computational Data Analysis: Machine Learning
26 pages
MLB HA 6 Answers Final
No ratings yet
MLB HA 6 Answers Final
13 pages
Sol3 2016
No ratings yet
Sol3 2016
8 pages
Adaboost Algorithm
No ratings yet
Adaboost Algorithm
17 pages
Adaboost Matas
No ratings yet
Adaboost Matas
136 pages
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
No ratings yet
Introduction To Boosting: Cynthia Rudin PACM, Princeton University
29 pages
Ensemble (v6)
No ratings yet
Ensemble (v6)
45 pages
Boosting With Structural Sparsity
No ratings yet
Boosting With Structural Sparsity
41 pages
Ada Boost
No ratings yet
Ada Boost
7 pages
Boosting
No ratings yet
Boosting
11 pages
Lec13 PDF
No ratings yet
Lec13 PDF
10 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
KND 100M
No ratings yet
KND 100M
297 pages
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
No ratings yet
A Brief Introduction To Adaboost: Hongbo Deng 6 Feb, 2007
35 pages
Boosting: I I I I
No ratings yet
Boosting: I I I I
5 pages
Foundations of Machine Learning: Courant Institute and Google Research
No ratings yet
Foundations of Machine Learning: Courant Institute and Google Research
42 pages
A Robust Real Time Face Detection
No ratings yet
A Robust Real Time Face Detection
55 pages
Boosting Algorithms: Regularization, Prediction and Model Fitting
No ratings yet
Boosting Algorithms: Regularization, Prediction and Model Fitting
29 pages
A Simple Proof of AdaBoost Algorithm
No ratings yet
A Simple Proof of AdaBoost Algorithm
4 pages
Adaboost: Derek Hoiem March 31, 2004
No ratings yet
Adaboost: Derek Hoiem March 31, 2004
46 pages
Boosting and Applications Yuan
No ratings yet
Boosting and Applications Yuan
41 pages
Ada Boost
No ratings yet
Ada Boost
25 pages
Introduction To Machine Learning - Boosting
No ratings yet
Introduction To Machine Learning - Boosting
6 pages
LECTURE+NOTES Boosting
No ratings yet
LECTURE+NOTES Boosting
8 pages
DLbook
No ratings yet
DLbook
165 pages
1 Eric Boosting304FinalRpdf
No ratings yet
1 Eric Boosting304FinalRpdf
19 pages
1 One Dimension: Gradient Descent
No ratings yet
1 One Dimension: Gradient Descent
5 pages
2018 05 HP
No ratings yet
2018 05 HP
105 pages
CS229 Supplemental Lecture Notes: 1 Boosting
No ratings yet
CS229 Supplemental Lecture Notes: 1 Boosting
11 pages
Imaging Brain Function With EEG
100% (4)
Imaging Brain Function With EEG
266 pages
Zhu - Multiclass Adaboost2009 PDF
No ratings yet
Zhu - Multiclass Adaboost2009 PDF
12 pages
ASET Abstract Reasoning Sample Test2
100% (1)
ASET Abstract Reasoning Sample Test2
15 pages
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
No ratings yet
Bagging and Boosting: 9.520 Class 10, 13 March 2006 Sasha Rakhlin
19 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
10-701 Midterm Exam, Fall 2007
No ratings yet
10-701 Midterm Exam, Fall 2007
25 pages
A Short Introduction To Boosting
No ratings yet
A Short Introduction To Boosting
14 pages
Mark Scheme (Results) January 2008: O Level Mathematics B (7361 - 01)
No ratings yet
Mark Scheme (Results) January 2008: O Level Mathematics B (7361 - 01)
6 pages
VLSI Design Circuits
No ratings yet
VLSI Design Circuits
12 pages
Chapter 13 Experimental Design and Analysis of Variance PDF
100% (1)
Chapter 13 Experimental Design and Analysis of Variance PDF
43 pages
Chapter 1
No ratings yet
Chapter 1
20 pages
R23 II Year Syllabus EEE
No ratings yet
R23 II Year Syllabus EEE
43 pages
Midterm - Revision (TA Aladin)
No ratings yet
Midterm - Revision (TA Aladin)
40 pages
Math Anxiety Questionnaire For Children
No ratings yet
Math Anxiety Questionnaire For Children
10 pages
Math 462: HW3 Solutions
No ratings yet
Math 462: HW3 Solutions
8 pages
Karnaugh Maps 1
No ratings yet
Karnaugh Maps 1
18 pages
Tutorial - 4
No ratings yet
Tutorial - 4
2 pages
T4 ENG Questions
No ratings yet
T4 ENG Questions
6 pages
A Review of Condition Monitoring and Fault Diagnosis For Diesel Engines
No ratings yet
A Review of Condition Monitoring and Fault Diagnosis For Diesel Engines
25 pages
Design of An Improved Interval Type-2 Controller Using FCM and Supervised Clustering Algorithms
No ratings yet
Design of An Improved Interval Type-2 Controller Using FCM and Supervised Clustering Algorithms
10 pages
Non Homogenous BVP Notes
No ratings yet
Non Homogenous BVP Notes
25 pages
EUROCONTROL ASTERIX CAT017 Annex A (Co-Ordinate Transformation Algorithms For The Hand-Over of Targets Between POEMS Interrogators)
No ratings yet
EUROCONTROL ASTERIX CAT017 Annex A (Co-Ordinate Transformation Algorithms For The Hand-Over of Targets Between POEMS Interrogators)
16 pages
210 Determining End Ring Resistance and Inductance of Squirrel Cage For Induction Motor With 2D and 3D Computations
No ratings yet
210 Determining End Ring Resistance and Inductance of Squirrel Cage For Induction Motor With 2D and 3D Computations
6 pages
Salahaddin University College of Science Mathematics Department Stage Two
No ratings yet
Salahaddin University College of Science Mathematics Department Stage Two
14 pages
DLL 4TH Quarter
No ratings yet
DLL 4TH Quarter
11 pages
CS 1101 Unit 4
No ratings yet
CS 1101 Unit 4
3 pages
Design of Quadrilateral Learning With RME Approach For Junior High School Students
No ratings yet
Design of Quadrilateral Learning With RME Approach For Junior High School Students
13 pages
Introduction To Trigonometry
No ratings yet
Introduction To Trigonometry
11 pages
Principal Stress and Principal Plane: NN ns1 ns2
No ratings yet
Principal Stress and Principal Plane: NN ns1 ns2
7 pages
Joint Sparse Channel Estimation and Data Detection For Underwater Acoustic Channels Using Partial Interval Demodulation
No ratings yet
Joint Sparse Channel Estimation and Data Detection For Underwater Acoustic Channels Using Partial Interval Demodulation
6 pages
Compare and Contrast
No ratings yet
Compare and Contrast
6 pages
Model Mania 2003 Phase 2
No ratings yet
Model Mania 2003 Phase 2
1 page
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Computer Solved: Nonlinear Differential Equations
From Everand
Computer Solved: Nonlinear Differential Equations
Joe J. Ettl
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Adaboost

Uploaded by

Adaboost

Uploaded by

AdaBoost

1.1 For Classifiers

min ˆlD (αh) = 2 ˆD (h)(1 − ˆD (h)) = 1 − γ̂D (h)2

where the unique minimizer is

By equating the two expressions of the normalizer, we obtain

assuming ˆDt−1 (ht ) ∈ (0, 1/2]. The greedy selection of α1 . . . αT gives

AdaBoost (Freund and Schapire, 1997)

where the first equality is from Eq. (3).

2.1 Degenerate cases

• Number of gradient steps T

kt ∈ arg min [∇ˆ

Theorem 2.3. Let H be finite and gT ← AdaBoost(S, H, T ). There exists an output

α(T ) ← CoordinateDescent(S, H, exp(−z), T )

(accounting for ties in optimization) such that gT = α(T ) , H .

Hstumps := {x 7→ b × Ind(xr > τ ) : r ∈ {1 . . . d} , b ∈ {±1} , τ ∈ R}

(r) [xi−1 ]r + [xi ]r

h(x) ← sign(γ ∗ ) × Ind(xr∗ > τ ∗ )

J (α) =  exp(α) + (1 − ) exp(−α)

The first and second derivatives are

J0 (α) =  exp(α) − (1 − ) exp(−α)

 exp(α) = (1 − ) exp(−α) ⇔ log  + α = log(1 − ) − α

This is nonnegative because (1 − )/ ≥ 1. The optimal value is

Now we view ˆlD (αh) as a function of α ∈ R, where

s t (i) := exp −yi α(t) , H(xi )

Ds 0 is uniform since α(0) = 0H , hence D

α(t−1) , H(x) + ηt hkt is the same as the step-t ensemble in AdaBoost,

B Zero-One Loss Form of AdaBoost

A functional is a mapping E : F → R. We say E is differentiable at f ∈ F if there is a function ∇E(f ) ∈ F such

E.1 Application to Regression

E.2 Application to Classification

For k ∈ {1 . . . K}, the negative functional gradient of L

2. For t = 1 . . . T , for k = 1 . . . K find

You might also like

min ˆlD (αh) = 2 ˆD (h)(1 − ˆD (h)) = 1 − γ̂D (h)2

assuming ˆDt−1 (ht ) ∈ (0, 1/2]. The greedy selection of α1 . . . αT gives

J (α) = exp(α) + (1 − ) exp(−α)

J0 (α) = exp(α) − (1 − ) exp(−α)

exp(α) = (1 − ) exp(−α) ⇔ log + α = log(1 − ) − α

This is nonnegative because (1 − )/ ≥ 1. The optimal value is