Adaboost
Adaboost
Karl Stratos
1 Empirical Loss
Let pop denote a population distribution over input-label pairs in X ×{±1}. Let (x1 , y1 ) . . . (xN , yN ) ∼ pop denote
iid samples. Given f : X → R, define
N
1 X
ˆ(f ) := [[yi f (xi ) ≤ 0]] (empirical zero-one loss)
N i=1
N
ˆl(f ) := 1
X
exp(−yi f (xi )) (empirical exponential loss)
N i=1
ˆl(f ) is a convex upper bound on ˆ(f ). More generally, given a distribution D over {1 . . . N }, define
N
X
ˆD (f ) := D(i) [[yi f (xi ) ≤ 0]] (expected empirical zero-one loss)
i=1
N
X
ˆlD (f ) := D(i) exp(−yi f (xi )) (expected empirical exponential loss)
i=1
N
X
γ̂D (f ) := D(i)yi f (xi ) (expected empirical margin, aka. edge)
i=1
In words: we can approximate ˆD (h) > 0 by tightening the upper bound ˆlD (αh) where α ≥ 0 controls the margin.
The better h is, the larger α. This is useful because ˆlD is easier to optimize than ˆD .
2 Ensemble
Let h1 . . . hT : X → {±1}. An ensemble with weights α1 . . . αT ∈ R is the function gT : X → R defined by
T
X
gT (x) := αt ht (x)
t=1
1
For t = 1 . . . T , define a distribution Dt over {1 . . . N } by
exp(−yi gt (xi )) exp(−yi gt (xi ))
Dt (i) := PN =
j=1 exp(−yj gt (xj )) N ˆl(gt )
A softmax over negative margins is simply normalized exponential losses. We can pull Pt (α1 , h1 ) . . . (αt−1 , ht−1 ) from
Dt to express it as a function of Dt−1 and (αt , ht ) thanks to the linearity of gt = s=1 αs hs . For convenience we
define g0 (x) = 0 so that D0 (i) = 1/N .
Lemma 2.1. For t ≥ 1,
Dt−1 (i) exp(−yi αt ht (xi )) Dt−1 (i) exp(−yi αt ht (xi )) exp(−yi gt (xi ))
Dt (i) = PN = = Qt
ˆ
lDt−1 (αt ht ) N s=1 ˆlDs−1 (αs hs )
j=1 Dt−1 (j) exp(−yj αt ht (xj ))
This suggests a greedy strategy: minimize ˆl(gt ) over αt ∈ R while holding α1 . . . αt−1 fixed. This is reduced to
minimizing ˆlDt−1 (αt ht ) where Dt−1 is constant. From Lemma 1.3 we know the optimal solution
s s
1 − ˆDt−1 (ht ) 1 + γ̂Dt−1 (ht )
αt = log = log ≥0 (2)
ˆDt−1 (ht ) 1 − γ̂Dt−1 (ht )
Plugging Eq. (2) in the recursive expression of Dt in Lemma 2.1, we can verify the following update:
1 1
(
Dt−1 (ht )) Dt−1 (i) = 1+γ̂Dt−1 (ht ) Dt−1 (i)
2(1−ˆ if ht (xi ) = yi
Dt (i) = 1 1
2ˆ
D (ht ) Dt−1 (i) = 1−γ̂D
t−1
(ht ) Dt−1 (i)
t−1
otherwise
The edge formulation, where γ̂Dt−1 (ht ) ∈ [0, 1), makes it clear that the probility of an example is downweighted if
correctly classified but upweighted if misclassified. This is the AdaBoost algorithm. We give it in the edge form:
the zero-one loss form is found in Appendix B.
PT
3. Return gT (x) ← t=1 αt ht (x).
2
Theorem 2.2. gT ← AdaBoost(S, H, T ) satisfies
1 2
ˆ(gT ) ≤ exp − γ̂min T
2
where γ̂min := minTt=1 γ̂Dt−1 (ht ) is the minimum edge among the T base classifiers trained in AdaBoost.
Proof.
T q T
Y Y 1 1 2
ˆ(gT ) ≤ ˆl(gT ) = 1 − γ̂Dt−1 (ht )2 ≤ exp γ̂Dt−1 (ht )2 ≤ exp γ̂min T
t=1 t=1
2 2
2.2 Interpretations
QT
We motivated AdaBoost as a greedy sequential minimization of ˆl(gT ) = t=1 ˆlDt−1 (αt ht ) (Eq. (1)), which upper
bounds ˆ(gT ). AdaBoost can also be derived as an adversarial step-wise calibration of Dt : select αt so that
N PN
X
i=1 [[ht (xi ) 6= yi ]] Dt−1 (i) exp(−yi αt ht (xi ))
ˆDt (ht ) = [[ht (xi ) 6= yi ]] Dt (i) = PN
i=1 j=1 Dt−1 (j) exp(−yj αt ht (xj ))
ˆDt−1 (ht ) exp(αt )
=
ˆDt−1 (ht ) exp(αt ) + (1 − ˆDt−1 (ht )) exp(−αt )
is equal to 1/2. Solving for αt yields the same solution as Eq. (2). More generally, AdaBoost can be seen a
coordinate descent on ˆ(g) over all ensembles g. For simplicity assume a finite hypothesis class H = {h1 . . . hH }
H
and write H(x) := (h1 (x) . . . hH (x)) ∈ {±1} . Then any ensemble can be written as
H
X
hα, H(x)i = αk hk (x)
k=1
where α ∈ RH . Thus the goal is simplified to finding α, which implicitly finds base classifiers. Let U be a convex
upper bound on the zero-one loss: U (z) ≥ [[z ≤ 0]] where z is the margin. This gives a convex loss of hα, Hi
N
ˆlU (α) := 1
X
U (yi hα, H(xi )i)
N i=1
which upper bounds ˆ (hα, Hi). By minimizing ˆlU (α) over α ∈ RH , we implicitly minimize ˆ(g) over all ensembles
g. We do coordinate descent: at each step, find the coordinate k ∈ {1 . . . H} with a largest decrease in the loss
(equivalently, steepest descent with an l1 -norm constraint) and take a step in that coordinate by α + ηek where
η ∈ R is an optimal step size.
3
AdaBoost is a special case with the exponential loss U (z) = exp(−z). The main idea is that by initializing α = 0H ,
each step corresponds to adding a single classifier in H. The coordinate and step size coincide with the classifier
and its weight selected in AdaBoost. This is due to the recursive property of the exponential derivative,
N
∂ ˆ 1 X
lU (α) = (−yi hk (xi )) exp(−yi hα, H(xi )i)
∂αk N i=1
The partial derivative “selects” hk while keeping the exponential loss, which can be normalized to give an expression
in expected zero-one loss. The partial derivative of ˆlU (α + ηek ) (which is convex in η) with respect to η is similar.
CoordinateDescent
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ X and yi ∈ {±1}
• Finite hypothesis class H = {h1 . . . hH }, where we write H(x) := (h1 (x) . . . hH (x)) ∈ {±1}H
• Convex upper bound U (z) ≥ [[z ≤ 0]], which approximates ˆ(g) over all possible ensembles g by
N
ˆ 1 X
lU (α) := U (yi hα, H(xi )i) ≥ ˆ (hα, Hi)
N i=1
and set α(t) ← α(t−1) + ηt ekt where ekt is the kt -th basis vector.
3. Return α(T ) .
This interpretation generalizes AdaBoost to other convex surrogates of the zero-one loss (e.g., hinge or logistic). It
also hints at a deeper connection between gradient descent and ensemble learning. Namely, taking gradient steps in
a function space is equivalent to taking an ensemble of functions. This motivates gradient boosting (Appendix E);
CoordinateDescent can be seen as a special case of gradient boosting.
3 Decision Stumps
A popular version of AdaBoost assumes the input space X = Rd and the hypothesis class of decision stumps:
where Ind(A) is 1 if A is true and −1 otherwise. Without loss of generality we assume that each dimension is
sorted, so that [x1 ]r ≤ · · · ≤ [xN ]r for r = 1 . . . d (we can implement this assumption by iterating through examples
in a presorted list for each r). Under this assumption, we define for each r = 1 . . . d and i = 2 . . . N ,
4
Expressiveness. A decision stump is a (restricted) linear classifier and cannot realize nonlinear labelings. This
works in our favor since we are unlikely to run into the degenerate case of perfect edge γ̂D (h) = 1. Specifically, given
N samples with 2N possible labelings, Hstumps can realize at most 2dN labelings. For instance, if d = 1 and N = 3
so that the inputs are scalars x1 ≤ x2 ≤ x3 , there exists no stump that can realize (1, −1, 1) or (−1, 1, −1), so only
6 out of 8 possible labelings are realized. Note that even fewer labelings would be realized if there are duplicate
input values (e.g., x1 = x2 ) since we cannot find a threshold. If we have another dimension in which the inputs are
ordered differently (e.g., x3 ≤ x1 ≤ x2 ), the missing labelings may be realized. But since each dimension realizes at
most 2N labelings and many labelings are duplicates across dimensions, Hstumps can realize at most 2dN labelings.
Learning. There are at most dN dimension-wise linear separations of N examples. Since Hstumps can only
induce labelings based on these separations, we can consider any stumps corresponding to these separations to do
(r)
an exhaustive search over Hstumps . In particular, we use stumps with τi as thresholds. If we calculate the edge
value for each threshold it would take O(N 2 d) time. Below we give a single-sweep approach with a linear runtime
O(N d) as described by Kégl (2009).
DecisionStump
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where xi ∈ Rd and yi ∈ {±1}; [x1 ]r ≤ · · · ≤ [xN ]r for r = 1 . . . d;
(r)
τi := ([xi−1 ]r + [xi ]r )/2
• Distribution D over {1 . . . N }
Output: h∗ ∈ arg maxh∈Hstumps γ̂D (h)
1. γ ∗ ← N
P
i=1 D(i)yi
2. For r = 1 . . . d:
(a) γ ← N
P
i=1 D(i)yi
(b) For j = 2 . . . N such that [xj−1 ]r < [xj ]r : # Can we use r for thresholding?
i. γ ← γ − 2D(j − 1)yj−1
(r)
ii. If |γ| > |γ ∗ |, set γ ∗ ← γ, r∗ ← r, and τ ∗ ← τj .
3. If γ ∗ = N ∗
P
i=1 D(i)yi , return a constant classifier x 7→ sign(γ ). Otherwise, return
∗
We
PN maintain the best edge value γ over all dN linear separations. In each dimension, we start from the edge
i=1 D(i)y of the constant classifier x 7→ 1 and subtract 2D(j − 1)yj−1 for j = 1 . . . N . Note that the j-th
PNi Pj
value is i=j+1 D(i)yi − i=1 D(i)yi (i.e., the edge of a stump that labels (−1, −1, . . . , 1, 1) with j negative ones).
Multiplying the value by its sign makes it nonnegative, implying that the sign is the parameter b of the underlying
decision stump. We examine the absolute value of the edge to pick whichever side that gives the highest value.
Decision trees. A decision stump is a special case of decision tree with 2 leaves: Hstumps = Htrees(2) (Ap-
pendix D). Unlike a decision stump, a decision tree (with more leaves) is nonlinear and can induce O(2N ) labelings
on S. However, exact learning is intractable and requires heuristics.
4 Generalization
The VC dimension of Hstumps is 2 (i.e., the maximum number of points that can be shattered by a decision stump is
PT
two). Let HstumpsT = { t=1 αt ht : αt ≥ 0, ht ∈ Hstumps }. It is intuitively clear that the VC dimension of HstumpsT
is 2T . A standard application of Hoeffding’s inequality gives the following.
Theorem 4.1. Draw S = {(x1 , y1 ) . . . (xN , yN )} ∼ popN and gT ← AdaBoost(S, Hstumps , T ). Then with high
probability,
r !
T
Pr (ygT (x) ≤ 0) ≤ ˆ(gT ) + O
(x,y)∼pop N
5
ˆ(gT ) can be further bounded using Theorem 2.2. The bound becomes looser as T increases (due to the increased
complexity of HstumpsT and the danger of overfitting). In contrast, it is observed empirically that gT generalizes
better as T goes up—even after ˆ(gT ) = 0. This motivated researchers to find a better generalization statement
based on the margin (Theorem 1, Schapire et al. (1998)).
Theorem 4.2. Draw S = {(x1 , y1 ) . . . (xN , yN )} ∼ popN and gT ← AdaBoost(S, Hstumps , T ). Then with high
probability,
N r !
1 X 1
Pr (ygT (x) ≤ 0) ≤ [[yi gT (xi ) ≤ 0.1]] + O
(x,y)∼pop N i=1 N
The first term on the RHS is the empirical probility that the ensemble has a small margin on S (0.1 arbitrarily
picked). Intuitively, this becomes smaller as T goes up because AdaBoost focuses on hard examples (≈ support
vectors) to increase the margin, even after the training error becomes zero. Since the second term is free of T , the
bound becomes tighter with more rounds of boosting.
References
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to
boosting. Journal of computer and system sciences, 55(1), 119–139.
Kégl, B. (2009). Introduction to adaboost.
Schapire, R. E., Freund, Y., Bartlett, P., Lee, W. S., et al. (1998). Boosting the margin: A new explanation for the
effectiveness of voting methods. The annals of statistics, 26(5), 1651–1686.
6
A Proofs
Proof of Lemma 1.1. We have yi h(xi ) = 1 − 2[[yi h(xi ) ≤ 0]]. Thus
N
X
γ̂D (h) = D(i)(1 − 2[[yi h(xi ) ≤ 0]]) = 1 − 2ˆ
D (h)
i=1
Proof of Lemma 1.2. If α = 0 the statement holds trivially since ˆlD (αh) = 1 ≥ ˆD (h). If α > 0,
N
X N
X N
X
ˆlD (αh) = D(i) exp(−yi αh(xi )) ≥ D(i) [[yi αh(xi ) ≤ 0]] = D(i) [[yi h(xi ) ≤ 0]] = ˆD (h)
i=1 i=1 i=1
Proof of Lemma 1.3. For any ∈ (0, 1/2], consider the objective J : R → R defined by
J00 (α) > 0 for all α ∈ R, thus it is sufficient to find a stationary point to find the unique optimal solution. Setting
J0 (α) = 0 gives
N
X
ˆlD (αh) = D(i) exp(−yi αh(xi ))
i=1
N
X N
X
= D(i) [[h(xi ) 6= yi ]] exp(α) + D(i) [[h(xi ) = yi ]] exp(−α)
i=1 i=1
= ˆD (h) exp(α) + (1 − ˆD (h)) exp(−α)
p p
Thus α = log (1 − ˆD (h))/ˆ D (h) ≥ 0 is the unique minimizer and 2 ˆD (h)(1
p − ˆD (h)) is the minimum. Plugging
ˆD (h) = 1/2(1 − γ̂D (h)) (Lemma 1.1) in the minimizer we also have α = log (1 + γ̂D (h))/(1 − γ̂D (h)). To get the
expression of the minimum in the edge, we use the algebraic fact 4z(1 − z) = 1 − (1 − 4z + 4z 2 ) = 1 − (1 − 2z)2 for
any z ∈ R. Then
p p p p
2 ˆD (h)(1 − ˆD (h)) = 4ˆ D (h))2 = 1 − γ̂D (h)2
D (h)(1 − ˆD (h)) = 1 − (1 − 2ˆ
7
Proof of Lemma 2.1.
exp(−yi gt (xi ))
Dt (i) = PN
j=1 exp(−yj gt (xj ))
exp(−yi gt−1 (xi )) exp(−yi αt ht (xi ))
= PN
j=1 exp(−yj gt−1 (xj )) exp(−yj αt ht (xj ))
(exp(−yi gt−1 (xi ))/C) exp(−yi αt ht (xi ))
= PN
j=1 (exp(−yj gt−1 (xj ))/C) exp(−yj αt ht (xj ))
Dt−1 (i) exp(−yi αt ht (xi ))
= PN
j=1 Dt−1 (j) exp(−yj αt ht (xj ))
PN
where we define the constant C = k=1 exp(−yk gt−1 (xk )). This proves the first equality. The second equality
holds by the definition of expected weighted exponential loss. The third equality then holds inductively.
η1 hk1 is the same as the step-1 ensemble in AdaBoost. At step t > 1, assume α(t−1) , H(x) is the same as
the step-(t − 1) ensemble in AdaBoost. Then D s t−1 = Dt−1 , so hk ∈ arg minh∈H ˆD (h) and α(t) , H(x) =
t t−1
Lemma A.1.
Mt
[∇ˆlU (α(t−1) )]k = Ds t−1 (hk ) − 1
2ˆ
N
Proof.
N
∂ 1 X D E
[∇ˆlU (α(t−1) )]k = exp −yi α(t−1) , H(xi )
∂αk N i=1
N
1 X D E
=− yi hk (xi ) exp −yi α(t−1) , H(xi )
N i=1
N
Mt−1 X
=− yi hk (xi )D
s t−1 (i)
N i=1
Mt−1
= ˆDs t−1 (hk ) − (1 − ˆDs t−1 (hk ))
N
Mt−1
= 2ˆ Ds t−1 (hk ) − 1
N
Lemma A.2.
s
1 − ˆDs t−1 (hk )
log ∈ arg min ˆlU (α(t−1) + ηek )
ˆDs t−1 (hk ) η∈R
8
Proof.
N
∂ ˆlU (α(t−1) + ηek ) ∂ 1 X D E
= exp −yi α(t−1) , H(xi ) − ηyi hk (xi )
∂η ∂η N i=1
N
1 X D E
=− yi hk (xi ) exp −yi α(t−1) , H(xi ) exp (−ηyi hk (xi ))
N i=1
N
Mt−1 X
=− yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi ))
N i=1
ˆlU (α(t−1) + ηek ) is a composition of a convex (by premise) and a linear function in η, thus convex. We set the
derivative to zero to find a minimizer:
N N
Mt−1 X X
− yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi )) = 0 ⇒ yi hk (xi )D
s t−1 (i) exp (−ηyi hk (xi )) = 0
N i=1 i=1
⇔ −ˆ
Ds t−1 (hk ) exp(η) + (1 − ˆDs t−1 (hk )) exp(η) = 0
q
Solving for η yields η = log (1 − ˆDs t−1 (hk ))/ˆ
Ds t−1 (hk ) assuming ˆDs t−1 (hk ) 6= 0.
PT
3. Return gT (x) ← t=1 αt ht (x).
C Gini Impurity
Given any set of N labeled examples
PN S and a full-support distribution D over their indices, the Gini impurity is
defined as 2p(1 − p) where p = i:yi =1 D(i) is the probability of drawing label 1 from S under D. This is minimized
at 0 if all examples are labeled as either 1 or −1, and maximized at 1/2 if p = 1/2. A split is optimal if the expected
Giny impurity of the resulting partition is the smallest. We can derive a O(N d)-time algorithm to find an optimal
split, again assuming that each dimension is sorted:
9
MinimizeGiniImpurity
Input:
• S = {(x1 , y1 ) . . . (xN , yN )} where N ≥ 2, xi ∈ Rd , and yi ∈ {±1}; [x1 ]r ≤ · · · ≤ [xN ]r for
(r)
r = 1 . . . d; τi := ([xi−1 ]r + [xi ]r )/2
• Distribution D over {1 . . . N }
Output: split (r, τ ) with minimum Gini impurity, or fail if there is no split
1. γ ∗ ← ∞
2. D+1 ← N
P
j=1 [[yj = 1]] D(j)
3. For r = 1 . . . d:
(a) p ← 0 # Left label-1 probability
(b) p0 ← D+1 # Right label-1 probability
(c) β ← 0 # Left probability
(d) For j = 2 . . . N such that [xj−1 ]r < [xj ]r : # Can we use r for thresholding?
i. p ← p + [[yj−1 = 1]] D(j − 1)
ii. p0 ← p0 − [[yj−1 = 1]] D(j − 1)
iii. β ← β + D(j − 1)
iv. γ ← β2p(1 − p) + (1 − β)2p0 (1 − p0 )
(r)
v. If |γ| < |γ ∗ |, set γ ∗ ← γ, r∗ ← r, and τ ∗ ← τj .
4. If γ ∗ = ∞, fail. Otherwise, return (r∗ , τ ∗ ).
D Decision Trees
d d d
X d= R . We say R ⊆ R is a hyperrectangle if there exist aR , bR ∈ (R ∪ {±∞}) such that R =
Let
x ∈ R : aR < x ≤ bR where the inequalities are element-wise. We say R = {R1 . . . RM } is a hyperrectan-
gle partition if R1 . . . RM are hyperrectangles such that ∪R∈R R = Rd and Ri ∩ Rj = ∅ for all i 6= j. For x ∈ Rd ,
we write R(x) ∈ {1 . . . M } to denote the unique hyperrectangle in R that x belongs to. A decision tree is a
mapping x 7→ π(R(x)) where R is a hyperrectangle partition and π : {1 . . . M } → {±1} is a region-labeling. It is
called a decision tree because it can be expressed as a binary tree with M leaves. Let ν denote a node object. If
internal, it is equipped with a dimension r ∈ {1 . . . d} and a threshold τ ∈ R as well as left and right child nodes; if
leaf, it is equipped with y ∈ {±}. Given any (R, π), by definition there is binary tree with a root node νroot such
that π(R(x)) = Traverse(x, νroot ) where
Traverse(x, ν)
1. If ν is a leaf node, return ν.y.
2. Otherwise,
(a) If [x]ν.r > ν.τ , return Traverse(x, ν.left)
(b) If [x]ν.r ≤ ν.τ , return Traverse(x, ν.right)
Let Htrees(M ) denote the hypothesis class of decision trees with M leaves. We would like to minimize a loss function
over Htrees(M ) but an exhaustive search is intractable unless M is small (e.g., M = 2 in DecisionStump): the
number of labelings is exponential in N since a decision tree is highly nonlinear (e.g., it can fit the XOR mapping
with M = 4 regions). Thus we adopt a top-down greedy heuristic. We assume a single-dimension splitting algorithm
A that maps any data-distribution pair (S 0 , D0 ) to a best dimension r ∈ {1 . . . d} and threshold τ ∈ R according to
some metric.
10
BuildTree
Input: S = {(x1 , y1 ) . . . (xN , yN )}, distribution D over {1 . . . N }, number of splits P ≤ blog N c, single-
dimension splitting algoirthm A
Output: root node νroot of a binary tree with 2P leaves
1. Initialize νroot and q ← queue([(νroot , S, D)])
2. For P times or until q is empty,
(a) (ν, S 0 , D0 ) ← q.pop(); if the labels in S 0 are pure, go to Step 2.
(b) (ν.r, ν.τ ) ← A(S 0 , D0 )
(c) Partition S 0 into S10 , S20 by thresholding dimension ν.r using ν.τ .
(d) Compute distributions D10 , D20 over S10 , S20 by renormalizing D0 .
(e) Initialize ν.left and push (ν.left, S10 , D10 ) onto q.
(f) Initialize ν.right and push (ν.right, S20 , D20 ) onto q.
3. Return νroot .
This is a heuristic with no optimality guarantee. For example, if we use DecisionStump (which does find an
optimal stump) as our choice of A, the resulting tree is generally suboptimal for P ≥ 2: that is, there may be
a different tree h ∈ Htrees(2P ) that achieves a larger edge. One popular approach is to grow a full tree (i.e.,
P = blog N c) using a “label purity” splitting metric (e.g., Gini impurity, Appendix C) and then prune the full tree
to minimize the misclassification rate of the leaf nodes. Decision trees can be used as base classifiers of AdaBoost,
regardless of the specifics of how they are learned.
E Gradient Boosting
Let F denote the set of all functions f : X → R. This is a vector space because functions are closed under (element-
wise) addition
p and scalar multiplication. We may assume an inner product h·, ·i : F × F → R, with the norm
||f || := hf, f i, for certain subspaces like square-integrable or RKHS.1 We will assume that F is an RKSH.
Note that K(·, x) ∈ F for every x ∈ X . The inner product is defined as hf, gi := (αf )> K f,g αg , in particular
X f
hf, K(·, x)i = αx0 K(x0 , x) = f (x)
x0 ∈Cf
hence the name “reproducing”. By the Moore-Aronszajn theorem, any kernel induces a unique associated RKSH.
11
Functional gradient descent. Let L b : F → R be a differentiable loss over S = {(x1 , y1 ) . . . (xN , yN )} ⊂ X × R
(e.g., L
b sq ). We may do steepest descent on L b over F. This means we pick an initial f0 ∈ F and for t = 1 . . . T
find gt ∈ F with a bounded norm such that L(ft−1 + gt ) is small. Using the linear approximation L(f
b b t−1 + gt ) ≈
b t−1 ) + h∇L(f
L(f b t−1 ), gt i, the steepest descent direction is given by gt = ηt (−∇L(f
b t−1 )) where ηt ∈ R. Then we set
ft = ft−1 + gt
PT
Note that fT = f0 + t=1 gt can be viewed as an ensemble; at test time, given x ∈ X the model computes f0 (x) ∈ R
and g1 (x) . . . gT (x) ∈ R, then returns fT (x) = f0 (x) + g1 (x) + · · · + gT (x).
While conceptually nice, functional gradient descent is not implementable due to its nonparametric nature. For
instance, the descent direction −∇Lb sq (ft−1 ) = PN (yi − ft−1 (xi ))K(·, xi ) is an abstract real-valued mapping over
i=1
X . However, we can easily evaluate it on S. Define
n o
St = (xi , yi0 ) : yi0 = −∇L(f
b t−1 )(xi ), i ∈ {1 . . . N }
We can treat St as labeled data and fit some parametric model to approximate ht ≈ −∇L(f b t−1 ). We decide ηt by
some learning rate schedule or line search (i.e., minη∈R L(f
b t−1 + ηht )), and set gt = ηt ht .
−∇L
b sq (f )(xi ) = yi − f (xi )
namely the i-th residual of f . We will use the squared loss to fit the residuals. We assume a parametric hypothesis
class that is easy to optimize over (e.g., decision trees, linear regressors). With the constant step size ηt = 1, the
gradient boosting algorithm is
PN
1. Initialize h0 ∈ arg minh:X →R i=1 (yi − h(xi ))2 .
2. For t = 1 . . . T , find
N t−1
! !2
X X
ht ∈ arg min yi − hs (xi ) − h(xi )
h:X →R i=1 s=0
PT
3. Given x ∈ X , predict t=0 ht (x).
−∇k L
b CE (f )(xi ) = [[yi = k]] − pf (k|xi )
namely the i-th residual of f for the k-th class. We will again use the squared loss to fit the residuals, an easy-to-
optimize-over hypothesis class, and ηt = 1. Then the gradient boosting algorithm is
12
PN P
K
1. Initialize h0 ∈ arg minh:X →RK i=1 log k=1 exp h(k) (xi ) − h(yi ) (xi ).
PT (y)
3. Given x ∈ X , predict y ∗ ∈ arg maxK
y=1 t=0 ht (x).
13