Boosting Algorithms as Gradient Descent
Boosting Algorithms as Gradient Descent
Abstract
Much recent attention, both experimental and theoretical, has been focussed on classi-
cation algorithms which produce voted combinations of classi ers. Recent theoretical
work has shown that the impressive generalization performance of algorithms like Ad-
aBoost can be attributed to the classi er having large margins on the training data.
We present an abstract algorithm for nding linear combinations of functions that min-
imize arbitrary cost functionals (i.e functionals that do not necessarily depend on the
margin). Many existing voting methods can be shown to be special cases of this abstract
algorithm. Then, following previous theoretical results bounding the generalization per-
formance of convex combinations of classi ers in terms of general cost functions of the
margin, we present a new algorithm (DOOM II) for performing a gradient descent opti-
mization of such cost functions.
Experiments on several data sets from the UC Irvine repository demonstrate that DOOM
II generally outperforms AdaBoost, especially in high noise situations. Margin distri-
bution plots verify that DOOM II is willing to `give up' on examples that are too hard
in order to avoid over tting. We also show that the over tting behavior exhibited by
AdaBoost can be quanti ed in terms of our proposed cost function.
1 Introduction
There has been considerable interest recently in voting methods for pattern classi cation, which predict
the label of a particular example using a weighted vote over a set of base classi ers. For example, Freund
and Schapire's AdaBoost algorithm [10] and Breiman's Bagging algorithm [2] have been found to give
signi cant performance improvements over algorithms for the corresponding base classi ers [6, 9, 16,
5], and have led to the study of many related algorithms [3, 19, 12, 17, 7, 11, 8]. Recent theoretical
results suggest that the e ectiveness of these algorithms is due to their tendency to produce large margin
classi ers . The margin of an example is de ned as the di erence between the total weight assigned to
the correct label and the largest weight assigned to an incorrect label. We can interpret the value of the
margin as an indication of the con dence of correct classi cation: an example is classi ed correctly if and
only if it has a positive margin, and a larger margin can be viewed as a con dent correct classi cation.
Results in [1] and [18] show that, loosely speaking, if a combination of classi ers correctly classi es most
of the training data with a large margin, then its error probability is small.
In [14], Mason, Bartlett and Baxter have presented improved upper bounds on the misclassi cation
probability of a combined classi er in terms of the average over the training data of a certain cost
function of the margins. That paper also describes experiments with an algorithm that directly minimizes
this cost function through the choice of weights associated with each base classi er. This algorithm
exhibits performance improvements over AdaBoost, which suggests that these margin cost functions are
appropriate quantities to optimize.
In this paper, we present a general class of algorithms (called AnyBoost) which are gradient descent
algorithms for choosing linear combinations of elements of an inner product space so as to minimize
some cost functional. Each component of the linear combination is chosen to maximize a certain inner
product. (In the speci c case of choosing a combination of classi ers to optimize the sample average of
a cost function of the margin, the choice of the base classi er corresponds to a minimization problem
involving weighted classi cation error. That is, for a certain weighting of the training data, the base
classi er learning algorithm attempts to return a classi er that minimizes the weight of misclassi ed
training examples.) In Section 4, we give convergence results for this class of algorithms.
In Section 3, we show that this general class of algorithms includes as special cases a number of popular
and successful voting methods, including Freund and Schapire's AdaBoost [10], Schapire and Singer's
extension of AdaBoost to combinations of real-valued functions [19], and Friedman, Hastie and Tibshi-
rani's LogitBoost [12]. That is, all of these algorithms implicitly minimize some margin cost function by
gradient descent.
In Section 5, we present experimental results for a particular implementation of the AnyBoost algorithm
using cost functions of the margin that are motivated by the theoretical results presented in [14]. The cost
functions suggested by these results are signi cantly di erent from the cost functions that are implicitly
minimized by the methods described in Section 3. The experiments show that the new algorithm typically
outperforms AdaBoost, and that this is especially true with label noise. In addition, the theoretically-
motivated cost functions provide good estimates of the error of AdaBoost, in the sense that they can be
used to predict its over tting behaviour.
m i=1 i i (2)
for all F; G 2 lin (F ). However, the AnyBoost algorithm de ned in this section and its convergence
properties studied in Section 4 are valid for any cost function and inner product.
Now suppose we have a function F 2 lin (F ) and we wish to nd a new f 2 F to add to F so that the
cost C (F + f ) decreases, for some small value of . Viewed in function space terms, we are asking for the
\direction" f such that C (F + f ) most rapidly decreases. The desired direction is simply the negative
of the functional derivative of C at F , ;rC (F )(x), where
rC (F )(x) := @C (F + 1x ) ;
@ =0
where 1x is the indicator function of x. Since we are restricted to choosing our new function f from F , in
general it will not be possible to choose f = ;rC (F ), so instead we search for an f with greatest inner
product with ;rC (F ). That is, we should choose f to maximize ; hrC (F ); f i. This can be motivated
by observing that, to rst order in , C (F + f ) = C (F ) + hrC (F ); f i and hence the greatest reduction
in cost will occur for the f maximizing ; hrC (F ); f i.
The preceding discussion motivates Algorithm 1, an iterative algorithm for nding linear combinations
F of base hypotheses in F that minimize the cost C (F ). Note that we have allowed the base hypotheses
to take values in an arbitrary set Y , we have not restricted the form of the cost or the inner product, and
we have not speci ed what the step-sizes should be. Appropriate choices for these things will be made
when we apply the algorithm to more concrete situations. Note also that the algorithm terminates when
; hrC (Ft ); ft+1 i 0, i.e when the weak learner L returns a base hypothesis ft+1 which no longer points
in the downhill direction of the cost function C (F ). Thus, the algorithm terminates when, to rst order,
a step in function space in the direction of the base hypothesis returned by L would increase the cost.
Algorithm 1 : AnyBoost
Require :
An inner product space (X ; h; i) containing functions mapping from X to some set Y .
A class of base classi ers F X .
A di erentiable cost functional C : lin (F ) ! R.
A weak learner L(F ) that accepts F 2 lin (F ) and returns f 2 F with a large value of
; hrC (F ); f i.
Let F (x) := 0.
0
for t := 0 to T do
Let ft := L(Ft ).
+1
if ; hrC (Ft ); ft i 0 then
+1
return Ft .
end if
Choose wt+1 .
Let Ft+1 := Ft + wt+1 ft+1
end for
return FT +1 .
m2 i=1 i i i i
i : f (x )6=y
i i
i=1 i i
Many of the most successful voting methods are, for the appropriate choice of cost function and step-size,
speci c cases of the AnyBoost algorithm. Table 3 summarizes the AnyBoost cost function and step-size
settings needed to obtain the AdaBoost [10], con dence-rated AdaBoost [19], ARC-X4 [3] and LogitBoost
[12] algorithms. A more detailed analysis of these algorithms as speci c cases of AnyBoost can be found
in the full version of this paper [15].
4 Convergence of AnyBoost
In this section we provide convergence results for the abstract AnyBoost algorithm, under quite weak
conditions on the cost functional C . The prescriptions given for the step-sizes wt in these results are for
Table 1: Existing voting methods viewed as gradient descent optimizers of margin cost functions.
Algorithm Cost function Step size
AdaBoost [9] e;yF (x)Line search
ARC-X4 [2] (1 ; yF (x))5 1=t
Con denceBoost [19] e ; yF (x)
Line search
LogitBoost [12] ln(1 + e;yF (x)) Newton-Raphson
convergence guarantees only: in practice they will almost always be smaller than necessary, hence xed
small steps or some form of line search should be used.
The following theorem (proof omitted, see [15]) supplies a speci c step-size for AnyBoost and characterizes
the limiting behaviour with this step-size.
Theorem 1. Let C : lin (F ) ! R be any lower bounded, Lipschitz di erentiable cost functional (that is,
there exists L > 0 such that krC (F ) ; rC (F 0 )k LkF ; F 0 k for all F; F 0 2 lin (F )). Let F0 ; F1 ; : : : be
the sequence of combined hypotheses generated by the AnyBoost algorithm, using step-sizes
w := ; hrC (Ft ); ft+1 i :
t+1 Lkft+1k2 (3)
Then AnyBoost either halts on round T with ; hrC (FT ); fT +1 i 0, or C (Ft ) converges to some nite
value C , in which case limt!1 hrC (Ft ); ft+1 i = 0:
The next theorem (proof omitted, see [15]) shows that if the weak learner can always nd the best
weak hypothesis ft 2 F on each round of AnyBoost, and if the cost functional C is convex, then any
accumulation point F of the sequence (Ft ) generated by AnyBoost with the step sizes (3) is a global
minimum of the cost. For ease of exposition, we have assumed that rather than terminating when
; hrC (FT ); fT +1 i 0, AnyBoost simply continues to return FT for all subsequent time steps t.
Theorem 2. Let C : lin (F ) ! R be a convex cost functional with the properties in Theorem 1, and let
(Ft ) be the sequence of combined hypotheses generated by the AnyBoost algorithm with step sizes given
by (3). Assume that the weak hypothesis class F is negation closed (f 2 F =) ;f 2 F ) and that
on each round the AnyBoost algorithm nds a function ft+1 maximizing ; hrC (Ft ); ft+1 i. Then any
accumulation point F of the sequence (Ft ) satis es
sup ; hrC (F ); f i = 0; and C (F ) = inf C (G):
f 2F G2lin (F )
5 Experiments
AdaBoost had been perceived to be resistant to over tting despite the fact that it can produce combina-
tions involving very large numbers of classi ers. However, recent studies have shown that this is not the
case, even for base classi ers as simple as decision stumps. Grove and Schuurmans [13] demonstrated that
running AdaBoost for hundreds of thousands of rounds can lead to signi cant over tting, while a number
of authors (e.g., [5, 17]) showed that, by adding label noise, over tting can be induced in AdaBoost even
with relatively few classi ers in the combination.
The main theoretical result from [14] provides bounds on the generalization performance of a convex
combination of classi ers in terms of training sample averages of certain, sigmoid-like, cost functions of
the margin. Given this theoretical motivation we propose a new algorithm (DOOM II) which is a speci c
case of AnyBoost using the cost functional
1 Xm
C (F ) = m i=1 (1 ; tanh(y F (x ))) ;
i i (4)
where F is restricted to be a convex combination of classi ers from some base class F and is an
adjustable parameter of the cost function. Henceforth we will refer to (4) as the normalized sigmoid cost
function (normalized because the weights are normalized so F is a convex combination). This family of
cost functions (parameterized by ) is qualitatively similar to the theoretically motivated family of cost
functions used in [14]. Using the family from [14] in practice may cause diculties for a gradient descent
procedure because the functions are very at for negative margins and for margins close to 1. Using the
normalized sigmoid cost function alleviates this problem.
Following the theoretical analysis in [14], can be viewed as a data dependent complexity parameter
which measures the resolution at which we examine the margins. A large value of corresponds to a
high resolution and hence high e ective complexity of the convex combination. Thus, choosing a large
value of amounts to a belief that a high complexity classi er can be used without over tting.
In our implementation of DOOM II we use a xed small step-size (for all of the experiments = 0:05).
In practice the use of a xed could be replaced by a line search for the optimal step-size at each round.
For full details of the algorithm the reader is referred to the full version of this paper [15].
Given that the normalized sigmoid cost function is non-convex the DOOM II algorithm will su er from
problems with local minima. In fact, the following result shows that for cost functions satisfying C (; ) =
1 ; C ( ), the algorithm will strike a local minimum at the rst step.
Lemma 3. Let C : R ! R be any cost function satisfying C (; ) = 1 ; C ( ). If DOOM II can nd the
optimal weak hypothesis f1 at the rst time step, it will terminate at the next time step, returning f1 .
One way of avoiding this local minimum is to remove f1 from F after the rst round and then continue
the algorithm returning f1 to F only when the cost goes below that of the rst round. Since f1 is a local
minimum the cost is guaranteed to increase after the rst round. However, if we continue to step in the
best available direction (the attest uphill direction) we should eventually `crest the hill' de ned by the
basin of attraction of the rst classi er and then start to decrease the cost. Once the cost decreases below
that of the rst classi er we can safely return the rst classi er to the class of available base classi ers.
Of course, we have no guarantee that the cost will decrease below that of the rst classi er at any round
after the rst. Practically however, this does not seem to be a problem except for very small values of
where the cost function is almost linear over [;1; 1] (in which case the rst classi er corresponds to a
global minimum anyway).
In order to compare the performance of DOOM II and AdaBoost a series of experiments were carried
out on a selection of data sets taken from the UCI machine learning repository [4]. To simplify matters,
only binary classi cation problems were considered. All of the experiments were repeated 100 times
with 80%, 10% and 10% of the examples randomly selected for training, validation and test purposes
respectively. The results were then averaged over the 100 repeats. For all of the experiments axis
orthogonal hyperplanes (also known as decision stumps) were used as the base classi ers. This xed
the complexity of the weak learner and thus avoided any problems with the complexity of the combined
classi er being dependent on the actual classi ers produced by the weak learner.
For AdaBoost, the validation set was used to perform early stopping. AdaBoost was run for 2000 rounds
and then the combined classi er from the round corresponding to minimum error on the validation set
was chosen. For DOOM II, the validation set was used to set the data dependent complexity parameter
. DOOM II was run for 2000 rounds with = 2; 4; 6; 10; 15 and 20 and the optimal was chosen to
correspond to minimum error on the validation set after 2000 rounds.
AdaBoost and DOOM II were run on nine data sets to which varying levels of label noise had been
applied. A summary of the experimental results is shown in Figure 1. The improvement in test error
exhibited by DOOM II over AdaBoost (with standard error bars) is shown for each data set and noise
level. These results show that DOOM II generally outperforms AdaBoost and that the improvement is
more pronounced in the presence of label noise.
3.5
3
2.5
2
Error advantage (%)
1.5
1
0.5
0
-0.5
-1 0% noise
5% noise
-1.5
15% noise
-2 sonar cleve ionosphere vote1 credit breast-cancer pima-indians hypo1 splice
Data set
Figure 1: Summary of test error advantage (with standard error bars) of DOOM II over AdaBoost with
varying levels of noise on nine UCI data sets.
The e ect of using the normalized sigmoid cost function rather than the exponential cost function is best
illustrated by comparing the cumulative margin distributions generated by AdaBoost and DOOM II.
Figure 2 shows comparisons for two data sets with 0% and 15% label noise applied. For a given margin,
the value on the curve corresponds to the proportion of training examples with margin less than or equal
to this value. These curves show that in trying to increase the margins of negative examples AdaBoost
is willing to sacri ce the margin of positive examples signi cantly. In contrast, DOOM II `gives up' on
examples with large negative margin in order to reduce the value of the cost function.
breast-cancer-wisconsin splice
1 1
0% noise - AdaBoost 0% noise - AdaBoost
0% noise - DOOM II 0% noise - DOOM II
0.8 15% noise - AdaBoost 0.8 15% noise - AdaBoost
15% noise - DOOM II 15% noise - DOOM II
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Margin Margin
Figure 2: Margin distributions for AdaBoost and DOOM II with 0% and 15% label noise for the
breast-cancer and splice data sets.
Given that AdaBoost su ers from over tting and minimizes an exponential cost function of the margins,
this cost function certainly does not relate to test error. How does the value of our proposed cost function
correlate against AdaBoost's test error? The theoretical bound suggests that for the `right' value of the
data dependent complexity parameter our cost function and the test error should be closely correlated.
Figure 3 shows the variation in the normalized sigmoid cost function, the exponential cost function and
the test error for AdaBoost for two UCI data sets over 10000 rounds. As before, the values of these curves
were averaged over 100 random train/validation/test splits. The value of used in each case was chosen
by running DOOM II for various values of and choosing the corresponding to minimum error on
the validation set. These curves show that there is a strong correlation between the normalized sigmoid
cost (for the right value of ) and AdaBoost's test error. In both data sets the minimum of AdaBoost's
test error and the minimum of the normalized sigmoid cost very nearly coincide. In the labor data set
AdaBoost's test error converges and over tting does not occur. For this data set both the normalized
sigmoid cost and the exponential cost converge. In the vote1 data set AdaBoost initially decreases the
test error and then increases the test error (as over tting set in). For this data set the normalized sigmoid
cost mirrors this behaviour, while the exponential cost converges to 0.
labor vote1
30 8
AdaBoost test error AdaBoost test error
Exponential cost 7 Exponential cost
25 Normalized sigmoid cost Normalized sigmoid cost
6
20
5
15 4
3
10
2
5
1
0 0
1 10 100 1000 10000 1 10 100 1000 10000
Rounds Rounds
Figure 3: AdaBoost test error, exponential cost and normalized sigmoid cost over 10000 rounds of
AdaBoost for the labor and vote1 data sets. Both costs have been scaled in each case for easier
comparison with test error.
6 Conclusions
We have shown that many existing \boosting-type" algorithms for combining classi ers can be viewed
as gradient descent on an appropriate cost functional in a suitable inner product space. We presented
\AnyBoost", an abstract algorithm of this type for generating linear combinations of functions from some
base hypothesis class. A prescription for the step-sizes in this algorithm guaranteeing convergence to the
optimal cost were given.
Motivated by the main theoretical result from [14], we derived DOOM II|a specialization of AnyBoost|
that used 1 ; tanh(z ) as its cost function of the margin z . Experimental results on the UCI datasets
veri ed that DOOM II generally outperformed AdaBoost when boosting decision stumps, particularly in
the presence of label noise. We also found that DOOM II's cost on the training data was a very reliable
predictor of test error, while AdaBoost's exponential cost was not.
Acknowledgments
This research was supported by the Australian Research Council. Llew Mason was supported by an
Australian Postgraduate Research Award. Jonathan Baxter was supported by an Australian Postdoc-
toral Fellowship. Peter Bartlett and Marcus Frean were supported by an Institute of Advanced Stud-
ies/Australian Universities Collaborative grant. Thanks to Shai Ben-David for a stimulating discussion.
References
[1] P. L. Bartlett. The sample complexity of pattern classi cation with neural networks: the size of the weights
is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525{536,
March 1998.
[2] L. Breiman. Bagging predictors. Machine Learning, 24(2):123{140, 1996.
[3] L. Breiman. Prediction games and arcing algorithms. Technical Report 504, Department of Statistics,
University of California, Berkeley, 1998.
[4] E. Keogh C. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.
http://www.ics.uci.edu/mlearn/MLRepository.html.
[5] T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees:
Bagging, boosting, and randomization. Technical report, Computer Science Department, Oregon State
University, 1998.
[6] H. Drucker and C. Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems
8, pages 479{485, 1996.
[7] N. Du y and D. Helmbold. A geometric approach to leveraging weak learners. In Computational Learning
Theory: 4th European Conference, 1999. (to appear).
[8] Y. Freund. An adaptive version of the boost by majority algorithm. In Proceedings of the Twelfth Annual
Conference on Computational Learning Theory, 1999. (to appear).
[9] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proceedings
of the Thirteenth International Conference, pages 148{156, 1996.
[10] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to
boosting. Journal of Computer and System Sciences, 55(1):119{139, August 1997.
[11] J. Friedman. Greedy function approximation : A gradient boosting machine. Technical report, Stanford
University, 1999.
[12] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression : A statistical view of boosting.
Technical report, Stanford University, 1998.
[13] A. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In
Proceedings of the Fifteenth National Conference on Arti cial Intelligence, pages 692{699, 1998.
[14] L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins.
Machine Learning, 1999. (to appear { extended abstract in NIPS 98).
[15] L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Boosting algorithms as gradient de-
scent in function space. Technical report, RSISE, Australian National University, 1999.
(http://wwwsyseng.anu.edu.au/jon/papers/doom2.ps.gz).
[16] J. R. Quinlan. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Conference on
Arti cial Intelligence, pages 725{730, 1996.
[17] G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Technical Report NC-TR-1998-021,
Department of Computer Science, Royal Holloway, University of London, Egham, UK, 1998.
[18] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin : A new explanation for the
e ectiveness of voting methods. Annals of Statistics, 26(5):1651{1686, October 1998.
[19] R. E. Schapire and Y. Singer. Improved boosting algorithms using con dence-rated predictions. In Proceedings
of the Eleventh Annual Conference on Computational Learning Theory, pages 80{91, 1998.