0% found this document useful (0 votes)

27 views7 pages

Boosting Algorithms As Gradient Descent

The document discusses the development of a new algorithm, DOOM II, for optimizing cost functions related to margin in classification algorithms, particularly in the context of boosting methods like AdaBoost. It presents a general class of algorithms called AnyBoost, which utilizes gradient descent to minimize cost functionals, demonstrating that many existing voting methods are special cases of this framework. Experimental results indicate that DOOM II outperforms AdaBoost, especially in noisy conditions, and provides insights into overfitting behavior in boosting algorithms.

Uploaded by

Lê Thị Hồng Hà

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views7 pages

Boosting Algorithms As Gradient Descent

Uploaded by

Lê Thị Hồng Hà

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Boosting Algorithms as Gradient Descent

Llew Mason Jonathan Baxter

Research School of Information Research School of Information
Sciences and Engineering Sciences and Engineering
Australian National University Australian National University
Canberra, ACT, 0200, Australia Canberra, ACT, 0200, Australia
[email protected] [email protected]

Peter Bartlett Marcus Frean

Research School of Information Department of Computer Science
Sciences and Engineering and Electrical Engineering
Australian National University The University of Queensland
Canberra, ACT, 0200, Australia Brisbane, QLD, 4072, Australia
[email protected] [email protected]

Abstract
Much recent attention, both experimental and theoretical, has been focussed on classi-
cation algorithms which produce voted combinations of classi ers. Recent theoretical
work has shown that the impressive generalization performance of algorithms like Ad-
aBoost can be attributed to the classi er having large margins on the training data.
We present an abstract algorithm for nding linear combinations of functions that min-
imize arbitrary cost functionals (i.e functionals that do not necessarily depend on the
margin). Many existing voting methods can be shown to be special cases of this abstract
algorithm. Then, following previous theoretical results bounding the generalization per-
formance of convex combinations of classi ers in terms of general cost functions of the
margin, we present a new algorithm (DOOM II) for performing a gradient descent opti-
mization of such cost functions.
Experiments on several data sets from the UC Irvine repository demonstrate that DOOM
II generally outperforms AdaBoost, especially in high noise situations. Margin distri-
bution plots verify that DOOM II is willing to `give up' on examples that are too hard
in order to avoid over tting. We also show that the over tting behavior exhibited by
AdaBoost can be quanti ed in terms of our proposed cost function.

1 Introduction
There has been considerable interest recently in voting methods for pattern classi cation, which predict
the label of a particular example using a weighted vote over a set of base classi ers. For example, Freund
and Schapire's AdaBoost algorithm [10] and Breiman's Bagging algorithm [2] have been found to give
signi cant performance improvements over algorithms for the corresponding base classi ers [6, 9, 16,
5], and have led to the study of many related algorithms [3, 19, 12, 17, 7, 11, 8]. Recent theoretical
results suggest that the e ectiveness of these algorithms is due to their tendency to produce large margin
classi ers . The margin of an example is de ned as the di erence between the total weight assigned to
the correct label and the largest weight assigned to an incorrect label. We can interpret the value of the
margin as an indication of the con dence of correct classi cation: an example is classi ed correctly if and
only if it has a positive margin, and a larger margin can be viewed as a con dent correct classi cation.
Results in [1] and [18] show that, loosely speaking, if a combination of classi ers correctly classi es most
of the training data with a large margin, then its error probability is small.
In [14], Mason, Bartlett and Baxter have presented improved upper bounds on the misclassi cation
probability of a combined classi er in terms of the average over the training data of a certain cost
function of the margins. That paper also describes experiments with an algorithm that directly minimizes
this cost function through the choice of weights associated with each base classi er. This algorithm
exhibits performance improvements over AdaBoost, which suggests that these margin cost functions are
appropriate quantities to optimize.
In this paper, we present a general class of algorithms (called AnyBoost) which are gradient descent
algorithms for choosing linear combinations of elements of an inner product space so as to minimize
some cost functional. Each component of the linear combination is chosen to maximize a certain inner
product. (In the speci c case of choosing a combination of classi ers to optimize the sample average of
a cost function of the margin, the choice of the base classi er corresponds to a minimization problem
involving weighted classi cation error. That is, for a certain weighting of the training data, the base
classi er learning algorithm attempts to return a classi er that minimizes the weight of misclassi ed
training examples.) In Section 4, we give convergence results for this class of algorithms.
In Section 3, we show that this general class of algorithms includes as special cases a number of popular
and successful voting methods, including Freund and Schapire's AdaBoost [10], Schapire and Singer's
extension of AdaBoost to combinations of real-valued functions [19], and Friedman, Hastie and Tibshi-
rani's LogitBoost [12]. That is, all of these algorithms implicitly minimize some margin cost function by
gradient descent.
In Section 5, we present experimental results for a particular implementation of the AnyBoost algorithm
using cost functions of the margin that are motivated by the theoretical results presented in [14]. The cost
functions suggested by these results are signi cantly di erent from the cost functions that are implicitly
minimized by the methods described in Section 3. The experiments show that the new algorithm typically
outperforms AdaBoost, and that this is especially true with label noise. In addition, the theoretically-
motivated cost functions provide good estimates of the error of AdaBoost, in the sense that they can be
used to predict its over tting behaviour.

2 Optimizing cost functions of the margin

We begin with some notation. We assume that examples (x; y) are randomly generated according to
some unknown probability distribution D on X Y where X is the space of measurements (typically
X RN ) and Y is the space of labels (Y is usually a discrete set or some subset of R).
Although the abstract algorithm of the following section applies to many di erent machine learning
settings, our primary interest in this paper is voted combinations of classi ers of the form sgn (F (x)),
where T
X
F (x) = wt ft (x);
t=1
ft : X ! f1g are base classi ers from some xed class F and wt 2 R are the classi er weights. The
margin of an example (x; y) with respect to the classi er sgn (F (x)) is de ned as yF (x).
Given a set S = f(x1 ; y1 ); : : : ; (xm ; ym)g of m labelled examples generated according to D we wish to
construct a voted combination of classi ers of the form described above so that PD (sgn (F (x)) 6= y)
is small. That is, the probability that F incorrectly classi es a random example is small. Since D is
unknown and we are only given a training set S , we take the approach of nding voted classi ers which
minimize the sample average of some cost function of the margin. That is, for a training set S we want
to nd F such that
m
C (F ) = m1 C (yi F (xi ))
X
(1)
i=1
is minimized for some suitable cost function C : R ! R. Note that we are using the symbol C to
denote both the cost function of the real margin yF (x), and the cost functional of the function F . Which
interpretation is meant should always be clear from the context.
2.1 AnyBoost
One way to produce a weighted combination of classi ers which optimizes (1) is by gradient descent in
function space, an idea rst proposed by Breiman [3]. Here we present a more abstract treatment that
shows how many existing voting methods may be viewed as gradient descent in a suitable inner product
space.
At an abstract level we can view the base hypotheses f 2 F and their combinations F as elements of an
inner product space (X ; h; i). In this case, S is a linear space of functions that contains lin (F ), the set of
all linear combinations of functions in F , and the inner product is de ned by
m
hF; Gi := 1 F (x )G(x )
X

m i=1 i i (2)
for all F; G 2 lin (F ). However, the AnyBoost algorithm de ned in this section and its convergence
properties studied in Section 4 are valid for any cost function and inner product.
Now suppose we have a function F 2 lin (F ) and we wish to nd a new f 2 F to add to F so that the
cost C (F + f ) decreases, for some small value of . Viewed in function space terms, we are asking for the
\direction" f such that C (F + f ) most rapidly decreases. The desired direction is simply the negative
of the functional derivative of C at F , ;rC (F )(x), where
rC (F )(x) := @C (F + 1x ) ;
@ =0

where 1x is the indicator function of x. Since we are restricted to choosing our new function f from F , in
general it will not be possible to choose f = ;rC (F ), so instead we search for an f with greatest inner
product with ;rC (F ). That is, we should choose f to maximize ; hrC (F ); f i. This can be motivated
by observing that, to rst order in , C (F + f ) = C (F ) + hrC (F ); f i and hence the greatest reduction
in cost will occur for the f maximizing ; hrC (F ); f i.
The preceding discussion motivates Algorithm 1, an iterative algorithm for nding linear combinations
F of base hypotheses in F that minimize the cost C (F ). Note that we have allowed the base hypotheses
to take values in an arbitrary set Y , we have not restricted the form of the cost or the inner product, and
we have not speci ed what the step-sizes should be. Appropriate choices for these things will be made
when we apply the algorithm to more concrete situations. Note also that the algorithm terminates when
; hrC (Ft ); ft+1 i 0, i.e when the weak learner L returns a base hypothesis ft+1 which no longer points
in the downhill direction of the cost function C (F ). Thus, the algorithm terminates when, to rst order,
a step in function space in the direction of the base hypothesis returned by L would increase the cost.
Algorithm 1 : AnyBoost
Require :
An inner product space (X ; h; i) containing functions mapping from X to some set Y .
A class of base classi ers F X .
A di erentiable cost functional C : lin (F ) ! R.
A weak learner L(F ) that accepts F 2 lin (F ) and returns f 2 F with a large value of
; hrC (F ); f i.
Let F (x) := 0.
0
for t := 0 to T do
Let ft := L(Ft ).
+1
if ; hrC (Ft ); ft i 0 then
+1
return Ft .
end if
Choose wt+1 .
Let Ft+1 := Ft + wt+1 ft+1
end for
return FT +1 .

3 A gradient descent view of voting methods

Since the main aim of this paper is optimization of margin cost functionals, we restrict our attention to
the inner product (2), the cost (1), and Y = f1g. For these choices,
m
; hrC (F ); f i = ; 1 y f (x )C 0 (y F (x )):
X

m2 i=1 i i i i

Any sensible cost function of the margin

Pm
will be monotonically decreasing, hence ;C 0 (yi F (xi )) will always
be positive. Dividing through by ; i=1 C 0 (yi F (xi )), we see that nding an f maximizing ; hrC (F ); f i
is equivalent to nding an f minimizing the weighted error
0
D(i) where D(i) := PmC (Cyi0F(y(xFi ())x )) for i = 1; : : : ; m.
X

i : f (x )6=y
i i
i=1 i i

Many of the most successful voting methods are, for the appropriate choice of cost function and step-size,
speci c cases of the AnyBoost algorithm. Table 3 summarizes the AnyBoost cost function and step-size
settings needed to obtain the AdaBoost [10], con dence-rated AdaBoost [19], ARC-X4 [3] and LogitBoost
[12] algorithms. A more detailed analysis of these algorithms as speci c cases of AnyBoost can be found
in the full version of this paper [15].

4 Convergence of AnyBoost
In this section we provide convergence results for the abstract AnyBoost algorithm, under quite weak
conditions on the cost functional C . The prescriptions given for the step-sizes wt in these results are for
Table 1: Existing voting methods viewed as gradient descent optimizers of margin cost functions.
Algorithm Cost function Step size
AdaBoost [9] e;yF (x)Line search
ARC-X4 [2] (1 ; yF (x))5 1=t
Con denceBoost [19] e ; yF (x)
Line search
LogitBoost [12] ln(1 + e;yF (x)) Newton-Raphson

convergence guarantees only: in practice they will almost always be smaller than necessary, hence xed
small steps or some form of line search should be used.
The following theorem (proof omitted, see [15]) supplies a speci c step-size for AnyBoost and characterizes
the limiting behaviour with this step-size.
Theorem 1. Let C : lin (F ) ! R be any lower bounded, Lipschitz di erentiable cost functional (that is,
there exists L > 0 such that krC (F ) ; rC (F 0 )k LkF ; F 0 k for all F; F 0 2 lin (F )). Let F0 ; F1 ; : : : be
the sequence of combined hypotheses generated by the AnyBoost algorithm, using step-sizes
w := ; hrC (Ft ); ft+1 i :
t+1 Lkft+1k2 (3)
Then AnyBoost either halts on round T with ; hrC (FT ); fT +1 i 0, or C (Ft ) converges to some nite
value C , in which case limt!1 hrC (Ft ); ft+1 i = 0:
The next theorem (proof omitted, see [15]) shows that if the weak learner can always nd the best
weak hypothesis ft 2 F on each round of AnyBoost, and if the cost functional C is convex, then any
accumulation point F of the sequence (Ft ) generated by AnyBoost with the step sizes (3) is a global
minimum of the cost. For ease of exposition, we have assumed that rather than terminating when
; hrC (FT ); fT +1 i 0, AnyBoost simply continues to return FT for all subsequent time steps t.
Theorem 2. Let C : lin (F ) ! R be a convex cost functional with the properties in Theorem 1, and let
(Ft ) be the sequence of combined hypotheses generated by the AnyBoost algorithm with step sizes given
by (3). Assume that the weak hypothesis class F is negation closed (f 2 F =) ;f 2 F ) and that
on each round the AnyBoost algorithm nds a function ft+1 maximizing ; hrC (Ft ); ft+1 i. Then any
accumulation point F of the sequence (Ft ) satis es
sup ; hrC (F ); f i = 0; and C (F ) = inf C (G):
f 2F G2lin (F )

5 Experiments
AdaBoost had been perceived to be resistant to over tting despite the fact that it can produce combina-
tions involving very large numbers of classi ers. However, recent studies have shown that this is not the
case, even for base classi ers as simple as decision stumps. Grove and Schuurmans [13] demonstrated that
running AdaBoost for hundreds of thousands of rounds can lead to signi cant over tting, while a number
of authors (e.g., [5, 17]) showed that, by adding label noise, over tting can be induced in AdaBoost even
with relatively few classi ers in the combination.
The main theoretical result from [14] provides bounds on the generalization performance of a convex
combination of classi ers in terms of training sample averages of certain, sigmoid-like, cost functions of
the margin. Given this theoretical motivation we propose a new algorithm (DOOM II) which is a speci c
case of AnyBoost using the cost functional
1 Xm
C (F ) = m i=1 (1 ; tanh(y F (x ))) ;
i i (4)
where F is restricted to be a convex combination of classi ers from some base class F and is an
adjustable parameter of the cost function. Henceforth we will refer to (4) as the normalized sigmoid cost
function (normalized because the weights are normalized so F is a convex combination). This family of
cost functions (parameterized by ) is qualitatively similar to the theoretically motivated family of cost
functions used in [14]. Using the family from [14] in practice may cause diculties for a gradient descent
procedure because the functions are very at for negative margins and for margins close to 1. Using the
normalized sigmoid cost function alleviates this problem.
Following the theoretical analysis in [14], can be viewed as a data dependent complexity parameter
which measures the resolution at which we examine the margins. A large value of corresponds to a
high resolution and hence high e ective complexity of the convex combination. Thus, choosing a large
value of amounts to a belief that a high complexity classi er can be used without over tting.
In our implementation of DOOM II we use a xed small step-size (for all of the experiments = 0:05).
In practice the use of a xed could be replaced by a line search for the optimal step-size at each round.
For full details of the algorithm the reader is referred to the full version of this paper [15].
Given that the normalized sigmoid cost function is non-convex the DOOM II algorithm will su er from
problems with local minima. In fact, the following result shows that for cost functions satisfying C (; ) =
1 ; C ( ), the algorithm will strike a local minimum at the rst step.
Lemma 3. Let C : R ! R be any cost function satisfying C (; ) = 1 ; C ( ). If DOOM II can nd the
optimal weak hypothesis f1 at the rst time step, it will terminate at the next time step, returning f1 .
One way of avoiding this local minimum is to remove f1 from F after the rst round and then continue
the algorithm returning f1 to F only when the cost goes below that of the rst round. Since f1 is a local
minimum the cost is guaranteed to increase after the rst round. However, if we continue to step in the
best available direction (the attest uphill direction) we should eventually `crest the hill' de ned by the
basin of attraction of the rst classi er and then start to decrease the cost. Once the cost decreases below
that of the rst classi er we can safely return the rst classi er to the class of available base classi ers.
Of course, we have no guarantee that the cost will decrease below that of the rst classi er at any round
after the rst. Practically however, this does not seem to be a problem except for very small values of
where the cost function is almost linear over [;1; 1] (in which case the rst classi er corresponds to a
global minimum anyway).
In order to compare the performance of DOOM II and AdaBoost a series of experiments were carried
out on a selection of data sets taken from the UCI machine learning repository [4]. To simplify matters,
only binary classi cation problems were considered. All of the experiments were repeated 100 times
with 80%, 10% and 10% of the examples randomly selected for training, validation and test purposes
respectively. The results were then averaged over the 100 repeats. For all of the experiments axis
orthogonal hyperplanes (also known as decision stumps) were used as the base classi ers. This xed
the complexity of the weak learner and thus avoided any problems with the complexity of the combined
classi er being dependent on the actual classi ers produced by the weak learner.
For AdaBoost, the validation set was used to perform early stopping. AdaBoost was run for 2000 rounds
and then the combined classi er from the round corresponding to minimum error on the validation set
was chosen. For DOOM II, the validation set was used to set the data dependent complexity parameter
. DOOM II was run for 2000 rounds with = 2; 4; 6; 10; 15 and 20 and the optimal was chosen to
correspond to minimum error on the validation set after 2000 rounds.
AdaBoost and DOOM II were run on nine data sets to which varying levels of label noise had been
applied. A summary of the experimental results is shown in Figure 1. The improvement in test error
exhibited by DOOM II over AdaBoost (with standard error bars) is shown for each data set and noise
level. These results show that DOOM II generally outperforms AdaBoost and that the improvement is
more pronounced in the presence of label noise.

3.5
3
2.5
2
Error advantage (%)

1.5
1
0.5
0
-0.5
-1 0% noise
5% noise
-1.5
15% noise
-2 sonar cleve ionosphere vote1 credit breast-cancer pima-indians hypo1 splice
Data set

Figure 1: Summary of test error advantage (with standard error bars) of DOOM II over AdaBoost with
varying levels of noise on nine UCI data sets.
The e ect of using the normalized sigmoid cost function rather than the exponential cost function is best
illustrated by comparing the cumulative margin distributions generated by AdaBoost and DOOM II.
Figure 2 shows comparisons for two data sets with 0% and 15% label noise applied. For a given margin,
the value on the curve corresponds to the proportion of training examples with margin less than or equal
to this value. These curves show that in trying to increase the margins of negative examples AdaBoost
is willing to sacri ce the margin of positive examples signi cantly. In contrast, DOOM II `gives up' on
examples with large negative margin in order to reduce the value of the cost function.

breast-cancer-wisconsin splice
1 1
0% noise - AdaBoost 0% noise - AdaBoost
0% noise - DOOM II 0% noise - DOOM II
0.8 15% noise - AdaBoost 0.8 15% noise - AdaBoost
15% noise - DOOM II 15% noise - DOOM II

0.6 0.6

0.4 0.4

0.2 0.2

0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Margin Margin

Figure 2: Margin distributions for AdaBoost and DOOM II with 0% and 15% label noise for the
breast-cancer and splice data sets.

Given that AdaBoost su ers from over tting and minimizes an exponential cost function of the margins,
this cost function certainly does not relate to test error. How does the value of our proposed cost function
correlate against AdaBoost's test error? The theoretical bound suggests that for the `right' value of the
data dependent complexity parameter our cost function and the test error should be closely correlated.
Figure 3 shows the variation in the normalized sigmoid cost function, the exponential cost function and
the test error for AdaBoost for two UCI data sets over 10000 rounds. As before, the values of these curves
were averaged over 100 random train/validation/test splits. The value of used in each case was chosen
by running DOOM II for various values of and choosing the corresponding to minimum error on
the validation set. These curves show that there is a strong correlation between the normalized sigmoid
cost (for the right value of ) and AdaBoost's test error. In both data sets the minimum of AdaBoost's
test error and the minimum of the normalized sigmoid cost very nearly coincide. In the labor data set
AdaBoost's test error converges and over tting does not occur. For this data set both the normalized
sigmoid cost and the exponential cost converge. In the vote1 data set AdaBoost initially decreases the
test error and then increases the test error (as over tting set in). For this data set the normalized sigmoid
cost mirrors this behaviour, while the exponential cost converges to 0.

labor vote1
30 8
AdaBoost test error AdaBoost test error
Exponential cost 7 Exponential cost
25 Normalized sigmoid cost Normalized sigmoid cost
6
20
5
15 4
3
10
2
5
1
0 0
1 10 100 1000 10000 1 10 100 1000 10000
Rounds Rounds

Figure 3: AdaBoost test error, exponential cost and normalized sigmoid cost over 10000 rounds of
AdaBoost for the labor and vote1 data sets. Both costs have been scaled in each case for easier
comparison with test error.

6 Conclusions
We have shown that many existing \boosting-type" algorithms for combining classi ers can be viewed
as gradient descent on an appropriate cost functional in a suitable inner product space. We presented
\AnyBoost", an abstract algorithm of this type for generating linear combinations of functions from some
base hypothesis class. A prescription for the step-sizes in this algorithm guaranteeing convergence to the
optimal cost were given.
Motivated by the main theoretical result from [14], we derived DOOM II|a specialization of AnyBoost|
that used 1 ; tanh(z ) as its cost function of the margin z . Experimental results on the UCI datasets
veri ed that DOOM II generally outperformed AdaBoost when boosting decision stumps, particularly in
the presence of label noise. We also found that DOOM II's cost on the training data was a very reliable
predictor of test error, while AdaBoost's exponential cost was not.
Acknowledgments
This research was supported by the Australian Research Council. Llew Mason was supported by an
Australian Postgraduate Research Award. Jonathan Baxter was supported by an Australian Postdoc-
toral Fellowship. Peter Bartlett and Marcus Frean were supported by an Institute of Advanced Stud-
ies/Australian Universities Collaborative grant. Thanks to Shai Ben-David for a stimulating discussion.

References
[1] P. L. Bartlett. The sample complexity of pattern classi cation with neural networks: the size of the weights
is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525{536,
March 1998.
[2] L. Breiman. Bagging predictors. Machine Learning, 24(2):123{140, 1996.
[3] L. Breiman. Prediction games and arcing algorithms. Technical Report 504, Department of Statistics,
University of California, Berkeley, 1998.
[4] E. Keogh C. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.
http://www.ics.uci.edu/mlearn/MLRepository.html.
[5] T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees:
Bagging, boosting, and randomization. Technical report, Computer Science Department, Oregon State
University, 1998.
[6] H. Drucker and C. Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems
8, pages 479{485, 1996.
[7] N. Du y and D. Helmbold. A geometric approach to leveraging weak learners. In Computational Learning
Theory: 4th European Conference, 1999. (to appear).
[8] Y. Freund. An adaptive version of the boost by majority algorithm. In Proceedings of the Twelfth Annual
Conference on Computational Learning Theory, 1999. (to appear).
[9] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proceedings
of the Thirteenth International Conference, pages 148{156, 1996.
[10] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to
boosting. Journal of Computer and System Sciences, 55(1):119{139, August 1997.
[11] J. Friedman. Greedy function approximation : A gradient boosting machine. Technical report, Stanford
University, 1999.
[12] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression : A statistical view of boosting.
Technical report, Stanford University, 1998.
[13] A. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In
Proceedings of the Fifteenth National Conference on Arti cial Intelligence, pages 692{699, 1998.
[14] L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins.
Machine Learning, 1999. (to appear { extended abstract in NIPS 98).
[15] L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Boosting algorithms as gradient de-
scent in function space. Technical report, RSISE, Australian National University, 1999.
(http://wwwsyseng.anu.edu.au/jon/papers/doom2.ps.gz).
[16] J. R. Quinlan. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Conference on
Arti cial Intelligence, pages 725{730, 1996.
[17] G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Technical Report NC-TR-1998-021,
Department of Computer Science, Royal Holloway, University of London, Egham, UK, 1998.
[18] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin : A new explanation for the
e ectiveness of voting methods. Annals of Statistics, 26(5):1651{1686, October 1998.
[19] R. E. Schapire and Y. Singer. Improved boosting algorithms using con dence-rated predictions. In Proceedings
of the Eleventh Annual Conference on Computational Learning Theory, pages 80{91, 1998.

Gradient Descent for ML Beginners
No ratings yet
Gradient Descent for ML Beginners
11 pages
What Is Gradient Descent - Built in
No ratings yet
What Is Gradient Descent - Built in
11 pages
Gradient Descent in Machine Learning - Javatpoint
No ratings yet
Gradient Descent in Machine Learning - Javatpoint
9 pages
Gradient Descend
No ratings yet
Gradient Descend
64 pages
Unit - III
No ratings yet
Unit - III
44 pages
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
No ratings yet
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
8 pages
AI33
No ratings yet
AI33
6 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Gradient Descent for Beginners
No ratings yet
Gradient Descent for Beginners
15 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Introduction To Gradient Descent
No ratings yet
Introduction To Gradient Descent
8 pages
Understanding Cost Function & Gradient Descent
No ratings yet
Understanding Cost Function & Gradient Descent
142 pages
Chapter 07
No ratings yet
Chapter 07
20 pages
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
No ratings yet
Notes On Some Methods For Solving Linear Systems: Dianne P. O'Leary, 1983 and 1999 September 25, 2007
11 pages
Linear Systems Solving Methods
No ratings yet
Linear Systems Solving Methods
11 pages
AI Search Algorithms Overview
No ratings yet
AI Search Algorithms Overview
15 pages
Genetic Algorithms for Optimization
No ratings yet
Genetic Algorithms for Optimization
47 pages
(Nicholas J. Higham) Accuracy and Stability of Num
100% (1)
(Nicholas J. Higham) Accuracy and Stability of Num
710 pages
Week 5
No ratings yet
Week 5
32 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Boosting
No ratings yet
Gradient Boosting
32 pages
Optimization in Neural Network
No ratings yet
Optimization in Neural Network
22 pages
ML Intro
No ratings yet
ML Intro
5 pages
Gradient Descent
No ratings yet
Gradient Descent
27 pages
Ut Daa
No ratings yet
Ut Daa
8 pages
Unit 2
No ratings yet
Unit 2
16 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
MAT6007 - Session8 - Gradient Descent
No ratings yet
MAT6007 - Session8 - Gradient Descent
16 pages
Gradient Descent for Data Scientists
No ratings yet
Gradient Descent for Data Scientists
75 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
65 Reductions
No ratings yet
65 Reductions
40 pages
Aiml Edited Ans
No ratings yet
Aiml Edited Ans
13 pages
Efficient Reinforcement Learning Using Gaussian Processes
No ratings yet
Efficient Reinforcement Learning Using Gaussian Processes
226 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Assignment B 4 GradientDescent
No ratings yet
Assignment B 4 GradientDescent
5 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Optimization and Search
No ratings yet
Optimization and Search
27 pages
Understanding Gradient Descent in ML
No ratings yet
Understanding Gradient Descent in ML
20 pages
Unit 2 From Constraint Propagtion
No ratings yet
Unit 2 From Constraint Propagtion
19 pages
Matrixde
No ratings yet
Matrixde
182 pages
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
No ratings yet
Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics
39 pages
Algorithms 4
No ratings yet
Algorithms 4
26 pages
Ai Unit 2
No ratings yet
Ai Unit 2
4 pages
Unit 2
No ratings yet
Unit 2
19 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
Cs1410 Design and Analysis of Algorithms: P, NP, NP P NP
No ratings yet
Cs1410 Design and Analysis of Algorithms: P, NP, NP P NP
20 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
DSA Basic Concepts
No ratings yet
DSA Basic Concepts
6 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
3 - Computational Complexity
No ratings yet
3 - Computational Complexity
22 pages
Unit 2
No ratings yet
Unit 2
8 pages
Survey - Gradient Boosting Machine
No ratings yet
Survey - Gradient Boosting Machine
9 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
14-1 2022 IT Dr. Juita Raut Edited Book 2 AI
No ratings yet
14-1 2022 IT Dr. Juita Raut Edited Book 2 AI
20 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Numerical Solution of Linear Systems: Chen Greif
No ratings yet
Numerical Solution of Linear Systems: Chen Greif
59 pages
Cheatsheet
No ratings yet
Cheatsheet
2 pages
21 - Advanced AppLayer
No ratings yet
21 - Advanced AppLayer
14 pages
05 - AppLayer Overview
No ratings yet
05 - AppLayer Overview
14 pages
Analyzing M Hotel Booking Cancellations
No ratings yet
Analyzing M Hotel Booking Cancellations
6 pages
3RD Mod. Part 1 DLL 3RD QRTR G9
75% (4)
3RD Mod. Part 1 DLL 3RD QRTR G9
7 pages
(Ebook PDF) Financial and Management Accounting 8th Edition Download
100% (6)
(Ebook PDF) Financial and Management Accounting 8th Edition Download
53 pages
Mirabai As A Bhakti Saint
100% (1)
Mirabai As A Bhakti Saint
13 pages
Philip G Zimbardo Shyness What It Is What To Do About It 1977a
No ratings yet
Philip G Zimbardo Shyness What It Is What To Do About It 1977a
270 pages
Ip (Unit-2) - Vit
No ratings yet
Ip (Unit-2) - Vit
25 pages
E-Portfolio: Reflections on Education
No ratings yet
E-Portfolio: Reflections on Education
18 pages
Grade 9 Certificate q1
No ratings yet
Grade 9 Certificate q1
14 pages
Pope Francis - Wikipedia 2005
No ratings yet
Pope Francis - Wikipedia 2005
2 pages
Alexander Dictionary of English Idioms
100% (1)
Alexander Dictionary of English Idioms
2 pages
Father Saturnino Urios University: Bp. Pueblos Senior High School Butuan City
100% (1)
Father Saturnino Urios University: Bp. Pueblos Senior High School Butuan City
25 pages
(Ebook) The Passionate Statesman: Eros and Politics in Plutarch's Lives by Jeffrey Beneker ISBN 9780199695904, 0199695903 All Chapters Available
No ratings yet
(Ebook) The Passionate Statesman: Eros and Politics in Plutarch's Lives by Jeffrey Beneker ISBN 9780199695904, 0199695903 All Chapters Available
156 pages
SBD Works (NCB) Cover
No ratings yet
SBD Works (NCB) Cover
3 pages
Introduction To Reading Skills
No ratings yet
Introduction To Reading Skills
52 pages
Teaching Philosophy Lesson Plan Guide
No ratings yet
Teaching Philosophy Lesson Plan Guide
3 pages
Learning Management System (LMS) FEATURES
No ratings yet
Learning Management System (LMS) FEATURES
2 pages
Part A: Multiple Choice: Answer With The Best Choice. Make Sure That You Clearly Circle The
No ratings yet
Part A: Multiple Choice: Answer With The Best Choice. Make Sure That You Clearly Circle The
8 pages
Mechanics+ +Waves+on+a+Stretched+String
No ratings yet
Mechanics+ +Waves+on+a+Stretched+String
4 pages
IIT Bombay Guide For Visitors: Formalities&Essentials
No ratings yet
IIT Bombay Guide For Visitors: Formalities&Essentials
4 pages
Discoursecommunityfinal 2
No ratings yet
Discoursecommunityfinal 2
8 pages
Computer Aided Engineering Drawing (Caed) : Mixing of Questions From Different Modules)
No ratings yet
Computer Aided Engineering Drawing (Caed) : Mixing of Questions From Different Modules)
1 page
UCF Health Sciences Student Resume
No ratings yet
UCF Health Sciences Student Resume
2 pages
OUP Readers 2016
No ratings yet
OUP Readers 2016
7 pages
Est 230715 132158
No ratings yet
Est 230715 132158
4 pages
CSC 325 AI Lecture08 Supervised Learning
No ratings yet
CSC 325 AI Lecture08 Supervised Learning
32 pages
Describing The Places We Live in
No ratings yet
Describing The Places We Live in
3 pages
Evaluating Functions Lesson Plan
No ratings yet
Evaluating Functions Lesson Plan
4 pages
Pages From SimAUD Proceedings
No ratings yet
Pages From SimAUD Proceedings
10 pages
Khel Kumbh 2025 Brochure
No ratings yet
Khel Kumbh 2025 Brochure
23 pages
Ocs Acm RCC R2
No ratings yet
Ocs Acm RCC R2
49 pages
TUCL Online Resources Guide
100% (1)
TUCL Online Resources Guide
31 pages

Boosting Algorithms As Gradient Descent

Uploaded by

Boosting Algorithms As Gradient Descent

Uploaded by

Boosting Algorithms as Gradient Descent

Llew Mason Jonathan Baxter

Peter Bartlett Marcus Frean

2 Optimizing cost functions of the margin

3 A gradient descent view of voting methods

Any sensible cost function of the margin

You might also like