0% found this document useful (0 votes)
115 views30 pages

Boosting Margin

A New Explanation for the Effectiveness of Voting Methods

Uploaded by

smysona
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views30 pages

Boosting Margin

A New Explanation for the Effectiveness of Voting Methods

Uploaded by

smysona
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

The Annals of Statistics, to appear.

Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods
Robert E. Schapire AT&T Labs 180 Park Avenue, Room A279 Florham Park, NJ 07932-0971 USA [email protected] Peter Bartlett Dept. of Systems Engineering RSISE, Aust. National University Canberra, ACT 0200 Australia [email protected] Yoav Freund AT&T Labs 180 Park Avenue, Room A205 Florham Park, NJ 07932-0971 USA [email protected] Wee Sun Lee School of Electrical Engineering University College UNSW Australian Defence Force Academy Canberra ACT 2600 Australia [email protected]

May 7, 1998 Abstract. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classication rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapniks support vector classiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance decomposition.

Introduction

This paper is about methods for improving the performance of a learning algorithm, sometimes also called a prediction algorithm or classication method. Such an algorithm operates on a given set of instances (or cases) to produce a classier, sometimes also called a classication rule or, in the machine-learning literature, a hypothesis. The goal of a learning algorithm is to nd a classier with low generalization or prediction error, i.e., a low misclassication rate on a separate test set. In recent years, there has been growing interest in learning algorithms which achieve high accuracy by voting the predictions of several classiers. For example, several researchers have reported signicant improvements in performance using voting methods with decision-tree learning algorithms such as C4.5 or CART as well as with neural networks [3, 6, 8, 12, 13, 16, 18, 29, 31, 37]. We refer to each of the classiers that is combined in the vote as a base classier and to the nal voted classier as the combined classier. As examples of the effectiveness of these methods, consider the results of the following two experiments using the letter dataset. (All datasets are described in Appendix B.) In the rst experiment, 1

we used Breimans bagging method [6] on top of C4.5 [32], a decision-tree learning algorithm similar to CART [9]. That is, we reran C4.5 many times on random bootstrap subsamples and combined the computed trees using simple voting. In the top left of Figure 1, we have shown the training and test error curves (lower and upper curves, respectively) of the combined classier as a function of the number of trees combined. The test error of C4.5 on this dataset (run just once) is 13.8%. The test error of bagging 1000 trees is 6.6%, a signicant improvement. (Both of these error rates are indicated in the gure as horizontal grid lines.) In the second experiment, we used Freund and Schapires AdaBoost algorithm [20] on the same dataset, also using C4.5. This method is similar to bagging in that it reruns the base learning algorithm C4.5 many times and combines the computed trees using voting. However, the subsamples that are used for training each tree are chosen in a manner which concentrates on the hardest examples. (Details are given in Section 3.) The results of this experiment are shown in the top right of Figure 1. Note that boosting drives the test error down even further to just 3.1%. Similar improvements in test error have been demonstrated on many other benchmark problems (see Figure 2). These error curves reveal a remarkable phenomenon, rst observed by Drucker and Cortes [16], and later by Quinlan [31] and Breiman [8]. Ordinarily, as classiers become more and more complex, we expect their generalization error eventually to degrade. Yet these curves reveal that test error does not increase for either method even after 1000 trees have been combined (by which point, the combined classier involves more than two million decision-tree nodes). How can it be that such complex classiers have such low error rates? This seems especially surprising for boosting in which each new decision tree is trained on an ever more specialized subsample of the training set. Another apparent paradox is revealed in the error curve for AdaBoost. After just ve trees have been combined, the training error of the combined classier has already dropped to zero, but the test error continues to drop1 from 8.4% on round 5 down to 3.1% on round 1000. Surely, a combination of ve trees is much simpler than a combination of 1000 trees, and both perform equally well on the training set (perfectly, in fact). So how can it be that the larger and more complex combined classier performs so much better on the test set? The results of these experiments seem to contradict Occams razor, one of the fundamental principles in the theory of machine learning. This principle states that in order to achieve good test error, the classier should be as simple as possible. By simple, we mean that the classier is chosen from a restricted space of classiers. When the space is nite, we use its cardinality as the measure of complexity and when it is innite we use the VC dimension [42] which is often closely related to the number of parameters that dene the classier. Typically, both in theory and in practice, the difference between the training error and the test error increases when the complexity of the classier increases. Indeed, such an analysis of boosting (which could also be applied to bagging) was carried out by Freund and Schapire [20] using the methods of Baum and Haussler [4]. This analysis predicts that the test error eventually will increase as the number of base classiers combined increases. Such a prediction is clearly incorrect in the case of the experiments described above, as was pointed out by Quinlan [31] and Breiman [8]. The apparent contradiction is especially stark in the boosting experiment in which the test error continues to decrease even after the training error has reached zero. Breiman [8] and others have proposed denitions of bias and variance for classication, and have argued that voting methods work primarily by reducing the variance of a learning algorithm. This explanation is useful for bagging in that bagging tends to be most effective when the variance is large.
1 Even when the training error of the combined classier reaches zero, AdaBoost continues to obtain new base classiers by training the base learning algorithm on different subsamples of the data. Thus, the combined classier continues to evolve, even after its training error reaches zero. See Section 3 for more detail.

Bagging
20 20 15 10 5 10 100 1000 0 10

Boosting

error (%)

15 10 5 0

100

1000

# classifiers cumulative distribution


1.0 1.0

0.5

0.5

-1

-0.5

0.5

1 -1

-0.5

0.5

margin
Figure 1: Error curves and margin distribution graphs for bagging and boosting C4.5 on the letter dataset. Learning curves are shown directly above corresponding margin distribution graphs. Each learning-curve gure shows the training and test error curves (lower and upper curves, respectively) of the combined classier as a function of the number of classiers combined. Horizontal lines indicate the test error rate of the base classier as well as the test error of the nal combined classier. The margin distribution graphs show the cumulative distribution of margins of the training instances after 5, 100 and 1000 iterations, indicated by short-dashed, long-dashed (mostly hidden) and solid curves, respectively.

However, for boosting, this explanation is, at best, incomplete. As will be seen in Section 5, large variance of the base classiers is not a requirement for boosting to be effective. In some cases, boosting even increases the variance while reducing the overall generalization error. Intuitively, it might seem reasonable to think that because we are simply voting the base classiers, we are not actually increasing their complexity but merely smoothing their predictions. However, as argued in Section 5.4, the complexity of such combined classiers can be much greater than that of the base classiers and can result in over-tting. In this paper, we present an alternative theoretical analysis of voting methods, applicable, for instance, to bagging, boosting, arcing [8] and ECOC [13]. Our approach is based on a similar result presented by Bartlett [2] in a different context. We prove rigorous, non-asymptotic upper bounds on the generalization error of voting methods in terms of a measure of performance of the combined classier on the training set. Our bounds also depend on the number of training examples and the complexity of the base classiers, but do not depend explicitly on the number of base classiers. Although too loose

30 25

30 25

C4.5

15 10 5 0 0 5 10 15 20 25

C4.5
30

20

20 15 10 5 0 0 5 10 15 20 25 30

bagging C4.5

boosting C4.5

Figure 2: Comparison of C4.5 versus bagging C4.5 and boosting C4.5 on a set of 27 benchmark problems as reported by Freund and Schapire [18]. Each point in each scatter plot shows the test error rate of the two competing algorithms on a single benchmark. The -coordinate of each point gives the test error rate (in percent) of C4.5 on the given benchmark, and the -coordinate gives the error rate of bagging (left plot) or boosting (right plot). All error rates have been averaged over multiple runs.

to give practical quantitative predictions, our bounds do give a qualitative explanation of the shape of the observed learning curves, and our analysis may be helpful in understanding why these algorithms fail or succeed, possibly leading to the design of even more effective voting methods. The key idea of this analysis is the following. In order to analyze the generalization error, one should consider more than just the training error, i.e., the number of incorrect classications in the training set. One should also take into account the condence of the classications. Here, we use a measure of the classication condence for which it is possible to prove that an improvement in this measure of condence on the training set guarantees an improvement in the upper bound on the generalization error. Consider a combined classier whose prediction is the result of a vote (or a weighted vote) over a set of base classiers. Suppose that the weights assigned to the different base classiers are normalized so that they sum to one. Fixing our attention on a particular example, we refer to the sum of the weights of the base classiers that predict a particular label as the weight of that label. We dene the classication margin for the example as the difference between the weight assigned to the correct label and the maximal weight assigned to any single incorrect label. It is easy to see that the margin is a number in the range 1 1 and that an example is classied correctly if and only if its margin is positive. A large positive margin can be interpreted as a condent correct classication. Now consider the distribution of the margin over the whole set of training examples. To visualize
this distribution, we plot the fraction of examples whose margin is at most as a function of 1 1 . We refer to these graphs as margin distribution graphs. At the bottom of Figure 1, we show the margin distribution graphs that correspond to the experiments described above. Our main observation is that both boosting and bagging tend to increase the margins associated with 4

examples and converge to a margin distribution in which most examples have large margins. Boosting is especially aggressive in its effect on examples whose initial margin is small. Even though the training error remains unchanged (at zero) after round 5, the margin distribution graph changes quite signicantly so that after 100 iterations all examples have a margin larger than 0.5. In comparison, on round 5, about 7.7% of the examples have margin below 0 5. Our experiments, detailed later in the paper, show that there is a good correlation between a reduction in the fraction of training examples with small margin and improvements in the test error. The idea that maximizing the margin can improve the generalization error of a classier was previously suggested and studied by Vapnik [42] and led to his work with Cortes on support-vector classiers [10], and with Boser and Guyon [5] on optimal margin classiers. In Section 6, we discuss the relation between our work and Vapniks in greater detail. Shawe-Taylor et al. [38] gave bounds on the generalization error of support-vector classiers in terms of the margins, and Bartlett [2] used related techniques to give a similar bound for neural networks with small weights. A consequence of Bartletts result is a bound on the generalization error of a voting classier in terms of the fraction of training examples with small margin. In Section 2, we use a similar but simpler approach to give a slightly better bound. Here we give the main intuition behind the proof. This idea brings us back to Occams razor, though in a rather indirect way. Recall that an example is classied correctly if its margin is positive. If an example is classied by a large margin (either positive or negative), then small changes to the weights in the majority vote are unlikely to change the label. If most of the examples have a large margin then the classication error of the original majority vote and the perturbed majority vote will be similar. Suppose now that we had a small set of weighted majority rules that was xed ahead of time, called the approximating set. One way of perturbing the weights of the classier majority vote is to nd a nearby rule within the approximating set. As the approximating set is small, we can guarantee that the error of the approximating rule on the training set is similar to its generalization error, and as its error is similar to that of the original rule, the generalization error of the original rule should also be small. Thus, we are back to an Occams razor argument in which instead of arguing that the classication rule itself is simple, we argue that the rule is close to a simple rule. Boosting is particularly good at nding classiers with large margins in that it concentrates on those examples whose margins are small (or negative) and forces the base learning algorithm to generate good classications for those examples. This process continues even after the training error has reached zero, which explains the continuing drop in test error. In Section 3, we show that the powerful effect of boosting on the margin is not merely an empirical observation but is in fact the result of a provable property of the algorithm. Specically, we are able to prove upper bounds on the number of training examples below a particular margin in terms of the training errors of the individual base classiers. Under certain conditions, these bounds imply that the number of training examples with small margin drops exponentially fast with the number of base classiers. In Section 4, we give more examples of margin distribution graphs for other datasets, base learning algorithms and combination methods. In Section 5, we discuss the relation of our work to bias-variance decompositions. In Section 6, we compare our work to Vapniks optimal margin classiers, and in Section 7, we briey discuss similar results for learning convex combinations of functions for loss measures other than classication error.

Generalization Error as a Function of Margin Distributions

In this section, we prove that achieving a large margin on the training set results in an improved bound on the generalization error. This bound does not depend on the number of classiers that are combined in the vote. The approach we take is similar to that of Shawe-Taylor et al. [38] and Bartlett [2], but the proof here is simpler and more direct. A slightly weaker version of Theorem 1 is a special case of Bartletts main result. We give a proof for the special case in which there are just two possible labels 1 1 . In Appendix A, we examine the case of larger nite sets of labels. Let denote the space from which the base classiers are chosen; for example, for C4.5 or CART, it is the space of decision trees of an appropriate size. A base classier is a mapping from an instance space to 1 1 . We assume that examples are generated independently at random according to some xed but unknown distribution over
1 1 . The training set is a list of pairs  1  1   2  2      chosen according to . We use  !"$#&% [ ' ] to denote the probability of the event ' when the example    is chosen according to , and () !"*#,+ [ ' ] to denote probability with respect to choosing an example uniformly at random from the training set. When clear from context, we abbreviate these by  % [ ' ] and  + [ ' ]. We use - % [ ' ] and - + [ ' ] to denote expected value in a similar manner. We dene the convex hull . of as the set of mappings that can be generated by taking a weighted average of classiers from :

/ 0 132 . 

: (45

7 < 6 78:9(;

where it is understood that only nitely many 7 s may be nonzero.2 The 2 2 majority vote rule that is ; associated with gives the wrong prediction2 on the example    only if   FE 0. Also, the margin of an example    in this case is simply    . The following two theorems, the main results of this section, state that with high probability, the generalization error of any majority vote classier can be bounded in terms of the number of training examples with margin below a threshold G , plus an additional term which depends on the number of training examples, some complexity measure of , and the threshold G (preventing us from choosing G too close to zero). The rst theorem applies to the case that the base classier space is nite, such as the set of all decision trees of a given size over a set of discrete-valued features. In this case, our bound depends only on log H IH , which is roughly the description length of a classier in . This means that we can tolerate very large classier classes. If is innitesuch as the class of decision trees over continuous featuresthe second theorem gives a bound in terms of the Vapnik-Chervonenkis dimension 3 of . Note that the theorems apply to every majority vote classier, regardless of how it is computed. Thus, the theorem applies to any voting method, including boosting, bagging, etc.
2 A nite support is not a requirement for our proof but is sufcient for the application here which is to majority votes over a nite number of base classiers. 3 Recall that the VC-dimension is dened as follows: Let J be a family of functions K : LNMPO where Q ORQ,S 2. Then the VC-dimension of J is dened to be the largest number T such that there exists U 1 VXWXWXWYV U3Z\[]L for which Z Q ^`_*KbacU 1 d VXWXWXWXV KeacU Z dgf : Kh[iJRjkQlS 2 . Thus, the VC-dimension is the cardinality of the largest subset m of the space L for which the set of restrictions to m of functions in J contains all functions from m to O .

>= =; = = =

7@?

0; 6

7A;

7 

1BC
D

2.1

Finite base-classier spaces

Theorem 1 Let be a distribution over


1 1 , and let be a sample of examples chosen independently at random according to . Assume that the base-classier space is nite, and let 0. Then with 2 probability at least 1 over the random choice of the training set , every weighted average function (. satises the following bound for all G 0:

P%

2 

FE 0 E P +

2 

FE


G <

1 

log

log H IH
G 2

log  1

1 2   

Proof: For the sake of the proof we dene . from :



to be the set of unweighted averages over

elements

   
.

: 45

 
1
6




We allow the same A to appear multiple times in This set will play the role of the approximating set in the proof. 2 Any majority vote classier . can be associated with a distribution over as dened by the coefcients 7 . By choosing elements of independently at random according to this distribution 2 ; we can generate an element of . . Using such a construction we map each . to a distribution over . . That is, a function . distributed according to is selected by choosing 1   . independently at random according to the coefcients 7 and 2 then dening <   1  1 ; Our goal is to upper bound the generalization error of . . For any . and G 0 we can separate this probability into two terms:

 = = = = the= sum.





   "!  #
 E

 

 %

2 

FE

0 E

 %

$<

 E

$ 2
G

 %

This holds because, in general, for two events ' and




&

%


$ G  2 

2 

1

&(' ' ] *) &+' '-, E  [& ] .) &+' '/,  2 As Equation (1) holds for any  (.  , we can take the expected value of the right hand side with respect to the distribution  and get: 2 G$ 2  2   %   FE 0  %> 0 #21 %   E G$ 2  %> 0 #21 %   E FE 2 0  - 0 #21  % %   E G3 2 - %  0 #21 $<    G$ 2  2   E 0 - 0 #21  % %  G3 2 - %  0 #21 $< G$ 2 H   0 (3)
[ ' ]  [

 E

FE

We bound both terms in Equation (3) separately, starting with the second term. Consider a xed example    2 and take the probability inside the expectation with respect to the random choice of . It is clear that   A- # <  so the probability inside the expectation is equal to the probability that the average over random draws from a distribution over 1 1 is larger than its expected value by more than G 2. The Chernoff bound yields

3

0 21 


0 #21 %


$ G  2 H $ 2
G

2 

FE

0 E

54$6 87  8
2

4

To upper bound the rst term in (3) we use the union bound. That is, the probability over the choice of that there exists any . and G 0 for which

 %9 $<

 E

 +

%


 E

3 2<;:
G

is at most  1  H . H 2 . The exponential term 2 comes from the Chernoff bound which holds for any single choice of and G . The term  1  H . H is an upper bound on the number of such choices where we have used the fact that, because of the form of functions in . , we need only consider . Note that H . H E H IH . values of G of the form 2 for  0

 4 6  

46

  1 2  ln   Thus, if we set get that, with probability at least 1  %

:

 
 E

  1  H IH    , and take expectation with respect to 


 +

#

, we


3 2 E for every choice of G , and every distribution  .




0 #21 %

0 #21 %


 E

$ 2<;:
G

5

To nish 2 the argument we relate the fraction of the training set on which  < FE G 2 to the fraction on which   FE G , which is the quantity that we measure. Using Equation (2) again, we have that
 + l#

%

$
2

0 1 $<

 E

$ 2
G E  E

 +  +  +

0 #22 1   FE G  + 0 #21 $<  2  FE G - +  0 #21 %<    FE G - +  0 #21 %< 

FE E E G

$ 2  G$ 2 H

3 2 2 
G

 2 

G G  G  

(6)

To bound the expression inside the expectation we use the Chernoff bound as we did for Equation (4) and get 2 2 8  #    E G 2 H   G E  7

54 6 87  &   A 1 so that the probability of failure for any  will be at most Let   !  1   .   , for every Then combining Equations (3), (4), (5), (6) and (7), we get that, with probability at least 1 0 and every  ? 1: G 2 2    1 2 H H   87  8 1  %   FE 0 E  +   FE G 2 4 6  8 ln 2 0 21 % $

Finally, the statement of the theorem follows by setting N  4 3G 2  ln  2.2 Discussion of the bound

ln H H 
.

Let us consider the quantitative predictions that can be made using Theorem 1. It is not hard to show 0 and G 0 are held xed as 5 the bound given in Equation (8) with the choice of that if given in the theorem converges to

 %

2 

FE 0 E

 +

2 

FE

2 ln

ln H IH
G 2



ln



9

In fact, if G E 1 2,  0 01 (1% probability of failure), H IH ? 106 and ? 1000 then the second term on the right-hand side of Equation (9) is a pretty good approximation of the second and third terms on the right-hand side of Equation (8), as is demonstrated in Figure 3. From Equation (9) and from Figure 3 we see that the bounds given here start to be meaningful only when the size of the training set is in the tens of thousands. As we shall see in Section 4, the actual performance of AdaBoost is much better than predicted by our bounds; in other words, while our bounds are not asymptotic (i.e., they hold for any size of the training set), they are still very loose. The bounds we give in the next section for innite base-classier spaces are even looser. It is an important and challenging open problem to prove tighter bounds.

1.0

.9

.8

.7

.6

bound

.5

.4

.3

.2

.1

10000.

1e+05

1e+06

size of training set

Figure 3: A few plots of the second and third terms in the bound given in Equation (8) (solid lines) and their approximation by the second term in Equation (9) (dotted lines). The horizontal axis denotes the number of training examples (with a logarithmic scale) and the vertical axis denotes the value of the bound. All plots are for 0 01 and 106. Each pair of close lines corresponds to a different value of ; counting the pairs from the upper right to the lower left, the values of are 1 20 1 8 1 4 and 1 2.

From a practical standpoint, the fact that the bounds we give are so loose suggests that there might exist criteria different from the one used in Theorem 1 which are better in predicting the performance of voting classiers. Breiman [7] and Grove and Schuurmans [23] experimented with maximizing the minimal margin, that is, the smallest margin achieved on the training set. The advantage of using this criterion is that maximizing the minimal margin can be done efciently (if the set of base classiers is not too large) using linear programming. Unfortunately, their experiments indicate that altering the combined classier generated by AdaBoost so as to maximize the minimal margin increases the generalization error more often than not. In one experiment reported by Breiman, the generalization error increases even though the margins of all of the instances are increased (for this dataset, called ionosphere, the number of instances is 351, much too small for our bounds to apply). While none of these experiments contradict the theory, they highlight the incompleteness of the theory and the need to rene it. 2.3 Innite base-classier spaces be a distribution over

Theorem 2 Let

1 1 , and let be a sample of 9

examples chosen

independently at random according to . Suppose the base-classier space has VC-dimension , ? ? 0. Assume that 1. Then with probability at least 1 over the random choice of and let 2 the training set , every weighted average function (. satises the following bound for all G 0: P%

2 

FE

0 E P+

2 

FE



log2 
G 2

 1 2 log  1   

The proof of this theorem uses the following uniform convergence result, which is a renement of the Vapnik and Chervonenkis result due to Devroye [11]. Let be a class of subsets of a space , and dene  : ' eH : <H Hk   max H '

'

Lemma 3 (Devroye) For any class of subsets of , and for a sample independently at random according to a distribution over , we have

of

examples chosen 2

P + # %

 8

sup H P #,+ [ 

'

P #&% [ 

'

]H



exp 

: 2

In other words, the lemma bounds the probability of a signicant deviation between the empirical and true probabilities of any of the events in the family . Proof: (of Theorem 2) The proof proceeds in the same way as that of Theorem 1, until we come to upper bound the rst term in (3). Rather than the union bound, we use Lemma 3. Dene 3 1 1 :  <  0    
G 2 : (. &G Let 1 and   lemma [33, 41] states that

k

$

$ 
6


1 1 . Since the VC-dimension of

is , Sauers

H  <

1

   

: \ eH E

E
0

4 


for
?

1. This implies that


H   1 

1



   

. heH E

since each . is composed of functions from . Since we need only consider 1 distinct values of G , it follows that   E  1   . We can now apply Lemma 3 to bound the   probability inside the expectation in the rst term of (3). Setting



 4 


4 

2  1 4 8  4 :A  ln 4      ln  , (5) holds for all and taking expectation with respect to  , we get that, with probability at least 1 , for all G 0 G . Proceeding as in the proof of Theorem 1, we get that, with probability at least 1 and  ? 1, 2 2 2  1 2 44 8    87  8 1 4 ln ln  %   FE 0 E  +   FE G 2 4$6   

1 2

Setting




 4

G 2

ln 

 

completes the proof. 10

2.4

Sketch of a more general approach

Instead of the proof above, we can use a more general approach which can also be applied to any class of real-valued functions. The use of an approximating class, such as . in the proofs of Theorems 1 and 2, is central to our approach. We refer to such an approximating class as a sloppy cover. More formally, for a class of real-valued functions, a training set of size , and positive real numbers 2 is an -sloppy G -cover of with respect to if, for all in that a function class G and , we say 2 2 2 , there exists in with  k#,+ H     H G . Let  G  denote the maximum, over all training sets of size , of the size of the smallest -sloppy G -cover of with respect to . Standard techniques yield the following theorem (the proof is essentially identical to that of Theorem 2 in Bartlett [2]).

#

: $, :

Theorem 4 Let be a class of real-valued functions dened on the instance space . Let be a 1 1 , and let be a sample of examples chosen independently at random distribution over
according to . Let 0 and let G2 0. Then the probability over the random choice of the training set that there exists any function for which

P% is at most

2 

FE

P+




2 

FE

$ 5 :
G

 G$ 2 :  8 2

exp 

: 2  32 

Theorem 2 can now be proved by constructing a sloppy cover using the same probabilistic argument randomly by sampling functions as in the proof of Theorems 1 and 2, i.e., by choosing an element of . from . In addition, this result leads to a slight improvement (by log factors) of the main result of Bartlett [2], which gives bounds on generalization error for neural networks with real outputs in terms of the size of the network weights and the margin distribution.

The Effect of Boosting on Margin Distributions

We now give theoretical evidence that Freund and Schapires [20] AdaBoost algorithm is especially suited to the task of maximizing the number of training examples with large margin. We briey review their algorithm. We adopt the notation used in the previous section, and restrict our attention to the binary case. Boosting works by sequentially rerunning a base learning algorithm, each time using a different distribution over training examples. That is, on each round > 1 , a distribution is computed . The goal of the base learning over the training examples, or, formally, over the set of indices 1 algorithm then is to nd a classier with small error <A #  A   . The distribution used by AdaBoost is initially uniform ( 1    1 ), and then is updated multiplicatively on each round:

Here, < 1   and 2 ln   1 case, can be computed exactly:

 

 1  


  

exp 

    g  



:  :


 6

is a normalization factor chosen so that



sums to one. In our




  
1

exp 





 

11

  

  4$6    6 7   4   ) " :! :!  ) "   1 :  4%6 5: 4 2 : 1 :   6 7


The nal combined classier is a weighted majority vote of the base classiers, namely, sign   where
2 

6  

1   Note that, on round , AdaBoost places the most weight on examples    for which  1 is smallest. This quantity is exactly the margin of the combined classier computed up to this point. Freund and Schapire [20] prove that if the training error rates of all the base classiers are bounded so that for some 0, then the training error of the combined below 1 2 for all E 1 2 classier decreases exponentially fast with the number of base classiers that are combined. The training 2 error is equal to the fraction of training examples for which    E 0. It is a simple matter to extend their proof to show that, under 2 the same conditions on , if G is not too large, then the fraction of training examples for which   hE G also decreases to zero exponentially fast with the number of base classiers (or boosting iterations).

  6 1

10 

  ! 6

:

Theorem 5 Suppose the base learning algorithm, when called by AdaBoost, generates classiers with . Then for any G , we have that weighted training errors 1

P  !"$#,+
2

2 

FE

:1 6 7


:  1 7

11 

Proof: Note that if 

FE

then
 6


1



FE

G 6


1

and so exp Therefore,


  !"$# +

 6


1



G 6

 
?
1

2 

FE

E 

 
exp
1

- ) !"*#,+

 6

exp
G

exp G

 
 6 1
12


1


1

exp

 


G 6

6   1  
1


1

 




 


Boosting letter
40 35 30 25 20 15 10 5 0 40 35 30 25 20 15 10 5 0

C4.5 Bagging
40 35 30 25 20 15 10 5 0

ECOC

error (%)

10

100

1000

10

100

1000

10

100

1000

# classifiers cumulative distribution


1.0 1.0 1.0

0.5

0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

margin

satimage
40 35 30 25 20 15 10 5 0 40 35 30 25 20 15 10 5 0 40 35 30 25 20 15 10 5 0

error (%)

10

100

1000

10

100

1000

10

100

1000

# classifiers cumulative distribution


1.0 1.0 1.0

0.5

0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

margin

vehicle
40 35 30 25 20 15 10 5 0 40 35 30 25 20 15 10 5 0 40 35 30 25 20 15 10 5 0

error (%)

10

100

1000

10

100

1000

10

100

1000

# classifiers cumulative distribution


1.0 1.0 1.0

0.5

0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

margin

Figure 4: Error curves and margin distribution graphs for three voting methods (bagging, boosting and ECOC) using C4.5 as the base learning algorithm. Results are given for the letter, satimage and vehicle datasets. (See caption under Figure 1 for an explanation of these curves.)

13

Boosting letter
100 100 80 60 40 20 10 100 1000 0

decision stumps Bagging


100 80 60 40 20 10 100 1000 0

ECOC

error (%)

80 60 40 20 0

10

100

1000

# classifiers cumulative distribution


1.0 1.0 1.0

0.5

0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

margin

satimage
100 100 80 60 40 20 10 100 1000 0 10 100 1000 100 80 60 40 20 0 10 100 1000

error (%)

80 60 40 20 0

# classifiers cumulative distribution


1.0 1.0 1.0

0.5

0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

margin

vehicle
100 100 80 60 40 20 10 100 1000 0 10 100 1000 100 80 60 40 20 0 10 100 1000

error (%)

80 60 40 20 0

# classifiers cumulative distribution


1.0 1.0 1.0

0.5

0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

-1

-0.5

0.5

margin

Figure 5: Error curves and margin distribution graphs for three voting methods (bagging, boosting and ECOC) using decision stumps as the base learning algorithm. Results are given for the letter, satimage and vehicle datasets. (See caption under Figure 1 for an explanation of these curves.)

14

where the last equality follows from the denition of plugging in the values of and gives the theorem.

1.

Noting that

 
1

 1 


1, and

To understand the signicance of the result, assume for a moment that, for all , E 1 2 for 0. Since here we are considering only two-class prediction problems, a random prediction some will be correct exactly half of the time. Thus, the condition that E 1 2 for some small positive means that the predictions of the base classiers are slightly better than random guessing. Given this assumption, we can simplify the upper bound in Equation (11) to:

1 7  6 1

1 7  

If G , it can 2 be shown that the expression inside the parentheses is smaller than 1 so that the probability that   RE G decreases exponentially fast with .4 In practice, increases as a function of , possibly even converging to 1 2. However, if this increase is sufciently slow the bound of Theorem 5 is still useful. Characterizing the conditions under which the increase is slow is an open problem. Although this theorem applies only to binary classication problems, Freund and Schapire [20] and others [35, 36] give extensive treatment to the multiclass case (see also Section 4). All of their results can be extended to prove analogous theorems about margin distributions for this more general case.

More Margin Distribution Graphs

In this section, we describe experiments we conducted to produce a series of error curves and margin distribution graphs for a variety of datasets and learning methods. Datasets. We used three benchmark datasets called letter, satimage and vehicle. Brief descriptions of these are given in Appendix B. Note that all three of these learning problems are multiclass with 26, 6 and 4 classes, respectively. Voting methods. In addition to bagging and boosting, we used a variant of Dietterich and Bakiris [13] method of error-correcting output codes (ECOC), which can be viewed as a voting method. This approach was designed to handle multiclass problems using only a two-class learning algorithm. Briey, it works as follows: As in bagging and boosting, a given base learning algorithm (which need only be designed for two-class problems) is rerun repeatedly. However, unlike bagging and boosting, the examples are not reweighted or resampled. Instead, on each round, the labels assigned to each example are modied so as to create a new two-class labeling of the data which is induced by a simple mapping from the set of classes into 1 1 . The base learning algorithm is then trained using this relabeled data, generating a base classier. The sequence of bit assignments for each of the individual labels can be viewed as a code word. A given test example is then classied by choosing the label whose associated code word is closest in Hamming distance to the sequence of predictions generated by the base classiers. This coding-theoretic interpretation led Dietterich and Bakiri to the idea of choosing code words with strong error-correcting properties so that they will be as far apart from one another as possible. However, in our experiments, rather than carefully constructing error-correcting codes, we simply used random output codes which are highly likely to have similar properties.
We can show that if is known in advance then an exponential decrease in the probability can be achieved (by a slightly different boosting algorithm) for any 2 . However, we dont know how to achieve this improvement when no nontrivial is known a priori. lower bound on 1 2
4

15

The ECOC combination rule can also be viewed as a voting method: Each base classier , on a 1 1 . We can interpret this bit as a single vote for given instance , predicts a single bit   each of the labels which were mapped on round to   . The combined hypothesis then predicts with the label receiving the most votes overall. Since ECOC is a voting method, we can measure margins just as we do for boosting and bagging. As noted above, we used three multiclass learning problems in our experiments, whereas the version of boosting given in Section 3 only handles two-class data. Freund and Schapire [20] describe a straightforward adaption of this algorithm to the multiclass case. The problem with this algorithm is that it still requires that the accuracy of each base classier exceed 1 2. For two-class problems, this requirement is about as minimal as can be hoped for since random guessing will achieve accuracy 1 2. However, for multiclass problems in which 2 labels are possible, accuracy 1 2 may be much harder to achieve than the random-guessing accuracy rate of 1 . For fairly powerful base learners, such as C4.5, this does not seem to be a problem. However, the accuracy 1 2 requirement can often be difcult for less powerful base learning algorithms which may be unable to generate classiers with small training errors. Freund and Schapire [20] provide one solution to this problem by modifying the form of the base classiers and rening the goal of the base learner. In this approach, rather than predicting a single class for each example, the base classier chooses a set of plausible labels for each example. For instance, in a character recognition task, the base classier might predict that a particular example is either a 6, 8 or 9, rather than choosing just a single label. Such a base classier is then evaluated using a pseudoloss measure which, for a given example, penalizes the base classier (1) for failing to include the correct label in the predicted plausible label set, and (2) for each incorrect label which is included in the plausible set. The combined classier, for a given example, then chooses the single label which occurs most frequently in the plausible label sets chosen by the base classiers (possibly giving more or less weight to some of the base classiers). The exact form of the pseudoloss is under the control of the boosting algorithm, and the base learning algorithm must therefore be designed to handle changes in the form of the loss measure.

Base learning algorithms. In our experiments, for the base learning algorithm, we used C4.5. We also used a simple algorithm for nding the best single-node, binary-split decision tree (a decision stump). Since this latter algorithm is very weak, we used the pseudoloss versions of boosting and bagging, as described above. (See Freund and Schapire [20, 18] for details.) Results. Figures 4 and 5 show error curves and margin distribution graphs for the three datasets, three voting methods and two base learning algorithms. Note that each gure corresponds only to a single run of each algorithm. As explained in the introduction, each of the learning curve gures shows the training error (bottom) and test error (top) curves. We have also indicated as horizontal grid lines the error rate of the base classier when run just once, as well as the error rate of the combined classier after 1000 iterations. Note the log scale used in these gures. Margin distribution graphs are shown for 5, 100 and 1000 iterations indicated by short-dashed, long-dashed (sometimes barely visible) and solid curves, respectively. It is interesting that, across datasets, all of the learning algorithms tend to produce margin distribution graphs of roughly the same character. As already noted, when used with C4.5, boosting is especially aggressive at increasing the margins of the examples, so much so that it is willing to suffer signicant reductions in the margins of those examples that already have large margins. This can be seen in Figure 4, where we observe that the maximal margin in the nal classier is bounded well away from 1. Contrast this with the margin distribution graphs after 1000 iterations of bagging in which as many

16

name waveform

twonorm

threenorm

ringnorm

Kong & Dietterich

bias var error bias var error bias var error bias var error bias var error

26.0 5.6 44.7 2.5 28.5 33.3 24.5 6.9 41.9 46.9 7.9 40.6 49.2 0.2 49.5

Kong & Dietterich [26] denitions stumps C4.5 error pseudoloss error boost bag boost bag boost 3.8 22.8 0.8 11.9 1.5 0.5 2.8 4.1 3.8 8.6 14.9 3.7 19.6 39.9 17.7 33.5 29.4 17.2 0.6 2.0 0.5 0.2 2.3 17.3 18.7 1.8 5.3 21.7 21.6 4.4 6.3 21.6 4.7 2.9 5.1 4.8 16.7 5.2 22.0 36.9 31.9 18.6 4.1 46.9 2.0 0.7 6.6 7.1 15.5 2.3 12.2 41.4 19.0 4.5 49.1 49.2 7.7 35.1 7.7 5.5 0.2 0.2 5.1 3.5 7.2 6.6 49.3 49.5 12.8 38.6 14.9 12.1

bag 1.4 5.2 19.7 0.5 5.4 8.3 5.0 6.8 22.3 1.7 6.3 9.5 8.9 4.3 13.1

19.2 12.5 44.7 1.3 29.6 33.3 14.2 17.2 41.9 32.3 6.7 40.6 49.0 0.4 49.5

error boost 2.6 4.0 19.6 0.3 2.6 5.3 4.1 7.3 22.0 2.7 8.0 12.2 49.0 0.3 49.3

Breiman [8] denitions stumps pseudoloss bag boost bag 15.7 0.5 7.9 0.9 11.2 4.1 12.5 15.5 39.9 17.7 33.5 29.4 1.1 0.3 18.2 19.0 21.7 21.6 13.8 2.6 12.6 18.8 36.9 31.9 37.6 1.1 2.2 16.4 41.4 19.0 49.0 5.3 29.7 5.1 0.5 7.5 8.9 9.8 49.5 12.8 38.6 14.9

C4.5 error boost 0.3 3.9 17.2 0.1 1.9 4.4 1.9 6.3 18.6 0.4 2.6 4.5 3.5 8.5 12.1

bag 1.4 5.2 19.7 0.3 5.6 8.3 3.1 8.6 22.3 1.1 6.9 9.5 6.2 6.9 13.1

Table 1: Results of bias-variance experiments using boosting and bagging on ve synthetic datasets (described in Appendix B). For each dataset and each learning method, we estimated bias, variance and generalization error rate, reported in percent, using two sets of denitions for bias and variance (given in Appendix C). Both C4.5 and decision stumps were used as base learning algorithms. For stumps, we used both error-based and pseudoloss-based versions of boosting and bagging on problems with more than two classes. Columns labeled with a dash indicate that the base learning algorithm was run by itself.

as half of the examples have a margin of 1. The graphs for ECOC with C4.5 resemble in shape those for boosting more so than bagging, but tend to have overall lower margins. Note that, on every dataset, both boosting and bagging eventually achieve perfect or nearly perfect accuracy on the training sets (at least 99%), but the generalization error for boosting is better. The explanation for this is evident from the margin distribution graphs where we see that, for boosting, far fewer training examples have margin close to zero. It should be borne in mind that, when combining decision trees, the complexity of the trees (as measured, say, by the number of leaves), may vary greatly from one combination method to another. As a result, the margin distribution graphs may not necessarily predict which method gives better generalization error. One must also always consider the complexity of the base classiers, as explicitly indicated by Theorems 1 and 2. When used with stumps, boosting can achieve training error much smaller than that of the base learner; however it is unable to achieve large margins. This is because, consistent with Theorem 5, the base classiers have much higher training errors. Presumably, such low margins do not adversely affect the generalization error because the complexity of decision stumps is so much smaller than that of full decision trees.

Relation to Bias-variance Theory

One of the main explanations for the improvements achieved by voting classiers is based on separating the expected error of a classier into a bias term and a variance term. While the details of these denitions differ from author to author [8, 25, 26, 40], they are all attempts to capture the following quantities: The bias term measures the persistent error of the learning algorithm, in other words, the error that would remain even if we had an innite number of independently trained classiers. The variance term measures the error that is due to uctuations that are a part of generating a single classier.

17

The idea is that by averaging over many classiers one can reduce the variance term and in that way reduce the expected error. In this section, we discuss a few of the strengths and weaknesses of bias-variance theory as an explanation for the performance of voting methods, especially boosting. 5.1 The bias-variance decomposition for classication.

The origins of bias-variance analysis are in quadratic regression. Averaging several independently trained regression functions will never increase the expected error. This encouraging fact is nicely reected in the bias-variance separation of the expected quadratic error. Both bias and variance are always nonnegative and averaging decreases the variance term without changing the bias term. One would naturally hope that this beautiful analysis would carry over from quadratic regression to classication. Unfortunately, as has been observed before us, (see, for instance, Friedman [22]) taking the majority vote over several classication rules can sometimes result in an increase in the expected classication error. This simple observation suggests that it may be inherently more difcult or even impossible to nd a bias-variance decomposition for classication as natural and satisfying as in the quadratic regression case. This difculty is reected in the myriad denitions that have been proposed for bias and variance [8, 25, 26, 40]. Rather than discussing each one separately, for the remainder of this section, except where noted, we follow the denitions given by Kong and Dietterich [26], and referred to as Denition 0 by Breiman [8]. (These denitions are given in Appendix C.) 5.2 Bagging and variance reduction.

The notion of variance certainly seems to be helpful in understanding bagging; empirically, bagging appears to be most effective for learning algorithms with large variance. In fact, under idealized conditions, variance is by denition the amount of decrease in error effected by bagging a large number of base classiers. This ideal situation is one in which the bootstrap samples used in bagging faithfully approximate truly independent samples. However, this assumption can fail to hold in practice, in which case, bagging may not perform as well as expected, even when variance dominates the error of the base learning algorithm. This can happen even when the data distribution is very simple. As a somewhat contrived example,  consider data generated according to the following distribution. The label 1 1 is chosen 7 is then chosen by picking each of the 7 bits to be uniformly at random. The instance 1 1 equal to  with probability 0 9 and  with probability 0 1. Thus, each coordinate of is an independent noisy version of  . For our base learner, we use a learning algorithm which generates a classier that is equal to the single coordinate of which is the best predictor of  with respect to the training set. It is clear that each coordinate of has the same probability of being chosen as the classier on a random training set, so the aggregate predictor over many independently trained samples is the unweighted majority vote over the coordinates of , which is also the Bayes optimal predictor in this case. Thus, the bias of our learning algorithm is exactly zero. The prediction error of the majority rule is roughly 0 3%, and so a variance of about 9 7% strongly dominates the expected error rate of 10%. In such a favorable case, one would predict, according to the bias-variance explanation, that bagging could get close to the error of the Bayes optimal predictor. However, using a training set of 500 examples, the generalization error achieved by bagging is 5 6% after 200 iterations. (All results are averaged over many runs.) The reason for this poor performance is that, in any particular random sample, some of the coordinates of are slightly more correlated with

18

and bagging tends to pick these coordinates much more often than the others. Thus, in this case, the behavior of bagging is very different from its expected behavior on truly independent training sets. Boosting, on the same data, achieved a test error of 0 6%.

5.3

Boosting and variance reduction.

Breiman [8] argued that boosting is primarily a variance-reducing procedure. Some of the evidence for this comes from the observed effectiveness of boosting when used with C4.5 or CART, algorithms known empirically to have high variance. As the error of these algorithms is mostly due to variance, it is not surprising that the reduction in the error is primarily due to a reduction in the variance. However, our experiments show that boosting can also be highly effective when used with learning algorithms whose error tends to be dominated by bias rather than variance. 5 We ran boosting and bagging on four articial datasets described by Breiman [8], as well as the articial problem studied by Kong and Dietterich [26]. Following previous authors, we used training sets of size 200 for the latter problem and 300 for the others. For the base learning algorithm, we tested C4.5. We also used the decision-stump base-learning algorithm described in Section 4. We then estimated bias, variance and average error of these algorithms by rerunning them 1000 times each, and evaluating them on a test set of 10,000 examples. For these experiments, we used both the bias-variance denitions given by Kong and Dietterich [26] and those proposed more recently by Breiman [8]. (Denitions are given in Appendix C.) For multiclass problems, following Freund and Schapire [18], we tested both error-based and pseudoloss-based versions of bagging and boosting. For two-class problems, only the error-based versions were used. The results are summarized in Table 1. Clearly, boosting is doing more than reducing variance. For instance, on ringnorm, boosting decreases the overall error of the stump algorithm from 40 6% to 12 2%, but actually increases the variance from 7 9% to 6.6% using Kong and Dietterichs denitions, or from 6.7% to 8.0% using Breimans denitions. (We did not check the statistical signicance of this increase.) Breiman also tested boosting with a low-variance base learning algorithmnamely, linear discriminant analysis (LDA)and attributed the ineffectiveness of boosting in this case to the stability (low variance) of LDA. The experiments with the fairly stable stump algorithm suggest that stability in itself may not be sufcient to predict boostings failure. Our theory suggests a different characterization of the cases in which boosting might fail. Taken together, Theorem 1 and Theorem 5 state that boosting can perform poorly only when either (1) there is insufcient training data relative to the complexity of the base classiers, or (2) the training errors of the base classiers (the s in Theorem 5) become too large too quickly. Certainly, this characterization is incomplete in that boosting often succeeds even in situations in which the theory provides no guarantees. However, while we hope that tighter bounds can be given, it seems unlikely that there exists a perfect theory. By a perfect theory we mean here a rigorous analysis of voting methods that, on the one hand, is general enough to apply to any base learning algorithm and to any i.i.d. source of labeled instances and on the other hand gives bounds that are accurate predictors of the performance of the algorithm in practice. This is because in any practical situation there is structure in the data and in the base learning algorithm that is not taken into account in the assumptions of a general theory.

In fact, the original goal of boosting was to reduce the error of so-called weak learning algorithms which tend to have very large bias. [17, 20, 34]

19

5.4

Why averaging can increase complexity

In this section, we challenge a common intuition which says that when one takes the majority vote over several base classiers the generalization error of the resulting classier is likely to be lower than the average generalization error of the base classiers. In this view, voting is seen as a method for smoothing or averaging the classication rule. This intuition is sometimes based on the biasvariance analysis of regression described in the previous section. Also, to some, it seems to follow from a Bayesian point of view according to which integrating the classications over the posterior is better than using any single classier. If one feels comfortable with these intuitions, there seems to be little point to most of the analysis given in this paper. It seems that because AdaBoost generates a majority vote over several classiers, its generalization error is, in general, likely to be better than the average generalization error of the base classiers. According to this point of view, the suggestion we make in the introduction that the majority vote over many classiers is more complex than any single classier seems to be irrelevant and misled. In this section, we describe a base learning algorithm which, when combined using AdaBoost, is likely to generate a majority vote over base classiers whose training error goes to zero, while at the same time the generalization error does not improve at all. In other words, it is a case in which voting results in over-tting. This is a case in which the intuition described above seems to break down, while the margin-based analysis developed in this paper gives the correct answer. Suppose we use classiers that are delta-functions, i.e., they predict 1 on a single point in the input space and 1 everywhere else, or vice versa ( 1 on one point and 1 elsewhere). (If you dislike delta functions, you can replace them with nicer functions. For example, if the input space is , use balls of sufciently small radius and make the prediction 1 or 1 inside, and 1 or 1, respectively, outside.) To this class of functions we add the constant functions that are 1 everywhere or 1 everywhere. Now, for any training sample of size we can easily construct a set of at most 2 functions from our class such that the majority vote over these functions will always be correct. To do this, we associate one delta function with each training example; the delta function gives the correct value on the training example and the opposite value everywhere else. Letting and denote the number of positive and negative examples, we next add copies of the function which predicts 1 everywhere, and copies of the function which predicts 1 everywhere. It can now be veried that the sum (majority vote) of all these functions will be positive on all of the positive examples in the training set, and negative on all the negative examples. In other words, we have constructed a combined classier which exactly ts the training set. Fitting the training set seems like a good thing; however, the very fact that we can easily t such a rule to any training set implies that we dont expect the rule to be very good on independently drawn points outside of the training set. In other words, the complexity of these average rules is too large, relative to the size of the sample, to make them useful. Note that this complexity is the result of averaging. Each one of the delta rules is very simple (the VC-dimension of this class of functions is exactly 2), and indeed, if we found a single delta function (or constant function) that t a large sample we could, with high condence, expect the rule to be correct on new randomly drawn examples. How would boosting perform in this case? It can be shown using Theorem 5 (with G  0) that boosting would slowly but surely nd a combination of the type described above having zero training error but very bad generalization error. A margin-based analysis of this example shows that while all of the classications are correct, they are correct only with a tiny margin of size  1  , and so we cannot expect the generalization error to be very good.

20

Input space R

+ + + + + +

High dimensional space h(x)

+ ++ + + + +

Figure 6: The maximal margins classication method. In this example, the raw data point is an element of , but in that space the positive and negative examples are not linearly separable. The raw input is mapped to a point in a high dimensional space (here 2 ) by a xed nonlinear transformation . In the high dimensional space, the classes are linearly separable. The vector is chosen to maximize the minimal margin . The circled instances are the support vectors; Vapnik shows that can always be written as a linear combination of the support vectors.

Relation to Vapniks Maximal Margin Classiers

The use of the margins of real-valued classiers to predict generalization error was previously studied by Vapnik [42] in his work with Boser and Guyon [5] and Cortes [10] on optimal margin classiers. We start with a brief overview of optimal margin classiers. One of the main ideas behind this method is that some nonlinear classiers on a low dimensional space can be treated as linear classiers as over a high dimensional space. For example, consider the classier that labels an instance 1 if 2 5 5 2 10 and 1 otherwise. This classier can be seen as a linear classier if we represent each instance by the vector <   1 2 3 4 5  . If we set   10 1 5 0 0 2  <  then the classication is 1 when 0 and 1 otherwise. In a typical case, the data consists of about 10 000 instances in 100 which are mapped into 1 000 000. Vapnik introduced the method of kernels which provides an efcient way for calculating the predictions of linear classiers in the high dimensional space. Using kernels, it is usually easy to nd a linear classier that separates the data perfectly. In fact, it is likely that there are many perfect linear classiers, many of which might have very poor generalization ability. In order to overcome this problem, the prescription suggested by Vapnik is to nd the classier that maximizes the minimal margin. More precisely, suppose that the training sample consists of pairs of the form    where is the instance and  1 1 is its label. Assume that <  is some xed nonlinear mapping of instances into (where is typically very large). Then the maximal margin classier is dened by the vector which maximizes

Here, H H H H 2 is the 2 or Euclidean norm of the vector . A graphical sketch of the maximal margin method is given in Figure 6. For the analysis of this method, Vapnik assumes that all of the vectors <  are enclosed within a ball of radius , i.e., they all are within Euclidean distance from some xed vector in . Without loss of generality, we can assume that  1.

 !"

min 8
+

,

 <   HcH HcH 2 

 

12 

21

Vapnik [42] showed that the VC dimension of all linear classiers with minimum margin at least G is upper bounded by 1 3G 2 . This result implies bounds on the generalization error in terms of the expected minimal margin on test points which do not depend on the dimension of the space into which the data are mapped. However, typically, the expected value of the minimal margin is not known. ShaweTaylor et al. [38] used techniques from the theory of learning real-valued functions to give bounds on generalization error in terms of margins on the training examples. Shawe-Taylor et al. [39] also gave related results for arbitrary real classes. Consider the relation between Equation (10) and the argument of the in Equation (12). We minimum   can view the coefcients and the predictions as the coordinates of a vector 1 1 as the coordinates of the vector   1 1 . Then we can rewrite Equation (10) as

where HcH H H 1  H is the 1 norm of . In our analysis, we use the fact that all of the components 1 H
>  are in the of range 1 1 , or, in other words that the max or norm of <  is bounded by 1: <   max 1 H   H E 1. Viewed this way, the connection between maximal margin classiers and boosting becomes clear. Both methods aim to nd a linear combination in a high dimensional space which has a large margin on the instances in the sample. The norms used to dene the margin are different in the two cases and the precise goal is also differentmaximal margin classiers aim to maximize the minimal margin while boosting aims to minimize an exponential weighting of the examples as a function of their margins. Our interpretation for these differences is that boosting is more suited for the case when the mapping maps into a high dimensional space where all of the coordinates have a similar maximal range, 1 1 . On the other hand, the optimal margin method is suitable for cases in which the such as Euclidean norm of   is likely to be small, such as is the case when is an orthonormal transformation between inner-product spaces. Related to this, the optimal margin method uses quadratic programming for its optimization, whereas the boosting algorithm can be seen as a method for approximate linear programming [7, 19, 21, 23]. Both boosting and support vector machines aim to nd a linear classier in a very high dimensional space. However, computationally, they are very different: support vector machines use the method of kernels to perform computations in the high dimensional space while boosting relies on a base learning algorithm which explores the high dimensional space one coordinate at a time. Vapnik [42] gave an alternative analysis of optimal margin classiers, based on the number of support vectors, i.e., the number of examples that dene the nal classier. This analysis is preferable to the analysis that depends on the size of the margin when only a few of the training examples are support vectors. Previous work [17] has suggested that boosting also can be used as a method for selecting a small number of informative examples from the training set. Investigating the relevance of this type of bound when applying boosting to real-world problems is an interesting open research direction.

  !  

 

 <  

Other loss functions

We describe briey related work which has been done on loss functions other than the 0-1 loss. For quadratic loss, Jones [24] and Barron [1] have shown that functions in the convex hull of a class of real-valued functions can be approximated by a convex combination of elements of to an accuracy of  1  by iteratively adding the member of which minimizes the residual error to the existing convex combination. Lee, Bartlett and Williamson [27] extended this result to show that the



22

procedure will converge to the best approximation in the convex hull of even when the target function is not in the convex hull of . Lee, Bartlett and Williamson [27, 28] also studied the generalization error when this procedure is used for learning. In results analogous to those presented here, they showed that the generalization error can be bounded in terms of the sum of the absolute values of the output weights
(when the members of are normalized to have output values in the interval 1 1 ), rather than in terms of the number of components in the convex combination. spaces was presented by Donahue et al. [14]. Similar work on iterative convex approximation in To the best of our knowledge, similar iterative schemes for combining functions have not been studied for the log loss. Extensions of boosting to solve regression problems have been suggested by Freund [17] and Freund and Schapire [20]. These extensions are yet to be tested in practice. Drucker [15] experimented with a different extension of boosting for regression and reported some encouraging results.

Open Problems

The methods in this paper allow us to upper bound the generalization error of a voted classier based on simple statistics which can be measured using the training data. These statistics are a function of the empirical distribution of the margins. While our bounds seem to explain the experiments qualitatively, their quantitative predictions are greatly over pessimistic. The challenge of coming up with better bounds can be divided into two questions. First, can one give better bounds that are a function of the empirical margins distribution? Second, are there better bounds that are functions of other statistics? A different approach to understanding the behavior of AdaBoost is to nd functions of the training set which predict the generalization error well on all or most of the datasets encountered in practice. While this approach does not give one the satisfaction of a mathematical proof, it might yield good results in practice. Acknowledgments Many thanks to Leo Breiman for a poignant email exchange which challenged us to think harder about these problems. Thanks also to all those who contributed to the datasets used in this paper, and to the three anonymous reviewers for many helpful criticisms.

References
[1] Andrew R. Barron. Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930945, 1993. [2] Peter L. Bartlett. The sample complexity of pattern classication with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 1998 (to appear). [3] Eric Bauer and Ron Kohavi. An empirical comparison of voting classication algorithms: Bagging, boosting, and variants. Unpublished manuscript, 1997. [4] Eric B. Baum and David Haussler. What size net gives valid generalization? Neural Computation, 1(1):151160, 1989. [5] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 144152, 1992. [6] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123140, 1996. [7] Leo Breiman. Prediction games and arcing classiers. Technical Report 504, Statistics Department, University of California at Berkeley, 1997. 23

[8] Leo Breiman. Arcing classiers. Annals of Statistics, to appear. [9] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classication and Regression Trees. Wadsworth International Group, 1984. [10] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273 297, September 1995. [11] Luc Devroye. Bounds for the uniform deviation of empirical measures. Journal of Multivariate Analysis, 12:7279, 1982. [12] Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Unpublished manuscript, 1998. [13] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via errorcorrecting output codes. Journal of Articial Intelligence Research, 2:263286, January 1995. [14] M. J. Donahue, L. Gurvits, C. Darken, and E. Sontag. Rates of convex approximation in non-Hilbert spaces. Constructive Approximation, 13:187220, 1997. [15] Harris Drucker. Improving regressors using boosting techniques. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 107115, 1997. [16] Harris Drucker and Corinna Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems 8, pages 479485, 1996. [17] Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121(2):256285, 1995. [18] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148156, 1996. [19] Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 325332, 1996. [20] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119139, August 1997. [21] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, (to appear). [22] Jerome H. Friedman. On bias, variance, 0/1-loss, and the curse-of-dimensionality. Available electronically from http://stat.stanford.edu/ jhf. [23] Adam J. Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Articial Intelligence, 1998. [24] Lee K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Annals of Statistics, 20(1):608613, 1992. [25] Ron Kohavi and David H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 275283, 1996. [26] Eun Bae Kong and Thomas G. Dietterich. Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning, pages 313321, 1995. [27] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Efcient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory, 42(6):21182132, 1996. [28] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, to appear.

24

[29] Richard Maclin and David Opitz. An empirical evaluation of bagging and boosting. In Proceedings of the Fourteenth National Conference on Articial Intelligence, pages 546551, 1997. [30] C. J. Merz and P. M. Murphy. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html. [31] J. R. Quinlan. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Articial Intelligence, pages 725730, 1996. [32] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [33] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory Series A, 13:145147, 1972. [34] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197227, 1990. [35] Robert E. Schapire. Using output codes to boost multiclass learning problems. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 313321, 1997. [36] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using condence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998. [37] Holger Schwenk and Yoshua Bengio. Training methods for adaptive boosting of neural networks for character recognition. In Advances in Neural Information Processing Systems 10, 1998. [38] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. A framework for structural risk minimisation. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 6876, 1996. [39] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. Technical Report NC-TR-96-053, Neurocolt, 1996. [40] Robert Tibshirani. Bias, variance and prediction error for classication rules. Technical report, University of Toronto, November 1996. [41] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its applications, XVI(2):264280, 1971. [42] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

Generalization Error for Multiclass Problems


In this appendix, we describe how Theorems 1 and 2 can be extended to multiclass problems. Suppose there are classes, and dene  1 2 as the output space. We formally view the base classiers as mappings from
to 0 1 , with the interpretation that if <    1 then  is predicted by to be a plausible label for . This general form of classier covers the forms of base classiers used throughout this paper. For simple classiers, like the decision trees computed by C4.5, only a single label is predicted so that, for each ,     1 for exactly one label  . However, some of the other combination methods such as pseudoloss-based boosting or bagging, as well as Dietterich and Bakiris output-coding method use base classiers which vote for a set of plausible labels so that <   may be 1 for several labels  . We dene the convex hull . of as
/ 0 1:2 . 2 

:    45

6 789 ;

7 

so a classier in . predicts label  for input if max ! ! 2    (and ties are broken arbitrarily). We dene the margin of an example    for such a function as
2 2

7@?  = =; = = 2 =   

0; 6

7 

1BC
D

margin 

 

   max ! !

13 

25

2


Clearly, gives the wrong prediction on    only if margin  have the following generalization of Theorems 1 and 2.

 E

0. With these denitions, we

Theorem 6 Let be a distribution over


, and let be a sample of examples chosen independently at random according to . Assume that the base-classier space is nite, and let 0. Then with probability at least 1 over the random choice of the training set , every function 2 (. satises the following bound for all G 0:

P%

margin 

2


 E 0 E P+

margin 

2


 E

log 

 G 2

log H IH

log  1

1 2  

More generally, for nite or innite ? 1: that ? P%

with VC-dimension , the following bound holds as well, assuming


2


margin 

2


 E 0 E P+

margin 

 E



log2 

G 2

log  1

 1 2   

Proof: The proof closely follows that of Theorem 1, so we only describe the differences. We rst consider the case of nite . First, we dene 2 1 .     = :    45 6

    

= = =

As in the proof of Theorem 1, for any to the distribution , and we have


 %

margin 

= we choose an approximating function


 E 

according

>E

0 #21  % margin   i- %  0 #21 margin  


2 2
  

 E


 G$ G$2 2 H margin  2

We bound the second term of the right hand side as follows: Fix , maximum in Equation (13) so that margin  Clearly, margin 
  


and  , and let  

achieve the

2 


>E

<




 and

0 #21 <

<

2 


2 


so


0 #21 margin   0 # 1    l E E 4$6 87  8


2

G$ 2 H margin  2 2     G$ 2 H <


 E


02




FE 0

using the Chernoff bound.

26

2 2 0 #21 margin      E G$ 2 H margin    G    0 #21  A :     <  FE G$  2 H  A  2 2 0 # 1     <  FE G$ 2 H     6   l E ! !  7 8  1 4 6 E

Equations (5) and (6) follow exactly as in the proof of Theorem 1 with    and  <  replaced 2 by margin    and margin    . We can derive the analog of Equation (7) as follows:

%

:      G

2 


Proceeding as in the proof of Theorem 1, we see that, with probability at least 1 ? 1, and

, for any G

 %

margin  

2


 E

0 E

 +

margin 

2


 E

 7 8 4$6 8
2

1 ln 2

   1 2 H IH  

Setting   4 G 2  ln  2 ln H IH 
gives the result. For innite , we follow essentially the same modications to the argument above as used in the proof of Theorem 2. As before, to apply Lemma 3, we need to derive an upper bound on   where  : margin    0    
G 2 : . G

3



Let



and 

. Then applying Sauers lemma to the set 

:1E

gives
1

H  

1

;  

1

<

 

: \eH E
6





E

4 


This implies that


H 




1

<

;  

1

  

 

. iH E

 4 

and hence
H 

margin 


1  1 

margin 



k

 

(.

ieH
E

Thus, 

 E

1 
2





 4 

. Proceeding as before, we obtain the bound


E  +

 %

margin 

 E 0

margin 

1 2

2


 ln 4

 E G 2

4 6  7 8 44 8    ln
2

1

Setting

as above completes the proof.

Brief Descriptions of Datasets

In this appendix, we briey describe the datasets used in our experiments. 27

name vehicle satimage letter

# examples train test 423 423 4435 2000 16000 4000

# classes 4 6 26

# features 18 36 16

Table 2: The three benchmark machine-learning problems used in the experiments.

B.1

Non-synthetic datasets

In Section 4, we conducted experiments on three non-synthetic datasets called letter, satimage and vehicle. All three are available from the repository at the University of California at Irvine [30]. Some of the basic characteristics of these datasets are given in Table 2. The letter and satimage datasets came with their own test sets. For the vehicle dataset, we randomly selected half of the data to be held out as a test set. All features are continuous (real-valued). None of these datasets have missing values. The letter benchmark is a letter image recognition task. The dataset was created by David J. Slate. According to the documentation provided with this dataset, The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a le of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to t into a range of integer values from 0 through 15. The satimage dataset is the statlog version of a satellite image dataset. According to the documentation, This database consists of the multi-spectral values of pixels in 3
3 neighborhoods in a satellite image, and the classication associated with the central pixel in each neighborhood. The aim is to predict this classication, given the multi-spectral values... The original database was generated from Landsat Multi-Spectral Scanner image data... purchased from NASA by the Australian Center for Remote Sensing, and used for research at The Center for Remote Sensing... The sample database was generated taking a small section (82 rows and 100 columns) from the original data. The binary values were converted to their present ASCII form by Ashwin Srinivasan. The classication for each pixel was performed on the basis of an actual site visit by Ms. Karen Hall, when working for Professor John A. Richards.... Conversion to 3
3 neighborhoods and splitting into test and training sets was done by Alistair Sutherland.... The purpose of the vehicle dataset, according to its documentation, is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles... This dataset comes from the Turing Institute... The [extracted] features were a combination of scale independent features utilizing both classical moments based measures such as scaled variance, skewness and kurtosis about the major/minor axes and heuristic measures such as hollows, circularity, rectangularity and compactness. Four Corgie model vehicles were used for the experiment: a double decker bus, Chevrolet van, Saab 9000 and an Opel Manta 400.... The images were acquired by a camera looking downwards at the model vehicle from a xed angle of elevation....

28

14 Class 0 12 10 8 Class 2 6 4 2 Class 3 0 0 2 4 6 a1 8 10 12 14 Class 5 Class 4 Class 3 Class 1

Figure 7: The 2-dimension, 6-class classication problem dened by Kong and Dietterich [26] on the region 0 15 0 15 .

B.2

Synthetic datasets

In Section 5, we described experiments using synthetically generated data. Twonorm, threenorm and ringnorm were taken from Breiman [8]. Quoting from him:

Twonorm: This is 20-dimension, 2-class data. Each class is drawn from a multivariate normal
distribution with unit covariance matrix. Class #1 has mean  ; ;  20.  where  2
; ;

a2

Threenorm: This is 20-dimension, 2-class data. Class #1 is drawn with equal probability
from a unit multivariate normal with mean   and from a unit multivariate normal ; ; ; with mean   . Class #2 is drawn from a unit multivariate normal with mean ; ; ; 20.   where  2

and class #2 has mean

Ringnorm: This is 20-dimension, 2-class data. Class #1 is multivariate normal with mean zero and
covariance matrix 4 times the identity. Class #2 has unit covariance matrix and mean  ; ; where  1 20.
;

The waveform data is 21-dimension, 3-class data. It is described by Breiman et al. [9]. A program for generating this data is available from the UCI repository [30]. The last dataset was taken from Kong and Dietterich [26]. This is a 2-dimension, 6-class classication problem where the classes are dened by the regions of 0 15
0 15 shown in Figure 7.

C Two Denitions of Bias and Variance for Classication


For the sake of completeness, we include here the denitions for bias and variance for classication tasks which we have used in our experiments. The rst set of denitions is due to Kong and Dietterich [26] 29

and the second one is due to Breiman [8]. Assume that we have an innite supply of independent training sets of size . Each sample is drawn i.i.d. from a xed distribution over
1 where is the number of classes. Denote by + the classier that is generated by the base learning algorithm given the sample . Denote by the classication rule that results from running the base learning algorithm on an innite number of independent training sets and taking the plurality vote 6 over the resulting classiers. Finally denote by the Bayes optimal prediction rule for the distribution . The prediction of a classier on an instance is denoted   and the expected error of a classier is denoted

 

  !"*#

 A

The denitions of Kong and Dietterich are: Bias 

Variance 

    - +3# ;  +  

Breiman denes a partition of the sample space into two sets. The unbiased set consists of all is s complement. Given these sets the   and the biased set for which    denition of bias and variance are:

&

Bias 

Variance 

) !"*#
) !"*#

 

    

&

- +3# -

+3#

)c

)@ !"$#

) !"$#

+ 

   A

+ 

&  , ,

The plurality vote outputs the class which receives the largest number of votes, breaking ties uniformly at random. When 2 the plurality vote is equal to the majority vote.

30

You might also like