0% found this document useful (0 votes)
13 views12 pages

Addaboost

The paper presents a new multi-class boosting algorithm called SAMME, which extends the AdaBoost algorithm to multi-class classification without reducing it to multiple two-class problems. SAMME minimizes a novel exponential loss function and is shown to be highly competitive in terms of misclassification error rate, requiring each weak classifier to perform better than random guessing. The authors provide theoretical justification and numerical results demonstrating the effectiveness of the SAMME algorithm compared to traditional AdaBoost.

Uploaded by

Menggang Yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

Addaboost

The paper presents a new multi-class boosting algorithm called SAMME, which extends the AdaBoost algorithm to multi-class classification without reducing it to multiple two-class problems. SAMME minimizes a novel exponential loss function and is shown to be highly competitive in terms of misclassification error rate, requiring each weak classifier to perform better than random guessing. The authors provide theoretical justification and numerical results demonstrating the effectiveness of the SAMME algorithm compared to traditional AdaBoost.

Uploaded by

Menggang Yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Statistics and Its Interface Volume 2 (2009) 349–360

Multi-class AdaBoost∗
Ji Zhu†‡ , Hui Zou§ , Saharon Rosset and Trevor Hastie¶

that the population minimizer of exponential loss is one-


half of the log-odds. Based on this statistical explanation,
Boosting has been a very successful technique for solving [11] derived a multi-class logit-boost algorithm.
the two-class classification problem. In going from two-class The multi-class boosting algorithm by [11] looks very dif-
to multi-class classification, most algorithms have been re- ferent from AdaBoost, hence it is not clear if the statis-
stricted to reducing the multi-class classification problem to tical view of AdaBoost still works in the multi-class case.
multiple two-class problems. In this paper, we develop a new To resolve this issue, we think it is desirable to derive an
algorithm that directly extends the AdaBoost algorithm to AdaBoost-like multi-class boosting algorithm by using the
the multi-class case without reducing it to multiple two-class exact same statistical explanation of AdaBoost. In this pa-
problems. We show that the proposed multi-class AdaBoost per, we develop a new algorithm that directly extends the
algorithm is equivalent to a forward stagewise additive mod- AdaBoost algorithm to the multi-class case without reduc-
eling algorithm that minimizes a novel exponential loss for ing it to multiple two-class problems. Surprisingly, the new
multi-class classification. Furthermore, we show that the ex- algorithm is almost identical to AdaBoost but with a sim-
ponential loss is a member of a class of Fisher-consistent loss ple yet critical modification, and similar to AdaBoost in
functions for multi-class classification. As shown in the pa- the two-class case, this new algorithm combines weak clas-
per, the new algorithm is extremely easy to implement and sifiers and only requires the performance of each weak clas-
is highly competitive in terms of misclassification error rate. sifier be better than random guessing. We show that the
proposed multi-class AdaBoost algorithm is equivalent to a
AMS 2000 subject classifications: Primary 62H30. forward stagewise additive modeling algorithm that mini-
Keywords and phrases: boosting, exponential loss, mizes a novel exponential loss for multi-class classification.
multi-class classification, stagewise modeling. Furthermore, we show that the exponential loss is a mem-
ber of a class of Fisher-consistent loss functions for multi-
class classification. Combined with forward stagewise addi-
1. INTRODUCTION tive modeling, these loss functions can be used to derive
various multi-class boosting algorithms. We believe this pa-
Boosting has been a very successful technique for solving per complements [11].
the two-class classification problem. It was first introduced
by [8], with their AdaBoost algorithm. In going from two- 1.1 AdaBoost
class to multi-class classification, most boosting algorithms Before delving into the new algorithm for multi-class
have been restricted to reducing the multi-class classifica- boosting, we briefly review the multi-class classification
tion problem to multiple two-class problems, e.g. [8], [19], problem and the AdaBoost algorithm [8]. Suppose we are
and [21]. The ways to extend AdaBoost from two-class to given a set of training data (x1 , c1 ), . . . , (xn , cn ), where the
multi-class depend on the interpretation or view of the suc- input (prediction variable) xi ∈ Rp , and the output (re-
cess of AdaBoost in binary classification, which still remains sponse variable) ci is qualitative and assumes values in a
controversial. Much theoretical work on AdaBoost has been finite set, e.g. {1, 2, . . . , K}. K is the number of classes. Usu-
based on the margin analysis, for example, see [20] and [13]. ally it is assumed that the training data are independently
Another view on boosting, which is popular in the statistical and identically distributed samples from an unknown prob-
community, regards AdaBoost as a functional gradient de- ability distribution Prob(X, C). The goal is to find a classifi-
scent algorithm [6, 11, 17]. In [11], AdaBoost has been shown cation rule C(x) from the training data, so that when given a
to be equivalent to a forward stagewise additive modeling al- new input x, we can assign it a class label c from {1, . . . , K}.
gorithm that minimizes the exponential loss. [11] suggested Under the 0/1 loss, the misclassification error rate of a classi-
K  
that the success of AdaBoost can be understood by the fact fier C(x) is given by 1− k=1 EX IC(X)=k Prob(C = k|X) .
∗ We
It is clear that
thank the AE and a referee for their helpful comments and sug-
gestions which greatly improved our paper.
† Corresponding author.
C ∗ (x) = arg max Prob(C = k|X = x)
k
‡ Zhu was partially supported by NSF grant DMS-0705532.
§ Zou was partially supported by NSF grant DMS-0706733. will minimize this quantity with the misclassification error
¶ Hastie was partially supported by NSF grant DMS-0204162. rate equal to 1 − EX maxk Prob(C = k|X). This classifier is
known as the Bayes classifier, and its error rate is the Bayes it was trained. This assumption is easily satisfied for two-
error rate. class classification problems, because the error rate of ran-
The AdaBoost algorithm is an iterative procedure that dom guessing is 1/2. However, it is much harder to achieve
tries to approximate the Bayes classifier C ∗ (x) by combining in the multi-class case, where the random guessing error rate
many weak classifiers. Starting with the unweighted train- is (K − 1)/K. As pointed out by the inventors of AdaBoost,
ing sample, the AdaBoost builds a classifier, for example a the main disadvantage of AdaBoost is that it is unable to
classification tree [5], that produces class labels. If a training handle weak learners with an error rate greater than 1/2. As
data point is misclassified, the weight of that training data a result, AdaBoost may easily fail in the multi-class case. To
point is increased (boosted). A second classifier is built us- illustrate this point, we consider a simple three-class simu-
ing the new weights, which are no longer equal. Again, mis- lation example. Each input x ∈ R10 , and the ten input vari-
classified training data have their weights boosted and the ables for all training examples are randomly drawn from
procedure is repeated. Typically, one may build 500 or 1000 a ten-dimensional standard normal distribution. The three
classifiers this way. A score is assigned to each classifier, and classes are defined as:
the final classifier is defined as the linear combination of the ⎧  2
classifiers from each stage. Specifically, let T (x) denote a ⎪
⎨ 1, if 0 ≤ xj < χ210,1/3 ,
weak multi-class classifier that assigns a class label to x,  2
c= 2, if χ210,1/3 ≤ x < χ210,2/3 ,
then the AdaBoost algorithm proceeds as follows: ⎪
⎩ 3, if χ2  j2
10,2/3 ≤ xj ,
Algorithm 1. AdaBoost [8]
2 2
1. Initialize the observation weights wi = 1/n, i = where χ10,k/3 is the (k/3)100% quantile of the χ10 distribu-
1, 2, . . . , n. tion, so as to put approximately equal numbers of observa-
2. For m = 1 to M: tions in each class. In short, the decision boundaries separat-
ing successive classes are nested concentric ten-dimensional
(a) Fit a classifier T (m) (x) to the training data using spheres. The training sample size is 3000 with approximately
weights wi . 1000 training observations in each class. An independently
(b) Compute drawn test set of 10000 observations is used to estimate the
error rate.
 n    n
err (m)
= wi I ci = T (m)
(xi ) / wi . Figure 1 (upper row) shows how AdaBoost breaks using
i=1 i=1 ten-terminal node trees as weak classifiers. As we can see
(upper left panel), the test error of AdaBoost actually starts
(c) Compute to increase after a few iterations, then levels off around 0.53.
What has happened can be understood from the upper mid-
1 − err (m)
α(m) = log . dle and upper right panels: the err(m) starts below 0.5; after
err(m) a few iterations, it overshoots 0.5 (α(m) < 0), then quickly
(d) Set hinges onto 0.5. Once err(m) is equal to 0.5, the weights of
   the training samples do not get updated (α(m) = 0), hence
wi ← wi · exp α (m)
· I ci = T (m)
(xi ) , the same weak classifier is fitted over and over again but is
not added to the existing fit, and the test error rate stays
for i = 1, 2, . . . , n. the same.
(e) Re-normalize wi . This illustrative example may help explain why Ad-
aBoost is never used for multi-class problems. Instead, for
3. Output
multi-class classification problems, [21] proposed the Ad-
M aBoost.MH algorithm which combines AdaBoost and the
C(x) = arg max α(m) · I(T (m) (x) = k). one-versus-all strategy. There are also several other multi-
k
m=1 class extensions of the boosting idea, for example, the ECOC
in [19] and the logit-boost in [11].
When applied to two-class classification problems, Ad-
aBoost has been proved to be extremely successful in pro- 1.2 Multi-class AdaBoost
ducing accurate classifiers. In fact, [1] called AdaBoost with
trees the “best off-the-shelf classifier in the world.” How- We introduce a new multi-class generalization of Ad-
ever, it is not the case for multi-class problems, although aBoost for multi-class classification. We refer to our algo-
AdaBoost was also proposed to be used in the multi-class rithm as SAMME — Stagewise Additive Modeling using a
case [8]. Note that the theory of [8] assumes that the error Multi-class Exponential loss function — this choice of name
of each weak classifier err(m) is less than 1/2 (or equiva- will be clear in Section 2. Given the same setup as that of
lently α(m) > 0), with respect to the distribution on which AdaBoost, SAMME proceeds as follows:

350 J. Zhu et al.


Figure 1. Comparison of AdaBoost and the new algorithm SAMME on a simple three-class simulation example. The training
sample size is 3000, and the testing sample size is 10000. Ten-terminal node trees are used as weak classifiers. The upper row
is for AdaBoost and the lower row is for SAMME.

Algorithm 2. SAMME (d) Set


  
1. Initialize the observation weights wi = 1/n, i =
wi ← wi · exp α(m) · I ci = T (m) (xi ) ,
1, 2, . . . , n.
2. For m = 1 to M:
for i = 1, . . . , n.
(a) Fit a classifier T (m) (x) to the training data using
(e) Re-normalize wi .
weights wi .
3. Output
(b) Compute

M

n   n
C(x) = arg max α(m) · I(T (m) (x) = k).
err(m) = wi I ci =
 T (m) (xi ) / wi . k
m=1
i=1 i=1
Note that Algorithm 2 (SAMME) shares the same simple
(c) Compute modular structure of AdaBoost with a simple but subtle dif-
ference in (1), specifically, the extra term log(K − 1). Obvi-
1 − err(m) ously, when K = 2, SAMME reduces to AdaBoost. However,
(1) α(m) = log + log(K − 1).
err(m) the term log(K − 1) in (1) is critical in the multi-class case
(K > 2). One immediate consequence is that now in order

Multi-class AdaBoost 351


for α(m) to be positive, we only need (1 − err(m) ) > 1/K, or multi-class exponential loss function. In the multi-class clas-
the accuracy of each weak classifier to be better than ran- sification setting, we can recode the output c with a K-
dom guessing rather than 1/2. To appreciate its effect, we dimensional vector y, with all entries equal to − K−1 1
except
T
apply SAMME to the illustrative example in Section 1.1. As a 1 in position k if c = k, i.e. y = (y1 , . . . , yK ) , and:
can be seen from Fig. 1, the test error of SAMME quickly
decreases to a low value and keeps decreasing even after 600 1, if c = k,
(2) yk =
iterations, which is exactly what we could expect from a − K−1
1
,  k.
if c =
successful boosting algorithm. In Section 2, we shall show [14] and [16] used the same coding for the multi-class sup-
that the term log(K − 1) is not artificial, it follows naturally port vector machine. Given the training data, we wish to
from the multi-class generalization of the exponential loss find f (x) = (f1 (x), . . . , fK (x))T such that
in the binary case.
The rest of the paper is organized as follows: In Sec- 
n

tion 2, we give theoretical justification for our new algo- (3) min L(y i , f (xi ))
f (x)
i=1
rithm SAMME. In Section 3, we present numerical results
on both simulation and real-world data. Summary and dis- (4) subject to f1 (x) + · · · + fK (x) = 0.
cussion regarding the implications of the new algorithm are We consider f (x) that has the following form:
in Section 4.

M

2. STATISTICAL JUSTIFICATION f (x) = β (m) g (m) (x),


m=1
In this section, we are going to show that the extra term
log(K − 1) in (1) is not artificial; it makes Algorithm 2 where β (m) ∈ R are coefficients, and g (m) (x) are basis func-
equivalent to fitting a forward stagewise additive model us- tions. We require g(x) to satisfy the symmetric constraint:
ing a multi-class exponential loss function. Our arguments g1 (x) + · · · + gK (x) = 0.
are in line with [11] who developed a statistical perspective
on the original two-class AdaBoost algorithm, viewing the For example, the g(x) that we consider in this paper takes
two-class AdaBoost algorithm as forward stagewise additive value in one of the K possible K-dimensional vectors in (2);
modeling using the exponential loss function specifically, at a given x, g(x) maps x onto Y:

L(y, f ) = e−yf , g : x ∈ Rp → Y,

where y = (I(c = 1) − I(c = 2)) ∈ {−1, 1} in a two-class where Y is the set containing K K-dimensional vectors:
classification setting. A key argument is to show that the ⎧  T ⎫

⎪ 1, − 1
, . . . , − 1
, ⎪

population minimizer of this exponential loss function is one ⎪
⎪  K−1 K−1
T ⎪⎪

⎪ ⎪

half of the logit transform ⎨ − K−1 , 1, . . . , − K−1 , ⎬
1 1
(5) Y= .
f ∗ (x) = arg min EY |X=x L(y, f (x)) ⎪
⎪ .. ⎪


⎪ . ⎪

⎪   ⎪ ⎪
⎩ − 1 ,...,− 1 ,1 T ⎪
f (x)

1 Prob(c = 1|x) K−1 K−1
= log .
2 Prob(c = 2|x)
Forward stagewise modeling approximates the solution
Therefore, the Bayes optimal classification rule agrees with to (3)–(4) by sequentially adding new basis functions to the

the sign of f (x). [11] recast AdaBoost as a functional gra- expansion without adjusting the parameters and coefficients

dient descent algorithm to approximate f (x). We note that of those that have already been added. Specifically, the al-
besides [11], [2] and [21] also made connections between the gorithm starts with f (0) (x) = 0, sequentially selecting new
original two-class AdaBoost algorithm and the exponential basis functions from a dictionary and adding them to the
loss function. We acknowledge that these views have been current fit:
influential in our thinking for this paper. Algorithm 3. Forward stagewise additive modeling

2.1 SAMME as forward stagewise additive 1. Initialize f (0) (x) = 0.


2. For m = 1 to M :
modeling
(a) Compute
We now show that Algorithm 2 is equivalent to forward
stagewise additive modeling using a multi-class exponential (β (m) , g (m) (x))
loss function. 
n
We start with the forward stagewise additive modeling = arg min L(y i , f (m−1) (xi ) + βg(xi )).
β,g
using a general loss function L(·, ·), then apply it to the i=1

352 J. Zhu et al.


(b) Set Based on Lemma 1, the model is then updated

f (m) (x) = f (m−1) (x) + β (m) g (m) (x). f (m) (x) = f (m−1) (x) + β (m) g (m) (x),
Now, we consider using the multi-class exponential loss
function and the weights for the next iteration will be
 
1  
L(y, f ) = exp − (y1 f1 + · · · + yK fK ) 1 (m) T (m)
K wi ← wi · exp − β y i g (xi ) .
  K
1 T
= exp − y f ,
K This is equal to
in the above forward stagewise modeling algorithm. The
(K−1)2
wi · e− α(m) y T (m)
choice of the loss function will be clear in Section 2.2 and (12) K2 ig (xi )

Section 2.3. Then in step (2a), we need to find g (m) (x) (and 
wi · e− K α
K−1 (m)
β (m) ) to solve: , if ci = T (xi ),
= 1 (m)
wi · e K α ,  T (xi ),
if ci =
(β (m) , g (m) )
= arg min where α(m) is defined as in (1) with the extra term log(K −
β,g
n   1), and the new weight (12) is equivalent to the weight up-
1 T (m−1)
(6) exp − y i (f (xi ) + βg(xi )) dating scheme in Algorithm 2 (2d) after normalization.
i=1
K
It is also a simple task to check that
= arg min (m) (m)
β,g arg maxk (f1 (x), . . . , fK (x))T is equivalent to the
  M

n
1 output C(x) = arg maxk m=1 α(m) · I(T (m) (x) = k) in
(7) wi exp − βy i g(xi ) ,
T
Algorithm 2. Hence, Algorithm 2 can be considered as
i=1
K
forward stagewise additive modeling using the multi-class
  exponential loss function.
where wi = exp − K 1 T (m−1)
yi f (xi ) are the un-normalized
observation weights. 2.2 The multi-class exponential loss
Notice that every g(x) as in (5) has a one-to-one corre-
spondence with a multi-class classifier T (x) in the following We now justify the use of the multi-class exponential
way: loss (6). Firstly, we note that when K = 2, the sum-to-
(8) T (x) = k, if gk (x) = 1, zero constraint indicates f = (f1 , −f1 ) and then the multi-
class exponential loss reduces to the exponential loss used
and vice versa: in binary classification. [11] justified the exponential loss by
1, if T (x) = k, showing that its population minimizer is equivalent to the
(9) gk (x) = Bayes rule. We follow the same arguments to investigate
− K−1 , if T (x) = k.
1
what is the population minimizer of this multi-class expo-
Hence, solving for g (x) in (7) is equivalent to finding the nential loss function. Specifically, we are interested in
(m)

multi-class classifier T (m) (x) that can generate g (m) (x).


Lemma 1. The solution to (7) is (13) arg min
f (x)
 
(10) T (m)
(x) 1
EY |X=x exp − (Y1 f1 (x) + · · · + YK fK (x))
n K
= arg min wi I(ci = T (xi )),
i=1 subject to f1 (x) + · · · + fK (x) = 0. The Lagrange of this
(11) β (m) constrained optimization problem can be written as:
 
(K − 1)2 1 − err(m)  
= log + log(K − 1) , f1 (x)
K err(m)
exp − Prob(c = 1|x)
K −1
where err(m) is defined as + ···
 
   fK (x)
n n
+ exp − Prob(c = K|x)
err(m) = wi I ci = T (m) (xi ) / wi . K −1
i=1 i=1 − λ (f1 (x) + · · · + fK (x)) ,

Multi-class AdaBoost 353


where λ is the Lagrange multiplier. Taking derivatives with φ(·) is Fisher-consistent for K-class classification, if for ∀x
respect to fk and λ, we reach in a set of full measure, the following optimization problem
 
1 f1 (x) (18) arg min
− exp − Prob(c = 1|x) − λ = 0, f (x)
K −1 K −1  
1
.. .. EY |X=x φ − (Y1 f1 (x) + · · · + YK fK (x))
  . . K
1 fK (x)
− exp − Prob(c = K|x) − λ = 0,
K −1 K −1 subject to f1 (x) + · · · + fK (x) = 0, has a unique solution f̂ ,
f1 (x) + · · · + fK (x) = 0. and

Solving this set of equations, we obtain the population min- (19) arg max fˆk (x) = arg max Prob(C = k|x).
k k
imizer
We use the sum-to-zero constraint to ensure the existence
fk∗ (x) = (K − 1) log Prob(c = k|x)− and uniqueness of the solution to (18).
Note that as n → ∞, the empirical loss in (17) becomes
K −1 
K
(14) log Prob(c = k |x),  
K 1
k =1 (20) EX EC|X=x φ − (Y1 f1 (x) + · · · + YK fK (x)) .
K
for k = 1, . . . , K. Thus,
Therefore, the multi-class Fisher-consistent condition basi-
arg max fk∗ (x) = arg max Prob(c = k|x), cally says that with infinite samples, one can exactly recover
k k
the multi-class Bayes rule by minimizing the multi-class loss
which is the multi-class Bayes optimal classification rule. using φ(·). Thus our definition of Fisher-consistent losses is
This result justifies the use of this multi-class exponential a multi-class generalization of the binary Fisher-consistent
loss function. Equation (14) also provides a way to recover loss function discussed in [15].
the class probability Prob(c = k|x) once fk∗ (x)’s are esti- In the following theorem, we show that there are a class
mated, i.e. of convex functions that are Fisher-consistent for K-class
1 ∗
classification, for all K ≥ 2.
e K−1 fk (x)
(15) Prob(C = k|x) = 1 ∗ 1 ∗ , Theorem 1. Let φ(t) be a non-negative twice differentiable
e K−1 f1 (x) + · · · + c K−1 fK (x) function. If φ (0) < 0 and φ (t) > 0 for ∀t, then φ is Fisher-
for k = 1, . . . , K. consistent for K-class classification for ∀K ≥ 2. Moreover,
let f̂ be the solution of (18), then we have
2.3 Fisher-consistent multi-class loss  
functions 1/φ K−1 1 ˆ
fk (x)
(21) Prob(C = k|x) =   ,
We have shown that the population minimizer of the new K
1/φ  1 ˆ
f (x)
k =1 K−1 k
multi-class exponential loss is equivalent to the multi-class
Bayes rule. This property is shared by many other multi- for k = 1, . . . , K.
class loss functions. Let us use the same notation as in Sec-
tion 2.1, and consider a general multi-class loss function Theorem 1 immediately concludes that the three most
  popular smooth loss functions, namely, exponential, logit
1 and L2 loss functions, are Fisher-consistent for all multi-
L(y, f ) = φ − (y1 f1 + · · · + yK fK )
K class classification problems regardless the number of
  classes. The inversion formula (21) allows one to easily con-
1
(16) = φ − yTf , struct estimates for the conditional class probabilities. Ta-
K
ble 1 shows the explicit inversion formulae for computing the
where φ(·) is a non-negative valued function. The multi- conditional class probabilities using the exponential, logit
class exponential loss uses φ(t) = e−t . We can use the gen- and L2 losses.
eral multi-class loss function in Algorithm 3 to minimize the With these multi-class Fisher-consistent losses on hand,
empirical loss we can use the forward stagewise modeling strategy to de-
  rive various multi-class boosting algorithms by minimizing
1
n
1 T the empirical multi-class loss. The biggest advantage of the
(17) φ − y i f (xi ) .
n i=1 K exponential loss is that it gives us a simple re-weighting for-
mula. Other multi-class loss functions may not lead to such
However, to derive a sensible algorithm, we need to require a simple closed-form re-weighting scheme. One could han-
the φ(·) function be Fisher-consistent. Specifically, we say dle this computation issue by employing the computational

354 J. Zhu et al.


Table 1. The probability inversion formula
exponential logit L2
φ(t) = e−t φ(t) = log(1 + e−t ) φ(t) = (1 − t)2
1 fˆ (x) 1 fˆ (x) 1 fˆ (x))
1/(1− K−1
Prob(C = k|x) Ke
K−1 k
1 fˆ (x) K 1+e
K−1 k
1 fˆ (x) K k
  1 fˆ (x))
1/(1− K−1
e K−1 k (1+e K−1 k ) k =1 k
k =1 k =1

trick used in [10] and [6]. For example, [24] derived a multi- same number of terminal nodes. This number is chosen via
class boosting algorithm using the logit loss. A multi-class five-fold cross-validation. We use an independent test sam-
version of the L2 boosting can be derived following the lines ple of size 5000 to estimate the error rate. Averaged results
in [6]. We do not explore these directions in the current pa- over ten such independently drawn training-test set combi-
per. To fix ideas, we shall focus on the multi-class AdaBoost nations are shown in Fig. 2 and Table 2.
algorithm. As we can see, for this particular simulation example,
SAMME performs slightly better than the AdaBoost.MH al-
3. NUMERICAL RESULTS gorithm. A paired t-test across the ten independent compar-
isons indicates a significant difference with p-value around
In this section, we use both simulation data and real-
0.003.
world data to demonstrate our multi-class AdaBoost algo-
rithm. For comparison, a single decision tree (CART; [5]) 3.2 Real data
and AdaBoost.MH [21] are also fit. We have chosen to com-
pare with the AdaBoost.MH algorithm because it is concep- In this section, we show the results of running SAMME on
tually easy to understand and it seems to have dominated a collection of datasets from the UC-Irvine machine learn-
other proposals in empirical studies [21]. Indeed, [22] also ing archive [18]. Seven datasets were used: Letter, Nursery,
argue that with large samples, AdaBoost.MH has the op- Pendigits, Satimage, Segmentation, Thyroid and Vowel.
timal classification performance. The AdaBoost.MH algo- These datasets come with pre-specified training and testing
rithm converts the K-class problem into that of estimating sets, and are summarized in Table 3. They cover a wide
a two-class classifier on a training set K times as large, with range of scenarios: the number of classes ranges from 3
an additional feature defined by the set of class labels. It is to 26, and the size of the training data ranges from 210
essentially the same as the one vs. rest scheme [11]. to 16,000 data points. The types of input variables in-
We would like to emphasize that the purpose of our nu- clude both numerical and categorical, for example, in the
merical experiments is not to argue that SAMME is the ul- Nursery dataset, all input variables are categorical vari-
timate multi-class classification tool, but rather to illustrate ables. We used a classification tree as the weak classifier
that it is a sensible algorithm, and that it is the natural ex- in each case. Again, the trees were built using a greedy,
tension of the AdaBoost algorithm to the multi-class case. top-down recursive partitioning strategy. We restricted all
trees within each method to have the same number of ter-
3.1 Simulation minal nodes, and this number was chosen via five-fold cross-
validation.
We mimic a popular simulation example found in [5]. This
Figure 3 compares SAMME and AdaBoost.MH. The test
is a three-class problem with twenty one variables, and it is
error rates are summarized  in Table 5. The standard er-
considered to be a difficult pattern recognition problem with
rors are approximated by te.err · (1 − te.err)/n.te, where
Bayes error equal to 0.140. The predictors are defined by
te.err is the test error, and n.te is the size of the testing

⎨ u · v1 (j) + (1 − u) · v2 (j) + j , Class 1, data.
(22) xj = u · v1 (j) + (1 − u) · v3 (j) + j , Class 2, The most interesting result is on the Vowel dataset. This
⎩ is a difficult classification problem, and the best methods
u · v2 (j) + (1 − u) · v3 (j) + j , Class 3,
achieve around 40% errors on the test data [12]. The data
where j = 1, . . . , 21, u is uniform on (0, 1), j are standard was collected by [7], who recorded examples of the eleven
normal variables, and the v are the shifted triangular wave- steady state vowels of English spoken by fifteen speakers for
forms: v1 (j) = max(6 − |j − 11|, 0), v2 (j) = v1 (j − 4) and a speaker normalization study. The International Phonetic
v3 (j) = v1 (j + 4). Association (IPA) symbols that represent the vowels and the
The training sample size is 300 so that approximately 100 words in which the eleven vowel sounds were recorded are
training observations are in each class. We use the classifi- given in Table 4.
cation tree as the weak classifier for SAMME. The trees are Four male and four female speakers were used to train
built using a greedy, top-down recursive partitioning strat- the classifier, and then another four male and three fe-
egy, and we restrict all trees within each method to have the male speakers were used for testing the performance. Each

Multi-class AdaBoost 355


Figure 2. Test errors for SAMME and AdaBoost.MH on the waveform simulation example. The training sample size is 300,
and the testing sample size is 5000. The results are averages of over ten independently drawn training-test set combinations.

Table 2. Test error rates % of different methods on the Table 4. The International Phonetic Association (IPA)
waveform data. The results are averaged over ten symbols that represent the eleven vowels
independently drawn datasets. For comparison, a single vowel word vowel word vowel word vowel word
decision tree is also fit i: heed O hod I hid C: hoard
Iterations E head U hood A had u: who’d
Method 200 400 600 a: hard 3: heard Y hud
Waveform CART error = 28.4 (1.8)
Ada.MH 17.1 (0.6) 17.0 (0.5) 17.0 (0.6)
SAMME 16.7 (0.8) 16.6 (0.7) 16.6 (0.6) performs almost 15% better than the AdaBoost.MH algo-
rithm.
For other datasets, the SAMME algorithm performs
Table 3. Summary of seven benchmark datasets slightly better than the AdaBoost.MH algorithm on
Dataset #Train #Test #Variables #Classes Letter, Pendigits, and Thyroid, while slightly worse on
Letter 16000 4000 16 26 Segmentation. In the Segmentation data, there are only
Nursery 8840 3790 8 3 210 training data points, so the difference might be just
Pendigits 7494 3498 16 10
due to randomness. It is also worth noting that for the
Satimage 4435 2000 36 6
Segmentation 210 2100 19 7
Nursery data, both the SAMME algorithm and the Ad-
Thyroid 3772 3428 21 3 aBoost.MH algorithm are able to reduce the test error
Vowel 528 462 10 11 to zero, while a single decision tree has about 0.8% test
error rate. Overall, we are comfortable to say that the
performance of SAMME is comparable with that of the
speaker yielded six frames of speech from eleven vowels. This AdaBoost.MH.
gave 528 frames from the eight speakers used as the train- For the purpose of further investigation, we also merged
ing data and 462 frames from the seven speakers used as the training and the test sets, and randomly split them into
the testing data. Ten predictors are derived from the digi- new training and testing sets. The procedure was repeated
tized speech in a rather complicated way, but standard in ten times. Again, the performance of SAMME is comparable
the speech recognition world. As we can see from Fig. 3 and with that of the AdaBoost.MH. For the sake of space, we do
Table 5, for this particular dataset, the SAMME algorithm not present these results.

356 J. Zhu et al.


Figure 3. Test errors for SAMME and AdaBoost.MH on six benchmark datasets. These datasets come with pre-specified
training and testing splits, and they are summarized in Table 3. The results for the Nursery data are not shown for the test
error rates are reduced to zero for both methods.

4. DISCUSSION • SAMME adaptively implements the multi-class Bayes


rule by fitting a forward stagewise additive model for
The statistical view of boosting, as illustrated in [11], multi-class problems;
shows that the two-class AdaBoost builds an additive model • SAMME follows closely to the philosophy of boosting,
to approximate the two-class Bayes rule. Following the same i.e. adaptively combining weak classifiers (rather than
statistical principle, we have derived SAMME, the natural regressors as in logit-boost [11] and MART [10]) into a
and clean multi-class extension of the two-class AdaBoost powerful one;
algorithm, and we have shown that • At each stage, SAMME returns only one weighted clas-

Multi-class AdaBoost 357


AdaBoost [23, 24]. These results further demonstrate the
Table 5. Test error rates % on seven benchmark real usefulness of the forward stagewise modeling view of boost-
datasets. The datasets come with pre-specified training and
ing.
testing splits. 
The standard errors (in parentheses) are
It should also be emphasized here that although our sta-
approximated by te.err · (1 − te.err)/n.te, where te.err is
the test error, and n.te is the size of the testing data. For tistical view of boosting leads to interesting and useful re-
comparison, a single decision tree was also fit, and the tree sults, we do not argue it is the ultimate explanation of boost-
size was determined by five-fold cross-validation ing. Why boosting works is still an open question. Interested
readers are referred to the discussions on [11]. [9] mentioned
Iterations
that the forward stagewise modeling view of AdaBoost does
Method 200 400 600
Letter CART error = 13.5 (0.5) not offer a bound on the generalization error as in the orig-
inal AdaBoost paper [8]. [3] also pointed out that the sta-
Ada.MH 3.0 (0.3) 2.8 (0.3) 2.6 (0.3)
tistical view of boosting does not explain why AdaBoost is
SAMME 2.6 (0.3) 2.4 (0.2) 2.3 (0.2)
Nursery CART error = 0.79 (0.14) robust against overfitting. Later, his understandings of Ad-
aBoost lead to the invention of random forests [4].
Ada.MH 0 0 0
SAMME 0 0 0 Finally, we discuss the computational cost of SAMME.
Pendigits CART error = 8.3 (0.5) Suppose one uses a classification tree as the weak learner,
and the depth of each tree is fixed as d, then the computa-
Ada.MH 3.0 (0.3) 3.0 (0.3) 2.8 (0.3)
SAMME 2.5 (0.3) 2.5 (0.3) 2.5 (0.3)
tional cost for building each tree is O(dpn log(n)), where p
Satimage CART error = 13.8 (0.8) is the dimension of the input x. The computational cost for
our SAMME algorithm is then O(dpn log(n)M ) since there
Ada.MH 8.7 (0.6) 8.4 (0.6) 8.5 (0.6)
SAMME 8.6 (0.6) 8.2 (0.6) 8.5 (0.6)
are M iterations.
Segmentation CART error = 9.3 (0.6) The SAMME algorithm has been implemented in the R
computing environment, and will be publicly available from
Ada.MH 4.5 (0.5) 4.5 (0.5) 4.5 (0.5)
SAMME 4.9 (0.5) 5.0 (0.5) 5.1 (0.5) the authors’ websites.
Thyroid CART error = 0.64 (0.14)
Ada.MH 0.67 (0.14) 0.67 (0.14) 0.67 (0.14) APPENDIX: PROOFS
SAMME 0.58 (0.13) 0.61 (0.13) 0.58 (0.13)
Vowel CART error = 53.0 (2.3) Lemma 1. First, for any fixed value of β > 0, using the
Ada.MH 52.8 (2.3) 51.5 (2.3) 51.5 (2.3) definition (8), one can express the criterion in (7) as:
SAMME 43.9 (2.3) 43.3 (2.3) 43.3 (2.3)
 β  β
wi e− K−1 + wi e (K−1)2
ci =T (xi ) ci =T (xi )
β 
sifier (rather than K), and the weak classifier only needs − K−1
= e wi +
to be better than K-class random guessing; i
• SAMME shares the same simple modular structure of β β 
(23) (e (K−1)2 − e− K−1 ) wi I(ci = T (xi )).
AdaBoost.
i
Our numerical experiments have indicated that Ad-
aBoost.MH in general performs very well and SAMME’s Since only the last sum depends on the classifier T (x), we
performance is comparable with that of the AdaBoost.MH, get that (10) holds. Now plugging (10) into (7) and solving
and sometimes slightly better. However, we would like to for β, we obtain (11) (note that (23) is a convex function
emphasize that our goal is not to argue that SAMME is of β).
the ultimate multi-class classification tool, but rather to il-
lustrate that it is the natural extension of the AdaBoost Theorem 1. Firstly, we note that under the sum-to-zero
algorithm to the multi-class case. The success of SAMME constraint,
is used here to demonstrate the usefulness of the forward
 
stagewise modeling view of boosting. 1
EY |X=x φ (Y1 f1 (x) + · · · + YK fK (x))
[11] called the AdaBoost algorithm Discrete AdaBoost K
and proposed Real AdaBoost and Gentle AdaBoost algo-  
f1 (x)
rithms which combine regressors to estimate the conditional = φ Prob(C = 1|x) +
K −1
class probability. Using their language, SAMME is also a dis- ...  
crete multi-class AdaBoost. We have also derived the corre- fK (x)
+φ Prob(C = K|x).
sponding Real Multi-class AdaBoost and Gentle Multi-class K −1

358 J. Zhu et al.


K
Therefore, we wish to solve follows that λ∗ = −( k=1 1/φ ( K−1
1 ˆ −1
fk )) . Then (21) is
obtained.
1
min φ( f1 (x))Prob(C = 1|x) +
f K −1
... ACKNOWLEDGMENTS
1
+φ( fK (x))Prob(C = 1|x)
K −1 We would like to dedicate this work to the memory of
K Leo Breiman, who passed away while we were finalizing this
subject to fk (x) = 0. manuscript. Leo Breiman has made tremendous contribu-
k=1 tions to the study of statistics and machine learning. His
work has greatly influenced us.
For convenience, let pk = Prob(C = k|x), k = 1, 2, . . . , K
and we omit x in fk (x). Using the Lagrangian multiplier,
we define Received 22 May 2009
1
Q(f ) = φ( f1 )p1 + REFERENCES
K −1
···+ [1] Breiman, L. (1996). Bagging predictors. Machine Learning 24
123–140.
1 1 [2] Breiman, L. (1999). Prediction games and arcing algorithms.
φ( fK )pK + λ(f1 + . . . + fK ).
K −1 K −1 Neural Computation 7 1493–1517.
[3] Breiman, L. (2000). Discussion of “Additive logistic regression: a
Then we have statistical view of boosting” by Friedman, Hastie and Tibshirani.
Annals of Statistics 28 374–377. MR1790002
∂Q(f ) 1 1 1
(24) = φ ( fk )pk + λ = 0, [4] Breiman, L. (2001). Random forests. Machine Learning 45 5–32.
∂fk K −1 K −1 K −1 [5] Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984).
Classification and Regression Trees. Wadsworth, Belmont, CA.
for k = 1, . . . , K. Since φ (t) > 0 for ∀t, φ has an in- MR0726392
1 [6] Bühlmann, P. and Yu, B. (2003). Boosting with the 2 loss:
verse function, denoted by ψ. Equation (24) gives K−1 fk = regression and classification. Journal of the American Statistical
λ
ψ(− pk ). By the sum-to-zero constraint on f , we have Association 98 324–339. MR1995709
[7] Deterding, D. (1989). Speaker Normalization for Automatic

K   Speech Recognition. University of Cambridge. Ph.D. thesis.
λ [8] Freund, Y. and Schapire, R. (1997). A decision theoretic
(25) ψ − = 0. generalization of on-line learning and an application to boost-
pk
k=1 ing. Journal of Computer and System Sciences 55 119–139.
MR1473055
Since φ is a strictly monotone increasing function, so is [9] Freund, Y. and Schapire, R. (2000). Discussion of “Addi-
ψ. Thus the left hand size (LHS) of (25) is a decreasing tive logistic regression: a statistical view on boosting” by Fried-
function of λ. It suffices to show that equation (25) has a man, Hastie and Tibshirani. Annals of Statistics 28 391–393.
root λ∗ , which is

the unique root. Then it is easy to see MR1790002
[10] Friedman, J. (2001). Greedy function approximation: a gra-
that fˆk = ψ(− λpk ) is the unique minimizer of (18), for the dient boosting machine. Annals of Statistics 29 1189–1232.
Hessian matrix of Q(f ) is a diagonal matrix and the k-th MR1873328
2
 [11] Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive
diagonal element is ∂ ∂f Q(f )
2
1
= (K−1) 1
2 φ ( K−1 fk ) > 0. Note logistic regression: a statistical view of boosting. Annals of Statis-
k
that when λ = −φ (0) > 0, we have pλk > −φ (0), then tics 28 337–407. MR1790002
[12] Hastie, T., Tibshirani, R., and Friedman, J. (2001). The
ψ(− pλk ) < ψ (φ (0)) = 0. So the LHS of (25) is negative Elements of Statistical Learning. Springer-Verlag, New York.
when λ = −φ (0) > 0. On the other hand, let us define MR1851606
[13] Koltchinskii, V. and Panchenko, D. (2002). Empirical margin
A = {a : φ (a) = 0}. If A is an empty set, then φ (t) → 0− distributions and bounding the generalization error of combined
as t → ∞ (since φ is convex). If A is not empty, denote classifiers. Annals of Statistics 30 1–50. MR1892654
a∗ = inf A. By the fact φ (0) < 0, we conclude a∗ > 0. [14] Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory support
Hence φ (t) → 0− as t → a∗ −. In both cases, we see that ∃ vector machines, theory, and application to the classification of
microarray data and satellite radiance data. Journal of the Amer-
a small enough λ0 > 0 such that ψ(− λpk0 ) > 0 for all k. So ican Statistical Association 99 67–81. MR2054287
the LHS of (25) is positive when λ = λ0 > 0. Therefore there [15] Lin, Y. (2004). A note on margin-based loss functions in
must be a positive λ∗ ∈ (λ0 , −φ (0)) such that equation (25) classification. Statistics and Probability Letters 68 73–82.
MR2064687
holds. Now we show the minimizer f̂ agrees with the Bayes [16] Liu, Y. and Shen, X. (2006). Multicategory psi-learning. Journal
rule. Without loss

of generality,

let p1 > pk for ∀k = 1. of the American Statistical Association 101 500–509. MR2256170
Then since − λp1 > − λpk for ∀k = 1, we have fˆ1 > fˆk for [17] Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999).
∗ Boosting algorithms as gradient descent in function space. Neural
∀k = 1. For the inversion formula, we note pk = − φ ( λ1 fˆ ) , Information Processing Systems 12.
K K ∗
K−1 k
[18] Merz, C. and Murphy, P. (1998). UCI repository of machine
and k=1 pj = 1 requires k=1 − φ ( λ1 fˆ ) = 1. Hence it learning databases.
K−1 k

Multi-class AdaBoost 359


[19] Schapire, R. (1997). Using output codes to boost multiclass Hui Zou
learning problems. Proceedings of the Fourteenth International School of Statistics
Conference on Machine Learning. Morgan Kauffman.
[20] Schapire, R., Freund, Y., Bartlett, P., and Lee, W. (1998).
University of Minnesota
Boosting the margin: a new explanation for the effectiveness of Minneapolis, MN 55455
voting methods. Annals of Statistics 26 1651–1686. MR1673273 USA
[21] Schapire, R. and Singer, Y. (1999). Improved boosting algo- E-mail address: [email protected]
rithms using confidence-rated prediction. Machine Learning 37
297–336. MR1811573 Saharon Rosset
[22] Zhang, T. (2004). Statistical analysis of some multi-category
large margin classification methods. Journal of Machine Learning Department of Statistics
Research 5 1225–1251. MR2248016 Tel Aviv University
[23] Zhu, J., Rosset, S., Zou, H., and Hastie, T. (2005). Multi- Tel Aviv 69978
class AdaBoost. Technical Report # 430, Department of Statis- Israel
tics, University of Michigan.
[24] Zou, H., Zhu, J., and Hastie, T. (2008). The margin vector, E-mail address: [email protected]
admissable loss, and multiclass margin-based classifiers. Annals
of Applied Statistics 2 1290–1306. Trevor Hastie
Department of Statistics
Ji Zhu Stanford University
Department of Statistics Stanford, CA 94305
University of Michigan USA
Ann Arbor, MI 48109 E-mail address: [email protected]
USA
E-mail address: [email protected]

360 J. Zhu et al.

You might also like