Addaboost
Addaboost
Multi-class AdaBoost∗
Ji Zhu†‡ , Hui Zou§ , Saharon Rosset and Trevor Hastie¶
tion 2, we give theoretical justification for our new algo- (3) min L(y i , f (xi ))
f (x)
i=1
rithm SAMME. In Section 3, we present numerical results
on both simulation and real-world data. Summary and dis- (4) subject to f1 (x) + · · · + fK (x) = 0.
cussion regarding the implications of the new algorithm are We consider f (x) that has the following form:
in Section 4.
M
L(y, f ) = e−yf , g : x ∈ Rp → Y,
where y = (I(c = 1) − I(c = 2)) ∈ {−1, 1} in a two-class where Y is the set containing K K-dimensional vectors:
classification setting. A key argument is to show that the ⎧ T ⎫
⎪
⎪ 1, − 1
, . . . , − 1
, ⎪
⎪
population minimizer of this exponential loss function is one ⎪
⎪ K−1 K−1
T ⎪⎪
⎪
⎪ ⎪
⎪
half of the logit transform ⎨ − K−1 , 1, . . . , − K−1 , ⎬
1 1
(5) Y= .
f ∗ (x) = arg min EY |X=x L(y, f (x)) ⎪
⎪ .. ⎪
⎪
⎪
⎪ . ⎪
⎪
⎪ ⎪ ⎪
⎩ − 1 ,...,− 1 ,1 T ⎪
f (x)
⎭
1 Prob(c = 1|x) K−1 K−1
= log .
2 Prob(c = 2|x)
Forward stagewise modeling approximates the solution
Therefore, the Bayes optimal classification rule agrees with to (3)–(4) by sequentially adding new basis functions to the
∗
the sign of f (x). [11] recast AdaBoost as a functional gra- expansion without adjusting the parameters and coefficients
∗
dient descent algorithm to approximate f (x). We note that of those that have already been added. Specifically, the al-
besides [11], [2] and [21] also made connections between the gorithm starts with f (0) (x) = 0, sequentially selecting new
original two-class AdaBoost algorithm and the exponential basis functions from a dictionary and adding them to the
loss function. We acknowledge that these views have been current fit:
influential in our thinking for this paper. Algorithm 3. Forward stagewise additive modeling
f (m) (x) = f (m−1) (x) + β (m) g (m) (x). f (m) (x) = f (m−1) (x) + β (m) g (m) (x),
Now, we consider using the multi-class exponential loss
function and the weights for the next iteration will be
1
L(y, f ) = exp − (y1 f1 + · · · + yK fK ) 1 (m) T (m)
K wi ← wi · exp − β y i g (xi ) .
K
1 T
= exp − y f ,
K This is equal to
in the above forward stagewise modeling algorithm. The
(K−1)2
wi · e− α(m) y T (m)
choice of the loss function will be clear in Section 2.2 and (12) K2 ig (xi )
Section 2.3. Then in step (2a), we need to find g (m) (x) (and
wi · e− K α
K−1 (m)
β (m) ) to solve: , if ci = T (xi ),
= 1 (m)
wi · e K α , T (xi ),
if ci =
(β (m) , g (m) )
= arg min where α(m) is defined as in (1) with the extra term log(K −
β,g
n 1), and the new weight (12) is equivalent to the weight up-
1 T (m−1)
(6) exp − y i (f (xi ) + βg(xi )) dating scheme in Algorithm 2 (2d) after normalization.
i=1
K
It is also a simple task to check that
= arg min (m) (m)
β,g arg maxk (f1 (x), . . . , fK (x))T is equivalent to the
M
n
1 output C(x) = arg maxk m=1 α(m) · I(T (m) (x) = k) in
(7) wi exp − βy i g(xi ) ,
T
Algorithm 2. Hence, Algorithm 2 can be considered as
i=1
K
forward stagewise additive modeling using the multi-class
exponential loss function.
where wi = exp − K 1 T (m−1)
yi f (xi ) are the un-normalized
observation weights. 2.2 The multi-class exponential loss
Notice that every g(x) as in (5) has a one-to-one corre-
spondence with a multi-class classifier T (x) in the following We now justify the use of the multi-class exponential
way: loss (6). Firstly, we note that when K = 2, the sum-to-
(8) T (x) = k, if gk (x) = 1, zero constraint indicates f = (f1 , −f1 ) and then the multi-
class exponential loss reduces to the exponential loss used
and vice versa: in binary classification. [11] justified the exponential loss by
1, if T (x) = k, showing that its population minimizer is equivalent to the
(9) gk (x) = Bayes rule. We follow the same arguments to investigate
− K−1 , if T (x) = k.
1
what is the population minimizer of this multi-class expo-
Hence, solving for g (x) in (7) is equivalent to finding the nential loss function. Specifically, we are interested in
(m)
Solving this set of equations, we obtain the population min- (19) arg max fˆk (x) = arg max Prob(C = k|x).
k k
imizer
We use the sum-to-zero constraint to ensure the existence
fk∗ (x) = (K − 1) log Prob(c = k|x)− and uniqueness of the solution to (18).
Note that as n → ∞, the empirical loss in (17) becomes
K −1
K
(14) log Prob(c = k |x),
K 1
k =1 (20) EX EC|X=x φ − (Y1 f1 (x) + · · · + YK fK (x)) .
K
for k = 1, . . . , K. Thus,
Therefore, the multi-class Fisher-consistent condition basi-
arg max fk∗ (x) = arg max Prob(c = k|x), cally says that with infinite samples, one can exactly recover
k k
the multi-class Bayes rule by minimizing the multi-class loss
which is the multi-class Bayes optimal classification rule. using φ(·). Thus our definition of Fisher-consistent losses is
This result justifies the use of this multi-class exponential a multi-class generalization of the binary Fisher-consistent
loss function. Equation (14) also provides a way to recover loss function discussed in [15].
the class probability Prob(c = k|x) once fk∗ (x)’s are esti- In the following theorem, we show that there are a class
mated, i.e. of convex functions that are Fisher-consistent for K-class
1 ∗
classification, for all K ≥ 2.
e K−1 fk (x)
(15) Prob(C = k|x) = 1 ∗ 1 ∗ , Theorem 1. Let φ(t) be a non-negative twice differentiable
e K−1 f1 (x) + · · · + c K−1 fK (x) function. If φ (0) < 0 and φ (t) > 0 for ∀t, then φ is Fisher-
for k = 1, . . . , K. consistent for K-class classification for ∀K ≥ 2. Moreover,
let f̂ be the solution of (18), then we have
2.3 Fisher-consistent multi-class loss
functions 1/φ K−1 1 ˆ
fk (x)
(21) Prob(C = k|x) = ,
We have shown that the population minimizer of the new K
1/φ 1 ˆ
f (x)
k =1 K−1 k
multi-class exponential loss is equivalent to the multi-class
Bayes rule. This property is shared by many other multi- for k = 1, . . . , K.
class loss functions. Let us use the same notation as in Sec-
tion 2.1, and consider a general multi-class loss function Theorem 1 immediately concludes that the three most
popular smooth loss functions, namely, exponential, logit
1 and L2 loss functions, are Fisher-consistent for all multi-
L(y, f ) = φ − (y1 f1 + · · · + yK fK )
K class classification problems regardless the number of
classes. The inversion formula (21) allows one to easily con-
1
(16) = φ − yTf , struct estimates for the conditional class probabilities. Ta-
K
ble 1 shows the explicit inversion formulae for computing the
where φ(·) is a non-negative valued function. The multi- conditional class probabilities using the exponential, logit
class exponential loss uses φ(t) = e−t . We can use the gen- and L2 losses.
eral multi-class loss function in Algorithm 3 to minimize the With these multi-class Fisher-consistent losses on hand,
empirical loss we can use the forward stagewise modeling strategy to de-
rive various multi-class boosting algorithms by minimizing
1
n
1 T the empirical multi-class loss. The biggest advantage of the
(17) φ − y i f (xi ) .
n i=1 K exponential loss is that it gives us a simple re-weighting for-
mula. Other multi-class loss functions may not lead to such
However, to derive a sensible algorithm, we need to require a simple closed-form re-weighting scheme. One could han-
the φ(·) function be Fisher-consistent. Specifically, we say dle this computation issue by employing the computational
trick used in [10] and [6]. For example, [24] derived a multi- same number of terminal nodes. This number is chosen via
class boosting algorithm using the logit loss. A multi-class five-fold cross-validation. We use an independent test sam-
version of the L2 boosting can be derived following the lines ple of size 5000 to estimate the error rate. Averaged results
in [6]. We do not explore these directions in the current pa- over ten such independently drawn training-test set combi-
per. To fix ideas, we shall focus on the multi-class AdaBoost nations are shown in Fig. 2 and Table 2.
algorithm. As we can see, for this particular simulation example,
SAMME performs slightly better than the AdaBoost.MH al-
3. NUMERICAL RESULTS gorithm. A paired t-test across the ten independent compar-
isons indicates a significant difference with p-value around
In this section, we use both simulation data and real-
0.003.
world data to demonstrate our multi-class AdaBoost algo-
rithm. For comparison, a single decision tree (CART; [5]) 3.2 Real data
and AdaBoost.MH [21] are also fit. We have chosen to com-
pare with the AdaBoost.MH algorithm because it is concep- In this section, we show the results of running SAMME on
tually easy to understand and it seems to have dominated a collection of datasets from the UC-Irvine machine learn-
other proposals in empirical studies [21]. Indeed, [22] also ing archive [18]. Seven datasets were used: Letter, Nursery,
argue that with large samples, AdaBoost.MH has the op- Pendigits, Satimage, Segmentation, Thyroid and Vowel.
timal classification performance. The AdaBoost.MH algo- These datasets come with pre-specified training and testing
rithm converts the K-class problem into that of estimating sets, and are summarized in Table 3. They cover a wide
a two-class classifier on a training set K times as large, with range of scenarios: the number of classes ranges from 3
an additional feature defined by the set of class labels. It is to 26, and the size of the training data ranges from 210
essentially the same as the one vs. rest scheme [11]. to 16,000 data points. The types of input variables in-
We would like to emphasize that the purpose of our nu- clude both numerical and categorical, for example, in the
merical experiments is not to argue that SAMME is the ul- Nursery dataset, all input variables are categorical vari-
timate multi-class classification tool, but rather to illustrate ables. We used a classification tree as the weak classifier
that it is a sensible algorithm, and that it is the natural ex- in each case. Again, the trees were built using a greedy,
tension of the AdaBoost algorithm to the multi-class case. top-down recursive partitioning strategy. We restricted all
trees within each method to have the same number of ter-
3.1 Simulation minal nodes, and this number was chosen via five-fold cross-
validation.
We mimic a popular simulation example found in [5]. This
Figure 3 compares SAMME and AdaBoost.MH. The test
is a three-class problem with twenty one variables, and it is
error rates are summarized in Table 5. The standard er-
considered to be a difficult pattern recognition problem with
rors are approximated by te.err · (1 − te.err)/n.te, where
Bayes error equal to 0.140. The predictors are defined by
te.err is the test error, and n.te is the size of the testing
⎧
⎨ u · v1 (j) + (1 − u) · v2 (j) + j , Class 1, data.
(22) xj = u · v1 (j) + (1 − u) · v3 (j) + j , Class 2, The most interesting result is on the Vowel dataset. This
⎩ is a difficult classification problem, and the best methods
u · v2 (j) + (1 − u) · v3 (j) + j , Class 3,
achieve around 40% errors on the test data [12]. The data
where j = 1, . . . , 21, u is uniform on (0, 1), j are standard was collected by [7], who recorded examples of the eleven
normal variables, and the v are the shifted triangular wave- steady state vowels of English spoken by fifteen speakers for
forms: v1 (j) = max(6 − |j − 11|, 0), v2 (j) = v1 (j − 4) and a speaker normalization study. The International Phonetic
v3 (j) = v1 (j + 4). Association (IPA) symbols that represent the vowels and the
The training sample size is 300 so that approximately 100 words in which the eleven vowel sounds were recorded are
training observations are in each class. We use the classifi- given in Table 4.
cation tree as the weak classifier for SAMME. The trees are Four male and four female speakers were used to train
built using a greedy, top-down recursive partitioning strat- the classifier, and then another four male and three fe-
egy, and we restrict all trees within each method to have the male speakers were used for testing the performance. Each
Table 2. Test error rates % of different methods on the Table 4. The International Phonetic Association (IPA)
waveform data. The results are averaged over ten symbols that represent the eleven vowels
independently drawn datasets. For comparison, a single vowel word vowel word vowel word vowel word
decision tree is also fit i: heed O hod I hid C: hoard
Iterations E head U hood A had u: who’d
Method 200 400 600 a: hard 3: heard Y hud
Waveform CART error = 28.4 (1.8)
Ada.MH 17.1 (0.6) 17.0 (0.5) 17.0 (0.6)
SAMME 16.7 (0.8) 16.6 (0.7) 16.6 (0.6) performs almost 15% better than the AdaBoost.MH algo-
rithm.
For other datasets, the SAMME algorithm performs
Table 3. Summary of seven benchmark datasets slightly better than the AdaBoost.MH algorithm on
Dataset #Train #Test #Variables #Classes Letter, Pendigits, and Thyroid, while slightly worse on
Letter 16000 4000 16 26 Segmentation. In the Segmentation data, there are only
Nursery 8840 3790 8 3 210 training data points, so the difference might be just
Pendigits 7494 3498 16 10
due to randomness. It is also worth noting that for the
Satimage 4435 2000 36 6
Segmentation 210 2100 19 7
Nursery data, both the SAMME algorithm and the Ad-
Thyroid 3772 3428 21 3 aBoost.MH algorithm are able to reduce the test error
Vowel 528 462 10 11 to zero, while a single decision tree has about 0.8% test
error rate. Overall, we are comfortable to say that the
performance of SAMME is comparable with that of the
speaker yielded six frames of speech from eleven vowels. This AdaBoost.MH.
gave 528 frames from the eight speakers used as the train- For the purpose of further investigation, we also merged
ing data and 462 frames from the seven speakers used as the training and the test sets, and randomly split them into
the testing data. Ten predictors are derived from the digi- new training and testing sets. The procedure was repeated
tized speech in a rather complicated way, but standard in ten times. Again, the performance of SAMME is comparable
the speech recognition world. As we can see from Fig. 3 and with that of the AdaBoost.MH. For the sake of space, we do
Table 5, for this particular dataset, the SAMME algorithm not present these results.