Model Combination in Multiclass Classification
Model Combination in Multiclass Classification
by
Doctor of Philosophy
2010
This thesis entitled:
Model Combination in Multiclass Classification
written by Samuel Robert Reid
has been approved for the Department of Computer Science
Michael C. Mozer
Date
The final copy of this thesis has been examined by the signatories, and we find that
both the content and the form meet acceptable presentation standards of scholarly
work in the above mentioned discipline.
iii
classifying a pattern into one of several classes, and encompasses domains such as hand-
identification and many others. In this thesis, we investigate three issues in combining
more effective to share hyperparameters across models than to optimize them indepen-
dently. Third, we introduce a new method for combining binary pairwise classifiers that
overcomes several problems with existing pairwise classification schemes and exhibits
significantly better performance on many problems. Our contributions span the themes
Acknowledgements
Many people are responsible for my productive and enjoyable learning experi-
ence at the University of Colorado and for the ideas and results in this dissertation
research. My first machine learning class was taught by Mike Mozer in Spring 2001,
and his infectious excitement sparked my interest in machine learning. Mike’s insight
and intuition make him an excellent teacher and an invaluable research advisor. Mike
improved every aspect of this dissertation, and has always pushed me to strive for the
best. Many thanks to Greg Grudic for his support and assistance with the ideas in
the early stages of this dissertation research, particularly concerning the proposal and
Caruana, Richard Byrd, James Martin and François Meyer for their service on the com-
mittees. Many thanks to PhET Interactive Simulations for funding my education and
providing hardware for the experimental studies. Thanks to the reviewers of the 2009
our publication. Thanks to all the groups that made their data sets publicly available,
including the UCI Repository and the Turing Institute at Glascow, Scotland. This work
was also supported by NSF Science of Learning Center grant SBE-0542013 (Garrison
parents Robert and Kathy who exemplify encouragement and support. Many thanks
to my wonderful wife Ingrid for feedback on rough drafts and practice talks and for her
Contents
Chapter
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background 5
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
vii
6 Conclusion 151
Bibliography 154
Appendix
Tables
Table
2.2 A meta-level stacking dataset for three classes, two base classifiers and
2.3 A meta-level stacking dataset for three classes, two base classifiers and
five examples, using Ting and Witten’s model [95]. ωi refers to the ith
discriminant value, ŷi is the ith model and ci is the predicted class. A
3.1 Data sets used in the experimental studies, and their properties . . . . . 39
3.2 Accuracy of each model for each data set. Entries are averages over the
3.3 for a description of the methods and Section 3.4 for discussion. . . . 42
3.3 Selected posterior probabilities and corresponding weights for the sat-
image problem for elastic net StackingC with α = 0.95. Only the 6
models with highest total weights are shown here. ann indicates a single-
4.2 Winning strategy for each combination of reduction and metric. Statis-
tically significant wins (at p ≤ 0.05) are highlighted. P-values from the
4.3 Accuracy results for linear decision boundaries (in %), for synthetic data
4.4 Accuracy results for mixed linear and nonlinear decision boundaries (in
4.5 Winning strategy for each combination of reduction and metric, when
0.05) are highlighted. P-values from the Wilcoxon signed-ranks test are
4.7 Average accuracy over 10 random splits for shared and independent
model selection strategies with the one-vs-all reduction, with the stan-
dard deviation indicated in parentheses. The winner for each data set is
indicated in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.8 Average accuracy over 10 random splits for shared and independent
model selection strategies with the all-pairs reduction, with the stan-
dard deviation indicated in parentheses. The winner for each data set is
indicated in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.9 Average accuracy over 10 random splits for shared and independent
4.10 Average accuracy over 10 random splits for shared and independent
4.11 Average rectified Brier score over 10 random splits for shared and inde-
pendent model selection strategies with the one-vs-all reduction, with the
4.12 Average rectified Brier score over 10 random splits for shared and inde-
pendent model selection strategies with the all-pairs reduction, with the
4.13 Average rectified Brier score over 10 random splits for shared and in-
4.14 Average rectified Brier score over 10 random splits for shared and inde-
5.2 Properties of the 20 data sets used in our experimental studies. . . . . . 114
5.3 Accuracy results (%) for the comparatively noiseless synthetic data set.
5.4 Accuracy results (%) for the noisy synthetic data set. The standard error
5.6 Accuracy scores for the J48 base classifier with various degradations, each
5.7 Adjusted p-values under the specified degradations for the accuracies in-
5.12 Results for j48 under the rectified Brier metric . . . . . . . . . . . . . . 146
5.13 Results for knn under the rectified Brier metric . . . . . . . . . . . . . . 148
5.14 Results for rf100 under the rectified Brier metric . . . . . . . . . . . . . 148
5.15 Results for svm121 under the rectified Brier metric . . . . . . . . . . . . 149
xiii
Figures
Figure
All-Pairs reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Figure 3.2(a) shows root mean squared error for the indicator problem for
the first class in the sat-image problem. Figure 3.2(b) shows overall ac-
curacy as a function of the ridge parameter for sat-image. The error bars
tion of the penalty term for lasso regression for the sat-image problem,
with standard errors indicated in Figure 3.3(a). Figure 3.3(b) shows ac-
the penalty term for α = {0.05, 0.5, 0.95} for the elastic net for sat-
image. The constant line indicates the accuracy of the classifier chosen
by select-best. Error bars have been omitted for clarity, but do not differ
3.4 Coefficient profiles for the first three subproblems in StackingC for the
all-pairs reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
eter selection for each reduction, averaged over 20 data sets. The advan-
4.4 Independent model selection curves for the four one-vs-all subproblems
4.5 Independent model selection curves for the 6 all-pairs subproblems in the
grees of noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.10 Model selection curves for Gaussian synthetic data sets under the one-
vs-all reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.11 Model selection curves for Gaussian synthetic data sets under the all-pairs
reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
average binary accuracy and the multiclass accuracy for the one-vs-all
reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
average binary accuracy and the multiclass accuracy for the all-pairs re-
duction. The R2 computation yielded NaN for the datasets letter and
4.18 Average ranks of the 7 algorithms under study; algorithms not statisti-
5.2 Accuracy averaged over all 20 data sets for all combinations of base clas-
sifier and reduction method, with one standard error indicated. . . . . . 118
5.3 Rectified Brier score averaged over all 20 data sets for all combinations
of base classifier and reduction method, with one standard error indicated.119
5.4 Graphical depiction of the rank of each algorithm as averaged over all
20 data sets, shown for the accuracy metric (top row) and for the Brier
metric (bottom row). A vertical bar connects the top algorithm to any
5.5 Accuracy as a function of the sample size (2/3 of which used for training),
averaged over the 10 largest data sets described in Table 5.2 . . . . . . . 122
5.6 Accuracy as a function of the (log10 of the) number of trees in the random
forest base classifier, averaged over all 20 data sets described in Table 5.2 124
5.10 The accuracy relative to a random forest with 100 trees as a function
of the number of classes in the data set for voted pairwise classifica-
data point for each of the 20 benchmark data sets (see Section 5.3.1)
5.11 The accuracy relative to a random forest with 100 trees as a function of
the number of classes for regression data sets that have been discretized
with varying number of classes. The average over the 9 data sets is
5.12 The entropy relative to a random forest with 100 trees as a function of the
number of classes in the data set for voted pairwise classification (VPC),
5.13 Relative accuracy for decision trees under accuracy metric. . . . . . . . 142
5.14 Relative accuracy for k-nearest neighbor under accuracy metric. . . . . . 142
5.15 Relative accuracy for Random Forest under accuracy metric. . . . . . . 143
Introduction
1.1 Introduction
toring, image segmentation, protein binding site prediction and many others. Several
algorithms have been proposed and studied for solving multiclass classification prob-
lems [50, 75, 30]. Early work in machine learning focused on using a single classifier for
each problem, but recent work has shown the advantage of training many classifiers for
each problem and combining their predictions [101, 62, 11, 89, 31, 48, 17]. Multiclass
classifiers can be combined directly using voting, averaging or other linear or nonlin-
is to combine binary classifiers to solve multiclass problems, with each binary classifier
proach (Chapter 5). Our contributions span the related themes of model selection and
solving multiclass problems with binary classifiers. Each contribution of this thesis is
self-contained so that each of the contribution chapters 3-5 can be read independently of
the rest of the dissertation. Some methodological descriptions and background materi-
2
als are duplicated to facilitate independence of the chapters. This introductory chapter
provides a high level overview of the contributions and common themes between them.
3-5 contain the contributions of this thesis, and Chapter 6 concludes with a summary
nation function, under the framework of stacked generalization. We use many types
of classification algorithms with many different hyperparameter settings, then use their
We study several regularization techniques and show that proper regularization of the
combiner function is essential to improve performance. The standard linear least squares
regression) or a combination of the two (elastic net regression). We study a linear model
that applies one weight per classifier prediction rather than one weight per classifier,
ferent classes. This chapter was published in the conference proceedings and presented
Some machine learning algorithms were designed for solving binary classification
problems (e.g. support vector machines or AdaBoost). A popular and effective way to
solve a multiclass problem using binary classifiers is to transform the multiclass clas-
sification problem into a set of binary classification problems, solve them using binary
classification algorithms and combine the predictions of the binary classifiers. For ex-
ample, the one-vs-all method creates one binary subproblem for each class, separating
3
it from the remaining classes. Another common reduction method is called pairwise
classification (or all-pairs), in which each subproblem separates one class from another
class. In Chapter 4, we focus on the particular issue of how to perform model selection
opposed to monolithic approaches that solve the entire multiclass problem at once and
are regularized as a unit, techniques that reduce multiclass problems to binary subprob-
lems introduce the new flexibility to perform model selection in each subproblem. In
share similar structure. Conversely, we construct a synthetic data set with differing
decision boundary shapes, and show that independently optimizing subproblem models
is more effective in that case. We also rule out several confounding factors such as
Pairwise classification (all-pairs) [38, 40, 56] has been criticized because it relies
on classifiers that must make predictions over distributions that were unseen during
training[51, 23]. In Chapter 5, we address this issue with a new pairwise classifica-
tion technique called probabilistic pairwise classification (PPC) that uses probabilistic
predictions for pairwise discrimination and weights each pairwise prediction with an
estimated probability that the instance belongs to the pair. The technique is derived
from the Theorem of Total Probability, and relies only on the assumption that each
instance is assigned exactly one label. Our method is conceptually simpler and easier
to implement than other pairwise classification methods that incorporate and produce
probabilities. Experimental studies indicate that our proposed technique performs bet-
ter than other pairwise classification techniques on real world data sets, at the cost of
4
increased computational demands. We also show that our proposed method is capable
over many real world data sets, comparison with similar methodologies, investigation of
behavior under synthetic data sets, and discussion of related theories. Before presenting
Background
This chapter introduces supervised classification (Section 2.1), and the related
concepts of model selection (Section 2.2) and model combination (Section 2.3). We
discuss the two methodologies for performing model combination, namely commensurate
model combination, in which each classifier provides an estimate of the same target
function (Section 2.3.1) and complementary model combination, in which each classifier
tributes), the goal is to induce a model ŷ(x) that minimizes a loss function L(y, ŷ) over
the distribution of unseen data. Training and test points are assumed to be drawn i.i.d.
(independent and identically distributed) from the same full joint probability distribu-
tion q(z), where z = (x, y). In nearly all real world problems, the full joint distribution
is unknown; instead a finite number of observations D are generated by the full joint
distribution. Generative techniques seek to model the full joint probability distribution
for the purpose of making predictions whereas discriminative techniques directly model
the class boundaries. In this thesis, we restrict our focus to discriminative modeling
techniques.
6
known as a variable or feature, and Ω is the set of possible class labels Ω = {ω1 ...ωk }.
set and a set of hyperparameters: fˆ(D, θ) = ŷ(x) : Kd·N ⇒ (Kd ⇒ Ω). The hyperpa-
rameters θ, also known as learning parameters, are settings used to govern the learning
rate or momentum parameters in artificial neural network models or the {C, γ} learn-
ing parameters used in Gaussian support vector machines (SVMs). Each classification
algorithm entails a search through an implicit or explicit hypothesis space to find the
preferred model structure and/or parameters [75]. For classification problems, we con-
hyperparameters for a particular data set. We refer to a model as any function that
produces a classifier when evaluated on a given labeled training data set. Typically,
data components of the trained classifier rather than the mechanism used to obtain it.
If the algorithm is selected before seeing any labeled training data, then model selection
simply refers to the search over learning hyperparameters for the given classification
algorithm. Search techniques such as grid search or binary search are often used to
set. While parametric model selection methods have been proposed, such as minimum
models in order to make the final prediction. The underlying classifiers are commonly
of the base classifier predictions ŷ(x) = f (ŷ1 (x), ..., ŷL (x)), where ŷ(x) is the ensemble
prediction, ŷi (x) is the prediction from the ith base classifier (out of L base classifiers),
and f (.) is the combination function. Multiple classifier systems can also allow the
combination function to depend on the input vector f (.) = f (ŷ1 (x), ..., ŷL (x), x), though
The base classifiers ŷi , i = 1..L may be constructed to each solve the same prob-
lem (commensurate models), or each base classifier may be trained to solve a different
subproblem (complementary models). In bagging, for example, each of the base clas-
or perturbation of the original problem), and the base classifiers are combined using
voting [11]. In boosting and error-correcting output coding, different subproblems are
constructed to be solved by the different base classifiers; when the base classifiers solve
different problems, the combination function is typically more complex than averaging
or voting. In this thesis, Chapter 3 focuses on a method that combines many com-
mensurate models. Chapters 4-5 focus on methods that split multiclass classification
same (target) function. In Chapter 3, we use many classification algorithms and hyper-
parameter settings to approximate the target function, then combine the models using a
model combination.
There are many techniques for generating different base models for classification
minor change to A or D produces a large change in the resulting classifier C [11]. Un-
stable classification algorithms are vital for obtaining improved ensembles under some
types of perturbations. In this thesis, we mainly focus on the usage of different algo-
rithms and associated hyperparameters to generate diversity among the base classifiers.
Furthermore, base classification algorithms may include one or more of the following
approaches (for example, a random forest classifier can be used as a base classifier).
Many publications have focused on isolating and evaluating the particular dimensions
for perturbation described below; an important line of future research would be to com-
bine many of these types of perturbations to attempt to maximize diversity and thus
classification performance.
Modifying the Training Data Set Modification of the original training data
main techniques for producing a perturbed dataset are to: (1) subsample examples, (2)
9
disjoint subsets of the training data set, the resulting classifiers can exhibit reduced
fication algorithm on bootstrap samples of the original dataset; this technique is called
bagging (for bootstrap aggregating) [11]. Friedman and Popescu [37] point out that
there is nothing inherently advantageous about the bootstrap; different problems will
benefit from different resampling strategies in general. Boosting assigns higher weight
to difficult examples to produce classifiers that correctly label the difficult examples.
Different Feature Sets For each classifier, a random subset of features from
the original problem are selected. This technique is known as the random subspace
method, and has been explored with decision tree classifiers [52]. This technique fails
for problems in which all features are required to attain sufficiently high classification
accuracy [61].
novel data based on the original data distribution. For example, in DECORATE, novel
data is generated and different classifiers are trained using that data with different
label assignments [72]. This technique can also be used productively in semi-supervised
methods, in which there is ample unlabeled data; in this case, no data synthesis step is
necessary.
Adding Noise Similarly, changing the class labels assigned to examples is one
way of producing different classifiers. For some datasets and classification algorithms,
data set to produce different classifiers, it is also possible to modify the algorithm used
ent inductive biases [75]; by using different learning algorithms, different classifiers are
obtained.
parameters that must be tuned in order to match properties of the dataset at hand.
Using different settings for these hyperparameters generally results in the production
of different classifiers. For example, backpropagation neural networks can have varying
domization to produce individual classifiers. For example, random forests [14] construct
each classifier by sampling i.i.d. from a specified distribution over decision tree classi-
fiers. This randomization increases the diversity of the classifiers without significantly
decreasing their accuracy, and subsequently, the ensemble has improved classification
accuracy.
Perturbation of the Loss Function When the loss function for a classifica-
tion algorithm is perturbed, the result is a novel classification algorithm. Freidman and
Popescu [37] discuss perturbation of the loss function to produce diverse classifiers.
In 2000, Dietterich [26] identified three distinct problems that can be overcome
by classifier combination:
(1) The statistical problem: Several classifiers may yield the same validation set
the output space, and reduces the risk of choosing a single poor classifier.
(2) The computational problem: Many classification algorithms entail a search that
NP-hard for neural networks [57] and decision trees [8]; therefore suboptimal
techniques are typically used instead (e.g. greedy search for decision trees or
runs, the ensemble may produce a better prediction than any of the constituent
classifiers.
decision boundary that is impossible to represent with any single base classifier
(for instance, consider that two right triangular decision boundaries may be
set of decision boundaries. For example, a neural network with a single hidden
layer and a sufficient number of hidden neurons can represent any continuous
function [53]. However, when trained on a finite amount of training data, clas-
sification algorithms are only capable of exploring a finite region of the classifier
space. It is possible to expand the hypothesis space by asserting that the final
Techniques for combining classifier outputs range from simple (e.g. majority vot-
ing) to complex (e.g. nonlinear combination functions). For any commensurate multiple
classifier technique to succeed, the base classifiers must exhibit nonrandom accuracy and
When classification algorithms are applied to overlapping (or equal) data samples
or share similar inductive biases, the resulting classifiers will probably make similar pre-
in order to understand the mechanism and benefits of classifier combination [63, 92].
12
When classifiers have identical accuracy and make independent predictions on a binary
L
X L k
Pmaj = p (1 − p)L−k (2.1)
k
k=bL/2c+1
base classifier accuracy and number of members in the group. These curves are also
depicted in Figure 2.1 for better-than-random classifiers. The behavior as the number
(1) p > 0.5 When each classifier predicts the correct class with probability greater
than 0.5, then as classifiers are added, the ensemble accuracy increases mono-
(2) p < 0.5 When each classifier predicts the correct class with probability less than
0.5, then as classifiers are added, the ensemble accuracy decreases monotonically.
(3) p = 0.5 When each classifier predicts the correct class with probability equal to
0.5, then as classifiers are added, the ensemble accuracy remains at 0.5.
For instance, in Table 2.1, the row for which p = 0.8 shows that after adding just 8 more
independent classifiers, the accuracy of the group under majority vote increases to 0.98,
a relative accuracy gain of more than 20%. It is possible to improve upon this benefit
However, classifiers will typically be at least somewhat correlated, and the benefits will
fall short of those identified in the table. It is also possible to characterize the behavior
of independent classifiers for multiclass classification problems. In this case, sums are
Table 2.1: Ensemble accuracy is depicted for a given number of base classifiers L, and
individual accuracy p.
1.00
0.95
0.90
0.85
Ensemble Accuracy
0.80
0.75
0.70
0.65
0.60
0.55
0.50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
L (number of classifiers)
When classifiers are trained using overlapping data or are produced by algorithms
that share a similar inductive bias, the classifiers are likely to be at least somewhat
14
correlated. Clemen and Winkler [20] and Jacobs [58] show that there is an upper limit
particular, they show that m dependent experts are worth the same as k independent
experts, where k <= m. This result is obtained under the assumption that experts
provide unbiased point estimates for a regression variable, and that the joint probability
of the experts’ errors is normally distribution with mean zero and covariance matrix Σ.
The combiner function is Bayes’ rule, treating the expert predictions as observations in a
new space (see Section 2.3.1.5). Therefore, combining dependent classifiers can still lead
classifiers.
dictions from human experts, so many of the ideas in this literature are referred to as
expert combination or opinion pooling, and the combiner function is called the decision
maker. Preliminary work involved the search for a combination technique that satis-
fied normative (axiomatic) constraints, which are intuitive properties one might expect
to find in a combination technique [79]. However, when just a few seemingly reason-
able axioms are asserted, this search can be shown to be unsatisfiable, suffering from
impossibility theorems in the same vein as Arrow’s impossibility theorem [4]. Further
work showed the advantage of viewing the set of individual predictions as data, and
many authors have advocated the usage of Bayes’ theorem on the meta-level data in
the so-called supra-Bayesian framework [76, 58]. A few authors have investigated par-
ticular forms for the likelihood function, with hyperparameters that are inferred from
data [46, 60]; however, the computational demands of these techniques are large and
grow rapidly with the addition of classifiers. Before discussing probabilistic models, we
combination technique.
That is, classifier predictions are weighted by the probability that the model is correct,
given the dataset. This technique appears to be a natural method for combining models,
and some have been tempted to treat it as such [28]. However, as is pointed out [74, 46],
Bayesian model averaging is not model combination, but rather a form of soft model
selection. As more data is observed, Bayesian model averaging will assign more and more
probability to the most probable model, and as the amount of data tends to infinity,
weights for all other models approach zero. Bayesian marginalization is appropriate
when the base classifiers Mi are mutually exclusive and exhaustive and the true data
a linear combination ŷ(x) = k=1 αk yk (x), called a linear opinion pool. Typically
PL
the weight for an expert is chosen as a function of his/her (perceived) accuracy for
the particular domain. In 1981, McConway [71] showed that the linear opinion pool
is the only combination scheme that satisfies the marginalization property. Consider
tion schemes that satisfy the marginalization property produce the same group decision
whether the marginal predictions are combined or the joint predictions are combined
then marginalized over. Marginalization is an intuitive property, however, there are two
main problems. First, there is no foundational approach for how the weights should be
allocated. Many studies have been devoted to find theoretically and empirically moti-
vated means for choosing expert weights [44], but this is still an open problem. Second,
16
axiomatic approaches have been criticized for their failure on particular straightfor-
ward examples. Lindley shows, in particular, that the marginalization property ignores
important information [68]. Another argument for linear opinion pools was made by
Genest and Schervish [45], who showed that the supra-Bayesian paradigm reduces to a
linear opinion pool when the decision maker asserts a value only for the mean of the
Supra-Bayesian Methods In his 1971 thesis, Morris showed that the appro-
priate way to combine predictions from multiple experts under uncertainty is to treat
the predictions as data in a new feature space, and to use Bayes’ rule to update the deci-
sion maker’s prior distribution [76]. This technique was later referred to as the modeling
approach, since it is necessary to model the joint predictive distribution of all experts,
and later called the supra-Bayesian approach [58]. In the supra-Bayesian framework,
predictions from base classifiers are treated as data by the meta-classifier, and Bayes’
rule is used to compute the ensemble decision. The ensemble decision is therefore
where Pi = pi (y|H) is the ith classifier’s probability distribution for y when its knowl-
edge is H. That is, the posterior probability distribution for the class label given the
combiner’s knowledge H and the predictions of each base model p(y|H, P1 , ..., Pm ) is
proportional to the product of the likelihood p(P1 , ..., Pm |y, H) of the experts’ decisions
given the true class, the combiner’s knowledge H and the prior distribution p(y|H) of
the combiner.
The main problem with supra-Bayesian methods was pointed out by Morris [77]
and Jacobs [58]; the difficulty is in specifying the likelihood function for the experts’
opinions given the data; the likelihood function must account for individual classifier cal-
cent research has focused on specifying a parametric model for the likelihood function,
2.3.1.6.
Garg et al. describe a simple graphical model for performing classifier combination
[43]. The classifiers are assumed to be conditionally independent of one another given
the ensemble classification. The Bayesian network is a tree of depth 2, with the root
being the ensemble classification and the leaves being the individual classifiers.
Ghahramani and Kim [46] describe a graphical model for combining classifiers,
called Bayesian classifier combination (BCC). They start with a simple graphical model
which assumes classifiers are independent (IBCC) and show how it can be embellished to
account for dependencies between classifiers. They describe a model called the enhanced
Bayesian classifier combination model, which uses separate graphs for easy and difficult
data points. A Markov network is added in the dependent BCC model to model the
correlations between classifiers directly. Finally, they combine dependent and enhanced
models to obtain a Markov network with the easy/difficult graph separation. Empirical
results are demonstrated for satellite, UCI digit and DNA datasets. The advantage of
BCC is shown to be greater when combining multiple different base classifiers trained on
the entire training set rather than combining different base classifiers trained on disjoint
sifiers are viewed as data in a new meta-feature space, and any classification algorithm
18
can be trained on this new problem. Wolpert also points out that the meta-feature space
can be augmented with the original inputs or with other relevant measures. Wolpert re-
stricted his focus to regression problems, but Ting and Witten later demonstrated that
stacked generalization can be used for classification problems as well, when using prob-
including model selection, majority voting, weighted averaging and nonlinear combina-
tion functions [101, 91]; in fact, Wolpert refers to model selection by cross-validation as
to an extraordinarily dumb level 1 generalizer”. Wolpert also showed that any classifier
combination technique can use embedded cross-validation to efficiently re-use all train-
ing data as validation data for training the combiner function. In this thesis, we refer
base classifiers as data in a new feature space. Below, we formalize the idea of stacked
generalization.
by aggregating the predictions of each classifier on a validation dataset Dval , and com-
bining them with the known labels. Table 2.2 shows an example of a meta-level dataset.
The meta-level dataset is used as input to the combiner classification algorithm. In or-
der to make a prediction on a new data point, each classifier makes its prediction, and
these predictions are input to the combiner function. The original problem of mapping
number of classifiers Kd ⇒ KL ⇒ Ω.
2
Ting and Witten [95] mistakenly report that Wolpert’s original formulation addressed a classifica-
tion task rather than a regression task.
3
Wolpert referred to this as the level-1 dataset to emphasize that stacked generalization can be
extended to higher levels
19
Table 2.2: A meta-level stacking dataset for three classes, two base classifiers and five
examples, using Wolpert’s model [101].
ŷ0 ŷ1 y
ω1 ω1 c1
ω1 ω1 c1
ω2 ω3 c3
ω1 ω1 c1
ω1 ω1 c2
Ting and Witten [95] addressed the classification problem, using probability dis-
tributions as inputs to the combiner function rather than class predictions. In this case,
each classifier outputs normalized discriminant values for each of the possible classes.
The set of discriminant values is passed as an input vector to the combiner function,
which produces the final ensemble prediction. In Ting and Witten’s stacked gener-
problem: Kd ⇒ KL·c ⇒ Ω. Table 2.3 shows an example of stacking using Ting and
Witten’s representation of the meta-feature space. Ting and Witten also pointed out
the need for regularization for the combiner function. They recommend using a simple
overfitting. MLR trains a linear model for each of the classes, and at prediction time
chooses the class for which the linear model outputs the highest value.
1992, he investigated a series of toy problems for a 1-dimensional piecewise linear func-
tion. This toy problem used a single classifier, and included the original features in the
meta-feature space, so that the meta-feature space was 2-dimensional. Wolpert’s second
numerical experiment involved the NETTalk text-to-speech program. The inputs are 7
20
Table 2.3: A meta-level stacking dataset for three classes, two base classifiers and five
examples, using Ting and Witten’s model [95]. ωi refers to the ith discriminant value,
ŷi is the ith model and ci is the predicted class. A single row (e.g. shown in bold)
corresponds to a meta-data point.
ŷ0 ŷ1
ω0 ω1 ω2 ω0 ω1 ω2 yi
0.8 0.1 0.1 0.6 0.3 0.3 c1
0.9 0.0 0.1 0.7 0.2 0.1 c1
0.2 0.7 0.1 0.0 0.1 0.9 c3
0.7 0.1 0.2 0.7 0.2 0.1 c1
0.9 0.1 0.0 0.6 0.4 0.0 c2
The stacked generalization experiment combined three classifiers that each made predic-
tions for a single letter. Wolpert points out that the purpose of the experiment wasn’t
Breiman, 1996 Breiman investigated the Housing and Ozone datasets in his
1996 paper [12]. 50 CARTTM (Classification And Regression Trees)regression trees were
used as the base classifiers in the first experiment. CART pruning yields subtrees, so
each of the 50 trees were nested subtrees. When using a regularized error function
for the combiner, Breiman found that a small number of combiner values had nonzero
weights; only a small number of models were being combined in the stacked regression.
Ting and Witten, 1999 [95] Ting and Witten investigate 10 problems:
Led24, Waveform, Horse, Credit, Vowel, Euthyroid, Splice, Abalone, Nettalk(s) and
Coding. They use C4.5, NB (a re-implementation of Naive Bayes) and IB1, a vari-
ant of the K-Nearest Neighbor algorithm as the base learners. They used C4.5, NB,
IB1 and MLR as combiner functions. No model selection is done at the base or meta-
levels. They show that MLR using confidence-level predictions beats model selection
21
by cross validation in all datasets, significant at over two standard errors. They also
show that stacked generalization with MLR has eight significant wins and two losses
(with insignificant differences) against majority vote. They also mention employing a
multilayer perceptron as the combiner function, and report that it had the same error
the original input features. This technique has been re-invented a few times in the
machine learning community; Chan and Stolfo [19] described this model as the class-
2006, Torres-Sospedra et al. called this technique stacked generalization plus in their
comparative study of combination techniques [96]. Bi-level stacking has the potential
to address the tradeoff between classifier selection and classifier combination. Since the
combiner function has access to the original inputs, it can select the classifier (or a
combination of classifiers) known to be more accurate in that region of the input space.
Poor results have been reported for bi-level stacking in the above references, which may
of bi-level stacking combined three base level classifiers with a decision tree, rule
set induction or a neural network; no parameter tuning was performed for the
tions from base classifiers as well as the original input features, the meta-feature
22
In a 2001 technical report, Breiman says (in a section titled “My Kingdom for
Some Good Theoretical Explanations”) [15] “The area of ensemble algorithms is filled
with excellent empirical results, but the understanding of how they work is a scarce
vated, but proposed axioms inevitably entail stronger and more restrictive implications
than intended, and tend to exhibit counterintuitive behavior on simple examples [100].
Winkler [100] says “The problem I have with the axiomatic approach (and this does
not apply to Morris) is that it is sometimes done in the spirit of a search for a single,
ing approach (also known as the supra-Bayesian approach) in which expert predictions
are taken as data and combined under Bayes’ rule is theoretically well-founded, but to
distribution over expert predictions which is often difficult or impossible [58]. In this
section, we identify other proposed theoretical models for explaining ensemble behavior.
of random forests in [14] based on Chebychev’s inequality. The margin function for a
where Θ is drawn i.i.d. from a probability distribution used to guide induction of the
decision trees. Informally, the margin function measures the expectation value of how
many more votes are cast for the correct class than for the next highest voted class.
By defining the strength of a set of classifiers to be the expectation value of its margin
error of the random forest in terms of the strength and correlation of the classifiers.
23
Informally, this inequality states that in any probability distribution, nearly all values
are close to the mean. The generalization error of a voted ensemble is given as the
probability that the margin function is negative: E ∗ = pD (m(x, y) < 0). Therefore,
of the margin function. Breiman goes on to show that for random forests, or indeed any
set of classifiers whose construction is guided by i.i.d. random sampling, the variance
of the margin function can be written as σm = ρ̄(1 − s2 ), where ρ̄ is the mean value of
the classifier correlation. Therefore an upper bound for the generalization error is given
by E ∗ ≤ ρ̄(1 − s2 )/s2 . Breiman points out that this bound is likely to be a loose upper
bound, but that it explains the success of particular types of ensembles and motivates
of a single classifier for a regression problem can be decomposed into contributions from
bias and variance [50]. Some work has been done to generalize this result to classification
problems [59]. The error in a single classifier prediction under the squared error loss can
be written as E(x) = σ2 + [E ŷ(x) − ŷ(x)]2 + E[ŷ(x) − E[ŷ(x)]]2 . The first term on the
right hand side is the irreducible (Bayes optimal) error. The second term is the square
of the bias and the final term is the variance. In the regression case for ensembles, a
decomposition in terms of the average bias, average variance and average covariance is
shown for classifier ensembles. These results show the tradeoff between bias, variance
Added Error of the Ensemble Tumer and Ghosh [97] describe a mathemat-
ical framework that gives the relationship between classifier correlation and ensemble
error for the average combiner by approximating the decision boundaries as linear near
the Bayes optimal decision boundary. They show that when the classifiers are uncor-
24
of diversity with a model for the benefit due to diversity, we may obtain an understand-
ing of ensemble behavior. The above theoretical models have depicted the ensemble
accuracy in terms of the classifier correlation; however, recent work has searched for
other diversity measures with desirable properties. Diversity measures can be roughly
pairwise diversity is averaged across all L(L − 1)/2 pairs of classifiers in the ensemble.
A simple pairwise measure known as the disagreement measure is the probability that
where yi− indicates that classifier i is incorrect and yi+ indicates that classifier i is
failure diversity, as well as others [62]. The entropy measure estimates the amount
of disagreement in the votes, with a maximum when the votes are very nearly split.
classifier, and has been shown to differ from the averaged disagreement measure by a
coefficient [62].
the group decision [108, 107]. There are two main reasons to prune ensembles: to reduce
25
the computational requirements (both storage space and prediction time) and to increase
example, for ensembles with 10 classifiers, there are 1023 possible subsets, but for 100
classifiers, there are over 1030 classifier subsets. Therefore suboptimal search techniques
(described in Section 2.3.1.6) with a trainable nonlinear combiner function have the
potential to learn which classifiers are most appropriate to combine for a problem, but
are prone to overfitting. Ensemble pruning, on the other hand, is a discrete technique
other techniques have been proposed for pruning classifier ensembles; we discuss the
and ensemble pruning approaches, come from the neural network ensemble literature.
Perrone and Cooper [80] studied the case of neural networks for regression, noting that in
the regression case it is possible to derive a closed form solution for weights in a weighted
combination model. They also address the issue of network pruning; the weights for a
authors point out that linearly dependent rows or columns indicate dependent classifiers,
and they recommend subsampling the population of neural networks to assure that C
min(xT Gx)
X
s.t. xi = k
i
xi ∈ {0, 1}
The matrix G is constructed to reflect classifier errors on the diagonal and correlation
off the diagonal. An ad hoc scheme for measuring the correlation of classifiers on
validation data is used; the authors report that they tried several ad hoc measures
and arrived at similar results. The authors used this technique to prune ensembles
produced by Adaboost, and report that they are able to maintain the same accuracy
with a dramatic reduction in the number of classifiers, in some cases obtaining improved
predictive accuracy.
set of classifiers and choose which will participate in the ensemble decision. However, in
the literature, the overproduce and choose paradigm typically refers to a set of heteroge-
neous classifiers (different classifier types built from a variety of learning algorithms and
learning parameters). The idea is that usage of a wide variety of algorithms will increase
the probability that there will be good models that are amenable to combination, and
diversity can be fostered by using different algorithms rather than subsampling from
the dataset.
In 2000, Sharkey and Sharkey [93] reviewed work that focused on constructing
many classifiers, then trying to identify the most appropriate subset to participate in the
ensemble decision [80, 78, 49], referring to them collectively as test and select methods.
This idea was further refined and generalized by Roli et al. in 2001 [86], and dubbed
the overproduce and choose paradigm. In the overproduce phase, many classifiers are
trained on the training data, using different algorithms, learning parameters, feature
subsets or resamplings of the dataset. Next, classifiers are selected in order to optimize
27
the accuracy of the ensemble; this requires choosing classifiers that are individually
accurate, but also different from one another. Roli et al. categorize selection rules as
heuristics, diversity measures, clustering and search methods. By analogy with feature
selection, we identify the first three techniques as filter methods, since they can be
computed without evaluation of the ensemble. Search methods, on the other hand,
are similar to wrapper methods, since estimates of the ensemble performance are used
to guide the search, and the search technique can be “wrapped around” any combiner
function.
ble selection evaluate classifier subsets together with the combiner function and overall
performance metric in order to choose classifier subsets. The wrapper method can be
contrasted with the filter method, which uses a statistical measure in order to perform
subset selection. This definition of filter and wrapper methods can be applied to fea-
ture selection, ensemble selection and meta-feature selection. Ensemble Selection from
Libraries of Models [17] is a wrapper method for ensemble selection that uses a forward
stepwise approach to construct the ensemble from a library of models. Classifiers are
selected with replacement, which produces a weighting of the final linear combination
function. Since this is a wrapper method, it is possible to use any metric to tune the
ensemble. Other stepwise selection procedures are possible such as backward selection,
The No-Free-Lunch Theorem [102] states that any two machine learning algo-
rithms will have identical performance averaged across all problems. We may be tempted
to assume that this theorem merely applies to base-level classifiers, and that model se-
cross-validation) are subject to the same kind of bias as base-level classifiers. When
able algorithm and model structure, including its inductive bias. When domain-specific
knowledge is unavailable, we must acknowledge that the implicit or explicit biases en-
tailed by our base- or meta-learning algorithms may be a poor match for the problem
at hand.
sifiers which each estimate the same (target) function. In the next section, we discuss
methods for constructing and combining classifiers that each estimate different func-
tions.
solve different parts of the overall problem. This includes the separate paradigms of
sensor fusion [3], classifier selection [62] (including mixture of experts techniques) and
In these cases, the classifiers are no longer estimating the same target function, but are
and background information on various techniques that are used to solve multiclass
2.3.2.1 One-vs-All
One of the simplest and most widely used techniques for reducing a multiclass
also known as unordered class binarization [40] or one-vs-rest (or 1vr) [105]. Using
29
classification subproblems, one for each class (see Figure 2.2). In the ith subproblem,
the classifier is trained to distinguish whether the instance belongs to class i or not. At
prediction time, the classifier with the highest output is chosen (alternatively, voting or
disagreement in the literature about the terminology for the one-vs-all reduction; Rifkin
and Klautau [85] use the term one-vs-all to indicate winner-take-all with continuous
outputs (i.e. choosing the classifier with the maximum output); other research such
as Beygelzimer et al. [5] refer to one-vs-all as it is used with Hamming decoding (e.g.
discrete outputs as in error-correcting output coding [27], and randomizing over the
selected classes). Here, we use the term OVA to refer to continuous winner-take-all one-
vs-all and we refer to the discrete version as OVA with Hamming decoding. Rifkin and
Klautau studied the one-vs-all technique, and compared it to other methods for reducing
multiclass to binary and to other SVM techniques that provide direct optimization on
the entire multiclass problem [85]. Their main thesis is that it is essential to perform
model selection and that under appropriate model selection, one-vs-all tends to perform
as accurately as other multiclass SVM methods. Rifkin and Klautau’s technique for
model selection is to choose one set of hyperparameters for all subproblems, rather
30
hyperparameters for all subproblems jointly. Rifkin and Klautau don’t explicitly state
that they use the shared-model paradigm for model selection, but it is implied since one
2.3.2.2 All-Pairs
Figure 2.3: Illustration of an A-C decision boundary in a 2D, 3-class example of the
All-Pairs reduction.
Figure 2.3). At prediction time, each binary classifier votes for one class, and the class
with the most votes is selected as the multiclass prediction. Note that in contrast to
dicted probabilities or confidence predictions from each binary classifier, instead using
only a discrete vote from each, though it is possible to use the all-pairs encoding with
a decoding function other than Hamming decoding in Loss Based Decoding. Friedman
shows that Bayes optimal binary classifiers combine to produce a Bayes optimal multi-
class classifier, and therefore each binary subproblem can be solved independently and
where K is the set of possible labels, ω is the true label, x is the input feature vector
X pk pi
ŷ(x) = argmax 1( > )
k∈K pk + pi pk + pi
i∈K
X
ŷ(x) = argmax 1(p(ω = k|ω ∈ {i, k}, x) > p(ω = i|ω ∈ {i, k}, x))
k∈K
i∈K
Therefore, given reliable p(ω = k|ω ∈ {i, k}, x), binary reduction under the All-Pairs re-
duction is equivalent to the true Bayes optimal decision. Note that this analysis assumes
that p(ω = k|ω ∈ {i, k}, x) can be determined exactly for each subproblem, whereas in
the outputs from each model must be commensurate with one another.
While one-vs-all and all-pairs are the most widely studied and employed tech-
niques for reducing multiclass to binary, they are only two cases within the more general
output coding. Though we focus our experimental studies on the one-vs-all and all-pairs
reductions, we also describe these other frameworks, since they must also address the
issue of model selection and they are generalizations of the methods under our study.
framework was proposed by Dietterich and Bakiri in 1995 [27]. This scheme is named
for its similarity to error correcting codes in information theory, with the analogy that
32
the instance’s class is a message to be transmitted, and that error correcting codes are
employed to encode the message in order to make the transmission (or classification)
more tolerant of errors. ECOC requires all classes to appear in each subproblem, and
allows an arbitrary specification of how classes are reassigned to subproblems. The data
structure used to specify how classes are reassigned to subproblems is called the coding
might be assigned to the positive indicator class. This would correspond to a row in
the coding matrix equal to {+1, −1, +1, +1, −1}. At prediction time, each subproblem
classifier votes for or against membership in the positive indicator class, and the class
with the most votes is selected as the multiclass prediction, breaking ties randomly.
The number of unique and nontrivial binary splits (codewords) for a set of k
The other splits are different binary problems constructed from the original class labels.
Dietterich and Bakiri [27] proposed using as many of these dichotomies as computa-
tionally feasible in order to improve the multiclass prediction. At prediction time, each
base classifier is evaluated, and the class label with the minimum Hamming distance
to the predicted codeword is used as the multiclass prediction. Dietterich and Bakiri
recommend using all possible dichotomies when the number of classes is 7 or less; when
there are more classes, a random sampling of dichotomies is typically used. The error-
composite subproblems.
Loss Based Decoding In 2000, Allwein et al. generalized the ECOC frame-
work to to Loss Based Decoding, which (a) accounts for continuous instead of discrete
classifier outputs and (b) allows some subproblems to optionally ignore some of the
classes in the data set [2]. Incorporating continuous output makes it possible to rep-
resent the One-vs-All technique in the Loss Based Decoding framework (as we show
below), and allowing subproblems to omit subsets of data points makes it possible to
33
represent the All-Pairs technique. Loss Based Decoding was further generalized by
Crammer and Singer in 2000 to Continuous Output Coding [22], in which each class in
a subproblem has some continuous weight w ∈ R rather than w ∈ {−1, +1} as in ECOC
all one-vs-all has been lacking. In this section, we show that using the loss function
L(z) = (1 − z)2 in one-vs-all loss-based decoding yields the same predictions as continu-
ous winner-take-all one-vs-all. Our experimental studies use this result in implementing
2.4 Conclusion
selection and commensurate and complementary model combination. The next chapter
Chapter Abstract
it tends to overfit unless the combiner function is sufficiently smooth. Previous stud-
ies attempt to avoid overfitting by using a linear function at the combiner level. This
chapter demonstrates experimentally that even with a linear combination function, reg-
standard linear least squares regression can be regularized with an L2 penalty (ridge
regression). In multiclass classification, sparse linear models select and combine individ-
perimental studies show that the dense ridge regularization is much more effective than
3.1 Introduction
the ensemble prediction [26, 86, 62]. Simple techniques such as voting or averaging can
to account for the strengths and weaknesses of the base classifiers and to produce a
the outputs of the base-level classifiers are viewed as data points in a new feature
space, and are used to train a combiner function [101]1 . Ting and Witten [95] applied
linear combiner outperformed several nonlinear combiners for their problem domains
and selection of base classifiers. They also showed that in classification problems, it is
more effective to combine predicted posterior probabilities for class membership than
class predictions.
Caruana et al. [18] evaluated stacked generalization with logistic regression with
thousands of classifiers on binary classification problems, and reported that stacked gen-
remedy this overfitting and improve overall generalization accuracy through regulariza-
tion.
ror at the cost of slightly increased bias error—this is known as the bias-variance trade-
off [50]. In this chapter, regularization is applied to linear stacked generalization for
gression [50], lasso regression [50], and elastic net regression [109] are used to regularize
the regression model by shrinking the model parameters. Lasso regression and some
settings of elastic net regression generate sparse models, selecting many of the weights
to be zero. This result means each class prediction may be produced by a different
are used to build a library of base models as in Caruana et al. [17]. We also perform
resampling at the ensemble level in order to obtain more statistically reliable estimates
of performance without the expense of retraining base classifiers. We look at the corre-
This chapter is organized as follows: Section 3.2.1 formally describes stacked gen-
into several regression problems and the class-conscious extension, StackingC. Section
3.2.2 describes linear regression, ridge regression, lasso regression and elastic net re-
gression, which are used to solve the indicator subproblems in stacked generalization.
Section 3.3 describes empirical studies that indicate the advantage of regularization.
Section 3.4 discusses the results and Section 3.5 concludes with a summary and future
work.
3.2 Model
Given a set of L classifiers ŷi (x|θ), i = 1..L, the predictions of each classifier
on a validation dataset Dval are aggregated and combined with the known labels to
this meta-level validation dataset. Given a test point, the predictions of all base-
level classifiers are combined to produce a new data point x0i . The combiner func-
tion is evaluated at the new data point x0i , and its output is taken as the ensem-
sg(x) = c(y11 (x), ..., y1K (x), ..., yL1 (x), ..., yLK (x)), where x is the test point, c is the
classifier combiner function and ylk is the posterior prediction of the lth classifier on the
k th class. Following Ting and Witten, a regression function can be used at the meta-
37
level by constructing one regression subproblem per class with an indicator function;
this is the so-called multi-response linear regression (MLR) formulation [95]. At predic-
tion time, the class corresponding to the subproblem model with the highest output is
The most general form of stacked generalization includes all outputs from all
approach in which each indicator model is trained using predictions on the indicated
class only, called StackingC [90]. Formally, the StackingC class prediction is given by
sc(x) = argmaxk rk (y1k (x), ..., yLk (x)), where x is the test point, k = 1..K is an index
over classes, rk is the regression model for the indicator problem corresponding to the
k th class and ylk is the posterior prediction of the lth classifier for the k th class. Ting
and Witten report that StackingC gives comparable predictive accuracy while running
considerably faster than stacked generalization [95], and Seewald also reports increased
predictive accuracy. Based on these arguments, the experiments in this chapter use
StackingC rather than complete stacked generalization. With Ting and Witten’s MLR,
the predictive model is p̂j (x) = i=1..L wij yij (x), where p̂j (x) is the predicted proba-
P
bility for class cj , wij is the weight corresponding to classifier yi and class cj , and yij (x)
is the ith classifier’s output on class cj . An example of this model is illustrated in Figure
3.1.
The least squares solution is given by ŷ = β̂x, where β̂ = argminβ i=1 (yi − β0 −
PN
Pp
j=1 xij βj ) . Here β̂ is the vector of model parameters determined by the regression, N
2
is the number of training data points, yi is the true output on data point i, xij is the jth
feature of the ith data point, and p is the number of input dimensions for the problem.
In StackingC, the features are predicted probabilities from base classifiers. When the
38
ŷ
x0
y1 (x) y2 (x)
Figure 3.1: Example illustration of the StackingC and Multi-Response Linear Regression
model used in our experiments for a 4-dimensional input vector in a 3-class classification
problem. The prediction for class cA is highlighted.
linear regression problem is underdetermined, there are many possible solutions. This
situation can occur when the dimensionality of the meta-feature space L is larger than
the effective rank of the input matrix (at most N ), where L is the number of classifiers
and N is the number of training points. In this case, it is possible to choose a basic
solution, which has at most m nonzero components, where m is the effective rank of the
input matrix.
Ridge regression augments the linear least squares problem with an L2-norm
constraint: PR = pj=1 βj2 ≤ s. This has the effect of conditioning the matrix inversion
P
Lasso regression augments the linear least squares problem with an L1-norm con-
straint: Pl = pj=1 |βj | ≤ t. The L1-norm constraint makes the optimization problem
P
Unlike ridge regression, lasso regression tends to force some model parameters to be
identically zero if the constraint t is tight enough, thus resulting in sparse solutions.
Zou and Hastie describe a convex combination of the ridge and lasso penalties
39
Table 3.1: Data sets used in the experimental studies, and their properties
called the elastic net [109]. The penalty term is given by Pen (β|α) = (1 − α) 12 ||β||2l2 +
α||β||l1 , where 0 ≤ α ≤ 1 controls the amount of sparsity. The elastic net is particularly
effective when the number of predictors p (or classifiers in StackingC) is larger than
the number of training points n. The elastic net performs groupwise selection when
there are many correlated features (unlike the lasso, which instead tends to select a
single feature under the same circumstances). When there are many excellent classifiers
to combine, their outputs will be highly correlated, and the elastic-net will be able to
attributes and k ≥ 3 classes. Table 3.1 indicates the datasets and relevant properties.
For the 26-class letter dataset, we randomly subsampled a stratified selection of 4000
points.
Approximately half the data points (with stratified samples) in each problem are
used for training the base classifiers. The remaining data is split into approximately
equal disjoint segments for model selection at the ensemble level (e.g. stacking training
data or select-best data) and test data, again in stratified samples. In a real-world
40
application, the base classifiers would be re-trained using the combination of base-
level data and validation data once ensemble-level hyperparameters are determined, but
this additional training is not done in this study due to the expense of model library
construction.
Previous studies with L ≥ 1000 classifiers obtain one sample per problem, with
no resampling due to the expense of model library creation, such as in Caruana et al.
[18]; we partially overcome this problem by resampling at the ensemble training stages.
In particular, we use Dietterich’s 5x2 cross-validation resampling [25] over the ensemble
training data and test data. We use the Wilcoxon signed-rank test for identifying
statistical significance of the results, since the accuracies are unlikely to be normally
distributed [24].
machines, k-nearest neighbors, decision stumps, decision trees, random forests, and
We generate about 1000 classifiers for each problem. For each classification algo-
rithm, we generate a classifier for each combination of the parameters specified below.
All implementations are in Weka except for the Random Forest (R), for which we used
the R port of the Breiman-Cutler code by Andy Liaw, available through CRAN.
(6) Random Forest (Weka) numTrees={1, 2, 30, 50, 100, 300, 500}
(9) Random Forest (R) numTrees={1, 2, 30, 50, 100, 300, 500}
on the held-out ensemble training set (select-best). We also compare our linear models
to voting (vote) and averaging (average) techniques. The StackingC approaches are
For the majority of our datasets, there are more linear regression attributes p
(same as the number of classifiers L) than data points n (equal to the number of stacking
in the Matlab mldivide function, which provides a sparse solution based on the QR
factorization.
In order to select the ridge regression penalty, we search over a coarse grid
of λ = {0.0001, 0.01, 0.1, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024} using cross-validation,
then use all validation data to train the ridge regression model with the selected penalty
2
The waveform, letter, optdigits and sat-image datasets are exceptions.
42
Table 3.2: Accuracy of each model for each data set. Entries are averages over the
10 samples from Dietterich’s 5x2 cross-validation at the ensemble level. Variances are
omitted based on arguments in Demšar [24]. See Section 3.3 for a description of the
methods and Section 3.4 for discussion.
Dataset select − vote average sg − sg − sg −
best linear lasso ridge
balance-scale 0.9872 0.9234 0.9265 0.9399 0.9610 0.9796
glass 0.6689 0.5887 0.6167 0.5275 0.6429 0.7271
letter 0.8747 0.8400 0.8565 0.5787 0.6410 0.9002
mfeat-m 0.7426 0.7390 0.7320 0.4534 0.4712 0.7670
optdigits 0.9893 0.9847 0.9858 0.9851 0.9660 0.9899
sat-image 0.9140 0.8906 0.9024 0.8597 0.8940 0.9257
segment 0.9768 0.9567 0.9654 0.9176 0.6147 0.9799
vehicle 0.7905 0.7991 0.8133 0.6312 0.7716 0.8142
waveform 0.8534 0.8584 0.8624 0.7230 0.6263 0.8599
yeast 0.6205 0.6024 0.6105 0.2892 0.4218 0.5970
parameter. We use the Matlab implementation of ridge regression from the Matlab
rather than choosing a single λ for all subproblems. For example, the regularization
hyperparameter for the first indicator problem λ1 may differ from λ2 . For lasso regres-
sion, we use the LARS software by Efron and Hastie [32], and search over a grid of
increments of 0.01 to select the regularization penalty term by cross-validation for each
subproblem. We search over a finer grid in sg-lasso than in sg-ridge since model selection
is much more efficient in LARS. For the elastic net, we use the glmnet package written
by Friedman, Hastie and Tibshirani and described in the corresponding technical report
[36].
3.4 Results
The test set accuracies of all ensemble methods are shown in Table 3.2. Each
entry in this table is an average over 10 folds of Dietterich’s 5x2 cross-validation [25]
ranks test [24], ridge regression StackingC outperforms unregularized linear regression
p ≤ 0.084, and has more wins than any other algorithm. On two problems, select-best
outperforms all model combination methods. On all problems, sg-linear and sg-lasso
perform less accurately than sg-ridge; this result suggests that it may be more produc-
tive to assign nonzero weights to all posterior predictions when combining several base
classifiers. A possible explanation for the superiority of ridge regularization over lasso
regularization is that lasso is often more effective when it is able to perform feature
selection by discarding irrelevant inputs or inputs that are negatively correlated with
the target values; however, the base classifiers are predominantly positively correlated
with the target prediction values, so lasso regularization is forced to throw away good
predictors. In other words, lasso regularization actively searches for uncorrelated in-
put models; however, since all models are estimates of the same target value, the lasso
To study the effect of regularization on each subproblem, we plot the root mean
Figure 3.2(a) shows the root mean squared error in the first subproblem in the sat-image
dataset4 . As the ridge penalty λ increases from 10−8 to 103 , the error decreases by
more than 10%. With such a small penalty term, the error at 10−8 roughly corresponds
to the error that would be obtained by unregularized linear regression. For individual
3
Thanks to Abhishek Jaiantilal for pointing out this explanation.
4
The root mean squared error is used instead of the accuracy because the subproblem in multi-
response is a regression problem.
44
Figure 3.2(b) shows the overall accuracy of the multi-response linear regression
with ridge regularization for the sat-image dataset. Regularization increases the accu-
racy of the overall model by about 6.5%, peaking around λ = 103 . As the penalty is
increased beyond 103 (not pictured), the accuracy decreases, reaching 0.24, the propor-
tion of the predominant class, around λ = 108 . Please note that in this figure, λ is the
Figure 3.2(c) shows the correlation between the accuracy of the overall multi-
response linear regression system and the root mean squared error on the first subprob-
lem. The fit is approximately linear, with a = −0.408e + 0.957, where a is the accuracy
of the multiclass classifier and e is the RMSE of the classifier on the first indicator
subproblem.
Figure 3.3(a) shows the overall accuracy of the multi-response linear regression
system as a function of the penalty term for lasso regression for the sat-image problem.
Standard errors over the 10 folds are indicated. As in the ridge regression case, λ is the
same over all subproblems in this figure. The accuracy falls dramatically as the penalty
increases beyond 0.2, stabilizing after λ = 0.50 at an accuracy of 0.24, the proportion
In order to view the effect of the elastic net’s mixing parameter α on the accuracy
of the multi-response system, accuracy vs penalty curves are plotted in Figure 3.3(b) for
α = {0.05, 0.5, 0.95}. The α = 1.0 curve indicated in Figure 3.3(a) is highly similar to
the α = 0.95 curve, and therefore omitted from Figure 3.3(b) for clarity. With a small
penalty term λ ≤ 10−1 , the curves are constant, and within one standard deviation of
the select-best curve. As the penalty increases, the accuracy reaches a maximum that
RMSE
0.16
0.14 .
0.12 ...
0.1 . . . . . ..
0.08
0.06
-9 -8 -7 -6 -5 -4 -3 -2 -1 2 3
10 10 10 10 10 10 10 10 10 1 10 10 10
Ridge Parameter
(a)
0.93
.. . . ..
0.92
. . .
0.91 .
.
.
Accuracy
0.9
0.89
0.88 .
0.87
0.86 .
0.85
-9 -8 -7 -6 -5 -4 -3 -2 -1 2 3
10 10 10 10 10 10 10 10 10 1 10 10 10
Ridge Parameter
(b)
0.94
0.93
.....
0.92 .. ..
0.91
.
Accuracy
0.9
0.89
.
0.88 .
0.87
0.86 .
0.85
0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24
RMSE on Subproblem 1
(c)
Figure 3.2: Figure 3.2(a) shows root mean squared error for the indicator problem for
the first class in the sat-image problem. Figure 3.2(b) shows overall accuracy as a
function of the ridge parameter for sat-image. The error bars indicate one standard
deviation over the 10 samples of Dietterich’s 5x2 cross validation. Figure 3.2(c) shows
the accuracy of the multi-response linear regression system as a function of mean squared
error on the first class indicator subproblem for ridge regression.
46
0.93
0.925
.........................
0.92
... ..
Accuracy ..
0.915
...
0.91
........
0.905
0.9
.
0.895
-5 -4 -3 -2 -1
10 2 5 10 2 5 10 2 5 10 2 5 10 2 5
Penalty
(a)
0.93
alpha=0.95
alpha=0.5
0.925 alpha=0.05
select-best
0.92
Accuracy
0.915
0.91
0.905
0.9
-5 -4 -3 -2 -1
10 2 5 10 2 5 10 2 5 10 2 5 10 2 5 1 2 5
Penalty
(b)
Figure 3.3: Overall accuracy of the multi-response linear regression system as a function
of the penalty term for lasso regression for the sat-image problem, with standard errors
indicated in Figure 3.3(a). Figure 3.3(b) shows accuracy of the multi-response linear
regression system as a function of the penalty term for α = {0.05, 0.5, 0.95} for the elastic
net for sat-image. The constant line indicates the accuracy of the classifier chosen by
select-best. Error bars have been omitted for clarity, but do not differ qualitatively from
those shown in Figure 3.3(a).
63
405 47
179
846
17
724
241
999
314
997
19 19 33 8 6 32 52 66 26 14
972
940
208
60
101
0.10
0.10
0.10
255
66
162
32
937
574
110
107
210
159
503
117
164
10
103
850
235
128
136
444
157
124
82
902
182
8 194
48
789
123
88
833
120
96 513
34
147
482 984
875
20
593
302
184
178
0.05
0.05
0.05
111
74
624
506
939
546
553
72
992
282
8
982 774
454
958
46
148
504
114
169
554
83
447
273
532
158 4
435
466
636
166
160
64 473
28
647
68
981 189
195
57
84
196
648
146
882
208 957
671
181
522
67 994
748
826
569
201
204 996
234
98
995
111
107 822
138
122
888
197
407
184
163
94
249 979
756
336
79
172
959
903
Class 1
Class 2
Class 3
989
152
134
212
36
174 22
909
296
865
995
116
180
17
46
942 989
572
216
214
988
656
842
30
71
206
3 450
932
112
0.00
0.00
0.00
76
195
209
164 344
94
951
40
717
215
36
227
142
174
121
222
969
99
82
166
794
988
221
154
34
125
462
727
24
90
199
968
912
12
27
198 990
396
797
965
682
354
967
382
622
14
655
140
574
367
57
463
66
70
594
130
98
994
38
137
707
93
135
213
176
165
128
197
997
56
614
63
496
432
126
822
952
188
742
207
703
582
69
960
635
934
121
202
21
761
23
335
275
382
724
723
944
961
953
102
108
141
148
151
162
171
179
175
187
201
247
234
231
230
229
227
354
351
350
349
347
346
345
343
342
333
331
329
327
312
294
292
284
251
375
371
370
363
412
402
431
430
461
454
471
470
469
467
464
512
494
491
490
489
669
636
632
591
590
549
548
547
546
545
544
711
710
709
705
792
776
775
772
762
752
747
731
729
831
829
827
816
815
814
804
803
851
850
849
876
896
895
892
933
931
930
929
909
947
946
945
967
963
962
955
954
980
979
978
977
976
987
16
11
18
26
41
50
58
78
87
5
4
194
311
959
55
542
782
193
937
943
47
867
951
687
61
975
6
394
991
127
756 35
143
364
64
412
980
429
670
351
77
690
173
346
457
125
662
453
642
449
971
332
426
188
228
262
287
331
348
388
415
408
493
536
603
607
739
767
857
868
897
104
133
151
177
185
202
204
221
238
254
252
251
244
256
261
272
271
269
267
281
278
289
285
284
295
301
298
310
309
308
303
318
334
333
329
328
342
341
338
347
345
355
361
358
372
367
378
376
389
387
392
394
397
416
411
432
428
423
422
441
438
448
465
464
462
458
472
478
490
487
486
485
501
498
505
512
511
509
507
521
520
530
538
534
533
544
552
551
550
548
560
558
556
573
592
591
590
584
578
598
594
612
611
608
619
618
629
628
627
638
646
654
652
663
681
672
693
692
697
712
704
703
714
731
727
726
725
747
745
743
742
738
737
736
754
752
750
749
758
757
769
765
764
784
783
782
781
778
776
775
792
813
812
809
808
805
798
818
816
824
827
838
835
847
861
855
852
867
882
880
878
887
886
894
893
908
913
918
916
931
930
929
928
926
925
950
949
947
946
945
944
943
942
956
955
954
953
952
963
962
961
960
966
978
975
18
25
31
37
39
50
59
67
69
76
93
3
2
1
171
948
977
806
199
718
858
514
976
283
543
387
106
904
133
296
407
767
205
215
573
143
118
902
42
95
965
49
495
146
91
203
624
957
276
110
200
138
506 213
47
373
253
732
518
623
236
982
844
210
32
168
872
689
392
136
990
10
170
74
422
60 854
49
896
113
176
684
127
763
774
28
992
147
122
92
2
189 193
231
275
6
723
515
120
100
443 866
130
974
825
68
949
86
132
196
722
35
211
19 170
933
44
144
183
123
48 126
683
562
987
323
106
872
843
54
903
190
103 42
134
160
144
905
335
258
446 293
969
715
12
964
111
56
456
100
90
362
150 203
492
968
698
444
940 986
19
696
92
190
734
55
139
192
116
402
70
218
205
420
443
86
22 96
391
186
870
108
124 993
616
320
898
206
80
906
−0.05
−0.05
−0.05
117
214 233
312
915
65
374
129
182
815
207
118
504
352
198
97
941
158
793
168
102
61
403
981
62
991
38
135
523
862
24
369
970
483
5
26
131
119
216
72 58
327
643
356
154
16
406
883
119
502
484
54
274
568
733
209
796
53
563
211
686
149
165
200
78
59
365
788
864
−0.10
−0.10
−0.10
446
75
29
324
393
132
543
163
688
794
315
635 694
999
180
109 150
890
212
−7 −6 −5 −4 −3 −2 −1 364
823 −7 −6 −5 −4 −3 −2 −1 913
220
916 −7 −6 −5 −4 −3 −2 −1
99 101
137
938
167
317
209
727
187
30
Log Lambda Log Lambda 95
Log Lambda
713
709
571 604
365
156
91 247
445
735
148
542
30
Figure 3.4: Coefficient profiles for the first three subproblems in StackingC for the sat-
888 334
71
876
51
63
904
251
762 187
21
482 922
27
91
image dataset with elastic net regression at α = 0.95, over a single partition of ensemble
127 39 19 55 28 41 985
815 79 40 38
974 823
45
726
939
348 37
557
197
395
473
0.10
0.10
0.10
155
95
862
576
932
0.05
0.05
211 106
202
58
842
567
228
265 38
128
32
133
465
50
907
923
75
77 652
112
995
446
909
994
249
74
67
956
449
4 65
139
4
182
443
2
575
58
168 57
92
15
28
10
112
266
231
374 347
456
172
894
157
88 104
178
981
25
765
623
53
528
3
65
303
131
115
776
52 852
184
85
98
973
582
192
84
342
432 6
198
18
199
114
592
42
904
864
315
866
742
28 752
110
991
450
983
195
39
562
814
24
408
937
394
983 96
69
952
126
989
232
34
11
602
206
375
166
703 156
417
927
171
476
54
970
985
86
433
110
208
474 159
322
736
987
526
784
664
125
440
761
986
325
159
506
216
826
178
39 22
951
193
304
997
949
802
158
70
16
200
468 485
544
Class 4
Class 5
Class 6
107
695
466
198
199
990
40
133
180 210
34
297
5
125
715
614
68
120
121
134
135
274
429
782
182
551
786
118 684
982
942
27
553
180
127
993
181
572
of the total weight. In lasso regression and some settings of elastic net regression, it
992
979
552
453 945
724
54
204
605
26
299
0.00
0.00
0.00
940
689
283
139
138
102
670
80
416
949 274
697
170
48
742
302
364
682
988
192
116
951
30
991
313
907
128
988
409
842
714
430
454
100
26
806
856
984
46
980
508
559 91
946
953
714
83
61
188
765
56
254
696
965
743
947
716
831
675
589
154
214
802
186
366
327
170
450
19
556
957
210
333
959
263
312
523
460
130
250
296
648
725
792
929
145
173
242
279
365
563
574
645
669
691
702
748
759
851
873
910
965
997
127
140
183
185
202
227
229
248
247
256
252
271
270
269
268
267
289
288
287
295
291
304
309
307
329
352
369
368
383
382
386
392
391
389
388
407
406
411
442
471
469
480
492
488
487
503
495
507
512
511
510
527
522
536
531
530
544
553
568
581
588
609
608
607
604
614
632
631
630
629
628
639
653
651
650
649
647
661
674
694
729
728
712
709
708
707
733
751
750
749
754
772
771
770
768
778
791
794
811
810
808
825
824
844
832
828
848
847
872
871
870
869
868
887
882
889
895
902
913
912
911
909
925
923
932
928
941
947
946
945
944
967
966
964
963
962
961
960
958
955
954
953
952
969
972
978
977
976
975
993
32
38
60
69
76
78
82
92
1
9
347
504
548
685
95
425
514
956
33
576 2
967
215
216
154
984
954
213
122
961
960
955
957
962
575
411
579
662
655
677
319
350
373
385
399
397
729
744
797
846
845
911
917
963
958
147
153
185
230
229
228
227
223
218
217
243
237
234
248
252
257
272
271
270
269
268
267
266
281
280
292
291
290
289
288
287
286
311
310
309
308
328
332
342
338
349
357
367
379
377
383
382
387
386
392
391
390
389
410
409
408
407
425
424
422
430
441
436
449
448
447
451
469
467
466
474
478
484
483
492
491
502
501
494
512
511
510
509
508
507
523
522
529
527
542
533
532
531
549
555
569
568
567
563
562
597
596
595
590
589
587
583
582
580
603
609
607
606
620
617
633
632
631
630
629
628
627
651
650
649
648
647
644
657
670
669
668
665
680
685
690
689
688
704
707
712
721
731
734
738
737
751
750
749
747
745
758
755
771
770
769
767
766
782
777
793
792
790
789
795
800
814
810
809
808
805
822
818
816
841
836
835
834
832
831
850
860
856
871
870
868
865
875
873
882
881
878
897
893
891
890
888
887
908
912
910
922
920
925
931
929
934
959
974
31
78
98
333
376
412
461
821
819
889
528
775
854
869
914
978
207
290
783
423
795
587
443
436
627
70
213
705
335
275
143
122
345
244
79
982 825
362
977
986
381
86
980
275
207
976
13
212
307
975
76
315
545
105
427
815
766
483
876
667
56
948
943
205
687
950
264
545
860 142
796
413
966
282
926
196
828
142
109
642
693
172
688
200
384
48
162
117
981
994
989 672
968
943
205
136
968
356
400
344
845
654
119
595
447
367 88
35
190
948
52
716
615
820
404
612
686
90
196 788
131
177
928
992
864
970
124
121
106
666
96
146
126
44
25
486
363
942
35 118
388
120
950
55
22
987
853
51
455
403 990
547
643
130
979
102
107
191
10
197
796
12
62
83
284
66
624
158 933
60
66
46
985
880
136
93
415
20
505
27
156
212
209
36
892
543 203
283
72
1
84
132
184
168
422
8
6
343
164 844
851
969
211
264
826
108
74
846 29
87
is possible to obtain a sparse model in which many weights are identically zero. The
68
233 189
733
263
176
458
188
64
927
915
160
896
905 403
561
622
793
113
201
684
215 262
137
132
177
464
18 428
853
14
206
−0.05
50
−0.05
−0.05
57
863
144
509
203
176
326
756 162
19
214
194
150
612
503
555
152
387 64
905
67 44
548
862
124
306
61
23 152
167
148 183
764
186
174
262
622 164
166
24
464
94
181
129
393 693
295
934 12
515
94
296
53
998
463
653
146
692
137
sparse model reduces computational demand at prediction time and makes it possible
494
123
272 939
62
428
147
767
103 336
354
465
668
16
14 554
938
72
971 20
906
117
47
673 433
42
941
505
448
141 936
143
872
515 861
676
706
−0.10
−0.10
462
282
276
141
174
33
93
40
113
108
964
261
−0.10
454
36
to identify a small subset of base classifiers and predictions that are responsible for
998 616
573
462
935
55
−7 −6 −5 −4 −3 −2 −1 −7 −6 −5 −4 −3 −2 −1 763
335
972
123
8
138
−7 −6 −5 −4 −3 −2 −1
7
731 855
45
427
713
dataset, for classes 1-3 with elastic net regularized StackingC with α = 0.95. The
line. At λ = λopt , only 244 of the 999 classifiers are assigned weight for any of the
subproblems. Table 3.3 shows the 6 classifiers with the highest total sum of weights
for all classes. Sparse models obtained by L1-regularized linear regression can choose
different classifiers for each class—that is, classwise posterior predictions are selected
instead of complete classifiers. For instance, the classifier assigned the most total weight
is k = 1-nearest neighbor, which contributes to the response for classes 3-6, but doesn’t
appear in the predictions for classes 1 or 2. The base classifier that makes the largest
contribution to the class-1 prediction is boosted decision trees run for 500 iterations.
48
Table 3.3: Selected posterior probabilities and corresponding weights for the sat-image
problem for elastic net StackingC with α = 0.95. Only the 6 models with highest
total weights are shown here. ann indicates a single-hidden-layer neural network, and
corresponding momentum, number of hidden units, and number of epochs in training.
Classif ier class − 1 class − 2 class − 3 class − 4 class − 5 class − 6 total
adaboost-500 0.063 0 0.014 0.000 0.0226 0 0.100
ann-0.5-32-1000 0 0 0.061 0.035 0 0.004 0.100
ann-0.5-16-500 0.039 0 0 0.018 0.009 0.034 0.101
ann-0.9-16-500 0.002 0.082 0 0 0.007 0.016 0.108
ann-0.5-32-500 0.000 0.075 0 0.100 0.027 0 0.111
knn-1 0 0 0.076 0.065 0.008 0.097 0.246
Thus each classifier is able to specialize in different class-based subproblems rather than
3.5 Conclusion
when using many highly correlated, well-tuned models. In order to avoid overfitting
combiner level, even when using a linear combiner. Regularization can be performed by
penalization of the L2 norm of the weights (ridge regression), L1 norm of the weights
(lasso regression) or a combination of the two (elastic net regression). L1 penalties yield
sparse linear models; in stacked generalization, this means selecting from a small number
more effective than the sparse linear lasso penalty, and suggested that this result is
because many of the classifier outputs are well correlated with the target prediction
value, causing lasso to select from both correct and incorrect models.
solutions (under Gaussian and Laplacian priors for regularization), instead of the single-
point maximum likelihood estimates implicit in the ridge (Gaussian prior) and lasso
(a) selecting a single regularization hyperparameter for use in all subproblems or (b)
Chapter Abstract
(SVMs) are a popular and effective classification technique, and model selection for
Gaussian SVMs is performed by tuning the cost (C) and Gaussian width (γ) hyper-
parameters. SVMs were originally designed for binary classification, but have been
and effective method for solving multiclass classification problems using binary SVMs
subproblems, solve the binary subproblems, and combine the predictions from the bi-
nary classifiers. This raises the question of how to perform model selection; should the
share the same hyperparameters; this enables performing model selection on the tar-
get (multiclass) metric, but requires the assumption that a single hyperparameter set
works well on all subproblems and allows suboptimal subproblem performance. Our ex-
optimization for a variety of binary reductions, and has similar performance in one case.
We show two situations in which independent optimization is more effective than shared
hyperparameter optimization: (a) when the subproblems have very different structure
51
and (b) when Hamming decoding is used and there is enough validation data to decrease
SVMs as the binary classification algorithm, the issues and results identified in this
paper are applicable to any tunable binary classifier used with a multiclass-to-binary
reduction method.
4.1 Introduction
domains such as handwritten text recognition, protein structure prediction [73], heart-
beat arrhythmia monitoring, and many others. Support vector machines were intro-
duced as a binary classification algorithm, and several techniques have been proposed
for adapting them to address multiclass classification problems [2, 38, 22, 81, 67, 9].
One simple, effective and widely-used technique is to reduce the multiclass classification
are solved and predictions from the binary classifiers are combined to produce the mul-
ticlass prediction [2]. In this chapter, we focus on the issue of model selection in the
tal studies that illustrate the correlation between binary subproblem performance and
vs. the all-pairs reduction. We present results for two reduction methods (one-vs-all
and all-pairs), two decodings (Hamming decoding and squared-error decoding) and two
metrics (accuracy and the Brier score, a probability calibration metric). In Section
4.1.1, we describe the one-vs-all and all-pairs reductions and previous research. In Sec-
tion 4.1.2, we focus on the particular issue of model selection in the binary subproblems,
52
and discuss theory correlating multiclass accuracy to binary accuracy and consistency
results in Section 4.1.4. In Section 4.2, we present experimental studies showing the
Section 4.2, we show that shared optimization has an advantage because model selec-
tion tends to choose wrong solutions and subproblems in a multiclass problem are often
similar with respect to model selection. We analyze the results and perform control
studies in Section 4.3, report on supplementary results in Section 4.4 and conclude in
Section 4.5.
4.1.1.1 One-vs-All
One of the simplest and most widely used techniques for reducing a multiclass
also known as unordered class binarization [40] or one-vs-rest (or 1vr) [105]. Using the
sification subproblems, one for each class (see Figure 4.1). In the ith subproblem, the
classifier is trained to distinguish whether the instance belongs to class i or not. At pre-
diction time, the classifier with the highest output is chosen. An alternative scheme uses
voting, or Hamming decoding, to combine the predictions, with ties broken randomly.
There is some disagreement in the literature about the terminology for the one-vs-all
reduction; Rifkin and Klautau [85] use the term one-vs-all to indicate winner-take-all
with continuous outputs (i.e. choosing the classifier with the maximum output); other
research such as [5] refer to one-vs-all as using Hamming decoding. Here, we use the
term OVA to refer to continuous winner-take-all one-vs-all, and refer to the discrete
version as OVA with Hamming decoding. Rifkin and Klautau studied the one-vs-all
technique, and compared it to other methods for reducing multiclass to binary and to
53
other SVM techniques that provide direct optimization on the entire multiclass prob-
lem [85]. They show that model selection is essential and that under appropriate model
Rifkin and Klautau’s technique for model selection is to choose one set of hyperparame-
ters for all subproblems, rather than trying to optimize each subproblem independently
or trying to optimize differing hyperparameters for all subproblems jointly. Rifkin and
Klautau don’t explicitly state that they use the shared-model paradigm for model se-
lection, but it is implied since one set of regularization hyperparameters is reported for
4.1.1.2 All-Pairs
(or AVA) [85], round-robin classification [40] and 1-against-1 (or 11) [105]), a k-class
k(k−1)
classification problem is decomposed into 2 problems, one for each pair of classes
At prediction time, each binary classifier votes for one class, and the class with
the most votes is selected as the multiclass prediction. Note that in contrast to con-
Figure 4.2: Illustration of an A-C decision boundary in a 2D, 3-class example of the
all-pairs reduction.
probabilities or confidence predictions from each binary classifier, instead using only a
discrete vote from each, though it is possible to use the all-pairs encoding with a decod-
ing function other than Hamming decoding in loss-based decoding. Subsequent research
in pairwise classification has shown how to incorporate continuous outputs and also to
produce a probability distribution instead of a discrete vote [51, 103]. Friedman shows
that Bayes optimal binary classifiers combine to produce a Bayes optimal multiclass
classifier, and therefore each binary subproblem can be solved independently and as
accurately as possible [38]. This proof is repeated here for for completeness.
where K is the set of possible labels, ω is the true label, x is the input feature vector
Therefore, given reliable p(ω = k|ω ∈ {i, k}, x), binary reduction under the all-pairs re-
duction is equivalent to the true Bayes optimal decision. Note that this analysis assumes
55
that p(ω = k|ω ∈ {i, k}, x) can be determined exactly for each subproblem, whereas in
a finite sample size. Friedman argues that unlike subproblems in the all-pairs reduction,
subproblems in one-vs-all must be tuned simultaneously, since the outputs from each
While one-vs-all and all-pairs are the most widely studied and employed tech-
niques for reducing multiclass to binary, they are only two cases within the more general
correcting output coding (ECOC). Though we focus our experimental studies on the
one-vs-all and all-pairs reductions, we also describe these other frameworks, since they
framework was proposed by Dietterich and Bakiri in 1995 [27]. This scheme is named for
its similarity to error correcting codes in information theory, with the analogy that an
instance’s class is a message to be transmitted, and error correcting codes are employed
to encode the message in order to make the transmission (or classification) more tolerant
of errors. ECOC requires all classes to appear in each subproblem, with an arbitrary
specification (called a coding matrix) of how classes are reassigned to subproblems. For
example, for a 5-class problem, classes 1, 3, 4 might be assigned to the positive indica-
ri = {+1, −1, +1, +1, −1}. At prediction time, each subproblem classifier votes for or
against membership in the positive indicator class, and the class with the most votes is
The number of unique and nontrivial binary splits (codewords) for a set of k
classes is 2k−1 − 1 [62]. Of these splits, k correspond to the one-vs-all dichotomies. The
56
other splits are different binary problems constructed from the original class labels. Di-
etterich and Bakiri [27] proposed using as many dichotomies as computationally feasible
in order to improve the multiclass prediction. At prediction time, each base classifier
is evaluated, and the class label with the minimum Hamming distance to the predicted
codeword is used as the multiclass prediction. Dietterich and Bakiri recommend using
all possible dichotomies when the number of classes is 7 or less; when there are more
classes, a random sampling of dichotomies is typically used. The tradeoff is that codes
with better error correcting properties typically correspond to more difficult subprob-
lems.
Loss Based Decoding In 2000, Allwein et al. generalized the ECOC frame-
classifier outputs and (b) allows some subproblems to optionally ignore some of the
classes in the data set [2]. Incorporating continuous output makes it possible to repre-
sent the one-vs-all technique in the loss-based decoding framework (as we show below),
and allowing subproblems to omit subsets of data points makes it possible to represent
the all-pairs technique. Loss-based decoding was further generalized by Crammer and
Singer to continuous-output coding [22], in which each class in a subproblem has some
in loss-based decoding.
all one-vs-all has been lacking. In this section, we show that using the loss function
L(z) = (1 − z)2 in one-vs-all loss-based decoding yields the same predictions as continu-
ous winner-take-all one-vs-all. Our experimental studies use this result in implementing
Theorem 1. Using the loss function L(z) = (1 − z)2 in one-vs-all loss-based decoding
yields the same predictions as continuous winner-take-all one-vs-all for a problem with
X
ŷ = argmin L(fc (x)) + L(−fa (x))
c
a∈K6=c
ŷ = argmax fc (x)
c
This concludes the proof that winner-take-all one-vs-all classification can be im-
plemented by setting the loss function as L(z) = (1−z)2 in loss-based decoding with the
one-vs-all coding scheme. In our experimental studies, we also employ this loss function
There are several differing views regarding subproblem model selection when us-
ing binary classifiers to solve multiclass classification problems. Before discussing the
different types of model selection, we identify the terminology used here and throughout
the chapter. We refer to a model as any function that produces a classifier when evalu-
ated on a labeled training data set. Typically, a model is the combination of a learning
58
algorithm (e.g. SVM or AdaBoosted decision stumps) and an associated set of hyper-
than the mechanism used to obtain it. A base classifier is a classification algorithm used
If the algorithm (but not the hyperparameter set) is selected before training,
then model selection refers to the search over learning hyperparameters for the given
classification algorithm. Search techniques such as grid search or binary search are often
training set.
class problem at once (and are regularized as a unit), techniques that reduce multiclass
problems to a set of binary subproblems introduce the new issue of model selection in
multiclass problem; the algorithm and hyperparameters (if any) are selected
of the others.
(4) full-joint optimization: A separate model is used for each subproblem, deter-
problem metric.
We examine each of these methods below, and point out prominent lines of re-
search based on each technique. There is some disagreement about the ‘classical’ way to
perform model selection for reducing multiclass to binary in SVMs; for example, Lebrun
et al. report that “the classical way to achieve optimization of multiclass schemes is an
individual model selection for each related binary sub-problem” [65]. However, often
cited works such as Rifkin et al. [85] select a single set of hyperparameters that are used
in all subproblems. Lorena provides an overview and discussion of these methods and
models, with the implicit assumption that behavior will transfer to the regularized
case. For example, Allwein et al. report usage of polynomial SVMs with polynomials of
degree 4, with no mention of whether or how this model was selected [2]. Beygelzimer
et al. report “We do not perform any kind of parameter optimization such as tuning
the regularization parameters for support vector machines or the pruning parameters
for decision trees. Our objective is simply to compare the performance of the two
reductions under the same conditions.” [7]. Zadrozny reports usage of the boosted
naive Bayes algorithm with 10 rounds of boosting [106], selected a priori without any
results in suboptimal performance [50], and there is no guarantee that results will be
The earliest and most pervasive view is that a single set of hyperparameters should
be used for all subproblems. Rifkin and Klautau follow this paradigm, and perform two
60
Duan et al. take the {C, γ} of each of the binary classifiers within a multi-category
method to be the same, tuned based on the multiclass classification performance over
Hsu and Lin [55] take the hyperparameter set for each subproblem to be the
same, and again use the shared-hyperparameters paradigm to parallel the shared-
Platt et al. report one set of hyperparameters for each set of binary subproblems,
model selection may be more appropriate than independent optimization because there
optimization has the advantage that the tuning is performed by evaluation on the actual
training data set for which a predictive model is desired, rather than tuning artificial
subproblem. More specifically, one model selection routine is applied to each sub-
are selected for each subproblem. Independent optimization removes the constraint in
shared-hyperparameters that the same model must be applied to all subproblems, and
increases the possibility of overfitting, or, more generally, may not entail an inductive
lem using heterogeneous classifiers for the all-pairs reduction [94]; that is, potentially
However, results on real world data sets reported in [94] are not statistically compelling;
only four data sets are used, and Wilcoxon signed ranks tests indicate that there is no
statistically significant difference between the proposed method and the baseline al-
gorithms at the predetermined value of p ≤ 0.05 (actual p-values are p ≤ 0.125 and
p ≤ 0.25). Furthermore, this work used only naive Bayes and LDA as the base classifi-
cation algorithms, and these results may not generalize to more powerful classification
overall multiclass classification performance. Due to the large search space, evolutionary
De Souza et al. introduces a particle swarm optimization for searching over all
subproblem models simultaneously [33]; however, they conclude that this technique did
mization over all subproblems simultaneously [65]. In this article, statistical claims are
also problematic, since only three datasets are used, and Wilcoxon signed ranks tests
Xu and Chan claim that all binary problems need to have different parameters,
but that it is insufficient to optimize each individually [104]. They propose an algorithm
62
that starts by searching over a 13 x 13 grid with shared hyperparameters, then uses
tractable); however, with so much flexibility in the model and given finite sample sizes,
techniques since they are computationally tractable and significantly more effective than
new methodology that uses shared hyperparameters that are selected by optimizing
The previously discussed techniques for model selection differ in their computa-
tional demands. At the low extreme, avoidance of model selection requires no additional
computational power beyond training the classifier itself because the learning algorithm
and learning hyperparameters are specified independently of the data. At the opposite
the hyperparameter set (e.g. two for Gaussian SVMs). Independent-optimization and
binary and multiclass metric have similar computational demands), since each requires
training and evaluation once for each combination of subproblem and model.
which means that optimal binary classifiers guarantee an optimal multiclass classifier.
As pointed out in [5], it is sufficient to minimize regret (difference from Bayes error
63
rate) rather than absolute error since the Bayes error rate may be nonzero. When us-
multiclass classifier. It has been shown that one-vs-all (or any other ECOC encoding)
under Hamming decoding is inconsistent, while the all-pairs reduction and one-vs-all
(with continuous outputs) is consistent [5, 6]. This result suggests that independent-
optimization model selection should work well for all-pairs and one-vs-all, since indepen-
dent optimization focuses on identifying the best model for each subproblem and has the
flexibility to choose different models for each subproblem. On the other hand, shared-
one-vs-all with Hamming decoding, since it is inconsistent. For some consistent reduc-
tions, the multiclass regret is bounded by a function of the average binary regret—this
describe the datasets, algorithms and statistical methods used in our studies. The main
experimental results are reported in Section 4.2.2 with discussion and analysis in Section
4.3.
4.2.1 Setup
number of classes, number of features and number of instances are indicated in Table
4.1. We formalize our data set selection decision procedure below to rule out bias
omitting any collection that consisted solely of artificial, ordinal or regression datasets,
(1) classes ≥ 3
The first rule (classes ≥ 3) selects multiclass classification problems, ignoring the
binary case k = 2. The second rule (5 ≤ numeric attributes ≤ 500) ensures there are
sufficiently many but not too many attributes. The remaining criterion ensures that
there is a sufficient number of data points. Uniquely identifying attributes that com-
pletely specify the identity of an instance (such as ID or index attributes) are discarded,
specifically: the counter attribute in collins, the BookID attribute in authorship and the
ID attribute in dj30-1985-2003. Classes with less than 20 instances are deleted, along
with corresponding instances. The free parameters in the above rules were hand-tuned
until 20 datasets were selected to facilitate statistical analysis. After deletion of classes,
any duplicate instances (based on attribute values, not class values) are deleted. Data
set selection was performed before evaluation of algorithms in order to avoid bias.
Stratified subsampling is used to reduce the total number of instances for large
over class labels commensurate with the original sample. For data sets with N ≥ 450
instances, random draws are sampled with Ntr = 300 training points and Nts = 150 test
points. For data sets with N < 450, random draws are taken with 2/3 of the instances
used for training and the remaining approximate 1/3 points for testing. Missing values
are filled in with the mean of non-missing values for each attribute. Datasets from
similar domains are discarded in order to improve tests for statistical significance of
65
its similarity to optdigits, only one of the mfeat- series was selected, and anneal.ORIG,
heart-h and cars-with-names were discarded due to similarity with other data sets. For
data sets with k > 20 (letter and dj30-1985-2003 ), 1/3 of the classes are removed to
decrease computational demands, and further stratified subsampling removes 1/3 of the
instances.
Table 4.1 indicates the datasets used in our experiments, and their relevant prop-
erties. The column labeled entropy refers to the normalized entropy (in base 2) of the
the class label ci and k is the number of classes. For instance, the entropy is 1 for a class
with an even distribution of class labels p1 = ... = pk = 1/k and 0 for a distribution
that has only instances with one label, i.e. pi = 1, pj6=i = 0. To summarize, the number
of classes varies from 3 to 20, with entropy varying between 0.4819 and 0.9976. The
smallest training sample size (after subsampling) is 133, and the number of attributes
and the data has optionally been subsampled, not all results on these datasets will
correspond identically to those found in the literature. However, we are able to perform
comparative studies since our experiments include a number of baseline algorithms. Also
note that a few of these data sets have classes that would be more appropriately modeled
as ordinal attributes, but that in these experiments we omit any ordering information
4.2.1.2 Methods
cording to the algorithm presented in [103], using Weka’s adapter. LibSVM is non-
a deterministic seed instead of seeding from the system clock to enable reproducible
runs. The one-vs-all and all-pairs methods are implemented in Scala under the loss-
based decoding framework1 using a squared error decoding for one-vs-all to implement
implement voting.
plemented in LibSVM [103], using the Gaussian kernel. To search over the hyperpa-
{−5, −1.666, 1.666, 5.0, 8.333, 11.666, 15.0}2 to first determine a value for the cost hy-
1
The default Weka implementation of One-vs-All or ECOC uses a loss-oriented output, in which
probabilities are summed (not using either Hamming decoding or continuous winner-take-all).
67
(ranging from −40 to 15) at the previously determined c-value to obtain the value for
γ. We used this sampling scheme since there was much more sensitivity to the γ hy-
perparameter than to the c hyperparameter, and so that our scheme would take a total
of 121 samples, as done in many other grid searches, such as LibSVM [103]. This fine-
granularity 1-d search for γ also facilitates visualization of the results. Platt scaling is
used to fit a sigmoid to each of the SVM models to improve probability estimates [82],
and Wu et al.’s technique for improving probability estimates by applying their second
Accuracy Metric The accuracy metric counts the number of correctly clas-
ŷ(x) = argmaxi pi (x) is the predicted class, x is the input attribute vector, and y(x)
is the true class. In this chapter, we primarily focus on classifier using the accuracy
metric, though some results with the Brier metric are presented in Section 4.4.2. The
accuracy metric has the advantage that accuracy can be computed on both the binary
subproblems and on the original multiclass problem; other multiclass metrics may not
Since the scores are unlikely to be normally distributed, we use the Wilcoxon
signed-ranks test for identifying statistical significance of the results [24]. Also, Demšar
The rest of this section is organized as follows. Section 4.2.2 compares shared-
investigate the structure of the binary subproblems and of the multiclass problems and
the relationship between the two in Section 4.2. We analyze the results and perform
68
4.2.2 Results
independent optimization for one-vs-all, all-pairs, one-vs-all with the Hamming loss
function and all-pairs with the squared error loss function. In each case, our results show
model optimization.
79
p <= 0.0663 p <= 0.0027 p <= 0.0027 p <= 0.0028
78
Average accuracy (%)
77
76
75
74
shared
73
independent
72
71
ll
rs
ed
-a
in
ai
ar
vs
m
l-p
qu
m
e-
al
ha
-s
on
rs
ll-
ai
-a
l-p
vs
al
e-
on
For continuous one-vs-all, the win is not statistically significant at p ≤ 0.05, with a
value of p ≤ 0.0663; for all other cases the win is statistically significant. For continuous
69
one-vs-all all-pairs one-vs-all-hamming all-pairs-squared
accuracy shared 0.0663 shared 0.0027 shared 0.0027 shared 0.0028
Table 4.2: Winning strategy for each combination of reduction and metric. Statistically
significant wins (at p ≤ 0.05) are highlighted. P-values from the Wilcoxon signed-ranks
test are indicated after the winning strategy.
individual data sets, are reported in the appendix (see Section 4.6.1). Explanations for
4.3 Analysis
4.3.1). Conversely, we show that when subproblem decision boundaries have differ-
ing shapes that independent optimization is necessary (Section 4.3.2). We also discuss
the relationship between binary and multiclass accuracy, showing a strong correlation
(Section 4.3.3). We also perform control studies that rule out other explanations. In
more effective because it uses the true target metric for optimization instead of a binary
metric. In Section 4.3.5, we show that choosing optimal classifiers on subproblems (as
judged by the oracle) favors independent optimization for reductions that use Hamming
similar structure with respect to model selection and, subsequently, that they often share
For algorithms like one-vs-all, it is not unexpected for subproblems to have similar
structure since portions of one decision boundary may appear in several problems. For
the negative indicator class, and therefore, the subproblems will share similar portions
the generative distribution that produces elements from each class may have similar
physical, statistical, or noise properties. For instance, if each class in a data sample
subproblems in the all-pairs reduction will be optimally fit with linear discriminants.
sets are structurally similar, and that this causes shared hyperparameter optimization
to work well since a single hyperparameter value can often be chosen that performs
well on many subproblems. As an example, we select the four class problem vehicle to
illustrate the relationship between subproblems. Composite results over many data sets
We selected the vehicle data set for particular investigation since it illustrates
many of the important issues in subproblem similarity. The vehicle data set is a 4-class
problem, and therefore has four one-vs-all subproblems and 6 all-pairs subproblems.
One-vs-All Figure 4.4 indicates the model selection curves for the four sub-
problems in one-vs-all for the vehicle data. First, note that there seem to be two types
of behaviors; the upper pair of curves for class c2 and class c3 and the lower pair for
classes c0 and c1 . It is interesting to note that the lower curves correspond to the classes
opel and saab and the upper curves correspond to bus and van, showing that these con-
ceptually similar classes have similar structures under model selection. Second, note
that even though the pairs of curves have significantly different shape, they have op-
tima in nearly the same region of the log2 (γ) dimension, around log2 (γ) = −5.0. This
indicates that the shared hyperparameter would be effective for this problem since all
four classes, though different, peak at the same hyperparameter setting. Table 4.7 indi-
cates that independent optimization actually averages a 0.8% higher average accuracy,
though this difference is probably not significant compared to the standard deviation of
about 3.0%.
All-Pairs Figure 4.5 indicates the model selection curves for the 6 subprob-
lems in the all-pairs reduction for the vehicle data set. Note that 5 of the subproblems
(upper part of the chart) have a similar structure, with broad peaks in the range of
−13 ≤ log2 (γ) ≤ −3. The unique lower subproblem plot is between the two sedans
opel and saab. Again, despite the difference between these types of series, they have a
similar peak, around log2 (γ) = −6.0. Table 4.8 indicates that shared-hyperparameter
optimization averages about 1.3% higher average accuracy, again probably not signifi-
0.975
0.950
0.925
0.900
0.875
accuracy
0.850
0.825
0.800
0.775
0.750
0.725
Figure 4.4: Independent model selection curves for the four one-vs-all subproblems in
the vehicle data set.
vehicle: all-pairs
1.000
0.975
0.950
0.925
0.900
0.875
0.850
0.825
0.800
0.775
accuracy
0.750
0.725
0.700
0.675
0.650
0.625
0.600
0.575
0.550
0.525
0.500
0.475
0.450
-40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15
log2(g)
subproblem 0 subproblem 1 subproblem 2 subproblem 3 subproblem 4 subproblem 5
Figure 4.5: Independent model selection curves for the 6 all-pairs subproblems in the
vehicle data set.
73
In this section, we show the model selection curves for the cars, page-blocks and
letter data sets under the one-vs-all and all-pairs reduction methods, which represent
many of the salient features in the 20 data sets. Note that the curves tend to peak near
the same regions, and that often, subproblems have similar shapes.
cars: one-vs-all cars: all-pairs
0.925
0.875 0.900
0.875
0.850
0.850
0.825 0.825
0.800
0.800
0.775
0.775 0.750
accuracy
accuracy
0.725
0.750
0.700
0.725 0.675
0.650
0.700
0.625
0.675 0.600
0.575
0.650
0.550
0.625 0.525
0.500
-40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15
log2(g) log2(g)
subproblem 0 subproblem 1 subproblem 2 subproblem 0 subproblem 1 subproblem 2
0.975 0.975
0.950
0.950 0.925
0.900
0.875
0.925
0.850
0.825
0.900
0.800
0.775
0.875
0.750
accuracy
accuracy
0.725
0.850
0.700
0.675
0.825
0.650
0.625
0.800
0.600
0.575
0.775 0.550
0.525
0.750 0.500
0.475
0.725 0.450
0.425
-40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 0.400
log2(g) -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15
subproblem 0 subproblem 1 subproblem 2 subproblem 3 subproblem 4 log2(g)
accuracy
0.950
0.60
0.945
0.55
0.940
0.50
0.935
0.45
0.930
0.40
0.925
0.35
0.920
0.30
0.915
0.25
0.910
-40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15
log2(g) log2(g)
Figure 4.6: Examples of subproblem performance as a function of γ for the cars, page-
blocks and letter data sets.
74
the average binary subproblem loss at the hyperparameter value selected by shared-
compute the difference between optimal binary accuracy of the subproblem attained at
the optimal value of gamma γo . The difference is d = ā(γo ) − a(γs ), which quantifies
for the one-vs-all reduction are indicated in Figure 4.7. All differences are less than
0.80%, with an average of 0.30%. The datasets halloffame and vehicle attain the opti-
mal values for all subproblems. These results validate our hypothesis because they show
from their optimal value. Furthermore, this result indicates that the individual subprob-
optimization.
For the all-pairs reduction (see Figure 4.8), the average loss is 4.24%, significantly
larger than the average loss for the one-vs-all reduction. The largest loss values for the
all-pairs reduction are 36.4% for letter and 29.4% for dj30-1985-2003. Note that the
larger loss values occur at large number of classes. In this case, subproblems deviate
from each other so significantly that they suggest that independent-optimization should
Section 4.2. In Section 4.3.5, we show that independent optimization is, in fact, more
appropriate for the Hamming decoding techniques (including all-pairs) once subsam-
0.8
0.6
0.5
0.4
0.3
one-vs-all
0.2
0.1
0.0 tic eh e
au co cle
op rsh l
td ip
w nn ts
ef al
vo rm
-1 l el
5 - ter
co 003
-m ag gm s
ph lo nt
po og s
eu thy cal
ly id
ec s
rh c a i
yt r s
au ia
s
o o
ol
he v fam
at p e llin
hy ol ck
to
av e
hm
th ntr
ca ro
or e-b e
ig
pt
o
9 8 et
- i
i
2
f
a
llo
s
ha
ar
30
nt
dj
sy
fe
m
Figure 4.7: Average subproblem accuracy loss at the value selected by shared-
hyperparameter optimization for the one-vs-all reduction.
35
30
25
20
15
10 all-pairs
0
au vef al
ha ors m
hy llo hip
th me
eu td id
ly ts
gm s
ge ec t
lo i
vo ks
hi l
co cle
tic c s
co ars
th l
dj ph au ia
-1 log s
5- al
le 3
r
-b ol
ve we
hy tro
en
tte
s e ptu
30 o to
0
a e
98 ic
th or
op yro
ca igi
20
lli
w nn
po ffa
rr n
a
-
a
pa
he
or
-m
nt
sy
at
fe
m
Figure 4.8: Average subproblem accuracy loss at the value selected by shared-
hyperparameter optimization for the all-pairs reduction.
be able to determine the same hyperparameter sets for each subproblem. However, an
that the effective amount of validation data is the combination of validation data for
76
performed with a significantly smaller amount of validation data. This efficient re-use
behavior of both model selection techniques in two synthetic data sets that are designed
to have different optima for each subproblem with respect to model selection. Two
(a) noise and (b) shape of the decision boundary. We first experiment with Gaussian
data sets and varying degrees of noise between each data set, which contains only linear
decision boundaries. In the next section, we experiment with datasets in which some
subproblem decision boundaries are linear and others are nonlinear. For both synthetic
data sets, we use 300 training and 150 test points over 10 random resamplings as in
Section 4.2.
boundary is linear, and there are varying degrees of noise between each class. To
implement this, we sample data points from each class from a Gaussian, each with
the identity covariance matrix. The centroids of each Gaussian are c1 = (0, 0), c2 =
(0, 1), c3 = (0, 3) so that the intra-class distances are 1, 2, 3. Qualitatively, these settings
correspond to a small, medium and large amount of noise between each pair of classes.
2
Thanks to Shumin Wu for identifying this explanation.
77
In the one-vs-all reduction, some subproblem classes correspond to the union of two
classes, and again we have varying degrees of noise between each subproblem. In the
c1 vs c23 subproblem, the amount of noise between class c1 and c23 is determined by
the closest class in c23 , namely c2 . Therefore, in this case, there are interclass distances
of 1 (between class c1 and c23 ) and 2 (between class c3 and c12 ). The subproblem
corresponding to c2 vs c13 has class c2 between the larger composite class c13 . Therefore,
this configuration of Gaussian clusters offers varying degrees of noise between each
subproblem for both all-pairs and one-vs-all reductions. The datasets are shown in
Figure 4.9.
6.0
5.5
5.0
4.5
4.0
3.5
3.0
2.5
2.0
y
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-3.0
-4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
x
Class_0 Class_1 Class_2
Figure 4.9: Synthetic datasets generated by Gaussian distributions with varying degrees
of noise.
Results The results are indicated in Table 4.3. For this synthetic data set,
very large, less than 1.1% in all cases. To understand these results, the model selection
78
plots are shown in Figures 4.10 (for the one-vs-all reduction) and 4.11 (for the all-
pairs reduction). Note that in both cases, despite having a different amount of noise
in each subproblem, the optima are not significantly separated, and that choosing the
optimal point with respect to the multiclass curve doesn’t incur much loss on any of
sufficient for this synthetic problem because the subproblems are structurally similar
one-vs-all: Gaussians
0.875
0.850
0.825
0.800
0.775
0.750
0.725
0.700
0.675
0.650
accuracy
0.625
0.600
0.575
0.550
0.525
0.500
0.475
0.450
0.425
0.400
0.375
Figure 4.10: Model selection curves for Gaussian synthetic data sets under the one-vs-all
reduction
Table 4.3: Accuracy results for linear decision boundaries (in %), for synthetic data sets
as described in paragraph 4.3.2. Standard error over 10 random samplings is indicated
in parentheses.
accuracy
0.675
0.650
0.625
0.600
0.575
0.550
0.525
0.500
0.475
0.450
0.425
0.400
0.375
0.350
-40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15
log2()
multiclass subproblem 0 subproblem 1 subproblem 2
Figure 4.11: Model selection curves for Gaussian synthetic data sets under the all-pairs
reduction
for classes c1 and c2 , a point is drawn uniformly from the unit circle, and the point
periods across the unit circle). A third class c3 is drawn from a Gaussian distribution
with an a covariance matrix of 0.1I, where I is the 2 × 2 identity matrix, and a centroid
located at x = 0, y = 1, so that it overlaps significantly with class c2 and less with class
c1 .
Results The results are indicated in Table 4.3. For this synthetic data set,
of the one-vs-all reductions, and 1.8% for the all-pairs reduction. To understand these
80
Linear and Nonlinear Decision Boundaries
2.00
1.75
1.50
1.25
1.00
0.75
y
0.50
0.25
0.00
-0.25
-0.50
-0.75
-1.00
-1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
A B C
Figure 4.12: Synthetic datasets with sinusoidal and linear decision boundaries.
results, the model selection plots are shown in figures 4.13 (for the one-vs-all reduction)
and 4.14 (for the all-pairs reduction). This problem was constructed to have differing
optimal hyperparameters for each subproblem. For both the one-vs-all and all-pairs
reductions, the optimal hyperparameter on one subproblem gives more than 2% error
on at least one other subproblem. Therefore, since the subproblems are significantly
different with respect to model selection, independent optimization is essential for this
problem.
Table 4.4: Accuracy results for mixed linear and nonlinear decision boundaries (in %),
described in paragraph 4.3.2. Standard error over 10 random samplings is indicated in
parentheses.
accuracy
0.675
0.650
0.625
0.600
0.575
0.550
0.525
0.500
0.475
0.450
0.425
0.400
0.375
0.350
0.325
-40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15
log2()
multiclass subproblem 0 subproblem 1 subproblem 2
Figure 4.13: Model selection curves for Sinusoidal synthetic data sets.
all-pairs: Sinusoidal
1.00
0.95
0.90
0.85
0.80
0.75
accuracy
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
Figure 4.14: Model selection curves for Sinusoidal synthetic data sets.
Summary The results of the synthetic experiments indicate that when sub-
(as in the mixed linear/nonlinear example above), that it is essential to allow each
subproblem to select its own optimal hyperparameters. On the other hand, when the
subproblems have similar shape, or, more specifically, have optima in similar regions of
mization in the real-world data sets, discussed in Section 4.2, we infer that this set of
real world data sets entailed subproblems that have similar structure with respect to
hyperparameter selection.
In this section, we show that there is a strong correlation between binary and
multiclass accuracy. This correlation has several ramification for model selection in
multiclass classification with binary classifiers. First, it means that optimizing with
respect to the binary average is a reasonable way to optimize the multiclass accuracy.
Second, it means that more effective binary classifiers (possibly obtained through in-
illustrative example in Section 4.3.3.1, then generalize results over all 20 data sets in
Section 4.3.3.2.
Figure 4.15(a) indicates the accuracy as a function of the hyperparameter γ for the
curves. The curve labeled one-vs-all-shared indicates the model selection performance
for the one-vs-all method, while the curve labeled one-vs-all-sharedsub indicates the
average binary accuracy for all subproblems as a function of γ. Note that these curves
are qualitatively very highly correlated. The fact that the one-vs-all-sharedsub is several
percent above the other curve indicates that the subproblems are much easier than the
the value of the model as determined by hold-out test data (instead of on validation
data as in the previous two curves). The similarity between the oracle curve and the
72.5
0.775
70.0
0.750 67.5
0.725 65.0
62.5
0.700
60.0
0.675
57.5
0.650 55.0
0.625 52.5
50.0
0.600
47.5
0.575
45.0
0.550 42.5
0.525 40.0
(a) Anneal with one-vs-all (b) Scatterplot for anneal with one-vs-all
anneal: all-pairs anneal: all-pairs
1.000
97.5
0.975
95.0
0.950
92.5
0.925
90.0
0.900
multiclass accuracy (%)
87.5
0.875
85.0
accuracy
0.850
82.5
0.825
0.800 80.0
0.775 77.5
0.750 75.0
0.725 72.5
0.700
70.0
0.675
-40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 72.5 75.0 77.5 80.0 82.5 85.0 87.5 90.0 92.5 95.0 97.5 100.0
log2(g) average binary accuracy (%)
all-pairs-shared all-pairs-sharedsub all-pairs-shared-oracle all-pairs all-pairs-squared
(c) Anneal with all-pairs (d) Scatterplot for anneal with all-pairs
Figure 4.15: Correlation between average binary accuracy and multiclass accuracy for
the dataset anneal
To quantify the linearity of the relationship between the average binary accuracy
and multiclass accuracy, we fit a linear model to the multiclass vs. average-binary
scatter plots for each data set and compute the R2 statistic. A R2 value of 1.0 indicates
that the data points are collinear (note that this result does not guarantee that the
generating model from which the points were sampled is linear). Figure 4.16 indicates
84
the R2 statistics for linear fits for each of the variables for the one-vs-all reduction. In
this case, the average R2 statistic is 0.791. For all-pairs (see Figure 4.17), the average R2
statistic is 0.910. These results indicate that there is a significant correlation between
td al
its
a r
gm s
sy a vo t
he o el
eu on p
ly l
v e us
pa hyt le
-b ia
i
th rs
av id
llo rm
an e
al
ca tro
ol
en
tte
se uto
ck
m
-c hi
op gic
nt uth w
ic
ge hm
ne
w yro
po ca
20
ig
pt
ec
ha efo
ph oll
ffa
tic rs
le
ar h
lo
5-
o
98
r
-1
hy
30
-m
dj
at
fe
m
1.00
R-Squared Value for All-Pairs
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
a ia
llo al
hy -blo e
th s
co id
ns
s
rs
-c el
av ol
ca r m
or or s
ol ip
td l
se gits
t
ve li
e
op ica
en
ge am
po ck
to
-m th tu
o
cl
hm
ha nne
tic ow
o
w ntr
ph sh
ca
lli
ec
eu e f o
hi
yr
au
at au yp
gm
og
i
pa ff
he v
o
yt
l
rh
ar
nt
sy
fe
m
different (though related) metrics. In order to identify whether this mechanism explains
the results obtained in Section 4.2, we implemented a new classification algorithm that
ter set by optimizing average binary subproblem accuracy rather than the true target
For all combinations of one-vs-all, all-pairs x Hamming, squared error, there are
test at a p ≤ 0.05 level. The Wilcoxon signed ranks test indicate difference for one-
and for all-pairs-squared at p ≤ 0.1231. (A three-way Holm test which also includes
differences between the two methods). Shared w/subproblem-selection also beats in-
and protection against choosing suboptimal binary classifiers, even though selection is
One possible explanation for the tendency for shared-hyperparameter model selec-
rithm (e.g. m = 2 for support vector machines with a Gaussian kernel). In independent
problems. With the increased flexibility of the model, overfitting becomes a significant
problem, and, in the absence of a sufficiently large validation set, incorrect hyperpa-
rameter values may be selected for many of the subproblems. In order to rule out the
on validation data but rather on actual test data. While this technique cannot be used
that, while optimal with respect to the validation set, are suboptimal with respect to
the actual test dataset. Note that the test data must be reduced into binary subprob-
optimization.
The results for oracle selection are indicated in Table 4.5. Even when an oracle
is used to select optimal models for both shared and independent methods, shared
Independent optimization with oracle selection outperforms shared optimization for all-
Table 4.5: Winning strategy for each combination of reduction and metric, when using
the oracle selection method. Statistically significant wins (at p ≤ 0.05) are highlighted.
P-values from the Wilcoxon signed-ranks test are indicated after the winning strategy.
that requiring re-use of the same hyperparameter set for each subproblem may provide
some level of regularization for the problem; that is, it may provide essential smoothing
In addition to the main results discussed in Section 4.2, we also discuss two
supplementary results: a comparison of all methods in this study (Section 4.4.1) and
Several studies have previously compared the all-pairs reduction to one-vs-all [85].
ally feasible). The results are indicated in Table 4.6, and the average ranks are depicted
optimization, but it is not statistically significantly different from all-pairs with squared-
independent optimization.
88
Figure 4.18: Average ranks of the 7 algorithms under study; algorithms not statistically
significantly different from the top-scoring algorithm are connected to it with a vertical
line.
In order to evaluate the calibration of each method, we use the Brier metric [16],
also known as the mean squared error (MSE) metric, in which b(x) = d1 j (tj (x) −
P
p̂j (x))2 , where x is the input vector, tj is the target probability for the the jth class, p̂j is
the probability estimated by the classification method and d is the dimensionality of the
input vector. We take tj (x) to be 1 if the label belongs to the jth class and 0 otherwise.
This metric is used in other related work such as Wu et al. [103] and Zadrozny [106].
In the results presented below, we use the rectified Brier score r(x) = (1 − b(x)) × 100
so that the results are structurally similar to percent accuracy; namely that higher is
better and that the values range between 0% and 100%. For all four combinations
on the Brier metric (see Table 4.2). One possible explanation for this result is that
a significantly different metric is used for model selection than for model evaluation.
89
Future studies could experiment with using a binary Brier metric for purposes of model
rs
ed
-a
in
ai
ar
vs
m
l-p
qu
m
e-
al
ha
-s
on
rs
ll-
ai
-a
l-p
vs
al
e-
on
Figure 4.19: Average rectified Brier scores comparing independent to shared hyperpa-
rameter selection for each reduction.
4.5 Conclusion
order to improve the overall model. We performed several control studies to isolate the
that shared-hyperparameter maintains its superiority even when (a) the average binary
accuracy is used as the metric instead of the true multiclass target metric and (b) the
90
oracle is used to ensure that suboptimal models are not selected due to low validation set
have similar structure, and that this is an explanation for the superiority of shared-
differing optima for each subproblem and showed that independent optimization is more
effective in this case. As a supplementary result, we also showed all-pairs with voting
to rank higher than other algorithms, but not statistically significantly higher than
This work could be extended to cover random coding matrices (with a variety of
loss functions). Another interesting avenue of research would be to study (possibly het-
erogeneous) model combination in each of the subproblems, as in Caruana et al. [17] and
see if the same results we obtained for model selection also apply to the case of model
combination (i.e. averaging binary models vs. averaging multiclass models). Also, this
research focused on the accuracy metric (0-1 loss) and Brier score metric. Future work
that do not necessarily have a corresponding analogous loss function for the associated
binary subproblems. Additional studies could determine whether there is any advan-
4.6 Appendix
to independent optimization on each data set. Tables 4.7-4.10 show the accuracy values
91
for one-vs-all and all-pairs under the squared-error and hamming decodings. Tables
Table 4.7: Average accuracy over 10 random splits for shared and independent model
selection strategies with the one-vs-all reduction, with the standard deviation indicated
in parentheses. The winner for each data set is indicated in bold.
Table 4.8: Average accuracy over 10 random splits for shared and independent model
selection strategies with the all-pairs reduction, with the standard deviation indicated
in parentheses. The winner for each data set is indicated in bold.
Table 4.9: Average accuracy over 10 random splits for shared and independent model
selection strategies with the one-vs-all reduction with Hamming decoding, with the
standard deviation indicated in parentheses. The winner for each data set is indicated
in bold.
Table 4.10: Average accuracy over 10 random splits for shared and independent model
selection strategies with the all-pairs-squared reduction, with the standard deviation
indicated in parentheses. The winner for each data set is indicated in bold.
Table 4.11: Average rectified Brier score over 10 random splits for shared and inde-
pendent model selection strategies with the one-vs-all reduction, with the standard
deviation indicated in parentheses. The winner for each data set is indicated in bold.
Table 4.12: Average rectified Brier score over 10 random splits for shared and indepen-
dent model selection strategies with the all-pairs reduction, with the standard deviation
indicated in parentheses. The winner for each data set is indicated in bold.
Table 4.13: Average rectified Brier score over 10 random splits for shared and indepen-
dent model selection strategies with the one-vs-all reduction with Hamming decoding,
with the standard deviation indicated in parentheses. The winner for each data set is
indicated in bold.
Table 4.14: Average rectified Brier score over 10 random splits for shared and indepen-
dent model selection strategies with the all-pairs-squared reduction, with the standard
deviation indicated in parentheses. The winner for each data set is indicated in bold.
Chapter Abstract
classification techniques combined discrete votes from each pairwise classifier to produce
a multiclass classification [38]. Subsequent work has shown the advantage of combining
classes (instead of a discrete classification) [51, 106, 103]. Pairwise classification methods
have been criticized because each pairwise classifier is trained on only two of the classes
but makes predictions for instances from any class [51, 23]. In this chapter, we propose
a new pairwise classification technique that addresses this problem by weighting each
pairwise prediction with an estimated probability that the instance belongs to the pair.
The technique is based on the Theorem of Total Probability, and relies on only the
assumption that each instance is assigned exactly one label. Furthermore, our method
is conceptually simpler and easier to implement than other methods that incorporate
data sets indicate that our proposed technique performs better than voted pairwise
classification [38] and the pairwise coupling methods from Hastie and Tibshirani [51]
5.1 Introduction
protein structure prediction [73], and many others. However, several supervised ma-
chine learning techniques such as support vector machines [10] and AdaBoost [35] are
designed for solving binary classification problems. While many modifications and ex-
tensions have been proposed for adapting these methods to multiclass classification
problems, another prominent line of research instead focuses on reducing the multiclass
problem to a set of binary problems. The most general framework for solving multi-
class classification problems with binary classifiers is the loss-based decoding framework
[2], which is flexible in how the multiclass problem is coded into binary classification
problems and how the binary classifier predictions are decoded as a multiclass predic-
creates one binary classification problem for each class to discriminate the class from
the union of the remaining classes. In this chapter, we focus on a related reduction
technique called pairwise classification (or all-pairs), which creates a binary classifier to
In pairwise classification (also known as all-pairs, all-vs-all (or AVA) [85], round-
robin classification [40] and 1-against-1 (or 11) [105]), a k-class classification problem
k(k−1)
is reduced to 2 subproblems, one for each pair of classes. For example, Figure 5.1
indicates a k = 3 class problem with the decision boundary between classes A and C.
At prediction time, each binary classifier votes for one class, and the class with the
most votes is selected as the multiclass prediction, with ties broken randomly [38]. This
Figure 5.1: Illustration of an A-C decision boundary in a 2D, 3-class example of the
all-pairs reduction.
confidence predictions from each binary classifier, instead combining a discrete vote
from each classifier, and produces a multiclass prediction rather than a probability dis-
tribution over classes. Despite these drawbacks, voted pairwise classification has several
(quadratic in the number of classes), each subproblem contains only a small fraction of
the instances in the multiclass classification problem. Furthermore, since these subprob-
lems involve only two classes, they can be significantly simpler than the original k-class
problem. That is, the decision boundary between two classes can be less complex than
5.1.1.1 Consistency
optimality of the multiclass classifier if each of the binary classifiers is optimal [5].
by showing that Bayes optimal binary classifiers combine to produce a Bayes optimal
where L = {c1 , c2 , ..., ck } is the set of possible labels, y is the true label, x is the input
X pi pk
ŷ(x) = arg max 1( > ) (5.2)
i=1..k pi + pk pi + pk
j=1..k
X
ŷ(x) = arg max 1(p(y = ci |y ∈ {ci , cj }, x) > p(y = cj |y ∈ {ci , cj }, x)) (5.3)
i=1..k
j=1..k
Therefore, given reliable values for p(y = ci |y ∈ {ci , cj }, x), binary reduction under the
all-pairs reduction yields the Bayes optimal decision. This analysis assumes p(y = ci |y ∈
sample size. However, even with finite sample sizes this analysis shows the all-pairs
accurately as possible. Friedman argues that the all-pairs reduction works according to
bias [38].
cision tree (CART) methods with axis-oriented splits and linear combination splits on
synthetic data sets generated from Gaussian mixtures. For the nearest-neighbor algo-
rithm, Friedman identified that all-pairs is more accurate than one-vs-all and suggests
that this is due to the ability to tune the regularization hyperparameter (number of
neighbors) separately in each subproblem. He also observed that the all-pairs reduction
outperforms the one-vs-all reduction for CART methods with linear combination, and
suggested that is because the one-vs-all problems are significantly more difficult than
eral authors have proposed algorithms for obtaining multiclass probabilities from pair-
wise probability predictions. These methods are often called pairwise coupling meth-
ods because the predicted pairwise probabilities are coupled together to produce the
multiclass classification. In this paper, we use the terms pairwise coupling and pair-
µij = pi +pj )
pi
and differ in how the µij are used to estimate the multiclass probability
Hastie & Tibshirani In 1996, Hastie & Tibshirani proposed a pairwise cou-
pling algorithm that works by tuning the k-dimensional multiclass probability estimate
obtained pairwise estimates and the true pairwise probability values [51]. With the true
classification algorithm is then used to obtain pairwise probability estimates µ̂ij ≈ µij .
An initial guess vector p is selected (with all elements pi ≥ 0.0 for i = 1..k, and normal-
ized so that ki=1 pi = 1) and used to compute a new iteration of values for µij . Then
P
by minimizing the Kullback-Liebler divergence between µ̂ij and µij in this iteration, we
the KL divergence between µ̂ij and µij . For more details, see Hastie & Tibshirani [51],
Wu, Lin & Weng In 2004, Wu et al. [103] proposed the following procedure for
The multiclass probability vector is then approximated as the solution to the constrained
optimization problem
k X
X
min (µ̂ji pi − µ̂ij pj )2
p
i=1 j6=i
subject to
k
X
pi = 1, pi ≥ 0, ∀i
i=1
This optimization problem reduces to a linear system, which is also solvable with an
iterative process, and is proved to be convergent. Wu et al. show this pairwise classifi-
cation method to be superior to the method proposed by Hastie & Tibshirani and voted
Pairwise classification techniques have been criticized because each pairwise clas-
sifier is trained on only two of the k classes, but makes predictions for instances of any
distribution can be problematic because it introduces unnecessary bias into the predic-
tions [51, 23]. Consider, for example, a 4-class problem with classes {A,B,C,D}. Assume
without loss of generality that we wish to classify a test point whose true label is A. At
prediction time, all 6 classifiers are evaluated; however, the predictions from the BC,
CD and BD classifiers will be unreliable since they were not trained on distributions
containing instances with class A. If there is bias, such as D instances being more eas-
ily mistaken for B than for C, multiclass classification errors can occur. Hastie and
Tibshirani show a simulation result that indicates this bias to be a real problem, but
107
they comment that other (non-pairwise) approaches may not fare any better. Further-
more, they show that when this bias is present that the probabilistic predictions of the
multiclass classifier are more evenly distributed, indicating reduced confidence in the
predicted class.
pairwise classification (PPC) that weights each pairwise probability with a probability
that the instance belongs to the pair. This approach has the the following features: it
incorporates predicted probabilities from the base classifiers (rather than discrete votes),
for the problem that pairwise classifiers make predictions for instances with different
class labels than those used during training, and it has the property of consistency. It
derive the probabilistic pairwise classification rule. Given N mutually exclusive and
is
N
X
p(b|x) = p(b|x, ai )p(ai ) (5.4)
i=1
Letting L = {c1 ..ck } be the set of labels in a k-class classification problem, and substi-
we have
p(ci |L, x) = p(ci |ci ∪ cj , x)p(ci ∪ cj |L, x) + p(ci |L − ci − cj , x)p(L − ci − cj |L, x) (5.5)
no instance can be labeled both ci and any label from the set L − ci − cj . Therefore,
108
In practice, we do not obtain true probabilities from the trained classifiers (whether
stitute the predicted probabilities and make the prediction p̂(ci |x) = p̂(ci |ci ∪cj , x)p̂(ci ∪
cj |L, x). Here, p̂(ci |ci ∪ cj , x) is the estimated probability that the given instance x be-
discriminative binary classifier trained using points from only classes ci and cj , where ci
is the positive indicator class and cj is the negative indicator class. Fürnkranz provides
theory and evidence that suggests that these pairwise discrimination problems are much
simpler than corresponding one-vs-all discrimination problems [40]. The more difficult
problem is estimating p̂(ci ∪ cj |L, x), the probability that an instance belongs to either
class ci or cj , selecting from all labels L. Depending on the properties of the underlying
distributions, predicting p̂(ci ∪ cj |L, x) may be more difficult than predicting the poste-
rior probability p(ci |L, x) itself. Because the predictions are based on estimated values
for probabilities instead of true probabilities, we average over all possibilities for j 6= i
(rather than arbitrarily selecting a single value for j in Equation (5.6)), giving
1 X
p̂(ci |L, x) = p̂(ci |ci ∪ cj , x)p̂(ci ∪ cj |L, x) (5.7)
k−1
j6=i
Note that the average is over k−1 terms since the i = j case can’t be used for discrimina-
tion. The motivation for this averaging is that the individual classifiers may make noisy
predictions and that averaging may be able to decrease the multiclass predictive error,
as long as the classifiers are not systematically biased in the same way. In practice, the
cation algorithm used with PPC (or any pairwise coupling method) is known as the base
109
forests.
can be used to construct or simplify the intermediate models. However, in our studies,
we assume that no domain specific knowledge is available, and focus solely on the
is unsuitable for multi-label classification problems (in which each instance may be
ther voted pairwise classification or one-vs-all. In one-vs-all, there are k problems, each
with N data points. In the voted all-pairs reduction or other pairwise coupling schemes,
there are k(k − 1)/2 classifiers, each with approximately 2/N training data points (for
sifiers, 2 for each of the k(k − 1)/2 pairs, one for discriminating between elements of
the pair and one for discriminating between the pair and the rest of the classes. The
pairwise classifiers are each trained on approximately 2/N data points as above, but
the pair-vs-rest classifiers are each trained on the entire training set. These values are
tationally demanding than both one-vs-all and pairwise classification. For applications
that have a large number of training instances or classes, probabilistic pairwise classi-
110
Table 5.1: Computational complexity of one-vs-all (OVA), pairwise coupling (PC) and
probabilistic pairwise classification (PPC)
OVA PC PPC
subproblems k k(k-1)/2 k(k-1)
instances per subproblem N 2N/k N (half) + 2N/k (other half)
computational complexity/SVM O(kN 3 ) O(k −1 N 3 ) O(k 2 N 3 )
important to note that all of the aforementioned methods are fully parallelizable. For
offline applications in which training time is the bottleneck, the main difference between
PPC and other pairwise coupling methods is the training of the (k*(k-1))/2 pair-vs-rest
classifiers, which each train on all N data points instead of just 2N/k data points.
For a base classifier with a known computational complexity, the time complexity
of the various methods can be computed. For instance, the binary SVM has a compu-
tational complexity for training time of O(N 3 ), where N is the number of training data
points. For one-vs-all, each of k subproblems must train on all N points, so the com-
putational complexity is O(kN 3 ). For balanced pairwise coupling methods, for which
complexity is O(k 2 Ns3 ) = O(k −1 N 3 ). For PPC, the computational complexity of each
train on all N data points. These computational complexities are summaries in the last
estimates for p(ci |ci ∪ cj ) and that proper density models should be used instead; this
111
argument applies equally well to one-vs-all and other reductions. Furthermore, if any
desired probability p(ci |L) could be easily and accurately estimated, then the one-vs-all
ities should be easier than inducing one-against-all class probabilities since all-pairs
subproblems tend to be simpler than one-vs-all subproblems (and, in turn, simpler than
Wu et al. show that Hastie and Tibshirani’s pairwise coupling method can be
X pi + pj
pi = ( )µij (5.8)
k−1
j:j6=i
with µij = p(y = ci |y ∈ {ci , cj }, x) where pi + pj is assumed to be 2/k and µij are taken
as the pairwise predictions µ̂ij [103]. PPC is also based on this identity, but instead of
5.3 Methodology
classification to other pairwise classification methods, and to classifiers that are capa-
algorithms and random forests. The experimental results indicate that under a wide va-
methods.
112
number of classes, number of features and number of instances are indicated in Table
5.2. We formalize our data set selection decision procedure below to rule out bias
omitting any collection that consisted solely of artificial, ordinal or regression datasets,
and filtered them, requiring that (a) the number of classes must be three or more, so that
number of numeric attributes is between 5 and 500 ensuring there are sufficiently many
but not too many attributes and (c) the number of instances is 200 or more, so that
each class will have sufficiently many instances in each binary training set. Uniquely
index attributes) are discarded, specifically: the counter attribute in collins, the BookID
attribute in authorship and the ID attribute in dj30-1985-2003. Classes with less than
20 instances are deleted, along with corresponding instances. The free parameters in
the above rules were hand-tuned until 20 datasets were selected to facilitate statistical
analysis. After deletion of classes, any duplicate instances (based on attribute values,
not class values) are deleted. Data set selection was performed before evaluation of
Stratified subsampling is used to reduce the total number of instances for large
over class labels commensurate with the original sample. For data sets with N ≥ 450
instances, random draws are sampled with Ntr = 300 training points and Nts = 150 test
points. For data sets with N < 450, random draws are taken with 2/3 of the instances
used for training and the remaining approximate 1/3 points for testing. Missing values
113
are filled in with the mean of non-missing values for each attribute. Datasets from
similar domains are discarded in order to improve tests for statistical significance of
its similarity to optdigits, only one of the mfeat- series was selected, and anneal.ORIG,
heart-h and cars-with-names were filtered out due to similarity with other data sets.
For data sets with k > 20 (letter and dj30-1985-2003 ), 1/3 of the classes are removed
the instances.
Table 5.2 indicates the datasets used in our experiments, and their relevant prop-
erties. The column labeled entropy refers to the normalized entropy (in base 2) of the
the class label ci and k is the number of classes. For instance, the entropy is 1 for a class
with an even distribution of class labels p1 = ... = pk = 1/k and 0 for a distribution
that has only instances with one label, i.e. pi = 1, pj6=i = 0. To summarize, the number
of classes varies from 3 to 20, with entropy varying between 0.4819 and 0.9976. The
smallest training sample size (after subsampling) is 133, and the number of attributes
For some of our experiments, we use a base classifier that is itself capable of
performing multiclass classification. For example, we use decision trees, random forests
This is the method proposed by Friedman in which each base classifier makes a
discrete vote for the prediction and the classifier makes a discrete classification rather
This is the pairwise coupling scheme proposed in [51] and implemented in Weka’s
5.1.1.2.
WLW is the pairwise coupling scheme proposed in [103] and implemented in Lib-
10−7 and 1 − 10−7 , and calling the method multiclass probability, which is described
in Section 5.1.1.2. While it would be possible to use the entire LibSVM implementa-
tion to obtain predictions for WLW-SVM-121, we merely use the multiclass probability
method in order to reduce the number of differences between the different multiclass
and pairwise methods, ensuring that (a) there are no other pre- or post-processing steps
included in one framework but not the other and (b) the pairwise classifiers are iden-
tical between methods. In contrast, we use the Weka framework for HT because it is
compatible with any base classifier, whereas WLW full implementation in LibSVM is
difference between the full LibSVM implementation of WLW-SVM-121 and the imple-
We experiment with a variety of base classifiers, which are the classifiers used with
the various reduction schemes or used directly as multiclass classifiers. The first three
algorithms provide direct support for multiclass classification, so we are able to compare
results given in Fürnkranz [39] and Wu et al. [103]. We use the following classifiers:
We use the decision tree classifier as implemented in Weka’s J48 (in Weka 3.7.0),
which is a reimplementation of Quinlan’s C4.5 decision trees [83]. We use Weka’s de-
fault parameter settings: a pruned decision tree that allows multiway splits on nominal
attributes.
116
We use the K-nearest neighbor algorithm (see [1]) as implemented in Weka’s IBk
class, and in our experiments, we fix k = 1 and use no distance weighting, so that
Euclidean distance is used for numerical attributes and Hamming distance is used for
nominal attributes.
class. Following Breiman [15], we set the number of trees in each forest to be L = 100
and set the number of features in the random inputs scheme to be (log2 (d) + 1), where
d is the number of attributes. The size of the trees is unconstrained. In Section 5.5.2,
We use the SVM algorithm as implemented in LibSVM [103], using the Gaussian
kernel. To search over the hyperparameters {c, γ}, we perform a search over the coarse
7 × 7 grid of {−5, −1.666, 1.666, 5.0, 8.333, 11.666, 15.0}2 to first determine a value for
the cost hyperparameter c. Then a separate search is performed over a finer grid of
72 samples (ranging from −40 to 15) at the previously determined c-value to obtain
the value for γ. We used this sampling scheme since there was much more sensitivity
to the γ hyperparameter than to the c hyperparameter, and so that our scheme would
take a total of 121 samples, as done in many other grid searches, such as LibSVM
[103]. Platt scaling is used to fit a sigmoid to each of the SVM models to improve
probability estimates [82]. Additionally, since we are using LibSVM with probability
predictions before determining the full multiclass probability distributions by using the
methods identified above. This implementation therefore has a double-layer of the WLW
117
method, once it estimate the binary probabilities and again to estimate the multiclass
probabilities.
5.4 Results
a given pairwise classification method METHOD and base classifier BASE, we describe
with the J48 decision tree base classifier is named PPC-J48. When a method is used
improve readability, aggregate behavior over all data sets is reported here in the main
text, while performance on individual data sets is discussed in the Appendix (Section
5.7). We first report on results for the accuracy metric (Section 5.4.1), then discuss
the nonparametric statistical tests recommended by Demšar [24] and refined by Garcı́a
et al. [42]. In particular, we use the Holm test for comparing one algorithm against
many others; this is a nonparametric algorithm that controls for the family-wise error
rate, and yields adjusted p-values. To perform these computations, we use the software
5.4.1 Accuracy
The accuracy metric counts the number of correctly classified instances, ignoring
probabilistic predictions: a = N1 N
i=1 1(ŷ(x) = y(x)), where ŷ(x) = argmaxi pi (x) is
P
the predicted class, x is the input attribute vector, and y(x) is the true class. Tables
and figures indicating performance on individual data sets are located in the Appendix.
Figure 5.2 indicates the behavior of each base classifier and reduction technique
118
on the accuracy metric, averaged over all data sets. The first category (multiclass)
indicates using the specified classifier directly as a multiclass classifier rather than using
a pairwise coupling scheme. For the base classifiers J48, RF-100 and SVM-121, PPC
79.0
78.5
78.0
77.5
77.0
accuracy (%)
76.5
76.0 multiclass
75.5 vpc
75.0 ht
74.5 wlw
74.0 ppc
73.5
73.0
72.5
8
1
j4
10
12
kn
rf
m
sv
Figure 5.2: Accuracy averaged over all 20 data sets for all combinations of base classifier
and reduction method, with one standard error indicated.
use the Brier metric [16], also known as the mean squared error (MSE) metric, in which
b(x) = d1 j (tj (x) − p̂j (x))2 , where x is the input vector, tj is the target probability for
P
the the jth class, p̂j is the probability estimated by the classification method and d is
the dimensionality of the input vector. We take tj (x) to be 1 if the label belongs to the
jth class and 0 otherwise. This metric is used in other related work such as Wu et al.
119
[103] and Zadrozny [106]. In the results presented below, we use the rectified Brier score
r(x) = (1 − b(x)) × 100 so that the results are structurally similar to percent accuracy;
namely that higher is better and that the values range between 0% and 100%. Again,
tables and figures indicating performance on each data set are located in the Appendix.
Figure 5.3 shows the average performance of each base classifier and reduction
technique under the Brier metric, with a standard deviation for the PPC methods also
indicated. The first category multiclass indicates using the specified classifier directly
as a multiclass classifier. Again PPC has the highest average performance for the base
95.0
Rectified Brier score (%)
94.5
94.0
93.5 multiclass
vpc
93.0
ht
92.5 wlw
ppc
92.0
91.5
8
1
j4
10
12
kn
rf
m
sv
Figure 5.3: Rectified Brier score averaged over all 20 data sets for all combinations of
base classifier and reduction method, with one standard error indicated.
5.4.3 Discussion
Figure 5.4 plots the average rank of each algorithm over all 20 data sets. Depic-
tion of average rank (as opposed to average accuracy or average rectified Brier score) is
120
valuable since (a) it doesn’t assume that errors across different data sets are commen-
surate and (b) it doesn’t allow excellent performance on a single problem to obscure
mediocre performance on several other data sets. This visualization technique and re-
lated statistical tests are described in Demšar [24]. A vertical bar connects the top
algorithm to any other algorithm that is not statistically significantly different from it
under the Holm test [42] at a p ≤ 0.05 level of significance. Note that in all of these
instances PPC is either the highest ranking method (6 out of the 8 runs), or not sta-
tistically significantly different from the top ranking method. Therefore, we conclude
that for a variety of base classifiers, metrics and data sets, that PPC yields statistically
significantly better performance than other pairwise coupling methods and multiclass
We noted that the predictions and performance were not preserved under permu-
tation of the subproblem assignments. Specifically, while only k(k − 1)/2 pairwise and
k(k − 1)/2 pair-vs-rest terms need to be computed, we found that modeling µ̂ij and
computing µ̂ji = 1 − µ̂ij for i = 1..k and j = 1..i − 1 provided slightly different results
than modeling µ̂ji and computing µ̂ij = 1 − µ̂ji . This suggests an asymmetry in the
behavior of the base classification algorithms. Future work could investigate whether it
is valuable to estimate and utilize both µ̂ij and µ̂ji independently in order to achieve a
analyze the learning curves (Section 5.5.1), look at the accuracy as a function of more
classes (Section 5.5.4), identify dependence on data set entropy (Section 5.5.5) and test
Figure 5.4: Graphical depiction of the rank of each algorithm as averaged over all 20
data sets, shown for the accuracy metric (top row) and for the Brier metric (bottom
row). A vertical bar connects the top algorithm to any other algorithms(s) that are not
statistically significantly different from it, if any.
of training data—the so-called learning curves. Since the statistical comparisons per-
formed in Section 5.4 use a single value for the amount of training data, it is valuable
to look at the dependence on the amount of training data in order to determine, for ex-
ample, whether the relative effectiveness of each method changes with varying training
sample sizes. For this experiment, we take the 10 largest data sets from the 20 data sets
122
used in Section 5.4, again with a 2/3 training and 1/3 training proportion. We focus on
using the random forest classifier with 100 trees (RF-100), since it was demonstrated
to be competitive compared to the other base classifiers, and less computationally ex-
pensive than the model-selected SVM-121. The average results over all 10 data sets are
indicated in Figure 5.5; behavior on individual data sets is not qualitatively different
than the composite behavior, so plots for individual data sets are omitted.
Learning Curves
85.0
84.5
84.0
83.5
83.0
82.5
82.0
81.5
Accuracy (%)
81.0
80.5
80.0
79.5
79.0
78.5
78.0
77.5
77.0
76.5
76.0
450 500 550 600 650 700 750 800 850 900 950 1,000
Number of Data Points
multi vpc ht wlw ppc
Figure 5.5: Accuracy as a function of the sample size (2/3 of which used for training),
averaged over the 10 largest data sets described in Table 5.2
Note that the relative performance of all 5 methods does not vary significantly as
the amount of training data is doubled; only voted pairwise classification (VPC) shows
an average change in rank, decreasing from the accuracy of WLW at low sample sizes to
the accuracy of HT at larger samples sizes. The fact that the ranks are predominantly
stable indicates that statistical comparisons in Section 5.4 apply to a larger ranger of
123
training sample sizes. Also note that the amount of training data is crucial in deter-
mining the performance of the learning algorithm—for instance, the worst method with
570 data points (380 training points) outperforms the best method with 450 data points
(300 training points). Informally, the addition of 20% more data points is more valuable
than switching from the least to the most effective combination rule.
Many studies assume that the benefits of a reduction scheme with a poor or
untuned base classification algorithm will generalize to more accurate base classifiers
[27, 64, 2]. As pointed out by Rifkin and Klautau [85], it is essential to study the
reduction with well-tuned base classifiers, since we are interested in understanding the
behavior in the regime with the best predictive power. In this section, we investigate
the accuracy of the various methods as the performance of the base classifiers is varied,
by using random forests as the base classifier and increasing the number of trees. The
accuracy of random forest classifiers tends to increase monotonically with the number
of trees [15]. We use the following numbers of trees: {10, 50, 100, 200, 500, 1000}. Figure
5.6 depicts the average accuracy of each of the 5 methods as a function of the base-10
While the average over 20 data sets indicates the advantage of PPC over the other
methods for varying numbers of trees, the variability between the methods decreases as
the number of trees increases. At 10 trees, the difference between the best and worst
performing methods is about 2.25%, while at 1000 trees, the difference between the
best and worst performing methods is only about 1.25%. While this average behavior
is indicated in Figure 5.6, the behavior on several individual data sets is indicated in
For instance, in the anneal dataset, the accuracy of all methods basically level
out around 200 trees, without too much noise. In the dj30-1985-2003 data set, the
124
Accuracy vs. Number of Trees Averaged over 20 Data Sets
79.25
79.00
78.75
78.50
78.25
78.00
Accuracy (%)
77.75
77.50
77.25
77.00
76.75
76.50
76.25
0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1
log_10(Number of Trees)
multi vpc ht wlw ppc
Figure 5.6: Accuracy as a function of the (log10 of the) number of trees in the random
forest base classifier, averaged over all 20 data sets described in Table 5.2
methods PPC and MULTI have a similar performance, significantly higher than the
other methods. In the eucalyptus data set, the multiclass classifier doesn’t improve
number of trees is effective in all methods, and while the curves for VPC, HT and WLW
tend to level out around 200 trees (with the performance of HT slightly decreasing), the
multiclass method and PPC still attain increased accuracy with 500 or 1000 trees.
One of the primary differences between the direct multiclass methods (such as
decision trees and nearest-neighbor methods) and pairwise classification methods is that
the direct multiclass methods operate on all classes simultaneously, while the pairwise
Accuracy vs. Number of Trees for anneal Accuracy vs. Number of Trees for dj30-1985-2003
125
99.05 27.00
26.75
99.00
26.50
98.95
26.25
98.90
26.00
98.85
25.75
98.80
25.50
98.75
25.25
98.70
25.00
98.65
24.75
Accuracy (%)
Accuracy (%)
98.60
24.50
98.55
24.25
98.50
24.00
98.45
23.75
98.40
23.50
98.35
23.25
98.30
23.00
98.25
22.75
98.20
22.50
98.15
22.25
98.10 22.00
98.05 21.75
98.00 21.50
97.95
0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1
log_10(Number of Trees) log_10(Number of Trees)
multi vpc ht wlw ppc multi vpc ht wlw ppc
85.7 58.0
85.6 57.5
85.5
57.0
85.4
56.5
85.3
85.2 56.0
85.1
55.5
Accuracy (%)
Accuracy (%)
85.0
55.0
84.9
84.8 54.5
84.7 54.0
84.6
53.5
84.5
53.0
84.4
84.3 52.5
84.2
52.0
84.1
51.5
84.0
0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1
log_10(Number of Trees) log_10(Number of Trees)
multi vpc ht wlw ppc multi vpc ht wlw ppc
56.5
76.50
56.0
55.5 76.25
55.0
54.5 76.00
Accuracy (%)
Accuracy (%)
54.0
75.75
53.5
53.0
75.50
52.5
52.0 75.25
51.5
51.0 75.00
50.5
74.75
50.0
49.5
74.50
49.0
48.5 74.25
48.0
47.5 74.00
0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1
log_10(Number of Trees) log_10(Number of Trees)
multi vpc ht wlw ppc multi vpc ht wlw ppc
Figure 5.7: Examples of accuracy as a function of the number of trees in the random
forest base classifier for 6 of the data sets.
classification methods are restricted to using one pair of classes at a time. In a PPC
prediction (see Equation (5.6)), the first term p(ci |ci ∪ cj , x) is restricted to using pair-
wise discriminations, and is multiplied by a pair-vs-rest weight p(ci ∪ cj |L, x). In this
section, we hypothesize that PPC will be less effective than a direct multiclass method
distribution centered in one of the four quadrants. By varying the covariance matrices
126
of the Gaussian distributions, we are able to change the amount of noise in each prob-
lem. The motivation for the structure of this synthetic data set is that the decision tree
algorithm can see all four classes simultaneously, and therefore has the potential to use
would use exactly the same decision boundaries for A − B as for C − D. Since the PPC
algorithm never sees more than two classes at a time, PPC will not be able to obtain
this same benefit. As for the data sets discussed in Section 5.3, 300 training points and
Figure 5.8 indicates the 4-class synthetic problem described above, and the re-
sults of the MULTI-J48 classifier and PPC classifier using MULTI-J48 base classifiers
is indicated in Table 5.4. The results for this experiment are averaged over 100 random
draws from the underlying generative distribution; more runs are possible for this syn-
thetic study since the decision-tree based algorithms are computationally inexpensive.
Even though this is a simple learning problem, note that the the accuracies are not
exactly 100%; this result is due to the fact that the convex hull of training points is
responsible for inducing the decision boundary. Since 1/3 of the data is removed for
testing purposes, the decision tree obtains suboptimal decision boundaries. For this
data set, the decision tree outperforms probabilistic pairwise classification by 0.527%.
The two-tailed paired t-test indicates a statistically significant difference at the p ≤ 0.05
level; the actual p-value is p ≤ 3.15 × 10−11 . Therefore we have verified our hypothesis
that a multiclass classification method can outperform PPC on a problem for which
0.5
0.4
0.3
0.2
0.1
0.0
-0.1
-0.2
-0.3
-0.4
-0.5
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
x
A B C D
see whether the benefits of the multiclass decision tree will be retained. Figure 5.9
indicates the 4-class problem with a significant amount of noise. Again, the results
for this experiment are averaged over 100 random draws from the underlying genera-
tive distribution. The results from using decision trees (MULTI-J48) and probabilistic
pairwise classification with decision tree base classifiers (ppc-j48) are indicated in Table
??. The increased noise is responsible for the significantly decreased accuracy for both
algorithms, compared to the synthetic dataset presented in Section 5.5.3.1. In this case,
PPC outperforms MULTI-J48 by 1.48%; that is, on this sample data set, it is slightly
more accurate to reduce the dataset into 42 = 6 subproblems than to solve the 4-class
128
Table 5.3: Accuracy results (%) for the comparatively noiseless synthetic data set. The
standard error over 100 random samplings is indicated in parentheses.
multi-j48 ppc-j48
99.2 (0.08) 98.7 (0.10)
problem using directly using a multiclass decision tree. The two-tailed paired t-test
indicates a statistically significant difference at the p ≤ 0.05 level; the actual p-value
is p ≤ 2.14 × 10−9 . This result is contradictory to the result in Section 5.5.3.1, even
though the structure of the decision boundaries remains unchanged. It is not entirely
clear why multiclass decision trees do not maintain the same advantage in the noisy
synthetic problem as in the noiseless synthetic problem. One possible explanation for
the advantage of PPC is that the estimation and average of 6 subproblems performs a
5.5.3.3 Summary
problems share decision boundaries would favor operating on all classes simultaneously
(and thus a direct multiclass method) rather than limiting decision regions to pairwise
this hypothesis. However, in a synthetic data set with identical decision boundaries
and only increased noise, PPC is surprisingly more effective. We suggested that PPC
may have an advantage in the noisy situation due to its averaging over a large number
Table 5.4: Accuracy results (%) for the noisy synthetic data set. The standard error
over 100 random samplings is indicated in parentheses.
multi-j48 ppc-j48
84.5 (0.34) 86.0 (0.31)
129
Noisy Synthetic Data Set
2.00
1.75
1.50
1.25
1.00
0.75
y
0.50
0.25
0.00
-0.25
-0.50
-0.75
-1.00
-1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75
x
A B C D
hypothesis.
For the J48 base classifier and accuracy metric, PPC exhibits excellent perfor-
mance gains for the data sets letter, optdigits, collins and vowel, which have 18, 10, 11,
and 11 classes, respectively. Since this larger performance gain occurs on four of the six
data sets with the largest number of classes, it suggests that PPC may have a bigger
advantage on data sets with more classes. First, we evaluate this hypothesis under the
20 benchmark data sets described in Section 5.3.1, then evaluate this hypothesis on new
data sets in which we incrementally increase the number of classes. In this section, we
130
restrict our focus to using random forest as the base classifier, since it had competitive
performance on the benchmark data sets, and since it is computationally less expensive
the relative gain over a multiclass algorithm as a function of the number of classes for
10.0
7.5
Relative Accuracy (%)
5.0
2.5
0.0
vpc
-2.5 ht
-5.0 wlw
-7.5 ppc
-10.0
-12.5
-15.0
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Classes
Figure 5.10: The accuracy relative to a random forest with 100 trees as a function of
the number of classes in the data set for voted pairwise classification (VPC), Hastie-
Tibshirani’s method (HT), Wu-Lin-Weng’s method (WLW), and probabilistic pairwise
classification (PPC). There is one data point for each of the 20 benchmark data sets
(see Section 5.3.1) and for each of the methods.
the number of classes. For instance, a relative accuracy of 0% indicates (at that num-
ber of classes) that random forests and the pairwise classification method have identical
performance. A positive slope indicates that the algorithm improves its benefit over ran-
dom forests as the number of classes increases; a negative slope indicates that random
forests is more effective at a higher number of classes. For voted pairwise classification
131
a negative correlation between the number of classes and the relative accuracy (with
multiclass random forests as the baseline). The results comparing the multiclass clas-
sifier to VPC, HT and WLW are consistent with the observation in Wu et al. [103]
that multiclass classification is more effective than the pairwise classification schemes
(PPC), there is an average gain of about 1% over the range of 17 classes, indicating
is increased. Qualitatively similar results are obtained for the Brier metric (omitted for
brevity).
Since the benchmark data sets were used to construct the hypothesis, they cannot
be used to validate the hypothesis; instead, we experiment with a set of 9 new multiclass
classification problems. These problems are formed using regression data sets whose
numeric target attribute is discretized into different classes. For instance, a regression
problem with outputs varying uniformly between 0 and 1 is transformed into a 3-class
classification problem by taking class 1 to be instances with output between 0 and 31 , and
so on. This technique used for conversion of regression into classification problems was
proposed in Frank and Hall [34] and used in Fürnkranz [41]. We used this method due
to its prominence in the related literature and because it provides a straightforward way
problems.
To obtain the data sets for this study, we started with the collection of 37 re-
gression data sets available from the Weka website (obtained from various sources) and
filtered out data sets that had less than 300 instances. The remaining data sets are
indicated in Table 5.5. As in Section 5.3, we use 2/3 of the points for training and 1/3
of the points for testing. We stop increasing the number of classes when the number of
Results are indicated in Figure 5.11. The vertical axis indicates the accuracy
relative to the random forest algorithm; a value of 0% indicates that the pairwise classi-
fication algorithm had equivalent performance to the random forest classifier. A positive
slope indicates increased advantage over the random forest at a higher number of classes.
On the meta dataset, there is a negative correlation between number of classes and rel-
ative accuracy. In the housing data set, the relative accuracy is independent of the
number of classes. For the other 6 data sets, there are varying degrees of improvement.
The cholesterol data set attains the largest benefit; in this case, PPC increases rela-
tive improvement over the random forest algorithm by about 9.5% as the number of
classes is raised from 2 to 13. While there are exceptions, the results on discretized
data sets support the hypothesis that PPC confers more benefits at a higher number of
classes. One possible explanation for this behavior could be a bagging phenomenon, as
Figure 5.11: The accuracy relative to a random forest with 100 trees as a function of
the number of classes for regression data sets that have been discretized with varying
number of classes. The average over the 9 data sets is indicated by the wide red line
series.
normalized entropy of the class distribution; note that the probabilities of class mem-
bership (the terms pi ) are not equivalent to class priors which determine the normalized
entropy. Recall that normalized class entropy (defined in Section 5.3.1) varies from 0
to 1, with a value of 0 meaning that all instances share the same class and a value of 1
meaning that the class distribution is uniform. In order to test the behavior of PPC and
other algorithms under varying class distributions, we plot the accuracy as a function
of normalized class entropy. The results on the accuracy metric are indicated in Figure
5.12, with each combination method compared to the direct multiclass method RF-100.
134
There are many data sets of varying difficulty that all have nearly unity normalized
entropy, so the results are very noisy. Over the domain of normalized entropies from
0.5 to 1.0, the largest difference is between PPC and HT, with an average difference of
approximately 3% near a normalized entropy of unity for the linear fits. These small
and noisy differences do not indicate a significantly different performance for the various
order to isolate an effect of the class distributions, possibly using a discretization scheme
such as the one used in Section 5.5.4. In contrast, note that in Hastie & Tibshirani’s
assumption that pi + pj ≈ 2/k, the probability values pi and pj are true probabilities
of class membership rather than priors over class distributions. Wu et al. show that
Hastie & Tibshirani’s method indeed performs more poorly as the probabilities of class
10.0
7.5
Relative Accuracy (%) 5.0
2.5
0.0
vpc
-2.5 ht
-5.0 wlw
-7.5 ppc
-10.0
-12.5
-15.0
0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
Entropy
Figure 5.12: The entropy relative to a random forest with 100 trees as a function of
the number of classes in the data set for voted pairwise classification (VPC), Hastie-
classification (PPC)
1 X
p̂(ci |L, x) = p̂(ci |ci ∪ cj , x)p̂(ci ∪ cj |L, x) (5.9)
k−1
j6=i
p̂(ci ∪ cj |L, x) = k,
1
we can better understand the relative importance of these terms
in producing the overall multiclass probability estimates. Table 5.6 shows the average
accuracy scores for the MULTI-J48 classifier, with statistically significant differences in-
dicated in Table 5.7. We use the Holm test to identify statistically significant differences
136
between every pair of methods, and find that of the 6 scheme × scheme comparisons, the
only pair that doesn’t exhibit a statistically significantly different behavior under the
accuracy metric is the comparison of the no-weight degradation to the no-pair reduction.
Therefore, we conclude that both terms are equally important in making the pairwise
classification. Note that there are special cases in which the degradations actually per-
form more accurately than the PPC itself, such as for the anneal dataset. Specifically,
assuming a uniform distribution over the p̂(ci |ci ∪ cj , x) term or the p̂(ci ∪ cj |L, x) term
produces approximately 0.2% performance benefit for the anneal data set. Furthermore,
removal of both the weight and pairwise terms still yields relatively large accuracy on
some of the data sets (cars: 64%, halloffame: 85.3%, hypothyroid : 85.3%). These ac-
curacies seem to be commensurate with the entropies of the data sets: (cars: 0.869,
Table 5.6: Accuracy scores for the J48 base classifier with various degradations, each
cell is averaged over 10 random samplings.
dataset ppc-j48 no-weight no-pair no-pair-and-no-weight
anneal 0.982 0.984 0.984 0.059333
arrhythmia 0.777519 0.727907 0.770543 0.063566
authorship 0.939333 0.916 0.852667 0.212667
autos 0.705882 0.691176 0.7 0.119118
cars 0.822059 0.808824 0.803676 0.641912
collins 0.404 0.367333 0.406 0.097333
dj30-1985-2003 0.231343 0.220896 0.222388 0.049254
ecoli 0.849515 0.853398 0.840777 0.120388
eucalyptus 0.600667 0.591333 0.586667 0.140667
halloffame 0.89 0.885333 0.874667 0.853333
hypothyroid 0.984667 0.984 0.98 0.852667
letter 0.561765 0.485294 0.542647 0.052941
mfeat-morphological 0.730667 0.716667 0.730667 0.072667
optdigits 0.899333 0.826 0.904 0.098
page-blocks 0.932667 0.915333 0.93 0.060667
segment 0.94 0.911333 0.94 0.139333
synthetic-control 0.937333 0.887333 0.931333 0.17
vehicle 0.725333 0.698667 0.674667 0.248
vowel 0.693333 0.616667 0.688667 0.094
waveform 0.709333 0.7 0.672667 0.346
average 0.765837 0.739375 0.751802 0.224592
137
Table 5.7: Adjusted p-values under the specified degradations for the accuracies indi-
cated in Table 5.6.
hypothesis pHolm
5.6 Conclusion
probabilistic pairwise classification (PPC). The derivation of the method is based on the
Theorem of Total Probability, and the method reduces a k-class classification problem
k(k−1) k(k−1)
into 2 pairwise classification problems and 2 pair-vs-rest problems. Because
PPC transforms multiclass problems into a set of binary problems, it can be combined
with any binary or multiclass base classifier. Like related pairwise coupling methods
[51, 103], PPC incorporates probabilistic predictions (rather than discrete votes) and
results over 20 data sets, 4 base classifiers and 2 metrics show that PPC outranks related
methods or is not statistically significantly different from the highest ranking method.
There is some variability across methods; for instance, for the k-nearest neighbor base
classifier and the accuracy metric, no method is statistically significantly different from
the highest average ranking method, which is voted pairwise classification. Under the
Brier metric, PPC ranked first on all 20 data sets. In order to understand the tradeoffs
of PPC versus direct multiclass classification, we constructed synthetic data sets that
showed that under some circumstances it is more valuable to perform a direct multiclass
classification, but this result was sensitive to the amount of noise in the problem. By
138
discretizing regression data sets into classification problems with varying numbers of
classes, we showed that the advantage of PPC over a multiclass random forest tends to
increase with the number of classes, while voted pairwise classification and the methods
number of classes increased. Our results also indicate that the choice of base classifier
has a large impact on the effectiveness of each method; for instance, k-nearest neighbor
as a base classifier was not competitive with decision trees, random forests or support-
vector machine base classifiers on either the accuracy or the Brier metric. Furthermore,
we showed the value of increased strength of base classifiers and of increased amount of
training data. PPC exhibits excellent performance an a variety of problems and metrics,
all data points (with the pair as the positive indicator class and the other classes as
the negative indicator class). When using an expensive classification algorithm, such
as support vector machines with a model selection scheme, this algorithm can take
One possible way to alleviate this problem would be to use a more efficient (while
less accurate) classification algorithm for the pair-vs-rest predictions, while still using a
more expensive algorithm to handle the smaller pairwise problems. For instance, while
using a model-selected SVM for the pairwise classifications, a decision tree might be
used to make the pair-vs-rest classifications. There is no inherent reason that each
139
that for some problems, different classification algorithms would be suitable for the
with several different settings for pairwise and pair-vs-rest classifiers could be used to
(bootstrap samples) of the training set and averaging or voting their predictions [11].
Bagging reduces variance and prevents overfitting, but only provides an advantage for
unstable classifiers. The fact that probabilistic pairwise classification exhibits a higher
benefit for J48 (an unstable algorithm) than for KNN (a stable algorithm) suggests that
data points in the training set by drawing bootstrap samples (samples with the same size
as the original training set, with replacement.) In PPC, resamplings is done over classes
rather than training points; however, this resampling may result in a similar benefit that
models are able to complement each other to increase the multiclass classification accu-
racy. One argument against this explanation is that the relative benefit of PPC over the
base multiclass classifier does not always show a positive correlation. If class-bagging
were the explanation for the advantage of PPC, then the benefit should correspond to
the number of pairwise classifiers, which varies as the square of the number of classes.
Some preliminary results indicate that as the number of classes is increased, the rela-
tive benefit of PPC increases (see Section 5.5.4). Future work could identify whether
class-bagging is responsible for the benefit in PPC, and could additionally investigate
5.7 Appendix
This section reports the performance of each base classifier under the accuracy
metric (see Section 5.7.1) and the Brier metric (see Section 5.7.2). Composite results
(aggregated over data sets) are reported in Section 5.4. The following sections break
down the results first by metric, then by base classifier, with results given for each
and dataset. In each of the tables, the standard deviation is indicated in parentheses.
The averages are computed over 10 random resamples, or 5 resamples for the SVM-121
methods.
Table 5.8 indicates the accuracy of the various pairwise classification methods
while using the decision tree MULTI-J48 as the base classifier. Note that PPC has only
four losses over the 20 data sets, which is statistically significant at p ≤ 0.05 under the
Holm test. Figure 5.13 shows the relative gain in accuracy over a multiclass J48 decision
tree. Note that the PPC has a higher gain than VPC, HT and WLW for many data
sets, and that for many data sets, PPC has a positive gain over MULTI-J48 while VPC,
Table 5.9 indicates the accuracy of the various pairwise classification methods
while using the K-nearest neighbor algorithm as the base classifier. Figure 5.14 shows
Figure 5.13: Relative accuracy for decision trees under accuracy metric.
Figure 5.14: Relative accuracy for k-nearest neighbor under accuracy metric.
Table 5.10 indicates the accuracy of the various pairwise classification methods
while using the random forest algorithm as the base classifier. Figure 5.15 shows the
143
Figure 5.15: Relative accuracy for Random Forest under accuracy metric.
Table 5.11 indicates the accuracy of the various pairwise classification methods
Table 5.12 indicates the rectified Brier score of the various pairwise classification
methods while using the decision tree J48 as the base classifier. Figure 5.16 shows the
relative gain in rectified Brier score over a multiclass J48 decision tree.
144
Table 5.12: Results for j48 under the rectified Brier metric
Table 5.13 indicates the rectified Brier score of the various pairwise classification
methods while using the K-nearest neighbor algorithm as the base classifier. Figure
5.17 shows the relative gain in rectified Brier score over a multiclass KNN classifier.
Table 5.14 indicates the rectified Brier score of the various pairwise classification
methods while using the random forest algorithm as the base classifier. Figure 5.18
shows the relative gain in rectified Brier score over a multiclass random forest classifier.
Table 5.15 indicates the rectified Brier score of the various pairwise classification
methods while using the support vector machine as the base classifier.
148
Table 5.13: Results for knn under the rectified Brier metric
Table 5.14: Results for rf100 under the rectified Brier metric
Table 5.15: Results for svm121 under the rectified Brier metric
Conclusion
6.1 Conclusions
and model selection, and how to perform model selection for models that are com-
(Chapter 3), and that applying one weight per prediction rather than one weight per
classifier allows classifiers to focus on subproblems. We also showed that when mul-
binary models, and because subproblems typically share similar structure. Finally, we
simple and easy to implement. Evaluation of PPC on real-world data sets indicates
that it is superior not only to other commonly used pairwise classification approaches,
but can also be used to improve the classification performance of multiclass classifiers.
152
The tradeoff is that PPC is computationally more expensive, but this cost is mitigated
the single-point maximum-likelihood estimates implicit in the ridge and lasso regular-
izers. An interesting extension of this work would be to examine the full Bayesian
solutions corresponding to the Gaussian prior over weights (corresponding to ridge reg-
rameter for use in all subproblems or constraining the weights to be non-negative for
each subproblem.
Most of this work has focused on the accuracy metric (Chapters 3 - 5), with some
studies under the Brier metric (Chapter 5). Future work should investigate how our
classification is the relationship between the binary metric and the multiclass metric.
Other studies have shown that a mismatch between the training metric and the target
metric is problematic [17]. Further studies could investigate the relationship between
the training and target metrics in the case of multiclass to binary reduction methods.
While this thesis focused on the complementary issues of model selection and
model combination, and how to perform model selection for models that are combined,
we have only used the shared-hyperparameters technique for the PPC algorithm. An-
other valuable line of research would be to see whether PPC attains a higher benefit
from independent optimization than that attained by one-vs-all or other pairwise classi-
fication techniques. Since the Theorem of Total Probability is the foundation for PPC,
being able to more accurately estimate each term in the total probability should lead
153
performed to identify whether PPC requires shared-hyperparameters for the same rea-
sons as one-vs-all and other pairwise classification techniques, namely that smoothing
over subproblems provides a necessary regularization and that subproblems are often
similar. One aspect of PPC that may make it more amenable to improved accuracy
(5.7)), which may alleviate the need for shared hyperparameter constraints.
Bibliography
[1] David W. Aha and Dennis Kibler. Instance-based learning algorithms. In Machine
Learning, pages 37–66, 1991.
[2] Erin Allwein, Robert Schapire, Yoram Singer, and Pack Kaelbling. Reducing mul-
ticlass to binary: A unifying approach for margin classifiers. Journal of Machine
Learning Research, 1:113–141, 2000.
[4] Kenneth J. Arrow. Social choice and individual values. Yale University Press,
1951.
[5] Alina Beygelzimer, Varsha Dani, Tom Hayes, John Langford, and Bianca
Zadrozny. Error limiting reductions between classification tasks. In ICML ’05:
Proceedings of the 22nd International Conference on Machine learning, pages 49–
56, New York, NY, USA, 2005. ACM.
[7] Alina Beygelzimer, John Langford, and Bianca Zadrozny. Weighted one-against-
all. In Manuela M. Veloso and Subbarao Kambhampati, editors, AAAI, pages
720–725. AAAI Press / The MIT Press, 2005.
[9] Kai bo Duan and S. Sathiya Keerthi. Which is the best multiclass svm method? an
empirical study. In Proceedings of the Sixth International Workshop on Multiple
Classifier Systems, pages 278–285, 2005.
[15] Leo Breiman. Some infinity theory for predictor ensembles, 2001.
[17] Rich Caruana, Art Munson, and Alexandru Niculescu-Mizil. Getting the most
out of ensemble selection. In ICDM ’06: Proceedings of the Sixth International
Conference on Data Mining, pages 828–833, Washington, DC, USA, 2006. IEEE
Computer Society.
[18] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble
selection from libraries of models. In ICML ’04: Proceedings of the Twenty-first
International Conference on Machine Learning, page 18, New York, NY, USA,
2004. ACM.
[19] Philip K. Chan and Salvatore J. Stolfo. On the accuracy of meta-learning for
scalable data mining. Journal of Intelligent Information Systems, 8(1):5–28, 1997.
[20] Robert T. Clemen and Robert L. Winkler. Limits for the precision and value of
information from dependent sources. Operations Research, 3(2):427442, 1985.
[22] Koby Crammer and Yoram Singer. Improved output coding for classification using
continuous relaxation. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp,
editors, NIPS, pages 437–443. MIT Press, 2000.
[23] Florin Cutzu. Polychotomous classification with pairwise classifiers: A new voting
principle. In Windeatt and Roli [99], pages 115–124.
[24] Janez Demšar. Statistical comparisons of classifiers over multiple data sets.
Journal of Machine Learning Research, 7:1–30, 2006.
[27] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning prob-
lems via error-correcting output codes. Journal of Artificial Intelligence Research,
2:263–286, 1995.
156
[28] Pedro Domingos. Bayesian averaging of classifiers and the overfitting problem. In
Proc. 17th International Conference on Machine Learning, pages 223–230. Morgan
Kaufmann, San Francisco, CA, 2000.
[29] Kaibo Duan, S. Sathiya Keerthi, Wei Chu, Shirish Krishnaj Shevade, and
Aun Neow Poo. Multi-category classification by soft-max combination of binary
classifiers. In Windeatt and Roli [99], pages 125–134.
[30] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd
Edition). Wiley-Interscience, November 2000.
[31] Saso Džeroski and Bernard Ženko. Is combining classifiers with stacking better
than selecting the best one? Machine Learning, 54(3):255–273, 2004.
[32] Bradley Efron, Trevor Hastie, Lain Johnstone, and Robert Tibshirani. Least angle
regression. Annals of Statistics, 32:407–499, 2004.
[33] Bruno Feres de Souza, Andre C. P. L. F. de Carvalho, Rodrigo Calvo, and Re-
nato Porfirio Ishii. Multiclass svm model selection using particle swarm optimiza-
tion. In HIS ’06: Proceedings of the Sixth International Conference on Hybrid
Intelligent Systems, page 31, Washington, DC, USA, 2006. IEEE Computer Soci-
ety.
[34] Eibe Frank and Mark Hall. A simple approach to ordinal classification. In EMCL
’01: Proceedings of the 12th European Conference on Machine Learning, pages
145–156, London, UK, 2001. Springer-Verlag.
[35] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm.
In International Conference on Machine Learning, pages 148–156, 1996.
[36] J. Friedman, T. Hastie, and R. Tibshirani. Regularized paths for generalized linear
models via coordinate descent. Technical report, Stanford, 2008.
[41] Johannes Fürnkranz. Round robin ensembles. Intell. Data Anal., 7(5):385–403,
2003.
[43] Ashutosh Garg and Vladimir Pavlovic. Bayesian networks as ensemble of classi-
fiers. In ICPR, Quebec City, Quebec, 2002.
[44] Christian Genest and Kevin J. McConway. Allocating the weights in the linear
opinion pool. Journal of Forecasting, 9(53-73), 1990.
[45] Christian Genest and Mark J. Schervish. Modeling expert judgments for bayesian
updating. The Annals of Statistics, 13:1198–1212, 1985.
[46] Zoubin Ghahramani and Hyun-Chul Kim. Bayesian classifier combination. Gatsby
Technical Report, 2003.
[48] L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. Pattern
Anal. Mach. Intell., 12(10):993–1001, 1990.
[51] Trevor Hastie and Robert Tibshirani. Classification by pairwise coupling. In The
Annals of Statistics, pages 507–513. MIT Press, 1996.
[52] Tin Kam Ho. The random subspace method for constructing decision forests.
IEEE Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998.
[54] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multi-class support
vector machines, 2002.
[55] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support
vector machines. Neural Networks, IEEE Transactions on, 13(2):415–425, 2002.
[56] Eyke Hüllermeier and Stijn Vanderlooy. Combining predictions in pairwise classi-
fication: An optimal adaptive voting strategy and its relation to weighted voting.
Pattern Recogn., 43(1):128–142, 2010.
[57] Laurent Hyafil and Ronald L. Rivest. Constructing optimal binary decision trees
is np-complete. Information Processing Letters, 5(1):15–17, 1976.
[58] Robert A. Jacobs. Methods for combining experts’ probability assessments. Neural
Computation, 7(5):867–888, 1995.
[60] Joseph M. Kahn. A generative bayesian model for aggregating experts’ probabili-
ties. In AUAI ’04: Proceedings of the 20th conference on Uncertainty in artificial
intelligence, pages 301–308, Arlington, Virginia, United States, 2004. AUAI Press.
[61] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial
Intelligence, 97(1-2):273–324, 1997.
[63] L. Lam and C.Y. Suen. Application of majority voting to pattern recognition: An
analysis of the behavior and performance. IEEE Trans. Systems Man Cybernet,
27(5):553–567, 1997.
[64] John Langford and Alina Beygelzimer. Sensitive error correcting output codes.
In Peter Auer and Ron Meir, editors, COLT, volume 3559 of Lecture Notes in
Computer Science, pages 158–172. Springer, 2005.
[65] Gilles Lebrun, Olivier Lezoray, Christophe Charrier, and Hubert Cardot. An
ea multi-model selection for svm multiclass schemes. In Francisco Sandoval
Hernández, Alberto Prieto, Joan Cabestany, and Manuel Graña, editors, IWANN,
volume 4507 of Lecture Notes in Computer Science, pages 260–267. Springer, 2007.
[66] Martina Liepert. Topological fields chunking for german with svm’s: Optimiz-
ing svm-parameters with ga’s. In Proceedings of the International Conference
on Recent Advances in Natural Language Processing (RANLP 2003), Borovets,
Bulgaria, 2003.
[67] Chih-Wei Hsu Chih-Jen Lin. A comparison of methods for multiclass support
vector machines. Neural Networks, IEEE Transactions on, 13:415–425, 2002.
[69] Y. Liu and X. Yao. Ensemble learning via negative correlation. Neural Networks,
12(10):1399–1404, 1999.
[72] P. Melville and R. J. Mooney. Constructing diverse classifier ensembles using arti-
ficial training examples. In Eighteenth International Joint Conference on Artificial
Intelligence, pages 505–510, 2003.
[73] Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, and Christina
Leslie. Multi-class protein classification using adaptive codes. Journal of Machine
Learning Research, 8:1557–1581, 2007.
159
[76] Peter Morris. Bayesian Expert Resolution. PhD thesis, Stanford University, 1971.
[79] David M. Pennock, Pedrito Maynard-Reid II, C. Lee Giles, and Eric Horvitz.
A normative examination of ensemble learning algorithms. In Proc. 17th
International Conference on Machine Learning, pages 735–742. Morgan Kauf-
mann, San Francisco, CA, 2000.
[80] M. P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for
hybrid neural networks. In R. J. Mammone, editor, Neural Networks for Speech
and Image Processing, pages 126–142. Chapman-Hall, 1993.
[81] John C. Platt, Nello Cristianini, and John Shawe-taylor. Large margin dags for
multiclass classification. In Advances in Neural Information Processing Systems,
pages 547–553. MIT Press, 2000.
[82] John C. Platt and John C. Platt. Probabilistic outputs for support vector machines
and comparisons to regularized likelihood methods. In Advances in Large Margin
Classifiers, pages 61–74. MIT Press, 1999.
[83] J. Ross Quinlan. C4.5: Programs for Machine Learning (Morgan Kaufmann Series
in Machine Learning). Morgan Kaufmann, 1 edition, January 1993.
[84] Samuel Robert Reid and Gregory Z. Grudic. Regularized linear models in stacked
generalization. In Jon Atli Benediktsson, Josef Kittler, and Fabio Roli, edi-
tors, MCS, volume 5519 of Lecture Notes in Computer Science, pages 112–121.
Springer, 2009.
[85] Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classification. Journal
of Machine Learning Research, 5:101–141, 2004.
[86] Fabio Roli, Giorgio Giacinto, and Gianni Vernazza. Methods for designing mul-
tiple classifier systems. In MCS ’01: Proceedings of the Second International
Workshop on Multiple Classifier Systems, pages 78–87, London, UK, 2001.
Springer-Verlag.
[90] Alexander K. Seewald. How to make stacking better and faster while also taking
care of an unknown weakness. In ICML, pages 554–561, 2002.
[92] L. Shapley and B Grofman. Optimizing group judgmental accuracy in the presence
of interdependencies. Public Choice, 1984.
[93] Amanda J. C. Sharkey, Noel E. Sharkey, Uwe Gerecke, and G. O. Chandroth. The
“test and select” approach to ensemble combination. Lecture Notes in Computer
Science, 1857, 2000.
[94] Gero Szepannek, Bernd Bischl, and Claus Weihs. On the combination of locally
optimal pairwise classifiers. In MLDM ’07: Proceedings of the 5th International
Conference on Machine Learning and Data Mining in Pattern Recognition, pages
104–116, Berlin, Heidelberg, 2007. Springer-Verlag.
[95] Kai Ming Ting and Ian H. Witten. Issues in stacked generalization. Journal of
Artificial Intelligence Research, 10:271–289, 1999.
[97] Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in en-
semble classifiers. Connection Science, 8(3-4):385–403, 1996.
[98] N. Ueda and R. Nakano. Generalization error of ensemble estimators. Proc. IEEE
Int’l Conf. Neural Networks, pages 90–95, 1996.
[99] Terry Windeatt and Fabio Roli, editors. Multiple Classifier Systems, 4th
International Workshop, MCS 2003, Guilford, UK, June 11-13, 2003, Proceedings,
volume 2709 of Lecture Notes in Computer Science. Springer, 2003.
[102] David H. Wolpert and William G. Macready. No free lunch theorems for opti-
mization. IEEE Transactions on Evolutionary Computation, 1(1):67–82, April
1997.
[103] Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. Probability estimates for multi-
class classification by pairwise coupling. Journal of Machine Learning Research,
5:975–1005, 2004.
161
[105] Naoto Yukinawa, Shigeyuki Oba, Kikuya Kato, and Shin Ishii. Optimal aggre-
gation of binary classifiers for multiclass cancer diagnosis using gene expression
profiles. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 6(2):333–343, 2009.
[107] Yi Zhang, Samuel Burer, and W. Nick Street. Ensemble pruning via semi-definite
programming. Journal of Machine Learning Research, 7:1315–1338, 2006.
[108] Z.-H. Zhou, J. Wu, and W. Tang. Ensembling neural networks: Many could be
better than all. Artificial Intelligence, 2002, 137(1-2): 239-263, 137(1-2):239–263,
2002.
[109] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society B, 67:301–320, 2005.
Appendix A
Companion software, data sets, errata and other supporting material are available
at http://spot.colorado.edu/~reids/thesis/.
Appendix B
(1) anneal (Annealing Data Set): Predict class for a steel annealing problem,
UCI repository and appears in many experimental studies for machine learning
datasets/Annealing.
mia exists, and if so, which class it belongs to (of 15 classes), given patient age,
gender, height, heart rate and temporal characteristics of the cardiac waveform
(3) authorship (from Analysis of Categorical Data): Predict whether an author was
(4) autos (1985 Auto Imports Database): Predict the symboling or risk factor (from
-3 to 3) based on features such as engine size, length, weight, stroke, bore, price,
etc.
(5) cars: Predict whether a car is American, European or Japanese based on MPG,
(6) collins: Predict the genre of a text from the Brown corpus using features such
(7) dj30-1985-2003 : Predict the identity of a stock, given attributes such as its
method for signal sequence recognition, von Heijne’s method for signal sequence
(9) eucalyptus (from agridatasets): Predict which eucalyptus seedlots are best for
The attributes include features such as {altitude, rainfall, frosts, year of plant-
ing, species, seedlot, height and stem, crown and branch form}. Note that this
(10) halloffame (from Analyzing Categorical Data, p. 418): Predict whether a base-
ball player was {not inducted into the Hall of Fame, elected into the Hall of
given measurements such as age, TSH, T3, TT4, T4U, FTI, TBG
(12) letter : Predict letter (A-Z) generated by one of 20 fonts and distorted, given
(14) optdigits: Predict the handwritten numeral (0-9) based on pixel counts over
(15) page-blocks: Predict whether a page block is (text, horizontal line, picture,
vertical line or graphic) given its height, length, area, eccentricity, percentage
(16) segment: Predict whether an image segment is (brickface, sky, foliage, cement,
window, path, grass) based on 19 input attributes such as column, row, number
the region, average intensity, average red, green and blue values, etc.
(Cyclic,Decreasing-trend,Downward-shift,Increasing-trend,Normal,Upward-shift)
(19) vowel : Predict the audio form of a vowel (hid, hId, hEd, hAd, hYd, had, hOd,
(20) waveform: Predict which of three wave types a waveform belongs to given 21
noisy attributes.