0% found this document useful (0 votes)
4 views10 pages

Class Hierar

The document discusses a method for improving multiclass classification by inducing class hierarchies from confusion matrices. It proposes a new approach that decomposes multiclass problems into simpler subproblems using a tree-like structure of classifiers, which enhances performance by leveraging relationships between class labels. The paper evaluates this method against various datasets, demonstrating its effectiveness over traditional flat classification techniques.

Uploaded by

Cecilia Celine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

Class Hierar

The document discusses a method for improving multiclass classification by inducing class hierarchies from confusion matrices. It proposes a new approach that decomposes multiclass problems into simpler subproblems using a tree-like structure of classifiers, which enhances performance by leveraging relationships between class labels. The paper evaluates this method against various datasets, demonstrating its effectiveness over traditional flat classification techniques.

Uploaded by

Cecilia Celine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Available online at www.sciencedirect.

com

ScienceDirect
This space is reserved for the Procedia header, do not use it
This Procedia
space isComputer
reserved for 108C
Science the Procedia header, do not use it
(2017) 1692–1701
This space is reserved for the Procedia header, do not use it

International Conference on Computational Science, ICCS 2017, 12-14 June 2017,


Zurich, Switzerland
Improving Performance of Multiclass Classification by
Improving Performance
Inducing Class of Multiclass
HierarchiesClassification by
Improving Performance of Multiclass
Inducing Class Hierarchies Classification by
Inducing
Daniel Silva-Palacios, Class
Cèsar Ferri, andHierarchies
Marı́a José Ramı́rez-Quintana
Daniel Silva-Palacios, Cèsar Ferri, and Marı́a José Ramı́rez-Quintana
DSIC, Universitat Politècnica de València,
Daniel Silva-Palacios,
Camı́Cèsar
de VeraFerri, and Valencia,
s/n, 46022, Marı́a José
SpainRamı́rez-Quintana
DSIC, Universitat Politècnica de València,
[email protected],
Camı́ de {cferri,mramirez}@dsic.upv.es
Vera s/n,Politècnica
46022, Valencia, Spain
DSIC, Universitat de València,
[email protected], {cferri,mramirez}@dsic.upv.es
Camı́ de Vera s/n, 46022, Valencia, Spain
[email protected], {cferri,mramirez}@dsic.upv.es
Abstract
In the last decades, one issue that has received a lot of attention in classification problems
Abstract
is how to last
In the
Abstract obtain betterone
decades, classifications.
issue that hasThis problem
received a lotbecomes eveninmore
of attention complicated
classification when
problems
thehow
is number
to of
obtain classes
better is high. In this
classifications. multiclass
This scenario,
problem it
becomes is assumed
even more
In the last decades, one issue that has received a lot of attention in classification problems that the class
complicated labels
when
are
the independent
number of of each
classes is other,
high. and
In thus,
this most
multiclass techniques
scenario, and
it is
is how to obtain better classifications. This problem becomes even more complicated when methods
assumed proposed
that the to improve
class labels
the
are performance
the independent
number of each
of
of classes theis classifiers
other,
high. In rely
andthis
thus, onmost
it. An
multiclass alternative
techniques
scenario, it isway
and to address
methods
assumed proposedtheclass
that the tomulticlass
improve
labels
problem
the is to
performance hierarchically
of the distribute
classifiers rely the
on classes
it. An in a collection
alternative
are independent of each other, and thus, most techniques and methods proposed to improve of
way multiclass
to address subproblems
the by
multiclass
reducing
problem the
is to number of
hierarchicallyclasses involved
distribute in
the each
classes local
in asubproblem.
collection
the performance of the classifiers rely on it. An alternative way to address the multiclassofIn this paper,
multiclass we propose
subproblems a
by
new method
reducing the for inducing
number of a class
classes hierarchy
involved in from
each the
local confusion
subproblem. matrix
In
problem is to hierarchically distribute the classes in a collection of multiclass subproblems by of
this a multiclass
paper, we classifier.
propose a
Then,
new we
methoduse the
for class
inducinghierarchy
a class to learn
hierarchy a tree-like
from hierarchy
the confusion of classifiers
matrix
reducing the number of classes involved in each local subproblem. In this paper, we propose a of for
a solving
multiclassthe original
classifier.
multiclass
Then, we use
new method problem
for in hierarchy
theinducing
class a similar
a classtoway as the
learn
hierarchy top-down
a tree-like
from hierarchical
hierarchy
the confusion classification
of classifiers
matrix of for approach
solving
a multiclass does
theclassifier.
original
for working
multiclass with
problem hierarchical
in a domains.
similar way as We
the experimentally
top-down evaluate
hierarchical the proposal
classification
Then, we use the class hierarchy to learn a tree-like hierarchy of classifiers for solving the original on a collection
approach does
of
for multiclass
working datasets
with showing
hierarchical that,
domains. in general,
We the generated
experimentally hierarchies
evaluate
multiclass problem in a similar way as the top-down hierarchical classification approach does the not
proposal only
on outperforms
a collection
the
of original with
formulticlass
working (flat) classification
datasets but also
showingdomains.
hierarchical that, hierarchical
in general,
We approaches
the generated
experimentally based theonproposal
hierarchies
evaluate other alternative
not only ways
onoutperforms
a collection
of
the constructing
original the
(flat) class hierarchy.
classification but also hierarchical approaches based
of multiclass datasets showing that, in general, the generated hierarchies not only outperforms on other alternative ways
of
©
the constructing
2017 The Authors.
original
Keywords: the class
Published hierarchy.
by Elsevier
(flat) classification
Multiclass classification, B.V.hierarchy
butClass
also hierarchical approaches
inference, based
Hierarchy of on other alternative ways
classifiers
Peer-review
of under
constructing responsibility
the class of the scientific
hierarchy. committee of the International
Keywords: Multiclass classification, Class hierarchy inference, Hierarchy of classifiers Conference on Computational Science
Keywords: Multiclass classification, Class hierarchy inference, Hierarchy of classifiers
1 Introduction
1 Introduction
In machine learning, classification is the problem of identifying to which of a set of categories
1 Introduction
(classes) a new instance belongs. When the categories contains more than two different labels,
In machine learning, classification is the problem of identifying to which of a set of categories
the problem
(classes)
In machine a new is distinguished
instance
learning, as multiclass
belongs.
classificationWhen
is thetheclassification.
categories
problem In the basic
contains
of identifying more scenario aofset
thanoftwo
to which multiclass
different clas-
labels,
of categories
sification
the problem
(classes) it is
a new assumed that
is distinguished (1)
instance belongs. only one
as multiclass
When the class label is assigned
classification. to each
In the basic
categories contains instance
morescenario
than two (in other
of different
multiclasswords,
clas-
labels,
this is a single-class
sification
the problem it isisassumed classification
that (1)
distinguished asone
only opposite
as multiclass class to multi-label
label is assigned
classification. classification
In the to that
eachscenario
basic instance allows
of(in othermultiple
multiclasswords,
clas-
class
this islabels
sification it isforassumed
each classification
a single-class instance), and
that (1) only (2)
asone the
classclass
opposite labels are independent
to multi-label
label is assignedclassification (that is, other
that (in
to each instance no relation-
allows multiple
words,
ship
class
this isamong
labels the each
for
a single-classclassclassification
labels as opposite
instance), andas (2) to
thehierarchical
opposite class labelsclassification
are independent
to multi-label that (that
classification aims
that to
is,classification
no relation-
allows multiple
problems
ship
class among where
labels for classes
the each are organised
class instance),
labels as opposite into
and (2) to a hierarchical
thehierarchical structure).
class labelsclassification
are independentthat (that
aims tois,classification
no relation-
problems
ship among where classes
the class are organised
labels as opposite into
to ahierarchical
hierarchicalclassification
structure). that aims to classification
problems where classes are organised into a hierarchical structure). 1
1
1877-0509 © 2017 The Authors. Published by Elsevier B.V. 1
Peer-review under responsibility of the scientific committee of the International Conference on Computational Science
10.1016/j.procs.2017.05.218
Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez 1693

One of the main objectives when solving a classification task is to make predictions as
“good” as possible, where the notion of “good” depends on the evaluation measure used to
asses the quality of the classifier. For instance, in classification, accuracy is one of the most
popular evaluation measure. However, obtaining good predictions is not always a simple task.
And things are usually even more complicated in the multiclass case since here the classifier has
to distinguish among a high number of classes to make the predictions. For this reason, in the
last decades, numerous efforts have been done trying to improve classifier performance. Most
approaches proposed in the literature fall into one of these three groups: a) to design better
learning techniques, b) to apply some kind of transformation over the training data, and c) to
modify or adjust the predictions given by the classifier. For instance, as an example of approach
of type a), we can mention the notion of ensemble learning, a general term based on the idea of
using more than one model to determine the predicted output, and that encompass a collection
of techniques (such as bagging [3] or boosting [8], among others). The different approaches based
on instance and feature selection ([17, 14]) and the oversampling and undersampling methods
proposed to deal with imbalanced datasets ([15]) are examples of proposals that belong to
group b). Finally, approaches based on threshold choice methods [12] or classifier calibration
techniques [16, 21], that are applied to scoring models, are examples of approaches belonging
to group c). All of these methods assume that the condition (2) aforementioned holds, that is,
there does not exist any relationship between the class labels.
However, in multiclass classification, it has been shown that classification performance can
also be improved by decomposing the multiclass problem into a hierarchy of intermediate clas-
sification problems that are smaller or less complex than the original one. Two alternative
ways of decomposing the original problem have been explored. The first one relies on the idea
that a problem becomes less complex if its dimensionality is reduced. For instance, in [22] the
instance attributes are iteratively split into disjoint sets and then a new classification problem
is defined for each partition. The second way relies on the idea that a multiclass problem be-
comes simpler if the number of classes is reduced. For instance, in [13] and [11] a class hierarchy
is constructed (by assuming that there exists some relationship between the class labels) and
then each internal node of the hierarchy defines a new classification problem involving only
its children class labels. This latter way to address the multiclass problem is inspired in the
top-down hierarchical classification method. The different approaches proposed mainly differ in
the way in that the class hierarchy is automatically generated (based on instances or based on
predictions) and/or which learning technique is used for learning the intermediate classifiers.
In this paper we also propose to decompose the original multiclass problem by using a
class hierarchy from which a tree-like structure of classifiers is constructed. Similar to [11],
the relationship between classes are derived from the confusion matrix of a “flat”1 classifier
(interpreting the confusion matrix as an indicator of how simple or hard for the classifier is to
distinguish the classes). But instead of deriving the similarity between classes directly from the
values of the confusion matrix (as [11] does), we transform the matrix trying to obtain as much
information as possible from it. From the transformed matrix, we define a semi-metric function
to compare the class labels that is used to generate the class hierarchy. We experimentally
evaluate our proposal using a large collection of datasets and techniques.
The paper is organized as follows. In Section 2 we review some previous works and present
our method for defining the similarity between classes from the confusion matrix that is then
used for inducing the class hierarchy. In Section 3, we describe how to solve multiclass problems
by using a class hierarchy. Section 4, presents the experiments we conducted in order to evaluate

1 In the literature, it is common to refer to a classifier that does not take into consideration any relationship

between the class labels as “flat”.

2
1694 Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez

our approach. Finally, Section 5 concludes the paper.

2 Inducing Class Hierarchies from Predictions


In this section we present our proposal for learning class hierarchies. First, we briefly summarise
previous proposals for automatically inducing class hierarchies. Then, we define a new method
for defining the similarity between classes that we use for obtaining the class hierarchy.

2.1 Related work


Most approaches for automatically generating class hierarchies are instance-based, that is, they
build the hierarchy by applying a hierarchical clustering algorithm that makes use of any dis-
tance or similarity function between the instances. For example, some works about ontology
construction [6, 2] and concept hierarchy learning from texts follow this approach.
Other instance-based approaches firstly pre-process or transform the instances and then
the hierarchy is induced by using any distance function between the transformed items. For
instance, [13] presents a method for automatically learn a hierarchy for document classification.
The approach consists in applying a linear discriminant projection to transform the documents
(represented as vectors) onto a low-dimensional space and then the similarity between two
classes is defined as the distance between their centroids. A similar approach is presented in
[19] for learning the ontologies used in a recommendation system developed using an item-
based collaborative filtering technique. From the user-item matrix that describes the item
ratings given by the users, the authors derive a collection of ontologies by using several distance
functions and applying both agglomerative and divisive clustering.
Another alternative way of determining the similarity between classes is to use the predic-
tions given by a classifier, especially constructed, in many cases, to this aim. In the framework
of multi-label hierarchical classification, in [5] an ARTMAP neural network is used as a self-
organizing expert system to derive hierarchical knowledge structures. A similar approach is
presented in [4], where the authors use an association rule learner that extracts class hierar-
chies from the predictions given by a multi-label classifier. In the field of (non-hierarchical)
multiclass classification, [11] presents an approach to improve document classification by com-
bining Naive Bayes (NB) and Support Vector Machines (SVM). Concretely, once a NB classifier
has been learnt, each row of its confusion matrix (after normalize it) is used to represent each
class as a tuple of numerical values. Classes are compared by applying the Euclidean distance.
Our proposal, although also based on the predictions of a classifier, differs from the last
approaches just mentioned in that the similarity between classes is not obtained either from
the predictions of the classifier or applying any well-known distance function over the values of
the confusion matrix. Instead of that, the confusion matrix is turned into a similarity matrix.

2.2 Calculating the similarity between classes


Let C = {c1 . . . cn } be a set of n class labels and D be a set of labeled instances of the form
x, y where x is a m-tuple whose components are the input attributes and y ∈ C is the target
attribute, i.e. the class. Given a flat classifier F trained using D, its confusion matrix M is
an n × n matrix that describes the performance of F . The rows of M represent the instances
in actual classes whereas the columns represent the instances in predicted classes. Thus, the
elements of M placed at the main diagonal (Mci ,ci ) are the instances of D that F correctly
classifies whereas the rest of elements of M are the misclassification errors, that is Mci ,cj , ci = cj ,

3
Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez 1695

represents the instances of class ci that F classifies as being of class cj . For the sake of simplicity,
in that follows we denote any class ci by its subindex i.
In general, in any confusion matrix M we can observe the following facts: (1) misclassifica-
tion errors are not usually uniformly distributed, which indicates that it is more difficult for the
classifier to separate some classes than others; and (2) M is usually non-symmetrical, which
means that to really measure the degree of confusion between two classes i and j we must take
into account all the errors the classifier makes involving both classes, that is Mi,j and Mj,i .
Our proposal is based on this reasoning.
Given a confusion matrix M , the new function for comparing classes derived from M , we
denote as dS , is obtained by applying the following steps:

1. Normalisation: M ij = nMij
j=1 Mij

2. Class overlap: the overlap degree between two classes i and j is


M ij +M ji
2 if i = j
overlap(i, j) =
1 if i = j

Applying this function we obtain a symmetric matrix M O we call the Overlapping Matrix.

3. Similarity between classes: the similarity between two classes i and j is defined as

dS (i, j) = 1 − overlap(i, j)

The similarity values belong to the interval [0..1] where dS (i, j) = 0 means that classes i
and j are completely indistinguibles, whereas dS (i, j) = 1 indicates that the classes are
completely overlapping. After applying dS we obtain another symmetric matrix M S we
call the Similarity Matrix.

Note that, (C, dS ) is a semi-metric space [10] since dS satisfies the non-negativity, symmetry
and identity conditions but, for some pairs of classes, dS can violate the triangle inequality
property (for instance, this is the case with classes a and b in Example 1).
Example 1 Let us consider a classification problem with four class labels C = {a, b, c, d}. Sup-
pose that we have trained a flat classifier F (using any classification technique) whose confusion
matrix M is depicted in Table 1a. From M we can see that F perfectly classifies the instances
belonging to class d but makes a lot of mistakes classifying the instances belonging to the other
classes. In fact, we could say that F is a poor classifier (whose accuracy is 0.55). Tables from
1b to 1d show step by step the different matrices obtained by applying our method. In this
example it is easy to see that dS does not hold the triangle inequality property since for classes
a and b it holds dS (a, b) = 1 > dS (a, c) + dS (c, b) = 0.972.

4
1696 Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez

Predicted
a b c d a b c d
a 20 0 30 0 a 0.400 0.000 0.600 0.000
b 0 20 30 0 b 0.000 0.400 0.600 0.000
Real c 30 30 10 0 c 0.429 0.429 0.143 0.000
d 0 0 0 100 d 0.000 0.000 0.000 1.000

(a) Confusion Matrix M . (b) Normalised Confusion Matrix M .


a b c d a b c d
a 1 a 0
b 0.000 1 b 1.000 0
c 0.514 0.514 1 c 0.486 0.486 0
d 0.000 0.000 0.000 1 d 1.000 1.000 1.000 0

(c) Overlapping Matrix M O . (d) The similarity Matrix M S .

Table 1: Example that illustrates the calculation of the similarity between classes from the
confusion matrix of a flat classifier.

2.3 Inferring the class hierarchy


Once the distance matrix has been obtained, the class hierarchy is built by applying an agglom-
erative hierarchical clustering algorithm. The agglomerative approach works in a bottom-up
manner starting by assigning each element to a single cluster (singleton) and then iteratively
merging pairs of clusters until obtaining only one cluster. Clusters are joined based on the
distance between them, referred as the linkage distance. Linkage distances are, among others,
the complete distance (the maximum distance between elements of each cluster), the single link-
age distance (the minimum distance between elements of each cluster) and the average linkage
distance (the mean distance between elements of each cluster). The hierarchical clustering is
graphically represented as a binary tree called dendrogram that shows not only how the clusters
are grouped but the distance to which the grouping has been taking place. From the dendro-
gram, the hierarchy of classes is derived by considering that the clusters created between the
leaves (the set of original classes) and the root constitute the internal nodes of the hierarchy.
We could obtain different hierarchies depending on the linkage distance used.

3 Decomposing Multiclass Problems using a Class Hier-


archy
Once the class hierarchy has been inferred, we propose to use it for solving the original multiclass
problem by creating a tree of classifiers in the same way as the Local Classifier per Parent Node
(LCPN) [20] does for hierarchical classification. LCPN is easy to implement and, hence, it is
one of the approach most commonly used in the literature. This approach consists in training a
flat classifier for every non-leaf node (or parent node) in the class hierarchy (including the root)
to distinguish between its child nodes. As a resulting we obtain a set of classifiers arranged in
a tree. Then, to classify a new instance the tree of classifiers is traversed in a top-down manner
applying the classifiers from the root until a leaf is reached.
As the agglomerative clustering algorithm aggregates two groups at each step, the class
hierarchy and the classifier tree derived from it are binary trees. That means that for a problem
with n classes we have to train (n − 1) flat classifiers. However, the number of flat classifiers we
have to apply for classifying a new instance varies from (n/2), when the class hierarchy (and

5
Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez 1697

the tree of classifiers) is balanced, up to (n − 1), when the class hierarchy is a chain. In that
follows, we present a procedure to simplify the class hierarchy in order to reduce the number
of internal nodes, and therefore the number of flat classifiers we have to train. This procedure,
therefore, generally implies a complexity reduction of the method.

Figure 1: Example of the Hierarchy Compression Process applied to the Zoo DataSet. (Left)
the class hierarchy given by the dendrogram. (Right) the compressed class hierarchy.

Algorithm 1 shows the compression method we propose to reduce the size of the class
hierarchy. In order to do that, the class hierarchy is traversed in a depth-first search verifying
that there is not two contiguous nodes that are at the same linkage distance; otherwise, the
nodes are merged. In hierarchical clustering, the idea of using a post-process to reduce the
dendrogram was also applied in [7], to develop a multidimensional hierarchical clustering.

Algorithm 1: Compression Algorithm


Function CompressionProcess (class hierarchy) is
/* Depth-first search of the class hierarchy. */
node ← getRootNode(class hierarchy);
node ← VerifyLevel(node);
children ← getChildren(node);
foreach child in children do
node.child ← CompressionProcess(child);
return node;
Function VerifyLevel(node) is
/* While there is a child with same height. */
while (exists(node)) do
children ← getChildren(node);
foreach child in children do
nodeHeight ← getHeight(node);
childHeight ← getHeight(child);
if nodeHeight == childHeight then
node ← JoinNodes(node,child);
break;
return node ;
Function JoinNodes(nodeFather,nodeChild) is
nodeFather ← removeChild(nodeChild);
decendents ← getChildren(nodeChild);
foreach descendent in decendents do
nodeFather ← addChildNode(descendent);
return nodeFather;

Figure 1 shows the compression process applied to the Zoo dataset (from the UCI repository
[1]). As can be seen, the nodes root, c5, c7 and c11 in the hierarchy on the left are merged
into one node that is the root of the resulting compressed hierarchy (depicted on the right). As
a consequence, the number of flat classifiers to be learnt has been reduced from six to three.
Note that, the node compression condition of being at the same linkage distance could be easily
modified to allow merging nodes whose linkage distance does not exceed a certain threshold

6
1698 Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez

depending on the domain or chosen by the user.

4 Experimental Analysis
In this section we evaluate the performance of our proposal to solve multiclass problems. For
that reason we carried out a set of experiments following the method described in Section 3.
The experiments were performed over 20 different datasets (Table 2) taken from the UCI [1]
and the LIBSVM2 public repositories, and the Reuters-21578 and Newsgroups datasets. All the
datasets are multivariate, multiclasss, non-hierarchical a priori, and the criterion we followed
to select them was to include datasets of different size and different number of classes. We have
preprocessed the datasets removing the instances with missing values. Additionally, we have
applied under-sampling to datasets 11 and 15 (to cope with class imbalance) and an stratified
sampling to datasets 19 and 20 (to reduce its size).
In order to analyse the suitability of the hierarchies generated by our method, we include two
existing methods for generating hierarchies in the classes. First, a method that calculates the
distance between the centroids of each class (we denote it as dC ) based on instances [13]. The
second approach computes simmilarities by applying the Euclidean distance over the normalized
confusion matrix generated by the flat classifier (we denote this distance as dE ) [11]. Our
proposal, that uses a semi-metric function to compute the distance matrix, is denoted by dS .
We use the complete linkage distance in the agglomerative hierarchical clustering algorithm used
to infer the class hierarchies (using dS and dE ). In addition, to analyse whether the multiclass
classification can be improved by using class hierarchies, we compare the results obtained using
hierarchies with those get by the flat classifier, firstly trained for inducing the hierarchy.

DataSet 1 2 3 4 5 6 7 8 9 10
Id Arrhytmia Covtype Dermatology Flare Forest Glass Letters Nuclear Pendigits Reuters
NumInst 416 2100 358 1066 523 214 2600 525 7494 1000
NumAtt 330 54 34 19 27 9 16 80 16 58
NumClass 7 7 6 6 4 6 26 8 10 10
DataSet 11 12 13 14 15 16 17 18 19 20
Id Satimage Segmentation Sports TrafficLight Usps Vertebral Yeast Zoo 10News 20News
NumInst 1795 3000 8000 300 3000 620 1484 101 3700 6660
NumAtt 36 18 13 10 256 6 8 16 45 48
NumClass 6 7 10 6 10 4 10 7 10 20

Table 2: Information about the datasets used in the experiments: number of instances, at-
tributes and classes.

In the experiments we apply eleven classification techniques in an R [18] script by means


of the libraries caret [9], rpart, e1071 and C50. Specifically, we use the following classification
algorithms: two decision trees “J48” and “C50”, a propositional rule learner “JRIP”, Naive
Bayes “NB”, K-nearest neighbours “KNN”, a decision list “PART”, a recursive partitioning
tree ”RPART”, a neural network “NNET”, a parallel random forest “RF”, flexible discriminant
analysis “FDA” and support vector machines “SVM”. We adopt a 10-fold cross-validation for
the complete process and we use accuracy as the evaluation measure. In that follows, the letters
S, C and E denote that the tree of classifiers has been built from a class hierarchy inferred from
our semi-metric distance dS , the distance between centroids dC and the Euclidean distance dE ,
respectively. Additionally, F stands for the flat classifier approach. Table 3 shows the results
obtained. For each classification technique and dataset, the best result is underlined, and the

2 https://www.csie.ntu.edu.tw/$\sim$cjlin/libsvmtools/datasets/

7
Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez 1699

best method using a hierarchy is highlighted in bold. As we can see, some algorithms were not
able to process all datasets due to resource limitations.
In general, the results show that the methods that use a class hierarchy outperform, in terms
of accuracy, the flat classification for most datasets and techniques except for the NB and SVM
whose flat classifiers are mainly more accurate than the classifiers using hierarchies. Hence,
we can say that, in general, the use of class hierarchies for decomposing the original multiclass
problems seems to be suitable to address multiclass problems. Regarding the way in that the
class hierarchies are induced, in average our semi-metric function dS obtains the best results
for almost all the techniques except for the NB where the distance between centroids dC gives
better accuracy values and C50 for which dS and dC obtain similar results.
Next, we analyse how the size of the datasets (in terms of number of classes and instances)
affects methods performance. To do this, we have first grouped the datasets into three categories
according to N umClass: small (N umClass =< 6), medium (6 < N umClass =< 10) and large
(10 < N umClass). We observe that for the small group there are not big differences among
using the flat, S and C approaches. For medium size datasets, flat and S approaches are the
best for almost all the learning techniques and, finally, for large datasets, the classifiers that
use our semi-metric distance achieve the best accuracy. Grouping the datasets according to
the number of instances, we see that for small size datasets (N umInst < 526) all methods
perform similarly; for medium size datasets (527 < N umInst < 2101) and large datasets
(N umInst > 2100) the methods that use a class hierarchy are better than the flat ones, being
the methods based on the Euclidean and the semi-metric distances the best ones.

5 Conclusions

This research proposes a method to improve accuracy on multiclass classification problems. The
idea is that in situations where there exists a high number of classes, traditional methods find
difficult to discern correctly the new observations given the high number of possibilities. The
proposal consists in building specialised classifiers for the classes that present more common
mistakes, i.e., to build a chain of specialised classifiers for simpler problems. Therefore, the
method is based on the inference of the hierarchy of classes. From the confusion matrix obtained
from train data we derive a similarity matrix, then an agglomerative clustering technique derives
the hierarchy of classifiers. We propose a semi-metric to calculate the distances between classes
given a confusion matrix. Also we introduce a method to compress hierarchies, which allow
us to better represent the hierarchical structure and it also helps to reduce the complexity of
hierarchical classification as it reduces the number of local classifiers.
Experiments with twenty multiclass datasets show the validity of our proposal. We show
that the new technique is able to improve the accuracy with respect to the basic flat approach.
We also include in the experiments other proposals to build the hierarchy of classifiers. The
new method based on the semi-metric distance obtains a better performance for the majority
of datasets.
As future work, we propose to analyze better the relation between number of classes and
performance obtained by the chain of classifiers approach. We also are interested on studying
the effect of class balance in the performance of the method. Finally, we plan to derive some
ensemble methods based on the combination of the local classifiers of the hierarchy.

8
1700 Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez

Datasets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Average
F 0,714 0,695 0,947 0,732 0,849 0,720 0,736 1,000 0,962 0,826 0,871 0,624 0,740 0,844 0,516 0,605 0,925 0,740
E 0,716 0,708 0,947 0,738 0,874 0,685 0,720 0,996 0,956 0,811 0,957 0,620 0,747 0,848 0,581 0,573 0,943 0,743
J48
S 0,709 0,707 0,949 0,744 0,866 0,714 0,720 1,000 0,961 0,817 0,956 0,628 0,724 0,852 0,563 0,581 0,933 0,745
C 0,680 0,711 0,955 0,728 0,877 0,739 0,703 0,989 0,954 0,816 0,957 0,620 0,711 0,861 0,560 0,587 0,929 0,742
F 0,758 0,649 0,947 0,686 0,847 0,676 0,687 0,899 0,952 0,827 0,838 0,569 0,720 0,842 0,577 0,587 0,869 0,718
E 0,730 0,686 0,961 0,742 0,877 0,704 0,708 0,955 0,956 0,839 0,957 0,615 0,713 0,857 0,568 0,587 0,902 0,741
JRIP
S 0,738 0,692 0,955 0,732 0,881 0,684 0,715 0,960 0,959 0,826 0,957 0,608 0,743 0,863 0,561 0,574 0,931 0,743
C 0,738 0,676 0,964 0,710 0,871 0,671 0,744 0,975 0,960 0,820 0,950 0,608 0,760 0,866 0,540 0,577 0,931 0,741
F 0,637 0,643 0,875 0,720 0,893 0,669 0,800 0,995 0,993 0,507 0,871 0,790 0,524 0,325 0,954 0,394 0,588 0,861 0,246 0,172 0,650
E 0,654 0,661 0,855 0,730 0,894 0,621 0,766 0,996 0,993 0,450 0,875 0,935 0,523 0,347 0,954 0,406 0,579 0,812 0,214 0,137 0,645
KNN
S 0,654 0,658 0,858 0,733 0,891 0,612 0,779 0,996 0,993 0,471 0,874 0,935 0,522 0,334 0,955 0,418 0,578 0,824 0,241 0,166 0,650
C 0,654 0,649 0,858 0,707 0,896 0,622 0,721 0,996 0,993 0,444 0,875 0,938 0,520 0,361 0,954 0,400 0,581 0,812 0,228 0,150 0,644
F 0,589 0,218 0,916 0,593 0,870 0,571 0,674 0,975 0,883 0,213 0,822 0,824 0,571 0,543 0,738 0,544 0,424 0,882 0,185 0,143 0,584
E 0,589 0,165 0,802 0,548 0,852 0,559 0,581 0,886 0,805 0,186 0,750 0,871 0,511 0,452 0,632 0,527 0,350 0,599 0,171 0,108 0,515
NB
S 0,589 0,146 0,771 0,551 0,852 0,583 0,535 0,982 0,793 0,197 0,749 0,871 0,520 0,371 0,684 0,532 0,344 0,645 0,142 0,078 0,518
C 0,589 0,240 0,785 0,590 0,860 0,599 0,707 0,879 0,864 0,223 0,795 0,878 0,538 0,482 0,694 0,531 0,362 0,796 0,157 0,101 0,557
F 0,587 0,283 0,972 0,761 0,876 0,703 0,362 1,000 0,463 0,537 0,843 0,605 0,229 0,659 0,672 0,521 0,596 0,941 0,264 0,163 0,578
E 0,644 0,508 0,958 0,759 0,898 0,684 0,770 1,000 0,920 0,569 0,876 0,937 0,491 0,634 0,935 0,532 0,569 0,953 0,264 0,181 0,663
NNET
S 0,661 0,503 0,964 0,759 0,917 0,703 0,763 1,000 0,913 0,578 0,870 0,933 0,517 0,707 0,931 0,566 0,579 0,963 0,281 0,192 0,674
C 0,685 0,454 0,966 0,751 0,906 0,694 0,720 1,000 0,921 0,559 0,870 0,913 0,508 0,560 0,931 0,558 0,587 0,954 0,273 0,176 0,659
F 0,710 0,707 0,939 0,723 0,845 0,711 0,724 1,000 0,970 0,824 0,881 0,618 0,753 0,878 0,526 0,555 0,925 0,739
E 0,678 0,697 0,950 0,726 0,861 0,678 0,727 0,996 0,960 0,812 0,951 0,592 0,694 0,868 0,571 0,555 0,953 0,735
PART
S 0,664 0,704 0,955 0,733 0,870 0,693 0,708 1,000 0,966 0,824 0,954 0,588 0,700 0,861 0,579 0,571 0,943 0,738
C 0,709 0,695 0,944 0,741 0,866 0,711 0,730 0,998 0,960 0,819 0,962 0,590 0,693 0,873 0,579 0,551 0,911 0,737
F 0,738 0,625 0,941 0,739 0,868 0,700 0,463 1,000 0,836 0,505 0,793 0,917 0,559 0,746 0,760 0,492 0,569 0,873 0,194 0,103 0,644
E 0,764 0,666 0,924 0,729 0,868 0,708 0,645 0,984 0,901 0,492 0,803 0,954 0,589 0,707 0,814 0,427 0,579 0,866 0,228 0,131 0,662
RPART
S 0,769 0,667 0,927 0,736 0,866 0,680 0,636 1,000 0,906 0,508 0,798 0,948 0,584 0,724 0,830 0,450 0,583 0,866 0,229 0,141 0,666
C 0,757 0,669 0,941 0,723 0,858 0,695 0,646 0,962 0,882 0,500 0,799 0,932 0,596 0,736 0,840 0,450 0,584 0,832 0,193 0,123 0,656
F 0,800 0,772 0,978 0,738 0,883 0,818 0,875 1,000 0,991 0,635 0,882 0,974 0,710 0,807 0,943 0,339 0,627 0,973 0,338 0,259 0,742
E 0,805 0,775 0,978 0,750 0,891 0,816 0,849 0,998 0,991 0,557 0,876 0,976 0,708 0,807 0,900 0,329 0,604 0,982 0,265 0,163 0,729
RF
S 0,800 0,774 0,983 0,751 0,889 0,831 0,880 1,000 0,991 0,572 0,875 0,975 0,707 0,803 0,938 0,339 0,608 0,973 0,260 0,164 0,733
C 0,793 0,769 0,980 0,735 0,889 0,797 0,822 1,000 0,988 0,578 0,877 0,972 0,700 0,790 0,926 0,335 0,616 0,981 0,289 0,193 0,725
F 0,589 0,684 0,970 0,756 0,895 0,714 0,834 0,998 0,995 0,569 0,870 0,939 0,626 0,571 0,969 0,473 0,596 0,951 0,327 0,230 0,704
E 0,589 0,650 0,937 0,749 0,898 0,659 0,753 0,998 0,994 0,477 0,872 0,958 0,607 0,670 0,956 0,460 0,584 0,972 0,151 0,089 0,675
SVM
S 0,589 0,690 0,945 0,744 0,898 0,666 0,803 0,998 0,996 0,503 0,873 0,958 0,616 0,677 0,957 0,477 0,569 0,972 0,272 0,185 0,693
C 0,589 0,702 0,925 0,735 0,896 0,680 0,819 0,996 0,994 0,460 0,871 0,934 0,612 0,571 0,965 0,477 0,555 0,962 0,238 0,157 0,682
F 0,719 0,708 0,941 0,728 0,855 0,713 0,725 1,000 0,968 0,561 0,822 0,963 0,622 0,769 0,851 0,510 0,560 0,943 0,213 0,133 0,688
E 0,713 0,709 0,944 0,736 0,864 0,694 0,697 0,995 0,955 0,519 0,815 0,960 0,620 0,746 0,851 0,598 0,562 0,943 0,200 0,116 0,685
C50
S 0,709 0,713 0,947 0,741 0,864 0,731 0,742 1,000 0,959 0,515 0,822 0,954 0,625 0,769 0,817 0,596 0,577 0,925 0,221 0,148 0,681
C 0,706 0,712 0,955 0,734 0,877 0,700 0,718 0,996 0,956 0,530 0,820 0,961 0,620 0,786 0,862 0,571 0,592 0,909 0,224 0,139 0,690
F 0,755 0,684 0,953 0,755 0,896 0,691 0,642 1,000 0,874 0,495 0,842 0,948 0,577 0,715 0,811 0,513 0,592 0,933 0,261 0,172 0,729
E 0,774 0,699 0,947 0,749 0,877 0,684 0,680 0,998 0,939 0,495 0,853 0,966 0,604 0,801 0,884 0,548 0,578 0,933 0,226 0,144 0,742
FDA
S 0,772 0,703 0,944 0,754 0,881 0,694 0,737 1,000 0,955 0,505 0,851 0,963 0,599 0,787 0,896 0,548 0,586 0,953 0,275 0,191 0,751
C 0,757 0,703 0,958 0,758 0,889 0,727 0,737 0,998 0,927 0,484 0,848 0,956 0,606 0,707 0,895 0,548 0,586 0,942 0,257 0,169 0,745

Table 3: Accuracy values obtained by the different classification techniques. The second column
represents the applied methodology in building the hierarchy (Centroid= C, Euclidean= E,
Semi-Metric = S) with respect to flat classification (F ). The distance that obtains the best
result for each method is highlighted in bold and the best result for method and dataset is
underlined.

Acknowledgments
This work was partially supported by the the EU (FEDER) and the Spanish MINECO un-
der grant TIN 2015-69175-C4-1-R, and by Generalitat Valenciana PROMETEOII2015/013.
This work has been supported by the Secretary of Higher Education, Science and Technol-
ogy (SENESCYT: Secretarı́a Nacional de Educación Superior, Ciencia y Tecnologı́a), of the
Republic of Ecuador.

References
[1] K. Bache and M. Lichman. UCI machine learning repository, 2013.
[2] Fernando Benites and Elena Sapozhnikova. Learning different concept hierarchies and the relations
between them from classified data. Intel. Data Analysis for Real-Life Applications: Theory and
Practice, pages 18–34, 2012.
[3] Leo Breiman and Leo Breiman. Bagging predictors. In Machine Learning, pages 123–140, 1996.

9
Multiclass Classification by Daniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)
Class Hierarchies 1692–1701
Silva, Ferri and Ramirez 1701

[4] Florian Brucker, Fernando Benites, and Elena Sapozhnikova. Multi-label classification and ex-
tracting predicted class hierarchies. Pattern Recognition, 44(3):724–738, 2011.
[5] Gail A Carpenter, Siegfried Martens, and Ogi J Ogas. Self-organizing information fusion and hier-
archical knowledge discovery: a new framework using artmap neural networks. Neural Networks,
18(3):287–295, 2005.
[6] Philipp Cimiano, Aleksander Pivk, Lars Schmidt-Thieme, and Steffen Staab. Learning taxonomic
relations from heterogeneous sources of evidence. Ontology Learning from Text: Methods, evalua-
tion and applications, 123:59–73, 2005.
[7] Rakesh Dugad and Narendra Ahuja. Unsupervised multidimensional hierarchical clustering. In
Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Con-
ference on, volume 5, pages 2761–2764. IEEE, 1998.
[8] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and
an application to boosting. In European conference on computational learning theory, pages 23–37.
Springer, 1995.
[9] Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan En-
gelhardt, Tony Cooper, Zachary Mayer, and the R Core Team. caret: Classification and Regression
Training, 2014. R package version 6.0-30.
[10] Fred Galvin and SD Shore. Distance functions and topologies. The American Mathematical
Monthly, 98(7):620–623, 1991.
[11] Shantanu Godbole, Sunita Sarawagi, and Soumen Chakrabarti. Scaling multi-class support vector
machines using inter-class confusion. In Proceedings of the eighth ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 513–518. ACM, 2002.
[12] José Hernández-Orallo, Peter Flach, and Cesar Ferri. A unified view of performance metrics:
Translating threshold choice into expected classification loss. Journal of Machine Learning Re-
search, 13(Oct):2813–2869, 2012.
[13] T. Li, S. Zhu, and M. Ogihara. Hierarchical document classification using automatically generated
hierarchy. Intelligent Information Systems, 29:211–230, 2007.
[14] Huan Liu and Lei Yu. Toward integrating feature selection algorithms for classification and clus-
tering. IEEE Transactions on knowledge and data engineering, 17(4):491–502, 2005.
[15] V. López, A. Fernández, S. Garcı́a, V. Palade, and F. Herrera. An insight into classification
with imbalanced data: Empirical results and current trends on using data intrinsic characteristics.
Information Sciences, 250:113–141, 2013.
[16] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learn-
ing. In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM,
2005.
[17] J Arturo Olvera-López, J Ariel Carrasco-Ochoa, J Francisco Martı́nez-Trinidad, and Josef Kittler.
A review of instance selection methods. Artificial Intelligence Review, 34(2):133–143, 2010.
[18] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria, 2015.
[19] V. Schickel-Zuber and B. Faltings. Using hierarchical clustering for learning the ontologies used
in recommendation systems. In 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD’2007, pages 599–608, 2007.
[20] Jr. Silla, N. Carlos, and Alex A. Freitas. A survey of hierarchical classification across different
application domains. Data Mining and Knowledge Discovery, 22(1-2):31–72, 2011.
[21] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass proba-
bility estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 694–699. ACM, 2002.
[22] Blaž Zupan, Marko Bohanec, Janez Demšar, and Ivan Bratko. Learning by discovering concept
hierarchies. Artificial Intelligence, 109(1):211–242, 1999.

10

You might also like