Class Hierar
Class Hierar
com
ScienceDirect
This space is reserved for the Procedia header, do not use it
This Procedia
space isComputer
reserved for 108C
Science the Procedia header, do not use it
(2017) 1692–1701
This space is reserved for the Procedia header, do not use it
One of the main objectives when solving a classification task is to make predictions as
“good” as possible, where the notion of “good” depends on the evaluation measure used to
asses the quality of the classifier. For instance, in classification, accuracy is one of the most
popular evaluation measure. However, obtaining good predictions is not always a simple task.
And things are usually even more complicated in the multiclass case since here the classifier has
to distinguish among a high number of classes to make the predictions. For this reason, in the
last decades, numerous efforts have been done trying to improve classifier performance. Most
approaches proposed in the literature fall into one of these three groups: a) to design better
learning techniques, b) to apply some kind of transformation over the training data, and c) to
modify or adjust the predictions given by the classifier. For instance, as an example of approach
of type a), we can mention the notion of ensemble learning, a general term based on the idea of
using more than one model to determine the predicted output, and that encompass a collection
of techniques (such as bagging [3] or boosting [8], among others). The different approaches based
on instance and feature selection ([17, 14]) and the oversampling and undersampling methods
proposed to deal with imbalanced datasets ([15]) are examples of proposals that belong to
group b). Finally, approaches based on threshold choice methods [12] or classifier calibration
techniques [16, 21], that are applied to scoring models, are examples of approaches belonging
to group c). All of these methods assume that the condition (2) aforementioned holds, that is,
there does not exist any relationship between the class labels.
However, in multiclass classification, it has been shown that classification performance can
also be improved by decomposing the multiclass problem into a hierarchy of intermediate clas-
sification problems that are smaller or less complex than the original one. Two alternative
ways of decomposing the original problem have been explored. The first one relies on the idea
that a problem becomes less complex if its dimensionality is reduced. For instance, in [22] the
instance attributes are iteratively split into disjoint sets and then a new classification problem
is defined for each partition. The second way relies on the idea that a multiclass problem be-
comes simpler if the number of classes is reduced. For instance, in [13] and [11] a class hierarchy
is constructed (by assuming that there exists some relationship between the class labels) and
then each internal node of the hierarchy defines a new classification problem involving only
its children class labels. This latter way to address the multiclass problem is inspired in the
top-down hierarchical classification method. The different approaches proposed mainly differ in
the way in that the class hierarchy is automatically generated (based on instances or based on
predictions) and/or which learning technique is used for learning the intermediate classifiers.
In this paper we also propose to decompose the original multiclass problem by using a
class hierarchy from which a tree-like structure of classifiers is constructed. Similar to [11],
the relationship between classes are derived from the confusion matrix of a “flat”1 classifier
(interpreting the confusion matrix as an indicator of how simple or hard for the classifier is to
distinguish the classes). But instead of deriving the similarity between classes directly from the
values of the confusion matrix (as [11] does), we transform the matrix trying to obtain as much
information as possible from it. From the transformed matrix, we define a semi-metric function
to compare the class labels that is used to generate the class hierarchy. We experimentally
evaluate our proposal using a large collection of datasets and techniques.
The paper is organized as follows. In Section 2 we review some previous works and present
our method for defining the similarity between classes from the confusion matrix that is then
used for inducing the class hierarchy. In Section 3, we describe how to solve multiclass problems
by using a class hierarchy. Section 4, presents the experiments we conducted in order to evaluate
1 In the literature, it is common to refer to a classifier that does not take into consideration any relationship
2
1694 Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez
3
Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez 1695
represents the instances of class ci that F classifies as being of class cj . For the sake of simplicity,
in that follows we denote any class ci by its subindex i.
In general, in any confusion matrix M we can observe the following facts: (1) misclassifica-
tion errors are not usually uniformly distributed, which indicates that it is more difficult for the
classifier to separate some classes than others; and (2) M is usually non-symmetrical, which
means that to really measure the degree of confusion between two classes i and j we must take
into account all the errors the classifier makes involving both classes, that is Mi,j and Mj,i .
Our proposal is based on this reasoning.
Given a confusion matrix M , the new function for comparing classes derived from M , we
denote as dS , is obtained by applying the following steps:
1. Normalisation: M ij = nMij
j=1 Mij
M ij +M ji
2 if i = j
overlap(i, j) =
1 if i = j
Applying this function we obtain a symmetric matrix M O we call the Overlapping Matrix.
3. Similarity between classes: the similarity between two classes i and j is defined as
dS (i, j) = 1 − overlap(i, j)
The similarity values belong to the interval [0..1] where dS (i, j) = 0 means that classes i
and j are completely indistinguibles, whereas dS (i, j) = 1 indicates that the classes are
completely overlapping. After applying dS we obtain another symmetric matrix M S we
call the Similarity Matrix.
Note that, (C, dS ) is a semi-metric space [10] since dS satisfies the non-negativity, symmetry
and identity conditions but, for some pairs of classes, dS can violate the triangle inequality
property (for instance, this is the case with classes a and b in Example 1).
Example 1 Let us consider a classification problem with four class labels C = {a, b, c, d}. Sup-
pose that we have trained a flat classifier F (using any classification technique) whose confusion
matrix M is depicted in Table 1a. From M we can see that F perfectly classifies the instances
belonging to class d but makes a lot of mistakes classifying the instances belonging to the other
classes. In fact, we could say that F is a poor classifier (whose accuracy is 0.55). Tables from
1b to 1d show step by step the different matrices obtained by applying our method. In this
example it is easy to see that dS does not hold the triangle inequality property since for classes
a and b it holds dS (a, b) = 1 > dS (a, c) + dS (c, b) = 0.972.
4
1696 Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez
Predicted
a b c d a b c d
a 20 0 30 0 a 0.400 0.000 0.600 0.000
b 0 20 30 0 b 0.000 0.400 0.600 0.000
Real c 30 30 10 0 c 0.429 0.429 0.143 0.000
d 0 0 0 100 d 0.000 0.000 0.000 1.000
Table 1: Example that illustrates the calculation of the similarity between classes from the
confusion matrix of a flat classifier.
5
Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez 1697
the tree of classifiers) is balanced, up to (n − 1), when the class hierarchy is a chain. In that
follows, we present a procedure to simplify the class hierarchy in order to reduce the number
of internal nodes, and therefore the number of flat classifiers we have to train. This procedure,
therefore, generally implies a complexity reduction of the method.
Figure 1: Example of the Hierarchy Compression Process applied to the Zoo DataSet. (Left)
the class hierarchy given by the dendrogram. (Right) the compressed class hierarchy.
Algorithm 1 shows the compression method we propose to reduce the size of the class
hierarchy. In order to do that, the class hierarchy is traversed in a depth-first search verifying
that there is not two contiguous nodes that are at the same linkage distance; otherwise, the
nodes are merged. In hierarchical clustering, the idea of using a post-process to reduce the
dendrogram was also applied in [7], to develop a multidimensional hierarchical clustering.
Figure 1 shows the compression process applied to the Zoo dataset (from the UCI repository
[1]). As can be seen, the nodes root, c5, c7 and c11 in the hierarchy on the left are merged
into one node that is the root of the resulting compressed hierarchy (depicted on the right). As
a consequence, the number of flat classifiers to be learnt has been reduced from six to three.
Note that, the node compression condition of being at the same linkage distance could be easily
modified to allow merging nodes whose linkage distance does not exceed a certain threshold
6
1698 Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez
4 Experimental Analysis
In this section we evaluate the performance of our proposal to solve multiclass problems. For
that reason we carried out a set of experiments following the method described in Section 3.
The experiments were performed over 20 different datasets (Table 2) taken from the UCI [1]
and the LIBSVM2 public repositories, and the Reuters-21578 and Newsgroups datasets. All the
datasets are multivariate, multiclasss, non-hierarchical a priori, and the criterion we followed
to select them was to include datasets of different size and different number of classes. We have
preprocessed the datasets removing the instances with missing values. Additionally, we have
applied under-sampling to datasets 11 and 15 (to cope with class imbalance) and an stratified
sampling to datasets 19 and 20 (to reduce its size).
In order to analyse the suitability of the hierarchies generated by our method, we include two
existing methods for generating hierarchies in the classes. First, a method that calculates the
distance between the centroids of each class (we denote it as dC ) based on instances [13]. The
second approach computes simmilarities by applying the Euclidean distance over the normalized
confusion matrix generated by the flat classifier (we denote this distance as dE ) [11]. Our
proposal, that uses a semi-metric function to compute the distance matrix, is denoted by dS .
We use the complete linkage distance in the agglomerative hierarchical clustering algorithm used
to infer the class hierarchies (using dS and dE ). In addition, to analyse whether the multiclass
classification can be improved by using class hierarchies, we compare the results obtained using
hierarchies with those get by the flat classifier, firstly trained for inducing the hierarchy.
DataSet 1 2 3 4 5 6 7 8 9 10
Id Arrhytmia Covtype Dermatology Flare Forest Glass Letters Nuclear Pendigits Reuters
NumInst 416 2100 358 1066 523 214 2600 525 7494 1000
NumAtt 330 54 34 19 27 9 16 80 16 58
NumClass 7 7 6 6 4 6 26 8 10 10
DataSet 11 12 13 14 15 16 17 18 19 20
Id Satimage Segmentation Sports TrafficLight Usps Vertebral Yeast Zoo 10News 20News
NumInst 1795 3000 8000 300 3000 620 1484 101 3700 6660
NumAtt 36 18 13 10 256 6 8 16 45 48
NumClass 6 7 10 6 10 4 10 7 10 20
Table 2: Information about the datasets used in the experiments: number of instances, at-
tributes and classes.
2 https://www.csie.ntu.edu.tw/$\sim$cjlin/libsvmtools/datasets/
7
Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez 1699
best method using a hierarchy is highlighted in bold. As we can see, some algorithms were not
able to process all datasets due to resource limitations.
In general, the results show that the methods that use a class hierarchy outperform, in terms
of accuracy, the flat classification for most datasets and techniques except for the NB and SVM
whose flat classifiers are mainly more accurate than the classifiers using hierarchies. Hence,
we can say that, in general, the use of class hierarchies for decomposing the original multiclass
problems seems to be suitable to address multiclass problems. Regarding the way in that the
class hierarchies are induced, in average our semi-metric function dS obtains the best results
for almost all the techniques except for the NB where the distance between centroids dC gives
better accuracy values and C50 for which dS and dC obtain similar results.
Next, we analyse how the size of the datasets (in terms of number of classes and instances)
affects methods performance. To do this, we have first grouped the datasets into three categories
according to N umClass: small (N umClass =< 6), medium (6 < N umClass =< 10) and large
(10 < N umClass). We observe that for the small group there are not big differences among
using the flat, S and C approaches. For medium size datasets, flat and S approaches are the
best for almost all the learning techniques and, finally, for large datasets, the classifiers that
use our semi-metric distance achieve the best accuracy. Grouping the datasets according to
the number of instances, we see that for small size datasets (N umInst < 526) all methods
perform similarly; for medium size datasets (527 < N umInst < 2101) and large datasets
(N umInst > 2100) the methods that use a class hierarchy are better than the flat ones, being
the methods based on the Euclidean and the semi-metric distances the best ones.
5 Conclusions
This research proposes a method to improve accuracy on multiclass classification problems. The
idea is that in situations where there exists a high number of classes, traditional methods find
difficult to discern correctly the new observations given the high number of possibilities. The
proposal consists in building specialised classifiers for the classes that present more common
mistakes, i.e., to build a chain of specialised classifiers for simpler problems. Therefore, the
method is based on the inference of the hierarchy of classes. From the confusion matrix obtained
from train data we derive a similarity matrix, then an agglomerative clustering technique derives
the hierarchy of classifiers. We propose a semi-metric to calculate the distances between classes
given a confusion matrix. Also we introduce a method to compress hierarchies, which allow
us to better represent the hierarchical structure and it also helps to reduce the complexity of
hierarchical classification as it reduces the number of local classifiers.
Experiments with twenty multiclass datasets show the validity of our proposal. We show
that the new technique is able to improve the accuracy with respect to the basic flat approach.
We also include in the experiments other proposals to build the hierarchy of classifiers. The
new method based on the semi-metric distance obtains a better performance for the majority
of datasets.
As future work, we propose to analyze better the relation between number of classes and
performance obtained by the chain of classifiers approach. We also are interested on studying
the effect of class balance in the performance of the method. Finally, we plan to derive some
ensemble methods based on the combination of the local classifiers of the hierarchy.
8
1700 Multiclass Classification byDaniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)Silva,
Class Hierarchies 1692–1701
Ferri and Ramirez
Datasets
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Average
F 0,714 0,695 0,947 0,732 0,849 0,720 0,736 1,000 0,962 0,826 0,871 0,624 0,740 0,844 0,516 0,605 0,925 0,740
E 0,716 0,708 0,947 0,738 0,874 0,685 0,720 0,996 0,956 0,811 0,957 0,620 0,747 0,848 0,581 0,573 0,943 0,743
J48
S 0,709 0,707 0,949 0,744 0,866 0,714 0,720 1,000 0,961 0,817 0,956 0,628 0,724 0,852 0,563 0,581 0,933 0,745
C 0,680 0,711 0,955 0,728 0,877 0,739 0,703 0,989 0,954 0,816 0,957 0,620 0,711 0,861 0,560 0,587 0,929 0,742
F 0,758 0,649 0,947 0,686 0,847 0,676 0,687 0,899 0,952 0,827 0,838 0,569 0,720 0,842 0,577 0,587 0,869 0,718
E 0,730 0,686 0,961 0,742 0,877 0,704 0,708 0,955 0,956 0,839 0,957 0,615 0,713 0,857 0,568 0,587 0,902 0,741
JRIP
S 0,738 0,692 0,955 0,732 0,881 0,684 0,715 0,960 0,959 0,826 0,957 0,608 0,743 0,863 0,561 0,574 0,931 0,743
C 0,738 0,676 0,964 0,710 0,871 0,671 0,744 0,975 0,960 0,820 0,950 0,608 0,760 0,866 0,540 0,577 0,931 0,741
F 0,637 0,643 0,875 0,720 0,893 0,669 0,800 0,995 0,993 0,507 0,871 0,790 0,524 0,325 0,954 0,394 0,588 0,861 0,246 0,172 0,650
E 0,654 0,661 0,855 0,730 0,894 0,621 0,766 0,996 0,993 0,450 0,875 0,935 0,523 0,347 0,954 0,406 0,579 0,812 0,214 0,137 0,645
KNN
S 0,654 0,658 0,858 0,733 0,891 0,612 0,779 0,996 0,993 0,471 0,874 0,935 0,522 0,334 0,955 0,418 0,578 0,824 0,241 0,166 0,650
C 0,654 0,649 0,858 0,707 0,896 0,622 0,721 0,996 0,993 0,444 0,875 0,938 0,520 0,361 0,954 0,400 0,581 0,812 0,228 0,150 0,644
F 0,589 0,218 0,916 0,593 0,870 0,571 0,674 0,975 0,883 0,213 0,822 0,824 0,571 0,543 0,738 0,544 0,424 0,882 0,185 0,143 0,584
E 0,589 0,165 0,802 0,548 0,852 0,559 0,581 0,886 0,805 0,186 0,750 0,871 0,511 0,452 0,632 0,527 0,350 0,599 0,171 0,108 0,515
NB
S 0,589 0,146 0,771 0,551 0,852 0,583 0,535 0,982 0,793 0,197 0,749 0,871 0,520 0,371 0,684 0,532 0,344 0,645 0,142 0,078 0,518
C 0,589 0,240 0,785 0,590 0,860 0,599 0,707 0,879 0,864 0,223 0,795 0,878 0,538 0,482 0,694 0,531 0,362 0,796 0,157 0,101 0,557
F 0,587 0,283 0,972 0,761 0,876 0,703 0,362 1,000 0,463 0,537 0,843 0,605 0,229 0,659 0,672 0,521 0,596 0,941 0,264 0,163 0,578
E 0,644 0,508 0,958 0,759 0,898 0,684 0,770 1,000 0,920 0,569 0,876 0,937 0,491 0,634 0,935 0,532 0,569 0,953 0,264 0,181 0,663
NNET
S 0,661 0,503 0,964 0,759 0,917 0,703 0,763 1,000 0,913 0,578 0,870 0,933 0,517 0,707 0,931 0,566 0,579 0,963 0,281 0,192 0,674
C 0,685 0,454 0,966 0,751 0,906 0,694 0,720 1,000 0,921 0,559 0,870 0,913 0,508 0,560 0,931 0,558 0,587 0,954 0,273 0,176 0,659
F 0,710 0,707 0,939 0,723 0,845 0,711 0,724 1,000 0,970 0,824 0,881 0,618 0,753 0,878 0,526 0,555 0,925 0,739
E 0,678 0,697 0,950 0,726 0,861 0,678 0,727 0,996 0,960 0,812 0,951 0,592 0,694 0,868 0,571 0,555 0,953 0,735
PART
S 0,664 0,704 0,955 0,733 0,870 0,693 0,708 1,000 0,966 0,824 0,954 0,588 0,700 0,861 0,579 0,571 0,943 0,738
C 0,709 0,695 0,944 0,741 0,866 0,711 0,730 0,998 0,960 0,819 0,962 0,590 0,693 0,873 0,579 0,551 0,911 0,737
F 0,738 0,625 0,941 0,739 0,868 0,700 0,463 1,000 0,836 0,505 0,793 0,917 0,559 0,746 0,760 0,492 0,569 0,873 0,194 0,103 0,644
E 0,764 0,666 0,924 0,729 0,868 0,708 0,645 0,984 0,901 0,492 0,803 0,954 0,589 0,707 0,814 0,427 0,579 0,866 0,228 0,131 0,662
RPART
S 0,769 0,667 0,927 0,736 0,866 0,680 0,636 1,000 0,906 0,508 0,798 0,948 0,584 0,724 0,830 0,450 0,583 0,866 0,229 0,141 0,666
C 0,757 0,669 0,941 0,723 0,858 0,695 0,646 0,962 0,882 0,500 0,799 0,932 0,596 0,736 0,840 0,450 0,584 0,832 0,193 0,123 0,656
F 0,800 0,772 0,978 0,738 0,883 0,818 0,875 1,000 0,991 0,635 0,882 0,974 0,710 0,807 0,943 0,339 0,627 0,973 0,338 0,259 0,742
E 0,805 0,775 0,978 0,750 0,891 0,816 0,849 0,998 0,991 0,557 0,876 0,976 0,708 0,807 0,900 0,329 0,604 0,982 0,265 0,163 0,729
RF
S 0,800 0,774 0,983 0,751 0,889 0,831 0,880 1,000 0,991 0,572 0,875 0,975 0,707 0,803 0,938 0,339 0,608 0,973 0,260 0,164 0,733
C 0,793 0,769 0,980 0,735 0,889 0,797 0,822 1,000 0,988 0,578 0,877 0,972 0,700 0,790 0,926 0,335 0,616 0,981 0,289 0,193 0,725
F 0,589 0,684 0,970 0,756 0,895 0,714 0,834 0,998 0,995 0,569 0,870 0,939 0,626 0,571 0,969 0,473 0,596 0,951 0,327 0,230 0,704
E 0,589 0,650 0,937 0,749 0,898 0,659 0,753 0,998 0,994 0,477 0,872 0,958 0,607 0,670 0,956 0,460 0,584 0,972 0,151 0,089 0,675
SVM
S 0,589 0,690 0,945 0,744 0,898 0,666 0,803 0,998 0,996 0,503 0,873 0,958 0,616 0,677 0,957 0,477 0,569 0,972 0,272 0,185 0,693
C 0,589 0,702 0,925 0,735 0,896 0,680 0,819 0,996 0,994 0,460 0,871 0,934 0,612 0,571 0,965 0,477 0,555 0,962 0,238 0,157 0,682
F 0,719 0,708 0,941 0,728 0,855 0,713 0,725 1,000 0,968 0,561 0,822 0,963 0,622 0,769 0,851 0,510 0,560 0,943 0,213 0,133 0,688
E 0,713 0,709 0,944 0,736 0,864 0,694 0,697 0,995 0,955 0,519 0,815 0,960 0,620 0,746 0,851 0,598 0,562 0,943 0,200 0,116 0,685
C50
S 0,709 0,713 0,947 0,741 0,864 0,731 0,742 1,000 0,959 0,515 0,822 0,954 0,625 0,769 0,817 0,596 0,577 0,925 0,221 0,148 0,681
C 0,706 0,712 0,955 0,734 0,877 0,700 0,718 0,996 0,956 0,530 0,820 0,961 0,620 0,786 0,862 0,571 0,592 0,909 0,224 0,139 0,690
F 0,755 0,684 0,953 0,755 0,896 0,691 0,642 1,000 0,874 0,495 0,842 0,948 0,577 0,715 0,811 0,513 0,592 0,933 0,261 0,172 0,729
E 0,774 0,699 0,947 0,749 0,877 0,684 0,680 0,998 0,939 0,495 0,853 0,966 0,604 0,801 0,884 0,548 0,578 0,933 0,226 0,144 0,742
FDA
S 0,772 0,703 0,944 0,754 0,881 0,694 0,737 1,000 0,955 0,505 0,851 0,963 0,599 0,787 0,896 0,548 0,586 0,953 0,275 0,191 0,751
C 0,757 0,703 0,958 0,758 0,889 0,727 0,737 0,998 0,927 0,484 0,848 0,956 0,606 0,707 0,895 0,548 0,586 0,942 0,257 0,169 0,745
Table 3: Accuracy values obtained by the different classification techniques. The second column
represents the applied methodology in building the hierarchy (Centroid= C, Euclidean= E,
Semi-Metric = S) with respect to flat classification (F ). The distance that obtains the best
result for each method is highlighted in bold and the best result for method and dataset is
underlined.
Acknowledgments
This work was partially supported by the the EU (FEDER) and the Spanish MINECO un-
der grant TIN 2015-69175-C4-1-R, and by Generalitat Valenciana PROMETEOII2015/013.
This work has been supported by the Secretary of Higher Education, Science and Technol-
ogy (SENESCYT: Secretarı́a Nacional de Educación Superior, Ciencia y Tecnologı́a), of the
Republic of Ecuador.
References
[1] K. Bache and M. Lichman. UCI machine learning repository, 2013.
[2] Fernando Benites and Elena Sapozhnikova. Learning different concept hierarchies and the relations
between them from classified data. Intel. Data Analysis for Real-Life Applications: Theory and
Practice, pages 18–34, 2012.
[3] Leo Breiman and Leo Breiman. Bagging predictors. In Machine Learning, pages 123–140, 1996.
9
Multiclass Classification by Daniel Silva-Palacios
Inducing et al. / Procedia Computer Science 108C (2017)
Class Hierarchies 1692–1701
Silva, Ferri and Ramirez 1701
[4] Florian Brucker, Fernando Benites, and Elena Sapozhnikova. Multi-label classification and ex-
tracting predicted class hierarchies. Pattern Recognition, 44(3):724–738, 2011.
[5] Gail A Carpenter, Siegfried Martens, and Ogi J Ogas. Self-organizing information fusion and hier-
archical knowledge discovery: a new framework using artmap neural networks. Neural Networks,
18(3):287–295, 2005.
[6] Philipp Cimiano, Aleksander Pivk, Lars Schmidt-Thieme, and Steffen Staab. Learning taxonomic
relations from heterogeneous sources of evidence. Ontology Learning from Text: Methods, evalua-
tion and applications, 123:59–73, 2005.
[7] Rakesh Dugad and Narendra Ahuja. Unsupervised multidimensional hierarchical clustering. In
Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Con-
ference on, volume 5, pages 2761–2764. IEEE, 1998.
[8] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and
an application to boosting. In European conference on computational learning theory, pages 23–37.
Springer, 1995.
[9] Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan En-
gelhardt, Tony Cooper, Zachary Mayer, and the R Core Team. caret: Classification and Regression
Training, 2014. R package version 6.0-30.
[10] Fred Galvin and SD Shore. Distance functions and topologies. The American Mathematical
Monthly, 98(7):620–623, 1991.
[11] Shantanu Godbole, Sunita Sarawagi, and Soumen Chakrabarti. Scaling multi-class support vector
machines using inter-class confusion. In Proceedings of the eighth ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 513–518. ACM, 2002.
[12] José Hernández-Orallo, Peter Flach, and Cesar Ferri. A unified view of performance metrics:
Translating threshold choice into expected classification loss. Journal of Machine Learning Re-
search, 13(Oct):2813–2869, 2012.
[13] T. Li, S. Zhu, and M. Ogihara. Hierarchical document classification using automatically generated
hierarchy. Intelligent Information Systems, 29:211–230, 2007.
[14] Huan Liu and Lei Yu. Toward integrating feature selection algorithms for classification and clus-
tering. IEEE Transactions on knowledge and data engineering, 17(4):491–502, 2005.
[15] V. López, A. Fernández, S. Garcı́a, V. Palade, and F. Herrera. An insight into classification
with imbalanced data: Empirical results and current trends on using data intrinsic characteristics.
Information Sciences, 250:113–141, 2013.
[16] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learn-
ing. In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM,
2005.
[17] J Arturo Olvera-López, J Ariel Carrasco-Ochoa, J Francisco Martı́nez-Trinidad, and Josef Kittler.
A review of instance selection methods. Artificial Intelligence Review, 34(2):133–143, 2010.
[18] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria, 2015.
[19] V. Schickel-Zuber and B. Faltings. Using hierarchical clustering for learning the ontologies used
in recommendation systems. In 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD’2007, pages 599–608, 2007.
[20] Jr. Silla, N. Carlos, and Alex A. Freitas. A survey of hierarchical classification across different
application domains. Data Mining and Knowledge Discovery, 22(1-2):31–72, 2011.
[21] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass proba-
bility estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 694–699. ACM, 2002.
[22] Blaž Zupan, Marko Bohanec, Janez Demšar, and Ivan Bratko. Learning by discovering concept
hierarchies. Artificial Intelligence, 109(1):211–242, 1999.
10