ML-KNN A Lazy Learning Approach To Multi-Label Learning
ML-KNN A Lazy Learning Approach To Multi-Label Learning
Multi-Label Learning
Min-Ling Zhang, Zhi-Hua Zhou*
National Laboratory for Novel Software Technology
Nanjing University, Nanjing 210093, China
Email: {zhangml, zhouzh}@lamda.nju.edu.cn
Abstract
Multi-label learning originated from the investigation of text categorization
problem, where each document may belong to several predefined topics simul-
taneously. In multi-label learning, the training set is composed of instances
each associated with a set of labels, and the task is to predict the label sets
of unseen instances through analyzing training instances with known label
sets. In this paper, a multi-label lazy learning approach named Ml-knn is
presented, which is derived from the traditional k-Nearest Neighbor (kNN)
algorithm. In detail, for each unseen instance, its k nearest neighbors in the
training set are firstly identified. After that, based on statistical information
gained from the label sets of these neighboring instances, i.e. the number
of neighboring instances belonging to each possible class, maximum a pos-
teriori (MAP) principle is utilized to determine the label set for the unseen
instance. Experiments on three different real-world multi-label learning prob-
lems, i.e. Yeast gene functional analysis, natural scene classification and auto-
matic web page categorization, show that Ml-knn achieves superior perfor-
mance to some well-established multi-label learning algorithms.
1 Introduction
1
gene may be associated with a set of functional classes, such as metabolism,
transcription and protein synthesis [7]; in scene classification, each scene im-
age may belong to several semantic classes, such as beach and urban [2]. In all
these cases, each instance in the training set is associated with a set of labels,
and the task is to output a label set whose size is unknown a priori for each
unseen instance.
Traditional two-class and multi-class problems can both be cast into multi-
label ones by restricting each instance to have only one label. On the other
hand, the generality of multi-label problems inevitably makes it more difficult
to learn. An intuitive approach to solving multi-label problem is to decompose
it into multiple independent binary classification problems (one per category).
However, this kind of method does not consider the correlations between the
different labels of each instance and the expressive power of such a system
can be weak [14,19,7]. Fortunately, several approaches specially designed for
multi-label learning tasks have been proposed, such as multi-label text cate-
gorization algorithms [14,19], multi-label decision trees [3,4] and multi-label
kernel methods [7,2]. In this paper, a lazy learning algorithm named Ml-knn,
i.e. Multi-Label k-Nearest Neighbor, is proposed, which is the first multi-label
lazy learning algorithm. As its name implied, Ml-knn is derived from the
popular k-Nearest Neighbor (kNN) algorithm [1]. Firstly, for each test in-
stance, its k nearest neighbors in the training set are identified. Then, accord-
ing to statistical information gained from the label sets of these neighboring
instances, i.e. the number of neighboring instances belonging to each possible
class, maximum a posteriori (MAP) principle is utilized to determine the label
set for the test instance. The effectiveness of Ml-knn is evaluated through
three different multi-label learning problems, i.e. Yeast gene functional anal-
ysis [7], natural scene classification and automatic web page categorization
[21]. Experimental results show that the performance of Ml-knn is superior
to those of some well-established multi-label learning methods.
The rest of this paper is organized as follows. In Section 2, notations and eval-
uation metrics used in multi-label learning are briefly introduced. In Section 3,
2
previous works on multi-label learning are reviewed. In Section 4, Ml-knn is
proposed. In Section 5, experimental results of Ml-knn and other multi-label
learning algorithms are presented. Finally in Section 6, the main contribution
of this paper is summarized.
2 Preliminaries
Let X denote the domain of instances and let Y = {1, 2, . . . , Q} be the finite
set of labels. Given a training set T = {(x1 , Y1 ), (x2 , Y2 ), ..., (xm , Ym )} (xi ∈
X , Yi ⊆ Y) i.i.d. drawn from an unknown distribution D, the goal of the learn-
ing system is to output a multi-label classifier h : X → 2Y which optimizes
some specific evaluation metric. In most cases however, instead of outputting
a multi-label classifier, the learning system will produce a real-valued function
of the form f : X × Y → R. It is supposed that, given an instance xi and its
associated label set Yi , a successful learning system will tend to output larger
values for labels in Yi than those not in Yi , i.e. f (xi , y1 ) > f (xi , y2 ) for any
y1 ∈ Yi and y2 ∈
/ Yi .
3
classified, i.e. a label not belonging to the instance is predicted or a label
belonging to the instance is not predicted. The performance is perfect when
hlossS (h) = 0; the smaller the value of hlossS (h), the better the performance.
p
1X 1
hlossS (h) = |h(xi )∆Yi | (1)
p i=1 Q
where ∆ stands for the symmetric difference between two sets. Note that
when |Yi | = 1 for all instances, a multi-label system is in fact a multi-class
2
single-label one and the hamming loss is Q
times the usual classification error.
While hamming loss is based on the multi-label classifier h(·), the following
metrics are defined based on the real-valued function f (·, ·) which concern the
ranking quality of different labels for each instance:
(2)one-error : evaluates how many times the top-ranked label is not in the set of
proper labels of the instance. The performance is perfect when one-errorS (f ) =
0; the smaller the value of one-errorS (f ), the better the performance.
p
1X
one-errorS (f ) = [[[arg max f (xi , y)] ∈
/ Yi ]] (2)
p i=1 y∈Y
where for any predicate π, [[π]] equals 1 if π holds and 0 otherwise. Note that,
for single-label classification problems, the one-error is identical to ordinary
classification error.
(3)coverage: evaluates how far we need, on the average, to go down the list
of labels in order to cover all the proper labels of the instance. It is loosely
related to precision at the level of perfect recall. The smaller the value of
coverageS (f ), the better the performance.
p
1X
coverageS (f ) = max rankf (xi , y) − 1 (3)
p i=1 y∈Yi
(4)ranking loss: evaluates the average fraction of label pairs that are reversely
ordered for the instance. The performance is perfect when rlossS (f ) = 0; the
smaller the value of rlossS (f ), the better the performance.
p
1X 1
rlossS (f ) = |{(y1 , y2 )|f (xi , y1 ) ≤ f (xi , y2 ), (y1 , y2 ) ∈ Yi × Yi }|(4)
p i=1 |Yi ||Y i |
4
where Y denotes the complementary set of Y in Y.
(5)average precision: evaluates the average fraction of labels ranked above a
particular label y ∈ Y which actually are in Y . It is originally used in infor-
mation retrieval (IR) systems to evaluate the document ranking performance
for query retrieval [17]. The performance is perfect when avgprecS (f ) = 1; the
bigger the value of avgprecS (f ), the better the performance.
p
1X 1 X |{y 0 |rankf (xi , y 0 ) ≤ rankf (xi , y), y 0 ∈ Yi }|
avgprecS (f ) = (5)
p i=1 |Yi | y∈Yi rankf (xi , y)
5
Pmms is that multi-label text has a mixture of characteristic words appear-
ing in single-label text that belong to each category of the multi-categories.
It is worth noting that the generative models used in [14] and [21] are both
based on learning text frequencies in documents, and are thus specific to text
applications. Also in 2003, Comité et al. [4] extended alternating decision tree
[8] to handle multi-label data, where the AdaBoost.MH algorithm proposed
by Schapire and Singer [18] is employed to train the multi-label alternating
decision trees.
In 2004, Gao et al. [11] generalized the maximal figure-of-merit (MFoM) ap-
proach [10] for binary classifier learning to the case of multiclass, multi-label
text categorization. They defined a continuous and differentiable function of
the classifier parameters to simulate specific performance metrics, such as pre-
cision and recall etc. (micro-averaging F1 in their paper). Their method assigns
a uniform score function to each category of interest for each given test ex-
ample, and thus the classical Bayes decision rules can be applied. One year
later, Kazawa et al. [13] converts the original multi-label learning problem of
text categorization into a multiclass single-label problem by regarding a set
of topics (labels) as a new class. To cope with the data sparseness caused
by the huge number of possible classes (Q topics will yield 2Q classes), they
embedded labels into a similarity-induced vector space in which prototype
vectors of similar labels will be placed close to each other. They also provided
an approximation method in learning and efficient classification algorithms in
testing to overcome the demanding computational cost of their method.
6
plete classification. One year later, through defining a special cost function
based on ranking loss (as shown in Eq.(4)) and the corresponding margin
for multi-label models, Elisseeff and Weston [7] proposed a kernel method for
multi-label classification and tested their algorithm on a Yeast gene functional
classification problem with positive results. In 2004, Boutell et al. [2] applied
multi-label learning techniques to scene classification. They decomposed the
multi-label learning problem into multiple independent binary classification
problems (one per category), where each example associated with label set Y
will be regarded as positive example when building classifier for class y ∈ Y
while regarded as negative example when building classifier for class y ∈
/ Y.
They also provided various labeling criteria to predict a set of labels for each
test instance based on its output on each binary classifier. Note that although
most works on multi-label learning assume that an instance can be associated
with multiple valid labels, there are also works assuming that only one of the
labels associated with an instance is correct [12] 1 .
In the following Section, the first multi-label lazy learning algorithm, Ml-knn,
is proposed.
4 ML-KNN
7
For each test instance t, Ml-knn firstly identifies its k nearest neighbors N (t)
in the training set. Let H1l be the event that t has label l, while H0l be the
event that t has not label l. Furthermore, let Ejl (j ∈ {0, 1, . . . , k}) denote the
event that, among the k nearest neighbors of t, there are exactly j instances
which have label l. Therefore, based on the membership counting vector C ~ t,
the category vector ~yt is determined using the following maximum a posteriori
principle:
As shown in Eq.(8), in order to determine the category vector ~yt , all the
information needed is the prior probabilities P (Hbl ) (l ∈ Y, b ∈ {0, 1}) and
the posterior probabilities P (Ejl |Hbl ) (j ∈ {0, 1, . . . , k}). Actually, these prior
and posterior probabilities can all be directly estimated from the training set
based on frequency counting.
multi-label training instances, steps (1) and (2) estimate the prior probabilities
P (Hbl ). Steps from (3) to (13) estimate the posterior probabilities P (Ejl |Hbl ),
where c[j] used in each iteration of l counts the number of training instances
with label l whose k nearest neighbors contain exactly j instances with label l.
Correspondingly, c0 [j] counts the number of training instances without label l
8
[~yt , ~rt ]=Ml-knn(T , k, t, s)
(18) ~rt (l) = P (H1l |ECl~ (l) )= (P (H1l )P (ECl~ (l) |H1l ))/P (ECl~ (l) )
t t t
P
= (P (H1l )P (ECl~ (l) |H1l ))/( b∈{0,1} P (Hbl )P (ECl~ (l) |Hbl ));
t t
whose k nearest neighbors contain exactly j instances with label l. Finally, us-
ing the Bayesian rule, steps from (14) to (18) compute the algorithm’s outputs
based on the estimated probabilities.
9
5 Experiments
10
Yeast
Saccharomyces cerevisiae
YAL062w
Fig. 2. First level of the hierarchy of the Yeast gene functional classes. One gene, for
instance the one named YAL062w, can belong to several classes (shaded in grey) of
the 14 possible classes.
ture of the functional classes are used. Actually, the whole set of functional
classes is structured into hierarchies up to 4 levels deep 4 . In this paper, as
what has been done in the literature [7], only functional classes in the top
hierarchy are considered. The first level of the hierarchy is depicted in Fig. 2.
The resulting multi-label data set contains 2 417 genes each represented by
a 103-dimensional feature vector. There are 14 possible class labels and the
average number of labels for each gene is 4.24 ± 1.57.
11
Table 1
Experimental results of Ml-knn (mean±std) on the Yeast data with different num-
ber of nearest neighbors considered.
Table 2
Experimental results of each multi-label learning algorithm (mean±std) on the
Yeast data.
Evaluation Algorithm
Criterion Ml-knn BoosTexter Adtboost.MH Rank-svm
Hamming Loss 0.194±0.010 0.220±0.011 0.207±0.010 0.207±0.013
One-error 0.230±0.030 0.278±0.034 0.244±0.035 0.243±0.039
Coverage 6.275±0.240 6.550±0.243 6.390±0.203 7.090±0.503
Ranking Loss 0.167±0.016 0.186±0.015 N/A 0.195±0.021
Average Precision 0.765±0.021 0.737±0.022 0.744±0.025 0.749±0.026
rest of this paper are obtained with the parameter k set to be the moderate
value of 10.
Note that the partial order “” only measures the relative performance be-
tween two algorithms A1 and A2 on one specific evaluation criterion. However,
it is quite possible that A1 performs better than A2 in terms of some met-
rics but worse that A2 in terms of other ones. In this case, it is hard to judge
12
Table 3
Relative performance between each multi-label learning algorithm on the Yeast data.
Evaluation Algorithm
Criterion A1-Ml-knn; A2-BoosTexter; A3-Adtboost.MH; A4-Rank-svm
Hamming Loss A1 Â A2, A1 Â A3, A1 Â A4, A3 Â A2, A4 Â A2
One-error A1 Â A2, A3 Â A2, A4 Â A2
Coverage A1 Â A2, A1 Â A3, A1 Â A4, A2 Â A4, A3 Â A2, A3 Â A4
Ranking Loss A1 Â A2, A1 Â A4
Average Precision A1 Â A2, A1 Â A3
Total Order Ml-knn(11)>Adtboost.MH(1)>Rank-svm(-3)>BoosTexter(-9)
Table 2 shows that Ml-knn performs fairly well in terms of all the evaluation
criteria, where on all these metrics no algorithm has outperformed Ml-knn.
Especially, Ml-knn outperforms all the other algorithms with respect to ham-
ming loss, coverage and ranking loss 5 . It is also worth noting that BoosTex-
ter performs quite poorly compared to other algorithms. As indicated in the
literature [7], the reason may be that the simple decision function realized by
this method is not suitable to learn from the Yeast data set. On the whole (as
shown by the total order), Ml-knn substantially outperforms all the other
algorithms on the Yeast data.
13
5.2 Natural Scene Classification
In natural scene classification, each natural scene image may belong to several
image types (classes) simultaneously, e.g. the image shown in Fig. 3(a) can be
classified as a mountain scene as well as a tree scene, while the image shown
in Fig. 3(b) can be classified as a sea scene as well as a sunset scene. Through
analyzing images with known label sets, a multi-label learning system will
automatically predict the sets of labels for unseen images. The above process
of semantic scene classification can be applied to many areas, such as content-
based indexing and organization and content-sensitive image enhancement,
etc [2]. In this paper, the effectiveness of multi-label learning algorithms is
also evaluated via this specific kind of multi-label learning problem.
The experimental data set consists of 2 000 natural scene images, where a set
of labels is manually assigned to each image. Table 4 gives the detailed descrip-
tion of the number of images associated with different label sets, where all the
possible class labels are desert, mountains, sea, sunset and trees. The number
of images belonging to more than one class (e.g. sea+sunset) comprises over
22% of the data set, many combined classes (e.g. mountains+sunset+trees)
are extremely rare. On average, each image is associated with 1.24 class labels.
In this paper, each image is represented by a feature vector using the same
method employed in the literature [2]. Concretely, each color image is firstly
converted to the CIE Luv space, which is a more perceptually uniform color
14
Table 4
Summary of the natural scene image data set.
Table 5
Experimental results of each multi-label learning algorithm (mean±std) on the nat-
ural scene image data set.
Evaluation Algorithm
Criterion Ml-knn BoosTexter Adtboost.MH Rank-svm
Hamming Loss 0.169±0.016 0.179±0.015 0.193±0.014 0.253±0.055
One-error 0.300±0.046 0.311±0.041 0.375±0.049 0.491±0.135
Coverage 0.939±0.100 0.939±0.092 1.102±0.111 1.382±0.381
Ranking Loss 0.168±0.024 0.168±0.020 N/A 0.278±0.096
Average Precision 0.803±0.027 0.798±0.024 0.755±0.027 0.682±0.093
Table 6
Relative performance between each multi-label learning algorithm on the natural
scene image data set.
Evaluation Algorithm
Criterion A1-Ml-knn; A2-BoosTexter; A3-Adtboost.MH; A4-Rank-svm
Hamming Loss A1 Â A2, A1 Â A3, A1 Â A4, A2 Â A3, A2 Â A4, A3 Â A4
One-error A1 Â A3, A1 Â A4, A2 Â A3, A2 Â A4, A3 Â A4
Coverage A1 Â A3, A1 Â A4, A2 Â A3, A2 Â A4, A3 Â A4
Ranking Loss A1 Â A4, A2 Â A4
Average Precision A1 Â A3, A1 Â A4, A2 Â A3, A2 Â A4, A3 Â A4
Total Order Ml-knn(10)>BoosTexter(8)>Adtboost.MH(-4)>Rank-svm(-14)
15
Fig. 4. Some example images on which Ml-knn works better than Boos-
Texter, Adtboost.MH and Rank-svm. (a) Ground-truth: desert, Ml-knn:
desert, BoosTexter: desert+trees, Adtboost.MH: null, Rank-svm: moun-
tains; (b) Ground-truth: mountains, Ml-knn: mountains, BoosTexter: moun-
tains+trees, Adtboost.MH: trees, Rank-svm: trees; (c) Ground-truth: moun-
tains+trees, Ml-knn: mountains+trees, BoosTexter: null, Adtboost.MH:
mountains, Rank-svm: mountains; (d) Ground-truth: sea+sunset, Ml-knn:
sea+sunset, BoosTexter: sunset, Adtboost.MH: null, Rank-svm: sunset.
16
Table 7
Experimental results of each multi-label learning algorithm (mean±std) on the fil-
tered natural scene image data set.
Evaluation Algorithm
Criterion Ml-knn BoosTexter Adtboost.MH Rank-svm
Hamming Loss 0.235±0.019 0.240±0.043 0.270±0.027 0.275±0.035
One-error 0.205±0.051 0.211±0.058 0.250±0.070 0.236±0.064
Coverage 1.875±0.072 1.921±0.140 2.080±0.151 2.054±0.162
Ranking Loss 0.195±0.021 0.204±0.034 N/A 0.228±0.040
Average Precision 0.832±0.018 0.828±0.029 0.799±0.034 0.805±0.032
Table 8
Relative performance between each multi-label learning algorithm on the filtered
natural scene image data set.
Evaluation Algorithm
Criterion A1-Ml-knn; A2-BoosTexter; A3-Adtboost.MH; A4-Rank-svm
Hamming Loss A1 Â A3, A1 Â A4, A2 Â A3, A2 Â A4
One-error A1 Â A3
Coverage A1 Â A3, A1 Â A4, A2 Â A3
Ranking Loss A1 Â A4
Average Precision A1 Â A3, A1 Â A4, A2 Â A3, A2 Â A4
Total Order Ml-knn(8)>BoosTexter(5)>Rank-svm(-6)>Adtboost.MH(-7)
Note that in this data set, the average number of class labels associated with
each image is relatively small (e.g. 1.24). Therefore, to further evaluate the
performance of the multi-label learning algorithms on the problem of natural
scene classification, images with only one class label are excluded from the
original data set. Thus, a filtered data set containing 457 images is obtained,
in which each image is associated with 2.03 class labels on average. Ten-
fold cross-validation is again performed on the filtered image data set, where
experimental results of the multi-label learning algorithms are reported in
Table 7 with the best result on each evaluation criterion shown in bold face.
Similarly as the Yeast data, the partial order “” and the total order “>” are
also defined on the set of all comparing algorithms which are shown in Table
8.
17
Table 8 shows that both Ml-knn and BoosTexter are superior or at least
comparable to Adtboost.MH and Rank-svm in terms of all evaluation
criteria. Furthermore, Ml-knn outperforms Adtboost.MH on all evalua-
tion metrics while BoosTexter outperforms Adtboost.MH and Ml-knn
outperforms Rank-svm on all evaluation metrics except one-error. On the
whole (as shown by the total order), the same as the original (unfiltered) data
set, Ml-knn again slightly outperforms BoosTexter and is far superior to
Adtboost.MH and Rank-svm on the filtered natural scene image data set.
These results show that Ml-knn can also work well on the problem of natural
scene classification when more class labels are associated with each image.
Recently, Ueda and Saito [21] presented two types of probabilistic gener-
ative models called parametric mixture models (Pmm1, Pmm2) for multi-
label text. They also designed efficient learning and prediction algorithms for
Pmms and tested the effectiveness of their method with application to the
specific text categorization problem of WWW page categorization 6 . Specif-
ically, they tried to categorize real Web pages linked from the “yahoo.com”
domain, where it consists of 14 top-level categories (i.e. “Arts&Humanities”,
“Business&Economy”, etc.) and each category is classified into a number of
second-level subcategories. By focusing on the second-level categories, they
used 11 out of the 14 independent text categorization problems. For each
problem, the training set contains 2 000 documents while the test set contains
3 000 documents.
In this paper, these data sets are used to further evaluate the performance of
each multi-label learning algorithm. The simple term selection method based
on document frequency (the number of documents containing a specific term)
is used to reduce the dimensionality of each data set. Actually, only 2% words
with highest document frequency are retained in the final vocabulary 7 . Note
6 Data set available at http://www.kecl.ntt.co.jp/as/members/ueda/yahoo.tar.gz.
7 Based on a series experiments, Yang and Pedersen [22] have shown that based
18
Table 9
Characteristics of the web page data sets (after term selection). PMC denotes the
percentage of documents belonging to more than one category, ANL denotes the
average number of labels for each document, and PRC denotes the percentage of
rare categories, i.e. the kind of category where only less than 1% instances in the
data set belong to it.
that other term selection methods such as information gain and mutual in-
formation could also be adopted. After term selection, each document in the
data set is described as a feature vector using the “Bag-of-Words” represen-
tation [6], i.e. each dimension of the feature vector corresponds to the number
of times a word in the vocabulary appearing in this document. Table 9 sum-
marizes the characteristics of the web page data sets. It is worth noting that,
compared with the Yeast data and the natural scene image data, instances
are represented by much higher dimensional feature vectors and a large por-
tion of them (about 20% ∼ 45%) are multi-labelled over the 11 problems.
Furthermore, in those 11 data sets, the number of categories are much larger
(minimum 21, maximum 40) and many of them are rare categories (about
20% ∼ 55%). Therefore, the web page data sets are more difficult to learn
from than the previous data collections.
19
Table 10
Experimental results of each multi-label learning algorithm on the web page data
sets in terms of hamming loss.
Algorithm
Data Set Ml-knn BoosTexter Adtboost.MH Rank-svm
Arts&Humanities 0.0612 0.0652 0.0585 0.0615
Business&Economy 0.0269 0.0293 0.0279 0.0275
Computers&Internet 0.0412 0.0408 0.0396 0.0392
Education 0.0387 0.0457 0.0423 0.0398
Entertainment 0.0604 0.0626 0.0578 0.0630
Health 0.0458 0.0397 0.0397 0.0423
Recreation&Sports 0.0620 0.0657 0.0584 0.0605
Reference 0.0314 0.0304 0.0293 0.0300
Science 0.0325 0.0379 0.0344 0.0340
Social&Science 0.0218 0.0243 0.0234 0.0242
Society&Culture 0.0537 0.0628 0.0575 0.0555
Average 0.0432 0.0459 0.0426 0.0434
Table 11
Experimental results of each multi-label learning algorithm on the web page data
sets in terms of one-error.
Algorithm
Data Set Ml-knn BoosTexter Adtboost.MH Rank-svm
Arts&Humanities 0.6330 0.5550 0.5617 0.6653
Business&Economy 0.1213 0.1307 0.1337 0.1237
Computers&Internet 0.4357 0.4287 0.4613 0.4037
Education 0.5207 0.5587 0.5753 0.4937
Entertainment 0.5300 0.4750 0.4940 0.4933
Health 0.4190 0.3210 0.3470 0.3323
Recreation&Sports 0.7057 0.5557 0.5547 0.5627
Reference 0.4730 0.4427 0.4840 0.4323
Science 0.5810 0.6100 0.6170 0.5523
Social&Science 0.3270 0.3437 0.3600 0.3550
Society&Culture 0.4357 0.4877 0.4845 0.4270
Average 0.4711 0.4463 0.4612 0.4401
as the Yeast data, the partial order “” and total order “>” are also defined
on the set of all comparing algorithms which are shown in Table 15.
As shown in Table 15, Ml-knn achieves comparable results in terms of all the
evaluation criteria, where on all these metrics no algorithm has outperformed
Ml-knn. On the other hand, although BoosTexter performs quite well in
terms of one-error, coverage, ranking loss and average precision, it performs
20
Table 12
Experimental results of each multi-label learning algorithm on the web page data
sets in terms of coverage.
Algorithm
Data Set Ml-knn BoosTexter Adtboost.MH Rank-svm
Arts&Humanities 5.4313 5.2973 5.1900 9.2723
Business&Economy 2.1840 2.4123 2.4730 3.3637
Computers&Internet 4.4117 4.4887 4.4747 8.7910
Education 3.4973 4.0673 3.9663 8.9560
Entertainment 3.1467 3.0883 3.0877 6.5210
Health 3.3043 3.0780 3.0843 5.5400
Recreation&Sports 5.1010 4.4737 4.3380 5.6680
Reference 3.5420 3.2100 3.2643 6.9683
Science 6.0470 6.6907 6.6027 12.4010
Social&Science 3.0340 3.6870 3.4820 8.2177
Society&Culture 5.3653 5.8463 4.9545 6.8837
Average 4.0968 4.2127 4.0834 7.5075
Table 13
Experimental results of each multi-label learning algorithm on the web page data
sets in terms of ranking loss.
Algorithm
Data Set Ml-knn BoosTexter Adtboost.MH Rank-svm
Arts&Humanities 0.1514 0.1458 N/A 0.2826
Business&Economy 0.0373 0.0416 N/A 0.0662
Computers&Internet 0.0921 0.0950 N/A 0.2091
Education 0.0800 0.0938 N/A 0.2080
Entertainment 0.1151 0.1132 N/A 0.2617
Health 0.0605 0.0521 N/A 0.1096
Recreation&Sports 0.1913 0.1599 N/A 0.2094
Reference 0.0919 0.0811 N/A 0.1818
Science 0.1167 0.1312 N/A 0.2570
Social&Science 0.0561 0.0684 N/A 0.1661
Society&Culture 0.1338 0.1483 N/A 0.1716
Average 0.1024 0.1028 N/A 0.1930
almost worst among all the comparing algorithms in terms of hamming loss.
It is also worth noting that all the algorithms perform quite poorly in terms
of one-error (around 45% for all comparing algorithms). The reason may be
that there are much more categories in those 11 data sets which makes the
top-ranked label be in the set of proper labels of an instance much more
difficult. On the whole (as shown by the total order), Ml-knn is comparable
21
Table 14
Experimental results of each multi-label learning algorithm on the web page data
sets in terms of average precision.
Algorithm
Data Set Ml-knn BoosTexter Adtboost.MH Rank-svm
Arts&Humanities 0.5097 0.5448 0.5526 0.4170
Business&Economy 0.8798 0.8697 0.8702 0.8694
Computers&Internet 0.6338 0.6449 0.6235 0.6123
Education 0.5993 0.5654 0.5619 0.5702
Entertainment 0.6013 0.6368 0.6221 0.5637
Health 0.6817 0.7408 0.7257 0.6839
Recreation&Sports 0.4552 0.5572 0.5639 0.5315
Reference 0.6194 0.6578 0.6264 0.6176
Science 0.5324 0.5006 0.4940 0.5007
Social&Science 0.7481 0.7262 0.7217 0.6788
Society&Culture 0.6128 0.5717 0.5881 0.5717
Average 0.6249 0.6378 0.6318 0.6046
Table 15
Relative performance between each multi-label learning algorithm on the web page
data sets.
Evaluation Algorithm
Criterion A1-Ml-knn; A2-BoosTexter; A3-Adtboost.MH; A4-Rank-svm
Hamming Loss A3 Â A2, A4 Â A2
One-error A2 Â A3
Coverage A1 Â A4, A2 Â A4, A3 Â A4
Ranking Loss A1 Â A4, A2 Â A4
Average Precision A2 Â A4
Total Order {Ml-knn(2), BoosTexter(2)}>Adtboost.MH(1)>Rank-svm(-5)
6 Conclusion
In this paper, a lazy learning algorithm named Ml-knn, which is the multi-
label version of kNN, is proposed. Based on statistical information derived
from the label sets of an unseen instance’s neighboring instances, i.e. the mem-
bership counting statistic as shown in Section 4, Ml-knn utilizes maximum
a posteriori principle to determine the label set for the unseen instance. Ex-
22
periments on three real-world multi-label learning problems, i.e. Yeast gene
functional analysis, natural scene classification and automatic web page cate-
gorization, show that Ml-knn outperforms some well-established multi-label
learning algorithms.
Acknowledgment
The authors wish to thank the anonymous reviewers for their constructive
suggestions. Many thanks to A. Elisseeff and J. Weston for providing the
authors with the Yeast data and the implementation details of Rank-svm.
This work was supported by the National Science Foundation of China under
the Grant No. 60473046.
References
23
[6] S. T. Dumais, J. Platt, D. Heckerman, M. Sahami, Inductive learning
algorithms and representation for text categorization, in: Proceedings of the 7th
ACM International Conference on Information and Knowledge Management,
Bethesda, MD, 1998, pp. 148–155.
[8] Y. Freund, L. Mason, The alternating decision tree learning algorithm, in:
Proceedings of the 16th International Conference on Machine Learning, Bled,
Slovenia, 1999, pp. 124–133.
[10] S. Gao, W. Wu, C.-H. Lee, T.-S. Chua, A maximal figure-of-merit learning
approach to text categorization, in: Proceedings of the 26th Annual
International ACM SIGIR Conference on Research and Development in
Information Retrieval, Toronto, Canada, 2003, pp. 174–181.
[11] S. Gao, W. Wu, C.-H. Lee, T.-S. Chua, A MFoM learning approach to
robust multiclass multi-label text categorization, in: Proceedings of the 21st
International Conference on Machine Learning, Banff, Canada, 2004, pp. 329–
336.
[12] R. Jin, Z. Ghahramani, Learning with multiple labels, in: S. Becker, S. Thrun,
K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15,
MIT Press, Cambridge, MA, 2003, pp. 897–904.
[16] J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San
Mateo, California, 1993.
[17] G. Salton, Developments in automatic text retrieval, Science 253 (1991) 974–
980.
24
[18] R. E. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated
predictions, in: Proceedings of the 11th Annual Conference on Computational
Learning Theory, New York, 1998, pp. 80–91.
[21] N. Ueda, K. Saito, Parametric mixture models for multi-label text, in: S. Becker,
S. Thrun, K. Obermayer (Eds.), Advances in Neural Information Processing
Systems 15, MIT Press, Cambridge, MA, 2003, pp. 721–728.
25