K-Nearest Algorithm Classification: Neighbor Based For Multi-Label
K-Nearest Algorithm Classification: Neighbor Based For Multi-Label
Multi-label Classification
Min-Ling Zhang and Zhi-Hua Zhou
National Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210093, China
Email: {zhangml, zhouzh}@lamda.nju.edu.cn
Abstract- In multi-label learning, each instance in the training on a real-world multi-label bioinformatic data. Finally, Section
set is associated with a set of labels, and the task is to output a 5 concludes and indicates several issues for future work.
label set whose size is unknown a priori for each unseen instance.
In this paper, a multi-label lazy learning approach named ML- II. MULTI-LABEL LEARNING
kNN is presented, which is derived from the traditional k-nearest
neighbor (kNN) algorithm. In detail, for each new instance, its Research of multi-label learning was initially motivated
k nearest neighbors are firstly identified. After that, according by the difficulty of concept ambiguity encountered in text
to the label sets of these neighboring instances, maximum a categorization, where each document may belong to several
posteriori (MAP) principle is utilized to determine the label set topics (labels) simultaneously. One famous approach to solv-
for the new instance. Experiments on a real-world multi-label
bioinformatic data show that ML-kNN is highly comparable to ing this problem is BoosTexter proposed by Schapire and
existing multi-label learning algorithms. Singer [2] , which is in fact extended from the popular
ensemble learning method AdaBoost [8]. In the training phase,
I. INTRODUCTION BoosTexter maintains a set of weights over both training
Multi-label classification tasks are ubiquitous in real-world examples and their labels, where training examples and their
problems. For example, in text categorization, each document corresponding labels that are hard (easy) to predict correctly
may belong to several predefined topics; in bioinformatics, one get incrementally higher (lower) weights. Following the work
protein may have many effects on a cell when predicting its of BoosTexter, multi-label learning has attracted many atten-
functional classes. In either case, instances in the training set tions from machine learning researchers.
are each associated with a set of labels, and the task is to In 1999, McCallum [1] proposed a Bayesian approach to
output the label set for the unseen instance whose set size is multi-label document classification, where a mixture proba-
not available a priori. bilistic model is assumed to generate each document and EM
Traditional two-class and multi-class problems can both be [9] algorithm is utilized to learn the mixture weights and
cast into multi-label ones by restricting each instance to have the word distributions in each mixture component. In 2001,
only one label. However, the generality of multi-label problem through defining a special cost function based on Ranking
makes it more difficult to learn. An intuitive approach to Loss (as shown in Eq.(5)) and the corresponding margin
solve multi-label problem is to decompose it into multiple for multi-label models, Elisseeff and Weston [6] proposed
independent binary classification problems (one per category). a kernel method for multi-label classification. In the same
But this kind of method does not consider the correlations be- year, Clare and King [4] adapted C4.5 decision tree [10] to
tween the different labels of each instance. Fortunately, several handle multi-label data through modifying the definition of
approaches specially designed for multi-label classification entropy. One year later, using independent word-based Bag-of-
have been proposed, such as multi-label text categorization Words representation [11], Ueda and Saito [3] presented two
algorithms [1], [2], [3], multi-label decision trees [4], [5] and types of probabilistic generative models for multi-label text
multi-label kernel method [6], etc. However, multi-label lazy called parametric mixture models (PMM1, PMM2), where the
learning approach is still not available. In this paper, this basic assumption under PMMs is that multi-label text has a
problem is addressed by a multi-label classification algorithm mixture of characteristic words appearing in single-label text
named ML-kNN, i.e. Multi-Label k-Nearest Neighbor, which is that belong to each category of the multi-categories. In the
derived from the popular k-nearest neighbor (kNN) algorithm same year, Comite et al. [5] extended alternating decision
[7]. ML-kNN first identifies the k nearest neighbors of the test tree [12] to handle multi-label data, where the AdaBoost.MH
instance where the label sets of its neighboring instances are algorithm proposed by Schapire and Singer [13] is employed
obtained. After that, maximum a posteriori (MAP) principle to train the multi-label alternating decision tree. In 2004,
is employed to predict the set of labels of the test instance. Boutell et al. [14] applied multi-label learning techniques to
The rest of this paper is organized as follows. Section 2 re- scene classification. They decomposed the multi-label learning
views previous works on multi-label learning and summarizes problem into multiple independent binary classification prob-
different evaluation criteria used in this area. Section 3 presents lems (one per category), where each example associated with
the ML-kNN algorithm. Section 4 reports experimental results label set Y will be regarded as positive example when building
719
1. Therefore, based on the membership counting vector Ct, the
[Y, rt*]=ML-kNN(S, k, t, s) category vector Yj is determined using the following maximum
%Computing the prior probabilities P(Hil) a posteriori principle:
(1) for 1 Y do Y' (l) = arg max P(HbEC(I))l 1I Y
m (8)
(2) P(H1) = (s-+ E ',j (l))/(s x 2+m)
i=l1
Using the Bayesian rule, Eq.(8) can be rewritten as:
(3) P(H' ) = 1 -P(HI);
%Computing the posterior probabilities P(Ejl Hil) P(Hbl)P(PCt(l,Hbl)
(4) Identify N(xi), i E {1, .... ml;
Y') = arg max Ct(l)
(5) for 1 Ydo
(6) for j E {,.. ., k} do
= arg max P(H )P(Et(l)IH) (9)
(7) c[j] = 0; c'[j] = 0; Note that the prior probabilities P(Hb) (1 E y, b E {0, 1})
(8) for i E {1, . . ., m} do and the posterior probabilities P(Ej IHb) (j E {0, ... , k}) can
all be estimated from the training set S.
(9) 6 cxi (I) = E Yxa (I);
aEN(xi) Figure 1 illustrates the complete description of ML-kNN.
(10) if (Y',(l) == 1) then c[6] = c[S] + 1; The meanings of the input arguments S, k, t and the output
(11) else c'[6] = c'[6] + 1; argument jjt are the same as described previously. While the
input argument s is a smoothing parameter controlling the
(12) for j E {0, ..., k} do strength of uniform prior (In this paper, s is set to be 1
(13) P(EjIHH)= s+C[j] ; which yields the Laplace smoothing). rit is a real-valued vector
sx(k+l)+E c[p] calculated for ranking labels in Y, where rt (1) corresponds to
p=O
the posterior probability P(H' I E (i)) As shown in Figure 1,
(14) P(Ej IHH) = s+c[i] ; based on the training instances, steps from (1) to (3) estimate
sx(k+l)+E c'[p] the prior probabilities P(HI). Steps from (4) to (14) estimate
p=O
the posterior probabilities P(Ej IH), where c[j] used in each
%Computing Y and rt iteration of 1 counts the number of training instances with
(15) Identify N(t); label 1 whose k nearest neighbors contain exactly j instances
(16) for 1 Ydo with label 1. Correspondingly, c'[j] counts how many training
(17) ct(l) = E: Pa (l); instances without label 1 whose k nearest neighbors contain
aEN(t) exactly j instances with label 1. Finally, using the Bayesian
(18) Yt (l) = arg maxP(Hb)P(El-C(l) IH
bE{O,l1}c
rule, steps from (15) to (19) compute the algorithm's outputs
based on the estimated probabilities.
(19) r-t (1) =
P(Hl El-(
Ct (1)
IV. EXPERIMENTS
= P(HD)P(EP jHl)/P(EP (1))
P(Hi)P(E_() IH1) A real-world Yeast gene functional data which has been
studied in the literatures [6], [16] is used for experiments.
bE{O,1} Each gene is associated with a set of functional classes whose
maximum size can be potentially more than 190. In order
Fig. 1. Pseudo code of ML-kNN. to make it easier, Elisseeff and Weston [6] preprocessed
the data set where only the known structure of the func-
tional classes are used. Actually, the whole set of functional
Thus, based on the label sets of these neighbors, a membership classes is structured into hierarchies up to 4 levels deep
counting vector can be defined as: (see http://mips.gsf.de/proj/yeast/catalogues/funcat/ for more
cx(l) S
E a(l) lE Y (7) details). In this paper, as what has been done in the literature
aEN(x) [6], only functional classes in the top hierarchy are considered.
For fair comparison, the same kind of data set division used in
where CX (1) counts how many neighbors of x belong to the the literature [6] is adopted. In detail, there are 1,500 genes in
l-th class. the training set and 917 in the test set. The input dimension is
For each test instance t, ML-kNN first identifies its k nearest 103. There are 14 possible class labels and the average number
neighbors N(t). Let Hf be the event that t has label 1, while of labels for all genes in the training set is 4.2 ± 1.6.
Hl be the event that t has not label 1. Furthermore, let El Table I presents the performance of ML-kNN on the Yeast
(j E {0,... ,k}) denote the event that, among the k nearest data when different values of k (number of neighbors) are
neighbors of t, there are exactly j instances which have label considered. It can be found that the value of k doesn't
720
TABLE I
THE PERFORMANCE OF ML-kNN ON THE YEAST DATA WITH DIFFERENT
ACKNOWLEDGMENT
VALUES OF k (NUMBER OF NEIGHBORS). Many thanks to A. Elisseeff and J. Weston for providing the
authors with the Yeast data and the implementation details of
Evaluation No. of neighbors considered Rank-SVM. This work was supported by the National Natural
Criterion k=6 k=7 k=8 k=9
Hamming Loss 0.197 0.197 0.197 0.197 Science Foundation of China under the Grant No. 60473046.
One-error 0.241 0.239 0.248 0.251
Coverage 6.374 6.302 6.357 6.424 REFERENCES
Ranking Loss 0.170 0.168 0.171 0.173 [1] A. McCallum, "Multi-label text classification with a mixture model
Average Precision 0.758 0.761 0.756 0.755 trained by EM," in Working Notes of the AAAI'99 Workshop on Text
Learning, Orlando, FL, 1999.
TABLE II [2] R. E. Schapire and Y. Singer, "Boostexter: a boosting-based system for
text categorization," Machine Learning, vol. 39, no. 2/3, pp. 135-168,
PERFORMANCE ON THE YEAST DATA FOR OTHER MULTI-LABEL 2000.
LEARNING ALGORITHMS. [3] N. Ueda and K. Saito, "Parametric mixture models for multi-label text,"
in Advances in Neural Information Processing Systems 15, S. Becker,
Evaluation Algorithm S. Thrun, and K. Obermayer, Eds. Cambridge, MA: MIT Press, 2003,
Criterion Rank-SVM ADITBoost.MH BoosTexter pp. 721-728.
Hamming Loss 0.196 0.213 0.237 [4] A. Clare and R. D. King, "Knowledge discovery in multi-label pheno-
One-error 0.225 0.245 0.302 type data," in Lecture Notes in Computer Science 2168, L. D. Raedt and
Coverage 6.717 6.502 N/A A. Siebes, Eds. Berlin: Springer, 2001, pp. 42-53.
Ranking Loss 0.179 N/A 0.298 [5] F. D. Comite, R. Gilleron, and M. Tommasi, "Learning multi-label
Average Precision 0.763 0.738 0.717 altenating decision tree from texts and data," in Lecture Notes in
Computer Science 2734, P. Perner and A. Rosenfeld, Eds. Berlin:
Springer, 2003, pp. 35-49.
[6] A. Elisseeff and J. Weston, "A kernel method for multi-labelled clas-
significantly affect the classifier's Hamming Loss, while ML- sification," in Advances in Neural Information Processing Systems 14,
T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cambridge, MA:
kNN achieves best performance on the other four ranking-based MIT Press, 2002, pp. 681-687.
criteria with k = 7. [7] D. W. Aha, "Special Al review issue on lazy learning," Artificial
Table II shows the experimental results on the Yeast data Intelligence Review, vol. 11, 1997.
[8] Y Freund and R. E. Schapire, "A decision-theoretic generalization of
of several other multi-label learning algorithms introduced in on-line learning and an application to boosting," in Lecture Notes in
Section 2 . It is worth noting that a re-implemented version Computer Science 904, P. M. B. Vitanyi, Ed. Berlin: Springer, 1995,
of Rank-SVM [6] is used in this paper, where polynomial pp. 23-37.
[9] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood
kernels with degree 8 are chosen and the cost parameter C from incomplete data via the EM algorithm," Journal of the Royal
is set to be 1. As for ADTBoost.MH [5], the number of Statistics Society -B, vol. 39, no. 1, pp. 1-38, 1977.
boosting steps is set to 30 considering that the performance [10] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo,
California: Morgan Kaufmann, 1993.
of the boosting algorithm rarely changes after 30 iterations. [11] S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive
Besides, the results of BoosTexter [2] shown in Table II are learning algorithms and representation for text categorization," in Proc.
those reported in the literature [6]. of the 7th ACM International Conference on Information and Knowledge
Management (CIKM'98), Bethesda, MD, 1998, pp. 148-155.
As shown in Table I and Table II, the performance of ML- [12] Y Freund and L. Mason, "The alternating decision tree learning al-
kNN is comparable to that of Rank-SVM. Moreover, it is gorithm," in Proc. of the 16th International Conference on Machine
obvious that both algorithms perform significantly better than Learning (ICML'99), Bled, Slovenia, 1999, pp. 124-133.
[13] R. E. Schapire and Y Singer, "Improved boosting algorithms using
ADTBoost.MH and BoosTexter. One possible reason for the confidence-rated predictions," in Proc. of the 11th Annual Conference
poor results of BoosTexter may be due to the simple decision on Computational Learning Theory (COLT'98), New York, 1998, pp.
80-91.
function realized by this method [6]. [14] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, "Learning multi-label
scene classification," Pattern Recognition, vol. 37, no. 9, pp. 1757-1771,
V. CONCLUSION 2004.
[15] G. Salton, "Developments in automatic text retrieval," Science, vol. 253,
pp. 974-980, 1991.
In this paper, the problem of designing multi-label lazy [16] P. Pavlidis, J. Weston, J. Cai, and W. N. Grundy, "Combining microarray
learning approach is addressed, where a k-nearest neighbor expression data and phylogenetic profiles to learn functional categories
based method for multi-label classification named ML-kNN using support vector machines," in Proceedings of the 5th Annual
International Conference on Computational Biology, Montr6al, Canada,
is proposed. Experiments on a multi-label bioinformatic data 2001, pp. 242-248.
show that the proposed algorithm is highly competitive to other
existing multi-label learners.
Nevertheless, the experimental results reported in this paper
are rather preliminary. Thus, conducting more experiments on
other multi-label data sets to fully evaluate the effectiveness of
ML-kNN will be an important issue to be explored in the near
future. On the other hand, adapting other traditional machine
learning approaches such as neural networks to handle multi-
label data will be another interesting issue to be investigated.
721