0% found this document useful (0 votes)
62 views4 pages

K-Nearest Algorithm Classification: Neighbor Based For Multi-Label

This document describes a k-nearest neighbor (kNN) based algorithm called ML-kNN for multi-label classification. ML-kNN identifies the k nearest neighbors of a test instance, determines the label sets of those neighbors, and uses maximum a posteriori (MAP) principle to predict the label set for the test instance. The document reviews previous work on multi-label learning and different evaluation criteria used to measure performance of multi-label classification algorithms.

Uploaded by

Ethan Manani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views4 pages

K-Nearest Algorithm Classification: Neighbor Based For Multi-Label

This document describes a k-nearest neighbor (kNN) based algorithm called ML-kNN for multi-label classification. ML-kNN identifies the k nearest neighbors of a test instance, determines the label sets of those neighbors, and uses maximum a posteriori (MAP) principle to predict the label set for the test instance. The document reviews previous work on multi-label learning and different evaluation criteria used to measure performance of multi-label classification algorithms.

Uploaded by

Ethan Manani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A k-Nearest Neighbor Based Algorithm for

Multi-label Classification
Min-Ling Zhang and Zhi-Hua Zhou
National Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210093, China
Email: {zhangml, zhouzh}@lamda.nju.edu.cn

Abstract- In multi-label learning, each instance in the training on a real-world multi-label bioinformatic data. Finally, Section
set is associated with a set of labels, and the task is to output a 5 concludes and indicates several issues for future work.
label set whose size is unknown a priori for each unseen instance.
In this paper, a multi-label lazy learning approach named ML- II. MULTI-LABEL LEARNING
kNN is presented, which is derived from the traditional k-nearest
neighbor (kNN) algorithm. In detail, for each new instance, its Research of multi-label learning was initially motivated
k nearest neighbors are firstly identified. After that, according by the difficulty of concept ambiguity encountered in text
to the label sets of these neighboring instances, maximum a categorization, where each document may belong to several
posteriori (MAP) principle is utilized to determine the label set topics (labels) simultaneously. One famous approach to solv-
for the new instance. Experiments on a real-world multi-label
bioinformatic data show that ML-kNN is highly comparable to ing this problem is BoosTexter proposed by Schapire and
existing multi-label learning algorithms. Singer [2] , which is in fact extended from the popular
ensemble learning method AdaBoost [8]. In the training phase,
I. INTRODUCTION BoosTexter maintains a set of weights over both training
Multi-label classification tasks are ubiquitous in real-world examples and their labels, where training examples and their
problems. For example, in text categorization, each document corresponding labels that are hard (easy) to predict correctly
may belong to several predefined topics; in bioinformatics, one get incrementally higher (lower) weights. Following the work
protein may have many effects on a cell when predicting its of BoosTexter, multi-label learning has attracted many atten-
functional classes. In either case, instances in the training set tions from machine learning researchers.
are each associated with a set of labels, and the task is to In 1999, McCallum [1] proposed a Bayesian approach to
output the label set for the unseen instance whose set size is multi-label document classification, where a mixture proba-
not available a priori. bilistic model is assumed to generate each document and EM
Traditional two-class and multi-class problems can both be [9] algorithm is utilized to learn the mixture weights and
cast into multi-label ones by restricting each instance to have the word distributions in each mixture component. In 2001,
only one label. However, the generality of multi-label problem through defining a special cost function based on Ranking
makes it more difficult to learn. An intuitive approach to Loss (as shown in Eq.(5)) and the corresponding margin
solve multi-label problem is to decompose it into multiple for multi-label models, Elisseeff and Weston [6] proposed
independent binary classification problems (one per category). a kernel method for multi-label classification. In the same
But this kind of method does not consider the correlations be- year, Clare and King [4] adapted C4.5 decision tree [10] to
tween the different labels of each instance. Fortunately, several handle multi-label data through modifying the definition of
approaches specially designed for multi-label classification entropy. One year later, using independent word-based Bag-of-
have been proposed, such as multi-label text categorization Words representation [11], Ueda and Saito [3] presented two
algorithms [1], [2], [3], multi-label decision trees [4], [5] and types of probabilistic generative models for multi-label text
multi-label kernel method [6], etc. However, multi-label lazy called parametric mixture models (PMM1, PMM2), where the
learning approach is still not available. In this paper, this basic assumption under PMMs is that multi-label text has a
problem is addressed by a multi-label classification algorithm mixture of characteristic words appearing in single-label text
named ML-kNN, i.e. Multi-Label k-Nearest Neighbor, which is that belong to each category of the multi-categories. In the
derived from the popular k-nearest neighbor (kNN) algorithm same year, Comite et al. [5] extended alternating decision
[7]. ML-kNN first identifies the k nearest neighbors of the test tree [12] to handle multi-label data, where the AdaBoost.MH
instance where the label sets of its neighboring instances are algorithm proposed by Schapire and Singer [13] is employed
obtained. After that, maximum a posteriori (MAP) principle to train the multi-label alternating decision tree. In 2004,
is employed to predict the set of labels of the test instance. Boutell et al. [14] applied multi-label learning techniques to
The rest of this paper is organized as follows. Section 2 re- scene classification. They decomposed the multi-label learning
views previous works on multi-label learning and summarizes problem into multiple independent binary classification prob-
different evaluation criteria used in this area. Section 3 presents lems (one per category), where each example associated with
the ML-kNN algorithm. Section 4 reports experimental results label set Y will be regarded as positive example when building

0-7803-9017-2/05/$20.00 02005 IEEE 718


classifier for class y E Y while regarded as negative example Coverage defined as:
when building classifier for class y V Y.
It is worth noting that in multi-label learning paradigm, 1m
various evaluation criteria have been proposed to measure the
Coverages(f) = -E C(xi)I - 1, where
i=1
performance of a multi-label learning system. Let X = RZd be C(xi) = {llf (xi, 1) > f (xi, l), 1 E Y} and
the d-dimensional instance domain and let Y = {1, 2, ..., Q}
be a set of labels or classes. Given a learning set S =<
1' = arg kmin f (xi, k) (4)
EYi
(Xi,YI),...,(xm,Ym) >e (X x 2Y)m i.i.d. drawn from an It measures how far we need, on the average, to go down the
unknown distribution D, the goal of the learning system is to
output a multi-label classifier h: X -÷ 2y which optimizes list of labels in order to cover all the possible labels assigned
some specific evaluation criterion. However, in most cases, the to an instance. The smaller the value of Coverages(f), the
learning system will produce a ranking function of the form better the performance.
f X x Y -yZ R with the interpretation that, for a given Let Y denote the complementary set of Y in Y, another
instance x, the labels in Y should be ordered according to ranking-based measurement named Ranking Loss is defined
f(x, -). That is, a label 11 is considered to be ranked higher as:
than 12 if f(x, 11) > f(x, 12). If Y is the associated label set m

for x, then a successful learning system will tend to rank labels


in Y higher than those not in Y. Note that the corresponding
RLs(f) = K 1y IR(xi)l, where R(xi) =
multi-label classifier h(.) can be conveniently derived from the f(ill Wif (Xi. 11) <- f (xi, lo), (11, lo) c Yi x Yil (5)
ranking function f(, )
It represents the average fraction of pairs that are not correctly
h(x) = {IIf(x, 1) > t(x), 1 E Y} (1) ordered. The smaller the value of RLs(f), the better the
performance.
where t(x) is the threshold function which is usually set to be The fourth evaluation criterion for the ranking function is
the zero constant function. Average Precision, which is originally used in information
Based on the above notations, several evaluation criteria can retrieval (IR) systems to evaluate the document ranking per-
be defined in multi-label learning as shown in [2]. Given a formance for query retrieval [15]. Nevertheless, it is used here
set of multi-label instances S = {(xi, Y,), ..., (xm, Ym)}, a to measure the effectiveness of the label rankings:
learned ranking function f(., ) and the corresponding multi-
label classifier h(.), the first evaluation criterion to be intro- Ave.precs(f) =-± 4 P(xi), where
duced is the so-called Hamming Loss defined as:
P (xk)
_ Y l{lf (xi, I) > f(xi, k), I E Yi}l (6)
HLs(h) = - h(xi)AYj (2) kcYi If llf(xi, 1) > f (xi, k), 1 E Yll
i=1Q In words, this measurement evaluates the average fraction of
where A stands for the symmetric difference between two sets. labels ranked above a particular label 1 C Yi which actually
The smaller the value of HLs (h), the better the classifier's are in Yi. Note that when Ave-precs(f) = 1, the learning
performance. When IYiI = 1 for all instances, a multi- system achieves the perfect performance. The bigger the value
label system is in fact a multi-class single-label one and the of Ave precs(f), the better the performance.
Hamming Loss is Q2 times the loss of the usual classification
error. III. ML-kNN
While Hamming Loss is based on the multi-label classifier As reviewed in the above Section, although there have been
h(-), the following measurements will be defined based on the several learning algorithms specially designed for multi-label
ranking function f (., .). The first ranking-based measurement learning, developing lazy learning approach for multi-label
to be considered is One-error: problems is still an unsolved issue. In this section, a novel
Im k-nearest neighbor based method for multi-label classification
One -errs (f) = -E H (xi), where named ML-kNN is presented. To begin, several notations are
i=1 introduced in addition to those used in Section 2 to simplify
the derivation of ML-kNN.
H(xi) { , if argmaXkE yf(xi,k) E Yi (3) Given an instance x and its associated label set Yx C
Y, suppose k nearest neighbors are considered in the kNN
The smaller the value of One-errs(f), the better the perfor- method. Let y. be the category vector for x, where its 1-
mance. Note that, for single-label classification problems, the th component ji (l) (1 c Y) takes the value of 1 if 1 c Yx
One-error is identical to ordinary classification error. and 0 otherwise. In addition, let N(x) denote the index set
The second ranking-based measurement to be introduced is of the k nearest neighbors of x identified in the training set.

719
1. Therefore, based on the membership counting vector Ct, the
[Y, rt*]=ML-kNN(S, k, t, s) category vector Yj is determined using the following maximum
%Computing the prior probabilities P(Hil) a posteriori principle:
(1) for 1 Y do Y' (l) = arg max P(HbEC(I))l 1I Y
m (8)
(2) P(H1) = (s-+ E ',j (l))/(s x 2+m)
i=l1
Using the Bayesian rule, Eq.(8) can be rewritten as:
(3) P(H' ) = 1 -P(HI);
%Computing the posterior probabilities P(Ejl Hil) P(Hbl)P(PCt(l,Hbl)
(4) Identify N(xi), i E {1, .... ml;
Y') = arg max Ct(l)
(5) for 1 Ydo
(6) for j E {,.. ., k} do
= arg max P(H )P(Et(l)IH) (9)

(7) c[j] = 0; c'[j] = 0; Note that the prior probabilities P(Hb) (1 E y, b E {0, 1})
(8) for i E {1, . . ., m} do and the posterior probabilities P(Ej IHb) (j E {0, ... , k}) can
all be estimated from the training set S.
(9) 6 cxi (I) = E Yxa (I);
aEN(xi) Figure 1 illustrates the complete description of ML-kNN.
(10) if (Y',(l) == 1) then c[6] = c[S] + 1; The meanings of the input arguments S, k, t and the output
(11) else c'[6] = c'[6] + 1; argument jjt are the same as described previously. While the
input argument s is a smoothing parameter controlling the
(12) for j E {0, ..., k} do strength of uniform prior (In this paper, s is set to be 1
(13) P(EjIHH)= s+C[j] ; which yields the Laplace smoothing). rit is a real-valued vector
sx(k+l)+E c[p] calculated for ranking labels in Y, where rt (1) corresponds to
p=O
the posterior probability P(H' I E (i)) As shown in Figure 1,
(14) P(Ej IHH) = s+c[i] ; based on the training instances, steps from (1) to (3) estimate
sx(k+l)+E c'[p] the prior probabilities P(HI). Steps from (4) to (14) estimate
p=O
the posterior probabilities P(Ej IH), where c[j] used in each
%Computing Y and rt iteration of 1 counts the number of training instances with
(15) Identify N(t); label 1 whose k nearest neighbors contain exactly j instances
(16) for 1 Ydo with label 1. Correspondingly, c'[j] counts how many training
(17) ct(l) = E: Pa (l); instances without label 1 whose k nearest neighbors contain
aEN(t) exactly j instances with label 1. Finally, using the Bayesian
(18) Yt (l) = arg maxP(Hb)P(El-C(l) IH
bE{O,l1}c
rule, steps from (15) to (19) compute the algorithm's outputs
based on the estimated probabilities.
(19) r-t (1) =
P(Hl El-(
Ct (1)
IV. EXPERIMENTS
= P(HD)P(EP jHl)/P(EP (1))
P(Hi)P(E_() IH1) A real-world Yeast gene functional data which has been
studied in the literatures [6], [16] is used for experiments.
bE{O,1} Each gene is associated with a set of functional classes whose
maximum size can be potentially more than 190. In order
Fig. 1. Pseudo code of ML-kNN. to make it easier, Elisseeff and Weston [6] preprocessed
the data set where only the known structure of the func-
tional classes are used. Actually, the whole set of functional
Thus, based on the label sets of these neighbors, a membership classes is structured into hierarchies up to 4 levels deep
counting vector can be defined as: (see http://mips.gsf.de/proj/yeast/catalogues/funcat/ for more
cx(l) S
E a(l) lE Y (7) details). In this paper, as what has been done in the literature
aEN(x) [6], only functional classes in the top hierarchy are considered.
For fair comparison, the same kind of data set division used in
where CX (1) counts how many neighbors of x belong to the the literature [6] is adopted. In detail, there are 1,500 genes in
l-th class. the training set and 917 in the test set. The input dimension is
For each test instance t, ML-kNN first identifies its k nearest 103. There are 14 possible class labels and the average number
neighbors N(t). Let Hf be the event that t has label 1, while of labels for all genes in the training set is 4.2 ± 1.6.
Hl be the event that t has not label 1. Furthermore, let El Table I presents the performance of ML-kNN on the Yeast
(j E {0,... ,k}) denote the event that, among the k nearest data when different values of k (number of neighbors) are
neighbors of t, there are exactly j instances which have label considered. It can be found that the value of k doesn't

720
TABLE I
THE PERFORMANCE OF ML-kNN ON THE YEAST DATA WITH DIFFERENT
ACKNOWLEDGMENT
VALUES OF k (NUMBER OF NEIGHBORS). Many thanks to A. Elisseeff and J. Weston for providing the
authors with the Yeast data and the implementation details of
Evaluation No. of neighbors considered Rank-SVM. This work was supported by the National Natural
Criterion k=6 k=7 k=8 k=9
Hamming Loss 0.197 0.197 0.197 0.197 Science Foundation of China under the Grant No. 60473046.
One-error 0.241 0.239 0.248 0.251
Coverage 6.374 6.302 6.357 6.424 REFERENCES
Ranking Loss 0.170 0.168 0.171 0.173 [1] A. McCallum, "Multi-label text classification with a mixture model
Average Precision 0.758 0.761 0.756 0.755 trained by EM," in Working Notes of the AAAI'99 Workshop on Text
Learning, Orlando, FL, 1999.
TABLE II [2] R. E. Schapire and Y. Singer, "Boostexter: a boosting-based system for
text categorization," Machine Learning, vol. 39, no. 2/3, pp. 135-168,
PERFORMANCE ON THE YEAST DATA FOR OTHER MULTI-LABEL 2000.
LEARNING ALGORITHMS. [3] N. Ueda and K. Saito, "Parametric mixture models for multi-label text,"
in Advances in Neural Information Processing Systems 15, S. Becker,
Evaluation Algorithm S. Thrun, and K. Obermayer, Eds. Cambridge, MA: MIT Press, 2003,
Criterion Rank-SVM ADITBoost.MH BoosTexter pp. 721-728.
Hamming Loss 0.196 0.213 0.237 [4] A. Clare and R. D. King, "Knowledge discovery in multi-label pheno-
One-error 0.225 0.245 0.302 type data," in Lecture Notes in Computer Science 2168, L. D. Raedt and
Coverage 6.717 6.502 N/A A. Siebes, Eds. Berlin: Springer, 2001, pp. 42-53.
Ranking Loss 0.179 N/A 0.298 [5] F. D. Comite, R. Gilleron, and M. Tommasi, "Learning multi-label
Average Precision 0.763 0.738 0.717 altenating decision tree from texts and data," in Lecture Notes in
Computer Science 2734, P. Perner and A. Rosenfeld, Eds. Berlin:
Springer, 2003, pp. 35-49.
[6] A. Elisseeff and J. Weston, "A kernel method for multi-labelled clas-
significantly affect the classifier's Hamming Loss, while ML- sification," in Advances in Neural Information Processing Systems 14,
T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cambridge, MA:
kNN achieves best performance on the other four ranking-based MIT Press, 2002, pp. 681-687.
criteria with k = 7. [7] D. W. Aha, "Special Al review issue on lazy learning," Artificial
Table II shows the experimental results on the Yeast data Intelligence Review, vol. 11, 1997.
[8] Y Freund and R. E. Schapire, "A decision-theoretic generalization of
of several other multi-label learning algorithms introduced in on-line learning and an application to boosting," in Lecture Notes in
Section 2 . It is worth noting that a re-implemented version Computer Science 904, P. M. B. Vitanyi, Ed. Berlin: Springer, 1995,
of Rank-SVM [6] is used in this paper, where polynomial pp. 23-37.
[9] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood
kernels with degree 8 are chosen and the cost parameter C from incomplete data via the EM algorithm," Journal of the Royal
is set to be 1. As for ADTBoost.MH [5], the number of Statistics Society -B, vol. 39, no. 1, pp. 1-38, 1977.
boosting steps is set to 30 considering that the performance [10] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo,
California: Morgan Kaufmann, 1993.
of the boosting algorithm rarely changes after 30 iterations. [11] S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive
Besides, the results of BoosTexter [2] shown in Table II are learning algorithms and representation for text categorization," in Proc.
those reported in the literature [6]. of the 7th ACM International Conference on Information and Knowledge
Management (CIKM'98), Bethesda, MD, 1998, pp. 148-155.
As shown in Table I and Table II, the performance of ML- [12] Y Freund and L. Mason, "The alternating decision tree learning al-
kNN is comparable to that of Rank-SVM. Moreover, it is gorithm," in Proc. of the 16th International Conference on Machine
obvious that both algorithms perform significantly better than Learning (ICML'99), Bled, Slovenia, 1999, pp. 124-133.
[13] R. E. Schapire and Y Singer, "Improved boosting algorithms using
ADTBoost.MH and BoosTexter. One possible reason for the confidence-rated predictions," in Proc. of the 11th Annual Conference
poor results of BoosTexter may be due to the simple decision on Computational Learning Theory (COLT'98), New York, 1998, pp.
80-91.
function realized by this method [6]. [14] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, "Learning multi-label
scene classification," Pattern Recognition, vol. 37, no. 9, pp. 1757-1771,
V. CONCLUSION 2004.
[15] G. Salton, "Developments in automatic text retrieval," Science, vol. 253,
pp. 974-980, 1991.
In this paper, the problem of designing multi-label lazy [16] P. Pavlidis, J. Weston, J. Cai, and W. N. Grundy, "Combining microarray
learning approach is addressed, where a k-nearest neighbor expression data and phylogenetic profiles to learn functional categories
based method for multi-label classification named ML-kNN using support vector machines," in Proceedings of the 5th Annual
International Conference on Computational Biology, Montr6al, Canada,
is proposed. Experiments on a multi-label bioinformatic data 2001, pp. 242-248.
show that the proposed algorithm is highly competitive to other
existing multi-label learners.
Nevertheless, the experimental results reported in this paper
are rather preliminary. Thus, conducting more experiments on
other multi-label data sets to fully evaluate the effectiveness of
ML-kNN will be an important issue to be explored in the near
future. On the other hand, adapting other traditional machine
learning approaches such as neural networks to handle multi-
label data will be another interesting issue to be investigated.

721

You might also like