K-Nearest Neighbour Classifiers
K-Nearest Neighbour Classifiers
net/publication/228686398
CITATIONS READS
1,008 29,413
2 authors:
All content following this page was uploaded by Sarah Jane Delany on 16 May 2014.
1 Introduction
There are a large range of possibilities for this distance metric; a basic version
for continuous and discrete attributes would be:
0 f discrete and qf = xif
δ(qf , xif ) = 1 f discrete and qf 6= xif (2)
|qf − xif | f continuous
The k nearest neighbours are selected based on this distance metric. Then
there are a variety of ways in which the k nearest neighbours can be used to
determine the class of q. The most straightforward approach is to assign the
majority class among the nearest neighbours to the query.
It will often make sense to assign more weight to the nearer neighbours in
deciding the class of the query. A fairly general technique to achieve this is
distance weighted voting where the neighbours get to vote on the class of the
query case with votes weighted by the inverse of their distance to the query.
k
X 1
V ote(yj ) = 1(yj , yc ) (3)
c=1
d(q, xc )n
In this paper we consider three important issues that arise with the use of
k-NN classifiers. In the next section we look at the core issue of similarity and
distance measures and explore some exotic (dis)similarity measures to illustrate
the generality of the k-NN idea. In section 3 we look at computational complex-
ity issues and review some speed-up techniques for k-NN. In section 4 we look
at dimension reduction – both feature selection and sample selection. Dimension
reduction is of particular importance with k-NN as it has a big impact on com-
putational performance and accuracy. The paper concludes with a summary of
the advantages and disadvantages of k-NN.
attribute only the M Dp (A, B) is 4 for all p, whereas M Dp (A, C) is 6, 4.24 and
3.78 for p values of 1, 2 and 3 respectively. So C becomes the nearer neighbour
to A for p values of 3 and greater.
The other important Minkowski distance is the L∞ or Chebyshev distance.
This is simply the distance in the dimension in which the two examples are
most different; it is sometimes referred to as the chessboard distance as it is the
number of moves it takes a chess king to reach any square on the board.
In the remainder of this section we will review a selection of other metrics
distances that are important in multimedia analysis.
The Minkowski distance defined in (5) is a very general metric that can be
used in a k-NN classifier for any data that is represented as a feature vector.
When working with image data a convenient representation for the purpose of
calculating distances is a colour histogram. An image can be considered as a
grey-scale histogram H of N levels or bins where hi is the number of pixels that
fall into the interval represented by bin i (this vector h is the feature vector). The
Minkowski distance formula (5) can be used to compare two images described
as histograms. L1 , L2 and less often L∞ norms are used.
Other popular measures for comparing histograms are the Kullback-Leibler
divergence (6) [12] and the χ2 statistic (7) [23].
N
X hi
dKL (H, K) = hi log (6)
i=1
ki
N
X hi − mi
dχ2 (H, K) = (7)
i=1
hi
where H and K are two histograms, h and k are the corresponding vectors
of bin values and mi = hi +k 2 .
i
Earth Mover Distance The Earth Mover Distance (EMD) is a distance mea-
sure that overcomes many of these problems that arise from the arbitrariness of
binning. As the name implies, the distance is based on the notion of the amount
of effort required to convert one image to another based on the analogy of trans-
porting mass from one distribution to another. If we think of two images as
distributions and view one distribution as a mass of earth in space and the other
distribution as a hole (or set of holes) in the same space then the EMD is the
minimum amount of work involved in filling the holes with the earth.
In their analysis of the EMD Rubner et al. argue that a measure based on
the notion of a signature is better than one based on a histogram. A signature
{sj = mj , wmj } is a set of j clusters where mj is a vector describing the mode
of cluster j and wmj is the fraction of pixels falling into that cluster. Thus a
signature is a generalisation of the notion of a histogram where boundaries and
the number of partitions are not set in advance; instead j should be ‘appropriate’
to the complexity of the image [23].
The example in Figure 2 illustrates this idea. We can think of the clustering as
a quantization of the image in some colour space so that the image is represented
by a set of cluster modes and their weights. In the figure the source image is
represented in a 2D space as two points of weights 0.6 and 0.4; the target image
is represented by three points with weights 0.5, 0.3 and 0.2. In this example
the EMD is calculated to be the sum of the amounts moved (0.2, 0.2, 0.1 and
0.5) multiplied by the distances they are moved. Calculating the EMD involves
discovering an assignment that minimizes this amount.
Fig. 2. An example of the EMD between two 2D signatures with two points (clusters)
in one signature and three in the other (based on example in [22]).
For two images described by signatures S = {mj , wmj }nj=1 and Q = {pk , wpk }rk=1
we are interested in the work required to transfer from one to the other for a
given flow pattern F:
n X
X r
W ORK(S, Q, F) = djk fjk (8)
j=1 k=1
6 Pádraig Cunningham and Sarah Jane Delany
where djk is the distance between clusters mj and pk and fjk is the flow
between mj and pk that minimises overall cost. An example of this in a 2D colour
space is shown in Figure ??. Once the transportation problem of identifying the
flow that minimises effort is solved (using dynamic programming) the EMD is
defined to be:
Pn Pr
j=1 k=1 djk fjk
EM D(S, Q) = Pn Pr (9)
j=1 k=1 fjk
Efficient algorithms for the EMD are described in [23] however this measure is
expensive to compute with cost increasing more than linearly with the number of
clusters. Nevertheless it is an effective measure for capturing similarity between
images.
Kv(x|y) + Kv(y|x)
dKv (x, y) = (10)
Kv(xy)
where Kv(x|y) is the length of the shortest program that computes x when y
is given as an auxiliary input to the program and Kv(xy) is the length of the
shortest program that outputs y concatenated to x. While this is an abstract
idea it can be approximated using compression
C(x|y) + C(y|x)
dC (x, y) = (11)
C(xy)
C(x) is the size of data x after compression, and C(x|y) is the size of x af-
ter compressing it with the compression model built for y. If we assume that
Kv(x|y) ≈ Kv(xy) − Kv(y) then we can define a normalised compression dis-
tance
3 Computational Complexity
Computationally expensive metrics such as the Earth-Mover’s Distance and com-
pression based (dis)similarity metrics focus attention on the computational is-
sues associated with k-NN classifiers. Basic k-NN classifiers that use a simple
Minkowski distance will have a time behaviour that is O(|D||F |) where D is the
training set and F is the set of features that describe the data, i.e. the distance
metric is linear in the number of features and the comparison process increases
linearly with the amount of data. The computational complexity of the EMD
and compression metrics is more difficult to characterise but a k-NN classifier
that incorporates an EMD metric is likely to be O(|D|n3 log n) where n is the
number of clusters [23].
For these reasons there has been considerable research on editing down the
training data and on reducing the number of features used to describe the data
(see section 4). There has also been considerable research on alternatives to the
exhaustive search strategy that is used in the standard k-NN algorithm. Here is
a summary of four of the strategies for speeding up nearest-neighbour retrieval:
– Case-Retrieval Nets: k-NN retrieval is widely used in Case-Based Reason-
ing and Case-Retrieval Nets (CRNs) are perhaps the most popular technique
for speeding up the retrieval process. Again, the cases are pre-processed, but
this time to form a network structure that is used at retrieval time. The
retrieval process is done by spreading activation in this network structure.
CRNs can be configured to return exactly the same cases as k-NN [14, 13].
– Footprint-Based Retrieval: As with all strategies for speeding up nearest-
neighbour retrieval, Footprint-Based Retrieval involves a preprocessing stage
to organise the training data into a two level hierarchy on which a two stage
retrieval process operates. The preprocessing constructs a competence model
which identifies ‘footprint’ cases which are landmark cases in the data. This
process is not guaranteed to retrieve the same cases as k-NN but the results
of the evaluation of speed-up and retrieval quality are nevertheless impressive
[27].
– Fish & Shrink: This technique requires the distance to be a true metric as
it exploits the triangle inequality property to produce an organisation of the
case-base into candidate neighbours and cases excluded from consideration.
Cases that are remote from the query can be bounded out so that they need
not be considered in the retrieval process. Fish & Shrink can be guaranteed
to be equivalent to k-NN [24].
– Cover Trees for Nearest Neighbor: This technique might be considered
the state-of-the-art in nearest-neighbour speed-up. It uses a data-structure
called a Cover Tree to organise the cases for efficient retrieval. The use of
Cover Trees requires that the distance measure is a true metric, however they
have attractive characteristics in terms of space requirements and speed-up
8 Pádraig Cunningham and Sarah Jane Delany
4 Dimension Reduction
Given the high dimension nature of the data, Dimension Reduction is a core
research topic in the processing of multimedia data. Research on Dimension Re-
duction has itself two dimensions; the dimensions of a dataset of |D| examples
described by |F | features can be reduced by selecting a subset of the examples
or by selecting a subset of the features (an alternative to this is to transform the
data into a representation with less features). Dimension reduction as achieved
by supervised feature selection is described in section 4.1. Unsupervised feature
transformation is described elsewhere in this book in the chapter on Unsuper-
vised Learning and Clustering. The other aspect of dimension reduction is the
deletion of redundant or noisy instances in the training data – this is reviewed
in section 4.2.
When the objective is to reduce the number of features used to describe data
there are two strategies that can be employed. Techniques such as Principle
Components Analysis (PCA) may be employed to transform the data into a
lower dimension represention. Alternatively feature selection may be employed
to discard some of the features. In using k-NN with high dimension data there
are several reasons why it is useful to perform feature selection:
– For many distance measures, the retrieval time increases directly with the
number of features (see section 3).
– Noisy or irrelevant features can have the same influence on retrieval as pre-
dictive features so they will impact negatively on accuracy.
– Things look more similar on average the more features used to describe them
(see Figure 3).
Fig. 3. The more dimensions used to describe objects the more similar on average
things appear. This figure shows the cosine similarity between objects described by 5
and by 20 features. It is clear that in 20 dimensions similarity has a lower variance
than in 5.
– Filter approaches attempt to remove irrelevant features from the feature set
prior to the application of the learning algorithm. Initially, the data is anal-
ysed to identify those dimensions that are most relevant for describing its
structure. The chosen feature subset is subsequently used to train the learn-
ing algorithm. Feedback regarding an algorithms performance is not required
during the selection process, though it may be useful when attempting to
gauge the effectiveness of the filter.
– Wrapper methods for feature selection make use of the learning algorithm
itself to choose a set of relevant features. The wrapper conducts a search
through the feature space, evaluating candidate feature subsets by estimating
the predictive accuracy of the classifier built on that subset. The goal of the
search is to find the subset that maximises this criterion.
Filter Techniques Central to the Filter strategy for feature selection is the
criterion used to score the predictivness of the features. In recent years Infor-
mation Gain (IG) has become perhaps the most popular criterion for feature
selection. The Information Gain of a feature is a measure of the amount of infor-
mation that a feature brings to the training set [19]. It is defined as the expected
reduction in entropy caused by partitioning the training set D using the feature
f as shown in Equation 13 where Dv is that subset of the training set D where
feature f has value v.
10 Pádraig Cunningham and Sarah Jane Delany
X |Dv |
IG(D, f ) = Entropy(D) − Entropy(Dv ) (13)
|D|
v∈values(f )
Odds(f |c)
OR(f, c) = (15)
Odds(f |c̄)
Where a specific feature does not occur in a class, it can be assigned a small
fixed value so that the OR can still be calculated. For feature selection, the
features can be ranked according to their OR with high values indicating features
that are very predictive of the class. The same can be done for the non-class to
highlight features that are predictive of the non-class.
We can look at the impact of these feature selection criteria in an email
spam classification task. In this experiment we selected the n features with the
highest IG value and n2 features each from OR(f, spam) and OR(f, nonspam)
sets. The results, displayed in Figure 4, show that IG performed significantly
better than OR. The reason for this is that OR is inclined to select features that
occur rarely but are very strong indicators of the class. This means that some
objects (emails) are described by no features and thus have no similarity to any
cases in the case base. In this experiment this occurs in 8.8% of cases with OR
compared with 0.2% for the IG technique.
This shows a simple but effective strategy for feature selection in very high
dimension data. IG can be used to rank features, then a cross validation process
can be employed to identify the number of features above which classification
accuracy is not improved. This evaluation suggests that the top 350 features as
ranked by IG are adequate.
While this is an effective strategy for feature selection it has the drawback
that features are considered in isolation so redundancies or dependancies are
ignored. Two strongly correlated features may both have high IG scores but one
may be redundant once the other is selected. More sophisticated Filter techniques
that address these issues using Mutual Information to score groups of features
have been researched by Novovičová et al. [18] and have been shown to be more
effective than these simple Filter techniques.
k-Nearest Neighbour Classifiers 11
100.0
95.0
85.0 3-nn IG
5-nn IG
1-nn OR
80.0
3-nn OR
5-nn OR
75.0
50 100 150 200 250 300 350 400 450 500
Num ber of Features
Fig. 4. Comparing Information Gain with Odds Ratio. Results of the average of three
10-fold cross validation experiments on a dataset of 1000 emails, 500 spam and 500
legitimate where word features only were used.
– Forward Selection which starts with no features selected, evaluates all the
options with just one feature, selects the best of these and considers the
options with that feature plus one other, etc.
– Backward Elimination starts with all features selected, considers the op-
tions with one feature deleted, selects the best of these and continues to
eliminate features.
These strategies will terminate when adding (or deleting) a feature will not pro-
duce an improvement in classification accuracy as assessed by cross validation.
12 Pádraig Cunningham and Sarah Jane Delany
Both of these are greedy search strategies and so are not guaranteed to discover
the best feature subset. More sophisticated search strategies can be employed
to better explore the search space; however, Reunanen [20] cautions that more
intensive search strategies are more likely to overfit the training data.
0000
1111
Noise reduction on the other hand aims to remove noisy or corrupt cases but
can remove exceptional or border cases which may not be distinguishable from
true noise, so a balance of both can be useful.
set and the latter using incrementing values of k. These techniques focus on
noisy or exceptional cases and do not result in the same storage reduction gains
as the competence preservation approaches.
Later editing techniques can be classified as hybrid techniques incorporating
both competence preservation and competence enhancement stages. [1] presented
a series of instance based learning algorithms to reduce storage requirements and
tolerate noisy instances. IB2 is similar to CNN adding only cases that cannot
be classified correctly by the reduced training set. IB2’s susceptibility to noise is
handled by IB3 which records how well cases are classifying and only keeps those
that classify correctly to a statistically significant degree. Other researchers have
provided variations on the IBn algorithms [4, 5, 31].
(i) an ordering policy for the presentation of the cases that is based on the
competence characteristics of the cases,
(ii) an addition rule to determine the cases to be added to the edited set,
(iii) a deletion rule to determine the cases to be removed from the training set
and
(iv) an update policy which indicates whether the competence model is updated
after each editing step.
The different combinations of ordering policy, addition rule, deletion rule and
update policy produce the family of algorithms.
Brighton and Mellish [3] also use the coverage and reachability properties
of cases in their Iterative Case Filtering (ICF) algorithm. ICF is a decremental
strategy contracting the training set by removing those cases c, where the number
of other cases that can correctly classify c is higher that the number of cases that
c can correctly classify. This strategy focuses on removing cases far from class
borders. After each pass over the training set, the competence model is updated
and the process repeated until no more cases can be removed. ICF includes a
pre-processing noise reduction stage, effectively RENN, to remove noisy cases.
McKenna and Smyth compared their family of algorithms to ICF and concluded
k-Nearest Neighbour Classifiers 15
that the overall best algorithm of the family delivered improved accuracy (albeit
marginal, 0.22%) with less than 50% of the cases needed by the ICF edited set
[16].
Wilson and Martinez [29] present a series of Reduction Technique (RT) algo-
rithms, RT1, RT2 and RT3 which, although published before the definitions of
coverage and reachability, could also be considered to use a competence model.
They define the set of associates of a case c which is comparable to the coverage
set of McKenna and Smyth except that the associates set will include cases of
a different class from case c whereas the coverage set will only include cases of
the same class as c. The RTn algorithms use a decremental strategy. RT1, the
basic algorithm, removes a case c if at least as many of its associates would still
be classified correctly without c. This algorithm focuses on removing noisy cases
and cases at the centre of clusters of cases of the same class as their associates
which will most probably still be classified correctly without them. RT2 fixes
the order of presentation of cases as those furthest from their nearest unlike
neighbour (i.e. nearest case of a different class) to remove cases furthest from
the class borders first. RT2 also uses the original set of associates when making
the deletion decision, which effectively means that the associate’s competence
model is not rebuilt after each editing step which is done in RT1. RT3 adds a
noise reduction pre-processing pass based on Wilson’s noise reduction algorithm.
Wilson and Martinez [29] concluded from their evaluation of the RTn algo-
rithms against IB3 that RT3 had a higher average generalization accuracy and
lower storage requirements overall but that certain datasets seem well suited to
the techniques while others were unsuited. [3] evaluated their ICF against RT3
and found that neither algorithm consistently out performed the other and both
represented the “cutting edge in instance set reduction techniques”.
– Because all the work is done at run-time, k-NN can have poor run-time
performance if the training set is large.
– k-NN is very sensitive to irrelevant or redundant features because all features
contribute to the similarity (see Eq. 1) and thus to the classification. This
can be ameliorated by careful feature selection or feature weighting.
– On very difficult classification tasks, k-NN may be outperformed by more
exotic techniques such as Support Vector Machines or Neural Networks.
References
15. M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi. The similarity metric. IEEE
Transactions on Information Theory, 50(12):3250–3264, 2004.
16. E. McKenna and B. Smyth. Competence-guided editing methods for lazy learning.
In W. Horn, editor, ECAI 2000, Proceedings of the 14th European Conference on
Artificial Intelligence, pages 60–64. IOS Press, 2000.
17. D. Mladenic. Feature subset selection in text-learning. In C. Nedellec and C. Rou-
veirol, editors, ECML, volume 1398 of Lecture Notes in Computer Science, pages
95–100. Springer, 1998.
18. J. Novovičová, A. Malı́k, and P. Pudil. Feature selection using improved mutual
information for text classification. In A. L. N. Fred, T. Caelli, R. P. W. Duin, A. C.
Campilho, and D. de Ridder, editors, SSPR/SPR, volume 3138 of Lecture Notes
in Computer Science, pages 1010–1017. Springer, 2004.
19. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
20. J. Reunanen. Overfitting in making comparisons between variable selection meth-
ods. Journal of Machine Learning Research, 3:1371–1382, 2003.
21. G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour. An algorithm
for a selective nearest neighbor decision rule. IEEE Transactions on Information
Theory, 21(6):665–669, 1975.
22. Y. Rubner, L. J. Guibas, and C. Tomasi. The earth mover’s distance, multi-
dimensional scaling, and color-based image retrieval. In Proceedings of the ARPA
Image Understanding Workshop, pages 661–668, 1997.
23. Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric
for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000.
24. J.W. Schaaf. Fish and Shrink. A Next Step Towards Efficient Case Retrieval in
Large-Scale Case Bases. In I. Smith and B. Faltings, editors, European Conference
on Case-Based Reasoning (EWCBR’96, pages 362–376. Springer, 1996.
25. R.N. Shepard. Toward a universal law of generalization for psychological science.
Science, 237:1317–1228, 1987.
26. B. Smyth and M. Keane. Remembering to forget: A competence preserving case
deletion policy for cbr system. In C. Mellish, editor, Proceedings of the Fourteenth
International Joint Conference on Artificial Intelligence, IJCAI (1995), pages 337–
382. Morgan Kaufmann, 1995.
27. B. Smyth and E. McKenna. Footprint-based retrieval. In Klaus-Dieter Althoff,
Ralph Bergmann, and Karl Branting, editors, ICCBR, volume 1650 of Lecture
Notes in Computer Science, pages 343–357. Springer, 1999.
28. I. Tomek. An experiment with the nearest neighbor rule. IEEE Transactions on
Information Theory, 6(6):448–452, 1976.
29. D. Wilson and T. Martinez. Instance pruning techniques. In ICML ’97: Proceedings
of the Fourteenth International Conference on Machine Learning, pages 403–411.
Morgan Kaufmann Publishers Inc., 1997.
30. D. L. Wilson. Asymptotic properties of nearest neighbor rules using edited data.
IEEE Transactions on Systems, Man and Cybernetics, 2(3):408–421, 1972.
31. J. Zhang. Selecting typical instances in instance-based learning. In Proceedings of
the 9th International Conference on Machine Learning (ICML 92), pages 470–479.
Morgan Kaufmann Publishers Inc., 1992.
32. J. Zu and Q. Yang. Remembering to add: competence preserving case-addition
policies for case-base maintenance. In Proceedings of the 16th International Joint
Conference on Artificial Intelligence (IJCAI 97), pages 234–239. Morgan Kauf-
mann Publishers Inc., 1997.