k Nearest Neighbors
(KNN) Classifier
Dr. Arvind Selwal
Central University of Jammu, J&K
Supervised learning
and classification
Given: dataset of instances with
known categories
Goal: using the “knowledge” in the
dataset, classify a given instance
predict the category of the given
instance that is rationally consistent
with the dataset
Classifiers
X1
X2
feature X3 Y
… Classifier category
values
Xn
collection of instances
DB
with known categories
k NEAREST NEIGHBOR
Requires 3 things:
Feature Space(Training Data)
Distance metric
• to compute distance between
records
The value of k
• the number of nearest
neighbors to retrieve from
? which to get majority class
To classify an unknown record:
Compute distance to other
training records
Identify k nearest neighbors
Use class labels of nearest
neighbors to determine the
class label of unknown record
k NEAREST NEIGHBOR
k = 1:
Belongs to square class
k = 3:
? Belongs to triangle class
k = 7:
Belongs to square class
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other
classes 5
8
Choose an odd value for k, to eliminate ties
K - Nearest Neighbors
For a given instance T, get the top k
dataset instances that are “nearest” to T
Select a reasonable distance measure
Inspect the category of these k
instances, choose the category C that
represent the most instances
Conclude that T belongs to category C
K - Nearest Neighbors
Algorithm
Input: Training Data set, test data set, value of „k‟
Steps:
Do for all test data points
i. Calculate the distance of test data point from the
different training data points.
ii. Sort all the training data points in ascending order
according to distance computed in step-i.
iii. Choose top k items from the sorted list from step-ii
if k=1
Then assign class label of training data point to the test data
point.
else
Assign class to the test data point by using majority voting(
name of class label using voting)
End do
Example 1
Determining decision on scholarship
application based on the following features:
Household income (annual income in
millions of pesos)
Number of siblings in family
High school grade (on a QPI scale of 1.0 –
4.0)
Intuition (reflected on data set): award
scholarships to high-performers and to
those with financial need
Distance formula
Euclidian distance: square root of sum
of squares of differences
for two features: (x)2 + (y)2
Intuition: similar samples should be
close to each other
May not always apply
(example: quota and actual sales)
Example revisited
Suppose household income was
instead indicated in thousands of
pesos per month and that grades are
given on a 70-100 scale
Note different results produced by
kNN algorithm on the same dataset
Non-numeric data
Feature values are not always
numbers
Example
Boolean values: Yes or no, presence
or absence of an attribute
Categories: Colors, educational
attainment, gender
How do these values factor into the
computation of distance?
Dealing with non-numeric
data
Boolean values => convert to 0 or 1
Applies to yes-no/presence-absence
attributes
Non-binary characterizations
Use natural progression when applicable;
e.g., educational attainment: GS, HS,
College, MS, PHD => 1,2,3,4,5
Assign arbitrary numbers but be careful
about distances; e.g., color: red, yellow, blue
=> 1,2,3
How about unavailable data?
(0 value not always the answer)
k-NN variations
Value of k
Larger k increases confidence in prediction
Note that if k is too large, decision may be
skewed
Weighted evaluation of nearest neighbors
Plain majority may unfairly skew decision
Revise algorithm so that closer neighbors
have greater “vote weight”
Other distance measures
k-NN Time Complexity
Suppose there are m instances and n
features in the dataset
Nearest neighbor algorithm requires
computing m distances
Each distance computation involves
scanning through each feature value
Running time complexity is proportional to
mXn
k NEAREST NEIGHBOR
ADVANTAGES
Simple technique that is easily implemented
Building model is inexpensive
Extremely flexible classification scheme
does not involve preprocessing
Well suited for
Multi-modal classes (classes of multiple forms)
Records with multiple class labels
Asymptotic Error rate at most twice Bayes rate
Cover & Hart paper (1967)
Can sometimes be the best method
Michihiro Kuramochi and George Karypis, Gene Classification using
Expression Profiles: A Feasibility Study, International Journal on
Artificial Intelligence Tools. Vol. 14, No. 4, pp. 641-660, 2005
K nearest neighbor outperformed SVM for protein function prediction
using expression profiles
k NEAREST NEIGHBOR
DISADVANTAGES
Classifying unknown records are relatively
expensive
Requires distance computation of k-nearest
neighbors
Computationally intensive, especially when the size
of the training set grows
Accuracy can be severely degraded by the
presence of noisy or irrelevant features
NN classification expects class conditional
probability to be locally constant
bias of high dimensions