0% found this document useful (0 votes)
7 views

KNN Algorithm

Pdf

Uploaded by

rajasweetyji369
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

KNN Algorithm

Pdf

Uploaded by

rajasweetyji369
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

k Nearest Neighbors

(KNN) Classifier

Dr. Arvind Selwal


Central University of Jammu, J&K
Supervised learning
and classification
 Given: dataset of instances with
known categories
 Goal: using the “knowledge” in the
dataset, classify a given instance
 predict the category of the given
instance that is rationally consistent
with the dataset
Classifiers

X1
X2
feature X3 Y
… Classifier category
values
Xn

collection of instances
DB
with known categories
k NEAREST NEIGHBOR
 Requires 3 things:
 Feature Space(Training Data)
 Distance metric
• to compute distance between
records
 The value of k
• the number of nearest
neighbors to retrieve from
? which to get majority class
 To classify an unknown record:
 Compute distance to other
training records
 Identify k nearest neighbors
 Use class labels of nearest
neighbors to determine the
class label of unknown record
k NEAREST NEIGHBOR
 k = 1:
 Belongs to square class

 k = 3:
?  Belongs to triangle class

 k = 7:
 Belongs to square class

 Choosing the value of k:


 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include points from other
classes 5
8
 Choose an odd value for k, to eliminate ties
K - Nearest Neighbors

 For a given instance T, get the top k


dataset instances that are “nearest” to T
 Select a reasonable distance measure
 Inspect the category of these k
instances, choose the category C that
represent the most instances
 Conclude that T belongs to category C
K - Nearest Neighbors
Algorithm
Input: Training Data set, test data set, value of „k‟
Steps:
Do for all test data points
i. Calculate the distance of test data point from the
different training data points.
ii. Sort all the training data points in ascending order
according to distance computed in step-i.
iii. Choose top k items from the sorted list from step-ii
if k=1
Then assign class label of training data point to the test data
point.
else
Assign class to the test data point by using majority voting(
name of class label using voting)
End do
Example 1

 Determining decision on scholarship


application based on the following features:
 Household income (annual income in
millions of pesos)
 Number of siblings in family
 High school grade (on a QPI scale of 1.0 –
4.0)
 Intuition (reflected on data set): award
scholarships to high-performers and to
those with financial need
Distance formula

 Euclidian distance: square root of sum


of squares of differences

for two features: (x)2 + (y)2

 Intuition: similar samples should be


close to each other
 May not always apply
(example: quota and actual sales)
Example revisited

 Suppose household income was


instead indicated in thousands of
pesos per month and that grades are
given on a 70-100 scale
 Note different results produced by
kNN algorithm on the same dataset
Non-numeric data

 Feature values are not always


numbers
 Example
 Boolean values: Yes or no, presence
or absence of an attribute
 Categories: Colors, educational
attainment, gender
 How do these values factor into the
computation of distance?
Dealing with non-numeric
data
 Boolean values => convert to 0 or 1
 Applies to yes-no/presence-absence
attributes
 Non-binary characterizations
 Use natural progression when applicable;
e.g., educational attainment: GS, HS,
College, MS, PHD => 1,2,3,4,5
 Assign arbitrary numbers but be careful
about distances; e.g., color: red, yellow, blue
=> 1,2,3
 How about unavailable data?
(0 value not always the answer)
k-NN variations

 Value of k
 Larger k increases confidence in prediction
 Note that if k is too large, decision may be
skewed
 Weighted evaluation of nearest neighbors
 Plain majority may unfairly skew decision
 Revise algorithm so that closer neighbors
have greater “vote weight”
 Other distance measures
k-NN Time Complexity

 Suppose there are m instances and n


features in the dataset
 Nearest neighbor algorithm requires
computing m distances
 Each distance computation involves
scanning through each feature value
 Running time complexity is proportional to
mXn
k NEAREST NEIGHBOR
ADVANTAGES
 Simple technique that is easily implemented
 Building model is inexpensive
 Extremely flexible classification scheme
 does not involve preprocessing
 Well suited for
 Multi-modal classes (classes of multiple forms)
 Records with multiple class labels
 Asymptotic Error rate at most twice Bayes rate
 Cover & Hart paper (1967)
 Can sometimes be the best method
 Michihiro Kuramochi and George Karypis, Gene Classification using
Expression Profiles: A Feasibility Study, International Journal on
Artificial Intelligence Tools. Vol. 14, No. 4, pp. 641-660, 2005
 K nearest neighbor outperformed SVM for protein function prediction
using expression profiles
k NEAREST NEIGHBOR
DISADVANTAGES
 Classifying unknown records are relatively
expensive
 Requires distance computation of k-nearest
neighbors
 Computationally intensive, especially when the size
of the training set grows
 Accuracy can be severely degraded by the
presence of noisy or irrelevant features
 NN classification expects class conditional
probability to be locally constant
 bias of high dimensions

You might also like