0% found this document useful (0 votes)
14 views

Lecture 12

Uploaded by

mbilal23640
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture 12

Uploaded by

mbilal23640
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Science

CSE-4075
(K-Nearest Neighbor)
If you want to annoy your neighbors, tell the truth
about them.
Pietro Aretino
Different Learning Methods
• Eager Learning
– Explicit description of target function on the
whole training set
• Instance-based Learning
– Learning=storing all training instances
– Classification=assigning target function to a new
instance
– Referred to as “Lazy” learning
Eager Learning

Any random movement


=>It’s a mouse

I saw a mouse!
Instance-based Learning

Its very similar to a


Desktop!!
Instance based learning
• Approximating real valued or discrete-valued
target functions
• Learning in this algorithm consists of storing
the presented training data
• When a new query instance is encountered, a
set of similar related instances is retrieved
from memory and used to classify the new
query instance
• Disadvantage of instance-based methods is
that the costs of classifying new instances can
be high
• Nearly all computation takes place at
classification time rather than learning time
K-Nearest Neighbor algorithm
• Most basic instance-based method

• Data are represented in a vector space

• Supervised learning
WHY NEAREST NEIGHBOR?
• Used to classify objects based on closest training
examples in the feature space
– Feature space: raw data transformed into sample
vectors of fixed length using feature extraction
(Training Data)
• Top 10 Data Mining Algorithm
– ICDM paper – December 2007
• Among the simplest of all Data Mining Algorithms
– Classification Method
• Implementation of lazy learner ? 9

– All computation deferred until


K NEAREST NEIGHBOR
• Requires 3 things:
– Feature Space(Training Data)
– Distance metric
• to compute distance between
records
– The value of k
• the number of nearest
neighbors to retrieve from which
? to get majority class
• To classify an unknown record:
– Compute distance to other training
records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
10
6
Feature space
•      


x (1)
, f ( x (1)
) , x (2)
f ( x (2)
) ,..., x (n )
, f ( x (n )
) 

x1

x
  
2
  d

x   ..   d
xy   i i
(x  y ) 2

 ..  i1




x
 d


K NEAREST NEIGHBOR

 k = 1:
 Belongs to square class

 k = 3:
?  Belongs to triangle class

 k = 7:
 Belongs to square class

• Choosing the value of k:


– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from other
classes 12
8
– Choose an odd value for k, to eliminate ties
ICDM: Top Ten Data Mining Algorithms k nearest neighbor classificationDecember 2006
How to determine
the good value for k?
• Determined experimentally
• Start with k=1 and use a test set to validate the
error rate of the classifier
• Repeat with k=k+2
• Choose the value of k for which the error rate is
minimum

• Note: k should be odd number to avoid ties


Training Data
(I) (II) (III) (IV)
 (7, 7), False ,  (7, 4), False ,  (3, 4), True ,  (1, 4), True 
Testing Instance

¿
Parameters:
Distance Metric= Euclidean Distance
Nearest Neighbors =K= 3

Distance Calculation Neighbor Neighbor Decision


Closeness Class
N=3 False For K=3
N=4 False True=2>False=1
N=1 True
Hence,
N=2 True
When to Consider Nearest Neighbors
• Instances map to points in Rd
• Less than 20 features (attributes) per instance, typically
normalized
• Lots of training data
Advantages:
• Training is very fast
• Learn complex target functions
• Do not loose information
Disadvantages:
• Slow at query time
– Presorting and indexing training samples into search trees reduces time
• Easily fooled by irrelevant features (attributes)

You might also like