0% found this document useful (0 votes)
10 views18 pages

Lecture 17 - KNN

KNN

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views18 pages

Lecture 17 - KNN

KNN

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Transfer Functions

Supervised Learning – Classification


K-Nearest Neighbor Algorithm
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that


have the k smallest distance to x
Basic Idea

 k-NN classification rule is to assign to a test sample the


majority category label of its k nearest training samples
 In practice, k is usually chosen to be odd, so as to avoid
ties
 The k = 1 rule is generally called the nearest-neighbor
classification rule
Nearest-Neighbor Classifiers: Issues

– The value of k, the number of nearest


neighbors to retrieve
– Choice of Distance Metric to compute
distance between records
– Computational complexity
– Size of training set
– Dimension of data
Value of K
 Choosing the value of k:
 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include points from
other classes

Rule of thumb:
K = sqrt(N)
N: number of training points X
Distance Metrics
Distance Measure: Scale Effects

 Different features may have different measurement scales


 E.g., patient weight in kg (range [50,200]) vs. blood protein
values in ng/dL (range [-3,3])
 Consequences
 Patient weight will have a much greater influence on the
distance between samples
 May bias the performance of the classifier
Standardization

 Transform raw feature values into z-scores


x ij - m j
zij =
sj

 x ijis the value for the ith sample and jth feature
 m j is the average of all x ij for feature j
 s is the standard deviation of all x over all input samples
j ij
 Range and scale of z-scores should be similar (providing
distributions of raw feature values are alike)
Additional Material
Voronoi Diagram

Properties:
1) All possible points
within a sample's
Voronoi cell are the
nearest neighboring
points for that sample
2) For any sample, the
nearest sample is
determined by the
closest Voronoi cell
edge
Distance-weighted k-NN

k
Replace
fˆ (q) = arg max å d (v, f ( xi ))

vÎV i =1

k
fˆ (q) = argmax å
1
d (v, f (x i ))
d( x i, x q )
2
v ÎV i=1

General Kernel functions like Parzen Windows may be considered


Instead of inverse distance.
Distance for Heterogeneous Data

Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of


Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997
Nearest Neighbour : Computational
Complexity
 Expensive
 To determine the nearest neighbour of a query point q, must
compute the distance to all N training examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance (LSH)
+ Remove redundant data (condensing)
 Storage Requirements
 Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
 High Dimensional Data
 “Curse of Dimensionality”
 Required amount of training data increases exponentially with dimension
 Computational cost also increases dramatically
 Partitioning techniques degrade to linear search in high dimension
KNN: Alternate Terminologies

 Instance Based Learning


 Lazy Learning
 Case Based Reasoning
 Exemplar Based Learning
Discussions
 kNN can deal with complex and arbitrary decision
boundaries.
 Despite its simplicity, researchers have shown that the
classification accuracy of kNN can be quite strong and in
many cases as accurate as those elaborated methods.
 kNN is slow at the classification time
 kNN does not produce an understandable model
Summary
 Applications of supervised learning are in almost any field
or domain.
 We studied 4 classification techniques.
 There are still many other methods, e.g.,
 Bayesian networks
 Neural networks
 Genetic algorithms
 Fuzzy classification
This large number of methods also show the importance of
classification and its wide applicability.
 It remains to be an active research area.

You might also like