0% found this document useful (0 votes)
10 views30 pages

K Nearest Neighbor Classification

Uploaded by

eshakaastha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views30 pages

K Nearest Neighbor Classification

Uploaded by

eshakaastha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

K Nearest Neighbor

Classification
Nearest Neighbor Classifiers
 Basic idea:
 If it walks like a duck, quacks like a duck, then it’s
probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records
Basic Idea

 k-NN classification rule is to assign to a test


sample the majority category label of its k
nearest training samples
 In practice, k is usually chosen to be odd, so
as to avoid ties
 The k = 1 rule is generally called the nearest-
neighbor classification rule
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data


points that have the k smallest distance to x
Voronoi Diagram

Properties:
1) All possible points
within a sample's
Voronoi cell are the
nearest neighboring
points for that
sample
2) For any sample, the
nearest sample is
determined by the
closest Voronoi cell
edge
Other Distance Measures
 City-block distance (Manhattan dist)
 Add absolute value of differences
 Cosine similarity
 Measure angle formed by the two samples (with
the origin)
 Jaccard distance
 Determine percentage of exact matches
between the samples (not including unavailable
data)
Predicting Continuous Values

 Replace by:

 Note: unweighted corresponds to wi=1 for all i


Nearest-Neighbor Classifiers: Issues
– The value of k, the number of
nearest neighbors to retrieve
– Choice of Distance Metric to
compute distance between
records
– Computational complexity
– Size of training set
– Dimension of data
Value of K
 Choosing the value of k:
 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include
points from other classes

Rule of thumb:
K = sqrt(N)
N: number of training points X
Distance Metrics
Distance Measure: Scale Effects

 Different features may have different


measurement scales
 E.g., patient weight in kg (range [50,200]) vs.
blood protein values in ng/dL (range [-3,3])
 Consequences
 Patient weight will have a much greater influence
on the distance between samples
 May bias the performance of the classifier
Standardization

 Transform raw feature values into z-scores


 is the value for the ith sample and jth feature
 is the average of all for feature j
 is the standard deviation of all over all input
samples
 Range and scale of z-scores should be similar
(providing distributions of raw feature values are
alike)
Nearest Neighbor : Dimensionality
 Problem with Euclidean measure:
 High dimensional data
 curse of dimensionality
 Can produce counter-intuitive results
 Shrinking density – sparsification effect

1 111111111 1 000000000
1 0 vs 0 0
0 111111111 0 000000000
1 1 0 1
d = 1.4142 d = 1.4142
Distance for Nominal Attributes
Distance for Heterogeneous Data

Wilson, D. R. and Martinez, T. R., Improved Heterogeneous Distance Functions, Journal of


Artificial Intelligence Research, vol. 6, no. 1, pp. 1-34, 1997
Nearest Neighbour : Computational
Complexity
 Expensive
 To determine the nearest neighbour of a query point q,
must compute the distance to all N training examples
+ Pre-sort training examples into fast data structures (kd-trees)
+ Compute only an approximate distance (LSH)
+ Remove redundant data (condensing)
 Storage Requirements
 Must store all training data P
+ Remove redundant data (condensing)
- Pre-sorting often increases the storage requirements
 High Dimensional Data
 “Curse of Dimensionality”
 Required amount of training data increases exponentially with
dimension
 Computational cost also increases dramatically
 Partitioning techniques degrade to linear search in high dimension
Reduction in Computational
Complexity
 Reduce size of training set
 Condensation, editing

 Use geometric data structure for high


dimensional search
Condensation: Decision Regions
Each cell contains one
sample, and every
location within the cell
is closer to that sample
than to any other
sample.

A Voronoi diagram
divides the space into
such cells.
Every query point will be assigned the classification of the sample
within that cell. The decision boundary separates the class regions
based on the 1-NN decision rule.
Knowledge of this boundary is sufficient to classify new points.
The boundary itself is rarely computed; many algorithms seek to retain
only those points necessary to generate an identical boundary.
Condensing

 Aim is to reduce the number of training samples


 Retain only the samples that are needed to define the decision boundary

 Decision Boundary Consistent – a subset whose nearest neighbour decision


boundary is identical to the boundary of the entire training set
 Minimum Consistent Set – the smallest subset of the training data that
correctly classifies all of the original training data

Original data Condensed data Minimum Consistent


Set
Condensing

 Condensed Nearest Neighbour • Incremental


(CNN) • Order dependent
1. Initialize subset with a
single (or K) training
• Neither minimal nor
example decision boundary
2. Classify all remaining consistent
samples using the subset, • O(n3) for brute-force
and transfer any method
incorrectly classified
samples to the subset
3. Return to 2 until no
transfers occurred or the
subset is full
Condensing

 Condensed Nearest Neighbour


(CNN)
1. Initialize subset with a
single training example
2. Classify all remaining
samples using the subset,
and transfer any
incorrectly classified
samples to the subset
3. Return to 2 until no
transfers occurred or the
subset is full
Condensing

 Condensed Nearest Neighbour


(CNN)
1. Initialize subset with a
single training example
2. Classify all remaining
samples using the subset,
and transfer any
incorrectly classified
samples to the subset
3. Return to 2 until no
transfers occurred or the
subset is full
Condensing
 Condensed Nearest Neighbour
(CNN)
1. Initialize subset with a
single training example
2. Classify all remaining
samples using the subset,
and transfer any
incorrectly classified
samples to the subset
3. Return to 2 until no
transfers occurred or the
subset is full
Condensing

 Condensed Nearest Neighbour


(CNN)
1. Initialize subset with a
single training example
2. Classify all remaining
samples using the subset,
and transfer any
incorrectly classified
samples to the subset
3. Return to 2 until no
transfers occurred or the
subset is full
Condensing

 Condensed Nearest Neighbour


(CNN)

1. Initialize subset with a


single training example
2. Classify all remaining
samples using the subset,
and transfer any
incorrectly classified
samples to the subset
3. Return to 2 until no
transfers occurred or the
subset is full
Condensing

 Condensed Nearest Neighbour


(CNN)
1. Initialize subset with a
single training example
2. Classify all remaining
samples using the subset,
and transfer any
incorrectly classified
samples to the subset
3. Return to 2 until no
transfers occurred or the
subset is full
High dimensional search

 Given a point set and a nearest neighbor query point

 Find the points enclosed in a rectangle (range) around the query


 Perform linear search for nearest neighbor only in the rectangle

Query
kd-tree: data structure for range
search
 Index data into a tree
 Search on the tree
 Tree construction: At each level we use a different
dimension to split
x=5
x<5 x>=5
C
B y=6
y=3

A E D x=6
kd-tree example

X=3 X=7 X=5

y=6
y=5

Y=6
x=8 x=7
x=3

Y=2 y=2

X=5 X=8
KNN: Alternate Terminologies

 Instance Based Learning


 Lazy Learning
 Case Based Reasoning
 Exemplar Based Learning

You might also like