0% found this document useful (1 vote)
86 views22 pages

Chap 5 1 NN Classification

The document discusses K-nearest neighbor (KNN) classification. It explains that KNN is a lazy learning algorithm that delays model building until classification. During classification, KNN finds the K training records closest in distance to the new record and assigns the most common class among those K neighbors. The document covers different distance measures for different attribute types, and how to determine the optimal K value. It notes advantages of KNN include quick training and ability to learn complex patterns, while disadvantages include slow query time and susceptibility to irrelevant attributes.

Uploaded by

ayman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
86 views22 pages

Chap 5 1 NN Classification

The document discusses K-nearest neighbor (KNN) classification. It explains that KNN is a lazy learning algorithm that delays model building until classification. During classification, KNN finds the K training records closest in distance to the new record and assigns the most common class among those K neighbors. The document covers different distance measures for different attribute types, and how to determine the optimal K value. It notes advantages of KNN include quick training and ability to learn complex patterns, while disadvantages include slow query time and susceptibility to irrelevant attributes.

Uploaded by

ayman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

CS446 – Fall 2015

Introduction to Data Mining

K Nearest Neighbor
Classification
All the slides were adapted from:
1- Intro. To Data Mining by Tan et. al.
2- Dr. Ibrahim Albluwi
3- Dr. Noureddin Sadawi
Is it a Duck?
• If it quacks like a duck and walks like a duck, and looks like a
duck, then most probably, it is a duck!

Compare
with all the
Test Record
records

Training Choose the “nearest”


Records record
k Nearest Neighbor (kNN)
NN Classification
• Given an unseen record r that needs to be classified:
– Compute the distance between r and all of the other records in the
training set.
– Choose the record rminDist that has the minimum distance to r..
– Assign to r the class value of rminDist.

• Example: What should r=(X=2, Y=2) X Y Class


be classified?
1 1 1 YES
• Distance(r1, r) = sqrt((1-2)2+(1-2)2)=sqrt(2).
2 3 5 NO
Distance(r1, r) = sqrt((3-2)2+(5-2)2)=sqrt(10).
3 7 9 NO
Distance(r1, r) = sqrt((7-2)2+(9-2)2)=sqrt(74).
4 4 7 YES
Distance(r1, r) = sqrt((4-2)2+(7-2)2)=sqrt(29).

The closest record is r1, so r will be classified as YES.


Algorithm
Lazy Learners
• Nearest neighbor classification is considered as a lazy method,
whereas decision tree classification is considered as an eager
method.

• Lazy Learners:
– Do not build any model: Zero training time.
– Delay “thinking” to classification time.
– Most time is spent on classification.

• Eager Learners:
– Spend most of the time on building the model prior to
classification.
– Classification is quick since the model is ready.
Proximity Measures
Definitions:
• Similarity: A numerical measure of how much two data objects are
alike.

• Dissimilarity Or Distance: A numerical measure of how different two


data objects are.

• Proximity: Similarity or dissimilarity depending on context.

• What proximity measure to use? This is highly dependent on the


attribute types.
Distance Measures

• Numeric Attributes:
– Manhattan Distance, Euclidean Distance, etc.

– Normalization is very important to avoid the domination of


one (or few) attributes over other attributes.
Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
If we do not normalize, the income attribute will dominate
the distance computation.
Numeric Attributes
• Distance between two attribute values: |v1-v2|
• Distance between two records: many possibilities.

• Euclidian Distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

• Manhattan Distance:
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
Euclidean Distance
Example: Age Income Height Weight
Record 1 45 2000 1.6 80
Record 2 32 1200 1.75 75

Normalization:
age1 = 45/140 = 0.32 age2 = 32/140 = 0.23
Income1 = 2000/5000 = 0.4 Income2 = 1200/5000 = 0.24
Height1 = 1.6/2.1 = 0.76 Height2 = 1.75/2.1 = 0.83
Weight1 = 80/150 = 0.53 Weight2 = 0.5

Euclidean Distance =

(0.32  0.23) 2  (0.4  0.24) 2  (0.76  0.83) 2  (0.53  0.5) 2  0.495


Ordinal Attributes
• Assign numbers depending on order.

• Distance between two attribute values: |v1-v2|/(n-1)

• Distance between two records: average distance between attributes.

Example: Height Income GPA


Record 1 Short Low A
Record 2 Tall Medium A

• A1 [Short, Medium, Tall]: d1= |2-0|/2 = 1


• A2 [Low, Medium, High]: d2= |1-0|/2 =0.5
• A3 [A, B, C, D, F]: d3= |0-0|/4 = 0
• Distance = (1+0.5+0)/3 = 0.5
• Similarity = 1-0.5 = 0.5
Nominal Attributes
• Similarity between two attribute values: Match = 1 and Mismatch = 0.

• Similarity between two records:


(Number of matches)/(Number of attributes)

• Distance between two records = 1 – similarity.

Example: Eye Color Country Job Married


Record 1 Black Jordan Engineer Yes
Record 2 Green Jordan Engineer Yes

• Similarity= (0 + 1 + 1 + 1)/4 = 0.75


• Distance = 1 – 0.75 = 0.25
Notes
• For records with mixed attribute types, compute the distance for each
attribute individually in the range [0-1] and then compute the average
over all attributes.

• Example: Age Gender Height Weight


Record 1 45 Female Short 80
Record 2 32 Male Tall 75

• |R1(Age) – R2(Age)| = (45/140 = 0.32) – (32/140 = 0.23) = 0.09


• |R1(Gender) – R2(Gender)| = 1 (Mismatch = 1, Match = 0)
• |R1(Height) – R2(Height)| = (2-0)/2 = 1
• |R1(Weight) – R2(Weight)| = (80/150 = 0.53) – (75/150 = 0.5) = 0.03

• Distance(R1, R2) = (0.09 + 1 + 1 + 0.03) / 4 = 0.53


KNN Concerns !
• What if the nearest neighbor is actually a noisy record?
• What if there are several nearest neighbors (all equally-distant)?
• What if there are several records that are all very close to the
unseen record but each having a different distance?

• Use K-Nearest Neighbors instead of 1-Nearest Neighbor!

• Assign to the unseen record:


– The majority class value among the K-NNs if the class
attribute is nominal.
– The median class value among the K-NNs if the class attribute
is ordinal.
– The mean class value among the K-NNs if the class attribute is
numeric.
Examples

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

• Using a very small K: sensitive to noise and susceptible to over-


fitting.
• Using a very large K: computationally expensive and may
consider irrelevant records.
• May need to set K experimentally.
How Many Neighbors?
Example
Example
Standardized Distance
Does k-NN Classification Work?
• At the limit (with optimal conditions), k-NN is guaranteed to have
an error rate that is no more than twice the error of an optimal
classifier.

In a
Voronoi Diagram,
all points in a cell
are closer to the
record in that cell NN-Classifiers
more than any
+ can learn complex
record in the other patterns that
cells.
+ are difficult for
decision trees.
+
To classify a record:
See in which cell it + +
falls and assign to it
+
the class of the
record in that cell. +
Notes
• When to use NN-Classification?
– If there are less than 20 attributes.
[Curse of Dimensionality: In higher dimensions, intuition fails,
distance measures become less meaningful and computation
becomes expensive.
– If the application affords long classification time.
– If there are lots of training data.

• Advantages of NN-Classification:
– Quick training time.
– Can learn complex patterns.
– Can be used for regression (numeric class attributes).

• Disadvantages of NN-Classification:
– Slow at query time.
– Easily fooled by irrelevant attributes [Feature subset selection is
very important].

You might also like