0% found this document useful (1 vote)

86 views22 pages

Chap 5 1 NN Classification

The document discusses K-nearest neighbor (KNN) classification. It explains that KNN is a lazy learning algorithm that delays model building until classification. During classification, KNN finds the K training records closest in distance to the new record and assigns the most common class among those K neighbors. The document covers different distance measures for different attribute types, and how to determine the optimal K value. It notes advantages of KNN include quick training and ability to learn complex patterns, while disadvantages include slow query time and susceptibility to irrelevant attributes.

Uploaded by

ayman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

86 views22 pages

Chap 5 1 NN Classification

Uploaded by

ayman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

CS446 – Fall 2015

Introduction to Data Mining

K Nearest Neighbor
Classification
All the slides were adapted from:
1- Intro. To Data Mining by Tan et. al.
2- Dr. Ibrahim Albluwi
3- Dr. Noureddin Sadawi
Is it a Duck?
• If it quacks like a duck and walks like a duck, and looks like a
duck, then most probably, it is a duck!

Compare
with all the
Test Record
records

Training Choose the “nearest”

Records record
k Nearest Neighbor (kNN)
NN Classification
• Given an unseen record r that needs to be classified:
– Compute the distance between r and all of the other records in the
training set.
– Choose the record rminDist that has the minimum distance to r..
– Assign to r the class value of rminDist.

• Example: What should r=(X=2, Y=2) X Y Class

be classified?
1 1 1 YES
• Distance(r1, r) = sqrt((1-2)2+(1-2)2)=sqrt(2).
2 3 5 NO
Distance(r1, r) = sqrt((3-2)2+(5-2)2)=sqrt(10).
3 7 9 NO
Distance(r1, r) = sqrt((7-2)2+(9-2)2)=sqrt(74).
4 4 7 YES
Distance(r1, r) = sqrt((4-2)2+(7-2)2)=sqrt(29).

The closest record is r1, so r will be classified as YES.

Algorithm
Lazy Learners
• Nearest neighbor classification is considered as a lazy method,
whereas decision tree classification is considered as an eager
method.

• Lazy Learners:
– Do not build any model: Zero training time.
– Delay “thinking” to classification time.
– Most time is spent on classification.

• Eager Learners:
– Spend most of the time on building the model prior to
classification.
– Classification is quick since the model is ready.
Proximity Measures
Definitions:
• Similarity: A numerical measure of how much two data objects are
alike.

• Dissimilarity Or Distance: A numerical measure of how different two

data objects are.

• Proximity: Similarity or dissimilarity depending on context.

• What proximity measure to use? This is highly dependent on the

attribute types.
Distance Measures

• Numeric Attributes:
– Manhattan Distance, Euclidean Distance, etc.

– Normalization is very important to avoid the domination of

one (or few) attributes over other attributes.
Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
If we do not normalize, the income attribute will dominate
the distance computation.
Numeric Attributes
• Distance between two attribute values: |v1-v2|
• Distance between two records: many possibilities.

• Euclidian Distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

• Manhattan Distance:
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
Euclidean Distance
Example: Age Income Height Weight
Record 1 45 2000 1.6 80
Record 2 32 1200 1.75 75

Normalization:
age1 = 45/140 = 0.32 age2 = 32/140 = 0.23
Income1 = 2000/5000 = 0.4 Income2 = 1200/5000 = 0.24
Height1 = 1.6/2.1 = 0.76 Height2 = 1.75/2.1 = 0.83
Weight1 = 80/150 = 0.53 Weight2 = 0.5

Euclidean Distance =

(0.32  0.23) 2  (0.4  0.24) 2  (0.76  0.83) 2  (0.53  0.5) 2  0.495

Ordinal Attributes
• Assign numbers depending on order.

• Distance between two attribute values: |v1-v2|/(n-1)

• Distance between two records: average distance between attributes.

Example: Height Income GPA

Record 1 Short Low A
Record 2 Tall Medium A

• A1 [Short, Medium, Tall]: d1= |2-0|/2 = 1

• A2 [Low, Medium, High]: d2= |1-0|/2 =0.5
• A3 [A, B, C, D, F]: d3= |0-0|/4 = 0
• Distance = (1+0.5+0)/3 = 0.5
• Similarity = 1-0.5 = 0.5
Nominal Attributes
• Similarity between two attribute values: Match = 1 and Mismatch = 0.

• Similarity between two records:

(Number of matches)/(Number of attributes)

• Distance between two records = 1 – similarity.

Example: Eye Color Country Job Married

Record 1 Black Jordan Engineer Yes
Record 2 Green Jordan Engineer Yes

• Similarity= (0 + 1 + 1 + 1)/4 = 0.75

• Distance = 1 – 0.75 = 0.25
Notes
• For records with mixed attribute types, compute the distance for each
attribute individually in the range [0-1] and then compute the average
over all attributes.

• Example: Age Gender Height Weight

Record 1 45 Female Short 80
Record 2 32 Male Tall 75

• |R1(Age) – R2(Age)| = (45/140 = 0.32) – (32/140 = 0.23) = 0.09

• |R1(Gender) – R2(Gender)| = 1 (Mismatch = 1, Match = 0)
• |R1(Height) – R2(Height)| = (2-0)/2 = 1
• |R1(Weight) – R2(Weight)| = (80/150 = 0.53) – (75/150 = 0.5) = 0.03

• Distance(R1, R2) = (0.09 + 1 + 1 + 0.03) / 4 = 0.53

KNN Concerns !
• What if the nearest neighbor is actually a noisy record?
• What if there are several nearest neighbors (all equally-distant)?
• What if there are several records that are all very close to the
unseen record but each having a different distance?

• Use K-Nearest Neighbors instead of 1-Nearest Neighbor!

• Assign to the unseen record:

– The majority class value among the K-NNs if the class
attribute is nominal.
– The median class value among the K-NNs if the class attribute
is ordinal.
– The mean class value among the K-NNs if the class attribute is
numeric.
Examples

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

• Using a very small K: sensitive to noise and susceptible to over-

fitting.
• Using a very large K: computationally expensive and may
consider irrelevant records.
• May need to set K experimentally.
How Many Neighbors?
Example
Example
Standardized Distance
Does k-NN Classification Work?
• At the limit (with optimal conditions), k-NN is guaranteed to have
an error rate that is no more than twice the error of an optimal
classifier.

In a
Voronoi Diagram,
all points in a cell
are closer to the
record in that cell NN-Classifiers
more than any
+ can learn complex
record in the other patterns that
cells.
+ are difficult for
decision trees.
+
To classify a record:
See in which cell it + +
falls and assign to it
+
the class of the
record in that cell. +
Notes
• When to use NN-Classification?
– If there are less than 20 attributes.
[Curse of Dimensionality: In higher dimensions, intuition fails,
distance measures become less meaningful and computation
becomes expensive.
– If the application affords long classification time.
– If there are lots of training data.

• Advantages of NN-Classification:
– Quick training time.
– Can learn complex patterns.
– Can be used for regression (numeric class attributes).

• Disadvantages of NN-Classification:
– Slow at query time.
– Easily fooled by irrelevant attributes [Feature subset selection is
very important].

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
ISO 45001 Checklist
83% (6)
ISO 45001 Checklist
12 pages
ISO 14001 Checklist
100% (6)
ISO 14001 Checklist
10 pages
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
ISO 9001 Checklist
100% (5)
ISO 9001 Checklist
18 pages
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Gabors Maté Teachings CI Mod1
100% (10)
Gabors Maté Teachings CI Mod1
16 pages
Similarities N Different Deming Juran N Crosby
60% (10)
Similarities N Different Deming Juran N Crosby
3 pages
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Cakes
100% (1)
Cakes
21 pages
Quality Assurance POLICY & Quality Manual
100% (4)
Quality Assurance POLICY & Quality Manual
150 pages
Database Systems: Recovery Control
No ratings yet
Database Systems: Recovery Control
25 pages
IMS (Integrated Management System) Manual
No ratings yet
IMS (Integrated Management System) Manual
33 pages
Wieder 2012 BI Tools On Performance 2012 Summary
No ratings yet
Wieder 2012 BI Tools On Performance 2012 Summary
26 pages
Chapter 08 Statistics 2
No ratings yet
Chapter 08 Statistics 2
47 pages
Wael 901210 - OOP Part2
No ratings yet
Wael 901210 - OOP Part2
107 pages
Summary of RM in OM
No ratings yet
Summary of RM in OM
18 pages
Chap 1 1 CourseIntro
No ratings yet
Chap 1 1 CourseIntro
12 pages
Quality Cost
No ratings yet
Quality Cost
5 pages
Defeasible Logic: Based On Slides From
No ratings yet
Defeasible Logic: Based On Slides From
58 pages
Exercises: Part I: Author: Mala Mitra
No ratings yet
Exercises: Part I: Author: Mala Mitra
10 pages
TQEM Sum
No ratings yet
TQEM Sum
16 pages
PDF
No ratings yet
PDF
37 pages
DRV q5
No ratings yet
DRV q5
2 pages
UNIT II Probability Theory
No ratings yet
UNIT II Probability Theory
84 pages
Juran Vs Deming
No ratings yet
Juran Vs Deming
29 pages
What Is Quality
No ratings yet
What Is Quality
16 pages
Chiriati Japanese TQCL TQM
No ratings yet
Chiriati Japanese TQCL TQM
18 pages
Pinto Pm2 Ch04
100% (1)
Pinto Pm2 Ch04
15 pages
Total Quality Management: D. Ali Jibreen Presented By: Zein Alabbadi
No ratings yet
Total Quality Management: D. Ali Jibreen Presented By: Zein Alabbadi
15 pages
Pinto ch2
No ratings yet
Pinto ch2
18 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Katangiang Pisikal NG Mga Rehiyon Sa Asya
No ratings yet
Katangiang Pisikal NG Mga Rehiyon Sa Asya
86 pages
Exit Exam Orientation by SCEE - Edited
No ratings yet
Exit Exam Orientation by SCEE - Edited
15 pages
Spotlight On Literature A - Tests PDF
No ratings yet
Spotlight On Literature A - Tests PDF
17 pages
Felah 2222
No ratings yet
Felah 2222
12 pages
Hoover F5915 900 Manual
No ratings yet
Hoover F5915 900 Manual
40 pages
ICD 9 CM Obgyn
No ratings yet
ICD 9 CM Obgyn
17 pages
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
No ratings yet
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
7 pages
Paint Specification of IRPC (Chugoku1)
No ratings yet
Paint Specification of IRPC (Chugoku1)
6 pages
2089 340065 Multimeter Mastech Ms8240c
No ratings yet
2089 340065 Multimeter Mastech Ms8240c
11 pages
978 93 6202 394 0 - Text
No ratings yet
978 93 6202 394 0 - Text
19 pages
I-Arch: Nata Exam Syllabus
100% (1)
I-Arch: Nata Exam Syllabus
4 pages
Physics Lab VIVA VOICE
75% (4)
Physics Lab VIVA VOICE
5 pages
BIOL122 Final Unit Outline Canberra
No ratings yet
BIOL122 Final Unit Outline Canberra
15 pages
Electronic Circuit Midterm Lab Exam 92%
No ratings yet
Electronic Circuit Midterm Lab Exam 92%
14 pages
Science Year 8 EOY Exam Revision.190138291
No ratings yet
Science Year 8 EOY Exam Revision.190138291
13 pages
Tracing Changes Through A Thousand Years
No ratings yet
Tracing Changes Through A Thousand Years
15 pages
A Study of Comparative Analysis of Different PWM
No ratings yet
A Study of Comparative Analysis of Different PWM
6 pages
The Sensory Dimensions of Gardening: Christopher Tilley
No ratings yet
The Sensory Dimensions of Gardening: Christopher Tilley
20 pages
Surveying Lab Exercise Outline
40% (5)
Surveying Lab Exercise Outline
19 pages
Hypertensive Crisis: Instructor'S Guide To Changes in This Edition
100% (1)
Hypertensive Crisis: Instructor'S Guide To Changes in This Edition
6 pages
Truth Between Friends
No ratings yet
Truth Between Friends
9 pages
Marine Environment and Their Divisions
No ratings yet
Marine Environment and Their Divisions
11 pages
Fundamentals of RF Coordination For Live Sound Part 1 - The RF Environment - Sound Forums
No ratings yet
Fundamentals of RF Coordination For Live Sound Part 1 - The RF Environment - Sound Forums
6 pages
3 - Solid State Ionics 176, 2005, 1601
No ratings yet
3 - Solid State Ionics 176, 2005, 1601
11 pages
MSDS-Silver Nano Particles LH
No ratings yet
MSDS-Silver Nano Particles LH
3 pages
DocuCentre-V C2265 C2263 Brochure
No ratings yet
DocuCentre-V C2265 C2263 Brochure
8 pages
Environmental Management Accounting System and Value Creation: An Institutional Perspective
No ratings yet
Environmental Management Accounting System and Value Creation: An Institutional Perspective
7 pages
Assignment I
No ratings yet
Assignment I
1 page