0% found this document useful (0 votes)
83 views24 pages

K-Nearest Neighbor

This document discusses classification using the k-nearest neighbors algorithm. Classification is the task of assigning data points to predefined categories or classes based on their attributes. The k-nearest neighbors algorithm identifies the k training data points that are closest in distance to a new data point and assigns the most common class among those neighbors as the predicted class. Choosing an appropriate value for k and scaling attributes appropriately are important considerations for the k-nearest neighbors algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views24 pages

K-Nearest Neighbor

This document discusses classification using the k-nearest neighbors algorithm. Classification is the task of assigning data points to predefined categories or classes based on their attributes. The k-nearest neighbors algorithm identifies the k training data points that are closest in distance to a new data point and assigns the most common class among those neighbors as the predicted class. Choosing an appropriate value for k and scaling attributes appropriately are important considerations for the k-nearest neighbors algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

DATA MINING

LECTURE
Classification
k-nearest neighbor classifier
What is classification?
• Classification is the task of learning a target function f that
maps attribute set x to one of the predefined class labels y

Tid Refund Marital Taxable


Status Income Cheat One of the attributes is the class attribute
1 Yes Single 125K No
In this case: Cheat
2 No Married 100K No
3 No Single 70K No
Two class labels (or classes): Yes (1), No (0)
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Examples of Classification Tasks
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate


or fraudulent

• Categorizing news stories as finance,


weather, entertainment, sports, etc

• Identifying spam email, spam web pages, adult


content
General approach to classification

• Predictive modeling: Predict a class of a previously


unseen record

• Training set consists of records with known class


labels

• A labeled test set of previously unseen data records


is used to evaluate the quality of the model.

• The classification model is applied to new records


with unknown class labels
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Nearest neighbor classification
Nearest Neighbor Classifiers
Set of Stored Cases • Store the training records

……... • Use training records to


Atr1 AtrN Class
predict the class label of
A unseen cases
B
B
Unseen Case
C
Atr1 ……... AtrN
A
C
B
Nearest Neighbor Classifiers
• Basic idea:
• “If it walks like a duck, quacks like a duck, then it’s
probably a duck”

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:


1. Compute distance to other
training records
2. Identify k nearest neighbors
3. Use class labels of nearest
neighbors to determine the
class label of unknown
record (e.g., by taking
majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
Nearest Neighbor Classifiers
• Nearest neighbor classifier
• Uses k “closest” points (nearest neighbors) for performing
classification
Nearest Neighbor Classifiers
K-Nearest Neighbors example
K-Nearest Neighbors example
K-Nearest Neighbors example
K-Nearest Neighbors example
K-Nearest Neighbors example
K-Nearest Neighbors example
K-Nearest Neighbors example
K-Nearest Neighbors example
Nearest Neighbor Classification…
• Scaling issues
• Attributes may have to be scaled to prevent distance
measures from being dominated by one of the attributes
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
Nearest Neighbor Classification…
• Choosing the value of k:
• If k is too small, sensitive to noise points
• If k is too large, neighborhood may include points from
other classes

You might also like