K Nearest Neighbor (Revised)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

Classification Methods:

K-Nearest Neighbors
Overview
• Goal:
– Adding another classification method to
our toolbox
• Outline:
– A Motivating Example
– K-nearest neighbors
• The methodology
• Examples and K-NN in R

2
K-NEAREST NEIGHBORS

3
K-Nearest Neighbor Algorithm
• The k-Nearest Neighbor Algorithm (KNN) is based on a simple
underlying idea: classify/predict a value of a new point based on the
point(s) “close to it”

X X X
A new record
that we want
to classify

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

• For prediction, use the average outcome of the neighbors


• For classification, use the majority of the votes of the k closest neighbors

Image credit: Tan, Steinback, 4


Kumar
The Algorithm
• Requires three things:
– The set of stored records (the training set)
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to use for
classification
• To classify an unknown record:
– Compute distance to all training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to determine the class
label of unknown record (e.g., by taking majority vote)

5
Adapted from: Tan,Steinbach,
Distances
• In order to use nearest neighbor classifiers one
must define what “close” means, that is one
needs a measure of “distances” between two
records
– The distance measure need not necessarily be
Euclidean
DE ( x, u )distance
 ( x1  u1 although
) 2  ( x2  u2 ) it
2
is
... the
( x p one
u p )2
commonly used
• Euclidian distance between x and u:

6
Dimensionality and Scale
• Nearest neighbor methods can work poorly when
the dimensionality is large (meaning there are a
large number of attributes)
– It is hard to find neighbors that are close enough – neighbors are all far
away

• The scales of the different attributes are


important. If a single numeric attribute has a
large spread, it can dominate the distance metric.
– Common practice is to scale all numeric attributes to have equal variance
– Can use the R function scale(.) or apply(.)

7
Normalization
• The example above has been constructed with
perfect symmetry
• What if we changed the units of measurement of
one of the variables (say by multiplying by 100)?
• This would affect the distance calculation
• Recall: one approach around is to normalize:

apply or scale or apply

8
Normalization
• Consider the following series of numbers:
Distance from (0.5, 0.5)
X1 X2 Y2 Z2 (X1, X2) (X1, Y2) (X1, Z1)
0.07 0.09 0.9 9 0.59 0.59 8.51
Shortest distance
0.60 0.70 7 70 0.22 6.50 69.50 from (0.5, 0.5)
0.64 0.35 3.5 35 0.20 3.00 34.50
0.12 0.15 1.5 15 0.52 1.07 14.51
0.75 0.80 8 80 0.39 7.50 79.50
0.57 0.65 6.5 65 0.16 6.00 64.50
0.36 0.25 2.5 25 0.28 2.00 24.50
0.62 0.10 1 10 0.42 0.51 9.50
0.42 0.40 4 40 0.13 3.50 39.50
0.61 0.50 5 50 0.11 4.50 49.50

0.40 3.99 39.90


0.244 2.441 24.41

9
The Number k
• This number decides how many neighbors (where neighbors is
defined based on the distance metric) influence the classification
– This is usually an odd number if the number of classes is 2
• Choice of k is very critical
– A small value of k will result in noise having a higher influence on the
result – risk of overfitting
– A large value of k will reduce noise, but we may miss out on local
structure (too much averaging)
• What happens when we select k=n ?
• Often k is in the range 1-20
– Depends heavily on the structure of the data
– If k=1, then the algorithm is simply called the nearest neighbor algorithm

10
Choosing k
• Select the best value of k!
– Compute the misclassification rate for a range of
values of k
– Choose the best value based
Note:on the
Since validation
we now use the validation
set to select k – the validation set is no
sample longer a hold out set – meaning data
that the method has never seen.
Ideally, we would set aside the third set
– the testing set – to judge out of
sample performance

11
KNN in One Dimension
• k=1
x class
Classify as
ClassifyClassify
1
as
1
as
Classify as
1
Classify as
1
-0.39701 0 1
-0.11216 0
-0.08226 0
-0.83098 0
0.172896 0
-0.23603 0
-0.99109 0
-0.0739 0
-0.91048 0
0.777112 0
-0.32008 0
0.521552 1
-0.70176 1
0.397391 1
0.36739 1
0.098193 1
0.813719 1
Classify as Classify as 0 Classify as Classify as Classify as
-0.27989 1
0 0 0 0
0.631199 1
0.378386 1 No classification errors in training set!
12
KNN in One Dimension
• k=3 Classify as
1
x class
-0.39701 0
-0.11216 0
-0.08226 0
-0.83098 0
0.172896 0
-0.23603 0
-0.99109 0
-0.0739 0
-0.91048 0 These two points are
equal distance from the
0.777112 0
decision boundary
-0.32008 0
0.521552 1
-0.70176 1
0.397391 1
0.36739 1
0.098193 1
0.813719 1 The decision
Classify as boundary is at
-0.27989 1
0 0.1467
0.631199 1
0.378386 1 Some classification errors!
13
Which is the better model?
• Training error is not a useful guide
• Would need a validation data set to compute error rates
for different k
• Pick model with smallest validation data set error rate
• But now, since the validation data set was used to “tune”
the model, it cannot be used to provide a reliable
estimate of test error rates
• Use a test data set to get an estimate of error rates on new
data.
• All of this is easily done in R
14
Nonlinear Decision Boundaries

• We saw previously that if the decision


boundary is nonlinear, we need to build this
into our Logistic Regression Model in advance
• The next example shows that in KNN we have
do nothing, and nonlinearity is automatically
taken care of!
• This is our circular data example from
previous classes

15
KNN - With Circular Data
• k=1

16
KNN – With Circular Data
• k=3

17
Example: Personal Loan Offer
• As part of customer acquisition efforts,
Universal bank wants to run a campaign for
current customers to purchase a loan
• In order to improve target marketing, they want
to find customers that are most likely to accept
the personal loan offer
• They use data from a previous campaign on
5000 customers, out of which 480 (9%)
accepted the personal loan offer
18
Personal Loan Data Description
• The data has information about the customers’
relationship with the bank, as well as some
ID
Age
Customer ID
Customer's age in completed years

demographic information
Experience
Income
#years of professional experience
Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities Account Does the customer have a securities account with the bank?
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by UniversalBank?

19
KNN in R
• First partition the data set
• Place the input variables in three matrices (training, validation
and test), and training output in a vector
• The knn(…) function
prediction <- knn(train_input,test_input,train_output,k=i)

For each case in test_input, the algorithm finds the k nearest


neighbors in the training input and uses the majority label
from the training output to make a class prediction
• The function requires library(class)

20

You might also like