0% found this document useful (0 votes)
41 views31 pages

KNN v2

The document discusses using a K-Nearest Neighbors (KNN) classification algorithm to predict customer credit scores for a lending company. KNN works by calculating the distance between a new customer and existing customers, and assigning the new customer the most common credit score of their K nearest neighbors. The document provides an example of how KNN can be used to predict the credit score of a new customer, Dravid, based on finding his two nearest neighbors, Jay and Rina, and assigning him their most common response. It also discusses how to handle non-numeric attributes and choose the optimal value for the parameter K.

Uploaded by

Sukeshan R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views31 pages

KNN v2

The document discusses using a K-Nearest Neighbors (KNN) classification algorithm to predict customer credit scores for a lending company. KNN works by calculating the distance between a new customer and existing customers, and assigning the new customer the most common credit score of their K nearest neighbors. The document provides an example of how KNN can be used to predict the credit score of a new customer, Dravid, based on finding his two nearest neighbors, Jay and Rina, and assigning him their most common response. It also discusses how to handle non-numeric attributes and choose the optimal value for the parameter K.

Uploaded by

Sukeshan R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

K-Nearest Neighbor

Classification
Agenda
 KNN Classification Algorithm
 Solving Business Problems using KNN Algorithm
 Hands-on
Sample Business Problem
 Let’s assume a money lending company “XYZ” like UpStart,
IndiaLends, etc.
 Money lending XYZ company is interested in making the money
lending system comfortable & safe for lenders as well as for
borrowers. The company holds a database of customer details.
 Using customer’s detailed information from the database, it will
calculate a credit score(discrete value) for each customer.
 The calculated credit score helps the company and lenders to
understand the credibility of a customer clearly.
 So they can simply take a decision whether they should lend
money to a particular customer or not.
Sample Business Problem
 The customer’s details could be:
 Educational background details
 Highest graduated degree
 Cumulative grade points average (CGPA) or marks percentage
 The reputation of the college
 Consistency in his lower degrees
 Cleared education loan dues
 Employment details
 Salary
 Years of experience
 Got any onsite opportunities
 Average job change duration
Sample Business Problem
 The company(XYZ) uses these kind of details to calculate credit
score of a customer
 The process of calculating the credit score from the customer’s
details is expensive
 To reduce the cost of predicting credit score, they realized that
the customers with similar background details are getting a
similar credit score
 So, they decided to use already available data of customers and
predict the credit score by comparing it with similar data
 These kinds of problems are handled by the K-nearest neighbor
classifier for finding the similar kind of customers
Introduction
 K-nearest neighbor classifier is one of the introductory supervised
classifier, which every data science learner should be aware of
 Fix & Hodges proposed K-nearest neighbor classifier algorithm in 1951
for performing pattern classification task
 For simplicity, this classifier is called as KNN Classifier
 KNN addresses the pattern recognition problems and also the best
choices for addressing some of the classification related tasks
 The simple version of the K-nearest neighbor classifier algorithms is to
predict the target label by finding the nearest neighbor class
 The closest class will be identified using the distance measures like
Euclidean distance
K_Nearest Neighbour Algorithm

To determine the class of a new example E:

• Calculate the distance between E and all examples in the training set
• Select K-nearest examples to E in the training set
• Assign E to the most common class among its K-nearest neighbors

E
Distance Between Neighbors

Each example is represented with a set of numerical attributes

Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit No. of credit
cards=3 cards=2

• “Closeness” is defined in terms of the Euclidean distance between two


examples

• The Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is
defined as:
n
D( X , Y )   (x  y )
i 1
i i
2

Distance (Jay,Rina) = (35−41)2 +(95,000−215,000)2 +(3 − 2)2


K_Nearest Neighbours: Example

No. credit
Customer Age Income Response
cards

Jay 35 35K 3 No

Rina 22 50K 2 Yes

Hema 63 200K 1 No

Tommy 59 170K 1 No

Neil 25 40K 4 Yes

Dravid 37 50K 2 ?
K_Nearest Neighbours: Example

Custome Incom No. credit


Age Response Distance from Dravid
r e cards

Jay 35 35K 3 No (35 − 37)2 +(35 − 50)2 +(3 − 2)2


= 15.16

Rina 22 50K 2 Yes 15

Hema 63 200K 1 No 152.23

Tommy 59 170K 1 No 122

Neil 25 40K 4 Yes 15.74

Dravid 37 50K 2 ? 0
K_Nearest Neighbours

Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit cards=3 No. of credit cards=2

Distance (Jay, Rina)=sqrt [(35-45)2+(95,000-215,000)2 +(3-2)2]


• Distance between neighbors could be dominated by some attributes with
relatively large numbers (e.g., income in our example)
• Important to normalize some features
(e.g., map numbers to numbers between 0-1)

Example: Income
Highest income = 200K
Davis’s income is normalized to 50/200, Rina income is normalized to
50/200, etc.)
K_Nearest Neighbours

Normalization of Variables

No. credit
Customer Age Income Response
cards
55/63= 35/200= 3/4=
Jay No
0.175 0.175 0.75
22/63= 50/200= 2/4=
Rina Yes
0.34 0.25 0.5
63/63= 200/200= 1/4=
Hema No
1 1 0.25
59/63= 170/200= 1/4=
Tommy No
0.93 0.175 0.25
25/63= 40/200= 4/4=
Neil Yes
0.39 0.2 1
37/63= 50/200= 2/4=
Dravid Yes
0. 58 0.25 0.5
K-Nearest Neighbor

• Distance works naturally with numerical attributes


d(Rina,Johm)= (35−37)2+(35−50)2 +(3−2)2 =15.16
• What if we have nominal attributes?

Example: Married

No. credit
Customer Married Income Response
cards
Jay Yes 35K 3 No
Rina No 50K 2 Yes
Hema No 200K 1 No
Tommy Yes 170K 1 No
Neil No 40K 4 Yes
Dravid Yes 50K 2 Yes
Non-Numeric Data

 Feature values are not always numbers


 Example
 Boolean values: Yes or no, presence or absence of an
attribute
 Categories: Colors, educational attainment, gender
 How do these values factor into the computation of
distance?
Dealing with Non-Neumeric Data

 Boolean values => convert to 0 or 1


 Applies to yes-no/presence-absence attributes
 Non-binary characterizations
 Use natural progression when applicable; e.g., educational
attainment: GS, HS, College, MS, PHD => 1,2,3,4,5
 Assign arbitrary numbers but be careful about distances; e.g.,
color: red, yellow, blue => 1,2,3
 How about unavailable data?
(0 value not always the answer)
Distance measures
• How to determine similarity between data points
• Let x = (x1,…,xn) and y = (y1,…yn) be n-dimensional vectors of
data points of objects g1 and g2
– g1, g2 can be two different genes in microarray data
How to calculate distance using Math?
1. Euclidean Distance

(y1
,y2)
Y
x
(x1
,x2)
1. Euclidean Distance

(9 ,8)
Y
x
(5 ,5)
1. Euclidean Distance

(9
,8,3)
Y
x
(5
,5,7)

L2 Norm
k-NN Variations

• Value of k
– Larger k increases confidence in prediction
– Note that if k is too large, decision may be skewed
• Weighted evaluation of nearest neighbors
– Plain majority may unfairly skew decision
– Revise algorithm so that closer neighbors have
greater “vote weight”
How to Choose ”K”?

• For k = 1, …,5 point x gets classified correctly


• red class
• For larger k classification of x is wrong
• blue class
How to Choose ”K”?

 Selecting the value of 𝐾 in 𝐾-nearest neighbor is the most critical


problem.
 A small value of 𝐾 means that noise will have a higher influence on the
result i.e., the probability of overfitting is very high.
 A large value of 𝐾 makes it computationally expensive and defeats the
basic idea behind KNN (that points that are near might have
similar classes ).
 A simple approach to select 𝐾 is 𝐾 = √𝑛
 It depends on individual cases, at times best process is to run through
each possible value of 𝐾 and test our result
KNN algorithm Pseudo Code
 Let (𝑋𝑖 , 𝐶𝑖 ) where 𝑖 = 1,2, ⋯ , 𝑛 be data points. 𝑋𝑖 denotes
feature values & 𝐶𝑖 denotes labels for 𝑋𝑖 for each 𝑖
 Assuming the number of classes as 𝑐, 𝐶𝑖 ∈ {1,2,3, ⋯ , 𝑐} for all
values of 𝑖
 Let 𝑥 be a point for which label is not known
 We would like to find the label class using k-nearest neighbor
algorithms.
KNN algorithm Pseudo Code
 Calculate 𝑑(𝑥, 𝑥𝑖 ), 𝑖 = 1,2, ⋯ , 𝑛; where d denotes the Euclidean
distance between the points.
 Let’s consider a setup with 𝑛 training samples, where 𝑥𝑖 is the training
data point.
 The training data points are categorized into 𝑐 classes.
 Using KNN, we want to predict class for the new data point.
 So, the first step is to calculate the distance(Euclidean) between the new data point
and all the training data points.
 Next step is to arrange all the distances in non-decreasing order.
 Assuming a positive value of 𝑘 and filtering 𝑘 least values from the sorted list.
 Now, we have 𝑘 top distances.
 Let 𝑘𝑖 denotes no. of points belonging to the 𝑖𝑡ℎ class among 𝑘 points.
 If 𝑘𝑖 > 𝑘𝑗 for all 𝑖 ≠ 𝑗 then put 𝑥 in class 𝑖
KNN algorithm: Example
 Let’s consider the image shown here
where we have two different target
classes white and orange circles.
 We have total 26 training samples.
 Now we would like to predict the
target class for the blue circle
 Considering 𝑘 value as three, we
need to calculate the similarity
distance using similarity measures
like Euclidean distance.
 If the similarity score is less which
means the classes are close.
 In the image, we have calculated
distance and placed the less distance
circles to blue circle inside the Big
circle.
Advantages and Disadvantages
 Advantages
 Makes no assumptions about distributions of classes in feature space
 Don’t need any prior knowledge about the structure of data in the training set
 No retraining is required if the new training pattern is added to the existing training
set
 Can work for multi-classes simultaneously
 Easy to implement and understand
 Disadvantages
 Fixing the optimal value of K is a challenge
 Does not output any models. Calculates distances for every new point ( lazy learner)
 For every test data, the distance should be computed between test data and all the
training data. Thus a lot of time may be needed for the testing
Demo Using Python
Sample Code in Python

X = [[0], [1], [2], [3]]


y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y) KNeighborsClassifier(...)
neigh.predict([[1.1]])

29
Summary
 KNN classification algorithm
 Different distance measures
 KNN algorithm
 Advantages and disadvantages
 Case study 1 (using KNN )
Thanks!

You might also like