A Complete Guide To K Nearest Neighbors Algorithm 1598272616
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
https://ashutoshtripathi.com
In this article you will learn how to implement k-Nearest Neighbours or KNN algorithm from
scratch using python. Problem described is to predict whether a person will take the personal
loan or not. Data set used is from universal bank data set.
Table of Contents
1. The intuition behind KNN – understand with the help of a graph.
2. How KNN as an algorithm works?
3. How to find the k-Nearest Neighbours?
4. Deciding k – The hyper parameter in KNN.
5. Complete end to end example using python which includes
• Exploratory data analysis
• Imputing missing values
• Data Pre-processing
• Train Test split of data
• Training the model using KNN
• Predicting on test data
6. Additional Reading
So if you look carefully the above scatter plot and observe that this test point is closer to the
circled point. And hence its weight will be closer to the weight of these two persons. This is
fair enough answer. So these circled points become the neighbours of the test data point.
This is the exact idea behind the KNN algorithm. How KNN as an algorithm works? How KNN
as an algorithm works?
How KNN as an algorithm works?
Let’s take one more example: Consider one Predictor variable x and Target variable y. And we
want to predict value of y for x = 13. (See the data below)
So we will look data points in x which are equal or closer to x= 13. Those are known as
neighbours to the new data point. So these points are 12.5, 13.8 and 10 if we take k = 3 nearest
neighbours. Now find selected neighbours corresponding y value those are 13.5, 14.8 and 11.
Note k is hyper parameter and decision to take how many k will discuss in next heading.
And take mean of those y values as (11+14.8+13.5)/3 = 13.1. So this will be the predicted
value for new data point x = 13. Whether we will take mean or median or some other
measures it depends on the Loss function. In case of L2 loss that is minimizing the squared
error values, we take mean of y values and it is known as conditional mean. If our loss function
is of L1 loss then we go with finding median of neighbour’s y values.
This was the example of predicting a continuous value that is regression problem. KNN can
also be used for classification problem. Only the difference will be in this case, we will take
the mode of neighbour’s y values that is taking the majority of y. For example in above case
if we have neighbour’s y values as 1, 0, 1 then majority is 1 and hence we will predict our data
point x = 13 will belong to class 1. This is how KNN can also be used for classification problems.
The most commonly used distance measures are Euclidean and Manhattan for continuous
value prediction that is regression and Hamming Distance for categorical or classification
problems.
1. Euclidean Distance
Euclidean distance is calculated as the square root of the sum of the squared differences
between a new point (X2) and an existing point (X1).
2. Manhattan Distance
This is the distance between real vectors using the sum of their absolute difference.
3. Hamming Distance
It is used for categorical variables. If the value (x) and the value (y) are same, the distance D
will be equal to 0. Otherwise D=1.
Source: Wikipedia
We start with some random value of k and then start increasing until it is reducing the error
in predicted value. Once it start increasing the error we stop there. Also overfitting case need
to be taken care here. Sometime we end up choosing large value of k which best suited in
training data but drastically increases the error in test or live data. Hence we divide the data
in three parts train, validation and test. We select k based on train data and check if it is not
overfitting by validating it against validation data.
This procedure required multiple iteration and then finally we get the best suited value of k.
However this all we no need to do manually, we can write a function or utilize the inbuilt
libraries in python which produces the final k value.
Problem Description
• In the following Supervised Learning activity, we try to predict those who will likely accept
the offer of a new personal loan.
• ID: Customer ID
• Age: Customer’s age in completed years
• Experience: #years of professional experience
• Income: Annual income of the customer ($000)
• ZIP Code: Home Address ZIP code. Do not use ZIP code Family: Family size of the customer
• CCAvg: Avg. spending on credit cards per month ($000)
• Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
• Mortgage: Value of house mortgage if any. ($000)
• Personal Loan: Did this customer accept the personal loan offered in the last campaign?
• Securities Account: Does the customer have a securities account with the bank?
• CD Account: Does the customer have a certificate of deposit (CD) account with the bank?
• Online: Does the customer use internet banking facilities?
• Credit Card: Does the customer use a credit card issued by Universal Bank?
From the above plot we can say that we get maximum test accuracy for k = 8 and after that it
is constant. Hence we will finalize k as 8 and train the model for 8 nearest neighbours.
5.7 Training the model using KNN
Reduced NN
Thank You
For more Articles please visit -> https://ashutoshtripathi.com