0% found this document useful (0 votes)

111 views13 pages

A Complete Guide To K Nearest Neighbors Algorithm 1598272616

The document provides a guide to implementing the K-nearest neighbors (KNN) machine learning algorithm from scratch in Python. It begins with an intuitive explanation of KNN using graphs and examples, describing how it finds the k nearest training examples to make predictions. It then covers how to calculate distances between points, choose the k value, preprocess data, perform train-test splits, train a KNN model, and evaluate performance using metrics like accuracy, confusion matrices and classification reports. An end-to-end Python example is presented to predict personal loan approvals using a bank customer dataset.

Uploaded by

「瞳」你分享

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views13 pages

A Complete Guide To K Nearest Neighbors Algorithm 1598272616

Uploaded by

「瞳」你分享

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Tech Tunnel

https://ashutoshtripathi.com

A Complete Guide to K-Nearest Neighbours Algorithm

– KNN using Python
K-Nearest Neighbours or KNN algorithm is very easy and powerful Machine Learning
algorithm. It can be used for both classification as well as regression that is predicting a
continuous value. The very basic idea behind KNN is that it starts with finding out the k-
nearest data points known as neighbours of the new data point for which we need to make
the prediction. And then if it is regression then take the conditional mean of the neighbour’s
y-value and that is the predicted value for new data point. If it is classification then it takes
the mode (majority value) of the neighbours y value and that becomes the predicted class of
the new data point.

In this article you will learn how to implement k-Nearest Neighbours or KNN algorithm from
scratch using python. Problem described is to predict whether a person will take the personal
loan or not. Data set used is from universal bank data set.

Table of Contents
1. The intuition behind KNN – understand with the help of a graph.
2. How KNN as an algorithm works?
3. How to find the k-Nearest Neighbours?
4. Deciding k – The hyper parameter in KNN.
5. Complete end to end example using python which includes
• Exploratory data analysis
• Imputing missing values
• Data Pre-processing
• Train Test split of data
• Training the model using KNN
• Predicting on test data
6. Additional Reading

The intuition behind KNN – understand with the help

of a graph
Let’s start with one basic example and try to understand what the intuition behind the KNN
algorithm is. Consider the following example containing a data frame with three columns
Height, Age and weight. The values are randomly chosen.
So now the problem is to predict the weight of a person whose height is 5.50 feet and his age
is 38 years (see the 10th row in data set). In the below scatter plot between Height and Age
this test point is marked as “x” in blue colour.

So if you look carefully the above scatter plot and observe that this test point is closer to the
circled point. And hence its weight will be closer to the weight of these two persons. This is
fair enough answer. So these circled points become the neighbours of the test data point.
This is the exact idea behind the KNN algorithm. How KNN as an algorithm works? How KNN
as an algorithm works?
How KNN as an algorithm works?
Let’s take one more example: Consider one Predictor variable x and Target variable y. And we
want to predict value of y for x = 13. (See the data below)

So we will look data points in x which are equal or closer to x= 13. Those are known as
neighbours to the new data point. So these points are 12.5, 13.8 and 10 if we take k = 3 nearest
neighbours. Now find selected neighbours corresponding y value those are 13.5, 14.8 and 11.
Note k is hyper parameter and decision to take how many k will discuss in next heading.
And take mean of those y values as (11+14.8+13.5)/3 = 13.1. So this will be the predicted
value for new data point x = 13. Whether we will take mean or median or some other
measures it depends on the Loss function. In case of L2 loss that is minimizing the squared
error values, we take mean of y values and it is known as conditional mean. If our loss function
is of L1 loss then we go with finding median of neighbour’s y values.

According to statistics “The best prediction of y at an point X = x is the conditional mean in

case of L1 Loss and is conditional median in case of L1 Loss”.

This was the example of predicting a continuous value that is regression problem. KNN can
also be used for classification problem. Only the difference will be in this case, we will take
the mode of neighbour’s y values that is taking the majority of y. For example in above case
if we have neighbour’s y values as 1, 0, 1 then majority is 1 and hence we will predict our data
point x = 13 will belong to class 1. This is how KNN can also be used for classification problems.

How to find the k-Nearest Neighbours?

To find the nearest neighbour’s, Algorithm calculates the distance between the new data
point and training data points. And then shortlist the points which are closer to the new data
point. These shortlisted points are known as the nearest neighbours to the new data point.
Now two question arises how to calculate the distance between points and how many (k)
closer points to be shortlisted. Decision of k will discuss in next heading, now let’s understand
how to calculate the distance.

The most commonly used distance measures are Euclidean and Manhattan for continuous
value prediction that is regression and Hamming Distance for categorical or classification
problems.
1. Euclidean Distance
Euclidean distance is calculated as the square root of the sum of the squared differences
between a new point (X2) and an existing point (X1).

2. Manhattan Distance
This is the distance between real vectors using the sum of their absolute difference.
3. Hamming Distance
It is used for categorical variables. If the value (x) and the value (y) are same, the distance D
will be equal to 0. Otherwise D=1.

Source: Wikipedia

4. Deciding k – The hyper parameter in KNN

k is nothing but the number of nearest neighbours to be selected to finally predict the
outcome of new data point. Decision of choosing the k is very important, although there is no
mathematical formula to decide the k.

We start with some random value of k and then start increasing until it is reducing the error
in predicted value. Once it start increasing the error we stop there. Also overfitting case need
to be taken care here. Sometime we end up choosing large value of k which best suited in
training data but drastically increases the error in test or live data. Hence we divide the data
in three parts train, validation and test. We select k based on train data and check if it is not
overfitting by validating it against validation data.

This procedure required multiple iteration and then finally we get the best suited value of k.
However this all we no need to do manually, we can write a function or utilize the inbuilt
libraries in python which produces the final k value.

5. Complete end to end example using python

Problem Description
• In the following Supervised Learning activity, we try to predict those who will likely accept
the offer of a new personal loan.
• ID: Customer ID
• Age: Customer’s age in completed years
• Experience: #years of professional experience
• Income: Annual income of the customer ($000)
• ZIP Code: Home Address ZIP code. Do not use ZIP code Family: Family size of the customer
• CCAvg: Avg. spending on credit cards per month ($000)
• Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
• Mortgage: Value of house mortgage if any. ($000)
• Personal Loan: Did this customer accept the personal loan offered in the last campaign?
• Securities Account: Does the customer have a securities account with the bank?
• CD Account: Does the customer have a certificate of deposit (CD) account with the bank?
• Online: Does the customer use internet banking facilities?
• Credit Card: Does the customer use a credit card issued by Universal Bank?

5.2 Exploratory data analysis

5.3 Imputing missing values
5.4 Data Pre-processing

5.5 Train Test split of data

1 from sklearn.model_selection import train_test_split
2 X = data.drop('Personal Loan',axis=1).values
3 y = data['Personal Loan'].values
4 X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.4,random_state=42, stratify=y)
5.6 Decide K the number of nearest neighbours
1
2 import KNeighborsClassifier
3 from sklearn.neighbors import KNeighborsClassifier
4 Setup arrays to store training and test accuracies
5 neighbors = np.arange(1,15)
train_accuracy =np.empty(len(neighbors))
6 test_accuracy = np.empty(len(neighbors))
7 for i,k in enumerate(neighbors):
8 #Setup a knn classifier with k neighbors
9 knn = KNeighborsClassifier(n_neighbors=k)
10 #Fit the model
knn.fit(X_train, y_train)
11 #Compute accuracy on the training set
12 train_accuracy[i] = knn.score(X_train, y_train)
13 #Compute accuracy on the test set
14 test_accuracy[i] = knn.score(X_test, y_test)
15
1
#Generate plot
2 plt.title('k-NN Varying number of neighbors')
3 plt.plot(neighbors, test_accuracy, label='Testing Accuracy')
4 plt.plot(neighbors, train_accuracy, label='Training accuracy')
5 plt.legend()
6 plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
7 plt.show()
8

From the above plot we can say that we get maximum test accuracy for k = 8 and after that it
is constant. Hence we will finalize k as 8 and train the model for 8 nearest neighbours.
5.7 Training the model using KNN

5.8 Predicting on test data

1 #Get accuracy.
2 #Note: In case of classification algorithms score method
3 #represents accuracy.
4 knn.score(X_test,y_test)

5.9 Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification
model (or “classifier”) on a set of test data for which the true values are known. Scikit-learn
provides facility to calculate confusion matrix using the confusion matrix method.
5.10 Classification Report
Another important report is the Classification report. It is a text summary of the precision,
recall, F1 score for each class. Scikit-learn provides facility to calculate Classification report
using the classification report method.

Some Important Points

• KNN is Model free algorithm. As there is no assumption on form of function f. You see in
regression at last we get a function in terms of x and y including coefficients and constant
values.
• Computational complexity increases with increase in data set size
• It suffers with “Curse of Dimensionality” problem. Because when dimensions increases it
produces less accuracy.
• Increase the neighbours, it creates smoother boundaries to classify correctly.

Improving (Speeding up) KNN

Clustering as a Pre-processing Step

• Eliminate most points (keep only cluster centroids)

• Apply KNN
Condensed NN

• Retain samples closest to “decision boundaries”

• Decision Boundary Consistent – a subset whose nearest neighbour decision boundary is
identical to the boundary of the entire training set
• Minimum Consistent Set – the smallest subset of the training data that correctly classifies
all of the original training data

Reduced NN

• Remove a sample if doing so does not cause any incorrect classifications

• Initialize subset with a single training example
• Classify all remaining samples using the subset, and transfer any incorrectly classified
samples to the subset
• Return to 2 until no transfers occurred or the subset is full

Thank You
For more Articles please visit -> https://ashutoshtripathi.com

ML Unit 5..
No ratings yet
ML Unit 5..
40 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
17 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
18 pages
ML CH 3
No ratings yet
ML CH 3
88 pages
K-Nearest Neighbor (KNN) 6
No ratings yet
K-Nearest Neighbor (KNN) 6
46 pages
KNN Algorithm - PPT (Autosaved)
0% (1)
KNN Algorithm - PPT (Autosaved)
8 pages
Applied Logistic Regression - 3rd Edition Scribd Download
100% (8)
Applied Logistic Regression - 3rd Edition Scribd Download
17 pages
CSL0777 L22
No ratings yet
CSL0777 L22
35 pages
21 KNN
No ratings yet
21 KNN
28 pages
ML-Unit 5
No ratings yet
ML-Unit 5
40 pages
Lecture 14 and 15
No ratings yet
Lecture 14 and 15
42 pages
K-Nearest Neighbors
No ratings yet
K-Nearest Neighbors
35 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
22 pages
ML Lec07 KNN
100% (2)
ML Lec07 KNN
37 pages
4K-Nearest Neighbor
No ratings yet
4K-Nearest Neighbor
38 pages
ML Lab2 PGM
No ratings yet
ML Lab2 PGM
3 pages
Wahid Khan - Piping & Mechanical Supervisor .
No ratings yet
Wahid Khan - Piping & Mechanical Supervisor .
21 pages
Machine Learning
No ratings yet
Machine Learning
32 pages
Sample KNN
No ratings yet
Sample KNN
7 pages
KNN
No ratings yet
KNN
53 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
Machine Learning Unit-3.1
No ratings yet
Machine Learning Unit-3.1
20 pages
When Do We Use KNN Algorithm?
No ratings yet
When Do We Use KNN Algorithm?
7 pages
KNN Dan KMeans
No ratings yet
KNN Dan KMeans
37 pages
Week 3. K-Nearest Neighbours (KNN) : Dr. Shuo Wang
No ratings yet
Week 3. K-Nearest Neighbours (KNN) : Dr. Shuo Wang
18 pages
05 Expedition Audit L3
No ratings yet
05 Expedition Audit L3
54 pages
Unit 5 Learning With Algorithm
No ratings yet
Unit 5 Learning With Algorithm
7 pages
B-56 Sanket Jambhulkar MLA-7
No ratings yet
B-56 Sanket Jambhulkar MLA-7
9 pages
'Machine Learning (Nagarjun)
No ratings yet
'Machine Learning (Nagarjun)
10 pages
ML 2
No ratings yet
ML 2
6 pages
K-Nearest Neighbor (KNN) Algorithm: Last Updated: 14 May, 2025
No ratings yet
K-Nearest Neighbor (KNN) Algorithm: Last Updated: 14 May, 2025
14 pages
Untitled 9
No ratings yet
Untitled 9
17 pages
4.kNN Concepts
No ratings yet
4.kNN Concepts
12 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
Updated K-Nearest Neighbors in Machine Learning
No ratings yet
Updated K-Nearest Neighbors in Machine Learning
11 pages
ML Lecture 13 KNN
No ratings yet
ML Lecture 13 KNN
14 pages
K-Nearest Neighbors: Marcel Van Velzen Junior Marte Garcia
No ratings yet
K-Nearest Neighbors: Marcel Van Velzen Junior Marte Garcia
8 pages
What Is KNN
No ratings yet
What Is KNN
9 pages
Matrices and Determinants - JEE Mains PYQ 2023 Session 2
No ratings yet
Matrices and Determinants - JEE Mains PYQ 2023 Session 2
63 pages
ML Assignment No. 3: 3.1 Title
No ratings yet
ML Assignment No. 3: 3.1 Title
6 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Part A 3. KNN Classification
No ratings yet
Part A 3. KNN Classification
35 pages
K-Nearest Neighbor Classification-Algorithm and Characteristics
No ratings yet
K-Nearest Neighbor Classification-Algorithm and Characteristics
6 pages
Supervised Example KNN
No ratings yet
Supervised Example KNN
22 pages
Week 07
No ratings yet
Week 07
24 pages
K-Nearest Neighbors Algorithm
No ratings yet
K-Nearest Neighbors Algorithm
7 pages
K-Nearest Neighbors: KNN Algorithm Pseudocode
No ratings yet
K-Nearest Neighbors: KNN Algorithm Pseudocode
2 pages
Why Do We Need A K-NN Algorithm?
No ratings yet
Why Do We Need A K-NN Algorithm?
11 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
13 pages
ML Notes
100% (2)
ML Notes
125 pages
Amir Maleki Moghaddam: Advanced Workflow To Evaluate and Compare The Performance of Directional Drilling Control Tools
No ratings yet
Amir Maleki Moghaddam: Advanced Workflow To Evaluate and Compare The Performance of Directional Drilling Control Tools
80 pages
K Nearest Neighbor: Presented by
No ratings yet
K Nearest Neighbor: Presented by
29 pages
KNN - Algorithm - SVM - Algorithm
No ratings yet
KNN - Algorithm - SVM - Algorithm
27 pages
Bài nhóm tìm hiểu về KNN
No ratings yet
Bài nhóm tìm hiểu về KNN
5 pages
Experiment 2.2 KNN Classifier
No ratings yet
Experiment 2.2 KNN Classifier
7 pages
Experiment No 7 ML
No ratings yet
Experiment No 7 ML
4 pages
Lecture 2 - Process Design & Analysis
No ratings yet
Lecture 2 - Process Design & Analysis
29 pages
K-Nearest Neighbor (KNN) : Non-Parametric Algorithm
No ratings yet
K-Nearest Neighbor (KNN) : Non-Parametric Algorithm
7 pages
K - Nearest Neighbor
No ratings yet
K - Nearest Neighbor
2 pages
Mobile Applications in Children With Cerebral Palsy
No ratings yet
Mobile Applications in Children With Cerebral Palsy
14 pages
Grade 11 ICT - Learning Actvity 001
No ratings yet
Grade 11 ICT - Learning Actvity 001
7 pages
Unit 3 KNN
No ratings yet
Unit 3 KNN
16 pages
10KW Visit Report
No ratings yet
10KW Visit Report
16 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
Unit1 Ai&ml
No ratings yet
Unit1 Ai&ml
51 pages
Dokumen - Pub - Move Over Brokers Here Comes The Blockchain 1175682526
No ratings yet
Dokumen - Pub - Move Over Brokers Here Comes The Blockchain 1175682526
272 pages
J&T Express Leveraging Information Systems For Competitive Advantage
No ratings yet
J&T Express Leveraging Information Systems For Competitive Advantage
14 pages
Road Traffic Algorithm
No ratings yet
Road Traffic Algorithm
5 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Code 188 - Punto Classic
No ratings yet
Code 188 - Punto Classic
5 pages
Turbine Monitoring and Control: Aset - Eee
No ratings yet
Turbine Monitoring and Control: Aset - Eee
16 pages
MCAD
No ratings yet
MCAD
24 pages
DigitalBCG Immersion Centers Brochure 2019 PDF
No ratings yet
DigitalBCG Immersion Centers Brochure 2019 PDF
11 pages
The Augmented Matrix of A Linear System
No ratings yet
The Augmented Matrix of A Linear System
14 pages
Chapter 2 - Parallel Programming Platforms
No ratings yet
Chapter 2 - Parallel Programming Platforms
33 pages
Lab 1: Getting Started
No ratings yet
Lab 1: Getting Started
51 pages
May.11.20 Source BTC
No ratings yet
May.11.20 Source BTC
44 pages
DCCN Lecture 20 21 MAC Sublayer
No ratings yet
DCCN Lecture 20 21 MAC Sublayer
31 pages
KNN Algorithm
No ratings yet
KNN Algorithm
3 pages
Путь клиента
No ratings yet
Путь клиента
55 pages
Bbi Notes
No ratings yet
Bbi Notes
20 pages
TCM Past Paper
No ratings yet
TCM Past Paper
4 pages
SG49K5J: Multi-MPPT String Inverter For Japan System
No ratings yet
SG49K5J: Multi-MPPT String Inverter For Japan System
1 page
Das 350
No ratings yet
Das 350
6 pages
Prateek Resume
No ratings yet
Prateek Resume
1 page
Sandy & Tristan v3 - (Points A - H)
No ratings yet
Sandy & Tristan v3 - (Points A - H)
11 pages
R431003039 RexrothAluminumQuickExhaustValves (Imperial)
No ratings yet
R431003039 RexrothAluminumQuickExhaustValves (Imperial)
2 pages
BL Gritel W 49
No ratings yet
BL Gritel W 49
6 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

A Complete Guide To K Nearest Neighbors Algorithm 1598272616

Uploaded by

A Complete Guide To K Nearest Neighbors Algorithm 1598272616

Uploaded by

Tech Tunnel

A Complete Guide to K-Nearest Neighbours Algorithm

The intuition behind KNN – understand with the help

According to statistics “The best prediction of y at an point X = x is the conditional mean in

How to find the k-Nearest Neighbours?

4. Deciding k – The hyper parameter in KNN

5. Complete end to end example using python

5.2 Exploratory data analysis

5.5 Train Test split of data

5.8 Predicting on test data

5.9 Confusion Matrix

Some Important Points

Improving (Speeding up) KNN

• Eliminate most points (keep only cluster centroids)

• Retain samples closest to “decision boundaries”

• Remove a sample if doing so does not cause any incorrect classifications

You might also like