100% found this document useful (1 vote)
136 views8 pages

Introduction To KNN

The KNN algorithm classifies new data points based on their similarity to existing data points in the training set. It finds the K nearest neighbors of the new point and assigns the most common class among those neighbors as the predicted class. The value of K is a hyperparameter that determines how many neighbors to consider. With a higher K more neighbors are included, which can impact the predicted class. The algorithm calculates the distance between points, typically using Euclidean distance, and selects the closest K points based on distance. It then predicts the class of the new point by taking a majority vote of the classes of its K nearest neighbors.

Uploaded by

noname
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
136 views8 pages

Introduction To KNN

The KNN algorithm classifies new data points based on their similarity to existing data points in the training set. It finds the K nearest neighbors of the new point and assigns the most common class among those neighbors as the predicted class. The value of K is a hyperparameter that determines how many neighbors to consider. With a higher K more neighbors are included, which can impact the predicted class. The algorithm calculates the distance between points, typically using Euclidean distance, and selects the closest K points based on distance. It then predicts the class of the new point by taking a majority vote of the classes of its K nearest neighbors.

Uploaded by

noname
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Introduction to KNN

KNN which stand for K Nearest Neighbor is a Supervised Machine Learning algorithm that
classifies a new data point into the target class, depending on the features of its neighboring
data points. To make you understand how KNN algorithm works, let’s consider the following
scenario

Figure 1 Two classes with one unknown objects


In the above image, we have two classes of data, namely class A (squares) and Class B
(triangles).The problem statement is to assign the new input data point to one of the two
classes by using the KNN algorithm. The first step in the KNN algorithm is to define the
value of K. But what does the K in the KNN algorithm stand for?. K stands for the number of
Nearest Neighbors and hence the name K Nearest Neighbors (KNN).

Figure 2 K=3 (Three Nearest Neighbors)


In the above figure we defined the value of ‘K’ as 3. This means that the algorithm will
consider the three neighbors that are the closest to the new data point in order to decide the
class of this new data point. The closeness between the data points is calculated by using
measures such as Euclidean and Manhattan distance. At ‘K’ = 3, the neighbors include two
squares and 1 triangle. So, if I were to classify the new data point based on ‘K’ = 3, then it
would be assigned to Class A (squares).

Figure 3 K=7 (Five Nearest Neighbors)


But what if the ‘K’ value is set to 7 ? Here, I’m basically telling my algorithm to look for
the seven nearest neighbors and classify the new data point into the class it is most
similar to. At ‘K’ = 7, the neighbors include three squares and four triangles. So, if we
were to classify the new data point based on ‘K’ = 7, then it would be assigned to Class B
(triangles) since the majority of its neighbors were of class B.

Figure 4 Unknown object belongs ( Majority class ) to class B


Features of KNN Algorithm
The KNN algorithm has the following features:
1. KNN is a Supervised Learning algorithm that uses labeled input data set to predict the
output of the data points.
2. It is one of the most simple Machine learning algorithms and it can be easily
implemented for a varied set of problems.
3. It is mainly based on feature similarity. KNN checks how similar a data point is to its
neighbor and classifies the data point into the class it is most similar to.
4. Unlike most algorithms, KNN is a non-parametric model which means that it does not
make any assumptions about the data set. This makes the algorithm more effective
since it can handle realistic data.
5. It memorizes the training data set instead of learning a discriminative function from
the training data.
6. KNN can be used for solving both classification and regression problems

Used for
Classification
Memorizes the and regression
training data Non-
set parametric

KNN Algorithm

Supervised
Based on
Learning
feature
algorithm
similarity
Simple Machine
learning
algorithm

Figure 5 Features of KNN algorithm


Outline of Proposed approach

Start

Compute distance between given unknown


object and all other objects

Initialize the value of K

Select K Nearest Neighbor

Apply majority on K Nearest Neighbor

Decide the class for unknown object

End

Figure 6 Outline of KNN Algorithm


KNN Algorithm Pseudo code
1. Calculate D(x, xi) i =1, 2, ….., n; where D denotes the Euclidean distance between the
points.
2. Arrange the calculated n Euclidean distances in non-decreasing order.
3. Let k be a +ve integer, take the first k distances from this sorted list.
4. Find those k-points corresponding to these k-distances.
5. Let ki denotes the number of points belonging to the ith class among k points i.e. k ≥ 0
6. If ki >kj ∀ i ≠ j then put x in class i.
Illustrate With Example
Table 1 simple training data set with 12 records
S.No. Customer Age Loan Class
Name (Default)
1 John 25 40000 N
2 Smith 35 60000 N
3 Pat 40 62000 Y
4 Alex 45 80000 N
5 Jade 20 20000 N
6 Jim 48 220000 Y
7 Jack 33 150000 Y
8 Kate 35 120000 N
9 Mark 52 18000 N
10 Anil 23 95000 Y
11 George 60 100000 N We need to
12 Andrew 48 1420000 ? predict Andrew
default status

Euclidean distance: The Euclidean distance between any two instances is the
length of the line segment connecting them. In this study, the dataset is composed
of 22 attributes is represented in 22-dimensional space. If x = (x 1 , x2 , ..., x12 ) and y
= (y1 , y2 , ..., y12 ) are two points, then the distance from x to y is given by :

Euclidean distance = √∑( 𝑥𝑖 − 𝑦𝑖 )2


𝑖=1

First step is to calculate Euclidean distance


Dist John (X 1 ,Y 1 ) and Andrew(X2 ,Y2 )
=√(𝑋1 − 𝑋2 )2 + (𝑌1 − 𝑌2 )2

= √(48 − 25)2 + (142000 − 40000)2


Dist.(John, Andrew)=102000
Similarly we need to calculate distance for all all other objects
Table 2 Euclidean distance of objects from unknown objects
S.No Customer Name Age Loan Class (Default) Distance
1 John 25 40,000 N 1,02,000
2 Smith 35 60,000 N 82,000
8 Pat 40 62,000 Y 80,000
3 Alex 45 80,000 N 62.000
4 Jade 20 20,000 N 1,22,000
10 Jim 48 2,20,000 Y 78,000
11 Jack 33 1,50,000 Y 8,000
5 Kate 35 1,20,000 N 22,000
6 Mark 52 18,000 N 1,24,000
7 Anil 23 95,000 Y 47,000
9 George 60 1,00,000 N 42,000
12 Andrew 48 14,20,000 ? -

Table 3 Five Nearest Neighbor with Minimum distance


S.No Customer Age Loan Class Distance Minimum
Name (Default) Distance
1 John 25 40,000 N 1,02,000
2 Smith 35 60,000 N 82,000
8 Pat 40 62,000 Y 80,000
3 Alex 45 80,000 N 62.000 5
4 Jade 20 20,000 N 1,22,000
10 Jim 48 2,20,000 Y 78,000
11 Jack 33 1,50,000 Y 8,000 1
5 Kate 35 1,20,000 N 22,000 2
6 Mark 52 18,000 N 1,24,000
7 Anil 23 95,000 Y 47,000 4
9 George 60 1,00,000 N 42,000 3
12 Andrew 48 14,20,000 ? -
Let K=2
With K=2, there are one Default=Y (Jack, 33,1 50000, Y) Yes and one Default=N (Kate,
35,120000, N) No. So we cannot able to decide class for Andrew

Let K=3
With K=3, there are one Default=Y (Jack, 33,150000, Y) and two Default=N (Kate,
35,120000, N)and (George , 60, 100000, N) .Here out of three closest object only one Y and
two N so we can say default class for Andrew is N.

Let K=4
With K=4, there are two Default=Y (Jack, 33,150000, Y),(Anil, 23, 95000, Y) and three
Default=N (Kate, 35,120000, N)(George, 60, 100000, N). Here out of four closest object
twois Y and two is N.So again we cannot able to decide class for Andrew.

Let K=5
With K=5, there are two Default=Y (Jack, 33,150000, Y),(Anil, 23, 95000, Y) and two
Default=N (Kate, 35,120000, N) (George, 60, 100000, N) (Alex,45,80000,N). Here out of
five closest objects two is Y and three is N. So we can say default class for Andrew is N
Pros of KNN
1. Simple to implement
2. Flexible to feature/distance choices
3. Naturally handles multi-class cases
4. Can do well in practice with enough representative data
Cons of KNN
1. Need to determine the value of parameter K (number of nearest neighbors)
2. Computation cost is quite high because we need to compute the distance of each query
instance to all training samples.
3. Storage of data
4. Must know we have a meaningful distance function.

You might also like