Classification KNN
Classification KNN
Arvind Deshpande
11/19/2024 Arvind Deshpande (VJTI) 2
KNN classifier
• KNN classifiers are based on learning by analogy, that is,
by comparing a given test tuple with training tuples that
are similar to it.
• The training tuples are described by n attributes.
• Each tuple represents a point in an n-dimensional space.
In this way, all the training tuples are stored in an n-
dimensional pattern space.
• When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple.
• These k training tuples are the k “nearest neighbors” of
the unknown tuple.
11/19/2024 Arvind Deshpande (VJTI) 4
KNN classifier
• “Closeness” is defined in terms of a distance metric, such
as Euclidean distance. The Euclidean distance between
two points or tuples, say, 𝑋1 = 𝑥11 , 𝑥12 , … , 𝑥1𝑛 and 𝑋2 =
𝑥21 , 𝑥22 , … , 𝑥2𝑛 .
𝑛
𝑖=1
KNN classifier
• Typically, we normalize the values of each attribute before
using the equation. This helps prevent attributes with
initially large ranges (e.g., income) from outweighing
attributes with initially smaller ranges (e.g., binary
attributes).
• Min-max normalization, for example, can be used to
transform a value v of a numeric attribute A to v’ in the
range [0, 1] by computing
′
𝑣 − 𝑚𝑖𝑛𝐴
𝑣 =
𝑚𝑎𝑥𝐴 − 𝑚𝑖𝑛𝐴
where 𝑚𝑖𝑛𝐴 and 𝑚𝑎𝑥𝐴 are the minimum and maximum
values of attribute A.
11/19/2024 Arvind Deshpande (VJTI) 6
KNN classifier
11/19/2024 Arvind Deshpande (VJTI) 7
KNN Algorithm
1. Calculate "d (x, xi)" i = 1, 2……, n where d denotes the
Euclidean distance between the points.
2. Arrange the calculated in Euclidean distances in non-
decreasing order.
3. Let k be a +ve integer, take the first k distances from
this sorted list.
4. Find those k-points corresponding to these k-distances.
5. Let ki denotes the number of points belonging to the ith
class among k points i.e. k > 0.
6. If ki > kj , i ≠ j then put x in class i.
11/19/2024 Arvind Deshpande (VJTI) 8
Advantages
• The algorithm is simple and easy to implement.
• It is lazy learning algorithm and therefore requires no
training prior to making real time prediction.
• There's no need to build a model, tune several
parameters, or make additional assumptions. This makes
the KNN algorithm much faster than other algorithms that
require training like SVM, linear regression etc.
• Since the algorithm requires no training before making
predictions, new data can be added seamlessly.
• There are only two parameters required to implement
KNN i.e. the value of K and the distance function.
• The algorithm is versatile. It can be used for classification,
regression and search.
11/19/2024 Arvind Deshpande (VJTI) 10
Disadvantages
• It doesn't work well with high dimensional data because
with large number of dimensions, it becomes difficult for
the algorithm to calculate distance in each dimension.
• KNN algorithm has a high prediction cost for large
datasets. This is because in large datasets the cost of
calculating distance between new point and each existing
point becomes higher.
• KNN algorithm doesn't work well with categorical features
since it is difficult to find the distance between dimensions
with categorical features.
• The algorithm gets significantly slower as the number of
examples and/or predictors/independent variables
increase.
11/19/2024 Arvind Deshpande (VJTI) 11
Disadvantages
• When making a classification or numeric prediction, lazy
learners can be computationally expensive.
• They require efficient storage techniques and are well
suited to implementation on parallel hardware.
• They offer little explanation or insight into the data’s
structure.
• They can suffer from poor accuracy when given noisy or
irrelevant attributes.
• Nearest-neighbor classifiers can be extremely slow when
classifying test tuples.