Week 07
Week 07
WEEK 07
K NEAREST NEIGHBOUR
A K Nearest Neighbor classifier is a machine learning model that makes predictions based
on the majority class of the K nearest data points in the feature space.
The KNN algorithm assumes that similar things exist in close proximity, making it intuitive
and easy to understand.
KK
Job
CGPA Age
Employment
3.5 22 1
3.2 23 0
WORKING 3.8 21 1
3.0 24 0
1. Calculate the
distance from 3.7 22 1
each point in the 3.3 25 1
space (e.g.
Euclidian 2.9 23 0
distance) 3.6 21 1
2. Sort all distances
3. Majority count 3.1 24 0
3.4 22 1
HOW IS THE K-DISTANCE CALCULATED?
Euclidean distance
The Euclidean distance
between two points is
the length of the
straight line segment
connecting them. This
most common
distance metric is
applied to real-valued
vectors.
HOW IS THE K-DISTANCE CALCULATED?
Manhattan distance
The Manhattan distance
between two points is the
sum of the absolute
differences between the x
and y coordinates of each
point.
Used to measure the
minimum distance by
summing the length of all
the intervals needed to
get from one location to
another in a city, it’s also
known as the taxicab
distance.
HOW IS THE K-DISTANCE CALCULATED?
Minkowski distance
Minkowski distance
generalizes the Euclidean
and Manhattan distances.
It adds a parameter
called “order” that allows
different distance
measures to be
calculated.
Minkowski distance
indicates a distance
between two points in a
normed vector space
HOW IS THE K-DISTANCE CALCULATED?
Hamming distance
Hamming distance is used
to compare two binary
vectors (also called data
strings or bitstrings).
To calculate it, data first has
to be translated into a
binary system.
REFER TO CODING EXAMPLE
DATASET DISTRIBUTION
HOW TO DETERMINE THE K VALUE IN THE K-NEIGHBORS
CLASSIFIER?
The optimal k value will help you to achieve the maximum accuracy of the model.
This process, however, is always challenging.
The simplest solution is to try out k values and find the one that brings the best results on the
testing set. For this, we follow these steps:
1. Select a random k value. In practice, k is usually chosen at random between 3 and 10, but
there are no strict rules.
a) A small value of k results in unstable decision boundaries.
b) A large value of k often leads to the smoothening of decision boundaries but not always to better
metrics.
c) So it’s always about trial and error.
2. Try out different k values and note their accuracy on the testing set.
3. Choose k with the lowest error rate and implement the model.
Cross validation
Dataset Used
https://
towardsdatascience.com/k-nearest-neighbor-classif
ier-explained-a-visual-guide-with-code-examples-fo
r-beginners-a3d85cad00e1
PROS & CONS
Pros:
Simplicity: Easy to understand and implement.
No Assumptions: Doesn’t assume anything about the data distribution.
Versatility: Can be used for both classification and regression tasks.
No Training Phase: Can quickly incorporate new data without retraining.
Cons:
Computationally Expensive: Needs to compute distances to all training samples for each
prediction.
Memory Intensive: Requires storing all training data.
Sensitive to Irrelevant Features: Can be thrown off by features that aren’t important to the
classification.
Curse of Dimensionality: Performance degrades in high-dimensional spaces.
FINAL REMARKS
Introduction to KNN
Simple and Intuitive: A straightforward algorithm for classification.
Proximity-Based: Makes predictions based on the similarity of data points.
No Explicit Training: Leverages the entire dataset for predictions.
Advantages of KNN
Easy to Understand: Simple concept, easy to implement.
Versatile: Applicable to various classification problems.
No Model Training: Quick to deploy.
Disadvantages of KNN
Computational Cost: Can be slow for large datasets.
Sensitive to Noise: Noisy data can impact predictions.
Curse of Dimensionality: Performance degrades in high-dimensional spaces.
CHOOSING THE RIGHT K VALUE