0% found this document useful (0 votes)
30 views24 pages

Week 07

Uploaded by

M.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views24 pages

Week 07

Uploaded by

M.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

CSC – 368

WEEK 07
K NEAREST NEIGHBOUR

Dr. Sadaf Hussain


Asst. Professor CS
Faculty of Computer Science
Lahore Garrison University
WHAT IS KNN

 A K Nearest Neighbor classifier is a machine learning model that makes predictions based
on the majority class of the K nearest data points in the feature space.
 The KNN algorithm assumes that similar things exist in close proximity, making it intuitive
and easy to understand.
KK
Job
CGPA Age
Employment

3.5 22 1
3.2 23 0
WORKING 3.8 21 1
3.0 24 0
1. Calculate the
distance from 3.7 22 1
each point in the 3.3 25 1
space (e.g.
Euclidian 2.9 23 0
distance) 3.6 21 1
2. Sort all distances
3. Majority count 3.1 24 0
3.4 22 1
HOW IS THE K-DISTANCE CALCULATED?

 Euclidean distance
 The Euclidean distance
between two points is
the length of the
straight line segment
connecting them. This
most common
distance metric is
applied to real-valued
vectors.
HOW IS THE K-DISTANCE CALCULATED?

 Manhattan distance
 The Manhattan distance
between two points is the
sum of the absolute
differences between the x
and y coordinates of each
point.
 Used to measure the
minimum distance by
summing the length of all
the intervals needed to
get from one location to
another in a city, it’s also
known as the taxicab
distance.
HOW IS THE K-DISTANCE CALCULATED?

 Minkowski distance
 Minkowski distance
generalizes the Euclidean
and Manhattan distances.
 It adds a parameter
called “order” that allows
different distance
measures to be
calculated.
 Minkowski distance
indicates a distance
between two points in a
normed vector space
HOW IS THE K-DISTANCE CALCULATED?

 Hamming distance
 Hamming distance is used
to compare two binary
vectors (also called data
strings or bitstrings).
 To calculate it, data first has
to be translated into a
binary system.
 REFER TO CODING EXAMPLE
DATASET DISTRIBUTION
HOW TO DETERMINE THE K VALUE IN THE K-NEIGHBORS
CLASSIFIER?

The optimal k value will help you to achieve the maximum accuracy of the model.
This process, however, is always challenging.

The simplest solution is to try out k values and find the one that brings the best results on the
testing set. For this, we follow these steps:
1. Select a random k value. In practice, k is usually chosen at random between 3 and 10, but
there are no strict rules.
a) A small value of k results in unstable decision boundaries.
b) A large value of k often leads to the smoothening of decision boundaries but not always to better
metrics.
c) So it’s always about trial and error.

2. Try out different k values and note their accuracy on the testing set.
3. Choose k with the lowest error rate and implement the model.
Cross validation

HOW TO SELECT THE VALUE OF “K”

 Depends on the problem


 Heirstics approach
 : n is training example
 Experimental approach
 Cross validation (Elbow method))
PICKING K IN MORE COMPLICATED CASES
UNDER FITTING AND OVERFITTING IN KNN

 K is very low  overfitting


 K is very high  underfitting
DECISION BOUNDARIES

 Ploting numpy meshgrid


 Use library mlxtend
WHEN USE AND NOT TO USE
 When not to use:
1. Very large size of datasets in dimensions since it is a lazy learning state
(whole working is done during validation time for query point)
2. High dimensions of data (number of features are high) leads to curse of
dimensionality since distance measurement is not reliable
3. Doesn’t work well in case of outliers
4. Non homogeneous scales leads poor performance since high scale dominates
the decision and unreliable distance measurements
5. Imbalanced datasets leads to biasness
6. When used for inference rather than prediction (since the contributions of
features are less important than the distances measured)
EXAMPLE 2

 Dataset Used

https://
towardsdatascience.com/k-nearest-neighbor-classif
ier-explained-a-visual-guide-with-code-examples-fo
r-beginners-a3d85cad00e1
PROS & CONS
 Pros:
 Simplicity: Easy to understand and implement.
 No Assumptions: Doesn’t assume anything about the data distribution.
 Versatility: Can be used for both classification and regression tasks.
 No Training Phase: Can quickly incorporate new data without retraining.

 Cons:
 Computationally Expensive: Needs to compute distances to all training samples for each
prediction.
 Memory Intensive: Requires storing all training data.
 Sensitive to Irrelevant Features: Can be thrown off by features that aren’t important to the
classification.
 Curse of Dimensionality: Performance degrades in high-dimensional spaces.
FINAL REMARKS

 Introduction to KNN
 Simple and Intuitive: A straightforward algorithm for classification.
 Proximity-Based: Makes predictions based on the similarity of data points.
 No Explicit Training: Leverages the entire dataset for predictions.

 How KNN Works


 Distance Calculation: Measures the distance between a new data point and existing ones.
 K-Nearest Neighbors: Identifies the K closest data points to the new point.
 Majority Vote: Assigns the new point to the class that is most frequent among its K neighbors.
ADVANTAGES & DISADVANTAGES OF KNN

 Advantages of KNN
 Easy to Understand: Simple concept, easy to implement.
 Versatile: Applicable to various classification problems.
 No Model Training: Quick to deploy.

 Disadvantages of KNN
 Computational Cost: Can be slow for large datasets.
 Sensitive to Noise: Noisy data can impact predictions.
 Curse of Dimensionality: Performance degrades in high-dimensional spaces.
CHOOSING THE RIGHT K VALUE

 Impact of K: The choice of K influences the model's flexibility and robustness.


 Cross-Validation: A technique to find the optimal K value.
 Trade-off: A larger K reduces noise sensitivity but can oversmooth the decision boundary.
APPLICATIONS OF KNN

 Recommendation Systems: Recommending products or services.


 Image Recognition: Classifying images based on visual features.
 Text Classification: Categorizing text documents.
 Anomaly Detection: Identifying outliers in data.

You might also like