What Is KNN
What Is KNN
1.KNN also called K- nearest neighbour is a supervised machine learning algorithm that can
be used for both classification and regression problems.
2.The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points.
This means that the new point is assigned a value based on how closely it resembles the
points in the training set.
Example :
Suppose there are two categories, category A and category B and we have a new data
point(in orange color), so now the question is in which of these categories will the data
point lie? To classify the new data point we can use the KNN algorithm which observes the
behavior of the nearest points and classify itself accordingly.
3. K nearest neighbor is non-parametric i,e. It does not make any assumptions for
underlying data assumptions.
(Parametric: whenever u make an assumption about the nature of the function of your data
then that algo is parametric. A parametric algo has fixed numbers of parameters and
doesn’t depend on the rows present in your data. i.e. No matter how much data you throw
at a parametric model, it won’t change its mind about how many parameters it needs.
Eg: linear regression is a good example of parametric mL because while doing linear
regression you take an assumption that the function is a line. Also no. of coefficient is also
fixed i.e. slope and intercept.
On the contrary if you don’t take any assumption then it is nonparametric. Non parametric
algos also have parameters , it’s just that they change or rather grow with respect to the
number of rows in our data eg: decision tree, knn)
4. K nearest neighbor is also termed as a lazy algorithm as it does not learn during the
training phase rather it stores the data points but learns during the testing phase.
6. It is a distance-based algorithm.
We can use euclidean or manhattan distance.
The Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
Consider point A(x1,y1) and B(x2,y2) . Here the observed value is x1 and y1 and actual value
is x2 ,y2
Euclidean distance
Steps to perform k-nn
• Choose the K value
• Calculate the distance between all the training points and new data points.
• Sort the computed distance in ascending order between training points and new data
points.
• Choose the first K distances from the sorted list
• Take the mode/mean of the classes associated with the distances.
How does KNN work for ‘Classification’ and ‘Regression’ problem statements?
Classification: When the problem statement is of ‘classification’ type, KNN tends to use the
concept of “Majority Voting”. Within the given range of K values, the class with the most
votes is chosen.
Suppose we have the height, weight, and T-shirt size of some customers and we need to
predict the T-shirt size of a new customer given only the height and weight information we
have.
Data including height, weight, and T-shirt size information is shown below –
A new customer named 'Monica' has a height of 161cm and weighs 61kg. so what will be
her T-shirt size?
After standardization, 5th closest value got changed as height was dominating earlier before
standardization. Hence, it is important to standardize predictors before running K-nearest
neighbor algorithm.
We have plotted the above information. In the graph below, 'Medium T-shirt size' is in blue
color and 'Large T-shirt size' is in orange color. New customer information is exhibited in the
yellow circle. Four blue highlighted data points and one orange highlighted data point are
close to yellow circle. so the prediction for the new case is blue highlighted data point which
is Medium T-shirt size.
Regression:
KNN employs a mean/average method for predicting the value of new data. Based on the
value of K, it would consider all of the nearest neighbors
As in the previous example of classification where we took the mode of the 5 nearest
neighbors because the target variable was categorical, in regression, we should take the
mean or median of the 5 nearest neighbors because our target variable is continuous.
Larger K value: The case of underfitting occurs when the value of k is increased. In
this case, the model would be unable to correctly learn on the training data.
Smaller k value: The condition of overfitting occurs when the value of k is smaller.
The model will capture all of the training data, including noise. The model will
perform poorly for the test data in this scenario.
2.We should not use even values of K when classifying binary classification problems. Suppose
we choose K=4 and the neighboring 4 points are evenly distributed among classes i.e 2 data
points belong to category 1 and 2 data points belong to category 2. In that case, the data point
cannot classify as there is a tie between the classes.
4.Plot the elbow curve between different K values and error. Choose the K value when there is
a sudden drop in the error rate.
Impact of Imbalanced dataset and Outliers on KNN
Imbalanced dataset~
When dealing with an imbalanced data set, the model will become biased. Consider the example
shown in the diagram below, where the “Yes” class is more prominent.
As a consequence, the bulk of the closest neighbors to this new point will be from the dominant
class. Because of this, we must balance our data set using either an an “Upscaling”or
“Downscaling” strategy
Outliers~
Outliers are the points that differ significantly from the rest of the data points.
The outliers will impact the classification/prediction of the model. The appropriate class for the
new data point, according to the following diagram, should be “Category B” in green.
The model, however, would be unable to have the appropriate classification due to the existence
of outliers. As a result, removing outliers before using KNN is recommended.