Seminar Report File On KNN Models: University Institute of Engineering and Technology, Kurukshetra University
Seminar Report File On KNN Models: University Institute of Engineering and Technology, Kurukshetra University
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by
using K- NN algorithm.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example:
Imagine we have a dataset containing images of cats and dogs,
each labeled as either "cat" or "dog." Now, we get a new, unlabeled
image of an animal that has mixed characteristics resembling both.
To identify whether this new image is a cat or a dog, we use the
K-Nearest Neighbors (KNN) algorithm.
● In K-NN, we first choose a value for 'k,' representing the number of nearest
neighbors to consider. Then, we calculate the "distance" between x1 and each
data point in both Category A and Category B. The distance metric, typically
Euclidean distance, allows us to measure the closeness of x1 to each
neighboring data point. Once we’ve found the 'k' nearest neighbors to x1, we
observe the categories these neighbors belong to.
● If most of the closest neighbors are from Category A, K-NN will classify x1 as
belonging to Category A. Conversely, if most neighbors are from Category B,
x1 will be classified into Category B. This approach is simple yet powerful
because it bases predictions on real, measurable patterns in the data. The
visual diagram of the dataset helps illustrate how x1’s position relative to
Category A and Category B points can help decide its category, making K-NN
ideal for solving problems where spatial closeness correlates with categorical
similarity
How does K-NN work?
The working of the K-Nearest Neighbors (K-NN) algorithm can be explained with a
step-by-step approach as follows:
o Step 2: Calculate the Euclidean distance between the new data point and
each of the data points in the dataset. The Euclidean distance formula is
used to measure how close or far apart each point is from the new data
point. This distance metric allows us to identify the points that are
physically closest to the new data point in terms of feature values.
o Step 3: Identify the K data points with the smallest Euclidean distances to
the new data point. These are considered the "K nearest neighbors" of the
new point. By focusing on these closest neighbors, K-NN assumes that
points close to each other are likely to share the same category.
o Step 4: Among these K nearest neighbors, count the number of data points
belonging to each category (e.g., Category A and Category B). This is the
process of "voting," where each neighbor essentially "votes" for its
category, influencing the classification of the new data point.
o Step 5: Assign the new data point to the category with the majority vote
among its K neighbors. For example, if more neighbors belong to
Category A than Category B, the new point will be classified under
Category A. This decision rule assumes that the category with the most
neighbors around a point is the most appropriate classification.
o Step 6: The model is now ready for use. With the classification assigned
to the new data point, our K-NN model can now classify additional new
data points using the same process. The simplicity and flexibility of K-NN
make it suitable for both classification and regression tasks, particularly
where patterns in data can be recognized through proximity or similarity.
● Firstly, we will choose the number of neighbors, so we will choose the k=5.
● As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.
How to select the value of K in the K-NN
Algorithm?
When selecting the value of 'K' in the K-Nearest Neighbors (K-NN) algorithm, Some
key points to keep in mind:
As the dataset size increases, K-NN’s accuracy can improve because it has a
wider pool of reference points, making it more effective in recognizing
complex patterns.
K-NN can be used not only for classification tasks but also for regression by
averaging the values of the nearest neighbors, making it a flexible tool for
different types of problems.
Disadvantages of KNN Algorithm
Six disadvantages of the K-Nearest Neighbors (K-NN) algorithm:
1. Choosing the Value of K
Determining the optimal value for 'K' can be complex and time-consuming. If
the value of K is too small, the model may become sensitive to noise, while a
very large 'K' can overly generalize and miss finer patterns in the data, making
it tricky to achieve optimal performance.
2. High Computation Cost
K-NN requires calculating the distance between the new data point and all the
training samples in the dataset. This can be computationally expensive,
especially when dealing with large datasets, as the model performs this
distance calculation for every prediction, leading to slow performance.
3. Memory Intensive
As a lazy learner, K-NN needs to store the entire training dataset in memory
to classify new data points. This can be quite memory-intensive, especially
with large datasets, making it less scalable for very large training sets.
4. Sensitive to Irrelevant Features
K-NN can perform poorly when there are irrelevant or redundant features in
the dataset. These features can distort the distance calculations, leading to
inaccurate predictions since the model treats all features equally, even if they
don’t contribute to the classification task.
5. Difficulty Handling High-Dimensional Data (Curse of Dimensionality)
As the number of features increases, the distances between data points become
less distinct, making it harder for K-NN to differentiate between them. This
issue, known as the "curse of dimensionality," can reduce the algorithm’s
effectiveness in high-dimensional spaces, such as in text classification or
image recognition.
6. Poor Performance with Imbalanced Data
K-NN struggles with imbalanced datasets, where one class significantly
outnumbers the other. In such cases, the algorithm may be biased toward the
majority class, leading to inaccurate predictions for the minority class, as the
majority class will dominate the "voting" process of the K nearest neighbors.
Python implementation of the KNN
algorithm
To do the Python implementation of the K-NN algorithm, we will use the
same problem and dataset which we have used in Logistic Regression.
But here we will improve the performance of the model. Below is the
problem description:
The Data Pre-processing step will remain exactly the same as Logistic
Regression. Below is the code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12. 12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16. 16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well
pre-processed. After feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
Once we have chosen the optimal value for 'K' and preprocessed the data, the next
step is to fit the classifier to the training data. This process involves training the
K-NN model by feeding it the feature values and corresponding labels from the
training dataset. The classifier learns the patterns and relationships in the data, so it
can later make predictions on new, unseen data. The code provided below
demonstrates how to implement this step in Python, using a K-NN algorithm from a
machine learning library, such as scikit-learn, to fit the model. This step is
crucial for enabling the model to classify new data based on the training it has
received
1. Fitting K-NN classifier to the training set
#
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will
create a y_pred vector as we did in Logistic Regression. Below is
the code for it:
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions
and 3+4= 7 incorrect predictions, whereas, in Logistic Regression, there
were 11 incorrect predictions. So we can say that the performance of the
model is improved by using the K-NN algorithm.
Output:
Output:The above graph is showing the output for the test data set.
As we can see in the graph, the predicted output is well good as most of the red
points are in the red region and most of the green points are in the green region.
However, there are few green points in the red region and a few red
points in the green region. So these are the incorrect observations that we
have observed in the confusion matrix(7 Incorrect output).