0% found this document useful (0 votes)
35 views24 pages

Seminar Report File On KNN Models: University Institute of Engineering and Technology, Kurukshetra University

Uploaded by

UIET Student
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views24 pages

Seminar Report File On KNN Models: University Institute of Engineering and Technology, Kurukshetra University

Uploaded by

UIET Student
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

SEMINAR REPORT FILE ON KNN MODELS

UNIVERSITY INSTITUTE OF ENGINEERING AND


TECHNOLOGY,
KURUKSHETRA UNIVERSITY,
(SESSION : 2021 – 2025)

Submitted To: Submitted By:


Dr Kulwinder Singh Samarjeet Singh
Asst. Professor 252102058
UIET , KUK CSE – A , 7th Sem
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.

o K-NN algorithm assumes the similarity between the new case/data


and available cases and put the new case into the category that is
most similar to the available categories.

o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by
using K- NN algorithm.

o K-NN algorithm can be used for Regression as well as for


Classification but mostly it is used for the Classification
problems.

o K-NN is a non-parametric algorithm, which means it does not


make any assumption on underlying data.

o It is also called a lazy learner algorithm because it does not learn


from the training set immediately instead it stores the dataset
and at the time of classification, it performs an action on the
dataset.

o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example:
Imagine we have a dataset containing images of cats and dogs,
each labeled as either "cat" or "dog." Now, we get a new, unlabeled
image of an animal that has mixed characteristics resembling both.
To identify whether this new image is a cat or a dog, we use the
K-Nearest Neighbors (KNN) algorithm.

The KNN model compares the features of this new image—such as


its shape, fur texture, ear shape, and other visual
characteristics—with the images in our dataset. By calculating the
similarity or "distance" between the new image and each of the
images in the dataset, KNN identifies the 'k' closest images. If most
of these nearest neighbors are labeled as cats, then the algorithm
classifies the new image as a cat; otherwise, if most are dogs, it
classifies it as a dog.

KNN is effective here because it relies on the similarity measure,


allowing the model to make an accurate prediction based on known
patterns in the data. This approach is particularly useful for
non-complex image recognition tasks with distinct, measurable
differences in categories, and it adapts well when new images with
similar features are added to the dataset. This makes KNN a
flexible and intuitive method for classifying images without
needing an intensive training process, as it can work directly on
raw data points.
Why do we need a K-NN Algorithm?
● Suppose we have two categories, Category A and Category B, each containing
multiple data points, plotted on a graph based on certain features. Now, we
receive a new data point, x1, which is unlabeled, and our goal is to determine
whether x1 belongs to Category A or Category B. To solve this type of
classification problem, we can use the K-Nearest Neighbors (K-NN)
algorithm, which is effective in identifying the most likely category of a new
data point based on its similarity to existing labeled data points.

● In K-NN, we first choose a value for 'k,' representing the number of nearest
neighbors to consider. Then, we calculate the "distance" between x1 and each
data point in both Category A and Category B. The distance metric, typically
Euclidean distance, allows us to measure the closeness of x1 to each
neighboring data point. Once we’ve found the 'k' nearest neighbors to x1, we
observe the categories these neighbors belong to.

● If most of the closest neighbors are from Category A, K-NN will classify x1 as
belonging to Category A. Conversely, if most neighbors are from Category B,
x1 will be classified into Category B. This approach is simple yet powerful
because it bases predictions on real, measurable patterns in the data. The
visual diagram of the dataset helps illustrate how x1’s position relative to
Category A and Category B points can help decide its category, making K-NN
ideal for solving problems where spatial closeness correlates with categorical
similarity
How does K-NN work?
The working of the K-Nearest Neighbors (K-NN) algorithm can be explained with a
step-by-step approach as follows:

o Step 1: Select the number of neighbors, 'K,' to consider. This is a crucial


parameter, as it determines how many nearest data points will influence
the classification. Choosing the right K value is important; if K is too
small, the model might be overly sensitive to noise, while a larger K can
lead to a more general, but possibly less accurate, classification.

o Step 2: Calculate the Euclidean distance between the new data point and
each of the data points in the dataset. The Euclidean distance formula is
used to measure how close or far apart each point is from the new data
point. This distance metric allows us to identify the points that are
physically closest to the new data point in terms of feature values.

o Step 3: Identify the K data points with the smallest Euclidean distances to
the new data point. These are considered the "K nearest neighbors" of the
new point. By focusing on these closest neighbors, K-NN assumes that
points close to each other are likely to share the same category.

o Step 4: Among these K nearest neighbors, count the number of data points
belonging to each category (e.g., Category A and Category B). This is the
process of "voting," where each neighbor essentially "votes" for its
category, influencing the classification of the new data point.

o Step 5: Assign the new data point to the category with the majority vote
among its K neighbors. For example, if more neighbors belong to
Category A than Category B, the new point will be classified under
Category A. This decision rule assumes that the category with the most
neighbors around a point is the most appropriate classification.

o Step 6: The model is now ready for use. With the classification assigned
to the new data point, our K-NN model can now classify additional new
data points using the same process. The simplicity and flexibility of K-NN
make it suitable for both classification and regression tasks, particularly
where patterns in data can be recognized through proximity or similarity.

By following these steps, the K-NN algorithm provides an intuitive, data-driven


method to classify new data points based on similarity, making it a useful algorithm
for tasks like image recognition, recommendation systems, and even predictive
analysis.
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:

● Firstly, we will choose the number of neighbors, so we will choose the k=5.

● Next, we will calculate the Euclidean distance between the data


points. The Euclidean distance is the distance between two points,
which we have already studied in geometry. It can be calculated as:
● By calculating the Euclidean distance between our new data point and all the
data points in the dataset, we identify the nearest neighbors based on their
proximity. Suppose the value of 'K' we chose is 5, meaning we are looking at
the five closest neighbors to our new data point. After measuring the distances,
we find that three of the nearest neighbors belong to Category A, while two
belong to Category B.
● This setup means that, out of the five nearest neighbors, the majority of them
(three) are in Category A, suggesting that our new data point is more similar to
Category A than to Category B. In K-NN, this majority vote among the
neighbors is a critical step because it determines the likely classification of the
new data point.
● Since Category A has more representatives among the nearest neighbors, the
algorithm will classify the new data point as belonging to Category A. This
decision is based on the principle that points closer together in the feature
space likely share the same category or class. The K-NN model relies on this
assumption, making it effective for problems where proximity correlates well
with category or class membership.
● If this example were visualized, we would see the new data point situated
closer to the cluster of points in Category A than to those in Category B,
reinforcing why it is classified as Category A. As we apply this to more data,
the K-NN model continues to classify each new point by comparing it with its
closest neighbors, making it both an adaptable and intuitive classification tool.

● As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.
How to select the value of K in the K-NN
Algorithm?
When selecting the value of 'K' in the K-Nearest Neighbors (K-NN) algorithm, Some
key points to keep in mind:

1.No Fixed Rule for the Best 'K'


● There’s no single method to determine the optimal 'K' value for all datasets, so
it's often necessary to experiment with different values to find the one that
provides the best accuracy for the specific dataset. Cross-validation is a
common approach for testing various 'K' values.

2.Commonly Used Value


● The most commonly used value for 'K' is 5. This value tends to strike a
balance between stability and responsiveness, as it considers enough neighbors
to minimize the impact of random outliers but is still responsive to local data
patterns.

3.Effect of a Low 'K' Value


● When 'K' is set to a very low number, such as 1 or 2, the model can become
overly sensitive to outliers. This can lead to incorrect classifications if these
outliers differ significantly from the actual category patterns, resulting in a
"noisy" model that lacks generalization.

4.Challenges with a High 'K' Value


● Higher values of 'K' provide a more stable and generalized model by reducing
the effect of individual points. However, too large a value can dilute the
influence of nearby points, potentially leading to less accurate classifications,
especially in cases where the categories are closely spaced.

5.Finding the Right Balance


● Choosing 'K' is about finding a balance between bias and variance. A smaller
'K' may capture finer details in the data, while a larger 'K' smooths the
decision boundary but can overlook local patterns.
Advantages of KNN Algorithm

Six advantages of the K-Nearest Neighbors (K-NN) algorithm:

1. Simple to Understand and Implement

K-NN is an intuitive and straightforward algorithm that doesn’t require


complex calculations, making it easy to implement and interpret.

2. No Training Phase Required

Unlike many algorithms, K-NN is a lazy learner, meaning it doesn’t require a


lengthy training phase. This allows it to be directly applied to classify new
data points without pre-training.

3. Adaptable to Multi-Class Problems

K-NN can be easily adapted for multi-class classification tasks, where


multiple categories are present, by simply counting the majority among the
neighbors.

4. Handles Noisy Data Well

K-NN is generally robust to noise in the dataset, especially when an


appropriate 'K' value is chosen, as it uses nearby points to smooth out outlier
effects. K-NN makes no assumptions about the underlying distribution of the
data, unlike some algorithms (e.g., linear regression). This makes it useful for
datasets where the relationships between features are complex or non-linear.

5. Effective with Large Datasets

As the dataset size increases, K-NN’s accuracy can improve because it has a
wider pool of reference points, making it more effective in recognizing
complex patterns.

6. Versatile for Both Classification and Regression

K-NN can be used not only for classification tasks but also for regression by
averaging the values of the nearest neighbors, making it a flexible tool for
different types of problems.
Disadvantages of KNN Algorithm
Six disadvantages of the K-Nearest Neighbors (K-NN) algorithm:
1. Choosing the Value of K
Determining the optimal value for 'K' can be complex and time-consuming. If
the value of K is too small, the model may become sensitive to noise, while a
very large 'K' can overly generalize and miss finer patterns in the data, making
it tricky to achieve optimal performance.
2. High Computation Cost
K-NN requires calculating the distance between the new data point and all the
training samples in the dataset. This can be computationally expensive,
especially when dealing with large datasets, as the model performs this
distance calculation for every prediction, leading to slow performance.
3. Memory Intensive
As a lazy learner, K-NN needs to store the entire training dataset in memory
to classify new data points. This can be quite memory-intensive, especially
with large datasets, making it less scalable for very large training sets.
4. Sensitive to Irrelevant Features
K-NN can perform poorly when there are irrelevant or redundant features in
the dataset. These features can distort the distance calculations, leading to
inaccurate predictions since the model treats all features equally, even if they
don’t contribute to the classification task.
5. Difficulty Handling High-Dimensional Data (Curse of Dimensionality)
As the number of features increases, the distances between data points become
less distinct, making it harder for K-NN to differentiate between them. This
issue, known as the "curse of dimensionality," can reduce the algorithm’s
effectiveness in high-dimensional spaces, such as in text classification or
image recognition.
6. Poor Performance with Imbalanced Data
K-NN struggles with imbalanced datasets, where one class significantly
outnumbers the other. In such cases, the algorithm may be biased toward the
majority class, leading to inaccurate predictions for the minority class, as the
majority class will dominate the "voting" process of the K nearest neighbors.
Python implementation of the KNN
algorithm
To do the Python implementation of the K-NN algorithm, we will use the
same problem and dataset which we have used in Logistic Regression.
But here we will improve the performance of the model. Below is the
problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company


that has manufactured a new SUV car. The company wants to give the
ads to the users who are interested in buying that SUV. So for this
problem, we have a dataset that contains multiple user's information
through the social network. The dataset contains lots of information but
the Estimated Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent variable. Below
is the dataset:
Steps to implement the K-NN algorithm:

o Data Pre-processing step


o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic
Regression. Below is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12. 12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16. 16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

By executing the above code, our dataset is imported to our program and well
pre-processed. After feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.

o Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this
we will import the KNeighborsClassifier class of Sklearn
Neighbors library. After importing the class, we will create the
Classifier object of the class. The Parameter of this class will be

o n_neighbors: To define the required neighbors of the


algorithm. Usually, it takes 5.
o metric='minkowski': This is the default parameter and it
decides the distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.

Once we have chosen the optimal value for 'K' and preprocessed the data, the next
step is to fit the classifier to the training data. This process involves training the
K-NN model by feeding it the feature values and corresponding labels from the
training dataset. The classifier learns the patterns and relationships in the data, so it
can later make predictions on new, unseen data. The code provided below
demonstrates how to implement this step in Python, using a K-NN algorithm from a
machine learning library, such as scikit-learn, to fit the model. This step is
crucial for enabling the model to classify new data based on the training it has
received
1. Fitting K-NN classifier to the training set
#

2. from sklearn.neighbors import KNeighborsClassifier


3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will
create a y_pred vector as we did in Logistic Regression. Below is
the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)
Output:

The output for the above code will be:


o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to
see the accuracy of the classifier. Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and


called it using the variable cm.

Output: By executing the above code, we will get the matrix as below:

In the above image, we can see there are 64+29= 93 correct predictions
and 3+4= 7 incorrect predictions, whereas, in Logistic Regression, there
were 11 incorrect predictions. So we can say that the performance of the
model is improved by using the K-NN algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The
code will remain same as we did in Logistic Regression, except the
name of the graph. Below is the code for it:
1. #Visulaizing the trianing set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:,
0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(),
x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10.for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13.mtp.title('K-NN Algorithm (Training set)')
14.mtp.xlabel('Age')
15.mtp.ylabel('Estimated Salary')
16.mtp.legend()
17.mtp.show()

Output:

By executing the above code, we will get the below graph:

Visualizing the Test set result:


After the training of the model, we will now test the result by
putting a new dataset, i.e., Test dataset. Code remains the same
except some minor changes: such as x_train and y_train will be
replaced by x_test and y_test.
Below is the code for it:
1. #Visualizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:,
0].max() + 1, step =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(),
x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10.for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13.mtp.title('K-NN algorithm(Test set)')
14.mtp.xlabel('Age')
15.mtp.ylabel('Estimated Salary')
16.mtp.legend()
17.mtp.show()

Output:The above graph is showing the output for the test data set.

As we can see in the graph, the predicted output is well good as most of the red
points are in the red region and most of the green points are in the green region.

However, there are few green points in the red region and a few red
points in the green region. So these are the incorrect observations that we
have observed in the confusion matrix(7 Incorrect output).

You might also like