0% found this document useful (0 votes)
12 views30 pages

Unit 3 Aam

Support Vector Machine (SVM) is a popular supervised learning algorithm primarily used for classification, aiming to create the best decision boundary, or hyperplane, to segregate data into classes. SVM can be linear or non-linear, depending on whether the data can be separated by a straight line or requires additional dimensions. The document also provides a Python implementation of SVM, detailing data preprocessing, model fitting, prediction, and visualization of results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views30 pages

Unit 3 Aam

Support Vector Machine (SVM) is a popular supervised learning algorithm primarily used for classification, aiming to create the best decision boundary, or hyperplane, to segregate data into classes. SVM can be linear or non-linear, depending on whether the data can be separated by a straight line or requires additional dimensions. The document also provides a Python implementation of SVM, detailing data preprocessing, model fitting, prediction, and visualization of results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT 3

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature. So
as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third dimension
z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Advantages of SVM
 Effective in high-dimensional cases.
 Its memory is efficient as it uses a subset of training points in the decision function called
support vectors.
 Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.

Python Implementation of Support Vector Machine

Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.

o Data Pre-processing step

Till the Data pre-processing step, the code will remain the same. Below is the code:

1. #Data Pre-processing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')
9.
10. #Extracting Independent and dependent Variable
11. x= data_set.iloc[:, [2,3]].values
12. y= data_set.iloc[:, 4].values
13.
14. # Splitting the dataset into training and test set.
15. from sklearn.model_selection import train_test_split
16. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

After executing the above code, we will pre-process the data. The code will give the
dataset as:
The scaled output for the test set will be:

Fitting the SVM classifier to the training set:

Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we
will import SVC class from Sklearn.svm library. Below is the code for it:

1. from sklearn.svm import SVC # "Support vector classifier"


2. classifier = SVC(kernel='linear', random_state=0)
3. classifier.fit(x_train, y_train)

In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the
classifier to the training dataset(x_train, y_train)

Output:

Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)

The model performance can be altered by changing the value of C(Regularization


factor), gamma, and kernel.

o Predicting the test set result:


Now, we will predict the output for test set. For this, we will create a new vector
y_pred. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

After getting the y_pred vector, we can compare the result of y_pred and y_test to check
the difference between the actual value and predicted value.

Output: Below is the output for the prediction of the test set:

o Creating the confusion matrix:


Now we will see the performance of the SVM classifier that how many incorrect
predictions are there as compared to the Logistic regression classifier. To create
the confusion matrix, we need to import the confusion_matrix function of the
sklearn library. After importing the function, we will call it using a new
variable cm. The function takes two parameters, mainly y_true( the actual values)
and y_pred (the targeted value return by the classifier). Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

Output:

As we can see in the above output image, there are 66+24= 90 correct predictions and
8+2= 10 correct predictions. Therefore we can say that our SVM model improved as
compared to the Logistic regression model.

o Visualizing the training set result:


Now we will visualize the training set result, below is the code for it:

1. from matplotlib.colors import ListedColormap


2. x_set, y_set = x_train, y_train
3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep =0.01),
4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape)
,
6. alpha = 0.75, cmap = ListedColormap(('red', 'green')))
7. mtp.xlim(x1.min(), x1.max())
8. mtp.ylim(x2.min(), x2.max())
9. for i, j in enumerate(nm.unique(y_set)):
10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
11. c = ListedColormap(('red', 'green'))(i), label = j)
12. mtp.title('SVM classifier (Training set)')
13. mtp.xlabel('Age')
14. mtp.ylabel('Estimated Salary')
15. mtp.legend()
16. mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see, the above output is appearing similar to the Logistic regression output. In
the output, we got the straight line as hyperplane because we have used a linear kernel
in the classifier. And we have also discussed above that for the 2d space, the hyperplane
in SVM is a straight line.

o Visualizing the test set result:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('SVM classifier (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

By executing the above code, we will get the output as:

As we can see in the above output image, the SVM classifier has divided the users into
two regions (Purchased or Not purchased). Users who purchased the SUV are in the red
region with the red scatter points. And users who did not purchase the SUV are in the
green region with green scatter points. The hyperplane has divided the two classes into
Purchased and not purchased variable.
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
o KNN is one of the most basic yet essential classification algorithms in
machine learning. It belongs to the supervised learning domain and finds
intense application in pattern recognition, data mining, and intrusion
detection.
o It is widely disposable in real-life scenarios since it is non-parametric,
meaning it does not make any underlying assumptions about the
distribution of data (as opposed to other algorithms such as GMM, which
assume a Gaussian distribution of the given data). We are given some
prior data (also called training data), which classifies coordinates into
groups identified by an attribute.
o Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will
find the similar features of the new data set to the cats and dogs images and based
on the most similar features it will put it in either cat or dog category.
As an example, consider the following table of data points containing two
features:

KNN Algorithm working visualization


Now, given another set of data points (also called testing data), allocate these
points to a group by analyzing the training set. Note that the unclassified points
are marked as ‘White’.

Intuition Behind KNN Algorithm


If we plot these points on a graph, we may be able to locate some clusters or
groups. Now, given an unclassified point, we can assign it to a group by
observing what group its nearest neighbors belong to. This means a point close
to a cluster of points classified as ‘Red’ has a higher probability of getting
classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’,
and the second point (5.5, 4.5) should be classified as ‘Red’.

Why do we need a KNN algorithm?


(K-NN) algorithm is a versatile and widely used machine learning algorithm that
is primarily used for its simplicity and ease of implementation. It does not
require any assumptions about the underlying data distribution. It can also
handle both numerical and categorical data, making it a flexible choice for
various types of datasets in classification and regression tasks. It is a non-
parametric method that makes predictions based on the similarity of data points
in a given dataset. K-NN is less sensitive to outliers compared to other
algorithms.
The K-NN algorithm works by finding the K nearest neighbors to a given data
point based on a distance metric, such as Euclidean distance. The class or
value of the data point is then determined by the majority vote or average of the
K neighbors. This approach allows the algorithm to adapt to different patterns
and make predictions based on the local structure of the data.

Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
Distance Metrics Used in KNN Algorithm
As we know that the KNN algorithm helps us identify the nearest points or the
groups for a query point. But to determine the closest groups or the nearest
points for a query point we need some metric. For this purpose, we use below
distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in
the plane/hyperplane. Euclidean distance can also be visualized as the length
of the straight line that joins the two points which are into consideration. This
metric helps us calculate the net displacement done between the two states of
an object.

Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total
distance traveled by the object instead of the displacement. This metric is
calculated by summing the absolute difference between the coordinates of the
points in n-dimensions.
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special
cases of the Minkowski distance.

From the formula above we can say that when p = 2 then it is the same as the
formula for the Euclidean distance and when p = 1 then we obtain the formula
for the Manhattan distance.
The above-discussed metrics are most common while dealing with a Machine
Learning problem but there are other distance metrics as well like Hamming
Distance which come in handy while dealing with problems that require
overlapping comparisons between two vectors whose contents can be Boolean
as well as string values.

How to choose the value of k for KNN Algorithm?


The value of k is very crucial in the KNN algorithm to define the number of
neighbors in the algorithm. The value of k in the k-nearest neighbors (k-NN)
algorithm should be chosen based on the input data. If the input data has more
outliers or noise, a higher value of k would be better. It is recommended to
choose an odd value for k to avoid ties in classification. Cross-
validation methods can help in selecting the best k value for the given dataset.
o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

How does K-NN work?


Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of
similarity, where it predicts the label or value of a new data point by considering
the labels or values of its K nearest neighbors in the training dataset.
Step-by-Step explanation of how KNN works is discussed below:
Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that needs to be considered
while making prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean
distance is used. Distance is calculated between each of the data points in
the dataset and target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are the
nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 In the classification problem, the class labels of are determined by
performing majority voting. The class with the most occurrences among the
neighbors becomes the predicted class for the target data point.
 In the regression problem, the class label is calculated by taking average of
the target values of K nearest neighbors. The calculated average value
becomes the predicted output for the target data point.

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B. Consider
the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

o Let X be the training dataset with n data points, where each data point is
represented by a d-dimensional feature vector and Y be the
corresponding labels or values for each data point in X. Given a new data
point x, the algorithm calculates the distance between x and each data
point in X using a distance metric, such as Euclidean distance:

o The algorithm selects the K data points from X that have the shortest
distances to x. For classification tasks, the algorithm assigns the label y
that is most frequent among the K nearest neighbors to x. For regression
tasks, the algorithm calculates the average or weighted average of the
values y of the K nearest neighbors and assigns it as the predicted value
for x.

Advantages of the KNN Algorithm


 Easy to implement as the complexity of the algorithm is not that high.
 Adapts Easily – As per the working of the KNN algorithm it stores all the
data in memory storage and hence whenever a new example or data point is
added then the algorithm adjusts itself as per that new example and has its
contribution to the future predictions as well.
 Few Hyperparameters – The only parameters which are required in the
training of a KNN algorithm are the value of k and the choice of the distance
metric which we would like to choose from our evaluation metric.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of the KNN Algorithm


 Does not scale – As we have heard about this that the KNN algorithm is
also considered a Lazy Algorithm. The main significance of this term is that
this takes lots of computing power as well as data storage. This makes this
algorithm both time-consuming and resource exhausting.
 Curse of Dimensionality – There is a term known as the peaking
phenomenon according to this the KNN algorithm is affected by the curse of
dimensionality which implies the algorithm faces a hard time classifying the
data points properly when the dimensionality is too high.
 Prone to Overfitting – As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well. Hence
generally feature selection as well as dimensionality reduction techniques
are applied to deal with this problem.
 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data
points for all the training samples.

Python implementation of the KNN algorithm


To do the Python implementation of the K-NN algorithm, we will use the same problem
and dataset which we have used in Logistic Regression. But here we will improve the
performance of the model. Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give the ads to the users who are
interested in buying that SUV. So for this problem, we have a dataset that contains
multiple user's information through the social network. The dataset contains lots of
information but the Estimated Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent variable. Below is the dataset:
Steps to implement the K-NN algorithm:

o Data Pre-processing step


o Fitting the K-NN algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression. Below
is the code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-
processed. After feature scaling our test dataset will look like:

From the above output image, we can see that our data is successfully scaled.

o Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the
class, we will create the Classifier object of the class. The Parameter of this class
will be
o n_neighbors: To define the required neighbors of the algorithm. Usually, it
takes 5.
o metric='minkowski': This is the default parameter and it decides the
distance between the points.
o p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:

1. #Fitting K-NN classifier to the training set


2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

o Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

Output:

The output for the above code will be:

o Creating the Confusion Matrix:


Now we will create the Confusion Matrix for our K-NN model to see the accuracy
of the classifier. Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and called it using the
variable cm.

Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7
incorrect predictions, whereas, in Logistic Regression, there were 11 incorrect predictions.
So we can say that the performance of the model is improved by using the K-NN
algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will remain
same as we did in Logistic Regression, except the name of the graph. Below is the
code for it:

1. #Visulaizing the trianing set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape)
,
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

By executing the above code, we will get the below graph:

The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:

o As we can see the graph is showing the red point and green points. The
green points are for Purchased(1) and Red Points for not Purchased(0)
variable.
o The graph is showing an irregular boundary instead of showing any straight
line or any curve because it is a K-NN algorithm, i.e., finding the nearest
neighbor.
o The graph has classified users in the correct categories as most of the users
who didn't buy the SUV are in the red region and users who bought the SUV
are in the green region.
o The graph is showing good result but still, there are some green points in
the red region and red points in the green region. But this is no big issue as
by doing this model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new dataset,
i.e., Test dataset. Code remains the same except some minor changes: such
as x_train and y_train will be replaced by x_test and y_test.
Below is the code for it:

1. #Visualizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, s
tep =0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape)
,
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:
The above graph is showing the output for the test data set. As we can see in the graph,
the predicted output is well good as most of the red points are in the red region and most
of the green points are in the green region.

Time Complexity: O(N * logN)


Auxiliary Space: O(1)
Applications of the KNN Algorithm
 Data Preprocessing – While dealing with any Machine Learning problem
we first perform the EDA part in which if we find that the data contains
missing values then there are multiple imputation methods are available as
well. One of such method is KNN Imputer which is quite effective ad
generally used for sophisticated imputation methodologies.
 Pattern Recognition – KNN algorithms work very well if you have trained a
KNN algorithm using the MNIST dataset and then performed the evaluation
process then you must have come across the fact that the accuracy is too
high.
 Recommendation Engines – The main task which is performed by a KNN
algorithm is to assign a new query point to a pre-existed group that has been
created using a huge corpus of datasets. This is exactly what is required in
the recommender systems to assign each user to a particular group and
then provide them recommendations based on that group’s preferences.

You might also like