Unit 3 Aam
Unit 3 Aam
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature. So
as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis
of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green
or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third dimension
z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in
2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Advantages of SVM
Effective in high-dimensional cases.
Its memory is efficient as it uses a subset of training points in the decision function called
support vectors.
Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.
Now we will implement the SVM algorithm using Python. Here we will use the same
dataset user_data, which we have used in Logistic regression and KNN classification.
Till the Data pre-processing step, the code will remain the same. Below is the code:
After executing the above code, we will pre-process the data. The code will give the
dataset as:
The scaled output for the test set will be:
Now the training set will be fitted to the SVM classifier. To create the SVM classifier, we
will import SVC class from Sklearn.svm library. Below is the code for it:
In the above code, we have used kernel='linear', as here we are creating SVM for linearly
separable data. However, we can change it for non-linear data. And then we fitted the
classifier to the training dataset(x_train, y_train)
Output:
Out[8]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=0,
shrinking=True, tol=0.001, verbose=False)
After getting the y_pred vector, we can compare the result of y_pred and y_test to check
the difference between the actual value and predicted value.
Output: Below is the output for the prediction of the test set:
Output:
As we can see in the above output image, there are 66+24= 90 correct predictions and
8+2= 10 correct predictions. Therefore we can say that our SVM model improved as
compared to the Logistic regression model.
Output:
As we can see, the above output is appearing similar to the Logistic regression output. In
the output, we got the straight line as hyperplane because we have used a linear kernel
in the classifier. And we have also discussed above that for the 2d space, the hyperplane
in SVM is a straight line.
Output:
As we can see in the above output image, the SVM classifier has divided the users into
two regions (Purchased or Not purchased). Users who purchased the SUV are in the red
region with the red scatter points. And users who did not purchase the SUV are in the
green region with green scatter points. The hyperplane has divided the two classes into
Purchased and not purchased variable.
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily classified
into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
o KNN is one of the most basic yet essential classification algorithms in
machine learning. It belongs to the supervised learning domain and finds
intense application in pattern recognition, data mining, and intrusion
detection.
o It is widely disposable in real-life scenarios since it is non-parametric,
meaning it does not make any underlying assumptions about the
distribution of data (as opposed to other algorithms such as GMM, which
assume a Gaussian distribution of the given data). We are given some
prior data (also called training data), which classifies coordinates into
groups identified by an attribute.
o Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will
find the similar features of the new data set to the cats and dogs images and based
on the most similar features it will put it in either cat or dog category.
As an example, consider the following table of data points containing two
features:
Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
Distance Metrics Used in KNN Algorithm
As we know that the KNN algorithm helps us identify the nearest points or the
groups for a query point. But to determine the closest groups or the nearest
points for a query point we need some metric. For this purpose, we use below
distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in
the plane/hyperplane. Euclidean distance can also be visualized as the length
of the straight line that joins the two points which are into consideration. This
metric helps us calculate the net displacement done between the two states of
an object.
Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total
distance traveled by the object instead of the displacement. This metric is
calculated by summing the absolute difference between the coordinates of the
points in n-dimensions.
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special
cases of the Minkowski distance.
From the formula above we can say that when p = 2 then it is the same as the
formula for the Euclidean distance and when p = 1 then we obtain the formula
for the Manhattan distance.
The above-discussed metrics are most common while dealing with a Machine
Learning problem but there are other distance metrics as well like Hamming
Distance which come in handy while dealing with problems that require
overlapping comparisons between two vectors whose contents can be Boolean
as well as string values.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B. Consider
the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
o Let X be the training dataset with n data points, where each data point is
represented by a d-dimensional feature vector and Y be the
corresponding labels or values for each data point in X. Given a new data
point x, the algorithm calculates the distance between x and each data
point in X using a distance metric, such as Euclidean distance:
o The algorithm selects the K data points from X that have the shortest
distances to x. For classification tasks, the algorithm assigns the label y
that is most frequent among the K nearest neighbors to x. For regression
tasks, the algorithm calculates the average or weighted average of the
values y of the K nearest neighbors and assigns it as the predicted value
for x.
Problem for K-NN Algorithm: There is a Car manufacturer company that has
manufactured a new SUV car. The company wants to give the ads to the users who are
interested in buying that SUV. So for this problem, we have a dataset that contains
multiple user's information through the social network. The dataset contains lots of
information but the Estimated Salary and Age we will consider for the independent
variable and the Purchased variable is for the dependent variable. Below is the dataset:
Steps to implement the K-NN algorithm:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below
is the code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-
processed. After feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
And then we will fit the classifier to the training data. Below is the code for it:
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create
a y_pred vector as we did in Logistic Regression. Below is the code for it:
Output:
In above code, we have imported the confusion_matrix function and called it using the
variable cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7
incorrect predictions, whereas, in Logistic Regression, there were 11 incorrect predictions.
So we can say that the performance of the model is improved by using the K-NN
algorithm.
Output:
The output graph is different from the graph which we have occurred in Logistic
Regression. It can be understood in the below points:
o As we can see the graph is showing the red point and green points. The
green points are for Purchased(1) and Red Points for not Purchased(0)
variable.
o The graph is showing an irregular boundary instead of showing any straight
line or any curve because it is a K-NN algorithm, i.e., finding the nearest
neighbor.
o The graph has classified users in the correct categories as most of the users
who didn't buy the SUV are in the red region and users who bought the SUV
are in the green region.
o The graph is showing good result but still, there are some green points in
the red region and red points in the green region. But this is no big issue as
by doing this model is prevented from overfitting issues.
o Hence our model is well trained.
o Visualizing the Test set result:
After the training of the model, we will now test the result by putting a new dataset,
i.e., Test dataset. Code remains the same except some minor changes: such
as x_train and y_train will be replaced by x_test and y_test.
Below is the code for it:
Output:
The above graph is showing the output for the test data set. As we can see in the graph,
the predicted output is well good as most of the red points are in the red region and most
of the green points are in the green region.