0% found this document useful (0 votes)
6 views

Lab 5

Uploaded by

k213928
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lab 5

Uploaded by

k213928
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lab-5 Manual Machine Learning Models for Business Analytics

National University of Computer &


Emerging Sciences, Karachi

Fall 2024, Lab Manual – 5


Supervised Learning - Classification with Support Vector Machine (SVM)

MSBA-3A
Course code:

Instructor: Adil Sheraz

Objectives
The following are the objectives of this lab.
1. Support Vector Machine
2. How SVM work
3. Support Vector Kernels
4. Implementation of SVC

1. Support Vector Machine


SVMs in short are supervised machine learning algorithms that are used for classification and
regression purposes. SVMs are one of the powerful machine learning algorithms for classification, and
regression. An SVM classifier builds a model that assigns new data points to one of the given categories.
Thus, it can be viewed as a non-probabilistic binary linear classifier. SVMs can be used for linear
classification purposes. In addition to performing linear classification, SVMs can efficiently perform a
non-linear classification using the kernel trick. It enable us to implicitly map the inputs into high
dimensional feature spaces. Before diving into the working of SVM let’s first understand the two basic
terms used in the algorithm “The support vector” and “Hyper-Plane”.
Hyperplane
A hyperplane is a decision boundary that differentiates the two classes in SVM. A data point falling
on either side of the hyperplane can be attributed to different classes. The dimension of the hyperplane
depends on the number of input features in the dataset. If we have 2 input features the hyper-plane will
be a line. Likewise, if the number of features is 3, it will become a two-dimensional plane.

1|Page
Lab-5 Manual Machine Learning Models for Business Analytics

Support Vectors
Support vectors are the data points that are nearest to
the hyper-plane and affect the position and orientation of
the hyper-plane. We have to select a hyperplane, for
which the margin, i-e. the distance between support
vectors and the hyper-plane is maximum. Even a little
interference in the position of these support vectors can
change the hyper-plane.
Margin
A margin is a separation gap between the two lines on
the closest data points. It is calculated as the
perpendicular distance from the line to support vectors or
closest data points. In SVMs, we try to maximize this
separation gap so that we get the maximum margin.

2. How does SVM work


Let’s take an example, we have a classification problem where we have to separate the red data points
from the blue ones.

Since it is a two-dimensional problem, our decision boundary will be a line, for the 3-dimensional
problem we have to use a plane, and similarly, the complexity of the solution will increase with the rising
number of features.

As shown in the above image, we have multiple lines separating the data points successfully. But our
objective is to look for the best solution. There are few rules that can help us to identify the best line.
• Maximum Classification
• Best Separation

2|Page
Lab-5 Manual Machine Learning Models for Business Analytics

Maximum classification, i.e. the selected line must be able to successfully segregate all the data points
into the respective classes. In our example, we can see lines E and D are misclassifying a red data point.
Hence, for this problem lines A, B, C is better than E and D. So we will drop them.

The second rule is Best Separation, which means, we must choose a line such that it is perfectly able
to separate the points.

If we talk about our example, if we get a new red data point closer to line A as shown in the image below,
line A will miss classifying that point. Similarly, if we got a new blue instance closer to line B, then line
A and C will classify the data successfully, whereas line B will miss classifying this new point.

The point to be noticed here, In both the cases line C is successfully classifying all the data points why?
To understand this let’s take all the lines one by one.

Let’s discuss which line to choose:


Why not line A

First, consider line A. If we move line A towards


the left, we can see it has very little chance to miss
classify the blue points. on the other hand, if we
shift line A towards the right side it will very easily
miss-classify the red points. The reason is on the
left side of the margin i.e the distance between the
nearest data point and the line is large whereas on
the right side the margin is very low.

3|Page
Lab-5 Manual Machine Learning Models for Business Analytics

Why not line B

Similarly, in the case of line B. If we shift line B


towards the right side, it has a sufficient margin on
the right side whereas it will wrongly classify the
instances on the left side as the margin towards the
left is very low. Hence, B is not our perfect
solution.

Why not line C

In the case of line C, It has sufficient margin on


the left as well as the right side. This maximum
margin makes line C more robust for the new data
points that might appear in the future. Hence, C is
the best fit in that case that successfully classifies
all the data points with the maximum margin.

This is what SVM looks for, it aims for the maximum margin and creates a line that is equidistant from
both sides, which is line C in our case. so we can say C represents the SVM classifier with the maximum
margin.

Now let’s look at the data below, As we can see this is not linearly separable data, so SVM will not work
in this situation. If anyhow we try to classify this data with a line, the result will not be promising.

4|Page
Lab-5 Manual Machine Learning Models for Business Analytics

So, is there any way that SVM can classify this kind of data? For this problem, we have to create a
decision boundary that looks something like this.

The question is, is it possible to create such a decision boundary using SVM. Well, the answer is Yes.
SVM does this by projecting the data in a higher dimension. As shown in the following image. In the first
case, data is not linearly separable, hence, we project into a higher dimension.

If we have more complex data then SVM will continue to project the data in a higher dimension till it
becomes linearly separable. Once the data become linearly separable, we can use SVM to classify just
like the previous problems.

Projection into Higher Dimension


Now let’s understand how SVM projects the data into a higher dimension. Take this data, it is a circular
non linearly separable dataset.

To project the data in a higher dimension, we are going to create another dimension z, where

Z = X2 + Y2

5|Page
Lab-5 Manual Machine Learning Models for Business Analytics

Now we will plot this feature Z with respect to x, which will give us linearly separable data that looks
like this,

Here, we have created a mapping Z using the base features X and Y, this process is known as kernel
transformation. Precisely, a kernel takes the features as input and creates the linearly separable data in a
higher dimension.

Now the question is, do we have to perform this transformation manually? The answer is no. SVM
handles this process itself, just we have to choose the kernel type.

3. Implementation of SVC

Support Vector Kernels


Linear Kernel

To start with, in the linear kernel, the decision boundary is a straight line. Unfortunately, most of the real-
world data is not linearly separable, this is the reason the linear kernel is not widely used in SVM.

6|Page
Lab-5 Manual Machine Learning Models for Business Analytics

Gaussian / RBF kernel


It is the most commonly used kernel. It projects the data into a Gaussian distribution, where the red points
become the peak of the Gaussian surface and the green data points become the base of the surface, making
the data linearly separable.

But most of the time, this kernel is prone to overfitting and it captures the noise.

Polynomial kernel
At last, we have a polynomial kernel, which is non-uniform in nature due to the polynomial combination
of the base features. It often gives good results.

But the problem with the polynomial kernel is, the number of higher dimension features increases
exponentially. As a result, this is computationally more expensive than RBF or linear kernel.

7|Page
Lab-5 Manual Machine Learning Models for Business Analytics

Hard margin
In a hard margin SVM, the goal is to find the hyperplane that can perfectly separate the data into two
classes without any misclassification. However, this is not always possible when the data is not linearly
separable or contains outliers. In such cases, the hard margin SVM will fail to find a hyperplane that can
perfectly separate the data, and the optimization problem will have no solution.

Soft Margin
In a soft margin SVM, we allow some misclassification by introducing slack variables that allow some
data points to be on the wrong side of the margin. The optimization problem in a soft margin SVM is
modified as follows:

where ξi are the slack variables, and C is a hyperparameter that controls the trade-off between maximizing
the margin and minimizing the misclassification. A larger value of C results in a narrow margin and fewer
misclassifications, while a smaller value of C results in a wider margin but more misclassifications.

Geometrically, the soft margin SVM introduces a penalty for the data points that lie on the wrong side of
the margin or even on the wrong side of the hyperplane. The slack variables ξi allow these data points to
be within the margin or on the wrong side of the hyperplane, but they incur a penalty in the objective
function. The optimization problem in a soft margin SVM finds the hyperplane that maximizes the margin
while minimizing the penalty for the misclassified data points.

Gamma Parameter
It tells us how much will be the influence of the individual data points on the decision boundary.

– Large Gamma: Fewer data points will influence the decision boundary. Therefore, decision
boundary becomes non-linear leading to overfitting

– Small Gamma: More data points will influence the decision boundary. Therefore, the decision
boundary is more generic.

8|Page
Lab-5 Manual Machine Learning Models for Business Analytics

TASK 1
Download the Breast Cancer Dataset from here https://drive.google.com/file/d/1dtxThgA7XHVq08-
ffjJoKuR7fX-QIB1D/view?usp=sharing

1. Perform EDA.
a. Check the head and tail of the dataset.
b. Check the datatype of each feature.
c. Check the missing values.
d. Print the column names.
e. Check unique values of the target column.
f. Check the distribution of the dataset (balanced or not) to the target variable.
g. Check the distribution of the features (histogram helps in this).
h. Check and plot each feature for outliers. (box plot helps in this).
i. If you find outliers in your dataset, which SVM variant should we apply (Hard margin or
Soft margin) explain the reason in your own words.

TASK 2
Perform the following steps:
1. Split the dataset into features and target variables and verify the shape of both.
2. Normalize the dataset (features) with the standard scaler method of sklearn.
3. The target variable is [M/B], convert it into [0/1] categories, by using the Label Encoder method
of sklearn.
4. Split the dataset into train & test splits, with 70% data for training and 30% for testing. Verify
the split by printing the shapes of X_train, X_test, y_train, and y_test.
5. Implement the SVM, as SVC with default parameters from the sklearn library.
6. Check the training and testing accuracies of the model, and explain in your words whether
overfitting or underfitting happened or not.
7. Explain in your own words Overfitting and Underfitting, which type of results show overfitting
or underfitting.
8. Plot / print the confusion matrix.
9. Print the classification report.

9|Page

You might also like