0% found this document useful (0 votes)
17 views36 pages

SVM 1

Support Vector Machines (SVM) are supervised learning models used for classification and regression, particularly effective on large datasets with complex relationships. SVMs work by finding an optimal hyperplane that separates different classes while minimizing overfitting, utilizing support vectors for efficient memory usage. Non-linear SVMs can be achieved through kernel functions, such as the Radial Basis Function (RBF) kernel, which allows mapping data into higher-dimensional spaces for better classification.

Uploaded by

ayushjain6548
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views36 pages

SVM 1

Support Vector Machines (SVM) are supervised learning models used for classification and regression, particularly effective on large datasets with complex relationships. SVMs work by finding an optimal hyperplane that separates different classes while minimizing overfitting, utilizing support vectors for efficient memory usage. Non-linear SVMs can be achieved through kernel functions, such as the Radial Basis Function (RBF) kernel, which allows mapping data into higher-dimensional spaces for better classification.

Uploaded by

ayushjain6548
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Support Vector Machines

Moving towards better generalization for big data

• When we deal with big data set that needs a complicated model, the full Bayesian framework is very
computationally expensive

• So we need a frequentist method that is faster yet generalizes well…..

• The process is starting with preprocessing input vectors

• Try to extract layer of “features” rather than to predict answers directly from raw inputs
 Sensible if we already know that certain combinations of input values would be useful (such as edges
or corners in an image)

• Instead of learning the features try to design them by hand


 Hand-coded features are equivalent to a layer of non-linear neurons that do not need to be learned
Support Vector Machines (SVM)

What is SVM ??

• A supervised learning model

• Analyze data and recognize patterns for classification and regression problem

• SVM utilize a very big set of non-linear features that is task-independent

• Exhibits a clever way to prevent overfitting

• Basically, SVM training model does mapping of several examples as points in space with the wide and
clear division between examples of separate categories or groups

• Further, the new set of examples are then categorized based upon its features closely aligned with the
corresponding group
Types of SVM

Simple SVM Kernel SVM


Used for linear regression More flexible for non-linear
and classification problems data since more features can
be added to fit a hyperplane
(i.e. by adding extra
dimension) instead of a two-
dimensional space
Purpose of using SVMs

• Handle both classification and regression on linear and non-linear dataset


• Find complex relationships between your data without you needing to perform a lot of transformations on your own
• Effective on datasets with multiple features like financial or medical data
• Utilizes a subset of training points in the decision function called “support vectors” which makes it memory efficient
Example: Classification Height Weight M/F
144 56 -1
Lets consider M(Male) = +1, F(female) = -1
145 68 -1
171 107 1
Support Vector
149 58 -1
Height (200cm) 168 84 1
+ ¿¿ 172 93 1
𝑟

𝑟 156 39 -1
157 84 1
160 106 1
162 78 1
153 39 -1
168 100 1
163 79 1
171 87 1
150 44 -1

Weight (100kg) 162 81 1


165 101 1

distance Support Vector 152 39 -1


169 108 1
156 54 -1
How does SVM works

• SVM is supervised learning algorithm which decides a Support Vector


distance optimal hyperplane between two classes for making
decision about a new unknown entry point Height (200cm)
+ ¿¿
• Support vectors are the most distant points of each class 𝑟

𝑟
• There can be multiple lines/decision boundaries to
segregate the classes in n-dimensional space, but we need to
find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of
SVM

• A hyperplane is subspace of dimension D-1

• For 2-D vector space, hyperplane will be a line


Weight (100kg)

distance Support Vector


How does SVM works

Since hyperplane is a line in our case, therefore let us consider a function of 𝑤


a vector lying in the vector space
°
90
is an orthogonal vector to the hyperplane, such that for any point lying on 𝑏
the hyperplane

The main idea is to partition the space by a hyperplane such that the cluster
lying on one side of the hyperplane is denoted by label +1 and other denoted
by -1 hyperplane
Classification

• For all points lying on the positive side

with label
……………………(1)

• For all points lying on the negative side

with label
……………………(2)

• After combining (1) and (2), we get


How to choose hyperplane

Now there are infinitely many candidates for hyperplane which


one should we choose???

• To find a unique solution, we should choose the separating


hyperplane that maximizes margin between the positive and
negative classes

• Margin: The distance of the hyperplane to the closest


examples in the dataset, assuming the dataset is linearly
separable
How to choose hyperplane
Consider a hyperplane and an example
𝑥𝑎
• lies on the positive side of hyperplane 𝑟 𝑤
• is the orthogonal projection of onto the hyperplane 𝑥𝑎

• The direction of the different vector lies in the direction of


• The unit vector in the direction of is
Therefore

• Since the hyperplane must bisect the two


classes such that the +ve and –ve support O
vectors must have equal margin − +¿
hyperplane
• All points/vectors in both classes must follow 𝑟 𝑟
: scalar showing the margin
Optimization problem

The optimization problem can be defined as

Sub to:

We know that

We have

…………………(3)
Optimization problem
Let us assume the scenario shown in the figure
𝑥𝑎
• Here we normalize everything such that
𝑟
and ′
𝑥 𝑎

¿𝑤
………………….(4)

,𝑥 ′ +
¿

𝑎
𝑤

>
Applying (4) into (3), and by observing we get

+
,
𝑥𝑎

𝑏
= 0
>
As , the constraint gets modified as

1
𝑏
=
Hence we get the modified optimization problem as

Sub to:
Modified Optimization problem
The modified optimization problem can be rewritten as

Sub to: (A)


• The factor of ½ is included to have a tidier form for gradient calculation
• We minimize in place of maximization of which is equivalent
• Solution to the SVM optimization problem
From the optimization problem (A), we can formulate the Lagrangian as

……………………..(B)

Now, find the derivatives of w.r.t. and

By setting and to zero, we get

(C)
Dual Optimization problem
Substituting (C) into (B), we get a dual Lagrangian function

• Dual optimization problem

Sub to: ,

• The examples for which is zero do not contribute to the solution of


(D)
• All other points for which are called SUPPORT VECTORS since they support the hyperplane

• Once we obtain the solution values (or dual parameters) of optimal from (D) we can recover optimal by using

…………………….(E)

• From (E) we get


Male Female Classification Based on Weight and Height
MATLAB Implementation
clear all;
%%Generation of the training data
W=[];
H=[];
L=[];
for n=1:30
r=randi([0 1],1,1);
if(r==1)
height=160+randi([-5 15],1,1);
weight=80+randi([-12 30],1,1);
H=[H;height];
W=[W;weight];
L=[L;1];
else
height=145+randi([-5 12],1,1);
weight=50+randi([-12 20],1,1);
H=[H;height];
W=[W;weight];
L=[L;-1];
end
end
scatter(W,H);
• D=[W,H];
• %%SVM training
• SVMModel =
fitcsvm(D,L,'KernelFunction','linear','BoxConstraint',Inf,'ClassNames',[-1,1]);
• % Predict scores over the grid
• d = 0.02;
• [x1Grid,x2Grid] = meshgrid(min(D(:,1)):d:max(D(:,1)),min(D(:,2)):d:max(D(:,2)));
• xGrid = [x1Grid(:),x2Grid(:)];
• [~,scores] = predict(SVMModel,xGrid);
• h(1:2) = gscatter(D(:,1),D(:,2),L,'rb','.');
• hold on;
• h(3) =
plot(D(SVMModel.IsSupportVector,1),D(SVMModel.IsSupportVector,2),'ko');
• contour(x1Grid,x2Grid,reshape(scores(:,2),size(x1Grid)),[0 0],'k');
• hold on;
• F=[100,180;60,155;55,142;75,165];
• [clss,score] = predict(SVMModel,F);
• h(4)=scatter(F(:,1),F(:,2),'d');
• legend(h,{'Female','Male','Support Vectors','Test Data'});
• hold on;
Non-linear SVMs

• Datasets that are linearly separable with some noise work out great

0 x

• But what are we going to do if the dataset is just too hard?


0 x
• How about… mapping data to a higher-dimensional space
x2

0 x
Non-linear SVMs: Feature spaces

• General idea: the original feature space can always be mapped to


some higher-dimensional feature space where the training set is
separable

Φ: x → φ(x)
Non-linear SVMs: The “Kernel Trick”

In the objective function of the Dual problem we have

• This only contains the inner product of and

• This objective function does not contain product of and or

Therefore or can also be replaced by or which is a set of features to represent and

could be a non-linear functions and we can use the SVM to construct classifier that are non-linear in
Non-linear SVMs: The “Kernel Trick”

• For this we rely over the similarity function

Second feature

Hilbert space

is also called Kernel

• A kernelized SVM is equivalent to a linear SVM that


operates in feature space rather than input space

Some examples of SVM Kernels


• Polynomial Kernel
• Gaussian Kernel First feature
• Radial Basis Function Kernel
• Rational Quadratic Kernel
Radial Basis Function (RBF) Kernel
• While working with SVMs for non-linear datasets, sometimes we can’t decide which would be the right
feature transform or which kernel to choose—well then RBF Kernel is the savior

• Most generalized from of kernelization

• Most widely used kernels due to its similarity to the Gaussian distribution

• Similarity measurement

• RBF kernel function for two points say and computes how close they are to each other

• Mathematical representation

is the variance and our “hyperparameter”

is the Euclidean (-norm) distance between and


Radial Basis Function (RBF) Kernel

• Let be the distance between and then will be as follows


𝑋1
• The kernel equation
𝑑
𝑑12=¿| 𝑋 1 − 𝑋 2|∨¿ ¿
2
12
• Maximum value the RBF kernel can be is 1 and
occurs when which implies

• When the points are the same—no distance


𝑋2
between them—exactly similar points
• When points are separated by large distance— Distance between two points in space
then kernel value <1 and close to 0—
dissimilar points

• Distance can be thought of equivalent to dissimilarity since increased distance between points mean they are less similar
𝑑12

𝑑13
Similarity ↓ as distance ↑
Radial Basis Function (RBF) Kernel

• It is important to find the right value of ‘’ to decide which points should be considered similar and this can be demonstrated on
a case by case basis

When , and the RBF kernel’s mathematical equation will be

Notice that as the distance increases, the RBF Kernel decreases


exponentially and is 0 for distances greater than 4

• We can notice that when , the similarity is 1 and as


increases beyond 4 units, the similarity is 0

• From the graph, we see that if the distance is below 4,


the points can be considered similar and if the distance
is greater than 4 then the points are dissimilar
Radial Basis Function (RBF) Kernel

When , and the RBF kernel’s mathematical equation will


be

The width of the Region of Similarity is minimal for and


hence, only if points are extremely close they are
considered similar

1. We see that the curve is extremely peaked and is 0 for


distances greater than 0.2

2. The points are considered similar only if the distance is


less than or equal to 0.2
Radial Basis Function (RBF) Kernel

When , and the RBF kernel’s mathematical equation


will be

The width of the Region of Similarity is large for


because of which the points that are farther away can be
considered to be similar

1. The width of the curve is large

2. The points are considered similar for distances up to


10 units and beyond 10 units they are dissimilar
Radial Basis Function (RBF) Kernel

• It is evident from the above cases that the width of the Region of Similarity changes as changes

• Finding the right for a given dataset is important and can be done by using hyperparameter tuning techniques like Grid
Search Cross Validation and Random Search Cross Validation

• Popular because of its similarity to K-Nearest Neighborhood Algorithm. It has the advantages of K-NN and overcomes
the space complexity problem as RBF Kernel Support Vector Machines just needs to store the support vectors during
training and not the entire dataset

• Implemented in the scikit-learn library and has two hyperparameters associated with it, ‘’ for SVM and ‘’ for the RBF
Kernel. Here, is inversely proportional to

Finding the right or along with the value of is essential in order to achieve the best Bias-Variance Trade off
Visual Representation—2D

1. SVM will search for apples that 2. Find the best hyperplane that 3. We have an infinite number of
are similar to lemons (apples separates the classes possibilities to draw the decision
which are yellow and have elliptic boundary. So how can we find the
form). This will be a support optimal one?
vector. The other support vector
will be a lemon similar to an apple
(green and rounded)
Visual Representation—Finding the Optimal Hyperplane
• Intuitively the best line is the line that is far away from both apple and lemon examples (has the largest margin). To have optimal solution,
we have to maximize the margin in both ways (if we have multiple classes, then we have to maximize it considering each of the classes)

4. Comparing the 1st and 2nd picture above, we can easily observe, that the first is 5. Use a “global” margin, which takes in
the optimal hyperplane (line) and the second is a sub-optimal solution, because the consideration all the classes. This margin
margin is far shorter would look like the purple line. This margin
is orthogonal to the boundary
and equidistant to the support vectors
• Each of the calculations (calculate distance and optimal hyperplanes) are made in vectorial space, so each data point is considered a vector.
The dimension of the space is defined by the number of attributes of the examples

• Basically the learning is equivalent with finding the hyperplane with the best margin, so it is a simple optimization problem
Visual Representation—SVM for Non-Linear Data Sets

• In this case we cannot find a straight line to separate apples • Map the apples and lemons (which are just simple points) to this
from lemons. So how can we solve this problem. We will use new space
the Kernel Trick!
• Use a transformation in which we add levels based on distance
• Basic idea is that when a data set is inseparable in the current
• When in the origin, then the points will be on the lowest level
dimensions, add another dimension, maybe that way the
data will be separable
• While moving away from the origin, it means that we are climbing
the hill (moving from the center of the plane towards the margins)
so the level of the points will be higher
Visual Representation—Mapping from 1D to 2D

• Using transformations we can easily separate the two • After the transformation, we can easily delimit the two classes
classes. These transformations are called kernels using just a single line

• In real life applications we won’t have a simple straight line, but


we will have lots of curves and high dimensions

• We need some trade-offs, tolerance for outliers

• SVM algorithm has a so-called regularization parameter to


configure the trade-off and to tolerate outliers
Non-linear SVM Example
• %%Non-linear SVM
• %%Generate 100 points uniformly distributed in the
unit disk.
• r = sqrt(rand(100,1)); % Radius
• t = 2*pi*rand(100,1); % Angle
• data1 = [r.*cos(t), r.*sin(t)]; % Points
• %%Generate 100 points uniformly distributed in the
annulus.
• r2 = sqrt(3*rand(100,1)+1); % Radius
• t2 = 2*pi*rand(100,1); % Angle
• data2 = [r2.*cos(t2), r2.*sin(t2)]; % points
• %%Plot the points, and plot circles of radii 1 and 2 for
comparison.
• figure;
• plot(data1(:,1),data1(:,2),'r.','MarkerSize',15)
• hold on
• plot(data2(:,1),data2(:,2),'b.','MarkerSize',15)
• axis equal
• hold off
• %%Put the data in one matrix, and make a vector of classifications.
• data3 = [data1;data2];
• theclass = ones(200,1);
• theclass(1:100) = -1;
• %%Train an SVM classifier with KernelFunction set to 'rbf' and BoxConstraint set
to Inf.
• cl =
• fitcsvm(data3,theclass,'KernelFunction','rbf', 'BoxConstraint',Inf,'ClassNames',[-
1,1]);
• % Predict scores over the grid
• d = 0.02;
• [x1Grid,x2Grid] =
meshgrid(min(data3(:,1)):d:max(data3(:,1)),min(data3(:,2)):d:max(data3(:,2)));
• xGrid = [x1Grid(:),x2Grid(:)];
• [~,scores] = predict(cl,xGrid);
• % Plot the data and the decision boundary
• figure;
• h(1:2) = gscatter(data3(:,1),data3(:,2),theclass,'rb','.');
• hold on
• h(3) = plot(data3(cl.IsSupportVector,1),data3(cl.IsSupportVector,2),'ko');
• contour(x1Grid,x2Grid,reshape(scores(:,2),size(x1Grid)),[0 0],'k');
• legend(h,{'-1','+1','Support Vectors'});
• axis equal
• hold off
Visual Representation—Regularization

• The Regularization Parameter (in python it’s called C) tells the SVM optimization how much you want to avoid miss
classifying each training example

• C is low, the margin is higher (so implicitly we don’t have so many curves, the line doesn’t strictly follows the data points) even if
two apples were classified as lemons

• C is high, the boundary is full of curves and all the training data was classified correctly

• However increasing the C will always increase the precision (because of overfitting)
Visual Representation—Gamma

• The next important parameter is Gamma. The gamma parameter defines how far the influence of a single training
example reaches.

• Decreasing the Gamma will result that finding the correct hyperplane will consider points at greater distances so more and
more points will be used (green lines indicates which points were considered when finding the optimal hyperplane)

• The last parameter is the margin. We’ve already talked about margin, higher margin results better model, so better
classification (or prediction). The margin should be always maximized
Conclusion

• A very popular and powerful supervised learning algorithm, the Support Vector Machine

• Have learnt the basic idea, what is a hyperplane, what are support vectors and why are they so important

• Seen visual representations, which help to better understand all the concepts

• Another important topic—Kernel Trick, which helps to solve non-linear problems


Cons

1.Training time is high when we have large data sets

2.When the data set has more noise (i.e. target classes are overlapping) SVM doesn’t perform well

Popular Use Cases


1.Text Classification
2.Detecting spam
3.Sentiment analysis
4.Aspect-based recognition
5.Aspect-based recognition
6.Handwritten digit recognition

You might also like