SVM Part A
SVM Part A
Introduction
SVM is a powerful supervised algorithm that works best on smaller datasets but on complex
ones. Support Vector Machine, abbreviated as SVM can be used for both regression and
classification tasks, but generally, they work best in classification problems. They were very
famous around the time they were created, during the 1990s, and keep on being the go-to method
for a high-performing algorithm with a little tuning.
By now, I hope you’ve now mastered Decision Trees, Random Forest, Naïve Bayes, K-nearest
neighbor, and Ensemble Modelling techniques. If not, I would suggest you take out a few
minutes and read about them as well.
In this article, I will explain to you What is SVM, how SVM Algorithm works, and the math
intuition behind this crucial ML algorithm.
Table of contents
What is a Support Vector Machine?
Logistic Regression vs Support Vector Machine
Types of Support Vector Machine Algorithms
Important Terms
How Does Support Vector Machine Work?
Mathematical Intuition Behind Support Vector Machine
Margin in Support Vector Machine
Optimization Function and its Constraints
Soft Margin SVM
Kernels in Support Vector Machine
o Different Kernel Functions
o How to Choose the Right Kernel?
Implementation and hyperparameter tuning of Support Vector Machine in Python
End Notes
Frequently Asked Questions
What is a Support Vector Machine?
It is a supervised machine learning problem where we try to find a hyperplane that best separates
the two classes. Note: Don’t get confused between SVM and logistic regression. Both the
algorithms try to find the best hyperplane, but the main difference is logistic regression is a
probabilistic approach whereas support vector machine is based on statistical approaches.
Now the question is which hyperplane does it select? There can be an infinite number of
hyperplanes passing through a point and classifying the two classes perfectly. So, which one is
the best?
Well, SVM does this by finding the maximum margin between the hyperplanes that means
maximum distances between the two classes.
Logistic Regression vs Support Vector Machine
Depending on the number of features you have you can either choose Logistic Regression or
SVM.
SVM works best when the dataset is small and complex. It is usually advisable to first use
logistic regression and see how does it performs, if it fails to give a good accuracy you can go for
SVM without any kernel (will talk more about kernels in the later section). Logistic regression
and SVM without any kernel have similar performance but depending on your features, one may
be more efficient than the other.
Types of Support Vector Machine Algorithms
1. Linear SVM
When the data is perfectly linearly separable only then we can use Linear SVM. Perfectly
linearly separable means that the data points can be classified into 2 classes by using a single
straight line(if 2D).
2. Non-Linear SVM
When the data is not linearly separable then we can use Non-Linear SVM, which means when
the data points cannot be separated into 2 classes by using a straight line (if 2D) then we use
some advanced techniques like kernel tricks to classify them. In most real-world applications we
do not find linearly separable datapoints hence we use kernel trick to solve them.
Important Terms
Now let’s define two main terms which will be repeated again and again in this article:
Support Vectors: These are the points that are closest to the hyperplane. A separating line will
be defined with the help of these data points.
Margin: it is the distance between the hyperplane and the observations closest to the hyperplane
(support vectors). In SVM large margin is considered a good margin. There are two types of
margins hard margin and soft margin. I will talk more about these two in the later section.
To classify these points, we can have many decision boundaries, but the question is which is the
best and how do we find it? NOTE: Since we are plotting the data points in a 2-dimensional
graph we call this decision boundary a straight line but if we have more dimensions, we call this
decision boundary a “hyperplane”
The best hyperplane is that plane that has the maximum distance from both the classes, and this
is the main aim of SVM. This is done by finding different hyperplanes which classify the labels
in the best way then it will choose the one which is farthest from the data points or the one which
has a maximum margin.
Image 2
Here a and b are 2 vectors, to find the dot product between these 2 vectors we first find the
magnitude of both the vectors and to find magnitude we use the Pythagorean theorem or the
distance formula.
After finding the magnitude we simply multiply it with the cosine angle between both the
vectors. Mathematically it can be written as:
A . B = |A| cosθ * |B|
Where |A| cosθ is the projection of A on B
And |B| is the magnitude of vector B
Now in SVM we just need the projection of A not the magnitude of B, I’ll tell you why later. To
just get the projection we can simply take the unit vector of B because it will be in the direction
of B but its magnitude will be 1. Hence now the equation becomes:
A.B = |A| cosθ * unit vector of B
Now let’s move to the next part and see how we will use this in SVM.
Use of Dot Product in SVM
Consider a random point X and we want to know whether it lies on the right side of the plane or
the left side of the plane (positive or negative).
To find this first we assume this point is a vector (X) and then we make a vector (w) which is
perpendicular to the hyperplane. Let’s say the distance of vector w from origin to decision
boundary is ‘c’. Now we take the projection of X vector on w.
We already know that projection of any vector or another vector is called dot-product. Hence, we
take the dot product of x and w vectors. If the dot product is greater than ‘c’ then we can say that
the point lies on the right side. If the dot product is less than ‘c’ then the point is on the left side
and if the dot product is equal to ‘c’ then the point lies on the decision boundary.
You must be having this doubt that why did we take this perpendicular vector w to the
hyperplane? So what we want is the distance of vector X from the decision boundary and there
can be infinite points on the boundary to measure the distance from. So that’s why we come to
standard, we simply take perpendicular and use it as a reference and then take projections of all
the other data points on this perpendicular vector and then compare the distance.
In SVM we also have a concept of margin. In the next section, we will see how we find the
equation of a hyperplane and what exactly do we need to optimize in SVM.
Margin in Support Vector Machine
We all know the equation of a hyperplane is w.x+b=0 where w is a vector normal to hyperplane
and b is an offset.
To classify a point as negative or positive we need to define a decision rule. We can define
decision rule as:
If the value of w.x+b>0 then we can say it is a positive point otherwise it is a negative point.
Now we need (w,b) such that the margin has a maximum distance. Let’s say this distance is ‘d’.
To calculate ‘d’ we need the equation of L1 and L2. For this, we will take few assumptions that
the equation of L1 is w.x+b=1 and for L2 it is w.x+b=-1.
Now the question comes
1. Why the magnitude is equal, why didn’t we take 1 and -2?
2. Why did we only take 1 and -1, why not any other value like 24 and -100?
3. Why did we assume this line?
Let’s try to answer these questions
1. We want our plane to have equal distance from both the classes that means L should pass
through the center of L1 and L2 that’s why we take magnitude equal.
2. Let’s say the equation of our hyperplane is 2x+y=2, we observe that even if we multiply
the whole equation with some other number the line doesn’t change (try plotting on a
graph). Hence for mathematical convenience, we take it as 1.
3. Now the main question is exactly why there’s a need to assume only this line? To answer
this, I’ll try to take the help of graphs.
Suppose the equation of our hyperplane is 2x+y=2:
Let’s create margin for this hyperplane,
If you multiply these equations by 10, we will see that the parallel line (red and green) gets
closer to our hyperplane. For more clarity look at this graph
(https://www.desmos.com/calculator/dvjo3vacyp)
We also observe that if we divide this equation by 10 then these parallel lines get bigger. Look at
this graph (https://www.desmos.com/calculator/15dbwehq9g).
By this I wanted to show you that the parallel lines depend on (w,b) of our hyperplane, if we
multiply the equation of hyperplane with a factor greater than 1 then the parallel lines will shrink
and if we multiply with a factor less than 1, they expand.
We can now say that these lines will move as we do changes in (w,b) and this is how this gets
optimized. But what is the optimization function? Let’s calculate it.
We know that the aim of SVM is to maximize this margin that means distance (d). But there are
few constraints for this distance (d). Let’s look at what these constraints are.
Optimization Function and its Constraints
In order to get our optimization function, there are few constraints to consider. That constraint is
that “We’ll calculate the distance (d) in such a way that no positive or negative point can
cross the margin line”. Let’s write these constraints mathematically:
Rather than taking 2 constraints forward, we’ll now try to simplify these two constraints into 1.
We assume that negative classes have y=-1 and positive classes have y=1.
We can say that for every point to be correctly classified this condition should always be true:
Suppose a green point is correctly classified that means it will follow w.x+b>=1, if we multiply
this with y=1 we get this same equation mentioned above. Similarly, if we do this with a red
point with y=-1 we will again get this equation. Hence, we can say that we need to maximize (d)
such that this constraint holds true.
We will take 2 support vectors, 1 from the negative class and 2 nd from the positive class. The
distance between these two vectors x1 and x2 will be (x2-x1) vector. What we need is, the
shortest distance between these two points which can be found using a trick we used in the dot
product. We take a vector ‘w’ perpendicular to the hyperplane and then find the projection of
(x2-x1) vector on ‘w’. Note: this perpendicular vector should be a unit vector then only this will
work. Why this should be a unit vector? This has been explained in the dot-product section. To
make this ‘w’ a unit vector we divide this with the norm of ‘w’.
Since x2 and x1 are support vectors and they lie on the hyperplane, hence they will follow yi*
(2.x+b)=1 so we can write it as:
Putting equations (2) and (3) in equation (1) we get:
We have now found our optimization function but there is a catch here that we don’t find this
type of perfectly linearly separable data in the industry, there is hardly any case we get this type
of data and hence we fail to use this condition we proved here. The type of problem which we
just studied is called Hard Margin SVM now we shall study soft margin which is similar to this
but there are few more interesting tricks we use in Soft Margin SVM.
Soft Margin SVM
In real-life applications we don’t find any dataset which is linearly separable, what we’ll find is
either an almost linearly separable dataset or a non-linearly separable dataset. In this scenario, we
can’t use the trick we proved above because it says that it will function only when the dataset is
perfectly linearly separable.
To tackle this problem what we do is modify that equation in such a way that it allows few
misclassifications that means it allows few points to be wrongly classified.
We know that max[f(x)] can also be written as min[1/f(x)], it is common practice to minimize a
cost function for optimization problems; therefore, we can invert the function.
To make a soft margin equation we add 2 more terms to this equation which is zeta and multiply
that by a hyperparameter ‘c’
For all the correctly classified points our zeta will be equal to 0 and for all the incorrectly
classified points the zeta is simply the distance of that particular point from its correct
hyperplane that means if we see the wrongly classified green points the value of zeta will be the
distance of these points from L1 hyperplane and for wrongly classified redpoint zeta will be the
distance of that point from L2 hyperplane.
So now we can say that our that are SVM Error = Margin Error + Classification Error. The
higher the margin, the lower would-be margin error, and vice versa.
Let’s say you take a high value of ‘c’ =1000, this would mean that you don’t want to focus on
margin error and just want a model which doesn’t misclassify any data point.
Look at the figure below:
If someone asks you which is a better model, the one where the margin is maximum and has 2
misclassified points or the one where the margin is very less, and all the points are correctly
classified?
Well, there’s no correct answer to this question, but rather we can use SVM Error = Margin
Error + Classification Error to justify this. If you don’t want any misclassification in the model
then you can choose figure 2. That means we’ll increase ‘c’ to decrease Classification Error but
if you want that your margin should be maximized then the value of ‘c’ should be minimized.
That’s why ‘c’ is a hyperparameter and we find the optimal value of ‘c’ using GridsearchCV and
cross-validation.