Support Vector Machines
Support Vector Machines
Support Vector Machines (SVMs in short) are machine learning algorithms that are used for
classification and regression purposes. SVMs are one of the powerful machine learning algorithms
for classification, regression and outlier detection purposes. An SVM classifier builds a model that
assigns new data points to one of the given categories. Thus, it can be viewed as a non-
probabilistic binary linear classifier.
The original SVM algorithm was developed by Vladimir N Vapnik and Alexey Ya. Chervonenkis in
1963. At that time, the algorithm was in early stages. The only possibility is to draw hyperplanes
for linear classifier. In 1992, Bernhard E. Boser, Isabelle M Guyon and Vladimir N Vapnik
suggested a way to create non-linear classifiers by applying the kernel trick to maximum-margin
hyperplanes. The current standard was proposed by Corinna Cortes and Vapnik in 1993 and
published in 1995.
SVMs can be used for linear classification purposes. In addition to performing linear classification,
SVMs can efficiently perform a non-linear classification using the kernel trick. It enable us to
implicitly map the inputs into high dimensional feature spaces.
Example
Let us consider two tags, yellow and blue, and our data has two features, x, and y. Given a pair of
(x,y) coordinates, we want a classifier that outputs either yellow or blue. We plot the labeled
training data on a plane:
An SVM takes these data points and outputs the hyperplane, which is simply a line in two-
dimension, that best separates the tags. The line is the decision boundary. Anything falling to one
side of it will be classified as yellow, and anything on the other side will be classified as blue.
For SVM, the best hyperplane is the one that maximizes the margins from both tags. It is the
hyperplane whose distance to the nearest element of each tag is the largest.
The above was easy since the data was linearly separable—a straight line can be drawn to separate
yellow and blue. However, in real scenarios, cases are usually not this simple. Consider the
following case:
There is no linear decision boundary. The vectors are, however, very clearly segregated, and it
seems as if it should be easy to separate them.
In this case, we will add a third dimension. Up until now, we have worked with two dimensions, x,
and y. A new z dimension is introduced in this case. It is set to be calculated a certain way that is
convenient, z = x² + y² (equation of a circle.) Taking a slice of this three-dimensional space looks
like this:
Note that since we are in three dimensions now, the hyperplane is a plane parallel to the x-axis at
a particular point in z, let us say z = 1. Now, it should be mapped back to two dimensions:
There we go! The decision boundary is a circumference with radius 1, and it separates both tags
by using SVM.
SVM working
In SVMs, our main objective is to select a hyperplane with the maximum possible margin between
support vectors in the given dataset. SVM searches for the maximum margin hyperplane in the
following 2 step process –
1. Generate hyperplanes which segregates the classes in the best possible way. There are
many hyperplanes that might classify the data. We should look for the best hyperplane
that represents the largest separation, or margin, between the two classes.
2. We choose the hyperplane so that distance from it to the support vectors on each side is
maximized. If such a hyperplane exists, it is known as the maximum margin
hyperplane and the linear classifier it defines is known as a maximum margin
classifier.
Step-by-step discussion
Hyperplane
A hyperplane is a decision boundary which separates between given set of data points having
different class labels. The SVM classifier separates data points using a hyperplane with the
maximum amount of margin. This hyperplane is known as the maximum margin hyperplane and
the linear classifier it defines is known as the maximum margin classifier.
Form of equation defining the decision surface separating the classes is a hyperplane of the form:
wTx + b = 0
– w is a weight vector
– x is input vector
– b is bias
• Allows us to write
wTx + b ≥ 0 for di = +1
wTx + b < 0 for di = –1
Step 4:
Recall the distance from a point(x0,y0) to a line: Ax+By+c = 0 is: |Ax0 +By0 +c|/sqrt(A2+B2), so,
The distance between H0 and H1 is then: |w•x+b|/||w||=1/||w||, so
The total distance between H1 and H2 is thus: 2/||w||
In order to maximize the margin, we thus need to minimize ||w||. With the condition that there are
no datapoints between H1 and H2:
xi•w+b ≥ +1 when yi =+1
xi•w+b ≤ –1 when yi =–1
Can be combined into: yi(xi•w) ≥ 1
Maximum margin hyperplane
Problem is: minimize ||w||, s.t. discrimination boundary is obeyed, i.e., min f(x) s.t. g(x)=0, which
we can rewrite as:
min f: ½ ||w||2 (Note this is a quadratic function)
s.t. g: yi(w•xi)–b = 1 or [yi(w•xi)–b] – 1 =0
This is a constrained optimization problem It can be solved by the Lagrangian multipler method
Because it is quadratic, the surface is a paraboloid, with just a single global minimum.
Kernel functions
In practice, SVM algorithm is implemented using a kernel. It uses a technique called the kernel
trick. In simple words, a kernel is just a function that maps the data to a higher dimension where
data is separable. A kernel transforms a low-dimensional input data space into a higher
dimensional space. So, it converts non-linear separable problems to linear separable problems by
adding more dimensions to it. Thus, the kernel trick helps us to build a more accurate classifier.
Hence, it is useful in non-linear separation problems.
The most widely used kernels in SVM are the linear kernel, polynomial kernel, and Gaussian
(radial basis function) kernel. The choice of kernel relies on the nature of the data and the job at
hand. The linear kernel is used when the data is roughly linearly separable, whereas the
polynomial kernel is used when the data has a complicated curved border. The Gaussian kernel is
employed when the data has no clear boundaries and contains complicated areas of overlap.
To make this data linearly separable, we can use the kernel trick.
By applying the kernel trick to the data, we transform it into a higher-dimensional feature space
where the data becomes linearly separable. We can see this in the plot below, where the red and
The linear kernel works really well when there are a lot of features, and text classification
problems have a lot of features. Linear kernel functions are faster than most of the others and you
have fewer parameters to optimize.
f(X) = w^T * X + b
In this equation, w is the weight vector that you want to minimize, X is the data that you're trying
to classify, and b is the linear coefficient estimated from the training data. This equation defines
the decision boundary that the SVM returns.
Polynomial
The polynomial kernel isn't used in practice very often because it isn't as computationally efficient
as other kernels and its predictions aren't as accurate.
This is one of the more simple polynomial kernel equations you can use. f(X1, X2) represents the
polynomial decision boundary that will separate your data. X1 and X2 represent your data.
One of the most powerful and commonly used kernels in SVMs. Usually the choice for non-linear
data.
In this equation, gamma specifies how much a single training point has on the other data points
around it. ||X1 - X2|| is the dot product between your features.
Sigmoid
More useful in neural networks than in support vector machines, but there are occasional specific
use cases.
In this function, alpha is a weight vector and C is an offset value to account for some
misclassification of data that can happen.
Assumptions
Support Vector Machines (SVMs) have certain assumptions and properties that are important to
understand when using them:
Linear Separability: The primary assumption of SVM is that the data is or can be transformed into
a linearly separable space. In other words, there exists a hyperplane that can distinctly separate
the classes.
Margin Maximization: SVM aims to find the hyperplane that maximizes the margin between
classes. This assumes that a larger margin contributes to better generalization and improved
performance.
Noisy Data Handling: SVMs are sensitive to noisy data and outliers, as these may influence the
position and orientation of the decision boundary. Outliers can have a significant impact on the
resulting hyperplane.
Kernel Function Choice: The choice of the kernel function (linear, polynomial, radial basis
function) and its parameters can affect the performance of SVM. The appropriate kernel and
parameters depend on the characteristics of the data.
Memory Efficiency: SVMs are memory-efficient due to the use of a subset of training points
(support vectors) in decision-making. This can be an advantage in terms of memory usage, but it
also assumes that these support vectors are representative of the entire dataset.
Strengths:
Robust to Overfitting: SVMs are less prone to overfitting, especially in high-dimensional spaces,
due to the use of a margin that penalizes data points inside the margin.
Effective in Cases with Clear Margin of Separation: SVMs work well when there is a clear margin
of separation between classes, making them suitable for tasks with distinct and well-separated
classes.
Kernel Trick for Non-Linear Data: The kernel trick allows SVMs to handle non-linear decision
boundaries by implicitly mapping data into higher-dimensional spaces.
Versatile Kernels: SVMs support different kernel functions, providing flexibility in capturing
different types of relationships in the data.
Memory Efficiency: SVMs use a subset of training points (support vectors) in decision-making,
making them memory-efficient, especially when dealing with large datasets.
Weaknesses:
Sensitivity to Noise and Outliers: SVMs can be sensitive to noise and outliers, as they may
influence the position and orientation of the decision boundary.
Difficulty in Handling Large Datasets: SVMs can become computationally expensive and memory-
intensive, particularly with large datasets.
Choice of Kernel and Parameters: The choice of the appropriate kernel and tuning of
hyperparameters can be challenging, and the performance may be sensitive to these choices.
Limited Interpretability: The decision function of SVMs is not easily interpretable, making it
challenging to understand the contribution of each feature to the final decision.
Not Suitable for Imbalanced Datasets: SVMs may not perform well on highly imbalanced datasets
where one class significantly outnumbers the other.
Binary Classification: SVMs are inherently binary classifiers, and extensions to handle multiclass
problems may require strategies like one-vs-one or one-vs-all.