SVM 1
SVM 1
• When we deal with big data set that needs a complicated model, the full Bayesian framework is very
computationally expensive
• Try to extract layer of “features” rather than to predict answers directly from raw inputs
Sensible if we already know that certain combinations of input values would be useful (such as edges
or corners in an image)
What is SVM ??
• Analyze data and recognize patterns for classification and regression problem
• Basically, SVM training model does mapping of several examples as points in space with the wide and
clear division between examples of separate categories or groups
• Further, the new set of examples are then categorized based upon its features closely aligned with the
corresponding group
Types of SVM
The main idea is to partition the space by a hyperplane such that the cluster
lying on one side of the hyperplane is denoted by label +1 and other denoted
by -1 hyperplane
Classification
with label
……………………(1)
with label
……………………(2)
Sub to:
We know that
We have
…………………(3)
Optimization problem
Let us assume the scenario shown in the figure
𝑥𝑎
• Here we normalize everything such that
𝑟
and ′
𝑥 𝑎
¿𝑤
………………….(4)
,𝑥 ′ +
¿
𝑎
𝑤
>
Applying (4) into (3), and by observing we get
+
,
𝑥𝑎
𝑏
= 0
>
As , the constraint gets modified as
1
𝑏
=
Hence we get the modified optimization problem as
Sub to:
Modified Optimization problem
The modified optimization problem can be rewritten as
……………………..(B)
(C)
Dual Optimization problem
Substituting (C) into (B), we get a dual Lagrangian function
Sub to: ,
• Once we obtain the solution values (or dual parameters) of optimal from (D) we can recover optimal by using
…………………….(E)
• Datasets that are linearly separable with some noise work out great
0 x
0 x
Non-linear SVMs: Feature spaces
Φ: x → φ(x)
Non-linear SVMs: The “Kernel Trick”
could be a non-linear functions and we can use the SVM to construct classifier that are non-linear in
Non-linear SVMs: The “Kernel Trick”
Second feature
Hilbert space
• Most widely used kernels due to its similarity to the Gaussian distribution
• Similarity measurement
• RBF kernel function for two points say and computes how close they are to each other
• Mathematical representation
• Distance can be thought of equivalent to dissimilarity since increased distance between points mean they are less similar
𝑑12
𝑑13
Similarity ↓ as distance ↑
Radial Basis Function (RBF) Kernel
• It is important to find the right value of ‘’ to decide which points should be considered similar and this can be demonstrated on
a case by case basis
• It is evident from the above cases that the width of the Region of Similarity changes as changes
• Finding the right for a given dataset is important and can be done by using hyperparameter tuning techniques like Grid
Search Cross Validation and Random Search Cross Validation
• Popular because of its similarity to K-Nearest Neighborhood Algorithm. It has the advantages of K-NN and overcomes
the space complexity problem as RBF Kernel Support Vector Machines just needs to store the support vectors during
training and not the entire dataset
• Implemented in the scikit-learn library and has two hyperparameters associated with it, ‘’ for SVM and ‘’ for the RBF
Kernel. Here, is inversely proportional to
Finding the right or along with the value of is essential in order to achieve the best Bias-Variance Trade off
Visual Representation—2D
1. SVM will search for apples that 2. Find the best hyperplane that 3. We have an infinite number of
are similar to lemons (apples separates the classes possibilities to draw the decision
which are yellow and have elliptic boundary. So how can we find the
form). This will be a support optimal one?
vector. The other support vector
will be a lemon similar to an apple
(green and rounded)
Visual Representation—Finding the Optimal Hyperplane
• Intuitively the best line is the line that is far away from both apple and lemon examples (has the largest margin). To have optimal solution,
we have to maximize the margin in both ways (if we have multiple classes, then we have to maximize it considering each of the classes)
4. Comparing the 1st and 2nd picture above, we can easily observe, that the first is 5. Use a “global” margin, which takes in
the optimal hyperplane (line) and the second is a sub-optimal solution, because the consideration all the classes. This margin
margin is far shorter would look like the purple line. This margin
is orthogonal to the boundary
and equidistant to the support vectors
• Each of the calculations (calculate distance and optimal hyperplanes) are made in vectorial space, so each data point is considered a vector.
The dimension of the space is defined by the number of attributes of the examples
• Basically the learning is equivalent with finding the hyperplane with the best margin, so it is a simple optimization problem
Visual Representation—SVM for Non-Linear Data Sets
• In this case we cannot find a straight line to separate apples • Map the apples and lemons (which are just simple points) to this
from lemons. So how can we solve this problem. We will use new space
the Kernel Trick!
• Use a transformation in which we add levels based on distance
• Basic idea is that when a data set is inseparable in the current
• When in the origin, then the points will be on the lowest level
dimensions, add another dimension, maybe that way the
data will be separable
• While moving away from the origin, it means that we are climbing
the hill (moving from the center of the plane towards the margins)
so the level of the points will be higher
Visual Representation—Mapping from 1D to 2D
• Using transformations we can easily separate the two • After the transformation, we can easily delimit the two classes
classes. These transformations are called kernels using just a single line
• The Regularization Parameter (in python it’s called C) tells the SVM optimization how much you want to avoid miss
classifying each training example
• C is low, the margin is higher (so implicitly we don’t have so many curves, the line doesn’t strictly follows the data points) even if
two apples were classified as lemons
• C is high, the boundary is full of curves and all the training data was classified correctly
• However increasing the C will always increase the precision (because of overfitting)
Visual Representation—Gamma
• The next important parameter is Gamma. The gamma parameter defines how far the influence of a single training
example reaches.
• Decreasing the Gamma will result that finding the correct hyperplane will consider points at greater distances so more and
more points will be used (green lines indicates which points were considered when finding the optimal hyperplane)
• The last parameter is the margin. We’ve already talked about margin, higher margin results better model, so better
classification (or prediction). The margin should be always maximized
Conclusion
• A very popular and powerful supervised learning algorithm, the Support Vector Machine
• Have learnt the basic idea, what is a hyperplane, what are support vectors and why are they so important
• Seen visual representations, which help to better understand all the concepts
2.When the data set has more noise (i.e. target classes are overlapping) SVM doesn’t perform well