Support Vector Machines
Support Vector Machines
Machines
INTRODUCTION
learning algorithms
Statistical Algorithm
Maximal-Margin Classier
Where the coefficients (B1 and B2) that determine the slope of the line and
the intercept (B0) are found by the learning algorithm, and X1 and X2 are the
two input variables
By plugging in input values into the line equation, we can calculate whether a
new point is above or below the line
Condition for a new point will be:
Above the line, the equation returns a value greater than 0 and the point
belongs to the first class (class 0)
Below the line, the equation returns a value less than 0 and the point belongs to
the second class (class 1)
A value close to the line returns a value close to zero and the point may be
difficult to classify
If the magnitude of the value is large, the model may have more confidence in
the prediction
Distance between the line and the closest data points is referred to as the
margin
Best or optimal line that can separate the two classes is the line that as the
largest margin (Maximal-Margin hyperplane)
Margin is calculated as the perpendicular distance from the line to only the
closest points
Only these points are relevant in defining the line and in the construction of
the classier called the Support vectors
Hyperplane is learned from training data using an optimization procedure that
maximizes the margin
Soft Margin Classier
This increases the complexity of the model as there are more parameters for
the model to fit to the data to provide this complexity
A tuning parameter is introduced called simply C that denes the magnitude of
the wiggle allowed across all dimensions
C parameters denes the amount of violation of the margin allowed
Larger the value of C the more violations of the hyperplane are permitted
During the learning of the hyperplane from data, all training instances that lie
within the distance of the margin will affect the placement of the hyperplane
The smaller the value of C, the more sensitive the algorithm is to the training
data (higher variance and lower bias)
The larger the value of C, the less sensitive the algorithm is to the training
data (lower variance and higher bias)
Support Vector Machines (Kernels)
Linear SVM can be rephrased using the inner product of any two given
observations, rather than the observations themselves
For example, the inner product of the vectors [2; 3] and [5; 6] is 2 5 + 3 6 or
28
Equation for making a prediction for a new input using the dot product
between the input (x) and each support vector (xi) is calculated as follows:
This is an equation that involves calculating the inner products of a new input
vector (x) with all support vectors in training data
The coefficients B0 and ai (for each input) must be estimated from the
training data by the learning algorithm
Kernels
Linear Kernel SVM
Dot-product is called the kernel and can be re-written as:
Kernel defines the similarity or a distance measure between new data and the support
vectors
Dot product is the similarity measure used for linear SVM or a linear kernel because the
distance is a linear combination of the inputs
Polynomial Kernel SVM
Instead of the dot-product, we can use a polynomial kernel, for example:
The polynomial kernel allows for curved lines in the input space
Radial Kernel SVM
We can also have a more complex radial kernel. For example:
A good default value for gamma is 0.1, where gamma is often 0 < gamma < 1
Radial kernel is very local and can create complex regions within the feature
space, like closed polygons in a two-dimensional space
How to Learn a SVM Model
This is inefficient and is not the approach used in widely used SVM
implementations like LIBSVM
A variation of gradient descent called sub-gradient descent can be used
There are specialized optimization procedures that re-formulate the
optimization problem to be a Quadratic Programming problem
Most popular method for fitting SVM is the Sequential Minimal Optimization
(SMO) method that is very efficient
It breaks the problem down into sub-problems that can be solved analytically
(by calculating) rather than numerically (by searching or optimizing)
Preparing Data For SVM
If we have categorical inputs we may need to covert them to binary dummy variables (one
variable for each category)
Binary Classifcation: