Introduction To Support Vector Machines
Introduction To Support Vector Machines
SUPPORT VECTOR
MACHINES
1
SVMs: A New Generation of Learning
Algorithms
• Pre 1980:
– Almost all learning methods learned linear decision surfaces.
– Linear learning methods have nice theoretical properties
• 1980’s
– Decision trees and NNs allowed efficient learning of nonlinear decision
surfaces
– Little theoretical basis and all suffer from local minima
• 1990’s
– Efficient learning algorithms for non-linear functions based on computational
learning theory developed
– Nice theoretical properties.
2
Key Ideas
• Two independent developments within last decade
– Computational learning theory
– New efficient separability of non-linear functions that use “kernel
functions”
• The resulting learning algorithm is an optimization algorithm rather
than a greedy search.
3
Statistical Learning Theory
• Systems can be mathematically described as a system that
– Receives data (observations) as input and
– Outputs a function that can be used to predict some features of
future data.
• Statistical learning theory models this as a function estimation
problem
• Generalization Performance (accuracy in labeling test data) is
measured
4
Motivation for Support Vector
Machines
• The problem to be solved is one of the supervised binary classification.
That is, we wish to categorize new unseen objects into two separate
groups based on their properties and a set of known examples, which
are already categorized.
• A good example of such a system is classifying a set of new documents
into positive or negative sentiment groups, based on other documents
which have already been classified as positive or negative.
• Similarly, we could classify new emails into spam or non-spam, based on
a large corpus of documents that have already been marked as spam or
non-spam by humans. SVMs are highly applicable to such situations.
5
Motivation for Support Vector
Machines
• A Support Vector Machine models the situation by creating a feature space,
which is a finite-dimensional vector space, each dimension of which
represents a "feature" of a particular object. In the context of spam or
document classification, each "feature" is the prevalence or importance of a
particular word.
• The goal of the SVM is to train a model that assigns new unseen objects into
a particular category.
• It achieves this by creating a linear partition of the feature space into two
categories.
• Based on the features in the new unseen objects (e.g. documents/emails), it
places an object "above" or "below" the separation plane, leading to a
categorization (e.g. spam or non-spam). This makes it an example of a non-
probabilistic linear classifier. It is non-probabilistic, because the features in
the new objects fully determine its location in feature space and there is no
stochastic element involved.
6
OBJECTIVES
• Support vector machines (SVM) are supervised learning models with
associated learning algorithms that analyze data used for
classification and regression analysis.
• It is a machine learning approach.
• They analyze the large amount of data to identify patterns from
them.
• SVMs are based on the idea of finding a hyperplane that best divides
a dataset into two classes, as shown in the image below.
7
Support Vectors
• Support Vectors are simply the co-ordinates of individual observation.
Support Vector Machine is a frontier which best segregates the two classes
(hyper-plane/ line).
• Support vectors are the data points that lie closest to the decision surface
(or hyperplane)
• They are the data points most difficult to classify
• They have direct bearing on the optimum location of the decision surface
• We can show that the optimal hyperplane stems from the function class
with the lowest “capacity” (VC dimension).
• Support vectors are the data points nearest to the hyperplane, the points
of a data set that, if removed, would alter the position of the dividing
hyperplane. Because of this, they can be considered the critical elements
of a data set.
8
What is a hyperplane?
• As a simple example, for a classification task with only two features,
you can think of a hyperplane as a line that linearly separates and
classifies a set of data.
• Intuitively, the further from the hyperplane our data points lie, the
more confident we are that they have been correctly classified. We
therefore want our data points to be as far away from the hyperplane
as possible, while still being on the correct side of it.
• So when new testing data are added, whatever side of the
hyperplane it lands will decide the class that we assign to it.
9
How do we find the right
hyperplane?
• How do we best segregate the two classes within the data?
• The distance between the hyperplane and the nearest data point
from either set is known as the margin. The goal is to choose a
hyperplane with the greatest possible margin between the
hyperplane and any point within the training set, giving a greater
chance of new data being classified correctly. There will never be any
data point inside the margin.
10
But what happens when there is no
clear hyperplane?
• Data are rarely ever as clean as our simple example above. A dataset
will often look more like the jumbled balls below which represent a
linearly non separable dataset.
• In order to classify a dataset like the one above it’s necessary to move
away from a 2d view of the data to a 3d view. Explaining this is easiest
with another simplified example. Imagine that our two sets of colored
balls above are sitting on a sheet and this sheet is lifted suddenly,
launching the balls into the air. While the balls are up in the air, you use
the sheet to separate them. This ‘lifting’ of the balls represents the
mapping of data into a higher dimension. This is known as kernelling.11
Because we are now in three dimensions, our hyperplane can no
longer be a line. It must now be a plane as shown in the example
above. The idea is that the data will continue to be mapped into higher
and higher dimensions until a hyperplane can be formed to segregate
it.
12
How does it work? How can we
identify the right hyper-plane?
13
Identify the right hyperplane
(Scenario-1):
• Here, we have three hyperplanes (A, B and C). Now, identify the right
hyperplane to classify star and circle.
15
Scenario-2
This distance is called as Margin. Let’s look at the below snapshot:
We can see that the margin for
hyperplane C is high as compared to
both A and B. Hence, we name
the right hyperplane as C. Another
lightning reason for selecting the
hyperplane with higher margin is
robustness. If we select a hyperplane
having low margin then there is high
chance of missclassification.
16
Identify the right hyperplane
(Scenario-3)
•We are unable to segregate the two classes using a straight line, as
one of star lies in the territory of other (circle) class as an outlier.
•One star at other end is like an outlier for star class. SVM has a feature
to ignore outliers and find the hyperplane that has maximum margin.
Hence, we can say, SVM is robust to outliers.
18
Find the hyperplane to
segregate to classes (Scenario-
5)
• In the scenario below, we can’t have linear hyperplane between the
two classes, so how does SVM classify these two classes? Till now, we
have only looked at the linear hyperplane.
19
Scenario-5
• Now, let’s plot the data points on axis x and z:
21
Linear Separating Hyperplanes
• The linear separating hyperplane is the key geometric entity that is at
the heart of the SVM. Informally, if we have a high-dimensional
feature space, then the linear hyperplane is an object one dimension
lower than this space that divides the feature space into two regions.
• This linear separating plane need not pass through the origin of our
feature space, i.e. it does not need to include the zero vector as an
entity within the plane. Such hyperplanes are known as affine.
• If we consider a real-valued p-dimensional feature space, known
mathematically as p, then our linear separating hyperplane is an
affine p−1 dimensional space embedded within it.
22
• For the case of p=2 this hyperplane is simply a one-dimensional
straight line, which lives in the larger two-dimensional plane, whereas
for p=3 the hyperplane is a two-dimensional plane that lives in the
larger three-dimensional feature space.
23
24
25
Classification
26
27
28
29
Deriving the Classifier
• Separating hyperplanes are not unique, since it is possible to slightly
translate or rotate such a plane without touching any training
observations.
• So, not only do we need to know how to construct such a plane, but
we also need to determine the most optimal. This motivates the
concept of the maximal margin hyperplane (MMH), which is the
separating hyperplane that is farthest from any training observations,
and is thus "optimal".
30
31
32
33
• One of the key features of the MMC (and subsequently SVC and SVM)
is that the location of the MMH only depends on the support vectors,
which are the training observations that lie directly on the margin
(but not hyperplane) boundary (see points A, B and C in the figure).
This means that the location of the MMH is NOT dependent upon any
other training observations.
34
Constructing the Maximal Margin
Classifier
35
• Despite the complex looking constraints, they actually state that each
observation must be on the correct side of the hyperplane and at least a
distance M from it. Since the goal of the procedure is to maximize M,
this is precisely the condition we need to create the MMC.
• Clearly, the case of perfect separability is an ideal one. Most "real world"
datasets will not have such perfect separability via a linear hyperplane.
However, if there is no separability then we are unable to construct a
MMC by the optimization procedure above. So, how do we create a form
of separating hyperplane?
36
Support Vector Classifiers
• Essentially we have to relax the requirement that a separating
hyperplane will perfectly separate every training observation on the
correct side of the line (i.e. guarantee that it is associated with its true
class label), using what is called a soft margin. This motivates the
concept of a support vector classifier(SVC).
• MMCs can be extremely sensitive to the addition of new training
observations.
If we add one point to the
MMH perfectly +1 class, we see that the
separating the location of the MMH
two classes changes substantially.
Hence in this situation the
MMH has clearly been
over-fit.
37
• We could consider a classifier based on a separating hyperplane that doesn't
perfectly separate the two classes, but does have a greater robustness to the
addition of new individual observations and has a better classification on most of
the training observations. This comes at the expense of some misclassification of
a few training observations.
• This is how a support vector classifier or soft margin classifier works. A SVC allows
some observations to be on the incorrect side of the margin (or hyperplane),
hence it provides a "soft" separation. The following figures demonstrate
observations being on the wrong side of the margin and the wrong side of the
hyperplane respectively:
38
39
where C, the budget, is a non-negative "tuning" parameter. M still
represents the margin and the slack variables ϵi allow the individual
observations to be on the wrong side of the margin or hyperplane.
•In essence the ϵi tell us where the i-th observation is located relative
to the margin and hyperplane. For ϵi=0 it states that the xi training
observation is on the correct side of the margin. For ϵi>0 we have that
xi is on the wrong side of the margin, while for ϵi>1 we have that xi is on
the wrong side of the hyperplane.
•C collectively controls how much the individual ϵi can be modified to
violate the margin. C=0 implies that ϵi=0,∀i and thus no violation of the
margin is possible, in which case (for separable classes) we have the
MMC situation.
40
• For C>0 it means that no more than C observations can violate the
hyperplane. As C increases the margin will widen. See figures for two
differing values of C:
43
44
• This is clearly not restricted to quadratic polynomials. Higher
dimensional polynomials, interaction terms and other functional
forms, could all be considered. Although the drawback is that it
dramatically increases the dimension of the feature space to the
point that some algorithms can become untractable.
45
SVM Kernel Functions
• SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into
the required form. Different SVM algorithms use different types of kernel
functions. These functions can be different types. For example linear,
nonlinear, polynomial, radial basis function (RBF), and sigmoid.
• Introduce Kernel functions for sequence data, graphs, text, images, as well
as vectors. The most used type of kernel function is RBF. Because it has
localized and finite response along the entire x-axis.
• The kernel functions return the inner product between two points in a
suitable feature space. Thus by defining a notion of similarity, with little
computational cost even in very high-dimensional spaces.
46