Support Vector Machine
Support Vector Machine
Group
University of Texas at 1
Austin
Machine Learning
Group
University of Texas at 2
Austin
Machine Learning
Group
Continue...
• Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
University of Texas at 3
Austin
Machine Learning
Group
Continue...
• Example: SVM can be understood with the example that we have used in
the KNN classifier. Suppose we see a strange cat that also has some
features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the
SVM algorithm.
• We will first train our model with lots of images of cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with
this strange creature.
• So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the
extreme case of cat and dog.
• On the basis of the support vectors, it will classify it as a cat.
University of Texas at 4
Austin
Machine Learning
Group
Continue...
• Consider the below diagram:
• SVM algorithm can be used for Face detection, image classification, text
categorization, etc. Types of SVM
University of Texas at 5
Austin
Machine Learning
Group
Continue...
• SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
University of Texas at 6
Austin
Machine Learning
Group
Continue...
• Hyperplane and Support Vectors in the SVM algorithm:
• Hyperplane: There can be multiple lines/decision boundaries to segregate
the classes in n- dimensional space, but we need to find out the best
decision boundary that helps to classify the data points. This best boundary
is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then
hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which means
the maximum distance between the data points.
• Support Vectors:
• The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector. How
does SVM works?
University of Texas at 7
Austin
Machine Learning
Group
Continue...
• Linear SVM:
• The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:
University of Texas at 8
Austin
Machine Learning
Group
Continue...
• So as it is 2-d space so by just using a straight line, we can easily separate these
two classes. But there can be multiple lines that can separate these classes.
Consider the below image:
• Hence, the SVM algorithm helps to find the best line or decision boundary; this
best boundary or region is called as a hyperplane.
• SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors.
• The distance between the vectors and the hyperplane is called as margin. And
the goal of SVM is to maximize this margin. The hyperplane with maximum
margin is called the optimal hyperplane.
University of Texas at 9
Austin
Machine Learning
Group
wTx + b = 0
wTx + b > 0
wTx + b < 0
f(x) = sign(wTx + b)
University of Texas at 10
Austin
Machine Learning
Group
Linear Separators
University of Texas at 11
Austin
Machine Learning
Group
Classification Margin
wT xi b
• Distance from example xi to the separator is r
w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the distance between support vectors.
ρ
University of Texas at 12
Austin
Machine Learning
Group
University of Texas at 13
Austin
Machine Learning
Group
University of Texas at 14
Austin
Machine Learning
Group
University of Texas at 15
Austin
Machine Learning
Group
• Given a solution α1…αn to the dual problem, solution to the primal is:
f(x) = ΣαiyixiTx + b
• Notice that it relies on an inner product between the test point x and the
support vectors xi – we will return to this later.
• Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.
University of Texas at 17
Austin
Machine Learning
Group
ξi
ξi
University of Texas at 18
Austin
Machine Learning
Group
University of Texas at 19
Austin
Machine Learning
Group
University of Texas at 20
Austin
Machine Learning
Group
University of Texas at 21
Austin
Machine Learning
Group
• Most “important” training points are support vectors; they define the
hyperplane.
• Both in the dual formulation of the problem and in the solution training
points appear only inside inner products:
Find α1…αN such that f(x) = ΣαiyixiTx + b
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
University of Texas at 22
Austin
Machine Learning
Group
Non-linear SVMs
• Datasets that are linearly separable with some noise work out great:
0 x
0 x
• How about… mapping data to a higher-dimensional space:
x2
0 x
University of Texas at 23
Austin
Machine Learning
Group
• General idea: the original feature space can always be mapped to some
higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
University of Texas at 24
Austin
Machine Learning
Group
• For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
• Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel
• Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
K=
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
University of Texas at 26
Austin
Machine Learning
Group
University of Texas at 28
Austin
Machine Learning
Group
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and
gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification
tasks ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors (e.g.
graphs, sequences, relational data) by designing kernel functions for such data.
• SVM techniques have been extended to a number of tasks such as regression
[Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to hill-
climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]
• Tuning SVMs remains a black art: selecting a specific kernel and parameters
is usually done in a try-and-see manner.
University of Texas at 29
Austin