Introduction To Support Vector Machines: Andrew Moore CMU
Introduction To Support Vector Machines: Andrew Moore CMU
Thanks:
Andrew Moore
CMU
And
Martin Law
Michigan State University
History of SVM
SVM is related to statistical learning theory [3]
SVM was first introduced in 1992 [1]
2021/3/3 2
Linear Classifiers Estimation:
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
w: weight vector
x: data vector
2021/3/3 3
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
2021/3/3 4
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
2021/3/3 5
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
2021/3/3 6
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
2021/3/3 7
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
2021/3/3 8
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
2021/3/3 9
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
2021/3/3 10
Why Maximum Margin?
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
2021/3/3 11
How to calculate the distance from a point
to a line?
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
http://mathworld.wolfram.com/Point-LineDistance2-
Dimensional.html
In our case, w1*x1+w2*x2+b=0,
2021/3/3 12
Estimate the Margin
denotes +1
denotes -1 x
wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
2021/3/3 13
Large-margin Decision Boundary
The decision boundary should be as far away from the
data of both classes as possible
We should maximize the margin, m
Class 2
Class 1
m
2021/3/3 14
Finding the Decision Boundary
Let {x1, ..., xn} be our data set and let yi {1,-1} be
the class label of xi
The decision boundary should classify all points correctly
To see this: when y=-1, we wish (wx+b)<1, when y=1,
2021/3/3 15
Next step… Optional
Converting SVM to a form we can solve
Dual form
Allowing a few errors
Soft margin
Allowing nonlinear boundary
Kernel functions
2021/3/3 16
The Dual Problem (we ignore the derivation)
The new objective function is in terms of ai only
It is known as the dual problem: if we know w, we
maximized!
The dual problem is therefore:
w can be recovered by
2021/3/3 18
Characteristics of the Solution
Many of the ai are zero (see next page for example)
w is a linear combination of a small number of data points
This “sparse” representation can be viewed as data
We can write
For testing with a new data z
Compute and
classify z as class 1 if the sum is positive, and class 2
otherwise
Note: w need not be formed explicitly
2021/3/3 19
A Geometrical Interpretation
Class 2
a8=0.6 a10=0
a7=0
a5=0 a2=0
a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1
2021/3/3 20
Allowing errors in our solutions
We allow “error” xi in classification; it is based on the
output of the discriminant function wTx+b
xi approximates the number of misclassified samples
Class 2
Class 1
2021/3/3 21
Soft Margin Hyperplane
If we minimize ixi, xi can be computed by
We want to minimize
C : tradeoff parameter between error and margin
2021/3/3 22
Extension to Non-linear Decision Boundary
So far, we have only considered large-
margin classifier with a linear decision
boundary
How to generalize it to become nonlinear?
transformation
2021/3/3 23
Transforming the Data (c.f. DHS Ch. 5)
f( )
f( ) f( )
f( ) f( ) f( )
f(.) f( )
f( ) f( )
f( ) f( )
f( ) f( )
f( ) f( ) f( )
f( )
f( )
2021/3/3 24
The Kernel Trick
Recall the SVM optimization problem
2021/3/3 25
An Example for f(.) and K(.,.)
Suppose f(.) is given as follows
2021/3/3 26
More on Kernel Functions
Not all similarity measures can be used as kernel
function, however
The kernel function needs to satisfy the Mercer function,
i.e., the function is “positive-definite”
This implies that
the n by n kernel matrix,
in which the (i,j)-th entry is the K(xi, xj), is always positive
definite
This also means that optimization problem can be solved
in polynomial time!
2021/3/3 27
Examples of Kernel Functions
2021/3/3 28
Non-linear SVMs: Feature spaces
Φ: x → φ(x)
2021/3/3 29
Example
Suppose we have 5 one-dimensional data points
x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
We use the polynomial kernel of degree 2
K(x,y) = (xy+1)2
C is set to 100
2021/3/3 30
Example
By using a QP solver, we get
a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
Note that the constraints are indeed satisfied
2021/3/3 31
Example
1 2 4 5 6
2021/3/3 32
Degree of Polynomial Features
2021/3/3 33
Choosing the Kernel Function
Probably the most tricky part of using SVM.
2021/3/3 34
Software
A list of SVM implementation can be found at
http://www.kernel-machines.org/software.html
Some implementation (such as LIBSVM) can handle
multi-class classification
SVMLight is among one of the earliest implementation of
SVM
Several Matlab toolboxes for SVM are also available
2021/3/3 35
Summary: Steps for Classification
Prepare the pattern matrix
Select the kernel function to use
support vectors
2021/3/3 36
Conclusion
SVM is a useful alternative to neural networks
Two key concepts of SVM: maximize the margin and the
kernel trick
Many SVM implementations are available on the web for
2021/3/3 37
Resources
http://www.kernel-machines.org/
http://www.support-vector.net/
http://www.support-vector.net/icml-tutorial.pdf
http://www.kernel-machines.org/papers/tutorial-
nips.ps.gz
http://www.clopinet.com/isabelle/Projects/SVM/applist.h
tml
2021/3/3 38
Appendix: Distance from a point to a line
2021/3/3 39
Distance and margin
x = x1 + u (x2 - x1)
y = y1 + u (y2 - y1)
d= |(P3-P)|=
2021/3/3 40