SVM Class
SVM Class
Vector Machines
History of SVM
SVM is related to statistical learning theory [3]
SVM was first introduced in 1992 [1]
24/2/23 2
Linear Classifiers Estimation:
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
w: weight vector
x: data vector
24/2/23 3
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
24/2/23 4
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
24/2/23 5
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
24/2/23 6
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
24/2/23 7
Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
24/2/23 8
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
24/2/23 9
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin
This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
24/2/23 10
Why Maximum Margin?
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin
This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
24/2/23 11
How to calculate the distance from a point to a line?
denotes +1
denotes -1 x wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
In our case, w *x +w *x +b=0,
1 1 2 2
24/2/23 12
Estimate the Margin
denotes +1
denotes -1 x wx +b = 0
X – Vector
W
W – Normal Vector
b – Scale Value
24/2/23 13
Large-margin Decision Boundary
The decision boundary should be as far away from the
data of both classes as possible
We should maximize the margin, m
Class 2
Class 1
m
24/2/23 14
Finding the Decision Boundary
Let {x1, ..., xn} be our data set and let yi {1,-1} be
the class label of xi
The decision boundary should classify all points correctly
To see this: when y=-1, we wish (wx+b)<1, when y=1,
24/2/23 15
Next step… Optional
Converting SVM to a form we can solve
Dual form
Allowing a few errors
Soft margin
Allowing nonlinear boundary
Kernel functions
24/2/23 16
The Dual Problem (we ignore the derivation)
The new objective function is in terms of i only
It is known as the dual problem: if we know w, we
know all i; if we know all i, we know w
The original problem is known as the primal problem
The objective function of the dual problem needs to be
maximized!
The dual problem is therefore:
w can be recovered by
24/2/23 18
Characteristics of the Solution
Many of the i are zero (see next page for example)
w is a linear combination of a small number of data points
This “sparse” representation can be viewed as data
Class 2
8=0.6 10=0
7=0
2=0
5=0
1=0.8
4=0
6=1.4
9=0
3=0
Class 1
24/2/23 20
Extension to Non-linear Decision Boundary
So far, we have only considered large-
margin classifier with a linear decision
boundary
How to generalize it to become nonlinear?
24/2/23 22
The Kernel Trick
Recall the SVM optimization problem
24/2/23 23
An Example for (.) and K(.,.)
Suppose (.) is given as follows
24/2/23 24
More on Kernel Functions
Not all similarity measures can be used as kernel
function, however
The kernel function needs to satisfy the Mercer function,
i.e., the function is “positive-definite”
This implies that
the n by n kernel matrix,
in which the (i,j)-th entry is the K(xi, xj), is always positive
definite
This also means that optimization problem can be solved
in polynomial time!
24/2/23 25
Examples of Kernel Functions
24/2/23 26
Non-linear SVMs: Feature spaces
General idea: the original input space can always be mapped to
some higher-dimensional feature space where the training set is
separable:
Φ: x → φ(x)
24/2/23 27
Example
Suppose we have 5 one-dimensional data points
x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
5 as class 2 y1=1, y2=1, y3=-1, y4=-1, y5=1
We use the polynomial kernel of degree 2
K(x,y) = (xy+1)2
C is set to 100
24/2/23 28
Example
1 2 4 5 6
24/2/23 29
Conclusion
SVM is a useful alternative to neural networks
Two key concepts of SVM: maximize the margin and the
kernel trick
Many SVM implementations are available on the web for
24/2/23 30
Resources
http://www.kernel-machines.org/
http://www.support-vector.net/
http://www.support-vector.net/icml-tutorial.pdf
http://www.kernel-machines.org/papers/tutorial-
nips.ps.gz
http://www.clopinet.com/isabelle/Projects/SVM/
applist.html
24/2/23 31
Appendix: Distance from a point to a line
Equation for the line: let u be a variable, then any point
on the line can be described as:
P = P1 + u (P2 - P1)
Let the intersect point be u, P2
Then, u can be determined by:
P
The two vectors (P2-P1) is orthogonal to P3-u:
That is,
24/2/23 32
Distance and margin
x = x1 + u (x2 - x1)
y = y1 + u (y2 - y1)
d= |(P3-P)|
24/2/23 33