Lecture 18 - SVM
Lecture 18 - SVM
2
Support Vector Machines - Overview
• Proposed by Vapnik and his colleagues
- Started in 1963, taking shape in late 70’s as part of his statistical
learning theory (with Chervonenkis)
- Current form established in early 90’s (with Cortes)
• Became popular in last decade
- Classification, regression (function approx.), optimization
• Basic ideas
- Maximize margin of decision boundary
- Overcoming linear seperability problem by transforming the
problem into higher dimensional space using kernel functions
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the maximum
are those
datapoints that margin.
the margin This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
X – Vector
W
W – Normal Vector
b – Scale Value
Maximize Margin
denotes +1
denotes -1 wx +b = 0
Margin
b xi w
argmax arg min
w ,b xi D d 2
w
i 1 i
subject to xi D : yi xi w b 0
WXi+b≥1 iff yi=1
WXi+b≤-1 iff yi=-1
argmin i 1 wi2
d
w ,b
yi(WXi+b) ≥ 1 subject to xi D : yi xi w b 1
Linear SVM
• Linear model:
1 if w x b 1
f ( x)
1 if w x b 1
• Constrained Optimization
• Langrangian Method
SVM – Langrangian Formulation
SVM – Langrangian Formulation
Example of Linear SVM
x1 x2 y l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Example of Linear SVM
Support vectors
x1 x2 y a
l
0.3858 0.4687 1 65.5261
0.4871 0.611 -1 65.5261
0.9218 0.4103 -1 0
0.7382 0.8936 -1 0
0.1763 0.0579 1 0
0.4057 0.3529 1 0
0.9355 0.8132 -1 0
0.2146 0.0099 1 0
Example of Linear SVM
Learning Linear SVM
• Decision boundary depends only on support
vectors
• If you have data set with same support vectors,
decision boundary will not change
1 if w x i b 1
f ( xi )
1 if w x i b 1
Support Vector Machines
• What if the problem is not linearly separable?
Support Vector Machines
• What if the problem is not linearly separable?
• Introduce slack variables
• Need to minimize: 2
|| w || N k
L( w) C i
2 i 1
• Subject to:
1 if w x i b 1 - i
yi
1 if w x i b 1 i
B2
b21
b22
margin
b11
b12
Decision boundary:
w ( x ) b 0
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Example of Nonlinear SVM
• Robust to noise
• Overfitting is handled by maximizing the margin of the decision
boundary,
• SVM can handle irrelevant and redundant attributes better than many
other techniques
• The user needs to provide the type of kernel function and cost function
• Difficult to handle missing values
References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998
• Download SVM-light:
http://svmlight.joachims.org/
Some other issues in SVM
• SVM works only in a real-valued space. For a
categorical attribute, we need to convert its
categorical values to numeric values.
• SVM does only two-class classification. For multi-
class problems, some strategies can be applied, e.g.,
one-against-rest, and error-correcting output coding.
• The hyperplane produced by SVM is hard to
understand by human users. The matter is made
worse by kernels. Thus, SVM is commonly used in
applications that do not required human
understanding.