20 SVM
20 SVM
References:
1. Ethem Alpaydin, "Introduction to
Machine Learning”, MIT Press, Prentice
Hall of India
Support Vector Machine
• SVM is a supervised learning model
• Data in a dataset are associated with a label
• Example
– Identify the mails in the mailbox as ‘complaint’ or not ‘complaint’
• Classification
– Linearly separable data
» Maximal-margin classifier
– Linearly inseparable data
» Kernel trick SVM
• Regression
– Support Vector Regression
– SVM can also be Unsupervised
• Support Vector Clustering
– Define the discriminant in terms of support vectors
Discriminating Plane
• Max Margin Plane
Optimal Separating Hyperplane
• In General,
t
1 if x C1
X x , r t where r
t t t
t
1 if x C2
find w and w 0 such that
w T xt w 0 1 for r t 1
w T xt w 0 1 for r t 1
which can be rewritten as
r t w T xt w 0 1
We require r t w T xt w
0
, t
w
• Aim: To maximize ρ
– but there are an infinite number of solutions that we can get by scaling w
For a unique sol’n, fix ρ||w||=1
1
min w subject to r t w T xt w 0 1, t
2
2
Lagrangian Method
• Consider the optimization problem
maximize f(x, y) subject to g(x, y) = 0.
Lagrange function (or Lagrangian or Lagrangian expression) is defined by
L(x,y,λ)=f(x,y)−λ⋅g(x,y)
• For the general case of an arbitrary number n of choice variables and an arbitrary number M
of constraints, the Lagrangian takes the form
1
w subject to r t w T x t w 0 1, t
2
min
2
N
1
Lp w t r t w T x t w 0 1
2
2
t 1
N N
1
w r w x w 0 t
2 t t T t
2 t 1 t 1
Lp N
0 w t r t x t
w t 1
(Cortes and Vapnik,
Lp N
0 t
r t 0 1995; Vapnik, 1995)
w 0 t 1
Lagrangian Method
• The function given below should be minimized with respect to w and w0
and maximized with respect to αt ≥ 0. The saddle point gives the solution.
1
w T w t
2 t
1
T
t s r t r s xt x s t
2 t s t
2 t 1 t 1
Lp N
0 w t r t xt
w t 1
Lp N
w 0
0 r
t 1
t t
0
Lagrangian Method
• Once we solve for αt, we see that though there are N of them, most vanish
with αt = 0 and only a small percentage have αt > 0.
• The set of xt whose αt > 0 are the support vectors, w is written as the
weighted sum of these training instances that are selected as the support
vectors.
– These are the xt that satisfy rt (wTxt + w0) = 1 and lie on the margin.
• We can use this fact to calculate w0 from any support vector as w0 = rt -
wTxt
– For numerical stability, it is advised that this be done for all support
vectors and an average be taken.
• The discriminant thus found is called support vector the support vector
machine (SVM)
Margin
• For a two-class problem where the instances of the classes
are shown by plus signs and dots, the thick line is the
boundary and the dashed lines define the margins on either
side. Circled instances are the support vectors.
Margin
• The majority of the αt are 0, for which rt (wTxt + w0) > 1 .
– These are the xt that lie more than sufficiently away from the discriminant, and they have no effect
on the hyperplane.
• The instances that are not support vectors carry no information; even if any subset of them
are removed, we would still get the same solution.
• From this perspective, the SVM algorithm can be likened to the condensed nearest neighbor
algorithm. which stores only the instances neighboring (and hence constraining) the class
discriminant.
• Being a discriminant-based method, the SVM cares only about the instances close to the
boundary and discards those that lie in the interior.
– Using this idea, it is possible to use a simpler classifier before the SVM to filter out a large portion of such instances,
thereby decreasing the complexity of the optimization step of the SVM.
17
By solving α1 = −3.5, α2 = 0.75 and α3
= 0.75
Hyperplane
SVM – Linearly Inseparable data – Case 1
Nonlinearly separable sample data points
Non-Linear SVM
23
Data represented in feature space
Data represented in feature space
• The two support vectors (in feature space) are marked as
yellow circles.
Hyperplane
• The discriminating hyperplane corresponding to the values α1
= -7 and α2 = 4
SVM – Linearly Inseparable data – Case 2
Nonlinearly separable sample data points
SVM – Linearly Inseparable data – Case 2