0% found this document useful (0 votes)
7 views35 pages

20 SVM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views35 pages

20 SVM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Module4_SupportVectorMachine

References:
1. Ethem Alpaydin, "Introduction to
Machine Learning”, MIT Press, Prentice
Hall of India
Support Vector Machine
• SVM is a supervised learning model
• Data in a dataset are associated with a label
• Example
– Identify the mails in the mailbox as ‘complaint’ or not ‘complaint’
• Classification
– Linearly separable data
» Maximal-margin classifier
– Linearly inseparable data
» Kernel trick SVM
• Regression
– Support Vector Regression
– SVM can also be Unsupervised
• Support Vector Clustering
– Define the discriminant in terms of support vectors
Discriminating Plane
• Max Margin Plane
Optimal Separating Hyperplane

• In General,
t
  1 if x  C1
X x , r t where r 
t t t
t
  1 if x C2
find w and w 0 such that
w T xt  w 0 1 for r t 1
w T xt  w 0 1 for r t  1
which can be rewritten as
r t w T xt  w 0 1

(Cortes and Vapnik, 1995; Vapnik, 1995)


Maximizing the Margin
Margin

• Distance from the hyperplane to the instances closest to it


on either side is called the margin, which needs to be
maximized for best generalization.
Distance of x to the hyperplane is w T xt  w 0
w

We require r t w T xt  w 
0
 , t
w
• Aim: To maximize ρ
– but there are an infinite number of solutions that we can get by scaling w
For a unique sol’n, fix ρ||w||=1
1
min w subject to r t w T xt  w 0 1, t
2

2
Lagrangian Method
• Consider the optimization problem
maximize f(x, y) subject to g(x, y) = 0.
Lagrange function (or Lagrangian or Lagrangian expression) is defined by
L(x,y,λ)=f(x,y)−λ⋅g(x,y)
• For the general case of an arbitrary number n of choice variables and an arbitrary number M
of constraints, the Lagrangian takes the form

1
w subject to r t w T x t  w 0 1, t
2
min
2
N
1

Lp  w    t r t w T x t  w 0  1
2
2

t 1
N N
1
 w    r w x  w 0    t
2 t t T t

2 t 1 t 1

Lp N
0  w   t r t x t
w t 1
(Cortes and Vapnik,
Lp N
0   t
r t 0 1995; Vapnik, 1995)
w 0 t 1
Lagrangian Method
• The function given below should be minimized with respect to w and w0
and maximized with respect to αt ≥ 0. The saddle point gives the solution.

• This is a convex quadratic optimization problem because the main term is


convex and the linear constraints are also convex.
• Therefore, we can equivalently solve the dual problem, making use of the
Karush-Kuhn-Tucker (KKT) conditions(generalized method of Lagrange multipliers) which
allows inequality constraints.

• The dual is to maximize Lp with respect to αt, subject to the constraints


that the gradient of Lp with respect to w and w0 are 0 and also that αt ≥ 0
• This can be solved using quadratic optimization methods. The size of the dual depends on N, sample size, and not on d, the
input dimensionality.
Lagrangian Method
• Most αt are 0 and only a small number have αt >0; they are the support
vectors
1
 
Ld  w T w  w T   t r t xt  w 0   t r t    t
2 t t t

1
 
 w T w    t
2 t

1
  T
   t s r t r s xt x s    t
2 t s t

subject to   t r t 0 and  t 0, t


t
1
w subject to r t w T xt  w 0 1, t
2
min
2
N
1

Lp  w    t r t w T xt  w 0  1
2
2

t 1
N N
1
 w    t r t w T xt  w 0    t
2

2 t 1 t 1

Lp N
0  w   t r t xt
w t 1

Lp N

w 0
0   r
t 1
t t
0
Lagrangian Method
• Once we solve for αt, we see that though there are N of them, most vanish
with αt = 0 and only a small percentage have αt > 0.
• The set of xt whose αt > 0 are the support vectors, w is written as the
weighted sum of these training instances that are selected as the support
vectors.
– These are the xt that satisfy rt (wTxt + w0) = 1 and lie on the margin.
• We can use this fact to calculate w0 from any support vector as w0 = rt -
wTxt
– For numerical stability, it is advised that this be done for all support
vectors and an average be taken.
• The discriminant thus found is called support vector the support vector
machine (SVM)
Margin
• For a two-class problem where the instances of the classes
are shown by plus signs and dots, the thick line is the
boundary and the dashed lines define the margins on either
side. Circled instances are the support vectors.
Margin
• The majority of the αt are 0, for which rt (wTxt + w0) > 1 .
– These are the xt that lie more than sufficiently away from the discriminant, and they have no effect
on the hyperplane.
• The instances that are not support vectors carry no information; even if any subset of them
are removed, we would still get the same solution.
• From this perspective, the SVM algorithm can be likened to the condensed nearest neighbor
algorithm. which stores only the instances neighboring (and hence constraining) the class
discriminant.
• Being a discriminant-based method, the SVM cares only about the instances close to the
boundary and discards those that lie in the interior.
– Using this idea, it is possible to use a simpler classifier before the SVM to filter out a large portion of such instances,
thereby decreasing the complexity of the optimization step of the SVM.

• During testing, the margin is not enforced.


– We calculate g(x) = wTx + w0, and choose according to the sign of g(x):
Choose C1 if g(x) > 0 and C2 otherwise
SVM by Example –Linearly separable
data
Support
Vectors are
Architecture
SVM Architecture
ee support vectors without bias term are represented as fol

ree support vectors with bias term are represented as follow

17
By solving α1 = −3.5, α2 = 0.75 and α3
= 0.75
Hyperplane
SVM – Linearly Inseparable data – Case 1
Nonlinearly separable sample data points
Non-Linear SVM
23
Data represented in feature space
Data represented in feature space
• The two support vectors (in feature space) are marked as
yellow circles.
Hyperplane
• The discriminating hyperplane corresponding to the values α1
= -7 and α2 = 4
SVM – Linearly Inseparable data – Case 2
Nonlinearly separable sample data points
SVM – Linearly Inseparable data – Case 2

• In input and feature spaces are the same size.


– However, it is often the case that in order to
effectively separate the data, we must use a
feature space that is of (sometimes very much)
higher dimension than our input space.
• Let us now consider an alternative mapping
function
SVM – Linearly Inseparable data – Case 2

• Let us now consider an alternative mapping


function

– which transforms our data from 2-dimensional


input space to 3-dimensional feature space.
SVM – Linearly Inseparable data – Case 2
• Using this alternative mapping, the data in the new feature
space looks like

for the negative samples


SVM – Linearly Inseparable data – Case 2

• Solving for 8 support vectors ,


– αi=1/46, for positive samples
– αi=-7/46, for negative samples
And

Therefore, the discriminating feature, is x3


Hence, g(x) = σ(x3)
SVM – Linearly Inseparable data – Case 2

• Decision surface induced in the input space for the


new mapping function:

You might also like