0% found this document useful (0 votes)
5 views17 pages

Lecture 6

This document discusses the derivation and application of Support Vector Machines (SVM) and kernels, focusing on the dual formulation for both linearly separable and non-separable cases. It highlights the importance of support vectors, the role of Lagrange multipliers, and the use of various kernel functions to avoid explicit feature computation. Additionally, it addresses concerns about overfitting in high-dimensional feature spaces and strategies to mitigate it.

Uploaded by

hoangtucuagio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Lecture 6

This document discusses the derivation and application of Support Vector Machines (SVM) and kernels, focusing on the dual formulation for both linearly separable and non-separable cases. It highlights the importance of support vectors, the role of Lagrange multipliers, and the use of various kernel functions to avoid explicit feature computation. Additionally, it addresses concerns about overfitting in high-dimensional feature spaces and strategies to mitigate it.

Uploaded by

hoangtucuagio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Support Vector Machines & Kernels

Lecture 6

David Sontag
New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin,


and Vibhav Gogate
Dual SVM derivation (1) – the linearly
separable case

Original optimization problem:

Rewrite One Lagrange multiplier


constraints per example
Lagrangian:

Our goal now is to solve:


Dual SVM derivation (2) – the linearly
separable case

(Primal)

Swap min and max

(Dual)

Slater’s condition from convex optimization guarantees that


these two optimization problems are equivalent!

x(1)
⇧ ... ⌃
⇧ ⌃
Dual SVM⇧ derivation
⇧ x (n) ⌃
⌃ (3) – the linearly
⇧ x(1) x(2) ⌃
⇥(x) = ⇧ separable case
⇧ ⌃
(1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

Sums over all training examples scalars dot product



x(1)
⇧ ... ⌃
⇧ ⌃
Dual SVM⇧ derivation
⇧ x (n) ⌃
⌃ (3) – the linearly
⇧ x(1) x(2) ⌃
⇥(x) = ⇧ separable case
⇧ ⌃
(1) (3) ⌃
⇧ x x ⌃
⇧ ⌃
⇧ ... ⌃
(Dual) ⇧ ⌃
⇤ ex(1) ⌅
Can solve for optimal w,. .b. as function of α:
⇤L ⌥
⇤w
=w j yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

So, in dual formulation we will solve for α directly!


• w and b are computed from α (if needed)
Dual SVM derivation (3) – the linearly
separable case
Lagrangian:

αj > 0 for some j implies constraint


is tight. We use this to obtain b:

(1)

(2)

(3)
Classification rule using dual solution

Using dual solution

dot product of feature vectors of


new example with support vectors
Dual for the non-separable case

Primal: Solve for w,b,α:

Dual:

What changed?
• Added upper bound of C on αi!
• Intuitive explanation:
• Without slack, αi  ∞ when constraints are violated (points
misclassified)
• Upper bound of C limits the αi, so misclassifications are allowed
Support vectors

• Complementary slackness conditions:

• Support vectors: points xj such that


(includes all j such that , but also additional points
where ↵j⇤ = 0 ^ yj (w
~ ⇤ · ~xj + b)  1 )

• Note: the SVM dual solution may not be unique!


Dual SVM interpretation: Sparsity

+1

-1
=
=

=
w.x + b
w.x + b

w.x + b
Final solution tends to
be sparse
•αj=0 for most j

•don’t need to store these


points to compute w or make
predictions
Non-support Vectors:
•αj=0
•moving them will not Support Vectors:
change w • αj≥0
SVM with kernels

• Never compute features explicitly!!!


– Compute dot products in closed form Predict with:

• O(n2) time in size of dataset to


compute objective
– much work on speeding up
Quadratic kernel

[Tommi Jaakkola]
Quadratic kernel

Feature mapping given by:

[Cynthia Rudin]
Common kernels
• Polynomials of degree exactly d

• Polynomials of degree up to d

• Gaussian kernels
Euclidean distance,
squared

• And many others: very active area of research!


(e.g., structured kernels that use dynamic programming
to evaluate, string kernels, …)
Gaussian kernel
Level sets, i.e. w.x=r for some r

Support vectors

[Cynthia Rudin] [mblondel.org]


Kernel algebra

Q: How would you prove that the “Gaussian kernel” is a valid kernel?
A: Expand the Euclidean norm as follows:

To see that this is a kernel, use the


Taylor series expansion of the
Then, apply (e) from above exponential, together with repeated
application of (a), (b), and (c):
The feature mapping is
infinite dimensional!
[Justin Domke]
Overfitting?

• Huge feature space with kernels: should we worry about


overfitting?
– SVM objective seeks a solution with large margin
• Theory says that large margin leads to good generalization
(we will see this in a couple of lectures)
– But everything overfits sometimes!!!
– Can control by:
• Setting C
• Choosing a better Kernel
• Varying parameters of the Kernel (width of Gaussian, etc.)

You might also like