0% found this document useful (0 votes)
135 views

Support Vector Machines & Kernels: David Sontag New York University

1. The document discusses support vector machines and kernels. It covers the derivation of the dual formulation of SVMs, which allows solving the SVM optimization problem using kernels without having to explicitly compute features in a high dimensional space. 2. The kernel trick is introduced, which allows computing the dot product of two vectors after they have been mapped to a high dimensional feature space, without having to explicitly perform the mapping. This makes SVMs with kernels efficient even for complex, non-linear mappings. 3. The document provides an overview of key concepts like slack variables, Lagrange multipliers, solving the constrained optimization problem in its dual formulation, and using kernels to implicitly compute dot products in high dimensional feature spaces.

Uploaded by

ramanadk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views

Support Vector Machines & Kernels: David Sontag New York University

1. The document discusses support vector machines and kernels. It covers the derivation of the dual formulation of SVMs, which allows solving the SVM optimization problem using kernels without having to explicitly compute features in a high dimensional space. 2. The kernel trick is introduced, which allows computing the dot product of two vectors after they have been mapped to a high dimensional feature space, without having to explicitly perform the mapping. This makes SVMs with kernels efficient even for complex, non-linear mappings. 3. The document provides an overview of key concepts like slack variables, Lagrange multipliers, solving the constrained optimization problem in its dual formulation, and using kernels to implicitly compute dot products in high dimensional feature spaces.

Uploaded by

ramanadk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Support Vector Machines & Kernels

Lecture 5

David Sontag
New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin


Multi-class SVM
As for the SVM, we introduce slack variables and maximize margin:

To predict, we use:

Now can we learn it? 


How to deal with imbalanced data?

•  In many practical applications we may have


imbalanced data sets
•  We may want errors to be equally distributed
between the positive and negative classes
•  A slight modification to the SVM objective
does the trick!

Class-specific weighting of the slack variables


What’s Next!
•  Learn one of the most interesting and
exciting recent advancements in machine
learning
–  The “kernel trick”
–  High dimensional feature spaces at no extra
cost!
•  But first, a detour
–  Constrained optimization!
Constrained optimization

No Constraint x ≥ -1 x≥1

x*=0 x*=0 x*=1

How do we solve with constraints?


 Lagrange Multipliers!!!
Lagrange multipliers – Dual variables
Add Lagrange multiplier
Rewrite
Constraint
Introduce Lagrangian (objective):

We will solve:
Why is this equivalent?
•  min is fighting max!
x<b  (x-b)<0  maxα-α(x-b) = ∞
•  min won’t let this happen!
Add new
x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0 constraint
•  min is cool with 0, and L(x, α)=x2 (original objective)

x=b  α can be anything, and L(x, α)=x2 (original objective)

The min on the outside forces max to behave, so constraints will be satisfied.
Dual SVM derivation (1) – the linearly
separable case

Original optimization problem:

Rewrite One Lagrange multiplier


constraints per example
Lagrangian:

Our goal now is to solve:


Dual SVM derivation (2) – the linearly
separable case

(Primal)

Swap min and max

(Dual)

Slater’s condition from convex optimization guarantees that


these two optimization problems are equivalent!
 
x(1)
 ... 
 
Dual SVM derivation
 x (n) 
 (3) – the linearly
 x(1) x(2) 
φ(x) =  separable case
 
(1) (3) 
 x x 
 
 ... 
(Dual)  
 ex(1) 
Can solve for optimal w,. .b. as function of α:
∂L �
∂w
=w− αj yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

Sums over all training examples scalars dot product


 
x(1)
 ... 
 
Dual SVM derivation
 x (n) 
 (3) – the linearly
 x(1) x(2) 
φ(x) =  separable case
 
(1) (3) 
 x x 
 
 ... 
(Dual)  
 ex(1) 
Can solve for optimal w,. .b. as function of α:
∂L �
∂w
=w− αj yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

So, in dual formulation we will solve for α directly!


•  w and b are computed from α (if needed)
Dual SVM derivation (3) – the linearly
separable case
Lagrangian:

αj > 0 for some j implies constraint


is tight. We use this to obtain b:

(1)

(2)

(3)
Dual for the non-separable case – same basic
story (we will skip details)

Primal: Solve for w,b,α:

Dual:

What changed?
•  Added upper bound of C on αi!
•  Intuitive explanation:
•  Without slack, αi  ∞ when constraints are violated (points
misclassified)
•  Upper bound of C limits the αi, so misclassifications are allowed
Wait a minute: why did we learn about the dual
SVM?

•  There are some quadratic programming


algorithms that can solve the dual faster than the
primal
–  At least for small datasets

•  But, more importantly, the “kernel trick”!!!


Reminder: What if the data is not
linearly separable?
Use features of features
of features of features….
 
x(1)
 ... 
 
 x(n) 
 
 x(1) x(2) 
 
φ(x) =  (1) (3) 
 x x 
 
 ... 
 
 ex(1) 
...

Feature space can get really large really quickly!


Higher order polynomials
number of monomial terms

d=4

m – input features
d – degree of polynomial

d=3
grows fast!
d = 6, m = 100
d=2
about 1.6 billion terms
number of input dimensions
Dual formulation only depends on
dot-products, not on w!

Remember the
First, we introduce features: examples x only
 appear in one dot
product
Next, replace the dot product with a Kernel:

Why is this useful???



∂w x . . .x
(1)   j j j
φ(x) =   ex x j�
(1)
 (1)e∂Lx(3) 
� x �. . .�= w − � αj yj xj
u1 .∂w .. v1  j
=  �. .. .  = u1 v1 + u2 v2 = u.v
Efficient dot-product
φ(u).φ(v)
∂L
∂L = w
=
− u2� (1)of

α �y
w − euα1j yj xj 
x j
v2�polynomials
j xj
 �
∂w
Polynomials of degree exactly d
v1
∂w 2 =
φ(u).φ(v)  jju 2
.  = u1 v1 + u2 v2 = u.v
u1 2 v1 v2
d=1 �� �� � � ��. . .

uu u 1u 2
v1  v 1 v 2
 2 2
φ(u).φ(v)
φ(u).φ(v) =
=  11
φ(u).φ(v) = u u u. v1�. 
u 2 1
.  == uu v v2+ =
+u2uu
v221vv= + u.v
u.v
1= 2u1 v1 u2 v2 + u

∂Lu22 21 u1 uv222  v
1
2 v1
1 11
1 2
v v 
φ(u).φ(v)
 =
= u


w
2 −  α
.
 2

vj2y j1x 2
j  = u21 v12 + 2u1 v1 u2 v2 +

d=2 u∂w
1
2 2
uv21u12 v2 v1
 u1 u2   v1uv22  j 2
v2 v )2
φ(u).φ(v) =  .
�u2 u1 �� = (u v 2
+ 2
u 2 2
v2 v1 �1 11 1 2 21 1 2 2
= u v + 2u v u v + u 2 v2

u
u2 1 v 2v1 = (u v + u v ) 2
φ(u).φ(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
φ(u).φ(v) = (u.v)d
For any d (we will skip proof): φ(u).φ(v) = (u.v) d

= (u.v)=d (u.v)d
φ(u).φ(v)φ(u).φ(v)
•  Cool! Taking a dot product and exponentiating gives same
results as mapping into high dimensional space and then taking
dot produce
Finally: the “kernel trick”!

•  Never compute features explicitly!!!


–  Compute dot products in closed form
•  Constant-time high-dimensional dot-
products for many classes of features
•  But, O(n2) time in size of dataset to
compute objective
–  Naïve implements slow
–  much work on speeding up
Common kernels
•  Polynomials of degree exactly d

•  Polynomials of degree up to d

•  Gaussian kernels

•  Sigmoid

•  And many others: very active area of research!

You might also like