0% found this document useful (0 votes)
77 views65 pages

Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Uploaded by

Kyuudaime
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views65 pages

Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Uploaded by

Kyuudaime
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 65

Support Vector

Machines
Note to other teachers and users of these Andrew W. Moore
slides. Andrew would be delighted if you
found this source material useful in
giving your own lectures. Feel free to use
Professor
these slides verbatim, or to modify them
to fit your own needs. PowerPoint School of Computer Science
originals are available. If you make use
of a significant portion of these slides in Carnegie Mellon University
your own lecture, please include this
message, or the following link to the www.cs.cmu.edu/~awm
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
[email protected]
Comments and corrections gratefully 412-268-7599
received.

Copyright © 2001, 2003, Andrew W. Moore Nov 23rd, 2001



Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 2



Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 3



Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 4



Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 5



Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 6



Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 7



Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 8

Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin
This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made = sign(w.
a small error inxthe
- b)
denotes +1
location of the boundary (it’s been
denotes -1 The maximum
jolted in its perpendicular direction)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
linear
3. LOOCV is easy since the classifier
model is
Support Vectors immune to removal of any
with the,non-
um,
are those support-vector datapoints.
datapoints that maximum margin.
the margin 4. There’s some theory (using VC
pushes up dimension) that isThis is the
related to (but not
against the same as) thesimplest kind
proposition thatof
this
is a good thing. SVM (Called an
LSVM)
5. Empirically it works very very well.
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 10
Specifying a line and margin
+ 1” Plus-Plane
=
la ss Classifier Boundary
i c t C o ne
r ed z Minus-Plane
“ P
- 1”
=
la ss
ic t C one
Pr ed z

• How do we represent this mathematically?


• …in m input dimensions?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 11


Specifying a line and margin
+ 1” Plus-Plane
=
la ss Classifier Boundary
i c t C o ne
r ed z Minus-Plane
“ P
- 1”
=
la ss
=1
+b ic t C one
wx 0 ed z
+b=
“ Pr
wx b=-1
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }

Classify as.. +1 if w . x + b >= 1


-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 12
Computing the margin width
+ 1”
= M = Margin Width
la ss
i c t C o ne
r ed z
“ P
= - 1” How do we compute
la ss M in terms of w
=1
+b ic t C one
wx 0 ed z
+b=
wx b=-1 “ Pr and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 13


Computing the margin width
+ 1”
= M = Margin Width
la ss
i c t C o ne
r ed z
“ P
= - 1” How do we compute
la ss M in terms of w
=1
+b ic t C one
wx 0 ed z
+b=
wx b=-1 “ Pr and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also


perpendicular to the Minus Plane
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 14
Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne
r ed z
“ P
=- -
1” How do we compute
ssx
+b
=1
c t
a
Cl e M in terms of w
wx 0 edi zon
+b=
wx b=-1 “Pr and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane mm:: not
R not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 15


Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne
r ed z
“ P
=- -
1” How do we compute
ssx
+b
=1
c t
a
Cl e M in terms of w
wx 0 edi zon
+b=
wx b=-1 “Pr and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- +  w for some value of . Why?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 16


Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne The line from x- to x+ is
r ed z
“ P perpendicular to the
- 1” How do we compute
ss x=-
planes.
b = 1
c t C la
e M in terms of w
wx
+
=0 edi zon
+ b
wx b=-1 “Pr and
So to getbfrom
? x- to x+
+
wx travel some distance in
• Plus-plane = { x : w . x + b = +1 direction
} w.
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- +  w for some value of . Why?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 17


Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne
r ed z
“ P
1”
sx=- -
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon
+
wx b=-1 “P
+
wx
What we know:
• w . x+ + b = +1

• w . x- + b = -1

• x+ = x- +  w

• |x+ - x- | = M

It’s now easy to get M


in terms of w and b
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 18
Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne
r ed z
“ P
1”
=- -
sx
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon w . (x - + w) + b = 1
+
wx b=-1 “P
+
wx
=>
What we know:
• w . x+ + b = +1 w . x - + b + w .w = 1
• w . x- + b = -1
=>
• x+ = x- +  w
-1 + w .w = 1
• |x+ - x- | = M
=>
It’s now easy to get M 2
in terms of w and b λ
w.w
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 19
Computing the margin width
+ 1” 2
= x+ M = Margin Width =
la ss w.w
i c t C o ne
r ed z
“ P
1”
=- -
sx
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon M = |x+ - x- | =| w |=
+
wx b=-1 “P
+
wx
What we know:  λ | w |  λ w.w
• w . x+ + b = +1

• w . x- + b = -1
2 w.w 2
• x+ = x- +  w  
w.w w.w
• |x+ - x- | = M

• 2
λ
w.w
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 20
Learning the Maximum Margin Classifier
+ 1” 2
= x+ M = Margin Width =
la ss w.w
i c t C o ne
r ed z
“ P
1”
sx=- -
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon
+
wx b=-1 “P
+
wx
Given a guess of w and b we can
• Compute whether all data points in the correct half-planes

• Compute the width of the margin

So now we just need to write a program to search the space of


w’s and b’s to find the widest margin that matches all the
datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion? EM?
Newton’s Method?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 21
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 22


Quadratic Programming
T
T u Ru Quadratic criterion
Find arg max c  d u 
u 2

Subject to a11u1  a12u2  ...  a1mu m  b1


a21u1  a22u2  ...  a2 mum  b2 n additional linear
inequality
: constraints
an1u1  an 2u 2  ...  anmum  bn
And subject to

e additional linear
a( n 1)1u1  a( n 1) 2u2  ...  a( n 1) mum  b( n 1)

constraints
equality
a( n  2 )1u1  a( n  2) 2u2  ...  a( n  2 ) mum  b( n  2 )
:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) m um  b( n  e )
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 23
Quadratic Programming
T
T u Ru Quadratic criterion
Find arg max c  d u 
u 2

Subject to a11u1  a12u2  ...  a1mu m  bn1g


s fo r findi
a21u1 re ae22 u  rit
o... hma u
x i s t2a l g 2 q
m u a
m d ratb ic2 n additional linear
The s tr a in ed nt l y
u c h c o n e ff i c i e inequality
s
m u:c h m o re i ent
op t i m a g r a d constraints
l ia b l y th a n
e
an1u1  aann2dur2  ...asceannm t. u m  bn
i dd ly… you
And subject to ery f

e additional linear
v
a( n 1)1u1 (Buat( nth1e)y2u2 n...’t waan(tn t1o) mwurimte  b( n 1)
a re

constraints
equality
r o b a b ly do s elf)
p o u r
a u a
( n  2 )1 1 uone ...  a
y
( n 2) 2 2 u b
( n2) m m ( n2)

:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) m um  b( n  e )
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 24
Learning the Maximum Margin Classifier
+1” M = Given guess of w , b we can
= 2
la ss • Compute whether all data
i c t C o ne w.w
ed z points are in the correct
“Pr
-1”
ss
= half-planes
=1 la
+b ic t C o ne • Compute the margin width
wx =0 ed z
+b
“ Pr
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we


optimization criterion be? have?
What should they be?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 25


Learning the Maximum Margin Classifier
+1” M = Given guess of w , b we can
= 2
la ss • Compute whether all data
i c t C o ne w.w
ed z points are in the correct
“Pr
-1”
ss
= half-planes
=1 la
+b ic t C o ne • Compute the margin width
wx =0 ed z
+b
“ Pr
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we


optimization criterion be? have? R
Minimize w.w What should they be?
w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 26


This is going to be a problem!
Uh-oh!
What should we do?
denotes +1
denotes -1

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 27


This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1:
denotes -1
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for
an ill-defined
optimization

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 28


This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1.1:
denotes -1
Minimize
w.w + C (#train errors)
Tradeoff parameter

There’s a serious practical


problem that’s about to make
us reject this approach. Can
you guess what it is?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 29
This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1.1:
denotes -1
Minimize
w.w + C (#train errors)
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
There’s a serious practical
(Also, doesn’t distinguish between y
problem that’s o… an to make
about
S
disastrous errors and near misses) other
us reject this approach.
ideas? Can
you guess what it is?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 30
This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 2.0:
denotes -1
Minimize
w.w + C (distance of error
points to their
correct place)

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 31


Learning Maximum Margin with Noise
M = Given guess of w , b we can
2
w.w
• Compute sum of distances
of points to their correct
zones
1
+b= • Compute the margin width
wx =0
+b
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we


optimization criterion be? have?
What should they be?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 32


Learning Maximum Margin with Noise
11 M = Given guess of w , b we can
2
2 w.w
• Compute sum of distances
of points to their correct
zones
1
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we


optimization criterion be? have? R
Minimize 1 R What should they be?

2

w.w  C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 33
Learning Maximum Marginmwith Noise
= # input
M = Given guessdimensions
of w , b we can
11 2
2 Compute sum of distances
w.w

of points to their correct


Our original (noiseless data) QP had m+1
zones
1 variables: w1, w2, … wm, and b.
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
+
Our new (noisy data)
Assume QP Rhas m+1+R each
datapoints,
wx
variables: w1, w2,(x…,yw)m,where
b, k , y1 ,…
= R 1
+/-
k k k

What should our quadratic How many constraints will we


R= # records
optimization criterion be? have? R
Minimize 1 R What should they be?

2

w.w  C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 34
Learning Maximum Margin with Noise
M = Given guess of w , b we can
11 2
2 w.w
• Compute sum of distances
of points to their correct
zones
1
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we


optimization criterion be? have? R
Minimize 1 R What should they be?

2

w.w  C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
There’s a bug in this QP. Can you spot it?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 35
Learning Maximum Margin with Noise
M = Given guess of w , b we can
11 2
2 w.w
• Compute sum of distances
of points to their correct
zones
1
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we


optimization criterion be? have? 2R
Minimize 1 R What should they be?

2

w.w  C εk w . xk + b >= 1-k if yk = 1
k 1 w . xk + b <= -1+k if yk = -1
k >= 0 for all k
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 36
An Equivalent QP
Warning: up until Rong Zhang spotted my error in
Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.

1 R RR
Maximize  αk   αk αl Qkl where Qkl  y k yl ( x k .x l )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:
R
w   αk yk x k Then classify with:

k 1 f(x,w,b) = sign(w. x - b)

b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 37


Warning: up until Rong Zhang spotted my error in

An Equivalent QP Oct 2003, this equation had been wrong in earlier


versions of the notes. This version is correct.

1 R RR
Maximize  αk   αk αl Qkl where Qkl  y k yl ( x k .x l )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define: Datapoints with k > 0


R will be the support
w   αk yk x k Then classify with:
vectors
k 1 f(x,w,b) = sign(w. x - b)
..so this sum only needs
b  y K (1  ε K )  x K .w K to be over the
support vectors.
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 38


An Equivalent QP
1 R RR
Maximize  αk   αk αl Qkl where Qkl  y k yl ( x k .x l )
k 1 2 k 1 l 1
Why did I tell you about this R
Subject to these equivalent
constraints:
0  αk QP?
C k αk  yk  0
• It’s a formulation that QP k 1
packages can optimize more
Then define: quickly Datapoints with k > 0
R will be the support
w   αk y• xBecause of further
k k
Then classify with:
jaw-
vectors
k 1 dropping developments
f(x,w,b)you’re
= sign(w. x - b)
about to learn...so this sum only needs
b  y K (1  ε K )  x K .w K to be over the
support vectors.
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 39


Suppose we’re in 1-dimension

What would
SVMs do with
this data?

x=0

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 40


Suppose we’re in 1-dimension

Not a big surprise

x=0
Positive “plane” Negative “plane”

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 41


Harder 1-dimensional dataset

That’s wiped the


smirk off SVM’s
face.
What can be
done about
this?

x=0

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 42


Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too

x=0 z k  ( xk , xk2 )

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 43


Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too

x=0 z k  ( xk , xk2 )

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 44


Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
 | xk  c j | 
z k [ j ]  φ j (x k )  KernelFn 
 KW 

zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 45



1
2 x1 


Constant Term
Quadratic



2 x2 
:


Linear Terms Basis Functions
 2 x  Number of terms (assuming m input
 m 
 x12
 dimensions) = (m+2)-choose-2
  Pure
 x 2
2  = (m+2)(m+1)/2
  Quadratic
:
 2
 Terms = (as near as makes no difference) m2/2
 x m 
Φ(x)  
2 x1 x2 
 
 2 x1 x3  You may be wondering what those
 

:
 2 ’s are doing.
 2 x1 xm  Quadratic
  Cross-Terms •You should be happy that they do no
 2 x x
2 3  harm
 : 
  •You’ll find out why they’re there
 2 x x
1 m 
soon.
 : 
 
 2 xm 1 xm 
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 46
Warning: up until Rong Zhang spotted my error in

QP with basis functions Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.

R
1 R R
Maximize 
k 1
αk   αk αl Qkl where Qkl  yk yl (Φ(x k ).Φ(x l ))
2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:

w  α k y k Φ( x k ) Then classify with:


k s.t. α k  0 f(x,w,b) = sign(w. (x) - b)
b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 47


QP with basis functions
R
1 R R
Maximize 
k 1
αk   αk αl Qkl where Qkl  yk yl (Φ(x k ).Φ(x l ))
2 k 1 l 1
We must do R2/2 dot products to
R
get this matrix ready.
dot product 
Subject to these 0  αk  CEachk α y 0
constraints: requiresk m k/2 2
k 1
additions and multiplications
The whole thing costs R2 m2 /4.
Then define:
Yeeks!

w  α k y k Φ( x k ) …or does it?


Then classify with:
k s.t. α k  0 f(x,w,b) = sign(w. (x) - b)
b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 48


 1   1 
Quadratic Dot 
 2a1  
 
2b1 
 1
+
Products 

2 a2  
 
2b2 
 m

 :   :   2a b i i
 2 a   2 b  i 1
 m   m 
 a12
  b1 2
 +
 2   2 
 a 2   b 2  m

 :   :   i bi
a 2 2

 2
  2
 i 1

 a m   b m 
Φ(a)  Φ(b)   
2a1a2   2b1b2 
   
 2a1a3   2b1b3  +
 :   : 
   
 2a1am   2b1bm 
    m m

 2 a a
2 3   2 b b
2 3    2a a b b i j i j
 :   :  i 1 j i 1
   
 2 a a
1 m   2b b
1 m 
 :   : 
   
 2am 1am   2bm 1bm 
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 49
Quadratic Dot Just out of casual, innocent, interest,
let’s look at another function of a and
Products b:
(a.b  1) 2

 (a.b) 2  2a.b  1
2
 m
 m
   ai bi   2 ai bi  1
 i 1  i 1

Φ(a)  Φ(b)  m m m
m m m m   ai bi a j b j  2 ai bi  1
1  2 ai bi   a b   2 2
i i  2a a b bi j i j
i 1 j 1 i 1
i 1 i 1 i 1 j i 1
m m m m
  (ai bi )  2
2
a b a b i i j j  2 ai bi  1
i 1 i 1 j  i 1 i 1

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 50


Quadratic Dot Just out of casual, innocent, interest,
let’s look at another function of a and
Products b:
(a.b  1) 2

 (a.b) 2  2a.b  1
2
 m
 m
   ai bi   2 ai bi  1
 i 1  i 1

Φ(a)  Φ(b)  m m m
m m m m   ai bi a j b j  2 ai bi  1
1  2 ai bi   a b   2 2
i i  2a a b bi j i j
i 1 j 1 i 1
i 1 i 1 i 1 j i 1
m m m m
  (ai bi )  2
2
a b a b i i j j  2 ai bi  1
i 1 i 1 j  i 1 i 1

They’re the same!


And this is only O(m) to
compute!
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 51
Warning: up until Rong Zhang spotted my error in

QP with Quadratic basis functions Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.

R
1 R R
Maximize 
k 1
αk   αk αl Qkl where Qkl  yk yl (Φ(x k ).Φ(x l ))
2 k 1 l 1
We must do R2/2 dot products to
R
get this matrix ready.
dot product 
Subject to these 0  αk  CEachk α k yrequires
now only k  0
constraints: k 1
m additions and multiplications

Then define:

w  α k y k Φ( x k ) Then classify with:


k s.t. α k  0 f(x,w,b) = sign(w. (x) - b)
b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 52


Higher Order Polynomials
Poly- (x) Cost to Cost if 100 (a).(b) Cost to Cost if
nomial build Qkl inputs build Qkl 100
matrix matrix inputs
tradition sneakily
ally
Quadratic All m2/2 m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2
terms up to
degree 2
Cubic All m3/6 m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2
terms up to
degree 3
Quartic All m4/24 m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2
terms up to
degree 4

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 53


QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
operations instead of 75 million
R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k α
k 1
k yk  0

Then define:

w α
k s.t. α k  0
k y k Φ( x k )

b  y K (1  ε K )  x K .w K
Then classify with:
where K  arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 54
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
operations instead of 75 million
R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k  0
k y k Φ( x k )

b  y K (1  ε K )  x K .w K
Then classify with:
where K  arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 55
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k  0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
b  y K (1  ε K )  x K .w K can be done?
Then classify with:
where K  arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 56
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k  0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
w
b 
Φ(y εαk y)k Φ
x) (1 
K (x
x k ) .w
Φ( x)
k s.t. α k  0K K K
can be done?
Then classify with:
where K arg
5
 α y
k k ( x
max
k  x  1
α)
k s.t. α k  0 k
Only Sm operations (S=#support vectors)
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 57
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k
•The fear of overfittingkwith
k k
1 this enormous
α y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k  0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
w
b 
Φ(y εαk y)k Φ
x) (1 
K (x
x k ) .w
Φ( x)
k s.t. α k  0K K K
can be done?
Then classify with:
where K arg
5 When you see this many callout bubbles on
 α y
k k ( x
max
k  x  1
α) a slide it’s time to wrap the author in a
k s.t. α k  0 k
Only Sm operations (S=#support vectors)
k f(x,w,b)
“someone’s = been
sign(w. 
blanket, gently take him away and murmur
(x) - b)for too
at the PowerPoint
long.”
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 58
QP with Quintic basis functions
R
1 R R
Maximize 
k 1
αk   αk αl Qkl where
2 k 1 l 1
Qkl opinion
Andrew’s yk yof
l (Φ( x k ).Φ( x l ))
why SVMs don’t
overfit as much as you’d think:
No matter what the basis function,
R up to R
there are really only
Subject to these
constraints:
0  αk  C k α
parameters: 1, 2 .. k
most are set tokzero
R, and
ky 0
usually
1 by the Maximum
Margin.

Then define: Asking for small w.w is like “weight


decay” in Neural Nets and like Ridge
Regression parameters in Linear
w α
k s.t. α k  0
k k y Φ( x )
k regression and like the use of Priors
in Bayesian Regression---all designed
to smooth the function and reduce
w  Φ( x) 
b  y K (1 ε )  x K .w K
 αk y k Φ( x k )  Φ( x) overfitting.
k s.t. α  0K k
Then classify with:

where K  α
k s.t. α  0
y
arg
k k ( x
max
k  x  1)
αk
5

f(x,w,b) = sign(w. (x) - b)


k
k
Only Sm operations (S=#support vectors)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 59
SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function
• Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function
• Radial-Basis-style Kernel Function:
 (a  b) 2  ,  and  are magic
K (a, b)  exp  2
 parameters that must
 2 
be chosen by a model
• Neural-net-style Kernel Function: selection method
such as CV or
VCSRM*
K (a, b)  tanh( a.b   )
*see last lecture

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 60


VC-dimension of an SVM
• Very very very loosely speaking there is some theory which
under some different assumptions puts an upper bound on
the VC dimension as
 Diameter 
 Margin 
 
• where
• Diameter is the diameter of the smallest sphere that can
enclose all the high-dimensional term-vectors derived
from the training set.
• Margin is the smallest margin we’ll let the SVM use
• This can be used in SRM (Structural Risk Minimization) for
choosing the polynomial degree, RBF , etc.
• But most people just use Cross-Validation

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 61


SVM Performance
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-written-character
recognition benchmark
• Another Example: Andrew knows several reliable
people doing practical real-world work who claim
that SVMs have saved them when their other
favorite classifiers did poorly.
• There is a lot of excitement and religious fervor
about SVMs as of 2001.
• Despite this, some practitioners (including your
lecturer) are a little skeptical.
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 62
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
• :
• SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 63


References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 64


What You Should Know
• Linear SVMs
• The definition of a maximum margin classifier
• What QP can do for you (but, for this class, you
don’t need to know how it does it)
• How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit us to pretend
we’re working with ultra-high-dimensional basis-
function terms

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 65

You might also like