Open navigation menu

Scribd

0% found this document useful (0 votes)

77 views65 pages

Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views65 pages

Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 65

Support Vector

Machines
Note to other teachers and users of these Andrew W. Moore
slides. Andrew would be delighted if you
found this source material useful in
giving your own lectures. Feel free to use
Professor
these slides verbatim, or to modify them
to fit your own needs. PowerPoint School of Computer Science
originals are available. If you make use
of a significant portion of these slides in Carnegie Mellon University
your own lecture, please include this
message, or the following link to the www.cs.cmu.edu/~awm
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
[email protected]
Comments and corrections gratefully 412-268-7599
received.

Copyright © 2001, 2003, Andrew W. Moore Nov 23rd, 2001


Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you

classify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 2


Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you

classify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 3


Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you

classify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 4


Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you

classify this data?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 5


Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

Any of these
would be fine..

..but which is
best?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 6


Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 7


Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 8

Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin
This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made = sign(w.
a small error inxthe
- b)
denotes +1
location of the boundary (it’s been
denotes -1 The maximum
jolted in its perpendicular direction)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
linear
3. LOOCV is easy since the classifier
model is
Support Vectors immune to removal of any
with the,non-
um,
are those support-vector datapoints.
datapoints that maximum margin.
the margin 4. There’s some theory (using VC
pushes up dimension) that isThis is the
related to (but not
against the same as) thesimplest kind
proposition thatof
this
is a good thing. SVM (Called an
LSVM)
5. Empirically it works very very well.
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 10
Specifying a line and margin
+ 1” Plus-Plane
=
la ss Classifier Boundary
i c t C o ne
r ed z Minus-Plane
“ P
- 1”
=
la ss
ic t C one
Pr ed z
“

• How do we represent this mathematically?

• …in m input dimensions?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 11

Specifying a line and margin
+ 1” Plus-Plane
=
la ss Classifier Boundary
i c t C o ne
r ed z Minus-Plane
“ P
- 1”
=
la ss
=1
+b ic t C one
wx 0 ed z
+b=
“ Pr
wx b=-1
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }

Classify as.. +1 if w . x + b >= 1

-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 12
Computing the margin width
+ 1”
= M = Margin Width
la ss
i c t C o ne
r ed z
“ P
= - 1” How do we compute
la ss M in terms of w
=1
+b ic t C one
wx 0 ed z
+b=
wx b=-1 “ Pr and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 13

Computing the margin width
+ 1”
= M = Margin Width
la ss
i c t C o ne
r ed z
“ P
= - 1” How do we compute
la ss M in terms of w
=1
+b ic t C one
wx 0 ed z
+b=
wx b=-1 “ Pr and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?

And so of course the vector w is also

perpendicular to the Minus Plane
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 14
Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne
r ed z
“ P
=- -
1” How do we compute
ssx
+b
=1
c t
a
Cl e M in terms of w
wx 0 edi zon
+b=
wx b=-1 “Pr and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane mm:: not
R not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 15

Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne
r ed z
“ P
=- -
1” How do we compute
ssx
+b
=1
c t
a
Cl e M in terms of w
wx 0 edi zon
+b=
wx b=-1 “Pr and b?
+
wx

• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- +  w for some value of . Why?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 16

Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne The line from x- to x+ is
r ed z
“ P perpendicular to the
- 1” How do we compute
ss x=-
planes.
b = 1
c t C la
e M in terms of w
wx
+
=0 edi zon
+ b
wx b=-1 “Pr and
So to getbfrom
? x- to x+
+
wx travel some distance in
• Plus-plane = { x : w . x + b = +1 direction
} w.
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- +  w for some value of . Why?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 17

Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne
r ed z
“ P
1”
sx=- -
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon
+
wx b=-1 “P
+
wx
What we know:
• w . x+ + b = +1

• w . x- + b = -1

• x+ = x- +  w

• |x+ - x- | = M

It’s now easy to get M

in terms of w and b
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 18
Computing the margin width
+ 1”
= x+ M = Margin Width
la ss
i c t C o ne
r ed z
“ P
1”
=- -
sx
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon w . (x - + w) + b = 1
+
wx b=-1 “P
+
wx
=>
What we know:
• w . x+ + b = +1 w . x - + b + w .w = 1
• w . x- + b = -1
=>
• x+ = x- +  w
-1 + w .w = 1
• |x+ - x- | = M
=>
It’s now easy to get M 2
in terms of w and b λ
w.w
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 19
Computing the margin width
+ 1” 2
= x+ M = Margin Width =
la ss w.w
i c t C o ne
r ed z
“ P
1”
=- -
sx
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon M = |x+ - x- | =| w |=
+
wx b=-1 “P
+
wx
What we know:  λ | w |  λ w.w
• w . x+ + b = +1

• w . x- + b = -1
2 w.w 2
• x+ = x- +  w  
w.w w.w
• |x+ - x- | = M

• 2
λ
w.w
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 20
Learning the Maximum Margin Classifier
+ 1” 2
= x+ M = Margin Width =
la ss w.w
i c t C o ne
r ed z
“ P
1”
sx=- -
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon
+
wx b=-1 “P
+
wx
Given a guess of w and b we can
• Compute whether all data points in the correct half-planes

• Compute the width of the margin

So now we just need to write a program to search the space of

w’s and b’s to find the widest margin that matches all the
datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion? EM?
Newton’s Method?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 21
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 22

Quadratic Programming
T
T u Ru Quadratic criterion
Find arg max c  d u 
u 2

Subject to a11u1  a12u2  ...  a1mu m  b1

a21u1  a22u2  ...  a2 mum  b2 n additional linear
inequality
: constraints
an1u1  an 2u 2  ...  anmum  bn
And subject to

e additional linear
a( n 1)1u1  a( n 1) 2u2  ...  a( n 1) mum  b( n 1)

constraints
equality
a( n  2 )1u1  a( n  2) 2u2  ...  a( n  2 ) mum  b( n  2 )
:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) m um  b( n  e )
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 23
Quadratic Programming
T
T u Ru Quadratic criterion
Find arg max c  d u 
u 2

Subject to a11u1  a12u2  ...  a1mu m  bn1g

s fo r findi
a21u1 re ae22 u  rit
o... hma u
x i s t2a l g 2 q
m u a
m d ratb ic2 n additional linear
The s tr a in ed nt l y
u c h c o n e ff i c i e inequality
s
m u:c h m o re i ent
op t i m a g r a d constraints
l ia b l y th a n
e
an1u1  aann2dur2  ...asceannm t. u m  bn
i dd ly… you
And subject to ery f

e additional linear
v
a( n 1)1u1 (Buat( nth1e)y2u2 n...’t waan(tn t1o) mwurimte  b( n 1)
a re

constraints
equality
r o b a b ly do s elf)
p o u r
a u a
( n  2 )1 1 uone ...  a
y
( n 2) 2 2 u b
( n2) m m ( n2)

:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) m um  b( n  e )
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 24
Learning the Maximum Margin Classifier
+1” M = Given guess of w , b we can
= 2
la ss • Compute whether all data
i c t C o ne w.w
ed z points are in the correct
“Pr
-1”
ss
= half-planes
=1 la
+b ic t C o ne • Compute the margin width
wx =0 ed z
+b
“ Pr
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we

optimization criterion be? have?
What should they be?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 25

Learning the Maximum Margin Classifier
+1” M = Given guess of w , b we can
= 2
la ss • Compute whether all data
i c t C o ne w.w
ed z points are in the correct
“Pr
-1”
ss
= half-planes
=1 la
+b ic t C o ne • Compute the margin width
wx =0 ed z
+b
“ Pr
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we

optimization criterion be? have? R
Minimize w.w What should they be?
w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 26

This is going to be a problem!
Uh-oh!
What should we do?
denotes +1
denotes -1

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 27

This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1:
denotes -1
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for
an ill-defined
optimization

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 28

This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1.1:
denotes -1
Minimize
w.w + C (#train errors)
Tradeoff parameter

There’s a serious practical

problem that’s about to make
us reject this approach. Can
you guess what it is?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 29
This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1.1:
denotes -1
Minimize
w.w + C (#train errors)
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
There’s a serious practical
(Also, doesn’t distinguish between y
problem that’s o… an to make
about
S
disastrous errors and near misses) other
us reject this approach.
ideas? Can
you guess what it is?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 30
This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 2.0:
denotes -1
Minimize
w.w + C (distance of error
points to their
correct place)

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 31

Learning Maximum Margin with Noise
M = Given guess of w , b we can
2
w.w
• Compute sum of distances
of points to their correct
zones
1
+b= • Compute the margin width
wx =0
+b
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we

optimization criterion be? have?
What should they be?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 32

Learning Maximum Margin with Noise
11 M = Given guess of w , b we can
2
2 w.w
• Compute sum of distances
of points to their correct
zones
1
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we

optimization criterion be? have? R
Minimize 1 R What should they be?

2

w.w  C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 33
Learning Maximum Marginmwith Noise
= # input
M = Given guessdimensions
of w , b we can
11 2
2 Compute sum of distances
w.w
•

of points to their correct

Our original (noiseless data) QP had m+1
zones
1 variables: w1, w2, … wm, and b.
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
+
Our new (noisy data)
Assume QP Rhas m+1+R each
datapoints,
wx
variables: w1, w2,(x…,yw)m,where
b, k , y1 ,…
= R 1
+/-
k k k

What should our quadratic How many constraints will we

R= # records
optimization criterion be? have? R
Minimize 1 R What should they be?

2

w.w  C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 34
Learning Maximum Margin with Noise
M = Given guess of w , b we can
11 2
2 w.w
• Compute sum of distances
of points to their correct
zones
1
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we

optimization criterion be? have? R
Minimize 1 R What should they be?

2

w.w  C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
There’s a bug in this QP. Can you spot it?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 35
Learning Maximum Margin with Noise
M = Given guess of w , b we can
11 2
2 w.w
• Compute sum of distances
of points to their correct
zones
1
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1

What should our quadratic How many constraints will we

optimization criterion be? have? 2R
Minimize 1 R What should they be?

2

w.w  C εk w . xk + b >= 1-k if yk = 1
k 1 w . xk + b <= -1+k if yk = -1
k >= 0 for all k
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 36
An Equivalent QP
Warning: up until Rong Zhang spotted my error in
Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.

1 R RR
Maximize  αk   αk αl Qkl where Qkl  y k yl ( x k .x l )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:
R
w   αk yk x k Then classify with:

k 1 f(x,w,b) = sign(w. x - b)

b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 37

Warning: up until Rong Zhang spotted my error in

An Equivalent QP Oct 2003, this equation had been wrong in earlier

versions of the notes. This version is correct.

1 R RR
Maximize  αk   αk αl Qkl where Qkl  y k yl ( x k .x l )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define: Datapoints with k > 0

R will be the support
w   αk yk x k Then classify with:
vectors
k 1 f(x,w,b) = sign(w. x - b)
..so this sum only needs
b  y K (1  ε K )  x K .w K to be over the
support vectors.
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 38

An Equivalent QP
1 R RR
Maximize  αk   αk αl Qkl where Qkl  y k yl ( x k .x l )
k 1 2 k 1 l 1
Why did I tell you about this R
Subject to these equivalent
constraints:
0  αk QP?
C k αk  yk  0
• It’s a formulation that QP k 1
packages can optimize more
Then define: quickly Datapoints with k > 0
R will be the support
w   αk y• xBecause of further
k k
Then classify with:
jaw-
vectors
k 1 dropping developments
f(x,w,b)you’re
= sign(w. x - b)
about to learn...so this sum only needs
b  y K (1  ε K )  x K .w K to be over the
support vectors.
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 39

Suppose we’re in 1-dimension

What would
SVMs do with
this data?

x=0

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 40

Suppose we’re in 1-dimension

Not a big surprise

x=0
Positive “plane” Negative “plane”

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 41

Harder 1-dimensional dataset

That’s wiped the

smirk off SVM’s
face.
What can be
done about
this?

x=0

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 42

Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too

x=0 z k  ( xk , xk2 )

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 43

Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too

x=0 z k  ( xk , xk2 )

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 44

Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
 | xk  c j | 
z k [ j ]  φ j (x k )  KernelFn 
 KW 

zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 45



1
2 x1 


Constant Term
Quadratic



2 x2 
:


Linear Terms Basis Functions
 2 x  Number of terms (assuming m input
 m 
 x12
 dimensions) = (m+2)-choose-2
  Pure
 x 2
2  = (m+2)(m+1)/2
  Quadratic
:
 2
 Terms = (as near as makes no difference) m2/2
 x m 
Φ(x)  
2 x1 x2 
 
 2 x1 x3  You may be wondering what those
 

:
 2 ’s are doing.
 2 x1 xm  Quadratic
  Cross-Terms •You should be happy that they do no
 2 x x
2 3  harm
 : 
  •You’ll find out why they’re there
 2 x x
1 m 
soon.
 : 
 
 2 xm 1 xm 
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 46
Warning: up until Rong Zhang spotted my error in

QP with basis functions Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.

R
1 R R
Maximize 
k 1
αk   αk αl Qkl where Qkl  yk yl (Φ(x k ).Φ(x l ))
2 k 1 l 1
R
Subject to these
constraints:
0  αk  C k α
k 1
k yk  0

Then define:

w  α k y k Φ( x k ) Then classify with:

k s.t. α k  0 f(x,w,b) = sign(w. (x) - b)
b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 47

QP with basis functions
R
1 R R
Maximize 
k 1
αk   αk αl Qkl where Qkl  yk yl (Φ(x k ).Φ(x l ))
2 k 1 l 1
We must do R2/2 dot products to
R
get this matrix ready.
dot product 
Subject to these 0  αk  CEachk α y 0
constraints: requiresk m k/2 2
k 1
additions and multiplications
The whole thing costs R2 m2 /4.
Then define:
Yeeks!

w  α k y k Φ( x k ) …or does it?

Then classify with:
k s.t. α k  0 f(x,w,b) = sign(w. (x) - b)
b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 48

 1   1 
Quadratic Dot 
 2a1  
 
2b1 
 1
+
Products 

2 a2  
 
2b2 
 m

 :   :   2a b i i
 2 a   2 b  i 1
 m   m 
 a12
  b1 2
 +
 2   2 
 a 2   b 2  m

 :   :   i bi
a 2 2

 2
  2
 i 1

 a m   b m 
Φ(a)  Φ(b)   
2a1a2   2b1b2 
   
 2a1a3   2b1b3  +
 :   : 
   
 2a1am   2b1bm 
    m m

 2 a a
2 3   2 b b
2 3    2a a b b i j i j
 :   :  i 1 j i 1
   
 2 a a
1 m   2b b
1 m 
 :   : 
   
 2am 1am   2bm 1bm 
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 49
Quadratic Dot Just out of casual, innocent, interest,
let’s look at another function of a and
Products b:
(a.b  1) 2

 (a.b) 2  2a.b  1
2
 m
 m
   ai bi   2 ai bi  1
 i 1  i 1

Φ(a)  Φ(b)  m m m
m m m m   ai bi a j b j  2 ai bi  1
1  2 ai bi   a b   2 2
i i  2a a b bi j i j
i 1 j 1 i 1
i 1 i 1 i 1 j i 1
m m m m
  (ai bi )  2
2
a b a b i i j j  2 ai bi  1
i 1 i 1 j  i 1 i 1

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 50

Quadratic Dot Just out of casual, innocent, interest,
let’s look at another function of a and
Products b:
(a.b  1) 2

 (a.b) 2  2a.b  1
2
 m
 m
   ai bi   2 ai bi  1
 i 1  i 1

Φ(a)  Φ(b)  m m m
m m m m   ai bi a j b j  2 ai bi  1
1  2 ai bi   a b   2 2
i i  2a a b bi j i j
i 1 j 1 i 1
i 1 i 1 i 1 j i 1
m m m m
  (ai bi )  2
2
a b a b i i j j  2 ai bi  1
i 1 i 1 j  i 1 i 1

They’re the same!

And this is only O(m) to
compute!
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 51
Warning: up until Rong Zhang spotted my error in

QP with Quadratic basis functions Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.

R
1 R R
Maximize 
k 1
αk   αk αl Qkl where Qkl  yk yl (Φ(x k ).Φ(x l ))
2 k 1 l 1
We must do R2/2 dot products to
R
get this matrix ready.
dot product 
Subject to these 0  αk  CEachk α k yrequires
now only k  0
constraints: k 1
m additions and multiplications

Then define:

w  α k y k Φ( x k ) Then classify with:

k s.t. α k  0 f(x,w,b) = sign(w. (x) - b)
b  y K (1  ε K )  x K .w K
where K  arg max αk
k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 52

Higher Order Polynomials
Poly- (x) Cost to Cost if 100 (a).(b) Cost to Cost if
nomial build Qkl inputs build Qkl 100
matrix matrix inputs
tradition sneakily
ally
Quadratic All m2/2 m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2
terms up to
degree 2
Cubic All m3/6 m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2
terms up to
degree 3
Quartic All m4/24 m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2
terms up to
degree 4

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 53

QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
operations instead of 75 million
R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k α
k 1
k yk  0

Then define:

w α
k s.t. α k  0
k y k Φ( x k )

b  y K (1  ε K )  x K .w K
Then classify with:
where K  arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 54
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
operations instead of 75 million
R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k  0
k y k Φ( x k )

b  y K (1  ε K )  x K .w K
Then classify with:
where K  arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 55
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k  0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
b  y K (1  ε K )  x K .w K can be done?
Then classify with:
where K  arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 56
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k  0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
w
b 
Φ(y εαk y)k Φ
x) (1 
K (x
x k ) .w
Φ( x)
k s.t. α k  0K K K
can be done?
Then classify with:
where K arg
5
 α y
k k ( x
max
k  x  1
α)
k s.t. α k  0 k
Only Sm operations (S=#support vectors)
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 57
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this


matrix ready.
Maximize α   α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl  yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0  α  C k
•The fear of overfittingkwith
k k
1 this enormous
α y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k  0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
w
b 
Φ(y εαk y)k Φ
x) (1 
K (x
x k ) .w
Φ( x)
k s.t. α k  0K K K
can be done?
Then classify with:
where K arg
5 When you see this many callout bubbles on
 α y
k k ( x
max
k  x  1
α) a slide it’s time to wrap the author in a
k s.t. α k  0 k
Only Sm operations (S=#support vectors)
k f(x,w,b)
“someone’s = been
sign(w. 
blanket, gently take him away and murmur
(x) - b)for too
at the PowerPoint
long.”
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 58
QP with Quintic basis functions
R
1 R R
Maximize 
k 1
αk   αk αl Qkl where
2 k 1 l 1
Qkl opinion
Andrew’s yk yof
l (Φ( x k ).Φ( x l ))
why SVMs don’t
overfit as much as you’d think:
No matter what the basis function,
R up to R
there are really only
Subject to these
constraints:
0  αk  C k α
parameters: 1, 2 .. k
most are set tokzero
R, and
ky 0
usually
1 by the Maximum
Margin.

Then define: Asking for small w.w is like “weight

decay” in Neural Nets and like Ridge
Regression parameters in Linear
w α
k s.t. α k  0
k k y Φ( x )
k regression and like the use of Priors
in Bayesian Regression---all designed
to smooth the function and reduce
w  Φ( x) 
b  y K (1 ε )  x K .w K
 αk y k Φ( x k )  Φ( x) overfitting.
k s.t. α  0K k
Then classify with:

where K  α
k s.t. α  0
y
arg
k k ( x
max
k  x  1)
αk
5

f(x,w,b) = sign(w. (x) - b)

k
k
Only Sm operations (S=#support vectors)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 59
SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function
• Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function
• Radial-Basis-style Kernel Function:
 (a  b) 2  ,  and  are magic
K (a, b)  exp  2
 parameters that must
 2 
be chosen by a model
• Neural-net-style Kernel Function: selection method
such as CV or
VCSRM*
K (a, b)  tanh( a.b   )
*see last lecture

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 60

VC-dimension of an SVM
• Very very very loosely speaking there is some theory which
under some different assumptions puts an upper bound on
the VC dimension as
 Diameter 
 Margin 
 
• where
• Diameter is the diameter of the smallest sphere that can
enclose all the high-dimensional term-vectors derived
from the training set.
• Margin is the smallest margin we’ll let the SVM use
• This can be used in SRM (Structural Risk Minimization) for
choosing the polynomial degree, RBF , etc.
• But most people just use Cross-Validation

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 61

SVM Performance
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-written-character
recognition benchmark
• Another Example: Andrew knows several reliable
people doing practical real-world work who claim
that SVMs have saved them when their other
favorite classifiers did poorly.
• There is a lot of excitement and religious fervor
about SVMs as of 2001.
• Despite this, some practitioners (including your
lecturer) are a little skeptical.
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 62
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
• :
• SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 63

References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 64

What You Should Know
• Linear SVMs
• The definition of a maximum margin classifier
• What QP can do for you (but, for this class, you
don’t need to know how it does it)
• How Maximum Margin can be turned into a QP
problem
• How we deal with noisy (non-separable) data
• How we permit non-linear boundaries
• How SVM Kernel functions permit us to pretend
we’re working with ultra-high-dimensional basis-
function terms

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 65

You might also like

SVM 2
No ratings yet
SVM 2
65 pages
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
65 pages
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
No ratings yet
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
65 pages
21 Support Vector Machines 03-10-2024
No ratings yet
21 Support Vector Machines 03-10-2024
72 pages
ML - Lec 8-SVM As A Linear Classifier
No ratings yet
ML - Lec 8-SVM As A Linear Classifier
78 pages
SVM Minus Kernel 71
No ratings yet
SVM Minus Kernel 71
32 pages
Machine Learning SVM: Mustansar Ali
No ratings yet
Machine Learning SVM: Mustansar Ali
21 pages
8 SVM
No ratings yet
8 SVM
55 pages
SVM PCA Kmeans
No ratings yet
SVM PCA Kmeans
121 pages
15 SVM
No ratings yet
15 SVM
61 pages
Support Vector Machines (SVMS) 2222
No ratings yet
Support Vector Machines (SVMS) 2222
23 pages
Perceptron
No ratings yet
Perceptron
23 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
16 SVM
No ratings yet
16 SVM
41 pages
Deep Learn
No ratings yet
Deep Learn
48 pages
11 Ethem Linear SVM 2015
No ratings yet
11 Ethem Linear SVM 2015
66 pages
Machine Learning-4
100% (1)
Machine Learning-4
18 pages
ML 09 SVM Draft
No ratings yet
ML 09 SVM Draft
73 pages
12 - Bài Toán Phân L P - SVM - v2
No ratings yet
12 - Bài Toán Phân L P - SVM - v2
138 pages
Module 6-Svm
No ratings yet
Module 6-Svm
47 pages
ML Lec SVM Linear
No ratings yet
ML Lec SVM Linear
19 pages
DDA3020 L07 SVM Session1 Updated
No ratings yet
DDA3020 L07 SVM Session1 Updated
65 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
2024 Scu ML 2 1 SVM
No ratings yet
2024 Scu ML 2 1 SVM
36 pages
Lecture#9: Support Vector Machine (SVM)
No ratings yet
Lecture#9: Support Vector Machine (SVM)
18 pages
Introduction To Support Vector Machines: Andrew Moore CMU
No ratings yet
Introduction To Support Vector Machines: Andrew Moore CMU
40 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
74 pages
15 Support Vector Machines
No ratings yet
15 Support Vector Machines
30 pages
27 Support - Vector - Machine
No ratings yet
27 Support - Vector - Machine
17 pages
Chapter 8
No ratings yet
Chapter 8
52 pages
10 SVM
No ratings yet
10 SVM
77 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
Support Vector Machines
No ratings yet
Support Vector Machines
24 pages
Introduction To Support Vector Machines
No ratings yet
Introduction To Support Vector Machines
23 pages
SVM Class
No ratings yet
SVM Class
33 pages
Pattern Recognition & Learning II: © UW CSE Vision Faculty
No ratings yet
Pattern Recognition & Learning II: © UW CSE Vision Faculty
47 pages
Lecture 9 - SVMs
No ratings yet
Lecture 9 - SVMs
8 pages
Support Vector Machines For Classification: A Seminar On Data Mining
No ratings yet
Support Vector Machines For Classification: A Seminar On Data Mining
18 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
2 pages
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
No ratings yet
Chapter 5 - Support Vector Machine: Prepared By: Shier Nee, SAW
44 pages
Lecture 18 - SVM
No ratings yet
Lecture 18 - SVM
54 pages
Support Vector Machines: Javier B Ejar Cbea
No ratings yet
Support Vector Machines: Javier B Ejar Cbea
44 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
45 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Basic of SVM Algorithm
No ratings yet
Basic of SVM Algorithm
10 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
W12 SVM
No ratings yet
W12 SVM
52 pages
Lecture Notes - SVM
No ratings yet
Lecture Notes - SVM
13 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
Tutorial4 SVM
No ratings yet
Tutorial4 SVM
8 pages
Elements of Tensor Calculus
From Everand
Elements of Tensor Calculus
A. Lichnerowicz
3.5/5 (2)
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
SVM Scribe Notes
No ratings yet
SVM Scribe Notes
16 pages
SVM & Image Classification.
No ratings yet
SVM & Image Classification.
22 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Decision Tree
No ratings yet
Decision Tree
43 pages
Test1 2018 Fall Solved
No ratings yet
Test1 2018 Fall Solved
6 pages
Omdm Project - Vinay
No ratings yet
Omdm Project - Vinay
8 pages
Digital Representation Sept 4-5-11 12
No ratings yet
Digital Representation Sept 4-5-11 12
33 pages
Chapter 10 Dip Me - Ec 4
No ratings yet
Chapter 10 Dip Me - Ec 4
22 pages
Taha ch03 Images
100% (1)
Taha ch03 Images
17 pages
Numerical Recipes
No ratings yet
Numerical Recipes
3 pages
Super Fast Derandomization of Interactive Proofs
No ratings yet
Super Fast Derandomization of Interactive Proofs
75 pages
Supervised ANN
No ratings yet
Supervised ANN
19 pages
Greedy Slides
No ratings yet
Greedy Slides
80 pages
Pulse Code Mod CH 3
No ratings yet
Pulse Code Mod CH 3
19 pages
Lab 8 Filters in MATLAB 1zk5b0w
No ratings yet
Lab 8 Filters in MATLAB 1zk5b0w
7 pages
Fourier Transform: Brief Description
No ratings yet
Fourier Transform: Brief Description
10 pages
Daa Lab File It
No ratings yet
Daa Lab File It
34 pages
ELEC5300 Lecture 4: Continuous Time Linear Systems: A Review Filtering WSS Random Processes
No ratings yet
ELEC5300 Lecture 4: Continuous Time Linear Systems: A Review Filtering WSS Random Processes
34 pages
Interpolation: Dr. Sukanta Deb
No ratings yet
Interpolation: Dr. Sukanta Deb
32 pages
Interpolated Filters Small
No ratings yet
Interpolated Filters Small
6 pages
Flowchart TD
No ratings yet
Flowchart TD
4 pages
Journal of Computational and Applied Mathematics: J. Rashidinia, M. Ghasemi
No ratings yet
Journal of Computational and Applied Mathematics: J. Rashidinia, M. Ghasemi
18 pages
Matlab: A Tool For Algorithms Development and System Analysis
No ratings yet
Matlab: A Tool For Algorithms Development and System Analysis
17 pages
Fast Clustering Based Feature Selection: Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar
No ratings yet
Fast Clustering Based Feature Selection: Ubed S. Attar, Ajinkya N. Bapat, Nilesh S. Bhagure, Popat A. Bhesar
7 pages
FIR Filter - Window Method
No ratings yet
FIR Filter - Window Method
7 pages
FINNAL MID II - FQP - DSP-31-10-2023-Final-sign
No ratings yet
FINNAL MID II - FQP - DSP-31-10-2023-Final-sign
4 pages
Adobe Scan Apr 06, 2025
No ratings yet
Adobe Scan Apr 06, 2025
3 pages
Chapter - 3 - Quiz 2
No ratings yet
Chapter - 3 - Quiz 2
8 pages
Zco2020 Question Paper
No ratings yet
Zco2020 Question Paper
4 pages
What Is Linear Data Structure
No ratings yet
What Is Linear Data Structure
2 pages
Cryptography, Winter Term 16/17: Sample Solution To Assignment 3
No ratings yet
Cryptography, Winter Term 16/17: Sample Solution To Assignment 3
3 pages
Practice Problems
No ratings yet
Practice Problems
2 pages
1 Karger's Mincut Algorithm: Lecture Notes CS:5360 Randomized Algorithms
No ratings yet
1 Karger's Mincut Algorithm: Lecture Notes CS:5360 Randomized Algorithms
9 pages