Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Support Vector Machines: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Machines
Note to other teachers and users of these Andrew W. Moore
slides. Andrew would be delighted if you
found this source material useful in
giving your own lectures. Feel free to use
Professor
these slides verbatim, or to modify them
to fit your own needs. PowerPoint School of Computer Science
originals are available. If you make use
of a significant portion of these slides in Carnegie Mellon University
your own lecture, please include this
message, or the following link to the www.cs.cmu.edu/~awm
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
[email protected]
Comments and corrections gratefully 412-268-7599
received.
Any of these
would be fine..
..but which is
best?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
• Let x- be any point on the minus plane mm:: not
R not
necessarily a
• Let x+ be the closest plus-plane-point to x-. datapoint
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
• Let x- be any point on the minus plane
• Let x+ be the closest plus-plane-point to x-.
• Claim: x+ = x- + w for some value of . Why?
• w . x- + b = -1
• x+ = x- + w
• |x+ - x- | = M
• w . x- + b = -1
2 w.w 2
• x+ = x- + w
w.w w.w
• |x+ - x- | = M
• 2
λ
w.w
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 20
Learning the Maximum Margin Classifier
+ 1” 2
= x+ M = Margin Width =
la ss w.w
i c t C o ne
r ed z
“ P
1”
sx=- -
a s
+b
=1
c t Cl e
wx
b=
0
r edi zon
+
wx b=-1 “P
+
wx
Given a guess of w and b we can
• Compute whether all data points in the correct half-planes
e additional linear
a( n 1)1u1 a( n 1) 2u2 ... a( n 1) mum b( n 1)
constraints
equality
a( n 2 )1u1 a( n 2) 2u2 ... a( n 2 ) mum b( n 2 )
:
a( n e )1u1 a( n e ) 2u2 ... a( n e ) m um b( n e )
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 23
Quadratic Programming
T
T u Ru Quadratic criterion
Find arg max c d u
u 2
e additional linear
v
a( n 1)1u1 (Buat( nth1e)y2u2 n...’t waan(tn t1o) mwurimte b( n 1)
a re
constraints
equality
r o b a b ly do s elf)
p o u r
a u a
( n 2 )1 1 uone ... a
y
( n 2) 2 2 u b
( n2) m m ( n2)
:
a( n e )1u1 a( n e ) 2u2 ... a( n e ) m um b( n e )
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 24
Learning the Maximum Margin Classifier
+1” M = Given guess of w , b we can
= 2
la ss • Compute whether all data
i c t C o ne w.w
ed z points are in the correct
“Pr
-1”
ss
= half-planes
=1 la
+b ic t C o ne • Compute the margin width
wx =0 ed z
+b
“ Pr
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1
2
w.w C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 33
Learning Maximum Marginmwith Noise
= # input
M = Given guessdimensions
of w , b we can
11 2
2 Compute sum of distances
w.w
•
2
w.w C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 34
Learning Maximum Margin with Noise
M = Given guess of w , b we can
11 2
2 w.w
• Compute sum of distances
of points to their correct
zones
1
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1
2
w.w C εk w . xk + b >= 1-k if yk = 1
k 1
w . xk + b <= -1+k if yk = -1
There’s a bug in this QP. Can you spot it?
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 35
Learning Maximum Margin with Noise
M = Given guess of w , b we can
11 2
2 w.w
• Compute sum of distances
of points to their correct
zones
1
b= • Compute the margin width
wx
+
b =0 7
+
wx b=-1
wx
+ Assume R datapoints, each
(xk,yk) where yk = +/- 1
2
w.w C εk w . xk + b >= 1-k if yk = 1
k 1 w . xk + b <= -1+k if yk = -1
k >= 0 for all k
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 36
An Equivalent QP
Warning: up until Rong Zhang spotted my error in
Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.
1 R RR
Maximize αk αk αl Qkl where Qkl y k yl ( x k .x l )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0 αk C k α
k 1
k yk 0
Then define:
R
w αk yk x k Then classify with:
k 1 f(x,w,b) = sign(w. x - b)
b y K (1 ε K ) x K .w K
where K arg max αk
k
1 R RR
Maximize αk αk αl Qkl where Qkl y k yl ( x k .x l )
k 1 2 k 1 l 1
R
Subject to these
constraints:
0 αk C k α
k 1
k yk 0
What would
SVMs do with
this data?
x=0
x=0
Positive “plane” Negative “plane”
x=0
x=0 z k ( xk , xk2 )
x=0 z k ( xk , xk2 )
zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 45
1
2 x1
Constant Term
Quadratic
2 x2
:
Linear Terms Basis Functions
2 x Number of terms (assuming m input
m
x12
dimensions) = (m+2)-choose-2
Pure
x 2
2 = (m+2)(m+1)/2
Quadratic
:
2
Terms = (as near as makes no difference) m2/2
x m
Φ(x)
2 x1 x2
2 x1 x3 You may be wondering what those
:
2 ’s are doing.
2 x1 xm Quadratic
Cross-Terms •You should be happy that they do no
2 x x
2 3 harm
:
•You’ll find out why they’re there
2 x x
1 m
soon.
:
2 xm 1 xm
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 46
Warning: up until Rong Zhang spotted my error in
QP with basis functions Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.
R
1 R R
Maximize
k 1
αk αk αl Qkl where Qkl yk yl (Φ(x k ).Φ(x l ))
2 k 1 l 1
R
Subject to these
constraints:
0 αk C k α
k 1
k yk 0
Then define:
: : 2a b i i
2 a 2 b i 1
m m
a12
b1 2
+
2 2
a 2 b 2 m
: : i bi
a 2 2
2
2
i 1
a m b m
Φ(a) Φ(b)
2a1a2 2b1b2
2a1a3 2b1b3 +
: :
2a1am 2b1bm
m m
2 a a
2 3 2 b b
2 3 2a a b b i j i j
: : i 1 j i 1
2 a a
1 m 2b b
1 m
: :
2am 1am 2bm 1bm
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 49
Quadratic Dot Just out of casual, innocent, interest,
let’s look at another function of a and
Products b:
(a.b 1) 2
(a.b) 2 2a.b 1
2
m
m
ai bi 2 ai bi 1
i 1 i 1
Φ(a) Φ(b) m m m
m m m m ai bi a j b j 2 ai bi 1
1 2 ai bi a b 2 2
i i 2a a b bi j i j
i 1 j 1 i 1
i 1 i 1 i 1 j i 1
m m m m
(ai bi ) 2
2
a b a b i i j j 2 ai bi 1
i 1 i 1 j i 1 i 1
(a.b) 2 2a.b 1
2
m
m
ai bi 2 ai bi 1
i 1 i 1
Φ(a) Φ(b) m m m
m m m m ai bi a j b j 2 ai bi 1
1 2 ai bi a b 2 2
i i 2a a b bi j i j
i 1 j 1 i 1
i 1 i 1 i 1 j i 1
m m m m
(ai bi ) 2
2
a b a b i i j j 2 ai bi 1
i 1 i 1 j i 1 i 1
QP with Quadratic basis functions Oct 2003, this equation had been wrong in earlier
versions of the notes. This version is correct.
R
1 R R
Maximize
k 1
αk αk αl Qkl where Qkl yk yl (Φ(x k ).Φ(x l ))
2 k 1 l 1
We must do R2/2 dot products to
R
get this matrix ready.
dot product
Subject to these 0 αk CEachk α k yrequires
now only k 0
constraints: k 1
m additions and multiplications
Then define:
matrix ready.
Maximize α α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl yk yl (Φ(x k ).Φ(x l ))
operations instead of 75 million
R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0 α C k α
k 1
k yk 0
Then define:
w α
k s.t. α k 0
k y k Φ( x k )
b y K (1 ε K ) x K .w K
Then classify with:
where K arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 54
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize α α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl yk yl (Φ(x k ).Φ(x l ))
operations instead of 75 million
R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0 α C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k 0
k y k Φ( x k )
b y K (1 ε K ) x K .w K
Then classify with:
where K arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 55
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize α α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0 α C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k 0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
b y K (1 ε K ) x K .w K can be done?
Then classify with:
where K arg max αk
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 56
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize α α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0 α C k
•The fear of overfittingkwith
k k α
1 this enormous
y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k 0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
w
b
Φ(y εαk y)k Φ
x) (1
K (x
x k ) .w
Φ( x)
k s.t. α k 0K K K
can be done?
Then classify with:
where K arg
5
α y
k k ( x
max
k x 1
α)
k s.t. α k 0 k
Only Sm operations (S=#support vectors)
k f(x,w,b) = sign(w. (x) - b)
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 57
QP with Quintic basis functions
We must do RR2/2 dot products
R R to get this
matrix ready.
Maximize α α α Q
k
In 100-d, each dot product now needs 103
k 1 k 1 l 1
k l kl
where Qkl yk yl (Φ(x k ).Φ(x l ))
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
k 0 α C k
•The fear of overfittingkwith
k k
1 this enormous
α y 0
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
expensive (why?)
w α
k s.t. α k 0
k y k Φ( x k )
Because each w. (x) (see below)
needs 75 million operations. What
w
b
Φ(y εαk y)k Φ
x) (1
K (x
x k ) .w
Φ( x)
k s.t. α k 0K K K
can be done?
Then classify with:
where K arg
5 When you see this many callout bubbles on
α y
k k ( x
max
k x 1
α) a slide it’s time to wrap the author in a
k s.t. α k 0 k
Only Sm operations (S=#support vectors)
k f(x,w,b)
“someone’s = been
sign(w.
blanket, gently take him away and murmur
(x) - b)for too
at the PowerPoint
long.”
Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 58
QP with Quintic basis functions
R
1 R R
Maximize
k 1
αk αk αl Qkl where
2 k 1 l 1
Qkl opinion
Andrew’s yk yof
l (Φ( x k ).Φ( x l ))
why SVMs don’t
overfit as much as you’d think:
No matter what the basis function,
R up to R
there are really only
Subject to these
constraints:
0 αk C k α
parameters: 1, 2 .. k
most are set tokzero
R, and
ky 0
usually
1 by the Maximum
Margin.