0% found this document useful (0 votes)
7 views29 pages

SVM Extra Kernels

Lecture slides on SVM and Kernels (ML)

Uploaded by

botafogopipokou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views29 pages

SVM Extra Kernels

Lecture slides on SVM and Kernels (ML)

Uploaded by

botafogopipokou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Support Vector

Machines
Doing Really Well with Linear
Decision Surfaces
Strengths of SVMs
n Good generalization in theory
n Good generalization in practice
n Work well with few training instances
n Find globally best model
n Efficient algorithms
n Amenable to the kernel trick
Linear Separators
n Training instances
n x Î Ân Math Review
n y Î {-1, 1} Inner (dot) product:
<a, b> = a · b = ∑ ai*bi
n w Î Ân
= a1b1 + a2b2 + …+anbn
n bÎÂ
n Hyperplane
n <w, x> + b = 0
n w1x1 + w2x2 … + wnxn + b = 0
n Decision function
n f(x) = sign(<w, x> + b)
Intuitions

O O
X X
X O
X O O
X X O
X O
X O
Intuitions

O O
X X
X O
X O O
X X O
X O
X O
Intuitions

O O
X X
X O
X O O
X X O
X O
X O
Intuitions

O O
X X
X O
X O O
X X O
X O
X O
A “Good” Separator

O O
X X
X O
X O O
X X O
X O
X O
Noise in the Observations

O O
X X
X O
X O O
X X O
X O
X O
Ruling Out Some Separators

O O
X X
X O
X O O
X X O
X O
X O
Lots of Noise

O O
X X
X O
X O O
X X O
X O
X O
Maximizing the Margin

O O
X X
X O
X O O
X X O
X O
X O
“Fat” Separators

O O
X X
X O
X O O
X X O
X O
X O
Support Vectors

O O
X X
X O
X O O
X X O
X O
X O
The Math
n Training instances
n x Î Ân
n y Î {-1, 1}
n Decision function
n f(x) = sign(<w,x> + b)
n w Î Ân
n bÎÂ
n Find w and b that
n Perfectly classify training instances
n Assuming linear separability
n Maximize margin
The Math
n For perfect classification, we want
n yi (<w,xi> + b) ≥ 0 for all i
n Why?

n To maximize the margin, we want


n w that minimizes |w|2
Dual Optimization Problem
n Maximize over a
n W(a) = Si ai - 1/2 Si,j ai aj yi yj <xi, xj>
n Subject to
n ai ³ 0
n Si ai yi = 0
n Decision function
n f(x) = sign(Si ai yi <x, xi> + b)
Strengths of SVMs
n Good generalization in theory
n Good generalization in practice
n Work well with few training instances
n Find globally best model
n Efficient algorithms
n Amenable to the kernel trick …
What if Surface is Non-Linear?

O O O
OO O O O O
O X O
XX X
O O
O X X O
O O O
O O
Image from http://www.atrandomresearch.com/iclass/
Kernel Methods
Making the Non-Linear Linear
When Linear Separators Fail

x2 x12
X X
X X
X X O O O O X X x1 OO O O x1
Mapping into a New Feature Space

F : x à X = F(x)
F(x1,x2) = (x1,x2,x12,x22,x1x2)
n Rather than run SVM on xi, run it on F(xi)
n Find non-linear separator in input space
n What if F(xi) is really big?
n Use kernels to compute it implicitly!
Image from http://web.engr.oregonstate.edu/
~afern/classes/cs534/
Kernels
n Find kernel K such that
n K(x1,x2) = < F(x1), F(x2)>
n Computing K(x1,x2) should be efficient,
much more so than computing F(x1) and
F(x2)
n Use K(x1,x2) in SVM algorithm rather than
<x1,x2>
n Remarkably, this is possible
The Polynomial Kernel
n K(x1,x2) = < x1, x2 > 2
n x1 = (x11, x12)
n x2 = (x21, x22)
n < x1, x2 > = (x11x21 + x12x22)
n < x1, x2 > 2 = (x112 x212 + x122x222 + 2x11 x12 x21 x22)
n F(x1) = (x112, x122, √2x11 x12)
n F(x2) = (x212, x222, √2x21 x22)
n K(x1,x2) = < F(x1), F(x2) >
The Polynomial Kernel
n F(x) contains all monomials of degree d
n Useful in visual pattern recognition
n Number of monomials
n 16x16 pixel image
n 1010 monomials of degree 5

n Never explicitly compute F(x)!


n Variation - K(x1,x2) = (< x1, x2 > + 1) 2
A Few Good Kernels
n Dot product kernel
n K(x1,x2) = < x1,x2 >
n Polynomial kernel
n K(x1,x2) = < x1,x2 >d (Monomials of degree d)
n K(x1,x2) = (< x1,x2 > + 1)d (All monomials of degree 1,2,…,d)
n Gaussian kernel
n K(x1,x2) = exp(-| x1-x2 |2/2s2)
n Radial basis functions
n Sigmoid kernel
n K(x1,x2) = tanh(< x1,x2 > + J)
n Neural networks
n Establishing “kernel-hood” from first principles is non-
trivial
The Kernel Trick

“Given an algorithm which is


formulated in terms of a positive
definite kernel K1, one can construct
an alternative algorithm by replacing
K1 with another positive definite
kernel K2”

Ø SVMs can use the kernel trick


Using a Different Kernel in the
Dual Optimization Problem
n For example, using the polynomial kernel
with d = 4 (including lower-order terms).

n Maximize over a (<xi, xj> + 1)4

n
n

Subject to X
W(a) = Si ai - 1/2 Si,j ai aj yi yj <xi, xj>

n ai ³ 0 So by the kernel trick,


These
we just are kernels!
replace them!
n Si ai yi = 0
(<xi, xj> + 1)4
n Decision function
n
X
f(x) = sign(Si ai yi <x, xi> + b)
Conclusion
n SVMs find optimal linear separator
n The kernel trick makes SVMs non-linear
learning algorithms

You might also like