SVM Extra Kernels
SVM Extra Kernels
Machines
Doing Really Well with Linear
Decision Surfaces
Strengths of SVMs
n Good generalization in theory
n Good generalization in practice
n Work well with few training instances
n Find globally best model
n Efficient algorithms
n Amenable to the kernel trick
Linear Separators
n Training instances
n x Î Ân Math Review
n y Î {-1, 1} Inner (dot) product:
<a, b> = a · b = ∑ ai*bi
n w Î Ân
= a1b1 + a2b2 + …+anbn
n bÎÂ
n Hyperplane
n <w, x> + b = 0
n w1x1 + w2x2 … + wnxn + b = 0
n Decision function
n f(x) = sign(<w, x> + b)
Intuitions
O O
X X
X O
X O O
X X O
X O
X O
Intuitions
O O
X X
X O
X O O
X X O
X O
X O
Intuitions
O O
X X
X O
X O O
X X O
X O
X O
Intuitions
O O
X X
X O
X O O
X X O
X O
X O
A “Good” Separator
O O
X X
X O
X O O
X X O
X O
X O
Noise in the Observations
O O
X X
X O
X O O
X X O
X O
X O
Ruling Out Some Separators
O O
X X
X O
X O O
X X O
X O
X O
Lots of Noise
O O
X X
X O
X O O
X X O
X O
X O
Maximizing the Margin
O O
X X
X O
X O O
X X O
X O
X O
“Fat” Separators
O O
X X
X O
X O O
X X O
X O
X O
Support Vectors
O O
X X
X O
X O O
X X O
X O
X O
The Math
n Training instances
n x Î Ân
n y Î {-1, 1}
n Decision function
n f(x) = sign(<w,x> + b)
n w Î Ân
n bÎÂ
n Find w and b that
n Perfectly classify training instances
n Assuming linear separability
n Maximize margin
The Math
n For perfect classification, we want
n yi (<w,xi> + b) ≥ 0 for all i
n Why?
O O O
OO O O O O
O X O
XX X
O O
O X X O
O O O
O O
Image from http://www.atrandomresearch.com/iclass/
Kernel Methods
Making the Non-Linear Linear
When Linear Separators Fail
x2 x12
X X
X X
X X O O O O X X x1 OO O O x1
Mapping into a New Feature Space
F : x à X = F(x)
F(x1,x2) = (x1,x2,x12,x22,x1x2)
n Rather than run SVM on xi, run it on F(xi)
n Find non-linear separator in input space
n What if F(xi) is really big?
n Use kernels to compute it implicitly!
Image from http://web.engr.oregonstate.edu/
~afern/classes/cs534/
Kernels
n Find kernel K such that
n K(x1,x2) = < F(x1), F(x2)>
n Computing K(x1,x2) should be efficient,
much more so than computing F(x1) and
F(x2)
n Use K(x1,x2) in SVM algorithm rather than
<x1,x2>
n Remarkably, this is possible
The Polynomial Kernel
n K(x1,x2) = < x1, x2 > 2
n x1 = (x11, x12)
n x2 = (x21, x22)
n < x1, x2 > = (x11x21 + x12x22)
n < x1, x2 > 2 = (x112 x212 + x122x222 + 2x11 x12 x21 x22)
n F(x1) = (x112, x122, √2x11 x12)
n F(x2) = (x212, x222, √2x21 x22)
n K(x1,x2) = < F(x1), F(x2) >
The Polynomial Kernel
n F(x) contains all monomials of degree d
n Useful in visual pattern recognition
n Number of monomials
n 16x16 pixel image
n 1010 monomials of degree 5
n
n
Subject to X
W(a) = Si ai - 1/2 Si,j ai aj yi yj <xi, xj>