Kernal and Multiclass
Kernal and Multiclass
Machines and
Kernels
Doing Really Well with
Linear Decision Surfaces
Linear Separators
Training instances
x n Math Review
y {-1, 1} Inner (dot) product:
<a, b> = a · b = ∑ ai*bi
w n
= a1b1 + a2b2 + …+anb
b
Hyperplane
<w, x> + b = 0
w1x1 + w2x2 … + wnxn + b = 0
Decision function
f(x) = sign(<w, x> + b)
Intuitions
O O
X X
X O
X O O
X X O
X O
X O
Intuitions
O O
X X
X O
X O O
X X O
X O
X O
Intuitions
O O
X X
X O
X O O
X X O
X O
X O
Intuitions
O O
X X
X O
X O O
X X O
X O
X O
A “Good” Separator
O O
X X
X O
X O O
X X O
X O
X O
Noise in the
Observations
O O
X X
X O
X O O
X X O
X O
X O
Ruling Out Some
Separators
O O
X X
X O
X O O
X X O
X O
X O
Lots of Noise
O O
X X
X O
X O O
X X O
X O
X O
Maximizing the Margin
O O
X X
X O
X O O
X X O
X O
X O
“Fat” Separators
O O
X X
X O
X O O
X X O
X O
X O
Why Maximize Margin?
Increasing margin reduces capacity
Must restrict capacity to generalize
m training instances
2m ways to label them
What if function class that can separate
them all?
Shatters the training instances
X O X X
X X X X O X X O
O O X O
O X X O O O O O
Support Vectors
O O
X X
X O
X O O
X X O
X O
X O
The Math
For perfect classification, we want
yi (<w,xi> + b) ≥ 0 for all i
Why?
To maximize the margin, we want
w that minimizes |w|2
Dual Optimization
Problem
Maximize over
W() = i i - 1/2 i,j i j yi yj <xi, xj>
Subject to
i 0
i i yi = 0
Decision function
f(x) = sign(i i yi <x, xi> + b)
primal/dual problems
https://www.youtube.com/watch?
v=vx7d32Jz97w
What if Data Are Not
Perfectly Linearly
Separable?
Cannot find w and b that satisfy
yi (<w,xi> + b) ≥ 1 for all i
Introduce slack variables i
yi (<w,xi> + b) ≥ 1 - i for all i
Minimize
|w|2 + C i
Strengths of SVMs
Good generalization in theory
Good generalization in practice
Work well with few training
instances
Find globally best model
Efficient algorithms
Amenable to the kernel trick …
Simple example
We have 2 colors of balls on the table that
we want to separate.
We get a stick and put it on the table, this works
pretty well right?
Some villain comes and places more balls on the table, it kind of works
but one of the balls is on the wrong side and there is probably a better
place to put the stick now.
SVMs try to put the stick in the best possible place by
having as big a gap on either side of the stick as possible.
There is another trick in the SVM toolbox that is
even more important. Say the villain has seen how
good you are with a stick so he gives you a new
challenge.
There’s no stick in the world that will let you split
those balls well, so what do you do? You flip the
table of course! Throwing the balls into the air.
Then, with your pro ninja skills, you grab a sheet
of paper and slip it between the balls.
Here,
O O O
OO O O O O
O X O
XX X
O O
O X X O
O O O
O O
Image from http://www.atrandomresearch.com/iclass/
Soft Margins
This idea is based on a simple premise: allow
SVM to make a certain number of mistakes
and keep margin as wide as possible so that
other points can still be classified correctly.
This can be done simply by modifying the
objective of SVM.
The green decision boundary has a wider margin that
would allow it to generalize well on unseen data. In this
event, soft margin formulation helps to avoid the
overfitting issues.
How it Works (mathematically)?
Aim to minimize the following objective:
: x X = (x)
(x1,x2) = (x1,x2,x12,x22,x1x2)
Rather than run SVM on xi, (original) run it on (xi)
(Transform dimensional)
Find non-linear separator in input space
What if (xi) is really big?
Use kernels to compute it implicitly!Image from http://web.engr.oregonstate.edu
~afern/classes/cs534/
Kernels
Find kernel K such that
K(x1,x2) = < (x1), (x2)>
Computing K(x1,x2) should be
efficient, much more so than
computing (x1) and (x2)
Use K(x1,x2) in SVM algorithm rather
than <x1,x2>
Remarkably, this is possible
Examples of Kernel
Functions
Linear: K(xi,xj)= xiTxj
Mapping Φ: x → φ(x), where φ(x) is x itself
Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
Mapping Φ: x → φ(x), where φ(x) has d p
dimensions
p
Gaussian (radial-basis function):
K(xi,xj) = x i x j 2
2 2
e
Mapping Φ: x → φ(x), where φ(x) is infinite-
dimensional: every point is mapped to a
function (a Gaussian); combination of functions
for support vectors is the separator.
.
The Polynomial Kernel
K(x1,x2) = < x1, x2 > 2
x1 = (x11, x12)
x2 = (x21, x22)
< x1, x2 > = (x11x21 + x12x22)
< x1, x2 > 2 = (x112 x212 + x122x222 + 2x11 x12 x21
x22)
(x1) = (x112, x122, √2x11 x12)
(x2) = (x212, x222, √2x21 x22)
K(x1,x2) = < (x1), (x2) >
The Polynomial Kernel
(x) contains all monomials of
degree d
Useful in visual pattern recognition
Number of monomials
16x16 pixel image
1010 monomials of degree 5