svm and kernel
svm and kernel
• Many decision
boundaries! Class 2
– The Perceptron
algorithm can be used
to find such a boundary
• Are all decision
boundaries equally good? Class 1
1
Examples of Bad Decision Boundaries
Class 2 Class 2
Class 1 Class 1
2
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi {1,-1} be the
class label of xi
T
For yi=1 w xi b 1
For yi=-1 wT xi b 1
y=1
y=1
So:
yi w xi b 1, xi , yi
y=1 T
y=-1 y=1
y=-1 y=1
Class 2
y=-1
y=-1
y=-1
Class 1 y=-1
m
3
Large-margin Decision Boundary
• The decision boundary should be as far away
from the data of both classes as possible
– We should maximize the margin, m
Class 2
Class 1
m
4
Finding the Decision Boundary
• The decision boundary should classify all points correctly
• The Lagrangian is
i≥0
– Note that ||w||2 = wTw
6
Gradient with respect to w and b
• Setting the gradient of w.r.t. w and b to zero,
we have
n
1 T
L w w i 1 yi wT xi b
2
i 1
1 m k k n m k k
w w i 1 yi w xi b
2 k 1 i 1 k 1
n: no of examples, m: dimension of the space
L
wk 0, k
L
b 0
7
A “Good” Separator
O O
X X
X O
X O O
X X O
X O
X O
Noise in the Observations
O O
X X
X O
X O O
X X O
X O
X O
Ruling Out Some Separators
O O
X X
X O
X O O
X X O
X O
X O
Lots of Noise
O O
X X
X O
X O O
X X O
X O
X O
Maximizing the Margin
O O
X X
X O
X O O
X X O
X O
X O
“Fat” Separators
O O
X X
X O
X O O
X X O
X O
X O
An example of VC dimension
• Suppose our model class is a hyperplane.
• In 2-D, we can find a plane (i.e. a line) to deal with any
labeling of three points. A 2-D hyperplane shatters 3 points
O O
X X
X O
X O O
X X O
X O
X O
The Math
• Training instances
– x n
– y {-1, 1}
• Decision function
– f(x) = sign(<w,x> + b)
– w n
– b
• Find w and b that
– Perfectly classify training instances
• Assuming linear separability
– Maximize margin
The Math
f ( x) a sin(b x)
The probabilistic guarantee
1
h h log(2 N / h) log( p / 4) 2
Etest Etrain
N
where N = size of training set
h = VC dimension of the model class
p = upper bound on probability that this bound fails
bias s
w K ( x test
, x s
) 0
s SV
The set of
support vectors
Gaussian Parameters
||x y||2 / 2 2 that the user
radial basis K (x, y ) e must choose
function
GP on GP on GP on top-level
the top-level features with
pixels features fine-tuning
100 labels 22.2 17.9 15.2
500 labels 17.2 12.7 7.2
1000 labels 16.3 11.2 6.4
• Maximize over
– W() = i i - 1/2 i,j i j yi yj <xi, xj>
• Subject to
– i 0
– i i yi = 0
• Decision function
– f(x) = sign(i i yi <x, xi> + b)
Strengths of SVMs
O O O
OO O O O O
O X O
XX X
O O
O X X O
O O O
O O
Image from http://www.atrandomresearch.com/iclass/
Kernel Methods
Making the Non-Linear Linear
When Linear Separators Fail
x2 x12
X X
X X
X X O O OO X X x1 OO O O x1
Mapping into a New Feature Space
: x X = (x)
(x1,x2) = (x1,x2,x12,x22,x1x2)
• Rather than run SVM on xi, run it on (xi)
• Find non-linear separator in input space
• What if (xi) is really big?
• Use kernels to compute it implicitly!
Image from http://web.engr.oregonstate.edu
~afern/classes/cs534/
Kernels
X
– W() = i i - 1/2 i,j i j yi yj <x
1)i, 4xj>
• Subject to
– i 0
So by the kernel trick
– i i yi = 0 These are kernels!
we just replace them!
• Decision function (<xi, xj> +
X 1)4xi> + b)
– f(x) = sign(i i yi <x,
Conclusion