Support Vector Machine
Support Vector Machine
• In logistic regression, we take the output of the linear function and squash the value
within the range of [0,1] using the sigmoid function.
• If the squashed value is greater than a threshold value(0.5) we assign it a label 1,
else we
assign it a label 0.
• In SVM, we take the output of the linear function and if that output is greater than 1,
we identify it with one class and if the output is -1, we identify is with another class.
• Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement
range of values([-1,1]) which acts as margin.
Sec.
15.1
Maximum Margin:
Formalization
w: decision hyperplane normal vector
Margin
• Distance from example to the separator is r
wT x b
y w
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the width of separation between support vectors of classes.
Derivation of finding r:
ρ Dotted line x’−x is perpendicular to
x
decision boundary so parallel to w.
Unit vector is w/|w|, so line is
r
x rw/|w|.
x’ = x – yrw/|w|.
′ x’ satisfies wTx’+b = 0. So
wT(x –yrw/|w|) + b = 0
Recall that |w| =
sqrt(wTw).
So wTx –yr|w| + b = 0
w So, solving for r gives:
r = y(wTx + b)/|w|
Sec.
15.1
Linear SVM
Mathematically The
• Assume that all data is at least linearly separable
distance 1 from the hyperplane, then the following two constraints
follow for a training set {(x ,y )}
i i
case
wTxi + b ≥ 1 if yi = 1
wTxi + b ≤ −1 if yi = −1
2
w
75
Sec.
15.1
•
wTxb + b = -1
Hyperplane
wT x + b = 0
• Extra scale constraint:
mini=1,…,n |wTxi + b| = 1
• This implies:
wT(xa–xb) = 2
ρ = ||xa–x b|| 2 = 2/||w||2 wT x + b = 0
76
Solving the Optimization Problem
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1
f(x) = ΣαiyixiTx + b
• Notice that it relies on an inner product between the test point x and the support vectors xi
• We will return to this later.
• Also keep in mind that solving the optimization problem involved computing the inner products x iTx j
between all pairs of training points.
78
Classification with SVMs
• The most “important” training points are the support vectors; they define the hyperplane.
• Quadratic optimization algorithms can identify which training points xi are support vectors with
non-zero Lagrangian multipliers αi.
• Both in the dual formulation of the problem and in the solution, training points appear only
inside
inner products:
Find α1…αN such that f(x) = ΣαiyixiTx + b
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
80
Non-linear SVMs
• Datasets that are linearly separable (with some noise) work out great:
0 x
0 x
81
Non-linear SVMs:
Feature spaces
• General idea: the original feature space can always be mapped
to some higher-dimensional feature space where
the training set is separable:
Φ: x → φ(x)
82
The “Kernel
Trick”
• The linear classifier relies on an inner product between vectors K(x ,x )=x x i j i
T
j
• If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the
inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is some function that corresponds to an inner product in some expanded feature
space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xi xj) ,
T 2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj1 √2 xj1xj2 xj2 √2xj1 √2xj2]
2
2
2 2
= φ(xi) Tφ(xj) where φ(x) = [1 x1 x2 √2x1 √2x2]
83
√2 x1x2
Sec.
15.2.3
Kernels
Common kernels
• Linear
• Polynomial K(x,z) = (1+xTz)d
• Gives feature conjunctions
• Radial basis function (infinite dimensional space)