Lecture Notes SVM
Lecture Notes SVM
yi (w · xi − b) > 0
If such a hyperplane exists, then it is not unique. In real world classification problems it is quite likely that
one would require non-linear separators having a reasonable complexity vs accuracy tradeoff.
Since the training data are merely samples of the instance space, and not necessarily adequate "represen-
tative" samples, doing well on the training data (samples) does not necessarily guarantee (or even imply)
that one will do well on the entire instance space. A related issue is that the training data distribution is
unknown, so contrary to statistical inference we do not estimate the unknown distribution. Nevertheless
optimal learning algorithms can be developed without the need for first estimating the distribution.
Note
2 2 2 1
argmax √ = argmax = argmax 2 = argmin 2 (w · w)
w w·w w kwk 2 w kwk2 w
Then we could write down the following optimization problem that SVM seeks to solve:
1
min w·w
w,b 2 (1)
s.t. yi (w · xi − b) − 1 ≥ 0 i = 1, . . . , m
yi (w · xi − b) − 1 ≥ 0 i = 1, . . . , m
1
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Geometry of Support Vector Machines and Kernel Trick, [email protected]
Noisy Data Case One relaxes the SVM problem to having a "soft" margin. Separability holds with some
error:
m
1 X
min w·w+ν i
w,b,i 2 i=1
(2)
s.t. yi (w · xi − b) ≥ 1 − i i = 1, . . . , m
i ≥ 0i = 1, . . . , m
Pm
With i > 0, point can lie inside the margin. Note that i=1 i is ki k1 ,i.e. the L1 norm, this implies sparsity
and thus sparse errors.
Best is to penalize based on number of errors, i.e.: L0 norm: ki k0 = {i : i > 0}
which minimizes the number of errors. However L0 norms are non-convex. L1 norm is the convex relaxation
of this L0 norm.
m
1 X
min w·w+ν i
w,b,i 2 i=1
(3)
s.t. yi (w · xi − b) ≥ 1 − i i = 1, . . . , m
i ≥ 0 i = 1, . . . , m
m m
∂L X X
=w− µi yi xi = 0 =⇒ w = µi yi xi
∂w i=1 i=1
m
∂L X
= µi yi = 0
∂b i=1
∂L
= ν − µi − δi =⇒ 0 ≤ µi ≤ ν, i = 1, 2, . . . , m
∂i
When µi = 0, i.e. yi (w · xi − b) > 1 − i , instance xi is classified and is not a boundary point.
When µi > 0, i.e. yi (w · xi − b) = 1 − i , then xi is a boundary point with margin error i > 0 as small as
possible. These boundary points are support vectors and w is determined by them.
∂L ∂L
∂w = 0 and using ∂i implies the following classification
2
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Geometry of Support Vector Machines and Kernel Trick, [email protected]
m
X 1X
max µi − (yi yj xi · xj )µi µj
µi
i=1
2 i,j
s.t. 0 ≤ µi ≤ ν, i = 1, . . . , m (4)
Xm
yi µi = 0
i=1
If we denote 1 as the vector with all elements 1, then the maximization problem can be written as:
1
max µT 1 − µT M µ
µi 2
with Gram Matrix Mij = yi yj xi · xj , which is a Positive Semi-Definite (PSD) matrix.
The reason to introduce the dual problem is the following: dual form of SVM is simpler than the primal
SVM; the key feature is that the optimization objective function is now described by inner products of data
instance pairs hxi , xj i.
φ(x) : Rn −→ Rp , p > n
the functional space formulation of a Kernel space is a Hilbert space H = {K : Rn ×Rn −→ R defines an inner product}.
Here K is given by
Algorithms on input vectors expressed as only computing inner products between vectors are amenable to
the kernel trick where xi · xj can be replaced by φT (xi )φ(xj )
We know the Gram Matrix M in dual formulation is Positive Semi-Definite (PSD) since M = QT Q, where
Q = y1 x1 , y2 x2 , · · · , ym xm
3
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Geometry of Support Vector Machines and Kernel Trick, [email protected]
(III) ∀ {xi }qi=1 and all q, matrix K with Kij = K(xi , xj ) is PSD
To better understand the relationship of feature maps with Kernels, consider:
Ps
• Homogeneous Polynomial Kernel: x, y ∈ Rs =⇒ k(x, y) = (xT y)d = ( i=1 xi yi )d , d > 0
The feature map can be defined as
s ! s
s+d d n1 ns
X
φ(x) ≡ ( dimensional vector space) = x1 ...xs , ni = d, ni ≥ 0
d n1 ...ns i=1
One constructs the Gram matrix of Kernels, as the following: Mij = K(xi , xj ) = (xTi xj )d
√
When s = d = 2, (xT y)2 = x21 y12 + 2x1 x2 y1 y2 + x22 y22 , we pick φ(x) = (x21 , x22 , 2x1 x2 ) and thereby
create Gram matrix with Mij = (xTi xj )2 .
• Non-Homogeneous Polynomial Kernel: All monomials with degree ≤ d
√ √ d
K(x, y) = (xT y + α)d = (x1 y1 + x2 y2 + · · · + xk yk + α α)
Again, consider the case where s = d = 2, and now K(x, y) = (x1 y1 + x2 y2 + α)d : φ : R2 −→ R6 maps
a conic curve in the measurement plane to a hyper-plane in six dimensional feature space.
References
[BHK] Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science, Chap 5
[SVM] Various SVM notes,