0% found this document useful (0 votes)
31 views

Lecture5 SVM

1. The document discusses support vector machines (SVMs), beginning with an overview of SVMs and how they find the maximum margin separating hyperplane between two classes of data. 2. It then derives the optimization problem to solve for the SVM, which is to minimize the norm of the weight vector subject to constraints on the data points. 3. The document introduces concepts like support vectors, Lagrange duality, and kernels that allow SVMs to be applied to non-linear classification problems.

Uploaded by

jiayuan0113
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Lecture5 SVM

1. The document discusses support vector machines (SVMs), beginning with an overview of SVMs and how they find the maximum margin separating hyperplane between two classes of data. 2. It then derives the optimization problem to solve for the SVM, which is to minimize the norm of the weight vector subject to constraints on the data points. 3. The document introduces concepts like support vectors, Lagrange duality, and kernels that allow SVMs to be applied to non-linear classification problems.

Uploaded by

jiayuan0113
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Support Vector Machine (SVM)

DSA5103 Lecture 5

Yangjing Zhang
12-Sep-2023
NUS
Today’s content

1. SVM
2. Lagrange duality and KKT
3. Dual of SVM and kernels
4. SVM with soft constraints

lecture5 1/43
SVM
Idea of support vector machine (SVM)

• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic


regression), i = 1, . . . , n
• The two classes are assumed to be linearly separable
• Aim: Learn a linear classifier: f (x) = sign(β T x + β0 )

lecture5 2/43
Idea of support vector machine (SVM)

• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic


regression), i = 1, . . . , n
• The two classes are assumed to be linearly separable
• Aim: Learn a linear classifier: f (x) = sign(β T x + β0 )

• Question: what is the “best”


separating hyperplane?
• SVM answer: the hyperplane
with maximum margin.
• Margin = the distance to the
closet data points.

lecture5 2/43
Maximum margin separating hyperplane

For the separating hyperplane with maximum margin,


distance to points in positive class = distance to points in negative class

lecture5 3/43
Maximum margin separating hyperplane

For the separating hyperplane with maximum margin,


distance to points in positive class = distance to points in negative class

lecture5 3/43
Normal cone of a hyperplane

Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex

Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.

• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}


. The normal cone of H must be 1-dimensional. We can show that
β ∈ NH (x̄), i.e., hβ, z − x̄i ≤ 0 ∀z ∈ H.

lecture5 4/43
Normal cone of a hyperplane

Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex

Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.

• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}


. The normal cone of H must be 1-dimensional. We can show that
β ∈ NH (x̄), i.e., hβ, z − x̄i ≤ 0 ∀z ∈ H.
This is true since z, x̄ ∈ H ⇒ β T z + β0 = 0, β T x̄ + β0 = 0
⇒ β T (z − x̄) = 0. Obviously, we also have −β ∈ NH (x̄).
lecture5 4/43
Distance of a point to a hyperplane

Compute the distance of a point x to a hyperplane


H = {x ∈ Rp | β T x + β0 = 0}.
1. x̄ = ΠH (x) ⇐⇒ 1 x − x̄ ∈ NH (x̄) ⇐⇒ x − x̄ = λβ for some
λ∈R
2. x̄ ∈ H ⇒ β T x̄ + β0 = 0
⇒ β T (x − λβ) + β0 = 0
β T x + β0
⇒λ=
βT β
β T x + β0
3. x − x̄ = β
βT β
|β T x + β0 |
kx − x̄k =
kβk
|β T x + β0 |
The distance of a point x to a hyperplane H is ; it is
kβk
invariant to scaling of the parameters β, β0
1 Lecture
lecture5 5/43
4, page 24
Maximize margin

|β T xi + β0 |
• Margin γ = γ(β, β0 ) = min
i=1,...,n kβk
• All data points must lie on the correct side:
β T xi + β0 ≥ 0 when yi = 1 β T xi + β0 ≤ 0 when yi = −1

⇐⇒ yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n] = {1, . . . , n}

• Therefore, the optimization problem is

|β T xi + β0 |
 
max min
β,β0 i=1,...,n kβk

s.t. yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n]

lecture5 6/43
Simplify the optimization problem

1
max
kβk
 
1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n

lecture5 7/43
Simplify the optimization problem

1
max
kβk
 
1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n

• The hyperplane and margin are scale


invariant (β, β0 ) → (cβ, cβ0 ), for any
c 6= 0
• If xk is the closest point to H, i.e.,
k = arg min |β T xi + β0 |, we can
i=1,...,n
scale β, β0 such that |β T xk + β0 | = 1

lecture5 7/43
Simplify the optimization problem

1
max
β,β0 kβk
min kβk2
β,β0
s.t. T
yi (β xi + β0 ) ≥ 0 ∀i ⇐⇒
s.t. yi (β T xi + β0 ) ≥ 1 ∀i
T
min |β xi + β0 | = 1
i=1,...,n

• “⇒” Note that yi ∈ {−1, 1}

• “⇐” Note that we minimize kβk

lecture5 8/43
SVM

SVM is a quadratic programming (QP) problem — it can be solved by


generic QP solvers
1
min kβk2
β,β0 2
s.t. yi (β T xi + β0 ) ≥ 1 ∀ i ∈ [n]

• Later, we will discuss the Lagrangian duality and derive the dual
problem of the above
• The dual problem will play a key role in allowing us to use kernels
(introduced later)
• The dual problem will also allow us to derive an efficient algorithm
better than generic QP solvers (especially when n  p)

lecture5 9/43
Support vectors

Support vectors are some xi having tight constraints

yi (β T xi + β0 ) = 1

• Support vectors must exist


• Number of support vectors  sample
size n
• The resulting hyperplane may change
if some support vectors are removed

lecture5 10/43
Lagrange duality and KKT
Lagrangian

Consider a general nonlinear programming problem (NLP), which is


known as a primal problem
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]

• Define the Lagrangian


m
X l
X
L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
i=1 j=1

for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .

lecture5 11/43
Lagrangian

Consider a general nonlinear programming problem (NLP), which is


known as a primal problem
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]

• Define the Lagrangian


m
X l
X
L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
i=1 j=1

for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .

• Define the Lagrange dual function (always concave)

θ(v, u) = min L(x, v, u)


x

lecture5 11/43
Lagrangian

• In evaluating θ(v, u) for each v, u, we must solve


m
X l
X
min L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
x
i=1 j=1

∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )

lecture5 12/43
Lagrangian

• In evaluating θ(v, u) for each v, u, we must solve


m
X l
X
min L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
x
i=1 j=1

∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )

• Suppose x is a feasible point of (P). Then for any v ∈ Rm , u ∈ Rl+ ,


θ(v, u) = min L(x, v, u) ≤ L(x, v, u)
x
m
X l
X
= f (x) + vi gi (x) + uj hj (x)
i=1 j=1

≤ f (x)

lecture5 12/43
Lagrangian dual problem

is a lower bound of
• Dual function θ(v, u) ≤ Primal function f (x)
v ∈ Rm , u ∈ Rl+ x is primal feasible
dual feasible

• We want to search for the largest lower bound — leading to the


Lagrangian dual problem

(D) max θ(v, u)


v,u

s.t. v ∈ Rm , u ∈ Rl+

Here vi ,uj are called dual variables or Lagrange multipliers.

lecture5 13/43
Primal and dual

Definition (Lagrangian dual problem)


For a primal nonlinear programming problem (P)
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]
The Lagrangian dual problem (D) is the following nonlinear programming
problem
( m l
)
X X
(D) max θ(v, u) = min f (x) + vi gi (x) + uj hj (x)
v,u x
i=1 j=1
m
s.t. v∈R , u∈ Rl+
• Weak duality: optimal value for (D) ≤ optimal value for (P)
• Under certain assumptions (see page 21),
strong duality: optimal value for (D) = objective value for (P)
lecture5 14/43
Example

Find the dual problem of the convex program


min x21 + x22
s.t. x1 + x2 ≥ 4

lecture5 15/43
Example

Find the dual problem of the convex program


min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0

lecture5 15/43
Example

Find the dual problem of the convex program


min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
Solution. For u1 ≥ 0, the Lagrangian is
L(x1 , x2 , u1 ) = f (x) + u1 h1 (x) = x21 + x22 + u1 (4 − x1 − x2 )

lecture5 15/43
Example

Find the dual problem of the convex program


min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
Solution. For u1 ≥ 0, the Lagrangian is
L(x1 , x2 , u1 ) = f (x) + u1 h1 (x) = x21 + x22 + u1 (4 − x1 − x2 )
The dual function is
θ(u1 ) = min x21 + x22 + u1 (4 − x1 − x2 )
x1 ,x2

= 4u1 + min {x21 − u1 x1 } + min {x22 − u1 x2 }


x1 x2

u21 u1 u1
= 4u1 − (Attained at x1 = , x2 = )
2 2 2
The dual problem is
u2
max 4u1 − 1
2
s.t. u1 ≥ 0
lecture5 15/43
Example: LP

Consider the linear programming (LP) problem in standard form


min cT x
x
s.t. Ax = b
x≥0
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and x ≥ 0 means xi ≥ 0, i ∈ [n].
Find the dual function and dual problem.

lecture5 16/43
Example: LP

Consider the linear programming (LP) problem in standard form


min cT x
x
s.t. Ax = b
x≥0
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and x ≥ 0 means xi ≥ 0, i ∈ [n].
Find the dual function and dual problem.
Solution. Let v ∈ Rm and u ∈ Rn+ . The dual function is
θ(v) = min {cT x + v T (b − Ax) − uT x} = v T b + min {xT (c − AT v − u)}
x x
(
T T
v b, if c − A v − u = 0
=
−∞, otherwise

max bT v
v,u max bT v
Dual problem: s.t. A v + u = c i.e.,
T v
s.t. AT v ≤ c
u≥0
lecture5 16/43
Example: LP

Consider the LP in standard inequality form


min cT x
x

s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.

lecture5 17/43
Example: LP

Consider the LP in standard inequality form


min cT x
x

s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.
Solution. Let u ∈ Rm
+ . The dual function is

θ(u) = minn {cT x + uT (Ax − b)} = −uT b + minn {xT (c + AT u)}


x∈R x∈R
(
T T
−u b, if c + A u = 0
=
−∞, otherwise

max −bT u
u
Dual problem: s.t. AT u + c = 0
u≥0 lecture5 17/43
Example: Lasso

Consider the problem


1
min kzk2 + λkβk1
β,z 2
s.t. z + Xβ = Y

where X ∈ Rn×p , Y ∈ Rn , λ > 0. Find the dual function and dual


problem.
Solution.

1. Let y ∈ Rn . The Lagrangian is


1
L(β, z, y) = kzk2 + λkβk1 + hy, Y − z − Xβi
2
1
= kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
2

lecture5 18/43
Example: Lasso

2. The dual function θ(y) = min L(β, z, y)


β,z

n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2

2 Lecture 4, page 34
lecture5 19/43
Example: Lasso

2. The dual function θ(y) = min L(β, z, y)


β,z

n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2
n1 o 1
• Set ∇z = z − y = 0 ⇒ z = y. min kzk2 − hy, zi = − kyk2
z 2 2
n o n XT y o
• min λkβk1 − hX T y, βi = −λ max h , βi − kβk1 =
β β λ | {z }
h(β)
 T  2  T 
X y X y
−λh∗ = −δB1 , B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ λ

2 Lecture 4, page 34
lecture5 19/43
Example: Lasso

2. The dual function


 T 
X y 1
θ(y) = −δB1 − kyk2 +hy, Y i, B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ 2

3. The dual problem

lecture5 20/43
Example: Lasso

2. The dual function


 T 
X y 1
θ(y) = −δB1 − kyk2 +hy, Y i, B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ 2

3. The dual problem

max θ(y)
y
 T 
X y 1
= − min δB1 + kyk2 − hy, Y i
y λ 2
1
= − min kyk2 − hy, Y i s.t. XT y ∞ ≤ λ
y 2

lecture5 20/43
KKT

Assumptions

1. f, hj : Rp → R differentiable and convex


2. gi : Rp → R affine (gi (x) = aTi x + bi )
3. Slater’s condition holds, i.e., there exists x̂ such that

gi (x̂) = 0, ∀ i hj (x̂) < 0, ∀ j

Under the above assumptions, strong duality holds, and there exist a
solution x∗ to (P) and a solution (u∗ , v ∗ ) to (D) satisfying the
Karush-Kuhn-Tucker (KKT) conditions:
m l
∂ X X
L(x∗ , u∗ , v ∗ ) = ∇f (x∗ ) + vi∗ ∇gi (x∗ ) + u∗j ∇hj (x∗ ) = 0
∂x i=1 j=1

gi (x∗ ) = 0, hj (x∗ ) ≤ 0, u∗j ≥ 0, u∗j hj (x∗ ) = 0, ∀ i ∈ [m], j ∈ [l]

lecture5 21/43
KKT

• We say (x∗ , u∗ , v ∗ ) (or simply x∗ ) is a KKT point or a KKT solution


if (x∗ , u∗ , v ∗ ) satisfies the KKT conditions
• Under the above assumptions, (x∗ , u∗ , v ∗ ) is a KKT solution ⇐⇒
x∗ is an optimal solution to (P) and (u∗ , v ∗ ) is an optimal solution
to (D)
• We call
u∗j hj (x∗ ) = 0, ∀ j ∈ [l]
complementary slackness condition. It implies
u∗j = 0 if hj (x∗ ) < 0, hj (x∗ ) = 0 if u∗j > 0

 • If the constraint hj (x∗ ) ≤ 0 is slack



 hj (x ) ≤ 0
 (hj (x∗ ) < 0), then the constraint u∗j ≥ 0 is

uj ≥ 0 active (u∗j = 0)
 u∗ h (x∗ ) = 0

j j • If the constraint u∗j ≥ 0 is slack (u∗j > 0),
then the constraint hj (x∗ ) ≤ 0 is active
(hj (x∗ ) = 0)
lecture5 22/43
Dual of SVM
Dual of SVM

Derive the dual of the following SVM problem


1
min kβk2
β,β0 2
s.t. 1 − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]

1. For α ∈ Rn+ , the Lagrangian is


n
1 X
L(β, β0 , α) = kβk2 + αi (1 − yi (β T xi + β0 ))
2 i=1

lecture5 23/43
Dual of SVM

Derive the dual of the following SVM problem


1
min kβk2
β,β0 2
s.t. 1 − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]

1. For α ∈ Rn+ , the Lagrangian is


n
1 X
L(β, β0 , α) = kβk2 + αi (1 − yi (β T xi + β0 ))
2 i=1

2. The dual function is

θ(α) = min L(β, β0 , α)


β,β0
n n n
1 X X X
= min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1

lecture5 23/43
Dual of SVM

We need to solve the optimization problem


n n n
1 X X X
min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1

lecture5 24/43
Dual of SVM

We need to solve the optimization problem


n n n
1 X X X
min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0, we
∂β i=1
∂β0 i=1
obtain that
 n n n
 X 1 X X
 αi − αi αj yi yj xTi xj if αi yi = 0
θ(α) = i=1
2 i,j=1 i=1

−∞ otherwise

3. The dual problem is


n n
X 1 X
max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0, αi ≥ 0, i ∈ [n]
i=1 lecture5 24/43
KKT of SVM

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Verify the assumptions (in Page 21) for strong duality and the existence
of KKT points: (Slater’s condition) there exist β̂, β̂0 such that
yi (β̂ T xi + β̂0 ) > 1, i ∈ [n]. It requires that the two classes are strictly
separable.

Figure 2: Left: strictly separable. Right: separable but not strictly separable
lecture5 25/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

1. If we obtain a dual solution α∗ (via solving the SVM dual problem),


then we can construct a primal solution (β ∗ , β0∗ ) from KKT
conditions
Xn
β∗ = αi∗ yi xi
i=1
n
X
β0∗ = yk − αi∗ yi hxi , xk i, for some k satisfying αk∗ > 0
i=1

lecture5 26/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

2. If αi∗ > 0, then xi is a support vector, by complementary slackness


condition:
αi∗ > 0 ⇒ yi ((β ∗ )T xi + β0∗ ) = 1
3. |{i | αi∗ > 0}| ≤ the number of support vectors  n
Dual solution is sparse (many αi∗ = 0)

lecture5 27/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

4. Decision boundary
n
X
0 = (β ∗ )T x + β0∗ = αi∗ yi hxi , xi + β0∗
i=1
X
= αi∗ yi hxi , xi + β0∗
i, α∗
i >0

For a new test point x, the prediction only depends on hxi , xi where
αi∗ > 0, namely, the inner products between x and support vectors.
lecture5 28/43
Primal vs Dual

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Classifier Classifier
n
!
f (x) = sign(β T x + β0 ) X
f (x) = sign αi yi hxi , xi + β0
i=1

Many αi ’s are zero (sparse solutions)


• Optimize p + 1 variables for primal, n variables for dual
• When n  p, it might be more efficient to solve the dual
• Dual problem only involves hxi , xj i — allowing the use of kernels
lecture5 29/43
Feature mapping

• Recall feature expansion, for example,


 
xi1
" #  x 
xi1  i2 
feature  2 
i-th feature vector xi = →  xi1 
xi2 expansion  2 
 xi2 
xi1 xi2
• Let φ denote the feature mapping, which maps from original
features to new features  
z1
" #!  z 
z1  2 
For example, φ =  z12 
 
z2
 z22 
 

z1 z2
• Instead of using the original feature vectors xi , i ∈ [n], we may
apply SVM using new features φ(xi ), i ∈ [n]
• New feature space can be very high dimensional lecture5 30/43
Kernel

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Using dual:
• For feature expansion, simply replace hxi , xj i with hφ(xi ), φ(xj )i
• Given a feature mapping φ, we define the corresponding kernel
K(a, b) = hφ(a), φ(b)i, a, b ∈ Rp
• Usually computing K(a, b) may be very cheap, even though
computing φ(a), φ(b) (high dimensional vectors) may be expensive
• The dual of SVM only requires the computation of kernels
K(xi , xj ). Explicitly calculating φ(xi ) is not necessary lecture5 31/43
Example: (homogeneous) polynomial kernel

For a, b ∈ Rp , consider
K(a, b) = (aT b)2
It can be written as p
! p 
p
X X X
K(a, b) = ai bi  aj bj =
 a i a j bi bj
i=1 j=1 i,j=1
p
X
= (ai aj )(bi bj )
i,j=1

Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → Rp is given by  

a1 a1

a1
  a1 a2 
 
a2  
  a a 

φ(a) = φ  ..  =  1 3 
 .   .. 
 . 
ap
ap ap
Computing φ(a): O(p2 ) operations; computing K(a, b): O(p) operations
lecture5 32/43
Example: (inhomogeneous) polynomial kernel

Given c ≥ 0. For a, b ∈ Rp , consider

K(a, b) = (aT b + c)2


p p
X X √ √
= (ai aj )(bi bj ) + ( 2cai )( 2cbi ) + c2
i,j=1 i=1

Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → R(p +p+1) is given by
 √ √ √ T
φ(a) = a1 a1 , a1 a2 , a1 a3 , . . . , ap ap , 2ca1 , 2ca2 , . . . , 2cap , c
| {z }| {z }
second order terms first order terms
Parameter c controls the relative weighting between first order and
second order terms.

lecture5 33/43
Common kernels

• Polynomials of degree d

K(a, b) = (aT b)d

• Polynomials up to degree d

K(a, b) = (aT b + 1)d

• Gaussian kernel — polynomials of all orders3

ka − bk2
 
K(a, b) = exp − , σ>0
2σ 2

3 ex
P∞ xn
= n=0 n!

lecture5 34/43
Kernel

• SVM can be applied in high dimensional feature spaces, without


explicitly applying the feature mapping
• The two classes might be separable in high dimensional space, but
not separable in the original feature space
• Kernels can be used efficiently in the dual problem of SVM because
the dual only involves inner products

lecture5 35/43
SVM with soft constraints
SVM with soft constraints

When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]

lecture5 36/43
SVM with soft constraints

When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}

lecture5 37/43
SVM with soft constraints

When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}
SVM with soft constraints solves
n
1 X
min kβk2 +C max{1 − yi (β T xi + β0 ), 0}
β,β0 2
| {z } i=1
ridge regularization
| {z }
hinge-loss function
lecture5 37/43
Logistic regression

Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,
  
T
 log 1 + e−(β xi +β0 ) , if yi = 1

logistic-loss =  
 log 1 + eβ T xi +β0 ,
 if yi = 0

4 Lecture 3, Page 19
lecture5 38/43
Logistic regression

Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,
  
T
 log 1 + e−(β xi +β0 ) , if yi = 1

logistic-loss =  
 log 1 + eβ T xi +β0 ,
 if yi = 0

Change label yi = 0 → yi = −1,


 T

logistic-loss = log 1 + e−yi (β xi +β0 ) , yi ∈ {−1, 1}

Logistic regression with ridge regularization


n  
X T
min log 1 + e−yi (β xi +β0 ) + λkβk2
β,β0
i=1
4 Lecture 3, Page 19
lecture5 38/43
SVM vs. logistic regression

SVM with soft constraints Logistic regression with


ridge regularization
n n  
X T 1 2
X −yi (β T xi +β0 ) 2
min C max{1 − yi (β xi + β0 ), 0} + kβk min log 1+e +λkβk
β,β0 2 β,β0
i=1 i=1

Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z  0

lecture5 39/43
SVM vs. logistic regression

SVM with soft constraints Logistic regression with


ridge regularization
n n  
X T 1 2
X −yi (β T xi +β0 ) 2
min C max{1 − yi (β xi + β0 ), 0} + kβk min log 1+e +λkβk
β,β0 2 β,β0
i=1 i=1

Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z  0
4

hinge-loss
3
logistic loss is a “smoothed
2
version” of hinge loss
logistic-loss

0
-3 -2 -1 0 1 2 3 lecture5 39/43
SVM with soft constraints: dual

SVM with soft constraints


n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. 1 − εi − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]
−εi ≤ 0, i ∈ [n]

Find the dual problem.

1. For α ∈ Rn+ , r ∈ Rn+ , the Lagrangian L(β, β0 , ε, α, r) =


n n n
1 X X X
kβk2 + C εi + αi (1 − εi − yi (β T xi + β0 )) − ri εi
2 i=1 i=1 i=1
n n n n
1 2
X
T
X X X
= kβk − αi yi xi β − αi yi β0 + (C − αi − ri )εi + αi
2 i=1 i=1 i=1 i=1

lecture5 40/43
SVM with soft constraints: dual

2. The dual function is θ(α, r) = min L(β, β0 , ε, α, r) =


β,β0 ,ε

n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1

n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1

L = C − αi − ri = 0,
∂εi

lecture5 41/43
SVM with soft constraints: dual

2. The dual function is θ(α, r) = min L(β, β0 , ε, α, r) =


β,β0 ,ε

n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1

n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1

L = C − αi − ri = 0,
∂εi
we obtain that θ(α, r) =
 n n n
X 1 X X
αi − αi αj yi yj xTi xj if αi yi = 0 and αi + ri = C


i=1
2 i,j=1 i=1

−∞ otherwise

lecture5 41/43
SVM with soft constraints: dual

3. The dual problem max {θ(α, r) | α ∈ Rn+ , r ∈ Rn+ }


α,r

n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α,r
i=1
2 i,j=1
n
X
s.t. αi yi = 0
i=1

αi + ri = C, i ∈ [n]
α ∈ Rn+ , r ∈ Rn+
n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0
i=1

0 ≤ αi ≤ C, i ∈ [n]
lecture5 42/43
In practice

You are encouraged to learn two popular open source machine learning
libraries:
LIBLINEAR https://www.csie.ntu.edu.tw/~cjlin/liblinear/
LIBSVM https://www.csie.ntu.edu.tw/~cjlin/libsvm/

lecture5 43/43

You might also like