0% found this document useful (0 votes)
7 views

Support Vecto Machine (3)

Uploaded by

baominh5x2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Support Vecto Machine (3)

Uploaded by

baominh5x2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Support Vector Machine (SVM)

Nguyen Minh Bao


Nguyen Minh Triet

Math for Computer Science

November 18, 2024

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 1/57


Overview of Support Vector Machine

Section 1: Introduction

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 2/57


Overview of Support Vector Machine

Section 1: Introduction

Section 2: Support Vector Machine

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 2/57


Overview of Support Vector Machine

Section 1: Introduction

Section 2: Support Vector Machine

Section 3: Types of SVM

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 2/57


Overview of Support Vector Machine

Section 1: Introduction

Section 2: Support Vector Machine

Section 3: Types of SVM

Section 4: Advantages and Drawbacks

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 2/57


Overview of Support Vector Machine

Section 1: Introduction

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 3/57


Related Mathematical Concepts

In a n-dimensional space, a hyperplane (α) is defined as a subspace with a


dimension of n - 1, represented by the equation:

(α) : w1 x1 + w2 x2 + · · · + wn xn + b = 0 (1)

Where:
x = [x1 , x2 , . . . , xn ]T represents the coordinates of a point on the
hyperplane.
w = [w1 , w2 , . . . , wn ]T is a normal vector of (α).
b is a constant.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 4/57


Distance from a Point to a Hyperplane

In a n-dimensional space, the distance d from a point


x0 = [x01 , x02 , . . . , x0n ]T to the hyperplane α is defined by the equation:

|w1 x01 + w2 x02 + · · · + wn x0n + b| |wT x0 + b|


d= q =
w12 + w22 + · · · + wn2 ∥w∥2

where: q √
∥w∥2 = w12 + w22 + · · · + wn2 = wT w
is the ℓ2 -norm of w.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 5/57


Illustration of Distance to Hyperplane

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 6/57


Duality in Optimization

Given an Optimization Problem:

x∗ = arg min f0 (x)


Subject to: (P1)
fi (x) ≤ 0, for i = 1, . . . , m
hj (x) = 0, for j = 1, . . . , p

Where:

x∗ is the optimal point of (P1).


f0 (x∗ ) is the optimal value of (P1).
Tp
D=( m
T
i=0 dom fi ) ∩ ( j=1 dom hj ) is the domain of (P1).

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 7/57


Lagrangian Function

The Lagrangian function of (P1) combines the objective function


and constraints using multipliers:
m
X p
X
L(x, λ, ν) = f0 (x) + λi fi (x) + νj hj (x)
i=1 j=1

Where:

λi ≥ 0: Lagrange multipliers for inequality constraints.


νj : Lagrange multipliers for equality constraints.
λ = [λ1 , λ2 , . . . , λm ]T and ν = [ν1 , ν2 , . . . , νp ]T : Lagrange multiplier
vectors.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 8/57


Lagrange Dual Function

Definition:
The Lagrange dual function of (P1) is derived from its Lagrangian
function. For any pair of input(λ, ν), it represents the infimum of the
Lagrangian function over all x in the domain D.

 m
X p
X 
g (λ, ν) = inf L(x, λ, ν) = inf f0 (x) + λi fi (x) + νj hj (x)
x∈D x∈D
i=1 j=1

Key Properties:
The dual function g (λ, ν) is always concave, even if f0 (x) is not
convex.
The dual function g (λ, ν) provides a lower bound for the optimal
value f0 (x∗ ).

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 9/57


Proof: The Dual Function is Concave

Steps of the Proof:


1 Fix two dual variable pairs (λ1 , ν1 ) and (λ2 , ν2 ).
2 Let θ ∈ [0, 1] and define a convex combination:

(λ, ν) = θ(λ1 , ν1 ) + (1 − θ)(λ2 , ν2 ).

3 The dual function for this convex combination:



g (λ, ν) = inf L x, θ(λ1 , ν1 ) + (1 − θ)(λ2 , ν2 )
x∈D
m
X p
X 
= inf f0 (x) + [θλ1,i + (1 − θ)λ2,i ]fi (x) + [θν1,j + (1 − θ)ν2,j ]hj (x)
x∈D
i=1 j=1

= inf θL(x, λ1 , ν1 ) + (1 − θ)L(x, λ2 , ν2 ) .
x∈D

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 10/57


Proof: The Dual Function is Concave

Conclusion: By the properties of the infimum:

g (λ, ν) ≥ θ inf L(x, λ1 , ν1 ) + (1 − θ) inf L(x, λ2 , ν2 ).


x∈D x∈D

Since:

inf L(x, λ1 , ν1 ) = g (λ1 , ν1 ) and inf L(x, λ2 , ν2 ) = g (λ2 , ν2 ),


x∈D x∈D

it follows that:

g (λ, ν) ≥ θg (λ1 , ν1 ) + (1 − θ)g (λ2 , ν2 ).

Therefore, g (λ, ν) is concave.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 11/57


Lagrange Dual Problem

Key Concept:
Each pair (λ, ν) provides a lower bound g (λ, ν) for the optimal value
f0 (x∗ ).
The pair (λ∗ , ν ∗ ) that gives the highest lower bound, g (λ∗ , ν ∗ ), is
called the optimal Lagrange multipliers.
Dual Problem:

(λ∗ , ν ∗ ) = arg max g (λ, ν)


Subject to: (P2)
λi ≥ 0, for i = 1, . . . , m

Additional Notes:
(P2) is always a convex optimization problem, regardless of the
convexity of (P1),
The difference f0 (x∗ ) − g (λ∗ , ν ∗ ) is called the optimal duality gap.
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 12/57
Strong Duality and Optimal Duality Gap

Strong Duality:
If the optimal duality gap is zero, we say that strong duality occurs.
Significance:
Solving the dual problem (P2) allows us to find the exact optimal
value of the primal problem (P1).

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 13/57


Slater’s Condition and Strong Duality

Constraint Qualifications:
For a convex optimization problem (P1), certain conditions called
constraint qualifications ensure strong duality.
A fundamental example of such a qualification is Slater’s condition.
Strictly Feasible Point (Definition):
A point x is strictly feasible if it satisfies:

fi (x) < 0, ∀i = 1, . . . , m, hj (x) = 0, ∀j = 1, . . . , p.

This means the inequality constraints are strictly satisfied, and the
equality constraints hold.
Slater’s Theorem:
If a strictly feasible point exists and (P1) is convex, then strong
duality holds.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 14/57


Karush-Kuhn-Tucker (KKT) Conditions

General KKT Conditions:


Assume strong duality holds.
The following conditions are necessary for x∗ , λ∗ , ν ∗ to be optimal
for both the primal and dual problems.
Necessary Conditions:
1 Primal feasibility:

fi (x∗ ) ≤ 0, ∀i = 1, . . . , m, hj (x∗ ) = 0, ∀j = 1, . . . , p.
2 Dual feasibility:
λ∗i ≥ 0, ∀i = 1, . . . , m.
3 Complementary slackness:
λ∗i fi (x∗ ) = 0, ∀i = 1, . . . , m.
4 Stationarity:
m
X p
X

∇x f0 (x ) + λ∗i ∇x fi (x∗ ) + νj∗ ∇x hj (x∗ ) = 0.
i=1 j=1
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 15/57
KKT Conditions for Convex Problems

If (P1) is a convex problems and strong duality holds under Staler’s


condition, the KKT conditions are both necessary and sufficient.
If a point satisfies the KKT conditions, it is optimal for both the
primal and dual problems.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 16/57


Overview of Support Vector Machine

Section 2: Support Vector Machine

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 17/57


Problem Statement

Given:
A dataset D = {(x(i) , y (i) )}m
i=1 , where:
x(i) ∈ Rn (feature vectors),
y (i) ∈ {−1, 1} (class labels), for i = 1, . . . , m.
Assume the two classes of data points (y (i) = 1 and y (i) = −1) are
linearly separable.
Question:
How can we find the best hyperplane to separate these two classes?

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 18/57


Illustration of separating hyperplane

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 19/57


Support Vector Machine (SVM)

Definition:
Support Vector Machine (SVM) is a supervised learning algorithm
designed for classification problems.
It identifies the optimal hyperplane that separates two classes of
data points.
The hyperplane is selected to maximize the margin, which is the
distance to the nearest data points (support vectors) from both
classes.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 20/57


Support Vector Machine Visualization

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 21/57


Separating Hyperplane in n-Dimensional Space

Definition:
In the n-dimensional space, the separating hyperplane (α) has the
form:
w1 x1 + w2 x2 + · · · + wn xn + b = w⊤ x + b = 0,
where:
x = [x1 , x2 , . . . , xn ]⊤ : coordinates of a point on the hyperplane.
w = [w1 , w2 , . . . , wn ]⊤ : normal vector to the hyperplane α.
b: a constant (bias term).
Conditions (C1):
For all pairs (x(i) , y (i) ) ∈ D, the following must hold:

w⊤ x(i) + b ≥ 0, if y (i) = 1,

w⊤ x(i) + b < 0, if y (i) = −1.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 22/57


Distance to the Hyperplane

Definition:
The distance d from each data point (x(i) , y (i) ) to the hyperplane α
is given by:
(i) (i) (i)
w1 x1 + w2 x2 + · · · + wn xn + b w⊤ x(i) + b
d= q = .
w12 + w22 + · · · + wn2 ∥w∥2

Using (C1), the distance can also be written as:

y (i) (w⊤ x(i) + b)


d= .
∥w∥2

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 23/57


Scaling for Support Vectors

Support Vector Condition:


If (x(a) , y (a) ) is a support vector, it satisfies:

y (a) (w⊤ x(a) + b) = c, c ∈ R+ .

Scaling Observation:
The coefficients (w, b) are not unique. Scaling them by any positive
constant k ∈ R+ still satisfies (C1).
The distance d from any point in D to the hyperplane remains
unchanged under this scaling.
Simplified Assumption:
To remove this redundancy, we can assume:

y (a) (w⊤ x(a) + b) = 1, (Eq.1)

without affecting the relative geometry of the problem.


Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 24/57
Margin Size

Implication of Equation 1:
Since (Eq.1) holds, we can conclude that for every i = 1, . . . , m, the
following holds:
y (i) (w⊤ x(i) + b) ≥ 1.
Margin Size:
The margin size is calculated as:

y (a) (w⊤ x(a) + b) 1


margin = = .
∥w∥2 ∥w∥2

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 25/57


SVM Optimization Problem

Objective:
The goal of SVM is to maximize the margin size, which is
equivalent to solving for the pair of optimal values (w∗ , b ∗ ) of the
following optimization problem:
1
(w∗ , b ∗ ) = arg max ,
w,b ∥w∥2

subject to:
y (i) (w⊤ x(i) + b) ≥ 1, ∀i = 1, . . . , m.
Reformulated Problem:
The above problem is equivalent to minimizing the squared norm of
w:
1
(w∗ , b ∗ ) = arg min ∥w∥22 ,
w,b 2
subject to: (P3)
1 − y (i) (w⊤ x(i) + b) ≤ 0, ∀i = 1, . . . , m.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 26/57


Lagrangian of the Optimization Problem (P3)

Lagrangian:
The Lagrangian for the optimization problem (P3) is defined as:
m
1 X  
2
L(w, b, λ) = ∥w∥2 + λi 1 − y (i) (w⊤ x(i) + b) ,
2
i=1

where:
λ = [λ1 , λ2 , . . . , λm ]⊤ are the Lagrange multipliers.
λi ≥ 0, ∀i = 1, . . . , m.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 27/57


Convexity and Optimality of (P3)

Key Insights:
It can be proven that (P3) is a convex optimization problem.
Slater’s condition is satisfied for (P3), ensuring strong duality holds.
Conclusion:
The optimal solutions w∗ , b ∗ , λ∗ for the dual problem can be obtained
by solving the Karush-Kuhn-Tucker (KKT) conditions of (P3).

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 28/57


KKT Conditions for (P3)

KKT Conditions:
Primal feasibility:
 
1 − y (i) (w∗ )⊤ x(i) + b ∗ ≤ 0, ∀i = 1, . . . , m. (C2.1)
Dual feasibility:
λ∗i ≥ 0, ∀i = 1, . . . , m. (C2.2)
Complementary slackness:
  
λ∗i 1 − y (i) (w∗ )⊤ x(i) + b ∗ = 0, ∀i = 1, . . . , m. (C2.3)
Stationarity with respect to w∗ :
m
∂L ∗
X
= w − λ∗i y (i) x(i) = 0. (C2.4)
∂w∗
i=1
Stationarity with respect to b ∗ :
m
∂L X
= λ∗i y (i) = 0. (C2.5)
∂b ∗
i=1
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 29/57
Solving (P3) Using the Dual Problem

Motivation:
Directly solving for w∗ , b ∗ , λ∗ using the KKT conditions can be
computationally intensive.
Instead, solving for λ in the Lagrange dual problem of (P3) is more
efficient and commonly done.
Lagrange Dual Function:
The dual function g (λ) is defined as:

g (λ) = inf L(w, b, λ),


w,b

where the Lagrangian L(w, b, λ) is given by:


m
1 X  
2
L(w, b, λ) = ∥w∥2 + λi 1 − y (i) (w⊤ x(i) + b) .
2
i=1

Key Insight:
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 30/57
Finding inf w,b L

Key Steps:
To find inf w,b L, set the partial derivatives of L with respect to w and
b to zero.
Partial Derivatives:
With respect to w:
m m
∂L X X
=w− λi y (i) x(i) = 0 ⇒ w = λi y (i) x(i) . (Eq.2)
∂w
i=1 i=1
With respect to b:
m
∂L X
= λi y (i) = 0. (Eq.3)
∂b
i=1
Substituting (Eq.2) and (Eq.3) into g (λ):
m m m
X 1 XX
g (λ) = λi − λi λj y (i) y (j) x(i)⊤ x(j) . (Eq.4)
2
i=1 i=1 j=1

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 31/57


Lagrange Dual Problem of (P3)

Dual Problem Formulation:


By combining (Eq.3), (Eq.4), and the constraints on λ, we obtain the
Lagrange dual problem of (P3):
λ∗ = arg max g (λ),
λ

subject to: (P4)


λi ≥ 0, ∀i = 1, . . . , m,
m
X
λi y (i) = 0.
i=1
Solving the Dual Problem:
(P4) is a quadratic programming problem.
To solve it, we can use:
SMO (Sequential Minimal Optimization),
Libraries such as CVXOPT,
The Projected Gradient Descent algorithm.
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 32/57
Calculating w∗ and b ∗ from KKT Conditions

Observation:
From (C2.3), we conclude that λ∗i could be greater that0 only if:
 
y (i) (w∗ )⊤ x(i) + b ∗ = 1,

meaning x(i) is a support vector.


Step 1: Solve for λ∗ and identify support vectors:
After solving for λ∗ in (P4), define the set of support vectors:

S = {i | λ∗i ̸= 0}.

Step 2: Calculate w∗ :
Using (C2.4): X
w∗ = λ∗i y (i) x(i) .
i∈S

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 33/57


Calculating w∗ and b ∗ from KKT Conditions

Step 3: Calculate b ∗ :
Since x(i) is a support vector for every i ∈ S, we have:
 
y (i) (w∗ )⊤ x(i) + b ∗ = 1.

For each i ∈ S, calculate:


1
b∗ = − (w∗ )⊤ x(i) .
y (i)
Alternatively, for numerical stability, we can calculation b ∗ by taking
the mean of all possible b ∗ values:
1 X  (i) 
b∗ = y − (w∗ )⊤ x(i) .
|S|
i∈S

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 34/57


Separating Hyperplane and Prediction

Separating Hyperplane:
The separating hyperplane α is defined as:

∗ ⊤ ∗
X 1 X  (i) 
α : (w ) x + b = λ∗i y (i) x(i) + ∗ ⊤ (i)
y − (w ) x = 0.
|S|
i∈S i∈S

Prediction for a New Data Point x(n) :


The label y (n) for a new data point x(n) is determined as follows:
(
1, if (w∗ )⊤ x(n) + b ∗ ≥ 0,
y (n) =
−1, otherwise.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 35/57


Section 3: Types of SVM

Linear Support Vecto Machine


Hard Margin
Soft Margin

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 36/57


Section 3: Types of SVM

Linear Support Vecto Machine


Hard Margin
Soft Margin

Non-Linear Support Vecto Machine


Kernel Function
Linear Kernel
Polynomial Kernel
Sigmoid Kernel
Radial Basis Function (RBF) Kernel

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 36/57


Section 3: Types of SVM - Linear SVM - Hard Margin

Assumption: The data is perfectly


linearly separable, meaning there exists
a hyperplane that can separate the two
classes without any misclassification.
Goal: Maximize the margin between
classes with no points inside the margin.
Conditions:
No points allowed within the margin.
No misclassifications.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 37/57


Section 3: Types of SVM - Linear SVM - Soft Margin

Problem: In real-world scenarios, data


is often noisy and not perfectly linearly
separable. So we can not find w and b.
Therefore, a model that allows for some
misclassification is needed to handle
these cases.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 38/57


Section 3: Types of SVM - Linear SVM - Soft Margin

Solution: To address the problem of


non-separable data, we use slack
variables ξi (epsilon) for each data
point.

Role of ξ (epsilon):
ξi measures the degree of
misclassification for each data point.
ξi = 0: The point is correctly
classified and outside the margin.
ξi > 0: The point either lies within
the margin or is misclassified.

Constraints:

⟨w · xi ⟩ + b ≥ 1 − ξi , if yi = 1
⟨w · xi ⟩ + b ≤ −1 + ξi , if yi = −1
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 39/57
Section 3: Types of SVM - Soft Margin

Constraints with Slack Variables ξ:


For positive class (yi = 1):
⟨w · xi ⟩ + b ≥ 1 − ξi , ∀i = 1 . . . N
For negative class (yi = −1):
⟨w · xi ⟩ + b ≤ −1 + ξi , ∀i = 1 . . . N
Non-negativity constraint:
ξi ≥ 0, ∀i = 1 . . . N

Optimization Objective:
N
!
1 X
arg min ||w ||2 + C ξi (Eq.1)
w ,b,ξ 2
i=1

where:
C > 0 is the penalty constant.
The greater C , the heavier the penalty on misclassifications.
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 40/57
Section 3: Types of SVM – Linear SVM – Soft Margin

The Lagrange function is:


N N N
1 X X X
L(w , b, λ, µ) = ⟨w , w ⟩ + C ξi − λi (yi (⟨w , xi ⟩ + b) − 1 + ξi ) − µi ξi
2 i=1 i=1 i=1
(Eq.2)
where:
λi ≥ 0 and µi ≥ 0 are Lagrange multipliers.
Optimization Conditions:
N
X
∇w L = 0 ⇒ w = λn yn xn (Eq.3)
n=1
N
X
∇b L = 0 ⇒ λn yn = 0 (Eq.4)
n=1

∇ ξ n L = 0 ⇒ λ n = C − µn (Eq.5)
Interpretation of Conditions:
This relationship shows that λi is constrained by C and µi , with µi ≥ 0.
If µi = 0, then λi = C , indicating a boundary point.
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 41/57
Section 3: Types of SVM - Linear SVM – Soft Margin

New Lagrange Function:


After substituting w , ξ, and λ into the Lagrange function, we get:
N N N
X 1 XX
L(λ) = λi − λi λj yi yj ⟨xi , xj ⟩ (Eq.6)
2
i=1 i=1 j=1

Notes:
This dual form depends only on λi and the inner products ⟨xi , xj ⟩,
making it computationally efficient.
The goal is to maximize L(λ) with respect to λ, which controls the
influence of each support vector.
Subject to the following constraints:
0P≤ λi ≤ C , ∀i = 1 . . . N
N
i=1 λi yi = 0

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 42/57


Section 3: Types of SVM - Linear SVM – Soft Margin

KKT Conditions for Soft Margin:

ξi ≥ 0 (Eq.7)
λi ≥ 0 (Eq.8)
µi ≥ 0 (Eq.9)
µi ξi = 0 (Eq.10)
yi ((w · xi ) + b) − 1 + ξi ≥ 0, ∀i = 1 . . . N (Eq.11)
λi (yi ((w · xi ) + b) − 1 + ξi ) = 0 (Eq.12)

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 43/57


Section 3: Types of SVM - Linear SVM – Soft Margin

If λn > 0, it contributes to finding the solution w in the soft margin


SVM problem.
The set S = {n : λn > 0} is called the support set, and the vectors
{xn , n ∈ S} are known as support vectors.
The weight vector w is determined solely based on the support
vectors: X
w= λn yn xn
n∈S

When λn > 0 and Eq.11:

yn (wT xn + b) = 1 − ξn

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 44/57


Section 3: Types of SVM - Linear SVM – Soft Margin

If 0 < λn < C and Eq.5 we have:

yn (wT xn + b) = 1

Indicating that these points lie exactly on the margin boundary.


The bias term b can be computed as:
1 X  
b= ym − wT xm
NM
m∈M

where M = {m : 0 < λm < C }.


Final solution:
X 1 X  
w= λm ym xm , b= ym − wT xm
NM
m∈S m∈M

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 45/57


Section 3: Types of SVM - Linear SVM – Soft Margin

If λn = C and Eq.5,Eq.11, we have:

yn (wT xn + b) = 1 − ξn ≤ 1

Indicating that these points lie on or between the margin boundaries.


The final decision function for a new point x is given by:
!
X 1 X X
wT x + b = λm ym xT
mx + yn − λm ym xT
m xn
NM
m∈S n∈M m∈S
(Eq.13)

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 46/57


Section 3: Types of SVM – Linear SVM – Soft Margin

C = 0.1

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 47/57


Section 3: Types of SVM – Linear SVM – Soft Margin

C=1

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 48/57


Section 3: Types of SVM – Linear SVM – Soft Margin

C = 10000

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 49/57


Section 3: Types of SVM – Non-linear SVM – Solution

ϕ:
x→
ϕ(x)

Basic concept: It is always possible to transform the initial feature space


into a higher-dimensional feature space in which the training set exhibits
separability.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 50/57


Section 3: Types of SVM - Non-linear SVM - Mathematics

Based on the problem linear SVM:


N N N
X 1XX
λ = arg max λn − λn λm yn ym xT
n xm
λ 2
n=1 n=1 m=1

subject to:
N
X
λn yn = 0, 0 ≤ λn ≤ C , ∀n
n=1

To avoid explicitly computing the transformation Φ(x), the kernel


trick is used:
k(x, z) = Φ(x)T Φ(z)
Allowing computations to stay in the original space.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 51/57


Section 3: Types of SVM - Non-linear SVM - Mathematics

The decision function in the transformed feature space Φ(x) is given


by:
1
wT Φ(x) + b = λm ym Φ(xm )T Φ(x) + λm ym Φ(xm )T Φ(xn )
P P P 
m∈S NM n∈M yn − m∈S
(Eq.14)
The objective function for the dual form of the SVM optimization
problem is:
PN 1 PN PN
λ = arg maxλ n=1 λn − 2 n=1 m=1 λn λm yn ym k(xn , xm ) (Eq.15)

subject to:
PN
n=1 λn yn = 0, 0 ≤ λn ≤ C , ∀n

Here:
S is the support set with λm > 0.
M is the set of support vectors on the margin where 0 < λm < C .
k(xn , xm ) = Φ(xn )T Φ(xm ) is the kernel function.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 52/57


Section 3: Types of SVM - Non-linear SVM - Mathematics - Instance

Consider a transformation of a point from the two-dimensional space


x = [x1 , x2 ]T to a higher-dimensional feature space:
√ √ √
Φ(x) = [1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 ]T

For two points x and z in the original space, the dot product in the
feature space is:
√ √ √ √ √ √
Φ(x)T Φ(z) = [1, 2x1 , 2x2 , x12 , 2x1 x2 , x22 ]T [1, 2z1 , 2z2 , z12 , 2z1 z2 , z22 ]
= 1 + 2x1 z1 + 2x2 z2 + x12 z12 + 2x1 z1 x2 z2 + x22 z22

This can be rewritten as:

(1 + xT z)2 = k(x, z)

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 53/57


Section 3: Types of SVM - Non-linear SVM - Conditions for Kernel
Functions

Symmetry: Kernel functions must be symmetric, i.e.,


k(x, z) = k(z, x).
Mercer’s Condition:
N X
X N
k(xm , xn )cm cn ≥ 0, ∀ci ∈ R
n=1 m=1

This condition ensures the kernel matrix K is positive semi-definite,


allowing efficient optimization in dual problems.
Practical Consideration: Some functions not satisfying Mercer’s
condition may still yield acceptable results and are used as kernels.

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 54/57


Section 3: Types of SVM – Non-linear SVM – Kernel Functions

Polynomial:

K (x, z) = ((x · z) + θ)d , θ ∈ R, d ∈ N

Gaussian radial basis function (RBF):


∥x−z∥2
K (x, z) = e − 2σ , σ>0

Sigmoid:
K (x, z) = tanh(γ(x · z) + r ), γ, r ∈ R

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 55/57


Section 3: Types of SVM - Summary

SVM is a supervised learning algorithm used for classification and


regression.
SVM works with real-value attributes. Any nominal attribute needs to
be transformed into a real one.
The learning formulation of SVM focuses on 2 classes. A multiclass
problem can be solved by reducing to many different problems with 2
classes (One vs the rest, one vs one).
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 56/57
Section 4: Advantages and Drawbacks

Pros
Work well with a clear margin of separation between classes
Productive in high-dimensional spaces
Effective when dimensions outnumber specimens
Memory-efficient

Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 57/57


Section 4: Advantages and Drawbacks

Pros
Work well with a clear margin of separation between classes
Productive in high-dimensional spaces
Effective when dimensions outnumber specimens
Memory-efficient
Drawbacks
Not suitable for large datasets
Performs poorly when classes overlap
Underperforms when features outnumber training data specimens
Lack of a probabilistic interpretation for classification
Computationally expensive for large datasets
Sensitive to the choice of kernel and parameters
Memory-intensive due to storing the kernel matrix
Limited to two-class problems
Not suitable for datasets with missing values
Group 8 MATH FOR COMPUTER SCIENCE NOVEMBER 2024 57/57

You might also like