0% found this document useful (0 votes)
10 views

Support Vector Machine

ewf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Support Vector Machine

ewf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Support Vector Machine

Contents
 Support Vector Machine
Support Vectors

 Hard Margin

 Linear Separability

 SVM for Linear Classification

 SVM for Non-linear Classification


 Kernel SVM
Support Vector Machine
 Supervised ML algorithm

 Used for Classification


 Can also be used for regression

 Used for linear classification


 Can also be used for non-linear classification
Linearly Separable Data
Linearly Separable Data
Support Vector Machine
Linear Separability
 Objective: Find the optimal hyperplane in an
N-dimensional space that can separate the data
points of different classes in the feature space.

 Technique: Determine the best hyperplane that


maximizes the separation margin between data
points.
Best Separating Hyperplane

L2: maximum-margin hyperplane or hard margin


Best Separating Hyperplane

Hard margin: Select the hyperplane whose


distance from the nearest data points on
each side is maximized.

Select the hyperplane so that the margin


is maximized.
Support Vectors

Data points that lie closest to the decision


surface or hyperplane.

Most difficult to classify.


Support Vectors
Support Vectors
Support Vectors
Mathematical Formulation:
Classification

Input SVM Target


Feature (𝒘, 𝑏) Output

Input to the SVM: Set of (input, output) training pairs (𝑋, 𝑌).
Input feature set: 𝑿 = {𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒏 }
Target output: Class 1 or class 2 = labeled as +1 and -1
Output of the SVM: Set of weights 𝒘 = {𝑤1 , 𝑤2 , … , 𝑤𝑓 }, one for each feature
dimension, and bias 𝑏, whose linear combination predicts the value of the output.
Mathematical Formulation :
Classification

Equation of the line: 𝒙


𝒘𝑇 𝒙 + 𝑏 = 0

𝒘 = {𝑤1 , 𝑤2 }
Mathematical Formulation :
Classification

Equation of the line: 𝒙


𝒘𝑇 𝒙 + 𝑏 = 0

SVM classifier
𝑦ො = 1 𝑖𝑓 𝒘𝑇 𝒙 + 𝑏 ≥ 0
𝑦ො = −1 𝑖𝑓 𝒘𝑇 𝒙 + 𝑏 < 0 𝒘 = {𝑤1 , 𝑤2 }

Decision Function for binary classification


Mathematical Formulation :
Classification
𝒙𝒊
The distance between 𝑖 𝑡ℎ data point 𝒙𝒊 and the
decision boundary
𝒙

𝒘𝑇 𝒙𝒊 + 𝑏
𝑑𝑖 =
𝒘 2

Decision boundary
𝒘𝑇 𝒙 + 𝑏 = 0
Mathematical Formulation :
Classification
𝒙𝒊
Derivation 𝒘
𝒘 2
𝒙

Decision boundary
𝒘𝑇 𝒙 + 𝑏 = 0
𝒘𝑇 𝒙𝒊 + 𝑏
𝑑𝑖 =
𝒘 2
Mathematical Formulation :
Classification

Assumption: All data points are at a distance


larger than 1 from the hyperplane.

Previously
𝑦 = 1 𝑖𝑓 𝒘𝑇 𝒙 + 𝑏 ≥ 0
𝑦 = −1 𝑖𝑓 𝒘𝑇 𝒙 + 𝑏 < 0 𝐻1
𝐻0
𝐻2
Mathematical Formulation :
Classification
𝐻0 : 𝒘𝑇 𝒙 + 𝑏 = 0

𝐻1 : 𝒘𝑇 𝒙 + 𝑏 = 1

𝐻2 : 𝒘𝑇 𝒙 + 𝑏 = −1

𝑑 + = Distance of 𝐻0 from the support vector corresponding to


𝒘𝑇 𝒙 + 𝑏 = 1
the +𝑣𝑒 class

𝑑 − = Distance of 𝐻0 from the support vector corresponding to 𝒘𝑇 𝒙 + 𝑏 = 0


the −𝑣𝑒 class 𝒘𝑇 𝒙 + 𝑏 = −1
Mathematical Formulation :
Classification

𝑑 += 𝑑 −= 𝑑 𝐻1 : 𝒘𝑇 𝒙 + 𝑏 = 1

𝒘𝑇 𝒙𝒊 + 𝑏 𝐻2 : 𝒘𝑇 𝒙 + 𝑏 = −1
𝑑𝑖 =
𝒘 2

1 𝒘𝑇 𝒙 + 𝑏 = 1
𝑑=
𝒘 2
𝒘𝑇 𝒙 + 𝑏 = 0
𝒘𝑇 𝒙 + 𝑏 = −1
Mathematical Formulation :
Classification
2
𝑀𝑎𝑟𝑔𝑖𝑛 =
𝒘 2

Objective: Maximize the margin

Minimize 𝒘 2
1 2
Minimize 2 𝒘 2

𝒘𝑇 𝒙 + 𝑏 = 1
Condition: There are no data points between 𝐻1 and 𝐻2
𝒘𝑇 𝒙 + 𝑏 ≥ 1 when 𝑦 = 1 𝒘𝑇 𝒙 + 𝑏 = 0
𝒘𝑇 𝒙 + 𝑏 ≤ −1 when 𝑦 = −1 𝒘𝑇 𝒙 + 𝑏 = −1
𝑦(𝒘𝑇 𝒙 + 𝑏) ≥ 1
𝑦 𝒘𝑇 𝒙 + 𝑏 − 1 ≥ 0
Mathematical Formulation :
Classification
1 2
minimize 𝒘 2
2

such that 𝑦 𝒘𝑇 𝒙 + 𝑏 − 1 ≥ 0

Constrained optimization problem

Can be solved by the Lagrangian multiplier method


Optimal Parameter Calculation
Using Lagrangian multiplier method

𝑛 𝑛
1 2
min 𝐿𝑝 = 𝒘 2 − ෍ 𝑎𝑖 𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 + ෍ 𝑎𝑖
2
𝑖=1 𝑖=1

such that 𝑎𝑖 ≥ 0, for all 𝑖


Optimal Parameter Calculation
𝑛 𝑛
1 2
min 𝐿𝑝 = 𝒘 2 − ෍ 𝑎𝑖 𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 + ෍ 𝑎𝑖
2
𝑖=1 𝑖=1

such that 𝑎𝑖 ≥ 0, for all 𝑖

Solution: Find partial derivative wrt 𝒘 and b and equate to 0

𝑛
𝜕𝐿𝑝 𝑛
= 𝒘 − ෍ 𝑎𝑖 𝑦𝑖 𝒙𝒊 = 0 𝜕𝐿𝑝
𝜕𝒘 = ෍ 𝑎𝑖 𝑦𝑖 = 0
𝑖=1 𝜕𝑏
𝑖=1
𝑛
𝑛
෍ 𝑎𝑖 𝑦𝑖 = 0
𝒘 = ෍ 𝑎𝑖 𝑦𝑖 𝑥𝑖
𝑖=1
𝑖=1
Optimal Parameter Calculation
Langrange dual problem:
Instead of minimizing 𝒘 subject to constraints involving a’s, we can maximize over
a, subject to
𝑛

𝒘 = ෍ 𝑎𝑖 𝑦𝑖 𝒙𝒊
𝑖=1

෍ 𝑎𝑖 𝑦𝑖 = 0
𝑖=1
Optimal Parameter Calculation
𝑛 𝑛
Primal Problem: 1 2
min 𝐿𝑝 = 𝒘 2 − ෍ 𝑎𝑖 𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 + ෍ 𝑎𝑖
2
𝑖=1 𝑖=1

such that 𝑎𝑖 ≥ 0, for all 𝑖

We got:
𝑛 𝑛

𝒘 = ෍ 𝑎𝑖 𝑦𝑖 𝒙𝑖 ෍ 𝑎𝑖 𝑦𝑖 = 0
𝑖=1 𝑖=1
Optimal Parameter Calculation
Dual problem:
𝑛 𝑛 𝑛 𝑛
1
max 𝐿𝐷 = ෍ 𝑎𝑖 − ෍ ෍ 𝑎𝑖 𝑎𝑗 𝑦𝑖 𝑦𝑗 𝒙𝒊 ⋅ 𝒙𝒋 + 𝑏 ෍ 𝑎𝑖 𝑦𝑖
2
𝑖=1 𝑖=1 𝑗=1 𝑗=1

such that ෍ 𝑎𝑖 𝑦𝑖 = 0
𝑖=1

and 𝑎𝑖 ≥ 0, for all 𝑖


Optimal Parameter Calculation
Dual problem:
𝑛 𝑛 𝑛
1
max 𝐿𝐷 = ෍ 𝑎𝑖 − ෍ ෍ 𝑎𝑖 𝑎𝑗 𝑦𝑖 𝑦𝑗 𝒙𝒊 ⋅ 𝒙𝒋
2
𝑖=1 𝑖=1 𝑗=1

such that ෍ 𝑎𝑖 𝑦𝑖 = 0
𝑖=1

and 𝑎𝑖 ≥ 0, for all 𝑖


Optimal Parameter Calculation
Find 𝑎 by taking the derivative of 𝐿𝐷 wrt and equate it to zero.

Find the values of 𝑎𝑖 ’s for all 𝑖 and compute optimal 𝒘 as

𝒘∗ = ෍ 𝑎𝑖 𝑦𝑖 𝒙𝑖
𝑖=1
Optimal Parameter Calculation
According to KKT condition:

𝑎𝑖 (𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 − 1) = 0

𝒘∗ = ෍ 𝑎𝑖 𝑦𝑖 𝒙𝑖
𝑖=1
𝑎𝑖 >0
Optimal Parameter Calculation
To compute optimal bias:
Compute 𝑏𝑖 for each support vector as

𝑎𝑖 (𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 − 1) = 0

1
𝑏𝑖 = − 𝒘𝑇 𝒙𝒊
𝑦

Optimal bias: Average 𝑏𝑖 ’s over all support vectors

𝑏 ∗ = avg {𝑏𝑖 }
𝑎𝑖 >0
Inference
For a new data point 𝑧 𝑛

𝒘∗ = ෍ 𝑎𝑖 𝑦𝑖 𝒙𝑖
𝑖=1
𝑎𝑖 >0

𝑏 ∗ = avg {𝑏𝑖 }
𝑎𝑖 >0
𝑦ො = sign(𝒘∗ 𝑇 𝑧 + 𝑏 ∗ )
Intuition
Dual problem:
𝑛 𝑛 𝑛 𝑛
1
max 𝐿𝐷 = ෍ 𝑎𝑖 − ෍ ෍ 𝑎𝑖 𝑎𝑗 𝑦𝑖 𝑦𝑗 𝒙𝒊 ⋅ 𝒙𝒋 + 𝑏 ෍ 𝑎𝑖 𝑦𝑖
2
𝑖=1 𝑖=1 𝑗=1 𝑗=1
𝑛

such that ෍ 𝑎𝑖 𝑦𝑖 = 0 and 𝑎𝑖 ≥ 0, for all 𝑖


𝑖=1
Support Vector Machine for Not-
linearly Separable Data

Modify the objective function to solve the problem


Support Vector Machine for Not-
linearly Separable Data
Previously

𝑦 𝒘𝑇 𝒙 + 𝑏 − 1 ≥ 0

Maximize the margin and get as many data points


classified correctly as possible.

Modified constraint

𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 ≥ 1 − 𝜉𝑖
Support Vector Machine for Not-
linearly Separable Data
Modified objective function
𝑛
1 2
min 𝒘 2 + 𝐶 ෍ 𝜉𝑖
2
𝑖=1

such that 𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 ≥ 1 − 𝜉𝑖

and 𝜉𝑖 ≥ 0 ∀𝑖
Support Vector Machine for Not-
linearly Separable Data
𝑛 𝑛 𝑛
Primal Problem: 1 2
min 𝐿𝑝 = 𝒘 2 + 𝐶 ෍ 𝜉𝑖 − ෍ 𝑎𝑖 [𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 − (1 − 𝜉𝑖 )] − ෍ 𝜇𝑖 𝜉𝑖
2
𝑖=1 𝑖=1 𝑖=1

such that 𝑎𝑖 , 𝜇𝑖 ≥ 0, for all 𝑖

Solution
𝑛 𝑛

𝒘 = ෍ 𝑎𝑖 𝑦𝑖 𝒙𝑖 ෍ 𝑎𝑖 𝑦𝑖 = 0 𝑎𝑖 = 𝐶 − 𝜇𝑖
𝑖=1 𝑖=1
for all 𝑖
Support Vector Machine for Not-
linearly Separable Data
Dual problem:
𝑛 𝑛 𝑛
1
max 𝐿𝐷 = ෍ 𝑎𝑖 − ෍ ෍ 𝑎𝑖 𝑎𝑗 𝑦𝑖 𝑦𝑗 𝒙𝒊 ⋅ 𝒙𝒋
2
𝑖=1 𝑖=1 𝑗=1

such that ෍ 𝑎𝑖 𝑦𝑖 = 0
𝑖=1

and 0 ≤ 𝑎𝑖 ≤ C, for all 𝑖


Support Vector Machine for Not-
linearly Separable Data
KKT Condition

𝑎𝑖 [𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 − (1 − 𝜉𝑖 )] = 0

𝜇𝑖 𝜉𝑖 = 0

𝑦𝑖 𝒘𝑇 𝒙𝒊 + 𝑏 − (1 − 𝜉𝑖 ) ≥ 0
Support Vector Machine for Non-
linearly Separable Data

Gain linearly separation by mapping the data to a higher dimensional space


SVM Kernel

Dual problem:
𝑛 𝑛 𝑛
1
max 𝐿𝐷 = ෍ 𝑎𝑖 − ෍ ෍ 𝑎𝑖 𝑎𝑗 𝑦𝑖 𝑦𝑗 ℎ(𝒙𝒊 ), ℎ(𝒙𝒋 )
2
𝑖=1 𝑖=1 𝑗=1
SVM Kernel
Polynomial Kernel

RBF Kernel

ANN Kernel
SVM Kernel
SVM Kernel
References
https://youtu.be/3-FhNaTkAZo?si=_b4ECthJQzZdRsvd

https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf

https://people.csail.mit.edu/dsontag/courses/ml14/slides/lecture2.pdf

https://course.ccs.neu.edu/cs5100f11/resources/jakkula.pdf

https://www.esann.org/sites/default/files/proceedings/legacy/es2004-11.pdf
Thank You

You might also like