0% found this document useful (0 votes)
19 views64 pages

M8 SupportVectorMachines

The document provides an overview of support vector machines (SVMs) with the following key points: 1) SVMs find the optimal separating hyperplane that maximizes the margin between two classes of data points. This can be formulated as an optimization problem to minimize a loss function subject to constraints. 2) The primal optimization problem can be converted to a dual problem using concepts of constrained optimization like Lagrange multipliers and the Karush-Kuhn-Tucker (KKT) conditions. 3) The dual problem is often easier to optimize than the primal problem and can be solved using algorithms like sequential minimal optimization (SMO). 4) Once the dual problem is solved, the parameters of the

Uploaded by

Aniket Keshri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views64 pages

M8 SupportVectorMachines

The document provides an overview of support vector machines (SVMs) with the following key points: 1) SVMs find the optimal separating hyperplane that maximizes the margin between two classes of data points. This can be formulated as an optimization problem to minimize a loss function subject to constraints. 2) The primal optimization problem can be converted to a dual problem using concepts of constrained optimization like Lagrange multipliers and the Karush-Kuhn-Tucker (KKT) conditions. 3) The dual problem is often easier to optimize than the primal problem and can be solved using algorithms like sequential minimal optimization (SMO). 4) Once the dual problem is solved, the parameters of the

Uploaded by

Aniket Keshri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

M8.

Classification: Support Vector


Machines (SVMs)
Manikandan Narayanan
Week . (Nov 6-, 2023)
PRML Jul-Nov 2023 (Grads section)
Acknowledgment of Sources
• Slides based on content from related
• Courses:
• IITM – Profs. Arun/Harish/Chandra’s PRML offerings (slides, quizzes, notes, etc.), Prof.
Ravi’s “Intro to ML” slides – cited respectively as [AR], [HR], [CC], [BR] in the bottom right
of a slide.
• India – NPTEL PR course by IISc Prof. PS. Sastry (slides, etc.) – cited as [PSS] in the bottom
right of a slide.

• Books:
• PRML by Bishop. (content, figures, slides, etc.) – cited as [CMB]
• Pattern Classification by Duda, Hart and Stork. (content, figures, etc.) – [DHS]
• Mathematics for ML by Deisenroth, Faisal and Ong. (content, figures, etc.) – [DFO]
• Information Theory, Inference and Learning Algorithms by David JC MacKay – [DJM]
Outline for Module M8
• M8. Classification (Support Vector Machines)
• M8.0 Introduction/Motivation
• (concrete understanding of SVMs – beyond popular pictures & software)
• M8.1 SVM Problem Statement
• (Hard/Soft-Margin SVM Problems)
• M8.2 SVM Solution
• (Background: Constrained optimization - KKT & Primal-Dual)
• (SVM Dual Problem & Optimization algo. sketch)
• M8.3 SVM Interpretations
• (Support vectors, Kernels, Loss function view)
• M8.4 Concluding thoughts
SVM hard-margin: popular pic. → geometry
• (Linear) SVM – max
margin, sparse
support vectors

• (Non-linear) SVM –
uses non-linear
kernels followed by
applying SVM
above in the
feature map space φ((a, b)) = (a, b, a2 + b2)

[Images source: https://en.wikipedia.org/wiki/Support-vector_machine]


SVM soft-margin (popular pic./software → ...)
• Parameter C controls where you lie in the soft-hard margin spectrum.

[From: https://stats.stackexchange.com/a/159051]

• Software
SVM soft-margin: … → concrete understanding
Primal OP: Dual OP:

Prediction for new point x:

SVM aka max-margin classifier and is a type of Sparse Kernel Machine (SKM) method
(Relevance Vector Machine is another type of SKM method, specifically a probabilistic/Bayesian variant)
[Above formulas from sklearn help pages]
Recall: Inference and decision: three approaches for classification –
SVM: discriminant approach initially motivated by Computational Learning Theory.

• Generative model approach:


(I) Model 𝑝 𝑥, 𝐶𝑘 = 𝑝 𝑥 𝐶𝑘 )𝑝(𝐶𝑘 )
(I) Use Bayes’ theorem
(D) Apply optimal decision criteria

• Discriminative model approach:


(I) Model directly
(D) Apply optimal decision criteria

• Discriminant function approach:


(D) Learn a function that maps each x to a class label directly from training data
Note: No posterior probabilities!

[CMB]
Recall: Discriminant
• Discriminant is a function that takes an input vector 𝑥 ∈ ℝ𝑑 and assigns it
to one of the 𝐾 classes
• we will assume K=2 henceforth for simplicity!

• We focus only on linear discriminants (i.e., those for which DB is a hyperplane wrt 𝑥 (or 𝜙(𝑥)).
• 𝑧 𝑥 = 𝑤 𝑇 𝑥 + 𝑤0 (or 𝑤 𝑇 𝜙 𝑥 )
• DB: 𝑧(𝑥) = 0 (hyperplane)
• Prediction: 𝑓 𝑧 𝑥 = sign(𝑧(𝑥)) (i.e., Predict 𝐶1 if 𝑧 𝑥 ≥ 0, & 𝐶2 if 𝑧 𝑥 < 0)

Recall defn. of hyperplane {𝑥 ∈ ℝ𝑑 : 𝑤 𝑇 𝑥 = 𝑏}, which is a (𝑑 − 1)-dimensional (affine) subspace of a 𝑑-dim. vector space.
Recall: Geometry of decision surfaces: signed
distance from decision surface
DB is 𝑤 𝑇 x + 𝑤0 = const., but const. can be absorbed
into 𝑤0 to get DB as 𝒘𝑻 𝒙 + 𝒘𝟎 = 𝟎.
(Decision Let 𝑧(𝑤, 𝑥) = 𝑤 𝑇 x + 𝑤0 with the const. absorbed.
Surface)

[CMB]
Outline for Module M8
• M8. Classification (Support Vector Machines)
• M8.0 Introduction/Motivation
• M8.1 SVM Problem Statement
• (Hard/Soft-Margin SVM Problems)
• M8.2 SVM Solution
• M8.3 SVM Interpretations
• M8.4 Concluding thoughts
SVM hard-margin problem

[HR]
SVM soft-margin problem

[HR]
Example: what is 𝜖𝑖 for a given 𝑤, 𝑏?

[HR]
Example: answer

[HR]
Outline for Module M8
• M8. Classification (Support Vector Machines)
• M8.0 Introduction/Motivation
• M8.1 SVM Problem Statement
• M8.2 SVM Solution
• (Background: Constrained optimization - KKT & Primal-Dual)
• (SVM Dual Problem & Optimization algo. sketch)
• M8.3 SVM Interpretations
• M8.4 Concluding thoughts
From unconstrained to constrained opt. - FONC

FONC for 𝑥 ∗ to be a local optima!

FONC for 𝑥 ∗ to be a local feasible (constrained) optima?

[HR]
Recall: Linear approximation using gradient vector

[HR]
Recall:

[HR]
[HR]
KKT conditions (FONC) – General Case

[HR]
KKT Conditions - Example

[HR]
KKT Conditions – Example (checking FONC)

[HR]
Optional: Bishop-AppxE is also a good read to get similar intuition about
Lagrange multipliers and FONC.

[CMB]
Exercises: Other constrained optimization
problems!
• Prove that the KL-divergence 𝐾𝐿(𝑝 || 𝑞) is minimized when 𝑞 = 𝑝.
• Prove that entropy 𝐻(𝑝) is maximized when 𝑝 is uniform.
• What is the distance of a point 𝑢 ∈ ℝ𝑑 to the closest point 𝑣 in a
hyperplane given by {𝑥 ∈ ℝ𝑑 ∶ 𝑤 𝑇 𝑥 + 𝑏 = 0}?

[HR]
Having established KKT cdnts., can we actually (constructively)
find the KKT multipliers 𝜆,∗ 𝜇∗ that satisfy the KKT cdnts.?
Primal-Dual Relation (via the Minimax Theorem)

[HR]
መ )
Function 𝑓(.
• Desired function:

• First attempt:

• Second attempt:

[HR]
[HR]
How to get one solution from the other?

[HR]
How to get one solution from the other?

[HR]
Duality Example: A linear program

[HR]
Duality Example (contd.)

[HR]
Duality Example (contd.)

[HR]
Outline for Module M8
• M8. Classification (Support Vector Machines)
• M8.0 Introduction/Motivation
• M8.1 SVM Problem Statement
• M8.2 SVM Solution
• (Background: Constrained optimization - KKT & Primal-Dual)
• (SVM Dual Problem & Optimization algo. sketch)
• M8.3 SVM Interpretations
• M8.4 Concluding thoughts
SVM: From Primal → Dual

[HR]
SVM: From Primal → Dual

[HR]
SVM Dual Problem

[HR]
Stop & Think: What have we achieved so far?

Next steps:
• This Dual problem can be
maximized using SMO or PGD
methods much easier than the
Primal problem!
• Then convert Dual to Primal
solution using

[HR]
Brief aside: Kernel-ize our Dual Problem!

Identity kernel gives back the original Dual Problem!

Recall: Why kernel-ize?

[CMB, HR]
Brief aside: Kernel SVM – Prediction for a new
data point (using support vectors)
Need to go from one solution (dual: 𝛼 ∗ ) to the other (primal: 𝑤 ∗ , 𝑏 ∗ (, 𝜖 ∗ )).

[HR]
∗ ∗
Having found 𝑤 from 𝛼 using KKT (stationarity);

now find 𝑏 using also KKT complementary slack

[CMB, HR]
How do we optimize the Dual Problem?

[HR]
PGD: Projected Gradient Descent
(actually Ascent)

Note: optima can be an


interior or boundary point!

[HR; Also from https://home.ttic.edu/~nati/Teaching/TTIC31070/2015/Lecture16.pdf]


SMO: Sequential Minimal Optimization
• Repeat until convergence:
• Find a KKT multiplier 𝛼1 that violates KKT conditions for the OP
• Pick a second mutilplier 𝛼2 and optimize the objective fn. over 𝛼1 , 𝛼2

[Platt 1998 paper; and Wikipedia on SMO]


Exercise: Can we absorb intercept/bias 𝑏 into 𝑤?

[HR]
Outline for Module M8
• M8. Classification (Support Vector Machines)
• M8.0 Introduction/Motivation
• M8.1 SVM Problem Statement
• M8.2 SVM Solution
• M8.3 SVM Interpretations
• (Support vectors, Kernels, Loss function view)
• M8.4 Concluding thoughts
Summary so far, and support vectors/kernels

Optimize using PGD/SMO to find 𝛼 ∗

Support Vectors:
• Training data points for which 𝛼𝑖∗ ≠ 0.
• Typically sparse. Why?
Exercise0: What are the possible values of 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 (and hence the locn. of a training point 𝑥𝑖 wrt DB/MBs) when:
i 𝛼𝑖∗ = 0,
𝑖𝑖 𝛼𝑖∗ = 𝐶,
𝑖𝑖𝑖 0 < 𝛼𝑖∗ < 𝐶?
(Hint: use KKT complementary slack and 𝛽𝑖∗ = 𝐶 − 𝛼𝑖∗ )) [HR]
Example: the two SV case
Example: the two SV case
Worked-out
example:

[HR]
Exercise1:

[HR]
Exercise2:

[HR]
Loss function view: Hinge loss

[HR]
The larger context: loss fn view offers a unified motivation of many classifn. methods
(thereby allowing us to condense a “laundry” list of methods into a single framework)

[HR]
Surrogate loss fns

[HR]
Why learn about loss fns view? An example
with outliers!

[HR]
Example: Logistic loss
Exercise: What is hinge loss?

[HR]
SVM Interpretations
• Support vectors
• Kernel machines
• Loss fn. view
Goal we set out: concrete understanding of SVM --
hope we reached it!

[Above formulas from sklearn help pages]


Concluding thoughts
• SVM
• Concept of max-margin classifn., sparse support vectors, & kernel machines.
• Use of constrained optimization, and Primal - Dual problems.
• Extensions: SVR (Support Vector Regression) and RVM (Relevance Vector Machines).

• Next steps: From linear to non-linear regression/classifn.


Non-linear (Non-linear) Basis functions Objective function / OP
method

Vanilla extn. of Fixed – non-linear basis fns (feature map, 𝜙: ℝ𝑑 → ℝ𝑑 ) fixed before Convex (unconstrained
linear models seeing training data (manually via feature engineering); only weights of opt.)
these basis fns learnt using training data.
SVM Selective – center basis fns on training data points (dual/kernel view) Convex (constrained opt.)
and use training data to learn their weights and select a subset of them
(non-zero weight support vectors) for eventual predictions.

. Neural Adaptive – Fix # of basis fns in advance, but allow them to be adaptive; Non-convex
networks parameterize basis fns and learn these parameters using training data.
[CMB]
Thank you!
Backup
From classification to regression: Support Vector Regression or SVR
(𝜖-insensitive “tube” and obj/loss fns.)

[From CMB; Smola and Scholkopf, 2004 tutorial]


Loss functions drawn to scale

[CMB]

You might also like