Support Vector Machines
Mehul Motani
Electrical & Computer Engineering
National University of Singapore
Email: [email protected]
© Mehul Motani Support Vector Machines 1
AI and Machine Learning
• “AI is the new electricity. Just as 100 years ago
electricity transformed industry after industry, AI will
now do the same.” – Andrew Ng
• Machine learning is about learning from data.
• Supervised learning – Learning from data with labels
which serve a supervisory purpose
• Unsupervised learning – Learning from data without
labels allows tasks such a clustering.
• Reinforcement learning – Learning from data without
labels but there is feedback from the environment.
© Mehul Motani Support Vector Machines 2
Support Vector Machines (SVM)
• SVM is a supervised learning algorithm
– Useful for both classification and regression problems
• Linear SVM – Maximum-Margin Classifier
– Formalize notion of the best linear separator
• Optimization Problem with Lagrangian Multipliers
– Technique to solve a constrained optimization problem
• Nonlinear SVM – Extending Linear SVM with Kernels
– Project data into higher-dimensional space to make it
linearly separable.
– create nonlinear classifiers by applying the kernel trick to
maximum-margin hyperplanes.
– Complexity: Depends only on the number of training
examples, not on dimensionality of the kernel space!
© Mehul Motani Support Vector Machines 3
SVM – A Brief History
• Pre-1980: Almost all learning methods learned linear decision
surfaces.
– Linear learning methods have nice theoretical properties
• 1980’s: Decision trees and Neural Nets allowed efficient
learning of non-linear decision surfaces
– Little theoretical basis and all suffer from local minima
• 1990’s: Efficient learning algorithms for non-linear functions
based on computational learning theory developed
• Support Vector Machines
– The original SVM algorithm was invented by Vapnik and
Chervonenkis in 1963.
– Nonlinear SVMs using the kernel trick were first introduced in a
conference paper by Boser, Guyon and Vapnik in 1992.
– The SVM with soft margin was proposed by Cortes and Vapnik in
1993 and published in 1995.
© Mehul Motani Support Vector Machines 4
Supervised Learning: Linear Separators
• Binary classification can be viewed as the task of
separating classes in feature space.
• Two features: !! and !"
• Two classes: red and blue
wTx + b = 0
!" • Linear separator given by line:
wTx + b > 0 wTx + b = 0 (1)
wTx + b < 0 !! $!
x= ! w= $
" "
• Classification
wTx + b < 0 à blue (-1)
wTx + b > 0 à red (+1)
• Classifier function
f(x) = sign(wTx + b)
!! • New data: Green dot will be
classified as red class (+1)
© Mehul Motani Support Vector Machines 5
Linear Separators
• There are many possible linear separators!
• Which of the linear separators is optimal?
• The linear SVM solution defines an objective and finds the
linear separator which maximizes that objective.
© Mehul Motani Support Vector Machines 6
Linear Separators and Margin
• Consider three new data points: A,
B, C (green dots), all of which are
classified as class 0.
Class 0
• How confident are you that point A
A B is class 0?
C Class 1
• What about point C?
• What about point B?
• Intuitively, we are more confident
about point A than point C.
• Intuition: if a point is far from the
separating hyperplane (i.e., large
margin), then we may be more
confident in our prediction.
© Mehul Motani Support Vector Machines 7
Classification Margin
• Data points closest to the hyperplane are called the support
vectors (circled data points)
• Margin ρ of separator is the distance between support vectors
• Note that the separator is completely defined by its support
vectors.
xi ρ
r What is the distance, r, from data
w point xi to the separator?
Q For x1 and x2 on the separating hyperplane:
wT (x1 − x 2 ) = 0 ⇒ w ⊥ hyperplane (1)
⎛ w ⎞ wT x i + b
T
w ⎜⎜ x i − r ⎟⎟ + b = 0 ⇒ r = (2)
⎝ w ⎠ w
Point Q Norm of w
© Mehul Motani Support Vector Machines 8
Maximum Margin Classification
• Maximizing the margin is provably good and intuitive
– Larger margin leads to lower generalization error (Vapnik).
• Implies that only support vectors matter; other training
examples can be ignored à SVM is stable and robust to outliers
ρ
© Mehul Motani Support Vector Machines 9
Linear SVM Mathematically
• Let training set be S={(xi, yi)}i=1,2,...,n with xi Î Rd and yi Î {-1, 1}.
• Suppose we have a separating hyperplane with margin ρ,
weight vector w and scalar b.
• Then for each training example (xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1
wTxi + b ≥ ρ/2 if yi = 1 Û
yi(wTxi + b) ≥ ρ/2 (1)
• For every support vector xs the above inequality is an equality.
After rescaling w and b by ρ/2 in the equality, we obtain that
distance between each xs and the hyperplane is
y s ( w T x s + b) 1 (2)
r= =
w w
• Then the margin can be expressed through (rescaled) w and b
as: 2
r = 2r = (3)
w
© Mehul Motani Support Vector Machines 10
Linear SVMs Mathematically (cont.)
• Then we can formulate the quadratic optimization problem:
"
Find w and b such that ! = #
is maximized (a)
(1)
and for all (xi, yi) ∈ S: yi (wTxi + b) ≥ 1 (b)
• We can reformulate the problem in (1) as follows:
Find w and b such that
Φ(w) = ||w||2=wTw is minimized (2)
and for all (xi, yi) ∈ S: yi (wTxi + b) ≥ 1
© Mehul Motani Support Vector Machines 11
Solving the Optimization Problem
Primal: Find w and b such that
Φ(w) =wTw is minimized (1)
and for all (xi, yi) ∈ S: yi (wTxi + b) ≥ 1
• Need to optimize a quadratic function subject to linear constraints.
• Quadratic optimization problems are a well-known class of mathematical
programming problems for which several (non-trivial) algorithms exist.
• Solution involves constructing a dual problem where a Lagrange multiplier αi
is associated with every inequality constraint in the primal problem:
Dual: Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0 (2)
(2) αi ≥ 0 for all αi
See https://en.wikipedia.org/wiki/Quadratic_programming
© Mehul Motani Support Vector Machines 12
The Optimization Problem Solution
• Given a solution α1…αn to the dual problem, solution to the primal is:
w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0 (1)
• Each non-zero αi indicates that corresponding xi is a support vector.
• Then the classifying function is (note that we don’t need w explicitly):
f(x) = ΣαiyixiTx + b (2)
• The quantity xTy is called the inner product or dot product between the
vector x and the vector y.
• Notice that the solution relies on the inner product between the test point x
and the support vectors xi – we will return to this later.
• Also keep in mind that solving the optimization problem involved computing
the inner products xiTxj between all training points.
© Mehul Motani Support Vector Machines 13
Linear and nonlinear data models
A B C
D E F
© Mehul Motani Support Vector Machines 14
Soft Margin Classification
• What if the training set is not linearly separable?
• Slack variables ξi can be added to allow misclassification of
difficult or noisy examples, resulting margin called soft.
What should our quadratic
optimization criterion be?
Minimize:
R
1 T
ξi w w + C ∑ξ k
ξj
2 k=1
Maximize Misclassification
Margin Penalty
Note: that ξ is the Greek letter Xi
and is pronounced as ‘zai’ or ‘ksi’.
© Mehul Motani Support Vector Machines 15
Hard margin vs Soft margin
• The hard-margin SVM formulation:
Find w and b such that
Φ(w) =wTw is minimized (1)
and for all (xi ,yi) ∈ S: yi (wTxi + b) ≥ 1
• Modified soft-margin SVM formulation with slack variables:
Find w and b such that
Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi) ∈ S: yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0 (2)
• Parameter C can be viewed as a way to control overfitting
– It trades off the relative importance of maximizing the margin and fitting the training
data.
– Larger C à the more the penalty for misclassifications. This leads to smaller and
smaller margins but less misclassifications. This is essentially overfitting.
– Small C à the lower the penalty for misclassifications. This leads to larger margins
but more misclassifications.
© Mehul Motani Support Vector Machines 16
Soft Margin Classification – Solution
• Dual problem is identical to separable case:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1)
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
• Again, xi with non-zero αi will be support vectors.
• Solution to the soft margin SVM is:
w =Σαiyixi
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0 (2)
Note: We don’t need to compute w
f(x) = ΣαiyixiTx + b
explicitly for classification:
Note: If the 2-norm penalty for slack variables CΣξi2 was used in primal
objective, we would need additional Lagrange multipliers for slack variables…
© Mehul Motani Support Vector Machines 17
Theoretical Justification for Maximum Margins
• VC dimension is a measure of the complexity of a classifier. The more
complex the classifier, the more prone it is to overfitting.
• Vapnik proved the following:
The class of optimal linear separators has VC dimension h bounded
from above as ìé D 2 ù ü
h £ miníê 2 ú, m0 ý + 1 (1)
îê r ú þ
where ρ is the margin, D is the diameter of the smallest sphere that
can enclose all of the training examples, and m0 is the dimensionality.
• Intuitively, this implies that regardless of dimensionality m0 we can
minimize the VC dimension by maximizing the margin ρ.
• Thus, complexity of the classifier is kept small regardless of
dimensionality.
© Mehul Motani Support Vector Machines 18
Summary of Linear SVMs
• The classifier is a separating hyperplane.
• Most “important” training points are support vectors; they
define the hyperplane.
• Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.
• Both in the dual formulation of the problem and in the solution,
the training points appear only inside inner products:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and f(x) = ΣαiyixiTx + b (2)
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi (1)
© Mehul Motani Support Vector Machines 19
Non-linear SVMs
Consider this noisy dataset: A
0 x
How about this dataset? B
0 x
C But what are we going to do if
0 x
the dataset is just too hard?
x2
How about… mapping
data to a higher-
dimensional space: D
0 x
© Mehul Motani Support Vector Machines 20
Non-linear SVMs: Feature spaces
• General idea: the original feature space is mapped to some
higher-dimensional feature space where the training set is
separable:
Lifting Function B
A
Φ: x → φ(x)
Separating
hyperplane
© Mehul Motani Support Vector Machines 21
The “Kernel Trick”
• The linear SVM classifier relies on the inner product between vectors, for
example: K(xi,xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes: K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is a function that is equivalent to an inner product in some
higher dimensional feature space.
• Example: 2-dimensional vectors x=[x1 x2] T
– Let K(xi,xj)=(1 + xiTxj)2 (1)
– Need to show that K(xi,xj)= φ(xi) Tφ(xj) for some φ(x)
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
(2)
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2] [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] T
àK(xi,xj) = φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2] T (3)
• Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
© Mehul Motani Support Vector Machines 22
What Functions are Kernels?
• For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
• Mercer’s theorem: Every semi-positive definite symmetric function is a
valid kernel
• Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)
K=
(1)
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
Check out the discussion at: https://www.quora.com/What-is-the-kernel-trick
© Mehul Motani Support Vector Machines 23
Examples of Kernel Functions
1. Linear: K(xi,xj)= xiTxj
– Mapping Φ: x → φ(x), where φ(x) is x itself
2. Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
æd + pö
– Mapping Φ: x → φ(x), where φ(x) has çç ÷ dimensions, where d is the
è p ÷ø
original feature space dimension. 2
xi -x j
-
2s 2
3. Gaussian (radial-basis function): K(xi,xj) = e
– Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is mapped
to a function (a Gaussian); combination of functions for support vectors is the
separator.
4. Higher-dimensional space still has intrinsic dimensionality d (the mapping is
not onto), but linear separators in it correspond to non-linear separators in
original space.
© Mehul Motani Support Vector Machines 24
Non-linear SVMs Mathematically
• Dual problem formulation:
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1)
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
• The solution is:
f(x) = ΣαiyiK(xi, xj)+ b (2)
• Optimization techniques for finding αi’s remain the same!
© Mehul Motani Support Vector Machines 25
Nonlinear SVM - Summary
• In summary, linear SVM locates a separating
hyperplane in the feature space and classifies
points in that space
• Nonlinear SVM lifts the problem to a higher
dimensional space and performs linear SVM in the
higher dimensional space.
• This corresponds to a nonlinear separator in the
original feature space.
• The algorithm does not need to represent the
space explicitly, it does this by simply defining a
kernel function, which plays the role of the inner
product in the high dimensional feature space.
© Mehul Motani Support Vector Machines 26
Properties of SVM
• Sparseness of solution when dealing with large data sets as only
support vectors are used to specify the separating hyperplane
• Ability to handle large feature spaces as the complexity does
not depend on the dimensionality of the feature space
• Overfitting can be controlled by soft margin approach
• Mathematically nice – a simple convex optimization problem
which is guaranteed to converge to a single global solution
• Supported by theory and intuition
• SVM empirically works very well
– Text (and hypertext) categorization, image classification,
– Protein classification, Disease classification
– Hand-written character recognition
© Mehul Motani Support Vector Machines 27
Weakness of SVM
• SVM is sensitive to noise
- A relatively small number of mislabeled examples can dramatically decrease
the performance
• Standard SVM only considers two classes
• Question: How to do multi-class classification with SVM?
• Answer: Build multiple SVMs
1. With m classes, learn m SVM’s
– SVM 1 learns “Output = 1” vs “Output != 1”
– SVM 2 learns “Output = 2” vs “Output != 2”
–:
– SVM m learns “Output = m” vs “Output != m”
2. To predict the output for a new input, just predict with each SVM and
find out which one puts the prediction the furthest into the positive
region.
© Mehul Motani Support Vector Machines 28
SVM Summary
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors
(e.g. graphs, sequences, relational data) by designing kernel functions
for such data.
• Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
• Some references on VC-dimension and Support Vector Machines:
• C.J.C. Burges. A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.
• The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir
Vapnik, Wiley-Interscience, 1998
• https://en.wikipedia.org/wiki/Support_vector_machine
© Mehul Motani Support Vector Machines 29
What do data engineers and thieves have in common?
© Mehul Motani Support Vector Machines 30