0% found this document useful (0 votes)
32 views15 pages

A09 Support Vector Machines 2up

The document provides an overview of Support Vector Machines (SVM), a supervised learning algorithm used for classification and regression. It discusses the concepts of linear and nonlinear SVMs, the importance of maximizing the margin for better generalization, and the use of kernel functions to handle non-linear data. Additionally, it covers the mathematical formulation of SVMs, optimization problems, and the significance of support vectors in defining the separating hyperplane.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views15 pages

A09 Support Vector Machines 2up

The document provides an overview of Support Vector Machines (SVM), a supervised learning algorithm used for classification and regression. It discusses the concepts of linear and nonlinear SVMs, the importance of maximizing the margin for better generalization, and the use of kernel functions to handle non-linear data. Additionally, it covers the mathematical formulation of SVMs, optimization problems, and the significance of support vectors in defining the separating hyperplane.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Support Vector Machines

Mehul Motani
Electrical & Computer Engineering
National University of Singapore
Email: [email protected]

© Mehul Motani Support Vector Machines 1

AI and Machine Learning


• “AI is the new electricity. Just as 100 years ago
electricity transformed industry after industry, AI will
now do the same.” – Andrew Ng
• Machine learning is about learning from data.
• Supervised learning – Learning from data with labels
which serve a supervisory purpose
• Unsupervised learning – Learning from data without
labels allows tasks such a clustering.
• Reinforcement learning – Learning from data without
labels but there is feedback from the environment.

© Mehul Motani Support Vector Machines 2


Support Vector Machines (SVM)
• SVM is a supervised learning algorithm
– Useful for both classification and regression problems
• Linear SVM – Maximum-Margin Classifier
– Formalize notion of the best linear separator
• Optimization Problem with Lagrangian Multipliers
– Technique to solve a constrained optimization problem
• Nonlinear SVM – Extending Linear SVM with Kernels
– Project data into higher-dimensional space to make it
linearly separable.
– create nonlinear classifiers by applying the kernel trick to
maximum-margin hyperplanes.
– Complexity: Depends only on the number of training
examples, not on dimensionality of the kernel space!

© Mehul Motani Support Vector Machines 3

SVM – A Brief History


• Pre-1980: Almost all learning methods learned linear decision
surfaces.
– Linear learning methods have nice theoretical properties
• 1980’s: Decision trees and Neural Nets allowed efficient
learning of non-linear decision surfaces
– Little theoretical basis and all suffer from local minima
• 1990’s: Efficient learning algorithms for non-linear functions
based on computational learning theory developed
• Support Vector Machines
– The original SVM algorithm was invented by Vapnik and
Chervonenkis in 1963.
– Nonlinear SVMs using the kernel trick were first introduced in a
conference paper by Boser, Guyon and Vapnik in 1992.
– The SVM with soft margin was proposed by Cortes and Vapnik in
1993 and published in 1995.

© Mehul Motani Support Vector Machines 4


Supervised Learning: Linear Separators
• Binary classification can be viewed as the task of
separating classes in feature space.
• Two features: !! and !"
• Two classes: red and blue
wTx + b = 0
!" • Linear separator given by line:
wTx + b > 0 wTx + b = 0 (1)
wTx + b < 0 !! $!
x= ! w= $
" "
• Classification
wTx + b < 0 à blue (-1)
wTx + b > 0 à red (+1)
• Classifier function
f(x) = sign(wTx + b)
!! • New data: Green dot will be
classified as red class (+1)
© Mehul Motani Support Vector Machines 5

Linear Separators
• There are many possible linear separators!
• Which of the linear separators is optimal?
• The linear SVM solution defines an objective and finds the
linear separator which maximizes that objective.

© Mehul Motani Support Vector Machines 6


Linear Separators and Margin
• Consider three new data points: A,
B, C (green dots), all of which are
classified as class 0.
Class 0
• How confident are you that point A
A B is class 0?
C Class 1
• What about point C?
• What about point B?
• Intuitively, we are more confident
about point A than point C.
• Intuition: if a point is far from the
separating hyperplane (i.e., large
margin), then we may be more
confident in our prediction.

© Mehul Motani Support Vector Machines 7

Classification Margin
• Data points closest to the hyperplane are called the support
vectors (circled data points)
• Margin ρ of separator is the distance between support vectors
• Note that the separator is completely defined by its support
vectors.
xi ρ
r What is the distance, r, from data
w point xi to the separator?
Q For x1 and x2 on the separating hyperplane:
wT (x1 − x 2 ) = 0 ⇒ w ⊥ hyperplane (1)

⎛ w ⎞ wT x i + b
T
w ⎜⎜ x i − r ⎟⎟ + b = 0 ⇒ r = (2)
⎝ w ⎠ w

Point Q Norm of w
© Mehul Motani Support Vector Machines 8
Maximum Margin Classification
• Maximizing the margin is provably good and intuitive
– Larger margin leads to lower generalization error (Vapnik).
• Implies that only support vectors matter; other training
examples can be ignored à SVM is stable and robust to outliers
ρ

© Mehul Motani Support Vector Machines 9

Linear SVM Mathematically


• Let training set be S={(xi, yi)}i=1,2,...,n with xi Î Rd and yi Î {-1, 1}.
• Suppose we have a separating hyperplane with margin ρ,
weight vector w and scalar b.
• Then for each training example (xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1
wTxi + b ≥ ρ/2 if yi = 1 Û
yi(wTxi + b) ≥ ρ/2 (1)

• For every support vector xs the above inequality is an equality.


After rescaling w and b by ρ/2 in the equality, we obtain that
distance between each xs and the hyperplane is
y s ( w T x s + b) 1 (2)
r= =
w w
• Then the margin can be expressed through (rescaled) w and b
as: 2
r = 2r = (3)
w
© Mehul Motani Support Vector Machines 10
Linear SVMs Mathematically (cont.)
• Then we can formulate the quadratic optimization problem:
"
Find w and b such that ! = #
is maximized (a)
(1)
and for all (xi, yi) ∈ S: yi (wTxi + b) ≥ 1 (b)

• We can reformulate the problem in (1) as follows:

Find w and b such that


Φ(w) = ||w||2=wTw is minimized (2)
and for all (xi, yi) ∈ S: yi (wTxi + b) ≥ 1

© Mehul Motani Support Vector Machines 11

Solving the Optimization Problem


Primal: Find w and b such that
Φ(w) =wTw is minimized (1)
and for all (xi, yi) ∈ S: yi (wTxi + b) ≥ 1

• Need to optimize a quadratic function subject to linear constraints.


• Quadratic optimization problems are a well-known class of mathematical
programming problems for which several (non-trivial) algorithms exist.
• Solution involves constructing a dual problem where a Lagrange multiplier αi
is associated with every inequality constraint in the primal problem:
Dual: Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0 (2)
(2) αi ≥ 0 for all αi

See https://en.wikipedia.org/wiki/Quadratic_programming
© Mehul Motani Support Vector Machines 12
The Optimization Problem Solution
• Given a solution α1…αn to the dual problem, solution to the primal is:

w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0 (1)

• Each non-zero αi indicates that corresponding xi is a support vector.


• Then the classifying function is (note that we don’t need w explicitly):

f(x) = ΣαiyixiTx + b (2)

• The quantity xTy is called the inner product or dot product between the
vector x and the vector y.
• Notice that the solution relies on the inner product between the test point x
and the support vectors xi – we will return to this later.
• Also keep in mind that solving the optimization problem involved computing
the inner products xiTxj between all training points.

© Mehul Motani Support Vector Machines 13

Linear and nonlinear data models

A B C

D E F
© Mehul Motani Support Vector Machines 14
Soft Margin Classification
• What if the training set is not linearly separable?
• Slack variables ξi can be added to allow misclassification of
difficult or noisy examples, resulting margin called soft.
What should our quadratic
optimization criterion be?
Minimize:
R
1 T
ξi w w + C ∑ξ k
ξj
2 k=1
Maximize Misclassification
Margin Penalty

Note: that ξ is the Greek letter Xi


and is pronounced as ‘zai’ or ‘ksi’.
© Mehul Motani Support Vector Machines 15

Hard margin vs Soft margin


• The hard-margin SVM formulation:
Find w and b such that
Φ(w) =wTw is minimized (1)
and for all (xi ,yi) ∈ S: yi (wTxi + b) ≥ 1
• Modified soft-margin SVM formulation with slack variables:
Find w and b such that
Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi) ∈ S: yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0 (2)
• Parameter C can be viewed as a way to control overfitting
– It trades off the relative importance of maximizing the margin and fitting the training
data.
– Larger C à the more the penalty for misclassifications. This leads to smaller and
smaller margins but less misclassifications. This is essentially overfitting.
– Small C à the lower the penalty for misclassifications. This leads to larger margins
but more misclassifications.
© Mehul Motani Support Vector Machines 16
Soft Margin Classification – Solution
• Dual problem is identical to separable case:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1)
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
• Again, xi with non-zero αi will be support vectors.
• Solution to the soft margin SVM is:
w =Σαiyixi
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0 (2)
Note: We don’t need to compute w
f(x) = ΣαiyixiTx + b
explicitly for classification:

Note: If the 2-norm penalty for slack variables CΣξi2 was used in primal
objective, we would need additional Lagrange multipliers for slack variables…
© Mehul Motani Support Vector Machines 17

Theoretical Justification for Maximum Margins


• VC dimension is a measure of the complexity of a classifier. The more
complex the classifier, the more prone it is to overfitting.
• Vapnik proved the following:
The class of optimal linear separators has VC dimension h bounded
from above as ìé D 2 ù ü
h £ miníê 2 ú, m0 ý + 1 (1)
îê r ú þ
where ρ is the margin, D is the diameter of the smallest sphere that
can enclose all of the training examples, and m0 is the dimensionality.
• Intuitively, this implies that regardless of dimensionality m0 we can
minimize the VC dimension by maximizing the margin ρ.
• Thus, complexity of the classifier is kept small regardless of
dimensionality.

© Mehul Motani Support Vector Machines 18


Summary of Linear SVMs
• The classifier is a separating hyperplane.
• Most “important” training points are support vectors; they
define the hyperplane.
• Quadratic optimization algorithms can identify which training
points xi are support vectors with non-zero Lagrangian
multipliers αi.
• Both in the dual formulation of the problem and in the solution,
the training points appear only inside inner products:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and f(x) = ΣαiyixiTx + b (2)
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi (1)

© Mehul Motani Support Vector Machines 19

Non-linear SVMs
Consider this noisy dataset: A
0 x

How about this dataset? B


0 x

C But what are we going to do if


0 x
the dataset is just too hard?
x2

How about… mapping


data to a higher-
dimensional space: D
0 x
© Mehul Motani Support Vector Machines 20
Non-linear SVMs: Feature spaces
• General idea: the original feature space is mapped to some
higher-dimensional feature space where the training set is
separable:
Lifting Function B
A

Φ: x → φ(x)

Separating
hyperplane
© Mehul Motani Support Vector Machines 21

The “Kernel Trick”


• The linear SVM classifier relies on the inner product between vectors, for
example: K(xi,xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes: K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is a function that is equivalent to an inner product in some
higher dimensional feature space.
• Example: 2-dimensional vectors x=[x1 x2] T
– Let K(xi,xj)=(1 + xiTxj)2 (1)
– Need to show that K(xi,xj)= φ(xi) Tφ(xj) for some φ(x)
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
(2)
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2] [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] T
àK(xi,xj) = φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2] T (3)
• Thus, a kernel function implicitly maps data to a high-dimensional space
(without the need to compute each φ(x) explicitly).
© Mehul Motani Support Vector Machines 22
What Functions are Kernels?
• For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome.
• Mercer’s theorem: Every semi-positive definite symmetric function is a
valid kernel
• Semi-positive definite symmetric functions correspond to a semi-positive
definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)

K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)


K=
(1)
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)

Check out the discussion at: https://www.quora.com/What-is-the-kernel-trick

© Mehul Motani Support Vector Machines 23

Examples of Kernel Functions


1. Linear: K(xi,xj)= xiTxj
– Mapping Φ: x → φ(x), where φ(x) is x itself

2. Polynomial of power p: K(xi,xj)= (1+ xiTxj)p


æd + pö
– Mapping Φ: x → φ(x), where φ(x) has çç ÷ dimensions, where d is the
è p ÷ø
original feature space dimension. 2
xi -x j
-
2s 2
3. Gaussian (radial-basis function): K(xi,xj) = e
– Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is mapped
to a function (a Gaussian); combination of functions for support vectors is the
separator.
4. Higher-dimensional space still has intrinsic dimensionality d (the mapping is
not onto), but linear separators in it correspond to non-linear separators in
original space.

© Mehul Motani Support Vector Machines 24


Non-linear SVMs Mathematically
• Dual problem formulation:
Find α1…αn such that
Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1)
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

• The solution is:

f(x) = ΣαiyiK(xi, xj)+ b (2)

• Optimization techniques for finding αi’s remain the same!

© Mehul Motani Support Vector Machines 25

Nonlinear SVM - Summary


• In summary, linear SVM locates a separating
hyperplane in the feature space and classifies
points in that space
• Nonlinear SVM lifts the problem to a higher
dimensional space and performs linear SVM in the
higher dimensional space.
• This corresponds to a nonlinear separator in the
original feature space.
• The algorithm does not need to represent the
space explicitly, it does this by simply defining a
kernel function, which plays the role of the inner
product in the high dimensional feature space.

© Mehul Motani Support Vector Machines 26


Properties of SVM
• Sparseness of solution when dealing with large data sets as only
support vectors are used to specify the separating hyperplane
• Ability to handle large feature spaces as the complexity does
not depend on the dimensionality of the feature space
• Overfitting can be controlled by soft margin approach
• Mathematically nice – a simple convex optimization problem
which is guaranteed to converge to a single global solution
• Supported by theory and intuition
• SVM empirically works very well
– Text (and hypertext) categorization, image classification,
– Protein classification, Disease classification
– Hand-written character recognition

© Mehul Motani Support Vector Machines 27

Weakness of SVM
• SVM is sensitive to noise
- A relatively small number of mislabeled examples can dramatically decrease
the performance
• Standard SVM only considers two classes
• Question: How to do multi-class classification with SVM?
• Answer: Build multiple SVMs
1. With m classes, learn m SVM’s
– SVM 1 learns “Output = 1” vs “Output != 1”
– SVM 2 learns “Output = 2” vs “Output != 2”
–:
– SVM m learns “Output = m” vs “Output != m”
2. To predict the output for a new input, just predict with each SVM and
find out which one puts the prediction the furthest into the positive
region.

© Mehul Motani Support Vector Machines 28


SVM Summary
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors
(e.g. graphs, sequences, relational data) by designing kernel functions
for such data.
• Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
• Some references on VC-dimension and Support Vector Machines:
• C.J.C. Burges. A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.
• The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir
Vapnik, Wiley-Interscience, 1998
• https://en.wikipedia.org/wiki/Support_vector_machine
© Mehul Motani Support Vector Machines 29

What do data engineers and thieves have in common?

© Mehul Motani Support Vector Machines 30

You might also like