0% found this document useful (0 votes)
19 views25 pages

Lec06 SVM

Uploaded by

Samreen Begum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views25 pages

Lec06 SVM

Uploaded by

Samreen Begum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

SVM - Support Vector Machines

• A new classification method for both linear and nonlinear data


• It uses a nonlinear mapping to transform the original training data into a
higher dimension
• With the new dimension, it searches for the linear optimal separating
hyperplane (i.e., “decision boundary”)
• With an appropriate nonlinear mapping to a sufficiently high dimension,
data from two classes can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors (“essential” training
tuples) and margins (defined by the support vectors)

Machine Learning 1
SVM - History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability
to model complex nonlinear decision boundaries (margin maximization)
• Used both for classification and prediction
• Applications:
– handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests

Machine Learning 2
Linear Classifiers

Consider a two dimensional dataset


with two classes

How would we classify this dataset?

Machine Learning 3
Linear Classifiers

Both of the lines can be linear classifiers.

Machine Learning 4
Linear Classifiers

There are many lines that can be linear classifiers.

Which one is the optimal classifier.


Machine Learning 5
Classifier Margin

Define the margin of a linear classifier as the width that


theboundary could be increased by before hitting a datapoint.

Machine Learning 6
Maximum Margin

The maximum margin linear


classifier is the linear classifier
with the maximum margin.

This is the simplest kind of SVM


(Called Linear SVM)

Machine Learning 7
Support Vectors
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the distance between support vectors.

w.x+b>0 w.x+b=0
ρ
f(x) = sign(w . x + b)
red is +1
blue is -1

Support Vectors
w.x+b<0

Machine Learning 8
Support Vectors

w.x i  b
• Distance from example xi to the separator is r
w

ρ || w || is w12  ...  wn2


r

Machine Learning 9
SVM - Linearly Separable
• A separating hyperplane can be written as
• W●X+b=0
– where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
• w0 + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
• H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
• H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors

Machine Learning 10
Linear SVM Mathematically
• Let training set {(xi, yi)}i=1..n, xiRd, yi  {-1, 1} be separated by a
hyperplane with margin ρ. Then for each training example (xi, yi):

wTxi + b ≤ - ρ/2 if yi = -1
yi(wTxi + b) ≥ ρ/2
wTxi + b ≥ ρ/2 if yi = 1 
• For every support vector xs the above inequality is an equality. After
rescaling w and b by ρ/2 in the equality, we obtain that distance between
each xs and the hyperplane is y s (w T x s  b) 1
r 
w w
• Then the margin can be expressed through (rescaled) w and b as:
2
  2r 
w
Machine Learning 11
Linear SVMs Mathematically (cont.)
• Then we can formulate the quadratic optimization problem:

Find w and b such that


2
 is maximized
w
and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1

Which can be reformulated as:


Find w and b such that
Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1

Machine Learning 12
Solving the Optimization Problem

Find w and b such that


Φ(w) =wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
• Need to optimize a quadratic function subject to linear constraints.
• Quadratic optimization problems are a well-known class of mathematical
programming problems for which several (non-trivial) algorithms exist.
• The solution involves constructing a dual problem where a Lagrange multiplier αi is
associated with every inequality constraint in the primal (original) problem:

Find α1…αn such that


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

Machine Learning 13
The Optimization Problem Solution
• Given a solution α1…αn to the dual problem, solution to the primal is:

w =Σαiyixi b = yk - ΣαiyixiTxk for any αk > 0


• Each non-zero αi indicates that corresponding xi is a support vector.
• Then the classifying function is (note that we don’t need w explicitly):

f(x) = ΣαiyixiTx + b

• Notice that it relies on an inner product between the test point x and the
support vectors xi –
• Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.

Machine Learning 14
Soft Margin Classification
• What if the training set is not linearly separable?
• Slack variables ξi can be added to allow misclassification of difficult or
noisy examples, resulting margin called soft.

ξi
ξi

Machine Learning 15
Soft Margin Classification Mathematically
• The old formulation:

Find w and b such that


Φ(w) =wTw is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1
• Modified formulation incorporates slack variables:

Find w and b such that


Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0
• Parameter C can be viewed as a way to control overfitting: it “trades off” the relative
importance of maximizing the margin and fitting the training data.

Machine Learning 16
Soft Margin Classification – Solution
• Dual problem is identical to separable case (would not be identical if the 2-norm
penalty for slack variables CΣξi2 was used in primal objective, we would need
additional Lagrange multipliers for slack variables):
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
• Again, xi with non-zero αi will be support vectors.
• Solution to the dual problem is: Again, we don’t need to compute
w explicitly for classification:
w =Σαiyixi f(x) = ΣαiyixiTx + b
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0

Machine Learning 17
Theoretical Justification for Maximum Margins
• Vapnik has proved the following:
The class of optimal linear separators has VC dimension h bounded from above as
 D 2  
h  min  2 , m0   1
   
where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of
the training examples, and m0 is the dimensionality.

• Intuitively, this implies that regardless of dimensionality m0 we can minimize


the VC dimension by maximizing the margin ρ.

• Thus, complexity of the classifier is kept small regardless of dimensionality.

Machine Learning 18
Linear SVMs: Overview
• The classifier is a separating hyperplane.
• Most “important” training points are support vectors; they define the
hyperplane.
• Quadratic optimization algorithms can identify which training points xi
are support vectors with non-zero Lagrangian multipliers αi.
• Both in the dual formulation of the problem and in the solution training
points appear only inside inner products:

Find α1…αN such that f(x) = ΣαiyixiTx + b


Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized
and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
Machine Learning 19
Non-linear SVMs

• Datasets that are linearly separable with some noise work out great:

0 x

• But what are we going to do if the dataset is just too hard?

0 x
• How about… mapping data to a higher-dimensional space
x2

0 x
Machine Learning 20
Non-linear SVMs: Feature spaces

• General idea: the original feature space can always be mapped to some higher-
dimensional feature space where the training set is separable:

Φ: x → φ(x)

Machine Learning 21
The “Kernel Trick”

• The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj


• If every datapoint is mapped into high-dimensional space via some transformation
Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is a function that is eqiuvalent to an inner product in some feature
space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
• Thus, a kernel function implicitly maps data to a high-dimensional space (without
the need to compute each φ(x) explicitly).
Machine Learning 22
Examples of Kernel Functions
• Linear: K(xi,xj)= xiTxj
– Mapping Φ: x → φ(x), where φ(x) is x itself

• Polynomial of power p: K(xi,xj)= (1+ xiTxj)p


d  p
– Mapping Φ: x → φ(x), where φ(x) has   dimensions
 p 
2
xi  x j

2 2
• Gaussian (radial-basis function): K(xi,xj) = e
– Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is
mapped to a function (a Gaussian); combination of functions for support vectors
is the separator.
• Higher-dimensional space still has intrinsic dimensionality d (the mapping is not
onto), but linear separators in it correspond to non-linear separators in original space.

Machine Learning 23
Non-linear SVMs Mathematically
• Dual problem formulation:

Find α1…αn such that


Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi

• The solution is:

f(x) = ΣαiyiK(xi, xj)+ b

• Optimization techniques for finding αi’s remain the same!

Machine Learning 24
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained
increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification tasks
ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors (e.g. graphs,
sequences, relational data) by designing kernel functions for such data.
• SVM techniques have been extended to a number of tasks such as regression [Vapnik
et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to hill-climb over
a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]
• Tuning SVMs remains a black art: selecting a specific kernel and parameters is
usually done in a try-and-see manner.

Machine Learning 25

You might also like