Lec06 SVM
Lec06 SVM
Machine Learning 1
SVM - History and Applications
• Vapnik and colleagues (1992)—groundwork from Vapnik &
Chervonenkis’ statistical learning theory in 1960s
• Features: training can be slow but accuracy is high owing to their ability
to model complex nonlinear decision boundaries (margin maximization)
• Used both for classification and prediction
• Applications:
– handwritten digit recognition, object recognition, speaker
identification, benchmarking time-series prediction tests
Machine Learning 2
Linear Classifiers
Machine Learning 3
Linear Classifiers
Machine Learning 4
Linear Classifiers
Machine Learning 6
Maximum Margin
Machine Learning 7
Support Vectors
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the distance between support vectors.
w.x+b>0 w.x+b=0
ρ
f(x) = sign(w . x + b)
red is +1
blue is -1
Support Vectors
w.x+b<0
Machine Learning 8
Support Vectors
w.x i b
• Distance from example xi to the separator is r
w
Machine Learning 9
SVM - Linearly Separable
• A separating hyperplane can be written as
• W●X+b=0
– where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
• For 2-D it can be written as
• w0 + w1 x1 + w2 x2 = 0
• The hyperplane defining the sides of the margin:
• H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
• H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
• Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
Machine Learning 10
Linear SVM Mathematically
• Let training set {(xi, yi)}i=1..n, xiRd, yi {-1, 1} be separated by a
hyperplane with margin ρ. Then for each training example (xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1
yi(wTxi + b) ≥ ρ/2
wTxi + b ≥ ρ/2 if yi = 1
• For every support vector xs the above inequality is an equality. After
rescaling w and b by ρ/2 in the equality, we obtain that distance between
each xs and the hyperplane is y s (w T x s b) 1
r
w w
• Then the margin can be expressed through (rescaled) w and b as:
2
2r
w
Machine Learning 11
Linear SVMs Mathematically (cont.)
• Then we can formulate the quadratic optimization problem:
Machine Learning 12
Solving the Optimization Problem
Machine Learning 13
The Optimization Problem Solution
• Given a solution α1…αn to the dual problem, solution to the primal is:
f(x) = ΣαiyixiTx + b
• Notice that it relies on an inner product between the test point x and the
support vectors xi –
• Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.
Machine Learning 14
Soft Margin Classification
• What if the training set is not linearly separable?
• Slack variables ξi can be added to allow misclassification of difficult or
noisy examples, resulting margin called soft.
ξi
ξi
Machine Learning 15
Soft Margin Classification Mathematically
• The old formulation:
Machine Learning 16
Soft Margin Classification – Solution
• Dual problem is identical to separable case (would not be identical if the 2-norm
penalty for slack variables CΣξi2 was used in primal objective, we would need
additional Lagrange multipliers for slack variables):
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
• Again, xi with non-zero αi will be support vectors.
• Solution to the dual problem is: Again, we don’t need to compute
w explicitly for classification:
w =Σαiyixi f(x) = ΣαiyixiTx + b
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0
Machine Learning 17
Theoretical Justification for Maximum Margins
• Vapnik has proved the following:
The class of optimal linear separators has VC dimension h bounded from above as
D 2
h min 2 , m0 1
where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of
the training examples, and m0 is the dimensionality.
Machine Learning 18
Linear SVMs: Overview
• The classifier is a separating hyperplane.
• Most “important” training points are support vectors; they define the
hyperplane.
• Quadratic optimization algorithms can identify which training points xi
are support vectors with non-zero Lagrangian multipliers αi.
• Both in the dual formulation of the problem and in the solution training
points appear only inside inner products:
• Datasets that are linearly separable with some noise work out great:
0 x
0 x
• How about… mapping data to a higher-dimensional space
x2
0 x
Machine Learning 20
Non-linear SVMs: Feature spaces
• General idea: the original feature space can always be mapped to some higher-
dimensional feature space where the training set is separable:
Φ: x → φ(x)
Machine Learning 21
The “Kernel Trick”
Machine Learning 23
Non-linear SVMs Mathematically
• Dual problem formulation:
Machine Learning 24
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained
increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification tasks
ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors (e.g. graphs,
sequences, relational data) by designing kernel functions for such data.
• SVM techniques have been extended to a number of tasks such as regression [Vapnik
et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to hill-climb over
a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]
• Tuning SVMs remains a black art: selecting a specific kernel and parameters is
usually done in a try-and-see manner.
Machine Learning 25