100% found this document useful (1 vote)
106 views44 pages

Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel

This document provides an introduction to support vector machines (SVMs). It discusses sources of information on SVMs, their application domains including pattern recognition, regression, dimensionality reduction, clustering and novelty detection. It also covers the classification problem in machine learning and how SVMs address this problem by finding an optimal separating hyperplane that maximizes the margin between the two classes of data points. The document explains the optimization problem involved in finding this optimal hyperplane and how it can be formulated as a convex quadratic programming problem. It also introduces the concept of the dual problem and how solving the dual problem can help address some limitations of the primal problem.

Uploaded by

rabbityeah
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
106 views44 pages

Support Vector Machines - An Introduction: Department of Electrical Engineering Technion, Israel

This document provides an introduction to support vector machines (SVMs). It discusses sources of information on SVMs, their application domains including pattern recognition, regression, dimensionality reduction, clustering and novelty detection. It also covers the classification problem in machine learning and how SVMs address this problem by finding an optimal separating hyperplane that maximizes the margin between the two classes of data points. The document explains the optimization problem involved in finding this optimal hyperplane and how it can be formulated as a convex quadratic programming problem. It also introduces the concept of the dual problem and how solving the dual problem can help address some limitations of the primal problem.

Uploaded by

rabbityeah
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Support Vector Machines -

an Introduction

Ron Meir

Department of Electrical Engineering


Technion, Israel
Support Vector Machines - An Introduction

Sources of Information

Web http://www.kernel-machines.org/

Tutorial C.J.C. Burges. A Tutorial on Support


Vector Machines for Pattern Recognition
(download from above site)

Books

? V. Vapnik, Statistical Learning Theory,


1998.
? N. Cristianini and J. Shawe-Taylor, An
Introduction to Support Vector Machines,
2000.
? A. Smola and B. Schölkopf, Learning with
Kernels, 2002.

Optimization book D. Bertsekas, Nonlinear


Programming, Second Edition 1999.

June 2002 2/44


Support Vector Machines - An Introduction

Application Domains

Supervised Learning

? Pattern Recognition - state of the art


results for OCR, text classification,
Biological sequencing
? Regression and time series - good results

Unsupervised Learning

? Dimensionality Reduction - Non linear


principal component analysis
? Clustering
? Novelty detection

Reinforcement Learning: Some preliminary


results

June 2002 3/44


Support Vector Machines - An Introduction

Classification I

3 1 3 ?

The problem:

Input: x feature vector


Label: y ∈ {1, 2, . . . , k}
Data: {(xi , yi )}m
i=1

Unknown source: x ∼ p(x)


Target: y = f (x)

Objective: Given new x, predict y so that


probability of error is minimal

June 2002 4/44


Support Vector Machines - An Introduction

Classification II
The ‘Model’

Hypothesis class: H : Rd 7→ {±1}


Loss: `(y, h(x)) = I[y 6= h(x)])
Generalization: L(h) = E{`(Y, h(X))}
Objective: Find h ∈ H which mini-
mizes L(h)

Caveat: Only have data at our disposal


‘Solution’: Form empirical estimator which
‘generalizes well’
Question: How can we efficiently construct
complex hypotheses with good gen-
eralization?
Focus: Two-class problem, y ∈ {−1, +1}

June 2002 5/44


Support Vector Machines - An Introduction

Linearly Separable Classes


 +1 w> x + b > 0
Y = sgn[w> x + b > 0] =
 −1 w> x + b ≤ 0

Problem: Many solutions! Some are very


poor
Task: Based on data, select hyper-plane
which works well ‘in general’

June 2002 6/44


Support Vector Machines - An Introduction

Selection of a Good Hyper-Plane

Objective: Select a ‘good’ hyper-plane using


only the data!
Intuition: (Vapnik 1965) - assuming linear
separability

(i) Separate the data

(ii) Place hyper-plane ‘far’ from


data

June 2002 7/44


Support Vector Machines - An Introduction

VC Dimension

Given: H = {h : Rd 7→ {−1, +1}}


Question: How complex is the class?
§

Shattering: H shatters a set X if H


achieves all dichotomies on X

All 8 dichotomies 14 dichotomies

VCdim=3
6 dichotomies

VC-dimension The size of the largest shat-


tered subset of X
Hyper-planes VCdim(H) = d + 1

June 2002 8/44


Support Vector Machines - An Introduction

What is the True Performance?

For h ∈ H
L(h) Probability of miss-classification
L̂n (h) Empirical fraction of miss-classifications
Vapnik and Chervonenkis 1971: For any
distribution with prob. 1 − δ, ∀h ∈ H,
s
VCdim(H) log n + log 1δ
L(h) < L̂n (h) + c
| {z } n
emp. error | {z }
complexity penalty

^
L ( h ) L( h )
n

June 2002 9/44


Support Vector Machines - An Introduction

An Improved VC Bound I
Hyper-plane: H(w, b) = {x : w> x + b = 0}
Distance of a point from a hyper-plane:

w> x + b
d(x, H(w, b)) =
kwk

d(x,H(w,b))

Optimal hyper-plane (linearly separable case)

max min d(xi , H(w, b))


w,b 1≤i≤n

June 2002 10/44


Support Vector Machines - An Introduction

An Improved VC Bound II
Canonical hyper-plane:

min |w> xi + b| = 1
1≤i≤n

(No loss of generality)


Improved VC Bound (Vapnik 95) VC
dimension of set of canonical hyper-planes such
that

kwk ≤ A
xi ∈ Ball of radius L

is
VCdim ≤ min(A2 L2 , d) + 1

Observe: Constraints reduce VC-dim bound


Canonical hyper-planes with mini-
mal norm yields best bound
Suggestion: Use hyper-plane with minimal
norm

June 2002 11/44


Support Vector Machines - An Introduction

The Optimization Problem I


Canonical hyper-planes: (xi , w ∈ Rd )

min |w> xi + b| ≥ 1
1≤i≤n

Support vectors:

{xi : |w> xi + b| = 1}

Margin

Margin Distance between hyper-planes defined


by support vectors

June 2002 12/44


Support Vector Machines - An Introduction

The Optimization Problem II


Distance from support vector to H(w, b)

w > xi + b ±1
=
kwk kwk

¯ ¯
¯ 1 −1 ¯ 2
Margin = ¯
¯ − ¯ =
kwk kwk ¯ kwk

1 >
minimize w w
2
subject to yi (w> xi + b) ≥ 1 i = 1, 2, . . . , n

1. Convex quadratic program


2. Linear inequality constraints (many!)
3. d + 1 parameters, n constraints

June 2002 13/44


Support Vector Machines - An Introduction

Convex Optimization
Problem:

minimize f (x)
subject to hi (x) = 0, i = 1, . . . , m,
gj (x) ≤ 0, j = 1, . . . , r

Active constraints: A(x) = {j : gj (x)=0}


Lagrangian
m
X r
X
L(x, λ, µ) = f (x) + λi hi (x) + µj hj (x)
i=1 j=1

Sufficient conditions for minimum: (KKT)


Let x∗ be a local minimum. Then ∃λ∗ , µ∗ s.t.

∇x L(x∗ , λ∗ , µ∗ ) = 0
µ∗j ≥ 0 j = 1, 2, . . . , r
µ∗j = 0 / A(x∗ )
∀j ∈

June 2002 14/44


Support Vector Machines - An Introduction

The Dual Problem I


Motivation:

? Many inequality constraints


? High (sometimes infinite) input dimension
Primal Problem

minimize f (x)
subject to e>
i x = di , i = 1, . . . , m,
a>
j x ≤ bj , j = 1, . . . , r

Lagrangian
m
X r
X
LP (x, λ, µ) = f (x)+ λi (e>
i x−di )+ µj (a>
j x−bj )
i=1 j=1

June 2002 15/44


Support Vector Machines - An Introduction

The Dual Problem II


Dual Lagrangian

LD (λ, µ) = inf LP (x, λ, µ)


x

Dual Problem

maximize LD (λ, µ)
λ,µ

subject to µ ≥ 0

Observation:
? LP (x, λ, µ) quadratic ⇒ LD (λ, µ) quadratic
? Constraints in Dual greatly simplified
? m + r variables, r constraints

Duality Theorem:
Optimal solutions of P and D coincide

June 2002 16/44


Support Vector Machines - An Introduction

SVM in the Primal Space I

1
minimize kwk2
w,b 2
¡ > ¢
subject to yi w xi + b ≥ 1, i = 1, . . . , n.
n
1 X
LP (w, b, α) = kwk2 − αi [yi (w> xi + b) − 1],
2 i=1

Solution:
n
X
w= α i yi x i
i=1
n
X
0= α i yi (αi ≥ 0)
i=1

KKT condition:

αi = 0 unless yi (w> xi + b) = 1

Sparsity: Often many αi vanish!

June 2002 17/44


Support Vector Machines - An Introduction

SVM in the Primal Space II


Recall
n
X
w= α i yi x i
i=1

Margin

Support vectors: All xi for which αi > 0

Occurs if constraint is
obeyed with equality

June 2002 18/44


Support Vector Machines - An Introduction

Support Vectors in the Dual Space

n
1X
X
max. LD (α) = αi − α i α j yi yj x >
i xj
i=1
2 i,j
n
X
s.t. α i yi = 0 ; αi ≥ 0
i=1

Determination of b: For support vectors

yi (w> xi + b) = 1

Thus
µ ¶
1
b∗ = − min {w∗T xi } + max {w∗T xi }
2 yi =+1 yi =−1

P
Classifier: (Recall w = i α i yi x i )
à n
!
X
f (x) = sgn αi∗ yi x>
i x+b

i=1

June 2002 19/44


Support Vector Machines - An Introduction

Non-Separable Case I

Objective: find a good separating hyper-plane


for the non-separable case
Problem: Cannot satisfy yi [w> xi + b] ≥ 1 for
all i
Solution: Slack variables

w> xi + b ≥ +1 − ξi for yi = +1,


w> xi + b ≤ −1 + ξi for yi = −1,
ξi ≥ 0 k = 1, 2, . . . , n.

An error occurs if ξi > 1. Thus,


n
X
I(ξi > 1) = # errors
i=1

June 2002 20/44


Support Vector Machines - An Introduction

Non-Separable Case II
Proposed solution: Minimize
n
1 2
X
kwk + C I(ξi > 1) (non − convex!)
2 i=1

Suggestion: Replace I(ξi > 1) by ξi (upper


bound)

n
1 2
X
minimize LP (w, ξ) = kwk + C ξi
w,b,ξ 2 i=1

subject to yi (w> xi + b) ≥ 1 − ξi
ξi ≥ 0

Tradeoff: Large C - penalize errors, Small C


penalize complexity

Dual Problem: Same as in separable case,


except that 0 ≤ αi ≤ C

Support vectors: αi > 0 - but lose geometric


interpretation!

June 2002 21/44


Support Vector Machines - An Introduction

Non-Separable Case III


Solution:
n
X
w= α i yi x i
i=1
à !
X
f (x) = sgn α i yi x >
i x+b
i

KKT conditions:
n
X
0= α i yi
i=1
¡ >
¢
0 = αi (yi w xi + b) − 1 + ξi
0 = (C − αi )ξi

Support vectors: characterized by αi > 0

June 2002 22/44


Support Vector Machines - An Introduction

Non-Separable Case IV
Two types of support vectors:
Recall
¡ >
¢
αi (yi w xi + b) − 1 + ξi = 0
(C − αi )ξi = 0

Margin vectors:

1
0 < αi < C⇒ξi = 0 ⇒ d(xi , H(w, b)) =
kwk

Non-margin vectors: αi = C
? Errors: ξi > 1 Misclassified
? Non-errors: 0 ≤ ξi ≤ 1 Correcty classified
Within margin

June 2002 23/44


Support Vector Machines - An Introduction

Non-Separable Case V

1
3

1
3
2 3

Support Vectors:

1 margin s.v. ξi = 0 Correct


2 non-margin s.v. ξi < 1 Correct (in margin)
3 non-margin s.v. ξi > 1 Error

Problem: Lose clear geometric intuition and


sparsity

June 2002 24/44


Support Vector Machines - An Introduction

Non-linear SVM I

Linear Separability: More likely in high


dimensions

Mapping: Map input into high-dimensional


feature space Φ

Classifier: Construct linear classifier in Φ

Motivation: Appropriate choice of Φ leads to


linear separability.

Non-linearity and high dimension are


essential (Cover ’65)!

Φ : Rd 7→ RD (D À d)
x 7→ Φ(x)

Hyper-plane condition: w> Φ(x) + b = 0


Inner products: x> x 7→ Φ> (x)Φ(x)

June 2002 25/44


Support Vector Machines - An Introduction

Non-Linear SVM II

Decisions in Input and Feature Space:


Problems becomes linearly separable in feature
space (Fig. from Schölkopf & Smola 2002)

R2 x1 x2 input space

Φ: R2 R3
✕ ❍
❍ x12 x22 2 x1x2 feature space
✕ w2 w
✕ w1 3

f(x)
2 2
f (x)=sgn (w1x1+w2x2+w3 2 x1x2+b)
Φ

R3 R2


❍ ✕ ❍



✕ ✕ ✕

Recall for linear SVM


d
X n
X
f (x) = wi x i + b ; w= α i yi x i
i=1 i=1

June 2002 26/44


Support Vector Machines - An Introduction

Non-Linear SVM III

Obtained
n
X
f (x) = α i yi x >
i x+b
i=1

In feature space
n
X
f (x) = αi yi Φ(xi )> Φ(x) + b
i=1

Kernel: A symmetric function K : Rd × Rd 7→ R


Inner product kernels: In addition

K(x, z) = Φ(x)> Φ(z)

Motivation: Φ ∈ RD , where D may be very


large - inner products expensive

June 2002 27/44


Support Vector Machines - An Introduction

Non-Linear Support Vectors IV

Examples:

Linear mapping: Φ(x) = Ax

K(x, z) = (Ax)> (Az) = x> A> Az = x> Bz

Quadratic map: Φ(x) = {(xi xj )}d,d


i,j=1

K(x, z) = (x> z)2


à d !2
X
= x i zi
i=1
d,d
X
= (xi xj )(zi zj )
i,j=(1,1)

Objective: Work directly with kernels,


avoiding mapping Φ

Question: Under what conditions is


K(x, z) = Φ(x)> Φ(z)?

June 2002 28/44


Support Vector Machines - An Introduction

Mercer Kernels I
Assumptions:

1. K(x, z) a continuous symmetric function

2. K is positive definite: for any f ∈ L2 not


identically zero
Z
f (x)K(x, z)f (z)dxdz > 0

Mercer’s Theorem:

X
K(x, z) = λj ψj (x)ψj (z)
j=1
Z
K(x, z)ψj (z)dz = λj ψj (x)

p
Conclusion: Let φj (x) = λj ψj (x), then

K(x, z) = Φ(x)> Φ(z)

June 2002 29/44


Support Vector Machines - An Introduction

Mercer Kernels II
Classifier:
à n
!
X
f (x) = sgn αi yi Φ(xi )> Φ(x) + b
i=1
à n
!
X
= sgn αi yi K(xi , x) + b
i=1

The gain: Implement infinite-dimensional


mapping, but do all calculations in finite
dimension
n
X 1X
maximize αi − αi αj yi yj K(xi , xj )
i=1
2 i,j
n
X
subject to α i yi = 0 ; 0 ≤ αi ≤ C
i=1

Observe: Only difference from linear case is in


the kernel
Optimization task is unchanged!

June 2002 30/44


Support Vector Machines - An Introduction

Kernel Selection I
Simpler Mercer conditions: for any finite set
of points Kij = K(xi , xj ) is positive-definite

v> Kv > 0

Classifier:
à n
!
X
f (x) = sgn αi yi K(xi , x) + b
i=1

Constructing kernels: Assume K1 and K2


kernels, f real-valued function

1. K1 (x, z) + K2 (x, z)

2. K1 (x, z)K2 (x, z)

3. f (x)f (z)

4. K3 (Φ(x), Φ(z))

June 2002 31/44


Support Vector Machines - An Introduction

Kernel Selection II
Explicit construction:

1. p(K1 (x, y)) - polynomial with positive


coefficients

2. exp[K(x, z)]

3. exp(−kx − zk2 /2σ 2 )

Some standard kernels:


Polynomial (x> x + 1)p
¡ 0 2 2
¢
Gaussian exp −kx − x k /2σ
Splines piece-wise polynomial between knots

June 2002 32/44


Support Vector Machines - An Introduction

Kernel Selection III


Adaptive kernels?

? Determine kernel parameters (e.g. width)


using cross-validation

? Improved bounds - take into account


properties of kernels

? Use invariances to constrain kernels

Applications: State of the art in many domains.


Can deal with huge data sets

Hand-writing recognition

Text classification

Bioinformatics

Many more

June 2002 33/44


Support Vector Machines - An Introduction

The Kernel Trick - Summary

? Can be thought of as non-linear similarity


measure

? Can be used with any algorithm that uses


only inner products

? Allows constructing non-linear classifiers


using only linear algorithms

? Can be applied to vectorial and non-vectorial


data (e.g. tree and string structures)

June 2002 34/44


Support Vector Machines - An Introduction

SVM and Penalization I


Recall the linear problem

n
1 2
X
minimize LP (w, ξ) = kwk + C ξi
w,b,ξ 2 i=1

subject to yi (w> xi + b) ≥ 1 − ξi
ξi ≥ 0

Let [u]+ = max(0, u)


One can show equivalence to
( n
)
X
minimize [1 − yi f (xi )]+ + λkwk2
w,b
i=1

subject to f (x) = x> w + b

This has the classic form of Regularization


Theory

empirical error + regularization term

June 2002 35/44


Support Vector Machines - An Introduction

SVM and Penalization II

( n
)
X
minimize [1 − yi f (x)]+ + λkwk2
w,b
i=1

? Maximize distance of points from the hyper-plane

? Correctly classified points are also penalized

[1−yf(x)]
+

I[yf(x) <0]

1 yf(x)

Suggestion: Consider a general formulation


( n )
X
minimize `(yi , f (xi )) + λkwk2
w,b
i=1

subject to f (x) = Φ(x)> w + b

June 2002 36/44


Support Vector Machines - An Introduction

SVM and Penalization III


Recall
( n
)
X
minimize `(yi , f (xi )) + λkwk2 (∗)
w,b
i=1

subject to f (x) = Φ(x)> w + b

Representer Theorem: (special case) For


any loss function `(y, f (x)), the solution of
(∗) is of the form
n
X
f (x) = αi K(xi , x) + b
i=1

Compression bounds: Sparsity improves


generalization! (A version of Occam’s ra-
zor)

June 2002 37/44


Support Vector Machines - An Introduction

SVM Regression I

Data: (x1 , y1 ), . . . , (xn , yn ), yi ∈ R


Loss: Attempt to achieve sparsity

loss

−ε +ε
y−(wx−b)

Achieve: Only badly predicted points con-


tribute

y ξ ε y ξ
ε

ξ ξ

x x

June 2002 38/44


Support Vector Machines - An Introduction

SVM Regression II

Dual problem: Very similar to the case of


classification

Solution: Solve a quadratic optimization


problem in 2n variables αi , αi∗ , i = 1, 2, . . . , n.

n
X
f (x) = (αi∗ − αi )K(xi , x) + b
i=1

Sparsity: Only data for which αi 6= αi∗


contribute. This occurs only if

|f (xi ) − yi | ≥ ² (outside tube)

June 2002 39/44


Support Vector Machines - An Introduction

SVM Regression III

Effect of ²:

² = 0.1, 0.2, 0.5

Improved regression: ²-adaptive


Left - zero noise, right - σ = 0.2

1 1

0 0

0 1 0 1

Figures: from Schölkopf and Smola 2002

June 2002 40/44


Support Vector Machines - An Introduction

Practically Useful Bounds?

Practically useful bound can be obtained using:

? Data-dependent complexities

? Specific learning algorithms

? A great deal of ingenuity

40
Test error
35 Span prediction

30

25
Error

20

15

10

0
−6 −4 −2 0 2 4 6
Log sigma

  
  "!
# $% &

36
Test error
34 Span prediction

32

30
Error

28

26

24

22

20
−2 0 2 4 6 8 10 12
Log C
'$(
)* ,+- .$./0 &120 
3/4!
$% &

From: “Bounds on Error Expectations for Support


Vector Machines”, V. Vapnik and O. Chapelle, 2001
(200 examples in (b))

June 2002 41/44


Support Vector Machines - An Introduction

Summary I
Advantages

? Systematic implementation through quadratic


programming (very efficient implementations
exist)

? Excellent data-dependent generalization


bounds exist

? Regularization built into cost function

? Statistical performance independent of dim.


of feature space

? Theoretically related to widely studied fields


of regularization theory and sparse
approximation

? Fully adaptive procedures available for


determining hyper-parameters

June 2002 42/44


Support Vector Machines - An Introduction

Summary II
Drawbacks

? Treatment of non-separable case somewhat


heuristic

? Number of support vectors may depend


strongly on the kernel type and and the
hyper-parameters

? Systematic choice of kernels is difficult (prior


information) - some ideas exist

? Optimization may require clever heuristics for


large problems

June 2002 43/44


Support Vector Machines - An Introduction

Summary III
Extensions

? Online algorithms

? Systematic choice of kernels using generative


statistical models

? Applications to

– Clustering
– Non-linear principal component analysis
– Independent component analysis

? Generalization bounds constantly improving


(some even practically useful!)

June 2002 44/44

You might also like