0% found this document useful (0 votes)
11 views14 pages

Lect 3

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views14 pages

Lect 3

Uploaded by

Ark Mtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Lecture 3: Dual problems and Kernels

C4B Machine Learning Hilary 2011 A. Zisserman

• Primal and dual forms

• Linear separability revisted

• Feature mapping

• Kernels for SVMs


• Kernel trick
• requirements
• radial basis functions

SVM – review
• We have seen that for an SVM learning a linear classifier

f (x) = w>x + b
is formulated as solving an optimization problem over w :
N
X
min ||w||2 + C max (0, 1 − yif (xi))
w∈Rd i
• This quadratic optimization problem is known as the primal problem.

• Instead, the SVM can be formulated to learn a linear classifier


N
X
f (x) = αiyi(xi>x) + b
i
by solving an optimization problem over αi.

• This is know as the dual problem, and we will look at the advantages
of this formulation.
Sketch derivation of dual form
The Representer Theorem states that the solution w can always be
written as a linear combination of the training data:
N
X
w= αj yj xj
j=1

Proof: see example sheet .

Now, substitute for w in f (x) = w>x + b


⎛ ⎞
N
X N
X ³ ´
f (x) = ⎝ αj y j x j ⎠ > x + b = αj yj xj >x + b
j=1 j=1
³ ´
and for w in the cost function minw ||w||2 subject to yi w>xi + b ≥ 1, ∀i
⎧ ⎫ ⎧ ⎫
⎨X ⎬ ⎨X ⎬ X
||w||2 = αj yj xj > αk yk xk = αj αk yj yk (xj >xk )
⎩ ⎭ ⎩ ⎭
j k jk
Hence, an equivalent optimization problem is over αj
⎛ ⎞
X N
X
min αj αk yj yk (xj >xk ) subject to yi ⎝ αj yj (xj >xi) + b⎠ ≥ 1, ∀i
αj
jk j=1
and a few more steps are required to complete the derivation.

Primal and dual formulations


N is number of training points, and d is dimension of feature vector x.

Primal problem: for w ∈ Rd


N
X
min ||w||2 + C max (0, 1 − yif (xi))
w∈Rd i

Dual problem: for α ∈ RN (stated without proof):


X 1X X
max αi − αj αk yj yk (xj >xk ) subject to 0 ≤ αi ≤ C for ∀i, and α i yi = 0
αi ≥0
i 2 jk i

• Complexity of solution is O(d3) for primal, and O(N 3) for dual

• If N << d then more efficient to solve for α than w

• Dual form only involves (xj >xi). We will return to why this is an
advantage when we look at kernels.
Primal and dual formulations

Primal version of classifier:

f (x) = w > x + b

Dual version of classifier:


N
X
f (x) = αiyi(xi>x) + b
i

At first sight the dual form appears to have the disad-


vantage of a K-NN classifier — it requires the training
data points xi. However, many of the αi’s are zero. The
ones that are non-zero define the support vectors xi.

Support Vector Machine

wTx + b = 0

b
||w||

Support Vector
Support Vector

X
f (x) = αi yi (xi > x) + b
i
support vectors
Handling data that is not linearly separable

• introduce slack variables


N
X
min ||w||2 + C ξi
w∈Rd ,ξi∈R+ i
subject to
³ ´
yi w>xi + b ≥ 1 − ξi for i = 1 . . . N

• linear classifier not appropriate


??

Solution 1: use polar coordinates

<0 >0
r θ
θ
0

0 r

• Data is linearly separable in polar coordinates


• Acts non-linearly in original space
à ! à !
x1 r
Φ: → R2 → R2
x2 θ
Solution 2: map data to higher dimension
⎛ ⎞
à ! x2
1
x1 ⎜ ⎟
Φ: → ⎝ √ x2
2 ⎠ R2 → R3
x2
2x1x2


Z= 2x1x2

0
Y = x2
2 X = x2
1
• Data is linearly separable in 3D
• This means that the problem can still be solved by a linear classifier

SVM classifiers in a transformed feature space


f (x) = 0
Rd RD

Φ : x → Φ(x) Rd → R D

Learn classifier linear in w for RD :

f (x) = w>Φ(x) + b
Primal Classifier in transformed feature space

Classifier, with w ∈ RD :

f (x) = w>Φ(x) + b
Learning, for w ∈ RD
N
X
2
min ||w|| + C max (0, 1 − yif (xi))
w∈RD i

• Simply map x to Φ(x) where data is separable

• Solve for w in high dimensional space RD

• Complexity of solution is now O(D 3) rather than O(d3)

Dual Classifier in transformed feature space

Classifier:
N
X
f (x) = αi y i x i > x + b
i
N
X
→ f ( x) = αiyi Φ(xi)>Φ(x) + b
i
Learning:
X 1X
max αi − αj αk y j y k x j > x k
αi ≥0
i 2 jk
X 1X
→ max αi − αj αk yj yk Φ(xj )>Φ(xk )
αi ≥0
i 2 jk
subject to
X
0 ≤ αi ≤ C for ∀i, and αi y i = 0
i
Dual Classifier in transformed feature space
• Note, that Φ(x) only occurs in pairs Φ(xj )>Φ(xi)

• Once the scalar products are computed, complexity is again


O(N 3); it is not necessary to learn in the D dimensional space,
as it is for the primal

• Write k(xj , xi) = Φ(xj )>Φ(xi). This is known as a Kernel

Classifier:
N
X
f (x) = αiyi k(xi, x) + b
i
Learning:
X 1X
max αi − αj αk yj yk k(xj , xk )
αi ≥0
i 2 jk
subject to
X
0 ≤ αi ≤ C for ∀i, and αiyi = 0
i

Special transformations
⎛ ⎞
à ! x2
1
x1 ⎜ ⎟
Φ: → ⎝ √ x2
2 ⎠ R2 → R3
x2
2x1x2
⎛ ⎞
³ √ z12
´
⎜ ⎟
Φ(x)>Φ(z) = x2 2
1, x2, 2x1x2 ⎝ √ z2
2 ⎠
2z1z2
= x2 2 2 2
1 z1 + x2z2 + 2x1x2z1 z2
= (x1z1 + x2z2)2
= (x> z)2
Kernel Trick
• Classifier can be learnt and applied without explicitly computing Φ(x)

• All that is required is the kernel k(x, z) = (x>z)2

• Complexity is still O(N 3)


Example kernels

• Linear kernels k(x, x0) = x>x0


³ ´d
0 >
• Polynomial kernels k(x, x ) = 1 + x x 0 for any d > 0

— Contains all polynomials terms up to degree d


³ ´
0 0 2
• Gaussian kernels k(x, x ) = exp −||x − x || /2σ 2 for σ > 0

— Infinite dimensional feature space

Valid kernels – when can the kernel trick be used?

• Given some arbitrary function k(xi, xj ), how do we know


if it corresponds to a scalar product Φ(xi)>Φ(xj ) in some
space?

• Mercer kernels: if k(, ) satisfies:


— Symmetric k(xi, xj ) = k(xj , xi)
— Positive definite, α>Kα ≥ 0 for all α ∈ RN , where K is
the N × N Gram matrix with entries Kij = k(xi, xj ).
then k(, ) is a valid kernel.

• e.g. k(x, z) = x>z is a valid kernel, k(x, z) = x − x>z is not.


SVM classifier with Gaussian kernel

N = size of training data


N
X
f (x) = αiyik(xi, x) + b
i
support vector
weight (may be zero)

³ ´
0 0 2
Gaussian kernel k(x, x ) = exp −||x − x || /2σ 2

Radial Basis Function (RBF) SVM


N
X ³ ´
2 2
f (x ) = αiyi exp −||x − xi|| /2σ +b
i

RBF Kernel SVM Example

0.6

0.4

0.2
feature y

-0.2

-0.4

-0.6
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
feature x

• data is not linearly separable in original feature space


σ = 1.0 C=∞ f (x) = 1

f (x) = 0

f (x) = −1

N
X ³ ´
f (x ) = αiyi exp −||x − xi||2/2σ 2 + b
i

σ = 1.0 C = 100

Decrease C, gives wider (soft) margin


σ = 1.0 C = 10

N
X ³ ´
f (x ) = αiyi exp −||x − xi||2/2σ 2 + b
i

σ = 1.0 C=∞

N
X ³ ´
f (x ) = αiyi exp −||x − xi||2/2σ 2 + b
i
σ = 0.25 C=∞

Decrease sigma, moves towards nearest neighbour classifier

σ = 0.1 C=∞

N
X ³ ´
f (x ) = αiyi exp −||x − xi||2/2σ 2 + b
i
Kernel block structure
N × N Gram matrix with entries Kij = k(xi, xj )
linear kernel (C = 0.1) RBF kernel (C = 1, gamma = 0.25)

-6 pos. vec.
-6
neg. vec.
-4 supp. vec.
-4
margin vec.
-2 decision bound.
-2
pos. margin
neg. margin
0 0

2 2

4 4

6 6

-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
Gram matrix linear kernel Gram matrix RBF kernel
1
20
5 15 5
0.8
10 10 10
5 0.6
15 15
0
0.4
20 -5 20

25
-10
25 0.2
The kernel
-15 measures similarity
30 30
10 20 30 10 20 30 between the points

Kernel Trick - Summary


• Classifiers can be learnt for high dimensional features spaces, without
actually having to map the points into the high dimensional space

• Data may be linearly separable in the high dimensional space, but not
linearly separable in the original feature space

• Kernels can be used for an SVM because of the scalar product in the dual
form, but can also be used elsewhere – they are not tied to the SVM formalism

• Kernels apply also to objects that are not vectors, e.g.


P
k(h, h0) = 0 0
k min(hk , hk ) for histograms with bins hk , hk

• We will see other examples of kernels later in regression and unsupervised


learning
Background reading
• Bishop, chapters 6.2 and 7

• Hastie et al, chapter 12

• More on web page:


http://www.robots.ox.ac.uk/~az/lectures/ml

You might also like