0% found this document useful (0 votes)
11 views

Class04 Feature+Kernels

This document discusses features and kernels for nonlinear functions in statistical learning theory. It begins by introducing linear functions defined by an inner product in a feature space. It then discusses using nonlinear feature maps to define linear spaces of functions and applying regularization methods like ridge regression in this feature space. The document introduces the "kernel trick" which allows defining kernels corresponding to dot products in feature spaces to perform kernel ridge regression without explicitly working in the feature space. It discusses how positive definite kernels can be used to define valid kernels for learning algorithms.

Uploaded by

timetif286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Class04 Feature+Kernels

This document discusses features and kernels for nonlinear functions in statistical learning theory. It begins by introducing linear functions defined by an inner product in a feature space. It then discusses using nonlinear feature maps to define linear spaces of functions and applying regularization methods like ridge regression in this feature space. The document introduces the "kernel trick" which allows defining kernels corresponding to dot products in feature spaces to perform kernel ridge regression without explicitly working in the feature space. It discusses how positive definite kernels can be used to define valid kernels for learning algorithms.

Uploaded by

timetif286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

MIT 9.520/6.

860, Fall 2018


Statistical Learning Theory and Applications

Class 04: Features and Kernels

Lorenzo Rosasco
Linear functions

Let Hlin be the space of linear functions

f (x) = w > x.

I f ↔ w is one to one,
D E
I inner product f , f̄ := w > w̄,
H
I norm/metric f − f̄ := kw − w̄k.
H

L.Rosasco, 9.520/6.860 2018


An observation

Function norm controls point-wise convergence.

Since
|f (x) − f̄ (x)| ≤ kxk kw − w̄k , ∀x ∈ X
then
wj → w ⇒ fj (x) → f (x), ∀x ∈ X .

L.Rosasco, 9.520/6.860 2018


ERM

n
1X
min (yi − w > xi )2 + λ kwk2 , λ≥0
w∈Rd n
i =1

I λ → 0 ordinary least squares (bias to minimal norm),


I λ > 0 ridge regression (stable).

L.Rosasco, 9.520/6.860 2018


Computations

Let Xn ∈ Rnd and b


Y ∈ Rn .

The ridge regression solution is

bλ = (Xn > Xn+nλI )−1 Xn >b


w Y time O (nd 2 ∨d 3 ) mem. O (nd ∨d 2 )

but also

bλ = Xn > (XnXn > +nλI )−1b


w Y time O (dn 2 ∨n 3 ) mem. O (nd ∨n 2 )

L.Rosasco, 9.520/6.860 2018


Representer theorem in disguise

We noted that
n
X n
X
bλ = Xn > c =
w xi ci ⇔ f̂ λ (x) = x > xi ci ,
i =1 i =1

c = (XnXn > + nλI )−1b


Y, (XnXn > )ij = xi> xj .

L.Rosasco, 9.520/6.860 2018


Limits of linear functions

Regression

L.Rosasco, 9.520/6.860 2018


Limits of linear functions

Classification

L.Rosasco, 9.520/6.860 2018


Nonlinear functions

Two main possibilities:

f (x) = w > Φ(x), f (x) = Φ(w > x)


where Φ is a non linear map.

I The former choice leads to linear spaces of functions1 .


I The latter choice can be iterated

f (x) = Φ(wL> Φ(wL>−1 . . . Φ(w1> x))).

1 The spaces are linear, NOT the functions! L.Rosasco, 9.520/6.860 2018
Features and feature maps

f (x) = w > Φ(x),


where Φ : X → Rp

Φ(x) = (ϕ1 (x), . . . , ϕp (x))>

and ϕj : X → R, for j = 1, . . . , p.

I X need not be Rd .
I We can also write
p
X
f (x) = w j ϕj (x).
i =1

L.Rosasco, 9.520/6.860 2018


Geometric view

f (x) = w > Φ(x)

L.Rosasco, 9.520/6.860 2018


An example

L.Rosasco, 9.520/6.860 2018


More examples

The equation
p
X
f (x) = w > Φ(x) = w j ϕj (x)
i =1
suggests to think of features as some form of basis.

Indeed we can consider


I Fourier basis,
I wave-lets + their variations,
I ...

L.Rosasco, 9.520/6.860 2018


And even more examples

Any set of functions

ϕj : X → R, j = 1, . . . , p

can be considered.

Feature design/engineering
I vision: SIFT, HOG
I audio: MFCC
I ...

L.Rosasco, 9.520/6.860 2018


Nonlinear functions using features

Let HΦ be the space of linear functions

f (x) = w > Φ(x).

I f ↔ w is one to one, if (ϕj )j are lin. indip.


D E
I inner product f , f̄ := w > w̄,

I norm/metric f − f̄ := kw − w̄k.

In this case

|f (x) − f̄ (x)| ≤ kΦ(x)k kw − w̄k , ∀x ∈ X .

L.Rosasco, 9.520/6.860 2018


Back to ERM

n
1X
minp (yi − w > Φ(xi ))2 + λ kwk2 , λ ≥ 0,
w∈R n
i =1

Equivalent to,
n
1X
min (yi − f (xi ))2 + λ kf k2HΦ , λ ≥ 0.
f ∈HΦ n
i =1

L.Rosasco, 9.520/6.860 2018


Computations using features

b ∈ Rnp with
Let Φ
b ij = ϕj (xi )
(Φ)

The ridge regression solution is

b λ = (Φ
w b > Φ+nλI
b )−1 Φ
b>b
Y time O (np 2 ∨p 3 ) mem. O (np∨p 2 ),

but also

bλ = Φ
w b > (Φ
bΦb > +nλI )−1b
Y time O (pn 2 ∨n 3 ) mem. O (np∨n 2 ).

L.Rosasco, 9.520/6.860 2018


Representer theorem a little less in disguise

Analogously to before
n
X n
X
bλ = Φ
w b> c = Φ(xi )ci ⇔ f̂ λ (x) = Φ(x)> Φ(xi )ci
i =1 i =1

c = (Φ
bΦb > + λI )−1b
Y, (Φ
bΦb > )ij = Φ(xi )> Φ(xj )

p
X
Φ(x)> Φ(x̄) = ϕs (x)ϕs (x̄).
s =1

L.Rosasco, 9.520/6.860 2018


Unleash the features

I Can we consider linearly dependent features?

I Can we consider p = ∞?

L.Rosasco, 9.520/6.860 2018


An observation

For X = R consider
s
−x 2 γ (2γ)(j −1)
ϕj (x) = x j −1 e , j = 2, . . . , ∞
(j − 1)!

with ϕ1 (x) = 1.

Then
s s
∞ ∞
X X
−x 2 γ (2γ)j −1 j −1 −x̄ 2 γ (2γ)j −1
ϕj (x)ϕj (x̄) = x j −1 e x̄ e
(j − 1)! (j − 1)!
j =1 j =1

2γ 2γ
X (2γ)j −1 2γ 2γ 2γ
= e −x e −x̄ (xx̄)j −1 = e −x e −x̄ e 2x x̄
(j − 1)!
j =1
−|x−x̄|2 γ
= e

L.Rosasco, 9.520/6.860 2018


From features to kernels


X
Φ(x)> Φ(x̄) = ϕj (x)ϕj (x̄) = k (x, x̄)
j =1

We might be able to compute the series in closed form.

The function k is called kernel.

Can we run ridge regression ?

L.Rosasco, 9.520/6.860 2018


Kernel ridge regression

We have
n
X n
X
λ >
f̂ (x) = Φ(x) Φ(xi )ci = k (x, xi )ci
i =1 i =1

K + λI )−1b
c = (b Y, K )ij = Φ(xi )> Φ(xj ) = k (xi , xj )
(b
K is the kernel matrix, the Gram (inner products) matrix of the data.
b

“The kernel trick”

L.Rosasco, 9.520/6.860 2018


Kernels

I Can we start from kernels instead of features?

I Which functions k : X × X → R define kernels we can use?

L.Rosasco, 9.520/6.860 2018


Positive definite kernels

A function k : X × X → R is called positive definite:

I if the matrix K̂ is positive semidefinite for all choice of points


x1 , . . . , xn , i.e.
a >bK a ≥ 0, ∀a ∈ Rn .

I Equivalently
n
X
k (xi , xj )ai aj ≥ 0,
i ,j =1

for any a1 , . . . , an ∈ R, x1 , . . . , xn ∈ X .

L.Rosasco, 9.520/6.860 2018


Inner product kernels are pos. def.

Assume Φ : X → Rp , p ≤ ∞ and

k (x, x̄) = Φ(x)> Φ(x̄)

Note that
n n n 2
X X X
k (xi , xj )ai aj = Φ(xi )> Φ(xj )ai aj = Φ(xi )ai .
i ,j =1 i ,j =1 i =1

Clearly k is symmetric.

L.Rosasco, 9.520/6.860 2018


But there are many pos. def. kernels

Classic examples
I linear k (x, x̄) = x > x̄
I polynomial k (x, x̄) = (x > x̄ + 1)s
2
I Gaussian k (x, x̄) = e −kx−x̄k γ

But one can consider


I kernels on probability distributions
I kernels on strings
I kernels on functions
I kernels on groups
I kernels graphs
I ...

It is natural to think of a kernel as a measure of similarity.

L.Rosasco, 9.520/6.860 2018


From pos. def. kernels to functions

Let X be any set/ Given a pos. def. kernel k .


I consider the space Hk of functions

N
X
f (x) = k (x, xi )ai
i =1

for any a1 , . . . , an ∈ R, x1 , . . . , xn ∈ X and any N ∈ N.

I Define an inner product on Hk

D E N X
X N̄
f , f̄ = k (xi , x̄j )ai āj .
Hk
i =1 j =1

I Hk can be completed to a Hilbert space.

L.Rosasco, 9.520/6.860 2018


A key result

Functions defind by Gaussian kernels with large and small widths.

L.Rosasco, 9.520/6.860 2018


An illustration

Theorem
Given a pos. def. k there exists Φ s.t. k (x, x̄) = hΦ(x), Φ(x̄)iHk and

HΦ ' H k

Roughly speaking
N
X
>
f (x) = w Φ(x) ' f (x) = k (x, xi )ai
i =1

L.Rosasco, 9.520/6.860 2018


From features and kernels to RKHS and beyond

Hk and HΦ have many properties, characterizations, connections:

I reproducing property
I reproducing kernel Hilbert spaces (RKHS)
I Mercer theorem (Kar hunen Loéve expansion)
I Gaussian processes
I Cameron-Martin spaces

L.Rosasco, 9.520/6.860 2018


Reproducing property

Note that by definition of Hk


I kx = k (x, ·) is a function in Hk
I For all f ∈ Hk , x ∈ X
f (x) = hf , kx iHk
called the reproducing property

I Note that

|f (x) − f̄ (x)| ≤ kkx kHk f − f̄ Hk


, ∀x ∈ X .

The above observations have a converse.

L.Rosasco, 9.520/6.860 2018


RKHS

Definition
A RKHS H is a Hilbert with a function k : X × X → R s.t.
I kx = k (x, ·) ∈ Hk ,
I and
f (x) = hf , kx iHk .

Theorem
If H is a RKHS then k is pos. def.

L.Rosasco, 9.520/6.860 2018


Evaluation functionals in a RKHS

If H is a RKHS then the evaluation functionals

ex (f ) = f (x)

are continuous. i.e.

|ex (f ) − ex (f̄ )| . f − f̄ Hk
, ∀x ∈ X

since
ex (f ) = hf , kx iHk .

Note that L 2 (Rd ) or C (Rd ) don’t have this property!

L.Rosasco, 9.520/6.860 2018


Alternative RKHS definition

Turns out the previous property also characterizes a RKHS.

Theorem
A Hilbert space with continuous evaluation functionals is a RKHS.

L.Rosasco, 9.520/6.860 2018


Summing up

I From linear to non linear functions


I using features
I using kernels

plus
I pos. def. functions
I reproducing property
I RKHS

L.Rosasco, 9.520/6.860 2018

You might also like