0% found this document useful (0 votes)

205 views760 pages

Machine Learning With Kernel Methods

This document introduces machine learning with kernel methods. It discusses how kernel methods allow developing algorithms that can process different types of data without assumptions about the data type by using pairwise comparisons between data points. The course will cover the mathematical foundations of kernel methods, algorithms for supervised and unsupervised learning with kernels, and open problems in the field. It will include lectures, homework assignments, and a final data challenge evaluation.

Uploaded by

hoangntdt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views760 pages

Machine Learning With Kernel Methods

Uploaded by

hoangntdt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 760

Machine Learning

with Kernel Methods

Julien Mairal & Jean-Philippe Vert

[email protected]

Last update: Jan 2017

1 / 635
Starting point: what we know is how to solve

2 / 635
Or

3 / 635
But real data are often more complicated...

4 / 635
Main goal of this course

Extend
well-understood, linear statistical learning techniques
to
real-world, complicated, structured, high-dimensional data
based on
a rigorous mathematical framework
leading to
practical modelling tools and algorithms
5 / 635
Organization of the course
Contents
1 Present the basic mathematical theory of kernel methods.
2 Introduce algorithms for supervised and unsupervised machine
learning with kernels.
3 Develop a working knowledge of kernel engineering for specific data
and applications (graphs, biological sequences, images).
4 Discuss open research topics related to kernels such as large-scale
learning with kernels and deep kernel learning.

Practical
Course homepage with slides, schedules, homework etc...:
http://cbio.mines-paristech.fr/~jvert/svn/kernelcourse/course/2017mva
Evaluation: 60% homework (3)+ 40% data challenge.

6 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

7 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks
The kernel trick
The representer theorem

7 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks
The kernel trick
The representer theorem

3 Kernel Methods: Supervised Learning

Kernel ridge regression
Kernel logistic regression
Large-margin classifiers
Interlude: convex optimization and duality
Support vector machines

7 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

8 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

5 The Kernel Jungle

Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

8 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

5 The Kernel Jungle

Green, Mercer, Herglotz, Bochner and friends
Kernels for probabilistic models
Kernels for biological sequences
Kernels for graphs
Kernels on graphs

6 Open Problems and Research Topics

Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Deep learning with kernels

8 / 635
Part 1

Kernels and RKHS

9 / 635
Overview
Motivations
Develop versatile algorithms to process and analyze data...
...without making any assumptions regarding the type of data
(vectors, strings, graphs, images, ...)

The approach
Develop methods based on pairwise comparisons.
By imposing constraints on the pairwise comparison function
(positive definite kernels), we obtain a general framework for
learning from data (optimization in RKHS).

10 / 635
Outline

1 Kernels and RKHS

Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics

11 / 635
Representation by pairwise comparisons

X
(S)=(aatcgagtcac,atggacgtct,tgcactact)

S
1 0.5 0.3
K= 0.5 1 0.6
0.3 0.6 1

Idea
Define a comparison function: K : X X 7 R.
Represent a set of n data points S = {x1 , x2 , . . . , xn } by the n n
matrix:
[K]ij := K (xi , xj ) .

12 / 635
Representation by pairwise comparisons
Remarks
K is always an n n matrix, whatever the nature of data: the same
algorithm will work for any type of data (vectors, strings, ...).
Total modularity between the choice of function K and the choice of
the algorithm.
Poor scalability with respect to the dataset size (n2 to compute and
store K)... but wait until the end of the course to see how to deal
with large-scale problems
We will restrict ourselves to a particular class of pairwise comparison
functions.

13 / 635
Positive Definite (p.d.) Kernels
Definition
A positive definite (p.d.) kernel on a set X is a function K : X X R
that is symmetric:

x, x0 X 2 , K x, x0 = K x0 , x ,

and which satisfies, for all N N, (x1 , x2 , . . . , xN ) X N and

(a1 , a2 , . . . , aN ) RN :
N X
X N
ai aj K (xi , xj ) 0.
i=1 j=1

14 / 635
Similarity matrices of p.d. kernels
Remarks
Equivalently, a kernel K is p.d. if and only if, for any N N and any
set of points (x1 , x2 , . . . , xN ) X N , the similarity matrix
[K]ij := K (xi , xj ) is positive semidefinite.
Kernel methods are algorithms that take such matrices as input.

15 / 635
The simplest p.d. kernel, for real numbers
Lemma
Let X = R. The function K : R2 7 R defined by:

x, x 0 R2 , K x, x 0 = xx 0

is p.d.

16 / 635
The simplest p.d. kernel, for real numbers
Lemma
Let X = R. The function K : R2 7 R defined by:

x, x 0 R2 , K x, x 0 = xx 0

is p.d.
Proof:
xx 0 = x 0 x
PN PN P 2
N
i=1 j=1 ai aj xi xj = i=1 ai xi 0

16 / 635
The simplest p.d. kernel, for vectors
Lemma
Let X = Rd . The function K : X 2 7 R defined by:

x, x0 X 2 , K x, x0 = x, x0 Rd

is p.d. (it is often called the linear kernel).

17 / 635
The simplest p.d. kernel, for vectors
Lemma
Let X = Rd . The function K : X 2 7 R defined by:

x, x0 X 2 , K x, x0 = x, x0 Rd

is p.d. (it is often called the linear kernel).

Proof:
hx, x0 iRd = hx0 , xiRd
PN PN PN 2
i=1 j=1 ai aj hxi , xj iRd = k i=1 ai xi kRd 0

17 / 635
A more ambitious p.d. kernel

X F

Lemma
Let X be any set, and : X 7 Rd . Then, the function K : X 2 7 R
defined as follows is p.d.:

x, x0 X 2 , K x, x0 = (x) , x0 Rd .

18 / 635
A more ambitious p.d. kernel

X F

Lemma
Let X be any set, and : X 7 Rd . Then, the function K : X 2 7 R
defined as follows is p.d.:

x, x0 X 2 , K x, x0 = (x) , x0 Rd .

Proof:
h (x) , (x0 )iRd = h (x0 ) , (x)iRd
PN PN PN 2
i=1 j=1 ai aj h (xi ) , (xj )iRd = k i=1 ai (xi ) kRd 0
18 / 635
Example: polynomial kernel
x1 x1 2

x2
R
x2 2

For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :

19 / 635
Example: polynomial kernel
x1 x1 2

x2
R
x2 2

For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :

K (x, x0 ) = x12 x102 + 2x1 x2 x10 x20 + x22 x202

2
= x1 x10 + x2 x20
2
= x, x0 R2 .

19 / 635
Example: polynomial kernel
x1 x1 2

x2
R
x2 2

For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :

K (x, x0 ) = x12 x102 + 2x1 x2 x10 x20 + x22 x202

2
= x1 x10 + x2 x20
2
= x, x0 R2 .

Exercise: show that hx.x0 idRp is p.d. on X = Rp for any d N.

19 / 635
Conversely: Kernels as inner products
Theorem [Aronszajn, 1950]
K is a p.d. kernel on the set X if and only if there exists a Hilbert
space H and a mapping
: X 7 H
such that, for any x, x0 in X :

K x, x0 = (x) , x0 H .

X F

20 / 635
In case of ...
Definitions
An inner product on an R-vector space H is a mapping
(f , g ) 7 hf , g iH from H2 to R that is bilinear, symmetric and such
that hf , f iH > 0 for all f H\{0}.
A vector space endowed with an inner product is called pre-Hilbert.
1
It is endowed with a norm defined as k f kH = hf , f iH
2
.
A Cauchy sequence (fn )n0 is a sequence whose elements become
progressively arbitrarily close to each other:

lim sup kfn fm kH = 0.

N+ n,mN

A Hilbert space is a pre-Hilbert space complete for the norm k.kH .

That is, any Cauchy sequence in H converges in H.
Completeness is necessary to keep good convergence properties of
Euclidean spaces in an infinite-dimensional context.
21 / 635
Proof: finite case
Assume X = {x1 , x2 , . . . , xN } is finite of size N.
Any p.d. kernel K : X X R is entirely defined by the N N
symmetric positive semidefinite matrix [K]ij := K (xi , xj ).
It can therefore be diagonalized on an orthonormal basis of
eigenvectors (u1 , u2 , . . . , uN ), with non-negative eigenvalues
0 1 . . . N , i.e.,
" N # N
X X
>
K (xi , xj ) = l ul ul = l [ul ]i [ul ]j = h (xi ) , (xj )iRN ,
l=1 ij l=1

with
1 [u1 ]i
..
(xi ) = .

.
N [uN ]i

22 / 635
Proof: general case

Mercer (1909) for X = [a, b] R (more generally X compact) and

K continuous.
Kolmogorov (1941) for X countable.
Aronszajn (1944, 1950) for the general case.
We will go through the proof of the general case by introducing the
concept of Reproducing Kernel Hilbert Spaces (RKHS).

23 / 635
Outline

1 Kernels and RKHS

Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics

24 / 635
RKHS Definition
Definition
Let X be a set and H RX be a class of functions forming a (real)
Hilbert space with inner product h., .iH . The function K : X 2 7 R is
called a reproducing kernel (r.k.) of H if
1 H contains all functions of the form

x X , Kx : t 7 K (x, t) .

2 For every x X and f H the reproducing property holds:

f (x) = hf , Kx iH .

If a r.k. exists, then H is called a reproducing kernel Hilbert space

(RKHS).
25 / 635
An equivalent definition of RKHS
Theorem
The Hilbert space H RX is a RKHS if and only if for any x X , the
mapping:

F : H R
f 7 f (x)

is continuous.

26 / 635
An equivalent definition of RKHS
Theorem
The Hilbert space H RX is a RKHS if and only if for any x X , the
mapping:

F : H R
f 7 f (x)

is continuous.

Corollary
Convergence in a RKHS implies pointwise convergence, i.e., if (fn )nN
converges to f in H, then (fn (x))nN converges to f (x) for any x X .

26 / 635
Proof
If H is a RKHS then f 7 f (x) is continuous
If a r.k. K exists, then for any (x, f ) X H:

| f (x) | = | hf , Kx iH |
k f kH .k Kx kH (Cauchy-Schwarz)
1
k f kH .K (x, x) ,2

because k Kx k2H = hKx , Kx iH = K (x, x). Therefore f H 7 f (x) R

is a continuous linear mapping.
Since F is linear, it is indeed sufficient to show that f 0 f (x) 0.

27 / 635
Proof (Converse)
If f 7 f (x) is continuous then H is a RKHS
Conversely, let us assume that for any x X the linear form
f H 7 f (x) is continuous.
Then by Riesz representation theorem (general property of Hilbert
spaces) there exists a unique gx H such that:

f (x) = hf , gx iH .

The function K (x, y) = gx (y) is then a r.k. for H.

28 / 635
Unicity of r.k. and RKHS
Theorem
If H is a RKHS, then it has a unique r.k.
Conversely, a function K can be the r.k. of at most one RKHS.

29 / 635
Unicity of r.k. and RKHS
Theorem
If H is a RKHS, then it has a unique r.k.
Conversely, a function K can be the r.k. of at most one RKHS.

Consequence
This shows that we can talk of the kernel of a RKHS, or the RKHS
of a kernel.

29 / 635
Proof
If a r.k. exists then it is unique
Let K and K 0 be two r.k. of a RKHS H. Then for any x X :

k Kx Kx0 k2H = Kx Kx0 , Kx Kx0 H

= Kx Kx0 , Kx H Kx Kx0 , Kx0 H

= Kx (x) Kx0 (x) Kx (x) + Kx0 (x)

= 0.

This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y X . In other words, K=K.

30 / 635
Proof
If a r.k. exists then it is unique
Let K and K 0 be two r.k. of a RKHS H. Then for any x X :

k Kx Kx0 k2H = Kx Kx0 , Kx Kx0 H

= Kx Kx0 , Kx H Kx Kx0 , Kx0 H

= Kx (x) Kx0 (x) Kx (x) + Kx0 (x)

= 0.

This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y X . In other words, K=K.

The RKHS of a r.k. K is unique

Left as exercise.

30 / 635
An important result
Theorem
A function K : X X R is p.d. if and only if it is a r.k.

31 / 635
Proof
A r.k. is p.d.
1 A r.k. is symmetric because, for any (x, y) X 2 :

K (x, y) = hKx , Ky iH = hKy , Kx iH = K (y, x) .

2 It is p.d. because for any N N,(x1 , x2 , . . . , xN ) X N , and

(a1 , a2 , . . . , aN ) RN :
N
X N
X

ai aj K (xi , xj ) = ai aj Kxi , Kxj H
i,j=1 i,j=1
N
X
=k ai Kxi k2H
i=1
0.

32 / 635
Proof
A p.d. kernel is a r.k. (1/4)
Let H0 be the vector subspace of RX spanned by the functions
{Kx }xX .
For any f , g H0 , given by:
m
X n
X
f = ai Kxi , g= bj Kyj ,
i=1 j=1

let: X
hf , g iH0 := ai bj K (xi , yj ) .
i,j

33 / 635
Proof
A p.d. kernel is a r.k. (2/4)
hf , g iH0 does not depend on the expansion of f and g because:
m
X n
X
hf , g iH0 = ai g (xi ) = bj f (yj ) .
i=1 j=1

This also shows that h., .iH0 is a symmetric bilinear form.

This also shows that for any x X and f H0 :

hf , Kx iH0 = f (x) .

34 / 635
Proof
A p.d. kernel is a r.k. (3/4)
K is assumed to be p.d., therefore:
m
X
k f k2H0 = ai aj K (xi , xj ) 0 .
i,j=1

In particular Cauchy-Schwarz is valid with h., .iH0 .

By Cauchy-Schwarz, we deduce that x X :
1
| f (x) | = hf , Kx iH0 k f kH0 .K (x, x) 2 ,

therefore k f kH0 = 0 = f = 0.
H0 is therefore a pre-Hilbert space endowed with the inner product
h., .iH0 .

35 / 635
Proof
A p.d. kernel is a r.k. (4/4)

For any Cauchy sequence (fn )n0 in H0 , h., .iH0 , we note that:
1
(x, m, n) X N2 , | fm (x) fn (x) | k fm fn kH0 .K (x, x) 2 .

Therefore for any x the sequence (fn (x))n0 is Cauchy in R and has
therefore a limit.
If we add to H0 the functions defined as the pointwise limits of
Cauchy sequences, then the space becomes complete and is
therefore a Hilbert space, with K as r.k. (up to a few technicalities,
left as exercise).

36 / 635
Application: back to Aronzsajns theorem
Theorem (Aronszajn, 1950)
K is a p.d. kernel on the set X if and only if there exists a Hilbert space
H and a mapping
: X 7 H ,
such that, for any x, x0 in X :

K x, x0 = (x) , x0 H .

X F

37 / 635
Proof of Aronzsajns theorem
If K is p.d. over a set X then it is the r.k. of a Hilbert space
H RX .
Let the mapping : X H defined by:

x X , (x) = Kx .

By the reproducing property we have:

(x, y) X 2 , h(x), (y)iH = hKx , Ky iH = K (x, y) .

X F

38 / 635
Outline

1 Kernels and RKHS

Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics

39 / 635
The linear kernel
Take X = Rd and the linear kernel:

K (x, y) = hx, yiRd .

Theorem
The RKHS of the linear kernel is the set of linear functions of the form

fw (x) = hw, xiRd for w Rd ,

endowed with the inner product

w, v Rd , hfw , fv iH = hw, viRd

and corresponding norm

w Rd , k fw k H = k w k 2 .

40 / 635
Proof
The RKHS of the linear kernel consists of functions:
X
x Rd 7 f (x) = ai hxi , xiRd = hw, xiRd ,
i
P
with w = i ai xi .
The RKHS is therefore the set of linear forms endowed with the
following inner product:

hf , g iH = hw, viRd ,

when f (x) = w> x and g (x) = v> x.

41 / 635
RKHS of the linear kernel (cont.)

0 = x> x0 .
Klin (x, x )

f (x) = w> x ,

k f kH = k w k2 .

||f||=2 ||f||=1 ||f||=0.5

42 / 635
The polynomial kernel
Lets find the RKHS of the polynomial kernel:
2
x, y Rd , K (x, y) = hx, yi2Rd = x> y

43 / 635
The polynomial kernel
Lets find the RKHS of the polynomial kernel:
2
x, y Rd , K (x, y) = hx, yi2Rd = x> y

First step: Look for an inner-product.

K (x, y) = trace x> y x> y

> >
= trace y x x y

= trace xx> yy>
D E
= xx> , yy> ,
F

where F is the Froebenius norm for matrices in Rdd .

43 / 635
The polynomial kernel
Second step: propose a candidate RKHS.
We know that H contains all the functions
* +
X X D E X
f (x) = ai K (xi , x) = ai xi x>
i , xx
>
= ai x i x >
i , xx
>
.
F
i i i

Any symmetric matrix in Rdd may be decomposed as i ai xi x>

P
i . Our
candidate RKHS H will be the set of quadratic functions
D E
fS (x) = S, xx> = x> Sx for S S dd ,
F

where S dd is the set of symmetric1 matrices in Rdd , endowed with

the inner-product hfS1 , fS1 iH = hS1 , S2 iF .

1
Why is it important?
44 / 635
The polynomial kernel
Third step: check that the candidate is a Hilbert space.
This step is trivial in the present case since it is easy to see that H a
Euclidean space, isomorphic to S dd by : S 7 fS . Sometimes, things
are not so simple and we need to prove the completeness explicitly.

Fourth step: check that H is the RKHS.

1 H contains all the functions K : t 7 K (x, t) = xx> , tt>

x F
.
2 For all f
S in H and x in X ,
D E
fS (x) = S, xx> = hfS , fxx> iH = hfS , Kx iH .
F

Remark
All points x in X are mapped to a rank-one matrix xx> , hence to a
function Kx = fxx> in H. However, most of points in H do not admit a
pre-image (why?).
Exercise: what is the RKHS of the general polynomial kernel?
45 / 635
Combining kernels
Theorem
If K1 and K2 are p.d. kernels, then:

K1 + K2 ,
K1 K2 , and
cK1 , for c 0,

are also p.d. kernels

If (Ki )i1 is a sequence of p.d. kernels that converges pointwisely to
a function K :

x, x0 X 2 , K x, x0 = lim Ki x, x0 ,

n

then K is also a p.d. kernel.

Proof: left as exercise

46 / 635
Examples
Theorem
If K is a kernel, then e K is a kernel too.

47 / 635
Examples
Theorem
If K is a kernel, then e K is a kernel too.
Proof:
n
K (x,x0 )
X K (x, x0 )i
e = lim
n+ i!
i=0

47 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )

48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2

X = R, K (x, x 0 ) = cos (x + x 0 )

X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )

X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )

X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )

X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )

X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 )

X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 )
X = N, K (x, x 0 ) = LCM (x, x 0 )

1 Kernels and RKHS

Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional

2 Kernel tricks

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics

49 / 635
Remember the RKHS of the linear kernel

0 = x> x0 .
Klin (x, x )

f (x) = w> x ,

k f kH = k w k2 .

||f||=2 ||f||=1 ||f||=0.5

50 / 635
Smoothness functional
A simple inequality
By Cauchy-Schwarz we have, for any function f H and any two
points x, x0 X :
f (x) f x0 = | hf , Kx Kx0 i |

H
k f kH k Kx Kx0 kH
= k f kH dK x, x0 .

The norm of a function in the RKHS controls how fast the function
varies over X with respect to the geometry defined by the kernel
(Lipschitz with constant k f kH ).

Important message

Small norm = slow variations.

51 / 635
Kernels and RKHS : Summary
P.d. kernels can be thought of as inner product after embedding the
data space X in some Hilbert space. As such a p.d. kernel defines a
metric on X .
A realization of this embedding is the RKHS, valid without
restriction on the space X nor on the kernel.
The RKHS is a space of functions over X . The norm of a function
in the RKHS is related to its degree of smoothness w.r.t. the metric
defined by the kernel on X .
We will now see some applications of kernels and RKHS in statistics,
before coming back to the problem of choosing (and eventually
designing) the kernel.

52 / 635
Part 2

Kernel tricks

53 / 635
Motivations
Two theoretical results underpin a family of powerful algorithms for data
analysis using p.d. kernels, collectively known as kernel methods:
The kernel trick, based on the representation of p.d. kernels as inner
products;
The representer theorem, based on some properties of the
regularization functional defined by the RKHS norm.

54 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks
The kernel trick
The representer theorem

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics

55 / 635
Motivations
Choosing a p.d. kernel K on a set X amounts to embedding the
data in a Hilbert space: there exists a Hilbert space H and a
mapping : X 7 H such that, for all x, x0 X ,

x, x0 X 2 , K x, x0 = (x) , x0 H .

However this mapping might not be explicitly given, nor convenient

to work with in practice (e.g., large or even infinite dimensions).
A solution is to work implicitly in the feature space!

X F

56 / 635
The kernel trick
Proposition
Any algorithm to process finite-dimensional vectors that can be expressed
only in terms of pairwise inner products can be applied to potentially
infinite-dimensional vectors in the feature space of a p.d. kernel by
replacing each inner product evaluation by a kernel evaluation.
Remarks:
The proof of this proposition is trivial, because the kernel is exactly
the inner product in the feature space.
This trick has huge practical applications.
Vectors in the feature space are only manipulated implicitly, through
pairwise inner products.

57 / 635
Example 1: computing distances in the feature space

X F

x1 d(x1,x2) ( x1)

x2 ( x2 )

dK (x1 , x2 )2 = k (x1 ) (x2 ) k2H

= h (x1 ) (x2 ) , (x1 ) (x2 )iH
= h (x1 ) , (x1 )iH + h (x2 ) , (x2 )iH 2 h (x1 ) , (x2 )iH
2
dK (x1 , x2 ) = K (x1 , x1 ) + K (x2 , x2 ) 2K (x1 , x2 )

58 / 635
Distance for the Gaussian kernel

The Gaussian kernel with

bandwidth on Rd is:
k xy k2
K (x, y) = e 2 2 ,

1.2
K (x, x) = 1 = k (x) k2H , so all

0.8
d(x,y)
points are on the unit sphere in the
feature space.

0.4
The distance between the images

0.0
of two points x and y in the feature 4 2 0 2 4
space is given by: ||xy||

s
k xy k2
dK (x, y) = 2 1 e 2 2

59 / 635
Example 2: distance between a point and a set
Problem
Let S = (x1 , , xn ) be a finite set of points in X .
How to define and compute the similarity between any point x in X
and the set S?

60 / 635
Example 2: distance between a point and a set
Problem
Let S = (x1 , , xn ) be a finite set of points in X .
How to define and compute the similarity between any point x in X
and the set S?
A solution:
Map all points to the feature space.
Summarize S by the barycenter of the points:
n
1X
:= (xi ) .
n
i=1

Define the distance between x and S by:

dK (x, S) := k (x) kH .

60 / 635
Computation

X F
m

n

1 X
dK (x, S) = (x) (xi )

n
i=1 H
v
u n n n
u 2X 1 XX
= tK (x, x) K (x, xi ) + 2 K (xi , xj ).
n n
i=1 i=1 j=1

Remark
The barycentre only exists in the feature spacein general: it does not
necessarily have a pre-image x such that x = .
61 / 635
1D illustration
S = {2, 3}
Plot f (x) = d(x, S)
2.5 2.5

2 2

1.5 1.5

d(x,S)

d(x,S)
1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x x

(xy )2 (xy )2
K (x, y ) = xy . K (x, y ) = e 2 2 . K (x, y ) = e 2 2 .
(linear) with = 1. with = 0.2.

62 / 635
1D illustration
S = {2, 3}
Plot f (x) = d(x, S)
2.5 2.5

2 2

1.5 1.5

d(x,S)

d(x,S)
1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x x

(xy )2 (xy )2
K (x, y ) = xy . K (x, y ) = e 2 2 . K (x, y ) = e 2 2 .
(linear) with = 1. with = 0.2.
Remarks
for the linear kernel, H = R, = 2.5 and d(x, S) = |x |.
q
for the Gaussian kernel d(x, S) = C n2 ni=1 K (xi , x).
P

62 / 635
2D illustration

S = {(1, 1)0 , (1, 2)0 , (2, 2)0 }

Plot f (x) = d(x, S)
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3

(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.

63 / 635
2D illustration

S = {(1, 1)0 , (1, 2)0 , (2, 2)0 }

Plot f (x) = d(x, S)
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3

(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.

Remark
as before, the barycenter in H (which is a single point in H) may
carry a lot of information about the training data.
63 / 635
Application in discrimination

S1 = {(1, 1)0 , (1, 2)0 } and S2 = {(1, 3)0 , (2, 2)0 }

Plot f (x) = d (x, S1 )2 d (x, S2 )2
4 4 4

3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3

(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.

64 / 635
Example 3: Centering data in the feature space
Problem
Let S = (x1 , , xn ) be a finite set of points in X endowed with a
p.d. kernel K . Let K be their n n Gram matrix: [K]ij = K (xi , xj ) .
Let = 1/n ni=1 (xi ) their barycenter, and ui = (xi ) for
P
i = 1, . . . , n be centered data in H.
How to compute the centered Gram matrix [Kc ]i,j = hui , uj iH ?

X F
m

65 / 635
Computation
A direct computation gives, for 0 i, j n:

Kci,j = h (xi ) , (xj ) iH

= h (xi ) , (xj )iH h, (xi ) + (xj )iH + h, iH
n n
1X 1 X
= Ki,j (Ki,k + Kj,k ) + 2 Kk,l .
n n
k=1 k,l=1

This can be rewritten in matricial form:

Kc = K UK KU + UKU = (I U) K (I U) ,

where Ui,j = 1/n for 1 i, j n.

66 / 635
Kernel trick Summary
The kernel trick is a trivial statement with important applications.
It can be used to obtain nonlinear versions of well-known linear
algorithms, e.g., by replacing the classical inner product by a
Gaussian kernel.
It can be used to apply classical algorithms to non vectorial data
(e.g., strings, graphs) by again replacing the classical inner product
by a valid kernel for the data.
It allows in some cases to embed the initial space to a larger feature
space and involve points in the feature space with no pre-image
(e.g., barycenter).

67 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks
The kernel trick
The representer theorem

3 Kernel Methods: Supervised Learning

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics

68 / 635
Motivation
An RKHS is a space of (potentially nonlinear) functions, and k f kH
measures the smoothness of f
Given a set of data (xi X , yi R)i=1,...,n , a natural way to
estimate a regression function f : X R is to solve something like:
n
1X
min `(yi , f (xi )) + kf k2H . (1)
f H n | {z }
| i=1 {z } regularization
empirical risk, data fit

for a loss function ` such as `(y , t) = (y t)2

How to solve in practice this problem, potentially in infinite
dimension?

69 / 635
The Theorem
Representer Theorem
Let X be a set endowed with a p.d. kernel K , H the corresponding
RKHS, and S = {x1 , , xn } X a finite set of points in X .
Let : Rn+1 R be a function of n + 1 variables, strictly
increasing with respect to the last variable.
Then, any solution to the optimization problem:

min (f (x1 ) , , f (xn ) , k f kH ) ,

f H
admits a representation of the form:
n
X
x X , f (x) = i K (xi , x) .
i=1
In other words, the solution lives in a finite-dimensional subspace:

f Span(Kx1 , . . . , Kxn ).
70 / 635
Proof (1/2)
Let (f ) be the functional that is minimized in the statement of the
representer theorem, and HS the linear span in H of the vectors Kxi :
n
( )
X
HS = f H : f (x) = i K (xi , x) , (1 , , n ) R n
.
i=1

HS is a finite-dimensional subspace, therefore any function f H

can be uniquely decomposed as:

f = fS + f ,

with fS HS and f HS (by orthogonal projection).

71 / 635
Proof (2/2)
H being a RKHS it holds that:

i = 1, , n, f (xi ) = hf , K (xi , .)iH = 0 ,

because K (xi , .) H, therefore:

i = 1, , n, f (xi ) = fS (xi ) .

Pythagoras theorem in H then shows that:

k f k2H = k fS k2H + k f k2H .

As a consequence, (f ) (fS ) , with equality if and only if

k f kH = 0. The minimum of is therefore necessarily in HS .

72 / 635
Remarks
Often the function has the form:

(f (x1 ) , , f (xn ) , k f kH ) = c (f (x1 ) , , f (xn )) + (k f kH )

where c(.) measures the fit of f to a given problem (regression,

classification, dimension reduction, ...) and is strictly increasing. This
formulation has two important consequences:
Theoretically, the minimization will enforce the norm k f kH to be
small, which can be beneficial by ensuring a sufficient level of
smoothness for the solution (regularization effect).
Practically, we know by the representer theorem that the solution
lives in a subspace of dimension n, which can lead to efficient
algorithms although the RKHS itself can be of infinite dimension.

73 / 635
Practical use of the representer theorem (1/2)
When the representer theorem holds, we know that we can look for
a solution of the form
n
X
f (x) = i K (xi , x) , for some Rn .
i=1

For any j = 1, . . . , n, we have

n
X
f (xj ) = i K (xi , xj ) = [K]j .
i=1

Furthermore,
n X
X n
k f k2H = i j K (xi , xj ) = > K.
i=1 j=1

74 / 635
Practical use of the representer theorem (2/2)
Therefore, a problem of the form

min f (x1 ) , , f (xn ) , k f k2H

f H

is equivalent to the following n-dimensional optimization problem:

minn [K]1 , , [K]n , > K
R
This problem can usually be solved analytically or by numerical
methods; we will see many examples in the next sections.

75 / 635
Remarks
Dual interpretations of kernel methods
Most kernel methods have two complementary interpretations:
A geometric interpretation in the feature space, thanks to the kernel
trick. Even when the feature space is large, most kernel methods
work in the linear span of the embeddings of the points available.
A functional interpretation, often as an optimization problem over
(subsets of) the RKHS associated to the kernel.
The representer theorem has important consequences, but it is in fact
rather trivial. We are looking for a function f in H such that for all x
in X , f (x) = hKx , f iH . The part f that is orthogonal to the Kxi s is
thus useless to explain the training data.

76 / 635
Part 3

Kernel Methods
Supervised Learning

77 / 635
Supervised learning
Definition
Given:
X , a space of inputs,
Y, a space of outputs,
Sn = (xi , yi )i=1,...,n , a training set of (input,output) pairs,
the supervised learning problem is to estimate a function h : X Y to
predict the output for any future input.

78 / 635
Supervised learning
Definition
Given:
X , a space of inputs,
Y, a space of outputs,
Sn = (xi , yi )i=1,...,n , a training set of (input,output) pairs,
the supervised learning problem is to estimate a function h : X Y to
predict the output for any future input.
Depending on the nature of the output, this covers:
Regression when Y = R;
Classification when Y = {1, 1} or any set of two labels;
Structured output regression or classification when Y is more
general.

78 / 635
Example: regression
Task: predict the capacity of a small molecule to inhibit a drug target
X = set of molecular structures (graphs?)
Y=R

79 / 635
Example: classification
Task: recognize if an image is a dog or a cat
X = set of images (Rd )
Y = {cat,dog}

80 / 635
Example: classification
Task: recognize if an image is a dog or a cat
X = set of images (Rd )
Y = {cat,dog}

80 / 635
Example: structured output
Task: translate from Japanese to French
X = finite-length strings of japanese characters
Y = finite-length strings of french characters

81 / 635
Supervised learning with kernels: general principles
1 Express h : X Y using a real-valued function f : Z R:
regression Y = R:
h(x) = f (x) with f : X R (Z = X )
classification Y = {1, 1}:
h(x) = sign(f (x)) with f :X R (Z = X )
structured output:
h(x) = arg max f (x, y) with f : X Y R (Z = X Y)
yY

2 Define an empirical risk function Rn (f ) to assess how good a

candidate function f is on the training set Sn , typically the average
of a loss:
n
1X
Rn (f ) := ` (f (xi ), yi )
n
i=1
3 Define a p.d. kernel on Z and solve
min Rn (f ) or min Rn (f ) + k f k2H
f H,k f kH B f H
82 / 635
Remarks
n
1X
min ` (f (xi ), yi ) + k f k2H .
f H n | {z }
i=1
| {z } regularization
empirical risk, data fit

Regularization is important, particularly in high dimension, to

prevent overfitting
When Z = Rd and K is the linear kernel, f = fw is a linear model
and the regularization is kwk2
Using more general spaces Z and kernels K allows to
learn non-linear functions over a functional space endowed with a
natural regularization (remember, small norm in RKHS = smooth)
learn functions over non-vectorial data, such as strings and graphs

We will now see a few methods in more details

83 / 635
Outline

1 Kernels and RKHS

2 Kernel tricks

3 Kernel Methods: Supervised Learning

Kernel ridge regression
Kernel logistic regression
Large-margin classifiers
Interlude: convex optimization and duality
Support vector machines

4 Kernel Methods: Unsupervised Learning

5 The Kernel Jungle

6 Open Problems and Research Topics 84 / 635

Regression
Setup
X set of inputs
Y = R real-valued outputs
Sn = (xi , yi )i=1,...,n (X R)n a training set of n pairs
Goal = find a function f : X R to predict y by f (x)

2

0 2 4 6 8 10
85 / 635
Regression
Setup
X set of inputs
Y = R real-valued outputs
Sn = (xi , yi )i=1,...,n (X R)n a training set of n pairs
Goal = find a function f : X R to predict y by f (x)

2

0 2 4 6 8 10
85 / 635
Least-square regression over a general functional space
Let us quantify the error if f predicts f (x) instead of y by the
squared error:
` (f (x) , y ) = (y f (x))2
Fix a set of functions H.
Least-square regression amounts to finding the function in H with
the smallest empirical risk, called in this case the mean squared error
(MSE):
n
1X

f arg min (yi f (xi ))2
f H n i=1

Issues: unstable (especially in large dimensions), overfitting if H is

too large.

86 / 635
Kernel ridge regression (KRR)
Let us now consider a RKHS H, associated to a p.d. kernel K on X .
KRR is obtained by regularizing the MSE criterion by the RKHS
norm:
n
1X

f = arg min (yi f (xi ))2 + k f k2H (2)
f H n i=1

1st effect = prevent overfitting by penalizing non-smooth functions.

87 / 635
Kernel ridge regression (KRR)
Let us now consider a RKHS H, associated to a p.d. kernel K on X .
KRR is obtained by regularizing the MSE criterion by the RKHS
norm:
n
1X

f = arg min (yi f (xi ))2 + k f k2H (2)
f H n i=1

1st effect = prevent overfitting by penalizing non-smooth functions.

By the representer theorem, any solution of (2) can be expanded as
n
X
f(x) = i K (xi , x) .
i=1

2nd effect = simplifying the solution.

87 / 635
Solving KRR
Let y = (y1 , . . . , yn )> Rn
Let = (1 , . . . , n )> Rn
Let K be the n n Gram matrix: Kij = K (xi , xj )
We can then write:
>
f (x1 ) , . . . , f (xn ) = K

The following holds as usual:

k f k2H = > K

The KRR problem (2) is therefore equivalent to:

1
arg min (K y)> (K y) + > K
Rn n

88 / 635
Solving KRR

1
arg min (K y)> (K y) + > K
Rn n

This is a convex and differentiable function of . Its minimum can

therefore be found by setting the gradient in to zero:
2
0= K (K y) + 2K
n
= K [(K + nI) y]

For > 0, K + nI is invertible (because K is positive semidefinite)

so one solution is to take:

= (K + nI)1 y.

89 / 635
Example (KRR with Gaussian RBF kernel)

0 2 4 6 8 10