0% found this document useful (0 votes)
18 views

Mva - Slides Machine Learning With Kernel Methods

This document provides an outline for a course on machine learning with kernel methods. The course will cover the basic theory of kernel methods and reproducing kernel Hilbert spaces. It will develop applications of kernel methods for supervised and unsupervised learning problems. Later sections will cover kernel engineering for different data types and open problems in the field. The course aims to extend linear learning techniques to complex, structured data using kernels.

Uploaded by

deboyerluis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Mva - Slides Machine Learning With Kernel Methods

This document provides an outline for a course on machine learning with kernel methods. The course will cover the basic theory of kernel methods and reproducing kernel Hilbert spaces. It will develop applications of kernel methods for supervised and unsupervised learning problems. Later sections will cover kernel engineering for different data types and open problems in the field. The course aims to extend linear learning techniques to complex, structured data using kernels.

Uploaded by

deboyerluis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 644

Machine Learning

with Kernel Methods


Julien Mairal (Inria)

Jean-Philippe Vert (Institut Curie, Mines ParisTech)

Nino Shervashidze (Institut Curie, Mines ParisTech)

Julien Mairal (Inria) 1/564


History of the course

A large part of the course material is due to Jean-


Philippe Vert, who gave the course from 2004 to 2015
and who is on sabbatical at UC Berkeley in 2016.

Over the years, the course has become more and more exhaustive
and the slides are probably one of the best reference available on
kernels.

Julien Mairal (Inria) 2/564


History of the course

A large part of the course material is due to Jean-


Philippe Vert, who gave the course from 2004 to 2015
and who is on sabbatical at UC Berkeley in 2016.

Over the years, the course has become more and more exhaustive
and the slides are probably one of the best reference available on
kernels.
This is a course with a fairly large amount of math, but still
accessible to computer scientists who have heard what is a Hilbert
space (at least once in their life).

Julien Mairal (Inria) 2/564


Starting point: what we know is how to solve

Julien Mairal (Inria) 3/564


Or

Julien Mairal (Inria) 4/564


But real data are often more complicated...

Julien Mairal (Inria) 5/564


Main goal of this course

Extend well-understood, linear statistical learning techniques to


real-world, complicated, structured, high-dimensional data (images,
texts, time series, graphs, distributions, permutations...)

Julien Mairal (Inria) 6/564


A concrete supervised learning problem
Regularized empirical risk minimization formulation
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
i=1
| {z } regularization
empirical risk, data fit

Julien Mairal (Inria) 7/564


A concrete supervised learning problem
Regularized empirical risk minimization formulation
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
| i=1 {z } regularization
empirical risk, data fit

A simple parametrization when X = Rp and Y = {−1, +1}.


F = {fw : w ∈ Rp } where the fw ’s are linear: fw : x 7→ x> w.
The regularization is the simple Euclidean norm Ω(fw ) = kwk22 .

Julien Mairal (Inria) 7/564


A concrete supervised learning problem
This simple setting corresponds to many well-studied formulations.
n
1X1
Ridge regression: minp (yi − w> xi )2 + λkwk22 .
w∈R n 2
i=1
n
1X
Linear SVM: min max(0, 1 − yi w> xi ) + λkwk22 .
w∈Rp n
i=1
n
1X  >

Logistic regression: minp log 1 + e −yi w xi + λkwk22 .
w∈R n
i=1

Julien Mairal (Inria) 8/564


A concrete supervised learning problem
Unfortunately, linear models often perform poorly unless the problem
features are well-engineered or the problem is very simple.
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
| i=1 {z } regularization
empirical risk, data fit

First approach to work with a non-linear functional space F


The “deep learning” space F is parametrized as follows:

f (x) = σk (Ak σk–1 (Ak–1 . . . σ2 (A2 σ1 (A1 x)) . . .)).

Finding the optimal A1 , A2 , . . . , Ak involves solving an (intractable)


non-convex optimization problem in huge dimension.

Julien Mairal (Inria) 9/564


A concrete supervised learning problem

Figure : Exemple of convolutional neural network from LeCun et al. [1998]

.
What are the main limitations of neural networks?
Poor theoretical understanding.
They require cumbersome hyper-parameter tuning.
They are hard to regularize.
Despite these shortcomings, they have had an enormous success, thanks
to large amounts of labeled data, computational power and engineering.
Julien Mairal (Inria) 10/564
A concrete supervised learning problem
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
i=1
| {z } regularization
empirical risk, data fit

Second approach based on kernels


Works with possibly infinite-dimensional functional spaces F;
Works with non-vectorial structured data sets X such as graphs;
Regularization is natural and easy.

Current limitations (and open research topics)


Lack of scalability with n (traditionally O(n2 ));
Lack of adaptivity to data and task.

Julien Mairal (Inria) 11/564


Organization of the course
Contents
1 Present the basic theory of kernel methods.
2 Develop a working knowledge of kernel engineering for specific data
and applications (graphs, biological sequences, images).
3 Introduce open research topics related to kernels such as large-scale
learning with kernels and “deep kernel learning”.

Practical
Course homepage with slides, schedules, homework etc...:
http://lear.inrialpes.fr/people/mairal/teaching/2015-2016/MVA/.
Evaluation: 50% homework + 50% data challenge.

Julien Mairal (Inria) 12/564


Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
My first kernels
Smoothness functional
The kernel trick

Julien Mairal (Inria) 13/564


Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
My first kernels
Smoothness functional
The kernel trick

2 Kernel Methods: Supervised Learning


The representer theorem
Kernel ridge regression
Classification with empirical risk minimization
A (tiny) bit of learning theory
Foundations of constrained optimization
Support vector machines

Julien Mairal (Inria) 13/564


Outline
3 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

Julien Mairal (Inria) 14/564


Outline
3 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA
4 The Kernel Jungle
Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 14/564


Outline
3 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA
4 The Kernel Jungle
Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs
5 Open Problems and Research Topics
Multiple Kernel Learning (MKL)
Large-scale learning with kernels
“Deep” learning with kernels

Julien Mairal (Inria) 14/564


Part 1

Kernels and RKHS

Julien Mairal (Inria) 15/564


Overview
Motivations
Develop versatile algorithms to process and analyze data...
...without making any assumptions regarding the type of data
(vectors, strings, graphs, images, ...)

The approach
Develop methods based on pairwise comparisons.
By imposing constraints on the pairwise comparison function
(positive definite kernels), we obtain a general framework for
learning from data (optimization in RKHS).

Julien Mairal (Inria) 16/564


Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
My first kernels
Smoothness functional
The kernel trick

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics

Julien Mairal (Inria) 17/564


Representation by pairwise comparisons

X
φ(S)=(aatcgagtcac,atggacgtct,tgcactact)

S
1 0.5 0.3
K= 0.5 1 0.6
0.3 0.6 1

Idea
Define a “comparison function”: K : X × X 7→ R.
Represent a set of n data points S = {x1 , x2 , . . . , xn } by the n × n
matrix:
[K]ij := K (xi , xj )

Julien Mairal (Inria) 18/564


Representation by pairwise comparisons
Remarks
Always an n × n matrix, whatever the nature of data: the same
algorithm will work for any type of data (vectors, strings, ...).
Total modularity between the choice of K and the choice of the
algorithm.
Poor scalability w.r.t. the dataset size (n2 )
We will restrict ourselves to a particular class of pairwise
comparison functions.

Julien Mairal (Inria) 19/564


Positive Definite (p.d.) Kernels
Definition
A positive definite (p.d.) kernel on the set X is a function
K : X × X → R that is symmetric:

∀ x, x0 ∈ X 2 , K x, x0 = K x0 , x ,
  

and which satisfies, for all N ∈ N, (x1 , x2 , . . . , xN ) ∈ X N and


(a1 , a2 , . . . , aN ) ∈ RN :
N X
X N
ai aj K (xi , xj ) ≥ 0.
i=1 j=1

Julien Mairal (Inria) 20/564


Similarity matrices of p.d. kernels
Remarks
Equivalently, a kernel K is p.d. if and only if, for any N ∈ N and
any set of points (x1 , x2 , . . . , xN ) ∈ X N , the similarity matrix
[K]ij := K (xi , xj ) is positive semidefinite.
Kernel methods are algorithms that take such matrices as input.

Julien Mairal (Inria) 21/564


The simplest p.d. kernel
Lemma
Let X = Rd . The function K : X 2 7→ R defined by:

∀ x, x0 ∈ X 2 , K x, x0 = x, x0
 
Rd

is p.d. (it is often called the linear kernel).

Julien Mairal (Inria) 22/564


The simplest p.d. kernel
Lemma
Let X = Rd . The function K : X 2 7→ R defined by:

∀ x, x0 ∈ X 2 , K x, x0 = x, x0
 
Rd

is p.d. (it is often called the linear kernel).

Proof
hx, x0 iRd = hx0 , xiRd ,
PN PN PN 2
i=1 j=1 ai aj hxi , xj iRd = k i=1 ai xi kRd ≥ 0

Julien Mairal (Inria) 22/564


A more ambitious p.d. kernel
φ
X F

Lemma
Let X be any set, and Φ : X →7 Rd . Then, the function K : X 2 7→ R
defined as follows is p.d.:

∀ x, x0 ∈ X 2 , K x, x0 = Φ (x) , Φ x0 Rd .
  

Julien Mairal (Inria) 23/564


A more ambitious p.d. kernel
φ
X F

Lemma
Let X be any set, and Φ : X →7 Rd . Then, the function K : X 2 7→ R
defined as follows is p.d.:

∀ x, x0 ∈ X 2 , K x, x0 = Φ (x) , Φ x0 Rd .
  

Proof
hΦ (x) , Φ (x0 )iRd = hΦ (x0 ) , Φ (x)iRd ,
PN PN PN 2
i=1 j=1 ai aj hΦ (xi ) , Φ (xj )iRd = k i=1 ai Φ (xi ) kRd ≥ 0 .

Julien Mairal (Inria) 23/564


Example: polynomial kernel
x1 x1 2

x2
R
x2 2


~ x ) = (x 2 , 2x1 x2 , x 2 ) ∈ R3 :
For ~x = (x1 , x2 )> ∈ R2 , let Φ(~ 1 2

K (~x , ~x 0 ) = x12 x102 + 2x1 x2 x10 x20 + x22 x202


2
= x1 x10 + x2 x20
2
= ~x .~x 0 .

Exercise: show that (~x .~x 0 )d is p.d. for any d ∈ N.


Julien Mairal (Inria) 24/564
Conversely: Kernels as inner products
Theorem (Aronszajn, 1950)
K is a p.d. kernel on the set X if and only if there exists a Hilbert space
H and a mapping
Φ : X 7→ H
such that, for any x, x0 in X :

K x, x0 = Φ (x) , Φ x0
 
H
.

φ
X F

Julien Mairal (Inria) 25/564


In case of ...
Definitions
An inner product on an R-vector space H is a mapping
(f , g ) 7→ hf , g iH from H2 to R that is bilinear, symmetric and such
that hf , f iH > 0 for all f ∈ H\{0}.
A vector space endowed with an inner product is called pre-Hilbert.
1
It is endowed with a norm defined as k f kH = hf , f iH
2
.
A Hilbert space is a pre-Hilbert space complete for the norm k.kH .
That is, any Cauchy sequence in H converges in H.
A Cauchy sequence (fn )n≥0 is a sequence whose elements become
progressively arbitrarily close to each other:

lim sup kfn − fm kH = 0.


N→+∞ n,m≥N

Completeness is necessary to keep “good” convergence properties of


Euclidean spaces in an infinite-dimensional context.
Julien Mairal (Inria) 26/564
Proof: finite case
Proof
Assume X = {x1 , x2 , . . . , xN } is finite of size N.
Any p.d. kernel K : X × X → R is entirely defined by the N × N
symmetric positive semidefinite matrix [K]ij := K (xi , xj ).
It can therefore be diagonalized on an orthonormal basis of
eigenvectors (u1 , u2 , . . . , uN ), with non-negative eigenvalues
0 ≤ λ1 ≤ . . . ≤ λN , i.e.,
" N # N
X X
K (xi , xj ) = λl ul u> l = λl ul (i)ul (j) = hΦ (xi ) , Φ (xj )iRN ,
l=1 ij l=1

with  √ 
λ1 u1 (i)
..
Φ (xi ) =  . 
 
√ .
λN uN (i)
Julien Mairal (Inria) 27/564
Proof: general case

Mercer (1909) for X = [a, b] ⊂ R (more generally X compact) and


K continuous.
Kolmogorov (1941) for X countable.
Aronszajn (1944, 1950) for the general case.
We will go through the proof of the general case by introducing the
concept of Reproducing Kernel Hilbert Spaces (RKHS).

Julien Mairal (Inria) 28/564


Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
My first kernels
Smoothness functional
The kernel trick

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics

Julien Mairal (Inria) 29/564


RKHS Definition
Definition
Let X be a set and H ⊂ RX be a class of functions forming a (real)
Hilbert space with inner product h., .iH . The function K : X 2 7→ R is
called a reproducing kernel (r.k.) of H if
1 H contains all functions of the form

∀x ∈ X , Kx : t 7→ K (x, t) .

2 For every x ∈ X and f ∈ H the reproducing property holds:

f (x) = hf , Kx iH .

If a r.k. exists, then H is called a reproducing kernel Hilbert space


(RKHS).
Julien Mairal (Inria) 30/564
An equivalent definition of RKHS
Theorem
The Hilbert space H ⊂ RX is a RKHS if and only if for any x ∈ X , the
mapping:

F : H →R
f 7→ f (x)

is continuous.

Julien Mairal (Inria) 31/564


An equivalent definition of RKHS
Theorem
The Hilbert space H ⊂ RX is a RKHS if and only if for any x ∈ X , the
mapping:

F : H →R
f 7→ f (x)

is continuous.

Corollary
Convergence in a RKHS implies pointwise convergence, i.e., if (fn )n∈N
converges to f in H, then (fn (x))n∈N converges to f (x) for any x ∈ X .

Julien Mairal (Inria) 31/564


Proof
If H is a RKHS then f 7→ f (x) is continuous
If a r.k. K exists, then for any (x, f ) ∈ X × H:

| f (x) | = | hf , Kx iH |
≤ k f kH .k Kx kH (Cauchy-Schwarz)
1
≤ k f kH .K (x, x) 2 ,

because k Kx k2H = hKx , Kx iH = K (x, x). Therefore f ∈ H 7→ f (x) ∈ R


is a continuous linear mapping. 
Since F is linear, it is indeed sufficient to show that f → 0 ⇒ f (x) → 0.

Julien Mairal (Inria) 32/564


Proof (Converse)
If f 7→ f (x) is continuous then H is a RKHS
Conversely, let us assume that for any x ∈ X the linear form
f ∈ H 7→ f (x) is continuous.
Then by Riesz representation theorem (general property of Hilbert
spaces) there exists a unique gx ∈ H such that:

f (x) = hf , gx iH .

The function K (x, y) = gx (y) is then a r.k. for H. 

Julien Mairal (Inria) 33/564


Unicity of r.k. and RKHS
Theorem
If H is a RKHS, then it has a unique r.k.
Conversely, a function K can be the r.k. of at most one RKHS.

Julien Mairal (Inria) 34/564


Unicity of r.k. and RKHS
Theorem
If H is a RKHS, then it has a unique r.k.
Conversely, a function K can be the r.k. of at most one RKHS.

Consequence
This shows that we can talk of ”the” kernel of a RKHS, or ”the” RKHS
of a kernel.

Julien Mairal (Inria) 34/564


Proof
If a r.k. exists then it is unique
Let K and K 0 be two r.k. of a RKHS H. Then for any x ∈ X :

k Kx − Kx0 k2H = Kx − Kx0 , Kx − Kx0 H


= Kx − Kx0 , Kx H
− Kx − Kx0 , Kx0 H
= Kx (x) − Kx0 (x) − Kx (x) + 0
Kx (x)
= 0.

This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y ∈ X . In other words, K=K’. 

Julien Mairal (Inria) 35/564


Proof
If a r.k. exists then it is unique
Let K and K 0 be two r.k. of a RKHS H. Then for any x ∈ X :

k Kx − Kx0 k2H = Kx − Kx0 , Kx − Kx0 H


= Kx − Kx0 , Kx H
− Kx − Kx0 , Kx0 H
= Kx (x) − Kx0 (x) − Kx (x) + 0
Kx (x)
= 0.

This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y ∈ X . In other words, K=K’. 

The RKHS of a r.k. K is unique


Left as exercise.

Julien Mairal (Inria) 35/564


An important result
Theorem
A function K : X × X → R is p.d. if and only if it is a r.k.

Julien Mairal (Inria) 36/564


Proof
A r.k. is p.d.
1 A r.k. is symmetric because, for any (x, y) ∈ X 2 :

K (x, y) = hKx , Ky iH = hKy , Kx iH = K (y, x) .

2 It is p.d. because for any N ∈ N,(x1 , x2 , . . . , xN ) ∈ X N , and


(a1 , a2 , . . . , aN ) ∈ RN :
N
X N
X
ai aj K (xi , xj ) = ai aj Kxi , Kxj H
i,j=1 i,j=1
N
X
=k ai Kxi k2H
i=1
≥ 0. 

Julien Mairal (Inria) 37/564


Proof
A p.d. kernel is a r.k. (1/4)
Let H0 be the vector subspace of RX spanned by the functions
{Kx }x∈X .
For any f , g ∈ H0 , given by:
m
X n
X
f = ai Kxi , g= bj Kyj ,
i=1 j=1

let: X
hf , g iH0 := ai bj K (xi , yj ) .
i,j

Julien Mairal (Inria) 38/564


Proof
A p.d. kernel is a r.k. (2/4)
hf , g iH0 does not depend on the expansion of f and g because:
m
X n
X
hf , g iH0 = ai g (xi ) = bj f (yj ) .
i=1 j=1

This also shows that h., .iH0 is a symmetric bilinear form.


This also shows that for any x ∈ X and f ∈ H0 :

hf , Kx iH0 = f (x) .

Julien Mairal (Inria) 39/564


Proof
A p.d. kernel is a r.k. (3/4)
K is assumed to be p.d., therefore:
m
X
k f k2H0 = ai aj K (xi , xj ) ≥ 0 .
i,j=1

In particular Cauchy-Schwarz is valid with h., .iH0 .


By Cauchy-Schwarz we deduce that ∀x ∈ X :
1
| f (x) | = hf , Kx iH0 ≤ k f kH0 .K (x, x) 2 ,

therefore k f kH0 = 0 =⇒ f = 0.
H0 is therefore a pre-Hilbert space endowed with the inner product
h., .iH0 .

Julien Mairal (Inria) 40/564


Proof
A p.d. kernel is a r.k. (4/4)

For any Cauchy sequence (fn )n≥0 in H0 , h., .iH0 , we note that:
1
∀ (x, m, n) ∈ X × N2 , | fm (x) − fn (x) | ≤ k fm − fn kH0 .K (x, x) 2 .

Therefore for any x the sequence (fn (x))n≥0 is Cauchy in R and has
therefore a limit.
If we add to H0 the functions defined as the pointwise limits of
Cauchy sequences, then the space becomes complete and is
therefore a Hilbert space, with K as r.k. (up to a few technicalities,
left as exercise). 

Julien Mairal (Inria) 41/564


Application: back to Aronzsajn’s theorem
Theorem (Aronszajn, 1950)
K is a p.d. kernel on the set X if and only if there exists a Hilbert space
H and a mapping
Φ : X 7→ H ,
such that, for any x, x0 in X :

K x, x0 = Φ (x) , Φ x0
 
H
.

φ
X F

Julien Mairal (Inria) 42/564


Proof of Aronzsajn’s theorem
Proof
If K is p.d. over a set X then it is the r.k. of a Hilbert space
H ⊂ RX .
Let the mapping Φ : X → H defined by:

∀x ∈ X , Φ(x) = Kx .

By the reproducing property we have:

∀ (x, y) ∈ X 2 , hΦ(x), Φ(y)iH = hKx , Ky iH = K (x, y) . 


φ
X F

Julien Mairal (Inria) 43/564


Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
My first kernels
Smoothness functional
The kernel trick

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics

Julien Mairal (Inria) 44/564


The linear kernel
Take X = Rd and the linear kernel:

K (x, y) = hx, yiRd .

Theorem
The RKHS of the linear kernel is the set of linear functions of the form

fw (x) = hw, xiRd for w ∈ Rd ,

endowed with the norm

k fw kH = k w k2 .

Julien Mairal (Inria) 45/564


Proof
The RKHS of the linear kernel consists of functions:
X
x ∈ Rd 7→ f (x) = ai hxi , xiRd = hw, xiRd ,
i
P
with w = i ai x i .
The RKHS is therefore the set of linear forms endowed with the
following inner product:

hf , g iHK = hw, viRd ,

when f (x) = w.x and g (x) = v.x.

Julien Mairal (Inria) 46/564


RKHS of the linear kernel (cont.)

0 = x> x0 .
Klin (x, x )

f (x) = w> x ,

k f kH = k w k2 .

||f||=2 ||f||=1 ||f||=0.5

Julien Mairal (Inria) 47/564


The polynomial kernel
We have already mentioned a generalization of the linear kernel: the
polynomial kernel of degree p:

Kpoly (x, y) = (hx, yiRd + c)p .

Let us find its RKHS for p = 2 and c = 0.

Julien Mairal (Inria) 48/564


The polynomial kernel
We have already mentioned a generalization of the linear kernel: the
polynomial kernel of degree p:

Kpoly (x, y) = (hx, yiRd + c)p .

Let us find its RKHS for p = 2 and c = 0.

First step: Look for an inner-product.


 
K (x, y) = trace x> y x> y
 
= trace y> x x> y
 
= trace xx> yy>
D E
= xx> , yy> ,
F

where F is the Froebenius norm for matrices in Rd×d .


Julien Mairal (Inria) 48/564
The polynomial kernel
Second step: propose a candidate RKHS.
We know that H contains all the functions
* +
X X D E X
f (x) = ai K (xi , x) = ai xi x>
i , xx
>
= ai x i x >
i , xx
>
.
F
i i i

Any symmetric matrix in Rd×d may be decomposed as i ai xi x>


P
i . Our
candidate RKHS H will be the set of quadratic functions
D E
fS (x) = S, xx> = x> Sx for S ∈ S d×d ,
F

where S d×d is the set of symmetric matrices in Rd×d , endowed with


the inner-product hfS1 , fS1 iH = hS1 , S2 iF .

Julien Mairal (Inria) 49/564


The polynomial kernel
Third step: check that the candidate is a Hilbert space.
This step is trivial in the present case since it is easy to see that H a
Euclidean space. Sometimes, things are not so simple and we need to
prove the completeness explicitly.

Fourth step: check that H is the RKHS.


H contains all the functions Kx : t 7→ K (x, t) = xx> , tt> F .
Moreover, we have for all fS in H and x in X ,
D E
fS (x) = S, xx> = hfS , fxx> iH = hfS , Kx iH .
F

Remark
All points x in X are mapped to a rank-one matrix xx> . Most of points
in H do not admit a pre-image.
Exercise: what is the RKHS of the general polynomial kernel?
Julien Mairal (Inria) 50/564
Combining kernels
Theorem
If K1 and K2 are p.d. kernels, then:

K1 + K2 ,
K1 K2 , and
cK1 , for c ≥ 0,

are also p.d. kernels


If (Ki )i≥1 is a sequence of p.d. kernels that converges pointwisely
to a function K :

∀ x, x0 ∈ X 2 , K x, x0 = lim Ki x, x0 ,
  
n→∞

then K is also a p.d. kernel.


Proof: left as exercise

Julien Mairal (Inria) 51/564


Examples
Theorem
If K is a kernel, then e K is a kernel too.
Proof:
n
K (x,x0 )
X K (x, x0 )i
e = lim
n→+∞ i!
i=0

Julien Mairal (Inria) 52/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


X = R, K (x, x0 ) = cos (x + x0 )

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
X = N, K (x, x0 ) = GCD (x, x0 )

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
X = N, K (x, x0 ) = GCD (x, x0 )
X = N, K (x, x0 ) = LCM (x, x0 )

Julien Mairal (Inria) 53/564


Quizz : which of the following are p.d. kernels?
X = (−1, 1), K (x, x0 ) = 1
1−xx0
0
X = N, K (x, x0 ) = 2x+x
0
X = N, K (x, x0 ) = 2xx
X = R+ , K (x, x0 ) = log (1 + xx0 )
X = R, K (x, x0 ) = exp −|x − x0 |2


X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
X = N, K (x, x0 ) = GCD (x, x0 )
X = N, K (x, x0 ) = LCM (x, x0 )
X = N, K (x, x0 ) = GCD (x, x0 ) /LCM (x, x0 )

Julien Mairal (Inria) 53/564


Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
My first kernels
Smoothness functional
The kernel trick

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics

Julien Mairal (Inria) 54/564


Remember the RKHS of the linear kernel

0 = x> x0 .
Klin (x, x )

f (x) = w> x ,

k f kH = k w k2 .

||f||=2 ||f||=1 ||f||=0.5

Julien Mairal (Inria) 55/564


Smoothness functional
A simple inequality
By Cauchy-Schwarz we have, for any function f ∈ H and any two
points x, x0 ∈ X :

f (x) − f x0 = | hf , Kx − Kx0 iH |


≤ k f kH × k Kx − Kx0 kH
= k f kH × dK x, x0 .


The norm of a function in the RKHS controls how fast the function
varies over X with respect to the geometry defined by the kernel
(Lipschitz with constant k f kH ).

Important message

Small norm =⇒ slow variations.

Julien Mairal (Inria) 56/564


Kernels and RKHS : Summary
P.d. kernels can be thought of as inner product after embedding
the data space X in some Hilbert space. As such a p.d. kernel
defines a metric on X .
A realization of this embedding is the RKHS, valid without
restriction on the space X nor on the kernel.
The RKHS is a space of functions over X . The norm of a function
in the RKHS is related to its degree of smoothness w.r.t. the metric
defined by the kernel on X .
We will now see some applications of kernels and RKHS in
statistics, before coming back to the problem of choosing (and
eventually designing) the kernel.

Julien Mairal (Inria) 57/564


Outline

1 Kernels and RKHS


Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
My first kernels
Smoothness functional
The kernel trick

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics

Julien Mairal (Inria) 58/564


The kernel trick
Choosing a p.d. kernel K on a set X amounts to embedding the
data in a Hilbert space: there exists a Hilbert space H and a
mapping Φ : X 7→ H such that, for all x, x0 ∈ X ,

∀ x, x0 ∈ X 2 , K x, x0 = Φ (x) , Φ x0 H .
  

However this mapping might not be explicitly given, nor convenient


to work with in practice (e.g., large or even infinite dimensions).
A solution is to work implicitly in the feature space!

Kernel trick
Any algorithm to process finite-dimensional vectors that can be
expressed only in terms of pairwise inner products can be applied to
potentially infinite-dimensional vectors in the feature space of a p.d.
kernel by replacing each inner product evaluation by a kernel evaluation.

Julien Mairal (Inria) 59/564


Kernel trick Summary
Summary
The kernel trick is a trivial statement with important applications.
It can be used to obtain nonlinear versions of well-known linear
algorithms, e.g., by replacing the classical inner product by a
Gaussian kernel.
It can be used to apply classical algorithms to non vectorial data
(e.g., strings, graphs) by again replacing the classical inner product
by a valid kernel for the data.
It allows in some cases to embed the initial space to a larger feature
space and involve points in the feature space with no pre-image
(e.g., barycenter).

Julien Mairal (Inria) 60/564


Example 1: computing distances in the feature space

φ
X F

x1 d(x1,x2) φ( x1)

x2 φ( x2 )

dK (x1 , x2 )2 = k Φ (x1 ) − Φ (x2 ) k2H


= hΦ (x1 ) − Φ (x2 ) , Φ (x1 ) − Φ (x2 )iH
= hΦ (x1 ) , Φ (x1 )iH + hΦ (x2 ) , Φ (x2 )iH − 2 hΦ (x1 ) , Φ (x2 )iH
2
dK (x1 , x2 ) = K (x1 , x1 ) + K (x2 , x2 ) − 2K (x1 , x2 )

Julien Mairal (Inria) 61/564


Distance for the Gaussian kernel

The Gaussian kernel with


bandwidth σ on Rd is:
k x−y k2
K (x, y) = e − 2σ 2 ,

1.2
K (x, x) = 1 = k Φ (x) k2H , so all

0.8
d(x,y)
points are on the unit sphere in the

0.4
feature space.

0.0
The distance between the images
of two points x and y in the feature −4 −2 0 2 4

space is given by: ||x−y||

s  
k x−y k2
dK (x, y) = 2 1 − e− 2σ 2

Julien Mairal (Inria) 62/564


Example 2: distance between a point and a set
Problem
Let S = (x1 , · · · , xn ) be a finite set of points in X .
How to define and compute the similarity between any point x in X
and the set S?

Julien Mairal (Inria) 63/564


Example 2: distance between a point and a set
Problem
Let S = (x1 , · · · , xn ) be a finite set of points in X .
How to define and compute the similarity between any point x in X
and the set S?
A solution
Map all points to the feature space.
Summarize S by the barycenter of the points:
n
1X
µ := Φ (xi ) .
n
i=1

Define the distance between x and S by:

dK (x, S) := k Φ (x) − µ kH .

Julien Mairal (Inria) 63/564


Computation
φ
X F
m

Kernel trick

n
1X
dK (x, S) = k Φ (x) − Φ(xi ) kH
n
i=1
v
u n n n
u 2X 1 XX
= K (x, x) −
t K (x, xi ) + 2 K (xi , xj ).
n n
i=1 i=1 j=1

Julien Mairal (Inria) 64/564


Remarks
Remarks
The barycentre µ only exists in the feature space in general:
 it does
not necessarily have a pre-image xµ such that Φ xµ = µ.
The distance obtained is a Hilbert metric (e.g., Pythagoras theorem
holds etc..)

Julien Mairal (Inria) 65/564


1D illustration

S = {2, 3}
Plot f (x) = d(x, S)
2.5 2.5

2 2

1.5 1.5

d(x,S)

d(x,S)
1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x x

(x−y)2 (x−y)2
k (x, y) = xy. k (x, y) = e − 2σ 2 . k (x, y) = e − 2σ 2 .
(linear) with σ = 1. with σ = 0.2.

Julien Mairal (Inria) 66/564


2D illustration

S = {(1, 1)0 , (1, 2)0 , (2, 2)0 }


Plot f (x) = d(x, S)
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3

(x−y)2 (x−y)2
k (x, y) = xy. k (x, y) = e − 2σ 2 . k (x, y) = e − 2σ 2 .
(linear) with σ = 1. with σ = 0.2.

Julien Mairal (Inria) 67/564


Application in discrimination

S1 = {(1, 1)0 , (1, 2)0 } and S2 = {(1, 3)0 , (2, 2)0 }


Plot f (x) = d (x, S1 )2 − d (x, S2 )2
4 4 4

3.5 3.5 3.5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3

(x−y)2 (x−y)2
k (x, y) = xy. k (x, y) = e − 2σ 2 . k (x, y) = e − 2σ 2 .
(linear) with σ = 1. with σ = 0.2.

Julien Mairal (Inria) 68/564


Example 3: Centering data in the feature space
Problem
Let S = (x1 , · · · , xn ) be a finite set of points in X endowed with a
p.d. kernel K . Let K be their n × n Gram matrix:
[K]ij = K (xi , xj ) .
Let µ = 1/n ni=1 Φ (xi ) their barycenter, and ui = Φ (xi ) − µ for
P
i = 1, . . . , n be centered data in H.
How to compute the centered Gram matrix [Kc ]i,j = hui , uj iH ?

φ
X F
m

Julien Mairal (Inria) 69/564


Computation
Kernel trick
A direct computation gives, for 0 ≤ i, j ≤ n:

Kci,j = hΦ (xi ) − µ, Φ (xj ) − µiH


= hΦ (xi ) , Φ (xj )iH − hµ, Φ (xi ) + Φ (xj )iH + hµ, µiH
n n
1X 1 X
= Ki,j − (Ki,k + Kj,k ) + 2 Kk,l .
n n
k=1 k,l=1

This can be rewritten in matricial form:

Kc = K − UK − KU + UKU = (I − U) K (I − U) ,

where Ui,j = 1/n for 1 ≤ i, j ≤ n.

Julien Mairal (Inria) 70/564


Part 2

Kernel Methods
Supervised Learning

Julien Mairal (Inria) 71/564


Back to classifying cats and dogs
Regularized empirical risk formulation
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
i=1
| {z } regularization
empirical risk, data fit

Julien Mairal (Inria) 72/564


Back to classifying cats and dogs
Regularized empirical risk formulation
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
| i=1 {z } regularization
empirical risk, data fit

A simple parametrization when X = Rp and Y = {−1, +1}.


F = {fw : w ∈ Rp } where the fw ’s are linear: fw : x 7→ x> w.
The regularization is the simple Euclidean norm Ω(fw ) = kwk22 .

Julien Mairal (Inria) 72/564


Back to classifying cats and dogs
Regularized empirical risk formulation
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
| i=1 {z } regularization
empirical risk, data fit

A simple parametrization when X = Rp and Y = {−1, +1}.


This is equivalent to using a linear kernel K (x, x0 ) = x> x0 .
In that case, F is the Hilbert space H of linear functions fw : x 7→ x> w
and Ω(fw ) = kfw k2H = kwk22 .

Julien Mairal (Inria) 72/564


Back to classifying cats and dogs
Regularized empirical risk formulation
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
| i=1 {z } regularization
empirical risk, data fit

What are the new perspectives with kernel methods?


being able to deal with non-linear functional spaces endowed with a
natural regularization function k.k2H .
being able to deal with non-vectorial data (graphs, trees).

Julien Mairal (Inria) 72/564


Motivations
Two theoretical results underpin a family of powerful algorithms for data
analysis using positive definite kernels, collectively known as kernel
methods:
The kernel trick, based on the representation of p.d. kernels as
inner products,
the representer theorem, based on some properties of the
regularization functional defined by the RKHS norm.

An important property
When needed, the RKHS norm acts as a natural regularization function
that penalizes variations of functions.

Julien Mairal (Inria) 73/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning


The representer theorem
Kernel ridge regression
Classification with empirical risk minimization
A (tiny) bit of learning theory
Foundations of constrained optimization
Support vector machines

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Julien Mairal (Inria) 74/564
Back to classifying cats and dogs
Regularized empirical risk formulation with kernels
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min L(yi , f (xi )) + λkf k2H . (1)
f ∈H n | {z }
| i=1 {z } regularization
empirical risk, data fit

Question: how to solve the above minimization problem?


A simple theorem, called “representer theorem” can turn (1) into a
concrete optimization problem in Rn .

Julien Mairal (Inria) 75/564


The Theorem
Representer Theorem
Let X be a set endowed with a p.d. kernel K , HK the corresponding
RKHS, and S = {x1 , · · · , xn } ⊂ X a finite set of points in X .
Let Ψ : Rn+1 → R be a function of n + 1 variables, strictly
increasing with respect to the last variable.
Then, any solution to the optimization problem:

min Ψ (f (x1 ) , · · · , f (xn ) , k f kHK ) , (2)


f ∈HK

admits a representation of the form:


n
X
∀x ∈ X , f (x) = αi K (xi , x) . (3)
i=1

Julien Mairal (Inria) 76/564


Proof (1/2)

Let ξ (f , S) be the functional that is minimized in the statement of


the representer theorem, and HK S the linear span in H of the
K
vectors Kxi , i.e.,
n
( )
X
S
HK = f ∈ HK : f (x) = αi K (xi , x) , (α1 , · · · , αn ) ∈ R n
.
i=1

HKS finite-dimensional subspace, therefore any function f ∈ H can


K
be uniquely decomposed as:

f = fS + f⊥ ,
S and f ⊥ HS (by orthogonal projection).
with fS ∈ HK ⊥ K

Julien Mairal (Inria) 77/564


Proof (2/2)
HK being a RKHS it holds that:

∀i = 1, · · · , n, f⊥ (xi ) = hf⊥ , K (xi , .)iHK = 0 ,

because K (xi , .) ∈ HK , therefore:

∀i = 1, · · · , n, f (xi ) = fS (xi ) .

Pythagoras’ theorem in HK then shows that:

k f k2HK = k fS k2HK + k f⊥ k2HK .

As a consequence, ξ (f , S) ≥ ξ (fS , S) , with equality if and only if


k f⊥ kHK = 0. The minimum of Ψ is therefore necessarily in HK S.

Julien Mairal (Inria) 78/564


Remarks
Practical and theoretical consequences
Often the function Ψ has the form:

Ψ (f (x1 ) , · · · , f (xn ) , k f kHK ) = c (f (x1 ) , · · · , f (xn )) + λΩ (k f kHK )

where c(.) measures the “fit” of f to a given problem (regression,


classification, dimension reduction, ...) and Ω is strictly increasing. This
formulation has two important consequences:
Theoretically, the minimization will enforce the norm k f kHK to be
“small”, which can be beneficial by ensuring a sufficient level of
smoothness for the solution (regularization effect).
Practically, we know by the representer theorem that the solution
lives in a subspace of dimension n, which can lead to efficient
algorithms although the RKHS itself can be of infinite dimension.

Julien Mairal (Inria) 79/564


Remarks
Dual interpretations of kernel methods
Most kernel methods have two complementary interpretations:
A geometric interpretation in the feature space, thanks to the kernel
trick. Even when the feature space is “large”, most kernel methods
work in the linear span of the embeddings of the points available.
A functional interpretation, often as an optimization problem over
(subsets of) the RKHS associated to the kernel.
The representer theorem has important consequences, but it is in fact
rather trivial. We are looking for a function f in H such that for all x
in X , f (x) = hKx , f iH . The part f ⊥ that is orthogonal to the Kxi ’s is
thus “useless” to explain the training data.

Julien Mairal (Inria) 80/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning


The representer theorem
Kernel ridge regression
Classification with empirical risk minimization
A (tiny) bit of learning theory
Foundations of constrained optimization
Support vector machines

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Julien Mairal (Inria) 81/564
Regression
Setup
Let S = {x1 , . . . , xn } ∈ X n be a set of points
Let y = {y1 , . . . , yn } ∈ Rn be real numbers attached to the points
Regression = find a function f : X → R to predict y by f (x)
4
line 1
line 2



3 





2 


1 

0






-1

-2

-3


-4
-2 -1 0
Julien 1 (Inria)2
Mairal 3 4 5 6 7 82/564
Least-square regression

Let us quantify the error if f predicts f (x) instead of y by:

L (f (x) , y ) = (y − f (x))2 .

Fix a set of functions H.


Least-square regression amounts to solving:
n
1 X
fˆ ∈ arg min (yi − f (xi ))2 .
f ∈H n
i=1

Issues: unstable (especially in large dimensions), overfitting if H is


too “large”.

Julien Mairal (Inria) 83/564


Regularized least-square
Let us consider a RKHS H, RKHS associated to a p.d. kernel K
on X .
Let us regularize the functional to be minimized by:
n
1 X
fˆ = arg min (yi − f (xi ))2 + λk f k2H .
f ∈H n
i=1

1st effect = prevent overfitting by penalizing non-smooth functions.

Julien Mairal (Inria) 84/564


Representation of the solution
By the representer theorem, any solution of:
n
1 X
fˆ = arg min (yi − f (xi ))2 + λk f k2HK .
f ∈HK n i=1

can be expanded as:


n
X
fˆ(x) = αi K (xi , x) .
i=1

2nd effect = simplifying the solution.

Julien Mairal (Inria) 85/564


Dual formulation

Let α = (α1 , . . . , αn )> ∈ Rn ,


Let K be the n × n Gram matrix: Ki,j = K (xi , xj ) .
We can then write:
 >
fˆ (x1 ) , . . . , fˆ (xn ) = Kα,

The following holds as usual:

k fˆ k2HK = α> Kα.

Julien Mairal (Inria) 86/564


Dual formulation
The problem is therefore equivalent to:
1
arg min (Kα − y)> (Kα − y) + λα> Kα.
α∈Rn n

This is a convex and differentiable function of α. Its minimum can


therefore be found by setting the gradient in α to zero:
2
0= K (Kα − y) + 2λKα
n
= K [(K + λnI ) α − y] .

Julien Mairal (Inria) 87/564


Dual formulation
K being a symmetric matrix, it can be diagonalized in an
orthonormal basis and Ker (K) ⊥ Im(K).
In this basis we see that (K + λnI )−1 leaves Im(K) and Ker (K)
invariant.
The problem is therefore equivalent to:

(K + λnI ) α − y ∈ Ker (K)


⇔α − (K + λnI )−1 y ∈ Ker (K)
⇔α = (K + λnI )−1 y + , with K = 0.

Julien Mairal (Inria) 88/564


Kernel ridge regression

However, if α0 = α +  with K = 0, then:


>
k f − f 0 k2H = α − α0 K α − α0 = 0,


therefore f = f 0 .
One solution to the initial problem is therefore:
n
X
fˆ = αi K (xi , x) ,
i=1

with
α = (K + λnI )−1 y.

Julien Mairal (Inria) 89/564


Remarks

The matrix (K + nλI )−1 is invertible when λ > 0.


When λ → 0, the method converges towards the solution of the
classical unregularized least-square solution. When λ → ∞, the
solution converges to f = 0.
In practice the symmetric matrix K + nλI is inverted with specific
algorithms (e.g., Cholevsky decomposition).
This method becomes difficult to use when the number of points
becomes large.

Julien Mairal (Inria) 90/564


Example
4
l=0
l=0.01
l=0.1
3 l=1






2






1


0






-1

-2

-3


-4
-2 -1 0 1 2 3 4 5 6 7

Julien Mairal (Inria) 91/564


Kernel methods: Summary
The kernel trick allows to extend many linear algorithms to
non-linear settings and to general data (even non-vectorial).
The representer theorem shows that that functional optimization
over (subsets of) the RKHS is feasible in practice.
We will see next a particularly successful applications of kernel
methods, pattern recognition.

Julien Mairal (Inria) 92/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning


The representer theorem
Kernel ridge regression
Classification with empirical risk minimization
A (tiny) bit of learning theory
Foundations of constrained optimization
Support vector machines

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Julien Mairal (Inria) 93/564
Pattern recognition

APPLE
APPLE ??? ???
PEAR
PEAR
APPLE

APPLE ???
PEAR APPLE

Input variables x ∈ X .
Output y ∈ {−1, 1}.
Training set S = {(x1 , y1 ) , . . . , (xn , yn )}.

Julien Mairal (Inria) 94/564


Or again the cats and dogs example...
Regularized empirical risk formulation
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
i=1
| {z } regularization
empirical risk, data fit

Julien Mairal (Inria) 95/564


...which we may reformulate with kernels
Regularized empirical risk formulation
The goal is to learn a prediction function f : X → Y given labeled
training data (xi ∈ X , yi ∈ Y)i=1,...,n :
n
1X
min ϕ(yi f (xi )) + λkf k2H .
f ∈H n | {z }
| i=1 {z } regularization
empirical risk, data fit

By the representer theorem, the solution of the unconstrained problem


can be expanded as:
n
X
f (x) = αi K (xi , x) .
i=1

Julien Mairal (Inria) 96/564


Optimization in RKHS
Plugging into the original problem we obtain the following
unconstrained and convex optimization problem in Rn :
   
1 X n X n n
X 
minn ϕ yi αj K (xi , xj ) + λ αi αj K (xi , xj ) .
α∈R  n 
i=1 j=1 i,j=1

which in matrix notation gives


( n )
1X
min ϕ (yi [Kα]i ) + λα> Kα , .
α∈Rn n
i=1

This can be implemented using general packages for convex


optimization or specific algorithms (e.g., for SVM).

Julien Mairal (Inria) 97/564


Loss function examples

3.0
2.5
1−SVM
2−SVM

2.0
Logistic
Boosting

phi(u)

1.5
1.0
0.5
0.0

−2 −1 0 1 2

Method ϕ(u)
Kernel logistic regression log (1 + e −u )
Support vector machine (1-SVM) max (1 − u, 0)
Support vector machine (2-SVM) max (1 − u, 0)2
Boosting e −u
Julien Mairal (Inria) 98/564
Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning


The representer theorem
Kernel ridge regression
Classification with empirical risk minimization
A (tiny) bit of learning theory
Foundations of constrained optimization
Support vector machines

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Julien Mairal (Inria) 99/564
Formalization
Definition of the risk and notation
Let P be an (unknown) distribution on X × Y.
Observation: Sn = (Xi , Yi )i=1,...,n i.i.d. random variables according
to P.
Loss function L (f (x) , y) ∈ R small when f (x) is a good predictor
for y .
Risk: R(f ) = E[L (f (X ) , Y )].
Estimator fˆn : X → Y.
 
Goal: small risk R fˆn .

Julien Mairal (Inria) 100/564


Large-margin classifiers
Definition of the margin
For pattern recognition Y = {−1, 1}.
The goal is to estimate a prediction function f : X → R.
The margin of the function f for a pair (x, y) is:

yf (x) .

Large margin classifiers


Focusing on large margins ensures that f (x) has the same sign as y
and a large absolute value (confidence).
Suggests a loss function L (f (x) , y) = ϕ (yf (x)), where ϕ : R → R
is non-increasing.
Goal: small ϕ-risk Rϕ (f ) = E[ϕ (Yf (X ))].
Julien Mairal (Inria) 101/564
Empirical risk minimization (ERM)
ERM estimator
Given n observations, the empirical ϕ-risk is:
n
1X
Rϕn (f ) = ϕ (Yi f (Xi )) .
n
i=1

The ERM estimator on the functional class F is the solution (when


it exists) of:
fˆn = arg minRϕn (f ) .
f ∈F

Julien Mairal (Inria) 102/564


Empirical risk minimization (ERM)
ERM estimator
Given n observations, the empirical ϕ-risk is:
n
1X
Rϕn (f ) = ϕ (Yi f (Xi )) .
n
i=1

The ERM estimator on the functional class F is the solution (when


it exists) of:
fˆn = arg minRϕn (f ) .
f ∈F

Question
When is Rϕn (f ) a good estimate of the true risk Rϕ (f )?

Julien Mairal (Inria) 102/564


Class capacity
Motivations
 
The ERM principle gives a good solution if Rϕn fˆn is similar to the
minimum achievable risk inf f ∈F Rϕ (f ).
This can be ensured if F is not “too large”.
We need a measure of the “capacity” of F.

Definition: Rademacher complexity


The Rademacher complexity of a class of functions F is:
n
" #
2X
Radn (F) = EX ,σ sup σi f (Xi ) ,
f ∈F n i=1

where the expectation is over (Xi )i=1,...,n and the independent uniform
{±1}-valued (Rademacher) random variables (σi )i=1,...,n .

Julien Mairal (Inria) 103/564


Basic learning bounds
Suppose ϕ is Lipschitz with constant Lϕ :

∀u, u 0 ∈ R, ϕ(u) − ϕ(u 0 ) ≤ Lϕ u − u 0 .

Then on average over the training set (and with high probability)
the ϕ-risk of the ERM estimator is closed to the empirical one:
 
ES sup Rϕ (f ) − Rϕ (f ) ≤ 2Lϕ Radn (F) .
n
f ∈F

The ϕ-risk of the ERM estimator is also close to the smallest


achievable on F (on average and with large probability):
 
ES Rϕ fˆn ≤ inf Rϕ (f ) + 4Lϕ Radn (F) .
f ∈F

Julien Mairal (Inria) 104/564


ERM in RKHS balls
Principle
Assume X is endowed with a p.d. kernel.
We consider the ball of radius B in the RKHS as function class for
the ERM:
FB = {f ∈ H : k f kH ≤ B} .

Theorem (capacity control of RKHS balls)


EK (X , X )
p
2B
Radn (FB ) ≤ √ .
n

Julien Mairal (Inria) 105/564


Proof (1/2)

" n
#
2X
Radn (FB ) = EX ,σ sup σi f (Xi )
f ∈FB n i=1
" * n
+ #
2X
= EX ,σ sup f, σi KXi (RKHS)
f ∈FB n
i=1
" n
#
2X
= EX ,σ Bk σi KXi kH (Cauchy-Schwarz)
n
i=1
v 
u n
2B X
EX ,σ tk
u
= σi KXi k2H 
n
i=1
v  
u
n
2B u
u X
≤ tEX ,σ  σi σj K (Xi , Xj ) (Jensen)
n
i,j=1

Julien Mairal (Inria) 106/564


Proof (2/2)
But Eσ [σi σj ] is 1 if i = j, 0 otherwise. Therefore:
v  
u
n
2B u 
u X
Radn (FB ) ≤ tEX Eσ [σi σj ] K (Xi , Xj )
n
i,j=1
v
u n
2B u
tEX
X
≤ K (Xi , Xi )
n
i=1

2B EX K (X , X )
p
= √ . 
n

Julien Mairal (Inria) 107/564


Basic learning bounds in RKHS balls
Corollary
Suppose K (X , X ) ≤ κ2 a.s. (e.g., Gaussian kernel and κ = 1).
Let the minimum possible ϕ-risk:

Rϕ∗ = inf Rϕ (f ) .
f measurable

Then we directly get for the ERM estimator in FB :


 
 
ˆ ∗ 8Lϕ κB ∗
ERϕ fn − Rϕ ≤ √ + inf Rϕ (f ) − Rϕ .
n f ∈FB

Julien Mairal (Inria) 108/564


Choice of B by structural risk minimization
Remark

The estimation error upper bound 8Lϕ κB/ n increases (linearly)
with B.
The approximation error inf f ∈FB Rϕ (f ) − Rϕ∗ decreases with B.
 

Ideally, the choice of B should find a trade-off that minimizes the


upper bound.
This is achieved when
∂ inf f ∈FB Rϕ (f ) 8Lϕ κ
=− √ .
∂B n

Julien Mairal (Inria) 109/564


ERM in practice
Reformulation as penalized minimization
We must solve the constrained minimization problem:
(
minf ∈H n1 ni=1 ϕ (yi f (xi ))
P

subject to k f kH ≤ B .

This is a constrained optimization problem.

Julien Mairal (Inria) 110/564


ERM in practice
Reformulation as penalized minimization
We must solve the constrained minimization problem:
(
minf ∈H n1 ni=1 ϕ (yi f (xi ))
P

subject to k f kH ≤ B .

This is a constrained optimization problem.


To make this practical we assume that ϕ is convex.
The problem is then a convex problem in f for which strong duality
holds. In particular f solves the problem if and only if it solves for
some dual parameter λ the unconstrained problem:
( n )
1X 2
min ϕ (yi f (xi )) + λk f kH .
f ∈H n
i=1

Julien Mairal (Inria) 110/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning


The representer theorem
Kernel ridge regression
Classification with empirical risk minimization
A (tiny) bit of learning theory
Foundations of constrained optimization
Support vector machines

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Julien Mairal (Inria) 111/564
A few slides on convex duality
Strong Duality

f (α), primal

α⋆ κ⋆
b b

α κ

b b

g(κ), dual
Strong duality means that maxκ g (κ) = minα f (α)
Strong duality holds in most “reasonable cases” for convex
optimization (to be detailed soon).

Julien Mairal (Inria) 112/564


A few slides on convex duality
Strong Duality

f (α), primal

α⋆ κ⋆
b b

α κ

b b

g(κ), dual
The relation between κ? and α? is not always known a priori.

Julien Mairal (Inria) 112/564


A few slides on convex duality
Parenthesis on duality gaps
f (α), primal

α̃ κ̃
b b

α κ

δ(α̃, κ̃) b

g(κ), dual
The duality gap guarantees us that 0 ≤ f (α̃) − f (α? ) ≤ δ(α̃, κ̃).
Dual problems are often obtained by Lagrangian or Fenchel duality.

Julien Mairal (Inria) 113/564


A few slides on Lagrangian duality
Setting
We consider an equality and inequality constrained optimization
problem over a variable x ∈ X :

minimize f (x)
subject to hi (x) = 0 , i = 1, . . . , m ,
gj (x) ≤ 0 , j = 1, . . . , r ,

making no assumption of f , g and h.


Let us denote by f ∗ the optimal value of the decision function
under the constraints, i.e., f ∗ = f (x ∗ ) if the minimum is reached at
a global minimum x ∗ .

Julien Mairal (Inria) 114/564


A few slides on Lagrangian duality
Lagrangian
The Lagrangian of this problem is the function L : X × Rm × Rr → R
defined by:
m
X r
X
L (x, λ, µ) = f (x) + λi hi (x) + µj gj (x) .
i=1 j=1

Lagrangian dual function


The Lagrange dual function g : Rm × Rr → R is:

q(λ, µ) = inf L (x, λ, µ)


x∈X
 
m
X r
X
= inf f (x) + λi hi (x) + µj gj (x) .
x∈X
i=1 j=1

Julien Mairal (Inria) 115/564


A few slides on convex Lagrangian duality
For the (primal) problem:
minimize f (x)
subject to h(x) = 0 , g (x) ≤ 0 ,
the Lagrange dual problem is:
maximize q(λ, µ)
subject to µ≥0,

Proposition
q is concave in (λ, µ), even if the original problem is not convex.
The dual function yields lower bounds on the optimal value f ∗ of
the original problem when µ is nonnegative:

q(λ, µ) ≤ f ∗ , ∀λ ∈ Rm , ∀µ ∈ Rr , µ ≥ 0 .

Julien Mairal (Inria) 116/564


Proofs

For each x, the function (λ, µ) 7→ L(x, λ, µ) is linear, and therefore


both convex and concave in (λ, µ). The pointwise minimum of
concave functions is concave, therefore q is concave.
Let x̄ be any feasible point, i.e., h(x̄) = 0 and g (x̄) ≤ 0. Then we
have, for any λ and µ ≥ 0:
m
X r
X
λi hi (x̄) + µi gi (x̄) ≤ 0 ,
i=1 i=1

m
X r
X
=⇒ L(x̄, λ, µ) = f (x̄) + λi hi (x̄) + µi gi (x̄) ≤ f (x̄) ,
i=1 i=1

=⇒ q(λ, µ) = inf L(x, λ, µ) ≤ L(x̄, λ, µ) ≤ f (x̄) , ∀x̄ . 


x

Julien Mairal (Inria) 117/564


Weak duality
Let d ∗ the optimal value of the Lagrange dual problem. Each
q(λ, µ) is an lower bound for f ∗ and by definition d ∗ is the best
lower bound that is obtained. The following weak duality inequality
therefore always hold:
d∗ ≤ f ∗ .

This inequality holds when d ∗ or f ∗ are infinite. The difference


d ∗ − f ∗ is called the optimal duality gap of the original problem.

Julien Mairal (Inria) 118/564


Strong duality
We say that strong duality holds if the optimal duality gap is zero,
i.e.:
d∗ = f ∗ .

If strong duality holds, then the best lower bound that can be
obtained from the Lagrange dual function is tight
Strong duality does not hold for general nonlinear problems.
It usually holds for convex problems.
Conditions that ensure strong duality for convex problems are called
constraint qualification.
in that case, we have for all feasible primal and dual points x, λ, µ,

q(λ, µ) ≤ q(λ? , µ? ) = L (x ? , λ? , µ? ) = f (x ? ) ≤ f (x).

Julien Mairal (Inria) 119/564


Slater’s constraint qualification
Strong duality holds for a convex problem:

minimize f (x)
subject to gj (x) ≤ 0 , j = 1, . . . , r ,
Ax = b ,

if it is strictly feasible, i.e., there exists at least one feasible point that
satisfies:
gj (x) < 0 , j = 1, . . . , r , Ax = b .

Julien Mairal (Inria) 120/564


Remarks

Slater’s conditions also ensure that the maximum d ∗ (if > −∞) is
attained, i.e., there exists a point (λ∗ , µ∗ ) with

q (λ∗ , µ∗ ) = d ∗ = f ∗

They can be sharpened. For example, strict feasibility is not


required for affine constraints.
There exist many other types of constraint qualifications

Julien Mairal (Inria) 121/564


Dual optimal pairs

Suppose that strong duality holds, x ∗ is primal optimal, (λ∗ , µ∗ ) is dual


optimal. Then we have:

f (x ∗ ) = q (λ∗ , µ∗ )
 
 m
X r
X 
= inf n f (x) + λ∗i hi (x) + µ∗j gj (x)
x∈R  
i=1 j=1
m
X r
X
≤ f (x ∗ ) + λ∗i hi (x ∗ ) + µ∗j gj (x ∗ )
i=1 j=1

≤ f (x )

Hence both inequalities are in fact equalities.

Julien Mairal (Inria) 122/564


Complimentary slackness
The first equality shows that:

L (x ∗ , λ∗ , µ∗ ) = inf n L (x, λ∗ , µ∗ ) ,
x∈R

showing that x ∗ minimizes the Lagrangian at (λ∗ , µ∗ ). The second


equality shows that:

µj gj (x ∗ ) = 0 , j = 1, . . . , r .

This property is called complementary slackness:


the ith optimal Lagrange multiplier is zero unless the ith constraint is
active at the optimum.

Julien Mairal (Inria) 123/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning


The representer theorem
Kernel ridge regression
Classification with empirical risk minimization
A (tiny) bit of learning theory
Foundations of constrained optimization
Support vector machines

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Julien Mairal (Inria) 124/564
Motivations
Support vector machines (SVM)
Historically the first “kernel method” for pattern recognition, still
the most popular.
Often state-of-the-art in performance.
One particular choice of loss function (hinge loss).
Leads to a sparse solution, i.e., not all points are involved in the
decomposition (compression).
Particular algorithm for fast optimization (decomposition by
chunking methods).

Julien Mairal (Inria) 125/564


Definitions
l(f(x),y)

yf(x)

The loss function is the hinge loss:


(
0 if u ≥ 1,
ϕhinge (u) = max (1 − u, 0) =
1−u otherwise.

SVM solve the problem:


( n )
1X
min ϕhinge (yi f (xi )) + λk f k2H .
f ∈H n
i=1

Julien Mairal (Inria) 126/564


Problem reformulation (1/3)
Slack variables
This is a convex optimization problem
However the objective function in not differentiable, so we
reformulate the problem with additional slack variables
ξ1 , . . . , ξn ∈ R:
( n )
1X 2
min ξi + λk f kH ,
f ∈H,ξ ∈Rn n
i=1

subject to:
ξi ≥ ϕhinge (yi f (xi )) .

Julien Mairal (Inria) 127/564


Problem reformulation (2/3)
The objective function is now differentiable in f and ξi , and we can
rewrite the constraints as a conjunction of linear constraints:
n
1X
min ξi + λk f k2H ,
f ∈H,ξ ∈Rn n
i=1

subject to: (
ξi ≥ 1 − yi f (xi ) , for i = 1, . . . , n ,
ξi ≥ 0, for i = 1, . . . , n .

Julien Mairal (Inria) 128/564


Problem reformulation (3/3)
Finite-dimensional expansion
Replacing fˆ by
n
X
fˆ (x) = αi K (xi , x) ,
i=1

the problem can be rewritten as an optimization problem in α and ξ:


n
1X
min ξi + λα> Kα ,
α∈Rn ,ξ ∈Rn n i=1

subject to:
( P
yi nj=1 αj K (xi , xj ) + ξi − 1 ≥ 0 , for i = 1, . . . , n ,
ξi ≥ 0 , for i = 1, . . . , n .

Julien Mairal (Inria) 129/564


Problem reformulation (3/3)
Finite-dimensional expansion
Replacing fˆ by
n
X
fˆ (x) = αi K (xi , x) ,
i=1

the problem can be rewritten as an optimization problem in α and ξ:


n
1X
min ξi + λα> Kα ,
α∈Rn ,ξ ∈Rn n i=1

subject to:
(
yi [Kα]i + ξi − 1 ≥ 0 , for i = 1, . . . , n ,
ξi ≥ 0 , for i = 1, . . . , n .

Julien Mairal (Inria) 129/564


Solving the problem
Remarks
This is a classical quadratic program (minimization of a convex
quadratic function with linear constraints) for which any
out-of-the-box optimization package can be used.
The dimension of the problem and the number of constraints,
however, are 2n where n is the number of points. General-purpose
QP solvers will have difficulties when n exceeds a few thousands.
Solving the dual of this problem (also a QP) will be more
convenient and lead to faster algorithms (due to the sparsity of the
final solution).

Julien Mairal (Inria) 130/564


Lagrangian
Let us introduce the Lagrange multipliers µ ∈ Rn and ν ∈ Rn .
The Lagrangian of the problem is:
n
1X
L (α, ξ, µ, ν) = ξi + λα> Kα
n
i=1
n
X n
X
− µi [yi [Kα]i + ξi − 1] − νi ξi .
i=1 i=1

Julien Mairal (Inria) 131/564


Lagrangian
Let us introduce the Lagrange multipliers µ ∈ Rn and ν ∈ Rn .
The Lagrangian of the problem is:
n
1X
L (α, ξ, µ, ν) = ξi + λα> Kα
n
i=1
− (diag (y)µ)> Kα − (µ + ν)> ξ + µ> 1.

Julien Mairal (Inria) 131/564


Minimizing L (α, ξ, µ, ν) w.r.t. α

L (α, ξ, µ, ν) is a convex quadratic function in α. It is minimized


when its gradient is null:

∇α L = 2λKα − K diag (y)µ = K (2λα − diag (y)µ) ,

Solving ∇α L = 0 leads to

diag (y)µ
α= + ,

with K = 0. But  does not change f (same as kernel ridge
regression), so we can choose for example  = 0 and:
yi µi
αi∗ (µ, ν) = , for i = 1, . . . , n.

Julien Mairal (Inria) 132/564


Minimizing L (α, ξ, µ, ν) w.r.t. ξ

L (α, ξ, µ, ν) is a linear function in ξ.


Its minimum is −∞ except when ∇ξ L = 0, i.e.:

∂L 1
= − µi − νi = 0.
∂ξi n

Julien Mairal (Inria) 133/564


Dual function
We therefore obtain the Lagrange dual function:

q (µ, ν) = inf L (α, ξ, µ, ν)


α∈Rn ,ξ ∈Rn
(P
n 1 Pn 1
i=1 µi − 4λ i,j=1 yi yj µi µj K (xi , xj ) if µi + νi = n for all i,
=
−∞ otherwise.

The dual problem is:

maximize q (µ, ν)
subject to µ ≥ 0,ν ≥ 0 .

Julien Mairal (Inria) 134/564


Dual problem

If µi > 1/n for some i, then there is no νi ≥ 0 such that


µi + νi = 1/n, hence q (µ, ν) = −∞.
If 0 ≤ µi ≤ 1/n for all i, then the dual function takes finite values
that depend only on µ by taking νi = 1/n − µi .
The dual problem is therefore equivalent to:
n n
X 1 X
max µi − yi yj µi µj K (xi , xj ) .
0≤µ≤1/n 4λ
i=1 i,j=1

Julien Mairal (Inria) 135/564


Back to the primal
Once the dual problem is solved in µ we get a solution of the
primal problem by α = diag (y)µ/2λ.
We can therefore directly plug this into the dual problem to obtain
the QP that α must solve:
n
X n
X
max 2 αi yi − αi αj K (xi , xj ) = 2α> y − α> Kα ,
α∈Rn
i=1 i,j=1

subject to:
1
0 ≤ yi αi ≤ , for i = 1, . . . , n .
2λn

Julien Mairal (Inria) 136/564


Complimentary slackness conditions
The complimentary slackness conditions are, for i = 1, . . . , n:
(
µi [yi f (xi ) + ξi − 1] = 0,
νi ξi = 0,

In terms of α this can be rewritten as:


(
αi [yi f (xi ) + ξi − 1] = 0 ,
yi 
αi − 2λn ξi = 0 .

Julien Mairal (Inria) 137/564


Analysis of KKT conditions
(
αi [yi f (xi ) + ξi − 1] = 0 ,
yi 
αi − 2λn ξi = 0 .

If αi = 0, then the second constraint is active: ξi = 0. This implies


yi f (xi ) ≥ 1.
1
If 0 < yi αi < 2λn , then both constraints are active: ξi = 0 et
yi f (xi ) + ξi − 1 = 0. This implies yi f (xi ) = 1.
yi
If αi = 2λn , then the second constraint is not active (ξi ≥ 0) while
the first one is active: yi f (xi ) + ξi = 1. This implies yi f (xi ) ≤ 1

Julien Mairal (Inria) 138/564


Geometric interpretation

Julien Mairal (Inria) 139/564


Geometric interpretation

) = +1
f(x
( x )=0 )= −1
f f(x

Julien Mairal (Inria) 139/564


Geometric interpretation

αy=1/2nλ

0<α y<1/2n λ

α=0

Julien Mairal (Inria) 139/564


Support vectors
Consequence of KKT conditions
The training points with αi 6= 0 are called support vectors.
Only support vectors are important for the classification of new
points:
n
X X
∀x ∈ X , f (x) = αi K (xi , x) = αi K (xi , x) ,
i=1 i∈SV

where SV is the set of support vectors.

Consequences
The solution is sparse in α, leading to fast algorithms for training
(use of decomposition methods).
The classification of a new point only involves kernel evaluations
with support vectors (fast).
Julien Mairal (Inria) 140/564
Remark: C-SVM
Often the SVM optimization problem is written in terms of a
regularization parameter C instead of λ as follows:
n
1 X
arg min k f k2H + C Lhinge (f (xi ) , yi ) .
f ∈H 2 i=1

1
This is equivalent to our formulation with C = 2nλ .
The SVM optimization problem is then:
n
X n
X
max 2 αi yi − αi αj K (xi , xj ) ,
α∈Rd i=1 i,j=1

subject to:
0 ≤ y i αi ≤ C , for i = 1, . . . , n .
This formulation is often called C-SVM.
Julien Mairal (Inria) 141/564
Remark: 2-SVM
A variant of the SVM, sometimes called 2-SVM, is obtained by
replacing the hinge loss by the square hinge loss:
( n )
1X 2 2
min ϕhinge (yi f (xi )) + λk f kH .
f ∈H n
i=1

After some computation (left as exercice) we find that the dual


problem of the 2-SVM is:

max 2α> y − α> (K + nλI ) α ,


α∈Rd
subject to:
0 ≤ y i αi , for i = 1, . . . , n .
This is therefore equivalent to the previous SVM with the kernel
K + nλI and C = +∞

Julien Mairal (Inria) 142/564


Part 3

Kernel Methods
Unsupervised Learning

Julien Mairal (Inria) 143/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning


Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

4 The Kernel Jungle

5 Open Problems and Research Topics

Julien Mairal (Inria) 144/564


The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi − µsi k22 .
µj ∈Rp for j=1,...,k
i=1
si ∈{1,...,k}, for i=1,...,n

K-means alternates between two steps:


1 cluster assignment:
Given fixed µ1 , . . . , µk , assign each xi to its closest centroid

∀i, si ∈ argmin kxi − µs k22 .


s∈{1,...,k}

Julien Mairal (Inria) 145/564


The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi − µsi k22 .
µj ∈Rp for j=1,...,k
i=1
si ∈{1,...,k}, for i=1,...,n

K-means alternates between two steps:


2 centroids update:
Given the previous assignments s1 , . . . , sn , update the centroids
X
∀j, µj = argmin kxi − µk22 .
µ∈Rp i:si =j

Julien Mairal (Inria) 145/564


The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi − µsi k22 .
µj ∈Rp for j=1,...,k
i=1
si ∈{1,...,k}, for i=1,...,n

K-means alternates between two steps:


2 centroids update:
Given the previous assignments s1 , . . . , sn , update the centroids
1 X
⇔ ∀j, µj = xi .
nj
i:si =j

Julien Mairal (Inria) 145/564


Kernel K-means and spectral clustering
We may now modify the objective to operate in a RKHS. Given data
points x1 , . . . , xn in X and a p.d. kernel K : X × X → R with H its
RKHS, the new objective becomes
n
X
min kϕ(xi ) − µsi k2H .
µj ∈H for j=1,...,k
i=1
si ∈{1,...,k} for i=1,...,n

To optimize the cost function, we will first use the following Proposition
Proposition
1 Pn
The center of mass ϕn = n i=1 ϕ(xi ) solves the following optimization
problem
n
X
min kϕ(xi ) − µk2H .
µ∈H
i=1

Julien Mairal (Inria) 146/564


Kernel K-means and spectral clustering
Proof

n n n
* +
1X 1X 2X
kϕ(xi ) − µk2H = kϕ(xi )k2H − ϕ(xi ), µ + kµk2H
n n n
i=1 i=1 i=1 H
n
1X
= kϕ(xi )k2H − 2 hϕn , µiH + kµk2H
n
i=1
n
1X
= kϕ(xi )k2H − kϕn k2H + kϕn − µk2H ,
n
i=1

which is minimum for µ = ϕn .

Julien Mairal (Inria) 147/564


Kernel K-means and spectral clustering
Back with the objective,
n
X
min kϕ(xi ) − µsi k2H ,
µj ∈H for j=1,...,k
i=1
si ∈{1,...,k} for i=1,...,n

we know that given assignments si , the optimal µj are the centers of


mass of the respective clusters and we obtain the equivalent objective:
2
n
X 1 X
min ϕ(xi ) − ϕ(xj ) ,
si ∈{1,...,k} |Csi |
for i=1,...,n i=1 j∈Csi
H

or, after short calculations,


n
X 2 X 1 X X
min K (xi , xi ) − K (xi , xj ) + K (xj , xl ).
si ∈{1,...,k} |Csi | |Csi |2
for i=1,...,n i=1 j∈Csi j∈Csi l∈Csi

Julien Mairal (Inria) 148/564


Kernel K-means and spectral clustering
and, after removing the constant terms, we obtain the objective
n
X 1 X
min − K (xi , xj ), (?)
si ∈{1,...,k} |Csi |
for i=1,...,n i=1 j∈Csi

The objective can be expressed with pairwise kernel comparisons.


Unfortunately, the problem is hard and we need an appropriate strategy
to obtain an approximate solution.
Greedy approach: kernel K-means
At every iteration,
Update the sets Cl , l = 1, . . . , k given current assignments si ’s.
Update the assignments by minimizing (?) keeping the sets Cl fixed.
The algorithm is similar to the traditional K-means algorithm.

Julien Mairal (Inria) 149/564


Kernel K-means and spectral clustering
Another approach consists of relaxing the non-convex problem with a
feasible one, which yields a class of algorithms called spectral clustering.
First, we rewrite the objective function as
k X
X 1
min − K (xi , xj ).
si ∈{1,...,k} |Cl |
for i=1,...,n l=1 i,j∈Cl

and we introduce
the binary matrix A in {0, 1}n×k such that [A]ij = 1 if si = j and 0
otherwise.
a diagonal matrix D in Rl×l with diagonal entries [D]jj equal to the
inverse of the number of elements in cluster j.
and the objective can be rewritten (proof is easy and left as an exercise)
h i
min − trace (D1/2 A> KAD1/2 ) .
A,D

Julien Mairal (Inria) 150/564


Kernel K-means and spectral clustering
h i
min − trace (D1/2 A> KAD1/2 ) .
A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints on A and instead
optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


Z∈Rn×k

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.

Julien Mairal (Inria) 151/564


Kernel K-means and spectral clustering
h i
min − trace (D1/2 A> KAD1/2 ) .
A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints on A and instead
optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


Z∈Rn×k

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.
Question
How do we obtain an approximate solution (A, D) of the original
problem from Z? ?

Julien Mairal (Inria) 151/564


Kernel K-means and spectral clustering
h i
min − trace (D1/2 A> KAD1/2 ) .
A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints on A and instead
optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


Z∈Rn×k

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.
Answer 1
With the original constraints on A, every row of A has a single non-zero
entry ⇒ compute the maximum entry of every row of Z? .

Julien Mairal (Inria) 151/564


Kernel K-means and spectral clustering
h i
min − trace (D1/2 A> KAD1/2 ) .
A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints on A and instead
optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


Z∈Rn×k

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.
Answer 2
Normalize the rows of Z? to have unit `2 -norm, and apply the traditional
K-means algorithm on the rows. This is called spectral clustering.

Julien Mairal (Inria) 151/564


Kernel K-means and spectral clustering
h i
min − trace (D1/2 A> KAD1/2 ) .
A,D

The constraints on A, D are such that D1/2 A> AD1/2 = I (exercise). A


natural relaxation consists of dropping the constraints on A and instead
optimize over Z = AD1/2 :

max trace (Z> KZ) s.t. Z> Z = I.


Z∈Rn×k

A solution Z? to this problem may be obtained by computing the


eigenvectors of K associated to the k-largest eigenvalues. As we will see
in a few slides, this procedure is related to the kernel PCA algorithm.
Answer 3
Choose another variant of the previous procedures.

Julien Mairal (Inria) 151/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning


Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

4 The Kernel Jungle

5 Open Problems and Research Topics

Julien Mairal (Inria) 152/564


Principal Component Analysis (PCA)
Classical setting
Let S = {x1 , . . . , xn } be a set of vectors (xi ∈ Rd )
PCA is a classical algorithm in multivariate statistics to define a set
of orthogonal directions that capture the maximum variance
Applications: low-dimensional representation of high-dimensional
points, visualization

PC2 PC1

Julien Mairal (Inria) 153/564


Principal Component Analysis (PCA)
Formalization
Assume that the data are centered (otherwise center them as
preprocessing), i.e.:
Xn
xi = 0.
i=1

The orthogonal projection onto a direction w ∈ Rd is the function


hw : X → R defined by:
w
hw (x) = x> .
kwk

Julien Mairal (Inria) 154/564


Principal Component Analysis (PCA)
Formalization
The empirical variance captured by hw is:
n n 2
1X 2 1 X x>i w
var
ˆ (hw ) := hw (xi ) = .
n n k w k2
i=1 i=1

The i-th principal direction wi (i = 1, . . . , d) is defined by:

wi = arg max ˆ (hw ) .


var
w⊥{w1 ,...,wi−1 }

Julien Mairal (Inria) 155/564


Principal Component Analysis (PCA)
Solution
Let X be the n × d data matrix whose rows are the vectors
x1 , . . . , xn . We can then write:
n 2
1 X x>i w 1 w> X> Xw
var
ˆ (hw ) = = .
n k w k2 n w> w
i=1

The solutions of:


1 w> X> Xw
wi = arg max
w⊥{w1 ,...,wi−1 } n w> w

are the successive eigenvectors of K = X> X, ranked by decreasing


eigenvalues.

Julien Mairal (Inria) 156/564


Functional point of view

Let K (x, y) = x> y be the linear kernel.


The associated RKHS H is the set of linear functions:

fw (x) = w> x ,

endowed with the norm k fw kH = k w kRd .


Therefore we can write:
n 2 n
1 X x>i w 1 X
var
ˆ (hw ) = = fw (xi )2 .
n k w k2 nk fw k2
i=1 i=1

Moreover, w ⊥ w0 ⇔ fw ⊥ fw0 .

Julien Mairal (Inria) 157/564


Functional point of view
In other words, PCA solves, for i = 1, . . . , d:
n
1 X
fi = arg max 2
f (xi )2 .
f ⊥{f1 ,...,fi−1 } nk f k i=1

We can apply the representer theorem (exercise: check that is is


also valid in a linear subspace): for i = 1, . . . , d, we have:
n
X
∀x ∈ X , fi (x) = αi,j K (xj , x) ,
j=1

with αi = (αi,1 , . . . , αi,n )> ∈ Rn .

Julien Mairal (Inria) 158/564


Functional point of view
Therefore we have:
d
X
k fi k2H = αi,k αi,l K (xk , xl ) = α>
i Kαi ,
k,l=1

Similarly:
n
X
fi (xk )2 = α> 2
i K αi .
k=1

Julien Mairal (Inria) 159/564


Functional point of view
PCA maximizes in α the function:
α> K2 α
αi = arg max > ,
α nα Kα
under the constraints:

α>
i Kαj = 0 for j = 1, . . . , i − 1 .

Julien Mairal (Inria) 160/564


Solution

Let U = (u1 , . . . , un ) be an orthonormal basis of eigenvectors of K


with eigenvalues λ1 ≥ . . . ≥ λn ≥ 0.
Let αi = nj=1 βij uj , then
P

Pn 2 2
α> 2
i K αi j=1 βij λj
= ,
nα> n nj=1 βij2 λj
P
i Kαi

which is maximized at α1 = β11 u1 , α2 = β22 u2 , etc...

Julien Mairal (Inria) 161/564


Normalization
For αi = βii ui , we want:

1 = k fi k2H = α> 2
i Kαi = βii λi .

Therefore:
1
αi = √ ui .
λi

Julien Mairal (Inria) 162/564


Kernel PCA: summary
1 Center the Gram matrix
2 Compute the first eigenvectors (ui , λi )

3 Normalize the eigenvectors αi = ui / λi
4 The projections of the points onto the i-th eigenvector is given by
Kαi

Julien Mairal (Inria) 163/564


Kernel PCA: remarks
In this formulation, we must diagonalize the centered kernel Gram
matrix, instead of the covariance matrix in the classical setting
Exercise: check that X> X and XX> have the same spectrum (up
to 0 eigenvalues) and that the eigenvectors are related by a simple
relationship.
This formulation remains valid for any p.d. kernel: this is kernel
PCA
Applications: nonlinear PCA with nonlinear kernels for vectors, PCA
of non-vector objects (strings, graphs..) with specific kernels...

Julien Mairal (Inria) 164/564


Example
PC2 A set of 74 human tRNA
sequences is analyzed using
a kernel for sequences (the
second-order marginalized
kernel based on SCFG). This
set of tRNAs contains three
PC1
classes, called Ala-AGC
(white circles), Asn-GTT
(black circles) and Cys-GCA
(plus symbols) (from Tsuda
et al., 2003).

Julien Mairal (Inria) 165/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning


Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA

4 The Kernel Jungle

5 Open Problems and Research Topics

Julien Mairal (Inria) 166/564


Canonical Correlation Analysis (CCA)
Given two views X = [x1 , . . . , xn ] in Rp×n and Y = [y1 , . . . , yn ] in Rd×n
of the same dataset, the goal of canonical correlation analysis (CCA) is
to find pairs of directions in the two views that are maximally correlated.
Formulation
Assuming that the datasets are centered, we want to maximize
1 Pn > >
n i=1 wa xi yi wb
max  .
> x x> w 1/2 1 > y y> w 1/2
wa ∈Rp ,wb ∈Rd 1
Pn  Pn
n i=1 w a i i a n w
i=1 b i i b

Assuming that the pairs (xi , yi ) are i.i.d. samples from an unknown
distribution, CCA seeks to maximize

cov (wa> X , wb> Y )


max q .
wa ∈Rp ,wb ∈Rd
p
> >
var (wa X ) var (wb Y )

Julien Mairal (Inria) 167/564


Canonical Correlation Analysis (CCA)
Given two views X = [x1 , . . . , xn ] in Rp×n and Y = [y1 , . . . , yn ] in Rd×n
of the same dataset, the goal of canonical correlation analysis (CCA) is
to find pairs of directions in the two views that are maximally correlated.
Formulation
Assuming that the datasets are centered, we want to maximize
1 Pn > >
n i=1 wa xi yi wb
max  .
> x x> w 1/2 1 > y y> w 1/2
wa ∈Rp ,wb ∈Rd 1
Pn  Pn
n i=1 w a i i a n w
i=1 b i i b

It is possible to show that this is an generalized eigenvalue problem (see


next slide or see Section 6.5 of Shawe-Taylor and Cristianini 2004b).
The above problem provides the first pair of canonical directions. Next
directions can be obtained by solving the same problem under the
constraint that they are orthogonal to the previous canonical directions.

Julien Mairal (Inria) 167/564


Canonical Correlation Analysis (CCA)
Formulation
Assuming that the datasets are centered,

wa> X> Ywb


max 1/2 1/2 .
wa ∈Rp ,wb ∈Rd (wa> X> Xwa ) wb> Y> Ywb

can be formulated, after removing the scaling ambiguity, as

max wa> X> Ywb s.t. wa> X> Xwa = 1 and wb> Y> Ywb = 1.
wa ∈Rp ,wb ∈Rd

Then, there exists λa and λb such that the problem is equivalent to


λa > > λb
min −wa> X> Ywb + (w X Xwa − 1) + (wb> Y> Ywb − 1).
wa ∈Rp ,w b ∈Rd 2 a 2

Julien Mairal (Inria) 168/564


Canonical Correlation Analysis (CCA)
Taking the derivatives and setting the gradient to zero, we obtain

−X> Ywb + λa X> Xwa = 0


−Y> Xwa + λb Y> Ywb = 0

Multiply first equality by wa> and second equality by wb> ; subtract the
two resulting equalities and we get

λa wa> X> Xwa = λb wb> Y> Ywb = λa = λb = λ,

and then, we obtain the generalized eigenvalue problem:

X> Y
    >  
0 wa X X 0 wa

Y> X 0 wb 0 Y> Y wb

Julien Mairal (Inria) 169/564


Canonical Correlation Analysis (CCA)
Let us define
X> Y X> X
     
0 0 wa
ΣA = , ΣB = and w =
Y> X 0 0 >
Y Y wb

Assuming the covariances are invertible, the generalized eigenvalue


problem is equivalent to
−1/2 1/2
ΣB ΣA w = λΣB w

which is also equivalent to the eigenvalue problem


−1/2 −1/2 −1/2 −1/2
ΣB ΣA ΣB (ΣB w) = λ(ΣB w).

Julien Mairal (Inria) 170/564


Kernel Canonical Correlation Analysis
Similar to kernel PCA, it is possible to operate in a RKHS. Given two
p.d. kernels Ka , Kb : X × X → R, we can obtain two “views” of a
dataset x1 , . . . , xn in X n :

(ϕa (x1 ), . . . , ϕa (xn )) and (ϕb (x1 ), . . . , ϕb (xn )),

where ϕa : X → Ha and ϕb : X → Hb are the embeddings in the


RKHSs Ha of Ka and Hb of Kb , respectively. Then, we may formulate
kernel CCA as the following optimization problem
1 Pn
n i=1 hfa , ϕa (xi )iHa hϕb (xi ), fb iHb
max
fa ∈Ha ,fb ∈Hb 1
 1/2  P 1/2 .
P n 2 1 n 2
n hf
i=1 a , ϕ (x )i
a i Ha n hf
i=1 b , ϕ (x )i
b i Hb

Julien Mairal (Inria) 171/564


Kernel Canonical Correlation Analysis
Similar to kernel PCA, it is possible to operate in a RKHS. Given two
p.d. kernels Ka , Kb : X × X → R, we can obtain two “views” of a
dataset x1 , . . . , xn in X n :

(ϕa (x1 ), . . . , ϕa (xn )) and (ϕb (x1 ), . . . , ϕb (xn )),

where ϕa : X → Ha and ϕb : X → Hb are the embeddings in the


RKHSs Ha of Ka and Hb of Kb , respectively. Then, we may formulate
kernel CCA as the following optimization problem
1 Pn
n i=1 fa (xi )fb (xi )
max  .
n 2 1/2 1 2 1/2
fa ∈Ha ,fb ∈Hb 1 P  Pn
n i=1 fa (xi ) n i=1 fb (xi )

Julien Mairal (Inria) 171/564


Kernel Canonical Correlation Analysis
Up to a few technical details (exercise),Pwe can apply the representer
theoremP and look for solutions fa (.) = ni=1 αi Ka (xi , .) and
fb (.) = ni=1 βi Kb (xi , .). We finally obtain the formulation
1 Pn
ni=1 [Ka α]i [Kb β]i
α∈R
max
n
,β∈Rn 1 Pn 2 1/2 1
 Pn  ,
2 1/2
n i=1 [Ka α]i n i=1 [Kb β]i

which is equivalent to

α> Ka Kb β
max
α∈R ,β∈Rn
n 1/2 1/2 ,
(α> K2a α) β > K2b β

or, after removing the scaling ambiguity for α and β,

max α> Ka Kb β s.t. α> K2a α = 1 and β > K2b β = 1.


α∈Rn ,β∈Rn

Julien Mairal (Inria) 172/564


Kernel Canonical Correlation Analysis
Remarks
kernel CCA also yields a generalized eigenvalue problem.
the subsequent canonical directions are obtained by solving the
same problem with additional orthogonality constraints.
in practice, kernel CCA is numerically unstable; it requires
regularization to replace the constraints α> K2a α by
α> (K2a + µa I)α = 1 (same for Kb ), which improves the condition
number of the matrix K2a .

Julien Mairal (Inria) 173/564


Part 4

The Kernel Jungle

Julien Mairal (Inria) 174/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

5 Open Problems and Research Topics

Julien Mairal (Inria) 175/564


Motivation
Kernel methods are sometimes criticized for their lack of flexibility: a
large effort is spent in designing by hand the kernel.
Question
How do we design a kernel adapted to the data?

Julien Mairal (Inria) 176/564


Motivation
Kernel methods are sometimes criticized for their lack of flexibility: a
large effort is spent in designing by hand the kernel.
Question
How do we design a kernel adapted to the data?

Answer
A successful strategy is given by kernels for generative models, which
are/have been the state of the art in many fields, including image and
sequence representations.

Parametric model
A model is a family of distributions

{Pθ , θ ∈ Θ ⊂ Rm } ⊆ M+
1 (X ) .

Julien Mairal (Inria) 176/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Fisher kernel
Mutual information kernels
Marginalized kernels
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 177/564


Fisher kernel
Definition
Fix a parameter θ0 ∈ Θ (e.g., by maximum likelihood over a
training set of sequences)
For each sequence x, compute the Fisher score vector:

Φθ0 (x) = ∇θ log Pθ (x)|θ=θ0 .

Form the kernel (Jaakkola et al., 2000):

K x, x0 = Φθ0 (x)> I(θ0 )−1 Φθ0 (x0 ) ,




where I(θ0 ) = E Φθ0 (x)Φθ0 (x)> is the Fisher information matrix.


 

Julien Mairal (Inria) 178/564


Fisher kernel properties (1/2)
The Fisher score describes how each parameter contributes to the
process of generating a particular example
A kernel classifier employing the Fisher kernel derived from a model
that contains the label as a latent variable is, asymptotically, at
least as good a classifier as the MAP labelling based on the model
(Jaakkola and Haussler, 1999).
A variant of the Fisher kernel (called the Tangent of Posterior
kernel) can also improve over the direct posterior classification by
helping to correct the effect of estimation errors in the parameter
(Tsuda et al., 2002).

Julien Mairal (Inria) 179/564


Fisher kernel properties (2/2)
Lemma
The Fisher kernel is invariant under change of parametrization.

Consider indeed different parametrization given by some


diffeomorphism λ = f (θ). The Jacobian matrix relating the
∂θ
parametrization is denoted by [J]ij = ∂λji .
The gradient of log-likelihood w.r.t. to the new parameters is

Φλ0 (x) = ∇λ log Pλ0 (x) = J∇θ log Pθ0 (x) = JΦθ0 (x).

the Fisher information matrix is


h i
I(λ0 ) = E Φθ0 (x)Φθ0 (x)> = JI(θ0 )J> .

we conclude by noticing that I(λ0 )−1 = J−1 I(θ0 )−1 J>−1 .

Julien Mairal (Inria) 180/564


Fisher kernel in practice

Φθ0 (x) can be computed explicitly for many models (e.g., HMMs),
where the model is first estimated from data.
I(θ0 ) is often replaced by the identity matrix for simplicity.
Several different models (i.e., different θ0 ) can be trained and
combined.
The Fisher vectors are defined as ϕθ0 (x) = I(θ0 )−1/2 Φθ0 (x). They
are explicitly computed and correspond to an explicit embedding:
K (x, x0 ) = ϕθ0 (x)> ϕθ0 (x0 ).

Julien Mairal (Inria) 181/564


Fisher kernels: example with Gaussian data model (1/2)
Consider a normal distribution N (µ, σ 2 ) and denote by α = 1/σ 2 the
inverse variance, i.e., precision parameter. With θ = (µ, α), we have
1 1 1
log Pθ (x) = log α − log(2π) − α(x − µ)2 ,
2 2 2
and thus
 
∂ log Pθ (x) ∂ log Pθ (x) 1 1
= α(x − µ), = − (x − µ)2 ,
∂µ ∂α 2 α
and (exercise)  
α 0
I(θ) = .
0 (1/2)α−2
The Fisher vector is then
 
(x − µ)/σ

ϕθ (x) = .
(1/ 2)(1 − (x − µ)2 /σ 2 )

Julien Mairal (Inria) 182/564


Fisher kernels: example with Gaussian data model (2/2)
Now consider an i.i.d. data model over a set of data points x1 , . . . , xn all
distributed according to N (µ, σ 2 ):
n
Y
Pθ (x1 , . . . , xn ) = Pθ (xi ).
i=1

Then, the Fisher vector is given by the sum of Fisher vectors of the
points.
Encodes the discrepancy in the first and second order moment of
the data w.r.t. those of the model.
n  
X (µ̂ − µ)/σ√
ϕ(x1 , . . . , xn ) = ϕ(xi ) = n ,
(σ 2 − σ̂ 2 )/( 2σ 2 )
i=1

where
n n
1X 1X
µ̂ = xi and σ̂ = (xi − µ̂)2 .
n n
i=1 i=1

Julien Mairal (Inria) 183/564


Application: Aggregation of visual words (1/4)
Patch extraction and description stage:
In various contexts, images may be described as a set of
patches x1 , . . . , xn computed at interest points. For example, SIFT,
HOG, LBP, color histograms, convolutional features...
Coding stage: The set of patches is then encoded into a single
representation ϕ(xi ), typically in a high-dimensional space.
Pooling stage: For example, sum pooling
n
X
ϕ(x1 , . . . , xn ) = ϕ(xi ).
i=1

Fisher vectors with a Gaussian Mixture Model (GMM) is


considered to be a state-of-the-art aggregation
technique [Perronnin and Dance, 2007].

Julien Mairal (Inria) 184/564


Application: Aggregation of visual words (2/4)
Let θ = (πj , µj , Σj )j=1 ldots,k be the parameters of a GMM with k
Gaussian components. Then, the probabilistic model is given by
k
X
Pθ (x) = πj N (x; µj , Σj ).
j=1

Remarks
Each mixture component corresponds to a visual word, with a
mean, variance, and mixing weight.
Diagonal covariances Σj = diag (σj1 , . . . , σjp ) = diag (σ j ) are often
used for simplicity.
This is a richer model than the traditional “bag of words” approach.
The probabilistic model is learned offline beforehand.

Julien Mairal (Inria) 185/564


Application: Aggregation of visual words (3/4)
After a few calculations (exercise), we obtain ϕθ (x1 , . . . , xn ) =

[ϕπ1 (X), . . . , ϕπp (X), ϕµ1 (X)> , . . . , ϕµp (X)> , ϕσ1 (X)> , . . . , ϕσp (X)> ]> ,

with
n
1 X
ϕµj (X) = √ γij (xi − µj )/σ j
n πj
i=1
n
1 X
γij (xi − µj )2 /σ 2j − 1 ,
 
ϕσj (X) = p
n 2πj i=1

where with an abuse of notation, the division between two vectors is


meant elementwise and the scalars γij can be interpreted as the
soft-assignment of word i to component j:
πj N (xi ; µj , σ j )
γij = Pk .
l=1 π l N (x i ; µl , σ l )

Julien Mairal (Inria) 186/564


Application: Aggregation of visual words (4/4)
Finally, we also have the following interpretation of encoding first and
second-order statistics:
γj
ϕµj (X) = √ (µ̂j − µj )/σ j
πj
γj
ϕσj (X) = p (σ̂ 2j − σ 2j )/σ 2j ,
2πj

with
n n n
X 1 X 1 X
γj = γij and µ̂j = γij xi and σ̂ j = γij (xi − µj )2 .
γj γj
i=1 i=1 i=1

The component ϕπ (X ) is often dropped due to its negligible


contribution in practice, and the resulting representation is of
dimension 2kp where p is the dimension of the xi ’s.

Julien Mairal (Inria) 187/564


Relation to classification with generative models (1/3)
Assume that we have a generative probabilistic model Pθ to model
random variables (X , Y ) where Y is a label in {1, . . . , p}.
Assume that the marginals Pθ (Y = k) = πk are among the model
parameters θ, which we can also parametrize as
e αk
Pθ (Y = k) = πk = Pp αk 0
.
k 0 =1 e
The classification of a new point x can be obtained via Bayes’ rule:
ŷ (x) = argmax Pθ (Y = k|x),
k=1,...,p

where Pθ (Y = k|x) is short for Pθ (Y = k|X = x) and


Pθ (Y = k|x) = Pθ (x|Y = k)Pθ (Y = k)/Pθ (x)
p
X
= Pθ (x|Y = k)πk / Pθ (x|Y = k 0 )πk 0
k 0 =1

Julien Mairal (Inria) 188/564


Relation to classification with generative models (2/3)
Then, consider the Fisher score
1
∇θ log Pθ (x) = ∇θ Pθ (x)
Pθ (x)
p
1 X
= ∇θ Pθ (x, Y = k)
Pθ (x)
k=1
p
1 X
= Pθ (x, Y = k)∇θ log Pθ (x, Y = k)
Pθ (x)
k=1
p
X
= Pθ (Y = k|x)[∇θ log πk + ∇θ log Pθ (x|Y = k)].
k=1

In particular (exercise)

∂ log Pθ (x)
= Pθ (Y = k|x) − πk .
∂αk

Julien Mairal (Inria) 189/564


Relation to classification with generative models (3/3)
The first p elements in the Fisher score are given by class posteriors
minus a constant

ϕθ (x) = [Pθ (Y = 1|x) − π1 , . . . , Pθ (Y = p|x) − πp , ...].

Consider a multi-class linear classifier on ϕθ(x) such that for class k


The weights are zero except one for the k-th position;
The intercept bk be −πk ;
Then,

ŷ (x) = argmax ϕθ (x)> wk + bk


k=1,...,p

ŷ (x) = argmax Pθ (Y = k|x).


k=1,...,p

Bayes’ rule is implemented via this simple classifier using Fisher kernel.

Julien Mairal (Inria) 190/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Fisher kernel
Mutual information kernels
Marginalized kernels
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 191/564


Mutual information kernels
Definition
Chose a prior w (dθ) on the measurable set Θ.
Form the kernel (Seeger, 2002):
Z
0
Pθ (x)Pθ (x0 )w (dθ) .

K x, x =
θ∈Θ

No explicit computation of a finite-dimensional feature vector.


K (x, x0 ) =< ϕ (x) , ϕ (x0 ) >L2 (w ) with

ϕ (x) = (Pθ (x))θ∈Θ .

Julien Mairal (Inria) 192/564


Example: coin toss
Let Pθ (X = 1) = θ and Pθ (X = 0) = 1 − θ a model for random
coin toss, with θ ∈ [0, 1].
Let dθ be the Lebesgue measure on [0, 1]
The mutual information kernel between x = 001 and x0 = 1010 is:
(
Pθ (x) = θ (1 − θ)2 ,
Pθ (x0 ) = θ2 (1 − θ)2 ,
Z 1
3!4! 1
0
θ3 (1 − θ)4 dθ =

K x, x = = .
0 8! 280

Julien Mairal (Inria) 193/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Fisher kernel
Mutual information kernels
Marginalized kernels
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 194/564


Marginalized kernels
Definition
For any observed data x ∈ X , let a latent variable y ∈ Y be
associated probabilistically through a conditional probability
Px (dy).
Let KZ be a kernel for the complete data z = (x, y)
Then the following kernel is a valid kernel on X , called a
marginalized kernel (Tsuda et al., 2002):

KX x, x0 := EPx (dy)×Px0 (dy0 ) KZ z, z0


 
Z Z
KZ (x, y) , x0 , y0 Px (dy) Px0 dy0 .
 
=

Julien Mairal (Inria) 195/564


Marginalized kernels: proof of positive definiteness
KZ is p.d. on Z. Therefore there exists a Hilbert space H and
ΦZ : Z → H such that:

KZ z, z0 = ΦZ (z) , ΦZ z0 H .
 

Marginalizing therefore gives:

KX x, x0 = EPx (dy)×Px0 (dy0 ) KZ z, z0


 

= EPx (dy)×Px0 (dy0 ) ΦZ (z) , ΦZ z0 H




= EPx (dy) ΦZ (z) , EPx (dy0 ) ΦZ z0 H ,




therefore KX is p.d. on X . 

Julien Mairal (Inria) 196/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

5 Open Problems and Research Topics

Julien Mairal (Inria) 197/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 198/564


Short history of genomics

1866 : Laws of heredity (Mendel)


1909 : Morgan and the drosophilists
1944 : DNA supports heredity (Avery)
1953 : Structure of DNA (Crick and Watson)
1966 : Genetic code (Nirenberg)
1960-70 : Genetic engineering
1977 : Method for sequencing (Sanger)
1982 : Creation of Genbank
1990 : Human genome project launched
2003 : Human genome project completed

Julien Mairal (Inria) 199/564


A cell

Julien Mairal (Inria) 200/564


Chromosomes

Julien Mairal (Inria) 201/564


Chromosomes and DNA

Julien Mairal (Inria) 202/564


Structure of DNA

“We wish to suggest a


structure for the salt of
desoxyribose nucleic acid
(D.N.A.). This structure have
novel features which are of
considerable biological
interest” (Watson and Crick,
1953)

Julien Mairal (Inria) 203/564


The double helix

Julien Mairal (Inria) 204/564


Central dogma

Julien Mairal (Inria) 205/564


Proteins

Julien Mairal (Inria) 206/564


Genetic code

Julien Mairal (Inria) 207/564


Human genome project
Goal : sequence the 3,000,000,000 bases of the human genome
Consortium with 20 labs, 6 countries
Cost : about 3,000,000,000 USD

Julien Mairal (Inria) 208/564


2003: End of genomics era

Findings
About 25,000 genes only (representing 1.2% of the genome).
Automatic gene finding with graphical models.
97% of the genome is considered “junk DNA”.
Superposition of a variety of signals (many to be discovered).

Julien Mairal (Inria) 209/564


Protein sequence

A : Alanine V : Valine L : Leucine


F : Phenylalanine P : Proline M : Methionine
E : Glutamic acid K : Lysine R : Arginine
T : Threonine C : Cysteine N : Asparagine
H : Histidine Y : Tyrosine W : Tryptophane
I : Isoleucine S : Serine Q : Glutamine
D : Aspartic acid G : Glycine

Julien Mairal (Inria) 210/564


Challenges with protein sequences
A protein sequences can be seen as a variable-length sequence over
the 20-letter alphabet of amino-acids, e.g., insuline:
FVNQHLCGSHLVEALYLVCGERGFFYTPKA
These sequences are produced at a fast rate (result of the
sequencing programs)
Need for algorithms to compare, classify, analyze these sequences
Applications: classification into functional or structural classes,
prediction of cellular localization and interactions, ...

Julien Mairal (Inria) 211/564


Example: supervised sequence classification
Data (training)
Secreted proteins:
MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...
MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...
MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...
...
Non-secreted proteins:
MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...
MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...
MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..
...

Goal
Build a classifier to predict whether new proteins are secreted or not.

Julien Mairal (Inria) 212/564


Supervised classification with vector embedding
The idea
Map each string x ∈ X to a vector Φ(x) ∈ F.
Train a classifier for vectors on the images Φ(x1 ), . . . , Φ(xn ) of the
training set (nearest neighbor, linear perceptron, logistic regression,
support vector machine...)

φ
X F
maskat...
msises
marssl...
malhtv...
mappsv...
mahtlg...

Julien Mairal (Inria) 213/564


Kernels for protein sequences
Kernel methods have been widely investigated since Jaakkola et
al.’s seminal paper (1998).
What is a good kernel?
it should be mathematically valid (symmetric, p.d. or c.p.d.)
fast to compute
adapted to the problem (give good performances)

Julien Mairal (Inria) 214/564


Kernel engineering for protein sequences

Define a (possibly high-dimensional) feature space of interest


Physico-chemical kernels
Spectrum, mismatch, substring kernels
Pairwise, motif kernels

Julien Mairal (Inria) 215/564


Kernel engineering for protein sequences

Define a (possibly high-dimensional) feature space of interest


Physico-chemical kernels
Spectrum, mismatch, substring kernels
Pairwise, motif kernels
Derive a kernel from a generative model
Fisher kernel
Mutual information kernel
Marginalized kernel

Julien Mairal (Inria) 215/564


Kernel engineering for protein sequences

Define a (possibly high-dimensional) feature space of interest


Physico-chemical kernels
Spectrum, mismatch, substring kernels
Pairwise, motif kernels
Derive a kernel from a generative model
Fisher kernel
Mutual information kernel
Marginalized kernel
Derive a kernel from a similarity measure
Local alignment kernel

Julien Mairal (Inria) 215/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 216/564


Vector embedding for strings
The idea
Represent each sequence x by a fixed-length numerical vector
Φ (x) ∈ Rn . How to perform this embedding?

Julien Mairal (Inria) 217/564


Vector embedding for strings
The idea
Represent each sequence x by a fixed-length numerical vector
Φ (x) ∈ Rn . How to perform this embedding?

Physico-chemical kernel
Extract relevant features, such as:
length of the sequence
time series analysis of numerical physico-chemical properties of
amino-acids along the sequence (e.g., polarity, hydrophobicity),
using for example:
Fourier transforms (Wang et al., 2004)
Autocorrelation functions (Zhang et al., 2003)
n−j
1 X
rj = hi hi+j
n−j
i=1

Julien Mairal (Inria) 217/564


Substring indexation
The approach
Alternatively, index the feature space by fixed-length strings, i.e.,

Φ (x) = (Φu (x))u∈Ak

where Φu (x) can be:


the number of occurrences of u in x (without gaps) : spectrum
kernel (Leslie et al., 2002)
the number of occurrences of u in x up to m mismatches (without
gaps) : mismatch kernel (Leslie et al., 2004)
the number of occurrences of u in x allowing gaps, with a weight
decaying exponentially with the number of gaps : substring kernel
(Lohdi et al., 2002)

Julien Mairal (Inria) 218/564


Example: spectrum kernel (1/2)
Kernel definition
The 3-spectrum of
x = CGGSLIAMMWFGV
is:
(CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV) .
Let Φu (x) denote the number of occurrences of u in x. The
k-spectrum kernel is:
X
K x, x0 := Φu (x) Φu x0 .
 

u∈Ak

Julien Mairal (Inria) 219/564


Example: spectrum kernel (2/2)
Implementation
The computation of the kernel is formally a sum over |A|k terms,
but at most | x | − k + 1 terms are non-zero in Φ (x) =⇒
Computation in O (| x | + | x0 |) with pre-indexation of the strings.
Fast classification of a sequence x in O (| x |):
| x |−k+1
X X
f (x) = w · Φ (x) = wu Φu (x) = wxi ...xi+k−1 .
u i=1

Remarks
Work with any string (natural language, time series...)
Fast and scalable, a good default method for string classification.
Variants allow matching of k-mers up to m mismatches.

Julien Mairal (Inria) 220/564


Example 2: Substring kernel (1/11)
Definition
For 1 ≤ k ≤ n ∈ N, we denote by I(k, n) the set of sequences of
indices i = (i1 , . . . , ik ), with 1 ≤ i1 < i2 < . . . < ik ≤ n.
For a string x = x1 . . . xn ∈ X of length n, for a sequence of indices
i ∈ I(k, n), we define a substring as:

x (i) := xi1 xi2 . . . xik .

The length of the substring is:

l (i) = ik − i1 + 1.

Julien Mairal (Inria) 221/564


Example 2: Substring kernel (2/11)
Example

ABRACADABRA
i = (3, 4, 7, 8, 10)
x (i) =RADAR
l (i) = 10 − 3 + 1 = 8

Julien Mairal (Inria) 222/564


Example 2: Substring kernel (3/11)
The kernel
Let k ∈ N and λ ∈ R+ fixed. For all u ∈ Ak , let Φu : X → R be
defined by:
X
∀x ∈ X , Φu (x) = λl(i) .
i∈I(k,| x |): x(i)=u

The substring kernel is the p.d. kernel defined by:


X
∀ x, x0 ∈ X 2 , Kk,λ x, x0 = Φu (x) Φu x0 .
  

u∈Ak

Julien Mairal (Inria) 223/564


Example 2: Substring kernel (4/11)
Example

u ca ct at ba bt cr ar br
Φu (cat) λ2 λ3 λ2 0 0 0 0 0
Φu (car) λ2 0 0 0 0 λ3 λ2 0
Φu (bat) 0 0 λ2 λ2 λ3 0 0 0
Φu (bar) 0 0 0 λ2 0 0 λ2 λ3

4 6
K (cat,cat) = K (car,car) = 2λ + λ

K (cat,car) = λ4

K (cat,bar) = 0

Julien Mairal (Inria) 224/564


Example 2: Substring kernel (5/11)
Kernel computation
We need to compute, for any pair x, x0 ∈ X , the kernel:
X
Kn,λ x, x0 = Φu (x) Φu x0
 

u∈Ak
0
X X X
= λl(i)+l(i ) .
u∈Ak i:x(i)=u i0 :x0 (i0 )=u

Enumerating the substrings is too slow (of order | x |k ).

Julien Mairal (Inria) 225/564


Example 2: Substring kernel (6/11)
Kernel computation (cont.)
For u ∈ Ak remember that:
X
Φu (x) = λin −i1 +1 .
i:x(i)=u

Let now: X
Ψu (x) = λ| x |−i1 +1 .
i:x(i)=u

Julien Mairal (Inria) 226/564


Example 2: Substring kernel (7/11)
Kernel computation (cont.)
Let us note x (1, j) = x1 . . . xj . A simple rewriting shows that, if we note
a ∈ A the last letter of u (u = va):
X
Φva (x) = Ψv (x (1, j − 1)) λ ,
j∈[1,| x |]:xj =a

and X
Ψva (x) = Ψv (x (1, j − 1)) λ| x |−j+1 .
j∈[1,| x |]:xj =a

Julien Mairal (Inria) 227/564


Example 2: Substring kernel (8/11)
Kernel computation (cont.)
Moreover we observe that if the string is of the form xa (i.e., the last
letter is a ∈ A), then:
If the last letter of u is not a:
(
Φu (xa) = Φu (x) ,
Ψu (xa) = λΨu (x) .

If the last letter of u is a (i.e., u = va with v ∈ An−1 ):


(
Φva (xa) = Φva (x) + λΨv (x) ,
Ψva (xa) = λΨva (x) + λΨv (x) .

Julien Mairal (Inria) 228/564


Example 2: Substring kernel (9/11)
Kernel computation (cont.)
Let us now show how the function:
X
Bn x, x0 := Ψu (x) Ψu x0
 

u∈An

and the kernel: X


Kn x, x0 := Φu (x) Φu x0
 

u∈An

can be computed recursively. We note that:


(
B0 (x, x0 ) = K0 (x, x0 ) = 0 for all x, x0
Bk (x, x0 ) = Kk (x, x0 ) = 0 if min (| x | , | x0 |) < k

Julien Mairal (Inria) 229/564


Example 2: Substring kernel (10/11)
Recursive computation of Bn

Bn xa, x0

X
Ψu (xa) Ψu x0

=
u∈An
X X
Ψu (x) Ψu x0 + λ Ψv (x) Ψva x0
 

u∈An v∈An−1
0

= λBn x, x +
 
0
X X
x0 (1, j − 1) λ| x |−j+1 

λ Ψv (x)  Ψv
v∈An−1 j∈[1,| x0 |]:xj0 =a
X  0
= λBn x, x0 + Bn−1 x, x0 (1, j − 1) λ| x |−j+2


j∈[1,| x0 |]:xj0 =a

Julien Mairal (Inria) 230/564


Example 2: Substring kernel (10/11)
Recursive computation of Bn

Bn xa, x0 b

X  0
= λBn x, x0 b + λ Bn−1 x, x0 (1, j − 1) λ| x |−j+2


j∈[1,| x0 |]:xj0 =a

+ δa=b Bn−1 (x, x0 )λ2


x, x0 b + λ(Bn (xa, x0 ) − λBn (x, x0 )) + δa=b Bn−1 (x, x0 )λ2

= λBn
x, x0 b + λBn (xa, x0 ) − λ2 Bn (x, x0 ) + δa=b Bn−1 (x, x0 )λ2 .

= λBn

The dynamic programming table can be filled in O(n|x||x0 |) operations.

Julien Mairal (Inria) 231/564


Example 2: Substring kernel (10/11)
Recursive computation of Kn

Kn xa, x0

X
Φu (xa) Φu x0

=
u∈An
X X
Φu (x) Φu x0 + λ Ψv (x) Φva x0
 
=
u∈An v∈An−1
0

= Kn x, x +
 
X X
x0 (1, j − 1) λ

λ Ψv (x)  Ψv
v∈An−1 j∈[1,| x0 |]:xj0 =a
X
= λKn x, x0 + λ2 Bn−1 x, x0 (1, j − 1)
 

j∈[1,| x0 |]:xj0 =a

Julien Mairal (Inria) 232/564


Summary: Substring indexation

Implementation in O(|x| + |x0 |) in memory and time for the


spectrum and mismatch kernels (with suffix trees)
Implementation in O(k(|x| + |x0 |)) in memory and time for the
spectrum and mismatch kernels (with tries)
Implementation in O(k|x| × |x0 |) in memory and time for the
substring kernels
The feature space has high dimension (|A|k ), so learning requires
regularized methods (such as SVM)

Julien Mairal (Inria) 233/564


Dictionary-based indexation
The approach
Chose a dictionary of sequences D = (x1 , x2 , . . . , xn )
Chose a measure of similarity s (x, x0 )
Define the mapping ΦD (x) = (s (x, xi ))xi ∈D

Julien Mairal (Inria) 234/564


Dictionary-based indexation
The approach
Chose a dictionary of sequences D = (x1 , x2 , . . . , xn )
Chose a measure of similarity s (x, x0 )
Define the mapping ΦD (x) = (s (x, xi ))xi ∈D

Examples
This includes:
Motif kernels (Logan et al., 2001): the dictionary is a library of
motifs, the similarity function is a matching function
Pairwise kernel (Liao & Noble, 2003): the dictionary is the training
set, the similarity is a classical measure of similarity between
sequences.

Julien Mairal (Inria) 234/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 235/564


Probabilistic models for sequences
Probabilistic modeling of biological sequences is older than kernel
designs. Important models include HMM for protein sequences, SCFG
for RNA sequences.

Recall: parametric model


A model is a family of distributions

{Pθ , θ ∈ Θ ⊂ Rm } ⊂ M+
1 (X )
Julien Mairal (Inria) 236/564
Context-tree model
Definition
A context-tree model is a variable-memory Markov chain:
n
Y
PD,θ (x) = PD,θ (x1 . . . xD ) PD,θ (xi | xi−D . . . xi−1 )
i=D+1

D is a suffix tree
θ ∈ ΣD is a set of conditional probabilities (multinomials)

Julien Mairal (Inria) 237/564


Context-tree model: example

P(AABACBACC ) = P(AAB)θAB (A)θA (C )θC (B)θACB (A)θA (C )θC (A) .

Julien Mairal (Inria) 238/564


The context-tree kernel
Theorem (Cuturi et al., 2005)
For particular choices of priors, the context-tree kernel:
Z
0
 X
K x, x = PD,θ (x)PD,θ (x0 )w (dθ|D)π(D)
D θ∈ΣD

can be computed in O(|x| + |x0 |) with a variant of the Context-Tree


Weighting algorithm.
This is a valid mutual information kernel.
The similarity is related to information-theoretical measure of
mutual information between strings.

Julien Mairal (Inria) 239/564


Marginalized kernels
Recall: Definition
For any observed data x ∈ X , let a latent variable y ∈ Y be
associated probabilistically through a conditional probability
Px (dy).
Let KZ be a kernel for the complete data z = (x, y)
Then the following kernel is a valid kernel on X , called a
marginalized kernel (Tsuda et al., 2002):

KX x, x0 := EPx (dy)×Px0 (dy0 ) KZ z, z0


 
Z Z
KZ (x, y) , x0 , y0 Px (dy) Px0 dy0 .
 
=

Julien Mairal (Inria) 240/564


Example: HMM for normal/biased coin toss
0.85

N 0.05
0.5
0.1 E Normal (N) and biased (B)
S 0.1 coins (not observed)
B
0.5 0.05

0.85
Observed output are 0/1 with probabilities:
(
π(0|N) = 1 − π(1|N) = 0.5,
π(0|B) = 1 − π(1|B) = 0.8.

Example of realization (complete data):


NNNNNBBBBBBBBBNNNNNNNNNNNBBBBBB
1001011101111010010111001111011
Julien Mairal (Inria) 241/564
1-spectrum kernel on complete data
If both x ∈ A∗ and y ∈ S ∗ were observed, we might rather use the
1-spectrum kernel on the complete data z = (x, y):
X
KZ z, z0 =

na,s (z) na,s (z) ,
(a,s)∈A×S

where na,s (x, y) for a = 0, 1 and s = N, B is the number of


occurrences of s in y which emit a in x.
Example:
z =1001011101111010010111001111011,
z0 =0011010110011111011010111101100101,

z, z0 = n0 (z) n0 z0 + n0 (z) n0 z0 + n1 (z) n1 z0 + n1 (z) n1 z0


   
KZ
= 7 × 15 + 9 × 12 + 13 × 6 + 2 × 1 = 293.

Julien Mairal (Inria) 242/564


1-spectrum marginalized kernel on observed data
The marginalized kernel for observed data is:
X
KX x, x0 = KZ ((x, y) , (x, y)) P (y|x) P y0 |x0
 

y,y0 ∈S ∗
X
Φa,s (x) Φa,s x0 ,

=
(a,s)∈A×S

with X
Φa,s (x) = P (y|x) na,s (x, y)
y∈S ∗

Julien Mairal (Inria) 243/564


Computation of the 1-spectrum marginalized kernel

X
Φa,s (x) = P (y|x) na,s (x, y)
y∈S ∗
( n )
X X
= P (y|x) δ (xi , a) δ (yi , s)
y∈S ∗ i=1
 
n
X X 
= δ (xi , a) P (y|x) δ (yi , s)

 
i=1 y∈S
n
X
= δ (xi , a) P (yi = s|x) .
i=1

and P (yi = s|x) can be computed efficiently by forward-backward


algorithm!

Julien Mairal (Inria) 244/564


HMM example (DNA)

Julien Mairal (Inria) 245/564


HMM example (protein)

Julien Mairal (Inria) 246/564


SCFG for RNA sequences

SFCG rules
S → SS
S → aSa
S → aS
S →a

Marginalized kernel (Kin et al., 2002)


Feature: number of occurrences of each (base,state) combination
Marginalization using classical inside/outside algorithm

Julien Mairal (Inria) 247/564


Marginalized kernels in practice
Examples
Spectrum kernel on the hidden states of a HMM for protein
sequences (Tsuda et al., 2002)
Kernels for RNA sequences based on SCFG (Kin et al., 2002)
Kernels for graphs based on random walks on graphs (Kashima et
al., 2004)
Kernels for multiple alignments based on phylogenetic models (Vert
et al., 2006)

Julien Mairal (Inria) 248/564


Marginalized kernels: example

PC2 A set of 74 human tRNA


sequences is analyzed using a
kernel for sequences (the
second-order marginalized
kernel based on SCFG). This
PC1
set of tRNAs contains three
classes, called Ala-AGC (white
circles), Asn-GTT (black
circles) and Cys-GCA (plus
symbols) (from Tsuda et al.,
2002).

Julien Mairal (Inria) 249/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 250/564


Sequence alignment
Motivation
How to compare 2 sequences?

x1 = CGGSLIAMMWFGV
x2 = CLIVMMNRLMWFGV

Find a good alignment:

CGGSLIAMM------WFGV
|...|||||....||||
C-----LIVMMNRLMWFGV

Julien Mairal (Inria) 251/564


Alignment score
In order to quantify the relevance of an alignment π, define:
a substitution matrix S ∈ RA×A
a gap penalty function g : N → R
Any alignment is then scored as follows

CGGSLIAMM------WFGV
|...|||||....||||
C----LIVMMNRLMWFGV

sS,g (π) = S(C , C ) + S(L, L) + S(I , I ) + S(A, V ) + 2S(M, M)


+ S(W , W ) + S(F , F ) + S(G , G ) + S(V , V ) − g (3) − g (4)

Julien Mairal (Inria) 252/564


Local alignment kernel
Smith-Waterman score (Smith and Waterman, 1981)
The widely-used Smith-Waterman local alignment score is defined
by:
SWS,g (x, y) := max sS,g (π).
π∈Π(x,y)

It is symmetric, but not positive definite...

Julien Mairal (Inria) 253/564


Local alignment kernel
Smith-Waterman score (Smith and Waterman, 1981)
The widely-used Smith-Waterman local alignment score is defined
by:
SWS,g (x, y) := max sS,g (π).
π∈Π(x,y)

It is symmetric, but not positive definite...

LA kernel (Saigo et al., 2004)


The local alignment kernel:
(β)
X
KLA (x, y) = exp (βsS,g (x, y, π)) ,
π∈Π(x,y)

is symmetric positive definite.

Julien Mairal (Inria) 253/564


LA kernel is p.d.: proof (1/11)
Lemma
If K1 and K2 are p.d. kernels, then:

K1 + K2 ,
K1 K2 , and
cK1 , for c ≥ 0,

are also p.d. kernels


If (Ki )i≥1 is a sequence of p.d. kernels that converges pointwisely
to a function K :

∀ x, x0 ∈ X 2 , K x, x0 = lim Ki x, x0 ,
  
n→∞

then K is also a p.d. kernel.

Julien Mairal (Inria) 254/564


LA kernel is p.d.: proof (2/11)
Proof of lemma
Let A and B be n × n positive semidefinite matrices. By diagonalization
of A:
Xn
Ai,j = fp (i)fp (j)
p=1

for some vectors f1 , . . . , fn . Then, for any α ∈ Rn :


n
X n X
X n
αi αj Ai,j Bi,j = αi fp (i)αj fp (j)Bi,j ≥ 0.
i,j=1 p=1 i,j=1

The matrix Ci,j = Ai,j Bi,j is therefore p.d. Other properties are obvious
from definition. 

Julien Mairal (Inria) 255/564


LA kernel is p.d.: proof (3/11)
Lemma (direct sum and product of kernels)
Let X = X1 × X2 . Let K1 be a p.d. kernel on X1 , and K2 be a p.d.
kernel on X2 . Then the following functions are p.d. kernels on X :
the direct sum,

K ((x1 , x2 ) , (y1 , y2 )) = K1 (x1 , y1 ) + K2 (x2 , y2 ) ,

The direct product:

K ((x1 , x2 ) , (y1 , y2 )) = K1 (x1 , y1 ) K2 (x2 , y2 ) .

Julien Mairal (Inria) 256/564


LA kernel is p.d.: proof (4/11)
Proof of lemma
If K1 is a p.d. kernel, let Φ1 : X1 7→ H be such that:

K1 (x1 , y1 ) = hΦ1 (x1 ) , Φ1 (y1 )iH .

Let Φ : X1 × X2 → H be defined by:

Φ ((x1 , x2 )) = Φ1 (x1 ) .

Then for x = (x1 , x2 ) and y = (y1 , y2 ) ∈ X , we get

hΦ ((x1 , x2 )) , Φ ((y1 , y2 ))iH = K1 (x1 , x2 ) ,

which shows that K (x, y) := K1 (x1 , y1 ) is p.d. on X1 × X2 . The lemma


follows from the properties of sums and products of p.d. kernels. 

Julien Mairal (Inria) 257/564


LA kernel is p.d.: proof (5/11)
Lemma: kernel for sets
Let K be a p.d. kernel on X , and let P (X ) be the set of finite subsets
of X . Then the function KP on P (X ) × P (X ) defined by:
XX
∀A, B ∈ P (X ) , KP (A, B) := K (x, y)
x∈A y∈B

is a p.d. kernel on P (X ).

Julien Mairal (Inria) 258/564


LA kernel is p.d.: proof (6/11)
Proof of lemma
Let Φ : X 7→ H be such that

K (x, y) = hΦ (x) , Φ (y)iH .

Then, for A, B ∈ P (X ), we get:


XX
KP (A, B) = hΦ (x) , Φ (y)iH
x∈A y∈B
* +
X X
= Φ (x) , Φ (y)
x∈A y∈B H
= hΦP (A), ΦP (B)iH ,
P
with ΦP (A) := x∈A Φ (x). 

Julien Mairal (Inria) 259/564


LA kernel is p.d.: proof (7/11)
Definition: Convolution kernel (Haussler, 1999)
Let K1 and K2 be two p.d. kernels for strings. The convolution of K1
and K2 , denoted K1 ? K2 , is defined for any x, x0 ∈ X by:
X
K1 ? K2 (x, y) := K1 (x1 , y1 )K2 (x2 , y2 ).
x1 x2 =x,y1 y2 =y

Lemma
If K1 and K2 are p.d. then K1 ? K2 is p.d..

Julien Mairal (Inria) 260/564


LA kernel is p.d.: proof (8/11)
Proof of lemma
Let X be the set of finite-length strings. For x ∈ X , let

R (x) = {(x1 , x2 ) ∈ X × X : x = x1 x2 } ⊂ X × X .

We can then write


X X
K1 ? K2 (x, y) = K1 (x1 , y1 )K2 (x2 , y2 )
(x1 ,x2 )∈R(x) (y1 ,y2 )∈R(y)

which is a p.d. kernel by the previous lemmas. 

Julien Mairal (Inria) 261/564


LA kernel is p.d.: proof (9/11)
3 basic string kernels
The constant kernel:
K0 (x, y) := 1 .

A kernel for letters:



(β) 0 if | x | =
6 1 where | y | =
6 1,
Ka (x, y) :=
exp (βS(x, y)) otherwise .

A kernel for gaps:


(β)
Kg (x, y) = exp [β (g (| x |) + g (| y |))] .

Julien Mairal (Inria) 262/564


LA kernel is p.d.: proof (10/11)
Remark
S : A2 → R is the similarity function between letters used in the
(β)
alignment score. Ka is only p.d. when the matrix:

(exp (βs(a, b)))(a,b)∈A2

is positive semidefinite (this is true for all β when s is conditionally


p.d..
g is the gap penalty function used in alignment score. The gap
kernel is always p.d. (with no restriction on g ) because it can be
written as:
(β)
Kg (x, y) = exp (βg (| x |)) × exp (βg (| y |)) .

Julien Mairal (Inria) 263/564


LA kernel is p.d.: proof (11/11)
Lemma
The local alignment kernel is a (limit) of convolution kernel:

(β) (n−1)
 
(β) (β) (β)
X
KLA = K0 ? Ka ? Kg ? Ka ? K0 .
n=0

As such it is p.d..

Proof (sketch)
By induction on n (simple but long to write).
See details in Vert et al. (2004).

Julien Mairal (Inria) 264/564


LA kernel computation
We assume an affine gap penalty:
(
g (0) = 0,
g (n) = d + e(n − 1) si n ≥ 1,

The LA kernel can then be computed by dynamic programming by:


(β)
KLA (x, y) = 1 + X2 (|x|, |y|) + Y2 (|x|, |y|) + M(|x|, |y|),

where M(i, j), X (i, j), Y (i, j), X2 (i, j), and Y2 (i, j) for 0 ≤ i ≤ |x|,
and 0 ≤ j ≤ |y| are defined recursively.

Julien Mairal (Inria) 265/564


LA kernel is p.d.: proof (/)
Initialization


M(i, 0) = M(0, j) = 0,

X (i, 0) = X (0, j) = 0,



Y (i, 0) = Y (0, j) = 0,

X2 (i, 0) = X2 (0, j) = 0,





Y2 (i, 0) = Y2 (0, j) = 0,

Julien Mairal (Inria) 266/564


LA kernel is p.d.: proof (/)
Recursion
For i = 1, . . . , |x| and j = 1, . . . , |y|:
 h


 M(i, j) = exp(βS(x ,
i jy )) 1 + X (i − 1, j − 1)
 i
− − − −




 +Y (i 1, j 1) + M(i 1, j 1) ,

X (i, j) = exp(βd)M(i − 1, j) + exp(βe)X (i − 1, j),

 Y (i, j) = exp(βd) [M(i, j − 1) + X (i, j − 1)]



+ exp(βe)Y (i, j − 1),





X2 (i, j) = M(i − 1, j) + X2 (i − 1, j),





Y2 (i, j) = M(i, j − 1) + X2 (i, j − 1) + Y2 (i, j − 1).

Julien Mairal (Inria) 267/564


LA kernel in practice

Implementation by a finite-state transducer in O(|x| × |x0 |)


0:0/1

a:0/1 a:0/E a:0/1

X1 a:0/D X X2 0:0/1
a:0/1 a:b/m(a,b)
0:b/D a:0/1
a:b/m(a,b) 0:a/1

B a:b/m(a,b) M 0:0/1 E
0:a/1
a:b/m(a,b)
0:0/1
a:b/m(a,b) 0:a/1
0:a/1 0:b/D
Y1 Y Y2

0:a/1 0:b/E 0:a/1


0:0/1

0:0/1

In practice, values are too large (exponential scale) so taking its


logarithm is a safer choice (but not p.d. anymore!)

Julien Mairal (Inria) 268/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Motivations and history of genomics
Kernels derived from large feature spaces
Kernels derived from generative models
Kernels derived from a similarity measure
Application to remote homology detection
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 269/564


Remote homology

s
og
s
og

e
on

ol
ol

m
tz
m

ho
gh
ho

se
ili
on

lo
Tw
N

C
Sequence similarity

Homologs have common ancestors


Structures and functions are more conserved than sequences
Remote homologs can not be detected by direct sequence
comparison

Julien Mairal (Inria) 270/564


SCOP database

SCOP
Fold
Superfamily
Family
Remote homologs Close homologs

Julien Mairal (Inria) 271/564


A benchmark experiment
Goal: recognize directly the superfamily
Training: for a sequence of interest, positive examples come from
the same superfamily, but different families. Negative from other
superfamilies.
Test: predict the superfamily.

Julien Mairal (Inria) 272/564


Difference in performance
60
SVM-LA
SVM-pairwise
SVM-Mismatch
No. of families with given performance
50 SVM-Fisher

40

30

20

10

0
0 0.2 0.4 0.6 0.8 1
ROC50

Performance on the SCOP superfamily recognition benchmark (from


Saigo et al., 2004).

Julien Mairal (Inria) 273/564


String kernels: Summary
A variety of principles for string kernel design have been proposed.
Good kernel design is important for each data and each task.
Performance is not the only criterion.
Still an art, although principled ways have started to emerge.
Fast implementation with string algorithms is often possible.
Their application goes well beyond computational biology.

Julien Mairal (Inria) 274/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

5 Open Problems and Research Topics

Julien Mairal (Inria) 275/564


Motivations
The RKHS norm is related to the smoothness of functions.
Smoothness of a function is naturally quantified by Sobolev norms
(in particular L2 norms of derivatives), or by the decay of the
Fourier transform.
In this section, we introduce several kernels were this link is explicit,
and we make a general link between RKHS and Green functions
defined by differential operators.

Julien Mairal (Inria) 276/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Shift-invariant kernels
Generalization to semigroups
Mercer kernels
RKHS and Green functions
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 277/564


Translation invariant kernels
Definition
A kernel K : Rd × Rd 7→ R is called translation invariant (t.i.) if it only
depends on the difference between its argument, i.e.:

∀ (x, y) ∈ R2d , K (x, y) = κ (x − y) .

Examples
Gaussian kernel (or RBF kernel)
1 2
K (x, y) = e − 2σ2 kx−yk2 .

Laplace kernel
K (x, y) = e −αkx−yk1 .

Julien Mairal (Inria) 278/564


In case of...
Definition
Let f ∈ L1 Rd . The Fourier transform of f , denoted fˆ or F [f ], is the


function defined for all ω ∈ Rd by:


Z
ˆ
f (ω) = e −ix.ω f (x) dx .
Rd

Julien Mairal (Inria) 279/564


In case of...
Properties
fˆ is complex-valued, continuous, tends to 0 at infinity and
k fˆ kL∞ ≤ k f kL1 .
If fˆ ∈ L1 Rd , then the inverse Fourier formula holds:


Z
1
∀x ∈ Rd , f (x) = d
e ix.ω fˆ (ω) dω.
(2π) Rd

If f ∈ L1 Rd is square integrable, then Parseval’s formula holds:




Z Z
1 2
| f (x) |2 dx = ˆ (ω) dω .
f
Rd (2π)d Rd

Julien Mairal (Inria) 280/564


Translation invariant kernels
Definition
A kernel K : Rd × Rd 7→ R is called translation invariant (t.i.) if it only
depends on the difference between its argument, i.e.:

∀ (x, y) ∈ R2d , K (x, y) = κ (x − y) .

Intuition
If K is t.i. and κ ∈ L1 Rd , then


Z
1
κ (x − y) = e i(x−y).ω κ̂ (ω) dω
(2π)d Rd
κ̂ (ω)
Z
= d
e iω.x e −iω.y dω .
Rd (2π)

Julien Mairal (Inria) 281/564


Characterization of p.d. t.i. kernels
Theorem (Bochner)
A real-valued continuous function κ(x − y) on Rd is positive definite if
and only if it is the Fourier-Stieltjes transform of a symmetric, positive,
and finite Borel measure µ:
Z
κ(z) = e iz.ω µ(dω).
Rd

Remarks
If κ(0) = 1, κ is a characteristic function—that is, κ(z) = Eω [e iz.ω ].
⇐ is easy:
Z 2
X X
ixk .ω
αk αl κ(xk − xl ) = αk e µ(dω) ≥ 0.
k,l Rd k

Julien Mairal (Inria) 282/564


RKHS of translation invariant kernels
Theorem
Let K be a translation invariant p.d. kernel, such that κ is integrable on
R as well as its Fourier transform κ̂. The subset HK of L2 R that
d d


consists of integrable and continuous functions f such that:


2
1
Z fˆ(ω)
k f k2K := dω < +∞ ,
(2π)d Rd κ̂(ω)

endowed with the inner product:

fˆ(ω)ĝ (ω)∗
Z
1
hf , g i := dω
(2π)d Rd κ̂(ω)

is a RKHS with K as r.k.

Julien Mairal (Inria) 283/564


Proof
HK is a Hilbert space: exercise.
For x ∈ Rd , Kx (y) = K (x, y) = κ(x − y) therefore:
Z
K̂x (ω) = e −iω.u κ(u − x)du = e −iω.x κ̂(ω) .

This leads to Kx ∈ H, because:


2
Z K̂x (ω) Z
≤ | κ̂(ω) | < ∞,
Rd κ̂(ω) Rd

Moreover, if f ∈ H and x ∈ Rd , we have:

K̂x (ω)fˆ (ω)∗


Z Z
1 1
hf , Kx iH = d
dω = d
fˆ (ω)∗ e −iω.x = f (x) 
(2π) R d κ̂(ω) (2π) R d

Julien Mairal (Inria) 284/564


Example
Gaussian kernel
(x−y )2
K (x, y ) = e − 2σ 2

corresponds to:
σ2 ω2
κ̂ (ω) = e − 2

and  Z 
2 σ2 ω2
H= f : fˆ(ω) e 2 dω < ∞ .

In particular, all functions in H are infinitely differentiable with all


derivatives in L2 .

Julien Mairal (Inria) 285/564


Example
Laplace kernel
1
K (x, y ) = e −γ| x−y |
2
corresponds to:
γ
κ̂ (ω) =
γ2 + ω2
and ( )
γ 2 + ω2
Z 
2
H= f : fˆ(ω) dω < ∞ ,
γ

the set of functions L2 differentiable with derivatives in L2 (Sobolev


norm).

Julien Mairal (Inria) 286/564


Example
Low-frequency filter
sin (Ω(x − y ))
K (x, y ) =
π(x − y )
corresponds to:

κ̂ (ω) = U (ω + Ω) − U (ω − Ω)

and ( Z )
2
H= f : fˆ(ω) dω = 0 ,
| ω |>Ω

the set of functions whose spectrum is included in [−Ω, Ω].

Julien Mairal (Inria) 287/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Shift-invariant kernels
Generalization to semigroups
Mercer kernels
RKHS and Green functions
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 288/564


Generalization to semigroups (cf Berg et al., 1983)
Definition
A semigroup (S, ◦) is a nonempty set S equipped with an
associative composition ◦ and a neutral element e.
A semigroup with involution (S, ◦, ∗) is a semigroup (S, ◦) together
with a mapping ∗ : S → S called involution satisfying:

1 (s ◦ t) = t ∗ ◦ s ∗ , for s, t ∈ S.

2 (s ∗ ) = s for s ∈ S.

Examples
Any group (G , ◦) is a semigroup with involution when we define
s ∗ = s −1 .
Any abelian semigroup (S, +) is a semigroup with involution when
we define s ∗ = s, the identical involution.

Julien Mairal (Inria) 289/564


Positive definite functions on semigroups
Definition
Let (S, ◦, ∗) be a semigroup with involution. A function ϕ : S → R is
called positive definite if the function:

∀s, t ∈ S, K (s, t) = ϕ (s ∗ ◦ t)

is a p.d. kernel on S.

Example: translation invariant kernels


Rd , +, − is an abelian group with involution. A function ϕ : Rd → R


is p.d. if the function


K (x, y) = ϕ(x − y)
is p.d. on Rd (translation invariant kernels).

Julien Mairal (Inria) 290/564


Semicharacters
Definition
A function ρ : S → C on an abelian semigroup with involution (S, +, ∗)
is called a semicharacter if
1 ρ(0) = 1,
2 ρ(s + t) = ρ(s)ρ(t) for s, t ∈ S,
3 ρ (s ∗ ) = ρ(s) for s ∈ S.
The set of semicharacters on S is denoted by S ∗ .

Remarks
If ∗ is the identity, a semicharacter is automatically real-valued.
If (S, +) is an abelian group and s ∗ = −s, a semicharacter has its
values in the circle group {z ∈ C | | z | = 1} and is a group
character.

Julien Mairal (Inria) 291/564


Semicharacters are p.d.
Lemma
Every semicharacter is p.d., in the sense that:
K (s, t) = K (t, s),
Pn
i,j=1 ai aj K (xi , xj ) ≥ 0.

Proof
Direct from definition, e.g.,
n
X n
 X
ai aj ρ xi + xj∗ = ai aj ρ (xi ) ρ (xj ) ≥ 0 .
i,j=1 i,j=1

Examples
ϕ(t) = e βt on (R, +, Id).
ϕ(t) = e iωt on (R, +, −).
Julien Mairal (Inria) 292/564
Integral representation of p.d. functions
Definition
An function α : S → R on a semigroup with involution is called an
absolute value if (i) α(e) = 1, (ii)α(s ◦ t) ≤ α(s)α(t), and (iii)
α (s ∗ ) = α(s).
A function f : S → R is called exponentially bounded if there exists an
absolute value α and a constant C > 0 s.t. | f (s) | ≤ C α(s) for s ∈ S.

Theorem
Let (S, +, ∗) an abelian semigroup with involution. A function ϕ : S → R is
p.d. and exponentially bounded (resp. bounded) if and only if it has a
representation of the form:
Z
ϕ(s) = ρ(s)dµ(ρ) .
S∗

where µ is a Radon measure with compact support on S ∗ (resp. on Ŝ, the set
of bounded semicharacters).
Julien Mairal (Inria) 293/564
Proof
Sketch (details in Berg et al., 1983, Theorem 4.2.5)
For an absolute value α, the set P1α of α-bounded p.d. functions
that satisfy ϕ(0) = 1 is a compact convex set whose extreme points
are precisely the α-bounded semicharacters.
If ϕ is p.d. and exponentially bounded then there exists an absolute
value α such that ϕ(0)−1 ϕ ∈ P1α .
By the Krein-Milman theorem there exits a Radon probability
measure on P1α having ϕ(0)−1 ϕ as barycentre.

Remarks
The result is not true without the assumption of exponentially
bounded semicharacters.
In the case of abelian groups with s ∗ = −s this reduces to
Bochner’s theorem for discrete abelian groups, cf. Rudin (1962).

Julien Mairal (Inria) 294/564


Example 1: (R+ , +, Id)
Semicharacters
S = (R+ , +, Id) is an abelian semigroup.
√ 2
P.d. functions are nonnegative, because ϕ(x) = ϕ x .
The set of bounded semicharacters is exactly the set of functions:

s ∈ R+ 7→ ρa (s) = e −as ,

for a ∈ [0, +∞] (left as exercice).


Non-bounded semicharacters are more difficult to characterize; in
fact there exist nonmeasurable solutions of the equation
h(x + y ) = h(x)h(y ).

Julien Mairal (Inria) 295/564


Example 1: (R+ , +, Id) (cont.)
P.d. functions
By the integral representation theorem for bounded semi-characters
we obtain that a function ϕ : R+ → R is p.d. and bounded if and
only if it has the form:
Z ∞
ϕ(s) = e −as dµ(a) + bρ∞ (s)
0

where µ ∈ Mb+ (R+ ) and b ≥ 0.


The first term is the Laplace transform of µ. ϕ is p.d., bounded and
continuous iff it is the Laplace transform of a measure in Mb+ (R).

Julien Mairal (Inria) 296/564


Example 2: Semigroup kernels for finite measures (1/6)
Setting
We assume that data to be processed are “bags-of-points”, i.e., sets
of points (with repeats) of a space U.
Example : a finite-length string as a set of k-mers.
How to define a p.d. kernel between any two bags that only
depends on the union of the bags?
See details and proofs in Cuturi et al. (2005).

Julien Mairal (Inria) 297/564


Example 2: Semigroup kernels for finite measures (2/6)
Semigroup of bounded measures
We can represent any bag-of-point x as a finite measure on U:
X
x= ai δxi ,
i

where ai is the number of occurrences on xi in the bag.


The measure that represents the union of two bags is the sum of
the measures that represent each individual bag.
This suggests to look at the semigroup Mb+ (U) , +, Id of


bounded Radon measures on U and to search for p.d. functions ϕ


on this semigroup.

Julien Mairal (Inria) 298/564


Example 2: Semigroup kernels for finite measures (3/6)
Semicharacters
For any Borel measurable function f : U → R the function
ρf : Mb+ (U) → R defined by:

ρf (µ) = e µ[f ]

is a semicharacter on Mb+ (U) , + .




Conversely, ρ is continuous semicharacter (for the topology of weak


convergence) if and only if there exists a continuous function
f : U → R such that ρ = ρf .
No such characterization for non-continuous characters, even
bounded.

Julien Mairal (Inria) 299/564


Example 2: Semigroup kernels for finite measures (4/6)
Corollary
Let U be a Hausdorff space. For any Radon measure µ ∈ Mc+ (C (U))
with compact support on the Hausdorff space of continuous real-valued
functions on U endowed with the topology of pointwise convergence, the
following function K is a continuous p.d. kernel on Mb+ (U) (endowed
with the topology of weak convergence):
Z
K (µ, ν) = e µ[f ]+ν[f ] dµ(f ) .
C (X )

Remarks
The converse is not true: there exist continuous p.d. kernels that do not have
this integral representation (it might include non-continuous semicharacters)

Julien Mairal (Inria) 300/564


Example 2: Semigroup kernels for finite measures (5/6)
Example : entropy kernel
Let X be the set of probability densities (w.r.t. some reference
measure) on U with finite entropy:
Z
h(x) = − x ln x .
U

Then the following entropy kernel is a p.d. kernel on X for all


β > 0:
x+x
K x, x0 = e −βh( 2 ) .


Remark: only valid for densities (e.g., for a kernel density estimator
from a bag-of-parts)

Julien Mairal (Inria) 301/564


Example 2: Semigroup kernels for finite measures (6/6)
Examples : inverse generalized variance kernel
Let U = Rd and MV+ (U) be the set of finite measure µ with
second order moment and non-singular variance
h i
Σ(µ) = µ xx > − µ [x] µ [x]> .

Then the following function is a p.d. kernel on MV


+ (U), called the
inverse generalized variance kernel:
1
K µ, µ0 =

 .
µ+µ0
det Σ 2

Generalization possible with regularization and kernel trick.

Julien Mairal (Inria) 302/564


Application of semigroup kernel

Weighted linear PCA of two different measures, with the first PC shown.
Variances captured by the first and second PC are shown. The
generalized variance kernel is the inverse of the product of the two
values.

Julien Mairal (Inria) 303/564


Kernelization of the IGV kernel
Motivations
Gaussian distributions may be poor models.
The method fails in large dimension

Solution
1 Regularization:
1
Kλ µ, µ0 =

  0
 .
det Σ µ+µ
2 + λI d

2 Kernel trick: the non-zero eigenvalues of UU > and U > U are the
same =⇒ replace the covariance matrix by the centered Gram
matrix (technical details in Cuturi et al., 2005).

Julien Mairal (Inria) 304/564


Illustration of kernel IGV kernel

Julien Mairal (Inria) 305/564


Semigroup kernel remarks
Motivations
A very general formalism to exploit an algebraic structure of the
data.
Kernel IVG kernel has given good results for character recognition
from a subsampled image.
The main motivation is more generally to develop kernels for
complex objects from which simple “patches” can be extracted.
The extension to nonabelian groups (e.g., permutation in the
symmetric group) might find natural applications.

Julien Mairal (Inria) 306/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Shift-invariant kernels
Generalization to semigroups
Mercer kernels
RKHS and Green functions
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 307/564


Mercer kernels
Definition
A kernel K on a set X is called a Mercer kernel if:
1 X is a compact metric space (typically, a closed bounded subset of
Rd ).
2 K : X × X → R is a continuous p.d. kernel (w.r.t. the Borel
topology)

Motivations
We can exhibit an explicit and intuitive feature space for a large
class of p.d. kernels
Historically, provided the first proof that a p.d. kernel is an inner
product for non-finite sets X (Mercer, 1905).
Can be thought of as the natural generalization of the factorization
of positive semidefinite matrices over infinite spaces.

Julien Mairal (Inria) 308/564


Sketch of the proof
1 The kernel matrix when X is finite becomes a linear operator when
X is a metric space.
2 The matrix was positive semidefinite in the finite case, the linear
operator is self-adjoint and positive in the metric case.
3 The spectral theorem states that any compact linear operator
admits a complete orthonormal basis of eigenfunctions, with
non-negative eigenvalues (just like positive semidefinite matrices
can be diagonalized with nonnegative eigenvalues).
4 The kernel function can then be expanded over basis of
eigenfunctions as:

X
K (x, t) = λk ψk (x) ψk (t) ,
k=1

where λi ≥ 0 are the non-negative eigenvalues.

Julien Mairal (Inria) 309/564


In case of...
Definition
Let H be a Hilbert space
A linear operator is a continuous linear mapping from H to itself.
A linear operator L is called compact if, for any bounded sequence
{fn }∞ ∞
n=1 , the sequence {Lfn }n=1 has a subsequence that converges.
L is called self-adjoint if, for any f , g ∈ H:

hf , Lg i = hLf , g i .

L is called positive if it is self-adjoint and, for any f ∈ H:

hf , Lf i ≥ 0 .

Julien Mairal (Inria) 310/564


An important lemma
The linear operator
Let ν be any Borel measure on X , and Lν2 (X ) the Hilbert space of
square integrable functions on X .
For any function K : X 2 7→ R, let the transform:
Z
∀f ∈ Lν2 (X ) , (LK f ) (x) = K (x, t) f (t) dν (t) .

Lemma
If K is a Mercer kernel, then LK is a compact and bounded linear
operator over Lν2 (X ), self-adjoint and positive.

Julien Mairal (Inria) 311/564


Proof (1/6)
LK is a mapping from Lν2 (X ) to Lν2 (X )
For any f ∈ Lν2 (X ) and (x1 , x1 ) ∈ X 2 :
Z
| LK f (x1 ) − LK f (x2 ) | = (K (x1 , t) − K (x2 , t)) f (t) dν (t)

≤ k K (x1 , ·) − K (x2 , ·) kk f k
(Cauchy-Schwarz)
p
≤ ν (X ) max | K (x1 , t) − K (x2 , t) | k f k.
t∈X

K being continuous and X compact, K is uniformly continuous,


therefore LK f is continuous. In particular, LK f ∈ Lν2 (X ) (with the slight
abuse of notation C (X ) ⊂ Lν2 (X )). 

Julien Mairal (Inria) 312/564


Proof (2/6)
LK is linear and continuous
Linearity is obvious (by definition of LK and linearity of the
integral).
For continuity, we observe that for all f ∈ Lν2 (X ) and x ∈ X :
Z
| (LK f ) (x) | = K (x, t) f (t) dν (t)
p
≤ ν (X ) max | K (x, t) | k f k
t∈X
p
≤ ν (X )CK k f k.

with CK = maxx,t∈X | K (x, t) |. Therefore:


Z 1
2
2
k LK f k = LK f (t) dν (t) ≤ ν (X ) CK k f k. 

Julien Mairal (Inria) 313/564


Proof (3/6)
Criterion for compactness
In order to prove the compactness of LK we need the following criterion.
Let C (X ) denote the set of continuous functions on X endowed with
infinite norm k f k∞ = maxx∈X | f (x) |.
A set of functions G ⊂ C (X ) is called equicontinuous if:

∀ > 0, ∃δ > 0, ∀ (x, y) ∈ X 2 ,


k x − y k < δ =⇒ ∀g ∈ G , | g (x) − g (y) | < .

Ascoli Theorem
A part H ⊂ C (X ) is relatively compact (i.e., its closure is compact) if
and only if it is uniformly bounded and equicontinuous.

Julien Mairal (Inria) 314/564


Proof (4/6)
LK is compact
Let (fn )n≥0 be a bounded sequence of Lν2 (X ) (k fn k < M for all n).
The sequence (LK fn )n≥0 is a sequence of continuous functions,
uniformly bounded because:
p p
k LK fn k∞ ≤ ν (X )CK k fn k ≤ ν (X )CK M .

It is equicontinuous because:
p
| LK fn (x1 ) − LK fn (x2 ) | ≤ ν (X ) max | K (x1 , t) − K (x2 , t) | M .
t∈X

By Ascoli theorem, we can extract a sequence uniformly convergent in


C (X ), and therefore in Lν2 (X ). 

Julien Mairal (Inria) 315/564


Proof (5/6)
LK is self-adjoint
K being symmetric, we have for all f , g ∈ H:
Z
hf , Lg i = f (x) (Lg ) (x) ν (dx)
Z Z
= f (x) g (t) K (x, t) ν (dx) ν (dt) (Fubini)

= hLf , g i .

Julien Mairal (Inria) 316/564


Proof (6/6)
LK is positive
We can approximate the integral by finite sums:
Z Z
hf , Lf i = f (x) f (t) K (x, t) ν (dx) ν (dt)
k
ν (X ) X
= lim K (xi , xj ) f (xi ) f (xj )
k→∞ k 2
i,j=1

≥ 0,

because K is positive definite. 

Julien Mairal (Inria) 317/564


Diagonalization of the operator
We need the following general result:
Spectral theorem
Let L be a compact linear operator on a Hilbert space H. Then there
exists in H a complete orthonormal system (ψ1 , ψ2 , . . .) of eigenvectors
of L. The eigenvalues (λ1 , λ2 , . . .) are real if L is self-adjoint, and
non-negative if L is positive.

Remark
This theorem can be applied to LK . In that case the eigenfunctions ϕk
associated to the eigenfunctions λk 6= 0 can be considered as continuous
functions, because:
1
ψk = LψK .
λk

Julien Mairal (Inria) 318/564


Main result
Mercer Theorem
Let X be a compact metric space, ν a Borel measure on X , and K a
continuous p.d. kernel. Let (λ1 , λ2 , . . .) denote the nonnegative
eigenvalues of LK and (ψ1 , ψ2 , . . .) the corresponding eigenfunctions.
Then all ψk are continuous functions, and for any x, t ∈ X :

X
K (x, t) = λk ψk (x) ψk (t) ,
k=1

where the convergence is absolute for each x, t ∈ X , and uniform on


X × X.

Julien Mairal (Inria) 319/564


Mercer kernels as inner products
Corollary
The mapping

Φ : X 7→ l 2
p 
x 7→ λk ψk (x)
k∈N

is well defined, continuous, and satisfies

K (x, t) = hΦ (x) , Φ (t)il 2 .

Julien Mairal (Inria) 320/564


Proof of the corollary
Proof
λk ψk2 (x) converges to
P
By Mercer theorem we see that for all x ∈ X ,
K (x, x) < ∞, therefore Φ (x) ∈ l 2 .
The continuity of Φ results from:

X
k Φ (x) − Φ (t) k2l 2 = λk (ψk (x) − ψk (t))2
k=1
= K (x, x) + K (t, t) − 2K (x, t)

Julien Mairal (Inria) 321/564


Summary
This proof extends the proof valid when X is finite.
This is a constructive proof, developed by Mercer (1905).
Compactness and continuity are required. For instance, for
X = Rd , the eigenvalues of:
Z
K (x, t) ψ (t) = λψ (t)
X

are not necessarily countable, Mercer theorem does not hold. Other
tools are thus required such as the Fourier transform for
shift-invariant kernels.

Julien Mairal (Inria) 322/564


RKHS of Mercer kernels
Let X be a compact metric space, and K a Mercer kernel on X
(symmetric, continuous and positive definite).
We have expressed a decomposition of the kernel in terms of the
eigenfunctions of the linear convolution operator.
In some cases this provides an intuitive feature space.
The kernel also has a RKHS, like any p.d. kernel.
Can we get an intuition of the RKHS norm in terms of the
eigenfunctions and eigenvalues of the convolution operator?

Julien Mairal (Inria) 323/564


Reminder: expansion of Mercer kernel
Theorem
Denote by LK the linear operator of Lν2 (X ) defined by:
Z
ν
∀f ∈ L2 (X ) , (LK f ) (x) = K (x, t) f (t) dν (t) .

Let (λ1 , λ2 , . . .) denote the eigenvalues of LK in decreasing order, and


(ψ1 , ψ2 , . . .) the corresponding eigenfunctions. Then it holds that for
any x, y ∈ X :

X
K (x, y) = λk ψk (x) ψk (y) = hΦ (x) , Φ (y)il 2 ,
k=1

with Φ : X 7→ l 2 defined par Φ (x) =

λk ψk (x) k∈N .

Julien Mairal (Inria) 324/564


RKHS construction
Theorem
Assuming that all eigenvalues are positive, the RKHS is the Hilbert
space:
∞ ∞
( )
X X a 2
HK = f ∈ Lν2 (X ) : f = ai ψi , with k
<∞
λk
i=1 k=1

endowed with the inner product:



X ak bk X X
hf , g iK = , for f = ak ψk , g = bk ψk .
λk
k=1 k k

Remark
If some eigenvalues are equal to zero, then the result and the proof remain valid
on the subspace spanned by the eigenfunctions with positive eigenvalues.

Julien Mairal (Inria) 325/564


Proof (1/6)
Sketch
In order to show that HK is the RKHS of the kernel K we need to show
that:
1 it is a Hilbert space of functions from X to R,
2 for any x ∈ X , Kx ∈ HK ,
3 for any x ∈ X and f ∈ HK , f (x) = hf , Kx iHK .

Julien Mairal (Inria) 326/564


Proof (2/6)
HK is a Hilbert space
Indeed the function:
1
LK2 :Lν2 (X ) → HK
X∞ ∞
X p
ai ψi 7→ ai λi ψi
i=1 i=1

is an isomorphism, therefore HK is a Hilbert space, like Lν2 (X ). 

Julien Mairal (Inria) 327/564


Proof (3/6)
HK is a space of continuous functions
P∞
For any f = i=1 ai ψi ∈ HK , and x ∈ X , we have (if f (x) makes sense):
∞ ∞
a p
√i
X X
| f (x) | = ai ψi (x) = λi ψi (x)
i=1 i=1
λ i


!1 ∞
!1
X a2 2 X 2
2
i
≤ . λi ψi (x)
λi
i=1 i=1
1
= k f kHK K (x, x) 2
p
= k f kHK CK .

Therefore convergence in k . kHK implies uniform convergence for


functions.

Julien Mairal (Inria) 328/564


Proof (4/6)
HK is a space of continuous functions (cont.)
Let now fn = ni=1 ai ψi ∈ HK . The functions ψi are continuous
P
functions, therefore fn is also continuous, for all n. The fn ’s are
convergent in HK , therefore also in the (complete) space of continuous
functions endowed with the uniform norm.
Let fc the continuous limit function. Then fc ∈ Lν2 (X ) and

k fn − fc kLν2 (X ) → 0.
n→∞

On the other hand,

k f − fn kLν2 (X ) ≤ λ1 k f − fn kHK → 0,
n→∞

therefore f = fc . 

Julien Mairal (Inria) 329/564


Proof (5/6)
Kx ∈ HK
For any x ∈ X let, for all i, ai = λi ψi (x). We have:
∞ ∞
X a2 X
i
= λi ψi (x)2 = K (x, x) < ∞,
λi
i=1 i=1
P∞
therefore ϕx := i=1 ai ψi ∈ HK . As seen earlier the convergence in HK
implies pointwise convergence, therefore for any t ∈ X :

X ∞
X
ϕx (t) = ai ψi (t) = λi ψi (x) ψi (t) = K (x, t) ,
i=1 i=1

therefore ϕx = Kx ∈ HK . 

Julien Mairal (Inria) 330/564


Proof (6/6)
f (x) = hf , Kx iHK
P∞
Let f = i=1 ai ψi ∈ HK , et x ∈ X . We have seen that:

X
Kx = λi ψi (x) ψi ,
i=1

therefore:
∞ ∞
X λi ψi (x) ai X
hf , Kx iHK = = ai ψi (x) = f (x) ,
λi
i=1 i=1

which concludes the proof. 

Julien Mairal (Inria) 331/564


Remarks
Although HK was built from the eigenfunctions of LK , which
depend on the choice of the measure ν (x), we know by uniqueness
of the RKHS that HK is independant of ν and LK .
Mercer theorem provides a concrete way to build the RKHS, by
taking linear combinations of the eigenfunctions of LK (with
adequately chosen weights).
The eigenfunctions (ψi )i∈N form an orthogonal basis of the RKHS:

1
hψi , ψj iHK = 0 si i 6= j, k ψi kHK = √ .
λi
The RKHS is a well-defined ellipsoid with axes given by the
eigenfunctions.

Julien Mairal (Inria) 332/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Shift-invariant kernels
Generalization to semigroups
Mercer kernels
RKHS and Green functions
Kernels for graphs
Kernels on graphs

Julien Mairal (Inria) 333/564


Motivations
The RKHS norm is related to the smoothness of functions.
Smoothness of a function is naturally quantified by Sobolev norms
(in particular L2 norms of derivatives).
In this section we make a general link between RKHS and Green
functions defined by differential operators.

Julien Mairal (Inria) 334/564


A simple example
Explicit choice of smoothness
Let

H = f : [0, 1] 7→ R, absolutely continuous, f 0 ∈ L2 ([0, 1]) , f (0) = 0 .




endowed with the bilinear form:


Z 1
2
∀ (f , g ) ∈ F hf , g iH = f 0 (u) g 0 (u) du .
0

Note that hf , f iH measures the smoothness of f :


Z 1
hf , f iH = f 0 (u)2 du = k f 0 k2L2 ([0,1]) .
0

Julien Mairal (Inria) 335/564


The RKHs point of view
Theorem
H is a RKHS with r.k. given by:

∀ (x, y ) ∈ [0, 1]2 , K (x, y ) = min (x, y ) .

Remark
Therefore, k f kH = k f 0 kL2 : the RKHS norm is precisely the smoothness
functional defined in the simple example.

Julien Mairal (Inria) 336/564


Proof (1/3)
Sketch
We need to show that
H is a Hilbert space
∀x ∈ [0, 1], Kx ∈ H,
∀ (x, f ) ∈ [0, 1] × H, hf , Kx iH = f (x).

Julien Mairal (Inria) 337/564


Proof (1/3)
Sketch
We need to show that
H is a Hilbert space
∀x ∈ [0, 1], Kx ∈ H,
∀ (x, f ) ∈ [0, 1] × H, hf , Kx iH = f (x).

Julien Mairal (Inria) 337/564


Proof (2/3)
H is a pre-Hilbert space
f absolutely continuous implies differentiable almost everywhere,
and Z x
∀x ∈ [0, 1], f (x) = f (0) + f 0 (u)du .
0
For any f ∈ H, f (0) = 0 implies by Cauchy-Schwarz:

x 1  21
√ √
Z Z
0 0 2
| f (x) | = f (u)du ≤ x f (u) du = xk f kH .
0 0

Therefore, k f kH = 0 =⇒ f = 0, showing that h., .iH is an inner


product. H is thus a pre-Hilbert space.

Julien Mairal (Inria) 338/564


Proof (2/3)
H is a Hilbert space
To show that H is complete, let (fn )n∈N a Cauchy sequence in H
(fn0 )n∈N is a Cauchy sequence in L2 [0, 1], thus converges to
g ∈ L2 [0, 1]
By the previous inequality, (fn (x))n∈N is a Cauchy sequence and
thus converges to a real number f (x), for any x ∈ [0, 1]. Moreover:
Z x Z x
0
f (x) = lim fn (x) = lim fn (u)du = g (u)du ,
n n 0 0

showing that f is absolutely continuous and f 0 = g almost


everywhere; in particular, f 0 ∈ L2 [0, 1].
Finally, f (0) = limn fn (0) = 0, therefore f ∈ H and

lim k fn − f kH = k f 0 − gn kL2 [0,1] = 0 .


n

Julien Mairal (Inria) 339/564


Proof (2/3)
∀x ∈ [0, 1], Kx ∈ H
Let Kx (y ) = K (x, y ) = min(x, y ) sur [0, 1]2 :

K(s,t)

t
s 1

Kx is differentiable except at s, has a square integrable derivative, and


Kx (0) = 0, therefore Kx ∈ H for all x ∈ [0, 1]. 

Julien Mairal (Inria) 340/564


Proof (3/3)
For all x, f , hf , Kx iH = f (x)
For any x ∈ [0, 1] and f ∈ H we have:
Z 1 Z x
0
hf , Kx iH = f (u)Kx0 (u)du = f 0 (u)du = f (x),
0 0

which shows that K is the r.k. associated to H. 

Julien Mairal (Inria) 341/564


Generalization
Theorem
Let X = Rd and D a differential operator on a class of functions H such
that, endowed with the inner product:

∀ (f , g ) ∈ H2 , hf , g iH = hDf , Dg iL2 (X ) ,

it is a Hilbert space.
Then H is a RKHS that admits as r.k. the Green function of the
operator D ∗ D, where D ∗ denotes the adjoint operator of D.

Julien Mairal (Inria) 342/564


In case of...
Green functions
Let the differential equation on H:

f = Dg ,

where g is unknown. In order to solve it we can look for g of the form:


Z
g (x) = k (x, y ) f (y ) dy
X

for some function k : X 2 7→ R. k must then satisfy, for all x ∈ X ,

f (x) = Dg (x) = hDkx , f iL2 (X ) .

k is called the Green function of the operator D.

Julien Mairal (Inria) 343/564


Proof
Let H be a Hilbert space endowed with the inner product:

hf , g iX = hDf , Dg iL2 (X ) ,

and K be the Green function of the operator D ∗ D. For all x ∈ X ,


Kx ∈ H because:

hDKx , DKx iL2 (X ) = hD ∗ DKx , Kx iL2 (X ) = Kx (x) < ∞ .

Moreover, for all f ∈ H and x ∈ X , we have:

f (x) = hD ∗ DKx , f iL2 (X ) = hDKx , Df iL2 (X ) = hKx , f iH ,

which shows that H is a RKHS with K as r.k. 

Julien Mairal (Inria) 344/564


Kernel examples: Summary
Many notions of smoothness can be translated as RKHS norms for
particular kernels (eigenvalues convolution operator, Sobolev norms
and Green operators, Fourier transforms...).
There is no “uniformly best kernel”, but rather a large toolbox of
methods and tricks to encode prior knowledge and exploit the
nature or structure of the data.
In the following sections we focus on particular data and
applications to illustrate the process of kernel design.

Julien Mairal (Inria) 345/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

5 Open Problems and Research Topics

Julien Mairal (Inria) 346/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

Julien Mairal (Inria) 347/564


Virtual screening for drug discovery

active

inactive

active
inactive
inactive

active

NCI AIDS screen results (from http://cactus.nci.nih.gov).

Julien Mairal (Inria) 348/564


Image retrieval and classification

From Harchaoui and Bach (2007).

Julien Mairal (Inria) 349/564


Our approach

Julien Mairal (Inria) 350/564


Our approach
1 Represent each graph x in X by a vector Φ(x) ∈ H, either explicitly
or implicitly through the kernel

K (x, x0 ) = Φ(x)> Φ(x0 ) .

φ
X H

Julien Mairal (Inria) 350/564


Our approach
1 Represent each graph x in X by a vector Φ(x) ∈ H, either explicitly
or implicitly through the kernel

K (x, x0 ) = Φ(x)> Φ(x0 ) .

2 Use a linear method for classification in H.


φ
X H

Julien Mairal (Inria) 350/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

Julien Mairal (Inria) 351/564


The approach
1 Represent explicitly each graph x by a vector of fixed dimension
Φ(x) ∈ Rp .

φ
X H

Julien Mairal (Inria) 352/564


The approach
1 Represent explicitly each graph x by a vector of fixed dimension
Φ(x) ∈ Rp .
2 Use an algorithm for regression or pattern recognition in Rp .

φ
X H

Julien Mairal (Inria) 352/564


Example
2D structural keys in chemoinformatics
Index a molecule by a binary fingerprint defined by a limited set of
predefined structures
N N N
O O O O O

O
N

Use a machine learning algorithm such as SVM, kNN, PLS,


decision tree, etc.

Julien Mairal (Inria) 353/564


Challenge: which descriptors (patterns)?

N N N
O O O O O

O
N

Expressiveness: they should retain as much information as possible


from the graph
Computation: they should be fast to compute
Large dimension of the vector representation: memory storage,
speed, statistical issues

Julien Mairal (Inria) 354/564


Indexing by substructures

N N N
O O O O O

O
N

Often we believe that the presence or absence of particular


substructures may be important predictive patterns
Hence it makes sense to represent a graph by features that indicate
the presence (or the number of occurrences) of these substructures
However, detecting the presence of particular substructures may be
computationally challenging...

Julien Mairal (Inria) 355/564


Subgraphs
Definition
A subgraph of a graph (V , E ) is a graph (V 0 , E 0 ) with V 0 ⊂ V and
E0 ⊂ E.

A graph and all its connected subgraphs.

Julien Mairal (Inria) 356/564


Indexing by all subgraphs?

Julien Mairal (Inria) 357/564


Indexing by all subgraphs?

Theorem
Computing all subgraph occurrences is NP-hard.

Julien Mairal (Inria) 357/564


Indexing by all subgraphs?

Theorem
Computing all subgraph occurrences is NP-hard.

Proof
The linear graph of size n is a subgraph of a graph X with n
vertices iff X has a Hamiltonian path;
The decision problem whether a graph has a Hamiltonian path is
NP-complete.

Julien Mairal (Inria) 357/564


Paths
Definition
A path of a graph (V , E ) is a sequence of distinct vertices
v1 , . . . , vn ∈ V (i 6= j =⇒ vi 6= vj ) such that (vi , vi+1 ) ∈ E for
i = 1, . . . , n − 1.
Equivalently the paths are the linear subgraphs.

Julien Mairal (Inria) 358/564


Indexing by all paths?

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Julien Mairal (Inria) 359/564


Indexing by all paths?

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Theorem
Computing all path occurrences is NP-hard.

Julien Mairal (Inria) 359/564


Indexing by all paths?

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Theorem
Computing all path occurrences is NP-hard.

Proof
Same as for subgraphs.

Julien Mairal (Inria) 359/564


Indexing by what?
Substructure selection
We can imagine more limited sets of substructures that lead to more
computationnally efficient indexing (non-exhaustive list)
substructures selected by domain knowledge (MDL fingerprint)
all paths up to length k (Openeye fingerprint, Nicholls 2005)
all shortest path lengths (Borgwardt and Kriegel, 2005)
all subgraphs up to k vertices (graphlet kernel, Shervashidze et al.,
2009)
all frequent subgraphs in the database (Helma et al., 2004)

Julien Mairal (Inria) 360/564


Example: Indexing by all shortest path lengths and their
endpoint labels

A 3 B
A A
B (0,...,0,2,0,...,0,1,0,...)
B A
A 1 A A 3 A

Julien Mairal (Inria) 361/564


Example: Indexing by all shortest path lengths and their
endpoint labels

A 3 B
A A
B (0,...,0,2,0,...,0,1,0,...)
B A
A 1 A A 3 A

Properties (Borgwardt and Kriegel, 2005)


There are O(n2 ) shortest paths.
The vector of counts can be computed in O(n3 ) with the
Floyd-Warshall algorithm.

Julien Mairal (Inria) 361/564


Example: Indexing by all subgraphs up to k vertices

Julien Mairal (Inria) 362/564


Example: Indexing by all subgraphs up to k vertices

Properties (Shervashidze et al., 2009)


Naive enumeration scales as O(nk ).
Enumeration of connected graphlets in O(nd k−1 ) for graphs with
degree ≤ d and k ≤ 5.
Randomly sample subgraphs if enumeration is infeasible.

Julien Mairal (Inria) 362/564


Summary
Explicit computation of substructure occurrences can be
computationnally prohibitive (subgraphs, paths);
Several ideas to reduce the set of substructures considered;
In practice, NP-hardness may not be so prohibitive (e.g., graphs
with small degrees), the strategy followed should depend on the
data considered.

Julien Mairal (Inria) 363/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

Julien Mairal (Inria) 364/564


The idea

Julien Mairal (Inria) 365/564


The idea
1 Represent implicitly each graph x in X by a vector Φ(x) ∈ H
through the kernel

K (x, x0 ) = Φ(x)> Φ(x0 ) .

φ
X H

Julien Mairal (Inria) 365/564


The idea
1 Represent implicitly each graph x in X by a vector Φ(x) ∈ H
through the kernel

K (x, x0 ) = Φ(x)> Φ(x0 ) .

2 Use a kernel method for classification in H.

φ
X H

Julien Mairal (Inria) 365/564


Expressiveness vs Complexity
Definition: Complete graph kernels
A graph kernel is complete if it distinguishes non-isomorphic graphs, i.e.:

∀G1 , G2 ∈ X , dK (G1 , G2 ) = 0 =⇒ G1 ' G2 .


Equivalently, Φ(G1 ) 6= Φ(G2 ) if G1 and G2 are not isomorphic.

Julien Mairal (Inria) 366/564


Expressiveness vs Complexity
Definition: Complete graph kernels
A graph kernel is complete if it distinguishes non-isomorphic graphs, i.e.:

∀G1 , G2 ∈ X , dK (G1 , G2 ) = 0 =⇒ G1 ' G2 .


Equivalently, Φ(G1 ) 6= Φ(G2 ) if G1 and G2 are not isomorphic.

Expressiveness vs Complexity trade-off


If a graph kernel is not complete, then there is no hope to learn all
possible functions over X : the kernel is not expressive enough.
On the other hand, kernel computation must be tractable, i.e., no
more than polynomial (with small degree) for practical applications.
Can we define tractable and expressive graph kernels?

Julien Mairal (Inria) 366/564


Complexity of complete kernels
Proposition (Gärtner et al., 2003)
Computing any complete graph kernel is at least as hard as the graph
isomorphism problem.

Julien Mairal (Inria) 367/564


Complexity of complete kernels
Proposition (Gärtner et al., 2003)
Computing any complete graph kernel is at least as hard as the graph
isomorphism problem.

Proof
For any kernel K the complexity of computing dK is the same as
the complexity of computing K , because:

dK (G1 , G2 )2 = K (G1 , G1 ) + K (G2 , G2 ) − 2K (G1 , G2 ) .

If K is a complete graph kernel, then computing dK solves the


graph isomorphism problem (dK (G1 , G2 ) = 0 iff G1 ' G2 ). 

Julien Mairal (Inria) 367/564


Subgraph kernel
Definition
Let (λG )G ∈X be a set or nonnegative real-valued weights
For any graph G ∈ X and any connected graph H ∈ X , let

ΦH (G ) = G 0 is a subgraph of G : G 0 ' H .


The subgraph kernel between any two graphs G1 and G2 ∈ X is


defined by:
X
Ksubgraph (G1 , G2 ) = λH ΦH (G1 )ΦH (G2 ) .
H∈X
H connected

Julien Mairal (Inria) 368/564


Subgraph kernel complexity
Proposition (Gärtner et al., 2003)
Computing the subgraph kernel is NP-hard.

Julien Mairal (Inria) 369/564


Subgraph kernel complexity
Proposition (Gärtner et al., 2003)
Computing the subgraph kernel is NP-hard.

Proof (1/2)
Let Pn be the path graph with n vertices.
Subgraphs of Pn are path graphs:

Φ(Pn ) = neP1 + (n − 1)eP2 + . . . + ePn .

The vectors Φ(P1 ), . . . , Φ(Pn ) are linearly independent, therefore:


n
X
ePn = αi Φ(Pi ) ,
i=1

where the coefficients αi can be found in polynomial time (solving


an n × n triangular system).
Julien Mairal (Inria) 369/564
Subgraph kernel complexity
Proposition (Gärtner et al., 2003)
Computing the subgraph kernel is NP-hard.

Proof (2/2)
If G is a graph with n vertices, then it has a path that visits each
node exactly once (Hamiltonian path) if and only if Φ(G )> ePn > 0,
i.e.,
n n
!
X X
>
Φ(G ) αi Φ(Pi ) = αi Ksubgraph (G , Pi ) > 0 .
i=1 i=1

The decision problem whether a graph has a Hamiltonian path is


NP-complete. 

Julien Mairal (Inria) 370/564


Path kernel

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Definition
The path kernel is the subgraph kernel restricted to paths, i.e.,
X
Kpath (G1 , G2 ) = λH ΦH (G1 )ΦH (G2 ) ,
H∈P

where P ⊂ X is the set of path graphs.

Julien Mairal (Inria) 371/564


Path kernel

A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A

Definition
The path kernel is the subgraph kernel restricted to paths, i.e.,
X
Kpath (G1 , G2 ) = λH ΦH (G1 )ΦH (G2 ) ,
H∈P

where P ⊂ X is the set of path graphs.

Proposition (Gärtner et al., 2003)


Computing the path kernel is NP-hard.

Julien Mairal (Inria) 371/564


Summary
Expressiveness vs Complexity trade-off
It is intractable to compute complete graph kernels.
It is intractable to compute the subgraph kernels.
Restricting subgraphs to be linear does not help: it is also
intractable to compute the path kernel.
One approach to define polynomial time computable graph kernels
is to have the feature space be made up of graphs homomorphic to
subgraphs, e.g., to consider walks instead of paths.

Julien Mairal (Inria) 372/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

Julien Mairal (Inria) 373/564


Walks
Definition
A walk of a graph (V , E ) is sequence of v1 , . . . , vn ∈ V such that
(vi , vi+1 ) ∈ E for i = 1, . . . , n − 1.
We note Wn (G ) the set of walks with n vertices of the graph G ,
and W(G ) the set of all walks.

etc...

Julien Mairal (Inria) 374/564


Walks 6= paths

Julien Mairal (Inria) 375/564


Walk kernel
Definition
Let Sn denote the set of all possible label sequences of walks of
length n (including vertex and edge labels), and S = ∪n≥1 Sn .
For any graph X let a weight λG (w ) be associated to each walk
w ∈ W(G ).
Let the feature vector Φ(G ) = (Φs (G ))s∈S be defined by:
X
Φs (G ) = λG (w )1 (s is the label sequence of w ) .
w ∈W(G )

Julien Mairal (Inria) 376/564


Walk kernel
Definition
Let Sn denote the set of all possible label sequences of walks of
length n (including vertex and edge labels), and S = ∪n≥1 Sn .
For any graph X let a weight λG (w ) be associated to each walk
w ∈ W(G ).
Let the feature vector Φ(G ) = (Φs (G ))s∈S be defined by:
X
Φs (G ) = λG (w )1 (s is the label sequence of w ) .
w ∈W(G )

A walk kernel is a graph kernel defined by:


X
Kwalk (G1 , G2 ) = Φs (G1 )Φs (G2 ) .
s∈S

Julien Mairal (Inria) 376/564


Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with λG (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.

Julien Mairal (Inria) 377/564


Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with λG (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.
The random walk kernel is obtained with λG (w ) = PG (w ), where
PG is a Markov random walk on G . In that case we have:

K (G1 , G2 ) = P(label(W1 ) = label(W2 )) ,

where W1 and W2 are two independent random walks on G1 and


G2 , respectively (Kashima et al., 2003).

Julien Mairal (Inria) 377/564


Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with λG (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.
The random walk kernel is obtained with λG (w ) = PG (w ), where
PG is a Markov random walk on G . In that case we have:

K (G1 , G2 ) = P(label(W1 ) = label(W2 )) ,

where W1 and W2 are two independent random walks on G1 and


G2 , respectively (Kashima et al., 2003).
The geometric walk kernel is obtained (when it converges) with
λG (w ) = β length(w ) , for β > 0. In that case the feature space is of
infinite dimension (Gärtner et al., 2003).

Julien Mairal (Inria) 377/564


Computation of walk kernels
Proposition
These three kernels (nth-order, random and geometric walk kernels) can
be computed efficiently in polynomial time.

Julien Mairal (Inria) 378/564


Product graph
Definition
Let G1 = (V1 , E1 ) and G2 = (V2 , E2 ) be two graphs with labeled vertices.
The product graph G = G1 × G2 is the graph G = (V , E ) with:
1 V = {(v1 , v2 ) ∈ V1 × V2 : v1 and v2 have the same label} ,
2 E=
{((v1 , v2 ), (v10 , v20 )) ∈ V × V : (v1 , v10 ) ∈ E1 and (v2 , v20 ) ∈ E2 }.

1 a b 1b 2a 1d

3c 3e
2 c
1a 2b 2d

3 4 d e
4c 4e

G1 G2 G1 x G2

Julien Mairal (Inria) 379/564


Walk kernel and product graph
Lemma
There is a bijection between:
1 The pairs of walks w1 ∈ Wn (G1 ) and w2 ∈ Wn (G2 ) with the same
label sequences,
2 The walks on the product graph w ∈ Wn (G1 × G2 ).

Julien Mairal (Inria) 380/564


Walk kernel and product graph
Lemma
There is a bijection between:
1 The pairs of walks w1 ∈ Wn (G1 ) and w2 ∈ Wn (G2 ) with the same
label sequences,
2 The walks on the product graph w ∈ Wn (G1 × G2 ).

Corollary
X
Kwalk (G1 , G2 ) = Φs (G1 )Φs (G2 )
s∈S
X
= λG1 (w1 )λG2 (w2 )1(l(w1 ) = l(w2 ))
(w1 ,w2 )∈W(G1 )×W(G1 )
X
= λG1 ×G2 (w ) .
w ∈W(G1 ×G2 )

Julien Mairal (Inria) 380/564


Computation of the nth-order walk kernel

For the nth-order walk kernel we have λG1 ×G2 (w ) = 1 if the length
of w is n, 0 otherwise.
Therefore: X
Knth-order (G1 , G2 ) = 1.
w ∈Wn (G1 ×G2 )

Let A be the adjacency matrix of G1 × G2 . Then we get:


X
Knth-order (G1 , G2 ) = [An ]i,j = 1> An 1 .
i,j

Computation in O(n|V1 ||V2 |d1 d2 ), where di is the maximum degree


of Gi .

Julien Mairal (Inria) 381/564


Computation of random and geometric walk kernels

In both cases λG (w ) for a walk w = v1 . . . vn can be decomposed


as:
n
Y
λG (v1 . . . vn ) = λi (v1 ) λt (vi−1 , vi ) .
i=2

Let Λi be the vector of λi (v ) and Λt be the matrix of λt (v , v 0 ):



X X n
Y
Kwalk (G1 , G2 ) = λi (v1 ) λt (vi−1 , vi )
n=1 w ∈Wn (G1 ×G2 ) i=2

X
= Λi Λnt 1
n=0
= Λi (I − Λt )−1 1

Computation in O(|V1 |3 |V2 |3 ).

Julien Mairal (Inria) 382/564


Extensions 1: Label enrichment
Atom relabeling with the Morgan index (Mahé et al., 2004)
1 2 4

1 1 2 2 4 5

1 O1 2 O1 4 O3
1 3 7
N1 N3 N5
1 2 5

No Morgan Indices O1 Order 1 indices O1 Order 2 indices O3

Compromise between fingerprints and structural keys.


Other relabeling schemes are possible.
Faster computation with more labels (less matches implies a smaller
product graph).

Julien Mairal (Inria) 383/564


Extension 2: Non-tottering walk kernel
Tottering walks
A tottering walk is a walk w = v1 . . . vn with vi = vi+2 for some i.

Non−tottering

Tottering

Tottering walks seem irrelevant for many applications.


Focusing on non-tottering walks is a way to get closer to the path
kernel (e.g., equivalent on trees).

Julien Mairal (Inria) 384/564


Computation of the non-tottering walk kernel (Mahé et
al., 2005)
Second-order Markov random walk to prevent tottering walks
Written as a first-order Markov random walk on an augmented
graph
Normal walk kernel on the augmented graph (which is always a
directed graph).

Julien Mairal (Inria) 385/564


Extension 3: Subtree kernels

Remark: Here and in subsequent slides by subtree we mean a tree-like


pattern with potentially repeated nodes and edges.
Julien Mairal (Inria) 386/564
Example: Tree-like fragments of molecules

.
.
. C C

C .
N

N O
.
.
C

O N C C N
. N O
.
.
N C
N N C C C
.
.
.

Julien Mairal (Inria) 387/564


Computation of the subtree kernel (Ramon and Gärtner,
2003; Mahé and Vert, 2009)

Like the walk kernel, amounts to computing the (weighted) number


of subtrees in the product graph.
Recursion: if T (v , n) denotes the weighted number of subtrees of
depth n rooted at the vertex v , then:
X Y
T (v , n + 1) = λt (v , v 0 )T (v 0 , n) ,
R⊂N (v ) v 0 ∈R

where N (v ) is the set of neighbors of v .


Can be combined with the non-tottering graph transformation as
preprocessing to obtain the non-tottering subtree kernel.

Julien Mairal (Inria) 388/564


Back to label enrichment
Link between the Morgan index and subtrees
Recall the Morgan index:
1 2 4

1 1 2 2 4 5

1 O1 2 O1 4 O3
1 3 7
N1 N3 N5
1 2 5

No Morgan Indices O1 Order 1 indices O1 Order 2 indices O3

The Morgan index of order k at a node v in fact corresponds to the


number of leaves in the k-th order full subtree pattern rooted at v .

2
1
1 3

2 3 6

6 4
1 3 1 2 4 5 1 5
5

A full subtree pattern of order 2 rooted at node 1.

Julien Mairal (Inria) 389/564


Label enrichment via the Weisfeiler-Lehman algorithm
A slightly more involved label enrichment strategy (Weisfeiler and
Lehman, 1968) is exploited in the definition and computation of the
Weisfeiler-Lehman subtree kernel (Shervashidze and Borgwardt, 2009).
e b
1 Multiset-label determination
d c
and sorting
a a

e,bcd b,ce a,d f


2 Label compression b,ce g
d,aace c,bde c,bde h
d,aace i
a,d a,d e,bcd j

j g
3 Relabeling
i h

c d
f f
e
b

Julien Mairal (Inria) 390/564


Label enrichment via the Weisfeiler-Lehman algorithm
A slightly more involved label enrichment strategy (Weisfeiler and
Lehman, 1968) is exploited in the definition and computation of the
Weisfeiler-Lehman subtree kernel (Shervashidze and Borgwardt, 2009).
e b
1 Multiset-label determination
d c
and sorting
a a

e,bcd b,ce a,d f


2 Label compression b,ce g
d,aace c,bde c,bde h
d,aace i
a,d a,d e,bcd j

j g
3 Relabeling
i h

c d
f f
e
b

Compressed labels represent full subtree patterns.


Julien Mairal (Inria) 390/564
Weisfeiler-Lehman (WL) subtree kernel

e b b e

d c d c (1)
φWLsubtree(G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)
a b c d e f g h i j k l m
a a a b (1)
φWLsubtree(G’) = ( 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1)
m h i m a b c d e f g h i j k l m

k j l j Counts of Counts of
original compressed
G G’
f f f g node labels node labels
(1) (1) (1)
KWLsubtree(G,G’)=<φWLsubtree(G), φWLsubtree(G’)>=11.

Properties
The WL features up to the k-th order are computed in O(|E |k).
Similarly to the Morgan index, the WL relabeling can be exploited
in combination with any graph kernel (that takes into account
categorical node labels) to make it more expressive (Shervashidze et
al., 2011).
Julien Mairal (Inria) 391/564
Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Motivation
Explicit enumeration of features
Challenges
Walk-based kernels
Applications
Kernels on graphs

Julien Mairal (Inria) 392/564


Application in chemoinformatics (Mahé et al., 2005)
MUTAG dataset
aromatic/hetero-aromatic compounds
high mutagenic activity /no mutagenic activity, assayed in
Salmonella typhimurium.
188 compounds: 125 + / 63 -

Results
10-fold cross-validation accuracy
Method Accuracy
Progol1 81.4%
2D kernel 91.2%

Julien Mairal (Inria) 393/564


AUC

70 72 74 76 78 80

CCRF−CEM
HL−60(TB)
K−562
MOLT−4
Walks

RPMI−8226
Subtrees

SR
A549/ATCC
EKVX
HOP−62
HOP−92
NCI−H226
NCI−H23
NCI−H322M
NCI−H460
NCI−H522
COLO_205
HCC−2998
HCT−116
HCT−15
HT29
KM12
SW−620
SF−268
SF−295
2D subtree vs walk kernels

SF−539

Julien Mairal (Inria)


SNB−19
SNB−75
U251
LOX_IMVI
MALME−3M
M14
SK−MEL−2
SK−MEL−28
SK−MEL−5
UACC−257
UACC−62
IGR−OV1
OVCAR−3
OVCAR−4
OVCAR−5
OVCAR−8

Screening of inhibitors for 60 cancer cell lines.


SK−OV−3
786−0
A498
ACHN
CAKI−1
RXF_393
SN12C
TK−10
UO−31
PC−3
DU−145
MCF7
NCI/ADR−RES
MDA−MB−231/ATCC
HS_578T
MDA−MB−435
MDA−N
BT−549
T−47D
394/564
Comparison of several graph feature extraction
methods/kernels (Shervashidze et al., 2011)
10-fold cross-validation accuracy on garph classification problems in
chemo- and bioinformatics:
NCI1 and NCI109 - active/inactive compounds in an anti-cancer screen
ENZYMES - 6 types of enzymes from the BRENDA database

Method/Data Set NCI1 NCI109 ENZYMES


WL subtree 82.19 (± 0.18) 82.46 (±0.24) 52.22 (±1.26)
WL shortest path 84.55 (±0.36) 83.53 (±0.30) 59.05 (±1.05)
Ramon & Gärtner 61.86 (±0.27) 61.67 (±0.21) 13.35 (±0.87)
Geometric p-walk 58.66 (±0.28) 58.36 (±0.94) 27.67 (±0.95)
Geometric walk 64.34 (±0.27) 63.51 (± 0.18) 21.68 (±0.94)
Graphlet count 66.00 (±0.07) 66.59 (±0.08) 32.70 (±1.20)
Shortest path 73.47 (±0.11) 73.07 (±0.11) 41.68 (±1.79)

Julien Mairal (Inria) 395/564


Image classification (Harchaoui and Bach, 2007)
COREL14 dataset
1400 natural images in 14 classes
Compare kernel between histograms (H), walk kernel (W), subtree
kernel (TW), weighted subtree kernel (wTW), and a combination
(M).

Performance comparison on Corel14

0.12

0.11

0.1

Test error
0.09

0.08

0.07

0.06

0.05
H W TW wTW M
Kernels

Julien Mairal (Inria) 396/564


Summary: graph kernels
What we saw
Kernels do not allow to overcome the NP-hardness of subgraph
patterns.
They allow to work with approximate subgraphs (walks, subtrees) in
infinite dimension, thanks to the kernel trick.
However: using kernels makes it difficult to come back to patterns
after the learning stage.

Julien Mairal (Inria) 397/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs

5 Open Problems and Research Topics

Julien Mairal (Inria) 398/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

Julien Mairal (Inria) 399/564


Graphs
Motivation
Data often come in the form of nodes in a graph for different reasons:
by definition (interaction network, internet...)
by discretization/sampling of a continuous domain
by convenience (e.g., if only a similarity function is available)

Julien Mairal (Inria) 400/564


Example: web

Julien Mairal (Inria) 401/564


Example: social network

Julien Mairal (Inria) 402/564


Example: protein-protein interaction

Julien Mairal (Inria) 403/564


Kernel on a graph

We need a kernel K (x, x0 ) between nodes of the graph.


Example: predict protein functions from high-throughput
protein-protein interaction data.

Julien Mairal (Inria) 404/564


General remarks
Strategies to design a kernel on a graph
X being finite, any symmetric semi-definite matrix K defines a valid
p.d. kernel on X .

Julien Mairal (Inria) 405/564


General remarks
Strategies to design a kernel on a graph
X being finite, any symmetric semi-definite matrix K defines a valid
p.d. kernel on X .
How to “translate” the graph topology into the kernel?
Direct geometric approach: Ki,j should be “large” when xi and xj are
“close” to each other on the graph?
Functional approach: k f kK should be “small” when f is “smooth”
on the graph?
Link discrete/continuous: is there an equivalent to the continuous
Gaussian kernel on the graph (e.g., limit by fine discretization)?

Julien Mairal (Inria) 405/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

Julien Mairal (Inria) 406/564


Conditionally p.d. kernels
Hilbert distance
Any p.d. kernel is an inner product in a Hilbert space

K x, x0 = Φ (x) , Φ x0 H .
 

It defines a Hilbert distance:


2
dK x, x0 = K (x, x) + K x0 , x0 − 2K x, x0 .
 

−dK2 is conditionally positive definite (c.p.d.), i.e.:


 2 
∀t > 0 , exp −tdK x, x0 is p.d.

Julien Mairal (Inria) 407/564


Example
A direct approach
For X = Rn , the inner product is p.d.:

K (x, x0 ) = x> x0 .

The corresponding Hilbert distance is the Euclidean distance:


2
dK x, x0 = x> x + x0> x0 − 2x>x0 = ||x − x0 ||2 .

−dK2 is conditionally positive definite (c.p.d.), i.e.:

∀t > 0 , exp −t||x − x0 ||2 is p.d.




Julien Mairal (Inria) 408/564


Graph distance
Graph embedding in a Hilbert space
Given a graph G = (V , E ), the graph distance dG (x, x 0 ) between
any two vertices is the length of the shortest path between x and x 0 .
We say that the graph G = (V , E ) can be embedded (exactly) in a
Hilbert space if −dG is c.p.d., which implies in particular that
exp(−tdG (x, x 0 )) is p.d. for all t > 0.

Julien Mairal (Inria) 409/564


Graph distance
Graph embedding in a Hilbert space
Given a graph G = (V , E ), the graph distance dG (x, x 0 ) between
any two vertices is the length of the shortest path between x and x 0 .
We say that the graph G = (V , E ) can be embedded (exactly) in a
Hilbert space if −dG is c.p.d., which implies in particular that
exp(−tdG (x, x 0 )) is p.d. for all t > 0.

Lemma
In general graphs cannot be embedded exactly in Hilbert spaces.
In some cases exact embeddings exist, e.g.:
trees can be embedded exactly,
closed chains can be embedded exactly.

Julien Mairal (Inria) 409/564


Example: non-c.p.d. graph distance

1 3 5
4
 
0 1 1 1 2

 1 0 2 2 1 

dG = 
 1 2 0 2 1 

 1 2 2 0 1 
2 1 1 1 0
h i
λmin e (−0.2dG (i,j)) = −0.028 < 0 .

Julien Mairal (Inria) 410/564


Graph distances on trees are c.p.d.
Proof
Let G = (V , E ) be a tree;
Fix a root x0 ∈ V ;
Represent any vertex x ∈ V by a vector Φ(x) ∈ R|E | , where
Φ(x)i = 1 if the i-th edge is part of the (unique) path between x
and x0 , 0 otherwise.
Then
dG (x, x 0 ) = k Φ(x) − Φ(x 0 ) k2 ,
and therefore −dG is c.p.d., in particular exp(−tdG (x, x 0 )) is p.d.
for all t > 0.

Julien Mairal (Inria) 411/564


Example

1
3 5
4
2
 
1 0.14 0.37 0.14 0.05
h i  0.14 1 0.37 0.14 0.05 
−dG (i,j)
 
e =
 0.37 0.37 1 0.37 0.14 

 0.14 0.14 0.37 1 0.37 
0.05 0.05 0.14 0.37 1

Julien Mairal (Inria) 412/564


Graph distances on closed chains are c.p.d.
Proof: case |V | = 2p
Let G = (V , E ) be a directed cycle with an even number of vertices
|V | = 2p.
Fix a root x0 ∈ V , number the 2p edges from x0 to x0 ;
Label the 2p edges with e1 , . . . , ep , −e1 , . . . , −ep (vectors in Rp );
For a vertex v , take Φ(v ) to be the sum of the labels of the edges
in the shortest directed path between x0 and v .

Julien Mairal (Inria) 413/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

Julien Mairal (Inria) 414/564


Functional approach
Motivation
How to design a p.d. kernel on general graphs?
Designing a kernel is equivalent to defining an RKHS.
There are intuitive notions of smoothness on a graph.

Idea
Define a priori a smoothness functional on the functions f : X → R;
Show that it defines an RKHS and identify the corresponding kernel.

Julien Mairal (Inria) 415/564


Notations

X = (x1 , . . . , xm ) is finite.
For x, x0 ∈ X , we note x ∼ x0 to indicate the existence of an edge
between x and x0
We assume that there is no self-loop x ∼ x, and that there is a
single connected component.
The adjacency matrix is A ∈ Rm×m :
(
1 if i ∼ j,
Ai,j =
0 otherwise.

D is theP
diagonal matrix where Di,i is the number of neighbors of xi
(Di,i = m i=1 Ai,j ).

Julien Mairal (Inria) 416/564


Example

1
3 5
4
2
   
0 0 1 0 0 1 0 0 0 0

 0 0 1 0 0 


 0 1 0 0 0 

A= 1 1 0 1 0 , D= 0 0 3 0 0 
   
 0 0 1 0 1   0 0 0 2 0 
0 0 0 1 0 0 0 0 0 1

Julien Mairal (Inria) 417/564


Graph Laplacian
Definition
The Laplacian of the graph is the matrix L = D − A.

1
3 5
4
2
0 −1 0
 
1 0

 0 1 −1 0 0 

L=D −A=
 −1 −1 3 −1 0 

 0 0 −1 2 −1 
0 0 0 −1 1

Julien Mairal (Inria) 418/564


Properties of the Laplacian
Lemma
Let L = D − A be the Laplacian of a connected graph:
For any f : X → R,
X
Ω(f ) := (f (xi ) − f (xj ))2 = f > Lf
i∼j

L is a symmetric positive semi-definite matrix


0 is an eigenvalue with multiplicity 1 associated to the constant
eigenvector 1 = (1, . . . , 1)
The image of L is
m
( )
X
Im(L) = f ∈ Rm : fi = 0
i=1

Julien Mairal (Inria) 419/564


Proof: link between Ω(f ) and L

X
Ω (f ) = (f (xi ) − f (xj ))2
i∼j
X 
= f (xi )2 + f (xj )2 − 2f (xi ) f (xj )
i∼j
Xm X
= Di,i f (xi )2 − 2 f (xi ) f (xj )
i=1 i∼j
> >
= f Df − f Af
= f > Lf

Julien Mairal (Inria) 420/564


Proof: eigenstructure of L
L is symmetric because A and D are symmetric.
For any f ∈ Rm , f > Lf = Ω(f ) ≥ 0, therefore the (real-valued)
eigenvalues of L are ≥ 0 : L is therefore positive semi-definite.
f is an eigenvector associated to eigenvalue 0
iff fP> Lf = 0

iff i∼j (f (xi ) − f (xj ))2 = 0 ,


iff f (xi ) = f (xj ) when i ∼ j,
iff f is constant (because the graph is connected).
L being symmetric, Im(L) is the orthogonal supplement of Ker (L),
that is, the set of functions orthogonal to 1. 

Julien Mairal (Inria) 421/564


Our first graph kernel
Theorem
Pm
The set H = {f ∈ Rm : i=1 fi
= 0} endowed with the norm
X
Ω (f ) = (f (xi ) − f (xj ))2
i∼j

is a RKHS whose reproducing kernel is L∗ , the pseudo-inverse of the


graph Laplacian.

Julien Mairal (Inria) 422/564


In case of...
Pseudo-inverse of L
Remember the pseudo-inverse L∗ of L is the linear application that is
equal to:
0 on Ker (L)
L−1 on Im(L), that is, if we write:
m
X
L= λi ui ui>
i=1

the eigendecomposition of L:

(λi )−1 ui ui> .


X
L∗ =
λi 6=0

In particular it holds that L∗ L = LL∗ = ΠH , the projection onto


Im(L) = H.
Julien Mairal (Inria) 423/564
Proof (1/2)
Resticted to H, the symmetric bilinear form:

hf , g i = f > Lg

is positive definite (because L is positive semi-definite, and


H = Im(L)). It is therefore a scalar product, making of H a Hilbert
space (in fact Euclidean).
The norm in this Hilbert space H is:

k f k2 = hf , f i = f > Lf = Ω(f ) .

Julien Mairal (Inria) 424/564


Proof (2/2)
To check that H is a RKHS with reproducing kernel K = L∗ , it suffices
to show that:
(
∀x ∈ X , Kx ∈ H ,
∀ (x, f ) ∈ X × H, hf , Kx i = f (x) .

Ker (K ) = Ker (L∗ ) = Ker (L), implying K 1 = 0. Therefore, each


row/column of K is in H.
For any f ∈ H, if we note gi = hK (i, ·), f i we get:

g = KLf = L∗ Lf = ΠH (f ) = f .

As a conclusion K = L∗ is the reproducing kernel of H. 

Julien Mairal (Inria) 425/564


Example

1
3 5
4
2
0.88 −0.12 0.08 −0.32 −0.52
 
 −0.12 0.88 0.08 −0.32 −0.52 

 
L =
 0.08 0.08 0.28 −0.12 −0.32 

 −0.32 −0.32 −0.12 0.48 0.28 
−0.52 −0.52 −0.32 0.28 1.08

Julien Mairal (Inria) 426/564


Interpretation of the Laplacian

f
dx

i−1 i i+1

∆f (x) = f 00 (x)
f 0 (x + dx/2) − f 0 (x − dx/2)

dx
f (x + dx) − f (x) − f (x) + f (x − dx)

dx 2
fi−1 + fi+1 − 2f (x)
=
dx 2
Lf (i)
=− .
dx 2
Julien Mairal (Inria) 427/564
Interpretation of regularization
For f = [0, 1] → R and xi = i/m, we have:
m     2
X i +1 i
Ω(f ) = f −f
m m
i=1
m   2
X 1 i
∼ ×f0
m m
i=1
m
1 X 0 i 2
 
1
= × f
m m m
i=1
1 1 0 2
Z
∼ f (t) dt.
m 0

Julien Mairal (Inria) 428/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

Julien Mairal (Inria) 429/564


Motivation

Consider the normalized Gaussian kernel on Rd :


k x − x0 k2
 
1
Kt x, x0 =

d exp − .
(4πt) 2 4t

In order to transpose it to the graph, replacing the Euclidean


distant by the shortest-path distance does not work.
In this section we provide a characterization of the Gaussian kernel
as the solution of a partial differential equation involving the
Laplacian, which we can transpose to the graph: the diffusion
equation.
The solution of the discrete diffusion equation will be called the
diffusion kernel or heat kernel.

Julien Mairal (Inria) 430/564


The diffusion equation
Lemma
For any x0 ∈ Rd , the function:

k x − x0 k2
 
1
Kx0 (x, t) = Kt (x0 , x) = d exp −
(4πt) 2 4t

is solution of the diffusion equation:



Kx (x, t) = ∆Kx0 (x, t)
∂t 0
with initial condition Kx0 (x, 0) = δx0 (x)

(proof by direct computation).

Julien Mairal (Inria) 431/564


Discrete diffusion equation
For finite-dimensional ft ∈ Rm , the diffusion equation becomes:

ft = −Lft
∂t
which admits the following solution:

ft = f0 e −tL

with
t2 2 t3 3
e tL = I − tL + L − L + ...
2! 3!

Julien Mairal (Inria) 432/564


Diffusion kernel (Kondor and Lafferty, 2002)
This suggest to consider:
K = e −tL
which is indeed symmetric positive semi-definite because if we write:
m
X
L= λi ui ui> (λi ≥ 0)
i=1

we obtain:
m
X
−tL
K =e = e −tλi ui ui>
i=1

Julien Mairal (Inria) 433/564


Example: complete graph

1+(m−1)e −tm
(
m for i = j,
Ki,j = 1−e −tm
m 6 j.
for i =

Julien Mairal (Inria) 434/564


Example: closed chain

m−1
2πν(i − j)
  
1 X 2πν
Ki,j = exp −2t 1 − cos cos .
m m m
ν=0

Julien Mairal (Inria) 435/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

Julien Mairal (Inria) 436/564


Motivation
In this section we show that the diffusion and Laplace kernels can
be interpreted in the frequency domain of functions
This shows that our strategy to design kernels on graphs was based
on (discrete) harmonic analysis on the graph
This follows the approach we developed for semigroup kernels!

Julien Mairal (Inria) 437/564


Spectrum of the diffusion kernel
Let 0 = λ1 < λ2 ≤ . . . ≤ λm be the eigenvalues of the Laplacian:
m
X
L= λi ui ui> (λi ≥ 0)
i=1

The diffusion kernel Kt is an invertible matrix because its


eigenvalues are strictly positive:
m
X
Kt = e −tλi ui ui>
i=1

Julien Mairal (Inria) 438/564


Norm in the diffusion RKHS
Any function f ∈ Rm can be written as f = K K −1 f , therefore its


norm in the diffusion RKHS is:


 
k f k2Kt = f > K −1 K K −1 f = f > K −1 f .


For i = 1, . . . , m, let:
fˆi = ui> f
be the projection of f onto the eigenbasis of K .
We then have:
m
X
> −1
kf k2Kt =f K f = e tλi fˆi 2 .
i=1

2 2 ω2
fˆ(ω) eσ
R
This looks similar to dω ...
Julien Mairal (Inria) 439/564
Discrete Fourier transform
Definition
 >
The vector fˆ = fˆ1 , . . . , fˆm is called the discrete Fourier transform of
f ∈R n

The eigenvectors of the Laplacian are the discrete equivalent to the


sine/cosine Fourier basis on Rn .
The eigenvalues λi are the equivalent to the frequencies ω 2
Successive eigenvectors “oscillate” increasingly as eigenvalues get
more and more negative.

Julien Mairal (Inria) 440/564


Example: eigenvectors of the Laplacian

Julien Mairal (Inria) 441/564


Generalization
This observation suggests to define a whole family of kernels:
m
X
Kr = r (λi )ui ui>
i=1

associated with the following RKHS norms:


m
X fˆi 2
kf k2Kr =
r (λi )
i=1

where r : R+ → R+
∗ is a non-increasing function.

Julien Mairal (Inria) 442/564


Example : regularized Laplacian

1
r (λ) = , >0
λ+
m
X 1
K= ui u > = (L + I )−1
λi +  i
i=1

X m
X
k f k2K = f > K −1 f = (f (xi ) − f (xj ))2 +  f (xi )2 .
i∼j i=1

Julien Mairal (Inria) 443/564


Example

1
3 5
4
2
 
0.60 0.10 0.19 0.08 0.04
 0.10 0.60 0.19 0.08 0.04 
−1
 
(L + I ) =
 0.19 0.19 0.38 0.15 0.08 

 0.08 0.08 0.15 0.46 0.23 
0.04 0.04 0.08 0.23 0.62

Julien Mairal (Inria) 444/564


Outline

4 The Kernel Jungle


Kernels for probabilistic models
Kernels for biological sequences
Mercer kernels and shift-invariant kernels
Kernels for graphs
Kernels on graphs
Motivation
Graph distance and p.d. kernels
Construction by regularization
The diffusion kernel
Harmonic analysis on graphs
Applications

Julien Mairal (Inria) 445/564


Applications 1: graph partitioning
A classical relaxation of graph partitioning is:
X X
min (fi − fj )2 s.t. fi 2 = 1
f ∈RX
i∼j i

This can be rewritten


X
max fi 2 s.t. k f kH ≤ 1
f
i

This is principal component analysis in the RKHS (“kernel PCA”)

PC2 PC1

Julien Mairal (Inria) 446/564


Applications 2: search on a graph

Let x1 , . . . , xq be a set of q nodes (the query). How to find


“similar” nodes (and rank them)?
One solution:

min k f kH s.t. f (xi ) ≥ 1 for i = 1, . . . , q.


f

Julien Mairal (Inria) 447/564


Application 3: Semi-supervised learning

Julien Mairal (Inria) 448/564


Application 3: Semi-supervised learning

Julien Mairal (Inria) 449/564


Application 4: Tumor classification from microarray data
(Rapaport et al., 2006)
Data available
Gene expression measures for more than 10k genes
Measured on less than 100 samples of two (or more) different
classes (e.g., different tumors)

Julien Mairal (Inria) 450/564


Application 4: Tumor classification from microarray data
(Rapaport et al., 2006)
Data available
Gene expression measures for more than 10k genes
Measured on less than 100 samples of two (or more) different
classes (e.g., different tumors)

Goal
Design a classifier to automatically assign a class to future samples
from their expression profile
Interpret biologically the differences between the classes

Julien Mairal (Inria) 450/564


Linear classifiers
The approach
Each sample is represented by a vector x = (x1 , . . . , xp ) where
p > 105 is the number of probes
Classification: given the set of labeled sample, learn a linear
decision function:
X p
f (x) = βi xi + β0 ,
i=1

that is positive for one class, negative for the other


Interpretation: the weight βi quantifies the influence of gene i for
the classification

Julien Mairal (Inria) 451/564


Linear classifiers
Pitfalls
No robust estimation procedure exist for 100 samples in 105
dimensions!
It is necessary to reduce the complexity of the problem with prior
knowledge.

Julien Mairal (Inria) 452/564


Example : Norm Constraints
The approach
A common method in statistics to learn with few samples in high
dimension is to constrain the norm of β, e.g.:
EuclideanPnorm (support vector machines, ridge regression):
k β k2 = pi=1 βi2
L1 -norm (lasso regression) : k β k1 = pi=1 | βi |
P

Cons
Pros Limited interpretation
Good performance in (small weights)
classification No prior biological
knowledge

Julien Mairal (Inria) 453/564


Example 2: Feature Selection
The approach
Constrain most weights to be 0, i.e., select a few genes (< 20) whose
expression are enough for classification. Interpretation is then about the
selected genes.

Pros Cons
The gene selection
Good performance in
process is usually not
classification
robust
Useful for biomarker
Wrong interpretation is
selection
the rule (too much
Apparently easy correlation between
interpretation genes)

Julien Mairal (Inria) 454/564


Pathway interpretation
Motivation
Basic biological functions are usually expressed in terms of
pathways and not of single genes (metabolic, signaling, regulatory)
Many pathways are already known
How to use this prior knowledge to constrain the weights to have an
interpretation at the level of pathways?

Solution (Rapaport et al., 2006)


Constrain the diffusion RKHS norm of β
Relevant if the true decision function is indeed smooth w.r.t. the
biological network

Julien Mairal (Inria) 455/564


Pathway interpretation
N Glycan
biosynthesis

Glycolysis / Bad example


Gluconeogenesis

The graph is the


complete known
Porphyrin Protein
and Sulfur
metabolism kinases metabolic network of the
chlorophyll
metabolism
- Nitrogen, budding yeast (from
asparagine
Riboflavin metabolism metabolism KEGG database)
Folate
biosynthesis
DNA
and
We project the classifier
RNA
polymerase
subunits
weight learned by a SVM
Biosynthesis of steroids,
ergosterol metabolism Good classification
Lysine
biosynthesis
Oxidative
phosphorylation,
accuracy, but no possible
TCA cycle
Phenylalanine, tyrosine and
tryptophan biosynthesis Purine
interpretation!
metabolism

Julien Mairal (Inria) 456/564


Pathway interpretation

Good example
The graph is the complete
known metabolic network
of the budding yeast
(from KEGG database)
We project the classifier
weight learned by a
spectral SVM
Good classification
accuracy, and good
interpretation!

Julien Mairal (Inria) 457/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
“Deep” learning with kernels

Julien Mairal (Inria) 458/564


Motivation

We have seen how to make learning algorithms given a kernel K on


some data space X
Often we may have several possible kernels:
by varying the kernel type or parameters on a given description of the
data (eg, linear, polynomial, Gaussian kernels with different
bandwidths...)
because we have different views of the same data, eg, a protein can
be characterized by its sequence, its structure, its mass spectrometry
profile...
How to choose or integrate different kernels in a learning task?

Julien Mairal (Inria) 459/564


Setting: learning with one kernel
For any f : X → R, let f n = (f (x1 ), . . . , f (xn )) ∈ Rn
Given a p.d. kernel K : X × X → R, we learn with K by solving:

min R(f n ) + λk f k2HK , (4)


f ∈HK

where λ > 0 and R : Rn → R is an closed1 and convex empirical


risk:
1
Pn
R(u) = n Pi=1 (ui − yi )2 for kernel ridge regression
1 n
R(u) = n Pi=1 max(1 − yi ui , 0) for SVM
1 n
R(u) = n i=1 log (1 + exp (−yi ui )) for kernel logistic regression

1
R is closed if, for each A ∈ R, the sublevel set {u ∈ Rn : R(u) ≤ A} is closed.
For example, if R is continuous then it is closed.
Julien Mairal (Inria) 460/564
Sum kernel

Definition
Let K1 , . . . , KM be M kernels on X . The sum kernel KS is the kernel on
X defined as
M
X
∀x, x0 ∈ X , KS (x, x0 ) = Ki (x, x0 ) .
i=1

Julien Mairal (Inria) 461/564


Sum kernel and vector concatenation
Theorem
For i = 1, . . . , M, let Φi : X → Hi be a feature map such that

Ki (x, x0 ) = Φi (x) , Φi x0 H .

i

PM
Then KS = i=1 Ki can be written as:

KS (x, x0 ) = ΦS (x) , ΦS x0

HS
,

where ΦS : X → HS = H1 ⊕ . . . ⊕ HM is the concatenation of the


feature maps Φi :

ΦS (x) = (Φ1 (x) , . . . , ΦM (x))> .


Therefore, summing kernels amounts to concatenating their feature
space representations, which is a quite natural way to integrate different
features.
Julien Mairal (Inria) 462/564
Proof
For ΦS (x) = (Φ1 (x) , . . . , ΦM (x))> , we easily compute:
M
X
0
Φi (x) , Φi x0
 
ΦS (x) , ΦS x Hs
= Hi
i=1
XM
= Ki (x, x0 )
i=1
= KS (x, x0 ) .

Julien Mairal (Inria) 463/564


Example: data integration with the sum kernel
Table 1. List of experiments of direct approach, spectral approach based on
kernel PCA, and supervised approach based on kernel CCA
Vol. 20 Suppl. 1 2004, pages i363–i370
BIOINFORMATICS DOI: 10.1093/bioinformatics/bth910

Approach Kernel (Predictor) Kernel (Target)


Protein network inference from multiple
genomic data: a supervised approach
Direct Kexp (Expression)
Y. Yamanishi1, ∗, J.-P. Vert2 and M. Kanehisa1
Kppi (Protein interaction)
1
Kloc (Localization) Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho,
Uji, Kyoto 611-0011, Japan and 2 Computational Biology group, Ecole des Mines de
Kphy (PhylogeneticParis, profile)
35 rue Saint-Honoré, 77305 Fontainebleau cedex, France
Kexp + Kppi + KlocReceived + Konphy January 15, 2004; accepted on March 1, 2004

(Integration)
Spectral Kexp (Expression)
Kppi (Protein interaction)
ABSTRACT computational biology. By protein network we mean, in this
Motivation: An increasing number of observations support the paper, a graph with proteins as vertices and edges that corres-
Kloc (Localization)
hypothesis that most biological functions involve the interac- pond to various binary relationships between proteins. More
tions phy (Phylogenetic
Kbetween many proteins, andprofile)
that the complexity of living precisely, we consider below the protein network with edges
systems arises as a result of such interactions. In this context, between two proteins if (i) the proteins interact physically,
theK exp +of K
problem ppi +a K
inferring global + Kphy
loc protein network for a given or (ii) the proteins are enzymes that catalyze two successive
organism,(Integration)
using all available genomic data about the organ- chemical reactions in a pathway or (iii) one of the proteins
ism, is quickly becoming one of the main challenges in current regulates the expression of the other. This definition of pro-
Supervised Kexp (Expression)
computational biology. Kgold (Protein
tein networknetwork)
involves various forms of interactions between
Results: This paper presents a new method to infer protein proteins, which should be taken into account for the study of
K ppi (Protein interaction)
networks from multiple types of genomic data. Based on a
K gold (Protein network)
the behavior of biological systems.
Kloc
variant (Localization)
of kernel Kgold (Protein
canonical correlation analysis, its originality network)
Unfortunately, the experimental determination of this pro-
is in the formalization of the protein network inference problem tein network remains very challenging nowadays, even for
Kphy (Phylogenetic
as a supervised
profile) Kgold (Protein
learning problem, and in the integration of het-
network)
the most basic organisms. The lack of reliable informa-
Kexp +genomic
erogeneous Kppi data+K within +K
loc this framework.
phy Kgold (Protein
We present network)
tion contrasts with the wealth of genomic data generated by
promising results on the prediction of the protein network for high-throughput technologies such as gene expression data
the yeast(Integration)
Saccharomyces cerevisiae from four types of widely (Eisen et al., 1998), physical protein interactions (Ito et al.,
available data: gene expressions, protein interactions meas- 2001), protein localization (Huh et al., 2003), phylogen-
ured by yeast two-hybrid systems, protein localizations in the etic profiles (Pellegrini et al., 1999) or pathway knowledge Fig. 6. Effect of n
Fig. 5. ROC curves: supervised approach.
JulienThe
cell and protein phylogenetic profiles. Mairal
method(Inria)
is shown (Kanehisa et al., 2004). There is therefore an incentive 464/564
approaches.
The sum kernel: functional point of view
Theorem
PM
The solution f ∗ ∈ HKS when we learn with KS = i=1 Ki is equal to:
M
X
f∗ = fi ∗ ,
i=1

where (f1∗ , . . . , fM∗ ) ∈ HK1 × . . . × HKM is the solution of:

M M
!
X X
n
min R fi +λ k fi k2HK .
f1 ,...,fM i
i=1 i=1

Julien Mairal (Inria) 465/564


Generalization: The weighted sum kernel
Theorem
PM
The solution f ∗ when we learn with Kη = i=1 ηi Ki , with
η1 , . . . , ηM ≥ 0, is equal to:
M
X
f∗ = fi ∗ ,
i=1

where (f1∗ , . . . , fM∗ ) ∈ HK1 × . . . × HKM is the solution of:

M M k f k2
!
X X i HK
n i
min R fi +λ .
f1 ,...,fM ηi
i=1 i=1

Julien Mairal (Inria) 466/564


Proof (1/4)
M M k f k2
!
X X i HK
n i
min R fi +λ .
f1 ,...,fM ηi
i=1 i=1

R being convex, the problem is strictly convex and has a unique


solution (f1∗ , . . . , fM∗ ) ∈ HK1 × . . . × HKM .
By the representer theorem, there exists α∗1 , . . . , α∗M ∈ Rn such that
n
X
fi ∗ (x) = αij∗ Ki (xj , x) .
j=1

(α∗1 , . . . , α∗M ) is the solution of


M M
!
X X α> Ki αi i
min R Ki αi +λ .
α1 ,...,αM ∈R n ηi
i=1 i=1

Julien Mairal (Inria) 467/564


Proof (2/4)
This is equivalent to
M M
X α> Ki αii
X
min R (u) + λ s.t. u= Ki αi .
u,α1 ,...,αM ∈Rn ηi
i=1 i=1

This is equivalent to the saddle point problem:


M M
X α> Ki αii >
X
min max R (u) + λ + 2λγ (u − Ki αi ) .
u,α1 ,...,αM ∈Rn γ∈Rn ηi
i=1 i=1

By Slater’s condition, strong duality holds, meaning we can invert


min and max:
M M
X α> Ki αii >
X
max min R (u) + λ + 2λγ (u − Ki αi ) .
γ∈Rn u,α1 ,...,αM ∈Rn ηi
i=1 i=1

Julien Mairal (Inria) 468/564


Proof (3/4)
Minimization in u:
n o
min R(u) + 2λγ > u = − max −2λγ > u − R(u) = −R ∗ (−2λγ) ,
u u

where R ∗ is the Fenchel dual of R:

∀v ∈ Rn R ∗ (v) = sup u> v − R(u) .


u∈Rn

Minimization in αi for i = 1, . . . , M:
 > 
α Ki αi
min λ i − 2λγ > Ki αi = −ληi γ > Ki γ ,
αi ηi

where the minimum in αi is reached for α∗i = ηi γ.

Julien Mairal (Inria) 469/564


Proof (4/4)
The dual problem is therefore
M
( ! )
X
max −R ∗ (−2λγ) − λγ > ηi Ki γ .
γ∈Rn
i=1
Note that if learn from a single kernel Kη , we get the same dual
problem n o
maxn −R ∗ (−2λγ) − λγ > Kη γ .
γ∈R
If γ∗ is a solution of the dual problem, then α∗i = ηi γ ∗ leading to:
n
X n
X

∀x ∈ X , fi (x) = αij∗ Ki (xj , x) = ηi γj∗ Ki (xj , x)
j=1 j=1
PM
Therefore, f ∗ = i=1 fi
∗ satisfies
M X
X n n
X
f ∗ (x) = ηi γj∗ Ki (xj , x) = γj∗ Kη (xj , x) . 
i=1 j=1 j=1

Julien Mairal (Inria) 470/564


Learning the kernel

Motivation
If we know how to weight each kernel, then we can learn with the
weighted kernel
XM
Kη = ηi Ki
i=1

However, usually we don’t know...


Perhaps we can optimize the weights ηi during learning?

Julien Mairal (Inria) 471/564


An objective function for K
Theorem
For any p.d. kernel K on X , let

R(f n ) + λk f k2HK

J(K ) = min .
f ∈HK

The function K 7→ J(K ) is convex.


This suggests a principled way to ”learn” a kernel: define a convex set of
candidate kernels, and minimize J(K ) by convex optimization.

Julien Mairal (Inria) 472/564


Proof
We have shown by strong duality that
n o
J(K ) = maxn −R ∗ (−2λγ) − λγ > Kγ .
γ∈R

For each γ fixed, this is an affine function of K , hence convex


A supremum of convex functions is convex. 

Julien Mairal (Inria) 473/564


MKL (Lanckriet et al., 2004)
We consider the set of convex combinations
M M
( )
X X
Kη = ηi Ki with η ∈ ΣM = ηi ≥ 0 , ηi = 1
i=1 i=1

We optimize both η and f ∗ by solving:


n o
min J (Kη ) = min min R(f n ) + λk f k2HKη
η∈ΣM η∈ΣM f ∈HKη

The problem is jointly convex in (η, α) and can be solved efficiently.


The output is both a set of weights η, and a predictor
corresponding to the kernel method trained with kernel Kη .
This method is usually called Multiple Kernel Learning (MKL).

Julien Mairal (Inria) 474/564


Example: protein annotation
Vol. 20 no. 16 2004, pages 2626–2635
BIOINFORMATICS doi:10.1093/bioinformatics/bth294

A statistical framework for genomic data fusion


Gert R. G. Lanckriet1 , Tijl De Bie3 , Nello Cristianini4 ,
Michael I. Jordan2 and William Stafford Noble5, ∗
1 Department of Electrical Engineering and Computer Science, 2 Division of Computer
Science, Department of Statistics, University of California, Berkeley 94720, USA,
3 Department A statistical framework for genomic data fusion
of Electrical Engineering, ESAT-SCD, Katholieke Universiteit Leuven 3001,
Belgium, 4 Department of Statistics, University of California, Davis 95618, USA and
5 Department of Genome Sciences, University of Washington, Seattle 98195, USA

1.00 1.0
Received on January 29, 2004; revised on April 7, 2004; accepted on April 23, 2004

ROC
0.95 Advance Access publication May 6, 2004 0.9
ROC

G.R.G.Lanckriet et al.
0.90
0.8
0.85
0.80 ABSTRACT 0.7views. In yeast, for example for a given gene we typ-
these
BfunctionsSW B its SW Pfam FFT h(p LI ) ∈DR|pi | : Ea vector all
Table 1. KernelMotivation: During Pfam
the past LIdecade,D the new E focus all on depend upon
ically know hydropathy
the protein it encodes,profile
that protein’s i similarity to
100 genomics has highlighted a particular challenge: to integrate containing theitshydrophobicities
other40proteins, hydrophobicity profile, of the theamino
mRNAacids along the
expres-

TP1FP
TP1FP

the different views of the genome that are provided by various sion
proteinlevels associated with the given gene under
30 (Engleman et al., 1986; Black and Mould, 1991; Hopp hundreds of
Kernel Data Similarity measure
50 types of experimental data. experimental
and Woods, conditions,
20 1981). The theFFT
occurrences
kernel of known
uses or inferred profiles
hydropathy
Results: This paper describes a computational framework transcription
10 factor binding sites in the upstream region of
KSW proteinand
sequences Smith-Waterman generated from the Kyte–Doolittle index (Kyte and Doolittle,
0 for integrating drawing inferences from a collection of that gene
0 and the identities of many of the proteins that interact
KB B
genome-wide SW
protein Pfam Each
sequences
measurements. LI dataset DBLAST E
is represented all
via 1982).
with theThis B kernel
given gene’s compares
SWprotein Pfam the Each
FFT
product. frequency
LIof these Dcontent E of the
distinct all
1 1
hydropathy profiles ofview
the two
KPfam protein sequences
a kernel function, which defines generalized Pfam HMMrelation-
similarity data types provides one of theproteins.
molecularFirst, the hydropathy
machinery of
Weights
Weights

KFFT hydropathy profile FFT


ships between pairs of entities, such as genes or proteins. The profiles
the cell. are
In thepre-filtered
near future, with a low-pass
research filter to reduce
in bioinformatics will noise:
KLI protein interactions
0.5 kernel representation linear kernel
is both flexible and efficient, and can be 0.5more and more heavily on methods of data fusion.
focus
KD protein interactions diffusion kernel
applied to many different types of data. Furthermore, kernel Different data sources hf (p = f to
arei )likely ⊗ contain
h(pi ), different and
KE gene expression radial basis kernel
0 functions derived from different types of data can be combined thus partly
0 independent information about the task at hand.
KRND random numbers linear kernel 1 complementary pieces of information can be
in a straightforward (A) fashion.
Ribosomal Recent
proteinsadvances in the theory where
Combiningf = those
4 (1 2 1) (B) is Membrane
the impulse response of the filter
proteins
of kernel methods have provided efficient algorithms to per- expected to enhance
and ⊗ denotes convolution the total information
with thatabout filter.theAfter
problem at
pre-filtering
The table lists the seven kernels used to compare proteins, the data on which they are
form such combinations in a way that minimizes a statistical hand.
theheightOne problem
hydropathy with
profiles this
(andapproach,
iftwo however,
necessary is that gen-zeros to
appending
defined,
Fig. 1.andCombining
the method for computing
datasets similarities.
yields better The final kernel, KRND
classification , is included
performance. The
loss
as a control. All function.
kernel matrices,These methods
along with the dataexploit semidefinite
from which program-
they were generated, omic data come in a wide variety of data formats: expression to the ROC
of the bars in the upper plots are proportional
score (top)atming
and the percentage
techniques of true the
to reduce positives
problem at one percentoptimiz-
of finding
make
false positives them
(middle),
data are
equal
expressed for asinvectors
the length—a
SDP/SVM commonly
or timemethod
series; using
used
proteinthe
technique notError
given kernel.
sequence
are available noble.gs.washington.edu/proj/sdp-svm. Julien Mairal (Inria) altering the frequency content), their frequency contents are 475/564
Example: Image classification (Harchaoui and Bach, 2007)
COREL14 dataset
1400 natural images in 14 classes
Compare kernel between histograms (H), walk kernel (W), subtree
kernel (TW), weighted subtree kernel (wTW), and a combination
by MKL (M).

Performance comparison on Corel14

0.12

0.11

0.1

Test error
0.09

0.08

0.07

0.06

0.05
H W TW wTW M
Kernels

Julien Mairal (Inria) 476/564


MKL revisited (Bach et al., 2004)
M M
( )
X X
Kη = ηi Ki with η ∈ ΣM = ηi ≥ 0 , ηi = 1
i=1 i=1

Theorem
The solution f ∗ of
n o
min min R(f n ) + λk f k2HKη
η∈ΣM f ∈HKη

PM
is f ∗ = i=1 fi
∗, where (f1∗ , . . . , fM∗ ) ∈ HK1 × . . . × HKM is the solution
of:  !2 
M M
!
 X X 
min R fi n + λ k fi kHKi .
f1 ,...,fM  
i=1 i=1

Julien Mairal (Inria) 477/564


Proof (1/2)

n o
min min R(f n ) + λk f k2HKη
η∈ΣM f ∈HKη
M M k f k2
( ! )
X X i HK
n i
= min min R fi +λ
η∈ΣM f1 ,...,fM ηi
i=1 i=1
M X k fi k2HK
( ! (M ))
X
= min R fi n + λ min i
f1 ,...,fM η∈ΣM ηi
i=1 i=1
 !2 
M M
!
 X X 
= min R fi n +λ k fi kHKi ,
f1 ,...,fM  
i=1 i=1

Julien Mairal (Inria) 478/564


Proof (2/2)
where the last equality results from:

M
!2 M
X X a2
∀a ∈ RM
+ , ai = inf i
,
η∈ΣM ηi
i=1 i=1

which is a direct consequence of the Cauchy-Schwarz inequality:

M M M
! 12 M
! 12
X X ai √ X a2 i
X
ai = √ × ηi ≤ ηi .
ηi ηi
i=1 i=1 i=1 i=1

Julien Mairal (Inria) 479/564


Algorithm: simpleMKL (Rakotomamonjy et al., 2008)
We want to minimize in η ∈ ΣM :
n o
min J (Kη ) = min maxn −R ∗ (−2λγ) − λγ > Kη γ .
η∈ΣM η∈ΣM γ∈R

For a fixed η ∈ ΣM , we can compute f (η) = J (Kη ) by using a


standard solver for a single kernel to find γ ∗ :

J (Kη ) = −R ∗ (−2λγ ∗ ) − λγ ∗> Kη γ ∗ .

From γ ∗ we can also compute the gradient of J (Kη ) with respect


to η:
∂J (Kη )
= −λγ ∗> Ki γ ∗ .
∂ηi
J (Kη ) can then be minimized on ΣM by a projected gradient or
reduced gradient algorithm.

Julien Mairal (Inria) 480/564


Sum kernel vs MKL
Learning with the sum kernel (uniform combination) solves
M M
( ! )
X X
n 2
min R fi +λ k fi kHK .
f1 ,...,fM i
i=1 i=1

Learning with MKL (best convex combination) solves


 !2 
M M
!
 X X 
min R fi n + λ k fi kHKi .
f1 ,...,fM  
i=1 i=1

Although MKL can be thought of as optimizing a convex


combination of kernels, it is more correct to think of it as a
penalized risk minimization estimator with the group lasso penalty:
M
X
Ω(f ) = min k fi kHKi .
f1 +...+fM =f
i=1

Julien Mairal (Inria) 481/564


Example: ridge vs LASSO regression
Take X = Rd , and for x = (x1 , . . . , xd )> consider the rank-1
kernels:
∀i = 1, . . . , d , Ki x, x0 = xi xi0 .


A function fi ∈ HKi has the form fi (x) = βi xi , with k fi kHKi = | βi |


The sum kernel is KS (x, x0 ) = di=1 xi xi0 = x> x, a function HKS is
P
of the form f (x) = β > x, with norm k f kHKS = k β kRd .
Learning with the sum kernel solves a ridge regression problem:
d
( )
X
2
min R(Xβ) + λ βi .
β∈Rd
i=1

Learning with MKL solves a LASSO regression problem:


 !2 
 X d 
min R(Xβ) + λ | βi | .
β∈Rd  
i=1

Julien Mairal (Inria) 482/564


Extensions (Micchelli et al., 2005)

M M
( )
X X
For r > 0 , Kη = ηi Ki with η ∈ ΣrM = ηi ≥ 0 , ηir = 1
i=1 i=1

Theorem
The solution f ∗ of
n o
minr min R(f n ) + λk f k2HKη
η∈ΣM f ∈HKη

PM
is f ∗ = i=1 fi
∗, where (f1∗ , . . . , fM∗ ) ∈ HK1 × . . . × HKM is the solution
of: 
M
! M
! r +1
r 

 X X 2r
min R fi n +λ r +1
k fi kHK .
f1 ,...,fM  i 
i=1 i=1

Julien Mairal (Inria) 483/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
“Deep” learning with kernels

Julien Mairal (Inria) 484/564


Outline

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Large-scale learning with linear models
Nyström approximations
Random Fourier features
New challenges
“Deep” learning with kernels

Julien Mairal (Inria) 485/564


Motivation
Main problem
All methods we have seen require computing the n × n Gram matrix,
which is infeasible when n is significantly greater than 100 000 both in
terms of memory and computation.

Solutions
low-rank approximation of the kernel;
random Fourier features.
The goal is to find an approximate embedding ψ : X → Rd such that

K (x, x0 ) ≈ hψ(x), ψ(x0 )iRd .

Julien Mairal (Inria) 486/564


Motivation
Then, functions f in H may be approximated by linear ones in Rd , e.g.,.
n
X Xn
f (x) = αi K (xi , x) ≈ h αi ψ(xi ), ψ(x)iRd = hw, ψ(x)iRd .
i=1 i=1

Then, the ERM problem


n
1X
min L(yi , f (xi )) + λkf k2H ,
f ∈H n
i=1

becomes, approximately,
n
1X
min L(yi , w> xi ) + λkwk22 ,
w∈Rd n
i=1

which we know how to solve when n is large.

Julien Mairal (Inria) 487/564


Outline

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Large-scale learning with linear models
Nyström approximations
Random Fourier features
New challenges
“Deep” learning with kernels

Julien Mairal (Inria) 488/564


Large-scale learning with linear models
Let us study for a while optimization techniques for minimizing large
sums of functions
n
1X
min fi (w).
w∈Rd n
i=1

Good candidates are


stochastic optimization techniques;
randomized incremental optimization techniques;
We will see a couple of such algorithms with their convergence rates and
start with the (batch) gradient descent method.

Julien Mairal (Inria) 489/564


Introduction of a few optimization principles
Why do we care about convexity?

Julien Mairal (Inria) 490/564


Introduction of a few optimization principles
Why do we care about convexity?
Local observations give information about the global optimum
f (w)

w⋆
b

w
b

∇f (w) = 0 is a necessary and sufficient optimality condition for


differentiable convex functions;
it is often easy to upper-bound f (w) − f ? .
Julien Mairal (Inria) 490/564
Introduction of a few optimization principles
An important inequality for smooth convex functions

If f is convex
f (w)

w⋆ w0
b b

w
b

f (w) ≥ f (w0 ) + ∇f (w0 )> (w − w0 );


| {z }
linear approximation
this is an equivalent definition of convexity for smooth functions.
Julien Mairal (Inria) 491/564
Introduction of a few optimization principles
An important inequality for smooth functions

If ∇f is L-Lipschitz continuous (f does not need to be convex)


g(w) f (w)

w⋆ w1 w0
b b b

w
b

f (w) ≤ g (w) = f (w0 ) + ∇f (w0 )> (w − w0 ) + L2 kw − w0 k22 ;


g (w) = Cw0 + L2 kw0 − (1/L)∇f (w0 ) − wk22 .
Julien Mairal (Inria) 492/564
Introduction of a few optimization principles
An important inequality for smooth functions

If ∇f is L-Lipschitz continuous (f does not need to be convex)


g(w) f (w)

w⋆ w1 w0
b b b

w
b

f (w) ≤ g (w) = f (w0 ) + ∇f (w0 )> (w − w0 ) + L2 kw − w0 k22 ;


w1 = w0 − L1 ∇f (w0 ) (gradient descent step).
Julien Mairal (Inria) 492/564
Introduction of a few optimization principles
Gradient Descent Algorithm

Assume that f is convex and differentiable, and that ∇f is L-Lipschitz.


Theorem
Consider the algorithm

wt ← wt−1 − L1 ∇f (wt−1 ).

Then,
Lkw0 − w? k22
f (wt ) − f ? ≤ .
2t

Remarks
the convergence rate improves under additional assumptions on f
(strong convexity);
some variants have a O(1/t 2 ) convergence rate [Nesterov, 2004].

Julien Mairal (Inria) 493/564


Proof (1/2)
Proof of the main inequality for smooth functions
We want to show that for all w and z,
L
f (w) ≤ f (z) + ∇f (z)> (w − z) + kw − zk22 .
2
By using Taylor’s theorem with integral form,
Z 1
f (w) − f (z) = ∇f (tw + (1 − t)z)> (w − z)dt.
0

Then,
Z 1
>
f (w)−f (z)−∇f (z) (w−z) ≤ (∇f (tw+(1−t)z)−∇f (z))> (w−z)dt
0
Z 1
≤ |(∇f (tw+(1−t)z)−∇f (z))> (w−z)|dt
0
Z 1
≤ k∇f (tw+(1−t)z)−∇f (z)k2 kw−zk2 dt (C.-S.)
0
Z 1
L
≤ Ltkw−zk22 dt = kw−zk22 .
0 2

Julien Mairal (Inria) 494/564


Proof (2/2)
Proof of the theorem
We have shown that for all w,
L
f (w) ≤ gt (w) = f (wt−1 ) + ∇f (wt−1 )> (w − wt−1 ) + kw − wt−1 k22 .
2
gt is minimized by wt ; it can be rewritten gt (w) = gt (wt ) + L2 kw − wt k22 . Then,

L ?
f (wt ) ≤ gt (wt ) = gt (w? ) − kw − wt k22
2
L ? L
= f (wt−1 ) + ∇f (wt−1 )> (w? − wt−1 ) + kw − wt−1 k22 − kw? − wt k22
2 2
L ? L
≤ f?+ kw − wt−1 k22 − kw? − wt k22 .
2 2
By summing from t = 1 to T , we have a telescopic sum
T
X L ? L
T (f (wT ) − f ? ) ≤ f (wt ) − f ? ≤ kw − w0 k22 − kw? − wT k22 .
t=1
2 2

Julien Mairal (Inria) 495/564


Introduction of a few optimization principles
An important inequality for smooth and µ-strongly convex functions

If ∇f is L-Lipschitz continuous and f µ-strongly convex


f (w)

w⋆ w0
b b

w
b

f (w) ≤ f (w0 ) + ∇f (w0 )> (w − w0 ) + L2 kw − w0 k22 ;


f (w) ≥ f (w0 ) + ∇f (w0 )> (w − w0 ) + µ2 kw − w0 k22 ;
Julien Mairal (Inria) 496/564
Introduction of a few optimization principles
Proposition
When f is µ-strongly convex, differentiable and ∇f is L-Lipschitz, the
gradient descent algorithm with step-size 1/L produces iterates such that
 µ t Lkw0 − w? k22
f (wt ) − f ? ≤ 1 − .
L 2
We call that a linear convergence rate (even though it has an
exponential form).

Julien Mairal (Inria) 497/564


Proof
We start from an inequality from the previous proof
L ? L
f (wt ) ≤ f (wt−1 ) + ∇f (wt−1 )> (w? − wt−1 ) + kw − wt−1 k22 − kw? − wt k22
2 2
L−µ ? L
≤ f?+ kw − wt−1 k22 − kw? − wt k22 .
2 2
In addition, we have that f (wt ) ≥ f ? + µ2 kwt − w? k22 , and thus

L−µ ?
kw? − wt k22 ≤ kw − wt−1 k22
L+µ
 µ ?
≤ 1− kw − wt−1 k22 .
L
Finally,
L t
f (wt ) − f ? ≤ kw − w? k22
2
 µ t Lkw? − w0 k22
≤ 1−
L 2

Julien Mairal (Inria) 498/564


The stochastic (sub)gradient descent algorithm
Consider now the minimization of an expectation

min f (w) = Ex [`(x, w)],


w∈Rp

To simplify, we assume that for all x, w 7→ `(x, w) is differentiable, but


everything here is true for nonsmooth functions.
Algorithm
At iteration t,
Randomly draw one example xt from the training set;
Update the current iterate

wt ← wt−1 − ηt ∇w `(xt , wt−1 ).

Perform online averaging of the iterates (optional)

w̃t ← (1 − γt )w̃t−1 + γt wt .
Julien Mairal (Inria) 499/564
The stochastic (sub)gradient descent algorithm
There are various learning rates strategies (constant, varying step-sizes),
and averaging strategies. Depending on the problem assumptions and
choice of ηt , γt , classical convergence rates may be obtained (see
Nemirovsky et al., 2009)

f (w̃t ) − f ? = O(1/ t) for convex problems;
f (w̃t ) − f ? = O(1/t) for strongly-convex ones;

Remarks
The convergence rates are not that great, but the complexity
per-iteration is small (1 gradient evaluation for minimizing an
empirical risk versus n for the batch algorithm).
When the amount of data is infinite, the method minimizes the
expected risk.
Choosing a good learning rate automatically is an open problem.

Julien Mairal (Inria) 500/564


Randomized incremental algorithms (1/3)
Consider now the minimization of a large finite sum of smooth convex
functions:
n
1X
minp fi (w),
w∈R n
i=1

A class of algorithms with low per-iteration complexity have been


recently introduced that enjoy exponential (aka, linear) convergence
rates for strongly-convex problems, e.g., SAG (Schmidt et al., 2013)

SAG algorithm

n
∇fi (wt−1 ) if i = it

t t−1 γ X t
w ←w − yi with yit = .
Ln yit−1 otherwise
i=1

Julien Mairal (Inria) 501/564


Randomized incremental algorithms (2/3)
Consider now the minimization of a large finite sum of smooth convex
functions:
n
1X µ
minp fi (w) + kwk22 ,
w∈R n 2
i=1

A class of algorithms with low per-iteration complexity have been


recently developed that enjoy exponential convergence rates for
strongly-convex problems, e.g., MISO/Finito (Mairal, 2015; Defazio et
al., 2015; Lin et al., 2015)

Basic MISO/Finito algorithm (requires n ≥ 2L/µ)

∇fi (wt−1 ) if i = it

1 t
t
w ←w t−1
− (y − yit−1 ) with yit = .
µn it t yit−1 otherwise

see also SDCA (Shalev-Shwartz and Zhang, 2012).

Julien Mairal (Inria) 502/564


Randomized incremental algorithms (3/3)
Many of these techniques are in fact performing SGD-types of steps

wt ← wt−1 − ηt gt ,

where E[gt |wt−1 ] = ∇f (wt−1 ), but where the estimator of the gradient
has lower variance than in SGD (see SVRG [Johnson and Zhang, 2013]).
Typically, these methods have the convergence rate
  t 
? 1 µ
f (wt ) − f = O 1 − C max ,
n L

and their complexity per-iteration is independent of n! In addition, they


are often almost parameter-free (theoretical values for their learning
rates work in practice).

Julien Mairal (Inria) 503/564


Large-scale learning with linear models
Conclusion
we know how to deal with huge-scale problems when the models are
linear;
significant progress has been made during the last 3-4 years;
all of this is also useful to learn with kernels!

Julien Mairal (Inria) 504/564


Outline

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Large-scale learning with linear models
Nyström approximations
Random Fourier features
New challenges
“Deep” learning with kernels

Julien Mairal (Inria) 505/564


Nyström approximations [Williams and Seeger, 2002] (1/14)
Consider a dataset x1 , . . . , xn in X with a p.d. kernel K : X × X → R.
Call H its RKHS and ϕ : X → H the mapping such that
K (x, x0 ) = hϕ(x), ϕ(x0 )iH .
A natural approximation consists of representing each data point xi as a
linear combination of a few anchor points fj in H:
d
X
ϕ(x) ≈ βj (x)fj .
j=1

Then,
* d d
+
X X
0 0
hϕ(x), ϕ(x )iH ≈ βj (x)fj , βj (x )fj
j=1 j=1 H
d
X
= βj (x)βl (x0 )hfj , fl iH = β(x)> Gβ(x0 ).
j,l=1

Julien Mairal (Inria) 506/564


Nyström approximations (2/14)
Then, we have

hϕ(x), ϕ(x0 )iH ≈ β(x)> Gβ(x0 ) = hψ(x), ψ(x0 )iRd ,

with
ψ(x) = G1/2 β(x).
In practice, the anchor points fj in H and the coordinates β are learned
by minimizing the least square error in H
2
n
X d
X
min ϕ(xi ) − βij fj .
f1 ,...,fd ∈H
βij ∈R i=1 j=1
H

Julien Mairal (Inria) 507/564


Nyström approximations (3/14)
Note that the problem
2
n
X d
X
min ϕ(xi ) − βij fj ,
f1 ,...,fd ∈H
βij ∈R i=1 j=1
H

is equivalent, after developing the quadratic function, to


n
X d
X d
X
min −2 βij hfj , ϕ(xi )iH + βij βil hfj , fl iH ,
f1 ,...,fd ∈H
βij ∈R i=1 j=1 j,l=1

or also
n
X d
X d
X
min −2 βij fj (xi ) + βij βil hfj , fl iH .
f1 ,...,fd ∈H
βij ∈R i=1 j=1 j,l=1

Julien Mairal (Inria) 508/564


Nyström approximations (4/14)
Then, call [Kf ]jl = hfj , fl iH and f (xi ) = [f1 (xi ), . . . , fd (xi )] in Rd . The
problem may be rewritten as
n
X
min −2β > >
i f (xi ) + β i Kf β i ,
f1 ,...,fd ∈H
β i ∈Rd i=1

and by minimizing with respect to all β i with f fixed, we have that


β i = K−1
f f (xi ) (assuming Kf to be invertible to simplify), which leads to

n
X
max f (xi )> K−1
f f (xi ).
f1 ,...,fd ∈H
i=1

Consider an optimal solution f ? and perform the eigenvalue


decomposition of Kf ? = U∆U> . Then, define the functions
[g1? (x), . . . , gd? (x)] = ∆−1/2 U> f ? (x). The functions gj? are points in the
RKHS H (as linear combinations of entries of f ? ).
Julien Mairal (Inria) 509/564
Nyström approximations (5/14)
By construction

[Kg ? ]jl = hgj? , gl? iH


d d
* +
1 X 1 X
= p [U]kj fk? , √ [U]kl fk?
∆jj k=1 ∆ll k=1
H
d
1 1 X
=p √ [U]kj [U]k 0 l hfk? , fk?0 iH
∆jj ∆ll k,k 0 =1
d
1 1 X
=p √ [U]kj [U]k 0 l [Kf ? ]kk 0
∆jj ∆ll k,k 0 =1
1 1
=p √ u>
j Kf ? ul
∆jj ∆ll
= δj=l .

Julien Mairal (Inria) 510/564


Nyström approximations (6/14)
Then, Kg ? = I and g ? is also a solution of the problem
n
X
max f (xi )> K−1
f f (xi ),
f1 ,...,fd ∈H
i=1

since

f ? (xi )> K−1 ? ? > −1 > ?


f ? f (xi ) = f (xi ) U∆ U f (xi )
= g ? (xi )> g ? (xi ) = g ? (xi )> K−1 ?
g ? g (xi ),

and also a solution of the problem


d X
X n
max gj (xi )2 s.t. gj ⊥ gk for k 6= j.
g1 ,...,gd ∈H
j=1 i=1

Julien Mairal (Inria) 511/564


Nyström approximations (6/14)
Then, Kg ? = I and g ? is also a solution of the problem
n
X
max f (xi )> K−1
f f (xi ),
f1 ,...,fd ∈H
i=1

since

f ? (xi )> K−1 ? ? > −1 > ?


f ? f (xi ) = f (xi ) U∆ U f (xi )
= g ? (xi )> g ? (xi ) = g ? (xi )> K−1 ?
g ? g (xi ),

and also a solution of the problem


d X
X n
max gj (xi )2 s.t. gj ⊥ gk for k 6= j.
g1 ,...,gd ∈H
j=1 i=1

This is the kernel PCA formulation!

Julien Mairal (Inria) 511/564


Nyström approximations (7/14)
First recipe with kernel PCA
Given a dataset of n training points x1 , . . . , xn in X ,
randomly choose a subset Z = [xz1 , . . . , xzm ] of m ≤ n training
points;
compute the m × m kernel matrix KZ ,Z .
perform kernel PCA to find the d ≤ m largest principal directions
(parametrized by d vectors αj in Rm );
Then, every point x in X may be approximated by

ψ(x) = β(x) = [g1? (x), . . . , gd? (x)]>


" m m
#>
X X
= α1i K (xzi , x), . . . , αmi K (xzi , x) .
i=1 i=1

Julien Mairal (Inria) 512/564


Nyström approximations (8/14)
The complexity of training is O(m3 ) (eig decomposition) + O(m2 )
kernel evaluations.
The complexity of encoding a point x is O(md) (matrix vector
multiplication) + O(m) kernel evaluations.

Images courtesy of Vedaldi and Zisserman [2012]


Julien Mairal (Inria) 513/564
Nyström approximations (9/14)
The main issue with kernel PCA is the encoding time, which depends
linearly of m. A popular alternative is instead to select the anchor points
among the training data points x1 , . . . , xn . Then, choose
f1 = ϕ(xz1 ), . . . , fd = ϕ(xzd ).

Second recipe with random point sampling


Given a dataset of n training points x1 , . . . , xn in X ,
randomly choose a subset Z = [xz1 , . . . , xzd ] of d training points;
compute the d × d kernel matrix KZ ,Z .
Then, a new point x is encoded as
1/2 1/2
ψ(x) = KZ ,Z β(x) = KZ ,Z K−1
Z ,Z f (x)
−1/2
= KZ ,Z [K (xz1 , x), . . . , K (xzd , x)]>
−1/2
= KZ ,Z KZ ,x .

Julien Mairal (Inria) 514/564


Nyström approximations (10/14)
The complexity of training is O(d 3 ) (eig decomposition) + O(d 2 )
kernel evaluations.
The complexity of encoding a point x is O(d 2 ) (matrix vector
multiplication) + O(d) kernel evaluations.

Images courtesy of Vedaldi and Zisserman [2012]


Julien Mairal (Inria) 515/564
Nyström approximations (11/14)
The encoding time is now low, but the (random) choice of anchor points
is not clever. Better approximation can be obtained with a greedy
algorithm that iteratively selects one column at a time with largest
residual (Bach and Jordan, 2002; Smola and Shölkopf, 2000).

At iteration k, assume that Z = [z1 , . . . , zk ]; then, the residual for a


data point x encoded with k anchor points f1 , . . . , fk is
k
X
min kϕ(x) − βj fj k2H ,
β∈Rk
j=1

which is equal to
kϕ(x)k2H − f (x)> K−1
f f (x),
and since fj = ϕ(xzj ) for all j, the data point xi with largest residual is
the one that maximizes
K (xi , xi ) − Kxi ,Z K−1
Z ,Z KZ ,xi .

Julien Mairal (Inria) 516/564


Nyström approximations (12/14)
This brings us to the following algorithm
Third recipe with greedy anchor point selection
Initialize Z = ∅. For k = 1, . . . , d do
data point selection

zk ← argmax K (xi , xi ) − Kxi ,Z K−1


Z ,Z KZ ,xi ;
i∈{1,...,n}

update the set Z


Z ← [Z , zk ].
A naive implementation is slow (O(j 2 n + j 3 ) at every iteration). To get
a reasonable complexity, one has to use simple linear algebra tricks (see
next slide).

Julien Mairal (Inria) 517/564


Nyström approximations (13/14)
−1
K−1 1 > −1b
  
K−1 =
KZ ,Z KZ ,z
= Z ,Z + s bb s ,
[Z ,z],[Z ,z] Kz,Z Kz,z − 1s b> 1
s

s is the Schur complement s = Kz,z − Kz,Z K−1


Z ,Z KZ ,z , and
−1
b = KZ ,Z KZ ,z .
the matrix K−1 −1
[Z ,z],[Z ,z] can be obtained from KZ ,Z and KZ ,z in
O(j 2 ) float operations; for that we need to always keep into
memory the j × n matrix KZ ,X .
computing the matrix K[Z ,z],X from KZ ,X requires n kernel
evaluations;
the quantity Kxi ,[Z ,z] K−1
[Z ,z],[Z ,z] K[Z ,z],xi can be computed from
Kxi ,Z K−1
Z ,Z KZ ,xi in O(j) float operations.
The total training complexity is O(d 2 n) float operations and O(dn)
kernel evaluations
Julien Mairal (Inria) 518/564
Nyström approximations (14/14)
Concluding remarks
The last technique is equivalent to computing an incomplete
Cholesky factorization of the kernel matrix (Bach and Jordan, 2002;
Fine and Scheinberg, 2001);
The techniques we have seen produce low-rank approximations of
the kernel matrix K ≈ LL> ;
When X = Rd , it is also possible to synthesize training points
z1 , . . . , zd and use anchor points ϕ(z1 ), . . . , ϕ(zd ), e.g., with a
K-means algorithms.

Julien Mairal (Inria) 519/564


Outline

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Large-scale learning with linear models
Nyström approximations
Random Fourier features
New challenges
“Deep” learning with kernels

Julien Mairal (Inria) 520/564


Random Fourier features [Rahimi and Recht, 2007] (1/5)
A large class of approximations for shift-invariant kernels are based on
sampling techniques. Consider a real-valued positive-definite continuous
translation-invariant kernel K (x, y) = κ(x − y) with κ : Rd → R. Then,
if κ(0) = 1, Bochner theorem tells us that κ is a valid characteristic
function for some probability measure
>
κ(z) = Ew [e iw z ].

Remember indeed that, with the right assumptions on κ,


Z
1 > >
κ(x − y) = d
κ̂(w)e iw x e −iw y dw,
(2π) Rd
1
and the probability measure admits a density p(w) = (2π)d
κ̂(w)
(non-negative, real-valued, sum to 1 since κ(0) = 1).

Julien Mairal (Inria) 521/564


Random Fourier features (2/5)
Then,
Z
1 > >
κ(x − y) = d
κ̂(w)e iw x e −iw y dw
(2π) Rd
Z
= p(w) cos(w> x − w> y)dw
ZR
d
 
= p(w) cos(w> x) cos(w> y) + sin(w> x) sin(w> y) dw
Rd
Z Z 2π
p(w)
= 2 cos(w> x + b) cos(w> y + b)dwdb (exercise)
Rd b=0 2π
h√ √ i
= Ew∼p(w),b∼U [0,2π] 2 cos(w> x + b) 2 cos(w> y + b)

Julien Mairal (Inria) 522/564


Random Fourier features (3/5)
Random Fourier features recipe
Compute the Fourier transform of the kernel κ̂ and define the
probability density p(w) = κ̂(w)/(2π)d ;
Draw d i.i.d. samples w1 , . . . , wd from p and d i.i.d. samples
b1 , . . . , bd from the uniform distribution on [0, 2π];
define the mapping
r h i>
2
x 7→ ψ(x) = cos(w1> x + b1 ), . . . , cos(wd> x + bd ) .
d
Then, we have that

κ(x − y) ≈ hψ(x), ψ(y)iRd .

The two quantities are equal in expectation.

Julien Mairal (Inria) 523/564


Random Fourier features (4/5)
Theorem, [Rahimi and Recht, 2007]
On any compact subset X of Rm , for all ε > 0,
" #  2 2
σp diam(X ) dε
− 4(m+2)
P sup |κ(x − y) − hψ(x), ψ(y)iRd | ≥ ε ≤ 2 8
e ,
x,y∈X ε

where σp2 = Ew∼p(w) [w> w] is the second moment of the Fourier


transform of κ.

Remarks
The convergence is uniform, not data dependent;
q
Take the sequence εd = log(d)
d σp diam(X ); Then the term on the
right converges to zero when d grows to infinity;
Prediction functions with Random Fourier features are not in H.

Julien Mairal (Inria) 524/564


Random Fourier features (5/5)
Ingredients of the proof
For a fixed pair of points x, y, Hoeffding’s inequality says that
dε2
P |κ(x − y) − hψ(x), ψ(y)iRd | ≥ ε ≤ 2e − 4 .
 
| {z }
f (x,y)

Consider a net (set of balls of radius r ) that covers


X∆ = {x − y : (x, y) ∈ X } with at most T = (4diam(X )/r )m balls.
Apply the Hoeffding’s inequality to the centers xi − yi of the balls;
Use a basic union bound
  X h
ε εi dε2
P sup f (xi , yi ) ≥ ≤ P f (xi , yi ) ≥ ≤ 2Te − 8 .
i 2 2
i

Glue things together: control the probability for points (x, y) inside
each ball, and adjust the radius r (a bit technical).
Julien Mairal (Inria) 525/564
Outline

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
Motivation
Large-scale learning with linear models
Nyström approximations
Random Fourier features
New challenges
“Deep” learning with kernels

Julien Mairal (Inria) 526/564


New challenges
We have seen two classes of kernel approximation techniques. Several
challenges remain
make random Fourier features data dependent (e.g., Bach, 2015);
make these approximation techniques data and task dependent;
reduce the number of dimensions;
find more explicit approximate feature maps dedicated to useful
kernel [e.g., Vedaldi and Zisserman, 2012];

Julien Mairal (Inria) 527/564


Outline

1 Kernels and RKHS

2 Kernel Methods: Supervised Learning

3 Kernel Methods: Unsupervised Learning

4 The Kernel Jungle

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
“Deep” learning with kernels

Julien Mairal (Inria) 528/564


Outline

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
“Deep” learning with kernels
Motivation
“Deep” feature maps
Convolutional kernel networks

Julien Mairal (Inria) 529/564


Deep learning with kernels
Main question
in some fields producing large amounts of labeled data (notably in
computer vision), kernel methods are not performing as well as
multilayer neural networks. Why? How to improve kernel methods?

Possible angles of attack


are multilayer neural networks close to a kernel machine?
building multilayer kernels with successful principles from multilayer
neural networks (successful=“convolutional” or “recurrent”).
perform end-to-end-learning with kernels (crafting the kernel);

Perspectives
build multilayer architectures that are easy to regularize and that
may work without (or with less) supervision.
build versatile architectures to process structured data.
Julien Mairal (Inria) 530/564
Classical criticisms of kernel methods
lack of adaptivity to data?

Julien Mairal (Inria) 531/564


Classical criticisms of kernel methods
lack of adaptivity to data?
if necessary, use kernels for probabilistic models;
lack of adaptivity to the task (end-to-end learning)?

Julien Mairal (Inria) 531/564


Classical criticisms of kernel methods
lack of adaptivity to data?
if necessary, use kernels for probabilistic models;
lack of adaptivity to the task (end-to-end learning)?
most critical point, important open problem;
kernel methods are glorified template matching algorithms?

Julien Mairal (Inria) 531/564


Classical criticisms of kernel methods
lack of adaptivity to data?
if necessary, use kernels for probabilistic models;
lack of adaptivity to the task (end-to-end learning)?
most critical point, important open problem;
kernel methods are glorified template matching algorithms?
irrelevant, only true for Gaussian kernel with σ too small;
n n
X ??? X yi
f (x) = αi K (xi , x) ≈ Pn K (xi , x).
i=1 i=1 l=1 K (xl , x)

The representer theorem simply tells us that the prediction function f


lies in a subspace spanned by the data (nothing to do with the
“template-matching” Nadaraya-Watson estimator on the right).
The αi ’s do not have the same sign as the yi ’s in general.
The theorem also applies to the last layer of neural networks...

Julien Mairal (Inria) 531/564


Outline

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
“Deep” learning with kernels
Motivation
“Deep” feature maps
Convolutional kernel networks

Julien Mairal (Inria) 532/564


Links between kernels and neural networks
A large class of kernels on Rp may be defined as an expectation

K (x, y) = Ew [s(w> x)s(w> y)],

where s : R → R is a nonlinear function. Then, approximating the


expectation by a finite sum yields
d
1X
K (x, y) ≈ s(wj> x)s(wj> y) = hψ(x), ψ(y)iRd ,
d
j=1

where ψ(x) may be interpreted as a one-layer neural network.

Example
Any shift-invariant kernel with random Fourier features!
r h i>
2
ψ(x) = cos(w1> x + b1 ), . . . , cos(wd> x + bd ) .
d
Julien Mairal (Inria) 533/564
Links between kernels and neural networks
A large class of kernels on Rp may be defined as an expectation

K (x, y) = Ew [s(w> x)s(w> y)],

where s : R → R is a nonlinear function.


Example
The Gaussian kernel on the hypersphere:
 m Z
− 1
kx−yk22 2 2 1 2 1 2
e 2σ 2 = e − σ2 kx−wk2 e − σ2 ky−wk2 dw
πσ 2 w∈Rm
Z
1 2 > 1 2 >y
= p(w)e − σ2 + σ2 w x e − σ2 + σ2 w dw,
w∈Rm

where p(w) is the density of the multivariate normal distribution


N (0, σ 2 /4I).

Julien Mairal (Inria) 534/564


Links between kernels and neural networks
Example, arc-cosine kernels
Cho and Saul, 2009 have proposed a collection of kernels defined as
Z
K (x, y) = 2 p(w)s(w> x)s(w> y)dw,
w∈Rm

for x, y on the hyper-sphere Sm−1 and p(w) is the density of the


multivariate normal distribution N (0, I). Interestingly, the non-linearity s
are typical ones from the neural network literature.
s(u) = max(0, u) (rectified linear units) leads to
K1 (x, y) = sin(θ) + (π − θ) cos(θ) with θ = cos−1 (x> y);
s(u) = max(0, u)2 (squared rectified linear units) leads to
K2 (x, y) = 3 sin(θ) cos(θ) + (π − θ)(1 + 2 cos2 (θ));
and also a general formula for s(u) = max(0, u)p , with d ≥ 0.

Julien Mairal (Inria) 535/564


Links between kernels and neural networks

1
arc-cosine1
arc-cosine2
0.8 RBF sigma=0.5
RBF sigma=1

0.6
s(u)

0.4

0.2

0
-1 -0.5 0 0.5 1
u
Julien Mairal (Inria) 536/564
Links between kernels and neural networks
We have seen that some kernels admit an interpretation as one-layer
neural networks with random weights and infinite number of neurons.

Another common features between neural networks and kernel method is


the composition of feature maps [Cho and Saul, 2009].

Consider kernels with the form

K1 (x, y) = κ (kϕ0 (x)kH0 , kϕ0 (y)kH0 , hϕ0 (x), ϕ0 (y)iH0 ) = hϕ1 (x), ϕ1 (y)iH1 ,

e.g., linear, polynomial, Gaussian, arc-cosine with ϕ0 (x) = x.


Then, it is easy to obtain a new kernel K2 by composition:

K2 (x, y) = κ (kϕ1 (x)kH1 , kϕ1 (y)kH1 , hϕ1 (x), ϕ1 (y)iH1 ) = hϕ2 (x), ϕ2 (y)iH2 ,

and recursively build multilayer kernels.

Julien Mairal (Inria) 537/564


Outline

5 Open Problems and Research Topics


Multiple Kernel Learning (MKL)
Large-scale learning with kernels
“Deep” learning with kernels
Motivation
“Deep” feature maps
Convolutional kernel networks

Julien Mairal (Inria) 538/564


Motivation
We have made explicit some links between neural networks
(approximation by linear operations followed by pointwise non-linearities,
and composition of feature maps leading to multilayer kernels).

However, one important ingredient in the kernel world is still missing:


The main deep learning success, convolutional neural networks, is able to

learn local structures in images (local stationarity);


learn how to combine these local structures into mid and high-level
ones (spatial composition).

Julien Mairal (Inria) 539/564


Motivation
We have made explicit some links between neural networks
(approximation by linear operations followed by pointwise non-linearities,
and composition of feature maps leading to multilayer kernels).

However, one important ingredient in the kernel world is still missing:


The main deep learning success, convolutional neural networks, is able to

learn local structures in images (local stationarity);


learn how to combine these local structures into mid and high-level
ones (spatial composition).

From a tutorial of Y. LeCun, quoting Stuart Geman “the world is


compositional or there is a God”.

Julien Mairal (Inria) 539/564


Motivation

Figure : Picture from Yann Lecun’s tutorial, based on [Zeiler and Fergus, 2013].

Julien Mairal (Inria) 540/564


Convolutional kernel networks
A few words about convolutional kernel networks [Mairal et al., 2014]
Unsupervised representation of images based on a multilayer kernel,
along with a finite-dimensional embedding ψ, which is a new type
of convolutional neural network;
State-of-the-art results for image retrieval [Paulin et al., 2016];
New principles to perform end-to-end supervised learning with
multilayer kernels (unpublished yet).

Julien Mairal (Inria) 541/564


Convolutional kernel networks

ϕ2 (z2 ) ∈ H2
Ω2
{z2 } + P2

Ω1 ϕ1 (z1 ) ∈ H1

{z1 } + P1
ϕ0 (z0 ) ∈ H0 Ω0

Julien Mairal (Inria) 542/564


Main properties of CKNs
CKNs are organized in a multi-layer fashion.
Each layer produces an image feature map.
An image feature map ϕ is a function ϕ : Ω → H, where Ω ⊆ [0, 1]2
is a set of “coordinates” and H is a Hilbert space.
Concretely, these are similar to feature maps of CNNs.
Each layer defines a kernel between patches of the previous layer.
The approximation scheme requires learning each layer sequentially,
and can be interpreted as a CNN layer with a different objective.

Julien Mairal (Inria) 543/564


Image feature maps and convolutional kernels
An image feature map ϕ is a function ϕ : Ω → H, where Ω ⊆ [0, 1]2 is a
set of “coordinates” in the image and H is a Hilbert space.
It is possible to define a convolutional kernel between ϕ and ϕ0
− 1
kz−z0 k22 1 0 0 2
e − 2σ2 kϕ̃(z)−ϕ̃ (z )kH ,
XX
K (ϕ, ϕ0 ) := kϕ(z)kH ϕ0 (z0 ) H
e 2β 2

z∈Ω z0 ∈Ω

when β is large, K is invariant to the positions z and z0 .


when β is small, only features placed at the same location z = z0
are compared to each other.

The kernel is inspired from the kernel descriptors of Bo et al., 2011.

Julien Mairal (Inria) 544/564


Image feature maps and convolutional kernels
An image feature map ϕ is a function ϕ : Ω → H, where Ω ⊆ [0, 1]2 is a
set of “coordinates” in the image and H is a Hilbert space.
It is possible to define a convolutional kernel between ϕ and ϕ0
− 1
kz−z0 k22 1 0 0 2
e − 2σ2 kϕ̃(z)−ϕ̃ (z )kH ,
XX
K (ϕ, ϕ0 ) := kϕ(z)kH ϕ0 (z0 ) H
e 2β 2

z∈Ω z0 ∈Ω

The kernel can be defined on patches


− 1
kz−z0 k22 1 0 0 0 2
e − 2σ2 kϕ̃(u+z)−ϕ̃ (u +z )kH ,
XX
kϕ(u + z)kH ϕ0 (u0 + z0 ) H
e 2β 2

z∈P z0 ∈P

where P is a patch shape and u, u0 are locations in Ω.

Julien Mairal (Inria) 544/564


Zoom on the zero-th layer
Before we build a hierarchy, we can specify two simple zero-th layer
feature maps ϕ0 .
Gradient map
H0 = R2 and ϕ0 (z) is the two-dimensional gradient of the image at
pixel z. Then, the quantity kϕ0 (z)kH0 is the gradient intensity,
and ϕ̃0 (z) is its orientation [cos(θ), sin(θ)].

Patch map
ϕ0 associates to a location z an image patch of size m × m centered
2
at z. Then, H0 = Rm , and ϕ̃0 (z) is a contrast-normalized version of
the patch.

Julien Mairal (Inria) 545/564


Multilayer kernels
Let us consider a set of coordinates Ωk–1 and a Hilbert space Hk–1 . We
build a new set Ωk and a new Hilbert space Hk as follows:
choose a patch shape Pk and a set of coordinates Ωk such that for
each zk in Ωk corresponds to a patch in Ωk–1 centered at zk .
call Kk the kernel of the previous slide on the “patch” feature
maps Pk → Hk–1 (with parameters βk , σk ). We denote by Hk the
Hilbert space for which the p.d. kernel Kk is reproducing.

An image represented by a feature map ϕk–1 : Ωk–1 → Hk–1 at layer k–1


is now encoded in the k-th layer as ϕk : Ωk → Hk , where ϕk (zk ) is the
representation in Hk of the patch of ϕk–1 centered at zk .

Julien Mairal (Inria) 546/564


Convolutional kernel networks

ϕ2 (z2 ) ∈ H2
Ω2
{z2 } + P2

Ω1 ϕ1 (z1 ) ∈ H1

{z1 } + P1
ϕ0 (z0 ) ∈ H0 Ω0

Julien Mairal (Inria) 547/564


Optimization
Key approximation
When x and y are on the sphere,
1 2
e − 2α2 kx−yk2 = Ez∼p(z) [s(z> x)s(z> y)],
1 2u
where s(u) ∝ e − α2 + α2 and p(z) is the density of the multivariate
normal distribution N (0, (α2 /4)I). Then,
p
− 1
kx−yk22 1X
e 2α2 ≈ ηj s(z> >
j x)s(zj y).
p
j=1

Instead of random sampling, zj and ηj are learned on training data:


 2
n p
X − 1 2 kxi −yi k22 1 X
min e 2α − ηj s(z> >
j xi )s(zj yi )
 .
Z,η p
i=1 j=1

Julien Mairal (Inria) 548/564


Approximation principles
We proceed by recursion, with the approximation holding for k = 0.

Main ingredients for approximating K (ϕk–1 , ϕ0k–1 ).


replace ϕk–1 by its finite-dimensional approximation ψk–1 ;
2
X − 1
kz−z0 k22 − 1
kψ̃k–1 (z)−ψ̃k–1
0 (z0 )
k2
0 2β 2 2σ 2
≈ kψk–1 (z)k2 ψk–1 (z0 ) 2 e k e k ;
z,z0 ∈Ωk–1

use the finite-dimensional approximation of the Gaussian kernel


X − 1
kz−z0 k22
2β 2
≈ ζk (z)> ζk0 (z0 )e k ;
z,z0 ∈Ωk–1
approximate the remaining Gaussian kernel
>  X
− 12 kz−uk22 − 12 kz0 −uk22
 X 
2 X β β 0 0
≈ e k ζk (z) e k ζk (z ) ;
π 0 0
u∈Ωk z∈Ωk–1 z ∈Ωk–1

Julien Mairal (Inria) 549/564


Zoom between layers k–1 and k

ξk (z)

Gaussian filtering
Ω0k + downsampling
= pooling
ζk (zk–1 )
pk
Ωk–1 convolution
+ non-linearity
0
{zk–1 }+Pk–1

Ω0k–1 ψk–1 (zk–1 )


ξk–1 (z) (patch extraction)
Julien Mairal (Inria) 550/564
Application to image retrieval
Encoding of interest points with CKN + VLAD.
Possible inputs:

Results (mAP or true positives in top-4 for UKB)


Method \ Dataset Holidays UKB Oxford
VLAD+SIFT [Jegou et al., 2012] 63.4 3.47 -
VLAD++ [Arandjelovic and Zissermann, 2013] 64.6 - 55.5
CNN [Babenko et al., 2014] 79.3 3.56 54.5
CNN2 [Gong et al., 2014] 80.2 - -
Sum-pooling VGG [Babenko et al., 2015] 80.2 3.65 53.1
Ours (vanilla, high-dimensional) 79.3 3.76 49.8
Ours + PCA 4096 + whitening 82.9 3.77 47.2

Julien Mairal (Inria) 551/564


What about image classification?
First proof of concept was evaluated on classical “deep learning”
datasets. without data augmentation or data pre-processing;

Tr. CNN Scat-1 Scat-2 CKN-GM1 CKN-GM2 CKN-PM1 CKN-PM2


[32] [18] [19]
size [25] [8] [8] (12/50) (12/400) (200) (50/200)
300 7.18 4.7 5.6 4.39 4.24 5.98 4.15 NA
1K 3.21 2.3 2.6 2.60 2.05 3.23 2.76 NA
2K 2.53 1.3 1.8 1.85 1.51 1.97 2.28 NA
5K 1.52 1.03 1.4 1.41 1.21 1.41 1.56 NA
10K 0.85 0.88 1 1.17 0.88 1.18 1.10 NA
20K 0.76 0.79 0.58 0.89 0.60 0.83 0.77 NA
40K 0.65 0.74 0.53 0.68 0.51 0.64 0.58 NA
60K 0.53 0.70 0.4 0.58 0.39 0.63 0.53 0.47 0.45 0.53

Table : Test error in % for various approaches on the MNIST dataset.


Method [12] [27] [18] [13] [4] [17] [32] CKN-GM CKN-PM CKN-CO
CIFAR-10 82.0 82.2 88.32 79.6 NA 83.96 84.87 74.84 78.30 82.18
STL-10 60.1 58.7 NA 51.5 64.5 62.3 NA 60.04 60.25 62.32

Table : Classification accuracy in % on CIFAR-10 and STL-10.

Julien Mairal (Inria) 552/564


Current Perspectives
Engineering effort helps
higher (huge)-dimensional models may be learned; they give about
86% on CIFAR-10 (≈ 88% with data augmentation);

Supervision helps
preliminary supervised models are already close to 90% (single
model, no data augmentation);

Future challenges
video data;
structured data, sequences, graphs;
theory and faster algorithms;
finish supervision.

Julien Mairal (Inria) 553/564


Conclusion of the course

Julien Mairal (Inria) 554/564


What we saw
Basic definitions of p.d. kernels and RKHS
How to use RKHS in machine learning
The importance of the choice of kernels, and how to include “prior
knowledge” there.
Several approaches for kernel design (there are many!)
Review of kernels for strings and on graphs
Recent research topics about kernel methods

Julien Mairal (Inria) 555/564


What we did not see

How to automatize the process of kernel design (kernel selection?


kernel optimization?)
How to deal with non p.d. kernels
Bayesian view of kernel methods, called Gaussian processes.

Julien Mairal (Inria) 556/564


References I
N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 – 404, 1950.
URL http://www.jstor.org/stable/1990404.
F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and
the SMO algorithm. In Proceedings of the Twenty-First International Conference on
Machine Learning, page 6, New York, NY, USA, 2004. ACM. doi:
http://doi.acm.org/10.1145/1015330.1015424.
C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups.
Springer-Verlag, New-York, 1984.
K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In ICDM ’05:
Proceedings of the Fifth IEEE International Conference on Data Mining, pages 74–81,
Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2278-5. doi:
http://dx.doi.org/10.1109/ICDM.2005.132.
M. Cuturi and J.-P. Vert. The context-tree kernel for strings. Neural Network., 18(4):
1111–1123, 2005. doi: 10.1016/j.neunet.2005.07.010. URL
http://dx.doi.org/10.1016/j.neunet.2005.07.010.
M. Cuturi, K. Fukumizu, and J.-P. Vert. Semigroup kernels on measures. J. Mach. Learn.
Res., 6:1169–1198, 2005. URL
http://jmlr.csail.mit.edu/papers/v6/cuturi05a.html.

Julien Mairal (Inria) 557/564


References II
T. Gärtner, P. Flach, and S. Wrobel. On graph kernels: hardness results and efficient
alternatives. In B. Schölkopf and M. Warmuth, editors, Proceedings of the Sixteenth
Annual Conference on Computational Learning Theory and the Seventh Annual Workshop
on Kernel Machines, volume 2777 of Lecture Notes in Computer Science, pages 129–143,
Heidelberg, 2003. Springer. doi: 10.1007/b12006. URL
http://dx.doi.org/10.1007/b12006.
Z. Harchaoui and F. Bach. Image classification with segmentation graph kernels. In 2007
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2007), pages 1–8. IEEE Computer Society, 2007. doi: 10.1109/CVPR.2007.383049. URL
http://dx.doi.org/10.1109/CVPR.2007.383049.
D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10,
UC Santa Cruz, 1999.
C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning
techniques for the identification of mutagenicity inducing substructures and structure
activity relationships of noncongeneric compounds. J. Chem. Inf. Comput. Sci., 44(4):
1402–11, 2004. doi: 10.1021/ci034254q. URL http://dx.doi.org/10.1021/ci034254q.
T. Jaakkola, M. Diekhans, and D. Haussler. A Discriminative Framework for Detecting
Remote Protein Homologies. J. Comput. Biol., 7(1,2):95–114, 2000. URL
http://www.cse.ucsc.edu/research/compbio/discriminative/Jaakola2-1998.ps.

Julien Mairal (Inria) 558/564


References III
T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In
Proc. of Tenth Conference on Advances in Neural Information Processing Systems, 1999.
URL http://www.cse.ucsc.edu/research/ml/papers/Jaakola.ps.
H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In
T. Faucett and N. Mishra, editors, Proceedings of the Twentieth International Conference
on Machine Learning, pages 321–328, New York, NY, USA, 2003. AAAI Press.
T. Kin, K. Tsuda, and K. Asai. Marginalized kernels for RNA sequence data analysis. In
R. Lathtop, K. Nakai, S. Miyano, T. Takagi, and M. Kanehisa, editors, Genome
Informatics 2002, pages 112–122. Universal Academic Press, 2002. URL
http://www.jsbi.org/journal/GIW02/GIW02F012.html.
R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input. In
Proceedings of the Nineteenth International Conference on Machine Learning, pages
315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel
matrix with semidefinite programming. J. Mach. Learn. Res., 5:27–72, 2004a. URL
http://www.jmlr.org/papers/v5/lanckriet04a.html.
G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical
framework for genomic data fusion. Bioinformatics, 20(16):2626–2635, 2004b. doi:
10.1093/bioinformatics/bth294. URL
http://bioinformatics.oupjournals.org/cgi/content/abstract/20/16/2626.

Julien Mairal (Inria) 559/564


References IV
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. J.
Mach. Learn. Res., 5:1435–1455, 2004.
C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: a string kernel for SVM protein
classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein,
editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564–575,
Singapore, 2002. World Scientific.
L. Liao and W. Noble. Combining Pairwise Sequence Similarity and Support Vector Machines
for Detecting Remote Protein Evolutionary and Structural Relationships. J. Comput. Biol.,
10(6):857–868, 2003. URL
http://www.liebertonline.com/doi/abs/10.1089/106652703322756113.
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification
using string kernels. J. Mach. Learn. Res., 2:419–444, 2002. URL
http://www.ai.mit.edu/projects/jmlr/papers/volume2/lodhi02a/abstract.html.
B. Logan, P. Moreno, B. Suzek, Z. Weng, and S. Kasif. A Study of Remote Homology
Detection. Technical Report CRL 2001/05, Compaq Cambridge Research laboratory, June
2001.
P. Mahé and J. P. Vert. Graph kernels based on tree patterns for molecules. Mach. Learn., 75
(1):3–35, 2009. doi: 10.1007/s10994-008-5086-2. URL
http://dx.doi.org/10.1007/s10994-008-5086-2.

Julien Mairal (Inria) 560/564


References V
P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph
kernels. In R. Greiner and D. Schuurmans, editors, Proceedings of the Twenty-First
International Conference on Machine Learning (ICML 2004), pages 552–559. ACM Press,
2004.
P. Mahé, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Graph kernels for molecular
structure-activity relationship analysis with support vector machines. J. Chem. Inf. Model.,
45(4):939–51, 2005. doi: 10.1021/ci050039t. URL
http://dx.doi.org/10.1021/ci050039t.
C. Micchelli and M. Pontil. Learning the kernel function via regularization. J. Mach. Learn.
Res., 6:1099–1125, 2005. URL http://jmlr.org/papers/v6/micchelli05a.html.
Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic
Publishers, 2004.
A. Nicholls. Oechem, version 1.3.4, openeye scientific software. website, 2005.
F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res.,
9:2491–2521, 2008. URL http://jmlr.org/papers/v9/rakotomamonjy08a.html.
J. Ramon and T. Gärtner. Expressivity versus efficiency of graph kernels. In T. Washio and
L. De Raedt, editors, Proceedings of the First International Workshop on Mining Graphs,
Trees and Sequences, pages 65–74, 2003.

Julien Mairal (Inria) 561/564


References VI
F. Rapaport, A. Zynoviev, M. Dutreix, E. Barillot, and J.-P. Vert. Classification of microarray
data using gene networks. BMC Bioinformatics, 8:35, 2007. doi:
10.1186/1471-2105-8-35. URL http://dx.doi.org/10.1186/1471-2105-8-35.
H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string
alignment kernels. Bioinformatics, 20(11):1682–1689, 2004. URL
http://bioinformatics.oupjournals.org/cgi/content/abstract/20/11/1682.
B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002. URL
http://www.learning-with-kernels.org.
B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT
Press, The MIT Press, Cambridge, Massachussetts, 2004.
M. Seeger. Covariance Kernels from Bayesian Generative Models. In Adv. Neural Inform.
Process. Syst., volume 14, pages 905–912, 2002.
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, New York, NY, USA, 2004a.
J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge
University Press, 2004b.
N. Shervashidze and K. M. Borgwardt. Fast subtree kernels on graphs. In Advances in Neural
Information Processing Systems, pages 1660–1668, 2009.

Julien Mairal (Inria) 562/564


References VII
N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient
graphlet kernels for large graph comparison. In 12th International Conference on Artificial
Intelligence and Statistics (AISTATS), pages 488–495, Clearwater Beach, Florida USA,
2009. Society for Artificial Intelligence and Statistics.
N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt.
Weisfeiler-lehman graph kernels. The Journal of Machine Learning Research, 12:
2539–2561, 2011.
T. Smith and M. Waterman. Identification of common molecular subsequences. J. Mol. Biol.,
147:195–197, 1981.
K. Tsuda, M. Kawanabe, G. Rätsch, S. Sonnenburg, and K.-R. Müller. A new discriminative
kernel from probabilistic models. Neural Computation, 14(10):2397–2414, 2002a. doi:
10.1162/08997660260293274. URL http://dx.doi.org/10.1162/08997660260293274.
K. Tsuda, T. Kin, and K. Asai. Marginalized Kernels for Biological Sequences.
Bioinformatics, 18:S268–S275, 2002b.
V. N. Vapnik. Statistical Learning Theory. Wiley, New-York, 1998.
A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 34(3):480–492, 2012.
J.-P. Vert, H. Saigo, and T. Akutsu. Local alignment kernels for biological sequences. In
B. Schölkopf, K. Tsuda, and J. Vert, editors, Kernel Methods in Computational Biology,
pages 131–154. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004.

Julien Mairal (Inria) 563/564


References VIII
J.-P. Vert, R. Thurman, and W. S. Noble. Kernels for gene regulatory regions. In Y. Weiss,
B. Schölkopf, and J. Platt, editors, Adv. Neural. Inform. Process Syst., volume 18, pages
1401–1408, Cambridge, MA, 2006. MIT Press.
G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional
Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
B. Weisfeiler and A. A. Lehman. A reduction of a graph to a canonical form and an algebra
arising during this reduction. Nauchno-Technicheskaya Informatsia, Ser. 2, 9, 1968.
Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic
data: a supervised approach. Bioinformatics, 20:i363–i370, 2004. URL
http://bioinformatics.oupjournals.org/cgi/reprint/19/suppl_1/i323.

Julien Mairal (Inria) 564/564

You might also like