Machine Learning With Kernel Methods
Machine Learning With Kernel Methods
Machine Learning With Kernel Methods
2 / 635
Or
3 / 635
But real data are often more complicated...
4 / 635
Main goal of this course
Extend
well-understood, linear statistical learning techniques
to
real-world, complicated, structured, high-dimensional data
based on
a rigorous mathematical framework
leading to
practical modelling tools and algorithms
5 / 635
Organization of the course
Contents
1 Present the basic mathematical theory of kernel methods.
2 Introduce algorithms for supervised and unsupervised machine
learning with kernels.
3 Develop a working knowledge of kernel engineering for specific data
and applications (graphs, biological sequences, images).
4 Discuss open research topics related to kernels such as large-scale
learning with kernels and deep kernel learning.
Practical
Course homepage with slides, schedules, homework etc...:
http://cbio.mines-paristech.fr/~jvert/svn/kernelcourse/course/2017mva
Evaluation: 60% homework (3)+ 40% data challenge.
6 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional
7 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional
2 Kernel tricks
The kernel trick
The representer theorem
7 / 635
Outline
1 Kernels and RKHS
Positive Definite Kernels
Reproducing Kernel Hilbert Spaces (RKHS)
Examples
Smoothness functional
2 Kernel tricks
The kernel trick
The representer theorem
7 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA
8 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA
8 / 635
Outline
4 Kernel Methods: Unsupervised Learning
Kernel K-means and spectral clustering
Kernel PCA
A quick note on kernel CCA
8 / 635
Part 1
9 / 635
Overview
Motivations
Develop versatile algorithms to process and analyze data...
...without making any assumptions regarding the type of data
(vectors, strings, graphs, images, ...)
The approach
Develop methods based on pairwise comparisons.
By imposing constraints on the pairwise comparison function
(positive definite kernels), we obtain a general framework for
learning from data (optimization in RKHS).
10 / 635
Outline
2 Kernel tricks
X
(S)=(aatcgagtcac,atggacgtct,tgcactact)
S
1 0.5 0.3
K= 0.5 1 0.6
0.3 0.6 1
Idea
Define a comparison function: K : X X 7 R.
Represent a set of n data points S = {x1 , x2 , . . . , xn } by the n n
matrix:
[K]ij := K (xi , xj ) .
12 / 635
Representation by pairwise comparisons
Remarks
K is always an n n matrix, whatever the nature of data: the same
algorithm will work for any type of data (vectors, strings, ...).
Total modularity between the choice of function K and the choice of
the algorithm.
Poor scalability with respect to the dataset size (n2 to compute and
store K)... but wait until the end of the course to see how to deal
with large-scale problems
We will restrict ourselves to a particular class of pairwise comparison
functions.
13 / 635
Positive Definite (p.d.) Kernels
Definition
A positive definite (p.d.) kernel on a set X is a function K : X X R
that is symmetric:
x, x0 X 2 , K x, x0 = K x0 , x ,
14 / 635
Similarity matrices of p.d. kernels
Remarks
Equivalently, a kernel K is p.d. if and only if, for any N N and any
set of points (x1 , x2 , . . . , xN ) X N , the similarity matrix
[K]ij := K (xi , xj ) is positive semidefinite.
Kernel methods are algorithms that take such matrices as input.
15 / 635
The simplest p.d. kernel, for real numbers
Lemma
Let X = R. The function K : R2 7 R defined by:
x, x 0 R2 , K x, x 0 = xx 0
is p.d.
16 / 635
The simplest p.d. kernel, for real numbers
Lemma
Let X = R. The function K : R2 7 R defined by:
x, x 0 R2 , K x, x 0 = xx 0
is p.d.
Proof:
xx 0 = x 0 x
PN PN P 2
N
i=1 j=1 ai aj xi xj = i=1 ai xi 0
16 / 635
The simplest p.d. kernel, for vectors
Lemma
Let X = Rd . The function K : X 2 7 R defined by:
x, x0 X 2 , K x, x0 = x, x0 Rd
17 / 635
The simplest p.d. kernel, for vectors
Lemma
Let X = Rd . The function K : X 2 7 R defined by:
x, x0 X 2 , K x, x0 = x, x0 Rd
17 / 635
A more ambitious p.d. kernel
X F
Lemma
Let X be any set, and : X 7 Rd . Then, the function K : X 2 7 R
defined as follows is p.d.:
x, x0 X 2 , K x, x0 = (x) , x0 Rd .
18 / 635
A more ambitious p.d. kernel
X F
Lemma
Let X be any set, and : X 7 Rd . Then, the function K : X 2 7 R
defined as follows is p.d.:
x, x0 X 2 , K x, x0 = (x) , x0 Rd .
Proof:
h (x) , (x0 )iRd = h (x0 ) , (x)iRd
PN PN PN 2
i=1 j=1 ai aj h (xi ) , (xj )iRd = k i=1 ai (xi ) kRd 0
18 / 635
Example: polynomial kernel
x1 x1 2
x2
R
x2 2
For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :
19 / 635
Example: polynomial kernel
x1 x1 2
x2
R
x2 2
For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :
19 / 635
Example: polynomial kernel
x1 x1 2
x2
R
x2 2
For x = (x1 , x2 )> R2 , let (x) = (x12 , 2x1 x2 , x22 ) R3 :
K x, x0 = (x) , x0 H .
X F
20 / 635
In case of ...
Definitions
An inner product on an R-vector space H is a mapping
(f , g ) 7 hf , g iH from H2 to R that is bilinear, symmetric and such
that hf , f iH > 0 for all f H\{0}.
A vector space endowed with an inner product is called pre-Hilbert.
1
It is endowed with a norm defined as k f kH = hf , f iH
2
.
A Cauchy sequence (fn )n0 is a sequence whose elements become
progressively arbitrarily close to each other:
with
1 [u1 ]i
..
(xi ) = .
.
N [uN ]i
22 / 635
Proof: general case
23 / 635
Outline
2 Kernel tricks
x X , Kx : t 7 K (x, t) .
f (x) = hf , Kx iH .
F : H R
f 7 f (x)
is continuous.
26 / 635
An equivalent definition of RKHS
Theorem
The Hilbert space H RX is a RKHS if and only if for any x X , the
mapping:
F : H R
f 7 f (x)
is continuous.
Corollary
Convergence in a RKHS implies pointwise convergence, i.e., if (fn )nN
converges to f in H, then (fn (x))nN converges to f (x) for any x X .
26 / 635
Proof
If H is a RKHS then f 7 f (x) is continuous
If a r.k. K exists, then for any (x, f ) X H:
| f (x) | = | hf , Kx iH |
k f kH .k Kx kH (Cauchy-Schwarz)
1
k f kH .K (x, x) ,2
27 / 635
Proof (Converse)
If f 7 f (x) is continuous then H is a RKHS
Conversely, let us assume that for any x X the linear form
f H 7 f (x) is continuous.
Then by Riesz representation theorem (general property of Hilbert
spaces) there exists a unique gx H such that:
f (x) = hf , gx iH .
28 / 635
Unicity of r.k. and RKHS
Theorem
If H is a RKHS, then it has a unique r.k.
Conversely, a function K can be the r.k. of at most one RKHS.
29 / 635
Unicity of r.k. and RKHS
Theorem
If H is a RKHS, then it has a unique r.k.
Conversely, a function K can be the r.k. of at most one RKHS.
Consequence
This shows that we can talk of the kernel of a RKHS, or the RKHS
of a kernel.
29 / 635
Proof
If a r.k. exists then it is unique
Let K and K 0 be two r.k. of a RKHS H. Then for any x X :
This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y X . In other words, K=K.
30 / 635
Proof
If a r.k. exists then it is unique
Let K and K 0 be two r.k. of a RKHS H. Then for any x X :
This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y X . In other words, K=K.
30 / 635
An important result
Theorem
A function K : X X R is p.d. if and only if it is a r.k.
31 / 635
Proof
A r.k. is p.d.
1 A r.k. is symmetric because, for any (x, y) X 2 :
32 / 635
Proof
A p.d. kernel is a r.k. (1/4)
Let H0 be the vector subspace of RX spanned by the functions
{Kx }xX .
For any f , g H0 , given by:
m
X n
X
f = ai Kxi , g= bj Kyj ,
i=1 j=1
let: X
hf , g iH0 := ai bj K (xi , yj ) .
i,j
33 / 635
Proof
A p.d. kernel is a r.k. (2/4)
hf , g iH0 does not depend on the expansion of f and g because:
m
X n
X
hf , g iH0 = ai g (xi ) = bj f (yj ) .
i=1 j=1
hf , Kx iH0 = f (x) .
34 / 635
Proof
A p.d. kernel is a r.k. (3/4)
K is assumed to be p.d., therefore:
m
X
k f k2H0 = ai aj K (xi , xj ) 0 .
i,j=1
therefore k f kH0 = 0 = f = 0.
H0 is therefore a pre-Hilbert space endowed with the inner product
h., .iH0 .
35 / 635
Proof
A p.d. kernel is a r.k. (4/4)
For any Cauchy sequence (fn )n0 in H0 , h., .iH0 , we note that:
1
(x, m, n) X N2 , | fm (x) fn (x) | k fm fn kH0 .K (x, x) 2 .
Therefore for any x the sequence (fn (x))n0 is Cauchy in R and has
therefore a limit.
If we add to H0 the functions defined as the pointwise limits of
Cauchy sequences, then the space becomes complete and is
therefore a Hilbert space, with K as r.k. (up to a few technicalities,
left as exercise).
36 / 635
Application: back to Aronzsajns theorem
Theorem (Aronszajn, 1950)
K is a p.d. kernel on the set X if and only if there exists a Hilbert space
H and a mapping
: X 7 H ,
such that, for any x, x0 in X :
K x, x0 = (x) , x0 H .
X F
37 / 635
Proof of Aronzsajns theorem
If K is p.d. over a set X then it is the r.k. of a Hilbert space
H RX .
Let the mapping : X H defined by:
x X , (x) = Kx .
X F
38 / 635
Outline
2 Kernel tricks
Theorem
The RKHS of the linear kernel is the set of linear functions of the form
w Rd , k fw k H = k w k 2 .
40 / 635
Proof
The RKHS of the linear kernel consists of functions:
X
x Rd 7 f (x) = ai hxi , xiRd = hw, xiRd ,
i
P
with w = i ai xi .
The RKHS is therefore the set of linear forms endowed with the
following inner product:
hf , g iH = hw, viRd ,
41 / 635
RKHS of the linear kernel (cont.)
0 = x> x0 .
Klin (x, x )
f (x) = w> x ,
k f kH = k w k2 .
42 / 635
The polynomial kernel
Lets find the RKHS of the polynomial kernel:
2
x, y Rd , K (x, y) = hx, yi2Rd = x> y
43 / 635
The polynomial kernel
Lets find the RKHS of the polynomial kernel:
2
x, y Rd , K (x, y) = hx, yi2Rd = x> y
43 / 635
The polynomial kernel
Second step: propose a candidate RKHS.
We know that H contains all the functions
* +
X X D E X
f (x) = ai K (xi , x) = ai xi x>
i , xx
>
= ai x i x >
i , xx
>
.
F
i i i
1
Why is it important?
44 / 635
The polynomial kernel
Third step: check that the candidate is a Hilbert space.
This step is trivial in the present case since it is easy to see that H a
Euclidean space, isomorphic to S dd by : S 7 fS . Sometimes, things
are not so simple and we need to prove the completeness explicitly.
K1 + K2 ,
K1 K2 , and
cK1 , for c 0,
x, x0 X 2 , K x, x0 = lim Ki x, x0 ,
n
46 / 635
Examples
Theorem
If K is a kernel, then e K is a kernel too.
47 / 635
Examples
Theorem
If K is a kernel, then e K is a kernel too.
Proof:
n
K (x,x0 )
X K (x, x0 )i
e = lim
n+ i!
i=0
47 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
X = R, K (x, x 0 ) = cos (x + x 0 )
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 )
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 )
X = N, K (x, x 0 ) = LCM (x, x 0 )
48 / 635
Quizz : which of the following are p.d. kernels?
X = (1, 1), K (x, x 0 ) = 1
1xx 0
0
X = N, K (x, x 0 ) = 2x+x
0
X = N, K (x, x 0 ) = 2xx
X = R+ , K (x, x 0 ) = log (1 + xx 0 )
X = R, K (x, x 0 ) = exp |x x 0 |2
X = R, K (x, x 0 ) = cos (x + x 0 )
X = R, K (x, x 0 ) = cos (x x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )
X = R+ , K (x, x 0 ) = max(x, x 0 )
X = R+ , K (x, x 0 ) = min(x, x 0 )/ max(x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 )
X = N, K (x, x 0 ) = LCM (x, x 0 )
X = N, K (x, x 0 ) = GCD (x, x 0 ) /LCM (x, x 0 )
48 / 635
Outline
2 Kernel tricks
50 / 635
Smoothness functional
A simple inequality
By Cauchy-Schwarz we have, for any function f H and any two
points x, x0 X :
f (x) f x0 = | hf , Kx Kx0 i |
H
k f kH k Kx Kx0 kH
= k f kH dK x, x0 .
The norm of a function in the RKHS controls how fast the function
varies over X with respect to the geometry defined by the kernel
(Lipschitz with constant k f kH ).
Important message
51 / 635
Kernels and RKHS : Summary
P.d. kernels can be thought of as inner product after embedding the
data space X in some Hilbert space. As such a p.d. kernel defines a
metric on X .
A realization of this embedding is the RKHS, valid without
restriction on the space X nor on the kernel.
The RKHS is a space of functions over X . The norm of a function
in the RKHS is related to its degree of smoothness w.r.t. the metric
defined by the kernel on X .
We will now see some applications of kernels and RKHS in statistics,
before coming back to the problem of choosing (and eventually
designing) the kernel.
52 / 635
Part 2
Kernel tricks
53 / 635
Motivations
Two theoretical results underpin a family of powerful algorithms for data
analysis using p.d. kernels, collectively known as kernel methods:
The kernel trick, based on the representation of p.d. kernels as inner
products;
The representer theorem, based on some properties of the
regularization functional defined by the RKHS norm.
54 / 635
Outline
2 Kernel tricks
The kernel trick
The representer theorem
55 / 635
Motivations
Choosing a p.d. kernel K on a set X amounts to embedding the
data in a Hilbert space: there exists a Hilbert space H and a
mapping : X 7 H such that, for all x, x0 X ,
x, x0 X 2 , K x, x0 = (x) , x0 H .
56 / 635
The kernel trick
Proposition
Any algorithm to process finite-dimensional vectors that can be expressed
only in terms of pairwise inner products can be applied to potentially
infinite-dimensional vectors in the feature space of a p.d. kernel by
replacing each inner product evaluation by a kernel evaluation.
Remarks:
The proof of this proposition is trivial, because the kernel is exactly
the inner product in the feature space.
This trick has huge practical applications.
Vectors in the feature space are only manipulated implicitly, through
pairwise inner products.
57 / 635
Example 1: computing distances in the feature space
X F
x1 d(x1,x2) ( x1)
x2 ( x2 )
58 / 635
Distance for the Gaussian kernel
1.2
K (x, x) = 1 = k (x) k2H , so all
0.8
d(x,y)
points are on the unit sphere in the
feature space.
0.4
The distance between the images
0.0
of two points x and y in the feature 4 2 0 2 4
space is given by: ||xy||
s
k xy k2
dK (x, y) = 2 1 e 2 2
59 / 635
Example 2: distance between a point and a set
Problem
Let S = (x1 , , xn ) be a finite set of points in X .
How to define and compute the similarity between any point x in X
and the set S?
60 / 635
Example 2: distance between a point and a set
Problem
Let S = (x1 , , xn ) be a finite set of points in X .
How to define and compute the similarity between any point x in X
and the set S?
A solution:
Map all points to the feature space.
Summarize S by the barycenter of the points:
n
1X
:= (xi ) .
n
i=1
dK (x, S) := k (x) kH .
60 / 635
Computation
X F
m
n
1 X
dK (x, S) =
(x) (xi )
n
i=1 H
v
u n n n
u 2X 1 XX
= tK (x, x) K (x, xi ) + 2 K (xi , xj ).
n n
i=1 i=1 j=1
Remark
The barycentre only exists in the feature spacein general: it does not
necessarily have a pre-image x such that x = .
61 / 635
1D illustration
S = {2, 3}
Plot f (x) = d(x, S)
2.5 2.5
2 2
1.5 1.5
d(x,S)
d(x,S)
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x x
(xy )2 (xy )2
K (x, y ) = xy . K (x, y ) = e 2 2 . K (x, y ) = e 2 2 .
(linear) with = 1. with = 0.2.
62 / 635
1D illustration
S = {2, 3}
Plot f (x) = d(x, S)
2.5 2.5
2 2
1.5 1.5
d(x,S)
d(x,S)
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x x
(xy )2 (xy )2
K (x, y ) = xy . K (x, y ) = e 2 2 . K (x, y ) = e 2 2 .
(linear) with = 1. with = 0.2.
Remarks
for the linear kernel, H = R, = 2.5 and d(x, S) = |x |.
q
for the Gaussian kernel d(x, S) = C n2 ni=1 K (xi , x).
P
62 / 635
2D illustration
2 2 2
1 1 1
(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.
63 / 635
2D illustration
2 2 2
1 1 1
(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.
Remark
as before, the barycenter in H (which is a single point in H) may
carry a lot of information about the training data.
63 / 635
Application in discrimination
3 3 3
2 2 2
1 1 1
(xy)2 (xy)2
K (x, y) = xy. K (x, y) = e 2 2 . K (x, y) = e 2 2 .
(linear) with = 1. with = 0.2.
64 / 635
Example 3: Centering data in the feature space
Problem
Let S = (x1 , , xn ) be a finite set of points in X endowed with a
p.d. kernel K . Let K be their n n Gram matrix: [K]ij = K (xi , xj ) .
Let = 1/n ni=1 (xi ) their barycenter, and ui = (xi ) for
P
i = 1, . . . , n be centered data in H.
How to compute the centered Gram matrix [Kc ]i,j = hui , uj iH ?
X F
m
65 / 635
Computation
A direct computation gives, for 0 i, j n:
Kc = K UK KU + UKU = (I U) K (I U) ,
66 / 635
Kernel trick Summary
The kernel trick is a trivial statement with important applications.
It can be used to obtain nonlinear versions of well-known linear
algorithms, e.g., by replacing the classical inner product by a
Gaussian kernel.
It can be used to apply classical algorithms to non vectorial data
(e.g., strings, graphs) by again replacing the classical inner product
by a valid kernel for the data.
It allows in some cases to embed the initial space to a larger feature
space and involve points in the feature space with no pre-image
(e.g., barycenter).
67 / 635
Outline
2 Kernel tricks
The kernel trick
The representer theorem
68 / 635
Motivation
An RKHS is a space of (potentially nonlinear) functions, and k f kH
measures the smoothness of f
Given a set of data (xi X , yi R)i=1,...,n , a natural way to
estimate a regression function f : X R is to solve something like:
n
1X
min `(yi , f (xi )) + kf k2H . (1)
f H n | {z }
| i=1 {z } regularization
empirical risk, data fit
69 / 635
The Theorem
Representer Theorem
Let X be a set endowed with a p.d. kernel K , H the corresponding
RKHS, and S = {x1 , , xn } X a finite set of points in X .
Let : Rn+1 R be a function of n + 1 variables, strictly
increasing with respect to the last variable.
Then, any solution to the optimization problem:
f Span(Kx1 , . . . , Kxn ).
70 / 635
Proof (1/2)
Let (f ) be the functional that is minimized in the statement of the
representer theorem, and HS the linear span in H of the vectors Kxi :
n
( )
X
HS = f H : f (x) = i K (xi , x) , (1 , , n ) R n
.
i=1
f = fS + f ,
71 / 635
Proof (2/2)
H being a RKHS it holds that:
i = 1, , n, f (xi ) = fS (xi ) .
72 / 635
Remarks
Often the function has the form:
73 / 635
Practical use of the representer theorem (1/2)
When the representer theorem holds, we know that we can look for
a solution of the form
n
X
f (x) = i K (xi , x) , for some Rn .
i=1
Furthermore,
n X
X n
k f k2H = i j K (xi , xj ) = > K.
i=1 j=1
74 / 635
Practical use of the representer theorem (2/2)
Therefore, a problem of the form
75 / 635
Remarks
Dual interpretations of kernel methods
Most kernel methods have two complementary interpretations:
A geometric interpretation in the feature space, thanks to the kernel
trick. Even when the feature space is large, most kernel methods
work in the linear span of the embeddings of the points available.
A functional interpretation, often as an optimization problem over
(subsets of) the RKHS associated to the kernel.
The representer theorem has important consequences, but it is in fact
rather trivial. We are looking for a function f in H such that for all x
in X , f (x) = hKx , f iH . The part f that is orthogonal to the Kxi s is
thus useless to explain the training data.
76 / 635
Part 3
Kernel Methods
Supervised Learning
77 / 635
Supervised learning
Definition
Given:
X , a space of inputs,
Y, a space of outputs,
Sn = (xi , yi )i=1,...,n , a training set of (input,output) pairs,
the supervised learning problem is to estimate a function h : X Y to
predict the output for any future input.
78 / 635
Supervised learning
Definition
Given:
X , a space of inputs,
Y, a space of outputs,
Sn = (xi , yi )i=1,...,n , a training set of (input,output) pairs,
the supervised learning problem is to estimate a function h : X Y to
predict the output for any future input.
Depending on the nature of the output, this covers:
Regression when Y = R;
Classification when Y = {1, 1} or any set of two labels;
Structured output regression or classification when Y is more
general.
78 / 635
Example: regression
Task: predict the capacity of a small molecule to inhibit a drug target
X = set of molecular structures (graphs?)
Y=R
79 / 635
Example: classification
Task: recognize if an image is a dog or a cat
X = set of images (Rd )
Y = {cat,dog}
80 / 635
Example: classification
Task: recognize if an image is a dog or a cat
X = set of images (Rd )
Y = {cat,dog}
80 / 635
Example: structured output
Task: translate from Japanese to French
X = finite-length strings of japanese characters
Y = finite-length strings of french characters
81 / 635
Supervised learning with kernels: general principles
1 Express h : X Y using a real-valued function f : Z R:
regression Y = R:
h(x) = f (x) with f : X R (Z = X )
classification Y = {1, 1}:
h(x) = sign(f (x)) with f :X R (Z = X )
structured output:
h(x) = arg max f (x, y) with f : X Y R (Z = X Y)
yY
83 / 635
Outline
2 Kernel tricks
0 2 4 6 8 10
85 / 635
Regression
Setup
X set of inputs
Y = R real-valued outputs
Sn = (xi , yi )i=1,...,n (X R)n a training set of n pairs
Goal = find a function f : X R to predict y by f (x)
2
0 2 4 6 8 10
85 / 635
Least-square regression over a general functional space
Let us quantify the error if f predicts f (x) instead of y by the
squared error:
` (f (x) , y ) = (y f (x))2
Fix a set of functions H.
Least-square regression amounts to finding the function in H with
the smallest empirical risk, called in this case the mean squared error
(MSE):
n
1X
f arg min (yi f (xi ))2
f H n i=1
86 / 635
Kernel ridge regression (KRR)
Let us now consider a RKHS H, associated to a p.d. kernel K on X .
KRR is obtained by regularizing the MSE criterion by the RKHS
norm:
n
1X
f = arg min (yi f (xi ))2 + k f k2H (2)
f H n i=1
87 / 635
Kernel ridge regression (KRR)
Let us now consider a RKHS H, associated to a p.d. kernel K on X .
KRR is obtained by regularizing the MSE criterion by the RKHS
norm:
n
1X
f = arg min (yi f (xi ))2 + k f k2H (2)
f H n i=1
87 / 635
Solving KRR
Let y = (y1 , . . . , yn )> Rn
Let = (1 , . . . , n )> Rn
Let K be the n n Gram matrix: Kij = K (xi , xj )
We can then write:
>
f (x1 ) , . . . , f (xn ) = K
k f k2H = > K
1
arg min (K y)> (K y) + > K
Rn n
88 / 635
Solving KRR
1
arg min (K y)> (K y) + > K
Rn n
= (K + nI)1 y.
89 / 635
Example (KRR with Gaussian RBF kernel)
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 1000
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 100
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 10
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 1
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 0.1
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 0.01
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 0.001
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 0.0001
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 0.00001
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 0.000001
0 2 4 6 8 10
90 / 635
Example (KRR with Gaussian RBF kernel)
lambda = 0.0000001
0 2 4 6 8 10
90 / 635
Remark: uniqueness of the solution
Let us find all s that solve
K [(K + nI) y] = 0
K being a symmetric matrix, it can be diagonalized in an
orthonormal basis and Ker (K) Im(K).
In this basis we see that (K + nI)1 leaves Im(K) and Ker (K)
invariant.
The problem is therefore equivalent to:
(K + nI) y Ker (K)
(K + nI)1 y Ker (K)
= (K + nI)1 y + , with K = 0.
However, if 0 = + with K = 0, then:
>
k f f 0 k2H = 0 K 0 = 0,
with
n
X 1
wKRR = i xi = X> = X> XX> + nI y
i=1
92 / 635
Remark: link with standard ridge regression
On the other hand, the RKHS is the set of linear functions
f (x) = w> x and the RKHS norm is k f kH = k w k
We can therefore directly rewrite the original KRR problem (2) as
n
1 X 2
arg min yi w> xi + k w k2
wRd n
i=1
1
= arg min (y Xw)> (y Xw) + w> w
wRd n
93 / 635
Remark: link with standard ridge regression
Matrix inversion lemma
For any matrices B and C , and > 0 the following holds (when it makes
sense):
B (CB + I)1 = (BC + I)1 B
We deduce that (of course...):
1 1
wRR = X> X + nI X> y = X> XX> + nI y = wKRR
| {z } | {z }
dd nn
94 / 635
Robust regression
The squared error `(t, y ) = (t y 2 ) is arbitrary and sensitive to
outliers
Many other loss functions exist for regression, e.g.:
1
arg min (K y)> W (K y) + > K
R n n
96 / 635
Weighted regression
Setting the gradient to zero gives
2
0= (KWK KWy) + 2K
n
2 1
h 1 1
1 1
i
= KW 2 W 2 KW 2 + nI W 2 W 2 y
n
A solution is therefore given by
1 1
1 1
W 2 KW 2 + nI W 2 W 2 y = 0
therefore 1 1 1
1 1
= W 2 W 2 KW 2 + nI W2Y
97 / 635
Outline
2 Kernel tricks
99 / 635
Binary classification
Setup
X set of inputs
Y = {1, 1} binary outputs
Sn = (xi , yi )i=1,...,n (X Y)n a training set of n pairs
Goal = find a function f : X R to predict y by sign(f (x))
99 / 635
The 0/1 loss
The 0/1 loss measures if a prediction is correct or not:
(
0 if y = sign (f (x))
`0/1 (f (x), y )) = 1 (yf (x) < 0) =
1 otherwise.
However:
The problem is non-smooth, and typically NP-hard to solve
The regularization has no effect since the 0/1 loss is invariant by
scaling of f
In fact, no function achieves the minimum when > 0 (why?)
100 / 635
The logistic loss
An alternative is to define a probabilistic model of y parametrized by
f (x), e.g.:
1
y {1, 1} , p (y | f (x)) = = (yf (x))
1 + e yf (x)
1.0
0.8
0.6
sigma(u)
sigma(u)
0.4
0.2
0.0
5 0 5
101 / 635
Kernel logistic regression (KLR)
n
1 X
f = arg min `logistic (f (xi ), yi ) + k f k2H
f H n 2
i=1
n
1X
= arg min ln 1 + e yi f (xi ) + k f k2H
f H n 2
i=1
102 / 635
Solving KLR
By the representer theorem, any solution of KLR can be expanded as
n
X
f(x) = i K (xi , x)
i=1
103 / 635
Technical facts
1.0
8
logistic loss(u)
0.8
6
0.6
sigma(u)
4
sigma(u)
0.4
2
0.2
0.0
0
5 0 5 5 0 5
104 / 635
Back to KLR
n
1X
minn J () = `logistic (yi [K]i ) + > K
R n 2
i=1
105 / 635
Computing the quadratic approximation
n
J 1X 0
= `logistic (yi [K]i ) yi Kij + [K]j
j n
i=1
| {z }
Pi ()
therefore
1
J () = KP () y + K
n
where P () = diag (P1 (), . . . , Pn ()).
n
2J 1 X 00
= `logistic (yi [K]i ) yi Kij yi Kil + Kjl
j l n
i=1
| {z }
Wi ()
therefore
1
2 J () = KW () K + K
n
where W () = diag (W1 (), . . . , Wn ()).
106 / 635
Computing the quadratic approximation
1
Jq () = J(0 ) + ( 0 )> J (0 ) + ( 0 )> 2 J (0 ) ( 0 )
2
Terms that depend on , with P = P (0 ) and W = W (0 ):
> J (0 ) = n1 > KPy + > K0
1 > 2 1 > >
2 J (0 ) = 2n KWK + 2 K
> 2 J (0 ) 0 = n1 > KWK0 > K0
Putting it all together:
2 1
2Jq () = > KW K0 W1 Py + > KWK + > K + C
n | {z } n
:=z
1
= (K z)> W (K z) + > K + C
n
This is a standard weighted kernel ridge regression (WKRR) problem!
107 / 635
Solving KLR by IRLS
In summary, one way to solve KLR is to iteratively solve a WKRR
problem until convergence:
t+1 solveWKRR(K, Wt , zt )
108 / 635
Outline
2 Kernel tricks
110 / 635
Loss function examples
3.0
2.5
1SVM
2SVM
2.0
Logistic
Boosting
phi(u)
1.5
1.0
0.5
0.0
2 1 0 1 2
Method (u)
Kernel logistic regression log (1 + e u )
Support vector machine (1-SVM) max (1 u, 0)
Support vector machine (2-SVM) max (1 u, 0)2
Boosting e u
111 / 635
Large-margin classifiers
Definition
Given a non-increasing function : R R+ , a (kernel) large-margin
classifier is an algorithm that estimates a function f : X R by solving
n
1X
min (yi f (xi )) + k f k2H
f H n
i=1
Questions:
1 Can we solve the optimization problem for other s?
2 Is it a good idea to optimize this objective function, if at the end of
the day we are interested in the `0/1 loss, i.e., learning models that
make few errors?
112 / 635
Solving large-margin classifiers
n
1X
min (yi f (xi )) + k f k2H
f H n
i=1
When is convex, this can be solved using general tools for convex
optimization, or specific algorithms (e.g., for SVM, see later).
113 / 635
A tiny bit of learning theory
Assumptions and notations
Let P be an (unknown) distribution on X Y, and
(x) = P(Y = 1 | X = x) a measurable version of the conditional
distribution of Y given X
Assume the training set Sn = (Xi , Yi )i=1,...,n are i.i.d. random
variables according to P.
The risk of a classifier f : X R is R(f ) = P (sign(f (X )) 6= Y )
The Bayes risk is
R = inf R(f )
f measurable
R (f ) = E[ (Yf (X ))]
115 / 635
A small -risk ensures a small 0/1 risk
Theorem [Bartlett et al., 2003]
Let : R R+ be convex, non-increasing, differentiable at 0 with
0 (0) < 0. Let f : X R measurable such that
R (f ) = min R (g ) = R .
g measurable
Then
R(f ) = min R(g ) = R .
g measurable
Remarks:
This tells us that, if we know P, then minimizing the -risk is a
good idea even if our focus is on the classification error.
The assumptions on can be relaxed; it works for the broader class
of classification-calibrated loss functions [Bartlett et al., 2003].
More generally, we can show that if R (f ) R is small, then
R(f ) R is small too [Bartlett et al., 2003].
116 / 635
A small -risk ensures a small 0/1 risk
Proof sketch:
Condition on X = x:
R (f | X = x) = E [ (Yf (X )) | X = x] = (x) (f (x)) + (1 (x)) (f (x))
R (f | X = x) = E [ (Yf (X )) | X = x] = (x) (f (x)) + (1 (x)) (f (x))
Therefore:
117 / 635
Empirical risk minimization (ERM)
To find a function with a small -risk, the following is a good candidate:
Definition
The ERM estimator on a functional class F is the solution (when it
exists) of:
fn = argmin Rn (f ) .
f F
118 / 635
Empirical risk minimization (ERM)
To find a function with a small -risk, the following is a good candidate:
Definition
The ERM estimator on a functional class F is the solution (when it
exists) of:
fn = argmin Rn (f ) .
f F
Questions
Is Rn (f ) a good estimate of the true risk R (f )?
Is R (fn ) small?
118 / 635
Class capacity
Motivations
The ERM principle gives a good solution if R fn is similar to the
minimum achievable risk inf f F R (f ).
This can be ensured if F is not too large.
We need a measure of the capacity of F.
where the expectation is over (Xi )i=1,...,n and the independent uniform
{1}-valued (Rademacher) random variables (i )i=1,...,n .
119 / 635
Basic learning bounds
Theorem
Suppose is Lipschitz with constant L :
u, u 0 R, (u) (u 0 ) L u u 0 .
Then the -risk of the ERM estimator satisfies (on average over the
sampling of training set)
ESn R fn R 4L Radn (F) + inf R (f ) R
| {z } | {z } f F
Estimation error
| {z }
Excess -risk Approximation error
121 / 635
Proof (1/2)
" #
2X n
Radn (FB ) = EX , sup i f (Xi )
f FB n
" * i=1 + #
n
2X
= EX , sup f , i KXi (RKHS)
f FB n
i=1
" n
#
2X
= EX , Bk i KXi kH (Cauchy-Schwarz)
n
i=1
v
u n
2B X
EX , tk
u
= i KXi k2H
n
i=1
v
u
n
2B u
u X
tEX , i j K (Xi , Xj ) (Jensen)
n
i,j=1
122 / 635
Proof (2/2)
But E [i j ] is 1 if i = j, 0 otherwise. Therefore:
v
u
n
2B u
u X
Radn (FB ) tEX E [i j ] K (Xi , Xj )
n
i,j=1
v
u n
2B u
tEX
X
K (Xi , Xi )
n
i=1
2B EX K (X , X )
p
= .
n
123 / 635
Basic learning bounds in RKHS balls
Corollary
Suppose K (X , X ) 2 a.s. (e.g., Gaussian kernel and = 1). Then the
ERM estimator in FB satisfies
8L B
ER fn R + inf R (f ) R .
n f FB
Remarks
B controls the trade-off between approximation and estimation error
The bound on expression error is independent of P and decreases
with n
The approximation error is harder to analyze in general
In practice, B (or , next slide) is tuned by cross-validation
124 / 635
ERM as penalized risk minimization
ERM over FB solves the constrained minimization problem:
(
minf H n1 ni=1 (yi f (xi ))
P
subject to k f kH B .
125 / 635
Summary: large margin classifiers
3.0
2.5
1SVM
2SVM
2.0
Logistic
Boosting
phi(u)
1.5
1.0
0.5
0.0
2 1 0 1 2
u
n
( )
1X
min (yi f (xi )) + k f k2H
f H n
i=1
2 Kernel tricks
f (x), primal
x
b b
b b
q(), dual
128 / 635
A few slides on convex duality
Strong Duality
f (x), primal
x
b b
b b
q(), dual
128 / 635
A few slides on convex duality
Parenthesis on duality gaps
f (x), primal
x
b b
(x, ) b
q(), dual
The duality gap guarantees us that 0 f (x) f (x? ) (x, ).
Dual problems are often obtained by Lagrangian or Fenchel duality.
129 / 635
A few slides on Lagrangian duality
Setting
We consider an equality and inequality constrained optimization
problem over a variable x X :
minimize f (x)
subject to hi (x) = 0 , i = 1, . . . , m ,
gj (x) 0 , j = 1, . . . , r ,
130 / 635
A few slides on Lagrangian duality
Lagrangian
The Lagrangian of this problem is the function L : X Rm Rr R
defined by:
m
X r
X
L (x, , ) = f (x) + i hi (x) + j gj (x) .
i=1 j=1
131 / 635
A few slides on convex Lagrangian duality
For the (primal) problem:
minimize f (x)
subject to h(x) = 0 , g (x) 0 ,
the Lagrange dual problem is:
maximize q(, )
subject to 0,
Proposition
q is concave in (, ), even if the original problem is not convex.
The dual function yields lower bounds on the optimal value f ? of
the original problem when is nonnegative:
q(, ) f ? , Rm , Rr , 0 .
132 / 635
Proofs
For each x, the function (, ) 7 L(x, , ) is linear, and therefore
both convex and concave in (, ). The pointwise minimum of
concave functions is concave, therefore q is concave.
Let x be any feasible point, i.e., h(x) = 0 and g (x) 0. Then we
have, for any and 0:
m
X r
X
i hi (x) + i gi (x) 0 ,
i=1 i=1
m
X r
X
= L(x, , ) = f (x) + i hi (x) + i gi (x) f (x) ,
i=1 i=1
133 / 635
Weak duality
Let q the optimal value of the Lagrange dual problem. Each
q(, ) is a lower bound for f ? and by definition q ? is the best lower
bound that is obtained. The following weak duality inequality
therefore always hold:
q? f ? .
134 / 635
Strong duality
We say that strong duality holds if the optimal duality gap is zero,
i.e.:
q? = f ? .
If strong duality holds, then the best lower bound that can be
obtained from the Lagrange dual function is tight
Strong duality does not hold for general nonlinear problems.
It usually holds for convex problems.
Conditions that ensure strong duality for convex problems are called
constraint qualification.
in that case, we have for all feasible primal and dual points x, , ,
135 / 635
Slaters constraint qualification
Strong duality holds for a convex problem:
minimize f (x)
subject to gj (x) 0 , j = 1, . . . , r ,
Ax = b ,
if it is strictly feasible, i.e., there exists at least one feasible point that
satisfies:
gj (x) < 0 , j = 1, . . . , r , Ax = b .
136 / 635
Remarks
Slaters conditions also ensure that the maximum q ? (if > ) is
attained, i.e., there exists a point (? , ? ) with
q (? , ? ) = q ? = f ?
137 / 635
Dual optimal pairs
Suppose that strong duality holds, x? is primal optimal, (? , ? ) is dual
optimal. Then we have:
f (x? ) = q (? , ? )
m
X r
X
= inf n f (x) + ?i hi (x) + ?j gj (x)
xR
i=1 j=1
m
X r
X
f (x? ) + ?i hi (x? ) + ?j gj (x? )
i=1 j=1
?
f (x )
138 / 635
Complimentary slackness
The first equality shows that:
j gj (x? ) = 0 , j = 1, . . . , r .
139 / 635
Outline
2 Kernel tricks
141 / 635
Support vector machines (SVM)
Definition
The hinge loss is the function R R+ :
(
0 if u 1,
hinge (u) = max (1 u, 0) =
1u otherwise.
l(f(x),y)
yf(x)
1 142 / 635
Problem reformulation (1/3)
By the representer theorem, the solution satisfies
n
X
f (x) = i K (xi , x) ,
i=1
where solves
n
( )
1X
min hinge (yi [K]i ) + > K
Rn n
i=1
143 / 635
Problem reformulation (2/3)
Let us introduce additional slack variables 1 , . . . , n R. The
problem is equivalent to:
( n )
1X >
min i + K ,
Rn , Rn n i=1
subject to:
i hinge (yi [K]i ) .
The objective function is now smooth, but not the constraints
However it is easy to replace the non-smooth constraint by a
cunjunction of two smooth constraints, because:
(
u 1v
u hinge (v )
u 0
144 / 635
Problem reformulation (3/3)
In summary, the SVM solution is
n
X
f (x) = i K (xi , x) ,
i=1
where solves:
SVM (primal formulation)
n
1X
min i + > K ,
Rn , Rn n i=1
subject to:
(
yi [K]i + i 1 0 , for i = 1, . . . , n ,
i 0 , for i = 1, . . . , n .
145 / 635
Solving the SVM problem
This is a classical quadratic program (minimization of a convex
quadratic function with linear constraints) for which any
out-of-the-box optimization package can be used.
The dimension of the problem and the number of constraints,
however, are 2n where n is the number of points. General-purpose
QP solvers will have difficulties when n exceeds a few thousands.
Solving the dual of this problem (also a QP) will be more convenient
and lead to faster algorithms (due to the sparsity of the final
solution).
146 / 635
Lagrangian
Let us introduce the Lagrange multipliers Rn and Rn .
The Lagrangian of the problem is:
n
1X
L (, , , ) = i + > K
n
i=1
n
X n
X
i [yi [K]i + i 1] i i
i=1 i=1
147 / 635
Minimizing L (, , , ) w.r.t.
L (, , , ) is a convex quadratic function in . It is minimized
whenever its gradient is null:
diag (y)
=
2
148 / 635
Minimizing L (, , , ) w.r.t.
L (, , , ) is a linear function in .
Its minimum is except when it is constant, i.e., when:
1
L = =0
n
or equivalently
1
+ =
n
149 / 635
Dual function
We therefore obtain the Lagrange dual function:
q (, ) = inf L (, , , )
Rn , Rn
(
> 1 4 1 >
diag (y)K diag (y) if + = 1
n ,
=
otherwise.
maximize q (, )
subject to 0, 0 .
150 / 635
Dual problem
If i > 1/n for some i, then there is no i 0 such that
i + i = 1/n, hence q (, ) = .
If 0 i 1/n for all i, then the dual function takes finite values
that depend only on by taking i = 1/n i .
The dual problem is therefore equivalent to:
1 >
max > 1 diag (y)K diag (y)
01/n 4
or with indices:
n n
X 1 X
max i yi yj i j K (xi , xj ) .
01/n 4
i=1 i,j=1
151 / 635
Back to the primal
Once the dual problem is solved in we get a solution of the primal
problem by = diag (y)/2.
Because the link is so simple, we can therefore directly plug this into
the dual problem to obtain the QP that must solve:
subject to:
1
0 yi i , for i = 1, . . . , n .
2n
152 / 635
Complimentary slackness conditions
The complimentary slackness conditions are, for i = 1, . . . , n:
(
i [yi f (xi ) + i 1] = 0,
i i = 0,
153 / 635
Analysis of KKT conditions
(
i [yi f (xi ) + i 1] = 0 ,
yi
i 2n i = 0 .
154 / 635
Geometric interpretation
155 / 635
Geometric interpretation
) = +1
f(x
)=0 )=1
f(x f(x
155 / 635
Geometric interpretation
y=1/2n
0< y<1/2n
=0
155 / 635
Support vectors
Consequence of KKT conditions
The training points with i 6= 0 are called support vectors.
Only support vectors are important for the classification of new
points:
n
X X
x X , f (x) = i K (xi , x) = i K (xi , x) ,
i=1 iSV
Consequences
The solution is sparse in , leading to fast algorithms for training
(use of decomposition methods).
The classification of a new point only involves kernel evaluations
with support vectors (fast).
156 / 635
Remark: C-SVM
Often the SVM optimization problem is written in terms of a
regularization parameter C instead of as follows:
n
1 X
arg min k f k2H + C Lhinge (f (xi ) , yi ) .
f H 2 i=1
1
This is equivalent to our formulation with C = 2n .
The SVM optimization problem is then:
n
X X n
max 2 i y i i j K (xi , xj ) ,
Rd i=1 i,j=1
subject to:
0 y i i C , for i = 1, . . . , n .
This formulation is often called C-SVM.
157 / 635
Remark: 2-SVM
A variant of the SVM, sometimes called 2-SVM, is obtained by
replacing the hinge loss by the square hinge loss:
( n )
1X 2 2
min hinge (yi f (xi )) + k f kH .
f H n
i=1
158 / 635
Part 4
Kernel Methods
Unsupervised Learning
159 / 635
Outline
2 Kernel tricks
160 / 635
The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi si k22 .
j Rp for j=1,...,k
i=1
si {1,...,k}, for i=1,...,n
161 / 635
The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi si k22 .
j Rp for j=1,...,k
i=1
si {1,...,k}, for i=1,...,n
161 / 635
The K-means algorithm
K-means is probably the most popular algorithm for clustering.
Optimization point of view
Given data points x1 , . . . , xn in Rp , it consists of performing alternate
minimization steps for optimizing the following cost function
n
X
min kxi si k22 .
j Rp for j=1,...,k
i=1
si {1,...,k}, for i=1,...,n
161 / 635
The kernel K-means algorithm
We may now modify the objective to operate in a RKHS. Given data
points x1 , . . . , xn in X and a p.d. kernel K : X X R with H its
RKHS, the new objective becomes
162 / 635
The kernel K-means algorithm
We may now modify the objective to operate in a RKHS. Given data
points x1 , . . . , xn in X and a p.d. kernel K : X X R with H its
RKHS, the new objective becomes
n
X
min k(xi ) si k2H .
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n
To optimize the cost function, we will first use the following Proposition
Proposition
1 Pn
The center of mass n = n i=1 (xi ) solves the following optimization
problem
n
X
min k(xi ) k2H .
H
i=1
162 / 635
The kernel K-means algorithm
Proof
n n n
* +
1X 1X 2X
k(xi ) k2H = k(xi )k2H (xi ), + kk2H
n n n
i=1 i=1 i=1 H
n
1X
= k(xi )k2H 2 hn , iH + kk2H
n
i=1
n
1X
= k(xi )k2H kn k2H + kn k2H ,
n
i=1
163 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n
164 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n
164 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n
164 / 635
The kernel K-means algorithm
Given now the objective,
n
X
min k(xi ) si k2H ,
j H for j=1,...,k
i=1
si {1,...,k} for i=1,...,n
164 / 635
The kernel K-means algorithm, equivalent objective
Note that all operations are performed by manipulating kernel values
K (xi , xj ) only. Implicitly, we are optimizing in fact
2
n
X
1 X
min
(xi ) (x j )
,
si {1,...,k}
|Csi |
for i=1,...,n i=1 jCsi
H
or, equivalently,
n
K (xi , xi ) 2 1
X X X
min K (xi , xj ) + 2
K (xj , xl ) .
si {1,...,k} |C si | |Csi |
for i=1,...,n i=1 jCsi j,lCsi
165 / 635
The kernel K-means algorithm, equivalent objective
Note that all operations are performed by manipulating kernel values
K (xi , xj ) only. Implicitly, we are optimizing in fact
2
n
X
(xi ) 1 X
min
,
(xj )
si {1,...,k}
|Csi |
for i=1,...,n i=1
jCs
i H
or, equivalently,
n
X
K (xi , xi ) 2 X 1 X
min K (xi , xj ) + K (xj , xl ) .
si {1,...,k} |Csi | |Csi |2
for i=1,...,n i=1 jCsi j,lCsi
and
n k
X 1 X X 1 X
K (xi , xj ) = K (xi , xj ).
|Csi | |Cl |
i=1 jCsi l=1 i,jCl
165 / 635
The kernel K-means algorithm, equivalent objective
Then, after removing the constant terms K (xi , xi ), we obtain:
Proposition
The kernel K-means objective is equivalent to the following one:
k
X 1 X
max K (xi , xj ).
si {1,...,k} |Cl |
for i=1,...,n l=1 i,jCl
166 / 635
The spectral clustering algorithms
Instead of a greedy approach, we can relax the problem into a feasible
one, which yields a class of algorithms called spectral clustering.
First, consider the objective
k
X 1 X
max K (xi , xj ).
si {1,...,k} |Cl |
for i=1,...,n l=1 i,jCl
and we introduce
(?) the binary assignment matrix A in {0, 1}nk whose rows sum to one.
167 / 635
The spectral clustering algorithms
168 / 635
The spectral clustering algorithms
168 / 635
The spectral clustering algorithms
168 / 635
The spectral clustering algorithms
168 / 635
The spectral clustering algorithms
168 / 635
Outline
2 Kernel tricks
169 / 635
Principal Component Analysis (PCA)
Classical setting
Let S = {x1 , . . . , xn } be a set of vectors (xi Rd )
PCA is a classical algorithm in multivariate statistics to define a set
of orthogonal directions that capture the maximum variance
Applications: low-dimensional representation of high-dimensional
points, visualization
PC2 PC1
170 / 635
Principal Component Analysis (PCA)
Formalization
Assume that the data are centered (otherwise center them as
preprocessing), i.e.:
n
1X
xi = 0.
n
i=1
171 / 635
Principal Component Analysis (PCA)
Formalization
The empirical variance captured by hw is:
n n 2
1X 2 1 X x>i w
var
(hw ) := hw (xi ) = .
n n k w k2
i=1 i=1
172 / 635
Principal Component Analysis (PCA)
Solution
Let X be the n d data matrix whose rows are the vectors
x1 , . . . , xn . We can then write:
n 2
1 X x>i w 1 w> X> Xw
var
(hw ) = = .
n k w k2 n w> w
i=1
173 / 635
Kernel Principal Component Analysis (PCA)
Let x1 , . . . , xn be a set of data points in X ; let K : X X R be a
positive definite kernel and H be its RKHS.
Formalization
Assume that the data are centered (otherwise center by
manipulating the kernel matrix), i.e.:
n n
1X 1X
xi = (xi ) = 0.
n n
i=1 i=1
174 / 635
Kernel Principal Component Analysis (PCA)
Let x1 , . . . , xn be a set of data points in X ; let K : X X R be a
positive definite kernel and H be its RKHS.
Formalization
The empirical variance captured by hf is:
n 2 n
1 X x>i w 1 X h(xi ), f i2H
var
(hw ) = = var
(hf ) := .
n k w k2 n k f k2H
i=1 i=1
fi = (hf ) s.t. kf kH = 1.
arg max var
f {f1 ,...,fi1 }
175 / 635
Kernel Principal Component Analysis (PCA)
Let x1 , . . . , xn be a set of data points in X ; let K : X X R be a
positive definite kernel and H be its RKHS.
Formalization
The empirical variance captured by hf is:
n 2 n
1 X x>i w 1 X f (xi )2
var
(hw ) = = var
(hf ) := .
n k w k2 n k f k2H
i=1 i=1
175 / 635
Sanity check: kernel PCA with linear kernel = PCA
fw (x) = w> x ,
Moreover, w w0 fw fw0 .
176 / 635
Kernel Principal Component Analysis (PCA)
Solution
Kernel PCA solves, for i = 1, . . . , d:
n
X
fi = arg max f (xi )2 s.t. kf kH = 1.
f {f1 ,...,fi1 } i=1
177 / 635
Kernel Principal Component Analysis (PCA)
Therefore we have:
n
X
k fi k2H = i,k i,l K (xk , xl ) = >
i Ki ,
k,l=1
Similarly:
n
X
fi (xk )2 = > 2
i K i .
k=1
and
hfi , fj iH = >
i Kj .
178 / 635
Kernel Principal Component Analysis (PCA)
Solution
Kernel PCA maximizes in the function:
179 / 635
Kernel Principal Component Analysis (PCA)
Solution
Compute the eigenvalue decomposition of the kernel matrix
K = UU> , with eigenvalues 1 . . . n 0.
After a change of variable = K1/2 (with K1/2 = U1/2 U> ),
180 / 635
Kernel Principal Component Analysis (PCA)
Summary
1 Center the Gram matrix
2 Compute the first eigenvectors (ui , i )
3 Normalize the eigenvectors i = ui / i
4 The projections of the points onto the i-th eigenvector is given by
Ki
181 / 635
Kernel Principal Component Analysis (PCA)
Remarks
In this formulation, we must diagonalize the centered kernel Gram
matrix, instead of the covariance matrix in the classical setting
Exercise: check that X> X and XX> have the same spectrum (up to
0 eigenvalues) and that the eigenvectors are related by a simple
relationship.
This formulation remains valid for any p.d. kernel: this is kernel PCA
Applications: nonlinear PCA with nonlinear kernels for vectors, PCA
of non-vector objects (strings, graphs..) with specific kernels...
182 / 635
Example
PC2 A set of 74 human tRNA
sequences is analyzed using a
kernel for sequences (the
second-order marginalized
kernel based on SCFG). This
set of tRNAs contains three
PC1
classes, called Ala-AGC
(white circles), Asn-GTT
(black circles) and Cys-GCA
(plus symbols) (from Tsuda
et al., 2003).
183 / 635
Outline
2 Kernel tricks
184 / 635
Canonical Correlation Analysis (CCA)
Given two views X = [x1 , . . . , xn ] in Rpn and Y = [y1 , . . . , yn ] in Rdn
of the same dataset, the goal of canonical correlation analysis (CCA) is
to find pairs of directions in the two views that are maximally correlated.
Formulation
Assuming that the datasets are centered, we want to maximize
1 Pn > >
n i=1 wa xi yi wb
max .
> x x> w 1/2 1 > y y> w 1/2
wa Rp ,wb Rd 1
Pn Pn
n i=1 w a i i a n w
i=1 b i i b
Assuming that the pairs (xi , yi ) are i.i.d. samples from an unknown
distribution, CCA seeks to maximize
185 / 635
Canonical Correlation Analysis (CCA)
Given two views X = [x1 , . . . , xn ] in Rpn and Y = [y1 , . . . , yn ] in Rdn
of the same dataset, the goal of canonical correlation analysis (CCA) is
to find pairs of directions in the two views that are maximally correlated.
Formulation
Assuming that the datasets are centered, we want to maximize
1 Pn > >
n i=1 wa xi yi wb
max .
> x x> w 1/2 1 > y y> w 1/2
wa Rp ,wb Rd 1
Pn Pn
n i=1 w a i i a n w
i=1 b i i b
185 / 635
Canonical Correlation Analysis (CCA)
Formulation
Assuming that the datasets are centered,
max wa> X> Ywb s.t. wa> X> Xwa = 1 and wb> Y> Ywb = 1.
wa Rp ,wb Rd
186 / 635
Canonical Correlation Analysis (CCA)
Taking the derivatives and setting the gradient to zero, we obtain
Multiply first equality by wa> and second equality by wb> ; subtract the
two resulting equalities and we get
X> Y
>
0 wa X X 0 wa
=
Y> X 0 wb 0 Y> Y wb
187 / 635
Canonical Correlation Analysis (CCA)
Let us define
X> Y X> X
0 0 wa
A = , B = and w =
Y> X 0 0 >
Y Y wb
188 / 635
Kernel Canonical Correlation Analysis
Similar to kernel PCA, it is possible to operate in a RKHS. Given two
p.d. kernels Ka , Kb : X X R, we can obtain two views of a
dataset x1 , . . . , xn in X n :
189 / 635
Kernel Canonical Correlation Analysis
Similar to kernel PCA, it is possible to operate in a RKHS. Given two
p.d. kernels Ka , Kb : X X R, we can obtain two views of a
dataset x1 , . . . , xn in X n :
189 / 635
Kernel Canonical Correlation Analysis
Up to a few technical details (exercise),Pwe can apply the representer
theoremP and look for solutions fa (.) = ni=1 i Ka (xi , .) and
fb (.) = ni=1 i Kb (xi , .). We finally obtain the formulation
1 Pn
n i=1 [Ka ]i [Kb ]i
max
Rn ,Rn 1
P n 2 1/2 1
Pn ,
2 1/2
n i=1 [K a ]i n i=1 [K ]
b i
which is equivalent to
> Ka Kb
max
R ,Rn
n 1/2 1/2 ,
(> K2a ) > K2b
or, after removing the scaling ambiguity for and ,
Equivalent formulation
max > Ka Kb s.t. > K2a = 1 and > K2b = 1.
Rn ,Rn
190 / 635
Kernel Canonical Correlation Analysis
191 / 635
Kernel Canonical Correlation Analysis
191 / 635
Kernel Canonical Correlation Analysis
191 / 635
Kernel Canonical Correlation Analysis
Figure: http://www.tylervigen.com/.
192 / 635
Spurious correlations
Spurious correlations are bad:
Figure: http://www.tylervigen.com/.
193 / 635
Kernel Canonical Correlation Analysis
194 / 635
Kernel Canonical Correlation Analysis
195 / 635
Outline
2 Kernel tricks
197 / 635
Outline
198 / 635
Motivations
The RKHS norm is related to the smoothness of functions.
Smoothness of a function is naturally quantified by Sobolev norms
(in particular L2 norms of derivatives).
Example: spline regression
n Z
X 2
min (yi f (xi ))2 + f 00 (t) dt
f
i=1
2
200 / 635
The RKHs point of view
Theorem
H is a RKHS with r.k. given by:
202 / 635
Proof (2/5)
H is a pre-Hilbert space of functions
H is a vector space of functions, and hf , g iH a bilinear form that
satisfies hf , f iH 0.
f absolutely continuous implies differentiable almost everywhere, and
Z x
x [0, 1], f (x) = f (0) + f 0 (u)du .
0
x 1 12
Z Z
0 0 2 1/2
| f (x) | = f (u)du x f (u) du = x hf , f iH .
0 0
203 / 635
Proof (3/5)
H is a Hilbert space
To show that H is complete, let (fn )nN a Cauchy sequence in H
(fn0 )nN is a Cauchy sequence in L2 [0, 1], thus converges to
g L2 [0, 1]
By the previous inequality, (fn (x))nN is a Cauchy sequence and
thus converges to a real number f (x), for any x [0, 1]. Moreover:
Z x Z x
0
f (x) = lim fn (x) = lim fn (u)du = g (u)du ,
n n 0 0
204 / 635
Proof (4/5)
x [0, 1], Kx H
Let Kx (y ) = K (x, y ) = min(x, y ) sur [0, 1]2 :
K(s,t)
t
s 1
Kx is differentiable except at s, has a square integrable derivative,
and Kx (0) = 0, therefore Kx H for all x [0, 1].
205 / 635
Proof (5/5)
For all x, f , hf , Kx iH = f (x)
For any x [0, 1] and f H we have:
Z 1 Z x
0
hf , Kx iH = f (u)Kx0 (u)du = f 0 (u)du = f (x),
0 0
206 / 635
Generalization
Theorem
Let X = Rd and D a differential operator on a class of functions H such
that, endowed with the inner product:
(f , g ) H2 , hf , g iH = hDf , Dg iL2 (X ) ,
it is a Hilbert space.
Then H is a RKHS that admits as r.k. the Green function of the
operator D D, where D denotes the adjoint operator of D.
207 / 635
Green function?
Definition
Let the differential equation on H:
f = Dg ,
208 / 635
Proof
Let H be a Hilbert space endowed with the inner product:
hf , g iX = hDf , Dg iL2 (X ) ,
209 / 635
Example
Back to our example, take X = [0, 1] and Df (u) = f 0 (u)
To find the r.k. of H we need to solve in k:
f (x) = hD Dkx , f iL2 ([0,1])
= hDkx , Df iL2 ([0,1])
Z 1
= kx0 (u)f 0 (u)du
0
The solution is
kx0 (u) = 1[0,x] (u)
which gives (
u if u x ,
kx (u) =
x otherwise.
and therefore
k(x, x 0 ) = min(x, x 0 )
210 / 635
Outline
211 / 635
Mercer kernels
Definition
A kernel K on a set X is called a Mercer kernel if:
1 X is a compact metric space (typically, a closed bounded subset of
Rd ).
2 K : X X R is a continuous p.d. kernel (w.r.t. the Borel
topology)
Motivations
We can exhibit an explicit and intuitive feature space for a large
class of p.d. kernels
Historically, provided the first proof that a p.d. kernel is an inner
product for non-finite sets X (Mercer, 1905).
Can be thought of as the natural generalization of the factorization
of positive semidefinite matrices over infinite spaces.
212 / 635
Sketch of the proof that a Mercer kernel is an inner
product
1 The kernel matrix when X is finite becomes a linear operator when
X is a metric space.
2 The matrix was positive semidefinite in the finite case, the linear
operator is self-adjoint and positive in the metric case.
3 The spectral theorem states that any compact linear operator
admits a complete orthonormal basis of eigenfunctions, with
non-negative eigenvalues (just like positive semidefinite matrices can
be diagonalized with nonnegative eigenvalues).
4 The kernel function can then be expanded over basis of
eigenfunctions as:
X
K (x, t) = k k (x) k (t) ,
k=1
hf , Lg i = hLf , g i .
hf , Lf i 0 .
214 / 635
An important lemma
The linear operator
Let be any Borel measure on X , and L2 (X ) the Hilbert space of
square integrable functions on X .
For any function K : X 2 7 R, let the transform:
Z
f L2 (X ) , (LK f ) (x) = K (x, t) f (t) d (t) .
Lemma
If K is a Mercer kernel, then LK is a compact and bounded linear
operator over L2 (X ), self-adjoint and positive.
215 / 635
Proof (1/6)
LK is a mapping from L2 (X ) to L2 (X )
For any f L2 (X ) and (x1 , x1 ) X 2 :
Z
| LK f (x1 ) LK f (x2 ) | = (K (x1 , t) K (x2 , t)) f (t) d (t)
k K (x1 , ) K (x2 , ) kk f k
(Cauchy-Schwarz)
p
(X ) max | K (x1 , t) K (x2 , t) | k f k.
tX
216 / 635
Proof (2/6)
LK is linear and continuous
Linearity is obvious (by definition of LK and linearity of the integral).
For continuity, we observe that for all f L2 (X ) and x X :
Z
| (LK f ) (x) | = K (x, t) f (t) d (t)
p
(X ) max | K (x, t) | k f k
tX
p
(X )CK k f k.
217 / 635
Proof (3/6)
Criterion for compactness
In order to prove the compactness of LK we need the following criterion.
Let C (X ) denote the set of continuous functions on X endowed with
infinite norm k f k = maxxX | f (x) |.
A set of functions G C (X ) is called equicontinuous if:
Ascoli Theorem
A part H C (X ) is relatively compact (i.e., its closure is compact) if
and only if it is uniformly bounded and equicontinuous.
218 / 635
Proof (4/6)
LK is compact
Let (fn )n0 be a bounded sequence of L2 (X ) (k fn k < M for all n).
The sequence (LK fn )n0 is a sequence of continuous functions, uniformly
bounded because:
p p
k LK fn k (X )CK k fn k (X )CK M .
It is equicontinuous because:
p
| LK fn (x1 ) LK fn (x2 ) | (X ) max | K (x1 , t) K (x2 , t) | M .
tX
219 / 635
Proof (5/6)
LK is self-adjoint
K being symmetric, we have for all f , g H:
Z
hf , Lg i = f (x) (Lg ) (x) (dx)
Z Z
= f (x) g (t) K (x, t) (dx) (dt) (Fubini)
= hLf , g i .
220 / 635
Proof (6/6)
LK is positive
We can approximate the integral by finite sums:
Z Z
hf , Lf i = f (x) f (t) K (x, t) (dx) (dt)
k
(X ) X
= lim K (xi , xj ) f (xi ) f (xj )
k k 2
i,j=1
0,
221 / 635
Diagonalization of the operator
We need the following general result:
Spectral theorem
Let L be a compact linear operator on a Hilbert space H. Then there
exists in H a complete orthonormal system (1 , 2 , . . .) of eigenvectors
of L. The eigenvalues (1 , 2 , . . .) are real if L is self-adjoint, and
non-negative if L is positive.
Remark
This theorem can be applied to LK . In that case the eigenfunctions k
associated to the eigenfunctions k 6= 0 can be considered as continuous
functions, because:
1
k = LK .
k
222 / 635
Main result
Mercer Theorem
Let X be a compact metric space, a Borel measure on X , and K a
continuous p.d. kernel. Let (1 , 2 , . . .) denote the nonnegative
eigenvalues of LK and (1 , 2 , . . .) the corresponding eigenfunctions.
Then all functions k are continuous, and for any x, t X :
X
K (x, t) = k k (x) k (t) ,
k=1
223 / 635
Mercer kernels as inner products
Corollary
The mapping
: X 7 l 2
p
x 7 k k (x)
kN
224 / 635
Proof of the corollary
k k2 (x) converges
P
By Mercer theorem we see that for all x X ,
to K (x, x) < , therefore (x) l 2 .
The continuity of results from:
X
k (x) (t) k2l 2 = k (k (x) k (t))2
k=1
= K (x, x) + K (t, t) 2K (x, t)
225 / 635
Summary
This proof extends the proof valid when X is finite.
This is a constructive proof, developed by Mercer (1905).
The eigensystem (k and k ) depend on the choice of the measure
(dx): different s lead to different feature spaces for a given
kernel and a given space X
Compactness and continuity are required. For instance, for X = Rd ,
the eigenvalues of:
Z
K (x, t) (t) dt = (x)
X
are not necessarily countable, Mercer theorem does not hold. Other
tools are thus required such as the Fourier transform for
shift-invariant kernels.
226 / 635
Example (1/6)
227 / 635
Example (2/6)
Let a p.d. kernel on S d1 of the form:
K (x, t) = x> t ,
2f 2f
f = + . . . + =0
x12 xd2
228 / 635
Example (3/6)
Definition (Spherical harmonics)
A homogeneous polynomial of degree k 0 in Rd whose Laplacian
vanishes is called a homogeneous harmonic of order k.
A spherical harmonic of order k is a homogeneous harmonic of order
k on the unit sphere S d1
The set Yk (d) of spherical harmonics is a vector space of dimension
(2k + d 2)(k + d 3)!
N(n, k) = dim (Yk (d)) = .
k!(d 2)!
229 / 635
Example (4/6)
Spherical harmonics form the Mercers eigenfunctions, because:
Theorem (Funk-Hecke) [e.g., Muller, 1998, p.30]
For any x S d1 , Yk Yk (d) and C ([1, 1]),
Z
>
x t Yk (t) d(t) = k Yk (x)
S d1
where
Z 1 d3
k = S d2 (t)Pk (d; t)(1 t 2 ) 2 dt
1
d1
Z 1
d2
k+ d3
k = S 2
d1
(k) (t) 1 t 2 2
dt
k
2 k+ 2 1
230 / 635
Example (5/6)
N(d;k)
For any k 0, let {Yk,j (d; x)}j=1 an orthonormal basis of Yk (d)
N(d;k)
n o
Spherical harmonics {Yk,j (d; x)}j=1 form an orthonormal
k=0
basis for L2 S d1
231 / 635
Example (6/6)
2
Take d = 2 and K (x, t) = 1 + x> t for x, t S 1
Using Rodrigeus rule we get 3 nonzero eigenvalues:
0 = 3 , 1 = 2 , 2 =
2
with multiplicities 1, 2 and 2
Corresponding eigenfunctions:
x1 x2 x1 x2 x12 x22
1
, , , ,
2
The resulting Mercer feature map is
!
3
r
x12 x22
(x) = , 2x1 , 2x2 , 2x1 x2 ,
2 2
233 / 635
Reminder: expansion of Mercer kernel
Theorem
Denote by LK the linear operator of L2 (X ) defined by:
Z
f L2 (X ) , (LK f ) (x) = K (x, t) f (t) d (t) .
234 / 635
RKHS construction
Theorem
Assuming that all eigenvalues are positive, the RKHS is the Hilbert
space:
( )
X X a 2
HK = f L2 (X ) : f = ai i , with k
<
k
i=1 k=1
Remark
If some eigenvalues are equal to zero, then the result and the proof remain valid
on the subspace spanned by the eigenfunctions with positive eigenvalues.
235 / 635
Proof (1/6)
Sketch
In order to show that HK is the RKHS of the kernel K we need to show
that:
1 it is a Hilbert space of functions from X to R,
2 for any x X , Kx HK ,
3 for any x X and f HK , f (x) = hf , Kx iHK .
236 / 635
Proof (2/6)
HK is a Hilbert space
Indeed the function:
1
LK2 :L2 (X ) HK
X
X p
ai i 7 ai i i
i=1 i=1
237 / 635
Proof (3/6)
HK is a space of continuous functions
P
For any f = i=1 ai i HK , and x X , we have (if f (x) makes sense):
X X a p
i
| f (x) | = ai i (x) = i i (x)
i=1
i=1
i
!1
!1
X ai2 2 X 2
2
. i i (x)
i
i=1 i=1
1
= k f kHK K (x, x) 2
p
= k f kHK CK .
238 / 635
Proof (4/6)
HK is a space of continuous functions (cont.)
Let now fn = ni=1 ai i HK . The functions i are continuous
P
functions, therefore fn is also continuous, for all n. The fn s are
convergent in HK , therefore also in the (complete) space of continuous
functions endowed with the uniform norm.
Let fc the continuous limit function. Then fc L2 (X ) and
k fn fc kL2 (X ) 0.
n
k f fn kL2 (X ) 1 k f fn kHK 0,
n
therefore f = fc .
239 / 635
Proof (5/6)
Kx HK
For any x X let, for all i, ai = i i (x). We have:
X a2 X
i
= i i (x)2 = K (x, x) < ,
i
i=1 i=1
P
therefore x := i=1 ai i HK . As seen earlier the convergence in HK
implies pointwise convergence, therefore for any t X :
X
X
x (t) = ai i (t) = i i (x) i (t) = K (x, t) ,
i=1 i=1
therefore x = Kx HK .
240 / 635
Proof (6/6)
f (x) = hf , Kx iHK
P
Let f = i=1 ai i HK , et x X . We have seen that:
X
Kx = i i (x) i ,
i=1
therefore:
X i i (x) ai X
hf , Kx iHK = = ai i (x) = f (x) ,
i
i=1 i=1
241 / 635
Remarks
Although HK was built from the eigenfunctions of LK , which depend
on the choice of the measure (x), we know by uniqueness of the
RKHS that HK is independant of and LK .
Mercer theorem provides a concrete way to build the RKHS, by
taking linear combinations of the eigenfunctions of LK (with
adequately chosen weights).
The eigenfunctions (i )iN form an orthogonal basis of the RKHS:
1
hi , j iHK = 0 si i 6= j, k i kHK = .
i
The RKHS is a well-defined ellipsoid with axes given by the
eigenfunctions.
242 / 635
Outline
243 / 635
Motivation
Let us suppose that X is not compact, for example X = Rd .
In that case, the eigenvalues of:
Z
K (x, t) (t) d(t) = (t)
X
244 / 635
Fourier-Stieltjes transform on the torus
Let T the torus [0, 2] with 0 and 2 identified
C (T) the set of continuous functions on T
M(T) the finite complex Borel measures2 on T
M(T) can be identified as the dual space (C (T)) : for any
continuous/bounded linear functional : C (T) C there exists
1
R
M(T) such that (f ) = 2 T f (t)d(t) (Riesz theorem).
246 / 635
Translation invariant kernels on Z
Definition
A kernel K : Z Z 7 R is called translation invariant (t.i.), or
shift-invariant, if it only depends on the difference between its argument,
i.e.:
x, y Z , K (x, y) = axy
for some sequence {an }nZ . Such a sequence is called positive definite if
the corresponding kernel K is p.d.
Theorem (Herglotz)
A sequence {an }nZ is p.d. if and only if it is the Fourier-Stieltjes
transform of a positive measure M(T)
246 / 635
Examples
Diagonal kernel:
(
1 if n = 0 ,
Z
1
= dt , an = (n) = e int dt =
2 T 0 otherwise.
resulting in K (x, t) = C
247 / 635
Proof of Herglotzs theorem:
If an = (n) for M(T) positive, then for any n N, x1 , . . . , xn Z
and z1 , . . . , zn R (or C) :
n X
n n n Z
X 1 XX
zi zj axi xj = zi zj e i(xi xj )t d(t)
2 T
i=1 j=1 i=1 j=1
n n Z
1 XX
= zi zj e ixi t e ixj t d(t)
2 T
i=1 j=1
Z X n
1
= | zj e ixj t |2 d(t)
2 T
j=1
0.
248 / 635
Proof of Herglotzs theorem: (1/4)
Let {an }nZ a p.d. sequence
For a given t R and N N let {zn }nZ be
(
e int if | n | N ,
zn =
0 otherwise.
Since {an }nZ is p.d. we get:
N
X N
X N
X N
X
0 akl zk zl = akl e i(kl)t
k=N l=N k=N l=N
2N
X
= (2N + 1 |k|)ak e ikt
k=2N
|k|
1 X
= max 0, 1 ak e ikt
2N + 1 2N + 1
kZ
| {z }
2N (t)
249 / 635
Proof of Herglotzs theorem: (2/4)
dN = N (t)dt is a positive measure (for N even) and satisfies
N
|j| |n|
X Z
i(nj)t
N (n) = aj 1 e = an max 0, 1
N +1 T N +1
j=N
Moreover
Z
k N kM(T) = sup f (t)N (t)dt
k f k 1 T
Z
= N (t)dt (take f = 1 because N (t) 0)
T
N Z
|n|
X
= an 1 e int dt
T N +1
n=N
= a0
250 / 635
Proof of Herglotzs theorem: (3/4)
For any P
trigonometric polynomial of the form
P(t) = K ikt
k=K bk e , with Fourier coefficient P(n) = bn , we have
Z
lim P(t)dN (t)
N+ T
K N Z
|n|
X X
= lim an bk 1 e i(nk)t dt
N+ T N +1
k=K n=N
K
|n|
X
= ak bk lim 1
N+ N +1
k=K
K
X
= ak b k
k=K
X
= ak P(k)
kZ
251 / 635
Proof of Herglotzs theorem: (4/4)
P
This shows that (P) = kZ ak P(k) is a linear functional over
trigonometric polynomials, with norm a0
It can be extended to all continuous functions because trigonometric
polynomials are dense in C (T)
By Riesz representation theorem, there exists a measure M(T)
such that k kM(T) a0
Z
f C (T) , (f ) = f (t)d(t)
T
252 / 635
Fourier transform on Rd
Definition
For any f L1 Rd , the Fourier transform of f is the function:
Z
>
Rd , f () = e ix f (x) dx .
Rd
253 / 635
Fourier transform on Rd
Properties
f is complex-valued, continuous, tends to 0 at infinity and
k f kL k f kL1 .
If f L1 Rd , then the inverse Fourier formula holds:
Z
1 >
x Rd , f (x) = d
e ix f () d.
(2) Rd
Z Z
1 2
| f (x) |2 dx =
f () d .
d
Rd (2) R d
254 / 635
Fourier-Stieltjes transform on Rd
C0 (Rd ) the set of continuous functions on Rd that vanish at infinity
M(Rd ) the finite complex Borel measures on Rd
M(Rd ) can be identified as the dual space C0 (Rd ) : for any
continuous/bounded linear functional : C0 (Rd ) C there exists
M(Rd ) such that (f ) = Rd f (t)d(t) (Riesz theorem).
R
255 / 635
Fourier-Stieltjes transform on Rd
This extends the standard Fourier transform for integrable functions
by taking d(x) = f (x)dx.
For M(Rd ), is still uniformly continuous, but () does not
necessarily go to 0 at infinity (e.g., take the Dirac = 0 , then
() = 1 for all )
Parsevals formula becomes: if M(Rd ), and both g , g are in
L1 (Rd ), then
Z Z
1
g (x)d(x) = g ()()d
Rd (2)d Rd
256 / 635
Translation invariant kernels on Rd
Definition
A kernel K : Rd Rd 7 R is called translation invariant (t.i.), or
shift-invariant, if it only depends on the difference between its argument,
i.e.:
x, y Rd , K (x, y) = (x y)
for some function : Rd R. Such a function is called positive
definite if the corresponding kernel K is p.d.
257 / 635
Translation invariant kernels on Rd
Definition
A kernel K : Rd Rd 7 R is called translation invariant (t.i.), or
shift-invariant, if it only depends on the difference between its argument,
i.e.:
x, y Rd , K (x, y) = (x y)
for some function : Rd R. Such a function is called positive
definite if the corresponding kernel K is p.d.
Theorem (Bochner)
A continuous function : Rd R is p.d. if and only if it is the
Fourier-Stieltjes transform of a symmetric and positive finite Borel
measure M(T)
257 / 635
Proof of Bochners theorem:
If = for some M(T) positive, then for any n N,
x1 , . . . , xn Rd and z1 , . . . , zn R (or C) :
n X
n n X
n Z
>
X X
zi zj (xi xj ) = zi zj e i(xi xj ) t d(t)
i=1 j=1 i=1 j=1 Rd
n X n Z
X > >
= zi zj e ixi t e ixj t d(t)
i=1 j=1 Rd
Z n
>
zj e ixj t |2 d(t)
X
= |
Rd j=1
0.
258 / 635
Proof of Bochners theorem: (1/5)
Lemma
Let : R R continuous. If there exists C 0 such that
Z
1
g ()()d C sup | g (x) |
2
R
xR
261 / 635
Proof of Bochners theorem: (3/5)
In addition, for any n Z:
Z
1
G (n) = e int G (t)dt
2 T
1 X 2 int
Z
t + 2m
= e g dt
2 0
mZ
Z 2(m+1)
X
= e in(u+2m) g (u)du
2 2m
mZ
2(m+1)
XZ
= e inu g (u)du
2 2m
mZ
Z
= e inu g (u)du
2 R
= g (n)
2
262 / 635
Proof of Bochners theorem: (4/5)
This gives:
X X
g (n)(n) = G (n) (n)
2
nZ nZ
Z
1
= G (t)d (t) (Parceval)
2 T
k kM(T) sup | G (t) |
tT
C sup | G (t) |
tT
C sup | g (x) | + C
xR
with C = (0).
263 / 635
Proof of Bochners theorem: (5/5)
Putting it all together gives:
Z
1
2 g ()()d < C sup | g (x) | + (C + 1)
R xR
2
f ()
Z
1
k f k2K := d < + ,
(2)d Rd ()
f()g ()
Z
1
hf , g i := d
(2)d Rd ()
265 / 635
Proof
H is a Hilbert space: exercise.
For x Rd , Kx (y) = K (x, y) = (x y) therefore:
Z
> >
Kx () = e i u (u x)du = e i x () .
Kx ()f ()
Z Z
1 1
hf , Kx iH = d = f ()e i.x d
(2)d Rd () (2)d Rd
= f (x)
266 / 635
Example
Gaussian kernel
(xy )2
K (x, y ) = e 2 2
corresponds to:
2 2
() = e 2
and Z 2 2 2
H= f : f () e 2 d < .
267 / 635
Example
Laplace kernel
1
K (x, y ) = e | xy |
2
corresponds to:
() =
2 + 2
and ( )
Z 2 2 + 2
H= f : f() d < ,
268 / 635
Example
Low-frequency filter
sin ((x y ))
K (x, y ) =
(x y )
corresponds to:
() = U ( + ) U ( )
and ( Z )
2
H= f : f () d = 0 ,
| |>
269 / 635
Outline
270 / 635
Generalization to semigroups (cf Berg et al., 1983)
Definition
A semigroup (S, ) is a nonempty set S equipped with an
associative composition and a neutral element e.
A semigroup with involution (S, , ) is a semigroup (S, ) together
with a mapping : S S called involution satisfying:
1 (s t) = t s , for s, t S.
2 (s ) = s for s S.
Examples
Any group (G , ) is a semigroup with involution when we define
s = s 1 .
Any abelian semigroup (S, +) is a semigroup with involution when
we define s = s, the identical involution.
271 / 635
Positive definite functions on semigroups
Definition
Let (S, , ) be a semigroup with involution. A function : S R is
called positive definite if the function:
s, t S, K (s, t) = (s t)
is a p.d. kernel on S.
272 / 635
Semicharacters
Definition
A function : S C on an abelian semigroup with involution (S, +, )
is called a semicharacter if
1 (0) = 1,
2 (s + t) = (s)(t) for s, t S,
3 (s ) = (s) for s S.
The set of semicharacters on S is denoted by S .
Remarks
If is the identity, a semicharacter is automatically real-valued.
If (S, +) is an abelian group and s = s, a semicharacter has its
values in the circle group {z C | | z | = 1} and is a group character.
273 / 635
Semicharacters are p.d.
Lemma
Every semicharacter is p.d., in the sense that:
K (s, t) = K (t, s),
Pn
i,j=1 ai aj K (xi , xj ) 0.
Proof
Direct from definition, e.g.,
n
X n
X
ai aj xi + xj = ai aj (xi ) (xj ) 0 .
i,j=1 i,j=1
Examples
(t) = e t on (R, +, Id).
(t) = e it on (R, +, ).
274 / 635
Integral representation of p.d. functions
Definition
An function : S R on a semigroup with involution is called an absolute
value if (i) (e) = 1, (ii)(s t) (s)(t), and (iii) (s ) = (s).
A function f : S R is called exponentially bounded if there exists an
absolute value and a constant C > 0 s.t. | f (s) | C (s) for s S.
Theorem
Let (S, +, ) an abelian semigroup with involution. A function : S R is p.d.
and exponentially bounded (resp. bounded) if and only if it has a representation
of the form: Z
(s) = (s)d() .
S
275 / 635
Proof
Sketch (details in Berg et al., 1983, Theorem 4.2.5)
For an absolute value , the set P1 of -bounded p.d. functions
that satisfy (0) = 1 is a compact convex set whose extreme points
are precisely the -bounded semicharacters.
If is p.d. and exponentially bounded then there exists an absolute
value such that (0)1 P1 .
By the Krein-Milman theorem there exits a Radon probability
measure on P1 having (0)1 as barycentre.
Remarks
The result is not true without the assumption of exponentially
bounded semicharacters.
In the case of abelian groups with s = s this reduces to
Bochners theorem for discrete abelian groups, cf. Rudin (1962).
276 / 635
Example 1: (R+ , +, Id)
Semicharacters
S = (R+ , +, Id) is an abelian semigroup.
2
P.d. functions are nonnegative, because (x) = x .
The set of bounded semicharacters is exactly the set of functions:
s R+ 7 a (s) = e as ,
277 / 635
Example 1: (R+ , +, Id) (cont.)
P.d. functions
By the integral representation theorem for bounded semi-characters
we obtain that a function : R+ R is p.d. and bounded if and
only if it has the form:
Z
(s) = e as d(a) + b (s)
0
278 / 635
Example 2: Semigroup kernels for finite measures (1/6)
Setting
We assume that data to be processed are bags-of-points, i.e., sets
of points (with repeats) of a space U.
Example : a finite-length string as a set of k-mers.
How to define a p.d. kernel between any two bags that only depends
on the union of the bags?
See details and proofs in Cuturi et al. (2005).
279 / 635
Example 2: Semigroup kernels for finite measures (2/6)
Semigroup of bounded measures
We can represent any bag-of-point x as a finite measure on U:
X
x= ai xi ,
i
280 / 635
Example 2: Semigroup kernels for finite measures (3/6)
Semicharacters
For any Borel measurable function f : U R the function
f : Mb+ (U) R defined by:
f () = e [f ]
281 / 635
Example 2: Semigroup kernels for finite measures (4/6)
Corollary
Let U be a Hausdorff space. For any Radon measure Mc+ (C (U))
with compact support on the Hausdorff space of continuous real-valued
functions on U endowed with the topology of pointwise convergence, the
following function K is a continuous p.d. kernel on Mb+ (U) (endowed
with the topology of weak convergence):
Z
K (, ) = e [f ]+[f ] d(f ) .
C (X )
Remarks
The converse is not true: there exist continuous p.d. kernels that do not have
this integral representation (it might include non-continuous semicharacters)
282 / 635
Example 2: Semigroup kernels for finite measures (5/6)
Example : entropy kernel
Let X be the set of probability densities (w.r.t. some reference
measure) on U with finite entropy:
Z
h(x) = x ln x .
U
Remark: only valid for densities (e.g., for a kernel density estimator
from a bag-of-parts)
283 / 635
Example 2: Semigroup kernels for finite measures (6/6)
Examples : inverse generalized variance kernel
Let U = Rd and MV + (U) be the set of finite measure with second
order moment and non-singular variance
h i
() = xx > [x] [x]> .
284 / 635
Application of semigroup kernel
Weighted linear PCA of two different measures, with the first PC shown.
Variances captured by the first and second PC are shown. The
generalized variance kernel is the inverse of the product of the two values.
285 / 635
Kernelization of the IGV kernel
Motivations
Gaussian distributions may be poor models.
The method fails in large dimension
Solution
1 Regularization:
1
K , 0 =
0
.
det +
2 + I d
2 Kernel trick: the non-zero eigenvalues of UU > and U > U are the
same = replace the covariance matrix by the centered Gram
matrix (technical details in Cuturi et al., 2005).
286 / 635
Illustration of kernel IGV kernel
287 / 635
Semigroup kernel remarks
Motivations
A very general formalism to exploit an algebraic structure of the
data.
Kernel IVG kernel has given good results for character recognition
from a subsampled image.
The main motivation is more generally to develop kernels for
complex objects from which simple patches can be extracted.
The extension to nonabelian groups (e.g., permutation in the
symmetric group) might find natural applications.
288 / 635
Kernel examples: Summary
Many notions of smoothness can be translated as RKHS norms for
particular kernels (eigenvalues convolution operator, Sobolev norms
and Green operators, Fourier transforms...).
There is no uniformly best kernel, but rather a large toolbox of
methods and tricks to encode prior knowledge and exploit the
nature or structure of the data.
In the following sections we focus on particular data and
applications to illustrate the process of kernel design.
289 / 635
Outline
2 Kernel tricks
291 / 635
Motivation
Kernel methods are sometimes criticized for their lack of flexibility: a
large effort is spent in designing by hand the kernel.
Question
How do we design a kernel adapted to the data?
Answer
A successful strategy is given by kernels for generative models, which
are/have been the state of the art in many fields, including
representation of image and sequence data representation.
Parametric model
A model is a family of distributions
{P , Rm } M+
1 (X ) .
291 / 635
Outline
292 / 635
Fisher kernel
Definition
Fix a parameter 0 (obtained for instance by maximum
likelihood over a training set).
For each sequence x, compute the Fisher score vector:
293 / 635
Fisher kernel properties (1/2)
The Fisher score describes how each parameter contributes to the
process of generating a particular example
A kernel classifier employing the Fisher kernel derived from a model
that contains the label as a latent variable is, asymptotically, at least
as good as the MAP labelling based on the model (Jaakkola and
Haussler, 1999).
A variant of the Fisher kernel (called the Tangent of Posterior
kernel) can also improve over the direct posterior classification by
helping to correct the effect of estimation errors in the parameter
(Tsuda et al., 2002).
294 / 635
Fisher kernel properties (2/2)
Lemma
The Fisher kernel is invariant under change of parametrization.
295 / 635
Fisher kernel in practice
296 / 635
Fisher kernels: example with Gaussian data model (1/2)
Consider a normal distribution N (, 2 ) and denote by = 1/ 2 the
inverse variance, i.e., precision parameter. With = (, ), we have
1 1 1
log P (x) = log log(2) (x )2 ,
2 2 2
and thus
log P (x) log P (x) 1 1 2
= (x ), = (x ) ,
2
and (exercise)
0
I() = .
0 (1/2)2
The Fisher vector is then
(x) = (x )/ .
(1/ 2)(1 (x )2 / 2 )
297 / 635
Fisher kernels: example with Gaussian data model (2/2)
Now consider an i.i.d. data model over a set of data points x1 , . . . , xn all
distributed according to N (, 2 ):
n
Y
P (x1 , . . . , xn ) = P (xi ).
i=1
Then, the Fisher vector is given by the sum of Fisher vectors of the
points.
Encodes the discrepancy in the first and second order moment of
the data w.r.t. those of the model.
n
X ( )/
(x1 , . . . , xn ) = (xi ) = n ,
( 2 2 )/( 2 2 )
i=1
where
n n
1X 1X
= xi and = (xi )2 .
n n
i=1 i=1
298 / 635
Application: Aggregation of visual words (1/4)
Patch extraction and description stage:
In various contexts, images may be described as a set of
patches x1 , . . . , xn computed at interest points. For example, SIFT,
HOG, LBP, color histograms, convolutional features...
Coding stage: The set of patches is then encoded into a single
representation (xi ), typically in a high-dimensional space.
Pooling stage: For example, sum pooling
n
X
(x1 , . . . , xn ) = (xi ).
i=1
299 / 635
Application: Aggregation of visual words (2/4)
Let = (j , j , j )j=1 ...,k be the parameters of a GMM with k Gaussian
components. Then, the probabilistic model is given by
k
X
P (x) = j N (x; j , j ).
j=1
Remarks
Each mixture component corresponds to a visual word, with a mean,
variance, and mixing weight.
Diagonal covariances j = diag (j1 , . . . , jp ) = diag ( j ) are often
used for simplicity.
This is a richer model than the traditional bag of words approach.
The probabilistic model is learned offline beforehand.
300 / 635
Application: Aggregation of visual words (3/4)
After cumbersome calculations (exercise), we obtain (x1 , . . . , xn ) =
with
n
1 X
j (X) = ij (xi j )/ j
n j
i=1
n
1 X
ij (xi j )2 / 2j 1 ,
j (X) = p
n 2j i=1
301 / 635
Application: Aggregation of visual words (4/4)
Finally, we also have the following interpretation of encoding first and
second-order statistics:
j
j (X) = (j j )/ j
j
j
j (X) = p ( 2j 2j )/ 2j ,
2j
with
n n n
X 1 X 1 X
j = ij and j = ij xi and j = ij (xi j )2 .
j j
i=1 i=1 i=1
302 / 635
Relation to classification with generative models (1/3)
Assume that we have a generative probabilistic model P to model
random variables (X , Y ) where Y is a label in {1, . . . , p}.
Assume that the marginals P (Y = k) = k are among the model
parameters , which we can also parametrize as
e k
P (Y = k) = k = Pp k 0
.
k 0 =1 e
303 / 635
Relation to classification with generative models (2/3)
Then, consider the Fisher score
1
log P (x) = P (x)
P (x)
p
1 X
= P (x, Y = k)
P (x)
k=1
p
1 X
= P (x, Y = k) log P (x, Y = k)
P (x)
k=1
p
X
= P (Y = k|x)[ log k + log P (x|Y = k)].
k=1
In particular (exercise)
log P (x)
= P (Y = k|x) k .
k
304 / 635
Relation to classification with generative models (3/3)
The first p elements in the Fisher score are given by class posteriors
minus a constant
Bayes rule is implemented via this simple classifier using Fisher kernel.
305 / 635
Outline
306 / 635
Mutual information kernels
Definition
Chose a prior w (d) on the measurable set .
Form the kernel (Seeger, 2002):
Z
0
P (x)P (x0 )w (d) .
K x, x =
(x) = (P (x)) .
307 / 635
Example: coin toss
Let P (X = 1) = and P (X = 0) = 1 a model for random coin
toss, with [0, 1].
Let d be the Lebesgue measure on [0, 1]
The mutual information kernel between x = 001 and x0 = 1010 is:
(
P (x) = (1 )2 ,
P (x0 ) = 2 (1 )2 ,
Z 1
3!4! 1
0
3 (1 )4 d =
K x, x = = .
0 8! 280
308 / 635
Outline
309 / 635
Marginalized kernels
Definition
For any observed data x X , let a latent variable y Y be
associated probabilistically through a conditional probability Px (dy).
Let KZ be a kernel for the complete data z = (x, y)
Then, the following kernel is a valid kernel on X , called a
marginalized kernel (Tsuda et al., 2002):
310 / 635
Marginalized kernels: proof of positive definiteness
KZ is p.d. on Z. Therefore, there exists a Hilbert space H and
Z : Z H such that:
KZ z, z0 = Z (z) , Z z0 H .
therefore KX is p.d. on X .
311 / 635
Marginalized kernels: proof of positive definiteness
KZ is p.d. on Z. Therefore, there exists a Hilbert space H and
Z : Z H such that:
KZ z, z0 = Z (z) , Z z0 H .
therefore KX is p.d. on X .
Of course, we make the right assumptions such that each operation
above is valid, and all quantities are well defined.
311 / 635
Outline
2 Kernel tricks
313 / 635
Short history of genomics
314 / 635
A cell
315 / 635
Chromosomes
316 / 635
Chromosomes and DNA
317 / 635
Structure of DNA
We wish to suggest a
structure for the salt of
desoxyribose nucleic acid
(D.N.A.). This structure have
novel features which are of
considerable biological
interest (Watson and Crick,
1953)
318 / 635
The double helix
319 / 635
Central dogma
320 / 635
Proteins
321 / 635
Genetic code
322 / 635
Human genome project
Goal : sequence the 3,000,000,000 bases of the human genome
Consortium with 20 labs, 6 countries
Cost : between 0.5 and 1 billion USD
323 / 635
2003: End of genomics era
Findings
About 25,000 genes only (representing 1.2% of the genome).
Automatic gene finding with graphical models.
97% of the genome is considered junk DNA.
Superposition of a variety of signals (many to be discovered).
324 / 635
Cost of human genome sequencing
325 / 635
Protein sequence
326 / 635
Challenges with protein sequences
A protein sequences can be seen as a variable-length sequence over
the 20-letter alphabet of amino-acids, e.g., insuline:
FVNQHLCGSHLVEALYLVCGERGFFYTPKA
These sequences are produced at a fast rate (result of the
sequencing programs)
Need for algorithms to compare, classify, analyze these sequences
Applications: classification into functional or structural classes,
prediction of cellular localization and interactions, ...
327 / 635
Example: supervised sequence classification
Data (training)
Secreted proteins:
MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA...
MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW...
MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL...
...
Non-secreted proteins:
MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG...
MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG...
MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..
...
Goal
Build a classifier to predict whether new proteins are secreted or not.
328 / 635
Supervised classification with vector embedding
The idea
Map each string x X to a vector (x) F.
Train a classifier for vectors on the images (x1 ), . . . , (xn ) of the
training set (nearest neighbor, linear perceptron, logistic regression,
support vector machine...)
X F
maskat...
msises
marssl...
malhtv...
mappsv...
mahtlg...
329 / 635
Kernels for protein sequences
Kernel methods have been widely investigated since Jaakkola et al.s
seminal paper (1998).
What is a good kernel?
it should be mathematically valid (symmetric, p.d. or c.p.d.)
fast to compute
adapted to the problem (gives good performances)
330 / 635
Kernel engineering for protein sequences
331 / 635
Kernel engineering for protein sequences
331 / 635
Kernel engineering for protein sequences
331 / 635
Outline
332 / 635
Vector embedding for strings
The idea
Represent each sequence x by a fixed-length numerical vector
(x) Rn . How to perform this embedding?
333 / 635
Vector embedding for strings
The idea
Represent each sequence x by a fixed-length numerical vector
(x) Rn . How to perform this embedding?
Physico-chemical kernel
Extract relevant features, such as:
length of the sequence
time series analysis of numerical physico-chemical properties of
amino-acids along the sequence (e.g., polarity, hydrophobicity),
using for example:
Fourier transforms (Wang et al., 2004)
Autocorrelation functions (Zhang et al., 2003)
nj
1 X
rj = hi hi+j
nj
i=1
333 / 635
Substring indexation
The approach
Alternatively, index the feature space by fixed-length strings, i.e.,
(x) = (u (x))uAk
334 / 635
Example: Spectrum kernel (1/4)
Kernel definition
The 3-spectrum of
x = CGGSLIAMMWFGV
is:
(CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV) .
Let u (x) denote the number of occurrences of u in x. The
k-spectrum kernel is:
X
K x, x0 := u (x) u x0 .
uAk
335 / 635
Example: Spectrum kernel (2/4)
Implementation
The computation of the kernel is formally a sum over |A|k terms,
but at most | x | k + 1 terms are non-zero in (x) =
Computation in O (| x | + | x0 |) with pre-indexation of the strings.
Fast classification of a sequence x in O (| x |):
| x |k+1
X X
f (x) = w (x) = wu u (x) = wxi ...xi+k1 .
u i=1
Remarks
Work with any string (natural language, time series...)
Fast and scalable, a good default method for string classification.
Variants allow matching of k-mers up to m mismatches.
336 / 635
Example: Spectrum kernel (3/4)
If pre-indexation is not possible: retrieval tree (trie)
337 / 635
Example: Spectrum kernel (4/4)
If pre-indexation is not possible: use a prefix tree
The complexity for computing K (x, x0 ) becomes O(|x| + |x0 |), but with a
larger constant than with pre-indexation.
338 / 635
Example 2: Substring kernel (1/12)
Definition
For 1 k n N, we denote by I(k, n) the set of sequences of
indices i = (i1 , . . . , ik ), with 1 i1 < i2 < . . . < ik n.
For a string x = x1 . . . xn X of length n, for a sequence of indices
i I(k, n), we define a substring as:
l (i) = ik i1 + 1.
339 / 635
Example 2: Substring kernel (2/12)
Example
ABRACADABRA
i = (3, 4, 7, 8, 10)
x (i) =RADAR
l (i) = 10 3 + 1 = 8
340 / 635
Example 2: Substring kernel (3/12)
The kernel
Let k N and R+ fixed. For all u Ak , let u : X R be
defined by:
X
x X , u (x) = l(i) .
iI(k,| x |): x(i)=u
uAk
341 / 635
Example 2: Substring kernel (4/12)
Example
u ca ct at ba bt cr ar br
u (cat) 2 3 2 0 0 0 0 0
u (car) 2 0 0 0 0 3 2 0
u (bat) 0 0 2 2 3 0 0 0
u (bar) 0 0 0 2 0 0 2 3
4 6
K (cat,cat) = K (car,car) = 2 +
K (cat,car) = 4
K (cat,bar) = 0
342 / 635
Example 2: Substring kernel (5/12)
Kernel computation
We need to compute, for any pair x, x0 X , the kernel:
X
Kk, x, x0 = u (x) u x0
uAk
0
X X X
= l(i)+l(i ) .
uAk i:x(i)=u i0 :x0 (i0 )=u
343 / 635
Example 2: Substring kernel (6/12)
Kernel computation (cont.)
For u Ak remember that:
X
u (x) = ik i1 +1 .
i:x(i)=u
Let now: X
u (x) = | x |i1 +1 .
i:x(i)=u
344 / 635
Example 2: Substring kernel (7/12)
Kernel computation (cont.)
Let us note x[1,j] = x1 . . . xj . A simple rewriting shows that, if we note
a A the last letter of u (u = va):
X
va (x) = v x[1,j1] ,
j[1,| x |]:xj =a
and X
v x[1,j1] | x |j+1 .
va (x) =
j[1,| x |]:xj =a
345 / 635
Example 2: Substring kernel (8/12)
Kernel computation (cont.)
Moreover we observe that if the string is of the form xa (i.e., the last
letter is a A), then:
If the last letter of u is not a:
(
u (xa) = u (x) ,
u (xa) = u (x) .
346 / 635
Example 2: Substring kernel (9/12)
Kernel computation (cont.)
Let us now show how the function:
X
Bk x, x0 := u (x) u x0
uAk
uAk
347 / 635
Example 2: Substring kernel (10/12)
Recursive computation of Bk
Bk xa, x0
X
u (xa) u x0
=
uAk
X X
u (x) u x0 + v (x) va x0
=
uAk vAk1
0
= Bk x, x +
X X 0
v (x) v x0[1,j1] | x |j+1
j[1,| x0 |]:xj0 =a
348 / 635
Example 2: Substring kernel (11/12)
Recursive computation of Bk
Bk xa, x0 b
X 0
= Bk x, x0 b + Bk1 x, x0[1,j1] | x |j+2
j[1,| x0 |]:xj0 =a
349 / 635
Example 2: Substring kernel (12/12)
Recursive computation of Kk
Kk xa, x0
X
u (xa) u x0
=
uAk
X X
u (x) u x0 + v (x) va x0
=
uAk vAk1
0
= Kk x, x +
X X
v (x) v x0[1,j1]
vAk1 j[1,| x0 |]:xj0 =a
X
= Kk x, x0 + 2 Bk1 x, x0[1,j1]
j[1,| x0 |]:xj0 =a
350 / 635
Summary: Substring indexation
351 / 635
Dictionary-based indexation
The approach
Chose a dictionary of sequences D = (x1 , x2 , . . . , xn )
Chose a measure of similarity s (x, x0 )
Define the mapping D (x) = (s (x, xi ))xi D
352 / 635
Dictionary-based indexation
The approach
Chose a dictionary of sequences D = (x1 , x2 , . . . , xn )
Chose a measure of similarity s (x, x0 )
Define the mapping D (x) = (s (x, xi ))xi D
Examples
This includes:
Motif kernels (Logan et al., 2001): the dictionary is a library of
motifs, the similarity function is a matching function
Pairwise kernel (Liao & Noble, 2003): the dictionary is the training
set, the similarity is a classical measure of similarity between
sequences.
352 / 635
Outline
353 / 635
Probabilistic models for sequences
Probabilistic modeling of biological sequences is older than kernel
designs. Important models include HMM for protein sequences, SCFG for
RNA sequences.
{P , Rm } M+
1 (X )
354 / 635
Context-tree model
Definition
A context-tree model is a variable-memory Markov chain:
n
Y
PD, (x) = PD, (x1 . . . xD ) PD, (xi | xiD . . . xi1 )
i=D+1
D is a suffix tree
D is a set of conditional probabilities (multinomials)
355 / 635
Context-tree model: example
356 / 635
The context-tree kernel
Theorem (Cuturi et al., 2005)
For particular choices of priors, the context-tree kernel:
Z
0
X
K x, x = PD, (x)PD, (x0 )w (d|D)(D)
D D
357 / 635
Marginalized kernels
Recall: Definition
For any observed data x X , let a latent variable y Y be
associated probabilistically through a conditional probability Px (dy).
Let KZ be a kernel for the complete data z = (x, y)
Then the following kernel is a valid kernel on X , called a
marginalized kernel (Tsuda et al., 2002):
358 / 635
Example: HMM for normal/biased coin toss
0.85
N 0.05
0.5
0.1 E Normal (N) and biased (B)
S 0.1 coins (not observed)
B
0.5 0.05
0.85
Observed output are 0/1 with probabilities:
(
(0|N) = 1 (1|N) = 0.5,
(0|B) = 1 (1|B) = 0.2.
(a,s)AS
360 / 635
1-spectrum marginalized kernel on observed data
The marginalized kernel for observed data is:
X
KX x, x0 = KZ (x, y) , x0 , y0 P (y|x) P y0 |x0
y,y0 S
X
a,s (x) a,s x0 ,
=
(a,s)AS
with X
a,s (x) = P (y|x) na,s (x, y)
yS
361 / 635
Computation of the 1-spectrum marginalized kernel
X
a,s (x) = P (y|x) na,s (x, y)
yS
( n )
X X
= P (y|x) (xi , a) (yi , s)
yS i=1
n
X X
= (xi , a) P (y|x) (yi , s)
i=1 yS
n
X
= (xi , a) P (yi = s|x) .
i=1
362 / 635
HMM example (DNA)
363 / 635
HMM example (protein)
364 / 635
SCFG for RNA sequences
SFCG rules
S SS
S aSa
S aS
S a
365 / 635
Marginalized kernels in practice
Examples
Spectrum kernel on the hidden states of a HMM for protein
sequences (Tsuda et al., 2002)
Kernels for RNA sequences based on SCFG (Kin et al., 2002)
Kernels for graphs based on random walks on graphs (Kashima et
al., 2004)
Kernels for multiple alignments based on phylogenetic models (Vert
et al., 2006)
366 / 635
Marginalized kernels: example
367 / 635
Outline
368 / 635
Sequence alignment
Motivation
How to compare 2 sequences?
x1 = CGGSLIAMMWFGV
x2 = CLIVMMNRLMWFGV
CGGSLIAMM------WFGV
|...|||||....||||
C-----LIVMMNRLMWFGV
369 / 635
Alignment score
In order to quantify the relevance of an alignment , define:
a substitution matrix S RAA
a gap penalty function g : N R
Any alignment is then scored as follows
CGGSLIAMM------WFGV
|...|||||....||||
C----LIVMMNRLMWFGV
370 / 635
Local alignment kernel
Smith-Waterman score (Smith and Waterman, 1981)
The widely-used Smith-Waterman local alignment score is defined
by:
SWS,g (x, y) := max sS,g ().
(x,y)
371 / 635
Local alignment kernel
Smith-Waterman score (Smith and Waterman, 1981)
The widely-used Smith-Waterman local alignment score is defined
by:
SWS,g (x, y) := max sS,g ().
(x,y)
371 / 635
LA kernel is p.d.: proof (1/11)
Lemma
If K1 and K2 are p.d. kernels, then:
K1 + K2 ,
K1 K2 , and
cK1 , for c 0,
x, x0 X 2 , K x, x0 = lim Ki x, x0 ,
n
372 / 635
LA kernel is p.d.: proof (2/11)
Proof of lemma
Let A and B be n n positive semidefinite matrices. By diagonalization
of A:
Xn
Ai,j = fp (i)fp (j)
p=1
The matrix Ci,j = Ai,j Bi,j is therefore p.d. Other properties are obvious
from definition.
373 / 635
LA kernel is p.d.: proof (3/11)
Lemma (direct sum and product of kernels)
Let X = X1 X2 . Let K1 be a p.d. kernel on X1 , and K2 be a p.d.
kernel on X2 . Then the following functions are p.d. kernels on X :
the direct sum,
374 / 635
LA kernel is p.d.: proof (4/11)
Proof of lemma
If K1 is a p.d. kernel, let 1 : X1 7 H be such that:
((x1 , x2 )) = 1 (x1 ) .
375 / 635
LA kernel is p.d.: proof (5/11)
Lemma: kernel for sets
Let K be a p.d. kernel on X , and let P (X ) be the set of finite subsets of
X . Then the function KP on P (X ) P (X ) defined by:
XX
A, B P (X ) , KP (A, B) := K (x, y)
xA yB
is a p.d. kernel on P (X ).
376 / 635
LA kernel is p.d.: proof (6/11)
Proof of lemma
Let : X 7 H be such that
377 / 635
LA kernel is p.d.: proof (7/11)
Definition: Convolution kernel (Haussler, 1999)
Let K1 and K2 be two p.d. kernels for strings. The convolution of K1
and K2 , denoted K1 ? K2 , is defined for any x, x0 X by:
X
K1 ? K2 (x, y) := K1 (x1 , y1 )K2 (x2 , y2 ).
x1 x2 =x,y1 y2 =y
Lemma
If K1 and K2 are p.d. then K1 ? K2 is p.d..
378 / 635
LA kernel is p.d.: proof (8/11)
Proof of lemma
Let X be the set of finite-length strings. For x X , let
R (x) = {(x1 , x2 ) X X : x = x1 x2 } X X .
379 / 635
LA kernel is p.d.: proof (9/11)
3 basic string kernels
The constant kernel:
K0 (x, y) := 1 .
380 / 635
LA kernel is p.d.: proof (10/11)
Remark
S : A2 R is the similarity function between letters used in the
()
alignment score. Ka is only p.d. when the matrix:
381 / 635
LA kernel is p.d.: proof (11/11)
Lemma
The local alignment kernel is a (limit) of convolution kernel:
() (n1)
() () ()
X
KLA = K0 ? Ka ? Kg ? Ka ? K0 .
n=0
As such it is p.d..
Proof (sketch)
By induction on n (simple but long to write).
See details in Vert et al. (2004).
382 / 635
LA kernel computation
We assume an affine gap penalty:
(
g (0) = 0,
g (n) = d + e(n 1) si n 1,
where M(i, j), X (i, j), Y (i, j), X2 (i, j), and Y2 (i, j) for 0 i |x|,
and 0 j |y| are defined recursively.
383 / 635
LA kernel is p.d.: proof (/)
Initialization
M(i, 0) = M(0, j) = 0,
X (i, 0) = X (0, j) = 0,
Y (i, 0) = Y (0, j) = 0,
X2 (i, 0) = X2 (0, j) = 0,
Y2 (i, 0) = Y2 (0, j) = 0,
384 / 635
LA kernel is p.d.: proof (/)
Recursion
For i = 1, . . . , |x| and j = 1, . . . , |y|:
h
M(i, j) = exp(S(x ,
i j y )) 1 + X (i 1, j 1)
i
+Y (i 1, j 1) + M(i 1, j 1) ,
X (i, j) = exp(d)M(i 1, j) + exp(e)X (i 1, j),
385 / 635
LA kernel in practice
X1 a:0/D X X2 0:0/1
a:0/1 a:b/m(a,b)
0:b/D a:0/1
a:b/m(a,b) 0:a/1
B a:b/m(a,b) M 0:0/1 E
0:a/1
a:b/m(a,b)
0:0/1
a:b/m(a,b) 0:a/1
0:a/1 0:b/D
Y1 Y Y2
0:0/1
386 / 635
Outline
387 / 635
Remote homology
gs
gs
o
on
o
ol
ol
m
tz
m
ho
gh
ho
se
ili
on
lo
Tw
N
C
Sequence similarity
388 / 635
SCOP database
SCOP
Fold
Superfamily
Family
Remote homologs Close homologs
389 / 635
A benchmark experiment
Goal: recognize directly the superfamily
Training: for a sequence of interest, positive examples come from
the same superfamily, but different families. Negative from other
superfamilies.
Test: predict the superfamily.
390 / 635
Difference in performance
60
SVM-LA
SVM-pairwise
SVM-Mismatch
No. of families with given performance
50 SVM-Fisher
40
30
20
10
0
0 0.2 0.4 0.6 0.8 1
ROC50
391 / 635
String kernels: Summary
A variety of principles for string kernel design have been proposed.
Good kernel design is important for each data and each task.
Performance is not the only criterion.
Still an art, although principled ways have started to emerge.
Fast implementation with string algorithms is often possible.
Their application goes well beyond computational biology.
392 / 635
Outline
2 Kernel tricks
394 / 635
Virtual screening for drug discovery
active
inactive
active
inactive
inactive
active
396 / 635
Our approach
397 / 635
Our approach
1 Represent each graph x in X by a vector (x) H, either explicitly
or implicitly through the kernel
X H
397 / 635
Our approach
1 Represent each graph x in X by a vector (x) H, either explicitly
or implicitly through the kernel
397 / 635
Outline
398 / 635
The approach
1 Represent explicitly each graph x by a vector of fixed dimension
(x) Rp .
X H
399 / 635
The approach
1 Represent explicitly each graph x by a vector of fixed dimension
(x) Rp .
2 Use an algorithm for regression or pattern recognition in Rp .
X H
399 / 635
Example
2D structural keys in chemoinformatics
Index a molecule by a binary fingerprint defined by a limited set of
predefined structures
N N N
O O O O O
O
N
400 / 635
Challenge: which descriptors (patterns)?
N N N
O O O O O
O
N
401 / 635
Indexing by substructures
N N N
O O O O O
O
N
402 / 635
Subgraphs
Definition
A subgraph of a graph (V , E ) is a graph (V 0 , E 0 ) with V 0 V and
E0 E.
403 / 635
Indexing by all subgraphs?
404 / 635
Indexing by all subgraphs?
Theorem
Computing all subgraph occurrences is NP-hard.
404 / 635
Indexing by all subgraphs?
Theorem
Computing all subgraph occurrences is NP-hard.
Proof
The linear graph of size n is a subgraph of a graph X with n vertices
iff X has a Hamiltonian path;
The decision problem whether a graph has a Hamiltonian path is
NP-complete.
404 / 635
Paths
Definition
A path of a graph (V , E ) is a sequence of distinct vertices
v1 , . . . , vn V (i 6= j = vi 6= vj ) such that (vi , vi+1 ) E for
i = 1, . . . , n 1.
Equivalently the paths are the linear subgraphs.
405 / 635
Indexing by all paths?
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
406 / 635
Indexing by all paths?
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
Theorem
Computing all path occurrences is NP-hard.
406 / 635
Indexing by all paths?
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
Theorem
Computing all path occurrences is NP-hard.
Proof
Same as for subgraphs.
406 / 635
Indexing by what?
Substructure selection
We can imagine more limited sets of substructures that lead to more
computationnally efficient indexing (non-exhaustive list)
substructures selected by domain knowledge (MDL fingerprint)
all paths up to length k (Openeye fingerprint, Nicholls 2005)
all shortest path lengths (Borgwardt and Kriegel, 2005)
all subgraphs up to k vertices (graphlet kernel, Shervashidze et al.,
2009)
all frequent subgraphs in the database (Helma et al., 2004)
407 / 635
Example: Indexing by all shortest path lengths and their
endpoint labels
A 3 B
A A
B (0,...,0,2,0,...,0,1,0,...)
B A
A 1 A A 3 A
408 / 635
Example: Indexing by all shortest path lengths and their
endpoint labels
A 3 B
A A
B (0,...,0,2,0,...,0,1,0,...)
B A
A 1 A A 3 A
408 / 635
Example: Indexing by all subgraphs up to k vertices
409 / 635
Example: Indexing by all subgraphs up to k vertices
409 / 635
Summary
Explicit computation of substructure occurrences can be
computationnally prohibitive (subgraphs, paths);
Several ideas to reduce the set of substructures considered;
In practice, NP-hardness may not be so prohibitive (e.g., graphs
with small degrees), the strategy followed should depend on the
data considered.
410 / 635
Outline
411 / 635
The idea
412 / 635
The idea
1 Represent implicitly each graph x in X by a vector (x) H
through the kernel
X H
412 / 635
The idea
1 Represent implicitly each graph x in X by a vector (x) H
through the kernel
X H
412 / 635
Expressiveness vs Complexity
Definition: Complete graph kernels
A graph kernel is complete if it distinguishes non-isomorphic graphs, i.e.:
G1 , G2 X , dK (G1 , G2 ) = 0 = G1 ' G2 .
413 / 635
Expressiveness vs Complexity
Definition: Complete graph kernels
A graph kernel is complete if it distinguishes non-isomorphic graphs, i.e.:
G1 , G2 X , dK (G1 , G2 ) = 0 = G1 ' G2 .
413 / 635
Complexity of complete kernels
Proposition (Gartner et al., 2003)
Computing any complete graph kernel is at least as hard as the graph
isomorphism problem.
414 / 635
Complexity of complete kernels
Proposition (Gartner et al., 2003)
Computing any complete graph kernel is at least as hard as the graph
isomorphism problem.
Proof
For any kernel K the complexity of computing dK is the same as the
complexity of computing K , because:
414 / 635
Subgraph kernel
Definition
Let (G )G X be a set or nonnegative real-valued weights
For any graph G X and any connected graph H X , let
H (G ) = G 0 is a subgraph of G : G 0 ' H .
415 / 635
Subgraph kernel complexity
Proposition (Gartner et al., 2003)
Computing the subgraph kernel is NP-hard.
416 / 635
Subgraph kernel complexity
Proposition (Gartner et al., 2003)
Computing the subgraph kernel is NP-hard.
Proof (1/2)
Let Pn be the path graph with n vertices.
Subgraphs of Pn are path graphs:
Proof (2/2)
If G is a graph with n vertices, then it has a path that visits each
node exactly once (Hamiltonian path) if and only if (G )> ePn > 0,
i.e.,
n n
!
X X
>
(G ) i (Pi ) = i Ksubgraph (G , Pi ) > 0 .
i=1 i=1
417 / 635
Path kernel
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
Definition
The path kernel is the subgraph kernel restricted to paths, i.e.,
X
Kpath (G1 , G2 ) = H H (G1 )H (G2 ) ,
HP
418 / 635
Path kernel
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
Definition
The path kernel is the subgraph kernel restricted to paths, i.e.,
X
Kpath (G1 , G2 ) = H H (G1 )H (G2 ) ,
HP
418 / 635
Summary
Expressiveness vs Complexity trade-off
It is intractable to compute complete graph kernels.
It is intractable to compute the subgraph kernels.
Restricting subgraphs to be linear does not help: it is also
intractable to compute the path kernel.
One approach to define polynomial time computable graph kernels is
to have the feature space be made up of graphs homomorphic to
subgraphs, e.g., to consider walks instead of paths.
419 / 635
Outline
420 / 635
Walks
Definition
A walk of a graph (V , E ) is sequence of v1 , . . . , vn V such that
(vi , vi+1 ) E for i = 1, . . . , n 1.
We note Wn (G ) the set of walks with n vertices of the graph G ,
and W(G ) the set of all walks.
etc...
421 / 635
Walks 6= paths
422 / 635
Walk kernel
Definition
Let Sn denote the set of all possible label sequences of walks of
length n (including vertex and edge labels), and S = n1 Sn .
For any graph X let a weight G (w ) be associated to each walk
w W(G ).
Let the feature vector (G ) = (s (G ))sS be defined by:
X
s (G ) = G (w )1 (s is the label sequence of w ) .
w W(G )
423 / 635
Walk kernel
Definition
Let Sn denote the set of all possible label sequences of walks of
length n (including vertex and edge labels), and S = n1 Sn .
For any graph X let a weight G (w ) be associated to each walk
w W(G ).
Let the feature vector (G ) = (s (G ))sS be defined by:
X
s (G ) = G (w )1 (s is the label sequence of w ) .
w W(G )
423 / 635
Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with G (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.
424 / 635
Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with G (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.
The random walk kernel is obtained with G (w ) = PG (w ), where
PG is a Markov random walk on G . In that case we have:
424 / 635
Walk kernel examples
Examples
The nth-order walk kernel is the walk kernel with G (w ) = 1 if the
length of w is n, 0 otherwise. It compares two graphs through their
common walks of length n.
The random walk kernel is obtained with G (w ) = PG (w ), where
PG is a Markov random walk on G . In that case we have:
424 / 635
Computation of walk kernels
Proposition
These three kernels (nth-order, random and geometric walk kernels) can
be computed efficiently in polynomial time.
425 / 635
Product graph
Definition
Let G1 = (V1 , E1 ) and G2 = (V2 , E2 ) be two graphs with labeled vertices.
The product graph G = G1 G2 is the graph G = (V , E ) with:
1 V = {(v1 , v2 ) V1 V2 : v1 and v2 have the same label} ,
2 E = {((v1 , v2 ), (v10 , v20 )) V V : (v1 , v10 ) E1 and (v2 , v20 ) E2 }.
1 a b 1b 2a 1d
c 3c 3e
2
1a 2b 2d
3 4 d e
4c 4e
G1 G2 G1 x G2
426 / 635
Walk kernel and product graph
Lemma
There is a bijection between:
1 The pairs of walks w1 Wn (G1 ) and w2 Wn (G2 ) with the same
label sequences,
2 The walks on the product graph w Wn (G1 G2 ).
427 / 635
Walk kernel and product graph
Lemma
There is a bijection between:
1 The pairs of walks w1 Wn (G1 ) and w2 Wn (G2 ) with the same
label sequences,
2 The walks on the product graph w Wn (G1 G2 ).
Corollary
X
Kwalk (G1 , G2 ) = s (G1 )s (G2 )
sS
X
= G1 (w1 )G2 (w2 )1(l(w1 ) = l(w2 ))
(w1 ,w2 )W(G1 )W(G1 )
X
= G1 G2 (w ) .
w W(G1 G2 )
427 / 635
Computation of the nth-order walk kernel
428 / 635
Computation of random and geometric walk kernels
429 / 635
Extensions 1: Label enrichment
Atom relabeling with the Morgan index (Mahe et al., 2004)
1 2 4
1 1 2 2 4 5
1 O1 2 O1 4 O3
1 3 7
N1 N3 N5
1 2 5
430 / 635
Extension 2: Non-tottering walk kernel
Tottering walks
A tottering walk is a walk w = v1 . . . vn with vi = vi+2 for some i.
Nontottering
Tottering
431 / 635
Computation of the non-tottering walk kernel (Mahe et al.,
2005)
Second-order Markov random walk to prevent tottering walks
Written as a first-order Markov random walk on an augmented graph
Normal walk kernel on the augmented graph (which is always a
directed graph).
432 / 635
Extension 3: Subtree kernels
.
.
. C C
C .
N
N O
.
.
C
O N C C N
. N O
.
.
N C
N N C C C
.
.
.
434 / 635
Computation of the subtree kernel (Ramon and Gartner,
2003; Mahe and Vert, 2009)
435 / 635
Back to label enrichment
Link between the Morgan index and subtrees
Recall the Morgan index:
1 2 4
1 1 2 2 4 5
1 O1 2 O1 4 O3
1 3 7
N1 N3 N5
1 2 5
2
1
1 3
2 3 6
6 4
1 3 1 2 4 5 1 5
5
436 / 635
Label enrichment via the Weisfeiler-Lehman algorithm
A slightly more involved label enrichment strategy (Weisfeiler and
Lehman, 1968) is exploited in the definition and computation of the
Weisfeiler-Lehman subtree kernel (Shervashidze and Borgwardt, 2009).
e b
1 Multiset-label determination
d c
and sorting
a a
j g
3 Relabeling
i h
c d
f f
e
b
437 / 635
Label enrichment via the Weisfeiler-Lehman algorithm
A slightly more involved label enrichment strategy (Weisfeiler and
Lehman, 1968) is exploited in the definition and computation of the
Weisfeiler-Lehman subtree kernel (Shervashidze and Borgwardt, 2009).
e b
1 Multiset-label determination
d c
and sorting
a a
j g
3 Relabeling
i h
c d
f f
e
b
e b b e
d c d c (1)
WLsubtree(G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)
a b c d e f g h i j k l m
a a a b (1)
WLsubtree(G) = ( 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1)
m h i m a b c d e f g h i j k l m
k j l j Counts of Counts of
original compressed
G G
f f f g node labels node labels
(1) (1) (1)
KWLsubtree(G,G)=<WLsubtree(G), WLsubtree(G)>=11.
Properties
The WL features up to the k-th order are computed in O(|E |k).
Similarly to the Morgan index, the WL relabeling can be exploited in
combination with any graph kernel (that takes into account
categorical node labels) to make it more expressive (Shervashidze et
al., 2011).
438 / 635
Outline
439 / 635
Application in chemoinformatics (Mahe et al., 2005)
MUTAG dataset
aromatic/hetero-aromatic compounds
high mutagenic activity /no mutagenic activity, assayed in
Salmonella typhimurium.
188 compounds: 125 + / 63 -
Results
10-fold cross-validation accuracy
Method Accuracy
Progol1 81.4%
2D kernel 91.2%
440 / 635
AUC
70 72 74 76 78 80
CCRFCEM
HL60(TB)
K562
MOLT4
Walks
RPMI8226
Subtrees
SR
A549/ATCC
EKVX
HOP62
HOP92
NCIH226
NCIH23
NCIH322M
NCIH460
NCIH522
COLO_205
HCC2998
HCT116
HCT15
HT29
KM12
SW620
SF268
2D subtree vs walk kernels
SF295
SF539
SNB19
SNB75
U251
LOX_IMVI
MALME3M
M14
SKMEL2
SKMEL28
SKMEL5
UACC257
UACC62
IGROV1
OVCAR3
OVCAR4
OVCAR5
OVCAR8
442 / 635
Image classification (Harchaoui and Bach, 2007)
COREL14 dataset
1400 natural images in 14 classes
Compare kernel between histograms (H), walk kernel (W), subtree
kernel (TW), weighted subtree kernel (wTW), and a combination
(M).
0.12
0.11
0.1
Test error
0.09
0.08
0.07
0.06
0.05
H W TW wTW M
Kernels
443 / 635
Summary: graph kernels
What we saw
Kernels do not allow to overcome the NP-hardness of subgraph
patterns.
They allow to work with approximate subgraphs (walks, subtrees) in
infinite dimension, thanks to the kernel trick.
However: using kernels makes it difficult to come back to patterns
after the learning stage.
444 / 635
Outline
2 Kernel tricks
446 / 635
Graphs
Motivation
Data often come in the form of nodes in a graph for different reasons:
by definition (interaction network, internet...)
by discretization/sampling of a continuous domain
by convenience (e.g., if only a similarity function is available)
447 / 635
Example: web
448 / 635
Example: social network
449 / 635
Example: protein-protein interaction
450 / 635
Kernel on a graph
451 / 635
General remarks
Strategies to design a kernel on a graph
X being finite, any symmetric semi-definite matrix K defines a valid
p.d. kernel on X .
452 / 635
General remarks
Strategies to design a kernel on a graph
X being finite, any symmetric semi-definite matrix K defines a valid
p.d. kernel on X .
How to translate the graph topology into the kernel?
Direct geometric approach: Ki,j should be large when xi and xj are
close to each other on the graph?
Functional approach: k f kK should be small when f is smooth
on the graph?
Link discrete/continuous: is there an equivalent to the continuous
Gaussian kernel on the graph (e.g., limit by fine discretization)?
452 / 635
Outline
453 / 635
Conditionally p.d. kernels
Hilbert distance
Any p.d. kernel is an inner product in a Hilbert space
K x, x0 = (x) , x0 H .
454 / 635
Example
A direct approach
For X = Rn , the inner product is p.d.:
K (x, x0 ) = x> x0 .
455 / 635
Graph distance
Graph embedding in a Hilbert space
Given a graph G = (V , E ), the graph distance dG (x, x 0 ) between
any two vertices is the length of the shortest path between x and x 0 .
We say that the graph G = (V , E ) can be embedded (exactly) in a
Hilbert space if dG is c.p.d., which implies in particular that
exp(tdG (x, x 0 )) is p.d. for all t > 0.
456 / 635
Graph distance
Graph embedding in a Hilbert space
Given a graph G = (V , E ), the graph distance dG (x, x 0 ) between
any two vertices is the length of the shortest path between x and x 0 .
We say that the graph G = (V , E ) can be embedded (exactly) in a
Hilbert space if dG is c.p.d., which implies in particular that
exp(tdG (x, x 0 )) is p.d. for all t > 0.
Lemma
In general graphs cannot be embedded exactly in Hilbert spaces.
In some cases exact embeddings exist, e.g.:
trees can be embedded exactly,
closed chains can be embedded exactly.
456 / 635
Example: non-c.p.d. graph distance
1 3 5
4
0 1 1 1 2
1 0 2 2 1
dG =
1 2 0 2 1
1 2 2 0 1
2 1 1 1 0
h i
min e (0.2dG (i,j)) = 0.028 < 0 .
457 / 635
Graph distances on trees are c.p.d.
Proof
Let G = (V , E ) be a tree;
Fix a root x0 V ;
Represent any vertex x V by a vector (x) R|E | , where
(x)i = 1 if the i-th edge is part of the (unique) path between x
and x0 , 0 otherwise.
Then
dG (x, x 0 ) = k (x) (x 0 ) k2 ,
and therefore dG is c.p.d., in particular exp(tdG (x, x 0 )) is p.d.
for all t > 0.
458 / 635
Example
1
3 5
4
2
1 0.14 0.37 0.14 0.05
h i 0.14 1 0.37 0.14 0.05
dG (i,j)
e =
0.37 0.37 1 0.37 0.14
0.14 0.14 0.37 1 0.37
0.05 0.05 0.14 0.37 1
459 / 635
Graph distances on closed chains are c.p.d.
Proof: case |V | = 2p
Let G = (V , E ) be a directed cycle with an even number of vertices
|V | = 2p.
Fix a root x0 V , number the 2p edges from x0 to x0 ;
Label the 2p edges with e1 , . . . , ep , e1 , . . . , ep (vectors in Rp );
For a vertex v , take (v ) to be the sum of the labels of the edges in
the shortest directed path between x0 and v .
460 / 635
Outline
461 / 635
Functional approach
Motivation
How to design a p.d. kernel on general graphs?
Designing a kernel is equivalent to defining an RKHS.
There are intuitive notions of smoothness on a graph.
Idea
Define a priori a smoothness functional on the functions f : X R;
Show that it defines an RKHS and identify the corresponding kernel.
462 / 635
Notations
X = (x1 , . . . , xm ) is finite.
For x, x0 X , we note x x0 to indicate the existence of an edge
between x and x0
We assume that there is no self-loop x x, and that there is a
single connected component.
The adjacency matrix is A Rmm :
(
1 if i j,
Ai,j =
0 otherwise.
D is theP
diagonal matrix where Di,i is the number of neighbors of xi
(Di,i = mi=1 Ai,j ).
463 / 635
Example
1
3 5
4
2
0 0 1 0 0 1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
A= 1 1 0 1 0 , D= 0 0 3 0 0
0 0 1 0 1 0 0 0 2 0
0 0 0 1 0 0 0 0 0 1
464 / 635
Graph Laplacian
Definition
The Laplacian of the graph is the matrix L = D A.
1
3 5
4
2
0 1 0
1 0
0 1 1 0 0
L=D A=
1 1 3 1 0
0 0 1 2 1
0 0 0 1 1
465 / 635
Properties of the Laplacian
Lemma
Let L = D A be the Laplacian of a connected graph:
For any f : X R,
X
(f ) := (f (xi ) f (xj ))2 = f > Lf
ij
466 / 635
Proof: link between (f ) and L
X
(f ) = (f (xi ) f (xj ))2
ij
X
= f (xi )2 + f (xj )2 2f (xi ) f (xj )
ij
Xm X
= Di,i f (xi )2 2 f (xi ) f (xj )
i=1 ij
> >
= f Df f Af
= f > Lf
467 / 635
Proof: eigenstructure of L
L is symmetric because A and D are symmetric.
For any f Rm , f > Lf = (f ) 0, therefore the (real-valued)
eigenvalues of L are 0 : L is therefore positive semi-definite.
f is an eigenvector associated to eigenvalue 0
iff fP> Lf = 0
468 / 635
Our first graph kernel
Theorem
Pm
The set H = {f Rm : i=1 fi= 0} endowed with the norm
X
(f ) = (f (xi ) f (xj ))2
ij
469 / 635
In case of...
Pseudo-inverse of L
Remember the pseudo-inverse L of L is the linear application that is
equal to:
0 on Ker (L)
L1 on Im(L), that is, if we write:
m
X
L= i ui ui>
i=1
the eigendecomposition of L:
(i )1 ui ui> .
X
L =
i 6=0
hf , g i = f > Lg
k f k2 = hf , f i = f > Lf = (f ) .
471 / 635
Proof (2/2)
To check that H is a RKHS with reproducing kernel K = L , it suffices
to show that:
(
x X , Kx H ,
(x, f ) X H, hf , Kx i = f (x) .
g = KLf = L Lf = H (f ) = f .
472 / 635
Example
1
3 5
4
2
0.88 0.12 0.08 0.32 0.52
0.12 0.88 0.08 0.32 0.52
L =
0.08 0.08 0.28 0.12 0.32
0.32 0.32 0.12 0.48 0.28
0.52 0.52 0.32 0.28 1.08
473 / 635
Interpretation of the Laplacian
f
dx
i1 i i+1
f (x) = f 00 (x)
f 0 (x + dx/2) f 0 (x dx/2)
dx
f (x + dx) f (x) f (x) + f (x dx)
dx 2
fi1 + fi+1 2f (x)
=
dx 2
Lf (i)
= .
dx 2
474 / 635
Interpretation of regularization
For f = [0, 1] R and xi = i/m, we have:
m 2
X i +1 i
(f ) = f f
m m
i=1
m 2
X 1 i
f0
m m
i=1
m
1 X 0 i 2
1
= f
m m m
i=1
1 1 0 2
Z
f (t) dt.
m 0
475 / 635
Outline
476 / 635
Motivation
477 / 635
The diffusion equation
Lemma
For any x0 Rd , the function:
k x x0 k2
1
Kx0 (x, t) = Kt (x0 , x) = d exp
(4t) 2 4t
478 / 635
Discrete diffusion equation
For finite-dimensional ft Rm , the diffusion equation becomes:
ft = Lft
t
which admits the following solution:
ft = f0 e tL
with
t2 2 t3 3
e tL = I tL + L L + ...
2! 3!
479 / 635
Diffusion kernel (Kondor and Lafferty, 2002)
This suggest to consider:
K = e tL
which is indeed symmetric positive semi-definite because if we write:
m
X
L= i ui ui> (i 0)
i=1
we obtain:
m
X
tL
K =e = e ti ui ui>
i=1
480 / 635
Example: complete graph
1+(m1)e tm
(
m for i = j,
Ki,j = 1e tm
m 6 j.
for i =
481 / 635
Example: closed chain
m1
2(i j)
1 X 2
Ki,j = exp 2t 1 cos cos .
m m m
=0
482 / 635
Outline
483 / 635
Motivation
In this section we show that the diffusion and Laplace kernels can be
interpreted in the frequency domain of functions
This shows that our strategy to design kernels on graphs was based
on (discrete) harmonic analysis on the graph
This follows the approach we developed for semigroup kernels!
484 / 635
Spectrum of the diffusion kernel
Let 0 = 1 < 2 . . . m be the eigenvalues of the Laplacian:
m
X
L= i ui ui> (i 0)
i=1
485 / 635
Norm in the diffusion RKHS
Any function f Rm can be written as f = K K 1 f , therefore its
For i = 1, . . . , m, let:
fi = ui> f
be the projection of f onto the eigenbasis of K .
We then have:
m
X
> 1
kf k2Kt =f K f = e ti fi 2 .
i=1
R 2
2 2
This looks similar to f() e d ...
486 / 635
Discrete Fourier transform
Definition
>
The vector f = f1 , . . . , fm is called the discrete Fourier transform of
f R n
487 / 635
Example: eigenvectors of the Laplacian
488 / 635
Generalization
This observation suggests to define a whole family of kernels:
m
X
Kr = r (i )ui ui>
i=1
where r : R+ R+
is a non-increasing function.
489 / 635
Example : regularized Laplacian
1
r () = , >0
+
m
X 1
K= ui u > = (L + I )1
i + i
i=1
X m
X
k f k2K = f > K 1 f = (f (xi ) f (xj ))2 + f (xi )2 .
ij i=1
490 / 635
Example
1
3 5
4
2
0.60 0.10 0.19 0.08 0.04
0.10 0.60 0.19 0.08 0.04
1
(L + I ) =
0.19 0.19 0.38 0.15 0.08
0.08 0.08 0.15 0.46 0.23
0.04 0.04 0.08 0.23 0.62
491 / 635
Outline
492 / 635
Applications 1: graph partitioning
A classical relaxation of graph partitioning is:
X X
min (fi fj )2 s.t. fi 2 = 1
f RX
ij i
PC2 PC1
493 / 635
Applications 2: search on a graph
494 / 635
Application 3: Semi-supervised learning
495 / 635
Application 3: Semi-supervised learning
496 / 635
Application 4: Tumor classification from microarray data
(Rapaport et al., 2006)
Data available
Gene expression measures for more than 10k genes
Measured on less than 100 samples of two (or more) different
classes (e.g., different tumors)
497 / 635
Application 4: Tumor classification from microarray data
(Rapaport et al., 2006)
Data available
Gene expression measures for more than 10k genes
Measured on less than 100 samples of two (or more) different
classes (e.g., different tumors)
Goal
Design a classifier to automatically assign a class to future samples
from their expression profile
Interpret biologically the differences between the classes
497 / 635
Linear classifiers
The approach
Each sample is represented by a vector x = (x1 , . . . , xp ) where
p > 105 is the number of probes
Classification: given the set of labeled sample, learn a linear decision
function:
Xp
f (x) = i xi + 0 ,
i=1
498 / 635
Linear classifiers
Pitfalls
No robust estimation procedure exist for 100 samples in 105
dimensions!
It is necessary to reduce the complexity of the problem with prior
knowledge.
499 / 635
Example : Norm Constraints
The approach
A common method in statistics to learn with few samples in high
dimension is to constrain the norm of , e.g.:
EuclideanPnorm (support vector machines, ridge regression):
k k2 = pi=1 i2
L1 -norm (lasso regression) : k k1 = pi=1 | i |
P
Cons
Pros Limited interpretation
Good performance in (small weights)
classification No prior biological
knowledge
500 / 635
Example 2: Feature Selection
The approach
Constrain most weights to be 0, i.e., select a few genes (< 20) whose
expression are enough for classification. Interpretation is then about the
selected genes.
Pros Cons
The gene selection
Good performance in
process is usually not
classification
robust
Useful for biomarker
Wrong interpretation is
selection
the rule (too much
Apparently easy correlation between
interpretation genes)
501 / 635
Pathway interpretation
Motivation
Basic biological functions are usually expressed in terms of pathways
and not of single genes (metabolic, signaling, regulatory)
Many pathways are already known
How to use this prior knowledge to constrain the weights to have an
interpretation at the level of pathways?
502 / 635
Pathway interpretation
N Glycan
biosynthesis
503 / 635
Pathway interpretation
Good example
The graph is the complete
known metabolic network
of the budding yeast
(from KEGG database)
We project the classifier
weight learned by a
spectral SVM
Good classification
accuracy, and good
interpretation!
504 / 635
Part 6
Open Problems
and Research Topics
505 / 635
Outline
2 Kernel tricks
506 / 635
Motivation
507 / 635
Setting: learning with one kernel
For any f : X R, let f n = (f (x1 ), . . . , f (xn )) Rn
Given a p.d. kernel K : X X R, we learn with K by solving:
3
R is closed if, for each A R, the sublevel set {u Rn : R(u) A} is closed. For
example, if R is continuous then it is closed.
508 / 635
Sum kernel
Definition
Let K1 , . . . , KM be M kernels on X . The sum kernel KS is the kernel on
X defined as
M
X
x, x0 X , KS (x, x0 ) = Ki (x, x0 ) .
i=1
509 / 635
Sum kernel and vector concatenation
Theorem
For i = 1, . . . , M, let i : X Hi be a feature map such that
Ki (x, x0 ) = i (x) , i x0 H .
i
PM
Then KS = i=1 Ki can be written as:
KS (x, x0 ) = S (x) , S x0 H ,
S
511 / 635
Example: data integration with the sum kernel
Table 1. List of experiments of direct approach, spectral approach based on
kernel PCA, and supervised approach based on kernel CCA
Vol. 20 Suppl. 1 2004, pages i363i370
BIOINFORMATICS DOI: 10.1093/bioinformatics/bth910
(Integration)
Spectral Kexp (Expression)
Kppi (Protein interaction)
ABSTRACT computational biology. By protein network we mean, in this
Motivation: An increasing number of observations support the paper, a graph with proteins as vertices and edges that corres-
K loc (Localization)
hypothesis that most biological functions involve the interac- pond to various binary relationships between proteins. More
Kbetween
tions phy (Phylogenetic
many proteins, and profile)
that the complexity of living precisely, we consider below the protein network with edges
systems arises as a result of such interactions. In this context, between two proteins if (i) the proteins interact physically,
theK exp +of K
problem ppi +a K
inferring global + Kphy
loc protein network for a given or (ii) the proteins are enzymes that catalyze two successive
organism,(Integration)
using all available genomic data about the organ- chemical reactions in a pathway or (iii) one of the proteins
ism, is quickly becoming one of the main challenges in current regulates the expression of the other. This definition of pro-
Supervised Kexp (Expression)
computational biology. Kgold (Protein
tein networknetwork)
involves various forms of interactions between
Results: This paper presents a new method to infer protein proteins, which should be taken into account for the study of
Kppi (Protein interaction)
networks from multiple types of genomic data. Based on a
Kgold (Protein network)
the behavior of biological systems.
Kloc
variant (Localization)
of kernel Kgold (Protein
canonical correlation analysis, its originality network)
Unfortunately, the experimental determination of this pro-
is in the formalization of the protein network inference problem tein network remains very challenging nowadays, even for
Kphy (Phylogenetic
as a supervised
profile) Kgold (Protein
learning problem, and in the integration of het-
network)
the most basic organisms. The lack of reliable informa-
Kexp +genomic
erogeneous Kppi data+K within +K
loc this framework.
phy Kgold (Protein
We present network)
tion contrasts with the wealth of genomic data generated by
promising results on the prediction of the protein network for high-throughput technologies such as gene expression data
(Integration)
the yeast Saccharomyces cerevisiae from four types of widely (Eisen et al., 1998), physical protein interactions (Ito et al.,
available data: gene expressions, protein interactions meas- 2001), protein localization (Huh et al., 2003), phylogen-
ured by yeast two-hybrid systems, protein localizations in the etic profiles (Pellegrini et al., 1999) or pathway knowledge Fig. 512
6. Effect
Fig. 5. ROC curves: supervised approach.
cell and protein phylogenetic profiles. The method is shown (Kanehisa et al., 2004). There is therefore an incentive / 635of n
The sum kernel: functional point of view
Theorem
PM
The solution f HKS when we learn with KS = i=1 Ki is equal to:
M
X
f = fi ,
i=1
M M
!
X X
n
min R fi + k fi k2HK .
f1 ,...,fM i
i=1 i=1
513 / 635
Generalization: The weighted sum kernel
Theorem
PM
The solution f when we learn with K = i=1 i Ki , with
1 , . . . , M 0, is equal to:
M
X
f = fi ,
i=1
M M k f k2
!
X X i HK
n i
min R fi + .
f1 ,...,fM i
i=1 i=1
514 / 635
Proof (1/4)
M M k f k2
!
X X i HK
n i
min R fi + .
f1 ,...,fM i
i=1 i=1
(1 , . . . , M ) is the solution of
M M
!
X X > Ki i i
min R Ki i + .
1 ,...,M R n i
i=1 i=1
515 / 635
Proof (2/4)
This is equivalent to
M M
X > Ki i
i
X
min R (u) + s.t. u= Ki i .
u,1 ,...,M Rn i
i=1 i=1
516 / 635
Proof (3/4)
Minimization in u:
n o
min R(u) + 2 > u = max 2 > u R(u) = R (2) ,
u u
Minimization in i for i = 1, . . . , M:
>
i Ki i
min 2 Ki i = i > Ki ,
>
i i
517 / 635
Proof (4/4)
The dual problem is therefore
M
( ! )
X
max R (2) > i Ki .
Rn
i=1
Note that if learn from a single kernel K , we get the same dual
problem n o
maxn R (2) > K .
R
If is a solution of the dual problem, then i = i leading to:
n
X n
X
x X , fi (x) = ij Ki (xj , x) = i j Ki (xj , x)
j=1 j=1
PM
Therefore, f = i=1 fi
satisfies
M X
X n n
X
f (x) = i j Ki (xj , x) = j K (xj , x) .
i=1 j=1 j=1
518 / 635
Learning the kernel
Motivation
If we know how to weight each kernel, then we can learn with the
weighted kernel
XM
K = i Ki
i=1
519 / 635
An objective function for K
Theorem
For any p.d. kernel K on X , let
520 / 635
Proof
We have shown by strong duality that
n o
J(K ) = maxn R (2) > K .
R
521 / 635
MKL (Lanckriet et al., 2004)
We consider the set of convex combinations
M M
( )
X X
K = i Ki with M = i 0 , i = 1
i=1 i=1
522 / 635
Example: protein annotation
Vol. 20 no. 16 2004, pages 26262635
BIOINFORMATICS doi:10.1093/bioinformatics/bth294
1.00 1.0
Received on January 29, 2004; revised on April 7, 2004; accepted on April 23, 2004
ROC
0.95 Advance Access publication May 6, 2004 0.9
ROC
G.R.G.Lanckriet et al.
0.90
0.8
0.85
0.80 ABSTRACT 0.7views. In yeast, for example for a given gene we typ-
these
BfunctionsSW B its SW Pfam FFT h(p LI ) DR|pi | : Ea vectorall
Table 1. Kernel
Motivation: During Pfam
the past LIdecade,D the new E focus allon depend
ically knowupon hydropathy
the protein it encodes,profile
that proteins
i similarity to
100 genomics has highlighted a particular challenge: to integrate containing theitshydrophobicities
other40proteins, of the
hydrophobicity profile, theamino
mRNAacids along the
expres-
TP1FP
TP1FP
the different views of the genome that are provided by various sion30
proteinlevels associatedetwith
(Engleman al.,the given
1986; gene and
Black underMould,
hundreds of Hopp
1991;
Kernel Data Similarity measure
50 types of experimental data. experimental
and Woods, conditions,
20 1981). The theFFT
occurrences
kernel of known
uses or inferred profiles
hydropathy
Results: This paper describes a computational framework transcription
10 factor binding sites in the upstream region of
KSW proteinand
sequences Smith-Waterman generated from the KyteDoolittle index (Kyte and Doolittle,
0 for integrating drawing inferences from a collection of that gene
0 and the identities of many of the proteins that interact
KB B
genome-wide SW
protein Pfam Each
sequences
measurements. LI datasetD BLAST E
is represented all
via 1982).
with theThis B kernel
given genes compares
SWprotein
Pfam the Each
FFT
product. frequency Dcontent
LIof these E of the
distinct all
K 1 protein sequences Pfam HMMrelation- 1
hydropathy profiles of the two proteins. First, the hydropathy
Pfam a kernel function, which defines generalized similarity data types provides one view of the molecular machinery of
Weights
Weights
KFFT hydropathy
ships between profile
pairs of entities, such as genes FFTor proteins. The profiles
the cell. are pre-filtered
In the near future, with a low-pass
research filter to reduce
in bioinformatics will noise:
KLI protein interactions
0.5 kernel representation linear kernel
is both flexible and efficient, and can be 0.5more and more heavily on methods of data fusion.
focus
KD protein interactions diffusion kernel
applied to many different types of data. Furthermore, kernel Different data sources hf (p = f to
arei )likely contain
h(pi ), different and
KE gene expression radial basis kernel
0 functions derived from different types of data can be combined thus partly
0 independent information about the task at hand.
KRND random numbers linear kernel 1 complementary pieces of information can be
in a straightforward (A) fashion.
Ribosomal Recent advances in the theory
proteins where f =
Combining
4 (1 2 1) (B)
those is Membrane
the impulse response of the filter
proteins
of kernel methods have provided efficient algorithms to per- expected
and to enhance
denotes the total information
convolution with thatabout
filter.theAfter
problem at
pre-filtering
The table lists the seven kernels used to compare proteins, the data on which they are
form such combinations in a way that minimizes a statistical hand. One
theheight
hydropathy problem with
profiles this
(andapproach, however, is that gen-zeros to
defined,
Fig. 1.andCombining
the method for
loss
computing
datasets
function. These
similarities.
yields better
methods
The final kernel, KRND
classification
exploit semidefinite
, is included The
performance.
program- omic data ofcome
the bars
in a wide upperifof
in thevariety necessary
two plots
data areappending
formats: proportional
expression to the ROC
as a control. All kernel matrices, along with the data from which they were generated, make them equal invectors
lengtha commonly used technique notError
score
are (top)atming
available and the percentage
techniques of true the
to reduce positives
noble.gs.washington.edu/proj/sdp-svm. problem at one percentoptimiz-
of finding false positives (middle),
data are expressedfor as
the SDP/SVM or timemethod
series; using
proteinthe given kernel.
sequence 523 / 635
Example: Image classification (Harchaoui and Bach, 2007)
COREL14 dataset
1400 natural images in 14 classes
Compare kernel between histograms (H), walk kernel (W), subtree
kernel (TW), weighted subtree kernel (wTW), and a combination by
MKL (M).
0.12
0.11
0.1
Test error
0.09
0.08
0.07
0.06
0.05
H W TW wTW M
Kernels
524 / 635
MKL revisited (Bach et al., 2004)
M M
( )
X X
K = i Ki with M = i 0 , i = 1
i=1 i=1
Theorem
The solution f of
n o
min min R(f n ) + k f k2HK
M f HK
PM
is f = i=1 fi
, where (f1 , . . . , fM ) HK1 . . . HKM is the solution
of: !2
M M
!
X X
min R fi n + k fi kHKi .
f1 ,...,fM
i=1 i=1
525 / 635
Proof (1/2)
n o
min min R(f n ) + k f k2HK
M f HK
M M k f k2
( ! )
X X i HK
n i
= min min R fi +
M f1 ,...,fM i
i=1 i=1
M X k fi k2HK
( ! (M ))
X
= min R fi n + min i
f1 ,...,fM M i
i=1 i=1
!2
M M
!
X X
= min R fi n + k fi kHKi ,
f1 ,...,fM
i=1 i=1
526 / 635
Proof (2/2)
where the last equality results from:
M
!2 M
X X a2
a RM
+ , ai = inf i
,
M i
i=1 i=1
M M M
! 21 M
! 12
X X ai X a2 i
X
ai = i i .
i i
i=1 i=1 i=1 i=1
527 / 635
Algorithm: simpleMKL (Rakotomamonjy et al., 2008)
We want to minimize in M :
n o
min J (K ) = min maxn R (2) > K .
M M R
J (K ) = R (2 ) > K .
528 / 635
Sum kernel vs MKL
Learning with the sum kernel (uniform combination) solves
M M
( ! )
X X
n 2
min R fi + k fi kHK .
f1 ,...,fM i
i=1 i=1
529 / 635
Example: ridge vs LASSO regression
Take X = Rd , and for x = (x1 , . . . , xd )> consider the rank-1 kernels:
i = 1, . . . , d , Ki x, x0 = xi xi0 .
530 / 635
Extensions (Micchelli et al., 2005)
M M
( )
X X
For r > 0 , K = i Ki with rM = i 0 , ir = 1
i=1 i=1
Theorem
The solution f of
n o
minr min R(f n ) + k f k2HK
M f HK
PM
is f = i=1 fi
, where (f1 , . . . , fM ) HK1 . . . HKM is the solution
of:
M
! M
! r +1
r
X X 2r
min R fi n + r +1
k fi kHK .
f1 ,...,fM i
i=1 i=1
531 / 635
Outline
2 Kernel tricks
532 / 635
Outline
533 / 635
Motivation
Main problem
All methods we have seen require computing the n n Gram matrix,
which is infeasible when n is significantly greater than 100 000 both in
terms of memory and computation.
Solutions
low-rank approximation of the kernel;
random Fourier features.
The goal is to find an approximate embedding : X Rd such that
534 / 635
Motivation
Then, functions f in H may be approximated by linear ones in Rd , e.g.,.
n
* n +
X X
f (x) = i K (xi , x) i (xi ), (x) = hw, (x)iRd .
i=1 i=1 Rd
becomes, approximately,
n
1X
min L(yi , w> (xi )) + kwk22 ,
wRd n
i=1
535 / 635
Outline
536 / 635
Interlude: Large-scale learning with linear models
Let us study for a while optimization techniques for minimizing large
sums of functions
n
1X
min fi (w).
wRd n
i=1
537 / 635
Introduction of a few optimization principles
Why do we care about convexity?
538 / 635
Introduction of a few optimization principles
Why do we care about convexity?
Local observations give information about the global optimum
f (w)
w
b
w
b
If f is convex
f (w)
w w0
b b
w
b
w w1 w0
b b b
w
b
w w1 w0
b b b
w
b
wt wt1 L1 f (wt1 ).
Then,
Lkw0 w? k22
f (wt ) f ? .
2t
Remarks
the convergence rate improves under additional assumptions on f
(strong convexity);
some variants have a O(1/t 2 ) convergence rate [Nesterov, 2004].
541 / 635
Proof (1/2)
Proof of the main inequality for smooth functions
We want to show that for all w and z,
L
f (w) f (z) + f (z)> (w z) + kw zk22 .
2
542 / 635
Proof (1/2)
Proof of the main inequality for smooth functions
We want to show that for all w and z,
L
f (w) f (z) + f (z)> (w z) + kw zk22 .
2
By using Taylors theorem with integral form,
Z 1
f (w) f (z) = f (tw + (1 t)z)> (w z)dt.
0
Then,
Z 1
f (w)f (z)f (z)> (wz) (f (tw+(1t)z)f (z))> (wz)dt
0
Z 1
|(f (tw+(1t)z)f (z))> (wz)|dt
0
Z 1
kf (tw+(1t)z)f (z)k2 kwzk2 dt (C.-S.)
0
Z 1
L
Ltkwzk22 dt = kwzk22 .
0 2
542 / 635
Proof (2/2)
Proof of the theorem
We have shown that for all w,
L
f (w) gt (w) = f (wt1 ) + f (wt1 )> (w wt1 ) + kw wt1 k22 .
2
gt is minimized by wt ; it can be rewritten gt (w) = gt (wt ) + L2 kw wt k22 . Then,
L ?
f (wt ) gt (wt ) = gt (w? ) kw wt k22
2
L ? L
= f (wt1 ) + f (wt1 )> (w? wt1 ) + kw wt1 k22 kw? wt k22
2 2
L ? L
f?+ kw wt1 k22 kw? wt k22 .
2 2
By summing from t = 1 to T , we have a telescopic sum
T
X L ? L
T (f (wT ) f ? ) f (wt ) f ? kw w0 k22 kw? wT k22 .
t=1
2 2
543 / 635
Introduction of a few optimization principles
An important inequality for smooth and -strongly convex functions
w w0
b b
w
b
545 / 635
Proof
We start from an inequality from the previous proof
L ? L
f (wt ) f (wt1 ) + f (wt1 )> (w? wt1 ) + kw wt1 k22 kw? wt k22
2 2
L ? L
f?+ kw wt1 k22 kw? wt k22 .
2 2
In addition, we have that f (wt ) f ? + 2 kwt w? k22 , and thus
L ?
kw? wt k22 kw wt1 k22
L+
?
1 kw wt1 k22 .
L
Finally,
L t
f (wt ) f ? kw w? k22
2
t Lkw? w0 k22
1
L 2
546 / 635
The stochastic (sub)gradient descent algorithm
Consider now the minimization of an expectation
wt (1 t )wt1 + t wt .
547 / 635
The stochastic (sub)gradient descent algorithm
There are various learning rates strategies (constant, varying step-sizes),
and averaging strategies. Depending on the problem assumptions and
choice of t , t , classical convergence rates may be obtained:
f (wt ) f ? = O(1/ t) for convex problems;
f (wt ) f ? = O(1/t) for strongly-convex ones;
Remarks
The convergence rates are not that great, but the complexity
per-iteration is small (1 gradient evaluation for minimizing an
empirical risk versus n for the batch algorithm).
When the amount of data is infinite, the method minimizes the
expected risk.
Choosing a good learning rate automatically is an open problem.
548 / 635
Randomized incremental algorithms (1/2)
Consider now the minimization of a large finite sum of smooth convex
functions:
n
1X
minp fi (w),
wR n
i=1
SAG algorithm
n
fi (wt1 ) if i = it
t t1 X t
w w yi with yit = .
Ln yit1 otherwise
i=1
See also SAGA [Defazio et al., 2014], SVRG [Xiao and Zhang, 2014],
SDCA [Shalev-Shwartz and Zhang, 2015], MISO [Mairal, 2015];
549 / 635
Randomized incremental algorithms (2/2)
Many of these techniques are in fact performing SGD-types of steps
wt wt1 t gt ,
where E[gt |wt1 ] = f (wt1 ), but where the estimator of the gradient
has lower variance than in SGD, see SVRG [Xiao and Zhang, 2014].
Typically, these methods have the convergence rate
t
? 1
f (wt ) f = O 1 C max ,
n L
Remarks
their complexity per-iteration is independent of n!
unlike SGD, they are often almost parameter-free.
besides, they can be accelerated [Lin et al., 2015].
550 / 635
Large-scale learning with linear models
Conclusion
we know how to deal with huge-scale linear problems;
this is also useful to learn with kernels!
551 / 635
Outline
552 / 635
Nystrom approximations: principle
Consider a p.d. kernel K : X X R and RKHS H, with the
mapping : X H such that
Motivation
This principle allows us to work explicitly in a finite-dimensional
space; it was introduced several times in the kernel literature [Williams
and Seeger, 2002], [Smola and Scholkopf, 2000], [Fine and Scheinberg, 2001].
553 / 635
Nystrom approximations: principle
(x0 )
554 / 635
Nystrom approximations: principle
or also
p
X p
X
minp 2 j fj (x) + j l hfj , fl iH .
R
j=1 j,l=1
555 / 635
Nystrom approximations: principle
Then, call [Kf ]jl = hfj , fl iH and f(x) = [f1 (x), . . . , fp (x)] in Rp . The
problem may be rewritten as
min 2 > f(x) + > Kf ,
Rp
and
* p p
+
X X
0
h(x), (x )iH j? (x)fj , j? (x0 )fj
j=1 j=1 H
p
X
= j? (x)l? (x0 )hfj , fl iH = ? (x)> Kf ? (x0 ).
j,l=1
556 / 635
Nystrom approximations: principle
This allows us to define the mapping
1/2 1/2
(x) = Kf ? (x) = Kf f(x),
Remarks
the mapping provides low-rank approximations of the kernel matrix.
Given an n n Gram matrix K computed on a training set
S = {x1 , . . . , xn }, we have
K (S)> (S),
557 / 635
Nystrom approximation via kernel PCA
Let us now try to learn the fj s given training data x1 , . . . , xn in X :
2
Xn
Xp
min
(xi ) ij fj
.
f1 ,...,fp H
ij R i=1
j=1
H
558 / 635
Nystrom approximation via kernel PCA
Remember the objective:
n
X
max f(xi )> K1
f f(xi ).
f1 ,...,fp H
i=1
Consider an optimal solution ?
f and
compute the eigenvalue
decomposition of Kf ? = UU> . Then, define the functions
g? (x) := [g1? (x), . . . , gp? (x)] = 1/2 U> f ? (x).
The functions gj? are points in the RKHS H since they are linear
combinations of the functions fj? in H.
559 / 635
Nystrom approximation via kernel PCA
Remember the objective:
n
X
max f(xi )> K1
f f(xi ).
f1 ,...,fp H
i=1
Consider an optimal solution ?
f and
compute the eigenvalue
decomposition of Kf ? = UU> . Then, define the functions
g? (x) := [g1? (x), . . . , gp? (x)] = 1/2 U> f ? (x).
The functions gj? are points in the RKHS H since they are linear
combinations of the functions fj? in H.
Exercise: check that all we do here and in the next slides can be
extended to deal with singular Gram matrices Kf ? and Kf .
559 / 635
Nystrom approximation via kernel PCA
Besides, by construction
560 / 635
Nystrom approximation via kernel PCA
Then, Kg? = I and g? is also a solution of the problem
n
X
max f(xi )> K1
f f(xi ),
f1 ,...,fp H
i=1
since
561 / 635
Nystrom approximation via kernel PCA
Then, Kg? = I and g? is also a solution of the problem
n
X
max f(xi )> K1
f f(xi ),
f1 ,...,fp H
i=1
since
561 / 635
Nystrom approximation via kernel PCA
Our first recipe with kernel PCA
Given a dataset of n training points x1 , . . . , xn in X ,
randomly choose a subset Z = [xz1 , . . . , xzm ] of m n training
points;
compute the m m kernel matrix KZ .
perform kernel PCA to find the p m largest principal directions
(parametrized by p vectors j in Rm );
Then, every point x in X may be approximated by
1/2
(x) = Kg? g? (x) = g? (x) = [g1? (x), . . . , gp? (x)]>
" m m
#>
X X
= 1i K (xzi , x), . . . , pi K (xzi , x) .
i=1 i=1
562 / 635
Nystrom approximation via kernel PCA
Remarks
The vector (x) can be interpreted as coordinates of the projection
of (x) onto the (orthogonal) PCA basis.
The complexity of training is O(m3 ) (eig decomposition of KZ ) +
O(m2 ) kernel evaluations.
The complexity of encoding a new point x is O(mp) (matrix vector
multiplication) + O(m) kernel evaluations.
563 / 635
Nystrom approximation via kernel PCA
Remarks
The vector (x) can be interpreted as coordinates of the projection
of (x) onto the (orthogonal) PCA basis.
The complexity of training is O(m3 ) (eig decomposition of KZ ) +
O(m2 ) kernel evaluations.
The complexity of encoding a new point x is O(mp) (matrix vector
multiplication) + O(m) kernel evaluations.
The main issue is the encoding time, which depends linearly on m > p.
563 / 635
Nystrom approximation via random sampling
A popular alternative is instead to select the anchor points among the
training data points x1 , . . . , xn that is,
564 / 635
Nystrom approximation via random sampling
The complexity of training is O(p 3 ) (eig decomposition) + O(p 2 )
kernel evaluations.
The complexity of encoding a point x is O(p 2 ) (matrix vector
multiplication) + O(p) kernel evaluations.
565 / 635
Nystrom approximation via random sampling
The complexity of training is O(p 3 ) (eig decomposition) + O(p 2 )
kernel evaluations.
The complexity of encoding a point x is O(p 2 ) (matrix vector
multiplication) + O(p) kernel evaluations.
The main issue complexity is better, but we lose the optimality of the
PCA basis and the random choice of anchor points is not clever.
565 / 635
Nystrom approximation via greedy approach
Better approximation can be obtained with a greedy algorithm that
iteratively selects one column at a time with largest residual (Bach and
Jordan, 2002; Smola and Sholkopf, 2000, Fine and Scheinbert, 2000).
which is equal to
k(x)k2H fZ (x)> K1
Z fZ (x),
and since fj = (xzj ) for all j, the data point xi with largest residual is
the one that maximizes
566 / 635
Nystrom approximation via greedy approach
This brings us to the following algorithm
Third recipe with greedy anchor point selection
Initialize Z = . For k = 1, . . . , p do
data point selection
Remarks
A naive implementation costs (O(k 2 n + k 3 ) at every iteration.
To get a reasonable complexity, one has to use simple linear algebra
tricks (see next slide).
567 / 635
Nystrom approximation via greedy approach
If Z 0 = Z {z},
1
K1 1 > 1b
KZ fZ (z) Z + s bb
K1
Z0 = = s ,
fZ (z)> K (z, z) 1 >
s b 1
s
Complexity analysis
K1 1 2
Z 0 can be obtained from KZ and fZ (z) in O(k ) float operations;
for that we need to always keep into memory the n vectors fZ (xi ).
updating the fZ 0 (xi )s from fZ (xi ) requires n kernel evaluations;
The total training complexity is O(p 2 n) float operations and O(pn)
kernel evaluations
568 / 635
Nystrom approximation via K-means
When X = Rd , it is also possible to synthesize points z1 , . . . , zp such
that they represented well some training data x1 , . . . , xn , leading to the
Clustred Nystrom approximation (Zhang and Kwok, 2008).
Remarks
The complexity is the same as Nystrom with random selection
(except for the K-means step);
The method is data-dependent and can significantly outperform the
other variants in practice.
569 / 635
Nystrom approximation: conclusion
Concluding remarks
The greedy selection rule is equivalent to computing an incomplete
Cholesky factorization of the kernel matrix (Bach and Jordan, 2002;
Scholkopf and Smola, 2000, Fine and Scheinberg, 2001);
The techniques we have seen produce low-rank approximations of
the kernel matrix K LL> ;
The method admits a geometric interpretation in terms of
orthogonal projection onto a finite-dimensional subspace.
The approximation provides points in the RKHS. As such, many
operations on the mapping are valid (translations, linear
combinations, projections), unlike the method that will come next.
570 / 635
Outline
571 / 635
Random Fourier features [Rahimi and Recht, 2007] (1/5)
A large class of approximations for shift-invariant kernels are based on
sampling techniques. Consider a real-valued positive-definite continuous
translation-invariant kernel K (x, y) = (x y) with : Rd R. Then,
if (0) = 1, Bochner theorem tells us that is a valid characteristic
function for some probability measure
>
(z) = Ew [e iw z ].
572 / 635
Random Fourier features (2/5)
Then,
Z
1 > >
(x y) = d
(w)e iw x e iw y dw
(2) Rd
Z
= q(w) cos(w> x w> y)dw
ZR
d
= q(w) cos(w> x) cos(w> y) + sin(w> x) sin(w> y) dw
Rd
Z Z 2
q(w)
= 2 cos(w> x + b) cos(w> y + b)dwdb (exercise)
R d b=0 2
h i
= Ewq(w),bU [0,2] 2 cos(w> x + b) 2 cos(w> y + b)
573 / 635
Random Fourier features (3/5)
Random Fourier features recipe
Compute the Fourier transform of the kernel and define the
probability density q(w) = (w)/(2)d ;
Draw p i.i.d. samples w1 , . . . , wp from q and p i.i.d. samples
b1 , . . . , bp from the uniform distribution on [0, 2];
define the mapping
r h i>
2
x 7 (x) = cos(w1> x + b1 ), . . . , cos(wp> x + bp ) .
d
Then, we have that
(x y) h(x), (y)iRp .
574 / 635
Random Fourier features (4/5)
Theorem, [Rahimi and Recht, 2007]
On any compact subset X of Rm , for all > 0,
" # 2 2
q diam(X ) p
4(m+2)
P sup |(x y) h(x), (y)iRp | 2 8
e ,
x,yX
Remarks
The convergence is uniform, not data dependent;
q
Take the sequence p = log(p)
p q diam(X ); Then the term on the
right converges to zero when p grows to infinity;
Prediction functions with Random Fourier features are not in H.
575 / 635
Random Fourier features (5/5)
Ingredients of the proof
For a fixed pair of points x, y, Hoeffdings inequality says that
p2
P |(x y) h(x), (y)iRd | 2e 4 .
| {z }
f (x,y)
Glue things together: control the probability for points (x, y) inside
each ball, and adjust the radius r (a bit technical).
576 / 635
Outline
2 Kernel tricks
577 / 635
Adaline: a physical neural net for least square regression
Figure: Adaline, [Widrow and Hoff, 1960]: A physical device that performs least
square regression using stochastic gradient descent.
578 / 635
A quick zoom on multilayer neural networks
The goal is to learn a prediction function f : Rp R given labeled
training data (xi , yi )i=1,...,n with xi in Rp , and yi in R:
n
1X
min L(yi , f (xi )) + (f ) .
f F n | {z }
i=1
| {z } regularization
empirical risk, data fit
579 / 635
A quick zoom on multilayer neural networks
The goal is to learn a prediction function f : Rp R given labeled
training data (xi , yi )i=1,...,n with xi in Rp , and yi in R:
n
1X
min L(yi , f (xi )) + (f ) .
f F n | {z }
i=1
| {z } regularization
empirical risk, data fit
580 / 635
A quick zoom on convolutional neural networks
581 / 635
A quick zoom on convolutional neural networks
Figure: Picture from Yann LeCuns tutorial, based on Zeiler and Fergus [2014].
582 / 635
A quick zoom on convolutional neural networks
What are the main features of CNNs?
they capture compositional and multiscale structures in images;
they provide some invariance;
they model local stationarity of images at several scales.
583 / 635
A quick zoom on convolutional neural networks
What are the main features of CNNs?
they capture compositional and multiscale structures in images;
they provide some invariance;
they model local stationarity of images at several scales.
583 / 635
A quick zoom on convolutional neural networks
What are the main features of CNNs?
they capture compositional and multiscale structures in images;
they provide some invariance;
they model local stationarity of images at several scales.
Nonetheless...
they are the focus of a huge academic and industrial effort;
there is efficient and well-documented open-source software.
583 / 635
Context of kernel methods
What are the main features of kernel methods?
decoupling of data representation and learning algorithm;
a huge number of unsupervised and supervised algorithms;
typically, convex optimization problems in a supervised context;
versatility: applies to vectors, sequences, graphs, sets,. . . ;
natural regularization function to control the learning capacity;
well studied theoretical framework.
584 / 635
Context of kernel methods
What are the main features of kernel methods?
decoupling of data representation and learning algorithm;
a huge number of unsupervised and supervised algorithms;
typically, convex optimization problems in a supervised context;
versatility: applies to vectors, sequences, graphs, sets,. . . ;
natural regularization function to control the learning capacity;
well studied theoretical framework.
But...
poor scalability in n, at least O(n2 );
decoupling of data representation and learning may not be a good
thing, according to recent supervised deep learning success.
584 / 635
Context of kernel methods
Challenges
Scaling-up kernel methods with approximate feature maps;
[Williams and Seeger, 2001, Rahimi and Recht, 2007, Vedaldi and
Zisserman, 2012, Le et al., 2013]...
Design data-adaptive and task-adaptive kernels;
Build kernel hierarchies to capture compositional structures.
Introduce supervision in the kernel design.
585 / 635
Context of kernel methods
Challenges
Scaling-up kernel methods with approximate feature maps;
[Williams and Seeger, 2001, Rahimi and Recht, 2007, Vedaldi and
Zisserman, 2012, Le et al., 2013]...
Design data-adaptive and task-adaptive kernels;
Build kernel hierarchies to capture compositional structures.
Introduce supervision in the kernel design.
585 / 635
Context of kernel methods
Challenges
Scaling-up kernel methods with approximate feature maps;
[Williams and Seeger, 2001, Rahimi and Recht, 2007, Vedaldi and
Zisserman, 2012, Le et al., 2013]...
Design data-adaptive and task-adaptive kernels;
Build kernel hierarchies to capture compositional structures.
Introduce supervision in the kernel design.
Remark
there exists already successful data-adaptive kernels that rely on
probabilistic models, e.g., Fisher kernel.
[Jaakkola and Haussler, 1999, Perronnin and Dance, 2007].
585 / 635
Some more motivation
Longer term objectives
build a kernel for images (abstract object), for which we can
precisely quantify the invariance, stability to perturbations,
recovery, and complexity properties.
build deep networks which can be easily regularized.
build deep networks for structured objects (graph, sequences)...
add more geometric interpretation to deep networks.
...
586 / 635
Basic principles of deep kernel machines: composition
Composition of feature spaces
Consider a p.d. kernel K1 : X 2 R and its RKHS H1 with mapping
1 : X H1 . Consider also a p.d. kernel K2 : H12 R and its RKHS
H2 with mapping 2 : H1 H2 . Then, K3 : X 2 R below is also p.d.
587 / 635
Basic principles of deep kernel machines: composition
Composition of feature spaces
Consider a p.d. kernel K1 : X 2 R and its RKHS H1 with mapping
1 : X H1 . Consider also a p.d. kernel K2 : H12 R and its RKHS
H2 with mapping 2 : H1 H2 . Then, K3 : X 2 R below is also p.d.
Examples
1
k1 (x)1 (x0 )k2H
K3 (x, x0 ) = e 2 2 1 .
587 / 635
Basic principles of deep kernel machines: composition
Remarks on the composition of feature spaces
we can iterate the process many times.
the idea appears early in the literature of kernel methods [see
Scholkopf et al., 1998, for a multilayer variant of kernel PCA].
588 / 635
Basic principles of deep kernel machines: composition
Remarks on the composition of feature spaces
we can iterate the process many times.
the idea appears early in the literature of kernel methods [see
Scholkopf et al., 1998, for a multilayer variant of kernel PCA].
588 / 635
Basic principles of deep kernel machines: composition
Remarks on the composition of feature spaces
we can iterate the process many times.
the idea appears early in the literature of kernel methods [see
Scholkopf et al., 1998, for a multilayer variant of kernel PCA].
588 / 635
Basic principles of deep kernel machines: infinite NN
A large class of kernels on Rp may be defined as an expectation
589 / 635
Basic principles of deep kernel machines: infinite NN
A large class of kernels on Rp may be defined as an expectation
Gaussian kernel
1 2
h 2 > 2 > i
e 22 kxyk2 Ew e 2 w x e 2 w y with w N (0, ( 2 /4)I).
589 / 635
Basic principles of deep kernel machines: infinite NN
Example, arc-cosine kernels
h i
K (x, y) Ew max w> x, 0 max w> y, 0 with w N (0, I),
Remarks
infinite neural nets were discovered by Neal, 1994; then revisited
many times [Le Roux, 2007, Cho and Saul, 2009].
the concept does not lead to more powerful kernel methods...
590 / 635
Basic principles of DKM: dot-product kernels
Another basic link between kernels and neural networks can be obtained
by considering dot-product kernels.
Proposition
Let X = Sd1 be the unit sphere of Rd . The kernel K : X 2 R
Remarks
the proposition holds if X is the unit sphere of some Hilbert space
and hx, yiRd is replaced by the corresponding inner-product.
591 / 635
Basic principles of DKM: dot-product kernels
The Nystrom method consists of replacing any point (x) in H, for x
in X by its orthogonal projection onto a finite-dimensional subspace
(x0 )
[Williams and Seeger, 2001, Smola and Scholkopf, 2000, Fine and Scheinberg, 2001].
592 / 635
Basic principles of DKM: dot-product kernels
The projection is equivalent to
2
p
X
Xp
j? (zj ) ?
F [x] := with argmin
(x)
,
j (zj )
j=1 Rp
j=1
H
with
(x) = (Z> Z)1/2 (Z> x),
where the function is applied pointwise to its arguments. The resulting
can be interpreted as a neural network performing (i) linear operation,
(ii) pointwise non-linearity, (iii) linear operation.
593 / 635
Convolutional kernel networks [Mairal et al., 2014, Mairal, 2016]
594 / 635
Related work
proof of concept for combining kernels and deep learning [Cho and
Saul, 2009];
hierarchical kernel descriptors [Bo et al., 2011];
other multilayer models [Bouvrie et al., 2009, Montavon et al., 2011,
Anselmi et al., 2015];
deep Gaussian processes [Damianou and Lawrence, 2013].
multilayer PCA [Scholkopf et al., 1998].
old kernels for images [Scholkopf, 1997].
RBF networks [Broomhead and Lowe, 1988].
595 / 635
The multilayer convolutional kernel
Definition: image feature maps
An image feature map is a function I : H, where is a 2D grid
representing coordinates in the image and H is a Hilbert space.
I1 ( 2 ) H1
I1 : 1 H1
Linear pooling
P1 P0 (patch)
I0 ( 0 ) H0 I0 : 0 H0
596 / 635
The multilayer convolutional kernel
Definition: image feature maps
An image feature map is a function I : H, where is a 2D grid
representing coordinates in the image and H is a Hilbert space.
597 / 635
The multilayer convolutional kernel
Definition: image feature maps
An image feature map is a function I : H, where is a 2D grid
representing coordinates in the image and H is a Hilbert space.
How do we go from I0 : 0 H0 to I1 : 1 H1 ?
597 / 635
The multilayer convolutional kernel
Definition: image feature maps
An image feature map is a function I : H, where is a 2D grid
representing coordinates in the image and H is a Hilbert space.
How do we go from I0 : 0 H0 to I1 : 1 H1 ?
597 / 635
The multilayer convolutional kernel
Going from I0 to I0.5 : kernel trick
Patches of size e0 e0 can be defined as elements of the Cartesian
product P0 := H0e0 e0 endowed with its natural inner-product.
Define a p.d. kernel on such patches: For all x, x0 in P0 ,
hx, x0 iP0
0 0
K1 (x, x ) = kxkP0 kx kP0 1 if x, x0 6= 0 and 0 otherwise.
kxkP0 kx0 kP0
598 / 635
The multilayer convolutional kernel
I1 ( 2 ) H1
I1 : 1 H1
Linear pooling
P1 P0 (patch)
I0 ( 0 ) H0 I0 : 0 H0
599 / 635
The multilayer convolutional kernel
I1 ( 2 ) H1
I1 : 1 H1
Linear pooling
P1 P0 (patch)
I0 ( 0 ) H0 I0 : 0 H0
Finally,
We may now repeat the process and build I0 , I1 , . . . , Ik .
and obtain the multilayer convolutional kernel
X
K (Ik , Ik0 ) = hIk (), Ik0 ()iHk .
k
600 / 635
The multilayer convolutional kernel
In summary
The multilayer convolutional kernel builds upon similar principles as
a convolutional neural net (multiscale, local stationarity).
It remains a conceptual object due to its high complexity.
Learning and modelling are still decoupled.
Let us first address the second point (scalability).
601 / 635
Unsupervised learning for convolutional kernel networks
Learn linear subspaces of finite-dimensions where we project the data
M1
linear pooling
1 (x ) 1 (x) Hilbert space H1
1 (x)
1 (x )
M0.5
projection on F1 F1
x
kernel trick
M0
x
602 / 635
Unsupervised learning for convolutional kernel networks
Formally, this means using the Nystrom approximation
We now manipulate finite-dimensional maps Mj : j Rpj .
Every linear subspace is parametrized by anchor points
Fj := Span (zj,1 ), . . . , (zj,pj ) ,
2
where the z1,j s are in Rpj1 ej1 for patches of size ej1 ej1 .
The encoding function at layer j is
> x
j (x) := kxkj (Z>
j Z j ) 1/2
1 Zj if x 6= 0 and 0 otherwise,
kxk
603 / 635
Unsupervised learning for convolutional kernel networks
The pooling operation keeps points in the linear subspace Fj , and
pooling M0.5 : 0 Rp1 is equivalent to pooling I0.5 : 0 H1 .
M1
linear pooling
1 (x ) 1 (x) Hilbert space H1
1 (x)
1 (x )
M0.5
projection on F1 F1
x
kernel trick
M0
x
604 / 635
Unsupervised learning for convolutional kernel networks
How do we learn the filters with no supervision?
we learn one layer at a time, starting from the bottom one.
we extract a large numbersay 100 000 patches from layers j 1
computed on an image database and normalize them;
perform a spherical K-means algorithm to learn the filters Zj ;
compute the projection matrix j (Z>
j Zj )
1/2 .
Remarks
with kernels, we map patches in infinite dimension; with the
projection, we manipulate finite-dimensional objects.
we obtain an unsupervised convolutional net with a geometric
interpretation, where we perform projections in the RKHSs.
605 / 635
Unsupervised learning for convolutional kernel networks
Remark on input image pre-processing
Unsupervised CKNs are sensitive to pre-processing; we have tested
RAW RGB input;
local centering of every color channel;
local whitening of each color channel;
2D image gradients.
x = [cos(), sin()],
with j = 2j/p0 .
zj = [cos(j ), sin(j )]
Then, the vector (x) = kxk1 (Z> Z)1/2 1 Z> kxk x
, can be
interpreted as a soft-binning of the gradient orientation.
After pooling, the representation of this first layer is very close
to SIFT/HOG descriptors [see Bo et al., 2011].
607 / 635
Convolutional kernel networks with supervised learning
How do we learn the filters with supervision?
Given a kernel K and RKHS H, the ERM objective is
n
1X
min L(yi , f (xi )) + kf k2H .
f H n 2
i=1 | {z }
| {z } regularization
empirical risk, data fit
609 / 635
Convolutional kernel networks
In summary
a multilayer kernel for images, which builds upon similar principles
as a convolutional neural net (multiscale, local stationarity).
A new type of convolutional neural network with a geometric
interpretation: orthogonal projections in RKHS.
Learning may be unsupervised: align subspaces with data.
Learning may be supervised: subspace learning in RKHSs.
610 / 635
Image classification
Experiments were conducted on classical deep learning datasets, on
CPUs with no model averaging and no data augmentation.
Dataset ] classes im. size ntrain ntest
CIFAR-10 10 32 32 50 000 10 000
SVHN 10 32 32 604 388 26 032
Remarks on CIFAR-10
10% is the standard good result for single model with no data
augmentation.
the best unsupervised architecture has two layers, is wide
(1024-16384 filters), and achieves 14.2%;
611 / 635
Image super-resolution
The task is to predict a high-resolution y image from low-resolution
one x. This may be formulated as a multivariate regression problem.
612 / 635
Image super-resolution
The task is to predict a high-resolution y image from low-resolution
one x. This may be formulated as a multivariate regression problem.
612 / 635
Image super-resolution
Fact. Dataset Bicubic SC CNN CSCN SCKN
Set5 33.66 35.78 36.66 36.93 37.07
x2 Set14 30.23 31.80 32.45 32.56 32.76
Kodim 30.84 32.19 32.80 32.94 33.21
Set5 30.39 31.90 32.75 33.10 33.08
x3 Set14 27.54 28.67 29.29 29.41 29.50
Kodim 28.43 29.21 29.64 29.76 29.88
[Zeyde et al., 2010, Dong et al., 2016, Wang et al., 2015, Kim et al., 2016].
613 / 635
Image super-resolution
614 / 635
Image super-resolution
Figure: Bicubic
615 / 635
Image super-resolution
Figure: SCKN
615 / 635
Image super-resolution
616 / 635
Image super-resolution
Figure: Bicubic
617 / 635
Image super-resolution
Figure: SCKN
617 / 635
Image super-resolution
618 / 635
Image super-resolution
Figure: Bicubic
619 / 635
Image super-resolution
Figure: SCKN
619 / 635
Image super-resolution
620 / 635
Image super-resolution
Figure: Bicubic
621 / 635
Image super-resolution
Figure: SCKN
621 / 635
Conclusion of the course
622 / 635
What we saw
Basic definitions of p.d. kernels and RKHS
How to use RKHS in machine learning
The importance of the choice of kernels, and how to include prior
knowledge there.
Several approaches for kernel design (there are many!)
Review of kernels for strings and on graphs
Recent research topics about kernel methods
623 / 635
What we did not see
624 / 635
References I
F. Anselmi, L. Rosasco, C. Tan, and T. Poggio. Deep convolutional networks are hierarchical
kernel machines. arXiv preprint arXiv:1508.01084, 2015.
N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 404, 1950.
URL http://www.jstor.org/stable/1990404.
F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and
the SMO algorithm. In Proceedings of the Twenty-First International Conference on
Machine Learning, page 6, New York, NY, USA, 2004. ACM. doi:
http://doi.acm.org/10.1145/1015330.1015424.
P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification and risk bounds. Technical
Report 638, UC Berkeley Statistics, 2003.
C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups.
Springer-Verlag, New-York, 1984.
L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In
Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In ICDM 05:
Proceedings of the Fifth IEEE International Conference on Data Mining, pages 7481,
Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2278-5. doi:
http://dx.doi.org/10.1109/ICDM.2005.132.
J. V. Bouvrie, L. Rosasco, and T. Poggio. On invariance in hierarchical models. In Adv. NIPS,
2009.
625 / 635
References II
D. S. Broomhead and D. Lowe. Radial basis functions, multi-variable functional interpolation
and adaptive networks. Technical report, DTIC Document, 1988.
Y. Cho and L. K. Saul. Kernel methods for deep learning. In Adv. NIPS, 2009.
M. Cuturi and J.-P. Vert. The context-tree kernel for strings. Neural Network., 18(4):
11111123, 2005. doi: 10.1016/j.neunet.2005.07.010. URL
http://dx.doi.org/10.1016/j.neunet.2005.07.010.
M. Cuturi, K. Fukumizu, and J.-P. Vert. Semigroup kernels on measures. J. Mach. Learn. Res.,
6:11691198, 2005. URL http://jmlr.csail.mit.edu/papers/v6/cuturi05a.html.
A. Damianou and N. Lawrence. Deep Gaussian processes. In Proc. AISTATS, 2013.
A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with
support for non-strongly convex composite objectives. In Advances in Neural Information
Processing Systems (NIPS), pages 16461654, 2014.
C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional
networks. IEEE T. Pattern Anal., 38(2):295307, 2016.
S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. J.
Mach. Learn. Res., 2:243264, 2001.
626 / 635
References III
T. Gartner, P. Flach, and S. Wrobel. On graph kernels: hardness results and efficient
alternatives. In B. Scholkopf and M. Warmuth, editors, Proceedings of the Sixteenth
Annual Conference on Computational Learning Theory and the Seventh Annual Workshop
on Kernel Machines, volume 2777 of Lecture Notes in Computer Science, pages 129143,
Heidelberg, 2003. Springer. doi: 10.1007/b12006. URL
http://dx.doi.org/10.1007/b12006.
Z. Harchaoui and F. Bach. Image classification with segmentation graph kernels. In 2007
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2007), pages 18. IEEE Computer Society, 2007. doi: 10.1109/CVPR.2007.383049. URL
http://dx.doi.org/10.1109/CVPR.2007.383049.
D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10,
UC Santa Cruz, 1999.
C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning
techniques for the identification of mutagenicity inducing substructures and structure
activity relationships of noncongeneric compounds. J. Chem. Inf. Comput. Sci., 44(4):
140211, 2004. doi: 10.1021/ci034254q. URL http://dx.doi.org/10.1021/ci034254q.
T. Jaakkola, M. Diekhans, and D. Haussler. A Discriminative Framework for Detecting
Remote Protein Homologies. J. Comput. Biol., 7(1,2):95114, 2000. URL
http://www.cse.ucsc.edu/research/compbio/discriminative/Jaakola2-1998.ps.
627 / 635
References IV
T. S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In
Proc. of Tenth Conference on Advances in Neural Information Processing Systems, 1999.
URL http://www.cse.ucsc.edu/research/ml/papers/Jaakola.ps.
H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In
T. Faucett and N. Mishra, editors, Proceedings of the Twentieth International Conference
on Machine Learning, pages 321328, New York, NY, USA, 2003. AAAI Press.
J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep
convolutional networks. In Proc. CVPR, 2016.
T. Kin, K. Tsuda, and K. Asai. Marginalized kernels for RNA sequence data analysis. In
R. Lathtop, K. Nakai, S. Miyano, T. Takagi, and M. Kanehisa, editors, Genome Informatics
2002, pages 112122. Universal Academic Press, 2002. URL
http://www.jsbi.org/journal/GIW02/GIW02F012.html.
R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input. In
Proceedings of the Nineteenth International Conference on Machine Learning, pages
315322, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel
matrix with semidefinite programming. J. Mach. Learn. Res., 5:2772, 2004a. URL
http://www.jmlr.org/papers/v5/lanckriet04a.html.
628 / 635
References V
G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical
framework for genomic data fusion. Bioinformatics, 20(16):26262635, 2004b. doi:
10.1093/bioinformatics/bth294. URL
http://bioinformatics.oupjournals.org/cgi/content/abstract/20/16/2626.
Q. V. Le, T. Sarlos, and A. J. Smola. Fastfood - computing hilbert space expansions in
loglinear time. In Proceedings of the 30th International Conference on Machine Learning,
ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Proceedings, pages
244252, 2013. URL http://jmlr.org/proceedings/papers/v28/le13.html.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):22782324, 1998.
C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. J.
Mach. Learn. Res., 5:14351455, 2004.
C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: a string kernel for SVM protein
classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein,
editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages 564575,
Singapore, 2002. World Scientific.
L. Liao and W. Noble. Combining Pairwise Sequence Similarity and Support Vector Machines
for Detecting Remote Protein Evolutionary and Structural Relationships. J. Comput. Biol.,
10(6):857868, 2003. URL
http://www.liebertonline.com/doi/abs/10.1089/106652703322756113.
629 / 635
References VI
H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In
Advances in Neural Information Processing Systems (NIPS), 2015.
H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification
using string kernels. J. Mach. Learn. Res., 2:419444, 2002. URL
http://www.ai.mit.edu/projects/jmlr/papers/volume2/lodhi02a/abstract.html.
B. Logan, P. Moreno, B. Suzek, Z. Weng, and S. Kasif. A Study of Remote Homology
Detection. Technical Report CRL 2001/05, Compaq Cambridge Research laboratory, June
2001.
P. Mahe and J. P. Vert. Graph kernels based on tree patterns for molecules. Mach. Learn., 75
(1):335, 2009. doi: 10.1007/s10994-008-5086-2. URL
http://dx.doi.org/10.1007/s10994-008-5086-2.
P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph
kernels. In R. Greiner and D. Schuurmans, editors, Proceedings of the Twenty-First
International Conference on Machine Learning (ICML 2004), pages 552559. ACM Press,
2004.
P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Graph kernels for molecular
structure-activity relationship analysis with support vector machines. J. Chem. Inf. Model.,
45(4):93951, 2005. doi: 10.1021/ci050039t. URL
http://dx.doi.org/10.1021/ci050039t.
J. Mairal. Incremental majorization-minimization optimization with application to large-scale
machine learning. SIAM Journal on Optimization, 25(2):829855, 2015.
630 / 635
References VII
J. Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In
Advances in Neural Information Processing Systems, pages 13991407, 2016.
J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In
Advances in Neural Information Processing Systems, 2014.
C. Micchelli and M. Pontil. Learning the kernel function via regularization. J. Mach. Learn.
Res., 6:10991125, 2005. URL http://jmlr.org/papers/v6/micchelli05a.html.
G. Montavon, M. L. Braun, and K.-R. Muller. Kernel analysis of deep networks. Journal of
Machine Learning Research, 12(Sep):25632581, 2011.
C. Muller. Analysis of spherical symmetries in Euclidean spaces, volume 129 of Applied
Mathematical Sciences. Springer, 1998.
Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic
Publishers, 2004.
A. Nicholls. Oechem, version 1.3.4, openeye scientific software. website, 2005.
F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Adv. NIPS, 2007.
A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res.,
9:24912521, 2008. URL http://jmlr.org/papers/v9/rakotomamonjy08a.html.
631 / 635
References VIII
J. Ramon and T. Gartner. Expressivity versus efficiency of graph kernels. In T. Washio and
L. De Raedt, editors, Proceedings of the First International Workshop on Mining Graphs,
Trees and Sequences, pages 6574, 2003.
F. Rapaport, A. Zynoviev, M. Dutreix, E. Barillot, and J.-P. Vert. Classification of microarray
data using gene networks. BMC Bioinformatics, 8:35, 2007. doi: 10.1186/1471-2105-8-35.
URL http://dx.doi.org/10.1186/1471-2105-8-35.
H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string
alignment kernels. Bioinformatics, 20(11):16821689, 2004. URL
http://bioinformatics.oupjournals.org/cgi/content/abstract/20/11/1682.
M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average
gradient. Mathematical Programming, 2016.
B. Scholkopf. Support Vector Learning. PhD thesis, Technischen Universitat Berlin, 1997.
B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002. URL
http://www.learning-with-kernels.org.
B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation, 10(5):12991319, 1998.
B. Scholkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT
Press, The MIT Press, Cambridge, Massachussetts, 2004.
632 / 635
References IX
M. Seeger. Covariance Kernels from Bayesian Generative Models. In Adv. Neural Inform.
Process. Syst., volume 14, pages 905912, 2002.
S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for
regularized loss minimization. Mathematical Programming, 2015.
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge
University Press, New York, NY, USA, 2004a.
J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University
Press, 2004b.
N. Shervashidze and K. M. Borgwardt. Fast subtree kernels on graphs. In Advances in Neural
Information Processing Systems, pages 16601668, 2009.
N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient
graphlet kernels for large graph comparison. In 12th International Conference on Artificial
Intelligence and Statistics (AISTATS), pages 488495, Clearwater Beach, Florida USA,
2009. Society for Artificial Intelligence and Statistics.
N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt.
Weisfeiler-lehman graph kernels. The Journal of Machine Learning Research, 12:
25392561, 2011.
T. Smith and M. Waterman. Identification of common molecular subsequences. J. Mol. Biol.,
147:195197, 1981.
A. J. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine learning. 2000.
633 / 635
References X
K. Tsuda, M. Kawanabe, G. Ratsch, S. Sonnenburg, and K.-R. Muller. A new discriminative
kernel from probabilistic models. Neural Computation, 14(10):23972414, 2002a. doi:
10.1162/08997660260293274. URL http://dx.doi.org/10.1162/08997660260293274.
K. Tsuda, T. Kin, and K. Asai. Marginalized Kernels for Biological Sequences. Bioinformatics,
18:S268S275, 2002b.
V. N. Vapnik. Statistical Learning Theory. Wiley, New-York, 1998.
A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 34(3):480492, 2012.
J.-P. Vert, H. Saigo, and T. Akutsu. Local alignment kernels for biological sequences. In
B. Scholkopf, K. Tsuda, and J. Vert, editors, Kernel Methods in Computational Biology,
pages 131154. MIT Press, The MIT Press, Cambridge, Massachussetts, 2004.
J.-P. Vert, R. Thurman, and W. S. Noble. Kernels for gene regulatory regions. In Y. Weiss,
B. Scholkopf, and J. Platt, editors, Adv. Neural. Inform. Process Syst., volume 18, pages
14011408, Cambridge, MA, 2006. MIT Press.
G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional
Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang. Deep networks for image super-resolution
with sparse prior. In Proc. ICCV, 2015.
B. Weisfeiler and A. A. Lehman. A reduction of a graph to a canonical form and an algebra
arising during this reduction. Nauchno-Technicheskaya Informatsia, Ser. 2, 9, 1968.
634 / 635
References XI
B. Widrow and M. E. Hoff. Adaptive switching circuits. In IRE WESCON convention record,
volume 4, pages 96104. New York, 1960.
C. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. In Adv.
NIPS, 2001.
L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance
reduction. SIAM Journal on Optimization, 24(4):20572075, 2014.
Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic
data: a supervised approach. Bioinformatics, 20:i363i370, 2004. URL
http://bioinformatics.oupjournals.org/cgi/reprint/19/suppl_1/i323.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In
European Conference on Computer Vision (ECCV), 2014.
R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In
Curves and Surfaces, pages 711730. 2010.
635 / 635