Mva - Slides Machine Learning With Kernel Methods
Mva - Slides Machine Learning With Kernel Methods
Over the years, the course has become more and more exhaustive
and the slides are probably one of the best reference available on
kernels.
Over the years, the course has become more and more exhaustive
and the slides are probably one of the best reference available on
kernels.
This is a course with a fairly large amount of math, but still
accessible to computer scientists who have heard what is a Hilbert
space (at least once in their life).
.
What are the main limitations of neural networks?
Poor theoretical understanding.
They require cumbersome hyper-parameter tuning.
They are hard to regularize.
Despite these shortcomings, they have had an enormous success, thanks
to large amounts of labeled data, computational power and engineering.
Julien Mairal (Inria) 10/564
A concrete supervised learning problem
n
1X
min L(yi , f (xi )) + λΩ(f ) .
f ∈F n | {z }
i=1
| {z } regularization
empirical risk, data fit
Practical
Course homepage with slides, schedules, homework etc...:
http://lear.inrialpes.fr/people/mairal/teaching/2015-2016/MVA/.
Evaluation: 50% homework + 50% data challenge.
The approach
Develop methods based on pairwise comparisons.
By imposing constraints on the pairwise comparison function
(positive definite kernels), we obtain a general framework for
learning from data (optimization in RKHS).
X
φ(S)=(aatcgagtcac,atggacgtct,tgcactact)
S
1 0.5 0.3
K= 0.5 1 0.6
0.3 0.6 1
Idea
Define a “comparison function”: K : X × X 7→ R.
Represent a set of n data points S = {x1 , x2 , . . . , xn } by the n × n
matrix:
[K]ij := K (xi , xj )
∀ x, x0 ∈ X 2 , K x, x0 = K x0 , x ,
∀ x, x0 ∈ X 2 , K x, x0 = x, x0
Rd
∀ x, x0 ∈ X 2 , K x, x0 = x, x0
Rd
Proof
hx, x0 iRd = hx0 , xiRd ,
PN PN PN 2
i=1 j=1 ai aj hxi , xj iRd = k i=1 ai xi kRd ≥ 0
Lemma
Let X be any set, and Φ : X →7 Rd . Then, the function K : X 2 7→ R
defined as follows is p.d.:
∀ x, x0 ∈ X 2 , K x, x0 = Φ (x) , Φ x0 Rd .
Lemma
Let X be any set, and Φ : X →7 Rd . Then, the function K : X 2 7→ R
defined as follows is p.d.:
∀ x, x0 ∈ X 2 , K x, x0 = Φ (x) , Φ x0 Rd .
Proof
hΦ (x) , Φ (x0 )iRd = hΦ (x0 ) , Φ (x)iRd ,
PN PN PN 2
i=1 j=1 ai aj hΦ (xi ) , Φ (xj )iRd = k i=1 ai Φ (xi ) kRd ≥ 0 .
x2
R
x2 2
√
~ x ) = (x 2 , 2x1 x2 , x 2 ) ∈ R3 :
For ~x = (x1 , x2 )> ∈ R2 , let Φ(~ 1 2
K x, x0 = Φ (x) , Φ x0
H
.
φ
X F
with √
λ1 u1 (i)
..
Φ (xi ) = .
√ .
λN uN (i)
Julien Mairal (Inria) 27/564
Proof: general case
∀x ∈ X , Kx : t 7→ K (x, t) .
f (x) = hf , Kx iH .
F : H →R
f 7→ f (x)
is continuous.
F : H →R
f 7→ f (x)
is continuous.
Corollary
Convergence in a RKHS implies pointwise convergence, i.e., if (fn )n∈N
converges to f in H, then (fn (x))n∈N converges to f (x) for any x ∈ X .
| f (x) | = | hf , Kx iH |
≤ k f kH .k Kx kH (Cauchy-Schwarz)
1
≤ k f kH .K (x, x) 2 ,
f (x) = hf , gx iH .
Consequence
This shows that we can talk of ”the” kernel of a RKHS, or ”the” RKHS
of a kernel.
This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y ∈ X . In other words, K=K’.
This shows that Kx = Kx0 as functions, i.e., Kx (y) = Kx0 (y) for any
y ∈ X . In other words, K=K’.
let: X
hf , g iH0 := ai bj K (xi , yj ) .
i,j
hf , Kx iH0 = f (x) .
therefore k f kH0 = 0 =⇒ f = 0.
H0 is therefore a pre-Hilbert space endowed with the inner product
h., .iH0 .
Therefore for any x the sequence (fn (x))n≥0 is Cauchy in R and has
therefore a limit.
If we add to H0 the functions defined as the pointwise limits of
Cauchy sequences, then the space becomes complete and is
therefore a Hilbert space, with K as r.k. (up to a few technicalities,
left as exercise).
K x, x0 = Φ (x) , Φ x0
H
.
φ
X F
∀x ∈ X , Φ(x) = Kx .
Theorem
The RKHS of the linear kernel is the set of linear functions of the form
k fw kH = k w k2 .
Remark
All points x in X are mapped to a rank-one matrix xx> . Most of points
in H do not admit a pre-image.
Exercise: what is the RKHS of the general polynomial kernel?
Julien Mairal (Inria) 50/564
Combining kernels
Theorem
If K1 and K2 are p.d. kernels, then:
K1 + K2 ,
K1 K2 , and
cK1 , for c ≥ 0,
∀ x, x0 ∈ X 2 , K x, x0 = lim Ki x, x0 ,
n→∞
X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
X = N, K (x, x0 ) = GCD (x, x0 )
X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
X = N, K (x, x0 ) = GCD (x, x0 )
X = N, K (x, x0 ) = LCM (x, x0 )
X = R, K (x, x0 ) = cos (x + x0 )
X = R, K (x, x0 ) = cos (x − x0 )
X = R+ , K (x, x0 ) = min(x, x0 )
X = R+ , K (x, x0 ) = max(x, x0 )
X = R+ , K (x, x0 ) = min(x, x0 )/ max(x, x0 )
X = N, K (x, x0 ) = GCD (x, x0 )
X = N, K (x, x0 ) = LCM (x, x0 )
X = N, K (x, x0 ) = GCD (x, x0 ) /LCM (x, x0 )
f (x) − f x0 = | hf , Kx − Kx0 iH |
≤ k f kH × k Kx − Kx0 kH
= k f kH × dK x, x0 .
The norm of a function in the RKHS controls how fast the function
varies over X with respect to the geometry defined by the kernel
(Lipschitz with constant k f kH ).
Important message
∀ x, x0 ∈ X 2 , K x, x0 = Φ (x) , Φ x0 H .
Kernel trick
Any algorithm to process finite-dimensional vectors that can be
expressed only in terms of pairwise inner products can be applied to
potentially infinite-dimensional vectors in the feature space of a p.d.
kernel by replacing each inner product evaluation by a kernel evaluation.
φ
X F
x1 d(x1,x2) φ( x1)
x2 φ( x2 )
1.2
K (x, x) = 1 = k Φ (x) k2H , so all
0.8
d(x,y)
points are on the unit sphere in the
0.4
feature space.
0.0
The distance between the images
of two points x and y in the feature −4 −2 0 2 4
s
k x−y k2
dK (x, y) = 2 1 − e− 2σ 2
dK (x, S) := k Φ (x) − µ kH .
Kernel trick
n
1X
dK (x, S) = k Φ (x) − Φ(xi ) kH
n
i=1
v
u n n n
u 2X 1 XX
= K (x, x) −
t K (x, xi ) + 2 K (xi , xj ).
n n
i=1 i=1 j=1
S = {2, 3}
Plot f (x) = d(x, S)
2.5 2.5
2 2
1.5 1.5
d(x,S)
d(x,S)
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x x
(x−y)2 (x−y)2
k (x, y) = xy. k (x, y) = e − 2σ 2 . k (x, y) = e − 2σ 2 .
(linear) with σ = 1. with σ = 0.2.
2 2 2
1 1 1
(x−y)2 (x−y)2
k (x, y) = xy. k (x, y) = e − 2σ 2 . k (x, y) = e − 2σ 2 .
(linear) with σ = 1. with σ = 0.2.
3 3 3
2 2 2
1 1 1
(x−y)2 (x−y)2
k (x, y) = xy. k (x, y) = e − 2σ 2 . k (x, y) = e − 2σ 2 .
(linear) with σ = 1. with σ = 0.2.
φ
X F
m
Kc = K − UK − KU + UKU = (I − U) K (I − U) ,
Kernel Methods
Supervised Learning
An important property
When needed, the RKHS norm acts as a natural regularization function
that penalizes variations of functions.
f = fS + f⊥ ,
S and f ⊥ HS (by orthogonal projection).
with fS ∈ HK ⊥ K
∀i = 1, · · · , n, f (xi ) = fS (xi ) .
3
2
1
0
-1
-2
-3
-4
-2 -1 0
Julien 1 (Inria)2
Mairal 3 4 5 6 7 82/564
Least-square regression
L (f (x) , y ) = (y − f (x))2 .
therefore f = f 0 .
One solution to the initial problem is therefore:
n
X
fˆ = αi K (xi , x) ,
i=1
with
α = (K + λnI )−1 y.
2
1
0
-1
-2
-3
-4
-2 -1 0 1 2 3 4 5 6 7
APPLE
APPLE ??? ???
PEAR
PEAR
APPLE
APPLE ???
PEAR APPLE
Input variables x ∈ X .
Output y ∈ {−1, 1}.
Training set S = {(x1 , y1 ) , . . . , (xn , yn )}.
3.0
2.5
1−SVM
2−SVM
2.0
Logistic
Boosting
phi(u)
1.5
1.0
0.5
0.0
−2 −1 0 1 2
Method ϕ(u)
Kernel logistic regression log (1 + e −u )
Support vector machine (1-SVM) max (1 − u, 0)
Support vector machine (2-SVM) max (1 − u, 0)2
Boosting e −u
Julien Mairal (Inria) 98/564
Outline
yf (x) .
Question
When is Rϕn (f ) a good estimate of the true risk Rϕ (f )?
where the expectation is over (Xi )i=1,...,n and the independent uniform
{±1}-valued (Rademacher) random variables (σi )i=1,...,n .
Then on average over the training set (and with high probability)
the ϕ-risk of the ERM estimator is closed to the empirical one:
ES sup Rϕ (f ) − Rϕ (f ) ≤ 2Lϕ Radn (F) .
n
f ∈F
" n
#
2X
Radn (FB ) = EX ,σ sup σi f (Xi )
f ∈FB n i=1
" * n
+ #
2X
= EX ,σ sup f, σi KXi (RKHS)
f ∈FB n
i=1
" n
#
2X
= EX ,σ Bk σi KXi kH (Cauchy-Schwarz)
n
i=1
v
u n
2B X
EX ,σ tk
u
= σi KXi k2H
n
i=1
v
u
n
2B u
u X
≤ tEX ,σ σi σj K (Xi , Xj ) (Jensen)
n
i,j=1
2B EX K (X , X )
p
= √ .
n
Rϕ∗ = inf Rϕ (f ) .
f measurable
subject to k f kH ≤ B .
subject to k f kH ≤ B .
f (α), primal
α⋆ κ⋆
b b
α κ
b b
g(κ), dual
Strong duality means that maxκ g (κ) = minα f (α)
Strong duality holds in most “reasonable cases” for convex
optimization (to be detailed soon).
f (α), primal
α⋆ κ⋆
b b
α κ
b b
g(κ), dual
The relation between κ? and α? is not always known a priori.
α̃ κ̃
b b
α κ
δ(α̃, κ̃) b
g(κ), dual
The duality gap guarantees us that 0 ≤ f (α̃) − f (α? ) ≤ δ(α̃, κ̃).
Dual problems are often obtained by Lagrangian or Fenchel duality.
minimize f (x)
subject to hi (x) = 0 , i = 1, . . . , m ,
gj (x) ≤ 0 , j = 1, . . . , r ,
Proposition
q is concave in (λ, µ), even if the original problem is not convex.
The dual function yields lower bounds on the optimal value f ∗ of
the original problem when µ is nonnegative:
q(λ, µ) ≤ f ∗ , ∀λ ∈ Rm , ∀µ ∈ Rr , µ ≥ 0 .
m
X r
X
=⇒ L(x̄, λ, µ) = f (x̄) + λi hi (x̄) + µi gi (x̄) ≤ f (x̄) ,
i=1 i=1
If strong duality holds, then the best lower bound that can be
obtained from the Lagrange dual function is tight
Strong duality does not hold for general nonlinear problems.
It usually holds for convex problems.
Conditions that ensure strong duality for convex problems are called
constraint qualification.
in that case, we have for all feasible primal and dual points x, λ, µ,
minimize f (x)
subject to gj (x) ≤ 0 , j = 1, . . . , r ,
Ax = b ,
if it is strictly feasible, i.e., there exists at least one feasible point that
satisfies:
gj (x) < 0 , j = 1, . . . , r , Ax = b .
Slater’s conditions also ensure that the maximum d ∗ (if > −∞) is
attained, i.e., there exists a point (λ∗ , µ∗ ) with
q (λ∗ , µ∗ ) = d ∗ = f ∗
f (x ∗ ) = q (λ∗ , µ∗ )
m
X r
X
= inf n f (x) + λ∗i hi (x) + µ∗j gj (x)
x∈R
i=1 j=1
m
X r
X
≤ f (x ∗ ) + λ∗i hi (x ∗ ) + µ∗j gj (x ∗ )
i=1 j=1
∗
≤ f (x )
L (x ∗ , λ∗ , µ∗ ) = inf n L (x, λ∗ , µ∗ ) ,
x∈R
µj gj (x ∗ ) = 0 , j = 1, . . . , r .
yf(x)
subject to:
ξi ≥ ϕhinge (yi f (xi )) .
subject to: (
ξi ≥ 1 − yi f (xi ) , for i = 1, . . . , n ,
ξi ≥ 0, for i = 1, . . . , n .
subject to:
( P
yi nj=1 αj K (xi , xj ) + ξi − 1 ≥ 0 , for i = 1, . . . , n ,
ξi ≥ 0 , for i = 1, . . . , n .
subject to:
(
yi [Kα]i + ξi − 1 ≥ 0 , for i = 1, . . . , n ,
ξi ≥ 0 , for i = 1, . . . , n .
Solving ∇α L = 0 leads to
diag (y)µ
α= + ,
2λ
with K = 0. But does not change f (same as kernel ridge
regression), so we can choose for example = 0 and:
yi µi
αi∗ (µ, ν) = , for i = 1, . . . , n.
2λ
∂L 1
= − µi − νi = 0.
∂ξi n
maximize q (µ, ν)
subject to µ ≥ 0,ν ≥ 0 .
subject to:
1
0 ≤ yi αi ≤ , for i = 1, . . . , n .
2λn
) = +1
f(x
( x )=0 )= −1
f f(x
αy=1/2nλ
0<α y<1/2n λ
α=0
Consequences
The solution is sparse in α, leading to fast algorithms for training
(use of decomposition methods).
The classification of a new point only involves kernel evaluations
with support vectors (fast).
Julien Mairal (Inria) 140/564
Remark: C-SVM
Often the SVM optimization problem is written in terms of a
regularization parameter C instead of λ as follows:
n
1 X
arg min k f k2H + C Lhinge (f (xi ) , yi ) .
f ∈H 2 i=1
1
This is equivalent to our formulation with C = 2nλ .
The SVM optimization problem is then:
n
X n
X
max 2 αi yi − αi αj K (xi , xj ) ,
α∈Rd i=1 i,j=1
subject to:
0 ≤ y i αi ≤ C , for i = 1, . . . , n .
This formulation is often called C-SVM.
Julien Mairal (Inria) 141/564
Remark: 2-SVM
A variant of the SVM, sometimes called 2-SVM, is obtained by
replacing the hinge loss by the square hinge loss:
( n )
1X 2 2
min ϕhinge (yi f (xi )) + λk f kH .
f ∈H n
i=1
Kernel Methods
Unsupervised Learning
To optimize the cost function, we will first use the following Proposition
Proposition
1 Pn
The center of mass ϕn = n i=1 ϕ(xi ) solves the following optimization
problem
n
X
min kϕ(xi ) − µk2H .
µ∈H
i=1
n n n
* +
1X 1X 2X
kϕ(xi ) − µk2H = kϕ(xi )k2H − ϕ(xi ), µ + kµk2H
n n n
i=1 i=1 i=1 H
n
1X
= kϕ(xi )k2H − 2 hϕn , µiH + kµk2H
n
i=1
n
1X
= kϕ(xi )k2H − kϕn k2H + kϕn − µk2H ,
n
i=1
and we introduce
the binary matrix A in {0, 1}n×k such that [A]ij = 1 if si = j and 0
otherwise.
a diagonal matrix D in Rl×l with diagonal entries [D]jj equal to the
inverse of the number of elements in cluster j.
and the objective can be rewritten (proof is easy and left as an exercise)
h i
min − trace (D1/2 A> KAD1/2 ) .
A,D
PC2 PC1
fw (x) = w> x ,
Moreover, w ⊥ w0 ⇔ fw ⊥ fw0 .
Similarly:
n
X
fi (xk )2 = α> 2
i K αi .
k=1
α>
i Kαj = 0 for j = 1, . . . , i − 1 .
Pn 2 2
α> 2
i K αi j=1 βij λj
= ,
nα> n nj=1 βij2 λj
P
i Kαi
1 = k fi k2H = α> 2
i Kαi = βii λi .
Therefore:
1
αi = √ ui .
λi
Assuming that the pairs (xi , yi ) are i.i.d. samples from an unknown
distribution, CCA seeks to maximize
max wa> X> Ywb s.t. wa> X> Xwa = 1 and wb> Y> Ywb = 1.
wa ∈Rp ,wb ∈Rd
Multiply first equality by wa> and second equality by wb> ; subtract the
two resulting equalities and we get
X> Y
>
0 wa X X 0 wa
=λ
Y> X 0 wb 0 Y> Y wb
which is equivalent to
α> Ka Kb β
max
α∈R ,β∈Rn
n 1/2 1/2 ,
(α> K2a α) β > K2b β
Answer
A successful strategy is given by kernels for generative models, which
are/have been the state of the art in many fields, including image and
sequence representations.
Parametric model
A model is a family of distributions
{Pθ , θ ∈ Θ ⊂ Rm } ⊆ M+
1 (X ) .
Φλ0 (x) = ∇λ log Pλ0 (x) = J∇θ log Pθ0 (x) = JΦθ0 (x).
Φθ0 (x) can be computed explicitly for many models (e.g., HMMs),
where the model is first estimated from data.
I(θ0 ) is often replaced by the identity matrix for simplicity.
Several different models (i.e., different θ0 ) can be trained and
combined.
The Fisher vectors are defined as ϕθ0 (x) = I(θ0 )−1/2 Φθ0 (x). They
are explicitly computed and correspond to an explicit embedding:
K (x, x0 ) = ϕθ0 (x)> ϕθ0 (x0 ).
Then, the Fisher vector is given by the sum of Fisher vectors of the
points.
Encodes the discrepancy in the first and second order moment of
the data w.r.t. those of the model.
n
X (µ̂ − µ)/σ√
ϕ(x1 , . . . , xn ) = ϕ(xi ) = n ,
(σ 2 − σ̂ 2 )/( 2σ 2 )
i=1
where
n n
1X 1X
µ̂ = xi and σ̂ = (xi − µ̂)2 .
n n
i=1 i=1
Remarks
Each mixture component corresponds to a visual word, with a
mean, variance, and mixing weight.
Diagonal covariances Σj = diag (σj1 , . . . , σjp ) = diag (σ j ) are often
used for simplicity.
This is a richer model than the traditional “bag of words” approach.
The probabilistic model is learned offline beforehand.
[ϕπ1 (X), . . . , ϕπp (X), ϕµ1 (X)> , . . . , ϕµp (X)> , ϕσ1 (X)> , . . . , ϕσp (X)> ]> ,
with
n
1 X
ϕµj (X) = √ γij (xi − µj )/σ j
n πj
i=1
n
1 X
γij (xi − µj )2 /σ 2j − 1 ,
ϕσj (X) = p
n 2πj i=1
with
n n n
X 1 X 1 X
γj = γij and µ̂j = γij xi and σ̂ j = γij (xi − µj )2 .
γj γj
i=1 i=1 i=1
In particular (exercise)
∂ log Pθ (x)
= Pθ (Y = k|x) − πk .
∂αk
Bayes’ rule is implemented via this simple classifier using Fisher kernel.
KZ z, z0 = ΦZ (z) , ΦZ z0 H .
therefore KX is p.d. on X .
Findings
About 25,000 genes only (representing 1.2% of the genome).
Automatic gene finding with graphical models.
97% of the genome is considered “junk DNA”.
Superposition of a variety of signals (many to be discovered).
Goal
Build a classifier to predict whether new proteins are secreted or not.
φ
X F
maskat...
msises
marssl...
malhtv...
mappsv...
mahtlg...
Physico-chemical kernel
Extract relevant features, such as:
length of the sequence
time series analysis of numerical physico-chemical properties of
amino-acids along the sequence (e.g., polarity, hydrophobicity),
using for example:
Fourier transforms (Wang et al., 2004)
Autocorrelation functions (Zhang et al., 2003)
n−j
1 X
rj = hi hi+j
n−j
i=1
u∈Ak
Remarks
Work with any string (natural language, time series...)
Fast and scalable, a good default method for string classification.
Variants allow matching of k-mers up to m mismatches.
l (i) = ik − i1 + 1.
ABRACADABRA
i = (3, 4, 7, 8, 10)
x (i) =RADAR
l (i) = 10 − 3 + 1 = 8
u∈Ak
u ca ct at ba bt cr ar br
Φu (cat) λ2 λ3 λ2 0 0 0 0 0
Φu (car) λ2 0 0 0 0 λ3 λ2 0
Φu (bat) 0 0 λ2 λ2 λ3 0 0 0
Φu (bar) 0 0 0 λ2 0 0 λ2 λ3
4 6
K (cat,cat) = K (car,car) = 2λ + λ
K (cat,car) = λ4
K (cat,bar) = 0
u∈Ak
0
X X X
= λl(i)+l(i ) .
u∈Ak i:x(i)=u i0 :x0 (i0 )=u
Let now: X
Ψu (x) = λ| x |−i1 +1 .
i:x(i)=u
and X
Ψva (x) = Ψv (x (1, j − 1)) λ| x |−j+1 .
j∈[1,| x |]:xj =a
u∈An
u∈An
Bn xa, x0
X
Ψu (xa) Ψu x0
=
u∈An
X X
Ψu (x) Ψu x0 + λ Ψv (x) Ψva x0
=λ
u∈An v∈An−1
0
= λBn x, x +
0
X X
x0 (1, j − 1) λ| x |−j+1
λ Ψv (x) Ψv
v∈An−1 j∈[1,| x0 |]:xj0 =a
X 0
= λBn x, x0 + Bn−1 x, x0 (1, j − 1) λ| x |−j+2
j∈[1,| x0 |]:xj0 =a
Bn xa, x0 b
X 0
= λBn x, x0 b + λ Bn−1 x, x0 (1, j − 1) λ| x |−j+2
j∈[1,| x0 |]:xj0 =a
Kn xa, x0
X
Φu (xa) Φu x0
=
u∈An
X X
Φu (x) Φu x0 + λ Ψv (x) Φva x0
=
u∈An v∈An−1
0
= Kn x, x +
X X
x0 (1, j − 1) λ
λ Ψv (x) Ψv
v∈An−1 j∈[1,| x0 |]:xj0 =a
X
= λKn x, x0 + λ2 Bn−1 x, x0 (1, j − 1)
j∈[1,| x0 |]:xj0 =a
Examples
This includes:
Motif kernels (Logan et al., 2001): the dictionary is a library of
motifs, the similarity function is a matching function
Pairwise kernel (Liao & Noble, 2003): the dictionary is the training
set, the similarity is a classical measure of similarity between
sequences.
{Pθ , θ ∈ Θ ⊂ Rm } ⊂ M+
1 (X )
Julien Mairal (Inria) 236/564
Context-tree model
Definition
A context-tree model is a variable-memory Markov chain:
n
Y
PD,θ (x) = PD,θ (x1 . . . xD ) PD,θ (xi | xi−D . . . xi−1 )
i=D+1
D is a suffix tree
θ ∈ ΣD is a set of conditional probabilities (multinomials)
N 0.05
0.5
0.1 E Normal (N) and biased (B)
S 0.1 coins (not observed)
B
0.5 0.05
0.85
Observed output are 0/1 with probabilities:
(
π(0|N) = 1 − π(1|N) = 0.5,
π(0|B) = 1 − π(1|B) = 0.8.
y,y0 ∈S ∗
X
Φa,s (x) Φa,s x0 ,
=
(a,s)∈A×S
with X
Φa,s (x) = P (y|x) na,s (x, y)
y∈S ∗
X
Φa,s (x) = P (y|x) na,s (x, y)
y∈S ∗
( n )
X X
= P (y|x) δ (xi , a) δ (yi , s)
y∈S ∗ i=1
n
X X
= δ (xi , a) P (y|x) δ (yi , s)
∗
i=1 y∈S
n
X
= δ (xi , a) P (yi = s|x) .
i=1
SFCG rules
S → SS
S → aSa
S → aS
S →a
x1 = CGGSLIAMMWFGV
x2 = CLIVMMNRLMWFGV
CGGSLIAMM------WFGV
|...|||||....||||
C-----LIVMMNRLMWFGV
CGGSLIAMM------WFGV
|...|||||....||||
C----LIVMMNRLMWFGV
K1 + K2 ,
K1 K2 , and
cK1 , for c ≥ 0,
∀ x, x0 ∈ X 2 , K x, x0 = lim Ki x, x0 ,
n→∞
The matrix Ci,j = Ai,j Bi,j is therefore p.d. Other properties are obvious
from definition.
Φ ((x1 , x2 )) = Φ1 (x1 ) .
is a p.d. kernel on P (X ).
Lemma
If K1 and K2 are p.d. then K1 ? K2 is p.d..
R (x) = {(x1 , x2 ) ∈ X × X : x = x1 x2 } ⊂ X × X .
As such it is p.d..
Proof (sketch)
By induction on n (simple but long to write).
See details in Vert et al. (2004).
where M(i, j), X (i, j), Y (i, j), X2 (i, j), and Y2 (i, j) for 0 ≤ i ≤ |x|,
and 0 ≤ j ≤ |y| are defined recursively.
X1 a:0/D X X2 0:0/1
a:0/1 a:b/m(a,b)
0:b/D a:0/1
a:b/m(a,b) 0:a/1
B a:b/m(a,b) M 0:0/1 E
0:a/1
a:b/m(a,b)
0:0/1
a:b/m(a,b) 0:a/1
0:a/1 0:b/D
Y1 Y Y2
0:0/1
s
og
s
og
e
on
ol
ol
m
tz
m
ho
gh
ho
se
ili
on
lo
Tw
N
C
Sequence similarity
SCOP
Fold
Superfamily
Family
Remote homologs Close homologs
40
30
20
10
0
0 0.2 0.4 0.6 0.8 1
ROC50
Examples
Gaussian kernel (or RBF kernel)
1 2
K (x, y) = e − 2σ2 kx−yk2 .
Laplace kernel
K (x, y) = e −αkx−yk1 .
Z
1
∀x ∈ Rd , f (x) = d
e ix.ω fˆ (ω) dω.
(2π) Rd
Z Z
1 2
| f (x) |2 dx = ˆ (ω) dω .
f
Rd (2π)d Rd
Intuition
If K is t.i. and κ ∈ L1 Rd , then
Z
1
κ (x − y) = e i(x−y).ω κ̂ (ω) dω
(2π)d Rd
κ̂ (ω)
Z
= d
e iω.x e −iω.y dω .
Rd (2π)
Remarks
If κ(0) = 1, κ is a characteristic function—that is, κ(z) = Eω [e iz.ω ].
⇐ is easy:
Z 2
X X
ixk .ω
αk αl κ(xk − xl ) = αk e µ(dω) ≥ 0.
k,l Rd k
fˆ(ω)ĝ (ω)∗
Z
1
hf , g i := dω
(2π)d Rd κ̂(ω)
corresponds to:
σ2 ω2
κ̂ (ω) = e − 2
and Z
2 σ2 ω2
H= f : fˆ(ω) e 2 dω < ∞ .
κ̂ (ω) = U (ω + Ω) − U (ω − Ω)
and ( Z )
2
H= f : fˆ(ω) dω = 0 ,
| ω |>Ω
Examples
Any group (G , ◦) is a semigroup with involution when we define
s ∗ = s −1 .
Any abelian semigroup (S, +) is a semigroup with involution when
we define s ∗ = s, the identical involution.
∀s, t ∈ S, K (s, t) = ϕ (s ∗ ◦ t)
is a p.d. kernel on S.
Remarks
If ∗ is the identity, a semicharacter is automatically real-valued.
If (S, +) is an abelian group and s ∗ = −s, a semicharacter has its
values in the circle group {z ∈ C | | z | = 1} and is a group
character.
Proof
Direct from definition, e.g.,
n
X n
X
ai aj ρ xi + xj∗ = ai aj ρ (xi ) ρ (xj ) ≥ 0 .
i,j=1 i,j=1
Examples
ϕ(t) = e βt on (R, +, Id).
ϕ(t) = e iωt on (R, +, −).
Julien Mairal (Inria) 292/564
Integral representation of p.d. functions
Definition
An function α : S → R on a semigroup with involution is called an
absolute value if (i) α(e) = 1, (ii)α(s ◦ t) ≤ α(s)α(t), and (iii)
α (s ∗ ) = α(s).
A function f : S → R is called exponentially bounded if there exists an
absolute value α and a constant C > 0 s.t. | f (s) | ≤ C α(s) for s ∈ S.
Theorem
Let (S, +, ∗) an abelian semigroup with involution. A function ϕ : S → R is
p.d. and exponentially bounded (resp. bounded) if and only if it has a
representation of the form:
Z
ϕ(s) = ρ(s)dµ(ρ) .
S∗
where µ is a Radon measure with compact support on S ∗ (resp. on Ŝ, the set
of bounded semicharacters).
Julien Mairal (Inria) 293/564
Proof
Sketch (details in Berg et al., 1983, Theorem 4.2.5)
For an absolute value α, the set P1α of α-bounded p.d. functions
that satisfy ϕ(0) = 1 is a compact convex set whose extreme points
are precisely the α-bounded semicharacters.
If ϕ is p.d. and exponentially bounded then there exists an absolute
value α such that ϕ(0)−1 ϕ ∈ P1α .
By the Krein-Milman theorem there exits a Radon probability
measure on P1α having ϕ(0)−1 ϕ as barycentre.
Remarks
The result is not true without the assumption of exponentially
bounded semicharacters.
In the case of abelian groups with s ∗ = −s this reduces to
Bochner’s theorem for discrete abelian groups, cf. Rudin (1962).
s ∈ R+ 7→ ρa (s) = e −as ,
ρf (µ) = e µ[f ]
Remarks
The converse is not true: there exist continuous p.d. kernels that do not have
this integral representation (it might include non-continuous semicharacters)
Remark: only valid for densities (e.g., for a kernel density estimator
from a bag-of-parts)
Weighted linear PCA of two different measures, with the first PC shown.
Variances captured by the first and second PC are shown. The
generalized variance kernel is the inverse of the product of the two
values.
Solution
1 Regularization:
1
Kλ µ, µ0 =
0
.
det Σ µ+µ
2 + λI d
2 Kernel trick: the non-zero eigenvalues of UU > and U > U are the
same =⇒ replace the covariance matrix by the centered Gram
matrix (technical details in Cuturi et al., 2005).
Motivations
We can exhibit an explicit and intuitive feature space for a large
class of p.d. kernels
Historically, provided the first proof that a p.d. kernel is an inner
product for non-finite sets X (Mercer, 1905).
Can be thought of as the natural generalization of the factorization
of positive semidefinite matrices over infinite spaces.
hf , Lg i = hLf , g i .
hf , Lf i ≥ 0 .
Lemma
If K is a Mercer kernel, then LK is a compact and bounded linear
operator over Lν2 (X ), self-adjoint and positive.
≤ k K (x1 , ·) − K (x2 , ·) kk f k
(Cauchy-Schwarz)
p
≤ ν (X ) max | K (x1 , t) − K (x2 , t) | k f k.
t∈X
Ascoli Theorem
A part H ⊂ C (X ) is relatively compact (i.e., its closure is compact) if
and only if it is uniformly bounded and equicontinuous.
It is equicontinuous because:
p
| LK fn (x1 ) − LK fn (x2 ) | ≤ ν (X ) max | K (x1 , t) − K (x2 , t) | M .
t∈X
= hLf , g i .
≥ 0,
Remark
This theorem can be applied to LK . In that case the eigenfunctions ϕk
associated to the eigenfunctions λk 6= 0 can be considered as continuous
functions, because:
1
ψk = LψK .
λk
Φ : X 7→ l 2
p
x 7→ λk ψk (x)
k∈N
are not necessarily countable, Mercer theorem does not hold. Other
tools are thus required such as the Fourier transform for
shift-invariant kernels.
Remark
If some eigenvalues are equal to zero, then the result and the proof remain valid
on the subspace spanned by the eigenfunctions with positive eigenvalues.
∞
!1 ∞
!1
X a2 2 X 2
2
i
≤ . λi ψi (x)
λi
i=1 i=1
1
= k f kHK K (x, x) 2
p
= k f kHK CK .
k fn − fc kLν2 (X ) → 0.
n→∞
k f − fn kLν2 (X ) ≤ λ1 k f − fn kHK → 0,
n→∞
therefore f = fc .
therefore ϕx = Kx ∈ HK .
therefore:
∞ ∞
X λi ψi (x) ai X
hf , Kx iHK = = ai ψi (x) = f (x) ,
λi
i=1 i=1
1
hψi , ψj iHK = 0 si i 6= j, k ψi kHK = √ .
λi
The RKHS is a well-defined ellipsoid with axes given by the
eigenfunctions.
Remark
Therefore, k f kH = k f 0 kL2 : the RKHS norm is precisely the smoothness
functional defined in the simple example.
x 1 21
√ √
Z Z
0 0 2
| f (x) | = f (u)du ≤ x f (u) du = xk f kH .
0 0
K(s,t)
t
s 1
∀ (f , g ) ∈ H2 , hf , g iH = hDf , Dg iL2 (X ) ,
it is a Hilbert space.
Then H is a RKHS that admits as r.k. the Green function of the
operator D ∗ D, where D ∗ denotes the adjoint operator of D.
f = Dg ,
hf , g iX = hDf , Dg iL2 (X ) ,
active
inactive
active
inactive
inactive
active
φ
X H
φ
X H
φ
X H
O
N
N N N
O O O O O
O
N
N N N
O O O O O
O
N
Theorem
Computing all subgraph occurrences is NP-hard.
Theorem
Computing all subgraph occurrences is NP-hard.
Proof
The linear graph of size n is a subgraph of a graph X with n
vertices iff X has a Hamiltonian path;
The decision problem whether a graph has a Hamiltonian path is
NP-complete.
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
Theorem
Computing all path occurrences is NP-hard.
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
Theorem
Computing all path occurrences is NP-hard.
Proof
Same as for subgraphs.
A 3 B
A A
B (0,...,0,2,0,...,0,1,0,...)
B A
A 1 A A 3 A
A 3 B
A A
B (0,...,0,2,0,...,0,1,0,...)
B A
A 1 A A 3 A
φ
X H
φ
X H
Proof
For any kernel K the complexity of computing dK is the same as
the complexity of computing K , because:
ΦH (G ) = G 0 is a subgraph of G : G 0 ' H .
Proof (1/2)
Let Pn be the path graph with n vertices.
Subgraphs of Pn are path graphs:
Proof (2/2)
If G is a graph with n vertices, then it has a path that visits each
node exactly once (Hamiltonian path) if and only if Φ(G )> ePn > 0,
i.e.,
n n
!
X X
>
Φ(G ) αi Φ(Pi ) = αi Ksubgraph (G , Pi ) > 0 .
i=1 i=1
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
Definition
The path kernel is the subgraph kernel restricted to paths, i.e.,
X
Kpath (G1 , G2 ) = λH ΦH (G1 )ΦH (G2 ) ,
H∈P
A A
B (0,...,0,1,0,...,0,1,0,...)
B A
A A A B A
Definition
The path kernel is the subgraph kernel restricted to paths, i.e.,
X
Kpath (G1 , G2 ) = λH ΦH (G1 )ΦH (G2 ) ,
H∈P
etc...
1 a b 1b 2a 1d
3c 3e
2 c
1a 2b 2d
3 4 d e
4c 4e
G1 G2 G1 x G2
Corollary
X
Kwalk (G1 , G2 ) = Φs (G1 )Φs (G2 )
s∈S
X
= λG1 (w1 )λG2 (w2 )1(l(w1 ) = l(w2 ))
(w1 ,w2 )∈W(G1 )×W(G1 )
X
= λG1 ×G2 (w ) .
w ∈W(G1 ×G2 )
For the nth-order walk kernel we have λG1 ×G2 (w ) = 1 if the length
of w is n, 0 otherwise.
Therefore: X
Knth-order (G1 , G2 ) = 1.
w ∈Wn (G1 ×G2 )
1 1 2 2 4 5
1 O1 2 O1 4 O3
1 3 7
N1 N3 N5
1 2 5
Non−tottering
Tottering
.
.
. C C
C .
N
N O
.
.
C
O N C C N
. N O
.
.
N C
N N C C C
.
.
.
1 1 2 2 4 5
1 O1 2 O1 4 O3
1 3 7
N1 N3 N5
1 2 5
2
1
1 3
2 3 6
6 4
1 3 1 2 4 5 1 5
5
j g
3 Relabeling
i h
c d
f f
e
b
j g
3 Relabeling
i h
c d
f f
e
b
e b b e
d c d c (1)
φWLsubtree(G) = (2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 0, 1)
a b c d e f g h i j k l m
a a a b (1)
φWLsubtree(G’) = ( 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1)
m h i m a b c d e f g h i j k l m
k j l j Counts of Counts of
original compressed
G G’
f f f g node labels node labels
(1) (1) (1)
KWLsubtree(G,G’)=<φWLsubtree(G), φWLsubtree(G’)>=11.
Properties
The WL features up to the k-th order are computed in O(|E |k).
Similarly to the Morgan index, the WL relabeling can be exploited
in combination with any graph kernel (that takes into account
categorical node labels) to make it more expressive (Shervashidze et
al., 2011).
Julien Mairal (Inria) 391/564
Outline
Results
10-fold cross-validation accuracy
Method Accuracy
Progol1 81.4%
2D kernel 91.2%
70 72 74 76 78 80
CCRF−CEM
HL−60(TB)
K−562
MOLT−4
Walks
RPMI−8226
Subtrees
SR
A549/ATCC
EKVX
HOP−62
HOP−92
NCI−H226
NCI−H23
NCI−H322M
NCI−H460
NCI−H522
COLO_205
HCC−2998
HCT−116
HCT−15
HT29
KM12
SW−620
SF−268
SF−295
2D subtree vs walk kernels
SF−539
0.12
0.11
0.1
Test error
0.09
0.08
0.07
0.06
0.05
H W TW wTW M
Kernels
K x, x0 = Φ (x) , Φ x0 H .
K (x, x0 ) = x> x0 .
Lemma
In general graphs cannot be embedded exactly in Hilbert spaces.
In some cases exact embeddings exist, e.g.:
trees can be embedded exactly,
closed chains can be embedded exactly.
1 3 5
4
0 1 1 1 2
1 0 2 2 1
dG =
1 2 0 2 1
1 2 2 0 1
2 1 1 1 0
h i
λmin e (−0.2dG (i,j)) = −0.028 < 0 .
1
3 5
4
2
1 0.14 0.37 0.14 0.05
h i 0.14 1 0.37 0.14 0.05
−dG (i,j)
e =
0.37 0.37 1 0.37 0.14
0.14 0.14 0.37 1 0.37
0.05 0.05 0.14 0.37 1
Idea
Define a priori a smoothness functional on the functions f : X → R;
Show that it defines an RKHS and identify the corresponding kernel.
X = (x1 , . . . , xm ) is finite.
For x, x0 ∈ X , we note x ∼ x0 to indicate the existence of an edge
between x and x0
We assume that there is no self-loop x ∼ x, and that there is a
single connected component.
The adjacency matrix is A ∈ Rm×m :
(
1 if i ∼ j,
Ai,j =
0 otherwise.
D is theP
diagonal matrix where Di,i is the number of neighbors of xi
(Di,i = m i=1 Ai,j ).
1
3 5
4
2
0 0 1 0 0 1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
A= 1 1 0 1 0 , D= 0 0 3 0 0
0 0 1 0 1 0 0 0 2 0
0 0 0 1 0 0 0 0 0 1
1
3 5
4
2
0 −1 0
1 0
0 1 −1 0 0
L=D −A=
−1 −1 3 −1 0
0 0 −1 2 −1
0 0 0 −1 1
X
Ω (f ) = (f (xi ) − f (xj ))2
i∼j
X
= f (xi )2 + f (xj )2 − 2f (xi ) f (xj )
i∼j
Xm X
= Di,i f (xi )2 − 2 f (xi ) f (xj )
i=1 i∼j
> >
= f Df − f Af
= f > Lf
the eigendecomposition of L:
hf , g i = f > Lg
k f k2 = hf , f i = f > Lf = Ω(f ) .
g = KLf = L∗ Lf = ΠH (f ) = f .
1
3 5
4
2
0.88 −0.12 0.08 −0.32 −0.52
−0.12 0.88 0.08 −0.32 −0.52
∗
L =
0.08 0.08 0.28 −0.12 −0.32
−0.32 −0.32 −0.12 0.48 0.28
−0.52 −0.52 −0.32 0.28 1.08
f
dx
i−1 i i+1
∆f (x) = f 00 (x)
f 0 (x + dx/2) − f 0 (x − dx/2)
∼
dx
f (x + dx) − f (x) − f (x) + f (x − dx)
∼
dx 2
fi−1 + fi+1 − 2f (x)
=
dx 2
Lf (i)
=− .
dx 2
Julien Mairal (Inria) 427/564
Interpretation of regularization
For f = [0, 1] → R and xi = i/m, we have:
m 2
X i +1 i
Ω(f ) = f −f
m m
i=1
m 2
X 1 i
∼ ×f0
m m
i=1
m
1 X 0 i 2
1
= × f
m m m
i=1
1 1 0 2
Z
∼ f (t) dt.
m 0
k x − x0 k2
1
Kx0 (x, t) = Kt (x0 , x) = d exp −
(4πt) 2 4t
ft = f0 e −tL
with
t2 2 t3 3
e tL = I − tL + L − L + ...
2! 3!
we obtain:
m
X
−tL
K =e = e −tλi ui ui>
i=1
1+(m−1)e −tm
(
m for i = j,
Ki,j = 1−e −tm
m 6 j.
for i =
m−1
2πν(i − j)
1 X 2πν
Ki,j = exp −2t 1 − cos cos .
m m m
ν=0
For i = 1, . . . , m, let:
fˆi = ui> f
be the projection of f onto the eigenbasis of K .
We then have:
m
X
> −1
kf k2Kt =f K f = e tλi fˆi 2 .
i=1
2 2 ω2
fˆ(ω) eσ
R
This looks similar to dω ...
Julien Mairal (Inria) 439/564
Discrete Fourier transform
Definition
>
The vector fˆ = fˆ1 , . . . , fˆm is called the discrete Fourier transform of
f ∈R n
where r : R+ → R+
∗ is a non-increasing function.
1
r (λ) = , >0
λ+
m
X 1
K= ui u > = (L + I )−1
λi + i
i=1
X m
X
k f k2K = f > K −1 f = (f (xi ) − f (xj ))2 + f (xi )2 .
i∼j i=1
1
3 5
4
2
0.60 0.10 0.19 0.08 0.04
0.10 0.60 0.19 0.08 0.04
−1
(L + I ) =
0.19 0.19 0.38 0.15 0.08
0.08 0.08 0.15 0.46 0.23
0.04 0.04 0.08 0.23 0.62
PC2 PC1
Goal
Design a classifier to automatically assign a class to future samples
from their expression profile
Interpret biologically the differences between the classes
Cons
Pros Limited interpretation
Good performance in (small weights)
classification No prior biological
knowledge
Pros Cons
The gene selection
Good performance in
process is usually not
classification
robust
Useful for biomarker
Wrong interpretation is
selection
the rule (too much
Apparently easy correlation between
interpretation genes)
Good example
The graph is the complete
known metabolic network
of the budding yeast
(from KEGG database)
We project the classifier
weight learned by a
spectral SVM
Good classification
accuracy, and good
interpretation!
1
R is closed if, for each A ∈ R, the sublevel set {u ∈ Rn : R(u) ≤ A} is closed.
For example, if R is continuous then it is closed.
Julien Mairal (Inria) 460/564
Sum kernel
Definition
Let K1 , . . . , KM be M kernels on X . The sum kernel KS is the kernel on
X defined as
M
X
∀x, x0 ∈ X , KS (x, x0 ) = Ki (x, x0 ) .
i=1
Ki (x, x0 ) = Φi (x) , Φi x0 H .
i
PM
Then KS = i=1 Ki can be written as:
KS (x, x0 ) = ΦS (x) , ΦS x0
HS
,
(Integration)
Spectral Kexp (Expression)
Kppi (Protein interaction)
ABSTRACT computational biology. By protein network we mean, in this
Motivation: An increasing number of observations support the paper, a graph with proteins as vertices and edges that corres-
Kloc (Localization)
hypothesis that most biological functions involve the interac- pond to various binary relationships between proteins. More
tions phy (Phylogenetic
Kbetween many proteins, andprofile)
that the complexity of living precisely, we consider below the protein network with edges
systems arises as a result of such interactions. In this context, between two proteins if (i) the proteins interact physically,
theK exp +of K
problem ppi +a K
inferring global + Kphy
loc protein network for a given or (ii) the proteins are enzymes that catalyze two successive
organism,(Integration)
using all available genomic data about the organ- chemical reactions in a pathway or (iii) one of the proteins
ism, is quickly becoming one of the main challenges in current regulates the expression of the other. This definition of pro-
Supervised Kexp (Expression)
computational biology. Kgold (Protein
tein networknetwork)
involves various forms of interactions between
Results: This paper presents a new method to infer protein proteins, which should be taken into account for the study of
K ppi (Protein interaction)
networks from multiple types of genomic data. Based on a
K gold (Protein network)
the behavior of biological systems.
Kloc
variant (Localization)
of kernel Kgold (Protein
canonical correlation analysis, its originality network)
Unfortunately, the experimental determination of this pro-
is in the formalization of the protein network inference problem tein network remains very challenging nowadays, even for
Kphy (Phylogenetic
as a supervised
profile) Kgold (Protein
learning problem, and in the integration of het-
network)
the most basic organisms. The lack of reliable informa-
Kexp +genomic
erogeneous Kppi data+K within +K
loc this framework.
phy Kgold (Protein
We present network)
tion contrasts with the wealth of genomic data generated by
promising results on the prediction of the protein network for high-throughput technologies such as gene expression data
the yeast(Integration)
Saccharomyces cerevisiae from four types of widely (Eisen et al., 1998), physical protein interactions (Ito et al.,
available data: gene expressions, protein interactions meas- 2001), protein localization (Huh et al., 2003), phylogen-
ured by yeast two-hybrid systems, protein localizations in the etic profiles (Pellegrini et al., 1999) or pathway knowledge Fig. 6. Effect of n
Fig. 5. ROC curves: supervised approach.
JulienThe
cell and protein phylogenetic profiles. Mairal
method(Inria)
is shown (Kanehisa et al., 2004). There is therefore an incentive 464/564
approaches.
The sum kernel: functional point of view
Theorem
PM
The solution f ∗ ∈ HKS when we learn with KS = i=1 Ki is equal to:
M
X
f∗ = fi ∗ ,
i=1
M M
!
X X
n
min R fi +λ k fi k2HK .
f1 ,...,fM i
i=1 i=1
M M k f k2
!
X X i HK
n i
min R fi +λ .
f1 ,...,fM ηi
i=1 i=1
Minimization in αi for i = 1, . . . , M:
>
α Ki αi
min λ i − 2λγ > Ki αi = −ληi γ > Ki γ ,
αi ηi
Motivation
If we know how to weight each kernel, then we can learn with the
weighted kernel
XM
Kη = ηi Ki
i=1
R(f n ) + λk f k2HK
J(K ) = min .
f ∈HK
1.00 1.0
Received on January 29, 2004; revised on April 7, 2004; accepted on April 23, 2004
ROC
0.95 Advance Access publication May 6, 2004 0.9
ROC
G.R.G.Lanckriet et al.
0.90
0.8
0.85
0.80 ABSTRACT 0.7views. In yeast, for example for a given gene we typ-
these
BfunctionsSW B its SW Pfam FFT h(p LI ) ∈DR|pi | : Ea vector all
Table 1. KernelMotivation: During Pfam
the past LIdecade,D the new E focus all on depend upon
ically know hydropathy
the protein it encodes,profile
that protein’s i similarity to
100 genomics has highlighted a particular challenge: to integrate containing theitshydrophobicities
other40proteins, hydrophobicity profile, of the theamino
mRNAacids along the
expres-
TP1FP
TP1FP
the different views of the genome that are provided by various sion
proteinlevels associated with the given gene under
30 (Engleman et al., 1986; Black and Mould, 1991; Hopp hundreds of
Kernel Data Similarity measure
50 types of experimental data. experimental
and Woods, conditions,
20 1981). The theFFT
occurrences
kernel of known
uses or inferred profiles
hydropathy
Results: This paper describes a computational framework transcription
10 factor binding sites in the upstream region of
KSW proteinand
sequences Smith-Waterman generated from the Kyte–Doolittle index (Kyte and Doolittle,
0 for integrating drawing inferences from a collection of that gene
0 and the identities of many of the proteins that interact
KB B
genome-wide SW
protein Pfam Each
sequences
measurements. LI dataset DBLAST E
is represented all
via 1982).
with theThis B kernel
given gene’s compares
SWprotein Pfam the Each
FFT
product. frequency
LIof these Dcontent E of the
distinct all
1 1
hydropathy profiles ofview
the two
KPfam protein sequences
a kernel function, which defines generalized Pfam HMMrelation-
similarity data types provides one of theproteins.
molecularFirst, the hydropathy
machinery of
Weights
Weights
0.12
0.11
0.1
Test error
0.09
0.08
0.07
0.06
0.05
H W TW wTW M
Kernels
Theorem
The solution f ∗ of
n o
min min R(f n ) + λk f k2HKη
η∈ΣM f ∈HKη
PM
is f ∗ = i=1 fi
∗, where (f1∗ , . . . , fM∗ ) ∈ HK1 × . . . × HKM is the solution
of: !2
M M
!
X X
min R fi n + λ k fi kHKi .
f1 ,...,fM
i=1 i=1
n o
min min R(f n ) + λk f k2HKη
η∈ΣM f ∈HKη
M M k f k2
( ! )
X X i HK
n i
= min min R fi +λ
η∈ΣM f1 ,...,fM ηi
i=1 i=1
M X k fi k2HK
( ! (M ))
X
= min R fi n + λ min i
f1 ,...,fM η∈ΣM ηi
i=1 i=1
!2
M M
!
X X
= min R fi n +λ k fi kHKi ,
f1 ,...,fM
i=1 i=1
M
!2 M
X X a2
∀a ∈ RM
+ , ai = inf i
,
η∈ΣM ηi
i=1 i=1
M M M
! 12 M
! 12
X X ai √ X a2 i
X
ai = √ × ηi ≤ ηi .
ηi ηi
i=1 i=1 i=1 i=1
M M
( )
X X
For r > 0 , Kη = ηi Ki with η ∈ ΣrM = ηi ≥ 0 , ηir = 1
i=1 i=1
Theorem
The solution f ∗ of
n o
minr min R(f n ) + λk f k2HKη
η∈ΣM f ∈HKη
PM
is f ∗ = i=1 fi
∗, where (f1∗ , . . . , fM∗ ) ∈ HK1 × . . . × HKM is the solution
of:
M
! M
! r +1
r
X X 2r
min R fi n +λ r +1
k fi kHK .
f1 ,...,fM i
i=1 i=1
Solutions
low-rank approximation of the kernel;
random Fourier features.
The goal is to find an approximate embedding ψ : X → Rd such that
becomes, approximately,
n
1X
min L(yi , w> xi ) + λkwk22 ,
w∈Rd n
i=1
w⋆
b
w
b
If f is convex
f (w)
w⋆ w0
b b
w
b
w⋆ w1 w0
b b b
w
b
w⋆ w1 w0
b b b
w
b
wt ← wt−1 − L1 ∇f (wt−1 ).
Then,
Lkw0 − w? k22
f (wt ) − f ? ≤ .
2t
Remarks
the convergence rate improves under additional assumptions on f
(strong convexity);
some variants have a O(1/t 2 ) convergence rate [Nesterov, 2004].
Then,
Z 1
>
f (w)−f (z)−∇f (z) (w−z) ≤ (∇f (tw+(1−t)z)−∇f (z))> (w−z)dt
0
Z 1
≤ |(∇f (tw+(1−t)z)−∇f (z))> (w−z)|dt
0
Z 1
≤ k∇f (tw+(1−t)z)−∇f (z)k2 kw−zk2 dt (C.-S.)
0
Z 1
L
≤ Ltkw−zk22 dt = kw−zk22 .
0 2
L ?
f (wt ) ≤ gt (wt ) = gt (w? ) − kw − wt k22
2
L ? L
= f (wt−1 ) + ∇f (wt−1 )> (w? − wt−1 ) + kw − wt−1 k22 − kw? − wt k22
2 2
L ? L
≤ f?+ kw − wt−1 k22 − kw? − wt k22 .
2 2
By summing from t = 1 to T , we have a telescopic sum
T
X L ? L
T (f (wT ) − f ? ) ≤ f (wt ) − f ? ≤ kw − w0 k22 − kw? − wT k22 .
t=1
2 2
w⋆ w0
b b
w
b
L−µ ?
kw? − wt k22 ≤ kw − wt−1 k22
L+µ
µ ?
≤ 1− kw − wt−1 k22 .
L
Finally,
L t
f (wt ) − f ? ≤ kw − w? k22
2
µ t Lkw? − w0 k22
≤ 1−
L 2
w̃t ← (1 − γt )w̃t−1 + γt wt .
Julien Mairal (Inria) 499/564
The stochastic (sub)gradient descent algorithm
There are various learning rates strategies (constant, varying step-sizes),
and averaging strategies. Depending on the problem assumptions and
choice of ηt , γt , classical convergence rates may be obtained (see
Nemirovsky et al., 2009)
√
f (w̃t ) − f ? = O(1/ t) for convex problems;
f (w̃t ) − f ? = O(1/t) for strongly-convex ones;
Remarks
The convergence rates are not that great, but the complexity
per-iteration is small (1 gradient evaluation for minimizing an
empirical risk versus n for the batch algorithm).
When the amount of data is infinite, the method minimizes the
expected risk.
Choosing a good learning rate automatically is an open problem.
SAG algorithm
n
∇fi (wt−1 ) if i = it
t t−1 γ X t
w ←w − yi with yit = .
Ln yit−1 otherwise
i=1
∇fi (wt−1 ) if i = it
1 t
t
w ←w t−1
− (y − yit−1 ) with yit = .
µn it t yit−1 otherwise
wt ← wt−1 − ηt gt ,
where E[gt |wt−1 ] = ∇f (wt−1 ), but where the estimator of the gradient
has lower variance than in SGD (see SVRG [Johnson and Zhang, 2013]).
Typically, these methods have the convergence rate
t
? 1 µ
f (wt ) − f = O 1 − C max ,
n L
Then,
* d d
+
X X
0 0
hϕ(x), ϕ(x )iH ≈ βj (x)fj , βj (x )fj
j=1 j=1 H
d
X
= βj (x)βl (x0 )hfj , fl iH = β(x)> Gβ(x0 ).
j,l=1
with
ψ(x) = G1/2 β(x).
In practice, the anchor points fj in H and the coordinates β are learned
by minimizing the least square error in H
2
n
X d
X
min ϕ(xi ) − βij fj .
f1 ,...,fd ∈H
βij ∈R i=1 j=1
H
or also
n
X d
X d
X
min −2 βij fj (xi ) + βij βil hfj , fl iH .
f1 ,...,fd ∈H
βij ∈R i=1 j=1 j,l=1
n
X
max f (xi )> K−1
f f (xi ).
f1 ,...,fd ∈H
i=1
since
since
which is equal to
kϕ(x)k2H − f (x)> K−1
f f (x),
and since fj = ϕ(xzj ) for all j, the data point xi with largest residual is
the one that maximizes
K (xi , xi ) − Kxi ,Z K−1
Z ,Z KZ ,xi .
Remarks
The convergence is uniform, not data dependent;
q
Take the sequence εd = log(d)
d σp diam(X ); Then the term on the
right converges to zero when d grows to infinity;
Prediction functions with Random Fourier features are not in H.
Glue things together: control the probability for points (x, y) inside
each ball, and adjust the radius r (a bit technical).
Julien Mairal (Inria) 525/564
Outline
Perspectives
build multilayer architectures that are easy to regularize and that
may work without (or with less) supervision.
build versatile architectures to process structured data.
Julien Mairal (Inria) 530/564
Classical criticisms of kernel methods
lack of adaptivity to data?
Example
Any shift-invariant kernel with random Fourier features!
r h i>
2
ψ(x) = cos(w1> x + b1 ), . . . , cos(wd> x + bd ) .
d
Julien Mairal (Inria) 533/564
Links between kernels and neural networks
A large class of kernels on Rp may be defined as an expectation
1
arc-cosine1
arc-cosine2
0.8 RBF sigma=0.5
RBF sigma=1
0.6
s(u)
0.4
0.2
0
-1 -0.5 0 0.5 1
u
Julien Mairal (Inria) 536/564
Links between kernels and neural networks
We have seen that some kernels admit an interpretation as one-layer
neural networks with random weights and infinite number of neurons.
K1 (x, y) = κ (kϕ0 (x)kH0 , kϕ0 (y)kH0 , hϕ0 (x), ϕ0 (y)iH0 ) = hϕ1 (x), ϕ1 (y)iH1 ,
K2 (x, y) = κ (kϕ1 (x)kH1 , kϕ1 (y)kH1 , hϕ1 (x), ϕ1 (y)iH1 ) = hϕ2 (x), ϕ2 (y)iH2 ,
Figure : Picture from Yann Lecun’s tutorial, based on [Zeiler and Fergus, 2013].
ϕ2 (z2 ) ∈ H2
Ω2
{z2 } + P2
Ω1 ϕ1 (z1 ) ∈ H1
{z1 } + P1
ϕ0 (z0 ) ∈ H0 Ω0
z∈Ω z0 ∈Ω
z∈Ω z0 ∈Ω
z∈P z0 ∈P
Patch map
ϕ0 associates to a location z an image patch of size m × m centered
2
at z. Then, H0 = Rm , and ϕ̃0 (z) is a contrast-normalized version of
the patch.
ϕ2 (z2 ) ∈ H2
Ω2
{z2 } + P2
Ω1 ϕ1 (z1 ) ∈ H1
{z1 } + P1
ϕ0 (z0 ) ∈ H0 Ω0
ξk (z)
Gaussian filtering
Ω0k + downsampling
= pooling
ζk (zk–1 )
pk
Ωk–1 convolution
+ non-linearity
0
{zk–1 }+Pk–1
Supervision helps
preliminary supervised models are already close to 90% (single
model, no data augmentation);
Future challenges
video data;
structured data, sequences, graphs;
theory and faster algorithms;
finish supervision.