Lecture Notes OperatorML 080424
Lecture Notes OperatorML 080424
in machine learning
G. Blanchard, Université Paris-Saclay
April 11, 2024
Warning: these lecture notes are an incomplete work in progress. They likely contain
many typos, inconsistencies and more serious errors! If you happen to stumble upon this
document, and notice some errors, feel free to contact me.
Contents
1 Introduction (and conventions used in these notes) 3
1.1 Motivation: regression in Hilbert space . . . . . . . . . . . . . . . . . . . . 3
1.2 Notation and convention index . . . . . . . . . . . . . . . . . . . . . . . . . 3
1
4 Spectral regularization methods 38
4.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Probabilistic inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Analysis of spectral regularization methods . . . . . . . . . . . . . . . . . . 44
4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Acceleration methods 57
6.1 Parallelizing: divide and average . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Nyström methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2
1 Introduction (and conventions used in these notes)
1.1 Motivation: regression in Hilbert space
TODO
3
2 Tools of operator theory and functional calculus
2.1 Basics on Hilbert spaces
[Source: [7, Chapter 1]]
Definition 2.1. A real resp. complex Hilbert space H is a R-, resp C-vector space with an
inner product ⟨·, ·⟩ on H2 which is complete for the metric induced by that inner product’s
norm. We recall here the defining properties of an inner product:
• ⟨u, v⟩ ∈ R, resp C;
• ⟨u, u⟩ ≥ 0;
• ⟨u, u⟩ = 0 ⇒ u = 0.
1
The norm induced by the inner product is ∥u∥ = ⟨u, u⟩ 2 . The inner product satisfies the
Cauchy-Schwartz inequality:
|⟨u, v⟩| ≤ ∥u∥∥v∥.
Proposition/Definition 2.2.
• If ⟨u, v⟩ = 0 we denote u ⊥ v.
4
The first point allows to formally define uncountable Hilbert sums; the second indicates that
the notion of summability is quite strong. For real or complex-valued sums, it is equivalent
to the corresponding sum being P absolutely convergent; for sums of Hilbert space elements
however, it does not imply that i∈Ie∥ui ∥ is convergent.
Proposition/Definition 2.4.
• An orthonormal set (ek )k∈I is a family of unit norm, pairwise orthogonal vectors.
(Bessel’s inequality)
• Any basis of H has the same cardinality; if H is separable, then any basis of H is
countable (or finite)
Proposition 2.5. If E = (ek )k∈I is a Hilbert basis of H, we have the following properties:
• E ⊥ = {0}.
• [E] = H.
5
2.2 Bounded operators on Hilbert space
[Source: [7, Chapter 2]]
• A is continuous.
It holds for bounded operators A, B s.t. the output space of B is the input space of A:
Proposition 2.8. We have the following properties for bounded operators A, B (with ap-
propriate compatibility for input/output spaces of operators):
• (AB)∗ = B ∗ A∗ .
• (A∗ )∗ = A.
6
Proof. We prove only the last statement: for any h ∈ H it holds
this implies ∥A∥2 ≤ ∥A∗ A∥ ≤ ∥A∗ ∥∥A∥, and further ∥A∥ ≤ ∥A∗ ∥. But since (A∗ )∗ = A,
we also obtain ∥A∗ ∥ ≤ ∥A∥. Thus we have equalities everywhere.
Proposition/Definition 2.9 (Rank-one and finite rank operators). Given two elements
u ∈ H, v ∈ H′ , we denote v ⊗ u∗ the linear mapping
v ⊗ u∗ : x ∈ H 7→ ⟨x, u⟩v.
For a proof of the closedness (implying completeness) of K(H, H′ ), see [7, Prop 4.2]
Theorem 2.13 (Spectral theorem for compact self-adjoint operators). Let A ∈ K(H) be
a compact, self-adjoint operator.
Then there exists a finite or countably infinite orthonormal family (ek )k∈I of eigenvectors
7
of A and family of corresponding real nonzero eigenvalues (λk )k∈I (here I = JnK for some
n ∈ N or I = N; possibly I = ∅ when A = 0) with λk → 0 if I = N, such that
X
A= λk ek ⊗ e∗k , (2.1)
k∈I
that is to say: X
∀u ∈ H Au = λk ⟨u, ek ⟩ek , (2.2)
k∈I
and the series in (2.1) converges in operator norm. It can be assumed, if needed, that the
sequence (λk )k∈I is ordered by decreasing absolute value (which we will just call ”ordered”
for short).
If I = N, the set σ(A) := {λk , k ∈ I} ∪ {0} ⊂ R is the spectrum of A and has 0 as only
accumulation value. If I is finite, we define σ(A) := {λk , k ∈ I} ∪ {0} if 0 is an eigenvalue
of A and σ(A) := {λk , k ∈ I} otherwise (the latter case can only happen if H is finite-
dimensional). P
If we group indices k corresponding to the same eigenvalue λ and denote Pλ := k:λk =λ ek ⊗
e∗k (necessarily a finite sum) then Pλ is the orthogonal projector on the eigenspace associated
to λ; when rewritten in the form
X
A= λPλ , (2.3)
λ∈σ(A)\{0}
the decomposition is unique (we will call this the ”canonical form” of the decomposition).
Correspondingly every representation of the form (2.1) has the same ordered sequence
(λk )k∈I (every nonzero eigenvalue of A is present in this sequence with its degree of multi-
plicity.)
If H is separable, we can complete (ek )k∈I to a finite or countable Hilbert basis of H. Defin-
ing λℓ = 0 for the added vectors of the completed basis, relations (2.1)-(2.2)-(2.3) still hold
for this completed basis (for (2.3), the sum is then over σ(A) and we include P0 , the orthog-
onal projector on the null space of A). We will call this the completed eigendecomposition
of A.
This completion operation can also be made if H is nonseparable, but then we have to
complete the orthonormal family (ek )k∈i to an uncountable Hilbert basis.
For a proof, see e.g. [5, Section 6.4] or [7, Section II.5]. Note that there is a more
general theory for the spectral decomposition of noncompact self-adjoint and even normal
(commuting with their adjoint) operators, for which the sum is replaced by an integral in
a suitable sense (see e.g. [7, Chapter IX]), be we won’t consider it here.
Proposition 2.14. If A is a compact, self-adjoint operator, it is positive if and only if its
spectrum is nonnegative.
8
The following consequence is important:
Theorem 2.15 (Singular value decomposition of a compact operator). Let A ∈ B(H, H′ ).
Then A is compact if there exists a finite or countably infinite set I (I = JnK for some
n ∈ N or I = N) and:
(1) a positive sequence (σk )k∈I , converging to 0 if I = N;
that is to say: X
∀u ∈ H Au = σk ⟨u, ek ⟩fk , (2.5)
k∈I
9
This establishes (2.5) i.e. convergence in the weak sense.
We will now check that conversely, if (σk , ek , fk )k∈I satisfy (1)-(2)-(3), then the sum
2
in (2.5) is well defined.
P Let M := maxk∈I σk2 . For any u ∈ H, since (ek )k∈I is an orthonor-
mal system we have k∈I ⟨u, ek ⟩ ≤ ∥u∥2 (Bessel’s inequality). Hence if ak := σk ⟨u, ek ⟩ we
2
have a2k ≤ M 2 ⟨u, ek ⟩2 , and further k∈I a2k ≤ M 2 ∥u∥2 , hence the sum k∈I ak fk is a well-
P P
defined element of H′ since (fk )k∈I is an orthonormal system. Linearity of this sum wrt.
u is straightforward, and we also have as a byproduct that the resulting linear operator
from H to H′ has operator norm bounded by M . This also establishes strong convergence
of (2.4) in operator norm, since by the same token, for any subset I ′ or I, it holds
X
σk fk ⊗ e∗k ≤ max′ σk ,
k∈I
k∈I ′ op
where the above series converges for the weak topology, that is to say
X
∀u ∈ H f (A)u = f (λk )⟨u, ek ⟩ek . (2.7)
k∈I
Proof. We have to prove the claim that the series (2.7) converges, and that the operator
defined this way is in B(H), so that this definition makes sense. The argument is the same
as the one used in the proof of Thm. 2.15: for any u ∈ H, we have k∈I ⟨u, ek ⟩2 ≤ ∥u∥2
P
10
Note that we do not have convergence in operator norm in (2.6) general, since (f (σk ))k∈I
does not converge to 0 in general (in fact, Theorem 2.15 indicates that convergence in
operator norm is equivalent to (f (σk ))k∈I converges to 0 and f (A) compact). For example,
if f is the function identically equal to 1, f (A) is the identity and the convergence is not
in the strong sense.
We have the following properties:
Proposition 2.17. Let A ∈ K(H) be a compact self-adjoint operator, and f, g be bounded
functions σ(A) → R. Then:
(a) For any λ, µ ∈ C : (λf + µg)(A) = λf (A) + µg(A);
(b) (f g)(A) = f (A)g(A), implying in particular f (A)g(A) = g(A)f (A);
(c) ∥f (A)∥op = supt∈σ(A) |f (t)|;
(d) If f is the constant function equal to 1, then f (A) = I;
(e) If f is the identity function, then f (A) = A.
The following proposition gives a useful trick:
Proposition 2.18 (Shift formula). Let A ∈ K(H, H′ ) and g : {λ2 |λ ∈ sv(A)} → R be a
bounded function. Then it holds
g(AA∗ )A = Ag(A∗ A) and A∗ g(AA∗ ) = g(A∗ A)A∗ . (2.8)
ek , fk )k∈I be an SVD of A, i.e. A = k∈I σk fk ⊗e∗k . Then A∗ = k∈I σk ek ⊗
P P
Proof. Let (σk ,P
fk∗ and AA∗ = k∈I σk2 fk ⊗ fk∗ , which is an eigendecomposition of AA∗ , hence
X X X
∗ ∗
g(AA )A = 2
g(σk )fk ⊗ fk σ ℓ fℓ ⊗ e ℓ = σk g(σk2 )fk ⊗ e∗k ,
k∈I ℓ∈I k∈I
it can be checked that Ag(AA∗ ) leads to the the same formula. The other part of the claim
is proved similarly.
meaning that if any of the sums is convergent the other also are, and their value is in-
dependent of the choice of basis. If these sums are convergent, the operator A is called
Hilbert-Schmidt operator.
11
Proof. By Parseval’s identity we have ∥Aek ∥2 = 2 ∗ 2
P P
ℓ∈J |⟨Aek , fℓ ⟩| = ℓ∈J |⟨ek , A fℓ ⟩| .
Summing over k ∈ I and using Fubini’s relation yields the two first equalities.
Proposition/Definition 2.20. The set of Hilbert-Schmidt operators from H to H′ , de-
noted HS(H, H′ ) (or sometimes B2 (H, H′ ) in the literature), is a closed linear subspace of
K(H, H′ ), and a Hilbert space, once endowed with the Hilbertian product
X X X
⟨A, B⟩2 := ⟨Aek , Bek ⟩ = ⟨A∗ fℓ , B ∗ fℓ ⟩ = ⟨Aek , fℓ ⟩⟨Bek , fℓ ⟩, (2.10)
k ℓ (k,ℓ)∈I×J
∥A∥op ≤ ∥A∥2 .
Proof. Let us fix (ek )k∈I , (fℓ )ℓ∈J orthonormal bases of H, H′ . It is easy to check that
HS(H, H′ ) is a vector space, using the definition and the fact that
|⟨(A + B)ek , fℓ ⟩|2 ≤ (|⟨Aek , fℓ ⟩| + |⟨Bek , fℓ ⟩|)2 ≤ 2 |⟨Aek , fℓ ⟩|2 + |⟨Bek , fℓ ⟩|2 .
Similarly, the sums in (2.10) are absolutely convergent if A and B are Hilbert-Schmidt due
to
1
⟨Aek , fℓ ⟩⟨Bek , fℓ ⟩ ≤ |⟨Aek , fℓ ⟩|2 + |⟨Bek , fℓ ⟩|2
2
for the last sum, and to the Cauchy-Schwartz inequality (to apply twice) for the two first
sums. P
It is straightforward to check that it is sesquilinear. Furthermore ⟨A, A⟩2 = k,ℓ |⟨Aek , fℓ ⟩| =
P 2
k ∥Aek ∥ is 0 iff A = 0. It is thus a Hibertian product and induces a norm on HS(H).
As a consequence, the Hilbertian product can be obtained by the polarization formula
⟨A, B⟩2 = 41 ∥A + B∥22 − ∥A + B∥22 (for a R-Hilbert space) or
⟨A, B⟩2 = 14 ∥A + B∥22 − ∥A + B∥22 + i∥A + iB∥22 − i∥A − iB∥22 (for a C-Hilbert space).
We have seen from Proposition 2.19 that the corresponding formula (2.9) does not depend
on the choice of basis, therefore neither does (2.10).
If u is a unit vector, we can complete it to an orthonormal basis (uk )k∈I , and we have
X
∥Au∥2 ≤ ∥Auk ∥2 = ∥A∥22 ,
k∈I
12
Hilbert-Schmidt operator can be arbitrarily approximated in operator norm by finite rank
operators, and hence is compact.
It remains to justify that HS(H, H′ ) is complete for its norm. Because the Hilbert-Schmidt
norm dominates the operator norm, a Cauchy sequence An in HS norm is Cauchy on
operator norm and converges in operator norm towards some limit A∞ , since B(H) is
complete. Thus ∥An ek ∥2 converges pointwise to ∥A∞ ek ∥2 for every k. By Fatou’s Lemma,
this implies that A∞ is Hilbert-Schmidt and that ∥An − A∞ ∥22 → 0.
Proposition 2.21. We have the following properties:
• For any u, v ∈ H and w, x ∈ H′ , and A ∈ HS(H, H′ ), it holds
• If (ek )k∈I and (fk )k∈J are bases of H, H′ , the family of rank-one operators (fℓ ⊗
e∗k )(k,ℓ)∈I×J forms an orthonormal basis of HS(H, H′ ).
13
• If A = σk fk ⊗ e∗k is an svd of A, then
P
k∈I
X
|A| = σk ek ⊗ e∗k .
k∈I
• For any u ∈ H,
∥Au∥ = ∥|A|u∥.
3. There exists a basis (ek )k∈I of H such that k∈I ⟨|A|p ek , ek ⟩ < ∞.
P
P p
4. For any basis (ek )k∈I of H, it holds k∈I ⟨|A| ek , ek ⟩ < ∞ and the value of this
quantity is independent of the choice of basis.
5. It holds X
sup |⟨Aek , fk ⟩|p < ∞,
(ek )k∈I ,(fk )k∈I k
where the supremum is over (ek )k∈I and (fk )k∈I bases of H.
Proof. The equivalence of points 1 to 4 (and equality of the quantities defined therein) is
a direct consequence of Propositions 2.19, 2.20, 2.21p
for Hilbert-Schmidt operators, and of
2
2.23, remarking that since |A| is self-adjoint, |A| 2 e = ⟨|A|p e, e⟩.
Concerning point 5, note that 2 ⇒ 5 by choosing the ”input and output” bases of a singular
value decomposition of A, so that |⟨Aek , fk ⟩|p = σkp for all k.
14
We now show the converse. Let (ek )k∈I , (fk )k∈I be the bases of a singular value decompo-
sition of A as above, and (eek )k∈I , (fek )k∈I arbitrary bases of H. We have
D E X X
Ae
ek , fek = σℓ ⟨ek , eeℓ ⟩⟨fk , feℓ ⟩ ≤ σℓ Wk,ℓ ,
ℓ ℓ
where
Wk,ℓ := |⟨ek , eeℓ ⟩| ⟨fk , feℓ ⟩ .
Observe that by the Cauchy-Schwartz inequality, it holds
X 21 X 12
2
X 2
0 ≤ Wk,• := Wk,ℓ ≤ |⟨ek , eeℓ ⟩| ⟨fk , fℓ ⟩
e = ∥ek ∥∥fk ∥ = 1,
ℓ ℓ ℓ
P
and similarly 0 ≤ W•,ℓ := Wk,ℓ ≤ 1. Therefore, by Jensen’s inequality:
k
XD E p XX p
Ae
ek , fek ≤ σℓ Wk,ℓ
k k ℓ
XX
≤ σℓp Wk,ℓ Wk,•
p−1
k ℓ
X
≤ σℓp W•,ℓ
ℓ
X
≤ σℓp .
ℓ
This argument also shows that the quantities appearing in points 2 and 5 are identical.
The variational characterization of point 5 allows us to establish the triangle inequality
(note that the other characterizations are not so nice for this since they involve |A|, not
A, and we are not sure what to do with |A + B|), namely for any bases (ek )k∈I , (fk )k∈I :
X p1 X p p1
p
|⟨A + Bek , fk ⟩| ≤ |⟨Aek , fk ⟩| + |⟨Bek , fk ⟩|
k k
X p1 X p1
p
≤ |⟨Aek , fk ⟩| + |⟨Bek , fk ⟩|p
k k
≤ ∥A∥p + ∥B∥p .
The announced norm inequalities follow directly from point 2, and the closedness/completeness
property a similar argument as in the proof of Theorem 2.20, using the characterization of
point and Fatou’s lemma.
15
choice of basis.
This quantity is called trace of A and denoted tr(A).
For this reason an operator in B1 (H) is also called ”trace-class” and ∥A∥1 sometimes ”trace
norm” (note that ∥A∥1 ̸= Tr(A) in general, though!)
Proof. As usual we start with a svd of A, A = k∈I σk fk ⊗ e∗k . Then for any orthonormal
P
basis (uℓ )ℓ∈I , it holds
X XX
|⟨Auℓ , uℓ ⟩| = σk ⟨uℓ , ek ⟩⟨fk , uℓ ⟩
ℓ∈I ℓ∈I k∈I
XX
≤ σk |⟨uℓ , ek ⟩⟨fk , uℓ ⟩|
ℓ∈I k∈I
X X
= σk |⟨uℓ , ek ⟩⟨fk , uℓ ⟩|
k∈I ℓ∈I
! 21 ! 21
X X X
≤ σk |⟨uℓ , ek ⟩|2 |⟨fk , uℓ ⟩|2
k∈I ℓ∈I ℓ∈I
X
= σk = ∥A∥1 < ∞,
k∈I
thus all the sums involved in the above chain on inequality converge (absolutely).
Since the first sum converges absolutely, we can write
X XX
⟨Auℓ , uℓ ⟩ = σk ⟨uℓ , ek ⟩⟨fk , uℓ ⟩
ℓ∈I ℓ∈I k∈I
X X
= σk ⟨fk , uℓ ⟩⟨ek , uℓ ⟩
k∈I ℓ∈I
X
= σk ⟨fk , ek ⟩;
k∈I
16
1
6. If A ∈ Bp (H) for p ∈ [1, ∞), then ∥A∥p = Tr(|A|p ) p .
7. If A, B are Hilbert-Schmidt operators, then Tr(AB) = Tr(BA) and ⟨A, B⟩2 = Tr(B ∗ A).
8. If A ∈ B1 (H) and B ∈ B(H) then AB and BA are both trace-class and Tr(AB) =
Tr(BA).
Proof. We only prove the two last points, as the previous ones are straightforward (possibly
reusing the explicit expression found in the proof of Proposition 2.25).
If A, B are Hilbert-Schmidt operators, we simply identify from (2.10) that for any basis
(ek )k∈I : X X
⟨A, B⟩2 = ⟨Aek , Bek ⟩ = ⟨B ∗ Aek , ek ⟩ = Tr(B ∗ A).
k k
(we know that the left-hand side is a Hilbert sum, so the right-hand side too, implying
B ∗ A ∈ B1 (H).)
Using (2.10) again, we have
X X
Tr(AB) = ⟨B, A∗ ⟩2 = ⟨Bek , eℓ ⟩⟨A∗ ek , eℓ ⟩ = ⟨Bek , eℓ ⟩⟨Aeℓ , ek ⟩, (2.12)
k,ℓ k,ℓ
where we have used characterization (2.9) of the HS norm. Since the double sum in
expression (2.12) is symmetrical in A and B, it holds Tr(AB) = Tr(BA).
If A ∈ B1 (H) and T ∈ B(H), we can we write A as a product A = BC of two Hilbert-
Schmidt operators B, C (this is clear from a svd of A.) Furthermore, since T is bounded,
CT and T B are also Hilbert-Schmidt (use Definition 2.19 of a Hilbert-Schmidt operator,
or Proposition 2.30 in the next section), therefore using the previous point
Tr(A)
intdim(A) := .
∥A∥op
17
The interpretation of this quantity is that is measures over how many dimensions the
spectrum of A is ”mainly concentrated”. Here are a few properties to get some intuition
on this quantity.
Proposition 2.28.
• If A is finite-rank, A ̸= 0, then 1 ≤ intdim(A) ≤ rank(A).
• If A is an orthogonal projector onto a finite-dimensional subspace E, then intdim(A) =
dim(E).
• If ∥A∥op = 1 and the singular values of A satisgy σk (A) ≤ k −α , α > 1, then
α
intdim(A) ≤ α−1 .
Proof. For the last point, use the sum-integral comparison
Z ∞
X
−α α
Tr(A) ≤ k ≤1+ x−α dx = .
k≥1 1 α−1
k,ℓ
X
= σℓp ,
ℓ
18
where we have used k |Wℓ,k |2 = ∥uℓ ∥2 = 1, ℓ |Wℓ,k |2 = ∥ek ∥2 = 1, and Jensen’s inequality
P P
p
for the convex (if p ≥ 2) resp. concave (if 1 ≤ p ≤ 2) function x 7→ x 2 (with the inequality
in different directions according to the case; it is an equality if p = 2.)
On the other hand, for any bases (uk )k∈I , (vℓ )ℓ∈I of H, it holds
X XX p X
p 2 2
∥Auk ∥ = |⟨Auk , vℓ ⟩| ⋛ |⟨Auk , vℓ ⟩|p ,
k k ℓ k,ℓ
p
where we have super- (if p ≥ 2) resp sub-additivity (if p ≤ 2) of the function x 7→ x 2 (note
that this is in the opposite direction as the previous display.) Note again that if take the
input/output bases of the svd of A we find again ∥A∥pp .
Proposition 2.30. If A ∈ B(H) and B ∈ Bp (H) (p ∈ [1, ∞]), then AB ∈ Bp (H) and
∥AB∥p ≤ ∥A∥op ∥B∥p .
Similarly, BA ∈ Bp (H) and ∥BA∥p ≤ ∥A∥op ∥B∥p .
(So Bp (H) is an ideal of B(H).)
Proof. We assume p < ∞ as the case p = ∞ (i.e. ∥.∥∞ = ∥.∥op ) was handled before.
It holds for any orthonormal basis (ek )k≥1 of H:
X X X
∥ABek ∥p ≤ (∥A∥op ∥Bek ∥)p = ∥A∥pop ∥Bek ∥p .
k k k
19
where we have used the (standard) Hölder’s inequality for the first inequality, and point 5
of Proposition 2.24 for the second.
Proving that Tr(AB) = Tr(BA) in that context is annoying. If A = k µk fk ⊗ e∗k is an
P
svd of A, we can start as above, writing
X X
⟨ABuk , uk ⟩ = σk ⟨Avk , uk ⟩
k k
XX
= σk µℓ ⟨vk , eℓ ⟩⟨fℓ , uk ⟩. (2.14)
k ℓ
It seems that the obtained expression is symmetric in the role of A, B and that we are
done? Unfortunately, for this argument to be correct we have to establish that the double
sum over k, ℓ is absolutely convergent (we know that for any fixed k, the sum over ℓ is
absolutely convergent; that is not enough to establish that the double sum is.) Let us
′
denote Wk,ℓ := |⟨vk , ℓ⟩| and Wk,ℓ := |⟨fℓ , uk ⟩|. Assume 1 < p ≤ 2 ≤ q < ∞. We want to
establish the convergence of
X X
′
σk µℓ |⟨vk , eℓ ⟩||⟨fℓ , uk ⟩| = σk µℓ Wk,ℓ Wk,ℓ
k,ℓ k,ℓ
2
X
q ′ 1− 2
= (σk Wk,ℓ )(µℓ Wk,ℓ q Wk,ℓ )
k,ℓ
! 1q ! p1
X X p(1− 2 )
≤ σkq Wk,ℓ
2
µpℓ Wk,ℓ q (Wk,ℓ
′ p
)
k,ℓ k,ℓ
! 1q ! p1
X X X p(1− 2 )
= σkq p
µℓ ′ p
Wk,ℓ q (Wk,ℓ ) ,
k ℓ k
2
P
where we have used Hölder’s inequality, then k Wk,ℓ = 1. Finally, applying Hölder’s
inequality again:
1− p2 ! p2
(1− 2
q)
X p(1− 2 ) X p (1− p ) X
′ p ′ 2
Wk,ℓ q (Wk,ℓ ) ≤ Wk,ℓ 2 (Wk,ℓ )
k k k
!1− p2 ! p2
X X
2 ′ 2
= Wk,ℓ (Wk,ℓ ) = 1.
k k
Thus (2.14) is absolutely convergent (we ended up also re-proving the trace-Hölder inequal-
ities established before, in a more complicated way. . . )
Proposition 2.32. Let A, B be two selfadjoint Hilbert-Schmidt operators and f : R → R
an L-Lipschitz function.
Then
∥f (A) − f (B)∥2 ≤ L∥A − B∥2 . (2.15)
20
Important remark: one can wonder if (2.15) holds for other norms. It is not the
case. In particular, it does not hold in general for the operator norm ∥.∥op : functions
satisfying (2.15) for the operator norm are called ”operator Lipschitz”, and not every
real-valued Lipschitz function is operator Lipschitz, even with a different constant.
Proof. Let (ek , λk )k≥1 and (fℓ , µℓ )ℓ≥1 be eigendecompositions of A and B, respectively.
Observe that in general, for an operator M ∈ HS(H), since both (ek )k≥1 and (fℓ )ℓ≥1
are Hilbert bases it holds
X X
∥M ∥22 = ∥M ek ∥2 = |⟨M ek , fℓ ⟩|2 .
k k,ℓ
It is now obvious that we can use the Lipschitz property of f for each (k, ℓ) term to reach
the claim.
21
3 Tools from concentration of measure
3.1 Random variables in Banach space
Proposition/Definition 3.1. Let B be a separable Banach space, with its Borel σ-
algebra. A random variable X from a base probability space (Ω, F, P ) to B is Bochner
integrable if E[∥X∥] < ∞. In this case there is a well-defined expectation E[X] ∈ B
satisfying the following properties:
• ∥E[X]∥ ≤ E[∥X∥];
• Simple linearity: if X, Y are Bochner-integrable then E[λX + Y ] = λE[X] + E[Y ];
• Operator linearity: for any bounded linear operator A from B to a separable Banach
space B ′ it holds that AX is Bochner-integrable in B ′ and
E[AX] = AE[X].
E[⟨X, u⟩⟨X, v⟩] = E[⟨v, ⟨u, X⟩X⟩] = ⟨v, E[⟨u, X⟩X]⟩ = ⟨v, Σu⟩.
22
3.2 Hoeffding’s inequality in Hilbert space
We first recall the Azuma-McDiarmid concentration theorem for “stable” functions of
independent random variables, also called Bounded difference inequality.
Theorem 3.4 (Azuma-McDiarmid). Let X be a measurable space, and f : X n → R a
measurable function such that
∀i ∈ {1, . . . , n}, ∀(x1 , . . . , xn ) ∈ X n , ∀x′i ∈ X :
|f (x1 , . . . , xi , . . . , xn ) − f (x1 , . . . , x′i , . . . , xn )| ≤ 2ci , (Stab)
for some positive constants (c1 , . . . , cn ).
Let (X1 , . . . , Xn ) be a independent family of random variables taking values in X (not
necessarily
Pnidentically distributed), then f (X1 , . . . , Xn ) is a sub-Gaussian variable with pa-
2
rameter i=1 ci , so that in particular
t2
P[f (X1 , . . . , Xn ) > E[f (X1 , . . . , Xn )] + t] ≤ exp − Pn 2 . (3.1)
2 i=1 ci
(In particular, if all constants ci are equal to c, the bound is exp(−t2 /(2nc2 )).)
For a proof, see e.g. [13, Section 3.4] or [2, Section 6.2] or [4, Section 6.1].
Theorem 3.5. Let X1 , . . . , Xn be i.i.d., Bochner-integrable random variables taking values
in a Hilbert space H, and having expectation 0.
Assume ∥Xi ∥ ≤ PB a.s. for some constant B.
n
Then if Sn := i=1 Xi , it holds for any δ ∈ (0, 1):
1 B p
−1
P Sn ≥ √ 1 + 2 log(δ ) ≤ δ.
n n
Proof. By the assumption ∥Xi ∥ ≤ B, we can assume that the variables Xi in fact take
their values in the ball of H centered at the origin and of radius B. As a first step, we note
that the function F (X1 , . . . , Xn ) := n1 Sn satisfies the condition (Stab) (with ci = B for
all i) on this ball, by the triangle inequality: namely, if we replace Xi by Xi′ in the sum
(i)
Sn , denoting it Sn , it holds
1 1 (i) 1 1
Sn − Sn ≤ Sn − Sn(i) = ∥Xi − Xi′ ∥ ≤ 2B.
n n n n
Applying the Azuma-McDiarmid inequality, we get
nt2
1 1
P Sn > E[∥Sn ∥] + t ≤ exp − 2 . (3.2)
n n 2B
In a Hilbert space, we have moreover due to Jensen’s inequality:
" n # 12
1 X √ 1 √
E[∥Sn ∥] ≤ E ∥Sn ∥2 2 ≤ E ⟨Xi , Xj ⟩ = nE ∥X1 ∥2 2 ≤ B n,
i,j=1
since E[⟨Xi , Xj ⟩] =
p0 if i ̸= j, using independence and E[Xi ] = 0. Combining this with (3.2)
and taking t = B 2 log(δ −1 )/n yields the claim.
23
3.2.1 (*) Extension to Banach space
Observe that equation (3.2) also holds more generally in a Banach space, since we have only
used the triangle inequality for the bounded difference concentration inequality. Only the
upper bound on E[∥Sn ∥] used the Hilbertian structure. For Banach spaces, a convenient
notion is that of type:
1 1
E[∥Sn ∥] ≤ 2Tp (B)n p E[∥X1 ∥p ] p .
Proof. Denote (X1′ , . . . , Xn′ ) an independent copy of (X1 , . . . , Xn ). Put C = Tp (B). Then
we have
" n #
X
E[∥Sn ∥] = E (Xi − E[Xi′ ])
i=1
" n
#
X
≤E (Xi − Xi′ ) (3.3)
i=1
" n
#
X
=E εi (Xi − Xi′ ) (3.4)
i=1
" n
#
X
≤ 2E εi X i (3.5)
i=1
" n
# p1
X p
≤ 2E εi X i (3.6)
i=1
n
! p1
X
≤ 2C E[∥Xi ∥p ] (3.7)
i=1
1 1
= 2Cn E[∥X1 ∥p ] p ,
p (3.8)
24
where (3.4) is due to invariance of the distribution of the vector (Xi − Xi′ )1≤i≤n by sign-
flipping, (3.5) is the triangle inequality, (3.6) is Jensen’s inequality, and (3.7) is the defini-
tion of Rademacher type p. Concerning (3.3), we use the first property of Proposition 3.1
for the conditional expectation E[.|Xi ].
Once we plug this estimate into (3.2), we see that for Banach spaces of type p < 2,
the situation is radically different than in Hilbert spaces, because the above bound on the
expectation of ∥n−1 Sn ∥ is of higher order O(n−(1−1/p) ) than the deviation term of order
1
O(n− 2 ).
Still, he have the following facts:
• for p ∈ [1, ∞), the Schatten p-class Sp (H) of operators is of the same type as Lp [25].
Hence, we have (as far as the order in n is concerned) a control of the same order in n (up to
constants) as in a Hilbert space for Lp spaces and Sp spaces for 2 ≤ p < ∞. Unfortunately,
p = ∞ is radically different as L∞ spaces are of type 1. Note that the above arguments
are useless in Banach spaces of type 1 (it does not give a better result than the triangle
inequality; any Banach space is at least of type 1).
Lemma 3.8. Let (Xi )i∈JnK be a sequence of Bochner integrable, independent random vari-
ables with values in a separable Banach space B with E[Xi ] = 0 for all i.
Let F : B → R be a function with the following properties:
′′
c) For all x, u ∈ B, u ̸= 0, it holds Fx,u (t) ≤ H(u, t)F (x) for some function H(u, t) ≥ 0.
Z 1
En [F (Sn )] ≤ F (Sn−1 ) 1 + E H(Xn , t)(1 − t)dt ,
0
25
Proof. Using the assumptions and Taylor’s expansion with integral remainder we get
Theorem 3.9 (Pinelis). Let (Xi )i∈JnK be a sequence of Bochner integrable, independent
Pn with values in a separable Banach space B with E[Xi ] = 0 for all i.
random variables
Denote Sn = i=1 Xn .
Assume Ψ : B → R+ is a function satisfying the following:
Then for any λ > 0 such that E[exp(λ∥Xk ∥)] for all k, it holds for all t > 0:
n
!
X
P[Ψ(Sn ) > t] ≤ 2 exp −λt + D2 Ak , (3.9)
k=1
Proof. Since we want to bound deviations from 0 rather than from the expectation E[Ψ(Sn )],
we introduce the symmetrized random variable RΨ(Sn ), where R is a Rademacher variable
(random sign) independent of Sn .
We start by the usual Chernov’s method bound: for any λ ≥ 0 and t ≥ 0, and recalling
Ψ(.) ≥ 0, it holds
26
It comes
′
Fx,u (t) = λ sinh(λG(t))G′ (t) = λ sinh(λG(t))[dΨx+tu ](u),
which shows that Assumption (b) of Lemma 3.8 is satisfied (with Fx = sinh(λG(0))dΨx );
and
′′
Fx,u (t) = λ sinh(λG(t))G′′ (t) + λ2 cosh(λG(t))(G′ (t))2 .
To upper bound the first term, we bound it by 0 if G(t) and G′′ (t) are of opposite signs,
otherwise we use |sinh(u)| ≤ u cosh(u), thus
(
′′ (G(t)G′′ (t) + (G′ (t))2 )λ2 cosh(λG(t)) if G(t)G′′ (t) ≥ 0;
Fx,u (t) ≤
(G′ (t))2 )λ2 cosh(λG(t)) if G(t)G′′ (t) < 0.
since D ≥ 1.
Note that Assumption 2 implies that Ψ is Lipschitz; we use this to get the bound
where we have used cosh(a + b) ≤ cosh(a) exp(b) for b ≥ 0. Gathering the previous
estimates we obtain (in all cases)
′′
Fx,u (t) ≤ D2 λ2 ∥u∥2 exp(λt∥u∥)F (x),
i.e. Assumption (c) of Lemma 3.8 is satisfied with H(u, t) = D2 λ2 ∥u∥2 exp(λt∥u∥). To
apply the lemma all is left is to evaluate
Z 1 Z 1
2 2 2
H(u, t)(1 − t)dt = D λ ∥u∥ (1 − t) exp(λt∥u∥)dt = D2 (exp(λ∥u∥) − 1 − λ∥u∥).
0 0
Applying Lemma 3.8 recursively (starting from n backwards to 1), and using Ψ(0) = 0 for
the last step, we thus get, with Ak := E[exp(λ∥Xk ∥) − 1 − λ∥Xk ∥]:
n n
!
Y X
E[cosh(λΨ(Sn ))] ≤ (1 + D2 Ak ) ≤ exp D2 Ak ,
k=1 k=1
27
Corollary 3.10. Under the same assumptions as Theorem 3.9, if for all i = 1, . . . , n, and
for all integers k ≥ 2:
h i k! 2 k−2
E ∥Xi ∥k ≤ σ M , (3.10)
2D2
for some constants M, σ > 0,
then for all t ≥ 0:
σ2
1 M
P ψ Sn > t ≤ 2 exp −n 2 h1 t 2 ,
n M σ
√
where h1 (u) := 1 + u − 1 + 2u.
As a consequence, for any x ≥ 0:
" r #
1 2x M x
P ψ Sn > σ + ≤ 2 exp(−x).
n n n
uj
Proof. Since π(u) = eu − 1 − u =
P
j≥2 j! , it holds (using Xi /n instead of Xi to account
for the sum normalization):
X λk h k
i X
k −k 2 k−2 2 1 σ 2 n−2 λ2
Ai = E[π(λ∥Xi ∥/n)] = k
E ∥X i ∥ ≤ λ n σ M /2D = 2 2(1 − λM n−1 )
,
k≥2
k!n k≥2
D
σ 2 λ2 /n
1
P Ψ Sn > t ≤ 2 exp −λt + ,
n 2(1 − λM/n)
and elementary computations show that
λ2 v
v ct
sup λt − = 2 h1 ,
λ∈(0,1/c) 2(1 − cλ) c v
√
leading to the first claim. It can be checked that h−1
1 (u) = u + 2u, leading to the second
claim.
The primary application of Pinelis’ concentration inequality is for Hilbert and (certain)
Banach norms. However norms are not differentiable everywhere (in particular not at the
origin). For this, we need the following result in order to slightly weaken the assumption
on Ψ:
Definition 3.11. If B is a Banach space, we call a function Ψ : B → R+ (2, D)-smooth
(for some constant D ≥ 1) if it satisfies Ψ(0) = 0 and for any x, u ∈ B:
28
Proposition 3.12. Theorem 3.9 and Corollary 3.10 also hold for any (2, D)-smooth func-
tion Ψ.
Proof. We only sketch the proof. As a first step, a centered random variable X taking values
in a separable Banach space B can be approximated by a sequence Xk such that Xk con-
ververges to X in probability, Xk is centered and takes only a finite number of values. This
can be seen as follows: put ε = 1/k, there exists a compact Kε such that P[X ̸∈ Kε ] ≤ ε by
tighthness of a probability measure on a separable Banach space. Cover Kε by a finite num-
ber of closed balls of radius ε. Define Xε = E[X|Fε ], where Fε is the (finite) sigma-algebra
generated by those balls. Thus Xε only takes a finite number of values, E[Xε ] = E[X] = 0,
P
and Xε → X because P[∥X − Xε ∥ > 2ε] ≤ P[X ̸∈ Kε ] + P[X ∈ Kε ; ∥X − Xε ∥ > 2ε] ≤ ε.
The second event has probability 0 because on Kε , Xε is the conditional average of X on
a partition piece of diameter less than 2ε, so ∥X − E[X|Fε ]∥ ≤ 2ε.
As a second step, we establish that on a finite-dimensional Banach space Be (the one
generated by the finite-numbered values of the approximant Xk for fixed k), there exists a
sequence Ψ e n of functions Be → R+ such that Ψ
e n (x) → Ψ(x) for all x ∈ B,
e and functions Ψen
satisfy (independently of n) conditions 1-2-3 of Theorem 3.9. Namely, let N be a centered
Gaussian variable in Be (for instance, take any (finite) basis of Be and a standard normal
with respect to that basis), then define
1
Ψε (x) = E Ψ2 (x − εN ) 2 .
29
A particular case of interest is for bounded random vectors. If X is a centered, bounded
random vector in Hilbert space, with ∥X∥ ≤ M and E[X ⊗ X ∗ ] = Σ its covariance opera-
tor, then
h i
E ∥X∥k ≤ M k−2 E ∥X∥2 = M k−2 E[Tr(X ⊗ X ∗ )] = M k−2 Tr(Σ),
• using a closely related (and more standard) notion of smoothness (for a connection
between the two notions see e.g. [11], eq. (9) ), it was established by [25] that the
p-Schatten classes share the same ”modulus of smoothness” (up to constant factors)
as an Lp space, for 1 ≤ p < ∞.
3.3.1 Discussion
TODO
• they rely on inequalities relating traces and functional calculus for matrices (”Trace
inequalities”), in particular Lieb’s inequality. One has to be somewhat careful that
the traces exist for operators and that the arguments can be carried over.
• the most basic inequalities involve the matrix dimension, which won’t be possible
for opertors over an infinite-dimensional Hilbert space. For this reason we will look
at refined concentration inequalities using the intrinsic dimension rather than the
ambient dimension.
30
Theorem 3.13 (Matrix Bernstein’s inequality with intrinsic dimension). Let X1 , . . . , Xn
be independent random
Pn matrices of the same size with E[Xk ] = 0 and ∥Xk ∥op ≤ L for all
k. Denote Sn = i=1 Xi .
Assume there are two matrices V1 , V2 satisfying
n
X
V1 ⪰ E[Sn Sn∗ ] = E[Xi Xi∗ ];
i=1
n
X
V2 ⪰ E[Sn∗ Sn ] = E[Xi∗ Xi ];
i=1
t2 /2
h i
P ∥Sn ∥op > t ≤ 8d exp − 2 .
σ + Lt/3
t2 /2
h i
P ∥Sn ∥op > t ≤ 8d exp − 2 .
σ + Lt/3
31
Remark: Condition (3.13) implies by positivity of expectation that, in fact, Xi must
be Hilbert-Schmidt operators. We have not required this explicitly in the assumption, to
avoid splitting hairs about what space the Xi s are Bochner integrable in. The condition
∥Xk ∥op ≤ L is enough to ensure Bochner integrability in B(H), but Bochner integrability
in HS(H) is not formally required (note that the Xi s may be unbounded in HS(H)).
We use the same device as in the proof of Proposition 3.12, which we rewrite for
convenience as a separate lemma:
Lemma 3.15. Let X be a Bochner-integrable random variable taking values in the space
K(H, H′ ) of compact operators from H to H′ , with E[X] = 0. Then there exists a sequence
of random variables X (k) converging in probability to X for ∥.∥op , i.e.
h i
(k)
∀t > 0 : lim P X −X op
≥ t → 0,
k→∞
• E X (k) = 0;
• X (k) only takes a finite number of different values in a subspace K(k) of operators,
such that there exists two finite-dimensional subspaces E (k) , F (k) of H and H′ with
(in other words PE (k) APF (k) = A for all A ∈ K(k) , where PE (k) , PF (k) are the orthogonal
projections onto E (k) , F (k) ).
• If XX∗ and X ∗ X are Bochner integrable, then E X (k) (X (k) )∗ ⪯ PE (k) E[XX ∗ ]PE (k) ,
Proof. For a fixed k, put ε = 1/k, there exists a compact Kε such that P[X ̸∈ Kε ] ≤ ε
by tighthness of a probability measure on a separable Banach space. Cover Kε by a fi-
nite number of closed balls of radius ε. Define Xε = E[X|Fε ], where Fε is the (finite)
sigma-algebra generated by those balls. Thus Xε only takes a finite number of values, and
E[Xε ] = E[X] = 0 by the properties of conditional expectation.
Let (A1 , . . . , ANε ) be the finite set of values taken by Xε . Since these are compact op-
erators, there exists (B1 , . . . , BNε ) such that Bi is finite-rank, ∥Bi − Ai ∥op ≤ ε for all
i ∈ {1, . . . , Nε }. Let Eε = ( Ker(Bi ), i ≤ Nε )⊥ and Fε = [Ran(Bi ), i ≤ Nε ]. Then be-
T
cause Ker(Bi ) is of finite codimension and Ran(Ai ) is of finite dimension, Eε and Fε are of
finite dimension; and we have PEε Bi PFε = Bi for all i ≤ N ε .
Define now X eε = PEε Xε PFε . Then PEε X eε PFε = Xeε , E Xeε = 0 by linearity, and it holds
h i h i h i
P X eε − X > 4ε ≤ P e ε − Xε
X > 2ε + P X ε − X > 2ε . (3.15)
op op op
32
Concerning the first term, we have
e ε − Xε
X ≤ sup PEε Ai PFε − Ai op
op
i≤Nε
≤ sup PEε Ai PFε − Bi op + ∥Bi − Ai ∥op
i≤Nε
≤ ε + sup PEε (Ai − Bi )PFε op
i≤Nε
≤ ε + sup Ai − Bi op
i≤Nε
≤ 2ε,
hence the first probability is zero. Concerning the second term in (3.15):
h i h i
P Xε − X op > 2ε ≤ P X ∈ Kε ; Xε − X op > 2ε + P[X ̸∈ Kε ] ≤ ε,
The first event above has probability 0 because on Kε , Xε is the conditional average of X
on a partition piece of diameter less than 2ε, so ∥X − E[X|Fε ]∥ ≤ 2ε. This proves that
X (k) := X
e1/k converges in probability to X for ∥.∥ .
op
Let us turn to the additional claims on boundedness and second moment. Since Xε is
defined as a conditional expectation of X, it inherits the boundedness property ∥Xε ∥op ≤ L,
and it holds X eε
op
= PEε Xε PFε op ≤ ∥Xε ∥op ≤ L. Concerning the variance, first note
that mimicking the usual argument for vector-valued variables, in general for an operator-
valued random variable Z such that ZZ ∗ is Bochner integrable, it holds
so E[ZZ ∗ ] ⪰ E[Z]E[Z]∗ ; and this also holds for conditional expectations. Because Xε is
∗ ∗
a conditional expectation of X, we therefore have E Xε Xε ⪯ E[XX ]. Finally, we have
Xeε = P Xε P ′ for two orthogonal projectors P, P ′ .
In general, A∗ P A ⪯ A∗ A. Namely, it holds for any u:
33
assume E (k) = F (k) ; let us denote P (k) the orthogonal projector on E (k) . Furthermore
(k)
Lemma 3.15 guarantees Xi op ≤ L and
n
X n
X
(k) 2 (k)
E Xi2 P (k) ⪯ P (k) V P (k) := V (k) .
E (Xi ) ⪯ P
i=1 i=1
Observe that Tr(V (k) ) = Tr(P (k) V ) ≤ P (k) op Tr |V | = Tr V since V is positive; and
V (k) op = P (k) V P (k) op ≤ ∥V ∥op ≤ σ 2 . To summarize, for fixed k the approximant
(k) (k)
variables X1 , . . . , Xn are independent, self-adjoint operators that are null on the orthog-
onal of the finite-dimensional subspace E (k) , hence can be conceived as finite-dimensional
Hermitian matrices acting on E (k) ; this is also the case for V (k) . We can thus apply
Theorem 3.13 to these variables, resulting in
t2 /2
h i
(k)
P Sn op > t ≤ 4d exp − 2 ,
σ + Lt/3
(k) Pn (k) (k) (k)
where Sn = i=1 Xi . Since Xi converges in probability to Xi , so does Sn to Sn ,
yielding the claim.
34
Proof. We take successively expectation with respect to X1 , . . . , Xn . Assume that after
k − 1 steps, we have established
" n
! # k−1 n
!
X X X
E Tr exp λ Xi Xk , . . . , Xn ≤ Tr exp ξi + λ Xi , (3.17)
i=1 i=1 i=k
combining with (3.17), and replacing the value of Hk we get (3.17) for k ← (k + 1). We
conclude by a straightforward recursion.
In order to establish Theorem 3.13, we will need the following properties on Hermitian
matrix functional calculus (defined in the same way as operator functional calculus, see
Section 2.4):
Proposition 3.18. Let A, B be Hermitian matrices. We denote A ⪯ B, resp. A ≺ B iff
B − A is positive semidefinite, resp. positive definite.
1. If f is nondecreasing on the union of the spectra of A and B, and A ⪯ B, then
Tr f (A) ≤ Tr f (B).
where the maximum runs over linear subspaces of dimension i. From this it follows that
A ⪯ B ⇒ ∀i λi (A) ≤ λi (B)
⇒ ∀i λi (f (A)) = f (λi (A)) ≤ f (λi (B)) = λi (f (B))
⇒ Tr(f (A)) ≤ Tr(f (B)).
(note that we have used the fact that f is nondecreasing to justify each of the relations in
the second implication).
For the second point, let (λi , ei ) be an eigendecomposition of A, then we have for any
i: ⟨ei , f (A)ei ⟩ = f (λi ) ≤ g(λi ) = ⟨ei , g(B)ei ⟩. It follows f (A) ⪯ g(A).
The third point is non-trivial: it is not true that a monotone function is ”operator
monotone”, in general. But is is true for log. See [26] for a short proof.
35
Lemma 3.19. If X is a random Hermitian matrix such that it spectrum is upper bounded
by L and E[X] = 0, then, for λ ∈ [0, 3/L]:
λ2 /2
E X2 .
log E[exp λX] ⪯
1 − λL/3
Proof. We start with defining ψ(x) := exp(x) − x − 1 and f (x) := ψ(x)/x2 . By inspection
1
of its series expansion, it holds that f (x) is nondecreasing and that f (x) ≤ g(x) := 2(1−x/3) .
Thus for x ≤ L and λ ∈ [0, 3/L]:
ψ(λx) ≤ λ2 x2 g(λL).
Using point 2 of Proposition 3.18, it follows that for a Hermitian matrix X such that its
spectrum is upper bounded by L, it holds
ψ(λX) ⪯ λ2 X 2 g(λL).
λ2 /2
E X2 .
E[ψ(λX)] = E[exp(λX)] − I ⪯
1 − λL/3
λ2 /2 λ2 /2
2
E X2 .
log E[exp(λX)] ⪯ log I + E X ⪯
1 − λL/3 1 − λL/3
λ2 /2
where g(λ) = 1−λL/3
. Using point 1 of Proposition 3.18, it comes
n
!
X
Tr exp log E[exp λXi ] ≤ Tr exp(g(λ)V ).
i=1
Combining the above with Proposition 3.17, we obtain a bound on the ”matrix Laplace
transform”
Tr E[exp λSn ] ≤ Tr exp(g(λ)V ). (3.18)
36
First, we relate the left-hand side of (3.18) to the probability of deviation of λmax (Sn ),
where λmax denotes largest eigenvalue. In general, if ψ is a nondecreasing function from R
to R+ , it holds:
We apply this to the function ψ(t) = exp(λt) − λt − 1 (with λ > 0), use E[Sn ] = 0 and
combine with (3.18) to obtain
Tr E[exp(λSn ) − 1] Tr(exp(g(λ)V ) − 1)
P[λmax (Sn ) > t] ≤ ≤ . (3.19)
exp(λt) − λt − 1 exp(λt) − λt − 1
Second, we further bound above right-hand side. For this, denote ψ(t) e = exp(t) − 1; we
t e
note that since ψ is convex, it holds ψ(t) ≤ M ψ(M ) for t ≤ M . Since ∥V ∥ ≤ σ 2 , we deduce
e e
by points 1-2 of Proposition 3.18 that
Tr g(λ)V e 2
Tr(exp(g(λ)V ) − 1) ≤ ψ(σ g(λ)) ≤ d exp(σ 2 g(λ)). (3.20)
g(λ)σ 2
exp(σ 2 g(λ))
P[λmax (Sn ) > t] ≤ d . (3.21)
exp(λt) − λt − 1
The rest is just somewhat tedious estimates. We rewrite the expression in the above bound
as
exp(σ 2 g(λ))
exp(λt) 2 3
= exp(σ g(λ) − λt) ≤ 1 + 2 2 exp(σg(λ) − λt),
exp(λt) − λt − 1 exp(λt) − λt − 1 λt
u
e
were one used eu −u−1 ≤ 1 + u32 for u ≥ 0, the proof of which is left to the reader. Next, we
choose λ = t/(σ 2 + Lt/3). If t ≥ σ + L/3,
it can be checked (again, left to the reader) that
3
with this choice of λ it holds 1 + λ2 t2 ≤ 4, and substituting the value of λ into g(λ) in
(3.21) yields
t2 /2
P[λmax (Sn ) > t] ≤ 4d exp − 2 ;
σ + Lt/3
we conclude by a union bound with a similar control for P[λmax (−S) > t].
37
4 Spectral regularization methods
Many sources on this theme, including [1, 6, 3, 16, 15, 21]. . .
4.1 Setting
In this chapter, we will consider a problem of linear regression with random design where
the covariate X lies in a Hilbert space, of the form
Y = ⟨f ∗ , X⟩ + ξ, (4.1)
These assumptions are quite restrictive and can be significantly weakened in the litera-
ture with the price of a more refined analysis. We will study this setting here for simplicity.
In the analysis to come, a lot of constants will depend on the parameters appearing in
the assumptions (such as κ and M abovel there will be more later.) To avoid a cumbersome
tracking of the effect of the constants, we will often use the notation C▲ to denote a number
implicitly depending on ”less important” parameters in the assumptions. For this section
C▲ will be a positive number only depending on (κ, M ). Note that the value of C▲ might
change in different contexts and even change from line to line!
We first need to introduce some notation which will be the infinite-dimensional analogue
of quantities appearing in traditional linear regression. For this we will need to introduce
the Hilbert space L2 (H, ρ) whose norm (and scalar product) we will denote as ∥·∥ρ resp
⟨·, ·⟩ρ .
38
Proposition/Definition 4.2. Let ρ be a distribution on the Hilbert space H such that
E ∥X∥2 < ∞ (this is implied in particular by Assumption 4.1). Denote S the ”population
evaluation” operator
S : H → L2 (H, ρ), f 7→ ⟨f, .⟩ = [x 7→ ⟨f, x⟩].
This is a Hilbert-Schmidt operator, and its adjoint is given by
S ∗ : L2 (H, ρ) → H, g 7→ E[g(X)X].
Finally, it holds S ∗ S = E[X ⊗ X ∗ ] = Σ, and Σ is a trace-class operator.
Proof. Let (ei )i∈I be an orthonormal basis of H. Then we have
X X
∥Sei ∥2 = E |⟨ei , X⟩|2 = E ∥X∥2 < ∞,
i i
39
Proof. Just note
1 X 1 X
Sf,
b u
n
= ⟨f, Xi ⟩ui = f, u i Xi ,
n n
i∈JnK i∈JnK
and X
∗ 1 X 1 ∗
S Sf =
b Xi ⟨f, Xi ⟩ = Xi ⊗ Xi f.
n n
i∈JnK i∈JnK
∗
Finally, let us define the excess risk we want to control forh an estimator if of f . We
b
will focus here on the quadratic prediction risk R(fb) = E ( fb, X − Y )2 , where the
expectation is over a new, independent example (X, Y ) drawn from P . Consequently, the
excess risk with respect to the optimal prediction f ∗ can be rewritten as
h i
∗ 2 ∗ 2
R(f ) − R(f ) = E ( f , X − Y ) − (⟨f , X⟩ − Y )
b b
h i
= E ( fb − f ∗ , X − ξ)2 − ξ 2
h i
= E ( fb − f ∗ , X )2
= S(fb − f ∗ ), S(fb − f ∗ ) ρ
= (fb − f ∗ ), Σ(fb − f ∗ ) H
1 2
= Σ 2 (fb − f ∗ ) H
(4.2)
b Sb∗ (Sf
fbλ = Fλ (Σ) b ∗ + ξ) = Fλ (Σ) b ∗ + Fλ (Σ)(
b Σf b Sb∗ ξ). (4.3)
b and Sb∗ ξ.
In view of the above, two quantities we wish to have control on are Σ
Let us start with an application of the simple Hoeffding’s inequality:
40
Proposition 4.4. Under Assumption 4.1, for δ ∈ (0, 1) denote Lδ := 1 + log δ −1 , it holds
with probability 1 − δ: √
4M κ L
Sb∗ ξ ≤ √ δ. (4.4)
n
Also, with probability 1 − δ:
√
2κ2 Lδ
b −Σ
Σ op
≤ Σ
b −Σ
2
≤ √ . (4.5)
n
Applying Hoeffding’s inequality in the Hilbert space HS(H) yields the second claim.
We turn to applications of Bernstein’s inequality. To better exploit it, we will consider
a ”warped” version of the quantities of interest. The following quantity will play an
important role:
Definition 4.5. In the context of Assumption 4.1, introduce and denote for λ > 0:
Σλ := (Σ + λI),
and
X λk
N (λ) := Tr ΣΣ−1
λ = , (4.6)
k≥1
λk + λ
41
Proof. For the first two cases, we’ll apply Pinelis’ inequality in a Hilbert space (Cor. 3.10
with Ψ(x) = ∥x∥ in a Hilbert space). For the first one, note that
−1 1 X −1
Σλ 2 Sb∗ ξ = Zi , Zi := Σλ 2 Xi ξi .
n
i∈JnK
−1 1 √
Since Σλ 2 ≤ λ− 2 , we have ∥Zi ∥ ≤ 2M κ/ λ. Moreover,
op
− 21
h i
2 2 2
E ∥Zi ∥ ≤ 4M E Σλ Xi
−1 −1
h i
= 4M 2 E Tr Σλ 2 Xi ⊗ (Σλ 2 Xi )∗
− 21 − 12
h i
2 ∗
= 4M E Tr Σλ (Xi ⊗ Xi )Σλ
−1 −1
= 4M 2 Tr(Σλ 2 E[Xi ⊗ Xi∗ ]Σλ 2 )
= 4M 2 N (λ).
−1 −1
= n1 i∈JnK Ai , with Ai = Σλ 2 ((Xi ⊗ Xi∗ ) − E[Xi ⊗ Xi∗ ]).
P
For the second one, we have Σλ 2 (Σ−Σ)
b
It holds
−1 −1 1
∥Ai ∥2 ≤ 2 Σλ 2 (Xi ⊗ Xi∗ ) 2 ≤ 2 Σλ 2 op ∥Xi ⊗ Xi∗ ∥2 ≤ 2λ− 2 κ2 ;
2
2 − 12 ∗
E ∥Ai ∥2 ≤ E Σλ (Xi ⊗ Xi )
2
−1
∗
= E Tr (Xi ⊗ Xi )Σλ (Xi ⊗ Xi∗ )
h i
≤ E ∥Xi ⊗ Xi∗ ∥op Tr Σ−1 λ (X i ⊗ X ∗
i )
≤ κ2 N (λ).
For the last claim, we will apply the operator Bernstein’s inequality (Theorem 3.14). The
estimates are similar to the above, now we consider a sum of i.i.d. self-adjoint random
−1 −1
operators having the form Bi := Σλ 2 (Xi ⊗ Xi − E[Xi ⊗ Xi∗ ])Σλ 2 . It holds ∥Bi ∥op ≤ 2κ2 /λ,
and due to E[(M − E(M ))2 ] ⪯ E[M 2 ] for a self-adjoint operator-valued random variable
M (s.t. M 2 is Bocher-integrable), it holds
−1 − 12
E Bi2 ⪯ Σλ 2 E (Xi ⊗ Xi∗ )Σ−1 ∗
λ (X i ⊗ X i ) Σλ ,
κ2
(Xi ⊗ Xi∗ )Σ−1 ∗ −1 ∗
λ (Xi ⊗ Xi ) = Xi , Σλ Xi Xi ⊗ Xi ⪯ (Xi ⊗ Xi∗ ).
λ
42
Using this into the previous display, and positivity of expectation, we obtain
κ2 − 1 −1
E Bi2 ⪯ Σλ 2 ΣΣλ 2 := V.
λ
It holds
κ2 − 12 −1 κ2
∥V ∥op ≤ Σλ ΣΣλ 2 ≤ .
λ op λ
Furthermore, since intrinsic dimension is invariant by rescaling, we have
1
−1
−
Tr Σλ 2 ΣΣλ 2
intdim(V ) = − 12 − 12
≤ 2N (λ),
Σλ ΣΣλ op
−1 −1
where for the last inequality we used that if λ1 = ∥Σ∥op , then Σλ 2 ΣΣλ 2 op
= λ1 /(λ1 +
λ) ≥ 1/2 provided λ ≤ λ1 .
The following corollary of (4.9) is extremely important and useful.
Corollary 4.7. Under Assumption 4.1, for λ ∈ (0, ∥Σ∥op ), and δ ∈ (0, 1), provided
log(2N (λ)) + Lδ
n ≥ C▲ A (4.10)
λ
for some A ≥ 1, then with probability at least (1 − δ) it holds simultaneously
− 1 12 2 C▲
Σλ 2 Σ b
λ ≤1+ √ ; (4.11)
op A
1 1 2 C▲
Σb− 2 Σ 2 ≤1+ √ . (4.12)
λ λ
op A
Proof. Provided C▲ is chosen large enough in condition (4.10), we have from (4.9) (with
probability at least 1 − δ):
−1 1
b − Σ Σ− 2 C▲′
Σλ 2 Σ λ ≤ √ .
op A
(we can assume C▲′ ≤ 21 provided C▲ is chosen large enough in condition (4.10)).
This immediately implies the first claim, since
− 1 12 2 − 12
b λ Σ− 2
1
− 12 1
b − Σ)Σ− 2 + I C▲′
Σλ 2 Σ
b
λ = Σ λ Σ λ = Σλ (Σ λ ≤ 1 + √ .
op op op A
For the second claim, we have
′
−1
C▲′′
− 12 1 2 − 12 − 21 −1 C
1 1
−1 ▲
Σ
b Σ2
λ = Σ 2Σb Σ2
λ = I − Σ λ (Σ − Σ)Σ
b
λ ≤ 1 − √ = 1+ √ ,
op op op A A
√
where we have used C▲′ / A < 12 . The second equality above is due to (for two invertible
self-adjoint operators A, B)
1 1 1 1 1 1
I − A− 2 (A − B)A− 2 = A− 2 BA− 2 = (A 2 B −1 A 2 )−1 .
43
4.3 Analysis of spectral regularization methods
As announced earlier, we will study estimates of the form
b Sb∗ Y ,
fbλ = Fλ (Σ) (4.13)
where Fλ : R+ → R+ is a ”regularized inverse” function depending on a regularization
parameter λ > 0.
We will study the statistical properties of this type of algorithms under somewhat
”generic” conditions for the family Fλ . These conditions are meant to allow for a large
variety of different methods and algorithms in practice. We defer precise examples to a
later section.
Assumption 4.8. For the family of functions Fλ : [0, κ2 ] → R+ defined for λ ∈ [0, κ2 ], is
said to be a regularization (or filter) function of qualification q > 0 if there exist positive
constants D, E, such that for all λ ∈ [0, κ2 ] and t ∈ [0, κ2 ], it holds:
Fλ (t) ≤ E min(λ−1 , t−1 ); (4.14)
q
λ
|1 − tFλ (t)| ≤ D . (4.15)
t
The following useful estimates follow directly:
Lemma 4.9. Under Assumption 4.8, the following holds true for all λ ∈ [0, κ2 ] and t ∈
[0, κ2 ]:
for all β ∈ [0, 1] : Fλ (t)tβ ≤ Eλβ−1 ; (4.16)
for all γ ∈ [0, q] : |1 − tFλ (t)|tγ ≤ D′ λγ , (4.17)
where E ′ = max(D, 1 + E).
Proof. For any λ ∈ [0, κ2 ] and t ∈ [0, κ2 ], and β ∈ [0, 1], it holds using (4.14):
Fλ (t)tβ = (Fλ (t))1−β (tFλ (t))β ≤ (E/λ)1−β E β ≤ Eλβ−1 .
Furthermore, for any γ ∈ [0, q], using (4.14) and (4.15):
γ γ γ γ
|1 − tFλ (t)|tγ = (|1 − tFλ (t)|tq ) q |1 − tFλ (t)|1− q ≤ D q (1 + E)1− q λγ .
The second type of assumption we will make concerns the ”regularity” of the target
function f ∗ , expressed in the ”scale” of the second moment operator Σ.
Assumption 4.10. Under the notation of Assumption 4.1, we say that the target f ∗ has a
Hölder source regularity condition of order r ≥ 0 if it can be written under the form
f ∗ = Σr g0 , (4.18)
for some g0 ∈ H.
44
Observe that since Σ is not invertible, the image of Σ is not H, and thus this condition
is not trivial; not every element of H has a non-trivial (r > 0) source regularity condition.
The higher r, the most restrictive the condition, and hence the higher ”regularity” the
target function has. Note that there exist more general source conditions in the literature
that use functions of Σ different from (fractional) powers, but the Hölder source condition
(using powers of Σ) is the most classical, and is the only type we will consider.
From now on, the “generic number” C▲ will be allowed to depend on κ, M , the constants
(E, F ) from Assumption 4.8 as well as (r, ∥g0 ∥) from Assumption 4.10. In fact, it is
probably more enlightening to say that C▲ indicates a factor that does not depend on n, λ
or δ. Remember that the value of C▲ might change from one line to the other.
Corollary 4.12. Under the same assumptions as Proposition 4.11, assume additionally
that the ordered eigenvalues of Σ satisfy
λk (Σ) ≤ ck −α , (4.20)
for some constants c > 0 and α > 1. Put β := α r + 12 . Then, choosing the regularization
constant α
λn = n− 2β+1 , (4.21)
for fixed δ ∈ (0, 1), for n big enough it holds with probability at least 1 − δ:
2β
R(fbλn ) − R(f ∗ ) ≤ C▲ n− 2β+1 , (4.22)
where C▲ depends on (c, α) in addition to the constants appearing in the other assumptions.
Proof. Let us derive a rough estimate of the effective dimension N (λ) in this case. Denote
kλ∗ = mink≥1:λk ≤λ . Then λk /(λk + λ) ≤ 21 for k ≤ kλ∗ ; and the assumption (4.20) implies
45
−1
that kλ∗ ≤ (λ/c)−α . Thus
X λk X λk X λk
N (λ) = = +
k≥1
λ + λk 1≤k<k∗ λk + λ k≥k∗ λk + λ
λ λ
1 X
≤ kλ∗ + λ−1 c k −α
2 k≥kλ∗
Z
1
≤ kλ∗ + cλ−1 t−α dt
2 t≥kλ∗
1 ∗ 1
≤ kλ + λ−1 (kλ∗ )1−α
2 α−1
−1
≤ C(c, α)λ−α .
It can then be checked that the choice (4.21) for λn , which balances the two terms for the
obtained risk bound (4.19), leads to (4.22).
Proof of Prop. 4.11. We will assume for this proof that the probabilistic inequalities of
Proposition 4.6 are satisfied, as well as those of Corollary 4.7. By the assumptions made,
the required conditions for Corollary 4.7, namely λ ≤ ∥Σ∥op ≤ κ2 and (4.10) are satisfied.
Note that we also implicitly use a union bound to get simultaneously the controls of
Proposition 4.6, but this amounts to replace Lδ by Lδ/c ≤ C▲ Lδ for a finite number
of events c to apply the union bound over, and this can be included in the numerical
constants.
We recall the starting decomposition 4.3 coming from model (4.1) and the definition of
the estimator:
b Sb∗ (Sf
fbλ = Fλ (Σ) b ∗ + ξ) = Fλ (Σ) b ∗ + Fλ (Σ)(
b Σf b Sb∗ ξ), (4.23)
hence the quantity we want to analyze for the control of the excess risk (4.2) is
1 1 1
Σ 2 (fbλ − f ∗ ) = Σ 2 (Fλ (Σ)
b Σb − I)f ∗ + Σ 2 Fλ (Σ)(
b Sb∗ ξ). (4.24)
We will control the two terms above, starting with the second one, “noise”, term. It holds
−1 1
−1 1 1 1 1
−1
1 1
b Sb∗ ξ) ≤ Σ 2 Σ 2
Σ 2 Fλ (Σ)( Σλ2 Σ
b 2 Σ
b 2 Fλ (Σ)
b Σb2 b− 2 Σ 2
Σ Σλ 2 Sb∗ ξ .
λ λ λ λ λ λ
op op op op
The first factor is bounded by 1. The second and fourth factors are bounded by a number
C▲ (with high probability) due to Corollary 4.7. The last factor is bounded using (4.7).
As for the the third factor, it holds
1 1
Σ
b 2 Fλ (Σ)
λ
b Σb2
λ ≤ sup Fλ (t)(t + λ)
op t∈[0,κ2 ]
≤ 2E,
46
using (4.14). In the end, we get
r ! r
1
b∗ Lδ N (λ) Lδ Lδ N (λ)
b S ξ) ≤ C▲
Σ Fλ (Σ)(
2 + √ ≤ C▲′ . (4.25)
n n λ n
For the last inequality, we used condition (4.10) which implies (noting that λ ≤ ∥Σ∥op
implies N (λ) ≥ 12 ):
log(2N (λ)) + Lδ Lδ
n ≥ C▲ ≥ C▲ ,
λ λ
so that √ r
Lδ Lδ Lδ N (δ)
√ ≤ C▲ √ ≤ C▲′ .
n λ n n
Let us turn to the first, ”approximation”, term in (4.24). We use the assumed source
condition (4.18) and start similarly as above; we denote Rλ (t) := (tFλ (t) − 1):
1 1
−1 1
1
Σ 2 (Fλ (Σ)
b Σb − I)f ∗ ≤ Σ 12 Σ− 2 Σλ2 Σ
b 2 b 2 Rλ (Σ)Σ
Σ b 12 Rλ (Σ)Σ
b r ∥g0 ∥ ≤ C▲ Σ b r .
λ λ λ
op op op
(4.26)
We will distinguish two cases: first, if r ≤ 21 , we will use the Cordes inequality ∥As B s ∥op ≤
∥AB∥sop if A,B, are self-adjoint and s ∈ [0, 1] to obtain
1 1 1 1 2r
Σ b r
b 2 Rλ (Σ)Σ ≤ Σ
b 2 Rλ (Σ)
b Σbr b− 2 Σ 2
Σ Σ−r r
λ λ λ λ λ λ Σ op
op op op
The last factor is bounded by 1 as before, the second by a number C▲ (with high proba-
bility) due to Corollary 4.7, and the first by
1
br r+ 21
b 2 Rλ (Σ)
Σ λ
b Σ
λ ≤ sup Rλ (t)(t + λ)
op t∈[0,κ2 ]
1 1
≤ sup Rλ (t)(tr+ 2 + λr+ 2 )
t∈[0,κ2 ]
1
≤ 2D′ λr+ 2 ,
where we have used r + 21 ≤ 1 for the second inequality, and property (4.17) for the last
(under the assumption r + 12 ≤ q).
If r ≥ 12 , we modify the argument:
1 1
b 2 Rλ (Σ)Σ
Σ λ
b r ≤ Σ b r
b 2 Rλ (Σ)Σ
λ λ Σ−r
λ Σ
r
op op | {z }
≤1
1 1
≤ Σ
b Rλ (Σ)
b Σ
2 b rλ + Σ
b 2 Rλ (Σ)
b Σrλ − Σ
b rλ
λ λ
op op op
2(r−1)+
√
1
′ 12 C▲ r(2κ) (2κ2 ) Lδ
≤ 2D′ λr+ 2 + 2D λ √ .
λ(1−r)+ n
47
To justify the last inequality, we use the HS-norm control (4.5) together with the Hilbert-
Schmidt Lipschitz perturbation inequality (2.15), for the function φr : x 7→ xr on the
interval [λ, 2κ2 ] containing the spectrum of both Σλ and Σ b λ (remember we assume λ ≤ κ2 ).
On this interval the function φr is rλr−1 -Lipschitz if r ≤ 1, and r(2κ)r−1 -Lipschitz if r ≥ 1.
Summing up the last computations into (4.26) and wrapping various factors into the generic
constant, we get
1
√ √ !
1 1
b − I)f ∗ ≤ C▲ λr+ 2 + 1 r ≥ 2
λ Lδ
Σ 2 (Fλ (Σ)
b Σ √ . (4.27)
λ(1−r)+ n
Plugging in (4.25) and (4.27) into (4.24), we thus obtain the risk bound holding with high
probability (using Lδ ≥ 1 to pull it out as a factor up to changes in the front constant):
r 1
√ !
1 N (λ) 1 1 r ≥ λ p
Σ 2 (fb − f ∗ ) H ≤ C▲ + λr+ 2 + (1−r)+2 √ Lδ . (4.28)
n λ n
Let us finally clean up the above expression by noticing that the third term is upper
bounded by the first up to a C▲ -factor, since, for r ≥ 12 :
√
λ min( 12 ,r− 12 )
p
= λ ≤ max(1, κ) ≤ max(1, κ) 2N (λ).
λ(1−r)+
This finally implies the announced estimate (4.19).
4.4 Examples
In this section we give a few examples of classical regularization functions and check that
they satisfy the conditions of Assumption 4.8. Most of these examples come from the
theory of inverse problems [9].
Spectral cut-off. The spectral cut-off (or truncated singular value decomposition,
TSVD) regularization function is given by Fλ (t) = 1{t ≥ λ}/t. In words, once applied to
a self-adjoint operator, this regularization function projects this operator onto the sum of
eigenspaces for eigenvalues less than λ, and takes its Moore-Penrose pseudoinverse. It is
immediate to check that Fλ (t) ≤ t−1 , Fλ (t) ≤ λ−1 , and |1 − tFλ (t)| = 1{t < λ} ≤ (λ/t)q ,
for any q ≥ 0. Therefore, the conditions of Assumption 4.8 are satisfied for E = D = 1
and any q > 0 we can say that this regularization has ”infinite qualification”.
Having infinite qualification sounds like a very desirable property, since it can adapt to
an arbitrarily regular source condition. However, eigendecomposition truncation is difficult
in practice since it requires to compute the eigendecomposition of Σ.b Furthermore, in prac-
tice somewhat more ”smooth” regularization functions turn out to have better behavior.
Ridge regression/Tikhonov regularization. The ridge regression regularizer, also
known as Tikhonov regularization, is given by Fλ (t) = (t + λ)−1 . It is easy to check that:
1 1 1 t λ λ
≤ min , ; 1− = ≤ ,
λ+t λ t λ+t λ+t t
48
hence the conditions of Assumption 4.8 are satisfied for E = D = 1 and qualification q = 1.
On the other hand, it can be check that qualification higher than 1 does not hold.
Iterated ridge regression/Tikhonov. To compensate for the limited qualification
of the standard ridge regression, it can be proposed to iterate it by applying it (with the
the same regularization parameter λ) recursively to the residuals. The following formulas
can be easily shown by recursion for m-times iteration:
m
λi−1 λm
(m)
X 1
Fλ (t) = = 1−
i=1
(λ + t)i t (λ + t)m
(m) (m) λm
(residuals)Rλ (t) = 1 − tFλ (t) = .
(λ + t)m
hence the conditions of Assumption 4.8 are satisfied for E = m, D = 1 and qualification
q = m.
Gradient descent/Landweber iteration. Consider the gradient method based on
the quadratic loss objective function
n
b )= 1
X 2
L(f (⟨f, Xi ⟩ − Yi )2 = Sf
b −Y ,
n i=1 n
Thus, if the estimate after k gradient iterations (with fixed stepsize η) has the form fbk =
b Sb∗ Y , the next gradient step is η(I − ΣF
Fk (Σ) b Sb∗ Y . Therefore, the k-th step of
b k (Σ))
gradient descent take the form of a regularization function Fk (t) satisfying the recursion
If t ∈ [0, κ2 ] and η ∈ (0, κ−2 ) then ηt < 1 and we have Fk (t) ≤ 1/t and, using (1 − x)k ≥
1 − kx for x ≤ 1 by convexity,
1
Fk (t) ≤ kηt ≤ kη.
t
49
Thus, is we define the equivalent regularization parameter λk := (ηk)−1 , the first part of
Assumption 4.8 is satisfied (for E = 1) and it holds for any q > 0:
q q
k 1 λ
|1 − tFk (t)| = (1 − ηt) ≤ exp(−kηt) ≤ cq = cq ,
kηt t
where cq = (q/e)q . For the last inequality we used the elementary fact that maxu>0 log u − uq =
log q−1, so that exp(−u) ≤ (q/e)q u−q . Hence the second part of Assumption 4.8 is satisfied
for any q > 0 with D = cq .
To summarize, gradient descent (for the quadratic risk) with fixed stepsize acts a regu-
lazation if if is stopped early at step k, provided the stopping time is chosen in accordance
with the target function’s source regularity.
50
5 Reproducing kernel methods
5.1 Reproducing kernel Hilbert spaces
We quickly review some important facts about the reproducing kernel Hilbert space method-
ology in data science. Initially considered in the context of spline methods in the 1970s
[27], for which what we now call ”Kernel ridge regression” was already introduced, they
enjoyed an important resurgence in the 2000s due to their versatility for applications of
machine learning methods, in particular Support Vector Machines. While they have been
outperformed by modern Deep Learning methods, they remain an important tool to un-
derstand and analyze machine learning methods (see e.g. works on the ”Neural Tangent
Kernel” [12])
A (reproducing) kernel Hilbert space (rkHs) can be defined in several equivalent ways
(see [2], Chap. 5, for instance).
Definition 5.1 (kernel Hilbert space, abstract version). Given a base space X , a kernel
Hilbert space over X is a R- or C-Hilbert space together with a “feature” mapping Φ :
X → H. The associated kernel is the function k defined as
k : X × X → R or C : (x, y) 7→ ⟨Φ(x), Φ(y)⟩. (5.1)
Proposition/Definition 5.2 (rkHS, functional space version). Given a base space X , a
reproducing kernel Hilbert space over X is R- or C-Hilbert space whose elements are R-
or C-valued functions on X , together with a “feature” mapping Φ : X → H such that the
following property holds:
∀f ∈ H, x ∈ X : f (x) = ⟨f, Φ(x)⟩. (5.2)
The associated kernel is the function k defined as (5.1).
As a consequence of (5.2), it can be checked that Φ must in fact be the mapping
∀x ∈ X : Φ(x) = k(x, ·) = (y 7→ k(x, y)), (5.3)
implying in particular that all functions k(x, ·) must belong to H.
The proof of (5.3) is left to the reader. The property (5.2) is called ”self-reproducing
property” because once applied to the functions Φ(x) = k(x, ·) and Φ(y) = k(y, ·) it yields
⟨k(x, ·), k(y, ·)⟩ = k(x, y).
In this sense the term ”reproducing” only makes sense in the functional space version, thus
when referring to a rKHS one always implicitly assumes the functional space.
The following equivalent definition of the functional space version is sometimes useful:
Property 5.3. Let H be a X is R- or C-Hilbert space whose elements are R- or C-valued
functions on X , and such that for any x ∈ X , the (linear) evaluation functional at point
x: f ∈ H 7→ f (x) is continuous. Then H is a rkHs (functional space version).
51
Proof. By Riesz’ theorem, since for any x ∈ X the evaluation mapping f ∈ H 7→ f (x) is
continuous, there exists an element Φ(x) such that f (x) = ⟨f, Φ(x)⟩, i.e. (5.2) is satisfied.
This Hilbert space Hk is the canonical (functional space) rkHS associated with the psd
kernel k. For any rkHs (abstract version) H◦ on X with feature mapping Φ◦ and kernel k,
the mapping ξ : u 7→ (x 7→ ⟨u, Φ◦ (x)⟩◦ ) maps H◦ onto Hk and satisfies ξ ◦ Φ◦ (x) = k(x, ·)
i.e. ξ ◦ Φ◦ is the canonical feature mapping from X to Hk .
For a proof see e.g. [24, 2].
Observe that if H is a Hilbert space, we can see its dual H∗ as a rkHs on H, with
the canonical mapping f ∈ H 7→ f ∗ ∈ H∗ and the ”linear” kernel k(f, g) = ⟨f, g⟩. This
somewhat convoluted way to present a Hilbert space is merely to remark that the case
of the linear regression model with covariate in a Hilbert space (4.1) considered in the
previous chapter can be cast into the framework considered here.
To come back to our purposes, given a rkHs H with kernel k and feature mapping Φ
on X , and data (Xi , Yi )i∈JnK taking values in X × R, we will intend to apply the results of
the previous chapter to the mapped data (X ei , Yi ) ∈ H × R, where X
e = Φ(X).
The following property is immediate and useful, and related boundedness of the kernel
to boundedness of the Hilbert-valued covariate X: e
Lemma 5.5. If k is a psd kernel on a space X , and if supx∈X k(x, x) = κ2 < ∞, then
for any rkHS with kernel k over X with feature mapping Φ, it holds ∥Φ(x)∥ ≤ κ for any
x ∈ X . Furthermore, it holds |k(x, y)| ≤ κ2 for any (x, y) ∈ X 2 .
The proof is almost tautologic and left to the reader (for the second part, use the
Cauchy-Schwartz inequality).
52
Proposition 5.6. Let H be a rkHs over X with kernel k and canonical feature mapping
Φ(x) = k(x, ·). Let ρ be a probability distribution over X such that Eρ [k(X, X)] < ∞ (for
instance, this is the case under the assumption supx∈X k(x, x) = κ2 < ∞).
Then we generalize the different operators appearing in Propositions 4.2 and 4.3 as
follows:
• The operator S ∗ S = E[k(x, ·) ⊗ k(x, ·)∗ ], which is a trace-class operator, is also given
by the kernel integral operator (5.4), but as an operator from H to H; the operator
SS ∗ is again given by the same formula, but as an operator from L2 (X , ρ) to itself.
• It holds Sb∗ Sb = Σ
b 1 Pn k(Xi , ·) ⊗ k(Xi , ·)∗ , while SbSb∗ : Rn → Rn is the normalized
n k=1
Gram kernel matrix G b given by
bi,j = 1 k(Xi , Xj ),
G (i, j) ∈ JnK2 .
n
Proof. For the first point, we must check that S is a well-defined, Hilbert-Schmidt operator.
First, under the assumption, any element f ∈ H is (as a function over X ) squared-integrable
with respect to ρ, since by the reproducing property and Cauchy-Schwarz:
Eρ |f (x)|2 = Eρ |⟨f, k(x, ·)⟩H |2 ≤ Eρ ∥f ∥2H ∥k(x, ·)∥2H = ∥f ∥2H Eρ [k(x, x)] < ∞.
53
establishing that S is Hilbert-Schmidt.
For the second point, we first check that for f ∈ L2 (X , ρ), the variable Z(X) =
f (X)k(X, ·) is Bochner-integrable in H (it does take its values in H, since for fixed X
it is a multiple of k(X, ·) ∈ H.) We have by the Cauchy-Schwarz inequality:
1
Eρ [∥f (X)k(X, ·)∥] = Eρ [|f (X)|∥k(X, ·)∥] ≤ Eρ |f (X)|2 Eρ ∥k(X, ·)∥2 2
1
= ∥f ∥L2 (X ,ρ) Eρ [k(X, X)] 2 < ∞,
⟨Sf, g⟩L2 (X ,ρ) = Eρ [f (X)g(X)] = Eρ [⟨f, k(X, ·)⟩H g(X)] = ⟨f, Eρ [k(X, ·)g(X)]⟩H ,
leading to the announced formula for S ∗ . The rest is left to the reader.
Again, the ”abstract” setting of the previous chapter can be recovered if we assume
directly random data taking values in a Hilbert space H, linear kernel and the rkHs given
by the dual H∗ .
Conversely, if we have a rkHs over X with a feature mapping X e = Φ(X), we can
”forget” the original covariate space X and its rkHS and see the problem in terms only in
terms of Xe and the setting of the previous chapter. From the point of view of the statistical
analysis, it does not change the arguments. But the rkHS view is richer in the sense that it
describes the model in terms of functions the original covariate X which is more concrete
than its mapped version X. e
In particular, the linear regression in Hilbert space model (4.1) becomes, in the rkHs
setting and in terms of the original covariate X ∈ X :
Y = f ∗ (X) + ξ, (5.6)
Computing this ”abstract” estimator seems impossible in practice since we apparently need
to manipulate infinite-dimensional vectors and operators in a Hilbert space. However,
thanks to the shift formula (2.8) we can rewrite the above as
54
b is a finite n × n matrix, and Y an n-
The benefit of this rewriting is that since G
dimensional vector, we can (at least in principle) numerically compute the n-vector of
coefficients α
b = Fλ (G)Y
b . Using (5.5), the estimated function is then
1 X
fbλ = bi k(Xi , ·).
α
n
i∈JnK
Hence, given the vector of coefficients α b and continued access to the training points (Xi ),
we can compute easily the prediction fλ (x) at any test point x.
b
To sum up, we can use the equivalent representation (5.8) for actual numerical computa-
tion of the estimated function at any point, while we use the ”abstract” representation (5.8)
for the statistical analysis of the estimator, for which the entirety of the arguments from
the previous chapter apply.
A worked example. As a an additional step towards interpretation of the statistical
results in this setting, let us look at the kind of ”regularity” that the source assumption 4.10
entails. We will be looking at a deliberately simplified illustrative example.
Let X be the interval [0, 2π] (seen as the unit circle) and assume the covariate distri-
bution ρ is the uniform distribution on X . Furthermore, assume we consider a kernel k on
X of the form k(x, y) = F (x − y), where F is a function X → R. We assume F can be
written as a Fourier series X
F (t) = a0 + 2ak cos(kt),
k≥0
with positive, summable coefficients (ak )k∈N (the factor 2 for k ≥ 1 is introduced to simplify
things later on).
Let us first justify that k is a spd kernel. We have
X
k(x, y) = F (x − y) = a0 + 2ak (cos(kx) cos(ky) + sin(kx) sin(ky)) = ⟨Φ(x), Φ(y)⟩ℓ2 ,
k≥1
√ √ √ √ √
where Φ(x) = ( a0 , 2a1 cos(x), 2a1 sin(x), . . . , 2ak cos(kx), 2ak sin(kx), . . .) ∈ ℓ2 (N).
Through this explicit representation in the (real) Hilbert space ℓ2 (N), via Definition 5.1 it
is indeed checked that k is a spd kernel on X .
Moreover, via Proposition 5.6 and in particular formula (5.4), we see that if f is a
function on X with Fourier expansion
X
f (x) = f0 + (fkc cos(kx) + fks sin(kx)), (5.9)
k≥1
then X
S ∗ f (x) = a0 f0 + (ak fkc cos(kx) + ak fks sin(kx)).
k≥1
∗
In particular, we see that S (x 7→ cos(kx)) = ak (x 7→ cos(kx)), thus all trigonometric
functions are elements of H, are in fact eigenfunctions of the operator Σ = S ∗ S with
corresponding eigenvalues (λk = ak )k≥0 .
55
1
Due to the fact that S(S ∗ S)− 2 is a partial isometry (this holds in general, as seen
from the singular value decomposition), whose range is dense in L2 ([0, 2π]), we can deduce
that
P if −1 f is a function with Fourier decomposition (5.9), then f is an element of H iff
c 2 s 2
k≥0 ak ((fk ) + (fk ) ) < ∞, and in fact we have
X
∥f ∥2H = a−1 2
0 f0 + a−1 c 2 s 2
k ((fk ) + (fk ) ).
k≥1
From this, we understand precisely the nature of the functions in the rkHs, which have
roughly speaking half the regularity of the kernel function F . For instance, if ak ∝ k −α
with α > 1, then the kernel function is α times differentiable (possibly in a fractional
sense), while the rkHs is made of functions that are α/2 times differentiable.
By the same type of arguments, we find that if f is in the range of Σr then
2 −(1+2r) 2
X −(1+2r)
Σ−r f H = a0 f0 + ak ((fkc )2 + (fks )2 ),
k≥1
so
P the functions satifying the Hölder source condition of order r are exactly those such that
−(1+2r) c 2 s 2 −α
a
k≥1 k ((f k ) + (fk ) ) < ∞. Again, if ak ∝ k , a function satisfying a source condi-
tion of order r iff (up to minor caveats) it is β = α(r + 21 ) times (fractional) differentiable
(also called β-Hölder). It is notable that this is only this ”intrinsic” regularity parameter
of the target function that drives the convergence rate of the statistical analysis (4.22).
In other words, if we used a different kernel function with a different spectral decay of
order α′ of the associated kernel integral operator, we would get the same convergence rate
for target functions of Hölder regularity β because they would satisfy a different source
regularity condition of order r′ for this kernel, leading to the same convergence rate only
depending on β. (In fact the convergence rate appearing in (4.22) can be shown to be
statistically minimax optimal for functions of that intrinsic regularity). This argument
holds provided α′ < 2β (since we need r′ ≥ 0) and the qualification of the methods is large
enough to cover source regularity r′ .
A conclusion from this example is that it could be preferable to choose a less regular
kernel and use a regularization method with large qualification, because it can adapt to
target functions more regular than the kernel via the source condition, while the converse
is not guaranteed: it is not clear if using a smooth kernel can adapt to irregular functions
(of smoothness less than half of that of the kernel). There are actually results in that
direction, but they are more difficult and require more assumptions.
56
6 Acceleration methods
In this section we will consider different approaches to speed up the numerical computa-
tion of procedures seen previously, such as the spectral regularization procedures (4.13).
Usually, this is done at the price of some approximation, and it is of interest to analyze if
this can be done while keeping the statistical guarantee on the obtained estimator. Note
that a particularly important computational bottleneck is the computation of the regular-
ized inverse Fλ (G).
b Even if for some regularization methods, it consists of forward matrix
multiplications, we still have to manipulate a n × n matrix and when the datasize n is
large, it can be problematic or time-costly. We would like to find ways to alleviate this
point in particular.
Proposition 6.1. Suppose granted Assumption 4.1, regularization function Assumption 4.8
with qualification q, and Hölder source Assumption 4.18 of order r, such that q ≥ r + 21 .
For any λ ∈ [0, ∥Σ∥op ], consider the ”distribute-and-average” estimator
1 X b(k)
feλ = fλ ,
m
k∈JmK
(k)
where fbλ are given by (4.13) applied to each of the m subsamples of size n/m.
57
Assume that it holds
λ
m ≤ C▲ n , (6.1)
log(2N (λ))
and, if r > 21 :
m ≤ N (λ)λ− min(1,2r−1) . (6.2)
Then, for n large enough, it holds with probability at least 1 − 2/n:
r !
1 N (λ) 1
Σ 2 (fbλ − f ∗ ) H ≤ C▲ + λr+ 2 log n. (6.3)
n
This means that we can attain the same stastistical convergence rates as in Corol-
lary 4.12 (up to logarithmic factor in n) in the distributed setting, provided for the choice
of regularization parameter given by (4.21), the constraints (6.1) and (6.2) are satisfied.
Concretely, in the same setting as in Corollary 4.12, the first constraints reads as
α
m ≤ C▲ n1− 2β+1 ,
We note that the ”bias” term (I)(k) has non-zero expectation which is the same accross
machines, while the ”noise” term (II)(k) has expectation zero and has independent real-
izations across machines; we can therefore hope for a noise reduction effect for term (II)(k)
when averaging accross machines. On the other hand, this effect won’t apply to term (I)(k) ,
which should therefore be uniformly small accross machines.
We will need to reiterate the arguments used in the proof of Proposition 4.11 separately
on each of the m machines each dealing with an indepedent data set of size N . For this,
as before we will assume that the probabilistic inequalities of Proposition 4.6 are satisfied,
58
as well as those of Corollary 4.7, simulaneously for all machines and data subsets. We
therefore use an union bound over machines, which amounts to say that the statements
we make will hold with probability 1 − mδ rather than 1 − δ. (Thus, for statements with
probability 1 − δ we will need to replace Lδ by Lδ/m ≤ C▲ Lδ + log(m), which we will do at
the end.)
Furthermore, remember that requirements for these inequalities to hold is λ ∈ (0, ∥Σ∥op )
(which will be satisfied for n large enough, so we won’t discuss it in more detail), and more
importantly, condition (4.10) for a data sample of size N :
log(2N (λ)) + Lδ
N ≥ C▲ A . (6.5)
λ
Let us start with term (I)(k) ; we recall the control obtained in (4.27) for a data sample
of size N :
1
√ √ !
1 1
b (k) − I)f ∗ ≤ C▲ λr+ 2 + 1 r ≥ 2
λ Lδ
(I)(k) = Σ 2 (Fλ (Σb (k) )Σ √ , (6.6)
λ(1−r)+ N
which will hold (with probability 1 − mδ) for all machines simultaneously (for the Σ b (k)
corresponding to their respective data subsamples, k ∈ JmK).
For the ”noise” term (II)(k) , under the same condition (6.5), for any individual k ∈ JmK
(indicating the subsample) we have with probability 1 − δ the control (4.25), which we
rewrite here:
r ! r
1
b (k) )(Sb∗ ξ) ≤ C▲ L δ N (λ) L δ Lδ N (λ)
(II)(k) = Σ 2 Fλ (Σ (k) + √ ≤ C▲′ . (6.7)
N N λ N
Note that the variables U(k) are independent (since U(k) only depends on theq subsample k,
and these subsamples are independent), and bounded in norm by B := C▲′ Lδ NN(λ) with
high probability 1−δ taken individually, or simultaneously with probability 1−mδ. We will
therefore apply Hoeffding’s inequality, using the following trick: consider the ”truncated”
random variables: (
e(k) := U(k) , if U(k) ≤ B;
U
0, if U(k) > B.
We will apply Hoeffding’s inequality to the family (U e(k) )k∈JmK and argue that with high
probability their sum coincides with that of the (U(k) )k∈JmK .
First, an annoyance point is that the modified variables U e(k) are not guaranteed to be
centered like the U(k) s where. Let us therefore roughly upper bound this discrepancy: since
59
∗
supt∈[0,κ2 ] |Fλ (t)| ≤ E/λ and Sb(k) ξk is an average of variables bounded in norm by 2M κ
(see e.g. proof of Proposition 4.4), the following rough upper bound holds (always):
C▲
U(k) ≤ ,
λ
and therefore
E Ue(k) = E Ue(k) − U(k) = E (U
e(k) − U(k) )1 Ue(k) ̸= U(k)
C▲ e
≤ P U(k) ̸= U(k)
λ
C▲
= P ∥U ∥(k) > B
λ
δ
≤ C▲ .
λ
We will assume in the sequel the condition
λ
δ≤√ , (6.8)
n
which implies in particular from the above that E U e(k) ≤ C▲ B (using the definition of
B to check this).
Applying the vectorial Hoeffding’s inequality to the centered variables U (k) := Ue(k) −
E U e(k) , k ∈ JmK, independent and bounded in norm by C▲ B, therefore yields that for any
η ∈ (0, 1), with probability at least 1 − η it holds
p
1 X C▲ B Lη
U (k) ≤ √ ,
m m
k∈JmK
entailing
p !
1 X e δ C▲ B Lη
U(k) ≤ C▲ + √
m λ m
k∈JmK
r
N (λ)Lδ Lη
≤ C▲ ,
n
where we have used the definition of B, condition (6.8), and n = N m.
Finally, the latter bound also holds for m1 k∈JmK U(k) with probability 1 − η − mδ,
P
60
with probability 1 − η − mδ, and provided that conditions (6.5) and (6.8) hold. This is to
be compared to the bound (4.28) that we obtained for the ”single machine” analysis.√
Let
√ us analyze these conditions: first, condition (6.5) implies in particular λ ≥ C▲ / N ≥
−3/2
C▲ / n. Hence, (6.8) is ensured if (6.5) is and if δ ≤ C▲ n . This is reasonable since it
will only result in a logarithmic factor (coming from Lδ ). We choose henceforth δ = n−2 ,
so that mδ ≤ 1/n; and η = n−1 .
As for condition (6.5) itself, due to N = n/m is is implied by the sufficient condition
on the number of subsamples/machines m:
λ
m ≤ C▲ n .
log(2N (λ))
Finally, as in the analysis of (4.28), we would like to be able to wrap the second term into
the third one. This is the case (i.e. the second term is smaller), again using N = n/m, if
m ≤ N (λ)λ− min(1,2r−1) ;
we recall that the latter form means that the solution can be written as
X
fbλ = Sbα
b= bi k(Xi , ·);
α with α b + λI)−1 Y ,
b = (G
i∈JnK
which is also known as a form of the so-called representer theorem (see for instance [2] for
details).
61
The idea of Nyström-based methods is to approximate the above expansion of f by a
reduced expansion on a subset points of size m, i.e. of the form
X
feλ = ei k(Xi , ·),
α (6.11)
i∈I
is given by
1X bt b −1 b
feλ = ei k(Xi , ·),
α with α
e = (G
bI,JnK G
I,JnK + λGI,I ) GI,JnK Y , (6.13)
n i∈I
where G bI,J denotes the submatrix of the normalized Gram kernel G b corresponding to indices
sets I and J.
P
Proof. We can write explicitly, for f = i∈I αi k(Xi , ·), that the vector of evaluation of f
at points (X1 , . . . , Xn ) is given by nGbJnK,I α (the factor n because G b is the normalized Gram
2 P
matrix). Furthermore, by properties of a rkHs, it holds ∥f ∥H = i,j∈I αi αj k(Xi , Xj ) =
nαt GbI,I α.
Thus, (6.12) is rewritten as the minimization of
2
nGbJnK,I α − Y + nλαt G bI,I α = αt nG bt
bI,JnK G + nλGbI,I α − 2αt GbI,JnK Y + ∥Y ∥2 .
I,JnK n
n
Standard formulas for quadratic optimization and some bookkeeping yield (6.13).
Note the interesting fact that to compute the Nyström approximate solution, it is not
necessary to compute the full kernel Gram matrix G, b only the submatrix G bi,JnK . Further-
more, the costly step of matrix inversion only concerns a (m, m) matrix instead of a (n, n)
one, thus significantly reducing computation.
Several strategies can be proposed for selection of the subset I:
• Uniform sampling (with or without replacement) of m indices whithin JnK;
• Leverage score sampling, where the indices are sampled with weights proportional to
so-called leverage scores,
ℓλ (i) := G b + λI −1 .
b G
ii
62
A theoretical analysis (see [23, 14]) shows that under some lower bound of the subsample
size m depending on the problem parameters (source condition, intrinsic dimension) but
still allow m ≪ n, the statistical convergence rate obtained in Corollary 4.12 can be
preserved for the Nyström approximated estimator feλ . The use of the leverage score
sampling allows further reduction of the subsample size m, however there is a chicken-and-
egg problem in that exact computation of these scores itselfs in principle require inversion
of the (n, n) matrix we were trying to avoid! To alleviate this, several approaches to
approximate the leverage scores have been proposed, see for instance [22].
63
References
[1] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms
in learning theory. Journal of complexity, 23(1):52–72, 2007.
[2] Gilles Blanchard. Mathematics for artificial intelligence I. (M1 Lecture notes), 2022.
[3] Gilles Blanchard and Nicole Mücke. Optimal rates for regularization of statistical
inverse learning problems. Foundations of Computational Mathematics, 18(4):971–
1013, 2018.
[4] Stéphane Boucheron, Gabor Lugósi, and Pascal Massart. Concentration Inequalities:
a nonasymptotic theory of independence. Oxford University Press, 2013.
[5] Haim Brezis. Functional Analysis, Sobolev Spaces and Partial Differential Equations.
Universitext. Springer Nature, New York, NY, 2010.
[6] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-
squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
[7] John B Conway. A Course in Functional Analysis, volume 96 of Graduate Texts in
Mathematics. Springer, 1985. [Disponible en version électronique à la bibliothèque du
LMO].
[8] John B. Conway. A Course in Operator Theory, volume 21 of Graduate stud-
ies in mathematics. American Mathematical Society, 2000. [Disponible en version
électronique à la bibliothèque du LMO].
[9] H.W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer
Academic Publishers, 1996.
[10] Alex Gittens and Michael Mahoney. Revisiting the Nyström method for improved
large-scale machine learning. In International Conference on Machine Learning, pages
567–575. PMLR, 2013.
[11] Milen Ivanov and Stanimir Troyanski. Uniformly smooth renorming of Banach
spaces with modulus of convexity of power type 2. Journal of Functional Analysis,
237(2):373–390, 2006.
[12] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Conver-
gence and generalization in neural networks. Advances in neural information process-
ing systems, 31, 2018.
[13] Matthieu Lerasle. Lectures on high-dimensional probability. (M2 Lecture notes), 2023.
[14] Jian Li, Yong Liu, and Weiping Wang. Optimal convergence rates for agnostic nyström
kernel learning. In International Conference on Machine Learning, pages 19811–19836.
PMLR, 2023.
64
[15] Junhong Lin and Volkan Cevher. Optimal convergence for distributed learning with
stochastic gradient methods and spectral algorithms. Journal of Machine Learning
Research, 21(147):1–63, 2020.
[16] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates
for spectral algorithms with least-squares regression over Hilbert spaces. Applied and
Computational Harmonic Analysis, 48(3):868–890, 2020.
[17] Stanislav Minsker. On some extensions of Bernstein’s inequality for self-adjoint oper-
ators. Statistics & Probability Letters, 127:111–119, 2017.
[18] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard
Schölkopf. Kernel mean embedding of distributions: a review and beyond. Foun-
dations and Trends in Machine Learning, 10, 2017.
[19] Nicole Mücke and Gilles Blanchard. Parallelizing spectrally regularized kernel algo-
rithms. Journal of Machine Learning Research, 19(30):1–29, 2018.
[20] Iosif Pinelis. Optimum bounds for the distributions of martingales in Banach spaces.
The Annals of Probability, 22(4):1679–1706, 1994.
[21] Abhishake Rastogi and Sivananthan Sampath. Optimal rates for the regularized learn-
ing algorithms under general source condition. Frontiers in Applied Mathematics and
Statistics, 3:3, 2017.
[22] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On
fast leverage score sampling and optimal learning. Advances in Neural Information
Processing Systems, 31, 2018.
[23] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström
computational regularization. Advances in neural information processing systems, 28,
2015.
[24] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.
[25] Nicole Tomczak-Jaegermann. The moduli of smoothness and convexity and the
Rademacher averages of the trace classes Sp (1 ≤ p < ∞). Studia Mathematica,
50(2):163–182, 1974.
[26] Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and
Trends in Machine Learning, 8(1-2):1–230, 2015.
[27] G. Wahba. Spline Models for Observational Data, volume 59. SIAM CBMS-NSF
Series in Applied Mathematics, 1990.
[28] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge
regression. In Conference on learning theory, pages 592–617. PMLR, 2013.
65