High Dimensional Probability MA3K0 Notes 3
High Dimensional Probability MA3K0 Notes 3
Stefan Adams
i
15 January 2024
Contents
1 Prelimaries on Probability Theory 1
1.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classical Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Lp -spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Concentration inequalities for independent random variables 11
2.1 Why concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Chernoff’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Sub-Exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Random vectors in High Dimensions 32
3.1 Concentration of the Euclidean norm . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 The geometry of high dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Covariance matrices and Principal Component Analysis (PCA) . . . . . . . . 39
3.4 Examples of High-Dimensional distributions . . . . . . . . . . . . . . . . . . . 41
3.5 Sub-Gaussian random variables in higher dimensions . . . . . . . . . . . . . 44
3.6 Application: Grothendieck’s inequality . . . . . . . . . . . . . . . . . . . . . . 45
4 Random Matrices 46
4.1 Geometrics concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Concentration of the operator norm of random matrices . . . . . . . . . . . . 48
4.3 Application: Community Detection in Networks . . . . . . . . . . . . . . . . . 52
4.4 Application: Covariance Estimation and Clustering . . . . . . . . . . . . . . . 52
5 Concentration of measure - general case 54
5.1 Concentration by entropic techniques . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Concentration via Isoperimetric Inequalities . . . . . . . . . . . . . . . . . . . 63
5.3 Some matrix calculus and covariance estimation . . . . . . . . . . . . . . . . 67
5.4 Application - Johnson-Lindenstrauss Lemma . . . . . . . . . . . . . . . . . . 71
6 Basic tools in high-dimensional probability 75
6.1 Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Concentration for Anisotropic random vectors . . . . . . . . . . . . . . . . . . 80
6.3 Symmetrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7 Random Processes 83
7.1 Basic concepts and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Slepian’s inequality and Gaussian interpolation . . . . . . . . . . . . . . . . . 85
7.3 The supremum of a process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4 Uniform law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8 Application: Statistical Learning theory 98
ii
Preface
Introduction
m
X m
X
λi zi where λi ≥ 0 and λi = 1. (0.1)
i=1 i=1
Given a set M ⊂ Rn , the convex hull of M is the set of all convex combinations of
all finite collections of points in M , defined as
Theorem 0.1 (Caratheodory’s theorem) Every point in the convex hull of a set M ⊂ Rn
can be expressed as a convex combination of at most n + 1 points from M .
1
A simplex is a generalisation of the notion of a triangle to arbitrary dimensions. Specifically, a k-simplex
S is the convex hull of its k + 1 vertices: Suppose u0 , . . . , uk ∈ Rk are affinely independent, which means that
u1 − u0 , . . . , uk − u0 are linearly independent. Then, the simplex determined by these vertices is the set of
points
n k
X o
S = λ0 u0 + · · · + λk uk : λi = 1, λi ≥ 0 for i = 0, . . . , k .
i=0
iii
Remark 0.3 We have assumed diam (M √ ) ≤ 1 for simplicity. For a general set M , the bound
in the theorem changes to diam (M )/ k.
Proof. The argument upon which our proof is based is known as the empirical
method of B. Maurey. W.l.o.g., we may assume that not only the diameter but also
the radius of M is bounded by 1, i.e.,
P(Z = zi ) = λi , i = 1, . . . , m.
This is possible by the fact that the weights λi ∈ [0, 1] and sum to 1. Consider now
a sequence (Zj )j∈N of copies of Z. This sequence is an independent identically
distributed sequence of Rn -valued random variables. By the strong law of large
numbers,
k
1X
Zj → x almost surely as k → ∞.
k j=1
We shall now get a quantitative version of this limiting statement, thatPis, we wish
to obtain an error bound. For this we shall compute the variance of k1 kj=1 Zj . We
obtain
k k
h 1X i 1 h X i
E kx − Zj k2 = 2 E k (Zj − x)k2 (since E[Zi − x] = 0)
k j=1 k j=1
k
1 X
= 2 E[kZj − xk2 ].
k j=1
iv
The last identity is just a higher-dimensional version of the basic fact that the vari-
ance of a sum of independent random variables equals the sum of the variances. To
bound the variances of the single terms we compute using that Zj is copy of Z and
that kZk ≤ 1 as Z ∈ M ,
where the second equality follows from the well-known property of the variance,
namely, for n = 1,
and the cases for n > 1 follow similarly. We have thus shown that
k
h 1X i 1
E kx − Zj k2 ≤ .
k j=1 k
Proof. We shall define the centres of the balls as follows. Let k := d1/ε2 e and
consider the set
n1 Xk o
N := xj : xj are vertices of P .
k j=1
The polytope P is the convex hull of the set of its vertices, which we denote by
M . We then apply Theorem√ 0.2 to any point x ∈ P = conv(M ) and deduce that
x is within a distance 1/ k ≤ ε from some point in N. This shows that the ε-balls
centred at N do indeed cover P . 2
In this lecture we will learn several other approaches to the covering problem in
relation to packing, entropy and coding, and random processes.
2
In geometry, a polytope is a geometric object with’ flat’ sides. It is a generalisation of the three-dimensional
polyhedron which is a solid with flat polygonal faces, straight edges and sharp corners/vertices. Flat sides mean
that the sides of a (k + 1)-polytope consist of k-polytopes.
P RELIMARIES ON P ROBABILITY T HEORY 1
be the system consisting of all compact rectangular boxes in Rn with rational vertices and
edges parallel to the axes. In honour of Émile Borel (1871–1956), the system B n = σ(G)
is called the Borel σ-algebra on Rn , and every A ∈ B n a Borel set. Here, σ(G) denotes the
smallest σ-algebra generated by the system G. Note that the B n can also be generated by the
system of open or half-open rectangular boxes, see [Dur19, Geo12].
♣
The decisive point in the process of building a stochastic model is the next step:
For each A ∈ F we need to define a value P (A) ∈ [0, 1] that indicates the probability
of A. Sensibly, this should be done so that the following holds.
Theorem 1.4 (Construction of probability measures via densities) (a) Discrete case: For
countable Ω, the relations
X
P (A) = %(ω) for A ∈ P(Ω), %(ω) = P ({ω}) for ω ∈ Ω
ω∈A
(b) Continuous case: If Ω ⊂ Rn is Borel, then every function % : Ω → [0, ∞) satisfying the
properties
Definition 1.5 A sequence or function % as in Theorem 1.4 above is called a density (of
P ) or, more explicitly (to emphasise normalisation), a probability density (function), often
abbreviated as pdf . If a distinction between the discrete and continuous case is required, a
sequence % = (%(ω))ω∈Ω as in case (a) is called a discrete density, and a function % in case
(b) a Lebesgue density.
In probability theory one often considers the transition from a measurable space
(event space) (Ω, F) to a coarser measurable (event) space (Ω0 , F 0 ). In general such
a mapping should satisfy the requirement
A0 ∈ F 0 ⇒ X −1 A0 := {ω ∈ Ω : X(ω) ∈ A0 } ∈ F. (1.1)
Definition 1.6 Let (Ω, F) and (Ω0 , F 0 ) be two measurable (event) spaces. Then every map-
ping X : Ω → Ω0 satisfying property (1.1) is called a random variable from (Ω, F) to
(Ω0 , F 0 ), or a random element of Ω0 , or a Ω0 -valued random variable. Alternatively (in the
terminology of measure theory), X is said to be measurable relative to F and F 0 .
Definition 1.8 (a) The probability measure P 0 in Theorem 1.7 is called the distribution of
X under P , or the image of P under X , and is denoted by P ◦ X −1 . (In the literature,
one also finds the notations PX or L(X; P ). The letter L stands for the more traditional
term law, or loi in French.)
(b) Two random variables are said to be identically distributed if they have the same distri-
bution.
We are considering real-valued or Rn -valued random variables in the following
and we just call them random variables for all these cases. In basic courses in
probability theory, one learns about the two most important quantities associated
with a random variable X, namely the expectation 3 (also called the mean) and
variance. They will be noted in this lecture by
2
E[X] and Var(X) := E[(X − E(X)) ].
It is often more convenient to work with the tails of random variables, namely with
Here we write P for the generic distribution of the random variable X which is given
by the context.
For any real-valued random variable the moment generating function (MGF) (MGF)
is defined
MX (λ) := E[eλX ] , λ ∈ R . (1.4)
When MX is finite for all λ in a neighbourhood of the origin, we can easily compute
all moments by taking derivatives (interchanging differentiation and expectation (in-
tegration) in the usual way):
dk
E[X k ] = MX (λ) , k ∈ N. (1.5)
dλk λ=0
3
In measure theory the expectation E[X] of a random variable on a probability space (Ω, F, P ) is the
Lebesgue integral of the function X : Ω → R. This makes theorems on Lebesgue integration applicable in
probability theory for expectations of random variables
4 P RELIMARIES ON P ROBABILITY T HEORY
Lemma 1.9 (Integral Identity) Let X be a real-valued non-negative random variable. Then
Z ∞
E[X] = P(X > t) dt .
0
Proof. We can write any non-negative real number x via the following identity using
indicator function 4 : Z x Z ∞
x= 1 dt = 1l{t<x} (t) dt .
0 0
Substitute now the random variable X for x and take expectation (with respect to X)
on both sides. This gives
hZ ∞ i Z ∞ Z ∞
E[X] = E 1l{t<X} (t) dt = E[1l{t<X} ] dt = P(t < X) dt .
0 0 0
To change the order of expectation and integration in the second inequality, we used
the Fubini-Tonelli theorem. 2
Exercise 1.10 (Integral identity) Prove the extension of Lemma 1.9 to any real-valued ran-
dom variable (not necessarily positive):
Z ∞ Z 0
E[X] = P(X > t) dt − P(X < t) dt .
0 −∞
Φ(E[X]) ≤ E[Φ(X)] .
Proof. See [Dur19] or [Geo12] using either the existence of sub-derivatives for
convex functions or the definition of convexity with the epi-graph of a function. The
epi-graph of a function f : I → R, I ⊂ some interval, is the set
4
1lA denotes the indicator function of the set A, that is, 1lA (t) = 1 if t ∈ A and 1lA (t) = 0 if t ∈
/ A.
P RELIMARIES ON P ROBABILITY T HEORY 5
1.3 Lp -spaces
In the following let X be a R-valued random variable, i.e., there is a probability space
(Ω, F, P ) such that X : Ω → R is a measurable function. By default, we equip the
real line R with its Borel-σ-algebra. We begin with the definition of the essential
supremum of X.
Definition 1.12 (Essential supremum) Let X be R-valued random variable. The essential
supremum of X, written ess-sup(X), is the smallest number α ∈ R such that the set {x ∈
Ω : X(x) > α} has measure zero, that is,
Example 1.13 (Essential supremum being infinity) Suppose that the probability space is
Ω = (0, 1) with σ-algebra F = B((0, 1)), and let P be the uniform measure on (0, 1). This
measure has constant probability density,
Z
P (A) = 1lA (t) dt = b − a , for any A = (a, b) with 0 ≤ a < b ≤ 1 .
Ω
1 1
{x ∈ (0, 1) : > α} = (0, )
x α
and
1 1
P ((0, )) = > 0 .
α α
As this holds for all α > 0, we have that ess-sup(X) = ∞. ♣
Definition 1.14 Let (Ω, F, P ) be a probability space. Given two measurable functions
f, g : [0, ∞], we say that f is equivalent to g, written f ∼ g, if
that is,
P ({x ∈ Ω : f (x) 6= g(x)}) = 0 .
We shall identify - with an abuse of notation - identify a measurable function f with its
equivalence class [f ].
6 P RELIMARIES ON P ROBABILITY T HEORY
where Z p1 Z p1
p p
kf kLp := |f | dP = |f (x)| P (dx) .
Ω Ω
If p = ∞, then
where
kf kL∞ := ess-sup(|X|) ,
and we write kf k∞ ≡ kf kL∞ occasionally.
Lemma 1.20 (Linear Markov’s inequality) For non-negative random variables X and t >
0 the tail probability is bounded as
E[X]
P(X > t) ≤ .
t
Proof. Pick t > 0. Any positive number x can be written as
x = x1l{X≥t} + x1l{X<t} ].
As X is non-negative, we insert X into the above expression in place of x and take
the expectation (integral) to obtain
E[X] = E[X1l{X≥t} ] + E[X1l{X<t} ] ≥ E[t1l{X≥t} ] = tP(X ≥ t) .
P RELIMARIES ON P ROBABILITY T HEORY 7
2
This is one particular version of the Markov inequality which provides linear decay
in t. In the following proposition we obtain the general version which will be used
frequently throughout the lecture.
f : [0, ∞) → [0, ∞)
E[f ◦ |Y |]
P(|Y | ≥ ε) ≤ .
f (ε)
f (ε)1l{|Y |≥ε} ≤ f ◦ |Y |.
2
The following version of the Markov inequality is often called Chebyshev’s in-
equality.
Corollary 1.22 (Chebyshev’s inequality, 1867) For all Y ∈ L2 with E[Y ] ∈ (−∞, ∞)
and ε > 0, Var(Y )
P |Y − E[Y ]| ≥ ε ≤ .
ε2
1.4 Limit Theorems
Definition 1.23 (Variance and covariance) Let X, Y ∈ L2 be real-valued random vari-
ables.
(a)
2
Var(X) := E[(X − E[X]) ] = E[X 2 ] − E[X]2
√
is called the variance, and Var(X) the standard deviation of X with respect to P.
(b)
cov(X, Y ) := E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ]
is called the covariance of X and Y . It exists since |XY | ≤ X 2 + Y 2 .
Recall that two independent random variables are uncorrelated, but two uncor-
related are not necessarily independent as the following example shows.
8 P RELIMARIES ON P ROBABILITY T HEORY
Example 1.24 Let Ω = {1, 2, 3} and let P the uniform distribution on Ω. Define two ran-
dom variables by their images, that is,
(X(1), X(2), X(3)) = (1, 0, −1) and (Y (1), Y (2), Y (3)) = (0, 1, 0).
1
P(X = 1, Y = 1) = 0 6= = P(X = 1)P(Y = 1).
9
Hence X and Y are not independent. ♣
Theorem 1.25 (Weak law of large numbers, L2 -version) Let (Xi )i∈N be a sequence of un-
correlated (e.g. independent) real-valued random variables in L2 with bounded variance, in
that v := supi∈N Var(Xi ) < ∞. Then for all ε > 0
1 XN v
P (Xi − E[Xi ]) ≥ ε ≤ −→ 0,
N i=1 N ε2 N →∞
and thus
N
1 X P
(Xi − E[Xi ]) −→ 0,
N i=1 N →∞
P
( −→ means convergence in probability). In particular, if E[Xi ] = E[X1 ] holds for all
N →∞
i ∈ N, then
N
1 X P
Xi −→ E[X1 ].
N i=1
We now present a second version of the weak law of large numbers, which does
not require the existence of the variance. To compensate we must assume that
the random variables, instead of being pairwise uncorrelated, are even pairwise
independent and identically distributed.
Theorem 1.26 (Weak law of large numbers, L1 -version) Let (Xi )i∈N be a sequence of pair-
wise independent, identically distributed real-valued random variables in L1 . Then
N
1 X P
Xi −→ E[X1 ].
N i=1 N →∞
Theorem 1.27 (Strong law of large numbers) If (Xi )i∈N is a sequence of pairwise uncor-
related real-valued random variables in L2 with v := supi∈N Var(Xi ) < ∞, then
N
1 X
(Xi − E[Xi ]) → 0 almost surely as N → ∞.
N i=1
P RELIMARIES ON P ROBABILITY T HEORY 9
Theorem 1.28 (Central limit theorem) A.M. Lyapunov 1901, J.W. Lindeberg 1922, P. Lévy
1922.
Let (Xi )i∈N be a sequence of independent, identically distributed real-valued random vari-
ables in L2 with E[Xi ] = m and Var(Xi ) = v > 0. Then, as N → ∞,
N
∗ 1 X Xi − m d
SN := √ √ −→ N(0, 1).
N i=1 v
Lemma 1.30 Let X1 and X2 be independent and normally distributed with zero mean and
variance σ 2 > 0. Then X1 + X2 and X1 − X2 are independent and normally distributed
with mean 0 and variance 2σ 2 .
Proposition 1.31 If X and Y are n-dimensional Gaussian vectors with E[X] = E[Y ] and
cov(X) = cov(Y ), then X and Y have the same distribution.
Corollary 1.32 A Gaussian random vector X has independent entries if and only if its co-
variance matrix is diagonal. In other words, the entries in a Gaussian vector are uncorre-
lated if and only if they are independent.
Exercise 1.33 (a) Let X ∼ Ber(p), p ∈ [0, 1]. Compute the expectation, the variance and
the moment generating function MX .
e−λ λk
P(X = k) = k ∈ N0 ,
k!
Poi(λ) ∈ M1 (N0 , P(N0 )). Here, M1 (Ω) denotes the set of probability measures on Ω
and P(N0 ) is the power set.
Exercise 1.34 Let X ∼ Poi(λ), λ > 0. Compute the expectation, the variance and the
moment generating function MX .
K
Exercise 1.35 Let X ∼ Exp(α), α > 0. Compute the expectation, the variance and the
moment generating function MX . K
Theorem 1.36 (Poisson limit theorem) Let Xi(N ) , i =P1, . . . , N , be independent Bernoulli
N
random variables Xi(N ) ∼ Ber(p(N )
i ), and denote SN = i=1 Xi
(N )
their sum. Assume that, as
N → ∞,
N
X
max {p(N )
i } −→ 0 and E[SN ] = p(N
i
)
−→ λ ∈ (0, ∞).
1≤i≤N N →∞ N →∞
i=1
Then, as N → ∞,
SN −→ Poi(λ) indistribution.
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 11
Proposition 2.1 (Tails of the normal distribution) Let Y ∼ N(0, 1). Then, for all t > 0,
we have 1 1 1 −t2 /2 1 1 2
− 3 √ e ≤ P(Y ≥ t) ≤ √ e−t /2 .
t t 2π t 2π
Proof. Denote f (x) := exp(−x2 /2). For the upper bound we use x ≥ t to get the
estimate Z ∞ Z ∞
x −x2 /2 1 2
f (x) dx ≤ e dx = e−t /2 .
t t t t
2
For the lower bound we use integration by parts (IBP) and f 0 (x) = −xe−x /2 :
Z ∞ Z ∞ i∞ Z ∞ 1
−x2 /2 1 −x2 /2 h 1
−x2 /2 2
e dx = xe dx = − e − 2
e−x /2 dx
t t x x t t x
Z ∞
1 −t2 /2 1 −x2 /2 1 −t2 /2 h 1 −x2 /2 i∞
= e − xe dx = e − − 3e
t t x3 t x t
Z ∞
3 −x2 /2
+ e dx.
t x4
12 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES
Hence Z ∞ 1 Z ∞
−x2 /2 1 −t2 /2 3 −x2 /2
e dx = − e + e dx,
t t t3 t x4
and, as the integral on the right hand side is positive,
1 2
P(Y ≥ t) ≥ √ (1/t − 1/t3 )e−t /2 .
2π
2
The lower bound in Proposition 2.1 is lower than the tail lower bound in Lemma
C.5 in the appendix.
Lemma 2.2 (Lower tail bound for normal distribution) Let Y ∼ N(0, 1). Then, for all
t ≥ 0,
t 1 2
P(Y ≥ t) ≥ 2 √ e−t /2 .
t +1 2π
Proof. Define Z ∞
1 2 /2 t −t2 /2
g(t) := √ e−x dx − e .
2π t t2 + 1
2
−2e−t /2
Then g(0) > 0 and g 0 (t) = (t2 +1)2 < 0. Thus g is strictly decreasing with limt→∞ g(t) =
0 implying that g(t) > 0 for all t ∈ R. 2
Using the tail estimates in Proposition 2.1 we expect to obtain an exponential
bound for the right hand side of (2.1), namely that the probability of having at least
3N/4 heads seems to be smaller than √12π e−N/8 . However, we have not taken into
account the approximation errors of ’≈‘ in (2.1). Unfortunately, it turns out that the
error decays too slowly, actually even more slowly than linearly in N . This can be
seen form the following version of the CLT which we state without proof, see for
example [Dur19].
Theorem 2.3 (Berry-Esseen CLT) In the setting of Theorem 1.28, for every N and every
t ∈ R, we have
%
|P (ZN ≥ t) − P(Y ≥ t)| ≤ √ ,
N
where Y ∼ N(0, 1) and % = E[|X1 − m|3 ]/σ 3 .
Can we improve the approximation error ’≈’ in (2.1) by simply computing the
probabilities with the help of Stirling’s formula ? Suppose that N is even. Then
−N N 1 1
P(SN = N/2) = 2 ∼√ and P(ZN = 0) ∼ √ ,
N/2 N N
but P(Y = 0) = 0. This shows that using Stirling’s formula we cannot improve our
estimates. We thus see that we shall get a better concentration bound. This is the
content of the next section.
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 13
Proof. Without loss of generality we can put kak2 = 1, namely, if kak2 6= 1, then
define ãi = ai /kak2 , i = 1, . . . , N , to obtain the bound for
N
X XN t2
P ãi Xi ≥ t = P ai Xi ≥ kak2 t ≤ exp − .
i=1 i=1
2
To bound the right hand side use first that the random variables are independent,
h XN N
Y
E exp λ ai X i ] = E[eλai Xi ]
i=1 i=1
eλai + e−λai
E[eλai Xi ] = = cosh(λai ).
2
We are left to find a bound for the hyperbolic cosine. There are two ways to get the
upper bound
2
cosh(x) ≤ ex /2 .
1.) Simply write down both Taylor series and compare term by term, that is,
x2 x4 x6
cosh(x) = 1 + + + + ···
2! 4! 6!
14 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES
and ∞
x2 /2
X (x2 /2)k x2 x4
e = =1+ + 2 + ··· .
k=0
k! 2 2 2!
2.) Use the product expansion (complex analysis – not needed for this course, just
as background information)
∞
Y 4x2 X 4x2 2
cosh(x) = 1+ 2 2
≤ exp 2 2
= ex /2 ,
π (2k − 1) k=1
π (2k − 1)
As the reader may have realised, the proof of Hoeffding’s inequality, Theorem 2.5,
is quite flexible as it is based on bounding the moment generating function. For ex-
ample, the following version of Hoeffding’s inequality is valid for general bounded
random variables. The proof will be given as an exercise.
Theorem 2.8 (Hoeffding’s inequality for general bounded random variables) Suppose that
X1 , . . . , XN are independent random variable with Xi ∈ [mi , Mi ], mi < Mi , for i =
1, . . . , N . Then, for any t ≥ 0, we have
N
X 2t2
P (Xi − E[Xi ]) ≥ t ≤ exp − PN .
2
i=1 i=1 (Mi − mi )
We need to bound the MGF of each Bernoulli random variable Xi separately. Using
1 + x ≤ ex , we get
Thus
N
Y N
X
E[exp(λXi )] ≤ exp eλ − 1 pi = exp eλ − 1 µ ,
i=1 i=1
and therefore
Define g(λ) := −λt+(eλ −1)µ. Optimising g over λ > 0, we obtain g 0 (λ) = −t+µeλ = 0
of and only if λ = log(t/µ) as t > µ for the minimal upper bound. 2
Exercise 2.12 In the setting of Theorem 2.11, prove that, for any t < µ,
eµ t
P SN ≤ t ≤ e−µ .
t
K
Solution. We get
and
N
Y
E[e−λXi ] ≤ exp ((e−λ − 1)µ).
i=1
Thus
P(SN ≤ t) ≤ eλt exp ((e−λ − 1)µ).
Minimising over λ > 0, gives −λ = log(t/µ) which is valid as t < µ. ,
Example 2.13 (Degrees of Random Graphs) We consider the Erdös-Rényi random graph
model. This is the simplest stochastic model for large, real-world networks. The random
graph G(n, p) consists of n vertices, and each pair of distinct vertices is connected inde-
pendently (from all other pairs) with probability p ∈ [0, 1]. The degree of a vertex is the
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 17
(random) number of edges incident to that vertex. The expected degree of every vertex i (we
label the n vertices by numbers 1, . . . , n), is
Note that d(i) = d(j) for all i 6= j, i, j = 1, . . . , n, and thus we simply write d in the
following. We call the random graph dense when d ≥ log n holds, and the graph is denoted
regular if the degree of all vertices approximately equal d.
♣
Proposition 2.14 (Dense graphs are almost regular) There is an absolute constant C > 0
such that the following holds: Suppose that G(n, p) has expected degree d ≥ C log n. Then,
with high probability, all vertices of G(n, p) have degrees between 0.9d and 1.1d.
Proof. Pick a vertex i of the graph, the degree is simply a sum of Bernoulli random
variables, i.e.,
n−1
X
D(i) = Xk , Xk ∼ Ber(p).
k=1
We are using the two-sided version of the Chernoff bound in Theorem 2.11, see
Exercise 2(a) of Example Sheet 1:
We can now ’unfix’ i by taking the union bound over all vertices of the graph:
n
X
P(∃i ∈ {1, . . . , n} : |D(i) − d| ≥ 0.1d) ≤ P(|D(i) − d| ≥ 0.1d) ≤ 2ne−cd .
i=1
If d ≥ C log n for a sufficiently large absolute constant C, the probability on the left
hand side is bounded as
This means that, with probability 0.9, the complementary event occurs and we have
Suppose the real-valued random variable X has mean µ ∈ R and there is some
constant b > 0 such that the function
the so-called centred moment generating function exits for all |λ| ≤ b, that is, Φ(λ) ∈
R for all |λ| ≤ b. For λ ∈ [0, b], we may apply Markov’s Inequality in Theorem 1.21 to
the random variable Y = eλ(X−E[X]) , thereby obtaining the upper bound
Φ(λ)
P((X − µ) ≥ t) = P(eλ(X−E[X]) ≥ eλt ) ≤ .
eλt
Optimising our choice of λ so as to obtain the tightest results yields the so-called
Chernoff bound, namely, the inequality
Example 2.15 (Gaussian tail bounds) Let X ∼ N(µ, σ 2 ), µ ∈ R, σ 2 > 0, and recall the
probability density function (p.d.f) of X,
1 (x−µ)2
fX (x) = √ e− 2σ2 .
2πσ 2
The moment generating function (MGF) is easily computed as
eλµ
Z Z
1 − (x−µ)
2
2 +λx
2 /2σ 2 +λy
MX (λ) = E[e λX
]= √ e 2σ dx = √ e−y dy
2πσ 2 R 2πσ 2 R
2 σ 2 /2
= eλµ+λ <∞
for all λ ∈ R. We obtain the Chernoff bound by optimising over λ ≥ 0 using Φ(λ) =
2 2
e−λµ MX (λ) = eλ σ /2 ,
n λ2 σ 2 o t2
inf { log Φ(λ) − λt} = inf − λt = − 2 .
λ≥0 λ≥0 2 2σ
Thus any N(µ, σ 2 ) random variable X satisfies the upper deviation inequality
t2
P(X ≥ µ + t) ≤ exp − 2 , t ≥ 0,
2σ
and the two-sided version
t2
P(|X − µ| ≥ t) ≤ 2 exp − 2 , t ≥ 0.
2σ
♣
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 19
Exercise 2.16 (Moments of the normal distribution) Let p ≥ 1 and X ∼ N(0, 1). Then
1 Z 1/p 2 Z ∞ 1/p
1/p p −x2 /2 2
kXkLp = (E[|X|]) = √ |x| e dx = √ xp e−x /2 dx
2π R 2π 0
2 Z ∞ √ √ √ √ p√
p −w 21 1/p 2 2 2 1/p
= √ ( 2 w) e √ dw = √ Γ(p/2 + 1/2)
(x2 /2=w) 2π 0 w2 2π 2
√ Γ(p/2 + 1/2) 1/p
= 2 .
Γ(1/2)
Hence
√
kXkLp = O( p) as p → ∞.
K
2
We also note that the MGF of X ∼ N(0, 1) is given as MX (λ) = eλ /2 . In the
following proposition we identify equivalent properties for a real-valued random vari-
able which exhibits similar tail bounds. moment bounds and MGF estimates than a
Gaussian random variable does.
(i) Tails of X:
P(|X| ≥ t) ≤ 2 exp ( − t2 /C12 ) for all t ≥ 0.
(ii) Moments:
√
kXkLp ≤ C2 p for all p ≥ 1.
(iii) MGF of X 2 :
1
E[ exp (λ2 X 2 )] ≤ exp (C32 λ2 ) for all λ with |λ| ≤ .
C3
Moreover, if E[X] = 0, then the statements (i)-(iv) are equivalent to the following
statement
Proof. (i) ⇒ (ii): Without loss of generality we can assume that C1 = 1 for property
(i). This can be seen by just multiplying the parameter t with C1 which corresponds
20 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES
to multiply X by 1/C1 . The integral identity in Lemma 1.9 applied to the non-negative
random variable |X|p gives
Z ∞ Z ∞ Z ∞
p p 2
E[|X| ] = P(|X| > t) dt = p p−1
P(|X| ≥ u)pu du ≤ 2e−u pup−1 du
0 t=u 0 (i) 0
p p p/2
= pΓ ≤p ,
u2 =s 2 2
where we used thep property Γ(x) ≤ xx of the Gamma function. 5 Taking the pth
√ √ √ √
root gives p p p √ 1/2 and p p ≤ 2 gives (ii) with some C2 ≤ 2. To √see that p p ≤ 2
recall that limn→∞ n n = 1. More precisely, in thatqproof one takes n n = 1 + δn and
2
shows with the binomial theorem that 0 < δn < n−1 for all n ≥ 4. For p = 1, 2, 3,
√
1
√ √
3
we directly obtain the estimate 1 = 1, 2 ≤ 2 and 3 ≤ 2.
(ii) ⇒(iii): Again without loss of generality we can assume that C2 = 1. Taylor
expansion of the exponential function and linearity of the expectation gives
∞
X λ2k E[X 2k ]
E[exp(λ2 X 2 )] = 1 + .
k=1
k!
k
Property (ii) implies that E[X 2k ] ≤ (2k)k and Stirling’s formula gives 6 that k! ≥ ( ke ) .
Thus
∞ ∞
2 2
X (2λ2 k)k X 1
E[exp(λ X )] ≤ 1 + k
= (2eλ2 )k = ,
k=1
(k/e) k=0
1 − 2eλ2
provided that 2eλ2 < 1 (geometric series). To bound the right hand side, i.e., to
bound 1/(1 − 2eλ2 ), we use that
1
≤ e2x for all x ∈ [0, 1/2] .
1−x
5
Γ(n) = (n − 1)!, n ∈ N.
Z ∞
Γ(z) := xz−1 e−x dx for all z = x + iy ∈ C with x > 0 ,
0
Γ(z + 1) = zΓ(z) ,
Γ(1) = 1 ,
1 √
Γ( ) = π .
2
6
Stirling’s formula:
√
N ! ∼ 2πN e−N N N where ∼ means quotient goes to zero when N → ∞ .
Alternatively,
√ 1 1
N! ∼ 2πN e−N N N 1 + +O .
12N N2
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 21
1
For all |λ| ≤ √
2 e
we have 2eλ2 < 12 , and thus
1
E[exp(λ2 X 2 )] ≤ exp(4eλ2 ) for all |λ| ≤ √ ,
2 e
√
and (iii) follows with C3 = 2 e.
(iii) ⇒ (iv): Trivial and left as an exercise.
(iv) ⇒ (i): Again, we assume without loss of generality that C4 = 1. Using Markov’s
inequality 1.21 with f = exp ◦x2 , we obtain
2 2 2 2 2
P(|X| ≥ t) = P(eX ≥ et ) ≤ e−t E[eX ] ≤ 2e−t .
(iv)
Suppose now that E[X] = 0. We show that (iii) ⇒ (v) and (v) → (i).
(iii) ⇒ (v): Again, we assume without loss of generality that C3 = 1. We use the
well-known inequality
2
ex ≤ x + ex , x ∈ R ,
to obtain
2X2 2X2 2
E[eλX ] ≤ E[λX + eλ ] = E[eλ ] ≤ eλ if |λ| ≤ 1 ,
(iii)
and thus we have (v) for the range |λ| ≤ 1. Now, assume |λ| ≥ 1. This time, use the
inequality 2λx ≤ λ2 + x2 , λ, x ∈ R, to arrive at
2 /2 2 /2 2 /2 2
E[eλX ] ≤ eλ E[eX ] ≤ eλ exp(1/2) ≤ eλ as |λ| ≥ 1 .
Exercise 2.19 (Sub-Gaussian norm) Let X be a sub-Gaussian random variable and define
n o
kXkψ2 := inf t > 0 : E[ exp(X 2 /t2 )] ≤ 2 .
Show that k·kψ2 is indeed a norm on the space of sub-Gaussian random variables.
KK
Exercise 2.21 (Exponential moments) Show that if X ∼ N(0, 1), the function
R 3 λ 7→ E[exp(λ2 X 2 )]
of X 2 is finite only in some bounded neighbourhood of zero. Determine this neighbourhood.
K
√ −1 2
Solution. Recall that when X ∼ N(0, 1), X has the probability density 2π e−x /2 .
Thus
Z Z
2 2 1 λ2 x2 −x2 /2 1 1
E[exp(λ X )] = √ e e dx = √ exp ((λ2 − )x2 ) dx < ∞
2π R 2π R 2
1 r1 r1
2
⇔ (λ − ) < 0 ⇔ λ ∈ − ,
2 2 2
q q
because λ ∈ − 12 , 12 ensures that the integrand is a so-called log-concave
density with finite integral, where a log-concave density is a probability density f
which can be written as
f (x) = exp(ϕ(x)) , with ϕ being concave, i.e., ϕ00 (x) ≤ 0 .
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 23
Recall that the sum of independent normal random variables Xi ∼ N(µi , σi2 ), i =
1, . . . N , is normally distributed, that is,
N
X N
X N
X
SN = Xi ∼ N µi , σi2 .
i=1 i=1 i=1
For a proof see [Dur19] or [Geo12]. We then may wonder if the sum of independent
sub-Gaussian random variables is sub-Gaussian as well. There are different state-
ments on this depending whether µi = 0 for all i = 1, . . . , N , as done in the book
[Ver18], or not vanishing means. We shall follow the general case here. It proves
useful to have the following equivalent definition of sub-Gaussian random variables.
The constant σ satisfying (2.5) is referred to as the sub-Gaussian parameter; for instance,
we say that X is sub-Gaussian with parameter σ when (2.5) holds.
Remark 2.23 (Sub-Gaussian definitions:) It is easy to see that our two definitions are equiv-
alent when the random variable X has zero mean µ = 0: use statement (v) of Proposi-
tion 2.17. If µ 6= 0 and X satisfies (2.5), we obtain the following tail bound
1
P(|X| ≥ t) ≤ 2 exp ( − (t − µ)2 ) for all t ≥ 0. (2.6)
2σ 2
This tail bound is not exactly the statement (i) of Proposition 2.17 as the parameter t ≥ 0
is chosen with respect to the mean. In most cases, one is solely interested in tail estimates
away from the mean. Thus the definitions are equivalent in case µ 6= 0 if we limit the range
for parameter t to t ≥ |µ|. In the literature and in applications Definition 2.22 is widely used
and sometimes called sub-Gaussian for centred random variables. We use both definitions
synonymously.
We now show that the sum of sub-Gaussian random variables is again sub-
Gaussian, and we do so for each of the two definitions separatly.
N N
X 2 X
Xi ≤C kXi k2ψ2 , (2.7)
ψ2
i=1 i=1
h N
i Y N
Y
E exp λSN = E[ exp (λXi ) ≤ exp (Cλ2 kXi k2ψ2 ) = exp (λ2 K 2 ) ,
i=1 i=1
where K 2 := C N 2
P
i=1 kXi kψ2 . The inequality follows directly from Proposition 2.17,
see Exercise 2.42. Recall the equivalence of (v) and (iv) in Proposition 2.17 to see
that the sum SN is sub-Gaussian and
kSN kψ2 ≤ C1 K ,
Proposition 2.25 (Hoeffding bounds for sums of independent sub-Gaussian random variables)
Let X1 , . . . , XN be real-valued independent sub-Gaussian random variables with sub-Gaussian
parameter σi and mean µi , i = 1, . . . , N , respectively. Then, for all t ≥ 0,
N
X t2
P (Xi − µi ) ≥ t ≤ exp − PN 2 .
i=1 2 i=1 σi
PN
Optimising over λ, we obtain λ = t/ i=1 σi2 , and conclude with the statement.
2
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 25
Example 2.26 (Bounded random variables) Let X be a real-valued random variable with
mean-zero and (bounded) supported on [a, b], a < b. Denote X 0 an identical independent
copy of X, i.e., X ∼ X 0 . For any λ ∈ R, using Jensen’s inequality 1.11,
0 0
EX [eλX ] = EX [eλ(X−EX 0 [X ]) ] ≤ EX,X 0 [eλ(X−X ) ] ,
where EX,X 0 is expectation with respect to the two independent and identically distributed
random variables X and X 0 . Suppose ε is an independent (from X and X 0 Rademacher
function, that is ε is a symmetric Bernoulli random variable. Then (X − X 0 ) ∼ ε(X − X 0 ).
For Rademacher functions/symmetric Bernoulli random variables ε we estimate the moment
generating function as
∞ ∞ ∞
λε 1 λ −λ 1 X (λ)k X (−λ)k X λ2k
E[e ] = (e + e ) = + =
2 2 k=0 k! k=0
k! k=0
(2k)!
∞
X λ2k 2
≤1+ k
= eλ /2 .
k=1
2 k!
Thus 0 0
EX,X 0 [eλ(X−X ) ] = EX,X 0 [Eε [eλε(X−X ) ]] ≤ EX,X 0 [ exp(λ2 (X − X 0 )2 /2)]
≤ EX,X 0 [ exp(λ2 (b − a)2 /2)],
as |X − X 0 | ≤ b − a. Thus X sub-Gaussian with sub-Gaussian parameter σ = b − a > 0,
see Definition 2.22. ♣
As discussed above in Remark 2.23, in many situations and results we encounter
later, we typically assume that the random variables have zero means. Then the two
definitions are equivalent. In the next lemma we simply show that centering does
not harm the sub-Gaussian property. This way we can see that we can actually use
both our definitions for sub-Gaussian random variables.
Proof. We use the fact that k·kψ2 is a norm. Triangle inequality gives
kX − E[X]kψ2 ≤ kXkψ2 + kE[X]kψ2 .
|a|
For any a ∈ R one can find t ≥ log 2
such that E[exp(a2 /t2 )] ≤ 2, thus
|a|
kakψ2 = ≤ c|a|
log 2
for some constant c > 0. Hence
kE[X]kψ2 ≤ c|E[X]| ≤ cE[|X|] = ckXkL1 ≤ CkXkψ2
√
using kXkLp ≤ CkXkψ2 p for all p ≥ 1.
2
26 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES
(ii) Moments:
1/p
kXkLp = (E[|X|p ]) ≤ C2 p for all p ≥ 1 .
E[exp(|X|C4 ) ≤ 2 .
Lemma 2.30 Equation (2.8) defines the sub-exponential norm kXkψ on the set of all sub-
exponential random variables X.
Proof. Supportclass. 2
and thus X is sub-Gaussian. As for the norms, recall that kX 2 kψ1 is the infimum
of all numbers C > 0 satisfying E[exp(X 2 /C)] ≤ 2, whereas kXkψ2 is the infimum
of all numbers K > 0 satisfying E[exp(X 2 /K 2 )] ≤ 2. Putting C = K 2 , one obtains
kX 2 kψ1 = kXk2ψ2 .
2
Proof. Suppose that 0 6= kXkψ2 (if kXkψ2 = 0 and/or kY kψ2 = 0 the statement
e = X/kXk is sub -Gaussian with kXk 1
trivially holds). Then X ψ2 = kXk kXkψ2 = 1.
e
ψ2 ψ2
Thus we assume without loss of generality that kXkψ2 = kY kψ2 = 1. To prove the
statement in the lemma we shall show that E[exp(X 2 )] ≤ 2 and E[exp(Y 2 )] ≤ 2 both
imply that E[exp(|XY |)] ≤ 2, where E[exp(|XY |)] ≤ 2 implies that kXY kψ1 ≤ 1. We
are going to use Young’s inequality :
a2 b 2
ab ≤ + for a, b ∈ R .
2 2
Thus
h X 2 Y 2 i h X2 Y 2 i
E[exp(|XY |)] ≤ E exp + = E exp exp Young’s inequality
2 2 2 2
1 h i 1
≤ E exp (X 2 ) + exp (Y 2 ) = (2 + 2) = 2 Young’s inequality.
2 2
2
28 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES
Example 2.33 (Exponential random variables) Suppose X ∼ Exp(α), α > 0. Then E[X] =
1
α
and Var(X) = α12 . We compute for every t 6= α1 to get
Z ∞ h −α i∞ α
E[exp(|X|/t)] = αex/t e−αx dx = e−x(α−1/t) = ≤2
0 (α − 1/t) 0 (α − 1/t)
if and only if t ≥ 2/α. Hence kXkψ1 = α2 .
♣
Example 2.34 (Sub-exponential but not sub-Gaussian) Suppose that Z ∼ N(0, 1), and
define X := Z 2 . Then E[X] = 1. For λ < 12 we have
Z ∞
λ(X−1) 1 λ(z 2 −1) −z 2 /2 e−λ
E[e ]= √ e e dz = √ ,
2π −∞ 1 − 2λ
whereas for λ > 12 we have E[eλ(X−1) ] = +∞. Thus X is not sub-Gaussian. In fact one can
show, after some computation, that
e−λ 2 2 1
√ ≤ e2λ = e4λ /2 for all |λ| < .
1 − 2λ 4
This motivates the following alternative definition of sub-exponential random variables which
corresponds to the second definition for sub-Gaussian random variables in Definition 2.22.
♣
Remark 2.36 (a) The random variable X in Example 2.34 is sub-exponential with param-
eters (ν, α) = (2, 4).
(b) It is easy to see that our two definitions are equivalent when the random variable X has
zero mean µ = 0: use statement (v) of Proposition 2.28. If µ 6= 0 and X satisfies (2.10),
we obtain the following tail bound
1 2
P(|X| ≥ t) ≤ 2 exp − 2 (t − µ) for all t ≥ 0. (2.11)
2ν
This tail bound is not exactly the statement (i) of Proposition 2.28 as the parameter t ≥ 0
is chosen with respect to the mean. In most cases, one is solely interested in tail estimates
away from the mean. Thus the definitions are equivalent in case µ 6= 0 if we limit the
range for parameter t to t ≥ |µ|. In the literature and in applications Definition 2.35
is widely used and sometimes called sub-exponential for centred random variables. We
use both definitions synonymously.
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 29
Example 2.37 (Bounded random variable) Let X be a real-valued, mean-zero and bounded
random variable support on the compact interval [a, b], a < b. Furthermore, let X 0 be an
independent copy of X and let ε be an independent (from both X and X 0 ) Rademacher func-
tion, that is, ε is a symmetric Bernoulli random variable with P(ε = −1) = P(ε = 1) = 21 .
Using Jensen’s inequality for X 0 we obtain
0 0 0
EX [eλX ] = EX [eλ(X−EX 0 [X ]) ] ≤ EX,X 0 [eλ(X−X ) ] = EX,X 0 [Eε [eλε(X−X ) ]] ,
where we write EX,X 0 for the expectation with respect to X and X 0 , Eε for the expectation
with respect to ε, and where we used that (X − X 0 ) ∼ ε(X − X 0 ). Here, ∼ means that both
random variables have equal distribution. Hold α := (X − X 0 ) fixed and compute
∞ ∞ ∞
1 1 X (−λα)k X (λα)k X (λα)2k
Eε [eλεα
] = [e−λα + eλα ] = ( + =
2 2 k=0 k! k=0
k! k=0
(2k)!
∞
X (λα)2k 2 α2 /2
≤1+ = eλ .
k=1
2k k!
as |X − X 0 | ≤ b − a. ♣
k 1
|E[(X − µ) ]| ≤ k!σ 2 bk−2 k = 2, 3, 4, . . . . (2.13)
2
30 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES
Exercise 2.40 (a) Show that a bounded random variable X with |X − µ| ≤ b with variance
σ 2 > 0 satisfies the Bernstein condition (2.13) .
(b) Show that the bounded random variable X in (a) is sub-exponential and derive a bound
on the centred moment generating function
E[exp(λ(X − E[X]))] .
KK
Solution. (a) From our assumption we have E[(X − µ)2 ] = E[|X − µ|2 ] = σ 2 and
ess sup|X − µ|k−2 ≤ bk−2 ≤ bk−2 21 k! for k ∈ N, k ≥ 2. Using Hölder’s inequality we
obtain
1
E[|X − µ|k−2 |X − µ|2 ] ≤ E[|X − µ|2 ] ess sup|X − µ|k−2 ≤ σ 2 bk−2 k!
2
for all k ∈ N, k ≥ 2.
(b) By power series expansion we have (using the Bernstein bound from (a),
∞
λ(X−µ) λ2 σ 2 X k E[(X − µ)k ]
E[e ]=1+ + λ
2 k=3
k!
∞
λ2 σ 2 λ2 σ 2 X
≤1+ + (|λ|b)k−2 ,
2 2 k=3
and for |λ| < 1/b we can sum the geometric series to obtain
λ2 σ 2 /2 λ2 σ 2 /2
E[eλ(X−µ) ] ≤ 1 + ≤ exp ( )
1 − b|λ| 1 − b|λ|
by using 1 + t ≤ et . Thus X is sub-exponential as we obtain
√
E[eλ(X−µ) ] ≤ exp (λ2 ( 2σ)2 /2)
for all |λ| < 1/2b.
2
Exercise 2.42 Restate property (v) of Proposition 2.17 in terms of the sub-Gaussian norm,
i.e., show that if E[X] = 0 then
Exercise 2.43 (Poisson distribution - various deviations) Let X ∼ Poi(λ), λ > 0. Then
the following holds.
Hint: Use the Poisson approximation in Theorem 1.36 in conjunction with the corresponding
concentration bounds for Bernoulli random variables. KK
32 R ANDOM VECTORS IN H IGH D IMENSIONS
√
We thus expect that the expectation of the Euclidean norm is approximately √ n. We
will now see in a special case that the norm kXk2 is indeed very close to n with
high probability.
Exercise 3.2 (Centering for sub-exponential random variables) Show the centering lemma
for sub-exponential random variables. This is an extension of Lemma 2.27 to sub-exponential
random variables: Let X be a real-valued sub-exponential random variable. Then
kX − E[X]kψ1 ≤ CkXkψ1 .
1 X N n t2 t o
P | Xi | ≥ t ≤ 2 exp − c min , N ,
N i=1 K2 K
for some absolute constant c > 0 and where K := max1≤i≤N {kXi kψ1 }. KK
1
PN
Solution. SN := N i=1 Xi . As usual we start with
N
Y
P(SN ≥ t) ≤ e−λt E[eλSN ] = e−λt E[e(λ/N )Xi ] ,
i=1
R ANDOM VECTORS IN H IGH D IMENSIONS 33
and (v) in Propostion 2.28 implies that, writing Xei = Xi /N , there are c̃ > 0 and C > 0
such that
ei k2 ) for |λ| ≤ c̃ ,
E[eλXi ] ≤ exp (Cλ2 kX
e
ψ1
Ke
and kXei k2 = 1/N 2 kXi k2 , Ke := max1≤i≤N {kXfi k }. With σ̃ 2 = PN 1/N 2 kXi k2 we
ψ1 ψ1 ψ1 i=1 ψ1
thus get
c̃
P(SN ≥ t) ≤ exp ( − λt + Cλ2 σ̃ 2 ) for |λ| ≤ .
K
e
Define g(λ) := −λt + Cλ2 σ̃ 2 . Then g 0 (λ) = −t + 2C σ̃ 2 λ. The zero of the derivative is
2 2
at λ = 2Ctσ̃2 . As long as t ≤ 2CKeσ̃ c̃ , this λ ≤ Kec̃ satisfies the constraint. For t > 2CKeσ̃ c̃
we see that g 0 (λ) ≤ 0, and thus the function is monotonically decreasing and the
infimum will be attained at the upper bound for λ. Hence, optimising over λ one
obtains n t c̃ o
λ = min , .
2C σ̃ 2 Ke
Inserting these values into the function g, we obtain
2 2 2 2C σ̃ 2 c̃
g(t/(2C σ̃ )) = −t /(4C σ̃ ) for t ≤ ,
K
e
and for the other value we note that for t > (2C σ̃ 2 c̃)/K
e we get
Remark 3.4 With high probability, e.g., with probability 0.99 (adjust the absolute constants
c√> 0 and K > 0 accordingly) X stays within a constant√distance from the sphere of radius
n. Sn := kXk22 has mean n and standard deviation O( n):
h 2 i n
hX i
2
Var kXk22 =E kXk22 −n =E (Xi Xj ) − 2nkXk22 + n2
i,j=1
h n
X i Xn n
X
2 2 2
=E (Xi Xj ) −n = E[Xi4 ] + E[(Xi Xj ) ] − n2 .
i,j=1 i=1 i,j=1,i6=j
34 R ANDOM VECTORS IN H IGH D IMENSIONS
For any i 6= j we have E[(Xi Xj )2 ] = E[Xi2 ]E[Xj2 ] = 1 (the Xi ’s are independent). Further-
more, we get E[Xi4 ] = O(1) because from (ii) in Proposition 2.17 we estimate
√ 4
E[Xi4 ] ≤ kXi k4L4 ≤ C 4 = 16C = O(1) .
q
Thus Var(kXk22 ) ≤ 16nC + n(n − 1) − n2 = C 0 n = O(1)n, and therefore Var(kXk22 ) =
√ √ √
O( n). Hence kXk2 = Sn deviates by O(1) around n. Note that this follows from
r
√ √ 1 √ √ 1 √
q
n ± O( n) = n 1 ± O( n) = n(1 ± O( √ )) n ± O(1) .
n n
We now apply the Bernstein inequality in Exercise 3.3 to obtain, for every u ≥ 0,
1 cn
P | kXk22 − 1| ≥ u ≤ 2 exp − 4 min{u2 , u} , (3.1)
n K
where we used that K ≥ 1 implies K 4 ≥ K 2 . Note that for z ≥ 0 inequality |z − 1| ≥ δ
implies the inequality
|z 2 − 1| ≥ max{δ, δ 2 } .
To see that, consider first z ≥ 1 which implies that z + 1 ≥ z − 1 ≥ δ and thus
|z 2 − 1| = |z − 1||z + 1| ≥ δ 2 . For 0 ≤ z < 1 we have z + 1 ≥ 1 and thus |z 2 − 1| ≥ δ.
We apply this finding, (3.1) with u = max{δ, δ 2 } to obtain
1 1 cn
P | √ kXk2 − 1| ≥ δ ≤ P | kXk22 − 1| ≥ max{δ, δ 2 } ≤ 2 exp ( − 4 δ 2 ) .
n n K
√ ct2
P(|kXk2 − n| ≥ t) ≤ 2 exp − 4 ,
K
√
which shows that |kXk2 − n| is sub-Gaussian. 2
R ANDOM VECTORS IN H IGH D IMENSIONS 35
Solution.
n
λε2 n λε2 n
Y
P(kXk22 2
≤ ε n) = P(−kXk22 2
≥ −ε n) ≤ e E[exp(−λkXk22 )] =e E[exp(−λXi2 )] ,
i=1
and inserting
Z Z r
−λx2i −λx2i π
E[exp(−λXi2 )] ≤ e |fi (x)| dx ≤ e dx = ,
R R λ
we obtain
n
n
P(kXk22 2 2
≤ ε n) ≤ exp λε n − log (λ/π) ≤ (Cε) ,
2
1
where the last inequality follows by optimising over λ and getting λ = 2ε2
. ©
S(n−1)
R := {x ∈ Rn : kxk2 = R}
is called n-dimensional sphere with radius R around the origin , and S(n−1)
R (a) = {x ∈
n
R : kx − ak2 = R}. If R = 1, we write S (n−1)
respectively S(n−1)
(a).
π n/2
(n)
vol(BR ) = n Rn ,
2
Γ(n/2)
2π n/2 n−1
area(S(n−1)
R )= R .
Γ(n/2)
36 R ANDOM VECTORS IN H IGH D IMENSIONS
1√
Example 3.6 n = 3, R = 1. Note that Γ(3/2) = 21 Γ(1/2) = 2
π, and thus area(S(2) ) = 4 ∈
and vol(B(3) ) = 34 π. ♣
It remains to determine the surface area area(S(n−1) ), that is, the surface integral
in (3.2) for general n ∈ N. In principle one can use the generalisation of the polar
coordinates from n = 3 to any higher dimensions. This is slightly elaborate, and we
therefore show a different and easier way to compute that area. For any n ∈ N we
have
√ n
Z Z
2 2
I(n) := ··· e−(x1 +···+xn ) dxn · · · dx1 = ( π) = π n/2 . (3.3)
R R
Alternatively, we can compute I(n) in (3.3) using polar coordinates with differential
rn−1 dr and change t = r2 in the integral,
Z Z ∞
2
I(n) = dσ e−r rn−1 dr
S(n−1) 0
Z ∞ Z ∞
n−1
−r2 n−1
= area(S (n−1)
) e r dr = area(S (n−1)
) e−t t 2 (1/2t−1/2 ) dt (3.4)
Z0 ∞ 0
1 n 1 n
= area(S(n−1) ) e−t t 2 −1 dt = area(S(n−1) ) Γ( ) .
0 2 2 2
Thus
2π n/2
area(S(n−1) ) = . (3.5)
Γ(n/2)
Notation 3.7 (Landau symbols) Asymptotic analysis is concerned with the behaviour of
function f (n), n ∈ N, as n →∈ ∞. Suppose f, g : N → R+ (or R). We define the following
Landau symbols, called big-O and little-o.
• f (n) is O(g(n)) if there is a constant C > 0 such that f (n) ≤ Cg(n) for all n ∈ N.
• f (n) is o(g(n)) if
f (n)
lim = 0.
n→∞ g(n)
• f (n) ∼ g(n) if
f (n)
lim = 1.
n→∞ g(n)
We now discuss briefly the fact that most of the volume of high-dimensional ob-
jects (sets with non-vanishing volume) is near the surface of that object. Let A ⊂ Rn
R ANDOM VECTORS IN H IGH D IMENSIONS 37
be a set with non-vanishing volume, i.e., vol(A) > 0, and pick ε > 0 small. Now we
shrink A by a small amount ε to produce
(1 − ε)A := {(1 − ε)x : x ∈ A} .
Then the following holds,
vol((1 − ε)A) = (1 − ε)n vol(A) . (3.6)
To see (3.6), partition the set A into infinitesimal cubes (for a Riemann sum approxi-
mate of the volume integral). Then, (1 − ε)A is the union of the set of cubes obtained
by shrinking the cubes of the partition of A by a factor (1 − ε). If we shrink each of
the 2n sides of an n-dimensional cube Q by (1 − ε), its volume vol((1 − ε)Q) shrinks
by the factor (1 − ε)n . Using that 1 − x ≤ e−x , we get the following estimate of the
ratio of the volumes:
vol((1 − ε)A)
= (1 − ε)n ≤ e−nε . (3.7)
vol(A)
Thus nearly all of the volume of A must be in the portion of A that does not belong
to the region (1 − ε)A. For the unit ball B(n) we have at least a (1 − e−εn ) fraction of
the volume vol(B(n) ) of the unit ball concentrated in B(n) \ (1 − ε)B(n) , namely in a small
annulus of width ε at the boundary.
2
Proposition 3.8 (Volume near the equator) For c ≥ 1 and n ≥ 3 at least a 1 − 2c e−c /2
c
fraction of the volume vol(B(n) ) of the unit ball has |x1 | ≤ √n−1 . Here, the coordinate x1
points to the north pole.
2
Proof. By symmetry it suffices to prove that at most a 2c e−c /2 fraction of the half
c
of the ball with x1 ≥ 0 has x1 ≥ √n−1 . Let denote H = {x ∈ B(n) : x1 ≥ 0} be the
c
upper hemisphere (northern hemisphere) and A = {x ∈ B(n) : x1 ≥ √n−1 }. We need
to show that the ratio of the volumes is bounded as
vol(A) 2 2
≤ e−c /2 . (3.8)
vol(H) c
We prove (3.8) by obtaining an upper bound for vol(A) and a lower bound for vol(H).
To calculate the volume vol(A), integrate an incremental volume that p is a disk of
width dx1 and whose face is a ball of dimension n − 1 and radius 1 − x21 . The
surface area of the disk is (1 − x21 )(n−1)/2 vol(B(n−1) ) and the volume above the slice is
Z 1
n−1
vol(A) = √ (1 − x21 ) 2 vol(B(n−1) ) dx1 .
c/ n−1
−x
We obtain an√ upper bound by using 1 − x ≤ e , integrating up to infinity and by
inserting x1 n − 1/c ≥ 1 into the integral. Then
√
n−1 ∞
Z
(n−1) − (n−1) x21
vol(A) ≤ vol(B ) √
x 1 e 2 dx1
c c/ n−1
√ (3.9)
n − 1 1 −c2 /2 vol(B(n−1) ) −c2 /2
= vol(B(n−1)
) e = √ e .
c n−1 c n−1
38 R ANDOM VECTORS IN H IGH D IMENSIONS
1
The volume of the hemisphere below the plane x1 = √n−1 is a lower bound on the
1
entire volume vol(H), and this volume is at least that of a cylinder of height √n−1 and
q
1
radius 1 − n−1 . The volume of the cylinder is
1 n−1 1
vol(B(n−1) )(1 − ) 2 √ .
n−1 n−1
Using the fact that (1 − x)a ≥ 1 − ax for a ≥ 1, the volume of the cylinder is at least
vol(B(n−1) )
√
2 n−1
for n ≥ 3. Thus we obtain (3.8) form our bounds.
2
We consider the orthogonality of two random vectors. Draw two points at random
from the unit ball B(n) ⊂ Rn . With high probability their vectors will be nearly orthog-
onal to each other. To understand that, recall from our previous considerations that
most of the volume of the n-dimensional unit ball B(n) is contained in an annulus of
width O(1/n) near the boundary (surface), that is, we pick ε = nc , c ≥ 1, and have
from (3.7) that
vol((1 − ε)B(n) )
≤ e−εn .
vol(B(n) )
Thus at least as 1 − e−εn fraction of vol(B(n) ) is concentrated in the anjulus of width ε
at the boundary. Equivalently, using our result about the volume near the equator in
Proposition 3.8, if one vector points
√ to the north pole, the other vector has projection
in this direction
√ of only ±O(1/ n), and thus their dot/inner/scalar product will be of
order ±O(1/ n).
Proposition 3.9 Suppose that we sample N points X (1) , . . . , X (N ) uniformly from the unit
ball B(n) . Then with probability 1 − O(1/N ) the following holds:
2 log N
(a) kX (i) k2 ≥ 1 − n
for i = 1, . . . , N ;
(b)
6 log N
hX (i) , X (j) i ≤ √ for all i, j = 1, . . . , i 6= j .
n−1
Proof. (a) For any i = 1, . . . , N , the probability that kX (i) k2 < 1 − ε is less than e−εn .
Thus
2 log N 2 log N 1
(i)
P kX k2 < 1 − ≤ e−( n )n = 2 .
n N
By the union bound, the probability there exists an i ∈ {1, . . . , N } such that kX (i) k2 <
2 log N
1 − n is at most 1/N .
(b) From Proposition 3.8 we know that the component X1(i) in direction of the north
pole satisfies
c 2 −c2 /2
P |X1(i) | > √ ≤ e .
n−1 c
R ANDOM VECTORS IN H IGH D IMENSIONS 39
There are N2 pairs i and j, and for each pair we define X (i) as the direction of the
and Σ(X) is symmetric and positive-semidefinite matrix which can be written as the
spectral decomposition
Xn
Σ(X) = si ui uTi , (3.13)
i=1
where ui ∈ Rn are the eigenvectors of Σ(X) for the eigenvalues si . The second
moment matrix allows the principal component analysis (PCA). We order the eigen-
values of Σ(X) according to their size: s1 ≥ s2 ≥ · · · ≥ sn . For large values of the
dimension n one aims to identify a few principal directions. These directions corre-
spond to the eigenvectors with the largest eigenvalues. For example, suppose that
the first m eigenvalues are significantly larger than the remaining n − m ones. This
40 R ANDOM VECTORS IN H IGH D IMENSIONS
allows to reduced the dimension of the given data to Rm by neglecting all contribu-
tions from directions with eigenvalues significantly smaller than the chosen principal
ones.
(b) Let X ∈ Rn be a random vector with mean µ and invertible covariance matrix Σ =
cov(X). Show that then Z := Σ−1/2 (X − µ) is an isotropic mean-zero random vector.
K
Proof. X is isotropic if and only if (E[Xi Xj ])i,j=1,...,n = (δi,j )i,j=1,...,n = 1ln . For every
x ∈ Rn we have
n
X n
X
2
E[hX, xi ] = E[xi Xi xj Xj ] = xi E[Xi Xj ]xj = hx, Σ(X)xi = kxk22
i,j=1 i,j=1
E[kXk22 ] = n .
E[hX, Y i2 ] = n .
Proof. For the first statement we view X T X as a 1 × 1 matrix and take advantage
of the cyclic property of the trace operation on matrices:
2
n
Suppose X ⊥ Y are isotropic vectors in R , and consider the normalised ver-
sions X̄ := X/kXk2 and Ȳ := Y /kY k2 . From the√concentration
√ results in this chap-
√
ters we know that with high probability, kXk2 ∼ n, kY k2 ∼ n, and hX, Y i ∼ n.
Thus, with high probabity,
1
|hX̄, Ȳ i| ∼ √ .
n
Thus, in high dimensions, independent and isotropic random vector are almost or-
thogonal.
√
Exercise 3.16 X ∼ Unif( nS(n−1) ) is isotropic but the coordinates Xi , i = 1, . . . , n, of X
are not independent due to the condition X12 + · · · + Xn2 = n. KK
Let e1 and e2 the two unit basis vectors in R2 . It suffices to show E[hX, ei i2 ] = 1, i =
1, 2, as any vector can be written as a linear combination of the two basis vectors.
42 R ANDOM VECTORS IN H IGH D IMENSIONS
The symmetric Bernoulli distribution is isotropic. This can be easily seen again by
checking E[hX, ei i2 ] = 1 for all i = 1, . . . , n, or for any x ∈ Rn by checking that
n
hX i h X i
2
E[hX, xi ] = E 2 2
X i xi + E 2 Xi Xj xi xj = kxk22 ,
i=1 1≤i<j≤n
as the second term vanishes because the Xi are independent mean-zero random
variables .
Any random vector X = (X1 , . . . , Xn ) ∈ Rn with independent mean-zero coordi-
nates Xi with unit variance Var(Xi ) = 1 is an isotropic random vector in Rn .
For the following recall the definition of the normal distribution. See also the
appendix sheets distributed at the begin of the lecture for useful Gaussian calculus
formulae.
Definition 3.17 (Multivariate Normal / Gaussian distribution) We say a random vector
Y = (Y1 , . . . , Yn ) ∈ Rn has standard normal distribution in Rn , denoted
Y ∼ N(0, 1ln ) ,
U Y ∼ N(0, 1ln ) .
kZk22 = Z T Z = Y T U T U Y = Y T Y = kY k22 .
Thus we have shown that Z has the same characteristic function/Laplace transform
then Y ∼ N(0, 1lN ) and therefore Z = U Y ∼ N(0, 1ln ). 2
For large values of n the standard normal distribution N(0, 1ln ) is not concen-
trated around the origin,
√ instead it is concentrated in a thin spherical shell around
the sphere of radius n around the origin (shell with width of order O(1)). From
Theorem 3.1 we obtain for Y ∼ N(0, 1ln ),
√
P(|kY k2 − n| ≥ t) ≤ 2 exp ( − Ct2 ) , for all t ≥ 0
√
and an absolute constant C > 0. Therefore with high probability kY k2 ≈ n, and
thus with high probability,
√ √
Y ≈ nΘ ∼ Unif( nS(n−1) ) ,
with the unit direction vector Θ = Y /kY k2 . Henceforth, with high probability,
√
N(0, 1ln ) ≈ Unif( nS(n−1) ) .
44 R ANDOM VECTORS IN H IGH D IMENSIONS
Proof. Let x ∈ S(n−1) . We are using Proposition 2.24 for the sum of independent
sub-Gaussian random variables:
n
X n
X
khX, xik2ψ2 =k xi Xi k2ψ2 ≤ C x2i kXi k2ψ2 ≤ C max {kXi k2ψ2 } ,
P rop. 2.24 1≤i≤n
i=1 i=1
2
Pn
where we used that i=1 x2i = 1,
Theorem 3.21 (Uniform distribution on the sphere) Let X ∈ Rn be a random vector √ uni-
n
formly distributed on the Euclidean sphere in R with centre at the origin and radius n,
i.e., √
X ∼ Unif( nS(n−1) ) .
Then X is sub-Gaussian, and kXkψ2 ≤ C for some absolute constant C > 0.
We will actually present two different proofs of this statement. The first uses
concentration properties whereas the second employs a geometric approach.
Proof of Theorem3.21 - Version I. See [Ver18] page 53-54. 2
Proof of Theorem3.21 - Version II.
For convenience we will work on the unit sphere, so let us rescale
X
Z := √ ∼ Unif(S(n−1) ).
n
√
It suffices to show that kZkψ2 ≤ C/ n, which by definition means that khZ, xikψ2 ≤ C
for all unit vectors x. By rotation invariance, all marginals hZ, xi have the same
distribution, and hence without loss of generality, we may prove our claim for x =
e1 = (1, 0, . . . , 0) ∈ Rn . In other words, we shall show that
Et
y Ht 7
O O O 2y
Figure 1:
Remark 3.22 The so-called Projective Central Limit Theorem tells us that marginals of the
uniform distribution on the sphere in Rn become
√ asymptotically normally distributed as the
dimension n increases. Namely, if X ∼ Unif( nS (n−1)
) then for any fixed unit vector u ∈ Rn
we have
hX, xi −→ N(0, 1) in distribution as n → ∞ .
4 Random Matrices
4.1 Geometrics concepts
(b) Covering number: The smallest possible cardinality of an ε-net of K is the covering
number of K and is denoted N(K, d, ε). Equivalently, N(K, d, ε) is the smallest number
of closed balls with centres in K and radii ε whose union covers K.
(c) ε-separated sets: A subset N ⊂ T is ε-separated if d(x, y) > ε for all distinct points
x, y ∈ N .
Lemma 4.2 Let (T, d) be a metric space. Suppose that N is a maximal ε separated subset
of K ⊂ T . Here maximal means that adding any new point x ∈ K to the set N destroys the
ε-separation property. Then N is an ε-net.
Lemma 4.3 (Equivalence of packing and covering numbers) Let (T, d) be a metric space.
For any K ⊂ T and any ε > 0, we have
N(K, k·k2 , ε) ≤ #N .
A + B := {a + b : a ∈ A, b ∈ B} .
Proposition 4.5 (Covering numbers of the Euclidean ball) (a) Let K ⊂ Rn and ε > 0.
Denote |K| the volume of K and denote B (n) = {x ∈ Rn : kxk2 ≤ 1} the closed unit
Euclidean ball. Then
|K| |(K + ε/2B (n) )|
≤ N(K, k·k2 , ε) ≤ P(K, k·k2 , ε) ≤ .
|εB (n) | |ε/2B (n) |
(b) 1 n 2 n
≤ N(B (n) , k·k2 , ε) ≤ +1 .
ε ε
Proof.
(a) The centre inequality follows from Lemma 4.3.
Lower bound: Let N := N(K, k·k2 , ε). Then we can cover K by N balls with radii ε.
Then |K| ≤ N |εB (n) |. Upper bound: Let N := P(K, k·k2 , ε) and construct N closed
disjoint balls B 2ε (xi ) with centres xi ∈ K and radii ε/2. These balls might not fit
entirely into the set K, but certainly into the extended set K + 2ε B (n) . Thus
ε ε
N | B (n) | ≤ |K + B (n) | .
2 2
(b) The statement follows easily with part (a) and is left as an exercise. 2
Remark 4.6 To simplify the bound in Proposition 4.5, note that in the nontrivial range ε ∈
(0, 1] we have 1 n 3 n
≤ N(B (n) , k·k2 , ε) ≤ .
ε ε
Example 4.7 (Euclidean balls - volume and surface area) Suppose R > 0 and denote BR(n)
the ball of radius R around the origin and S(n−1)
R its surface, i.e.,
n
B(n) (n−1)
R := {x ∈ R : kxk2 ≤ R} and SR := {x ∈ Rn : kxk2 = R} .
π n/2
vol(B(n) (n)
R ) = |BR | = Rn ,
Γ(n/2 + 1)
(4.2)
2π n/2 n−1
area(S(n−1)
R ) = |S(n−1)
R | = R ,
Γ(n/2)
where Γ is the Gamma function. ♣
48 R ANDOM M ATRICES
Definition 4.8 (Hamming cube) The Hamming cube His the set of binary strings of length
n, i.e.,
H = {0, 1}n .
Define the Hamming distance dH between two binary strings as the number of bits where
they disagree, i.e.,
Definition 4.10 Let A be an m × n matrix with real entries. The matrix A represents a
linear map Rn → Rm .
Equivalently,
kAk = max {hAx, yi} .
x∈S(n−1) ,y∈Sm−1
(b) The singular values si = si (A) of the matrix A are the square roots of the eigenvalues
of both AAT and AT A,
p p
si (A) = λi (AAT ) = λi (AT A) ,
where si = si (A) are the singular values of A, the vectors ui ∈ Rm are the left singular
vectors, and the vectors vi ∈ Rn are the right singular vectors of A.
Remark 4.11 (a) The extreme singular values s1 (A) and sn (A) (sr (A)) are respectively the
smallest number M and the largest number m such that
mkxk2 ≤ kAxk2 ≤ M kxk2 , for all x ∈ Rn .
Thus
sn (A)kx − yk2 ≤ kAx − Ayk2 ≤ s1 (A)kx − yk2 for all x ∈ Rn .
R ANDOM M ATRICES 49
(b) In terms of its spectrum, the operator norm of A equals the largest singular value of A,
s1 (A) = kAk .
Lemma 4.12 (Operator norm on a net) Let ε ∈ [0, 1) and A be an m × n matrix. Then,
for any ε-net N of S(n−1) , we have
1
sup {kAxk2 } ≤ kAk ≤ sup {kAxk2 } .
x∈N 1 − ε x∈N
Proof. The lower bound is trivial as N ⊂ S(n−1) . For the upper bound pick an x ∈
S(n−1) for which kAk = kAxk2 and choose x0 ∈ N for this x such that kx − x0 k2 ≤ ε.
Then
kAx − Ax0 k2 ≤ kAkkx − x0 k2 ≤ εkAk .
The triangle inequality implies that
Exercise 4.13 Let N be an ε-net of S(n−1) and M be an ε-net of S(m−1) . Show that for any
m × n matrix A one has
1
sup {hAx, yi} ≤ kAk ≤ sup {hAx, yi} .
x∈N,y∈M 1 − 2ε x∈N,y∈M
KK
Theorem 4.15 (Norm of sub-Gaussian random matrices) Let A be an m×n random ma-
trix with independent mean-zero sub Gaussian random entries Aij , i = 1, . . . , m, j = 1, . . . , n.
Then, for every t > 0, we have that
√ √
kAk ≤ CK( m + n + t)
with probability at least 1 − 2 exp ( − t2 ), where K := max 1≤i≤m, {kAij kψ2 } and where
1≤j≤n
C > 0 is an absolute constant.
50 R ANDOM M ATRICES
Step 2: Concentration Pick x ∈ N and y ∈ M. We compute (using the fact that the
single matrix entries are sub-Gaussian random variables with their norm bounded
by K)
m X
X n m X
X n
khAx, yik2ψ2 ≤ C kAij xi yj k2ψ2 ≤ CK 2 yj2 x2i = CK 2 ,
i=1 j=1 i=1 j=1
for some absolute constant C > 0. We therefore obtain a tail bound for hAx, yi, i.e.,
for every u ≥ 0,
P(hAx, yi ≥ u) ≤ 2 exp ( − cu2 /K 2 ) ,
for some absolute constant c > 0.
Step 3: Union bound We unfix the choice of x and y in Step 2 by a union bound.
X
P max {hAx, yi} ≥ u ≤ P(hAx, yi ≥ u) ≤ 9n+m 2 exp ( − cu2 /K 2 ) .
x∈N,y∈M
x∈N,y∈M
√ √
We continue the estimate by choosing u = CK( n + m + t) which leads to a lower
bound u2 ≥ C 2 K 2 (n + m + t2 ), and furthermore, adjust the constant C > 0 such that
cu2 /K 2 ≥ 3(n + m) + t2 . Inserting these choices we get
P max {hAx, yi} ≥ u ≤ 9n+m 2 exp ( − 3(n + m) − t2 ) ≤ 2 exp ( − t2 ) ,
x∈N,y∈M
and thus
P(kAk ≥ 2u) ≤ 2 exp ( − t2 ) .
2
Corollary 4.16 Let A be an n × n random matrix whose entries on and above the diagonal
are independent mean-zero sub-Gaussian random variables. Then, for every t > 0, we have
√
kAk ≤ CK( n + t)
with probability at least 1 − 4 exp(−t2 ) and K := max1≤i,j≤n {kAij kψ2 }.
Proof. Exercise. Decompose the matrix into an upper-triangular part and a lower-
triangular part and use the proof of the previous theorem. 2
We can actually improve the statement in Theorem 4.15 in two ways. First we
obtain a two-sided bound, and, secondly, we can relax the independence assump-
tion. As this is a refinement of our previous statement we only state the result and
omit its proof. The statement is used in covariance estimation below.
R ANDOM M ATRICES 51
Proof. (a) The proof follows similarly to the proof of Theorem 4.15 and is thus left
as an exercise. To show that (4.4) indeed implies (4.3) we use Lemma 4.18 below.
(b) To obtain the bound for the expected operator norm of the difference of m1 AT A
to the identity, we use the integral identity (1.9) for the real valued random variable
k m1 AT A − 1ln k . This calculation is quite long and requires some computational work
for the different cases and is omitted.
2
Proof. We first assume without loss of generality that kxk2 = 1. Then our assump-
tions give
max{δ, δ 2 } ≥ |h(AT A − 1ln )x, xi| = |kAxk22 − 1| .
For every z ≥ 0 the following elementary inequality holds:
max{|z − 1|, |z − 1|2 } ≤ |z 2 − 1| . (4.6)
To show (4.6) use that for z ≥ 1 we have |z −1|2 = |z 2 −2z +1| ≤ |z 2 −2+1| = |z 2 −1|,
and for z ∈ [0, 1), use |z − 1| ≤ |z 2 − 1| as z 2 ≤ z for z ∈ [0, 1). Then use (4.6) with
kAxk2 to conclude that
|kAxk2 − 1| ≤ δ .
This implies both statements of the lemma. 2
52 R ANDOM M ATRICES
Note that Xi ∼ X implies that E[ΣN ] = Σ. The law of large numbers yields
ΣN → Σ almost surely as N → ∞ .
We bring X, X (1) , . . . , X (N ) all into isotropic position. That is, there exist independent
isotropic random vectors Z, Z (1) , . . . , Z (N ) such that
Then
kΣN − Σk = kΣ1/2 RN Σ1/2 k ≤ kRN kkΣk ,
where
N
1 X (i) T
RN = (Z )(Z (i) ) − 1ln .
N i=1
Suppose now that A is the N × n matrix whose rows are (Z (i) )T , that is,
1 T
A A − 1ln = RN ,
N
and Theorem 4.17 for A implies that
r n n
2
E[kRN k] ≤ CK + ,
N N
Definition 5.1 (Entropy) The entropy of the random variable X for the convex function ϕ
is
Hϕ (X) = E[ϕ(X)] − ϕ(E[X]).
(c) ϕ(u) = u log u, u > 0, and ϕ(0) := 0. The function ϕ is convex function on R+ and
continuous when we set 0 log 0 = 0. For any non-negative random variable Z ≥ 0, the
ϕ-entropy is
Hϕ (Z) = E[Z log Z] − E[Z] log E[Z].
♣
In the following we will drop the index ϕ for the entropy whenever we take ϕ(u) =
u log u as in Example 5.3 (c). There are several reasons why this choice is particular
useful. In the next remark we show some connection of that entropy to other entropy
concepts in probability theory.
Remark 5.4 Suppose that Ω is a finite sample space, and let p, q ∈ M1 (Ω) be two probabil-
ity measures (vectors) such that q(ω) = 0 implies p(ω) = 0. Define
(
p(ω)
if q(ω) > 0,
X : Ω → R, ω 7→ X(ω) = q(ω)
0 if q(ω) = p(ω) = 0 ,
C ONCENTRATION OF MEASURE - GENERAL CASE 55
X X p(ω)
H(X) = X(ω)q(ω) log X(ω) − E[X] log E[X] = p(ω) log
ω∈Ω ω∈Ω
q(ω)
=: H(p|q) =: D(pkq),
where H(p|q) is the relative entropy of p with respect to q , a widely used function proba-
bility theory (e.g. large deviation theory) and information theory, and where D(pkq) is the
Kullback-Leibler divergence of p and q used in information theory.
Definition 5.5 Let Ω be a finite (respectively discrete) sample space and denote M1 (Ω) the
set of probability measures (vectors).
(a) The relative entropy with respect to q ∈ M1 (Ω) is defined as the mapping
(P
p(ω)
ω∈Ω p(ω) log q(ω) if q(ω) ⇒ p(ω) = 0 ,
H(·|q) : M1 (Ω) → [0, ∞]; p 7→ H(p|q) =
+∞ otherwise .
(b) The Shannon entropy of a Ω-valued random variable X with probability density (dis-
tribution) p ∈ M1 (Ω) is defined as
X
H(X) ≡ H(p) = − p(ω) log p(ω) .
ω∈Ω
Lemma 5.6 Let Ω be a finite sample space and denote M1 (Ω) the set of probability mea-
sures (vectors). Let q ∈ M1 (Ω). Then the relative H(·|q) is strictly convex, continuous
and
H(p|q) = 0 ⇔ p = q .
Proof. Exercise. 2
Exercise 5.7 Let Ω be a finite sample space and denote M1 (Ω) the set of probability mea-
sures (vectors). Let X be a Ω-valuedPrandom variable with distribution p ∈ M1 (Ω). Its
Shannon entropy is then H(X) = − ω∈Ω p(ω) log p(ω). We shall explore the connection
between the entropy functional Hϕ with ϕ(u) = u log u and the Shannon entropy:
(a) Consider the random variable Z := p(U ), where U ∼ Unif(Ω) is uniformly distributed
over Ω. Show that
1
Hϕ (Z) = ( log|Ω| − H(X)) .
|Ω|
56 C ONCENTRATION OF MEASURE - GENERAL CASE
(b) Use part (a) and Lemma 5.6 to show that Shannon entropy for a discrete random variable
is maximised by a uniform distribution.
The following example motivates the definition of Shannon entropy and can be
skipped for the following.
Example 5.8 (Shannon entropy) Let Ω be a finite sample space. Let X (i) , i = 1, . . . , N , be
independent identically distributed Ω-valued random variables and write X = (X (1) , . . . , X (N ) ).
The empirical measure of the sample vector X ∈ ΩN is then
N
1 X
LX = δ (i) .
N
N i=1 X
N!
NN (µ) = ]{ω ∈ ΩN : LωN = µ} = Q ,
x∈Ω (N µ(x))!
where the second equality follows from multinomial distribution. For any µ ∈ M1 (Ω) pick
µN ∈ M1 (Ω) with N µN (x) ∈ N0 , x ∈ Ω, such that µN → µ as N → ∞. Then, using
Stirling’s formula, one can show that
1
H(µ) = lim log NN (µN ) . (5.1)
N →∞ N
Example 5.9 (Entropy - Gaussian random variable) Suppose X ∼ N(0, σ 2 ), σ > 0. Then
MX (λ) = exp(λ2 σ 2 /2), and MX0 (λ) = λσ 2 MX (λ), and therefore
1
H(eλX ) = λ2 σ 2 MX (λ) − λ2 σ 2 /2MX (λ) = λ2 σ 2 MX (λ).
2
♣
A bound on the entropy leads to a bound of the centred moment generating
function Φ, see (2.3), this is the content of the so-called Herbst argument .
C ONCENTRATION OF MEASURE - GENERAL CASE 57
Proposition 5.10 (Herbst argument) Let X be a real-valued random variable and suppose
that, for σ > 0,
1
H(eλX ) ≤ σ 2 λ2 MX (λ)
2
for λ ∈ I with interval I being either [0, ∞) or R. Then
1
log E[ exp (λ(X − E[X]))] ≤ λ2 σ 2 for all λ ∈ I. (5.3)
2
Remark 5.11 (a) If I = R, then the bound (5.3) is equivalent to asserting that the centred
random variable X − E[X] is sub-Gaussian with parameter σ > 0.
(b) For I = [0, ∞), the bound (5.3) leads immediately to the one-sided tail estimate
Proof of Proposition 5.10. Suppose that I = [0, ∞). Using (5.2), our assumption
turns into a differential inequality for the moment generating function,
1
λMX0 (λ) − MX (λ) log MX (λ) ≤ σ 2 λ2 MX (λ) for all λ ≥ 0 . (5.4)
2
Define now a function G(λ) := λ1 log MX (λ) for λ 6= 0, and then extend the function to
0 by continuity (L’Hospital rule)
d
G(0) := lim G(λ) = log MX (λ) = E[X] .
λ→0 dλ λ=0
Our assumptions on the existence of the MGF imply differentiability with respect to
the parameter λ. Hence
1 MX0 (λ) 1
G0 (λ) = − 2 log MX (λ) ,
λ MX (λ) λ
and thus we can rewrite our differential inequality (5.4) as
1
G0 (λ) ≤ σ 2 for all λ ∈ I = [0, ∞) .
2
For any λ0 > 0 we can integrate both sides of the previous inequality to arrive at
1
G(λ) − G(λ0 ) ≤ σ 2 (λ − λ0 ) .
2
Now, letting λ0 ↓ 0 and using the above definition of G(0), we get
1 λE[X]
1
G(λ) − E[X] = log MX (λ) − log e ≤ σ2λ ,
λ 2
and we conclude with the statement in (5.3). 2
58 C ONCENTRATION OF MEASURE - GENERAL CASE
Proposition 5.12 (Bernstein entropy bound) Suppose there exist B > 0 and σ > 0 such
that the real-valued random variable X satisfies the following entropy bound
Then
−1
log E[eλ(X−E[X]) ] ≤ σ 2 λ2 (1 − Bλ) for all λ ∈ [0, B −1 ) . (5.5)
Remark 5.13 As a consequence of the Chernoff argument, Proposition 5.12 implies that X
satisfies the upper tail bound
δ
P(X ≥ E[X] + δ) ≤ exp , for all δ ≥ 0 . (5.6)
4σ 2 + 2Bδ
Proof of Proposition 5.12. We skip the proof of this statement as it employs similar
techniques as in the proof of Proposition 5.10.
2
So far, the entropic method has not provided substantial new insight as all con-
centration results are done via the usual Chernoff bound. We shall now study con-
centration for functions of many random variables.
Definition 5.15 (a) A function f : Rn → R is separately convex if, for every index k ∈
{1, . . . , n}, the univariate function
fk : R → R ,
yk 7→ fk (yk ) := f (x1 , . . . , xk−1 , yk , xk+1 , . . . , xn ) ,
(b) A function f : X → Y for metric spaces (X, dX ) and (Y, dY ) is Lipschitz continuous
(sometimes called locally Lipschitz continuous ) if for every x ∈ X there exists a
neighbourhood U ⊂ X such that f |U is globally Lipschitz continuous. Here f |U : U →
Y is the restriction of f to U .
(c) A function f : X → Y for metric spaces (X, dX ) and (Y, dY ) is L-Lipschitz continuous
(sometimes called globally Lipschitz continuous) if there exists L ∈ R such that
The smallest constant L > 0 satisfying (5.7) is denoted kf kLip . In the following some
statements hold for global Lipschitz continuity and some only for local Lipschitz con-
tinuity.
C ONCENTRATION OF MEASURE - GENERAL CASE 59
|kM k − kM 0 k| ≤ kM − M 0 k ≤ kM − M 0 kF ,
qP P
n d 2
where kM kF := i=1 j=1 Mij is the Euclidean norm of all entries of the matrix, called
the Frobenius norm of the matrix M . The first inequality shows that f := k·k is 1-Lipschitz
continuous. Thus Theorem 5.16 implies that, for every t ≥ 0,
2 /16
P(kM k ≥ E[kM k] + t) ≤ e−t .
Key ingredients for the proof of Theorem 5.16 are the following two lemmas.
Lemma 5.18 (Entropy bound for univariate functions) Let X and Y two independent, iden-
tically distributed R-valued random variables. Denote by EX,Y the expectation with respect
to X and Y . For any function g : R → R the following statements hold:
(a)
h 2 n oi
H(eλg(X) ) ≤ λ2 EX,Y g(X) − g(Y ) eλg(X) 1l g(X) ≥ g(Y ) , for all λ > 0 .
(b) If in addition the random variable X is supported on [a, b], a < b, and the function g is
convex and Lipschitz continuous, then
h i
2
H(eλg(X) ) ≤ λ2 (b − a)2 E (g 0 (X)) eλg(X) , for all λ > 0 .
60 C ONCENTRATION OF MEASURE - GENERAL CASE
is still a function of X̄ k and is integrated with respect to E on the right hand side of (5.9).
We first finish the proof of theorem 5.16.
Proof of Theorem 5.16. For k ∈ {1, . . . , n} and every vector (fixed) X̄ k ∈ Rn−1 the
function fk is convex, and hence Lemma 5.18 implies for all λ > 0 that for every fixed
vector X̄ k we have
h i
λfk (Xk ) k 2 2 0 2 λfk (Xk ) k
H e X̄ ≤ λ (b − a) EXk (fk (Xk )) e X̄
h ∂f (X , . . . , X , . . . , X ) 2 i
2 2 1 k n k
λ (b − a) EXk exp (λf (X1 , . . . , Xk , . . . , Xn )) X̄ .
∂xk
With Lemma 5.19 one obtains, writing X = (X1 , . . . , Xn ),
hX n h ∂f (X) 2 i
λf (X) 2 2
H e ≤ λ (b − a) E EXk exp (λf (X)) X̄ k
k=1
∂xk
n
hX ∂f (X) 2 i
=E exp (λf (X))
k=1
∂xk
≤ λ2 (b − a)2 L2 E[eλf (X) ] ,
where the equality follows from the fact that the single coordinates Xi , i = 1, . . . , n,
are independent and thus E = EX1 ⊗ · · · ⊗ EXn and where we used the Lipschitz
continuity of f leading to
n
X ∂f (X) 2
k∇f (X)k22 = ≤ L2 almost surely .
k=1
∂xk
The tail bound then follows from the Herbst argument in Proposition 5.10. 2
Proof of Lemma 5.18. Using the fact that X and Y are independent and identical
distributed we have
H(eλg(X) ) = EX [λg(X)eλg(X) ] − EX [eλg(X) ] log EY [eλg(Y ) ] .
C ONCENTRATION OF MEASURE - GENERAL CASE 61
By Jensen’s inequality,
log EY [eλg(Y ) ] ≥ EY [λg(Y )] ,
and thus, using the symmetry between X and Y , we obtain (we write EX,Y for the
expectation with respect to both, X and Y , when we want to distinguish expectations
with respect to the single random variables. Note that we can easily replace EX,Y by
E),
h i h i
H(eλg(X) ) ≤ EX λg(X)eλg(X) − EX,Y eλg(X) λg(Y )
1 h i
= EX,Y λ(g(X) − g(Y ))(eλg(X) − eλg(Y ) ) (5.10)
2 h i
= λE (g(X) − g(Y ))(eλg(X) − eλg(Y ) ){g(X) ≥ g(Y )}
Symmetry
es (1 − et−s ) ≤ es (1 − (1 + (t − s))) = es (s − t) .
Applying this bound with s = λg(X) and t = λg(Y ) to the inequality (5.10) yields
2
H(eλg(X) ) ≤ λ2 E[(g(X) − g(Y )) eλg(X) 1l{g(X) ≥ g(Y )}] .
Proof of Lemma 5.19. They key ingredient is the variational representation of the
entropy (5.14), see Proposition 5.20 and Remark 5.21 below:
and
E[ exp (g k (Xk , . . . , Xn )|Xk+1 ] = 1 .
We use this decomposition within the variational representation (5.11) leading to the
following chain of upper bounds:
n
X
λf (X)
E[g(X)e ] ≤ E[g k (Xk , . . . , Xn )eλf (X) ]
(5.12)
k=1
n
X
= EX̄ k [EXk [g k (Xk , . . . , Xn )eλf (X) |X̄ k ]]
k=1
n
X
≤ EX̄k [H(eλfk (Xk ) |X̄k )]] .
(5.11)
k=1
Proposition 5.20 (Duality formula of the Entropy) Let Y be a non-negative R-valued ran-
dom variable defined on a probability space (Ω, F, P ) such that E[ϕ(Y )] < ∞, where
ϕ(u) = u log u for u ≥ 0. Then
H(Y ) = sup {E[gY ]} , (5.13)
g∈U
Remark 5.21 (Variational representation of the entropy) One can easily extend the vari-
ational formula (5.13) in Proposition to the set G = {g : Ω → R : eg ≤ 1}, namely
The statement of Theorem 5.22 amounts to the following concentration result, for
every t ≥ 0,
Ct2
P |f (X) − E[f (X)]| ≥ t ≤ 2 exp − , (5.15)
kf k2Lip
for some absolute constant C > 0. We already know √ this statement for linear func-
tions, see Theorem 3.21 saying that when X ∼ n(S (n−1)
) we have that X (or any
linear map) is sub-Gaussian. To prove the extension to any nonlinear Lipschitz func-
tion we need two fundamental results, the so-called isoperimetric inequalities, which
we can only state in order not to overload the lecture.
Definition 5.23 Suppose f : Rn → R is some function. It’s level-sets (or sub-level sets)
are
Lf (c) := {x ∈ Rn : f (x) ≤ c} , c ∈ R .
Aε := {x ∈ Rn : ∃ y ∈ A : kx − yk2 ≤ ε} = A + εB (n) ,
Theorem 5.25 (Isoperimetric inequality on the sphere) Let ε > 0. Then, among all A ⊂
S(n−1) with given area σn−1 (A), the spherical caps minimise the area of the ε-neighbourhood
σn−1 (Aε ),
Aε := {x ∈ S(n−1) : ∃ y ∈ A : kx − yk2 ≤ ε} .
A spherical cap C(a, ε) centred at a point a ∈ S(n−1) is the set
C(a, ε) := {x ∈ S(n−1) : kx − ak2 ≤ ε} .
The following lemma is a crucial step in the proof of Theorem 5.22.
√
Lemma 5.26 (Blow-up) For any A ⊂ nS(n−1) , denote σ the normalised area on the sphere.
if σ(A) ≥ 12 , then, for every t ≥ 0,
σ(At ) ≥ 1 − 2 exp(−ct2 )
for some absolute constant c > 0.
√
Proof. For the hemisphere H = {x ∈ nS(n−1) : x1 ≤ 0}, we have σ(A) ≥ 12 =
σ(H). The t-neighbourhood Ht of the hemisphere H is a spherical cap, and the
isoperimetric inequality in Theorem 5.25 gives
σ(At ) ≥ σ(Ht ) .
We continue as in our proof of Theorem 3.21 noting that the normalised measure σ
is the uniform probability measure on the sphere such that
σ(Ht ) = P(X ∈ Ht ) .
√
Recall that in that context, X ∼ Unif( nS(n−1) ), and thus X is Sub-Gaussian accord-
ing to Theorem 3.21. Because of
n √ t o
Ht ⊃ x ∈ nS(n−1) : x1 ≤ √
2
we have
√ √
σ(Ht ) ≥ P(X1 ≤ t/ 2) = 1 − P(X1 > t/ 2) ≥ 1 − 2 exp ( − ct2 ) ,
for some absolute constant c > 0. 2
Proof of Theorem 5.22. Without loss of generality we assume that kf kLip = 1. Let
M denote the median of f (X), that is,
1 1
P(f (X) ≤ M ) ≥ and P(f (X) ≥ M ) ≥ .
2 2
√
The set A = {x ∈ nS(n−1) : f (x) ≤ M } is a level (sub-level) set of f with P(X ∈ A) ≥
1
2
. Then Lemma 5.26 implies that P(X ∈ At ) ≥ 1 − 2 exp(−Ct2 ) for some absolute
constant C > 0. We claim that, for every t ≥ 0,
P(X ∈ At ) ≤ P(f (X) ≤ M + t) . (5.16)
C ONCENTRATION OF MEASURE - GENERAL CASE 65
To see (5.16), note that X ∈ At implies kX − yk2 ≤ t for some point y ∈ A. By our
definition of the set A, f (y) ≤ M . As kf kLip = 1, we have
and thus
f (X) ≤ f (y) + kX − yk2 ≤ M + t ,
√
We now repeat our argument for −f : We define à = {x ∈ nS(n−1) : − f (x) ≤ M }.
Then P(X ∈ Ã) ≥ 1/2 and thus P(X ∈ Ãt ) ≥ 1 − 2 exp(−Ct2 ). Now X ∈ Ã implies
kX − yk2 ≤ t for some y ∈ Ã, and f (y) ≥ M by definition of Ã.
and thus P(f (X) ≥ M − t) ≥ 1 − 2 exp(−C̃t2 ) for some absolute constant C̃ > 0.
Combining our two estimates we obtain
for some absolute constant Ĉ > 0, and thus the immediate tail estimate shows that
kf (X) − M kψ2 ≤ C
for some absolute constant C > 0. To replace the median M by the expectation
E[f (X)] note that although the median is not unique it is a fixed real number deter-
mined by the distribution of the random variable X and the function f . We use the
centering Lemma 2.27 to get
2
66 C ONCENTRATION OF MEASURE - GENERAL CASE
Definition 5.27 (Isoperimetric problem) Let (E, d) be a metric space and B(E) its Borel-
σ-algebra, and P ∈ M1 (E, B(E)) some given probability measure and X an E-valued
random variable.
Isoperimteric problem: Given p ∈ (0, 1) and t > 0, find the sets A with P (X ∈ A) ≥ p
for which P (d(X, A) ≥ t) is maximal, where
Concentration function:
where At is the t-blow-up of the set A, At = {x ∈ E : d(x, A) < t}. For a given function
f : E → R denote the median of f (X) by M f (X).
There are many isoperimetric inequalities, we only mention the Gaussian isoperi-
metric inequality as it is widely used. Recall that the Gaussian (standard normal
distribution) is given by the probability measure γn ∈ M1 (Rn , B(Rn )),
Z
2
γn (A) = (2π)−n/2 e−kxk /2
dx1 · · · dxn , A ⊂ B(Rn ) .
A
Theorem 5.28 (Gaussian isoperimetric inequality) Let ε > 0 be given. Among all A ⊂
Rn with fixed Gaussian measure γn (A), the half-spaces minimise the Gaussian measure
γn (Aε ) of the ε-neighbourhood Aε .
From this we can obtain the following concentration result. The proof is using
similar steps as done above, and we leave the details as exercise for the reader.
PN
where σ 2 = i=1 E[Xi2 ] is the norm of the matrix variance of the sum.
For the proof we shall introduce a few well-known facts about matrix calculus.
Definition 5.31 (a) For any symmetric n × n matrix X with eigenvalues λi = λi (X) and
corresponding eigenvectors ui the function of a matrix for any given f : R → R is
defined as the n × n matrix
n
X
f (X) := f (λi )ui uTi .
i=1
In principle one can prove the Matrix Bernstein inequality in Theorem 5.30 with these
two results from matrix analysis. If X is a random positive-definite matrix then Lieb’s
and Jensen’s inequality imply that
Lemma 5.32 (Lieb’s inequality for random matrices) Let H be a fixed n × n symmetric
matrix and Z be a random n × n symmetric matrix. Then
The proof of Theorem 5.30 follows below and is based on the following bound of
the moment generating function (MGF).
λ2 /2
E[ exp (λX)] ≤ exp (g(λ)E[X 2 ]) , where g(λ) = , for |λ| < 3/K .
1 − |λ|K/3
Proof of Lemma 5.33. For z ∈ C we can easily obtain the following estimate
1 z2
ez ≤ 1 + z + for |z| < 3 .
1 − |z|/3 2
This can can easily derived from the Taylor series of the exponential function in
conjunction with the lower bound p! ≥ 2 × 3p−2 and the geometric series. Details are
left to the reader. Now with z = λx, |x| ≤ K, |λ| < 3/K, we then obtain
eλx ≤ 1 + λx + g(λ)x2 .
and then, after taking the expectation and using E[X] = 0, to arrive at
E[eλX ] ≤ 1 + g(λ)E[X 2 ] .
Proposition 5.34 (Expectation bound via the Bernstein Theorem) Under all the assump-
tions in Theorem 5.30 we have the tail bound
N
X t2 /2
P k Xi k ≥ t ≤ 2n exp − .
i=1
σ 2 + Kt/3
Then
h XN i √2π p 4
E Xi ≤ 2σ + log(2n) + K 1 + log(2n) . (5.18)
i=1
2 3
C ONCENTRATION OF MEASURE - GENERAL CASE 69
K
Proof. Define b := 3
. Then the right hand side in Theorem 5.30 reads as
t2
2n exp − .
2(σ 2 + bt)
We will use a crude union-type boundPon the tail probability itself by observing that
either σ 2 ≤ bt or σ 2 ≥ bt. Define Z := N
i=1 Xi . For every t ≥ 0,
t2 n t t2 o
P(kZk ≥ t) ≤ 2n exp − ≤ 2n max exp − , exp − 2
2(σ 2 + bt) 4b 4σ
t t2
≤ 2n exp − + 2n exp − 2 .
4b 4σ
We shall combine this with the trivial inequality P(kZk ≥ t) ≤ 1. Thus
t t2
P(kZk ≥ t) ≤ 1 ∧ 2n exp − + 1 ∧ 2n exp − 2 ,
4b 4σ
and Z ∞
E[kZk] = P(kZk ≥ t) dt =: I1 + I2 ,
0
R∞ R∞
with I1 = 0 1 ∧ 2n exp(−t/(4b)) dt and I2 = 0 1 ∧ 2n exp(−t2 /(4σ 2 )) dt. Solve
2n exp(−t1 /(4b)) = 1 to obtain t1 = 4b log(2n), and then
Z t1 Z ∞
I1 = 1 dt + (2n) exp(−t/(4b)) dt = t1 + 4b = 4b(1 + log(2n)) .
0 t1
√
Solve 2n exp(−t2 /(4σ 2 )) = 1 to obtain t2 = 2σ log(2n), and then
Z t2 Z ∞ Z ∞
2
I2 = 1 dt + 2 2
(2n) exp(−t /(4σ )) dt = t2 + 2σ √ (2n)e−t̃ /2 dt̃
0 t2 t̃=t/2σ log(2n)
p √
≤ 2σ( log(2n) + π) ,
√
where we used another transformation for the inequality, √ namely, x = t̃ − log(2n).
−t̃2 /2 −x2
Equivalently, we may use 2ne ≤e for all t ≥ log(2n). We arrive at
√π p
E[kZk] ≤ 2σ + log(2n) + 4b 1 + log(2n) .
2
2
Proof of Theorem 5.30.
Step 1: Define S := N
P
i=1 Xi and λm (S) := max1≤i≤n λi (S) the largest eigenvalue of
S. Then kSk = max{λm (S), λm (−S)}.
All eigenvalues of eλS are positive, hence the maximal eigenvalue of eλS is bounded
by the sum of all eigenvalues, that is, by the trace of eλS . Henceforth
Step 2: We now turn to bound the expectation of the trace using Lieb’s inequality, in
particular Lemma 5.32. We write the exponent of eλS separately, namely splitting off
the last term of the sum, writing
PN −1
λXi +λXN
Trace (e i=1 ). (5.20)
where E
b is the conditional expectation with respect to XN conditioned on X1 , . . . , XN −1 .
We repeat this application of Lemma 5.32 successively for λXN −1 , . . . , λX1 , to arrive
at
X N h i
E ≤ Trace exp log EXi eλXi .
i=1
The minimum in the exponent is attained for λ = t/(σ 2 + 2Kt/3). We finally conclude
with t2 /2
P(λm (S) ≥ t) ≤ n exp − 2 ,
σ + Kt/3
and repeating our steps for λm (−S) we conclude with the statement of the theorem.
2
C ONCENTRATION OF MEASURE - GENERAL CASE 71
Exercise 5.35 (Concentration for the unit sphere) Let f be a Lipschitz function on the
unit sphere S(n−1) . Show that for X ∼ Unif(S(n−1) ) one has
Ckf kLip
kf (X) − E[f (X)]kψ2 ≤ √ .
n
Equivalently, for every t ≥ 0, we have
cnt2
P(|f (X) − E[f (X)]| ≥ t) ≤ 2 exp − .
kf k2Lip
KK
In Theorem 5.29 we have a concentration result for Lipschitz images of normally
distributed random vectors using the Gaussian isoperimetric inequality. One can
find similar concentration results for other metric spaces. recall the Hamming cube
and its metric dH in Definition 4.8. We can define the uniform distribution on the
Hamming cube H as the probability measure P (A) := |A|/2 for any subset A ⊂ H.
If X ∼ Unif(H), then the coordinates Xi of X are Bernoulli distributed with parameter
1/2. Then one can obtain the following concentration result.
Theorem 5.36 (Concentration for the Hamming cube) Suppose X ∼ Unif(H) and f : H →
R. Then
Ckf kLip
kf (X) − E[f (X)]kψ2 ≤ √ .
n
E = {αu : α ∈ R} .
72 C ONCENTRATION OF MEASURE - GENERAL CASE
Then the set on the right hand side generate the subspace E with dimension dim(E) =
1. So any concentration result for Gn,m includes the concentration for the sphere as
a special case. We need a metric and a probability measure for Gn,m . The distance
(metric) between subspaces E and F can be defined as the operator norm
d(E, F ) := kPE − PF k ,
Theorem 5.37 Suppose that X ∼ Unif(Gn,m ) and let f : Gn,m → R some function. Then
Ckf kLip
kf (X) − E[f (X)]kψ2 ≤ √ .
n
The proof of that theorem goes beyond what we can do in this lecture. It is based
on concentration results for the special orthogonal group and the fact that Gn,m can
be written as quotient of orthogonal groups.
We now present the Johnson-Lindenstrauss Lemma for the N data points and prove
the statement that the geometry of the given data is well preserved if we choose E
to be a random subspace of dimension dim(E) = m ∼ log N , where ∼ means that in
the limit N → ∞ the quotient of both sides converges to 1. Thus we consider random
subspaces E ∼ Unif(Gn,m ) and the space (Gn,m , d, P ) with P being the uniform (Haar)
measure. Note that we have the following invariance.
(a) r
1/2 m
(E[kP zk22 ) = kzk2 .
n
Proof. (a) Without loss of generality we can assume that kzk2 = 1.This is possible
as for z = 0 the statement holds trivially and for z 6= 0 we can simply multiply by
1/kzk2 . Instead of a random projection P we can fix a projection and consider z ∼
Unif(S(n−1) ). According to the above mentioned invariance we can assume without
loss of generality that P is the coordinate projection onto the first m coordinates in
Rn . Thus m m
hX i X
2 2
E[kP zk2 ] = E zi = E[zi2 ] = mE[z12 ] ,
i=1 i=1
where the last equality follows from the fact that the coordinates zi are identically
distributed. As z ∼ Unif(S(n−1) ) we have
n
X
1= E[zi2 ] = nE[z12 ] ,
i=1
and thus E[z12 ] = 1/n and therefore we get the statement that
m
E[kP zk22 ] = ,
n
and thus (a).
(b) Define the function f : S(n−1) → R by f (x) := kP xk2 . Then f is a Lipschitz function
with kf kLip = 1:
Statement
p (5.21) is not quite (b) but we can replace the expectation E[f (X)] by
m/n as follows. First note that due to Jensen
r
m 1/2
E[F (X)] ≤ = E[f (X)2 ] .
n
74 C ONCENTRATION OF MEASURE - GENERAL CASE
H − H = {x − y : x, y ∈ H} .
For simplicity, we assume in the following that the random variables Xi have
mean-zero and unit variances,
E[Xi ] = 0 , Var(Xi ) = 1 , i = 1, . . . , n .
Then
n
X m
X
E[hX, AXi] = aij E[Xi Xj ] = aii = Trace (A) .
i,j=1 i=1
We shall study concentration properties for chaos. This time we need to develop
tools to overcome the fact that we have sums of not necessarily independent random
variables. The idea is to use the decoupling technique . The idea is to study the
following random quadratic form,
n
X
aij Xi Xj0 = X T AX 0 = hX, AX 0 i , (6.1)
i,j=1
where X 0 = (X10 , . . . , Xn0 ) is a random vector independent of X but with the same
distribution as X. Obviously, the bilinear form in (6.1) is easier to study, e.g., when
we condition on X 0 we simply obtain a linear form for the random vector X. The
vector X 0 is called an independent copy of X, and conditioning on X 0 ,
n
X n
X
hX, AX 0 i = ci X i with ci = aij Xj0 ,
i=1 j=1
is a random linear form for X depending on the condition of the independent copy
X 0.
Lemma 6.2 Let Y and Z be independent random vectors in Rn such that EY [Y ] = EZ [Z] =
0 (we write indices when we want to stress the expectation for a certain random variable).
Then, for every convex function F : Rn → R, one has
We fix such an realisation of I until the end. The random variables (Xi )i∈I and
(Xj )j∈I c are independent and thus the distribution of the sum on the right hand side
will not change if we replace Xj by Xj0 . Hence we replace Xj , j ∈ I c , by Xj0 on the
right hand side of (6.3) to get
h X i
EX [F (hX, AXi)] ≤ E F 4 aij Xi Xj0 . (6.4)
(i,j)∈I×I c
BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY 77
where [n] = {1, . . . , n}. We now split the argument of the function F on the right
hand side of (6.5) into three terms,
X
aij Xi Xj0 =: Y + Z1 + Z2 ,
(i,j)∈[n]×[n]
with
X X X
Y := aij Xi Xj0 , Z1 := aij Xi Xj0 , Z2 := aij Xi Xj0 .
(i,j)∈I×I c (i,j)∈I×I (i,j)∈I c ×[n]
We now condition on all random variables except (Xj0 )j∈I and (Xi )i∈I c . We denote
the corresponding conditional expectation by E.
e The conditioning already fixes Y .
Furthermore, Z1 and Z2 have zero expectation. Applications of Lemma 6.2 leads to
F (4Y ) ≤ E[F
e (4Y + 4Z1 + 4Z2 )] ,
and thus
E[F (4Y )] ≤ E[F (4Y + 4Z1 + 4Z2 )] .
This proves (6.5) and finishes the argument by taking the final expectation with re-
spect to I. 2
Lemma 6.5 (MGF of Gaussian chaos) Let X, X 0 ∼ N(0, 1ln ), X and X 0 be independent,
and let A be an n × n matrix. Then
E[ exp (λhX, AX 0 i)] ≤ exp (Cλ2 kAkF ) , for all |λ| ≤ C/kAk ,
Proof. We use the singular value decomposition of the matrix A, see Definition 4.10:
n
X
A= si ui viT
i=1
Xn
hX, AX 0 i = si hui , Xihvi , X 0 i
i=1
n
X
hX, AX 0 i = si Yi Yi0 ,
i=1
where Y = (hu1 , Xi, . . . , hun , Xi) ∼ N(0, 1ln ) and Y 0 = (hv1 , X 0 i, . . . , hvn , X 0 i) ∼
N(0, 1ln ). Independence yields
n
Y
0
E[ exp (λhX, AX i)] = E[ exp (λsi Yi Yi0 )] .
i=1
For each i ∈ {1, . . . , n} we compute taking the expectation with respect to Y 0 , i.e.,
the conditional expectation holding the Yi ’s fixed,
E[ exp (λsi Yi Yi0 )|Y ] = E[ exp (λ2 s2i Yi2 /2)] ≤ exp (Cλ2 s2i ) , provided λ2 s2i ≤ C ,
where the first equality follows from the fact the MGF of X ∼ N(0, 1) is E[exp(λX)] =
exp(λ2 /2), and the inequality follows from the fact that Yi are Gaussian and thus sub-
Gaussian and therefore the square Yi2 is sub-exponential and thus property (v) in
Proposition 2.28 gives the bound.
n
0
2
X C
E[ exp (λhX, AX i)] ≤ exp Cλ s2i , provided λ ≤ .
i=1
max1≤i≤n {s2i }
2
Pn
We conclude with i=1 s2i = kAk2F and max1≤i≤n {si } = kAk.
√
Choosing µ = 2CλK, we can match our estimates to get
EX [ exp (λhX, AX 0 i)] ≤ EY [ exp (µhY, AX 0 i)] = exp (Cλ2 K 2 λkAX 0 k22 )] .
We can now take the expectation with respect to X 0 on both sides and see that we
have successfully replaced X by Y . We can now repeat the same procedure for X 0
and Y 0 to obtain our statement. 2
Proof of Theorem 6.4. Without loss of generality, K = 1. It suffices to show the
one-sided tail estimate. Denote
p = P(hX, AXi − E[hX, AXi] ≥ t) .
Write n
X X
hX, AXi − E[hX, AXi] = aii (Xi2 − E[Xi2 ]) + aij Xi Xj ,
i=1 i,j : i6=j
and thus the problem reduces to estimating diagonal and off-diagonal sums:
n
X X
p≤P aii (Xi2 − E[Xi2 ]) ≥ t/2 + P aij Xi Xj ≥ t/2 =: p1 + p2 .
i=1 i,j : i6=j
(b) Recall that we have already proved a version with B = 1ln in Theorem 3.1, namely,
√
kkXk2 − nkψ2 ≤ CK 2 .
Thus
kBk2F
2
P |kBXk2 − kBkF | ≥ δkBkF ≤ 2 exp − Cδ ,
K 4 kBk2
and setting t = δkBkF , we obtain the tail-estimate
Ct2
P |kBXk2 − kBkF | ≥ t ≤ 2 exp − .
K 4 kBk2
2
6.3 Symmetrisation
(a) εX and ε|X| are symmetric random variables and εX and ε|X| have the same distribu-
tion.
Proof.
Upper
PN bound: Let (Xi0 ) be an independent copy of the random vector (Xi ). Since
0
i=1 Xi has zero mean,
h XN i h XN N
X i h XN i
0
p := E k Xi k ≤ E k Xi − Xi k = E k (Xi − Xi0 )k ,
i=1 i=1 i=1 i=1
where the inequality stems from an application of Lemma 6.2), namely, if E[Z] = 0
then E[kY k] ≤ E[kY + Zk].
82 BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY
Since all (Xi −Xi0 ) are symmetric random vectors, they have the same distribution
as εi (Xi − Xi0 ), see Exercise 6.11. Application of the triangle inequality and our
assumptions conclude the upper bound
h XN i h XN i h XN i h XN i
0 0
p≤E k εi (Xi − Xi )k ≤ E k εi X i k + E k εi Xi k = 2E k εi X i k
i=1 i=1 i=1 i=1
h XN i h XN i h XN i
0 0
E k εi X i k ≤ E k εi (Xi − Xi )k = E k (Xi − Xi )k
i=1 i=1 i=1
h XN i h XN i h XN i
0
≤E k Xi k + E k Xi k = 2E k Xi k .
i=1 i=1 i=1
2
R ANDOM P ROCESSES 83
7 Random Processes
7.1 Basic concepts and examples
Definition 7.1 A random process is a collection of random variables X := (Xt )t∈T of ran-
dom variables Xt defined on the same probability space, which are indexed by the elements
t of some set T .
(c) When the dimension of the index set is greater than one, e.g., T ⊂ Rn or T ⊂ Zn , we
often call the process a random field (Xt )t∈T .
(d) The most well-known continuous-time process is the standard Brownian motion X =
(Xt )t≥0 , also called the Wiener process. We can characterise it as follows:
(i) The process has continuous paths, i.e., X : [0, ∞) → R, t 7→ Xt is almost surely
continuous.
(ii) The increments are independent and satisfy Xt − Xs ∼ N(0, t − s) for al t ≥ s.
♣
Definition 7.3 Let (Xt )t∈T be a random process with E[Xt ] = 0 for all t ∈ T . The covari-
ance function of the process is defined as
Σ(t, s) := cov(Xt , Xs ) , t, s ∈ T .
Note that Σ(t, s) = E[Xt Xs ], t, s ∈ T , when the process X = (Xt )t∈T has zero mean,
i.e., E[Xt ] = 0 for all t ∈ T .
Example 7.4 (Increments of random walks) The increments √ of the discrete time random
2
in Example [(b)]7.2 with E[Zi ] = 1, i ∈ N, are d(n, m) = n − m for all n, m ∈ N0 , n ≥
84 R ANDOM P ROCESSES
where we used the fact that the Zi ’s are independent with zero mean, i.e., E[Zk Zj ] = δk,j ,
and our assumption E[Zk2 ] = 1 for all k ∈ N. ♣
Definition 7.5 (Gaussian process) (a) A random process (Xt )t∈T is called a Gaussian
process if, for any finite subset T0 ⊂ T , the random vector (Xt )t∈T0 has normal dis-
tribution. Equivalently, (Xt )t∈T is called a Gaussian process if every finite linear com-
bination X
at Xt , at ∈ R ,
t∈T0
Xt := hY, ti , t ∈ T ⊂ Rn .
We call the random process (Xt )t∈T the canonical Gaussian process in Rn .
For general processes, even if they are Gaussian, obtaining such bounds is highly
nontrivial. In this section we learn first how the expectation of the supremum of two
processes compare to each other. In Section 7.3 below we obtain a lower and upper
bound for expected supremum of a process.
Theorem 7.7 (Slepian’s inequality) Let (Xt )t∈T and (Yt )t∈T be two mean-zero Gaussian
processes. Assume that, for all t, s ∈ T , we have
2 2
(i) E[Xt2 ] = E[Yt2 ] (ii) E[(Xt − Xs ) ] ≤ E[(Yt − Ys ) ] . (7.2)
and, consequently,
E[ sup{Xt }] ≤ E[ sup{Yt }] .
t∈T t∈T
Whenever (7.3) holds, we say that the process (Xt )t∈T is stochastically domi-
nated by the process (Yt )t∈T . The proof of Theorem 7.7 follows by combining the
two versions of Slepian’s inequality which we will discuss below. To prepare these
statements we introduce the method of Gaussian interpolation first.
Suppose T is finite, e.g., |T | = n. Let X = (Xt )t∈T and Y = (Yt )t∈T be two
Gaussian random vectors (without loss of generality we may assume that they X
and Y are independent). We define the Gaussian interpolation as the following
random vector in Rn ,
√ √
Z(u) := uX + 1 − uY , u ∈ [0, 1] . (7.4)
It is easy to see (following exercise) that the covariance matrix of the interpolation
interpolates linearly between the covariance matrices of X and Y .
K
For a function f : Rn → R, we shall study how the expectation E[f (Z(u))] varies
with u ∈ [0, 1]. Suppose, for example, that f (x) := 1l{max1≤i≤n {xi } ≤ u}, x ∈ Rn .
Then one can easily show that E[f (Z(u))] increases with u leading to
and henceforth
P( max {Xi } < τ ) ≥ P( max {Yi } < τ ) ,
1≤i≤n 1≤i≤n
leading to
P sup{Xt } ≥ τ ≤ P sup{Yt } ≥ τ . (7.5)
t∈T t∈T
The following three lemmas collect basic facts about Gaussian random variables
and their proofs will be left as an exercise (homework).
Lemma 7.9 (Gaussian integration by parts) Suppose X ∼ N(0, 1). Then, for any differ-
entiable function f : R → R,
Proof. Exercise for the reader. Solution see Lemma 7.2.3 in [Ver18]. 2
Lemma 7.11 (Gaussian interpolation) Consider two independent Gaussian random vec-
tors X ∼ N(0, ΣX ) and Y ∼ N(0, ΣY ) in Rn . Define the interpolation Gaussian random
vector as
√ √
Z(u) := u X + 1 − u Y , u ∈ [0, 1] . (7.6)
Then for any twice-differentiable function f : Rn → R, we have
n h ∂ 2f
d 1X X i
E[f (Z(u))] = (Σij − ΣYij )E (Z(u)) . (7.7)
du 2 i,j=1 ∂xi ∂xj
Proof. Exercise for the reader (homework - Example Sheet 4). Solution see Lemma 7.2.7
in [Ver18].
2
∂ 2f
(x) ≥ 0 for all i 6= j .
∂xi ∂xj
Then
E[f (X)] ≥ E[f (Y )] .
We can now prove Theorem 7.7 with the following results for Gaussian vectors.
Theorem 7.13 Let X and Y be two mean-zero Gaussian random vectors in Rn as in Lemma 7.12.
Then, for every τ ∈ R, we have
P max {Xi } ≥ τ ≥ P max {Yi } ≥ τ .
1≤i≤n 1≤i≤n
Consequently,
E[ max {Xi }] ≤ E[ max {Yi }] .
1≤i≤n 1≤i≤n
Proof. The key idea is to use Lemma 7.12 for some appropriate approximation of
the maximum. For that to work, let h : R → [0, 1] be a twice-differentiable approxima-
tion of the indicator function of the interval (−∞, τ ), that is, h(t) ≈ 1l(−∞,τ ) (t), t ∈ R,
and h0 (t) ≤ 0 (h is smooth non increasing function). Define f (x) := h(x1 ) · · · h(xn ) for
x = (x1 , . . . , xn ) ∈ Rn . The second partial derivatives
∂ 2f Y
(x) = h0 (xi )h0 (xj ) h(xk )
∂xi ∂xj k6=i,j
are positive. It follows that E[f (X)] = E[f (Z(1))] ≥ E[f (Y )] = E[f (Z(0))]. Thus, by
the above approximations, according to Lemma 7.12,
P max {Xi } < τ ≥ P max {Yi } < τ ,
1≤i≤n 1≤i≤n
88 R ANDOM P ROCESSES
Theorem 7.14 (Sudakov-Fernique inequality) Let (Xt )t∈T and (Yt )t∈T be two mean-zero
Gaussian processes. Assume that, for all t, s ∈ T , we have
2 2
(i) E[(Xt − Xs ) ] ≤ E[(Yt − Ys ) ] .
Then
E[ sup {Xt }] ≤ E[ sup {Yt }] .
t∈T t∈T
Proof. We shall prove the statement for Gaussian random vectors X, Y ∈ Rn with
the help of Theorem 7.13. The idea this time is to use an approximation of the
maximum itself and not for the indicator function. For β > 0, define
n
1 X
f (x) := log eβxi x ∈ Rn .
β i=1
Inserting the function into the Gaussian interpolation formula (7.7) we can obtain
after some tedious calculation that
d
E[f (Z(u))] ≤ 0 ,
du
and conclude with our statement. 2
Show that
f (x) −→ max {xi } .
β→∞ 1≤i≤n
K
R ANDOM P ROCESSES 89
Solution. Without loss of generality let exp(βxk ) = max1≤i≤n {exp(βxi )}, Then
n
X n
X
eβxi = eβxk 1 + exp(β(xi − xk )) ,
i=1 i6=k
we obtain that
n
1 X 1
0 ≤ lim log 1 + exp(β(xi − xk )) ≤ lim log (1 + n) = 0 ,
β→∞ β β→∞ β
i6=k
Definition 7.16 (Canonical metric) Suppose X = (Xt )t∈T is a random process with index
set T . We define the canonical metric of the process by
h 2 i1/2
d(t, s) := kXt − Xs kL2 = E Xt − Xs t, s ∈ T .
We can now study the question if we can evaluate the expectation E[supt∈T {Xt }]
by using properties of the geometry, in particular using covering numbers of the
index metric space (T, d) equipped with the canonical metric of the process. Via this
approach we obtain a lower bound of the expectation of the supremum.
Theorem 7.17 (Sudakov’s minorisation inequality) Let X = (Xt )t∈T be a mean-zero Gaus-
sian process. Then, for any ε ≥ 0, we have
where d is the canonical metric of the process, C > 0 an absolute constant and N(T, d, ε)
the covering number for T (recall Definition 4.1).
Proof. Assume that N(T, d, ε) = N < ∞ is finite. When T is not compact, one can
show that the expectation of the supremum is infinite (we skip this step). Let M be a
maximal ε-separated subset of T . Then M is an ε-net according to Lemma 4.2, and
thus |M| ≥ N . It suffices to show
p
E[ sup {Xt }] ≥ Cε log N . (7.9)
t∈M
90 R ANDOM P ROCESSES
To show (7.9), we compare the process (Xt )t∈M with a simpler process (Yt )t∈M .
ε
Yt := √ Gt , Gt ∼ N(0, 1), Gt independent of Gs for all t 6= s .
2
For t, s ∈ M fixed we have
2
E[(Xt − Xs ) ] = d(t, s)2 ≥ ε2 ,
while
2 ε2 2
E[(Yt − Ys ) ] = E[(Gt − Gs ) ] = ε2 .
2
Thus
2 2
E[(Xt − Xs ) ] ≥ E[(Yt − Ys ) ] for all t, s ∈ M .
We now obtain with Theorem 7.14,
ε p
E[ sup {Xt }] ≥ E[ sup {Yt }] = √ E[ max{Gi }] ≥ Cε log N ,
t∈M t∈M 2 i∈M
where we used Proposition 7.18below. 2
Proposition 7.18 (Maximum of normally distributed random variables) Let Yi ∼ N(0, 1),
i = 1, . . . , N , be independent normally distributed real-valued random variables. Then the
following holds.
(a) h i p
E max {Yi } ≤ 2 log N .
1≤i≤N
(b) h i p
E max {|Yi |} ≤ 2 log(2N ) .
1≤i≤N
(c) h i p
E max {Yi } ≥ C 2 log N for some absolute constant C > 0 .
1≤i≤N
√
Taking β = 2 log(2N ), we conclude with the statement.
(c) The lower bound is slightly more involved and needs some preparation:
(i) The Yi ’s are symmetric (Gaussian), and hence, by symmetry,
To obtain a lower bound we exploit the fact that our Gaussian random variables
Yi are independent and identically distributed. Then, for every δ > 0, noting that
N
1 − (1 − P(|Y1 | > t)) is the probability that one of the random variables |Y | = i, i =
1, . . . , N , is larger then t,
Z δ
N N
E[ max {|Yi |}] ≥ [1 − (1 − P(|Y1 | > t)) ] dt ≥ δ[1 − (1 − P(|Y1 | > δ)) ] ,
1≤i≤N 0
where the first inequality follows from the integral identity for the expectation and
the second inequality is just the interval length times the minimal value of the in-
tegrand. Now, we obtain with the lower tail bound of the normal distribution (see
Proposition 2.1),
r Z ∞
2 1 1 1
P(|Y1 | > δ) = exp ( − t2 /2) dt ≥ ( − 3 ) exp ( − δ 2 /2) .
π δ π δ δ
√
Now we choose δ = log N with N large enough so that
1
P(|Y1 | > δ) ≥ ,
N
and hence
1 N 1
E[ max {|Yi |}] ≥ δ[1 − (1 − ) ] ≥ δ(1 − ) .
1≤i≤N N e
We conclude with statement (c) using
We have seen that the expected supremum of the canonical Gaussian process
on some set T ⊂ Rn ,
E[ suphY, ti] ,
t∈T
92 R ANDOM P ROCESSES
where Y ∼ N(0, 1ln ) plays an important role. In many application this geometric
quantity is an important parameter. This leads to the following definition.
KKK
We summarise a few simple properties of the Gaussian width. We skip the proof
as it involves mostly elementary properties of the norm and the Minkowski sum.
Recall that the diameter of a set T ⊂ Rn with respect to the Euclidean norm is
(d)
1 1
W(T ) = W(T − T ) = E[ sup hY, x − yi] .
2 2 x,y∈T
(e) √
1 n
√ diam (T ) ≤ W(T ) ≤ diam (T ) .
2π 2
Proof.
(a) Cauchy-Schwarz inequality gives
If W(T ) < ∞, then this implies that kxk2 < ∞ for all x ∈ T , and henceforth T is
bounded. Conversely, if T is a bounded set we have that kxk2 ≤ C for all x ∈ T and
some C > 0. Thus √
E[hY, xi] ≤ E[kY k2 ]C ≤ nC < ∞
implies that W(T ) < ∞.
R ANDOM P ROCESSES 93
(b) We simply use the rotation invariance of the normal distribution, i.e., Y ∼ N(0, 1ln )
implies that U Y ∼ N(0, 1ln ) for any orthogonal matrix U .
(c) Recalling the definition of the Minkowski sum of two sets we get
W(T + S) = E[ sup hY, x + yi] = E[ suphY, xi] + E[ suphY, yi] = W(T ) + W(S) .
x∈T,y∈S x∈T y∈S
If a ≥ 0 we have a = |a| and hY, axi = |a|hY, xi. If a < 0 we have |a| = −a and using
the fact that Y is symmetric, i.e., −Y has same distribution as Y , we get
Example 7.22 (Gaussian width) (a) The Gaussian width of the unit sphere in n dimen-
sions is
√
W(S(n−1) ) = E[kY k2 ] = n ± C ,
where the second equality follows from Exercise 6 - Example sheet 2 with some absolute
constant C > 0 as an immediate consequence of our concentration of norm result in
Theorem 3.1. To see the first equality, apply first Cauchy-Schwarz inequality to obtain
an upper bound for any x ∈ S(n−1) ,
Pick
Yi
xi = , i = 1, . . . , N ,
kY k2
then x = (x1 , . . . , xn ) ∈ S(n−1) .
n
(b) The Gaussian width for the cube B∞ = [−1, 1]n with respect to the `∞ norm is
r
n 2
W(B∞ ) = E[kY k1 ] = nE[|Y1 |] = n ,
π
where thePsecond equality follows from the isotropy of Y and the definition of the norm
kY k1 = ni=1 |Y |i . The third equality is just calculation, i.e.,
Z ∞ Z ∞
r
1 −y 2 /2 2 −y 2 /2 2 2 ∞ 2
E[|Y1 |] = √ |y|e dy = √ ye dy = √ [ − e−y /2 ]0 = .
2π ∞ 2π 0 2π π
The first equality follows from Cauchy-Schwarz for an upper bound, i.e.,
(c) The unit ball B1n in Rn with respect to the `1 norm is B1n = {x ∈ Rn : kxk1 ≤ 1}, and
its Gaussian width is
where the second equality is just the definition of the supremum norm and the first equal-
ity follows from Cauchy-Schwarz inequality. With Proposition 7.18 we get two absolute
constants c, C > 0 such that
p p
c log n ≤ W(B1n ) ≤ C log n .
♣
R ANDOM P ROCESSES 95
Theorem 7.24 (Dudley’s inequality) Let X = (Xt )t∈T be a mean-zero stochastic process
on a metric space (T, d) with Sub-Gaussian increments. Then
Z ∞ p
E[ sup Xt ] ≤ CK N(T, d, ε) dε
t∈T 0
The proof of the theorem needs some bounds on the covering number of the class of
Lipschitz function. We explore this in the next exercise before proving the theorem.
KK
Solution. Recall the definition of the supremum norm kf k∞ = supx∈[0,1] |f (x)| and
consider the square Λ = [0, 1]2 . We put a mesh of step (size) ε on Λ such that we get
(1/ε)2 squares of side length 1/ε. Mesh-following functions f0 are steps functions on
the mesh taking one of 1/ε possible values on each interval of length 1/ε. For every
f ∈ F there is a mesh-following function f0 such that kf − f0 k∞ ≤ ε. The number of
all mesh-following functions is bounded by (1/ε)1/ε . Recall that the covering number
N(K, d, ε) is the smallest cardinality of closed ε-balls with centre in K whose union
covers the set K. If we relax the assumption that the centres are in K we obtain the
external covering number Next (K, d, ε). As we have done earlier, one can show that
which we need for our case as the centres of our balls might not be element of F.
Thus
2/ε
N(F, k·k∞ , ε) ≤ (2/ε) .
©
where
Zi := (f − g)(Xi ) − E[(f − g)(X)] , i = 1, . . . , N .
In the following we write . whenever we have ≤ with some absolute constant C > 0
to avoid adapting the absolute constant in each single step. The Zi ’s are indepen-
dent and man-zero random numbers and thus Proposition 2.24 gives
N
1 X 2
1/2
kXf − Xg kψ2 . kZi kψ2 .
N i=1
and therefore
1 1/2 1
kXf − Xg kψ2 . N kf − gk∞ = √ kf − gk∞ .
N N
Step 2: Application of Dudley’s inequality. According to Step 1 the empirical process
has Sub-Gaussian increments. The diameter of F is one, that is,
diam (F) = sup kf − gk∞ = sup sup |f (x) − g(x)| = 1 .
f,g∈F f,g∈F x∈[0,1]
Application of Theorem 7.24 gives the upper bound for the expected supremum
where the integral runs between zero and the diameter. For the integral we use the
bound on the covering number in Exercise 7.27. Thus we get
Z 1 q Z 1 r
1 1 c c 1
E[ sup|Xf |] . √ log N(F, k·k∞ , ε) dε . √ log dε . √ ,
f ∈F N 0 N 0 ε ε N
for some absolute constant c > 0. 2
We state just for information a uniform law of large numbers involving the VC
dimension. It goes beyond the scope of this lecture to actually provide all details
about the relationship between the covering numbers for Boolean functions and their
VC dimension. We need the following result to appreciate the application example
in Section 8.
Theorem 7.29 (Empirical processes via VC dimension) Let (Ω, B(Ω), µ a probability space
and F = {f : Ω → {0, 1}} a class of Boolean functions with VC(F) ∈ [1, ∞). Consider the
i.i.d. samples X, X1 , . . . , XN with law µ ∈ M1 (Ω). Then
N
r
h 1 X i VC(F)
E sup f (Xi ) − E[f (X)] ≤ C
f ∈F N i=1 N
for some absolute constant C > 0.
98 A PPLICATION : S TATISTICAL L EARNING THEORY
Example 8.1 (Health study) We examine N patients and determine n health parameter,
e.g., blood pressure heart rate, weight, etc. We then obtain the samples Xi ∈ Rn , i =
1, . . . , N . Suppose we know whether each patient has a certain illness or not, e.g. diabetes.
That is, we know T (Xi ) ∈ {0, 1}, i = 1, . . . , N with T (Xi ) = 1 being sick and T (Xi ) = 0
being healthy. We want to learn from the given training sample to diagnose diabetes, i.e., we
want to learn the target function T : Ω → {0, 1}. This target function would output diagnosis
for any person based on the n health parameter. ♣
Definition 8.2 Let (Ω, B(Ω), µ) be a probability space. The risk of a function f : Ω → R in
a learning problem with the target function T : Ω → R and class F of functions is defined
as
2
R(f ) := E[(f (X) = T (X)) ] , (8.1)
where the expectation is with respect to the probability measure µ ∈ M1 (Ω). The min-
imiser f ∗ of the risk is defined as
as f 2 (X) − 2f (X)T (X) + T 2 (X) is zero when f (X) = T (X) and 1 otherwise.
In any learning problem the choice of the class of functions is crucial. In this context
we call the class of functions F the hypothesis space. If we choose F to be a class
of simple functions like linear functions or Lipschitz function we might get easily
calculations and estimates but we can be still off the real function, this is called
under fitting. Conversely, if we consider too many different function types in F we
A PPLICATION : S TATISTICAL L EARNING THEORY 99
Definition 8.3 (Empirical risk) Let (Ω, B(Ω), µ) be a probability space and f : Ω → R an
element of the hypothesis class F of a learning problem with target function T : Ω → R.
Let X1 , . . . , XN be Ω-valued i.i.d. samples with law µ ∈ M1 (Ω). The empirical risk of f
given the sample is defined as
N
1 X 2
RN (f ) := (f (Xi ) − T (Xi )) , (8.3)
N i=1
RN (fN∗ ) − R(f ∗ ) .
Lemma 8.4 Let (Ω, B(Ω), µ) be a probability space and f : Ω → R an element of the hy-
pothesis class F of a learning problem with target function T : Ω → R. Let X1 , . . . , XN be
Ω-valued i.i.d. samples with law µ ∈ M1 (Ω). Then
Theorem 8.5 (Excess risk via VC dimension) Let (Ω, B(Ω), µ) be a probability space and
F be a class of functions of a learning problem with target function T : Ω → R and VC(F) ≥
1. Let X1 , . . . , XN be Ω-valued i.i.d. samples with law µ ∈ M1 (Ω). Then
r
∗ ∗ VC(F)
E[R(fN )] ≤ R(f ) + C
N
for some absolute constant C > 0.
100 A PPLICATION : S TATISTICAL L EARNING THEORY
1 h 1 p
Z i
L.H.S. of (8.4) . √ E log N(L, L (µN ), ε) dε ,
2
N 0
To see (8.5) pick an ε-net {fj }j=1,...,J of F. For all ` ∈ L there exist `j := (fj − T )2
such that
k` − `j kL2 (µN ) = kf 2 − 2T (f − fj ) − fj2 kL2 (µN ) = k(f + fj )(f − fj ) − 2T (f − fj )kL2 (µN )
≤ 2kf − fj kL2 (µN ) + 2kf − fj kL2 (µN ) ≤ ε
whenever kf − fj kL2 (µN ) ≤ ε/4. This shows our claim (8.5). Now we replace L by F
and use Theorem 7.29 and its proof to see that
If we want to bound the expected excess risk in our health study Example 8.1 by
ε > 0, all we need to do is to take a sample of size
N ∼ ε−2 VC(F) .
A PPLICATION : S TATISTICAL L EARNING THEORY 101
References
[Dur19] R. D URRETT. Probability - Theory and Examples. Cambridge University Press,
fifth ed., 2019.
[Geo12] H.-O. G EORGII. Stochastics. De Gruyter Textbook. De Gruyter, 2nd rev. and ext.
ed., 2012.
[Ver18] R. V ERSHYNIN. High-Dimensional Probability. Cambridge University Press, 2018.
102 I NDEX
Index
ε-net, 45 Kullback-Leibler divergence, 54
ε-separated, 45
n-dimensional ball, 35 Landau symbols, 36
n-dimensional sphere, 35 level sets, 63
Lipschitz continuous, 58
Sub-exponential norm, 27
Minkowski sum, 46
Bernstein condition, 29 moment generation function, 3
Multivariate Normal / Gaussian distri-
canonical Gaussian process in Rn , 82 bution, 42
canonical metric of the process, 87
centred moment generating function, 18 normal distribution, 9
chaining, 93
operator norm, 48
Chaos, 74
covariance function, 82 packing number, 45
covariance matrix, 9, 39
covering number, 45 Rademacher function, 29
cumulative distribution function (CDF), random field, 81
3 random process, 81
relative entropy, 54
Decoupling, 74
diameter, 90 second moment matrix, 39
separately convex, 58
empirical covariance, 52 Shannon entropy, 55
empirical measure, 52 singular values, 48
empirical process, 93 spherically distributed, 41
empirical risk, 98 standard Brownian motion, 81
essential supremum, 5 Stirling’s formula, 20
excess risk, 98 stochastically dominated, 83
Sub-exponential random variable, 26
Gamma function, 20
Sub-exponential random variable, sec-
Gaussian interpolation, 83
ond definition, 28
Gaussian process, 82
Sub-exponential random variables, first
Gaussian width, 90
definition, 26
global Lipschitz continuity, 58
Sub-Gaussian - first definition, 21
Hamming cube, 47 Sub-Gaussian - second definition, 23
Hamming distance, 47 Sub-Gaussian increments, 93
Herbst argument, 56 Sub-Gaussian properties, 18
sub-level sets, 63
increments, 82 symmetric, 80
isotropic, 39 symmetric Bernoulli distribution in Rn ,
41
Jensen’s inequality, 4
Johnson-Lindenstrauss Lemma, 72 tails of the normal distribution, 11
I NDEX 103
target function, 97
training data, 97
unit sphere, 41
VC dimension, 95
Young’s inequality, 27