0% found this document useful (0 votes)

30 views108 pages

High Dimensional Probability MA3K0 Notes 3

Uploaded by

chingofchina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views108 pages

High Dimensional Probability MA3K0 Notes 3

Uploaded by

chingofchina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 108

MA3K0 - High-Dimensional Probability

Lecture Notes 2023/24 - term 2

Stefan Adams
i

15 January 2024

Notes will be updated during the term (typos, additional material)

Contents
1 Prelimaries on Probability Theory 1
1.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classical Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Lp -spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Concentration inequalities for independent random variables 11
2.1 Why concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Chernoff’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Sub-Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Sub-Exponential random variables . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Random vectors in High Dimensions 32
3.1 Concentration of the Euclidean norm . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 The geometry of high dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Covariance matrices and Principal Component Analysis (PCA) . . . . . . . . 39
3.4 Examples of High-Dimensional distributions . . . . . . . . . . . . . . . . . . . 41
3.5 Sub-Gaussian random variables in higher dimensions . . . . . . . . . . . . . 44
3.6 Application: Grothendieck’s inequality . . . . . . . . . . . . . . . . . . . . . . 45
4 Random Matrices 46
4.1 Geometrics concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Concentration of the operator norm of random matrices . . . . . . . . . . . . 48
4.3 Application: Community Detection in Networks . . . . . . . . . . . . . . . . . 52
4.4 Application: Covariance Estimation and Clustering . . . . . . . . . . . . . . . 52
5 Concentration of measure - general case 54
5.1 Concentration by entropic techniques . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Concentration via Isoperimetric Inequalities . . . . . . . . . . . . . . . . . . . 63
5.3 Some matrix calculus and covariance estimation . . . . . . . . . . . . . . . . 67
5.4 Application - Johnson-Lindenstrauss Lemma . . . . . . . . . . . . . . . . . . 71
6 Basic tools in high-dimensional probability 75
6.1 Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Concentration for Anisotropic random vectors . . . . . . . . . . . . . . . . . . 80
6.3 Symmetrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7 Random Processes 83
7.1 Basic concepts and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Slepian’s inequality and Gaussian interpolation . . . . . . . . . . . . . . . . . 85
7.3 The supremum of a process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4 Uniform law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8 Application: Statistical Learning theory 98
ii

Preface

Introduction

We discuss an elegant argument that showcases the usefulness of probabilistic rea-

soning in geometry. First recall that a convex combination of points z1 , . . . , zm ∈ Rn
is a linear combination with coefficients that are non-negative and sum to 1, i.e., it is
a sum of the form

m
X m
X
λi zi where λi ≥ 0 and λi = 1. (0.1)
i=1 i=1

Given a set M ⊂ Rn , the convex hull of M is the set of all convex combinations of
all finite collections of points in M , defined as

conv(M ) := {convex combinations of z1 , . . . , zm ∈ M for m ∈ N}.

The number m of elements defining a convex combination in Rn is not restricted a

priori. The classical theorem of Caratheodory states that one always take m ≤ n + 1.
For the convenience of the reader we briefly state that classical theorem.

Theorem 0.1 (Caratheodory’s theorem) Every point in the convex hull of a set M ⊂ Rn
can be expressed as a convex combination of at most n + 1 points from M .

Unfortunately the bound n + 1 cannot be improved (it is clearly attained for a

simplex M ) 1 This is some bad news. However, in most applications we only want
to approximate a point x ∈ conv(M ) rather than to represent it exactly as a convex
combination.
Can we do this with fewer than n + 1 points?
We now show that it is possible, and actually the number of required points does
not need to depend on the dimension n at all! This is certainly brilliant news for any
applications in mind - in particular for those where the dimension of the data set is
extremely high (data science and machine learning and high-dimensional geometry
and statistical mechanics models).

1
A simplex is a generalisation of the notion of a triangle to arbitrary dimensions. Specifically, a k-simplex
S is the convex hull of its k + 1 vertices: Suppose u0 , . . . , uk ∈ Rk are affinely independent, which means that
u1 − u0 , . . . , uk − u0 are linearly independent. Then, the simplex determined by these vertices is the set of
points
n k
X o
S = λ0 u0 + · · · + λk uk : λi = 1, λi ≥ 0 for i = 0, . . . , k .
i=0
iii

Theorem 0.2 (Approximate form Caratheodory’s theorem) Consider a set M ⊂ Rn

whose diameter diam (M ) := sup{kx − yk : x, y ∈ M } is bounded by 1. Then, for ev-
ery point x ∈ conv(M ) and every positive integer k, one can find points x1 , . . . , xk ∈ M
such that
k
1X 1
kx − xj k ≤ √ .
k j=1 k
p
Here kxk = x21 + · · · + x2n , x = (x1 , . . . , xn ) ∈ Rn , denotes the Euclidean norm on Rn .

Remark 0.3 We have assumed diam (M √ ) ≤ 1 for simplicity. For a general set M , the bound
in the theorem changes to diam (M )/ k.

Why is this result surprising?

First, the number of points k in convex combinations does not depend on the di-
mension n. Second, the coefficients of convex combinations can be made all equal.

Proof. The argument upon which our proof is based is known as the empirical
method of B. Maurey. W.l.o.g., we may assume that not only the diameter but also
the radius of M is bounded by 1, i.e.,

kwk ≤ 1 for all w ∈ M.

We pick a point x ∈ conv(M ) and express it as a convex combination of some

vectors z1 , . . . , zm ∈ M as in (0.1). Now we consider the numbers λi in that convex
combination as probabilities that a random vector Z takes the values zi , i = 1, . . . , m,
respectively. That is, we define

P(Z = zi ) = λi , i = 1, . . . , m.

This is possible by the fact that the weights λi ∈ [0, 1] and sum to 1. Consider now
a sequence (Zj )j∈N of copies of Z. This sequence is an independent identically
distributed sequence of Rn -valued random variables. By the strong law of large
numbers,
k
1X
Zj → x almost surely as k → ∞.
k j=1

We shall now get a quantitative version of this limiting statement, thatPis, we wish
to obtain an error bound. For this we shall compute the variance of k1 kj=1 Zj . We
obtain
k k
h 1X i 1 h X i
E kx − Zj k2 = 2 E k (Zj − x)k2 (since E[Zi − x] = 0)
k j=1 k j=1
k
1 X
= 2 E[kZj − xk2 ].
k j=1
iv

The last identity is just a higher-dimensional version of the basic fact that the vari-
ance of a sum of independent random variables equals the sum of the variances. To
bound the variances of the single terms we compute using that Zj is copy of Z and
that kZk ≤ 1 as Z ∈ M ,

E[kZj − xk2 ] = E[kZ − E[Z]k2 ] = E[kZk2 ] − kE[Z]k2 ≤ E[kZk2 ] ≤ 1,

where the second equality follows from the well-known property of the variance,
namely, for n = 1,

E[kZ − E[Z]k2 ] = E[(Z − E[Z])2 ] = E[Z 2 − 2ZE[Z] + E[Z]2 ] = E[Z 2 ] − E[Z]2 ,

and the cases for n > 1 follow similarly. We have thus shown that
k
h 1X i 1
E kx − Zj k2 ≤ .
k j=1 k

Therefore, there exists a realisation of the random variables Z1 , . . . , Zk such that

k
1X 1
kx − Zj k2 ≤ .
k j=1 k

Since by construction each Zj takes values in M , the proof is complete. 2

We shall give one application of Theorem 0.2 in computational geometry. Sup-

pose that we are given a subset P ⊂ Rn (say a polygon 2 ) and asked to cover it by
balls of a given radius ε > 0. What is the smallest number of balls needed, and how
should we place them?
Corollary 0.4 (Covering polytopes by balls) Let P be a polygon in Rn with N vertices
and whose diameter is bounded by 1. Then P can be covered by at most N d1/ε e Euclidean
2

balls of radii ε > 0.

Proof. We shall define the centres of the balls as follows. Let k := d1/ε2 e and
consider the set
n1 Xk o
N := xj : xj are vertices of P .
k j=1

The polytope P is the convex hull of the set of its vertices, which we denote by
M . We then apply Theorem√ 0.2 to any point x ∈ P = conv(M ) and deduce that
x is within a distance 1/ k ≤ ε from some point in N. This shows that the ε-balls
centred at N do indeed cover P . 2
In this lecture we will learn several other approaches to the covering problem in
relation to packing, entropy and coding, and random processes.
2
In geometry, a polytope is a geometric object with’ flat’ sides. It is a generalisation of the three-dimensional
polyhedron which is a solid with flat polygonal faces, straight edges and sharp corners/vertices. Flat sides mean
that the sides of a (k + 1)-polytope consist of k-polytopes.
P RELIMARIES ON P ROBABILITY T HEORY 1

1 Prelimaries on Probability Theory

In this chapter we recall some basic concepts and results of probability theory. The
reader should be familiar with most of this material some of which is taught in ele-
mentary probability courses in the first year. To make these lectures self-contained
we review the material mostly without proof and refer the reader to basic chapters of
common undergraduate textbooks in probability theory, e.g. [Dur19] and [Geo12]. In
Section 1.1 we present basic definitions for probability space and probability mea-
sure as well as random variables along with expectation, variance and moments.
Vital for the lecture will be the review of all classical inequalities in Section 1.2. Fi-
nally, in Section 1.4 we review well-know limit theorems.

1.1 Random variables

A probability space (Ω, F, P ) is a triple consisting of a set Ω, a −σ-algebra F and a
probability measure P . We write P(Ω) for the power set of Ω which is the set of all
subsets of Ω.

Definition 1.1 (σ-algebra) Suppose Ω 6= ∅. A system F ⊂ P(Ω) satisfying

(a) Ω ∈ F
(b) A ∈ F ⇒ Ac := Ω \ A ∈ F
S
(c) A1 , A2 , . . . ∈ F ⇒ i≥1 Ai ∈ F.
is called σ-algebra (or σ-field) on Ω. The pair (Ω, F) is then called an event space or mea-
surable space.

Example 1.2 (Borel σ-algebra) Let Ω = Rn , n ∈ N and

nY n o
G= [ai , bi ] : ai < bi , ai , bi ∈ Q
i=1

be the system consisting of all compact rectangular boxes in Rn with rational vertices and
edges parallel to the axes. In honour of Émile Borel (1871–1956), the system B n = σ(G)
is called the Borel σ-algebra on Rn , and every A ∈ B n a Borel set. Here, σ(G) denotes the
smallest σ-algebra generated by the system G. Note that the B n can also be generated by the
system of open or half-open rectangular boxes, see [Dur19, Geo12].
♣
The decisive point in the process of building a stochastic model is the next step:
For each A ∈ F we need to define a value P (A) ∈ [0, 1] that indicates the probability
of A. Sensibly, this should be done so that the following holds.

(N) Normalisation: P (Ω) = 1.

(A) σ-Additivity: For pairwise disjoint events A1 , A2 , . . . ∈ F one has
[ X
P Ai = P (Ai ).
i≥1 i≥1
2 P RELIMARIES ON P ROBABILITY T HEORY

Definition 1.3 (Probability measure) Let (Ω, F) be a measurable space. A function P : F →

[0, 1] satisfying the properties (N) and (A) is called a probability measure or a probability
distribution, in short a distribution (or, a little old-fashioned, a probability law) on (Ω, F).
Then the triple (Ω, F, P ) is called a probability space.

Theorem 1.4 (Construction of probability measures via densities) (a) Discrete case: For
countable Ω, the relations
X
P (A) = %(ω) for A ∈ P(Ω), %(ω) = P ({ω}) for ω ∈ Ω
ω∈A

establish a one-to-one correspondence between the set of all probabilityPmeasures P on

(Ω, P(Ω)) and the set of all sequences % = (%(ω))ω∈Ω in [0, 1] such that ω∈Ω %(ω) = 1.

(b) Continuous case: If Ω ⊂ Rn is Borel, then every function % : Ω → [0, ∞) satisfying the
properties

(i) {x ∈ Ω : %(x) ≤ c} ∈ BΩn for all c > 0,

R
(ii) Ω %(x) dx = 1

determines a unique probability measure on (Ω, BΩn ) via

Z
P (A) = %(x) dx for A ∈ BΩn
A

(but not every probability measure on (Ω, BΩn ) is of this form).

Proof. See [Dur19, Geo12]. 2

Definition 1.5 A sequence or function % as in Theorem 1.4 above is called a density (of
P ) or, more explicitly (to emphasise normalisation), a probability density (function), often
abbreviated as pdf . If a distinction between the discrete and continuous case is required, a
sequence % = (%(ω))ω∈Ω as in case (a) is called a discrete density, and a function % in case
(b) a Lebesgue density.

In probability theory one often considers the transition from a measurable space
(event space) (Ω, F) to a coarser measurable (event) space (Ω0 , F 0 ). In general such
a mapping should satisfy the requirement

A0 ∈ F 0 ⇒ X −1 A0 := {ω ∈ Ω : X(ω) ∈ A0 } ∈ F. (1.1)

Definition 1.6 Let (Ω, F) and (Ω0 , F 0 ) be two measurable (event) spaces. Then every map-
ping X : Ω → Ω0 satisfying property (1.1) is called a random variable from (Ω, F) to
(Ω0 , F 0 ), or a random element of Ω0 , or a Ω0 -valued random variable. Alternatively (in the
terminology of measure theory), X is said to be measurable relative to F and F 0 .

In probability theory it is common to write {X ∈ A0 } := X −1 A0 .

P RELIMARIES ON P ROBABILITY T HEORY 3

Theorem 1.7 (Distribution of a random variable) If X is a random variable from a prob-

ability space (Ω, F, P ) to a measurable space (Ω0 , F 0 ), then the prescription

P 0 (A0 ) := P (X −1 A0 ) = P ({X ∈ A0 }) ≡ P (X ∈ A0 ) for any A0 ∈ F 0

defines a probability measure P 0 on (Ω0 , F 0 ).

Definition 1.8 (a) The probability measure P 0 in Theorem 1.7 is called the distribution of
X under P , or the image of P under X , and is denoted by P ◦ X −1 . (In the literature,
one also finds the notations PX or L(X; P ). The letter L stands for the more traditional
term law, or loi in French.)

(b) Two random variables are said to be identically distributed if they have the same distri-
bution.
We are considering real-valued or Rn -valued random variables in the following
and we just call them random variables for all these cases. In basic courses in
probability theory, one learns about the two most important quantities associated
with a random variable X, namely the expectation 3 (also called the mean) and
variance. They will be noted in this lecture by
2
E[X] and Var(X) := E[(X − E(X)) ].

The distribution of a real-valued random variable X is determined by the cumu-

lative distribution function (CDF) of X, defined as

FX (t) = P(X ≤ t) = P((−∞, t])), t ∈ R. (1.2)

It is often more convenient to work with the tails of random variables, namely with

P(X > t) = 1 − FX (t) . (1.3)

Here we write P for the generic distribution of the random variable X which is given
by the context.
For any real-valued random variable the moment generating function (MGF) (MGF)
is defined
MX (λ) := E[eλX ] , λ ∈ R . (1.4)
When MX is finite for all λ in a neighbourhood of the origin, we can easily compute
all moments by taking derivatives (interchanging differentiation and expectation (in-
tegration) in the usual way):

dk
E[X k ] = MX (λ) , k ∈ N. (1.5)
dλk λ=0

3
In measure theory the expectation E[X] of a random variable on a probability space (Ω, F, P ) is the
Lebesgue integral of the function X : Ω → R. This makes theorems on Lebesgue integration applicable in
probability theory for expectations of random variables
4 P RELIMARIES ON P ROBABILITY T HEORY

Lemma 1.9 (Integral Identity) Let X be a real-valued non-negative random variable. Then
Z ∞
E[X] = P(X > t) dt .
0

Proof. We can write any non-negative real number x via the following identity using
indicator function 4 : Z x Z ∞
x= 1 dt = 1l{t<x} (t) dt .
0 0

Substitute now the random variable X for x and take expectation (with respect to X)
on both sides. This gives
hZ ∞ i Z ∞ Z ∞
E[X] = E 1l{t<X} (t) dt = E[1l{t<X} ] dt = P(t < X) dt .
0 0 0

To change the order of expectation and integration in the second inequality, we used
the Fubini-Tonelli theorem. 2

Exercise 1.10 (Integral identity) Prove the extension of Lemma 1.9 to any real-valued ran-
dom variable (not necessarily positive):
Z ∞ Z 0
E[X] = P(X > t) dt − P(X < t) dt .
0 −∞

1.2 Classical Inequalities

In this section fundamental classical inequalities are presented. Here, classical
refers to typical estimates for analysing stochastic limits.

Proposition 1.11 (Jensen’s inequality) Suppose that Φ : I → R, where I ⊂ R is an inter-

val, is a convex function. Let X be a real-valued random variable. Then

Φ(E[X]) ≤ E[Φ(X)] .

Proof. See [Dur19] or [Geo12] using either the existence of sub-derivatives for
convex functions or the definition of convexity with the epi-graph of a function. The
epi-graph of a function f : I → R, I ⊂ some interval, is the set

epi(f ) := {(x, t) ∈ R2 : x ∈ I, f (x) ≤ t} .

A function f : I → R is convex if and only if epi(f ) is a convex set in R2 . 2

4
1lA denotes the indicator function of the set A, that is, 1lA (t) = 1 if t ∈ A and 1lA (t) = 0 if t ∈
/ A.
P RELIMARIES ON P ROBABILITY T HEORY 5

1.3 Lp -spaces
In the following let X be a R-valued random variable, i.e., there is a probability space
(Ω, F, P ) such that X : Ω → R is a measurable function. By default, we equip the
real line R with its Borel-σ-algebra. We begin with the definition of the essential
supremum of X.

Definition 1.12 (Essential supremum) Let X be R-valued random variable. The essential
supremum of X, written ess-sup(X), is the smallest number α ∈ R such that the set {x ∈
Ω : X(x) > α} has measure zero, that is,

P ({x ∈ Ω : X(x) > α}) = 0 .

If no such number exists we define ess-sup(X) = ∞.

To understand this definition better we shall check the following example.

Example 1.13 (Essential supremum being infinity) Suppose that the probability space is
Ω = (0, 1) with σ-algebra F = B((0, 1)), and let P be the uniform measure on (0, 1). This
measure has constant probability density,
Z
P (A) = 1lA (t) dt = b − a , for any A = (a, b) with 0 ≤ a < b ≤ 1 .
Ω

Define X : (0, 1) → R, x 7→ x1 . Then X is continuous function and therefore measurable.

Then ess sup(X) = ∞. To see this, pick any α ∈ R+ . Then

1 1
{x ∈ (0, 1) : > α} = (0, )
x α
and
1 1
P ((0, )) = > 0 .
α α
As this holds for all α > 0, we have that ess-sup(X) = ∞. ♣

Definition 1.14 Let (Ω, F, P ) be a probability space. Given two measurable functions
f, g : [0, ∞], we say that f is equivalent to g, written f ∼ g, if

f (x) = g(x) for P − a.e. x ∈ Ω,

that is,
P ({x ∈ Ω : f (x) 6= g(x)}) = 0 .
We shall identify - with an abuse of notation - identify a measurable function f with its
equivalence class [f ].
6 P RELIMARIES ON P ROBABILITY T HEORY

Definition 1.15 Let (Ω, F, P ) be a probability space and 1 ≤ p < ∞.

Lp ≡ Lp (Ω, F, P ) := {f : Ω → [−∞, ∞] : f measurable and kf kLp < ∞} ,

where Z p1 Z p1
p p
kf kLp := |f | dP = |f (x)| P (dx) .
Ω Ω
If p = ∞, then

L∞ ≡ L∞ (Ω, F, P ) := {f : Ω → [−∞, ∞] : f measurable and kf kL∞ < ∞} ,

where
kf kL∞ := ess-sup(|X|) ,
and we write kf k∞ ≡ kf kL∞ occasionally.

In the following let X : Ω → R denote an R-valued random variable.

A consequence of Jensen’s inequality is that kXkLp is an increasing function in the
parameter p, i.e.,
kXkLp ≤ kXkLq 0 ≤ p ≤ q ≤ ∞. (1.6)
q
This follows form the convexity of Φ(x) = x p when q ≥ p.

Exercise 1.16 Show that (1.6) holds. K

Proposition 1.17 (Minkowski’s inequality) For p ∈ [1, ∞], let X, Y ∈ Lp , then

kX + Y kLp ≤ kXkLp + kY kLp .
Proposition 1.18 (Cauchy-Schwarz inequality) For X, Y ∈ L2 ,
|E[XY ]| ≤ kXkL2 kY kL2 .
Proposition 1.19 (Hölder’s inequality) For p, q ∈ (1, ∞) with 1/p + 1/q = 1 let X ∈ Lp
and Y ∈ Lq . Then
E[XY ] ≤ E[|XY |] ≤ kXkLp kY kLq .

Lemma 1.20 (Linear Markov’s inequality) For non-negative random variables X and t >
0 the tail probability is bounded as
E[X]
P(X > t) ≤ .
t
Proof. Pick t > 0. Any positive number x can be written as
x = x1l{X≥t} + x1l{X<t} ].
As X is non-negative, we insert X into the above expression in place of x and take
the expectation (integral) to obtain
E[X] = E[X1l{X≥t} ] + E[X1l{X<t} ] ≥ E[t1l{X≥t} ] = tP(X ≥ t) .
P RELIMARIES ON P ROBABILITY T HEORY 7

2
This is one particular version of the Markov inequality which provides linear decay
in t. In the following proposition we obtain the general version which will be used
frequently throughout the lecture.

Proposition 1.21 (Markov’s inequality) Let Y be a real-valued random variable and

f : [0, ∞) → [0, ∞)

an increasing function. Then, for all ε > 0 with f (ε) > 0,

E[f ◦ |Y |]
P(|Y | ≥ ε) ≤ .
f (ε)

Proof. Clearly, the composition f ◦ |Y | is a positive random variable such that

f (ε)1l{|Y |≥ε} ≤ f ◦ |Y |.

Taking the expectation on both sides of that inequality gives

f (ε)P(|Y | ≥ ε) = E[f (ε)1l{|Y |≥ε} ] ≤ E[f ◦ |Y |].

2
The following version of the Markov inequality is often called Chebyshev’s in-
equality.

Corollary 1.22 (Chebyshev’s inequality, 1867) For all Y ∈ L2 with E[Y ] ∈ (−∞, ∞)
and ε > 0, Var(Y )
P |Y − E[Y ]| ≥ ε ≤ .
ε2
1.4 Limit Theorems
Definition 1.23 (Variance and covariance) Let X, Y ∈ L2 be real-valued random vari-
ables.

(a)
2
Var(X) := E[(X − E[X]) ] = E[X 2 ] − E[X]2
√
is called the variance, and Var(X) the standard deviation of X with respect to P.

(b)
cov(X, Y ) := E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ]
is called the covariance of X and Y . It exists since |XY | ≤ X 2 + Y 2 .

(c) If cov(X, Y ) = 0, then X and Y are called uncorrelated.

Recall that two independent random variables are uncorrelated, but two uncor-
related are not necessarily independent as the following example shows.
8 P RELIMARIES ON P ROBABILITY T HEORY

Example 1.24 Let Ω = {1, 2, 3} and let P the uniform distribution on Ω. Define two ran-
dom variables by their images, that is,

(X(1), X(2), X(3)) = (1, 0, −1) and (Y (1), Y (2), Y (3)) = (0, 1, 0).

Then XY = 0 and E[XY ] = 0, and therefore cov(X, Y ) = 0, but

1
P(X = 1, Y = 1) = 0 6= = P(X = 1)P(Y = 1).
9
Hence X and Y are not independent. ♣

Theorem 1.25 (Weak law of large numbers, L2 -version) Let (Xi )i∈N be a sequence of un-
correlated (e.g. independent) real-valued random variables in L2 with bounded variance, in
that v := supi∈N Var(Xi ) < ∞. Then for all ε > 0

1 XN v
P (Xi − E[Xi ]) ≥ ε ≤ −→ 0,
N i=1 N ε2 N →∞

and thus
N
1 X P
(Xi − E[Xi ]) −→ 0,
N i=1 N →∞

P
( −→ means convergence in probability). In particular, if E[Xi ] = E[X1 ] holds for all
N →∞
i ∈ N, then
N
1 X P
Xi −→ E[X1 ].
N i=1

We now present a second version of the weak law of large numbers, which does
not require the existence of the variance. To compensate we must assume that
the random variables, instead of being pairwise uncorrelated, are even pairwise
independent and identically distributed.

Theorem 1.26 (Weak law of large numbers, L1 -version) Let (Xi )i∈N be a sequence of pair-
wise independent, identically distributed real-valued random variables in L1 . Then
N
1 X P
Xi −→ E[X1 ].
N i=1 N →∞

Theorem 1.27 (Strong law of large numbers) If (Xi )i∈N is a sequence of pairwise uncor-
related real-valued random variables in L2 with v := supi∈N Var(Xi ) < ∞, then

N
1 X
(Xi − E[Xi ]) → 0 almost surely as N → ∞.
N i=1
P RELIMARIES ON P ROBABILITY T HEORY 9

Theorem 1.28 (Central limit theorem) A.M. Lyapunov 1901, J.W. Lindeberg 1922, P. Lévy
1922.
Let (Xi )i∈N be a sequence of independent, identically distributed real-valued random vari-
ables in L2 with E[Xi ] = m and Var(Xi ) = v > 0. Then, as N → ∞,
N
∗ 1 X Xi − m d
SN := √ √ −→ N(0, 1).
N i=1 v

The normal distribution is defined as follows.

A real-valued random variable X is normally distributed with mean µ and variance
σ 2 > 0 if Z ∞
1 (u−µ)2
P(X > x) = √ e− 2σ2 du, for all x ∈ R.
2πσ 2 x
We write X ∼ N(µ, σ 2 ). We say that X is standard normal distributed if X ∼ N(0, 1).

A random vector X = (X1 , . . . , Xn ) is called a Gaussian random vector if there

exits an n × m matrix A, and an n-dimensional vector b ∈ Rn such that X T = AY +
b, where Y is an m-dimensional vector with independent standard normal entries,
i.e. Yi ∼ N(0, 1) for i = 1, . . . , m. Likewise, a random variable Y = (Y1 , . . . , Ym )
with values in Rm has the m-dimensional standard Gaussian distribution if the m
coordinates are standard normally distributed and independent. The covariance
matrix of X = AY + b is then given by

cov(Y ) = E[(Y − E[Y ])(Y − E[Y ])T ] = AAT .

Lemma 1.29 If A is an orthogonal n × n matrix, i.e. AAT = 1l, and X is a n-dimensional

standard Gaussian vector, then AX is also a n-dimensional standard Gaussian vector.

Lemma 1.30 Let X1 and X2 be independent and normally distributed with zero mean and
variance σ 2 > 0. Then X1 + X2 and X1 − X2 are independent and normally distributed
with mean 0 and variance 2σ 2 .

Proposition 1.31 If X and Y are n-dimensional Gaussian vectors with E[X] = E[Y ] and
cov(X) = cov(Y ), then X and Y have the same distribution.

Corollary 1.32 A Gaussian random vector X has independent entries if and only if its co-
variance matrix is diagonal. In other words, the entries in a Gaussian vector are uncorre-
lated if and only if they are independent.

Bernoulli: p ∈ [0, 1], then X ∼ Ber(p) if P(X = 1) = p and P(X = 0) = 1 − p =: q.

If X ∼ Ber(p), then E[X] = p and Var(X) = pq. We call this random variable the
standard Bernoulli random variable.
Binomial: SN = N
P
i=1 Xi ∼ B(N, p) if Xi ∼ Ber(p) and (Xi ), i = 1, . . . , N , indepen-
dent family. E[SN ] = N p and Var(SN ) = N pq.
10 P RELIMARIES ON P ROBABILITY T HEORY

Exercise 1.33 (a) Let X ∼ Ber(p), p ∈ [0, 1]. Compute the expectation, the variance and
the moment generating function MX .

(b) Let Z := X − 1 with X ∼ Ber(p). Z is called symmetric Bernoulli variable. Compute

the expectation, the variance and the moment generating function MZ .
K

Poisson: λ > 0, then X ∼ Poi(λ) if

e−λ λk
P(X = k) = k ∈ N0 ,
k!
Poi(λ) ∈ M1 (N0 , P(N0 )). Here, M1 (Ω) denotes the set of probability measures on Ω
and P(N0 ) is the power set.

Exercise 1.34 Let X ∼ Poi(λ), λ > 0. Compute the expectation, the variance and the
moment generating function MX .
K

Exponential: A random variable X taking positive real values is exponentially dis-

tributed with parameter α > 0 when the probability density function is

fX (t) = αe−αt for t ≥ 0.

1 1
We write X ∼ Exp(α). If X ∼ Exp(α), then E[X] = α
and Var(X) = α2
.

Exercise 1.35 Let X ∼ Exp(α), α > 0. Compute the expectation, the variance and the
moment generating function MX . K

Theorem 1.36 (Poisson limit theorem) Let Xi(N ) , i =P1, . . . , N , be independent Bernoulli
N
random variables Xi(N ) ∼ Ber(p(N )
i ), and denote SN = i=1 Xi
(N )
their sum. Assume that, as
N → ∞,
N
X
max {p(N )
i } −→ 0 and E[SN ] = p(N
i
)
−→ λ ∈ (0, ∞).
1≤i≤N N →∞ N →∞
i=1

Then, as N → ∞,
SN −→ Poi(λ) indistribution.
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 11

2 Concentration inequalities for independent random variables

2.1 Why concentration inequalities
Suppose a random variable X has mean µ, then, for any t ≥ 0,
P(|X − µ| > t) ≤ something small
is a concentration inequality. One is interested in cases where the bound on the right
hand side decays with increasing parameter t ≥ 0. We now discuss a very simple
example to demonstrate that we need better concentration inequalities than the ones
obtained for example from the central limit theorem (CLT). Toss a fair coin N times.
What is the probability that we get at least 3N/4 heads? Recall that E[SN ] = N2 and
Var(SN ) = N4 in conjunction with Corollary 1.22 gives the bound
3 N N 4
P SN ≥ N ≤ P |SN − | ≥ ≤ .
4 2 4 N
This concentration bound vanishes linearly in N . One may wonder if we can do
better using the CLT. The De Moivre Laplace limit theorem (variant of the CLT for
Binomial distributions) states that the distribution of the normalised number of heads
SN − N/2
ZN = p
N/4
converges to the standard normal distribution N(0, 1). We therefore expect to get the
following concentration bound
3 p p
P SN ≥ N = P ZN ≥ N/4 ≈ P(Y ≥ N/4), (2.1)
4
where Y ∼ N(0, 1). To obtain explicit bounds we need estimates on the tail of the
normal distribution.

Proposition 2.1 (Tails of the normal distribution) Let Y ∼ N(0, 1). Then, for all t > 0,
we have 1 1 1 −t2 /2 1 1 2
− 3 √ e ≤ P(Y ≥ t) ≤ √ e−t /2 .
t t 2π t 2π
Proof. Denote f (x) := exp(−x2 /2). For the upper bound we use x ≥ t to get the
estimate Z ∞ Z ∞
x −x2 /2 1 2
f (x) dx ≤ e dx = e−t /2 .
t t t t
2
For the lower bound we use integration by parts (IBP) and f 0 (x) = −xe−x /2 :
Z ∞ Z ∞ i∞ Z ∞ 1
−x2 /2 1 −x2 /2 h 1
−x2 /2 2
e dx = xe dx = − e − 2
e−x /2 dx
t t x x t t x
Z ∞
1 −t2 /2 1 −x2 /2 1 −t2 /2 h 1 −x2 /2 i∞
= e − xe dx = e − − 3e
t t x3 t x t
Z ∞
3 −x2 /2
+ e dx.
t x4
12 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

Hence Z ∞ 1 Z ∞
−x2 /2 1 −t2 /2 3 −x2 /2
e dx = − e + e dx,
t t t3 t x4
and, as the integral on the right hand side is positive,

1 2
P(Y ≥ t) ≥ √ (1/t − 1/t3 )e−t /2 .
2π
2
The lower bound in Proposition 2.1 is lower than the tail lower bound in Lemma
C.5 in the appendix.

Lemma 2.2 (Lower tail bound for normal distribution) Let Y ∼ N(0, 1). Then, for all
t ≥ 0,
t 1 2
P(Y ≥ t) ≥ 2 √ e−t /2 .
t +1 2π

Proof. Define Z ∞
1 2 /2 t −t2 /2

g(t) := √ e−x dx − e .
2π t t2 + 1
2
−2e−t /2
Then g(0) > 0 and g 0 (t) = (t2 +1)2 < 0. Thus g is strictly decreasing with limt→∞ g(t) =
0 implying that g(t) > 0 for all t ∈ R. 2
Using the tail estimates in Proposition 2.1 we expect to obtain an exponential
bound for the right hand side of (2.1), namely that the probability of having at least
3N/4 heads seems to be smaller than √12π e−N/8 . However, we have not taken into
account the approximation errors of ’≈‘ in (2.1). Unfortunately, it turns out that the
error decays too slowly, actually even more slowly than linearly in N . This can be
seen form the following version of the CLT which we state without proof, see for
example [Dur19].

Theorem 2.3 (Berry-Esseen CLT) In the setting of Theorem 1.28, for every N and every
t ∈ R, we have
%
|P (ZN ≥ t) − P(Y ≥ t)| ≤ √ ,
N
where Y ∼ N(0, 1) and % = E[|X1 − m|3 ]/σ 3 .

Can we improve the approximation error ’≈’ in (2.1) by simply computing the
probabilities with the help of Stirling’s formula ? Suppose that N is even. Then

−N N 1 1
P(SN = N/2) = 2 ∼√ and P(ZN = 0) ∼ √ ,
N/2 N N

but P(Y = 0) = 0. This shows that using Stirling’s formula we cannot improve our
estimates. We thus see that we shall get a better concentration bound. This is the
content of the next section.
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 13

2.2 Hoeffding’s Inequality

In this section we study sums of symmetric Bernoulli random variables defined as
follows.

Definition 2.4 A random variable Z taking values in {−1, +1} with

1
P(Z = −1) = P(Z = +1) =
2
is called symmetric Bernoulli random variable.
Note that the ’standard’ Bernoulli random variable X takes values in {0, 1}, and that
one can easily switch between both via Z = 2X − 1.

Theorem 2.5 (Hoeffding’s inequality) Suppose that X1 , . . . , XN , are independent symmet-

ric Bernoulli random variables and let a = (a1 , . . . , aN ) ∈ RN . Then, for any t ≥ 0, we
have
N
X t2
P ai Xi ≥ t ≤ exp − .
i=1
2kak22

Proof. Without loss of generality we can put kak2 = 1, namely, if kak2 6= 1, then
define ãi = ai /kak2 , i = 1, . . . , N , to obtain the bound for
N
X XN t2
P ãi Xi ≥ t = P ai Xi ≥ kak2 t ≤ exp − .
i=1 i=1
2

Let λ > 0, then, using Markov’s inequality, obtain

N
X XN h XN
λt −λt
P ai Xi ≥ t = P exp λ ai X i ≥ e ≤ e E exp λ ai Xi ].
i=1 i=1 i=1

To bound the right hand side use first that the random variables are independent,
h XN N
Y
E exp λ ai X i ] = E[eλai Xi ]
i=1 i=1

and then compute for i ∈ {1, . . . , N },

eλai + e−λai
E[eλai Xi ] = = cosh(λai ).
2
We are left to find a bound for the hyperbolic cosine. There are two ways to get the
upper bound
2
cosh(x) ≤ ex /2 .
1.) Simply write down both Taylor series and compare term by term, that is,
x2 x4 x6
cosh(x) = 1 + + + + ···
2! 4! 6!
14 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

and ∞
x2 /2
X (x2 /2)k x2 x4
e = =1+ + 2 + ··· .
k=0
k! 2 2 2!
2.) Use the product expansion (complex analysis – not needed for this course, just
as background information)
∞
Y 4x2 X 4x2 2
cosh(x) = 1+ 2 2
≤ exp 2 2
= ex /2 ,
π (2k − 1) k=1
π (2k − 1)

where we used 1 + x ≤ ex for the inequality.

In any case, we obtain
2 2
E[eλai Xi ] ≤ eλ ai /2 ,
and thus
N
X λ2
P ai Xi ≥ t ≤ exp − λt + kak22 .
i=1
2
Using kak2 = 1, we write the right hand side as exp(g(λ)) with g(λ) = −λt + λ2 /2, and
optimising the function g over the value of λ we get λ = t ( 0 = g 0 (λ) = −t + λ) and
thus the right hand is simply exp(−t2 /2).
2
With Hoeffding’s inequality in Theorem 2.5 we are now in the position to answer
our previous question on the probability for a fair coin to have at least 3/4N times
with heads. The fair coin is a standard Bernoulli random variable Y with P(Y = 1) =
P(Y = 0) = 21 , and the symmetric one is just Z = 2Y − 1. Thus we obtain the
following bound,
N N N
X X Zi + 1 X Zi
P Yi ≥ 3/4N = P ≥ 3/4N = P ≥ 1/4N
i=1 i=1
2 i=1
2
N
X Z 1√ 1
=P √i ≥ N ≤ exp − N .
i=1
N 2 2·4

This shows that Hoeffding’s inequality is a good concentration estimate avoiding

the approximatingPerror using the CLT. We also get two-sided tail / concentration
estimates for S = ni=1 ai Xi using Theorem 2.5 writing

P(|S| ≥ t) = P(S ≥ t) + P(−S ≥ t).

Theorem 2.6 (Two-sided Hoeffding’s inequality) Suppose that X1 , . . . , XN , are indepen-

dent symmetric Bernoulli random variables and let a = (a1 , . . . , aN ) ∈ RN . Then, for any
t ≥ 0, we have
N
X t2
P | ai Xi | ≥ t ≤ 2 exp − .
i=1
2kak22
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 15

Exercise 2.7 Prove Theorem 2.6. K

As the reader may have realised, the proof of Hoeffding’s inequality, Theorem 2.5,
is quite flexible as it is based on bounding the moment generating function. For ex-
ample, the following version of Hoeffding’s inequality is valid for general bounded
random variables. The proof will be given as an exercise.

Theorem 2.8 (Hoeffding’s inequality for general bounded random variables) Suppose that
X1 , . . . , XN are independent random variable with Xi ∈ [mi , Mi ], mi < Mi , for i =
1, . . . , N . Then, for any t ≥ 0, we have
N
X 2t2
P (Xi − E[Xi ]) ≥ t ≤ exp − PN .
2
i=1 i=1 (Mi − mi )

Exercise 2.9 Prove Theorem 2.8.

Example 2.10 Let X1 , . . . , XN non-negative independent random variables with continuous

distribution (i.e., having a density with respect to the Lebesgue measure). Assume that the
probability density functions fi of Xi are uniformly bounded by 1. Then the following holds.
(a) For all i = 1, . . . , N ,
Z ∞ Z ∞
−tx 1
E[exp(−tXi )] = e fi (x) dx ≤ e−tx dx = .
0 0 t

(b) For any ε > 0, we have

N
X
N
P Xi ≤ εN ≤ (eε) . (2.2)
i=1

2.3 Chernoff’s Inequality

Hoeffding’s inequality is already good but it does not produce good results in case
the success probabilities/parameter pi are very small. In that case one knows that
the sum SN of N Bernoulli random variables has an approximately Poisson distribu-
tion. The Gaussian tails are not good enough to match Poisson tails as we will see
later.

Theorem 2.11 (Chernoff’s inequality) Let Xi be independent Bernoulli random P variables

with parameter pi ∈ [0, 1], i.e., Xi ∼ Ber(pi ), i = 1, . . . , N . Denote SN = N
i=1 Xi their
sum and µ = E[SN ] its mean. Then, for any t > µ, we have
eµ t
P SN ≥ t ≤ e−µ .
t
16 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

Proof. Let λ > 0, then Markov’s inequality gives

N
Y
λSN λt −λt
P(SN ≥ t) = P(e ≥e )≤e E[exp(λXi )].
i=1

We need to bound the MGF of each Bernoulli random variable Xi separately. Using
1 + x ≤ ex , we get

E[exp(λXi )] = eλ pi + (1 − pi ) = 1 + (eλ − 1)pi ≤ exp((eλ − 1)p1 ).

Thus
N
Y N
X
E[exp(λXi )] ≤ exp eλ − 1 pi = exp eλ − 1 µ ,
i=1 i=1

and therefore

P(SN ≥ t) ≤ e−λt exp((eλ − 1)µ) for all λ > 0.

Define g(λ) := −λt+(eλ −1)µ. Optimising g over λ > 0, we obtain g 0 (λ) = −t+µeλ = 0
of and only if λ = log(t/µ) as t > µ for the minimal upper bound. 2

Exercise 2.12 In the setting of Theorem 2.11, prove that, for any t < µ,
eµ t
P SN ≤ t ≤ e−µ .
t
K

Solution. We get

P(SN ≤ t) = P(−SN ≥ −t) ≤ E[e−λSN ]eλt

and
N
Y
E[e−λXi ] ≤ exp ((e−λ − 1)µ).
i=1

Thus
P(SN ≤ t) ≤ eλt exp ((e−λ − 1)µ).
Minimising over λ > 0, gives −λ = log(t/µ) which is valid as t < µ. ,

We shall now discuss some example on random graphs where concentration

inequalities provide sufficient bounds.

Example 2.13 (Degrees of Random Graphs) We consider the Erdös-Rényi random graph
model. This is the simplest stochastic model for large, real-world networks. The random
graph G(n, p) consists of n vertices, and each pair of distinct vertices is connected inde-
pendently (from all other pairs) with probability p ∈ [0, 1]. The degree of a vertex is the
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 17

(random) number of edges incident to that vertex. The expected degree of every vertex i (we
label the n vertices by numbers 1, . . . , n), is

d(i) = E[D(i)] = (n − 1)p , i = 1, . . . , n.

Note that d(i) = d(j) for all i 6= j, i, j = 1, . . . , n, and thus we simply write d in the
following. We call the random graph dense when d ≥ log n holds, and the graph is denoted
regular if the degree of all vertices approximately equal d.
♣

Proposition 2.14 (Dense graphs are almost regular) There is an absolute constant C > 0
such that the following holds: Suppose that G(n, p) has expected degree d ≥ C log n. Then,
with high probability, all vertices of G(n, p) have degrees between 0.9d and 1.1d.

Proof. Pick a vertex i of the graph, the degree is simply a sum of Bernoulli random
variables, i.e.,
n−1
X
D(i) = Xk , Xk ∼ Ber(p).
k=1

We are using the two-sided version of the Chernoff bound in Theorem 2.11, see
Exercise 2(a) of Example Sheet 1:

P(|D(i) − d| ≥ 0.1d) ≤ 2e−cd for some absolute constant c > 0.

We can now ’unfix’ i by taking the union bound over all vertices of the graph:
n
X
P(∃i ∈ {1, . . . , n} : |D(i) − d| ≥ 0.1d) ≤ P(|D(i) − d| ≥ 0.1d) ≤ 2ne−cd .
i=1

If d ≥ C log n for a sufficiently large absolute constant C, the probability on the left
hand side is bounded as

P(∃i ∈ {1, . . . , n} : |D(i) − d| ≥ 0.1d) ≤ 2ne−cC log n = 2n1−cC ≤ 0.1.

This means that, with probability 0.9, the complementary event occurs and we have

P(∀ i ∈ {1, . . . , n} : |D(i) − d| < 0.1d) ≥ 0.9.

2.4 Sub-Gaussian random variables

In this section we introduce a family of random variables, the so-called sub-Gaussian
random variables, who show exponential concentration bounds similar to the Nor-
mal distribution (Gaussian). Before we define this class via various properties, we
discuss some facts on Chernoff bounds.
18 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

Suppose the real-valued random variable X has mean µ ∈ R and there is some
constant b > 0 such that the function

Φ(λ) := E[eλ(X−E[X]) ] (2.3)

the so-called centred moment generating function exits for all |λ| ≤ b, that is, Φ(λ) ∈
R for all |λ| ≤ b. For λ ∈ [0, b], we may apply Markov’s Inequality in Theorem 1.21 to
the random variable Y = eλ(X−E[X]) , thereby obtaining the upper bound

Φ(λ)
P((X − µ) ≥ t) = P(eλ(X−E[X]) ≥ eλt ) ≤ .
eλt
Optimising our choice of λ so as to obtain the tightest results yields the so-called
Chernoff bound, namely, the inequality

log P((X − µ) ≥ t) ≤ inf { log Φ(λ) − λt}. (2.4)

λ∈[0,b]

Example 2.15 (Gaussian tail bounds) Let X ∼ N(µ, σ 2 ), µ ∈ R, σ 2 > 0, and recall the
probability density function (p.d.f) of X,

1 (x−µ)2
fX (x) = √ e− 2σ2 .
2πσ 2
The moment generating function (MGF) is easily computed as

eλµ
Z Z
1 − (x−µ)
2
2 +λx
2 /2σ 2 +λy
MX (λ) = E[e λX
]= √ e 2σ dx = √ e−y dy
2πσ 2 R 2πσ 2 R
2 σ 2 /2
= eλµ+λ <∞

for all λ ∈ R. We obtain the Chernoff bound by optimising over λ ≥ 0 using Φ(λ) =
2 2
e−λµ MX (λ) = eλ σ /2 ,
n λ2 σ 2 o t2
inf { log Φ(λ) − λt} = inf − λt = − 2 .
λ≥0 λ≥0 2 2σ

Thus any N(µ, σ 2 ) random variable X satisfies the upper deviation inequality

t2
P(X ≥ µ + t) ≤ exp − 2 , t ≥ 0,
2σ
and the two-sided version
t2
P(|X − µ| ≥ t) ≤ 2 exp − 2 , t ≥ 0.
2σ
♣
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 19

Exercise 2.16 (Moments of the normal distribution) Let p ≥ 1 and X ∼ N(0, 1). Then
1 Z 1/p 2 Z ∞ 1/p
1/p p −x2 /2 2
kXkLp = (E[|X|]) = √ |x| e dx = √ xp e−x /2 dx
2π R 2π 0
2 Z ∞ √ √ √ √ p√
p −w 21 1/p 2 2 2 1/p
= √ ( 2 w) e √ dw = √ Γ(p/2 + 1/2)
(x2 /2=w) 2π 0 w2 2π 2
√ Γ(p/2 + 1/2) 1/p

= 2 .
Γ(1/2)

Hence
√
kXkLp = O( p) as p → ∞.
K
2
We also note that the MGF of X ∼ N(0, 1) is given as MX (λ) = eλ /2 . In the
following proposition we identify equivalent properties for a real-valued random vari-
able which exhibits similar tail bounds. moment bounds and MGF estimates than a
Gaussian random variable does.

Proposition 2.17 (Sub-Gaussian properties) Let X be a real-valued random variable. Then

there are absolute constants Ci > 0, i = 1, . . . , 5, such that the following statements are
equivalent:

(i) Tails of X:
P(|X| ≥ t) ≤ 2 exp ( − t2 /C12 ) for all t ≥ 0.

(ii) Moments:
√
kXkLp ≤ C2 p for all p ≥ 1.

(iii) MGF of X 2 :
1
E[ exp (λ2 X 2 )] ≤ exp (C32 λ2 ) for all λ with |λ| ≤ .
C3

(iv) MGF bound:

E[ exp (X 2 /C42 )] ≤ 2.

Moreover, if E[X] = 0, then the statements (i)-(iv) are equivalent to the following
statement

(v) MGF bound of X:

E[ exp (λX)] ≤ exp (C52 λ2 ) for all λ ∈ R.

Proof. (i) ⇒ (ii): Without loss of generality we can assume that C1 = 1 for property
(i). This can be seen by just multiplying the parameter t with C1 which corresponds
20 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

to multiply X by 1/C1 . The integral identity in Lemma 1.9 applied to the non-negative
random variable |X|p gives
Z ∞ Z ∞ Z ∞
p p 2
E[|X| ] = P(|X| > t) dt = p p−1
P(|X| ≥ u)pu du ≤ 2e−u pup−1 du
0 t=u 0 (i) 0
p p p/2
= pΓ ≤p ,
u2 =s 2 2

where we used thep property Γ(x) ≤ xx of the Gamma function. 5 Taking the pth
√ √ √ √
root gives p p p √ 1/2 and p p ≤ 2 gives (ii) with some C2 ≤ 2. To √see that p p ≤ 2
recall that limn→∞ n n = 1. More precisely, in thatqproof one takes n n = 1 + δn and
2
shows with the binomial theorem that 0 < δn < n−1 for all n ≥ 4. For p = 1, 2, 3,
√
1
√ √
3
we directly obtain the estimate 1 = 1, 2 ≤ 2 and 3 ≤ 2.
(ii) ⇒(iii): Again without loss of generality we can assume that C2 = 1. Taylor
expansion of the exponential function and linearity of the expectation gives
∞
X λ2k E[X 2k ]
E[exp(λ2 X 2 )] = 1 + .
k=1
k!

k
Property (ii) implies that E[X 2k ] ≤ (2k)k and Stirling’s formula gives 6 that k! ≥ ( ke ) .
Thus
∞ ∞
2 2
X (2λ2 k)k X 1
E[exp(λ X )] ≤ 1 + k
= (2eλ2 )k = ,
k=1
(k/e) k=0
1 − 2eλ2

provided that 2eλ2 < 1 (geometric series). To bound the right hand side, i.e., to
bound 1/(1 − 2eλ2 ), we use that

1
≤ e2x for all x ∈ [0, 1/2] .
1−x
5
Γ(n) = (n − 1)!, n ∈ N.
Z ∞
Γ(z) := xz−1 e−x dx for all z = x + iy ∈ C with x > 0 ,
0
Γ(z + 1) = zΓ(z) ,
Γ(1) = 1 ,
1 √
Γ( ) = π .
2

6
Stirling’s formula:
√
N ! ∼ 2πN e−N N N where ∼ means quotient goes to zero when N → ∞ .

Alternatively,
√ 1 1
N! ∼ 2πN e−N N N 1 + +O .
12N N2
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 21

1
For all |λ| ≤ √
2 e
we have 2eλ2 < 12 , and thus

1
E[exp(λ2 X 2 )] ≤ exp(4eλ2 ) for all |λ| ≤ √ ,
2 e
√
and (iii) follows with C3 = 2 e.
(iii) ⇒ (iv): Trivial and left as an exercise.

(iv) ⇒ (i): Again, we assume without loss of generality that C4 = 1. Using Markov’s
inequality 1.21 with f = exp ◦x2 , we obtain
2 2 2 2 2
P(|X| ≥ t) = P(eX ≥ et ) ≤ e−t E[eX ] ≤ 2e−t .
(iv)

Suppose now that E[X] = 0. We show that (iii) ⇒ (v) and (v) → (i).
(iii) ⇒ (v): Again, we assume without loss of generality that C3 = 1. We use the
well-known inequality
2
ex ≤ x + ex , x ∈ R ,
to obtain
2X2 2X2 2
E[eλX ] ≤ E[λX + eλ ] = E[eλ ] ≤ eλ if |λ| ≤ 1 ,
(iii)

and thus we have (v) for the range |λ| ≤ 1. Now, assume |λ| ≥ 1. This time, use the
inequality 2λx ≤ λ2 + x2 , λ, x ∈ R, to arrive at
2 /2 2 /2 2 /2 2
E[eλX ] ≤ eλ E[eX ] ≤ eλ exp(1/2) ≤ eλ as |λ| ≥ 1 .

(v) ⇒ (i): Again, assume that C5 = 1. Let λ ≥ 0.

2
P(X ≥ t) = P(eλX ≥ eλt ) ≤ e−λt E[eλX ] ≤ e−λt eλ = exp(−λt + λ2 ) .
2 /4
Optimising over λ ≥ 0, we obtain λ = t/2, and thus P(x ≥ t) ≤ e−t . Now we repeat
2
the argument for −X and obtain P(X ≤ −t) ≤ e−t /4 , and thus
2 /4
P(|X| ≥ t) ≤ 2e−t ,

and C1 = 2 implies (i).

Definition 2.18 (Sub-Gaussian random variable, first definition) A random real-valued

variable X that satisfies one of the equivalent statements (i)-(iv) in Proposition 2.17 is called
sub-Gaussian random variable. Define, for any sub-Gaussian random variable
n o
kXkψ2 := inf t > 0 : E( exp (X 2 /t2 )] ≤ 2 .
22 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

Exercise 2.19 (Sub-Gaussian norm) Let X be a sub-Gaussian random variable and define
n o
kXkψ2 := inf t > 0 : E[ exp(X 2 /t2 )] ≤ 2 .

Show that k·kψ2 is indeed a norm on the space of sub-Gaussian random variables.
KK

Example 2.20 (Examples of Sub-Gaussian random variables) (a) X ∼ N(0, 1) is sub-

Gaussian random variable: pick t > 0 with 1 > t22 and set a(t) := 1 − t22 . Then
Z r
2 2 1 − a(t)x
2 1
E[ exp (X /t )] = √ e 2 dx =
2π R a(t)
shows that there is an absolute constant C > 0 with kXkψ2 ≤ C.
(b) Let X be a symmetric Bernoulii random variable, kXk = 1.
2 /t2 2 1
E[eX ] = e1/t ≤ 2 ⇔ ≤ t2 .
log 2
√
Thus kXkψ2 = 1/ log 2.
(c) Let X be a real-valued, bounded random variable, that is, |X| ≤ b = kXk∞ . Thus
2 2 2 2 p
E[eX /t ] ≤ ekXk∞ /t ≤ 2 ⇔ t ≥ kXk∞ / log 2,
√
and kXkψ2 = kXk∞ / log 2.
♣

Exercise 2.21 (Exponential moments) Show that if X ∼ N(0, 1), the function
R 3 λ 7→ E[exp(λ2 X 2 )]
of X 2 is finite only in some bounded neighbourhood of zero. Determine this neighbourhood.
K
√ −1 2
Solution. Recall that when X ∼ N(0, 1), X has the probability density 2π e−x /2 .
Thus
Z Z
2 2 1 λ2 x2 −x2 /2 1 1
E[exp(λ X )] = √ e e dx = √ exp ((λ2 − )x2 ) dx < ∞
2π R 2π R 2

1 r1 r1
2
⇔ (λ − ) < 0 ⇔ λ ∈ − ,
2 2 2
q q
because λ ∈ − 12 , 12 ensures that the integrand is a so-called log-concave
density with finite integral, where a log-concave density is a probability density f
which can be written as
f (x) = exp(ϕ(x)) , with ϕ being concave, i.e., ϕ00 (x) ≤ 0 .
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 23

Recall that the sum of independent normal random variables Xi ∼ N(µi , σi2 ), i =
1, . . . N , is normally distributed, that is,
N
X N
X N
X
SN = Xi ∼ N µi , σi2 .
i=1 i=1 i=1

For a proof see [Dur19] or [Geo12]. We then may wonder if the sum of independent
sub-Gaussian random variables is sub-Gaussian as well. There are different state-
ments on this depending whether µi = 0 for all i = 1, . . . , N , as done in the book
[Ver18], or not vanishing means. We shall follow the general case here. It proves
useful to have the following equivalent definition of sub-Gaussian random variables.

Definition 2.22 (Sub-Gaussian random variables, second definition) A real-valued ran-

dom variable X with mean µ = E[X] is a sub-Gaussian random variable if there is a
positive number σ such that
2 λ2 /2
E[eλ(X−µ) ]] ≤ eσ for all λ ∈ R. (2.5)

The constant σ satisfying (2.5) is referred to as the sub-Gaussian parameter; for instance,
we say that X is sub-Gaussian with parameter σ when (2.5) holds.

Remark 2.23 (Sub-Gaussian definitions:) It is easy to see that our two definitions are equiv-
alent when the random variable X has zero mean µ = 0: use statement (v) of Proposi-
tion 2.17. If µ 6= 0 and X satisfies (2.5), we obtain the following tail bound
1
P(|X| ≥ t) ≤ 2 exp ( − (t − µ)2 ) for all t ≥ 0. (2.6)
2σ 2
This tail bound is not exactly the statement (i) of Proposition 2.17 as the parameter t ≥ 0
is chosen with respect to the mean. In most cases, one is solely interested in tail estimates
away from the mean. Thus the definitions are equivalent in case µ 6= 0 if we limit the range
for parameter t to t ≥ |µ|. In the literature and in applications Definition 2.22 is widely used
and sometimes called sub-Gaussian for centred random variables. We use both definitions
synonymously.

We now show that the sum of sub-Gaussian random variables is again sub-
Gaussian, and we do so for each of the two definitions separatly.

Proposition 2.24 (Sum of independent sub-Gaussian random variables)

(a) Let X1 , . . . , XN be independent sub-Gaussian random PNvariables with sub-Gaussian pa-
rameters σi , i = 1, . . . , N , respectively.q
Then SN = i=1 Xi is a sub-Gaussian random
PN 2
variable with sub-Gaussian parameter i=1 σi .
24 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

Let X1 , . . . , XN be independent mean-zero sub-Gaussian random variables. Then SN =

(b) P
N
i=1 Xi is a sub-Gaussian random variable, and

N N
X 2 X
Xi ≤C kXi k2ψ2 , (2.7)
ψ2
i=1 i=1

where C > 0 is an absolute constant.

Proof. (a) Recall Definition 2.22, then

N
X N
Y N
X
2
E[ exp (λ (Xi − µi ))] = E[ exp (λ(Xi − µi ))] ≤ exp (λ /2 σi2 ) for all λ ∈ R.
i=1 i=1 i=1

(b) We first compute the MGF of the sum SN . Indeed, for λ ∈ R,

h N
i Y N
Y
E exp λSN = E[ exp (λXi ) ≤ exp (Cλ2 kXi k2ψ2 ) = exp (λ2 K 2 ) ,
i=1 i=1

where K 2 := C N 2
P
i=1 kXi kψ2 . The inequality follows directly from Proposition 2.17,
see Exercise 2.42. Recall the equivalence of (v) and (iv) in Proposition 2.17 to see
that the sum SN is sub-Gaussian and

kSN kψ2 ≤ C1 K ,

where C1 > 0 is an absolute constant. 2

This allows us to obtain a Hoeffding bound for independent sum of sub-Gaussian
random variables.

Proposition 2.25 (Hoeffding bounds for sums of independent sub-Gaussian random variables)
Let X1 , . . . , XN be real-valued independent sub-Gaussian random variables with sub-Gaussian
parameter σi and mean µi , i = 1, . . . , N , respectively. Then, for all t ≥ 0,
N
X t2
P (Xi − µi ) ≥ t ≤ exp − PN 2 .
i=1 2 i=1 σi

Proof. Set SeN := N

P
i=1 (Xi − µi ). Using again the exponential version of the Markov
inequality in Proposition 1.21, we obtain
N
Y N
X
P(SeN ≥ t) ≤ E[eλSN ]e−λt = e−λt λ(Xi −µi )
] ≤ exp − λt + λ /2 2
σi2 .
e
E[e
i=1 i=1

PN
Optimising over λ, we obtain λ = t/ i=1 σi2 , and conclude with the statement.
2
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 25

Example 2.26 (Bounded random variables) Let X be a real-valued random variable with
mean-zero and (bounded) supported on [a, b], a < b. Denote X 0 an identical independent
copy of X, i.e., X ∼ X 0 . For any λ ∈ R, using Jensen’s inequality 1.11,
0 0
EX [eλX ] = EX [eλ(X−EX 0 [X ]) ] ≤ EX,X 0 [eλ(X−X ) ] ,
where EX,X 0 is expectation with respect to the two independent and identically distributed
random variables X and X 0 . Suppose ε is an independent (from X and X 0 Rademacher
function, that is ε is a symmetric Bernoulli random variable. Then (X − X 0 ) ∼ ε(X − X 0 ).
For Rademacher functions/symmetric Bernoulli random variables ε we estimate the moment
generating function as
∞ ∞ ∞
λε 1 λ −λ 1 X (λ)k X (−λ)k X λ2k
E[e ] = (e + e ) = + =
2 2 k=0 k! k=0
k! k=0
(2k)!
∞
X λ2k 2
≤1+ k
= eλ /2 .
k=1
2 k!
Thus 0 0
EX,X 0 [eλ(X−X ) ] = EX,X 0 [Eε [eλε(X−X ) ]] ≤ EX,X 0 [ exp(λ2 (X − X 0 )2 /2)]
≤ EX,X 0 [ exp(λ2 (b − a)2 /2)],
as |X − X 0 | ≤ b − a. Thus X sub-Gaussian with sub-Gaussian parameter σ = b − a > 0,
see Definition 2.22. ♣
As discussed above in Remark 2.23, in many situations and results we encounter
later, we typically assume that the random variables have zero means. Then the two
definitions are equivalent. In the next lemma we simply show that centering does
not harm the sub-Gaussian property. This way we can see that we can actually use
both our definitions for sub-Gaussian random variables.

Lemma 2.27 (Centering) If X is sub-Gaussian random variable then X − E[X] is sub-

Gaussian too with
kX − E[X]kψ2 ≤ CkXkψ2
for some absolute constant C > 0.

Proof. We use the fact that k·kψ2 is a norm. Triangle inequality gives
kX − E[X]kψ2 ≤ kXkψ2 + kE[X]kψ2 .
|a|
For any a ∈ R one can find t ≥ log 2
such that E[exp(a2 /t2 )] ≤ 2, thus

|a|
kakψ2 = ≤ c|a|
log 2
for some constant c > 0. Hence
kE[X]kψ2 ≤ c|E[X]| ≤ cE[|X|] = ckXkL1 ≤ CkXkψ2
√
using kXkLp ≤ CkXkψ2 p for all p ≥ 1.
2
26 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

2.5 Sub-Exponential random variables

Suppose Y = (Y1 , . . . , YN ) ∈ RN is a random vector with independent coordinates
Yi ∼ N(0, 1), i = 1, . . . , N . We expect that the Euclidean norm kY k2 exhibits some
form of concentration as the square of the norm kY k22 is the sum of independent
random variables Yi2 . However, although the Yi are sub-Gaussian random variables,
the Yi2 are not. Recall our tail estimate for the Gaussian random variables and note
that √ √2
P(Yi2 > t) = P(|Yi | > t) ≤ C exp(− t /2) = C exp(−t/2).
The tails are like those for the exponential distribution. To see that, suppose that
X ∼ Exp(λ), λ > 0, i.e., the probability density function (pdf) is

fX (t) = λe−λt 1l{t ≥ 0} .

Then, for all t ≥ 0, Z ∞

P(X ≥ t) = λe−λt dt = λe−λt .
t
We can therefore use our general Hoeffding bound. The following proposition sum-
marise properties of a new class of random variables.

Proposition 2.28 (Sub-exponential properties) Let X be a real-valued random variable.

Then the following properties are equivalent for absolute constant Ci > 0, i = 1, . . . , 5:
(i) Tails:
P(|X| ≥ t) ≤ 2 exp(−t/C1 ) for all t ≥ 0 .

(ii) Moments:
1/p
kXkLp = (E[|X|p ]) ≤ C2 p for all p ≥ 1 .

(iii) Moment generating function (MGF) of |X|:

E[exp(λ|X|)] ≤ exp(C3 λ) for all λ such that 0 ≤ λ ≤ 1/C3 .

(iv) MGF of |X| bounded at some point:

E[exp(|X|C4 ) ≤ 2 .

Moreover, if E[X] = 0, the properties (i)-(iv) are also equivalent to (v):

(v) MGF of X:

E[exp(λX)] ≤ exp(C52 λ2 ) for all λ such that |λ| ≤ 1/C5 .

Definition 2.29 (Sub-exponential random variable, first definition) A real-valued ran-

dom variable X is called Sub-exponential if it satisfies one of the equivalent properties
(i)-(iv) of Proposition 2.28 (respectively properties (i)-(v) when E[X] = 0). Define kXkψ1
by
kXkψ1 := inf{t > 0 : E[exp(|X|/t)] ≤ 2} . (2.8)
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 27

Lemma 2.30 Equation (2.8) defines the sub-exponential norm kXkψ on the set of all sub-
exponential random variables X.

Proof. Supportclass. 2

Lemma 2.31 A real-valued random variable X is sub-Gaussian if and only if X 2 is sub-

exponential. Moreover,
kX 2 kψ1 = kXk2ψ1 . (2.9)

Proof. Suppose X is sub-Gaussian and t ≥ 0. Then

√ √2
P(|X|2 ≥ t) = P(|X| ≥ t) ≤ 2 exp(− t /C12 ) ,

and thus X 2 is sub-exponential according to Proposition 2.28 (i). If X 2 is sub-

exponential we have

P(|X| ≥ t) = P(|X|2 ≥ t2 ) ≤ 2 exp(−t2 /C1 ) ,

and thus X is sub-Gaussian. As for the norms, recall that kX 2 kψ1 is the infimum
of all numbers C > 0 satisfying E[exp(X 2 /C)] ≤ 2, whereas kXkψ2 is the infimum
of all numbers K > 0 satisfying E[exp(X 2 /K 2 )] ≤ 2. Putting C = K 2 , one obtains
kX 2 kψ1 = kXk2ψ2 .
2

Lemma 2.32 (Product of sub-Gaussians) Let X and Y real-valued sub-Gaussian random

variables. Then XY is sub-exponential and

kXY kψ1 ≤ kXkψ2 kY kψ2 .

Proof. Suppose that 0 6= kXkψ2 (if kXkψ2 = 0 and/or kY kψ2 = 0 the statement
e = X/kXk is sub -Gaussian with kXk 1
trivially holds). Then X ψ2 = kXk kXkψ2 = 1.
e
ψ2 ψ2
Thus we assume without loss of generality that kXkψ2 = kY kψ2 = 1. To prove the
statement in the lemma we shall show that E[exp(X 2 )] ≤ 2 and E[exp(Y 2 )] ≤ 2 both
imply that E[exp(|XY |)] ≤ 2, where E[exp(|XY |)] ≤ 2 implies that kXY kψ1 ≤ 1. We
are going to use Young’s inequality :

a2 b 2
ab ≤ + for a, b ∈ R .
2 2
Thus
h X 2 Y 2 i h X2 Y 2 i
E[exp(|XY |)] ≤ E exp + = E exp exp Young’s inequality
2 2 2 2
1 h i 1
≤ E exp (X 2 ) + exp (Y 2 ) = (2 + 2) = 2 Young’s inequality.
2 2
2
28 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

Example 2.33 (Exponential random variables) Suppose X ∼ Exp(α), α > 0. Then E[X] =
1
α
and Var(X) = α12 . We compute for every t 6= α1 to get
Z ∞ h −α i∞ α
E[exp(|X|/t)] = αex/t e−αx dx = e−x(α−1/t) = ≤2
0 (α − 1/t) 0 (α − 1/t)
if and only if t ≥ 2/α. Hence kXkψ1 = α2 .
♣

Example 2.34 (Sub-exponential but not sub-Gaussian) Suppose that Z ∼ N(0, 1), and
define X := Z 2 . Then E[X] = 1. For λ < 12 we have
Z ∞
λ(X−1) 1 λ(z 2 −1) −z 2 /2 e−λ
E[e ]= √ e e dz = √ ,
2π −∞ 1 − 2λ
whereas for λ > 12 we have E[eλ(X−1) ] = +∞. Thus X is not sub-Gaussian. In fact one can
show, after some computation, that
e−λ 2 2 1
√ ≤ e2λ = e4λ /2 for all |λ| < .
1 − 2λ 4
This motivates the following alternative definition of sub-exponential random variables which
corresponds to the second definition for sub-Gaussian random variables in Definition 2.22.
♣

Definition 2.35 (Sub-exponential random variables, second definition) A real-valued

random variable X with mean µ ∈ R is sub-exponential if there exist non-negative
parameters (ν, α) such that
2 λ2 /2 1
E[ exp (λ(X − µ))] ≤ eν for all |λ| < . (2.10)
α

Remark 2.36 (a) The random variable X in Example 2.34 is sub-exponential with param-
eters (ν, α) = (2, 4).
(b) It is easy to see that our two definitions are equivalent when the random variable X has
zero mean µ = 0: use statement (v) of Proposition 2.28. If µ 6= 0 and X satisfies (2.10),
we obtain the following tail bound
1 2

P(|X| ≥ t) ≤ 2 exp − 2 (t − µ) for all t ≥ 0. (2.11)
2ν
This tail bound is not exactly the statement (i) of Proposition 2.28 as the parameter t ≥ 0
is chosen with respect to the mean. In most cases, one is solely interested in tail estimates
away from the mean. Thus the definitions are equivalent in case µ 6= 0 if we limit the
range for parameter t to t ≥ |µ|. In the literature and in applications Definition 2.35
is widely used and sometimes called sub-exponential for centred random variables. We
use both definitions synonymously.

C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 29

Example 2.37 (Bounded random variable) Let X be a real-valued, mean-zero and bounded
random variable support on the compact interval [a, b], a < b. Furthermore, let X 0 be an
independent copy of X and let ε be an independent (from both X and X 0 ) Rademacher func-
tion, that is, ε is a symmetric Bernoulli random variable with P(ε = −1) = P(ε = 1) = 21 .
Using Jensen’s inequality for X 0 we obtain
0 0 0
EX [eλX ] = EX [eλ(X−EX 0 [X ]) ] ≤ EX,X 0 [eλ(X−X ) ] = EX,X 0 [Eε [eλε(X−X ) ]] ,
where we write EX,X 0 for the expectation with respect to X and X 0 , Eε for the expectation
with respect to ε, and where we used that (X − X 0 ) ∼ ε(X − X 0 ). Here, ∼ means that both
random variables have equal distribution. Hold α := (X − X 0 ) fixed and compute
∞ ∞ ∞
1 1 X (−λα)k X (λα)k X (λα)2k
Eε [eλεα
] = [e−λα + eλα ] = ( + =
2 2 k=0 k! k=0
k! k=0
(2k)!
∞
X (λα)2k 2 α2 /2
≤1+ = eλ .
k=1
2k k!

Inserting this result in the previous one leads to

h 2 i
λX λ (X−X 0 )2 /2 2 2
EX [e ] ≤ EX,X e 0 ≤ eλ (b−a) /2

as |X − X 0 | ≤ b − a. ♣

Proposition 2.38 (Sub-exponential tail-bound) Let X be a real-valued sub-exponential ran-

dom variable with parameters (ν, α) and mean µ = E[X]. Then, for every t ≥ 0,
( t2
e− 2ν 2 for 0 ≤ t < ν 2 /α,
P(X − µ ≥ t) ≤ t (2.12)
e− 2α for t ≥ ν 2 /α .

Proof. Recall Definition 2.35 and obtain

λ2 ν 2
P(X − µ ≥ t) ≤ e−λt E[eλ(X−µ) ] ≤ exp − λt + for λ ∈ [0, α−1 ) .
2
Define g(λ, t) := −λt + λ2 ν 2 /2. We need to determine g ∗ (t) = infλ∈[0,α−1 ) {g(λ, t)}.
Suppose that t is fixed, then ∂λ g(λ, t) = −t + λν 2 = 0 if and only if λ = λ∗ = νt2 . If
0 ≤ t < ν 2 /α, then the infimum equals the unconstrained one and g ∗ (t) = −t2 /2ν 2
for t ∈ [0, ν 2 /α). Suppose now that t ≥ ν 2 /α. As g(·, t) is monotonically decreasing
on [0, λ∗ ) (derivative is not positive), the constrained infimum is achieved on the
boundary λ∗ = α−1 , and hence g ∗ (t) = −t/2α. 2

Definition 2.39 (Bernstein condition) A real-valued random variable X with mean µ ∈ R

and variance σ 2 ∈ (0, ∞) satisfies the Bernstein condition with parameter b > 0 if

k 1
|E[(X − µ) ]| ≤ k!σ 2 bk−2 k = 2, 3, 4, . . . . (2.13)
2
30 C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES

Exercise 2.40 (a) Show that a bounded random variable X with |X − µ| ≤ b with variance
σ 2 > 0 satisfies the Bernstein condition (2.13) .
(b) Show that the bounded random variable X in (a) is sub-exponential and derive a bound
on the centred moment generating function
E[exp(λ(X − E[X]))] .
KK

Solution. (a) From our assumption we have E[(X − µ)2 ] = E[|X − µ|2 ] = σ 2 and
ess sup|X − µ|k−2 ≤ bk−2 ≤ bk−2 21 k! for k ∈ N, k ≥ 2. Using Hölder’s inequality we
obtain
1
E[|X − µ|k−2 |X − µ|2 ] ≤ E[|X − µ|2 ] ess sup|X − µ|k−2 ≤ σ 2 bk−2 k!
2
for all k ∈ N, k ≥ 2.
(b) By power series expansion we have (using the Bernstein bound from (a),
∞
λ(X−µ) λ2 σ 2 X k E[(X − µ)k ]
E[e ]=1+ + λ
2 k=3
k!
∞
λ2 σ 2 λ2 σ 2 X
≤1+ + (|λ|b)k−2 ,
2 2 k=3

and for |λ| < 1/b we can sum the geometric series to obtain

λ2 σ 2 /2 λ2 σ 2 /2
E[eλ(X−µ) ] ≤ 1 + ≤ exp ( )
1 − b|λ| 1 − b|λ|
by using 1 + t ≤ et . Thus X is sub-exponential as we obtain
√
E[eλ(X−µ) ] ≤ exp (λ2 ( 2σ)2 /2)
for all |λ| < 1/2b.
2

Exercise 2.41 (general Hoeffding inequality) Let X1 , . . . , XN independent mean-zero sub-

Gaussian real-valued random variables, and let a = (a1 , . . . , aN ) ∈ RN . Then, for every
t ≥ 0, we have
N
X ct2
P | ai Xi | ≥ t ≤ 2 exp − ,
i=1
K 2 kak22
where K = max1≤i≤N {kXi kψ2 }. KK
PN
Hint: Use the fact that Xi sub-Gaussian, i = 1, . . . , N , implies that i=1 Xi is sub-
Gaussian with
N
X N
X
k Xi k2ψ2 ≤ C kXi k2ψ2 ,
i=1 i=1
see Proposition 2.24.
C ONCENTRATION INEQUALITIES FOR INDEPENDENT RANDOM VARIABLES 31

Exercise 2.42 Restate property (v) of Proposition 2.17 in terms of the sub-Gaussian norm,
i.e., show that if E[X] = 0 then

E[exp(λX)] ≤ exp(Cλ2 kXk2ψ2 ) , for all λ ∈ R .

The following exercise explores different deviations from the mean.

Exercise 2.43 (Poisson distribution - various deviations) Let X ∼ Poi(λ), λ > 0. Then
the following holds.

(a) For any t > λ, we have

eλ t
P(X ≥ t) ≤ e−λ .
t
(b) For t ∈ (0, λ], we have
ct2
P(|X − λ| ≥ t) ≤ 2 exp − ,
λ
for some absolute c > 0.

Hint: Use the Poisson approximation in Theorem 1.36 in conjunction with the corresponding
concentration bounds for Bernoulli random variables. KK
32 R ANDOM VECTORS IN H IGH D IMENSIONS

3 Random vectors in High Dimensions

We study random vectors X = (X1 , . . . , Xn ) ∈ Rn and aim to obtain concentration
properties of the Euclidean norm of random vectors X.

3.1 Concentration of the Euclidean norm

Suppose Xi , i = 1, . . . , n, are independent R-valued random variables with E[Xi ] = 0
and Var(Xi ) = 1. Then
n
X n
X
E[kXk22 ] = E[ Xi2 ] = E[Xi2 ] = n.
i=1 i=1

√
We thus expect that the expectation of the Euclidean norm is approximately √ n. We
will now see in a special case that the norm kXk2 is indeed very close to n with
high probability.

Theorem 3.1 (Concentration of the norm) Let X = (X1 , . . . , Xn ) ∈ Rn be a random vec-

tor with independent sub-Gaussian coordinates Xi that satisfy E[Xi2 ] = 1. Then
√
kkXk2 − nkψ2 ≤ CK 2 ,

where K := max1≤i≤n {kXkψ2 } and C > 0 an absolute constant.

The following two exercises are used in the proof of the theorem.

Exercise 3.2 (Centering for sub-exponential random variables) Show the centering lemma
for sub-exponential random variables. This is an extension of Lemma 2.27 to sub-exponential
random variables: Let X be a real-valued sub-exponential random variable. Then

kX − E[X]kψ1 ≤ CkXkψ1 .

Exercise 3.3 (Bernstein inequality) Let X1 , . . . , XN be independent mean-zero sub-exponential

real-valued random variables. Then, for every t ≥ 0, we have

1 X N n t2 t o
P | Xi | ≥ t ≤ 2 exp − c min , N ,
N i=1 K2 K

for some absolute constant c > 0 and where K := max1≤i≤N {kXi kψ1 }. KK

1
PN
Solution. SN := N i=1 Xi . As usual we start with

N
Y
P(SN ≥ t) ≤ e−λt E[eλSN ] = e−λt E[e(λ/N )Xi ] ,
i=1
R ANDOM VECTORS IN H IGH D IMENSIONS 33

and (v) in Propostion 2.28 implies that, writing Xei = Xi /N , there are c̃ > 0 and C > 0
such that
ei k2 ) for |λ| ≤ c̃ ,
E[eλXi ] ≤ exp (Cλ2 kX
e
ψ1
Ke
and kXei k2 = 1/N 2 kXi k2 , Ke := max1≤i≤N {kXfi k }. With σ̃ 2 = PN 1/N 2 kXi k2 we
ψ1 ψ1 ψ1 i=1 ψ1
thus get
c̃
P(SN ≥ t) ≤ exp ( − λt + Cλ2 σ̃ 2 ) for |λ| ≤ .
K
e
Define g(λ) := −λt + Cλ2 σ̃ 2 . Then g 0 (λ) = −t + 2C σ̃ 2 λ. The zero of the derivative is
2 2
at λ = 2Ctσ̃2 . As long as t ≤ 2CKeσ̃ c̃ , this λ ≤ Kec̃ satisfies the constraint. For t > 2CKeσ̃ c̃
we see that g 0 (λ) ≤ 0, and thus the function is monotonically decreasing and the
infimum will be attained at the upper bound for λ. Hence, optimising over λ one
obtains n t c̃ o
λ = min , .
2C σ̃ 2 Ke
Inserting these values into the function g, we obtain
2 2 2 2C σ̃ 2 c̃
g(t/(2C σ̃ )) = −t /(4C σ̃ ) for t ≤ ,
K
e
and for the other value we note that for t > (2C σ̃ 2 c̃)/K
e we get

c̃ c̃ Cc̃2 σ̃ 2 c̃t c̃t Cc̃2 σ̃ 2 c̃t

)=− t+
g( =− − + ≤− ,
K
e K
e e2
K 2K e 2K e e2
K 2K e
which follows from
c̃t Cc̃2 σ̃ 2
− + ≤ 0, for t > (2C σ̃ 2 c̃)/(K)
e .
2K e e2
K
Thus n t2 c̃t o
P(SN ≥ t) ≤ exp − min , ,
4C σ̃ 2 2K
e
and finally, using σ̃ 2 ≤ N1 K 2 and K
e ≤ 1 K, there is an absolute constant c > 0 such
N
that n t2 t o
P(SN ≥ t) ≤ exp − c min , N .
K2 K
To conclude the proof one needs to derive the complementary bound to derive the
concentration for the absolute value. ©

Remark 3.4 With high probability, e.g., with probability 0.99 (adjust the absolute constants
c√> 0 and K > 0 accordingly) X stays within a constant√distance from the sphere of radius
n. Sn := kXk22 has mean n and standard deviation O( n):
h 2 i n
hX i
2
Var kXk22 =E kXk22 −n =E (Xi Xj ) − 2nkXk22 + n2
i,j=1
h n
X i Xn n
X
2 2 2
=E (Xi Xj ) −n = E[Xi4 ] + E[(Xi Xj ) ] − n2 .
i,j=1 i=1 i,j=1,i6=j
34 R ANDOM VECTORS IN H IGH D IMENSIONS

For any i 6= j we have E[(Xi Xj )2 ] = E[Xi2 ]E[Xj2 ] = 1 (the Xi ’s are independent). Further-
more, we get E[Xi4 ] = O(1) because from (ii) in Proposition 2.17 we estimate
√ 4
E[Xi4 ] ≤ kXi k4L4 ≤ C 4 = 16C = O(1) .
q
Thus Var(kXk22 ) ≤ 16nC + n(n − 1) − n2 = C 0 n = O(1)n, and therefore Var(kXk22 ) =
√ √ √
O( n). Hence kXk2 = Sn deviates by O(1) around n. Note that this follows from
r
√ √ 1 √ √ 1 √
q
n ± O( n) = n 1 ± O( n) = n(1 ± O( √ )) n ± O(1) .
n n

Proof of Theorem 3.1. We assume again without loss of generality that K ≥ 1.

n
1 1X 2
kXk22 − 1 = (Xi − 1) .
n n i=1

Xi sub-Gaussian implies that that Xi2 − 1 is sub-exponential (Lemma 2.32). The

centering property of Exercise 3.2 shows that there is an absolute constant C > 0
such that
kXi2 − 1kψ1 ≤ CkXi2 kψ1 = CkXi k2ψ2 ≤ CK 2 .
Lemma 2.31

We now apply the Bernstein inequality in Exercise 3.3 to obtain, for every u ≥ 0,
1 cn
P | kXk22 − 1| ≥ u ≤ 2 exp − 4 min{u2 , u} , (3.1)
n K
where we used that K ≥ 1 implies K 4 ≥ K 2 . Note that for z ≥ 0 inequality |z − 1| ≥ δ
implies the inequality
|z 2 − 1| ≥ max{δ, δ 2 } .
To see that, consider first z ≥ 1 which implies that z + 1 ≥ z − 1 ≥ δ and thus
|z 2 − 1| = |z − 1||z + 1| ≥ δ 2 . For 0 ≤ z < 1 we have z + 1 ≥ 1 and thus |z 2 − 1| ≥ δ.
We apply this finding, (3.1) with u = max{δ, δ 2 } to obtain
1 1 cn
P | √ kXk2 − 1| ≥ δ ≤ P | kXk22 − 1| ≥ max{δ, δ 2 } ≤ 2 exp ( − 4 δ 2 ) .
n n K

We used that v = min{u, u2 } = δ 2 when u = max{δ, δ 2 }. To see that, note that δ ≥ δ 2

implies δ ≤ 1 and thus u = δ √ and v = δ 2 . If δ > 1 we have δ 2 > δ and thus u = δ 2
and thus v = δ 2 . Setting t = δ n we finally conclude with

√ ct2
P(|kXk2 − n| ≥ t) ≤ 2 exp − 4 ,
K
√
which shows that |kXk2 − n| is sub-Gaussian. 2
R ANDOM VECTORS IN H IGH D IMENSIONS 35

Exercise 3.5 (Small ball probabilities) Let X = (X1 , . . . , Xn ) ∈ Rn be a random vec-

tor with independent coordinates Xi having continuous distribution with probability density
functions fi : R → R (Radon-Nikodym density with respect to the Lebesgue measure) satis-
fying
|fi (x)| ≤ 1 , i = 1, . . . , n, for all x ∈ R .
Show that, for any ε > 0, we have
√ n
P(kXk2 ≤ ε n) ≤ (Cε)

for some absolute constant C > 0. K

Solution.
n
λε2 n λε2 n
Y
P(kXk22 2
≤ ε n) = P(−kXk22 2
≥ −ε n) ≤ e E[exp(−λkXk22 )] =e E[exp(−λXi2 )] ,
i=1

and inserting
Z Z r
−λx2i −λx2i π
E[exp(−λXi2 )] ≤ e |fi (x)| dx ≤ e dx = ,
R R λ
we obtain
n
n
P(kXk22 2 2
≤ ε n) ≤ exp λε n − log (λ/π) ≤ (Cε) ,
2
1
where the last inequality follows by optimising over λ and getting λ = 2ε2
. ©

3.2 The geometry of high dimensions

We collect a few facts about high-dimensional Euclidean vector spaces. We begin
with the volume and the area of balls.

Let R > 0. Then

n
B(n)
R := {x ∈ R : kxk2 ≤ R}

is called the n-dimensional ball with radius R around the origin . If a ∈ Rn , we

denote BR(n) (a) he ball with radius R around a, BR(n) (a) = {x ∈ Rn : kx − ak2 ≤ R}. If
R = 1, we write B (n) respectively B (n) (a).

S(n−1)
R := {x ∈ Rn : kxk2 = R}

is called n-dimensional sphere with radius R around the origin , and S(n−1)
R (a) = {x ∈
n
R : kx − ak2 = R}. If R = 1, we write S (n−1)
respectively S(n−1)
(a).

π n/2
(n)
vol(BR ) = n Rn ,
2
Γ(n/2)
2π n/2 n−1
area(S(n−1)
R )= R .
Γ(n/2)
36 R ANDOM VECTORS IN H IGH D IMENSIONS

1√
Example 3.6 n = 3, R = 1. Note that Γ(3/2) = 21 Γ(1/2) = 2
π, and thus area(S(2) ) = 4 ∈
and vol(B(3) ) = 34 π. ♣

In n-dimensional polar coordinates, the volume vol(B(n) ) of the n-dimensional unit

ball is given by
Z Z 1 Z
(n) n−1 1 area(S(n−1) )
vol(B ) = r drdσ = dσ = . (3.2)
S(n−1) 0 n S(n−1) n

It remains to determine the surface area area(S(n−1) ), that is, the surface integral
in (3.2) for general n ∈ N. In principle one can use the generalisation of the polar
coordinates from n = 3 to any higher dimensions. This is slightly elaborate, and we
therefore show a different and easier way to compute that area. For any n ∈ N we
have
√ n
Z Z
2 2
I(n) := ··· e−(x1 +···+xn ) dxn · · · dx1 = ( π) = π n/2 . (3.3)
R R

Alternatively, we can compute I(n) in (3.3) using polar coordinates with differential
rn−1 dr and change t = r2 in the integral,
Z Z ∞
2
I(n) = dσ e−r rn−1 dr
S(n−1) 0
Z ∞ Z ∞
n−1
−r2 n−1
= area(S (n−1)
) e r dr = area(S (n−1)
) e−t t 2 (1/2t−1/2 ) dt (3.4)
Z0 ∞ 0
1 n 1 n
= area(S(n−1) ) e−t t 2 −1 dt = area(S(n−1) ) Γ( ) .
0 2 2 2

Thus
2π n/2
area(S(n−1) ) = . (3.5)
Γ(n/2)

Notation 3.7 (Landau symbols) Asymptotic analysis is concerned with the behaviour of
function f (n), n ∈ N, as n →∈ ∞. Suppose f, g : N → R+ (or R). We define the following
Landau symbols, called big-O and little-o.

• f (n) is O(g(n)) if there is a constant C > 0 such that f (n) ≤ Cg(n) for all n ∈ N.

• f (n) is o(g(n)) if
f (n)
lim = 0.
n→∞ g(n)

• f (n) ∼ g(n) if
f (n)
lim = 1.
n→∞ g(n)

We now discuss briefly the fact that most of the volume of high-dimensional ob-
jects (sets with non-vanishing volume) is near the surface of that object. Let A ⊂ Rn
R ANDOM VECTORS IN H IGH D IMENSIONS 37

be a set with non-vanishing volume, i.e., vol(A) > 0, and pick ε > 0 small. Now we
shrink A by a small amount ε to produce
(1 − ε)A := {(1 − ε)x : x ∈ A} .
Then the following holds,
vol((1 − ε)A) = (1 − ε)n vol(A) . (3.6)
To see (3.6), partition the set A into infinitesimal cubes (for a Riemann sum approxi-
mate of the volume integral). Then, (1 − ε)A is the union of the set of cubes obtained
by shrinking the cubes of the partition of A by a factor (1 − ε). If we shrink each of
the 2n sides of an n-dimensional cube Q by (1 − ε), its volume vol((1 − ε)Q) shrinks
by the factor (1 − ε)n . Using that 1 − x ≤ e−x , we get the following estimate of the
ratio of the volumes:
vol((1 − ε)A)
= (1 − ε)n ≤ e−nε . (3.7)
vol(A)
Thus nearly all of the volume of A must be in the portion of A that does not belong
to the region (1 − ε)A. For the unit ball B(n) we have at least a (1 − e−εn ) fraction of
the volume vol(B(n) ) of the unit ball concentrated in B(n) \ (1 − ε)B(n) , namely in a small
annulus of width ε at the boundary.

2
Proposition 3.8 (Volume near the equator) For c ≥ 1 and n ≥ 3 at least a 1 − 2c e−c /2
c
fraction of the volume vol(B(n) ) of the unit ball has |x1 | ≤ √n−1 . Here, the coordinate x1
points to the north pole.
2
Proof. By symmetry it suffices to prove that at most a 2c e−c /2 fraction of the half
c
of the ball with x1 ≥ 0 has x1 ≥ √n−1 . Let denote H = {x ∈ B(n) : x1 ≥ 0} be the
c
upper hemisphere (northern hemisphere) and A = {x ∈ B(n) : x1 ≥ √n−1 }. We need
to show that the ratio of the volumes is bounded as
vol(A) 2 2
≤ e−c /2 . (3.8)
vol(H) c
We prove (3.8) by obtaining an upper bound for vol(A) and a lower bound for vol(H).
To calculate the volume vol(A), integrate an incremental volume that p is a disk of
width dx1 and whose face is a ball of dimension n − 1 and radius 1 − x21 . The
surface area of the disk is (1 − x21 )(n−1)/2 vol(B(n−1) ) and the volume above the slice is
Z 1
n−1
vol(A) = √ (1 − x21 ) 2 vol(B(n−1) ) dx1 .
c/ n−1
−x
We obtain an√ upper bound by using 1 − x ≤ e , integrating up to infinity and by
inserting x1 n − 1/c ≥ 1 into the integral. Then
√
n−1 ∞
Z
(n−1) − (n−1) x21
vol(A) ≤ vol(B ) √
x 1 e 2 dx1
c c/ n−1
√ (3.9)
n − 1 1 −c2 /2 vol(B(n−1) ) −c2 /2
= vol(B(n−1)
) e = √ e .
c n−1 c n−1
38 R ANDOM VECTORS IN H IGH D IMENSIONS

1
The volume of the hemisphere below the plane x1 = √n−1 is a lower bound on the
1
entire volume vol(H), and this volume is at least that of a cylinder of height √n−1 and
q
1
radius 1 − n−1 . The volume of the cylinder is

1 n−1 1
vol(B(n−1) )(1 − ) 2 √ .
n−1 n−1
Using the fact that (1 − x)a ≥ 1 − ax for a ≥ 1, the volume of the cylinder is at least
vol(B(n−1) )
√
2 n−1
for n ≥ 3. Thus we obtain (3.8) form our bounds.
2

We consider the orthogonality of two random vectors. Draw two points at random
from the unit ball B(n) ⊂ Rn . With high probability their vectors will be nearly orthog-
onal to each other. To understand that, recall from our previous considerations that
most of the volume of the n-dimensional unit ball B(n) is contained in an annulus of
width O(1/n) near the boundary (surface), that is, we pick ε = nc , c ≥ 1, and have
from (3.7) that
vol((1 − ε)B(n) )
≤ e−εn .
vol(B(n) )
Thus at least as 1 − e−εn fraction of vol(B(n) ) is concentrated in the anjulus of width ε
at the boundary. Equivalently, using our result about the volume near the equator in
Proposition 3.8, if one vector points
√ to the north pole, the other vector has projection
in this direction
√ of only ±O(1/ n), and thus their dot/inner/scalar product will be of
order ±O(1/ n).

Proposition 3.9 Suppose that we sample N points X (1) , . . . , X (N ) uniformly from the unit
ball B(n) . Then with probability 1 − O(1/N ) the following holds:
2 log N
(a) kX (i) k2 ≥ 1 − n
for i = 1, . . . , N ;

(b)
6 log N
hX (i) , X (j) i ≤ √ for all i, j = 1, . . . , i 6= j .
n−1

Proof. (a) For any i = 1, . . . , N , the probability that kX (i) k2 < 1 − ε is less than e−εn .
Thus
2 log N 2 log N 1

(i)
P kX k2 < 1 − ≤ e−( n )n = 2 .
n N
By the union bound, the probability there exists an i ∈ {1, . . . , N } such that kX (i) k2 <
2 log N
1 − n is at most 1/N .

(b) From Proposition 3.8 we know that the component X1(i) in direction of the north
pole satisfies
c 2 −c2 /2
P |X1(i) | > √ ≤ e .
n−1 c
R ANDOM VECTORS IN H IGH D IMENSIONS 39

There are N2 pairs i and j, and for each pair we define X (i) as the direction of the

north pole. Then the probability that the projection

√ of the other pair vector X (j) onto
6 log N
the direction of the north pole is more than √n−1 is at most O( exp(−6/2 log N ) =
O(1/N 3 ). Thus, the dot product condition is violated with probability at most
N
−3
O N = O(1/N ) .
2
2

3.3 Covariance matrices and Principal Component Analysis (PCA)

Definition 3.10 Let X = (X1 , . . . , Xn ) ∈ Rn be a random vector. Define the random

matrix XX T as the (n × n) matrix
 2 
  X1 X 1 X2 · · · X1 Xn
X1  X2 X1 X22 · ·
 ·   · ·  
T
   · · · · · · 
XX :=   ·  X1 · · · X n =  ·
  .
 ·   · · · · ·  
 · · · · · · 
Xn
Xn X1 · · · · Xn2

Let X = (X1 , . . . , Xn ) ∈ Rn be a random vector with mean µ = E[X] and matrix

µµT = (µi µj )i,j=1,...,n . Then the covariance matrix , which is defined as
h i
T
cov(X) := E (X − µ)(X − µ) = E[XX T ] − µµT , (3.10)

is a (n × n) symmetric positive-semidefinite matrix with entries

cov(X)i,j = E[(Xi − µi )(Xj − µj )], i, j = 1, . . . , n . (3.11)

The second moment matrix of a random vector X ∈ Rn is simply

Σ = Σ(X) = E[XX T ] = (E[Xi Xj ])i,j=1,...,n , (3.12)

and Σ(X) is symmetric and positive-semidefinite matrix which can be written as the
spectral decomposition
Xn
Σ(X) = si ui uTi , (3.13)
i=1

where ui ∈ Rn are the eigenvectors of Σ(X) for the eigenvalues si . The second
moment matrix allows the principal component analysis (PCA). We order the eigen-
values of Σ(X) according to their size: s1 ≥ s2 ≥ · · · ≥ sn . For large values of the
dimension n one aims to identify a few principal directions. These directions corre-
spond to the eigenvectors with the largest eigenvalues. For example, suppose that
the first m eigenvalues are significantly larger than the remaining n − m ones. This
40 R ANDOM VECTORS IN H IGH D IMENSIONS

allows to reduced the dimension of the given data to Rm by neglecting all contribu-
tions from directions with eigenvalues significantly smaller than the chosen principal
ones.

Definition 3.11 A random vector X = (X1 , . . . , Xn ) ∈ Rn is called isotropic if

Σ(X) = E[XX T ] = 1ln ,

where 1ln := idn is the identity operator/matrix in Rn .

Exercise 3.12 (a) Let Z be an isotropic mean-zero random vector in Rn , µ ∈ Rn , and Σ be

a (n × n) positive-semidefinite symmetric matrix. Show that then X := µ + Σ1/2 Z has
mean µ and covariance matrix cov(X) = Σ.

(b) Let X ∈ Rn be a random vector with mean µ and invertible covariance matrix Σ =
cov(X). Show that then Z := Σ−1/2 (X − µ) is an isotropic mean-zero random vector.
K

Lemma 3.13 (Isotropy) A random vector X ∈ Rn is isotropic if and only if

E[hX, xi2 ] = kxk22 for all x ∈ Rn .

Proof. X is isotropic if and only if (E[Xi Xj ])i,j=1,...,n = (δi,j )i,j=1,...,n = 1ln . For every
x ∈ Rn we have
n
X n
X
2
E[hX, xi ] = E[xi Xi xj Xj ] = xi E[Xi Xj ]xj = hx, Σ(X)xi = kxk22
i,j=1 i,j=1

if and only if Σ(X) = 1ln . 2

It suffices to show E[hX, ei i2 ] = 1 for all basis vectors ei , i = 1, . . . , n. Note that
hX, ei i is a one-dimensional marginal of the random vector X. Thus X is isotropic
if and only if all one-dimensional marginals of X have unit variance. An isotropic
distribution is evenly extended in all spatial directions.

Lemma 3.14 Let X be an isotropic random vector in Rn . Then

E[kXk22 ] = n .

Moreover, if X and Y are two independent isotropic random vectors in Rn , then

E[hX, Y i2 ] = n .

Proof. For the first statement we view X T X as a 1 × 1 matrix and take advantage
of the cyclic property of the trace operation on matrices:

E[kXk22 ] = E[X T X] = E[Trace (X T X)] = E[Trace (XX T )] = Trace (E[XX T ])

= Trace (1ln ) = n .
R ANDOM VECTORS IN H IGH D IMENSIONS 41

We fix a realisation of Y , that is, we consider the conditional expectation of X

with respect to Y which we denote EX . The law of total expectation says that

E[hX, Y i2 ] = EY [EX [hX, Y i2 |Y ]] ,

where EY denotes the expectation with respect to Y . To compute the innermost

expectation we use Lemma 3.13 with x = Y and obtain using the previous part that

E[hX, Y i2 ] = EY [kY k22 ] = n .

2
n
Suppose X ⊥ Y are isotropic vectors in R , and consider the normalised ver-
sions X̄ := X/kXk2 and Ȳ := Y /kY k2 . From the√concentration
√ results in this chap-
√
ters we know that with high probability, kXk2 ∼ n, kY k2 ∼ n, and hX, Y i ∼ n.
Thus, with high probabity,
1
|hX̄, Ȳ i| ∼ √ .
n
Thus, in high dimensions, independent and isotropic random vector are almost or-
thogonal.

3.4 Examples of High-Dimensional distributions

Definition 3.15 (Spherical distribution) A random vector X is called spherically

√ dis-
tributed if it is uniformly distributed on the Euclidean sphere with radius n and centre
at the origin, i.e., √
X ∼ Unif( nS(n−1) ) ,
where
n n
X o
n
S (n−1)
= x = (x1 , . . . , xn ) ∈ R : x2i = 1 = {x ∈ Rn : kxk2 = 1}
i=1

is the unit sphere of radius 1 and centre at the origin.

√
Exercise 3.16 X ∼ Unif( nS(n−1) ) is isotropic but the coordinates Xi , i = 1, . . . , n, of X
are not independent due to the condition X12 + · · · + Xn2 = n. KK

Solution. We give a solution for n = 2. The solution for higher dimensions is

similar and uses high-dimensional versions of the Polar coordinates. We use Polar
coordinates to represent X as
√ cos(θ)

X= n , θ ∈ [0, 2π] .
sin(θ)

Let e1 and e2 the two unit basis vectors in R2 . It suffices to show E[hX, ei i2 ] = 1, i =
1, 2, as any vector can be written as a linear combination of the two basis vectors.
42 R ANDOM VECTORS IN H IGH D IMENSIONS

Without loss of generality we pick e1 (the proof for e2 is just analogous):

Z 2π
2
√ 2 n 2 h θ sin(2θ) i2π 2π
E[hX, e1 i ] = E[( n cos(θ)) ] = cos2 (θ) dθ = + = = 1,
2π 0 2π 2 4 0 2π
where we used that cos(2x) = 2 cos2 (x) − 1.
©

An example of a discrete isotropic distribution in Rn is the symmetric Bernoulli

distribution in Rn . We say that a random vector X = (X1 , . . . , Xn ) ∈ Rn is symmet-
ric Bernoulli distributed if the coordinates Xi are independent, symmetric, Bernoulli
random variables. A random variable ε is symmetric if P(ε = −1) = P(ε = +1) = 12 .
Equivalently, we may say that X is uniformly distributed on the unit cube in Rn :

n
X ∼ Unif { − 1, +1} .

The symmetric Bernoulli distribution is isotropic. This can be easily seen again by
checking E[hX, ei i2 ] = 1 for all i = 1, . . . , n, or for any x ∈ Rn by checking that
n
hX i h X i
2
E[hX, xi ] = E 2 2
X i xi + E 2 Xi Xj xi xj = kxk22 ,
i=1 1≤i<j≤n

as the second term vanishes because the Xi are independent mean-zero random
variables .
Any random vector X = (X1 , . . . , Xn ) ∈ Rn with independent mean-zero coordi-
nates Xi with unit variance Var(Xi ) = 1 is an isotropic random vector in Rn .

For the following recall the definition of the normal distribution. See also the
appendix sheets distributed at the begin of the lecture for useful Gaussian calculus
formulae.
Definition 3.17 (Multivariate Normal / Gaussian distribution) We say a random vector
Y = (Y1 , . . . , Yn ) ∈ Rn has standard normal distribution in Rn , denoted

Y ∼ N(0, 1ln ) ,

if the coordinates Yi , i = 1, . . . , n, are independent, R-valued standard normally distributed

random variables, i.e., Yi ∼ N(0, 1). The probability density function (pdf) for Y is just the
product
n
Y 1 2 1 −kxk22 /2
fY (x) = √ e−xi /2 = n/2
e . (3.14)
i=1
2π (2π)

It is easy to check that Y is isotropic. Furthermore, as the density (3.14) only

depends on the Euclidean norm, that is, the standard normal distribution in Rn only
depends on the length and not on the direction. In other words, the standard normal
distribution in Rn is rotation invariant. This reasoning is rigorously stated in the next
proposition.
R ANDOM VECTORS IN H IGH D IMENSIONS 43

Proposition 3.18 Let Y ∼ N(0, 1ln ) and U be a n × n orthogonal matrix (i.e., U T U =

U U T = 1ln , or equivalently, U −1 = U T ). Then

U Y ∼ N(0, 1ln ) .

Proof. For Z := U Y we have

kZk22 = Z T Z = Y T U T U Y = Y T Y = kY k22 .

Furthermore, |det(U )| = |det(U T )| = 1, and thus for any vector J ∈ Cn (characteristic

functions/Laplace transform), writing z = U x, x ∈ Rn ,
Z n
hJ,Zi 1 1 Y
EZ [e ]= exp − hz, zi + hJ, zi dzi
(2π)n/2 Rn 2 i=1
Z n
1 1
T
Y
= exp − hx, xi + hU J, xi dxi
(2π)n/2 Rn 2 i=1
1 1 1 1
hJ,Ji
= exp hU T J, U T Ji = E[e 2 ] = EY [ehJ,Y i ] .
(2π)n/2 2 (2π)n/2

Thus we have shown that Z has the same characteristic function/Laplace transform
then Y ∼ N(0, 1lN ) and therefore Z = U Y ∼ N(0, 1ln ). 2

Let Σ be a symmetric positive-definite n × n matrix and X ∈ Rn random vector with

mean µ = E[X]. Then

X ∼ N(µ, Σ) ⇔ Z = Σ−1/2 (X − µ) ∼ N(0, 1ln )

1 1
−1

⇔ fX (x) exp − hX − µ, Σ (X − µ)i , x ∈ Rn .
(2π)n/2 det(Σ)1/2 2

For large values of n the standard normal distribution N(0, 1ln ) is not concen-
trated around the origin,
√ instead it is concentrated in a thin spherical shell around
the sphere of radius n around the origin (shell with width of order O(1)). From
Theorem 3.1 we obtain for Y ∼ N(0, 1ln ),
√
P(|kY k2 − n| ≥ t) ≤ 2 exp ( − Ct2 ) , for all t ≥ 0
√
and an absolute constant C > 0. Therefore with high probability kY k2 ≈ n, and
thus with high probability,
√ √
Y ≈ nΘ ∼ Unif( nS(n−1) ) ,

with the unit direction vector Θ = Y /kY k2 . Henceforth, with high probability,
√
N(0, 1ln ) ≈ Unif( nS(n−1) ) .
44 R ANDOM VECTORS IN H IGH D IMENSIONS

3.5 Sub-Gaussian random variables in higher dimensions

Definition 3.19 (Sub-Gaussian random vectors) A random vector X ∈ Rn is sub-

Gaussian if the one-dimensional marginals hX, xi are sub-Gaussian real-valued random
variables for all x ∈ Rn . Moreover,

kXkψ2 := sup {khX, xikψ2 } .

x∈S(n−1)

Lemma 3.20 Let X = (X1 , . . . , Xn ) ∈ Rn be a random vector with independent mean-zero

sub-Gaussian coordinates Xi . Then X is sub-Gaussian and

kXkψ2 ≤ C max {kXi kψ2 } ,

1≤i≤n

for some absolute constant C > 0.

Proof. Let x ∈ S(n−1) . We are using Proposition 2.24 for the sum of independent
sub-Gaussian random variables:
n
X n
X
khX, xik2ψ2 =k xi Xi k2ψ2 ≤ C x2i kXi k2ψ2 ≤ C max {kXi k2ψ2 } ,
P rop. 2.24 1≤i≤n
i=1 i=1

2
Pn
where we used that i=1 x2i = 1,

Theorem 3.21 (Uniform distribution on the sphere) Let X ∈ Rn be a random vector √ uni-
n
formly distributed on the Euclidean sphere in R with centre at the origin and radius n,
i.e., √
X ∼ Unif( nS(n−1) ) .
Then X is sub-Gaussian, and kXkψ2 ≤ C for some absolute constant C > 0.
We will actually present two different proofs of this statement. The first uses
concentration properties whereas the second employs a geometric approach.
Proof of Theorem3.21 - Version I. See [Ver18] page 53-54. 2
Proof of Theorem3.21 - Version II.
For convenience we will work on the unit sphere, so let us rescale
X
Z := √ ∼ Unif(S(n−1) ).
n
√
It suffices to show that kZkψ2 ≤ C/ n, which by definition means that khZ, xikψ2 ≤ C
for all unit vectors x. By rotation invariance, all marginals hZ, xi have the same
distribution, and hence without loss of generality, we may prove our claim for x =
e1 = (1, 0, . . . , 0) ∈ Rn . In other words, we shall show that

P(|Z1 | ≥ t) ≤ 2 exp(−ct2 n) for all t ≥ 0 .

R ANDOM VECTORS IN H IGH D IMENSIONS 45

y Ht 7
O O O 2y

Figure 1:

We use the fact that

P (Z1 ≥ t) = P(Z ∈ Ct ) with the spherical cap Ct = {z ∈ S(n−1) : z1 ≥ t}.
Denote by Kt the ”ice-cream” cone when we connect all points in Ct to the origin,
see Figure 1. The fraction of Ct in the unit sphere (in terms of area) is the same as
the fraction of Kt in the unit ball B(0, 1) = {x ∈ Rn : x21 + · · · + x2n ≤ 1}. Thus
Vol (Kt )
P(Z ∈ Ct ) = .
Vol (B(0, 1))
√ √
The set Kt is contained in a ball B(00 , 1 − t2 ) with radius 1 − t2 centred at 00 =
(t, 0, . . . , 0). Using 1 − x ≤ e−x for 0 ≤ x < 1, we get
√ n
√
P(Z1 ≥ t) = ( 1 − t2 ) ≤ exp ( − t2 n/2) for 0 ≤ t ≤ 1/ 2 ,
and we can easily extend this bound to all t by loosening the absolute √constant (note
that for t ≥ 1 the probability is trivially zero). Indeed, in the range 1/ 2 ≤ t ≤ 1,
√
P(Z1 ≥ t) ≤ P(Z1 ≥ 1/ 2) ≤ exp(−n/4) ≤ exp(−t2 n/4)
We proved that P(Z1 ≥ t) ≤ exp(−t2 n/4). By symmetry, the same inequality holds
for −Z1 . Taking the union bound, we obtain the desired sub-Gaussian tail.
2

Remark 3.22 The so-called Projective Central Limit Theorem tells us that marginals of the
uniform distribution on the sphere in Rn become
√ asymptotically normally distributed as the
dimension n increases. Namely, if X ∼ Unif( nS (n−1)
) then for any fixed unit vector u ∈ Rn
we have
hX, xi −→ N(0, 1) in distribution as n → ∞ .

3.6 Application: Grothendieck’s inequality

See Chapter 3.5 in [Ver18].
46 R ANDOM M ATRICES

4 Random Matrices
4.1 Geometrics concepts

Definition 4.1 Let (T, d) be a metric space, K ⊂ T , and ε > 0.

(a) ε-net: A subset N ⊂ K is an ε-net of K if every point of K is within a distance ε of

some point of N ,
∀x ∈ K ∃ x0 ∈ N : d(x, x0 ) ≤ ε .

(b) Covering number: The smallest possible cardinality of an ε-net of K is the covering
number of K and is denoted N(K, d, ε). Equivalently, N(K, d, ε) is the smallest number
of closed balls with centres in K and radii ε whose union covers K.

(d) Packing numbers: The largest possible cardinality of an ε-separated subset of K is

called the packing number of K and denoted P(K, d, ε).

Lemma 4.2 Let (T, d) be a metric space. Suppose that N is a maximal ε separated subset
of K ⊂ T . Here maximal means that adding any new point x ∈ K to the set N destroys the
ε-separation property. Then N is an ε-net.

Proof. Let x ∈ K. If x ∈ N ⊂ K, then choosing x0 = x we have d(x0 , x0 ) = 0 ≤ ε.

Suppose x ∈ / N .Then the maximality assumption implies that N ∪ {x} is not ε-
separated, thus d(x, x0 ) ≤ ε for some x0 ∈ N . 2

Lemma 4.3 (Equivalence of packing and covering numbers) Let (T, d) be a metric space.
For any K ⊂ T and any ε > 0, we have

P(K, d, 2ε) ≤ N(K, d, ε) ≤ P(K, d, ε) . (4.1)

Proof. Without loss of generality we consider Euclidean space T = Rn with d =

k·k2 . For the upper bound one can show that P(K, k·k2 , ε) is the largest number of
closed disjoint balls with centres in K and radii ε/2. Furthermore, P(K, k·k2 , ε) is
the largest cardinality of an ε-separated subset, any ε-separated set N with #N =
P(K, k·k2 , ε) is maximal, and hence an ε-net according to Lemma 4.2. Thus

N(K, k·k2 , ε) ≤ #N .

Pick an 2ε-separated subset P = {xi } in K and an ε-net N = {yj } of K. Each

xi ∈ K belongs to some closed ball Bε (yj ) with radius ε around some yj . Any such
ball Bε (yj ) may contain at most one point of the xi ’s. Thus |P| = #P ≤ |N| = #N. 2
In the following we return to the Euclidean space Rn with its Euclidean norm,
d(x, y) = kx − yk2 .
R ANDOM M ATRICES 47

Definition 4.4 (Minkowski sum) A, B ⊂ Rn .

A + B := {a + b : a ∈ A, b ∈ B} .

Proposition 4.5 (Covering numbers of the Euclidean ball) (a) Let K ⊂ Rn and ε > 0.
Denote |K| the volume of K and denote B (n) = {x ∈ Rn : kxk2 ≤ 1} the closed unit
Euclidean ball. Then
|K| |(K + ε/2B (n) )|
≤ N(K, k·k2 , ε) ≤ P(K, k·k2 , ε) ≤ .
|εB (n) | |ε/2B (n) |

(b) 1 n 2 n
≤ N(B (n) , k·k2 , ε) ≤ +1 .
ε ε
Proof.
(a) The centre inequality follows from Lemma 4.3.
Lower bound: Let N := N(K, k·k2 , ε). Then we can cover K by N balls with radii ε.
Then |K| ≤ N |εB (n) |. Upper bound: Let N := P(K, k·k2 , ε) and construct N closed
disjoint balls B 2ε (xi ) with centres xi ∈ K and radii ε/2. These balls might not fit
entirely into the set K, but certainly into the extended set K + 2ε B (n) . Thus
ε ε
N | B (n) | ≤ |K + B (n) | .
2 2
(b) The statement follows easily with part (a) and is left as an exercise. 2

Remark 4.6 To simplify the bound in Proposition 4.5, note that in the nontrivial range ε ∈
(0, 1] we have 1 n 3 n
≤ N(B (n) , k·k2 , ε) ≤ .
ε ε

Example 4.7 (Euclidean balls - volume and surface area) Suppose R > 0 and denote BR(n)
the ball of radius R around the origin and S(n−1)
R its surface, i.e.,
n
B(n) (n−1)
R := {x ∈ R : kxk2 ≤ R} and SR := {x ∈ Rn : kxk2 = R} .

Then the volume and surface area is given as

π n/2
vol(B(n) (n)
R ) = |BR | = Rn ,
Γ(n/2 + 1)
(4.2)
2π n/2 n−1
area(S(n−1)
R ) = |S(n−1)
R | = R ,
Γ(n/2)
where Γ is the Gamma function. ♣
48 R ANDOM M ATRICES

Definition 4.8 (Hamming cube) The Hamming cube His the set of binary strings of length
n, i.e.,
H = {0, 1}n .
Define the Hamming distance dH between two binary strings as the number of bits where
they disagree, i.e.,

dH (x, y) := #{i ∈ {1, . . . , n} : x(i) 6= y(i)} , x, y ∈ {0, 1}n .

Exercise 4.9 Show that dH is a metric on H. K

4.2 Concentration of the operator norm of random matrices

Definition 4.10 Let A be an m × n matrix with real entries. The matrix A represents a
linear map Rn → Rm .

(a) The operator norm or simply the norm of A is defined as

n kAxk o
2
kAk := max = max {kAxk2 } .
x∈Rn \{0} kxk2 x∈S(n−1)

Equivalently,
kAk = max {hAx, yi} .
x∈S(n−1) ,y∈Sm−1

(b) The singular values si = si (A) of the matrix A are the square roots of the eigenvalues
of both AAT and AT A,
p p
si (A) = λi (AAT ) = λi (AT A) ,

and one orders them s1 ≥ s2 ≥ · · · ≥ sn ≥ 0. If A is symmetric, then si (A) = |λi (A)|.

(c) Suppose r = rank(A). The singular value decomposition of A is

r
X
A= si ui viT ,
i=1

where si = si (A) are the singular values of A, the vectors ui ∈ Rm are the left singular
vectors, and the vectors vi ∈ Rn are the right singular vectors of A.

Remark 4.11 (a) The extreme singular values s1 (A) and sn (A) (sr (A)) are respectively the
smallest number M and the largest number m such that
mkxk2 ≤ kAxk2 ≤ M kxk2 , for all x ∈ Rn .
Thus
sn (A)kx − yk2 ≤ kAx − Ayk2 ≤ s1 (A)kx − yk2 for all x ∈ Rn .
R ANDOM M ATRICES 49

(b) In terms of its spectrum, the operator norm of A equals the largest singular value of A,

s1 (A) = kAk .

Lemma 4.12 (Operator norm on a net) Let ε ∈ [0, 1) and A be an m × n matrix. Then,
for any ε-net N of S(n−1) , we have

1
sup {kAxk2 } ≤ kAk ≤ sup {kAxk2 } .
x∈N 1 − ε x∈N

Proof. The lower bound is trivial as N ⊂ S(n−1) . For the upper bound pick an x ∈
S(n−1) for which kAk = kAxk2 and choose x0 ∈ N for this x such that kx − x0 k2 ≤ ε.
Then
kAx − Ax0 k2 ≤ kAkkx − x0 k2 ≤ εkAk .
The triangle inequality implies that

kAx0 k2 ≥ kAxk2 − kAx − Ax0 k2 ≥ (1 − ε)kAk ,

and thus kAk ≤ kAx0 k2 /(1 − ε). 2

Exercise 4.13 Let N be an ε-net of S(n−1) and M be an ε-net of S(m−1) . Show that for any
m × n matrix A one has
1
sup {hAx, yi} ≤ kAk ≤ sup {hAx, yi} .
x∈N,y∈M 1 − 2ε x∈N,y∈M

Exercise 4.14 (Isometries) Let A be an m × n matrix with m ≥ n. Prove the following

equivalences:

AT A = 1ln ⇔ A isometry, i.e., kAxk2 = kxk2 for all x ∈ Rn

⇔ sn (A) = s1 (A) .

Theorem 4.15 (Norm of sub-Gaussian random matrices) Let A be an m×n random ma-
trix with independent mean-zero sub Gaussian random entries Aij , i = 1, . . . , m, j = 1, . . . , n.
Then, for every t > 0, we have that
√ √
kAk ≤ CK( m + n + t)

with probability at least 1 − 2 exp ( − t2 ), where K := max 1≤i≤m, {kAij kψ2 } and where
1≤j≤n
C > 0 is an absolute constant.
50 R ANDOM M ATRICES

Proof. We use an ε-net argument for our proof.

Step 1: Using Proposition 4.5 (b) we can find for ε = 1
4
an ε-net N ⊂ S(n−1) and an
ε-net M ⊂ S(m−1) with
|N| ≤ 9n and |M| ≤ 9m .
By Exercise 4.13, the operator norm of A can be bounded using our nets as follows
kAk ≤ 2 max {hAx, yi} .
x∈N,y∈M

Step 2: Concentration Pick x ∈ N and y ∈ M. We compute (using the fact that the
single matrix entries are sub-Gaussian random variables with their norm bounded
by K)
m X
X n m X
X n
khAx, yik2ψ2 ≤ C kAij xi yj k2ψ2 ≤ CK 2 yj2 x2i = CK 2 ,
i=1 j=1 i=1 j=1

for some absolute constant C > 0. We therefore obtain a tail bound for hAx, yi, i.e.,
for every u ≥ 0,
P(hAx, yi ≥ u) ≤ 2 exp ( − cu2 /K 2 ) ,
for some absolute constant c > 0.

Step 3: Union bound We unfix the choice of x and y in Step 2 by a union bound.
X
P max {hAx, yi} ≥ u ≤ P(hAx, yi ≥ u) ≤ 9n+m 2 exp ( − cu2 /K 2 ) .
x∈N,y∈M
x∈N,y∈M
√ √
We continue the estimate by choosing u = CK( n + m + t) which leads to a lower
bound u2 ≥ C 2 K 2 (n + m + t2 ), and furthermore, adjust the constant C > 0 such that
cu2 /K 2 ≥ 3(n + m) + t2 . Inserting these choices we get

P max {hAx, yi} ≥ u ≤ 9n+m 2 exp ( − 3(n + m) − t2 ) ≤ 2 exp ( − t2 ) ,
x∈N,y∈M

and thus
P(kAk ≥ 2u) ≤ 2 exp ( − t2 ) .
2

Corollary 4.16 Let A be an n × n random matrix whose entries on and above the diagonal
are independent mean-zero sub-Gaussian random variables. Then, for every t > 0, we have
√
kAk ≤ CK( n + t)
with probability at least 1 − 4 exp(−t2 ) and K := max1≤i,j≤n {kAij kψ2 }.

Proof. Exercise. Decompose the matrix into an upper-triangular part and a lower-
triangular part and use the proof of the previous theorem. 2
We can actually improve the statement in Theorem 4.15 in two ways. First we
obtain a two-sided bound, and, secondly, we can relax the independence assump-
tion. As this is a refinement of our previous statement we only state the result and
omit its proof. The statement is used in covariance estimation below.
R ANDOM M ATRICES 51

Theorem 4.17 Let A be an m × n matrix whose rows Ai , i = 1, . . . , m, are independent

mean-zero sub-Gaussian isotropic random vectors in Rn .
(a) Then, for every t > 0, we have
√ √ √ √
m − CK 2 ( n + t) ≤ sn (A) ≤ s1 (A) ≤ m + CK 2 ( n + t) , (4.3)
with probability at least 1 − 2 exp(−t2 ) and K := max1≤i≤m {kAi kψ2 }. Furthermore,
with probability at least 1 − 2 exp(−t2 ),
1 T r n t
2 2
A A − 1ln ≤ K max{δ, δ } where δ = C +√ . (4.4)
m m m

(b) Property (4.4) implies that

h 1 i r n n
T 2
E A A − 1ln ≤ CK + .
m m m

Proof. (a) The proof follows similarly to the proof of Theorem 4.15 and is thus left
as an exercise. To show that (4.4) indeed implies (4.3) we use Lemma 4.18 below.
(b) To obtain the bound for the expected operator norm of the difference of m1 AT A
to the identity, we use the integral identity (1.9) for the real valued random variable
k m1 AT A − 1ln k . This calculation is quite long and requires some computational work
for the different cases and is omitted.
2

Lemma 4.18 Let A be an (m × n) - matrix and δ > 0. Suppose that

kAT A − 1ln k ≤ max{δ, δ 2 } .
Then
(1 − δ)kxk2 ≤ kAxk2 ≤ (1 + δ)kxk2 for all x ∈ Rn . (4.5)
In particular, this means that all singular values of A lie between 1 − δ and 1 + δ,
1 − δ ≤ sn (A) ≤ s1 (A) ≤ 1 + δ .

Proof. We first assume without loss of generality that kxk2 = 1. Then our assump-
tions give
max{δ, δ 2 } ≥ |h(AT A − 1ln )x, xi| = |kAxk22 − 1| .
For every z ≥ 0 the following elementary inequality holds:
max{|z − 1|, |z − 1|2 } ≤ |z 2 − 1| . (4.6)
To show (4.6) use that for z ≥ 1 we have |z −1|2 = |z 2 −2z +1| ≤ |z 2 −2+1| = |z 2 −1|,
and for z ∈ [0, 1), use |z − 1| ≤ |z 2 − 1| as z 2 ≤ z for z ∈ [0, 1). Then use (4.6) with
kAxk2 to conclude that
|kAxk2 − 1| ≤ δ .
This implies both statements of the lemma. 2
52 R ANDOM M ATRICES

4.3 Application: Community Detection in Networks

See Chapter 4.5 in [Ver18].

4.4 Application: Covariance Estimation and Clustering

Suppose that X (1) , . . . , X (N ) are empirical samples (random outcomes) of a random
vector X ∈ Rn . We do not have access to the full distribution, only to the empirical
measure
N
1 X
LN = δ (i) ,
N i=1 X

which is a random (depending on the N random outcomes X (i) ) probability mea-

sure on Rn . Here, the symbol δX is the Kronecker-delta measure or point measure
defined as (
1 if X = y,
δX (y) = y ∈ Rn .
0 if y 6= X ,

We assume for simplicity that E[X] = 0 and recall Σ = Σ(X) = E[XX T ].

Definition 4.19 Let X ∈ Rn with N random outcomes/random samples X (1) , . . . , X (N ) .

(a) The empirical measure of X (1) , . . . , X (N ) is the probability measure on Rn ,

N
1 X
LN := δ (i) .
N i=1 X

(b) The empirical covariance of X (1) , . . . , X (N ) is the random matrix

N
1 X (i) T
ΣN := (X )(X (i) ) .
N i=1

Note that Xi ∼ X implies that E[ΣN ] = Σ. The law of large numbers yields

ΣN → Σ almost surely as N → ∞ .

Theorem 4.20 (Covariance estimation) Let X be a sub-Gaussian random vector in Rn ,

and assume that there exist K ≥ 1 such that

khX, xikψ2 ≤ KkhX, xikL2 , for all x ∈ Rn .

Then, for every N ∈ N,

h i r n n
2
E kΣN − Σk ≤ CK + kΣk .
N N
R ANDOM M ATRICES 53

Proof. First note that

khX, xik2L2 = E[|hX, xi|2 ] = E[hX, xi2 ] = hΣx, xi .

We bring X, X (1) , . . . , X (N ) all into isotropic position. That is, there exist independent
isotropic random vectors Z, Z (1) , . . . , Z (N ) such that

X = Σ1/2 Z and X (i) = Σ1/2 Z (i) .

We have from our assumptions that

kZkψ2 ≤ K and kZ (i) kψ2 ≤ K .

Then
kΣN − Σk = kΣ1/2 RN Σ1/2 k ≤ kRN kkΣk ,
where
N
1 X (i) T
RN = (Z )(Z (i) ) − 1ln .
N i=1

Suppose now that A is the N × n matrix whose rows are (Z (i) )T , that is,
1 T
A A − 1ln = RN ,
N
and Theorem 4.17 for A implies that
r n n
2
E[kRN k] ≤ CK + ,
N N

and we conclude with our statement. 2

Remark 4.21 For all ε ∈ (0, 1) we have

E[kΣN − Σk] ≤ εkΣk ,

if we take a sample of size N ∼ ε−2 n.

54 C ONCENTRATION OF MEASURE - GENERAL CASE

5 Concentration of measure - general case

We study now general concentration of measure phenomena and aim in particular
to include cases where the random variables are not necessarily independent. The
independence assumption made our concentration results relatively easy to develop.
In the first section we summarise concentration results by entropy techniques before
studying dependent random variables in the remaining sections.

5.1 Concentration by entropic techniques

In the following assume that ϕ : R → R is a convex function, X an R-valued random
variable such that E[X] and E[ϕ(X)] are finite unless otherwise stated. The ran-
dom variable is a measurable map from some probability space (Ω, F, P) to R and
distributed with law PX = P ◦ X −1 ∈ M1 (R).

Definition 5.1 (Entropy) The entropy of the random variable X for the convex function ϕ
is
Hϕ (X) = E[ϕ(X)] − ϕ(E[X]).

Corollary 5.2 By Jensen’s inequality, ϕ([E[X]) ≤ E[ϕ(X)], we see that Hϕ (X) ≥ 0.

Example 5.3 (a) ϕ(u) = u2 , u ∈ R, then the entropy of X,

2
Hϕ (X) = E[X 2 ] − E[X] = Var(X)
is the variance of X.
(b) ϕ(u) = − log u, u > 0, and for X real-valued random variable we have that Z := eλX >
0 is a strictly positive real-valued random variable.
Hϕ (Z) = −λE[X] + log E[eλX ] = log E[eλ(X−E[X]) ].

(c) ϕ(u) = u log u, u > 0, and ϕ(0) := 0. The function ϕ is convex function on R+ and
continuous when we set 0 log 0 = 0. For any non-negative random variable Z ≥ 0, the
ϕ-entropy is
Hϕ (Z) = E[Z log Z] − E[Z] log E[Z].
♣
In the following we will drop the index ϕ for the entropy whenever we take ϕ(u) =
u log u as in Example 5.3 (c). There are several reasons why this choice is particular
useful. In the next remark we show some connection of that entropy to other entropy
concepts in probability theory.

Remark 5.4 Suppose that Ω is a finite sample space, and let p, q ∈ M1 (Ω) be two probabil-
ity measures (vectors) such that q(ω) = 0 implies p(ω) = 0. Define
(
p(ω)
if q(ω) > 0,
X : Ω → R, ω 7→ X(ω) = q(ω)
0 if q(ω) = p(ω) = 0 ,
C ONCENTRATION OF MEASURE - GENERAL CASE 55

with distribution q ∈ M1 (Ω). Then X ≥ 0 with

X X
E[X] = X(ω)q(ω) = p(ω) = 1.
ω∈Ω ω∈Ω

X X p(ω)
H(X) = X(ω)q(ω) log X(ω) − E[X] log E[X] = p(ω) log
ω∈Ω ω∈Ω
q(ω)
=: H(p|q) =: D(pkq),
where H(p|q) is the relative entropy of p with respect to q , a widely used function proba-
bility theory (e.g. large deviation theory) and information theory, and where D(pkq) is the
Kullback-Leibler divergence of p and q used in information theory.

Definition 5.5 Let Ω be a finite (respectively discrete) sample space and denote M1 (Ω) the
set of probability measures (vectors).

(a) The relative entropy with respect to q ∈ M1 (Ω) is defined as the mapping
(P
p(ω)
ω∈Ω p(ω) log q(ω) if q(ω) ⇒ p(ω) = 0 ,
H(·|q) : M1 (Ω) → [0, ∞]; p 7→ H(p|q) =
+∞ otherwise .

(b) The Shannon entropy of a Ω-valued random variable X with probability density (dis-
tribution) p ∈ M1 (Ω) is defined as
X
H(X) ≡ H(p) = − p(ω) log p(ω) .
ω∈Ω

Lemma 5.6 Let Ω be a finite sample space and denote M1 (Ω) the set of probability mea-
sures (vectors). Let q ∈ M1 (Ω). Then the relative H(·|q) is strictly convex, continuous
and
H(p|q) = 0 ⇔ p = q .

Proof. Exercise. 2

Exercise 5.7 Let Ω be a finite sample space and denote M1 (Ω) the set of probability mea-
sures (vectors). Let X be a Ω-valuedPrandom variable with distribution p ∈ M1 (Ω). Its
Shannon entropy is then H(X) = − ω∈Ω p(ω) log p(ω). We shall explore the connection
between the entropy functional Hϕ with ϕ(u) = u log u and the Shannon entropy:

(a) Consider the random variable Z := p(U ), where U ∼ Unif(Ω) is uniformly distributed
over Ω. Show that
1
Hϕ (Z) = ( log|Ω| − H(X)) .
|Ω|
56 C ONCENTRATION OF MEASURE - GENERAL CASE

(b) Use part (a) and Lemma 5.6 to show that Shannon entropy for a discrete random variable
is maximised by a uniform distribution.

The following example motivates the definition of Shannon entropy and can be
skipped for the following.

Example 5.8 (Shannon entropy) Let Ω be a finite sample space. Let X (i) , i = 1, . . . , N , be
independent identically distributed Ω-valued random variables and write X = (X (1) , . . . , X (N ) ).
The empirical measure of the sample vector X ∈ ΩN is then
N
1 X
LX = δ (i) .
N
N i=1 X

For a given µ ∈ M1 (Ω) with N µ(x) ∈ N0 , x ∈ Ω, we define the number

N!
NN (µ) = ]{ω ∈ ΩN : LωN = µ} = Q ,
x∈Ω (N µ(x))!

where the second equality follows from multinomial distribution. For any µ ∈ M1 (Ω) pick
µN ∈ M1 (Ω) with N µN (x) ∈ N0 , x ∈ Ω, such that µN → µ as N → ∞. Then, using
Stirling’s formula, one can show that
1
H(µ) = lim log NN (µN ) . (5.1)
N →∞ N

For any real-valued random variable X, Z = exp(λX), λ ∈ R, is a positive random

variable and the entropy can be written with the moment generating function of X,
recall the definition of the moment generating function (MGF) in (1.4) (we assume
again that the expectations are finite for all λ ∈ R),

H(eλX ) = λMX0 (λ) − MX (λ) log MX (λ). (5.2)

Example 5.9 (Entropy - Gaussian random variable) Suppose X ∼ N(0, σ 2 ), σ > 0. Then
MX (λ) = exp(λ2 σ 2 /2), and MX0 (λ) = λσ 2 MX (λ), and therefore

1
H(eλX ) = λ2 σ 2 MX (λ) − λ2 σ 2 /2MX (λ) = λ2 σ 2 MX (λ).
2
♣
A bound on the entropy leads to a bound of the centred moment generating
function Φ, see (2.3), this is the content of the so-called Herbst argument .
C ONCENTRATION OF MEASURE - GENERAL CASE 57

Proposition 5.10 (Herbst argument) Let X be a real-valued random variable and suppose
that, for σ > 0,
1
H(eλX ) ≤ σ 2 λ2 MX (λ)
2
for λ ∈ I with interval I being either [0, ∞) or R. Then
1
log E[ exp (λ(X − E[X]))] ≤ λ2 σ 2 for all λ ∈ I. (5.3)
2
Remark 5.11 (a) If I = R, then the bound (5.3) is equivalent to asserting that the centred
random variable X − E[X] is sub-Gaussian with parameter σ > 0.

(b) For I = [0, ∞), the bound (5.3) leads immediately to the one-sided tail estimate

P(X ≥ E[X] + t) ≤ exp(−t2 /2σ 2 ) , t ≥ 0,

and I = R provides the corresponding two-sided estimate.

Proof of Proposition 5.10. Suppose that I = [0, ∞). Using (5.2), our assumption
turns into a differential inequality for the moment generating function,
1
λMX0 (λ) − MX (λ) log MX (λ) ≤ σ 2 λ2 MX (λ) for all λ ≥ 0 . (5.4)
2
Define now a function G(λ) := λ1 log MX (λ) for λ 6= 0, and then extend the function to
0 by continuity (L’Hospital rule)
d
G(0) := lim G(λ) = log MX (λ) = E[X] .
λ→0 dλ λ=0

Our assumptions on the existence of the MGF imply differentiability with respect to
the parameter λ. Hence
1 MX0 (λ) 1
G0 (λ) = − 2 log MX (λ) ,
λ MX (λ) λ
and thus we can rewrite our differential inequality (5.4) as
1
G0 (λ) ≤ σ 2 for all λ ∈ I = [0, ∞) .
2
For any λ0 > 0 we can integrate both sides of the previous inequality to arrive at
1
G(λ) − G(λ0 ) ≤ σ 2 (λ − λ0 ) .
2
Now, letting λ0 ↓ 0 and using the above definition of G(0), we get
1 λE[X]
1
G(λ) − E[X] = log MX (λ) − log e ≤ σ2λ ,
λ 2
and we conclude with the statement in (5.3). 2
58 C ONCENTRATION OF MEASURE - GENERAL CASE

Proposition 5.12 (Bernstein entropy bound) Suppose there exist B > 0 and σ > 0 such
that the real-valued random variable X satisfies the following entropy bound

H(eλX ) ≤ λ2 (BMX0 (λ) + MX (λ)(σ 2 − BE[X])) for all λ ∈ [0, B −1 ) .

Then
−1
log E[eλ(X−E[X]) ] ≤ σ 2 λ2 (1 − Bλ) for all λ ∈ [0, B −1 ) . (5.5)

Remark 5.13 As a consequence of the Chernoff argument, Proposition 5.12 implies that X
satisfies the upper tail bound

δ

P(X ≥ E[X] + δ) ≤ exp , for all δ ≥ 0 . (5.6)
4σ 2 + 2Bδ

Exercise 5.14 Prove the tail bound (5.6). K

Proof of Proposition 5.12. We skip the proof of this statement as it employs similar
techniques as in the proof of Proposition 5.10.
2
So far, the entropic method has not provided substantial new insight as all con-
centration results are done via the usual Chernoff bound. We shall now study con-
centration for functions of many random variables.

Definition 5.15 (a) A function f : Rn → R is separately convex if, for every index k ∈
{1, . . . , n}, the univariate function

fk : R → R ,
yk 7→ fk (yk ) := f (x1 , . . . , xk−1 , yk , xk+1 , . . . , xn ) ,

is convex for each vector (x1 , . . . , xk−1 , xk+1 , . . . , xn ) ∈ Rn−1 .

(b) A function f : X → Y for metric spaces (X, dX ) and (Y, dY ) is Lipschitz continuous
(sometimes called locally Lipschitz continuous ) if for every x ∈ X there exists a
neighbourhood U ⊂ X such that f |U is globally Lipschitz continuous. Here f |U : U →
Y is the restriction of f to U .

(c) A function f : X → Y for metric spaces (X, dX ) and (Y, dY ) is L-Lipschitz continuous
(sometimes called globally Lipschitz continuous) if there exists L ∈ R such that

dY (f (x), f (y)) ≤ LdX (x, y) for all x, y ∈ X . (5.7)

The smallest constant L > 0 satisfying (5.7) is denoted kf kLip . In the following some
statements hold for global Lipschitz continuity and some only for local Lipschitz con-
tinuity.
C ONCENTRATION OF MEASURE - GENERAL CASE 59

Theorem 5.16 (Tail-bound for Lipschitz functions) Let X = (X1 , . . . , Xn ) ∈ Rn be a

random vector with independent random coordinates Xi supported on the interval [a, b], a <
b, and let f : Rn → R be separately convex and L-Lipschitz continuous with respect to the
Euclidean norm k·k2 . Then, for every t ≥ 0,
t2
P(f (X) ≥ E[f (X)] + t) ≤ exp − 2 . (5.8)
4L (b − a)2

Example 5.17 (Operator norm of a random matrix) Let M ∈ Rn×d be a n × d matrix

with independent identically distributed mean-zero random entries Mij , i ∈ {1, . . . , n}, j ∈
{1, . . . , d} supported on the interval [−1, 1].

kM k = max {kM νk2 } = max max

n
{hu, M νi} .
ν∈Rd : kνk2 =1 ν∈Rd : u∈R :
kνk2 =1 kuk2 =1

M 7→ kM k is a function f : Rn×d → R, f (M ) = kM k. To apply Theorem 5.16 above

we shall show that f is Lipschitz and separately convex. The operator norm is the max-
imin/supremum of functions that are linear in the entries of the matrix M , and thus any such
function is convex and as such separately convex. Moreover, for M, M 0 ∈ Rn×d ,

|kM k − kM 0 k| ≤ kM − M 0 k ≤ kM − M 0 kF ,
qP P
n d 2
where kM kF := i=1 j=1 Mij is the Euclidean norm of all entries of the matrix, called
the Frobenius norm of the matrix M . The first inequality shows that f := k·k is 1-Lipschitz
continuous. Thus Theorem 5.16 implies that, for every t ≥ 0,
2 /16
P(kM k ≥ E[kM k] + t) ≤ e−t .

Key ingredients for the proof of Theorem 5.16 are the following two lemmas.

Lemma 5.18 (Entropy bound for univariate functions) Let X and Y two independent, iden-
tically distributed R-valued random variables. Denote by EX,Y the expectation with respect
to X and Y . For any function g : R → R the following statements hold:

(a)
h 2 n oi
H(eλg(X) ) ≤ λ2 EX,Y g(X) − g(Y ) eλg(X) 1l g(X) ≥ g(Y ) , for all λ > 0 .

(b) If in addition the random variable X is supported on [a, b], a < b, and the function g is
convex and Lipschitz continuous, then
h i
2
H(eλg(X) ) ≤ λ2 (b − a)2 E (g 0 (X)) eλg(X) , for all λ > 0 .
60 C ONCENTRATION OF MEASURE - GENERAL CASE

Lemma 5.19 (Tensorisation of the entropy) Let X1 , . . . , Xn be independent real-valued

random variables and f : Rn → R a given function. Then
n
hX i
H(eλf (X1 ,...,Xn ) ) ≤ E H eλfk (Xk ) |X̄ k , for all λ > 0 , (5.9)
k=1

where fk is the function introduced in Definition 5.15 and

X̄ k = (X1 , . . . , Xk−1 , Xk+1 , . . . , Xn ) .
The entropy on the right hand side is computed with respect to Xk for k = 1, . . . , n, by
holding the remaining X̄ k fixed. That is,

λfk (Xk )
H e X̄ = EXk [eλfk (Xk ) λfk (Xk )] − EXk [ exp (λfk (Xk ))] log EXk [ exp (λfk (Xk ))]
k

is still a function of X̄ k and is integrated with respect to E on the right hand side of (5.9).
We first finish the proof of theorem 5.16.
Proof of Theorem 5.16. For k ∈ {1, . . . , n} and every vector (fixed) X̄ k ∈ Rn−1 the
function fk is convex, and hence Lemma 5.18 implies for all λ > 0 that for every fixed
vector X̄ k we have
h i
λfk (Xk ) k 2 2 0 2 λfk (Xk ) k
H e X̄ ≤ λ (b − a) EXk (fk (Xk )) e X̄
h ∂f (X , . . . , X , . . . , X ) 2 i
2 2 1 k n k
λ (b − a) EXk exp (λf (X1 , . . . , Xk , . . . , Xn )) X̄ .
∂xk
With Lemma 5.19 one obtains, writing X = (X1 , . . . , Xn ),
hX n h ∂f (X) 2 i
λf (X) 2 2
H e ≤ λ (b − a) E EXk exp (λf (X)) X̄ k
k=1
∂xk
n
hX ∂f (X) 2 i
=E exp (λf (X))
k=1
∂xk
≤ λ2 (b − a)2 L2 E[eλf (X) ] ,
where the equality follows from the fact that the single coordinates Xi , i = 1, . . . , n,
are independent and thus E = EX1 ⊗ · · · ⊗ EXn and where we used the Lipschitz
continuity of f leading to
n
X ∂f (X) 2
k∇f (X)k22 = ≤ L2 almost surely .
k=1
∂xk

The tail bound then follows from the Herbst argument in Proposition 5.10. 2

Proof of Lemma 5.18. Using the fact that X and Y are independent and identical
distributed we have
H(eλg(X) ) = EX [λg(X)eλg(X) ] − EX [eλg(X) ] log EY [eλg(Y ) ] .
C ONCENTRATION OF MEASURE - GENERAL CASE 61

By Jensen’s inequality,
log EY [eλg(Y ) ] ≥ EY [λg(Y )] ,
and thus, using the symmetry between X and Y , we obtain (we write EX,Y for the
expectation with respect to both, X and Y , when we want to distinguish expectations
with respect to the single random variables. Note that we can easily replace EX,Y by
E),
h i h i
H(eλg(X) ) ≤ EX λg(X)eλg(X) − EX,Y eλg(X) λg(Y )
1 h i
= EX,Y λ(g(X) − g(Y ))(eλg(X) − eλg(Y ) ) (5.10)
2 h i
= λE (g(X) − g(Y ))(eλg(X) − eλg(Y ) ){g(X) ≥ g(Y )}
Symmetry

For all s, t ∈ R we have es − et ≤ es (s − t). To see that, assume without loss of

generality that s ≥ t and recall that ex ≥ 1 + x, to see that

es (1 − et−s ) ≤ es (1 − (1 + (t − s))) = es (s − t) .

For s ≥ t, we obtain therefore

(s − t)(es − et )1l{s ≥ t} ≤ (s − t)2 es 1l{s ≥ t} .

Applying this bound with s = λg(X) and t = λg(Y ) to the inequality (5.10) yields
2
H(eλg(X) ) ≤ λ2 E[(g(X) − g(Y )) eλg(X) 1l{g(X) ≥ g(Y )}] .

If in addition the function g is convex, then we have the upper bound

g(x) − g(y) ≤ g 0 (x)(x − y) ,

and hence, for g(x) ≥ g(y),

2 2
(g(x) − g(y)) ≤ (g 0 (x)) (x − y)2

which finishes the proof of the statement in Lemma 5.18. 2

Proof of Lemma 5.19. They key ingredient is the variational representation of the
entropy (5.14), see Proposition 5.20 and Remark 5.21 below:

H(eλf (X) ) = sup {E[g(X)eλf (X) ]} , (5.11)

g∈G

where G = {g : Ω → R : eg ≤ 1}. The proof now amounts to some computation once

the following notations are introduced.
For each j ∈ {1, . . . , n} define X̄j = (Xj , . . . , Xn ), and for any g ∈ G define the
62 C ONCENTRATION OF MEASURE - GENERAL CASE

functions g i , i = 1, . . . , n, as follows using X = (X1 , . . . , Xn ) and E[ · |X̄j ] denote

expectation with respect to X conditions on fixing X̄j :
g 1 (X1 , . . . , Xn ) := g(X) − log E[eg(X) |X̄2 ] ,
E[eg(X) |X̄k ]
g k (Xk , . . . , Xn ) := log , for k = 2, . . . , n .
E[eg(X) |X̄k+1 ]
It is easy to see that by construction we have,
n
X
g k (Xk , . . . , Xn ) = g(X) − log E[eg(X) ] ≥ g(X) , (5.12)
k=1

and
E[ exp (g k (Xk , . . . , Xn )|Xk+1 ] = 1 .
We use this decomposition within the variational representation (5.11) leading to the
following chain of upper bounds:
n
X
λf (X)
E[g(X)e ] ≤ E[g k (Xk , . . . , Xn )eλf (X) ]
(5.12)
k=1
n
X
= EX̄ k [EXk [g k (Xk , . . . , Xn )eλf (X) |X̄ k ]]
k=1
n
X
≤ EX̄k [H(eλfk (Xk ) |X̄k )]] .
(5.11)
k=1

We conclude with the statement by optimising over the function g ∈ G. 2

Proposition 5.20 (Duality formula of the Entropy) Let Y be a non-negative R-valued ran-
dom variable defined on a probability space (Ω, F, P ) such that E[ϕ(Y )] < ∞, where
ϕ(u) = u log u for u ≥ 0. Then
H(Y ) = sup {E[gY ]} , (5.13)
g∈U

where U = {g : Ω → R measurable with E[eg ] = 1}.

Proof. Denote Q = eg P the probability measure with Radon-Nikodym denisty dQ

dP
=
g
e with respect to P for some g ∈ U. Denote E the expectation with respect to P
and EQ the expectation with respect to Q, and we write HQ when we compute the
entropy with respect to the probability measure Q. Then
HQ (Y e−g ) = EQ [Y e−g ( log Y − g)] − EQ [Y e−g ] log EQ [Y e−g ]
= E[Y log Y ] − E[Y g] − E[Y ] log E[Y ]
= H(Y ) − E[Y g] ,
and as HQ (Y e−g ) ≥ 0 (entropy is positive due to Corollary 5.2) we get that
H(Y ) ≥ E[Y g] .
The equality in (5.13) follows by setting eg = Y /E[Y ], i.e., g = log Y − log E[Y ]. 2
C ONCENTRATION OF MEASURE - GENERAL CASE 63

Remark 5.21 (Variational representation of the entropy) One can easily extend the vari-
ational formula (5.13) in Proposition to the set G = {g : Ω → R : eg ≤ 1}, namely

H(Y ) = sup {E[gY ]} . (5.14)

g∈G

5.2 Concentration via Isoperimetric Inequalities

We will see that Lipschitz functions
√ concentrate well on S(n−1) . In the following, when
we consider the sphere S(n−1) or nS(n−1) we use the Euclidean metric in Rn instead
of the geodesic metric of the spheres.
√ (n−1)
Theorem 5.22 (Concentration of Lipschitz functions on the sphere) Let f : nS →
R be a Lipschitz function and
√
X ∼ Unif( nS(n−1) ) .
Then
kf (X) − E[f (X)]kψ2 ≤ Ckf kLip ,
for some absolute constant C > 0.

The statement of Theorem 5.22 amounts to the following concentration result, for
every t ≥ 0,
Ct2
P |f (X) − E[f (X)]| ≥ t ≤ 2 exp − , (5.15)
kf k2Lip

for some absolute constant C > 0. We already know √ this statement for linear func-
tions, see Theorem 3.21 saying that when X ∼ n(S (n−1)
) we have that X (or any
linear map) is sub-Gaussian. To prove the extension to any nonlinear Lipschitz func-
tion we need two fundamental results, the so-called isoperimetric inequalities, which
we can only state in order not to overload the lecture.

Definition 5.23 Suppose f : Rn → R is some function. It’s level-sets (or sub-level sets)
are
Lf (c) := {x ∈ Rn : f (x) ≤ c} , c ∈ R .

Theorem 5.24 (Isoperimetric inequality on Rn ) Among all subsets A ⊂ Rn with given

volume, Euclidean balls have minimal surface area. Moreover, for any ε > 0, Euclidean
balls minimise the volume of the ε-neighbourhood of A,

Aε := {x ∈ Rn : ∃ y ∈ A : kx − yk2 ≤ ε} = A + εB (n) ,

where B (n) is the unit ball in Rn .

64 C ONCENTRATION OF MEASURE - GENERAL CASE

Theorem 5.25 (Isoperimetric inequality on the sphere) Let ε > 0. Then, among all A ⊂
S(n−1) with given area σn−1 (A), the spherical caps minimise the area of the ε-neighbourhood
σn−1 (Aε ),
Aε := {x ∈ S(n−1) : ∃ y ∈ A : kx − yk2 ≤ ε} .
A spherical cap C(a, ε) centred at a point a ∈ S(n−1) is the set
C(a, ε) := {x ∈ S(n−1) : kx − ak2 ≤ ε} .
The following lemma is a crucial step in the proof of Theorem 5.22.
√
Lemma 5.26 (Blow-up) For any A ⊂ nS(n−1) , denote σ the normalised area on the sphere.
if σ(A) ≥ 12 , then, for every t ≥ 0,
σ(At ) ≥ 1 − 2 exp(−ct2 )
for some absolute constant c > 0.
√
Proof. For the hemisphere H = {x ∈ nS(n−1) : x1 ≤ 0}, we have σ(A) ≥ 12 =
σ(H). The t-neighbourhood Ht of the hemisphere H is a spherical cap, and the
isoperimetric inequality in Theorem 5.25 gives
σ(At ) ≥ σ(Ht ) .
We continue as in our proof of Theorem 3.21 noting that the normalised measure σ
is the uniform probability measure on the sphere such that
σ(Ht ) = P(X ∈ Ht ) .
√
Recall that in that context, X ∼ Unif( nS(n−1) ), and thus X is Sub-Gaussian accord-
ing to Theorem 3.21. Because of
n √ t o
Ht ⊃ x ∈ nS(n−1) : x1 ≤ √
2
we have
√ √
σ(Ht ) ≥ P(X1 ≤ t/ 2) = 1 − P(X1 > t/ 2) ≥ 1 − 2 exp ( − ct2 ) ,
for some absolute constant c > 0. 2

Proof of Theorem 5.22. Without loss of generality we assume that kf kLip = 1. Let
M denote the median of f (X), that is,
1 1
P(f (X) ≤ M ) ≥ and P(f (X) ≥ M ) ≥ .
2 2
√
The set A = {x ∈ nS(n−1) : f (x) ≤ M } is a level (sub-level) set of f with P(X ∈ A) ≥
1
2
. Then Lemma 5.26 implies that P(X ∈ At ) ≥ 1 − 2 exp(−Ct2 ) for some absolute
constant C > 0. We claim that, for every t ≥ 0,
P(X ∈ At ) ≤ P(f (X) ≤ M + t) . (5.16)
C ONCENTRATION OF MEASURE - GENERAL CASE 65

To see (5.16), note that X ∈ At implies kX − yk2 ≤ t for some point y ∈ A. By our
definition of the set A, f (y) ≤ M . As kf kLip = 1, we have

f (X) − f (y) ≤ |f (X) − f (y)| ≤ kX − yk2

and thus
f (X) ≤ f (y) + kX − yk2 ≤ M + t ,

which implies (5.16). Hence

P(f (X) ≤ M + t) ≥ 1 − 2 exp(−Ct2 ) .

√
We now repeat our argument for −f : We define Ã = {x ∈ nS(n−1) : − f (x) ≤ M }.
Then P(X ∈ Ã) ≥ 1/2 and thus P(X ∈ Ãt ) ≥ 1 − 2 exp(−Ct2 ). Now X ∈ Ã implies
kX − yk2 ≤ t for some y ∈ Ã, and f (y) ≥ M by definition of Ã.

−f (X) − (−f (y)) ≤ kX − yk2 ≤ t ⇒ f (X) ≥ f (y) − t ≥ M − t ,

and thus P(f (X) ≥ M − t) ≥ 1 − 2 exp(−C̃t2 ) for some absolute constant C̃ > 0.
Combining our two estimates we obtain

P(|f (X) − M | ≤ t) ≥ 1 − 2 exp ( − Ĉt2 )

for some absolute constant Ĉ > 0, and thus the immediate tail estimate shows that

kf (X) − M kψ2 ≤ C

for some absolute constant C > 0. To replace the median M by the expectation
E[f (X)] note that although the median is not unique it is a fixed real number deter-
mined by the distribution of the random variable X and the function f . We use the
centering Lemma 2.27 to get

|kf (X)kψ2 − kM kψ2 | ≤ kf (X) − M kψ2 ≤ C

kM kψ2 ≤ C̃ ⇒ −C̃ + kf (X)kψ2 ≤ C
⇒ kf (X)kψ2 ≤ C̃ + C
⇒ kf (X) − E[f (X)]kψ2 ≤ C .

2
66 C ONCENTRATION OF MEASURE - GENERAL CASE

Definition 5.27 (Isoperimetric problem) Let (E, d) be a metric space and B(E) its Borel-
σ-algebra, and P ∈ M1 (E, B(E)) some given probability measure and X an E-valued
random variable.
Isoperimteric problem: Given p ∈ (0, 1) and t > 0, find the sets A with P (X ∈ A) ≥ p
for which P (d(X, A) ≥ t) is maximal, where

d(X, A) := inf {d(X, y)} .

y∈A

Concentration function:

α(t) := sup {P (d(X, A) ≥ t)} = sup {P (Act ) ,

A⊂E : A⊂E :
P (A)≥1/2 P (A)≥1/2

where At is the t-blow-up of the set A, At = {x ∈ E : d(x, A) < t}. For a given function
f : E → R denote the median of f (X) by M f (X).

There are many isoperimetric inequalities, we only mention the Gaussian isoperi-
metric inequality as it is widely used. Recall that the Gaussian (standard normal
distribution) is given by the probability measure γn ∈ M1 (Rn , B(Rn )),
Z
2
γn (A) = (2π)−n/2 e−kxk /2
dx1 · · · dxn , A ⊂ B(Rn ) .
A

The concentration function

R ∞ for2 the one-dimensional Gaussian (n = 1) is just α(t) =
1 − Φ(t) with Φ(t) = √12π t eu /2 du. In the following statement half-spaces are sets
of the form

A = {x ∈ Rn : hx, ui < λ} , u ∈ Rn , λ ∈ R , or A = {x ∈ Rn : x1 ≤ z}, z ∈ R .

Theorem 5.28 (Gaussian isoperimetric inequality) Let ε > 0 be given. Among all A ⊂
Rn with fixed Gaussian measure γn (A), the half-spaces minimise the Gaussian measure
γn (Aε ) of the ε-neighbourhood Aε .

From this we can obtain the following concentration result. The proof is using
similar steps as done above, and we leave the details as exercise for the reader.

Theorem 5.29 (Gaussian concentration) Suppose X ∼ N(0, 1ln ), and let f : Rn → R be a

Lipschitz continuous function. Then

kf (X) − E[f (X)]kψ2 ≤ Ckf kLip

for some absolute constant C > 0.

C ONCENTRATION OF MEASURE - GENERAL CASE 67

5.3 Some matrix calculus and covariance estimation

In this section we are generalising our concentration to random matrices. The main
focus is the following Bernstein type result for random matrices

Theorem 5.30 (Matrix Bernstein inequality) Let X1 , . . . , XN be independent mean-zero

random n × n symmetric matrices such that kXi k ≤ K almost surely for all i = 1, . . . , N .
Then, for every t ≥ 0, we have
N
X t2 /2
P k Xi k ≥ t ≤ 2n exp − ,
i=1
σ 2 + Kt/3

PN
where σ 2 = i=1 E[Xi2 ] is the norm of the matrix variance of the sum.

For the proof we shall introduce a few well-known facts about matrix calculus.

Definition 5.31 (a) For any symmetric n × n matrix X with eigenvalues λi = λi (X) and
corresponding eigenvectors ui the function of a matrix for any given f : R → R is
defined as the n × n matrix
n
X
f (X) := f (λi )ui uTi .
i=1

(b) Suppose X is a n × n matrix. We write X < 0 if X is positive-semidefinite. Equiva-

lently, X < 0 if all eigenvalues of X are positive, i.e., λi ≥ 0. For Y ∈ Rn×n , we set
X < Y and Y 4 X if X − Y < 0.
We borrow the following trace inequalities from linear algebra. Recall the notion
of the trace of a matrix.

Golden-Thompson inequality: For any n×n symmetric matrices A and B we have

Trace (eA+B ) ≤ Trace (eA eB ) .

Lieb’s inequality: Suppose H is a n × n symmetric matrix and define the function

on matrices
f (X) := Trace ( exp (H + log X)) .
Then f is concave on the space of positive-definite n × n matrices.

In principle one can prove the Matrix Bernstein inequality in Theorem 5.30 with these
two results from matrix analysis. If X is a random positive-definite matrix then Lieb’s
and Jensen’s inequality imply that

E[f (X)] ≤ f (E[X]) .

We now apply this with X = eZ for some n × n symmetric matrix Z:

68 C ONCENTRATION OF MEASURE - GENERAL CASE

Lemma 5.32 (Lieb’s inequality for random matrices) Let H be a fixed n × n symmetric
matrix and Z be a random n × n symmetric matrix. Then

E[Trace ( exp (H + Z))] ≤ Trace ( exp (H + log E[eZ ])) . (5.17)

The proof of Theorem 5.30 follows below and is based on the following bound of
the moment generating function (MGF).

Lemma 5.33 (Bound on MGF) Let X be an n × n symmetric mean-zero random matrix

such that kXk ≤ K almost surely. Then

λ2 /2
E[ exp (λX)] ≤ exp (g(λ)E[X 2 ]) , where g(λ) = , for |λ| < 3/K .
1 − |λ|K/3

Proof of Lemma 5.33. For z ∈ C we can easily obtain the following estimate

1 z2
ez ≤ 1 + z + for |z| < 3 .
1 − |z|/3 2

This can can easily derived from the Taylor series of the exponential function in
conjunction with the lower bound p! ≥ 2 × 3p−2 and the geometric series. Details are
left to the reader. Now with z = λx, |x| ≤ K, |λ| < 3/K, we then obtain

eλx ≤ 1 + λx + g(λ)x2 .

Now we transfer this inequality from scalar to matrices:

eλX 4 1ln + λX + g(λ)X 2 ,

and then, after taking the expectation and using E[X] = 0, to arrive at

E[eλX ] ≤ 1 + g(λ)E[X 2 ] .

To finish, use the inequality 1 + z ≤ ez to conclude with

E[eλX ] ≤ exp (g(λ)E[X 2 ]) .

Proposition 5.34 (Expectation bound via the Bernstein Theorem) Under all the assump-
tions in Theorem 5.30 we have the tail bound
N
X t2 /2
P k Xi k ≥ t ≤ 2n exp − .
i=1
σ 2 + Kt/3

Then
h XN i √2π p 4
E Xi ≤ 2σ + log(2n) + K 1 + log(2n) . (5.18)
i=1
2 3
C ONCENTRATION OF MEASURE - GENERAL CASE 69

K
Proof. Define b := 3
. Then the right hand side in Theorem 5.30 reads as
t2
2n exp − .
2(σ 2 + bt)
We will use a crude union-type boundPon the tail probability itself by observing that
either σ 2 ≤ bt or σ 2 ≥ bt. Define Z := N
i=1 Xi . For every t ≥ 0,

t2 n t t2 o
P(kZk ≥ t) ≤ 2n exp − ≤ 2n max exp − , exp − 2
2(σ 2 + bt) 4b 4σ
t t2
≤ 2n exp − + 2n exp − 2 .
4b 4σ
We shall combine this with the trivial inequality P(kZk ≥ t) ≤ 1. Thus

t t2
P(kZk ≥ t) ≤ 1 ∧ 2n exp − + 1 ∧ 2n exp − 2 ,
4b 4σ
and Z ∞
E[kZk] = P(kZk ≥ t) dt =: I1 + I2 ,
0
R∞ R∞
with I1 = 0 1 ∧ 2n exp(−t/(4b)) dt and I2 = 0 1 ∧ 2n exp(−t2 /(4σ 2 )) dt. Solve
2n exp(−t1 /(4b)) = 1 to obtain t1 = 4b log(2n), and then
Z t1 Z ∞
I1 = 1 dt + (2n) exp(−t/(4b)) dt = t1 + 4b = 4b(1 + log(2n)) .
0 t1
√
Solve 2n exp(−t2 /(4σ 2 )) = 1 to obtain t2 = 2σ log(2n), and then
Z t2 Z ∞ Z ∞
2
I2 = 1 dt + 2 2
(2n) exp(−t /(4σ )) dt = t2 + 2σ √ (2n)e−t̃ /2 dt̃
0 t2 t̃=t/2σ log(2n)
p √
≤ 2σ( log(2n) + π) ,
√
where we used another transformation for the inequality, √ namely, x = t̃ − log(2n).
−t̃2 /2 −x2
Equivalently, we may use 2ne ≤e for all t ≥ log(2n). We arrive at
√π p
E[kZk] ≤ 2σ + log(2n) + 4b 1 + log(2n) .
2
2
Proof of Theorem 5.30.
Step 1: Define S := N
P
i=1 Xi and λm (S) := max1≤i≤n λi (S) the largest eigenvalue of
S. Then kSk = max{λm (S), λm (−S)}.

P(λm (S) ≥ t) = P(eλλm (S) ≥ eλt ) ≤ e−λt E[eλλm (S) ] .

The eigenvalues of eλS are eλλi (S) , and thus

E := E[eλλm (S) ] = E[λm (eλS )] .

70 C ONCENTRATION OF MEASURE - GENERAL CASE

All eigenvalues of eλS are positive, hence the maximal eigenvalue of eλS is bounded
by the sum of all eigenvalues, that is, by the trace of eλS . Henceforth

E ≤ E[Trace (eλS )] . (5.19)

Step 2: We now turn to bound the expectation of the trace using Lieb’s inequality, in
particular Lemma 5.32. We write the exponent of eλS separately, namely splitting off
the last term of the sum, writing
PN −1
λXi +λXN
Trace (e i=1 ). (5.20)

When we take the expectation of (5.20) we condition P on X1 , . . . , XN −1 and apply

−1
Lieb’s inequality (Lemma 5.32 for the fixed matrix H := N i=1 λXi ), and obtain
h i h i
E ≤ E Trace eH+λXN = EX1 ,...,XN1 ⊗ EXN Trace eH+λXN X1 , . . . , XN −1
h N
X −1 i
≤ EX1 ,...,XN1 Trace exp λXi + log E
b X [ exp(λXN )]
N
,
i=1

where E
b is the conditional expectation with respect to XN conditioned on X1 , . . . , XN −1 .
We repeat this application of Lemma 5.32 successively for λXN −1 , . . . , λX1 , to arrive
at
X N h i
E ≤ Trace exp log EXi eλXi .
i=1

Step 3: We use Lemma 5.33 to bound E[eλXi ] for each Xi , i = 1, . . . , N ,

λXi 2
EXi [e ] ≤ exp g(λ)E[Xi ] , i = 1, . . . , N .

Step 4: With Step 3 we get

E ≤ Trace ( exp (g(λ)Z)) ,

PN
where Z := i=1 E[Xi2 ]. The trace of exp(g(λ)Z) is a sum of n positive eigenvalues,

E ≤ nλm (exp(g(λ)Z)) = n exp(g(λ)λm (Z)) = n exp(g(λ)kZk) = n exp(g(λ)σ 2 ) .

Thus, with Step 1 and Lemma 5.33,

P(λm (S) ≥ t) ≤ n exp(−λt + g(λ)σ 2 ) for |λ| < 3/K .

The minimum in the exponent is attained for λ = t/(σ 2 + 2Kt/3). We finally conclude
with t2 /2
P(λm (S) ≥ t) ≤ n exp − 2 ,
σ + Kt/3
and repeating our steps for λm (−S) we conclude with the statement of the theorem.
2
C ONCENTRATION OF MEASURE - GENERAL CASE 71

5.4 Application - Johnson-Lindenstrauss Lemma

Before we study an application of our results in the previous sections we recall our
basic results. In Theorem 5.16 we have a concentration result for Lipschitz im-
ages of random vector with independent coordinates using entropy methods. Using
isoperimetric inequalities in
√ the(n−1)
Euclidean space we show in Theorem 5.22 a con-
centration result for X ∼ nS . Note that such a random vector does not have
independent coordinates. In the following exercise it is easy to extend this result to
the unit sphere.

Exercise 5.35 (Concentration for the unit sphere) Let f be a Lipschitz function on the
unit sphere S(n−1) . Show that for X ∼ Unif(S(n−1) ) one has
Ckf kLip
kf (X) − E[f (X)]kψ2 ≤ √ .
n
Equivalently, for every t ≥ 0, we have
cnt2
P(|f (X) − E[f (X)]| ≥ t) ≤ 2 exp − .
kf k2Lip

KK
In Theorem 5.29 we have a concentration result for Lipschitz images of normally
distributed random vectors using the Gaussian isoperimetric inequality. One can
find similar concentration results for other metric spaces. recall the Hamming cube
and its metric dH in Definition 4.8. We can define the uniform distribution on the
Hamming cube H as the probability measure P (A) := |A|/2 for any subset A ⊂ H.
If X ∼ Unif(H), then the coordinates Xi of X are Bernoulli distributed with parameter
1/2. Then one can obtain the following concentration result.

Theorem 5.36 (Concentration for the Hamming cube) Suppose X ∼ Unif(H) and f : H →
R. Then
Ckf kLip
kf (X) − E[f (X)]kψ2 ≤ √ .
n

We now want to discuss an important application of our concentration result

for the Euclidean sphere. Suppose we have N data points in Rn , i.e., a sample
Xi ∈ Rn , i = 1, . . . , N . We would like to reduce the dimension of the data without
sacrificing too much of its geometry. Consider a subspace E ⊂ Rn with dimension
dim(E) = m n. Denote Gn,m the set (manifold) consisting of all m-dimensional
subspaces of Rn . If we choose m = 1 we can identify Gn,1 with the unit sphere S(n−1) .
To see this recall that any 1-dimensional subspace of Rn can be generated by a
direction vector u ∈ S(n−1) , i.e.,

E = {αu : α ∈ R} .
72 C ONCENTRATION OF MEASURE - GENERAL CASE

Then the set on the right hand side generate the subspace E with dimension dim(E) =
1. So any concentration result for Gn,m includes the concentration for the sphere as
a special case. We need a metric and a probability measure for Gn,m . The distance
(metric) between subspaces E and F can be defined as the operator norm

d(E, F ) := kPE − PF k ,

where PE (PF ) is the orthogonal projection onto E, i.e., PE : Rn → E, PE (Rn ) = E. If

we define P to be the uniform (Haar) measure then we expect concentration results
for X ∼ Unif(Gn,m ) in the metric space (Gn,m , d, P ).

Theorem 5.37 Suppose that X ∼ Unif(Gn,m ) and let f : Gn,m → R some function. Then

Ckf kLip
kf (X) − E[f (X)]kψ2 ≤ √ .
n

The proof of that theorem goes beyond what we can do in this lecture. It is based
on concentration results for the special orthogonal group and the fact that Gn,m can
be written as quotient of orthogonal groups.

We now present the Johnson-Lindenstrauss Lemma for the N data points and prove
the statement that the geometry of the given data is well preserved if we choose E
to be a random subspace of dimension dim(E) = m ∼ log N , where ∼ means that in
the limit N → ∞ the quotient of both sides converges to 1. Thus we consider random
subspaces E ∼ Unif(Gn,m ) and the space (Gn,m , d, P ) with P being the uniform (Haar)
measure. Note that we have the following invariance.

P (E ∈ E) = P (U (E) ∈ E) , for any subset E ⊂ Gn.m and any orthogonal matrix U .

Theorem 5.38 (Johnson-Lindenstrauss-Lemma) Let H = {X1 , . . . , XN }, Xi ∈ Rn , i =

1, . . . , N , be a set of N points in Rn and pick ε > 0 and assume that m ≥ C/ε2 log N
for some absolute constant C > 0. Suppose E ∼ Unif(Gn,m ) and denote P the orthogonal
2
p n onto E. Then, with probability at least 1 − 2 exp(−cε m) the scaled projection
projection
Q := m P is an approximate isometry on the set H, that is,

(1 − ε)kx − yk2 ≤ kQx − Qyk2 ≤ (1 + ε)kx − yk2 , for all x, y ∈ H .

We base the proof of the theorem on the concentration of Lipschitz functions

for the sphere in Theorem 5.22 respectively its version on the unit sphere in Exer-
cise 5.35. We consider first random projections P acting on a fixed vector x − y, and
then we take the union bound over all N 2 differences x − y, x, y ∈ H. For any fixed
vector the next lemma gives the desired properties.
C ONCENTRATION OF MEASURE - GENERAL CASE 73

Lemma 5.39 Let P be an orthogonal projection from Rn onto a random m-dimensional

subspace uniformly distributed in Gn,m . Let z ∈ Rn be a fixed point and choose ε > 0. Then
the following holds.

(a) r
1/2 m
(E[kP zk22 ) = kzk2 .
n

(b) With probability at least 1 − 2 exp(−cε2 m), we have

r r
m m
(1 − ε) kzk2 ≤ kP zk2 ≤ (1 + ε) kzk2 .
n n

Proof. (a) Without loss of generality we can assume that kzk2 = 1.This is possible
as for z = 0 the statement holds trivially and for z 6= 0 we can simply multiply by
1/kzk2 . Instead of a random projection P we can fix a projection and consider z ∼
Unif(S(n−1) ). According to the above mentioned invariance we can assume without
loss of generality that P is the coordinate projection onto the first m coordinates in
Rn . Thus m m
hX i X
2 2
E[kP zk2 ] = E zi = E[zi2 ] = mE[z12 ] ,
i=1 i=1

where the last equality follows from the fact that the coordinates zi are identically
distributed. As z ∼ Unif(S(n−1) ) we have
n
X
1= E[zi2 ] = nE[z12 ] ,
i=1

and thus E[z12 ] = 1/n and therefore we get the statement that
m
E[kP zk22 ] = ,
n
and thus (a).
(b) Define the function f : S(n−1) → R by f (x) := kP xk2 . Then f is a Lipschitz function
with kf kLip = 1:

|f (x) − f (y)| = |kP xk2 − kP yk2 | ≤ kP x − P yk2 ≤ kP kkx − yk2 = kx − yk2

as kP k = 1. Thus Exercise 5.35 (Theorem 5.22) give the concentration result

P(|F (X) − E[F (X)]| ≥ t) ≤ 2 exp ( − C̃nt2 ) . (5.21)

Statement
p (5.21) is not quite (b) but we can replace the expectation E[f (X)] by
m/n as follows. First note that due to Jensen
r
m 1/2
E[F (X)] ≤ = E[f (X)2 ] .
n
74 C ONCENTRATION OF MEASURE - GENERAL CASE

Then −t ≤ f (X) − E[f (X)] ≤ t is equivalent to

r r r
m m m
−t + E[f (X)] − ≤ f (X) − ≤ t + E[f (X)] − .
n n n

With a suitable constant c p

> 0 to accommodate the change in t we arrive at state-
ment (b) by choosing t = ε m/n.. 2

Proof of Theorem 5.38. We consider the difference set

H − H = {x − y : x, y ∈ H} .

We shall show, with the required minimal probability, that

(1 − ε)kzk2 ≤ kQk2 ≤ (1 + ε)kzk2

holds for all differences z ∈ H − H. Setting Q := m

pn
P , this is equivalent to showing
that r r
m m
(1 − ε) kzk2 ≤ kP zk2 ≤ (1 + ε) kzk2 . (5.22)
n n
For any fixed z, Lemma 5.39 states that (5.22) holds with probability at least 1 −
2 exp(−cε2 m). All what remains is to take a union bound. First note that (5.22) holds
simultaneously for all z ∈ H − H. with probability at least

1 − |H − H|2 exp(−cε2 m) ≥ 1 − N 2 2 exp(−cε2 m) .

If we choose m ≥ (2/cε2 ) log N + 1, then we conclude with the statement of the

theorem. 2
BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY 75

6 Basic tools in high-dimensional probability

6.1 Decoupling

Definition 6.1 Let X1 , . . . , Xn be independent real-valued random variables and aij ∈

R, i, j = 1, . . . , n. The random quadratic form
n
X
aij Xi Xj = X T AX = hX, AXi , X = (X1 , . . . , Xn ) ∈ Rn , A = (aij ) ,
i,j=1

is called chaos in probability theory.

For simplicity, we assume in the following that the random variables Xi have
mean-zero and unit variances,

E[Xi ] = 0 , Var(Xi ) = 1 , i = 1, . . . , n .

Then
n
X m
X
E[hX, AXi] = aij E[Xi Xj ] = aii = Trace (A) .
i,j=1 i=1

We shall study concentration properties for chaos. This time we need to develop
tools to overcome the fact that we have sums of not necessarily independent random
variables. The idea is to use the decoupling technique . The idea is to study the
following random quadratic form,
n
X
aij Xi Xj0 = X T AX 0 = hX, AX 0 i , (6.1)
i,j=1

where X 0 = (X10 , . . . , Xn0 ) is a random vector independent of X but with the same
distribution as X. Obviously, the bilinear form in (6.1) is easier to study, e.g., when
we condition on X 0 we simply obtain a linear form for the random vector X. The
vector X 0 is called an independent copy of X, and conditioning on X 0 ,
n
X n
X
hX, AX 0 i = ci X i with ci = aij Xj0 ,
i=1 j=1

is a random linear form for X depending on the condition of the independent copy
X 0.

Lemma 6.2 Let Y and Z be independent random vectors in Rn such that EY [Y ] = EZ [Z] =
0 (we write indices when we want to stress the expectation for a certain random variable).
Then, for every convex function F : Rn → R, one has

E[F (Y )] ≤ E[F (Y + Z)] .

76 BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY

Proof. Fix y ∈ Rn and use EZ [Z] = 0 and the convexity to get

F (y) = F (y + EZ [Z]) ≤ EZ [F (y + Z)] .

We choose y = Y and take expectation with respect to Y ,

EY [F (Y )] = EY [F (Y + EZ [Z])] = EY [F (EZ [Y + Z])] ≤ EY ⊗ EZ [F (Y + Z)] .

Theorem 6.3 (Decoupling) Let A be an n × n diagonal-free (i.e., diagonal entries vanish)

and X = (X1 , . . . , Xn ) ∈ Rn be a random vector with independent mean-zero coordinates
Xi , and X 0 an independent copy of X. Then, for every convex function F : R → R, one has

E[F (hX, AXi)] ≤ E[F (4hX, AX 0 i)] . (6.2)

Proof. The idea is to study partial chaos

X
aij Xi Xj
(i,j)∈I×I c

with a randomly chosen subset I ⊂ {1, . . . , n}. Let δ1 , . . . , δn be independent Bernoulli

random variables with P(δi = 0) = P(δi = 1) = 12 . Then define the random set
I = {i : δi = 1}. We condition on the random variable X and obtain for i 6= j using
aii = 0 and E[δi (1 − δj )] = 1/4,
X hX i h X i
hX, AXi = aij Xi Xj = 4Eδ δi (1 − δj )aij Xi Xj = 4EI aij Xi Xj ,
i6=j i6=j (i,j)∈I×I c

where Eδ is expectation with respect to the Bernoulli random variables and EI is

expectation with respect to the Bernoulli random variables for I = {i : δi = 1}. Apply
F on both sides and use Jensen (w.r.t. to EI ) and and take the expectation w.r.t. to
EX in conjunction with Fubini to get
h X i
EX [F (hX, AXi)] ≤ EI EX F 4 aij Xi Xj .
(i,j)∈I×I c

There is a realisation of a random set I such that

h X i
EX [F (hX, AXi)] ≤ EX F 4 aij Xi Xj . (6.3)
(i,j)∈I×I c

We fix such an realisation of I until the end. The random variables (Xi )i∈I and
(Xj )j∈I c are independent and thus the distribution of the sum on the right hand side
will not change if we replace Xj by Xj0 . Hence we replace Xj , j ∈ I c , by Xj0 on the
right hand side of (6.3) to get
h X i
EX [F (hX, AXi)] ≤ E F 4 aij Xi Xj0 . (6.4)
(i,j)∈I×I c
BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY 77

It the remains to show that

h X i
R.H.S. of (6.4) ≤ E F 4 aij Xi Xj0 , (6.5)
(i,j)∈[n]×[n]

where [n] = {1, . . . , n}. We now split the argument of the function F on the right
hand side of (6.5) into three terms,
X
aij Xi Xj0 =: Y + Z1 + Z2 ,
(i,j)∈[n]×[n]

with
X X X
Y := aij Xi Xj0 , Z1 := aij Xi Xj0 , Z2 := aij Xi Xj0 .
(i,j)∈I×I c (i,j)∈I×I (i,j)∈I c ×[n]

We now condition on all random variables except (Xj0 )j∈I and (Xi )i∈I c . We denote
the corresponding conditional expectation by E.
e The conditioning already fixes Y .
Furthermore, Z1 and Z2 have zero expectation. Applications of Lemma 6.2 leads to

F (4Y ) ≤ E[F
e (4Y + 4Z1 + 4Z2 )] ,

and thus
E[F (4Y )] ≤ E[F (4Y + 4Z1 + 4Z2 )] .

This proves (6.5) and finishes the argument by taking the final expectation with re-
spect to I. 2

Theorem 6.4 (Hanson-Wright inequality) Let X = (X1 , . . . , Xn ) ∈ Rn be a random vec-

tor with independent mean-zero sub-Gaussian coordinates and A be an n × n matrix. Then,
for every t ≥ 0,
n t2 t o
P |hX, AXi − E[hX, AXi]| ≥ t ≤ 2 exp − c min , ,
K 4 kAk2F K 2 kAk

where K := max1≤i≤n {kXi kψ2 }.

We prepare the proof with the following two lemmas.

Lemma 6.5 (MGF of Gaussian chaos) Let X, X 0 ∼ N(0, 1ln ), X and X 0 be independent,
and let A be an n × n matrix. Then

E[ exp (λhX, AX 0 i)] ≤ exp (Cλ2 kAkF ) , for all |λ| ≤ C/kAk ,

for some absolute constant C > 0.

78 BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY

Proof. We use the singular value decomposition of the matrix A, see Definition 4.10:
n
X
A= si ui viT
i=1
Xn
hX, AX 0 i = si hui , Xihvi , X 0 i
i=1
n
X
hX, AX 0 i = si Yi Yi0 ,
i=1

where Y = (hu1 , Xi, . . . , hun , Xi) ∼ N(0, 1ln ) and Y 0 = (hv1 , X 0 i, . . . , hvn , X 0 i) ∼
N(0, 1ln ). Independence yields
n
Y
0
E[ exp (λhX, AX i)] = E[ exp (λsi Yi Yi0 )] .
i=1

For each i ∈ {1, . . . , n} we compute taking the expectation with respect to Y 0 , i.e.,
the conditional expectation holding the Yi ’s fixed,

E[ exp (λsi Yi Yi0 )|Y ] = E[ exp (λ2 s2i Yi2 /2)] ≤ exp (Cλ2 s2i ) , provided λ2 s2i ≤ C ,

where the first equality follows from the fact the MGF of X ∼ N(0, 1) is E[exp(λX)] =
exp(λ2 /2), and the inequality follows from the fact that Yi are Gaussian and thus sub-
Gaussian and therefore the square Yi2 is sub-exponential and thus property (v) in
Proposition 2.28 gives the bound.
n
0

2
X C
E[ exp (λhX, AX i)] ≤ exp Cλ s2i , provided λ ≤ .
i=1
max1≤i≤n {s2i }

2
Pn
We conclude with i=1 s2i = kAk2F and max1≤i≤n {si } = kAk.

Lemma 6.6 (Comparison) Let X and X 0 be independent mean-zero sub-Gaussian random

vectors in Rn with kXkψ2 ≤ K and kX 0 kψ2 ≤ K. Furthermore, let Y, Y 0 be independent
normally distributed random vectors Y, Y 0 ∼ N(0, 1ln ), and A be an n × n matrix. Then

E[ exp (λhX, AX 0 i)] ≤ E[ exp (CK 2 λhY, AY 0 i)] .

Proof. We condition on X 0 and take the expectation with respect to X (denoted by

EX ). Then hX, AX 0 i is conditionally sub-Gaussian and

EX [ exp (λhX, AX 0 i)] ≤ exp (Cλ2 K 2 kAX 0 k22 )] , λ ∈ R.

We now replace X by Y but still conditioning on X 0 (we replace one at a time),

EY [ exp (µhY, AX 0 i)] = exp (µ2 kAX 0 k22 /2) , µ ∈ R.

BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY 79

√
Choosing µ = 2CλK, we can match our estimates to get
EX [ exp (λhX, AX 0 i)] ≤ EY [ exp (µhY, AX 0 i)] = exp (Cλ2 K 2 λkAX 0 k22 )] .
We can now take the expectation with respect to X 0 on both sides and see that we
have successfully replaced X by Y . We can now repeat the same procedure for X 0
and Y 0 to obtain our statement. 2
Proof of Theorem 6.4. Without loss of generality, K = 1. It suffices to show the
one-sided tail estimate. Denote
p = P(hX, AXi − E[hX, AXi] ≥ t) .
Write n
X X
hX, AXi − E[hX, AXi] = aii (Xi2 − E[Xi2 ]) + aij Xi Xj ,
i=1 i,j : i6=j

and thus the problem reduces to estimating diagonal and off-diagonal sums:
n
X X
p≤P aii (Xi2 − E[Xi2 ]) ≥ t/2 + P aij Xi Xj ≥ t/2 =: p1 + p2 .
i=1 i,j : i6=j

Diagonal sum: Xi2 −E[Xi2 ] are independent mean-zero sub-exponential random

variables. Thus, using centering (see Exercise 3.2) and Lemma 2.31 we have
kXi2 − E[Xi2 ]kψ1 ≤ CkXi2 kψ1 ≤ CkXi k2ψ2 ≤ C .
Then Bernstein’s inequality (see Exercise 3.3 using it with 1/N replaced by Aii ))
implies that
n t2 t o
p1 ≤ exp − C min Pn 2 , .
i=1 aii max1≤i≤n {|aii |}
P
Off-diagonal sum: S := i6=j aij Xi Xj and λ > 0. Then

E[exp(λS)] ≤ E[exp(4λhX, AX 0 i)] ; Decoupling - Theorem 6.3

≤ E[exp(C1 λhY, AY 0 i)] ; Comparison - Lemma 6.6
≤ exp(Cλ2 kAk2F ) provided |λ| ≤ C̄/kAk ; Gaussian chaos - Lemma 6.5 .
Thus
p2 ≤ exp(−λt/2)E[exp(λS)] ≤ exp(−λt/2 + Cλt2 kAk2F ) .
t C̄4CkAk2F
Optimising over 0 ≤ λ ≤ C̄/kAk, we get a solution λ = 4CkAkF 2 as long as t ≤ kAk
.
2
C̄4CkAk
Inserting this solution gives the exponent −t2 /(16CkAk2F ). For t > kAk F we have
C̄
λ = kAk and obtain the other exponent which is linear in t. Thus we have some
absolute constant C > 0 such that
n t2 t
p2 ≤ exp − C min , } .
kAk2F kAk
2
80 BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY

6.2 Concentration for Anisotropic random vectors

We now study anisotropic random vectors, which have the form BX where B is a
fixed matrix and where X is an isotropic random vector.

Exercise 6.7 Let B be an m × n matrix and X be an isotropic random vector in Rb . Show

that
E[kBXk22 ] = kBk2F .
K

Theorem 6.8 (Concentration for random vectors) Let B be an m × n matrix and X =

(X1 , . . . , Xn ) ∈ Rn a random vector with independent mean-zero unit variance sub-Gaussian
coordinates. Then
kkBXk2 − kBkF kψ2 ≤ CK 2 kBk , K := max {kXi kψ2 } ,
1≤i≤n

for some absolute constant C > 0.

Remark 6.9 (a) We have, according to Exercise 6.7,

E[kBXk22 ] = kBk2F .

(b) Recall that we have already proved a version with B = 1ln in Theorem 3.1, namely,
√
kkXk2 − nkψ2 ≤ CK 2 .

Proof of Theorem 6.8. Without loss of generality we assume that K ≥ 1 and we

write A = B T B and apply Hanson-Wright inequality (Theorem 6.4) with the following.
hX, AXi = kBXk22 ,
E[hX, AXi] = kBk2F ; kAk = kBk2 ,
kB T BkF ≤ kB T kkBkF = kBkkBkF .
Thus, for every u ≥ 0,
C n u2 u o
P |kBXk22 − kBk2F | ≥ u ≤ 2 exp − 4 min , .
K kBk2 kBk2F kBk2

Setting u = εkBk2F with ε ≥ 0, we obtain

kBk2F
P |kBXk22 − kBk2F | ≥ εkBk2F 2
≤ 2 exp − C min{ε , ε} .
K 4 kBk2
We now set δ 2 = min{ε2 , ε}, or, equivalently ε = max{δ, δ 2 }, and we recall our rea-
soning in the proof of Theorem 3.1, namely, that for z ≥ 0, the bound |z − 1| ≥ δ
implies |z 2 − 1| ≥ max{δ, δ 2 }.
|kBXk2 − kBkF | ≥ δkBkF ⇒ |kBXk22 − kBk2F | ≥ εkBk2F .
BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY 81

Thus
kBk2F
2
P |kBXk2 − kBkF | ≥ δkBkF ≤ 2 exp − Cδ ,
K 4 kBk2
and setting t = δkBkF , we obtain the tail-estimate
Ct2
P |kBXk2 − kBkF | ≥ t ≤ 2 exp − .
K 4 kBk2
2

6.3 Symmetrisation

Definition 6.10 (symmetric random variables) A real-valued random variable X is

called symmetric if X and −X have the same distribution.

Exercise 6.11 Let X be a real-valued random variable independent of some symmetric

Bernoulli random variable ε, i.e., P(ε = −1) = P(ε = 1) = 12 . Show the following state-
ments.

(a) εX and ε|X| are symmetric random variables and εX and ε|X| have the same distribu-
tion.

(b) X symmetric ⇒ distribution of εX and ε|X| equal the distribution of X.

(c) Suppose X 0 is an independent copy of X, then X − X 0 is a symmetric random variable.

Lemma 6.12 (Symmetrisation) Let X1 , . . . , XN be independent mean-zero random vec-

tors in a normed space (E, k·k) and let ε1 , . . . , εN be independent symmetric Bernoulli
random variables (that is, they are not only independent of each other but also of any
Xi , i = 1, . . . , N ). Show that
N N N
1 h X i h X i h X i
E k εi X i k ≤ E k Xi k ≤ 2E k εi X i k .
2 i=1 i=1 i=1

Proof.
Upper
PN bound: Let (Xi0 ) be an independent copy of the random vector (Xi ). Since
0
i=1 Xi has zero mean,

h XN i h XN N
X i h XN i
0
p := E k Xi k ≤ E k Xi − Xi k = E k (Xi − Xi0 )k ,
i=1 i=1 i=1 i=1

where the inequality stems from an application of Lemma 6.2), namely, if E[Z] = 0
then E[kY k] ≤ E[kY + Zk].
82 BASIC TOOLS IN HIGH - DIMENSIONAL PROBABILITY

Since all (Xi −Xi0 ) are symmetric random vectors, they have the same distribution
as εi (Xi − Xi0 ), see Exercise 6.11. Application of the triangle inequality and our
assumptions conclude the upper bound
h XN i h XN i h XN i h XN i
0 0
p≤E k εi (Xi − Xi )k ≤ E k εi X i k + E k εi Xi k = 2E k εi X i k
i=1 i=1 i=1 i=1

Lower bound: By conditioning on εi and the triangle inequality,

h XN i h XN i h XN i
0 0
E k εi X i k ≤ E k εi (Xi − Xi )k = E k (Xi − Xi )k
i=1 i=1 i=1
h XN i h XN i h XN i
0
≤E k Xi k + E k Xi k = 2E k Xi k .
i=1 i=1 i=1

2
R ANDOM P ROCESSES 83

7 Random Processes
7.1 Basic concepts and examples

Definition 7.1 A random process is a collection of random variables X := (Xt )t∈T of ran-
dom variables Xt defined on the same probability space, which are indexed by the elements
t of some set T .

Example 7.2 (a) T = {1, . . . , n}, then X = (X1 , . . . , Xn ) ∈ Rn random vector.

(b) T = N, then X = (Xn )n∈N with

n
X
Xn = Zi , with Zi are R − valued and independent identically distributed ,
i=1

is the discrete time random walk.

(c) When the dimension of the index set is greater than one, e.g., T ⊂ Rn or T ⊂ Zn , we
often call the process a random field (Xt )t∈T .

(d) The most well-known continuous-time process is the standard Brownian motion X =
(Xt )t≥0 , also called the Wiener process. We can characterise it as follows:

(i) The process has continuous paths, i.e., X : [0, ∞) → R, t 7→ Xt is almost surely
continuous.
(ii) The increments are independent and satisfy Xt − Xs ∼ N(0, t − s) for al t ≥ s.
♣

Definition 7.3 Let (Xt )t∈T be a random process with E[Xt ] = 0 for all t ∈ T . The covari-
ance function of the process is defined as

Σ(t, s) := cov(Xt , Xs ) , t, s ∈ T .

The increments of the process are defined as

2 1/2
d(t, s) := kXt − Xs kL2 = (E[(Xt − Xs ) ]) , t, s ∈ T .

Note that Σ(t, s) = E[Xt Xs ], t, s ∈ T , when the process X = (Xt )t∈T has zero mean,
i.e., E[Xt ] = 0 for all t ∈ T .

Example 7.4 (Increments of random walks) The increments √ of the discrete time random
2
in Example [(b)]7.2 with E[Zi ] = 1, i ∈ N, are d(n, m) = n − m for all n, m ∈ N0 , n ≥
84 R ANDOM P ROCESSES

m. To see that, we compute

n
X n
X
2 2
2
d(n, m) = kXn − Xm k2L2 = E[(Xn − Xm ) ] = E[( Zk ) ] = E[Zk Zj ]
k=m+1 k,j=m+1
n
X
= E[Zk2 ] = n − m ,
k=m+1

where we used the fact that the Zi ’s are independent with zero mean, i.e., E[Zk Zj ] = δk,j ,
and our assumption E[Zk2 ] = 1 for all k ∈ N. ♣

Definition 7.5 (Gaussian process) (a) A random process (Xt )t∈T is called a Gaussian
process if, for any finite subset T0 ⊂ T , the random vector (Xt )t∈T0 has normal dis-
tribution. Equivalently, (Xt )t∈T is called a Gaussian process if every finite linear com-
bination X
at Xt , at ∈ R ,
t∈T0

is a normally distributed random variable.

(b) Suppose T ⊂ Rn and let Y ∼ N(0, 1ln ), and define

Xt := hY, ti , t ∈ T ⊂ Rn .

We call the random process (Xt )t∈T the canonical Gaussian process in Rn .

To compute the increments of a canonical Gaussian process, recall that hY, ti ∼

N(0, ktk22 ) for any t ∈ Rn . Then one show that (easy exercise) the increments are
d(t, s) = kXt − Xs kL2 = kt − sk2 t, s ∈ Rn ,
where k·k2 denotes the Euclidean norm in Rn .
Lemma 7.6 Let X ∈ Rn be a mean-zero Gaussian random vector. Then there exist points
t1 , . . . , tn ∈ Rn such that
X := (hY, ti i)i=1,...,n ∈ Rn , with Y ∼ N(0, 1ln ) .
Proof. Let Σ be the covariance matrix of X. Then
X = Σ1/2 Y where Y ∼ N(0, 1ln ) , (7.1)
which follows from from X ∼ N(0, Σ) if and only if Σ−1/2 X ∼ N(0, 1ln ) . This in turn
can be seen by direct computation (change of variables) by recalling the probability
density function of X, i.e.,
1 1
fX (x) = n/2 1/2
exp ( − hx, Σ−1 xi) , x ∈ Rn .
(2π) det(Σ) 2
The coordinates of Σ1/2 Y are hsi , Y i, where the si denote the rows of the matrix
Σ1/2 .
2
R ANDOM P ROCESSES 85

7.2 Slepian’s inequality and Gaussian interpolation

In many applications a uniform control of a stochastic process X = (Xt )t∈T is useful,
i.e., to have a bound on
E[ sup Xt ] .
t∈T

For general processes, even if they are Gaussian, obtaining such bounds is highly
nontrivial. In this section we learn first how the expectation of the supremum of two
processes compare to each other. In Section 7.3 below we obtain a lower and upper
bound for expected supremum of a process.

Theorem 7.7 (Slepian’s inequality) Let (Xt )t∈T and (Yt )t∈T be two mean-zero Gaussian
processes. Assume that, for all t, s ∈ T , we have
2 2
(i) E[Xt2 ] = E[Yt2 ] (ii) E[(Xt − Xs ) ] ≤ E[(Yt − Ys ) ] . (7.2)

Then, for every τ ∈ R, we have

P sup{Xt } ≥ τ ≤ P sup{Yt } ≥ τ , (7.3)
t∈T t∈T

and, consequently,
E[ sup{Xt }] ≤ E[ sup{Yt }] .
t∈T t∈T

Whenever (7.3) holds, we say that the process (Xt )t∈T is stochastically domi-
nated by the process (Yt )t∈T . The proof of Theorem 7.7 follows by combining the
two versions of Slepian’s inequality which we will discuss below. To prepare these
statements we introduce the method of Gaussian interpolation first.
Suppose T is finite, e.g., |T | = n. Let X = (Xt )t∈T and Y = (Yt )t∈T be two
Gaussian random vectors (without loss of generality we may assume that they X
and Y are independent). We define the Gaussian interpolation as the following
random vector in Rn ,
√ √
Z(u) := uX + 1 − uY , u ∈ [0, 1] . (7.4)

It is easy to see (following exercise) that the covariance matrix of the interpolation
interpolates linearly between the covariance matrices of X and Y .

Exercise 7.8 Show that

Σ(Z(u)) = uΣ(X) + (1 − u)Σ(Y ) t ∈ [0, 1] .

K
For a function f : Rn → R, we shall study how the expectation E[f (Z(u))] varies
with u ∈ [0, 1]. Suppose, for example, that f (x) := 1l{max1≤i≤n {xi } ≤ u}, x ∈ Rn .
Then one can easily show that E[f (Z(u))] increases with u leading to

E[f (Z(1))] ≥ E[f (Z(0))] ,

86 R ANDOM P ROCESSES

and henceforth
P( max {Xi } < τ ) ≥ P( max {Yi } < τ ) ,
1≤i≤n 1≤i≤n

leading to
P sup{Xt } ≥ τ ≤ P sup{Yt } ≥ τ . (7.5)
t∈T t∈T

The following three lemmas collect basic facts about Gaussian random variables
and their proofs will be left as an exercise (homework).

Lemma 7.9 (Gaussian integration by parts) Suppose X ∼ N(0, 1). Then, for any differ-
entiable function f : R → R,

E[f 0 (X)] = E[Xf (X)] .

Proof. Exercise for the reader. Solution see Lemma 7.2.3 in [Ver18]. 2

Similarly, X ∼ N(0, σ 2 ), σ > 0 implies E[Xf (X)] = σ 2 E[f 0 (X)].

Lemma 7.10 (Gaussian integration by parts) Suppose X ∼ N(0, Σ), X ∈ Rn , Σ an n×n

symmetric positive semi-definite matrix. Then, for any differentiable function f : Rn → R,

E[Xf (X)] = ΣE[∇f (X)] .

Proof. Exercise for the reader (homework - Example Sheet 4). 2

Lemma 7.11 (Gaussian interpolation) Consider two independent Gaussian random vec-
tors X ∼ N(0, ΣX ) and Y ∼ N(0, ΣY ) in Rn . Define the interpolation Gaussian random
vector as
√ √
Z(u) := u X + 1 − u Y , u ∈ [0, 1] . (7.6)
Then for any twice-differentiable function f : Rn → R, we have
n h ∂ 2f
d 1X X i
E[f (Z(u))] = (Σij − ΣYij )E (Z(u)) . (7.7)
du 2 i,j=1 ∂xi ∂xj

Proof. Exercise for the reader (homework - Example Sheet 4). Solution see Lemma 7.2.7
in [Ver18].
2

We now prove a first version of Slepian’s inequality (Theorem 7.7).

R ANDOM P ROCESSES 87

Lemma 7.12 (Slepian’s inequality, functional form) Let X, Y ∈ Rn be two mean-zero

Gaussian random vectors. Assume that for all i, j = 1, . . . , n,
2 2
(i) E[Xi2 ] = E[Yi2 ] , (ii) E[(Xi − Xj ) ] ≤ E[(Yi − Yj ) ] , (7.8)

and let f : Rn → R be twice-differentiable such that

∂ 2f
(x) ≥ 0 for all i 6= j .
∂xi ∂xj
Then
E[f (X)] ≥ E[f (Y )] .

Proof. We have, using (7.8) and our assumptions, that ΣX Y

ii = Σii and E[Xi Xj ] ≥
X Y
E[Yi Yj ]. Thus Σij ≥ Σij , We assume without loss of generality that X and Y are
independent. Then Lemma 7.11 shows that
d
E[f (Z(u))] ≥ 0 ,
du
and hence that E[f (Z(u))] increases in u. Thus E[f (Z(1))] = E[f (X)] ≥ E[f (Y )] =
E[f (Z(0))]. 2

We can now prove Theorem 7.7 with the following results for Gaussian vectors.

Theorem 7.13 Let X and Y be two mean-zero Gaussian random vectors in Rn as in Lemma 7.12.
Then, for every τ ∈ R, we have

P max {Xi } ≥ τ ≥ P max {Yi } ≥ τ .
1≤i≤n 1≤i≤n

Consequently,
E[ max {Xi }] ≤ E[ max {Yi }] .
1≤i≤n 1≤i≤n

Proof. The key idea is to use Lemma 7.12 for some appropriate approximation of
the maximum. For that to work, let h : R → [0, 1] be a twice-differentiable approxima-
tion of the indicator function of the interval (−∞, τ ), that is, h(t) ≈ 1l(−∞,τ ) (t), t ∈ R,
and h0 (t) ≤ 0 (h is smooth non increasing function). Define f (x) := h(x1 ) · · · h(xn ) for
x = (x1 , . . . , xn ) ∈ Rn . The second partial derivatives

∂ 2f Y
(x) = h0 (xi )h0 (xj ) h(xk )
∂xi ∂xj k6=i,j

are positive. It follows that E[f (X)] = E[f (Z(1))] ≥ E[f (Y )] = E[f (Z(0))]. Thus, by
the above approximations, according to Lemma 7.12,

P max {Xi } < τ ≥ P max {Yi } < τ ,
1≤i≤n 1≤i≤n
88 R ANDOM P ROCESSES

and thus, using the integral identity 1.9,

E[ max {Xi }] ≤ E[ max {Yi }] ,

1≤i≤n 1≤i≤n

and the statement. 2

The following theorem improves Slepian’s inequality by using a different approx-

imation in the proof.

Theorem 7.14 (Sudakov-Fernique inequality) Let (Xt )t∈T and (Yt )t∈T be two mean-zero
Gaussian processes. Assume that, for all t, s ∈ T , we have
2 2
(i) E[(Xt − Xs ) ] ≤ E[(Yt − Ys ) ] .

Then
E[ sup {Xt }] ≤ E[ sup {Yt }] .
t∈T t∈T

Proof. We shall prove the statement for Gaussian random vectors X, Y ∈ Rn with
the help of Theorem 7.13. The idea this time is to use an approximation of the
maximum itself and not for the indicator function. For β > 0, define
n
1 X
f (x) := log eβxi x ∈ Rn .
β i=1

One can easily show that

f (x) −→ max {xi } .
β→∞ 1≤i≤n

Inserting the function into the Gaussian interpolation formula (7.7) we can obtain
after some tedious calculation that
d
E[f (Z(u))] ≤ 0 ,
du
and conclude with our statement. 2

Exercise 7.15 For β > 0, define

n
1 X
f (x) := log eβxi x ∈ Rn .
β i=1

Show that
f (x) −→ max {xi } .
β→∞ 1≤i≤n

K
R ANDOM P ROCESSES 89

Solution. Without loss of generality let exp(βxk ) = max1≤i≤n {exp(βxi )}, Then
n
X n
X
eβxi = eβxk 1 + exp(β(xi − xk )) ,
i=1 i6=k

and observing that

n
X
exp(β(xi − xk )) ≤ n
i6=k

we obtain that
n
1 X 1
0 ≤ lim log 1 + exp(β(xi − xk )) ≤ lim log (1 + n) = 0 ,
β→∞ β β→∞ β
i6=k

7.3 The supremum of a process

We now combine features of the index space with the random process and obtain
bounds for E[supt∈T Xt ].

Definition 7.16 (Canonical metric) Suppose X = (Xt )t∈T is a random process with index
set T . We define the canonical metric of the process by
h 2 i1/2
d(t, s) := kXt − Xs kL2 = E Xt − Xs t, s ∈ T .

We can now study the question if we can evaluate the expectation E[supt∈T {Xt }]
by using properties of the geometry, in particular using covering numbers of the
index metric space (T, d) equipped with the canonical metric of the process. Via this
approach we obtain a lower bound of the expectation of the supremum.

Theorem 7.17 (Sudakov’s minorisation inequality) Let X = (Xt )t∈T be a mean-zero Gaus-
sian process. Then, for any ε ≥ 0, we have

E[ sup {Xt }] ≥ Cε log N(T, d, ε) ,

p
t∈T

where d is the canonical metric of the process, C > 0 an absolute constant and N(T, d, ε)
the covering number for T (recall Definition 4.1).

Proof. Assume that N(T, d, ε) = N < ∞ is finite. When T is not compact, one can
show that the expectation of the supremum is infinite (we skip this step). Let M be a
maximal ε-separated subset of T . Then M is an ε-net according to Lemma 4.2, and
thus |M| ≥ N . It suffices to show
p
E[ sup {Xt }] ≥ Cε log N . (7.9)
t∈M
90 R ANDOM P ROCESSES

To show (7.9), we compare the process (Xt )t∈M with a simpler process (Yt )t∈M .
ε
Yt := √ Gt , Gt ∼ N(0, 1), Gt independent of Gs for all t 6= s .
2
For t, s ∈ M fixed we have
2
E[(Xt − Xs ) ] = d(t, s)2 ≥ ε2 ,

while
2 ε2 2
E[(Yt − Ys ) ] = E[(Gt − Gs ) ] = ε2 .
2
Thus
2 2
E[(Xt − Xs ) ] ≥ E[(Yt − Ys ) ] for all t, s ∈ M .
We now obtain with Theorem 7.14,
ε p
E[ sup {Xt }] ≥ E[ sup {Yt }] = √ E[ max{Gi }] ≥ Cε log N ,
t∈M t∈M 2 i∈M
where we used Proposition 7.18below. 2

Proposition 7.18 (Maximum of normally distributed random variables) Let Yi ∼ N(0, 1),
i = 1, . . . , N , be independent normally distributed real-valued random variables. Then the
following holds.
(a) h i p
E max {Yi } ≤ 2 log N .
1≤i≤N

(b) h i p
E max {|Yi |} ≤ 2 log(2N ) .
1≤i≤N

Proof. (a) Let β > 0. Using Jensen’s inequality, we obtain

N N
1 1 1
E[ max Yi ] = E[ log eβ max1≤i≤N {Yi } ] ≤ E[ log
X X
eβYi ] ≤ log E[eβYi ]
1≤i≤n β β i=1
β i=1
1 2
β log N
= log N eβ /2 = + .
β 2 β
√
The claim is proved by taking β = 2 log N .
(b) We repeat the steps in (a),
N
1 X h i β log(2N )
E[ max {|Y |i }] = . . . ≤ log E eβYi + e−βYi = + .
1≤i≤n β i=1
2 β
R ANDOM P ROCESSES 91

√
Taking β = 2 log(2N ), we conclude with the statement.
(c) The lower bound is slightly more involved and needs some preparation:
(i) The Yi ’s are symmetric (Gaussian), and hence, by symmetry,

E[ max {|Yi − Yj |}] = E[ max {(Yi − Yj )}] = 2E[ max {Yi }] .

1≤i,j≤N 1≤i,j≤N 1≤i≤N

(ii) For every k ∈ {1, . . . , N }, using (i),

E[ max {Yi }] ≤ E[ max {|Yi |}] ≤ E[|Yk |] + E[ max {|Yi − Yj |]

1≤i≤N 1≤i≤N 1≤i,j≤N

= E[|Yk |] + 2E[ max {Yi }] .

1≤i≤N

To obtain a lower bound we exploit the fact that our Gaussian random variables
Yi are independent and identically distributed. Then, for every δ > 0, noting that
N
1 − (1 − P(|Y1 | > t)) is the probability that one of the random variables |Y | = i, i =
1, . . . , N , is larger then t,
Z δ
N N
E[ max {|Yi |}] ≥ [1 − (1 − P(|Y1 | > t)) ] dt ≥ δ[1 − (1 − P(|Y1 | > δ)) ] ,
1≤i≤N 0

where the first inequality follows from the integral identity for the expectation and
the second inequality is just the interval length times the minimal value of the in-
tegrand. Now, we obtain with the lower tail bound of the normal distribution (see
Proposition 2.1),
r Z ∞
2 1 1 1
P(|Y1 | > δ) = exp ( − t2 /2) dt ≥ ( − 3 ) exp ( − δ 2 /2) .
π δ π δ δ
√
Now we choose δ = log N with N large enough so that
1
P(|Y1 | > δ) ≥ ,
N
and hence
1 N 1
E[ max {|Yi |}] ≥ δ[1 − (1 − ) ] ≥ δ(1 − ) .
1≤i≤N N e
We conclude with statement (c) using

E[ max {|Yi |}] ≤ E[|Y1 |] + 2E[ max {Yi }] .

1≤i≤N 1≤i≤N

We have seen that the expected supremum of the canonical Gaussian process
on some set T ⊂ Rn ,
E[ suphY, ti] ,
t∈T
92 R ANDOM P ROCESSES

where Y ∼ N(0, 1ln ) plays an important role. In many application this geometric
quantity is an important parameter. This leads to the following definition.

Definition 7.19 The Gaussian width of a subset T ⊂ Rn is defined

h i
W(T ) := E suphY, xi with Y ∼ N(0, 1ln ) .
x∈T

Exercise 7.20 Suppose Xi , i = 1, . . . , N , are sub-Gaussian random variables, and K =

max1≤i≤N kXi kψ2 . Show that, for any N ≥ 2,
p
E[ max {|Xi |}] ≤ CK log N .
1≤i≤N

KKK
We summarise a few simple properties of the Gaussian width. We skip the proof
as it involves mostly elementary properties of the norm and the Minkowski sum.
Recall that the diameter of a set T ⊂ Rn with respect to the Euclidean norm is

diam (T ) = sup kx − yk2 .

x,y∈T

Proposition 7.21 (Properties of Gaussian width) (a) W(T ) is finite ⇔ T is bounded.

(b) W(T ) = W(U T ) for any orthogonal n × n matrix U .

(c) W(T + S) = W(T ) + W(S) S, T ⊂ Rn and W(aT ) = |a|W(T ) for every a ∈ R.

(d)
1 1
W(T ) = W(T − T ) = E[ sup hY, x − yi] .
2 2 x,y∈T

(e) √
1 n
√ diam (T ) ≤ W(T ) ≤ diam (T ) .
2π 2
Proof.
(a) Cauchy-Schwarz inequality gives

|hY, xi| ≤ kY k2 kxk2 for all x ∈ T .

If W(T ) < ∞, then this implies that kxk2 < ∞ for all x ∈ T , and henceforth T is
bounded. Conversely, if T is a bounded set we have that kxk2 ≤ C for all x ∈ T and
some C > 0. Thus √
E[hY, xi] ≤ E[kY k2 ]C ≤ nC < ∞
implies that W(T ) < ∞.
R ANDOM P ROCESSES 93

(b) We simply use the rotation invariance of the normal distribution, i.e., Y ∼ N(0, 1ln )
implies that U Y ∼ N(0, 1ln ) for any orthogonal matrix U .
(c) Recalling the definition of the Minkowski sum of two sets we get

W(T + S) = E[ sup hY, x + yi] = E[ suphY, xi] + E[ suphY, yi] = W(T ) + W(S) .
x∈T,y∈S x∈T y∈S

If a ≥ 0 we have a = |a| and hY, axi = |a|hY, xi. If a < 0 we have |a| = −a and using
the fact that Y is symmetric, i.e., −Y has same distribution as Y , we get

|a|hY, xi = −ahY, xi = h−Y, axi ∼ hY, axi ,

and thus the statement.

(d) Using (c) we get
1 1 1
W(T ) = (W(T ) + W(T )) = (W(T ) + W(−T )) = W(T − T ) .
2 2 2

(e) Fix a pair x, y ∈ T . Then by definition x − y, y − x ∈ T − T . With part (d) we get

a lower bound
r
1 1 1 2
W(T ) ≥ E[ max{hY, x − yi, hY, y − xi}] = E[|hY, x − yi|] = kx − yk2 ,
2 2 2 π
where the inequality follows from taking one pair x, y ∈ T and the first equality follows
from max{a, −a} = |a| for a ∈ R. The second equality can be seen as follows. Recall
that hY, x − yi ∼ N(0, kx − yk2 ) and therefore
r
x−y x−y 2
hY, i ∼ N(0, 1) , and E[|hY, i|] =
kx − yk2 kx − yk2 π
follows from the caalculation for any X ∼ N(0, 1),
Z ∞ Z ∞ r
1 2 2 2 2 2 ∞ 2
E[|X|] = √ |x|e−x /2 dx = √ x e−x /2 dx = √ [ − e−x /2 ]0 = .
2π ∞ 2π 0 2π π
Taking now the supremum of all pairs x, y ∈ T we obtain the lower bound. For the
corresponding upper bound we use (d) and Cauchy-Schwarz inequality again.
1 1 1
W(T ) = E[ sup hY, x − yi] ≤ E[ sup kY k2 kx − yk2 ] ≤ E[kY k2 ]diam (T )
2 x,y∈T 2 x,y∈T 2
1√
≤ ndiam (T ) ,
2
√ √
where we used E[kY k2 ] ≤ n which follows from E[kY k2 ] ≤ (E[kY k22 ])1/2 = n and
E[kY k22 ] = n. 2

We discuss a few examples to obtain some understanding of the Gaussian width.

94 R ANDOM P ROCESSES

Example 7.22 (Gaussian width) (a) The Gaussian width of the unit sphere in n dimen-
sions is
√
W(S(n−1) ) = E[kY k2 ] = n ± C ,
where the second equality follows from Exercise 6 - Example sheet 2 with some absolute
constant C > 0 as an immediate consequence of our concentration of norm result in
Theorem 3.1. To see the first equality, apply first Cauchy-Schwarz inequality to obtain
an upper bound for any x ∈ S(n−1) ,

E[hY, xi] ≤ E[kY k2 ] .

Pick
Yi
xi = , i = 1, . . . , N ,
kY k2
then x = (x1 , . . . , xn ) ∈ S(n−1) .
n
(b) The Gaussian width for the cube B∞ = [−1, 1]n with respect to the `∞ norm is
r
n 2
W(B∞ ) = E[kY k1 ] = nE[|Y1 |] = n ,
π
where thePsecond equality follows from the isotropy of Y and the definition of the norm
kY k1 = ni=1 |Y |i . The third equality is just calculation, i.e.,
Z ∞ Z ∞
r
1 −y 2 /2 2 −y 2 /2 2 2 ∞ 2
E[|Y1 |] = √ |y|e dy = √ ye dy = √ [ − e−y /2 ]0 = .
2π ∞ 2π 0 2π π

The first equality follows from Cauchy-Schwarz for an upper bound, i.e.,

E[hY, xi] ≤ E[kY k1 kxk∞ ] = E[kY k1 ] ,

n
and setting x = (sign(Y1 ), . . . , sign(Yn )) ∈ B∞ we obtain a lower bound and thus the
equality.

W(B1n ) = E[kY k∞ ] = E[ max |Yi |] ,

1≤i≤n

where the second equality is just the definition of the supremum norm and the first equal-
ity follows from Cauchy-Schwarz inequality. With Proposition 7.18 we get two absolute
constants c, C > 0 such that
p p
c log n ≤ W(B1n ) ≤ C log n .

♣
R ANDOM P ROCESSES 95

We finally present an upper bound for the expected supremum of a process

in Theorem 7.24 below. The proof of that statement uses multi-scale approach in
conjunction with the ε-net arguments, i.e., varying the threshold ε in a systematic
and controlled way. This technique is called chaining and is a widely used tool in
data science and high-dimensional probability. The whole proof goes beyond what
we can do in this third year course, and we therefore only present the statement.
The statement and its proof use the notion of Sub-Gaussian increments which we
define first.

Definition 7.23 (Sub-Gaussian increments) Let X = (Xt )t∈T be a stochastic process on

some metric space (T, d). We say that the process X has Sub-Gaussian increments if there
exists K ≥ 0 such that

kXt − Xs kψ2 ≤ K d(t, s) , for all t, s ∈ T .

Theorem 7.24 (Dudley’s inequality) Let X = (Xt )t∈T be a mean-zero stochastic process
on a metric space (T, d) with Sub-Gaussian increments. Then
Z ∞ p
E[ sup Xt ] ≤ CK N(T, d, ε) dε
t∈T 0

for some absolute constant C > 0.

Proof. The interested reader may check Chapter 8 in [Ver18]. 2

7.4 Uniform law of large numbers

Definition 7.25 Let (Ω, B(Ω), µ) be a probability space and denote F = {f : Ω → R} a

class of real-valued functions. Let X be a Ω-valued random variable with law µ ∈ M1 (Ω)
and X1 , . . . , XN be independent copies of X . The random process X = (Xf )f ∈F defined
by
N
1 X
Xf := f (Xi ) − E[f (X)] (7.10)
N i=1
is called an empirical process indexed by F .

Theorem 7.26 (Uniform law of large numbers) Denote

F = {f : [0, 1] → R : kf kLip ≤ L}
the class of Lipschitz function on [0, 1], where L > 0 is a fixed number. Let X, X1 , . . . , XN
be independent identically distributed [0, 1]-valued random variables. Then
N
h 1 X i CL
E sup f (Xi ) − E[f (X)] ≤ √ (7.11)
f ∈F N i=1 N
for some absolute constant C > 0.
96 R ANDOM P ROCESSES

The proof of the theorem needs some bounds on the covering number of the class of
Lipschitz function. We explore this in the next exercise before proving the theorem.

Exercise 7.27 Let F = {f : [0, 1] → [0, 1] : kf kLip ≤ 1}. Show that

2/ε
N(F, k·k∞ , ε) ≤ (2/ε) for any ε > 0 .

Solution. Recall the definition of the supremum norm kf k∞ = supx∈[0,1] |f (x)| and
consider the square Λ = [0, 1]2 . We put a mesh of step (size) ε on Λ such that we get
(1/ε)2 squares of side length 1/ε. Mesh-following functions f0 are steps functions on
the mesh taking one of 1/ε possible values on each interval of length 1/ε. For every
f ∈ F there is a mesh-following function f0 such that kf − f0 k∞ ≤ ε. The number of
all mesh-following functions is bounded by (1/ε)1/ε . Recall that the covering number
N(K, d, ε) is the smallest cardinality of closed ε-balls with centre in K whose union
covers the set K. If we relax the assumption that the centres are in K we obtain the
external covering number Next (K, d, ε). As we have done earlier, one can show that

N(K, d, ε) ≤ Next (K, d, ε/2) ,

Proof of Theorem 7.26. Without loss of generality we put L = 1.

Step 1: We show that the empirical process has Sub-Gaussian increments. Fix a
pair f, g ∈ F and consider
N
1 X
kXf − Xg kψ2 = k Zi kψ2 ,
N i=1

where
Zi := (f − g)(Xi ) − E[(f − g)(X)] , i = 1, . . . , N .
In the following we write . whenever we have ≤ with some absolute constant C > 0
to avoid adapting the absolute constant in each single step. The Zi ’s are indepen-
dent and man-zero random numbers and thus Proposition 2.24 gives
N
1 X 2
1/2
kXf − Xg kψ2 . kZi kψ2 .
N i=1

Using the centering Lemma 2.27 we have

kZi kψ2 . k(f − g)(Xi )kψ2 . kf − gk∞ ,

R ANDOM P ROCESSES 97

and therefore
1 1/2 1
kXf − Xg kψ2 . N kf − gk∞ = √ kf − gk∞ .
N N
Step 2: Application of Dudley’s inequality. According to Step 1 the empirical process
has Sub-Gaussian increments. The diameter of F is one, that is,
diam (F) = sup kf − gk∞ = sup sup |f (x) − g(x)| = 1 .
f,g∈F f,g∈F x∈[0,1]

Application of Theorem 7.24 gives the upper bound for the expected supremum
where the integral runs between zero and the diameter. For the integral we use the
bound on the covering number in Exercise 7.27. Thus we get
Z 1 q Z 1 r
1 1 c c 1
E[ sup|Xf |] . √ log N(F, k·k∞ , ε) dε . √ log dε . √ ,
f ∈F N 0 N 0 ε ε N
for some absolute constant c > 0. 2

We finally introduce the VC dimension, which plays a major role in statistical

learning theory as we see in Section 8 below. One can relate the VC dimension to
covering numbers and via Dudley’s inequality a uniform laws of large numbers in-
volving the VC dimension follows. We can only give the definition and state the cor-
responding uniform law of large numbers with out proof. The Vapnik-Chervonenkis
(VC) dimension is a difficult concept and takes time to comprehend, roughly speak-
ing, it is a measure of complexity for classes of Boolean functions f : Ω → {0, 1} on
some common domain Ω.
Definition 7.28 Suppose F is a class of Boolean functions on some domain Ω. A subset
Λ ⊂ Ω is shattered by F if any Boolean function g : Λ → {0, 1} can be obtained by
restricting some function f ∈ F on Λ. The VC dimension of F, denoted VC(F), is the
largest cardinality of a subset Λ ⊂ Ω shattered by F. If the largest cardinality does not
exist, VC(F) ≡ ∞.

We state just for information a uniform law of large numbers involving the VC
dimension. It goes beyond the scope of this lecture to actually provide all details
about the relationship between the covering numbers for Boolean functions and their
VC dimension. We need the following result to appreciate the application example
in Section 8.

Theorem 7.29 (Empirical processes via VC dimension) Let (Ω, B(Ω), µ a probability space
and F = {f : Ω → {0, 1}} a class of Boolean functions with VC(F) ∈ [1, ∞). Consider the
i.i.d. samples X, X1 , . . . , XN with law µ ∈ M1 (Ω). Then
N
r
h 1 X i VC(F)
E sup f (Xi ) − E[f (X)] ≤ C
f ∈F N i=1 N
for some absolute constant C > 0.
98 A PPLICATION : S TATISTICAL L EARNING THEORY

8 Application: Statistical Learning theory

In statistical learning theory we are concerned with learning a target function T : Ω →
R from empirical data X1 , . . . , XN , where the Xi ’s are independent and identical
distributed according to some law µ ∈ M1 (Ω). We call the N pairs (Xi , T (Xi ))i=1,...,N
the training data . Our task is then to seek a good prediction of T (X) for any data
X ∈ / {X1 , . . . , XN }. We focus in so-called classification problems where the target
function T : Ω → {0, 1} is a Boolean function. This target function classifies the
data points in Ω into two classes depending on the label, i.e., the value of the target
function.

Example 8.1 (Health study) We examine N patients and determine n health parameter,
e.g., blood pressure heart rate, weight, etc. We then obtain the samples Xi ∈ Rn , i =
1, . . . , N . Suppose we know whether each patient has a certain illness or not, e.g. diabetes.
That is, we know T (Xi ) ∈ {0, 1}, i = 1, . . . , N with T (Xi ) = 1 being sick and T (Xi ) = 0
being healthy. We want to learn from the given training sample to diagnose diabetes, i.e., we
want to learn the target function T : Ω → {0, 1}. This target function would output diagnosis
for any person based on the n health parameter. ♣

A solution to the learning problem can be a function f : Ω → {0, 1} which is as

close as possible to the target function T : Ω → {0, 1}. We like to choose the function
f which minimises the so-called risk.

Definition 8.2 Let (Ω, B(Ω), µ) be a probability space. The risk of a function f : Ω → R in
a learning problem with the target function T : Ω → R and class F of functions is defined
as
2
R(f ) := E[(f (X) = T (X)) ] , (8.1)
where the expectation is with respect to the probability measure µ ∈ M1 (Ω). The min-
imiser f ∗ of the risk is defined as

f ∗ := argmin f ∈F {R(f )} . (8.2)

For our classification problem, i.e., learning of a Boolean target function T : Ω →

{0, 1} with F a class of Boolean functions, we see that

R(f ) = E[f 2 (X) − 2f (X)T (X) + T 2 (X)] = P(f (X) 6= T (X))

as f 2 (X) − 2f (X)T (X) + T 2 (X) is zero when f (X) = T (X) and 1 otherwise.

In any learning problem the choice of the class of functions is crucial. In this context
we call the class of functions F the hypothesis space. If we choose F to be a class
of simple functions like linear functions or Lipschitz function we might get easily
calculations and estimates but we can be still off the real function, this is called
under fitting. Conversely, if we consider too many different function types in F we
A PPLICATION : S TATISTICAL L EARNING THEORY 99

end up with challenges in computing the VC dimension and obtaining estimates. In

addition we might over fit the sample which happens if any normal fluctuation (noise)
is reflected. If T ∈ F, we get clearly R(f ) = 0 as f ∗ = T in that case. However, note
that in general we cannot compute the risk R(f ) and thus f ∗ from the given empirical
training data. Instead we can only approximate the risk R(f ) and its minimiser f ∗
given the empirical data.

Definition 8.3 (Empirical risk) Let (Ω, B(Ω), µ) be a probability space and f : Ω → R an
element of the hypothesis class F of a learning problem with target function T : Ω → R.
Let X1 , . . . , XN be Ω-valued i.i.d. samples with law µ ∈ M1 (Ω). The empirical risk of f
given the sample is defined as
N
1 X 2
RN (f ) := (f (Xi ) − T (Xi )) , (8.3)
N i=1

and the minimiser fN∗ ∈ F of the empirical risk is

fN∗ := argmin f ∈F {RN (f )} .

The excess risk is defined as the difference

RN (fN∗ ) − R(f ∗ ) .

Lemma 8.4 Let (Ω, B(Ω), µ) be a probability space and f : Ω → R an element of the hy-
pothesis class F of a learning problem with target function T : Ω → R. Let X1 , . . . , XN be
Ω-valued i.i.d. samples with law µ ∈ M1 (Ω). Then

R(fN∗ ) − R(f ∗ ) ≤ 2 sup |RN (f ) − R(f )| .

f ∈F

Proof. Define ε := supf ∈F |RN (f ) − R(f )|, Then

R(fN∗ ) ≤ RN (fN∗ ) + ε (since fN∗ ∈ F by definition)

≤ RN (f ∗ ) + ε (as fN∗ minimises RN in F)
≤ R(f ∗ ) + 2ε (as f ∗ ∈ F by definition) .

Subtracting R(f ∗ ) on both sides, we get the statement of the lemma. 2

Theorem 8.5 (Excess risk via VC dimension) Let (Ω, B(Ω), µ) be a probability space and
F be a class of functions of a learning problem with target function T : Ω → R and VC(F) ≥
1. Let X1 , . . . , XN be Ω-valued i.i.d. samples with law µ ∈ M1 (Ω). Then
r
∗ ∗ VC(F)
E[R(fN )] ≤ R(f ) + C
N
for some absolute constant C > 0.
100 A PPLICATION : S TATISTICAL L EARNING THEORY

Proof. According to Lemma 8.4 it suffices to show that

r
VC(F)
E[ sup |RN (f ) − R(f )|] . . (8.4)
f ∈F N

We insert our definition for RN and R and obtain

N
1 X
L.H.S. of (8.4) = E[ sup | `(Xi ) − E[`(X)]|] .
`∈L N i=1

where L = {(f − T )2 : f ∈ F}. Furthermore, an application of Theorem 7.29 and its

proof gives

1 h 1 p
Z i
L.H.S. of (8.4) . √ E log N(L, L (µN ), ε) dε ,
2
N 0

where L2 (µN ) is the L2 metric with respect to the empirical measure µN = N1 N

P
i=1 δXi .
To proceed we need to relate the covering of L and F, that is, we show that

N(L, L2 (µN ), ε) ≤ N(F, L2 (µN ), ε/4) , for all ε ∈ (0, 1) . (8.5)

To see (8.5) pick an ε-net {fj }j=1,...,J of F. For all ` ∈ L there exist `j := (fj − T )2
such that
k` − `j kL2 (µN ) = kf 2 − 2T (f − fj ) − fj2 kL2 (µN ) = k(f + fj )(f − fj ) − 2T (f − fj )kL2 (µN )
≤ 2kf − fj kL2 (µN ) + 2kf − fj kL2 (µN ) ≤ ε

whenever kf − fj kL2 (µN ) ≤ ε/4. This shows our claim (8.5). Now we replace L by F
and use Theorem 7.29 and its proof to see that

log N(F, L2 (µN ), ε) . VC(F) log(2/ε)

to conclude with our statement. 2

If we want to bound the expected excess risk in our health study Example 8.1 by
ε > 0, all we need to do is to take a sample of size

N ∼ ε−2 VC(F) .
A PPLICATION : S TATISTICAL L EARNING THEORY 101

References
[Dur19] R. D URRETT. Probability - Theory and Examples. Cambridge University Press,
fifth ed., 2019.
[Geo12] H.-O. G EORGII. Stochastics. De Gruyter Textbook. De Gruyter, 2nd rev. and ext.
ed., 2012.
[Ver18] R. V ERSHYNIN. High-Dimensional Probability. Cambridge University Press, 2018.
102 I NDEX

Index
ε-net, 45 Kullback-Leibler divergence, 54
ε-separated, 45
n-dimensional ball, 35 Landau symbols, 36
n-dimensional sphere, 35 level sets, 63
Lipschitz continuous, 58
Sub-exponential norm, 27
Minkowski sum, 46
Bernstein condition, 29 moment generation function, 3
Multivariate Normal / Gaussian distri-
canonical Gaussian process in Rn , 82 bution, 42
canonical metric of the process, 87
centred moment generating function, 18 normal distribution, 9
chaining, 93
operator norm, 48
Chaos, 74
covariance function, 82 packing number, 45
covariance matrix, 9, 39
covering number, 45 Rademacher function, 29
cumulative distribution function (CDF), random field, 81
3 random process, 81
relative entropy, 54
Decoupling, 74
diameter, 90 second moment matrix, 39
separately convex, 58
empirical covariance, 52 Shannon entropy, 55
empirical measure, 52 singular values, 48
empirical process, 93 spherically distributed, 41
empirical risk, 98 standard Brownian motion, 81
essential supremum, 5 Stirling’s formula, 20
excess risk, 98 stochastically dominated, 83
Sub-exponential random variable, 26
Gamma function, 20
Sub-exponential random variable, sec-
Gaussian interpolation, 83
ond definition, 28
Gaussian process, 82
Sub-exponential random variables, first
Gaussian width, 90
definition, 26
global Lipschitz continuity, 58
Sub-Gaussian - first definition, 21
Hamming cube, 47 Sub-Gaussian - second definition, 23
Hamming distance, 47 Sub-Gaussian increments, 93
Herbst argument, 56 Sub-Gaussian properties, 18
sub-level sets, 63
increments, 82 symmetric, 80
isotropic, 39 symmetric Bernoulli distribution in Rn ,
41
Jensen’s inequality, 4
Johnson-Lindenstrauss Lemma, 72 tails of the normal distribution, 11
I NDEX 103

target function, 97
training data, 97

unit sphere, 41

VC dimension, 95

Young’s inequality, 27

Further Mathematics For Economic Analysis PDF
100% (1)
Further Mathematics For Economic Analysis PDF
610 pages
Lecture Notes On Probability Theory Dmitry Panchenko
No ratings yet
Lecture Notes On Probability Theory Dmitry Panchenko
316 pages
Notes On Luenberger's Vector Space Optimization
100% (3)
Notes On Luenberger's Vector Space Optimization
131 pages
Convex Functions and Their Applications: Constantin P. Niculescu Lars-Erik Persson
No ratings yet
Convex Functions and Their Applications: Constantin P. Niculescu Lars-Erik Persson
430 pages
Paul G.bamberg-Convexity and Optimization With Applications
No ratings yet
Paul G.bamberg-Convexity and Optimization With Applications
131 pages
HDP Solution
No ratings yet
HDP Solution
76 pages
CMU Prob-Grad-Notes - Tomasz Tkocz
No ratings yet
CMU Prob-Grad-Notes - Tomasz Tkocz
226 pages
AsymptoticConvexGeometry 13 OCT 2021
No ratings yet
AsymptoticConvexGeometry 13 OCT 2021
201 pages
Random Matrices
No ratings yet
Random Matrices
44 pages
HDS Wainwright
No ratings yet
HDS Wainwright
4 pages
Stanford Statistics311 InformationTheoryAndStatistics
No ratings yet
Stanford Statistics311 InformationTheoryAndStatistics
304 pages
Lecture Notes
100% (1)
Lecture Notes
324 pages
Full Notes
No ratings yet
Full Notes
197 pages
Lecture Notes
No ratings yet
Lecture Notes
495 pages
Emp Proc Lecture Notes
No ratings yet
Emp Proc Lecture Notes
172 pages
Maths For Machine Learning
No ratings yet
Maths For Machine Learning
47 pages
Math 4 ML
100% (1)
Math 4 ML
47 pages
Probability in High Dimensions 1693642387
No ratings yet
Probability in High Dimensions 1693642387
184 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Horowitz Sinander Notes
No ratings yet
Horowitz Sinander Notes
136 pages
Econ 623 AsymptoticTheory 2023
No ratings yet
Econ 623 AsymptoticTheory 2023
74 pages
Ashish Mcdiarmid
No ratings yet
Ashish Mcdiarmid
22 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Lecture Notes On Advanced Probability
No ratings yet
Lecture Notes On Advanced Probability
293 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Research: 1 Theorems and Open Problems
No ratings yet
Research: 1 Theorems and Open Problems
12 pages
December 2, 2020
No ratings yet
December 2, 2020
38 pages
An Error Bound in The Sudakov-Fernique Inequality: 1 Statement of The Result
No ratings yet
An Error Bound in The Sudakov-Fernique Inequality: 1 Statement of The Result
5 pages
OPTIIILN2023Spring ConvexOpti
No ratings yet
OPTIIILN2023Spring ConvexOpti
341 pages
MIT18 S096F15 TenLec
No ratings yet
MIT18 S096F15 TenLec
165 pages
Stat 2013
No ratings yet
Stat 2013
132 pages
Lectures On Convex Sets: Niels Lauritzen
No ratings yet
Lectures On Convex Sets: Niels Lauritzen
93 pages
Lectures On Convex Sets
No ratings yet
Lectures On Convex Sets
93 pages
Co 463
No ratings yet
Co 463
116 pages
MA4K0 Notes
No ratings yet
MA4K0 Notes
189 pages
Lectures On Random Matrix Theory
No ratings yet
Lectures On Random Matrix Theory
131 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
Lecture 4 Inequalities and Asymptotic Estimates
No ratings yet
Lecture 4 Inequalities and Asymptotic Estimates
9 pages
נוסחאות ואי שיוויונים
No ratings yet
נוסחאות ואי שיוויונים
12 pages
1 Inequalities: 1.1 Markov
No ratings yet
1 Inequalities: 1.1 Markov
15 pages
Probability Theory Cookbook
No ratings yet
Probability Theory Cookbook
63 pages
Functional Analysis - Sabri
No ratings yet
Functional Analysis - Sabri
81 pages
Vjeravatnost
No ratings yet
Vjeravatnost
429 pages
Cheat Sheet - JAM
No ratings yet
Cheat Sheet - JAM
46 pages
Don McLeish Probability
No ratings yet
Don McLeish Probability
101 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Asymptotic Theory of Statistics and Probability: Anirban Dasgupta
No ratings yet
Asymptotic Theory of Statistics and Probability: Anirban Dasgupta
15 pages
Spreij Measure Theoretic Probability
No ratings yet
Spreij Measure Theoretic Probability
169 pages
ProbabilisticCombinatorics 15 MAR 2019
No ratings yet
ProbabilisticCombinatorics 15 MAR 2019
114 pages
Stochastic Processes
No ratings yet
Stochastic Processes
133 pages
Lecture Notes PDF
No ratings yet
Lecture Notes PDF
143 pages
Selected Topics in Malliavin Calculus, Springer
No ratings yet
Selected Topics in Malliavin Calculus, Springer
178 pages
Prob Notes
No ratings yet
Prob Notes
70 pages
Probability I12
No ratings yet
Probability I12
100 pages
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
No ratings yet
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
9 pages
978 3 642 15202 3
No ratings yet
978 3 642 15202 3
494 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
Prob 2 B English
No ratings yet
Prob 2 B English
81 pages
Lecture Notes - Probability Theory: Manuel Cabral Morais
No ratings yet
Lecture Notes - Probability Theory: Manuel Cabral Morais
297 pages
All The Math-1
No ratings yet
All The Math-1
4 pages
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
Chapter6 - Test Bank
No ratings yet
Chapter6 - Test Bank
4 pages
3-1a Mathematics 3: Differential Calculus
No ratings yet
3-1a Mathematics 3: Differential Calculus
33 pages
Hspice Use
100% (1)
Hspice Use
28 pages
Python Notes 3
No ratings yet
Python Notes 3
13 pages
Rotational Motion Project
No ratings yet
Rotational Motion Project
18 pages
HKDSE Mathematics in Action (3rd Edition) 4B - Chapter 06 Exponential Functions - Full Solution
No ratings yet
HKDSE Mathematics in Action (3rd Edition) 4B - Chapter 06 Exponential Functions - Full Solution
35 pages
Cs6551 Computer Networks: Unit - I
No ratings yet
Cs6551 Computer Networks: Unit - I
86 pages
Practice Problem Set 6
No ratings yet
Practice Problem Set 6
7 pages
Slab Types: Membrane, Plate or Shell
No ratings yet
Slab Types: Membrane, Plate or Shell
6 pages
Data Structures 2
No ratings yet
Data Structures 2
1 page
CVE 230. Lab Report 4 (Torsion Testing) .
100% (1)
CVE 230. Lab Report 4 (Torsion Testing) .
4 pages
Chapter 3
No ratings yet
Chapter 3
7 pages
Coordinate Geometry 1 - 20-26 Chapters
No ratings yet
Coordinate Geometry 1 - 20-26 Chapters
14 pages
Core Java BCA Sem V Slip Solution
67% (3)
Core Java BCA Sem V Slip Solution
69 pages
NCERT Solutions For Class 11 Science Maths Chapter 3 - Trigonometric Functions
No ratings yet
NCERT Solutions For Class 11 Science Maths Chapter 3 - Trigonometric Functions
46 pages
Comparative Analysis of Optimizers in Deep Neural Networks
No ratings yet
Comparative Analysis of Optimizers in Deep Neural Networks
4 pages
Lesson 7 - ECON MODELS (CIRCULAR FLOW DIAGRAM & PPF), FINDING EQUIL P & Q, SLOPE & ELASTICITY - March 24-FOR SHARING
No ratings yet
Lesson 7 - ECON MODELS (CIRCULAR FLOW DIAGRAM & PPF), FINDING EQUIL P & Q, SLOPE & ELASTICITY - March 24-FOR SHARING
70 pages
Modeling Mathematical Ideas - Developing Strategic Competence in Elementary and Middle School
100% (1)
Modeling Mathematical Ideas - Developing Strategic Competence in Elementary and Middle School
227 pages
1CE 355 Practice Problems Traffic Flow and Level-Of-Service
No ratings yet
1CE 355 Practice Problems Traffic Flow and Level-Of-Service
1 page
Sumanta Chowdhury - CLS - Aipmt-15-16 - XIII - Phy - Study-Package-1 - Set-1 - Chapter-3 PDF
0% (2)
Sumanta Chowdhury - CLS - Aipmt-15-16 - XIII - Phy - Study-Package-1 - Set-1 - Chapter-3 PDF
46 pages
Constrained Optimization
No ratings yet
Constrained Optimization
9 pages
Limits, Tolerances, and Fits: Dharm d:/N-Design/Des15-1.pm5
No ratings yet
Limits, Tolerances, and Fits: Dharm d:/N-Design/Des15-1.pm5
12 pages
Math Topical Test 04 - Estimation 2022-2023
No ratings yet
Math Topical Test 04 - Estimation 2022-2023
1 page
A Rod of Length L and Diameter D Is Subjected To A Tensile Load P
No ratings yet
A Rod of Length L and Diameter D Is Subjected To A Tensile Load P
2 pages
Maths Ch-1 Real Numbers Test 01
No ratings yet
Maths Ch-1 Real Numbers Test 01
2 pages
3D Heat Conduction
No ratings yet
3D Heat Conduction
18 pages
Gcse Matheamtics Paper 1 (N - C) : Pre-Public Examinations
No ratings yet
Gcse Matheamtics Paper 1 (N - C) : Pre-Public Examinations
23 pages
Mathematics Grade 11 Term 1 Week 2 - 2021 - M
No ratings yet
Mathematics Grade 11 Term 1 Week 2 - 2021 - M
7 pages
20171101131130chapter 1 - Measurement in Chemistry
No ratings yet
20171101131130chapter 1 - Measurement in Chemistry
43 pages