HDP Solution
HDP Solution
Solution Manual
Pingbang Hu
This is the solution I write when organizing the reading group on Roman Vershynin’s High Dimen-
sional Probability [Ver24]. While we aim to solve all the exercises, occasionally we omit some due to
either 1.) simplicity; 2.) difficulty; or 3.) skipped section. Additionally, it may contain factual and/or
typographic errors.
The reading group started from Spring 2024, and the date on the cover page is the last updated time.
Contents
4 Random matrices 41
4.1 Preliminaries on matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Nets, covering numbers and packing numbers . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Application: error correcting codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Upper bounds on random sub-gaussian matrices . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Application: community detection in networks . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Two-sided bounds on sub-gaussian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Application: covariance estimation and clustering . . . . . . . . . . . . . . . . . . . . . . . 52
1
6 Quadratic forms, symmetrization and contraction 65
6.1 Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Hanson-Wright Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3 Concentration of anisotropic random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5 Random matrices with non-i.i.d. entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.6 Application: matrix completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.7 Contraction Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
CONTENTS 2
Appetizer: using probability to cover a
geometric set
3
Week 1: Appetizer and Basic Inequalities
m−1
(n/m)m Y n m−j
n
= ≤1
m j=0
m n−j
Pm n
as m−j
n−j n
for all j. The second inequality m
n
≤ k=0 k is trivial since n
≥ 1 for all k. The
≥m k
last inequality is due to
Pm n n
k=0 k
X n m k m n
n m
≤ = 1+ ≤ em .
m
k n n
k=0
CONTENTS 4
Chapter 1
Answer. Separating X into the plus and minus parts would do the job. Specifically, let X = X+ −X−
where X+ = max(X, 0) and X− = max(−X, 0), both are non-negative. Then, we see that by
applying Lemma 1.2.1,
Problem (Exercise 1.2.3). Let X be a random variable and p ∈ (0, ∞). Show that
Z ∞
p
E[|X| ] = ptp−1 P(|X| > t) dt
0
5
Week 2: Basic Inequalities and Limit Theorems
√
As σ < ∞ is a constant, the rate is exactly O(1/ N ). ⊛
implying that
Z ∞
1 1 1
E g 2 1g>t = t · √ e−t /2 +
2 2
√ e−t /2 ,
Φ(x) dx ≤ t +
2π t t 2π
which gives the second inequality. ⊛
7
Week 3: More Powerful Concentration Inequalities
Answer. Omit. ⊛
The next exercise is to prove Theorem 2.2.5 (Hoeffding’s inequality for general bounded random
variables), which we restate it for convenience.
Theorem 2.2.1 (Hoeffding’s inequality for general bounded random variables). Let X1 , . . . , XN be
independent random variables. Assume that Xi ∈ [mi , Mi ] for every i. Then, for any t > 0, we
have ! !
N
X 2t2
P (Xi − E[Xi ]) ≥ t ≤ exp − PN .
2
i=1 i=1 (Mi − mi )
Problem (Exercise 2.2.7). Prove the Hoeffding’s inequality for general bounded random variables,
possibly with some absolute constant instead of 2 in the tail.
Answer. Since raising both sides to p-th power doesn’t work since we’re now working with sum of
random variables, so we instead consider the MGF trick (also known as Crarmer-Chernoff method):
So now everything left is to bound E [exp(λ(Xi − E [Xi ]))]. Before we proceed, we need one lemma.
(b − a)2
Var [Z] ≤ .
4
Proof. Since
" 2 #
(b − a)2
a+b a+b
Var [Z] = Var Z − ≤E Z− ≤ .
2 2 4
(b − a)2
E eλX ≤ exp λ2
.
8
!2
′ E XeλX ′′ E X 2 eλX E XeλX
ψ (λ) = , ψ (λ) = − .
E [eλX ] E [eλX ] E [eλX ]
λX
Now, observe that ψ ′′ is the variance under the law of X re-weighted by E[e e
λX ] , i.e., by a
eλX
dPλ (x) := dP(x),
EP [eλX ]
then
Z
′ EP XeλX xeλx
ψ (λ) = = dP(x) = EPλ [X]
EP [eλX ] EP [eλX ]
and
!2
′′ EP X 2 eλX EP XeλX 2
= EPλ X 2 − EPλ [X] = VarPλ [X] .
ψ (λ) = −
EP [eλX ] λX
EP [e ]
From Lemma 2.2.2, since X under the new distribution Pλ is still bounded between a and
b,
(b − a)2
ψ ′′ (λ) = VarPλ [X] ≤ .
4
Then by Taylor’s theorem, there exists some λ
e ∈ [0, λ] such that
1 e 2 = 1 ψ ′′ (λ)λ
ψ(λ) = ψ(0) + ψ ′ (0)λ + ψ ′′ (λ)λ e 2
2 2
1 (b − a)2 2 (b − a)2
ln E eλX = ψ(λ) ≤ · λ = λ2
,
2 4 8
raising both sides by e shows the desired result. ⊛
Say given Xi ∈ [mi , Mi ] for every i, then Xi − E [Xi ] ∈ [mi − E [Xi ] , Mi − E [Xi ]] with mean 0
for every i. Then given any of the two bounds, for all λ ∈ R,
2
2 (Mi − mi )
h i
λ(Xi −E[Xi ])
E e ≤ exp λ .
8
Problem (Exercise 2.2.8). Imagine we have an algorithm for solving some decision problem (e.g., is
a given number p a prime?). Suppose the algorithm makes a decision at random and returns the
correct answer with probability 12 + δ with some δ > 0, which is just a bit better than a random
guess. To improve the performance, we run the algorithm N times and take the majority vote.
Show that, for any ϵ ∈ (0, 1), the answer is correct with probability at least 1 − ϵ, as long as
1 1
N ≥ 2 ln .
2δ ϵ
i.i.d.
Answer. Consider X1 , . . . , XN ∼ Ber( 12 + δ), which is a series of indicators indicting whether the
random decision is correct or not. Note that E [Xi ] = 12 + δ.
PN
We see that by taking majority vote over N times, the algorithm makes a mistake if i=1 Xi ≤
N/2 (let’s not consider tie). This happens with probability
N
! N
!
2(N δ)2
X N X 2
P Xi ≤ =P (Xi − E [Xi ]) ≤ −N δ ≤ exp − = e−2N δ
i=1
2 i=1
N
2
from Hoeffding’s inequality.a Requiring e−2N δ ≤ ϵ is equivalent to requiring N ≥ 1
2δ 2 ln(1/ϵ). ⊛
a Note that the sign is flipped. However, Hoeffding’s inequality still holds (why?).
Problem (Exercise 2.2.9). Suppose we want to estimate the mean µ of a random variable X from
a sample X1 , . . . , XN drawn independently from the distribution of X. We want an ϵ-accurate
estimate, i.e., one that falls in the interval (µ − ϵ, µ + ϵ).
(a) Show that a sample of size N = O(σ 2 /ϵ2 ) is sufficient to compute an ϵ-accurate estimate with
probability at least 3/4, where s;2 = Var[X].
(b) Show that a sample of size N = O(log δ −1 σ 2 /ϵ2 ) is sufficient to compute an ϵ-accurate
as E [Xi ] = 1/4. From Hoeffding’s inequality, the above probability is bounded above by
exp −2(k/4)2 /k , setting it to be less than δ we have
2(k/4)2
1 k
≥ ⇔ k = O(ln δ −1 ),
exp − ≤ δ ⇔ ln
k δ 8
i.e., the total number of samples required is O(kσ 2 /ϵ2 ) = O(ln δ −1 σ 2 /ϵ2 ).
Answer. (a) Since Xi ’s are non-negative and the densities fXi ≤ 1 uniformly, for every t > 0,
Z ∞ Z ∞ ∞
1 1
E [exp(−tXi )] = e−tx fXi (x) dx ≤ e−tx dx = − e−tx = .
0 0 t 0 t
hence
N
Y
P(SN ≤ t) ≤ eλt exp (e−λ − 1)pi = eλt exp (e−λ − 1)µ = exp λt + (e−λ − 1)µ .
i=1
Problem (Exercise 2.3.3). Let X ∼ Pois(λ). Show that for any t > λ, we have
t
−λ eλ
P(X ≥ t) ≤ e .
t
hence t t
−θt θ
λ −λ eλ
P(X ≥ t) ≤ e exp (e − 1)λ = exp(t − λ) = e
t t
where we take the minimizing θ = ln(t/λ) > 0 as t > λ. ⊛
Alternatively, we can also solve Exercise 2.3.3 directly as follows.
Answer. Consider a series of independent Bernoulli random variables XN,i for a fixed N such that
the Poisson limit theorem applies to approximate X ∼ Pois(λ), i.e., as N → ∞, maxi≤N pN,i → 0
and λN := E [SN ] → λ < ∞, SN → Pois(λ). From Chernoff’s inequality, for any t > λN ,
t
eλN
P(SN > t) ≤ e−λN .
t
since λN → λ as N → ∞. ⊛
2
P(|SN − µ| ≥ δµ) ≤ 2e−cµδ
Proof. As (1 + x/2)2 = 1 + x + x2 /4 ≥ 1 + x,
′
1 1 x
[log(1 + x)]′ = ≥ = .
1+x (1 + x/2)2 1 + x/2
2δ µδ 2 µδ 2
ln P(SN ≥ (1 + δ)µ) ≤ µ(δ − (1 + δ) ln(1 + δ)) ≤ µδ − µ(1 + δ) =− ≤− .
2+δ 2+δ 3
Similarly, from Chernoff’s inequality (left-tail), for t = (1 − δ)µ, we have
δ2 µδ 2
ln P(SN ≤ (1 − δ)µ) ≤ µ(−δ − (1 − δ) ln(1 − δ)) ≤ −µδ − µ(1 − δ) −δ − ≤− .
2 2
Problem (Exercise 2.3.6). Let X ∼ Pois(λ). Show that for t ∈ (0, λ], we have
ct2
P(|X − λ| ≥ t) ≤ 2 exp − .
λ
Answer. Fix some t =: δλ ∈ (0, λ] for some δ ∈ (0, 1] first. Consider a series of independent Bernoulli
random variables XN,i for a fixed N such that the Poisson limit theorem applies to approximate
X ∼ Pois(λ), i.e., as N → ∞, maxi≤N pN,i → 0 and λN := E [SN ] → λ < ∞, SN → Pois(λ). From
multiplicative form of Chernoff’s inequality, for tN := δλN ,
ct2N
P(|SN − λN | ≥ tN = δλN ) ≤ 2 exp − .
λN
ct2 ct2
P(|X − λ| ≥ t) = lim P(|SN − λN | ≥ tN ) = lim 2 exp − N = 2 exp −
N →∞ N →∞ λN λ
since tN = δλN → δλ = t. ⊛
X −λ D
√ → N (0, 1).
λ
Pλ i.i.d.
Answer. Since X := i=1 Xi ∼ Pois(λ) if Xi ∼ Pois(1) for all i, from Lindeberg-Lévy central
limit theorem, we have
X − E [X] X −λ d
p = √ → N (0, 1)
Var [X] λ
as E [Xi ] = Var [Xi ] = 1. ⊛
Answer. Since d = O(log n), there exists an absolute constant M > 0 such that d = (n − 1)p ≤
M log n for all large enough n. Now, consider some C > 0 such that eM/C =: α < 1. From
Chernoff’s inequality,
C log n C log n
−d ed −d eM
P(di ≥ C log n) ≤ e ≤e ≤ αC log n .
C log n C
Problem (Exercise 2.4.3). Consider a random graph G ∼ G(n, p) with expected degrees d = O(1).
Show that with high probability (say, 0.9), all vertices of G have degrees
log n
O .
log log n
Answer. Since now d = (n − 1)p ≤ M for some absolute constant M > 0 for all large n, from
Chernoff’s inequality,
!C logloglogn n C logloglogn n
log n −d ed −d eM log log n
P di ≥ C ≤e ≤e
log log n C logloglogn n C log n
as n → ∞, i.e.,
C logloglogn n
−d eM log log n
ne → 0,
C log n
which is what we want to prove. ⊛
Problem (Exercise 2.4.4). Consider a random graph G ∼ G(n, p) with expected degrees d = o(log n).
Show that with high probability, (say, 0.9), G has a vertex with degree 10d.
Answer. Omit. ⊛
Problem (Exercise 2.4.5). Consider a random graph G ∼ G(n, p) with expected degrees d = O(1).
Show that with high probability, (say, 0.9), G has a vertex with degree
log n
Ω .
log log n
Answer. Firstly, note that the question is ill-defined in the sense that if d = (n − 1)p = O(1), it can
be d = 0 (with p = 0), which is impossible to prove the claim. Hence, consider the non-degenerate
case, i.e., d = Θ(1).
We want to prove that there exists some absolute constant C > 0 such that with high probability
G has a vertex with degree at least C log n/ log log n. First, consider separate the graph randomly
into two parts A, B, each of size n/2. It’s then easy to see by dropping every inner edge in A and
B, the graph becomes bipartite such that now A and B forms independent sets. Consider working
on this new graph (with degree denoted as d′ ), we have
k n/2−k n k dk
n/2 d d
P(d′i = k) = 1− ≥ · k · e−d
k n−1 n−1 2k n
n k k
k −k −d d
=d n e = e−d .
2k 2k
Let k = C log n/ log log n such that d/2k > 1/ log n for large enough n,a we have
k
′ C log n −d d
P di = ≥e ≥ e−d (log n)−k = exp(−d − k log log n)
log log n 2k
= exp(−d − C log n) = e−d n−C .
Let this probability be q, and focus on A. We can then define Xi = 1d′i =k for i ∈ A, and note
that Xi are all independent as A being an independent set. Then,P the number of vertices in A,
denoted as X, with degree exactly k follows Bin(n/2, q) with X = i∈A Xi and mean nq/2, variance
nq(1 − q)/2. From Chebyshev’s inequality,
as n → ∞, which means P(X ≥ 1) → 1, i.e., with probability 1, there are at least one point with
degree log n/2 log log n. Now, by considering the deleting edges in the beginning, we conclude that
there will be a vertex with degree
log n
Ω
log log n
with overwhelming probability. ⊛
a Since this is equivalent as k < d log n/2. As k has a log log n → ∞ factor in the denominator, the claim holds.
Deduce that
√
∥X∥Lp = O( p) as p → ∞.
We see that for p being even, (1 + p)/2 = p/2 + 1/2, by letting z := p/2 ∈ N,
√
21−p πΓ(p) √ (p − 1)!
Γ((1 + p)/2) = = 21−p π
Γ(p/2) (p/2 − 1)!
√ (p − 1)! √
= 21−p π = 2−p/2 π(p − 1)!!.
(1/2)p/2−1 (p − 2)!!
■
We then see that as p → ∞,
1/p
√
Γ((1 + p)/2) 1/p
p 1/p √
∥X∥Lp = 2 ≲ ((p − 1)!!) = O( p! ) = O( p).
Γ(1/2)
⊛
Problem (Exercise 2.5.4). Show that the condition E[X] = 0 is necessary for property v to hold.
Answer. Since if E[exp(λX)] ≤ exp K52 λ2 for all λ ∈ R, we see that from Jensen’s inequality,
i.e.,
λE[X] ≤ K52 λ2 .
Since this holds for every λ ∈ R, if λ > 0, E[X] ≤ K52 λ; on the other hand, if λ < 0, E[X] ≥ K52 λ.
In either case, as λ → 0 (from both sides, respectively), 0 ≤ E[X] ≤ 0, hence E[X] = 0. ⊛
Problem (Exercise 2.5.5). (a) Show that if X ∼ N (0, 1), the function λ 7→ E[exp λ2 X 2 ] is only
(b) Suppose that some random variable X satisfies E[exp λ2 X 2 ] ≤ exp Kλ2 for all λ ∈ R and
some constant K. Show that X is a bounded random variable, i.e., ∥X∥∞ < ∞.
Problem (Exercise 2.5.7). Check that ∥·∥ψ2 is indeed a norm on the space of sub-gaussian random
variables.
Answer. It’s clear that ∥X∥ψ2 = 0 if and only if X = 0. Also, for any λ > 0, ∥λX∥ψ2 = λ∥X∥ψ2
is obvious. Hence, we only need to verify triangle inequality, i.e., for any sub-gaussian random
variables X and Y ,
∥X + Y ∥ψ2 ≤ ∥X∥ψ2 + ∥Y ∥ψ2 .
Firstly, we observe that since exp(x) and x2 are both convex (hence their composition),
2 !
X +Y ∥X∥ψ2
exp (X/∥X∥ψ2 )2
exp ≤
∥X∥ψ2 + ∥Y ∥ψ2 ∥X∥ψ2 + ∥Y ∥ψ2
∥Y ∥ψ2
exp (Y /∥Y ∥ψ2 )2 .
+
∥X∥ψ2 + ∥Y ∥ψ2
Now, we see that from the definition of ∥X + Y ∥ψ2 and t := ∥X∥ψ2 + ∥Y ∥ψ2 , the above implies
Problem (Exercise 2.5.9). Check that Poisson, exponential, Pareto and Cauchy distributions are not
sub-gaussian.
Answer. Omit. ⊛
√
Answer. Let Yi := |Xi |/K 1 + log i (which is always positive) for all i ≥ 1. Then for all t ≥ 0,
|X |
P(Yi ≥ t) = P √ i ≥t
K 1 + log i
p
= P |Xi | ≥ tK 1 + log i
!
ct2 K 2 (1 + log i) 2
≤ 2 exp −ct2 (1 + log i) = 2(ei)−ct
≤ 2 exp −
∥Xi ∥2ψ2
as K := maxi ∥Xi ∥2ψ2 . Then, our goal now is to show that E[maxi Yi ] ≤ C for some absolute constant
C. Consider t0 := 1/c, then we have
p
h i Z ∞
E max Yi = P max Yi ≥ t dt
i 0 i
Z t0 Z ∞
∞X
≤ P max Yi ≥ t dt + P(Yi ≥ t) dt union bound
0 i t0 i=1
Z ∞
∞X
2
≤ t0 + 2(ei)−ct dt
t0 i=1
Z ∞ ∞
p 2 X
≤ 1/c + 2 e−ct i−2 dt
t0 i=1
∞ √ 5/2
2
π2 1 + π6
Z
p π −ct2
p π
≤ 1/c + 2 · e dt = 1/c + · √ = √ =: C.
6 0 3 2 c c
Problem (Exercise 2.5.11). Show that the bound in Exercise 2.5.10 is sharp. Let X1 , X2 , . . . , XN be
independent N (0, 1) random variables. Prove that
p
E max Xi ≥ c log N .
i≤N
so
Z ∞
N
E max Xi = 1 − 1 − P(X1 ≥ t) dt
i≤N
Z0 ∞
2
≥ 1 − (1 − Ce−t )N dt
0
∞ N
√
Z
p C
= log N 1 − 1 − u2 du. t =: log N u
0 N
Finally, as the final integral can be further bounded below by some absolute constant c depending
only on C, hence we obtain the desired result. ⊛
Answer. Omit. ⊛
Problem (Exercise 2.6.5). Let X1 , . . . , XN be independent sub-gaussian random variables with zero
means and unit variances, and let a = (a1 , . . . , aN ) ∈ RN . Prove that for every p ∈ [2, ∞) we have
N
!1/2 N N
!1/2
X X √ X
a2i ≤ ai Xi ≤ CK p a2i
i=1 i=1 Lp i=1
N N
N
!2 1/2
X X X
ai Xi ≥ ai Xi = E ai Xi .
i=1 Lp i=1 L2 i=1
hP i
N PN 2 PN 2
and at the same time, as Var[Xi ] = 1, Var i=1 ai Xi = i=1 ai Var[Xi ] = i=1 ai = ∥a∥ ,
2
hence we have
N
X 1/2
≥ ∥a∥2
ai Xi = ∥a∥,
i=1 Lp
which is the desired lower-bound. For the upper-bound, we see that
N 2 N 2
2√ 2
X X
ai Xi ≤C p ai Xi
i=1 Lp i=1 ψ2
N
X N
X
≤ C ′p ∥ai Xi ∥2ψ2 = C ′′ p a2i ∥Xi ∥2ψ2 ≤ C ′′ K 2 p∥a∥2 ,
i=1 i=1
where C, C ′ , C ′′ are all absolute constant (might depend on each other). Taking square root on
both sides, we obtain the desired result. ⊛
Problem (Exercise 2.6.6). Show that in the setting of Exercise 2.6.5, we have
N
!1/2 N N
!1/2
X X X
c(K) a2i ≤ ai Xi ≤ a2i .
i=1 i=1 L1 i=1
Here Kg maxi ∥Xi ∥ψ2 and c(K) > 0 is a quantity which may depend only on K.
Problem (Exercise 2.6.7). State and prove a version of Khintchine’s inequality for p ∈ (0, 2).
N
!1/2 N N
!1/2
X X X
c(K, p) a2i ≤ ai Xi ≤ a2i .
i=1 i=1 Lp i=1
Here K = maxi ∥Xi ∥ψ2 and c(K, p) > 0 is a quantity which depends on K and p. We first recall the
generalized Hölder inequality.
Theorem 2.6.1 (Generalized Hölder inequality). For 1/p + 1/q = 1/r where p, q ∈ (0, ∞],
Proof. The classical case is when r = 1. By considering |f |r ∈ Lp/r and |g|r ∈ Lq/r , r/p+r/q =
1. Then the standard Hölder inequality implies
Z
∥f g∥rLr = |f g|r = ∥|f g|r ∥L1 ≤ ∥|f |r ∥Lp/r ∥|g|r ∥Lq/r
Z r/p Z r/q
r p/r r q/r
= (|f | ) (|g| ) = ∥f ∥rLp ∥g∥rLq ,
implying
!4/p 4/p
∥Z∥L2 ∥Z∥L2
∥Z∥Lp ≥ (4−p)/4
= (4−p)/p
.
∥Z∥L4−p ∥Z∥L4−p
PN
Finally, by letting Z = i=1 ai Xi ,
N 4/p
N X (4−p)/p
. N
ai Xi
X
ai Xi ≥ .
X
i=1 L2
ai Xi
i=1 Lp i=1 L4−p
√
Remark. Exercise 2.6.6 is just a special case with c(K, 1) = (CK 3)−1/3 .
Problem (Exercise 2.6.9). Show that unlike (2.19), the centering inequality in Lemma 2.6.8 does not
hold with C = 1.
√
Answer. Consider the random variable X := log 2 · ϵ where ϵ is a Rademacher random variable
with parameter p, i.e.,
w.p. p;
(p
log 2,
X=
− log 2, w.p. 1 − p.
p
Since E[exp X 2 ] = 2, we know that ∥X∥ψ2 is exactly 1. We now want to show that ∥X −E[X]∥ψ2 >
Problem (Exercise 2.7.2). Prove the equivalence of properties a-d in Proposition 2.7.1 by modifying
the proof of Proposition 2.5.2.
Problem (Exercise 2.7.3). More generally, consider the class of distributions whose tail decay is of
the type exp(−ctα ) or faster. Here α = 2 corresponds to sub-gaussian distributions, and α = 1, to
sub-exponential. State and prove a version of Proposition 2.7.1 for such distributions.
Answer. The generalized version of Proposition 2.7.1 is known to be the so-called Sub-Weibull
distributions [Vla+20]: Let X be a random variable. Then the following properties are equivalent;
the parameters Ki > 0 appearing in these properties differ from each other by at most an absolute
constant factor.
E[exp(|X|α /K4α )] ≤ 2.
From (b), when αk ≥ 1, we have E[|X|αk ] ≤ (K2 (αk)1/α )αk = K2αk (αk)k . On the other hand,
for any given α > 0, there are only finitely many k ≥ 1 such that αk < 1. Hence, there exists
some Ke 2 such that
E[|X|αk ] ≤ K
e 2αk (αk)k
∞
α α
X
e 2α λα αe)k = 1
E[exp(λ |X| )] ≤ 1 + (K .
k=1 1 − K2α λα αe
e
As (1 − x)e2x ≥ 1 for all x ∈ [0, 1/2], the above is further less than
h iα
exp 2(K e 2 λ)α αe = exp (2αe)1/α K e 2 λα .
By letting K3 := (2αe)1/α K
e 2 , we have the desired result whenever K
e α λα αe < 1, or equiva-
2
lently,
1 1
0 < λα < ⇔0<λ< .
α
K αe
e K2 (αe)1/α
e
2
Hence, if 0 < λ ≤ 1
e 2 (2αe)1/α
K
= K3 ,
1
the above is satisfied. ⊛
Proof. Assuming (c) holds, then (d) is obtained by taking λ := 1/K4 where K4 := K3 (ln 2)−1/α .
In this case, λ = 1/K3 · (ln 2)1/α , hence
E[exp(|X|α )]
P(|X| ≥ t) = P(exp(|X|α ) ≥ exp(tα )) ≤ ≤ 2 exp(−tα ),
exp(tα )
Problem (Exercise 2.7.4). Argue that the bound in property c can not be extended for all λ such
that |λ| ≤ 1/K3 .
Answer. It’s easy to see that in the proof of Exercise 2.7.3, when we prove (b) ⇒ (c), the condition
for λ essentially comes from:
P∞ e α α P∞ e
• whether 1 + k=1 (K k
2 λ αe) = 1 + k=1 (K2 λe) as α = 1 converges; and
k
Problem (Exercise 2.7.10). Prove an analog of the Centering Lemma 2.6.8 for sub-exponential ran-
dom variables X:
∥X − E[X]∥ψ1 ≤ C∥X∥ψ1 .
Answer. Since ∥·∥ψ2 is a norm, we have ∥X − E[X]∥ψ1 ≤ ∥X∥ψ1 + ∥E[X]∥ψ1 such that
∥X∥L1 ≤ K2 ∼
= ∥X∥ψ1
since Ki ∼
= ∥X∥ψ1 = K4 . ⊛
Answer. Clearly, ∥X∥ψ ≥ 0. To check ∥X∥ψ = 0 if and only if X = 0 a.s., we first see that ∥0∥ψ = 0
as ψ(0) = 0. On the other hand, if ∥X∥ψ = 0, then by the monotone convergence theorem, we have
h i
1 ≥ lim E[ψ(|X|/t)] = E lim ψ(|X|/t)
t→0 t→0
Z ∞
= P lim ψ(|X|/t) > u du
0 t→0
Z ∞
= P(|X| > 0) P lim ψ(|X|/t) > u | |X| > 0 du
t→0
Z0 ∞
= P(|X| > 0) du
0
= ∞ · P(|X| > 0),
since ψ(x) → ∞ for x → ∞, and in this case, x = |X|/t, which indeed goes to ∞ as t → 0. Overall,
this implies P(|X| > 0) = 0, i.e., X = 0 almost surely, hence we conclude that ∥X∥ψ = 0 if and
only if X = 0 a.s. The other two properties follows the same proof of Exercise 2.5.7. ⊛
λ2 /2
E[exp(λX)] ≤ exp g(λ)E[X 2 ] where g(λ) =
,
1 − |λ|K/3
λ2 X 2 /2 λ2 E[X 2 ]/2
2
λ E[X 2 ]/2
E[exp(λX)] ≤ E 1 + λX + =1+ ≤ exp ,
1 − |λX|/3 1 − |λ|K/3 1 − |λ|K/3
where we let x := λX and apply the claim. Finally, note that the right-hand side is exactly
exp g(λ)E[X 2 ] , we’re done. ⊛
Problem (Exercise 2.8.6). Deduce Theorem 2.8.4 from the bound in Exercise 2.8.5.
PN
from Exercise 2.8.5, if |λ| < 3/K. Denote σ 2 = i=1 E[Xi2 ], we further have
N
!
X
≤ inf exp −λt + g(λ)σ 2 .
P Xi ≥ t
λ>0
i=1
Let 0 ≤ λ = t
σ 2 +tK/3 < 3/K, we see that
N
!
t2 σ 2 λ2 /2 t2 /2
X
P Xi ≥ t ≤ exp − 2
+ = exp − .
i=1
σ + tK/3 1 − |λ|K/3 σ 2 + tK/3
√
(b) We first observe that E[∥X∥2 ] ≤ E[∥X∥22 ] = n, hence we only need to deal with lower-
p
28
Week 9: Concentration Inequalities of Random Vectors
and from the sub-gaussian property, this is ≲ n · max1≤i≤n ∥Xi ∥4ψ2 = nK 4 . Overall,
√ 1 4
√ K4 √
E[∥X∥2 ] ≳ n− nK = n − √ = n + o(1),
2n3/2 n
Var[∥X∥2 ] ≤ CK 4 .
Answer. From the definition and the fact that the mean minimizes the MSE,
√
Var[∥X∥2 ] = E[(∥X∥2 − E[∥X∥2 ])2 ] ≤ E[(∥X∥2 − n)2 ],
√
then from the proof of Exercise 3.1.4, as E[|∥X∥2 − n|] ≤ cK 2 for some c,
√
Var[∥X∥2 ] ≤ E[(∥X∥2 − n)2 ] ≤ c2 K 4 ,
Problem (Exercise 3.1.6). Let X = (X1 , . . . , Xn ) ∈ Rn be a random vector with independent coor-
dinates Xi that satisfy E[Xi2 ] = 1 and E[Xi4 ] ≤ K 4 . Show that
Var[∥X∥2 ] ≤ CK 4 .
Answer. Firstly,
√ √ observe that with our new assumption, Exercise 3.1.4 (b) again gives E[∥X∥2 ] ≳
n − K 4 / n. Then from the same reason as stated in Exercise 3.1.5,
√ 2 √ √ √ K4
Var[∥X∥2 ] ≤ E[(∥X∥2 − n) ] = 2n − 2 nE[∥X∥2 ] ≲ 2n − 2 n n− √ = 2K 4 ,
n
Problem (Exercise 3.1.7). Let X = (X1 , . . . , Xn ) ∈ Rn be a random vector with independent coor-
dinates Xi with continuous distributions. Assume that the densities of Xi are uniformly bounded
by 1. Show that, for any ϵ > 0, we have
√
P(∥X∥2 ≤ ϵ n) ≤ (Cϵ)n .
Follow the same argument as Exercise 2.2.10,a i.e., first we bound E[exp −tXi2 ] for all t > 0. We
have Z ∞ Z ∞ r
2
−tx2 −tx2 1 π
E[exp −tXi ] = e fXi (x) dx ≤ e dx =
0 0 2 t
from the Gaussian integral. Then, from the MGF trick, we have
r n
√ 2 2 E[exp −t∥X∥22 ] 1 π 2
P(∥X∥2 ≤ ϵ n) = P(−∥X∥2 ≥ −ϵ n) ≤ inf 2
≤ inf etϵ n .
t>0 exp(−tϵ n) t>0 2 t
as Σ is positive-semidefinite.
(b) Similarly,
E[Z] = Σ−1/2 E[X − µ] = Σ−1/2 (µ − µ) = 0,
and moreover,
Problem (Exercise 3.2.6). Let X and Y be independent, mean zero, isotropic random vectors in Rn .
Check that
E[∥X − Y ∥22 ] = 2n.
D
Answer. Firstly, from the spherical symmetry of X, for any x ∈ Rn , ⟨X, x⟩ = ⟨X, ∥x∥2 e⟩ for all
e ∈ S n−1 . Hence, to show X is isotropic, from Lemma 3.2.3, it suffices to show that for any x ∈ Rn ,
n
" n # " n #
2 1X 2 1 X
2 2 1X 2
E[⟨X, x⟩ ] = E[⟨X, ∥x∥2 ei ⟩ ] = E (∥x∥2 Xi ) = ∥x∥2 E Xi = ∥x∥22 ,
n i=1 n i=1
n i=1
where ei denotes the ith standard unit vector. The last equality holds from the fact that
" n #
1X 2 1 1
E X = E[∥X∥22 ] = n = 1
n i=1 i n n
√
as X ∼ U( nS n−1 ). On the other hand, clearly Xi ’s can’t be independent since the first n − 1
coordinates determines the last coordinate. ⊛
Problem (Exercise 3.3.3). Deduce the following properties from the rotation invariance of the normal
distribution.
(c) Let G be an m × n Gaussian random matrix, i.e., the entries of G are independent N (0, 1)
random variables. Let u ∈ Rn be a fixed unit vector. Then
Gu ∼ N (0, Im ).
Answer. (a) Without loss of generality, we may assume ∥u∥2 = 1 and prove
⟨g, u⟩ ∼ N (0, 1)
for any fixed unit vector u ∈ Rn . But this is clear as there must exist u1 , . . . , un−1 such
that {u, u1 , . . . , un−1 } forms an orthonormal basis of Rn , and U := (u, u1 , . . . , un−1 )⊤ is
U g ∼ N (0, In ),
which implies (U g)1 ∼ N (0, 1). With (U g)1 = u⊤ g = ⟨g, u⟩, we’re done.
(b) For independent Xi ∼ N (0, σi2 ), we have Xi /σi ∼ N (0, 1). We want to show
n
X
Xi ∼ N (0, σ 2 )
i=1
Pn
where σ 2 = i=1 σi2 . Firstly, we have g := (X1 /σ1 , . . . , Xn /σn ) ∼ N (0, In ), then by consid-
ering u := (σ1 , . . . , σn ) ∈ Rn , we have
n n
!
X X
2
⟨g, u⟩ = Xi ∼ N (0, ∥u∥2 ) = N 0, σi = N (0, σ 2 )
2
i=1 i=1
from (a).
Pn
(c) For any fixed unit vector u, (Gu)i = j=1 gij uj = ⟨gi , u⟩ where gi = (gi1 , gi2 , . . . , gin ) for all
i ∈ [m]. It’s clear that gi ∼ N (0, In ), and from (a), ⟨gi , u⟩ ∼ N (0, 1). This implies
as desired.
Problem (Exercise 3.3.4). Let X be a random vector in Rn . Show that X has a multivariate normal
distribution if and only if every one-dimensional marginal ⟨X, θ⟩, θ ∈ Rn , has a (univariate) normal
distribution.
Answer. This is an application of Cramér-Wold device and Exercise 3.3.3 (a). Omit the details. ⊛
(b) Given a vector u ∈ Rn , consider the random variable Xu := ⟨X, u⟩. From Exercise 3.3.3 we
know that Xu ∼ N (0, ∥u∥22 ). Check that
Problem (Exercise 3.3.6). h Let G be an m × n Gaussian random matrix, i.e., the entries of G are
independent N (0, 1) random variables. Let u, v ∈ Rn be unit orthogonal vectors. Prove that Gu
and Gv are independent N (0, Im ) random vectors.
Answer. It’s clear that Gu and Gv are both N (0, Im ) random vectors from Exercise 3.3.3 (c). It
remains to show that Gu and Gv are independent, i.e., (Gu)i and (Gv)j are independent random
variables.
For i ̸= j, this is clear since (Gu)i = e⊤i (Gu) and (Gv)j = ej (Gv), and ei G gives the i
⊤ ⊤ th
row of
G, while ej G gives the j row of G. The fact that G has independent rows proves the result for
⊤ th
the case of i ̸= j.
For i = j, let e⊤i G =: g where g ∼ N (0, In ), and we want to show independence of (Gu)i = g u
⊤ ⊤
⊤
g u
= (u, v)⊤ g ∼ N (0, (u, v)⊤ In (u, v)) = N (0, I2 )
g⊤ v
g = rθ
where r = ∥g∥2 is the length and θ = g/∥g∥2 is the direction of g. Prove the following:
(a) The length r and direction θ are independent random variables.
Answer. For any measurable M ⊆ Rn , given the normal density fG (g) of g, some elementary
calculus gives the polar coordinate transformation dg = rn−1 dr dσ(θ), hence
Z Z Z
P(g ∈ M ) = fG (g) dg = fG (rθ) dσ(θ)rn−1 dr
M A B
Z Z (3.1)
ωn−1 n−1 −r 2 /2
= r e dr dσ(θ) = P(r ∈ A, θ ∈ B)
(2π)n/2 A B
for some AR ⊆ [0, ∞) and B ⊆ S n−1 generating M , where σ is the surface area element on S n−1
such that S n−1 dσ = ωn−1 , i.e., ωn−1 is the surface area of the unit sphere S n−1 .
(a) From Equation 3.1, it’s possible to write
such that g(S n−1 ) = 1 with appropriate constant manipulation. Hence, with B = S n−1 ,
implying f ([0, ∞)) = 1 as well. This further shows that by considering A = [0, ∞),
(b) From Equation 3.1, we see that for any B ⊆ S n−1 , the density is uniform among dσ(θ), hence
θ is uniformly distributed on S n−1 .
N
X
ui u⊤
i = AIn .
i=1
Answer. Recall that for two symmetric matrices A, B ∈ Rn×n , A = B if and only if x⊤ Ax = x⊤ Bx
for all x ∈ Rn . Hence,
N N
!
X X
⊤ ⊤
ui ui = AIn ⇔ x ui ui x = x⊤ (AIn )x
⊤
i=1 i=1
2. Just consider Xi = Z are the same where Z ∼ N (0, 1). Then, we see that
p
max∥Xi ∥ψ2 = ∥Z∥ψ2 = 8/3
i
√ √
∥X∥ψ2 ≥ ⟨X, 1n / n⟩
p
ψ2
= ∥ nZ∥ψ2 = 8n/3.
Answer. Since we not only want an upper-bound, but a tight, non-asymptotic behavior, we need to
calculate ∥X∥ψ2 as precise as possible. We note that
∥X∥ψ2 = sup ∥⟨X, x⟩∥ψ2 = sup inf{t > 0 : E[exp ⟨X, x⟩2 /t2 ] ≤ 2},
x∈S n−1 x∈S n−1
and clearly the supremum is attained when x = ei for some i. In this case,
√
Note that since X ∼ U({ nei }i ), we see if we focus on a particular coordinate i,
n−1
0,
w.p. ;
Xi = n
√n, w.p. 1 .
n
Hence, for any t > 0,
n−1 1 n
E[exp Xi2 /t2 ] =
+ exp 2 .
n n t
Equating the above to be exactly 2 and solve it w.r.t. t, we have
2
n − 1 + en/t
r
2 n n
= 2 ⇔ n − 1 + en/t = 2n ⇔ ln(n + 1) = 2 ⇔ t = ,
n t ln(n + 1)
meaning that
r r
n n
∥X∥ψ2 = inf{t > 0 : E[exp Xi2 /t2 ] ≤ 2} =
≍ .
ln(n + 1) log n
Problem (Exercise 3.4.5). Let X be an isotropic random vector supported in a finite set T ⊆ Rn .
Show that in order for x to be sub-gaussian with ∥X∥ψ2 = O(1), the cardinality of the set must be
exponentially large in n:
|T | ≥ ecn .
√ (Exercise 3.4.7). Extend Theorem 3.4.6 for the√uniform distribution on the Euclidean ball
Problem
B(0, n) in Rn centered at the origin and with radius n. Namely, show that a random vector
√
X ∼ U(B(0, n))
is sub-gaussian, and
∥X∥ψ2 ≤ C.
√ √ √ √
Answer. For X ∼ U(B(0, n)), consider R := ∥X∥2 / n and Y := X/R = nX/∥X∥2 ∼ U( nS n−1 ).
From Theorem 3.4.6, ∥Y ∥ψ2 ≤ C. It’s clear that R ≤ 1, hence for any x ∈ S n−1 ,
E[exp ⟨X, x⟩2 /t2 ] = E[exp R2 ⟨Y, x⟩2 /t2 ] ≤ E[exp ⟨Y, x⟩2 /t2 ],
K := {x ∈ Rn : ∥x∥1 ≤ r}.
D
Answer. (a) Observe that for i ̸= j, (Xi , Xj ) = (Xi , −Xj ), hence E[Xi ] = 0 and E[Xi Xj ] = 0 for
i ̸= j. Hence, for X to be isotropic, we need E[Xi2 ] = 1. Now, we note that P(|Xi | > x) =
(r − x)n /rn = (1 − x/r)n for x ∈ [0, r], hence
Z ∞ Z r Z 1
x x n dx
E[Xi2 ] = 2xP(|Xi | > x) dx = 2r2 1− = 2r2 t(1 − t)n dt,
0 0 r r r 0
which with some calculation is 2r /(n + 3n + 2). Equating this with 1 gives r ≍ n.
2 2
√
(b) It suffices to show that ∥Xi ∥Lp > C p, which in turns blow up the sub-Gaussian property in
terms of L norm. We see that
p
Z ∞
p
∥Xi ∥Lp = pxp−1 P(|Xi | > x) dx
0
Z r x p−1 Z 1
p x n dx
= pr 1− = prp tp−1 (1 − t)n dt = prp · B(p, n + 1),
0 r r r 0
Γ(p)Γ(n + 1)
∥Xi ∥pLp = prp · ,
Γ(p + n + 1)
√
hence ∥Xi ∥Lp > C p is evident from the Stirling’s formula.
Problem (Exercise 3.4.10). Show that the concentration inequality in Theorem 3.1.1 may not hold
for a general isotropic sub-gaussian random vector X. Thus, independence of the coordinates of X
is an essential requirement in that result.
√
Answer. We want to show that ∥∥X∥2 − n∥ψ2 ≤ C max∥Xi ∥2ψ2 does not hold for a general isotropic
sub-Gaussian random vector X with E[Xi2 ] = 1. Let 0 < a < 1 < b such that a2 + b2 = 2, and
define
X := (aZ)ϵ (bZ)1−ϵ ,
where ϵ ∼ Bern(1/2) and Z ∼ N (0, In ). In human language, consider X has a distribution
1 1
FX := FaZ + FbZ .
2 2
and E[Xi2 ] = 1 with a similar calculation. Moreover, for any vector x ∈ S n−1 ,
1 1
E[exp ⟨X, x⟩2 /t2 ] = p
+ p <2
2
2 1 − 2a /t 2 2 1 − 2b2 /t2
when t is large enough (compared to a, b). This shows ∥⟨X, x⟩∥ψ2 ≤ t, and since a, b is taken to be
constants, X is indeed a sub-Gaussian random vector. √
Now, we show that the norm of √ X actually deviates away from n at a non-vanishing rate of
n. In particular, conciser t = (b − 1) n/2, then
√ 2 2 √
n /t ] > E[exp (∥bZ∥2 − n)2 /t2 ]
2E[exp ∥X∥2 −
√
> E[exp (∥bZ∥2 − n)2 /t2 1∥Z∥22 >n ]
√ √
> exp (b n − n)2 /t2 P(∥Z∥22 > n) since b > 1
4
=e P(∥Z∥22 > n)
4
→ e /2 > 4
Pn
since P(∥Z∥22 > n) = P Zi2 > n , and with E[Zi2 ] = Var[Zi ] = 1, and Var[Zi2 ] = E[Zi4 ] −
i=1
E[Zi ]2 = 3 − 1 = 2 < ∞,
Pn n
!
1
Z2 − 1 1 X D
n
√i=1 √i =√ Zi2 −n → N (0, 1)
2/ n 2n i=1
Pn
by the central limit theorem,
Pn hence, the asymptotic distribution of i=1 Zi2 −n is symmetric around
n
0, meaning that P( i=1 Zi2 > n) = P( i=1 Zi2 − n > 0) = 1/2. This implies that for all large
P
enough n, √
√ n
∥∥X∥2 − n∥ψ2 ≥ t = (b − 1) → ∞.
2
⊛
X
aij xi yi ≤ max|xi | · max|yj |
i j
i,j
X
aij ⟨ui , vj ⟩ ≤ K max∥ui ∥ · max∥vj ∥
i j
i,j
Answer. Omit. ⊛
Problem (Exercise 3.5.3). Deduce the following version of Grothendieck’s inequality for symmetric
n × n matrices A = (aij ) with real entries. Suppose that A is either positive semidefinie or has zero
diagonal. Assume that, for any numbers xi ∈ {−1, 1}, we have
X
aij xi xj ≤ 1.
i,j
Then, for any Hilbert space H and any vectors ui , vj ∈ H satisfying ∥ui ∥ = ∥vj ∥ = 1, we have
X
aij ⟨ui , vj ⟩ ≤ 2K,
i,j
Answer. Omit. ⊛
Problem (Exercise 3.5.5). Show that the optimization (3.21) is equivalent to the following semidef-
inite program:
max⟨A, X⟩ : X ⪰ 0, Xii = 1 for i = 1, . . . , n.
Answer. Omit. ⊛
Answer. Omit. ⊛
Answer. Omit. ⊛
Answer. Omit. ⊛
n
!k
X X X
⊗k ⊗k
⟨u ,v ⟩= ui1 ...ik vi1 ...ik = ui1 . . . uik vi1 . . . vik = ui vi
i1 ,...,ik i1 ,...,ik i=1
Problem (Exercise 3.7.5). (a) Show that there exist a Hilbert space H and a transformation
Φ : Rn → H such that
(b) More generally, consider a polynomial f : R → R with non-negative coefficients, and construct
H and Φ such that
⟨Φ(u), Φ(v)⟩ = f (⟨u, v⟩) for all u, v ∈ Rn .
(c) Show the same for any real analytic function f : R → R with non-negative coefficients, i.e.,
for any function that can be represented as a convergent series
∞
X
f (x) = a k xk , x∈R (3.2)
k=0
Then by a similar calculation as (a), we have ⟨Φ(u), Φ(v)⟩ = f (⟨u, v⟩) for all u, v ∈ Rn .
(c) In this case, we just let m = ∞ in (b), i.e., consider
∞ ∞
M k M √
H := Rn , and Φ(x) := ak x⊗k ,
k=0 k=0
√
where the limit is allowed as f converges everywhere. Note that ak ≥ 0, hence ak is also
well-defined.
Problem (Exercise 3.7.6). Let f : R → R be any real analytic function (with possibly negative coeffi-
cients in Equation 3.2). Show that there exist a Hilbert space H and transformation Φ, Ψ : Rn → H
such that
⟨Φ(u), Ψ(v)⟩ = f (⟨u, v⟩) for all u, v ∈ Rn .
Then, ⟨Φ(u), Ψv ⟩ = f (⟨u, v⟩) since the sign of ak is now taking care by Ψ. The norm can be
calculated as
∞ p
X p
∥Φ(u)∥2 = ⟨Φ(u), Φu ⟩ = ⟨ |ak |u⊗k , |ak |u⊗k ⟩
k=0
X∞ ∞
X ∞
X
⊗k ⊗k k
= |ak |⟨u ,u ⟩= |ak |⟨u, u⟩ = |ak |∥u∥2k
2 ,
k=0 k=0 k=0
where the last equality follows from Exercise 3.7.4. A similar calculation can be carried out for
∥Ψ(u)∥2 . ⊛
Random matrices
Check that
n
X 1
A−1 = v i u⊤
i .
i=1
si
Problem (Exercise 4.1.2). Prove the following bound on the singular values si of any matrix A:
1
si ≤ √ ∥A∥F .
i
r
X X
∥A∥2F = s2i ≥ s2k ≥ is2i
k=1 k≤i
Problem (Exercise 4.1.3). Let Ak be the best rank k approximation of a matrix A. Express ∥A−Ak ∥2
and ∥A − Ak ∥2F in terms of the singular values si of A.
41
Week 12: High-Dimensional Sub-Gaussian Distributions
hence
n
X
A − Ak = si ui vi⊤ .
i=k+1
This implies, the singular values of the matrix A − Ak are just sk+1 , . . . , sn ,a implying
∥A − Ak ∥2 = s2k+1 ,
and
n
X
∥A − Ak ∥2F = s2i .
i=k+1
⊛
a This can be seen from the fact that the same U and V still work, but now si = 0 for all 1 ≤ i ≤ k.
Problem (Exercise 4.1.4). Let A be an m×n matrix with m ≥ n. Prove that the following statements
are equivalent.
(a) A⊤ A = In .
sn (A) = s1 (A) = 1.
a Recall that P is a projection if P 2 = P , and P is called orthogonal if the image and kernel of P are orthogonal
subspaces.
Answer. It’s easy to see that (a), (c), and (d) are all equivalent. Indeed, for (a) and (c), we
want ∥Ax∥22 = (Ax)⊤ (Ax) = xA⊤ Ax = x⊤ x = ∥x∥22 , and the equivalency lies in the equality
xA⊤ Ax = x⊤ x. If ∥Ax∥2 = ∥x∥2 holds for all x, since A⊤ A is a symmetric matrix, we know that
this means A⊤ A = In . On the other hand, if A⊤ A = In , then we clearly have the equality. For (c)
and (d), noting the Equation 4.5 suffices. Now, we focus on proving the equivalence between (a)
and (b).
as matrix multiplication can only reduce the rank, hence rank(A) = n. This also implies
rank(A⊤ ) = n, hence we’re left to check whether Im A⊤ ∩ ker A = ∅. If this is true, then
rank(AA⊤ ) = n as well, and we’re done. But it’s well-known that Im A⊤ = (ker A)⊤ , which
completes the proof.
Now, we use the fact that rank(P ) = rank(AA⊤ ) = n. From the previous argument, we know
that rank(A) = rank(A⊤ ) = n, and hence
(A⊤ A − In )⊤ A⊤ = 0 ⇒ (A⊤ A − In )⊤ = 0
⊛
a Note that such a characterization is standard. See here for example.
Problem (Exercise 4.1.6). Prove the following converse to Lemma 4.1.5: if (4.7) holds, then
∥A⊤ A − In ∥ ≤ 3 max(δ, δ 2 ).
≤ max
n−1
|x⊤ (A⊤ A − In )x| = max
n−1
|∥Ax∥22 − 1|.
x∈S x∈S
Problem (Exercise 4.1.8). Canonical examples of isometries and projections can be constructed from
a fixed unitary matrix U . Check that any sub-matrix of U obtained by selecting a subset of columns
is an isometry, and any sub-matrix obtained by selecting a subset of rows is a projection.
Answer. Consider a tall sub-matrix An×k of Un×n for some k < n. We know that A is an isometry
if and only if A⊤ is a projection. From Remark 4.1.7, it suffices to check A⊤ A = Ik . But this
is trivial since U is unitary, and we’re basically computing pair-wise inner products between some
columns (selected in A) of U .
On the other hand, consider a fat sub-matrix Bk×n of Un×n for some k < n. We want to show
that B ⊤ B is an orthogonal projection (of dimension k). From Exercise 4.1.4, it’s equivalent to
showing B ⊤ is an isometry, and from the above, it reduces to show that U ⊤ is also unitary since
B ⊤ can be viewed as a tall sub-matrix of U ⊤ . But this is true by definition. ⊛
(b) Show by example that the previous statement may be false for a general metric space T .
Answer. (a) Consider any ϵ-separated subset of K. Then, B(xi , ϵ/2)’s are disjoint since if not,
then there exists y ∈ B(xi , ϵ/2) ∩ B(xj , ϵ/2) such that
ϵ ϵ
ϵ < d(xi , xj ) ≤ d(xi , y) + d(xj , y) ≤ + = ϵ,
2 2
a contradiction. On the other hand, if d(xi , xj ) ≤ ϵ then
xi + xj
∈ B(xi , ϵ/2) ∩ B(xj , ϵ/2),
2
hence, there is a one-to-one correspondence between ϵ-separated subset of K and families of
closed disjoint balls with centers in K and radii ϵ/2, proving the result.
(b) Let T = Z and d(x, y) = 1x̸=y . For K = {0, 1} and ϵ = 1, we have P(K, d, 1) = 1. On the
other hand, B(0, 1/2) = {0} and B(1, 1/2) = {1} are disjoint. If the result of (a) holds, then
at least P(K, d, 1) = 2 as there are exactly two such disjoint closed balls.
Problem (Exercise 4.2.9). In our definition of the covering numbers of K, we required that the
centers xi of the balls B(xi , ϵ) that form a covering lie in K. Relaxing this condition, define the
exterior covering number N ext (K, d, ϵ) similarly but without requiring that xi ∈ K. Prove that
Answer. The lower bound is trivial. We focus on the upper bound. Consider an exterior cover
{B(xi , ϵ/2)} of K where xi might not lie in K. Now, for every i, choose exactly one yi from
B(xi , ϵ/2) ∩ K is it’s non-empty. Then, {B(yi , ϵ)} covers K since
from d(x, yi ) ≤ d(x, xi ) + d(xi , yi ) ≤ ϵ/2 + ϵ/2 = ϵ for any x ∈ B(xi , ϵ/2). Hence, by taking the
union over i, {B(yi , ϵ)} indeed cover K, so the upper bound is proved. ⊛
Answer. The problem lies in the fact that we’re not allowing exterior covering. Consider K = [−1, 1]
and L = {−1, 1}. Then, N (L, d, 1) = 2 > 1 = N (K, d, 1) for d(x, y) = |x − y|.
The approximate version of monotonicity can be proved with a similar argument as Exercise
4.2.9: specifically, consider an ϵ/2-covering {xi } of K with size exactly N (K, d, ϵ/2). Now, for every
i, choose one yi ∈ B(xi , ϵ/2) ∩ L if the latter is non-empty. It turns out that {B(yi , ϵ)} covers L.
Indeed, B(xi , ϵ/2) ∩ L ⊆ B(yi , ϵ) since
ϵ ϵ
d(x, yi ) ≤ d(x, xi ) + d(xi , yi ) ≤ + =ϵ
2 2
for all x ∈ B(xi , ϵ/2). ⊛
Intuition. The fundamental idea is just every such B(yi , ϵ) can cover B(xi , ϵ/2).
Problem (Exercise 4.2.16). Let K = {0, 1}n . Prove that for every integer m ∈ [0, n], we have
2n 2n
Pm n
≤ N (K, dH , m) ≤ P(K, dH , m) ≤ P⌊m/2⌋ .
n
k=0 k k=0 k
Answer. The middle inequality follows from Lemma 4.2.8. Now, for K = {0, 1}n , we first note that
we have |K| = 2n . Furthermore, observe the following.
• Upper bound: observe that |K| ≥ P(K, dH , m)|{y ∈ K : dH (xi , y) ≤ ⌊m/2⌋}| where {xi } is
m-packing of size P(K, dH , m).
Plugging the above calculation complete the proof of both bounds. ⊛
Remark. Unlike Proposition 4.2.12, we don’t have the issue of “going outside K” since we’re working
with a hamming cube, i.e., the entire universe is exactly the collection of n-bits string. Moreover,
for the upper bound, we use ⌊m/2⌋ since m ∈ N, and taking the floor makes sure that {y ∈
K : dH (x, y) ≤ ⌊m/2⌋}’s are disjoint for {xi } being m-separated. Hence, the total cardinality is
upper bounded by |K|.
(b) Deduce a converse to Theorem 4.3.5. Conclude that for any error correcting code that encodes
k-bit strings into n-bit strings and can correct r errors, the rate must be
R ≤ 1 − f (δ)
Answer. Omit. ⊛
Answer. The lower bound is again trivial. On the other hand, for any x ∈ Rn , consider an x0 ∈ N
such that ∥x0 −x/∥x∥2 ∥2 ≤ ϵ (normalization is necessary since N is an ϵ-net of S n−1 , while x ∈ Rn ).
Now, observe that from the Cauchy-Schwarz inequality, we have
x x
∥x∥2 − ⟨x, x0 ⟩ = x, − x0 ≤ ∥x∥2 − x0 ≤ ϵ∥x∥2 ,
∥x∥2 ∥x∥2
Answer. (a) The lower bound is again trivial. On the other hand, denote x∗ ∈ S n−1 and y ∗ ∈
S m−1 such that ∥A∥ = ⟨Ax∗ , y ∗ ⟩. Pick x0 ∈ N and y0 ∈ M such that ∥x∗ −x0 ∥2 , ∥y ∗ −y0 ∥2 ≤
ϵ. We then have
⟨Ax∗ , y ∗ ⟩ − ⟨Ax0 , y0 ⟩ = ⟨A(x∗ − x0 ), y ∗ ⟩ + ⟨Ax0 , y ∗ − y0 ⟩
≤ ∥A∥(∥x∗ − x0 ∥2 ∥y ∗ ∥2 + ∥x0 ∥2 ∥y ∗ − y0 ∥2 ) ≤ 2ϵ∥A∥
(b) Following the same argument as (a), with y ∗ := x∗ and y0 := x0 . To be explicit to handle the
absolute value, we see that
Problem (Exercise 4.4.4). Let A be an m × n matrix, µ ∈ R and ϵ ∈ [0, 1/2). Show that for any
ϵ-net N of the sphere S n−1 , we have
C
sup |∥Ax∥2 − µ| ≤ · sup |∥Ax∥2 − µ|.
x∈S n−1 1 − 2ϵ x∈N
∥Ax∥22 − 1 = ⟨Rx, x⟩
for a symmetric R = A⊤ A − In . Secondly, there exists x∗ such that ∥R∥ = ⟨Rx∗ , x∗ ⟩, consider
x0 ∈ N such that ∥x0 − x∗ ∥ ≤ ϵ. Now, from a numerical inequality |z − 1| ≤ |z 2 − 1| for z > 0, we
have
sup |∥Ax∥2 − 1| ≤ sup ∥Ax∥22 − 1 = ∥R∥
x∈S n−1 x∈S n−1
1 1
≤ sup |⟨Rx, x⟩| = sup ∥Ax∥22 − 1 ,
1 − 2ϵ x∈N 1 − 2ϵ x∈N
where the last inequality follows from Exercise 4.4.3. Further, factoring |∥Ax∥22 − 1| get
1
sup |∥Ax∥2 − 1| ≤ sup |∥Ax∥2 − 1| (∥Ax∥2 + 1) .
x∈S n−1 1 − 2ϵ x∈N
where the maximum is attained at some x′ ∈ S n−1 . With the existence of x′′ ∈ N ∩ {x : ∥x − x′ ∥2 ≤
ϵ}, the supremum over N can be lower bounded as
provided that
1 − 2ϵ ϵ 1 − 2ϵ supx∈N |∥Ax∥2 − 1| + ϵ
C := 3 ≥ sup 1+ ≥ ,
d>1−2ϵ 1 − ϵ d 1 − ϵ supx∈N |∥Ax∥2 − 1|
which is true since the middle supremum is just 1. The case that µ ̸= 1 can be easily generalized
by considering R = A⊤ A − µIn . ⊛
Problem (Exercise 4.4.7). Suppose that in Theorem 4.4.5 the entries Aij have unit variances. Prove
that for sufficiently large n and m one has
1 √ √
E[∥A∥] ≥ ( m + n).
4
On the other hand, by picking x = (A11 /∥(A1j )1≤j≤n ∥2 , . . . , A1n /∥(A1j )1≤j≤n ∥2 ) ∈ S n−1 and
y = e1 ∈ S m−1 , we have
n
X A1j
∥A∥ = sup ⟨Ax, y⟩ ≥ A1j = ∥(A1j )1≤j≤n ∥2 .
x∈S n−1 ,y∈S m−1 j=1
∥(A1j )1≤j≤n ∥2
Hence, ∥A∥ is lower bounded by the norm of the first row and column, i.e.,
Remark. An easier way to deduce the second (i.e., lower bounded by the norm of the first row) is
to note that ∥A⊤ ∥ = ∥A∥ by some elementary (functional) analysis.
1 1n/2×1
p+q p−q
λ1 = n, u1 = n/2×1 , λ2 = n, u2 = .
2 1n/2×1 2 −1n/2×1
Answer. Let n be an even number. Firstly, for any D ∈ Rn×n , columns 1 to n/2 are identical, same
for columns n/2 + 1 to n. Furthermore, since p > q, column 1 and n/2 + 1 are linear independent,
so rank(D) = 2.
Instead of solving the characteristic equation and find the eigenvalues, and find the corresponding
eigenvectors later, since we know that rank(D) = 2, it’s immediate that there are only 2 non-zero
11×n/2
⊤
p+q p−q
λ1 = n, u1 = 1n×1 , λ2 = n, u2 = .
2 2 −11×n/2
Problem (Exercise 4.5.4). Deduce Weyl’s inequality from the Courant-Fisher’s min-max character-
ization of eigenvalues.
λi (A) = −λn−i+1 (−A) = − max min ⟨−Ax, x⟩ = min max ⟨Ax, x⟩.
dim E=n−i+1 x∈S(E) dim E=n−i+1 x∈S(E)
i.e., there exists some EA with dim EA = n − i + 1 such that λi (A) = maxx∈S(EA ) ⟨Ax, x⟩.
Similarly, there exists some EB with dim EB = n − j + 1 satisfying the same property. Hence,
it suffices to find some unit vector x in EA ∩ EB ∩ E. We see that
which implies that EA ∩ EB will have a non-trivial intersection with E since dim E = i + j − 1,
hence we’re done. For the upper-bound, taking the negative gives the result. ■
To obtain the spectral stability, we see that from Weyl’s inequality, we have
(
λi (A + B) ≤ λi (A) + λ1 (B);
⇒ λn (B) ≤ λi (A + B) − λi (A) ≤ λ1 (B).
λi (A + B) ≥ λi (A) + λn (B);
λi (S) − λi (T ) ≤ λ1 (S − T ) = ∥S − T ∥.
λi (T ) − λi (S) ≤ λ1 (T − S) = ∥T − S∥ = ∥S − T ∥.
max|λi (S) − λi (T )| ≤ ∥S − T ∥
i
as we desired. ⊛
Answer. We have that for any t ≥ 0, with probability at least 1 − 2 exp −t2 ,
r
1 ⊤ n t
A A − In ≤ K 2 max(δ, δ 2 ), where δ = C +√ ,
m m m
plugging back v = u + K 2 (C m
pn n
+ C2 m ),
r Z ∞
n n 2
≤ K2 C + C2 + 2e−t du
m m 0
Z ∞ √
2C 2 n 2C 2
r
2 n 2 n −t2 2 C
=K C +C + 2e K √ + + t dt
m m 0 m m m
√
√ 2C 2 n 2C 2
r
n n C
= K2 C + C2 + K2 π √ + + ,
m m m m m
which is asymptotically ≍ K 2 ( m
pn n
+m ). ⊛
Problem (Exercise 4.6.3). Deduce from Theorem 4.6.1 the following bounds on the expectation:
√ √ √ √
m − CK 2 n ≤ E[sn (A)] ≤ E[s1 (A)] ≤ m + CK 2 n.
√ √ √ √
m − CK 2 n ≤ E[sn (A)] ≤ E[s1 (A)] ≤ m + CK 2 n.
Consider √ √ √ √
max 0, m − CK 2 n − sn (A), s1 (A) − m − CK 2 n
ξ := ≥ 0,
CK 2
then from the integral identity,
Z ∞ Z ∞
2 √
E[ξ] = P(ξ > t) dt ≤ 2e−t dt = π,
0 0
Problem (Exercise 4.6.4). Give a simpler proof of Theorem 4.6.1, using Theorem 3.1.1 to obtain a
concentration bound for ∥Ax∥2 and Exercise 4.4.4 to reduce to a union bound over a net.
Answer. From the proof of Theorem 4.6.1, we know that S n−1 admits a 1/4-net N such that
|N | ≤ 9n . Furthermore, for any x ∈ N , we have
• E[⟨Ai , x⟩] = ⟨E[Ai ], x⟩ = ⟨0, x⟩ = 0;
Answer. Omit ⊛
Problem (Exercise 4.7.6). Prove Theorem 4.7.5 for the spectral clustering algorithm applied for the
Gaussian mixture model. Proceed as follows.
(a) Compute the covariance matrix Σ of X; note that the eigenvector corresponding to the largest
eigenvalue is parallel to µ.
(b) Use results about covariance estimation to show that the sample covariance matrix Σm is close
to Σ, if the sample size m is relatively large.
(c) Use the Davis-Kahan Theorem 4.5.5 to deduce that the first eigenvector v = v1 (Σm ) is close
to the direction of µ.
(d) Conclude that the signs of ⟨µ, Xi ⟩ predict well which community Xi belongs to.
Answer. Omit ⊛
Answer. Omit. ⊛
f (x) = ⟨x, θ⟩
Answer. Omit. ⊛
√ √
Problem (Exercise 5.1.8). Prove inclusion (5.2), i.e., Ht ⊇ {x ∈ nS n−1 : x1 ≤ t/ 2}.
Answer. Omit. ⊛
53
Week 17: Concentration of Lipschitz Functions on Spheres
√
Problem (Exercise 5.1.9). Let A be the subset of the sphere nS n−1 such that
Problem (Exercise 5.1.11). We proved Theorem 5.1.4 for functions f that are Lipschitz with respect
to the Euclidean metric ∥x − y∥2 on the sphere. Argue that the same result holds for the geodesic
metric, which is the length of the shortest arc connecting x and y.
Answer. Omit. ⊛
√
Problem (Exercise 5.1.12). We stated Theorem 5.1.4 for the scaled sphere nS n−1 . Deduce that a
Lipschitz function f on the unit sphere S n−1 satisfies
C∥f ∥Lip
∥f (X) − E[f (X)]∥ψ2 ≤ √ ,
n
Answer. Omit. ⊛
Problem (Exercise 5.1.13). Consider a random variable Z with median M . Show that
Answer. Omit. ⊛
Problem (Exercise 5.1.14). Consider a random vector X taking values in some metric space (T, d).
Assume that there exists K > 0 such that
for every Lipschitz function f : T → R. For a subset A ⊆ T , define σ(A) := P(X ∈ A). (Then σ is
a probability measure on T .) Show that if σ(A) ≥ 1/2 then, for every t ≥ 0,
Problem (Exercise 5.1.15). From linear algebra, we know that any set of orthonormal vectors in Rn
must contain at most n vectors. However, if we allow the vectors to be almost orthogonal, there
can be exponentially many of them! Prove this counterintuitive fact as follows. Fix ϵ ∈ (0, 1). Show
that there exists a set {x1 , . . . , xN } of unit vectors in Rn which are mutually almost orthogonal:
N ≥ exp(c(ϵ)n).
Answer. Omit. ⊛
Answer. Omit. ⊛
Problem (Exercise 5.2.4). Prove that in the concentration results for sphere and Gauss space (The-
orem 5.1.4 and 5.2.2), the expectation E[f (X)] can be replaced by the Lp norm (E[f (X)p ])1/p for
any p ≥ 1 and for any non-negative function f . The constants may depend on p.
Answer. Omit. ⊛
Problem (Exercise 5.2.11). Let Φ(x) denote the cumulative distribution function of the standard
normal distribution N (0, 1). Consider a random vector Z = (Z1 , . . . , Zn ) ∼ N (0, In ). Check that
Answer. Omit. ⊛
Problem (Exercise 5.2.12). Expressing X = ϕ(Z) by the previous exercise, use Gaussian concentra-
tion to control the deviation of f (ϕ(Z)) in terms of ∥f ◦ ϕ∥Lip ≤ ∥f ∥Lip ∥ϕ∥Lip . Show that ∥ϕ∥Lip is
bounded by an absolute constant and complete the proof of Theorem 5.2.10.
Answer. Omit. ⊛
Answer. Omit. ⊛
Answer. Omit. ⊛
Problem (Exercise 5.3.4). Give an example of a set X of N points for which no scaled projection
Answer. Omit. ⊛
f (x) = a0 + a1 x + · · · + ap xp .
f (X) = a0 I + a1 X + · · · + ap X p .
In the right side, we use the standard rules for matrix addition and multiplication, so in
particular, X p = X · · · X (p times) there.
(b) Consider a convergent power series expansion of f about x0 :
∞
X
f (x) = ak (x − x0 )k .
k=1
for arbitrary, nor necessarily commuting, matrices. First, show that X ⪯ Y always implies
tr f (X) ≤ tr f (Y ) for any increasing function f : R → R.
(f) Show that 0 ⪯ X ⪯ Y implies X −1 ⪯ Y −1 if X is invertible.
(g) Show that 0 ⪯ X ⪯ Y implies log X ⪯ log Y .
(b) Since |λ| ≤ K 1, then g(λ) − f (λ) ≥ 0. This implies that g(X) − f (X) = U (g(Λ) − f (Λ))U ⊤
has non-negative eigenvalues. Therefore, g(X) ⪰ f (X).
(c) Since X and Y are symmetric and commute, then Y admits an eigendecomposition with V =
U . This implies λ ≤ µ. It follows that f (µ) − f (λ) ≥ 0, so f (Y ) − f (X) = U (f (M ) − f (Λ))U ⊤
has non-negative eigenvalues. Therefore, f (X) ⪯ f (Y ).
(d) We see that
4 2 3 0
λ − = {5, 0},
2 4 0 0
while 3 3 ! ( √ √ )
4 2 3 0 43993 + 197 43993 − 197
λ − = ,− .
2 4 0 0 2 2
(f) Since X ⪯ Y , then I = X −1/2 XX −1/2 ⪯ X −1/2 Y X −1/2 . This implies λ(X −1/2 Y X −1/2 ) ≥ 1.
Thus, λ(X 1/2 Y −1 X 1/2 ) = λ−1 (X −1/2 Y X −1/2 ) ≤ 1, so X 1/2 Y −1 X 1/2 ⪯ I. It follows that
t=∞ R∞
(g) By (f), (X + tI)−1 ⪰ (Y + tI)−1 for t ≥ 0. Since log z = log 1+t
z+t = 0
1
1+t
1
− z+t dt, then
t=0
Z ∞ Z ∞
log X = ((1 + t)−1 I − (X + tI)−1 ) dt ⪯ ((1 + t)−1 I − (Y + tI)−1 ) dt = log Y.
0 0
eX+Y = eX eY .
eX+Y ̸= eX eY .
Answer. (a) Since X and Y commute, by the binomial theorem and the substitution i := k − j,
∞ ∞ k ∞ ∞
X (X + Y )k X 1 X k! X Xi X Y j
eX+Y = = X k−j Y j = = eX eY .
k! k! j=0 (k − j)!j! i=0
i! j=0
j!
k=0 k=0
1 0 0 1
(b) For X := and Y := ,
0 −1 1 0
√ sinh
√
2 sinh
√
2
!
cosh 2+ e2 + 1 e2 − 1
√ √
X+Y 1
e = √ 2 √ 2 √ , e X eY = .
sinh
√ 2
cosh 2− sinh
√ 2 2 1 − e−2 1 + e−2
2 2
PN
Answer. Let σ 2 := ∥ i=1 √ E[Xi ]∥. By−1
2
the matrix Berstein’s inequality, for every u > 0, with the
substitution t := c −1/2
σ u + log n + c K(u + log n),
N
! 2
t
−c min σt 2 , K
X
P Xi ≥ t ≤ 2ne ≤ 2ne−(u+log n) = 2e−u .
i=1
Problem (Exercise 5.4.12). Let ε1 , . . . , εn be independent symmetric Bernoulli random variables and
let A1 , . . . , AN be symmetric n × n matrices (deterministic). Prove that, for any t ≥ 0, we have
N
!
X
εi Ai ≥ t ≤ 2n exp −t2 /2σ 2 ,
P
i=1
PN
where σ 2 = ∥ i=1 A2i ∥.
PN
Answer. Let σ 2 := ∥ i=1 A2i ∥ and λ := t/σ 2 ≥ 0. By Exercise 2.2.3,
λ2 λ2 λ2 σ 2
PN
log E[eλεi Ai ]
PN PN
A2i λmax ( N 2
P
log cosh(λAi ) i=1 Ai )
tr e i=1 = tr e i=1 ≤ tr e i=1 2 ≤ ne 2 = ne 2
Answer. Since (a) follows from (b) with p = 1, we will only prove (b) here. As the inequality
trivially holds for n = 1 with C = 1, let’s assume n ≥ 2 from now on.
q
z z 12z
1
Note that if 1 ≤ p ≤ 2, then by Stirling’s approximation Γ(z) ≤ 2π e ,
z e
N p ! N
!
X X t2/p
P ε i Ai ≥t =P εi Ai ≥ t1/p ≤ 2ne− 2σ2 .
i=1 i=1
p
Then with the substitution t =: σ 2(log(2n) + s) , by Lemma 1.2.1 and Minkowski’s inequality,
p
" p #!1/p
√ p p ! 1/p
N
X Z σ 2 log(2n) Z ∞ N
X
E εi Ai = + √ p P ε i Ai ≥t dt
i=1 0 σ 2 log(2n) i=1
√ p 1/p
Z σ 2 log(2n) Z ∞ 2/p
− t2σ2
≤ 1 dt + √ p 2ne dt
0 σ 2 log(2n)
∞
√
!1/p
2σ)p p
Z
−s (
p p p
−1
= σ 2 log(2n) + e (log(2n) + s) 2 ds
0 2
1/p
√ p ∞ −s
Z
p p p
= 2σ log(2n) + e (log(2n) + s) 2 −1 ds
2 0
Z ∞ 1/p !
√ p p 1/p −s p
−1
≤ 2σ log(2n) + e (log(2n) + s) 2 ds
2 0
log n
E∥S∥ ≍ .
log log n
Deduce that the bound in Exercise 5.4.11 would fail if the logarithmic factors were removed
from it.
Answer. (a) We see that X = ek e⊤ k with k being chosen uniformly randomly among [n], where
ek e⊤
k is a matrix with all 0’s except the k th diagonal element being 1. Hence, by interpreting
each Xi as “throwing a ball into n bins,” Skk records the number of balls in the k th bin when
N balls are thrown into n bins independently.
(b) We first observe that since S is diagonal, ∥S∥ = λ1 (S) = maxk Skk as all the diagonal elements
are eigenvalues of S. We first answer the question of how this related to the coupon collector’s
problem. Firstly, let’s introduce the problem formally:
Problem 5.4.1 (Coupon collector’s problem). Say we have n different types of coupons
to collect, and we buy N boxes, where each box contains a (uniformly) random type of
coupon. The classical coupon collector’s problem asks for the expected number of boxes
(i.e., N ) we need in order to collect all coupons.
Intuition. From (a), we can view Skk as the number of coupons we have collected for the
k th type of the coupon, where N is the number of boxes we have bought.
Hence, the coupon collector’s problem asks for the expected N we need for λn (S) = mink Skk >
0, while (b) is asking for the expected number of the most frequent coupons (i.e., maxk Skk )
we will see when buying only N ≍ n boxes.
Next, let’s prove the upper bound and the lower bound separately. Let 0 < c < C to be some
constants satisfying N ≤ Cn and n ≤ cN .
∞
L−1
!
X X
E[∥S∥] = + P(∥S∥ ≥ m)
m=1 m=L
L−1 ∞
X X N m log log n
≤ 1+ 3 n +1− log n
m=1 m=L
N L log log n
3 n +1− log n (C + 1) log n 3C+1−(C+1) (C + 25 ) log n
=L−1+ log log n ≤ + 2 log log n = ,
1 − 3− log n log log n 3 · log n
log log n
The hard part lies in the lower bound. We will need the following fact.
i.i.d.
Lemma 5.4.1 (Maximum of Poisson [Kim83; BSP09]). Given Y1 , . . . , Yn ∼ Pois(1),
log n
E max Yk ≍ .
1≤k≤n log log n
Pn i.i.d.
Proof. Let MTP := E[maxk {Yk } | k=1 Yk = T ] with Y1 , . . . , Yn ∼ Pois(1). As
n
(Y1 , . . . , Yn ) | k=1 Yk = T ∼ Multinomial(T
Pn ; 1/n, . . . , 1/n), we know that MT is non-
decreasing w.r.t. T . Moreover, as k=1 Yk ∼ Pois(n), by the law of total expectation
and maximum of Poisson lemma,
1
⌊ne2+ 2e ⌋ ∞
e−n nT
log n X X
≍ E max Yk = + MT
log log n T!
1≤k≤n
T =0 1
T =⌊ne2+ 2e ⌋+1
1
⌊ne2+ 2e ⌋ ∞
X e−n nT X e−n nT
≤ M 2+ 2e
1 + T
T! ⌊ne ⌋ T!
T =0 1
T =⌊ne2+ 2e ⌋+1
∞
X e−n nT
≤M 1 ·1+
⌊ne2+ 2e ⌋ Γ(T )
1
T =⌊ne2+ 2e ⌋+1
√
From Stirling’s approximation, Γ(z) ≥ 2πz z−1/2 e−z for z > 0,
∞
X e−n nT
≤ M 2+ 2e
1 + √
⌊ne ⌋
2+ 1
2πT T −1/2 e−T
T =⌊ne 2e ⌋+1
∞ 1
!T
e−n X neT 2T
= M 2+ 2e
1 +√
⌊ne ⌋ 2π T
1
T =⌊ne2+ 2e ⌋+1
leading to
log n
M 1 ≳
⌊ne2+ 2e ⌋ log log n
as the trailing term is decreasing exponentially fast. Finally, we have
& 1
' & 1
'
⌊ne2+ 2e ⌋ ne2+ 2e 1
M 2+ 2e 1 ≤M &
2+ 1
' ≤ MN ≤ MN ≤ ⌈ce2+ 2e ⌉MN ,
⌊ne ⌋ ⌊ne 2e ⌋ N N
N N
where the second inequality follows from the triangle inequality of max. This leads to
1 log n
E[∥S∥] = MN ≥ M 1 ≳
1
⌈ce2+ 2e ⌉ ⌊ne2+ 2e ⌋ log log n
as desired. ⊛
Finally, the bound in Exercise 5.4.11 will fail if the logarithmic factors were removed becomes
obvious after a direct substitution. Indeed, since ∥Xi ∥ = 1 =: K, Exercise 5.4.11 states that
" N
# N 1/2
X X
E Xi ≲ E[Xi2 ] ,
i=1 i=1
where theP logarithmic factors were removed along with K = 1. Now, using the bound
N
for S := i=1 Xi we have, with the observation that
p Xi = Xi and E[Xi ] = E[Xi ] =
2 2
diag(1/n, . . . , 1/n), we see that the bound becomes N/n = Θ(1), while the left-hand side
grows as log n/ log log n → ∞, which is clear not valid.
Remark (Alternative examples). We give another example to demonstrate the sharpness of the matrix
Bernstein’s inequality. Consider the following random n × n matrix (slightly different from S)
N X
n
(N )
X
T := bik ek e⊤
k,
i=1 k=1
(N ) i.i.d. Pn (N )
where bik ∼ Ber(1/N ). Here, we view Xi := ⊤
k=1 bik ek ek
Intuition. In expectation, T and S should behave the same. However, this is easier to work
with from independence.
Proof. For every k ∈ [n], we apply the Poisson limit theorem since as N → ∞, pN,ik = 1/N → 0
k D
PN (N )
and E[SNk
] = E[ i=1 bik ] = 1 =: λ as N → ∞. So as N → ∞, SN → Pois(1).
PN (N )
With a similar interpretation as in (a), we can interpret SN k
= i=1 bik as the value
D
of the k th diagonal element of T , i.e., Tkk . Hence, as N → ∞, for all k, Tkk → Yk where
i.i.d. D
Yk ∼ Pois(1). Since Tkk ’s are independent, we have T → diag(Z1 , . . . , Zn ), therefore
log n
E[λ1 (T )] = E max Tkk → E max Yk ≍ Θ
1≤k≤n 1≤k≤n log log n
where !
N
X N
X
σ 2 = max E[Xi⊤ Xi ] , E[Xi Xi⊤ ] .
i=1 i=1
Xi⊤
′ 0n×n
Xi := .
Xi 0m×m
To apply the matrix Bernstein’s inequality (Theorem 5.4.1), we need to show that ∥Xi′ ∥ ≤ K ′ for
some K ′ , where we know that ∥Xi ∥ ≤ K. However, it’s easy to see that since ∥Xi ∥ = ∥Xi⊤ ∥, we
have ∥Xi′ ∥ ≤ K as well since the characteristic equation for Xi′ is
−λI Xi⊤
′
= det λ2 I − Xi⊤ Xi = 0,
det(Xi − λI) = det
Xi −λI
so ∥Xi′ ∥ =
p
∥Xi⊤ Xi ∥ ≤ K.
Proof. Observe that for any matrix A ∈ Rm×n , as ∥A∥ = λ1 (AA⊤ ) = λ1 (A⊤ A), we have
p p
s v
⊤
u !
⊤ 2
0 A⊤
A A 0 u 0 A
∥A∥ = λ1 = λ1
t = .
0 AA⊤ A 0 A 0
hence we have !
N
X N
X
2
σ = max E[Xi⊤ Xi ] , E[Xi Xi⊤ ] ,
i=1 i=1
Answer. Omit. ⊛
Problem (Exercise 6.1.5). Prove the following alternative generalization of Theorem 6.1.1. Let
(uij )ni,j=1 be fixed vectors in some normed space. Let X1 , . . . , Xn be independent, mean zero
random variables. Show that, for every convex and increasing function F , one has
X X
E F Xi Xj uij ≤ E F 4 Xi Xj′ uij ,
i,j : i̸=j i,j
Answer. Omit. ⊛
Answer. Omit. ⊛
Problem (Exercise 6.2.5). Give an alternative proof of Hanson-Write inequality for normal distribu-
tions, without separating the diagonal part or decoupling.
65
Week 19: Decoupling and Hanson-Wright Inequality
Answer. Omit. ⊛
Problem (Exercise 6.2.6). Consider a mean zero, sub-gaussian random vector X in Rn with ∥X∥ψ2 ≤
K. Let B be an m × n matrix. Show that
c
E exp λ2 ∥BX∥22 ≤ exp CK 2 λ2 ∥B∥2F provided |λ| ≤
.
K∥B∥
To prove this bound, replace X with a Gaussian random vector g ∼ N (0, Im ) along the following
lines:
(a) Prove the comparison inequality
for every λ ∈ R.
(b) Check that
E[exp λ2 ∥B ⊤ g∥22 ] ≤ exp Cλ2 ∥B∥2F
Answer. Omit. ⊛
Problem (Exercise 6.2.7). Let X1 , . . . , Xn be independent, mean zero, sub-gaussian random vectors
in Rd . Let A = (aij ) be an n × n matrix. prove that for every t ≥ 0, we have
n
t2
X t
P aij ⟨Xi , Xj ⟩ ≥ t ≤ 2 exp −c min ,
K 4 d∥A∥2F K 2 ∥A∥
i,j : i̸=j
Answer. Omit. ⊛
Answer. Omit. ⊛
∥DB∥F ≤ ∥D∥∥B∥F .
Answer. Omit. ⊛
Problem (Exercise 6.3.5). Let B be an m×n matrix, and let X be a mean zero, sub-gaussian random
vector in Rn with ∥X∥ψ2 ≤ K. Prove that for any t ≥ 0, we have
ct2
P(∥BX∥2 ≥ CK∥B∥F + t) ≤ exp − 2 .
K ∥B∥2
Answer. Omit. ⊛
Problem (Exercise 6.3.6). Show that there exists a mean zero, isotropic, and sub-gaussian random
vector X in Rn such that
√ 1
P(∥X∥2 = 0) = P(∥X∥2 ≥ 1.4 n) = .
2
√
In other words, ∥X∥2 does not concentrate near n.
Answer. Omit. ⊛
(b) If X is symmetric, show that the distribution of ξX and ξ|X| is the same as of x.
(c) Let X ′ be an independent copy of X. Check that X − X ′ is symmetric.
Answer. (a) For any random variable X and a symmetric Bernoulli random variable ξ, we first
D
prove that ξX = −ξX, i.e., P(ξX ≥ t) = P(−ξX ≥ t) for any t ∈ R. Indeed, since
D
show that ξX = ξ|X|, i.e., P(ξX ≥ t) = P(ξ|X| ≥ t) for any t ∈ R. Again, we have
P(X ≥ t) + P(−X ≥ t)
P(X ≥ t) = P(−X ≥ t) = = P(ξX ≥ t)
2
from the proof of (a).
D D
(c) It suffices to show that X − X ′ = X ′ − X, but this is trivial since (X, X ′ ) = (X ′ , X).
Problem (Exercise 6.4.3). Where in this argument did we use the independence of the random
variables Xi ? Is mean zero assumption needed for both upper and lower bounds?
Problem (Exercise 6.4.4). (a) Prove the following generalization of Symmetrization Lemma 6.4.2
for random vectors Xi that do not necessarily have zero means:
" N N
# " N #
X X X
E Xi − E[Xi ] ≤ 2E εi Xi .
i=1 i=1 i=1
(b) Argue that there can not be any non-trivial reverse inequality.
while
E[∥ε1 X1 ∥2 ] = λ∥1∥2
can be arbitrarily large as λ → ∞.
Problem (Exercise 6.4.5). Prove the following generalization of Symmetrization Lemma 6.4.2. Let
F : R+ → R be an increasing, convex function. Show that the same inequalities in Lemma 6.4.2
hold if the norm ∥·∥ is replaced with F (∥·∥), namely
" N
!# " N
!# " N
!#
1 X X X
E F εi Xi ≤E F Xi ≤E F 2 εi Xi .
2 i=1 i=1 i=1
Problem (Exercise 6.4.6). Let X1 , . . . , XN be independent, mean zero random variables. Show that
their sum i Xi is sub-gaussian if and only if i εi Xi is sub-gaussian, and
P P
N
X N
X N
X
c εi Xi ≤ Xi ≤C εXi .
i=1 ψ2 i=1 ψ2 i=1 ψ2
Answer. Consider P FK (x) := exp x2 /K 2 − 1 for some K ≥ 0, which is clearly convex. Hence, by
n
Exercise 6.4.5, if ∥ i=1 εi Xi ∥ψ2 ≤ K, then
" n
!# " n
!# " n
!#
X X X
E F2K Xi ≤ E F2K 2 εi Xi = E FK εi Xi ≤ 1,
i=1 i=1 i=1
Pn Pn
implying ∥ i=1 Xi ∥ψ2 ≤ 2K. Conversely, if ∥ i=1 Xi ∥ψ2 ≤ K, then
" n
!# " n
!# " n
!#
X 1 X X
E F2K εi Xi = E FK εi Xi ≤ E FK Xi ≤ 1,
i=1
2 i=1 i=1
Pn
thus ∥ i=1 εi Xi ∥ψ2 ≤ 2K. ⊛
Problem (Exercise 6.7.2). Check that the function f defined in (6/16) is convex. For reference,
f : RN → R is defined as " N #
X
f (a) := E a i εi xi .
i=1
Problem (Exercise 6.7.3). Prove the following generalization of Theorem 6.7.1. Let X1 , . . . , XN be
independent, mean zero random vectors in a normed space, and let a = (a1 , . . . , an ) ∈ Rn . Then
" N # " N #
X X
E ai Xi ≤ 4∥a∥∞ · E Xi .
i=1 i=1
Answer. Let εi ’s be independent Bernoulli’s random variables, then from the symmetrization and
Theorem 6.7.1 with conditioning on Xi ’s, we have
" N # " N # " N # " N #
X X X X
E ai Xi ≤ 2E ai εi Xi ≤ 2∥a∥∞ · E εi Xi ≤ 4∥a∥∞ · E Xi ,
i=1 i=1 i=1 i=1
√
Problem (Exercise 6.7.5). Show that the factor log N in Lemma 6.7.4 is needed in general, and
is optimal. Thus, symmetrization with Gaussian random variables is generally weaker than sym-
metrization with symmetric Bernoullis.
Answer. Consider ei ’s being ith standard basis in RN , and consider Xi := εi ei for all i ≥ 1. We
have " N # " N #
X X
E Xi =E εi ei = E[∥(ε1 , . . . , εN )∥∞ ] = 1,
i=1 ∞ i=1 ∞
Problem (Exercise 6.7.6). Let F : R+ → R be a convex increasing function. Generalize the sym-
metrization and contraction results of this and previous section by replacing the norm ∥·∥ with
F (∥·∥) throughout.
Answer. Omit. ⊛
Answer. (a) Writing t by t′ in the second term on both sides, which gives
where the inequality comes from (a) by considering the supremum over
( n−1
)
X
(n) 2
T := (x, y) ∈ R : x = εi ϕi (ti ), y = tn , (t1 , . . . , tn−1 , tn ) ∈ T .
i=1
Explicitly, we get
" " n−1
# # " " n−1
# #
X X
E E sup εi ϕi (ti ) + εn ϕn (tn ) | ε1 : n−1 ≤ E E sup εi ϕi (ti ) + εn tn | ε1 : n−1 .
t∈T i=1 t∈T i=1
Problem (Exercise 6.7.8). Generalize Talagrand’s contraction principle for arbitrary Lipschitz func-
tions ϕi : R → R without restriction on their Lipschitz norms.
Answer. Look into the proof of Exercise 6.7.7, we see that for general Lipschitz functions ϕi ’s,
" n
# " n
# " n
#
X X X
E sup εi ϕi (ti ) ≤ E sup εi ∥ϕi ∥Lip ti ≤ max ∥ϕi ∥Lip E sup εi ti ,
t∈T i=1 t∈T i=1 1≤i≤n t∈T i=1
where the last inequality follows from Theorem 6.7.1, by noting that supt∈T satisfies all the condi-
tions we need in Theorem 6.7.1. ⊛
[BSP09] K. M. Briggs, L. Song, and T. Prellberg. A note on the distribution of the maximum of a set
of Poisson random variables. 2009. arXiv: 0903.4373 [math.PR]. url: https://arxiv.org/
abs/0903.4373.
[Kim83] AC Kimber. “A note on Poisson maxima”. In: Zeitschrift für Wahrscheinlichkeitstheorie und
Verwandte Gebiete 63.4 (1983), pp. 551–552.
[Ver24] Roman Vershynin. High-Dimensional Probability. Vol. 47. Cambridge University Press, 2024.
url: https://www.math.uci.edu/~rvershyn/papers/HDP-book/HDP-book.html.
[Vla+20] Mariia Vladimirova et al. “Sub-Weibull distributions: Generalizing sub-Gaussian and sub-
Exponential properties to heavier tailed distributions”. In: Stat 9.1 (Jan. 2020). issn: 2049-
1573. doi: 10.1002/sta4.318. url: http://dx.doi.org/10.1002/sta4.318.
74