Rakhlin Mathstat sp22

Download as pdf or txt
Download as pdf or txt
You are on page 1of 108

IDS.

160 – Mathematical Statistics: A Non-Asymptotic Approach


Lecturer: A. Rakhlin Lectures 1-26
Scribe: A. Rakhlin Spring 2022

1. INTRODUCTION

Suppose we would like to estimate the average height µ of students at MIT. Assuming
students in this course are a random sample from the overall population, we may build a
confidence interval for the unknown parameter µ as
 √ √ 
X̄n − 1.96σ/ n, X̄n + 1.96σ/ n

where X̄n is the sample average of n student heights in this course, and σ 2 is the population
variance (which we can also estimate from data). Classical statistics tells us that this
random interval contains µ with probability approximately 95%. Where does the number
1.96 come from?
More formally, let X1 , . . . , Xn be i.i.d. from some distribution P on R
Pnsuch that µ =
] and σ 2 = var(X ) are finite. Of course, for the sample mean X̄ = 1
E[X
 i  i n n i=1 Xi we have
E X̄n = µ, i.e. X̄n is an unbiased estimate of the population mean. The (weak) Law of
Large Numbers provides more information: X̄n converges to µ in probability as n → ∞:
for any  > 0, 
lim P X̄n − µ <  = 1.
n→∞

The strong LLN provides states that X̄n converges to µ almost surely:
 
P lim X̄n = µ = 1.
n→∞

Only finiteness of µ is needed for both of these results. Assuming σ is finite as well, we have
the Central Limit Theorem:
√  d
n X̄n − µ → N (0, σ 2 ).

This means that for large enough n, tail probability


 
√ X̄n − µ
P n > u ≈ P(|Z| > u) = 2Φ(−u), Z ∼ N (0, 1) (1.1)
σ
where Φ is the cumulative distribution function (cdf) of the standard normal. This approx-
imation gives us the number 1.96 for which the probability is close to 95%.
Note that the statement of CLT is asymptotic, while we apply the conclusions for finite
n. The quality of the approximation for finite n should be a source of worry, but statisticians
devised rules of thumb, and indeed the approximation is quite good for n above, say, 30. In
practice, statisticians can perform simulations to see whether the CLT can be trusted for
the given sample size.
Of course, the quality of the approximation in (1.1) also depends on P . For instance,
student heights are approximately normal, and for a normal random variable we have the

exact non-asymptotic result: X̄n ∼ N (µ, σ 2 /n) (that is, n X̄nσ−µ ∼ N (0, 1)). However,
for highly skewed distributions, the CLT kicks in for larger n.

1
This course is centered on non-asymptotic results. In the context of one-dimensional
mean estimation, we will show in the next lecture that, under appropriate assumptions on
P,
 
√ X̄n − µ 
P n > u ≤ 2 exp −cu2 , (1.2)
σ
which holds for any n. Here c is some constant that depends on properties of P and may
be somewhat larger than the one suggested by the limiting distribution. Thus, confidence
intervals derived with such non-asymptotic methods may be somewhat wider, yet they hold
for any n. On the downside, we will have to place stronger assumptions on the distribution
than those required by the CLT.
One may argue that in modern applications, n is very often large. However, it is also
true that many applications of interest are characterized by high dimensionality of data, or
complex structure. In such problems, as we see below, asymptotic analysis based on n → ∞
may not suffice.

1.1 Covariance estimation


i.i.d. d
Suppose X1 , . . . , XP n , X ∼ P on R and E[X] = 0. Let Σ = E[XX ] be the covariance
T

matrix, and Σ b= 1 n
i=1 Xi Xi sample covariance. Clearly, sample covariance is an unbiased
T
n
estimate of Σ. What can we say about the quality of such an estimate? For any pair
b i,j converges to Σi,j in probability by the LLN (it’s an
i, j = 1, . . . , d, it still holds that Σ
average of independent products). Similarly,
√ d
b i,j − Σi,j ) → N (0, var(X1,i X1,j )),
n(Σ

under the appropriate moment assumptions. In particular, Σ b is a consistent estimator of


Σ. This positive result, however, disregards the role of dimension d. In many problems
of interest, dimensionality of the data may be of the same or similar order as the number
of datapoints (e.g. genomics applications). Is it then reasonable to disregard the role of d
while sending n to infinity?
There are two approaches to address the issue of high dimensionality:
• Consider an asymptotic setting where both n and d increase.

• Develop a non-asymptotic result that exhibits explicit dependence on d and n.


We start with the first approach. To make sense of this setting, consider a sequence of
problems where both n and d are increasing at a constant aspect ratio α = d/n ∈ (0, 1].
Such a scaling is called high-dimensional asymptotics.
In high-dimensional analysis we are often interested in convergence in the operator norm
kAxk2
kAk = sup .
x6=0 kxk2

Let’s see if such convergence holds in the high-dimensional asymptotics regime. For simplic-
b ≥ . . . ≥ λd (Σ)
ity, take Σ = I, assume that coordinates of X are i.i.d., and let λ1 (Σ) b ≥ 0 be
b b
the eigenvalues of Σ. If Σ converges in spectral norm to Σ = I, the histogram of (random)
eigenvalues should be concentrated at 1. In particular, we would expect the empirical dis-
1 Pd n→∞
tribution of eigenvalues d i=1 δλi −→ δ1 . This indeed happens when d is kept fixed and

2
n taken to infinity. Yet in the proportional high-dimensional regime, the limiting distribu-
tion of the empirical spectrum is not δ1 but follows the Marčhenko-Pastur law [23]. This
√ √
distribution has density supported on [λ− , λ+ ] where λ+ = (1 + α)2 , λ− = (1 − α)2 .
The density has the form p
(λ+ − t)(t − λ− )
p(t) ∝
t
for α ∈ (0, 1] (and for α > 1, there is an atom at 0). We see that when d, n both grow
proportionally, Σb does not converge to Σ in the desired sense. To conclude, if we had,
say, genomic data with d = 20K and n = 30K, we should probably not trust the sample
covariance matrix as an estimate of the true covariance matrix, even though the data size
appears to be large.
Analogously to the development in the previous section, we can contrast the asymptotic
approach with non-asymptotic tail bounds that hold for all n, d. In particular, we will show
that, under additional assumptions, the largest eigenvalue of the sample covariance matrix
satisfies
  2 
p 
b ≥ 1 + d/n + u
P λ1 (Σ) ≤ exp −nu2 /2 , u ≥ 0 (1.3)

1.2 Hypothesis testing in high dimension


Suppose X has distribution either P1 = N (µ1 , Σ) or P2 = N (µ2 , Σ). In this case, the
Neyman-Pearson lemma says that the most powerful hypothesis test (P1 vs P2 ) is to compare
the likelihood-ratio to a fixed threshold τ :
dP1 (x)
log ≥ τ.
dP2 (x)
Using the form of the Gaussian multivariate density yields a simple statistic for testing:1
Ψ(x) = hµ1 − µ2 , Σ−1 (x − (µ1 + µ2 )/2)i.
If Type I and II errors are weighted equally,
1 1
P1 (Ψ(x) ≤ 0) + P2 (Ψ(x) > 0) = Φ(−∆/2) (1.4)
2 2
Rγ 2
where ∆ = kµ1 − µ2 k2 and Φ(γ) = √12π −∞ e−t /2 dt.
P 1 P 2
If µ1 and µ2 are unknown, we may estimate them by X̄1 = n11 ni=1 Xi , X̄2 = n11 ni=1 Xi0
on two independent samples from the two respective distributions. Let us assume for
simplicity that Σ = I. The plug-in rule is then based on the statistic
b
Ψ(x) = hX̄1 − X̄2 , x − (X̄1 + X̄2 )/2i.
Kolmogorov in 1970’s studied the high-dimensional asymptotics of this problem where
b instead converges
n1 , n2 , d → ∞ while d/n1 → α, d/n2 → α. He showed that the error of Ψ
in probability to  
∆2
Φ − √ .
2 ∆2 + 2α
Note that when α = 0, the result recovers (1.4). When the asymptotics are proportional,
however, the effect of dimensionality cannot be ignored: the error probability becomes
skewed and failure to account for dimensionality can lead to incorrect hypothesis test ac-
ceptance/rejection.
1
The exposition here follows [30] and [34, Chap 1].

3
1.3 MLE
To give a taste of some other settings where asyptotic analysis is classically used, consider
i.i.d.
the case of X1 , . . . , Xn ∼ Pθ on R where θ ∈ Θ is a parameter. Suppose for simplicity, the
random variables are real-valued. Under some regularity conditions, the sequence of MLE
solutions
n
X
θbn ∈ argmax log Pθ (Xi )
θ∈Θ i=1

satisfies θbn → θ in probability and asymptotic normality holds:


√ d
n(θbn − θ) −→ N (0, I(θ)−1 )

where I(θ) is the Fisher information. Once again, while the asymptotic result sheds light
on the convergence of MLE for large enough n, it does not say much about finite n. In
particular, for some finite n, MLE may not be the best estimator, and some biased procedure
may be better.

1.4 Statistical Learning


In the problem of binary classification, we are given i.i.d. samples Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}
from a joint distribution PX×Y on X × {±1}. Based on these n data, we construct a clas-
sifier fbn : X → {±1}. We can make the dependence on the dataset explicit by writing the
prediction on x ∈ X as fbn (x; Dn ).
The classification rule is said to be (weakly) consistent [8] if

P(fn (X; Dn ) 6= Y ) → L∗ ,

where L∗ is the Bayes risk (lowest achievable error by any classification rule), and the
probability is with respect to Dn and a new datum (X, Y ) from the same distribution.
Strong universal consistency asks for almost sure convergence.
Once again, consistency does not guarantee good performance for any finite n. Much of
learning theory instead focuses on explicit rates of convergence in n, as well as on making
explicit the relevant complexity parameters of the problem. Such complexity parameters
are not always explicit (in contrast to dimensionality of linear models), as illustrated in the
next example.

1.5 Example of Complex Structure: Neural Networks


Consider a class of feed-forward neural networks

f (x) = W L (σ(W L−1 . . . σ(W 1 x) . . .))

of depth L and nonlinearity σ : R → R applied coordinate-wise, with W ` ∈ Rd` ×d`−1 . Here


fixing the neural network structure and letting the data n increase to infinity may not be too
interesting. Indeed, in modern practice, the size of the network is taken to be large for large
n (much like in the high-dimensional asymptotics regime). We would like to understand
what plays the role of “dimension” here. With the techniques developed in the second part
of the course, we will be able to develop results that hold for any particular n, particular
architecture, and, say, norms of the weight matrices.

4
2. SUB-GAUSSIAN RANDOM VARIABLES

This lecture is based on [33, Chap 2].


Let X1 , . . . , Xn be i.i.d. real-valued random variables with distribution P with mean µ
and variance σ 2 . Recall that CLT implies approximate
 results of the form (1.1), i.e. for
large enough n, tails (that is, values of P X̄ ≥ t ) of sample averages of random variables
behave like those of a Gaussian. So, what are these tails?

2.1 What is it like to be normal?


It is easy to show (by simple integration) that for Z ∼ N (0, 1) and for any t > 0,
 
1 1 1 2 1 1 2
− 3 √ e−t /2 ≤ P(Z ≥ t) ≤ √ e−t /2 (2.5)
t t 2π t 2π
The right-hand side is especially easy to remember: for t ≥ 1, the tail is at most the density
of the standard normal itself! Further, note that the moments of a standard normal random
variable have the following behavior:
!1/p
p 1/p
√ Γ( 1+p
2 ) √
(E|Z| ) = 2 1 ∼ c p, p ≥ 1 (2.6)
Γ( 2 )

Finally,
2 /2
EeλZ = eλ (2.7)
0 2 2
for any λ ∈ R. Hence EeλZ = eσ λ /2 for Z 0 ∼ N (0, σ 2 ). Since our aim is to develop
CLT-like non-asymptotic tail bounds on averages of random variables, we will be checking
whether approximate versions of (2.5), (2.6), (2.7) hold.

2.2 Tail bounds


Let’s now discuss some of the basic tools we have at our disposal for proving non-asymptotic
tail bounds. We start with some familiar probabilistic inequalities. Markov’s inequality says
that for any non-negative X,
EX
P (X ≥ t) ≤ , t>0
t
As a consequence, Chebyshev’s inequality2 says that for any real-valued random variable
X with mean µ,
σ2
P (|X − µ| ≥ t) ≤ 2 .
t
Applying Markov’s inequality to higher moments yields
E|X − µ|p
P (|X − µ| ≥ t) = P (|X − µ|p ≥ tp ) ≤ min ,
p≥1 tp
and applying Markov to an exponentiated random variable gives the Cramér-Chernoff bound
 
P (X − µ ≥ t) = P eλ(X−µ) ≥ eλt ≤ inf e−tλ Eeλ(X−µ) . (2.8)
λ>0
2
Chebyshev was Markov’s advisor

5
P
Hence, to deduce Gaussian-like tails for the random variable X̄ = n1 ni=1 Xi , we need to
understand the behavior of its moments E|X̄ − µ|p or its moment generation function

MU (λ) = E exp{λU }, λ∈R

(defined abstractly here for any random variable U ). Since the exponential of a sum is
product of exponentials, the upper bound furnished by optimizing λ in (2.8) will be easier
to handle.
Before proceeding to analyze the sums and establishing tail bounds, we first discuss a
family of random variables that will be useful to work with. These random variables have
more restrictions than those for which CLT holds (finite second moment), and hence form
a smaller family. Nevertheless, the family is rich enough to cover many applications of
interest. In the next lecture we will see a larger family of random variables.

2.3 Sub-Gaussian random variables

Definition 1. A mean-zero random variable X is sub-Gaussian with variance factor


(or variance proxy) s2 if
2 2
EeλX ≤ es λ /2
for all λ ∈ R.

We will write X ∈ subG(s2 ) to denote the fact that X belongs to the family of sub-Gaussian
random variables with s2 as the parameter.
A few remarks. First, if X is sub-Gaussian, then so is −X with the same variance
proxy. This will be useful for deducing bounds on |X| from those of bounds on X. Second,
the families of these random variables are nested in the sense that if X ∈ subG(s2 ), then
X ∈ subG(t2 ) for all t2 > s2 . Third, if X ∈ subG(s2 ) then cX ∈ subG(c2 s2 ). In particular,
we can often work with subG(1) and conclude the more general result by rescaling.
It turns out that there are several equivalent ways of defining sub-Gaussian behav-
ior.

Lemma 1 (Prop 2.5.2 in [33]). Let X be a random variable with E[X] = 0. Then the
following are equivalent, and the parameters ci > 0 differ by at most absolute constant
factors:

1. For all λ ∈ R,
E exp{λX} ≤ exp{c21 λ2 }

2. For all t ≥ 0,
P (|X| ≥ t) ≤ 2 exp{−t2 /c22 }

3. For all p = 1, 2 . . .,

(E|X|p )1/p ≤ c3 p

4. For all λ such that |λ| ≤ 1/c4 ,

E exp{λ2 X 2 } ≤ exp{c24 λ2 }

6
5. For some c5 < ∞,
E exp{X 2 /c25 } ≤ 2.

We will only prove a few of the implications here (please see [33] for all the proofs). Let
us illustrate (1) ⇒ (2). Suppose without loss of generality that X ∈ subG(1) (and hence
c21 = 1/2). In view of (2.8),
 2
−tλ λX λ2 /2−tλ t
P (X ≥ t) ≤ inf e Ee ≤ inf e = exp − (2.9)
λ>0 λ>0 2

by plugging in the optimizing value λ = t. This is the Cramér-Chernoff method.


Let us now prove (2) ⇒ (3). By rescaling, assume c2 = 1. We have
Z ∞ Z ∞ Z ∞
p p p−1
E|X| = P (|X| ≥ u) du = P (|X| ≥ t) pt dt ≤ 2 exp{−t2 }ptp−1 dt. (2.10)
0 0 0

By change of variables t = s (and hence dt = 12 s−1/2 ds), the last expression can be written
as pΓ(p/2) in terms of the Gamma-function. Using Stirling’s approximation, Γ(p/2) ≤
(p/2)p/2 . Hence,

(E|X|p )1/p ≤ p1/p (p/2)1/2 ≤ c3 p.

2.3.1 Examples
Arguably, the simplest nontrivial random variables are Bernoulli or Rademacher. The
Rademacher random variable ε takes values in {±1} with equal probability. We then have
∞ ∞ ∞
1 1 1 X λk (−λ)k X λ2k X (λ2 )k 2
Eeλε = eλ + e−λ = + = ≤1+ k
= eλ /2 . (2.11)
2 2 2 k! k! (2k)! 2 k!
k=0 k=0 k=1

Hence, ε is 1-subGaussian. By re-scaling, the variable b−a


2 ε has subGaussian parameter
(b − a) /4 and (obviously) takes values on the endpoints of [− b−a
2 b−a
2 , 2 ] (assuming b ≥
a). In fact, any zero-mean random variable that takes on values in the interval [a, b] has
subGaussian parameter at most (b − a)2 /4. In this sense, the scaled Rademacher random
variable is extremal.

Lemma 2 (Hoeffding’s Lemma). For any zero-mean random variable X taking values
in [a, b], the moment generating function satisfies

E exp{λX} ≤ exp{λ2 (b − a)2 /8}, λ ∈ R.

Hence, X ∈ subG((b − a)2 /4).

E[XeλX ]
Proof. Let ψ(λ) = log E exp{λX}. Then ψ 0 (λ) = EeλX . Observe that ψ(0) = ψ 0 (0) = 0.
It remains to prove that ψ 00 (λ) ≤ (b − a)2 /4 since Taylor’s theorem would then imply (for
some ν ∈ [0, λ])
λ2 (b − a)2
ψ(λ) = ψ(0) + λψ 0 (0) + ψ 00 (ν) ≤ λ2
2 8

7
We compute the second derivative as
    2
eλX eλX
ψ 00 (λ) = E X 2 λX − E X λX = var(Y )
Ee Ee
λx
for Y with density tilted by x → Eee λX . Since Y takes on values in [a, b], its variance is at
most (b − a)2 /4, concluding the proof.

Observe now that if X1 ∈ subG(σ12 ) and X2 ∈ subG(σ22 ), then X1 + X2 ∈ subG(σ12 + σ22 )


whenever X1 and X2 are independent. As an immediate consequence,

Lemma 3. Let ε = (ε1 , . . . , εn ) be independent Rademacher and a = (a1 , . . . , an ) ∈ R.


Then
Xn
hε, ai = εi ai ∈ subG(kak22 ).
i=1

In the same vein, for any sequence of independent random variables Xi with E[Xi ] = µi
and Xi − µi ∈ subG(σi2 ),
n
!  
X t2
P (Xi − µi ) ≥ t ≤ exp − Pn (2.12)
i=1
2 i=1 σi2

In particular, we have

Lemma 4. Hoeffding’s inequality For independent Xi ∈ [a, b],


n
!  
X 2t2
P (Xi − µi ) ≥ t ≤ exp − (2.13)
n(b − a)2
i=1

We close this section with two examples that indicate that the development of sub-
Gaussian tail bounds so far is lacking on several fronts.
First, we will be interested in tail bounds P on norms of gaussian vectors kgk, where
coordinates are standard normal. Since kgk2 = gi2 , it’s tempting  to use
√ the sub-Gaussian
2 2
results above. However, gi is not sub-Gaussian: P g ≥ t = P |g| ≥ t ≤ 2 exp{−t/2},
which is sub-exponential rather than sub-Gaussian. These tails are heavier (or, fatter) than
those of sub-Gaussian.
The second example illustrates a larger concern with sub-Gaussian tail bounds a la
Hoeffding that rely on the range of random variables but not on their variance. Consider
the following variable X. Let P (X = 0) = 1 − 1/k 2 and P (X = ±k) = 1/2k 2 , where k is a
parameter, which we think of as large. Observe that the range of this random variable is
2k, but the mean and (importantly) variance are small: EX = 0, var(X) = 1 − 1/k 2 . If we
draw X1 , . . . , Xk i.i.d., P (X1 = . . . = Xk = 0) = (1 − 1/k 2 )k ≈ exp{−1/k} which is close to
1 for large k. Since Hoeffding style inequalities only depend on the range, they are not able
to distinguish this small-variance distribution from one that is uniform on [−k, k].

8
3. SUB-EXPONENTIAL RANDOM VARIABLES

In this section, we follow the notation of [34].


As mentioned at the end of last lecture, the sub-Gaussian family leaves out some in-
teresting random variables, in particular X = Z 2 , where Z ∼ N (0, 1). Here X is called
chi-square random variable, denoted by χ2 . Let’s examine its moment generating function:
Z
1 2 2
MX (λ) = E exp{λ(Z 2 − 1)} = √ eλ(z −1) e−z /2 dz (3.14)

Clearly, MGF is infinite when λ ≥ 1/2, so we only consider λ < 1/2. In that case,
Z
−λ 1 2 1
MX (λ) = e √ e−z (1−2λ)/2 dz = e−λ √ . (3.15)
2π 1 − 2λ
One can further check that
 
−λ 1 λ2
e √ ≤ exp , 0 < λ < 1/2 (3.16)
1 − 2λ 1 − 2λ
2
Moreover, for |λ| < 1/4, the expression in (3.15) is dominated by e2λ , and thus in this
range Z 2 is sub-Gaussian.

Definition 2 (p 26 in [34]). A random variable X with mean µ = E[X] is sub-


exponential if there are non-negative parameters (s2 , α) such that

E[exp{λ(X − µ)}] ≤ exp s2 λ2 /2 , ∀|λ| < 1/α. (3.17)

We will write X ∈ subE(s2 , α).

Remarks:
• In some of the references, you will see that sub-exponential random variables are
defined with only one parameter; this corresponds to insisting that α = s, i.e. the
random variable has sub-Gaussian behavior with parameter s2 in the range |λ| < 1/s.
We follow [34] and decouple these two parameters.

• If we ask that (3.17) holds for λ ∈ (0, 1/α), the results stated below will only hold for
the upper tail of (X − µ). The behavior for the upper and lower tails can indeed be
different.

• Any X ∈ subG(s2 ) is also sub-exponential with parameters (s2 , 0).


From the earlier calculation, Z 2 ∈ subE(22 , 4).

Lemma 5. Suppose X ∈ subE(v 2 , α). Then


( 
exp −t2 /2v 2 , 0 ≤ t ≤ v 2 /α
P (X − µ ≥ t) ≤ (3.18)
exp {−t/2α} , t ≥ v 2 /α

9
The same holds for the tail of −(X − µ).

Alternatively, we can write


  
t2 t
P (X − µ ≥ t) ≤ exp − min , (3.19)
2v 2 2α

Proof. Recall that


2 λ2 /2
P (X − µ ≥ t) ≤ inf e−tλ Eeλ(X−µ) ≤ inf e−tλ ev . (3.20)
λ∈[0,1/α) λ∈[0,1/α)

where the limited range of λ, as compared to (2.8), is dictated by the definition of sub-
exponential random variable. By taking derivative, we see that λ = t/v 2 is the uncon-
strained solution; we take this whenever t/v 2 ≤ 1/α. Otherwise, the minimum is achieved
at the endpoint λ = 1/α, with the value of −t/α + v 2 /2α2 ≤ −t/2α.

Let us discuss Lemma 5. It shows that sub-exponential random variables exhibit two
behaviors: sub-Gaussian (in the range 0 ≤ t ≤ v 2 /α) and sub-exponential (in the range
t ≥ v 2 /α). We remark that the two-tail behavior arises simply by asking for the sub-
Gaussian behavior in an interval.
Rather than writing the tail bound with a min as in (3.19), we can relax the exponent
as follows. Note that for nonnegative u, v, it holds that min{1/u, 1/v} ≥ 1/(u + v). We can
thus upper bound the right-hand side of (3.19) as
  2   
t /2 t2 /2 t2 /2
exp − min , ≤ exp − (3.21)
v2 tα v 2 + tα

We will see this form of a tail bound later in the lecture.


As with sub-Gaussian random variables, we can easily calculate the parameters for
the sum of sub-exponentials. If X1 , . . . , Xn are independent with means E[Xi ] = µi and
Xi − µi ∈ subE(vi2 , α), then
n
X X 
(Xi − µi ) ∈ subE vi2 , max αi
i=1

Lemma 6 (Bernstein’s inequality). Suppose X1 , . . . , Xn are independent zero-mean


and Xi ∈ subE(1, 1). Let a = (a1 , . . . , an ) ∈ R2 . Then
n
X
ai Xi ∈ subE(kak22 , kak∞ )
i=1

and, hence,
n
! ( ( ))
X t2 t
P | ai Xi | ≥ t ≤ 2 exp − min 2 , 2 kak
i=1
2 kak2 ∞

10
In particular, if all ai = 1/n, under the conditions of above lemma,
n
!   2 
1X t t
P | Xi | ≥ t ≤ 2 exp −n · min , (3.22)
n 2 2
i=1

To shed some light on (3.22), consider a tail bound for a single sub-exponential random
variable with parameters (1, 1):
   
1 nt
P | Xi | ≥ t ≤ 2 exp − , t≥1 (3.23)
n 2
from Lemma 5. Hence, the sub-exponential behavior of the averages in (3.22) comes not
from averaging but rather from a single worst tail (e.g. that has the largest α for a general
collection).
Another way to write (3.22) is
!  n o √
1 X
n 2 exp − t2 , t≤ n
2
n √ o
P |√ Xi | ≥ t ≤ √ (3.24)
n 2 exp − t n , t ≥ n
i=1 2
P
The CLT would say that for large enough n, the random variable √1n ni=1 Xi should have
Gaussian tails under finiteness of second moment. In contrast, (3.24) says that for the sub-
exponential family (where the restriction is less strict than sub-Gaussian but more strict

than finite second moment), the sub-Gaussian behavior holds until t = n, after which is
switches to heavier tails.

3.1 Bernstein’s Condition


It turns out that the two tail behaviors in (3.22) play an important role in statistical
applications. As we will see below, the interplay between these tails is due to the relative
behavior of the variance and range of random variables. So-called “fast rates” will be derived
in this course in situations with small variance or low noise. But first, we define a condition
that implies that the random variable is sub-exponential.

Definition 3. We say that a random variable X with mean µ = EX satisfies the


Bernstein’s Condition (BC) with parameter b if
1
|E(X − µ)k | ≤ k!σ 2 bk−2 , k = 3, 4, . . .
2

Lemma 7. Any bounded random variable with |X − EX| ≤ B satisfies the Bernstein’s
Condition with b = B/3.

Proof. For any k = 3, . . .,


n o σ2
E|X − µ| ≤ E |X − µ| (X − µ) ≤ B k−2 σ 2 ≤
k k−2 2
k!(B/3)k−2 (3.25)
2

11
Lemma 8 (Bernstein’s Inequality). For a random variable X satisfying the Bernstein’s
Condition with parameter b > 0, it holds that for any |λ| < 1/b,
 2 2 
λ σ /2
E exp{λ(X − µ)} ≤ exp (3.26)
1 − b|λ|

where µ = EX and σ 2 = var(X). Hence, for all t > 0,


 
t2 /2
P (|X − µ| ≥ t) ≤ 2 exp − 2 . (3.27)
σ + tb

In particular, for a bounded random variable with |X − µ| ≤ B a.s.,


 
t2 /2
P (|X − µ| ≥ t) ≤ 2 exp − 2 . (3.28)
σ + Bt/3

It is worth comparing (3.27) to the tail in (3.21) for a subE(v 2 , α) random variable.
Here, v 2 is replaced by the actual variance σ 2 , and the parameter α by b.

Proof. We have

λ2 σ 2 X λk E(X − µ)k
E exp{λ(X − µ)} = 1 + + (3.29)
2 k!
k=3

λ2 σ 2 λ2 σ 2 X
≤1+ + |λ|k−2 bk−2 (3.30)
2 2
k=3

!
λ2 σ 2 X
=1+ 1+ |λ|k bk (3.31)
2
k=1
 
λ2 σ 2 1
≤1+ (3.32)
2 1 − |λ|b

provided that |λ| ≤ 1/b, where σ 2 = var(X). Since 1 + x ≤ ex , we conclude


 2 2 
λ σ /2
E exp{λ(X − µ)} ≤ exp (3.33)
1 − b|λ|
t
Choosing λ = bt+σ 2
∈ [0, 1/b) in the Cramér-Chernoff bound (3.20) concludes the proof.

In particular, (3.26) implies that random variables satisfying BC are sub-exponential.


Indeed, by restricting |λ| ≤ 1/2b in (3.33) we conclude that
n √ o
E exp{λ(X − µ)} ≤ exp λ2 /2 · ( 2σ)2 , (3.34)

which means that X − µ ∈ subE(2σ 2 , 2b). This, however, does not yield the constants of
(3.28) as opposed to working directly with (3.33).
Finally, we mention a one-sided tail bound that has tighter constants:

12
Lemma 9. Suppose for some positive v, b it holds that
 2 2 
λ v /2
E exp{λ(X − µ)} ≤ exp , λ ∈ (0, 1/b). (3.35)
1 − bλ

Then
 √ 
P X − µ ≥ 2v 2 t + bt ≤ exp{−t}. (3.36)

See [5, p. 29] for a proof, or try to prove it yourself (Hint: solve for the optimal λ in
Cramér-Chernoff).

3.2 Bernstein’s inequality for sums


We now discuss the implication of Bernstein’s inequality for a sum of independent random
variables Xi . Let µ = E[Xi ] and 2
2
Pnσ = var(Xi ). If Xi satisfies BC with parameter b then
2 , 2b). Crucially, this is significantly
Xi −µ ∈ subE(2σ , 2b) and P thus i=1 X i −µ ∈ subE(2nσ
better than saying that ni=1 Xi − µ is a random variable with Pn range B · n and variance
nσ . However, as mentioned at the end of last section, using i=1 Xi − µ ∈ subE(2nσ 2 , 2b)
2

in (3.19) will lose a constant factor, so we directly repeat the proof of Lemma 8 with the
sum of random variables:

Lemma 10 (Bernstein’s inequality). Let X1 , . . . , Xn be independent with EXi = µ,


var(Xi ) = σ 2 , and range |Xi − µ| ≤ B almost surely. Then
n
!  
X t2 /2
P | (Xi − µ)| ≥ t ≤ 2 exp − 2 . (3.37)
nσ + Bt/3
i=1

You may encounter the normalized version


n
!  
1X nt2 /2
P | Xi − µ| ≥ t ≤ 2 exp − 2 . (3.38)
n σ + Bt/3
i=1

from which we can read off the following transition between the two tails. If t ≤ 3σ 2 /B,
the tails are sub-Gaussian, while for t ≥ 3σ 2 /B they are sub-exponential.
As already indicated by (3.36), in view of (3.33), it also holds that with probability at
least 1 − δ,
n
r
1X 2σ 2 log 2/δ B log 2/δ
| Xi − µ| ≤ + . (3.39)
n n 3n
i=1

We will give a short proof of this with a worse constant 2 in the last term. To this end, set
 
nt2
δ = 2 exp − 2
2σ + 2Bt/3
which is equivalent to solving quadratic equation
2B log 2/δ 2σ 2 log 2/δ
t2 − t − =0
3n n

13
and thus r
2σ 2 log 2/δ 2B log 2/δ
t≤ +
n 3n
√ √ √
using a + b ≤ a + b for a, b ≥ 0. For the sharper constant 1 in (3.39), see (3.36).
Let us examine (3.39). We see that for small-variance case, the last term dominates and
it indicates
p a faster convergence rate in terms of n (though at the expense of log 1/δ rather
than log 1/δ dependence on precision).

3.3 Equivalent conditions


Just as sub-Gaussian, sub-exponential random variables have several equivalent defini-
tions.

Lemma 11. Let X be a random variable with E[X] = 0. Then the following are
equivalent, and the parameters ci > 0 differ by at most absolute constant factors:

1. For all |λ| < 1/c1 ,


E exp{λX} ≤ exp{c21 λ2 }

2. For all t ≥ 0,
P (|X| ≥ t) ≤ 2 exp{−t/c2 }

3. For all p = 1, 2 . . .,
(E|X|p )1/p ≤ c3 p

4. For all λ ∈ [0, 1/c4 ],


E exp{λ|X|} ≤ exp{c4 λ}

5. For some c5 < ∞,


E exp{|X|/c5 } ≤ 2.

In particular, from the last point we immediately conclude that X is sub-Gaussian if an


only if X 2 is sub-exponential.

3.4 Application: Classification


Suppose f : X → {±1} is a classifier that we developed (e.g. by training on some data).
Now, suppose we have validation data (X1 , Y1 ), . . . , (Xn , Yn ) sampled i.i.d. from an unknown
PX×Y . The indicator loss compares the output f (X) of the classifier on a point X to
that of the label Y , and we denote it by 1 {f (Xi ) 6= Yi }. The validation error is then
1 Pn
n i=1 1 {f (Xi ) 6= Yi }, while the true expected error is E1 {f (X) 6= Y } = P (f (X) 6= Y ) ,
p. Note that the variance of the random variable 1 {f (X) 6= Y } is simply p(1 − p), since
this is a Bernoulli random variable.
Suppose we observe that validation error is 0. What can we conclude about the actual

true expected error? The CLT would suggest we are O(1/ n) away.
Bernstein’s inequality tells us that with probability at least 1 − e−u ,
n
r
1X 2p(1 − p)u u
E1 {f (X) 6= Y } − 1 {f (Xi ) 6= Yi } ≤ + (3.40)
n n 3n
i=1

14
Under the event that the validation error is zero, we have
r
2pu u
p≤ +
n 3n
which means
4u
p≤
.
n
Note that this is better than what we expected from the CLT. The effect is due to low
variance (more precisely, here variance is upper bounded by expectation itself). This type
of argument appears often in statistical learning. Of course, we would be interested in the
case that f itself was produced by minimizing error on the same data (in which case the
validation error is in fact training error). The issue of the dependence of f on the data
(and hence failure of CLT due to lack of independence) will be dealt with through notions
of uniform convergence in the second part of the course.

3.5 Norm concentration


P
We now revisit the example we considered earlier. Let Y = kgk2 = di=1 gi2 be the squared
norm of a random Gaussian vector with i.i.d. N (0, 1) coordinates (Y has χ2d distribution
with d degrees of freedom). Recall that gi2 ∈ subE(22 , 4) and thus Y ∈ subE(4d, 4). Thus,
d
!
1X 2
P gi − 1 ≥ t = P (|Y − d| ≥ dt) ≤ 2 exp{−dt2 /8}, t ∈ (0, 1) (3.41)
d
i=1

where we only took one tail of the two-tail behavior (NB: the constant 8 can be improved).

3.6 The Johnson–Lindenstrauss lemma (JL) Lemma


Let u1 , . . . , uN ∈ RM be fixed vectors in M dimensions. If M is large, we may ask whether
we can reduce the dimensionality while preserving the norms of these vectors (or, pairwise
distances). A classical way to do this is via random projections. Let m be the target
dimensionality, m < M . Let Γ ∈ Rm×M be a random matrix with independent entries
Γi,j ∼ N (0, 1). We will reduce dimensionality by mapping each ui → √1m Γui . It remains to
analyze how norms change under the action of this matrix.
First, fix a vector v P∈ RM with kvk = 1. Then hΓi , vi ∼ N (0, 1) where Γi is the ith row
of the matrix Γ. Then m 2 2 2
i=1 hΓi , vi = kΓvk ∼ χm . As shown in the previous section,
 
1
P kΓvk2 − 1 ≥ t ≤ 2 exp{−mt2 /8}, t ∈ (0, 1)
m

Hence, if we define the map F (u) = √1 Γu, we have proved that for any u 6= 0, u ∈ RM ,
m
!
kF (u)k2
P ∈
/ [1 − t, 1 + t] ≤ 2 exp{−mt2 /8}, t ∈ (0, 1)
kuk2

By a union bound, for u1 , . . . , uN ∈ RM ,


!  
kF (ui ) − F (uj )k2 N
P ∃ui 6= uj , ∈
/ [1 − t, 1 + t] ≤2 exp{−mt2 /8}, t ∈ (0, 1)
kui − uj k2 2

15
since F is linear. By setting the right-hand-side to δ, we have that with probability at least
1 − δ, all the norms are preserved up to multiplicative accuracy 1 ± t as long as
16
m> log(N/δ).
t2
Interestingly, the dimension M does not enter this estimate for the target dimension.

3.7 Norm concentration: from sub-Exponential to sub-Gaussian tails


P
As we have seen earlier, Y − d = kgk2 − d = di=1 2
√ (gi − 1) is a sub-exponential random
variable. Hence, we expect that square root of Y (that is, the norm of the random
Gaussian vector) to be, after centering, sub-Gaussian. In fact, we will show this for a
general vector with sub-Gaussian entries.
Following the exposition in [33, Chap 3.1], let X = (X1 , . . . , Xd ) be a vector
P with i.i.d.
sub-Gaussian entries Xi with mean zero and variance 1. Then, kXk2 − d = di=1 (Xi2 − 1)
is sub-exponential and, hence, satisfies the Bernstein’s inequality
 
1 2 
P | kXk − 1| ≥ t ≤ 2 exp −Cd min{t2 , t} , (3.42)
d

where C depends on the sub-Gaussian constant of Xi . A simple trick will now convert
the two-tailed behavior for the square into single-tail behavior for the norm itself. First,
following [33], observe that |z − 1| ≥ t implies |z 2 − 1| ≥ max{t, t2 } for all z ≥ 0. Now,
let u = max{t, t2 } and observe that min{u, u2 } = min{max{t, t2 }, (max{t, t2 })2 } = t2 (by
considering cases). Hence,
   
1 1 2 2
P | √ kXk − 1| ≥ t ≤ P | kXk − 1| ≥ max{t, t } (3.43)
d d
 
1
= P | kXk2 − 1| ≥ u (3.44)
d

≤ 2 exp −Cd min{u2 , u} (3.45)

= 2 exp −Cdt2 (3.46)

for all t > 0. Thus, norm of a random vector with sub-Gaussian entries is sub-Gaussian
itself (after centering). Or, rescaling,
 √  
P | kXk − d| ≥ t ≤ 2 exp −Ct2 (3.47)

This means that a vector with independent sub-Gaussian


√ entries with variance 1 has norm
that is tightly concentrated around the value d. This is one of the most basic high-
dimensional phenomena.

3.8 From isotropic to anisotropic vectors


Recall that in (3.41), we treated the χ21 random variable g 2 (where g ∼ N (0, 1)) as a
subE(22 , 4) random variable. However, χ21 has distinct upper and lower tails, and so it may
be beneficial to consider one-sided tail bounds. We will only focus on the upper tail. For

16
this purpose, recall (3.16), which holds for λ ∈ (0, 1/2). Then, using the Cramér-Chernoff
t
bound with the choice λ = 2t+2 ∈ (0, 1/2), we arrive at the one-sided bound
 
 t2 /2
P g 2 − 1 ≥ t ≤ exp − (3.48)
2 + 2t
or, in view of (3.16), from (3.36),
 √ 
P g 2 − 1 ≥ 2 t + 2t ≤ exp{−t}. (3.49)

The rest of this subsection


P follows easily by checking what happens to (3.16) for the sum of
random variables. For di=1 gi2 ∼ χ2d , where gi ∼ N (0, 1) independently,
d
!  
X t2 /2
2
P gi − d ≥ t ≤ exp − (3.50)
2d + 2t
i=1

or, from (3.36),


d
!
X √
P gi2 − d ≥ 2 dt + 2t ≤ exp{−t}. (3.51)
i=1

For a = (a1 , . . . , ad ), ai ≥ 0, the above tail bound is easily extended to


d
!
X √
2
P ai gi − kak1 ≥ 2 kak2 t + 2 kak∞ t ≤ exp{−t}. (3.52)
i=1

Let us write g ∼ N (0, Id ). Let A ∈ Rd×d and let Σ = A T A. We have that the mean

E kAgk2 = Eg T A T Ag = tr(A T AEgg T ) = tr(Σ).

Since Σ is positive semidefinite, it has an SVD decomposition Σ = U ΛU T . Then Σ1/2 g ∼


N (0, Σ) and
Xd
2 2
2 1/2 1/2
T
kAgk = g Σg = Σ g = Λ g = λi g 2i
i=1

by the rotational invariance of multivariate normal and Λ = diag(λ1 , . . . , λd ). Then (3.52)


implies
 p 
P kAgk2 − tr(Σ) ≥ 2 tr(Σ2 )t + 2 kΣk t ≤ exp{−t} (3.53)

because Σ and Λ share the same set of eigenvalues. Note that d does not explicitly appear,
except through the trace of the eigenvalues.
The tail bound (3.53) was proved in [14] for sub-Gaussian (rather than Gaussian) mean-
zero vectors, with the same constants as above. More precisely, a centered random vector
x ∈ Rd is sub-Gaussian with variance proxy v 2 if for any unit vector u ∈ Rd , hx, ui ∈
subG(v 2 ).
We also remark that tail bounds on kAxk2 = x T A T Ax have been proved in [29] for
more general quadratic forms x T Bx, where B is not necessarily psd, but with stronger
independence assumptions on coordinates of x. Such bounds are known as Hanson-Wright
inequalities.

17
4. MEAN ESTIMATION

4.1 High-dimensional mean estimation


Now, let X1 , . . . , Xn ∼ N (µ, Γ) be independent multivariate Gaussian vectors in Rd , and
recall that Xi − µ = Γ1/2 Zi where Zi ∼ N (0, Id ). To estimate the mean, it is natural to take
P 2
the sample average X̄n . Its quality can be measured by n1 ni=1 Xi − µ . Without loss of
generality, for the purposes of analysis we can set µ = 0. Then the quality of the estimate
P 2 P 2 d 2
is simply n1 ni=1 Xi = n−1 Γ1/2 ( √1n ni=1 Zi ) = n−1 Γ1/2 Z for an independent
Z ∼ N (0, Id ). From (3.53) with A = (Γ/n)1/2 (and hence Σ = n−1 Γ),
 p 
n 2
1 X tr(Γ) 2
2 tr(Γ )t 2 kΓk t 
P Xi − ≥ + ≤ exp{−t} (4.54)
n n n n
i=1

We conclude that when the error is measured in squared Euclidean norm, the expected error
is tr(Γ)/n and the deviations above this expectation are given by the two tails in terms of
the trace of Γ. Since dimension d never appears in these bounds, it can be very large or
infinite, as long as the covariance matrix Γ has a fast decay of eigenvalues.
As in (3.47), if we instead consider the norm rather than squared norm, we again only
have the sub-Gaussian behavior. More precisely,
n
r r !
1X tr(Γ) 2 kΓk t
P Xi − ≥ ≤ exp{−t}. (4.55)
n n n
i=1

We refer to [33, p. 135] for the proof of this fact, along the lines of our earlier conversion
to sub-Gaussian tails for the norm itself.
We see that the trace of the covariance matrix, tr(Γ), serves as the effective dimension
of the problem. Indeed, it replaces the actual dimension d that would arise if we used (3.47)
with identity covariance.

4.2 Median of means and heavy-tailed distributions


Consider mean estimation in 1 dimension, but let us not assume sub-Gaussianity. We will
see that sample average itself is not a good estimate of the unknown mean, and a small
modification will be needed.
Let X1 , . . . , Xn be i.i.d. from a distribution P with finite mean µ and variance σ 2 . From
Chebyshev’s inequality,
 σ 2 /n
P |X̄n − µ| ≥ t ≤ 2 ,
t
or, with probability at least 1 − 2δ,
r
1
|X̄n − µ| ≤ σ .
2nδ
Unlike the sub-Gaussian tails we had studied so far, the above tails have a polynomial
dependence on t, or 1/δ. One may ask whether it is due to a sub-optimal choice of Cheby-
shev’s inequality. However, the result of Catoni tells us that Chebyshev here is essentially
unimprovable:

18
Lemma 12 (Catoni [7]). For any δ ∈ (0, (2e)−1 ) and σ 2 > 0, there exists distribution
P with mean 0 and variance σ 2 s.t.
r   n−1 !
1 2eδ 2
P |X̄n | ≥ σ 1− ≥ 2δ.
2nδ n

Proof. Let X1 , . . . , Xn i.i.d. with EXi = 0, and assume without loss of generality that σ = 1
(since we can divide through by σ). Fix t, to be chosen later, and define
1
P (Xi = nt) = P (Xi = −nt) =
2n2 t2
and
1
P (Xi = 0) = 1 − .
n2 t 2
We verify that
1
EXi2 = var(Xi ) = n2 t2 · =1
n2 t2
1 Pn
Then for X̄ = n i=1 Xi ,
 n−1
   n 1
P X̄ ≥ t = P X̄ ≤ −t ≥ P X̄ = t ≥ 1− 2 2 , (4.56)
2n2 t2 n t
since X̄ = q t can be achieved whenever any of the variables is nt and the rest are 0. Now
1
 n−1
choose t = 2nδ 1 − 2eδ
n
2
. If we show that the right-hand side of (4.56) with this value
of t is at least δ, we will be done. This amounts to proving
!n−1  
1 2eδ n−1
1− n 2eδ n−1
≥ 1− (4.57)
2δ (1 − n )
n

which is true if
2eδ 2δ 1
1− ≤1−  .
n n 1 − 2eδ n−1
n
The last statement is true since
 
2eδ n−1
1− ≥ 1/e.
n

Since the sample mean does not exhibit sub-Gaussian tail behavior in our heavy-tailed
situation, the goal is to change the estimator itself. Perhaps, this is the first “non-trivial”
estimator in this course, since we only analyzed averages so far.
For simplicity of exposition, suppose n = km with k, m ≥ 1 integers. Define the median-
of-means estimator as
 
m
X km
X
1 1
b = median 
µ Xi , . . . , Xi  (4.58)
m m
i=1 i=(k−1)m+1

19
Lemma 13. Let δ ∈ (0, 1), k = c log 1/δ, m = c logn1/δ , for some absolute constant c.
Then, with probability at least 1 − δ,
r
c log 1/δ
|b
µ − µ| ≤ σ .
n

Proof. For each batch of size m, we have a bound by application of Chebyshev:


m
!
1 X 2σ 1
P | Xi − µ| ≥ √ ≤ (4.59)
m m 4
i=1

The corresponding bad event for the jth batch can be denoted by
 
 1 jm
X 2σ 
Yj = 1 | Xi − µ| ≥ √
m m
i=(j−1)m+1

Note that Yj are i.i.d. Bernoulli with bias p ≤ 1/4. Then


   
  Xk Xk  
2σ     (k/4)2
P |bµ − µ| ≥ √ ≤P Yj ≥ k/2 ≤ P Yj − EYj ≥ k/4 ≤ exp −
m k
j=1 j=1

A disadvantage of the median-of-means estimator is that k should be chosen as a function


of the target accuracy δ. One may ask whether there exist estimators with sub-Gaussian
tails that work for all (or at least a nontrivial range of) δ. Surprisingly, if σ is not known,
this is impossible [9].
Finally, the idea of median-of-means has been extended to multivariate distributions.
Here, a certain median-of-means tournament by [22] is shown to obtain the sub-Gaussian
behavior (4.55). [13] achieved the first poly-time algorithm with this behavior.

4.3 Sparse mean estimation and the Gaussian Sequence Model


In Section 4.1, we considered the problem of estimating the mean of a high-dimensional
vector. We saw that the trace of the covariance matrix serves as an effective dimension of the
problem and determines the number n of samples needed to achieve a certain accuracy. Low
trace of the covariance matrix is a coordinate-free notion of “simplicity” of the distribution
of the high-dimensional random variable. In contrast, here we will make an assumption
about the mean of the distribution.
First, let us extend the definition of sub-Gaussianity to random vectors.

Definition 4. A vector-valued random variable X ∈ Rd with mean µ is v 2 -sub-Gaussian


if for all u ∈ Rd with kuk = 1,

hX − µ, ui ∈ subG(v 2 ).

20
Equivalently, we can state the definition as: hX − µ, ui ∈ subG(v 2 kuk2 ) for any u ∈
Rd . In other words, a vector is sub-Gaussian if all its 1-dimensional marginals are sub-
Gaussian. In particular, by choosing standard basis vectors, sub-Gaussianity of X implies
sub-Gaussianity of its coordinates, and thus the variance of each coordinate of X is at
most v 2 . Note that sub-Gaussianity of the vector does not require independence of the
coordinates.
Let X1 , . . . , Xn bePi.i.d. from a v 2 -sub-Gaussian distribution with mean µ. Let us
estimate µ by X̄ = n1 ni=1 Xi . Then it is easy to see that X̄ is v 2 /n-sub-Gaussian. Equiv-
alently, we can think of observing a single vector from the model

Y =µ+ε (4.60)

with ε ∈ subG(v 2 /n), Eε = 0, and we are observing one vector realization Y = X̄. While
ε is sub-Gaussian, we will not assume independence of the coordinates. The model in
(4.60)—called the Gaussian Sequence Model —has been studied extensively (usually under
the Gaussian assumption on ε).
The model in (4.60) is also a prototypical example of a statistical problem in the form
“observation = signal + noise,” with the goal of denoising the observation and estimating
the signal, under various structural assumptions. We will study one such example now.
Suppose µ ∈ Rd is k-sparse:
d
X
kµk0 = 1 {µj 6= 0} = k.
j=1

Returning Y (or, in our earlier example, the sample mean X̄) as an estimate of µ may be
suboptimal if we know that µ is sparse. Indeed, the mean squared error of the estimator
b = Y is
µ
d
E kµ − µb k2 = E kεk2 ∝ .
n
In particular, this ignores the sparsity of the mean vector µ.
A natural modification is to threshold coordinates of Y . Given δ, consider the event
E = {|εi | ≤ λ}, where λ will be chosen later as a function of δ, n, k such that this event
b HT ∈ Rd by
holds with probability at least 1 − δ. Define a “kill-or-keep” estimate µ

b HT
µ i = Yi 1 {|Yi | > λ}

The superscript here stands for “hard thresholding.” How close is this estimate to µ?
b HT
In the case that µi = 0, under the event E it holds that |Yi | ≤ λ, and thus µ i = 0 and
HT
|b
µi − µi | = 0 (that is, the coordinate was zeroed out correctly). If, on the other hand,
µi 6= 0 (i.e. i is one of the k non-zero coordinates), then

µHT
|b µHT
i − µi | ≤ |b i − Yi | + |Yi − µi | ≤ 2λ.

Putting these together, we have

µHT
|b i − µi | ≤ 2λ1 {µi 6= 0}

or
2
b HT − µ
µ ≤ 4λ2 kµk0 . (4.61)

21
It remains to calculate λ such that P (E) ≥ 1 − δ. To this end, note that for any i = 1, . . . , d,
sub-Gaussianity (with parameter v 2 /n) implies
 
v p
P |εi | > √ 2 log 2/δ ≤ δ (4.62)
n
By union bound,
 
v p
P ∀i ∈ [d], |εi | ≤ √ 2 log(2d/δ) ≤ δ (4.63)
n
Hence, from (4.61), with probability at least 1 − δ,
2 8v 2 k log(2d/δ)
b HT − µ
µ ≤ . (4.64)
n
A few remarks:
• The thresholding method requires the knowledge of v and δ, but not the sparsity
parameter k. In this sense, the method is adaptive to the unknown sparsity, i.e.
attains the rate in (4.64) that depends on k despite not knowing it.
• If the goal is to recover the correct support (i.e. non-zero entries) of µ, we need
to make an assumption about “signal strength,” i.e. that the minimum value of a
nonzero entry of µ is at least, say, 3λ. This ensures that signal can be separated from
the noisy values outside the support. In this case, we can threshold Y at the value of
2λ.
The hard thresholding estimator is a discontinuous function. Another popular thresh-
olding scheme is soft thresholding, defined as


Yi − λ, Yi > λ
ST
b i = 0,
µ |Yi | ≤ λ (4.65)


Yi + λ, Yi < −λ
Finally, we mention that both hard and soft thresholding schemes can be written in the
form
µ b k2 + λ2 kb
b HT = argmin kY − µ µk0 (4.66)
µ
b

and
µ b k2 + λ kb
b ST = argmin kY − µ µk1 . (4.67)
µ
b

To see the first one, note that the objective decomposes coordinate-wise, and for each
coordinate we have
b HT
µ b i )2 + λ2 1 {b
i = argmin(Yi − µ µi 6= 0} . (4.68)
b i ∈R
µ

If λ2 ≥ Yi2 , the solution is attained at 0; otherwise at Yi .


In contrast to hard thresholding, the reformulation in (4.67) is convex. While for the
Gaussian Sequence Model it may not matter computationally, such a convex reformulation
helps in regression setups considered later in the course.
The model in (4.60) is also called a direct observation model since we are observing
the signal µ directly perturbed by noise. After a short detour into maximal inequalities,
we turn to linear regression, a problem where the parameter vector is observed via linear
measurements (indirect observations).

22
5. MAXIMAL INEQUALITIES: BASIC RESULTS

Before diving into linear regression, we make a brief detour and talk about maximal inequal-
ities. This topic is a precursor to the more detailed study of the suprema of sub-Gaussian
and empirical processes.
First, recall several basic notions. Given a norm k·k (say, on Rd , although this extends
to Banach spaces), the dual norm is defined as

kvk∗ = sup hu, vi.


kuk≤1

The `p norm is dual to `q with 1/p + 1/q = 1, p, q ≥ 1. In particular, `1 is dual to `∞ , while


`2 is dual to itself.
Next, recall that the maximum of a linear function over a bounded set is achieved at
the vertices. More precisely, for any V ⊂ Rd , a ∈ Rd ,

sup ha, ui = sup ha, ui.


u∈conv(V ) u∈V

Next, we prove the following straightforward result.

Lemma 14. Let Z = (Z1 , . . . , Zd ) be a centered random variable with Z ∈ subG(σ 2 ).


Then

E kZk2 ≤ σ d, (5.69)

p
E max Zi ≤ σ 2 log d, (5.70)
i

and
p
E kZk∞ = E max |Zi | ≤ σ 2 log(2d) (5.71)
i

Proof. First,
v
q q u d
uX √
E kZk2 = E kZk2 ≤ E kZk2 = t
2 2
EZi2 = σ d (5.72)
i=1

where we used the fact that variance of a random variable is at most its sub-Gaussian

23
parameter (homework). Next, we prove (5.71). For any λ > 0,
1
E max Zi = E max log exp{λZi } (5.73)
i λ i=1,...,d
1
= E log max exp{λZi } (5.74)
λ i=1,...,d
1
≤ log E max exp{λZi } (5.75)
λ i=1,...,d

X d
1
≤ log E exp{λZi } (5.76)
λ
i=1
1  
≤ log d exp{λ2 σ 2 /2} (5.77)
λ
which is equal to
1 λσ 2 p
log d + = σ 2 log d (5.78)
λ 2
q
2 log d
upon choosing λ = σ2
. The estimate on kZk∞ follows by considering 2d variables.

We also leave the following as an exercise:

Lemma 15. Let Z1 , . . . , Zd be real-valued centered random variables satisfying


 
λ2 v 2
E exp{λZi } ≤ exp , 0 < λ < 1/b.
2(1 − bλ)

Then
p
E max Zi ≤ 2v 2 log d + b log d (5.79)
i

6. LINEAR REGRESSION

We now introduce the problem of linear regression, make the connection to the Gaussian
Sequence Model, and motivate the need to study maximal inequalities.
Consider the model

Yi = hxi , β ∗ i + εi , i = 1, . . . , n (6.80)

and β ∗ , xi ∈ Rd . Assume the zero-mean noise satisfies ε = (ε1 , . . . , εn ) ∈ subG(σ 2 ). As


usual, we assume that xi ’s and Yi ’s are observed, but the parameter vector β ∗ is unknown.
We can write the n equations together as

Y = Xβ ∗ + ε (6.81)

where X is the n × d matrix with xi as rows, and Y = (Y1 , . . . , Yn ) T . For now, we will think
of the matrix X as being fixed and given to us (this is called fixed design). In later parts
of the course, we will work under the assumption that x1 , . . . , xn are drawn i.i.d. from a
distribution (i.e. random design).

24
6.1 Connection to the Gaussian Sequence Model
1
Multiplying both sides of (6.81) by nX
T
yields

1 T 1 1
X Y = X T Xβ ∗ + X T ε. (6.82)
n n n
Consider the following assumption on the matrix X:

1
Definition 5. If nX X
T
= Id , we say that design (that is, the set {x1 , . . . , xn }) is
orthonormal.

In addition to orthogonality, the above definition implies that kxi k2 = n, which is what we
would expect if coordinates of xi were independent. In this case, (6.82) becomes

Ỹ = β ∗ + ε̃ (6.83)

where ε̃ = n1 X T ε and Ỹ = n1 X T Y . If ε = (ε1 , . . . , εn ) ∼ N (0, Id ), then ε̃ is also Gaussian


and mean-zero, and covariance is Eε̃ε̃ T = n12 EX T εε T X T = n1 Id . Hence, ε̃i ∼ N (0, 1/n).
We see that in orthogonal design, regression becomes the Gaussian Sequence Model. Fur-
thermore, the problem of estimation β ∗ with respect to Euclidean norm becomes equivalent
to the problem of bounding the prediction error: for an estimator β b
n
X
b − β ∗ ) T ( 1 X T X)(β
b − β∗ ) = 1
2
b − β∗
β b − β ∗ ) T (β
= (β b − β ∗ ) = (β b − hxi , β ∗ i)2
(hxi , βi
2 n n
i=1
(6.84)

6.2 Estimation, de-noising, and fixed design.


Several goals can be set for analyzing linear regression (or, any other estimator). Let us
mention a few that will be central for the rest of the course.
First goal is estimation in some measure of distance on the space of parameters. For
2
instance, a natural measure is β b − β ∗ for the distance between the estimator βb and the
2
true parameter.
Another objective is to provide a bound on the error of the form
n
1X b − hxi , β ∗ i)2
(hxi , βi
n
i=1

which can be called a “fixed-design error” or “de-noising objective.” In other words,


x1 , . . . , xn are fixed, and we are interested in prediction (or de-noising of the y values) on
these very points. Following the calculation in (6.84), but without assuming orthonormal
design,
n
1X b − hxi , β ∗ i)2 = 1 X β
2
b − Xβ ∗ = βb − β∗
2
(hxi , βi
n n 2 Σ
i=1
P
where Σ = n1 ni=1 xi xiT = n1 X T X.

25
6.3 Unconstrained Least Squares
We now go back to the model (6.80), without the assumption on the matrix X. Our goal
in this section will be to upper bound the fixed-design error.
Let βb be the least-squares solution

X n
b ∈ argmin kXβ − Y k2 = argmin 1
β (Yi − hxi , βi)2 (6.85)
β∈Rd β∈Rd n
i=1

Setting the gradient of the objective to zero,


b = X TY
X TX β

and thus
b = (X T X)† X T Y
β
where A† denotes the Moore-Penrose inverse.
Rather than using the closed-form solution for the least squares, we will present analysis
based on the optimality of the solution with respect to the empirical error. This analysis is
more general and will hold for constrained least squares beyond linear regression.
First, observe that by optimality,
2
b −Y
Xβ ≤ kXβ ∗ − Y k2 = kεk2 . (6.86)

On the other hand,


2 2 2
b −Y
Xβ b − Xβ ∗ − ε
= Xβ b − Xβ ∗
= Xβ b − Xβ ∗ i + kεk2 .
− 2hε, X β (6.87)

These two equations together yield the so-called Basic Inequality:


1 b − Xβ ∗
2 2 b − Xβ ∗ i
Xβ ≤ hε, X β (6.88)
n n
Since generalizations of this inequality will be used many times in this course, it’s worth
making a few remarks. First, the inequality is deterministic. Second, on the left-hand side,
we have our quantity of interest: the de-noising error. However, the right-hand side also
b Moreover, if X β
involves our estimator β. b − Xβ ∗ is small, we expect the product with the
random ε to be “even smaller”, which, in turn, leads to a smaller bound on X β b − Xβ ∗ ,
which... Indeed, this argument will be formalized in terms of a certain fixed point in the
second part of the course. Third, (6.86) will be the only place in our analysis where we
b minimizes empirical error, and other properties of β
use the fact that β b will be irrelevant.
This observation will lead to immediate generalizations of the analysis beyond linear least
squares.
Our strategy in analyzing least squares will be to “remove the hat” (i.e. the dependence
of the right-hand side of the right-hand side of (6.88) on β b by passing to a sufficiently
b
localized supremum over all possible locations of β.

26
6.4 Constrained Least Squares
Consider now a modification of the regression model (6.80), where we have the additional
knowledge that β ∗ ∈ K for some set K ⊂ Rd . It then makes sense to minimize squared
error subject to being in K:
b ∈ argmin kXβ − Y k2
β (6.89)
β∈K

Of course, the unconstrained model and the corresponding least squares solution in (6.85)
corresponds to K = Rd . Since the constrained case subsumes the unconstrained case, we
will proceed below with the constrained analysis.
By examining (6.86) and (6.87), the Basic Inequality (6.88) still holds in constrained
least squares.

6.5 Analyses of Least Squares: first strategy


For simplicity of exposition, we make the assumption that K is symmetric about the origin
(i.e. x ∈ K implies −x ∈ K). In this case, K − K ⊆ 2K.
We will take one of the following two paths in analyzing (6.88). The first, which leads
to the so-called fast rates, is to divide both sides of (6.88) by X βb − Xβ ∗ and then take
a supremum over all unit vectors that can arise when β b = β(ε)
b ranges over K (or Rd ).
More precisely, the basic inequality leads to

b − Xβ ∗
Xβ Xβ − Xβ ∗
b − Xβ ∗ ≤ 2hε,
Xβ i ≤ 2 sup hε, i (6.90)
b − Xβ ∗
Xβ β∈K kXβ − Xβ ∗ k

Note that the right-hand side is now independent of the algorithm/estimator. We can now
treat the right-hand side as a supremum of a collection of random variables indexed by β.
The smaller the collection, the smaller is the upper bound (other things kept equal). Once
we have a high-probability bound on (6.90), we may square both sides to get a bound on the
squared error. We should point out that it is quite surprising that such a simple strategy
works for analyzing least squares. It may appear that the supremum on the right-hand side
of (6.90) can be significantly larger than the middle part of that inequality.
To illustrate the strategy, we consider the unconstrained least squares (also called Or-
dinary Least Squares, OLS).

Lemma 16. Assume the regression model Y = Xβ ∗ + ε with ε ∈ subG(σ 2 ). Let


r = rank(X T X). Then the unconstrained OLS enjoys
 
1 b ∗
2 4rσ 2
Eε X β − Xβ ≤ . (6.91)
n n

Proof. Let us write the basic inequality as

b − Xβ ∗ ≤ 2hε, v(ε)i
Xβ (6.92)

27

X β−Xβ
where v(ε) = ∗ . Let U = [u1 , . . . , ur ] be the matrix with orthonormal columns, a
b
kX β−Xβ
b k
basis of the column space of X. Since X(β b − β ∗ ) is in the column space of X, we can write
r
v(ε) = Ua for some a = a(ε) ∈ R with kak = 1. Then

b − Xβ ∗ ≤ 2 sup hε, Uai = 2 sup hU T ε, ai = 2 kU T εk


Xβ (6.93)
kak≤1 kak≤1

since Euclidean norm is self-dual. Squaring both sides,


2
b − Xβ ∗
Xβ ≤ 4 kU T εk
2
(6.94)

It is easy to see that U T ε ∈ subG(σ 2 ). Indeed, for any w ∈ Rr with kwk = 1,

E exp{λhw, U T εi} = E exp{λhUw, εi} ≤ exp{λ2 σ 2 /2}.

Since variance of each coordinate of U T ε is at most σ 2 , the result follows.


σ2 d
In particular, r ≤ min{n, d}, and if r = d, the guarantee takes on the familiar form of n .

6.6 Analyses of Least Squares: second strategy


P
Consider constrained least squares with K = Bd1 = {q ∈ Rd : |qi | ≤ 1}. We have that
b ∗ d d
β − β ∈ K − K = 2B1 . Note that 2B1 has 2d vertices, and thus X(β b − β ∗ ) ∈ X · (2Bd )
1
also has at most 2d vertices.
Unfortunately, when we normalize X(βb − β ∗ ), we “lose” the vertices, as

b − β∗ )
X(β
v(ε) = ∈ Sn−1 ∩ col(X).
b ∗
X(β − β )

Such an approach, at least directly, would lead to rates of the previous section, without
exploiting the structure of K.
We proceed by avoiding normalization and directly analyzing the basic inequality:
2
b − Xβ ∗
Xβ b − Xβ ∗ i ≤ 2 max hε, Xvi = 4 kX T εk
≤ 2hε, X β (6.95)

v∈2Bd1

2
Let xi denote the ith column of X. From (5.71), and observing that hxi , εi ∈ subG( xi σ 2 ),
p
E kX T εk∞ ≤ σ 2 log(2d) · max xi
i

A natural normalization of the data is xi ≤ n, in which case we have proved the
following result:

Lemma 17. Assume the regression model Y = Xβ ∗ +ε with ε ∈ subG(σ 2 ) and K = Bd1 .

Suppose columns of X are normalized to be xi ≤ n. Then the constrained OLS
enjoys
  r
1 b − Xβ ∗
2 2 log(2d)
Eε Xβ ≤ 4σ . (6.96)
n n

28
Since analysis of previous section applies to the constrained least squares as well, we have
that   ( r )
1 2 2 log(2d) rσ 2
Eε Xβb − Xβ ∗
≤ 4 min σ , ,
n n n
where r is the rank of X. Disregarding the logarithmic factors, the transition between the

two rates is at r ∼ n. The upper bound in (6.96) is sometimes referred to as the “slow
rate,” as opposed to the “fast rate” in (6.91). The fast rate kicks in for problems with small
dimensionality (or rank), while the slow rate wins in high-dimensional situation. We will
see how these two regimes arise more generally in parametric and nonparametric regression
through the lens of covering numbers.

6.7 Sparsity
To close our discussion of linear regression, consider one more example of constrained least
squares, where β ∗ is known to be sparse. In other words, assume that

β ∗ ∈ Bd0 (k) = {β ∈ Rd : kβk0 ≤ k}.


b be the constrained least squares solution, and assume, as before, that ε ∈ subG(σ 2 ).
Let β 
Constrained least squares is an inefficient method, as it requires enumeration over kd
b − β ∗ ∈ 2K = 2Bd (k) = Bd (2k), i.e. this difference vector is at most
subsets. Note that β 0 0
2k-sparse.
We shall proceed with the first approach. The Basic Inequality then gives
b − Xβ ∗
Xβ Xw
b − Xβ ∗ ≤ 2hε,
Xβ i ≤ 2 max sup hε, i (6.97)
b − Xβ
Xβ ∗ S⊂[d],|S|≤2k w:supp(w)⊂S kXwk

As before, let US be the orthonormal basis of the span of columns of X corresponding to


index set S. We can then write (Xw)/ kXwk = US a for some a ∈ B2k 2 . Then we have

b − Xβ ∗ ≤ 2
Xβ max sup hUST ε, ai = 2 max kUST εk (6.98)
S⊂[d],|S|≤2k a∈B2k S⊂[d],|S|≤2k
2

and
2
b − Xβ ∗
E Xβ ≤ 4E max
2
kUST εk (6.99)
S⊂[d],|S|≤2k
n o
2 2 2
≤ 4E max kUST εk − E kUST εk + max E kUST εk . (6.100)
S⊂[d],|S|≤2k S⊂[d],|S|≤2k

where in the last expression we centered the random variables.


This first quantity in the last expression is an expected maximum
 of random variables
kUST εk2 , indexed by S, and centered. There are at most 2k d
such variables, and each
variable is sub-exponential (indeed, each coordinate j of the vector UST ε is a σ 2 -sub-Gaussian
random variable). Thus, this expected maximum is at most
 
2 d
Cσ log
2k

29
for some absolute constant C. The second term is at most 2kσ 2 , a lower-order term. With
 i
the standard estimate di ≤ ed i , we conclude that

b − Xβ ∗
2 k log(d/k)
E Xβ . σ2 . (6.101)
n

7. COVERING NUMBERS: AN INTRODUCTION

7.1 `2 ball cover


In analyzing linear regression, we started with the Basic Inequality and turned its estimator-
dependent upper bound into a maximum of a collection of random variables, e.g. (6.93),
(6.96), and (6.97). In these examples, the maximum was expressed conveniently as either the
`2 or `∞ norm of a vector-valued random variable. For constrained least squares with other
sets K, we may not have such a convenient closed form, and we are seeking to understand a
more general principle behind developing maximal inequalities. At the very least, we would
like to see how to unify the analysis of (5.69) and (5.71), which at the moment appears to
be very different: the first uses properties of the `2 norm (specifically, smoothness) while
the second relies on finiteness of extremal points. Is there a unified analysis or principle for
these two results? Indeed, there is.
We start with a notion of a cover for a subset of Rd .

Definition 6. Given K ⊂ Rd and ε ≥ 0, a set V ⊂ Rd is an ε-net (equivalently, an


ε-cover) with respect to distance measure ρ on Rd if for any x ∈ K there exists a v ∈ V
such that ρ(x, v) ≤ ε. The covering number N (K, ρ, ε) is the cardinality of the smallest
such cover.

Clearly, the definition extends beyond Rd to any metric space (X, ρ). If V ⊂ K, the cover
is called proper.

Lemma 18. For any ε ∈ (0, 1],


 
2 d
N (Bd2 , k·k2 , ε) ≤ 1+
ε

Proof. We use the following volume argument. We add centers v1 , v2 , . . . ∈ Bd2 such that
kvi − vj k > ε for every i 6= j, until no such additional point exists. Let N be the size of
this set, which is clearly an ε-net. Then
ε   d  d
ε  1 + 2ε 2
N · vol Bd2 ≤ vol Bd2 + Bd2 ⇒ N≤  = 1+ .
2 2 ε d ε
2

Since we can start the iterative process of placing ε-balls with v1 = 0, we can assume
without loss of generality that the minimal cover contains 0.

30
Lemma 19. Let V be a cover of Bd2 at scale ε ∈ (0, 1) with respect to k·k2 . Then for
any x ∈ Rd ,
1
max hu, xi ≤ maxhv, xi. (7.102)
u∈Bd2 1 − ε v∈V

Proof. For any u ∈ Bd2 , there exists v ∈ V such that ku − vk ≤ ε (i.e. u − v ∈ εBd2 ). Since
we have hu, xi = hv, xi + hu − v, xi, it also holds that

hu, xi ≤ maxhv, xi + max hw, xi.


v∈V w∈εBd2

By linearity, the last term is ε maxw∈Bd hw, xi. Since the choice of u ∈ Bd2 was arbitrary,
2
the statement follows by rearranging the terms.

We remark that this lemma trivially extends to norms beyond Euclidean, as long as the
ball is covered in the very norm with respect to which it is defined. This situation is rather
special, and we will use this comparison result only a couple of times in this course, with a
constant ε.

7.2 Recovering (5.69) via covering numbers


As a sanity check, let’s see if we can recover (5.69). To this end, let Z ∈ subG(σ 2 ) be a
random d-dimensional vector. Then

E kZk = E max hu, Zi ≤ 2E maxhv, Zi (7.103)


u∈Bd2 v∈V

where V is a minimal 1/2-cover of Bd2 . Since |V | ≤ 5d by Lemma 19, we can conclude from
(5.70) that
q √
E kZk = E max hu, Zi ≤ 2σ 2 log(5d ) ≤ Cσ d (7.104)
u∈Bd2

for C = 2 2 log 5. While the approach through finite discretization appears to “unify” both
(5.69) and (5.71) (in the latter case, the set is already discrete), it did not recover the same
constant 1 as in (5.69).

7.3 Operator norm


Recall that for a matrix A ∈ Rp×q , an operator norm (in the `q2 → `p2 sense) is defined as

kAk = sup kAxk = sup y T Ax


kxk=1 kxk=1,kyk=1

where x ∈ Rp and y ∈ Rq . This norm is also known as the spectral norm since kAk =
p
λmax (A T A), the square root of the largest eigenvalue, which is also the largest singular
value of A, which we shall denote as σmax (A). We emphasize our convention that the
unadorned norm k·k for vectors stands for the Euclidean norm (unless stated otherwise),
and the unadorned norm k·k for matrices will stand for the operator norm.
We have the following extension of Lemma 19, see [33, p. 84].

31
Lemma 20. Let ε ∈ (0, 1/2). Let V and U be proper ε-nets of Bp2 and Bq2 , respectively,
with respect to Euclidean norm. Without loss of generality, assume 0 ∈ V, U . Then for
any A ∈ Rp×q ,
1
max v T Au ≤ kAk ≤ max v T Au. (7.105)
v∈V,u∈U 1 − 2ε v∈V,u∈U

Furthermore, if p = q, it holds that


1
kAk ≤ max v T Av. (7.106)
1 − 2ε v∈V

Proof. Let kAk = kAxk for kxk = 1 and let u ∈ U be such that kx − uk ≤ ε. Then

kAk = kAxk ≤ kAuk + kA(x − u)k ≤ kAuk + kAk ε.

Then, combining with Lemma 19,


1
(1 − ε) kAk ≤ kAuk ≤ maxhv, Aui (7.107)
1 − ε v∈V

Taking maximum over u, and noting that (1 − ε)−2 ≤ (1 − 2ε)−1 , the upper bound follows.
The lower bound is immediate since the ε-nets are proper.
For the second statement, let x be such that kAk = |x T Ax|. We have for any v that is
kx − vk ≤ ε,

hx, Axi = hv, Avi + hv, A(x − v)i + hx − v, Axi ≤ hv, Avi + 2ε kAk

and thus
kAk ≤ |hv, Avi| + 2ε kAk ≤ max |hv, Avi| + 2ε kAk .
v∈V

The following result appears, for example, in [33, Thm 4.4.5].

Lemma 21. Suppose A ∈ Rp×q be a random matrix with mean-zero independent σ 2 -


sub-Gaussian entries. Then for any t > 0,
√ √
kAk ≤ Cσ( p + q + t) (7.108)

with probability at least 1 − exp{−t2 } for some absolute constant C. Hence,


√ √
E kAk . σ( p + q) (7.109)

Proof. Let V, U be, respectively, 1/4-nets for Bp2 and Bq2 , of size 9p and 9q as guaranteed by
Lemma 18. From Lemma 20,

max v T Au ≤ kAk ≤ 2 max v T Au. (7.110)


v∈V,u∈U v∈V,u∈U

32
Pp Pq
For any fixed v ∈ V, u ∈ U , the random variable v T Au = i=1 j=1 Ai,j v i uj is sub-
2
Pp Pq 2 2 2
Gaussian with variance proxy σ i=1 j=1 Ai,j v i uj ≤ σ . Hence,

P (v T Au ≥ tσ) ≤ exp −t2 /2 .

Then by union bound,



P (∃u ∈ U, v ∈ V : v T Au ≥ tσ) ≤ 9p+q exp −t2 /2 .
√ √ √
Substituting C( p + q + t) in place of t, for some absolute constant C > 0, and noting
√ √
that 9p+q exp{−C( p+ q+t)2 } ≤ exp −Ct2 for C large enough, we conclude the proof.

Note: as a corollary, for a symmetric (Wigner) random matrix A ∈ Rp×p with independent
σ 2 -sub-Gaussian entries above the diagonal,

E kAk . σ p.

This holds by applying the above lemma separately to the upper and lower triangular
components of A.

8. COVARIANCE ESTIMATION

Let X1 , . . . , Xn ∈ Rd be i.i.d. sub-Gaussian centered random variables with covariance Σ.


b 1 Pn
Recall that sample covariance is Σ = n i=1 Xi XiT . We would like to measure the quality
of the approximation of Σ by Σ b via the spectral norm:
n
b = max 1X
Σ−Σ hXi , vi2 − v T Σv (8.111)
v∈Sd−1 n
i=1

b are not independent.


Lemma 21 does not directly apply here since entries of Σ − Σ

Lemma 22. Let X1 , . . . , Xn ∈ Rd be i.i.d. with mean zero and Xi ∈ subG(σ 2 ). Let
b be the sample covariance matrix. Then
EXi XiT = Σ and let Σ
(r )!
b ≥ σ C max
2 d t d t
P Σ−Σ + , + ≤ 2 exp{−t} (8.112)
n n n n

for some absolute constant C.

Proof. Define shorthand Q = Σ b − Σ. Let V be a 1/8-cover of Bd of size at most 17d . A


2
small modification of the proof of Lemma 20 implies that

kQk2 ≤ 2 max |hv, Qvi|.


v∈V

Now, Xi ∈ subG(σ 2 ) implies that for any u ∈ Bd2 , the random variable hXi , ui2 − hu, Σui
is sub-exponential with parameters (cσ 2 , cσ 2 ) for some absolute constant c. From (3.22),
rescaling by cσ 2 ,  
P |hu, Qui| ≥ cεσ 2 ≤ 2 exp −n min{ε, ε2 }

33
Taking a union bound over the discretization,
 
P ∃v ∈ V : |hv, Qvi| ≥ cεσ 2 ≤ 17d · 2 exp −n min{ε, ε2 }

and thus  
P ∃v ∈ V : |hv, Qvi| ≥ εσ 2 ≤ 2 exp −c(n min{ε, ε2 } − d)
Now for some t > 0, choose
(r )
d t d t
ε = max + , + .
n n n n

Then
d t
min{ε, ε2 } = +
n n
and
n min{ε, ε2 } − d = t.
This yields
(r )!
b −Σ 2 d t d t
P Σ ≥ 2σ max + , + ≤ 2 exp{−ct}
2 n n n n

for some absolute constant c > 0.

Note that if Xi ∼ N (0, Σ), we have hv, Xi i ∼ N (0, v T Σv). Since for any unit vector v,
v Σv ≤ kΣk, Xi is a sub-Gaussian vector with variance proxy at most kΣk. More generally,
T

if we assume that the sub-Gaussian parameter of Xi is at most C kΣk for some constant C,
then we have the following corollary:

Corollary 1. In the setting of Lemma 22, if we additionally assume that Xi ∈ subG(kΣk),


then
(r )!
b ≥ kΣk C max d t d t
P Σ−Σ + , + ≤ 2 exp{−t} (8.113)
n n n n

Furthermore, (r )
b . kΣk max d d
E Σ−Σ ,
n n

b is a consistent estimator of
As discussed before, if d = o(n), then sample covariance Σ
Σ and we have an explicit rate. Let us mention one more result, in terms of effective rank
(or, stable rank)

tr(Σ)
r(Σ) = . (8.114)
kΣk
Note that a similar quantity arose in mean estimation in high dimension. The numerator
here is the sum of eigenvalues of Σ, while the denominator is the largest eigenvalue. The
ratio has the right “units” to qualify for a notion of a dimension. If Σ = Id , we have

34
tr(Σ) = d and kΣk = 1. More generally, effective rank can be small even though d is large,
as long as the eigenvalues decay fast enough.
The more general result says that Xi are sub-Gaussian centered vectors such that the
sub-Gaussian parameter of any one-dimensional projection hXi , ui is at most a constant
multiple of its variance, it holds that
(r )
b r(Σ) r(Σ)
E Σ − Σ . kΣk max ,
n n

(see [17], [33, Theorem 9.2.4])

8.1 Singular values

The reason we had the two-tailed behavior of the spectral norm Σ − Σ b is that Σ b is
an average of “squares” of a sub-Gaussian random variables. If you recall, in the earlier
lecture, we deduced pure sub-Gaussian tails by taking square root of the random variable.
The analogue here will be the singular values σi (X) of the data matrix X ∈ Rn×d which
has rows XiT .
b are
Recall that singular values of the matrix X and eigenvalues of sample covariance Σ
related as q
σj (X) = λj (X T X)
or, rescaling, r q
1 1 b
σj ( √ X) = λj ( X T X) = λj (Σ).
n n
Weyl’s Inequality then says that
b − λj (Σ)| ≤ Σ
max |λj (Σ) b −Σ (8.115)
j=1,...,d

Suppose for the purposes of illustration that Σ = Id (i.e. the random variables are isotropic).
Then our results tell us that
r
b d t d t
Σ − Id . + ∨ +
n n n n
with probability at least 2e−t , which means that for all i = 1, . . . , d,
r
1 d t d t
σi2 ( √ X) − 1 . + ∨ +
n n n n n
Since max(|z − 1|, |z − 1|2 ) ≤ |z 2 − 1| for z ≥ 0, we get
 
1 1 1
max |σi ( √ X) − 1|, |σi ( √ X) − 1|2 ≤ |σi2 ( √ X) − 1|
n n n
which implies, after rescaling, that
√ √ √
|σi (X) − n| . d+ t
with probability at least 1 − 2 exp{−t}. In other words, the singular values of a tall (n > d)
matrix X with sub-Gaussian
√ √ √isotropic rows can be found to be tightly concentrated in the

interval [ n − C d, n + C d]. This result holds in more generality than stated here, and
we refer to [32, 33].

35
9. SPECTRAL METHODS

9.1 Perturbation Analysis


We now present several additional models that involve random matrices. In these appli-
cations, such as Principal Component Analysis, we will think of the matrix Y as a noisy
observation of some signal matrix X, with additive noise E, i.e.

Y = X + E. (9.116)

We will be mainly interested in estimating top eigenvector(s) of X from the noisy observation
Y . For now, however, we think of E as a non-random perturbation of X. How does this
perturbation affect the spectral properties? While Weil’s inequality (see e.g. (8.115)) tells
us that eigenvalues to not change much when the perturbation of the matrix is small in
spectral norm, it does not say anything about closeness of eigenvectors. So, it is natural to
ask: Are eigenvectors of X and Y close if kX − Y k is small?
To provide some intuition, consider the following example, with some δ > 0:
     
1+δ 0 −δ δ 1 δ
X= , E= , Y =X +E = (9.117)
0 1−δ δ δ δ 1

The eigenvalues of X are 1 + δ and 1 − δ, with the corresponding eigenvectors u1 = e1 ,


u2 = e2 . On the other hand, the eigenvalues of Y are also 1 + δ and 1 − δ, with eigenvectors
v 1 = [2−1/2 , 2−1/2 ] T and v 2 = [−2−1/2 , 2−1/2 ] T . We see that u1T v 1 = 2−1/2 , i.e. the
eigenvectors rotated by 45 degrees because of the perturbation, even though the eigenvalues
remained the same. The source of the instability of eigenvalues is the gap between the
eigenvalues of X which is on the same order as the size of the perturbation E, when measured,
say, in the operator norm. One may wonder if, in general, this is the only reason that
eigenvectors may move significantly. Indeed, that’s the case, as shown below.
We state a simplified version of the Davis-Kahan “sin(θ)” theorem (see [36]).

Theorem 1. Let X, Y ∈ Rd×d be symmetric matrices with, respectively, eigenvalues


λ1 ≥ . . . ≥ λd and µ1 ≥ . . . ≥ µd , as well as eigenvectors u1 , . . . , ud and v 1 , . . . , v d .
Then
2 kX − Y k
sin(θ) ≤ (9.118)
max{λ1 − λ2 , µ1 − µ2 }

where θ = arccos(|hu1 , v 1 i|) is the principal angle between u1 and v 1 .

Proof. First, we have

hu1 , Xu1 i − hv 1 , Xv 1 i = hu1 , Y u1 i − hv 1 , Xv 1 i + hu1 , (X − Y )u1 i


≤ hv 1 , Y v 1 i − hv 1 , Xv 1 i + hu1 , (X − Y )u1 i
= hX − Y, u1 u1T − v 1 v 1T i

where the last inner product is to be understood as trace. The last expression is at most

kX − Y k · ku1 u1T − v 1 v 1T k1 ≤ kX − Y k · ku1 u1T − v 1 v 1T kF · 2

36
where k·k1 is the nuclear norm (`1 of eigenvalues) and k·kF is the Frobenius norm (`2 of
eigenvalues).
On the one hand,
2
ku1 u1T − v 1 v 1T kF = 2 − 2hu1 , v 1 i2 = 2 sin2 (θ).

On the other hand, the values hu1 , Xu1 i and hv 1 , Xv 1 i should be different if the angle is
large and there is a gap in the eigenvalues λ1 and λ2 . More precisely,
d
X
hv 1 , Xv 1 i = λj huj , v 1 i2 ≤ λ1 hu1 , v 1 i2 + λ2 (1 − hu1 , v 1 i2 ) = λ1 cos2 (θ) + λ2 sin2 (θ)
j=1

Pd
since j=1 huj , v 1 i
2 = kU T v 1 k2 = kv 1 k2 = 1 for U = [u1 , . . . , ud ]. Hence,

hu1 , Xu1 i − hv 1 , Xv 1 i ≥ λ1 − λ1 cos2 (θ) − λ2 sin2 (θ) = (λ1 − λ2 ) sin2 (θ). (9.119)

We conclude that
(λ1 − λ2 ) sin2 (θ) ≤ 2 kX − Y k sin(θ).
The analogous analysis with hv 1 , Y v 1 i − hu1 , Y u1 i as a starting point yields (µ1 − µ2 ) in the
denominator. We can take the best of these two bounds by introducing the maximum.

The theorem says that the top eigenvectors of X and Y are close (i.e. the sine of the
angle is small) if the gap between the top two eigenvalues of either X or Y is large compared
to the spectral norm of the difference of these two matrices. Recall that the lack of this
favorable comparison was exactly the reason for the instability in (9.117).
A few remarks:

• The statement of the theorem presented here is in terms of the gap between the eigen-
values of either X or Y . This form (see [36]) will be useful in statistical applications,
as we often have control on the gaps of the signal matrix X. Other versions in the
literature state the upper bound in terms of gaps between eigenvalues of X and the
corresponding eigenvalues of Y . In this case, one can use Weyl’s inequality to pass to
the gap on the signal matrix X only.

• We stated the result for the top eigenvector. More general results can be found in the
literature (e.g. [36]), for intermediate eigenvalues and eigenspaces.

• Wedin’s theorems generalize Davis-Kahan to singular vectors rather than eigenvectors.

9.2 Principal Component Analysis


One of the most basic questions we may ask when analyzing high-dimensional data is
whether there is a direction (or several directions) along which the data varies more than
other directions. In many situations, this may be the case if data lives close to a low-
dimensional subspace. We may determine this direction/subspace from the sample covari-
ance matrix. When can this estimate reliably tell us about the direction of large variance
in the population?
To motivate the Spiked Covariance Model below, consider two data generating scenarios.
First is a model of data that lives close to a single direction u. For g ∼ N (0, Id ), Z ∼
N (0, σ 2 Id ), we have X = hg, uiu + Z and EXX T = uu T + σ 2 I. The second model is a

37
mixture model of two Gaussian populations with means at u and −u: X = εu + Z, where
ε ∈ {±1} is a Rademacher random variable. Here, again, EXX T = uu T + σ 2 Id . This
covariance structure, present in both models, is the subject of the following investigation.
To this end, consider the following simple model. We assume that X1 , . . . , Xn ∈ Rd
are centered i.i.d. random variables with EXi XiT = Σ and Xi ∈ subG(kΣk). Assume that
the population covariance matrix has the following structure, called the Spiked Covariance
Model:

Σ = λuu T + Id (9.120)

for some fixed u ∈ Rd , kuk = 1. The parameter λ ≥ 1 here determines the signal-to-noise
ratio, the strength of the “spike.” Clearly, the top eigenvector of Σ is u1 = u, corresponding
to the eigenvalue 1 + λ. The question is whether this spike persists in the sample covariance
matrix Σ.b To this end, we view Y = Σ b as a randomly perturbed observation of a signal
matrix X = Σ with E = Σ b − Σ, as in (9.116). Let v 1 be the leading eigenvector of Σ. b
b
Since −v 1 is also an eigenvector of Σ, we can only determine closeness to u1 up to a sign.
Observe that

min kεv 1 − u1 k2 = 2 − 2|hu1 , v 1 i| ≤ 2 − 2hu1 , v 1 i2 = 2 sin2 (∠(u1 , v 1 )) (9.121)


ε∈{±1}

We also have that the gap λ1 − λ2 of the top two eigenvalues of Σ is λ, while kΣk = 1 + λ.
Together with results of the previous lecture,
(r r )
1+λ d t d t
min kεv 1 − u1 k . max + , + (9.122)
ε∈{±1} λ n n n n

with probability at least 1 − 2 exp{−t}. When d is large compared to n, we may employ


the corresponding results with low effective rank, or sparsity.

9.3 Spectral Clustering and Stochastic Block Model


Suppose we have d vertices, subdivided into two equal-sized groups. For concreteness,
the first d/2 vertices belong to the first cluster. A random graph is constructed as follows:
independently, each pair within the community has an edge with probability p, and any pair
across the two communities has an edge with probability q < p. The resulting distribution
of the random graph is denoted by G(n, p, q). One of the questions we may ask is: If we
observe the random graph, but not identity of the vertices, can we recover the communities?
Note that the adjacency matrix A of the random graph is a random matrix with entries
0 and 1. We can also calculate its expected value as
 
pJd/2 qJd/2
EA = ,
qJd/2 pJd/2

a matrix made up of four blocks, where Jd/2 is a d/2 × d/2 matrix of all 1’s.
We now view the observation Y = A as a noisy value of the signal matrix X = EA, as in
(9.116), with E = A − EA. Let us examine the eigenstructure of EA. The first (normalized)
eigenvector is u1 = √1d 1 and the corresponding eigenvalue is λ1 = d(p + q)/2. This vector
is not informative. The second eigenvector is u2 = √1d [1, . . . , 1, −1, . . . , −1] T , with the
corresponding eigenvalue λ2 = d(p − q)/2 (the rest of the eigenvalues are 0). Interestingly,

38
this second eigenvector contains community memberships. Note that not knowing identity
of the vertices means that the rows/columns of A are renamed, or permuted. This only
permutes the corresponding coordinates of the eigenvectors. Hence, we have the hope that
the second eigenvector v 2 of A also contains the necessary information about the community
memberships. In what follows, we will show that clustering vertices into two communities
according to the sign of v 2 (i) for each vertex i leads to only a constant number of errors.
This algorithm is known as spectral clustering.
Since we are aiming to recover the second rather than first eigenvector, we need to
appeal to a more general version of Davis-Kahan, which has min{λ1 − λ2 , λ2 − λ3 } instead
of λ1 − λ2 in the denominator of (9.118):

√ 2 2 kA − EAk
min kεv 2 − u2 k ≤ 2 sin(∠(u2 , v 2 )) ≤ (9.123)
ε∈{±1} min{λ1 − λ2 , λ2 − λ3 }

Now recall that from (21), with high probability, kA − EAk . d where we use the fact
that each entry is sub-Gaussian (recall that we need to apply the lemma separately to upper
and lower triangular parts of the matrix). On the other hand,

min{λ1 − λ2 , λ2 − λ3 } = d min{q, (p − q)/2}.


√ √
We now recall that du2 is a vector of ±1’s, while dv 2 (i) ∈ [−1, 1] for any coordinate
i. Then we have for any ε ∈ {±1},
d d
1X √ √ 1X
kεv 2 − u2 k2 = (ε dv 2 (i) − du2 (i))2 ≥ 1 {ε sign(v 2 (i)) 6= sign(u2 (i))}
d d
i=1 i=1
(9.124)

Thus, if we think of p, q as constants


√ (that is, for d large enough), we have that the right-
hand side of (9.123) is of order 1/ d, and by squaring both sides we get that for a constant
Cp,q ,
d
X
1 {ε sign(v 2 (i)) 6= sign(u2 (i))} . Cp,q , (9.125)
i=1

i.e. only a constant number (out of d) vertices are misclassified by the spectral clustering
algorithm which separates the nodes into two clusters according to the sign of the second
eigenvector of A.

10. UNIFORM LAWS OF LARGE NUMBERS: MOTIVATION

By now you have seen a number of finite-sample guarantees: estimation of a mean vector,
matrix estimation, constrained and unconstrained linear regression. In all the examples, the
key technical step was a control of the maximum of some collection of random variables.
Over the next few lectures, we will extend the toolkit to arbitrary classes of functions
and then apply it to questions of parametric and nonparametric estimation and statistical
learning.

39
10.1 Kolmogorov’s Goodness-of-Fit test
Given n independent draws of a real-valued random variable X, you may want to ask
whether it has a hypothesized distribution with cdf F0 . For instance, can you test the
hypothesis that heights of people are N (63, 32 ) (in inches)? Of course, we can try to see if
the sample mean is “close” to the mean of the hypothesized distribution. We can also try
the median, or some quantiles. In fact, we can try to compare all the quantiles at once and
see if they match the quantiles of F0 . It turns out that comparing “all quantiles” is again
a question about control of a maximum of a collection of correlated random variables. We
will make this connection precise.
If you have taken a course on statistics, you might have seen several approaches to the
hypothesis testing problem of whether X has a given distribution. One classical approach
is the Kolmogorov-Smirnov test. Let
F (θ) = P (X ≤ θ)
be the cdf of X, and let
n
1X
Fn (θ) = 1 {Xi ≤ θ}
n
i=1
be the empirical cdf obtained from n examples. While for a single θ, the random variable
|F (θ)−Fn (θ)| converges to zero almost surely by the Laws of Large Numbers, the analogous
convergence of
Dn = sup |F (θ) − Fn (θ)|
θ
to zero (that is, convergence uniform in θ) is less clear since we have a maximum of an
uncountable collection of correlated random variables.
Nevertheless, the Glivenko-Cantelli Theorem (1933) states that
Dn → 0 a.s.
Hence, given a candidate F , one can test whether X has distribution with cdf F , but
for this we need to know the (asymptotic) distribution of Dn . Assuming continuity of F ,
Kolmogorov (1933) showed that the distribution of Dn does not depend on the law of X,
and he calculated the asymptotic distribution (now known as the Kolmogorov distribution).
Without going into details, we can observe that F (X) has cdf of a uniform random variable
supported on [0, 1], and this transformation does not change the supremum. Hence, it is
enough to calculate Dn for the uniform distribution on [0, 1]. Dn fluctuates on the order of

1/ n and √
nDn −→ sup |B(F (θ))|.
θ∈R
Here B(x) is a Brownian bridge on [0, 1] (a continuous-time stochastic process with distri-
bution being Wiener process conditioned on being pinned to 0 at the endpoints).
In particular, Kolmogorov in his 1933 paper calculates the asymptotic distribution, as
well a table of a few values. For instance, he states that

P (Dn ≤ 2.4/ n) −→ approx 0.999973.
In the spirit of this course, we will take a non-asymptotic approach to this problem. While
we might not obtain such sharp constants, the deviation inequalities will be valid for finite
n.
We will now come to the same question of uniform deviations from a different angle –
Statistical Learning Theory.

40
10.2 Statistical Learning and Empirical Risk Minimization
Let S = {(x1 , y1 ), . . . , (xn , yn )} be n i.i.d. copies of a random variable (X, Y ) ∈ X × Y
with distribution P = PX × PY |X , where the X variable lives in some abstract space X and
Y ⊆ R. Fix a loss function ` : Y × Y → R and a class of functions F = {f : X → Y}. We
may think of F as a set of neural networks, or decision trees, or whatever model you may
have. Given the dataset S, the empirical risk minimization (ERM) method is defined as
n
1X
fb ∈ argmin `(f (Xi ), Yi )
f ∈F n i=1

Examples:
• Linear regression: X = Rd , Y = R, F = {x 7→ hw, xi : w ∈ Rd }, `(a, b) = (a − b)2

• Linear classification: X = Rd , Y = {0, 1}, F = {x 7→ (sign(hw, xi) + 1)/2 : w ∈ B2 },


`(a, b) = 1 {a 6= b}
We now define expected loss (error) as

L(f ) = E(X,Y ) `(f (X), Y ) (10.126)

and empirical loss (error) as


X n
b )= 1
L(f `(f (Xi ), Yi ). (10.127)
n
i=1

A central question in Statistical Learning is: what is an upper bound on the expected error
of ERM?

Lemma 23. For any class F = {f : X → R}, the ERM fb satisfies


h i h i
E L(fb) − inf L(f ) ≤ E sup L(f ) − L(f
b ) , (10.128)
f ∈F f ∈F

where the expectations are with respect to S.

Proof. Suppose without loss of generality that f ∗ = inf f ∈F L(f ). The decomposition holds:
h i h i h i
L(fb) − L(f ∗ ) = L(fb) − L(
b fb) + L(b fb) − L(f
b ∗ ) + L(f b ∗ ) − L(f ∗ ) .

By definition of ERM, the second term is nonpositive. Since f ∗ is independent of the random
sample, the third term is a difference between an average of random variables `(f ∗ (Xi ), Yi )
and their expectation. Hence, this term is zero-mean, and its fluctuations can be controlled
with the tail bounds we have seen in class. The first term, however, is generally not zero
b fb) 6= ES L(fb) (why?). Let us proceed by taking expectation (with
in expectation, i.e. ES L(
respect to S) of both sides:
h i h i h i
E L(fb) − L(f ∗ ) ≤ E L(fb) − L( b fb) ≤ E sup L(f ) − L(f
b ) . (10.129)
f ∈F

41
Here we “removed the hat” on fb by “supping out” this data-dependent choice. We are only
using the knowledge that f ∈ F, and nothing else about the method. We will see later
that for “curved” loss functions, such as square loss, the supremum can be further localized
within F. Note that (10.128) can lead to a vacuous (e.g. infinite) upper bound: one such
example is linear unconstrained regression.

10.3 Example: Classification with thesholds.


We now specialize to the classification scenario with indicator loss `(a, b) = 1 {a 6= b}.
Observe that 1 {a 6= b} = a + (1 − 2a)b for a, b ∈ {0, 1}. Hence, by taking a = Y and
b = f (X),
" #
h i 1 Xn
b ) = E sup E(Y + (1 − 2Y )f (X)) −
E sup L(f ) − L(f (Yi + (1 − 2Yi )f (Xi ))
f ∈F f ∈F n
i=1
" n
#
1X
= E sup E((1 − 2Y )f (X)) − (1 − 2Yi )f (Xi )
f ∈F n
i=1

Observe that (1 − 2Y ) is a random sign that is jointly distributed with X. Let us omit this
random sign for a moment, and consider
" n
#
1X
E sup Ef (X) − f (Xi ) . (10.130)
f ∈F n
i=1
Over the next few lectures, we will develop upper bounds on the above expected supremum
for any class F. For now, let us gain a bit more intuition about this object by looking at a
particular class of 1D thresholds:
F = {x 7→ 1 {x ≤ θ} : θ ∈ R}.
Substituting this choice, (10.130) becomes
" n
#
1X
E sup P (X ≤ θ) − 1 {Xi ≤ θ} = E sup [F (θ) − Fn (θ)] . (10.131)
θ∈R n θ∈R
i=1

which is precisely the quantity from the beginning of the lecture (albeit without absolute
values and in expectation). Again, (10.131) is the expected largest pointwise (and one-
sided) distance between the CDF and empirical CDF. Does it go to zero as n → ∞? How
fast?
Let’s introduce the shorthand
n
1X
Uθ = E1 {X ≤ θ} − 1 {Xi ≤ θ} .
n
i=1

{Uθ }θ∈R is an uncountable collection of correlated random variables, so how does the max-
imum behave? We have already encountered the question in the context of linear forms
hX, θi, indexed by θ ∈ B2 and we were able to use a covering argument to control the
expected supremum. Recall the key step in that proof: we can introduce a cover θ1 , . . . , θN
such that control of sup Uθ can be reduced to control of maxj=1,...,N Uθi . Does this idea
work here? Problems with this approach start appearing immediately: how do we cover R
by a finite collection?
In the next two sections, we present two approaches for upper-bounding (10.131); both
extend to the general case of (10.130).

42
10.4 Approach 1: Bracketing
While we cannot provide a finite -grid of R directly, we observe that we should be placing
the covering elements according to the underlying measure P . Informally, Uθ is likely to be
constant over regions of θ with small mass.
For simplicity assume that P does not have atoms, and let θ1 , θ1 , . . . , θN (with θ0 =
−∞, θN +1 = +∞) correspond to the quantiles: P (θi ≤ X ≤ θi+1 ) = N 1+1 . For a given θ,
let u(θ) and `(θ) denote, respectively, the upper and lower elements corresponding to the
discrete collection θ0 , . . . , θN +1 . Then, trivially,
n n
1X 1X
E1 {X ≤ θ} − 1 {Xi ≤ θ} ≤ E1 {X ≤ u(θ)} − 1 {Xi ≤ `(θ)}
n n
i=1 i=1
n
1X 1
≤ E1 {X ≤ `(θ)} − 1 {Xi ≤ `(θ)} +
n N +1
i=1

and thus
" n
#
1X
E sup IE1 {X ≤ θ} − 1 {Xi ≤ θ}
θ∈R n
i=1
n
1 1X
≤ + E max E1 {X ≤ θj } − 1 {Xi ≤ θj }
N +1 j∈{0,...,N } n
i=1

Now, each random variable E1 {X ≤ θ}−1 {Xi ≤ θ} is centered and 1-subGaussian.


q Hence,
2 log(N +1)
for each j, Uθj is √1n -subGaussian, and the expected maximum is at most n . The
overall upper bound is then
r r !
1 log(N + 1) log n
+ =O
N +1 n n

if we choose, for instance, N + 1 = n.


Before presenting an alternative to this approach, we state a general lemma.

10.5 The Symmetrization Lemma


An alternative is a powerful technique that replaces the expected value by a ghost sample.
To motivate the technique, recall the following inequality for variance:

E(X − EX)2 ≤ E(X − X 0 )2 = 2E(X − EX)2

where X 0 is an independent copy of X.

Lemma 24. Let F = {f : X → Y} be a class of real-valued functions. Let X, X1 , . . . , Xn


be i.i.d. random variables with values in X , and let 1 , . . . , n be i.i.d. Rademacher ran-
dom variables. Then
" n
# " n #
1X 1X
E sup Ef (X) − f (Xi ) ≤ 2E sup i f (Xi ) .
f ∈F n f ∈F n
i=1 i=1

43
We also have
n
" # " n #
1X 1X
E sup f (Xi ) − Ef (X) ≤ 2E sup i f (Xi )
f ∈F n i=1 f ∈F n i=1

and
n n
1X 1X
E sup f (Xi ) − Ef (X) ≤ 2E sup i f (Xi ) .
f ∈F n f ∈F n
i=1 i=1

Furthermore, the opposite direction holds:


n n
1X 1X 1
E sup i f (Xi ) ≤ 2E sup Ef (X) − f (Xi ) + √ sup |Ef |
f ∈F n i=1 f ∈F n n f ∈F
i=1

Proof. For the first statement, we introduce an i.i.d. sample X10 , . . . , Xn0 with the same
1 Pn
distribution as X. Observe that E n i=1 f (Xi0 ) = Ef (X). Hence,
" n
# " " n # n
#
1X 1X 1 X
E sup Ef (X) − f (Xi ) = E sup E f (Xi0 ) − f (Xi ) . (10.132)
f ∈F n f ∈F n n
i=1 i=1 i=1

By Jensen’s inequality, the last expression is at most


" n #
1X 0
E sup f (Xi ) − f (Xi ) (10.133)
f ∈F n i=1
0 . Now, since distribution of f (X 0 ) − f (X ) is the
where the expectation is over X1:n , X1:n i i
same as the distribution of − (f (Xi0 ) − f (Xi )), we can insert arbitrary signs i without
changing the expected value:
" n #
1X 0
E sup i (f (Xi ) − f (Xi )) . (10.134)
f ∈F n i=1

Since the quantity is constant for all the choices of 1 , . . . , n , we have the same value by
taking an expectation. We have
" n
# " n #
1X 1X 0
E sup Ef (X) − f (Xi ) ≤ E sup i (f (Xi ) − f (Xi )) , (10.135)
f ∈F n f ∈F n
i=1 i=1

where i ’s are now Rademacher random variables. Breaking up the supremum into two
terms leads to an upper bound
" n # " n #
1X 1 X
E sup i f (Xi0 ) + E sup −i f (Xi ) (10.136)
f ∈F n i=1 f ∈F n i=1
" n #
1X
= 2E sup i f (Xi ) (10.137)
f ∈F n i=1

by the symmetry of Rademacher random variables. The second and third statement follow
from the same argument. For the last part,
n n n
1X 1X 1X
E sup i f (Xi ) ≤ E sup i (f (Xi ) − Ef ) + E sup i Ef .
f ∈F n f ∈F n f ∈F n
i=1 i=1 i=1

44
Consider the first term on the RHS:
n n
1X 1X
E sup i (f (Xi ) − Ef ) ≤ E sup i (f (Xi ) − f (Xi0 ))
f ∈F n f ∈F n
i=1 i=1
Xn
1
= E sup (f (Xi ) − Ef + Ef − f (Xi0 ))
f ∈F n
i=1
n n
1X 1X
≤ E sup (Ef − f (Xi )) + E sup (f (Xi ) − Ef ) .
f ∈F n i=1 f ∈F n i=1

As for the second term,


n n
1X 1X
E sup i Ef ≤ sup |Ef | · E i (10.138)
f ∈F n f ∈F n
i=1 i=1

10.6 Approach 2: Symmetrization


We now illustrate the power of the symmetrization lemma for the case of thresholds. Recall,
that our goal is to upper bound
" n
#
1X
E sup E1 {X ≤ θ} − 1 {Xi ≤ θ} .
θ∈R n
i=1

From Lemma 24, this expected supremum is upper bounded by


n
1X
2E sup i 1 {Xi ≤ θ} .
θ∈R n
i=1

Let us condition on X1 , . . . , Xn and think of the random variables


n
1X
Vθ = i 1 {Xi ≤ θ}
n
i=1

as a function of the Rademacher random variables. How many truly distinct Vθ ’s do we


have? Since X1 , . . . , Xn are now fixed, there are only at most n + 1 choices (say, midpoints
between datapoints), and so the last expression is
" " n
##  
1X
2E E sup i 1 {Xi ≤ θ} X1:n = 2EE max Vθ X1:n (10.139)
θ∈R n i=1
θ∈{θ1 ,...,θn+1 }


Since each Vθ is 1/ n-subGaussian, and we get an overall upper bound of
r
2 log(n + 1)
2
n
which, up to constants, matches the bound with the bracketing approach.

45
10.7 Discussion
The bracketing and symmetrization approaches produced similar upper bounds for the case
of thresholds. We will see, however, that for more complex classes of functions, the two
approaches can give different results.
Of course, the symmetrization lemma can also be applied to the class of functions

{(x, y) 7→ (1 − 2y)f (x)}.

Since (1 − 2y) is {±1}-valued, the distribution of (1 − 2Yi )i is also Rademacher. Hence,
" n # " n #
1X 1X
E sup i (1 − 2Yi )f (Xi ) = E sup i f (Xi ) . (10.140)
f ∈F n i=1 f ∈F n i=1

This justifies omitting (1 − 2Y ) for binary classification in our earlier exposition. Hence, in
view of (10.129), the upper bounds we derived guarantee that for empirical risk minimiza-
tion, r
log(n + 1)
EL(fb) − min L(f ∗ ) .

f ∈F n
The power of symmetrization for studying the suprema of empirical processes has been
described in [11], who, in turn, attribute the technique to [15].

10.8 Empirical Processes


Let us also define an empirical process:

Definition 7. Let F = {f : X → R} and X, X1 , . . . , Xn are i.i.d. The stochastic


process
n
1X
νf = Ef (X) − f (Xi )
n
i=1

is called the empirical process indexed by F.

We note that it is also customary to scale the empirical process as


n
!
√ 1X
νf = n Ef (X) − f (Xi )
n
i=1

Second, empirical process theory often employs the notation



νf = n(P − Pn )f
P
where P is the distribution of X and Pn = n1 ni=1 δXi is the empirical measure. You may
also see the notation
sup |νf | = kP − Pn kF .
f ∈F

We can view supremum of the empirical process as the difference between the true and
empirical distributions when viewed through the lens of F.

46
Definition 8. A class F = {f : X → R} is (weak) Glivenko-Cantelli with respect to P
if
n
1X
sup Ef (X) − f (Xi ) → 0 (10.141)
f ∈F n
i=1

in probability (and strong Glivenko-Cantelli for almost sure convergence; these are equiv-
alent under certain boundedness assumptions).

Note: there do exist classes that are not Glivenko-Cantelli. These classes are, in a certain
sense, very rich, and both learning and uniform GC property fail. One trivial example is

F = {1 {S} : |S| = m, m ≥ 1},

indicators of discrete sets of arbitrary size, and P is absolutely continuous with respect to
Lebesgue. Another example is a class is bounded continuous functions on [0, 1] with respect
to, say, Lebesgue measure.

11. SUPREMA OF GAUSSIAN AND SUBGAUSSIAN PROCESSES

Definition 9. Stochastic process (Uθ )θ∈Θ , indexed by θ ∈ Θ, is a collection of random


variables on a common probability space.

The index θ can be “time,” but we will be primarily interested in cases where Θ has
some metric structure.
We will be interested in the behavior of the supremum of the stochastic process, and in
particular its expected value:
E sup Uθ .
θ∈Θ

To understand this object, we need to have a sense of the dependence structure of Uθ and
Uθ0 for a pair of parameters, but also about the metric structure of Θ.
Gaussian process is a collection of random variables such that any finite collection
Uθ1 , . . . , Uθn , for any n ≥ 1, is zero-mean and jointly Gaussian. In this case

E exp {λ(Uθ − Uθ0 )} = exp{λ2 d(θ, θ0 )2 /2}

with d(θ, θ0 )2 = E(Uθ − Uθ0 )2 . Hence, there is a natural metric for Gaussian process.

11.1 SubGaussian Processes

Definition 10. Stochastic process (Uθ )θ∈Θ is sub-Gaussian with respect to a metric d
on Θ if Uθ is zero-mean and

∀θ, θ0 ∈ Θ, λ ∈ R, E exp {λ(Uθ − Uθ0 )} ≤ exp{λ2 d(θ, θ0 )2 /2}

The main examples we will be studying have a particular linearly parametrized form:

47
Gaussian process: Let Gθ = hg, θi, g = (g1 , . . . , gn ), gi ∼ N (0, 1) i.i.d. Take d(θ, θ0 ) =
kθ − θ0 k. Then
2
Gθ − G0θ = hg, θ − θ0 i ∼ N (0, θ − θ0 )
In particular, this Gaussian process is also, trivially, sub-Gaussian with respect to the
Euclidean distance on Θ.

Rademacher process: Let Rθ = hε, θi, ε = (ε1 , . . . , εn ), εi i.i.d. Rademacher. Again,


take d(θ, θ0 ) = kθ − θ0 k. Then
Rθ − Rθ0 = hε, θ − θ0 i
is subGaussian with parameter kθ − θ0 k2 .
Note that in this linear parametrization of Uθ , the expected supremum can be seen as
a kind of average ‘width’ of the set Θ.

Definition 11. We will call


b
R(Θ) = E suphε, θi
θ∈Θ

the (empirical) Rademacher averages of Θ. The corresponding expected supremum of


the Gaussian process will be called the Gaussian averages or the Gaussian width of Θ
b
and denoted by G(Θ).

11.1.1 A few examples


Let Uθ = hε, θi, Θ ⊂ Rn , and take Euclidean distance as the metric. Let Bnp denote the unit
`p ball in Rn . We have

b n∞ ) = E sup Uθ = E sup hε, θi = n.


R(B
θ∈Bn
∞ θ∈Bn

To get a sublinear growth in n, we have to make sure Θ is significantly smaller than Bn∞ .
A few other sets: √
b n2 ) = E sup hε, θi = E kεk = n
R(B 2
θ∈Bn
2

and √
b n2 ) ≤
G(B n.
However, we observe that
b n ) = E sup hε, θi = E kεk = 1
R(B1 ∞
θ∈Bn
1

while for the Gaussian process,


p
b n1 ) = E sup hg, θi = E max |gi | ≤
G(B 2 log(2n).
θ∈Bn i∈[n]
1

In fact, this discrepancy between the Rademacher and Gaussian averages for Bn1 is the worst
that can happen and for any Θ
p
b
R(Θ) b
. G(Θ) b
. log n · R(Θ). (11.142)

48
Furthermore, the discrepancy is only there because Bn1 has a small `1 diameter, and for
many of the applications in statistics, we will work with a function class that will not have
such a small `1 diameter.
For a singleton,
b
R({θ}) =0
while for the vector 1n = (1, . . . , 1),
n
X √
b
R({−1 n , 1n }) = E max{h, 1n i, −h, 1n i} = E i ≤ n.
i=1

Some further properties of both Rademacher and Gaussian averages:


p
b
R(Θ) . diam(Θ) log card(Θ),
b
R(conv(Θ)) b
= R(Θ),
b
R(cΘ) b
= |c|R(Θ) for constant c

11.2 Finite-class lemma and a single-scale covering argument

Lemma 25. Let d be a metric on Θ and assume (Uθ ) is a subGaussian process. Then
for any finite subset A ⊆ Θ × Θ,
p
E max Uθ − Uθ0 ≤ max d(θ, θ0 ) · 2 log card(A) (11.143)
(θ,θ0 )∈A (θ,θ0 )∈A

How do we go beyond finite cover?

Definition 12. Let (Θ, d) be a metric space. A set θ1 , . . . , θN ∈ Θ is a (proper) cover


of Θ at scale  if for any θ there exists j ∈ [N ] such that d(θ, θj ) ≤ . The covering
number of Θ at scale  is the size of the smallest cover, denoted by N (Θ, d, ).

As a simple consequence,

Lemma 26. If (Uθ )θ∈Θ is subGaussian with respect to d on Θ, then for any δ > 0,
p
E sup Uθ ≤ 2E sup (Uθ − Uθ0 ) + 2diam(Θ) log N (Θ, d, δ)
θ∈Θ d(θ,θ0 )≤δ

Proof. Observe that


E sup Uθ = E sup Uθ − Uθ0 ≤ E sup Uθ − Uθ0
θ∈Θ θ∈Θ θ,θ0 ∈Θ

b be a δ-cover of Θ. Then for θ̂, θ̂0 ∈ Θ


Let Θ b with d(θ, θ̂), d(θ0 , θ̂0 ) ≤ δ,

Uθ − Uθ0 = Uθ − Uθ̂ + Uθ̂ − Uθˆ0 + Uθˆ0 − Uθ0 (11.144)


≤2 sup (Uθ − Uθ0 ) + sup (Uθ̂ − Uθˆ0 ) (11.145)
d(θ,θ0 )≤δ θ̂,θˆ0 ∈Θ
b

49
The last term is q
b 2)
E sup Uθ̂ − Uθˆ0 ≤ diam(Θ) 2 log(card(Θ)
θ̂,θˆ0 ∈Θ
b

11.3 Example: Rademacher/Gaussian processes


Let Uθ = hg, θi or hε, θi, Θ ⊂ Rn , and take Euclidean distance as the metric. Then

E sup Uθ − Uθ0 ≤ E sup hg, θi ≤ δE kgk ≤ δ n
d(θ,θ0 )≤δ kθk≤δ

Hence,
√ q
E sup Uθ ≤ 2δ n + 2diam(Θ) log N (Θ, k·k2 , δ) (11.146)
θ∈Θ

Roughly speaking, the supremum over Θ can be upper bounded by the supremum within a
ball of radius δ (“local complexity”) and the maximum over a finite collection of centers of
δ-balls. We will see this decomposition/idea again within the context of optimal estimators
with general (possibly nonparametric) classes of functions.
Is (11.146) a tight upper bound? To investigate this question, consider two examples.

First is the example of Θ = Bn2 . In this case, (5.69) gives an upper bound of n and a
multiplicative-cover approach of (7.104) recovers this up to constant factors (here n is the
dimensionality rather than d). We see that the same guarantee can be achieved by (11.146)
by taking δ a constant.
The next example, however, brings bad news: (11.146) is not necessarily tight. Consider

Θ = {(0, . . . , 0), (1, 0, . . . , 0), (1, 1, 0, . . . , 0), . . . , (1, . . . , 1)}. (11.147)



In this case, the expected supremum in (11.146) is O( n), as we shall see soon, but this
cannot be recovered from the upper bound. To establish a guarantee, we need to take δ

to be constant, yet the diameter of Θ is n while the covering number of Θ at a constant
scale must grow with n. We will soon see that (11.146) can lead to suboptimal rates.
Notice that we have seen the set Θ in (11.147) earlier: it corresponds to the n + 1
effective “signatures” of threshold functions when x1 , . . . , xn are fixed (see (10.139)).

11.4 Chaining

Theorem 2. Let (Uθ )θ∈Θ be a sub-Gaussian stochastic process with respect to a metric
d. Let D = diam(Θ). Then for any δ ∈ [0, D],

√ Z D/2 p
E sup Uθ ≤ 2E sup (Uθ − Uθ0 ) +8 2 log N (Θ, d, ε)dε (11.148)
θ∈Θ d(θ,θ0 )≤δ δ/4

Proof. Let Θj be a cover of Θ at scale 2−j D. We have card(Θ0 ) = 1. Let



N = min j : 2−j D ≤ δ

50
(which means 2−N D ≤ δ ≤ 2−(N −1) D) and card(ΘN ) = N (Θ, d, 2−N D) ≥ N (Θ, d, δ). As
before, we start with a single (finest-scale) cover:

E sup Uθ ≤ 2E sup (Uθ − Uθ0 ) + E sup (UθN − UθN


0 ).
θ∈Θ d(θ,θ 0 )≤δ 0 ∈Θ
θN ,θN N

For θN ∈ ΘN ,
N
X
UθN = Uθi − Uπi−1 (θi ) + Uθ0 (11.149)
i=1

where, recursively, we define θi−1 = πi−1 (θi ) to be the element of Θi−1 closest to θi . The
sequence θ0 , θ1 , . . . , θN is a “chain” linking an element of the covering to the corresponding
closest element at the coarser scale.
Let the corresponding chain for θN 0 ∈ Θ be denoted by θ 0 , θ 0 , . . . , θ 0 . Then
N 0 1 N

N
! N
!
X X
UθN − UθN
0 = Uθi − Uπi−1 (θi ) − Uθi0 − Uπi−1 (θi0 )
i=1 i=1

and
N
X N
X
E max
0
Uθ − Uθ0 ≤ E max (Uθi − Uπi−1 (θi ) ) + E max
0
(Uπi−1 (θi0 ) − Uθi0 ) (11.150)
θ,θ ∈ΘN θi ∈Θi θi ∈Θi
i=1 i=1
N
X q
≤2 D2−(i−1) 2 log N (Θ, d, 2−i D) (11.151)
i=1
N
X q
=8 D2−(i+1) 2 log N (Θ, d, 2−i D) (11.152)
i=1
XN Z 2−i D p
≤8 2 log N (Θ, d, ε)dε (11.153)
i=1 2−(i+1) D

Observe that 2−(N +1) D ≥ δ/4, which concludes the proof.

p
log N (⇥, d, ✏)
<latexit sha1_base64="t7/476fEOI1mWw2GKNMiWH+Q6Hs=">AAACFXicdVBNSwMxEM36WetX1aOXYBEUStm2ovUmePEkCrYK3VKy6bQNZpM1mRXK0j/hxb/ixYMiXgVv/hvTWqGKPhh4vDfDzLwwlsKi7394U9Mzs3PzmYXs4tLyympubb1udWI41LiW2lyFzIIUCmooUMJVbIBFoYTL8Pp46F/egrFCqwvsx9CMWFeJjuAMndTKFQJ7YzANpO7SIGLY40ymp4Od4KIHyAq0XaABxFZIrXYHrVzeL/oj0AlyWNmvlqu0NFbyZIyzVu49aGueRKCQS2Zto+TH2EyZQcElDLJBYiFm/Jp1oeGoYhHYZjr6akC3ndKmHW1cKaQjdXIiZZG1/Sh0ncPD7W9vKP7lNRLsVJupUHGCoPjXok4iKWo6jIi2hQGOsu8I40a4WynvMcM4uiCzLoTvT+n/pF4ulirF8vle/qg6jiNDNskW2SElckCOyAk5IzXCyR15IE/k2bv3Hr0X7/Wrdcobz2yQH/DePgE2wJ7O</latexit>

2
<latexit sha1_base64="JxNCpOp+T1z6Fdzd0LuEORsdRfk=">AAAB83icdVDLSgNBEOyNrxhfUY9eBoMQEcPuRnS9BfTgMYJ5QLKG2clsMmT2wcysEJb8hhcPinj1Z7z5N06SFaJoQUNR1U13lxdzJpVpfhq5peWV1bX8emFjc2t7p7i715RRIghtkIhHou1hSTkLaUMxxWk7FhQHHqctb3Q19VsPVEgWhXdqHFM3wIOQ+YxgpaWufZ+eltmJdTxB171iyayYM6AFclk9d2wHWZlSggz1XvGj249IEtBQEY6l7FhmrNwUC8UIp5NCN5E0xmSEB7SjaYgDKt10dvMEHWmlj/xI6AoVmqmLEykOpBwHnu4MsBrK395U/MvrJMp33JSFcaJoSOaL/IQjFaFpAKjPBCWKjzXBRDB9KyJDLDBROqaCDuH7U/Q/adoVq1qxb89KNSeLIw8HcAhlsOACanADdWgAgRge4RlejMR4Ml6Nt3lrzshm9uEHjPcv3ZmQPw==</latexit>
(i+1)
D 2 iD
<latexit sha1_base64="P4ad7afS2N6rhIhAD05f1a0h9l4=">AAAB73icdVBNS8NAEJ3Ur1q/qh69LBbBiyVpReOtoAePFewHtLFstpt26WYTdzdCCf0TXjwo4tW/481/47aNUEUfDDzem2Fmnh9zprRtf1q5peWV1bX8emFjc2t7p7i711RRIgltkIhHsu1jRTkTtKGZ5rQdS4pDn9OWP7qc+q0HKhWLxK0ex9QL8UCwgBGsjdSu3KUnbIKuesWSXbZnQAvkonrmVlzkZEoJMtR7xY9uPyJJSIUmHCvVcexYeymWmhFOJ4VuomiMyQgPaMdQgUOqvHR27wQdGaWPgkiaEhrN1MWJFIdKjUPfdIZYD9Vvbyr+5XUSHbheykScaCrIfFGQcKQjNH0e9ZmkRPOxIZhIZm5FZIglJtpEVDAhfH+K/ifNStmplis3p6Wam8WRhwM4hGNw4BxqcA11aAABDo/wDC/WvfVkvVpv89aclc3sww9Y7186OY9q</latexit>
D
<latexit sha1_base64="W6Z4IjTFLfFClwxIGzi+ITWQUvg=">AAAB6HicdVDLSgNBEOyNrxhfUY9eBoPgKWwS0fUW0IPHBMwDkiXMTnqTMbMPZmaFsOQLvHhQxKuf5M2/cZKsoKIFDUVVN91dXiy40rb9YeVWVtfWN/Kbha3tnd294v5BW0WJZNhikYhk16MKBQ+xpbkW2I0l0sAT2PEmV3O/c49S8Si81dMY3YCOQu5zRrWRmteDYsku2wuQb+Sydu5UHVLJlBJkaAyK7/1hxJIAQ80EVapXsWPtplRqzgTOCv1EYUzZhI6wZ2hIA1Ruujh0Rk6MMiR+JE2FmizU7xMpDZSaBp7pDKgeq9/eXPzL6yXad9yUh3GiMWTLRX4iiI7I/Gsy5BKZFlNDKJPc3ErYmErKtMmmYEL4+pT8T9rVcqVWrjbPSnUniyMPR3AMp1CBC6jDDTSgBQwQHuAJnq0769F6sV6XrTkrmzmEH7DePgHIjozm</latexit>

Figure 1: Illustration of the Dudley integral upper bound

Sudakov’s theorem gives a single-scale lower bound:

51
Theorem 3. For a Gaussian process (Uθ )θ∈Θ ,
p
C sup α log N (Θ, d, α) ≤ E sup Uθ
α≥0 θ∈Θ

for some constant C.

We can interpret this lower bound as the largest rectangle under the curve in Figure 1. This
lower bound can be tight in the applications we consider (whenever the sum of the areas of
rectangles Figure 1 is of the same order as the largest one).

12. RADEMACHER/GAUSSIAN AVERAGES FOR FUNCTION CLASSES

We have developed general machinery for upper- and lower-bounding the expected suprema
of sub-Gaussian processes, including Rademacher and Gaussian processes linearly parametrized
by a Θ ⊂ Rn . How are these results relevant to the problem of learning or estimation with
a class of functions F?
The symmetrization lemma (Lemma 24) tells us that for a class of real-valued functions
F = {f : X → R} we can upper bound the expected supremum of the empirical process
indexed by F in terms of the expected supremum of the Rademacher processes:
" n
# " n #
1X 1X
E sup Ef (X) − f (Xi ) ≤ 2E sup εi f (Xi ) .
f ∈F n f ∈F n
i=1 i=1

The key is that we can now condition on (X1 , . . . , Xn ) = (x1 , . . . , xn ), and


" n #
1X
Eε sup εi f (Xi )
f ∈F n i=1

precisely corresponds to Rademacher averages of the following indexing set Θ. To see this
correspondence, let
 
1 1
Θ = √ F|x1 ,...,xn = √ (f (x1 ), . . . , f (xn )) : f ∈ F ⊆ Rn (12.154)
n n

a (scaled by 1/ n) projection (or, restriction) of F onto x1 , . . . , xn . Take d to be
n
2 1X 2
d(θ, θ0 )2 = θ − θ0 = (f (xi ) − f 0 (xi ))2 , f − f 0 n
(12.155)
n
i=1

where θ = (f (x1 ), . . . , f (xn )) and θ0 = (f 0 (x1 ), . . . , f 0 (xn )), f, f 0 ∈ F. Note that k·kn is a
pseudo-metric, as it can be zero for functions that differ outside the given data. With these
definitions, we write " #
n
1 X
Eε sup √ i f (Xi ) = Eε suphε, θi.
f ∈F n θ∈Θ
i=1

Furthermore,
N (Θ, k·k2 , α) = N (F, k·kn , α).

52
Then Theorem 2 tell us that for any δ ∈ [0, D],
n Z D/2 q
1 X √ √
E sup √ εi f (xi ) ≤ 2δ n + 8 2 log N (F, k·kn , α)dα
f ∈F n δ/4
i=1

Let us formalize this as the following corollary.

Corollary 2. Let F = {f : X → R}. For any x1 , . . . , xn ,


( Z )
n q
1X 12 D/2
Eε sup εi f (Xi ) ≤ inf 8δ + √ log N (F, k·kn , α)dα
f ∈F n i=1
δ≥0 n δ

1 Pn
where D = supf,g∈F kf − gkn ≤ 2 supf ∈F kf kn ≤ 2 supf ∈F kf k∞ and Pn = n i=1 δxi .

Putting together the symmetrization lemma and above Corollary, we have

Corollary 3. Let F = {f : X → R} be a class of functions and let X1 , . . . , Xn ∼ P be


independent. Then
( n
)  Z q 
1X 24 D
E sup Ef (X) − f (Xi ) ≤ E inf 16δ + √ log N (F, k·kn , α)dα
f ∈F n δ≥0 n δ
i=1
(12.156)
q P
1 n 2
where D = supf ∈F kf kn = n i=1 f (Xi ) .

Expectations on both sides are with respect to X1 , . . . , Xn . Note that the above results
hold for the absolute value of the empirical process if we replace log N by log 2N , and the
log 2 can be further absorbed into the multiplicative constant.
The Sudakov lower bound for the Gaussian process implies (together with the rela-
tionship between Rademacher and Gaussian processes) the following lower bound for the
Rademacher averages:

Corollary 4. For any X1 , . . . , Xn ,


n
r
1X c log N (F, k·kn , α)
Eε sup εi f (Xi ) ≥ √ · sup α
f ∈F n log n α≥0 n
i=1

for some absolute constant c.

We note that a version of the lower bound (for a particular choice of α) without the log-
arithmic factor is available, under some conditions, and it often matches the upper bound
(see a few pages below).

53
Definition 13. Given x1 , . . . , xn and a class F = {f : X → R},
n
b 1 X
R(F) = Eε sup √ εi f (xi ) (12.157)
f ∈F n
i=1

are called the (empirical) Rademacher averages of F.

Note that we will occasionally adopt the 1/n scaling to follow the literature.

13. COVERING AND PACKING

Given a probability measure P on X , we define


Z
kf k2L2 (P ) = Ef (X) = 2
f (x)2 P (dx).

Similarly, for a given X1 , . . . , Xn we define a random pseudometric


n
1X
kf k2L2 (Pn ) = f (Xi )2 = kf k2n .
n
i=1

OfPcourse, the second definition is just a special case of the first for empirical measure
1 n
n i=1 δXi .

Definition 14. An α-net (or, α-cover) of F with respect to L2 (P ) is a set of functions


f1 , . . . , fN such that

∀f ∈ F, ∃j ∈ [N ] s.t. kf − fj kL2 (P ) ≤ α.

The size of the smallest α-net is denoted by N (F, L2 (P ), α).

The above definition can be also generalized to Lr (P ). Next, we spell out the above defini-
tion specifically for the empirical measure Pn :

P
Definition 15. Let Pn = n1 ni=1 δxi be the empirical measure supported on x1 , . . . , xn .
A set V = {v1 , . . . , vN } of vectors in Rn forms an α-net (or, α-cover) of F with respect
to Lr (Pn ) if
n
1X
∀f ∈ F, ∃j ∈ [N ] s.t. |f (xi ) − vj (i)|r ≤ αr
n
i=1

The size of the smallest α-net is denoted by N (F, Lr (Pn ), α). Similarly, an α-net (or,
α-cover) with respect to L∞ (Pn ) requires

∀f ∈ F, ∃j ∈ [N ] s.t. max |f (xi ) − vj (i)| ≤ α


i∈[n]

54
The size of the smallest α-net is denoted by N (F, L∞ (Pn ), α).

Observe that the elements of the cover V can be “improper,” i.e. they do not need to
correspond to values of some function on the data. However, one can go between proper
and improper covers at a cost of a constant (check!).
Second, observe that

N (F, Lr Pn ), α) ≤ N (F, Lq (Pn ), α)

for r ≤ q since kf kLr (Pn ) is nondecreasing with r. Note that this is different for unweighted
metrics: e.g. kxkr is nonincreasing in r, and hence N (Θ, k·kr , α) is also nonincreasing in r.

Definition 16. An α-packing of F with respect to Lr (Pn ) is a set f1 , . . . , fN ∈ F such


that
n
1X
|fj (xi ) − fk (xi )|r ≥ αr
n
i=1

for any j 6= k. The size of the largest α-packing is denoted by D(F, Lr (Pn ), α).

A standard relationship between covering and packing holds for any P :

D(F, Lr (P ), 2α) ≤ N (F, Lr (P ), α) ≤ D(F, Lr (P ), α).

In fact, this relationship is true for any metric.

14. PARAMETRIC AND NONPARAMETRIC CLASSES OF FUNCTIONS

There is no clear definition of what constitutes a “nonparametric class,” especially since


the same class of functions (e.g. neural networks) can be treated as either parametric or
nonparametric (e.g. if neural network complexity is measured by matrix norms rather than
number of parameters).
Consider the following (slightly vague) definition as a possibility:

Definition 17. We will say that a class F is parametric if there is a constant C and
a notion of dimension dim such that
 dim
2 C
sup N (F, L (Pn ), ) . .
Pn 

We will say that F is nonparametric if there is a p > 0 and C such that


 p
2 C
sup log N (F, L (Pn ), )  . (14.158)
Pn 

The requirement that (14.158) holds for all measures Pn and values of n is quite strong.
Yet, we will show that as an upper bound, it is true for a variety of function classes.

55
However, one should keep in mind that there are also cases where dependence of the upper
bound on n can lead to better overall estimates. The quantity

sup log N (F, L2 (Q), ),


Q

where supremum is taken over all discrete measures, is called Koltchinskii-Pollard entropy.
Let’s consider a “parametric” class F such that functions in F are uniformly bounded:
|f |∞ ≤ 1. Uniform boundedness implies an upper bound on the diameter: D/2 ≤ 1. Then,
taking δ = 0 in Corollary 2, conditionally on X1 , . . . , Xn ,
n Z
1X 12 1p
Eε sup εi f (Xi ) ≤ √ log N (F, L2 (Pn ), α)dα
f ∈F n i=1
n 0
Z 1p
12
≤√ d log(1/α)dα
n 0
r
d
≤c
n
Here it’s useful to note that
Z ap ( p
2a log(1/a) a ≤ 1/e
log(1/α)dα ≤
0 2a a > 1/e

The following theorem is due to D. Haussler (an earlier version with exponent O(d) is
due to Dudley ’78):

Theorem 4. Let F = {f : X → {0, 1}} be a class of binary-valued functions with VC


dimension vc(F) = d. Then for any n and any Pn ,
 2d
2 1 d
N (F, L (Pn ), ) ≤ Cd(4e) .


We will explain what “VC dimension” means a bit later, and let’s just say here that the
class of thresholds has dimension 1 and the class of homogenous linear classifiers in Rd has
dimension d. In particular, this removes the extraneous log(n + 1) factor we had in Lecture
14 when analyzing thresholds.

14.1 A phase transition


Let us inspect the Dudley integral upper bound. Note that when we plug in
 p
2 1
log N (F, L (Pn ), ) . ,


the integral becomes


Z D/2
ε−p/2 dε
δ

56
If p < 2, the integral converges, and we can take δ = 0. However, when p > 2, the lower
limit of the integral matters and we get an overall bound of the order
h iD/2
δ + n−1/2 ε1−p/2 ≤ δ + n−1/2 δ 1−p/2
δ

By choosing δ to balance the two terms (and thus minimize the upper bound) we obtain
δ = n−1/p . Hence, for p > 2, the estimate on Rademacher averages provided by the Dudley
bound is
1 b
√ R(F) . n−1/p .
n
On the other hand, for p < 2, the Dudley entropy integral upper bound becomes (by setting
δ = 0) on the order of
n−1/2 D1−p/2 = O(n−1/2 ),
yielding
1 b
√ R(F) . n−1/2 .
n
We see that there is a transition at p = 2 in terms of the growth of Rademacher averages
(“elbow” behavior). The phase transition will be important in the rest of the course when
we study optimality of nonparametric least squares.
Remark that in the p < 2 regime, the rate n−1/2 is the same rate CLT rate we would
1 Pn
have if we simply considered E n i=1 f (Xi ) − Ef (or the average with random signs)
with a single function. Hence, the payment for the supremum over class F is only in a
constant that may depend on F but does not depend on n.

14.2 Single scale vs chaining


It is also worthwhile to compare the single-scale upper bound we obtained earlier to the
tighter upper bound given by chaining. In other words, we are comparing
r
log N (δ)
δ+
n
versus Z r
D/2
log N (ε)
δ+ dε,
δ n
simplifying the notation for brevity.
In the parametric case, the single-scale bound becomes (with the choice of δ = 1/n)
r
dim log n
n
while chaining gives r
dim
.
n
In the nonparametric case, the difference is more stark:
r
δ −p − 1
δ+  n 2+p
n

57
vs
n−1/2
for p < 2, and
δ 1−p/2
δ+ √  n−1/p
n
for p > 2.

14.3 Linear class: Parametric or Nonparametric?


Let’s take a closer look at the function class

F = {x 7→ hw, xi : w ∈ Bd2 }

and take X = Bd2 . Recall that for a given x1 , . . . , xn ,


n o
F|x1 ,...,xn = {(f (x1 ), . . . , f (xn )) : f ∈ F} = Xw : w ∈ Bd2

where X is the n × d data matrix. As we have seen, the key quantity we need to compute
is
N (F, L2 (Pn ), ε).
What is a good upper bound for this quantity? What we had done earlier in the course was
to discretize the set Bd2 to create a ε-net w1 , . . . , wN of size N (Bd2 , k·k2 , ε). Observe that for
any w, w0 ∈ Bd2 ,
n
!1/2
1X
(hw, xi i − hw0 , xi i)2 ≤ max |hw − w0 , xi i|
n i∈[n]
i=1
≤ max |hw − w0 , xi|
x∈Bd2
2
≤ w − w0 .

This sequence of inequalities corresponds to

N (F, L2 (Pn ), ε) ≤ N (F, L∞ (Pn ), ε) ≤ N (F, k·k∞ , ε) ≤ N (Bd2 , k·k2 , ε). (14.159)

where the sup-norm (or, pointwise over the domain) metric is kf − gk∞ = supx∈X |f (x) −
g(x)|. Recall that the covering number of Bd2 is
 
d 2 d
N (B2 , k·k2 , ε) ≤ 1 + .
ε
This gives a “parametric” growth of entropy

log N (F, L2 (Pn ), ε) . d log(1 + 2/ε).

However, if d is large or infinite, this bound is loose. We will show that it also holds that

log N (F, L2 (Pn ), ε) . ε−2 ,

which is a nonparametric behavior. Hence, the same class can be viewed as either parametric
or nonparametric. In fact, in the parametric behavior, it is not important that the domain

58
of w is Bd2 since we would expect a similar estimate for other sets (including Bd∞ ). In
contrast, it will be crucial in nonparametric estimates that the norm of w is `2 -bounded.
Jumping ahead, we will study neural networks and show a similar phenomenon: we
can either count the number of neurons or connections (parameters) or we can calculate
nonparametric “norm-based” estimates by looking at the norms of the layers in the network.
It’s worth emphasizing again that (??) can lead to very loose bounds in high-dimensional
situations. A cover of function values on finite set of data can be significantly smaller than
a cover with respect to sup norm.

14.4 A more general result (Optional)


We have that for any fixed function
n
1 X
E √ (f (Xi ) − Ef (X)) ≤ var(f )1/2 = kf − Ef kL2 (P ) .
n
i=1

Obviously this implies


n
1 X
sup E √ (f (Xi ) − Ef (X)) ≤ sup var(f )1/2 =: σ
f ∈F n f ∈F
i=1

If we could ever prove


n
1 X
E sup √ (f (Xi ) − Ef (X)) ≤ C(F) · σ,
f ∈F n
i=1

it would imply that we only paid C(F) for having a statement uniform in f ∈ F.
Next, rather than assuming that functions in F are uniformly bounded, it will be enough
to assume that they have an L2 (P )-integrable envelope F :

F (x) = sup |f (x)|.


f ∈F

Rather than assuming that F (x) ≤ 1, we shall assume that kF k2L2 (P ) = EF (X)2 ≤ ∞ and
everything will be phrased in terms of kF k2L2 (P ) .
p Now, let H : [0, ∞) 7→ [0, ∞) is such that H(z) is non-decreasing for z > 0 and
z H(1/z) is non-decreasing for z ∈ (0, 1]. Assume
Z D p p
H(1/x)dx ≤ CH D H(1/D)
0

for all D ∈ (0, 1], and suppose that

sup log 2N (F, L2 (Q), τ kF kL2 (Q) ) ≤ H(1/τ )


Q

for all τ > 0. With this control on Koltchinskii-Pollard entropy, it follows that
v !
n u
1 X u 2 kF kL2 (P )
E sup √ (f (Xi ) − Ef (X)) . σ tH (14.160)
n σ
i=1

59
if n is large enough. We refer to [10] for more details, in particular Theorem 3.5.6 and the
following corollaries.
Remarkably, under additional mild conditions on size of n, the inequality (14.160) can
be reversed for a given 2
 P as soon as the entropy with respect to L (P ) indeed grows at
kF kL2 (P )
least as H σ .
Hence, the price we pay for uniformity in f ∈ F is truly
v !
u
u kF kL2 (P )
C(F)  tH .
σ

Of course, this expression is even simpler if σ 2 = supf ∈F E(f (X) − Ef )2 is on the same
order as kF k2L2 (P ) = E supf |f (X)|2 .

15. COMBINATORIAL PARAMETERS


b
Let us gain some intuition for what can make R(Θ) large. First, recall that
b
R({±1}n
) = E sup hθ, i = n.
θ∈{±1}n

Next, suppose that for α > 0 and v ∈ Rn ,

α{±1}n + v ⊆ Θ.

Then
b
R(Θ) b
≥ R(α{±1} n b
+ v) = R(α{±1} n b
) = αR({±1}n
) ≥ αn
Hence, “large cubes” inside Θ make Rademacher averages large. It turns out, this is the
b
only reason R(F| x1 ,...,xn ) can be large!
The key question is whether F|x1 ,...,xn contains large cubes for a given class F.

15.1 Binary-Valued Functions


Let’s start with function classes of {0, 1}-valued functions. In this case, Fx1 ,...,xn is either a
full {0, 1}n cube or not. Consider the particular example of threshold functions on the real
line. Take any point x1 . Clearly, F|x1 = {0, 1}, which is a one-dimensional cube. Take two
points x1 , x2 . We can only realize sign patters (0, 0), (0, 1), (1, 1), but not (1, 0). Hence, for
no two points can we get a cube.

Definition 18. Let F = {f : X → {0, 1}}. We say that F shatters x1 , . . . , xn ∈ X if


F|x1 ,...,xn = {0, 1}n . The Vapnik-Chervonenkis dimension of F is

vc(F) = max{n : F shatters some x1 , . . . , xn }

60
Lemma 27 (Sauer-Shelah-Vapnik-Chervonenkis). If vc(F) = d < ∞,
d  
X  en d
n
card (F|x1 ,...,xn ) ≤ ≤
i d
i=0

This result is quite remarkable. It says that as soon as n > vc(F), the proportion of the
cube that can be realized by F becomes very small (nd vs 2n ). This combinatorial result is
at the heart of empirical process theory and the early developments in pattern recognition.
In particular, the lemma can be interpreted as a covering number upper bound:
 en d
N (F, L∞ (Pn ), ε) ≤
d
for any  > 0. Observe that these numbers are with respect to L∞ (Pn ) rather than
L2 (Pn ), and hence can be an overkill. Indeed, L∞ (Pn ) covering numbers are necessar-
ily n-dependent while we can hope to get dimension-independent L2 (Pn ) covering numbers.
Indeed, this result (Dudley, Haussler) was already mentioned: for a binary-valued class with
finite vc(F) = d,
 Cd
2 C
N (F, L (Pn ), ) . .

Hence, a class with finite VC dimension is “parametric”. On the other hand, if vc(F) is
infinite, then F|x1 ,...,xn is a full cube for arbitrarily large n (for some appropriately chosen
points). Hence, Rademacher averages of this set are too large and there is no uniform
convergence for all P (to see this, consider P supported on the shattered set). Hence,
finiteness of VC dimension is a characterization (of both distribution-free learnability and
uniform convergence).
A word of caution: VC dimension does not always correspond to “number of param-
eters.” For instance, the one-parameter family F = {x 7→ 1 {sin(αx) ≥ 0} : α ∈ R} over
X = R has infinite VC dimension.

15.2 Real-Valued Functions


For binary-valued functions, the size of the cube contained in F|x1 ,...,xn was trivially 1, and
we only varied n to see where the phase transition occurs. In contrast, for a general real-
valued function class, it is feasible that F|x1 ,...,xn contains a cube of size α, but not larger
than α; this extra parameter is in addition to the dimensionality of the cube. To deal with
this extra degree of freedom, we fix the scale α and ask for the largest size n such that
F|x1 ,...,xn contains a (translate of a) cube of size α. A true containment statement would
read s + (α/2){−1, 1}n ⊆ F|x1 ,...,xn . However, it is enough to ask that the equalities for the
vertices are replaced with inequalities:

Definition 19. We say that F shatters a set of points x1 , . . . , xn at scale α if there


exists s ∈ Rn such that
(
n f (xt ) ≥ st + α/2 if t = +1
∀ ∈ {±1} , ∃f ∈ F s.t.
f (xt ) ≤ st − α/2 if t = −1

61
The combinatorial dimension vc(F, α) of F (on domain X ) at scale α is defined as the
size n of the largest shattered set.

This scale-sensitive dimension (or ‘fat-shattering’ dimension) was introduced by [16].

15.2.1 Example: non-decreasing functions


Consider the class of nondecreasing functions f : R → [0, 1]. First, observe that a point-
wise cover of this class does not exist (N (F, k·k∞ , ) = ∞ for any  < 1/2). However,
N (F, L∞ (Pn ), ) is necessarily finite. Let’s calculate the scale-sensitive dimension of this
class.
Claim: vc(F, ) . −1 . Indeed, fix any x1 , . . . , xn and assume these are arranged in
an increasing order. Suppose F shatters this set. Take the alternating sequence  =
(+1, −1, . . .). We then must have a nondecreasing function that is at least s1 + α/2 at
x1 but then no greater than s2 − α/2 at x2 . The nondecreasing constraint implies that
s2 ≥ s1 + α. A similar argument then holds for the next point and so forth. Since functions
are bounded, nα ≤ 1, which concludes the proof.

15.2.2 Control of covering numbers


The following generalization of the earlier result for binary-valued functions is due to
Mendelson and Vershynin:

Theorem 5. Let F be a class of functions X → [−1, 1]. Then for any distribution P ,
 c c·vc(F ,ε/c)
N (F, L2 (P ), ε) ≤
ε
for all  > 0. Here c is an absolute constant.

In particular, plugging into the entropy integral yields


Z p
vc(F, ε) log(1/ε)dε

Rudelson-Vershynin: log(1/) can be removed.


Back to the class of non-decreasing functions, we immediately get
c
log N (F, L2 (Pn ), ε) . ε−1 · log .
ε

In particular, Rademacher averages of this class scale as n−1/2 since this is a nonparametric
class with entropy exponent p < 2.

15.3 Scale-sensitive dimension of linear class via Perceptron


In this section, we will prove that

62
Proposition 1. For
F = {x 7→ hw, xi : w ∈ Bd2 }
and X ⊆ Bd2 , it holds that
vc(F, α) ≤ 16α−2 .

We turn to the Perceptron algorithm, defined as follows. We start with w b0 = 0. At


time t = 1, . . . , n, we observe xt ∈ X and predict ybt = sign(hw
bt , xt i), a deterministic guess
of the label of xt given the hypothesis w bt . We then observe the true label of the example
yt ∈ {±1}. If ybt 6= yt , we update

bt+1 = w
w bt + yt xt ,

bt+1 = w
and otherwise w bt .

Lemma 28 (Novikoff’62). For any sequence (x1 , y1 ), . . . , (xn , yn ) ∈ Bd2 × {±1} the Per-
ceptron algorithm makes at most γ −2 mistakes, where γ is the margin of the sequence,
defined as
γ = max min yt hw∗ , xt i ∨ 0
w∗ ∈Bd2 t

Proof. If a mistake is made on round t,

bt+1 k2 = kw
kw bt + yt xt k2 ≤ kw
bt k2 + 2yt hw bt k2 + 1
bt , xt i + 1 ≤ kw

bn+1 k2 ≤ m. Next, for w∗ ,


Denote the number of mistakes at the end as m. Then kw

γ ≤ hw∗ , yt xt i = hw∗ , w
bt+1 − w
bt i,

and so by summing and telescoping, mγ ≤ hw∗ , w
bn+1 i ≤ m. This concludes the proof.

Remarkably, the number of mistakes does not depend on the dimension d. We will now
show that the mistake bound translates into a bound on the scale-sensitive dimension.

Proof of Proposition. Suppose there exist a shattered set x1 , . . . , xm ∈ Bd2 : there exists
s1 , . . . , sm ∈ [−1, 1] such that for any sequence of signs  = (1 , . . . , m ) there exists a
w ∈ Bd2 such that
i (hw , xi i − si ) ≥ α/2.
Claim: we can reparametrize the problem so that si = 0. Indeed, take

w̃ = [w , 1], x̃i = [xi , −si ].

Then we have
i hw̃ , x̃i i ≥ α/2.

while the norms are at most 2:

kw̃ k2 = kw k2 + 1 ≤ 2, kx̃i k2 ≤ 2

63
√ √
Now comes the key step. We run Perceptron on the sequence x̃1 / 2, . . . , x̃m / 2 and
yi = −byi . That is, we force Perceptron to make mistakes on every round, no matter what
the predictions are. It is important that Perceptron makes deterministic predictions for this
argument to work. Note that the sequence of predictions of Perceptron defines the sequence
y = (y1 , . . . , ym ) with √ √
yi hw̃y / 2, x̃i / 2i ≥ α/4.
Hence, by Novikoff’s result,
m ≤ 16/α2 .

Interestingly, both Perceptron and VC theory were developed in the 60’s as distinct
approaches (online vs batch), yet the connection between them runs deeper than was recog-
nized, until recently. In particular, the above proof in fact shows that a stronger sequential
version of vc(F, α) is also bounded by 16α−2 , where (roughly speaking) sequential analogues
allow the sequence to evolve as a predictable process with respect to a dyadic filtration. It
turns out that there are sequential analogues of Rademacher averages, covering numbers,
Dudley chaining, and combinatorial dimensions, and these govern online (rather than i.i.d.)
learning. We will mention these towards the end of the course.

16. PREDICTION WITH LIPSCHITZ LOSS FUNCTIONS

In the past few lectures, we have developed tools for analyzing the expected suprema of
empirical processes. We have already seen that such quantities can be used to derive
sample complexity bounds for empirical risk minimization algorithms. Let us recall the
setup. Excess loss with respect to a class of functions F is defined as

E`(f (X), Y ) − min E`(f (X), Y ) (16.161)


f ∈F

for some ` : Y × Y → R. Earlier, we have shown that ERM


n
1X
fb ∈ argmin `(f (Xi ), Yi )
f ∈F n i=1

enjoys
n
1X
E`(fb(X), Y ) − min E`(f (X), Y ) ≤ E sup E`(f (X), Y ) − `(f (Xi ), Yi ).
f ∈F f ∈F n
i=1

The latter is at most


n
1X
2E sup i `(f (Xi ), Yi ) (16.162)
f ∈F n
i=1

by symmetrization, which is Rademacher averages of the loss class


` ◦ F |(X1 ,Y1 ),...,(Xn ,Yn )
We would like to further upper bound this with Rademacher averages of the function class
itself. This can be done if ` is Lipschitz in the first argument.

64
Lemma 29 (Contraction). Let φi : R → R be 1-Lipschitz, i = 1, . . . , n. Let Θ ⊂ Rn
and φ ◦ θ = (φ1 (θ1 ), . . . , φn (θn )) for θ ∈ Θ. Denote φ ◦ Θ = {φ ◦ θ : θ ∈ Θ}. Then
b ◦ Θ) ≤ R(Θ).
R(φ b

Proof. Conditionally on 1 , . . . , n−1 ,


 
1  0 0
En suphφ ◦ θ, i = sup {hφ ◦ θ1:n−1 , 1:n−1 i + φn (θn )} + sup hφ ◦ θ1:n−1 , 1:n−1 i − φn (θn )
θ∈Θ 2 θ∈Θ θ0 ∈Θ
1 0
≤ sup hφ ◦ θ1:n−1 , 1:n−1 i + hφ ◦ θ1:n−1 , 1:n−1 i + |θn − θn0 |
2 θ,θ0 ∈Θ
1 0
= sup hφ ◦ θ1:n−1 , 1:n−1 i + hφ ◦ θ1:n−1 , 1:n−1 i + θn − θn0
2 θ,θ0 ∈Θ
 
1  0 0
= sup {hφ ◦ θ1:n−1 , 1:n−1 i + θn } + sup hφ ◦ θ1:n−1 , 1:n−1 i − θn
2 θ∈Θ θ0 ∈Θ
= En suphφ ◦ θ1:n−1 , 1:n−1 i + n θn
θ∈Θ

The inequality follows from the Lipschitz condition and the following equality is justified be-
cause of the symmetry of the other two terms with respect to renaming θ and θ0 . Proceeding
to remove the other signs concludes the proof.

We now apply this lemma to functions φi (·) = `(·, Yi ). As long as these functions are
L-Lipschitz, contraction lemma gives
n n
1X 1X
E sup i `(f (Xi ), Yi ) ≤ L · E sup i f (Xi ) (16.163)
f ∈F n f ∈F n
i=1 i=1

the (expected) Rademacher averages of F. The argument can be seen as a generalization


of the argument in (10.140) for classification where we “erased” multipliers (1 − 2Yi ).
The simple analysis we just performed applies to any Lipschitz loss function. For uni-
formly bounded F and Y, square loss is Lipschitz, but that is no longer true for unbounded
Y (e.g. for real-value prediction with Gaussian noise). Hence, such an analysis only goes so
far.
Second, observe that one would only obtain rates n−1/2 or worse with such an analysis,
while we might hope to have faster decrease. For instance, in finite-dimensional regression,
one can recall the classical d · n−1 rates for Least Squares.
A quick inspection tells us that the second step in the sequence of inequalities
h i h i h i
E L(fb) − L(f ∗ ) ≤ E L(fb) − L( b fb) ≤ E sup L(f ) − L(f
b ) (16.164)
f ∈F

for ERM fb may be too loose. The second step only used the fact that fb belongs to F. It
turns out one can localize its place in F better than that. Before turning to this question
of localization, let us point out a relationship between the problems of estimation and
prediction with square loss.

65
17. REGRESSION. PREDICTION VS ESTIMATION

As before, let S = {(X1 , Y1 ), . . . , (Xn , Yn )} be a set of i.i.d. pairs with distribution P =


PX × PY |X on X × Y. Let f ∗ (x) = E[Y |X = x] be the regression function. One can show
that
f ∗ ∈ argmin E(f (X) − Y )2
f

where minimization is over all measurable functions. Given a class F of functions X → Y,


we also define
fF ∈ argmin E(f (X) − Y )2
f ∈F

to be the best predictor within the class F.


Note that for any function f : X → R,

E(f (X) − Y )2 − inf E(h(X) − Y )2 (17.165)


h∈F
= E(f (X) − f ∗ (X) + f ∗ (X) − Y )2 − inf E(h(X) − f ∗ (X) + f ∗ (X) − Y )2
h∈F
∗ ∗
= E(f (X) − f (X)) − inf E(h(X) − f (X))2
2
h∈F
= kf − f ∗ k2L2 (P ) − inf kh − f ∗ k2L2 (P ) . (17.166)
h∈F

The penultimate equality holds because the cross term

E [(f (X) − f ∗ (X))(f ∗ (X) − Y )] = 0,

as follows by conditioning on X and using the definition of f ∗ .


On the left-hand side of (17.165), we have the object of interest in Statistical Learning:
predicting well relative to a given class F (e.g. agnostic PAC learning in the realm of com-
putational learning theory). On the other hand, (17.166) measures the quality of estimation
of an unknown regression function in the L2 (P ) norm. This object is within the purview of
Statistics. We see that the problem of prediction and the problem of estimation naturally
coincide for square loss.
Of course, we will be interested in analyzing estimators fb constructed on the basis of n
datapoints. The hat on fb reminds us about the dependence on S. Then (17.165) will be
evaluated with f replaced by fb.
Two standard scenarios:

• Well-specified case: given some class F, assume f ∗ ∈ F. More precisely, P is such that
the regression function is in the class F. In this case, (17.166) becomes kf − f ∗ k2L2 (P ) .

• Misspecified case: do not insist that f ∗ ∈ F. Upper bounds on (17.166) are called Or-
acle Inequalities in statistics, while the prediction form has been studied in statistical
learning theory (sometimes under the name of Agnostic PAC).

The misspecified problem arises naturally as a relaxation of an assumption on the form


of the distribution.

66
18. NONPARAMETRIC REGRESSION: WELL-SPECIFIED CASE

We will start with “fixed design”: x1 , . . . , xn ∈ X are fixed. Let

Yi = f ∗ (xi ) + ηi

where ηi are zero-mean independent subGaussian. Suppose f ∗ ∈ F. Goal: estimate f ∗


on the points x1 , . . . , xn (denoise the observed values). That is, the goal is to provide
nonasymptotic bounds on
2
Eη fb − f ∗ ,
L2 (Pn )

where fb is the least squares (ERM) constrained to F. In constrast, in random design the
goal is w.r.t. L2 (P ) with P unknown, while here Pn is known. We write the L2 (Pn ) norm
2
more succinctly as E fb − f ∗ .
n
Since
n
1X
fb ∈ argmin (f (xi ) − Yi )2 = kf − Y k2n
f ∈F n
i=1

we have
2 2 2
kf ∗ − Y k2n ≥ fb − Y = fb − f ∗ + f ∗ − Y = fb − f ∗ +kf ∗ − Y k2n +2hfb−f ∗ , f ∗ −Y in
n n n

where ha, bin = n1 ha, bi. Thus,


2
fb − f ∗ ≤ 2hη, fb − f ∗ in (18.167)
n

which is the Basic Inequality developed earlier in (6.88) for linear regression.

18.1 Informal intuition for localization


Before developing the localization approach, we provide some intuition. The first intuition
comes from viewing (18.167) as a fixed point.
Let’s assume for simplicity that ηi are 1-subGaussian. For fixed a ∈ Rn , we have that
with high probability

hη, ai . kak (18.168)

Hence, if it holds that


kak2 ≤ hη, ai,
then kak . 1, or, dividing by n, kak2n . 1/n.
We can try to repeat this argument with a being the values of fb − f ∗ on the data.
However, since fb depends on η, we do not have the averaging in (18.168) that we need.
Still, we can do the mental experiment of assuming that the dependence is “weak” (e.g. we
fit linear regression in small d and large n). Then a bound on the size of fb − f ∗ would
n
lead to an improved bound on the RHS of the basic inequality, which would in turn tighten
the bound on the LHS of the basic inequality, suggesting some kind of a fixed point. It also
seems intuitive that this fixed point likely depends on F and its richness.

67
18.2 1st approach to localization: ratio-type inequalities
To simplify the proof somewhat, we will assume that η1 , . . . , ηn are independent standard
normal N (0, 1).
We proceed as in the linear case earlier in the course. First, we divide both sides of the
Basic Inequality (18.167) by fb − f ∗ and further upper bound the right-hand side by a
n
supremum over f , removing the dependence of the algorithm on the data:
f − f∗
fb − f ∗ ≤ 2 sup hη, in (18.169)
n f ∈F kf − f ∗ kn

By squaring both sides, we would get an upper bound on the estimation error (in probability
or in expectation).
Let us use the shorthand F ∗ = F −f ∗ . The rest of the discussion will be about complex-
ity of the neighborhood around f ∗ in F, or, equivalently, complexity of the neighborhood
of 0 in F ∗ . Observe that we only care about values of functions on the data x1 , . . . , xn , so
the discussion is really about the set F ∗ |x1 ,...,xn , drawn in blue below.

At this point, one can say that there is no difference from the linear case, and we should
just go ahead and analyze
g
sup hη, in
g∈F ∗ kgk n

After all, this is just the Gaussian width (normalized by n) of the subset of the sphere
obtained by rescaling all the functions:

K = {v ∈ Sn−1 : ∃g ∈ F ∗ s.t. v = (g(x1 ), . . . , g(xn ))/( n kgkn )}.

(here the normalization is because kgkn is scaled as 1/ n times the `2 norm.) How big is
this subset of the sphere? Note: if the set is all of Sn−1 , we are doomed since in that case
g 1 1
sup hη, in = sup √ hη, vi = √ kηk ∼ 1
g∈F ∗ kgkn v∈Sn−1 n n
and does not converge to zero. What we would need is that K is a significantly smaller
subset of the sphere. In the linear case, this was easy: we simply used the fact that the
subset is d-dimensional. However, for nonlinear functions, it is not easy to see what the set
is.
There is a bigger problem, however. Upon rescaling every vector to the sphere, all the
functions are treated equally even if their unscaled versions are very close to being zero
(that is, close to f ∗ in the original class F). In other words, the quantity
g
sup hη, in
g∈F ∗ :kgkn ≥u kgkn

68
can be potentially much smaller than the unrestricted supremum. This is depicted in the
above figure. If we look at functions within the smaller green sphere, its rescaled version is
the whole sphere. However, at larger scales (e.g. the larger green sphere), the set can be
much smaller. Understanding the map
g
u 7→ sup hη, in
g∈F ∗ :kgkn ≥u kgkn

will be key. In particular, we can break up the balance at scale u and instead have a better
upper bound
g
fb − f ∗ ≤u+2 sup hη, in (18.170)
n g∈F ∗ :kgkn ≥u kgkn

Indeed, to show (18.170), write


n o n o
fb − f ∗ = fb − f ∗ 1 fb − f ∗ < u + fb − f ∗ 1 fb − f ∗ ≥u
n n n n n
n o
≤ u + fb − f ∗ b
1 f − f∗ ≥ u
n n
fb − f ∗ n o
≤ u + 2hη, in × 1 fb − f ∗ ≥u
fb − f ∗ n
n
g
≤u+2 sup hη, in
g∈F ∗ :kgkn ≥u kgkn

Consider the following assumption:

Definition 20. A class H is star-shaped (around 0) if h ∈ H implies λh ∈ H for


λ ∈ [0, 1]. In particular, if H is convex and contains 0, it is star-shaped.

We will assume that F ∗ is star-shaped. In particular, if F is convex, then F ∗ is star-shaped.


The key property of a star-shaped class is that by increasing the radius, the sets cannot
become more complex, as for any function there is a scaled copy of it at a smaller magnitude.
In light of this last remark, we claim that the inequality kgkn ≥ u in the supremum in
(18.170) can be replaced with an equality if the class is star-shaped. Indeed, for any g ∈ F ∗
g
with kgkn ≥ u, there is a corresponding function h = u kgk with norm khkn = u and
n

g h
hη, in = hη, in
kgkn u
Hence,
g 1
hη, in ≤ sup hη, hin
kgkn u h∈F ∗ :khkn =u
Taking a supremum on the LHS over g with kgkn ≥ u gives an upper bound on (18.170) as
2
fb − f ∗ ≤u+ sup hη, gin
n u g∈F ∗ :kgkn =u
2
≤u+ sup hη, gin (18.171)
u g∈F ∗ :kgkn ≤u

69
where in the last step we included all the functions below level u. We will use concentration
to replace the second term with its expectation. In particular, define

Z(u) = sup hη, gin


g∈F ∗ :kgkn ≤u

and
G(u) = EZ(u).
If we were to replace Z(u) on the RHS of (18.171) with G(u), the natural balance between
the two terms would be
2
u = G(u)
u

Definition 21. The critical radius δn will be the minimum δ satisfying

G(δ) ≤ δ 2 /2

One can ask if this critical radius is actually well-defined. This follows from the follow-
ing:

Lemma 30. If F ∗ is star-shaped, the function u 7→ G(u)/u is non-increasing.

Proof. Let δ 0 < δ. Take any h ∈ F ∗ with δ 0 < khkn ≤ δ. By star-shapedness,


 0
δ
h0 = h ∈ F∗
δ
δ0
and kh0 kn = δ khkn ≤ δ 0 . Hence,

δ δ
hη, hin = 0
hη, h0 in ≤ 0 Z(δ 0 )
δ δ
Taking supremum on the left-hand side over h with khkn ≤ δ, as well as expectation on
both sides, finishes the proof.

In particular, for any u ≥ δn ,


G(u) ≤ u2 /2
Indeed,

G(u) G(δn )
G(u) = u ≤u ≤ uδn /2 ≤ u2 /2. (18.172)
u δn
To formally replace Z(u) with G(u) in the balancing equation, we need a concentration
result.

70
Lemma 31 (Gaussian Concentration). Let η = (η1 , . . . , ηn ) be a vector of independent
standard normals. Let φ : Rn → R be L-Lipschitz (w.r.t. Euclidean norm). Then for
all t > 0  
t2
P (φ(η) − Eφ ≥ t) ≤ exp − 2
2L


First, observe that Z(u) is (u/ n)-Lipschitz function of η. Omitting the argument u,
u
Z[η] − Z[η 0 ] ≤ sup hη, gin − hη 0 , gin ≤ η − η 0 n
sup kgkn ≤ √ η − η 0
g∈F ∗ ,kgkn ≤u g∈F ∗ ,kgkn ≤u n

Hence, for any u > 0,


 
nt2
P (Z(u) − EZ(u) ≥ t) ≤ exp − 2 (18.173)
2u

In particular, by setting t = u2 ,
 
2
 nu2
P Z(u) ≥ G(u) + u ≤ exp − (18.174)
2

In light of (18.172), we have proved

n 2
o
Lemma 32. Assuming F ∗ is star-shaped, with probability at least 1 − exp − nu2 ,

Z(u) ≤ 1.5u2 (18.175)

for any u ≥ δn .

Thus, from (18.171), we have

fb − f ∗ ≤ 4u (18.176)
n
n 2
o
with probability at least 1 − exp − nu2 , for any u ≥ δn . Squaring both sides, yields

Theorem 6. Assume x1 , . . . , xn are fixed, η1 , . . . , ηn are i.i.d. standard normal, and


Yi = f ∗ (xi ) + ηi with f ∗ ∈ F. Assume F − f ∗ is star-shaped and δn the corresponding
critical radius. Then constrained least squares fb satisfies
   2

2 nsδ
P fb − f ∗ 2
≥ 16sδn ≤ exp − n
(18.177)
n 2

for any s ≥ 1. In particular, this implies


2 1
E fb − f ∗ . δn2 + .
n n

71
Note: in the literature, you will find a slightly different parametrization. Write ψ(r) =

EZ( r). In other words, ψ(u2 ) = G(u). Then the property G(u)/u non-increasing trans-
lates into ψ having the subroot property:

ψ(ra) ≤ aψ(r)
using the same type of proof as above. The fixed point then reads as the smallest r such
that ψ(r) ≤ r (ignoring the constant).
Let’s quickly discuss the behavior of G(δ)/δ.

The above sketch shows the function δ 7→ G(δ)/δ for two classes of functions. The
purple curve corresponds to a more complex class, since the Gaussian width (normalized
by δ) grows faster as δ → 0. The corresponding fixed point is larger for a more rich class.

18.3 2nd approach to localization: offset


We start again with the basic inequality
2
fb − f ∗ ≤ 2hη, fb − f ∗ in
n

and trivially write it as


2 2
fb − f ∗ ≤ 4hη, fb − f ∗ in − fb − f ∗
n n

Now take the supremum and expectation on both sides:


2
E fb − f ∗ ≤ E sup 4hη, f − f ∗ in − kf − f ∗ k2n
n f ∈F
n
1X
= E sup 4ηi g(xi ) − g(xi )2
g∈F −f ∗ n
i=1

which we shall call the offset Rademacher (or Gaussian) averages.


Contrast this approach with the first approach where we divided both sides by the norm
fb − f ∗ and then upper bounded by supremum over an appropriately localized subset,
n
then squared both sides.
Surprisingly, this somewhat simpler approach yields correct upper bounds. Note that the
negative quadratic term annihilates the fluctuations of the term ηi g(xi ) when the magnitude
of g becomes large enough (beyond some critical radius). Hence, the supremum is achieved
in a finite radius, no larger than the critical radius:

72
Lemma 33. Let δn be the critical radius. Then for any c ≥ 1,
!
2 2 2 2c2 u
P sup 2chη, gin − kgkn > 2c δn + ≤ exp{−u/2} (18.178)
g∈F ∗ n

In particular,
1
E sup 2hη, gin − kgk2n . δn2 + .
g∈F ∗ n

Proof. By Gaussian concentration,


 
nt2
P (Z(δn ) ≥ EZ(δn ) + tδn ) ≤ exp − . (18.179)
2
We now condition on the complement of the above event. Take g ∈ F ∗ . Consider two cases.
First, if kgkn ≤ δn then
 2 
2 δn
2chη, gin − kgkn ≤ 2cZ(δn ) ≤ 2c (EZ(δn ) + tδn ) ≤ 2c + tδn ≤ c(t + δn )2 (18.180)
2

Second, if kgkn ≥ δn , we set r = δn / kgkn ≤ 1. Then

2c δn δ2 2c δ2 2δn cZ(δn ) δn2


2chη, gin − kgk2n = hη, gin − n2 ≤ Z(δn ) − n2 = − 2. (18.181)
r kgkn r r r r δn r

Using 2ab − b2 ≤ a2 , we get a further upper bound of


 2  2 2
2 Z(δn ) 2 δn /2 + tδn
c ≤c = c2 (δn /2 + t)2 (18.182)
δn δn

18.3.1 Example: linear regression


To get a sense of the behavior of the offset process, consider the linear class F = {x 7→
hw, xi : w ∈ Rd }. First, F − f ∗ = F. Second, note that functions are unbounded, and so
Rademacher/Gaussian averages are unbounded too. However, offset Gaussian/Rademacher
averages are
n
X n
X
sup ηi hw, xi i − chw, xi i2 = sup hw, ηi xi i − c kwk2Σ (18.183)
w∈Rd i=1 w∈Rd i=1
n 2
1 X
= η i xi (18.184)
4c
i=1 Σ†
Pn
where Σ = T
i=1 xi xi and Σ† is the pseudoinverse. Assuming Eηi2 ≤ 1,
n 2 n
X X
† †
E η i xi ≤ xT
i Σ xi = tr(ΣΣ ) = rank(Σ)
i=1 Σ† i=1

73
We see that, these offset Rademacher/Gaussian averages have the right behavior: we already
saw in the first part of the course that the fast rate for linear regression is O rank(Σ)
n
without further assumptions.
We can view the negative term that extinguishes the fluctuations of the zero-mean
process as coming from the curvature of the square loss. Without the curvature, the negative
term is not there and we are left with the usual Rademacher/Gaussian averages.

18.4 Minimax Optimal Rates


It can be shown that minimax optimal rates of estimation (for any estimator) for fixed
design are given by the fixed point (see [35])

log N (F, L2 (Pn ), δ∗ )


 δ∗2 (18.185)
n
For instance, if log N (F, L2 (Pn ), δ)  δ −p , the balance is

δ∗−p n−1  δ∗2


2

which gives the rate of δ∗2 = n 2+p .
Do we recover this rate for least squares?

18.5 Rates for Fixed Design Least Squares


We would like to calculate the critical radius δn for some function clases of interest. Recall
that δn is defined as the smallest number such that

E sup hη, gin ≤ δ 2 /2.


g∈F ∗ :kgkn ≤δ

The strategy is to find upper bounds on the left-hand-side in terms of δ and then solve for
the minimal δ. In particular, we know that for any α ≥ 0,
Z δ p
1
E sup hη, gin . α + √ log N (F ∗ , L2 (Pn ), ε)dε (18.186)
g∈F ∗ :kgkn ≤δ n α/4

If the Dudley integral in (18.186) is of the order of the single-scale value (think area under
the curve)
1 p
δ×√ log N (F ∗ , L2 (Pn ), δ)
n
then an upper bound on the critical radius is obtained by the balance
1 p
δ×√ log N (F ∗ , L2 (Pn ), δ)  δ 2 (18.187)
n

which matches the optimal rate in (18.185). In this case, least squares is an optimal proce-
dure. Below we compute the fixed point under entropy growth conditions.

74
18.5.1 Nonparametric
Suppose we have
log N (F ∗ , L2 (Pn ), ε) . ε−p
for p ∈ (0, 2). Then, taking α = 0,

E sup hη, gin . n−1/2 [ε1−p/2 ]δ0 = n−1/2 δ 1−p/2


g∈F ∗ :kgkn ≤δ

Setting
n−1/2 δ 1−p/2 = δ 2
yields
1
− 2+p
δn . n
and thus the rate of the least squares estimator is
2 2
− 2+p
E fb − f ∗ .n
n

Hence, least squares are optimal in this minimax sense for p ∈ (0, 2).

2

Figure 2: Optimal (in general) rates n 2+p (obtained with localization for p ∈ (0, 2) by
ERM) vs without localization (e.g. via global Rademacher averages)

Example: Convex L-Lipschitz functions on a compact domain in Rd :

log N (Fcvx,lip , L2 (Pn ), ε) ≤ (L/ε)d/2

Example: L-Lipschitz functions on a compact domain in Rd :

log N (Flip , L2 (Pn ), ε) ≤ (L/ε)d

18.5.2 Parametric
Consider the parametric case,

log N (F ∗ , L2 (Pn ), ε) . d log(1 + 2/ε)

Then
Z δ p
1
E sup hη, gin . √ d log(1 + 2/ε)dε (18.188)
g∈F ∗ :kgkn ≤δ n 0

75
Change of variables gives an upper bound
r Z 1p
d
δ· log(1 + 2/(uδ))du (18.189)
n 0

Unfortunately, this gives a pesky logarithmic factor that should not always be there. For
some parametric cases one can, in fact, prove that local covering numbers behave as

log N (F ∗ ∩ {g : kgkn ≤ δ}, L2 (Pn ), ε) . d log(1 + 2δ/ε) (18.190)

In this case, the change-of-variables leads to


r Z 1p r
d d
E sup hη, gin . δ· log(1 + 2/ε)dε . δ (18.191)

g∈F :kgkn ≤δ n 0 n

Equating r
d
δ  δ2
n
yields
d
δn2 
n
Note that local covering numbers (18.190) are available in some parametric cases (e.g. when
we discretize the parameter space of linear functions) but may not be available for some
other classes (e.g. for VC classes, except under additional conditions).

18.6 Remarks
• to bound metric entropy of F ∗ = F − f ∗ , instead consider F − F. This often leads to
only mild increase in a constant. For instance, if F is a class of L-Lipschitz functions,
then F − F is a subset of 2L-Lipschitz functions.
• Note that the rate δn2 depends on local covering numbers (or, local complexity) around
f ∗ . This gives a path to proving adaptivity results (e.g. if f ∗ is convex but has
only k linear pieces, the rate of estimation is parametric because its neighborhood is
“simple”).
• A simple counting argument (see Yang & Barron 1999, Section 7) shows that for
rich enough classes (e.g. nonparametric) worst-case local entropy (worst-case location
in the class) and global entropies behave similarly. This implies, in particular, that
instead of constructing a local packing for a lower bound (via hypothesis testing), one
can instead use global entropy with Fano inequality, justifying the LHS of (18.185)
as the lower bound for estimation. See also Mendelson’s “local vs global parameters”
paper for an in-depth discussion.

19. ORACLE INEQUALITIES

What if we do not assume the regression function f ∗ is in F? How can we prove an oracle
inequality
2
E fb − f ∗ − inf kf − f ∗ k2 ≤ φ(F, n)n
n f ∈F

Again, we will focus on fixed design.

76
19.1 Convex F

Suppose F is convex (or, rather, F|x1 ,...,xn is convex). Let fb be the constrained least squares:
n
1X
fb ∈ argmin (f (xi ) − Yi )2 = argmin kf − Y k2n
f ∈F n f ∈F
i=1

For the basic inequality we used


2
fb − Y ≤ kf ∗ − Y k2n
n

but in the misspecified case this is no longer true. However, what is true is that
2
fb − Y ≤ kfF − Y k2n
n

Unfortunately, this inequality is not strong enough to get us the desired result. Fortunately,
we can do better. Since fb is a projection of Y onto F = F|x1 ,...,xn , it holds that
2 2
fb − Y ≤ kf − Y k2n − fb − f (19.192)
n n

for any f ∈ F, and in particular for fF . This is a simple consequence of convexity and
pythagorean theorem. The negative quadratic will give us the extra juice we need.
Adding and subtracting f ∗ on both sides and expanding,
2 2
fb − f ∗ +kf ∗ − Y k2n +2hfb−f ∗ , −ηin ≤ kfF − f ∗ k2n +kf ∗ − Y k2n +2hfF −f ∗ , −ηin − fF − fb
n n

which leads to
2 2
fb − f ∗ − kfF − f ∗ k2n ≤ 2hη, fb − fF in − fb − fF (19.193)
n n
≤ sup 2hη, hin − khk2n (19.194)
h∈F −fF

We conclude that for convex F and fixed design, the upper bounds we find for well-specified
and misspecified cases match. Moreover, since the misspecified case is strictly more general,
and since lower bounds for the well-specified case and polynomial entropy growth (in the
p < 2 regime) match the upper bounds, we conclude that constrained least squares are also
minimax optimal for fixed design misspecified case.
Note: a crucial observation is that offset complexity would arise even if (19.192) had a
2
different constant multiplier in front of − f − fb . We will exploit this observation in a
n
bit.

19.2 General F
What if F is not convex? It turns out that least squares (ERM) can be suboptimal even if
F is a finite class!

77
19.2.1 A lower bound for ERM (or any proper procedure)
The suboptimality can be illustrated on a very simple example. Suppose X = {x}, Y
is {0, 1}-valued, and F = {f0 , f1 } such that f0 (x) = 0 and f1 (x) = 1. The marginal
distribution is the trivial PX = δx and suppose we have two conditional distributions P0 (Y =
1) = 1/2 − α and P1 (Y = 1) = 1/2 + α. Clearly, the population minimizer for Pj is fj . Also,
under P0 the regression function is f0∗ = 1/2 − α while under P1 it is f1∗ = 1/2 + α. Finally,
ERM is a method that goes after the most frequent observation in the data Y1 , . . . , Yn .

However, if α ∝ 1/ n, there is a constant probability of error in determining whether
P0 or P1 generated the data. Note that the oracle risk is minf ∈{f0 ,f1 } kf − fi∗ k2 = (1/2−α)2
while the risk of the estimator p(1/2 + α)2 + (1 − p)(1/2 − α)2 where p is the probability
of making a mistake and not selecting fi under the distribution Pi . Hence, the overall
comparison to the oracle is at least p((1/2 + α)2 − (1/2 − α)2 ) = Ω(α) when p is constant.
Hence, ERM (or any “proper” method that selects from F) cannot achieve excess loss
smaller than Ω(n−1/2 ):
 
2
max b ∗ ∗ 2
E f − fi − min kf − fi k = Ω(n−1/2 )
Pi ∈{P0 ,P1 } f ∈{f0 ,f1 }

Yet, an improper method that selects fb outside F can achieve an O(n−1 ) rate.
A similar simple lower bound can be constructed for ERM with random design.3

19.2.2 How about ERM over Convex Hull?


Given that the procedure has to be “improper” (select from outside of F), one can hy-
pothesize that doing ERM over conv(F) may work. Interestingly, this procedure is also
rate-suboptimal for a finite F since conv(F) is too expressive.4

19.2.3 An improper procedure


Somewhat surprisingly, only a small modification of ERM is required to make it optimal
for general classes. Consider the following two-step procedure5 (Star Estimator ):

gb = argmin kf − Y k2n (19.195)


f ∈F

fb = argmin kf − Y k2n (19.196)


f ∈star(F ,b
g)

where
star(F, g) = {αf + (1 − α)g : f ∈ F, α ∈ [0, 1]}.
Note that fb need not be in F but is an average of two elements of F.
Note: the method is, in general, different from single ERM over a convex hull of F, and
so it is not clear that a version of (19.192) holds [21]:

3
For more detailed discussion, we refer to [20].
4
Proof can be found in Lecué & Mendelson
5
For a finite class, the above estimator was analyzed by J-Y. Audibert [1].

78
Lemma 34. For any f ∈ F,
2 1 b 2
kf − Y k2n − fb − Y ≥ f −f . (19.197)
n 18 n

The above inequality is an approximate version of (19.192), a generalization of the pythagorean


relationship for convex sets.
As a consequence,
2 1 2
fb − f ∗ − kfF − f ∗ k2n ≤ 2hη, fb − fF in − fF − fb
n 18 n

and the same upper bounds hold as in the convex case, up to constants. The difference is
that the supremum is now in star(F, gb) ⊆ F − f ∗ + star(F − F) which is not significantly
larger than F in terms of entropy (unless F is finite, which can be handled separately).
Remarks:

1. if the set is convex, fb = gb.

2. the Star Estimator can be viewed as one step of Frank-Wolfe. More steps can improve
the constant.

Exercise: for any α > 0 and a set F ⊂ Rn , the covering numbers satisfy

log N (F, k·k , 2α) ≤ log N (star(F, 0), k·k , 2α) ≤ log(diam(F )/α) + log N (F, k·k , α)

19.3 Offset Rademacher averages


We have seen that offset Gaussian or Rademacher averages are a convenient way of proving
rates of convergence for least squares in well-specified, as well as misspecified (both convex
and general function class) cases. We finish this section by estimating offset complexities
via covering numbers. We start with a finite set.
For a set V ⊂ Rn , the offset process indexed by V is defined as a stochastic process
n
X
v 7→ εi vi − cvi2 = hε, vi − c kvk2 .
i=1

Here εi are independent Rademacher, but the same results hold for any sub-Gaussian ran-
dom variables.

79
Lemma 35. Let V ⊂ Rn be a finite set of vectors, card(V ) = N . Then for any c > 0,

log N
Eε maxhε, vi − c kvk2 ≤ .
v∈V 2c
Furthermore,
 
12
P maxhε, vi − c kvk ≥ (log N + log(1/δ)) ≤ δ
v∈V 2c

Proof. Assuming the random variables are 1-subGaussian,


1
E maxhε, vi − c kvk2 = E log exp max λhε, vi − λc kvk2
v∈V λ
1 X
≤ log E exp{λhε, vi − λc kvk2 }
λ
v∈V
!
1 X 2 2
2
≤ log exp{λ kvk /2 − λc kvk }
λ
v∈V
1
= log N
2c
where we chose λ = 2c.
Finally, we state a general result.

Theorem 7. Let F be a class of functions X → R. Then for any x1 , . . . , xn ∈ X and


the corresponding empirical measure Pn ,
n
1X
E sup εi f (xi ) − cf (xi )2 (19.198)
f ∈F n
i=1
 Z 
(2/c) log N (F, L2 (Pn ), γ) 12 γ p
≤ inf + 4α + √ 2
log N (F, L (Pn ), δ)dδ
γ≥0,α∈[0,γ] n n α

We have shown in Lemma 33 that the offset Gaussian process cannot be more than a
constant multiple of the critical radius. But what if the bound of the above theorem is too
loose to be useful? To see that it attains the optimal balance of (18.185) in some cases,
consider the situation where, as in the discussion preceding (18.187), the Dudley entropy
integral is of the order of the single scale estimate
1 p
γ×√ log N (F, L2 (Pn ), γ).
n
In this case, the optimal balance in (19.198) is
1 p log N (F, L2 (Pn ), γ)
γ×√ log N (F, L2 (Pn ), γ)  . (19.199)
n n
Dividing and squaring, we do recover (18.185), an optimal rate. We conclude that the upper
bound of Theorem 7 recovers optimal rates for regression in the p ∈ (0, 2) regime and, more
generally, under the above-stated condition on the Dudley integral.

80
20. TALAGRAND’S INEQUALITY AND APPLICATIONS

For the last half of the course, we have only considered the expected suprema of empirical,
Rademacher, or Gaussian processes. We mentioned that high-probability statements follow
from different arguments. In this lecture, we provide the tools to study deviations of random
suprema above (or below) their expected values.
The following version of Talagrand’s inequality is due to Bousquet:

Theorem 8. Let X1 , . . . , Xn be i.i.d., and let F = {f : X → [−1, 1]}. Suppose

sup var(f (X)) ≤ σ 2


f ∈F

for some σ > 0. Let either


( n
) ( n
)
1X 1X
Z = sup Ef − f (Xi ) or Z = sup f (Xi ) − Ef
f ∈F n f ∈F n
i=1 i=1

and set v = σ 2 + 2EZ. Then for any t ≥ 0,


r !
2tv t
P Z ≥ EZ + + ≤ e−t .
n 3n

P
Consider a particular case of a singleton F = {f }. Then Z = n1 ni=1 f (Xi ) − IEf (or
the other form), v = σ 2 = var(f (X)) because EZ = 0. Then Theorem 8 says that
n r !
1X 2t t
P f (Xi ) − Ef ≥ σ + ≤ e−t
n n 3n
i=1

which is Bernstein’s inequality. Moreover, the constants match those in Bernstein’s inequal-
ity, which is remarkable.
Now, recall the definition of empirical Rademacher averages. In this lecture we will scale
these averages by 1/n:
" n
#
1 X
b
R(F) = Eε sup εi f (Xi ) X1 , . . . , Xn .
f ∈F ni=1

b
We have R(F) ≥ 0 by Jensen’s inequality. Moreover, this function satisfies a self-bounding
property [5, Ch. 6], which implies the following

Theorem 9. Let F = {f : X → [−1, 1]}. Then for any t > 0,


 s 
2tE b
R(F) t
b
P R(F) b
≥ ER(F) + +  ≤ e−t
n 3n

81
and  s 
b
2tER(F)
b
P R(F) b
≤ ER(F) −  ≤ e−t
n

This first statement has a similar form to that of Theorem 8 (after normalizing Z by n)
b
with v replaced by ER(F), b
a consequence of the self-bounding property of R(F).
In particular, by using the inequality
√ λ 1
∀x, y, λ > 0, xy ≤ x+ y,
2 2λ
we have  
b b 5t
P R(F) ≥ 2ER(F) + ≤ e−t
6n
and  
b b 2t
P ER(F) ≥ 2R(F) + ≤ e−t .
n
Finally, recall that symmetrization lemma states that for the supremum of the empirical
process Z in Theorem 8,
b
EZ ≤ 2ER(F).
Together with Theorem 8 and Theorem 9, this yields (see e.g. [3, Thm 2.1])

Theorem 10. Let F = {f : X → [−1, 1]}. Let supf ∈F var(f (X)) ≤ σ 2 . Then for any
t > 0, with probability at least 1 − 2e−t , for any f ∈ F,
r
1X b 2t 11t
Ef (X) − f (Xi ) ≤ 6R(F) +σ + (20.200)
n n n
i=1

A few remarks. First, the constantsP here can be balanced differently (see [3, Thm 2.1]).
Second, the same result holds with n1 i=1 f (Xi ) − Ef (X) on the left-hand-side of (20.200).
Third, Theorem 8 can be replaced with McDiarmid’s inequality if one does not aim to take
advantage of small variance σ; however, some of the key results on fast rates in learning
theory do take advantage of this Bernstein-style bound.
Theorem 10 can be applied in a variety of situations. To start, since R b does not depend
on the unknown distribution of X, all the terms (except for σ) on the right-hand-side of
(20.200) can be computed from the data. While the supremum of the empirical process on
the left-hand-side of (20.200) cannot be computed in general since the distribution of X is
not known, the expression provides a data-dependent estimate of this quantity.
In the setting of prediction and model selection, we could consider a model Gλ such that
Gλ ⊆ Gλ0 for λ ≤ λ0 , i.e. λ is a tunable parameter that controls complexity of the model
(e.g. width of a neural network). Inequality (20.200) can then be viewed as an upper bound
on the expected loss of any function in Gλ in terms of its empirical fit to data plus a penalty
term for model complexity, as given by the Rademacher averages. Moreover, this penalty
is data-driven.
The above theorems are also at the heart of proving localization results for random
design, both in the well-specified and misspecified settings. Let us only mention one conse-
quence (see [3, Thm 4.1]).

82
Theorem 11. Let F = {f : X → [−1, 1]}. Suppose for every f ∈ F, it holds that
Ef (X)2 ≤ BEf (X). Then with probability at least 1 − 3e−t , for all f ∈ F,
n
1X c0 t
Ef (X) ≤ 2 f (Xi ) + cδn2 + (20.201)
n n
i=1

where δn be the critical radiusa of star(F, 0) and c, c0 are constants that depend on B.
a
To be precise, [3, Thm 4.1] includes a confidence term t/n in the computation of the critical radius.

20.1 Application: Learning and Low-Noise


Consider the setting of statistical learning with a class F and a 1-Lipschitz loss function `.
Let fb be a minimizer of empirical risk L(f
b ) as defined in (10.127). Let fF be a minimizer
of expected risk L(f ) over F. We apply Theorem 11 to the class ` ◦ F − ` ◦ fF . Under
the high-probability event of the Theorem, the inequality holds for all functions, so we can
apply it to fb ∈ F. Since L(
b fb) − L(f
b F ) ≤ 0, we have that, with probability at least 1 − 3e−t ,

c0 t
L(fb) − L(fF ) ≤ cδn2 + . (20.202)
n
This conclusion holds under the assumption

E(f (X) − f ∗ (X))2 ≤ B(L(f ) − L(fF )),

which, together with the Lipschitz condition on the loss implies the so-called Bernstein
condition
E(` ◦ f − ` ◦ fF )2 ≤ BE(` ◦ f − ` ◦ fF ) = B(L(f ) − L(fF ))
Such a condition (or closely-related variants) are implied by, for instance, convexity of F
and uniform convexity of `, or by low-noise assumptions in classification settings. For the
case of square loss, (20.202) implies a random design oracle inequality in the misspecified
case. We now provide more details for the case of well-specified random design regression
and develop general tools for passing from fixed to random design.

21. FROM FIXED TO RANDOM DESIGN

Recall that in fixed design regression we aim to prove that for a given set of points x1 , . . . , xn ,
an estimator (such as constrained least squares) attains
2
fb − f ∗ ≤ ...
L2 (Pn )

where on the right-hand side we have either a quantity that goes to zero with n or oracle
risk as in the misspecified case. We would like to analyze random design regression where
X1 , . . . , Xn are i.i.d from P . Importantly, we also measure the risk through the L2 (P ) norm.
However,
2 2
E fb − f ∗ 6= E fb − f ∗
L2 (Pn ) L2 (P )

83
since the algorithm fb depends on X1 , . . . , Xn , and so lifting the results from the fixed design
case is not straightforward.
Imagine, however, we could prove that with high probability, for all functions f ∈ F,

kf − f ∗ k2L2 (P ) ≤ 2 kf − f ∗ k2L2 (Pn ) + ψ(n, F). (21.203)

In that case, a guarantee for fixed-design regression would translate into a guarantee for
random design regression as long as fb ∈ F (for the Star Algorithm, just enlarge F appro-
priately). Furthermore, as long as ψ(n, F) decays with n at least as fast as the rate of fixed
design regression, we would be able to conclude that random design is not harder than fixed
design. Let’s see if this can be shown.

21.1 Uniformly Bounded Functions


Our plan of action for proving results of the form (21.203) is to view the inequality as an
instance of a more general uniform comparison
n
2X
∀g ∈ G, Eg(X) ≤ g(Xi ) + ψ(n, G)
n
i=1

for a class G of nonnegative functions. In this part of the lecture, we sketch analysis for
uniformly bounded functions. This requirement is necessitated by the use of Theorems 8
and 10.
Let δ̄ be such that for all δ ≥ δ̄,
n
1X
Eε sup εi g(Xi ) ≤ δ 2 /2 (21.204)
1
g∈G: n
Pn
g(Xi )≤δ 2 n
i=1 i=1

conditionally on X1 , . . . , Xn .
Alternatively, we can write (21.204) as
n
1X
Eε sup εi g(Xi ) ≤ δ 2 /2 (21.205)

g∈G:k g kn ≤δ n
i=1

The following result can be proved using Theorem 10 (see [6, Theorem 6.1]):

Lemma 36. Let G be a class of functions with values in [0, 1]. Then with probability
at least 1 − e−t for all g ∈ G
n
2X c0 · (t + log log n)
Eg(X) ≤ g(Xi ) + c · δ̄ 2 + (21.206)
n n
i=1

where δ̄ = δ̄(G) is any upper bound on the fixed point in (21.204).

Applying this inequality for the class G = {(f − f 0 )2 : f, f 0 ∈ F}, assuming F is a class
of [0, 1]-valued functions, yields

2 2 c0 · (t + log log n)
f − f0 L2 (P )
≤ 2 f − f0 L2 (Pn )
+ c · δ̄ 2 + . (21.207)
n

84
A few remarks. First, G = (F − F)2 can be replaced by (F − f ∗ )2 , even if f ∗ ∈ / F,
as long as the resulting class is uniformly
P bounded. Second, we observePthat (21.204) is
defined with a localization restriction n1 ni=1 g(Xi ) ≤ δ 2 rather than n1 ni=1 g(Xi )2 ≤ δ 2
in the previous lecture. Since functions are bounded by 1, the set
( n
)
1 X
c := g :
M g(Xi ) ≤ δ 2 ⊆ {kgk2n ≤ δ 2 }
n
i=1

and hence the set in (21.204) is smaller. Thus the fixed point (21.204) is potentially smaller
than the one defined in the previous lecture.

21.1.1 Evaluating the new critical radius


Now, one can ask how to compute a suitable upper bound on the critical radius in (21.204)
for particular classes of interest. As in the earlier lectures, the strategy is to upper bound
the left-hand side of (21.204) in terms of some more tangible measures of complexity and
δ, and then balance with δ 2 /2.
In particular, we are interested in the case when G = F 2 (same analysis works for
(F − F)2 or (F − f ∗ )2 ) for some class F of [−1, 1]-valued functions. In this case, it is
tempting to proceed with the help of contraction inequality and upper bound
n n
1X 1X
Eε sup εi g(Xi ) ≤ 2Eε sup εi f (Xi ) (21.208)
cn
g∈F 2 ∩M i=1 f ∈F :kf k2 ≤δ 2 n
n i=1

since square is 2-Lipschitz on [−1, 1]. Balancing this with δ 2 gives, up to constants, precisely
the critical radius of F, as in Definition 21 (modulo the use of Gaussian vs Rademacher
random variables). Interestingly, one can significantly improve upon this argument and
show that the localization radius for F 2 with the left-hand-side of (21.208) can be smaller
than that of F. In particular, a useful result is the following:

Lemma 37. For any class F = {f : X → [−1, 1]} of bounded functions, the critical
radius in (21.204) for the class G = F 2 can be upper bounded by a solution to
Z 1 p
12
√ log N (F, L∞ (Pn ), u/2))du ≤ δ/4. (21.209)
n δ/16

Proof. We start upper bounding the left-hand side of (21.204). Observe that functions in
GPare nonnegative and bounded uniformly in [0, 1]. As discussed earlier, the restriction
1 n 2
n i=1 g(Xi ) ≤ δ implies kgkn ≤ δ, and hence the left-hand-side of (21.204) is upper
bounded by
 Z q 
12 δ c
inf 4α + √ 2
log N (G ∩ M, L (Pn ), ε)dε . (21.210)
α n α

Let V = {f˜1 , . . . , f˜N } be a proper L∞ (Pn )-cover of F ∩ {kf kn ≤ δ} at scale τ ≤ δ (proper


implies f˜ ≤ δ). Fix any g = f 2 ∈ G ∩ M. c Let f˜ be an element of V that is τ -close to
n

85
f . Then
n n
1X 1X
(f (xi )2 − f˜(xi )2 )2 = (f (xi ) − f˜(xi ))2 (f (xi ) + f˜(xi ))2
n n
i=1 i=1
n
1X
≤ max(f (xi ) − f˜(xi ))2 · (f (xi ) + f˜(xi ))2
i n
i=1
2
≤τ 2
(2 kf k2n + 2 f˜ )
n
2 2 2
≤ 4τ δ := ε

We conclude that
c L2 (Pn ), ε) ≤ N (F ∩ {kf k ≤ δ}, L∞ (Pn ), ε/(2δ))
N (G ∩ M, n
≤ N (F, L∞ (Pn ), ε/(2δ))

Substituting into (21.210), the upper bound on the right-hand side becomes
 Z 
12 p δ
inf 4α + √ log N (F, L∞ (Pn ), ε/(2δ))dε
α≥0 α n
Z 1 p
2 12
≤ δ /4 + δ × √ log N (F, L∞ (Pn ), u/2))du
n δ/16

where we performed change-of-variables u = ε/δ and chose α = δ 2 /16. Using this in (21.204)
and balancing with δ 2 /2 yields (21.209).

A key outcome of the above lemma is that the critical radius of G = F 2 (or (F − F)2 )
given by (21.204) is much smaller than that of F. Note that whenever the Dudley integral
in (21.209) converges with δ = 0, the solution is δ ∝ n−1/2 (up to log n factors) and hence
the remainder in (21.207) is of the order 1/n, a smaller order term as compared to the rate
of estimation for fixed design. The fact that the remainder term is or a lower order can
be shown, for instance, more generally under the polynomial growth of entropy, or in the
parametric cases. For instance, for
 cn d
N (F, L∞ (Pn ), ε) ≤ ,
ε
the localization radius of G = F 2 can be upper bounded as
r  cn 
d
δ̄(G) = C log
n d
and for a finite class we immediately have
r
log |F|
δ̄(G) ≤ C .
n
We can also prove a general and useful result, albeit with extra log factors (due to its
generality). Following [31], we have

86
Lemma 38. For any class F = {f : X → [−1, 1]}, the critical radius in (21.209) is at
most
δ̄(F 2 ) ≤ C log2 n · R̄(F),
where
b
R̄(F) = sup R(F).
x1 ,...,xn

Proof. Substitute the following estimate for L∞ covering numbers in terms of the scale-
sensitive dimension (see e.g. [28]):
 
∞ cn
log N (F, L (Pn ), α) ≤ 2vc(F, cα) · log n · (21.211)
vc(F, cα) · α

and then use the following fact: for any α > R̄(F),

4nR̄(F)2
vc(F, α) ≤ . (21.212)
α2
This last inequality can be written in the more familiar form
r
vc(F, α)
sup α ≤ R̄(F), (21.213)
α>R̄(F ) 4n

which bears similarity to Sudakov’s minoration. This inequality is proved by taking the
α-shattered set, replicating it dn/vc(F, α)e times, and using our previous argument about
Rademacher averages being large when there is a cube inside the set. We leave it as an
exercise.
Back to the estimate, we have
Z 1/4 p √ Z r  cn 
1 ∞
log n 1/4
√ log N (F, L (Pn ), α))dα . √ vc(F, cα) log dα (21.214)
n δ/64 n δ/64 α
p Z 1/4 r  
1 cn
. log nR̄(F) log dα (21.215)
δ/64 α α

To finish the proof, choose δ = 64R̄(F) and observe that


Z 1 r  
1 cn
log dα . log2 (cn/R̄(F)).
R̄(F ) α α

Hence, for G = F 2 , ignoring logarithmic factors, δ̄(G) ≤ Õ(n−1 ) when R̄(F) . n−1/2
and δ̄(G) ≤ Õ(n−2/p ) when R̄(F) . n−1/p , which is smaller than the rate of estimation for
least squares, ignoring logarithmic factors.
We conclude that rates of estimation for fixed design translate into rates for estimation
with random design, at least for bounded functions. It is worth emphasizing that the extra
factors one gains from comparing kf − f ∗ k2L2 (P ) to 2 kf − f ∗ k2L2 (Pn ) are typically of smaller
order than what one gets from denoising for fixed design. The next section provides further
motivation for why this happens, and presents an approach that does not rely on uniform
boundedness of functions.

87
21.2 Beyond boundedness: the small-ball method
This approach was pioneered by [18] and then developed by Mendelson in a series of papers
starting with [24]. Importantly, this approach does not rely on uniform boundedness of
functions as in the application of Talagrand’s inequality.
Roughly speaking, the realization is that whenever the population norm kf kL2 (P ) is
large enough, it is highly unlikely that the random empirical norm kf kL2 (Pn ) can be smaller
than a fraction of the population norm. Moreover, conditions for such a statement to be
true are rather weak and do not require uniform boundedness.
We first recall the Paley-Zygmund inequality (1932) stating that for a nonnegative ran-
dom variable Z with finite variance,
(EZ)2
P (Z ≥ tEZ) ≥ (1 − t)2
EZ 2
for any 0 ≤ t ≤ 1.
Let us use the following shorthand. We will write kf k2 = kf kL2 (P ) = (Ef (X)2 )1/2 and
kf k4 = kf kL4 (P ) = (Ef (X)4 )1/4 . Then
  kf k42
P (|f (X)| ≥ t kf k2 ) = P f (X)2 ≥ t2 kf k22 ≥ (1 − t2 )2
kf k44
Now, we make an assumption that for every f ∈ F,

Ef (X)4 ≤ c(Ef (X)2 )2

for some c.
Under this L4 − L2 norm comparison, it holds that

P (|f (X)| ≥ t kf k2 ) ≥ (1 − t2 )2 c,

an “anti-concentration” inequality. More generally, the condition that there exists c and c0
such that for all f ∈ F,

P (|f (X)| ≥ c kf k2 ) ≥ c0 (21.216)

is called the small-ball property.

Lemma 39. Assume (21.216). Let F be star-shaped around 0. Then with probability
at least e−c1 n ,
kf kn
inf ≥ c2
f ∈F :kf k2 ≥δ̃ kf k2

for some constants c1 , c2 , i.e. for all f ∈ F,

kf k22 . kf k2n + δ̃ 2 ,

where δ̃ is a critical radius defined as the smallest δ such that


n
1X
E sup i f (Xi ) ≤ c3 δ. (21.217)
f ∈F ,kf k2 ≤δ n
i=1

88
Proof. Let’s see how we can compare the empirical and population norms, uniformly over
F, given such a condition. First, let’s consider any function with norm kf k2 = 1. Observe
that if we could show with high probability
n
1X
1 {|f (Xi )| ≥ c1 } ≥ c2 (21.218)
n
i=1
for some constantsPc1 , c2 , we would be done since such a lower bound implies a constant
lower bound on n1 ni=1 f (Xi )2 ≥ c kf k22 = c). By rescaling and assuming star-shapedness,
we would extend the result to all functions in F (above some critical level for which we can
prove (21.218)).
For a given c > 0, we have
n n
!
1X 1X
1 {|f (Xi )| ≥ c} = E1 {|f (X)| ≥ 2c} − E1 {|f (X)| ≥ 2c} − 1 {|f (Xi )| ≥ c}
n n
i=1 i=1
n
!
1X
≥ E1 {|f (X)| ≥ 2c} − Eφ(|f (X)|) − φ(|f (Xi )|)
n
i=1

for φ(u) = 0 on (−∞, c], φ(u) = u/c − 1 on [c, 2c], and φ(u) = 1 on [2c, ∞).
n
!
1X
≥ inf P (|f (X)| ≥ 2c kf k2 ) − sup Eφ(|f |) − φ(|f (Xi )|)
f ∈F f ∈F ,kf k2 =1 n
i=1

Now, using concentration (since φ(|f |) are in [0, 1]), the random supremum
n
!
1X
sup Eφ(|f |) − φ(|f (Xi )|)
f ∈F ,kf k2 =1 n
i=1
2
can be upper bounded with probability at least 1 − e−2u by its expectation
n
!
1X u
E sup Eφ(|f |) − φ(|f (Xi )|) + √
f ∈F ,kf k2 =1 n n
i=1

which, in turn, can be upper bounded via symmetrization and contraction inequality (since
φ is 1/c-Lipschitz) by
n
4 1X u
E sup i f (Xi ) + √
c f ∈F ,kf k2 =1 n n
i=1

By choosing u = n · c00 , we can make the additive term an arbitrarily small constant c00 .
Now, we see that (21.218) will hold with a non-zero constant c2 as long as
n
1X
E sup i f (Xi ) ≤ c00
f ∈F ,kf k2 =1 n
i=1

for an appropriately small constant c00 .


We now need to extend this control to all kf k2
above some critical radius. Assuming that F is star-shaped around 0, the control extends
for all f such that kf k ≥ δ̃.
Observe that δ̃ can be significantly smaller than if (21.217) were defined with δ 2 on the
right-hand side, as before.

89
21.3 Example: Random Projections and Johnson-Lindenstrauss lemma
The development here can be seen as a nonlinear generalization of the random projection
method and the Johnson–Lindenstrauss lemma. Let Γ ∈ Rn×d be an appropriately scaled
random matrix. We then prove that for any fixed v ∈ Rd , with high probability

(1 − ε)2 kvk22 ≤ kΓvk22 ≤ (1 + ε)2 kvk22 .

Of particular interest in applications is the lower side of this inequality:

kΓvk22
≥1−α
kvk22

where α ∈ (0, 1). A corresponding uniform statement over a set V ⊂ Rd asks that with
high probability,
kΓvk22
inf ≥ 1 − α.
v∈V kvk2
2
Statements of this form are very useful in statistics, signal processing, etc. The lower
isometry says that the energy of the signal is preserved under random measurement. Or,
the null space of the random matrix Γ is likely to miss (in a quantitative way) the set V . Of
course, if V is too large, it’s not possible to miss it, and so complexity of V (as quantified
by the measures we have studied) enters the picture.
The connection to today’s lecture can be seen by taking
 
−X1 −
1
Γ = √  ... 
n
−Xn −

with X1 , . . . , Xn i.i.d. from an isotropic distribution. Then


n
1X
kΓvk22 = hv, Xi i2
n
i=1

while kvk = Ex hv, Xi2 . Each v ∈ V then corresponds to f ∈ F in our earlier notation.

21.4 Example: Interpolation

Suppose we observe noiseless values yi = f ∗ (Xi ) at i.i.d. locations X1 , . . . , Xn . Let fb be


an ERM with respect to square loss over F and assume f ∗ ∈ F. Clearly, fb achieves zero
error, and the question is what the expected deviation from f ∗ is. This is a question of a
“version space size” – what is the L2 (P ) diameter of the random subset of F that matches
f ∗ on a set of data points. More precisely, define the interpolation set

IX1 ,...,Xn = {f ∈ F : f (Xi ) = f ∗ (Xi )},

a random subset of the class F, and its diameter as

diam2 (IX1 ,...,Xn ) = sup f − f0 L2 (P )


.
f,f 0 ∈I X1 ,...,Xn

90
Of course, from the earlier calculations, we have that with high probability
2
f − f0 L2 (P )
. δ̄ 2

b
where δ̄ is the localization radius for (F − F)2 and can be upper bounded by supx1:n R(F)2,

up to polylog factors. Alternatively, we can use the fixed point δ̃ 2 under the small ball
property.

22. LARGE MARGIN THEORY

We now switch gears and discuss the problem of classification with margin. Recall that
for a class of binary functions G = {g : X → {±1}}, we established learning and uniform
convergence results in terms of the ratio vc(G)/n. Yet, the VC dimension can be easily larger
than the sample size for neural networks (where it is related to the number of parameters)
and high-dimensional linear separators (e.g. kernels). Perhaps more importantly, we do not
usually work with a class G directly but rather with a real-valued class F{f : X → R},
with sign of the function determining the class label. That is, in applications of interest,
we work with sign(F) = {sign(f ) : f ∈ F }. It is important to realize that nearly-constant
f can generate very complex sign(f ), a situation we would like to avoid. Large-margin
approach below allows us to replace complexity of sign(F) with that of F itself, as a class
of real-valued functions.
The proof below utilizes the same technique as that in Lemma 39 (not surprisingly,
when we look at the authors of [19] and [18]).
Let F be a class of R-valued functions. Consider a classification problem with binary
Y ∈ {±1}. Fix γ > 0 as a margin parameter.
Let φ : R → R as 

1 if s ≤ 0
φ(s) = 1 − s/γ if 0 < s < γ


0 if s ≥ γ
Then
1 {yf (x) ≤ 0} ≤ φ(yf (x)) ≤ 1 {yf (x) ≤ γ} .
2
Hence, with probability at least 1 − e−2u , for any f ∈ F,
n n
1X 1X
E1 {Y f (X) ≤ 0} − 1 {Yi f (Xi ) ≤ γ} ≤ sup Eφ(Y f (X)) − φ(Yi f (Xi ))
n f ∈F n
i=1 i=1
Xn
1 u
≤ E sup Eφ(Y f (X)) − φ(Yi f (Xi )) + √
f ∈F n n
i=1

since φ is in [0, 1]. By symmetrization, the above expectation is at most


n n n
1X 2 1X 2 1X 2
2E sup i φ(Yi f (Xi )) ≤ E sup i Yi f (Xi ) = E sup i f (Xi ) = R(F)
f ∈F n γ f ∈F n γ f ∈F n γ
i=1 i=1 i=1
2
Hence, with probability at least 1 − e−2u , for any f ∈ F,
n
1X 2 u
E1 {Y f (X) ≤ 0} ≤ 1 {Yi f (Xi ) ≤ γ} + R(F) + √ .
n γ n
i=1

91
By a union bound over a discretization of (0, B], we can prove the following result [19, Thm
2]:

Theorem 12. Let F = {f : X → [−1, 1]}. For all u > 0, with probability at least
2
1 − 2e−2u , for all f ∈ F, γ ∈ (0, 1],
n
r
1X 8 log log(2/γ) u
E1 {Y f (X) ≤ 0} ≤ 1 {Yi f (Xi ) ≤ γ} + R(F) + + √ (22.219)
n γ n n
i=1

The key message of this theorem is that upper bound on the expected error is in terms
of the complexity of F as a class of real-valued functions, rather than complexity of sign(F).
The price for this is the margin parameter γ which sets the resolution at which we view
predictions as being incorrect (or not confident enough). Given that we would like to have
small left-hand-side (for some estimator), the above bound suggest maximize the margin
(i.e. minimize the number of margin mistakes) while balancing this goal with complexity
of the class. Methods such as support-vector-machines or boosting can be seen as directly
or indirectly having this goal.
Finally, suppose we apply the above theorem to some class FB = {fθ : compl(θ) ≤ B},
where compl(θ) is some notion of complexity of the parameter. Suppose FB ⊆ FB 0 for
B ≤ B 0 and that supf ∈FB |f |∞ ≤ ψ(B). In other words, we allow the function range to
increase (linearly or otherwise) with increasing B. We can then apply a union bound to
obtain a statement for any f ∈ ∪B>0 FB in terms of the complexity of f , defined as the
smallest radius B such that f ∈ FB . We leave this as an exercise.

22.1 Linear example and comparison to perceptron


As an example, consider the class of linear functions
F = {x 7→ hx, wi : w ∈ Bd2 }
and X ∈ Bd2 . We saw earlier that
1
R(F) ≤ √
n
(recall that here we normalized Rademacher averages by 1/n). Thus, one can derive an
upper bound on classification out-of-sample performance that does not depend on the di-
mensionality of the space despite the fact that the VC dimension of the set of hyperplanes
in Rd is d and covering numbers of sign(F) necessarily grow with d. Similarly, one can prove
margin bounds for neural networks in terms of norms of the weight matrices and without
any dependence on the number of neurons.
1
Observe that the rate of γ √ n
is the “slow rate analogue” of the nγ1 2 rate we can prove
under the assumption that the distribution of the data has a hard margin γ. We show this
argument in Section 24.1.

23. COMPLEXITIES OF NEURAL NETWORKS

Neural networks are a class of functions built in a hierarchical manner. Let σ : R 7→ R be


a fixed 1-Lipschitz function. Given parameters θ = (W1 , . . . , WL ), we define
fθ (x) = σ(WL σ(WL−1 · · · σ(W1 x) · · · )), (23.220)

92
where Wl ∈ Rdl ×dl−1 and d0 = d and σ is applied coordinate-wise.
In our setting, the architecture of a neural network corresponds to the choices of input
and intermediate dimensions. For the fixed architecture, the set of neural networks we
consider is
F = {fθ : compl(θ) ≤ B}
where compl(θ) is some notion of complexity of the weight matrices. That is, just as in
the case of a class of linear functions Flin = {x 7→ hw, xi : kwk ≤ B}, we would like to
define a “ball” in the space of neural networks.
Note that many tuples (W1 , . . . , WL ) lead to the same function fθ . For example, take
ReLU activation, scale one layer up by 100, another down by 100. The function does not
change under this transformation. There are many transformations that leave the function
intact, and we would like to make sure compl does not assign different values of complexity
to different sets of parameters if they lead to same function.
As an example, take Frobenius norm of all the layers:
L
X
compl(θ) = kWj kF
j=1

since this is a natural “generalization” of the corresponding Euclidean norm for Flin . Un-
fortunately, this measure does not capture the scaling invariance of the layers. However, a
product of Frobenius norms would reflect the invariance (though it may not reflect many
other invariances)
YL
compl(θ) = kWj kF
j=1

Of course, it is not at all clear that the Rademacher averages of a unit ball defined with
respect to this complexity is non-vacuous. Remember that we relied heavily on linearity of
b lin ).
functions to analyze R(F

23.1 Short primer on matrix norms


Before we start, we briefly describe some other norms of a d1 × d2 matrix A. We have
already seen the operator norm (or, spectral norm, or 2-norm) of a matrix A:
p
kAk = σmax (A) = λmax (A∗ A)

which can also be written as


kAxk
kAk = max
x6=0 kxk
General Schatten norms are
 1/p
min(d1 ,d2 )
X
kAkp =  σip  ,
i=1

and the p = 2 case coincides with the Frobenius norm. The p = 1 case is termed nuclear
norm, or trace norm, or Ky Fan norm:
min(d1 ,d2 )
X √
kAknuc = σi = trace( A∗ A).
i=1

93
Next, we describe entry-wise norms. We start with the sum of `2 norms of columns:
d2 d2 d1
!1/2
X X X
2
kAk2,1 = kA·,j k = Ai,j
j=1 j=1 i=1

whereas the maximum `2 norm of columns is


kAk2,∞ = max kA·,j k
j=1...d2

For general p, q ≥ 1,
 !q/p 1/q
Xd2 d1
X
kAkp,q = |Ai,j |p 
j=1 i=1

23.2 Neural networks with bounded (1, ∞) and Frobenius norms


Let us generalize the definition in (23.220) and write it down recursively as follows. Take a
base class
F1 = {f : X → R}, X ⊂ Rd ,
and assume (in order to simplify the proof) that 0 ∈ F1 . We now define
di−1
X
Fi = {x 7→ wj σ (fj (x)) : fj ∈ Fi−1 , kwk1 ≤ Bi } (23.221)
j=1

The following was proved in [2]:

Lemma 40. Let Fi be defined recursively as in (23.221), with a base function class F1
that contains the zero function. Assuming σ is 1-Lipschitz and σ(0) = 0. Then
b i ) ≤ 2Bi R(F
R(F b i−1 ).

Proof.
 
n
X X X n
X
b i ) = E
R(F sup t  wj σ (fj (xt )) = E sup wj t σ (fj (xt ))
kwk1 ≤Bi t=1 j kwk1 ≤Bi j t=1
fj ∈Fi−1 fj ∈Fi−1

which is upper bounded via Hölder’s inequality by


n
X n
X
E sup kwk1 max t σ (fj (xt )) ≤ Bi E sup t σ (f (xt ))
w:kwk1 ≤Bi j f ∈Fi−1 t=1
t=1
fj ∈Fi−1

Next, we remove the absolute values and pay a factor of 2:


n
( n n
)
X X X
sup t σ (f (xt )) = sup max t σ (f (xt )) , − t σ (f (xt ))
f ∈Fi−1 t=1 f ∈Fi−1 t=1 t=1
( n n
)
X X
≤ max sup t σ (f (xt )) , sup −t σ (f (xt ))
f ∈Fi−1 t=1 f ∈Fi−1 t=1

94
Since 0 ∈ F and σ(0) = 0, it also holds that 0 ∈ Fi−1 . Hence both terms in the above max
are nonnegative and we can further upper bound the maximum by the sum of two terms,
which are equal in expectation:
n
X n
X
E sup t σ (f (xt )) ≤ 2E sup b i−1 ).
t σ (f (xt )) ≤ 2R(F
f ∈Fi−1 t=1 f ∈Fi−1 t=1

Observe that the restriction that the norm of incoming weights for every neuron is
bounded as kwk1 ≤ Bi is equivalent to constraining the rows of Wi , which can be written
as
kWiT k1,∞ ≤ Bi .
Hence, we have the following corollary:

Corollary 5. Under the assumptions of Lemma 40, the Rademacher averages of FL


with weight matrices kWiT k1,∞ ≤ Bi is

L r
Y log d
b L) . 2
R(F L
Bi · ,
n
i=1

d .
where we also assumed F1 = {x 7→ hw, xi : kwk1 ≤ B1 } and X ⊆ B∞

It is easy to see that, in general, the factor 2L is superfluous in the above bound. Indeed,
consider a thin neural network f (x) = wL σ (. . . σ(w1 x) . . .) with w1 ∈ R1×d and all wj ∈ R≥0
for j > 1 be nonnegative numbers. Take σ to be ReLU. Then by positive homogeneity of
ReLU, Y
f (x) = wj · hw1 , xi
j>1

Clearly, in this trivial case there is no exponential dependence on depth.


We mention a result that nearly removes dependence on the depth [12, Theorem 1]:

Theorem 13. Let σ be 1-homogenous (that is, σ(αx) = ασ(x) for all x ∈ R, α ≥ 0).
Suppose F is a class of functions of the form (23.220) with kWi kF ≤ Bi . Then

√ YL
1
R(F) . L Bi · √ .
n
i=1

Under additional mild assumptions, [12, Corollary 1] also shows a depth-independent upper
bound of order (up to log factors)
YL
1
Bi · 1/4 .
j=1
n

95
24. BEYOND UNIFORM CONVERGENCE?

24.1 Perceptron
Recall Perceptron and its mistake bound in Lemma 28. Perceptron is an online method
that, given the next xt , predicts the label ybt and corrects the hyperplane only in case of a
mistake. Given any sequence, the number of mistakes is at most γ −2 , where γ is the margin
of the sequence.
Suppose (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. from a distribution on X × {±1}, and suppose
X ⊆ Bd2 . Consider the following procedure. Cycle through the data multiple times until
until there is a pass with no more mistakes. The length T of the resulting sequence (T is a
multiple of n) is at most nγ −2 , corresponding to the case of one mistake per pass. Let wT
be the final hyperplane output by this procedure. Clearly, it separates the data perfectly,
b 01 (wT ) = 0 where
i.e. L
n
b 1X
L01 (w) = 1 {Yi hw, Xi i ≤ 0} .
n
i=1

Therefore, the function fb(x) = sign(hwT , xi) is a particular ERM solution (one of many).
Can we say anything about future performance of wT on data from the same distribution?

Lemma 41. Let wT be the output of Perceptron after no mistakes are made in a pass
over the i.i.d. data (X1 , Y1 ), . . . , (Xn , Yn ) ∈ X × {±1}, and let X ⊆ B2d . Let γ be a
(random) margin of n + 1 data points drawn i.i.d. from the distribution. Then
1
L01 (wT ) = E1 {Y hwT , Xi ≤ 0} ≤ × E[γ −2 ]
n+1

Proof. Let us use the notation S = {(X1 , Y1 ), . . . , (Xn , Yn )}, Zi = (Xi , Yi ), Z = (X, Y ) and
`(w, Z) = 1 {Y hw, Xi ≤ 0}. First,
" n+1
#
1 X
ES EZ `(wT , Z) = ES,Zn+1 `(w(−t) , Zt ) (24.222)
n+1
t=1

where w(−t) is Perceptron’s final hyperplane after (hypothetically) cycling through data
Z1 , . . . , Zt−1 , Zt+1 , . . . , Zn+1 . That is, leave-one-out is unbiased estimate of expected loss.
Now consider cycling Perceptron on Z1 , . . . , Zn+1 until no more errors, and call the
output w̄. Let i1 , . . . , im be indices on which Perceptron errs in any of the cycles. We
know m ≤ γ −2 . However, if index t ∈ / {i1 , . . . , im }, then whether or not Zt was included
in the computation of w̄ does not matter, and so w̄ = w(−t) . Furthermore, Zt is correctly
classified by w(−t) . Thus, at most γ −2 terms in (24.222) can be nonzero.

If we assume hard margin in the distribution (otherwise, expected γ −2 will be infinite),


Bayes error L01 (f ∗ ) = 0. Such an assumption on P is not about its parametric or nonpara-
metric form, but rather on what happens at the boundary. As in Section 3.4, here we beat

the CLT rate of 1/ n.
1
More importantly, we improved upon the γ √ n
rate in Section 22.1 (which was an ap-
plication of Theorem 12) by a square factor! Recall that Theorem 12 was proved with
relatively heavy machinery of uniform convergence, while here we used a trivial argument

96
to obtain a better result. This observation motivates two questions: (1) is there a version
of Theorem 12 that achieves the correct rate? and (2) does the Perceptron-based argument
magically avoid uniform convergence altogether? The answer to the first question is yes,
and it involves developing an L∗ -style bound, beyond the scope of this course. But the
answer to the second question is more subtle, and should be morally taken as a ‘no’. The
mechanism employed in the proof of the above lemma is a version of the so-called online-to-
batch conversion, where one first proves an online mistake or regret bound for an arbitrary
sequence and then uses the i.i.d. nature of the sequence to derive a result on expected
loss. However, the very fact that one can show an online mistake bound or an online regret
bound for an arbitrary sequence implies a stronger version of uniform convergence – uniform
convergence for martingales. We will describe this in detail in the last lecture.

25. BIAS-VARIANCE DECOMPOSITION

For a large part of the course, we studied risk bounds for fb defined implicitly as an empirical
minimizer over some class of functions. Yet, in certain situations, an estimator of interest is
defined explicitly. This was the case, for instance, with linear unconstrained regression. In
that case, however, we opted for not using the closed-form solution since such an approach
would not be generalizable to nonlinear cases (or even to linear constrained regression). Here
we describe a classical approach that is convenient for analyzing closed-form estimators.
We consider random design regression. To this end, let P be the law of (X, Y ), and
f ∗ (x) = E[Y |X = x] be the regression function. We write Yi = f ∗ (Xi ) + ξi for zero-mean ξi .
Let X = [X1 , . . . , Xn ] T ∈ Rn×d and y = [Y1 , . . . , Yn ] T . Given an estimator fb(·) = fb(·; X, y),
define
 2  2
B 2 = EX f ∗ (X) − Ey fb(X) , V = EX,y fb(X) − Ey fb(X) . (25.223)

Both B 2 and V are random variables (in X), and it is easy to check that
 
Ekfb − f ∗ k2L2 (P ) = EX B 2 + EX [V ] . (25.224)

Consider estimators that are linear in y:


n
X
fb(x) = Yi ωi (x). (25.225)
i=1

We then have !2
n
X
B 2 = EX f ∗ (X) − f ∗ (Xi )ωi (X) (25.226)
i=1
and !2
n
X n
X
V = EX,ξ ξi ωi (X) ≤ σξ2 EX (ωi (X))2 . (25.227)
i=1 i=1

The form of (25.226) and (25.227) is particularly useful for analyzing “local methods”.
Indeed, let’s think of ωi (x) as the “relevance” of example (Xi , Yi ) to the given point x. In
this case, fb(x) in (25.225) aggregates the responses Yi according to these weights. The bias
term then asks whether the problem is easy if there is no noise ξi , and (25.226) has the
interpretation of the expected difference between the value of the true regression function at

97
X and its “reconstruction” from datapoints, assuming no noise. Smoothness of f ∗ helps to
upper bound this term. The variance term increases with the noise level σξ2 and the sum of
L2 (P )-norms of the weight functions. If these weight functions ωi are sufficiently localized
around Xi , one can often compute simple upper bounds on the variance term.

25.1 Example: Local Smoothing


In local smoothing,

K((x − xi )/h)
ωi (x) = Pn (25.228)
j=1 K((x − xj )/h)

where K(u) : Rd 7→ [0, ∞) is a kernel function and h > 0 is a bandwidth parameter.


Example: K(u) = exp{− kuk} or its truncated version K(u) = exp{− kuk}1 {kuk2 ≤ 1}.
Another example is K(u) = kuk−a 1 {kuk2 ≤ 1} for 0 < a < d/2, which is singular at 0 and
leads to an interpolant of the data.

25.2 Example: Least Squares


Consider unconstrained Least Squares, which has a closed-form solution

fb(x) = hθ,
b xi = hX† y, xi = (Xx) T (XX T )−1 y, (25.229)
P
This solution can be written as fb(x) = ni=1 Yi ωi (x), where

ωi (x) = (x T X† )i = (Xx) T (XX T )−1 ei . (25.230)

To avoid confusion, we will use the lower-case x for a random x ∼ P . The bias is then
2
B 2 = Ex hP ⊥ x, θ ∗ i2 = Σ1/2 P ⊥ θ ∗ , (25.231)
2

where P ⊥ = Id − X T (XX T )−1 X. On the other hand, the variance term is


2 
V ≤ σξ2 · Ex (XX T )−1 (Xx) 2
= σξ2 · trace (XX T )−2 XΣX T . (25.232)

25.3 Example: Regularized Least Squares


Regularized Least Squares (or, Ridge Regression) is a classical method employed for high-
dimensional data,

bλ = argmin 1 kXθ − yk2 + λ kθk2


θ (25.233)
θ n

and it has a closed-form expression


bλ = X T (XX T + λIn )−1 y.
θ (25.234)

It is easy to extend (25.231) and (25.232) to this case (exercise).

98
25.4 Example: Kernel Ridge/Ridgeless Regression
Observe that the solution for Least Squares and Regularized Least Squares only depends
on inner products between data points Xi and Xj , i, j ∈ [n]. There are several ways to
motivate kernel methods, but the one we take here just replaces x with some feature map
φ(x) : X → RD with D large or infinite. Let Φ ∈ Rn×D be the matrix with rows φ(Xi ) T .
From the earlier discussion, the least squares solution in this high- or infinite-dimensional
space is simply

fb(x) = hθ,
b φ(x)i = (Φφ(x)) T (ΦΦ T )−1 y. (25.235)

It is useful to write k(x, x0 ) = hφ(x), φ(x0 )i (k is called a kernel ), as well as write ΦΦ T as


K(X, X) ∈ Rn×n with (i, j) entry k(Xi , Xj ). Furthermore, we write Φφ(x) as K(x, X) ∈ Rn
with [K(x, X)]i = k(Xi , x). With this notation, (25.235) becomes

fb(x) = K(x, X) T K(X, X)−1 y. (25.236)

and the Kernel Ridge Regression solution becomes

fb(x) = K(x, X) T (K(X, X) + λIn )−1 y. (25.237)

25.5 Example: Linear Regime in Nonlinear Models


Let f (x, θ) be a function X → R parametrized (potentially non-linearly) by θ ∈ Rp . A
running example here is a neural network with 1 hidden layer
m
1 X
f (x, θ) = √ bj σ(hwj , xi), θ = (w1 , . . . , wm ) (25.238)
m
j=1

and b1 , . . . , bm fixed (for simplicity, we do not include them in θ). Here σ : R → R is a


non-linearity such as σ(a) = max{a, 0}.
Suppose our estimator is a solution of (potentially non-convex) least-squares problem

b = argmin 1 ky − fn (θ)k2
θ (25.239)
θ n
where fn : θ 7→ (f (X1 , θ), . . . , f (Xn , θ)) is the evaluation of the function parametrized by
θ on the data. Despite potential non-convexity of the problem, we can aim to minimize
the squared loss by gradient flow (or gradient descent). Taking θ 0 as a starting point, the
evolution is given by
dθ t 1
= Dfn (θ) T (y − fn (θ t )) (25.240)
dt n
where Dfn (θ) ∈ Rn×p is the Jacobian of fn . Let us linearize

fn (θ t ) ≈ fn (θ 0 ) + Dfn (θ 0 )(θ t − θ 0 )

around θ 0 . This linearization can be a good approximation if θ does not move too far from
θ 0 and fn is “regular” enough. Since the linearization introduces different dynamics, we
use θ̄ t to denote it. We have
dθ̄ t 1
= Dfn (θ) T (y − fn (θ 0 ) − Dfn (θ 0 )(θ̄ t − θ 0 )) (25.241)
dt n

99
Under certain conditions (see e.g. [4, Thm 5.1]), parameters θ t stay close to θ̄ t , square
loss (25.239) decays exponentially fast to 0 under the dynamics of θ t , and function values
f (x, θ t ) are close to those of the linear model

f lin (x, θ) = f (x, θ 0 ) + Df (x, θ 0 )(θ − θ 0 ) (25.242)

in L2 (P ) (here, D is extended to be a linear operator from the parameter space Rp to the


space of functions, L2 (P )). See [4] for references to prior work on this.
When we examine the linear model (25.242) on the data (note the subscript n),

fnlin (θ) = fn (θ 0 ) + Dfn (θ 0 )(θ − θ 0 ), (25.243)

we see that the data X1 , . . . , Xn are mapped to a feature space, with the feature matrix
being
Φ = Dfn (θ0 ),
and then we find the best parameter θ in this feature space.
Suppose we initialize the model in such a way that fn (θ) = 0 (or approximately 0) so
that we can drop it from the above expression. In this case, ΦΦ T is the kernel matrix

K(X, X) = Dfn (θ 0 )Dfn (θ 0 ) T ∈ Rn×n .

If this kernel matrix is full-rank, the linearized gradient flow can be shown to converge to
the minimum interpolant of the data:

θ̄ ∞ = argmin θ̄ − θ 0 2 : Dfn (θ 0 )(θ − θ 0 ) = y − fn (θ 0 )
θ

(again, we can ensure fn (θ 0 ) = 0)

25.5.1 Feature map and kernels for (25.238)


We can calculate the Jacobian Dfn (θ) for the model in (25.238). Here p = md, where
x ∈ Rd with d input dimension. Then, trivially,
1
[Dfn (θ)]i,(j,a) = √ bj σ 0 (hwj , xi i)xi,a
m
for i ∈ [n], (j, a) ∈ [m] × [d]. This corresponds to the feature matrix Φ ∈ Rn×md given by
 
b1 σ 0 (hx1 , w1 i)x1T b2 σ 0 (hx1 , w2 i)x1T . . . bm σ 0 (hx1 , wm i)x1T
 b1 σ 0 (hx2 , w1 i)x2T b2 σ 0 (hx2 , w2 i)x2T . . . bm σ 0 (hx2 , wm i)x2T 
Φ= 

 (25.244)
... ... ... ...
b1 σ 0 (hxn , w1 i)xnT b2 σ 0 (hxn , w2 i)xnT . . . bm σ 0 (hxn , wm i)xnT
The corresponding kernel, termed the Neural Tangent Kernel (NTK), is
m
1 X
Km (x1 , x2 ) = hx1 , x2 iσ 0 (hwi , x1 i)σ 0 (hwi , x2 i).
m
i=1

where we assumed bi = ±1. Now, suppose the weights wi ∼ N (0, Id /d), independently. As
the number of neurons m increases, the finite-width kernel Km converges (under conditions)
to an infinite-width NTK given by

K(x1 , x2 ) = Ew hx1 , x2 iσ 0 (hw, x1 i)σ 0 (hw, x2 i)

100
Under the conditions which ensure that the nonlinear gradient flow (25.240) stays close
to the linearized gradient flow, θ t converges to a minimum-norm interpolant of the data
with respect to the NTK kernel (see e.g. [4]), and can be analyzed with the bias-variance
decomposition. These cases are among the few where we can provably analyze both opti-
mization and statistical properties of neural network models. Arguably, however, the linear
regime is not very interesting in practice.
Another aspect we have not discussed here is that of interpolation. See [4].

26. TIME SERIES

Suppose we observe a sequence

xt+1 = f ∗ (xt ) + ηt , t = 1, . . . , n

where xt ∈ Rd and ηt are independent zero mean vectors. The function f ∗ is unknown, but
we assume it is a member of a known class F = {f : Rd → Rd }. Let us treat this problem
as a fixed-design regression problem, except that the outcomes are now vectors rather than
reals, and the sequence x1 , . . . , xn is a sequence of dependent random variables.
Consider the least squares solution:
n
1X
fb ∈ argmin kxt+1 − f (xt )k22 ,
f ∈F n
t=1

where the norm is the Euclidean norm. This is a natural generalization of least squares to
vector-valued regression. As before, we denote
n
1X
kf − gk2n = kf (xt ) − g(xt )k22
n
t=1

The basic inequality can now be written as (exercise):


n
2 1X
fb − f ∗ ≤2 hηt , fb(xt ) − f ∗ (xt )i.
n n
t=1

Choosing the offset-style approach covered in previous lectures, we have


n
2 1X
fb − f ∗ ≤ sup 4hηt , g(xt )i − kg(xt )k2 .
n g∈F −f ∗ n
t=1

Up until now, the statement is conditional on {η1 , . . . , ηn }. What happens if we take expec-
tations on both sides? On the left-hand side we have a denoising guarantee on the sequence.
On the right-hand side, we have a “dependent version” of offset Gaussian/Rademacher com-
plexity where xt is measurable with respect to σ(η1 , . . . , ηt−1 ). To analyze this object, we
first need to understand the simpler R-valued version without the offset: what is the be-
havior of
n
1X
E sup t f (xt )
f ∈F n t=1

where xt is σ(1 , . . . , t−1 )-measurable, F is a class of real-valued functions X → R, and


1 , . . . , n are i.i.d. Rademacher random variables.

101
27. SEQUENTIAL COMPLEXITIES

We choose to study the random process generated by Rademacher random variables for
several reasons. First, just as in the classical case, conditioning on the data will lead to
a simpler object (binary tree) and, second, other noise processes can be reduced to the
Rademacher case, under moment assumptions on the noise. The development here is based
on [27], and we refer also to [26] for an introduction.
Let us elaborate on the first point. Note that xt being measurable with respect to
σ(1 , . . . , t−1 ) simply means xt is a function of 1 , . . . , t−1 (in other words, it’s a predictable
process). Note that the collection x1 , . . . , xn can be “summarized” as a depth-n binary tree
decorated with elements of X at the nodes. Indeed, x1 ∈ X is a constant (root), x2 = x2 (1 )
takes on two possible values depending on the sign of 1 (left or right), and so forth. It is
useful to think of (x1 , . . . , xn ) as a tree, even though it doesn’t bring any more information
into the picture. We shall denote the collection of n functions xi : {±1}i−1 → X as
x = (x1 , . . . , xn ) and call it simply as an X -valued tree. We shall refer to  = (1 , . . . , n )
as a path in the tree. We will also talk about R-valued trees, such as f ◦ x for f : X → R.
Given a tree x, we shall call
X n
b seq (F, x) = E sup 1
R t f (xt (1 , . . . , t−1 ))
f ∈F n t=1

the sequential Rademacher complexity of F on the tree x.


Comparing to the classical version,
n
1X
E sup t f (xt )
f ∈F n
t=1

where x1 , . . . , xn are constant values, we see that it is a special case of a tree with constant
levels xt (1 , . . . , t−1 ) = xt . Hence, sequential Rademacher complexity is a generalization of
the classical notion.
To ease the notation, we will write xt without explicit dependence on , or for brevity
write xt () even though xt only depends on the prefix 1:t−1 .
Observe that for any f ∈ F, the variable
n
1X
νf = t f (xt )
n
t=1

is zero mean. Moreover, it is an average of martingale differences t f (xt ), and so we expect



1/ n behavior from Azuma-Hoeffding’s inequality. It should be clear that, say, for F
consisting of a finite collection of [−1, 1]-valued functions on X , we have
n
r
1X 2 log card(F)
E max t f (xt ) ≤
f ∈F n n
t=1

Given that there is no difference with the classical case, one may wonder if we can just reduce
everything to the classical Rademacher averages. The answer is no, and the differences
already start to appear when we attempt to define covering numbers.
More precisely, since any tree x is defined by 2n −1 values, one might wonder if we could
define a notion of pseudo-distance between f and f 0 as an `2 distance on these 2n − 1 values.

102
It is easy to see that this is a huge overkill. Perhaps one of the key points to understand
here is: what is the equivalent of the projection F|x1 ,...,xn for the tree case? Spoiler: it’s
not F|x . The following turns out to be the right definition:

Definition 22. A set V of R-valued trees is an 0-cover of F on a tree x = (x1 , . . . , xn )


if
∀f ∈ F,  ∈ {±1}n , ∃v ∈ V s.t. f (xt (1:t−1 )) = v t (1:t−1 ) ∀t ∈ [n]
The size of the smallest 0-cover of F on a tree x will be denoted by N (F, x, 0).

The key aspect of this definition is that v ∈ V can be chosen based on the sequence
 ∈ {±1}n . In other words, in contrast with the classical definition, for the same function f
different elements v ∈ V can provide a cover on different paths. This results in the needed
reduction in the size of V .
As an example, take a set of 2n−1 functions that take a value of 1 on one of the 2n−1
leaves of x and zero everywhere else. Then the projection F|x is of size 2n−1 but the size
of the 0-cover is only 2 (exercise!), corresponding to our intuition that the class is simple
(as it only varies on the last example). Indeed, the size of the 0-cover is the analogue of the
size of F|x1 ,...,xn in the binary-valued case.
For real-valued functions, consider the following definition.

Definition 23. A set V of R-valued trees is an α-cover of F on a tree x = (x1 , . . . , xn )


with respect to `2 if
n
1X
∀f ∈ F,  ∈ {±1}n , ∃v ∈ V s.t. (f (xt (1:t−1 )) − v t (1:t−1 ))2 ≤ α2
n
i=1

The size of the smallest α-cover of F on a tree x with respect to `2 will be denoted by
N2 (F, x, α).

A similar definition can be stated for cover with respect to `p .


The following is an analogue of the chaining bound:

Theorem 14. For any class of [−1, 1]-valued functions F,


 Z 
b seq (F, x) ≤ inf 4α + √ 12 1 p
R log N2 (F, x, ε)dε
α≥0 n α

Recall the definition of VC dimension and a shattered set. Here is the right sequential
analogue:

Definition 24. Function class F of {±1}-valued functions shatters a tree x of depth


d if
∀ ∈ {±1}d , ∃f ∈ F, s.t. ∀t ∈ [d], f (xt ()) = t

103
The largest depth d for which there exists a shattered X -valued tree is called the Little-
stone dimension and denoted by ldim(F).

To contrast with the classical definition, the path on which the signs should be realized is
given by the path itself. But it’s clear that the definition serves the same purpose: if x is
shattered by F then R b seq (F, x) = 1. It is also easy to see that vc(F) ≤ ldim(F), and the
gap can be infinite.
The following is an analogue of the Sauer-Shelah-Vapnik-Chervonenkis lemma.

Theorem 15. For a class of binary-valued functions F with Littlestone dimension


ldim(F),
d  
X  en d
n
N (F, x, 0) ≤ ≤
i d
i=0

Scale-sensitive sequential versions are defined as follows:

Definition 25. Function class F of R-valued functions shatters a tree x of depth d at


scale α if there exists a witness R-valued tree s such that

∀ ∈ {±1}d , ∃f ∈ F, s.t. ∀t ∈ [d], t (f (xt ()) − st ()) ≥ α/2

The largest depth d for which there exists an α-shattered X -valued tree is called sequen-
tial scale-sensitive dimension and denoted ldim(F, α).

We note that the above definitions reduce to the classical ones if we consider only trees
x with constant levels.

Theorem 16. For any class of [−1, 1]-valued functions F and X -valued tree x of depth
n  
2en ldim(F ,α)
N∞ (F, x, α) ≤
α

Finally, it is possible to show an analogue of symmetrization lemma: for any joint


distribution of (X1 , . . . , Xn ),
n
1X b seq (F, x)
E sup E[f (Xt )|X1:t−1 ] − f (Xt ) ≤ 2 sup R
f ∈F n x
t=1

If the sequence (X1 , . . . , Xn ) is i.i.d., the left-hand side is the expected supremum of the
empirical process. The present version provides a martingale generalization. Furthermore,
if we take supremum over all joint distributions on the left-hand-side, then the lower bound
is also matching the upper bound, up to a constant.
The offset Rademacher complexity has been analyzed in [25].

104
28. ONLINE LEARNING

Consider the following online classification problem. On each of n rounds t = 1, . . . , n, the


learner observes xt ∈ X , makes a prediction ybt ∈ {±1}, and observes the outcome yt ∈ {±1}.
The learner models the problem by fixing a class F of possible models f : X → {±1}, and
aims to predict nearly as well as the best model in F in the sense of keeping regret
" n # " n #
1X 1X
Reg(F) = E 1 {b
yt 6= yt } − inf 1 {f (xt ) 6= yt } (28.245)
n f ∈F n
t=1 t=1

small for any sequence (x1 , y1 ), . . . , (xn , yn ). At least visually, this looks like oracle inequal-
ities for misspecified models. The distinguishing feature of this online framework is that
(a) data arrives sequentially, and (b) we aim to have low regret for any sequence without
assuming any generative process.
It is also worth noting that in the above protocol there is no separation of training and
test data: the online nature of the problem allows us to first test our current hypothesis
by making a prediction, then observe the outcome and incorporate the datum in to our
dataset.
The expectation on the first term in (28.245) is with respect to learner’s internal ran-
domization. More specifically, let Qt be the distribution on {±1} that the learner uses to
predict ybt ∼ Qt . Let qt = Eb yt be the (conditional) mean of this distribution. In other words,
qt = 0 would correspond to the learner tossing a fair coin.
A note about the protocol. The results below hold even if the sequence is chosen based
on learner’s past predictions. However, in this case, yt may only depend on qt but not
on the realization ybt . To simplify the presentation, let us just assume that the sequence
(x1 , y1 ), . . . , (xn , yn ) is fixed in advance (this turns out not to matter).
We will answer the following question: what is the best achievable Reg(F) for a given
F by any prediction strategy?
Let us first rewrite 1 {b yt 6= yt } = (1 − ybt yt )/2 and do the same for the oracle term.
Cancelling 1/2, we have
n
" n #
1X 1X
2Reg(F) = −qt yt − inf −yt f (xt ) (28.246)
n f ∈F n
t=1 t=1
" n # n
1X 1X
= sup yt f (xt ) − qt yt (28.247)
f ∈F n
t=1 t=1
n

Now, consider a particular stochastic process for generating the data sequence: fix any
X -valued tree x of depth n, and on round t let xt = xt (y1 , . . . , yt−1 ) and yt = t be
an independent Rademacher random variable. This defines a stochastic process with 2n
possible sequences (x1 , y1 ), . . . , (xn , yn ). Now, clearly

2Reg(F) ≥ 2E Reg(F).

Observe that qt = qt (1 , . . . , t−1 ) and thus


" n #
1X
E qt t = 0.
n
t=1

105
Hence,
n
" #
1X
E Reg(F) = E sup t f (xt ) . (28.248)
f ∈F n t=1

Since the argument holds for any x, we have proved that the optimal value of Reg(F) is
lower bounded by half of
b seq (F, x).
R̄seq (F) = sup R
x

It turns out that this lower bound is within a factor of 2 from optimal. Define the minimax
value
V = min maxn Reg(F)
Algo {(xt ,yt )}t=1

Theorem 17. For a binary-valued class F,


1 seq
R̄ (F) ≤ V ≤ R̄seq (F)
2

Similar results also holds for absolute value and other Lipschitz loss functions. For square
loss, the sequential Rademacher averages are replaced by offset sequential Rademacher
averages (again, as both upper and lower bounds).
In short, sequential complexities in online learning play a role similar to the role played
by i.i.d. complexities as studied in this course. However, quite a large number of questions
still remains open. But that’s a topic for a different course.

References

[1] J.-Y. Audibert. Progressive mixture rules are deviation suboptimal. In Advances in
Neural Information Processing Systems, pages 41–48, 2008.

[2] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

[3] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. The


Annals of Statistics, 33(4):1497–1537, 2005.

[4] P. L. Bartlett, A. Montanari, and A. Rakhlin. Deep learning: a statistical viewpoint.


Acta numerica, 30:87–201, 2021.

[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic


theory of independence. Oxford university press, 2013.

[6] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to


the Analysis of Learning Algorithms. PhD thesis, Biologische Kybernetik, 2002.

[7] O. Catoni. Challenging the empirical mean and empirical variance: a deviation study.
In Annales de l’IHP Probabilités et statistiques, volume 48, pages 1148–1185, 2012.

106
[8] L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition.
Springer, 1996.

[9] L. Devroye, M. Lerasle, G. Lugosi, and R. I. Oliveira. Sub-gaussian mean estimators.


The Annals of Statistics, 44(6):2695–2725, 2016.

[10] E. Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical


models, volume 40. Cambridge University Press, 2016.

[11] E. Giné and J. Zinn. Some limit theorems for empirical processes. The Annals of
Probability, pages 929–989, 1984.

[12] N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural


networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018.

[13] S. B. Hopkins. Mean estimation with sub-gaussian rates in polynomial time. The
Annals of Statistics, 48(2):1193–1213, 2020.

[14] D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian
random vectors. Electronic Communications in Probability, 17:1–6, 2012.

[15] J.-P. Kahane. Some random series of functions. D. C. Heath, 1968.

[16] M. J. Kearns and R. E. Schapire. Efficient distribution-free learning of probabilistic


concepts. Journal of Computer and System Sciences, 48(3):464–497, 1994.

[17] V. Koltchinskii and K. Lounici. Concentration inequalities and moment bounds for
sample covariance operators. Bernoulli, 23(1):110–133, 2017.

[18] V. Koltchinskii and S. Mendelson. Bounding the smallest singular value of a random
matrix without concentration. International Mathematics Research Notices, 2015(23):
12991–13008, 2015.

[19] V. Koltchinskii, D. Panchenko, et al. Empirical margin distributions and bounding the
generalization error of combined classifiers. The Annals of Statistics, 30(1):1–50, 2002.

[20] W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in


learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–
1980, 1998.

[21] T. Liang, A. Rakhlin, and K. Sridharan. Learning with square loss: Localization
through offset rademacher complexity. In Conference on Learning Theory, pages 1260–
1285, 2015.

[22] G. Lugosi and S. Mendelson. Sub-gaussian estimators of the mean of a random vector.
The annals of statistics, 47(2):783–794, 2019.

[23] V. A. Marčenko and L. A. Pastur. Distribution of eigenvalues for some sets of random
matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.

[24] S. Mendelson. Learning without concentration. In Conference on Learning Theory,


pages 25–39, 2014.

107
[25] A. Rakhlin and K. Sridharan. Online non-parametric regression. In Conference on
Learning Theory, pages 1232–1264, 2014.

[26] A. Rakhlin and K. Sridharan. On martingale extensions of Vapnik–Chervonenkis the-


ory with applications to online learning. In Measures of Complexity, pages 197–215.
Springer, 2015.

[27] A. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform martin-
gale laws of large numbers. Probability Theory and Related Fields, 161(1-2):111–153,
2015.

[28] M. Rudelson and R. Vershynin. Combinatorics of random processes and sections of


convex bodies. Annals of Mathematics, pages 603–648, 2006.

[29] M. Rudelson and R. Vershynin. Hanson-wright inequality and sub-gaussian concentra-


tion. Electronic Communications in Probability, 18:1–9, 2013.

[30] V. I. Serdobolskii. Multivariate statistical analysis: A high-dimensional approach, vol-


ume 41. Springer Science & Business Media, 2000.

[31] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In
Advances in neural information processing systems, pages 2199–2207, 2010.

[32] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv


preprint arXiv:1011.3027, 2010.

[33] R. Vershynin. High-dimensional probability: An introduction with applications in data


science, volume 47. Cambridge university press, 2018.

[34] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48.


Cambridge University Press, 2019.

[35] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of


convergence. Annals of Statistics, pages 1564–1599, 1999.

[36] Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the davis–kahan theorem for
statisticians, 2014.

108

You might also like