Rakhlin Mathstat sp22
Rakhlin Mathstat sp22
Rakhlin Mathstat sp22
1. INTRODUCTION
Suppose we would like to estimate the average height µ of students at MIT. Assuming
students in this course are a random sample from the overall population, we may build a
confidence interval for the unknown parameter µ as
√ √
X̄n − 1.96σ/ n, X̄n + 1.96σ/ n
where X̄n is the sample average of n student heights in this course, and σ 2 is the population
variance (which we can also estimate from data). Classical statistics tells us that this
random interval contains µ with probability approximately 95%. Where does the number
1.96 come from?
More formally, let X1 , . . . , Xn be i.i.d. from some distribution P on R
Pnsuch that µ =
] and σ 2 = var(X ) are finite. Of course, for the sample mean X̄ = 1
E[X
i i n n i=1 Xi we have
E X̄n = µ, i.e. X̄n is an unbiased estimate of the population mean. The (weak) Law of
Large Numbers provides more information: X̄n converges to µ in probability as n → ∞:
for any > 0,
lim P X̄n − µ < = 1.
n→∞
The strong LLN provides states that X̄n converges to µ almost surely:
P lim X̄n = µ = 1.
n→∞
Only finiteness of µ is needed for both of these results. Assuming σ is finite as well, we have
the Central Limit Theorem:
√ d
n X̄n − µ → N (0, σ 2 ).
1
This course is centered on non-asymptotic results. In the context of one-dimensional
mean estimation, we will show in the next lecture that, under appropriate assumptions on
P,
√ X̄n − µ
P n > u ≤ 2 exp −cu2 , (1.2)
σ
which holds for any n. Here c is some constant that depends on properties of P and may
be somewhat larger than the one suggested by the limiting distribution. Thus, confidence
intervals derived with such non-asymptotic methods may be somewhat wider, yet they hold
for any n. On the downside, we will have to place stronger assumptions on the distribution
than those required by the CLT.
One may argue that in modern applications, n is very often large. However, it is also
true that many applications of interest are characterized by high dimensionality of data, or
complex structure. In such problems, as we see below, asymptotic analysis based on n → ∞
may not suffice.
matrix, and Σ b= 1 n
i=1 Xi Xi sample covariance. Clearly, sample covariance is an unbiased
T
n
estimate of Σ. What can we say about the quality of such an estimate? For any pair
b i,j converges to Σi,j in probability by the LLN (it’s an
i, j = 1, . . . , d, it still holds that Σ
average of independent products). Similarly,
√ d
b i,j − Σi,j ) → N (0, var(X1,i X1,j )),
n(Σ
Let’s see if such convergence holds in the high-dimensional asymptotics regime. For simplic-
b ≥ . . . ≥ λd (Σ)
ity, take Σ = I, assume that coordinates of X are i.i.d., and let λ1 (Σ) b ≥ 0 be
b b
the eigenvalues of Σ. If Σ converges in spectral norm to Σ = I, the histogram of (random)
eigenvalues should be concentrated at 1. In particular, we would expect the empirical dis-
1 Pd n→∞
tribution of eigenvalues d i=1 δλi −→ δ1 . This indeed happens when d is kept fixed and
2
n taken to infinity. Yet in the proportional high-dimensional regime, the limiting distribu-
tion of the empirical spectrum is not δ1 but follows the Marčhenko-Pastur law [23]. This
√ √
distribution has density supported on [λ− , λ+ ] where λ+ = (1 + α)2 , λ− = (1 − α)2 .
The density has the form p
(λ+ − t)(t − λ− )
p(t) ∝
t
for α ∈ (0, 1] (and for α > 1, there is an atom at 0). We see that when d, n both grow
proportionally, Σb does not converge to Σ in the desired sense. To conclude, if we had,
say, genomic data with d = 20K and n = 30K, we should probably not trust the sample
covariance matrix as an estimate of the true covariance matrix, even though the data size
appears to be large.
Analogously to the development in the previous section, we can contrast the asymptotic
approach with non-asymptotic tail bounds that hold for all n, d. In particular, we will show
that, under additional assumptions, the largest eigenvalue of the sample covariance matrix
satisfies
2
p
b ≥ 1 + d/n + u
P λ1 (Σ) ≤ exp −nu2 /2 , u ≥ 0 (1.3)
3
1.3 MLE
To give a taste of some other settings where asyptotic analysis is classically used, consider
i.i.d.
the case of X1 , . . . , Xn ∼ Pθ on R where θ ∈ Θ is a parameter. Suppose for simplicity, the
random variables are real-valued. Under some regularity conditions, the sequence of MLE
solutions
n
X
θbn ∈ argmax log Pθ (Xi )
θ∈Θ i=1
where I(θ) is the Fisher information. Once again, while the asymptotic result sheds light
on the convergence of MLE for large enough n, it does not say much about finite n. In
particular, for some finite n, MLE may not be the best estimator, and some biased procedure
may be better.
P(fn (X; Dn ) 6= Y ) → L∗ ,
where L∗ is the Bayes risk (lowest achievable error by any classification rule), and the
probability is with respect to Dn and a new datum (X, Y ) from the same distribution.
Strong universal consistency asks for almost sure convergence.
Once again, consistency does not guarantee good performance for any finite n. Much of
learning theory instead focuses on explicit rates of convergence in n, as well as on making
explicit the relevant complexity parameters of the problem. Such complexity parameters
are not always explicit (in contrast to dimensionality of linear models), as illustrated in the
next example.
4
2. SUB-GAUSSIAN RANDOM VARIABLES
Finally,
2 /2
EeλZ = eλ (2.7)
0 2 2
for any λ ∈ R. Hence EeλZ = eσ λ /2 for Z 0 ∼ N (0, σ 2 ). Since our aim is to develop
CLT-like non-asymptotic tail bounds on averages of random variables, we will be checking
whether approximate versions of (2.5), (2.6), (2.7) hold.
5
P
Hence, to deduce Gaussian-like tails for the random variable X̄ = n1 ni=1 Xi , we need to
understand the behavior of its moments E|X̄ − µ|p or its moment generation function
(defined abstractly here for any random variable U ). Since the exponential of a sum is
product of exponentials, the upper bound furnished by optimizing λ in (2.8) will be easier
to handle.
Before proceeding to analyze the sums and establishing tail bounds, we first discuss a
family of random variables that will be useful to work with. These random variables have
more restrictions than those for which CLT holds (finite second moment), and hence form
a smaller family. Nevertheless, the family is rich enough to cover many applications of
interest. In the next lecture we will see a larger family of random variables.
We will write X ∈ subG(s2 ) to denote the fact that X belongs to the family of sub-Gaussian
random variables with s2 as the parameter.
A few remarks. First, if X is sub-Gaussian, then so is −X with the same variance
proxy. This will be useful for deducing bounds on |X| from those of bounds on X. Second,
the families of these random variables are nested in the sense that if X ∈ subG(s2 ), then
X ∈ subG(t2 ) for all t2 > s2 . Third, if X ∈ subG(s2 ) then cX ∈ subG(c2 s2 ). In particular,
we can often work with subG(1) and conclude the more general result by rescaling.
It turns out that there are several equivalent ways of defining sub-Gaussian behav-
ior.
Lemma 1 (Prop 2.5.2 in [33]). Let X be a random variable with E[X] = 0. Then the
following are equivalent, and the parameters ci > 0 differ by at most absolute constant
factors:
1. For all λ ∈ R,
E exp{λX} ≤ exp{c21 λ2 }
2. For all t ≥ 0,
P (|X| ≥ t) ≤ 2 exp{−t2 /c22 }
3. For all p = 1, 2 . . .,
√
(E|X|p )1/p ≤ c3 p
E exp{λ2 X 2 } ≤ exp{c24 λ2 }
6
5. For some c5 < ∞,
E exp{X 2 /c25 } ≤ 2.
We will only prove a few of the implications here (please see [33] for all the proofs). Let
us illustrate (1) ⇒ (2). Suppose without loss of generality that X ∈ subG(1) (and hence
c21 = 1/2). In view of (2.8),
2
−tλ λX λ2 /2−tλ t
P (X ≥ t) ≤ inf e Ee ≤ inf e = exp − (2.9)
λ>0 λ>0 2
2.3.1 Examples
Arguably, the simplest nontrivial random variables are Bernoulli or Rademacher. The
Rademacher random variable ε takes values in {±1} with equal probability. We then have
∞ ∞ ∞
1 1 1 X λk (−λ)k X λ2k X (λ2 )k 2
Eeλε = eλ + e−λ = + = ≤1+ k
= eλ /2 . (2.11)
2 2 2 k! k! (2k)! 2 k!
k=0 k=0 k=1
Lemma 2 (Hoeffding’s Lemma). For any zero-mean random variable X taking values
in [a, b], the moment generating function satisfies
E[XeλX ]
Proof. Let ψ(λ) = log E exp{λX}. Then ψ 0 (λ) = EeλX . Observe that ψ(0) = ψ 0 (0) = 0.
It remains to prove that ψ 00 (λ) ≤ (b − a)2 /4 since Taylor’s theorem would then imply (for
some ν ∈ [0, λ])
λ2 (b − a)2
ψ(λ) = ψ(0) + λψ 0 (0) + ψ 00 (ν) ≤ λ2
2 8
7
We compute the second derivative as
2
eλX eλX
ψ 00 (λ) = E X 2 λX − E X λX = var(Y )
Ee Ee
λx
for Y with density tilted by x → Eee λX . Since Y takes on values in [a, b], its variance is at
most (b − a)2 /4, concluding the proof.
In the same vein, for any sequence of independent random variables Xi with E[Xi ] = µi
and Xi − µi ∈ subG(σi2 ),
n
!
X t2
P (Xi − µi ) ≥ t ≤ exp − Pn (2.12)
i=1
2 i=1 σi2
In particular, we have
We close this section with two examples that indicate that the development of sub-
Gaussian tail bounds so far is lacking on several fronts.
First, we will be interested in tail bounds P on norms of gaussian vectors kgk, where
coordinates are standard normal. Since kgk2 = gi2 , it’s tempting to use
√ the sub-Gaussian
2 2
results above. However, gi is not sub-Gaussian: P g ≥ t = P |g| ≥ t ≤ 2 exp{−t/2},
which is sub-exponential rather than sub-Gaussian. These tails are heavier (or, fatter) than
those of sub-Gaussian.
The second example illustrates a larger concern with sub-Gaussian tail bounds a la
Hoeffding that rely on the range of random variables but not on their variance. Consider
the following variable X. Let P (X = 0) = 1 − 1/k 2 and P (X = ±k) = 1/2k 2 , where k is a
parameter, which we think of as large. Observe that the range of this random variable is
2k, but the mean and (importantly) variance are small: EX = 0, var(X) = 1 − 1/k 2 . If we
draw X1 , . . . , Xk i.i.d., P (X1 = . . . = Xk = 0) = (1 − 1/k 2 )k ≈ exp{−1/k} which is close to
1 for large k. Since Hoeffding style inequalities only depend on the range, they are not able
to distinguish this small-variance distribution from one that is uniform on [−k, k].
8
3. SUB-EXPONENTIAL RANDOM VARIABLES
Remarks:
• In some of the references, you will see that sub-exponential random variables are
defined with only one parameter; this corresponds to insisting that α = s, i.e. the
random variable has sub-Gaussian behavior with parameter s2 in the range |λ| < 1/s.
We follow [34] and decouple these two parameters.
• If we ask that (3.17) holds for λ ∈ (0, 1/α), the results stated below will only hold for
the upper tail of (X − µ). The behavior for the upper and lower tails can indeed be
different.
9
The same holds for the tail of −(X − µ).
where the limited range of λ, as compared to (2.8), is dictated by the definition of sub-
exponential random variable. By taking derivative, we see that λ = t/v 2 is the uncon-
strained solution; we take this whenever t/v 2 ≤ 1/α. Otherwise, the minimum is achieved
at the endpoint λ = 1/α, with the value of −t/α + v 2 /2α2 ≤ −t/2α.
Let us discuss Lemma 5. It shows that sub-exponential random variables exhibit two
behaviors: sub-Gaussian (in the range 0 ≤ t ≤ v 2 /α) and sub-exponential (in the range
t ≥ v 2 /α). We remark that the two-tail behavior arises simply by asking for the sub-
Gaussian behavior in an interval.
Rather than writing the tail bound with a min as in (3.19), we can relax the exponent
as follows. Note that for nonnegative u, v, it holds that min{1/u, 1/v} ≥ 1/(u + v). We can
thus upper bound the right-hand side of (3.19) as
2
t /2 t2 /2 t2 /2
exp − min , ≤ exp − (3.21)
v2 tα v 2 + tα
and, hence,
n
! ( ( ))
X t2 t
P | ai Xi | ≥ t ≤ 2 exp − min 2 , 2 kak
i=1
2 kak2 ∞
10
In particular, if all ai = 1/n, under the conditions of above lemma,
n
! 2
1X t t
P | Xi | ≥ t ≤ 2 exp −n · min , (3.22)
n 2 2
i=1
To shed some light on (3.22), consider a tail bound for a single sub-exponential random
variable with parameters (1, 1):
1 nt
P | Xi | ≥ t ≤ 2 exp − , t≥1 (3.23)
n 2
from Lemma 5. Hence, the sub-exponential behavior of the averages in (3.22) comes not
from averaging but rather from a single worst tail (e.g. that has the largest α for a general
collection).
Another way to write (3.22) is
! n o √
1 X
n 2 exp − t2 , t≤ n
2
n √ o
P |√ Xi | ≥ t ≤ √ (3.24)
n 2 exp − t n , t ≥ n
i=1 2
P
The CLT would say that for large enough n, the random variable √1n ni=1 Xi should have
Gaussian tails under finiteness of second moment. In contrast, (3.24) says that for the sub-
exponential family (where the restriction is less strict than sub-Gaussian but more strict
√
than finite second moment), the sub-Gaussian behavior holds until t = n, after which is
switches to heavier tails.
Lemma 7. Any bounded random variable with |X − EX| ≤ B satisfies the Bernstein’s
Condition with b = B/3.
11
Lemma 8 (Bernstein’s Inequality). For a random variable X satisfying the Bernstein’s
Condition with parameter b > 0, it holds that for any |λ| < 1/b,
2 2
λ σ /2
E exp{λ(X − µ)} ≤ exp (3.26)
1 − b|λ|
It is worth comparing (3.27) to the tail in (3.21) for a subE(v 2 , α) random variable.
Here, v 2 is replaced by the actual variance σ 2 , and the parameter α by b.
Proof. We have
∞
λ2 σ 2 X λk E(X − µ)k
E exp{λ(X − µ)} = 1 + + (3.29)
2 k!
k=3
∞
λ2 σ 2 λ2 σ 2 X
≤1+ + |λ|k−2 bk−2 (3.30)
2 2
k=3
∞
!
λ2 σ 2 X
=1+ 1+ |λ|k bk (3.31)
2
k=1
λ2 σ 2 1
≤1+ (3.32)
2 1 − |λ|b
which means that X − µ ∈ subE(2σ 2 , 2b). This, however, does not yield the constants of
(3.28) as opposed to working directly with (3.33).
Finally, we mention a one-sided tail bound that has tighter constants:
12
Lemma 9. Suppose for some positive v, b it holds that
2 2
λ v /2
E exp{λ(X − µ)} ≤ exp , λ ∈ (0, 1/b). (3.35)
1 − bλ
Then
√
P X − µ ≥ 2v 2 t + bt ≤ exp{−t}. (3.36)
See [5, p. 29] for a proof, or try to prove it yourself (Hint: solve for the optimal λ in
Cramér-Chernoff).
in (3.19) will lose a constant factor, so we directly repeat the proof of Lemma 8 with the
sum of random variables:
from which we can read off the following transition between the two tails. If t ≤ 3σ 2 /B,
the tails are sub-Gaussian, while for t ≥ 3σ 2 /B they are sub-exponential.
As already indicated by (3.36), in view of (3.33), it also holds that with probability at
least 1 − δ,
n
r
1X 2σ 2 log 2/δ B log 2/δ
| Xi − µ| ≤ + . (3.39)
n n 3n
i=1
We will give a short proof of this with a worse constant 2 in the last term. To this end, set
nt2
δ = 2 exp − 2
2σ + 2Bt/3
which is equivalent to solving quadratic equation
2B log 2/δ 2σ 2 log 2/δ
t2 − t − =0
3n n
13
and thus r
2σ 2 log 2/δ 2B log 2/δ
t≤ +
n 3n
√ √ √
using a + b ≤ a + b for a, b ≥ 0. For the sharper constant 1 in (3.39), see (3.36).
Let us examine (3.39). We see that for small-variance case, the last term dominates and
it indicates
p a faster convergence rate in terms of n (though at the expense of log 1/δ rather
than log 1/δ dependence on precision).
Lemma 11. Let X be a random variable with E[X] = 0. Then the following are
equivalent, and the parameters ci > 0 differ by at most absolute constant factors:
2. For all t ≥ 0,
P (|X| ≥ t) ≤ 2 exp{−t/c2 }
3. For all p = 1, 2 . . .,
(E|X|p )1/p ≤ c3 p
14
Under the event that the validation error is zero, we have
r
2pu u
p≤ +
n 3n
which means
4u
p≤
.
n
Note that this is better than what we expected from the CLT. The effect is due to low
variance (more precisely, here variance is upper bounded by expectation itself). This type
of argument appears often in statistical learning. Of course, we would be interested in the
case that f itself was produced by minimizing error on the same data (in which case the
validation error is in fact training error). The issue of the dependence of f on the data
(and hence failure of CLT due to lack of independence) will be dealt with through notions
of uniform convergence in the second part of the course.
where we only took one tail of the two-tail behavior (NB: the constant 8 can be improved).
Hence, if we define the map F (u) = √1 Γu, we have proved that for any u 6= 0, u ∈ RM ,
m
!
kF (u)k2
P ∈
/ [1 − t, 1 + t] ≤ 2 exp{−mt2 /8}, t ∈ (0, 1)
kuk2
15
since F is linear. By setting the right-hand-side to δ, we have that with probability at least
1 − δ, all the norms are preserved up to multiplicative accuracy 1 ± t as long as
16
m> log(N/δ).
t2
Interestingly, the dimension M does not enter this estimate for the target dimension.
where C depends on the sub-Gaussian constant of Xi . A simple trick will now convert
the two-tailed behavior for the square into single-tail behavior for the norm itself. First,
following [33], observe that |z − 1| ≥ t implies |z 2 − 1| ≥ max{t, t2 } for all z ≥ 0. Now,
let u = max{t, t2 } and observe that min{u, u2 } = min{max{t, t2 }, (max{t, t2 })2 } = t2 (by
considering cases). Hence,
1 1 2 2
P | √ kXk − 1| ≥ t ≤ P | kXk − 1| ≥ max{t, t } (3.43)
d d
1
= P | kXk2 − 1| ≥ u (3.44)
d
≤ 2 exp −Cd min{u2 , u} (3.45)
= 2 exp −Cdt2 (3.46)
for all t > 0. Thus, norm of a random vector with sub-Gaussian entries is sub-Gaussian
itself (after centering). Or, rescaling,
√
P | kXk − d| ≥ t ≤ 2 exp −Ct2 (3.47)
16
this purpose, recall (3.16), which holds for λ ∈ (0, 1/2). Then, using the Cramér-Chernoff
t
bound with the choice λ = 2t+2 ∈ (0, 1/2), we arrive at the one-sided bound
t2 /2
P g 2 − 1 ≥ t ≤ exp − (3.48)
2 + 2t
or, in view of (3.16), from (3.36),
√
P g 2 − 1 ≥ 2 t + 2t ≤ exp{−t}. (3.49)
Let us write g ∼ N (0, Id ). Let A ∈ Rd×d and let Σ = A T A. We have that the mean
because Σ and Λ share the same set of eigenvalues. Note that d does not explicitly appear,
except through the trace of the eigenvalues.
The tail bound (3.53) was proved in [14] for sub-Gaussian (rather than Gaussian) mean-
zero vectors, with the same constants as above. More precisely, a centered random vector
x ∈ Rd is sub-Gaussian with variance proxy v 2 if for any unit vector u ∈ Rd , hx, ui ∈
subG(v 2 ).
We also remark that tail bounds on kAxk2 = x T A T Ax have been proved in [29] for
more general quadratic forms x T Bx, where B is not necessarily psd, but with stronger
independence assumptions on coordinates of x. Such bounds are known as Hanson-Wright
inequalities.
17
4. MEAN ESTIMATION
We conclude that when the error is measured in squared Euclidean norm, the expected error
is tr(Γ)/n and the deviations above this expectation are given by the two tails in terms of
the trace of Γ. Since dimension d never appears in these bounds, it can be very large or
infinite, as long as the covariance matrix Γ has a fast decay of eigenvalues.
As in (3.47), if we instead consider the norm rather than squared norm, we again only
have the sub-Gaussian behavior. More precisely,
n
r r !
1X tr(Γ) 2 kΓk t
P Xi − ≥ ≤ exp{−t}. (4.55)
n n n
i=1
We refer to [33, p. 135] for the proof of this fact, along the lines of our earlier conversion
to sub-Gaussian tails for the norm itself.
We see that the trace of the covariance matrix, tr(Γ), serves as the effective dimension
of the problem. Indeed, it replaces the actual dimension d that would arise if we used (3.47)
with identity covariance.
18
Lemma 12 (Catoni [7]). For any δ ∈ (0, (2e)−1 ) and σ 2 > 0, there exists distribution
P with mean 0 and variance σ 2 s.t.
r n−1 !
1 2eδ 2
P |X̄n | ≥ σ 1− ≥ 2δ.
2nδ n
Proof. Let X1 , . . . , Xn i.i.d. with EXi = 0, and assume without loss of generality that σ = 1
(since we can divide through by σ). Fix t, to be chosen later, and define
1
P (Xi = nt) = P (Xi = −nt) =
2n2 t2
and
1
P (Xi = 0) = 1 − .
n2 t 2
We verify that
1
EXi2 = var(Xi ) = n2 t2 · =1
n2 t2
1 Pn
Then for X̄ = n i=1 Xi ,
n−1
n 1
P X̄ ≥ t = P X̄ ≤ −t ≥ P X̄ = t ≥ 1− 2 2 , (4.56)
2n2 t2 n t
since X̄ = q t can be achieved whenever any of the variables is nt and the rest are 0. Now
1
n−1
choose t = 2nδ 1 − 2eδ
n
2
. If we show that the right-hand side of (4.56) with this value
of t is at least δ, we will be done. This amounts to proving
!n−1
1 2eδ n−1
1− n 2eδ n−1
≥ 1− (4.57)
2δ (1 − n )
n
which is true if
2eδ 2δ 1
1− ≤1− .
n n 1 − 2eδ n−1
n
The last statement is true since
2eδ n−1
1− ≥ 1/e.
n
Since the sample mean does not exhibit sub-Gaussian tail behavior in our heavy-tailed
situation, the goal is to change the estimator itself. Perhaps, this is the first “non-trivial”
estimator in this course, since we only analyzed averages so far.
For simplicity of exposition, suppose n = km with k, m ≥ 1 integers. Define the median-
of-means estimator as
m
X km
X
1 1
b = median
µ Xi , . . . , Xi (4.58)
m m
i=1 i=(k−1)m+1
19
Lemma 13. Let δ ∈ (0, 1), k = c log 1/δ, m = c logn1/δ , for some absolute constant c.
Then, with probability at least 1 − δ,
r
c log 1/δ
|b
µ − µ| ≤ σ .
n
The corresponding bad event for the jth batch can be denoted by
1 jm
X 2σ
Yj = 1 | Xi − µ| ≥ √
m m
i=(j−1)m+1
hX − µ, ui ∈ subG(v 2 ).
20
Equivalently, we can state the definition as: hX − µ, ui ∈ subG(v 2 kuk2 ) for any u ∈
Rd . In other words, a vector is sub-Gaussian if all its 1-dimensional marginals are sub-
Gaussian. In particular, by choosing standard basis vectors, sub-Gaussianity of X implies
sub-Gaussianity of its coordinates, and thus the variance of each coordinate of X is at
most v 2 . Note that sub-Gaussianity of the vector does not require independence of the
coordinates.
Let X1 , . . . , Xn bePi.i.d. from a v 2 -sub-Gaussian distribution with mean µ. Let us
estimate µ by X̄ = n1 ni=1 Xi . Then it is easy to see that X̄ is v 2 /n-sub-Gaussian. Equiv-
alently, we can think of observing a single vector from the model
Y =µ+ε (4.60)
with ε ∈ subG(v 2 /n), Eε = 0, and we are observing one vector realization Y = X̄. While
ε is sub-Gaussian, we will not assume independence of the coordinates. The model in
(4.60)—called the Gaussian Sequence Model —has been studied extensively (usually under
the Gaussian assumption on ε).
The model in (4.60) is also a prototypical example of a statistical problem in the form
“observation = signal + noise,” with the goal of denoising the observation and estimating
the signal, under various structural assumptions. We will study one such example now.
Suppose µ ∈ Rd is k-sparse:
d
X
kµk0 = 1 {µj 6= 0} = k.
j=1
Returning Y (or, in our earlier example, the sample mean X̄) as an estimate of µ may be
suboptimal if we know that µ is sparse. Indeed, the mean squared error of the estimator
b = Y is
µ
d
E kµ − µb k2 = E kεk2 ∝ .
n
In particular, this ignores the sparsity of the mean vector µ.
A natural modification is to threshold coordinates of Y . Given δ, consider the event
E = {|εi | ≤ λ}, where λ will be chosen later as a function of δ, n, k such that this event
b HT ∈ Rd by
holds with probability at least 1 − δ. Define a “kill-or-keep” estimate µ
b HT
µ i = Yi 1 {|Yi | > λ}
The superscript here stands for “hard thresholding.” How close is this estimate to µ?
b HT
In the case that µi = 0, under the event E it holds that |Yi | ≤ λ, and thus µ i = 0 and
HT
|b
µi − µi | = 0 (that is, the coordinate was zeroed out correctly). If, on the other hand,
µi 6= 0 (i.e. i is one of the k non-zero coordinates), then
µHT
|b µHT
i − µi | ≤ |b i − Yi | + |Yi − µi | ≤ 2λ.
µHT
|b i − µi | ≤ 2λ1 {µi 6= 0}
or
2
b HT − µ
µ ≤ 4λ2 kµk0 . (4.61)
21
It remains to calculate λ such that P (E) ≥ 1 − δ. To this end, note that for any i = 1, . . . , d,
sub-Gaussianity (with parameter v 2 /n) implies
v p
P |εi | > √ 2 log 2/δ ≤ δ (4.62)
n
By union bound,
v p
P ∀i ∈ [d], |εi | ≤ √ 2 log(2d/δ) ≤ δ (4.63)
n
Hence, from (4.61), with probability at least 1 − δ,
2 8v 2 k log(2d/δ)
b HT − µ
µ ≤ . (4.64)
n
A few remarks:
• The thresholding method requires the knowledge of v and δ, but not the sparsity
parameter k. In this sense, the method is adaptive to the unknown sparsity, i.e.
attains the rate in (4.64) that depends on k despite not knowing it.
• If the goal is to recover the correct support (i.e. non-zero entries) of µ, we need
to make an assumption about “signal strength,” i.e. that the minimum value of a
nonzero entry of µ is at least, say, 3λ. This ensures that signal can be separated from
the noisy values outside the support. In this case, we can threshold Y at the value of
2λ.
The hard thresholding estimator is a discontinuous function. Another popular thresh-
olding scheme is soft thresholding, defined as
Yi − λ, Yi > λ
ST
b i = 0,
µ |Yi | ≤ λ (4.65)
Yi + λ, Yi < −λ
Finally, we mention that both hard and soft thresholding schemes can be written in the
form
µ b k2 + λ2 kb
b HT = argmin kY − µ µk0 (4.66)
µ
b
and
µ b k2 + λ kb
b ST = argmin kY − µ µk1 . (4.67)
µ
b
To see the first one, note that the objective decomposes coordinate-wise, and for each
coordinate we have
b HT
µ b i )2 + λ2 1 {b
i = argmin(Yi − µ µi 6= 0} . (4.68)
b i ∈R
µ
22
5. MAXIMAL INEQUALITIES: BASIC RESULTS
Before diving into linear regression, we make a brief detour and talk about maximal inequal-
ities. This topic is a precursor to the more detailed study of the suprema of sub-Gaussian
and empirical processes.
First, recall several basic notions. Given a norm k·k (say, on Rd , although this extends
to Banach spaces), the dual norm is defined as
p
E max Zi ≤ σ 2 log d, (5.70)
i
and
p
E kZk∞ = E max |Zi | ≤ σ 2 log(2d) (5.71)
i
Proof. First,
v
q q u d
uX √
E kZk2 = E kZk2 ≤ E kZk2 = t
2 2
EZi2 = σ d (5.72)
i=1
where we used the fact that variance of a random variable is at most its sub-Gaussian
23
parameter (homework). Next, we prove (5.71). For any λ > 0,
1
E max Zi = E max log exp{λZi } (5.73)
i λ i=1,...,d
1
= E log max exp{λZi } (5.74)
λ i=1,...,d
1
≤ log E max exp{λZi } (5.75)
λ i=1,...,d
X d
1
≤ log E exp{λZi } (5.76)
λ
i=1
1
≤ log d exp{λ2 σ 2 /2} (5.77)
λ
which is equal to
1 λσ 2 p
log d + = σ 2 log d (5.78)
λ 2
q
2 log d
upon choosing λ = σ2
. The estimate on kZk∞ follows by considering 2d variables.
Then
p
E max Zi ≤ 2v 2 log d + b log d (5.79)
i
6. LINEAR REGRESSION
We now introduce the problem of linear regression, make the connection to the Gaussian
Sequence Model, and motivate the need to study maximal inequalities.
Consider the model
Yi = hxi , β ∗ i + εi , i = 1, . . . , n (6.80)
Y = Xβ ∗ + ε (6.81)
where X is the n × d matrix with xi as rows, and Y = (Y1 , . . . , Yn ) T . For now, we will think
of the matrix X as being fixed and given to us (this is called fixed design). In later parts
of the course, we will work under the assumption that x1 , . . . , xn are drawn i.i.d. from a
distribution (i.e. random design).
24
6.1 Connection to the Gaussian Sequence Model
1
Multiplying both sides of (6.81) by nX
T
yields
1 T 1 1
X Y = X T Xβ ∗ + X T ε. (6.82)
n n n
Consider the following assumption on the matrix X:
1
Definition 5. If nX X
T
= Id , we say that design (that is, the set {x1 , . . . , xn }) is
orthonormal.
In addition to orthogonality, the above definition implies that kxi k2 = n, which is what we
would expect if coordinates of xi were independent. In this case, (6.82) becomes
Ỹ = β ∗ + ε̃ (6.83)
25
6.3 Unconstrained Least Squares
We now go back to the model (6.80), without the assumption on the matrix X. Our goal
in this section will be to upper bound the fixed-design error.
Let βb be the least-squares solution
X n
b ∈ argmin kXβ − Y k2 = argmin 1
β (Yi − hxi , βi)2 (6.85)
β∈Rd β∈Rd n
i=1
and thus
b = (X T X)† X T Y
β
where A† denotes the Moore-Penrose inverse.
Rather than using the closed-form solution for the least squares, we will present analysis
based on the optimality of the solution with respect to the empirical error. This analysis is
more general and will hold for constrained least squares beyond linear regression.
First, observe that by optimality,
2
b −Y
Xβ ≤ kXβ ∗ − Y k2 = kεk2 . (6.86)
26
6.4 Constrained Least Squares
Consider now a modification of the regression model (6.80), where we have the additional
knowledge that β ∗ ∈ K for some set K ⊂ Rd . It then makes sense to minimize squared
error subject to being in K:
b ∈ argmin kXβ − Y k2
β (6.89)
β∈K
Of course, the unconstrained model and the corresponding least squares solution in (6.85)
corresponds to K = Rd . Since the constrained case subsumes the unconstrained case, we
will proceed below with the constrained analysis.
By examining (6.86) and (6.87), the Basic Inequality (6.88) still holds in constrained
least squares.
b − Xβ ∗
Xβ Xβ − Xβ ∗
b − Xβ ∗ ≤ 2hε,
Xβ i ≤ 2 sup hε, i (6.90)
b − Xβ ∗
Xβ β∈K kXβ − Xβ ∗ k
Note that the right-hand side is now independent of the algorithm/estimator. We can now
treat the right-hand side as a supremum of a collection of random variables indexed by β.
The smaller the collection, the smaller is the upper bound (other things kept equal). Once
we have a high-probability bound on (6.90), we may square both sides to get a bound on the
squared error. We should point out that it is quite surprising that such a simple strategy
works for analyzing least squares. It may appear that the supremum on the right-hand side
of (6.90) can be significantly larger than the middle part of that inequality.
To illustrate the strategy, we consider the unconstrained least squares (also called Or-
dinary Least Squares, OLS).
b − Xβ ∗ ≤ 2hε, v(ε)i
Xβ (6.92)
27
∗
X β−Xβ
where v(ε) = ∗ . Let U = [u1 , . . . , ur ] be the matrix with orthonormal columns, a
b
kX β−Xβ
b k
basis of the column space of X. Since X(β b − β ∗ ) is in the column space of X, we can write
r
v(ε) = Ua for some a = a(ε) ∈ R with kak = 1. Then
b − β∗ )
X(β
v(ε) = ∈ Sn−1 ∩ col(X).
b ∗
X(β − β )
Such an approach, at least directly, would lead to rates of the previous section, without
exploiting the structure of K.
We proceed by avoiding normalization and directly analyzing the basic inequality:
2
b − Xβ ∗
Xβ b − Xβ ∗ i ≤ 2 max hε, Xvi = 4 kX T εk
≤ 2hε, X β (6.95)
∞
v∈2Bd1
2
Let xi denote the ith column of X. From (5.71), and observing that hxi , εi ∈ subG( xi σ 2 ),
p
E kX T εk∞ ≤ σ 2 log(2d) · max xi
i
√
A natural normalization of the data is xi ≤ n, in which case we have proved the
following result:
Lemma 17. Assume the regression model Y = Xβ ∗ +ε with ε ∈ subG(σ 2 ) and K = Bd1 .
√
Suppose columns of X are normalized to be xi ≤ n. Then the constrained OLS
enjoys
r
1 b − Xβ ∗
2 2 log(2d)
Eε Xβ ≤ 4σ . (6.96)
n n
28
Since analysis of previous section applies to the constrained least squares as well, we have
that ( r )
1 2 2 log(2d) rσ 2
Eε Xβb − Xβ ∗
≤ 4 min σ , ,
n n n
where r is the rank of X. Disregarding the logarithmic factors, the transition between the
√
two rates is at r ∼ n. The upper bound in (6.96) is sometimes referred to as the “slow
rate,” as opposed to the “fast rate” in (6.91). The fast rate kicks in for problems with small
dimensionality (or rank), while the slow rate wins in high-dimensional situation. We will
see how these two regimes arise more generally in parametric and nonparametric regression
through the lens of covering numbers.
6.7 Sparsity
To close our discussion of linear regression, consider one more example of constrained least
squares, where β ∗ is known to be sparse. In other words, assume that
b − Xβ ∗ ≤ 2
Xβ max sup hUST ε, ai = 2 max kUST εk (6.98)
S⊂[d],|S|≤2k a∈B2k S⊂[d],|S|≤2k
2
and
2
b − Xβ ∗
E Xβ ≤ 4E max
2
kUST εk (6.99)
S⊂[d],|S|≤2k
n o
2 2 2
≤ 4E max kUST εk − E kUST εk + max E kUST εk . (6.100)
S⊂[d],|S|≤2k S⊂[d],|S|≤2k
29
for some absolute constant C. The second term is at most 2kσ 2 , a lower-order term. With
i
the standard estimate di ≤ ed i , we conclude that
b − Xβ ∗
2 k log(d/k)
E Xβ . σ2 . (6.101)
n
Clearly, the definition extends beyond Rd to any metric space (X, ρ). If V ⊂ K, the cover
is called proper.
Proof. We use the following volume argument. We add centers v1 , v2 , . . . ∈ Bd2 such that
kvi − vj k > ε for every i 6= j, until no such additional point exists. Let N be the size of
this set, which is clearly an ε-net. Then
ε d d
ε 1 + 2ε 2
N · vol Bd2 ≤ vol Bd2 + Bd2 ⇒ N≤ = 1+ .
2 2 ε d ε
2
Since we can start the iterative process of placing ε-balls with v1 = 0, we can assume
without loss of generality that the minimal cover contains 0.
30
Lemma 19. Let V be a cover of Bd2 at scale ε ∈ (0, 1) with respect to k·k2 . Then for
any x ∈ Rd ,
1
max hu, xi ≤ maxhv, xi. (7.102)
u∈Bd2 1 − ε v∈V
Proof. For any u ∈ Bd2 , there exists v ∈ V such that ku − vk ≤ ε (i.e. u − v ∈ εBd2 ). Since
we have hu, xi = hv, xi + hu − v, xi, it also holds that
By linearity, the last term is ε maxw∈Bd hw, xi. Since the choice of u ∈ Bd2 was arbitrary,
2
the statement follows by rearranging the terms.
We remark that this lemma trivially extends to norms beyond Euclidean, as long as the
ball is covered in the very norm with respect to which it is defined. This situation is rather
special, and we will use this comparison result only a couple of times in this course, with a
constant ε.
where V is a minimal 1/2-cover of Bd2 . Since |V | ≤ 5d by Lemma 19, we can conclude from
(5.70) that
q √
E kZk = E max hu, Zi ≤ 2σ 2 log(5d ) ≤ Cσ d (7.104)
u∈Bd2
√
for C = 2 2 log 5. While the approach through finite discretization appears to “unify” both
(5.69) and (5.71) (in the latter case, the set is already discrete), it did not recover the same
constant 1 as in (5.69).
where x ∈ Rp and y ∈ Rq . This norm is also known as the spectral norm since kAk =
p
λmax (A T A), the square root of the largest eigenvalue, which is also the largest singular
value of A, which we shall denote as σmax (A). We emphasize our convention that the
unadorned norm k·k for vectors stands for the Euclidean norm (unless stated otherwise),
and the unadorned norm k·k for matrices will stand for the operator norm.
We have the following extension of Lemma 19, see [33, p. 84].
31
Lemma 20. Let ε ∈ (0, 1/2). Let V and U be proper ε-nets of Bp2 and Bq2 , respectively,
with respect to Euclidean norm. Without loss of generality, assume 0 ∈ V, U . Then for
any A ∈ Rp×q ,
1
max v T Au ≤ kAk ≤ max v T Au. (7.105)
v∈V,u∈U 1 − 2ε v∈V,u∈U
Proof. Let kAk = kAxk for kxk = 1 and let u ∈ U be such that kx − uk ≤ ε. Then
Taking maximum over u, and noting that (1 − ε)−2 ≤ (1 − 2ε)−1 , the upper bound follows.
The lower bound is immediate since the ε-nets are proper.
For the second statement, let x be such that kAk = |x T Ax|. We have for any v that is
kx − vk ≤ ε,
hx, Axi = hv, Avi + hv, A(x − v)i + hx − v, Axi ≤ hv, Avi + 2ε kAk
and thus
kAk ≤ |hv, Avi| + 2ε kAk ≤ max |hv, Avi| + 2ε kAk .
v∈V
Proof. Let V, U be, respectively, 1/4-nets for Bp2 and Bq2 , of size 9p and 9q as guaranteed by
Lemma 18. From Lemma 20,
32
Pp Pq
For any fixed v ∈ V, u ∈ U , the random variable v T Au = i=1 j=1 Ai,j v i uj is sub-
2
Pp Pq 2 2 2
Gaussian with variance proxy σ i=1 j=1 Ai,j v i uj ≤ σ . Hence,
P (v T Au ≥ tσ) ≤ exp −t2 /2 .
Note: as a corollary, for a symmetric (Wigner) random matrix A ∈ Rp×p with independent
σ 2 -sub-Gaussian entries above the diagonal,
√
E kAk . σ p.
This holds by applying the above lemma separately to the upper and lower triangular
components of A.
8. COVARIANCE ESTIMATION
Lemma 22. Let X1 , . . . , Xn ∈ Rd be i.i.d. with mean zero and Xi ∈ subG(σ 2 ). Let
b be the sample covariance matrix. Then
EXi XiT = Σ and let Σ
(r )!
b ≥ σ C max
2 d t d t
P Σ−Σ + , + ≤ 2 exp{−t} (8.112)
n n n n
Now, Xi ∈ subG(σ 2 ) implies that for any u ∈ Bd2 , the random variable hXi , ui2 − hu, Σui
is sub-exponential with parameters (cσ 2 , cσ 2 ) for some absolute constant c. From (3.22),
rescaling by cσ 2 ,
P |hu, Qui| ≥ cεσ 2 ≤ 2 exp −n min{ε, ε2 }
33
Taking a union bound over the discretization,
P ∃v ∈ V : |hv, Qvi| ≥ cεσ 2 ≤ 17d · 2 exp −n min{ε, ε2 }
and thus
P ∃v ∈ V : |hv, Qvi| ≥ εσ 2 ≤ 2 exp −c(n min{ε, ε2 } − d)
Now for some t > 0, choose
(r )
d t d t
ε = max + , + .
n n n n
Then
d t
min{ε, ε2 } = +
n n
and
n min{ε, ε2 } − d = t.
This yields
(r )!
b −Σ 2 d t d t
P Σ ≥ 2σ max + , + ≤ 2 exp{−ct}
2 n n n n
Note that if Xi ∼ N (0, Σ), we have hv, Xi i ∼ N (0, v T Σv). Since for any unit vector v,
v Σv ≤ kΣk, Xi is a sub-Gaussian vector with variance proxy at most kΣk. More generally,
T
if we assume that the sub-Gaussian parameter of Xi is at most C kΣk for some constant C,
then we have the following corollary:
Furthermore, (r )
b . kΣk max d d
E Σ−Σ ,
n n
b is a consistent estimator of
As discussed before, if d = o(n), then sample covariance Σ
Σ and we have an explicit rate. Let us mention one more result, in terms of effective rank
(or, stable rank)
tr(Σ)
r(Σ) = . (8.114)
kΣk
Note that a similar quantity arose in mean estimation in high dimension. The numerator
here is the sum of eigenvalues of Σ, while the denominator is the largest eigenvalue. The
ratio has the right “units” to qualify for a notion of a dimension. If Σ = Id , we have
34
tr(Σ) = d and kΣk = 1. More generally, effective rank can be small even though d is large,
as long as the eigenvalues decay fast enough.
The more general result says that Xi are sub-Gaussian centered vectors such that the
sub-Gaussian parameter of any one-dimensional projection hXi , ui is at most a constant
multiple of its variance, it holds that
(r )
b r(Σ) r(Σ)
E Σ − Σ . kΣk max ,
n n
The reason we had the two-tailed behavior of the spectral norm Σ − Σ b is that Σ b is
an average of “squares” of a sub-Gaussian random variables. If you recall, in the earlier
lecture, we deduced pure sub-Gaussian tails by taking square root of the random variable.
The analogue here will be the singular values σi (X) of the data matrix X ∈ Rn×d which
has rows XiT .
b are
Recall that singular values of the matrix X and eigenvalues of sample covariance Σ
related as q
σj (X) = λj (X T X)
or, rescaling, r q
1 1 b
σj ( √ X) = λj ( X T X) = λj (Σ).
n n
Weyl’s Inequality then says that
b − λj (Σ)| ≤ Σ
max |λj (Σ) b −Σ (8.115)
j=1,...,d
Suppose for the purposes of illustration that Σ = Id (i.e. the random variables are isotropic).
Then our results tell us that
r
b d t d t
Σ − Id . + ∨ +
n n n n
with probability at least 2e−t , which means that for all i = 1, . . . , d,
r
1 d t d t
σi2 ( √ X) − 1 . + ∨ +
n n n n n
Since max(|z − 1|, |z − 1|2 ) ≤ |z 2 − 1| for z ≥ 0, we get
1 1 1
max |σi ( √ X) − 1|, |σi ( √ X) − 1|2 ≤ |σi2 ( √ X) − 1|
n n n
which implies, after rescaling, that
√ √ √
|σi (X) − n| . d+ t
with probability at least 1 − 2 exp{−t}. In other words, the singular values of a tall (n > d)
matrix X with sub-Gaussian
√ √ √isotropic rows can be found to be tightly concentrated in the
√
interval [ n − C d, n + C d]. This result holds in more generality than stated here, and
we refer to [32, 33].
35
9. SPECTRAL METHODS
Y = X + E. (9.116)
We will be mainly interested in estimating top eigenvector(s) of X from the noisy observation
Y . For now, however, we think of E as a non-random perturbation of X. How does this
perturbation affect the spectral properties? While Weil’s inequality (see e.g. (8.115)) tells
us that eigenvalues to not change much when the perturbation of the matrix is small in
spectral norm, it does not say anything about closeness of eigenvectors. So, it is natural to
ask: Are eigenvectors of X and Y close if kX − Y k is small?
To provide some intuition, consider the following example, with some δ > 0:
1+δ 0 −δ δ 1 δ
X= , E= , Y =X +E = (9.117)
0 1−δ δ δ δ 1
where the last inner product is to be understood as trace. The last expression is at most
√
kX − Y k · ku1 u1T − v 1 v 1T k1 ≤ kX − Y k · ku1 u1T − v 1 v 1T kF · 2
36
where k·k1 is the nuclear norm (`1 of eigenvalues) and k·kF is the Frobenius norm (`2 of
eigenvalues).
On the one hand,
2
ku1 u1T − v 1 v 1T kF = 2 − 2hu1 , v 1 i2 = 2 sin2 (θ).
On the other hand, the values hu1 , Xu1 i and hv 1 , Xv 1 i should be different if the angle is
large and there is a gap in the eigenvalues λ1 and λ2 . More precisely,
d
X
hv 1 , Xv 1 i = λj huj , v 1 i2 ≤ λ1 hu1 , v 1 i2 + λ2 (1 − hu1 , v 1 i2 ) = λ1 cos2 (θ) + λ2 sin2 (θ)
j=1
Pd
since j=1 huj , v 1 i
2 = kU T v 1 k2 = kv 1 k2 = 1 for U = [u1 , . . . , ud ]. Hence,
hu1 , Xu1 i − hv 1 , Xv 1 i ≥ λ1 − λ1 cos2 (θ) − λ2 sin2 (θ) = (λ1 − λ2 ) sin2 (θ). (9.119)
We conclude that
(λ1 − λ2 ) sin2 (θ) ≤ 2 kX − Y k sin(θ).
The analogous analysis with hv 1 , Y v 1 i − hu1 , Y u1 i as a starting point yields (µ1 − µ2 ) in the
denominator. We can take the best of these two bounds by introducing the maximum.
The theorem says that the top eigenvectors of X and Y are close (i.e. the sine of the
angle is small) if the gap between the top two eigenvalues of either X or Y is large compared
to the spectral norm of the difference of these two matrices. Recall that the lack of this
favorable comparison was exactly the reason for the instability in (9.117).
A few remarks:
• The statement of the theorem presented here is in terms of the gap between the eigen-
values of either X or Y . This form (see [36]) will be useful in statistical applications,
as we often have control on the gaps of the signal matrix X. Other versions in the
literature state the upper bound in terms of gaps between eigenvalues of X and the
corresponding eigenvalues of Y . In this case, one can use Weyl’s inequality to pass to
the gap on the signal matrix X only.
• We stated the result for the top eigenvector. More general results can be found in the
literature (e.g. [36]), for intermediate eigenvalues and eigenspaces.
37
mixture model of two Gaussian populations with means at u and −u: X = εu + Z, where
ε ∈ {±1} is a Rademacher random variable. Here, again, EXX T = uu T + σ 2 Id . This
covariance structure, present in both models, is the subject of the following investigation.
To this end, consider the following simple model. We assume that X1 , . . . , Xn ∈ Rd
are centered i.i.d. random variables with EXi XiT = Σ and Xi ∈ subG(kΣk). Assume that
the population covariance matrix has the following structure, called the Spiked Covariance
Model:
Σ = λuu T + Id (9.120)
for some fixed u ∈ Rd , kuk = 1. The parameter λ ≥ 1 here determines the signal-to-noise
ratio, the strength of the “spike.” Clearly, the top eigenvector of Σ is u1 = u, corresponding
to the eigenvalue 1 + λ. The question is whether this spike persists in the sample covariance
matrix Σ.b To this end, we view Y = Σ b as a randomly perturbed observation of a signal
matrix X = Σ with E = Σ b − Σ, as in (9.116). Let v 1 be the leading eigenvector of Σ. b
b
Since −v 1 is also an eigenvector of Σ, we can only determine closeness to u1 up to a sign.
Observe that
We also have that the gap λ1 − λ2 of the top two eigenvalues of Σ is λ, while kΣk = 1 + λ.
Together with results of the previous lecture,
(r r )
1+λ d t d t
min kεv 1 − u1 k . max + , + (9.122)
ε∈{±1} λ n n n n
a matrix made up of four blocks, where Jd/2 is a d/2 × d/2 matrix of all 1’s.
We now view the observation Y = A as a noisy value of the signal matrix X = EA, as in
(9.116), with E = A − EA. Let us examine the eigenstructure of EA. The first (normalized)
eigenvector is u1 = √1d 1 and the corresponding eigenvalue is λ1 = d(p + q)/2. This vector
is not informative. The second eigenvector is u2 = √1d [1, . . . , 1, −1, . . . , −1] T , with the
corresponding eigenvalue λ2 = d(p − q)/2 (the rest of the eigenvalues are 0). Interestingly,
38
this second eigenvector contains community memberships. Note that not knowing identity
of the vertices means that the rows/columns of A are renamed, or permuted. This only
permutes the corresponding coordinates of the eigenvectors. Hence, we have the hope that
the second eigenvector v 2 of A also contains the necessary information about the community
memberships. In what follows, we will show that clustering vertices into two communities
according to the sign of v 2 (i) for each vertex i leads to only a constant number of errors.
This algorithm is known as spectral clustering.
Since we are aiming to recover the second rather than first eigenvector, we need to
appeal to a more general version of Davis-Kahan, which has min{λ1 − λ2 , λ2 − λ3 } instead
of λ1 − λ2 in the denominator of (9.118):
√
√ 2 2 kA − EAk
min kεv 2 − u2 k ≤ 2 sin(∠(u2 , v 2 )) ≤ (9.123)
ε∈{±1} min{λ1 − λ2 , λ2 − λ3 }
√
Now recall that from (21), with high probability, kA − EAk . d where we use the fact
that each entry is sub-Gaussian (recall that we need to apply the lemma separately to upper
and lower triangular parts of the matrix). On the other hand,
i.e. only a constant number (out of d) vertices are misclassified by the spectral clustering
algorithm which separates the nodes into two clusters according to the sign of the second
eigenvector of A.
By now you have seen a number of finite-sample guarantees: estimation of a mean vector,
matrix estimation, constrained and unconstrained linear regression. In all the examples, the
key technical step was a control of the maximum of some collection of random variables.
Over the next few lectures, we will extend the toolkit to arbitrary classes of functions
and then apply it to questions of parametric and nonparametric estimation and statistical
learning.
39
10.1 Kolmogorov’s Goodness-of-Fit test
Given n independent draws of a real-valued random variable X, you may want to ask
whether it has a hypothesized distribution with cdf F0 . For instance, can you test the
hypothesis that heights of people are N (63, 32 ) (in inches)? Of course, we can try to see if
the sample mean is “close” to the mean of the hypothesized distribution. We can also try
the median, or some quantiles. In fact, we can try to compare all the quantiles at once and
see if they match the quantiles of F0 . It turns out that comparing “all quantiles” is again
a question about control of a maximum of a collection of correlated random variables. We
will make this connection precise.
If you have taken a course on statistics, you might have seen several approaches to the
hypothesis testing problem of whether X has a given distribution. One classical approach
is the Kolmogorov-Smirnov test. Let
F (θ) = P (X ≤ θ)
be the cdf of X, and let
n
1X
Fn (θ) = 1 {Xi ≤ θ}
n
i=1
be the empirical cdf obtained from n examples. While for a single θ, the random variable
|F (θ)−Fn (θ)| converges to zero almost surely by the Laws of Large Numbers, the analogous
convergence of
Dn = sup |F (θ) − Fn (θ)|
θ
to zero (that is, convergence uniform in θ) is less clear since we have a maximum of an
uncountable collection of correlated random variables.
Nevertheless, the Glivenko-Cantelli Theorem (1933) states that
Dn → 0 a.s.
Hence, given a candidate F , one can test whether X has distribution with cdf F , but
for this we need to know the (asymptotic) distribution of Dn . Assuming continuity of F ,
Kolmogorov (1933) showed that the distribution of Dn does not depend on the law of X,
and he calculated the asymptotic distribution (now known as the Kolmogorov distribution).
Without going into details, we can observe that F (X) has cdf of a uniform random variable
supported on [0, 1], and this transformation does not change the supremum. Hence, it is
enough to calculate Dn for the uniform distribution on [0, 1]. Dn fluctuates on the order of
√
1/ n and √
nDn −→ sup |B(F (θ))|.
θ∈R
Here B(x) is a Brownian bridge on [0, 1] (a continuous-time stochastic process with distri-
bution being Wiener process conditioned on being pinned to 0 at the endpoints).
In particular, Kolmogorov in his 1933 paper calculates the asymptotic distribution, as
well a table of a few values. For instance, he states that
√
P (Dn ≤ 2.4/ n) −→ approx 0.999973.
In the spirit of this course, we will take a non-asymptotic approach to this problem. While
we might not obtain such sharp constants, the deviation inequalities will be valid for finite
n.
We will now come to the same question of uniform deviations from a different angle –
Statistical Learning Theory.
40
10.2 Statistical Learning and Empirical Risk Minimization
Let S = {(x1 , y1 ), . . . , (xn , yn )} be n i.i.d. copies of a random variable (X, Y ) ∈ X × Y
with distribution P = PX × PY |X , where the X variable lives in some abstract space X and
Y ⊆ R. Fix a loss function ` : Y × Y → R and a class of functions F = {f : X → Y}. We
may think of F as a set of neural networks, or decision trees, or whatever model you may
have. Given the dataset S, the empirical risk minimization (ERM) method is defined as
n
1X
fb ∈ argmin `(f (Xi ), Yi )
f ∈F n i=1
Examples:
• Linear regression: X = Rd , Y = R, F = {x 7→ hw, xi : w ∈ Rd }, `(a, b) = (a − b)2
A central question in Statistical Learning is: what is an upper bound on the expected error
of ERM?
Proof. Suppose without loss of generality that f ∗ = inf f ∈F L(f ). The decomposition holds:
h i h i h i
L(fb) − L(f ∗ ) = L(fb) − L(
b fb) + L(b fb) − L(f
b ∗ ) + L(f b ∗ ) − L(f ∗ ) .
By definition of ERM, the second term is nonpositive. Since f ∗ is independent of the random
sample, the third term is a difference between an average of random variables `(f ∗ (Xi ), Yi )
and their expectation. Hence, this term is zero-mean, and its fluctuations can be controlled
with the tail bounds we have seen in class. The first term, however, is generally not zero
b fb) 6= ES L(fb) (why?). Let us proceed by taking expectation (with
in expectation, i.e. ES L(
respect to S) of both sides:
h i h i h i
E L(fb) − L(f ∗ ) ≤ E L(fb) − L( b fb) ≤ E sup L(f ) − L(f
b ) . (10.129)
f ∈F
41
Here we “removed the hat” on fb by “supping out” this data-dependent choice. We are only
using the knowledge that f ∈ F, and nothing else about the method. We will see later
that for “curved” loss functions, such as square loss, the supremum can be further localized
within F. Note that (10.128) can lead to a vacuous (e.g. infinite) upper bound: one such
example is linear unconstrained regression.
Observe that (1 − 2Y ) is a random sign that is jointly distributed with X. Let us omit this
random sign for a moment, and consider
" n
#
1X
E sup Ef (X) − f (Xi ) . (10.130)
f ∈F n
i=1
Over the next few lectures, we will develop upper bounds on the above expected supremum
for any class F. For now, let us gain a bit more intuition about this object by looking at a
particular class of 1D thresholds:
F = {x 7→ 1 {x ≤ θ} : θ ∈ R}.
Substituting this choice, (10.130) becomes
" n
#
1X
E sup P (X ≤ θ) − 1 {Xi ≤ θ} = E sup [F (θ) − Fn (θ)] . (10.131)
θ∈R n θ∈R
i=1
which is precisely the quantity from the beginning of the lecture (albeit without absolute
values and in expectation). Again, (10.131) is the expected largest pointwise (and one-
sided) distance between the CDF and empirical CDF. Does it go to zero as n → ∞? How
fast?
Let’s introduce the shorthand
n
1X
Uθ = E1 {X ≤ θ} − 1 {Xi ≤ θ} .
n
i=1
{Uθ }θ∈R is an uncountable collection of correlated random variables, so how does the max-
imum behave? We have already encountered the question in the context of linear forms
hX, θi, indexed by θ ∈ B2 and we were able to use a covering argument to control the
expected supremum. Recall the key step in that proof: we can introduce a cover θ1 , . . . , θN
such that control of sup Uθ can be reduced to control of maxj=1,...,N Uθi . Does this idea
work here? Problems with this approach start appearing immediately: how do we cover R
by a finite collection?
In the next two sections, we present two approaches for upper-bounding (10.131); both
extend to the general case of (10.130).
42
10.4 Approach 1: Bracketing
While we cannot provide a finite -grid of R directly, we observe that we should be placing
the covering elements according to the underlying measure P . Informally, Uθ is likely to be
constant over regions of θ with small mass.
For simplicity assume that P does not have atoms, and let θ1 , θ1 , . . . , θN (with θ0 =
−∞, θN +1 = +∞) correspond to the quantiles: P (θi ≤ X ≤ θi+1 ) = N 1+1 . For a given θ,
let u(θ) and `(θ) denote, respectively, the upper and lower elements corresponding to the
discrete collection θ0 , . . . , θN +1 . Then, trivially,
n n
1X 1X
E1 {X ≤ θ} − 1 {Xi ≤ θ} ≤ E1 {X ≤ u(θ)} − 1 {Xi ≤ `(θ)}
n n
i=1 i=1
n
1X 1
≤ E1 {X ≤ `(θ)} − 1 {Xi ≤ `(θ)} +
n N +1
i=1
and thus
" n
#
1X
E sup IE1 {X ≤ θ} − 1 {Xi ≤ θ}
θ∈R n
i=1
n
1 1X
≤ + E max E1 {X ≤ θj } − 1 {Xi ≤ θj }
N +1 j∈{0,...,N } n
i=1
43
We also have
n
" # " n #
1X 1X
E sup f (Xi ) − Ef (X) ≤ 2E sup i f (Xi )
f ∈F n i=1 f ∈F n i=1
and
n n
1X 1X
E sup f (Xi ) − Ef (X) ≤ 2E sup i f (Xi ) .
f ∈F n f ∈F n
i=1 i=1
Proof. For the first statement, we introduce an i.i.d. sample X10 , . . . , Xn0 with the same
1 Pn
distribution as X. Observe that E n i=1 f (Xi0 ) = Ef (X). Hence,
" n
# " " n # n
#
1X 1X 1 X
E sup Ef (X) − f (Xi ) = E sup E f (Xi0 ) − f (Xi ) . (10.132)
f ∈F n f ∈F n n
i=1 i=1 i=1
Since the quantity is constant for all the choices of 1 , . . . , n , we have the same value by
taking an expectation. We have
" n
# " n #
1X 1X 0
E sup Ef (X) − f (Xi ) ≤ E sup i (f (Xi ) − f (Xi )) , (10.135)
f ∈F n f ∈F n
i=1 i=1
where i ’s are now Rademacher random variables. Breaking up the supremum into two
terms leads to an upper bound
" n # " n #
1X 1 X
E sup i f (Xi0 ) + E sup −i f (Xi ) (10.136)
f ∈F n i=1 f ∈F n i=1
" n #
1X
= 2E sup i f (Xi ) (10.137)
f ∈F n i=1
by the symmetry of Rademacher random variables. The second and third statement follow
from the same argument. For the last part,
n n n
1X 1X 1X
E sup i f (Xi ) ≤ E sup i (f (Xi ) − Ef ) + E sup i Ef .
f ∈F n f ∈F n f ∈F n
i=1 i=1 i=1
44
Consider the first term on the RHS:
n n
1X 1X
E sup i (f (Xi ) − Ef ) ≤ E sup i (f (Xi ) − f (Xi0 ))
f ∈F n f ∈F n
i=1 i=1
Xn
1
= E sup (f (Xi ) − Ef + Ef − f (Xi0 ))
f ∈F n
i=1
n n
1X 1X
≤ E sup (Ef − f (Xi )) + E sup (f (Xi ) − Ef ) .
f ∈F n i=1 f ∈F n i=1
√
Since each Vθ is 1/ n-subGaussian, and we get an overall upper bound of
r
2 log(n + 1)
2
n
which, up to constants, matches the bound with the bracketing approach.
45
10.7 Discussion
The bracketing and symmetrization approaches produced similar upper bounds for the case
of thresholds. We will see, however, that for more complex classes of functions, the two
approaches can give different results.
Of course, the symmetrization lemma can also be applied to the class of functions
Since (1 − 2y) is {±1}-valued, the distribution of (1 − 2Yi )i is also Rademacher. Hence,
" n # " n #
1X 1X
E sup i (1 − 2Yi )f (Xi ) = E sup i f (Xi ) . (10.140)
f ∈F n i=1 f ∈F n i=1
This justifies omitting (1 − 2Y ) for binary classification in our earlier exposition. Hence, in
view of (10.129), the upper bounds we derived guarantee that for empirical risk minimiza-
tion, r
log(n + 1)
EL(fb) − min L(f ∗ ) .
∗
f ∈F n
The power of symmetrization for studying the suprema of empirical processes has been
described in [11], who, in turn, attribute the technique to [15].
We can view supremum of the empirical process as the difference between the true and
empirical distributions when viewed through the lens of F.
46
Definition 8. A class F = {f : X → R} is (weak) Glivenko-Cantelli with respect to P
if
n
1X
sup Ef (X) − f (Xi ) → 0 (10.141)
f ∈F n
i=1
in probability (and strong Glivenko-Cantelli for almost sure convergence; these are equiv-
alent under certain boundedness assumptions).
Note: there do exist classes that are not Glivenko-Cantelli. These classes are, in a certain
sense, very rich, and both learning and uniform GC property fail. One trivial example is
indicators of discrete sets of arbitrary size, and P is absolutely continuous with respect to
Lebesgue. Another example is a class is bounded continuous functions on [0, 1] with respect
to, say, Lebesgue measure.
The index θ can be “time,” but we will be primarily interested in cases where Θ has
some metric structure.
We will be interested in the behavior of the supremum of the stochastic process, and in
particular its expected value:
E sup Uθ .
θ∈Θ
To understand this object, we need to have a sense of the dependence structure of Uθ and
Uθ0 for a pair of parameters, but also about the metric structure of Θ.
Gaussian process is a collection of random variables such that any finite collection
Uθ1 , . . . , Uθn , for any n ≥ 1, is zero-mean and jointly Gaussian. In this case
with d(θ, θ0 )2 = E(Uθ − Uθ0 )2 . Hence, there is a natural metric for Gaussian process.
Definition 10. Stochastic process (Uθ )θ∈Θ is sub-Gaussian with respect to a metric d
on Θ if Uθ is zero-mean and
The main examples we will be studying have a particular linearly parametrized form:
47
Gaussian process: Let Gθ = hg, θi, g = (g1 , . . . , gn ), gi ∼ N (0, 1) i.i.d. Take d(θ, θ0 ) =
kθ − θ0 k. Then
2
Gθ − G0θ = hg, θ − θ0 i ∼ N (0, θ − θ0 )
In particular, this Gaussian process is also, trivially, sub-Gaussian with respect to the
Euclidean distance on Θ.
To get a sublinear growth in n, we have to make sure Θ is significantly smaller than Bn∞ .
A few other sets: √
b n2 ) = E sup hε, θi = E kεk = n
R(B 2
θ∈Bn
2
and √
b n2 ) ≤
G(B n.
However, we observe that
b n ) = E sup hε, θi = E kεk = 1
R(B1 ∞
θ∈Bn
1
In fact, this discrepancy between the Rademacher and Gaussian averages for Bn1 is the worst
that can happen and for any Θ
p
b
R(Θ) b
. G(Θ) b
. log n · R(Θ). (11.142)
48
Furthermore, the discrepancy is only there because Bn1 has a small `1 diameter, and for
many of the applications in statistics, we will work with a function class that will not have
such a small `1 diameter.
For a singleton,
b
R({θ}) =0
while for the vector 1n = (1, . . . , 1),
n
X √
b
R({−1 n , 1n }) = E max{h, 1n i, −h, 1n i} = E i ≤ n.
i=1
Lemma 25. Let d be a metric on Θ and assume (Uθ ) is a subGaussian process. Then
for any finite subset A ⊆ Θ × Θ,
p
E max Uθ − Uθ0 ≤ max d(θ, θ0 ) · 2 log card(A) (11.143)
(θ,θ0 )∈A (θ,θ0 )∈A
As a simple consequence,
Lemma 26. If (Uθ )θ∈Θ is subGaussian with respect to d on Θ, then for any δ > 0,
p
E sup Uθ ≤ 2E sup (Uθ − Uθ0 ) + 2diam(Θ) log N (Θ, d, δ)
θ∈Θ d(θ,θ0 )≤δ
49
The last term is q
b 2)
E sup Uθ̂ − Uθˆ0 ≤ diam(Θ) 2 log(card(Θ)
θ̂,θˆ0 ∈Θ
b
Hence,
√ q
E sup Uθ ≤ 2δ n + 2diam(Θ) log N (Θ, k·k2 , δ) (11.146)
θ∈Θ
Roughly speaking, the supremum over Θ can be upper bounded by the supremum within a
ball of radius δ (“local complexity”) and the maximum over a finite collection of centers of
δ-balls. We will see this decomposition/idea again within the context of optimal estimators
with general (possibly nonparametric) classes of functions.
Is (11.146) a tight upper bound? To investigate this question, consider two examples.
√
First is the example of Θ = Bn2 . In this case, (5.69) gives an upper bound of n and a
multiplicative-cover approach of (7.104) recovers this up to constant factors (here n is the
dimensionality rather than d). We see that the same guarantee can be achieved by (11.146)
by taking δ a constant.
The next example, however, brings bad news: (11.146) is not necessarily tight. Consider
11.4 Chaining
Theorem 2. Let (Uθ )θ∈Θ be a sub-Gaussian stochastic process with respect to a metric
d. Let D = diam(Θ). Then for any δ ∈ [0, D],
√ Z D/2 p
E sup Uθ ≤ 2E sup (Uθ − Uθ0 ) +8 2 log N (Θ, d, ε)dε (11.148)
θ∈Θ d(θ,θ0 )≤δ δ/4
50
(which means 2−N D ≤ δ ≤ 2−(N −1) D) and card(ΘN ) = N (Θ, d, 2−N D) ≥ N (Θ, d, δ). As
before, we start with a single (finest-scale) cover:
For θN ∈ ΘN ,
N
X
UθN = Uθi − Uπi−1 (θi ) + Uθ0 (11.149)
i=1
where, recursively, we define θi−1 = πi−1 (θi ) to be the element of Θi−1 closest to θi . The
sequence θ0 , θ1 , . . . , θN is a “chain” linking an element of the covering to the corresponding
closest element at the coarser scale.
Let the corresponding chain for θN 0 ∈ Θ be denoted by θ 0 , θ 0 , . . . , θ 0 . Then
N 0 1 N
N
! N
!
X X
UθN − UθN
0 = Uθi − Uπi−1 (θi ) − Uθi0 − Uπi−1 (θi0 )
i=1 i=1
and
N
X N
X
E max
0
Uθ − Uθ0 ≤ E max (Uθi − Uπi−1 (θi ) ) + E max
0
(Uπi−1 (θi0 ) − Uθi0 ) (11.150)
θ,θ ∈ΘN θi ∈Θi θi ∈Θi
i=1 i=1
N
X q
≤2 D2−(i−1) 2 log N (Θ, d, 2−i D) (11.151)
i=1
N
X q
=8 D2−(i+1) 2 log N (Θ, d, 2−i D) (11.152)
i=1
XN Z 2−i D p
≤8 2 log N (Θ, d, ε)dε (11.153)
i=1 2−(i+1) D
p
log N (⇥, d, ✏)
<latexit sha1_base64="t7/476fEOI1mWw2GKNMiWH+Q6Hs=">AAACFXicdVBNSwMxEM36WetX1aOXYBEUStm2ovUmePEkCrYK3VKy6bQNZpM1mRXK0j/hxb/ixYMiXgVv/hvTWqGKPhh4vDfDzLwwlsKi7394U9Mzs3PzmYXs4tLyympubb1udWI41LiW2lyFzIIUCmooUMJVbIBFoYTL8Pp46F/egrFCqwvsx9CMWFeJjuAMndTKFQJ7YzANpO7SIGLY40ymp4Od4KIHyAq0XaABxFZIrXYHrVzeL/oj0AlyWNmvlqu0NFbyZIyzVu49aGueRKCQS2Zto+TH2EyZQcElDLJBYiFm/Jp1oeGoYhHYZjr6akC3ndKmHW1cKaQjdXIiZZG1/Sh0ncPD7W9vKP7lNRLsVJupUHGCoPjXok4iKWo6jIi2hQGOsu8I40a4WynvMcM4uiCzLoTvT+n/pF4ulirF8vle/qg6jiNDNskW2SElckCOyAk5IzXCyR15IE/k2bv3Hr0X7/Wrdcobz2yQH/DePgE2wJ7O</latexit>
2
<latexit sha1_base64="JxNCpOp+T1z6Fdzd0LuEORsdRfk=">AAAB83icdVDLSgNBEOyNrxhfUY9eBoMQEcPuRnS9BfTgMYJ5QLKG2clsMmT2wcysEJb8hhcPinj1Z7z5N06SFaJoQUNR1U13lxdzJpVpfhq5peWV1bX8emFjc2t7p7i715RRIghtkIhHou1hSTkLaUMxxWk7FhQHHqctb3Q19VsPVEgWhXdqHFM3wIOQ+YxgpaWufZ+eltmJdTxB171iyayYM6AFclk9d2wHWZlSggz1XvGj249IEtBQEY6l7FhmrNwUC8UIp5NCN5E0xmSEB7SjaYgDKt10dvMEHWmlj/xI6AoVmqmLEykOpBwHnu4MsBrK395U/MvrJMp33JSFcaJoSOaL/IQjFaFpAKjPBCWKjzXBRDB9KyJDLDBROqaCDuH7U/Q/adoVq1qxb89KNSeLIw8HcAhlsOACanADdWgAgRge4RlejMR4Ml6Nt3lrzshm9uEHjPcv3ZmQPw==</latexit>
(i+1)
D 2 iD
<latexit sha1_base64="P4ad7afS2N6rhIhAD05f1a0h9l4=">AAAB73icdVBNS8NAEJ3Ur1q/qh69LBbBiyVpReOtoAePFewHtLFstpt26WYTdzdCCf0TXjwo4tW/481/47aNUEUfDDzem2Fmnh9zprRtf1q5peWV1bX8emFjc2t7p7i711RRIgltkIhHsu1jRTkTtKGZ5rQdS4pDn9OWP7qc+q0HKhWLxK0ex9QL8UCwgBGsjdSu3KUnbIKuesWSXbZnQAvkonrmVlzkZEoJMtR7xY9uPyJJSIUmHCvVcexYeymWmhFOJ4VuomiMyQgPaMdQgUOqvHR27wQdGaWPgkiaEhrN1MWJFIdKjUPfdIZYD9Vvbyr+5XUSHbheykScaCrIfFGQcKQjNH0e9ZmkRPOxIZhIZm5FZIglJtpEVDAhfH+K/ifNStmplis3p6Wam8WRhwM4hGNw4BxqcA11aAABDo/wDC/WvfVkvVpv89aclc3sww9Y7186OY9q</latexit>
D
<latexit sha1_base64="W6Z4IjTFLfFClwxIGzi+ITWQUvg=">AAAB6HicdVDLSgNBEOyNrxhfUY9eBoPgKWwS0fUW0IPHBMwDkiXMTnqTMbMPZmaFsOQLvHhQxKuf5M2/cZKsoKIFDUVVN91dXiy40rb9YeVWVtfWN/Kbha3tnd294v5BW0WJZNhikYhk16MKBQ+xpbkW2I0l0sAT2PEmV3O/c49S8Si81dMY3YCOQu5zRrWRmteDYsku2wuQb+Sydu5UHVLJlBJkaAyK7/1hxJIAQ80EVapXsWPtplRqzgTOCv1EYUzZhI6wZ2hIA1Ruujh0Rk6MMiR+JE2FmizU7xMpDZSaBp7pDKgeq9/eXPzL6yXad9yUh3GiMWTLRX4iiI7I/Gsy5BKZFlNDKJPc3ErYmErKtMmmYEL4+pT8T9rVcqVWrjbPSnUniyMPR3AMp1CBC6jDDTSgBQwQHuAJnq0769F6sV6XrTkrmzmEH7DePgHIjozm</latexit>
51
Theorem 3. For a Gaussian process (Uθ )θ∈Θ ,
p
C sup α log N (Θ, d, α) ≤ E sup Uθ
α≥0 θ∈Θ
We can interpret this lower bound as the largest rectangle under the curve in Figure 1. This
lower bound can be tight in the applications we consider (whenever the sum of the areas of
rectangles Figure 1 is of the same order as the largest one).
We have developed general machinery for upper- and lower-bounding the expected suprema
of sub-Gaussian processes, including Rademacher and Gaussian processes linearly parametrized
by a Θ ⊂ Rn . How are these results relevant to the problem of learning or estimation with
a class of functions F?
The symmetrization lemma (Lemma 24) tells us that for a class of real-valued functions
F = {f : X → R} we can upper bound the expected supremum of the empirical process
indexed by F in terms of the expected supremum of the Rademacher processes:
" n
# " n #
1X 1X
E sup Ef (X) − f (Xi ) ≤ 2E sup εi f (Xi ) .
f ∈F n f ∈F n
i=1 i=1
precisely corresponds to Rademacher averages of the following indexing set Θ. To see this
correspondence, let
1 1
Θ = √ F|x1 ,...,xn = √ (f (x1 ), . . . , f (xn )) : f ∈ F ⊆ Rn (12.154)
n n
√
a (scaled by 1/ n) projection (or, restriction) of F onto x1 , . . . , xn . Take d to be
n
2 1X 2
d(θ, θ0 )2 = θ − θ0 = (f (xi ) − f 0 (xi ))2 , f − f 0 n
(12.155)
n
i=1
where θ = (f (x1 ), . . . , f (xn )) and θ0 = (f 0 (x1 ), . . . , f 0 (xn )), f, f 0 ∈ F. Note that k·kn is a
pseudo-metric, as it can be zero for functions that differ outside the given data. With these
definitions, we write " #
n
1 X
Eε sup √ i f (Xi ) = Eε suphε, θi.
f ∈F n θ∈Θ
i=1
Furthermore,
N (Θ, k·k2 , α) = N (F, k·kn , α).
52
Then Theorem 2 tell us that for any δ ∈ [0, D],
n Z D/2 q
1 X √ √
E sup √ εi f (xi ) ≤ 2δ n + 8 2 log N (F, k·kn , α)dα
f ∈F n δ/4
i=1
1 Pn
where D = supf,g∈F kf − gkn ≤ 2 supf ∈F kf kn ≤ 2 supf ∈F kf k∞ and Pn = n i=1 δxi .
Expectations on both sides are with respect to X1 , . . . , Xn . Note that the above results
hold for the absolute value of the empirical process if we replace log N by log 2N , and the
log 2 can be further absorbed into the multiplicative constant.
The Sudakov lower bound for the Gaussian process implies (together with the rela-
tionship between Rademacher and Gaussian processes) the following lower bound for the
Rademacher averages:
We note that a version of the lower bound (for a particular choice of α) without the log-
arithmic factor is available, under some conditions, and it often matches the upper bound
(see a few pages below).
53
Definition 13. Given x1 , . . . , xn and a class F = {f : X → R},
n
b 1 X
R(F) = Eε sup √ εi f (xi ) (12.157)
f ∈F n
i=1
Note that we will occasionally adopt the 1/n scaling to follow the literature.
OfPcourse, the second definition is just a special case of the first for empirical measure
1 n
n i=1 δXi .
∀f ∈ F, ∃j ∈ [N ] s.t. kf − fj kL2 (P ) ≤ α.
The above definition can be also generalized to Lr (P ). Next, we spell out the above defini-
tion specifically for the empirical measure Pn :
P
Definition 15. Let Pn = n1 ni=1 δxi be the empirical measure supported on x1 , . . . , xn .
A set V = {v1 , . . . , vN } of vectors in Rn forms an α-net (or, α-cover) of F with respect
to Lr (Pn ) if
n
1X
∀f ∈ F, ∃j ∈ [N ] s.t. |f (xi ) − vj (i)|r ≤ αr
n
i=1
The size of the smallest α-net is denoted by N (F, Lr (Pn ), α). Similarly, an α-net (or,
α-cover) with respect to L∞ (Pn ) requires
54
The size of the smallest α-net is denoted by N (F, L∞ (Pn ), α).
Observe that the elements of the cover V can be “improper,” i.e. they do not need to
correspond to values of some function on the data. However, one can go between proper
and improper covers at a cost of a constant (check!).
Second, observe that
for r ≤ q since kf kLr (Pn ) is nondecreasing with r. Note that this is different for unweighted
metrics: e.g. kxkr is nonincreasing in r, and hence N (Θ, k·kr , α) is also nonincreasing in r.
for any j 6= k. The size of the largest α-packing is denoted by D(F, Lr (Pn ), α).
Definition 17. We will say that a class F is parametric if there is a constant C and
a notion of dimension dim such that
dim
2 C
sup N (F, L (Pn ), ) . .
Pn
The requirement that (14.158) holds for all measures Pn and values of n is quite strong.
Yet, we will show that as an upper bound, it is true for a variety of function classes.
55
However, one should keep in mind that there are also cases where dependence of the upper
bound on n can lead to better overall estimates. The quantity
where supremum is taken over all discrete measures, is called Koltchinskii-Pollard entropy.
Let’s consider a “parametric” class F such that functions in F are uniformly bounded:
|f |∞ ≤ 1. Uniform boundedness implies an upper bound on the diameter: D/2 ≤ 1. Then,
taking δ = 0 in Corollary 2, conditionally on X1 , . . . , Xn ,
n Z
1X 12 1p
Eε sup εi f (Xi ) ≤ √ log N (F, L2 (Pn ), α)dα
f ∈F n i=1
n 0
Z 1p
12
≤√ d log(1/α)dα
n 0
r
d
≤c
n
Here it’s useful to note that
Z ap ( p
2a log(1/a) a ≤ 1/e
log(1/α)dα ≤
0 2a a > 1/e
The following theorem is due to D. Haussler (an earlier version with exponent O(d) is
due to Dudley ’78):
We will explain what “VC dimension” means a bit later, and let’s just say here that the
class of thresholds has dimension 1 and the class of homogenous linear classifiers in Rd has
dimension d. In particular, this removes the extraneous log(n + 1) factor we had in Lecture
14 when analyzing thresholds.
56
If p < 2, the integral converges, and we can take δ = 0. However, when p > 2, the lower
limit of the integral matters and we get an overall bound of the order
h iD/2
δ + n−1/2 ε1−p/2 ≤ δ + n−1/2 δ 1−p/2
δ
By choosing δ to balance the two terms (and thus minimize the upper bound) we obtain
δ = n−1/p . Hence, for p > 2, the estimate on Rademacher averages provided by the Dudley
bound is
1 b
√ R(F) . n−1/p .
n
On the other hand, for p < 2, the Dudley entropy integral upper bound becomes (by setting
δ = 0) on the order of
n−1/2 D1−p/2 = O(n−1/2 ),
yielding
1 b
√ R(F) . n−1/2 .
n
We see that there is a transition at p = 2 in terms of the growth of Rademacher averages
(“elbow” behavior). The phase transition will be important in the rest of the course when
we study optimality of nonparametric least squares.
Remark that in the p < 2 regime, the rate n−1/2 is the same rate CLT rate we would
1 Pn
have if we simply considered E n i=1 f (Xi ) − Ef (or the average with random signs)
with a single function. Hence, the payment for the supremum over class F is only in a
constant that may depend on F but does not depend on n.
57
vs
n−1/2
for p < 2, and
δ 1−p/2
δ+ √ n−1/p
n
for p > 2.
F = {x 7→ hw, xi : w ∈ Bd2 }
where X is the n × d data matrix. As we have seen, the key quantity we need to compute
is
N (F, L2 (Pn ), ε).
What is a good upper bound for this quantity? What we had done earlier in the course was
to discretize the set Bd2 to create a ε-net w1 , . . . , wN of size N (Bd2 , k·k2 , ε). Observe that for
any w, w0 ∈ Bd2 ,
n
!1/2
1X
(hw, xi i − hw0 , xi i)2 ≤ max |hw − w0 , xi i|
n i∈[n]
i=1
≤ max |hw − w0 , xi|
x∈Bd2
2
≤ w − w0 .
N (F, L2 (Pn ), ε) ≤ N (F, L∞ (Pn ), ε) ≤ N (F, k·k∞ , ε) ≤ N (Bd2 , k·k2 , ε). (14.159)
where the sup-norm (or, pointwise over the domain) metric is kf − gk∞ = supx∈X |f (x) −
g(x)|. Recall that the covering number of Bd2 is
d 2 d
N (B2 , k·k2 , ε) ≤ 1 + .
ε
This gives a “parametric” growth of entropy
However, if d is large or infinite, this bound is loose. We will show that it also holds that
which is a nonparametric behavior. Hence, the same class can be viewed as either parametric
or nonparametric. In fact, in the parametric behavior, it is not important that the domain
58
of w is Bd2 since we would expect a similar estimate for other sets (including Bd∞ ). In
contrast, it will be crucial in nonparametric estimates that the norm of w is `2 -bounded.
Jumping ahead, we will study neural networks and show a similar phenomenon: we
can either count the number of neurons or connections (parameters) or we can calculate
nonparametric “norm-based” estimates by looking at the norms of the layers in the network.
It’s worth emphasizing again that (??) can lead to very loose bounds in high-dimensional
situations. A cover of function values on finite set of data can be significantly smaller than
a cover with respect to sup norm.
it would imply that we only paid C(F) for having a statement uniform in f ∈ F.
Next, rather than assuming that functions in F are uniformly bounded, it will be enough
to assume that they have an L2 (P )-integrable envelope F :
Rather than assuming that F (x) ≤ 1, we shall assume that kF k2L2 (P ) = EF (X)2 ≤ ∞ and
everything will be phrased in terms of kF k2L2 (P ) .
p Now, let H : [0, ∞) 7→ [0, ∞) is such that H(z) is non-decreasing for z > 0 and
z H(1/z) is non-decreasing for z ∈ (0, 1]. Assume
Z D p p
H(1/x)dx ≤ CH D H(1/D)
0
for all τ > 0. With this control on Koltchinskii-Pollard entropy, it follows that
v !
n u
1 X u 2 kF kL2 (P )
E sup √ (f (Xi ) − Ef (X)) . σ tH (14.160)
n σ
i=1
59
if n is large enough. We refer to [10] for more details, in particular Theorem 3.5.6 and the
following corollaries.
Remarkably, under additional mild conditions on size of n, the inequality (14.160) can
be reversed for a given 2
P as soon as the entropy with respect to L (P ) indeed grows at
kF kL2 (P )
least as H σ .
Hence, the price we pay for uniformity in f ∈ F is truly
v !
u
u kF kL2 (P )
C(F) tH .
σ
Of course, this expression is even simpler if σ 2 = supf ∈F E(f (X) − Ef )2 is on the same
order as kF k2L2 (P ) = E supf |f (X)|2 .
α{±1}n + v ⊆ Θ.
Then
b
R(Θ) b
≥ R(α{±1} n b
+ v) = R(α{±1} n b
) = αR({±1}n
) ≥ αn
Hence, “large cubes” inside Θ make Rademacher averages large. It turns out, this is the
b
only reason R(F| x1 ,...,xn ) can be large!
The key question is whether F|x1 ,...,xn contains large cubes for a given class F.
60
Lemma 27 (Sauer-Shelah-Vapnik-Chervonenkis). If vc(F) = d < ∞,
d
X en d
n
card (F|x1 ,...,xn ) ≤ ≤
i d
i=0
This result is quite remarkable. It says that as soon as n > vc(F), the proportion of the
cube that can be realized by F becomes very small (nd vs 2n ). This combinatorial result is
at the heart of empirical process theory and the early developments in pattern recognition.
In particular, the lemma can be interpreted as a covering number upper bound:
en d
N (F, L∞ (Pn ), ε) ≤
d
for any > 0. Observe that these numbers are with respect to L∞ (Pn ) rather than
L2 (Pn ), and hence can be an overkill. Indeed, L∞ (Pn ) covering numbers are necessar-
ily n-dependent while we can hope to get dimension-independent L2 (Pn ) covering numbers.
Indeed, this result (Dudley, Haussler) was already mentioned: for a binary-valued class with
finite vc(F) = d,
Cd
2 C
N (F, L (Pn ), ) . .
Hence, a class with finite VC dimension is “parametric”. On the other hand, if vc(F) is
infinite, then F|x1 ,...,xn is a full cube for arbitrarily large n (for some appropriately chosen
points). Hence, Rademacher averages of this set are too large and there is no uniform
convergence for all P (to see this, consider P supported on the shattered set). Hence,
finiteness of VC dimension is a characterization (of both distribution-free learnability and
uniform convergence).
A word of caution: VC dimension does not always correspond to “number of param-
eters.” For instance, the one-parameter family F = {x 7→ 1 {sin(αx) ≥ 0} : α ∈ R} over
X = R has infinite VC dimension.
61
The combinatorial dimension vc(F, α) of F (on domain X ) at scale α is defined as the
size n of the largest shattered set.
Theorem 5. Let F be a class of functions X → [−1, 1]. Then for any distribution P ,
c c·vc(F ,ε/c)
N (F, L2 (P ), ε) ≤
ε
for all > 0. Here c is an absolute constant.
In particular, Rademacher averages of this class scale as n−1/2 since this is a nonparametric
class with entropy exponent p < 2.
62
Proposition 1. For
F = {x 7→ hw, xi : w ∈ Bd2 }
and X ⊆ Bd2 , it holds that
vc(F, α) ≤ 16α−2 .
bt+1 = w
w bt + yt xt ,
bt+1 = w
and otherwise w bt .
Lemma 28 (Novikoff’62). For any sequence (x1 , y1 ), . . . , (xn , yn ) ∈ Bd2 × {±1} the Per-
ceptron algorithm makes at most γ −2 mistakes, where γ is the margin of the sequence,
defined as
γ = max min yt hw∗ , xt i ∨ 0
w∗ ∈Bd2 t
bt+1 k2 = kw
kw bt + yt xt k2 ≤ kw
bt k2 + 2yt hw bt k2 + 1
bt , xt i + 1 ≤ kw
γ ≤ hw∗ , yt xt i = hw∗ , w
bt+1 − w
bt i,
√
and so by summing and telescoping, mγ ≤ hw∗ , w
bn+1 i ≤ m. This concludes the proof.
Remarkably, the number of mistakes does not depend on the dimension d. We will now
show that the mistake bound translates into a bound on the scale-sensitive dimension.
Proof of Proposition. Suppose there exist a shattered set x1 , . . . , xm ∈ Bd2 : there exists
s1 , . . . , sm ∈ [−1, 1] such that for any sequence of signs = (1 , . . . , m ) there exists a
w ∈ Bd2 such that
i (hw , xi i − si ) ≥ α/2.
Claim: we can reparametrize the problem so that si = 0. Indeed, take
Then we have
i hw̃ , x̃i i ≥ α/2.
√
while the norms are at most 2:
63
√ √
Now comes the key step. We run Perceptron on the sequence x̃1 / 2, . . . , x̃m / 2 and
yi = −byi . That is, we force Perceptron to make mistakes on every round, no matter what
the predictions are. It is important that Perceptron makes deterministic predictions for this
argument to work. Note that the sequence of predictions of Perceptron defines the sequence
y = (y1 , . . . , ym ) with √ √
yi hw̃y / 2, x̃i / 2i ≥ α/4.
Hence, by Novikoff’s result,
m ≤ 16/α2 .
Interestingly, both Perceptron and VC theory were developed in the 60’s as distinct
approaches (online vs batch), yet the connection between them runs deeper than was recog-
nized, until recently. In particular, the above proof in fact shows that a stronger sequential
version of vc(F, α) is also bounded by 16α−2 , where (roughly speaking) sequential analogues
allow the sequence to evolve as a predictable process with respect to a dyadic filtration. It
turns out that there are sequential analogues of Rademacher averages, covering numbers,
Dudley chaining, and combinatorial dimensions, and these govern online (rather than i.i.d.)
learning. We will mention these towards the end of the course.
In the past few lectures, we have developed tools for analyzing the expected suprema of
empirical processes. We have already seen that such quantities can be used to derive
sample complexity bounds for empirical risk minimization algorithms. Let us recall the
setup. Excess loss with respect to a class of functions F is defined as
enjoys
n
1X
E`(fb(X), Y ) − min E`(f (X), Y ) ≤ E sup E`(f (X), Y ) − `(f (Xi ), Yi ).
f ∈F f ∈F n
i=1
64
Lemma 29 (Contraction). Let φi : R → R be 1-Lipschitz, i = 1, . . . , n. Let Θ ⊂ Rn
and φ ◦ θ = (φ1 (θ1 ), . . . , φn (θn )) for θ ∈ Θ. Denote φ ◦ Θ = {φ ◦ θ : θ ∈ Θ}. Then
b ◦ Θ) ≤ R(Θ).
R(φ b
The inequality follows from the Lipschitz condition and the following equality is justified be-
cause of the symmetry of the other two terms with respect to renaming θ and θ0 . Proceeding
to remove the other signs concludes the proof.
We now apply this lemma to functions φi (·) = `(·, Yi ). As long as these functions are
L-Lipschitz, contraction lemma gives
n n
1X 1X
E sup i `(f (Xi ), Yi ) ≤ L · E sup i f (Xi ) (16.163)
f ∈F n f ∈F n
i=1 i=1
for ERM fb may be too loose. The second step only used the fact that fb belongs to F. It
turns out one can localize its place in F better than that. Before turning to this question
of localization, let us point out a relationship between the problems of estimation and
prediction with square loss.
65
17. REGRESSION. PREDICTION VS ESTIMATION
• Well-specified case: given some class F, assume f ∗ ∈ F. More precisely, P is such that
the regression function is in the class F. In this case, (17.166) becomes kf − f ∗ k2L2 (P ) .
• Misspecified case: do not insist that f ∗ ∈ F. Upper bounds on (17.166) are called Or-
acle Inequalities in statistics, while the prediction form has been studied in statistical
learning theory (sometimes under the name of Agnostic PAC).
66
18. NONPARAMETRIC REGRESSION: WELL-SPECIFIED CASE
Yi = f ∗ (xi ) + ηi
where fb is the least squares (ERM) constrained to F. In constrast, in random design the
goal is w.r.t. L2 (P ) with P unknown, while here Pn is known. We write the L2 (Pn ) norm
2
more succinctly as E fb − f ∗ .
n
Since
n
1X
fb ∈ argmin (f (xi ) − Yi )2 = kf − Y k2n
f ∈F n
i=1
we have
2 2 2
kf ∗ − Y k2n ≥ fb − Y = fb − f ∗ + f ∗ − Y = fb − f ∗ +kf ∗ − Y k2n +2hfb−f ∗ , f ∗ −Y in
n n n
which is the Basic Inequality developed earlier in (6.88) for linear regression.
67
18.2 1st approach to localization: ratio-type inequalities
To simplify the proof somewhat, we will assume that η1 , . . . , ηn are independent standard
normal N (0, 1).
We proceed as in the linear case earlier in the course. First, we divide both sides of the
Basic Inequality (18.167) by fb − f ∗ and further upper bound the right-hand side by a
n
supremum over f , removing the dependence of the algorithm on the data:
f − f∗
fb − f ∗ ≤ 2 sup hη, in (18.169)
n f ∈F kf − f ∗ kn
By squaring both sides, we would get an upper bound on the estimation error (in probability
or in expectation).
Let us use the shorthand F ∗ = F −f ∗ . The rest of the discussion will be about complex-
ity of the neighborhood around f ∗ in F, or, equivalently, complexity of the neighborhood
of 0 in F ∗ . Observe that we only care about values of functions on the data x1 , . . . , xn , so
the discussion is really about the set F ∗ |x1 ,...,xn , drawn in blue below.
At this point, one can say that there is no difference from the linear case, and we should
just go ahead and analyze
g
sup hη, in
g∈F ∗ kgk n
√
After all, this is just the Gaussian width (normalized by n) of the subset of the sphere
obtained by rescaling all the functions:
√
K = {v ∈ Sn−1 : ∃g ∈ F ∗ s.t. v = (g(x1 ), . . . , g(xn ))/( n kgkn )}.
√
(here the normalization is because kgkn is scaled as 1/ n times the `2 norm.) How big is
this subset of the sphere? Note: if the set is all of Sn−1 , we are doomed since in that case
g 1 1
sup hη, in = sup √ hη, vi = √ kηk ∼ 1
g∈F ∗ kgkn v∈Sn−1 n n
and does not converge to zero. What we would need is that K is a significantly smaller
subset of the sphere. In the linear case, this was easy: we simply used the fact that the
subset is d-dimensional. However, for nonlinear functions, it is not easy to see what the set
is.
There is a bigger problem, however. Upon rescaling every vector to the sphere, all the
functions are treated equally even if their unscaled versions are very close to being zero
(that is, close to f ∗ in the original class F). In other words, the quantity
g
sup hη, in
g∈F ∗ :kgkn ≥u kgkn
68
can be potentially much smaller than the unrestricted supremum. This is depicted in the
above figure. If we look at functions within the smaller green sphere, its rescaled version is
the whole sphere. However, at larger scales (e.g. the larger green sphere), the set can be
much smaller. Understanding the map
g
u 7→ sup hη, in
g∈F ∗ :kgkn ≥u kgkn
will be key. In particular, we can break up the balance at scale u and instead have a better
upper bound
g
fb − f ∗ ≤u+2 sup hη, in (18.170)
n g∈F ∗ :kgkn ≥u kgkn
g h
hη, in = hη, in
kgkn u
Hence,
g 1
hη, in ≤ sup hη, hin
kgkn u h∈F ∗ :khkn =u
Taking a supremum on the LHS over g with kgkn ≥ u gives an upper bound on (18.170) as
2
fb − f ∗ ≤u+ sup hη, gin
n u g∈F ∗ :kgkn =u
2
≤u+ sup hη, gin (18.171)
u g∈F ∗ :kgkn ≤u
69
where in the last step we included all the functions below level u. We will use concentration
to replace the second term with its expectation. In particular, define
and
G(u) = EZ(u).
If we were to replace Z(u) on the RHS of (18.171) with G(u), the natural balance between
the two terms would be
2
u = G(u)
u
G(δ) ≤ δ 2 /2
One can ask if this critical radius is actually well-defined. This follows from the follow-
ing:
δ δ
hη, hin = 0
hη, h0 in ≤ 0 Z(δ 0 )
δ δ
Taking supremum on the left-hand side over h with khkn ≤ δ, as well as expectation on
both sides, finishes the proof.
G(u) G(δn )
G(u) = u ≤u ≤ uδn /2 ≤ u2 /2. (18.172)
u δn
To formally replace Z(u) with G(u) in the balancing equation, we need a concentration
result.
70
Lemma 31 (Gaussian Concentration). Let η = (η1 , . . . , ηn ) be a vector of independent
standard normals. Let φ : Rn → R be L-Lipschitz (w.r.t. Euclidean norm). Then for
all t > 0
t2
P (φ(η) − Eφ ≥ t) ≤ exp − 2
2L
√
First, observe that Z(u) is (u/ n)-Lipschitz function of η. Omitting the argument u,
u
Z[η] − Z[η 0 ] ≤ sup hη, gin − hη 0 , gin ≤ η − η 0 n
sup kgkn ≤ √ η − η 0
g∈F ∗ ,kgkn ≤u g∈F ∗ ,kgkn ≤u n
In particular, by setting t = u2 ,
2
nu2
P Z(u) ≥ G(u) + u ≤ exp − (18.174)
2
n 2
o
Lemma 32. Assuming F ∗ is star-shaped, with probability at least 1 − exp − nu2 ,
for any u ≥ δn .
fb − f ∗ ≤ 4u (18.176)
n
n 2
o
with probability at least 1 − exp − nu2 , for any u ≥ δn . Squaring both sides, yields
71
Note: in the literature, you will find a slightly different parametrization. Write ψ(r) =
√
EZ( r). In other words, ψ(u2 ) = G(u). Then the property G(u)/u non-increasing trans-
lates into ψ having the subroot property:
√
ψ(ra) ≤ aψ(r)
using the same type of proof as above. The fixed point then reads as the smallest r such
that ψ(r) ≤ r (ignoring the constant).
Let’s quickly discuss the behavior of G(δ)/δ.
The above sketch shows the function δ 7→ G(δ)/δ for two classes of functions. The
purple curve corresponds to a more complex class, since the Gaussian width (normalized
by δ) grows faster as δ → 0. The corresponding fixed point is larger for a more rich class.
72
Lemma 33. Let δn be the critical radius. Then for any c ≥ 1,
!
2 2 2 2c2 u
P sup 2chη, gin − kgkn > 2c δn + ≤ exp{−u/2} (18.178)
g∈F ∗ n
In particular,
1
E sup 2hη, gin − kgk2n . δn2 + .
g∈F ∗ n
73
We see that, these offset Rademacher/Gaussian averages have the right behavior: we already
saw in the first part of the course that the fast rate for linear regression is O rank(Σ)
n
without further assumptions.
We can view the negative term that extinguishes the fluctuations of the zero-mean
process as coming from the curvature of the square loss. Without the curvature, the negative
term is not there and we are left with the usual Rademacher/Gaussian averages.
The strategy is to find upper bounds on the left-hand-side in terms of δ and then solve for
the minimal δ. In particular, we know that for any α ≥ 0,
Z δ p
1
E sup hη, gin . α + √ log N (F ∗ , L2 (Pn ), ε)dε (18.186)
g∈F ∗ :kgkn ≤δ n α/4
If the Dudley integral in (18.186) is of the order of the single-scale value (think area under
the curve)
1 p
δ×√ log N (F ∗ , L2 (Pn ), δ)
n
then an upper bound on the critical radius is obtained by the balance
1 p
δ×√ log N (F ∗ , L2 (Pn ), δ) δ 2 (18.187)
n
which matches the optimal rate in (18.185). In this case, least squares is an optimal proce-
dure. Below we compute the fixed point under entropy growth conditions.
74
18.5.1 Nonparametric
Suppose we have
log N (F ∗ , L2 (Pn ), ε) . ε−p
for p ∈ (0, 2). Then, taking α = 0,
Setting
n−1/2 δ 1−p/2 = δ 2
yields
1
− 2+p
δn . n
and thus the rate of the least squares estimator is
2 2
− 2+p
E fb − f ∗ .n
n
Hence, least squares are optimal in this minimax sense for p ∈ (0, 2).
2
−
Figure 2: Optimal (in general) rates n 2+p (obtained with localization for p ∈ (0, 2) by
ERM) vs without localization (e.g. via global Rademacher averages)
18.5.2 Parametric
Consider the parametric case,
Then
Z δ p
1
E sup hη, gin . √ d log(1 + 2/ε)dε (18.188)
g∈F ∗ :kgkn ≤δ n 0
75
Change of variables gives an upper bound
r Z 1p
d
δ· log(1 + 2/(uδ))du (18.189)
n 0
Unfortunately, this gives a pesky logarithmic factor that should not always be there. For
some parametric cases one can, in fact, prove that local covering numbers behave as
Equating r
d
δ δ2
n
yields
d
δn2
n
Note that local covering numbers (18.190) are available in some parametric cases (e.g. when
we discretize the parameter space of linear functions) but may not be available for some
other classes (e.g. for VC classes, except under additional conditions).
18.6 Remarks
• to bound metric entropy of F ∗ = F − f ∗ , instead consider F − F. This often leads to
only mild increase in a constant. For instance, if F is a class of L-Lipschitz functions,
then F − F is a subset of 2L-Lipschitz functions.
• Note that the rate δn2 depends on local covering numbers (or, local complexity) around
f ∗ . This gives a path to proving adaptivity results (e.g. if f ∗ is convex but has
only k linear pieces, the rate of estimation is parametric because its neighborhood is
“simple”).
• A simple counting argument (see Yang & Barron 1999, Section 7) shows that for
rich enough classes (e.g. nonparametric) worst-case local entropy (worst-case location
in the class) and global entropies behave similarly. This implies, in particular, that
instead of constructing a local packing for a lower bound (via hypothesis testing), one
can instead use global entropy with Fano inequality, justifying the LHS of (18.185)
as the lower bound for estimation. See also Mendelson’s “local vs global parameters”
paper for an in-depth discussion.
What if we do not assume the regression function f ∗ is in F? How can we prove an oracle
inequality
2
E fb − f ∗ − inf kf − f ∗ k2 ≤ φ(F, n)n
n f ∈F
76
19.1 Convex F
Suppose F is convex (or, rather, F|x1 ,...,xn is convex). Let fb be the constrained least squares:
n
1X
fb ∈ argmin (f (xi ) − Yi )2 = argmin kf − Y k2n
f ∈F n f ∈F
i=1
but in the misspecified case this is no longer true. However, what is true is that
2
fb − Y ≤ kfF − Y k2n
n
Unfortunately, this inequality is not strong enough to get us the desired result. Fortunately,
we can do better. Since fb is a projection of Y onto F = F|x1 ,...,xn , it holds that
2 2
fb − Y ≤ kf − Y k2n − fb − f (19.192)
n n
for any f ∈ F, and in particular for fF . This is a simple consequence of convexity and
pythagorean theorem. The negative quadratic will give us the extra juice we need.
Adding and subtracting f ∗ on both sides and expanding,
2 2
fb − f ∗ +kf ∗ − Y k2n +2hfb−f ∗ , −ηin ≤ kfF − f ∗ k2n +kf ∗ − Y k2n +2hfF −f ∗ , −ηin − fF − fb
n n
which leads to
2 2
fb − f ∗ − kfF − f ∗ k2n ≤ 2hη, fb − fF in − fb − fF (19.193)
n n
≤ sup 2hη, hin − khk2n (19.194)
h∈F −fF
We conclude that for convex F and fixed design, the upper bounds we find for well-specified
and misspecified cases match. Moreover, since the misspecified case is strictly more general,
and since lower bounds for the well-specified case and polynomial entropy growth (in the
p < 2 regime) match the upper bounds, we conclude that constrained least squares are also
minimax optimal for fixed design misspecified case.
Note: a crucial observation is that offset complexity would arise even if (19.192) had a
2
different constant multiplier in front of − f − fb . We will exploit this observation in a
n
bit.
19.2 General F
What if F is not convex? It turns out that least squares (ERM) can be suboptimal even if
F is a finite class!
77
19.2.1 A lower bound for ERM (or any proper procedure)
The suboptimality can be illustrated on a very simple example. Suppose X = {x}, Y
is {0, 1}-valued, and F = {f0 , f1 } such that f0 (x) = 0 and f1 (x) = 1. The marginal
distribution is the trivial PX = δx and suppose we have two conditional distributions P0 (Y =
1) = 1/2 − α and P1 (Y = 1) = 1/2 + α. Clearly, the population minimizer for Pj is fj . Also,
under P0 the regression function is f0∗ = 1/2 − α while under P1 it is f1∗ = 1/2 + α. Finally,
ERM is a method that goes after the most frequent observation in the data Y1 , . . . , Yn .
√
However, if α ∝ 1/ n, there is a constant probability of error in determining whether
P0 or P1 generated the data. Note that the oracle risk is minf ∈{f0 ,f1 } kf − fi∗ k2 = (1/2−α)2
while the risk of the estimator p(1/2 + α)2 + (1 − p)(1/2 − α)2 where p is the probability
of making a mistake and not selecting fi under the distribution Pi . Hence, the overall
comparison to the oracle is at least p((1/2 + α)2 − (1/2 − α)2 ) = Ω(α) when p is constant.
Hence, ERM (or any “proper” method that selects from F) cannot achieve excess loss
smaller than Ω(n−1/2 ):
2
max b ∗ ∗ 2
E f − fi − min kf − fi k = Ω(n−1/2 )
Pi ∈{P0 ,P1 } f ∈{f0 ,f1 }
Yet, an improper method that selects fb outside F can achieve an O(n−1 ) rate.
A similar simple lower bound can be constructed for ERM with random design.3
where
star(F, g) = {αf + (1 − α)g : f ∈ F, α ∈ [0, 1]}.
Note that fb need not be in F but is an average of two elements of F.
Note: the method is, in general, different from single ERM over a convex hull of F, and
so it is not clear that a version of (19.192) holds [21]:
3
For more detailed discussion, we refer to [20].
4
Proof can be found in Lecué & Mendelson
5
For a finite class, the above estimator was analyzed by J-Y. Audibert [1].
78
Lemma 34. For any f ∈ F,
2 1 b 2
kf − Y k2n − fb − Y ≥ f −f . (19.197)
n 18 n
and the same upper bounds hold as in the convex case, up to constants. The difference is
that the supremum is now in star(F, gb) ⊆ F − f ∗ + star(F − F) which is not significantly
larger than F in terms of entropy (unless F is finite, which can be handled separately).
Remarks:
2. the Star Estimator can be viewed as one step of Frank-Wolfe. More steps can improve
the constant.
Exercise: for any α > 0 and a set F ⊂ Rn , the covering numbers satisfy
log N (F, k·k , 2α) ≤ log N (star(F, 0), k·k , 2α) ≤ log(diam(F )/α) + log N (F, k·k , α)
Here εi are independent Rademacher, but the same results hold for any sub-Gaussian ran-
dom variables.
79
Lemma 35. Let V ⊂ Rn be a finite set of vectors, card(V ) = N . Then for any c > 0,
log N
Eε maxhε, vi − c kvk2 ≤ .
v∈V 2c
Furthermore,
12
P maxhε, vi − c kvk ≥ (log N + log(1/δ)) ≤ δ
v∈V 2c
We have shown in Lemma 33 that the offset Gaussian process cannot be more than a
constant multiple of the critical radius. But what if the bound of the above theorem is too
loose to be useful? To see that it attains the optimal balance of (18.185) in some cases,
consider the situation where, as in the discussion preceding (18.187), the Dudley entropy
integral is of the order of the single scale estimate
1 p
γ×√ log N (F, L2 (Pn ), γ).
n
In this case, the optimal balance in (19.198) is
1 p log N (F, L2 (Pn ), γ)
γ×√ log N (F, L2 (Pn ), γ) . (19.199)
n n
Dividing and squaring, we do recover (18.185), an optimal rate. We conclude that the upper
bound of Theorem 7 recovers optimal rates for regression in the p ∈ (0, 2) regime and, more
generally, under the above-stated condition on the Dudley integral.
80
20. TALAGRAND’S INEQUALITY AND APPLICATIONS
For the last half of the course, we have only considered the expected suprema of empirical,
Rademacher, or Gaussian processes. We mentioned that high-probability statements follow
from different arguments. In this lecture, we provide the tools to study deviations of random
suprema above (or below) their expected values.
The following version of Talagrand’s inequality is due to Bousquet:
P
Consider a particular case of a singleton F = {f }. Then Z = n1 ni=1 f (Xi ) − IEf (or
the other form), v = σ 2 = var(f (X)) because EZ = 0. Then Theorem 8 says that
n r !
1X 2t t
P f (Xi ) − Ef ≥ σ + ≤ e−t
n n 3n
i=1
which is Bernstein’s inequality. Moreover, the constants match those in Bernstein’s inequal-
ity, which is remarkable.
Now, recall the definition of empirical Rademacher averages. In this lecture we will scale
these averages by 1/n:
" n
#
1 X
b
R(F) = Eε sup εi f (Xi ) X1 , . . . , Xn .
f ∈F ni=1
b
We have R(F) ≥ 0 by Jensen’s inequality. Moreover, this function satisfies a self-bounding
property [5, Ch. 6], which implies the following
81
and s
b
2tER(F)
b
P R(F) b
≤ ER(F) − ≤ e−t
n
This first statement has a similar form to that of Theorem 8 (after normalizing Z by n)
b
with v replaced by ER(F), b
a consequence of the self-bounding property of R(F).
In particular, by using the inequality
√ λ 1
∀x, y, λ > 0, xy ≤ x+ y,
2 2λ
we have
b b 5t
P R(F) ≥ 2ER(F) + ≤ e−t
6n
and
b b 2t
P ER(F) ≥ 2R(F) + ≤ e−t .
n
Finally, recall that symmetrization lemma states that for the supremum of the empirical
process Z in Theorem 8,
b
EZ ≤ 2ER(F).
Together with Theorem 8 and Theorem 9, this yields (see e.g. [3, Thm 2.1])
Theorem 10. Let F = {f : X → [−1, 1]}. Let supf ∈F var(f (X)) ≤ σ 2 . Then for any
t > 0, with probability at least 1 − 2e−t , for any f ∈ F,
r
1X b 2t 11t
Ef (X) − f (Xi ) ≤ 6R(F) +σ + (20.200)
n n n
i=1
A few remarks. First, the constantsP here can be balanced differently (see [3, Thm 2.1]).
Second, the same result holds with n1 i=1 f (Xi ) − Ef (X) on the left-hand-side of (20.200).
Third, Theorem 8 can be replaced with McDiarmid’s inequality if one does not aim to take
advantage of small variance σ; however, some of the key results on fast rates in learning
theory do take advantage of this Bernstein-style bound.
Theorem 10 can be applied in a variety of situations. To start, since R b does not depend
on the unknown distribution of X, all the terms (except for σ) on the right-hand-side of
(20.200) can be computed from the data. While the supremum of the empirical process on
the left-hand-side of (20.200) cannot be computed in general since the distribution of X is
not known, the expression provides a data-dependent estimate of this quantity.
In the setting of prediction and model selection, we could consider a model Gλ such that
Gλ ⊆ Gλ0 for λ ≤ λ0 , i.e. λ is a tunable parameter that controls complexity of the model
(e.g. width of a neural network). Inequality (20.200) can then be viewed as an upper bound
on the expected loss of any function in Gλ in terms of its empirical fit to data plus a penalty
term for model complexity, as given by the Rademacher averages. Moreover, this penalty
is data-driven.
The above theorems are also at the heart of proving localization results for random
design, both in the well-specified and misspecified settings. Let us only mention one conse-
quence (see [3, Thm 4.1]).
82
Theorem 11. Let F = {f : X → [−1, 1]}. Suppose for every f ∈ F, it holds that
Ef (X)2 ≤ BEf (X). Then with probability at least 1 − 3e−t , for all f ∈ F,
n
1X c0 t
Ef (X) ≤ 2 f (Xi ) + cδn2 + (20.201)
n n
i=1
where δn be the critical radiusa of star(F, 0) and c, c0 are constants that depend on B.
a
To be precise, [3, Thm 4.1] includes a confidence term t/n in the computation of the critical radius.
c0 t
L(fb) − L(fF ) ≤ cδn2 + . (20.202)
n
This conclusion holds under the assumption
which, together with the Lipschitz condition on the loss implies the so-called Bernstein
condition
E(` ◦ f − ` ◦ fF )2 ≤ BE(` ◦ f − ` ◦ fF ) = B(L(f ) − L(fF ))
Such a condition (or closely-related variants) are implied by, for instance, convexity of F
and uniform convexity of `, or by low-noise assumptions in classification settings. For the
case of square loss, (20.202) implies a random design oracle inequality in the misspecified
case. We now provide more details for the case of well-specified random design regression
and develop general tools for passing from fixed to random design.
Recall that in fixed design regression we aim to prove that for a given set of points x1 , . . . , xn ,
an estimator (such as constrained least squares) attains
2
fb − f ∗ ≤ ...
L2 (Pn )
where on the right-hand side we have either a quantity that goes to zero with n or oracle
risk as in the misspecified case. We would like to analyze random design regression where
X1 , . . . , Xn are i.i.d from P . Importantly, we also measure the risk through the L2 (P ) norm.
However,
2 2
E fb − f ∗ 6= E fb − f ∗
L2 (Pn ) L2 (P )
83
since the algorithm fb depends on X1 , . . . , Xn , and so lifting the results from the fixed design
case is not straightforward.
Imagine, however, we could prove that with high probability, for all functions f ∈ F,
In that case, a guarantee for fixed-design regression would translate into a guarantee for
random design regression as long as fb ∈ F (for the Star Algorithm, just enlarge F appro-
priately). Furthermore, as long as ψ(n, F) decays with n at least as fast as the rate of fixed
design regression, we would be able to conclude that random design is not harder than fixed
design. Let’s see if this can be shown.
for a class G of nonnegative functions. In this part of the lecture, we sketch analysis for
uniformly bounded functions. This requirement is necessitated by the use of Theorems 8
and 10.
Let δ̄ be such that for all δ ≥ δ̄,
n
1X
Eε sup εi g(Xi ) ≤ δ 2 /2 (21.204)
1
g∈G: n
Pn
g(Xi )≤δ 2 n
i=1 i=1
conditionally on X1 , . . . , Xn .
Alternatively, we can write (21.204) as
n
1X
Eε sup εi g(Xi ) ≤ δ 2 /2 (21.205)
√
g∈G:k g kn ≤δ n
i=1
The following result can be proved using Theorem 10 (see [6, Theorem 6.1]):
Lemma 36. Let G be a class of functions with values in [0, 1]. Then with probability
at least 1 − e−t for all g ∈ G
n
2X c0 · (t + log log n)
Eg(X) ≤ g(Xi ) + c · δ̄ 2 + (21.206)
n n
i=1
Applying this inequality for the class G = {(f − f 0 )2 : f, f 0 ∈ F}, assuming F is a class
of [0, 1]-valued functions, yields
2 2 c0 · (t + log log n)
f − f0 L2 (P )
≤ 2 f − f0 L2 (Pn )
+ c · δ̄ 2 + . (21.207)
n
84
A few remarks. First, G = (F − F)2 can be replaced by (F − f ∗ )2 , even if f ∗ ∈ / F,
as long as the resulting class is uniformly
P bounded. Second, we observePthat (21.204) is
defined with a localization restriction n1 ni=1 g(Xi ) ≤ δ 2 rather than n1 ni=1 g(Xi )2 ≤ δ 2
in the previous lecture. Since functions are bounded by 1, the set
( n
)
1 X
c := g :
M g(Xi ) ≤ δ 2 ⊆ {kgk2n ≤ δ 2 }
n
i=1
and hence the set in (21.204) is smaller. Thus the fixed point (21.204) is potentially smaller
than the one defined in the previous lecture.
since square is 2-Lipschitz on [−1, 1]. Balancing this with δ 2 gives, up to constants, precisely
the critical radius of F, as in Definition 21 (modulo the use of Gaussian vs Rademacher
random variables). Interestingly, one can significantly improve upon this argument and
show that the localization radius for F 2 with the left-hand-side of (21.208) can be smaller
than that of F. In particular, a useful result is the following:
Lemma 37. For any class F = {f : X → [−1, 1]} of bounded functions, the critical
radius in (21.204) for the class G = F 2 can be upper bounded by a solution to
Z 1 p
12
√ log N (F, L∞ (Pn ), u/2))du ≤ δ/4. (21.209)
n δ/16
Proof. We start upper bounding the left-hand side of (21.204). Observe that functions in
GPare nonnegative and bounded uniformly in [0, 1]. As discussed earlier, the restriction
1 n 2
n i=1 g(Xi ) ≤ δ implies kgkn ≤ δ, and hence the left-hand-side of (21.204) is upper
bounded by
Z q
12 δ c
inf 4α + √ 2
log N (G ∩ M, L (Pn ), ε)dε . (21.210)
α n α
85
f . Then
n n
1X 1X
(f (xi )2 − f˜(xi )2 )2 = (f (xi ) − f˜(xi ))2 (f (xi ) + f˜(xi ))2
n n
i=1 i=1
n
1X
≤ max(f (xi ) − f˜(xi ))2 · (f (xi ) + f˜(xi ))2
i n
i=1
2
≤τ 2
(2 kf k2n + 2 f˜ )
n
2 2 2
≤ 4τ δ := ε
We conclude that
c L2 (Pn ), ε) ≤ N (F ∩ {kf k ≤ δ}, L∞ (Pn ), ε/(2δ))
N (G ∩ M, n
≤ N (F, L∞ (Pn ), ε/(2δ))
Substituting into (21.210), the upper bound on the right-hand side becomes
Z
12 p δ
inf 4α + √ log N (F, L∞ (Pn ), ε/(2δ))dε
α≥0 α n
Z 1 p
2 12
≤ δ /4 + δ × √ log N (F, L∞ (Pn ), u/2))du
n δ/16
where we performed change-of-variables u = ε/δ and chose α = δ 2 /16. Using this in (21.204)
and balancing with δ 2 /2 yields (21.209).
A key outcome of the above lemma is that the critical radius of G = F 2 (or (F − F)2 )
given by (21.204) is much smaller than that of F. Note that whenever the Dudley integral
in (21.209) converges with δ = 0, the solution is δ ∝ n−1/2 (up to log n factors) and hence
the remainder in (21.207) is of the order 1/n, a smaller order term as compared to the rate
of estimation for fixed design. The fact that the remainder term is or a lower order can
be shown, for instance, more generally under the polynomial growth of entropy, or in the
parametric cases. For instance, for
cn d
N (F, L∞ (Pn ), ε) ≤ ,
ε
the localization radius of G = F 2 can be upper bounded as
r cn
d
δ̄(G) = C log
n d
and for a finite class we immediately have
r
log |F|
δ̄(G) ≤ C .
n
We can also prove a general and useful result, albeit with extra log factors (due to its
generality). Following [31], we have
86
Lemma 38. For any class F = {f : X → [−1, 1]}, the critical radius in (21.209) is at
most
δ̄(F 2 ) ≤ C log2 n · R̄(F),
where
b
R̄(F) = sup R(F).
x1 ,...,xn
Proof. Substitute the following estimate for L∞ covering numbers in terms of the scale-
sensitive dimension (see e.g. [28]):
∞ cn
log N (F, L (Pn ), α) ≤ 2vc(F, cα) · log n · (21.211)
vc(F, cα) · α
and then use the following fact: for any α > R̄(F),
4nR̄(F)2
vc(F, α) ≤ . (21.212)
α2
This last inequality can be written in the more familiar form
r
vc(F, α)
sup α ≤ R̄(F), (21.213)
α>R̄(F ) 4n
which bears similarity to Sudakov’s minoration. This inequality is proved by taking the
α-shattered set, replicating it dn/vc(F, α)e times, and using our previous argument about
Rademacher averages being large when there is a cube inside the set. We leave it as an
exercise.
Back to the estimate, we have
Z 1/4 p √ Z r cn
1 ∞
log n 1/4
√ log N (F, L (Pn ), α))dα . √ vc(F, cα) log dα (21.214)
n δ/64 n δ/64 α
p Z 1/4 r
1 cn
. log nR̄(F) log dα (21.215)
δ/64 α α
Hence, for G = F 2 , ignoring logarithmic factors, δ̄(G) ≤ Õ(n−1 ) when R̄(F) . n−1/2
and δ̄(G) ≤ Õ(n−2/p ) when R̄(F) . n−1/p , which is smaller than the rate of estimation for
least squares, ignoring logarithmic factors.
We conclude that rates of estimation for fixed design translate into rates for estimation
with random design, at least for bounded functions. It is worth emphasizing that the extra
factors one gains from comparing kf − f ∗ k2L2 (P ) to 2 kf − f ∗ k2L2 (Pn ) are typically of smaller
order than what one gets from denoising for fixed design. The next section provides further
motivation for why this happens, and presents an approach that does not rely on uniform
boundedness of functions.
87
21.2 Beyond boundedness: the small-ball method
This approach was pioneered by [18] and then developed by Mendelson in a series of papers
starting with [24]. Importantly, this approach does not rely on uniform boundedness of
functions as in the application of Talagrand’s inequality.
Roughly speaking, the realization is that whenever the population norm kf kL2 (P ) is
large enough, it is highly unlikely that the random empirical norm kf kL2 (Pn ) can be smaller
than a fraction of the population norm. Moreover, conditions for such a statement to be
true are rather weak and do not require uniform boundedness.
We first recall the Paley-Zygmund inequality (1932) stating that for a nonnegative ran-
dom variable Z with finite variance,
(EZ)2
P (Z ≥ tEZ) ≥ (1 − t)2
EZ 2
for any 0 ≤ t ≤ 1.
Let us use the following shorthand. We will write kf k2 = kf kL2 (P ) = (Ef (X)2 )1/2 and
kf k4 = kf kL4 (P ) = (Ef (X)4 )1/4 . Then
kf k42
P (|f (X)| ≥ t kf k2 ) = P f (X)2 ≥ t2 kf k22 ≥ (1 − t2 )2
kf k44
Now, we make an assumption that for every f ∈ F,
for some c.
Under this L4 − L2 norm comparison, it holds that
P (|f (X)| ≥ t kf k2 ) ≥ (1 − t2 )2 c,
an “anti-concentration” inequality. More generally, the condition that there exists c and c0
such that for all f ∈ F,
Lemma 39. Assume (21.216). Let F be star-shaped around 0. Then with probability
at least e−c1 n ,
kf kn
inf ≥ c2
f ∈F :kf k2 ≥δ̃ kf k2
kf k22 . kf k2n + δ̃ 2 ,
88
Proof. Let’s see how we can compare the empirical and population norms, uniformly over
F, given such a condition. First, let’s consider any function with norm kf k2 = 1. Observe
that if we could show with high probability
n
1X
1 {|f (Xi )| ≥ c1 } ≥ c2 (21.218)
n
i=1
for some constantsPc1 , c2 , we would be done since such a lower bound implies a constant
lower bound on n1 ni=1 f (Xi )2 ≥ c kf k22 = c). By rescaling and assuming star-shapedness,
we would extend the result to all functions in F (above some critical level for which we can
prove (21.218)).
For a given c > 0, we have
n n
!
1X 1X
1 {|f (Xi )| ≥ c} = E1 {|f (X)| ≥ 2c} − E1 {|f (X)| ≥ 2c} − 1 {|f (Xi )| ≥ c}
n n
i=1 i=1
n
!
1X
≥ E1 {|f (X)| ≥ 2c} − Eφ(|f (X)|) − φ(|f (Xi )|)
n
i=1
for φ(u) = 0 on (−∞, c], φ(u) = u/c − 1 on [c, 2c], and φ(u) = 1 on [2c, ∞).
n
!
1X
≥ inf P (|f (X)| ≥ 2c kf k2 ) − sup Eφ(|f |) − φ(|f (Xi )|)
f ∈F f ∈F ,kf k2 =1 n
i=1
Now, using concentration (since φ(|f |) are in [0, 1]), the random supremum
n
!
1X
sup Eφ(|f |) − φ(|f (Xi )|)
f ∈F ,kf k2 =1 n
i=1
2
can be upper bounded with probability at least 1 − e−2u by its expectation
n
!
1X u
E sup Eφ(|f |) − φ(|f (Xi )|) + √
f ∈F ,kf k2 =1 n n
i=1
which, in turn, can be upper bounded via symmetrization and contraction inequality (since
φ is 1/c-Lipschitz) by
n
4 1X u
E sup i f (Xi ) + √
c f ∈F ,kf k2 =1 n n
i=1
√
By choosing u = n · c00 , we can make the additive term an arbitrarily small constant c00 .
Now, we see that (21.218) will hold with a non-zero constant c2 as long as
n
1X
E sup i f (Xi ) ≤ c00
f ∈F ,kf k2 =1 n
i=1
89
21.3 Example: Random Projections and Johnson-Lindenstrauss lemma
The development here can be seen as a nonlinear generalization of the random projection
method and the Johnson–Lindenstrauss lemma. Let Γ ∈ Rn×d be an appropriately scaled
random matrix. We then prove that for any fixed v ∈ Rd , with high probability
kΓvk22
≥1−α
kvk22
where α ∈ (0, 1). A corresponding uniform statement over a set V ⊂ Rd asks that with
high probability,
kΓvk22
inf ≥ 1 − α.
v∈V kvk2
2
Statements of this form are very useful in statistics, signal processing, etc. The lower
isometry says that the energy of the signal is preserved under random measurement. Or,
the null space of the random matrix Γ is likely to miss (in a quantitative way) the set V . Of
course, if V is too large, it’s not possible to miss it, and so complexity of V (as quantified
by the measures we have studied) enters the picture.
The connection to today’s lecture can be seen by taking
−X1 −
1
Γ = √ ...
n
−Xn −
while kvk = Ex hv, Xi2 . Each v ∈ V then corresponds to f ∈ F in our earlier notation.
90
Of course, from the earlier calculations, we have that with high probability
2
f − f0 L2 (P )
. δ̄ 2
b
where δ̄ is the localization radius for (F − F)2 and can be upper bounded by supx1:n R(F)2,
up to polylog factors. Alternatively, we can use the fixed point δ̃ 2 under the small ball
property.
We now switch gears and discuss the problem of classification with margin. Recall that
for a class of binary functions G = {g : X → {±1}}, we established learning and uniform
convergence results in terms of the ratio vc(G)/n. Yet, the VC dimension can be easily larger
than the sample size for neural networks (where it is related to the number of parameters)
and high-dimensional linear separators (e.g. kernels). Perhaps more importantly, we do not
usually work with a class G directly but rather with a real-valued class F{f : X → R},
with sign of the function determining the class label. That is, in applications of interest,
we work with sign(F) = {sign(f ) : f ∈ F }. It is important to realize that nearly-constant
f can generate very complex sign(f ), a situation we would like to avoid. Large-margin
approach below allows us to replace complexity of sign(F) with that of F itself, as a class
of real-valued functions.
The proof below utilizes the same technique as that in Lemma 39 (not surprisingly,
when we look at the authors of [19] and [18]).
Let F be a class of R-valued functions. Consider a classification problem with binary
Y ∈ {±1}. Fix γ > 0 as a margin parameter.
Let φ : R → R as
1 if s ≤ 0
φ(s) = 1 − s/γ if 0 < s < γ
0 if s ≥ γ
Then
1 {yf (x) ≤ 0} ≤ φ(yf (x)) ≤ 1 {yf (x) ≤ γ} .
2
Hence, with probability at least 1 − e−2u , for any f ∈ F,
n n
1X 1X
E1 {Y f (X) ≤ 0} − 1 {Yi f (Xi ) ≤ γ} ≤ sup Eφ(Y f (X)) − φ(Yi f (Xi ))
n f ∈F n
i=1 i=1
Xn
1 u
≤ E sup Eφ(Y f (X)) − φ(Yi f (Xi )) + √
f ∈F n n
i=1
91
By a union bound over a discretization of (0, B], we can prove the following result [19, Thm
2]:
Theorem 12. Let F = {f : X → [−1, 1]}. For all u > 0, with probability at least
2
1 − 2e−2u , for all f ∈ F, γ ∈ (0, 1],
n
r
1X 8 log log(2/γ) u
E1 {Y f (X) ≤ 0} ≤ 1 {Yi f (Xi ) ≤ γ} + R(F) + + √ (22.219)
n γ n n
i=1
The key message of this theorem is that upper bound on the expected error is in terms
of the complexity of F as a class of real-valued functions, rather than complexity of sign(F).
The price for this is the margin parameter γ which sets the resolution at which we view
predictions as being incorrect (or not confident enough). Given that we would like to have
small left-hand-side (for some estimator), the above bound suggest maximize the margin
(i.e. minimize the number of margin mistakes) while balancing this goal with complexity
of the class. Methods such as support-vector-machines or boosting can be seen as directly
or indirectly having this goal.
Finally, suppose we apply the above theorem to some class FB = {fθ : compl(θ) ≤ B},
where compl(θ) is some notion of complexity of the parameter. Suppose FB ⊆ FB 0 for
B ≤ B 0 and that supf ∈FB |f |∞ ≤ ψ(B). In other words, we allow the function range to
increase (linearly or otherwise) with increasing B. We can then apply a union bound to
obtain a statement for any f ∈ ∪B>0 FB in terms of the complexity of f , defined as the
smallest radius B such that f ∈ FB . We leave this as an exercise.
92
where Wl ∈ Rdl ×dl−1 and d0 = d and σ is applied coordinate-wise.
In our setting, the architecture of a neural network corresponds to the choices of input
and intermediate dimensions. For the fixed architecture, the set of neural networks we
consider is
F = {fθ : compl(θ) ≤ B}
where compl(θ) is some notion of complexity of the weight matrices. That is, just as in
the case of a class of linear functions Flin = {x 7→ hw, xi : kwk ≤ B}, we would like to
define a “ball” in the space of neural networks.
Note that many tuples (W1 , . . . , WL ) lead to the same function fθ . For example, take
ReLU activation, scale one layer up by 100, another down by 100. The function does not
change under this transformation. There are many transformations that leave the function
intact, and we would like to make sure compl does not assign different values of complexity
to different sets of parameters if they lead to same function.
As an example, take Frobenius norm of all the layers:
L
X
compl(θ) = kWj kF
j=1
since this is a natural “generalization” of the corresponding Euclidean norm for Flin . Un-
fortunately, this measure does not capture the scaling invariance of the layers. However, a
product of Frobenius norms would reflect the invariance (though it may not reflect many
other invariances)
YL
compl(θ) = kWj kF
j=1
Of course, it is not at all clear that the Rademacher averages of a unit ball defined with
respect to this complexity is non-vacuous. Remember that we relied heavily on linearity of
b lin ).
functions to analyze R(F
and the p = 2 case coincides with the Frobenius norm. The p = 1 case is termed nuclear
norm, or trace norm, or Ky Fan norm:
min(d1 ,d2 )
X √
kAknuc = σi = trace( A∗ A).
i=1
93
Next, we describe entry-wise norms. We start with the sum of `2 norms of columns:
d2 d2 d1
!1/2
X X X
2
kAk2,1 = kA·,j k = Ai,j
j=1 j=1 i=1
For general p, q ≥ 1,
!q/p 1/q
Xd2 d1
X
kAkp,q = |Ai,j |p
j=1 i=1
Lemma 40. Let Fi be defined recursively as in (23.221), with a base function class F1
that contains the zero function. Assuming σ is 1-Lipschitz and σ(0) = 0. Then
b i ) ≤ 2Bi R(F
R(F b i−1 ).
Proof.
n
X X X n
X
b i ) = E
R(F sup t wj σ (fj (xt )) = E sup wj t σ (fj (xt ))
kwk1 ≤Bi t=1 j kwk1 ≤Bi j t=1
fj ∈Fi−1 fj ∈Fi−1
94
Since 0 ∈ F and σ(0) = 0, it also holds that 0 ∈ Fi−1 . Hence both terms in the above max
are nonnegative and we can further upper bound the maximum by the sum of two terms,
which are equal in expectation:
n
X n
X
E sup t σ (f (xt )) ≤ 2E sup b i−1 ).
t σ (f (xt )) ≤ 2R(F
f ∈Fi−1 t=1 f ∈Fi−1 t=1
Observe that the restriction that the norm of incoming weights for every neuron is
bounded as kwk1 ≤ Bi is equivalent to constraining the rows of Wi , which can be written
as
kWiT k1,∞ ≤ Bi .
Hence, we have the following corollary:
L r
Y log d
b L) . 2
R(F L
Bi · ,
n
i=1
d .
where we also assumed F1 = {x 7→ hw, xi : kwk1 ≤ B1 } and X ⊆ B∞
It is easy to see that, in general, the factor 2L is superfluous in the above bound. Indeed,
consider a thin neural network f (x) = wL σ (. . . σ(w1 x) . . .) with w1 ∈ R1×d and all wj ∈ R≥0
for j > 1 be nonnegative numbers. Take σ to be ReLU. Then by positive homogeneity of
ReLU, Y
f (x) = wj · hw1 , xi
j>1
Theorem 13. Let σ be 1-homogenous (that is, σ(αx) = ασ(x) for all x ∈ R, α ≥ 0).
Suppose F is a class of functions of the form (23.220) with kWi kF ≤ Bi . Then
√ YL
1
R(F) . L Bi · √ .
n
i=1
Under additional mild assumptions, [12, Corollary 1] also shows a depth-independent upper
bound of order (up to log factors)
YL
1
Bi · 1/4 .
j=1
n
95
24. BEYOND UNIFORM CONVERGENCE?
24.1 Perceptron
Recall Perceptron and its mistake bound in Lemma 28. Perceptron is an online method
that, given the next xt , predicts the label ybt and corrects the hyperplane only in case of a
mistake. Given any sequence, the number of mistakes is at most γ −2 , where γ is the margin
of the sequence.
Suppose (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. from a distribution on X × {±1}, and suppose
X ⊆ Bd2 . Consider the following procedure. Cycle through the data multiple times until
until there is a pass with no more mistakes. The length T of the resulting sequence (T is a
multiple of n) is at most nγ −2 , corresponding to the case of one mistake per pass. Let wT
be the final hyperplane output by this procedure. Clearly, it separates the data perfectly,
b 01 (wT ) = 0 where
i.e. L
n
b 1X
L01 (w) = 1 {Yi hw, Xi i ≤ 0} .
n
i=1
Therefore, the function fb(x) = sign(hwT , xi) is a particular ERM solution (one of many).
Can we say anything about future performance of wT on data from the same distribution?
Lemma 41. Let wT be the output of Perceptron after no mistakes are made in a pass
over the i.i.d. data (X1 , Y1 ), . . . , (Xn , Yn ) ∈ X × {±1}, and let X ⊆ B2d . Let γ be a
(random) margin of n + 1 data points drawn i.i.d. from the distribution. Then
1
L01 (wT ) = E1 {Y hwT , Xi ≤ 0} ≤ × E[γ −2 ]
n+1
Proof. Let us use the notation S = {(X1 , Y1 ), . . . , (Xn , Yn )}, Zi = (Xi , Yi ), Z = (X, Y ) and
`(w, Z) = 1 {Y hw, Xi ≤ 0}. First,
" n+1
#
1 X
ES EZ `(wT , Z) = ES,Zn+1 `(w(−t) , Zt ) (24.222)
n+1
t=1
where w(−t) is Perceptron’s final hyperplane after (hypothetically) cycling through data
Z1 , . . . , Zt−1 , Zt+1 , . . . , Zn+1 . That is, leave-one-out is unbiased estimate of expected loss.
Now consider cycling Perceptron on Z1 , . . . , Zn+1 until no more errors, and call the
output w̄. Let i1 , . . . , im be indices on which Perceptron errs in any of the cycles. We
know m ≤ γ −2 . However, if index t ∈ / {i1 , . . . , im }, then whether or not Zt was included
in the computation of w̄ does not matter, and so w̄ = w(−t) . Furthermore, Zt is correctly
classified by w(−t) . Thus, at most γ −2 terms in (24.222) can be nonzero.
96
to obtain a better result. This observation motivates two questions: (1) is there a version
of Theorem 12 that achieves the correct rate? and (2) does the Perceptron-based argument
magically avoid uniform convergence altogether? The answer to the first question is yes,
and it involves developing an L∗ -style bound, beyond the scope of this course. But the
answer to the second question is more subtle, and should be morally taken as a ‘no’. The
mechanism employed in the proof of the above lemma is a version of the so-called online-to-
batch conversion, where one first proves an online mistake or regret bound for an arbitrary
sequence and then uses the i.i.d. nature of the sequence to derive a result on expected
loss. However, the very fact that one can show an online mistake bound or an online regret
bound for an arbitrary sequence implies a stronger version of uniform convergence – uniform
convergence for martingales. We will describe this in detail in the last lecture.
For a large part of the course, we studied risk bounds for fb defined implicitly as an empirical
minimizer over some class of functions. Yet, in certain situations, an estimator of interest is
defined explicitly. This was the case, for instance, with linear unconstrained regression. In
that case, however, we opted for not using the closed-form solution since such an approach
would not be generalizable to nonlinear cases (or even to linear constrained regression). Here
we describe a classical approach that is convenient for analyzing closed-form estimators.
We consider random design regression. To this end, let P be the law of (X, Y ), and
f ∗ (x) = E[Y |X = x] be the regression function. We write Yi = f ∗ (Xi ) + ξi for zero-mean ξi .
Let X = [X1 , . . . , Xn ] T ∈ Rn×d and y = [Y1 , . . . , Yn ] T . Given an estimator fb(·) = fb(·; X, y),
define
2 2
B 2 = EX f ∗ (X) − Ey fb(X) , V = EX,y fb(X) − Ey fb(X) . (25.223)
Both B 2 and V are random variables (in X), and it is easy to check that
Ekfb − f ∗ k2L2 (P ) = EX B 2 + EX [V ] . (25.224)
We then have !2
n
X
B 2 = EX f ∗ (X) − f ∗ (Xi )ωi (X) (25.226)
i=1
and !2
n
X n
X
V = EX,ξ ξi ωi (X) ≤ σξ2 EX (ωi (X))2 . (25.227)
i=1 i=1
The form of (25.226) and (25.227) is particularly useful for analyzing “local methods”.
Indeed, let’s think of ωi (x) as the “relevance” of example (Xi , Yi ) to the given point x. In
this case, fb(x) in (25.225) aggregates the responses Yi according to these weights. The bias
term then asks whether the problem is easy if there is no noise ξi , and (25.226) has the
interpretation of the expected difference between the value of the true regression function at
97
X and its “reconstruction” from datapoints, assuming no noise. Smoothness of f ∗ helps to
upper bound this term. The variance term increases with the noise level σξ2 and the sum of
L2 (P )-norms of the weight functions. If these weight functions ωi are sufficiently localized
around Xi , one can often compute simple upper bounds on the variance term.
K((x − xi )/h)
ωi (x) = Pn (25.228)
j=1 K((x − xj )/h)
fb(x) = hθ,
b xi = hX† y, xi = (Xx) T (XX T )−1 y, (25.229)
P
This solution can be written as fb(x) = ni=1 Yi ωi (x), where
To avoid confusion, we will use the lower-case x for a random x ∼ P . The bias is then
2
B 2 = Ex hP ⊥ x, θ ∗ i2 = Σ1/2 P ⊥ θ ∗ , (25.231)
2
98
25.4 Example: Kernel Ridge/Ridgeless Regression
Observe that the solution for Least Squares and Regularized Least Squares only depends
on inner products between data points Xi and Xj , i, j ∈ [n]. There are several ways to
motivate kernel methods, but the one we take here just replaces x with some feature map
φ(x) : X → RD with D large or infinite. Let Φ ∈ Rn×D be the matrix with rows φ(Xi ) T .
From the earlier discussion, the least squares solution in this high- or infinite-dimensional
space is simply
fb(x) = hθ,
b φ(x)i = (Φφ(x)) T (ΦΦ T )−1 y. (25.235)
b = argmin 1 ky − fn (θ)k2
θ (25.239)
θ n
where fn : θ 7→ (f (X1 , θ), . . . , f (Xn , θ)) is the evaluation of the function parametrized by
θ on the data. Despite potential non-convexity of the problem, we can aim to minimize
the squared loss by gradient flow (or gradient descent). Taking θ 0 as a starting point, the
evolution is given by
dθ t 1
= Dfn (θ) T (y − fn (θ t )) (25.240)
dt n
where Dfn (θ) ∈ Rn×p is the Jacobian of fn . Let us linearize
fn (θ t ) ≈ fn (θ 0 ) + Dfn (θ 0 )(θ t − θ 0 )
around θ 0 . This linearization can be a good approximation if θ does not move too far from
θ 0 and fn is “regular” enough. Since the linearization introduces different dynamics, we
use θ̄ t to denote it. We have
dθ̄ t 1
= Dfn (θ) T (y − fn (θ 0 ) − Dfn (θ 0 )(θ̄ t − θ 0 )) (25.241)
dt n
99
Under certain conditions (see e.g. [4, Thm 5.1]), parameters θ t stay close to θ̄ t , square
loss (25.239) decays exponentially fast to 0 under the dynamics of θ t , and function values
f (x, θ t ) are close to those of the linear model
we see that the data X1 , . . . , Xn are mapped to a feature space, with the feature matrix
being
Φ = Dfn (θ0 ),
and then we find the best parameter θ in this feature space.
Suppose we initialize the model in such a way that fn (θ) = 0 (or approximately 0) so
that we can drop it from the above expression. In this case, ΦΦ T is the kernel matrix
If this kernel matrix is full-rank, the linearized gradient flow can be shown to converge to
the minimum interpolant of the data:
θ̄ ∞ = argmin θ̄ − θ 0 2 : Dfn (θ 0 )(θ − θ 0 ) = y − fn (θ 0 )
θ
where we assumed bi = ±1. Now, suppose the weights wi ∼ N (0, Id /d), independently. As
the number of neurons m increases, the finite-width kernel Km converges (under conditions)
to an infinite-width NTK given by
100
Under the conditions which ensure that the nonlinear gradient flow (25.240) stays close
to the linearized gradient flow, θ t converges to a minimum-norm interpolant of the data
with respect to the NTK kernel (see e.g. [4]), and can be analyzed with the bias-variance
decomposition. These cases are among the few where we can provably analyze both opti-
mization and statistical properties of neural network models. Arguably, however, the linear
regime is not very interesting in practice.
Another aspect we have not discussed here is that of interpolation. See [4].
xt+1 = f ∗ (xt ) + ηt , t = 1, . . . , n
where xt ∈ Rd and ηt are independent zero mean vectors. The function f ∗ is unknown, but
we assume it is a member of a known class F = {f : Rd → Rd }. Let us treat this problem
as a fixed-design regression problem, except that the outcomes are now vectors rather than
reals, and the sequence x1 , . . . , xn is a sequence of dependent random variables.
Consider the least squares solution:
n
1X
fb ∈ argmin kxt+1 − f (xt )k22 ,
f ∈F n
t=1
where the norm is the Euclidean norm. This is a natural generalization of least squares to
vector-valued regression. As before, we denote
n
1X
kf − gk2n = kf (xt ) − g(xt )k22
n
t=1
Up until now, the statement is conditional on {η1 , . . . , ηn }. What happens if we take expec-
tations on both sides? On the left-hand side we have a denoising guarantee on the sequence.
On the right-hand side, we have a “dependent version” of offset Gaussian/Rademacher com-
plexity where xt is measurable with respect to σ(η1 , . . . , ηt−1 ). To analyze this object, we
first need to understand the simpler R-valued version without the offset: what is the be-
havior of
n
1X
E sup t f (xt )
f ∈F n t=1
101
27. SEQUENTIAL COMPLEXITIES
We choose to study the random process generated by Rademacher random variables for
several reasons. First, just as in the classical case, conditioning on the data will lead to
a simpler object (binary tree) and, second, other noise processes can be reduced to the
Rademacher case, under moment assumptions on the noise. The development here is based
on [27], and we refer also to [26] for an introduction.
Let us elaborate on the first point. Note that xt being measurable with respect to
σ(1 , . . . , t−1 ) simply means xt is a function of 1 , . . . , t−1 (in other words, it’s a predictable
process). Note that the collection x1 , . . . , xn can be “summarized” as a depth-n binary tree
decorated with elements of X at the nodes. Indeed, x1 ∈ X is a constant (root), x2 = x2 (1 )
takes on two possible values depending on the sign of 1 (left or right), and so forth. It is
useful to think of (x1 , . . . , xn ) as a tree, even though it doesn’t bring any more information
into the picture. We shall denote the collection of n functions xi : {±1}i−1 → X as
x = (x1 , . . . , xn ) and call it simply as an X -valued tree. We shall refer to = (1 , . . . , n )
as a path in the tree. We will also talk about R-valued trees, such as f ◦ x for f : X → R.
Given a tree x, we shall call
X n
b seq (F, x) = E sup 1
R t f (xt (1 , . . . , t−1 ))
f ∈F n t=1
where x1 , . . . , xn are constant values, we see that it is a special case of a tree with constant
levels xt (1 , . . . , t−1 ) = xt . Hence, sequential Rademacher complexity is a generalization of
the classical notion.
To ease the notation, we will write xt without explicit dependence on , or for brevity
write xt () even though xt only depends on the prefix 1:t−1 .
Observe that for any f ∈ F, the variable
n
1X
νf = t f (xt )
n
t=1
Given that there is no difference with the classical case, one may wonder if we can just reduce
everything to the classical Rademacher averages. The answer is no, and the differences
already start to appear when we attempt to define covering numbers.
More precisely, since any tree x is defined by 2n −1 values, one might wonder if we could
define a notion of pseudo-distance between f and f 0 as an `2 distance on these 2n − 1 values.
102
It is easy to see that this is a huge overkill. Perhaps one of the key points to understand
here is: what is the equivalent of the projection F|x1 ,...,xn for the tree case? Spoiler: it’s
not F|x . The following turns out to be the right definition:
The key aspect of this definition is that v ∈ V can be chosen based on the sequence
∈ {±1}n . In other words, in contrast with the classical definition, for the same function f
different elements v ∈ V can provide a cover on different paths. This results in the needed
reduction in the size of V .
As an example, take a set of 2n−1 functions that take a value of 1 on one of the 2n−1
leaves of x and zero everywhere else. Then the projection F|x is of size 2n−1 but the size
of the 0-cover is only 2 (exercise!), corresponding to our intuition that the class is simple
(as it only varies on the last example). Indeed, the size of the 0-cover is the analogue of the
size of F|x1 ,...,xn in the binary-valued case.
For real-valued functions, consider the following definition.
The size of the smallest α-cover of F on a tree x with respect to `2 will be denoted by
N2 (F, x, α).
Recall the definition of VC dimension and a shattered set. Here is the right sequential
analogue:
103
The largest depth d for which there exists a shattered X -valued tree is called the Little-
stone dimension and denoted by ldim(F).
To contrast with the classical definition, the path on which the signs should be realized is
given by the path itself. But it’s clear that the definition serves the same purpose: if x is
shattered by F then R b seq (F, x) = 1. It is also easy to see that vc(F) ≤ ldim(F), and the
gap can be infinite.
The following is an analogue of the Sauer-Shelah-Vapnik-Chervonenkis lemma.
The largest depth d for which there exists an α-shattered X -valued tree is called sequen-
tial scale-sensitive dimension and denoted ldim(F, α).
We note that the above definitions reduce to the classical ones if we consider only trees
x with constant levels.
Theorem 16. For any class of [−1, 1]-valued functions F and X -valued tree x of depth
n
2en ldim(F ,α)
N∞ (F, x, α) ≤
α
If the sequence (X1 , . . . , Xn ) is i.i.d., the left-hand side is the expected supremum of the
empirical process. The present version provides a martingale generalization. Furthermore,
if we take supremum over all joint distributions on the left-hand-side, then the lower bound
is also matching the upper bound, up to a constant.
The offset Rademacher complexity has been analyzed in [25].
104
28. ONLINE LEARNING
small for any sequence (x1 , y1 ), . . . , (xn , yn ). At least visually, this looks like oracle inequal-
ities for misspecified models. The distinguishing feature of this online framework is that
(a) data arrives sequentially, and (b) we aim to have low regret for any sequence without
assuming any generative process.
It is also worth noting that in the above protocol there is no separation of training and
test data: the online nature of the problem allows us to first test our current hypothesis
by making a prediction, then observe the outcome and incorporate the datum in to our
dataset.
The expectation on the first term in (28.245) is with respect to learner’s internal ran-
domization. More specifically, let Qt be the distribution on {±1} that the learner uses to
predict ybt ∼ Qt . Let qt = Eb yt be the (conditional) mean of this distribution. In other words,
qt = 0 would correspond to the learner tossing a fair coin.
A note about the protocol. The results below hold even if the sequence is chosen based
on learner’s past predictions. However, in this case, yt may only depend on qt but not
on the realization ybt . To simplify the presentation, let us just assume that the sequence
(x1 , y1 ), . . . , (xn , yn ) is fixed in advance (this turns out not to matter).
We will answer the following question: what is the best achievable Reg(F) for a given
F by any prediction strategy?
Let us first rewrite 1 {b yt 6= yt } = (1 − ybt yt )/2 and do the same for the oracle term.
Cancelling 1/2, we have
n
" n #
1X 1X
2Reg(F) = −qt yt − inf −yt f (xt ) (28.246)
n f ∈F n
t=1 t=1
" n # n
1X 1X
= sup yt f (xt ) − qt yt (28.247)
f ∈F n
t=1 t=1
n
Now, consider a particular stochastic process for generating the data sequence: fix any
X -valued tree x of depth n, and on round t let xt = xt (y1 , . . . , yt−1 ) and yt = t be
an independent Rademacher random variable. This defines a stochastic process with 2n
possible sequences (x1 , y1 ), . . . , (xn , yn ). Now, clearly
105
Hence,
n
" #
1X
E Reg(F) = E sup t f (xt ) . (28.248)
f ∈F n t=1
Since the argument holds for any x, we have proved that the optimal value of Reg(F) is
lower bounded by half of
b seq (F, x).
R̄seq (F) = sup R
x
It turns out that this lower bound is within a factor of 2 from optimal. Define the minimax
value
V = min maxn Reg(F)
Algo {(xt ,yt )}t=1
Similar results also holds for absolute value and other Lipschitz loss functions. For square
loss, the sequential Rademacher averages are replaced by offset sequential Rademacher
averages (again, as both upper and lower bounds).
In short, sequential complexities in online learning play a role similar to the role played
by i.i.d. complexities as studied in this course. However, quite a large number of questions
still remains open. But that’s a topic for a different course.
References
[1] J.-Y. Audibert. Progressive mixture rules are deviation suboptimal. In Advances in
Neural Information Processing Systems, pages 41–48, 2008.
[2] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds
and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[7] O. Catoni. Challenging the empirical mean and empirical variance: a deviation study.
In Annales de l’IHP Probabilités et statistiques, volume 48, pages 1148–1185, 2012.
106
[8] L. Devroye, L. Györfi, and G. Lugosi. A probabilistic theory of pattern recognition.
Springer, 1996.
[11] E. Giné and J. Zinn. Some limit theorems for empirical processes. The Annals of
Probability, pages 929–989, 1984.
[13] S. B. Hopkins. Mean estimation with sub-gaussian rates in polynomial time. The
Annals of Statistics, 48(2):1193–1213, 2020.
[14] D. Hsu, S. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian
random vectors. Electronic Communications in Probability, 17:1–6, 2012.
[17] V. Koltchinskii and K. Lounici. Concentration inequalities and moment bounds for
sample covariance operators. Bernoulli, 23(1):110–133, 2017.
[18] V. Koltchinskii and S. Mendelson. Bounding the smallest singular value of a random
matrix without concentration. International Mathematics Research Notices, 2015(23):
12991–13008, 2015.
[19] V. Koltchinskii, D. Panchenko, et al. Empirical margin distributions and bounding the
generalization error of combined classifiers. The Annals of Statistics, 30(1):1–50, 2002.
[21] T. Liang, A. Rakhlin, and K. Sridharan. Learning with square loss: Localization
through offset rademacher complexity. In Conference on Learning Theory, pages 1260–
1285, 2015.
[22] G. Lugosi and S. Mendelson. Sub-gaussian estimators of the mean of a random vector.
The annals of statistics, 47(2):783–794, 2019.
[23] V. A. Marčenko and L. A. Pastur. Distribution of eigenvalues for some sets of random
matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.
107
[25] A. Rakhlin and K. Sridharan. Online non-parametric regression. In Conference on
Learning Theory, pages 1232–1264, 2014.
[27] A. Rakhlin, K. Sridharan, and A. Tewari. Sequential complexities and uniform martin-
gale laws of large numbers. Probability Theory and Related Fields, 161(1-2):111–153,
2015.
[31] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In
Advances in neural information processing systems, pages 2199–2207, 2010.
[36] Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the davis–kahan theorem for
statisticians, 2014.
108