Emp Proc Lecture Notes
Emp Proc Lecture Notes
Applications
Bodhisattva Sen
Contents
1
4 Chaining and uniform entropy 42
4.1 Dudley’s bound for the supremum of a sub-Gaussian process . . . . . . . . 42
4.1.1 Dudley’s bound when the metric space is separable . . . . . . . . . . 47
4.2 Maximal inequality with uniform entropy . . . . . . . . . . . . . . . . . . . 48
4.3 Maximal inequalities with bracketing . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Bracketing number for some function classes . . . . . . . . . . . . . . . . . . 51
2
8.4 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3
Abstract
1
see https://www.stat.berkeley.edu/~aditya/resources/FullNotes210BSpring2018.pdf
2
and Shayan Hundrieser, Marcel Klatt, and Thomas Staudt
3
Tobias W. Wegel, Erik Pudelko, Jan N. Dühmert, Meggie Marschner, Antonia Seifrid, Jana Böhm,
Robin Requadt, Oliver D. Gauselmann, Tobias Weber, Huaiqing Gou, Leo H. Lehmann, Michel Groppe
4
http://www.personal.soton.ac.uk/cz1y20/Reading_Group/ep-group.html
4
1 Introduction to empirical processes
In this chapter we introduce the main object of study (i.e., empirical processes), highlight
the main questions we would like to answer, give a few historically important statistical
applications that motivated the development of the field, and lay down some of the broad
questions that we plan to investigate in this course.
Empirical process theory began in the 1930’s and 1940’s with the study of the empirical
distribution function and the corresponding empirical process.
If X1 , . . . , Xn are i.i.d. real-valued random variables5 (r.v’s) with cumulative distribu-
tion function (c.d.f.) F then the empirical distribution function (e.d.f.) Fn : R → [0, 1] is
defined as
n
1X
Fn (x) := 1(−∞,x] (Xi ), for x ∈ R. (1)
n
i=1
In other words, for each x ∈ R, the quantity nFn (x) simply counts the number of Xi ’s that
are less than or equal to x. The e.d.f. is a natural unbiased (i.e., E[Fn (x)] = F (x) for all
x ∈ R) estimator of F . The corresponding empirical process is
√
Gn (x) = n(Fn (x) − F (x)), for x ∈ R. (2)
Note that both Fn and Gn are stochastic processes6 (i.e., random functions) indexed
by the real line. By the strong law of large numbers (SLLN), for every x ∈ R, we can say
that
a.s.
Fn (x) → F (x) as n → ∞.
Two of the basic results in empirical process theory concerning Fn and Gn are the Glivenko-
Cantelli and Donsker theorems. These results generalize the above two results to processes
that hold for all x simultaneously.
5
We will assume that all the random variables are defined on the probability space (Ω, A, P). Recall
the following definitions. A σ-field A (also called σ-algebra) on a set Ω is a collection of subsets of Ω that
includes the empty set, is closed under complementation, and is closed under countable unions and countable
intersections. The pair (Ω, A) is called a measurable space.
If C is an arbitrary class of subsets of S, there is a smallest σ-field in S containing C, denoted by σ(C) and
called the σ-field generated (or induced) by C.
A metric (or topological) space S will usually be endowed with its Borel σ-field B(S) — the σ-field
generated by its topology (i.e., the collection of all open sets in S). The elements of B(S) are called Borel
sets.
6
Fix a measurable space (S, S), an index set I, and a subset V ⊂ S I . Then a function W : Ω → V is
called an S-valued stochastic process on I with paths in V if and only if Wt : Ω → S is S-measurable for
every t ∈ I.
5
Theorem 1.1 ([Glivenko, 1933], [Cantelli, 1933]).
a.s.
kFn − F k∞ = sup |Fn (x) − F (x)| → 0.
x∈R
We can then define the continuity of a functional γ with respect to this norm: more precisely,
we say that the functional γ is continuous at F in the sup-norm if, for all > 0, there exists
7
The above notion of weak convergence has not been properly defined yet. D([−∞, ∞]) denotes the space
of cadlag functions on [−∞, ∞] (French acronym: “continue à droite, limited à gauche”, right continuous at
each point with left limit existing at each point).
Heuristically speaking, we would say that a sequence of stochastic processes {Zn } (as elements of
D([−∞, ∞])) converges in distribution to a stochastic process Z in D([−∞, ∞]) if
E[g(Zn )] → E[g(Z)], as n → ∞,
for any bounded and continuous function g : D([−∞, ∞]) → R (ignoring measurability issues).
8
In short, U is a zero-mean Gaussian process on [0, 1] with covariance function E[U(s)U(t)] = s ∧ t −
st, s, t ∈ [0, 1]. To be more precise, the Brownian bridge process U is characterized by the following three
properties:
1. U(0) = U(1) = 0. For every t ∈ (0, 1), U(t) is a random variable.
2. For every k ≥ 1 and t1 , . . . , tk ∈ (0, 1), the random vector (U(t1 ), . . . , U(tk )) has the Nk (0, Σ) distri-
bution.
3. The function t 7→ U(t) is (almost surely) continuous on [0, 1].
9
For example, given some integrable function g : R → R, we may be interested in the expectation
R
functional γg defined via γg (F ) := g(x)dF (x).
6
a δ > 0 such that kG − F k∞ ≤ δ implies that |γ(G) − γ(F )| ≤ . This notion is useful,
because for any continuous functional, it reduces the consistency question for the plug-in
estimator γ(Fn ) to the issue of whether or not the r.v. kFn − F k∞ converges to zero.
In this course we are going to substantially generalize the two results — Theorems 1.1
and 1.2. But before we start with this endeavor, let us ask ourselves why do we need
generalization of such results like. The following subsection addresses this.
1.1 Notation
The need for generalizations of Theorems 1.1 and 1.2 became apparent in the 1950’s and
1960’s. In particular, it became apparent that when the observations take values in a more
general space X (such as Rd , or a Riemannian manifold, or some space of functions, etc.),
then the e.d.f. is not as natural. It becomes much more natural to consider the empirical
measure Pn indexed by some class of real-valued functions F defined on X .
Suppose now that X1 , . . . , Xn are i.i.d. P on X . Then the empirical measure Pn is
defined by
n
1X
Pn := δXi ,
n
i=1
where δx denotes the Dirac measure at x. For each n ≥ 1, Pn denotes the random discrete
probability measure which puts mass 1/n at each of the n points X1 , . . . , Xn . Thus, for any
Borel set A ⊂ X ,
n
1X #{i ≤ n : Xi ∈ A}
Pn (A) := 1A (Xi ) = .
n n
i=1
For a real-valued function f on X , we write
n
1X
Z
Pn (f ) := f dPn = f (Xi ).
n
i=1
7
and the collection of random variables {Gn (f ) : f ∈ F} as f varies over F is called the
empirical process indexed by F. Note that the classical empirical process (in (2)) for real-
valued r.v.’s can again be viewed as the special case of the general theory for which X = R,
F = {1(−∞,x] (·) : x ∈ R}.
The goal of empirical process theory is to study the properties of the approximation
of P f by Pn f , uniformly in F. Traditionally, we would be concerned with probability
estimates of the random quantity
In particular, we will find appropriate conditions to answer the following two questions
(which will extend Theorems 1.1 and 1.2):
Our main findings reveal that the answers (to the two above questions and more) depend
crucially on the complexity11 or size of the underlying function class F. However, the scope
of empirical process theory is much beyond answering the above two questions.
In the last 20 years there has been enormous interest in understanding the concentra-
tion12 properties of kPn − P kF about its mean. In particular, one may ask if we can obtain
finite sample (exponential) inequalities for the difference kPn − P kF − EkPn − P kF (when F
is uniformly bounded) in terms of the class of functions F and the common distribution P
of X1 , X2 , . . . , Xn . Talagrand’s inequality ([Talagrand, 1996a]) gives an affirmative answer
to this question; a result that is considered to be one of the most important and powerful
results in the theory of empirical processes in the last 30 years. We will cover this topic
towards the end of the course (if time permits).
11
We will consider different geometric (packing and covering numbers) and combinatorial (shattering and
combinatorial dimension) notions of complexity.
12
Often we need to show that a random quantity g(X1 , . . . , Xn ) is close to its mean µ(g) :=
E[g(X1 , . . . , Xn )]. That is, we want a result of the form P(|g(X1 , . . . , Xn ) − µ(g)| ≥ ) ≤ δ, for suitable
and δ. Such results are known as concentration of measure. These results are fundamental for establishing
performance guarantees of many algorithms.
8
The following section introduces the topic of M -estimation (also known as empirical
risk minimization), a field that naturally relies on the study of empirical processes.
Many problems in statistics and machine learning are concerned with estimators of the form
n
1X
θ̂n := arg max Pn [mθ ] = arg max mθ (Xi ). (4)
θ∈Θ θ∈Θ n
i=1
can be thought of as the “empirical risk” and θ̂n denotes the empirical risk minimizer over
θ ∈ Θ. Here are some examples:
2. Location estimators:
4. Regression estimators: Let {Xi = (Zi , Yi )}ni=1 denote i.i.d. from a regression model
and let
mθ (x) = mθ (z, y) := −(y − θ(z))2 ,
for a class θ ∈ Θ of real-valued functions from the domain of Z 13 . This gives the usual
least squares estimator over the class Θ. The choice mθ (z, y) = −|y − θ(z)| gives the
least absolute deviation estimator over Θ.
13
In the simplest setting we could parametrize θ(·) as θβ (z) := β > z, for β ∈ Rd , in which case Θ = {θβ (·) :
β ∈ Rd }.
9
In these problems, the parameter of interest is
Perhaps the simplest general way to address this problem is to reason as follows. By the
law of large numbers, we can approximate the ‘risk’ for a fixed parameter θ by the empirical
risk which depends only on the data, i.e.,
P [mθ ] ≈ Pn [mθ ].
If Pn [mθ ] and P [mθ ] are uniformly close, then maybe their argmax’s θ̂n and θ0 are close. The
problem is now to quantify how close θ̂n is to θ0 as a function of the number of samples n, the
dimension of the parameter space Θ, the dimension of the space X , etc. The resolution of
this question leads naturally to the investigation of quantities such as the uniform deviation
The following two examples show the importance of controlling the above display in
the problem of M -estimation and classification.
We will assume that the class of functions F := {mθ (·) : θ ∈ Θ} is P -Glivenko Cantelli.
Fix δ > 0 and let
ψ(δ) := M (θ0 ) − sup M (θ).
θ∈Θ:d(θ,θ0 )≥δ
Observe that,
P d(θ̂n , θ0 ) ≥ δ ≤ P Mn (θ0 ) ≤ sup Mn (θ)
θ∈Θ:d(θ,θ0 )≥δ
!
n o
≤ P sup (Mn (θ) − M (θ)) − (Mn (θ0 ) − M (θ0 )) ≥ ψ(δ) .
θ∈Θ:d(θ,θ0 )≥δ
Empirical process results provide bounds for the above probability (some assumptions on the
relation between M and the metric d(·, ·) will be needed), e.g., we may assume that θ0 is a
well-separated maximizer, i.e., for every δ > 0, ψ(δ) > 0.
10
Note that one can further bound the above probability by
P sup |Mn (θ) − M (θ)| ≥ ψ(δ)/2 ,
θ∈Θ
Example 1.4 (Classification). Consider a pair of random objects X ≡ (Z, Y ) having some
joint distribution where Z takes values in a space Z and Y takes only two values: −1 or
+1. A classifier is a function g : Z → {−1, +1}. The error of the classifier is given by
L(g) := P(g(Z) 6= Y ).
The goal of classification is to construct a classifier with small error based on n i.i.d. obser-
vations X1 ≡ (Z1 , Y1 ), . . . , Xn ≡ (Zn , Yn ) having the same distribution as X = (Z, Y ) ∼ P .
For a classifier g, its empirical error (i.e., its error on the observed sample) is given by
n
1X
Ln (g) := I{g(Zi ) 6= Yi } = Pn [I{g(Z) 6= Y }].
n
i=1
A natural strategy for classification is to select a class of classifiers C and then to choose
the classifier in C which has the smallest empirical error on the observed sample, i.e.,
1. Is L(ĝn ) comparable to inf g∈C L(g), i.e., is the error of ĝn comparable to the best
achievable error in the class C?
2. Is L(ĝn ) comparable to Ln (ĝn ), i.e., is the error of ĝn comparable to its “in-sample”
empirical error?
It is quite easy to relate these two questions to the size of supg∈C |Ln (g) − L(g)|. Indeed, if
g ∗ := argming∈C L(g), then
where we have used the fact that Ln (ĝn ) ≤ Ln (g ∗ ) (which follows from the definition of ĝn ).
Also,
L(ĝn ) = Ln (ĝn ) + L(ĝn ) − Ln (ĝn ) ≤ Ln (ĝn ) + sup |Ln (g) − L(g)|. (5)
g∈C
11
Thus the key quantity to answering the above two questions is
It is now easy to see that the above quantity is a special case of (3) when F is taken to be
the class of all functions I{g(z) 6= y} as g varies over C. Sometimes, the two inequalities
above can sometimes be quite loose. Later, we shall see sharper inequalities which utilize a
technique known as “localization”; see Section 8.3.
We will learn how to establish consistency, rates of convergence and the limiting dis-
tribution for M and Z-estimators; see [van der Vaart and Wellner, 1996, Chapters 3.1-3.4]
for more details.
We quite often write a statistic of interest as a functional on the sample paths of a stochastic
process in order to break the analysis of the statistic into two parts: the study of the conti-
nuity properties of the (measurable14 ) functional and the study of the stochastic process as
a random element15 in a space of functions. The method has its greatest appeal when many
different statistics can be written as functionals on the same process, as in the following
goodness-of-fit examples.
Consider the statistical problem of goodness-of-fit16 hypothesis testing where one ob-
serves an i.i.d. sample X1 , . . . , Xn from a distribution F on the real line and wants to test
the null hypothesis
H0 : F = F0 versus H1 : F 6= F0 ,
and rejecting H0 when Dn is large. To calculate the p-value of this test, the null distribution
(i.e., the distribution of Dn under H0 ) needs to be determined.
14
Given two measurable spaces (S, S) and (T, T ), a mapping f : S → T is said to be S/T -measurable or
simply measurable if f −1 (T ) ⊂ S, i.e., if
12
Question: What is the asymptotic distribution of Dn , under H0 ?
An interesting property about the null distribution of Dn is that the null distribution
is the same whenever F0 is continuous17 . Thus we can compute the null distribution of Dn
assuming that F0 is the c.d.f. of a uniformly distributed random variable on [0, 1]. In other
words, the null distribution of Dn is the same as that of supt∈[0,1] |Un (t)| where
n
√ 1X
Un (t) := n (Fn (t) − t) with Fn (t) := 1{ξi ≤ t}, t ∈ [0, 1],
n
i=1
and ξ1 , . . . , ξn are i.i.d. Unif(0, 1) random variables. The function t 7→ Un (t) is called the
uniform empirical process.
The Kolmogorov-Smirnov test for H0 : F = F0 is only one of a large class of tests that
are based on some measure of distance between the e.d.f. Fn and F0 . Another such test is
the Cramér-von Mises statistic:
Z
Wn := n (Fn (x) − F0 (x))2 dF0 (x).
All of these quantities have the property that their null distribution (i.e., when F = F0 ) is
the same for all continuous F0 . Thus one may assume that F0 is the uniform distribution
for computing their null distribution. And in this case, all these quantities can be written
in terms of the uniform empirical process Un .
Initially, the asymptotic distributions of these quantities were determined on a case by
case basis without a unified technique. Doob realized that it should be possible to obtain
these distributions using some basic properties of the uniform empirical process.
Remark 1.1 (Study of the uniform empirical process). By the multivariate CLT, for every
k ≥ 1 and 0 < t1 , . . . , tk < 1, the random vector (Un (t1 ), . . . , Un (tk )) converges in distribu-
tion to Nk (0, Σ) where Σ(i, j) := ti ∧tj −ti tj (here a∧b := min(a, b)). This limiting distribu-
tion Nk (0, Σ) is the same as the distribution of (U(t1 ), . . . , U(tk )) where U is the Brownian
bridge. Doob therefore conjectured that the uniform empirical process {Un (t) : t ∈ [0, 1]}
must converge in some sense to a Brownian Bridge {U(t) : t ∈ [0, 1]}. Hopefully, this notion
of convergence will be strong enough to yield that various functionals of Un (·) will converge
to the corresponding functionals of U(·).
17
Exercise (HW1): Prove this. Hint: you may use the quantile transformation.
13
by appealing to the continuous mapping theorem18 . Similarly, we can obtain the asymptotic
distribution of the other test statistics.
In fact, there are plenty of other examples where it is convenient to break the analysis
of a statistic into two parts: the study of the continuity properties of the functional and the
study of the underlying stochastic process. An important example of such a “continuous”
functional is the argmax of a stochastic process which arises in the the study of M -estimators
(to be introduced in the following subsection).
Example 1.5. Suppose that X, X1 , . . . , Xn , . . . are i.i.d. P with c.d.f. G, having a Lebesgue
density g, and E(X 2 ) < ∞. Let µ = E(X). Consider the absolute deviations about the
sample mean,
n
1X
Mn := Pn |X − X̄n | = |Xi − X̄n |,
n
i=1
as an estimate of scale. This is an average of the dependent random variables |Xi − X̄n |.
Suppose that we want to find the almost sure (a.s.) limit and the asymptotic distribution19
of Mn (properly normalized).
a.s.
There are several routes available for showing that Mn → M := E|X − µ|, but the
method we will develop in this section proceeds as via empirical process theory. Since
a.s.
X̄n → µ, we know that for any δ > 0 we have X̄n ∈ [µ − δ, µ + δ] for all sufficiently large n
almost surely. Let us define, for δ > 0, the random functions
Mn − M = Pn (fX̄n ) − P (fµ )
= (Pn − P )(fX̄n ) + P (fX̄n ) − P (fµ )
= In + IIn .
18 d
The continuous mapping theorem states that if random elements Yn → Y and g is a continuous, then
d
g(Yn ) → g(Y ).
19
This example was one of the illustrative examples considered by [Pollard, 1989].
14
Note that,
a.s.
|In | ≤ sup |(Pn − P )(f )| → 0, (6)
f ∈Fδ
where ψ(t) := P (ft ) = E|X − t|. We will argue later that Bn is asymptotically negligible
using an equicontinuity argument. Let us consider An + Cn . It can be easily shown that
Z t
ψ(t) = µ − 2 xg(x)dx − t + 2tG(t), and ψ 0 (t) = 2G(t) − 1.
−∞
15
!
lim sup P sup |Zn (f ) − Zn (f0 )| > η < .
n→∞ f ∈V
Exercise (HW1): Show that if (i) {fˆn }n≥1 is a sequence of (random) elements of F that
converges in probability to f0 , and (ii) {Zn (f ) : f ∈ F} is asymptotically equicontinuous at
f0 , then Zn (fˆn ) − Zn (f0 ) = oP (1). [Hint: Note that with probability tending to 1, fˆn will
belong to each V .]
Empirical process theory offers very efficient methods for establishing the asymptotic
equicontinuity of Gn over a class of functions F. The fact that F is a VC class of func-
tions with square-integrable envelope function will suffice to show the desired asymptotic
equicontinuity.
16
2 Size/complexity of a function class
A relatively simple way to measure the size of any set is to use covering numbers. Let (Θ, d)
be an arbitrary semi-metric space24 ; we will assume that Θ ⊂ Ξ and that d(·, ·) is defined
on the space Ξ. Let ε > 0.
Definition 2.1 (ε-cover). A ε-cover of the set Θ with respect to the semi-metric d is a
set {θ1 , . . . , θN } ⊂ Ξ25 such that for any θ ∈ Θ, there exists some v ∈ {1, . . . , N } with
d(θ, θv ) ≤ ε.
Equivalently, the ε-covering number N (ε, Θ, d) is the minimal number of balls B(x; ε) :=
{y ∈ Θ : d(x, y) ≤ ε} of radius ε needed to cover the set Θ.
Definition 2.3 (Metric entropy). The metric entropy of the set Θ with respect to the semi-
metric d is the logarithm of its covering number: log N (ε, Θ, d).
Note that a semi-metric space (Θ, d) is said to be totally bounded if the ε-covering
number is finite for every ε > 0. We can define a related measure of size that relates to the
number of disjoint balls of radius ε > 0 that can be placed into the set Θ.
Definition 2.4 (ε-packing). A ε-packing of the set Θ with respect to the semi-metric d is
a set {θ1 , . . . , θD } ⊆ Θ such that for all distinct v, v 0 ∈ {1, . . . , D}, we have d(θv , θv0 ) > ε.
Equivalently, call a collection of points ε-separated if the distance between each pair of points
is larger than ε. Thus, the packing number D(ε, Θ, d) is the maximum number of ε-separated
points in Θ.
24
By a semi-metric space (Θ, d) we mean, for any θ1 , θ2 , θ3 ∈ Θ, we have: (i) d(θ1 , θ2 ) = 0 ⇒ θ1 = θ2 ; (ii)
d(θ1 , θ2 ) = d(θ2 , θ1 ); and (iii) d(θ1 , θ3 ) ≤ d(θ1 , θ2 ) + d(θ2 , θ3 ).
25
The elements {θ1 , . . . , θN } ⊂ Ξ need not belong to Θ themselves.
17
A minimal -cover and or maximal -packing do not have to be finite. In the proofs of
the following results, we do not separate out the case when they are infinite (in which case
there is nothing show).
Thus, packing and covering numbers have the same scaling in the radius ε.
Proof. Let us first show the second inequality. Suppose E = {θ1 , . . . , θD } ⊆ Θ is a maximal
packing. Then for every θ ∈ Θ \ E, there exists 1 ≤ i ≤ D such that d(θ, θi ) ≤ ε (for if this
does not hold for θ then we can construct a bigger packing set with θD+1 = θ). Hence E is
automatically an ε-covering. Since N (ε, Θ, d) is the minimal size of all possible coverings,
we have D(ε, Θ, d) ≥ N (ε, Θ, d).
We next prove the first inequality by contradiction. Suppose that there exists a 2ε-
packing {θ1 , . . . , θD } and an ε-covering {x1 , . . . , xN } such that D ≥ N + 1. Then by
pigeonhole, we must have θi and θj belonging to the same ε-ball B(xk , ε) for some i 6= j and
k. This means that the distance between θi and θj cannot be more than the diameter of the
ball, i.e., d(θi , θj ) ≤ 2ε, which leads to a contradiction since d(θi , θj ) > 2 for a 2ε-packing.
Hence the size of any 2ε-packing is less or equal to the size of any ε-covering.
Remark 2.1. As shown in the preceding lemma, covering and packing numbers are closely
related, and we can use both in the following. Clearly, they become bigger as ε → 0.
Let k · k denote any norm on Rd . The following result gives the (order of) covering
number for any bounded set in Rd .
Lemma 2.7. For a bounded subset Θ ⊂ Rd there exist constants c < C depending on Θ
(and k · k) only such that, for ∈ (0, 1),
d d
1 1
c ≤ N (, Θ, k · k) ≤ C .
Proof. If θ1 , . . . , θD are -separated points in Θ, then the balls of radius /2 around the θi ’s
are disjoint, and their union is contained in Θ0 := {θ ∈ Rd : kθ − Θk ≤ /2}. Thus, the sum
Dvd (/2)d of the volumes of these balls, where vd is the volume of the unit ball, is bounded
by Vol(Θ0 ), the volume of Θ0 . This gives the upper bound of the lemma, as
d
2d Vol(Θ0 ) 1
N (, Θ, k · k) ≤ D(, Θ, k · k) ≤ .
vd
Let θ1 , . . . , θN be an -cover of Θ, i.e., the union of the balls of radius around them
covers Θ. Thus the volume of Θ is bounded above by the sum of the volumes of the N
18
balls, i.e., by N vd d . This yields the lower bound of the lemma, as
d
Vol(Θ) 1
N (, Θ, k · k) ≥ .
vd
The following result gives an upper bound (which also happens to be optimal) on the
entropy numbers of the class of Lipschitz functions26 .
Lemma 2.8. Let F := {f : [0, 1] → [0, 1] | f is 1-Lipschitz}. Then for some constant A,
we have
1
log N (, F, k · k∞ ) ≤ A , for all > 0.
Proof. If > 1, there is nothing to prove as then N (, F, k · k∞ ) = 1 (take the function
f0 ≡ 0 and observe that for any f ∈ F, kf − f0 k∞ ≤ 1 < ).
Let 0 < < 1. We will explicitly exhibit an -cover of F (under k · k∞ -metric)
with cardinality less than exp(A/), for some A > 0. This will complete the proof as
N (, F, k · k∞ ) will then be automatically less than exp(A/).
Let us define a -grid of the interval [0,1], i.e., 0 = a0 < a1 < . . . < aN = 1 where
ak := k, for k = 1, . . . , N − 1; here N ≤ b1/c + 1 (where bxc denotes the greatest integer
less than or equal to x). Let B1 := [a0 , a1 ] and Bk := (ak−1 , ak ], k = 2, . . . , N . For each
f ∈ F define f˜ : [0, 1] → R as
N
f (ak )
f˜(x) =
X
1Bk (x). (7)
k=1
Thus, f˜ is constant on the interval Bk and can only take values of the form i, for i =
0, . . . , b1/c. Observe that for x ∈ Bk (for some k ∈ {1, . . . , N }) we have
where the first comes from the fact that f is 1-Lipschitz, and the second appears because
of the approximation error in (7)27 . Thus, kf − f˜k∞ ≤ 2.
New, let us count the number of distinct f˜’s obtained as f varies over F. There are at
most b1/c + 1 choices for f˜(a1 ). Further, note that for any f˜ (and any k = 2, . . . , N ),
Therefore once a choice is made for f˜(ak−1 ) there are at most 7 choices left for the next
value of f˜(ak ), k = 2, . . . , N .
26
Note that f : X → R is L-Lipschitz if |f (x)
j − fk(y)| ≤ Lkx − yk for all x, y ∈ X . j k
27
Note that, for x ∈ Bk , f (x) = f (ak ) = f (a k ) ≤ f (ak ), and f (ak ) − f˜(ak ) = f (a k ) − f (a k )
˜ ˜ ≤ .
19
Now consider the collection {f˜ : f ∈ F}. We see that this collection is a 2-cover of F
and the number of distinct functions in this collection is upper bounded by
1
+ 1 7b1/c .
Thus, N (2, F, k · k∞ ) is bounded by the right-side of the above display, which completes
the proof the result.
Thus, the set of Lipschitz functions is much “larger” than a bounded set in Rd , since
its metric entropy grows as 1/ as → 0, as compared to log(1/) (cf. Lemma 2.7).
Exercise (HW1): For L > 0, let FL := {f : [0, 1] → R | f is L-Lipschitz}. Show that, for
> 0, log N (, FL , k · k∞ ) ≥ a L , for some constant a > 0. Then, using Lemma 2.8 show
that log N (, FL , k · k∞ ) L , for > 0 sufficiently small.
Definition 2.9 (ε-bracket). Given two functions l(·) and u(·), the bracket [l, u] is the set
of all functions f ∈ F with l(x) ≤ f (x) ≤ u(x), for all x ∈ X . An ε-bracket is a bracket
[l, u] with kl − uk < ε.
Definition 2.10 (Bracketing numbers). The bracketing number N[ ] (ε, F, k · k) is the min-
imum number of ε-brackets needed to cover F.
Definition 2.11 (Entropy with bracketing). The entropy with bracketing is the logarithm
of the bracketing number.
In the definition of the bracketing number, the upper and lower bounds u and l of the
brackets need not belong to F themselves but are assumed to have finite norms.
Example 2.12. (Distribution function). When F is equal to the collection of all indicator
functions of the form ft (·) = 1(−∞,t] (·), with t ranging over R, then the empirical process
√
Gn (ft ) is the classical empirical process n(Fn (t) − F (t)) (here X1 , . . . , Xn are i.i.d. P with
c.d.f. F).
Consider brackets of the form [1(−∞,ti−1 ] , 1(−∞,ti ) ] for a grid points −∞ = t0 < t1 <
· · · < tk = ∞ with the property F (ti −) − F (ti−1 ) < ε for each i = 1, . . . , k; here we assume
that ε < 1. These brackets have L1 (P )-size ε. Their total number k can be chosen smaller
than 2/ε. Since P f 2 ≤ P f for every 0 ≤ f ≤ 1, the L2 (P )-size of the brackets is bounded by
√ √
ε. Thus N[ ] ( ε, F, L2 (P )) ≤ 2/ε, whence the bracketing numbers are of the polynomial
order 1/ε2 .
20
Exercise (HW1): Show that N (ε, F, k · k) ≤ N[ ] (2ε, F, k · k), for every ε > 0.
In general, there is no converse inequality. Thus, apart from the constant 1/2, bracket-
ing numbers are bigger than covering numbers. The advantage of a bracket is that it gives
pointwise control over a function: l(x) ≤ f (x) ≤ u(x), for every x ∈ X . In comparison an
Lr (P )-ball gives integrated, but not pointwise control.
Lemma 2.14. Let F = {mθ : θ ∈ Θ} be a class of functions satisfying the preceding display
for every θ1 and θ2 and some fixed function F . Then, for any norm k · k,
Proof. Let θ1 , . . . , θp be an -cover of Θ (under the metric d). Then the brackets [mθi −
F, mθi + F ], i = 1, . . . , p, cover F. The brackets are of size 2kF k.
Exercise (HW1): Let F and G be classes of measurable function. Then for any probability
measure Q and any 1 ≤ r ≤ ∞,
21
3 Glivenko-Cantelli (GC) classes of functions
Suppose that X1 , . . . , Xn are independent random variables defined on the space X with
probability measure P . Let F be a class of measurable functions from X to R. The main
object of study in this section is to obtain probability estimates of the random quantity
The law of large numbers says that Pn f → P f almost surely, as soon as the expectation
P f exists. A class of functions is called Glivenko-Cantelli if this convergence is uniform in
the functions belonging to the class.
Remark 3.1 (On measurability). Note that if F is uncountable, kPn − P kF is the supre-
mum of an uncountable family of random variables. In general, the supremum of uncount-
ably many measurable functions is not necessarily measurable29 . However, there are many
situations when this is actually a countable supremum, e.g., in the case of the empirical
distribution function (because of right continuity and existence of left limits, kFn − F k∞ =
supx∈Q |Fn (x) − F (x)|, where Q is the set of rational numbers). Thus if F is countable or
if there exists F0 countable such that kPn − P kF = kPn − P kF0 a.s., then the measurability
problem for kPn − P kF disappears30 .
22
and the law of large numbers for real variables. The second theorem uses random L1 -entropy
numbers and is proved through symmetrization followed by a maximal inequality.
3.1 GC by bracketing
Theorem 3.2. Let F be a class of measurable functions such that N[ ] (, F, L1 (P )) < ∞
for every > 0. Then F is Glivenko-Cantelli.
Proof. Fix > 0. Choose finitely many -brackets [li , ui ] whose union contains F and such
that P (ui − li ) < , for every i. Then, for every f ∈ F, there is a bracket such that
Consequently,
sup (Pn − P )f ≤ max(Pn − P )ui + .
f ∈F i
The right side converges almost surely to by the strong law of large numbers for real
variables. A similar argument also yields
A similar argument as above (by SLLN) shows that inf f ∈F (Pn − P )f is bounded below by
−ε almost surely. As,
( )
sup |(Pn − P )f | = max sup (Pn − P )f, − inf (Pn − P )f ,
f ∈F f ∈F f ∈F
we see that lim sup kPn − P kF ≤ almost surely, for every > 0. Taking a sequence m ↓ 0
yields the desired result.
Example 3.3. (Distribution function). The previous proof generalizes a well-known proof of
the classical GC theorem for the e.d.f. on the real line. Indeed, the set of indicator functions
of cells (−∞, c] possesses finite bracketing numbers for any underlying distribution; simply
use the brackets 1(−∞,ti−1 ] , 1(−∞,ti ) for a grid of points −∞ = t0 < t1 < . . . , tk = +∞
with the property P (ti−1 , ti ) < for each i.
23
Given a sequence of balls Bk with common center a given θ and radii decreasing to 0,
we have mBk − mBk ↓ mθ − mθ = 0 by the continuity, pointwise in x, and hence also in
L1 by the dominated convergence theorem and the integrability of the envelope. Thus, given
> 0, for every θ there exists a ball B around θ such that the bracket [mB , mB ] has size at
most . By the compactness of Θ, the collection of balls constructed in this way has a finite
subcover. The corresponding brackets cover F. This construction shows that the bracketing
numbers are finite, but it gives no control on their sizes.
An example of such a class would be the log-likelihood function of a parametric model
{pθ (x) : θ ∈ Θ}, where Θ ∈ Rd is assumed to be compact32 and pθ (x) is assumed to be
continuous in θ for Pθ0 -a.e. x.
The goal for the remainder of this section is to prove the following theorem.
Theorem 3.5 (GC by entropy). Let F be a class of measurable functions with envelope F
such that P (F ) < ∞. Let FM be the class of functions f 1{F ≤ M } where f ranges over
F. Then kPn − P kF → 0 almost surely if and only if
1 P
log N (, FM , L1 (Pn )) → 0, (8)
n
for every > 0 and M > 033 . In that case the convergence takes place in mean also.
Both the statement and the proof of the GC theorem with entropy are more complicated
than the previous bracketing theorem. However, the result gives a precise (necessary and
sufficient) characterization for a class of functions to be GC. Moreover, the sufficiency
condition for the GC property can be checked for many classes of functions by elegant
combinatorial arguments, as will be discussed later. The proof of the above result needs
numerous other concepts, which we introduce below.
3.2 Preliminaries
In this subsection we will introduce a variety of simple results that will be useful in proving
the Glivenko-Cantelli theorem with entropy. We will expand on each of these topics later
on, as they indeed form the foundations of empirical process theory.
Bounds on the tail probability of (the maximum of a bunch of) random variables form the
backbone of most of the results in empirical process theory (e.g., GC theorem, maximal
32
This is a stringent assumption in many situations. If we assume that mθ is convex/concave (in θ), then
it suffices to consider compact subsets of the parameter space (we will see such an example soon; also see
e.g., [Hjort and Pollard, 2011] for a more refined approach). In other situations we have to argue from first
principles that it is enough to restrict attention to compacts.
33
Furthermore, the random entropy condition is necessary.
24
inequalities needed to show asymptotic equicontinuity, etc.). In this subsection we will
review some basic results in this topic which will be used to prove the GC theorem with
entropy. We start with a very simple (but important and useful) result.
Lemma 3.6 (Markov’s inequality). Let Z ≥ 0 be a random variable. Then for any t > 0,
EZ
P(Z ≥ t) ≤ .
t
Proof. Observe that t1{Z ≥ t} ≤ Z, which on taking expectations yield the above result.
Var(Z)
P(|Z − EZ| ≥ t) ≤ .
t2
√ t2
t σ
lim P Sn − ESn ≥ t n = 1 − Φ ≤√ exp − 2 ,
n→∞ σ 2πt 2σ
where the last inequality uses a standard bound on the normal CDF. Thus, we see that
√
although Chebyshev’s inequality gets the order n correct, the dependence on t2 /σ 2 is not
as predicted by the CLT (we expect an exponential decrease in t2 /σ 2 ).
Indeed, we can improve the above bound by assuming that Z has a moment generating
function. The main trick here is to use Markov’s inequality in a clever way: if λ > 0,
E[eλ(Z−EZ) ]
P(Z − EZ > t) = P eλ(Z−EZ) > eλt ≤ . (9)
eλt
Now, we can derive bounds for the moment generating function Eeλ(Z−EZ) and optimize
over λ.
When dealing with Sn , we can extend the idea as follows. Observe that,
n n
!
Y Y
λZ λXi
Ee = E e = EeλXi , (by independence). (10)
i=1 i=1
25
Lemma 3.8 (Exercise (HW1)). Let X be a random variable with EX = 0 and X ∈ [a, b]
with probability 1 (w.p.1). Then, for any λ > 0,
2 (b−a)2 /8
E(eλX ) ≤ eλ .
The above lemma, combined with (9) and (10), implies the following Hoeffding’s tail
inequality.
and
2/
Pn 2
P (Sn − ESn ≤ −t) ≤ e−2t i=1 (bi −ai ) .
i=1 i=1
n
X
= exp − λt + 18 λ2 (bi − ai )2 .
i=1
To get the best possible upper bound, we find the minimum of the right hand side of the last
2 Pn
inequality as a function of λ. Define g : R+ → R such that g(λ) = −λt + λ8 2
i=1 (bi − ai ) .
Note that g is a quadratic function and achieves its minimum at λ = Pn (b4ti −ai )2 . Plugging
i=1
in this value of λ in the above bound we obtain the desired result. We can similarly prove
the tail bound for t < 0.
Example 3.10 (Hoeffding’s bound for i.i.d. random variables). Suppose that X1 , . . . , Xn
are i.i.d. such that X1 ∈ [a, b] w.p. 1. Then for any given α ∈ (0, 1) a direct consequence of
Lemma 3.9 show that r
n(b − a)2 1
Sn − E[Sn ] ≥ log (11)
2 α
with probability at most α. In fact, by Hoeffding’s inequality, we can obtain an 1 − α
honest conservative confidence (symmetric) interval (around the sample mean X̄n ) for the
population mean µ := E[X1 ] as:
" r r #
b−a 2 b−a 2
X̄n − √ log , X̄n + √ log .
2n α 2n α
Hoeffding’s inequality does not depend on the distribution of the Xi ’s (which is good),
but also does not incorporate the dependence on Var(Xi ) in the bound (which can result
in an inferior bound; e.g., consider Xi ∼ Bernoulli(p) where p is close to 0 or 1).
26
3.2.2 Sub-Gaussian random variables/processes
Suppose (T, d) is a semi-metric space and let {Xt , t ∈ T } be a stochastic process indexed
by T satisfying
u2
P (|Xs − Xt | ≥ u) ≤ 2 exp − for all u > 0. (12)
2d(s, t)2
Such a stochastic process is called sub-Gaussian with respect to the semi-metric d. Any
p
Gaussian process is sub-Gaussian for the standard deviation semi-metric d(s, t) = Var(Xs − Xt ).
Another example is the Rademacher process
n
X
Xa := ai εi , a := (a1 , . . . , an ) ∈ Rn ,
i=1
27
2
Proof. For any λ and Rademacher variable ε, one has Eeλε = (eλ + e−λ )/2 ≤ eλ /2 , where
the last inequality follows after writing out the power series. Thus, by Markov’s inequality,
for any λ > 0,
X n Pn 2 2
P ai εi ≥ x ≤ e−λx Eeλ i=1 ai εi ≤ e(λ /2)kak −λx .
i=1
The best upper bound is obtained for λ = x/kak2 and is the exponential in the probability
of the lemma. Combination with a similar bound for the lower tail yields the probability
bound.
Here is a useful (and probably the earliest and simplest) ‘maximal inequality’ which is
an application of Hoeffding’s inequality.
Then,
p
E max Yi ≤ σ 2 log N . (13)
i=1,...,N
where we have used Jensen’s inequality in the first step (taking the function eλx ). Taking
logarithms yields
λσ 2
log N
E max Yi ≤ + .
i=1,...,N λ 2
Optimizing with respect to λ (differentiating and then equating to 0) yields the result.
28
Lemma 3.13 can be easily extended to yield the following maximal inequality.
Lemma 3.16. Show that if ξi ’s are linear combinations of Rademacher variables (i.e.,
(i) (i) (i) 2 (i) 2
ξi = nk=1 ak εk , a(i) = (a1 , . . . , an ) ∈ Rn ), then: (i) E[eξi /(6ka k ) ] ≤ 2; (ii) by using
P
2
ψ(x) = ex show that, for N ≥ 2,
p
E max |ξi | ≤ C log N max ka(i) k. (14)
1≤i≤N 1≤i≤N
√
(iii) Further, show that C = 2 6 for Rademacher linear combinations.
with the understanding that kXkψ = ∞ if the infimum runs over an empty set. Let
ψ(x) = exp(x2 ) − 1. Then show that kXkψ < ∞ if and only if X − EX is sub-Gaussian.
3.3 Symmetrization
29
denotes the maximum correlation taken over all functions f ∈ F. The intuition here is: a
function class is extremely large — and, in fact, “too large” for statistical purposes — if we
can always find a function (in the class) that has a high correlation with a randomly drawn
noise vector.
The advantage of symmetrization lies in the fact that the symmetrized process is typ-
ically easier to control than the original process, as we will find out in several places. For
example, even though ni=1 (f (Xi ) − P f ) has only low order moments, ni=1 εi f (Xi ) is sub-
P P
Adding a minus sign in front of a term [f (Xi ) − f (Yi )] has the effect of exchanging Xi
and Yi . Because the Y ’s are independent copies of the X’s, the expectation of any function
30
g(X1 , . . . , Xn , Y1 , . . . , Yn ) (= supf ∈F | ni=1 [f (Xi ) − f (Yi )]| here) remains unchanged under
P
is the same for any n-tuple (e1 , . . . , en ) ∈ {−1, +1}n . Deduce that
n
1X
EkPn − P kF ≤ Eε EX,Y εi [f (Xi ) − f (Yi )] .
n F
i=1
Using the triangle inequality to separate the contributions of the X’s and the Y ’s and noting
that they are both equal to EkPon kF .
Remark 3.2. The symmetrization lemma is valid for any class F. In the proofs of
Glivenko-Cantelli and Donsker theorems, it will be applied not only to the original set of
functions of interest, but also to several classes constructed from such a set F (such as the
class Fδ of small differences). The next step in these proofs is to apply a maximal inequality
to the right side of the above theorem, conditionally on X1 , . . . , Xn .
Lemma 3.18 (A more general version of the symmetrization lemma; Exercise (HW1)).
Suppose that P f = 0 for all f ∈ F. Let ε1 , . . . , εn be independent Rademacher random
variables independent of X1 , . . . , Xn . Let Φ : R+ → R+ be a nondecreasing convex function,
and let µ : F → R be a bounded functional such that {f + µ(f ) : f ∈ F} is pointwise
measurable. Then,
" n
# " n
# " n
#
1 X X X
E Φ εi f (Xi ) ≤E Φ f (Xi ) ≤E Φ 2 εi (f (Xi ) + µ(f )) .
2 i=1
F
i=1
F
i=1
F
In this subsection we prove Theorem 3.5. We only show the sufficiency of the entropy
condition. By the symmetrization result, measurability of the class F, and Fubini’s theorem,
n
1X
EkPn − P kF ≤ 2EX Eε εi f (Xi )
n F
i=1
n
1 X
≤ 2EX Eε εi f (Xi ) + 2P [F 1{F > M }],
n FM
i=1
by the triangle inequality, for every M > 0. For sufficiently large M , the last term is
arbitrarily small. To prove convergence in mean, it suffices to show that the first term
converges to zero for fixed M . Fix X1 , . . . , Xn . If G is an η-net35 in L1 (Pn ) over FM , then
for any f ∈ FM , there exists g ∈ G such that
n n n n
1X 1X 1X 1X
εi f (Xi ) ≤ εi g(Xi ) + εi [f (Xi ) − g(Xi )] ≤ εi g(Xi ) + η.
n n n n G
i=1 i=1 i=1 i=1
35
The set of centers of balls of radius η that cover T is called an -net of T .
31
Thus,
n n
1X 1X
Eε εi f (Xi ) ≤ Eε εi g(Xi ) + η. (15)
n FM n G
i=1 i=1
The cardinality of G can be chosen equal to N (η, FM , L1 (Pn )). Given X1 , . . . , Xn , the
symmetrized empirical process Gon is sub-Gaussian with respect to the L2 (Pn )-norm. Using
the maximal inequality in Lemma 3.15 with ψ(x) = exp(x2 ) (see (14)) shows that the
preceding display does not exceed
p 1
C log N (, FM , L1 (Pn )) √ sup kgkn + ,
n g∈G
Pn
where kgk2n := 2
i=1 g(Xi ) /n and C is a universal constant.
The k · kn norms of g ∈ G are bounded above by M (this can always be ensured by
√
truncating g if required). By assumption the square root of the entropy divided by n tends
to zero in probability. Thus the right side of the above display tends to in probability.
Since this argument is valid for every > 0, it follows that the left side of (15) converges
to zero in probability.
h h P ii
Next we show that EX Eε n1 ni=1 εi f (Xi ) F X1 , . . . , Xn converges to 0. Since
M
1 Pn
Eε n i=1 εi f (Xi ) is bounded by M and converges to 0 (in probability), its expectation
FM
with respect to X1 , . . . , Xn converges to zero by the dominated convergence theorem.
This concludes the proof that kPn − P kF → 0 in mean. That it also converges almost
surely follows from the fact that the sequence kPn − P kF is a reverse sub-martingale with
respect to a suitable filtration; see e.g., [van der Vaart and Wellner, 1996, Lemma 2.4.5] (by
martingale theory any nonnegative reverse sub-martingale converges almost surely to some
limit36 ).
3.5 Applications
Consider the setup of M -estimation as introduced in Section 1.2 and Example 1.3 where
we assume that Θ is a metric space with the metric d(·, ·). In this example we describe
the steps to prove the consistency of the M -estimator θ̂n := arg maxθ∈Θ Pn [mθ ], as defined
in (4). Formally, we want to show that
P
d(θ̂n , θ0 ) → 0 where θ0 := arg max P [mθ ].
θ∈Θ
32
We will assume that the class of functions F := {mθ (·) : θ ∈ Θ} is P -Glivenko Cantelli. We
will further need to assume that θ0 is a well-separated maximizer, i.e., for every δ > 0,
Observe that,
Therefore,
P d(θ̂n , θ0 ) ≥ δ ≤ P sup |Mn (θ) − M (θ)| ≥ ψ(δ)/2 → 0
θ∈Θ
Lemma 3.19 (Exercise (HW1)). Suppose that Θ is a convex subset of Rd , and that θ 7→
mθ (x), is continuous and concave, for all x ∈ X . Suppose that E[G (X)] < ∞ where
P
G (x) := supkθ−θ0 k≤ |mθ (x)|, for x ∈ X . Then θ̂n → θ0 .
Hint: Define α := and θ̃n := αθ̂n + (1 − α)θ0 and compare Mn (θ̃n ) with Mn (θ0 ).
+kθ̂n −θ0 k
33
EWi = 0 and variance Var(Wi ) ≤ σ02 < ∞, for i = 1, . . . , n. The covariates z1 , . . . , zn are
fixed, i.e., we consider the case of fixed design.
The function g0 : Z → R is unknown, but we assume that g0 ∈ G, where G is a given
class of regression functions. The unknown regression function can be estimated by the
least squares estimator (LSE) ĝn , which is defined (not necessarily uniquely) by
n
X
ĝn = arg min (Yi − g(zi ))2 . (17)
g∈G
i=1
Let
n
1X
Qn := δ zi
n
i=1
denote the empirical measure of the design points. For g : Z → R, we write
n n n
1X 2 1X 1X
kgk2n := g (zi ), kY − gk2n := (Yi − g(zi ))2 , and hW, gin := Wi g(zi ).
n n n
i=1 i=1 i=1
P
Question: When can we say that kĝn − g0 kn → 0?
Our starting point is the following inequality:
which follows from simplifying the inequality kY −ĝn k2n ≤ kY −g0 k2n (Hint: write kY −ĝn k2n =
kY − g0 + g0 − ĝn k2n and expand).
We shall need to control the entropy, not of the whole class G itself, but of subclasses
Gn (R), which are defined as
Gn (R) = {g ∈ G : kg − g0 kn ≤ R}.
and
log N (δ, Gn (R), L1 (Qn ))
→ 0, for all δ > 0, R > 0. (20)
n
p
Then, kĝn − g0 kn → 0.
Proof. Let η, δ > 0 be given. We will show that P(kĝn − g0 kn > δ) can be made arbitrarily
small, for all n sufficiently large. Note that for any R > δ, we have
34
We will first show that the second term on the right side can be made arbitrarily small by
choosing R large. From (18), using Cauchy-Schwarz inequality, it follows that
n
1 X 1/2
kĝn − g0 kn ≤ 2 Wi2 .
n
i=1
An application of the Cauchy-Schwarz and Markov’s inequality bounds the second term
on the right side of the above display:
n n
! !
1 X
2
1/2 δ2 4R 2 1X 2
P Wi 1{|Wi |>K} ≥ ≤ E Wi 1{|Wi |>K} ≤ η,
n 4R δ2 n
i=1 i=1
by choosing K = K(δ, η) sufficiently large and using (19). We bound the first term in (21)
by using Markov’s inequality:
!
δ2 4
P sup hW 1{|W |≤K} , g − g0 in ≥ ≤ 2 E hW 1{|W |≤K} , g − g0 in G (R) .
g∈Gn (R) 4 δ n
The random variables Wi 1{|Wi |≤K} still have expectation zero if each Wi is symmetric,
which we shall assume to avoid digressions. If they are not symmetric, one can use different
truncation levels to the left and right to approximately maintain zero expectation. We will
now use Hoeffding’s inequality (see Lemma 3.9). Thus, for any function g : Z → R and for
all δ > 0,
nδ 2
P |hW 1{|W |≤K} , g − g0 in | ≥ δ ≤ 2 exp − . (22)
2K 2 kg − g0 k2n
The proof will now mimic the proof of Theorem 3.5. If G̃ is an -net in L1 (Qn ) over Gn (R),
then
E hW 1{|W |≤K} , g − g0 in G (R) ≤ E hW 1{|W |≤K} , g − g0 in G̃ + K. (23)
n
The cardinality of G̃ can be chosen equal to N (, Gn (R), L1 (Qn )). Using the maximal
inequality in Lemma 3.13 with ψ(x) = exp(x2 ) (note that (13) holds for every g ∈ G̃ by
35
Lemma 3.2137 and (22)) shows that the preceding display is does not exceed a multiple of
p K
log N (, Gn (R), L1 (Qn )) √ sup kg − g0 kn + K.
n g∈G̃
The norms of g − g0 ∈ G̃ are bounded above by R. By assumption the square root of the
√
entropy divided by n tends to zero. Thus the above display is less than (K + 1) for all
large n. Since this argument is valid for every > 0, it follows that the left side of (23) can
be made less than η.
Z = f (X1 , . . . , Xn )
be the random variable of interest (e.g., Z = ni=1 Xi when X = R). We seek upper bounds
P
for
P(Z > EZ + t) and P(Z < EZ − t) for t > 0.
36
Let X1 , . . . , Xn be independent random variables taking values in X . Let f : X n → R
and
Z = f (X1 , . . . , Xn )
be the random variable of interest. Note that if we define
Yk := E[Z|X1 , . . . , Xk ], for k = 1, . . . , n,
Lemma 3.22 (Azuma-Hoeffding inequality). Suppose that the martingale differences are
bounded, i.e., |∆i | ≤ ci , for all i = 1, . . . , n. Then,
2
Pn 2
Eeλ(Z−E(Z)) ≤ eλ i=1 ci /2 .
37
In other words, if we change the i-th variable of f while keeping all the others fixed, the
value of the function cannot change by more than ci .
The following theorem provides exponential tail bounds for the random variable Z −
E(Z). It follows easily from the above lemma and is left as an exercise.
Proof. We will show that if f satisfies the bounded difference property with the constants
ci ’s, then |∆i | ≤ ci , which together with Lemma 3.22 will yield the desired result. Recall
that
Fix X1 = x1 , . . . , Xi−1 = xi−1 , for any x1 , . . . , xi−1 ∈ X . Then ∆i can be viewed solely as
a function of Xi . We will study the range of ∆i as Xi = x varies over X . Then, observe
that, as X1 , . . . , Xn are independent,
where the first inequality follows using Jensen’s inequality, and the last inequality follows
from the fact that f satisfies the bounded differences property (24).
Example 3.25 (Kernel density estimation). Let X1 , . . . , Xn are i.i.d. from a distribution
P on R (the argument can be easily generalized to Rd ) with density φ. We want to estimate
38
φ nonparametrically using the kernel density estimator (KDE) φ̂n : R → [0, ∞) defined as
n
x − Xi
1 X
φ̂n (x) = K , forx ∈ R,
nhn hn
i=1
where hn > 0 is the smoothing bandwidth and K is a nonnegative kernel (i.e., K ≥ 0 and
R
K(x)dx = 1). The L1 -error of the estimator φ̂n is
Z
Z ≡ f (X1 , . . . , Xn ) := |φ̂n (x) − φ(x)|dx.
The random variable Z not only provides a measure of the difference between φ̂n and φ but,
as Z = 2 supA |Pn (A) − P (A)| (Exercise (HW1): Show this) where the supremum is over
all Borel sets in R and Pn denotes the distribution corresponding to the KDE φ̂n ), Z also
captures the difference between Pn and P in the total variation distance.
We can use Theorem 3.24 to get exponential tail bounds for Z. We will show that (25)
holds with ci = 2/n, for all i = 1, . . . , n. It is easy to see that for x1 , . . . , xn , x0i ∈ X ,
which shows that Z concentrates around its expectation E[Z] at the rate n−1/2 . The remark-
able thing is that this concentration property holds regardless of the choice of bandwidth hn .
Of course, in this case, it is difficult to actually compute what that expectation is.
Let us conclude this section with an important result on what the bounded differences
concentration inequality implies for the supremum of the empirical process
n
1X
Z := sup f (Xi ) − E[f (X1 )] ,
f ∈F n i=1
39
We shall show below that g satisfies the bounded differences property (24) with ci := 2B/n
for i = 1, . . . , n. To see this, note that
1X f (x0i )
g(x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn ) = f (xi ) + − E[f (X1 )]
n n
j6=i
n
1X f (x0i ) f (xi )
= f (xj ) − E[f (X1 )] + −
n n n
j=1
n
1X 2B 2B
≤ f (xj ) − E[f (X1 )] + ≤ g(x1 , . . . , xn ) + ,
n n n
j=1
where we have used the fact that for every f ∈ F, |f (xi )| ≤ B and f (x0i )| ≤ B. Interchanging
the roles of xi and x0i , we can deduce that (24) holds with ci := 2B/n for i = 1, . . . , n. Then,
Theorem 3.24 yields
nt2
P(|Z − EZ| > t) ≤ 2 exp − 2 , for every t ≥ 0.
2B
nt2
Setting δ := exp − 2B 2 , we can deduce that
r
2 1
|Z − E[Z]| ≤ B log ,
n δ
holds with probability at least 1 − 2δ for every δ > 0. This inequality implies that E[Z] is
usually the dominating term for understanding the behavior of Z.
We may apply this to study the classical Glivenko-Cantelli problem. The following
theorem illustrates this.
Theorem 3.26. Suppose that X1 , . . . , Xn are i.i.d. random variables on R with distribution
P and c.d.f. F . Let Fn be the empirical d.f. of the data (see (1)). Then,
" r #
log(n + 1) 2
P kFn − F k∞ ≥ 8 + t ≤ e−nt /2 , forall t > 0. (27)
n
a.s.
Hence, kFn − F k∞ → 0.
Proof. The function class under consideration is F := {1(−∞,t] (·) : t ∈ R}. Then, Z :=
kPn − P kF = kFn − F k∞ . From the discussion in this subsection, we have to bound upper
bound E[Z]. This can be done via symmetrization, i.e., E[Z] ≤ 2EX [Eε [supf ∈F | n1 ni=1 εi f (Xi )|]],
P
where ε1 , . . . , en are i.i.d. Rademachers independent of the Xi ’s. For a fixed (x1 , . . . , xn ) ∈
Rn , define
∆n (F; x1 , . . . , xn ) := {(f (x1 ), . . . , f (xn )) : f ∈ F}.
Observe that although F has uncountable many functions, for every (x1 , . . . , xn ) ∈ Rn ,
∆n (F; x1 , . . . , xn ) can take at most n + 1 distinct values39 . Thus, supf ∈F n1 ni=1 εi f (xi )
P
39
If we order xn1 as x(1) ≤ x(2) ≤ . . . ≤ x(n) , then they split the real line into at most n + 1 intervals
(including the two end-intervals (−∞, x(1) ) and [x(n) , ∞). Thus, for a given t ∈ R, the indicator 1(−∞,t] (x(i) )
takes the value one for all x(i) ≤ t, and the value zero for all other samples.
40
is at most the supremum of n + 1 such variables, and we can apply Lemma 3.16 to show
that40
n
" # r
1X log(n + 1)
E sup εi f (Xi ) ≤ 4 .
f ∈F n i=1
n
Although the exponential tail bound (27) is adequate for many purposes, it is far
from the tightest possible. Using alternative methods (using Dudley’s entropy bound in
Section 4), we provide a sharper result that removes the log(n + 1) factor.
40
Note that for x1 , . . . , xn distinct
n j
1X d 1X
sup εi f (xi ) = max εi
f ∈F n i=1 j=1,...,n n
i=1
h Pn i
and direct calculations would actually show that the Eε supf ∈F 1
n i=1 εi f (xi ) = O(n−1/2 ).
41
4 Chaining and uniform entropy
In this section we will introduce Dudley’s metric entropy bound (and the idea of chaining).
We will use this result to prove a maximal inequality (with uniform entropy) that will be
useful in deriving rates of convergence of statistical estimators (see Section 5). Further,
as we will see later, these derived maximal inequalities also play a crucial role in proving
functional extensions of the Donsker’s theorem (see Section 11). In fact, these maximal
inequalities are at the heart of the theory of empirical processes.
The proof of the main result involves in this section an idea called chaining. Before we
start with chaining, let us recall our first maximal inequality (13) (see Lemma 3.13). Note
that the bound (13) can be tight in some situations. For example, this is the case when
Y1 , . . . , YN are i.i.d. N (0, σ 2 ). Because of this example, Lemma 3.13 cannot be improved
without imposing additional conditions on the Y1 , . . . , YN . It is also easy to construct
examples where (13) is quite weak. For example, if Yi = Y0 + σN Zi , for i = 1, . . . , N ,
iid √
for some Y0 ∼ N (0, σ 2 ) and Zi ∼ N (0, 1) and σN log N = o(1), then it is clear that
√
maxi=1,...,N |Yi | ≈ Y0 so that (13) will be loose by a factor of log N . In order to improve
on (13), we need to make assumptions on how close to each other the Yi ’s are. Dudley’s
entropy bound makes such an assumption explicit and provides improved upper bounds for
E[maxi=1,...,N |Yi |].
For generality, we will assume that we may have an infinite (possibly uncountable) collection
of random variables and we are interested in the expected supremum of the collection.
Suppose that (T, d) is a metric space and Xt is a stochastic process indexed by T . We
will state two versions of Dudley’s metric entropy bound and will prove one of these results
(the other has a similar proof). Let us first assume that
EXt = 0, for all t ∈ T.
We want to find upper bounds for
E sup Xt
t∈T
that ONLY depends on structure of the metric space (T, d). We shall first state Dudley’s
bound when the index set T is finite and subsequently improve it to the case when T is
infinite.
Theorem 4.1 (Dudley’s entropy bound for finite T ). Suppose that {Xt : t ∈ T } is a mean
zero stochastic process such that for every s, t ∈ T and u ≥ 0,41
u2
P {|Xt − Xs | ≥ u} ≤ 2 exp − 2 . (28)
2d (s, t)
41
Note that if Xt , t ∈ T , have mean zero and are jointly Gaussian, then Xt − Xs is a mean zero normal
p
random variable for every s, t ∈ T so that (28) holds with d(s, t) := E(Xs − Xt )2 .
42
Also, assume that (T, d) is a finite metric space. Then, we have
h i Z ∞p
E sup Xt ≤ C log N (, T, d) d (29)
t∈T 0
Next we give a slightly different formulation of Dudley’s metric entropy bound for finite
T . However, before proceeding further, we shall give a result similar to Lemma 3.13 but
instead bound the maximum of the absolute values of sub-Gaussian random variables.
Proposition 4.2. Let T be a finite set and let {Xt , t ∈ T } be a stochastic process. Suppose
that for every t ∈ T and u ≥ 0, the inequality
u2
P(|Xt | ≥ u) ≤ 2 exp − 2 (30)
2σ
holds. Here σ is a fixed positive real number. Then, for a universal positive constant C, we
have
p
E max |Xt | ≤ Cσ log(2|T |). (31)
t∈T
we can control E maxt∈T |Xt | by bounding the tail probability P (maxt∈T |Xt | ≥ u) du for
every u ≥ 0. For this, write
u2
X
P max |Xt | ≥ u = P (∪t∈T {|Xt | ≥ u}) ≤ P (|Xt | ≥ u) ≤ 2|T | exp − 2 .
t∈T 2σ
t∈T
This bound is good for large u but not so good for small u (it is quite bad for u = 0 for
example). It is therefore good to use it only for u ≥ u0 for some u0 to be specified later.
This gives42
Z ∞
E max |Xt | = P max |Xt | ≥ u du
t∈T 0 t∈T
Z u0 Z ∞
= P max |Xt | ≥ u du + P max |Xt | ≥ u du
0 t∈T u0 t∈T
Z ∞ 2
u
≤ u0 + 2|T | exp − 2 du
u0 2σ
Z ∞
u2 2|T | 2 u20
u
≤ u0 + 2|T | exp − 2 du = u0 + σ exp − 2 .
u0 u0 2σ u0 2σ
42
R∞
Result: Suppose that Z ≥ 0. Then, E(Z) = 0
P(Z ≥ t) dt.
43
One can try to minimize the above term over u0 . A simpler strategy is to realize that the
large term here is 2|T | so one can choose u0 to kill this term by setting
√ p
2
u0
exp 2
= 2|T | or u0 = 2σ log(2|T |).
2σ
This gives
√ p σ2 p
E max |Xt | ≤ 2σ log(2|T |) + p ≤ Cσ log(2|T |)
t∈T 2σ 2 log(2|T |)
which proves the result.
Theorem 4.3. Suppose (T, d) is a finite metric space and {Xt , t ∈ T } is a stochastic process
such that (28) hold. Then, for a universal positive constant C, the following inequality holds
for every t0 ∈ T :
Z ∞p Z ∞p
E max |Xt − Xt0 | ≤ C log D(, T, d) d . log N (, T, d) d. (32)
t∈T 0 0
Here D(, T, d) denotes the -packing number of the space (T, d).
The following remarks mention some alternative forms of writing the inequality (32)
and also describe some implications.
Remark 4.1. Let D̃ denote the diameter of the metric space T (i.e., D̃ = maxs,t∈T d(s, t)).
Then the packing number D(, T, d) clearly equals 1 for ≥ D̃ (it is impossible to have two
points in T whose distance is strictly larger than when > D). Therefore,
Z ∞p Z D̃ p
log D(, T, d)d = log D(, T, d)d.
0 0
Moreover,
Z D̃ p Z D̃/2 p Z D̃ p
log D(, T, d)d = log D(, T, d)d + log D(, T, d)d
0 0 D̃/2
Z D̃/2 p Z D̃/2 q
≤ log D(, T, d)d + log D( + (D̃/2), T, d)d
0 0
Z D̃/2 p
≤2 log D(, T, d)d
0
because D( + (D̃/2), T, d) ≤ D(, T, d) for every . We can thus state Dudley’s bound as
Z D̃/2 p
E max |Xt − Xt0 | ≤ C log D(, T, d)d
t∈T 0
where the C above equals twice the constant C in (32). Similarly, again by splitting the above
integral in two parts (over 0 to D̃/4 and over D̃/4 to D̃/2), we can also state Dudley’s bound
as Z D̃/4 p
E max |Xt − Xt0 | ≤ C log D(, T, d)d.
t∈T 0
The constant C above now is 4 times the constant in (32).
44
Remark 4.2. The left hand side in (32) is bounded from below (by triangle inequality) by
E maxt∈T |Xt | − E|Xt0 |. Thus, (32) implies that
Z D̃/4 p
E max |Xt | ≤ E|Xt0 | + C log D(, T, d)d for every t0 ∈ T .
t∈T 0
We shall now give the proof of Theorem 4.3. The proof will be based on an idea called
chaining. Specifically, we shall split maxt∈T (Xt − Xt0 ) in chains and use the bound given
by Proposition 4.2 within the links of each chain.
Proof of Theorem 4.3. Recall that D̃ is the diameter of T . For n ≥ 1, let Tn be a maximal
D̃2−n -separated subset of T i.e., mins,t∈Tn :s6=t d(s, t) > D̃2−n and Tn has maximal cardi-
nality subject to the separation restriction. The cardinality of Tn is given by the packing
number D(D̃2−n , T, d). Because of the maximality,
Because T is finite and d(s, t) > 0 for all s 6= t, the set Tn will equal T when n is large. Let
N := min{n ≥ 1 : Tn = T }.
For each n ≥ 1, let πn : T → Tn denote the function which maps each point t ∈ T to the
point in Tn that is closest to T (if there are multiple closest points to T in Tn , then choose
one arbitrarily). In other words, πn (t) is chosen so that
Note that πN (t) = t. Finally let T0 := {t0 } and π0 (t) = t0 for all t ∈ T .
We now note that
N
X
Xt − Xt0 = Xπn (t) − Xπn−1 (t) for every t ∈ T . (35)
n=1
The sequence
t0 → π1 (t) → π2 (t) → · · · → πN −1 (t) → πN (t) = t
can be viewed as a chain from t0 to t. This is what gives the argument the name chaining.
By (35), we obtain
N
X N
X
max |Xt − Xt0 | ≤ max |Xπn (t) − Xπn−1 (t) | ≤ max |Xπn (t) − Xπn−1 (t) |
t∈T t∈T t∈T
n=1 n=1
45
so that
N
X
E max |Xt − Xt0 | ≤ E max |Xπn (t) − Xπn−1 (t) |. (36)
t∈T t∈T
n=1
Now to bound E maxt∈T |Xπn (t) − Xπn−1 (t) | for each 1 ≤ n ≤ N , we shall use the elementary
bound given by Proposition 4.2. For this, note first that by (28), we have
−u2
P |Xπn (t) − Xπn−1 (t) | ≥ u ≤ 2 exp .
2d2 (πn (t), πn−1 (t))
Now
d(πn (t), πn−1 (t)) ≤ d(πn (t), t) + d(πn−1 (t), t) ≤ D̃2−n + D̃2−(n−1) = 3D̃2−n .
Thus Proposition 4.2 can be applied with σ := 3D̃2−n so that we obtain (note that the
value of C might change from occurrence to occurrence)
3D̃ p
E max |Xπn (t) − Xπn−1 (t) | ≤ C log (2|Tn ||Tn−1 |)
t∈T 2n
r
−n −n
p
≤ C D̃2 2
log (2|Tn | ) ≤ C D̃2 log 2D(D̃2−n , T, d)
Note now that for ≤ D̃/4, the packing number D(, T, d) ≥ 2 so that
√ Z D̃/4 p
E max |Xt − Xt0 | ≤ 2 2C log D(, T, d)d
t∈T 0
46
4.1.1 Dudley’s bound when the metric space is separable
We shall next prove Dudley’s bound for the case of infinite T . This requires a technical
assumption called separability which will always be satisfied in our applications.
Definition 4.4 (Separable stochastic process). Let (T, d) be a metric space. The stochastic
process {Xt , t ∈ T } indexed by T is said to be separable if there exists a null set N and a
countable subset T̃ of T such that for all ω ∈
/ N and t ∈ T , there exists a sequence {tn } in
T̃ with limn→∞ d(tn , t) = 0 and limn→∞ Xtn (ω) = Xt (ω).
Note that the definition of separability requires that T̃ is a dense subset of T which
means that the metric space (T, d) is separable (a metric space is said to be separable if it
has a countable dense subset).
The following fact is easy to check: If (T, d) is a separable metric space and if Xt , t ∈ T ,
has continuous sample paths (almost surely), then Xt , t ∈ T is separable. The statement
that Xt , t ∈ T , has continuous sample paths (almost surely) means that there exists a null
set N such that for all ω ∈ / N , the function t 7→ Xt (ω) is continuous on T .
The following fact is also easy to check: If {Xt , t ∈ T } is a separable stochastic process,
then
sup |Xt − Xt0 | = sup |Xt − Xt0 | almost surely (37)
t∈T t∈T̃
Theorem 4.5. Let (T, d) be a separable metric space and let {Xt , t ∈ T } be a separable
stochastic process. Suppose that for every s, t ∈ T and u ≥ 0, we have
u2
P {|Xs − Xt | ≥ u} ≤ 2 exp − 2 .
2d (s, t)
Then for every t0 ∈ T , we have
Z D̃/4 p
E sup |Xt − Xt0 | ≤ C log D(, T, d)d (38)
t∈T 0
Proof of Theorem 4.5. Let T̃ be a countable subset of T such that (37) holds. We may
assume that T̃ contains t0 (otherwise simply add t0 to T̃ ). For each k ≥ 1, let T̃k be the
47
finite set obtained by taking the first k elements of T̃ (in an arbitrary enumeration of the
entries of T̃ ). We can ensure that T̃k contains t0 for every k ≥ 1.
Applying the finite index set version of Dudley’s theorem (Theorem 4.3) to {Xt , t ∈ T̃k },
we obtain
Z diam(T̃k )/4 q Z D̃/4 p
E max |Xt − Xt0 | ≤ C log D(, T̃k , d)d ≤ C log D(, T, d)d.
t∈T̃k 0 0
Note that the right hand side does not depend on k. Letting k → ∞ on the left hand side,
we use the Monotone Convergence Theorem to obtain
" # Z D̃/4 p
E sup |Xt − Xt0 | ≤ C log D(, T, d)d.
t∈T̃ 0
Remark 4.3. One may ask if there is a lower bound for E supt∈T Xt in terms of cov-
ering/packing numbers. A classical result in this direction is Sudakov’s lower bound which
states: For a zero-mean Gaussian process Xt defined on T , define the variance pseudometric
d2 (s, t) := Var(Xs − Xt ). Then,
h i p
E sup Xt ≥ sup log D(, T, d),
t∈T >0 2
It is natural to ask how Dudley’s bound can be useful for the theory of empirical
process. Indeed, Theorem 4.5 is enormously helpful in upper bounding the supremum of
the empirical process as indicated by the maximum inequality in the next subsection.
Recall our setup: We have data X1 , . . . , Xn i.i.d. P on X and a class of real valued functions
F defined on X . For any function f : X → R,
n
1X 2
kf k2n := f (Xi )
n
i=1
denotes the L2 (Pn )-seminorm. Further, recall that the empirical process under consideration
√
is the stochastic process indexed by F and defined as Gn (f ) = n(Pn − P )(f ), for f ∈ F.
Definition 4.6 (Uniform entropy bound). A class F of measurable functions with measur-
able envelope F satisfies the uniform entropy bound if and only if J(1, F, F ) < ∞ where
Z δ q
J(δ, F, F ) := sup log N (kF kQ,2 , F ∪ {0}, L2 (Q)) d, δ > 0. (39)
0 Q
Here the supremum if taken over all finitely discrete probability measures Q on X with
kF k2Q,2 := F 2 dQ > 0 and we have added the function f ≡ 0 to F. Finiteness of the
R
48
The uniform entropy integral may seem a formidable object, but we shall later see how
to bound it for concrete classes F. Of course, the class F must be totally bounded in L2 (Q)
to make the integrand in the integral bounded, and then still the integral might diverge
(evaluate to +∞). The integrand is a nonincreasing function of , and finiteness of the
R1
integral is therefore determined by its behaviour near = 0. Because 0 (1/)r d is finite if
r < 1 and infinite if r ≥ 1, convergence of the integral roughly means that, for ↓ 0,
1 2
sup log N (kF kQ,2 , F, L2 (Q)) << ,
Q
where << means smaller in order, or smaller up to an appropriate logarithmic term.
Proof. By symmetrization (see Theorem 3.17) it suffices to bound EkGon kF ; recall that
Gon (f ) = √1n ni=1 εi f (Xi ) where εi ’s are i.i.d Rademacher. Given X1 , . . . , Xn , the process
P
Gon is sub-Gaussian for the L2 (Pn )-seminorm k · kn (by Lemma 3.12), i.e.,
n n
!
X f (Xi ) X g(Xi ) 2 2
P εi √ − εi √ ≥ u X1 , . . . , Xn ≤ 2e−u /(2kf −gkn ) , ∀ f, g ∈ F, ∀ u ≥ 0.
n n
i=1 i=1
2 := sup 2 2
The value σn,2 f ∈F Pn f = supf ∈F kf kn is an upper bound for the squared radius of
F ∪{0} with respect to this norm. We add the function f ≡ 0 to F, so that the symmetrized
process is zero at some parameter. The maximal inequality (38) (with Xt0 = 0) gives
Z σn,2 p
o
Eε kGn kF . log N (, F ∪ {0}, L2 (Pn )) d, (41)
0
where Eε is the expectation with respect to the Rademacher variables, given fixed X1 , . . . , Xn
(note that log N (, F ∪ {0}, L2 (Pn )) = 0 for any > σn,2 ). Making a change of variable and
bounding the random entropy by a supremum we see that the right side is bounded by
Z σn,2 /kF kn p
log N (kF kn , F ∪ {0}, L2 (Pn )) dkF kn ≤ J(θn , F, F )kF kn .
0
Next, by taking the expectation over X1 , . . . , Xn we obtain the first inequality of the theo-
rem.
Since θn ≤ 1, we have that J(θn , F, F ) ≤ J(1, F, F ). Furthermore, by Jensen’s in-
equality applied to the root function, EkF kn ≤ E[n−1 ni=1 F 2 (Xi )] = kF kP,2 . This gives
p P
The above theorem shows that the order of magnitude of kGn kF is not bigger than
J(1, F, F ) times the order of kF kP,2 , which is the order of magnitude of the random variable
|Gn (F )| if the entropy integral is finite.
49
Example 4.8 (Supremum of the empirical process). Recall the setting of Section 3.7. Sup-
pose that F is a class of B-uniformly bounded functions such that
2ν
ν B
N (, F, k · kPn ) ≤ Cν(16e) .
We will see in Section 7 that if F is a function class with finite VC dimension ν, then
the above inequality holds. The goal is to study the expected supremum of the empirical
process over F, i.e., kPn − P kF . From our previous results, we have seen that by exploiting
concentration (see Section 3.7) and symmetrization results (see Theorem 3.17), the study of
kPn − P kF can be reduced to controlling the expectation Eε supf ∈F n1 ni=1 εi f (Xi ) . We
P
consider the random variable Zf := √1n ni=1 εi f (Xi ), and consider the stochastic process
P
{Zf : f ∈ F}. We have seen that by Lemma 3.12, the increment Zf − Zg is sub-Gaussian
with parameter kf − gk2n . Consequently, by Dudley’s entropy integral (see (41)), we have
n
" #
1X 24 B p
Z
Eε sup εi f (Xi ) ≤ √ log N (, F ∪ {0}, k · kPn ) d,
f ∈F n i=1 n 0
Z Bp r
1 0 ν
≤ c0 √ log[1 + c(B/)2ν ] d = c0 B ,
n 0 n
since the integral is finite43 ; here c, c0 and c00 are constants. Thus,
r
ν
EkPn − P kF . B .
n
Example 4.9. Suppose that F = {mθ (·) : θ ∈ Θ} is a parameterized class such that
F = −F, where Θ = B(0; 1) ⊂ Rd is the unit Euclidean ball in Rd . Suppose that the class
of functions is L-Lipschitz with respect to the Euclidean distance on Rd so that for all x,
Observe that the right side of (40) depends on the L2 (P )-norm of the envelope function
F , which may, in some situations, be large compared with the maximum L2 (P )-norm of
functions in F, namely, σ := supf ∈F kf kP,2 . In such a case, the following theorem will be
more useful.
Theorem 4.10 ([van der Vaart and Wellner, 2011], [Chernozhukov et al., 2014]44 ). Sup-
pose that 0 < kF kP,2 < ∞, and let σ 2 > 0 be any positive constant such that supf ∈F P f 2 ≤
43
Note that
Z B p Z 1 p √
Z 1 p
log[1 + c(B/)2ν ] d = B log[1 + c(1/)2ν ] d . B ν log(1/) d.
0 0 0
44
A version of this result was proved in [van der Vaart and Wellner, 2011] under the additional assumption
that the envelope F is bounded; the current version is due to [Chernozhukov et al., 2014]. We will skip the
proof of this result.
50
p
σ 2 ≤ kF k2P,2 . Let δ := σ/kF kP,2 . Define B := E[max1≤i≤n F 2 (Xi )]. Then,
h i BJ 2 (δ, F, F )
E kGn kF . J(δ, F, F )kF kP,2 + √ . (42)
δ2 n
As one might expect, there exists maximal inequalities that work with bracketing numbers
(as opposed to covering numbers). However, the bracketing result is more delicate and
difficult to prove45 . We will just state the result and illustrate an application of the result.
Recall that, apart from the constant 1/2, bracketing numbers are bigger than covering
numbers. The advantage of a bracket is that it gives pointwise control over a function:
l(x) ≤ f (x) ≤ u(x), for every x ∈ X . The maximal inequalities in the preceding subsection
(without bracketing) compensate this lack of pointwise control by considering entropy under
every measure Q, not just the law P of the observations. With bracketing we can obtain
analogous results using only bracketing numbers under P .
The preceding theorem does not take the size of the functions f into account. The
following theorem remedy this, which is however restricted to uniformly bounded classes.
Theorem 4.13 (Lemma 3.4.2 of [van der Vaart and Wellner, 1996]). For any class F of
measurable functions f : X → R such that P f 2 < δ 2 and kf k∞ ≤ M for every f ,
J[ ] (δ, F, L2 (P ))
h i
E kGn kF . J[ ] (δ, F, L2 (P )) 1 + √ M .
δ2 n
Here are some important results about the bracketing numbers of nonparametric classes of
functions. A good account on this is [van der Vaart and Wellner, 1996, Section 2.7].
1. Let C1α (E), for a bounded convex subset E of Rd with nonempty interior, be the set
of functions f : E → R with kf k∞ ≤ 1 and with degree of smoothness α (if α ≤ 1,
Hölder of order α and constant 1, and if α > 1, differentiable up to order bαc, the
45
In this subsection we just state a few results without proofs; see [van der Vaart and Wellner, 1996,
Chapter 2.14] for a more detailed discussion with complete proofs of these results.
51
greatest integer smaller than α, with all the partial derivatives of order α, Hölder of
order α−bαc and constant 1, and with all the partial derivatives bounded by 1. Then,
52
5 Rates of convergence of M -estimators
We want to find the rate δn of the convergence of θ̂n to θ0 in the metric d, i.e., d(θ̂n , θ0 ). A
rate of convergence46 of δn means that
We assume that Mn (θ) gets close to M (θ) as n increases and under this setting want to
know how close θ̂n is to θ0 .
If the metric d is chosen appropriately we may expect that the asymptotic criterion decreases
quadratically47 when θ moves away from θ0 :
for all θ ∈ Θ.
Consider the probability P d(θ̂n , θ0 ) > 2M δn for a large M . We want to understand
P 2j−1 δn < d(θ̂n , θ0 ) ≤ 2j δn = P θ̂n ∈ Sj .
46
Recall that a sequence of random variables {Zn } is said to be bounded in probability or OP (1) if
In other words, Zn = OP (1), if for any given > 0, ∃ T , N > 0 such that P(|Zn | > T ) < for all n ≥ N .
47
To get intuition about this condition assume that if M : Rd → R is twice continuously differentiable and
d(·, ·) is the Euclidean distance, then, for θ in a neighborhood of θ0 ,
1
M (θ) − M (θ0 ) = ∇M (θ0 )> (θ − θ0 ) + (θ − θ0 )> ∇2 M (θ̃0 )(θ − θ0 ) ≤ −ckθ − θ0 k2
2
where ∇M (θ0 ) = 0 (as θ0 is a maximizer of M (·)) and ∇2 M (θ0 ) (the Hessian matrix of M (·)) is assumed
to be negative definite, in which case we can find such a constant c > 0 (corresponding to the smallest
eigenvalue of ∇2 M (θ0 )); here θ̃0 is a point close to θ0 .
53
As θ̂n maximizes Mn (θ), it is obvious that
P θ̂n ∈ Sj ≤ P sup (Mn (θ) − Mn (θ0 )) ≥ 0 .
θ∈Sj
or supθ∈Sj [M (θ) − M (θ0 )] . −22j−2 δn2 . Thus, the event supθ∈Sj [Mn (θ) − Mn (θ0 )] ≥ 0 can
only happen if Mn and M are not too close. Let
We thus get
φ (2j δ )
n n
P 2j−1 δn < d(θ̂n , θ0 ) ≤ 2j δn . √ 2j 2
n2 δn
for every j. As a consequence,
1 X φn (2j δn )
P d(θ̂n , θ0 ) > 2M δn . √ .
n 22j δn2
j>M
The following assumption on φn (·) is usually made to simplify the expression above: there
exists 0 < α < 2 such that
54
then X
P d(θ̂n , θ0 ) > 2M δn ≤ c 2j(α−2) ,
j>M
for a constant c > 0 (which does not depend on n, M ). Let uM denote the right side of the
last display. It follows therefore that, under assumptions (46) and (106), we get
Theorem 5.1. Let (Θ, d) be a semi-metric space. Fix n ≥ 1. Let {Mn (θ) : θ ∈ Θ} be a
stochastic process and {M (θ) : θ ∈ Θ} be a deterministic process. Assume condition (43)
and that the function φn (·) satisfies (45) and (46). Then for every M > 0, we get d(θ̂n , θ0 ) ≤
2M δn with probability at least 1 − uM provided (106) holds. Here uM → 0 as M → ∞.
Suppose now that condition (43) holds only for θ in a neighborhood of θ0 and that (45)
holds only for small u. Then one can prove the following asymptotic result under the
P
additional condition that θ̂n is consistent (i.e., d(θ̂n , θ0 ) → 0).
Theorem 5.2 (Rate theorem). Let Θ be a semi-metric space. Let {Mn (θ) : θ ∈ Θ} be
a stochastic process and {M (θ) : θ ∈ Θ} be a deterministic process. Assume that (43) is
satisfied for every θ in a neighborhood of θ0 . Also, assume that for every n and sufficiently
small u condition (45) holds for some function φn satisfying (46), and that (106) holds. If
the sequence θ̂n satisfies Mn (θ̂n ) ≥ Mn (θ0 ) − OP (δn2 ) and if θ̂n is consistent in estimating
θ0 , then d(θ̂n , θ0 ) = OP (δn ).
Proof. The above result is Theorem 3.2.5 in [van der Vaart and Wellner, 1996] where you
can find its proof. The proof is very similar to the proof of Theorem 5.1. The crucial
observation is to realize that: for any η > 0,
X
P d(θ̂n , θ0 ) > 2M δn ≤ P 2j−1 δn < d(θ̂n , θ0 ) ≤ 2j δn + P 2d(θ̂n , θ0 ) > η .
j>M,2j−1 δn ≤η
The first term can be tackled as before while the second term goes to zero by the consistency
of θ̂n .
Remark 5.1. In the case of i.i.d. data and criterion functions of the form Mn (θ) = Pn [mθ ]
√
and M (θ) = P [mθ ], the centered and scaled process n(Mn − M )(θ) = Gn [mθ ] equals
the empirical process at mθ . Condition (45) involves the suprema of the empirical process
indexed by classes of functions
Thus, we need to find the existence of φn (·) such that EkGn kMu . φn (u).
55
Remark 5.2. Theorem 5.2 gives the correct rate in fair generality, the main problem being
to derive sharp bounds on the modulus of continuity of the empirical process. A simple,
but not necessarily efficient, method is to apply the maximal inequalities (with and without
bracketing). These yield bounds in terms of the uniform entropy integral J(1, Mu , Mu ) or
the bracketing integral J[ ] (kMu kP,2 , Mu , L2 (P )) of the class Mu given by
where Z 1 q
J(1, Mu , Mu ) = sup log N (kMu kQ,2 , Mu , L2 (Q)) d
0 Q
and
E kGn kMu . J[ ] (kMu k, Mu , L2 (P )),
where Z δ q
J[ ] (δ, Mu , L2 (P )) = log N[ ] (, Mu , L2 (P )) d.
0
Here Mu is the envelope function of the class Mu . In this case, we can take φ2n (u) = P [Mu2 ]
and this leads to a rate of convergence δn of at least the solution of
P [Mδ2n ] ∼ nδn4 .
Observe that the rate of convergence in this case is driven by the sizes of the envelope
functions as u ↓ 0, and the size of the classes is important only to guarantee a finite entropy
integral.
Remark 5.3. In genuinely infinite-dimensional situations, this approach could be less use-
ful, as it is intuitively clear that the precise entropy must make a difference for the rate of
convergence. In this situation, the maximal inequalities obtained in Section 4 may be used.
Remark 5.4. For a Euclidean parameter space, the first condition of the theorem is satisfied
if the map θ 7→ P mθ is twice continuously differentiable at the point of maximum θ0 with a
nonsingular second-derivative matrix.
Let X1 , . . . , Xn be i.i.d. random elements on X with a common law P , and let {mθ : θ ∈ Θ}
be a class of real-valued measurable maps. Suppose that Θ ⊂ Rd , and that, for every
θ1 , θ2 ∈ Θ (or just in a neighborhood of θ0 ),
56
for some measurable function F : X → R with P F 2 < ∞. Then the class of functions
Mδ := {mθ − mθ0 : kθ − θ0 k ≤ δ} has envelope function δF and bracketing number (see
Theorem 2.14) satisfying
d
Cδ
N[ ] (2kF kP,2 , Mδ , L2 (P )) ≤ N (, {θ : kθ − θ0 k ≤ δ}, k · k) ≤ ,
where the last inequality follows from Lemma 2.7 coupled with the fact that the -covering
number of δB (for any set B) is the /δ-covering number of B. In view of the maximal
inequality with bracketing (see Theorem 4.12),
Z δkF kP,2 q
EP kGn kMδ . log N[ ] (, Mδ , L2 (P )) d . δkF kP,2 .
0
√ √
Thus, we can take φn (δ) δ, and the inequality φn (δn ) ≤ nδn2 is solved by δn = 1/ n. We
conclude that the rate of convergence of θ̂n is n−1/2 as soon as P (mθ − mθ0 ) ≤ −ckθ − θ0 k2 ,
for every θ ∈ Θ in a neighborhood of θ0 .
Example 5.3 (Least absolute deviation regression). Given i.i.d. random vectors Z1 , . . . , Zn ,
and e1 , . . . , en in Rd and R, respectively, let
Yi = θ0> Zi + ei .
Argue the consistency of the least-absolute-deviation estimator from the convexity of the
map θ 7→ |y − θ> z|. Moreover, show that the map θ 7→ P |Y − θ> Z| is twice differentiable
at θ0 if the distribution of the errors has a positive density at its median (you may need to
assume that Z and e are integrable and E[ZZ > ] is positive definite). Furthermore, derive
the rate of convergence of θ̂n in this situation.
Example 5.4 (Analysis of the shorth). Suppose that X1 , . . . , Xn are i.i.d. P on R with
a differentiable density p with respect to the Lebesgue measure. Let FX be the distribution
57
function of X. Suppose that p is a unimodal (bounded) continuously differentiable symmetric
density with mode θ0 (with p0 (x) > 0 for x < θ0 and p0 (x) < 0 for x > θ0 ). We want to
estimate θ0 .
Exercise (HW2): Let
M(θ) := P mθ = P(|X − θ| ≤ 1) = FX (θ + 1) − FX (θ − 1)
where mθ (x) = 1[θ−1,θ+1] (x). Show that θ0 = argmaxθ∈R M(θ). Thus, θ0 is the center of an
interval of length 2 that contains the largest possible (population) fraction of data points.
We can estimate θ0 by
sup 1[θ−1,θ+1] − 1[θ0 −1,θ0 +1] ≤ 1[θ0 −1−δ,θ0 −1+δ] + 1[θ0 +1−δ,θ0 +1+δ] .
|θ−θ0 |≤δ
√
The L2 (P )-norm of these functions is bounded above by a constant times δ. Thus, the
√
conditions of the rate theorem are satisfied with φn (δ) = c δ for some constant c, leading to
a rate of convergence of n−1/3 . We will show later that n1/3 (θ̂n −θ0 ) converges in distribution
to a non-normal limit as n → ∞.
Example 5.5 (A toy change point problem). Suppose that we have i.i.d. data {Xi =
(Zi , Yi ) : i = 1, . . . , n} where Zi ∼ Unif(0, 1) and
Here, i ’s are the unobserved errors assumed to be i.i.d. N (0, σ 2 ). Further, for simplicity,
we assume that i is independent of Zi . The goal is to estimate the unknown parameter
θ0 ∈ (0, 1). A natural procedure is to consider the least squares estimator:
58
5.2.3 Persistency in high-dimensional regression
We assume that p = nα , α > 0, that is, there could be many more explanatory variables
than observations. We consider sets Bn restricted by the maximal number of non-zero
coefficients of their members, or by their l1 -radius. We study the following asymptotic
question: how ‘large’ may the set Bn be, so that it is still possible to select empirically a
predictor whose risk under P is close to that of the best predictor in the set?
We formulate this problem using a triangular array setup, i.e., we model the observa-
tions Zn1 , . . . , Znn as i.i.d. random vectors in Rpn +1 , having distribution Pn (that depends on
n). In the following we will hide the dependence on n and just write Z 1 , . . . , Z n . We will
consider Bn of the form
Bn,b := {β ∈ Rpn : kβk1 ≤ b}, (50)
where k · k1 denotes the l1 -norm. For any Z := (Y, X1 , . . . , Xp ) ∼ P , we will denote the
expected prediction error by
h p
X i h i
LP (β) := EP (Y − βj Xj )2 = EP (Y − β > X)2
j=1
for some sequence of {bn }n≥1 . We estimate the best linear predictor βn∗ from the sample by
n
1X i
β̂n := arg min LPn (β) = arg min (Y − β > X i )2 ,
β∈Bn,bn β∈Bn,bn n
i=1
59
Theorem 5.6. Suppose that pn = nα , where α > 0. Let
F (Z i ) := max |Xji Xki − EPn (Xji Xki )|, where we take X0i = Y i , for i = 1, . . . , n.
0≤j,k≤p
Suppose that EPn [F 2 (Z 1 )] ≤ M < ∞, for all n. Then for bn = o((n/ log n)1/4 ), β̂n is
persistent relative to Bn,bn .
Thus,
where we have used the fact that LPn (β̂n ) − LPn (βn∗ ) ≤ 0. To simply our notation, let
γ = (−1, β) ∈ Rpn +1 . Then LPn(β) = γ > ΣPn
γ and LPn (β) = γ > ΣPn γ where ΣPn =
EPn (Xj1 Xk1 ) and ΣPn = n1 ni=1 Xji Xki
P
. Thus,
0≤j,k≤pn 0≤j,k≤pn
|LPn (β) − LPn (β)| ≤ |γ > (ΣPn − ΣPn )γ| ≤ kΣPn − ΣPn k∞ kγk21 ,
1 Pn i i
where kΣPn − ΣPn k∞ = sup0≤j,k≤pn n i=1 Xj Xk − EPn (Xj1 Xk1 ) . Therefore,
P LPn (β̂n ) − LPn (βn∗ ) > ≤ P 2 sup |LPn (β) − LPn (β)| >
β∈Bn,bn
≤ P 2(bn + 1)2 kΣPn − ΣPn k∞ >
2(bn + 1)2 h i
≤ E kΣPn − ΣPn k∞ . (51)
Let F = {fj,k : 0 ≤ j, k ≤ pn } where fj,k (z) := xj xk −EPn (Xj1 Xk1 ) and z = (x0 , x1 , . . . , xpn ).
Observe that kΣPn −ΣPn k∞ = kPn −Pn kF . We will now use the following maximal inequality
with bracketing entropy (see Theorem 4.12):
√
Ek n(Pn − P )kF . J[ ] (kFn kPn ,2 , F ∪ {0}, L2 (Pn )),
60
6 Rates of convergence of infinite dimensional parameters
If Θ consists of all functions θ : [0, 1] → R, then obviously the minimum is 0, taken for any
function that interpolates the data points exactly: θ(zi ) = Yi for every i = 1, . . . , n. This
interpolation is typically not a good estimator, but overfits the data: it follows the given
data exactly even though these probably contain error. The interpolation very likely gives
a poor representation of the true regression function.
One way to rectify this problem is to consider minimization over a restricted class
of functions. For example, the minimization can be carried out over all functions with 2
derivatives, which are bounded above by 10 throughout the interval; here the numbers 2
and (particularly) 10 are quite arbitrary. To prevent overfitting the size of the derivatives
should not be too large, but can grow as we obtain more samples.
The method of sieves is an attempt to implement this. Sieves are subsets Θn ⊂ Θ,
typically increasing in n, that can approximate any given function θ0 that is considered
likely to be “true”. Given n observations the maximization is restricted to Θn , and as n
increases this “sieve” is taken larger. In this section we extend the rate theorem in the
previous section to sieved M -estimators, which include maximum likelihood estimators and
least-squares estimators.
We also generalize the notation and other assumptions. In the next theorem the em-
pirical criterion θ 7→ Pn mθ is replaced by a general stochastic process
θ 7→ Mn (θ).
It is then understood that each “estimator” θ̂n is a map defined on the same probability
space as Mn , with values in the index set Θn (which may be arbitrary set) of the process
Mn .
Corresponding to the criterion functions are centering functions θ 7→ Mn (θ) and “true
parameters” θn,0 . These may be the mean functions of the processes Mn and their point of
maximum, but this is not an assumption.
In this generality we also need not assume that Θn is a metric space, but measure the
“discrepancy” or “distance” between θ and the true “value” θn,0 by a map θ 7→ dn (θ, θn,0 )
from Θn to [0, ∞).
Theorem 6.1 (Rate of convergence). For each n, let Mn and Mn be stochastic processes
indexed by a set Θn ∪ {θn,0 }, and let θ 7→ dn (θ, θn,0 ) be an arbitrary map from Θn to [0, ∞).
61
Let δ̃n ≥ 0 and suppose that, for every n and δ > δ̃n ,
for increasing functions φn : [δ̃n , ∞) → R such that δ 7→ φn (δ)/δ α is decreasing for some
0 < α < 2. Let θn ∈ Θn and let δn satisfy
√
φn (δn ) ≤ nδn2 , δn2 ≥ Mn (θn,0 ) − Mn (θn ), δn ≥ δ̃n .
If the sequence θ̂n takes values in Θn and satisfies Mn (θ̂n ) ≥ Mn (θn ) − OP (δn2 ), then
Exercise (HW2): Complete the proof. Hint: The proof is similar to that of the previous
rate theorem. That all entities are now allowed to depend on n asks for notational changes
only, but the possible discrepancy between θn and θn,0 requires some care.
The theorem can be applied with θ̂n and θn,0 equal to the maximizers of θ 7→ Mn (θ)
over a sieve Θn and of θ 7→ Mn (θ) over a full parameter set Θ, respectively. Then (52)
requires that the centering functions fall off quadratically in the “distance” dn (θ, θn,0 ) as θ
moves away from the maximizing value θn,0 . We use δ̃n = 0, and the theorem shows that
the “distance” of θ̂n to θn,0 satisfies
62
6.1 Least squares regression on sieves
where Pn is the empirical measure on the design points z1 , . . . , zn . This criterion function
is not observable but is of simpler character than the sum of squares. Note that the second
term is assumed non-random, the randomness solely residing in the error terms.
Under the assumption that the error variables have mean zero, the mean of Mn (θ) is
Mn (θ) = −Pn (θ − θ0 )2 and can be used as a centering function. It satisfies, for every θ,
Thus, Theorem 6.1 applies with dn (θ, θ0 ) equal to the L2 (Pn )-distance on the set of regres-
sion functions. The modulus of continuity condition takes the form
n
1 X
φn (δ) ≥ E sup √ (θ − θ0 )(zi )i . (55)
Pn (θ−θ0 )2 ≤δ 2 ,θ∈Θn n
i=1
Theorem 6.2. If Y1 , . . . , Yn are independent random variables satisfying (16) for fixed
design points z1 , . . . , zn and errors 1 , . . . , n with mean 0, then the minimizer θ̂n over Θn
of the least squares criterion satisfies
63
Since the design points are non-random, the modulus (55) involves relatively simple
multiplier processes, to which the abstract maximal inequalities may apply directly. In par-
ticular, if the error variables are sub-Gaussian, then the stochastic process {n−1/2 ni=1 (θ −
P
θ0 )(zi )i : θ ∈ Θn } is sub-Gaussian with respect to the L2 (Pn )- semimetric on the set of
regression functions. Thus, using (41), we may choose
Z δ p
φn (δ) = log N (, Θn ∩ {θ : Pn (θ − θ0 )2 ≤ δ 2 }, L2 (Pn )) d.
0
Example 6.5 (Hölder smooth functions). For α > 0, we consider the class of all functions
on a bounded set X ⊂ Rd that possess uniformly bounded partial derivatives up to bαc and
whose highest partial derivates are ‘Lipschitz’ (actually Hölder) of order α − bαc49 .
Let X = [0, 1]d and let Θn = C1α ([0, 1]d ). Then, log N (, Θ, L2 (Pn )) ≤ log N (, Θ, k ·
√
k∞ ) . −d/α . Thus, for α > d/2 this leads to φn (δ) δ 1−d/(2α) and hence, φn (δ) ≤ δn2 n
48
Note that a -cover in the k · k∞ -norm (as in Lemma 2.8) also yields a a cover in the L2 (Pn )-seminorm.
49
i.e., for any vector k = (k1 , . . . , kd ) of d integers the differential operator
∂ k.
Dk = k
,
∂xk1 1 · · · ∂xdd
Pd
where k. = i=1 ki . Then for a function f : X → R, let
Dk f (x) − Dk f (y)
kf kα := max sup Dk f (x) + max sup ,
k. ≤bαc x k. =bαc x,y kx − ykα−bαc
α
where the supremum is taken over all x, y in the interior of X with x 6= y. Let CM (X ) be the set of all continu-
ous functions f : X → R with kf kα ≤ M . The following lemma, proved in [van der Vaart and Wellner, 1996,
α
Chapter 7], bounds the entropy number of the class CM (X ).
Lemma 6.6. Let X be a bounded, convex subset of Rd with nonempty interior. Then there exists a constant
K, depending only on α and d, and a constant K 0 , depending only on α, diam(X ) and d, such that
for every > 0, r ≥ 1, where λ(X 1 ) is the Lebesgue measure of the set {x : kx − X k ≤ 1} and Q is any
probability measure on Rd . Note that k · k∞ denotes the supremum norm.
64
can be solved to obtain the rate of convergence δn & n−α/(2α+d) . The rate relative to the
empirical L2 -norm is bounded above by
For θ0 ∈ C1α ([0, 1]d ) the second term vanishes; the first is known to be the minimax rate
over this set.
Note that this is a convex program in N = n(d + 1) variables, with a quadratic cost function
and a total of n(n − 1) linear constraints. Give the form of a LSE θ̂n .
Suppose now that C = [0, 1]d , and instead of minimizing (56) over the class of all
convex functions, we minimize over the class of all L-Lipschitz convex functions. Find the
rate of convergence of the LSE (over all L-Lipschitz convex functions).
In the standard nonparametric regression model, we assume the noise variables in (54) are
drawn in an i.i.d. manner from the N (0, σ 2 ) distribution, where σ > 0 is the unknown
standard deviation parameter. In this case, we can write i = σwi , where wi ∼ N (0, 1)
are i.i.d. We change our notation slightly and assume that f ∗ : Z → R is the unknown
regression function (i.e., f ∗ ≡ θ0 in (54)).
50
Any convex function f is subdifferentiable at each point in the (relative) interior of its domain C. More
precisely, at any interior point z ∈ C, there exists at least one vector ξ ∈ Rd such that
Any such vector is known as a subgradient, and each point z ∈ C can be associated with the set ∂f (z) of its
subgradients, which is known as the subdifferential of f at z. When f is actually differentiable at z, then the
above inequality holds if and only if ξ = ∇f (z), so that we have ∂f (z) = {∇f (z)}. See standard references
in convex analysis for more on this.
65
Our main result in this section yields a finite sample inequality for the L2 (Pn )-loss of
the constrained LSE
n
1X
fˆn ∈ argmin {Yi − f (zi )}2 ;
f ∈F n i=1
i.e., we study the error kfˆn − f ∗ k2n := n1 i=1 {fˆn (zi ) − f ∗ (zi )}2 . This error is expressed in
Pn
terms of a localized form of Gaussian complexity: it measures the complexity of the function
class F, locally in a neighborhood around the true regression function f ∗ . More precisely,
we define the set:
F ∗ := F − f ∗ = {f − f ∗ : f ∈ F} (57)
corresponding to an f ∗ -shifted version of the original function class F. For a given radius
δ > 0, the local Gaussian complexity around f ∗ at scale δ is given by
n
" #
∗ 1X
Gn (δ; F ) := Ew sup wi g(zi )
g∈F ∗ :kgkn ≤δ n i=1
where the expectation is w.r.t. the variables {wi }ni=1 which are i.i.d. N (0, 1).
A function class H is star-shaped if for any h ∈ H and α ∈ [0, 1], the rescaled function
αh also belongs to H. Recall the basic inequality for nonparametric least squares:
n
1 ˆ σX
kfn − f ∗ k2n ≤ wi {f (zi ) − f ∗ (zi )}. (58)
2 n
i=1
A central object in our analysis is the set of δ > 0 that satisfy the critical inequality
δ2
Gn (δ; F ∗ ) ≤ . (59)
2σ
It can be shown that the star-shaped condition ensures existence of the critical radius51 .
51
Let H be a star-shaped class of functions.
Lemma 6.7. For any star-shaped function class H, the function δ 7→ Gn (δ, H)/δ is nonincreasing on the
interval (0, ∞). Consequently, for any constant c > 0, the inequality Gn (δ, H) ≤ cδ 2 has a smallest positive
solution.
Proof. For a pair 0 < δ ≤ t, it suffices to show that δt Gn (t; H) ≤ Gn (δ; H). Given any function h ∈ H with
khkn ≤ t, we may define the rescaled function h̃ = δt h. By construction, we have kh̃kn ≤ δ; moreover, since
δ ≤ t, the star-shaped assumption on H guarantees that h̃ ∈ H. Thus, write
n n n
1 δX 1 X 1 X
wi h(zi ) = wi h̃(zi ) ≤ sup wi g(zi ) .
n t i=1 n i=1 g∈H:kgkn ≤δ n i=1
Taking the supremum over the set H ∩ {khkn ≤ t} on the left-hand side followed by expectations yields
δ
G (t; H) ≤ Gn (δ; H), which completes the proof of the first part. As Gn (δ; H)/δ is nonincreasing and cδ
t n
is nondecreasing (in δ) on (0, ∞), the inequality Gn (δ, H) ≤ cδ 2 has a smallest positive solution.
66
Theorem 6.8. Suppose that the shifted function class F ∗ is star-shaped, and let δn be any
positive solution to the critical inequality (59). Then for any t ≥ δn , the LSE fˆn satisfies
the bound
ntδn
P kfˆn − f ∗ k2n ≥ 16tδn ≤ e− 2σ2 .
Exercise (HW2): By integrating this tail bound, show that the mean-squared error in the
L2 (Pn )-semi-norm is upper bounded as
σ2
h i
ˆ ∗ 2 2
E kfn − f kn ≤ c δn +
n
for some universal constant c.
Proof. Recall the basic inequality (58). In terms of the shorthand notation ∆ ˆ := fˆn − f ∗ , it
ˆ 2 ≤ σ
can be written as 12 k∆k
P n ˆ ˆ ˆ ∗
n n i=1 wi ∆(zi ). By definition, the error function ∆ = fn − f
belongs to the shifted function class F ∗ . We will need the following lemma.
Lemma 6.9. Let H be an arbitrary star-shaped function class, and let δn > 0 satisfy the
inequality Gn (δ; H) ≤ δ 2 /(2σ). For a given scalar u ≥ δn , define the event
n
( )
σX
A(u) := ∃ g ∈ {h ∈ H : khkn ≥ u} : wi g(zi ) ≥ 2kgkn u . (60)
n
i=1
We will prove the main theorem using the lemma for the time being; we take H = F ∗
√ c
√ ntδ 2
− 2n
and u = tδn for some t ≥ δn , so that we can write P(A ( tδn )) ≥ 1 − e 2σ . Note that
ˆ 2 ≤ 16tδn ) = P k∆k
P(k∆k ˆ 2 ≤ 16tδn , k∆k
ˆ 2 < tδn + P k∆k ˆ 2 ≤ 16tδn , k∆kˆ 2 ≥ tδn
n n n n n
= P k∆kˆ 2n < tδn + P tδn ≤ k∆k ˆ 2n ≤ 16tδn
ˆ 2n < tδn + P tδn ≤ k∆k ˆ 2n ≤ 16tδn , Ac ( tδn )
p
≥ P k∆k
ˆ 2 < tδn + P tδn ≤ k∆k ˆ 2 , Ac ( tδn )
p
= P k∆k n n (61)
2
ntδn
≥ P Ac ( tδn ) ≥ 1 − e− 2σ2 ,
p
ˆ 2 ≥ tδn and
where the only nontrivial step is (61), which we explain next. Note that if k∆kn
√
Ac ( tδn ) holds, then
n
1X ˆ ˆ n tδn .
p
wi ∆(zi ) ≤ 2k∆k
n
i=1
√
Consequently, the basic inequality (58) implies that k∆kˆ 2 ≤ 4k∆k
ˆ n tδn , or equivalently,
n
ˆ 2 ≤ 16tδn . Thus, (61) holds, thereby completing the proof.
k∆k n
67
Proof of Lemma 6.9: Our first step is to reduce the problem to controlling a supremum
over a subset of functions satisfying the upper bound kg̃kn ≤ u. Suppose that there exists
some g ∈ H with kgkn ≥ u such that
n
σX
wi g(zi ) ≥ 2kgkn u. (62)
n
i=1
u u
Defining the function g̃ := kgk n
g, we observe that kg̃kn = u. Since g ∈ H and kgk n
∈ (0, 1],
the star-shaped assumption on H implies that g̃ ∈ H. Consequently, we have shown that if
there exists a function g satisfying inequality (62), which occurs whenever the event A(u)
is true, then there exists a function g̃ ∈ H with kg̃kn = u such that
n n
σX u σX
wi g̃(zi ) = wi g(zi ) ≥ 2u2 .
n kgkn n
i=1 i=1
Since the noise variables wi ∼ N (0, 1) are i.i.d., the variable nσ ni=1 wi g̃(zi ) is zero-mean and
P
Gaussian for each fixed g̃. Therefore, the variable Zn (u) corresponds to the supremum of a
Gaussian process. If we view this supremum as a function of the standard Gaussian vector
(w1 , . . . , wn ), then it can be verified that the associated Lipschitz constant52 is at most
√
σu/ n. Consequently, by the concentration of Lipschitz functions of Gaussian variables53 ,
52
The following lemma illustrates the Lipschitz nature of Gaussian complexity.
Lemma 6.10. Let {Wk }n k=1 be an i.i.d. sequence of N (0, 1) variables. Given a collection of vectors
A ⊂ Rn , define the random variable Z := supa∈A | n
P
k=1 ak W k |. Viewing Z as a function (w1 , . . . , wn ) 7→
f (w1 , . . . , wn ), we can verify that f is Lipschitz (with respect to Euclidean norm) with parameter
supa∈A∪(−A) kak2 .
To see this, let w = (w1 , . . . , wn ), w0 = (w10 , . . . , wn0 ) ∈ Rn . Suppose that there exists a∗ = (a∗1 , . . . , a∗n )
such that f (w) = supa∈A | n
P Pn ∗ Pn ∗
k=1 ak wk | = k=1 ak wk (or k=1 (−ak )wk , which case can also be handled
similarly). Then,
n
X n
X
f (w) − f (w0 ) ≤ a∗k wk − a∗k wk0 ≤ ka∗ k2 kw − w0 k2 ≤ sup kak2 kw − w0 k2 .
k=1 k=1 a∈A∪(−A)
The same argument holds with the roles of w and w0 switched which leads to the desired result:
53
Classical result on the concentration properties of Lipschitz functions of Gaussian variables:
Recall that a function f : Rn → R is L-Lipschitz with respect to the Euclidean norm k · k2 if
The following result guarantees that any such function is sub-Gaussian with parameter at most L.
68
we obtain the tail bound
ns2
P (Zn (u) ≥ E[Zn (u)] + s) ≤ e− 2u2 σ2 ,
Finally, by definition of Zn (u) and Gn (u; H), we have E[Zn (u)] = σGn (u; H). By Lemma 6.7,
the function v 7→ Gn (v; H)/v is nonincreasing, and since u ≥ δn by assumption, we have
Gn (u; H) Gn (δn ; H) δn
σ ≤σ ≤ ≤ δn ,
u δn 2
where the 2nd inequality used the critical condition (59). Putting together the pieces, we
have shown that E[Zn (u)] ≤ uδn . Combined with the tail bound (63), we obtain
nu2
P(Zn (u) ≥ 2u2 ) ≤ P(Zn (u) ≥ uδn + u2 ) ≤ P Zn (u) ≥ E[Zn (u)] + u2 ≤ e− 2σ2 ,
Exercise (HW2): Suppose that F ∗ is star-shaped. Show that for any δ ∈ (0, σ] such that
16 δ2
Z p
√ log N (t, F ∗ ∩ {h : khkn ≤ δ}, k · kn )dt ≤ (64)
n δ2 /(4σ) 4σ
satisfies the critical inequality (59) and hence the conclusion of of Theorem 6.8 holds.
Exercise (HW2) [Linear regression]: Consider the standard linear regression model Yi =
hθ∗ , zi i + wi , where θ∗ ∈ Rd , and fixed xi are d-dimensional covariates. Although this
example can be studied using direct linear algebraic arguments, we will use our general
theory in analysis this model. The usual LSE corresponds to optimizing over the class of
all linear functions
Flin := {fθ = hθ, ·i : θ ∈ Rd }. (65)
Let X ∈ Rn×d denote the design matrix with zi ∈ Rd as its i-th row. Let θ̂ be the LSE.
Show that
kX(θ̂ − θ∗ )k22 rank(X)
kfθ̂ − fθ∗ k2n = . σ2
n n
Theorem 6.11. Let X = (X1 , . . . , Xn ) be a vector of i.i.d. standard Gaussian variables, and let f : Rn → R
be L-Lipschitz with respect to the Euclidean norm. Then the variable f (X) − E[f (X)] is sub-Gaussian with
parameter at most L, and hence
2
− t
P |f (X) − E[f (X)]| ≥ t ≤ 2e 2L2 for all t ≥ 0.
Note that this result is truly remarkable: it guarantees that any L-Lipschitz function of a standard
Gaussian random vector, regardless of the dimension, exhibits concentration like a scalar Gaussian variable
with variance L2 . See Section 13.4 for more details about this result and a proof.
69
with high probability.
∗ = F
Hint: First note that the shifted class Flin lin for any choice of fθ∗ ∈ Flin . Moreover,
∗
Flin is convex and hence star-shaped around any point. To use (64) to find δn so that
∗ ∩ {h : khk ≤ δ}, k · k ). Show
Theorem 6.8 applies in this setting, we have to find N (t, Flin n n
that the required covering number can be bounded by (1 + 2δt )r where r := rank(X).
In our analysis thus far, we have assumed that the true regression function f ∗ belongs to
the function class F over which the constrained LSE is defined. In practice, this assumption
might be violated. In such settings, we expect the performance of the LSE to involve both
the estimation error that arises in Theorem 6.8, and some additional form of approximation
error, arising from the fact that f ∗ ∈
/ F. A natural way in which to measure approximation
error is in terms of the best approximation to f ∗ using functions from F — the error in this
best approximation is given by inf f ∈F kf − f ∗ k2n . Note that this error can only be achieved
by an “oracle” that has direct access to the samples {f ∗ (xi )}ni=1 . For this reason, results
that involve this form of approximation error are referred to as oracle inequalities. With
this setup, we have the following generalization of Theorem 6.8. We define
∂F := {f − g : f, g ∈ F}.
δ2
Gn (δ; ∂F) ≤ . (66)
2σ
ntδn
with probability greater than 1 − e− 2σ2 .
√
Proof. Recall the definition of A(u) in (60). We apply Lemma 6.9 with u = tδn and
√ ntδn
H = ∂F to conclude that P Ac ( tδn ) ≥ 1 − e− 2σ2 . We will assume below that the event
√
Ac ( tδn ) holds.
Given an arbitrary f˜ ∈ F, since f˜ is feasible and fˆ is optimal, we have
n n
1 X 1 X
{Yi − fˆ(zi )}2 ≤ {Yi − f˜(zi )}2 .
2n 2n
i=1 i=1
70
where ∆ˆ := fˆ − f ∗ and ∆
˜ := fˆ − f˜. It remains to analyze the term on the right-hand side
˜ We break our analysis into two cases.
involving ∆.
√
Case 1: First suppose that k∆k ˜ n ≤ tδn . Then,
where in the first inequality above we have used the triangle inequality, and the second
inequality follows from the fact that (a + b)2 ≤ 2(a2 + b2 ) (for a, b ∈ R).
√ √
Case 2: Suppose now that k∆k ˜ n > tδn . Note that ∆ ˜ ∈ ∂F and as the event Ac ( tδn )
holds, we get
n
σX ˜ p
˜ n.
wi ∆(zi ) ≤ 2 tδn k∆k
n
i=1
ntδn
Combining with the basic inequality (68), we find that, with probability at least 1 − e− 2σ2 ,
the squared error is bounded as
ˆ 2 = kf˜ − f ∗ k2 + 4 tδn k∆k ˜ n
p
k∆k n n
ˆ 2 ≤ 2kf˜ − f ∗ k2 + 36tδn .
k∆k n n
√
Combining the pieces we get that, under the event Ac ( tδn ), the above inequality holds for
any f˜ ∈ F. Thus, (67) holds.
Remark 6.1. We can, in fact, have a slightly more general form of (67) where the ‘oracle’
approximation term 2kf˜ − f ∗ k2n can be replaced by 1−γ
1+γ ˜
kf − f ∗ k2n for any γ ∈ (0, 1) (with
appropriate adjustments to the ‘estimation’ error term 36tδn ).
Note that the guarantee (67) is actually a family of bounds, one for each f ∈ F.
When f ∗ ∈ F, then we can set f = f ∗ , so that the bound (67) reduces to asserting that
kfˆ − f ∗ k2n . tδn with high probability, where δn satisfies the critical inequality (66). Thus,
up to constant factors, we recover Theorem 6.8 as a special case of Theorem 6.12. By
integrating the tail bound, we are guaranteed that
h i σ2
E kfˆ − f ∗ k2n . inf kf − f ∗ k2n + δn2 + . (69)
f ∈F n
71
The bound (69) guarantees that the LSE fˆ has prediction error that is at most a constant
multiple of the oracle error, plus a term proportional to δn2 . The term inf f ∈F kf − f ∗ k2n
can be viewed a form of approximation error that decreases as the function class F grows,
whereas the term δn2 is the estimation error that increases as F becomes more complex.
Consider the standard linear model Yi = fθ∗ (zi ) + σwi , where fθ (z) := hθ, zi is an unknown
iid
linear regression function, and wi ∼ N (0, 1) is an i.i.d. noise sequence. Here θ∗ ∈ Rd is
the unknown parameter. For some sparsity index s ∈ {1, 2, . . . , d}, consider the class of all
linear regression functions based on s-sparse vectors — namely, the class
n
X
θ̂ ≡ fθ̂ ∈ arg min {Yi − fθ (zi )i}2 , (70)
fθ ∈Fspar(s)
i=1
corresponding to performing least squares over the set of all regression vectors with at most
s non-zero coefficients. As a corollary of Theorem 6.12, we claim that the L2 (Pn )-error of
this estimator is upper bounded as
s log( ed
s )
kfθ̂ − fθ∗ k2n . inf kfθ̂ − fθ∗ k2n + σ 2 , (71)
θ∈Fspar(s) n
s log( ed )
with high probability; here δn2 = σ 2 n s . Consequently, up to constant factors, its error
is as good as the best s-sparse predictor plus the ‘estimation’ error term δn2 . Note that this
‘estimation’ error term grows linearly with the sparsity s, but only logarithmically in the
dimension d, so that it can be very small even when the dimension is exponentially larger
than the sample size n. In essence, this result guarantees that we pay a relatively small
price for not knowing in advance the best s-sized subset of coefficients to use.
In order to derive this result as a consequence of Theorem 6.12, we need to compute
the local Gaussian complexity Gn (δ; ∂Fspar (s)). Making note of the inclusion ∂Fspar (s) ⊂
Fspar (2s), we have Gn (δ; ∂Fspar (s)) ⊂ Gn (δ; Fspar (2s)). Now let S ⊂ {1, . . . , d} be an
arbitrary 2s-sized subset of indices, and let XS ∈ Rn×2s denote the submatrix with columns
indexed by S. We can then write
n
" #
1X
Gn (δ; Fspar (2s)) = Ew sup wi g(zi ) = Ew max Zn (S) ,
g∈Fspar (2s):kgkn ≤δ n i=1
|S|=2s
where
1 >
Zn (S) := sup w X S θS
kXS θS k2 n
θS ∈R2s : √
n
≤δ
72
as, for g ∈ Fspar (2s), g(z) ≡ gθ (z) = hθ, zi = hθS , zS i, if θ has nonzero entries in the subset
S ⊂ {1, . . . , d}, and kgk2n = n1 ni=1 hθ, zi i2 = n1 kXS θS k22 (here k · k2 denotes the usual
P
Euclidean norm).
Viewed as a function of the standard Gaussian vector w ∈ Rn , the variable Zn (S) is
√
Lipschitz with parameter at most δ/ n (by Lemma 6.10), from which Theorem 6.11 implies
the tail bound
−t2 δ 2 −nt2
P(Zn (S) ≥ E[Zn (S)] + tδ) ≤ e (2δ2 /n) = e 2 , for all t > 0. (72)
We now upper bound the expectation. Consider the singular value decomposition XS =
UDV> , where U ∈ Rn×2s and V ∈ Rd×2s are matrices of left and right singular vectors,
respectively, and D ∈ R2s×2s is a diagonal matrix of the singular values. Noting that
kXS θS k2 = kDV> θS k2 , we arrive at the upper bound
h 1 i δ h i
E[Zn (S)] ≤ E sup √ hU> w, βi ≤ √ E kU> wk2
β∈R2s :kβk2 ≤δ n n
>
where we have taken β = DV √ θS . Since w ∼ N (0, In ) and the matrix U has orthogonal
n √
columns, we have U> w ∼ N (0, I2s ), and therefore E kU> wk2 ≤ 2s. Combining this
upper bound with the earlier tail bound (72), an application of the union bound yields, for
all t > 0, " √ #
δ 2s d −nt2
P max Zn (S) ≥ √ + tδ ≤ e 2 .
|S|=2s n 2s
By integrating this tail bound, we find that
s
d
s
log ed
r
Gn (δ; Fspar (2s)) Ew max|S|=2s Zn (S) s log 2s s
= . + . ,
δ δ n n n
s log( ed )
so that the critical inequality (66) is satisfied for δn2 ' σ 2 n
s
, as claimed.
Although it is natural to take the objective (criterion) function we optimize (i.e., Mn (·) in
our previous notation) as Pn log p, for some technical reasons (explained below) we consider
a slightly modified function.
73
Let pn ∈ Pn . We will discuss the particular choice of pn later. By concavity of the
logarithm function we have
p̂n + pn 1 p̂n 1 pn + pn
Pn log ≥ Pn [ log + log 1] ≥ 0 = Pn log .
2pn 2 pn 2 2pn
Thus, defining the criterion functions mn,p (for p ∈ Pn ) as
p + pn
mn,p := log ,
2pn
we obtain Pn mn,p̂n ≥ Pn [mn,pn ]. We shall apply Theorem 6.1 with Mn (p) := Pn [mn,p ] to
obtain the rate of convergence of p̂n . We note that it is not true that p̂n maximizes the
map p 7→ Mn (p) over Pn . Inspection of the conditions of Theorem 6.1 shows that this is
not required for its application; it suffices that the criterion is bigger at the estimator than
at the value θn , which is presently taken equal to pn .
An immediate question that arises next is what discrepancy (or metric) do we use to
measure the difference between p̂n and p0 ? A natural metric while comparing densities is
the Hellinger distance:
Z 1/2
√ √ 2
h(p, q) = ( p − q) dµ ,
We will need to use a maximal inequality to control the fluctuations of the empirical process
in the class Mn,δ . The following result will be useful in this regard; it uses bracketing with
the “Bernstein norm”54 .
Theorem 6.13. For any class F of measurable functions f : X → R such that kf kP,B < δ
for every f ,
J[ ] (δ, F, k · kP,B )
EkGn kF . J[ ] (δ, F, k · kP,B ) 1 + √ .
δ2 n
Using mn,p rather than the more obvious choice log p is technically more convenient.
First it combines smoothly with the Hellinger distance h(p, q). The key is the following
pair of inequalities, which relate the “Bernstein norm” of the criterion functions mn,p to
the Hellinger distance of the densities p.
h i1/2
54
The “Bernstein norm” is defined as kf kP,B := 2P (e|f | − 1 − |f |) . This “Bernstein norm” turns
out to combine well with minimum contrast estimators (e.g., MLE), where the criterion is a logarithm of
another natural function, such as the log-likelihood. Actually, kf kP,B is not a true norm, but it can be used
in the same way to measure the size of brackets.
74
Lemma 6.14. For nonnegative functions p, q, pn , and p0 (assumed to be a density) such
that p0 /pn ≤ M and p ≤ q, we have
√ √
kmn,p − mn,pn kP0 ,B . M h(p, pn ), kmn,p − mn,q kP0 ,B . M h(p, q),
Proof. Note that e|x| − 1 − |x| ≤ 4(ex/2 − 1)2 , for every x ≥ − log 2. As mn,p ≥ − log 2 and
mn,pn = 0,
√ 2
2 p + pn
kmn,p − mn,pn k2P0 ,B . P0 emn,p /2 − 1 = P0 √ −1 .
2pn
Since p0 /pn ≤ M , the right side is bounded by 2M h2 (p + pn , 2pn ). Combination with the
preceding display gives the first inequality. If p ≤ q, then mn,q − mn,p is nonnegative. By
the same inequality for ex − 1 − x as before,
√ 2
(mn,q −mn,p )/2
2 q + pn
kmn,p − mn,q k2P0 ,B . P0 e − 1 = P0 √ −1 .
p + pn
Since the map p 7→ mn,p is monotone, the second inequality shows that a bracketing
partition of a class of densities p for the Hellinger distance induces a bracketing partition of
the class of criterion functions mn,p for the “Bernstein norm” of essentially the same size.
Thus, we can use a maximal inequality available for the classes of functions Mn,δ with the
entropy bounded by the Hellinger entropy of the class of densities.
Lemma 6.15. Let h denote the Hellinger distance on a class of densities P and set mn,p :=
log[(p + pn )/(2pn )]. If pn and p0 are probability densities with p0 /pn ≤ M pointwise, then
for every probability density p such that h(p, pn ) ≥ ch(pn , p0 ), for some constant c > 0.
Furthermore, for the class of functions Mn,δ := {mn,p − mn,pn : p ∈ Pn , h(p, pn ) ≤ δ},
√ !
√ M J[ ] (δ, Pn , h)
EkGn kMn,δ . M J[ ] (δ, Pn , h) 1 + √ .
δ2 n
√
Proof. Since log x ≤ 2( x − 1) for every x > 0,
!
q q 1/2
P0 log ≤ 2P0 1/2
−1
pn pn
!
1/2 1/2
q 1/2 1/2 p0 + pn
Z
1/2 1/2 1/2
= 2Pn 1/2
− 1 + 2 (q − pn )(p0 − pn ) 1/2
dµ.
pn pn
75
The first term in last display equals −h2 (q, pn ). The second term can be bounded by the
√
expression 2h(q, pn )h(p0 , pn )( M + 1) in view of the assumption on the quotient p0 /pn and
√
the Cauchy-Schwarz inequality. The sum is bounded by −h2 (q, pn )/2 if 2h(p0 , pn )( M +
1) ≤ h(q, pn )/2. The first statement of the theorem follows upon combining this with the
inequalities55 [Exercise (HW2)]
√
h(2p, p + q) ≤ h(p, q) ≤ (1 + 2) h(2p, p + q).
These inequalities are valid for every pair of densities p and q and show that the Hellinger
distance between p and q is equivalent to the Hellinger distance between p and (p + q)/2.
The maximal inequality is now a consequence of Theorem 6.13. Each of the functions
√
in Mn,δ has “Bernstein norm” bounded by a multiple of M δ, while a bracket [p1/2 , q 1/2 ]
of densities of size δ leads to a bracket [mn,p , mn,q ] of “Bernstein norm” of size a multiple
√
of M δ.
It follows that the conditions of Theorem 6.1 are satisfied with the Hellinger distance,
δ̃n = h(pn , p0 ), and
J[ ] (δ, Pn , h)
φn (δ) := J[ ] (δ, Pn , h) 1 + √ ,
δ2 n
where J[ ] (δ, Pn , h) is the Hellinger bracketing integral of the sieve Pn . (Usually this function
φn (·) has the property that φn (δ)/δ α is decreasing for some 0 < α < 2 as required by
√
Theorem 6.1.) The condition φn (δn ) . nδn is equivalent to
√
J[ ] (δn , Pn , h) ≤ nδn2 .
For the unsieved MLE the Hellinger integral is independent of n and any δn solving
the preceding display gives an upper bound on the rate. Under the condition that the
true density p0 can be approximated by a sequence pn ∈ Pn such that p0 /pn is uniformly
bounded, the sieved MLE that maximizes the likelihood over Pn has at least the rate δn
satisfying both
√
J[ ] (δn , Pn , h) ≤ nδn2 and δn & h(pn , p0 ).
Theorem 6.16. Given a random sample X1 , . . . , Xn from a density p0 let p̂n maximize the
likelihood p 7→ ni=1 p(Xi ) over an arbitrary set of densities Pn . Then
Q
h(p̂n , p0 ) = OP (δn )
The lower inequality follows from the concavity of the root function. The upper inequality is valid with
√ √
constant 2 if t ≥ s and with constant (1 + 2) as stated if t ≤ s.
76
where pn can be any sequence with pn ∈ Pn for every n and such that the functions x 7→
p0 (x)/pn (x) are uniformly bounded in x and n.
Example 6.17. Suppose the observations take their values in a compact interval [0, T ] in
the real line and are sampled from a density that is known to be nonincreasing. Conclude
that if P is the set of all nonincreasing probability densities bounded by a constant C, then
1
log N[ ] (, P, h) ≤ log N[ ] (, F, L2 (λ)) . .
√
where F of all non-increasing functions f : [0, T ] → [0, C]. The result follows from the
observations: (i) F has bracketing entropy for the L2 (λ)-norm of the order 1/ for any finite
measure λ on [0, T ], in particular the Lebesgue measure; (ii) if a density p is non-increasing,
√
then so is its root p; (iii) the Hellinger distance on the densities is the L2 (λ)-distance on
the root densities.
√
Thus J[ ] (δ, P, h) . δ, which yields a rate of convergence of at least δn = n−1/3 for
the MLE. The MLE is called the Grenander estimator.
77
7 Vapnik-C̆ervonenkis (VC) classes of sets/functions
Consider our canonical setting: X1 , . . . , Xn are i.i.d. P on some space X . In this section we
study classes of functions F (on X ) that satisfy certain combinatorial restrictions. These
classes at first sight may seem have nothing to do with entropy numbers, but indeed will
be shown to imply bounds on the covering numbers of the type
V
1
sup N (kF kQ,2 , F, L2 (Q)) ≤ K , 0 < < 1, some number V > 0,
Q
where F is the underlying function class with envelope F , and K is a universal constant.
Note that this has direct implications on the uniform entropy of such a class (see Defini-
tion 4.6) is of the order log(1/) and hence the uniform entropy integral converges, and is
of the order δ log(1/δ), as δ ↓ 0.
Classes of (indicator functions of) this type were first studied by Vapnik and C̆ervonenkis
in the 1970s, whence the name VC classes. There are many examples of VC classes, and
more examples can be constructed by operations as unions and sums. Furthermore, one can
combine VC classes in different sorts of ways (thereby, building larger classes of functions)
to ensure that the resulting larger classes also satisfy the uniform entropy condition (though
these larger classes may not necessarily be VC).
We first consider VC classes to sets. To motivate this study let us consider a boolean
class of functions F 56 , i.e., every f ∈ F takes values in {0, 1}. Thus,
F = {1C : C ∈ C},
Definition 7.2. The VC dimension V (C) of the class C is the largest n such that some set
of size n is shattered by C.
78
Let’s try rectangles with horizontal and vertical edges. In order to show that the VC dimension is 4 (in this
case), we need to show two things:
So, yes, there exists an arrangement of 4 points that can be shattered.
1. There exist 4 points that can be shattered.
2. No set of 5 points can be shattered.
It’s clear that capturing just 1 point and all 4 points are both trivial. The figure below shows how we
can capture 2 points and 3 points.
Suppose we have 5 points. A shattering must allow us to select all 5 points and allow us to select 4
points without the 5th.
Clearly, the more refined C is, higher the VC index. The VC dimension is infinite if C
Our minimum enclosing rectangle that allows us to select all five points is defined by only four points
shatters sets of arbitrarily large size. It is immediate from the definition that V (C) ≤ V if
– one for each edge. So, it is clear that the fifth point must lie either on an edge or on the inside of
the rectangle. This prevents us from selecting four points without the 57
and only if no set of size V + 1 fifth.is shattered.
Example 7.4. Let X = R and define the collection of sets C := {(−∞, c] : c ∈ R}. Consider
any two point set1 {x1 , x2 } ⊂ R, and assume without loss of generality, that x1 < x2 . It is
easy to verify that C can pick out the null set {} and the sets {x1 } and {x1 , x2 } but cannot
pick out {x2 }. Hence its VC dimension equals 1.
The collection of all cells (a, b] ∈ R shatters every two-point set but cannot pick out the
subset consisting of the smallest and largest points of any set of three points. Thus its VC
dimension equals 2.
Remark 7.1. With more effort, it can be seen that VC dimensions of the same type of sets
in Rd are d and 2d, respectively. For example, let X = R2 and define
Let us see what happens when n = 4. Draw a figure to see this when the points are not
co-linear. We can show that there exists 4 points such that all the possible subsets of these
four points are picked out by C; see the left panel of Figure 7.1.
Now if we have n = 5 points things change a bit; see the right panel of Figure 7.1.
If we have five points there is always one that stays “in the middle” of all the others, and
thus the complement set cannot be picked out by C. We immediately conclude that the VC
dimension of C is 4.
79
Example 7.5. Suppose that X = [0, 1], and let C be the class of all finite subsets of X .
Let P be the uniform (Lebesgue) distribution on [0, 1]. Clearly V (C) = ∞ and C is not a
VC class. Note that for any possible value of Pn we have Pn (A) = 1 for A = {X1 , . . . , Xn }
while P (A) = 0. Therefore kPn − P kC = 1 for all n, so C is not a Glivenko-Cantelli class
for P .
Exercise (HW3): Show that the class of all closed and convex sets in Rd does not have finite
VC dimension (Hint: Consider a set of n points on the boundary of the unit ball).
Sauer’s lemma58 (also known as Sauer-Shelah-Vapnik-C̆ervonenkis lemma), one of the
fundamental results on VC dimension, states that the number ∆n (C; x1 , . . . , xn ) of subsets
picked out by a VC class C, for n ≥ 1, satisfies:
V (C)
X n
max ∆n (C; x1 , . . . , xn ) ≤ , (73)
x1 ,...,xn j
j=0
where we use the notation nj = 0 if j > n. Observe that for n ≤ V (C), the right-hand side
of the above display equals 2n , i.e., the growth is exponential. However, it is easy to show59
that for n ≥ V (C),
V (C)
ne V (C)
X n
≤ . (74)
j V (C)
j=0
Consequently, the numbers on the left side grow polynomially (of order at most O(nV (C) ))
rather than an exponential number. Intuitively this means that a finite VC index implies
that C has an apparent simplistic structure.
The definition of VC dimension can be easily extended to a function class F in which every
function f is binary-valued, taking the values {0, 1} (say). In this case, we define, for every
58
See [van der Vaart and Wellner, 1996, pages 135–136] for a complete proof of the result.
59
In the following we just give a proof of the right-hand inequality of (74). Note that with Y ∼
Binomial(n, 1/2),
V (C) V (C)
! !
X n n
X n 1 n
= 2 = 2n P(Y ≤ V (C))
j=0
j j=0
j 2
h i
≤ 2 E[rY −V (C) ]
n
for r ≤ 1 as 1{Y − V (C) ≤ 0} ≤ rY −V (C) for r ≤ 1
n " n
! n #
n
n −V (C) 1 r −V (C) n Y
X j n 1 1 r
= 2 r + = r (1 + r) as E[r ] = r = +
2 2 j=0
j 2 2 2
V (C) n
n V (C)
= 1+ by choosing r = V (C)/n
V (C) n
V (C)
n
≤ eV (C) .
V (C)
80
x1 , . . . , x n ∈ X ,
F(x1 , . . . , xn ) := {(f (x1 ), . . . , f (xn )) : f ∈ F}. (75)
As functions in F are Boolean, F(x1 , . . . , xn ) is a subset of {0, 1}n .
Definition 7.6. Given such a function class F we say that the set {x1 , . . . , xn } is shattered
by F if
∆n (F; x1 , . . . , xn ) := |F(x1 , . . . , xn )| = 2n .
The VC dimension V (F) of F is defined as the largest integer n for which there is some
collection x1 , . . . , xn of n points that can be shattered by F.
Exercise (HW3): Consider the class of all two-sided intervals over the real line, i.e., F :=
{1(a,b] (·) : a < b ∈ R}. Show that ∆n (F; X1 , . . . , Xn ) ≤ (n + 1)2 a.s.
Exercise (HW3): For a scalar t ∈ R, consider the function ft (x) := 1{sin(tx) ≥ 0}, x ∈
[−1, 1]. Prove that the function class {ft : [−1, 1] → R : t ∈ R} has infinite VC dimension
(Note that this shows that VC dimension is not equivalent to the number of parameters in
a function class).
81
7.2 Covering number bound for VC classes of sets
Theorem 7.8. There exists a universal constant K such that for any VC class C of sets,
any probability measure Q, any r ≥ 1, and 0 < < 1,
rV (C)
V (C) 1
N (, C, Lr (Q)) ≤ K V (C)(4e) . (77)
Proof. See [van der Vaart and Wellner, 1996, Theorem 2.6.4].
In the following we will prove a slightly weaker version of the above result.
Theorem 7.9. For any VC class C of sets, any r ≥ 1, and 0 < < 1,60
c rc2 V (C)
1
sup N (, C, Lr (Q)) ≤ (78)
Q
Here c1 and c2 are universal positive constants and the supremum is over all probability
measures Q on X .
Proof. Fix 0 < < 1. Let X1 , . . . , Xn be i.i.d. Q. Let m := D(, C, L1 (Q)) be the -packing
number for the collection C in the norm L1 (Q). Thus, there exists C1 , . . . , Cm ∈ C which
satisfy
Q|1Ci − 1Cj | = Q(Ci 4Cj ) > , i 6= j.
Let F := {1C : C ∈ C}. We consider this function class view point as it is sometimes more
natural than working with the collection of sets C. Note that, {fi ≡ 1Ci }m
i=1 is a set of m
-separated functions in F in the L1 (Q)-metric, as, for i 6= j,
Z
< |fi − fj |dQ = Q{fi 6= fj } = Q(Ci 4Cj ) = P[X1 ∈ Ci 4Cj ].
P[fi (X1 ) = fj (X1 )] = 1 − P[fi (X1 ) 6= fj (X1 )] = 1 − P[X1 ∈ Ci 4Cj ] < 1 − ≤ e− .
In words, this means that the probability that fi and fj agree on every X1 , . . . , Xk is at
most e−k . By the union bound, we have
m −k m2 −k
P [(fi (X1 ), . . . , fi (Xk )) = (fj (X1 ), . . . , fj (Xk )) for some 1 ≤ i < j ≤ m] ≤ e ≤ e .
2 2
Recalling that F(x1 , . . . , xk ) = {(f (x1 ), . . . , f (xk )) : f ∈ F}, this immediately gives
m2 −k
P[|F(X1 , . . . , Xk )| ≥ m] ≥ 1 − e .
2
60
Note that N (, C, Lr (Q)) = 1 for all ≥ 1.
82
l m
Thus if we take k := 2 log m ≥ 2 log m , then, P[|F(X1 , . . . , Xk )| ≥ m] ≥ 1/2. Thus
for the choice of k above, there exists a subset {z1 , . . . , zk } of cardinality k such that
|F(z1 , . . . , zk )| ≥ m. We now apply the Sauer-Shelah-VC lemma and deduce that
V (C)
X k
m ≤ |F(z1 , . . . , zk )| ≤ max ∆k (C; x1 , . . . , xk ) ≤ . (79)
x1 ,...,xk j
j=1
so that
c2 V (C)
N (, C, Lr (Q)) = N (r , C, L1 (Q)) ≤ c1 −r .
This completes the proof.
Exercise (HW3): Suppose F is a Boolean class of functions with VC dimension V (F). Then,
for some constant C > 0,
" # r
V (F)
E sup |(Pn − P )f | ≤ C .
f ∈F n
Suppose X1 , . . . , Xn are i.i.d. real-valued observations having a common cdf F . Apply this
result to obtain a high probability upper bound on supx∈R |Fn (x) − F (x)|, i.e., show that
r
C 2 1
sup |Fn (x) − F (x)| ≤ √ + log
x∈R n n α
with probability at least 1 − α (for α ∈ (0, 1)).
83
Example 7.10 (Classification). Recall the problem of classification from Section 1.4 where
we observe i.i.d. data (Z1 , Y1 ), . . . , (Zn , Yn ) ∼ P with Zi ∈ Z and Yi ∈ {0, 1}. Let C be
a class of functions from Z to {0, 1} — the class of classifiers under consideration. The
empirical risk minimizer classifier is ĝn := argming∈C n1 ni=1 I{g(Zi ) 6= Yi }. It is usually
P
of interest to understand the test error of ĝn relative to the best test error in the class C,
i.e., L(ĝn ) − inf g∈C L(g) (here L(g) := P(g(Z) 6= Y ) is the misclassification error of g). If
g ∗ minimizes L(g) over g ∈ C, then we have seen in (5) that
where
F := {(z, y) 7→ I{g(z) 6= y} : g ∈ C}.
Using the bounded differences concentration inequality and the bound given by Example 4.8,
we obtain r r
∗ V (F) 8 1
L(ĝn ) − L(g ) ≤ C + log
n n α
with probability at least 1 − α (for α ∈ (0, 1)). The above display would be useful if we
could upper bound V (F) effectively. We will now show that V (F) ≤ V (C). To see this, it
is enough to argue that if F can shatter (z1 , y1 ), . . . , (zn , yn ), then C can shatter z1 , . . . , zn .
For this, let η1 , . . . , ηn be arbitrary points in {0, 1}. We need to obtain a function g ∈ C
such that g(zi ) = ηi , for i = 1, . . . , n. Define δ1 , . . . , δn by
As F can shatter (z1 , y1 ), . . . , (zn , yn ), there exists f ∈ F, say f (z, y) = I{g(z) 6= y} for
some g ∈ C, with f (zi , yi ) = δi , for i = 1, . . . , n. Then, g(zi ) = ηi 61 , for i = 1, . . . , n. This
proves that C shatters z1 , . . . , zn and completes the proof of the fact that V (F) ≤ V (C).
Thus, we obtain
r r
∗ V (C) 8 1
L(ĝn ) − L(g ) ≤ 2 sup |Ln (g) − L(g)| ≤ C + log
g∈C n n α
with probability at least 1 − α. In fact, this is one of the important results in the VC theory.
Let us start with a motivating application. Recall from Example 1.5 the class of functions
F = {ft : t ∈ R} where ft (x) = |x − t|. In Example 1.5 we needed to show asymptotic
61
First observe that f (zi , yi ) = 0 ⇔ g(zi ) = yi and f (zi , yi ) = 1 ⇔ g(zi ) 6= yi . Suppose that δi = 0, i.e.,
f (zi , yi ) = 0. Then, we must have, from the definition of δi , 0 = ηi I{yi = 0} = (1 − ηi )I{yi = 1}, which
implies that yi = ηi and thus, g(zi ) = yi = ηi . Similarly, suppose that δi = 1, i.e., f (zi , yi ) = 1. Then, we
must have, from the definition of δi , yi 6= ηi , and thus as g(zi ) 6= yi , we have g(zi ) = ηi . Thus, in both cases
we see that g(zi ) = ηi .
84
equicontinuity of a certain process which boiled down to controlling the modulus of conti-
nuity of the empirical process indexed by F as in (48). In particular, we may ask: “Is this
function class ‘nice’ is some sense so that results analogous to (80), and thus (48), hold?”.
The VC subgraph dimension of F is simply the VC dimension of the Boolean class
obtained by taking the indicators of the subgraphs of functions in F. To formally define
this, let us first define the notion of subgraph of a function.
Let V (F) be the VC dimension of the set of subgraphs of functions in F. Just as for
sets, the covering numbers of VC classes of functions grow at a polynomial rate.
Theorem 7.12. For a VC class of functions F with measurable envelope function F and
r ≥ 1, one has for any probability measure Q with kF kQ,r > 0,
rV (F )
2
N (kF kQ,r , F, Lr (Q)) ≤ K V (F) (4e)V (F ) , (80)
Proof. Let C be the set of all subgraphs Cf of functions f in F. Note that Q|f − g| = (Q ×
λ)(1Cf ∆Cg ) = (Q×λ)|1Cf −1Cg | where λ is the Lebesgue measure on R62 . Renormalize Q×λ
to a probability measure on the set {(x, t) : |t| ≤ F (x)} by defining P = (Q × λ)/(2kF kQ,1 ).
Thus, as P (Cf ∆Cg ) = P |1Cf − 1Cg | = 2kF1kQ,1 Q|f − g|,
for a universal constant K, for any probability measure Q with kF kQ,1 > 0. This completes
the proof for r = 1.
For r > 1 note that
Q|f − g|r ≤ Q[|f − g|(2F )r−1 ] = 2r−1 R[|f − g|]Q[F r−1 ], (82)
62
R
Fact: For any two real numbers a and b, we have the identity |a − b| = |I{t < a} − I{t < b}|dt.
85
for the probability measure R with density F r−1 /Q[F r−1 ] with respect to Q. We claim that
To prove this claim let N = N (2(/2)r R[F ], F, L1 (R)) and let f1 , . . . fN an 2(/2)r R[F ]-net
for the class F (under L1 (R)-norm). Therefore, given f ∈ F, there exists k ∈ {1, . . . , N }
such that kf − fk kL1 (R) ≤ 2(/2)r R[F ]. Hence, by (82),
Q|f − fk |r ≤ 2r−1 R[|f − fk |] Q[F r−1 ] ≤ 2r−1 2(/2)r R[F ] Q[F r−1 ] = r Q[F r ],
which implies that kf − fk kLr (Q) ≤ kF kQ,r . Thus, we have obtained a kF kQ,r -cover of F
in the Lr (Q)-norm, which proves the claim. Now, combining (83) with (81) yields
rV (F )
V (F ) 2
N (kF kQ,r , F, Lr (Q)) ≤ KV (F)(4e) ,
The preceding theorem shows that a VC class has a finite uniform entropy integral,
with much to spare. In fact we can show that if F is a class of measurable functions with
envelope F and VC subgraph dimension V (F) then the expected supremum of the empirical
process can be easily controlled63 .
63
Here in a maximal inequality for an important (VC) class of functions we will encounter soon.
Theorem 7.13. Let F be a measurable class of functions with a constant envelope U such that for A > e2
and V ≥ 2 and for every finitely supported probability measure Q
V
A
N (U, F, L2 (Q)) ≤ , 0 ≤ < 1.
Then, for all n, !
n r
X √ AU AU
E (f (Xi ) − P f ) ≤L nσ V log ∨ V U log (84)
i=1
F σ σ
Remark 7.2. If nσ 2 & V log(AU/σ) then the above result shows that
n
X p
E (f (Xi ) − P f ) . nσ 2 V log(AU/σ),
F
i=1
which means that if nσ 2 is not too small, then the ‘price’ one pays for considering the expectation of the
p
supremum of infinitely many sums instead of just one is the factor V log(AU/σ).
Proof. We assume without loss of generality that the class F contains the function 0 and that the functions
f are P -centered. It suffices to prove the inequality in the theorem for U = 1. By our symmetrization result
Pn Pn
(see Theorem 3.17), we have E i=1 f (Xi ) F ≤ 2E i=1 εi f (Xi ) F . By Dudley’s entropy bound (see
Theorem 4.1 and (41)), we have
√
n
√ Z kPn f 2 kF
r
1 X A
√ Eε εi f (Xi ) ≤K V log d,
n i=1
F 0
86
Exercise (HW3): Suppose F is a class of measurable functions with envelope F and VC
subgraph dimension V (F). Then, for some constant C > 0,
" # r
V (F)
E sup |(Pn − P )f | ≤ CkF kP,2 . (85)
f ∈F n
Exercise (HW3): Recall the setting of Example 5.5 which considered a change point problem.
Show that condition (45) (bound on the modulus of continuity of the empirical process)
needed to apply Theorem 5.2, to obtain the rate of convergence of the estimator, holds with
an appropriate function φn (·) (also see Remark 5.2).
where εi ’s are i.i.d. Rademacher variables independent of the variables Xj ’s and Eε indicates expectation
with respect to the εi ’s. It is easy to see that if log(C/c) ≥ 2 then
Z c
C 1/2 C 1/2
log dx ≤ 2c log .
0 x c
Now, since Pn [(f 2 − g 2 )2 ] ≤ 4Pn [(f − g)2 ] (as |f | ≤ 1 by assumption), which implies N (, F 2 , L2 (Pn )) ≤
N (/2, F, L2 (Pn )), we can estimate the last expectation again by the entropy bound (2.14), and get, with
the change of variables /2 = u,
√ "Z √ #
kPn f 2 kF
r
2 2 4K V A
EkPn f kF ≤ σ + √ E log du
n 0 u
Note that when the dominant term is the first, we only pay a logarithmic price for the fact that we are
taking the supremum over a countable or uncountable set. In fact the inequality is sharp in the range of
(σ, n) where the first term dominates.
87
7.4 Examples and Permanence Properties
The results of this subsection give basic methods for generating VC (subgraph) classes.
This is followed by a discussion of methods that allow one to build up new function classes
related to the VC property (from basic classes).
Although it is obvious, it is worth mentioning that a subclass of a VC class is itself
a VC class. The following lemma shows that various operations on VC classes (of sets)
preserve the VC structure.
(i) C c = {C c : C ∈ C} is VC;
(ii) C u D = {C ∩ D : C ∈ C, D ∈ D} is VC;
(iii) C t D = {C ∪ D : C ∈ C, D ∈ D} is VC;
(vi) C × D = {C × D : C ∈ C, D ∈ D} is VC in X × Y.
Proof. (i) The set C c picks out the points of a given set {x1 , . . . , xn } that C does not pick
out. Thus if C shatters a given set of points, so does C c . This proves (i) and shows that the
dimensions of C and C c are equal.
(ii) Fix n ≥ max{V (C), V (D)}. Let x1 , . . . , xn ∈ X be arbitrary. We have to study
the cardinality of the set {C ∩ D ∩ {x1 , . . . , xn } : C ∈ C, D ∈ D}. From the n points
x1 , . . . , xn , C can pick out O(nV (C) ) subsets. From each of these subsets, D can pick out
at most O(nV (D) ) further subsets64 . Thus C u D can pick out O(nV (C)+V (D) ) subsets. For
large n, this is certainly smaller than 2n . This proves (ii).
Next, (iii) follows from a combination of (i) and (ii), since C ∪ D = (C c ∩ Dc )c .
(iv) We will show that if φ(C) shatters a set of points in Y then C should also shatter
a set of points of the same cardinality in X , which will yield the desired result. Suppose
that φ(C) shatters {y1 , . . . , yn } ⊂ Y. Then each yi must be in the range of φ and there exist
x1 , . . . , xn such that φ is a bijection between x1 , . . . , xn and y1 , . . . , yn . We now claim that C
must shatter {x1 , . . . , xn }. To see this, let A := {xi1 , . . . , xik } for 1 ≤ i1 , . . . , ik ≤ n distinct,
with 0 ≤ k ≤ n. As φ(C) shatters {y1 , . . . , yn } ⊂ Y, φ(C) picks out there {yi1 , . . . , yik } and
thus there exists B ≡ φ(C) ∈ φ(C), where C ∈ C, such that B ∩{y1 , . . . , yn } = {yi1 , . . . , yik }.
64 PV (D) k
≤ O(nV (D) ), for any
Note that by Sauer’s lemma, for any 1 ≤ k ≤ n, ∆n (D; xi1 , . . . , xik ) ≤ j=0 j
{xi1 , . . . , xik } ⊂ {x1 , . . . , xn }.
88
As φ is a bijection, this means that C ∩ {x1 , . . . , xn } = {xi1 , . . . , xik }, and thus C picks out
there {xi1 , . . . , xik }.
For (v) the argument is analogous: if ψ −1 (C) shatters z1 , . . . , zn , then all xi := ψ(zi )
must be different and the restriction of ψ to z1 , . . . , zn is a bijection on its range. We now
claim that then C shatters x1 , . . . , xn 65 . Thus, as C has finite VC dimension, so has ψ −1 (C).
For (vi) note first that C × Y and X × D are VC classes66 . Then by (ii) so is their
intersection C × D.
Exercise (HW3) (Open and closed subgraphs): For a set F of measurable functions, define
“closed” and “open” subgraphs by {(x, t) : t ≤ f (x)} and {(x, t) : t < f (x)}, respectively.
Then the collection of “closed” subgraphs has the same VC-dimension as the collection of
“open” subgraphs. Consequently, “closed” and “open” are equivalent in the definition of a
VC-subgraph class.
(Define the sum over an empty set as zero.) There exists such a vector a with at least one
strictly positive coordinate. We will show that the subgraphs of F do not pick out the set
{(xi , ti ) : ai > 0}.
Suppose there exists f ∈ F such that Cf ∩ {(xi , ti )}ni=1 = {(xi , ti ) : ai > 0}. Then,
for i such that ai ≤ 0 we must have (xi , ti ) ∈
/ Cf , i.e., ti ≥ f (xi ). However, then the left
side of the above display would be strictly positive and the right side non-positive for this
65
Exercise (HW3): Show this.
66
Exercise (HW3): Show this.
67
We can find α ∈ Rn such that α> H = 0 and α> (t1 , . . . , tn ) = 0. Such an α exists as the columns of H
and (t1 , . . . , tn ) span at most an m + 1 = n − 1 dimensional subspace of Rn .
89
f . Conclude that the subgraphs of F do not pick out the set {(xi , ti ) : ai > 0}. Hence the
subgraphs of F shatter no set of n points.
P
Example 7.16. Let F be the set of all linear combinations λi fi of a given, finite set of
functions f1 , . . . , fk on X . Then F is a VC class and hence has a finite uniform entropy
integral. Furthermore, the same is true for the class of all sets {f > c} if f ranges over F
and c over R.
Lemma 7.17. The set of all translates {ψ(x − h) : h ∈ R} of a fixed monotone function
ψ : R → R is VC of dimension 1.
Proof. By the monotonicity, the subgraphs are linearly ordered by inclusion: if ψ is nonde-
creasing, then the subgraph of x 7→ ψ(x − h1 ) is contained in the subgraph of x 7→ ψ(x − h2 )
if h1 ≥ h2 . Any collection of sets with this property has VC dimension 1 by Proposi-
tion 7.1868 .
(i) F ∧ G = {f ∧ g : f ∈ F; g ∈ G} is VC subgraph;
(ii) F ∨ G is VC subgraph;
(iv) −F is VC;
(v) F + g := {f + g : f ∈ F} is VC subgraph;
(vi) F · g = {f g : f ∈ F} is VC subgraph;
Proposition 7.18. Suppose that C is a collection of at least two subsets of a set X . Show that V (C) = 1 if
either (a) C is linearly ordered by inclusion, or, (b) any two sets in C are disjoint.
Proof. Consider (a) first. Take points x1 , x2 ∈ X . We need to show that this set of 2 points cannot be
shattered. Suppose that {x1 , x2 } can be shattered. Let C1 pick out {x1 } and C2 pick out {x2 }. By (a), one
of these sets is contained in the other. Suppose C1 ⊂ C2 . But then {x1 } ⊂ C2 and this contradicts the fact
that C2 picks out {x2 }. On the other hand, at least one set of size 1 is shattered.
Next consider (b). As before, suppose that {x1 , x2 } can be shattered. Suppose C picks out {x1 } and D
picks out {x1 , x2 }. But then C and D are no longer disjoint.
90
Proof. The subgraphs of f ∧ g and f ∨ g are the intersection and union of the subgraphs of
f and g, respectively. Hence (i) and (ii) are consequences of Lemma 7.14.
For (iii) note that the sets {f > 0} are one-to-one images of the intersections of the
(open) subgraphs with the set X × {0}, i.e.,
{f > 0} = {x ∈ X : f (x) > 0} = φ {(x, t) ∈ X × R : f (x) > t} ∩ (X × {0}) .
It suffices to show that these sets are VC in (X ∩ {g > 0}) × R, (X ∩ {g < 0}) × R, and
(X ∩ {g = 0}) × R, respectively69 . Now, for instance, {i : (xi , ti ) ∈ C − } is the set of
indices of the points (xi , ti /g(xi )) picked out by the open supergraphs of F. These are the
complements of the closed subgraphs and hence form a VC class.
The subgraphs of the class F ◦ ψ are the inverse images of the subgraphs of functions
in F under the map (z, t) 7→ (ψ(z), t). Thus (v) of Lemma (7.14) implies (vii).
For (viii) suppose that the subgraphs of φ◦F shatter the set of points (x1 , t1 ), . . . , (xn , tn ).
Choose f1 , . . . , fm from F such that the subgraphs of the functions φ◦fj pick out all m = 2n
subsets. For each fixed i, define si = max{fj (xi ) : φ(fj (xi )) ≤ ti }. Then si < fj (xi ) if and
only if ti < φ(fj (xi )), for every i and j, and the subgraphs of f1 , . . . , fm shatter the points
(xi , si ).
Exercise (HW3): The class of all ellipsoids {x ∈ Rd : (x − µ)> A(x − µ) ≤ c}, for µ ranging
over Rd and A ranging over the nonnegative d × d matrices, is VC. [Hint: This follows by a
combination of Lemma 7.19(iii), Lemma 7.14(i) and Lemma 7.15. The third shows that the
set of functions x 7→ (x − µ)> A(x − µ) − c (a vector space with basis functions x 7→ c, x 7→ xi
and x 7→ xi xj ) is VC, and the first and second show that their positive (or negative) sets
are also VC.
69
Exercise (HW3): If X is the union of finitely many disjoint sets Xi , and Ci is a VC class of subsets of
Pm
Xi for each i, i = 1, . . . , m, then tm m
i=1 Ci is a VC class in ∪i=1 Xi of dimension i=1 V (Ci ).
91
Example 7.20. Let ft (x) = |x − t| where t ∈ R. Let F = {ft : t ∈ R}. Then this is
a VC class of functions as ft (x) = (x − t) ∨ [−(x − t)] and we can use Lemma 7.15 with
Lemma 7.19(i) to prove the result.
Lemma 7.21. Fix r ≥ 1. Suppose that F and G are classes of measurable functions with
envelopes F and G respectively. Then, for every 0 < < 1,
(ii) supQ N (2kF ·GkQ,2 , F ·G, L2 (Q)) ≤ supQ N (kF kQ,2 , F, L2 (Q)) supQ N (kGkQ,2 , G, L2 (Q)),
where F · G := {f g : f ∈ F, g ∈ G}, and the supremums are all taken over the
appropriate subsets of all finitely discrete probability measures.
Proof. Let us first prove (i). Find functions f1 , . . . , fn and g1 , . . . , gm such that
N (kF GkQ̃,2 , F · G, L2 (Q̃)) ≤ N (kF kQ∗ ,2 , F, L2 (Q∗ )) ≤ sup N (kF kQ,2 , F, L2 (Q)),
Q
where the supremum is taken over all finitely discrete probability measures Q for which
kF kQ,2 > 0. Since the right hand-side of the above display does not depend on Q̃, and since
Q̃ satisfies kF GkQ̃,2 > 0 but is otherwise arbitrary, we have that
92
where the supremums are taken over all finitely discrete probability measures Q but with
the left side taken over the subset for which kF GkQ,2 > 0 while the right side is taken over
the subset for which kF kQ,2 > 0.
We can similarly show that the uniform entropy numbers for the class G · F with
envelope F G is bounded by the uniform entropy numbers for G with envelope G. Since
|f1 g1 − f2 g2 | ≤ |f1 − f2 |G + |g1 − g2 |F for all f1 , f2 ∈ F and g1 , g2 ∈ G, part (i) in
conjunction with (86) imply that
sup N (2kF GkQ,2 , F · G, L2 (Q)) ≤ sup N (kF kQ,2 , F, L2 (Q)) × sup N (kGkQ,2 , G, L2 (Q)),
Q Q Q
where the supremums are all taken over the appropriate subsets of all finitely discrete
probability measures.
Lemma 7.22. (i) Let ρ(·) be a real-valued right continuous function of bounded variation
on R+ . The covering number of the class F of all functions on Rd of the form
x 7→ ρ(kAx + bk), with A ranging over all m × d matrices and b ∈ Rm satisfies the
bound
N (, F, Lr (Q)) ≤ K1 −V1 , (87)
for some K1 and V1 and for a constant envelope.
(ii) (Exercise (HW3)) Let λ(·) be a real-valued function of bounded variation on R. The
class of all functions on Rd of the form x 7→ λ(α> x + β), with α ranging over Rd and
β ranging over R, satisfies (87) for a constant envelope.
Proof. Let us prove (i). By Lemma 7.21 it is enough to treat the two monotone components
of ρ(·) separately. Assume, without loss of generality, that ρ(·) is bounded and nondecreas-
ing, with ρ(0) = 0. Define ρ−1 (·) as the usual left continuous inverse of ρ on the range
T = (0, sup ρ), i.e.,
ρ−1 (t) := inf{y : ρ(y) ≥ t}.
This definition ensures that
93
(A) Empirical d.f., X = R: Suppose that we consider the classical empirical d.f. of
real-valued random variables. Thus, F = {1(−∞,t] (·) : t ∈ R}. Then, letting Fn and F
denote the empirical and true distribution functions, [Dvoretzky et al., 1956] showed
that
√
P(k n(Fn − F )k∞ ≥ x) ≤ C exp(−2x2 )
(B) Empirical d.f., X = Rd : Now consider the classical empirical d.f. of i.i.d. random
vectors: Thus F = {1(−∞,t] (·) : t ∈ Rd }. Then [Kiefer, 1961] showed that for every
> 0 there exists a C such that
√
P(k n(Fn − F )k∞ ≥ x) ≤ C exp(−(2 − )x2 )
Example 7.23 (Projection pursuit). Projection pursuit (PP) is a type of statistical tech-
nique which involves finding the most “interesting” possible projections in multidimensional
94
data. Often, projections which deviate more from a normal distribution are considered to
be more interesting. The idea of projection pursuit is to locate the projection or projections
from high-dimensional space to low-dimensional space that reveal the most details about the
structure of the data set.
Suppose that X1 , X2 , . . . , Xn are i.i.d. P on Rd . The first step in PP is to estimate
the distribution of the low-dimensional projections. We address this question here. In
particular, we ask: “How large can d be (with n) so that we can still uniformly approximate
all the one-dimensional projections of P ?”
We answer the above questions below. For t ∈ R and γ ∈ S d−1 (S d−1 is the unit sphere
in Rd ), let
Fn (t; γ) = Pn [1(−∞,t] (γ · X)] = Pn (γ · X ≤ t),
First note that the sets in question in this example are half-spaces
Ht,γ := {x ∈ Rd : γ · x ≤ t}.
Note that
Dn = sup sup |Pn (Ht,γ ) − P (Ht,γ )| = kPn − P kH ,
t∈R γ∈S d−1
95
By Talagrand’s exponential bound (88),
√ D Dx2 d+1
P k n(Pn − P )kH ≥ x ≤ exp(−2x2 )
x d+1
√
for all n ≥ 1 and x > 0. Taking x = n yields
2 d+1
D D n
P kPn − P kH ≥ ) ≤ √ exp(−22 n)
n d+1
D
D2 n
= √ exp (d + 1) log − 22 n
n d+1
→ 0 as n → ∞,
if d/n → 071 .
Lemma 7.24 (Radon’s Lemma). Let X0 ⊂ Rd be a set of size d + 2. Then there exist two disjoint subsets
X1 , X2 of X0 such that conv(X1 ) ∩ conv(X2 ) 6= ∅ where conv(Xi ) denotes the convex hull of Xi .
Given Radon’s lemma, the proof of V (H) ≤ d + 1 is easy. We have to show that given any set X0 ∈ Rd
of size d + 2, there is a subset of X0 that cannot be picked out by the half-spaces. Using Radon’s lemma
with X0 yields two disjoint subsets X1 , X2 of X0 such that conv(X1 ) ∩ conv(X2 ) 6= ∅. We now claim that X1
cannot be picked out by using any half-space. Suppose that there is such a half-space H, i.e., H ∩ X0 = X1 .
Note that if a half-space picks out a set of points, then every point in its convex hull is also picked out.
Thus, conv(X1 ) ⊂ H. However, as conv(X1 ) ∩ conv(X2 ) 6= ∅, H ∩ conv(X2 ) 6= ∅ which implies that H also
contains at least one point from X2 , leading to a contradiction.
71
Exercise (HW3): Show this.
96
8 Talagrand’s concentration inequality for the suprema of
the empirical process
The main goal of this chapter is to motivate and formally state (without proof) Talagrand’s
inequality for the suprema of the empirical process. We will also see a few applications of
this result. If we have time, towards the end of the course, I will develop the tools necessary
and prove the main result. To fully appreciate the strength of the main result, we start with
a few important tail bounds for the sum of independent random variables. The following
discussion extends and improves Hoeffding’s inequality (Lemma 3.9).
In most of results in this chapter we only assume that the X -valued random variables
X1 , . . . , Xn are independent; they need not be identically distributed.
8.1 Preliminaries
2/
Pn 2 2/
Pn 2
P (Sn ≥ t) ≤ e−2t i=1 (bi −ai ) , and P (Sn ≤ −t) ≤ e−2t i=1 (bi −ai ) . (89)
A crucial ingredient in the proof of the above result was Lemma 3.8 which stated that for
2 2
a centered X ∈ [a, b] w.p.1 we have E[eλX ] ≤ eλ (b−a) /8 , for λ ≥ 0.
Note that if bi −ai is much larger than the standard deviation σi of Xi then, although the
tail probabilities prescribed by Hoeffding’s inequality for Sn are of the normal type72 , they
correspond to normal variables with the ‘wrong’ variance. The following result incorporates
the standard deviation of the random variable and is inspired by the moment generating
function of Poisson random variables73 .
Theorem 8.1. Let X be a centered random variable such that |X| ≤ c a.s, for some c < ∞,
and E[X 2 ] = τ 2 . Then
2
λX τ λc
E[e ] ≤ exp (e − 1 − λc) , for all λ > 0. (90)
c2
97
Pn
and Sn = i=1 Xi , we have
nσ 2 λc
λSn
E[e ] ≤ exp (e − 1 − λc) , for all λ > 0, (92)
c2
and the same inequality holds for −Sn .
This gives inequality (90). Inequality (92) follows from (90) by using the independence of
the Xi ’s. The above also applies to Yi = −Xi which yields the result for −Sn .
It is standard to derive tail probability bounds for a random variable based on a bound
for its moment generating function. We proceed to implement this idea and obtain four such
bounds, three of them giving rise, respectively, to the Bennett, Prokhorov and Bernstein
classical inequalities for sums of independent random variables and one where the bound on
the tail probability function is inverted. It is convenient to introduce the following notation:
Proof. Observe that by Markov’s inequality and the given bound E[eλZ ], we obtain
P(Z ≥ t) = inf P(eλZ ≥ eλt ) ≤ inf e−λt E[eλZ ] ≤ eν inf λ>0 {φ(−λ)−λt/ν} .
λ>0 λ>0
98
This proves the first inequality in (94). We can also show that (by checking the value of
the corresponding functions at t = 0 and then comparing derivatives)
3t 2t t2
h1 (t) ≥ log 1 + ≥ , for t > 0,
4 3 2 + 2t/3
thus completing the proof of the three inequalities in (94).
To prove (95), we begin by observing that (by Taylor’s theorem) (1−λ/3)(eλ −λ−1) ≤
λ2 /2, λ ≥ 0. Thus, if
νλ2
ϕ(λ) := , λ ∈ [0, 3),
2(1 − λ/3)
then inequality (93) yields
" #
P(Z ≥ t) ≤ inf e−λt E[eλZ ] ≤ exp inf (ϕ(λ) − λt) = exp − sup (λt − ϕ(λ)) = e−γ(t) ,
0≤λ<3 0≤λ<3 0≤λ<3
where we have used the fact that ν(eλ − 1 − λ) ≤ ϕ(λ) and γ(s) := supλ∈[0,3) (λs − ϕ(λ)), for
√
s > 0. Then it can be shown74 that γ −1 (x) = 2νx + x/3. Therefore, letting t = γ −1 (x)
(i.e., x = γ(t)) in the above display yields (95).
n n
E[Xi2 ] λ
2
λZ
Y
(λ/c)Xi
Y nσ λ
E[e ] = E[e ]≤ exp (e − 1 − λ) = exp (e − 1 − λ)
c2 c2
i=1 i=1
Pn
where σ 2 = n1 i=1 E[Xi2 ]. Thus, Z satisfies the hypothesis of Proposition 8.2 with ν :=
nσ 2 /c2 . Therefore we have the following exponential inequalities, which go by the names
of Bennet’s, Prokhorov’s and Bernstein’s75 (in that order).
x ≥ 0,
2
t2
− nσ2 h1 tc2 3t 2tc
P(Sn ≥ t) ≤ e c nσ ≤ exp − log 1 + ≤ exp − (96)
4c 3nσ 2 2nσ 2 + 2ct/3
74
Exercise (HW3): Complete this.
75
It is natural to ask whether Theorem 8.4 extends to unbounded random variables. In fact, Bernstein’s
inequality does hold for random variables Xi with finite exponential moments, i.e., such that E[eλ|Xi | ] < ∞,
for some λ > 0, as shown below.
Lemma 8.3 (Bernstein’s inequality). Let Xi , 1 ≤ i ≤ n, be centered independent random variables such
that, for all k ≥ 2 and all 1 ≤ i ≤ n,
k!
E|Xi |k ≤ σi2 ck−2 ,
2
and set σ 2 := n1 n
P 2 Pn
σ
i=1 i , S n := i=1 Xi . Then,
t2
P(Sn ≥ t) ≤ exp − , for t ≥ 0.
2nσ 2 + 2ct
99
and √
P Sn ≥ 2nσ 2 x + cx/3 ≤ e−x , x ≥ 0.
Bennett’s inequality is the sharpest, but Prokhorov’s and Bernstein’s inequalities are
easier to interpret. Prokhorov’s inequality exhibits two regimes for the tail probabilities of
Sn : if tc/(nσ 2 ) is small, then the logarithm is approximately 2tc/(3nσ 2 ), and the tail
2 2
probability is only slightly larger than e−t /(2nσ ) (which is Gaussian-like), whereas, if
tc/(nσ 2 ) is not small or moderate, then the exponent for the tail probability is of the
order of −[3t/(4c)] log[2tc/(3nσ 2 )] (which is ‘Poisson’-like76 ). Bernstein’s inequality keeps
the Gaussian-like regime for small values of tc/(nσ 2 ) but replaces the Poisson regime by
the larger, hence less precise, exponential regime.
Example 8.5 (Deviation bound with fixed probability). Let us try to shed some light on the
differences between Bernstein’s inequality (i.e., the rightmost side of (96)) and Hoeffding’s
inequality (see (89)). We can first attempt to find the value of t which makes the bound on
the rightmost side of (96) exactly equal to α, i.e., we want to solve the equation
t2
exp − = α.
2(nσ 2 + ct/3)
This leads to the quadratic equation
2tc 1 1
t2 − log − 2nσ 2 log = 0,
3 α α
whose nonnegative solution is given by
s
1 2
r
c2
c 1 1 1 2c 1
t = log + log + 2nσ 2 log ≤ σ 2n log + log .
3 α 9 α α α 3 α
√ √ √
where in the last inequality we used the fact q that a + b ≤ a + b for all a, b ≥ 0. Thus,
1 2c 1
Bernstein’s inequality implies that Sn ≤ σ 2n log α + 3 log α with probability at least 1−α.
Now if X1 , . . . , Xn are i.i.d. with mean zero, variance σ 2 and bounded in absolute value by
c, then this yields r
σ 1 2c 1
X̄n ≤ √ 2 log + log (97)
n α 3n α
with probability
q (w.p.) at least 1 − α; compare this the Hoeffding’s bound which yields
2 1
X̄n ≤ c n log α w.p. at least 1 − α; see (11). Note that if X̄n is normal, then X̄n will be
bounded by the first term in the right hand side of (97) w.p. at least 1 − α. Therefore the
above deviation bound agrees with the normal approximation bound except for the smaller
√
order term (which if of order 1/n; the leading term being of order 1/ n).
76
Note that if X has Poisson distribution with parameter a (i.e., EX = Var(X) = a) then
3t 2t
P(X − a ≥ t) ≤ exp − log 1 + , t ≥ 0.
4 3a
100
Example 8.6 (When Xi ’s are i.i.d. Bernoulli). Suppose that Xi ’s are i.i.d. Bernoulli with
probability of successqp ∈ (0, 1). Then, using (97), we see that using the Bernstein’s inequal-
q
ity yields that X̄n ≤ p(1−p)
n 2 log α1 + 3n
2
log 1 holds w.p. at least 1 − α; compare this with
q α
Hoeffding’s inequality which yields X̄n ≤ n2 log α1 w.p. at least 1 − α. Note that Bernstein’s
inequality is superior here if p(1 − p) is a fairly small. In particular, if Var(X1 ) = n1 (i.e.,
q q
p ≈ n1 ), then the two upper bounds reduce to n1 2 log α1 + 3n2
log α1 and n2 log α1 respectively,
showing that Bernstein’s inequality is so much better in this case.
Talagrand’s concentration inequality for the supremum of the empirical process [Talagrand, 1996a]
is one of the most useful results in modern empirical process theory, and also one of the
deepest results in the theory. This inequality may be thought of as a Bennett, Prokhorov
or Bernstein inequality uniform over an infinite collection of sums of independent random
variables, i.e., for the supremum of the empirical process. As such, it constitutes an expo-
nential inequality of the best possible kind. Below we state Bousquet’s version of the upper
half of Talagrand’s inequality.
−t2
3t 2tU
νn tU
− h1
P(Z ≥ EZ + t) ≤ e U2 νn
≤ exp − log 1 + ≤ exp (98)
4U 3νn 2νn + 2tU/3
and √
P Z ≥ EZ + 2νn x + U x/3 ≤ e−x , x ≥ 0. (99)
77
This is a consequence of the following: consider the class of functions F̃ = {f /U : f ∈ F} (thus any
f˜ ∈ F̃ satisfies kf˜k∞ ≤ 1). Let Z̃ := Z/U , σ̃ 2 := σ 2 /U 2 , and ν̃n := νn /U 2 . Then,
101
Notice the similarity between (98) and the Bennet, Prokhorov and Bernstein inequali-
ties in (96) in Theorem 8.4: in the case when F = {f }, with kf k∞ ≤ c, and E[f (Xi )] = 0, U
becomes c, and νn becomes nσ 2 , and the right-hand side of Talagrand’s inequality becomes
exactly the Bennet, Prokhorov and Bernstein inequalities. Clearly, Talagrand’s inequality
is essentially the best possible exponential bound for the empirical process.
Whereas the Bousquet-Talagrand upper bound for the moment generating function of
the supremum Z of an empirical process for λ ≥ 0 is best possible, there exist quite good
results for λ < 0, but these do not exactly reproduce the classical exponential bounds for
sums of independent random variables when specified to a single function. Here is the
strongest result available in this direction.
Theorem 8.8 ([Klein and Rio, 2005]). Under the same hypothesis and notation as in The-
orem 8.7, we have
Ṽn 3λ
log E[e−λ(Z̃−EZ̃) ] ≤ (e − 1 − 3λ), 0 ≤ λ < 1,
9
−t2
t 2tU
Vn 3tU
− h1
P(Z ≤ EZ − t) ≤ e 9U 2 Vn
≤ exp − log 1 + ≤ exp (100)
4U Vn 2Vn + 2tU
and
P Z ≤ EZ − 2Vn x − U x ≤ e−x ,
p
x ≥ 0. (101)
Remark 8.1. In order to get concrete exponential inequalities from Theorems 8.7 and 8.8,
we need to have good estimates of EZ and supf ∈F E[f 2 (Xi )]. We have already seen many
techniques to control EZ. In particular, (85) gives such a bound.
102
√ √
Note that for x > 2 n, this probability is zero (as Z ≤ 2n). For x > 2c1 , t := nx − EZ ≥
√ √
n(x−c1 ) > 0, and thus we can apply the last inequality in (98). Hence, for 2 n ≥ x > 2c1 ,
√
√ ( nx − EZ)2
P( nkFn − F k∞ ≥ x) ≤ exp − √ √
2(2c1 n + n/4) + 2( nx − EZ)/3
n(x − c1 )2 x2
≤ exp − ≤ exp − ,
c3 n 4c3
√
where we have used (i) for 2 n ≥ x the denominator in the exponential term is upper
√
bounded by 2(2c1 n + n/4) + 4n/3 which is in turn upper bounded by c3 n (for some c3 > 0);
(ii) for x > 2c1 , (x − c1 )2 > x2 /4 (as x − c1 ≥ x − x/2 = x/2). Thus, for some constants
c2 , c3 > 0 that depend only on d, we can show that for all x > 0,
√ 2
P( nkFn − F k∞ ≥ x) ≤ c2 e−x /(4c3 ) ;
note that c2 appears above as we can just upper bound the LHS by some c2 when x ∈ (0, 2c1 ].
Proof. Set Z := k ni=1 (f (Xi ) − P f )kF and set Z̃ := k ni=1 i f (Xi )kF . Note that Z̃
P P
is also the supremum of an empirical process: the variables are X̃i = (εi , Xi ), defined
103
on {−1, 1} × X , and the functions are f˜(, x) := f (x), for f ∈ F. Thus, Talagrand’s
inequalities apply to both Z and Z̃. Then, using the fact
q √ p √ 1
2x(nσ 2 + 2EZ̃) ≤ 2xnσ 2 + 2 xEZ̃ ≤ 2xnσ 2 + x + δEZ̃,
δ
for any δ > 0, the Klein-Rio version of Talagrand’s lower-tail inequality gives
q √ 1+δ
e−x ≥ P Z̃ ≤ EZ̃ − 2x(nσ 2 + 2EZ̃) − x ≥ P Z̃ ≤ (1 − δ)EZ̃ − 2xnσ 2 − x .
δ
Similarly, using (99),
√ 3+δ
2
P Z ≥ (1 + δ)EZ + 2xnσ + x ≤ e−x .
3δ
Recall also that E[Z] ≤ 2E[Z̃]. Then, we have on the intersection of the complement of the
events in the last two inequalities, for δ = 1/5 (say),
6 √ 16 12 √ 16
Z < E[Z] + 2xnσ 2 + x ≤ E[Z̃] + 2xnσ 2 + x
5 3 5 √ 3
12 5 5√ 15 16
< Z̃ + 2xnσ 2 + x + 2xnσ 2 + x
5 4 4 2 3
√ 70
= 3Z̃ + 4 2xnσ 2 + x;
3
i.e., this inequality holds with probability 1 − 2e−x .
Note that different values of δ produce different coefficients in the above theorem.
min P f (102)
f ∈F
in the cases when the distribution P is unknown and has to be estimated based on the data
X1 , . . . , Xn . Since the empirical measure Pn is a natural estimator of P , the true risk can
be estimated by the corresponding empirical risk, and the risk minimization problem has
to be replaced by the empirical risk minimization (ERM):
min Pn f. (103)
f ∈F
104
As is probably clear by now, many important methods of statistical estimation such as
maximum likelihood and more general M -estimation are versions of ERM.
E(f ) ≡ EP (f ) := P f − inf P h.
h∈F
Recall that we have already seen an important application of ERM in the problem of
classification in Example 7.10. Here is another important application.
Suppose that the true regression function is g∗ (z) := E[Y |Z = z], for z ∈ Z. In this case,
the excess risk of f (z, y) = (y − g(z))2 ∈ F (for some g ∈ G) is given by78
Let
fˆ ≡ fˆn ∈ arg min Pn f
f ∈F
be a solution of the ERM problem (103). The function fˆn is used as an approximation
of the solution of the true risk minimization problem (102) and its excess risk EP (fˆn ) is a
natural measure of accuracy of this approximation.
It is worth pointing out that a crucial difference between ERM and classical M -
estimation, as discussed in Sections 5 and 6, is that in the analysis of ERM we do not
(usually) assume that the data generating distribution P belongs to the class of models
considered (e.g., inf h∈F P h need not be 0). Moreover, in M -estimation, typically the focus
is on recovering a parameter of interest in the model (which is expressed as the population
M -estimator) whereas in ERM the focus is mainly on deriving optimal (upper and lower)
bounds for the excess risk EP (fˆn ).
It is of interest to find tight upper bounds on the excess risk79 of fˆ that hold with
a high probability. Such bounds usually depend on certain “geometric” properties of the
78
Exercise (HW3): Show this.
79
Note that we have studied upper bounds on the excess risk in the problem of classification in Exam-
ple 7.10.
105
function class F and on various measures of its “complexity” that determine the accuracy of
approximation of the true risk P f by the empirical risk Pn f in a neighborhood of a proper
size of the minimal set of the true risk.
In the following we describe a rather general approach to derivation of such bounds in
an abstract framework of ERM. We start with some definitions.
F(δ) := {f ∈ F : EP (f ) ≤ δ}.
Suppose, for simplicity, that the infimum of the risk P f is attained at f¯ ∈ F (the
argument can be easily modified if the infimum is not attained in the class). Denote
δ̂ := EP (fˆ).
Previously, we had used the last inequality to upper bound the excess risk in classification;
see Example 7.10. In this section we will use the implicit characterization of δ̂ in (105)
to improve our upper bound. This naturally leads us to the study of the following (local)
measure of empirical approximation:
" #
φn (δ) ≡ φn (F; δ) := E sup |(Pn − P )(f1 − f2 )| . (106)
f1 ,f2 ∈F (δ)
that holds uniformly in δ with a high probability. Then, with the same probability, the
excess risk δ̂ = EP (fˆ) will be bounded80 by the largest solution of the inequality
δ ≤ Un (δ). (108)
80
As δ̂ ≤ supf1 ,f2 ∈F (δ̂) |(Pn − P )(f1 − f2 )| ≤ Un (δ̂), δ̂ satisfies inequality (108).
106
By solving the above inequality one can obtain δn (F) (which satisfies (108)) such that
P(EP (fˆn ) > δn (F)) is small81 . Thus, constructing an upper bound on the excess risk
essentially reduces to solving a fixed point inequality of the type δ ≤ Un (δ).
Let us describe in more detail what we mean by the above intuition. There are many
different ways to construct upper bounds on the sup-norm of empirical processes. A very
general approach is based on Talagrand’s concentration inequalities. For example, if the
functions in F take values in the interval [0, 1], then82 by (99) we have, for t > 0,83
!
1 p t
P sup |(Pn − P )(f1 − f2 )| ≥ φn (δ) + √ 2t (2φn (δ) + D2 (δ)) + ≤ e−t . (109)
f1 ,f2 ∈F (δ) n 3n
√ √ √ √
Then, using the facts: (i) a+b≤ a+ b, and (ii) 2 ab ≤ a/K +Kb, for any a, b, K > 0,
we have
p p p √ t √
2t (D2 (δ) + 2φn (δ)) ≤ 2tD2 (δ) + 2 tφn (δ) ≤ D(δ) 2t + √ + nφn (δ).
n
where r !
t t
Ūn (δ; t) := 2 φn (δ) + D(δ) + . (111)
n n
This observation provides a way to construct a function Un (δ) such that (107) holds with
a high probability “uniformly” in δ — by first defining such a function at a discrete set of
values of δ and then extending it to all values by monotonicity. We will elaborate on this
shortly. Then, by solving the inequality (108) one can construct a bound on EP (fˆn ), which
holds with “high probability” and which is often of correct order of magnitude.
Let us now try to state a formal result in this direction. To simplify notation, assume
that the functions in F take values in [0, 1]. Let {δj }j≥0 be a decreasing sequence of
positive numbers with δ0 = 1 and let {tj }j≥0 be a sequence of positive numbers. Define
Un : (0, ∞) → R, via (111), as
107
Figure 2: Plot of the piecewise constant function Un (δ), for δ ≥ δn (F), along with the value
of kPn − P kF 0 (δj ) , for j = 0, 1, . . ., denoted by the ?’s.
It is easy to check that δn (F) ≤ Un (δn (F)). Obviously, the definitions of Un and δn (F)
depend on the choice of {δj }j≥0 and {tj }j≥0 (we will choose specific values of these quan-
tities later on). We start with the following simple inequality that provides a distribution
dependent upper bound on the excess risk EP (fˆn ).
Proof. It is enough to prove the result for any δ > δn (F); then the right continuity of the
distribution function of EP (fˆn ) would lead to the bound (114) for δ = δn (F).
So, fix δ > δn (F). Letting F 0 (δ) := {f1 − f2 : f1 , f2 ∈ F(δ)}, we know that
Denote n o
En,j := kPn − P kF 0 (δj ) ≤ Un (δj ) .
It follows from Bousquet’s version of Talagrand’s inequality (see (110)) that P(En,j ) ≥
1 − e−tj . Let
En := ∩j:δj ≥δ En,j .
Then
X
P(En ) = 1 − P(Enc ) ≥ 1 − e−tj . (116)
j:δj ≥δ
108
The above holds as: (i) Un (·) is a piecewise constant function (with possible jumps only
at δj ’s), (ii) the function σ 7→ kPn − P kF 0 (σ) is monotonically nondecreasing, and (iii)
kPn − P kF 0 (δj ) ≤ Un (δj ) on En , for j such that δ ≥ δj ; see Figure 8.3.1.
Claim: {δ̂ ≥ δ} ⊂ Enc . We prove the claim using the method of contradiction. Thus,
suppose that the above claim does not hold. Then, the event {δ̂ ≥ δ} ∩ En is non-empty.
On the event {δ̂ ≥ δ} ∩ En we have
where the first inequality follows from (115) and the second inequality holds via (117). This,
in particular, implies that
δ ≤ δ̂ ≤ δn (F),
where the last inequality follows from (118) and the maximality of δn (F) via (113). However
the above display contradicts the assumption that δ > δn (F). Therefore, we must have
{δ̂ ≥ δ} ⊂ Enc .
The claim now implies that P(EP (fˆn ) ≥ δ) = P(δ̂ ≥ δ) ≤ P(Enc ) ≤ −tj ,
P
j:δj ≥δ e
via (116), thereby completing the proof.
Although Theorem 8.15 yields a high probability bound on the excess risk of fˆn (i.e.,
EP (fˆn )), we still need to upper bound δn (F) for the result to be useful. We address this
next. We start with some notation. Given any ψ : (0, ∞) → R, denote by
ψ(s)
ψ † (σ) := sup . (119)
s≥σ s
We will show next that σnt ≥ δn (F) (for a special choice of {δj }j≥0 and {tj }j≥0 ) and
thus, by (8.15) and some algebraic simplification, we will obtain the following result. Given
85
Take σ1 < σ2 . Then
ψ(s) ψ(s)
ψ † (σ1 ) = sup ≥ sup = ψ † (σ2 ).
s≥σ1 s s≥σ2 s
109
a concrete application, our goal would be to find upper bounds on σnt ; see Section 8.3.2
where we illustrate this technique for finding a high probability bound on the excess risk in
bounded regression.
Theorem 8.16 (High probability bound on the excess risk of the ERM). For all t > 0,
P EP (fˆn ) > σnt ≤ Cq e−t . (122)
q
where Cq := q−1 ∨ e.
Proof. Fix t > 0 and let σ > σnt . We will show that P EP (fˆn ) > σ ≤ Cq e−t . Then, by
Since σ > σnt and the function Vnt is strictly decreasing, we have Vnt (σ) < Vnt (σnt ) ≤ 1, and
hence, for all δ > σ,
Un (δ)
≤ Vnt (σ) < 1.
δ
Therefore, δ > δn (F) := sup{s > 0 : 1 ≤ Uns(s) }, and thus, σ ≥ δn (F). Now, from
Theorem 8.15 it follows that
P EP (fˆn ) > σ ≤
X
e−tj ≤ Cq e−t
j:δj ≥σ
110
8.3.2 Excess risk in bounded regression
Recall the regression setting in Example 8.13. Given a function g : Z → T , the quantity
(` • g)(z, y) := `(y, g(z)) is interpreted as the loss suffered when g(z) is used to predict y.
The problem of optimal prediction can be viewed as a risk minimization:
E[`(Y, g(Z))] =: P (` • g)
over g : Z → T . We start with the regression problem with bounded response and with
quadratic loss. To be specific, assume that Y takes values in T = [0, 1] and `(y, u) := (y−u)2 .
Suppose that we are given a class of measurable real-valued functions G on Z. We denote
by F := {` • g : g ∈ G}. Suppose that the true regression function is g∗ (z) := E[Y |Z = z],
for z ∈ Z, which is not assumed to be in G. Recall that the excess risk EP (` • g) in this
problem is given by (104).
In order to apply Theorem 8.16 to find a high probability bound on the excess risk of
the ERM fˆ ≡ ` • ĝ (see (103)) in this problem, which is determined by σnt via (121), we have
to find upper bounds for Vnt (·) (which in turn depends on the functions φ†n and (D2 )† ).
p
which implies
P (` • g1 − ` • g2 )2 ≤ 4kg1 − g2 k2L2 (Π) .
111
where the last step follows from the triangle inequality: kg1 −g2 kL2 (Π) ≤ kg1 −ḡkL2 (Π) +
kg2 − ḡkL2 (Π) . Hence, by (124),
s
q
D2 (δ) √
(D2 )† (σ) = sup ≤ 4 2.
δ≥σ δ
Since `(y, ·) is Lipschitz with constant 2 on the interval [0, 1] one can use the contrac-
tion inequality88 to get
n
1 X
φn (δ) ≤ 8 E sup i (g − ḡ)(Zi ) := ψn (δ).
g∈G:kg−ḡk2 ≤2δ n
L2 (Π) i=1
Theorem 8.18. Let G be a convex class of functions from Z into [0, 1] and let ĝn denotes
the LSE of the regression function, i.e.,
n
1X
ĝn := argmin {Yi − g(Xi )}2 .
g∈G n
i=1
Then, there exist constants K > 0 such that for all t > 0,
] 1 t
2 2
P kĝn − g∗ kL2 (Π) ≥ inf kg − g∗ kL2 (Π) + ψn ( ) + K ≤ Cq e−t , (125)
g∈G 4q n
88
Ledoux-Talagrand contraction inequality (Theorem 4.12 of [Ledoux and Talagrand, 1991]): If ϕi : R →
R satisfies |ϕi (a) − ϕi (b)| ≤ L|a − b| for all a, b ∈ R, then
" n
# " n
#
1X 1X
E sup i ϕi (h(xi )) ≤ L E sup i h(xi ) .
h∈H n h∈H n
i=1 i=1
In the above application we take ϕi (u) = (Yi − u)2 for u ∈ [0, 1].
112
where for any ψ : (0, ∞) → R, ψ ] is defined as89
n o
ψ ] (ε) := inf σ > 0 : ψ † (σ) ≤ ε . (126)
Proof. Note that in this case, by (104), EP (ĝn ) = kĝn − g∗ k2L2 (Π) − inf g∈G kg − g∗ k2L2 (Π) . To
use Theorem 8.16 we need to upper bound the quantity σnt defined in (121). Recall the
definition of Vnt (σ) from (120). By the above observations 1-3, we have
" #
√
r
t † t t
Vn (σ) ≤ 2q ψn (σ) + 4 2 + (127)
nσ nσ
We are only left to show that σnt := inf{σ : Vnt (σ) ≤ 1} ≤ ψn] ( 4q
1
) + K nt , for a sufficiently
large K, which will be implied if we can show that Vnt ψn] ( 2q 1
) + K nt ≤ 1 (since then
ψn] ( 2q
1
) + K nt ∈ {σ : Vnt (σ) ≤ 1} and the result follows from the minimality of σnt ). Note
that, by the nonincreasing nature of each of the terms on the right hand side of (127),
" s #
√
1 t 1 t t
Vnt ψn] ( ) + K ≤ 2q ψn† (ψn] ( )) + 4 2 +
4q n 4q n(Kt/n) n(Kt/n)
" √ #
1 4 2 1
≤ 2q +√ + < 1,
4q K K
√
where K > 0 is chosen so that 4√ 2
K
+ 1
K < 1
2 (note that ψn† (ψn] ( 4q
1
)) ≤ 1
4q ).
Example 8.19 (Finite dimensional classes). Suppose that L ⊂ L2 (Π) is a finite dimensional
linear space with dim(L) = d < ∞. and let G ⊂ L be a convex class of functions taking
values in a bounded interval (for simplicity, [0, 1]). We would like to show that
d
t
2 2
P kĝn − g∗ kL2 (Π) ≥ inf kg − g∗ kL2 (Π) + +K ≤ Ce−t (128)
g∈G n n
with some constant C, K > 0.
It can be shown that90 that r
dδ
ψn (δ) ≤ c
n
with some constant c > 0. Hence,
r r
ψn (δ) d d
ψn† (σ) = sup ≤ sup c =c .
δ≥σ δ δ≥σ δn σn
89
Note that ψ ] can be thought of as the generalized inverse of ψ † . Thus, under the assumption that ψ † is
right-continuous, ψ † (σ) ≤ ε if and only if σ ≥ ψ ] (ε) (Exercise (HW3): Show this). Further note that with
this notation σnt = Vnt,] (1).
90
" Exercise (HW3): Suppose that# L is a finite dimensional subspace of L2 (P ) with dim(L) = d. Then
Pn q
1 d
E sup n i=1 i f (Xi ) ≤ r n
.
f ∈L:kf kL (P ) ≤r
2
113
As, ψn† (σ) ≤ ε implies σ ≥ ψn] (ε), taking σ := nd and q ≥ max{1, 1/(4c)}, we see that
s
d d 1 1 d
ψn† ≤c d ≤ ⇒ ψn] ( ) ≤ ,
n nn
4q 4q n
and Theorem 8.18 then implies (128); here C ≡ Cq is taken as in Theorem 8.16 and K as
in Theorem 8.18.
Exercise (HW3): Consider the setting of Example 8.19. Instead of using the refined analysis
using (105) (and Talagrand’s concentration inequality) as illustrated in this section, use the
bounded differences inequality to get a crude upper bound on the excess risk of the ERM
in this problem. Compare the obtained high probability bound to (128).
Exercise (HW3)[VC-subgraph classes]: Suppose that G is a convex VC-subgraph class of
functions g : Z → [0, 1] of VC-dimension V . Then, show that, the function ψn (δ) can be
upper bounded by: "r #
Vδ 1 V 1
ψn (δ) ≤ c log ∨ log .
n δ n δ
2
Show that ψn] (ε) ≤ nε
cV nε
2 log V . Finally, use Theorem 8.18 to obtain a high probability
O(n−1/(1+ρ) ). Finally, use Theorem 8.18 to obtain a high probability bound analogous
to (125).
114
where
h y − X i Z y − x Z
−d −d
ph (y) := h P K =h K p(x)dx = K(u)p(y − hu)du
h Rd h Rd
is a smoothed version of p. Convergence to zero of the second term can be argued based
only on smoothness assumptions on p: if p is uniformly continuous, then it is easily seen
that
sup sup |ph (y) − p(y)| → 0
h≤bn y∈Rd
for any sequence bn → 0. On the other hand, the first term is just
h y − X i
h−d (Pn − P ) K . (129)
h
For a fixed y ∈ Rd , it is easy to study the properties of the above display
using the CLT
−d y−Xi
as we are dealing with a sum of independent random variables h K h , i = 1, . . . , n.
However, it is natural to ask whether the KDE pbn,hn converges to p uniformly (a.s.) for a
sequence of bandwidths hn → 0 and, if so, what is the rate of convergence in that case? We
investigate this question using tools from empirical processes.
The KDE pbn,h (·) is indexed by the bandwidth h, and it is natural to consider pbn,h as
a process indexed by both y ∈ Rd and h > 0. This leads to studying the class of functions
y−x
d
F := x 7→ K : y ∈ R ,h > 0 .
h
It is fairly easy to give conditions on the kernel K so that the class F defined above satisfies
this does not suffice in view of the hfactor −d in (129). In fact, we need a rate of
ofih
a.s.
convergence for suph>0,y∈Rd (Pn − P ) K y−X h → 0. The following theorem gives such a
result92 .
91
For instance, it is satisfied for general d ≥ 1 whenever K(x) = φ(q(x)), with q(x) being a polynomial in
d variables and φ being a real-valued right continuous function of bounded variation.
92
To study variable bandwidth kernel estimators [Einmahl and Mason, 2005] derived the following result,
which can be proved with some extra effort using ideas from the proof of Theorem 8.21.
115
Theorem 8.21. Suppose that hn ↓ 0, nhdn /| log hn | → ∞, log log n/| log hn | → ∞ and
hdn ≤ čhd2n for some č > 0. Then
p
nhdn kb
pn,hn (·) − phn (·)k∞
lim sup p = C a.s.
n→∞ log h−1
n
Lemma 8.22 ([de la Peña and Giné, 1999, Theorem 1.1.5]). If Xi , i ∈ N, are i.i.d X -valued
random variables and F a class of measurable functions, then
j n
!
X X t
P max (f (Xi ) − P f ) > t ≤ 9P (f (Xi ) − P f ) > .
1≤j≤n F F 30
i=1 i=1
y − ·
d
Fk := K : hnk ≤ h ≤ hnk−1 , y ∈ R .
h
As
h y − X i Z y − x Z
E K2 = K2 p(x)dx = hd K 2 (u)p(y − uh)du ≤ hd kpk∞ kKk22 ,
h Rd h Rd
√
r
√ AUk
σk < Uk /2 and nk σk ≥ V Uk log . (check!) (132)
σk
116
Pnk
Letting Zk := E i=1 (f (Xi ) − Pf) , we can bound E[Zk ] by using Theorem 7.13
Fk
(see (84)), for k ≥ k0 , to obtain
nk
X p
E[Zk ] = E (f (Xi ) − P f ) ≤ Lσk nk log(AUk /σk )
Fk
i=1
for a constant c̃ > 1 and k ≥ k0 . Choosing x = c log(AUk /σk ) in (99), for some c > 0, we
see that
√ p √
E[Zk ] + 2νk x + Uk x/3 ≤ σk nk log(AUk /σk )(L + 2cc̃) + cUk log(AUk /σk )/3
p
≤ Cσk nk log(AUk /σk ),
for some constant C > 0, where we have again used (132). Therefore, by Theorem 8.7,
√
P Zk ≥ Cσk nk log(AUk /σk ) ≤ P(Zk ≥ E[Zk ] + 2νk x + Uk x/3) ≤ e−c log(AUk /σk ) .
p
Notice that p
30Cσk nk log(AUk /σk )
q >λ (check!)
nk−1 hdnk log h−1
nk
for some λ > 0, not depending on k. Therefore, choosing this λ the probability on the right
hand-side of (131) can be expressed as
Zk λ
> ≤ P Zk ≥ Cσk nk log(AUk /σk ) ≤ e−c log(AUk /σk ) .
p
P q
n hd log h−1 30
k−1 nk nk
Since
∞
X ∞
X ∞
X
e−c log(AUk /σk ) = c1 hcd/2
nk−1 ≤ c̃1 (č)−cd/2 < ∞,
k=k0 k=k0 k=k0
lemma we can see that P(Y > λ) = 0. This yields the desired result using the zero-one
law93 .
93
For a fixed λ ≥ 0, define the event A := {lim supn→∞ Yn > λ}. As this is a tail event, by the
zero-one law it has probability 0 or 1. We thus have that for each λ, P(Y > λ) ∈ {0, 1}. Defining
c := sup{λ : P(Y > λ) = 1}, we get that Y = c a.s. Note that c < ∞ as there exists λ > 0 such that
P(Y > λ) = 0, by the proof of Theorem 8.21.
117
9 Review of weak convergence in complete separable metric
spaces
In this chapter (and the next few) we will study weak convergence of stochastic processes.
Suppose that U1 , . . . , Un are i.i.d. Uniform(0, 1) random variables (i.e., U1 has distribution
function G(x) = x, for x ∈ [0, 1]) and let Gn denote the empirical distribution function of
U1 , . . . , Un . In this chapter, we will try to make sense of the (informal) statement:
√ d
n(Gn − G) → G, as n → ∞,
Before we study weak convergence of stochastic processes let us briefly recall the notion of
weak convergence of random vectors.
Let (Ω, A, P) be a probability space and let X be a random vector (or a measurable
map) in Rk (k ≥ 1) defined on the probability space (Ω, A, P)94 .
Definition 9.1. Let {Xn }n≥1 be a sequence of random vectors in Rk . Let Xn have distri-
bution function95 Fn and let Pn denote the distribution96 of Xn , i.e., Xn ∼ Pn . We say
that {Xn } converges in distribution (or weakly or in law) to a random vector X ∼ P (or to
the distribution P ) with distribution function F if
for every x ∈ Rk at which the limiting distribution function F (·) is continuous. This is
d d
usually denoted by Xn → X or Pn → P .
As the name suggests, weak convergence only depends on the induced laws of the
vectors and not on the probability spaces on which they are defined.
The whole point of weak convergence is to approximate the probabilities of events
related to Xn (i.e., Pn ), for n large, by that of the limiting random vector X (i.e., P ).
94
More formally, X : (Ω, A) → (Rk , Bk ), where Bk is the Borel σ-field in Rd , is a map such that for any
B ⊂ Bk , X −1 (B) := {ω ∈ Ω : X(ω) ∈ B} ∈ A.
95
Thus, Fn (x) = P(Xn ≤ x), for all x ∈ Rd .
96
Thus, Pn : Bk → [0, 1] is a map such that Pn (B) := P(X ∈ B) = P(X −1 (B)) for all B ∈ Bk .
118
The multivariate central limit theorem (CLT) is a classic application of convergence in
distribution and it highlights the usefulness of such a concept.
The following theorem, referred to as the Portmanteau result, gives a number of equiv-
alent descriptions of weak convergence (most of these characterizations are only useful in
proofs). Indeed, the characterization (v) (of the following result) makes the intuitive notion
of convergence in distribution rigorous.
and bounded.
(v) Pn (A) → P (A) for all P -continuity sets A (i.e., P (∂A) = 0, where ∂A denotes the
boundary of A).
9.2 Weak convergence in metric spaces and the continuous mapping the-
orem
As before, let (Ω, A, P) be a probability space and and let (T, T ) is a measurable (metric)
space. X : (Ω, A, P) → (T, T ) is a random element if X −1 (B) ∈ A for all B ∈ T . Quite
often, when we talk about a random element X, we take T = B(T ), the Borel σ-field (on
the set T ), the smallest σ-field containing all open sets.
Question: How do we define the notion of weak convergence in this setup?
Although it is not straight forward to define weak convergence as in Definition 9.1
(for random vectors through their distribution functions), the equivalent definition (ii) in
Theorem 9.2 can be extended easily.
Let Cb (T ; T ) denote the set of all bounded, continuous, T /B(R)-measurable, real-valued
functions on T .
119
Definition 9.4. A sequence {Xn }n≥1 of random elements of T (defined on (Ω, A, P)) con-
d
verges in distribution to a random element X, written Xn → X, if and only if
Note that definition (135) of weak convergence is slightly more general than (134) as
it allows the Xn ’s to be defined on different probability spaces (and we let Xn ∼ Pn , for
n ≥ 1, and X ∼ P ). As in the previous subsection, we have the following equivalent
characterizations of weak convergence.
Suppose that T is a metric space with metric d(·, ·) and let T be the Borel σ-field of T
generated by the metric d(·, ·). Then, we have the following equivalent characterizations of
weak convergence.
d d
(i) Xn → X or Pn → P ;
(iv) Pn (A) → P (A) for all P -continuity sets A (i.e., P (∂A) = 0, where ∂A denotes the
boundary of A).
Proof. See Theorem 2.1 of [Billingsley, 1999] (or 3.25 of [Kallenberg, 2002]).
Such an abstract definition of weak convergence can only be useful if we have a contin-
uous mapping theorem (as before). The following result shows that essentially the “vanilla’
version of the continuous mapping theorem is true, and is a trivial consequence of the
definition of weak convergence.
120
Theorem 9.6 (Continuous mapping theorem). Let (T, B(T )) be a measurable (metric)
space. Suppose that {Xn }n≥1 is a sequence of random elements of T converging in distribu-
d
tion to a random element X, i.e., Xn → X. Let H : (T, B(T )) → (S, B(S)) be a continuous
d
function, where (S, B(S)) is another measurable (metric) space. Then H(Xn ) → H(X) as
random elements in S.
Proof. Let f ∈ Cb (S, B(S)). Then f ◦ H ∈ Cb (T, B(T )) and thus from the definition of weak
convergence of Xn ,
E[(f ◦ H)(Xn )] → E[(f ◦ H)(X)].
As the above convergence holds for all f ∈ Cb (S, B(S)), again using the definition of weak
d
convergence, we can say that H(Xn ) → H(X).
Theorem 9.7. Let H : (T, B(T )) → (S, B(S)) be a measurable mapping. Write C for the
d
set of points in T at which H is continuous. If Xn → X for which P(X ∈ C) = 1, then
d
H(Xn ) → H(X).
Proof. Let Xn ∼ Pn and X ∼ P . Fix ψ ∈ Cb (S; B(S)). Then the measurable, real-valued,
bounded function h := ψ ◦ H is continuous at all points in C. We will have to show that
Pn h → P h as n → ∞.
Consider any increasing sequence {fi } of bounded, continuous functions for which
fi ≤ h everywhere and fi ↑ h at each point of C. Accept for the moment that such a
sequence exists. Then, weak convergence of Pn to P implies that
Thus,
lim inf Pn h ≥ lim inf Pn fi = P fi for each fixed fi .
n→∞ n→∞
By substituting −h for h and going through the above argument again gives the desired
result.
121
It only remains to show that we can construct such a sequence {fi }. These functions
must be chosen from the family
F = {f ∈ Cb (T ; B(T )) : f ≤ h}.
If we can find a countable subfamily of F, say {g1 , g2 , . . .}, whose pointwise supremum
equals h at each point of C, then setting fi := max{g1 , . . . , gi } will do the trick.
Without loss of generality suppose that h > 0 (a constant could be added to h to
achieve this). Let d(·, ·) be the metric in T . For each subset A ∈ B(T ), define the distance
function d(·, A) by
d(x, A) = inf{d(x, y) : y ∈ A}.
Note that d(·, A) is a continuous function (in fact, d(·, A) is uniformly continuous. Exercise
(HW4): Show this.), for each fixed A. For a positive integer m and a positive rational r
define
fm,r (x) := r ∧ [m d(x, {h ≤ r})].
Each fm,r is bounded and continuous; it is at most r if h(x) > r; it takes the value zero if
h(x) ≤ r. Thus, fm,r ∈ F.
Given a point x ∈ C and an > 0, choose a positive rational number r with h(x) − <
r < h(x). Continuity of h at x keeps its value greater than r in some neighborhood of x.
Consequently, d(x, {h ≤ r}) > 0 and fm,r (x) = r > h(x) − for all m large enough. Take
{fm,r : m ∈ N+ , r ∈ Q+ } as the countable set {g1 , g2 , . . .}.
Till now we have restricted our attention to metric spaces (T, T ) endowed with their Borel
σ-field, i.e., T = B(T ). In this subsection we extend the continuous mapping theorem when
122
T need not equal B(T ).
d
Formally, we ask the following question. Suppose that Xn → X as T -measurable
random elements of a metric space T , and let H be a T /S-measurable map from T into
another metric space S. If H is continuous at each point of an T -measurable set C with
d
P(X ∈ C) = 1, does it follow that H(Xn ) → H(X), i.e., does Ef (H(Xn )) converge to
Ef (H(X)) for every f ∈ Cb (S; S)?
We will now assume that T is a sub-σ-field of B(T ). Define
F = {f ∈ Cb (T ; T ) : f ≤ h}.
Last time we constructed the countable subfamily that took the form
Continuity of fm,r suffices for Borel measurability, but it needn’t imply T -measurability.
We must find a substitute for these functions.
Definition 9.8. Call a point x ∈ T completely regular (with respect to the metric d and the
σ-field T ) if to each neighborhood V of x there exists a uniformly continuous, T -measurable
function g with g(x) = 1 and g ≤ 1V .
Theorem 9.9 (Continuous mapping theorem). Let (T, T ) is a measurable (metric) space.
Suppose that {Xn }n≥1 is a sequence of random elements of T converges in distribution to
d
a random element X, i.e., Xn → X. Let H : (T, T ) → (S, S) be a continuous function
at each point of some separable, T -measurable set C of completely regular points such that
d
P(X ∈ C) = 1, then H(Xn ) → H(X) as random elements in (S, S).
Idea of proof: (Step 1) Use the fact that each x ∈ C is completely regular to approximate
h(x) by a supremum of a sub-class of Cb (T ; T ) that is uniformly continuous. (Step 2) Then
use the separability of C to find a sequence of uniformly continuous functions in Cb (T ; T )
that increase to h for all x ∈ C. (Step 3) Use the same technique as in the initial part of
the proof of Theorem 9.7 to complete the proof.
Proof. Let Xn ∼ Pn and X ∼ P . Fix ψ ∈ Cb (S; B(S)). Then the T -measurable, real-
valued, bounded function h := ψ ◦ H is continuous at all points in C. We will have to show
that Pn h → P h.
Without loss of generality suppose that h > 0 (a constant could be added to h to
achieve this). Define
F = {f ∈ Cb (T ; T ) : f ≤ h; f is uniformly continuous}.
123
To see this suppose that h is continuous at a point x that is completely regular. Choose r
with 0 < r < h(x) (r should ideally be close to h(x)). By continuity, there exists a δ > 0
such that h(y) > r on the closed ball B(x, δ). As x is completely regular, there exists a
uniformly continuous, T -measurable g such that g(x) = 1 and g ≤ 1B(x,δ) . Now, look at
the function f = rg. Observe that f ∈ F and f (x) = r. Thus (136) holds for all x ∈ C.
Separability of C will enable us to extract a suitable countable subfamily from F. Let
C0 be a countable dense subset of C. Let {g1 , g2 , . . .} be the set of all those functions of the
form r1B , with r rational, B a closed ball of rational radius centered at a point of C0 , and
r1B ≤ f for at least one f in F. For each gi choose one f satisfying the inequality gi ≤ f .
Denote it by fi . This picks out the required countable subfamily:
To see this, consider any point z ∈ C and any f ∈ F. For each rational number r such that
f (z) > r > 0 choose a rational for which f > r at all points within a distance 2 of z.
Let B be the closed ball of radius centered at a point x ∈ C0 for which d(x, z) < . The
function r1B lies completely below f ; it must be one of the gi . The corresponding fi takes
a value greater than r at z. Thus, assertion (137) follows.
The proof can now be completed using analogous steps as in the proof of Theorem 9.7.
Assume without loss of generality that fi ↑ h at each point of C. Then,
Replace h by −h (plus a big constant) to get the companion inequality for the lim sup.
Corollary 9.10. If Ef (Xn ) → Ef (X) for each bounded, uniformly continuous, T -measurable
d
f , and if X concentrates on a separable set of completely regular points, then Xn → X.
The corollary follows directly from the decision to insist upon uniform continuous
separating functions in the definition of a completely regular point.
Till now we have mostly confined ourselves to finite dimensional spaces (although the dis-
cussion in the previous subsection is valid for any metric space). In this subsection we
briefly discuss an important, both historically and otherwise, infinite dimensional metric
space where our random elements take values. Let C[0, 1] denote the space of continuous
functions from [0, 1] to R. We endow C[0, 1] with the uniform metric:
124
Remark 9.1. Note that C[0, 1], equipped with the uniform metric, is separable97 (Exercise
(HW4): Show this.) and complete98 . C[0, 1] is complete because a uniform limit of contin-
uous functions is continuous [Browder, 1996, Theorem 3.24]. C[0, 1] is separable because
the set of polynomials on [0, 1] is dense in C[0, 1] by the Weierstrass approximation theo-
rem [Browder, 1996, Theorem 7.1], and the set of polynomials with rational coefficients is
countable and dense in the set of all polynomials, hence also dense in C[0, 1].
Question: Let X and Xn are random processes (elements) on [0, 1]. When can we say that
d
Xn → X?
d
Suppose that Xn → X in C[0, 1]. For each t ∈ [0, 1] define the projection map
πt : C[0, 1] → R as
πt (x) = x(t), for x ∈ C[0, 1].
Observe that πt is a uniformly continuous map as |πt (x) − πt (y)| = |x(t) − y(t)| ≤ kx − yk∞
for all x, y ∈ C[0, 1]. Thus, by the continuous mapping theorem (see e.g., Theorem 9.7), we
d
know that Xn (t) = πt (Xn ) → πt (X) = X(t) for all t ∈ [0, 1].
fd
We shall write Xn → X for convergence of the finite-dimensional distributions, in the
sense that
d
(Xn (t1 ), . . . , Xn (tk )) → (X(t1 ), . . . , X(tk )), t1 , . . . , tk ∈ [0, 1], k ∈ N. (138)
Indeed, by defining the projection map π(t1 ,...,tk ) : C[0, 1] → Rk as π(t1 ,...,tk ) (x) = (x(t1 ), . . . , (tk )),
d fd
for t1 , . . . , tk ∈ [0, 1], we can show that if Xn → X in C[0, 1] then Xn → X.
Remark 9.2. Although it can be shown that the distribution of a random process is deter-
mined by the family of finite-dimensional distributions99 , condition (138) is insufficient in
d
general for the convergence Xn → X. Hint: Consider the sequence of points xn ∈ C[0, 1],
for n ≥ 1, where
xn (t) = nt1[0,n−1 ] (t) + (2 − nt)1(n−1 ,2n−1 ] (t).
fd
Let x0 ≡ 0 ∈ C[0, 1]. We can show that δxn → δx0 . But δxn does NOT converge weakly to
δx0 as xn 6→ x0 .
97
A topological space is separable if it contains a countable dense subset; that is, there exists a sequence
{xn }∞n=1 of elements of the space such that every nonempty open subset of the space contains at least one
element of the sequence.
98
A metric space T is called complete (or a Cauchy space) if every Cauchy sequence of points in T has a
limit that is also in T or, alternatively, if every Cauchy sequence in T converges in T . Intuitively, a space is
complete if there are no “points missing” from it (inside or at the boundary).
99
We have the following result (a consequence of [Kallenberg, 2002, Proposition 3.2]; also
see [Billingsley, 1999, Chapter 1]).
d
Lemma 9.11. Suppose that X and Y are random elements in C[0, 1]. Then X = Y if and only if
d
(X(t1 ), . . . , X(tk )) = (Y (t1 ), . . . , Y (tk )), t1 , . . . , tk ∈ [0, 1], k ∈ N.
125
d
Alternatively, if δxn → δx0 , then kxn k∞ := supt∈[0,1] xn (t) = 1 should converge weakly
to kx0 k∞ = 0 (which is obviously not true!). Note that here we have used the fact that
f : C[0, 1] → R, defined as f (x) = supt∈[0,1] x(t) is a continuous function (Exercise(HW4):
Show this.).
The above discussion motivates the following concepts, which we will show are crucially
tied to the notion of weak convergence of stochastic processes.
Suppose that Xn ’s are random elements taking values in the metric space (T, B(T )).
Definition 9.12. We say that the sequence of random elements {Xn }n≥1 is tight if and
only if for every > 0 there exists a compact set100 V ⊂ T such that
Exercise (HW4): Let S be a separable and complete metric space. Then every probability
measure on (S, B(S)) is tight. (Hint: See [Billingsley, 1999, Theorem 1.3])
Recall that a set A is relatively compact if its closure is compact, which is equivalent
to the condition that each sequence in A contains a convergent subsequence (the limit of
which may not lie in A). This motivates the following definition which can be thought of
as a ‘probabilistic’ version of relative compactness.
Definition 9.13. A sequence of random elements {Xn }n≥1 is said to be relatively compact
in distribution if every subsequence has a further sequence that converges in distribution.
Similarly, we can define the notion of relative compactness (in distribution) of a sequence
{Pn }n≥1 of probability measures.
Theorem 9.14 (Prohorov’s theorem). If {Xn }n≥1 is tight, then it is relatively compact in
distribution. In fact, the two notions are equivalent if T is separable and complete101 .
Proof. See Theorem 2.1 of [Billingsley, 1999] (or 3.25 of [Kallenberg, 2002]).
Prohorov’s theorem is probably the key result in this theory of classical weak conver-
gence and gives the basic connection between tightness and relative distributional compact-
ness.
100
Recall that in a metric space a set is compact if and only if it is complete and totally bounded.
101
A metric space M is called complete if every Cauchy sequence of points in M has a limit that is also in
M ; or, alternatively, if every Cauchy sequence in M converges in M .
126
9.3.2 Tightness and weak convergence in C[0, 1]
d
Lemma 9.15. Let X, X1 , . . . , Xn , . . . be random elements in C[0, 1]. Then Xn → X if and
fd
only if Xn → X and {Xn } is relatively compact in distribution.
d fd
Proof. If Xn → X, then Xn → X (by the continuous mapping theorem; take f : C[0, 1] →
Rk defined as f (x) = (x(t1 ), . . . , x(tk ))), and {Xn } is trivially relatively compact in distri-
bution.
d
Now assume that {Xn } satisfies the two conditions. If Xn →6 X, we may choose a
bounded continuous function f : C[0, 1] → R and an > 0 such that |Ef (Xn ) − Ef (X)| >
along some subsequence N 0 ⊂ N. By the relative compactness we may choose a further
d fd
subsequence N 00 ⊂ N and a process Y such that Xn → Y along N 00 . But then Xn → Y along
fd d
N 00 , and since also Xn → X, we must have X = Y (a consequence of Lemma 9.11). Thus,
d
Xn → X along N 00 , and so Ef (Xn ) → Ef (X) along the same sequence, a contradiction.
d
Thus, we conclude that Xn → X.
Recall that Prohorov’s theorem shows that tightness and relative compactness in distri-
bution are equivalent. Also, recall that a set A is relatively compact if its closure is compact.
Thus, the above result, we need to find convenient criteria for tightness.
The modulus of continuity of an arbitrary function x(·) on [0, 1] is defined as
The function w(·, h) is continuous in C[0, 1] and hence a measurable function102 , for each
fixed h > 0.
We use the classical Arzelá-Ascoli compactness criterion to do this. The Arzelá-Ascoli
theorem completely characterizes relative compactness in C[0, 1].
and
lim sup w(x, h) = 0. (139)
h→0 x∈A
Lemma 9.17. The sequence {Xn }n≥1 is tight if and only if the following two conditions
hold:
102
This follows from the fact that |w(x, h) − w(y, h)| ≤ 2kx − yk∞ .
127
(i) For every η > 0, there exists a ≥ 0 and n0 ∈ N such that
(ii) For each > 0 and η > 0, there exist h ∈ (0, 1) and n0 ∈ N such that for all n ≥ n0
Proof. Suppose that {Xn }n≥1 is tight. Given η > 0, we can find a compact K such that
P(Xn ∈ K) > 1 − η for all n ≥ 1. By the Arzelà-Ascoli theorem, we have K ⊂ {x ∈ C[0, 1] :
|x(0)| < a} for large enough a and K ⊂ {x ∈ C[0, 1] : w(x, h) < } for small enough h, and
so (i) and (ii) hold, with n0 = 1 in each case. Hence the necessity.
Assume then that {Pn }n≥1 satisfies (i) and (ii), with n0 = 1 (without loss of general-
ity103 ),
where Xn ∼ Pn . Given η > 0, choose a so that, if B = {x ∈ C[0, 1] : |x(0)| ≤ a},
then Pn (B) ≥ 1 − η for all n. Then choose hk , so that, if Bk := {x ∈ C[0, 1] : w(x, hk ) <
1/k}, then Pn (Bk ) ≥ 1 − η/2k for all n. If K is the closure of A := B ∩ (∩k Bk ), then
Pn (K) ≥ 1 − 2η for all n. Since A satisfies the conditions of Theorem 9.16, K is compact.
Therefore, {Pn }n≥1 is tight.
Remark 9.3. Note that condition (141) can be expressed in a more compact form:
Remark 9.4. Let us fix two metric spaces (K, d) and (S, ρ), where K is compact and S is
separable and complete (i.e., S is a Polish space). Everything done in this subsection can
be easily extended to the space C(K, S) of continuous functions from K to S, endowed with
the uniform metric
ρ̂(x, y) = sup ρ(xt , yt ).
t∈K
103
Since C[0, 1] is separable and complete, a single measure Q is tight (by ), and so by the necessity of (i)
and (ii), for a given η > 0 there is an a such that Q({x ∈ C[0, 1] : |x(0)| ≥ a}) ≤ η, and for given > 0 and
η > 0 there is a h > 0 such that Q({x ∈ C[0, 1] : w(x, h) > }) ≤ η. Therefore, we may ensure that the
inequalities in (140) and (141) hold for the finitely many n preceding n0 by increasing a and decreasing h if
necessary. Thus, we may assume that n0 is always 1.
128
9.4 Non-measurablilty of the empirical process
Suppose that U1 , . . . , Un be i.i.d. Uniform(0,1) defined on a probability space (Ω, A, P). Let
Gn be the empirical c.d.f. of the data. Note that Gn (defined on [0, 1]) is NOT continuous.
Thus, the empirical process is not an element of C[0, 1]. It is standard to consider Gn as
an element of D[0, 1], the space of càdlàg functions (right continuous with left limits) on
[0, 1]. To understand the weak convergence of the empirical process we next study weak
convergence on D[0, 1].
The following example shows that if D[0, 1] is equipped with the Borel σ-field B(D[0, 1])
generated by the closed sets under the uniform metric, the empirical d.f. Gn will not be a
random element in D[0, 1], i.e., Gn is not A/B(D[0, 1])-measurable.
Example 9.19. Consider n = 1 so that G1 (t) = 1[0,t] (U1 ), t ∈ [0, 1] (visualize the random
function G1 over [0, 1]; G1 (t) = 1{t ≥ U1 }). Let Bs be the open ball in D[0, 1] with center
1[s,1] and radius 1/2, where s ∈ [0, 1]. For each subset A ⊂ [0, 1] define
EA := ∪s∈A Bs .
Observe that EA is an open set as an uncountable union of open sets is also open.
If G1 were A/B(D[0, 1])-measurable, the set
{ω ∈ Ω :∈ G1 (ω) ∈ EA } = {ω ∈ Ω : U1 (ω) ∈ A}
would belong to A. A probability measure could be defined on the class of all subsets of [0, 1]
by setting µ(A) := P(U1 ∈ A). This µ would be an extension of the uniform distribution
to all subsets of [0, 1]. Unfortunately such an extension cannot exist! Thus, we must give
up Borel measurability of G1 . The argument can be extended to n ≥ 1 (see Problem 1 in
[Pollard, 1984, Chapter IV]).
The above example shows that the Borel σ-field generated by the uniform metric on
D[0, 1] contains too many sets.
Exercise (HW4): Show that D[0, 1], equipped with the Borel σ-field B(D[0, 1]) generated
by the uniform metric, is NOT separable. [Hint: We can define fx (t) = 1[x,1] (t) for each x ∈
[0, 1], then kfx −fy k∞ = 1 whenever x 6= y. In particular, we have an uncountable collection
of disjoint open sets given by the balls B(fx , 1/2), and so the space is not countable.]
Too large a σ-field T makes it too difficult for a map into T to be a random element. We
must also guard against too small a T . Even though the metric space on T has lost the right
to have T equal to the Borel σ-field, it can still demand some degree of compatibility before
a fruitful weak convergence theory will result. If Cb (T ; T ) contains too few functions, the
approximation arguments underlying the continuous mapping theorem will fail. Without
that key theorem, weak convergence becomes a barren theory (see [Pollard, 1984, Chapter
IV]).
129
9.5 D[0, 1] with the ball σ-field
In the last subsection we saw that the uniform empirical distribution function Gn ∈ T :=
D[0, 1] is not measurable with respect to its Borel σ-field (under the uniform metric). There
is a simple alternative to the Borel σ-field that works for most applications in D[0, 1], when
the limiting distribution is continuous.
For each fixed t ∈ [0, 1], the map Gn (·, t) : Ω → R is a random variable. That is, if πt
denotes the coordinate projection map that takes a function x in D[0, 1] onto its value at
t, the composition πt ◦ Un is A/B(R)-measurable.
Let P be the projection σ-field, i.e., the σ-field generated by the coordinate projection
maps. Recall that if f : T → R, then the σ-field generated by the function f , denoted by
σ(f ), is the collection of all inverse images f −1 (B) of the sets B ∈ B(R), i.e.,
Exercise(HW4): A stochastic process X on (Ω, A, P) with sample paths in D[0, 1], such as
an empirical process, is A/P-measurable provided πt ◦ X is A/B(R)-measurable for each
fixed t.
As every cadlag function on [0, 1] is bounded (Exercise(HW4)), the uniform distance
defines a metric on D[0, 1]. Here are a few facts about P (Exercise(HW4)):
• P coincides with the σ-field generated by the closed (or open) balls (this is called the
ball σ-field104 ).
The limit processes for many applications will always concentrate in a separable subset
of D[0, 1], usually C[0, 1], the set of all continuous real valued functions on [0, 1].
Exercise(HW4): Show that C[0, 1] is a closed, complete, separable, P-measurable subset of
D[0, 1]. Show that for C[0, 1], the projection σ-field coincides with the Borel σ-field.
Question: How do we establish convergence in distribution of a sequence {Xn } of random
elements of D[0, 1] to a limit process X?
Definition 9.20. For each finite subset S of [0, 1] write πS for the projection map from
D[0, 1] into RS that takes an x onto its vector of values {x(t) : t ∈ S}.
130
subset S ∈ [0, 1]. The continuity and measurability of πS and the continuous mapping
theorem makes that a necessary condition.
For the sake of brevity, let us now shorten “finite-dimensional distributions” to “fidis”
and “finite-dimensional projections” to fidi projections.
Theorem 9.21. Let X, X1 , X2 , . . . be random elements of D[0, 1] (under its uniform metric
and the projection σ-field). Suppose that P(X ∈ C) = 1 for some separable subset C of
D[0, 1]. The necessary and sufficient conditions for {Xn } to converge in distribution to X
are:
fd d
(i) Xn → X (the fidis of Xn converge to the fidis of X, i.e., πS Xn → πS X of each finite
subset S of [0, 1]);
(ii) to each > 0 and δ > 0 there corresponds a grid 0 = t0 < t1 < . . . < tm = 1 such that
lim sup P max sup |Xn (t) − Xn (ti )| > δ < ; (142)
n→∞ i Ji
d
Proof. Suppose that Xn → X. As the projection map πS is both continuous and projection-
measurable (by definition), the continuous mapping theorem shows that (i) holds.
Let , δ > 0 be given. To simplify the proof of (ii) suppose that the separable subset C is
C[0, 1] (continuity of the sample paths makes the choice of the grid easier). Let {s0 , s1 , . . .}
be a countable dense subset of [0, 1]. We will assume that s0 = 0 and s1 = 1.
For a fixed x ∈ C[0, 1], and given the ordered k + 1 points {s0 , . . . , sk } labeled as
0 = t0 < . . . < tk = 1 in [0, 1], define an interpolation Ak x of x as
F := {x ∈ D[0, 1] : kAk x − xk ≥ δ}
is closed. By an application of the Portmanteau theorem (note that this needs a proof,
and has not been discussed in class yet, as we are not dealing with the Borel σ-field; see
e.g., [Pollard, 1984, Example 17, Chapter IV]; also see [Billingsley, 1999, Theorem 6.3]),
131
The left-hand side bounds the limsup in (142).
d
Now let us show that (i) and (ii) imply Xn → X. Let us retain the assumption that
X has continuous sample paths. Choose any bounded, uniformly continuous, projection-
measurable, real valued function f on D[0, 1]. Given > 0, find δ > 0 such that |f (x) −
f (y)| < whenever kx − yk ≤ δ. Write AT for the approximation map constructed from
the grid in (ii) corresponding to this δ and , i.e.,
The middle term in the last line converges to zero as n → ∞, because of the fidi convergence
d
πT Xn → πT X.
We now know the price we pay for wanting to make probability statements about func-
tionals that depend on the whole sample path of a stochastic process: with high probability
we need to rule out nasty behavior between the grid points.
Question: How do we control the left-hand side of (142)? It involves a probability of a
union of m events, which we may bound by the sum of the probabilities of those events,
m−1
X
P sup |Xn (t) − Xn (ti )| > δ .
i=0 Ji
Then we can concentrate on what happens in an interval Ji between adjacent grid points.
For many stochastic processes, good behavior of the increment Xn (ti+1 ) − Xn (Ti ) forces
good behavior for the whole segment of sample path over Ji . Read [Pollard, 1984, Chapter
V] for further details and results on how to control the left-hand side of (142).
132
10 Weak convergence in non-separable metric spaces
We have already seen that the uniform empirical process is not Borel measurable. We have
also seen that we can change the Borel σ-field to the ball σ-field and develop a fruitful
notion of weak convergence.
Another alternative solution is to use a different metric that will make the empirical
process measurable, e.g., we can equip the space D[0, 1] with the Skorohod (metric) topology.
Remark 10.1. (D[0, 1] and the Skorohod metric) We say that two cadlag functions are close
to one another in the Skorohod metric if there is a reparameterization of the time-axis, (a
function [0, 1] to itself ) that is uniformly close to the identity function, and when applied to
one of the cadlag functions, brings it close to the other cadlag function. Heuristically, two
cadlag functions are close if their large jumps are close to one another and of similar size,
and if they are uniformly close elsewhere. See [Billingsley, 1999, Chapter 3] for the study
of D[0, 1] with the Skorohod metric.
However, for empirical processes that assume values in very general (and often more
complex spaces) such cute generalizations are not readily achievable and the easier way
out is to keep the topology simple and tackle the measurability issues. Of course, any
generalization of the notion of weak convergence must allow a powerful continuous mapping
theorem.
We will now try to develop a more general notion of weak convergence that can handle
non-measurability of the underlying functions. In this section, the underlying σ-field will
always be the Borel σ-field generated by the metric endowed with the space.
Suppose that D is a metric space with metric d(·, ·). Suppose that we have arbitrary
maps Xn : Ωn → D, defined on probability spaces (Ωn , An , Pn ). Because Ef (Xn ) need
no longer make sense (where f : D → R is a bounded continuous function), we replace
expectations by outer expectations.
E∗ f (Xn ) → Ef (X)
105
Unfortunately, Fubini’s theorem is not valid for outer expectations (Note that for any map T it always
holds E∗1 E∗2 T ≤ E∗ T ). To overcome this problem, it is assumed that F is P -measurable. We will not define
P -measurability formally; see [van der Vaart and Wellner, 1996, Chapter 2.3] for more details.
106
We note that the weak convergence in the above sense is no longer tied with convergence of probability
measures, simply because Xn need not induce Borel probability measures on D.
133
for every bounded, continuous function f : D → R. Here we insist that the limit X be
Borel-measurable (and hence has a distribution). Note that although Ωn may depend on n,
we do not let this show up in the notation for E∗ and P∗ .
Remark 10.2. Even for Borel measurable maps Xn and X, the distance d(Xn , X) need
not be a random variable.
(iii) lim inf n→∞ P∗ (Xn ∈ G) ≥ P(X ∈ G) for every open set G.
(v) P∗ (Xn ∈ B) → P(X ∈ B) for all Borel set B with P(X ∈ ∂B) = 0.
Proof. See [van der Vaart and Wellner, 1996, Theorem 1.3.4].
Proof. See [van der Vaart and Wellner, 1996, Theorem 1.3.6].
Definition 10.7. A Borel-measurable random element X into a metric space is tight if for
every > 0 there exists a compact set K such that
P(X ∈
/ K) < .
134
Remark 10.3. (Separability and tightness) Let X : Ω → D be a random element. Tightness
is equivalent to there being a σ-compact set (countable union of compacts) that has probability
1 under X.
If there is a separable, measurable set with probability 1, then X is called separable.
Since a σ-compact set in a metric space is separable, separability is slightly weaker than
tightness. The two properties are the same if the metric space is complete107 .
Let us look back at our motivation for studying weak convergence of stochastic processes.
Given i.i.d. random elements X1 , . . . , Xn ∼ P taking values in X and a class of measurable
√
functions F, we want to study the stochastic process { n(Pn − P )(f ) : f ∈ F} (i.e., the
empirical process). If we assume that
f 7→ f (x) − P f, x ∈ X
Pn − P ∈ `∞ (F),
Remark 10.4. As C[0, 1] ⊂ D[0, 1] ⊂ `∞ [0, 1], we can consider convergence of a sequence
of maps with values in C[0, 1] relative to C[0, 1], but also relative to D[0, 1], or `∞ [0, 1]. It
can be shown that if D0 ⊂ D be arbitrary metric spaces equipped with the same metric and
d
X (as D) and every Xn take there values in D0 , then Xn → X as maps in D0 if and only
if Xn → X as maps in D.
135
complete separable metric spaces needs to be extended. Given that the space `∞ (·) arises
naturally while studying the empirical process, we will devote this subsection to the study
of weak convergence in the space `∞ (T ).
It turns out that the theory of weak convergence on bounded stochastic processes
extends nicely if the limit laws are assumed to be tight Borel probability measures on
`∞ (T ). The following result characterizes weak convergence (to a tight limiting probability
measure) in `∞ (T ): convergence in distribution of a sequence of sample bounded processes
is equivalent to weak convergence of the finite dimensional probability laws together with
asymptotic equicontinuity, a condition that is expressed in terms of probability inequalities.
This reduces convergence in distribution in `∞ (T ) to maximal inequalities (we have spent
most of the earlier part of this course on proving such inequalities). We will often refer
to this theorem as the asymptotic equicontinuity criterion for convergence in law in `∞ (T ).
The history of this theorem goes back to Prohorov (for sample continuous processes) and
in the present generality it is due to [Hoffmann-Jø rgensen, 1991] with important previous
contributions by [Dudley, 1978].
(i) The sequence (Xn (t1 ), . . . , Xn (tk )) converges in distribution in Rk (k ∈ N) for every
finite set of points t1 , . . . , tk ∈ T ;
(ii) (asymptotic equicontinuity) there exists a semi-metric d(·, ·) for which (T, d) is totally
bounded and for every > 0
!
lim lim sup P∗ sup |Xn (s) − Xn (t)| > = 0. (147)
δ→0 n→∞ d(s,t)<δ; s,t∈T
Proof. Let us first show that (i) and (ii) imply that Xn converges weakly to a tight random
element. The proof consists of two steps.
Step 1: By assumption (i) and Kolmogorov’s consistency theorem109 we can construct a
d
stochastic process {Xt : t ∈ T } on some probability space such that (Xn (t1 ), . . . , Xn (tk )) →
109
Kolmogorov’s consistency theorem shows that for any collection of “consistent” finite-dimensional
marginal distributions one can construct a stochastic process on some probability space that has these
distributions as its marginal distributions.
Theorem 10.9. Let T denote an arbitrary set (thought of as “time”). For each k ∈ N and finite sequence
of times t1 , . . . tk ∈ T , let νt1 ,...,tk denote a probability measure on Rk . Suppose that these measures satisfy
two consistency conditions:
• for all permutations π of {1, . . . , k} and measurable sets Fi ⊂ R,
136
(X(t1 ), . . . , X(tk )) for every finite set of points t1 , . . . , tk ∈ T . We need to verify that X
admits a version that is a tight random element in `∞ (T ). Let T0 be a countable d-dense
subset of T , and let Tk , k = 1, 2, . . . be an increasing sequence of finite subsets of T0 such
that ∪∞k=1 Tk = T0 . By the Portmanteau lemma (see part (iii) of Theorem 9.2) on the
equivalent conditions of weak convergence in Euclidean spaces we have
P max |X(s) − X(t)| > ≤ lim inf P max |Xn (s) − Xn (t)| >
d(s,t)<δ; s,t∈Tk n→∞ d(s,t)<δ; s,t∈Tk
!
≤ lim inf P sup |Xn (s) − Xn (t)| > .
n→∞ d(s,t)<δ; s,t∈T0
where we consider the open set {x ∈ `∞ (T ) : supd(s,t)<δ; s,t∈Tk |x(s) − x(t)| > } and by the
monotone convergence theorem this remain true if Tk is replaced by T0 .
By the asymptotic equicontinuity condition (147), there exists a sequence δr > 0 with
δr → 0 as r → ∞ such that
!
P sup |X(s) − X(t)| > 2−r ≤ 2−r .
d(s,t)≤δr , s,t∈T0
Then there exists a probability space (Ω, A, P) and a stochastic process X : T × Ω → R such that
for all ti ∈ T , k ∈ N and measurable set Fi ⊂ Rn , i.e., X has νt1 ,...,tk as its finite-dimensional
distributions relative to times t1 , . . . , tk .
110
Let {An }n≥1 be a sequence of some events in some probability space. The Borel-Cantelli lemma states
that: If the sum of the probabilities of the events An is finite, i.e., ∞
P
n=1 P(An ) < ∞, then the probability
that infinitely many of them occur is 0, i.e., P(lim supn→∞ An ) = 0, where lim sup An := ∩∞ ∞
n=1 ∪k=1 Ak is
the set of outcomes that occur infinitely many times within the infinite sequence of events {An }.
111
We will use the following lemma without proving it.
137
d
Step 2: We now prove Xn → X in `∞ (T ). First we recall a useful fact112 : If f : `∞ (T ) → R
is bounded and continuous, and if K ⊂ `∞ (T ) is compact, then for every > 0 there exists
δ > 0 such that
Since (T, d) is totally bounded, for every τ > 0 there exists a finite set of points
N (τ )
t1 , . . . , tN (τ ) which is τ -dense in (T, d) in the sense that T ⊆ ∪i=1 B(ti , τ ), where where
B(t, τ ) denotes the open ball of center t and radius τ . Then, for each t ∈ T we can choose
πτ : T → {t1 , . . . , tN (τ ) } so that d(πτ (t), t) < τ . We then define processes Xn,τ , n ∈ N, and
Xτ as
Xn,τ (t) = Xn (πτ (t)), Xτ (t) = X(πτ (t)), t ∈ T. (149)
These are approximations of Xn and X that take only a finite number N (τ ) of values.
Convergence of the finite dimensional distributions of Xn to those of X implies113 that
d
Xn,τ → Xτ in `∞ (T ). (150)
We have seen that limn→∞ IIn,τ = 0 (by (150)) for each fixed τ > 0 and limτ →0 IIIτ = 0 (as
X is uniformly continuous a.s.114 ). Hence it only remains to show that lim lim sup In,τ = 0.
τ →0 n→∞
Lemma 10.10. Let X(t), t ∈ T , be a sample bounded stochastic process. Then the finite dimensional
distributions of X are those of a tight Borel probability measure on `∞ (T ) if and only if there exists on T a
semi-metric d for which (T, d) is totally bounded and such that X has a version with almost all its sample
paths uniformly continuous for d.
112
Suppose on the contrary that the assertion is false. Then there exist > 0 and sequences un ∈ K and
vn ∈ T such that d(un ; vn ) → 0 and |f (un ) − f (vn )| ≥ . Since K is compact, there exists a subsequence
un0 of un such that un0 has a limit u in K. Then vn0 → u and, by continuity of f , |f (un0 ) − f (vn0 )| →
|f (u) − f (u)| = 0, which is a contradiction.
113
To see this, let Ti = {t ∈ T : ti = πτ (t)}. Then {Ti } forms a partition of T and πτ (t) = ti whenever
PN (τ )
t ∈ Ti Since the map Πτ : (a − 1, . . . , aN (τ ) ) 7→ i=1 ai 1Ti (t) is continuous from RN (τ ) into `∞ (T ), the
finite dimensional convergence implies that dro any bounded continuous function H : `∞ (T ) → R,
E[H(Xn,τ )] = E[H ◦ Πτ (Xn (t1 ), . . . , Xn (tN (τ ) )] → E[H ◦ Πτ (X(t1 ), . . . , X(tN (τ ) )] = E[H(Xτ )],
138
Given > 0 we choose K ⊂ `∞ (T ) to be a compact set such that P(X ∈ K c ) <
/(6kf k∞ ). Let δ > 0 be such that (148) holds for K and /6. Then, we have
h i
|E∗ f (Xn ) − Ef (Xn,τ )| ≤ 2kf k∞ P Xn,τ ∈ (K δ/2 )c + P kXn − Xn,τ kT ≥ δ/2
+ sup{|f (u) − f (v)| : u ∈ K, kv − vkT < δ}, (152)
where K δ/2 is the (δ/2)-open neighborhood of the set K under the sup norm, i.e.,
The inequality in (152) can be checked as follows: If Xn,τ ∈ K δ/2 and kXn − Xn,τ kT < δ/2,
then there exists u ∈ K such that ku − Xn,τ kT < δ/2 and then
Now the asymptotic equicontinuity hypothesis (see (147); recall the definition of Xn,τ
in (149)) implies that there is a τ2 > 0 such that
lim sup P∗ kXn − Xn,τ kT ≥ δ/2 < , for all τ < τ2 .
n→∞ 6kf k∞
which completes the proof of the part that (i) and (ii) imply the weak convergence of Xn
(to tight random element).
Now we show that if Xn converges weakly to a tight random element in `∞ (T ), (i) and
(ii) should hold. By the continuous mapping theorem it follows that (i) holds. The other
implication is a consequence of the “closed set” part of the Portmanteau theorem. First we
state a result which we will use (but prove later).
Theorem 10.11. Suppose that X ∈ `∞ (T ) induces a tight Borel measure. Then there exists
a semi-metric ρ on T for which (T, ρ) is totally bounded and such that X has a version with
P(X ∈ K c ) < /(6kf k∞ ). Let δ > 0 be such that (148) holds for K and /6. Let τ1 > 0 be such that
P(kXτ − XkT ≥ δ) < /(6kf k∞ ) for all τ < τ1 ; this can be done by virtue of (151). Then it follows that
|Ef (Xτ ) − Ef (X)| ≤ 2kf k∞ P {X ∈ K c } ∪ {kXτ − XkT ≥ δ} + sup{|f (u) − f (v)| : u ∈ K, kv − vkT < δ}
≤ 2kf k∞ + + = .
6kf k∞ 6kf k∞ 6
Hence limτ →0 IIIτ = 0.
139
almost all sample paths in the space of all uniformly continuous real valued functions from
T to R.
Furthermore, if X is zero-mean Gaussian, then this semi-metric can be taken equal to
p
ρ(s, t) = Var(X(s) − X(t)).
Taking limits across the resulting inequality as δ → 0 yields the asymptotic equicontinuity
in view of the ρ-uniform continuity of the sample paths of X.
In view of this connection between the partitioning condition (ii), continuity and tight-
ness, we shall sometimes refer to this condition as the condition of asymptotic tightness or
asymptotic equicontinuity.
Remark 10.5. How do we control the left-hand side of (147)? It involves a probability,
which by Markov’s inequality can be bounded by
!
−1 E∗ sup |Xn (s) − Xn (t)| .
d(s,t)<δ
Thus we need good bounds on the expected suprema of the localized fluctuations.
where s, t ∈ T makes (T, ρ) totally bounded. To show this, let > 0, and choose k so that
P∞ −m < /4. By the compactness of K , there exists x , . . . , x , a finite subset of
m=k+1 2 k 1 r
140
Kk , such that Kk ⊂ ∪ri=1 B(xi ; /4) (here B(x; ) := {y ∈ `∞ (T ) : kx − ykT < } is the ball
of radius around x), i.e., for each x ∈ Kk = ∪km=1 Km there exists i ∈ {1, . . . , r} such that
kx − xi kT ≤ . (153)
4
Also, as Kk is compact, Kk is a bounded set. Thus, the subset A ⊂ Rr defined by
{(x1 (t), . . . , xr (t)) : t ∈ T } is bounded. Therefore, A is totally bounded and hence there
exists a finite set T := {tj : 1 ≤ j ≤ N } such that, for every t ∈ T , there is a j ≤ N for
which
max |xi (t) − x(tj )| ≤ /4. (154)
i=1,...,r
Next we will show that T is -dense in T for the semi-metric ρ. Let t ∈ T . For any m ≤ k,
we have
3
ρm (t, tj ) = sup |x(t) − x(tj )| ≤ . (155)
x∈Km 4
Note that (155) follows as for any x ∈ Km , there exists i ∈ {1, . . . , r} such that kx − xi kT ≤
/4 (by (153)) and thus,
3
|x(t) − x(tj )| ≤ |x(t) − xi (t)| + |xi (t) − xi (tj )| + |xi (tj ) − x(tj )| ≤ + + = ,
4 4 4 4
where we have used (154). Hence,
k ∞ k
X X 3 X −m
ρ(t, tj ) ≤ 2−m ρm (t, tj ) + 2−m ≤ 2 + ≤ .
4 4
m=1 m=k+1 m=1
141
ρ2 (tn0 , s) → 0, so that ρ2 (s, t) = 0 and |X(s, ω) − X(t, ω)| ≥ . Conclude that the path
t 7→ X(t, ω) can only fail to be ρ2 -continuous for ω for which there exist s, t ∈ T with
ρ2 (s, t) = 0, but X(s, ω) 6= X(t, ω). Let N be the set of ω for which there do exist such
s and t. Take a countable, ρ-dense subset A of {(s, t) ∈ T × T : ρ2 (s, t) = 0}. Since
t 7→ X(t, ω) is ρ-continuous, N is also the set of all ω such that there exist (s, t) ∈ A with
X(s, ω) 6= X(t, ω). From the definition of ρ2 , it is clear that for every fixed (s, t), the set of
ω such that X(s, ω) 6= X(t, ω) is a null set. Conclude that N is a null set. Hence, almost
all paths of X are ρ2 -continuous.
Remark 10.6. In the course of the proof of the preceding theorem we constructed a semi-
metric ρ such that the weak limit X has uniformly ρ-continuous sample paths, and such that
(T, ρ) is totally bounded. This is surprising: even though we are discussing stochastic pro-
cesses with values in the very large space `∞ (T ), the limit is concentrated on a much smaller
space of continuous functions. Actually this is a consequence of imposing the condition (ii)
and insisting that the limit X be a tight random element.
Exercise (HW4): Show that a sequence converges in this metric if it converges uniformly
on each Ti .
In case Ti := [−i, i]d ⊂ Rd (for d ≥ 1), the metric d induces the topology of uniform
convergence on compacta.
The space `∞ (T1 , T2 , . . .) is of interest in applications, but its weak convergence theory
is uneventful. Weak convergence of a sequence is equivalent to (weak) convergence in each
of the restrictions Ti ’s.
Proof. See [van der Vaart and Wellner, 1996, Theorem 1.6.1].
142
11 Donsker classes of functions
Suppose that X1 , . . . , Xn are i.i.d. random elements taking values in a set X having distri-
√
bution P and let Gn denote the corresponding empirical process (i.e., Gn ≡ n(Pn − P ))
indexed by a class F of real-valued measurable functions F.
The Donsker property depends on the law P of the observations; to stress this we
also say “P -Donsker”. The definition implicitly assumes that the empirical process can be
viewed as a map into `∞ (F), i.e., that the sample paths f 7→ Gn f are bounded. By the
multivariate CLT, for any finite set of measurable functions fi with P fi2 < ∞,
d
(Gn f1 , . . . , Gn fk ) → (GP f1 , . . . , GP fk ),
where the vector on the right-hand side possesses a multivariate normal distribution with
mean zero and covariances given by
Remark 11.1. A stochastic process GP in `∞ (F) is called Gaussian if for every (f1 , . . . , fk ),
fi ∈ F, k ∈ N, the random vector (GP (f1 ), . . . , GP (fk )) is a multivariate normal vector.
Theorem 11.2. Let F be a class of measurable functions from X to R such that P [f 2 ] < ∞,
for every f ∈ F, and
Then the empirical process {Gn f : f ∈ F} converges weakly to a tight random element (i.e.,
F is P -Donsker) if and only if there exists a semi-metric d(·, ·) on F such that (F, d) is
totally bounded and
!
lim lim sup P∗ sup |Gn (f − g)| > = 0, for every > 0. (156)
δ→0 n→∞ d(f,g)≤δ; f,g∈F
A typical distance d is d(f, g) = kf − gkL2 (P ) , but this is not the only one.
For most function classes of interest, the bracketing numbers N[ ] (, F, L2 (P )) grow to in-
finity as ↓ 0. A sufficient condition for a class to be Donsker is that they do not grow
143
too fast. The speed can be measured in terms of the bracketing integral. Recall that the
bracketing entropy integral is defined as
Z δq
J[ ] (δ, F, L2 (P )) := log N[ ] (, F ∪ {0}, L2 (P )) d.
0
If this integral is finite-valued, then the class F is P -Donsker. As the integrand is a decreas-
ing function of the convergence of the integral depends only on the size of the bracketing
R1
numbers for ↓ 0. Because 0 −r d converges for r < 1 and diverges for r ≥ 1, the integral
condition roughly requires that the entropies grow of slower order than 1/2 .
Theorem 11.3 (Donsker theorem). Suppose that F is a class of measurable functions with
square-integrable (measurable) envelope F and such that J[ ] (1, F, L2 (P )) < ∞. Then F is
P -Donsker.
Proof. As N[ ] (, F, L2 (P )) is finite for every > 0, we know that (F, d) is totally bounded,
where d(f, g) = kf − gkL2 (P ) . Let G be the collection of all differences f − g when f and g
range over F. With a given set of -brackets {[li , ui ]}Ni=1 over F we can construct 2-brackets
over G by taking differences [li − uj , ui − lj ] of upper and lower bounds. Therefore, the
bracketing numbers N[ ] (, G, L2 (P )) are bounded by the squares of the bracketing numbers
N[ ] (/2, F, L2 (P )). Taking a logarithm turns the square into a multiplicative factor 2 and
hence the entropy integrals of F and G are proportional. The function G = 2F is an
envelope for the class G.
Let Gδ := {f − g : f, g ∈ F, kf − gkL2 (P ) ≤ δ}. Hence, by a maximal inequality115 ,
q
there exists a finite number a(δ) = δ/ log N[ ] (δ, Gδ , L2 (P )) such that
h i √ √
E∗ sup |Gn (f − g)| . J[ ] (δ, Gδ , L2 (P )) + nP [G1{G > a(δ) n}].
f,g∈F ;kf −gk≤δ
√ √
≤ J[ ] (δ, G, L2 (P )) + nP [G1{G > a(δ) n}]. (157)
√
The second term on the right is bounded by a(δ)−1 P [G2 1{G > a(δ) n}] and hence
converges to 0 as n → ∞ for every δ. The integral converges to zero as δ → 0. The theorem
now follows from the asymptotic equi-continuity condition (see Theorem 11.2), in view of
Markov’s inequality.
Example 11.5 (Classical Donsker’s theorem). When F is equal to the collection of all
indicator functions of the form ft = 1(−∞,t] , with t ranging over R, then the empirical
115
Here is a maximal inequality that uses bracketing entropy (see [van der Vaart, 1998, Lemma 19.34] for
a proof):
Theorem 11.4. For any class F of measurable functions f : X → R such that P f 2 < δ 2 , for every f , we
p
have, with a(δ) = δ/ log N[ ] (δ, F, L2 (P )), and F an envelope function,
h i √ √
E∗ kGn kF = E∗ sup |Gn f | . J[ ] (δ, F, L2 (P )) + nP ∗ [F 1{F > na(δ)}].
f ∈F
144
√
process Gn ft is the classical empirical process n(Fn (t)−F (t)) (here X1 , . . . , Xn are i.i.d. P
with c.d.f. F).
√
We saw previously that N[ ] ( , F, L2 (P )) ≤ 2/, whence the bracketing numbers are
of the polynomial order 1/2 . This means that this class of functions is very small, because
a function of the type log(1/) satisfies the entropy condition of Theorem 5.2 easily.
The following theorem shows that the bracketing numbers in the preceding Donsker theorem
can be replaced by the uniform covering numbers116
Here the supremum is taken over all probability measures Q for which kF k2Q,2 = Q[F 2 ] > 0.
Recall that the uniform entropy integral is defined as
Z δ q
J(δ, F, F ) = sup log N (kF kQ,2 , F, L2 (Q)) d. (158)
0 Q
Proof. We first show that (F, d) is totally bounded, where d(f, g) = kf −gkL2 (P ) . The finite-
ness of the uniform entropy integral implies the finiteness of its integrand: supQ N (kF kQ,2 , F, L2 (Q)) <
∞, for every > 0, where the supremum is taken over all finitely discrete probability mea-
sures Q with Q[F 2 ] > 0. We claim that this implies that F is totally bounded in L2 (P ).
Let > 0 and suppose that f1 , . . . , fN are functions in F such that P [(fi − fj )2 ] > 2 P [F 2 ],
for every i 6= j. By the law of large numbers Pn [(fi − fj )2 ] → P [(fi − fj )2 ] for any i, j and
Pn [F 2 ] → P F 2 , almost surely, as n → ∞. It follows that there exists some n and realization
Pn of Pn such that Pn [(fi − fj )2 ] > 2 P [F 2 ], for every i 6= j, and 0 < Pn [F 2 ] < 2P [F 2 ].
√
Consequently Pn [(fi − fj )2 ] > 2 Pn [F 2 ]/2, and hence N ≤ D(kF kPn ,2 / 2, F, L2 (Pn )) (re-
call the notion of packing number; see Definition 2.5). Because Pn is finitely discrete, the
right side is bounded by the supremum over Q considered previously and hence bounded
in n. In view of the definition of N this shows that D(kF kP,2 , F, L2 (P )) is finite for every
> 0.
116
The uniform covering numbers are relative to a given envelope function F . This is fortunate, because the
covering numbers under different measures Q typically are more stable if standardized by the norm kF kQ,2
of the envelope function. In comparison, in the case of bracketing numbers we consider a single distribution
P.
117
The condition that the class F is pointwise-measurable (or “suitably measurable”) is satisfied in most
examples but cannot be omitted. It suffices that there exists a countable collection G of functions such that
each f is the pointwise limit of a sequence gm in G; see [van der Vaart and Wellner, 1996, Chapter 2.3].
145
To verify the asymptotic equi-continuity condition, it is enough to show that
√
lim lim sup E[ nkPn − P kGδ ] = 0,
δ→0 n→∞
which leads to J(, Gδ , 2F ) ≤ CJ(, F, F ) for all > 0. Hence by the maximal inequality in
Theorem 4.10 (with σ = δ, envelope 2F ) and δ 0 = δ/(2kF kP,2 ),
Bn J 2 (δ 0 , F, F )
0
E[kGn kGδ ] ≤ C J(δ , F, F )kF kP,2 + √ ,
δ 02 n
p √
where Bn = 2 E[max1≤i≤n F 2 (Xi )]. As F ∈ L2 (P ), we have118 Bn = o( n). Given
η > 0, by choosing δ small, we can make sure that J(δ 0 , F, F ) < η and for large n,
√
Bn J 2 (δ 0 , F, F )/(δ 02 n) < η. Thus,
P
If θn → 0, then the right side converges to zero, in view of the dominated convergence
theorem.
Without loss of generality, assume that F ≥ 1, so that θn2 ≤ supf ∈Gδ Pn [f 2 ] (otherwise
replace F by F ∨ 1; this decreases the entropy integral). As supf ∈Gδ P [f 2 ] → 0, as δ → 0,
the desired conclusion follows if kPn f 2 − P f 2 kGδ converges in probability to zero. This
is certainly the case if the class G = (F − F)2 is Glivenko-Cantelli. It can be shown
G = (F − F)2 , relative to the envelope (2F )2 , has bounded uniform entropy integral. Thus
the class is Glivenko-Cantelli.
The Donsker theorem we just stated involves a fixed class of functions F not depending on
n. As will become clear in the next example, it is sometime useful to have similar results
118
This follows from the following simple fact: For i.i.d. random variables ξ1 , ξ2 , . . . the following three
statements are equivalent: (1) E[|ξ1 |] < ∞; (2) max1≤i≤n |ξi |/n → 0 almost surely; (3) E[max1≤i≤n |ξi |] =
o(n).
146
for classes of functions Fn which depend on the sample size n. Suppose that
Fn := {fn,t : t ∈ T }
Suppose further that the classes Fn have envelope functions Fn satisfying the Lindeberg
condition
P [Fn2 ] = O(1), and P [Fn2 1{Fn >√n} ] → 0 for every > 0. (161)
147
Proof. We only give the proof using the bracketing entropy integral condition. For each
δ > 0, applying Theorem 4.12 and using a similar idea as in (157) we obtain the bound
h i
E∗ sup |Gn (fn,s − fn,t )| . J[ ] (δ, Gn , L2 (P ))
s,t∈T ; P (fn,s −fn,t )2 <δ 2
√
+an (δ)−1 P [G2n 1{Gn > an (δ) n}],
q
where Gn = Fn −Fn , Gn = 2Fn and an (δ) = δ/ log N[ ] (δ, Gn , L2 (P )). Because J[ ] (δn , Fn , L2 (P )) →
0 for every δn → 0, we must have that J[ ] (δ, Fn , L2 (P )) = O(1) for every δ > 0 and hence
an (δ) is bounded away from 0. Then the second term in the preceding display converges
to zero for every fixed δ > 0, by the Lindeberg condition. The first term can be made
arbitrarily small as n → ∞ by choosing δ small, by assumption.
Example 11.8 (The Grenander estimator). Suppose that X1 , . . . , Xn are i.i.d. P on [0, ∞)
with a non-increasing density function f and c.d.f. F (which is known to be concave). We
want to estimate the unknown density f under the restriction that f is non-increasing.
Grenander [Grenander, 1956] showed that we can find a nonparametric maximum likeli-
hood estimator (NPMLE) fˆn in this problem, i.e., we can maximize the likelihood ni=1 g(Xi )
Q
see [Robertson et al., 1988, Chapter 7.2]. Also, fˆn can be computed easily using the pool
adjacent violators algorithm.
Suppose that x0 ∈ (0, ∞) is an interior point in the support of P . Does fˆn (x0 ) → f (x0 )?
Indeed, this holds. In fact, it can be shown that n1/3 (fˆn (x0 ) − f (x0 )) = Op (1).
Let us find the limiting distribution of ∆n := n1/3 (fˆn (x0 ) − f (x0 )). We will show that
if f 0 (x0 ) < 0, then
d
∆n = n1/3 (fˆn (x0 ) − f (x0 )) → LCM 0 [Z](0),
Observe that ∆n = LCM 0 [Zn ](0). Here we have used the fact that for any function m :
R → R and an affine function x 7→ a(x) := β0 + β1 x, LCM [m + a] = LCM [m] + a.
The idea is to show that
d
Zn → Z in `∞ ([−K, K]), for any K > 0, (162)
148
d
and then apply the continuous mapping principle to deduce ∆n → LCM [Z]0 (0).
Actually, a rigorous proof of the convergence of ∆n involves a little more than an appli-
d
cation of a continuous mapping theorem. The convergence Zn → Z is only under the metric
of uniform convergence on compacta. A concave majoring near the origin might be deter-
d
mined by values of the process a long way from the origin; thus the convergence Zn → Z by
d
itself does not imply the convergence of LCM 0 [Zn ](0) → LCM [Z]0 (0). However, we will not
address this issue for the time being. An interested reader can see [Kim and Pollard, 1990,
Assertion, Page 217] for a rigorous treatment of this issue. We will try to show that (162)
holds.
Consider the class of functions
Let Fn := {fn,t := n1/6 gtn−1/3 : t ∈ R}. It can be shown, appealing to Theorem 11.7,
d p
that n2/3 (Pn − P )[gtn−1/3 (X)] = Gn [fn,t ] → f (x0 )W(t) in `∞ [−K, K], for every K > 0.
Further, using a Taylor series expansion, we can see that
h i t2
n2/3 P [gtn−1/3 (X)] = n2/3 F (x0 + tn−1/3 ) − F (x0 ) − n−1/3 tf (x0 ) → f 0 (x0 ),
2
uniformly on compacta. Combining these two facts we see that (162) holds.
149
12 Limiting distribution of M -estimators
Let
θ0 := arg max M (θ).
θ∈Θ
If Mn and M are uniformly close, then maybe their argmax’s θ̂n and θ0 are also close. A key
tool to studying such behavior of θ̂n is the argmax continuous mapping theorem which we
consider next. Before we present the result in a general setup let us discuss the main idea
behind the proof. For any given > 0, we have to bound the probability P(d(θ̂n , θ0 ) ≥ ).
The key step is to realize that
!
P(d(θ̂n , θ0 ) ≥ ) ≤ P sup [Mn (θ) − Mn (θ0 )] > 0
θ∈Θ:d(θ,θ0 )≥
!
≤ P sup [(Mn − M )(θ) − (Mn − M )(θ0 )] > − sup [M (θ) − M (θ0 )] . (163)
θ∈Θ:d(θ,θ0 )≥ d(θ,θ0 )≥
The (uniform) closeness of Mn and M (cf. condition (3) in Theorem 12.1 below) shows
that the left-hand side of (163) must converge to 0 (in probability), whereas if M has a
well-separated unique maximum121 (cf. condition (1) in Theorem 12.1) then the right-hand
side of (163) must exceed a positive number, thereby showing that P(d(θ̂n , θ0 ) ≥ ) → 0
as n → ∞. This was carried out in Subsection 3.5.1 while discussing the consistency of
M -estimators.
121
i.e., the function M (θ) should be strictly smaller than M (θ0 ) on the complement of every neighborhood
of the point θ0 .
150
12.1 Argmax continuous mapping theorems
We state our first argmax continuous mapping theorem below which generalizes the above
discussed setup (so that it can also be used to derive asymptotic distributions of the M -
estimator). Our first result essentially says that the argmax functional is continuous at
functions M that have a well-separated unique maximum.
Theorem 12.1. Let H be a metric space and let {Mn (h), h ∈ H} and {M (h), h ∈ H} be
stochastic processes indexed by H. Suppose the following conditions hold:
for every open set G containing ĥ; i.e., M has a unique “well-separated” point of
maximum.
d
→ M in `∞ (H).
3. Mn −
d
Then ĥn −
→ ĥ in H.
d
Proof. By the Portmanteau theorem 10.5, to prove ĥn −
→ ĥ it suffices to show that
for every closed subset F of H. Fix a closed set F and note that
{ĥn ∈ F } ⊆ sup Mn (h) ≥ sup Mn (h) − oP (1) .
h∈F h∈H
Therefore,
P∗ ĥn ∈ F ≤ P∗ sup Mn (h) − sup Mn (h) + oP (1) ≥ 0 .
h∈F h∈H
The map suph∈F Mn (h)−suph∈H Mn (h) converges in distribution to suph∈F M (h)−suph∈H M (h)
d
→ M in `∞ (H) and by the continuous mapping theorem. We thus have
as Mn −
∗
lim sup P ĥn ∈ F ≤ P sup M (h) ≥ sup M (h) ,
n→∞ h∈F h∈H
where we have again used the Portmanteau theorem. The first assumption of the theorem
implies that {suph∈F M (h) ≥ suph∈H M (h)} ⊆ {ĥ ∈ F } (note that F c is open). This
proves (164).
151
The idea behind the proof of the above theorem can be used to prove the following
stronger technical lemma.
Lemma 12.2. Let H be a metric space and let {Mn (h) : h ∈ H} and {M (h) : h ∈ H}
be stochastic processes indexed by H. Let A and B be arbitrary subsets of H. Suppose the
following conditions hold:
1. ĥ is a random element of H which satisfies M (ĥ) > suph∈A∩Gc M (h) almost surely
for every open set G containing ĥ.
2. For each n, let ĥn ∈ H be such that Mn (ĥn ) ≥ suph∈H Mn (h) − oP (1).
d
→ M in `∞ (A ∪ B).
3. Mn −
Then
lim sup P∗ ĥn ∈ F ∩ A ≤ P ĥ ∈ F + P ĥ ∈ B c (165)
n→∞
for every closed set F .
Observe that Theorem 12.1 is a special case of this lemma which corresponds to A =
B = H.
Proof of Lemma 12.2. The proof is very similar to that of Theorem 12.1. Observe first that
n o
ĥn ∈ F ∩ A ⊆ sup Mn (h) − sup Mn (h) + oP (1) ≥ 0 .
h∈F ∩A h∈B
The term suph∈F ∩A Mn (h)−suph∈B Mn (h)+oP (1) converges in distribution to suph∈F ∩A M (h)−
d
suph∈B M (h) because Mn − → M in `∞ (A ∪ B). This therefore gives
∗
lim sup P ĥn ∈ F ∩ A ≤ P sup M (h) − sup M (h) ≥ 0
n→∞ h∈F ∩A h∈B
Now if the event {suph∈F ∩A M (h) ≥ suph∈B M (h)} holds and if ĥ ∈ B, then suph∈F ∩A M (h) ≥
M (ĥ) which can only happen if ĥ ∈ F . This means
P sup M (h) − sup M (h) ≥ 0 ≤ P(ĥ ∈ B c ) + P(ĥ ∈ F )
h∈F ∩A h∈B
We next prove a more applicable argmax continuous mapping theorem. The assumption
d
that Mn −→ M in `∞ (H) is too stringent. It is much more reasonable to assume that
d
Mn −→ M in `∞ (K) for every compact subset K of H. The next theorem proves that ĥn
converges in law to ĥ under this weaker assumption.
As we will be restricting analysis to compact sets in the next theorem, we need to
assume that ĥn and ĥ lie in compact sets with arbitrarily large probability. This condition,
made precise below, will be referred to as the tightness condition:
152
For every > 0, there exists a compact set K ⊆ H such that
lim sup P∗ ĥn ∈
/ K ≤ and P ĥ ∈ / K ≤ . (166)
n→∞
Theorem 12.3 (Argmax continuous mapping theorem). Let H be a metric space and let
{Mn (h) : h ∈ H} and {M (h) : h ∈ H} be stochastic processes indexed by H. Suppose that
the following conditions hold:
d
→ M in `∞ (K) for every compact subset K of H.
1. Mn −
2. Almost all sample paths h 7→ M (h) are upper semicontinuous122 (u.s.c.) and possess
a unique maximum at a random point ĥ.
3. For each n, let ĥn be a random element of H such that Mn (ĥn ) ≥ suph∈H Mn (h) −
oP (1).
for every open set G containing ĥ. Suppose, for the sake of contradiction, that M (ĥ) =
suph∈K∩Gc M (h) for some open set G containing ĥ. In that case, there exist hm ∈ K ∩ Gc
with M (hm ) → M (h) as m → ∞. Because K ∩ Gc (intersection of a closed set with a
compact set) is compact, a subsequence of {hm } converges which means that we can assume,
without loss of generality, that hm → h for some h ∈ K ∩ Gc . By the u.s.c. hypothesis, this
implies that lim supm→∞ M (hm ) ≤ M (h) which is same as M (ĥ) ≤ M (h). This implies
that ĥ is not a unique maximum (as ĥ ∈ G and h ∈ Gc , we note that ĥ 6= h). This proves
the claim.
d
We now use Lemma 12.2 with A = B = K (note that Mn − → M on `∞ (A∪B) = `∞ (K)).
This gives that for every closed set F , we have
lim sup P∗ ĥn ∈ F ≤ lim sup P∗ ĥn ∈ F ∩ K + lim sup P∗ ĥn ∈ K c
n→∞ n→∞ n→∞
≤ P ĥ ∈ F + P ĥ ∈ K + lim sup P∗ ĥn ∈ K c .
c
n→∞
The term on the right hand side above can be made smaller than P ĥ ∈ F + for every
> 0 by choosing K appropriately (using tightness). An application of the Portmanteau
theorem now completes the proof.
122
Recall the definition of upper semicontinuity: f is u.s.c. at x0 if lim supn→∞ f (xn ) ≤ f (x0 ) whenever
xn → x0 as n → ∞.
153
As a simple consequence of Theorems 12.1 and 12.3, we can prove the following the-
d
orem which is useful for checking consistency of M -estimators. Note that Mn −→ M for a
P
deterministic process M is equivalent to Mn −→ M . This latter statement is equivalent to
suph∈H |Mn (h) − M (h)| converges to 0 in probability.
Theorem 12.4 (Consistency Theorem). Let Θ be a metric space. For each n ≥ 1, let
{Mn (θ) : θ ∈ Θ} be a stochastic process. Also let {M (θ) : θ ∈ Θ} be a deterministic process.
P
1. Suppose supθ∈Θ |Mn (θ) − M (θ)| − → 0 as n → ∞. Also suppose the existence of θ0 ∈
Θ such that M (θ0 ) > supθ∈G/ M (θ) for every open set G containing θ0 . Then any
sequence sequence of M -estimators θ̂n (assuming that Mn (θ̂n ) ≥ supθ∈Θ Mn (θ)−oP (1)
is enough), converges in probability to θ0 .
P
2. Suppose supθ∈K |Mn (θ) − M (θ)| − → 0 as n → ∞ for every compact subset K of Θ.
Suppose also that the deterministic limit process M is upper semicontinuous and has
a unique maximum at θ0 . If {θ̂n } is tight, then θ̂n converges to θ0 in probability.
Remark 12.1. For M -estimators, we can apply the above theorem with Mn (θ) := ni=1 mθ (Xi )/n
P
P
and M (θ) := P [mθ ]. In this case, the condition supθ∈K |Mn (θ) − M (θ)| −
→ 0 is equivalent
to {mθ : θ ∈ K} being P -Glivenko-Cantelli.
Theorem 12.3 can also be used to prove asymptotic distribution results for M -estimators,
as illustrated in the following examples.
In this section we present one result that gives the asymptotic distribution of M -estimators
for the case of i.i.d. observations. The formulation is from [van der Vaart, 1998]. The limit
√
distribution of the sequence n(θ̂n − θ0 ) follows from the following theorem, where θ̂n is
an M -estimator of the finite dimensional parameter θ0 (i.e., θ̂n := arg maxθ∈Θ Mn (θ) where
Mn (θ) = Pn [mθ (X)]).
Example 12.5 (Parametric maximum likelihood estimators). Suppose X1 , . . . , Xn are i.i.d. from
an unknown density pθ0 belonging to a known class {pθ : θ ∈ Θ ⊆ Rk }. Let θ̂n denote the
maximum likelihood estimator of θ0 . A classical result is that, under some smoothness as-
√
sumptions, n(θ̂n − θ0 ) converges in distribution to Nk (0, I −1 (θ0 )) where I(θ0 ) denotes the
Fisher information matrix.
This result can be derived from the argmax continuous mapping theorem. The first step
is to observe that if θ 7→ pθ (x) is sufficiently smooth at θ0 , then, for any h ∈ Rk ,
n n
pθ0 +hn−1/2 (Xi ) > 1 1
`˙θ0 (Xi ) − h> I(θ0 )h + oPθ0 (1)
X X
log =h √ (167)
pθ0 (Xi ) n 2
i=1 i=1
154
where `˙θ0 (x) := ∇θ log pθ (x) denotes the score function. Condition (167) is known as the
LAN (local asymptotic normality) condition. We shall prove the asymptotic normality of θ̂n
assuming the marginal convergence of (167) (for every fixed h) can be suitably strengthened
to a process level result in `∞ (K), for K ⊂ Rk compact. We apply the argmax continuous
mapping theorem (Theorem 12.3) with H = Rk ,
n
X pθ0 +hn−1/2 (Xi ) 1
Mn (h) := log and M (h) := hT ∆ − hT I(θ0 )h
pθ0 (Xi ) 2
i=1
√
where ∆ ∼ Nk (0, I(θ0 )). Then ĥn = n(θ̂n − θ0 ) and ĥ ∼ N (0, I −1 (θ0 )). The argmax
theorem will then imply the result provided the conditions of the argmax theorem hold. The
main condition is tightness of {ĥn } which means that the rate of convergence of θ̂n to θ0 is
n−1/2 .
The above idea can be easily extended to derive the asymptotic distributions of other
√
n-consistent estimators, e.g., non-linear regression, robust regression, etc. (see [van der Vaart, 1998,
Chapter 5] for more details).
√ d
n(θ̂n − θ0 ) = −V −1 Gn (ṁθ0 ) + oP (1) → Nd (0, V −1 P [ṁθ0 ṁ>
θ0 ]V
−1
),
√
where Gn := n(Pn − P ).
d 1
M̃n (h) := nPn (mθ0 +hn−1/2 − mθ0 ) → h> G(ṁθ0 ) + h> V h =: M(h) in `∞ ({h : khk ≤ K})
2
for every K > 0. Then the conclusion follows from the argmax continuous Theorem 12.3
upon noticing that
155
Now, observe that
√ √
nPn (mθ0 +hn−1/2 − mθ0 ) = n(Pn − P )[ n(mθ0 +hn−1/2 − mθ0 )] + nP (mθ0 +hn−1/2 − mθ0 ).
By the second order Taylor expansion of M (θ) := P [mθ ] about θ0 , the second term of the
right side of the last display converges to (1/2)h> V h uniformly for khk ≤ K. To handle
the first term we use the Donsker theorem with chaining classes. The classes
√
Fn := { n(mθ0 +hn−1/2 − mθ0 ) : khk ≤ K}
have envelopes Fn = F = ṁθ0 for all n, and since ṁθ0 ∈ L2 (P ) the Lindeberg condition is
satisfied easily. Furthermore, with
√ √
fn,g = n(mθ0 +gn−1/2 − mθ0 ), fn,h = n(mθ0 +hn−1/2 − mθ0 ),
Finally, the bracketing entropy condition holds since, by way of the same entropy calcula-
tions used in the proof of we have
CK d CKkF kP,2 d
N[ ] (2kF kP,2 , Fn , L2 (P )) ≤ , i.e., N[ ] (, Fn , L2 (P )) .
Rδ q
Thus, J[ ] (δ, Fn , L2 (P )) . 0 d log CK
d, and hence the bracketing entropy hypothesis
of Donsker theorem holds. We conclude that M̃n (h) converges weakly to h> G(ṁθ0 ) in
`∞ ({h : khk ≤ K}), and the desired result holds.
Example 12.7 (Analysis of the shorth). Recall the setup of Example 5.4. Suppose that
X1 , . . . , Xn are i.i.d. P on R with density p with respect to the Lebesgue measure. Let FX be
the distribution function of X. Suppose that p is a unimodal symmetric density with mode
θ0 (with p0 (x) > 0 for x < θ0 and p0 (x) < 0 for x > θ0 ). We want to estimate θ0 .
Let
M(θ) := P [mθ ] = P(|X − θ| ≤ 1) = FX (θ + 1) − FX (θ − 1)
where mθ (x) = 1[θ−1,θ+1] (x). We can how that θ0 = argmaxθ∈R M(θ).
We can estimate θ0 by
We have already seen that (in Example 5.4) τn := n1/3 (θ̂n − θ0 ) = Op (1). Let us here give
a sketch of the limiting distribution of (the normalized version of ) θ̂n . Observe that
τn = argmax Mn (θ0 + hn−1/3 ) = argmax n2/3 [Mn (θ0 + hn−1/3 ) − Mn (θ0 )].
h∈R h∈R
156
The plan is to show that the localized (and properly normalized) stochastic process M̃n (h) :=
n2/3 [Mn (θ0 +hn−1/3 )−Mn (θ0 )] converges in distribution to “something” so that we can apply
the argmax continuous mapping theorem (Theorem 12.3) to deduce the limiting behavior of
τn . Notice that,
157
13 Concentration Inequalities
Z = f (X1 , . . . , Xn )
be the random variable of interest (e.g., Z = ni=1 Xi ). In this section we try to understand:
P
(a) under what conditions does Z concentrate around its mean EZ? (b) how large are typical
deviations of Z from EZ? In particular, we seek upper bounds for
Various approaches have been used over the years to tackle such questions, including mar-
tingale methods, information theoretic methods, logarithmic Sobolev inequalities, etc. We
have already seen many concentration inequalities in this course, e.g., Hoeffding’s inequality,
Bernstein’s inequality, Talagrand’s concentration inequality, etc.
Let X1 , . . . , Xn be independent random variables taking values in X . Let f : X n → R
and let
Z = f (X1 , . . . , Xn )
Ei (∆j ∆i ) = ∆i Ei (∆j ) = 0.
Until now we have not made any use of the fact that Z is a function of independent variables.
Indeed, Z
Ei Z = f (X1 , . . . , Xi , xi+1 , . . . , xn )dµi+1 (xi+1 ) . . . dµn (xn ),
X n−i
where, for every j = 1, . . . , n, µj denotes the probability distribution of Xj .
158
Let E(i) (Z) denote the expectation of Z with respect to the i-th variable Xi only, fixing
the values of the other variables, i.e.,
Z
(i)
E (Z) = f (X1 , . . . , Xi−1 , xi , Xi+1 , . . . , Xn )dµi (xi ).
X
Moreover, if X10 , . . . , Xn0 are independent copies of X1 , . . . , Xn , and if we define, for every
i = 1, . . . , n,
Zi0 = f (X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ), (172)
then
n
1X
v= E[(Z − Zi0 )2 ].
2
i=1
Proof. Using (170) we may write ∆i = Ei (Z − E(i) Z). By Jensen’s inequality, used condi-
tionally, ∆2i ≤ Ei [(Z − E(i) Z)2 ]. Now (169) yields the desired bounds.
To see the second claim, we use (conditionally) the fact that if X and X 0 are i.i.d. real
valued random variables, then Var(X) = E[(X − X 0 )2 ]/2. Since conditionally on X (i) :=
(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ), Zi0 is an independent copy of Z, we may write
1
Var(i) (Z) = E(i) [(Z − Zi0 )2 ],
2
where we have used the fact that the conditional distributions of Z and Zi0 are identical.
The last identity is obtained by recalling that, for any real-values random variable X,
Var(X) = inf a E[(X − a)2 ]. Using this fact conditionally, we have, for every i = 1, . . . , n,
159
Observe that in the case when Z = ni=1 Xi , the Efron-Stein inequality becomes an
P
equality. Thus, the bound in the Efron-Stein inequality, is, in a sense, not improvable.
It is easy to see that if f has the bounded difference property (see (24)) with constants
c1 , . . . , cn , then
n
1X 2
Var(Z) ≤ ci .
2
i=1
Example 13.2 (Kernel density estimation). Recall Example 3.25. Let X1 , . . . , Xn are
i.i.d. from a distribution P on R (the argument can be easily generalized to Rd ) with
density φ. The kernel density
estimator (KDE) of φ is φ̂n : R → [0, ∞) defined as
1 Pn x−Xi
φ̂n (x) = nhn i=1 K hn , where hn > 0 is the smoothing bandwidth and K is a non-
R
negative kernel (i.e., K ≥ 0 and K(x)dx = 1). The L1 -error of the estimator φ̂n is
R
Z := f (X1 , . . . , Xn ) := |φ̂n (x) − φ(x)|dx. We have shown in (26) that f satisfies (24) with
ci = n2 . Thus the difference between Z − Zi0 , deterministically, is upper bounded by 2/n, for
all i. Thus, an application of the Efron-Stein inequality gives
2
n 2 2
Var(Z) ≤ = .
2 n n
√
It is known that for every φ, nE(Zn ) → ∞ (we write Zn instead of Z to emphasize
the dependence on n), which implies, by Chebyshev’s inequality, for every > 0,
Zn Var(Zn )
P − 1 ≥ = P |Zn − E(Zn )| ≥ E(Zn ) ≤ 2 →0
E(Zn ) [E(Zn )]2
p
as n → ∞. Thus, Zn /E(Zn ) → 1, or in other words, Zn is relatively stable. This means
that the random L1 -error essentially behaves like its expected value.
Recall that we can bound E(Z) using empirical process techniques. Let P0n (A) =
Pn 0 0 0
i=1 1A (Xi )/n where X1 , . . . , Xn (sometimes called a “ghost” sample) are independent
copies of X1 , . . . , Xn . Let E0 denote the expectation only with respect to X10 , . . . , Xn0 .
E sup |Pn (A) − P (A)| = E sup |E0 [Pn (A) − P0n (A)]|
A∈A A∈A
n n
1 X X
≤ E sup |Pn (A) − P0n (A)| = E sup (1A (Xi ) − 1A (Xi0 ))
A∈A n A∈A
i=1 i=1
160
where we have used the Jensen’s inequality. Next we use symmetrization: if 1 , . . . , n
are independent Rademacher variables, then the last term in the previous display can be
bounded (from above) by (Exercise)
n n n
1 X X 2 X
E sup (1A (Xi ) − 1A (Xi0 )) ≤ E sup i 1A (Xi ) .
n A∈A n A∈A
i=1 i=1 i=1
Pn
Note that i=1 i 1A (Xi ) can be thought of as the sample covariance between the i ’s and
the 1A (Xi )’s. Letting,
n
1 X
Rn = E sup i 1A (Xi ) ,
n A∈A
i=1
where E denotes the expectation with respect to the Rademacher variables, we see that
Observe that Rn is a data dependent quantity, and does not involve the probability
measure P . We want to show that Rn is concentrated around its mean ERn . Define
1 X
Rn(i) = E sup j 1A (Xj ) ,
n A∈A
j6=i
Random variables with the property (173) are called self-bounding. These random vari-
ables have their variances bounded by their means an thus are automatically concentrated.
Recall that ∆n (A, Xn ) is the number of different sets of the form {X1 , . . . , Xn } ∩ A
where A ∈ A, then Rn is the maximum of ∆n (A, Xn ) sub-Gaussian random variables. By
the maximal inequality, r
log ∆n (A, Xn )
Rn ≤ .
2n
Let V = V (xn , A) be the size of the largest subset of {x1 , . . . , xn } shattered by A.
Note that V is a random variable. In fact, V is self-bounding. Let V (i) be defined similarly
with all the points excluding the i’th point. Then, (Exercise) for every 1 ≤ i ≤ n,
n
X
0 ≤ V − V (i) ≤ 1 and (V − V (i) ) ≤ V.
i=1
Pn
Thus, i=1 (V − V (i) )2 ≤ V , and so by the Efron-Stein inequality, Var(V ) ≤ EV .
161
13.2 Concentration and logarithmic Sobolev inequalities
We start with a brief summary of some basic properties of the (Shannon) entropy of a
random variable. For simplicity, we will only consider discrete-values random variables.
Definition 13.4 (Shannon entropy). Let X be a random variable taking values in a count-
able set X with probability mass function (p.m.f.) p(x) = P(X = x), x ∈ X . The Shannon
entropy (or just entropy) of X is defined as
X
H(X) := E[− log p(X)] = − p(x) log p(x) (174)
x∈X
where log denotes natural logarithm and 0 log 0 = 0. The entropy can be thought of as the
“uncertainty” in the random variable. Observe that the entropy is obviously nonnegative.
If X and Y is a pair of discrete random variables taking values in X × Y then the joint
entropy H(X, Y ) of X and Y is defined as the entropy of the pair (X, Y ).
It can be shown that D(P kQ) ≥ 0, and D(P kQ) = 0 if and only if P = Q. This
follows from observing the fact that if P is absolutely continuous with respect to Q, since
log x ≤ x − 1 for all x > 0,
X q(x) X q(x)
D(P kQ) = − p(x) log ≥ p(x) − 1 ≥ 0.
p(x) p(x)
x∈X :p(x)>0 x∈X :p(x)>0
D(P kQ) = log N − H(X), and H(X) ≤ log N (as D(P kQ) ≥ 0).
Definition 13.6 (Conditional entropy). Consider a pair of random variables (X, Y ). The
conditional entropy H(X|Y ) is defined as
Observe that if we write p(x, y) = P(X = x, Y = y) and p(x|y) = P(X = x|Y = y), then
X
H(X|Y ) = − p(x, y) log p(x|y) = E[− log p(X|Y )].
x∈X ,y∈Y
162
Suppose that (X, Y ) ∼ PX,Y and X ∼ PX and Y ∼ PY . Noting that D(PX,Y kPX ⊗
PY ) = H(X) − H(X|Y ), the nonnegativity of the relative entropy implies
The chain rule for entropy says that for random variables X1 , . . . , Xn ,
where we have used (175) conditionally. Now, by summing over all i and using (176), we
get the desired result.
Then we have the following result, proved in [Boucheron et al., 2013, Section 4.6].
163
or equivalently,
n h
X i
D(QkP ) ≤ D(QkP ) − D(Q(i) kP (i) ) .
i=1
We end this section with a result that will be useful in developing the entropy method,
explained in the next section.
Theorem 13.9 (The expected value minimizes expected Bergman divergence). Let I ⊂ R
be an open interval and let g : I → R be convex and differentiable. For any x, y ∈ I, the
Bergman divergence of f from x to y is g(y) − g(x) − g 0 (x)(y − x). Let X be an I-valued
random variable. Then,
Proof. Let a ∈ I. The expected Bergman divergence from a is E[g(X)−g(a)−g 0 (a)(X −a)].
The expected Bergman divergence from EX is
Thus, the difference between the expected Bergman divergence from a and the expected
from EX is
as g is convex.
where Φ(x) = x log x for x > 0 and Φ(0) = 0. By Jensen’s inequality, Ent(Y ) ≥ 0.
Remark 13.1. Taking g(x) = x log x in (178) we obtain the following variational formula
for entropy:
Ent(Y ) = inf E [Y (log Y − log u) − (Y − u)] . (180)
u>0
164
Theorem 13.11 (Sub-additivity of the Entropy). Let X1 , . . . , Xn be independent and let
Z = f (X1 , . . . , Xn ), where f ≥ 0. Then
n
X
Ent(Z) ≤ E Ent(i) (Z). (181)
i=1
Proof. (Han’s inequality implies the sub-additivity property.) First observe that if the
inequality is true for a random variable Z, then it is also true for cZ, where c > 0. Hence
we may assume that E(Z) = 1. Now define the probability measure Q on X n by its p.m.f. q
given by
q(x) = f (x)p(x), for all x ∈ X n
which, by Theorem 13.8, does not exceed ni=1 [D(QkP )−D(Q(i) kP (i) )]. However, straight-
P
Remark 13.2. The form of the above inequality should remind us of the Efron-Stein in-
equality. In fact, if we take Φ(x) = x2 , then the above display is the exact analogue of the
Efron-Stein inequality.
Ent(g 2 ) ≤ 2E k∇g(X)k2 .
165
This result can be proved using the central limit theorem and the Bernoulli log-Sobelev
inequality; see e.g., [Boucheron et al., 2013, Theorem 5.4].
Proof. By a standard density argument we may assume that f is differentiable with gradient
uniformly bounded by L. Using Theorem 13.12 for the function eλf /2 , we obtain,
h i λ2 h i λ2 L2 h i
Ent(eλf ) ≤ 2E k∇eλf (X)/2 k2 = E eλf (X) k∇f (X)k2 ≤ E eλf (X) .
2 2
The Gaussian log-Sobolev inequality may now be used with
Thus, we obtain the differential inequality (This is usually referred to as the Herbst’s argu-
ment.)
λ2 L2
λF 0 (λ) − F (λ) log F (λ) ≤ F (λ). (182)
2
To solve (182) divide both sides by the positive number λ2 F (λ). Defining G(λ) := log F (λ),
we observe that the left-hand side is just the derivative of G(λ)/λ. Thus, we obtain the
inequality
L2
d G(λ)
≤ .
dλ λ 2
By l’Hospital’s rule we note that limλ→0 G(λ)/λ = F 0 (0)/F (0) = EZ. If λ > 0, by inte-
grating the inequality between 0 and λ, we get G(λ)/λ ≤ EZ + λL2 /2, or in other words,
2 L2 /2
F (λ) ≤ eλEZ+λ . (183)
166
where λ = t/L2 minimizes the obtained upper bound. Similarly, if λ < 0, we may integrate
the obtained upper bound for the derivative of G(λ)/λ between −λ and 0 to obtain the
same bound as in (183), which implies the required bound for the left-tail inequality P(Z <
EZ − t).
Remark 13.4. An important feature of the problem is that the right-hand side does not
depend on the dimension n.
Example 13.14 (Supremum of a Gaussian process). Let (Xt )t∈T be an almost surely con-
tinuous centered Gaussian process indexed by a totally bounded set T . Let Z = supt∈T Xt .
If
σ 2 := sup E(Xt2 ),
t∈T
then
2 /(2σ 2 )
P(|Z − E(Z)| ≥ u) ≤ 2e−u .
Let us first assume that T is a finite set (the extension to arbitrary totally bounded T is
based on a separability argument and monotone convergence; see [Boucheron et al., 2013,
Exercise 5.14]). We may assume, without loss of generality, that T = {1, . . . , n}. Let Γ
be the covariance matrix of the centered Gaussian vector X = (X1 , . . . , Xn ). Denote by
A the square root of the positive semidefinite matrix Γ. If Y = (Y1 , . . . , Yn ) is a vector of
i.i.d. standard normal random variables, then
f (Y ) = max (AY )i
i=1,...,n
has the same distribution as maxi=1,...,n Xi . Hence we can assume the Gaussian concentra-
tion inequality by bounding the Lipchitz function f . By the Cauchy-Schwarz inequality, for
all u, v ∈ Rn and i = 1, . . . , n,
X X 1/2
|(Au)i − (Av)i | = Aij (uj − vj ) ≤ A2ij ku − vk.
j j
Therefore, f is Lipschitz with constant σ and the tail bound follows from the Gaussian
concentration inequality.
In the previous subsections we have used the log-Sobolev inequalities quite effectively to
derive concentration results when the underlying distribution is Gaussian and Bernoulli.
Here we try to extend these results to hold under more general distributional assumptions.
Observe that however, (181) holds for any distribution. We state a result in that direction;
see e.g., [Boucheron et al., 2013, Theorem 6.6].
167
Theorem 13.15 (A modified logarithmic Sobolev Inequality). Let φ(x) = ex − x − 1. Then
for all λ ∈ R,
n
X
λZ λZ λZ
λE(Ze ) − E(e ) log E(e )≤ E[eλZ φ(−λ(Z − Zi ))], (184)
i=1
Proof. We bound each term on the right-hand side of the sub-additivity of entropy. To do
this we will apply (180) conditionally. Let Yi be a positive function of the random variables
X1 , . . . , Xi−1 , Xi+1 , . . . , Xn , then
Applying the above inequality to the variables Y = eλZ and Yi = eλZi , we obtain
h i
E(i) (Y log Y ) − (E(i) Y ) log(E(i) Y ) ≤ E(i) eλZ φ(−λ(Z − Zi )) .
Assume that Z is such that there exists a constant v > 0 for which, almost surely,
n
X
(Z − Zi )2 ≤ v.
i=1
Proof. The result follows from the modified logarithmic Sobolev Inequality. Observe that
for x > 0, φ(−x) ≤ x2 /2. Thus, Theorem 13.15 implies
n
" #
X λ 2 λ2 v
λE(ZeλZ ) − E(eλZ ) log E(eλZ ) ≤ E eλZ (Z − Zi )2 ≤ E(eλZ ).
2 2
i=1
The obtained inequality has the same form as (182), and the proof may be finished in an
identical way (using the Herbst’s argument).
Zi = inf x0i f (X1 , . . . , Xi−1 , x0i , Xi+1 , . . . , Xn ) and Zi = supx0i f (X1 , . . . , Xi−1 , x0i , Xi+1 , . . . , Xn ),
one has the two-sided inequality
2 /(2v)
P(|Z − EZ| > t) ≤ 2e−t .
168
Example 13.17 (The largest eigenvalue of a symmetric matrix). Let A = (Xij )n×n be
a symmetric random matrix, with Xij , i ≤ j, being independent random variables (not
necessarily identically distributed) with |Xij | ≤ 1. Let
and suppose that v is such that Z = v > Av. Let A0ij be the symmetric matrix obtained from
A by replacing Xij (and Xji ) by x0ij ∈ [−1, 1] (and keeping the other entries fixed). Then,
This example shows that if we want to bound (Z − Zij 0 )2 individually, we get an upper bound
+
of 4 and the usual bounded differences inequality does not lead to a good concentration bound.
But this (stronger) version of the bounded difference inequality yields a much stronger result.
Note that the above result applies for the adjacency matrix of a random graph if the edges
a sampled independently.
The location of the distribution of the norm kGn kF of the empirical process depends strongly
on the complexity of the class of functions F. We have seen various bounds on its the mean
value EkGn kF in terms of entropy integrals. It turns out that the spread or “concentration”
of the distribution hardly depends on the complexity of the class F. It is sub-Gaussian as
soon as the class F is uniformly bounded, no matter its complexity.
Theorem 13.18. If F is a class of measurable functions f : X → R such that |f (x) −
f (y)| ≤ 1 for every f ∈ F and every x, y ∈ X , then, for all t > 0,
2
P∗ kGn kF − EkGn kF ≥ t ≤ 2e−2t and
!
2
P∗ sup Gn − E sup Gn ≥ t ≤ 2e−2t .
f f
This theorem is a special case of the bounded difference inequality; both the norm
kGn kF and the supremum supf ∈F Gn satisfy the bounded difference property (24) with
ci n−1/2 the supremum over f of the range of f (in fact, we can take ci = n−1/2 ).
169
References
[Boucheron et al., 2013] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration
inequalities. Oxford University Press, Oxford. A nonasymptotic theory of independence,
With a foreword by Michel Ledoux.
[Cantelli, 1933] Cantelli, F. (1933). Sulla determinazione empirica delle leggi di probabilità.
Giorn. Ist. Ital. Attuari, 4:421–424.
[Chernozhukov et al., 2014] Chernozhukov, V., Chetverikov, D., and Kato, K. (2014).
Gaussian approximation of suprema of empirical processes. Ann. Statist., 42(4):1564–
1597.
[de la Peña and Giné, 1999] de la Peña, V. H. and Giné, E. (1999). Decoupling. Probabil-
ity and its Applications (New York). Springer-Verlag, New York. From dependence to
independence, Randomly stopped processes. U -statistics and processes. Martingales and
beyond.
[Dudley, 1978] Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann.
Probab., 6(6):899–929 (1979).
[Dvoretzky et al., 1956] Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic
minimax character of the sample distribution function and of the classical multinomial
estimator. Ann. Math. Statist., 27:642–669.
[Einmahl and Mason, 2005] Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth
consistency of kernel-type function estimators. Ann. Statist., 33(3):1380–1403.
[Giné and Nickl, 2016] Giné, E. and Nickl, R. (2016). Mathematical foundations of infinite-
dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathe-
matics, [40]. Cambridge University Press, New York.
170
[Glivenko, 1933] Glivenko, V. (1933). Sulla determinazione empirica delle leggi di proba-
bilità. Giorn. Ist. Ital. Attuari, 4:92–99.
[Greenshtein and Ritov, 2004] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-
dimensional linear predictor selection and the virtue of overparametrization. Bernoulli,
10(6):971–988.
[Hjort and Pollard, 2011] Hjort, N. L. and Pollard, D. (2011). Asymptotics for minimisers
of convex processes. arXiv preprint arXiv:1107.3806.
[Kiefer, 1961] Kiefer, J. (1961). On large deviations of the empiric D. F. of vector chance
variables and a law of the iterated logarithm. Pacific J. Math., 11:649–660.
[Kim and Pollard, 1990] Kim, J. and Pollard, D. (1990). Cube root asymptotics. Ann.
Statist., 18(1):191–219.
[Klein and Rio, 2005] Klein, T. and Rio, E. (2005). Concentration around the mean for
maxima of empirical processes. Ann. Probab., 33(3):1060–1077.
[Ledoux and Talagrand, 1991] Ledoux, M. and Talagrand, M. (1991). Probability in Banach
spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in
Mathematics and Related Areas (3)]. Springer-Verlag, Berlin. Isoperimetry and processes.
171
[Pollard, 1989] Pollard, D. (1989). Asymptotics via empirical processes. Statist. Sci.,
4(4):341–366. With comments and a rejoinder by the author.
[Robertson et al., 1988] Robertson, T., Wright, F. T., and Dykstra, R. L. (1988). Order
restricted statistical inference. Wiley Series in Probability and Mathematical Statistics:
Probability and Mathematical Statistics. John Wiley & Sons, Ltd., Chichester.
[Talagrand, 1994] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical pro-
cesses. Ann. Probab., 22(1):28–76.
[van de Geer, 2000] van de Geer, S. A. (2000). Applications of empirical process theory,
volume 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge
University Press, Cambridge.
[van der Vaart and Wellner, 2011] van der Vaart, A. and Wellner, J. A. (2011). A local
maximal inequality under uniform entropy. Electron. J. Stat., 5:192–203.
[van der Vaart, 1998] van der Vaart, A. W. (1998). Asymptotic statistics, volume 3 of
Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University
Press, Cambridge.
[van der Vaart and Wellner, 1996] van der Vaart, A. W. and Wellner, J. A. (1996). Weak
convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New
York. With applications to statistics.
172