0% found this document useful (0 votes)
2 views

Emp Proc Lecture Notes

This document serves as a gentle introduction to empirical process theory, aiming to make the material more accessible to a broader audience by incorporating statistical applications throughout. It covers various topics such as M-estimation, weak convergence, and concentration inequalities, while also providing historical context and examples from established literature. The document arose from lecture notes delivered at Stanford and seeks to bridge the gap for graduate students unfamiliar with the theory of weak convergence of stochastic processes.

Uploaded by

yuzukieba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Emp Proc Lecture Notes

This document serves as a gentle introduction to empirical process theory, aiming to make the material more accessible to a broader audience by incorporating statistical applications throughout. It covers various topics such as M-estimation, weak convergence, and concentration inequalities, while also providing historical context and examples from established literature. The document arose from lecture notes delivered at Stanford and seeks to bridge the gap for graduate students unfamiliar with the theory of weak convergence of stochastic processes.

Uploaded by

yuzukieba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 172

A Gentle Introduction to Empirical Process Theory and

Applications

Bodhisattva Sen

July 19, 2022

Contents

1 Introduction to empirical processes 5


1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 M -estimation (or empirical risk minimization) . . . . . . . . . . . . . . . . . 9
1.3 Why study weak convergence of stochastic processes? . . . . . . . . . . . . 12
1.4 Asymptotic equicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Size/complexity of a function class 17


2.1 Covering numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Bracketing numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Glivenko-Cantelli (GC) classes of functions 22


3.1 GC by bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Hoeffding’s inequality for the sample mean . . . . . . . . . . . . . . 24
3.2.2 Sub-Gaussian random variables/processes . . . . . . . . . . . . . . . 27
3.3 Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Proof of GC by entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.1 Consistency of M /Z-estimators . . . . . . . . . . . . . . . . . . . . . 32
3.5.2 Consistency of least squares regression . . . . . . . . . . . . . . . . . 33
3.6 Bounded differences inequality — a simple concentration inequality . . . . . 36
3.7 Supremum of the empirical process for a bounded class of functions . . . . . 39

1
4 Chaining and uniform entropy 42
4.1 Dudley’s bound for the supremum of a sub-Gaussian process . . . . . . . . 42
4.1.1 Dudley’s bound when the metric space is separable . . . . . . . . . . 47
4.2 Maximal inequality with uniform entropy . . . . . . . . . . . . . . . . . . . 48
4.3 Maximal inequalities with bracketing . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Bracketing number for some function classes . . . . . . . . . . . . . . . . . . 51

5 Rates of convergence of M -estimators 53


5.1 The rate theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.1 Euclidean parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 A non-standard example . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.3 Persistency in high-dimensional regression . . . . . . . . . . . . . . . 59

6 Rates of convergence of infinite dimensional parameters 61


6.1 Least squares regression on sieves . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Least squares regression: a finite sample inequality . . . . . . . . . . . . . . 65
6.3 Oracle inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3.1 Best sparse linear regression . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Density estimation via maximum likelihood . . . . . . . . . . . . . . . . . . 73

7 Vapnik-C̆ervonenkis (VC) classes of sets/functions 78


7.1 VC classes of Boolean functions . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2 Covering number bound for VC classes of sets . . . . . . . . . . . . . . . . . 82
7.3 VC classes of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.4 Examples and Permanence Properties . . . . . . . . . . . . . . . . . . . . . 88
7.5 Exponential tail bounds: some useful inequalities . . . . . . . . . . . . . . . 93

8 Talagrand’s concentration inequality for the suprema of the empirical


process 97
8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 Talagrand’s concentration inequality . . . . . . . . . . . . . . . . . . . . . . 101
8.3 Empirical risk minimization and concentration inequalities . . . . . . . . . . 104
8.3.1 A formal result on excess risk in ERM . . . . . . . . . . . . . . . . . 107
8.3.2 Excess risk in bounded regression . . . . . . . . . . . . . . . . . . . . 111

2
8.4 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

9 Review of weak convergence in complete separable metric spaces 118


9.1 Weak convergence of random vectors in Rd . . . . . . . . . . . . . . . . . . 118
9.2 Weak convergence in metric spaces and the continuous mapping theorem . . 119
9.2.1 When T = B(T ), the Borel σ-field of T . . . . . . . . . . . . . . . . 120
9.2.2 The general continuous mapping theorem . . . . . . . . . . . . . . . 122
9.3 Weak convergence in the space C[0, 1] . . . . . . . . . . . . . . . . . . . . . 124
9.3.1 Tightness and relative compactness . . . . . . . . . . . . . . . . . . . 126
9.3.2 Tightness and weak convergence in C[0, 1] . . . . . . . . . . . . . . . 127
9.4 Non-measurablilty of the empirical process . . . . . . . . . . . . . . . . . . . 129
9.5 D[0, 1] with the ball σ-field . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

10 Weak convergence in non-separable metric spaces 133


10.1 Bounded stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2 Spaces of locally bounded functions . . . . . . . . . . . . . . . . . . . . . . . 142

11 Donsker classes of functions 143


11.1 Donsker classes under bracketing condition . . . . . . . . . . . . . . . . . . 143
11.2 Donsker classes with uniform covering numbers . . . . . . . . . . . . . . . . 145
11.3 Donsker theorem for classes changing with sample size . . . . . . . . . . . . 146

12 Limiting distribution of M -estimators 150


12.1 Argmax continuous mapping theorems . . . . . . . . . . . . . . . . . . . . . 151
12.2 Asymptotic distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
12.3 A non-standard example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

13 Concentration Inequalities 158


13.1 Efron-Stein inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
13.2 Concentration and logarithmic Sobolev inequalities . . . . . . . . . . . . . . 162
13.3 The Entropy method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
13.4 Gaussian concentration inequality . . . . . . . . . . . . . . . . . . . . . . . . 165
13.5 Bounded differences inequality revisited . . . . . . . . . . . . . . . . . . . . 167
13.6 Suprema of the empirical process: exponential inequalities . . . . . . . . . . 169

3
Abstract

This document provides an introduction to the theory of empirical processes. The


standard references on this topic (e.g., [van der Vaart and Wellner, 1996]) usually de-
velop all the abstract concepts in detail before they address the statistical applications.
Although this is certainly the right approach to provide a rigorous treatment of the
applications, I believe that this has somewhat hindered some graduate students from
appreciating the usefulness of the topic. In this set of lecture notes, I try to address a
few statistical applications at the end of every section and do not go into the rigorous
treatment of certain topics to make the material more accessible to a broader audience.
This document arose from the lecture notes that I delivered at Stanford in Spring 2017.
As most graduate students in statistics now-a-days are not necessarily exposed to
the theory of weak convergence of stochastic processes (e.g., in the space C[0, 1] or
D[0, 1]) this document tries to give the reader a brief overview of this classical theory
(in Section 9). I hope this will make the transition to the theory of weak convergence
on abstract spaces smoother.
I would like to thank Aditya Guntuboyina for several helpful discussions and for
sharing his lecture notes on this subject (indeed, the treatment of some of the topics in
this document is taken from his lecture notes1 ). I am thankful to Axel Munk and Tobias
Kley2 (who discussed some of these lecture notes in one of their seminar classes) and
the students3 in their class, and to the members of the ‘Empirical Processes Reading
Group’ coordinated by Chao Zheng4 at University of Southampton, for pointing out
numerous typos, inconsistencies, etc. in the notes. I am also thankful to Chaowen
Zheng (University of York), Myoung-Jin Keay (South Dakota State), Huiyuan Wang
(Peking University), Zhen Huang (Columbia University) for pointing out further typos
and inconsistencies.
Many of the examples given in this document is borrowed from the following books:
[Giné and Nickl, 2016], [Koltchinskii, 2011], [Pollard, 1984], [Wainwright, 2019], [van de Geer, 2000],
[van der Vaart, 1998], [van der Vaart and Wellner, 1996].

1
see https://www.stat.berkeley.edu/~aditya/resources/FullNotes210BSpring2018.pdf
2
and Shayan Hundrieser, Marcel Klatt, and Thomas Staudt
3
Tobias W. Wegel, Erik Pudelko, Jan N. Dühmert, Meggie Marschner, Antonia Seifrid, Jana Böhm,
Robin Requadt, Oliver D. Gauselmann, Tobias Weber, Huaiqing Gou, Leo H. Lehmann, Michel Groppe
4
http://www.personal.soton.ac.uk/cz1y20/Reading_Group/ep-group.html

4
1 Introduction to empirical processes

In this chapter we introduce the main object of study (i.e., empirical processes), highlight
the main questions we would like to answer, give a few historically important statistical
applications that motivated the development of the field, and lay down some of the broad
questions that we plan to investigate in this course.
Empirical process theory began in the 1930’s and 1940’s with the study of the empirical
distribution function and the corresponding empirical process.
If X1 , . . . , Xn are i.i.d. real-valued random variables5 (r.v’s) with cumulative distribu-
tion function (c.d.f.) F then the empirical distribution function (e.d.f.) Fn : R → [0, 1] is
defined as
n
1X
Fn (x) := 1(−∞,x] (Xi ), for x ∈ R. (1)
n
i=1

In other words, for each x ∈ R, the quantity nFn (x) simply counts the number of Xi ’s that
are less than or equal to x. The e.d.f. is a natural unbiased (i.e., E[Fn (x)] = F (x) for all
x ∈ R) estimator of F . The corresponding empirical process is

Gn (x) = n(Fn (x) − F (x)), for x ∈ R. (2)

Note that both Fn and Gn are stochastic processes6 (i.e., random functions) indexed
by the real line. By the strong law of large numbers (SLLN), for every x ∈ R, we can say
that
a.s.
Fn (x) → F (x) as n → ∞.

Also, by the central limit theorem (CLT), for each x ∈ R, we have


 
d
Gn (x) → N 0, F (x)(1 − F (x)) as n → ∞.

Two of the basic results in empirical process theory concerning Fn and Gn are the Glivenko-
Cantelli and Donsker theorems. These results generalize the above two results to processes
that hold for all x simultaneously.
5
We will assume that all the random variables are defined on the probability space (Ω, A, P). Recall
the following definitions. A σ-field A (also called σ-algebra) on a set Ω is a collection of subsets of Ω that
includes the empty set, is closed under complementation, and is closed under countable unions and countable
intersections. The pair (Ω, A) is called a measurable space.
If C is an arbitrary class of subsets of S, there is a smallest σ-field in S containing C, denoted by σ(C) and
called the σ-field generated (or induced) by C.
A metric (or topological) space S will usually be endowed with its Borel σ-field B(S) — the σ-field
generated by its topology (i.e., the collection of all open sets in S). The elements of B(S) are called Borel
sets.
6
Fix a measurable space (S, S), an index set I, and a subset V ⊂ S I . Then a function W : Ω → V is
called an S-valued stochastic process on I with paths in V if and only if Wt : Ω → S is S-measurable for
every t ∈ I.

5
Theorem 1.1 ([Glivenko, 1933], [Cantelli, 1933]).
a.s.
kFn − F k∞ = sup |Fn (x) − F (x)| → 0.
x∈R

Theorem 1.2 ([Donsker, 1952]).


d
Gn → U(F ) in D([−∞, ∞])7 ,

where U is the standard Brownian bridge process8 on [0, 1].

Question: Why are uniform convergence results interesting and important?


Let us motive this by outlining a typical application of Theorems 1.1. In statistical
settings, a typical use of the e.d.f. is to construct estimators of various quantities associated
with the population c.d.f. Many such estimation problems can be formulated in terms of a
functional γ that maps any c.d.f. F to a real number γ(F ), i.e., F 7→ γ(F )9 . Given a set of
samples distributed according to F , the plug-in principle suggests replacing the unknown
F with the e.d.f. Fn , thereby obtaining γ(Fn ) as an estimate of γ(F ).
For any plug-in estimator γ(Fn ), an important question is to understand when it is
consistent, i.e., when does γ(Fn ) converge to γ(F ) in probability (or almost surely)? This
question can be addressed in a unified manner for many functionals by defining a notion
of continuity. Given a pair of e.d.f’s F and G, let us measure the distance between them
using the sup-norm
kG − F k∞ := sup |G(x) − F (x)|.
x∈R

We can then define the continuity of a functional γ with respect to this norm: more precisely,
we say that the functional γ is continuous at F in the sup-norm if, for all  > 0, there exists
7
The above notion of weak convergence has not been properly defined yet. D([−∞, ∞]) denotes the space
of cadlag functions on [−∞, ∞] (French acronym: “continue à droite, limited à gauche”, right continuous at
each point with left limit existing at each point).
Heuristically speaking, we would say that a sequence of stochastic processes {Zn } (as elements of
D([−∞, ∞])) converges in distribution to a stochastic process Z in D([−∞, ∞]) if

E[g(Zn )] → E[g(Z)], as n → ∞,

for any bounded and continuous function g : D([−∞, ∞]) → R (ignoring measurability issues).
8
In short, U is a zero-mean Gaussian process on [0, 1] with covariance function E[U(s)U(t)] = s ∧ t −
st, s, t ∈ [0, 1]. To be more precise, the Brownian bridge process U is characterized by the following three
properties:
1. U(0) = U(1) = 0. For every t ∈ (0, 1), U(t) is a random variable.
2. For every k ≥ 1 and t1 , . . . , tk ∈ (0, 1), the random vector (U(t1 ), . . . , U(tk )) has the Nk (0, Σ) distri-
bution.
3. The function t 7→ U(t) is (almost surely) continuous on [0, 1].

9
For example, given some integrable function g : R → R, we may be interested in the expectation
R
functional γg defined via γg (F ) := g(x)dF (x).

6
a δ > 0 such that kG − F k∞ ≤ δ implies that |γ(G) − γ(F )| ≤ . This notion is useful,
because for any continuous functional, it reduces the consistency question for the plug-in
estimator γ(Fn ) to the issue of whether or not the r.v. kFn − F k∞ converges to zero.
In this course we are going to substantially generalize the two results — Theorems 1.1
and 1.2. But before we start with this endeavor, let us ask ourselves why do we need
generalization of such results like. The following subsection addresses this.

1.1 Notation

The need for generalizations of Theorems 1.1 and 1.2 became apparent in the 1950’s and
1960’s. In particular, it became apparent that when the observations take values in a more
general space X (such as Rd , or a Riemannian manifold, or some space of functions, etc.),
then the e.d.f. is not as natural. It becomes much more natural to consider the empirical
measure Pn indexed by some class of real-valued functions F defined on X .
Suppose now that X1 , . . . , Xn are i.i.d. P on X . Then the empirical measure Pn is
defined by
n
1X
Pn := δXi ,
n
i=1
where δx denotes the Dirac measure at x. For each n ≥ 1, Pn denotes the random discrete
probability measure which puts mass 1/n at each of the n points X1 , . . . , Xn . Thus, for any
Borel set A ⊂ X ,
n
1X #{i ≤ n : Xi ∈ A}
Pn (A) := 1A (Xi ) = .
n n
i=1
For a real-valued function f on X , we write
n
1X
Z
Pn (f ) := f dPn = f (Xi ).
n
i=1

If F is a collection of real-valued functions defined on X , then {Pn (f ) : f ∈ F} is the


empirical measure indexed by F. Note that the empirical measure indexed by F is a direct
generalization of the e.d.f. in (1)); by considering

F := {1(−∞,x] (·) : x ∈ R}, we have {Pn (f ) : f ∈ F} ≡ {Fn (x) : x ∈ R}.

Let us assume that10 Z


P f := f dP

exists for each f ∈ F. The empirical process Gn is defined by



Gn := n(Pn − P ),
10
We will use the this operator notation for the integral of any function f with respect to P . Note that
such a notation is helpful (and preferable over the expectation notation) as then we can even treat random
(data dependent) functions.

7
and the collection of random variables {Gn (f ) : f ∈ F} as f varies over F is called the
empirical process indexed by F. Note that the classical empirical process (in (2)) for real-
valued r.v.’s can again be viewed as the special case of the general theory for which X = R,
F = {1(−∞,x] (·) : x ∈ R}.
The goal of empirical process theory is to study the properties of the approximation
of P f by Pn f , uniformly in F. Traditionally, we would be concerned with probability
estimates of the random quantity

kPn − P kF := sup |Pn f − P f | (3)


f ∈F

and probabilistic limit theorems for the processes



{ n(Pn − P )f : f ∈ F}.

In particular, we will find appropriate conditions to answer the following two questions
(which will extend Theorems 1.1 and 1.2):

1. Glivenko-Cantelli: Under what conditions on F does kPn − P kF converge to zero


almost surely (or in probability)?
If this convergence holds, then we say that F is a P -Glivenko-Cantelli class of func-
tions.

2. Donsker: Under what conditions on F does {Gn (f ) : f ∈ F} converges as a process


to some limiting object as n → ∞.
If this convergence holds, then we say that F is a P -Donsker class of functions.

Our main findings reveal that the answers (to the two above questions and more) depend
crucially on the complexity11 or size of the underlying function class F. However, the scope
of empirical process theory is much beyond answering the above two questions.
In the last 20 years there has been enormous interest in understanding the concentra-
tion12 properties of kPn − P kF about its mean. In particular, one may ask if we can obtain
finite sample (exponential) inequalities for the difference kPn − P kF − EkPn − P kF (when F
is uniformly bounded) in terms of the class of functions F and the common distribution P
of X1 , X2 , . . . , Xn . Talagrand’s inequality ([Talagrand, 1996a]) gives an affirmative answer
to this question; a result that is considered to be one of the most important and powerful
results in the theory of empirical processes in the last 30 years. We will cover this topic
towards the end of the course (if time permits).
11
We will consider different geometric (packing and covering numbers) and combinatorial (shattering and
combinatorial dimension) notions of complexity.
12
Often we need to show that a random quantity g(X1 , . . . , Xn ) is close to its mean µ(g) :=
E[g(X1 , . . . , Xn )]. That is, we want a result of the form P(|g(X1 , . . . , Xn ) − µ(g)| ≥ ) ≤ δ, for suitable
 and δ. Such results are known as concentration of measure. These results are fundamental for establishing
performance guarantees of many algorithms.

8
The following section introduces the topic of M -estimation (also known as empirical
risk minimization), a field that naturally relies on the study of empirical processes.

1.2 M -estimation (or empirical risk minimization)

Many problems in statistics and machine learning are concerned with estimators of the form
n
1X
θ̂n := arg max Pn [mθ ] = arg max mθ (Xi ). (4)
θ∈Θ θ∈Θ n
i=1

where X, X1 , . . . , Xn denote (i.i.d.) observations from P taking values in a space X . Here


Θ denotes the parameter space and, for each θ ∈ Θ, mθ denotes the a real-valued (loss-)
function on X . Such a quantity θ̂n is called an M -estimator as it is obtained by maximizing
(or minimizing) an objective function. The map
n
1X
θ 7→ −Pn mθ = − mθ (Xi )
n
i=1

can be thought of as the “empirical risk” and θ̂n denotes the empirical risk minimizer over
θ ∈ Θ. Here are some examples:

1. Maximum likelihood estimators: These correspond to mθ (x) = log pθ (x).

2. Location estimators:

(a) Median: corresponds to mθ (x) = |x − θ|.


(b) Mode: may correspond to mθ (x) = 1{|x − θ| ≤ 1}.

3. Nonparametric maximum likelihood: Suppose X1 , . . . , Xn are i.i.d. from a den-


sity θ∗ on [0, ∞) that is known to be nonincreasing. Then take Θ to be the collection
of all non-increasing densities on [0, ∞) and mθ (x) = log θ(x). The corresponding
M -estimator is the MLE over all non-increasing densities. It can be shown that θ̂n
exists and is unique; θ̂n is usually known as the Grenander estimator.

4. Regression estimators: Let {Xi = (Zi , Yi )}ni=1 denote i.i.d. from a regression model
and let
mθ (x) = mθ (z, y) := −(y − θ(z))2 ,

for a class θ ∈ Θ of real-valued functions from the domain of Z 13 . This gives the usual
least squares estimator over the class Θ. The choice mθ (z, y) = −|y − θ(z)| gives the
least absolute deviation estimator over Θ.
13
In the simplest setting we could parametrize θ(·) as θβ (z) := β > z, for β ∈ Rd , in which case Θ = {θβ (·) :
β ∈ Rd }.

9
In these problems, the parameter of interest is

θ0 := arg max P [mθ ].


θ∈Θ

Perhaps the simplest general way to address this problem is to reason as follows. By the
law of large numbers, we can approximate the ‘risk’ for a fixed parameter θ by the empirical
risk which depends only on the data, i.e.,

P [mθ ] ≈ Pn [mθ ].

If Pn [mθ ] and P [mθ ] are uniformly close, then maybe their argmax’s θ̂n and θ0 are close. The
problem is now to quantify how close θ̂n is to θ0 as a function of the number of samples n, the
dimension of the parameter space Θ, the dimension of the space X , etc. The resolution of
this question leads naturally to the investigation of quantities such as the uniform deviation

sup |(Pn − P )[mθ ]|.


θ∈Θ

The following two examples show the importance of controlling the above display in
the problem of M -estimation and classification.

Example 1.3 (Consistency of M -estimator). Consider the setup of M -estimation as intro-


duced above where we assume that Θ is a metric space with the metric d(·, ·). In this example
we describe the steps to prove the consistency of the M -estimator θ̂n := arg maxθ∈Θ Pn [mθ ],
as defined in (4). Formally, we want to show that
P
d(θ̂n , θ0 ) → 0 where θ0 := arg max P [mθ ].
θ∈Θ

To simplify notation we define

Mn (θ) := Pn [mθ ] and M (θ) := P [mθ ], for all θ ∈ Θ.

We will assume that the class of functions F := {mθ (·) : θ ∈ Θ} is P -Glivenko Cantelli.
Fix δ > 0 and let
ψ(δ) := M (θ0 ) − sup M (θ).
θ∈Θ:d(θ,θ0 )≥δ

Observe that,
   
P d(θ̂n , θ0 ) ≥ δ ≤ P Mn (θ0 ) ≤ sup Mn (θ)
θ∈Θ:d(θ,θ0 )≥δ
!
n o
≤ P sup (Mn (θ) − M (θ)) − (Mn (θ0 ) − M (θ0 )) ≥ ψ(δ) .
θ∈Θ:d(θ,θ0 )≥δ

Empirical process results provide bounds for the above probability (some assumptions on the
relation between M and the metric d(·, ·) will be needed), e.g., we may assume that θ0 is a
well-separated maximizer, i.e., for every δ > 0, ψ(δ) > 0.

10
Note that one can further bound the above probability by
 
P sup |Mn (θ) − M (θ)| ≥ ψ(δ)/2 ,
θ∈Θ

but this can sometimes be too loose.

Example 1.4 (Classification). Consider a pair of random objects X ≡ (Z, Y ) having some
joint distribution where Z takes values in a space Z and Y takes only two values: −1 or
+1. A classifier is a function g : Z → {−1, +1}. The error of the classifier is given by

L(g) := P(g(Z) 6= Y ).

The goal of classification is to construct a classifier with small error based on n i.i.d. obser-
vations X1 ≡ (Z1 , Y1 ), . . . , Xn ≡ (Zn , Yn ) having the same distribution as X = (Z, Y ) ∼ P .
For a classifier g, its empirical error (i.e., its error on the observed sample) is given by
n
1X
Ln (g) := I{g(Zi ) 6= Yi } = Pn [I{g(Z) 6= Y }].
n
i=1

A natural strategy for classification is to select a class of classifiers C and then to choose
the classifier in C which has the smallest empirical error on the observed sample, i.e.,

ĝn := argmin Ln (g).


g∈C

How good a classifier is ĝn , i.e., how small is its error


Z
L(ĝn ) = P(ĝn (Z) 6= Y |X1 , . . . , Xn ) = P (ĝn (Z) 6= Y ) = dP (z, y).
(z,y):ĝn (z)6=y

Two questions are relevant about L(ĝn ):

1. Is L(ĝn ) comparable to inf g∈C L(g), i.e., is the error of ĝn comparable to the best
achievable error in the class C?

2. Is L(ĝn ) comparable to Ln (ĝn ), i.e., is the error of ĝn comparable to its “in-sample”
empirical error?

It is quite easy to relate these two questions to the size of supg∈C |Ln (g) − L(g)|. Indeed, if
g ∗ := argming∈C L(g), then

L(ĝn ) = L(g ∗ ) + L(ĝn ) − Ln (ĝn ) + Ln (ĝn ) − L(g ∗ )


≤ L(g ∗ ) + L(ĝn ) − Ln (ĝn ) + Ln (g ∗ ) − L(g ∗ ) ≤ L(g ∗ ) + 2 sup |Ln (g) − L(g)|,
g∈C

where we have used the fact that Ln (ĝn ) ≤ Ln (g ∗ ) (which follows from the definition of ĝn ).
Also,
L(ĝn ) = Ln (ĝn ) + L(ĝn ) − Ln (ĝn ) ≤ Ln (ĝn ) + sup |Ln (g) − L(g)|. (5)
g∈C

11
Thus the key quantity to answering the above two questions is

sup |Ln (g) − L(g)|.


g∈C

It is now easy to see that the above quantity is a special case of (3) when F is taken to be
the class of all functions I{g(z) 6= y} as g varies over C. Sometimes, the two inequalities
above can sometimes be quite loose. Later, we shall see sharper inequalities which utilize a
technique known as “localization”; see Section 8.3.

Closely related to M -estimators are Z-estimators, which are defined as solutions to a


system of equations of the form ni=1 mθ (Xi ) = 0 for θ ∈ Θ, an appropriate function class.
P

We will learn how to establish consistency, rates of convergence and the limiting dis-
tribution for M and Z-estimators; see [van der Vaart and Wellner, 1996, Chapters 3.1-3.4]
for more details.

1.3 Why study weak convergence of stochastic processes?

We quite often write a statistic of interest as a functional on the sample paths of a stochastic
process in order to break the analysis of the statistic into two parts: the study of the conti-
nuity properties of the (measurable14 ) functional and the study of the stochastic process as
a random element15 in a space of functions. The method has its greatest appeal when many
different statistics can be written as functionals on the same process, as in the following
goodness-of-fit examples.
Consider the statistical problem of goodness-of-fit16 hypothesis testing where one ob-
serves an i.i.d. sample X1 , . . . , Xn from a distribution F on the real line and wants to test
the null hypothesis
H0 : F = F0 versus H1 : F 6= F0 ,

where F0 is a fixed continuous d.f. For testing H0 : F = F0 , Kolmogorov recommended


working with the quantity

Dn := n sup |Fn (x) − F0 (x)|
x∈R

and rejecting H0 when Dn is large. To calculate the p-value of this test, the null distribution
(i.e., the distribution of Dn under H0 ) needs to be determined.
14
Given two measurable spaces (S, S) and (T, T ), a mapping f : S → T is said to be S/T -measurable or
simply measurable if f −1 (T ) ⊂ S, i.e., if

f −1 (B) := {s ∈ S : f (s) ∈ B} ∈ S, for every B ∈ T .


15
A random element of T is a map X : (Ω, A, P) → (T, T ) such that X is Ω/T -measurable [think of
(T, T ) = (R, B(R)) in which case X is a random variable (or a random element of R)].
16
Many satisfactory goodness-of-fit tests were proposed by Cramér, von Mises, Kolmogorov and Smirnov.
These tests are based on various divergences between the hypothesized c.d.f. F0 and the e.d.f. Fn .

12
Question: What is the asymptotic distribution of Dn , under H0 ?
An interesting property about the null distribution of Dn is that the null distribution
is the same whenever F0 is continuous17 . Thus we can compute the null distribution of Dn
assuming that F0 is the c.d.f. of a uniformly distributed random variable on [0, 1]. In other
words, the null distribution of Dn is the same as that of supt∈[0,1] |Un (t)| where
n
√ 1X
Un (t) := n (Fn (t) − t) with Fn (t) := 1{ξi ≤ t}, t ∈ [0, 1],
n
i=1

and ξ1 , . . . , ξn are i.i.d. Unif(0, 1) random variables. The function t 7→ Un (t) is called the
uniform empirical process.
The Kolmogorov-Smirnov test for H0 : F = F0 is only one of a large class of tests that
are based on some measure of distance between the e.d.f. Fn and F0 . Another such test is
the Cramér-von Mises statistic:
Z
Wn := n (Fn (x) − F0 (x))2 dF0 (x).

All of these quantities have the property that their null distribution (i.e., when F = F0 ) is
the same for all continuous F0 . Thus one may assume that F0 is the uniform distribution
for computing their null distribution. And in this case, all these quantities can be written
in terms of the uniform empirical process Un .
Initially, the asymptotic distributions of these quantities were determined on a case by
case basis without a unified technique. Doob realized that it should be possible to obtain
these distributions using some basic properties of the uniform empirical process.

Remark 1.1 (Study of the uniform empirical process). By the multivariate CLT, for every
k ≥ 1 and 0 < t1 , . . . , tk < 1, the random vector (Un (t1 ), . . . , Un (tk )) converges in distribu-
tion to Nk (0, Σ) where Σ(i, j) := ti ∧tj −ti tj (here a∧b := min(a, b)). This limiting distribu-
tion Nk (0, Σ) is the same as the distribution of (U(t1 ), . . . , U(tk )) where U is the Brownian
bridge. Doob therefore conjectured that the uniform empirical process {Un (t) : t ∈ [0, 1]}
must converge in some sense to a Brownian Bridge {U(t) : t ∈ [0, 1]}. Hopefully, this notion
of convergence will be strong enough to yield that various functionals of Un (·) will converge
to the corresponding functionals of U(·).

Donsker accomplished this by first establishing a rigorous theory of convergence of


stochastic processes and then proving that the uniform empirical process converges to the
Brownian bridge process.
d
Thus, we have Un → U (in some sense to be defined carefully later), and thus,
d
sup |Un (t)| → sup |U(t)|,
t∈[0,1] t∈[0,1]

17
Exercise (HW1): Prove this. Hint: you may use the quantile transformation.

13
by appealing to the continuous mapping theorem18 . Similarly, we can obtain the asymptotic
distribution of the other test statistics.
In fact, there are plenty of other examples where it is convenient to break the analysis
of a statistic into two parts: the study of the continuity properties of the functional and the
study of the underlying stochastic process. An important example of such a “continuous”
functional is the argmax of a stochastic process which arises in the the study of M -estimators
(to be introduced in the following subsection).

1.4 Asymptotic equicontinuity

A commonly recurring theme in statistics is that we want to prove consistency or asymptotic


normality of some statistic which is not a sum of independent random variables, but can
be related to some natural sum of random functions indexed by a parameter in a suitable
(metric) space. The following example illustrates the basic idea.

Example 1.5. Suppose that X, X1 , . . . , Xn , . . . are i.i.d. P with c.d.f. G, having a Lebesgue
density g, and E(X 2 ) < ∞. Let µ = E(X). Consider the absolute deviations about the
sample mean,
n
1X
Mn := Pn |X − X̄n | = |Xi − X̄n |,
n
i=1

as an estimate of scale. This is an average of the dependent random variables |Xi − X̄n |.
Suppose that we want to find the almost sure (a.s.) limit and the asymptotic distribution19
of Mn (properly normalized).
a.s.
There are several routes available for showing that Mn → M := E|X − µ|, but the
method we will develop in this section proceeds as via empirical process theory. Since
a.s.
X̄n → µ, we know that for any δ > 0 we have X̄n ∈ [µ − δ, µ + δ] for all sufficiently large n
almost surely. Let us define, for δ > 0, the random functions

Mn (t) = Pn |X − t|, for |t − µ| ≤ δ.

This is just the empirical measure indexed by the collection of functions

Fδ := {ft : |t − µ| ≤ δ}, where ft (x) := |x − t|.


a.s.
Note that Mn ≡ Mn (X̄n ). To show that Mn → M := E|X − µ|, we write

Mn − M = Pn (fX̄n ) − P (fµ )
 
= (Pn − P )(fX̄n ) + P (fX̄n ) − P (fµ )
= In + IIn .
18 d
The continuous mapping theorem states that if random elements Yn → Y and g is a continuous, then
d
g(Yn ) → g(Y ).
19
This example was one of the illustrative examples considered by [Pollard, 1989].

14
Note that,
a.s.
|In | ≤ sup |(Pn − P )(f )| → 0, (6)
f ∈Fδ

if Fδ is P -Glivenko-Cantelli. As we will see, this collection of functions Fδ is a VC subgraph


class of functions20 with an integrable envelope21 function, and hence empirical process
theory can be used to establish the desired convergence.
The convergence of the second term in IIn is easy: by the triangle inequality
a.s.
|IIn | = P (fX̄n ) − P (fµ ) ≤ P |X̄n − µ| = |X̄n − µ| → 0.
a.s.
Exercise (HW1): Give an alternate direct (rigorous) proof of the above result (i.e., Mn →
M := E|X − µ|).

The corresponding central limit theorem is trickier. Can we show that n(Mn − M )
converges to a normal distribution? This may still not be unreasonable to expect. After
all if X̄n were replaced by µ in the definition of Mn this would be an outcome of the CLT
(assuming a finite variance for the Xi ’s) and X̄n is the natural estimate of µ. Note that
√ √
n(Mn − M ) = n(Pn fX̄n − P fµ )
√ √
= n(Pn − P )fµ + n(Pn fX̄n − Pn fµ )

= Gn fµ + Gn (fX̄n − fµ ) + n(ψ(X̄n ) − ψ(µ))
= An + Bn + Cn (say),

where ψ(t) := P (ft ) = E|X − t|. We will argue later that Bn is asymptotically negligible
using an equicontinuity argument. Let us consider An + Cn . It can be easily shown that
Z t
ψ(t) = µ − 2 xg(x)dx − t + 2tG(t), and ψ 0 (t) = 2G(t) − 1.
−∞

The delta method now yields:



An + Cn = Gn fµ + n(X̄n − µ)ψ 0 (µ) + op (1) = Gn [fµ (X) + Xψ 0 (µ)] + oP (1).

The usual CLT now gives the limit distribution of An + Cn .


Exercise (HW1): Complete the details and derive the exact form of the limiting distribution.

Definition 1.6. Let {Zn (f ) : f ∈ F} be a stochastic process indexed by a class F equipped


with a semi-metric22 d(·, ·). Call {Zn }n≥1 to be asymptotically (or stochastically) equicon-
tinuous at f0 if for each η > 0 and  > 0 there exists a neighborhood V of f0 for which23
20
We will formally define VC classes of functions later. Intuitively, these classes of functions have simple
combinatorial properties.
21
An envelope function of a class F is any function x 7→ F (x) such that |f (x)| ≤ F (x), for every x ∈ X
and f ∈ F .
22
A semi-metric has all the properties of a metric except that d(s, t) = 0 need not imply that s = t.
23
There might be measure theoretical difficulties related to taking a supremum over an uncountable set
of f values, but we shall ignore these for the time being.

15
!
lim sup P sup |Zn (f ) − Zn (f0 )| > η < .
n→∞ f ∈V

Exercise (HW1): Show that if (i) {fˆn }n≥1 is a sequence of (random) elements of F that
converges in probability to f0 , and (ii) {Zn (f ) : f ∈ F} is asymptotically equicontinuous at
f0 , then Zn (fˆn ) − Zn (f0 ) = oP (1). [Hint: Note that with probability tending to 1, fˆn will
belong to each V .]
Empirical process theory offers very efficient methods for establishing the asymptotic
equicontinuity of Gn over a class of functions F. The fact that F is a VC class of func-
tions with square-integrable envelope function will suffice to show the desired asymptotic
equicontinuity.

16
2 Size/complexity of a function class

Let F be a class of measurable real-valued functions defined on X . Whether a given class


of function F is “Glivenko-Cantelli” or “Donsker” depends on the size (or complexity) of
the class. A finite class of square integrable functions is always Donsker, while at the other
extreme the class of all square integrable, uniformly bounded functions is almost never
Donsker.

2.1 Covering numbers

A relatively simple way to measure the size of any set is to use covering numbers. Let (Θ, d)
be an arbitrary semi-metric space24 ; we will assume that Θ ⊂ Ξ and that d(·, ·) is defined
on the space Ξ. Let ε > 0.

Definition 2.1 (ε-cover). A ε-cover of the set Θ with respect to the semi-metric d is a
set {θ1 , . . . , θN } ⊂ Ξ25 such that for any θ ∈ Θ, there exists some v ∈ {1, . . . , N } with
d(θ, θv ) ≤ ε.

Definition 2.2 (Covering number). The ε-covering number of Θ is

N (ε, Θ, d) := inf{N ∈ N : ∃ a ε-cover θ1 , . . . , θN of Θ}.

Equivalently, the ε-covering number N (ε, Θ, d) is the minimal number of balls B(x; ε) :=
{y ∈ Θ : d(x, y) ≤ ε} of radius ε needed to cover the set Θ.

Definition 2.3 (Metric entropy). The metric entropy of the set Θ with respect to the semi-
metric d is the logarithm of its covering number: log N (ε, Θ, d).

Note that a semi-metric space (Θ, d) is said to be totally bounded if the ε-covering
number is finite for every ε > 0. We can define a related measure of size that relates to the
number of disjoint balls of radius ε > 0 that can be placed into the set Θ.

Definition 2.4 (ε-packing). A ε-packing of the set Θ with respect to the semi-metric d is
a set {θ1 , . . . , θD } ⊆ Θ such that for all distinct v, v 0 ∈ {1, . . . , D}, we have d(θv , θv0 ) > ε.

Definition 2.5 (Packing number). The ε-packing number of Θ is

D(ε, Θ, d) := sup{D ∈ N : ∃ a ε-packing θ1 , . . . , θD of Θ}.

Equivalently, call a collection of points ε-separated if the distance between each pair of points
is larger than ε. Thus, the packing number D(ε, Θ, d) is the maximum number of ε-separated
points in Θ.
24
By a semi-metric space (Θ, d) we mean, for any θ1 , θ2 , θ3 ∈ Θ, we have: (i) d(θ1 , θ2 ) = 0 ⇒ θ1 = θ2 ; (ii)
d(θ1 , θ2 ) = d(θ2 , θ1 ); and (iii) d(θ1 , θ3 ) ≤ d(θ1 , θ2 ) + d(θ2 , θ3 ).
25
The elements {θ1 , . . . , θN } ⊂ Ξ need not belong to Θ themselves.

17
A minimal -cover and or maximal -packing do not have to be finite. In the proofs of
the following results, we do not separate out the case when they are infinite (in which case
there is nothing show).

Lemma 2.6. Show that

D(2ε, Θ, d) ≤ N (ε, Θ, d) ≤ D(ε, Θ, d), for every ε > 0.

Thus, packing and covering numbers have the same scaling in the radius ε.

Proof. Let us first show the second inequality. Suppose E = {θ1 , . . . , θD } ⊆ Θ is a maximal
packing. Then for every θ ∈ Θ \ E, there exists 1 ≤ i ≤ D such that d(θ, θi ) ≤ ε (for if this
does not hold for θ then we can construct a bigger packing set with θD+1 = θ). Hence E is
automatically an ε-covering. Since N (ε, Θ, d) is the minimal size of all possible coverings,
we have D(ε, Θ, d) ≥ N (ε, Θ, d).
We next prove the first inequality by contradiction. Suppose that there exists a 2ε-
packing {θ1 , . . . , θD } and an ε-covering {x1 , . . . , xN } such that D ≥ N + 1. Then by
pigeonhole, we must have θi and θj belonging to the same ε-ball B(xk , ε) for some i 6= j and
k. This means that the distance between θi and θj cannot be more than the diameter of the
ball, i.e., d(θi , θj ) ≤ 2ε, which leads to a contradiction since d(θi , θj ) > 2 for a 2ε-packing.
Hence the size of any 2ε-packing is less or equal to the size of any ε-covering.

Remark 2.1. As shown in the preceding lemma, covering and packing numbers are closely
related, and we can use both in the following. Clearly, they become bigger as ε → 0.

Let k · k denote any norm on Rd . The following result gives the (order of) covering
number for any bounded set in Rd .

Lemma 2.7. For a bounded subset Θ ⊂ Rd there exist constants c < C depending on Θ
(and k · k) only such that, for  ∈ (0, 1),
 d  d
1 1
c ≤ N (, Θ, k · k) ≤ C .
 

Proof. If θ1 , . . . , θD are -separated points in Θ, then the balls of radius /2 around the θi ’s
are disjoint, and their union is contained in Θ0 := {θ ∈ Rd : kθ − Θk ≤ /2}. Thus, the sum
Dvd (/2)d of the volumes of these balls, where vd is the volume of the unit ball, is bounded
by Vol(Θ0 ), the volume of Θ0 . This gives the upper bound of the lemma, as
 d
2d Vol(Θ0 ) 1
N (, Θ, k · k) ≤ D(, Θ, k · k) ≤ .
vd 

Let θ1 , . . . , θN be an -cover of Θ, i.e., the union of the balls of radius  around them
covers Θ. Thus the volume of Θ is bounded above by the sum of the volumes of the N

18
balls, i.e., by N vd d . This yields the lower bound of the lemma, as
 d
Vol(Θ) 1
N (, Θ, k · k) ≥ .
vd 

The following result gives an upper bound (which also happens to be optimal) on the
entropy numbers of the class of Lipschitz functions26 .

Lemma 2.8. Let F := {f : [0, 1] → [0, 1] | f is 1-Lipschitz}. Then for some constant A,
we have
1
log N (, F, k · k∞ ) ≤ A , for all  > 0.

Proof. If  > 1, there is nothing to prove as then N (, F, k · k∞ ) = 1 (take the function
f0 ≡ 0 and observe that for any f ∈ F, kf − f0 k∞ ≤ 1 < ).
Let 0 <  < 1. We will explicitly exhibit an -cover of F (under k · k∞ -metric)
with cardinality less than exp(A/), for some A > 0. This will complete the proof as
N (, F, k · k∞ ) will then be automatically less than exp(A/).
Let us define a -grid of the interval [0,1], i.e., 0 = a0 < a1 < . . . < aN = 1 where
ak := k, for k = 1, . . . , N − 1; here N ≤ b1/c + 1 (where bxc denotes the greatest integer
less than or equal to x). Let B1 := [a0 , a1 ] and Bk := (ak−1 , ak ], k = 2, . . . , N . For each
f ∈ F define f˜ : [0, 1] → R as
N  
f (ak )
f˜(x) =
X
 1Bk (x). (7)

k=1

Thus, f˜ is constant on the interval Bk and can only take values of the form i, for i =
0, . . . , b1/c. Observe that for x ∈ Bk (for some k ∈ {1, . . . , N }) we have

|f (x) − f˜(x)| ≤ |f (x) − f (ak )| + |f (ak ) − f˜(ak )| ≤ 2,

where the first  comes from the fact that f is 1-Lipschitz, and the second appears because
of the approximation error in (7)27 . Thus, kf − f˜k∞ ≤ 2.
New, let us count the number of distinct f˜’s obtained as f varies over F. There are at
most b1/c + 1 choices for f˜(a1 ). Further, note that for any f˜ (and any k = 2, . . . , N ),

|f˜(ak ) − f˜(ak−1 )| ≤ |f˜(ak ) − f (ak )| + |f (ak ) − f (ak−1 )| + |f (ak−1 ) − f˜(ak−1 )| ≤ 3.

Therefore once a choice is made for f˜(ak−1 ) there are at most 7 choices left for the next
value of f˜(ak ), k = 2, . . . , N .
26
Note that f : X → R is L-Lipschitz if |f (x)
j − fk(y)| ≤ Lkx − yk for all x, y ∈ X .  j k
27
Note that, for x ∈ Bk , f (x) = f (ak ) =  f (a k ) ≤ f (ak ), and f (ak ) − f˜(ak ) =  f (a k ) − f (a k )
˜ ˜ ≤ .

19
Now consider the collection {f˜ : f ∈ F}. We see that this collection is a 2-cover of F
and the number of distinct functions in this collection is upper bounded by
  
1
+ 1 7b1/c .


Thus, N (2, F, k · k∞ ) is bounded by the right-side of the above display, which completes
the proof the result.

Thus, the set of Lipschitz functions is much “larger” than a bounded set in Rd , since
its metric entropy grows as 1/ as  → 0, as compared to log(1/) (cf. Lemma 2.7).
Exercise (HW1): For L > 0, let FL := {f : [0, 1] → R | f is L-Lipschitz}. Show that, for
 > 0, log N (, FL , k · k∞ ) ≥ a L , for some constant a > 0. Then, using Lemma 2.8 show
that log N (, FL , k · k∞ )  L , for  > 0 sufficiently small.

2.2 Bracketing numbers

Let (F, k · k) be a subset of a normed space of real functions f : X → R on some set


X . We are mostly thinking of Lr (Q)-spaces for probability measures Q. We shall write
1/r
|f |r dQ
R
N (ε, F, Lr (Q)) for covering numbers relative to the Lr (Q)-norm kf kQ,r = .

Definition 2.9 (ε-bracket). Given two functions l(·) and u(·), the bracket [l, u] is the set
of all functions f ∈ F with l(x) ≤ f (x) ≤ u(x), for all x ∈ X . An ε-bracket is a bracket
[l, u] with kl − uk < ε.

Definition 2.10 (Bracketing numbers). The bracketing number N[ ] (ε, F, k · k) is the min-
imum number of ε-brackets needed to cover F.

Definition 2.11 (Entropy with bracketing). The entropy with bracketing is the logarithm
of the bracketing number.

In the definition of the bracketing number, the upper and lower bounds u and l of the
brackets need not belong to F themselves but are assumed to have finite norms.

Example 2.12. (Distribution function). When F is equal to the collection of all indicator
functions of the form ft (·) = 1(−∞,t] (·), with t ranging over R, then the empirical process

Gn (ft ) is the classical empirical process n(Fn (t) − F (t)) (here X1 , . . . , Xn are i.i.d. P with
c.d.f. F).
Consider brackets of the form [1(−∞,ti−1 ] , 1(−∞,ti ) ] for a grid points −∞ = t0 < t1 <
· · · < tk = ∞ with the property F (ti −) − F (ti−1 ) < ε for each i = 1, . . . , k; here we assume
that ε < 1. These brackets have L1 (P )-size ε. Their total number k can be chosen smaller
than 2/ε. Since P f 2 ≤ P f for every 0 ≤ f ≤ 1, the L2 (P )-size of the brackets is bounded by
√ √
ε. Thus N[ ] ( ε, F, L2 (P )) ≤ 2/ε, whence the bracketing numbers are of the polynomial
order 1/ε2 .

20
Exercise (HW1): Show that N (ε, F, k · k) ≤ N[ ] (2ε, F, k · k), for every ε > 0.
In general, there is no converse inequality. Thus, apart from the constant 1/2, bracket-
ing numbers are bigger than covering numbers. The advantage of a bracket is that it gives
pointwise control over a function: l(x) ≤ f (x) ≤ u(x), for every x ∈ X . In comparison an
Lr (P )-ball gives integrated, but not pointwise control.

Definition 2.13 (Envelope function). An envelope function of a class F is any function


x 7→ F (x) such that |f (x)| ≤ F (x), for every x ∈ X and f ∈ F. The minimal envelope
function is x 7→ supf ∈F |f (x)|.

Consider a class of functions {mθ : θ ∈ Θ} indexed by a parameter θ in an arbitrary


index set Θ with a metric d. Suppose that the dependence on θ is Lipschitz in the sense
that
|mθ1 (x) − mθ2 (x)| ≤ d(θ1 , θ2 )F (x),

for some function F : X → R, for every θ1 , θ2 ∈ Θ, and every x ∈ X . The bracketing


numbers of this class are bounded by the covering numbers of Θ as shown below.

Lemma 2.14. Let F = {mθ : θ ∈ Θ} be a class of functions satisfying the preceding display
for every θ1 and θ2 and some fixed function F . Then, for any norm k · k,

N[ ] (2kF k, F, k · k) ≤ N (, Θ, d).

Proof. Let θ1 , . . . , θp be an -cover of Θ (under the metric d). Then the brackets [mθi −
F, mθi + F ], i = 1, . . . , p, cover F. The brackets are of size 2kF k.

Exercise (HW1): Let F and G be classes of measurable function. Then for any probability
measure Q and any 1 ≤ r ≤ ∞,

(i) N[ ] (2, F + G, Lr (Q)) ≤ N[ ] (, F, Lr (Q)) N[ ] (, G, Lr (Q));

(ii) provided F and G are bounded by 1,

N[ ] (2, F · G, Lr (Q)) ≤ N[ ] (, F, Lr (Q)) N[ ] (, G, Lr (Q)).

Here, by F + G we mean the class of functions {f + g : f ∈ F, g ∈ G}; similarly, F · G :=


{f · g : f ∈ F, g ∈ G}.

21
3 Glivenko-Cantelli (GC) classes of functions

Suppose that X1 , . . . , Xn are independent random variables defined on the space X with
probability measure P . Let F be a class of measurable functions from X to R. The main
object of study in this section is to obtain probability estimates of the random quantity

kPn − P kF := sup |Pn f − P f |.


f ∈F

The law of large numbers says that Pn f → P f almost surely, as soon as the expectation
P f exists. A class of functions is called Glivenko-Cantelli if this convergence is uniform in
the functions belonging to the class.

Definition 3.1. A class F of measurable functions f : X → R with P |f | < ∞ for every


f ∈ F is called Glivenko-Cantelli28 (GC) if

kPn − P kF := sup |Pn f − P f | → 0, almost surely.


f ∈F

Remark 3.1 (On measurability). Note that if F is uncountable, kPn − P kF is the supre-
mum of an uncountable family of random variables. In general, the supremum of uncount-
ably many measurable functions is not necessarily measurable29 . However, there are many
situations when this is actually a countable supremum, e.g., in the case of the empirical
distribution function (because of right continuity and existence of left limits, kFn − F k∞ =
supx∈Q |Fn (x) − F (x)|, where Q is the set of rational numbers). Thus if F is countable or
if there exists F0 countable such that kPn − P kF = kPn − P kF0 a.s., then the measurability
problem for kPn − P kF disappears30 .

To avoid such measurability problems we assume throughout that F is pointwise mea-


surable, i.e., F contains a countable subset G such that for every f ∈ F there exists a
sequence gm ∈ G with gm (x) → f (x) for every x ∈ X 31 .
In this section we prove two Glivenko-Cantelli theorems. The first theorem is the
simplest and is based on entropy with bracketing. Its proof relies on finite approximation
28
As the Glivenko-Cantelli property depends on the distribution P of the observations, we also say, more
precisely, P -Glivenko-Cantelli. If the convergence is in mean or probability rather than almost surely, we
speak of “Glivenko-Cantelli in mean” or “in probability”.
29
Take a non-measurable set E ⊂ X and for each e ∈ E let 1e (·) be the indicator function of the set {e}.
Then the supremum of the (uncountable) family {1e (·) : e ∈ E} is the indicator function 1E (·) (of the set
E), which is not measurable.
30
In general one can take kPn − P k∗F , the smallest measurable function that dominates kPn − P kF , which
exists. k · k∗F works essentially as a norm, and one also has Pr∗ {k · k∗F > t} = Pr{k · k∗F > t}, where Pr∗ is
outer probability, the infimum of the probabilities of measurable sets that contain 1{k·k∗F > t}, and the same
holds with E∗ . The calculus with Pr∗ is quite similar to the usual measure theory, but there are differences
(no full Fubini); see e.g., [van der Vaart and Wellner, 1996, Section 1.2].
31
Some examples of this situation are the collection of indicators of cells in Euclidean space, the collection
of indicators of balls, and collections of functions that are separable for the supremum norm.

22
and the law of large numbers for real variables. The second theorem uses random L1 -entropy
numbers and is proved through symmetrization followed by a maximal inequality.

3.1 GC by bracketing

Theorem 3.2. Let F be a class of measurable functions such that N[ ] (, F, L1 (P )) < ∞
for every  > 0. Then F is Glivenko-Cantelli.

Proof. Fix  > 0. Choose finitely many -brackets [li , ui ] whose union contains F and such
that P (ui − li ) < , for every i. Then, for every f ∈ F, there is a bracket such that

(Pn − P )f ≤ (Pn − P )ui + P (ui − f ) ≤ (Pn − P )ui + .

Consequently,
sup (Pn − P )f ≤ max(Pn − P )ui + .
f ∈F i

The right side converges almost surely to  by the strong law of large numbers for real
variables. A similar argument also yields

(Pn − P )f ≥ (Pn − P )li + P (li − f ) ≥ (Pn − P )li − 


⇒ inf (Pn − P )f ≥ min(Pn − P )li − .
f ∈F i

A similar argument as above (by SLLN) shows that inf f ∈F (Pn − P )f is bounded below by
−ε almost surely. As,
( )
sup |(Pn − P )f | = max sup (Pn − P )f, − inf (Pn − P )f ,
f ∈F f ∈F f ∈F

we see that lim sup kPn − P kF ≤  almost surely, for every  > 0. Taking a sequence m ↓ 0
yields the desired result.

Example 3.3. (Distribution function). The previous proof generalizes a well-known proof of
the classical GC theorem for the e.d.f. on the real line. Indeed, the set of indicator functions
of cells (−∞, c] possesses finite bracketing numbers for any underlying distribution; simply
 
use the brackets 1(−∞,ti−1 ] , 1(−∞,ti ) for a grid of points −∞ = t0 < t1 < . . . , tk = +∞
with the property P (ti−1 , ti ) <  for each i.

Example 3.4 (Pointwise compact class). Let F = {mθ (·) : θ ∈ Θ} be a collection of


measurable functions with integrable envelope function F indexed by a compact metric space
Θ such that the map θ 7→ mθ (x) is continuous for every x. Then the bracketing numbers of
F are finite and hence F is Glivenko-Cantelli.
We can construct the brackets in the obvious way in the form [mB , mB ], where B is
an open ball and mB and mB are the infimum and supremum of mθ for θ ∈ B, respectively
(i.e., mB (x) = inf θ∈B mθ (x), and mB (x) = supθ∈B mθ (x)).

23
Given a sequence of balls Bk with common center a given θ and radii decreasing to 0,
we have mBk − mBk ↓ mθ − mθ = 0 by the continuity, pointwise in x, and hence also in
L1 by the dominated convergence theorem and the integrability of the envelope. Thus, given
 > 0, for every θ there exists a ball B around θ such that the bracket [mB , mB ] has size at
most . By the compactness of Θ, the collection of balls constructed in this way has a finite
subcover. The corresponding brackets cover F. This construction shows that the bracketing
numbers are finite, but it gives no control on their sizes.
An example of such a class would be the log-likelihood function of a parametric model
{pθ (x) : θ ∈ Θ}, where Θ ∈ Rd is assumed to be compact32 and pθ (x) is assumed to be
continuous in θ for Pθ0 -a.e. x.

The goal for the remainder of this section is to prove the following theorem.

Theorem 3.5 (GC by entropy). Let F be a class of measurable functions with envelope F
such that P (F ) < ∞. Let FM be the class of functions f 1{F ≤ M } where f ranges over
F. Then kPn − P kF → 0 almost surely if and only if
1 P
log N (, FM , L1 (Pn )) → 0, (8)
n
for every  > 0 and M > 033 . In that case the convergence takes place in mean also.

Both the statement and the proof of the GC theorem with entropy are more complicated
than the previous bracketing theorem. However, the result gives a precise (necessary and
sufficient) characterization for a class of functions to be GC. Moreover, the sufficiency
condition for the GC property can be checked for many classes of functions by elegant
combinatorial arguments, as will be discussed later. The proof of the above result needs
numerous other concepts, which we introduce below.

3.2 Preliminaries

In this subsection we will introduce a variety of simple results that will be useful in proving
the Glivenko-Cantelli theorem with entropy. We will expand on each of these topics later
on, as they indeed form the foundations of empirical process theory.

3.2.1 Hoeffding’s inequality for the sample mean

Bounds on the tail probability of (the maximum of a bunch of) random variables form the
backbone of most of the results in empirical process theory (e.g., GC theorem, maximal
32
This is a stringent assumption in many situations. If we assume that mθ is convex/concave (in θ), then
it suffices to consider compact subsets of the parameter space (we will see such an example soon; also see
e.g., [Hjort and Pollard, 2011] for a more refined approach). In other situations we have to argue from first
principles that it is enough to restrict attention to compacts.
33
Furthermore, the random entropy condition is necessary.

24
inequalities needed to show asymptotic equicontinuity, etc.). In this subsection we will
review some basic results in this topic which will be used to prove the GC theorem with
entropy. We start with a very simple (but important and useful) result.

Lemma 3.6 (Markov’s inequality). Let Z ≥ 0 be a random variable. Then for any t > 0,

EZ
P(Z ≥ t) ≤ .
t
Proof. Observe that t1{Z ≥ t} ≤ Z, which on taking expectations yield the above result.

The above lemma implies Chebyshev’s inequality.

Lemma 3.7 (Chebyshev’s inequality). If Z has a finite variance Var(Z), then

Var(Z)
P(|Z − EZ| ≥ t) ≤ .
t2

This above lemma is probably the simplest concentration inequality. Thus, if X1 , . . . , Xn


Pn
are i.i.d. with finite variance σ 2 , and if Z = i=1 Xi =: Sn , then (by independence)
2
Var(Z) = nσ , and
 √  σ2
P Sn − ESn ≥ t n ≤ 2 .
t
Thus, the typical deviations/fluctuations of the sum of n i.i.d. random variables are at most

of order n. However, by the central limit, theorem (CLT), for fixed t > 0,

√  t2
   
t σ
lim P Sn − ESn ≥ t n = 1 − Φ ≤√ exp − 2 ,
n→∞ σ 2πt 2σ

where the last inequality uses a standard bound on the normal CDF. Thus, we see that

although Chebyshev’s inequality gets the order n correct, the dependence on t2 /σ 2 is not
as predicted by the CLT (we expect an exponential decrease in t2 /σ 2 ).
Indeed, we can improve the above bound by assuming that Z has a moment generating
function. The main trick here is to use Markov’s inequality in a clever way: if λ > 0,
  E[eλ(Z−EZ) ]
P(Z − EZ > t) = P eλ(Z−EZ) > eλt ≤ . (9)
eλt

Now, we can derive bounds for the moment generating function Eeλ(Z−EZ) and optimize
over λ.
When dealing with Sn , we can extend the idea as follows. Observe that,
n n
!
Y Y
λZ λXi
Ee = E e = EeλXi , (by independence). (10)
i=1 i=1

Now it suffices to find bounds for EeλXi .

25
Lemma 3.8 (Exercise (HW1)). Let X be a random variable with EX = 0 and X ∈ [a, b]
with probability 1 (w.p.1). Then, for any λ > 0,
2 (b−a)2 /8
E(eλX ) ≤ eλ .

The above lemma, combined with (9) and (10), implies the following Hoeffding’s tail
inequality.

Lemma 3.9 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent bounded random vari-


ables such that Xi ∈ [ai , bi ] w.p.1. Then, we obtain,
2/
Pn 2
P (Sn − ESn ≥ t) ≤ e−2t i=1 (bi −ai ) ,

and
2/
Pn 2
P (Sn − ESn ≤ −t) ≤ e−2t i=1 (bi −ai ) .

Proof. For λ, t ≥ 0, Markov’s inequality and the independence of Xi implies:


  h i
P (Sn − ESn ≥ t) = P eλ(Sn −E[Sn ]) ≥ eλt ≤ e−λt E eλ(Sn −E[Sn ])
n n
λ2 (bi −ai )2
Y h i Y
−λt λ(Xi −E[Xi ]) −λt
= e E e ≤ e e 8

i=1 i=1
 n
X 
= exp − λt + 18 λ2 (bi − ai )2 .
i=1

To get the best possible upper bound, we find the minimum of the right hand side of the last
2 Pn
inequality as a function of λ. Define g : R+ → R such that g(λ) = −λt + λ8 2
i=1 (bi − ai ) .
Note that g is a quadratic function and achieves its minimum at λ = Pn (b4ti −ai )2 . Plugging
i=1
in this value of λ in the above bound we obtain the desired result. We can similarly prove
the tail bound for t < 0.

Example 3.10 (Hoeffding’s bound for i.i.d. random variables). Suppose that X1 , . . . , Xn
are i.i.d. such that X1 ∈ [a, b] w.p. 1. Then for any given α ∈ (0, 1) a direct consequence of
Lemma 3.9 show that r
n(b − a)2 1
Sn − E[Sn ] ≥ log (11)
2 α
with probability at most α. In fact, by Hoeffding’s inequality, we can obtain an 1 − α
honest conservative confidence (symmetric) interval (around the sample mean X̄n ) for the
population mean µ := E[X1 ] as:
" r r #
b−a 2 b−a 2
X̄n − √ log , X̄n + √ log .
2n α 2n α

Hoeffding’s inequality does not depend on the distribution of the Xi ’s (which is good),
but also does not incorporate the dependence on Var(Xi ) in the bound (which can result
in an inferior bound; e.g., consider Xi ∼ Bernoulli(p) where p is close to 0 or 1).

26
3.2.2 Sub-Gaussian random variables/processes

A sub-Gaussian distribution is a probability distribution with strong tail decay property.


Informally, the tails of a sub-Gaussian distribution are dominated by (i.e., decay at least as
fast as) the tails of a Gaussian.

Definition 3.11. X is said to be sub-Gaussian if ∃ constants C, v > 0 s.t. P(|X| > t) ≤


2
Ce−vt for every t > 0.

The following are equivalent characterizations of a sub-Gaussian random variable (Ex-


ercise: HW1)

• The distribution of X is sub-Gaussian.


2
• There exists a > 0 such that E[eaX ] < +∞.
2
• Laplace transform condition: ∃B, b > 0 such that ∀ λ ∈ R, Eeλ(X−E[X]) ≤ Beλ b .

• Moment condition: ∃ K > 0 such that for all p ≥ 1, (E|X|p )1/p ≤ K p.

Suppose (T, d) is a semi-metric space and let {Xt , t ∈ T } be a stochastic process indexed
by T satisfying

u2
 
P (|Xs − Xt | ≥ u) ≤ 2 exp − for all u > 0. (12)
2d(s, t)2

Such a stochastic process is called sub-Gaussian with respect to the semi-metric d. Any
p
Gaussian process is sub-Gaussian for the standard deviation semi-metric d(s, t) = Var(Xs − Xt ).
Another example is the Rademacher process
n
X
Xa := ai εi , a := (a1 , . . . , an ) ∈ Rn ,
i=1

for independent Rademacher variables ε1 , . . . , εn , i.e.,


1
P(εi = 1) = P(εi = −1) = ;
2
variables that are +1 or −1 with probability 1/2 each. The result follows directly from the
following lemma.

Lemma 3.12 (Hoeffding’s inequality for Rademacher variables). Let a = (a1 , . . . , an ) ∈ Rn


be a vector of constants and ε1 , . . . , εn be Rademacher random variables. Then
n
 X  2 2
P ai εi ≥ x ≤ 2e−x /(2kak ) ,
i=1

where kak denotes the Euclidean norm of a.

27
2
Proof. For any λ and Rademacher variable ε, one has Eeλε = (eλ + e−λ )/2 ≤ eλ /2 , where
the last inequality follows after writing out the power series. Thus, by Markov’s inequality,
for any λ > 0,
X n  Pn 2 2
P ai εi ≥ x ≤ e−λx Eeλ i=1 ai εi ≤ e(λ /2)kak −λx .
i=1

The best upper bound is obtained for λ = x/kak2 and is the exponential in the probability
of the lemma. Combination with a similar bound for the lower tail yields the probability
bound.

Here is a useful (and probably the earliest and simplest) ‘maximal inequality’ which is
an application of Hoeffding’s inequality.

Lemma 3.13. Suppose that Y1 , . . . , YN (not necessarily independent) are sub-Gaussian in


the sense that, for all λ > 0,
2 σ 2 /2
EeλYi ≤ eλ , for all i = 1, . . . , N.

Then,
p
E max Yi ≤ σ 2 log N . (13)
i=1,...,N

Proof. Observe that


N
2 σ 2 /2
X
eλE maxi=1,...,N Yi ≤ Eeλ maxi=1,...,N Yi ≤ EeλYi ≤ N eλ ,
i=1

where we have used Jensen’s inequality in the first step (taking the function eλx ). Taking
logarithms yields
λσ 2
 
log N
E max Yi ≤ + .
i=1,...,N λ 2
Optimizing with respect to λ (differentiating and then equating to 0) yields the result.

Example 3.14. Let A1 , . . . , AN ⊂ X and let X1 , . . . , Xn be i.i.d. random points in X . Let


n
1X
P (A) = P(Xi ∈ A) and Pn (A) = 1A (Xi ).
n
i=1

By Hoeffding’s inequality, for each A,


Pn n
2 /(8n)
Y
Eeλ(Pn (A)−P (A)) = Ee(λ/n) i=1 (1A (Xi )−P (A)) = Ee(λ/n)(1A (Xi )−P (A)) ≤ eλ .
i=1

Thus, by a simple maximal inequality (see Lemma 3.13),


r
log N
E max (Pn (A) − P (A)) ≤ .
i=1,...,N 2n

28
Lemma 3.13 can be easily extended to yield the following maximal inequality.

Lemma 3.15. Let ψ be a strictly increasing, convex, nonnegative function. If ξ1 , . . . , ξN


are random variables such that

E[ψ(|ξi |/ci )] ≤ L, for i = 1, . . . , N,

where L is a constant, then

E max |ξi | ≤ ψ −1 (LN ) max ci .


1≤i≤N 1≤i≤N

Proof. By the properties of ψ,


N
E max |ξi | |ξi | |ξi | |ξi |
      X  
ψ ≤ ψ E max ≤ Eψ max ≤ Eψ ≤ LN.
max ci ci ci ci
i=1

Applying ψ −1 to both sides we get the result.

Lemma 3.16. Show that if ξi ’s are linear combinations of Rademacher variables (i.e.,
(i) (i) (i) 2 (i) 2
ξi = nk=1 ak εk , a(i) = (a1 , . . . , an ) ∈ Rn ), then: (i) E[eξi /(6ka k ) ] ≤ 2; (ii) by using
P
2
ψ(x) = ex show that, for N ≥ 2,
p
E max |ξi | ≤ C log N max ka(i) k. (14)
1≤i≤N 1≤i≤N

(iii) Further, show that C = 2 6 for Rademacher linear combinations.

Exercise (HW1): (a) An Orlicz function is a convex, increasing function ψ : R+ → R+ with


0 ≤ ψ(0) < 1 (most authors actually require ψ(0) = 0).
Define the Orlicz norm kXkψ (seminorm actually, unless one identifies random variables
that are almost everywhere equal) by

kXkψ := inf{c > 0 : E[ψ(|X|/c)] ≤ 1},

with the understanding that kXkψ = ∞ if the infimum runs over an empty set. Let
ψ(x) = exp(x2 ) − 1. Then show that kXkψ < ∞ if and only if X − EX is sub-Gaussian.

3.3 Symmetrization

Symmetrization (or randomization) technique plays an essential role in empirical process


theory. The symmetrization replaces ni=1 (f (Xi )−P f ) by ni=1 εi f (Xi ) with i.i.d. Rademacher34
P P
Pn
random variables ε1 , . . . , εn independent of X1 , . . . , Xn . Note that i=1 εi f (Xi ) can be
thought of as the correlation between the vector (f (X1 ), . . . , f (Xn )) and the “noise vector”
(ε1 , . . . , εn ). Thus,
n n
1X 1X
εi f (Xi ) := sup εi f (Xi )
n F f ∈F n
i=1 i=1
34
Recall that a Rademacher random variable ε takes values ±1 with equal probability 1/2.

29
denotes the maximum correlation taken over all functions f ∈ F. The intuition here is: a
function class is extremely large — and, in fact, “too large” for statistical purposes — if we
can always find a function (in the class) that has a high correlation with a randomly drawn
noise vector.
The advantage of symmetrization lies in the fact that the symmetrized process is typ-
ically easier to control than the original process, as we will find out in several places. For
example, even though ni=1 (f (Xi ) − P f ) has only low order moments, ni=1 εi f (Xi ) is sub-
P P

Gaussian, conditionally on X1 , . . . , Xn . In what follows, Eε denotes the expectation with


respect to ε1 , ε2 , . . . only; likewise, EX denotes the expectation with respect to X1 , X2 , . . .
only.
The symmetrized empirical measure and process are defined by
n n
1X 1 X
f 7→ Pon f := εi f (Xi ), f 7→ Gon f := √ εi f (Xi ).
n n
i=1 i=1

The symmetrized empirical processes has mean function zero.


One main approach to proving empirical limit theorems is to pass from Pn − P to
Pon and next apply arguments conditionally on the original X’s. The idea is that, for
fixed X1 , . . . , Xn , the symmetrized empirical measure is a Rademacher process, hence a
sub-Gaussian process to which Dudley’s entropy result (see Chapter 4) can be applied.

Theorem 3.17 (Symmetrization). For any class of measurable functions F,


n
1X
EkPn − P kF ≤ 2 E εi f (Xi ) .
n F
i=1

Proof. Let Y1 , . . . , Yn be independent copies of X1 , . . . , Xn , and defined on the same prob-


ability space. For fixed values X1 , . . . , Xn ,
n n
1 X 1 X 
kPn − P kF = sup {f (Xi ) − Ef (Yi )} = sup EY {f (Xi ) − f (Yi )}
f ∈F n f ∈F n
i=1 i=1
n
1 h X i
≤ EY sup {f (Xi ) − f (Yi )}
n f ∈F i=1

where EY is the expectation with respect to Y1 , . . . , Yn , given fixed values of X1 , . . . , Xn ,


 
and we have used the fact that for a class of functions G, supg∈G E[g(Z)] ≤ E supg∈G |g(Z)|
for some random vector Z.
Taking the expectation with respect to X1 , . . . , Xn , we get
n
1X
EkPn − P kF ≡ EX kPn − P kF ≤ EX,Y [f (Xi ) − f (Yi )] .
n F
i=1

Adding a minus sign in front of a term [f (Xi ) − f (Yi )] has the effect of exchanging Xi
and Yi . Because the Y ’s are independent copies of the X’s, the expectation of any function

30
g(X1 , . . . , Xn , Y1 , . . . , Yn ) (= supf ∈F | ni=1 [f (Xi ) − f (Yi )]| here) remains unchanged under
P

permutations of its 2n arguments. Hence the expression


n
1 X
E ei [f (Xi ) − f (Yi )]
n F
i=1

is the same for any n-tuple (e1 , . . . , en ) ∈ {−1, +1}n . Deduce that
n
1X
EkPn − P kF ≤ Eε EX,Y εi [f (Xi ) − f (Yi )] .
n F
i=1

Using the triangle inequality to separate the contributions of the X’s and the Y ’s and noting
that they are both equal to EkPon kF .

Remark 3.2. The symmetrization lemma is valid for any class F. In the proofs of
Glivenko-Cantelli and Donsker theorems, it will be applied not only to the original set of
functions of interest, but also to several classes constructed from such a set F (such as the
class Fδ of small differences). The next step in these proofs is to apply a maximal inequality
to the right side of the above theorem, conditionally on X1 , . . . , Xn .
Lemma 3.18 (A more general version of the symmetrization lemma; Exercise (HW1)).
Suppose that P f = 0 for all f ∈ F. Let ε1 , . . . , εn be independent Rademacher random
variables independent of X1 , . . . , Xn . Let Φ : R+ → R+ be a nondecreasing convex function,
and let µ : F → R be a bounded functional such that {f + µ(f ) : f ∈ F} is pointwise
measurable. Then,
" n
# " n
# " n
#
1 X   X   X 
E Φ εi f (Xi ) ≤E Φ f (Xi ) ≤E Φ 2 εi (f (Xi ) + µ(f )) .
2 i=1
F
i=1
F
i=1
F

3.4 Proof of GC by entropy

In this subsection we prove Theorem 3.5. We only show the sufficiency of the entropy
condition. By the symmetrization result, measurability of the class F, and Fubini’s theorem,
n
1X
EkPn − P kF ≤ 2EX Eε εi f (Xi )
n F
i=1
n
1 X
≤ 2EX Eε εi f (Xi ) + 2P [F 1{F > M }],
n FM
i=1

by the triangle inequality, for every M > 0. For sufficiently large M , the last term is
arbitrarily small. To prove convergence in mean, it suffices to show that the first term
converges to zero for fixed M . Fix X1 , . . . , Xn . If G is an η-net35 in L1 (Pn ) over FM , then
for any f ∈ FM , there exists g ∈ G such that
n n n n
1X 1X 1X 1X
εi f (Xi ) ≤ εi g(Xi ) + εi [f (Xi ) − g(Xi )] ≤ εi g(Xi ) + η.
n n n n G
i=1 i=1 i=1 i=1
35
The set of centers of balls of radius η that cover T is called an -net of T .

31
Thus,
n n
1X 1X
Eε εi f (Xi ) ≤ Eε εi g(Xi ) + η. (15)
n FM n G
i=1 i=1

The cardinality of G can be chosen equal to N (η, FM , L1 (Pn )). Given X1 , . . . , Xn , the
symmetrized empirical process Gon is sub-Gaussian with respect to the L2 (Pn )-norm. Using
the maximal inequality in Lemma 3.15 with ψ(x) = exp(x2 ) (see (14)) shows that the
preceding display does not exceed
p 1
C log N (, FM , L1 (Pn )) √ sup kgkn + ,
n g∈G
Pn
where kgk2n := 2
i=1 g(Xi ) /n and C is a universal constant.
The k · kn norms of g ∈ G are bounded above by M (this can always be ensured by

truncating g if required). By assumption the square root of the entropy divided by n tends
to zero in probability. Thus the right side of the above display tends to  in probability.
Since this argument is valid for every  > 0, it follows that the left side of (15) converges
to zero in probability.
h h P ii
Next we show that EX Eε n1 ni=1 εi f (Xi ) F X1 , . . . , Xn converges to 0. Since
M
1 Pn
Eε n i=1 εi f (Xi ) is bounded by M and converges to 0 (in probability), its expectation
FM
with respect to X1 , . . . , Xn converges to zero by the dominated convergence theorem.
This concludes the proof that kPn − P kF → 0 in mean. That it also converges almost
surely follows from the fact that the sequence kPn − P kF is a reverse sub-martingale with
respect to a suitable filtration; see e.g., [van der Vaart and Wellner, 1996, Lemma 2.4.5] (by
martingale theory any nonnegative reverse sub-martingale converges almost surely to some
limit36 ).

3.5 Applications

3.5.1 Consistency of M /Z-estimators

Consider the setup of M -estimation as introduced in Section 1.2 and Example 1.3 where
we assume that Θ is a metric space with the metric d(·, ·). In this example we describe
the steps to prove the consistency of the M -estimator θ̂n := arg maxθ∈Θ Pn [mθ ], as defined
in (4). Formally, we want to show that
P
d(θ̂n , θ0 ) → 0 where θ0 := arg max P [mθ ].
θ∈Θ

To simplify notation we define

Mn (θ) := Pn [mθ ] and M (θ) := P [mθ ], for all θ ∈ Θ.


36
Result: If P F < ∞, then kPn − P kF converges almost surely and in L1 .

32
We will assume that the class of functions F := {mθ (·) : θ ∈ Θ} is P -Glivenko Cantelli. We
will further need to assume that θ0 is a well-separated maximizer, i.e., for every δ > 0,

M (θ0 ) > sup M (θ).


θ∈Θ:d(θ,θ0 )≥δ

Fix δ > 0 and let


ψ(δ) := M (θ0 ) − sup M (θ).
θ∈Θ:d(θ,θ0 )≥δ

Observe that,

{d(θ̂n , θ0 ) ≥ δ} ⇒ M (θ̂n ) ≤ sup M (θ)


θ∈Θ:d(θ,θ0 )≥δ

⇔ M (θ̂n ) − M (θ0 ) ≤ −ψ(δ)


⇒ M (θ̂n ) − M (θ0 ) + (Mn (θ0 ) − Mn (θ̂n )) ≤ −ψ(δ)
⇒ 2 sup |Mn (θ) − M (θ)| ≥ ψ(δ).
θ∈Θ

Therefore,  
 
P d(θ̂n , θ0 ) ≥ δ ≤ P sup |Mn (θ) − M (θ)| ≥ ψ(δ)/2 → 0
θ∈Θ

by the fact that F is P -Glivenko Cantelli.


The above result of course assumes that F is P -Glivenko Cantelli, which we can verify
by showing the sufficient conditions in Theorem 3.2 or Theorem 3.5 hold. For example,
for a pointwise compact space, as described in Example 3.4, this immediately yields the
consistency of θ̂n .
Note that the sufficient conditions needed to show that F is P -GC is hardly ever met
when Θ is not a compact set (observe that the finiteness of covering numbers necessarily
imply that the underlying set is totally bounded). We give a lemma below which replaces
compactness by a convexity assumption.

Lemma 3.19 (Exercise (HW1)). Suppose that Θ is a convex subset of Rd , and that θ 7→
mθ (x), is continuous and concave, for all x ∈ X . Suppose that E[G (X)] < ∞ where
P
G (x) := supkθ−θ0 k≤ |mθ (x)|, for x ∈ X . Then θ̂n → θ0 .


Hint: Define α := and θ̃n := αθ̂n + (1 − α)θ0 and compare Mn (θ̃n ) with Mn (θ0 ).
+kθ̂n −θ0 k

3.5.2 Consistency of least squares regression

Suppose that we have data

Yi = g0 (zi ) + Wi , for i = 1, . . . , n, (16)

where Yi ∈ R is the observed response variable, zi ∈ Z is a covariate, and Wi is the unob-


served error. The errors are assumed to be independent random variables with expectation

33
EWi = 0 and variance Var(Wi ) ≤ σ02 < ∞, for i = 1, . . . , n. The covariates z1 , . . . , zn are
fixed, i.e., we consider the case of fixed design.
The function g0 : Z → R is unknown, but we assume that g0 ∈ G, where G is a given
class of regression functions. The unknown regression function can be estimated by the
least squares estimator (LSE) ĝn , which is defined (not necessarily uniquely) by
n
X
ĝn = arg min (Yi − g(zi ))2 . (17)
g∈G
i=1

Let
n
1X
Qn := δ zi
n
i=1
denote the empirical measure of the design points. For g : Z → R, we write
n n n
1X 2 1X 1X
kgk2n := g (zi ), kY − gk2n := (Yi − g(zi ))2 , and hW, gin := Wi g(zi ).
n n n
i=1 i=1 i=1

P
Question: When can we say that kĝn − g0 kn → 0?
Our starting point is the following inequality:

kĝn − g0 k2n ≤ 2hW, ĝn − g0 in , (18)

which follows from simplifying the inequality kY −ĝn k2n ≤ kY −g0 k2n (Hint: write kY −ĝn k2n =
kY − g0 + g0 − ĝn k2n and expand).
We shall need to control the entropy, not of the whole class G itself, but of subclasses
Gn (R), which are defined as

Gn (R) = {g ∈ G : kg − g0 kn ≤ R}.

Thus, Gn (R) is the ball of radius R around g0 , intersected with G.

Theorem 3.20. Suppose that


n
1X  2 
lim lim sup E Wi 1{|Wi |>K} = 0, (19)
K→∞ n→∞ n
i=1

and
log N (δ, Gn (R), L1 (Qn ))
→ 0, for all δ > 0, R > 0. (20)
n
p
Then, kĝn − g0 kn → 0.

Proof. Let η, δ > 0 be given. We will show that P(kĝn − g0 kn > δ) can be made arbitrarily
small, for all n sufficiently large. Note that for any R > δ, we have

P(kĝn − g0 kn > δ) ≤ P(δ < kĝn − g0 kn < R) + P(kĝn − g0 kn > R).

34
We will first show that the second term on the right side can be made arbitrarily small by
choosing R large. From (18), using Cauchy-Schwarz inequality, it follows that
n
1 X 1/2
kĝn − g0 kn ≤ 2 Wi2 .
n
i=1

Thus, using Markov’s inequality,


n n
!
  1 X 1/2 4 1X 4σ02
P kĝn − g0 kn > R ≤ P 2 Wi2 >R ≤ EW i
2
≤ = η,
n R2 n R2
i=1 i=1

where R2 := 4σ02 /η. Now, using (18) again,


!
 
2
P δ < kĝn − g0 kn < R ≤ P sup 2hW, g − g0 in ≥ δ
g∈Gn (R)
! !
δ2 δ2
≤P sup hW 1{|W |≤K} , g − g0 in ≥ +P sup hW 1{|W |>K} , g − g0 in ≥ . (21)
g∈Gn (R) 4 g∈Gn (R) 4

An application of the Cauchy-Schwarz and Markov’s inequality bounds the second term
on the right side of the above display:
n n
!   !
1 X
2
1/2 δ2 4R 2 1X 2
P Wi 1{|Wi |>K} ≥ ≤ E Wi 1{|Wi |>K} ≤ η,
n 4R δ2 n
i=1 i=1

by choosing K = K(δ, η) sufficiently large and using (19). We bound the first term in (21)
by using Markov’s inequality:
!
δ2 4
P sup hW 1{|W |≤K} , g − g0 in ≥ ≤ 2 E hW 1{|W |≤K} , g − g0 in G (R) .
g∈Gn (R) 4 δ n

The random variables Wi 1{|Wi |≤K} still have expectation zero if each Wi is symmetric,
which we shall assume to avoid digressions. If they are not symmetric, one can use different
truncation levels to the left and right to approximately maintain zero expectation. We will
now use Hoeffding’s inequality (see Lemma 3.9). Thus, for any function g : Z → R and for
all δ > 0,
nδ 2
   
P |hW 1{|W |≤K} , g − g0 in | ≥ δ ≤ 2 exp − . (22)
2K 2 kg − g0 k2n
The proof will now mimic the proof of Theorem 3.5. If G̃ is an -net in L1 (Qn ) over Gn (R),
then
E hW 1{|W |≤K} , g − g0 in G (R) ≤ E hW 1{|W |≤K} , g − g0 in G̃ + K. (23)
n

The cardinality of G̃ can be chosen equal to N (, Gn (R), L1 (Qn )). Using the maximal
inequality in Lemma 3.13 with ψ(x) = exp(x2 ) (note that (13) holds for every g ∈ G̃ by

35
Lemma 3.2137 and (22)) shows that the preceding display is does not exceed a multiple of
p K
log N (, Gn (R), L1 (Qn )) √ sup kg − g0 kn + K.
n g∈G̃

The norms of g − g0 ∈ G̃ are bounded above by R. By assumption the square root of the

entropy divided by n tends to zero. Thus the above display is less than (K + 1) for all
large n. Since this argument is valid for every  > 0, it follows that the left side of (23) can
be made less than η.

Exercise (HW1): Assume the setup of (16) where G = {g : R → R | g in nondecreasing}.


Assume that g0 ∈ G is fixed, and supz∈R |g0 (z)| ≤ K for some (unknown) constant K. Using
the fact that
log N[ ] (, G[0,1] , L1 (Qn )) ≤ A−1 , for all  > 0,

where G[0,1] = {g : R → [0, 1] | g in nondecreasing} and A > 0 is a universal constant, show


p
that kĝn − g0 kn → 0 (here ĝn is the LSE defined in (17)).

3.6 Bounded differences inequality — a simple concentration inequality

We are interested in bounding the random fluctuations of (complicated) functions of many


independent random variables (e.g., the Hoeffding’s inequality in Lemma 3.9 accomplishes
this for the sum of independent bounded random variables). Let X1 , . . . , Xn be independent
random variables taking values in X . Let f : X n → R, and let

Z = f (X1 , . . . , Xn )

be the random variable of interest (e.g., Z = ni=1 Xi when X = R). We seek upper bounds
P

for
P(Z > EZ + t) and P(Z < EZ − t) for t > 0.

In this subsection we study a concentration inequality similar in spirit to the Hoeffding’s


inequality, but that holds for a general random variable Z = f (X1 , . . . , Xn ), under suitable
conditions.
37
The following result shows that sub-Gaussian tail bounds for a random variable W imply the finiteness
2
of EeDW for some D > 0.
2
Lemma 3.21. Let W be a random variable with P(|W | > x) ≤ Ke−Cx for every x, for constants K and
2
C. Then, EeDW ≤ KD/(C − D) + 1, for any D < C.

Proof. By Fubini’s theorem,


Z |W |2 Z ∞ √
Z ∞
2 KD
E(eDW − 1) = E DeDs ds = P(|W | > s)DeDs ds ≤ KD e−Cs eDs ds = .
0 0 0 C −D

36
Let X1 , . . . , Xn be independent random variables taking values in X . Let f : X n → R
and
Z = f (X1 , . . . , Xn )
be the random variable of interest. Note that if we define

Yk := E[Z|X1 , . . . , Xk ], for k = 1, . . . , n,

then {Yk }nk=0 is a martingale38 adapted to a filtration generated by {Xk }nk=1 .


Denote by Ei [·] := E[·|X1 , . . . , Xi ]. Thus, E0 (Z) = E(Z) and Ek (Z) = Yk , for k =
1, . . . , n. Writing
∆i := Ei [Z] − Ei−1 [Z],
we have
n
X
Z − EZ = ∆i .
i=1
We want to get exponential concentration inequalities for Z. As before, we start with
bounding the moment generating function Eeλ(Z−EZ) . We start with the Azuma-Hoeffding
inequality for sums of bounded martingale differences.

Lemma 3.22 (Azuma-Hoeffding inequality). Suppose that the martingale differences are
bounded, i.e., |∆i | ≤ ci , for all i = 1, . . . , n. Then,
2
Pn 2
Eeλ(Z−E(Z)) ≤ eλ i=1 ci /2 .

Proof. Observe that,


Pn h Pn−1 i
Eeλ(Z−E(Z)) = Eeλ i=1 ∆i = E En−1 (eλ( i=1 ∆i )+λ∆n )
h Pn−1 i
= E eλ( i=1 ∆i ) En−1 [eλ∆n ]
h Pn−1 i 2 2
≤ E eλ( i=1 ∆i ) eλ cn /2 (by Lemma 3.8)
···
Pn
2( 2
≤ eλ i=1 ci )/2 .

Definition 3.23 (Functions with bounded differences). We say that a function f : X n → R


has the bounded difference property if for some nonnegative constants c1 , . . . , cn ,

sup |f (x1 , . . . , xn ) − f (x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn )| ≤ ci , 1 ≤ i ≤ n. (24)


x1 ,...,xn ,x0i ∈X

Given a sequence {Yk }∞


38 ∞
k=1 of random variables adapted to a filtration {Fk }k=1 (e.g., Fk =
σ(X1 , . . . , Xk )), the pair {Yk , Fk }∞
k=1 is a martingale if, for all k ≥ 1,

E[|Yk |] < ∞, and E[Yk+1 |Fk ] = Yk .

37
In other words, if we change the i-th variable of f while keeping all the others fixed, the
value of the function cannot change by more than ci .

The following theorem provides exponential tail bounds for the random variable Z −
E(Z). It follows easily from the above lemma and is left as an exercise.

Theorem 3.24 (Bounded differences inequality or McDiarmid’s inequality). Suppose that


Z = f (X1 , . . . , Xn ) and f is such that (24) holds (i.e., f has the bounded differences prop-
erty), then
2
Pn 2
P(|Z − E(Z)| > t) ≤ 2e−2t / i=1 ci . (25)

Proof. We will show that if f satisfies the bounded difference property with the constants
ci ’s, then |∆i | ≤ ci , which together with Lemma 3.22 will yield the desired result. Recall
that

∆i ≡ ∆i (X1 , . . . , Xi ) := E[f (X1 , . . . , Xn )|X1 , . . . , Xi ] − E[f (X1 , . . . , Xn )|X1 , . . . , Xi−1 ].

Fix X1 = x1 , . . . , Xi−1 = xi−1 , for any x1 , . . . , xi−1 ∈ X . Then ∆i can be viewed solely as
a function of Xi . We will study the range of ∆i as Xi = x varies over X . Then, observe
that, as X1 , . . . , Xn are independent,

|∆i (x1 , . . . , xi−1 , x)|


= E[f (x1 , . . . , xi−1 , x, Xi+1 , . . . , Xn )] − E[f (x1 , . . . , xi−1 , Xi , Xi+1 , . . . , Xn )]
 
= E f (x1 , . . . , xi−1 , x, Xi+1 , . . . , Xn ) − f (x1 , . . . , xi−1 , Xi , Xi+1 , . . . , Xn )
h i
≤ E f (x1 , . . . , xi−1 , x, Xi+1 , . . . , Xn ) − f (x1 , . . . , xi−1 , Xi , Xi+1 , . . . , Xn )
" #
≤ E sup f (x1 , . . . , xi−1 , x, Xi+1 , . . . , Xn ) − f (x1 , . . . , xi−1 , x0 , Xi+1 , . . . , Xn )
x,x0 ∈X
≤ ci

where the first inequality follows using Jensen’s inequality, and the last inequality follows
from the fact that f satisfies the bounded differences property (24).

Theorem 3.24 can be seen as a quantification of the following qualitative statement


of Talagrand (see [Talagrand, 1996b, Page 2]): A random variable that depends on the
influence of many independent variables (but not too much on any of them) concentrates.
The numbers ci control the effect of the i-th variable on the function f .
The bounded differences inequality is quite useful and distribution-free and often close
to optimal. However, it does not incorporate the variance information.

Example 3.25 (Kernel density estimation). Let X1 , . . . , Xn are i.i.d. from a distribution
P on R (the argument can be easily generalized to Rd ) with density φ. We want to estimate

38
φ nonparametrically using the kernel density estimator (KDE) φ̂n : R → [0, ∞) defined as
n
x − Xi
 
1 X
φ̂n (x) = K , forx ∈ R,
nhn hn
i=1

where hn > 0 is the smoothing bandwidth and K is a nonnegative kernel (i.e., K ≥ 0 and
R
K(x)dx = 1). The L1 -error of the estimator φ̂n is
Z
Z ≡ f (X1 , . . . , Xn ) := |φ̂n (x) − φ(x)|dx.

The random variable Z not only provides a measure of the difference between φ̂n and φ but,
as Z = 2 supA |Pn (A) − P (A)| (Exercise (HW1): Show this) where the supremum is over
all Borel sets in R and Pn denotes the distribution corresponding to the KDE φ̂n ), Z also
captures the difference between Pn and P in the total variation distance.
We can use Theorem 3.24 to get exponential tail bounds for Z. We will show that (25)
holds with ci = 2/n, for all i = 1, . . . , n. It is easy to see that for x1 , . . . , xn , x0i ∈ X ,

|f (x1 , . . . , xn ) − f (x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn )|


x − xi x − x0i
   
1 2
Z
≤ K −K dx ≤ . (26)
nhn hn hn n
Thus, using Theorem 3.24 we have
2 /2 √ 2
P(|Z − E(Z)| > t) ≤ 2e−nt ⇒ P ( n|Z − E(Z)| > t) ≤ 2e−t /2 ,

which shows that Z concentrates around its expectation E[Z] at the rate n−1/2 . The remark-
able thing is that this concentration property holds regardless of the choice of bandwidth hn .
Of course, in this case, it is difficult to actually compute what that expectation is.

3.7 Supremum of the empirical process for a bounded class of functions

Let us conclude this section with an important result on what the bounded differences
concentration inequality implies for the supremum of the empirical process
n
1X
Z := sup f (Xi ) − E[f (X1 )] ,
f ∈F n i=1

where X1 , . . . , Xn are i.i.d. random objects taking values in X and F is a collection of


real-valued functions on X , when it is assumed that all functions in F are bounded by a
positive constant B, i.e.,

sup |f (x)| ≤ B for all f ∈ F.


x∈X

We shall argue that Z concentrates around its expectation. Let


n
1X
g(x1 , . . . , xn ) := f (xi ) − E[f (X1 )] .
n
i=1

39
We shall show below that g satisfies the bounded differences property (24) with ci := 2B/n
for i = 1, . . . , n. To see this, note that
1X f (x0i )
g(x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn ) = f (xi ) + − E[f (X1 )]
n n
j6=i
n
1X f (x0i ) f (xi )
= f (xj ) − E[f (X1 )] + −
n n n
j=1
n
1X 2B 2B
≤ f (xj ) − E[f (X1 )] + ≤ g(x1 , . . . , xn ) + ,
n n n
j=1

where we have used the fact that for every f ∈ F, |f (xi )| ≤ B and f (x0i )| ≤ B. Interchanging
the roles of xi and x0i , we can deduce that (24) holds with ci := 2B/n for i = 1, . . . , n. Then,
Theorem 3.24 yields
nt2
 
P(|Z − EZ| > t) ≤ 2 exp − 2 , for every t ≥ 0.
2B
 
nt2
Setting δ := exp − 2B 2 , we can deduce that

r
2 1
|Z − E[Z]| ≤ B log ,
n δ
holds with probability at least 1 − 2δ for every δ > 0. This inequality implies that E[Z] is
usually the dominating term for understanding the behavior of Z.
We may apply this to study the classical Glivenko-Cantelli problem. The following
theorem illustrates this.
Theorem 3.26. Suppose that X1 , . . . , Xn are i.i.d. random variables on R with distribution
P and c.d.f. F . Let Fn be the empirical d.f. of the data (see (1)). Then,
" r #
log(n + 1) 2
P kFn − F k∞ ≥ 8 + t ≤ e−nt /2 , forall t > 0. (27)
n
a.s.
Hence, kFn − F k∞ → 0.

Proof. The function class under consideration is F := {1(−∞,t] (·) : t ∈ R}. Then, Z :=
kPn − P kF = kFn − F k∞ . From the discussion in this subsection, we have to bound upper
bound E[Z]. This can be done via symmetrization, i.e., E[Z] ≤ 2EX [Eε [supf ∈F | n1 ni=1 εi f (Xi )|]],
P

where ε1 , . . . , en are i.i.d. Rademachers independent of the Xi ’s. For a fixed (x1 , . . . , xn ) ∈
Rn , define
∆n (F; x1 , . . . , xn ) := {(f (x1 ), . . . , f (xn )) : f ∈ F}.
Observe that although F has uncountable many functions, for every (x1 , . . . , xn ) ∈ Rn ,
∆n (F; x1 , . . . , xn ) can take at most n + 1 distinct values39 . Thus, supf ∈F n1 ni=1 εi f (xi )
P

39
If we order xn1 as x(1) ≤ x(2) ≤ . . . ≤ x(n) , then they split the real line into at most n + 1 intervals
(including the two end-intervals (−∞, x(1) ) and [x(n) , ∞). Thus, for a given t ∈ R, the indicator 1(−∞,t] (x(i) )
takes the value one for all x(i) ≤ t, and the value zero for all other samples.

40
is at most the supremum of n + 1 such variables, and we can apply Lemma 3.16 to show
that40
n
" # r
1X log(n + 1)
E sup εi f (Xi ) ≤ 4 .
f ∈F n i=1
n

This shows (27).


a.s.
Exercise (HW2): Show that this implies that kFn − F k∞ → 0.

Although the exponential tail bound (27) is adequate for many purposes, it is far
from the tightest possible. Using alternative methods (using Dudley’s entropy bound in
Section 4), we provide a sharper result that removes the log(n + 1) factor.

40
Note that for x1 , . . . , xn distinct
n j
1X d 1X
sup εi f (xi ) = max εi
f ∈F n i=1 j=1,...,n n
i=1
h Pn i
and direct calculations would actually show that the Eε supf ∈F 1
n i=1 εi f (xi ) = O(n−1/2 ).

41
4 Chaining and uniform entropy

In this section we will introduce Dudley’s metric entropy bound (and the idea of chaining).
We will use this result to prove a maximal inequality (with uniform entropy) that will be
useful in deriving rates of convergence of statistical estimators (see Section 5). Further,
as we will see later, these derived maximal inequalities also play a crucial role in proving
functional extensions of the Donsker’s theorem (see Section 11). In fact, these maximal
inequalities are at the heart of the theory of empirical processes.
The proof of the main result involves in this section an idea called chaining. Before we
start with chaining, let us recall our first maximal inequality (13) (see Lemma 3.13). Note
that the bound (13) can be tight in some situations. For example, this is the case when
Y1 , . . . , YN are i.i.d. N (0, σ 2 ). Because of this example, Lemma 3.13 cannot be improved
without imposing additional conditions on the Y1 , . . . , YN . It is also easy to construct
examples where (13) is quite weak. For example, if Yi = Y0 + σN Zi , for i = 1, . . . , N ,
iid √
for some Y0 ∼ N (0, σ 2 ) and Zi ∼ N (0, 1) and σN log N = o(1), then it is clear that

maxi=1,...,N |Yi | ≈ Y0 so that (13) will be loose by a factor of log N . In order to improve
on (13), we need to make assumptions on how close to each other the Yi ’s are. Dudley’s
entropy bound makes such an assumption explicit and provides improved upper bounds for
E[maxi=1,...,N |Yi |].

4.1 Dudley’s bound for the supremum of a sub-Gaussian process

For generality, we will assume that we may have an infinite (possibly uncountable) collection
of random variables and we are interested in the expected supremum of the collection.
Suppose that (T, d) is a metric space and Xt is a stochastic process indexed by T . We
will state two versions of Dudley’s metric entropy bound and will prove one of these results
(the other has a similar proof). Let us first assume that
EXt = 0, for all t ∈ T.
We want to find upper bounds for
E sup Xt
t∈T
that ONLY depends on structure of the metric space (T, d). We shall first state Dudley’s
bound when the index set T is finite and subsequently improve it to the case when T is
infinite.
Theorem 4.1 (Dudley’s entropy bound for finite T ). Suppose that {Xt : t ∈ T } is a mean
zero stochastic process such that for every s, t ∈ T and u ≥ 0,41
u2
 
P {|Xt − Xs | ≥ u} ≤ 2 exp − 2 . (28)
2d (s, t)
41
Note that if Xt , t ∈ T , have mean zero and are jointly Gaussian, then Xt − Xs is a mean zero normal
p
random variable for every s, t ∈ T so that (28) holds with d(s, t) := E(Xs − Xt )2 .

42
Also, assume that (T, d) is a finite metric space. Then, we have
h i Z ∞p
E sup Xt ≤ C log N (, T, d) d (29)
t∈T 0

where C > 0 is a constant.

Next we give a slightly different formulation of Dudley’s metric entropy bound for finite
T . However, before proceeding further, we shall give a result similar to Lemma 3.13 but
instead bound the maximum of the absolute values of sub-Gaussian random variables.

Proposition 4.2. Let T be a finite set and let {Xt , t ∈ T } be a stochastic process. Suppose
that for every t ∈ T and u ≥ 0, the inequality

u2
 
P(|Xt | ≥ u) ≤ 2 exp − 2 (30)

holds. Here σ is a fixed positive real number. Then, for a universal positive constant C, we
have
p
E max |Xt | ≤ Cσ log(2|T |). (31)
t∈T

Proof of Proposition 4.2. Because


Z ∞  
E max |Xt | = P max |Xt | ≥ u du,
t∈T 0 t∈T

we can control E maxt∈T |Xt | by bounding the tail probability P (maxt∈T |Xt | ≥ u) du for
every u ≥ 0. For this, write

u2
  X  
P max |Xt | ≥ u = P (∪t∈T {|Xt | ≥ u}) ≤ P (|Xt | ≥ u) ≤ 2|T | exp − 2 .
t∈T 2σ
t∈T

This bound is good for large u but not so good for small u (it is quite bad for u = 0 for
example). It is therefore good to use it only for u ≥ u0 for some u0 to be specified later.
This gives42
Z ∞  
E max |Xt | = P max |Xt | ≥ u du
t∈T 0 t∈T
Z u0   Z ∞  
= P max |Xt | ≥ u du + P max |Xt | ≥ u du
0 t∈T u0 t∈T
Z ∞  2

u
≤ u0 + 2|T | exp − 2 du
u0 2σ
Z ∞
u2 2|T | 2 u20
   
u
≤ u0 + 2|T | exp − 2 du = u0 + σ exp − 2 .
u0 u0 2σ u0 2σ
42
R∞
Result: Suppose that Z ≥ 0. Then, E(Z) = 0
P(Z ≥ t) dt.

43
One can try to minimize the above term over u0 . A simpler strategy is to realize that the
large term here is 2|T | so one can choose u0 to kill this term by setting
√ p
 2 
u0
exp 2
= 2|T | or u0 = 2σ log(2|T |).

This gives
√ p σ2 p
E max |Xt | ≤ 2σ log(2|T |) + p ≤ Cσ log(2|T |)
t∈T 2σ 2 log(2|T |)
which proves the result.

Theorem 4.3. Suppose (T, d) is a finite metric space and {Xt , t ∈ T } is a stochastic process
such that (28) hold. Then, for a universal positive constant C, the following inequality holds
for every t0 ∈ T :
Z ∞p Z ∞p
E max |Xt − Xt0 | ≤ C log D(, T, d) d . log N (, T, d) d. (32)
t∈T 0 0

Here D(, T, d) denotes the -packing number of the space (T, d).

The following remarks mention some alternative forms of writing the inequality (32)
and also describe some implications.
Remark 4.1. Let D̃ denote the diameter of the metric space T (i.e., D̃ = maxs,t∈T d(s, t)).
Then the packing number D(, T, d) clearly equals 1 for  ≥ D̃ (it is impossible to have two
points in T whose distance is strictly larger than  when  > D). Therefore,
Z ∞p Z D̃ p
log D(, T, d)d = log D(, T, d)d.
0 0
Moreover,
Z D̃ p Z D̃/2 p Z D̃ p
log D(, T, d)d = log D(, T, d)d + log D(, T, d)d
0 0 D̃/2
Z D̃/2 p Z D̃/2 q
≤ log D(, T, d)d + log D( + (D̃/2), T, d)d
0 0
Z D̃/2 p
≤2 log D(, T, d)d
0

because D( + (D̃/2), T, d) ≤ D(, T, d) for every . We can thus state Dudley’s bound as
Z D̃/2 p
E max |Xt − Xt0 | ≤ C log D(, T, d)d
t∈T 0

where the C above equals twice the constant C in (32). Similarly, again by splitting the above
integral in two parts (over 0 to D̃/4 and over D̃/4 to D̃/2), we can also state Dudley’s bound
as Z D̃/4 p
E max |Xt − Xt0 | ≤ C log D(, T, d)d.
t∈T 0
The constant C above now is 4 times the constant in (32).

44
Remark 4.2. The left hand side in (32) is bounded from below (by triangle inequality) by
E maxt∈T |Xt | − E|Xt0 |. Thus, (32) implies that
Z D̃/4 p
E max |Xt | ≤ E|Xt0 | + C log D(, T, d)d for every t0 ∈ T .
t∈T 0

We shall now give the proof of Theorem 4.3. The proof will be based on an idea called
chaining. Specifically, we shall split maxt∈T (Xt − Xt0 ) in chains and use the bound given
by Proposition 4.2 within the links of each chain.

Proof of Theorem 4.3. Recall that D̃ is the diameter of T . For n ≥ 1, let Tn be a maximal
D̃2−n -separated subset of T i.e., mins,t∈Tn :s6=t d(s, t) > D̃2−n and Tn has maximal cardi-
nality subject to the separation restriction. The cardinality of Tn is given by the packing
number D(D̃2−n , T, d). Because of the maximality,

max min d(s, t) ≤ D̃2−n . (33)


t∈T s∈Tn

Because T is finite and d(s, t) > 0 for all s 6= t, the set Tn will equal T when n is large. Let

N := min{n ≥ 1 : Tn = T }.

For each n ≥ 1, let πn : T → Tn denote the function which maps each point t ∈ T to the
point in Tn that is closest to T (if there are multiple closest points to T in Tn , then choose
one arbitrarily). In other words, πn (t) is chosen so that

d(t, πn (t)) = min d(t, s).


s∈Tn

As a result, from (33), we have

d(t, πn (t)) ≤ D̃2−n for all t ∈ T and n ≥ 1. (34)

Note that πN (t) = t. Finally let T0 := {t0 } and π0 (t) = t0 for all t ∈ T .
We now note that
N
X 
Xt − Xt0 = Xπn (t) − Xπn−1 (t) for every t ∈ T . (35)
n=1

The sequence
t0 → π1 (t) → π2 (t) → · · · → πN −1 (t) → πN (t) = t

can be viewed as a chain from t0 to t. This is what gives the argument the name chaining.
By (35), we obtain
N
X N
X
max |Xt − Xt0 | ≤ max |Xπn (t) − Xπn−1 (t) | ≤ max |Xπn (t) − Xπn−1 (t) |
t∈T t∈T t∈T
n=1 n=1

45
so that
N
X
E max |Xt − Xt0 | ≤ E max |Xπn (t) − Xπn−1 (t) |. (36)
t∈T t∈T
n=1

Now to bound E maxt∈T |Xπn (t) − Xπn−1 (t) | for each 1 ≤ n ≤ N , we shall use the elementary
bound given by Proposition 4.2. For this, note first that by (28), we have

−u2
 

P |Xπn (t) − Xπn−1 (t) | ≥ u ≤ 2 exp .
2d2 (πn (t), πn−1 (t))
Now

d(πn (t), πn−1 (t)) ≤ d(πn (t), t) + d(πn−1 (t), t) ≤ D̃2−n + D̃2−(n−1) = 3D̃2−n .

Thus Proposition 4.2 can be applied with σ := 3D̃2−n so that we obtain (note that the
value of C might change from occurrence to occurrence)

3D̃ p
E max |Xπn (t) − Xπn−1 (t) | ≤ C log (2|Tn ||Tn−1 |)
t∈T 2n
r  
−n −n
p
≤ C D̃2 2
log (2|Tn | ) ≤ C D̃2 log 2D(D̃2−n , T, d)

Plugging the above bound into (36), we deduce


N r
X D̃  
E max |Xt − Xt0 | ≤ C log 2D(D̃2−n , T, d)
t∈T 2n
n=1
XN Z D̃/2n p
≤ 2C log(2D(, T, d))d
n+1
n=1 D̃/2
Z D̃/2 p
=C log(2D(, T, d))d
D̃/2N +1
Z D̃/2 p
≤C log(2D(, T, d))d
0
Z D̃/4 p Z D̃/4 q
=C log(2D(, T, d))d + C log(2D( + (D̃/4), T, d))d
0 0
Z D̃/4 p
≤ 2C log(2D(, T, d))d.
0

Note now that for  ≤ D̃/4, the packing number D(, T, d) ≥ 2 so that

log(2D(, T, d)) ≤ log 2 + log D(, T, d) ≤ 2 log D(, T, d).

We have thus proved that

√ Z D̃/4 p
E max |Xt − Xt0 | ≤ 2 2C log D(, T, d)d
t∈T 0

which proves (32).

46
4.1.1 Dudley’s bound when the metric space is separable

We shall next prove Dudley’s bound for the case of infinite T . This requires a technical
assumption called separability which will always be satisfied in our applications.

Definition 4.4 (Separable stochastic process). Let (T, d) be a metric space. The stochastic
process {Xt , t ∈ T } indexed by T is said to be separable if there exists a null set N and a
countable subset T̃ of T such that for all ω ∈
/ N and t ∈ T , there exists a sequence {tn } in
T̃ with limn→∞ d(tn , t) = 0 and limn→∞ Xtn (ω) = Xt (ω).

Note that the definition of separability requires that T̃ is a dense subset of T which
means that the metric space (T, d) is separable (a metric space is said to be separable if it
has a countable dense subset).
The following fact is easy to check: If (T, d) is a separable metric space and if Xt , t ∈ T ,
has continuous sample paths (almost surely), then Xt , t ∈ T is separable. The statement
that Xt , t ∈ T , has continuous sample paths (almost surely) means that there exists a null
set N such that for all ω ∈ / N , the function t 7→ Xt (ω) is continuous on T .
The following fact is also easy to check: If {Xt , t ∈ T } is a separable stochastic process,
then
sup |Xt − Xt0 | = sup |Xt − Xt0 | almost surely (37)
t∈T t∈T̃

for every t0 ∈ T . Here T̃ is a countable subset of T which appears in the definition of


separability of Xt , t ∈ T .
In particular, the statement (37) implies that supt∈T |Xt − Xt0 | is measurable (note
that uncountable suprema are in general not guaranteed to be measurable; but this is not
an issue for separable processes).
We shall now state Dudley’s theorem for separable processes. This theorem does not
impose any cardinality restrictions on T (it holds for both finite and infinite T ).

Theorem 4.5. Let (T, d) be a separable metric space and let {Xt , t ∈ T } be a separable
stochastic process. Suppose that for every s, t ∈ T and u ≥ 0, we have

u2
 
P {|Xs − Xt | ≥ u} ≤ 2 exp − 2 .
2d (s, t)
Then for every t0 ∈ T , we have
Z D̃/4 p
E sup |Xt − Xt0 | ≤ C log D(, T, d)d (38)
t∈T 0

where D̃ is the diameter of the metric space (T, d).

Proof of Theorem 4.5. Let T̃ be a countable subset of T such that (37) holds. We may
assume that T̃ contains t0 (otherwise simply add t0 to T̃ ). For each k ≥ 1, let T̃k be the

47
finite set obtained by taking the first k elements of T̃ (in an arbitrary enumeration of the
entries of T̃ ). We can ensure that T̃k contains t0 for every k ≥ 1.
Applying the finite index set version of Dudley’s theorem (Theorem 4.3) to {Xt , t ∈ T̃k },
we obtain
Z diam(T̃k )/4 q Z D̃/4 p
E max |Xt − Xt0 | ≤ C log D(, T̃k , d)d ≤ C log D(, T, d)d.
t∈T̃k 0 0

Note that the right hand side does not depend on k. Letting k → ∞ on the left hand side,
we use the Monotone Convergence Theorem to obtain
" # Z D̃/4 p
E sup |Xt − Xt0 | ≤ C log D(, T, d)d.
t∈T̃ 0

The proof is now completed by (37).

Remark 4.3. One may ask if there is a lower bound for E supt∈T Xt in terms of cov-
ering/packing numbers. A classical result in this direction is Sudakov’s lower bound which
states: For a zero-mean Gaussian process Xt defined on T , define the variance pseudometric
d2 (s, t) := Var(Xs − Xt ). Then,
h i p
E sup Xt ≥ sup log D(, T, d),
t∈T >0 2

where D(, T, d) is the -packing number of (T, d).

It is natural to ask how Dudley’s bound can be useful for the theory of empirical
process. Indeed, Theorem 4.5 is enormously helpful in upper bounding the supremum of
the empirical process as indicated by the maximum inequality in the next subsection.

4.2 Maximal inequality with uniform entropy

Recall our setup: We have data X1 , . . . , Xn i.i.d. P on X and a class of real valued functions
F defined on X . For any function f : X → R,
n
1X 2
kf k2n := f (Xi )
n
i=1

denotes the L2 (Pn )-seminorm. Further, recall that the empirical process under consideration

is the stochastic process indexed by F and defined as Gn (f ) = n(Pn − P )(f ), for f ∈ F.
Definition 4.6 (Uniform entropy bound). A class F of measurable functions with measur-
able envelope F satisfies the uniform entropy bound if and only if J(1, F, F ) < ∞ where
Z δ q
J(δ, F, F ) := sup log N (kF kQ,2 , F ∪ {0}, L2 (Q)) d, δ > 0. (39)
0 Q
Here the supremum if taken over all finitely discrete probability measures Q on X with
kF k2Q,2 := F 2 dQ > 0 and we have added the function f ≡ 0 to F. Finiteness of the
R

previous integral will be referred to as the uniform entropy condition.

48
The uniform entropy integral may seem a formidable object, but we shall later see how
to bound it for concrete classes F. Of course, the class F must be totally bounded in L2 (Q)
to make the integrand in the integral bounded, and then still the integral might diverge
(evaluate to +∞). The integrand is a nonincreasing function of , and finiteness of the
R1
integral is therefore determined by its behaviour near  = 0. Because 0 (1/)r d is finite if
r < 1 and infinite if r ≥ 1, convergence of the integral roughly means that, for  ↓ 0,
 1 2
sup log N (kF kQ,2 , F, L2 (Q)) << ,
Q 
where << means smaller in order, or smaller up to an appropriate logarithmic term.

Theorem 4.7. If F is a class of measurable functions with measurable envelope function


F , then h i h i
E kGn kF . E J(θn , F, F )kF kn . J(1, F, F )kF kP,2 , (40)
where θn := supf ∈F kf kn /kF kn .

Proof. By symmetrization (see Theorem 3.17) it suffices to bound EkGon kF ; recall that
Gon (f ) = √1n ni=1 εi f (Xi ) where εi ’s are i.i.d Rademacher. Given X1 , . . . , Xn , the process
P

Gon is sub-Gaussian for the L2 (Pn )-seminorm k · kn (by Lemma 3.12), i.e.,
n n
!
X f (Xi ) X g(Xi ) 2 2
P εi √ − εi √ ≥ u X1 , . . . , Xn ≤ 2e−u /(2kf −gkn ) , ∀ f, g ∈ F, ∀ u ≥ 0.
n n
i=1 i=1
2 := sup 2 2
The value σn,2 f ∈F Pn f = supf ∈F kf kn is an upper bound for the squared radius of
F ∪{0} with respect to this norm. We add the function f ≡ 0 to F, so that the symmetrized
process is zero at some parameter. The maximal inequality (38) (with Xt0 = 0) gives
Z σn,2 p
o
Eε kGn kF . log N (, F ∪ {0}, L2 (Pn )) d, (41)
0

where Eε is the expectation with respect to the Rademacher variables, given fixed X1 , . . . , Xn
(note that log N (, F ∪ {0}, L2 (Pn )) = 0 for any  > σn,2 ). Making a change of variable and
bounding the random entropy by a supremum we see that the right side is bounded by
Z σn,2 /kF kn p
log N (kF kn , F ∪ {0}, L2 (Pn )) dkF kn ≤ J(θn , F, F )kF kn .
0

Next, by taking the expectation over X1 , . . . , Xn we obtain the first inequality of the theo-
rem.
Since θn ≤ 1, we have that J(θn , F, F ) ≤ J(1, F, F ). Furthermore, by Jensen’s in-
equality applied to the root function, EkF kn ≤ E[n−1 ni=1 F 2 (Xi )] = kF kP,2 . This gives
p P

the inequality on the right side of the theorem.

The above theorem shows that the order of magnitude of kGn kF is not bigger than
J(1, F, F ) times the order of kF kP,2 , which is the order of magnitude of the random variable
|Gn (F )| if the entropy integral is finite.

49
Example 4.8 (Supremum of the empirical process). Recall the setting of Section 3.7. Sup-
pose that F is a class of B-uniformly bounded functions such that
 2ν
ν B
N (, F, k · kPn ) ≤ Cν(16e) .

We will see in Section 7 that if F is a function class with finite VC dimension ν, then
the above inequality holds. The goal is to study the expected supremum of the empirical
process over F, i.e., kPn − P kF . From our previous results, we have seen that by exploiting
concentration (see Section 3.7) and symmetrization results (see Theorem 3.17), the study of
kPn − P kF can be reduced to controlling the expectation Eε supf ∈F n1 ni=1 εi f (Xi ) . We
 P 

consider the random variable Zf := √1n ni=1 εi f (Xi ), and consider the stochastic process
P

{Zf : f ∈ F}. We have seen that by Lemma 3.12, the increment Zf − Zg is sub-Gaussian
with parameter kf − gk2n . Consequently, by Dudley’s entropy integral (see (41)), we have
n
" #
1X 24 B p
Z
Eε sup εi f (Xi ) ≤ √ log N (, F ∪ {0}, k · kPn ) d,
f ∈F n i=1 n 0
Z Bp r
1 0 ν
≤ c0 √ log[1 + c(B/)2ν ] d = c0 B ,
n 0 n
since the integral is finite43 ; here c, c0 and c00 are constants. Thus,
r
ν
EkPn − P kF . B .
n
Example 4.9. Suppose that F = {mθ (·) : θ ∈ Θ} is a parameterized class such that
F = −F, where Θ = B(0; 1) ⊂ Rd is the unit Euclidean ball in Rd . Suppose that the class
of functions is L-Lipschitz with respect to the Euclidean distance on Rd so that for all x,

|mθ1 (x) − mθ2 (x)| ≤ Lkθ1 − θ2 k.


 q 
Exercise (HW2): Show that EkPn − P kF = O L nd .

Observe that the right side of (40) depends on the L2 (P )-norm of the envelope function
F , which may, in some situations, be large compared with the maximum L2 (P )-norm of
functions in F, namely, σ := supf ∈F kf kP,2 . In such a case, the following theorem will be
more useful.

Theorem 4.10 ([van der Vaart and Wellner, 2011], [Chernozhukov et al., 2014]44 ). Sup-
pose that 0 < kF kP,2 < ∞, and let σ 2 > 0 be any positive constant such that supf ∈F P f 2 ≤
43
Note that
Z B p Z 1 p √
Z 1 p
log[1 + c(B/)2ν ] d = B log[1 + c(1/)2ν ] d . B ν log(1/) d.
0 0 0

44
A version of this result was proved in [van der Vaart and Wellner, 2011] under the additional assumption
that the envelope F is bounded; the current version is due to [Chernozhukov et al., 2014]. We will skip the
proof of this result.

50
p
σ 2 ≤ kF k2P,2 . Let δ := σ/kF kP,2 . Define B := E[max1≤i≤n F 2 (Xi )]. Then,
h i BJ 2 (δ, F, F )
E kGn kF . J(δ, F, F )kF kP,2 + √ . (42)
δ2 n

4.3 Maximal inequalities with bracketing

As one might expect, there exists maximal inequalities that work with bracketing numbers
(as opposed to covering numbers). However, the bracketing result is more delicate and
difficult to prove45 . We will just state the result and illustrate an application of the result.
Recall that, apart from the constant 1/2, bracketing numbers are bigger than covering
numbers. The advantage of a bracket is that it gives pointwise control over a function:
l(x) ≤ f (x) ≤ u(x), for every x ∈ X . The maximal inequalities in the preceding subsection
(without bracketing) compensate this lack of pointwise control by considering entropy under
every measure Q, not just the law P of the observations. With bracketing we can obtain
analogous results using only bracketing numbers under P .

Definition 4.11 (Bracketing integral). The bracketing integral is defined as


Z δ q
J[ ] (δ, F, L2 (P )) := log N[ ] (, F ∪ {0}, L2 (P )) d < ∞, δ > 0.
0

Theorem 4.12. For a class F of measurable functions with envelope F ,


 
E kGn kF . J[ ] (kF kP,2 , F ∪ {0}, L2 (P )).

The preceding theorem does not take the size of the functions f into account. The
following theorem remedy this, which is however restricted to uniformly bounded classes.

Theorem 4.13 (Lemma 3.4.2 of [van der Vaart and Wellner, 1996]). For any class F of
measurable functions f : X → R such that P f 2 < δ 2 and kf k∞ ≤ M for every f ,

J[ ] (δ, F, L2 (P ))
h i  
E kGn kF . J[ ] (δ, F, L2 (P )) 1 + √ M .
δ2 n

4.4 Bracketing number for some function classes

Here are some important results about the bracketing numbers of nonparametric classes of
functions. A good account on this is [van der Vaart and Wellner, 1996, Section 2.7].

1. Let C1α (E), for a bounded convex subset E of Rd with nonempty interior, be the set
of functions f : E → R with kf k∞ ≤ 1 and with degree of smoothness α (if α ≤ 1,
Hölder of order α and constant 1, and if α > 1, differentiable up to order bαc, the
45
In this subsection we just state a few results without proofs; see [van der Vaart and Wellner, 1996,
Chapter 2.14] for a more detailed discussion with complete proofs of these results.

51
greatest integer smaller than α, with all the partial derivatives of order α, Hölder of
order α−bαc and constant 1, and with all the partial derivatives bounded by 1. Then,

log N[ ] (, C1α (E), Lr (Q)) ≤ K(1/)d/α

for all r ≥ 1,  > 0 and probability measure Q on Rd , where K is a constant that


depends only on α, diam(E) and d.

2. The class F of monotone functions R to [0, 1] satisfies

log N[ ] (, F, Lr (Q)) ≤ K/,

for all Q and all r ≥ 1, for K depending only on r.

52
5 Rates of convergence of M -estimators

Let (Θ, d) be a semimetric space. As usual, we are given i.i.d. observations X1 , X2 , . . . , Xn


from a probability distribution P on X . Let {Mn (θ) : θ ∈ Θ} denote a stochastic process
and let {M (θ) : θ ∈ Θ} denote a deterministic process. Suppose θ̂n maximizes Mn (θ) and
suppose θ0 maximizes M (θ), i.e.,

θ̂n = argmax Mn (θ), and θ0 = argmax M (θ).


θ∈Θ θ∈Θ

We want to find the rate δn of the convergence of θ̂n to θ0 in the metric d, i.e., d(θ̂n , θ0 ). A
rate of convergence46 of δn means that

δn−1 d(θ̂n , θ0 ) = OP (1).

We assume that Mn (θ) gets close to M (θ) as n increases and under this setting want to
know how close θ̂n is to θ0 .

5.1 The rate theorem

If the metric d is chosen appropriately we may expect that the asymptotic criterion decreases
quadratically47 when θ moves away from θ0 :

M (θ) − M (θ0 ) . −d2 (θ, θ0 ) (43)

for all θ ∈ Θ.
Consider the probability P d(θ̂n , θ0 ) > 2M δn for a large M . We want to understand


for which δn this probability becomes small as M grows large. Write


  X  
P d(θ̂n , θ0 ) > 2M δn = P 2j−1 δn < d(θ̂n , θ0 ) ≤ 2j δn .
j>M

Let us define the “shells” Sj := θ ∈ Θ : 2j−1 δn < d(θ, θ0 ) ≤ 2j δn so that




   
P 2j−1 δn < d(θ̂n , θ0 ) ≤ 2j δn = P θ̂n ∈ Sj .
46
Recall that a sequence of random variables {Zn } is said to be bounded in probability or OP (1) if

lim lim sup P(|Zn | > T ) = 0.


T →∞ n→∞

In other words, Zn = OP (1), if for any given  > 0, ∃ T , N > 0 such that P(|Zn | > T ) <  for all n ≥ N .
47
To get intuition about this condition assume that if M : Rd → R is twice continuously differentiable and
d(·, ·) is the Euclidean distance, then, for θ in a neighborhood of θ0 ,
1
M (θ) − M (θ0 ) = ∇M (θ0 )> (θ − θ0 ) + (θ − θ0 )> ∇2 M (θ̃0 )(θ − θ0 ) ≤ −ckθ − θ0 k2
2
where ∇M (θ0 ) = 0 (as θ0 is a maximizer of M (·)) and ∇2 M (θ0 ) (the Hessian matrix of M (·)) is assumed
to be negative definite, in which case we can find such a constant c > 0 (corresponding to the smallest
eigenvalue of ∇2 M (θ0 )); here θ̃0 is a point close to θ0 .

53
As θ̂n maximizes Mn (θ), it is obvious that
   
P θ̂n ∈ Sj ≤ P sup (Mn (θ) − Mn (θ0 )) ≥ 0 .
θ∈Sj

Now d(θ, θ0 ) > 2j−1 δn for θ ∈ Sj which implies, by (43), that

M (θ) − M (θ0 ) . −d2 (θ, θ0 ) . −22j−2 δn2 for θ ∈ Sj (44)

or supθ∈Sj [M (θ) − M (θ0 )] . −22j−2 δn2 . Thus, the event supθ∈Sj [Mn (θ) − Mn (θ0 )] ≥ 0 can
only happen if Mn and M are not too close. Let

Un (θ) := Mn (θ) − M (θ), for θ ∈ Θ.

It follows from (44) that


   
P sup [Mn (θ) − Mn (θ0 )] ≥ 0 ≤ P sup [Un (θ) − Un (θ0 )] & 22j−2 δn2
θ∈Sj θ∈Sj
!
≤P sup [Un (θ) − Un (θ0 )] & 22j−2 δn2
θ:d(θ,θ0 )≤2j δn
" #
1
. E sup (Un (θ) − Un (θ0 )) .
22j−2 δn2 θ:d(θ,θ0 )≤2j δn

Suppose that there is a function φn (·) such that


" #

E sup n(Un (θ) − Un (θ0 )) . φn (u) for every u > 0. (45)
θ:d(θ,θ0 )≤u

We thus get
  φ (2j δ )
n n
P 2j−1 δn < d(θ̂n , θ0 ) ≤ 2j δn . √ 2j 2
n2 δn
for every j. As a consequence,
  1 X φn (2j δn )
P d(θ̂n , θ0 ) > 2M δn . √ .
n 22j δn2
j>M

The following assumption on φn (·) is usually made to simplify the expression above: there
exists 0 < α < 2 such that

φn (cx) ≤ cα φn (x) for all c > 1 and x > 0. (46)

Under this assumption, we get


  φ (δ ) X
n n
P d(θ̂n , θ0 ) > 2M δn . √ 2 2j(α−2) .
nδn
j>M

2j(α−2) converges to zero as M → ∞. Observe that if we further


P
The quantity j>M
assume that

φn (δn ) . nδn2 , as n varies, (47)

54
then   X
P d(θ̂n , θ0 ) > 2M δn ≤ c 2j(α−2) ,
j>M

for a constant c > 0 (which does not depend on n, M ). Let uM denote the right side of the
last display. It follows therefore that, under assumptions (46) and (106), we get

d(θ̂n , θ0 ) ≤ 2M δn with probability at least 1 − uM , for all n.

Further note that uM → 0 as M → ∞. This gives us the following non-asymptotic rate of


convergence theorem.

Theorem 5.1. Let (Θ, d) be a semi-metric space. Fix n ≥ 1. Let {Mn (θ) : θ ∈ Θ} be a
stochastic process and {M (θ) : θ ∈ Θ} be a deterministic process. Assume condition (43)
and that the function φn (·) satisfies (45) and (46). Then for every M > 0, we get d(θ̂n , θ0 ) ≤
2M δn with probability at least 1 − uM provided (106) holds. Here uM → 0 as M → ∞.

Suppose now that condition (43) holds only for θ in a neighborhood of θ0 and that (45)
holds only for small u. Then one can prove the following asymptotic result under the
P
additional condition that θ̂n is consistent (i.e., d(θ̂n , θ0 ) → 0).

Theorem 5.2 (Rate theorem). Let Θ be a semi-metric space. Let {Mn (θ) : θ ∈ Θ} be
a stochastic process and {M (θ) : θ ∈ Θ} be a deterministic process. Assume that (43) is
satisfied for every θ in a neighborhood of θ0 . Also, assume that for every n and sufficiently
small u condition (45) holds for some function φn satisfying (46), and that (106) holds. If
the sequence θ̂n satisfies Mn (θ̂n ) ≥ Mn (θ0 ) − OP (δn2 ) and if θ̂n is consistent in estimating
θ0 , then d(θ̂n , θ0 ) = OP (δn ).

Proof. The above result is Theorem 3.2.5 in [van der Vaart and Wellner, 1996] where you
can find its proof. The proof is very similar to the proof of Theorem 5.1. The crucial
observation is to realize that: for any η > 0,
  X    
P d(θ̂n , θ0 ) > 2M δn ≤ P 2j−1 δn < d(θ̂n , θ0 ) ≤ 2j δn + P 2d(θ̂n , θ0 ) > η .
j>M,2j−1 δn ≤η

The first term can be tackled as before while the second term goes to zero by the consistency
of θ̂n .

Remark 5.1. In the case of i.i.d. data and criterion functions of the form Mn (θ) = Pn [mθ ]

and M (θ) = P [mθ ], the centered and scaled process n(Mn − M )(θ) = Gn [mθ ] equals
the empirical process at mθ . Condition (45) involves the suprema of the empirical process
indexed by classes of functions

Mu := {mθ − mθ0 : d(θ, θ0 ) ≤ u}.

Thus, we need to find the existence of φn (·) such that EkGn kMu . φn (u).

55
Remark 5.2. Theorem 5.2 gives the correct rate in fair generality, the main problem being
to derive sharp bounds on the modulus of continuity of the empirical process. A simple,
but not necessarily efficient, method is to apply the maximal inequalities (with and without
bracketing). These yield bounds in terms of the uniform entropy integral J(1, Mu , Mu ) or
the bracketing integral J[ ] (kMu kP,2 , Mu , L2 (P )) of the class Mu given by

E kGn kMu . J(1, Mu , Mu )[P (Mu2 )]1/2


 
(48)

where Z 1 q
J(1, Mu , Mu ) = sup log N (kMu kQ,2 , Mu , L2 (Q)) d
0 Q

and
 
E kGn kMu . J[ ] (kMu k, Mu , L2 (P )),

where Z δ q
J[ ] (δ, Mu , L2 (P )) = log N[ ] (, Mu , L2 (P )) d.
0

Here Mu is the envelope function of the class Mu . In this case, we can take φ2n (u) = P [Mu2 ]
and this leads to a rate of convergence δn of at least the solution of

P [Mδ2n ] ∼ nδn4 .

Observe that the rate of convergence in this case is driven by the sizes of the envelope
functions as u ↓ 0, and the size of the classes is important only to guarantee a finite entropy
integral.

Remark 5.3. In genuinely infinite-dimensional situations, this approach could be less use-
ful, as it is intuitively clear that the precise entropy must make a difference for the rate of
convergence. In this situation, the maximal inequalities obtained in Section 4 may be used.

Remark 5.4. For a Euclidean parameter space, the first condition of the theorem is satisfied
if the map θ 7→ P mθ is twice continuously differentiable at the point of maximum θ0 with a
nonsingular second-derivative matrix.

5.2 Some examples

5.2.1 Euclidean parameter

Let X1 , . . . , Xn be i.i.d. random elements on X with a common law P , and let {mθ : θ ∈ Θ}
be a class of real-valued measurable maps. Suppose that Θ ⊂ Rd , and that, for every
θ1 , θ2 ∈ Θ (or just in a neighborhood of θ0 ),

|mθ1 (x) − mθ2 (x)| ≤ F (x)kθ1 − θ2 k (49)

56
for some measurable function F : X → R with P F 2 < ∞. Then the class of functions
Mδ := {mθ − mθ0 : kθ − θ0 k ≤ δ} has envelope function δF and bracketing number (see
Theorem 2.14) satisfying
 d

N[ ] (2kF kP,2 , Mδ , L2 (P )) ≤ N (, {θ : kθ − θ0 k ≤ δ}, k · k) ≤ ,


where the last inequality follows from Lemma 2.7 coupled with the fact that the -covering
number of δB (for any set B) is the /δ-covering number of B. In view of the maximal
inequality with bracketing (see Theorem 4.12),

 
Z δkF kP,2 q
EP kGn kMδ . log N[ ] (, Mδ , L2 (P )) d . δkF kP,2 .
0
√ √
Thus, we can take φn (δ)  δ, and the inequality φn (δn ) ≤ nδn2 is solved by δn = 1/ n. We
conclude that the rate of convergence of θ̂n is n−1/2 as soon as P (mθ − mθ0 ) ≤ −ckθ − θ0 k2 ,
for every θ ∈ Θ in a neighborhood of θ0 .

Example 5.3 (Least absolute deviation regression). Given i.i.d. random vectors Z1 , . . . , Zn ,
and e1 , . . . , en in Rd and R, respectively, let

Yi = θ0> Zi + ei .

The least absolute-deviation estimator θ̂n minimizes the function


n
1X
θ 7→ |Yi − θ> Zi | = Pn mθ ,
n
i=1

where Pn is the empirical measure of Xi := (Zi , Yi ), and mθ (x) = |y − θ> z|.


Exercise (HW2): Show that the parameter θ0 is a point of minimum of the map θ 7→
P |Y − θ> Z| if the distribution of the error e1 has median zero. Furthermore, show that the
maps θ 7→ mθ satisfies condition (49):

|y − θ1> z| − |y − θ2> z| ≤ kθ1 − θ2 kkzk.

Argue the consistency of the least-absolute-deviation estimator from the convexity of the
map θ 7→ |y − θ> z|. Moreover, show that the map θ 7→ P |Y − θ> Z| is twice differentiable
at θ0 if the distribution of the errors has a positive density at its median (you may need to
assume that Z and e are integrable and E[ZZ > ] is positive definite). Furthermore, derive
the rate of convergence of θ̂n in this situation.

5.2.2 A non-standard example

Example 5.4 (Analysis of the shorth). Suppose that X1 , . . . , Xn are i.i.d. P on R with
a differentiable density p with respect to the Lebesgue measure. Let FX be the distribution

57
function of X. Suppose that p is a unimodal (bounded) continuously differentiable symmetric
density with mode θ0 (with p0 (x) > 0 for x < θ0 and p0 (x) < 0 for x > θ0 ). We want to
estimate θ0 .
Exercise (HW2): Let

M(θ) := P mθ = P(|X − θ| ≤ 1) = FX (θ + 1) − FX (θ − 1)

where mθ (x) = 1[θ−1,θ+1] (x). Show that θ0 = argmaxθ∈R M(θ). Thus, θ0 is the center of an
interval of length 2 that contains the largest possible (population) fraction of data points.
We can estimate θ0 by

θ̂n := argmax Mn (θ), where Mn (θ) = Pn [mθ ].


θ∈R
P
Show that θ̂n → θ0 ? The functions mθ (x) = 1[θ−1,θ+1] (x) are not Lipschitz in the parameter
θ ∈ Θ ≡ R. Nevertheless, the classes of functions Mδ satisfy the conditions of Theorem 5.2.
These classes have envelope function

sup 1[θ−1,θ+1] − 1[θ0 −1,θ0 +1] ≤ 1[θ0 −1−δ,θ0 −1+δ] + 1[θ0 +1−δ,θ0 +1+δ] .
|θ−θ0 |≤δ

The L2 (P )-norm of these functions is bounded above by a constant times δ. Thus, the

conditions of the rate theorem are satisfied with φn (δ) = c δ for some constant c, leading to
a rate of convergence of n−1/3 . We will show later that n1/3 (θ̂n −θ0 ) converges in distribution
to a non-normal limit as n → ∞.
Example 5.5 (A toy change point problem). Suppose that we have i.i.d. data {Xi =
(Zi , Yi ) : i = 1, . . . , n} where Zi ∼ Unif(0, 1) and

Yi = 1[0,θ0 ] (Zi ) + i , for i = 1, . . . , n.

Here, i ’s are the unobserved errors assumed to be i.i.d. N (0, σ 2 ). Further, for simplicity,
we assume that i is independent of Zi . The goal is to estimate the unknown parameter
θ0 ∈ (0, 1). A natural procedure is to consider the least squares estimator:

θ̂n := argmin Pn [(Y − 1[0,θ] (Z))2 ].


θ∈[0,1]

Exercise (HW2): Show that θ̂n := argmaxθ∈[0,1] Mn (θ) where

Mn (θ) := Pn [(Y − 1/2){1[0,θ] (Z) − 1[0,θ0 ] (Z)}].

Prove that Mn converges uniformly to

M (θ) := P [(Y − 1/2){1[0,θ] (Z) − 1[0,θ0 ] (Z)}].


P
Show that M (θ) = −|θ − θ0 |/2. As a consequence, show that θ̂n → θ0 .
p
To find the rate of convergence of θ̂n we consider the metric d(θ1 , θ2 ) := |θ1 − θ2 |.
Show that the conditions needed to apply Theorem 5.2 hold with this choice of d(·, ·). Using
Theorem 5.2 derive that n(θ̂n − θ0 ) = OP (1).

58
5.2.3 Persistency in high-dimensional regression

Let Z i := (Y i , X1i , . . . , Xpi ), i = 1, . . . , n, be i.i.d. random vectors, where Z i ∼ P . It is


desired to predict Y by j βj Xj , where (β1 , . . . , βp ) ∈ Bn ⊂ Rp , under a prediction loss.
P

We assume that p = nα , α > 0, that is, there could be many more explanatory variables
than observations. We consider sets Bn restricted by the maximal number of non-zero
coefficients of their members, or by their l1 -radius. We study the following asymptotic
question: how ‘large’ may the set Bn be, so that it is still possible to select empirically a
predictor whose risk under P is close to that of the best predictor in the set?
We formulate this problem using a triangular array setup, i.e., we model the observa-
tions Zn1 , . . . , Znn as i.i.d. random vectors in Rpn +1 , having distribution Pn (that depends on
n). In the following we will hide the dependence on n and just write Z 1 , . . . , Z n . We will
consider Bn of the form
Bn,b := {β ∈ Rpn : kβk1 ≤ b}, (50)

where k · k1 denotes the l1 -norm. For any Z := (Y, X1 , . . . , Xp ) ∼ P , we will denote the
expected prediction error by
h p
X i h i
LP (β) := EP (Y − βj Xj )2 = EP (Y − β > X)2
j=1

where X = (X1 , . . . , Xp ). The best linear predictor, where Z ∼ Pn , is given by

βn∗ := arg min LPn (β),


β∈Bn,bn

for some sequence of {bn }n≥1 . We estimate the best linear predictor βn∗ from the sample by
n
1X i
β̂n := arg min LPn (β) = arg min (Y − β > X i )2 ,
β∈Bn,bn β∈Bn,bn n
i=1

where Pn is the empirical measure of the Z i ’s.


We study the following asymptotic question: how ‘large’ may the set Bn,bn be, so that
it is still possible to select empirically a predictor whose risk under Pn is close to that of
the best predictor in the set?
We say that β̂n is persistent (relative to Bn,bn and Pn ) ([Greenshtein and Ritov, 2004])
if and only if
LPn (β̂n ) − LPn (βn∗ ) → 0.
P

This is certainly a weak notion of “risk-consistency” — we are only trying to consistently


estimate the expected predictor error. However, as we will see soon, this notion does not
require any modeling assumptions on the (joint) distribution of Z (in particular, we are not
assuming that there is a ‘true’ linear model). The following theorem is a version of Theorem
3 in [Greenshtein and Ritov, 2004].

59
Theorem 5.6. Suppose that pn = nα , where α > 0. Let

F (Z i ) := max |Xji Xki − EPn (Xji Xki )|, where we take X0i = Y i , for i = 1, . . . , n.
0≤j,k≤p

Suppose that EPn [F 2 (Z 1 )] ≤ M < ∞, for all n. Then for bn = o((n/ log n)1/4 ), β̂n is
persistent relative to Bn,bn .

Proof. From the definition of βn∗ and β̂n it follows that

LPn (β̂n ) − LPn (βn∗ ) ≥ 0, and LPn (β̂n ) − LPn (βn∗ ) ≤ 0.

Thus,

0 ≤ LPn (β̂n ) − LPn (βn∗ )


     
= LPn (β̂n ) − LPn (β̂n ) + LPn (β̂n ) − LPn (βn∗ ) + LPn (βn∗ ) − LPn (βn∗ )
≤ 2 sup |LPn (β) − LPn (β)|,
β∈Bn,bn

where we have used the fact that LPn (β̂n ) − LPn (βn∗ ) ≤ 0. To simply our notation, let
γ = (−1, β) ∈ Rpn +1 . Then LPn(β) = γ > ΣPn 
 γ and LPn (β) = γ > ΣPn γ where ΣPn =
EPn (Xj1 Xk1 ) and ΣPn = n1 ni=1 Xji Xki
P
. Thus,
0≤j,k≤pn 0≤j,k≤pn

|LPn (β) − LPn (β)| ≤ |γ > (ΣPn − ΣPn )γ| ≤ kΣPn − ΣPn k∞ kγk21 ,
1 Pn i i
where kΣPn − ΣPn k∞ = sup0≤j,k≤pn n i=1 Xj Xk − EPn (Xj1 Xk1 ) . Therefore,
 
P LPn (β̂n ) − LPn (βn∗ ) >  ≤ P 2 sup |LPn (β) − LPn (β)| > 

β∈Bn,bn
 
≤ P 2(bn + 1)2 kΣPn − ΣPn k∞ > 
2(bn + 1)2 h i
≤ E kΣPn − ΣPn k∞ . (51)

Let F = {fj,k : 0 ≤ j, k ≤ pn } where fj,k (z) := xj xk −EPn (Xj1 Xk1 ) and z = (x0 , x1 , . . . , xpn ).
Observe that kΣPn −ΣPn k∞ = kPn −Pn kF . We will now use the following maximal inequality
with bracketing entropy (see Theorem 4.12):

Ek n(Pn − P )kF . J[ ] (kFn kPn ,2 , F ∪ {0}, L2 (Pn )),

where Fn is an envelope of F. Note that Fn can be taken as F (defined in the statement


of the theorem). We can obviously cover F with the -brackets [fj,k − /2, fj,k + /2], for
every  > 0, and thus, N[ ] (, F, L2 (Pn )) ≤ (pn + 1)2 . Therefore, using (51) and the maximal
inequality above,

 2(bn + 1)2 2 log(pn + 1) √
p
∗ b2n α log n
P LPn (β̂n ) − LPn (βn ) >  . √ M. √ → 0,
 n n
as n → ∞, by the assumption on bn .

60
6 Rates of convergence of infinite dimensional parameters

If Θ is an infinite-dimensional set, such as a function space, then maximization of a criterion


over the full space may not always be a good idea. For instance, consider fitting a function
θ : [0, 1] → R to a set of observations (z1 , Y1 ), . . . , (zn , Yn ) by least squares, i.e., we minimize
n
1X
θ→
7 {Yi − θ(zi )}2 .
n
i=1

If Θ consists of all functions θ : [0, 1] → R, then obviously the minimum is 0, taken for any
function that interpolates the data points exactly: θ(zi ) = Yi for every i = 1, . . . , n. This
interpolation is typically not a good estimator, but overfits the data: it follows the given
data exactly even though these probably contain error. The interpolation very likely gives
a poor representation of the true regression function.
One way to rectify this problem is to consider minimization over a restricted class
of functions. For example, the minimization can be carried out over all functions with 2
derivatives, which are bounded above by 10 throughout the interval; here the numbers 2
and (particularly) 10 are quite arbitrary. To prevent overfitting the size of the derivatives
should not be too large, but can grow as we obtain more samples.
The method of sieves is an attempt to implement this. Sieves are subsets Θn ⊂ Θ,
typically increasing in n, that can approximate any given function θ0 that is considered
likely to be “true”. Given n observations the maximization is restricted to Θn , and as n
increases this “sieve” is taken larger. In this section we extend the rate theorem in the
previous section to sieved M -estimators, which include maximum likelihood estimators and
least-squares estimators.
We also generalize the notation and other assumptions. In the next theorem the em-
pirical criterion θ 7→ Pn mθ is replaced by a general stochastic process

θ 7→ Mn (θ).

It is then understood that each “estimator” θ̂n is a map defined on the same probability
space as Mn , with values in the index set Θn (which may be arbitrary set) of the process
Mn .
Corresponding to the criterion functions are centering functions θ 7→ Mn (θ) and “true
parameters” θn,0 . These may be the mean functions of the processes Mn and their point of
maximum, but this is not an assumption.
In this generality we also need not assume that Θn is a metric space, but measure the
“discrepancy” or “distance” between θ and the true “value” θn,0 by a map θ 7→ dn (θ, θn,0 )
from Θn to [0, ∞).

Theorem 6.1 (Rate of convergence). For each n, let Mn and Mn be stochastic processes
indexed by a set Θn ∪ {θn,0 }, and let θ 7→ dn (θ, θn,0 ) be an arbitrary map from Θn to [0, ∞).

61
Let δ̃n ≥ 0 and suppose that, for every n and δ > δ̃n ,

sup [Mn (θ) − Mn (θn,0 )] ≤ −cδ 2 , (52)


θ∈Θn :δ/2<dn (θ,θn,0 )≤δ

for some c > 0 (for all n ≥ 1) and


" #

E sup n (Mn − Mn )(θ) − (Mn − Mn )(θn,0 ) . φn (δ),
θ∈Θn :dn (θ,θn,0 )≤δ

for increasing functions φn : [δ̃n , ∞) → R such that δ 7→ φn (δ)/δ α is decreasing for some
0 < α < 2. Let θn ∈ Θn and let δn satisfy

φn (δn ) ≤ nδn2 , δn2 ≥ Mn (θn,0 ) − Mn (θn ), δn ≥ δ̃n .

If the sequence θ̂n takes values in Θn and satisfies Mn (θ̂n ) ≥ Mn (θn ) − OP (δn2 ), then

dn (θ̂n , θn,0 ) = OP (δn ).

Exercise (HW2): Complete the proof. Hint: The proof is similar to that of the previous
rate theorem. That all entities are now allowed to depend on n asks for notational changes
only, but the possible discrepancy between θn and θn,0 requires some care.
The theorem can be applied with θ̂n and θn,0 equal to the maximizers of θ 7→ Mn (θ)
over a sieve Θn and of θ 7→ Mn (θ) over a full parameter set Θ, respectively. Then (52)
requires that the centering functions fall off quadratically in the “distance” dn (θ, θn,0 ) as θ
moves away from the maximizing value θn,0 . We use δ̃n = 0, and the theorem shows that
the “distance” of θ̂n to θn,0 satisfies

d2n (θ̂n , θn,0 ) = OP (δn2 + Mn (θn,0 ) − Mn (θn )), (53)



for δn solving φn (δn ) ≤ nδn2 and for any θn ∈ Θn . Thus the rate δn is determined by the

“modulus of continuity” δ 7→ φn (δ) of the centered processes n(Mn −Mn ) over Θn and the
discrepancy Mn (θn,0 ) − Mn (θn ). The latter vanishes if θn = θn,0 but this choice of θn may
not be admissible (as θn must be an element of the sieve and θn,0 need not). A natural choice
of θn is to take θn as the closest element to θn,0 in Θn , e.g., θn := argminθ∈Θn dn (θ, θn,0 ).
Typically, small sieves Θn lead to a small modulus, hence fast δn in (53). On the other
hand, the discrepancy Mn (θn,0 ) − Mn (θn ) of a small sieve will be large. Thus, the two terms
in the right side of (53) may be loosely understood as a “variance” and a “squared bias”
term, which must be balanced to obtain a good rate of convergence. We note that in many
problems an un-sieved M -estimator actually performs well, so the trade-off should not be
understood too literally: it may work well to reduce the “bias” to zero.

62
6.1 Least squares regression on sieves

Suppose that we have data

Yi = θ0 (zi ) + i , for i = 1, . . . , n, (54)

where Yi ∈ R is the observed response variable, zi ∈ Z is a covariate, and i is the unobserved


error. The errors are assumed to be independent random variables with expectation Ei = 0
and variance Var(i ) ≤ σ02 < ∞, for i = 1, . . . , n. The covariates z1 , . . . , zn are fixed, i.e.,
we consider the case of fixed design. The function θ0 : Z → R is unknown, but we assume
that θ0 ∈ Θ, where Θ is a given class of regression functions.
The unknown regression function can be estimated by the sieved-least squares estimator
(LSE) θ̂n , which is defined (not necessarily uniquely) by
n
1X
θ̂n = arg min (Yi − θ(zi ))2 ,
θ∈Θn n
i=1

where Θn is a set of regression functions θ : Z → R. Inserting the expression for Yi and


calculating the square, we see that θ̂n maximizes
n
2X
Mn (θ) = (θ − θ0 )(zi )i − Pn (θ − θ0 )2 ,
n
i=1

where Pn is the empirical measure on the design points z1 , . . . , zn . This criterion function
is not observable but is of simpler character than the sum of squares. Note that the second
term is assumed non-random, the randomness solely residing in the error terms.
Under the assumption that the error variables have mean zero, the mean of Mn (θ) is
Mn (θ) = −Pn (θ − θ0 )2 and can be used as a centering function. It satisfies, for every θ,

Mn (θ) − Mn (θ0 ) = −Pn (θ − θ0 )2 .

Thus, Theorem 6.1 applies with dn (θ, θ0 ) equal to the L2 (Pn )-distance on the set of regres-
sion functions. The modulus of continuity condition takes the form
n
1 X
φn (δ) ≥ E sup √ (θ − θ0 )(zi )i . (55)
Pn (θ−θ0 )2 ≤δ 2 ,θ∈Θn n
i=1

Theorem 6.2. If Y1 , . . . , Yn are independent random variables satisfying (16) for fixed
design points z1 , . . . , zn and errors 1 , . . . , n with mean 0, then the minimizer θ̂n over Θn
of the least squares criterion satisfies

kθ̂n − θ0 kPn ,2 = OP (δn )



for δn satisfying δn ≥ kθ0 − Θn kPn ,2 and φn (δn ) ≤ nδn2 for φn in (55) such that δ 7→
φn (δ)/δ α is decreasing for some 0 < α < 2.

63
Since the design points are non-random, the modulus (55) involves relatively simple
multiplier processes, to which the abstract maximal inequalities may apply directly. In par-
ticular, if the error variables are sub-Gaussian, then the stochastic process {n−1/2 ni=1 (θ −
P

θ0 )(zi )i : θ ∈ Θn } is sub-Gaussian with respect to the L2 (Pn )- semimetric on the set of
regression functions. Thus, using (41), we may choose
Z δ p
φn (δ) = log N (, Θn ∩ {θ : Pn (θ − θ0 )2 ≤ δ 2 }, L2 (Pn )) d.
0

Example 6.3 (Bounded isotonic regression). Let Θn = Θ = {f : [0, 1] → [0, 1] : f is nondecreasing}.


By Theorem 2.7.5 of [van der Vaart and Wellner, 1996] we see that

log N (, Θ, L2 (Pn )) ≤ K−1 ,


√ Rδ √ √
where K > 0 is a universal constant. Thus, we can take φn (δ) = K 0 −1/2 d = 2 K δ.
√ √
Thus we solve δn = δn2 n to obtain the rate of convergence of δn = n−1/3 .

Example 6.4 (Lipschitz regression). Let Θ = Θn := {f : [0, 1] → [0, 1] | f is 1-Lipschitz}.



By Lemma 2.8, we see that φn (δ) can be taken48 to be δ which yields the rate of δn = n−1/3 .

Example 6.5 (Hölder smooth functions). For α > 0, we consider the class of all functions
on a bounded set X ⊂ Rd that possess uniformly bounded partial derivatives up to bαc and
whose highest partial derivates are ‘Lipschitz’ (actually Hölder) of order α − bαc49 .
Let X = [0, 1]d and let Θn = C1α ([0, 1]d ). Then, log N (, Θ, L2 (Pn )) ≤ log N (, Θ, k ·

k∞ ) . −d/α . Thus, for α > d/2 this leads to φn (δ)  δ 1−d/(2α) and hence, φn (δ) ≤ δn2 n
48
Note that a -cover in the k · k∞ -norm (as in Lemma 2.8) also yields a a cover in the L2 (Pn )-seminorm.
49
i.e., for any vector k = (k1 , . . . , kd ) of d integers the differential operator

∂ k.
Dk = k
,
∂xk1 1 · · · ∂xdd
Pd
where k. = i=1 ki . Then for a function f : X → R, let

Dk f (x) − Dk f (y)
kf kα := max sup Dk f (x) + max sup ,
k. ≤bαc x k. =bαc x,y kx − ykα−bαc
α
where the supremum is taken over all x, y in the interior of X with x 6= y. Let CM (X ) be the set of all continu-
ous functions f : X → R with kf kα ≤ M . The following lemma, proved in [van der Vaart and Wellner, 1996,
α
Chapter 7], bounds the entropy number of the class CM (X ).

Lemma 6.6. Let X be a bounded, convex subset of Rd with nonempty interior. Then there exists a constant
K, depending only on α and d, and a constant K 0 , depending only on α, diam(X ) and d, such that

log N (, C1α (X ), k · k∞ ) ≤ Kλ(X 1 )−d/α ,


log N[ ] (, C1α (X ), Lr (Q)) ≤ K 0 −d/α ,

for every  > 0, r ≥ 1, where λ(X 1 ) is the Lebesgue measure of the set {x : kx − X k ≤ 1} and Q is any
probability measure on Rd . Note that k · k∞ denotes the supremum norm.

64
can be solved to obtain the rate of convergence δn & n−α/(2α+d) . The rate relative to the
empirical L2 -norm is bounded above by

n−α/(2α+d) + kθ0 − Θn kPn,2 .

For θ0 ∈ C1α ([0, 1]d ) the second term vanishes; the first is known to be the minimax rate
over this set.

Exercise (HW2) (Convex regression): Suppose that θ0 : C → R is known to be a convex


function over its domain C, some convex and open subset of Rd . In this case, it is natural
to consider the LSE with a convexity constraint — namely
n
1X
θ̂n ∈ argmin (Yi − f (zi ))2 . (56)
f :C→R “convex” n i=1

As stated, this optimization problem is infinite-dimensional in nature. Fortunately, by


exploiting the structure of convex functions, it can be converted to an equivalent finite-
dimensional problem50 . Show that the above LSE can be computed by solving the opti-
mization problem:
n
1X
min (Yi − ui )2 s.t. ui + ξi> (zj − zi ) ≤ uj ∀i 6= j.
u1 ,...,un ∈R;ξ1 ,...,ξn ∈Rd n
i=1

Note that this is a convex program in N = n(d + 1) variables, with a quadratic cost function
and a total of n(n − 1) linear constraints. Give the form of a LSE θ̂n .
Suppose now that C = [0, 1]d , and instead of minimizing (56) over the class of all
convex functions, we minimize over the class of all L-Lipschitz convex functions. Find the
rate of convergence of the LSE (over all L-Lipschitz convex functions).

6.2 Least squares regression: a finite sample inequality

In the standard nonparametric regression model, we assume the noise variables in (54) are
drawn in an i.i.d. manner from the N (0, σ 2 ) distribution, where σ > 0 is the unknown
standard deviation parameter. In this case, we can write i = σwi , where wi ∼ N (0, 1)
are i.i.d. We change our notation slightly and assume that f ∗ : Z → R is the unknown
regression function (i.e., f ∗ ≡ θ0 in (54)).
50
Any convex function f is subdifferentiable at each point in the (relative) interior of its domain C. More
precisely, at any interior point z ∈ C, there exists at least one vector ξ ∈ Rd such that

f (z) + ξ > (x − z) ≤ f (x), for all x ∈ C.

Any such vector is known as a subgradient, and each point z ∈ C can be associated with the set ∂f (z) of its
subgradients, which is known as the subdifferential of f at z. When f is actually differentiable at z, then the
above inequality holds if and only if ξ = ∇f (z), so that we have ∂f (z) = {∇f (z)}. See standard references
in convex analysis for more on this.

65
Our main result in this section yields a finite sample inequality for the L2 (Pn )-loss of
the constrained LSE
n
1X
fˆn ∈ argmin {Yi − f (zi )}2 ;
f ∈F n i=1

i.e., we study the error kfˆn − f ∗ k2n := n1 i=1 {fˆn (zi ) − f ∗ (zi )}2 . This error is expressed in
Pn

terms of a localized form of Gaussian complexity: it measures the complexity of the function
class F, locally in a neighborhood around the true regression function f ∗ . More precisely,
we define the set:
F ∗ := F − f ∗ = {f − f ∗ : f ∈ F} (57)

corresponding to an f ∗ -shifted version of the original function class F. For a given radius
δ > 0, the local Gaussian complexity around f ∗ at scale δ is given by
n
" #
∗ 1X
Gn (δ; F ) := Ew sup wi g(zi )
g∈F ∗ :kgkn ≤δ n i=1

where the expectation is w.r.t. the variables {wi }ni=1 which are i.i.d. N (0, 1).
A function class H is star-shaped if for any h ∈ H and α ∈ [0, 1], the rescaled function
αh also belongs to H. Recall the basic inequality for nonparametric least squares:
n
1 ˆ σX
kfn − f ∗ k2n ≤ wi {f (zi ) − f ∗ (zi )}. (58)
2 n
i=1

A central object in our analysis is the set of δ > 0 that satisfy the critical inequality

δ2
Gn (δ; F ∗ ) ≤ . (59)

It can be shown that the star-shaped condition ensures existence of the critical radius51 .
51
Let H be a star-shaped class of functions.

Lemma 6.7. For any star-shaped function class H, the function δ 7→ Gn (δ, H)/δ is nonincreasing on the
interval (0, ∞). Consequently, for any constant c > 0, the inequality Gn (δ, H) ≤ cδ 2 has a smallest positive
solution.

Proof. For a pair 0 < δ ≤ t, it suffices to show that δt Gn (t; H) ≤ Gn (δ; H). Given any function h ∈ H with
khkn ≤ t, we may define the rescaled function h̃ = δt h. By construction, we have kh̃kn ≤ δ; moreover, since
δ ≤ t, the star-shaped assumption on H guarantees that h̃ ∈ H. Thus, write
n n n
1 δX 1 X 1 X
wi h(zi ) = wi h̃(zi ) ≤ sup wi g(zi ) .
n t i=1 n i=1 g∈H:kgkn ≤δ n i=1

Taking the supremum over the set H ∩ {khkn ≤ t} on the left-hand side followed by expectations yields
δ
G (t; H) ≤ Gn (δ; H), which completes the proof of the first part. As Gn (δ; H)/δ is nonincreasing and cδ
t n
is nondecreasing (in δ) on (0, ∞), the inequality Gn (δ, H) ≤ cδ 2 has a smallest positive solution.

66
Theorem 6.8. Suppose that the shifted function class F ∗ is star-shaped, and let δn be any
positive solution to the critical inequality (59). Then for any t ≥ δn , the LSE fˆn satisfies
the bound
ntδn
 
P kfˆn − f ∗ k2n ≥ 16tδn ≤ e− 2σ2 .

Exercise (HW2): By integrating this tail bound, show that the mean-squared error in the
L2 (Pn )-semi-norm is upper bounded as

σ2
h i  
ˆ ∗ 2 2
E kfn − f kn ≤ c δn +
n
for some universal constant c.

Proof. Recall the basic inequality (58). In terms of the shorthand notation ∆ ˆ := fˆn − f ∗ , it
ˆ 2 ≤ σ
can be written as 12 k∆k
P n ˆ ˆ ˆ ∗
n n i=1 wi ∆(zi ). By definition, the error function ∆ = fn − f
belongs to the shifted function class F ∗ . We will need the following lemma.

Lemma 6.9. Let H be an arbitrary star-shaped function class, and let δn > 0 satisfy the
inequality Gn (δ; H) ≤ δ 2 /(2σ). For a given scalar u ≥ δn , define the event
n
( )
σX
A(u) := ∃ g ∈ {h ∈ H : khkn ≥ u} : wi g(zi ) ≥ 2kgkn u . (60)
n
i=1

Then, for all u ≥ δn , we have


nu2
P(A(u)) ≤ e− 2σ2 .

We will prove the main theorem using the lemma for the time being; we take H = F ∗
√ c
√ ntδ 2
− 2n
and u = tδn for some t ≥ δn , so that we can write P(A ( tδn )) ≥ 1 − e 2σ . Note that
   
ˆ 2 ≤ 16tδn ) = P k∆k
P(k∆k ˆ 2 ≤ 16tδn , k∆k
ˆ 2 < tδn + P k∆k ˆ 2 ≤ 16tδn , k∆kˆ 2 ≥ tδn
n n n n n
   
= P k∆kˆ 2n < tδn + P tδn ≤ k∆k ˆ 2n ≤ 16tδn
   
ˆ 2n < tδn + P tδn ≤ k∆k ˆ 2n ≤ 16tδn , Ac ( tδn )
p
≥ P k∆k
   
ˆ 2 < tδn + P tδn ≤ k∆k ˆ 2 , Ac ( tδn )
p
= P k∆k n n (61)
  2
ntδn
≥ P Ac ( tδn ) ≥ 1 − e− 2σ2 ,
p

ˆ 2 ≥ tδn and
where the only nontrivial step is (61), which we explain next. Note that if k∆kn

Ac ( tδn ) holds, then
n
1X ˆ ˆ n tδn .
p
wi ∆(zi ) ≤ 2k∆k
n
i=1

Consequently, the basic inequality (58) implies that k∆kˆ 2 ≤ 4k∆k
ˆ n tδn , or equivalently,
n
ˆ 2 ≤ 16tδn . Thus, (61) holds, thereby completing the proof.
k∆k n

67
Proof of Lemma 6.9: Our first step is to reduce the problem to controlling a supremum
over a subset of functions satisfying the upper bound kg̃kn ≤ u. Suppose that there exists
some g ∈ H with kgkn ≥ u such that
n
σX
wi g(zi ) ≥ 2kgkn u. (62)
n
i=1

u u
Defining the function g̃ := kgk n
g, we observe that kg̃kn = u. Since g ∈ H and kgk n
∈ (0, 1],
the star-shaped assumption on H implies that g̃ ∈ H. Consequently, we have shown that if
there exists a function g satisfying inequality (62), which occurs whenever the event A(u)
is true, then there exists a function g̃ ∈ H with kg̃kn = u such that
n n
σX u σX
wi g̃(zi ) = wi g(zi ) ≥ 2u2 .
n kgkn n
i=1 i=1

We thus conclude that


n
σX
P(A(u)) ≤ P Zn (u) ≥ 2u2 ,

where Zn (u) := sup wi g̃(zi ) .
g̃∈H:kg̃kn ≤u n
i=1

Since the noise variables wi ∼ N (0, 1) are i.i.d., the variable nσ ni=1 wi g̃(zi ) is zero-mean and
P

Gaussian for each fixed g̃. Therefore, the variable Zn (u) corresponds to the supremum of a
Gaussian process. If we view this supremum as a function of the standard Gaussian vector
(w1 , . . . , wn ), then it can be verified that the associated Lipschitz constant52 is at most

σu/ n. Consequently, by the concentration of Lipschitz functions of Gaussian variables53 ,
52
The following lemma illustrates the Lipschitz nature of Gaussian complexity.

Lemma 6.10. Let {Wk }n k=1 be an i.i.d. sequence of N (0, 1) variables. Given a collection of vectors
A ⊂ Rn , define the random variable Z := supa∈A | n
P
k=1 ak W k |. Viewing Z as a function (w1 , . . . , wn ) 7→
f (w1 , . . . , wn ), we can verify that f is Lipschitz (with respect to Euclidean norm) with parameter
supa∈A∪(−A) kak2 .

To see this, let w = (w1 , . . . , wn ), w0 = (w10 , . . . , wn0 ) ∈ Rn . Suppose that there exists a∗ = (a∗1 , . . . , a∗n )
such that f (w) = supa∈A | n
P Pn ∗ Pn ∗
k=1 ak wk | = k=1 ak wk (or k=1 (−ak )wk , which case can also be handled
similarly). Then,
n
X n
X
f (w) − f (w0 ) ≤ a∗k wk − a∗k wk0 ≤ ka∗ k2 kw − w0 k2 ≤ sup kak2 kw − w0 k2 .
k=1 k=1 a∈A∪(−A)

The same argument holds with the roles of w and w0 switched which leads to the desired result:

|f (w) − f (w0 )| ≤ sup kak2 kw − w0 k2 .


a∈A∪(−A)

53
Classical result on the concentration properties of Lipschitz functions of Gaussian variables:
Recall that a function f : Rn → R is L-Lipschitz with respect to the Euclidean norm k · k2 if

|f (x) − f (y)| ≤ Lkx − yk2 , for all x, y ∈ Rn .

The following result guarantees that any such function is sub-Gaussian with parameter at most L.

68
we obtain the tail bound
ns2
P (Zn (u) ≥ E[Zn (u)] + s) ≤ e− 2u2 σ2 ,

valid for any s > 0. Setting, s = u2 yields,


nu2
P Zn (u) ≥ E[Zn (u)] + u2 ≤ e− 2σ2 .

(63)

Finally, by definition of Zn (u) and Gn (u; H), we have E[Zn (u)] = σGn (u; H). By Lemma 6.7,
the function v 7→ Gn (v; H)/v is nonincreasing, and since u ≥ δn by assumption, we have

Gn (u; H) Gn (δn ; H) δn
σ ≤σ ≤ ≤ δn ,
u δn 2
where the 2nd inequality used the critical condition (59). Putting together the pieces, we
have shown that E[Zn (u)] ≤ uδn . Combined with the tail bound (63), we obtain
nu2
P(Zn (u) ≥ 2u2 ) ≤ P(Zn (u) ≥ uδn + u2 ) ≤ P Zn (u) ≥ E[Zn (u)] + u2 ≤ e− 2σ2 ,


where we have used the fact that u2 ≥ uδn .

Exercise (HW2): Suppose that F ∗ is star-shaped. Show that for any δ ∈ (0, σ] such that

16 δ2
Z p
√ log N (t, F ∗ ∩ {h : khkn ≤ δ}, k · kn )dt ≤ (64)
n δ2 /(4σ) 4σ

satisfies the critical inequality (59) and hence the conclusion of of Theorem 6.8 holds.
Exercise (HW2) [Linear regression]: Consider the standard linear regression model Yi =
hθ∗ , zi i + wi , where θ∗ ∈ Rd , and fixed xi are d-dimensional covariates. Although this
example can be studied using direct linear algebraic arguments, we will use our general
theory in analysis this model. The usual LSE corresponds to optimizing over the class of
all linear functions
Flin := {fθ = hθ, ·i : θ ∈ Rd }. (65)

Let X ∈ Rn×d denote the design matrix with zi ∈ Rd as its i-th row. Let θ̂ be the LSE.
Show that
kX(θ̂ − θ∗ )k22 rank(X)
kfθ̂ − fθ∗ k2n = . σ2
n n
Theorem 6.11. Let X = (X1 , . . . , Xn ) be a vector of i.i.d. standard Gaussian variables, and let f : Rn → R
be L-Lipschitz with respect to the Euclidean norm. Then the variable f (X) − E[f (X)] is sub-Gaussian with
parameter at most L, and hence
2
 − t
P |f (X) − E[f (X)]| ≥ t ≤ 2e 2L2 for all t ≥ 0.

Note that this result is truly remarkable: it guarantees that any L-Lipschitz function of a standard
Gaussian random vector, regardless of the dimension, exhibits concentration like a scalar Gaussian variable
with variance L2 . See Section 13.4 for more details about this result and a proof.

69
with high probability.
∗ = F
Hint: First note that the shifted class Flin lin for any choice of fθ∗ ∈ Flin . Moreover,

Flin is convex and hence star-shaped around any point. To use (64) to find δn so that
∗ ∩ {h : khk ≤ δ}, k · k ). Show
Theorem 6.8 applies in this setting, we have to find N (t, Flin n n
that the required covering number can be bounded by (1 + 2δt )r where r := rank(X).

6.3 Oracle inequalities

In our analysis thus far, we have assumed that the true regression function f ∗ belongs to
the function class F over which the constrained LSE is defined. In practice, this assumption
might be violated. In such settings, we expect the performance of the LSE to involve both
the estimation error that arises in Theorem 6.8, and some additional form of approximation
error, arising from the fact that f ∗ ∈
/ F. A natural way in which to measure approximation
error is in terms of the best approximation to f ∗ using functions from F — the error in this
best approximation is given by inf f ∈F kf − f ∗ k2n . Note that this error can only be achieved
by an “oracle” that has direct access to the samples {f ∗ (xi )}ni=1 . For this reason, results
that involve this form of approximation error are referred to as oracle inequalities. With
this setup, we have the following generalization of Theorem 6.8. We define

∂F := {f − g : f, g ∈ F}.

Theorem 6.12. Assume that ∂F is star-shaped. Let δn > 0 be any solution to

δ2
Gn (δ; ∂F) ≤ . (66)

Then for any t ≥ δn , the LSE fˆ satisfies the bound

kfˆ − f ∗ k2n ≤ 2 inf kf − f ∗ k2n + 36tδn (67)


f ∈F

ntδn
with probability greater than 1 − e− 2σ2 .

Proof. Recall the definition of A(u) in (60). We apply Lemma 6.9 with u = tδn and
√ ntδn
H = ∂F to conclude that P Ac ( tδn ) ≥ 1 − e− 2σ2 . We will assume below that the event


Ac ( tδn ) holds.
Given an arbitrary f˜ ∈ F, since f˜ is feasible and fˆ is optimal, we have
n n
1 X 1 X
{Yi − fˆ(zi )}2 ≤ {Yi − f˜(zi )}2 .
2n 2n
i=1 i=1

Using the relation Yi = f ∗ (zi ) + σwi , some algebra yields


n
1 ˆ 2 1 σX ˜
k∆kn ≤ kf˜ − f ∗ k2n + wi ∆(zi ) , (68)
2 2 n
i=1

70
where ∆ˆ := fˆ − f ∗ and ∆
˜ := fˆ − f˜. It remains to analyze the term on the right-hand side
˜ We break our analysis into two cases.
involving ∆.

Case 1: First suppose that k∆k ˜ n ≤ tδn . Then,

ˆ 2n = kfˆ − f ∗ k2n = k(f˜ − f ∗ ) + ∆k


k∆k ˜ 2n
2
≤ kf˜ − f ∗ kn + tδn
 p

≤ 2kf˜ − f ∗ k2n + 2tδn (taking β = 1)

where in the first inequality above we have used the triangle inequality, and the second
inequality follows from the fact that (a + b)2 ≤ 2(a2 + b2 ) (for a, b ∈ R).
√ √
Case 2: Suppose now that k∆k ˜ n > tδn . Note that ∆ ˜ ∈ ∂F and as the event Ac ( tδn )
holds, we get
n
σX ˜ p
˜ n.
wi ∆(zi ) ≤ 2 tδn k∆k
n
i=1
ntδn
Combining with the basic inequality (68), we find that, with probability at least 1 − e− 2σ2 ,
the squared error is bounded as
ˆ 2 = kf˜ − f ∗ k2 + 4 tδn k∆k ˜ n
p
k∆k n n

≤ kf˜ − f ∗ k2n + 4 tδn k∆k ˆ n + kf˜ − f ∗ kn


p 
h tδ
ˆ 2 + 2 tδn + βkf˜ − f ∗ k2
i h i
n
≤ kf˜ − f ∗ k2n + 2 + βk∆k n n
β β
⇒ (1 − 2β)k∆k ˆ 2n ≤ (1 + 2β)kf˜ − f ∗ k2n + 4 tδn
β
where the second step follows from the triangle inequality and the next step follows from
multiple usage of the fact that 2ab ≤ βa2 + b2 /β (for a, b ∈ R and β > 0). Taking β = 1/6,
we have (1+2β)
(1−2β) = 2, and thus we get

ˆ 2 ≤ 2kf˜ − f ∗ k2 + 36tδn .
k∆k n n

Combining the pieces we get that, under the event Ac ( tδn ), the above inequality holds for
any f˜ ∈ F. Thus, (67) holds.

Remark 6.1. We can, in fact, have a slightly more general form of (67) where the ‘oracle’
approximation term 2kf˜ − f ∗ k2n can be replaced by 1−γ
1+γ ˜
kf − f ∗ k2n for any γ ∈ (0, 1) (with
appropriate adjustments to the ‘estimation’ error term 36tδn ).

Note that the guarantee (67) is actually a family of bounds, one for each f ∈ F.
When f ∗ ∈ F, then we can set f = f ∗ , so that the bound (67) reduces to asserting that
kfˆ − f ∗ k2n . tδn with high probability, where δn satisfies the critical inequality (66). Thus,
up to constant factors, we recover Theorem 6.8 as a special case of Theorem 6.12. By
integrating the tail bound, we are guaranteed that
h i σ2
E kfˆ − f ∗ k2n . inf kf − f ∗ k2n + δn2 + . (69)
f ∈F n

71
The bound (69) guarantees that the LSE fˆ has prediction error that is at most a constant
multiple of the oracle error, plus a term proportional to δn2 . The term inf f ∈F kf − f ∗ k2n
can be viewed a form of approximation error that decreases as the function class F grows,
whereas the term δn2 is the estimation error that increases as F becomes more complex.

6.3.1 Best sparse linear regression

Consider the standard linear model Yi = fθ∗ (zi ) + σwi , where fθ (z) := hθ, zi is an unknown
iid
linear regression function, and wi ∼ N (0, 1) is an i.i.d. noise sequence. Here θ∗ ∈ Rd is
the unknown parameter. For some sparsity index s ∈ {1, 2, . . . , d}, consider the class of all
linear regression functions based on s-sparse vectors — namely, the class

Fspar (s) := {fθ : θ ∈ Rd , kθk0 ≤ s},


Pd
where kθk0 := j=1 I(θj 6= 0) counts the number of non-zero coefficients in the vector
θ ∈ R . Disregarding computational considerations, a natural estimator of θ∗ is given by
d

n
X
θ̂ ≡ fθ̂ ∈ arg min {Yi − fθ (zi )i}2 , (70)
fθ ∈Fspar(s)
i=1

corresponding to performing least squares over the set of all regression vectors with at most
s non-zero coefficients. As a corollary of Theorem 6.12, we claim that the L2 (Pn )-error of
this estimator is upper bounded as
s log( ed
s )
kfθ̂ − fθ∗ k2n . inf kfθ̂ − fθ∗ k2n + σ 2 , (71)
θ∈Fspar(s) n
s log( ed )
with high probability; here δn2 = σ 2 n s . Consequently, up to constant factors, its error
is as good as the best s-sparse predictor plus the ‘estimation’ error term δn2 . Note that this
‘estimation’ error term grows linearly with the sparsity s, but only logarithmically in the
dimension d, so that it can be very small even when the dimension is exponentially larger
than the sample size n. In essence, this result guarantees that we pay a relatively small
price for not knowing in advance the best s-sized subset of coefficients to use.
In order to derive this result as a consequence of Theorem 6.12, we need to compute
the local Gaussian complexity Gn (δ; ∂Fspar (s)). Making note of the inclusion ∂Fspar (s) ⊂
Fspar (2s), we have Gn (δ; ∂Fspar (s)) ⊂ Gn (δ; Fspar (2s)). Now let S ⊂ {1, . . . , d} be an
arbitrary 2s-sized subset of indices, and let XS ∈ Rn×2s denote the submatrix with columns
indexed by S. We can then write
n
" #  
1X
Gn (δ; Fspar (2s)) = Ew sup wi g(zi ) = Ew max Zn (S) ,
g∈Fspar (2s):kgkn ≤δ n i=1
|S|=2s

where
1 >
Zn (S) := sup w X S θS
kXS θS k2 n
θS ∈R2s : √
n
≤δ

72
as, for g ∈ Fspar (2s), g(z) ≡ gθ (z) = hθ, zi = hθS , zS i, if θ has nonzero entries in the subset
S ⊂ {1, . . . , d}, and kgk2n = n1 ni=1 hθ, zi i2 = n1 kXS θS k22 (here k · k2 denotes the usual
P

Euclidean norm).
Viewed as a function of the standard Gaussian vector w ∈ Rn , the variable Zn (S) is

Lipschitz with parameter at most δ/ n (by Lemma 6.10), from which Theorem 6.11 implies
the tail bound
−t2 δ 2 −nt2
P(Zn (S) ≥ E[Zn (S)] + tδ) ≤ e (2δ2 /n) = e 2 , for all t > 0. (72)

We now upper bound the expectation. Consider the singular value decomposition XS =
UDV> , where U ∈ Rn×2s and V ∈ Rd×2s are matrices of left and right singular vectors,
respectively, and D ∈ R2s×2s is a diagonal matrix of the singular values. Noting that
kXS θS k2 = kDV> θS k2 , we arrive at the upper bound
h 1 i δ h i
E[Zn (S)] ≤ E sup √ hU> w, βi ≤ √ E kU> wk2
β∈R2s :kβk2 ≤δ n n
>
where we have taken β = DV √ θS . Since w ∼ N (0, In ) and the matrix U has orthogonal
n √
columns, we have U> w ∼ N (0, I2s ), and therefore E kU> wk2 ≤ 2s. Combining this
 

upper bound with the earlier tail bound (72), an application of the union bound yields, for
all t > 0, " √ #  
δ 2s d −nt2
P max Zn (S) ≥ √ + tδ ≤ e 2 .
|S|=2s n 2s
By integrating this tail bound, we find that
s
d
 s
log ed
  r
Gn (δ; Fspar (2s)) Ew max|S|=2s Zn (S) s log 2s s
= . + . ,
δ δ n n n
s log( ed )
so that the critical inequality (66) is satisfied for δn2 ' σ 2 n
s
, as claimed.

6.4 Density estimation via maximum likelihood

Let X1 , . . . , Xn be an i.i.d. sample from a density p0 that belongs to a set P of densities


with respect to a measure µ on some measurable space. In this subsection the parameter
is the density p0 itself (and we denoted a generic density by p instead of θ).
The sieved maximum likelihood estimator (MLE) p̂n based on X1 , . . . , Xn maximizes
the log-likelihood p 7→ Pn log p over a sieve Pn , i.e.,

p̂n = argmax Pn [log p].


p∈Pn

Although it is natural to take the objective (criterion) function we optimize (i.e., Mn (·) in
our previous notation) as Pn log p, for some technical reasons (explained below) we consider
a slightly modified function.

73
Let pn ∈ Pn . We will discuss the particular choice of pn later. By concavity of the
logarithm function we have
p̂n + pn 1 p̂n 1 pn + pn
Pn log ≥ Pn [ log + log 1] ≥ 0 = Pn log .
2pn 2 pn 2 2pn
Thus, defining the criterion functions mn,p (for p ∈ Pn ) as
p + pn
mn,p := log ,
2pn
we obtain Pn mn,p̂n ≥ Pn [mn,pn ]. We shall apply Theorem 6.1 with Mn (p) := Pn [mn,p ] to
obtain the rate of convergence of p̂n . We note that it is not true that p̂n maximizes the
map p 7→ Mn (p) over Pn . Inspection of the conditions of Theorem 6.1 shows that this is
not required for its application; it suffices that the criterion is bigger at the estimator than
at the value θn , which is presently taken equal to pn .
An immediate question that arises next is what discrepancy (or metric) do we use to
measure the difference between p̂n and p0 ? A natural metric while comparing densities is
the Hellinger distance:
Z 1/2
√ √ 2
h(p, q) = ( p − q) dµ ,

which is what we will use in this subsection.


We will apply Theorem 6.1 with θn,0 = θn = pn in our new notation. Thus, we will
obtain a rate of convergence for h(p̂n , pn ), which coupled with a satisfactory choice of pn
will yield a rate for h(p̂n , p0 ). It is then required that the centering function decreases
quadratically as p moves away from pn within the sieve, at least for h(p, pn ) > δ̃n (to be
defined later). For δ > 0, let

Mn,δ := {mn,p − mn,pn : p ∈ Pn , h(p, pn ) ≤ δ}.

We will need to use a maximal inequality to control the fluctuations of the empirical process
in the class Mn,δ . The following result will be useful in this regard; it uses bracketing with
the “Bernstein norm”54 .
Theorem 6.13. For any class F of measurable functions f : X → R such that kf kP,B < δ
for every f ,
J[ ] (δ, F, k · kP,B )
 
EkGn kF . J[ ] (δ, F, k · kP,B ) 1 + √ .
δ2 n
Using mn,p rather than the more obvious choice log p is technically more convenient.
First it combines smoothly with the Hellinger distance h(p, q). The key is the following
pair of inequalities, which relate the “Bernstein norm” of the criterion functions mn,p to
the Hellinger distance of the densities p.
h i1/2
54
The “Bernstein norm” is defined as kf kP,B := 2P (e|f | − 1 − |f |) . This “Bernstein norm” turns
out to combine well with minimum contrast estimators (e.g., MLE), where the criterion is a logarithm of
another natural function, such as the log-likelihood. Actually, kf kP,B is not a true norm, but it can be used
in the same way to measure the size of brackets.

74
Lemma 6.14. For nonnegative functions p, q, pn , and p0 (assumed to be a density) such
that p0 /pn ≤ M and p ≤ q, we have
√ √
kmn,p − mn,pn kP0 ,B . M h(p, pn ), kmn,p − mn,q kP0 ,B . M h(p, q),

where the constant in . does not depend on anything.

Proof. Note that e|x| − 1 − |x| ≤ 4(ex/2 − 1)2 , for every x ≥ − log 2. As mn,p ≥ − log 2 and
mn,pn = 0,
√ 2
 2 p + pn
kmn,p − mn,pn k2P0 ,B . P0 emn,p /2 − 1 = P0 √ −1 .
2pn

Since p0 /pn ≤ M , the right side is bounded by 2M h2 (p + pn , 2pn ). Combination with the
preceding display gives the first inequality. If p ≤ q, then mn,q − mn,p is nonnegative. By
the same inequality for ex − 1 − x as before,
√ 2

(mn,q −mn,p )/2
2 q + pn
kmn,p − mn,q k2P0 ,B . P0 e − 1 = P0 √ −1 .
p + pn

This is bounded by M h2 (q + pn , p + pn ) . M h2 (p, q) as before.

Since the map p 7→ mn,p is monotone, the second inequality shows that a bracketing
partition of a class of densities p for the Hellinger distance induces a bracketing partition of
the class of criterion functions mn,p for the “Bernstein norm” of essentially the same size.
Thus, we can use a maximal inequality available for the classes of functions Mn,δ with the
entropy bounded by the Hellinger entropy of the class of densities.

Lemma 6.15. Let h denote the Hellinger distance on a class of densities P and set mn,p :=
log[(p + pn )/(2pn )]. If pn and p0 are probability densities with p0 /pn ≤ M pointwise, then

P0 [mn,p − mn,pn ] . −h2 (p, pn ),

for every probability density p such that h(p, pn ) ≥ ch(pn , p0 ), for some constant c > 0.
Furthermore, for the class of functions Mn,δ := {mn,p − mn,pn : p ∈ Pn , h(p, pn ) ≤ δ},
√ !
√ M J[ ] (δ, Pn , h)
EkGn kMn,δ . M J[ ] (δ, Pn , h) 1 + √ .
δ2 n

Proof. Since log x ≤ 2( x − 1) for every x > 0,
!
q q 1/2
P0 log ≤ 2P0 1/2
−1
pn pn
!
1/2 1/2
q 1/2 1/2 p0 + pn
Z
1/2 1/2 1/2
= 2Pn 1/2
− 1 + 2 (q − pn )(p0 − pn ) 1/2
dµ.
pn pn

75
The first term in last display equals −h2 (q, pn ). The second term can be bounded by the

expression 2h(q, pn )h(p0 , pn )( M + 1) in view of the assumption on the quotient p0 /pn and

the Cauchy-Schwarz inequality. The sum is bounded by −h2 (q, pn )/2 if 2h(p0 , pn )( M +
1) ≤ h(q, pn )/2. The first statement of the theorem follows upon combining this with the
inequalities55 [Exercise (HW2)]

h(2p, p + q) ≤ h(p, q) ≤ (1 + 2) h(2p, p + q).

These inequalities are valid for every pair of densities p and q and show that the Hellinger
distance between p and q is equivalent to the Hellinger distance between p and (p + q)/2.
The maximal inequality is now a consequence of Theorem 6.13. Each of the functions

in Mn,δ has “Bernstein norm” bounded by a multiple of M δ, while a bracket [p1/2 , q 1/2 ]
of densities of size δ leads to a bracket [mn,p , mn,q ] of “Bernstein norm” of size a multiple

of M δ.

It follows that the conditions of Theorem 6.1 are satisfied with the Hellinger distance,
δ̃n = h(pn , p0 ), and
J[ ] (δ, Pn , h)
 
φn (δ) := J[ ] (δ, Pn , h) 1 + √ ,
δ2 n
where J[ ] (δ, Pn , h) is the Hellinger bracketing integral of the sieve Pn . (Usually this function
φn (·) has the property that φn (δ)/δ α is decreasing for some 0 < α < 2 as required by

Theorem 6.1.) The condition φn (δn ) . nδn is equivalent to

J[ ] (δn , Pn , h) ≤ nδn2 .

For the unsieved MLE the Hellinger integral is independent of n and any δn solving
the preceding display gives an upper bound on the rate. Under the condition that the
true density p0 can be approximated by a sequence pn ∈ Pn such that p0 /pn is uniformly
bounded, the sieved MLE that maximizes the likelihood over Pn has at least the rate δn
satisfying both

J[ ] (δn , Pn , h) ≤ nδn2 and δn & h(pn , p0 ).

Theorem 6.16. Given a random sample X1 , . . . , Xn from a density p0 let p̂n maximize the
likelihood p 7→ ni=1 p(Xi ) over an arbitrary set of densities Pn . Then
Q

h(p̂n , p0 ) = OP (δn )

for any δn satisfying



J[ ] (δn , Pn , h) ≤ nδn2 , and δn & h(pn , p0 )
55
The inequalities follow, because for any nonnegative reals s and t,
√ √ √ √ √ √ √
| 2s − s + t| ≤ | s − t| ≤ (1 + 2)| 2s − s + t|.

The lower inequality follows from the concavity of the root function. The upper inequality is valid with
√ √
constant 2 if t ≥ s and with constant (1 + 2) as stated if t ≤ s.

76
where pn can be any sequence with pn ∈ Pn for every n and such that the functions x 7→
p0 (x)/pn (x) are uniformly bounded in x and n.

Example 6.17. Suppose the observations take their values in a compact interval [0, T ] in
the real line and are sampled from a density that is known to be nonincreasing. Conclude
that if P is the set of all nonincreasing probability densities bounded by a constant C, then
1
log N[ ] (, P, h) ≤ log N[ ] (, F, L2 (λ)) . .


where F of all non-increasing functions f : [0, T ] → [0, C]. The result follows from the
observations: (i) F has bracketing entropy for the L2 (λ)-norm of the order 1/ for any finite
measure λ on [0, T ], in particular the Lebesgue measure; (ii) if a density p is non-increasing,

then so is its root p; (iii) the Hellinger distance on the densities is the L2 (λ)-distance on
the root densities.

Thus J[ ] (δ, P, h) . δ, which yields a rate of convergence of at least δn = n−1/3 for
the MLE. The MLE is called the Grenander estimator.

77
7 Vapnik-C̆ervonenkis (VC) classes of sets/functions

Consider our canonical setting: X1 , . . . , Xn are i.i.d. P on some space X . In this section we
study classes of functions F (on X ) that satisfy certain combinatorial restrictions. These
classes at first sight may seem have nothing to do with entropy numbers, but indeed will
be shown to imply bounds on the covering numbers of the type
 V
1
sup N (kF kQ,2 , F, L2 (Q)) ≤ K , 0 <  < 1, some number V > 0,
Q 
where F is the underlying function class with envelope F , and K is a universal constant.
Note that this has direct implications on the uniform entropy of such a class (see Defini-
tion 4.6) is of the order log(1/) and hence the uniform entropy integral converges, and is
of the order δ log(1/δ), as δ ↓ 0.
Classes of (indicator functions of) this type were first studied by Vapnik and C̆ervonenkis
in the 1970s, whence the name VC classes. There are many examples of VC classes, and
more examples can be constructed by operations as unions and sums. Furthermore, one can
combine VC classes in different sorts of ways (thereby, building larger classes of functions)
to ensure that the resulting larger classes also satisfy the uniform entropy condition (though
these larger classes may not necessarily be VC).
We first consider VC classes to sets. To motivate this study let us consider a boolean
class of functions F 56 , i.e., every f ∈ F takes values in {0, 1}. Thus,

F = {1C : C ∈ C},

where C is a collection of subsets of X . This naturally leads to the study of C.

Definition 7.1. Let C be a collection of subsets of a set X . Let {x1 , . . . , xn } ⊂ X be an


arbitrary set of n points. Say that C picks out a certain subset A of {x1 , . . . , xn } if A can
be expressed as C ∩ {x1 , . . . , xn } for some C ∈ C.
The collection C is said to shatter {x1 , . . . , xn } if each of its 2n subsets can be picked
out in this manner (note that an arbitrary set of n points possesses 2n subsets).

Definition 7.2. The VC dimension V (C) of the class C is the largest n such that some set
of size n is shattered by C.

Definition 7.3. The VC index ∆n (C; x1 , . . . , xn ) is defined as



∆n (C; x1 , . . . , xn ) = | C ∩ {x1 , . . . , xn } : C ∈ C |,

where |A| denotes the cardinality of the set A. Thus,

V (C) := sup n : max ∆n (C; x1 , . . . , xn ) = 2n .



x1 ,...,xn ∈X
56
Boolean classes F arise in the problem of classification (where F can be taken to consist of all functions
f of the form I{g(X) 6= Y }). They are also important for historical reasons: empirical process theory has
its origins in the study of the function class F = {1(−∞,t] (·) : t ∈ R}.

78
Let’s try rectangles with horizontal and vertical edges. In order to show that the VC dimension is 4 (in this
case), we need to show two things:
So, yes, there exists an arrangement of 4 points that can be shattered.
1. There exist 4 points that can be shattered.
2. No set of 5 points can be shattered.
It’s clear that capturing just 1 point and all 4 points are both trivial. The figure below shows how we
can capture 2 points and 3 points.
Suppose we have 5 points. A shattering must allow us to select all 5 points and allow us to select 4
points without the 5th.

So, yes, there exists an arrangement of 4 points that can be shattered.


Our minimum enclosing rectangle that allows us to select all five points is defined by only four points
Figure 1: The left panel illustrates how we can pick
2. No set of 5 points can be shattered.
out 2 points and 3 points (it’s clear that capturing just
– one for each edge. So, it is clear that the fifth point must lie either on an edge or on the inside of
1 point and all 4 points are both trivial) therebythe
showing
rectangle.that
This there
preventsexist 4 points
us from selecting that can without
four points be shattered.
the fifth.
Suppose we have 5Thepoints.right
A shattering
panel must allow us to
illustrates selectno
that all set
5 points
of 5and allow uscan
points to select 4
be shattered:
the minimum enclosing rectangle that
points without the 5th.
allows us to select all 5 points is defined by only four points — one for each edge. So, it is clear that the
fifth point must lie either on an edge or on the inside of the rectangle thereby preventing us from selecting
four points without the fifth. 1

Clearly, the more refined C is, higher the VC index. The VC dimension is infinite if C
Our minimum enclosing rectangle that allows us to select all five points is defined by only four points
shatters sets of arbitrarily large size. It is immediate from the definition that V (C) ≤ V if
– one for each edge. So, it is clear that the fifth point must lie either on an edge or on the inside of
the rectangle. This prevents us from selecting four points without the 57
and only if no set of size V + 1 fifth.is shattered.

Example 7.4. Let X = R and define the collection of sets C := {(−∞, c] : c ∈ R}. Consider
any two point set1 {x1 , x2 } ⊂ R, and assume without loss of generality, that x1 < x2 . It is
easy to verify that C can pick out the null set {} and the sets {x1 } and {x1 , x2 } but cannot
pick out {x2 }. Hence its VC dimension equals 1.
The collection of all cells (a, b] ∈ R shatters every two-point set but cannot pick out the
subset consisting of the smallest and largest points of any set of three points. Thus its VC
dimension equals 2.

Remark 7.1. With more effort, it can be seen that VC dimensions of the same type of sets
in Rd are d and 2d, respectively. For example, let X = R2 and define

C = {A ⊂ X : A = [a, b] × [c, d], for some a, b, c, d ∈ R}.

Let us see what happens when n = 4. Draw a figure to see this when the points are not
co-linear. We can show that there exists 4 points such that all the possible subsets of these
four points are picked out by C; see the left panel of Figure 7.1.
Now if we have n = 5 points things change a bit; see the right panel of Figure 7.1.
If we have five points there is always one that stays “in the middle” of all the others, and
thus the complement set cannot be picked out by C. We immediately conclude that the VC
dimension of C is 4.

A collection of measurable sets C is called a VC class if its dimension is finite. The


main result of this section is the remarkable fact that the covering numbers of any VC class
grow polynomially in 1/ as  → 0, of order dependent on the dimension of the class.
57
Some books define the VC index of the class C as the smallest n for which no set of size n is shattered
by C (i.e., V (C) + 1 in our notation).

79
Example 7.5. Suppose that X = [0, 1], and let C be the class of all finite subsets of X .
Let P be the uniform (Lebesgue) distribution on [0, 1]. Clearly V (C) = ∞ and C is not a
VC class. Note that for any possible value of Pn we have Pn (A) = 1 for A = {X1 , . . . , Xn }
while P (A) = 0. Therefore kPn − P kC = 1 for all n, so C is not a Glivenko-Cantelli class
for P .

Exercise (HW3): Show that the class of all closed and convex sets in Rd does not have finite
VC dimension (Hint: Consider a set of n points on the boundary of the unit ball).
Sauer’s lemma58 (also known as Sauer-Shelah-Vapnik-C̆ervonenkis lemma), one of the
fundamental results on VC dimension, states that the number ∆n (C; x1 , . . . , xn ) of subsets
picked out by a VC class C, for n ≥ 1, satisfies:
V (C)  
X n
max ∆n (C; x1 , . . . , xn ) ≤ , (73)
x1 ,...,xn j
j=0

where we use the notation nj = 0 if j > n. Observe that for n ≤ V (C), the right-hand side


of the above display equals 2n , i.e., the growth is exponential. However, it is easy to show59
that for n ≥ V (C),
V (C)  
ne V (C)
X n  
≤ . (74)
j V (C)
j=0

Consequently, the numbers on the left side grow polynomially (of order at most O(nV (C) ))
rather than an exponential number. Intuitively this means that a finite VC index implies
that C has an apparent simplistic structure.

7.1 VC classes of Boolean functions

The definition of VC dimension can be easily extended to a function class F in which every
function f is binary-valued, taking the values {0, 1} (say). In this case, we define, for every
58
See [van der Vaart and Wellner, 1996, pages 135–136] for a complete proof of the result.
59
In the following we just give a proof of the right-hand inequality of (74). Note that with Y ∼
Binomial(n, 1/2),
V (C) V (C)
! !
X n n
X n  1 n
= 2 = 2n P(Y ≤ V (C))
j=0
j j=0
j 2
h i
≤ 2 E[rY −V (C) ]
n
for r ≤ 1 as 1{Y − V (C) ≤ 0} ≤ rY −V (C) for r ≤ 1
n " n
!  n #
 n 
n −V (C) 1 r −V (C) n Y
X j n 1 1 r
= 2 r + = r (1 + r) as E[r ] = r = +
2 2 j=0
j 2 2 2
 V (C)  n
n V (C)
= 1+ by choosing r = V (C)/n
V (C) n
 V (C)
n
≤ eV (C) .
V (C)

80
x1 , . . . , x n ∈ X ,
F(x1 , . . . , xn ) := {(f (x1 ), . . . , f (xn )) : f ∈ F}. (75)
As functions in F are Boolean, F(x1 , . . . , xn ) is a subset of {0, 1}n .
Definition 7.6. Given such a function class F we say that the set {x1 , . . . , xn } is shattered
by F if
∆n (F; x1 , . . . , xn ) := |F(x1 , . . . , xn )| = 2n .
The VC dimension V (F) of F is defined as the largest integer n for which there is some
collection x1 , . . . , xn of n points that can be shattered by F.

When V (F) is finite, then F is said to be a VC class.


Example 7.7. Let us revisit the Glivenko-Cantelli (GC) theorem (Theorem 3.5) when we
have a binary-valued function class F. In particular, suppose that X1 , . . . , Xn are i.i.d. P
on X . A natural question is how does one verify condition (8) in practice? We need an
upper bound on N (, F, L1 (Pn )). Recall that under L1 (Pn ) the distance between f and g is
measured by
n
1X
kf − gkL1 (Pn ) := |f (Xi ) − g(Xi )|.
n
i=1
This notion of distance clearly only depends on the values of f and g at the data points
X1 , . . . , Xn . Therefore, the covering number of F in the L1 (Pn )-norm should be bounded
from above by the corresponding covering number of {(f (X1 ), . . . , f (Xn )) : f ∈ F}. It should
be obvious that N (, F, L1 (Pn )) is bounded from above by the cardinality of F(X1 , . . . , Xn ),
i.e.,
N (, F, L1 (Pn )) ≤ |F(X1 , . . . , Xn )| for every  > 0.
This is in fact a very crude upper bound although it can be quite useful in practice. For
example, in the classical GC theorem F := {1(−∞,t] (·) : t ∈ R}, and we can see that
|F(X1 , . . . , Xn )| ≤ (n + 1).
Since F(X1 , . . . , Xn ) is a subset of {0, 1}n , its maximum cardinality is 2n . But if
∆n (F; X1 , . . . , Xn ) is at the most a polynomial in n for every possible realization of X1 , . . . , Xn ,
then
1
log ∆n (F; X1 , . . . , Xn ) → 0 as n → ∞ a.s. (76)
n
which implies, by Theorem 3.5, that F is GC. Thus, if F is a boolean function class such
that (76) holds, then F is P -GC.

Exercise (HW3): Consider the class of all two-sided intervals over the real line, i.e., F :=
{1(a,b] (·) : a < b ∈ R}. Show that ∆n (F; X1 , . . . , Xn ) ≤ (n + 1)2 a.s.
Exercise (HW3): For a scalar t ∈ R, consider the function ft (x) := 1{sin(tx) ≥ 0}, x ∈
[−1, 1]. Prove that the function class {ft : [−1, 1] → R : t ∈ R} has infinite VC dimension
(Note that this shows that VC dimension is not equivalent to the number of parameters in
a function class).

81
7.2 Covering number bound for VC classes of sets

Theorem 7.8. There exists a universal constant K such that for any VC class C of sets,
any probability measure Q, any r ≥ 1, and 0 <  < 1,
 rV (C)
V (C) 1
N (, C, Lr (Q)) ≤ K V (C)(4e) . (77)


Proof. See [van der Vaart and Wellner, 1996, Theorem 2.6.4].

In the following we will prove a slightly weaker version of the above result.

Theorem 7.9. For any VC class C of sets, any r ≥ 1, and 0 <  < 1,60
 c rc2 V (C)
1
sup N (, C, Lr (Q)) ≤ (78)
Q 
Here c1 and c2 are universal positive constants and the supremum is over all probability
measures Q on X .

Proof. Fix 0 <  < 1. Let X1 , . . . , Xn be i.i.d. Q. Let m := D(, C, L1 (Q)) be the -packing
number for the collection C in the norm L1 (Q). Thus, there exists C1 , . . . , Cm ∈ C which
satisfy
Q|1Ci − 1Cj | = Q(Ci 4Cj ) > , i 6= j.
Let F := {1C : C ∈ C}. We consider this function class view point as it is sometimes more
natural than working with the collection of sets C. Note that, {fi ≡ 1Ci }m
i=1 is a set of m
-separated functions in F in the L1 (Q)-metric, as, for i 6= j,
Z
 < |fi − fj |dQ = Q{fi 6= fj } = Q(Ci 4Cj ) = P[X1 ∈ Ci 4Cj ].

By the above, we have

P[fi (X1 ) = fj (X1 )] = 1 − P[fi (X1 ) 6= fj (X1 )] = 1 − P[X1 ∈ Ci 4Cj ] < 1 −  ≤ e− .

By the independence of X1 , . . . , Xn we deduce then that for every k ≥ 1,

P[fi (X1 ) = fj (X1 ), . . . , fi (Xk ) = fj (Xk )] ≤ e−k .

In words, this means that the probability that fi and fj agree on every X1 , . . . , Xk is at
most e−k . By the union bound, we have
m −k m2 −k
 
P [(fi (X1 ), . . . , fi (Xk )) = (fj (X1 ), . . . , fj (Xk )) for some 1 ≤ i < j ≤ m] ≤ e ≤ e .
2 2
Recalling that F(x1 , . . . , xk ) = {(f (x1 ), . . . , f (xk )) : f ∈ F}, this immediately gives
m2 −k
P[|F(X1 , . . . , Xk )| ≥ m] ≥ 1 − e .
2
60
Note that N (, C, Lr (Q)) = 1 for all  ≥ 1.

82
l m
Thus if we take k := 2 log m ≥ 2 log m , then, P[|F(X1 , . . . , Xk )| ≥ m] ≥ 1/2. Thus
for the choice of k above, there exists a subset {z1 , . . . , zk } of cardinality k such that
|F(z1 , . . . , zk )| ≥ m. We now apply the Sauer-Shelah-VC lemma and deduce that
V (C)  
X k
m ≤ |F(z1 , . . . , zk )| ≤ max ∆k (C; x1 , . . . , xk ) ≤ . (79)
x1 ,...,xk j
j=1

We now split into two cases depending on whether k ≤ V (C) or k ≥ V (C).


Case 1: k ≤ V (C). Here (79) gives
 V (C)
V (C) 2
N (, C, L1 (Q)) ≤ D(, C, L1 (Q)) = m ≤ 2 ≤ ,

which proves (78).
Case 2: k ≥ V (C). Here (79) gives
 V (C)
ke
N (, C, L1 (Q)) = m ≤ ,
V (C)
4 log m
so that using the choice of k which satisfies k ≤  ,
ke 4e 8e 8e
m1/V (C) ≤ ≤ log m = log m1/(2V (C)) ≤ m1/(2V (C)) ,
V (C) V (C)  
where we have used log x ≤ x. This immediately gives
 2V (C)
8e
N (, C, L1 (Q)) ≤ D(, C, L1 (Q)) = m ≤ ,

which completes the proof of the result for r = 1.
For Lr (Q) with r > 1, note that

k1C − 1D kL1 (Q) = Q(C4D) = k1C − 1D krLr (Q) ,

so that
c2 V (C)
N (, C, Lr (Q)) = N (r , C, L1 (Q)) ≤ c1 −r .
This completes the proof.

Exercise (HW3): Suppose F is a Boolean class of functions with VC dimension V (F). Then,
for some constant C > 0,
" # r
V (F)
E sup |(Pn − P )f | ≤ C .
f ∈F n
Suppose X1 , . . . , Xn are i.i.d. real-valued observations having a common cdf F . Apply this
result to obtain a high probability upper bound on supx∈R |Fn (x) − F (x)|, i.e., show that
r
C 2 1
sup |Fn (x) − F (x)| ≤ √ + log
x∈R n n α
with probability at least 1 − α (for α ∈ (0, 1)).

83
Example 7.10 (Classification). Recall the problem of classification from Section 1.4 where
we observe i.i.d. data (Z1 , Y1 ), . . . , (Zn , Yn ) ∼ P with Zi ∈ Z and Yi ∈ {0, 1}. Let C be
a class of functions from Z to {0, 1} — the class of classifiers under consideration. The
empirical risk minimizer classifier is ĝn := argming∈C n1 ni=1 I{g(Zi ) 6= Yi }. It is usually
P

of interest to understand the test error of ĝn relative to the best test error in the class C,
i.e., L(ĝn ) − inf g∈C L(g) (here L(g) := P(g(Z) 6= Y ) is the misclassification error of g). If
g ∗ minimizes L(g) over g ∈ C, then we have seen in (5) that

L(ĝn ) − L(g ∗ ) ≤ 2 sup |Ln (g) − L(g)| = 2 sup |(Pn − P )f |


g∈C f ∈F

where
F := {(z, y) 7→ I{g(z) 6= y} : g ∈ C}.

Using the bounded differences concentration inequality and the bound given by Example 4.8,
we obtain r r
∗ V (F) 8 1
L(ĝn ) − L(g ) ≤ C + log
n n α
with probability at least 1 − α (for α ∈ (0, 1)). The above display would be useful if we
could upper bound V (F) effectively. We will now show that V (F) ≤ V (C). To see this, it
is enough to argue that if F can shatter (z1 , y1 ), . . . , (zn , yn ), then C can shatter z1 , . . . , zn .
For this, let η1 , . . . , ηn be arbitrary points in {0, 1}. We need to obtain a function g ∈ C
such that g(zi ) = ηi , for i = 1, . . . , n. Define δ1 , . . . , δn by

δi := ηi I{yi = 0} + (1 − ηi )I{yi = 1}.

As F can shatter (z1 , y1 ), . . . , (zn , yn ), there exists f ∈ F, say f (z, y) = I{g(z) 6= y} for
some g ∈ C, with f (zi , yi ) = δi , for i = 1, . . . , n. Then, g(zi ) = ηi 61 , for i = 1, . . . , n. This
proves that C shatters z1 , . . . , zn and completes the proof of the fact that V (F) ≤ V (C).
Thus, we obtain
r r
∗ V (C) 8 1
L(ĝn ) − L(g ) ≤ 2 sup |Ln (g) − L(g)| ≤ C + log
g∈C n n α

with probability at least 1 − α. In fact, this is one of the important results in the VC theory.

7.3 VC classes of functions

Let us start with a motivating application. Recall from Example 1.5 the class of functions
F = {ft : t ∈ R} where ft (x) = |x − t|. In Example 1.5 we needed to show asymptotic
61
First observe that f (zi , yi ) = 0 ⇔ g(zi ) = yi and f (zi , yi ) = 1 ⇔ g(zi ) 6= yi . Suppose that δi = 0, i.e.,
f (zi , yi ) = 0. Then, we must have, from the definition of δi , 0 = ηi I{yi = 0} = (1 − ηi )I{yi = 1}, which
implies that yi = ηi and thus, g(zi ) = yi = ηi . Similarly, suppose that δi = 1, i.e., f (zi , yi ) = 1. Then, we
must have, from the definition of δi , yi 6= ηi , and thus as g(zi ) 6= yi , we have g(zi ) = ηi . Thus, in both cases
we see that g(zi ) = ηi .

84
equicontinuity of a certain process which boiled down to controlling the modulus of conti-
nuity of the empirical process indexed by F as in (48). In particular, we may ask: “Is this
function class ‘nice’ is some sense so that results analogous to (80), and thus (48), hold?”.
The VC subgraph dimension of F is simply the VC dimension of the Boolean class
obtained by taking the indicators of the subgraphs of functions in F. To formally define
this, let us first define the notion of subgraph of a function.

Definition 7.11. The subgraph of a function f : X → R is a subset of X × R defined as

Cf := {(x, t) ∈ X × R : t < f (x)}.

A collection F of measurable functions on X is called a VC subgraph class, or just a VC


class, if the collection of all subgraphs of the functions in F (i.e., {Cf : f ∈ F }) forms a
VC class of sets (in X × R).

Let V (F) be the VC dimension of the set of subgraphs of functions in F. Just as for
sets, the covering numbers of VC classes of functions grow at a polynomial rate.

Theorem 7.12. For a VC class of functions F with measurable envelope function F and
r ≥ 1, one has for any probability measure Q with kF kQ,r > 0,
 rV (F )
2
N (kF kQ,r , F, Lr (Q)) ≤ K V (F) (4e)V (F ) , (80)


for a universal K and 0 <  < 1.

Proof. Let C be the set of all subgraphs Cf of functions f in F. Note that Q|f − g| = (Q ×
λ)(1Cf ∆Cg ) = (Q×λ)|1Cf −1Cg | where λ is the Lebesgue measure on R62 . Renormalize Q×λ
to a probability measure on the set {(x, t) : |t| ≤ F (x)} by defining P = (Q × λ)/(2kF kQ,1 ).
Thus, as P (Cf ∆Cg ) = P |1Cf − 1Cg | = 2kF1kQ,1 Q|f − g|,

N (2kF kQ,1 , F, L1 (Q)) = N (, C, L1 (P )).

Then by the result for VC classes of sets stated in Theorem 7.8,


 V (F )
4e
N (2kF kQ,1 , F, L1 (Q)) ≤ KV (F) , (81)


for a universal constant K, for any probability measure Q with kF kQ,1 > 0. This completes
the proof for r = 1.
For r > 1 note that

Q|f − g|r ≤ Q[|f − g|(2F )r−1 ] = 2r−1 R[|f − g|]Q[F r−1 ], (82)
62
R
Fact: For any two real numbers a and b, we have the identity |a − b| = |I{t < a} − I{t < b}|dt.

85
for the probability measure R with density F r−1 /Q[F r−1 ] with respect to Q. We claim that

N (kF kQ,r , F, Lr (Q)) ≤ N (2(/2)r R[F ], F, L1 (R)). (83)

To prove this claim let N = N (2(/2)r R[F ], F, L1 (R)) and let f1 , . . . fN an 2(/2)r R[F ]-net
for the class F (under L1 (R)-norm). Therefore, given f ∈ F, there exists k ∈ {1, . . . , N }
such that kf − fk kL1 (R) ≤ 2(/2)r R[F ]. Hence, by (82),

Q|f − fk |r ≤ 2r−1 R[|f − fk |] Q[F r−1 ] ≤ 2r−1 2(/2)r R[F ] Q[F r−1 ] = r Q[F r ],

which implies that kf − fk kLr (Q) ≤ kF kQ,r . Thus, we have obtained a kF kQ,r -cover of F
in the Lr (Q)-norm, which proves the claim. Now, combining (83) with (81) yields
 rV (F )
V (F ) 2
N (kF kQ,r , F, Lr (Q)) ≤ KV (F)(4e) ,


which completes the proof.

The preceding theorem shows that a VC class has a finite uniform entropy integral,
with much to spare. In fact we can show that if F is a class of measurable functions with
envelope F and VC subgraph dimension V (F) then the expected supremum of the empirical
process can be easily controlled63 .
63
Here in a maximal inequality for an important (VC) class of functions we will encounter soon.

Theorem 7.13. Let F be a measurable class of functions with a constant envelope U such that for A > e2
and V ≥ 2 and for every finitely supported probability measure Q
 V
A
N (U, F, L2 (Q)) ≤ , 0 ≤  < 1.

Then, for all n, !
n r
X √ AU AU
E (f (Xi ) − P f ) ≤L nσ V log ∨ V U log (84)
i=1
F σ σ

where L is a universal constant and σ is such that supf ∈F P (f − P f )2 ≤ σ 2 .

Remark 7.2. If nσ 2 & V log(AU/σ) then the above result shows that
n
X p
E (f (Xi ) − P f ) . nσ 2 V log(AU/σ),
F
i=1

which means that if nσ 2 is not too small, then the ‘price’ one pays for considering the expectation of the
p
supremum of infinitely many sums instead of just one is the factor V log(AU/σ).

Proof. We assume without loss of generality that the class F contains the function 0 and that the functions
f are P -centered. It suffices to prove the inequality in the theorem for U = 1. By our symmetrization result
Pn Pn
(see Theorem 3.17), we have E i=1 f (Xi ) F ≤ 2E i=1 εi f (Xi ) F . By Dudley’s entropy bound (see
Theorem 4.1 and (41)), we have

n
√ Z kPn f 2 kF
r
1 X A
√ Eε εi f (Xi ) ≤K V log d,
n i=1
F 0 

86
Exercise (HW3): Suppose F is a class of measurable functions with envelope F and VC
subgraph dimension V (F). Then, for some constant C > 0,
" # r
V (F)
E sup |(Pn − P )f | ≤ CkF kP,2 . (85)
f ∈F n

Exercise (HW3): Recall the setting of Example 5.5 which considered a change point problem.
Show that condition (45) (bound on the modulus of continuity of the empirical process)
needed to apply Theorem 5.2, to obtain the rate of convergence of the estimator, holds with
an appropriate function φn (·) (also see Remark 5.2).
where εi ’s are i.i.d. Rademacher variables independent of the variables Xj ’s and Eε indicates expectation
with respect to the εi ’s. It is easy to see that if log(C/c) ≥ 2 then
Z c
C 1/2  C 1/2
log dx ≤ 2c log .
0 x c

Since A/kPn f 2 kF ≥ e2 (as |f | ≤ 1 by assumption), we conclude that


n
s
1 X √ p A
√ Eε εi f (Xi ) ≤ 2K V kPn f 2 kF log p .
n i=1
F kP n f kF
2

x(− log x) on (0, e−1 ), this yields


p
By the concavity of the function
n
s
1 X √ p A √
√ E εi f (Xi ) ≤ 2K V EkPn f 2 kF log p =: 2K V B.
n i=1
F EkPn f 2 kF

Next, notice that


n n
1 X 2 X
EkPn f 2 kF ≤ σ 2 + E (f 2 (Xi ) − P f 2 ) ≤ σ2 + E εi f 2 (Xi ) .
n i=1 F n i=1 F

Now, since Pn [(f 2 − g 2 )2 ] ≤ 4Pn [(f − g)2 ] (as |f | ≤ 1 by assumption), which implies N (, F 2 , L2 (Pn )) ≤
N (/2, F, L2 (Pn )), we can estimate the last expectation again by the entropy bound (2.14), and get, with
the change of variables /2 = u,
√ "Z √ #
kPn f 2 kF
r
2 2 4K V A
EkPn f kF ≤ σ + √ E log du
n 0 u

which, by the previous computations gives



8K V B
EkPn f 2 kF ≤ σ 2 + √
n
Replacing this into the definition of B shows that B satisfies the inequality (check!)
 √ 
8K V B A
B 2 ≤ σ2 + √ log .
n σ
√ √
This implies that B is dominated by the largest root of the quadratic function B 2 − 8K V B log(A/σ)/ n −
σ 2 log(A/σ) which yields the desired result for a suitable constant L.

Note that when the dominant term is the first, we only pay a logarithmic price for the fact that we are
taking the supremum over a countable or uncountable set. In fact the inequality is sharp in the range of
(σ, n) where the first term dominates.

87
7.4 Examples and Permanence Properties

The results of this subsection give basic methods for generating VC (subgraph) classes.
This is followed by a discussion of methods that allow one to build up new function classes
related to the VC property (from basic classes).
Although it is obvious, it is worth mentioning that a subclass of a VC class is itself
a VC class. The following lemma shows that various operations on VC classes (of sets)
preserve the VC structure.

Lemma 7.14. Let C and D be VC classes of sets in a set X and φ : X → Y and ψ : Z → X


be fixed functions. Then:

(i) C c = {C c : C ∈ C} is VC;

(ii) C u D = {C ∩ D : C ∈ C, D ∈ D} is VC;

(iii) C t D = {C ∪ D : C ∈ C, D ∈ D} is VC;

(iv) φ(C) is VC if φ is one-to-one;

(v) ψ −1 (C) is VC;

(vi) C × D = {C × D : C ∈ C, D ∈ D} is VC in X × Y.

Proof. (i) The set C c picks out the points of a given set {x1 , . . . , xn } that C does not pick
out. Thus if C shatters a given set of points, so does C c . This proves (i) and shows that the
dimensions of C and C c are equal.
(ii) Fix n ≥ max{V (C), V (D)}. Let x1 , . . . , xn ∈ X be arbitrary. We have to study
the cardinality of the set {C ∩ D ∩ {x1 , . . . , xn } : C ∈ C, D ∈ D}. From the n points
x1 , . . . , xn , C can pick out O(nV (C) ) subsets. From each of these subsets, D can pick out
at most O(nV (D) ) further subsets64 . Thus C u D can pick out O(nV (C)+V (D) ) subsets. For
large n, this is certainly smaller than 2n . This proves (ii).
Next, (iii) follows from a combination of (i) and (ii), since C ∪ D = (C c ∩ Dc )c .
(iv) We will show that if φ(C) shatters a set of points in Y then C should also shatter
a set of points of the same cardinality in X , which will yield the desired result. Suppose
that φ(C) shatters {y1 , . . . , yn } ⊂ Y. Then each yi must be in the range of φ and there exist
x1 , . . . , xn such that φ is a bijection between x1 , . . . , xn and y1 , . . . , yn . We now claim that C
must shatter {x1 , . . . , xn }. To see this, let A := {xi1 , . . . , xik } for 1 ≤ i1 , . . . , ik ≤ n distinct,
with 0 ≤ k ≤ n. As φ(C) shatters {y1 , . . . , yn } ⊂ Y, φ(C) picks out there {yi1 , . . . , yik } and
thus there exists B ≡ φ(C) ∈ φ(C), where C ∈ C, such that B ∩{y1 , . . . , yn } = {yi1 , . . . , yik }.
64 PV (D) k
≤ O(nV (D) ), for any

Note that by Sauer’s lemma, for any 1 ≤ k ≤ n, ∆n (D; xi1 , . . . , xik ) ≤ j=0 j
{xi1 , . . . , xik } ⊂ {x1 , . . . , xn }.

88
As φ is a bijection, this means that C ∩ {x1 , . . . , xn } = {xi1 , . . . , xik }, and thus C picks out
there {xi1 , . . . , xik }.
For (v) the argument is analogous: if ψ −1 (C) shatters z1 , . . . , zn , then all xi := ψ(zi )
must be different and the restriction of ψ to z1 , . . . , zn is a bijection on its range. We now
claim that then C shatters x1 , . . . , xn 65 . Thus, as C has finite VC dimension, so has ψ −1 (C).
For (vi) note first that C × Y and X × D are VC classes66 . Then by (ii) so is their
intersection C × D.

Exercise (HW3) (Open and closed subgraphs): For a set F of measurable functions, define
“closed” and “open” subgraphs by {(x, t) : t ≤ f (x)} and {(x, t) : t < f (x)}, respectively.
Then the collection of “closed” subgraphs has the same VC-dimension as the collection of
“open” subgraphs. Consequently, “closed” and “open” are equivalent in the definition of a
VC-subgraph class.

Lemma 7.15. Any finite-dimensional vector space F of measurable functions f : X → R


is VC subgraph of dimension smaller than or equal to dim(F) + 1.

Proof. By assumption, there exists m := dim(F) functions f1 , . . . , fm : X → R such that


m
nX o
F := αj fj (x) : αj ∈ R .
i=1

Take any collection of n = dim(F) + 2 points (x1 , t1 ), . . . , (xn , tn ) in X × R. We will show


that the subgraphs of F do not shatter these n points. Let H ∈ Rn×m be the matrix with
elements (fj (xi )), for i = 1, . . . , n and j = 1, . . . , m. By assumption, the vectors in

H := {(f (x1 ) − t1 , . . . , f (xn ) − tn ) : f ∈ F} = {Hc − (t1 , . . . , tn ) : c ∈ Rm },

are contained in a dim(F) + 1 = (n − 1)-dimensional subspace67 of Rn . Any vector a 6= 0


that is orthogonal to this subspace satisfies
X X
ai (f (xi ) − ti ) = (−ai )(f (xi ) − ti ), for every f ∈ F.
i:ai >0 i:ai ≤0

(Define the sum over an empty set as zero.) There exists such a vector a with at least one
strictly positive coordinate. We will show that the subgraphs of F do not pick out the set
{(xi , ti ) : ai > 0}.
Suppose there exists f ∈ F such that Cf ∩ {(xi , ti )}ni=1 = {(xi , ti ) : ai > 0}. Then,
for i such that ai ≤ 0 we must have (xi , ti ) ∈
/ Cf , i.e., ti ≥ f (xi ). However, then the left
side of the above display would be strictly positive and the right side non-positive for this
65
Exercise (HW3): Show this.
66
Exercise (HW3): Show this.
67
We can find α ∈ Rn such that α> H = 0 and α> (t1 , . . . , tn ) = 0. Such an α exists as the columns of H
and (t1 , . . . , tn ) span at most an m + 1 = n − 1 dimensional subspace of Rn .

89
f . Conclude that the subgraphs of F do not pick out the set {(xi , ti ) : ai > 0}. Hence the
subgraphs of F shatter no set of n points.
P
Example 7.16. Let F be the set of all linear combinations λi fi of a given, finite set of
functions f1 , . . . , fk on X . Then F is a VC class and hence has a finite uniform entropy
integral. Furthermore, the same is true for the class of all sets {f > c} if f ranges over F
and c over R.

Lemma 7.17. The set of all translates {ψ(x − h) : h ∈ R} of a fixed monotone function
ψ : R → R is VC of dimension 1.

Proof. By the monotonicity, the subgraphs are linearly ordered by inclusion: if ψ is nonde-
creasing, then the subgraph of x 7→ ψ(x − h1 ) is contained in the subgraph of x 7→ ψ(x − h2 )
if h1 ≥ h2 . Any collection of sets with this property has VC dimension 1 by Proposi-
tion 7.1868 .

Lemma 7.19. Let F and G be VC subgraph classes of functions on a set X and g : X →


R, φ : R → R, and ψ : Z → X fixed functions. Then,

(i) F ∧ G = {f ∧ g : f ∈ F; g ∈ G} is VC subgraph;

(ii) F ∨ G is VC subgraph;

(iii) {F > 0} := {{f > 0} : f ∈ F} is VC;

(iv) −F is VC;

(v) F + g := {f + g : f ∈ F} is VC subgraph;

(vi) F · g = {f g : f ∈ F} is VC subgraph;

(vii) F ◦ ψ = {f (ψ) : f ∈ F} is VC subgraph;

(viii) φ ◦ F is VC subgraph for monotone φ.


68

Proposition 7.18. Suppose that C is a collection of at least two subsets of a set X . Show that V (C) = 1 if
either (a) C is linearly ordered by inclusion, or, (b) any two sets in C are disjoint.

Proof. Consider (a) first. Take points x1 , x2 ∈ X . We need to show that this set of 2 points cannot be
shattered. Suppose that {x1 , x2 } can be shattered. Let C1 pick out {x1 } and C2 pick out {x2 }. By (a), one
of these sets is contained in the other. Suppose C1 ⊂ C2 . But then {x1 } ⊂ C2 and this contradicts the fact
that C2 picks out {x2 }. On the other hand, at least one set of size 1 is shattered.
Next consider (b). As before, suppose that {x1 , x2 } can be shattered. Suppose C picks out {x1 } and D
picks out {x1 , x2 }. But then C and D are no longer disjoint.

90
Proof. The subgraphs of f ∧ g and f ∨ g are the intersection and union of the subgraphs of
f and g, respectively. Hence (i) and (ii) are consequences of Lemma 7.14.
For (iii) note that the sets {f > 0} are one-to-one images of the intersections of the
(open) subgraphs with the set X × {0}, i.e.,

{f > 0} = {x ∈ X : f (x) > 0} = φ {(x, t) ∈ X × R : f (x) > t} ∩ (X × {0}) .

Here φ : X × {0} → X defined as φ(x, 0) = x is one-one. Thus the class {F > 0} is VC by


(ii) and (iv) of Lemma 7.14.
(iv) The subgraphs of the class −F are the images of the open supergraphs of F
under the map (x, t) 7→ (x, −t). The open supergraphs are the complements of the closed
subgraphs, which are VC by the previous exercise. Now (iv) follows from the previous
lemma.
For (v) it suffices to note that the subgraphs of the class F + g shatter a given set of
points (x1 , t1 ), . . . , (xn , tn ) if and only if the subgraphs of F shatter the set (xi , ti − g(xi )).
The subgraph of the function f g is the union of the sets

C + := {(x, t) : t < f (x)g(x), g(x) > 0},


C − := {(x, t) : t < f (x)g(x), g(x) < 0},
C 0 := {(x, t) : t < 0, g(x) = 0},

It suffices to show that these sets are VC in (X ∩ {g > 0}) × R, (X ∩ {g < 0}) × R, and
(X ∩ {g = 0}) × R, respectively69 . Now, for instance, {i : (xi , ti ) ∈ C − } is the set of
indices of the points (xi , ti /g(xi )) picked out by the open supergraphs of F. These are the
complements of the closed subgraphs and hence form a VC class.
The subgraphs of the class F ◦ ψ are the inverse images of the subgraphs of functions
in F under the map (z, t) 7→ (ψ(z), t). Thus (v) of Lemma (7.14) implies (vii).
For (viii) suppose that the subgraphs of φ◦F shatter the set of points (x1 , t1 ), . . . , (xn , tn ).
Choose f1 , . . . , fm from F such that the subgraphs of the functions φ◦fj pick out all m = 2n
subsets. For each fixed i, define si = max{fj (xi ) : φ(fj (xi )) ≤ ti }. Then si < fj (xi ) if and
only if ti < φ(fj (xi )), for every i and j, and the subgraphs of f1 , . . . , fm shatter the points
(xi , si ).

Exercise (HW3): The class of all ellipsoids {x ∈ Rd : (x − µ)> A(x − µ) ≤ c}, for µ ranging
over Rd and A ranging over the nonnegative d × d matrices, is VC. [Hint: This follows by a
combination of Lemma 7.19(iii), Lemma 7.14(i) and Lemma 7.15. The third shows that the
set of functions x 7→ (x − µ)> A(x − µ) − c (a vector space with basis functions x 7→ c, x 7→ xi
and x 7→ xi xj ) is VC, and the first and second show that their positive (or negative) sets
are also VC.
69
Exercise (HW3): If X is the union of finitely many disjoint sets Xi , and Ci is a VC class of subsets of
Pm
Xi for each i, i = 1, . . . , m, then tm m
i=1 Ci is a VC class in ∪i=1 Xi of dimension i=1 V (Ci ).

91
Example 7.20. Let ft (x) = |x − t| where t ∈ R. Let F = {ft : t ∈ R}. Then this is
a VC class of functions as ft (x) = (x − t) ∨ [−(x − t)] and we can use Lemma 7.15 with
Lemma 7.19(i) to prove the result.

Sometimes, the result of simple operations on VC classes of functions (e.g., addition of


two such classes) can produce a function classes that is not itself VC. However, the resulting
function class might still have a uniform polynomial bound on covering numbers (as in (80))
and hence are very easy to work with. The following results provide a few such examples.

Lemma 7.21. Fix r ≥ 1. Suppose that F and G are classes of measurable functions with
envelopes F and G respectively. Then, for every 0 <  < 1,

(i) N (2kF + GkQ,r , F + G, Lr (Q)) ≤ N (kF kQ,r , F, Lr (Q)) · N (kGkQ,r , G, Lr (Q));

(ii) supQ N (2kF ·GkQ,2 , F ·G, L2 (Q)) ≤ supQ N (kF kQ,2 , F, L2 (Q)) supQ N (kGkQ,2 , G, L2 (Q)),
where F · G := {f g : f ∈ F, g ∈ G}, and the supremums are all taken over the
appropriate subsets of all finitely discrete probability measures.

Proof. Let us first prove (i). Find functions f1 , . . . , fn and g1 , . . . , gm such that

min kf − fi krQ,r ≤ r kF krQ,r , ∀f ∈ F, and min kg − gj krQ,r ≤ r kGkrQ,r , ∀g ∈ G.


i j

Now, given f + g ∈ F + G, we can find i and j such that

kf + g − fi − gj kQ,r ≤ kf − fi kQ,r + kg − gj kQ,r ≤ kF kQ,r + kGkQ,r ≤ 2kF + GkQ,r ,

which completes the proof.


Let us prove (ii) now. Fix  > 0 and a finitely discrete probability measure Q̃ with
kF GkQ̃,2 > 0 (which also implies that kGkQ̃,2 > 0), and let dQ∗ := G2 dQ̃/kGk2Q̃,2 . Clearly,
Q∗ is a finitely discrete probability measure with kF kQ∗ ,2 > 0. Let f1 , f2 ∈ F satisfying
kf1 − f2 kQ∗ ,2 ≤ kF kQ∗ ,2 . Then

kf1 − f2 kQ∗ ,2 k(f1 − f2 )GkQ̃,2


≥ = ,
kF kQ∗ ,2 kF GkQ̃,2

and thus, if we let F · G := {f G : f ∈ F},

N (kF GkQ̃,2 , F · G, L2 (Q̃)) ≤ N (kF kQ∗ ,2 , F, L2 (Q∗ )) ≤ sup N (kF kQ,2 , F, L2 (Q)),
Q

where the supremum is taken over all finitely discrete probability measures Q for which
kF kQ,2 > 0. Since the right hand-side of the above display does not depend on Q̃, and since
Q̃ satisfies kF GkQ̃,2 > 0 but is otherwise arbitrary, we have that

sup N (kF GkQ,2 , F · G, L2 (Q)) ≤ sup N (kF kQ,2 , F, L2 (Q)), (86)


Q Q

92
where the supremums are taken over all finitely discrete probability measures Q but with
the left side taken over the subset for which kF GkQ,2 > 0 while the right side is taken over
the subset for which kF kQ,2 > 0.
We can similarly show that the uniform entropy numbers for the class G · F with
envelope F G is bounded by the uniform entropy numbers for G with envelope G. Since
|f1 g1 − f2 g2 | ≤ |f1 − f2 |G + |g1 − g2 |F for all f1 , f2 ∈ F and g1 , g2 ∈ G, part (i) in
conjunction with (86) imply that

sup N (2kF GkQ,2 , F · G, L2 (Q)) ≤ sup N (kF kQ,2 , F, L2 (Q)) × sup N (kGkQ,2 , G, L2 (Q)),
Q Q Q

where the supremums are all taken over the appropriate subsets of all finitely discrete
probability measures.

Lemma 7.22. (i) Let ρ(·) be a real-valued right continuous function of bounded variation
on R+ . The covering number of the class F of all functions on Rd of the form
x 7→ ρ(kAx + bk), with A ranging over all m × d matrices and b ∈ Rm satisfies the
bound
N (, F, Lr (Q)) ≤ K1 −V1 , (87)
for some K1 and V1 and for a constant envelope.

(ii) (Exercise (HW3)) Let λ(·) be a real-valued function of bounded variation on R. The
class of all functions on Rd of the form x 7→ λ(α> x + β), with α ranging over Rd and
β ranging over R, satisfies (87) for a constant envelope.

Proof. Let us prove (i). By Lemma 7.21 it is enough to treat the two monotone components
of ρ(·) separately. Assume, without loss of generality, that ρ(·) is bounded and nondecreas-
ing, with ρ(0) = 0. Define ρ−1 (·) as the usual left continuous inverse of ρ on the range
T = (0, sup ρ), i.e.,
ρ−1 (t) := inf{y : ρ(y) ≥ t}.
This definition ensures that

{y : ρ(y) ≥ t} = {y : y ≥ ρ−1 (t)}, for t ∈ T.

Exercise (HW3): Complete the proof now.

7.5 Exponential tail bounds: some useful inequalities

Suppose that we have i.i.d. data X1 , . . . , Xn on a set X having distribution P and F is


a VC class of measurable real-valued functions on X . We end this chapter with a brief
discussion of some useful and historically important results on exponential tail bounds for
the empirical process indexed by VC classes of functions. One of the classical results in this
direction is the exponential tail bounds for the supremum distance between the empirical
distribution and the true distribution function; see [Dvoretzky et al., 1956].

93
(A) Empirical d.f., X = R: Suppose that we consider the classical empirical d.f. of
real-valued random variables. Thus, F = {1(−∞,t] (·) : t ∈ R}. Then, letting Fn and F
denote the empirical and true distribution functions, [Dvoretzky et al., 1956] showed
that

P(k n(Fn − F )k∞ ≥ x) ≤ C exp(−2x2 )

for all n ≥ 1, x ≥ 0 where C is an absolute constant. [Massart, 1990] showed that


C = 2 works, confirming a long-standing conjecture of Z. W. Birnbaum.
This result strengthens the GC theorem by quantifying the rate of convergence as n
tends to infinity. It also estimates the tail probability of the Kolmogorov-Smirnov
statistic.

(B) Empirical d.f., X = Rd : Now consider the classical empirical d.f. of i.i.d. random
vectors: Thus F = {1(−∞,t] (·) : t ∈ Rd }. Then [Kiefer, 1961] showed that for every
 > 0 there exists a C such that

P(k n(Fn − F )k∞ ≥ x) ≤ C exp(−(2 − )x2 )

for all n ≥ 1, x > 0.

(C) Empirical measure, X general: Let F = {1C : C ∈ C} be such that


 V
K
sup N (, F, L1 (Q)) ≤ ,
Q 

where we assume that V ≥ 1 and K ≥ 1. Then [Talagrand, 1994] proved that


 √  D  Dx2 V
P k n(Pn − P )kC ≥ x ≤ exp(−2x2 ) (88)
x V

for all n ≥ 1 and x > 0, where D ≡ D(K) depends on K only.

(D) Empirical measure, X general: Let F be a class of functions such that f : X →


[0, 1] for every f ∈ F, and F satisfies
 V
K
sup N (, F, L2 (Q)) ≤ ;
Q 

e.g., when F is a VC class V = 2V (F). Then [Talagrand, 1994] proved that


 √   Dx V
P k n(Pn − P )kF ≥ x ≤ √ exp(−2x2 )
V
for all n ≥ 1 and x > 0.

Example 7.23 (Projection pursuit). Projection pursuit (PP) is a type of statistical tech-
nique which involves finding the most “interesting” possible projections in multidimensional

94
data. Often, projections which deviate more from a normal distribution are considered to
be more interesting. The idea of projection pursuit is to locate the projection or projections
from high-dimensional space to low-dimensional space that reveal the most details about the
structure of the data set.
Suppose that X1 , X2 , . . . , Xn are i.i.d. P on Rd . The first step in PP is to estimate
the distribution of the low-dimensional projections. We address this question here. In
particular, we ask: “How large can d be (with n) so that we can still uniformly approximate
all the one-dimensional projections of P ?”
We answer the above questions below. For t ∈ R and γ ∈ S d−1 (S d−1 is the unit sphere
in Rd ), let
Fn (t; γ) = Pn [1(−∞,t] (γ · X)] = Pn (γ · X ≤ t),

denote the empirical distribution function of γ · X1 , . . . , γ · Xn . Let

F (t; γ) = P [1(−∞,t] (γ · X)] = P(γ · X1 ≤ t).

Question: Under what conditions on d = dn → ∞ as n → ∞, do we have


P
Dn := sup sup |Fn (t; γ) − F (t; γ)| → 0?
t∈R γ∈S d−1

First note that the sets in question in this example are half-spaces

Ht,γ := {x ∈ Rd : γ · x ≤ t}.

Note that
Dn = sup sup |Pn (Ht,γ ) − P (Ht,γ )| = kPn − P kH ,
t∈R γ∈S d−1

where H := {Ht,γ : t ∈ R, γ ∈ S d−1 }.


The key to answering the question raised in this example is one of the exponential
bounds applied to the collection H, the half-spaces in Rd . The collection H is a VC collection
of sets with V (H) = d + 170 .
70
We have to prove two inequalities: V (H) ≥ d + 1 and V (H) ≤ d + 1. To prove the first inequality, we
need to exhibit a particular set of size d + 1 that is shattered by H. Proving the second inequality is a bit
more tricky: we need to show that for all sets of size d + 2, there is labelling that cannot be realized using
half-spaces.
Let us first prove V (H) ≥ d + 1. Consider the set X0 = {0, e1 , . . . , ed } which consists of the origin
along with the vectors in the standard basis of Rd (also let e0 = 0 ∈ Rd ). Let A := {ei1 , . . . , eim } be a
subset of X0 , where m ≥ 0 and {i1 , . . . , im } ⊆ {0, . . . , d}. We will show that A is picked out by H. Let
√ √
γ = (γ1 , . . . , γd ) ∈ S d−1 be such that γj = −1/ d if j ∈ {i1 , . . . , im } and γj = 1/ d otherwise. Let t = 0

if e0 ∈ A and t = −1/ d if e0 ∈ / A. Thus, for x ∈ A, γ · x ≤ t and for x ∈ X0 \ A, γ · x > t. Therefore,
{x ∈ Rd : γ · x ≤ t} ∩ X0 = A, which shows that A is picked out.
To prove V (H) ≤ d + 1, we need the following result from convex geometry; see e.g., https://en.
wikipedia.org/wiki/Radon%27s_theorem.

95
By Talagrand’s exponential bound (88),
 √  D  Dx2 d+1
P k n(Pn − P )kH ≥ x ≤ exp(−2x2 )
x d+1

for all n ≥ 1 and x > 0. Taking x =  n yields
 2 d+1
 D D n
P kPn − P kH ≥ ) ≤ √ exp(−22 n)
 n d+1
D
  D2 n  
= √ exp (d + 1) log − 22 n
 n d+1
→ 0 as n → ∞,

if d/n → 071 .

Lemma 7.24 (Radon’s Lemma). Let X0 ⊂ Rd be a set of size d + 2. Then there exist two disjoint subsets
X1 , X2 of X0 such that conv(X1 ) ∩ conv(X2 ) 6= ∅ where conv(Xi ) denotes the convex hull of Xi .

Given Radon’s lemma, the proof of V (H) ≤ d + 1 is easy. We have to show that given any set X0 ∈ Rd
of size d + 2, there is a subset of X0 that cannot be picked out by the half-spaces. Using Radon’s lemma
with X0 yields two disjoint subsets X1 , X2 of X0 such that conv(X1 ) ∩ conv(X2 ) 6= ∅. We now claim that X1
cannot be picked out by using any half-space. Suppose that there is such a half-space H, i.e., H ∩ X0 = X1 .
Note that if a half-space picks out a set of points, then every point in its convex hull is also picked out.
Thus, conv(X1 ) ⊂ H. However, as conv(X1 ) ∩ conv(X2 ) 6= ∅, H ∩ conv(X2 ) 6= ∅ which implies that H also
contains at least one point from X2 , leading to a contradiction.
71
Exercise (HW3): Show this.

96
8 Talagrand’s concentration inequality for the suprema of
the empirical process

The main goal of this chapter is to motivate and formally state (without proof) Talagrand’s
inequality for the suprema of the empirical process. We will also see a few applications of
this result. If we have time, towards the end of the course, I will develop the tools necessary
and prove the main result. To fully appreciate the strength of the main result, we start with
a few important tail bounds for the sum of independent random variables. The following
discussion extends and improves Hoeffding’s inequality (Lemma 3.9).
In most of results in this chapter we only assume that the X -valued random variables
X1 , . . . , Xn are independent; they need not be identically distributed.

8.1 Preliminaries

Recall Hoeffding’s inequality: Let X1 , . . . , Xn be independent and centered random variables


such that Xi ∈ [ai , bi ] w.p.1 and let Sn := ni=1 Xi . Then, for any t ≥ 0,
P

2/
Pn 2 2/
Pn 2
P (Sn ≥ t) ≤ e−2t i=1 (bi −ai ) , and P (Sn ≤ −t) ≤ e−2t i=1 (bi −ai ) . (89)

A crucial ingredient in the proof of the above result was Lemma 3.8 which stated that for
2 2
a centered X ∈ [a, b] w.p.1 we have E[eλX ] ≤ eλ (b−a) /8 , for λ ≥ 0.
Note that if bi −ai is much larger than the standard deviation σi of Xi then, although the
tail probabilities prescribed by Hoeffding’s inequality for Sn are of the normal type72 , they
correspond to normal variables with the ‘wrong’ variance. The following result incorporates
the standard deviation of the random variable and is inspired by the moment generating
function of Poisson random variables73 .

Theorem 8.1. Let X be a centered random variable such that |X| ≤ c a.s, for some c < ∞,
and E[X 2 ] = τ 2 . Then
 2 
λX τ λc
E[e ] ≤ exp (e − 1 − λc) , for all λ > 0. (90)
c2

As a consequence, if Xi , 1 ≤ i ≤ n, are centered, independent and a.s. bounded by c < ∞


in absolute value, then setting
n
2 1X
σ := E[Xi2 ], (91)
n
i=1
72 2
Recall that if the Xi ’s are i.i.d. and centered  with variance σ , by the CLT for fixed t > 0,
√ t
 σ

t 2
limn→∞ P (Sn ≥ t n) = 1 − Φ σ ≤ √2πt exp − 2σ2 , where the last inequality uses a standard bound on
the normal CDF.
73
Recall that if X has Poisson distribution with parameter a (i.e., EX = Var(X) = a) then E[eλ(X−a) ] =
λ
e−a(λ+1) ∞ a /k! = ea(e −1−λ) .
λk k
P
k=0 e

97
Pn
and Sn = i=1 Xi , we have

nσ 2 λc
 
λSn
E[e ] ≤ exp (e − 1 − λc) , for all λ > 0, (92)
c2
and the same inequality holds for −Sn .

Proof. Since E(X) = 0, expansion of the exponential gives


∞ ∞
λX
X λk EX k X λk EX k 
E[e ]=1+ ≤ exp .
k! k!
k=2 i=2

Since |EX k | ≤ ck−2 τ 2 , for all k ≥ 2, this exponent can be bounded by


∞ ∞ ∞
X λk EX k X (λc)k−2 τ 2 X (λc)k τ 2 λc
≤ λ2 τ 2 = = (e − 1 − λc).
k! k! c2 k! c2
k=2 k=2 k=2

This gives inequality (90). Inequality (92) follows from (90) by using the independence of
the Xi ’s. The above also applies to Yi = −Xi which yields the result for −Sn .

It is standard to derive tail probability bounds for a random variable based on a bound
for its moment generating function. We proceed to implement this idea and obtain four such
bounds, three of them giving rise, respectively, to the Bennett, Prokhorov and Bernstein
classical inequalities for sums of independent random variables and one where the bound on
the tail probability function is inverted. It is convenient to introduce the following notation:

φ(x) = e−x − 1 + x, for x ∈ R


h1 (t) = (1 + t) log(1 + t) − t, for t ≥ 0.

Proposition 8.2. Let Z be a random variable whose moment-generating function satisfies


the bound
E(eλZ ) ≤ exp ν(eλ − 1 − λ) ,

λ > 0, (93)

for some ν > 0. Then, for all t ≥ 0,


t2
   
3t  2t 
P(Z ≥ t) ≤ e−νh1 (t/ν) ≤ exp − log 1 + ≤ exp − (94)
4 3ν 2ν + 2t/3
and  √ 
P Z ≥ 2νx + x/3 ≤ e−x , x ≥ 0. (95)

Proof. Observe that by Markov’s inequality and the given bound E[eλZ ], we obtain

P(Z ≥ t) = inf P(eλZ ≥ eλt ) ≤ inf e−λt E[eλZ ] ≤ eν inf λ>0 {φ(−λ)−λt/ν} .
λ>0 λ>0

It can be checked that for z > −1 (think of z = t/ν)

inf {φ(−λ) − λz} = z − (1 + z) log(1 + z) = −h1 (z).


λ∈R

98
This proves the first inequality in (94). We can also show that (by checking the value of
the corresponding functions at t = 0 and then comparing derivatives)
3t  2t  t2
h1 (t) ≥ log 1 + ≥ , for t > 0,
4 3 2 + 2t/3
thus completing the proof of the three inequalities in (94).
To prove (95), we begin by observing that (by Taylor’s theorem) (1−λ/3)(eλ −λ−1) ≤
λ2 /2, λ ≥ 0. Thus, if
νλ2
ϕ(λ) := , λ ∈ [0, 3),
2(1 − λ/3)
then inequality (93) yields
  " #
P(Z ≥ t) ≤ inf e−λt E[eλZ ] ≤ exp inf (ϕ(λ) − λt) = exp − sup (λt − ϕ(λ)) = e−γ(t) ,
0≤λ<3 0≤λ<3 0≤λ<3

where we have used the fact that ν(eλ − 1 − λ) ≤ ϕ(λ) and γ(s) := supλ∈[0,3) (λs − ϕ(λ)), for

s > 0. Then it can be shown74 that γ −1 (x) = 2νx + x/3. Therefore, letting t = γ −1 (x)
(i.e., x = γ(t)) in the above display yields (95).

Let Xi , 1 ≤ i ≤ n, be independent centered random variables a.s. bounded by c < ∞


in absolute value. Let Sn := ni=1 Xi and define Z := Sn /c. Then,
P

n n
E[Xi2 ] λ
   2 
λZ
Y
(λ/c)Xi
Y nσ λ
E[e ] = E[e ]≤ exp (e − 1 − λ) = exp (e − 1 − λ)
c2 c2
i=1 i=1
Pn
where σ 2 = n1 i=1 E[Xi2 ]. Thus, Z satisfies the hypothesis of Proposition 8.2 with ν :=
nσ 2 /c2 . Therefore we have the following exponential inequalities, which go by the names
of Bennet’s, Prokhorov’s and Bernstein’s75 (in that order).

Theorem 8.4. Let Xi , 1 ≤ i ≤ n, be independent centered random variables a.s. bounded


by c < ∞ in absolute value. Set σ 2 = ni=1 E[Xi2 ]/n and Sn := ni=1 Xi . Then, for all
P P

x ≥ 0,
2
t2
      
− nσ2 h1 tc2 3t  2tc 
P(Sn ≥ t) ≤ e c nσ ≤ exp − log 1 + ≤ exp − (96)
4c 3nσ 2 2nσ 2 + 2ct/3
74
Exercise (HW3): Complete this.
75
It is natural to ask whether Theorem 8.4 extends to unbounded random variables. In fact, Bernstein’s
inequality does hold for random variables Xi with finite exponential moments, i.e., such that E[eλ|Xi | ] < ∞,
for some λ > 0, as shown below.

Lemma 8.3 (Bernstein’s inequality). Let Xi , 1 ≤ i ≤ n, be centered independent random variables such
that, for all k ≥ 2 and all 1 ≤ i ≤ n,
k!
E|Xi |k ≤ σi2 ck−2 ,
2
and set σ 2 := n1 n
P 2 Pn
σ
i=1 i , S n := i=1 Xi . Then,
t2
 
P(Sn ≥ t) ≤ exp − , for t ≥ 0.
2nσ 2 + 2ct

99
and  √ 
P Sn ≥ 2nσ 2 x + cx/3 ≤ e−x , x ≥ 0.

Bennett’s inequality is the sharpest, but Prokhorov’s and Bernstein’s inequalities are
easier to interpret. Prokhorov’s inequality exhibits two regimes for the tail probabilities of
Sn : if tc/(nσ 2 ) is small, then the logarithm is approximately 2tc/(3nσ 2 ), and the tail
2 2
probability is only slightly larger than e−t /(2nσ ) (which is Gaussian-like), whereas, if
tc/(nσ 2 ) is not small or moderate, then the exponent for the tail probability is of the
order of −[3t/(4c)] log[2tc/(3nσ 2 )] (which is ‘Poisson’-like76 ). Bernstein’s inequality keeps
the Gaussian-like regime for small values of tc/(nσ 2 ) but replaces the Poisson regime by
the larger, hence less precise, exponential regime.

Example 8.5 (Deviation bound with fixed probability). Let us try to shed some light on the
differences between Bernstein’s inequality (i.e., the rightmost side of (96)) and Hoeffding’s
inequality (see (89)). We can first attempt to find the value of t which makes the bound on
the rightmost side of (96) exactly equal to α, i.e., we want to solve the equation

t2
 
exp − = α.
2(nσ 2 + ct/3)
This leads to the quadratic equation
2tc 1 1
t2 − log − 2nσ 2 log = 0,
3 α α
whose nonnegative solution is given by
s 
1 2
r
c2

c 1 1 1 2c 1
t = log + log + 2nσ 2 log ≤ σ 2n log + log .
3 α 9 α α α 3 α
√ √ √
where in the last inequality we used the fact q that a + b ≤ a + b for all a, b ≥ 0. Thus,
1 2c 1
Bernstein’s inequality implies that Sn ≤ σ 2n log α + 3 log α with probability at least 1−α.
Now if X1 , . . . , Xn are i.i.d. with mean zero, variance σ 2 and bounded in absolute value by
c, then this yields r
σ 1 2c 1
X̄n ≤ √ 2 log + log (97)
n α 3n α
with probability
q (w.p.) at least 1 − α; compare this the Hoeffding’s bound which yields
2 1
X̄n ≤ c n log α w.p. at least 1 − α; see (11). Note that if X̄n is normal, then X̄n will be
bounded by the first term in the right hand side of (97) w.p. at least 1 − α. Therefore the
above deviation bound agrees with the normal approximation bound except for the smaller

order term (which if of order 1/n; the leading term being of order 1/ n).
76
Note that if X has Poisson distribution with parameter a (i.e., EX = Var(X) = a) then
 
3t  2t 
P(X − a ≥ t) ≤ exp − log 1 + , t ≥ 0.
4 3a

100
Example 8.6 (When Xi ’s are i.i.d. Bernoulli). Suppose that Xi ’s are i.i.d. Bernoulli with
probability of successqp ∈ (0, 1). Then, using (97), we see that using the Bernstein’s inequal-
q
ity yields that X̄n ≤ p(1−p)
n 2 log α1 + 3n
2
log 1 holds w.p. at least 1 − α; compare this with
q α
Hoeffding’s inequality which yields X̄n ≤ n2 log α1 w.p. at least 1 − α. Note that Bernstein’s
inequality is superior here if p(1 − p) is a fairly small. In particular, if Var(X1 ) = n1 (i.e.,
q q
p ≈ n1 ), then the two upper bounds reduce to n1 2 log α1 + 3n2
log α1 and n2 log α1 respectively,
showing that Bernstein’s inequality is so much better in this case.

8.2 Talagrand’s concentration inequality

Talagrand’s concentration inequality for the supremum of the empirical process [Talagrand, 1996a]
is one of the most useful results in modern empirical process theory, and also one of the
deepest results in the theory. This inequality may be thought of as a Bennett, Prokhorov
or Bernstein inequality uniform over an infinite collection of sums of independent random
variables, i.e., for the supremum of the empirical process. As such, it constitutes an expo-
nential inequality of the best possible kind. Below we state Bousquet’s version of the upper
half of Talagrand’s inequality.

Theorem 8.7 (Talagrand’s inequality, [Talagrand, 1996a, Bousquet, 2003]). Let Xi , i =


1, . . . , n, be independent X -valued random variables. Let F be a countable family of measur-
able real-valued functions on X such that kf k∞ ≤ U < ∞ and E[f (X1 )] = . . . = E[f (Xn )] =
0, for all f ∈ F. Let
n
X n
X
Z := sup f (Xi ) or Z = sup f (Xi )
f ∈F i=1 f ∈F i=1

and let the parameters σ 2 and νn be defined as


n
1X
U 2 ≥ σ2 ≥ sup E[f 2 (Xi )] and νn := 2U E[Z] + nσ 2 .
n f ∈F i=1

Then77 , for all t ≥ 0,

−t2
  
3t 2tU 
 
νn tU   
− h1
P(Z ≥ EZ + t) ≤ e U2 νn
≤ exp − log 1 + ≤ exp (98)
4U 3νn 2νn + 2tU/3

and  √ 
P Z ≥ EZ + 2νn x + U x/3 ≤ e−x , x ≥ 0. (99)
77
This is a consequence of the following: consider the class of functions F̃ = {f /U : f ∈ F} (thus any
f˜ ∈ F̃ satisfies kf˜k∞ ≤ 1). Let Z̃ := Z/U , σ̃ 2 := σ 2 /U 2 , and ν̃n := νn /U 2 . Then,

log E[eλ(Z̃−EZ̃) ] ≤ ν̃n (eλ − 1 − λ), λ ≥ 0.

101
Notice the similarity between (98) and the Bennet, Prokhorov and Bernstein inequali-
ties in (96) in Theorem 8.4: in the case when F = {f }, with kf k∞ ≤ c, and E[f (Xi )] = 0, U
becomes c, and νn becomes nσ 2 , and the right-hand side of Talagrand’s inequality becomes
exactly the Bennet, Prokhorov and Bernstein inequalities. Clearly, Talagrand’s inequality
is essentially the best possible exponential bound for the empirical process.
Whereas the Bousquet-Talagrand upper bound for the moment generating function of
the supremum Z of an empirical process for λ ≥ 0 is best possible, there exist quite good
results for λ < 0, but these do not exactly reproduce the classical exponential bounds for
sums of independent random variables when specified to a single function. Here is the
strongest result available in this direction.

Theorem 8.8 ([Klein and Rio, 2005]). Under the same hypothesis and notation as in The-
orem 8.7, we have

Ṽn 3λ
log E[e−λ(Z̃−EZ̃) ] ≤ (e − 1 − 3λ), 0 ≤ λ < 1,
9

where Ṽn = Vn /U 2 and


n
X
Vn := 2U E[Z] + sup E[f 2 (Xi )].
f ∈F i=1

Then, for all t ≥ 0,

−t2
  
t 2tU 
 
Vn 3tU   
− h1
P(Z ≤ EZ − t) ≤ e 9U 2 Vn
≤ exp − log 1 + ≤ exp (100)
4U Vn 2Vn + 2tU

and  
P Z ≤ EZ − 2Vn x − U x ≤ e−x ,
p
x ≥ 0. (101)

Remark 8.1. In order to get concrete exponential inequalities from Theorems 8.7 and 8.8,
we need to have good estimates of EZ and supf ∈F E[f 2 (Xi )]. We have already seen many
techniques to control EZ. In particular, (85) gives such a bound.

Example 8.9 (Dvoretzky-Kiefer-Wolfowitz). A first question we may ask is whether Ta-


lagrand’s inequality recovers, up to constants, the DKW inequality. Let F be a distribu-
tion function in Rd and let Fn be the distribution function corresponding to n i.i.d. vari-
ables with distribution F . Let Z := nkFn − F k∞ We can take the envelope of the class
F := {1(−∞,x] (·) : x ∈ Rd } to be 1 (i.e., U = 1), and σ 2 = 1/4. F is VC (with V (F) = d)
and inequality (85) gives

E[Z] = n EkFn − F k∞ ≤ c1 n,

where c1 depends only on d. Here, νn ≤ 2c1 n+n/4. We have to upper-bound the probability
√ √ √
P( nkFn − F k∞ ≥ x) = P(Z ≥ nx) = P(Z − EZ ≥ nx − EZ).

102
√ √
Note that for x > 2 n, this probability is zero (as Z ≤ 2n). For x > 2c1 , t := nx − EZ ≥
√ √
n(x−c1 ) > 0, and thus we can apply the last inequality in (98). Hence, for 2 n ≥ x > 2c1 ,

√ ( nx − EZ)2
 
P( nkFn − F k∞ ≥ x) ≤ exp − √ √
2(2c1 n + n/4) + 2( nx − EZ)/3
n(x − c1 )2 x2
   
≤ exp − ≤ exp − ,
c3 n 4c3

where we have used (i) for 2 n ≥ x the denominator in the exponential term is upper

bounded by 2(2c1 n + n/4) + 4n/3 which is in turn upper bounded by c3 n (for some c3 > 0);
(ii) for x > 2c1 , (x − c1 )2 > x2 /4 (as x − c1 ≥ x − x/2 = x/2). Thus, for some constants
c2 , c3 > 0 that depend only on d, we can show that for all x > 0,
√ 2
P( nkFn − F k∞ ≥ x) ≤ c2 e−x /(4c3 ) ;

note that c2 appears above as we can just upper bound the LHS by some c2 when x ∈ (0, 2c1 ].

Example 8.10 (Data-driven inequalities). In many statistical applications, it is of impor-


tance to have data-dependent “confidence sets” for the random quantity kPn − P kF . This
quantity is a natural measure of the accuracy of the approximation of an unknown distribu-
tion by the empirical distribution Pn . However, kPn − P kF itself depends on the unknown
distribution P and is not directly available.
To obtain such data dependent bounds on kPn − P kF we have to replace the unknown
quantities EkPn − P kF , σ 2 and U by suitable estimates or bounds. Suppose for the sake of
simplicity, σ 2 and U are known, and the only problem is to estimate or bound the expec-
tation EkPn − P kF . We have discussed so far how to bound the expectation EkPn − P kF .
However, such bounds typically depend on other unknown constants and may not be sharp.
Talagrand’s inequalities (99) and (101), and symmetrization allow us to replace EkPn −P kF
by a completely data-based surrogate. In the following we give such a (finite-sample) high-
probability upper bound on kPn − P kF ; see [Giné and Nickl, 2016, Section 3.4.2] for more
on this topic.

Theorem 8.11. Let F be a countable collection of real-valued measurable functions on X


with absolute values bounded by 1/2. Let X1 , . . . , Xn be i.i.d. X with a common probability
law P . Let ε1 , . . . , en be i.i.d. Rademacher random variables independent from the sequence
{Xi } and let σ 2 ≥ supf ∈F P f 2 . Then, for all n and x ≥ 0,
n
r !
1X 2σ 2 x 70
P kPn − P kF ≥ 3 i f (Xi ) +4 + x ≤ 2e−x .
n F n 3n
i=1

Proof. Set Z := k ni=1 (f (Xi ) − P f )kF and set Z̃ := k ni=1 i f (Xi )kF . Note that Z̃
P P

is also the supremum of an empirical process: the variables are X̃i = (εi , Xi ), defined

103
on {−1, 1} × X , and the functions are f˜(, x) := f (x), for f ∈ F. Thus, Talagrand’s
inequalities apply to both Z and Z̃. Then, using the fact
q √ p √ 1
2x(nσ 2 + 2EZ̃) ≤ 2xnσ 2 + 2 xEZ̃ ≤ 2xnσ 2 + x + δEZ̃,
δ
for any δ > 0, the Klein-Rio version of Talagrand’s lower-tail inequality gives
 q   √ 1+δ

e−x ≥ P Z̃ ≤ EZ̃ − 2x(nσ 2 + 2EZ̃) − x ≥ P Z̃ ≤ (1 − δ)EZ̃ − 2xnσ 2 − x .
δ
Similarly, using (99),
 √ 3+δ

2
P Z ≥ (1 + δ)EZ + 2xnσ + x ≤ e−x .

Recall also that E[Z] ≤ 2E[Z̃]. Then, we have on the intersection of the complement of the
events in the last two inequalities, for δ = 1/5 (say),
6 √ 16 12 √ 16
Z < E[Z] + 2xnσ 2 + x ≤ E[Z̃] + 2xnσ 2 + x
5  3 5 √ 3
12 5 5√ 15 16
< Z̃ + 2xnσ 2 + x + 2xnσ 2 + x
5 4 4 2 3
√ 70
= 3Z̃ + 4 2xnσ 2 + x;
3
i.e., this inequality holds with probability 1 − 2e−x .

Note that different values of δ produce different coefficients in the above theorem.

8.3 Empirical risk minimization and concentration inequalities

Let X, X1 , . . . , Xn , . . . be i.i.d. random variables defined on a probability space and taking


values in a measurable space X with common distribution P . In this section we highlight
the usefulness of concentration inequalities, especially Talagrand’s inequality, in empirical
risk minimization (ERM); see [Koltchinskii, 2011] for a thorough study of this topic.
Let F be a class of measurable functions f : X → R. In what follows, the values of
a function f ∈ F will be interpreted as “losses” associated with certain “actions” (e.g.,
F = {f (x) ≡ f (z, y) = (y − β > z)2 : β ∈ Rd } and X = (Z, Y ) ∼ P ).
We will be interested in the problem of risk minimization:

min P f (102)
f ∈F

in the cases when the distribution P is unknown and has to be estimated based on the data
X1 , . . . , Xn . Since the empirical measure Pn is a natural estimator of P , the true risk can
be estimated by the corresponding empirical risk, and the risk minimization problem has
to be replaced by the empirical risk minimization (ERM):

min Pn f. (103)
f ∈F

104
As is probably clear by now, many important methods of statistical estimation such as
maximum likelihood and more general M -estimation are versions of ERM.

Definition 8.12. The excess risk of f ∈ F is defined as

E(f ) ≡ EP (f ) := P f − inf P h.
h∈F

Recall that we have already seen an important application of ERM in the problem of
classification in Example 7.10. Here is another important application.

Example 8.13 (Regression). Suppose that we observe X1 ≡ (Z1 , Y1 ), . . . , Xn ≡ (Zn , Yn )


i.i.d. X ≡ (Z, Y ) ∼ P on X ≡ Z × T , T ⊂ R, and the goal is to study the relationship
between Y and Z. We study regression with quadratic loss `(y, u) := (y − u)2 given a class
of of measurable functions G from Z to T ; the distribution of Z will be denoted by Π. This
problem can be thought of as a special case of ERM with

F := {(` • g)(z, y) ≡ (y − g(z))2 : g ∈ G}.

Suppose that the true regression function is g∗ (z) := E[Y |Z = z], for z ∈ Z. In this case,
the excess risk of f (z, y) = (y − g(z))2 ∈ F (for some g ∈ G) is given by78

EP (f ) = EP (` • g) = kg − g∗ k2L2 (Π) − inf kh − g∗ k2L2 (Π) . (104)


h∈G

If G is such that g∗ ∈ G then EP (` • g) = kg − g∗ k2L2 (Π) , for all g ∈ G.

Let
fˆ ≡ fˆn ∈ arg min Pn f
f ∈F

be a solution of the ERM problem (103). The function fˆn is used as an approximation
of the solution of the true risk minimization problem (102) and its excess risk EP (fˆn ) is a
natural measure of accuracy of this approximation.
It is worth pointing out that a crucial difference between ERM and classical M -
estimation, as discussed in Sections 5 and 6, is that in the analysis of ERM we do not
(usually) assume that the data generating distribution P belongs to the class of models
considered (e.g., inf h∈F P h need not be 0). Moreover, in M -estimation, typically the focus
is on recovering a parameter of interest in the model (which is expressed as the population
M -estimator) whereas in ERM the focus is mainly on deriving optimal (upper and lower)
bounds for the excess risk EP (fˆn ).
It is of interest to find tight upper bounds on the excess risk79 of fˆ that hold with
a high probability. Such bounds usually depend on certain “geometric” properties of the
78
Exercise (HW3): Show this.
79
Note that we have studied upper bounds on the excess risk in the problem of classification in Exam-
ple 7.10.

105
function class F and on various measures of its “complexity” that determine the accuracy of
approximation of the true risk P f by the empirical risk Pn f in a neighborhood of a proper
size of the minimal set of the true risk.
In the following we describe a rather general approach to derivation of such bounds in
an abstract framework of ERM. We start with some definitions.

Definition 8.14. The δ-minimal set of the risk is defined as

F(δ) := {f ∈ F : EP (f ) ≤ δ}.

The L2 -diameter of the δ-minimal set is denoted by

D(δ) ≡ DP (F; δ) := sup {P [(f1 − f2 )2 ]}1/2 .


f1 ,f2 ∈F (δ)

Suppose, for simplicity, that the infimum of the risk P f is attained at f¯ ∈ F (the
argument can be easily modified if the infimum is not attained in the class). Denote

δ̂ := EP (fˆ).

Then fˆ, f¯ ∈ F(δ̂) and Pn fˆ ≤ Pn f¯. Therefore,

δ̂ = EP (fˆ) = P (fˆ − f¯) ≤ Pn (fˆ − f¯) + (P − Pn )(fˆ − f¯)


≤ sup |(Pn − P )(f1 − f2 )| (105)
f1 ,f2 ∈F (δ̂)
≤ sup |(Pn − P )(f1 − f2 )|.
f1 ,f2 ∈F

Previously, we had used the last inequality to upper bound the excess risk in classification;
see Example 7.10. In this section we will use the implicit characterization of δ̂ in (105)
to improve our upper bound. This naturally leads us to the study of the following (local)
measure of empirical approximation:
" #
φn (δ) ≡ φn (F; δ) := E sup |(Pn − P )(f1 − f2 )| . (106)
f1 ,f2 ∈F (δ)

Idea: Imagine there exists a nonrandom upper bound

Un (δ) ≥ sup |(Pn − P )(f1 − f2 )| (107)


f1 ,f2 ∈F (δ)

that holds uniformly in δ with a high probability. Then, with the same probability, the
excess risk δ̂ = EP (fˆ) will be bounded80 by the largest solution of the inequality

δ ≤ Un (δ). (108)
80
As δ̂ ≤ supf1 ,f2 ∈F (δ̂) |(Pn − P )(f1 − f2 )| ≤ Un (δ̂), δ̂ satisfies inequality (108).

106
By solving the above inequality one can obtain δn (F) (which satisfies (108)) such that
P(EP (fˆn ) > δn (F)) is small81 . Thus, constructing an upper bound on the excess risk
essentially reduces to solving a fixed point inequality of the type δ ≤ Un (δ).
Let us describe in more detail what we mean by the above intuition. There are many
different ways to construct upper bounds on the sup-norm of empirical processes. A very
general approach is based on Talagrand’s concentration inequalities. For example, if the
functions in F take values in the interval [0, 1], then82 by (99) we have, for t > 0,83
!
1 p t
P sup |(Pn − P )(f1 − f2 )| ≥ φn (δ) + √ 2t (2φn (δ) + D2 (δ)) + ≤ e−t . (109)
f1 ,f2 ∈F (δ) n 3n
√ √ √ √
Then, using the facts: (i) a+b≤ a+ b, and (ii) 2 ab ≤ a/K +Kb, for any a, b, K > 0,
we have
p p p √ t √
2t (D2 (δ) + 2φn (δ)) ≤ 2tD2 (δ) + 2 tφn (δ) ≤ D(δ) 2t + √ + nφn (δ).
n

Thus, from (109), for all t > 0, we have84


!
P sup |(Pn − P )(f1 − f2 )| ≥ Ūn (δ; t) ≤ e−t (110)
f1 ,f2 ∈F (δ)

where r !
t t
Ūn (δ; t) := 2 φn (δ) + D(δ) + . (111)
n n
This observation provides a way to construct a function Un (δ) such that (107) holds with
a high probability “uniformly” in δ — by first defining such a function at a discrete set of
values of δ and then extending it to all values by monotonicity. We will elaborate on this
shortly. Then, by solving the inequality (108) one can construct a bound on EP (fˆn ), which
holds with “high probability” and which is often of correct order of magnitude.

8.3.1 A formal result on excess risk in ERM

Let us now try to state a formal result in this direction. To simplify notation, assume
that the functions in F take values in [0, 1]. Let {δj }j≥0 be a decreasing sequence of
positive numbers with δ0 = 1 and let {tj }j≥0 be a sequence of positive numbers. Define
Un : (0, ∞) → R, via (111), as

Un (δ) := Ūn (δj ; tj ), for δ ∈ (δj+1 , δj ], (112)


81
We will formalize this later.
82
This assumption just simplifies a few mathematical expressions; there is nothing sacred about the interval
[0, 1], we could have done it for any constant compact interval.
83
According to the notation of (99), we can take σ 2 = D2 (δ), and then νn = 2nφn (F; δ) + nD2 (δ).
84
This form of the concentration inequality is usually called Bousquet’s version of Talagrand’s inequality.

107
Figure 2: Plot of the piecewise constant function Un (δ), for δ ≥ δn (F), along with the value
of kPn − P kF 0 (δj ) , for j = 0, 1, . . ., denoted by the ?’s.

and Un (δ) := Un (1) for δ > 1. Denote

δn (F) := sup{δ ∈ (0, 1] : δ ≤ Un (δ)}. (113)

It is easy to check that δn (F) ≤ Un (δn (F)). Obviously, the definitions of Un and δn (F)
depend on the choice of {δj }j≥0 and {tj }j≥0 (we will choose specific values of these quan-
tities later on). We start with the following simple inequality that provides a distribution
dependent upper bound on the excess risk EP (fˆn ).

Theorem 8.15. For all δ ≥ δn (F),


 
P EP (fˆn ) > δ ≤
X
e−tj . (114)
j:δj ≥δ

Proof. It is enough to prove the result for any δ > δn (F); then the right continuity of the
distribution function of EP (fˆn ) would lead to the bound (114) for δ = δn (F).
So, fix δ > δn (F). Letting F 0 (δ) := {f1 − f2 : f1 , f2 ∈ F(δ)}, we know that

EP (fˆ) = δ̂ ≤ sup |(Pn − P )(f )| ≡ kPn − P kF 0 (δ̂) . (115)


f ∈F 0 (δ̂)

Denote n o
En,j := kPn − P kF 0 (δj ) ≤ Un (δj ) .

It follows from Bousquet’s version of Talagrand’s inequality (see (110)) that P(En,j ) ≥
1 − e−tj . Let
En := ∩j:δj ≥δ En,j .

Then
X
P(En ) = 1 − P(Enc ) ≥ 1 − e−tj . (116)
j:δj ≥δ

On the event En , for all σ ≥ δ, we have

kPn − P kF 0 (σ) ≤ Un (σ). (117)

108
The above holds as: (i) Un (·) is a piecewise constant function (with possible jumps only
at δj ’s), (ii) the function σ 7→ kPn − P kF 0 (σ) is monotonically nondecreasing, and (iii)
kPn − P kF 0 (δj ) ≤ Un (δj ) on En , for j such that δ ≥ δj ; see Figure 8.3.1.
Claim: {δ̂ ≥ δ} ⊂ Enc . We prove the claim using the method of contradiction. Thus,
suppose that the above claim does not hold. Then, the event {δ̂ ≥ δ} ∩ En is non-empty.
On the event {δ̂ ≥ δ} ∩ En we have

δ̂ ≤ kPn − P kF 0 (δ̂) ≤ Un (δ̂), (118)

where the first inequality follows from (115) and the second inequality holds via (117). This,
in particular, implies that
δ ≤ δ̂ ≤ δn (F),

where the last inequality follows from (118) and the maximality of δn (F) via (113). However
the above display contradicts the assumption that δ > δn (F). Therefore, we must have
{δ̂ ≥ δ} ⊂ Enc .
The claim now implies that P(EP (fˆn ) ≥ δ) = P(δ̂ ≥ δ) ≤ P(Enc ) ≤ −tj ,
P
j:δj ≥δ e
via (116), thereby completing the proof.

Although Theorem 8.15 yields a high probability bound on the excess risk of fˆn (i.e.,
EP (fˆn )), we still need to upper bound δn (F) for the result to be useful. We address this
next. We start with some notation. Given any ψ : (0, ∞) → R, denote by
ψ(s)
ψ † (σ) := sup . (119)
s≥σ s

Note that ψ † is a nonincreasing function85 .


The study of ψ † is naturally motivated by the study of the function Unδ(δ) and when it
crosses the value 1; cf. (113). As Unδ(δ) may have multiple crossings of 1, we “regularize”
Un (δ)
δ by studying Vnt (δ) defined below (which can be thought of as a well-behaved monotone
version of Un† ). For q > 1 an t > 0, denote
" r #
t t
q
t †
Vn (σ) := 2q φn (σ) + (D2 )† (σ) + , for σ > 0. (120)
nσ nσ

Note that Vnt is a strictly decreasing of σ in (0, ∞). Let

σnt ≡ σnt (F) := inf{σ > 0 : Vnt (σ) ≤ 1}. (121)

We will show next that σnt ≥ δn (F) (for a special choice of {δj }j≥0 and {tj }j≥0 ) and
thus, by (8.15) and some algebraic simplification, we will obtain the following result. Given
85
Take σ1 < σ2 . Then
ψ(s) ψ(s)
ψ † (σ1 ) = sup ≥ sup = ψ † (σ2 ).
s≥σ1 s s≥σ2 s

109
a concrete application, our goal would be to find upper bounds on σnt ; see Section 8.3.2
where we illustrate this technique for finding a high probability bound on the excess risk in
bounded regression.

Theorem 8.16 (High probability bound on the excess risk of the ERM). For all t > 0,
 
P EP (fˆn ) > σnt ≤ Cq e−t . (122)

q
where Cq := q−1 ∨ e.

Proof. Fix t > 0 and let σ > σnt . We will show that P EP (fˆn ) > σ ≤ Cq e−t . Then, by


taking a limit as σ ↓ σnt , we obtain (122).


Define, for j ≥ 0,
δj
δj := q −j and .
tj := t
σ
Recall the definitions of Un (δ) and δn (F) (in (112) and (113)) using the above choice of the
sequences {δj }j≥0 and {tj }j≥0 . Then, for all δ ≥ σ, using (112),86
r !
Un (δ) φn (δj ) D(δj ) tδj tδj
= 2 + √ + if δ ∈ (δj+1 , δj ]
δ δ δ δσn δσn
s !
φn (δj ) D(δj ) tδj tδj δj 1 q
≤ 2q + p + as δ > δj+1 = ⇒ <
δj δj δ j σn δj σn q δ δ j
r !
φn (s) t D(δ) t
≤ 2q sup + sup √ + as δj ≥ δ ≥ σ
s≥σ s σn s≥σ δ σn
r !
t t
q
= 2q φ†n (σ) + (D2 )† (σ) + = Vnt (σ).
σn σn

Since σ > σnt and the function Vnt is strictly decreasing, we have Vnt (σ) < Vnt (σnt ) ≤ 1, and
hence, for all δ > σ,
Un (δ)
≤ Vnt (σ) < 1.
δ
Therefore, δ > δn (F) := sup{s > 0 : 1 ≤ Uns(s) }, and thus, σ ≥ δn (F). Now, from
Theorem 8.15 it follows that
 
P EP (fˆn ) > σ ≤
X
e−tj ≤ Cq e−t
j:δj ≥σ

where the last step follows from some algebra87 .


86
For δ > δ0 ≡ 1, the following sequence of displays also holds with j = 0.
87
Exercise (HW3): Show this. Hint: we can write
X −t X −tδ /σ X −tqj q
e j = e j ≤ e = ··· ≤ e−t , for t ≥ 1.
q−1
j:δj ≥σ j:δj ≥σ j≥0

110
8.3.2 Excess risk in bounded regression

Recall the regression setting in Example 8.13. Given a function g : Z → T , the quantity
(` • g)(z, y) := `(y, g(z)) is interpreted as the loss suffered when g(z) is used to predict y.
The problem of optimal prediction can be viewed as a risk minimization:

E[`(Y, g(Z))] =: P (` • g)

over g : Z → T . We start with the regression problem with bounded response and with
quadratic loss. To be specific, assume that Y takes values in T = [0, 1] and `(y, u) := (y−u)2 .
Suppose that we are given a class of measurable real-valued functions G on Z. We denote
by F := {` • g : g ∈ G}. Suppose that the true regression function is g∗ (z) := E[Y |Z = z],
for z ∈ Z, which is not assumed to be in G. Recall that the excess risk EP (` • g) in this
problem is given by (104).
In order to apply Theorem 8.16 to find a high probability bound on the excess risk of
the ERM fˆ ≡ ` • ĝ (see (103)) in this problem, which is determined by σnt via (121), we have
to find upper bounds for Vnt (·) (which in turn depends on the functions φ†n and (D2 )† ).
p

As a first step we relate the excess risk of any f ≡ ` • g ∈ F to g ∈ G. The following


lemma provides an easy way to bound the excess risk of f from below in the case of a convex
class G, an assumption we make in the sequel.

Lemma 8.17. If G is a convex class of functions, then

2EP (` • g) ≥ kg − ḡk2L2 (Π)

where ḡ := argming∈G kg − g∗ k2L2 (Π) is assumed to exist.

Below we make some observations that will be crucial to find σnt .

1. It follows from Lemma 8.17 that

F(δ) = {f ∈ F : EP (f ) ≤ δ} ⊂ {` • g : g ∈ G, kg − ḡk2L2 (Π) ≤ 2δ}. (123)

2. For any two functions g1 , g2 ∈ G and all z ∈ Z, y ∈ [0, 1], we have

|(` • g1 )(z, y) − (` • g2 )(z, y)| = (y − g1 (z))2 − (y − g2 (z))2


= |g1 (z) − g2 (z)| |2y − g1 (z) − g2 (z)| ≤ 2 |g1 (z) − g2 (z)| ,

which implies
P (` • g1 − ` • g2 )2 ≤ 4kg1 − g2 k2L2 (Π) .
 

Recalling that D(δ) := supf1 ,f2 ∈F (δ) {P [(f1 − f2 )2 ]}1/2 , we have


n o
D(δ) ≤ 2 sup kg1 − g2 kL2 (Π) : gk ∈ G, kgk − ḡk2L2 (Π) ≤ 2δ for k = 1, 2

≤ 2(2 2δ) (124)

111
where the last step follows from the triangle inequality: kg1 −g2 kL2 (Π) ≤ kg1 −ḡkL2 (Π) +
kg2 − ḡkL2 (Π) . Hence, by (124),
s
q
D2 (δ) √
(D2 )† (σ) = sup ≤ 4 2.
δ≥σ δ

3. By symmetrization inequality (recall that we use 1 , . . . , n to be i.i.d. Rademacher


variables independent of the observed data), and letting F 0 (δ) := {f1 − f2 : f1 , f2 ∈
F(δ)}, and using (123),
n
" #
1 X
φn (δ) = EkPn − P kF 0 (δ) ≤ 2 E sup i f (Xi )
f ∈F 0 (δ) n i=1
 
n
1 X
≤ 2E sup i (` • g1 − ` • g2 )(Xi ) 
2
gk ∈G:kgk −ḡkL (Π) ≤2δ n
2 i=1
 
n
1 X
≤ 4E sup i (` • g − ` • ḡ)(Xi )  .
g∈G:kg−ḡk 2 ≤2δ n
L2 (Π) i=1

Since `(y, ·) is Lipschitz with constant 2 on the interval [0, 1] one can use the contrac-
tion inequality88 to get
 
n
1 X
φn (δ) ≤ 8 E  sup i (g − ḡ)(Zi )  := ψn (δ).
g∈G:kg−ḡk2 ≤2δ n
L2 (Π) i=1

As a result, we get (recall (119))


φ†n (σ) ≤ ψn† (σ).

The following result is now a corollary of Theorem 8.16.

Theorem 8.18. Let G be a convex class of functions from Z into [0, 1] and let ĝn denotes
the LSE of the regression function, i.e.,
n
1X
ĝn := argmin {Yi − g(Xi )}2 .
g∈G n
i=1

Then, there exist constants K > 0 such that for all t > 0,
 
] 1 t

2 2
P kĝn − g∗ kL2 (Π) ≥ inf kg − g∗ kL2 (Π) + ψn ( ) + K ≤ Cq e−t , (125)
g∈G 4q n
88
Ledoux-Talagrand contraction inequality (Theorem 4.12 of [Ledoux and Talagrand, 1991]): If ϕi : R →
R satisfies |ϕi (a) − ϕi (b)| ≤ L|a − b| for all a, b ∈ R, then
" n
# " n
#
1X 1X
E sup i ϕi (h(xi )) ≤ L E sup i h(xi ) .
h∈H n h∈H n
i=1 i=1

In the above application we take ϕi (u) = (Yi − u)2 for u ∈ [0, 1].

112
where for any ψ : (0, ∞) → R, ψ ] is defined as89
n o
ψ ] (ε) := inf σ > 0 : ψ † (σ) ≤ ε . (126)

Proof. Note that in this case, by (104), EP (ĝn ) = kĝn − g∗ k2L2 (Π) − inf g∈G kg − g∗ k2L2 (Π) . To
use Theorem 8.16 we need to upper bound the quantity σnt defined in (121). Recall the
definition of Vnt (σ) from (120). By the above observations 1-3, we have
" #

r
t † t t
Vn (σ) ≤ 2q ψn (σ) + 4 2 + (127)
nσ nσ

We are only left to show that σnt := inf{σ : Vnt (σ) ≤ 1} ≤ ψn] ( 4q
1
) + K nt , for a sufficiently
 
large K, which will be implied if we can show that Vnt ψn] ( 2q 1
) + K nt ≤ 1 (since then
ψn] ( 2q
1
) + K nt ∈ {σ : Vnt (σ) ≤ 1} and the result follows from the minimality of σnt ). Note
that, by the nonincreasing nature of each of the terms on the right hand side of (127),
" s #

 
1 t 1 t t
Vnt ψn] ( ) + K ≤ 2q ψn† (ψn] ( )) + 4 2 +
4q n 4q n(Kt/n) n(Kt/n)
" √ #
1 4 2 1
≤ 2q +√ + < 1,
4q K K

where K > 0 is chosen so that 4√ 2
K
+ 1
K < 1
2 (note that ψn† (ψn] ( 4q
1
)) ≤ 1
4q ).

Example 8.19 (Finite dimensional classes). Suppose that L ⊂ L2 (Π) is a finite dimensional
linear space with dim(L) = d < ∞. and let G ⊂ L be a convex class of functions taking
values in a bounded interval (for simplicity, [0, 1]). We would like to show that
 d 
t
2 2
P kĝn − g∗ kL2 (Π) ≥ inf kg − g∗ kL2 (Π) + +K ≤ Ce−t (128)
g∈G n n
with some constant C, K > 0.
It can be shown that90 that r

ψn (δ) ≤ c
n
with some constant c > 0. Hence,
r r
ψn (δ) d d
ψn† (σ) = sup ≤ sup c =c .
δ≥σ δ δ≥σ δn σn
89
Note that ψ ] can be thought of as the generalized inverse of ψ † . Thus, under the assumption that ψ † is
right-continuous, ψ † (σ) ≤ ε if and only if σ ≥ ψ ] (ε) (Exercise (HW3): Show this). Further note that with
this notation σnt = Vnt,] (1).
90
" Exercise (HW3): Suppose that# L is a finite dimensional subspace of L2 (P ) with dim(L) = d. Then
Pn q
1 d
E sup n i=1 i f (Xi ) ≤ r n
.
f ∈L:kf kL (P ) ≤r
2

113
As, ψn† (σ) ≤ ε implies σ ≥ ψn] (ε), taking σ := nd and q ≥ max{1, 1/(4c)}, we see that
  s
d d 1 1 d
ψn† ≤c d ≤ ⇒ ψn] ( ) ≤ ,
n nn
4q 4q n

and Theorem 8.18 then implies (128); here C ≡ Cq is taken as in Theorem 8.16 and K as
in Theorem 8.18.

Exercise (HW3): Consider the setting of Example 8.19. Instead of using the refined analysis
using (105) (and Talagrand’s concentration inequality) as illustrated in this section, use the
bounded differences inequality to get a crude upper bound on the excess risk of the ERM
in this problem. Compare the obtained high probability bound to (128).
Exercise (HW3)[VC-subgraph classes]: Suppose that G is a convex VC-subgraph class of
functions g : Z → [0, 1] of VC-dimension V . Then, show that, the function ψn (δ) can be
upper bounded by: "r #
Vδ 1 V 1
ψn (δ) ≤ c log ∨ log .
n δ n δ
2
Show that ψn] (ε) ≤ nε
cV nε
2 log V . Finally, use Theorem 8.18 to obtain a high probability

bound analogous to (125).


Exercise (HW3)[Nonparametric classes]: In the case when the metric entropy of the class
2ρ
G (random, uniform, bracketing, etc.; e.g., if log N (ε, G, L2 (Pn )) ≤ Aε ) is bounded by
O(ε ) for some ρ ∈ (0, 1) (assuming that the envelope of G is 1), we typically have ψn] (ε) ≤
−2ρ

O(n−1/(1+ρ) ). Finally, use Theorem 8.18 to obtain a high probability bound analogous
to (125).

8.4 Kernel density estimation

Let X, X1 , X2 , . . . , Xn be i.i.d. P on Rd , d ≥ 1. Suppose P has density p with respect to


the Lebesgue measure on Rd , and kpk∞ < ∞. Let K : Rd → R be any measurable function
that integrates to one, i.e., Z
K(y)dy = 1
Rd
and kKk∞ < ∞. Then the kernel density estimator (KDE) of p if given by
n
y − Xi y−X
    
1 X −d
pbn,h (y) = K = h Pn K , for y ∈ Rd .
nhd h h
i=1

Here h is called the smoothing bandwidth. Choosing a suitable bandwidth sequence hn → 0


and assuming that the density p is continuous, one can obtain a strongly consistent estimator
pbn,h (y) ≡ pbn,hn (y) of p(y), for any y ∈ Rd .
It is natural to write the difference pbn (y, h) − p(y) as the sum of a random term and a
deterministic term:

pbn,h (y) − p(y) = pbn,h (y) − ph (y) + ph (y) − p(y)

114
where
h  y − X i Z y − x Z
−d −d
ph (y) := h P K =h K p(x)dx = K(u)p(y − hu)du
h Rd h Rd
is a smoothed version of p. Convergence to zero of the second term can be argued based
only on smoothness assumptions on p: if p is uniformly continuous, then it is easily seen
that
sup sup |ph (y) − p(y)| → 0
h≤bn y∈Rd

for any sequence bn → 0. On the other hand, the first term is just
h  y − X i
h−d (Pn − P ) K . (129)
h
For a fixed y ∈ Rd , it is easy to study the properties of the above display
 using the CLT
−d y−Xi
as we are dealing with a sum of independent random variables h K h , i = 1, . . . , n.
However, it is natural to ask whether the KDE pbn,hn converges to p uniformly (a.s.) for a
sequence of bandwidths hn → 0 and, if so, what is the rate of convergence in that case? We
investigate this question using tools from empirical processes.
The KDE pbn,h (·) is indexed by the bandwidth h, and it is natural to consider pbn,h as
a process indexed by both y ∈ Rd and h > 0. This leads to studying the class of functions
y−x
   
d
F := x 7→ K : y ∈ R ,h > 0 .
h
It is fairly easy to give conditions on the kernel K so that the class F defined above satisfies

N (kKk∞ , F, L2 (Q)) ≤ (A/)V (130)


for some constants V ≥ 2 and A ≥ e2 ; see e.g., Lemma 7.2291 . While it follows immediately
from the GC theorem that
h  y − X i
a.s.
sup (Pn − P ) K → 0,
h>0,y∈Rd h

this does not suffice in view of the hfactor −d in (129). In fact, we need a rate of
 ofih
a.s.
convergence for suph>0,y∈Rd (Pn − P ) K y−X h → 0. The following theorem gives such a
result92 .
91
For instance, it is satisfied for general d ≥ 1 whenever K(x) = φ(q(x)), with q(x) being a polynomial in
d variables and φ being a real-valued right continuous function of bounded variation.
92
To study variable bandwidth kernel estimators [Einmahl and Mason, 2005] derived the following result,
which can be proved with some extra effort using ideas from the proof of Theorem 8.21.

Theorem 8.20. For any c > 0, with probability 1,



nhkb pn,h (y) − ph (y)k∞
lim sup sup p =: K(c) < ∞.
n→∞ c log n/n≤h≤1 log(1/h) ∨ log log n
Theorem (8.20) implies for any sequences 0 < an < bn ≤ 1, satisfying bn → 0 and nan / log n → ∞, with
r !
probability 1 , log(1/an ) ∨ log log n
sup kbpn,h − ph k∞ = O ,
an ≤h≤bn nan
a.s.
which in turn implies that limn→∞ supan ≤h≤bn kbpn,h − ph k∞ → 0.

115
Theorem 8.21. Suppose that hn ↓ 0, nhdn /| log hn | → ∞, log log n/| log hn | → ∞ and
hdn ≤ čhd2n for some č > 0. Then
p
nhdn kb
pn,hn (·) − phn (·)k∞
lim sup p = C a.s.
n→∞ log h−1
n

where C < ∞ is a constant that depends only on the VC characteristics of F.

Proof. We will use the following result:

Lemma 8.22 ([de la Peña and Giné, 1999, Theorem 1.1.5]). If Xi , i ∈ N, are i.i.d X -valued
random variables and F a class of measurable functions, then
j n
!
X  X t 
P max (f (Xi ) − P f ) > t ≤ 9P (f (Xi ) − P f ) > .
1≤j≤n F F 30
i=1 i=1

For k ≥ 0, let nk := 2k . Let λ > 0; to be chosen later. The monotonicity of {hn }


(hence of hn log h−1 −1
n once hn < e ) and Lemma 8.22 imply (for k ≥ 1)
s !
nhdn
P max kb
pn,hn (y) − phn (y)k∞ > λ
nk−1 <n≤nk log h−1
n
s n h 
!
1 X y − Xi   y − X i
i
= P max sup K − EK >λ
nk−1 <n≤nk nhdn log h−1
n y∈Rd i=1 hn hn
 
n h 
1 X y − Xi
  y − X i
i
≤ P q × max sup K − EK > λ
d −1
nk−1 hnk log hnk 1≤n≤nk y∈Rd ,h ≤h≤h
nk nk−1 i=1
h h
 
nk h 
1 X y − Xi
  y − X i
i λ
≤ 9P  q × sup K − EK > . (131)
n hd log h−1 y∈Rd ,hnk ≤h≤hnk−1 i=1 h h 30
k−1 nk nk

We will study the subclasses

y − ·
  
d
Fk := K : hnk ≤ h ≤ hnk−1 , y ∈ R .
h

As
h  y − X i Z y − x Z
E K2 = K2 p(x)dx = hd K 2 (u)p(y − uh)du ≤ hd kpk∞ kKk22 ,
h Rd h Rd

for the class Fk , we can take

Uk := 2kKk∞ , and σk2 := hdnk−1 kpk∞ kKk22 .

Since hnk ↓ 0, and nhdn / log h−1


n → ∞, there exists k0 < ∞ such that for all k ≥ k0 ,


r
√ AUk
σk < Uk /2 and nk σk ≥ V Uk log . (check!) (132)
σk

116
Pnk
Letting Zk := E i=1 (f (Xi ) − Pf) , we can bound E[Zk ] by using Theorem 7.13
Fk
(see (84)), for k ≥ k0 , to obtain
nk
X p
E[Zk ] = E (f (Xi ) − P f ) ≤ Lσk nk log(AUk /σk )
Fk
i=1

for a suitable constant L > 0. Thus, using (132),

νk := nk σk2 + 2Uk E[Zk ] ≤ c̃nk σk2

for a constant c̃ > 1 and k ≥ k0 . Choosing x = c log(AUk /σk ) in (99), for some c > 0, we
see that
√ p √
E[Zk ] + 2νk x + Uk x/3 ≤ σk nk log(AUk /σk )(L + 2cc̃) + cUk log(AUk /σk )/3
p
≤ Cσk nk log(AUk /σk ),

for some constant C > 0, where we have again used (132). Therefore, by Theorem 8.7,
  √
P Zk ≥ Cσk nk log(AUk /σk ) ≤ P(Zk ≥ E[Zk ] + 2νk x + Uk x/3) ≤ e−c log(AUk /σk ) .
p

Notice that p
30Cσk nk log(AUk /σk )
q >λ (check!)
nk−1 hdnk log h−1
nk

for some λ > 0, not depending on k. Therefore, choosing this λ the probability on the right
hand-side of (131) can be expressed as
 
Zk λ  
>  ≤ P Zk ≥ Cσk nk log(AUk /σk ) ≤ e−c log(AUk /σk ) .
p
P q
n hd log h−1 30
k−1 nk nk

Since

X ∞
X ∞
X
e−c log(AUk /σk ) = c1 hcd/2
nk−1 ≤ c̃1 (č)−cd/2 < ∞,
k=k0 k=k0 k=k0

for constants c1 , c̃1 > 0, we get, summarizing,


s

!
X nhdn
P max kb
pn,h (y) − ph (y)k∞ > λ < ∞.
k=1
nk−1 <n≤nk log h−1
n
r
nhdn
Let Yn = log h−1
kb
pn,h − ph k∞ . Letting Y := lim supn→∞ Yn , and using the Borel-Cantelli
n

lemma we can see that P(Y > λ) = 0. This yields the desired result using the zero-one
law93 .

93
For a fixed λ ≥ 0, define the event A := {lim supn→∞ Yn > λ}. As this is a tail event, by the
zero-one law it has probability 0 or 1. We thus have that for each λ, P(Y > λ) ∈ {0, 1}. Defining
c := sup{λ : P(Y > λ) = 1}, we get that Y = c a.s. Note that c < ∞ as there exists λ > 0 such that
P(Y > λ) = 0, by the proof of Theorem 8.21.

117
9 Review of weak convergence in complete separable metric
spaces

In this chapter (and the next few) we will study weak convergence of stochastic processes.
Suppose that U1 , . . . , Un are i.i.d. Uniform(0, 1) random variables (i.e., U1 has distribution
function G(x) = x, for x ∈ [0, 1]) and let Gn denote the empirical distribution function of
U1 , . . . , Un . In this chapter, we will try to make sense of the (informal) statement:
√ d
n(Gn − G) → G, as n → ∞,

where G is a standard Brownian bridge process in [0, 1].


This will give us the right background and understanding to appreciate the Donsker
theorem for the empirical process (indexed by an arbitrary class of functions). As we have
seen in Chapter 1, weak convergence coupled with the continuous mapping theorem can
help us study (in a unified fashion) the limiting distribution of many random variables that
can be expressed as functionals of the empirical process.

9.1 Weak convergence of random vectors in Rd

Before we study weak convergence of stochastic processes let us briefly recall the notion of
weak convergence of random vectors.
Let (Ω, A, P) be a probability space and let X be a random vector (or a measurable
map) in Rk (k ≥ 1) defined on the probability space (Ω, A, P)94 .

Definition 9.1. Let {Xn }n≥1 be a sequence of random vectors in Rk . Let Xn have distri-
bution function95 Fn and let Pn denote the distribution96 of Xn , i.e., Xn ∼ Pn . We say
that {Xn } converges in distribution (or weakly or in law) to a random vector X ∼ P (or to
the distribution P ) with distribution function F if

Fn (x) → F (x), as n → ∞, (133)

for every x ∈ Rk at which the limiting distribution function F (·) is continuous. This is
d d
usually denoted by Xn → X or Pn → P .

As the name suggests, weak convergence only depends on the induced laws of the
vectors and not on the probability spaces on which they are defined.
The whole point of weak convergence is to approximate the probabilities of events
related to Xn (i.e., Pn ), for n large, by that of the limiting random vector X (i.e., P ).
94
More formally, X : (Ω, A) → (Rk , Bk ), where Bk is the Borel σ-field in Rd , is a map such that for any
B ⊂ Bk , X −1 (B) := {ω ∈ Ω : X(ω) ∈ B} ∈ A.
95
Thus, Fn (x) = P(Xn ≤ x), for all x ∈ Rd .
96
Thus, Pn : Bk → [0, 1] is a map such that Pn (B) := P(X ∈ B) = P(X −1 (B)) for all B ∈ Bk .

118
The multivariate central limit theorem (CLT) is a classic application of convergence in
distribution and it highlights the usefulness of such a concept.
The following theorem, referred to as the Portmanteau result, gives a number of equiv-
alent descriptions of weak convergence (most of these characterizations are only useful in
proofs). Indeed, the characterization (v) (of the following result) makes the intuitive notion
of convergence in distribution rigorous.

Theorem 9.2 (Portmanteau theorem). Let X, X1 , . . . , Xn be random vectors in Rk . Let


Xn ∼ Pn , for n ≥ 1, and X ∼ P . Then, the following are equivalent:
d d
(i) Xn → X or Pn → P , i.e., (133) holds;

(ii) f dPn =: Pn f → P f := f dP , as n → ∞, for every f : Rk → R which is continuous


R R

and bounded.

(iii) lim inf n→∞ Pn (G) ≥ P (G) for all open G ⊂ Rk ;

(iv) lim supn→∞ Pn (F ) ≤ P (F ) for all closed F ⊂ Rk ;

(v) Pn (A) → P (A) for all P -continuity sets A (i.e., P (∂A) = 0, where ∂A denotes the
boundary of A).

Proof. See Lemma 2.2 of [van der Vaart, 1998].

Quite often we are interested in the distribution of a one-dimensional (or m-dimensional,


where m ≤ k usually) function of Xn . The following result, the continuous mapping theorem
is extremely useful and, in some sense, justifies the study of weak convergence.

Theorem 9.3 (Continuous mapping). Suppose that g : Rk → Rm be continuous at every


d d
point of a set C such that P(X ∈ C) = 1. If Xn → X then g(Xn ) → g(X).

9.2 Weak convergence in metric spaces and the continuous mapping the-
orem

As before, let (Ω, A, P) be a probability space and and let (T, T ) is a measurable (metric)
space. X : (Ω, A, P) → (T, T ) is a random element if X −1 (B) ∈ A for all B ∈ T . Quite
often, when we talk about a random element X, we take T = B(T ), the Borel σ-field (on
the set T ), the smallest σ-field containing all open sets.
Question: How do we define the notion of weak convergence in this setup?
Although it is not straight forward to define weak convergence as in Definition 9.1
(for random vectors through their distribution functions), the equivalent definition (ii) in
Theorem 9.2 can be extended easily.
Let Cb (T ; T ) denote the set of all bounded, continuous, T /B(R)-measurable, real-valued
functions on T .

119
Definition 9.4. A sequence {Xn }n≥1 of random elements of T (defined on (Ω, A, P)) con-
d
verges in distribution to a random element X, written Xn → X, if and only if

Ef (Xn ) → Ef (X) for each f ∈ Cb (T ; T ). (134)

Similarly, a sequence {Pn }n≥1 of probability measures on T converges weakly to P , written


d
Pn → P , if and only if

Pn f → P f for each f ∈ Cb (T ; T ), (135)


R R
where Pn f := f dPn and P f := f dP .

Note that definition (135) of weak convergence is slightly more general than (134) as
it allows the Xn ’s to be defined on different probability spaces (and we let Xn ∼ Pn , for
n ≥ 1, and X ∼ P ). As in the previous subsection, we have the following equivalent
characterizations of weak convergence.

9.2.1 When T = B(T ), the Borel σ-field of T

Suppose that T is a metric space with metric d(·, ·) and let T be the Borel σ-field of T
generated by the metric d(·, ·). Then, we have the following equivalent characterizations of
weak convergence.

Theorem 9.5. (Portmanteau theorem) Let X, X1 , . . . , Xn be random elements taking val-


ues in a metric space (T, B(T )). Let Xn ∼ Pn , for n ≥ 1, and X ∼ P . Then, the following
are equivalent:

d d
(i) Xn → X or Pn → P ;

(ii) lim inf n→∞ Pn (G) ≥ P (G) for all open G ⊂ T ;

(iii) lim supn→∞ Pn (F ) ≤ P (F ) for all closed F ⊂ T ;

(iv) Pn (A) → P (A) for all P -continuity sets A (i.e., P (∂A) = 0, where ∂A denotes the
boundary of A).

Proof. See Theorem 2.1 of [Billingsley, 1999] (or 3.25 of [Kallenberg, 2002]).

Such an abstract definition of weak convergence can only be useful if we have a contin-
uous mapping theorem (as before). The following result shows that essentially the “vanilla’
version of the continuous mapping theorem is true, and is a trivial consequence of the
definition of weak convergence.

120
Theorem 9.6 (Continuous mapping theorem). Let (T, B(T )) be a measurable (metric)
space. Suppose that {Xn }n≥1 is a sequence of random elements of T converging in distribu-
d
tion to a random element X, i.e., Xn → X. Let H : (T, B(T )) → (S, B(S)) be a continuous
d
function, where (S, B(S)) is another measurable (metric) space. Then H(Xn ) → H(X) as
random elements in S.

Proof. Let f ∈ Cb (S, B(S)). Then f ◦ H ∈ Cb (T, B(T )) and thus from the definition of weak
convergence of Xn ,
E[(f ◦ H)(Xn )] → E[(f ◦ H)(X)].

As the above convergence holds for all f ∈ Cb (S, B(S)), again using the definition of weak
d
convergence, we can say that H(Xn ) → H(X).

Exercise(HW4): Show that for any sequence of points yn ∈ S, n ≥ 1, (where S is a metric


d
space with Borel σ-field B(S)) yn → y0 if and only if δyn → δy0 , where δ· is the Dirac delta
measure.
As we will see later, requiring that H be continuous of the entire set T is asking for
too much. The following result, which can be thought of as an extension of the above result
(and is similar in flavor to Theorem 9.3), requires that H be continuous on the set where the
limiting random element X lives (or on the support of the limiting probability measure).

Theorem 9.7. Let H : (T, B(T )) → (S, B(S)) be a measurable mapping. Write C for the
d
set of points in T at which H is continuous. If Xn → X for which P(X ∈ C) = 1, then
d
H(Xn ) → H(X).

Proof. Let Xn ∼ Pn and X ∼ P . Fix ψ ∈ Cb (S; B(S)). Then the measurable, real-valued,
bounded function h := ψ ◦ H is continuous at all points in C. We will have to show that
Pn h → P h as n → ∞.
Consider any increasing sequence {fi } of bounded, continuous functions for which
fi ≤ h everywhere and fi ↑ h at each point of C. Accept for the moment that such a
sequence exists. Then, weak convergence of Pn to P implies that

Pn fi → P fi for each fixed fi .

Thus,
lim inf Pn h ≥ lim inf Pn fi = P fi for each fixed fi .
n→∞ n→∞

Invoking monotone convergence as i → ∞ on the right-hand side yields

lim inf Pn h ≥ lim P fi = P h.


n→∞ i→∞

By substituting −h for h and going through the above argument again gives the desired
result.

121
It only remains to show that we can construct such a sequence {fi }. These functions
must be chosen from the family

F = {f ∈ Cb (T ; B(T )) : f ≤ h}.

If we can find a countable subfamily of F, say {g1 , g2 , . . .}, whose pointwise supremum
equals h at each point of C, then setting fi := max{g1 , . . . , gi } will do the trick.
Without loss of generality suppose that h > 0 (a constant could be added to h to
achieve this). Let d(·, ·) be the metric in T . For each subset A ∈ B(T ), define the distance
function d(·, A) by
d(x, A) = inf{d(x, y) : y ∈ A}.
Note that d(·, A) is a continuous function (in fact, d(·, A) is uniformly continuous. Exercise
(HW4): Show this.), for each fixed A. For a positive integer m and a positive rational r
define
fm,r (x) := r ∧ [m d(x, {h ≤ r})].
Each fm,r is bounded and continuous; it is at most r if h(x) > r; it takes the value zero if
h(x) ≤ r. Thus, fm,r ∈ F.

Given a point x ∈ C and an  > 0, choose a positive rational number r with h(x) −  <
r < h(x). Continuity of h at x keeps its value greater than r in some neighborhood of x.
Consequently, d(x, {h ≤ r}) > 0 and fm,r (x) = r > h(x) −  for all m large enough. Take
{fm,r : m ∈ N+ , r ∈ Q+ } as the countable set {g1 , g2 , . . .}.

9.2.2 The general continuous mapping theorem

Till now we have restricted our attention to metric spaces (T, T ) endowed with their Borel
σ-field, i.e., T = B(T ). In this subsection we extend the continuous mapping theorem when

122
T need not equal B(T ).
d
Formally, we ask the following question. Suppose that Xn → X as T -measurable
random elements of a metric space T , and let H be a T /S-measurable map from T into
another metric space S. If H is continuous at each point of an T -measurable set C with
d
P(X ∈ C) = 1, does it follow that H(Xn ) → H(X), i.e., does Ef (H(Xn )) converge to
Ef (H(X)) for every f ∈ Cb (S; S)?
We will now assume that T is a sub-σ-field of B(T ). Define

F = {f ∈ Cb (T ; T ) : f ≤ h}.

Last time we constructed the countable subfamily that took the form

fm,r (x) = r ∧ [md(x, {h ≤ r}].

Continuity of fm,r suffices for Borel measurability, but it needn’t imply T -measurability.
We must find a substitute for these functions.

Definition 9.8. Call a point x ∈ T completely regular (with respect to the metric d and the
σ-field T ) if to each neighborhood V of x there exists a uniformly continuous, T -measurable
function g with g(x) = 1 and g ≤ 1V .

Theorem 9.9 (Continuous mapping theorem). Let (T, T ) is a measurable (metric) space.
Suppose that {Xn }n≥1 is a sequence of random elements of T converges in distribution to
d
a random element X, i.e., Xn → X. Let H : (T, T ) → (S, S) be a continuous function
at each point of some separable, T -measurable set C of completely regular points such that
d
P(X ∈ C) = 1, then H(Xn ) → H(X) as random elements in (S, S).

Idea of proof: (Step 1) Use the fact that each x ∈ C is completely regular to approximate
h(x) by a supremum of a sub-class of Cb (T ; T ) that is uniformly continuous. (Step 2) Then
use the separability of C to find a sequence of uniformly continuous functions in Cb (T ; T )
that increase to h for all x ∈ C. (Step 3) Use the same technique as in the initial part of
the proof of Theorem 9.7 to complete the proof.

Proof. Let Xn ∼ Pn and X ∼ P . Fix ψ ∈ Cb (S; B(S)). Then the T -measurable, real-
valued, bounded function h := ψ ◦ H is continuous at all points in C. We will have to show
that Pn h → P h.
Without loss of generality suppose that h > 0 (a constant could be added to h to
achieve this). Define

F = {f ∈ Cb (T ; T ) : f ≤ h; f is uniformly continuous}.

At those completely regular points x of T where h is continuous, the supremum of F equals


h, i.e.,
h(x) = sup f (x). (136)
f ∈F

123
To see this suppose that h is continuous at a point x that is completely regular. Choose r
with 0 < r < h(x) (r should ideally be close to h(x)). By continuity, there exists a δ > 0
such that h(y) > r on the closed ball B(x, δ). As x is completely regular, there exists a
uniformly continuous, T -measurable g such that g(x) = 1 and g ≤ 1B(x,δ) . Now, look at
the function f = rg. Observe that f ∈ F and f (x) = r. Thus (136) holds for all x ∈ C.
Separability of C will enable us to extract a suitable countable subfamily from F. Let
C0 be a countable dense subset of C. Let {g1 , g2 , . . .} be the set of all those functions of the
form r1B , with r rational, B a closed ball of rational radius centered at a point of C0 , and
r1B ≤ f for at least one f in F. For each gi choose one f satisfying the inequality gi ≤ f .
Denote it by fi . This picks out the required countable subfamily:

sup fi = sup f on C. (137)


i f ∈F

To see this, consider any point z ∈ C and any f ∈ F. For each rational number r such that
f (z) > r > 0 choose a rational  for which f > r at all points within a distance 2 of z.
Let B be the closed ball of radius  centered at a point x ∈ C0 for which d(x, z) < . The
function r1B lies completely below f ; it must be one of the gi . The corresponding fi takes
a value greater than r at z. Thus, assertion (137) follows.
The proof can now be completed using analogous steps as in the proof of Theorem 9.7.
Assume without loss of generality that fi ↑ h at each point of C. Then,

lim inf Pn h ≥ lim inf Pn fi for each i


n→∞ n→∞
d
= P fi because Pn → P
→ Ph as i → ∞, by monotone convergence.

Replace h by −h (plus a big constant) to get the companion inequality for the lim sup.

Corollary 9.10. If Ef (Xn ) → Ef (X) for each bounded, uniformly continuous, T -measurable
d
f , and if X concentrates on a separable set of completely regular points, then Xn → X.

The corollary follows directly from the decision to insist upon uniform continuous
separating functions in the definition of a completely regular point.

9.3 Weak convergence in the space C[0, 1]

Till now we have mostly confined ourselves to finite dimensional spaces (although the dis-
cussion in the previous subsection is valid for any metric space). In this subsection we
briefly discuss an important, both historically and otherwise, infinite dimensional metric
space where our random elements take values. Let C[0, 1] denote the space of continuous
functions from [0, 1] to R. We endow C[0, 1] with the uniform metric:

ρ(x, y) := kx − yk∞ := sup |x(t) − y(t)|, for x, y ∈ C[0, 1].


t∈[0,1]

124
Remark 9.1. Note that C[0, 1], equipped with the uniform metric, is separable97 (Exercise
(HW4): Show this.) and complete98 . C[0, 1] is complete because a uniform limit of contin-
uous functions is continuous [Browder, 1996, Theorem 3.24]. C[0, 1] is separable because
the set of polynomials on [0, 1] is dense in C[0, 1] by the Weierstrass approximation theo-
rem [Browder, 1996, Theorem 7.1], and the set of polynomials with rational coefficients is
countable and dense in the set of all polynomials, hence also dense in C[0, 1].

Question: Let X and Xn are random processes (elements) on [0, 1]. When can we say that
d
Xn → X?
d
Suppose that Xn → X in C[0, 1]. For each t ∈ [0, 1] define the projection map
πt : C[0, 1] → R as
πt (x) = x(t), for x ∈ C[0, 1].

Observe that πt is a uniformly continuous map as |πt (x) − πt (y)| = |x(t) − y(t)| ≤ kx − yk∞
for all x, y ∈ C[0, 1]. Thus, by the continuous mapping theorem (see e.g., Theorem 9.7), we
d
know that Xn (t) = πt (Xn ) → πt (X) = X(t) for all t ∈ [0, 1].
fd
We shall write Xn → X for convergence of the finite-dimensional distributions, in the
sense that
d
(Xn (t1 ), . . . , Xn (tk )) → (X(t1 ), . . . , X(tk )), t1 , . . . , tk ∈ [0, 1], k ∈ N. (138)

Indeed, by defining the projection map π(t1 ,...,tk ) : C[0, 1] → Rk as π(t1 ,...,tk ) (x) = (x(t1 ), . . . , (tk )),
d fd
for t1 , . . . , tk ∈ [0, 1], we can show that if Xn → X in C[0, 1] then Xn → X.

Remark 9.2. Although it can be shown that the distribution of a random process is deter-
mined by the family of finite-dimensional distributions99 , condition (138) is insufficient in
d
general for the convergence Xn → X. Hint: Consider the sequence of points xn ∈ C[0, 1],
for n ≥ 1, where
xn (t) = nt1[0,n−1 ] (t) + (2 − nt)1(n−1 ,2n−1 ] (t).
fd
Let x0 ≡ 0 ∈ C[0, 1]. We can show that δxn → δx0 . But δxn does NOT converge weakly to
δx0 as xn 6→ x0 .
97
A topological space is separable if it contains a countable dense subset; that is, there exists a sequence
{xn }∞n=1 of elements of the space such that every nonempty open subset of the space contains at least one
element of the sequence.
98
A metric space T is called complete (or a Cauchy space) if every Cauchy sequence of points in T has a
limit that is also in T or, alternatively, if every Cauchy sequence in T converges in T . Intuitively, a space is
complete if there are no “points missing” from it (inside or at the boundary).
99
We have the following result (a consequence of [Kallenberg, 2002, Proposition 3.2]; also
see [Billingsley, 1999, Chapter 1]).
d
Lemma 9.11. Suppose that X and Y are random elements in C[0, 1]. Then X = Y if and only if
d
(X(t1 ), . . . , X(tk )) = (Y (t1 ), . . . , Y (tk )), t1 , . . . , tk ∈ [0, 1], k ∈ N.

The above result holds more generally than C[0, 1].

125
d
Alternatively, if δxn → δx0 , then kxn k∞ := supt∈[0,1] xn (t) = 1 should converge weakly
to kx0 k∞ = 0 (which is obviously not true!). Note that here we have used the fact that
f : C[0, 1] → R, defined as f (x) = supt∈[0,1] x(t) is a continuous function (Exercise(HW4):
Show this.).

The above discussion motivates the following concepts, which we will show are crucially
tied to the notion of weak convergence of stochastic processes.

9.3.1 Tightness and relative compactness

Suppose that Xn ’s are random elements taking values in the metric space (T, B(T )).

Definition 9.12. We say that the sequence of random elements {Xn }n≥1 is tight if and
only if for every  > 0 there exists a compact set100 V ⊂ T such that

P(Xn ∈ V ) > 1 − , for all n ≥ 1.

Exercise (HW4): Let S be a separable and complete metric space. Then every probability
measure on (S, B(S)) is tight. (Hint: See [Billingsley, 1999, Theorem 1.3])
Recall that a set A is relatively compact if its closure is compact, which is equivalent
to the condition that each sequence in A contains a convergent subsequence (the limit of
which may not lie in A). This motivates the following definition which can be thought of
as a ‘probabilistic’ version of relative compactness.

Definition 9.13. A sequence of random elements {Xn }n≥1 is said to be relatively compact
in distribution if every subsequence has a further sequence that converges in distribution.
Similarly, we can define the notion of relative compactness (in distribution) of a sequence
{Pn }n≥1 of probability measures.

Theorem 9.14 (Prohorov’s theorem). If {Xn }n≥1 is tight, then it is relatively compact in
distribution. In fact, the two notions are equivalent if T is separable and complete101 .

Proof. See Theorem 2.1 of [Billingsley, 1999] (or 3.25 of [Kallenberg, 2002]).

Prohorov’s theorem is probably the key result in this theory of classical weak conver-
gence and gives the basic connection between tightness and relative distributional compact-
ness.
100
Recall that in a metric space a set is compact if and only if it is complete and totally bounded.
101
A metric space M is called complete if every Cauchy sequence of points in M has a limit that is also in
M ; or, alternatively, if every Cauchy sequence in M converges in M .

126
9.3.2 Tightness and weak convergence in C[0, 1]
d
Lemma 9.15. Let X, X1 , . . . , Xn , . . . be random elements in C[0, 1]. Then Xn → X if and
fd
only if Xn → X and {Xn } is relatively compact in distribution.

d fd
Proof. If Xn → X, then Xn → X (by the continuous mapping theorem; take f : C[0, 1] →
Rk defined as f (x) = (x(t1 ), . . . , x(tk ))), and {Xn } is trivially relatively compact in distri-
bution.
d
Now assume that {Xn } satisfies the two conditions. If Xn →6 X, we may choose a
bounded continuous function f : C[0, 1] → R and an  > 0 such that |Ef (Xn ) − Ef (X)| > 
along some subsequence N 0 ⊂ N. By the relative compactness we may choose a further
d fd
subsequence N 00 ⊂ N and a process Y such that Xn → Y along N 00 . But then Xn → Y along
fd d
N 00 , and since also Xn → X, we must have X = Y (a consequence of Lemma 9.11). Thus,
d
Xn → X along N 00 , and so Ef (Xn ) → Ef (X) along the same sequence, a contradiction.
d
Thus, we conclude that Xn → X.

Recall that Prohorov’s theorem shows that tightness and relative compactness in distri-
bution are equivalent. Also, recall that a set A is relatively compact if its closure is compact.
Thus, the above result, we need to find convenient criteria for tightness.
The modulus of continuity of an arbitrary function x(·) on [0, 1] is defined as

w(x, h) := sup |x(s) − x(t)|, h > 0.


|s−t|≤h

The function w(·, h) is continuous in C[0, 1] and hence a measurable function102 , for each
fixed h > 0.
We use the classical Arzelá-Ascoli compactness criterion to do this. The Arzelá-Ascoli
theorem completely characterizes relative compactness in C[0, 1].

Theorem 9.16 (Arzelá-Ascoli). A set A is relatively compact in C[0, 1] if and only if

sup |x(0)| < ∞


x∈A

and
lim sup w(x, h) = 0. (139)
h→0 x∈A

The functions in A are by definition equicontinuous at t0 ∈ [0, 1] if, as t → t0 ,


supx∈A |x(t) − x(t0 )| → 0 (cf. the notion of asymptotic equicontinuity introduced in Defi-
nition 1.6); and (139) defines uniform equicontinuity (over [0,1]) of the functions in A. We
can use it now for characterization of tightness.

Lemma 9.17. The sequence {Xn }n≥1 is tight if and only if the following two conditions
hold:
102
This follows from the fact that |w(x, h) − w(y, h)| ≤ 2kx − yk∞ .

127
(i) For every η > 0, there exists a ≥ 0 and n0 ∈ N such that

P(|Xn (0)| ≥ a) ≤ η for all n ≥ n0 . (140)

(ii) For each  > 0 and η > 0, there exist h ∈ (0, 1) and n0 ∈ N such that for all n ≥ n0

P(w(Xn , h) ≥ ) ≤ η for all n ≥ n0 . (141)

Proof. Suppose that {Xn }n≥1 is tight. Given η > 0, we can find a compact K such that
P(Xn ∈ K) > 1 − η for all n ≥ 1. By the Arzelà-Ascoli theorem, we have K ⊂ {x ∈ C[0, 1] :
|x(0)| < a} for large enough a and K ⊂ {x ∈ C[0, 1] : w(x, h) < } for small enough h, and
so (i) and (ii) hold, with n0 = 1 in each case. Hence the necessity.
Assume then that {Pn }n≥1 satisfies (i) and (ii), with n0 = 1 (without loss of general-
ity103 ),
where Xn ∼ Pn . Given η > 0, choose a so that, if B = {x ∈ C[0, 1] : |x(0)| ≤ a},
then Pn (B) ≥ 1 − η for all n. Then choose hk , so that, if Bk := {x ∈ C[0, 1] : w(x, hk ) <
1/k}, then Pn (Bk ) ≥ 1 − η/2k for all n. If K is the closure of A := B ∩ (∩k Bk ), then
Pn (K) ≥ 1 − 2η for all n. Since A satisfies the conditions of Theorem 9.16, K is compact.
Therefore, {Pn }n≥1 is tight.

Remark 9.3. Note that condition (141) can be expressed in a more compact form:

lim lim sup P(w(Xn , h) > ) = 0.


h→0 n→∞

The following result now gives an equivalent characterization of weak convergence in


C[0, 1]. It is a trivial consequence of what we have established so far.
d
Theorem 9.18. Let X, X1 , X2 , . . . be random elements in C[0, 1]. Then, Xn → X if and
fd
only if Xn → X and
lim lim sup E[w(Xn , h) ∧ 1] = 0.
h→0 n→∞

Remark 9.4. Let us fix two metric spaces (K, d) and (S, ρ), where K is compact and S is
separable and complete (i.e., S is a Polish space). Everything done in this subsection can
be easily extended to the space C(K, S) of continuous functions from K to S, endowed with
the uniform metric
ρ̂(x, y) = sup ρ(xt , yt ).
t∈K
103
Since C[0, 1] is separable and complete, a single measure Q is tight (by ), and so by the necessity of (i)
and (ii), for a given η > 0 there is an a such that Q({x ∈ C[0, 1] : |x(0)| ≥ a}) ≤ η, and for given  > 0 and
η > 0 there is a h > 0 such that Q({x ∈ C[0, 1] : w(x, h) > }) ≤ η. Therefore, we may ensure that the
inequalities in (140) and (141) hold for the finitely many n preceding n0 by increasing a and decreasing h if
necessary. Thus, we may assume that n0 is always 1.

128
9.4 Non-measurablilty of the empirical process

Suppose that U1 , . . . , Un be i.i.d. Uniform(0,1) defined on a probability space (Ω, A, P). Let
Gn be the empirical c.d.f. of the data. Note that Gn (defined on [0, 1]) is NOT continuous.
Thus, the empirical process is not an element of C[0, 1]. It is standard to consider Gn as
an element of D[0, 1], the space of càdlàg functions (right continuous with left limits) on
[0, 1]. To understand the weak convergence of the empirical process we next study weak
convergence on D[0, 1].
The following example shows that if D[0, 1] is equipped with the Borel σ-field B(D[0, 1])
generated by the closed sets under the uniform metric, the empirical d.f. Gn will not be a
random element in D[0, 1], i.e., Gn is not A/B(D[0, 1])-measurable.

Example 9.19. Consider n = 1 so that G1 (t) = 1[0,t] (U1 ), t ∈ [0, 1] (visualize the random
function G1 over [0, 1]; G1 (t) = 1{t ≥ U1 }). Let Bs be the open ball in D[0, 1] with center
1[s,1] and radius 1/2, where s ∈ [0, 1]. For each subset A ⊂ [0, 1] define

EA := ∪s∈A Bs .

Observe that EA is an open set as an uncountable union of open sets is also open.
If G1 were A/B(D[0, 1])-measurable, the set

{ω ∈ Ω :∈ G1 (ω) ∈ EA } = {ω ∈ Ω : U1 (ω) ∈ A}

would belong to A. A probability measure could be defined on the class of all subsets of [0, 1]
by setting µ(A) := P(U1 ∈ A). This µ would be an extension of the uniform distribution
to all subsets of [0, 1]. Unfortunately such an extension cannot exist! Thus, we must give
up Borel measurability of G1 . The argument can be extended to n ≥ 1 (see Problem 1 in
[Pollard, 1984, Chapter IV]).

The above example shows that the Borel σ-field generated by the uniform metric on
D[0, 1] contains too many sets.
Exercise (HW4): Show that D[0, 1], equipped with the Borel σ-field B(D[0, 1]) generated
by the uniform metric, is NOT separable. [Hint: We can define fx (t) = 1[x,1] (t) for each x ∈
[0, 1], then kfx −fy k∞ = 1 whenever x 6= y. In particular, we have an uncountable collection
of disjoint open sets given by the balls B(fx , 1/2), and so the space is not countable.]
Too large a σ-field T makes it too difficult for a map into T to be a random element. We
must also guard against too small a T . Even though the metric space on T has lost the right
to have T equal to the Borel σ-field, it can still demand some degree of compatibility before
a fruitful weak convergence theory will result. If Cb (T ; T ) contains too few functions, the
approximation arguments underlying the continuous mapping theorem will fail. Without
that key theorem, weak convergence becomes a barren theory (see [Pollard, 1984, Chapter
IV]).

129
9.5 D[0, 1] with the ball σ-field

In the last subsection we saw that the uniform empirical distribution function Gn ∈ T :=
D[0, 1] is not measurable with respect to its Borel σ-field (under the uniform metric). There
is a simple alternative to the Borel σ-field that works for most applications in D[0, 1], when
the limiting distribution is continuous.
For each fixed t ∈ [0, 1], the map Gn (·, t) : Ω → R is a random variable. That is, if πt
denotes the coordinate projection map that takes a function x in D[0, 1] onto its value at
t, the composition πt ◦ Un is A/B(R)-measurable.
Let P be the projection σ-field, i.e., the σ-field generated by the coordinate projection
maps. Recall that if f : T → R, then the σ-field generated by the function f , denoted by
σ(f ), is the collection of all inverse images f −1 (B) of the sets B ∈ B(R), i.e.,

σ(f ) := {f −1 (B) : B ∈ B(R)}.

Exercise(HW4): A stochastic process X on (Ω, A, P) with sample paths in D[0, 1], such as
an empirical process, is A/P-measurable provided πt ◦ X is A/B(R)-measurable for each
fixed t.
As every cadlag function on [0, 1] is bounded (Exercise(HW4)), the uniform distance

d(x, y) := kx − yk = sup |x(t) − y(t)|


t∈[0,1]

defines a metric on D[0, 1]. Here are a few facts about P (Exercise(HW4)):

• P coincides with the σ-field generated by the closed (or open) balls (this is called the
ball σ-field104 ).

• Every point in D[0, 1] is completely regular.

The limit processes for many applications will always concentrate in a separable subset
of D[0, 1], usually C[0, 1], the set of all continuous real valued functions on [0, 1].
Exercise(HW4): Show that C[0, 1] is a closed, complete, separable, P-measurable subset of
D[0, 1]. Show that for C[0, 1], the projection σ-field coincides with the Borel σ-field.
Question: How do we establish convergence in distribution of a sequence {Xn } of random
elements of D[0, 1] to a limit process X?

Definition 9.20. For each finite subset S of [0, 1] write πS for the projection map from
D[0, 1] into RS that takes an x onto its vector of values {x(t) : t ∈ S}.

Certainly, we need the finite dimensional projections {πS Xn } to converge in distribu-


tion, as random vectors in Rs , to the finite-dimensional projection of πS X, for each finite
104
For separable spaces, the ball σ-field is the same as the Borel σ-field; see Section 6 of [Billingsley, 1999]
for a detailed description and the properties of the ball σ-field.

130
subset S ∈ [0, 1]. The continuity and measurability of πS and the continuous mapping
theorem makes that a necessary condition.
For the sake of brevity, let us now shorten “finite-dimensional distributions” to “fidis”
and “finite-dimensional projections” to fidi projections.

Theorem 9.21. Let X, X1 , X2 , . . . be random elements of D[0, 1] (under its uniform metric
and the projection σ-field). Suppose that P(X ∈ C) = 1 for some separable subset C of
D[0, 1]. The necessary and sufficient conditions for {Xn } to converge in distribution to X
are:
fd d
(i) Xn → X (the fidis of Xn converge to the fidis of X, i.e., πS Xn → πS X of each finite
subset S of [0, 1]);

(ii) to each  > 0 and δ > 0 there corresponds a grid 0 = t0 < t1 < . . . < tm = 1 such that
 
lim sup P max sup |Xn (t) − Xn (ti )| > δ < ; (142)
n→∞ i Ji

where Ji denotes the interval [ti , ti+1 ), for i = 0, 1, . . . , m − 1.

d
Proof. Suppose that Xn → X. As the projection map πS is both continuous and projection-
measurable (by definition), the continuous mapping theorem shows that (i) holds.
Let , δ > 0 be given. To simplify the proof of (ii) suppose that the separable subset C is
C[0, 1] (continuity of the sample paths makes the choice of the grid easier). Let {s0 , s1 , . . .}
be a countable dense subset of [0, 1]. We will assume that s0 = 0 and s1 = 1.
For a fixed x ∈ C[0, 1], and given the ordered k + 1 points {s0 , . . . , sk } labeled as
0 = t0 < . . . < tk = 1 in [0, 1], define an interpolation Ak x of x as

Ak x(t) = x(ti ), for ti ≤ t < ti−1 ,

and Ak x(1) = x(1).


For a fixed x ∈ C[0, 1], the distance kAk x − xk converges to zero as k increases, by
virtue of the uniform continuity of x. Thus, we can assure the existence of some k for which

P(kAk X − Xk ≥ δ) < . (143)


Choose and fix such a k. As kAk x − xk varies continuously with x (show this!), the set

F := {x ∈ D[0, 1] : kAk x − xk ≥ δ}

is closed. By an application of the Portmanteau theorem (note that this needs a proof,
and has not been discussed in class yet, as we are not dealing with the Borel σ-field; see
e.g., [Pollard, 1984, Example 17, Chapter IV]; also see [Billingsley, 1999, Theorem 6.3]),

lim sup P(Xn ∈ F ) ≤ P(X ∈ F ).


n→∞

131
The left-hand side bounds the limsup in (142).
d
Now let us show that (i) and (ii) imply Xn → X. Let us retain the assumption that
X has continuous sample paths. Choose any bounded, uniformly continuous, projection-
measurable, real valued function f on D[0, 1]. Given  > 0, find δ > 0 such that |f (x) −
f (y)| <  whenever kx − yk ≤ δ. Write AT for the approximation map constructed from
the grid in (ii) corresponding to this δ and , i.e.,

lim sup P(kAT Xn − Xn k > δ) < .


n→∞

Without loss of generality we may assume that the Ak of (143) equals AT .


Now, let us write the composition f ◦ AT as g ◦ πT , where g is a bounded, continuous
function on D[0, 1]. Thus,

|Ef (Xn ) − Ef (X)|


≤ E|f (Xn ) − f (AT Xn )| + |Ef (AT Xn ) − Ef (AT X)| + E|f (AT X) − f (X)|
≤  + 2kf kP(kXn − AT Xn k > δ) + |Ef (πT Xn ) − Ef (πT X)|
+  + 2kf kP(kXn − AT Xn k > δ).

The middle term in the last line converges to zero as n → ∞, because of the fidi convergence
d
πT Xn → πT X.

We now know the price we pay for wanting to make probability statements about func-
tionals that depend on the whole sample path of a stochastic process: with high probability
we need to rule out nasty behavior between the grid points.
Question: How do we control the left-hand side of (142)? It involves a probability of a
union of m events, which we may bound by the sum of the probabilities of those events,
m−1
X  
P sup |Xn (t) − Xn (ti )| > δ .
i=0 Ji

Then we can concentrate on what happens in an interval Ji between adjacent grid points.
For many stochastic processes, good behavior of the increment Xn (ti+1 ) − Xn (Ti ) forces
good behavior for the whole segment of sample path over Ji . Read [Pollard, 1984, Chapter
V] for further details and results on how to control the left-hand side of (142).

132
10 Weak convergence in non-separable metric spaces

We have already seen that the uniform empirical process is not Borel measurable. We have
also seen that we can change the Borel σ-field to the ball σ-field and develop a fruitful
notion of weak convergence.
Another alternative solution is to use a different metric that will make the empirical
process measurable, e.g., we can equip the space D[0, 1] with the Skorohod (metric) topology.

Remark 10.1. (D[0, 1] and the Skorohod metric) We say that two cadlag functions are close
to one another in the Skorohod metric if there is a reparameterization of the time-axis, (a
function [0, 1] to itself ) that is uniformly close to the identity function, and when applied to
one of the cadlag functions, brings it close to the other cadlag function. Heuristically, two
cadlag functions are close if their large jumps are close to one another and of similar size,
and if they are uniformly close elsewhere. See [Billingsley, 1999, Chapter 3] for the study
of D[0, 1] with the Skorohod metric.

However, for empirical processes that assume values in very general (and often more
complex spaces) such cute generalizations are not readily achievable and the easier way
out is to keep the topology simple and tackle the measurability issues. Of course, any
generalization of the notion of weak convergence must allow a powerful continuous mapping
theorem.
We will now try to develop a more general notion of weak convergence that can handle
non-measurability of the underlying functions. In this section, the underlying σ-field will
always be the Borel σ-field generated by the metric endowed with the space.
Suppose that D is a metric space with metric d(·, ·). Suppose that we have arbitrary
maps Xn : Ωn → D, defined on probability spaces (Ωn , An , Pn ). Because Ef (Xn ) need
no longer make sense (where f : D → R is a bounded continuous function), we replace
expectations by outer expectations.

Definition 10.1 (Outer expectation). For an arbitrary map X : Ω → D, define


n o
E∗ f (X) := inf EU | U : Ω → R measurable, U ≥ f (X), EU exists . (144)

For most purposes outer expectations behave like usual expectations105 .

Definition 10.2. Then we say that a sequence of arbitrary maps Xn : Ωn → D converges


in distribution106 to a random element X if and only if

E∗ f (Xn ) → Ef (X)
105
Unfortunately, Fubini’s theorem is not valid for outer expectations (Note that for any map T it always
holds E∗1 E∗2 T ≤ E∗ T ). To overcome this problem, it is assumed that F is P -measurable. We will not define
P -measurability formally; see [van der Vaart and Wellner, 1996, Chapter 2.3] for more details.
106
We note that the weak convergence in the above sense is no longer tied with convergence of probability
measures, simply because Xn need not induce Borel probability measures on D.

133
for every bounded, continuous function f : D → R. Here we insist that the limit X be
Borel-measurable (and hence has a distribution). Note that although Ωn may depend on n,
we do not let this show up in the notation for E∗ and P∗ .

Definition 10.3. An arbitrary sequence of maps Xn : Ωn → D converges in probability to


X if  
P∗ d(Xn , X) >  → 0 for all  > 0. (145)
P
This will be denoted by Xn → X.

Definition 10.4. The sequence of maps Xn : Ωn → D converges almost surely to X if there


exists a sequence of (measurable) random variables ∆n such that
a.s.
d(Xn , X) ≤ ∆n and ∆n → 0. (146)
a.s.∗
This will be denoted by Xn → X.

Remark 10.2. Even for Borel measurable maps Xn and X, the distance d(Xn , X) need
not be a random variable.

Theorem 10.5 (Portmanteau). For arbitrary maps Xn : Ωn → D and every random


element X with values in D, the following statements are equivalent:

(i) E∗ f (Xn ) → Ef (X) for all real-valued bounded continuous functions f .

(ii) E∗ f (Xn ) → Ef (X) for all real-valued bounded Lipschitz functions f .

(iii) lim inf n→∞ P∗ (Xn ∈ G) ≥ P(X ∈ G) for every open set G.

(iv) lim supn→∞ P∗ (Xn ∈ F ) ≤ P(X ∈ F ) for every closed set F .

(v) P∗ (Xn ∈ B) → P(X ∈ B) for all Borel set B with P(X ∈ ∂B) = 0.

Proof. See [van der Vaart and Wellner, 1996, Theorem 1.3.4].

Theorem 10.6 (Continuous mapping). Suppose that H : D → S (S is a metric space) is


a map which is continuous at every x ∈ D0 ⊂ D. Also, suppose that Xn : Ωn → D are
arbitrary maps and X is a random element with values in D0 such that H(X) is a random
d d
element in S. If Xn → X, then H(Xn ) → H(X).

Proof. See [van der Vaart and Wellner, 1996, Theorem 1.3.6].

Definition 10.7. A Borel-measurable random element X into a metric space is tight if for
every  > 0 there exists a compact set K such that

P(X ∈
/ K) < .

134
Remark 10.3. (Separability and tightness) Let X : Ω → D be a random element. Tightness
is equivalent to there being a σ-compact set (countable union of compacts) that has probability
1 under X.
If there is a separable, measurable set with probability 1, then X is called separable.
Since a σ-compact set in a metric space is separable, separability is slightly weaker than
tightness. The two properties are the same if the metric space is complete107 .

10.1 Bounded stochastic processes

Let us look back at our motivation for studying weak convergence of stochastic processes.
Given i.i.d. random elements X1 , . . . , Xn ∼ P taking values in X and a class of measurable

functions F, we want to study the stochastic process { n(Pn − P )(f ) : f ∈ F} (i.e., the
empirical process). If we assume that

sup |f (x) − P f | < ∞, for all x ∈ X ,


f ∈F

then the maps from F to R defined as

f 7→ f (x) − P f, x ∈ X

are bounded functionals over F, and therefore, so is f 7→ (Pn − P )(f ). Thus,

Pn − P ∈ `∞ (F),

where `∞ (F) is the space of bounded real-valued functions on F, a Banach space108 if we


equip it with the supremum norm k · kF .

Remark 10.4. As C[0, 1] ⊂ D[0, 1] ⊂ `∞ [0, 1], we can consider convergence of a sequence
of maps with values in C[0, 1] relative to C[0, 1], but also relative to D[0, 1], or `∞ [0, 1]. It
can be shown that if D0 ⊂ D be arbitrary metric spaces equipped with the same metric and
d
X (as D) and every Xn take there values in D0 , then Xn → X as maps in D0 if and only
if Xn → X as maps in D.

A stochastic process X = {Xt : t ∈ T } is a collection of random variables Xt : Ω → R,


indexed by an arbitrary set T and defined on the same probability space (Ω, A, P). For each
fixed ω ∈ Ω, the map t 7→ Xt (ω) is called a sample path, and it is helpful to think of X
as a random function, whose realizations are the sample paths, rather than a collection of
random variables. If every sample path is a bounded function, then X can be viewed as a
map X : Ω → `∞ (T ), where `∞ (T ) denotes the set of all bounded real-valued functions on
T . The space `∞ (T ) is equipped with the supremum norm k · kT . Unless T is finite, the
Banach space `∞ (T ) is not separable. Thus the classical theory of convergence in law on
107
See [Billingsley, 1999, Theorem 1.3] for a proof.
108
A Banach space is a complete normed vector space.

135
complete separable metric spaces needs to be extended. Given that the space `∞ (·) arises
naturally while studying the empirical process, we will devote this subsection to the study
of weak convergence in the space `∞ (T ).
It turns out that the theory of weak convergence on bounded stochastic processes
extends nicely if the limit laws are assumed to be tight Borel probability measures on
`∞ (T ). The following result characterizes weak convergence (to a tight limiting probability
measure) in `∞ (T ): convergence in distribution of a sequence of sample bounded processes
is equivalent to weak convergence of the finite dimensional probability laws together with
asymptotic equicontinuity, a condition that is expressed in terms of probability inequalities.
This reduces convergence in distribution in `∞ (T ) to maximal inequalities (we have spent
most of the earlier part of this course on proving such inequalities). We will often refer
to this theorem as the asymptotic equicontinuity criterion for convergence in law in `∞ (T ).
The history of this theorem goes back to Prohorov (for sample continuous processes) and
in the present generality it is due to [Hoffmann-Jø rgensen, 1991] with important previous
contributions by [Dudley, 1978].

Theorem 10.8. A sequence of arbitrary maps Xn : Ωn → `∞ (T ) converges weakly to a


tight random element if and only if both of the following conditions hold:

(i) The sequence (Xn (t1 ), . . . , Xn (tk )) converges in distribution in Rk (k ∈ N) for every
finite set of points t1 , . . . , tk ∈ T ;

(ii) (asymptotic equicontinuity) there exists a semi-metric d(·, ·) for which (T, d) is totally
bounded and for every  > 0
!
lim lim sup P∗ sup |Xn (s) − Xn (t)| >  = 0. (147)
δ→0 n→∞ d(s,t)<δ; s,t∈T

Proof. Let us first show that (i) and (ii) imply that Xn converges weakly to a tight random
element. The proof consists of two steps.
Step 1: By assumption (i) and Kolmogorov’s consistency theorem109 we can construct a
d
stochastic process {Xt : t ∈ T } on some probability space such that (Xn (t1 ), . . . , Xn (tk )) →
109
Kolmogorov’s consistency theorem shows that for any collection of “consistent” finite-dimensional
marginal distributions one can construct a stochastic process on some probability space that has these
distributions as its marginal distributions.

Theorem 10.9. Let T denote an arbitrary set (thought of as “time”). For each k ∈ N and finite sequence
of times t1 , . . . tk ∈ T , let νt1 ,...,tk denote a probability measure on Rk . Suppose that these measures satisfy
two consistency conditions:
• for all permutations π of {1, . . . , k} and measurable sets Fi ⊂ R,

νπ(t1 ),...,π(tk ) (Fπ(1) × . . . × Fπ(k) ) = νt1 ,...,tk (F1 × . . . × Fk );

• for all measurable sets Fi ⊂ R,

νt1 ,...,tk (F1 × . . . × Fk ) = νt1 ,...,tk ,tk+1 (F1 × . . . × Fk × R).

136
(X(t1 ), . . . , X(tk )) for every finite set of points t1 , . . . , tk ∈ T . We need to verify that X
admits a version that is a tight random element in `∞ (T ). Let T0 be a countable d-dense
subset of T , and let Tk , k = 1, 2, . . . be an increasing sequence of finite subsets of T0 such
that ∪∞k=1 Tk = T0 . By the Portmanteau lemma (see part (iii) of Theorem 9.2) on the
equivalent conditions of weak convergence in Euclidean spaces we have
   
P max |X(s) − X(t)| >  ≤ lim inf P max |Xn (s) − Xn (t)| > 
d(s,t)<δ; s,t∈Tk n→∞ d(s,t)<δ; s,t∈Tk
!
≤ lim inf P sup |Xn (s) − Xn (t)| >  .
n→∞ d(s,t)<δ; s,t∈T0

Taking k → ∞ on the far left side, we conclude that


! !
P sup |X(s) − X(t)| >  ≤ lim inf P sup |Xn (s) − Xn (t)| >  .
d(s,t)<δ; s,t∈T0 n→∞ d(s,t)<δ; s,t∈T0

where we consider the open set {x ∈ `∞ (T ) : supd(s,t)<δ; s,t∈Tk |x(s) − x(t)| > } and by the
monotone convergence theorem this remain true if Tk is replaced by T0 .
By the asymptotic equicontinuity condition (147), there exists a sequence δr > 0 with
δr → 0 as r → ∞ such that
!
P sup |X(s) − X(t)| > 2−r ≤ 2−r .
d(s,t)≤δr , s,t∈T0

These probabilities sum to a finite number over r ∈ N. Hence by the Borel-Cantelli


lemma110 , there exists r(ω) < ∞ a.s. such that for almost all ω,

sup |X(s; ω) − X(t; ω)| ≤ 2−r , for all r > r(ω).


d(s,t)≤δr ; s,t∈T0

Hence, X(t; ω) is a d-uniformly continuous function of t for almost every ω. As T is to-


tally bounded, X(t; ω) is also bounded. The extension to T by uniform continuity of the
restriction of X on T0 (only the ω set where X is uniformly continuous needs be con-
sidered) produces a version of X whose trajectories are all uniformly continuous in (T, d)
and, in particular, the law of X admits a tight extension to the Borel σ-algebra of `∞ (T )111 .

Then there exists a probability space (Ω, A, P) and a stochastic process X : T × Ω → R such that

P(Xt1 ∈ F1 , . . . , Xtk ∈ Fk ) = νt1 ,...,tk (F1 × . . . × Fk )

for all ti ∈ T , k ∈ N and measurable set Fi ⊂ Rn , i.e., X has νt1 ,...,tk as its finite-dimensional
distributions relative to times t1 , . . . , tk .

110
Let {An }n≥1 be a sequence of some events in some probability space. The Borel-Cantelli lemma states
that: If the sum of the probabilities of the events An is finite, i.e., ∞
P
n=1 P(An ) < ∞, then the probability
that infinitely many of them occur is 0, i.e., P(lim supn→∞ An ) = 0, where lim sup An := ∩∞ ∞
n=1 ∪k=1 Ak is
the set of outcomes that occur infinitely many times within the infinite sequence of events {An }.
111
We will use the following lemma without proving it.

137
d
Step 2: We now prove Xn → X in `∞ (T ). First we recall a useful fact112 : If f : `∞ (T ) → R
is bounded and continuous, and if K ⊂ `∞ (T ) is compact, then for every  > 0 there exists
δ > 0 such that

ku − vkT < δ, u ∈ K, v ∈ `∞ (T ) ⇒ |f (u) − f (v)| < . (148)

Since (T, d) is totally bounded, for every τ > 0 there exists a finite set of points
N (τ )
t1 , . . . , tN (τ ) which is τ -dense in (T, d) in the sense that T ⊆ ∪i=1 B(ti , τ ), where where
B(t, τ ) denotes the open ball of center t and radius τ . Then, for each t ∈ T we can choose
πτ : T → {t1 , . . . , tN (τ ) } so that d(πτ (t), t) < τ . We then define processes Xn,τ , n ∈ N, and
Xτ as
Xn,τ (t) = Xn (πτ (t)), Xτ (t) = X(πτ (t)), t ∈ T. (149)

These are approximations of Xn and X that take only a finite number N (τ ) of values.
Convergence of the finite dimensional distributions of Xn to those of X implies113 that
d
Xn,τ → Xτ in `∞ (T ). (150)

Moreover, the uniform continuity of the sample paths of X implies

lim kX − Xτ kT = 0 a.s. (151)


τ →0

Now let f : `∞ (T ) → R be a bounded continuous function. We have,

|E∗ f (Xn ) − Ef (X)| ≤ |E∗ f (Xn ) − Ef (Xn,τ )| + |Ef (Xn,τ ) − Ef (Xτ )|


+ |Ef (Xτ ) − Ef (X)|
≤ In,τ + IIn,τ + IIIτ .

We have seen that limn→∞ IIn,τ = 0 (by (150)) for each fixed τ > 0 and limτ →0 IIIτ = 0 (as
X is uniformly continuous a.s.114 ). Hence it only remains to show that lim lim sup In,τ = 0.
τ →0 n→∞

Lemma 10.10. Let X(t), t ∈ T , be a sample bounded stochastic process. Then the finite dimensional
distributions of X are those of a tight Borel probability measure on `∞ (T ) if and only if there exists on T a
semi-metric d for which (T, d) is totally bounded and such that X has a version with almost all its sample
paths uniformly continuous for d.

112
Suppose on the contrary that the assertion is false. Then there exist  > 0 and sequences un ∈ K and
vn ∈ T such that d(un ; vn ) → 0 and |f (un ) − f (vn )| ≥ . Since K is compact, there exists a subsequence
un0 of un such that un0 has a limit u in K. Then vn0 → u and, by continuity of f , |f (un0 ) − f (vn0 )| →
|f (u) − f (u)| = 0, which is a contradiction.
113
To see this, let Ti = {t ∈ T : ti = πτ (t)}. Then {Ti } forms a partition of T and πτ (t) = ti whenever
PN (τ )
t ∈ Ti Since the map Πτ : (a − 1, . . . , aN (τ ) ) 7→ i=1 ai 1Ti (t) is continuous from RN (τ ) into `∞ (T ), the
finite dimensional convergence implies that dro any bounded continuous function H : `∞ (T ) → R,

E[H(Xn,τ )] = E[H ◦ Πτ (Xn (t1 ), . . . , Xn (tN (τ ) )] → E[H ◦ Πτ (X(t1 ), . . . , X(tN (τ ) )] = E[H(Xτ )],

which proves (150).


114
A more detailed proof goes as follows. Given  > 0, let K ⊂ `∞ (T ) be a compact set such that

138
Given  > 0 we choose K ⊂ `∞ (T ) to be a compact set such that P(X ∈ K c ) <
/(6kf k∞ ). Let δ > 0 be such that (148) holds for K and /6. Then, we have
h    i
|E∗ f (Xn ) − Ef (Xn,τ )| ≤ 2kf k∞ P Xn,τ ∈ (K δ/2 )c + P kXn − Xn,τ kT ≥ δ/2
+ sup{|f (u) − f (v)| : u ∈ K, kv − vkT < δ}, (152)

where K δ/2 is the (δ/2)-open neighborhood of the set K under the sup norm, i.e.,

K δ/2 := {v ∈ `∞ (T ) : inf ku − vkT < δ/2}.


u∈K

The inequality in (152) can be checked as follows: If Xn,τ ∈ K δ/2 and kXn − Xn,τ kT < δ/2,
then there exists u ∈ K such that ku − Xn,τ kT < δ/2 and then

ku − Xn kT ≤ ku − Xn,τ kT + kXn − Xn,τ kT < δ.

Now the asymptotic equicontinuity hypothesis (see (147); recall the definition of Xn,τ
in (149)) implies that there is a τ2 > 0 such that
  
lim sup P∗ kXn − Xn,τ kT ≥ δ/2 < , for all τ < τ2 .
n→∞ 6kf k∞

Further, finite-dimensional convergence yields (by the Portmanteau theorem)


    
lim sup P∗ Xn,τ ∈ (K δ/2 )c ≤ P Xτ ∈ (K δ/2 )c < .
n→∞ 6kf k∞

Hence we conclude from (152), that for all τ < min{τ1 , τ2 },

lim sup |E∗ f (Xn ) − Ef (Xn,τ )| < ,


n→∞

which completes the proof of the part that (i) and (ii) imply the weak convergence of Xn
(to tight random element).
Now we show that if Xn converges weakly to a tight random element in `∞ (T ), (i) and
(ii) should hold. By the continuous mapping theorem it follows that (i) holds. The other
implication is a consequence of the “closed set” part of the Portmanteau theorem. First we
state a result which we will use (but prove later).

Theorem 10.11. Suppose that X ∈ `∞ (T ) induces a tight Borel measure. Then there exists
a semi-metric ρ on T for which (T, ρ) is totally bounded and such that X has a version with
P(X ∈ K c ) < /(6kf k∞ ). Let δ > 0 be such that (148) holds for K and /6. Let τ1 > 0 be such that
P(kXτ − XkT ≥ δ) < /(6kf k∞ ) for all τ < τ1 ; this can be done by virtue of (151). Then it follows that
 
|Ef (Xτ ) − Ef (X)| ≤ 2kf k∞ P {X ∈ K c } ∪ {kXτ − XkT ≥ δ} + sup{|f (u) − f (v)| : u ∈ K, kv − vkT < δ}
    
≤ 2kf k∞ + + = .
6kf k∞ 6kf k∞ 6
Hence limτ →0 IIIτ = 0.

139
almost all sample paths in the space of all uniformly continuous real valued functions from
T to R.
Furthermore, if X is zero-mean Gaussian, then this semi-metric can be taken equal to
p
ρ(s, t) = Var(X(s) − X(t)).

Now, if Xn converges weakly to a tight random element X in `∞ (T ), then by Theo-


rem 10.11, there is a semi-metric ρ on T which makes (T, ρ) totally bounded and such that
X has (a version with) ρ-uniformly continuous sample paths. Thus for the closed set Fδ,
defined by n o
Fδ, := x ∈ `∞ (T ) : sup |x(s) − x(t)| ≥  ,
ρ(s,t)≤δ

we have (by the Portmanteau theorem)


!
lim sup P∗ sup |Xn (s) − Xn (t)| ≥  = lim sup P∗ (Xn ∈ Fδ, )
n→∞ ρ(s,t)≤δ n→∞
!
≤ P(X ∈ Fδ, ) = P sup |X(s) − X(t)| ≥  .
ρ(s,t)≤δ

Taking limits across the resulting inequality as δ → 0 yields the asymptotic equicontinuity
in view of the ρ-uniform continuity of the sample paths of X.

In view of this connection between the partitioning condition (ii), continuity and tight-
ness, we shall sometimes refer to this condition as the condition of asymptotic tightness or
asymptotic equicontinuity.

Remark 10.5. How do we control the left-hand side of (147)? It involves a probability,
which by Markov’s inequality can be bounded by
!
−1 E∗ sup |Xn (s) − Xn (t)| .
d(s,t)<δ

Thus we need good bounds on the expected suprema of the localized fluctuations.

X ∼ P is tight is equivalent to the existence of a σ-compact set (a countable union


of compacts) that has probability 1 under X. To see this, given  = 1/m, there exists
a compact set Em ⊂ `∞ (T ) such that P (Em ) > 1 − 1/m. Take Km = ∪m i=1 Ei and let
∞ ∞
K = ∪m=1 Km . Then Km is an increasing sequence of compacts in ` (T ). We will show
that the semi-metric ρ on T defined by

X
ρ(s, t) = 2−m (1 ∧ ρm (s, t)), where ρm (s, t) = sup |x(s) − x(t)|,
m=1 x∈Km

where s, t ∈ T makes (T, ρ) totally bounded. To show this, let  > 0, and choose k so that
P∞ −m < /4. By the compactness of K , there exists x , . . . , x , a finite subset of
m=k+1 2 k 1 r

140
Kk , such that Kk ⊂ ∪ri=1 B(xi ; /4) (here B(x; ) := {y ∈ `∞ (T ) : kx − ykT < } is the ball
of radius  around x), i.e., for each x ∈ Kk = ∪km=1 Km there exists i ∈ {1, . . . , r} such that

kx − xi kT ≤ . (153)
4
Also, as Kk is compact, Kk is a bounded set. Thus, the subset A ⊂ Rr defined by
{(x1 (t), . . . , xr (t)) : t ∈ T } is bounded. Therefore, A is totally bounded and hence there
exists a finite set T := {tj : 1 ≤ j ≤ N } such that, for every t ∈ T , there is a j ≤ N for
which
max |xi (t) − x(tj )| ≤ /4. (154)
i=1,...,r

Next we will show that T is -dense in T for the semi-metric ρ. Let t ∈ T . For any m ≤ k,
we have
3
ρm (t, tj ) = sup |x(t) − x(tj )| ≤ . (155)
x∈Km 4
Note that (155) follows as for any x ∈ Km , there exists i ∈ {1, . . . , r} such that kx − xi kT ≤
/4 (by (153)) and thus,
   3
|x(t) − x(tj )| ≤ |x(t) − xi (t)| + |xi (t) − xi (tj )| + |xi (tj ) − x(tj )| ≤ + + = ,
4 4 4 4
where we have used (154). Hence,
k ∞ k
X X 3 X −m 
ρ(t, tj ) ≤ 2−m ρm (t, tj ) + 2−m ≤ 2 + ≤ .
4 4
m=1 m=k+1 m=1

Thus, we have proved that (T, ρ) is totally bounded.


Furthermore, the functions x ∈ K are uniformly ρ-continuous, since, if x ∈ Km , then
|x(s) − x(t)| ≤ ρm (s, t) ≤ 2m ρ(s, t) for all s, t ∈ T with ρ(s, t) ≤ 1 (which is always the
case). Since P (K) = 1, the identity function of (`∞ (T ), B, P ) yields a version of X with
almost all of its sample paths in K, hence in U C(T, ρ).
Now let ρ2 be the standard deviation semi-metric. Since every uniformly ρ-continuous
function has a unique continuous extension to the ρ-completion of T , which is compact, it is
no loss of generality to assume that T is ρ-compact. Let us also assume that every sample
path of X is ρ-continuous.
An arbitrary sequence {tn }n≥1 in T has a ρ-converging subsequence tn0 → t. By the ρ-
continuity of the sample paths, X(tn0 ) → X(t) almost surely. Since every X(t) is Gaussian,
this implies convergence of means and variances, whence ρ2 (tn0 , t)2 = E(X(tn0 )−X(t))2 → 0
by a convergence lemma. Thus tn0 → t also for ρ2 , and hence T is ρ2 -compact.
Suppose that a sample path t 7→ X(t, ω) is not ρ2 -continuous. Then there exists an
 > 0 and a t ∈ T such that ρ2 (tn , t) → 0, but |X(tn , ω) − X(t, ω)| ≥  for every n.
By the ρ-compactness and continuity, there exists a subsequence such that ρ(tn0 , s) → 0
and X(tn0 , ω) → X(s, ω) for some s ∈ T . By the argument of the preceding paragraph,

141
ρ2 (tn0 , s) → 0, so that ρ2 (s, t) = 0 and |X(s, ω) − X(t, ω)| ≥ . Conclude that the path
t 7→ X(t, ω) can only fail to be ρ2 -continuous for ω for which there exist s, t ∈ T with
ρ2 (s, t) = 0, but X(s, ω) 6= X(t, ω). Let N be the set of ω for which there do exist such
s and t. Take a countable, ρ-dense subset A of {(s, t) ∈ T × T : ρ2 (s, t) = 0}. Since
t 7→ X(t, ω) is ρ-continuous, N is also the set of all ω such that there exist (s, t) ∈ A with
X(s, ω) 6= X(t, ω). From the definition of ρ2 , it is clear that for every fixed (s, t), the set of
ω such that X(s, ω) 6= X(t, ω) is a null set. Conclude that N is a null set. Hence, almost
all paths of X are ρ2 -continuous.

Remark 10.6. In the course of the proof of the preceding theorem we constructed a semi-
metric ρ such that the weak limit X has uniformly ρ-continuous sample paths, and such that
(T, ρ) is totally bounded. This is surprising: even though we are discussing stochastic pro-
cesses with values in the very large space `∞ (T ), the limit is concentrated on a much smaller
space of continuous functions. Actually this is a consequence of imposing the condition (ii)
and insisting that the limit X be a tight random element.

10.2 Spaces of locally bounded functions

Let T1 ⊂ T2 ⊂ . . . be arbitrary sets and T = ∪∞ i=1 Ti . Think of Ti = [−i, i] in which case



T = R. The space ` (T1 , T2 , . . .) is defined as the set of all functions z : T → R that are
uniformly bounded on every Ti (but not necessarily on T ).
Exercise (HW4): Show that `∞ (T1 , T2 , . . .) is a complete metric space with respect to the
metric
X∞
d(z1 , z2 ) := (kz1 − z2 kTi ∧ 1) 2−i .
i=1

Exercise (HW4): Show that a sequence converges in this metric if it converges uniformly
on each Ti .
In case Ti := [−i, i]d ⊂ Rd (for d ≥ 1), the metric d induces the topology of uniform
convergence on compacta.
The space `∞ (T1 , T2 , . . .) is of interest in applications, but its weak convergence theory
is uneventful. Weak convergence of a sequence is equivalent to (weak) convergence in each
of the restrictions Ti ’s.

Theorem 10.12. Let Xn : Ωn → `∞ (T1 , T2 , . . .) be arbitrary maps, n ≥ 1. Then the


sequence {Xn }n≥1 converges weakly to a tight limit if and only if for every i ∈ N, Xn|Ti :
Ωn → `∞ (Ti ) converges weakly to a tight limit.

Proof. See [van der Vaart and Wellner, 1996, Theorem 1.6.1].

142
11 Donsker classes of functions

Suppose that X1 , . . . , Xn are i.i.d. random elements taking values in a set X having distri-

bution P and let Gn denote the corresponding empirical process (i.e., Gn ≡ n(Pn − P ))
indexed by a class F of real-valued measurable functions F.

Definition 11.1. A class F of measurable functions f : X → R is Donsker if the empirical


process {Gn f : f ∈ F} indexed by F converges in distribution in the space `∞ (F) to a tight
random element.

The Donsker property depends on the law P of the observations; to stress this we
also say “P -Donsker”. The definition implicitly assumes that the empirical process can be
viewed as a map into `∞ (F), i.e., that the sample paths f 7→ Gn f are bounded. By the
multivariate CLT, for any finite set of measurable functions fi with P fi2 < ∞,
d
(Gn f1 , . . . , Gn fk ) → (GP f1 , . . . , GP fk ),

where the vector on the right-hand side possesses a multivariate normal distribution with
mean zero and covariances given by

E[GP f GP g] = P (f g) − (P f )(P g).

Remark 11.1. A stochastic process GP in `∞ (F) is called Gaussian if for every (f1 , . . . , fk ),
fi ∈ F, k ∈ N, the random vector (GP (f1 ), . . . , GP (fk )) is a multivariate normal vector.

As an immediate consequence of Theorem 10.8, we have the following result.

Theorem 11.2. Let F be a class of measurable functions from X to R such that P [f 2 ] < ∞,
for every f ∈ F, and

sup |f (x) − P f | < ∞, for all x ∈ X .


f ∈F

Then the empirical process {Gn f : f ∈ F} converges weakly to a tight random element (i.e.,
F is P -Donsker) if and only if there exists a semi-metric d(·, ·) on F such that (F, d) is
totally bounded and
!
lim lim sup P∗ sup |Gn (f − g)| >  = 0, for every  > 0. (156)
δ→0 n→∞ d(f,g)≤δ; f,g∈F

A typical distance d is d(f, g) = kf − gkL2 (P ) , but this is not the only one.

11.1 Donsker classes under bracketing condition

For most function classes of interest, the bracketing numbers N[ ] (, F, L2 (P )) grow to in-
finity as  ↓ 0. A sufficient condition for a class to be Donsker is that they do not grow

143
too fast. The speed can be measured in terms of the bracketing integral. Recall that the
bracketing entropy integral is defined as
Z δq
J[ ] (δ, F, L2 (P )) := log N[ ] (, F ∪ {0}, L2 (P )) d.
0

If this integral is finite-valued, then the class F is P -Donsker. As the integrand is a decreas-
ing function of  the convergence of the integral depends only on the size of the bracketing
R1
numbers for  ↓ 0. Because 0 −r d converges for r < 1 and diverges for r ≥ 1, the integral
condition roughly requires that the entropies grow of slower order than 1/2 .

Theorem 11.3 (Donsker theorem). Suppose that F is a class of measurable functions with
square-integrable (measurable) envelope F and such that J[ ] (1, F, L2 (P )) < ∞. Then F is
P -Donsker.

Proof. As N[ ] (, F, L2 (P )) is finite for every  > 0, we know that (F, d) is totally bounded,
where d(f, g) = kf − gkL2 (P ) . Let G be the collection of all differences f − g when f and g
range over F. With a given set of -brackets {[li , ui ]}Ni=1 over F we can construct 2-brackets
over G by taking differences [li − uj , ui − lj ] of upper and lower bounds. Therefore, the
bracketing numbers N[ ] (, G, L2 (P )) are bounded by the squares of the bracketing numbers
N[ ] (/2, F, L2 (P )). Taking a logarithm turns the square into a multiplicative factor 2 and
hence the entropy integrals of F and G are proportional. The function G = 2F is an
envelope for the class G.
Let Gδ := {f − g : f, g ∈ F, kf − gkL2 (P ) ≤ δ}. Hence, by a maximal inequality115 ,
q
there exists a finite number a(δ) = δ/ log N[ ] (δ, Gδ , L2 (P )) such that
h i √ √
E∗ sup |Gn (f − g)| . J[ ] (δ, Gδ , L2 (P )) + nP [G1{G > a(δ) n}].
f,g∈F ;kf −gk≤δ
√ √
≤ J[ ] (δ, G, L2 (P )) + nP [G1{G > a(δ) n}]. (157)


The second term on the right is bounded by a(δ)−1 P [G2 1{G > a(δ) n}] and hence
converges to 0 as n → ∞ for every δ. The integral converges to zero as δ → 0. The theorem
now follows from the asymptotic equi-continuity condition (see Theorem 11.2), in view of
Markov’s inequality.

Example 11.5 (Classical Donsker’s theorem). When F is equal to the collection of all
indicator functions of the form ft = 1(−∞,t] , with t ranging over R, then the empirical
115
Here is a maximal inequality that uses bracketing entropy (see [van der Vaart, 1998, Lemma 19.34] for
a proof):

Theorem 11.4. For any class F of measurable functions f : X → R such that P f 2 < δ 2 , for every f , we
p
have, with a(δ) = δ/ log N[ ] (δ, F, L2 (P )), and F an envelope function,
h i √ √
E∗ kGn kF = E∗ sup |Gn f | . J[ ] (δ, F, L2 (P )) + nP ∗ [F 1{F > na(δ)}].
f ∈F

144

process Gn ft is the classical empirical process n(Fn (t)−F (t)) (here X1 , . . . , Xn are i.i.d. P
with c.d.f. F).

We saw previously that N[ ] ( , F, L2 (P )) ≤ 2/, whence the bracketing numbers are
of the polynomial order 1/2 . This means that this class of functions is very small, because
a function of the type log(1/) satisfies the entropy condition of Theorem 5.2 easily.

11.2 Donsker classes with uniform covering numbers

The following theorem shows that the bracketing numbers in the preceding Donsker theorem
can be replaced by the uniform covering numbers116

sup N (kF kQ,2 , F, L2 (Q)).


Q

Here the supremum is taken over all probability measures Q for which kF k2Q,2 = Q[F 2 ] > 0.
Recall that the uniform entropy integral is defined as
Z δ q
J(δ, F, F ) = sup log N (kF kQ,2 , F, L2 (Q)) d. (158)
0 Q

Theorem 11.6. (Donsker theorem). Let F be a pointwise-measurable117 class of measurable


functions with (measurable) envelope F such that P [F 2 ] < ∞. If J(1, F, F ) < ∞ then F is
P -Donsker.

Proof. We first show that (F, d) is totally bounded, where d(f, g) = kf −gkL2 (P ) . The finite-
ness of the uniform entropy integral implies the finiteness of its integrand: supQ N (kF kQ,2 , F, L2 (Q)) <
∞, for every  > 0, where the supremum is taken over all finitely discrete probability mea-
sures Q with Q[F 2 ] > 0. We claim that this implies that F is totally bounded in L2 (P ).
Let  > 0 and suppose that f1 , . . . , fN are functions in F such that P [(fi − fj )2 ] > 2 P [F 2 ],
for every i 6= j. By the law of large numbers Pn [(fi − fj )2 ] → P [(fi − fj )2 ] for any i, j and
Pn [F 2 ] → P F 2 , almost surely, as n → ∞. It follows that there exists some n and realization
Pn of Pn such that Pn [(fi − fj )2 ] > 2 P [F 2 ], for every i 6= j, and 0 < Pn [F 2 ] < 2P [F 2 ].

Consequently Pn [(fi − fj )2 ] > 2 Pn [F 2 ]/2, and hence N ≤ D(kF kPn ,2 / 2, F, L2 (Pn )) (re-
call the notion of packing number; see Definition 2.5). Because Pn is finitely discrete, the
right side is bounded by the supremum over Q considered previously and hence bounded
in n. In view of the definition of N this shows that D(kF kP,2 , F, L2 (P )) is finite for every
 > 0.
116
The uniform covering numbers are relative to a given envelope function F . This is fortunate, because the
covering numbers under different measures Q typically are more stable if standardized by the norm kF kQ,2
of the envelope function. In comparison, in the case of bracketing numbers we consider a single distribution
P.
117
The condition that the class F is pointwise-measurable (or “suitably measurable”) is satisfied in most
examples but cannot be omitted. It suffices that there exists a countable collection G of functions such that
each f is the pointwise limit of a sequence gm in G; see [van der Vaart and Wellner, 1996, Chapter 2.3].

145
To verify the asymptotic equi-continuity condition, it is enough to show that

lim lim sup E[ nkPn − P kGδ ] = 0,
δ→0 n→∞

where Gδ := {f − g : f, g ∈ F, kf − gkL2 (P ) ≤ δ}. The class Gδ has envelope 2F , and since


Gδ ⊂ {f − g : f, g ∈ F}, we have

sup N (k2F kQ,2 , Gδ , k · kQ,2 ) ≤ sup N 2 (kF kQ,2 , F, k · kQ,2 ),


Q Q

which leads to J(, Gδ , 2F ) ≤ CJ(, F, F ) for all  > 0. Hence by the maximal inequality in
Theorem 4.10 (with σ = δ, envelope 2F ) and δ 0 = δ/(2kF kP,2 ),

Bn J 2 (δ 0 , F, F )
 
0
E[kGn kGδ ] ≤ C J(δ , F, F )kF kP,2 + √ ,
δ 02 n
p √
where Bn = 2 E[max1≤i≤n F 2 (Xi )]. As F ∈ L2 (P ), we have118 Bn = o( n). Given
η > 0, by choosing δ small, we can make sure that J(δ 0 , F, F ) < η and for large n,

Bn J 2 (δ 0 , F, F )/(δ 02 n) < η. Thus,

lim sup E[kGn kGδ ] ≤ (CkF kP,2 + 1)η,


n→∞

so that the desired conclusion follows.


Alternate proof: The entropy integral of the class F − F (with envelope 2F ) is bounded
by a multiple of J(δ, F, F ). Application of Theorem 4.7 followed by the Cauchy-Schwarz
inequality yields, with θn2 := supf ∈Gδ Pn [f 2 ]/kF k2n ,
h i h i
E kGn kGδ . E J(θn , Gδ , 2F )kF kn . E∗ [J 2 (θn , Gδ , 2F )]1/2 kF kP,2 .

P
If θn → 0, then the right side converges to zero, in view of the dominated convergence
theorem.
Without loss of generality, assume that F ≥ 1, so that θn2 ≤ supf ∈Gδ Pn [f 2 ] (otherwise
replace F by F ∨ 1; this decreases the entropy integral). As supf ∈Gδ P [f 2 ] → 0, as δ → 0,
the desired conclusion follows if kPn f 2 − P f 2 kGδ converges in probability to zero. This
is certainly the case if the class G = (F − F)2 is Glivenko-Cantelli. It can be shown
G = (F − F)2 , relative to the envelope (2F )2 , has bounded uniform entropy integral. Thus
the class is Glivenko-Cantelli.

11.3 Donsker theorem for classes changing with sample size

The Donsker theorem we just stated involves a fixed class of functions F not depending on
n. As will become clear in the next example, it is sometime useful to have similar results
118
This follows from the following simple fact: For i.i.d. random variables ξ1 , ξ2 , . . . the following three
statements are equivalent: (1) E[|ξ1 |] < ∞; (2) max1≤i≤n |ξi |/n → 0 almost surely; (3) E[max1≤i≤n |ξi |] =
o(n).

146
for classes of functions Fn which depend on the sample size n. Suppose that

Fn := {fn,t : t ∈ T }

is a sequence of function classes (indexed by T ) where each fn,t is a measurable function


from X to R. We want to treat the weak convergence of the stochastic processes Zn defined
as
Zn (t) = Gn fn,t , t∈T (159)

as elements of `∞ (T ). We know that weak convergence in `∞ (T ) is equivalent to marginal


convergence and asymptotic equi-continuity. The marginal convergence to a Gaussian pro-
cess follows under the conditions of the Lindeberg-Feller CLT119 . Sufficient conditions for
equi-continuity can be given in terms of the entropies of the classes Fn .
We will assume that there is a semi-metric ρ for the index set T for which (T, ρ) is
totally bounded, and such that

sup P (fn,s − fn,t )2 → 0 for every δn → 0. (160)


ρ(s,t)<δn

Suppose further that the classes Fn have envelope functions Fn satisfying the Lindeberg
condition

P [Fn2 ] = O(1), and P [Fn2 1{Fn >√n} ] → 0 for every  > 0. (161)

Theorem 11.7. Suppose that Fn = {fn,t : t ∈ T } is a class of measurable function indexed


by (T, ρ) which is totally bounded. Suppose that (160) and (161) hold. If J[ ] (δn , Fn , L2 (P )) →
0 for every δn → 0, or J(δn , Fn , Fn ) → 0 for every δn → 0 and all the classes Fn are P -
measurable120 , then the processes {Zn (t) : t ∈ T } defined by (159) converge weakly to a tight
Gaussian process Z provided that the sequence of covariance functions

Kn (s, t) = P (fn,s , fn,t ) − P (fn,s )P (fn,t )

converges pointwise on T × T . If K(s, t), s, t ∈ T , denotes the limit of the covariance


functions, then it is a covariance function and the limit process Z is a mean zero Gaussian
process with covariance function K.
119
Lindeberg-Feller CLT: For each n ∈ N, suppose that Wn,1 , . . . , Wn,kn are independent random vectors
with finite variances such that
kn
X kn
X
E[kWn,i k2 1{kWn,i k > }] → 0, for every  > 0, and Cov(Wn,i ) → Σ.
i=1 i=1

Then the sequence ki=1


P n
(Wn,i − EWn,i ) converges in distribution to a normal N (0, Σ) distribution.
120
We will not define P -measurability formally; see [van der Vaart and Wellner, 1996, Chapter 2.3] for
more details. However if F is point-wise measurable then F is P -measurable for every P .

147
Proof. We only give the proof using the bracketing entropy integral condition. For each
δ > 0, applying Theorem 4.12 and using a similar idea as in (157) we obtain the bound
h i
E∗ sup |Gn (fn,s − fn,t )| . J[ ] (δ, Gn , L2 (P ))
s,t∈T ; P (fn,s −fn,t )2 <δ 2

+an (δ)−1 P [G2n 1{Gn > an (δ) n}],
q
where Gn = Fn −Fn , Gn = 2Fn and an (δ) = δ/ log N[ ] (δ, Gn , L2 (P )). Because J[ ] (δn , Fn , L2 (P )) →
0 for every δn → 0, we must have that J[ ] (δ, Fn , L2 (P )) = O(1) for every δ > 0 and hence
an (δ) is bounded away from 0. Then the second term in the preceding display converges
to zero for every fixed δ > 0, by the Lindeberg condition. The first term can be made
arbitrarily small as n → ∞ by choosing δ small, by assumption.

Example 11.8 (The Grenander estimator). Suppose that X1 , . . . , Xn are i.i.d. P on [0, ∞)
with a non-increasing density function f and c.d.f. F (which is known to be concave). We
want to estimate the unknown density f under the restriction that f is non-increasing.
Grenander [Grenander, 1956] showed that we can find a nonparametric maximum likeli-
hood estimator (NPMLE) fˆn in this problem, i.e., we can maximize the likelihood ni=1 g(Xi )
Q

over all non-increasing densities g on [0, ∞).


Let Fn is the empirical distribution function of the data. It can be shown that fˆn is
unique and that fˆn is the left derivative of the least concave majorant (LCM) of Fn , i.e.,

fˆn = LCM 0 [Fn ];

see [Robertson et al., 1988, Chapter 7.2]. Also, fˆn can be computed easily using the pool
adjacent violators algorithm.
Suppose that x0 ∈ (0, ∞) is an interior point in the support of P . Does fˆn (x0 ) → f (x0 )?
Indeed, this holds. In fact, it can be shown that n1/3 (fˆn (x0 ) − f (x0 )) = Op (1).
Let us find the limiting distribution of ∆n := n1/3 (fˆn (x0 ) − f (x0 )). We will show that
if f 0 (x0 ) < 0, then
d
∆n = n1/3 (fˆn (x0 ) − f (x0 )) → LCM 0 [Z](0),

f (x0 )W(s) + s2 f 0 (x0 )/2, W is a two-sided standard Brownian motion


p
where Z(s) =
starting at 0.
Let us define the stochastic process
 
Zn (t) := n2/3 Fn (x0 + tn−1/3 ) − Fn (x0 ) − f (x0 )tn−1/3 , for t ≥ −x0 n−1/3 .

Observe that ∆n = LCM 0 [Zn ](0). Here we have used the fact that for any function m :
R → R and an affine function x 7→ a(x) := β0 + β1 x, LCM [m + a] = LCM [m] + a.
The idea is to show that
d
Zn → Z in `∞ ([−K, K]), for any K > 0, (162)

148
d
and then apply the continuous mapping principle to deduce ∆n → LCM [Z]0 (0).
Actually, a rigorous proof of the convergence of ∆n involves a little more than an appli-
d
cation of a continuous mapping theorem. The convergence Zn → Z is only under the metric
of uniform convergence on compacta. A concave majoring near the origin might be deter-
d
mined by values of the process a long way from the origin; thus the convergence Zn → Z by
d
itself does not imply the convergence of LCM 0 [Zn ](0) → LCM [Z]0 (0). However, we will not
address this issue for the time being. An interested reader can see [Kim and Pollard, 1990,
Assertion, Page 217] for a rigorous treatment of this issue. We will try to show that (162)
holds.
Consider the class of functions

gθ (x) = 1(−∞,x0 +θ] (x) − 1(−∞,x0 ] (x) − f (x0 )θ,

where θ ∈ R. Note that

Zn (t) = n2/3 Pn [gtn−1/3 (X)]


= n2/3 (Pn − P )[gtn−1/3 (X)] + n2/3 P [gtn−1/3 (X)].

Let Fn := {fn,t := n1/6 gtn−1/3 : t ∈ R}. It can be shown, appealing to Theorem 11.7,
d p
that n2/3 (Pn − P )[gtn−1/3 (X)] = Gn [fn,t ] → f (x0 )W(t) in `∞ [−K, K], for every K > 0.
Further, using a Taylor series expansion, we can see that
h i t2
n2/3 P [gtn−1/3 (X)] = n2/3 F (x0 + tn−1/3 ) − F (x0 ) − n−1/3 tf (x0 ) → f 0 (x0 ),
2
uniformly on compacta. Combining these two facts we see that (162) holds.

149
12 Limiting distribution of M -estimators

Let X1 , . . . , Xn be i.i.d. P observations taking values in a space X . Let Θ denote a parameter


space (assumed to be a metric space with metric d(·, ·)) and, for each θ ∈ Θ, let mθ denote
a real-valued function on X . Consider the map
n
1X
θ 7→ Mn (θ) := Pn [mθ (X)] ≡ mθ (Xi )
n
i=1

and let θ̂n denote the maximizer of Mn (θ) over θ ∈ Θ, i.e.,

θ̂n = arg max Mn (θ).


θ∈Θ

Such a quantity θ̂n is called an M -estimator. We study the (limiting) distribution of M -


estimators (properly standardized) in this section.
Their statistical properties of θ̂n depend crucially on the behavior of the criterion
function Mn (θ) as n → ∞. For example, we may ask: is θ̂n converging to some θ0 ∈ Θ, as
n → ∞? A natural way to tackle the question is as follows: We expect that for each θ ∈ Θ,
Mn (θ) will be close to its population version

M (θ) := P [mθ (X)], θ ∈ Θ.

Let
θ0 := arg max M (θ).
θ∈Θ

If Mn and M are uniformly close, then maybe their argmax’s θ̂n and θ0 are also close. A key
tool to studying such behavior of θ̂n is the argmax continuous mapping theorem which we
consider next. Before we present the result in a general setup let us discuss the main idea
behind the proof. For any given  > 0, we have to bound the probability P(d(θ̂n , θ0 ) ≥ ).
The key step is to realize that
!
P(d(θ̂n , θ0 ) ≥ ) ≤ P sup [Mn (θ) − Mn (θ0 )] > 0
θ∈Θ:d(θ,θ0 )≥
!
≤ P sup [(Mn − M )(θ) − (Mn − M )(θ0 )] > − sup [M (θ) − M (θ0 )] . (163)
θ∈Θ:d(θ,θ0 )≥ d(θ,θ0 )≥

The (uniform) closeness of Mn and M (cf. condition (3) in Theorem 12.1 below) shows
that the left-hand side of (163) must converge to 0 (in probability), whereas if M has a
well-separated unique maximum121 (cf. condition (1) in Theorem 12.1) then the right-hand
side of (163) must exceed a positive number, thereby showing that P(d(θ̂n , θ0 ) ≥ ) → 0
as n → ∞. This was carried out in Subsection 3.5.1 while discussing the consistency of
M -estimators.
121
i.e., the function M (θ) should be strictly smaller than M (θ0 ) on the complement of every neighborhood
of the point θ0 .

150
12.1 Argmax continuous mapping theorems

We state our first argmax continuous mapping theorem below which generalizes the above
discussed setup (so that it can also be used to derive asymptotic distributions of the M -
estimator). Our first result essentially says that the argmax functional is continuous at
functions M that have a well-separated unique maximum.

Theorem 12.1. Let H be a metric space and let {Mn (h), h ∈ H} and {M (h), h ∈ H} be
stochastic processes indexed by H. Suppose the following conditions hold:

1. ĥ is a random element of H which satisfies

M (ĥ) > sup M (h) a.s.,


h∈G
/

for every open set G containing ĥ; i.e., M has a unique “well-separated” point of
maximum.

2. For each n, let ĥn ∈ H satisfy

Mn (ĥn ) ≥ sup Mn (h) − oP (1).


h∈H

d
→ M in `∞ (H).
3. Mn −

d
Then ĥn −
→ ĥ in H.

d
Proof. By the Portmanteau theorem 10.5, to prove ĥn −
→ ĥ it suffices to show that

lim sup P∗ {ĥn ∈ F } ≤ P{ĥ ∈ F } (164)


n→∞

for every closed subset F of H. Fix a closed set F and note that
 
{ĥn ∈ F } ⊆ sup Mn (h) ≥ sup Mn (h) − oP (1) .
h∈F h∈H

Therefore,  
P∗ ĥn ∈ F ≤ P∗ sup Mn (h) − sup Mn (h) + oP (1) ≥ 0 .

h∈F h∈H

The map suph∈F Mn (h)−suph∈H Mn (h) converges in distribution to suph∈F M (h)−suph∈H M (h)
d
→ M in `∞ (H) and by the continuous mapping theorem. We thus have
as Mn −
   

lim sup P ĥn ∈ F ≤ P sup M (h) ≥ sup M (h) ,
n→∞ h∈F h∈H

where we have again used the Portmanteau theorem. The first assumption of the theorem
implies that {suph∈F M (h) ≥ suph∈H M (h)} ⊆ {ĥ ∈ F } (note that F c is open). This
proves (164).

151
The idea behind the proof of the above theorem can be used to prove the following
stronger technical lemma.

Lemma 12.2. Let H be a metric space and let {Mn (h) : h ∈ H} and {M (h) : h ∈ H}
be stochastic processes indexed by H. Let A and B be arbitrary subsets of H. Suppose the
following conditions hold:

1. ĥ is a random element of H which satisfies M (ĥ) > suph∈A∩Gc M (h) almost surely
for every open set G containing ĥ.

2. For each n, let ĥn ∈ H be such that Mn (ĥn ) ≥ suph∈H Mn (h) − oP (1).
d
→ M in `∞ (A ∪ B).
3. Mn −

Then      
lim sup P∗ ĥn ∈ F ∩ A ≤ P ĥ ∈ F + P ĥ ∈ B c (165)
n→∞
for every closed set F .

Observe that Theorem 12.1 is a special case of this lemma which corresponds to A =
B = H.

Proof of Lemma 12.2. The proof is very similar to that of Theorem 12.1. Observe first that
n o  
ĥn ∈ F ∩ A ⊆ sup Mn (h) − sup Mn (h) + oP (1) ≥ 0 .
h∈F ∩A h∈B

The term suph∈F ∩A Mn (h)−suph∈B Mn (h)+oP (1) converges in distribution to suph∈F ∩A M (h)−
d
suph∈B M (h) because Mn − → M in `∞ (A ∪ B). This therefore gives
   

lim sup P ĥn ∈ F ∩ A ≤ P sup M (h) − sup M (h) ≥ 0
n→∞ h∈F ∩A h∈B

Now if the event {suph∈F ∩A M (h) ≥ suph∈B M (h)} holds and if ĥ ∈ B, then suph∈F ∩A M (h) ≥
M (ĥ) which can only happen if ĥ ∈ F . This means
 
P sup M (h) − sup M (h) ≥ 0 ≤ P(ĥ ∈ B c ) + P(ĥ ∈ F )
h∈F ∩A h∈B

which completes the proof.

We next prove a more applicable argmax continuous mapping theorem. The assumption
d
that Mn −→ M in `∞ (H) is too stringent. It is much more reasonable to assume that
d
Mn −→ M in `∞ (K) for every compact subset K of H. The next theorem proves that ĥn
converges in law to ĥ under this weaker assumption.
As we will be restricting analysis to compact sets in the next theorem, we need to
assume that ĥn and ĥ lie in compact sets with arbitrarily large probability. This condition,
made precise below, will be referred to as the tightness condition:

152
For every  > 0, there exists a compact set K ⊆ H such that
   
lim sup P∗ ĥn ∈
/ K ≤  and P ĥ ∈ / K ≤ . (166)
n→∞

Theorem 12.3 (Argmax continuous mapping theorem). Let H be a metric space and let
{Mn (h) : h ∈ H} and {M (h) : h ∈ H} be stochastic processes indexed by H. Suppose that
the following conditions hold:
d
→ M in `∞ (K) for every compact subset K of H.
1. Mn −

2. Almost all sample paths h 7→ M (h) are upper semicontinuous122 (u.s.c.) and possess
a unique maximum at a random point ĥ.

3. For each n, let ĥn be a random element of H such that Mn (ĥn ) ≥ suph∈H Mn (h) −
oP (1).

4. The tightness condition (166) holds.


d
Then ĥn −
→ ĥ in H.

Proof. Let K be an arbitrary compact subset of H. We first claim that

M (ĥ) > sup M (h)


h∈K∩Gc

for every open set G containing ĥ. Suppose, for the sake of contradiction, that M (ĥ) =
suph∈K∩Gc M (h) for some open set G containing ĥ. In that case, there exist hm ∈ K ∩ Gc
with M (hm ) → M (h) as m → ∞. Because K ∩ Gc (intersection of a closed set with a
compact set) is compact, a subsequence of {hm } converges which means that we can assume,
without loss of generality, that hm → h for some h ∈ K ∩ Gc . By the u.s.c. hypothesis, this
implies that lim supm→∞ M (hm ) ≤ M (h) which is same as M (ĥ) ≤ M (h). This implies
that ĥ is not a unique maximum (as ĥ ∈ G and h ∈ Gc , we note that ĥ 6= h). This proves
the claim.
d
We now use Lemma 12.2 with A = B = K (note that Mn − → M on `∞ (A∪B) = `∞ (K)).
This gives that for every closed set F , we have
     
lim sup P∗ ĥn ∈ F ≤ lim sup P∗ ĥn ∈ F ∩ K + lim sup P∗ ĥn ∈ K c
n→∞ n→∞ n→∞
     
≤ P ĥ ∈ F + P ĥ ∈ K + lim sup P∗ ĥn ∈ K c .
c
n→∞

The term on the right hand side above can be made smaller than P ĥ ∈ F +  for every
 > 0 by choosing K appropriately (using tightness). An application of the Portmanteau
theorem now completes the proof.
122
Recall the definition of upper semicontinuity: f is u.s.c. at x0 if lim supn→∞ f (xn ) ≤ f (x0 ) whenever
xn → x0 as n → ∞.

153
As a simple consequence of Theorems 12.1 and 12.3, we can prove the following the-
d
orem which is useful for checking consistency of M -estimators. Note that Mn −→ M for a
P
deterministic process M is equivalent to Mn −→ M . This latter statement is equivalent to
suph∈H |Mn (h) − M (h)| converges to 0 in probability.

Theorem 12.4 (Consistency Theorem). Let Θ be a metric space. For each n ≥ 1, let
{Mn (θ) : θ ∈ Θ} be a stochastic process. Also let {M (θ) : θ ∈ Θ} be a deterministic process.

P
1. Suppose supθ∈Θ |Mn (θ) − M (θ)| − → 0 as n → ∞. Also suppose the existence of θ0 ∈
Θ such that M (θ0 ) > supθ∈G/ M (θ) for every open set G containing θ0 . Then any
sequence sequence of M -estimators θ̂n (assuming that Mn (θ̂n ) ≥ supθ∈Θ Mn (θ)−oP (1)
is enough), converges in probability to θ0 .
P
2. Suppose supθ∈K |Mn (θ) − M (θ)| − → 0 as n → ∞ for every compact subset K of Θ.
Suppose also that the deterministic limit process M is upper semicontinuous and has
a unique maximum at θ0 . If {θ̂n } is tight, then θ̂n converges to θ0 in probability.

Remark 12.1. For M -estimators, we can apply the above theorem with Mn (θ) := ni=1 mθ (Xi )/n
P
P
and M (θ) := P [mθ ]. In this case, the condition supθ∈K |Mn (θ) − M (θ)| −
→ 0 is equivalent
to {mθ : θ ∈ K} being P -Glivenko-Cantelli.

Theorem 12.3 can also be used to prove asymptotic distribution results for M -estimators,
as illustrated in the following examples.

12.2 Asymptotic distribution

In this section we present one result that gives the asymptotic distribution of M -estimators
for the case of i.i.d. observations. The formulation is from [van der Vaart, 1998]. The limit

distribution of the sequence n(θ̂n − θ0 ) follows from the following theorem, where θ̂n is
an M -estimator of the finite dimensional parameter θ0 (i.e., θ̂n := arg maxθ∈Θ Mn (θ) where
Mn (θ) = Pn [mθ (X)]).

Example 12.5 (Parametric maximum likelihood estimators). Suppose X1 , . . . , Xn are i.i.d. from
an unknown density pθ0 belonging to a known class {pθ : θ ∈ Θ ⊆ Rk }. Let θ̂n denote the
maximum likelihood estimator of θ0 . A classical result is that, under some smoothness as-

sumptions, n(θ̂n − θ0 ) converges in distribution to Nk (0, I −1 (θ0 )) where I(θ0 ) denotes the
Fisher information matrix.
This result can be derived from the argmax continuous mapping theorem. The first step
is to observe that if θ 7→ pθ (x) is sufficiently smooth at θ0 , then, for any h ∈ Rk ,
n n
pθ0 +hn−1/2 (Xi ) > 1 1
`˙θ0 (Xi ) − h> I(θ0 )h + oPθ0 (1)
X X
log =h √ (167)
pθ0 (Xi ) n 2
i=1 i=1

154
where `˙θ0 (x) := ∇θ log pθ (x) denotes the score function. Condition (167) is known as the
LAN (local asymptotic normality) condition. We shall prove the asymptotic normality of θ̂n
assuming the marginal convergence of (167) (for every fixed h) can be suitably strengthened
to a process level result in `∞ (K), for K ⊂ Rk compact. We apply the argmax continuous
mapping theorem (Theorem 12.3) with H = Rk ,
n
X pθ0 +hn−1/2 (Xi ) 1
Mn (h) := log and M (h) := hT ∆ − hT I(θ0 )h
pθ0 (Xi ) 2
i=1

where ∆ ∼ Nk (0, I(θ0 )). Then ĥn = n(θ̂n − θ0 ) and ĥ ∼ N (0, I −1 (θ0 )). The argmax
theorem will then imply the result provided the conditions of the argmax theorem hold. The
main condition is tightness of {ĥn } which means that the rate of convergence of θ̂n to θ0 is
n−1/2 .

The above idea can be easily extended to derive the asymptotic distributions of other

n-consistent estimators, e.g., non-linear regression, robust regression, etc. (see [van der Vaart, 1998,
Chapter 5] for more details).

Theorem 12.6. Suppose that x 7→ mθ (x) is a measurable function for each θ ∈ Θ ⊂ Rd


for an open set Θ, that θ 7→ mθ (x) is differentiable at θ0 ∈ Θ for P -almost every x with
derivative ṁθ0 (x), and that

|mθ1 (x) − mθ2 (x)| ≤ F (x)kθ1 − θ2 k (168)

holds for all θ1 , θ2 in a neighborhood of θ0 , where F ∈ L2 (P ). Also suppose that M (θ) =


P [mθ ] has a second order Taylor expansion
1
P [mθ ] − P [mθ0 ] = (θ − θ0 )> V (θ − θ0 ) + o(kθ − θ0 k2 )
2
where θ0 is a point of maximum of M and V is symmetric and nonsingular (negative definite
since M is a maximum at θ0 ). If Mn (θ̂n ) ≥ supθ Mn (θ) − oP (n−1 ) and θ̂n → θ0 , then
P

√ d
n(θ̂n − θ0 ) = −V −1 Gn (ṁθ0 ) + oP (1) → Nd (0, V −1 P [ṁθ0 ṁ>
θ0 ]V
−1
),

where Gn := n(Pn − P ).

Proof. We will show that

d 1
M̃n (h) := nPn (mθ0 +hn−1/2 − mθ0 ) → h> G(ṁθ0 ) + h> V h =: M(h) in `∞ ({h : khk ≤ K})
2
for every K > 0. Then the conclusion follows from the argmax continuous Theorem 12.3
upon noticing that

ĥ = argmax M(h) = −B −1 G(ṁθ0 ) ∼ Nd (0, V −1 P (ṁθ0 ṁ>


θ0 )V
−1
).
h

155
Now, observe that
√ √
nPn (mθ0 +hn−1/2 − mθ0 ) = n(Pn − P )[ n(mθ0 +hn−1/2 − mθ0 )] + nP (mθ0 +hn−1/2 − mθ0 ).

By the second order Taylor expansion of M (θ) := P [mθ ] about θ0 , the second term of the
right side of the last display converges to (1/2)h> V h uniformly for khk ≤ K. To handle
the first term we use the Donsker theorem with chaining classes. The classes

Fn := { n(mθ0 +hn−1/2 − mθ0 ) : khk ≤ K}

have envelopes Fn = F = ṁθ0 for all n, and since ṁθ0 ∈ L2 (P ) the Lindeberg condition is
satisfied easily. Furthermore, with
√ √
fn,g = n(mθ0 +gn−1/2 − mθ0 ), fn,h = n(mθ0 +hn−1/2 − mθ0 ),

by the dominated convergence theorem the covariance functions satisfy

P (fn,g fn,h ) − P (fn,g )P (fn,h ) → P (g > ṁθ0 ṁ> > >


θ0 h) = g E[G(ṁθ0 )G(ṁθ0 )]h.

Finally, the bracketing entropy condition holds since, by way of the same entropy calcula-
tions used in the proof of we have
CK d CKkF kP,2 d
   
N[ ] (2kF kP,2 , Fn , L2 (P )) ≤ , i.e., N[ ] (, Fn , L2 (P )) .
 
Rδ q
Thus, J[ ] (δ, Fn , L2 (P )) . 0 d log CK

 d, and hence the bracketing entropy hypothesis
of Donsker theorem holds. We conclude that M̃n (h) converges weakly to h> G(ṁθ0 ) in
`∞ ({h : khk ≤ K}), and the desired result holds.

12.3 A non-standard example

Example 12.7 (Analysis of the shorth). Recall the setup of Example 5.4. Suppose that
X1 , . . . , Xn are i.i.d. P on R with density p with respect to the Lebesgue measure. Let FX be
the distribution function of X. Suppose that p is a unimodal symmetric density with mode
θ0 (with p0 (x) > 0 for x < θ0 and p0 (x) < 0 for x > θ0 ). We want to estimate θ0 .
Let
M(θ) := P [mθ ] = P(|X − θ| ≤ 1) = FX (θ + 1) − FX (θ − 1)
where mθ (x) = 1[θ−1,θ+1] (x). We can how that θ0 = argmaxθ∈R M(θ).
We can estimate θ0 by

θ̂n := argmax Mn (θ), where Mn (θ) = Pn mθ .


θ∈R

We have already seen that (in Example 5.4) τn := n1/3 (θ̂n − θ0 ) = Op (1). Let us here give
a sketch of the limiting distribution of (the normalized version of ) θ̂n . Observe that

τn = argmax Mn (θ0 + hn−1/3 ) = argmax n2/3 [Mn (θ0 + hn−1/3 ) − Mn (θ0 )].
h∈R h∈R

156
The plan is to show that the localized (and properly normalized) stochastic process M̃n (h) :=
n2/3 [Mn (θ0 +hn−1/3 )−Mn (θ0 )] converges in distribution to “something” so that we can apply
the argmax continuous mapping theorem (Theorem 12.3) to deduce the limiting behavior of
τn . Notice that,

M̃n (h) := n2/3 Pn [mθ0 +hn−1/3 − mθ0 ]


= n2/3 (Pn − P )[mθ0 +hn−1/3 − mθ0 ] + n2/3 P [mθ0 +hn−1/3 − mθ0 ],

where the second term is


1
n2/3 [M(θ0 + hn−1/3 ) − M(θ0 )] =n2/3 M0 (θ0 )hn−1/3 + n2/3 M00 (θ∗ )h2 n−2/3
2
1 00 1 0
→ M (θ0 )h = [p (θ0 + 1) − p0 (θ0 − 1)]h2 ,
2
2 2
uniformly in |h| ≤ K, for any constant K. Note that as M is differentiable M0 (θ0 ) =
p(θ0 + 1) − p(θ0 − 1) = 0. Thus, we want to study the empirical process Gn indexed by the
collection of functions Fn := {n1/6 (mθ0 +hn−1/3 − mθ0 ) : |h| ≤ K}. Here we can apply a
Donsker theorem for a family of functions depending on n, for example Theorem 11.7. Thus
we need to check that (160) and (161) hold. Observe that

P [(fn,s − fn,t )2 ] − [P (fn,s − fn,t )]2


= n1/3 P [(mθ0 +sn−1/3 − mθ0 +tn−1/3 )2 ] − o(1)
n o
= n1/3 P 1[θ0 −1+sn−1/6 ,θ0 −1+tn−1/6 ] + P 1[θ0 +1+sn−1/6 ,θ0 +1+tn−1/6 ] + o(1) if t > s

→ [p(θ0 − 1) + p(θ0 + 1)]|s − t|.

Thus, we can conclude that


d
n2/3 (Pn − P )[mθ0 +hn−1/3 − mθ0 ] → aZ(h)

where a2 := p(θ0 + 1) + p(θ0 − 1) and Z is a standard two-sided Brownian motion process


starting from 0 (show this!). We can now use the argmax continuous mapping theorem to
conclude now that
d
τn = n1/3 (θ̂n − θ0 ) → argmax[aZ(h) − bh2 ],
h

where b := −M00 (θ 0 )/2.

157
13 Concentration Inequalities

We are interested in bounding the random fluctuations of (complicated) functions of many


independent random variables. Let X1 , . . . , Xn be independent random variables taking
values in X . Let f : X n → R, and let

Z = f (X1 , . . . , Xn )

be the random variable of interest (e.g., Z = ni=1 Xi ). In this section we try to understand:
P

(a) under what conditions does Z concentrate around its mean EZ? (b) how large are typical
deviations of Z from EZ? In particular, we seek upper bounds for

P(Z > EZ + t) and P(Z < EZ − t) for t > 0.

Various approaches have been used over the years to tackle such questions, including mar-
tingale methods, information theoretic methods, logarithmic Sobolev inequalities, etc. We
have already seen many concentration inequalities in this course, e.g., Hoeffding’s inequality,
Bernstein’s inequality, Talagrand’s concentration inequality, etc.
Let X1 , . . . , Xn be independent random variables taking values in X . Let f : X n → R
and let
Z = f (X1 , . . . , Xn )

be the random variable of interest. Recall the martingale representation of Z = ni=1 ∆i


P

and the notation introduced in Section 3.6. Note that,


" n # n
 X 2 X X
Var(Z) = E ∆i = E(∆2i ) + 2 E(∆i ∆j ).
i=1 i=1 j>i

Now if j > i, Ei ∆j = Ei (Ej Z) − Ei (Ej−1 Z) = Ei (Z) − Ei (Z) = 0, so

Ei (∆j ∆i ) = ∆i Ei (∆j ) = 0.

Thus, we obtain that


n n
" #
X 2 X
Var(Z) = E ∆i = E(∆2i ). (169)
i=1 i=1

13.1 Efron-Stein inequality

Until now we have not made any use of the fact that Z is a function of independent variables.
Indeed, Z
Ei Z = f (X1 , . . . , Xi , xi+1 , . . . , xn )dµi+1 (xi+1 ) . . . dµn (xn ),
X n−i
where, for every j = 1, . . . , n, µj denotes the probability distribution of Xj .

158
Let E(i) (Z) denote the expectation of Z with respect to the i-th variable Xi only, fixing
the values of the other variables, i.e.,
Z
(i)
E (Z) = f (X1 , . . . , Xi−1 , xi , Xi+1 , . . . , Xn )dµi (xi ).
X

Using Fubini’s theorem,


Ei (E(i) Z) = Ei−1 Z. (170)

Theorem 13.1 (Efron-Stein inequality). Then,


n
X n
X
Var(Z) ≤ E (Z − E(i) (Z))2 = E Var(i) (Z) =: v. (171)
i=1 i=1

Moreover, if X10 , . . . , Xn0 are independent copies of X1 , . . . , Xn , and if we define, for every
i = 1, . . . , n,
Zi0 = f (X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ), (172)

then
n
1X
v= E[(Z − Zi0 )2 ].
2
i=1

Also, for every i, letting

Zi := fi (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )

for any arbitrary measurable function fi , we have


n
X
v = inf E[(Z − Zi )2 ].
Zi
i=1

Proof. Using (170) we may write ∆i = Ei (Z − E(i) Z). By Jensen’s inequality, used condi-
tionally, ∆2i ≤ Ei [(Z − E(i) Z)2 ]. Now (169) yields the desired bounds.
To see the second claim, we use (conditionally) the fact that if X and X 0 are i.i.d. real
valued random variables, then Var(X) = E[(X − X 0 )2 ]/2. Since conditionally on X (i) :=
(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ), Zi0 is an independent copy of Z, we may write

1
Var(i) (Z) = E(i) [(Z − Zi0 )2 ],
2
where we have used the fact that the conditional distributions of Z and Zi0 are identical.
The last identity is obtained by recalling that, for any real-values random variable X,
Var(X) = inf a E[(X − a)2 ]. Using this fact conditionally, we have, for every i = 1, . . . , n,

Var(i) (Z) = inf E(i) [(Z − Zi )2 ].


Zi

159
Observe that in the case when Z = ni=1 Xi , the Efron-Stein inequality becomes an
P

equality. Thus, the bound in the Efron-Stein inequality, is, in a sense, not improvable.
It is easy to see that if f has the bounded difference property (see (24)) with constants
c1 , . . . , cn , then
n
1X 2
Var(Z) ≤ ci .
2
i=1

Example 13.2 (Kernel density estimation). Recall Example 3.25. Let X1 , . . . , Xn are
i.i.d. from a distribution P on R (the argument can be easily generalized to Rd ) with
density φ. The kernel  density
 estimator (KDE) of φ is φ̂n : R → [0, ∞) defined as
1 Pn x−Xi
φ̂n (x) = nhn i=1 K hn , where hn > 0 is the smoothing bandwidth and K is a non-
R
negative kernel (i.e., K ≥ 0 and K(x)dx = 1). The L1 -error of the estimator φ̂n is
R
Z := f (X1 , . . . , Xn ) := |φ̂n (x) − φ(x)|dx. We have shown in (26) that f satisfies (24) with
ci = n2 . Thus the difference between Z − Zi0 , deterministically, is upper bounded by 2/n, for
all i. Thus, an application of the Efron-Stein inequality gives
 2
n 2 2
Var(Z) ≤ = .
2 n n

It is known that for every φ, nE(Zn ) → ∞ (we write Zn instead of Z to emphasize
the dependence on n), which implies, by Chebyshev’s inequality, for every  > 0,
 
Zn   Var(Zn )
P − 1 ≥  = P |Zn − E(Zn )| ≥  E(Zn ) ≤ 2 →0
E(Zn )  [E(Zn )]2
p
as n → ∞. Thus, Zn /E(Zn ) → 1, or in other words, Zn is relatively stable. This means
that the random L1 -error essentially behaves like its expected value.

Example 13.3. Let A be a collection of subsets of X , and let X1 , . . . , Xn be n i.i.d random


points in X with distribution P . Let Z = supA∈A |Pn (A) − P (A)|. By the Efron-Stein
inequality (show this),
2
Var(Z) ≤ ,
n
regardless of the richness of the collection of sets A and the distribution P .

Recall that we can bound E(Z) using empirical process techniques. Let P0n (A) =
Pn 0 0 0
i=1 1A (Xi )/n where X1 , . . . , Xn (sometimes called a “ghost” sample) are independent
copies of X1 , . . . , Xn . Let E0 denote the expectation only with respect to X10 , . . . , Xn0 .

E sup |Pn (A) − P (A)| = E sup |E0 [Pn (A) − P0n (A)]|
A∈A A∈A
n n
1 X X
≤ E sup |Pn (A) − P0n (A)| = E sup (1A (Xi ) − 1A (Xi0 ))
A∈A n A∈A
i=1 i=1

160
where we have used the Jensen’s inequality. Next we use symmetrization: if 1 , . . . , n
are independent Rademacher variables, then the last term in the previous display can be
bounded (from above) by (Exercise)
n n n
1 X X 2 X
E sup (1A (Xi ) − 1A (Xi0 )) ≤ E sup i 1A (Xi ) .
n A∈A n A∈A
i=1 i=1 i=1
Pn
Note that i=1 i 1A (Xi ) can be thought of as the sample covariance between the i ’s and
the 1A (Xi )’s. Letting,
n
1 X
Rn = E sup i 1A (Xi ) ,
n A∈A
i=1
where E denotes the expectation with respect to the Rademacher variables, we see that

E sup |Pn (A) − P(A)| ≤ 2ERn .


A∈A

Observe that Rn is a data dependent quantity, and does not involve the probability
measure P . We want to show that Rn is concentrated around its mean ERn . Define

1 X
Rn(i) = E sup j 1A (Xj ) ,
n A∈A
j6=i

one can show that (Exercise)


n
X
0 ≤ n(Rn − Rn(i) ) ≤ 1, and n(Rn − Rn(i) ) ≤ nRn . (173)
i=1

By the Efron-Stein inequality,


n
X
Var(nRn ) ≤ E [n(Rn − Rn(i) )]2 ≤ E(nRn ).
i=1

Random variables with the property (173) are called self-bounding. These random vari-
ables have their variances bounded by their means an thus are automatically concentrated.
Recall that ∆n (A, Xn ) is the number of different sets of the form {X1 , . . . , Xn } ∩ A
where A ∈ A, then Rn is the maximum of ∆n (A, Xn ) sub-Gaussian random variables. By
the maximal inequality, r
log ∆n (A, Xn )
Rn ≤ .
2n
Let V = V (xn , A) be the size of the largest subset of {x1 , . . . , xn } shattered by A.
Note that V is a random variable. In fact, V is self-bounding. Let V (i) be defined similarly
with all the points excluding the i’th point. Then, (Exercise) for every 1 ≤ i ≤ n,
n
X
0 ≤ V − V (i) ≤ 1 and (V − V (i) ) ≤ V.
i=1
Pn
Thus, i=1 (V − V (i) )2 ≤ V , and so by the Efron-Stein inequality, Var(V ) ≤ EV .

161
13.2 Concentration and logarithmic Sobolev inequalities

We start with a brief summary of some basic properties of the (Shannon) entropy of a
random variable. For simplicity, we will only consider discrete-values random variables.

Definition 13.4 (Shannon entropy). Let X be a random variable taking values in a count-
able set X with probability mass function (p.m.f.) p(x) = P(X = x), x ∈ X . The Shannon
entropy (or just entropy) of X is defined as
X
H(X) := E[− log p(X)] = − p(x) log p(x) (174)
x∈X

where log denotes natural logarithm and 0 log 0 = 0. The entropy can be thought of as the
“uncertainty” in the random variable. Observe that the entropy is obviously nonnegative.

If X and Y is a pair of discrete random variables taking values in X × Y then the joint
entropy H(X, Y ) of X and Y is defined as the entropy of the pair (X, Y ).

Definition 13.5 (Kullback-Leibler divergence). Let P and Q be two probability distributions


with p.m.f.’s p and q. Then the Kullback-Leibler divergence or relative entropy of P and Q
is
X p(x)
D(P kQ) = p(x) log
q(x)
x∈X
if P is absolutely continuous with respect to Q and infinite otherwise.

It can be shown that D(P kQ) ≥ 0, and D(P kQ) = 0 if and only if P = Q. This
follows from observing the fact that if P is absolutely continuous with respect to Q, since
log x ≤ x − 1 for all x > 0,
 
X q(x) X q(x)
D(P kQ) = − p(x) log ≥ p(x) − 1 ≥ 0.
p(x) p(x)
x∈X :p(x)>0 x∈X :p(x)>0

Observe that if X ∼ P takes N values, then by taking Q to be the uniform distribution on


those N values,

D(P kQ) = log N − H(X), and H(X) ≤ log N (as D(P kQ) ≥ 0).

Definition 13.6 (Conditional entropy). Consider a pair of random variables (X, Y ). The
conditional entropy H(X|Y ) is defined as

H(X|Y ) = H(X, Y ) − H(Y ).

Observe that if we write p(x, y) = P(X = x, Y = y) and p(x|y) = P(X = x|Y = y), then
X
H(X|Y ) = − p(x, y) log p(x|y) = E[− log p(X|Y )].
x∈X ,y∈Y

It is easy to see that H(X|Y ) ≥ 0.

162
Suppose that (X, Y ) ∼ PX,Y and X ∼ PX and Y ∼ PY . Noting that D(PX,Y kPX ⊗
PY ) = H(X) − H(X|Y ), the nonnegativity of the relative entropy implies

H(X|Y ) ≤ H(X). (175)

The chain rule for entropy says that for random variables X1 , . . . , Xn ,

H(X1 , . . . , Xn ) = H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |X1 , . . . , Xn−1 ). (176)

Let X = (X1 , . . . , Xn ) be a vector of n random variables (not necessarily independent) and


let X (i) = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) (vector obtained by leaving out Xi ). We have the
following result.

Theorem 13.7 (Han’s inequality). Let X1 , . . . , Xn be discrete random variables. Then,


n
1 X
H(X1 , . . . , Xn ) ≤ H(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). (177)
n−1
i=1

Proof. Observe that

H(X1 , . . . , Xn ) = H(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) + H(Xi |X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )


≤ H(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) + H(Xi |X1 , . . . , Xi−1 )

where we have used (175) conditionally. Now, by summing over all i and using (176), we
get the desired result.

Next we derive an inequality (which may be regarded as a version of Han’s inequality


for relative entropies) that is fundamental in deriving a “sub-additivity” inequality (see
Theorem 13.11), which, in turn, is at the basis of many exponential concentration inequal-
ities.
Let X be a countable set, and let P and Q be probability measures on X n such that P =
P1 ⊗ . . . ⊗ Pn is a product measure. We denote the elements of X n by x = (x1 , . . . , xn )and
write x(i) = (x1 , . . . , xi−1 , xi+1 , . . . , xn ) for the (n − 1)-vector obtained by leaving out the
i-th component of x. Denote by Q(i) and P (i) the marginal distributions of Q and P , and
let p(i) and q (i) denote the corresponding p.m.f.’s, i.e.,
X
q (i) (x(i) ) = q(x1 , . . . , xi−1 , y, xi+1 , . . . , xn ), and
y∈X
(i) (i)
p (x ) = p1 (x1 ) · · · pi−1 (xi−1 )pi+1 (xi+1 ) · · · pn (xn ).

Then we have the following result, proved in [Boucheron et al., 2013, Section 4.6].

Theorem 13.8 (Han’s inequality for relative entropies).


n
1 X
D(QkP ) ≥ D(Q(i) kP (i) )
n−1
i=1

163
or equivalently,
n h
X i
D(QkP ) ≤ D(QkP ) − D(Q(i) kP (i) ) .
i=1

We end this section with a result that will be useful in developing the entropy method,
explained in the next section.

Theorem 13.9 (The expected value minimizes expected Bergman divergence). Let I ⊂ R
be an open interval and let g : I → R be convex and differentiable. For any x, y ∈ I, the
Bergman divergence of f from x to y is g(y) − g(x) − g 0 (x)(y − x). Let X be an I-valued
random variable. Then,

E[g(X) − g(EX)] = inf E[g(X) − g(a) − g 0 (a)(X − a)]. (178)


a∈I

Proof. Let a ∈ I. The expected Bergman divergence from a is E[g(X)−g(a)−g 0 (a)(X −a)].
The expected Bergman divergence from EX is

E[g(X) − g(EX) − g 0 (EX)(X − EX)] = E[g(X) − g(EX)].

Thus, the difference between the expected Bergman divergence from a and the expected
from EX is

E[g(X) − g(a) − g 0 (a)(X − a)] − E[g(X) − g(EX)]


= E[−g(a) − g 0 (a)(X − a) + g(EX)]
= g(EX) − g(a) − g 0 (a)(EX − a) ≥ 0

as g is convex.

13.3 The Entropy method

Definition 13.10 (Entropy). The entropy of a random variable Y ≥ 0 is

Ent(Y ) = EΦ(Y ) − Φ(EY ) (179)

where Φ(x) = x log x for x > 0 and Φ(0) = 0. By Jensen’s inequality, Ent(Y ) ≥ 0.

Remark 13.1. Taking g(x) = x log x in (178) we obtain the following variational formula
for entropy:
Ent(Y ) = inf E [Y (log Y − log u) − (Y − u)] . (180)
u>0

Let X1 , . . . , Xn be independent and let Z = f (X1 , . . . , Xn ), where f ≥ 0. Denote

Ent(i) (Z) := E(i) [Φ(Z)] − Φ(E(i) Z).

164
Theorem 13.11 (Sub-additivity of the Entropy). Let X1 , . . . , Xn be independent and let
Z = f (X1 , . . . , Xn ), where f ≥ 0. Then
n
X
Ent(Z) ≤ E Ent(i) (Z). (181)
i=1

Proof. (Han’s inequality implies the sub-additivity property.) First observe that if the
inequality is true for a random variable Z, then it is also true for cZ, where c > 0. Hence
we may assume that E(Z) = 1. Now define the probability measure Q on X n by its p.m.f. q
given by
q(x) = f (x)p(x), for all x ∈ X n

where p denotes the p.m.f. of X = (X1 , . . . , Xn ). Let P be the distribution induced by p.


Then,
EΦ(Z) − Φ(EZ) = E[Z log Z] = D(QkP )

which, by Theorem 13.8, does not exceed ni=1 [D(QkP )−D(Q(i) kP (i) )]. However, straight-
P

forward calculations show that


n
X n
X
(i) (i)
[D(QkP ) − D(Q kP )] = E Ent(i) (Z)
i=1 i=1

and the statement follows.

Remark 13.2. The form of the above inequality should remind us of the Efron-Stein in-
equality. In fact, if we take Φ(x) = x2 , then the above display is the exact analogue of the
Efron-Stein inequality.

Remark 13.3 (A logarithmic Sobolev inequality on the hypercube). Let X = (X1 , . . . , Xn )


be uniformly distributed over {−1, +1}n . If f : {−1, +1}n → R and Z = f (X), then
n
1 X
Ent(Z 2 ) ≤ E (Z − Zi0 )2
2
i=1

where Zi0 is defined as in (172).


The proof uses the sub-additivity of the entropy and calculus for the case n = 1; see
e.g., [Boucheron et al., 2013, Theorem 5.1]. In particular, it implies the Efron-Stein in-
equality.

13.4 Gaussian concentration inequality

Theorem 13.12 (Gaussian logarithmic Sobolev inequality). Let X = (X1 , . . . , Xn ) be a


vector of n i.i.d. standard normal random variables and let g : Rn → R be a continuously
differentiable function. Then

Ent(g 2 ) ≤ 2E k∇g(X)k2 .
 

165
This result can be proved using the central limit theorem and the Bernoulli log-Sobelev
inequality; see e.g., [Boucheron et al., 2013, Theorem 5.4].

Theorem 13.13 (Gaussian Concentration: the Tsirelson-Ibragimov-Sudakov inequality).


Let X = (X1 , . . . , Xn ) be a vector of n i.i.d. standard normal random variables. Let f :
Rn → R be a an L-Lipschitz function, i.e., there exists a constant L > 0 such that for all
x, y ∈ Rn ,
|f (x) − f (y)| ≤ Lkx − yk.

Then, for all λ ∈ R,


h i λ2
log E eλ(f (X)−Ef (X)) ≤ L2 ,
2
and for all t > 0,
2 /(2L2 )
P(f (X) − Ef (X) ≥ t) ≤ e−t .

Proof. By a standard density argument we may assume that f is differentiable with gradient
uniformly bounded by L. Using Theorem 13.12 for the function eλf /2 , we obtain,
h i λ2 h i λ2 L2 h i
Ent(eλf ) ≤ 2E k∇eλf (X)/2 k2 = E eλf (X) k∇f (X)k2 ≤ E eλf (X) .
2 2
The Gaussian log-Sobolev inequality may now be used with

g(x) = eλf (x)/2 , where λ ∈ R.

If F (λ) := EeλZ is the moment generating function of Z = f (X), then

Ent(g(X)2 ) = λE(eλZ ) − E(eλZ ) log E(eλZ ) = λF 0 (λ) − F (λ) log F (λ).

Thus, we obtain the differential inequality (This is usually referred to as the Herbst’s argu-
ment.)
λ2 L2
λF 0 (λ) − F (λ) log F (λ) ≤ F (λ). (182)
2
To solve (182) divide both sides by the positive number λ2 F (λ). Defining G(λ) := log F (λ),
we observe that the left-hand side is just the derivative of G(λ)/λ. Thus, we obtain the
inequality
L2
 
d G(λ)
≤ .
dλ λ 2
By l’Hospital’s rule we note that limλ→0 G(λ)/λ = F 0 (0)/F (0) = EZ. If λ > 0, by inte-
grating the inequality between 0 and λ, we get G(λ)/λ ≤ EZ + λL2 /2, or in other words,

2 L2 /2
F (λ) ≤ eλEZ+λ . (183)

Finally, by Markov’s inequality,


2 L2 /2−λt 2 /(2L2 )
P(Z > EZ + t) ≤ inf F (λ)e−λEZ−λt ≤ inf eλ = e−t ,
λ>0 λ>0

166
where λ = t/L2 minimizes the obtained upper bound. Similarly, if λ < 0, we may integrate
the obtained upper bound for the derivative of G(λ)/λ between −λ and 0 to obtain the
same bound as in (183), which implies the required bound for the left-tail inequality P(Z <
EZ − t).

Remark 13.4. An important feature of the problem is that the right-hand side does not
depend on the dimension n.

Example 13.14 (Supremum of a Gaussian process). Let (Xt )t∈T be an almost surely con-
tinuous centered Gaussian process indexed by a totally bounded set T . Let Z = supt∈T Xt .
If
σ 2 := sup E(Xt2 ),
t∈T
then
2 /(2σ 2 )
P(|Z − E(Z)| ≥ u) ≤ 2e−u .
Let us first assume that T is a finite set (the extension to arbitrary totally bounded T is
based on a separability argument and monotone convergence; see [Boucheron et al., 2013,
Exercise 5.14]). We may assume, without loss of generality, that T = {1, . . . , n}. Let Γ
be the covariance matrix of the centered Gaussian vector X = (X1 , . . . , Xn ). Denote by
A the square root of the positive semidefinite matrix Γ. If Y = (Y1 , . . . , Yn ) is a vector of
i.i.d. standard normal random variables, then

f (Y ) = max (AY )i
i=1,...,n

has the same distribution as maxi=1,...,n Xi . Hence we can assume the Gaussian concentra-
tion inequality by bounding the Lipchitz function f . By the Cauchy-Schwarz inequality, for
all u, v ∈ Rn and i = 1, . . . , n,
X X 1/2
|(Au)i − (Av)i | = Aij (uj − vj ) ≤ A2ij ku − vk.
j j

A2ij = Var(Xi ), we get


P
Since j

|f (u) − f (v)| ≤ max |(Au)i − (Av)i | ≤ σku − vk.


i=1,...,n

Therefore, f is Lipschitz with constant σ and the tail bound follows from the Gaussian
concentration inequality.

13.5 Bounded differences inequality revisited

In the previous subsections we have used the log-Sobolev inequalities quite effectively to
derive concentration results when the underlying distribution is Gaussian and Bernoulli.
Here we try to extend these results to hold under more general distributional assumptions.
Observe that however, (181) holds for any distribution. We state a result in that direction;
see e.g., [Boucheron et al., 2013, Theorem 6.6].

167
Theorem 13.15 (A modified logarithmic Sobolev Inequality). Let φ(x) = ex − x − 1. Then
for all λ ∈ R,
n
X
λZ λZ λZ
λE(Ze ) − E(e ) log E(e )≤ E[eλZ φ(−λ(Z − Zi ))], (184)
i=1

where Zi := fi (X (i) ) for a arbitrary function fi : X n−1 → R.

Proof. We bound each term on the right-hand side of the sub-additivity of entropy. To do
this we will apply (180) conditionally. Let Yi be a positive function of the random variables
X1 , . . . , Xi−1 , Xi+1 , . . . , Xn , then

E(i) (Y log Y ) − (E(i) Y ) log(E(i) Y ) ≤ E(i) [Y (log Y − log Yi ) − (Y − Yi )] .

Applying the above inequality to the variables Y = eλZ and Yi = eλZi , we obtain
h i
E(i) (Y log Y ) − (E(i) Y ) log(E(i) Y ) ≤ E(i) eλZ φ(−λ(Z − Zi )) .

Theorem 13.16 (A stronger form of the bounded differences inequality). Let X1 , . . . , Xn


be n independent random variables, Z = f (X1 , . . . , Xn ) and Zi denotes an X (i) -measurable
random variable defined by

Zi = inf0 f (X1 , . . . , Xi−1 , x0i , Xi+1 , . . . , Xn ). (185)


xi

Assume that Z is such that there exists a constant v > 0 for which, almost surely,
n
X
(Z − Zi )2 ≤ v.
i=1

Then, for all t > 0,


2 /(2v)
P(Z − EZ > t) ≤ e−t .

Proof. The result follows from the modified logarithmic Sobolev Inequality. Observe that
for x > 0, φ(−x) ≤ x2 /2. Thus, Theorem 13.15 implies
n
" #
X λ 2 λ2 v
λE(ZeλZ ) − E(eλZ ) log E(eλZ ) ≤ E eλZ (Z − Zi )2 ≤ E(eλZ ).
2 2
i=1

The obtained inequality has the same form as (182), and the proof may be finished in an
identical way (using the Herbst’s argument).

Remark 13.5. As a consequence, if the condition ni=1 (Z − Zi )2 ≤ v is satisfied both for


P

Zi = inf x0i f (X1 , . . . , Xi−1 , x0i , Xi+1 , . . . , Xn ) and Zi = supx0i f (X1 , . . . , Xi−1 , x0i , Xi+1 , . . . , Xn ),
one has the two-sided inequality
2 /(2v)
P(|Z − EZ| > t) ≤ 2e−t .

168
Example 13.17 (The largest eigenvalue of a symmetric matrix). Let A = (Xij )n×n be
a symmetric random matrix, with Xij , i ≤ j, being independent random variables (not
necessarily identically distributed) with |Xij | ≤ 1. Let

Z = λ1 := sup u> Au,


u:kuk=1

and suppose that v is such that Z = v > Av. Let A0ij be the symmetric matrix obtained from
A by replacing Xij (and Xji ) by x0ij ∈ [−1, 1] (and keeping the other entries fixed). Then,

(Z − Zij )+ ≤ (v > Av − v > A0ij v)1Z>Zij


 
= v > (A − A0ij )v 1Z>Zij
≤ 2 vi vj (Xij − Xij0 ) + ≤ 4|vi vj |.


Therefore, using Zij as defined in (185),


n
!2
X X X X
2 0 2 2
(Z − Zij ) = inf (Z − Zij )+ ≤ 16|vi vj | ≤ 16 vi2 = 16.
x0ij :|x0ij |≤1
1≤i≤j≤n 1≤i≤j≤n 1≤i≤j≤n i=1

Thus, by Theorem 13.16 we get


2 /32
P(Z − EZ > t) ≤ e−t .

This example shows that if we want to bound (Z − Zij 0 )2 individually, we get an upper bound
+
of 4 and the usual bounded differences inequality does not lead to a good concentration bound.
But this (stronger) version of the bounded difference inequality yields a much stronger result.
Note that the above result applies for the adjacency matrix of a random graph if the edges
a sampled independently.

13.6 Suprema of the empirical process: exponential inequalities

The location of the distribution of the norm kGn kF of the empirical process depends strongly
on the complexity of the class of functions F. We have seen various bounds on its the mean
value EkGn kF in terms of entropy integrals. It turns out that the spread or “concentration”
of the distribution hardly depends on the complexity of the class F. It is sub-Gaussian as
soon as the class F is uniformly bounded, no matter its complexity.
Theorem 13.18. If F is a class of measurable functions f : X → R such that |f (x) −
f (y)| ≤ 1 for every f ∈ F and every x, y ∈ X , then, for all t > 0,
  2
P∗ kGn kF − EkGn kF ≥ t ≤ 2e−2t and
!
2
P∗ sup Gn − E sup Gn ≥ t ≤ 2e−2t .
f f

This theorem is a special case of the bounded difference inequality; both the norm
kGn kF and the supremum supf ∈F Gn satisfy the bounded difference property (24) with
ci n−1/2 the supremum over f of the range of f (in fact, we can take ci = n−1/2 ).

169
References

[Billingsley, 1999] Billingsley, P. (1999). Convergence of probability measures. Wiley Series


in Probability and Statistics: Probability and Statistics. John Wiley & Sons, Inc., New
York, second edition. A Wiley-Interscience Publication.

[Boucheron et al., 2013] Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration
inequalities. Oxford University Press, Oxford. A nonasymptotic theory of independence,
With a foreword by Michel Ledoux.

[Bousquet, 2003] Bousquet, O. (2003). Concentration inequalities for sub-additive functions


using the entropy method. In Stochastic inequalities and applications, volume 56 of Progr.
Probab., pages 213–247. Birkhäuser, Basel.

[Browder, 1996] Browder, A. (1996). Mathematical analysis. Undergraduate Texts in Math-


ematics. Springer-Verlag, New York. An introduction.

[Cantelli, 1933] Cantelli, F. (1933). Sulla determinazione empirica delle leggi di probabilità.
Giorn. Ist. Ital. Attuari, 4:421–424.

[Chernozhukov et al., 2014] Chernozhukov, V., Chetverikov, D., and Kato, K. (2014).
Gaussian approximation of suprema of empirical processes. Ann. Statist., 42(4):1564–
1597.

[de la Peña and Giné, 1999] de la Peña, V. H. and Giné, E. (1999). Decoupling. Probabil-
ity and its Applications (New York). Springer-Verlag, New York. From dependence to
independence, Randomly stopped processes. U -statistics and processes. Martingales and
beyond.

[Donsker, 1952] Donsker, M. D. (1952). Justification and extension of Doob’s heuristic


approach to the Komogorov-Smirnov theorems. Ann. Math. Statistics, 23:277–281.

[Dudley, 1978] Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann.
Probab., 6(6):899–929 (1979).

[Dvoretzky et al., 1956] Dvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic
minimax character of the sample distribution function and of the classical multinomial
estimator. Ann. Math. Statist., 27:642–669.

[Einmahl and Mason, 2005] Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth
consistency of kernel-type function estimators. Ann. Statist., 33(3):1380–1403.

[Giné and Nickl, 2016] Giné, E. and Nickl, R. (2016). Mathematical foundations of infinite-
dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathe-
matics, [40]. Cambridge University Press, New York.

170
[Glivenko, 1933] Glivenko, V. (1933). Sulla determinazione empirica delle leggi di proba-
bilità. Giorn. Ist. Ital. Attuari, 4:92–99.

[Greenshtein and Ritov, 2004] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-
dimensional linear predictor selection and the virtue of overparametrization. Bernoulli,
10(6):971–988.

[Grenander, 1956] Grenander, U. (1956). On the theory of mortality measurement. I. Skand.


Aktuarietidskr., 39:70–96.

[Hjort and Pollard, 2011] Hjort, N. L. and Pollard, D. (2011). Asymptotics for minimisers
of convex processes. arXiv preprint arXiv:1107.3806.

[Hoffmann-Jø rgensen, 1991] Hoffmann-Jø rgensen, J. (1991). Stochastic processes on Pol-


ish spaces, volume 39 of Various Publications Series (Aarhus). Aarhus Universitet,
Matematisk Institut, Aarhus.

[Kallenberg, 2002] Kallenberg, O. (2002). Foundations of modern probability. Probability


and its Applications (New York). Springer-Verlag, New York, second edition.

[Kiefer, 1961] Kiefer, J. (1961). On large deviations of the empiric D. F. of vector chance
variables and a law of the iterated logarithm. Pacific J. Math., 11:649–660.

[Kim and Pollard, 1990] Kim, J. and Pollard, D. (1990). Cube root asymptotics. Ann.
Statist., 18(1):191–219.

[Klein and Rio, 2005] Klein, T. and Rio, E. (2005). Concentration around the mean for
maxima of empirical processes. Ann. Probab., 33(3):1060–1077.

[Koltchinskii, 2011] Koltchinskii, V. (2011). Oracle inequalities in empirical risk mini-


mization and sparse recovery problems, volume 2033 of Lecture Notes in Mathematics.
Springer, Heidelberg. Lectures from the 38th Probability Summer School held in Saint-
Flour, 2008, École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer
School].

[Ledoux and Talagrand, 1991] Ledoux, M. and Talagrand, M. (1991). Probability in Banach
spaces, volume 23 of Ergebnisse der Mathematik und ihrer Grenzgebiete (3) [Results in
Mathematics and Related Areas (3)]. Springer-Verlag, Berlin. Isoperimetry and processes.

[Massart, 1990] Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz


inequality. Ann. Probab., 18(3):1269–1283.

[Pollard, 1984] Pollard, D. (1984). Convergence of stochastic processes. Springer Series in


Statistics. Springer-Verlag, New York.

171
[Pollard, 1989] Pollard, D. (1989). Asymptotics via empirical processes. Statist. Sci.,
4(4):341–366. With comments and a rejoinder by the author.

[Robertson et al., 1988] Robertson, T., Wright, F. T., and Dykstra, R. L. (1988). Order
restricted statistical inference. Wiley Series in Probability and Mathematical Statistics:
Probability and Mathematical Statistics. John Wiley & Sons, Ltd., Chichester.

[Talagrand, 1994] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical pro-
cesses. Ann. Probab., 22(1):28–76.

[Talagrand, 1996a] Talagrand, M. (1996a). New concentration inequalities in product


spaces. Invent. Math., 126(3):505–563.

[Talagrand, 1996b] Talagrand, M. (1996b). A new look at independence. Ann. Probab.,


24(1):1–34.

[van de Geer, 2000] van de Geer, S. A. (2000). Applications of empirical process theory,
volume 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge
University Press, Cambridge.

[van der Vaart and Wellner, 2011] van der Vaart, A. and Wellner, J. A. (2011). A local
maximal inequality under uniform entropy. Electron. J. Stat., 5:192–203.

[van der Vaart, 1998] van der Vaart, A. W. (1998). Asymptotic statistics, volume 3 of
Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University
Press, Cambridge.

[van der Vaart and Wellner, 1996] van der Vaart, A. W. and Wellner, J. A. (1996). Weak
convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New
York. With applications to statistics.

[Wainwright, 2019] Wainwright, M. J. (2019). High-dimensional statistics, volume 48 of


Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University
Press, Cambridge. A non-asymptotic viewpoint.

172

You might also like