Convergence of Stochastic Processes
Convergence of Stochastic Processes
oat o
Close inspection of the proof would reveal a disregard for a number of
measure-theoretic niceties. A more careful treatment may be found in
Appendix C. For our present purpose it would suffice if we assumed T
countable; the proof is impeccable for stochastic processes sharing a count-
able index set, We could replace suprema over all intervals (00, f] by
suprema over intervals with a rational endpoint.
For fixed t, Ps(—00, t] isan average of the n independent random variables
{ <0), each having expected value P(—o, ¢] and variance P(— v0, ] —
(P(x, £)), which is less than one, By Tehebychev’s inequality,
IPIP,(—c0, 9] ~ P(—o0, 1 < de} >} it n> Be?113. The Combinatorial Method 15
Apply the Symmetrization Lemma with Z = P, ~ Pand Z’ = P, — P, the
class J as index set, = $6, and B= 3
10) PLP, — Pi > ¢} < UP(EP, — Pill > $e) if mB Bem,
SECOND SYMMETRIZATION.
The difference P, ~ P, depends on 2n observations, The double sample
size ereates a minor nuisance, atleast notationally. It can be avoided by a
second symmetrization trick, at the cost of a further diminution of the
Independently of the observations €,,...,€, Ejr--+4 5 from which the
empirical measures are constructed, generate independent sign random
variables ¢,,.... 0, for which IP(o; = +1} = P(g; = —1} = 4. The sym-
metric random variables {é, <1} — {<1}, for on and
—o0 0
there exists a compact set K(e) of completely regular points such that
PK@) > 1-6
Call a sequence {P,} of probability measures on .f uniformly tight if for
every > O there exists a compact set K(c) of completely regular points such
that liminf P,G > 1 ~ e for every open, «#-measurable G containing K(e).
a
Problem 7 justifies the implicit assumption of ¥-measurability for the
K(o) in the definition of tightness; every compact set of completely regular
points can be written as a countable intersection of open, «-measurable
sets.
If G is replaced by K(e), the uniform tightness condition becomes a
slightly tidier, but stronger, condition. It is, however, more natural to retain
the open G. If P, ~+ P and P is tight then, by virtue of the results proved in
Example 17, the liminf condition for open G is satisfied; it might not be
satisfied if G were replaced by K(c). More importantly, one does not need
the stronger condition to get weakly convergent subsequences, as will be
shown in the next theorem,
For the proof of the theorem we shall make use of a prope-ty of compact,
sets:
If (x,} is a Cauchy sequence in a metric space, and if d(x,, K) + 0
for some fixed compact set K, then {x,} converges to a point of K.
This follows easily from one of a set of alternative characterizations of
compactness in metric spaces. As we shall be making free use of these charac-
terizations in later chapters, a short digression on the topic will not go amiss.82 TV. Convergence in Distribution in Metre Spaces
To prove the assertion we have only to choose, according to the definition
of d(x, K), points {y,} in K for which d(x,, y,) +0. From {),} we can
extract a subsequence converging to a point ) in K. For if no subsequence
of {y4} converged to a point of K, then around each x in K we could put an
open neighborhood G, that excluded y, for all large enough values of n.
This would imply that {y,} is eventually outside the union of the finite
collection of G, sets covering the compact K, a contradiction. The cor-
responding subsequence of {x} also converges to y. The Cauchy property
forces {x} to follow the subsequence in converging toy.
AA set with the property that every sequence has a convergent subsequence
(with limit point in the set) is said to be sequentially compact. Every compact
set is sequentially compact. This leads to another characterization of
compactness:
A sequentially compact set is complete (every Cauchy sequence
converges to a point of the set) and totally bounded (for every
positive e, the set can be covered by a finite union of closed balls
of radius less than 6)
For clearly a Cauchy sequence in a sequentially compact K must converge
to the same limit as the convergent subsequence. And if K were not totally
bounded, there would be some positive ¢ for which no finite collection of
balls of radius ¢ could cover K. We could extract a sequence {x,} in K with
yer at least e away from each of x,,...,%, for every n, No subsequence of
{4} could converge, in defiance of sequential compactness,
For us the last link in the chain of characteriza:ions will be the most
important
A complete, totally bounded subset of a metric space is compact.
Suppose, to the contrary, that {G;} is an open cove: of a totally bounded
set K for which no finite union of (G,} sets covers K. We can cover K by
a finite union of closed balls of radius 4, though. There must be at least
‘one such ball, By say, for which Ko B, has no finite {G,} subcover, Cover
K-B, by finitely many closed balls of radius 2. For at least one of these
balls, By say, K By 7B; has no finite {G,} subcover. Continuing in
this way we discover a sequence of closed balls {8,} of radii {2~") for
which KB, o+-+7B, has no finite {G) cover. Choose a point x,
from this (necessarily non-empty) intersection. The sequence {x,} is
Cauchy. If K were also complete, {x,} would converge to some x in K.
Certainly x would belong to some G;, which would necessarily contain
B, for m large enough, A single G; is about as finite a subcover as one could
wish for. Completeness would indeed force {G,} to have a finite subeover
for K. End of digression.
29 Compactness Theorem. Every uniformly tight sequence of probability
‘measures contains a subsequence that converges weakly to a tight borelIVS, Weakly Convergent Subsequences 83
PROOF. Write {P,} for the uniformly tight sequence, and K; ‘or the compact
set K(G,), for a fixed sequence {e,} that converges to zero. We may assume
that (K,} is an increasing sequence of sets.
The proof will use a coupling to represent a subsequence of {P,} by an
almost surely convergent sequence of random elements. Tke limit of these
random elements will concentrate on the union of the compact Ky sets: it
will induce the tight borel measure on 2 to which the subsequence {P,} will
converge weakly.
Complete regularity of each point in K, allows us to cover K, by a collec-
tion of open s/-measurable sets, each of diameter less than é. Invoke
compactness to extract a finite subcover, (Uy: 1 n(h,
Lighten the notation by assuming that n(k) =k. (If you suspect these
notational tricks for avoiding an orgy of subsequencing, feel free to rewrite
the argument using, by now, triple subscripting.) As in the proof of the
Representation Theorem, this allows us to construct a random element X,
with distribution P,, by means of an auxiliary random variable é that has a
Uniform(0, 1) distribution independent of 7:
For each atom A of of, if falls in the corresponding A of 7, and
<1 ~éq distribute X, on A according to the conditional
distribution P,(|A). If ¢°> 1 — ey distribute X, with whatever
conditional distribution is necessary to bring its overall distribution
up to P,,
We have coupled each P, with lebesgue measure on the unit square.
To emphasize that X, depends on 1, £, and the ranomization necessary
to generate observations on P,(-|A), write it as X,(., 7, €). Notice that the
same 1j and & figure in the construction of every X,.
It will suffice for us to prove that {X,(o, m, é)} converges to a point
(cn, 8) of Ky for every w and every pair (1, €) lying in a region of prob-
ability at least (1 — 6), a result stronger than mere almost sure convergence
to a point in the union of the compact sets {K,}. Problem 16 provides the
extra details needed to deduce borel measurability of
For each m greater than k, let Gya be the smallest open, ./y-measurable
set containing K,. Uniform tightness tells us that
AGnu = limint PyGyy > 1 ~ 64,
which implies IP(7 €G,u} > 1 —e,. Define G, as the intersection of the
decreasing sequence of sets {Gyy) for m= k, k-+ 1,.... The overbar here
is slightly misleading, because G, need not belong 10 «7... But it is a borel
subset of (0, 1). Countable additivity of lebesgue measure allows us to deduce
that IP{n€G,} > 1 — a. Notice how we have gotten around lack of count-
able additivity for 2, by pulling the construction back into a more femiliar
‘measure space.
Whenever 1 falls in G, and € < 1 ~ #4, which occurs with probability at
least (1 ~ 63)? the random elements X,, X,,,... crowd together into a
shrinking neighborhood ofa point of K,. There exists a decreasing sequence
{pg} with:
(Aq isan atom of fy;
(ii) Ay is contained in Guys
Gi) X p(s & lies in Aye
Properties (i) and (ii) are consequences of the method of construction for
X,qi Property (ji) holds because G, is a subset of Gyy. The Set Gy, being theNotes 85
smallest open, sy-measurable set containing K,, must be contained within
the union of those U,, that intersect K,. The atom 4,, must lie wholly within
one such Up, a set of diameter less than ¢,. So whenever 9 falls in G, and
E51 — e the sequence (X,} satisfies:
@ dX, D, X(O,N, )) Sq fork < msn;
Gi) dXL.n, 9, K) Sq for k's m.
As explained at the start of the digression, this forces convergence to a point
X(o,n, Dof Ky a
Norss
Any reader uncomfortable with the metric space ideas used in this chapter
might consult Simmons (1963, especially Chapters 2 and 5).
The advantages of equipping a metric space with a a-feld different from
the borel o-feld were first exploited by Dudley (19662, 1967a), who developed
a weak convergence theory for measures living on the a-fied generated by
the closed balls. The measurability problem for empirical processes (Example
2) was noted by Chibisov (1965); he opted for the Skorohod metric. Pyke
and Shorack (1968) suggested another way out: X,-> X should mean
Bf(X,) + IP/(20 for all those bounded, continuous f that make f(X,) and
F(X) measurable, They noted the equivalence of this definition to the defini-
tion based on the Skorohod mettic, for random elements of D[0, 1] con-
‘verging to a process with continuous sample paths.
Separability has a curious role in the theory. With it, the closed balls
generate the borel o-feld (Problem 6); but this can also hold without
separability (Talagrand 1978). Borel measures usually have separable
support (Dudiey 1967a, 1976, Lecture 5).
‘Alexandroff (1940, 1941, 1943) laid the foundation for a theory of weak
‘convergence on abstract spaces, not necessarily topological, Prohorov (1956)
reset the theory in complete, separable metric space, where most probabilistic
‘and statistical applications can flourish. He and LeCam (1957) proved
different versions of the Compactness Theorem, whose form (but not the
proof) I have borrowed from Dudley (1966). Weak convergence of baire
‘measures on general topological spaces was thoroughly investigated by
‘Varadarajan (1965). Topsoe (1970) put together a weak convergence theory
for borel measures; he used the liminf property for semicontinuous functions
(Example 17) to define weak convergence. These two authors made clear
the need for added regularity conditions on the limit measure and separation
properties on the topology. One particularly nice combination—a completely
regular topology and a t-additive limit measure—corresponds closely to my
assumption that limit measures concentrate on separable se’s of completely
regular points,
The best references to the weak convergence theory for bore! measures on.
metric spaces remain Billingsley (1968, 1971) and Parthasarathy (1967).86 1V. Convergence ia Distribution in Metric Spaces
Dudley's (1976) lecture notes offer an excellent condensed exposition of both
the mathematical theory and the statistical applications.
Example 11 is usually attributed to Wichura (1971), although Hajek
(1965) used a similar approximation idea to prove convergence for random
elements of CTO, 1}
Skorohod (1956) hit pon the idea of representing sequences that converge
in distribution by sequences that converge almost surely, for the case of
random elements of complete, separable metre spaces. The proof in Section 3
is adapted from Dudley (1968). He paid more attentioa to some of the points
glossed over in my proof—for example, he showeé how to construct a
probability space supporting all the {X,}. Here, and ia Section 5, one needs
the existence theorem for product meastires On infinite-product spaces. Pyke
(1969, 1970) has been a most persuasive advocate of this method for proving
theorems about weak convergence. Many of the applications now belong,
to the folklore.
The uniformity result of Example 19 comes from Ranga Rao (1962);
Billingsley and Topspe (1967) and Topsge (1970) perfected the idea. Not
surprisingly, the original proofs of this type of result made direct use of the
dissection technique of Lemma 15. Prohorov (1956) defined the Prohorov
metric; Dudley (19666) defined the bounded Lipschitz metic.
Strassen (1965) invoked convexity arguments to establish the coupling
characterization of the Prohorov metric (Example 26). My proof comes
essentially from Dudley (1968), via Dudley (1976, Lecture 18), who introduced
the idea of building coupling between diserete measures by application of
the marriage lemma, The Allocation Lemma can also be proved by the
‘max-flow-min-cut theorem (an elementary result from graph theory; for a
proof see Bollobas (1979). The conditions of my Lemma ensure that the
‘minimum capacity of a cut will correspond to the total column mass
Appendix B of Jacobs (1978) contains an exposition of this approach,
following Hansel and Troallic (1978). Major (1978) has described more
refined forms of coupling.
PROBLEMS
[1] Suppose the empirical process U; were measurable with respect tothe borel o-field
‘on D{O, 1} generated by the uniform metric. For each subset 4 of (1,2) define J,
as the open set of functions in D{0, 1] with jumps at some pai of distinct points
and f, in [0,1] with t+ fin A. Define a non-atomie measure on the class of
all subsets of (1,2) by setting (4) = IP{Us€ J4). This eontradicts the continuum,
hypothesis (Oxtoby 1971, Section 5). Manuficture from an extension of the
uniform distribution to all subsets of (1,2) if you woul like to offend the axiom
of choice as well. Extend the argument to larger sample sizes
2] Write of for the o-feld on a set generated by a family (f) of real-valued func-
tions on 2. That is, wis the smallest o-feld containing f;"B for each é and each
boret sot B. Prove that a map X from (0, é) into 7 is &/>/-measvrable ifand only
ifthe composition f,» X is 6/40R)-measurable for each iProblems 87
[3] Bvery function in D{0, 1] is bounded: |x(,)| + o© a8 n > 20 would violate either
the right continuity or the existence of the left limit at some eluster point of the
sequence {,)
[4] Write 9 for the projection o-ield on D[O, 1] and dy for the o-feld generated by
the closed balls of the uniform metric. Weite , for the projection map that takes
‘an x in D[O, 1] onto its value x(0).
(a) Prove that each «, is @-measurable. [Express (x: 1,x > a} a8 a countable
union of closed balls BCx,, m} where x, equals a plus (n + n~') times the
indicator fonction of ft, +n” 1).] Deduce that @, contains 2%
(©) Prove thatthe ofield 9 contains each closed ball B(x, r). [Express the ball
as an intersection of sets {2:|m,x — x2] < rh with ¢ ratioral.] Deduce that
P contains 3
[5] Let {G) be a family of open sets whose union covers @ separable subset C of a
metric space. Adapt the argument of Lemma 7 to prove that Cis contained in the
tunion of some countable subfamily ofthe {Gi}. [This is Lindelo's theorem.]
{6] Every separable, open subset of a metric space can be written as countable union
‘of closed balls. [Rational radii, centered at points of the countable dense set.)
‘The closed balls generate the borel e-feld on a separable metric space.
[7] Every closed, separable set of completely regular points belongs to «f. [Cover it
with open, .f-measurable sets of small diameter. Use Lindel6t’s theorem to
«extract countable subcover. The union of these sets belongs to./. Represent the
O}IFOO + F(X + YO)
Se QF MPLY| = 3},
‘The inequality lets us deduce convergence in distribution of a sequence of
random vectors from convergence of slightly perturbed sequences.
LL Lemma. Let {X,}, X and ¥ be random vectors for which X, + @Y ~»
X + GY for each fixed positive ¢. Then X, ~ X.
Proor, Remember (Corollary 5) we have only to check that IP/(X,)— IP/(X)
for each bounded, uniformly continuous f. Apply inequality (10) with X
replaced by X, and Y replaced by o¥.
sup [IP/(X,) — IPP(X, + oY) < e+ IfUP{|Y| > do~
Similarly
UIPF(X) — PFCK + o¥)| <0 + QISMIPUIY| = 60°}.
Choose ¢ small enough to make both right-hand sides les than 2s, then
invoke the known convergence of {IP/(X, + 6¥}} to IP/(X + oY) to
deduce that fimsup |IP/(X,) — IPF] < 4 a
Now, instead of thinking of the oY as a perturbation of the random
vectors, treat it as a means for smoothing the function/. This can be arrangedILA. Expectations of Smooth Functions 49
bby choosing independently of X and the {X,} a random vector ¥ having a
smooth density function with respect to lebesgue measure—for convenience,
take Y to have a N(0,1,) distribution. Integrate out first with respect to
the distribution of Y¥ then with respect to the distribution of X.
IP/(X + oY) = PEC,
where
Lie) = fon" 7oe + evrexp(—AP) dy
= fase? 77( exp(—Hle — xP) a
The function f has been smoothed by convolution, Dominated convergence
justifies repeated differentiation under the last integral sign to prove that
‘f. belongs to the class @°(IR*) of all bounded real funetions on IR* having
bounded, continuous partial derivatives of all orders.
12 Theorem. If IPf(X,) -+ IP/(X) for every fin (IR) then X_-» X.
PRooF. Convergence holds for every f, produced by convolution smoothing.
Apply Lemma 11. a
For the remainder of the section assume that k = 1, That is, consider
only real random variables, As the results of Section $ will show, no great
generality will be lost thereby—a trick with multidimensional characteristic
functions will reduce problems of convergence of random vectors to theit
one-dimensional analogues.
For expectations of smooth functions of X, the effect of small perturba
tions can be expressed in terms of moments by applying Taylor's theorem.
‘Suppose f belongs to IR"). Then, ignoring the niceties cf convergence,
wwe can write
FOE + Y= SO) + 9") + VI) +
‘Suppose the random variable X is inctemented by an independent amount
¥. Then, again ignoring problems of convergence and finiteness, deduce
(13) RPO + ¥) = RPC) + POPSCD + HPO IPSN) +
Try to mimic the effect of the increment ¥ by a different increment W, also
independent of X. As long as IP(Y) = IPGV) and PC?) = IPW?), the
expectations IP/(X + Y) and IP/(X + W) should differ only by terms in-
volving third or higher moments of Y and W. These higher-order terms
should amount to very little provided both ¥ and W are small; the eect of
substituting W for ¥ should be small in that case.
This method of substitution can be applied repeatedly for a random
Variable Z made up of a lot of little independent increments. We can replace50 TL, Convergence in Distr bution in Eueidean Spaces
the increments one after another by new independent random variables. If
at each substitution we match up the first and second moments, as above,
the overall effect on IP/(Z) should involve only a sum of quantities of third
or higher order. In the next section this approach, with normally distributed
replacement increments, will establish the Liapounoff and Lindeberg forms
of the Central Limit Theorem,
‘To make these approximation ideas more precise we need to bound the
remainder terms in the informal expansion (13). Because only the first two
‘moments of the increments are to be matched, a Taylor expansion to quad-
ratic terms will suffice. Existence of third derivatives for f will help to control
the error terms.
‘Assume f belongs to the class ¢°(IR) of all bounded real functions on IR
having bounded continuous derivatives up to third order. Then the remainder
term in the Taylor expansion
(4) Se + Y= FCO) + yf) + 5Y°F"C) + ROY)
can be expressed as
RG») = "Oe + HY)
with 0, (depending on xand y) between Oand 1. Write |” for the supremum
of |f"(-)|. Then
as) 1Res | < HISD,
Set C equal to $|)f"I. Then from (14) and (15),
UIRFX + Y) — IPF(X) — POYIPF' CD — HPCPYIPF"O)|
<< PIR(X, 1
< CP(IYP).
Apply the same argument with ¥ replaced by the increment W, which is
also independent of X. Because IP(Y) = IP(W) and IP(¥?) = IP(W?), when
the resulting expansion for IP/(X + W) is subtracted from the expansion
for IP/(X + Y) most of the terms cancel out, leaving,
(16) |IPFOK + ¥) — IPFOK + W)| < PIRCY, ¥)| + PROX, WY]
< CP(IY)) + CPU W)).
‘This inequality is sharp enough for the proof of a limit theorem for sums of
independent random variables with third moments.
TIL4. The Central Limit Theorem
A sum ofa large number of small, independent random variables is approxi-
mately normally distributed—that is roughly what the Central L
‘Theorem asserts, The rigorous formulations of the theorem set forth condi-
tions for convergence in distribution of sums of independent randomLA, The Central Limit Theorem sl
variables to a standard normal distribution, We shall prove two versions
of the theorem
To begin with, consider a sum Z = €, +--+ + & of independent random
variables with finite third moments. Write o} for IPG}. Standardize, if
necessary, to ensure that IPZ, = 0 for each j and of +
pendentiy of the {&)}, choose independent N(0,o3)-distributed random
variables {y,} for = 1, .., k-Start replacing the {€,} by the {y,), beginning
at the right-hand end. Define
SpE tt Geb tot
Notice that Sy + & = Zand that $, + ny has a N(O, 1) distribution.
Choose a °(IR) function /, as in Section 3. Theorem 12 has shown that
convergence for expectations of infinitely differentiable functions of random
variables is enough to establish convergence in distribution; convergence
for functions in @°(IR) is more than enough. We need to show that IPf(Z)
is close to IP/(N(, 1)
Apply inequality (16) with X=, Y=), and W=,. Because
8) + §)= Syon tye for] = by.
A HPF) — PANO, AY] SY PPS, +
— PAS, +0)
SZ PIR) 51+ PIRG) nh
<3 Pigh+e 3 wiyt.
With this bound in hand, the proof of the first version of the Central Limit
Theorem presents no difficulty.
18 Liapounoff Central Limit Theorem. For each n let Z,, be ¢ sum of indepen-
dent random variables &y1. Gya+-+-» up) With zero means and variances that
sun to one. Ifthe Liapounoff condition,
as) ¥ Plé[2 +0 asn— a0,
1s satisfied, then 2, -» NO, 1.
Proor. Choose and fix an f in @°(IR). Check that IP/(Z,} -> P/(N(, 0).
The replacement normal random variables are denoted by Mei +. tam
The sum ry +-*- + thay has a N(0, 1) distribution, Write 3) for the
variance of f,,, and 2, for the sum on the left-hand side of (19). With sub-
scripting n’s attached, the bound (17) becomes
a
IPF (Z_) — PFNO, 1)] = Ca, + CY of PING, DI,32 IL. Convergence in Distribution in Euclidean Spaces
By Jensen's inequality, 3, = (IPé2)*” < P|é|*, which shows that the sum
contributed by the normal increments is less than IP.N(0, 1). Two calls
upon (19) as -+ oo complete the proof. o
The Liapounoff condition (19) imposes the unnecessary constraint of
finite third moments upon the summands. Liapounoff himself was able to
weaken this to a condition on the (2 + 6)th moments, for some 5 > 0.
‘The remainder term R(x, y) in the Taylor expansion (14) does not increase
as fast as |y|*** though:
)) IRG yD] = IEP + 9) ~ FO) — yf CO] = y*F"OD!
Hy?/"G + O29) — dy*s"Co)
S|f"UyP forall x and y,
The new bound improves upon (15) for large |}, but not for small |]. To
hhave it both ways, apply (15) if|y| < ¢ and (20) otherwise. Increase C to the
maximum of $||"| and "|. Then the bound on the expected remainder
is sharpened to
21) PRK, ¥)) < PC|YP{I¥| <6} + PCLYPLY| > 0}
<0CIP(Y) + CIPY*{|Y| > ¢},
22 Lindeberg Central Limit Theorem. For each n let Z, be a sum of indepen-
dent random variables gy» 421+» uu With zero means and variances that
sum to one. If for each fixed & > O, the Lindeberg condition
Pa}
3) Y, Pelee &
is satisfied, then Z, ~+ N(O, 1)
PRoor. Use the same notation as in the proof of Theorem 18. Denote the
left-hand side of (23) by L,(2). Stop one line earlier than before in the ap-
plication of (17)
ky Mo
IPFZ.) — PANO, DI = Y PLRG Ee) + Y WARS.)
From (21), the first sum is less than
CY CPE, + PEE = eh]
i
which equals Ce + CL,(e) because the variance sum to one. For the second
sum retain the bound
CY oi P|NO, 1)LA, The Central Limit Theorem 3
but, in place of Jensen’s inequality, use:
_ smaxle* + PEELS) = 1
<2 +L).
The strange-looking Lindeberg condition is not as artificial as one might
think. For example, consider the standardized summands for 1 sequence
(14) of independent, identically distributed random varizbles with zero
‘means and unit variances: &,, = n™"?*Y, for j = 1, 2,...,n. In this case
Lge) = nlPn-"¥7{|¥,| = ne},
‘hich tends to zero asn tends to infinity, by dominated convergence, because
Y} is integrable. It is even more comforting to know that the Lindeberg
condition comes very close to being a necessary condition (Feller 1971,
Section XV.6) for the Central Limit Theorem to hold.
24 Example (A Central Limit Theorem for the Sample Median). Let M, be
the median of a sample {¥,, ¥:,..-,¥,} ftom a distribution funetion G
with median M. For simplicity, suppose the sample sizen is odd,n = 2N + 1,
so that M, equals the (N + ist order statistic of the sample. Suppose also
that the underlying distribution function G has a positive derivative 7 at its
‘median. To prove that
(My — M) > NO, 497),
it suffices to check pointwise convergence ofthe distribution functions
23) Pel, — M) You might also like