0% found this document useful (0 votes)
31 views16 pages

Note 1

This document discusses different modes of convergence for sequences of random variables, including: 1) Convergence in probability, which requires that the absolute difference between two random variables becomes arbitrarily small with probability approaching one. 2) Probabilistic order notation such as oP and OP, which describe how quickly random variables converge to other quantities. 3) An example of a consistent estimator for an unknown parameter θ based on independent and identically distributed uniform random variables.

Uploaded by

Subhra Sankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views16 pages

Note 1

This document discusses different modes of convergence for sequences of random variables, including: 1) Convergence in probability, which requires that the absolute difference between two random variables becomes arbitrarily small with probability approaching one. 2) Probabilistic order notation such as oP and OP, which describe how quickly random variables converge to other quantities. 3) An example of a consistent estimator for an unknown parameter θ based on independent and identically distributed uniform random variables.

Uploaded by

Subhra Sankar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Chapter 2

Weak Convergence

Chapter 1 discussed limits of sequences of constants, either scalar-valued or vector-valued.


Chapters 2 and 3 extend this notion by defining what it means for a sequence of random
variables to have a limit. As it turns out, there is more than one sensible way to do this.
Chapters 2 and 4 (and, to a lesser extent, Chapter 3) lay the theoretical groundwork for
nearly all of the statistical topics that will follow. While the material in Chapter 2 is
essential, readers may wish to skip Chapter 3 on a first reading. As is common throughout
the book, some of the proofs here have been relegated to the exercises.

2.1 Modes of Convergence

Whereas the limit of a sequence of real numbers is unequivocally expressed by Definition


1.32, in the case of random variables there are several ways to define the convergence of a
sequence. This section discusses three such definitions, or modes, of convergence; Section 3.1
presents a fourth. Because it is often easier to understand these concepts in the univariate
case than the multivariate case, we only consider univariate random vectors here, deferring
the analogous multivariate topics to Section 2.3.

2.1.1 Convergence in Probability

What does it mean for the sequence X1 , X2 , . . . of random variables to converge to, say,
the random variable X? Under what circumstances should one write Xn → X? We begin
by considering a definition of convergence that requires that Xn and X be defined on the
same sample space. For this form of convergence, called convergence in probability, the

41
absolute difference |Xn − X|, itself a random variable, should be arbitrarily close to zero
with probability arbitrarily close to one. More precisely, we make the following definition.
Definition 2.1 Let {Xn }n≥1 and X be defined on the same probability space. We
P
say that Xn converges in probability to X, written Xn → X, if for any  > 0,

P (|Xn − X| < ) → 1 as n → ∞. (2.1)

It is very common that the X in Definition 2.1 is a constant, say X ≡ c. In such cases, we
P
simply write Xn → c. When we replace X by c in Definition 2.1, we do not need to concern
ourselves with the question of whether X is defined on the same sample space as Xn because
any constant may be defined as a random variable on any sample space. In the most common
statistical usage of convergence to a constant c, we take c to be some parameter θ and Xn
to be an estimator of θ:
P
Definition 2.2 If Xn → θ, Xn is said to be consistent (or weakly consistent) for θ.

From statistics As the name suggests, weak consistency is weaker than (i.e., implied by) a condition called
literature “strong consistency,” which will be defined in Chapter 3. “Consistency,” used without the
word “strong” or “weak,” generally refers to weak consistency. Throughout this book, we
shall refer repeatedly to (weakly) consistent estimators, whereas strong consistency plays a
comparatively small role.

Example 2.3 Suppose that X1 , X2 , . . . are independent and identically distributed


(i.i.d.) uniform (0, θ) random variables, where θ is an unknown positive constant.
For n ≥ 1, let X(n) be defined as the largest value among X1 through Xn : That
def
is, X(n) = max1≤i≤n Xi . Then we may show that X(n) is a consistent estimator
of θ as follows:
By Definition 2.1, we wish to show that for an arbitrary  > 0, P (|X(n) − θ| <
) → 1 as n → ∞. In this particular case, we can evalulate P (|X(n) − θ| < )
directly by noting that X(n) cannot possibly be larger than θ, so that

P (|X(n) − θ| < ) = P (X(n) > θ − ) = 1 − P (X(n) ≤ θ − ).

The maximum X(n) is less than some constant if and only if each of the random
variables X1 , . . . , Xn is less than that constant. Therefore, since the Xi are i.i.d.,

[1 − (/θ)]n if 0 <  < θ



n
P (X(n) ≤ θ − ) = [P (X1 ≤ θ − )] =
0 if  ≥ θ.

Since 1 − (/θ) is strictly less than 1, we conclude that no matter what positive
value  takes, P (Xn ≤ θ − ) → 0 as desired.

42
2.1.2 Probabilistic Order Notation
Important
There are probabilistic analogues of the o and O notations of Section 1.3 that apply to
random variable sequences instead of real number sequences.
P
Definition 2.4 We write Xn = oP (Yn ) if Xn /Yn → 0.
In particular, oP (1) is shorthand notation for a sequence of random variables that converges
to zero in probability, as illustrated in Equation (2.2) below.
Definition 2.5 We write Xn = OP (Yn ) if for every  > 0, there exist M and N such
that
 
Xn
P < M > 1 −  for all n > N .

Yn

As a special case of Definition 2.5, we refer to any OP (1) sequence as a bounded in probability
sequence:
Definition 2.6 We say that X1 , X2 , . . . is bounded in probability if Xn = OP (1), i.e.,
if for every  > 0, there exist M and N such that P (|Xn | < M ) > 1− for n > N .
Definition 2.6 is primarily useful because of the properties of bounded in probability se-
quences established in Exercise 2.2.

Example 2.7 In Example 2.3, we showed that if X1 , X2 , . . . are independent and


identically distributed uniform (0, θ) random variables, then
P
max Xi → θ as n → ∞.
1≤i≤n

Equivalently, we may say that

max Xi = θ + oP (1) as n → ∞. (2.2)


1≤i≤n

It is also technically correct to write

max Xi = θ + OP (1) as n → ∞, (2.3)


1≤i≤n

though Statement (2.3) is less informative than Statement (2.2). On the other
hand, we will see in Example 6.1 that Statement (2.3) may be sharpened considerably—
and made more informative than Statement (2.2)—by writing
 
1
max Xi = θ + OP as n → ∞.
1≤i≤n n

43
Using the oP notation defined above, it is possible to rewrite Taylor’s theorem 1.18 in a
form involving random variables. This theorem will prove to be useful in later chapters; for
instance, it is used to prove the result known as the delta method in Section 5.1.1.
P
Theorem 2.8 Suppose that Xn → θ0 for a sequence of random variables X1 , X2 , . . .
and a constant θ0 . Furthermore, suppose that f (x) has d derivatives at the point
You can skip
this theorem in
θ0 . Then there is a random variable Yn such that
your first reading
(Xn − θ0 )d  (d)
f (Xn ) = f (θ0 ) + (Xn − θ0 )f 0 (θ0 ) + · · · +

f (θ0 ) + Yn (2.4)
d!
and Yn = oP (1) as n → ∞.
The proof of Theorem 2.8 is a useful example of an “epsilon-delta” proof (named for the 
and δ in Definition 1.11).
Proof: Let
( h i
(Xn −θ0 )d−1
Yn =
d!
(Xn −θ0 )d
f (Xn ) − f (θ0 ) − (Xn − θ0 )f 0 (θ0 ) − · · · − (d−1)!
− f (d) (θ0 ) if Xn 6= θ0
0 if Xn = θ0 .
P
Then Equation (2.4) is trivially satisfied. We will show that Yn = oP (1), which means Yn → 0,
by demonstrating that for an arbitrary  > 0, there exists N such that P (|Yn | < ) > 1 − 
for all n > N . By Taylor’s Theorem 1.18, there exists some δ > 0 such that |Xn − θ0 | < δ
implies |Yn | <  (that is, the event {ω : |Xn (ω) − θ0 | < δ} is contained in the event {ω :
P
|Yn (ω)| < }). Furthermore, because Xn → θ0 , we know that there exists some N such that
P (|Xn − θ0 | < δ) > 1 −  for all n > N . Putting these facts together, we conclude that for
all n > N ,
P (|Yn | < ) ≥ P (|Xn − θ0 | < δ) > 1 − ,
which proves the result.
In later chapters, we will generally write simply

0 (Xn − θ0 )d  (d)
f (Xn ) = f (θ0 ) + (Xn − θ0 )f (θ0 ) + · · · + f (θ0 ) + oP (1) as n → ∞ (2.5)
d!
when referring to the result of Theorem 2.8. A technical quibble with Expression (2.5) is
that it suggests that any random variable Yn satisfying (2.4) must also be oP (1). This is not
quite true: Since Yn may be defined arbitrarily in the event that Xn = θ0 and still satisfy
(2.4), if
P (Xn = θ0 ) > c for all n
for some positive constant c, then Yn 6= oP (1) may still satisfy (2.4). However, as long as
one remembers what Theorem 2.8 says, there is little danger in using Expression (2.5).

44
2.1.3 Convergence in Distribution

As the name suggests, convergence in distribution (also known as convergence in law) has
to do with convergence of the distribution functions (or “laws”) of random variables. Given
a random variable X, the distribution function of X is the function

F (x) = P (X ≤ x). (2.6)

Any distribution function F (x) is nondecreasing and right-continuous, and it has limits
limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. Conversely, any function F (x) with these proper-
ties is a distribution function for some random variable.
It is not enough to define convergence in distribution as simple pointwise convergence of a
sequence of distribution functions; there are technical reasons that such a simplistic definition
fails to capture any useful concept of convergence of random variables. These reasons are
illustrated by the following two examples.
Example 2.9 Let Xn be normally distributed with mean √ 0 and variance n. Then
the distribution function of Xn is Fn (x) = Φ(x/ n), where Φ(z) denotes the
standard normal distribution function. Because Φ(0) = 1/2, we see that for any
Important fixed x, Fn (x) → 1/2 as n → ∞. But the function that is constant at 1/2 is not
a distribution function. This example shows that not all convergent sequences of
distribution functions have limits that are distribution functions.

Example 2.10 By any sensible definition of convergence, 1/n should converge to 0


as n → ∞. But consider the distribution functions Fn (x) = I{x ≥ 1/n} and
F (x) = I{x ≥ 0} corresponding to the constant random variables 1/n and 0. We
do not have pointwise convergence of Fn (x) to F (x), since Fn (0) = 0 for all n but
F (0) = 1. However, Fn (x) → F (x) is true for all x 6= 0. Not coincidentally, the
point x = 0 where convergence of Fn (x) to F (x) fails is the only point at which
the function F (x) is not continuous.

To write a sensible definition of convergence in distribution, Example 2.9 demonstrates that


we should require that the limit of distribution functions be a distribution function itself,
say F (x), while Example 2.10 suggests that we should exclude points where F (x) is not
continuous. We therefore arrive at the following definition:
Definition 2.11 Suppose that X has distribution function F (x) and that Xn has
distribution function Fn (x) for each n. Then we say Xn converges in distribution
d
to X, written Xn → X, if Fn (x) → F (x) as n → ∞ for all x at which F (x) is
continuous. Convergence in distribution is sometimes called convergence in law
L
and written Xn → X.

45
The notation of Definition 2.11 may be stretched a bit; sometimes the expressions on either
d
side of the → symbol may be distribution functions or other notations indicating certain
distributions, rather than actual random variables as in the definition. The meaning is
always clear even if the notation is not consistent.
d P
However, one common mistake should be avoided at all costs: If → (or → or any other “limit
arrow”) indicates that n → ∞, then n must never appear on the right side of the arrow. See
Expression (2.8) in Example 2.12 for an example of how this rule is sometimes violated.

Example 2.12 The Central Limit Theorem for i.i.d. sequences: Let X1 , . . . , Xn be
independent and identically distributed (i.i.d.) with mean µ and finite variance
σ 2 . Then by a result that will be covered in Chapter 4 (but which is perhaps
You can skip
this example in already known to the reader),
your first reading !
n
√ 1X d
n Xi − µ → N (0, σ 2 ), (2.7)
n i=1

where N (0, σ 2 ) denotes a normal distribution with mean 0 and variance σ 2 .


[N (0, σ 2 ) is not actually a random variable; this is an example of “stretching
d
the → notation” referred to above.]
Because Equation 2.7 may be interpreted as saying that the sample mean X n
has approximately a N (µ, σ 2 /n) distribution, it may seem tempting to “rewrite”
Equation (2.7) as
n
σ2
 
1X d
Xi → N µ, . (2.8)
n i=1 n

Resist the temptation to do this! As pointed out above, n should never appear
on the right side of a limit arrow (as long as that limit arrow expresses the idea
that n is tending to ∞).
By the result of Exercise 2.2, the limit statement (2.7) implies that the left side
of
√ that statement is OP (1). We may therefore write (after dividing through by
n and adding µ)
n  
1X 1
Xi = µ + OP √ as n → ∞. (2.9)
n i=1 n

Unlike Expression (2.8), Equation (2.9) is perfectly legal;√and although it is less


specific than Expression (2.7), it expresses at a glance the n-rate of convergence
of the sample mean to µ.

46
P d
Unlike Xn → X, the expression Xn → X does not require Xn − X to be a random variable;
d
in fact, Xn → X is possible even if Xn and X are not defined on the same sample space.
Even if Xn and X do have a joint distribution, it is easy to construct an example in which
d
Xn → X but Xn does not converge to X in probability: Take Z1 and Z2 to be independent
and identically distributed standard normal random variables, then let Xn = Z1 for all n
d
and X = Z2 . Since Xn and X have exactly the same distribution by construction, Xn → X
in this case. However, since Xn − X is a N (0, 2) random variable for all n, we do not have
P
Xn → X.
d P
We conclude that Xn → X cannot possibly imply Xn → X (but see Theorem 2.14 for a special
case in which it does). However, the implication in the other direction is always true:
P d
Theorem 2.13 If Xn → X, then Xn → X.
You can skip
this theorem in Proof: Let Fn (x) and F (x) denote the distribution functions of Xn and X, respectively.
the first reading P
Assume that Xn → X. We need to show that Fn (t) → F (t), where t is any point of continuity
of F (x).
Choose any  > 0. Whenever Xn ≤ t, it must be true that either X ≤ t +  or |Xn − X| > .
This implies that
Fn (t) ≤ F (t + ) + P (|Xn − X| > ).
Similarly, whenever X ≤ t − , either Xn ≤ t or |Xn − X| > , implying
F (t − ) ≤ Fn (t) + P (|Xn − X| > ).
We conclude that for arbitrary n and  > 0,
F (t − ) − P (|Xn − X| > ) ≤ Fn (t) ≤ F (t + ) + P (|Xn − X| > ). (2.10)
Taking both the lim inf n and the lim supn of the above inequalities, we conclude [since
P
Xn → X implies P (|Xn − X| > ) → 0] that
F (t − ) ≤ lim inf Fn (t) ≤ lim sup Fn (t) ≤ F (t + )
n n

for all . Since t is a continuity point of F (x), letting  → 0 implies


F (t) = lim inf Fn (t) = lim sup Fn (t),
n n

so we conclude Fn (t) → F (t) and the theorem is proved.


d P
We remarked earlier that Xn → X could not possibly imply Xn → X because the latter
expression requires that Xn and X be defined on the same sample space for every n. However,
a constant c may be considered to be a random variable defined on any sample space; thus,
d P
it is reasonable to ask whether Xn → c implies Xn → c. The answer is yes:

47
d P
Theorem 2.14 Xn → c if and only if Xn → c.
d P
Proof: We only need to prove that Xn → c implies Xn → c, since the other direction is a
special case of Theorem 2.13. If F (x) is the distribution function I{x ≥ c} of the constant
random variable c, then c +  and c −  are points of continuity of F (x) for any  > 0.
d
Therefore, Xn → c implies that Fn (c − ) → F (c − ) = 0 and Fn (c + ) → F (c + ) = 1 as
n → ∞. We conclude that

P (− < Xn − c ≤ ) = Fn (c + ) − Fn (c − ) → 1,
P
which means Xn → c.
When we speak of convergence of random variables to a constant in this book, most com-
monly we refer to convergence in probability, which (according to Theorem 2.14) is equivalent
to convergence in distribution. On the other hand, when we speak of convergence to a ran-
dom variable, we nearly always refer to convergence in distribution. Therefore, in a sense,
Theorem 2.14 makes convergence in distribution the most important form of convergence in
this book. This type of convergence is often called “weak convergence”.

2.1.4 Convergence in Mean

The third and final mode of convergence in this chapter is useful primarily because it is
sometimes easy to verify and thus gives a quick way to prove convergence in probability, as
Theorem 2.17 below implies.

Definition 2.15 Let a be a positive constant. We say that Xn converges in ath mean
a
to X, written Xn → X, if

E |Xn − X|a → 0 as n → ∞. (2.11)

Two specific cases of Definition 2.15 deserve special mention. When a = 1, we normally omit
mention of the a and simply refer to the condition E |Xn − X| → 0 as convergence in mean.
Convergence in mean is not equivalent to E Xn → E X: For one thing, E Xn → E X is
possible without any regard to the joint distribution of Xn and X, whereas E |Xn − X| → 0
clearly requires that Xn − X be a well-defined random variable.
Even more important than a = 1 is the special case a = 2:
qm
Definition 2.16 We say that Xn converges in quadratic mean to X, written Xn → X,
if

E |Xn − X|2 → 0 as n → ∞.

48
Convergence in quadratic mean is important for two reasons. First, it is often quite easy
qm
to check; in Exercise 2.6, you are asked to prove that Xn → c if and only if E Xn → c
and Var Xn → 0 for some constant c. Second, quadratic mean convergence (indeed, ath
mean convergence for any a > 0) is stronger than convergence in probability, which means
that weak consistency of an estimator may be established by checking that it converges in
quadratic mean. This latter property is a corollary of the following result:
qm
Theorem 2.17 (a) For a constant c, Xn → c if and only if E Xn → c and Var Xn → 0.
a P
(b) For fixed a > 0, Xn → X implies Xn → X.

Proof: Part (a) is the subject of Exercise 2.6. Part (b) relies on Markov’s inequality (1.35),
which states that
1
P (|Xn − X| ≥ ) ≤ a E |Xn − X|a (2.12)

a
for an arbitrary fixed  > 0. If Xn → X, then by definition the right hand side of inequality
P
(2.12) goes to zero as n → ∞, so the left side also goes to zero and we conclude that Xn → X
by definition.

Example 2.18 Any unbiased estimator is consistent if its variance goes to zero. This
fact follows directly from Theorem 2.17(a) and (b). As an example, consider a
From statistics sequence of independent and identically distributed random variables X1 , X2 , . . .
literature with mean µ and finite variance σ 2 . The sample mean
n
1X
Xn = Xi
n i=1

has mean µ and variance σ 2 /n. Therefore, X n is unbiased and its variance goes
P
to zero, so we conclude that it is consistent; i.e., X n → µ. This fact is the Weak
Law of Large Numbers (see Theorem 2.19) for the case of random variables with
finite variance.

Exercises for Section 2.1


P
Exercise 2.1 For each of the three cases below, prove that Xn → 1:
(a) Xn = 1 + nYn , where Yn is a Bernoulli random variable with mean 1/n.
(b) Xn = Yn / log n, where Yn is a Poisson random variable with mean ni=1 (1/i).
P

(c) Xn = n1 ni=1 Yi2 , where the Yi are independent standard normal random
P
variables.

49
Exercise 2.2 This exercise deals with bounded in probability sequences; see Defini-
tion 2.6.
d
(a) Prove that if Xn → X for some random variable X, then Xn is bounded in
probability.
Hint: You may use the fact that any interval of real numbers must contain a
point of continuity of F (x). Also, recall that F (x) → 1 as x → ∞.
P P
(b) Prove that if Xn is bounded in probability and Yn → 0, then Xn Yn → 0.
Hint: For fixed  > 0, argue that there must be M and N such that P (|Xn | <
M ) > 1 − /2 and P (|Yn | < /M ) > 1 − /2 for all n > N . What is then the
smallest possible value of P (|Xn | < M and |Yn | < /M )? Use this result to prove
P
Xn Yn → 0.

Exercise 2.3 The Poisson approximation to the binomial:

Important exercise (a) Suppose that Xn is a binomial random variable with n trials, where the
probability of success on each trial is λ/n. Let X be a Poisson random variable
d
with the same mean as Xn , namely λ. Prove that Xn → X.
Hint: Argue that it suffices to show that P (Xn = k) → P (X = k) for all
nonnegative integers k. Then use Stirling’s formula (1.19).
(b) Part (a) can be useful in approximating binomial probabilities in cases where
the number of trials is large but the success probability is small: Simply consider
a Poisson random variable with the same mean as the binomial variable. Assume
that Xn is a binomial random variable with parameters n and 2/n. Create a plot
on which you plot P (X10 = k) for k = 0, . . . , 10. On the same set of axes, plot
the same probabilities for X20 , X50 , and the Poisson variable we’ll denote by X∞ .
Try looking at the same plot but with the probabilities transformed using the
logit (log-odds) transformation logit(t) = log(t) − log(1 − t). Which plot makes
it easier to characterize the trend you observe?

Exercise 2.4 Suppose that X1 , . . . , Xn are independent and identically distributed


Uniform(0, 1) random variables. For a real number t, let
Important exercise n
X
Gn (t) = I{Xi ≤ t}.
i=1

(a) What is the distribution of Gn (t) if 0 < t < 1?


(b) Suppose c > 0. Find the distribution of a random variable X such that

50
d
Gn (c/n) → X. Justify your answer.
(c) How does your answer to part (b) change if X1 , . . . , Xn are from a stan-
dard exponential distribution instead of a uniform distribution? The standard
exponential distribution function is F (t) = 1 − e−t .
qm
Exercise 2.5 For each of the three examples in Exercise 2.1, does Xn → 1? Justify
your answers.

Exercise 2.6 Prove Theorem 2.17(a).

Exercise 2.7 The converse of Theorem 2.17(b) is not true. Construct a counterex-
P
ample in which Xn → 0 but E Xn = 1 for all n (by Theorem 2.17, if E Xn = 1,
then Xn cannot converge in quadratic mean to 0).
Hint: The mean of a random variable may be strongly influenced by a large
value that occurs with small probability (and if this probability goes to zero,
then the mean can be influenced in this way without destroying convergence in
probability).

Exercise 2.8 Prove or disprove this statement: If there exists M such that P (|Xn | <
P qm
M ) = 1 for all n, then Xn → c implies Xn → c.

Exercise 2.9 (a) Prove that if 0 < a < b, then convergence in bth mean is stronger
b a
than convergence in ath mean; i.e., Xn → X implies Xn → X.
Hint: Use Exercise 1.40 with α = b/a.
(b) Prove by counterexample that the conclusion of part (a) is not true in general
if 0 < b < a.

2.2 Consistent Estimates of the Mean

For a sequence of random vectors X1 , X2 , . . ., we denote the nth sample mean by


n
1X
def
Xn = Xi .
n i=1

We begin with a formal statement of the weak law of large numbers for an independent
and identically distributed sequence. Later, we discuss some cases in which the sequence of
random vectors is not independent and identically distributed.

51
Chapter 3

Strong convergence

There are multiple ways to define the convergence of a sequence of random variables. Chapter
2 introduced convergence in probability, convergence in distribution, and convergence in
quadratic mean. We now consider a fourth mode of convergence, almost sure convergence
or convergence with probability one. We will see that almost sure convergence implies both
convergence in probability and convergence in distribution, which is why we sometimes use
the term “strong” for almost sure convergence and “weak” for the other two.
The terms “weak” and “strong” do not indicate anything about their importance; indeed,
the “weak” modes of convergence are used much more frequently in asymptotic statistics
than the strong mode. Because weak convergence dominates the remainder of this book
beginning with Chapter 4, a reader may safely skip much of the material in the current
chapter if time is limited; however, the quantile function and the Dominated Convergence
Theorem of Section 3.3 are used elsewhere, and at least these topics should be reviewed
before moving on. Due to the technical nature of the material of this chapter, the exercises
are almost exclusively devoted to proofs.

3.1 Strong Consistency Defined

A random variable like Xn or X is a function on a sample space, say Ω. Suppose that we


fix a particular element of that space, say ω0 , so we obtain the real numbers Xn (ω0 ) and
X(ω0 ). If Xn (ω0 ) → X(ω0 ) as n → ∞ in the sense of Definition 1.1, then ω0 is contained in
the event

S = {ω ∈ Ω : Xn (ω) → X(ω)}. (3.1)

70
If the probability of S—that is, E I{Xn → X}—equals 1, then we say that Xn converges
almost surely to X:
Definition 3.1 Suppose X and X1 , X2 , . . . are random variables defined on the same
sample space Ω (and as usual P denotes the associated probability measure). If
P ({ω ∈ Ω : Xn (ω) → X(ω)}) = 1,
then Xn is said to converge almost surely (or with probability one) to X, denoted
a.s.
Xn → X or Xn → X a.s. or Xn → X w.p. 1.
In other words, convergence with probability one means exactly what it sounds like: The
probability that Xn converges to X equals one. Later, in Theorem 3.3, we will formulate
an equivalent definition of almost sure convergence that makes it much easier to see why
it is such a strong form of convergence of random variables. Yet the intuitive simplicity of
Definition 3.1 makes it the standard definition.
As in the case of convergence in probability, we may replace the limiting random variable
a.s.
X by any constant c, in which case we write Xn → c. In the most common statistical usage
of convergence to a constant, the random variable Xn is some estimator of a particular
parameter, say θ:
a.s.
Definition 3.2 If Xn → θ, Xn is said to be strongly consistent for θ.
As the names suggest, strong consistency implies consistency (also known as weak consis-
tency), a fact we now explore in more depth.

3.1.1 Strong Consistency versus Consistency

As before, suppose that X and X1 , X2 , . . . are random variables defined on the same sample
space, Ω. For given n and  > 0, define the events
An = {ω ∈ Ω : |Xk (ω) − X(ω)| <  for all k ≥ n} (3.2)
and
Bn = {ω ∈ Ω : |Xn (ω) − X(ω)| < }. (3.3)
First, note that An must be contained in Bn and that both An and Bn imply that Xn is close
to X as long as  is small. Therefore, both P (An ) → 1 and P (Bn ) → 1 seem like reasonable
ways to define the convergence of Xn to X. Indeed, as we have already seen in Definition
2.1, convergence in probability means precisely that P (Bn ) → 1 for any  > 0.
Yet what about the sets An ? One fact is immediate: Since An ⊂ Bn , we must have
P (An ) ≤ P (Bn ). Therefore, P (An ) → 1 implies P (Bn ) → 1. In other words, if we were to

71
take P (An ) → 1 for all  > 0 to be the definition of a new form of convergence of random
sequences, then this form of convergence would be stronger than (i.e., it would imply) con-
vergence in probability. By now, the reader may already have guessed that this new form of
convergence is actually equivalent to almost sure convergence:
Theorem 3.3 With An defined as in Equation (3.2), P (An ) → 1 for any  > 0 if and
a.s.
only if Xn → X.
Proving Theorem 3.3 is the subject of Exercise 3.1. The following corollary now follows from
the preceding discussion:
a.s. P
Corollary 3.4 If Xn → X, then Xn → X.
The converse of Corollary 3.4 is not true, as the following example illustrates.

0 1
J1
● ●

J2 J3 J4
● ● ● ●

J5 J6 J7 J8 J9
● ● ● ● ● ●

J10 J11 J12 J13 J14 J15 J16


● ● ● ● ● ● ● ●

J17 J18 J19 J20 J21 J22 J23 J24 J25


● ● ● ● ● ● ● ● ● ●

...and so on

P
Figure 3.1: Example 3.5, in which P (Jn ) → 0 as n → ∞, which means that I{Jn } → 0.
However, the intervals Jn repeatedly cover the entire interval (0, 1], so the subset of (0, 1] on
which I{Jn } converges to 0 is empty!

Example 3.5 Take Ω to be the half-open interval (0, 1], and for any interval J ⊂ Ω,
say J = (a, b], take P (J) = b − a to be the length of that interval. Define a
sequence of intervals J1 , J2 , . . . as follows (see Figure 3.1):

J1 = (0, 1]
J2 through J4 = (0, 13 ] , ( 13 , 23 ] , ( 32 ,1]

72
J5 through J9 = (0, 51 ] , ( 15 , 25 ] , ( 52 , 35 ] , ( 35 , 45 ] , ( 54 ,1]
..
.    
1 2m
Jm2 +1 through J(m+1)2 = 0, ,..., ,1
2m + 1 2m + 1
..
.

Note in particular that P√(Jn ) = 1/(2m + 1), where m = b n − 1c is the largest
integer not greater than n − 1. Now, define Xn = I{Jn } and take 0 <  < 1.
Then P (|Xn − 0| < ) is the same as 1 − P (Jn ). Since P (Jn ) → 0, we conclude
P
Xn → 0 by definition.
a.s.
However, it is not true that Xn → 0. Since every ω ∈ Ω is contained in infinitely
many Jn , the set An defined in Equation (3.2) is empty for all n. Alternatively,
consider the set S = {ω : Xn (ω) → 0}. For any ω, Xn (ω) has no limit because
Xn (ω) = 1 and Xn (ω) = 0 both occur for infinitely many n. Thus S is empty.
This is not convergence with probability one; it is convergence with probability
zero!

3.1.2 Multivariate Extensions

We may extend Definition 3.1 to the multivariate case in a completely straightforward way:
You can skip
this section in Definition 3.6 Xn is said to converge almost surely (or with probability one) to X
your first reading a.s.
(X → X)
n if

P (Xn → X as n → ∞) = 1.

Alternatively, since the proof of Theorem 3.3 applies to random vectors as well
a.s.
as random variables, we say Xn → X if for any  > 0,

P (kXk − Xk <  for all k ≥ n) → 1 as n → ∞. (3.4)

We saw in Theorems 2.27 and 2.34 that continuous functions preserve both convergence in
probability and convergence in distribution. Yet these facts were quite difficult to prove.
Fortunately, the analogous result for almost sure convergence follows immediately from the
results of Chapter 1. Similarly, unlike with convergence in distribution, there is no problem
“stacking” random sequences into vectors while preserving almost sure convergence. The
following theorem is really just a corollary of earlier results (specifically, Theorem 1.16 and
Lemma 1.33).

73
Theorem 3.7 (a) Suppose that f : S → R` is a continuous function defined on some
subset S ⊂ Rk , Xn is a k-component random vector, and the range of X and of
a.s. a.s.
each Xn is contained in S with probability 1. If Xn → X, then f (Xn ) → f (X).
a.s. a.s.
(b) Xn → X if and only if Xnj → Xj for all j.

We conclude this section with a simple diagram summarizing the implications among the
modes of convergence defined so far. In the diagram, a double arrow like ⇒ means “implies”.
Note that the picture changes slightly when convergence is to a constant c rather than a
random vector X.
qm qm
Xn → X Xn → c
⇓ ⇓
a.s. P d a.s. P d
Xn → X ⇒ X n → X ⇒ Xn → X Xn → c ⇒ Xn → c ⇔ Xn → c

Exercises for Section 3.1


a.s.
Exercise 3.1 Let S be the set defined in equation (3.1), so Xn → X is equivalent to
P (S) = 1 by definition.
(a) Let An be defined as in Equation (3.2). Prove that
ω0 ∈ ∪∞
n=1 An for all  > 0

if and only if ω0 ∈ S.
Hint: Use Definition 1.1.
(b) Prove Theorem 3.3.
Hint: Note that the sets An are increasing in n, so that by the lower continuity
of any probability measure (which you may assume without proof), limn P (An )
exists and is equal to P (∪∞
n=1 An ).
a.s.
Exercise 3.2 The diagram at the end of this section suggests that neither Xn → X
qm
nor Xn → X implies the other. Construct two counterexamples, one to show that
a.s. qm qm
Xn → X does not imply Xn → X and the other to show that Xn → X does not
a.s.
imply Xn → X.

3.2 The Strong Law of Large Numbers

Some of the results in this section are presented for univariate random variables and some are
presented for random vectors. Take note of the use of bold print to denote vectors. Nearly

74

You might also like