0% found this document useful (0 votes)

31 views16 pages

Note 1

This document discusses different modes of convergence for sequences of random variables, including: 1) Convergence in probability, which requires that the absolute difference between two random variables becomes arbitrarily small with probability approaching one. 2) Probabilistic order notation such as oP and OP, which describe how quickly random variables converge to other quantities. 3) An example of a consistent estimator for an unknown parameter θ based on independent and identically distributed uniform random variables.

Uploaded by

Subhra Sankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views16 pages

Note 1

Uploaded by

Subhra Sankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Chapter 2

Weak Convergence

Chapter 1 discussed limits of sequences of constants, either scalar-valued or vector-valued.

Chapters 2 and 3 extend this notion by defining what it means for a sequence of random
variables to have a limit. As it turns out, there is more than one sensible way to do this.
Chapters 2 and 4 (and, to a lesser extent, Chapter 3) lay the theoretical groundwork for
nearly all of the statistical topics that will follow. While the material in Chapter 2 is
essential, readers may wish to skip Chapter 3 on a first reading. As is common throughout
the book, some of the proofs here have been relegated to the exercises.

2.1 Modes of Convergence

Whereas the limit of a sequence of real numbers is unequivocally expressed by Definition

1.32, in the case of random variables there are several ways to define the convergence of a
sequence. This section discusses three such definitions, or modes, of convergence; Section 3.1
presents a fourth. Because it is often easier to understand these concepts in the univariate
case than the multivariate case, we only consider univariate random vectors here, deferring
the analogous multivariate topics to Section 2.3.

2.1.1 Convergence in Probability

What does it mean for the sequence X1 , X2 , . . . of random variables to converge to, say,
the random variable X? Under what circumstances should one write Xn → X? We begin
by considering a definition of convergence that requires that Xn and X be defined on the
same sample space. For this form of convergence, called convergence in probability, the

41
absolute difference |Xn − X|, itself a random variable, should be arbitrarily close to zero
with probability arbitrarily close to one. More precisely, we make the following definition.
Definition 2.1 Let {Xn }n≥1 and X be defined on the same probability space. We
P
say that Xn converges in probability to X, written Xn → X, if for any > 0,

P (|Xn − X| < ) → 1 as n → ∞. (2.1)

It is very common that the X in Definition 2.1 is a constant, say X ≡ c. In such cases, we
P
simply write Xn → c. When we replace X by c in Definition 2.1, we do not need to concern
ourselves with the question of whether X is defined on the same sample space as Xn because
any constant may be defined as a random variable on any sample space. In the most common
statistical usage of convergence to a constant c, we take c to be some parameter θ and Xn
to be an estimator of θ:
P
Definition 2.2 If Xn → θ, Xn is said to be consistent (or weakly consistent) for θ.

From statistics As the name suggests, weak consistency is weaker than (i.e., implied by) a condition called
literature “strong consistency,” which will be defined in Chapter 3. “Consistency,” used without the
word “strong” or “weak,” generally refers to weak consistency. Throughout this book, we
shall refer repeatedly to (weakly) consistent estimators, whereas strong consistency plays a
comparatively small role.

Example 2.3 Suppose that X1 , X2 , . . . are independent and identically distributed

(i.i.d.) uniform (0, θ) random variables, where θ is an unknown positive constant.
For n ≥ 1, let X(n) be defined as the largest value among X1 through Xn : That
def
is, X(n) = max1≤i≤n Xi . Then we may show that X(n) is a consistent estimator
of θ as follows:
By Definition 2.1, we wish to show that for an arbitrary > 0, P (|X(n) − θ| <
) → 1 as n → ∞. In this particular case, we can evalulate P (|X(n) − θ| < )
directly by noting that X(n) cannot possibly be larger than θ, so that

P (|X(n) − θ| < ) = P (X(n) > θ − ) = 1 − P (X(n) ≤ θ − ).

The maximum X(n) is less than some constant if and only if each of the random
variables X1 , . . . , Xn is less than that constant. Therefore, since the Xi are i.i.d.,

[1 − (/θ)]n if 0 < < θ

n
P (X(n) ≤ θ − ) = [P (X1 ≤ θ − )] =
0 if ≥ θ.

Since 1 − (/θ) is strictly less than 1, we conclude that no matter what positive
value takes, P (Xn ≤ θ − ) → 0 as desired.

42
2.1.2 Probabilistic Order Notation
Important
There are probabilistic analogues of the o and O notations of Section 1.3 that apply to
random variable sequences instead of real number sequences.
P
Definition 2.4 We write Xn = oP (Yn ) if Xn /Yn → 0.
In particular, oP (1) is shorthand notation for a sequence of random variables that converges
to zero in probability, as illustrated in Equation (2.2) below.
Definition 2.5 We write Xn = OP (Yn ) if for every > 0, there exist M and N such
that

Xn
P < M > 1 − for all n > N .

Yn

As a special case of Definition 2.5, we refer to any OP (1) sequence as a bounded in probability
sequence:
Definition 2.6 We say that X1 , X2 , . . . is bounded in probability if Xn = OP (1), i.e.,
if for every > 0, there exist M and N such that P (|Xn | < M ) > 1− for n > N .
Definition 2.6 is primarily useful because of the properties of bounded in probability se-
quences established in Exercise 2.2.

Example 2.7 In Example 2.3, we showed that if X1 , X2 , . . . are independent and

identically distributed uniform (0, θ) random variables, then
P
max Xi → θ as n → ∞.
1≤i≤n

Equivalently, we may say that

max Xi = θ + oP (1) as n → ∞. (2.2)

1≤i≤n

It is also technically correct to write

max Xi = θ + OP (1) as n → ∞, (2.3)

1≤i≤n

though Statement (2.3) is less informative than Statement (2.2). On the other
hand, we will see in Example 6.1 that Statement (2.3) may be sharpened considerably—
and made more informative than Statement (2.2)—by writing

1
max Xi = θ + OP as n → ∞.
1≤i≤n n

43
Using the oP notation defined above, it is possible to rewrite Taylor’s theorem 1.18 in a
form involving random variables. This theorem will prove to be useful in later chapters; for
instance, it is used to prove the result known as the delta method in Section 5.1.1.
P
Theorem 2.8 Suppose that Xn → θ0 for a sequence of random variables X1 , X2 , . . .
and a constant θ0 . Furthermore, suppose that f (x) has d derivatives at the point
You can skip
this theorem in
θ0 . Then there is a random variable Yn such that
your first reading
(Xn − θ0 )d (d)
f (Xn ) = f (θ0 ) + (Xn − θ0 )f 0 (θ0 ) + · · · +

f (θ0 ) + Yn (2.4)
d!
and Yn = oP (1) as n → ∞.
The proof of Theorem 2.8 is a useful example of an “epsilon-delta” proof (named for the
and δ in Definition 1.11).
Proof: Let
( h i
(Xn −θ0 )d−1
Yn =
d!
(Xn −θ0 )d
f (Xn ) − f (θ0 ) − (Xn − θ0 )f 0 (θ0 ) − · · · − (d−1)!
− f (d) (θ0 ) if Xn 6= θ0
0 if Xn = θ0 .
P
Then Equation (2.4) is trivially satisfied. We will show that Yn = oP (1), which means Yn → 0,
by demonstrating that for an arbitrary > 0, there exists N such that P (|Yn | < ) > 1 −
for all n > N . By Taylor’s Theorem 1.18, there exists some δ > 0 such that |Xn − θ0 | < δ
implies |Yn | < (that is, the event {ω : |Xn (ω) − θ0 | < δ} is contained in the event {ω :
P
|Yn (ω)| < }). Furthermore, because Xn → θ0 , we know that there exists some N such that
P (|Xn − θ0 | < δ) > 1 − for all n > N . Putting these facts together, we conclude that for
all n > N ,
P (|Yn | < ) ≥ P (|Xn − θ0 | < δ) > 1 − ,
which proves the result.
In later chapters, we will generally write simply

0 (Xn − θ0 )d (d)
f (Xn ) = f (θ0 ) + (Xn − θ0 )f (θ0 ) + · · · + f (θ0 ) + oP (1) as n → ∞ (2.5)
d!
when referring to the result of Theorem 2.8. A technical quibble with Expression (2.5) is
that it suggests that any random variable Yn satisfying (2.4) must also be oP (1). This is not
quite true: Since Yn may be defined arbitrarily in the event that Xn = θ0 and still satisfy
(2.4), if
P (Xn = θ0 ) > c for all n
for some positive constant c, then Yn 6= oP (1) may still satisfy (2.4). However, as long as
one remembers what Theorem 2.8 says, there is little danger in using Expression (2.5).

44
2.1.3 Convergence in Distribution

As the name suggests, convergence in distribution (also known as convergence in law) has
to do with convergence of the distribution functions (or “laws”) of random variables. Given
a random variable X, the distribution function of X is the function

F (x) = P (X ≤ x). (2.6)

Any distribution function F (x) is nondecreasing and right-continuous, and it has limits
limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. Conversely, any function F (x) with these proper-
ties is a distribution function for some random variable.
It is not enough to define convergence in distribution as simple pointwise convergence of a
sequence of distribution functions; there are technical reasons that such a simplistic definition
fails to capture any useful concept of convergence of random variables. These reasons are
illustrated by the following two examples.
Example 2.9 Let Xn be normally distributed with mean √ 0 and variance n. Then
the distribution function of Xn is Fn (x) = Φ(x/ n), where Φ(z) denotes the
standard normal distribution function. Because Φ(0) = 1/2, we see that for any
Important fixed x, Fn (x) → 1/2 as n → ∞. But the function that is constant at 1/2 is not
a distribution function. This example shows that not all convergent sequences of
distribution functions have limits that are distribution functions.

Example 2.10 By any sensible definition of convergence, 1/n should converge to 0

as n → ∞. But consider the distribution functions Fn (x) = I{x ≥ 1/n} and
F (x) = I{x ≥ 0} corresponding to the constant random variables 1/n and 0. We
do not have pointwise convergence of Fn (x) to F (x), since Fn (0) = 0 for all n but
F (0) = 1. However, Fn (x) → F (x) is true for all x 6= 0. Not coincidentally, the
point x = 0 where convergence of Fn (x) to F (x) fails is the only point at which
the function F (x) is not continuous.

To write a sensible definition of convergence in distribution, Example 2.9 demonstrates that

we should require that the limit of distribution functions be a distribution function itself,
say F (x), while Example 2.10 suggests that we should exclude points where F (x) is not
continuous. We therefore arrive at the following definition:
Definition 2.11 Suppose that X has distribution function F (x) and that Xn has
distribution function Fn (x) for each n. Then we say Xn converges in distribution
d
to X, written Xn → X, if Fn (x) → F (x) as n → ∞ for all x at which F (x) is
continuous. Convergence in distribution is sometimes called convergence in law
L
and written Xn → X.

45
The notation of Definition 2.11 may be stretched a bit; sometimes the expressions on either
d
side of the → symbol may be distribution functions or other notations indicating certain
distributions, rather than actual random variables as in the definition. The meaning is
always clear even if the notation is not consistent.
d P
However, one common mistake should be avoided at all costs: If → (or → or any other “limit
arrow”) indicates that n → ∞, then n must never appear on the right side of the arrow. See
Expression (2.8) in Example 2.12 for an example of how this rule is sometimes violated.

Example 2.12 The Central Limit Theorem for i.i.d. sequences: Let X1 , . . . , Xn be
independent and identically distributed (i.i.d.) with mean µ and finite variance
σ 2 . Then by a result that will be covered in Chapter 4 (but which is perhaps
You can skip
this example in already known to the reader),
your first reading !
n
√ 1X d
n Xi − µ → N (0, σ 2 ), (2.7)
n i=1

where N (0, σ 2 ) denotes a normal distribution with mean 0 and variance σ 2 .

[N (0, σ 2 ) is not actually a random variable; this is an example of “stretching
d
the → notation” referred to above.]
Because Equation 2.7 may be interpreted as saying that the sample mean X n
has approximately a N (µ, σ 2 /n) distribution, it may seem tempting to “rewrite”
Equation (2.7) as
n
σ2

1X d
Xi → N µ, . (2.8)
n i=1 n

Resist the temptation to do this! As pointed out above, n should never appear
on the right side of a limit arrow (as long as that limit arrow expresses the idea
that n is tending to ∞).
By the result of Exercise 2.2, the limit statement (2.7) implies that the left side
of
√ that statement is OP (1). We may therefore write (after dividing through by
n and adding µ)
n
1X 1
Xi = µ + OP √ as n → ∞. (2.9)
n i=1 n

Unlike Expression (2.8), Equation (2.9) is perfectly legal;√and although it is less

specific than Expression (2.7), it expresses at a glance the n-rate of convergence
of the sample mean to µ.

46
P d
Unlike Xn → X, the expression Xn → X does not require Xn − X to be a random variable;
d
in fact, Xn → X is possible even if Xn and X are not defined on the same sample space.
Even if Xn and X do have a joint distribution, it is easy to construct an example in which
d
Xn → X but Xn does not converge to X in probability: Take Z1 and Z2 to be independent
and identically distributed standard normal random variables, then let Xn = Z1 for all n
d
and X = Z2 . Since Xn and X have exactly the same distribution by construction, Xn → X
in this case. However, since Xn − X is a N (0, 2) random variable for all n, we do not have
P
Xn → X.
d P
We conclude that Xn → X cannot possibly imply Xn → X (but see Theorem 2.14 for a special
case in which it does). However, the implication in the other direction is always true:
P d
Theorem 2.13 If Xn → X, then Xn → X.
You can skip
this theorem in Proof: Let Fn (x) and F (x) denote the distribution functions of Xn and X, respectively.
the first reading P
Assume that Xn → X. We need to show that Fn (t) → F (t), where t is any point of continuity
of F (x).
Choose any > 0. Whenever Xn ≤ t, it must be true that either X ≤ t + or |Xn − X| > .
This implies that
Fn (t) ≤ F (t + ) + P (|Xn − X| > ).
Similarly, whenever X ≤ t − , either Xn ≤ t or |Xn − X| > , implying
F (t − ) ≤ Fn (t) + P (|Xn − X| > ).
We conclude that for arbitrary n and > 0,
F (t − ) − P (|Xn − X| > ) ≤ Fn (t) ≤ F (t + ) + P (|Xn − X| > ). (2.10)
Taking both the lim inf n and the lim supn of the above inequalities, we conclude [since
P
Xn → X implies P (|Xn − X| > ) → 0] that
F (t − ) ≤ lim inf Fn (t) ≤ lim sup Fn (t) ≤ F (t + )
n n

for all . Since t is a continuity point of F (x), letting → 0 implies

F (t) = lim inf Fn (t) = lim sup Fn (t),
n n

so we conclude Fn (t) → F (t) and the theorem is proved.

d P
We remarked earlier that Xn → X could not possibly imply Xn → X because the latter
expression requires that Xn and X be defined on the same sample space for every n. However,
a constant c may be considered to be a random variable defined on any sample space; thus,
d P
it is reasonable to ask whether Xn → c implies Xn → c. The answer is yes:

47
d P
Theorem 2.14 Xn → c if and only if Xn → c.
d P
Proof: We only need to prove that Xn → c implies Xn → c, since the other direction is a
special case of Theorem 2.13. If F (x) is the distribution function I{x ≥ c} of the constant
random variable c, then c + and c − are points of continuity of F (x) for any > 0.
d
Therefore, Xn → c implies that Fn (c − ) → F (c − ) = 0 and Fn (c + ) → F (c + ) = 1 as
n → ∞. We conclude that

P (− < Xn − c ≤ ) = Fn (c + ) − Fn (c − ) → 1,
P
which means Xn → c.
When we speak of convergence of random variables to a constant in this book, most com-
monly we refer to convergence in probability, which (according to Theorem 2.14) is equivalent
to convergence in distribution. On the other hand, when we speak of convergence to a ran-
dom variable, we nearly always refer to convergence in distribution. Therefore, in a sense,
Theorem 2.14 makes convergence in distribution the most important form of convergence in
this book. This type of convergence is often called “weak convergence”.

2.1.4 Convergence in Mean

The third and final mode of convergence in this chapter is useful primarily because it is
sometimes easy to verify and thus gives a quick way to prove convergence in probability, as
Theorem 2.17 below implies.

Definition 2.15 Let a be a positive constant. We say that Xn converges in ath mean
a
to X, written Xn → X, if

E |Xn − X|a → 0 as n → ∞. (2.11)

Two specific cases of Definition 2.15 deserve special mention. When a = 1, we normally omit
mention of the a and simply refer to the condition E |Xn − X| → 0 as convergence in mean.
Convergence in mean is not equivalent to E Xn → E X: For one thing, E Xn → E X is
possible without any regard to the joint distribution of Xn and X, whereas E |Xn − X| → 0
clearly requires that Xn − X be a well-defined random variable.
Even more important than a = 1 is the special case a = 2:
qm
Definition 2.16 We say that Xn converges in quadratic mean to X, written Xn → X,
if

E |Xn − X|2 → 0 as n → ∞.

48
Convergence in quadratic mean is important for two reasons. First, it is often quite easy
qm
to check; in Exercise 2.6, you are asked to prove that Xn → c if and only if E Xn → c
and Var Xn → 0 for some constant c. Second, quadratic mean convergence (indeed, ath
mean convergence for any a > 0) is stronger than convergence in probability, which means
that weak consistency of an estimator may be established by checking that it converges in
quadratic mean. This latter property is a corollary of the following result:
qm
Theorem 2.17 (a) For a constant c, Xn → c if and only if E Xn → c and Var Xn → 0.
a P
(b) For fixed a > 0, Xn → X implies Xn → X.

Proof: Part (a) is the subject of Exercise 2.6. Part (b) relies on Markov’s inequality (1.35),
which states that
1
P (|Xn − X| ≥ ) ≤ a E |Xn − X|a (2.12)

a
for an arbitrary fixed > 0. If Xn → X, then by definition the right hand side of inequality
P
(2.12) goes to zero as n → ∞, so the left side also goes to zero and we conclude that Xn → X
by definition.

Example 2.18 Any unbiased estimator is consistent if its variance goes to zero. This
fact follows directly from Theorem 2.17(a) and (b). As an example, consider a
From statistics sequence of independent and identically distributed random variables X1 , X2 , . . .
literature with mean µ and finite variance σ 2 . The sample mean
n
1X
Xn = Xi
n i=1

has mean µ and variance σ 2 /n. Therefore, X n is unbiased and its variance goes
P
to zero, so we conclude that it is consistent; i.e., X n → µ. This fact is the Weak
Law of Large Numbers (see Theorem 2.19) for the case of random variables with
finite variance.

Exercises for Section 2.1

P
Exercise 2.1 For each of the three cases below, prove that Xn → 1:
(a) Xn = 1 + nYn , where Yn is a Bernoulli random variable with mean 1/n.
(b) Xn = Yn / log n, where Yn is a Poisson random variable with mean ni=1 (1/i).
P

49
Exercise 2.2 This exercise deals with bounded in probability sequences; see Defini-
tion 2.6.
d
(a) Prove that if Xn → X for some random variable X, then Xn is bounded in
probability.
Hint: You may use the fact that any interval of real numbers must contain a
point of continuity of F (x). Also, recall that F (x) → 1 as x → ∞.
P P
(b) Prove that if Xn is bounded in probability and Yn → 0, then Xn Yn → 0.
Hint: For fixed > 0, argue that there must be M and N such that P (|Xn | <
M ) > 1 − /2 and P (|Yn | < /M ) > 1 − /2 for all n > N . What is then the
smallest possible value of P (|Xn | < M and |Yn | < /M )? Use this result to prove
P
Xn Yn → 0.

Exercise 2.3 The Poisson approximation to the binomial:

Important exercise (a) Suppose that Xn is a binomial random variable with n trials, where the
probability of success on each trial is λ/n. Let X be a Poisson random variable
d
with the same mean as Xn , namely λ. Prove that Xn → X.
Hint: Argue that it suffices to show that P (Xn = k) → P (X = k) for all
nonnegative integers k. Then use Stirling’s formula (1.19).
(b) Part (a) can be useful in approximating binomial probabilities in cases where
the number of trials is large but the success probability is small: Simply consider
a Poisson random variable with the same mean as the binomial variable. Assume
that Xn is a binomial random variable with parameters n and 2/n. Create a plot
on which you plot P (X10 = k) for k = 0, . . . , 10. On the same set of axes, plot
the same probabilities for X20 , X50 , and the Poisson variable we’ll denote by X∞ .
Try looking at the same plot but with the probabilities transformed using the
logit (log-odds) transformation logit(t) = log(t) − log(1 − t). Which plot makes
it easier to characterize the trend you observe?

Exercise 2.4 Suppose that X1 , . . . , Xn are independent and identically distributed

Uniform(0, 1) random variables. For a real number t, let
Important exercise n
X
Gn (t) = I{Xi ≤ t}.
i=1

(a) What is the distribution of Gn (t) if 0 < t < 1?

(b) Suppose c > 0. Find the distribution of a random variable X such that

50
d
Gn (c/n) → X. Justify your answer.
(c) How does your answer to part (b) change if X1 , . . . , Xn are from a stan-
dard exponential distribution instead of a uniform distribution? The standard
exponential distribution function is F (t) = 1 − e−t .
qm
Exercise 2.5 For each of the three examples in Exercise 2.1, does Xn → 1? Justify
your answers.

Exercise 2.6 Prove Theorem 2.17(a).

Exercise 2.7 The converse of Theorem 2.17(b) is not true. Construct a counterex-
P
ample in which Xn → 0 but E Xn = 1 for all n (by Theorem 2.17, if E Xn = 1,
then Xn cannot converge in quadratic mean to 0).
Hint: The mean of a random variable may be strongly influenced by a large
value that occurs with small probability (and if this probability goes to zero,
then the mean can be influenced in this way without destroying convergence in
probability).

Exercise 2.8 Prove or disprove this statement: If there exists M such that P (|Xn | <
P qm
M ) = 1 for all n, then Xn → c implies Xn → c.

Exercise 2.9 (a) Prove that if 0 < a < b, then convergence in bth mean is stronger
b a
than convergence in ath mean; i.e., Xn → X implies Xn → X.
Hint: Use Exercise 1.40 with α = b/a.
(b) Prove by counterexample that the conclusion of part (a) is not true in general
if 0 < b < a.

2.2 Consistent Estimates of the Mean

For a sequence of random vectors X1 , X2 , . . ., we denote the nth sample mean by

n
1X
def
Xn = Xi .
n i=1

We begin with a formal statement of the weak law of large numbers for an independent
and identically distributed sequence. Later, we discuss some cases in which the sequence of
random vectors is not independent and identically distributed.

51
Chapter 3

Strong convergence

There are multiple ways to define the convergence of a sequence of random variables. Chapter
2 introduced convergence in probability, convergence in distribution, and convergence in
quadratic mean. We now consider a fourth mode of convergence, almost sure convergence
or convergence with probability one. We will see that almost sure convergence implies both
convergence in probability and convergence in distribution, which is why we sometimes use
the term “strong” for almost sure convergence and “weak” for the other two.
The terms “weak” and “strong” do not indicate anything about their importance; indeed,
the “weak” modes of convergence are used much more frequently in asymptotic statistics
than the strong mode. Because weak convergence dominates the remainder of this book
beginning with Chapter 4, a reader may safely skip much of the material in the current
chapter if time is limited; however, the quantile function and the Dominated Convergence
Theorem of Section 3.3 are used elsewhere, and at least these topics should be reviewed
before moving on. Due to the technical nature of the material of this chapter, the exercises
are almost exclusively devoted to proofs.

3.1 Strong Consistency Defined

A random variable like Xn or X is a function on a sample space, say Ω. Suppose that we

fix a particular element of that space, say ω0 , so we obtain the real numbers Xn (ω0 ) and
X(ω0 ). If Xn (ω0 ) → X(ω0 ) as n → ∞ in the sense of Definition 1.1, then ω0 is contained in
the event

S = {ω ∈ Ω : Xn (ω) → X(ω)}. (3.1)

70
If the probability of S—that is, E I{Xn → X}—equals 1, then we say that Xn converges
almost surely to X:
Definition 3.1 Suppose X and X1 , X2 , . . . are random variables defined on the same
sample space Ω (and as usual P denotes the associated probability measure). If
P ({ω ∈ Ω : Xn (ω) → X(ω)}) = 1,
then Xn is said to converge almost surely (or with probability one) to X, denoted
a.s.
Xn → X or Xn → X a.s. or Xn → X w.p. 1.
In other words, convergence with probability one means exactly what it sounds like: The
probability that Xn converges to X equals one. Later, in Theorem 3.3, we will formulate
an equivalent definition of almost sure convergence that makes it much easier to see why
it is such a strong form of convergence of random variables. Yet the intuitive simplicity of
Definition 3.1 makes it the standard definition.
As in the case of convergence in probability, we may replace the limiting random variable
a.s.
X by any constant c, in which case we write Xn → c. In the most common statistical usage
of convergence to a constant, the random variable Xn is some estimator of a particular
parameter, say θ:
a.s.
Definition 3.2 If Xn → θ, Xn is said to be strongly consistent for θ.
As the names suggest, strong consistency implies consistency (also known as weak consis-
tency), a fact we now explore in more depth.

3.1.1 Strong Consistency versus Consistency

As before, suppose that X and X1 , X2 , . . . are random variables defined on the same sample
space, Ω. For given n and > 0, define the events
An = {ω ∈ Ω : |Xk (ω) − X(ω)| < for all k ≥ n} (3.2)
and
Bn = {ω ∈ Ω : |Xn (ω) − X(ω)| < }. (3.3)
First, note that An must be contained in Bn and that both An and Bn imply that Xn is close
to X as long as is small. Therefore, both P (An ) → 1 and P (Bn ) → 1 seem like reasonable
ways to define the convergence of Xn to X. Indeed, as we have already seen in Definition
2.1, convergence in probability means precisely that P (Bn ) → 1 for any > 0.
Yet what about the sets An ? One fact is immediate: Since An ⊂ Bn , we must have
P (An ) ≤ P (Bn ). Therefore, P (An ) → 1 implies P (Bn ) → 1. In other words, if we were to

71
take P (An ) → 1 for all > 0 to be the definition of a new form of convergence of random
sequences, then this form of convergence would be stronger than (i.e., it would imply) con-
vergence in probability. By now, the reader may already have guessed that this new form of
convergence is actually equivalent to almost sure convergence:
Theorem 3.3 With An defined as in Equation (3.2), P (An ) → 1 for any > 0 if and
a.s.
only if Xn → X.
Proving Theorem 3.3 is the subject of Exercise 3.1. The following corollary now follows from
the preceding discussion:
a.s. P
Corollary 3.4 If Xn → X, then Xn → X.
The converse of Corollary 3.4 is not true, as the following example illustrates.

0 1
J1
● ●

J2 J3 J4
● ● ● ●

J5 J6 J7 J8 J9
● ● ● ● ● ●

J10 J11 J12 J13 J14 J15 J16

● ● ● ● ● ● ● ●

J17 J18 J19 J20 J21 J22 J23 J24 J25

● ● ● ● ● ● ● ● ● ●

...and so on

P
Figure 3.1: Example 3.5, in which P (Jn ) → 0 as n → ∞, which means that I{Jn } → 0.
However, the intervals Jn repeatedly cover the entire interval (0, 1], so the subset of (0, 1] on
which I{Jn } converges to 0 is empty!

Example 3.5 Take Ω to be the half-open interval (0, 1], and for any interval J ⊂ Ω,
say J = (a, b], take P (J) = b − a to be the length of that interval. Define a
sequence of intervals J1 , J2 , . . . as follows (see Figure 3.1):

J1 = (0, 1]
J2 through J4 = (0, 13 ] , ( 13 , 23 ] , ( 32 ,1]

72
J5 through J9 = (0, 51 ] , ( 15 , 25 ] , ( 52 , 35 ] , ( 35 , 45 ] , ( 54 ,1]
..
.
1 2m
Jm2 +1 through J(m+1)2 = 0, ,..., ,1
2m + 1 2m + 1
..
.
√
Note in particular that P√(Jn ) = 1/(2m + 1), where m = b n − 1c is the largest
integer not greater than n − 1. Now, define Xn = I{Jn } and take 0 < < 1.
Then P (|Xn − 0| < ) is the same as 1 − P (Jn ). Since P (Jn ) → 0, we conclude
P
Xn → 0 by definition.
a.s.
However, it is not true that Xn → 0. Since every ω ∈ Ω is contained in infinitely
many Jn , the set An defined in Equation (3.2) is empty for all n. Alternatively,
consider the set S = {ω : Xn (ω) → 0}. For any ω, Xn (ω) has no limit because
Xn (ω) = 1 and Xn (ω) = 0 both occur for infinitely many n. Thus S is empty.
This is not convergence with probability one; it is convergence with probability
zero!

3.1.2 Multivariate Extensions

We may extend Definition 3.1 to the multivariate case in a completely straightforward way:
You can skip
this section in Definition 3.6 Xn is said to converge almost surely (or with probability one) to X
your first reading a.s.
(X → X)
n if

P (Xn → X as n → ∞) = 1.

Alternatively, since the proof of Theorem 3.3 applies to random vectors as well
a.s.
as random variables, we say Xn → X if for any > 0,

P (kXk − Xk < for all k ≥ n) → 1 as n → ∞. (3.4)

We saw in Theorems 2.27 and 2.34 that continuous functions preserve both convergence in
probability and convergence in distribution. Yet these facts were quite difficult to prove.
Fortunately, the analogous result for almost sure convergence follows immediately from the
results of Chapter 1. Similarly, unlike with convergence in distribution, there is no problem
“stacking” random sequences into vectors while preserving almost sure convergence. The
following theorem is really just a corollary of earlier results (specifically, Theorem 1.16 and
Lemma 1.33).

73
Theorem 3.7 (a) Suppose that f : S → R` is a continuous function defined on some
subset S ⊂ Rk , Xn is a k-component random vector, and the range of X and of
a.s. a.s.
each Xn is contained in S with probability 1. If Xn → X, then f (Xn ) → f (X).
a.s. a.s.
(b) Xn → X if and only if Xnj → Xj for all j.

We conclude this section with a simple diagram summarizing the implications among the
modes of convergence defined so far. In the diagram, a double arrow like ⇒ means “implies”.
Note that the picture changes slightly when convergence is to a constant c rather than a
random vector X.
qm qm
Xn → X Xn → c
⇓ ⇓
a.s. P d a.s. P d
Xn → X ⇒ X n → X ⇒ Xn → X Xn → c ⇒ Xn → c ⇔ Xn → c

Exercises for Section 3.1

a.s.
Exercise 3.1 Let S be the set defined in equation (3.1), so Xn → X is equivalent to
P (S) = 1 by definition.
(a) Let An be defined as in Equation (3.2). Prove that
ω0 ∈ ∪∞
n=1 An for all > 0

if and only if ω0 ∈ S.
Hint: Use Definition 1.1.
(b) Prove Theorem 3.3.
Hint: Note that the sets An are increasing in n, so that by the lower continuity
of any probability measure (which you may assume without proof), limn P (An )
exists and is equal to P (∪∞
n=1 An ).
a.s.
Exercise 3.2 The diagram at the end of this section suggests that neither Xn → X
qm
nor Xn → X implies the other. Construct two counterexamples, one to show that
a.s. qm qm
Xn → X does not imply Xn → X and the other to show that Xn → X does not
a.s.
imply Xn → X.

3.2 The Strong Law of Large Numbers

Some of the results in this section are presented for univariate random variables and some are
presented for random vectors. Take note of the use of bold print to denote vectors. Nearly

Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Part 3 Simulation With R
No ratings yet
Part 3 Simulation With R
42 pages
Maroc 100 Recettes Authentiques Textbook PDF Download
100% (11)
Maroc 100 Recettes Authentiques Textbook PDF Download
17 pages
Lecture 19: Stationary Markov Chains
No ratings yet
Lecture 19: Stationary Markov Chains
4 pages
Stat720 Notes
No ratings yet
Stat720 Notes
150 pages
A Multivariate Claim Count Model For Applications in Insurance Daniela Anna Selch Download PDF
100% (7)
A Multivariate Claim Count Model For Applications in Insurance Daniela Anna Selch Download PDF
53 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
7 pages
Notes (Week 1 and 2)
No ratings yet
Notes (Week 1 and 2)
7 pages
1552976863mathematical Statistics (MS) PDF
No ratings yet
1552976863mathematical Statistics (MS) PDF
15 pages
Lec 8 Poisson Processes
No ratings yet
Lec 8 Poisson Processes
35 pages
Analysis of Simulation Output
No ratings yet
Analysis of Simulation Output
27 pages
Intro To Stat (STAT 111) by Ewens
No ratings yet
Intro To Stat (STAT 111) by Ewens
113 pages
Note 1
No ratings yet
Note 1
16 pages
Mtma Dse2 End Sem 19
No ratings yet
Mtma Dse2 End Sem 19
3 pages
Chapter 7 - Sampling Distributions
No ratings yet
Chapter 7 - Sampling Distributions
82 pages
Mathematics 13 02260
No ratings yet
Mathematics 13 02260
20 pages
Part 3 Simulation With R
No ratings yet
Part 3 Simulation With R
42 pages
PQT New Syllabus Oliver Anna University
No ratings yet
PQT New Syllabus Oliver Anna University
27 pages
Introduction To Repairable System Modeling
No ratings yet
Introduction To Repairable System Modeling
27 pages
A Visual Introduction To Information Theory
No ratings yet
A Visual Introduction To Information Theory
43 pages
Renyan Jiang - Introduction To Quality and Reliability Engineering - Springer - 2015-107-120
No ratings yet
Renyan Jiang - Introduction To Quality and Reliability Engineering - Springer - 2015-107-120
14 pages
Ch-04 - Random Variables and Their Properties
No ratings yet
Ch-04 - Random Variables and Their Properties
32 pages
c11.4 02-Further Issues in Using OLS With Times Series Data
No ratings yet
c11.4 02-Further Issues in Using OLS With Times Series Data
61 pages
Lecture 4: Joint Probability Distribution & Limit Theorems: Wisnu Setiadi Nugroho
No ratings yet
Lecture 4: Joint Probability Distribution & Limit Theorems: Wisnu Setiadi Nugroho
48 pages
4404 Notes SSQ
No ratings yet
4404 Notes SSQ
4 pages
ML Lecture 03 - Probabilistic Inference (Spring 2024)
No ratings yet
ML Lecture 03 - Probabilistic Inference (Spring 2024)
46 pages
Lecture0 Print
No ratings yet
Lecture0 Print
30 pages
Survey Data Analysis in Stata: Jeff Pitblado
No ratings yet
Survey Data Analysis in Stata: Jeff Pitblado
47 pages
HW3 PDF
No ratings yet
HW3 PDF
1 page
Week 8 - Queuing Theory PDF
No ratings yet
Week 8 - Queuing Theory PDF
33 pages
Independent and Identically Distributed Random Variables - Wikipedia
No ratings yet
Independent and Identically Distributed Random Variables - Wikipedia
10 pages

Note 1

Uploaded by

Note 1

Uploaded by

Chapter 2

Chapter 1 discussed limits of sequences of constants, either scalar-valued or vector-valued.

2.1 Modes of Convergence

Whereas the limit of a sequence of real numbers is unequivocally expressed by Definition

2.1.1 Convergence in Probability

P (|Xn − X| < ) → 1 as n → ∞. (2.1)

Example 2.3 Suppose that X1 , X2 , . . . are independent and identically distributed

P (|X(n) − θ| < ) = P (X(n) > θ − ) = 1 − P (X(n) ≤ θ − ).

[1 − (/θ)]n if 0 <  < θ

Example 2.7 In Example 2.3, we showed that if X1 , X2 , . . . are independent and

Equivalently, we may say that

max Xi = θ + oP (1) as n → ∞. (2.2)

It is also technically correct to write

max Xi = θ + OP (1) as n → ∞, (2.3)

F (x) = P (X ≤ x). (2.6)

Example 2.10 By any sensible definition of convergence, 1/n should converge to 0

To write a sensible definition of convergence in distribution, Example 2.9 demonstrates that

where N (0, σ 2 ) denotes a normal distribution with mean 0 and variance σ 2 .

Unlike Expression (2.8), Equation (2.9) is perfectly legal;√and although it is less

for all . Since t is a continuity point of F (x), letting  → 0 implies

so we conclude Fn (t) → F (t) and the theorem is proved.

2.1.4 Convergence in Mean

E |Xn − X|a → 0 as n → ∞. (2.11)

Exercises for Section 2.1

Exercise 2.3 The Poisson approximation to the binomial:

Exercise 2.4 Suppose that X1 , . . . , Xn are independent and identically distributed

(a) What is the distribution of Gn (t) if 0 < t < 1?

Exercise 2.6 Prove Theorem 2.17(a).

2.2 Consistent Estimates of the Mean

For a sequence of random vectors X1 , X2 , . . ., we denote the nth sample mean by

3.1 Strong Consistency Defined

A random variable like Xn or X is a function on a sample space, say Ω. Suppose that we

S = {ω ∈ Ω : Xn (ω) → X(ω)}. (3.1)

3.1.1 Strong Consistency versus Consistency

J10 J11 J12 J13 J14 J15 J16

J17 J18 J19 J20 J21 J22 J23 J24 J25

3.1.2 Multivariate Extensions

P (kXk − Xk <  for all k ≥ n) → 1 as n → ∞. (3.4)

Exercises for Section 3.1

3.2 The Strong Law of Large Numbers

You might also like

P (|Xn − X| < ) → 1 as n → ∞. (2.1)

P (|X(n) − θ| < ) = P (X(n) > θ − ) = 1 − P (X(n) ≤ θ − ).

[1 − (/θ)]n if 0 < < θ

for all . Since t is a continuity point of F (x), letting → 0 implies

P (kXk − Xk < for all k ≥ n) → 1 as n → ∞. (3.4)