Math5846_chapter6
Math5846_chapter6
UNSW Sydney
OPEN LEARNING
Chapter 6
Convergence of Random Variables
2 / 83
Outline:
6.1 Introduction
3 / 83
6.1 Introduction
4 / 83
The previous chapter dealt with the problem of finding the density function
(or probability function) of some transformation to a function of one or two
random variables.
An example is the problem of finding the exact density function of the sum of
100 independent Uniform(0, 1) random variables.
5 / 83
This is a very tedious and complicated problem.
6 / 83
In this chapter, we will focus on some key convergence results useful in
statistics.
Some of these (such as the law of large numbers and the central limit
theorem) relate to sums or averages of random variables.
These results are particularly useful because sums and averages are typically
used as summary statistics in quantitative research.
7 / 83
6.2 Convergence in Probability
8 / 83
Definition
The sequence of random variables X1 , X2 , . . . converges in probability to
a constant c if, for any ε > 0,
lim P |Xn − c| > ε = 0.
n→∞
Memorise this!
9 / 83
Example
Let X1 , X2 , . . . be independent Uniform (0, θ) variables.
ny n−1
fYn (y) = , 0 < y < θ.
θn
10 / 83
Example
Solution:
P
Note that Yn −→ θ means limn→∞ P Yn − θ > ε = 0.
For ε > θ, P (|Yn − θ| > ε) = 0 for all n ≥ 1 since the event Yn > θ cannot occur or it can occur with
probability zero. So limn→∞ P (|Yn − θ| > ε) = 0.
P
∴ Yn −→ θ.
11 / 83
Example
For n = 1, 2, . . . , let Yn ∼ N (µ, σn2 ) and suppose lim σn = 0.
n→∞
P
Show that Yn −
→ µ.
12 / 83
Example
Solution:
We need to show that limn→∞ P Yn − µ > ε = 0. For any ε > 0,
!
Yn − µ ε
P (|Yn − µ| > ε) = P >
σn σn
!
Yn − µ ε [ Yn − µ ε
= P <− >
σn σn σn σn
! !
Yn − µ ε Yn − µ ε
= P <− +P >
σn σn σn σn
Z −ε/σn Z ∞
1 2 1 2
= √ e−y /2 dy + √ e−y /2 dy
−∞ 2π ε/σn 2π
→ 0 as n → ∞,
ε Yn −µ P
since σn
→ ∞ and σn
∼ N (0, 1). Thus Yn −→ µ.
13 / 83
6.3 Weak Law of Large Numbers
14 / 83
Weak Law of Large Numbers
Suppose X1 , X2 , . . . are independent, each with mean µ and variance
0 < σ 2 < ∞. If n
1X
X̄n = Xi ,
n i=1
then
P
X̄n −
→ µ.
That is,
for all ε > 0, lim P (|X̄n − µ| > ε) = 0.
n→∞
15 / 83
Proof.
p
By Chebychev’s Inequality with ε = k V ar(X̄n ) and k = √ ε
, we
V ar(X̄n )
have
1
q
P X̄n − E(X̄n ) > k
V ar(X̄n ) ≤ 2
k
V ar(X̄ )
n
P X̄n − E(X̄n ) > ε ≤ 2
ε
σ2
P X̄n − µ > ε ≤
n ε2
(since E(X̄n ) = µ and V ar(X̄n ) = σ 2 /n )
→ 0 n → ∞.
16 / 83
6.4 Convergence in Distribution
17 / 83
Convergence in probability captures the concept of a sequence of random
variables approaching a fixed random variable.
18 / 83
Definition
Let X1 , X2 , . . . be a sequence of random variables. We say that Xn
converges in distribution to X if
A common shorthand is
D
Xn −
→ X.
The proof is omitted in these notes but may be found in advanced texts such
as:
20 / 83
Slutzky’s Theorem
Let X1 , X2 , . . . be a sequence of random variables that converges in
distribution to X, i.e.,
D
Xn −
→ X.
Then
D
1 X n + Yn −
→ X + c,
D
2 X n Yn −
→ c X.
21 / 83
Example
Suppose that X1 , X2 , . . . converges in distribution to X ∼ N (0, 1), i.e.,
D
→ N (0, 1), and suppose that n Yn ∼ Bin(n, 21 ).
Xn −
22 / 83
Example
Suppose that X1 , X2 , . . . converges in distribution to X ∼ N (0, 1), i.e.,
D
→ N (0, 1), and suppose that n Yn ∼ Bin(n, 21 ).
Xn −
22 / 83
6.5 Central Limit Theorem
23 / 83
For a general random sample X1 , X2 , . . . , Xn it is often of interest to make
probability statements about the sample mean X̄.
24 / 83
Central Limit Theorem
Suppose X1 , X2 , . . . are independent and identically distributed random
variables with common mean µ = E(Xi ) and common variance
σ 2 = V ar(Xi ) < ∞.
Pn
For each n ≥ 1, let X̄n = n1 i=1 Xi . Then
X̄n − µ D
√ − → Z,
σ/ n
where Z ∼ N (0, 1). It is common to write
X̄n − µ D
√ − → N (0, 1).
σ/ n
Memorise this!
25 / 83
Note that
σ2
E(X̄n ) = µ and V ar(X̄n ) = .
n
See why
So the Central Limit Theorem states that the limiting distribution of any
standardised average of independent random variables is the standard
Normal or N (0, 1) distribution.
26 / 83
This is an important aspect of the Central Limit Theorem and is particularly
useful in practice.
In fact, the Central Limit Theorem is the single most useful result in
statistics, and it forms the basis of most of the statistical inference tools
used by researchers today.
27 / 83
Proof.
The method of proof will be to show that the moment generating function of
−µ
X̄n√
σ/ n
converges, as n → ∞, to the moment generating function of a N (0, 1)
random variable.
That is, if
(X̄n − µ)
mn (u) = E exp u √
σ/ n
−µ
X̄n√
is the moment generating function of σ/ n
, then
2 /2
lim mn (u) = eu .
n→∞
28 / 83
Proof. - continued
So, we have
( n √ )!
u X uµ n
mn (u) = E exp √ Xi −
σ n i=1 σ
√ ( n
)!
−uµ n u X
= exp E exp √ Xi .
σ σ n i=1
29 / 83
Proof. - continued
The Xi ’s all have mean µ and variance σ 2 and so
30 / 83
Proof. - continued
( n
)!
u X
∴ E exp √ Xi
σ n i=1
n !
Y u
= E exp √ Xi
i=1
σ n
n
Y u
= E exp √ Xi (since Xi′ s independent)
i=1
σ n
n
Y u
= mXi √
i=1
σ n
2
n
u 1 2 2 u
= 1 + µ · √ + (σ + µ ) 2 + . . . .
σ n 2 σ n
31 / 83
Proof.
Thus
√
u2
−uµ n µu 1
ln mn (u) = + n ln 1 + √ + (σ 2 + µ2 ) 2 + . . .
σ σ n 2 σ n
√ (
2
−u µ n µu 1 u
= +n √ + (σ 2 + µ2 ) 2 + . . .
σ σ n 2 σ n
2 )
1 µu 1 u2
− √ + (σ 2 + µ2 ) 2 + . . . + ...
2 σ n 2 σ n
1 2 1 3
(since ln(1 + x) = x − x + x − . . . .)
2 3
(
1 2 u2 1 µ2 u2
= n (σ + µ2 ) 2 −
2 σ n 2 σ2 n
)
1
+ (terms involving to powers larger than 1)
n
u2
= + (terms which → 0 as n → ∞).
2
u2
∴ lim mn (u) = e 2 .
n→∞
32 / 83
The Central Limit Theorem stated above provides the limiting distribution of
X̄ − µ
√ .
σ/ n
Since ni=1 Xi = n X̄, the Central Limit Theorem also applies to the sum of
P
a sequence of random variables.
The next result provides alternative forms of the Central Limit Theorem.
33 / 83
Results
Suppose X1 , X2 , . . . are independent and identically distributed random
variables with common mean µ = E(Xi ) and common variance
σ 2 = V ar(Xi ) < ∞.
Then the Central Limit Theorem may also be stated in the following
alternative forms:
√ D
1 → N (0, σ 2 ),
n(X̄ − µ) −
P
Xi −nµ D
2 i√
σ n
−
→ N (0, 1),
P
Xi −nµ D
3 i √
n
−
→ N (0, σ 2 ).
34 / 83
6.6 Applications of the Central Limit Theorem
35 / 83
In this section, we provide some applications of the Central Limit Theorem.
36 / 83
6.6.1 Probability Calculations about a Sample
Mean
37 / 83
Say we are interested in a random variable X.
Thanks to the Central Limit Theorem (CLT), we know that the average of a
sample from any random variable is approximately normally
distributed.
So, if we know µ and σ for this random variable, we can calculate any
probability we like about X̄.
38 / 83
Example
It is known that, in 1995, adult women in Australia had an average weight of
about 67kg, and the variance of this quantity is about 256.
39 / 83
Example
Solution:
From the Central Limit Theorem ,
X̄ − µ D
√ −→ N (0, 1),
σ/ n
so we can say that
σ2
approx 256
X̄ ∼ N µ, =N 67, = N (67, 25.6).
n 10
So, using Chapter 3 methods to calculate normal probabilities,
X̄ − 67 80 − 67
P (X̄ > 80) = P √ > √
25.6 25.6
≈ P (Z > 2.569351)
≈ 0.00509446.
40 / 83
Example
Cadmium is a naturally occurring heavy metal found in drinking water at low levels. The
Australian Drinking Water Guidelines recommend drinking water contain no more than
0.05 mg/L of cadmium due to health considerations.
Water in the tested dam is high in cadmium, with 0.06 mg/L. What is the
chance that the unsatisfactory cadmium levels are detected, i.e., the chance
that the average of the three samples is higher than 0.05 mg/L?
41 / 83
Example
Solution:
Let X be the cadmium level for a randomly chosen sample, and we will assume that it is normally
distributed.
We are given that µ = 0.06, σ 2 = 0.0062 , and n = 3.
That is,
2 !
0.006
X̄3 ∼ N 0.06, √ .
3
So,
!
X̄ − 0.06 0.05 − 0.06
P (X̄3 > 0.05) = P 0.006
> 0.006
√ √
3 3
= P (Z > −2.88675) (where Z ∼ N (0, 1))
≈ 0.998
42 / 83
6.6.2 How Well Does the Central Limit
Theorem Work?
43 / 83
The normal approximation to the sample mean is an asymptotic
approximation; that is, it is an approximation obtained by considering
ever-increasing values of n.
44 / 83
Recall that in our proof of the central limit theorem , we said that
u2
ln mn (u) = + (terms which → 0 as n → ∞)
2
When we say that the distribution of X̄ is normal, we are ignoring all the
terms in ln mn (u) that go to zero as n increases.
45 / 83
The first (and usually the largest) of the terms we ignored is
u3 3 2 3
u3
E(X ) − 3µσ − µ = κ1
6σ 3 n3/2 6n1/2
where κ1 is defined as the “skewness” of the distribution (a measure of the
asymmetry in the density function). See Information on skewness .
So this first term gets smaller as n gets larger, and it is small when the
skewness of the distribution is small.
The second term we ignored in the central limit theorem proof, which has a
coefficient of n−1 , is a function of both the skewness and the “kurtosis” of the
distribution (a measure of how long-tailed the density of X is).
46 / 83
From a further study of the distribution of the sample mean for different
choices of X, we can work out the following rough rules of thumb:
47 / 83
It should be noted however that more “pathological” distributions exist,
for which n > 30 is not sufficient to ensure approximate normality of X̄.
For any n, in theory, one can always produce an X such that X̄ is not
close to normal, for example, X ∼ P oisson(1/n).
48 / 83
If it is reasonable to assume that the distribution under consideration
has little skewness and without a “long-tailed” (i.e., it does not have
high kurtosis ), then the Central Limit Theorem will work well for
even smaller n.
49 / 83
6.6.3 Normal Approximation to the Binomial
Distribution
50 / 83
The Central Limit Theorem also allows us to approximate some common
distributions by the normal. An example is the binomial distribution.
X − np D
p −
→ N (0, 1),
np(1 − p)
51 / 83
Proof.
Let X1 , . . . , Xn be a set of independent Bernoulli random variables with
parameter p. Then
X
X= Xi =⇒ X/n = X̄n
i
1
P
where X̄n = n i Xi is the sample mean of the Xi ’s. By the Central Limit
Theorem
X/n − µ
lim P √ ≤x = P (Z ≤ x),
n→∞ σ/ n
where Z ∼ N (0, 1) and µ = E(Xi ) = p and σ 2 = V ar(Xi ) = p(1 − p).
52 / 83
The practical ramifications are that probabilities involving binomial random
variables with large n can be approximated by normal probabilities.
53 / 83
Normal Approximation to Binomial Distribution with Continuity Correction
Suppose X ∼ Bin (n, p). Then
!
x − np + 12
P (X ≤ x) ≃ P Z ≤ p ,
np(1 − p)
where Z ∼ N (0, 1).
The continuity correction is based on the fact that a discrete random variable is being
approximated by a continuous random variable.
The continuity correction is subtracting 0.5 from any lower bound and adding 0.5 to any
upper bound.
54 / 83
Example
Adam tosses 25 pieces of toast off a roof and ten land butter side up.
Is this evidence that toast lands butter side down more often than
butter side up? That is, is P (X ≤ 10) unusually small?
55 / 83
Example
Adam tosses 25 pieces of toast off a roof and ten land butter side up.
Is this evidence that toast lands butter side down more often than
butter side up? That is, is P (X ≤ 10) unusually small?
Solution
Firstly, let X be the number of toasts that land butter side up. Here
X ∼ Binomial(25, 0.5).
We assume the toast will equally like to land on the butter or non-butter
sides.
55 / 83
Example
Solution - continued
Instead, we use the fact that, by the Central Limit Theorem,
X − np D
Z=p −→ N (0, 1),
n p (1 − p)
p √
where n p = 25 × 12 = 12.5, n p (1 − p) = 25 × 12 × 12 = 6.25, and n p (1 − p) = 6.25 = 2.5.
Therefore,
!
X − np 10 − np
P (X ≤ 10) = P p ≤p
n p (1 − p) n p (1 − p)
10 − 12.5
= P Z≤
2.5
= P (Z ≤ −1) = Φ(−1)
≈ 0.1586553.
Solution - continued
With the continuity correction ,
10 − 12.5 + 0.5
P (X ≤ 10) = P Z≤
2.5
= P (Z ≤ −0.80) = Φ(−0.80)
= 0.2118554,
Compare this with the exact answer obtained from the binomial distribution:
P (X ≤ 10) = 0.2121781,
58 / 83
How large does n need to be for the normal approximation to the binomial
distribution to be reasonable?
Recall that how well the central limit theorem works depends on the
skewness of the distribution of X.
This means that how well the normal approximation to the binomial works is
a function of p, and it is a better approximation as p approaches 0.5.
59 / 83
A useful rule of thumb is that the normal approximation to the
binomial will work well when n is large enough that both np > 5 and
n(1 − p) > 5.
This rule of thumb means that we do not need a very large value of n for this
large sample approximation to work well.
60 / 83
6.6.5 Normal Approximation to the Poisson
Distribution
61 / 83
Normal Approximation to the Poisson Distribution
Suppose X ∼ Poisson(λ). Then
X −λ
lim P √ ≤ x = P (Z ≤ x)
λ→∞ λ
where Z ∼ N (0, 1).
62 / 83
Example
Suppose X ∼ Poisson(100). Then
100x
P (X = x) = e−100 , x = 0, 1, 2, . . . .
x!
120
X 100x
To calculate P (80 ≤ X ≤ 120), we would need to evaluate e−100 .
x=80
x!
This isn’t easy!
63 / 83
Example
Solution:
We have X ∼ Poisson(100). So by the Central Limit Theorem ,
80 − λ X −λ 120 − λ
P (80 ≤ X ≤ 120) = P √ ≤ √ ≤ √
λ λ λ
80 − 100 X − 100 120 − 100
= P ≤ ≤
10 10 10
≈ P (−2 ≤ Z ≤ 2), where Z ∼ N (0, 1)
= Φ(2) − Φ(−2) = 0.9544997,
64 / 83
Example
Solution - continued:
So by the Central Limit Theorem and the continuity correction ,
80 − λ − 0.5 120 − λ + 0.5
P (80 ≤ X ≤ 120) = P √ ≤Z≤ √
λ λ
≈ P (−2.05 ≤ Z ≤ 2.05) where Z ∼ N (0, 1)
= Φ(2.05) − Φ(−2.05) = 0.9596356,
65 / 83
Example
Solution - continued:
The exact solution is
120
X 100x
P (80 ≤ X ≤ 120) = e−100
x=80
x!
= 0.9546815,
66 / 83
6.7 Delta Method
67 / 83
The Central Limit Theorem provides a large sample approximation to the
distribution of X̄n .
68 / 83
It turns out that these random variable sequences also converge in
distribution to a normal random variable.
The general technique for establishing such results has become known as the
delta method.
The reason for this name is a bit mysterious, although it seems to be related
to the notation (e.g. δ) used in Taylor series expressions.
69 / 83
We are interested in the distribution of g(X̄n ) for some function of g.
70 / 83
Delta Method
Let Y1 , Y2 , . . . be a sequence of random variables such that
√ D
→ N (0, σ 2 ).
n(Yn − θ) −
Suppose the function g is differentiable at θ and g ′ (θ) ̸= 0. Then
√ D
→ N (0, σ 2 g ′ (θ)2 ).
n{g(Yn ) − g(θ)} −
71 / 83
Proof.
Sketch of proof: Taylor series expansion gives
72 / 83
Example
Let X1 , X2 , . . . be a sequence of identically distributed random variables with
mean two and variance seven.
73 / 83
Example
Solution:
To obtain the asymptotic distribution of (X̄n )3 , we first need to find the asymptotic distribution of X̄n
and then apply the Delta method.
By the Central Limit Theorem, we know that
√ D √
n(X̄n − 2) −→ N (0, ( 7)2 ).
Let g(X̄n ) = (X̄n )3 . Applying of the Delta method with g(x) = x3 leads to g ′ (x) = 3x2 ⇒ g ′ (2) ̸= 0.
That is, the Delta method gives
√ √
n (X̄n )3 − 23
n g(X̄n ) − g(2) =
D
−→ N (0, 7 × (g ′ (2))2 ) = N (0, 7 × 9 × 16) = N (0, 1008).
75 / 83
6.8 Supplementary Material
76 / 83
Supplementary Material - Chebychev’s Inequality
Chebychev’s Inequality
For any random variable Y ,
p 1
P Y − E(Y ) > k V ar(Y ) ≤ .
k2
77 / 83
Supplementary Material - E(X̄n ) and V ar(X¯n )
n
1 X
E(X̄n ) = E Xi
n i=1
n
1 X
= E(Xi )
n i=1
(since expectation is a linear operator)
1
= × n×µ
n
= µ.
n
1 X
V ar(X̄n ) = V ar Xi
n i=1
n
1 X
= V ar(Xi )
n2 i=1
(since the Xi are independent)
1 2
= ×n×σ
n2
σ2
= .
n
return to notes
78 / 83
Supplementary Material - Distribution of the Maximum of Uniform(a, b)
Let Y = max{X1 , X2 , . . . , Xn }, where X1 , X2 , . . . , Xn are distributed as Uniform(a, b). We know that Y < x if and only
if every sample element is less than x. That is,
P (Y ≤ x) = P (X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x)
n
Y
= P (Xi ≤ x) by independence
i=1
n
= FXi (x) .
Special cases:
(
n y n−1 if 0 < y < 1
fY (y) =
0 otherwise.
Γ(n + 1) n!
Beta(n, 1) = = = n.
Γ(n) Γ(1) (n − 1)!
n y n−1
(
fY (y) = θn
if 0 < y < θ
0 otherwise.
return to notes
80 / 83
Supplementary Material
Let Z1 , Z2 , . . . , Zn be independent and identically distributed Bernoulli(1/2) random variables.
We know that
1 E(Zi ) = 1 = µ
2
2 V ar(Zi ) = 1 × 1
= 1
= σ2
2 2 4
Let
n
X 1
n Yn = Zi = n Z̄n ∼ Binomial(n, )
i=1
2
since it is the sum of independent and identically distributed Bernoulli(1/2), Therefore
P 1
Yn = Z̄n −→ .
2
return to notes
81 / 83
Supplementary Material - Skewness
Suppose the random variable X has mean µ and variance σ 2 .
82 / 83
Supplementary Material - Kurtosis
kur(X) = E (X − µ)4 /σ 4 .
Kurtosis is a measure of how outlier-prone a distribution is. The kurtosis of the normal
distribution is 3.
Distributions that are more outlier-prone than the normal distribution have kurtosis
greater than 3; distributions that are less outlier-prone have kurtosis less than 3.
return to notes
83 / 83