ORF309 Limit Theorems
ORF309 Limit Theorems
Mark Cerenzia∗
1 Variance
We introduced the expectation of a random variable as a quantity representing the “typical size”
or “best guess” of a random variable X. At one extreme, if X is deterministic, it is in fact equal to
its expectation, i.e., X ≡ EX. At the other extreme, when X is “very random,” it should be very
often far from EX. Thus, to capture “how random” a random variable is, the following quantity
measures the discrepancy from its mean.
Definition 1.1. The variance of a random variable X is the mean squared distance from its mean:
2
Var(X) = σX := E(X − EX)2 .
p
The quantity σX := Var(X) is the standard deviation. More generally, the covariance of random
variables X, Y is defined by
Cov(X, Y ) = σX,Y := E[(X − EX)(Y − EY )].
We say that X, Y are uncorrelated if Cov(X, Y ) = 0. This is implied by (but not equal to) the
independence of X, Y since E[g(X)h(Y )] = E[g(X)] · E[h(Y )] in this case.
Of course, one could imagine many other measures of the dispersion of the values of X around
its mean, but the variance is special in its many nice properties.1
Lemma 1.1. The variance satisfies the following:
• If a, b ∈ R, then Var(aX + b) = a2 V ar(X).
• Var(X) ≥ 0
• Var(X) = 0 if and only if X is deterministic, and thus X ≡ EX.
• Var(X) = E(X 2 ) − (EX)2 .
•
Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y )
Hence, if X, Y are uncorrelated or independent, then
1
Much of the previous lemma may be proven relying on the relationship Var(X) = Cov(X, X),
the fact that Cov(X, c) = 0, and finally the fact that (X, Y ) 7→ Cov(X, Y ) is bilinear, i.e.,
n m
! n X
m
X X X
Cov ai X i , bj Yj = ai bj Cov(Xi , Yj )
i=1 j=1 i=1 j=1
Example 1.1. If τ ∼ Exp(λ), we can readily calculate E[τ ] = 1/λ2 . Further, if γ ∼ Γ(k, λ), then
we saw γ can be realized as a sum of k-many independent Exp(λ) distributed random variables,
and thus we immediately get Var(γ) = k/λ2 .
Similarly, if X1 , X2 , . . . form a Bernoulli process, then Var(Xi ) = p(1 − p). Hence, we have
Var(Sn ) = n · p(1 − p), where Sn := X1 + · · · + Xn .
In particular, Sn /n is the fraction of times A occurs in n repetitions of the experiment. Notice also
that the expectation of Sn /n is given by E[Sn /n] = P(A). For small n, Sn /n can be quite far from
this expected value P(A); however, as the number of experiments n grows, we will see the mass of
the distribution of Sn /n “concentrates” around its mean P(A).2
The Law of Large Numbers guarantees that Sn /n “converges” to P(A) as n → ∞. More gen-
erally, this theorem says that if X1 , X2 , . . . are i.i.d. random variables
Pn with the same distribution
1
as some given random variable X, then the empirical average n k=1 Xk over n experiments “con-
verges” to E[X] with probability 1.
Notice that the result above suggests the random variables Sn /n concentrate around the deter-
ministic constant E[X], so establishing the above result will require making precise the idea that
Sn /n becomes less and less random. The previous section introduced variance as a measure of
randomness
However, we were quite vague on both the technical assumptions as well as what we mean by
“converge.” We will be more precise on these points as we develop two versions.
2
In this subsection, we sketch the proof of a version of the Weak Law of Large Numbers 3 that
takes “convergence” to be in the sense of mean square (for statisticians) or L2 -convergence (for
mathematicians), i.e.,
E(Sn /n − EX)2 → 0
as n → ∞. But since E[Sn /n] = EX, we immediately see that mean square convergence corresponds
to the variance vanishing:
n
2 variance properties 1 X Var(X) n→∞
E(Sn /n − EX) = Var(Sn /n) = 2
· Var(Xk ) = = 0,
n k=1 n
and thus we must have (Sn /n − EX)2 → 0, and thus too |Sn /n − EX| → 0, w.p.1, as required.
Unfortunately, the first condition of the last display can never hold in our framework. Indeed,
the astute reader will remember that E(Sn /n − EX)2 = V ar(Sn /n) P = V ar(X)/n, so the series
∞
appearing in the last display is proportional to the harmonic series k=1 1/k = +∞, which is
famous for its divergence!
However, the series ∞ 2
P
k=1 1/k converges, so a flash of inspiration suggests that we square the
terms of our series, i.e., we should sum the terms E(Sn /n−EX)4 instead. This approach will require
more assumptions, namely, EX 4 < +∞, and a little more work! First note that we can write
n n
1X 1X
Sn /n − EX = (Xk − EX) = Zk ,
n k=1 n k=1
3
where as indicated the last equality needs to be checked. Since EZk = 0 and recalling that E[U ·V ] =
EU · EV for independent random variables U, V , the only terms that survive are those for which
k = ℓ = r = s (only n such terms) or for which you have two distinct pairs of indices (only
1 4
2 2
· n(n − 1) = 3n(n − 1)/n such terms). Note the latter terms are of the form E(X − EX)2 =
V ar(X). This explains the last equality, so we can repeat the argument technique above by summing
E(Sn /n − EX)4 instead of E(Sn /n − EX)2 , which completes the proof!
Example 2.1. A real number is called a normal number if the limiting relative frequency of any
digit d ∈ {0, . . . , 9} in its decimal expansion of the number is 1/10. Intuitively, no digit occurs more
frequently than any other. We will use the strong law of large numbers to immediately prove Borel’s
normal number theorem: if X is uniform in (0, 1), then X is a normal number with probability 1.
The amazing part of this theorem is that only a few specific numbers have been shown to be
normal (concatenating the natural numbers or the prime numbers is known to produce a normal
number, the√ latter case being a theorem of Copeland-Erdös), and according to Wikipedia, the classic
numbers 2, π, e are believed to be normal, but there is still no proof of this fact!
P∞To Xprove
k 4
the normal number theorem, write the decimal expansion of X as X = 0.X1 X2 X3 . . . =
k=1 10k . Fix d ∈ {0, . . . , 9}. Then the sequence Yn := 1{Xn =d} , n ≥ 1 are i.i.d. with mean
µ = EYn = P(Xn = d) = 1/10. PnHence, the strong law of large numbers applies to let us conclude
1
that the empirical average n k=1 Yk , which is also the relative frequency of d in the first n digits
of the decimal expansion, converges almost surely and in mean square to 1/10, as required.
4
the Central Limit Theorem that the hump or “bell” in the probability mass described above can
2
always be described by the function “exp(− x2 )”.
Definition 3.1. A continuous random variables X is said to be normally or Gaussian distributed
with mean µ ∈ R and variance σ 2 , written X ∼ N (µ, σ 2 ) if its probability density function is given
by
(x − µ)2
1
fX (x) = √ · exp − , x ∈ R.
2πσ 2 2σ 2
We refer to X as standard normal if it has mean zero µ = 0 and unit variance σ 2 = 1.
But how do we zero in on this shape? On the one hand, Var(Sn ) = N · V ar(X), which grows
linearly with N (this is “zooming Pin”); on the other hand, by the law of large numbers, we know
that the distribution of Sn /n = n nk=1 Xk concentrates around the constant EX, reflected by the
1
fact that Var(Sn /n) = Var(X)/n vanishes (this is “zooming out”). The appropriate scaling to zoom
2
in or out just√
the right amount to retain the hump or “bell” shape “exp(− x2 )” is found by noticing
that Var(Sn / n) = Var(X) is constant. Hence, the scaling √1n nk=1 Xk playing a prominent role
P
in the next statement.
Theorem 3.1 (Central Limit Theorem). Let X1 , X2 , . . . be i.i.d. random variables with
Pn mean
2 1
µ ∈ R and variance 0 < σ < ∞. Write the running empirical average as Xn := Sn /n = n k=1 Xk .
p √
Writing µn := µSn = ESn = n · µ and σn := σSn = Var(Sn ) = n · σ, consider the z-score 5
Sn − µn √
Xn − µ
Zn := = n .
σn σ
Then we have
n→∞
Probability distribution of Zn → Z ∼ N (0, 1).
Put another way, no longer scaling out σ 2 , if Y ∼ N (0, σ 2 ), then for any real numbers a < b,
Z b
√ x2
n→∞ 1
P a < n Xn − µ ≤ b → P(a < Y ≤ b) = √ exp − 2 dx.
2πσ 2 a 2σ
p
Definition 3.2. The standard deviation of a random variable X is defined as σ = V ar(X).
Remark. In 1733, de Moivre introduced the 68-95-99.7 rule, a really useful fact to keep in mind
when working with normal distributions. The rule states that 68% of the mass falls within one
standard deviation of the mean, 95% percent within two standard deviations of the mean, and
99.7% within three standard deviations of the mean. More precisely, if X ∼ N (µ, σ 2 ), we have
68.27 k = 1
P(µ − kσ < X < µ + kσ) ≈ 95.45 k = 2
99.73 k = 3
One can additionally use symmetry of the normal distribution about its mean to further reason
about how mass is distributed nearby the mean.
R∞ (x−µ)2 √
Of course, we also know the calculation −∞ e− 2σ2 dx = 2πσ 2 holds since it must be a pdf.
2
However, general integrals involving “exp(− x2 )” like the one appearing in the statement of the
central limit theorem cannot be evaluated explicitly. One must settle either for estimates using the
68-95-99.7 rule or uses numerical tables/computer functions (see “erf ”) that provide (very good)
approximations of such integrals.
5
√
Intuitively, we are zooming in by a factor n to see how the empirical average “fluctuates” about its mean.
5
The central limit theorem is perhaps the most famous example of the concept of “universality”
from physics: a wide variety of random phenomena exhibit the normal distribution. Put another
way, the normal distribution is “attracting” in some sense, arising in just about any random model
you can think of as long as you know where to look. For a small list, the weight or height of a man
or woman, measurement errors, molecules, growth distributions of plants/animals/organs, etc. For
example, the first of these is the result of many environmental and genetic factors each contributing
a small amount, so we expect that polling weight or heights of a random sample of a given gender6
will obey a normal distribution.
We turn to some examples.
Example 3.1. The rates of return for a stock are i.i.d. R1 , R2 , . . . each equally likely assumes the
two values .30 or -.25. An investor notes that the expected rate of return on the ith trading day
is positive ERi = .025, so she invests c dollars in it. After the first trading day, the value of her
shares becomes c · (1 + R1 ), which she reasons she expects will be c · 1.025. Continuing to reinvest,
the value of her shares becomes
c · (1 + R1 ) · · · (1 + Rn )
on the nth trading day. By independence, she expects her investment growth will be c · (1.025)n ,
which is exponential in n, so her returns should be to the moon.
Unfortunately, there is a very high probability (essential certitude) that the value can become
arbitrarily small. Define Xi := ln(1 + Ri ) and write µ = EXi = −0.127 < 0, σ 2 = V ar(Xi ) =
0.0597 > 0. Letting δ ∈ (0, 1) be small, we have
where the last approximation is accurate by the central limit theorem for n large enough. Since µ
is negative, the upper limit of integration gets arbitrarily large and thus this probability can get as
close to one as we like regardless of how small δ > 0 is! Indeed, if we take δ = .10 and computing
the relevant values µ, σ 2 , one can see from numerical tables that the righthandside will be over .99
after n = 50 trading days. Put another way, with near certainty, the value of the shares held by
the investor will reach 10% of her original investment after 50 days despite positive EV (expected
value). This suggests more a depressing conclusion: the initial investment will eventually disappear!
Example 3.2 (Stirling’s Formula). Let X1 , X2 , . . . be i.i.d. Poisson random variables Pwith rate
λ = 1. Recall that µ := EXi = λ = 1 and σ := V ar(Xi ) = λ = 1. As usual, write Sn = nk=1 Xk .
2
We saw using the theory of Poisson processes that Sn also has Poisson distribution with rate λ = n.
Thus, we have
n
−n n
P(Sn = n) = e · .
n!
6
However, mixing genders here can cause issues with this approximation.
6
However, we can rewrite this probability as follows:
We remark that this example is a bit circular since the first proof of the central limit theorem (the
de Moivre-Laplace theorem) does so in the special case the Xi form a Bernoulli process. Nevertheless,
most modern proofs of the CLT do not explicitly use this formula, and further this argument provides
a probabilistic interpretation of this formula, thus also serving as a mnemonic device.