Section 6 - The Law of Large Numbers and The Central Limit Theorem
Section 6 - The Law of Large Numbers and The Central Limit Theorem
In this section we study two of the most important results in probability and statistics: the law of
large numbers and the central limit theorem.
Proof. We focus on the discrete case. Consider the random variable Z = Y − X. For all ω ∈ Ω we
have Z(ω) = Y (ω) − X(ω) ≥ 0 by the hypothesis. Thus SZ = {Z(ω) : ω ∈ Ω} ⊆ [0, ∞), and so
X
E[Y ] − E[X] = E[Z] = z · P(Z = z) ≥ 0.
z∈SZ
The first equality uses linearity of expectation, the second is by definition of E[Z], the final inequality
holds since P(Z = z) ≥ 0 and z ≥ 0 for all z ∈ SZ . It follows that E[X] ≤ E[Y ].
Definition 6.2. Let A ⊆ Ω be an event. The indicator variable of the event A is the random variable
1A : Ω → R defined for all ω ∈ Ω by
(
1, if ω ∈ A,
1A (ω) =
0, if ω ∈/ A.
Remark 6.3. Note that 1A ∼ Berp with p = P (A), giving E[1A ] = P(A).
We are now ready to prove Markov’s inequality, a fundamental result. It is very intuitive – a non-
negative random variable rarely takes values that are much larger than its expectation 1 .
Theorem 6.4 (Markov’s inequality). Let X : Ω → R be a non-negative random variable with well-
defined expectation. Then, given any t > 0, we have
E [X]
P X≥t ≤ .
t
Proof. Let A ⊆ Ω denote the event A := {ω ∈ Ω : X(ω) ≥ t}. We claim that t · 1A (ω) ≤ X(ω) for all
ω ∈ Ω. To see this, we split according to whether ω ∈ A or ω ∈
/ A.
• If ω ∈
/ A then t · 1A (ω) = 0 ≤ X(ω). (We used X is non-negative here, as this gave X(ω) ≥ 0.)
1
Remark 6.5. .
• Markov’s inequality is used very often in mathematics. Here are two reasons:
• Markov’s inequality is only really useful if t > E[X], as otherwise P(X ≥ t) ≤ 1 is better.
Example 6.6. Let X ∼ bin100,0.1 . Then E [X] = 100 · 0.1 = 10. By Markov’s inequality, we have
P (X ≥ 50) ≤ 10/50 = 0.2.
Example 6.7. Let X be a non-negative random variable with P (X ≥ 10) = 0.3. Then, by Markov’s
inequality, E [X] ≥ 0.3 · 10 = 3.
Example 6.8. On a social network, an average user has 300 friends. Equivalently, if we select a
random person on the network and let X equal the number of their friends then E[X] = 300. Markov’s
inequality then gives P(X ≥ 900) ≤ 1/3, and so at most a third of the users have 900 friends or more.
One of the most important applications of Markov’s inequality is to prove Chebyshev’s inequality,
which shows that a random variable with small variance is typically close to its expectation.
Theorem 6.9 (Chebyshev’s inequality). Let X be a random variable with well-defined expectation
and variance. Then, for all ε > 0, we have
Var(X)
P |X − E [X] | ≥ ε ≤ .
ε2
Proof. Note that if ω ∈ Ω then |X(ω) − E[X]| ≥ ε if and only if |X(ω) − E[X]|2 ≥ ε2 . This gives
P |X − E[X]| ≥ ε = P |X − E[X]|2 ≥ ε2 .
2
Now apply Markov’s inequality to the non-negative random variable X − E[X] , with t = ε2 , to get
2
2 2
E X − E[X] Var(X)
P |X − E[X]| ≥ ε = P |X − E[X]| ≥ ε ≤ = .
ε2 ε2
The last equality here holds by definition of variance.
2 , if we take ε = ασ with α > 0, Chebyshev’s inequality gives
Remark 6.10. Since Var(X) = σX X
P |X − E [X] | ≥ ασX ≤ α−2 .
It follows that |X − E[X]| is typically not much larger than σX (see Remark 5.32).
Example 6.11. Let X ∼ bin50,0.4 . Then E [X] = 20 and Var(X) = 12 so P (|X − 20| ≥ 5) ≤ 12/25.
2
Check that the theorem is false if X can take negative values.
2
Example 6.12 (Binomial distribution). Let X ∼ binn,p for n ∈ N, p ∈ (0, 1). By Chebyshev’s
inequality, for all ε > 0, we have
Var(X) np(1 − p) p(1 − p)
P |X − np| ≥ εn ≤ 2
= 2
= .
(εn) (εn) ε2 n
Hence, limn→∞ P |X − np| ≥ εn = 0.
Example 6.13. We toss a coin which appears as tails with some unknown probability p ∈ (0, 1) a
total number of n times. Let Ŝn be the number of times that tails appears. Then, Ŝn ∼ binn,p . How
large must n be to guarantee that p̂n = Ŝn /n satisfies |p̂n − pn | ≤ 0.01 with probability at least 0.95?
This question is ideal for applying Chebyshev’s inequality. First note that as Ŝn ∼ binn,p we have
Var(Ŝn ) = np(1 − p). Secondly, by Proposition 5.39 (ii) we have Var(Ŝn /n) = n−2 Var(Ŝn ). Therefore
Ŝn Var(Ŝn ) np(1 − p) 1
Var = 2
= 2
≤
n n n 4n
where the last inequality holds since p(1 − p) ≤ 1/4 for p ∈ [0, 1]. Thus
1 2500
P |p̂n − p| ≥ 0.01 = P Ŝn − np ≥ 0.01n ≤ 2
= ,
4n · (0.01) n
Example 6.14 (Mean and variance estimation). We return to the estimators for mean µ and variance
σ 2 of an unknown distribution using independent random samples X̂1 , . . . , X̂n discussed in Section 5.6.
We introduced the sample average µ̂n = n−1 (X̂1 + · · · + X̂n ) and the estimator for sample fluctuations
σ̂n2 := n−1 ni=1 (X̂i − µ̂n )2 . We proved that E [µ̂n ] = µ, E σ̂n2 = 1 − n1 σ 2 and both variances of µ̂n
P
lim P (|µ̂n − µ| ≥ ε) = 0
n→∞
and 1 2
lim P σ̂n2 − 1 − σ ≥ ε = 0 =⇒ lim P σ̂n2 − σ 2 ≥ ε = 0.
n→∞ n n→∞
An infinite collection of random variables X1 , X2 , . . . are independent and identically distributed 3 if:
• all Xi follow the same distribution, that is FXi (t) = FXj (t) for all i, j ∈ N and t ∈ R.
3
It is very common to write i.i.d. to abbreviate this condition.
3
Theorem 6.15 (Law of large numbers). Suppose that X1 , X2 , . . . are independent and identically
distributed random variables with well-defined expectations and variances, where E[Xi ] = µ and
Var(Xi ) ≤ c for all i ∈ N. Then, setting Sn = X1 + · · · + Xn , for any ε > 0 we have
S
n
lim P − µ ≥ ε = 0.
n→∞ n
using that Var(Xi ) ≤ c for i ∈ {1, . . . , n}. Applying Chebyshev’s inequality to Sn gives
Sn Var (Sn ) cn c
P − µ ≥ ε = P Sn − nµ ≥ εn ≤ 2
≤ 2 2 = 2.
n (εn) ε n nε
Remark 6.16. We have thought of the expectation E[X] of a random variable X as a sort of average,
or typical value. The law of large numbers gives strong support to this view: if we take many
independent samples according to X (e.g. n repeated dice rolls) and average the results, the result is
extremely likely to be close to E[X].
Monte Carlo simulation. Here is a useful way to use randomness (and the law of large numbers)
to approximate quantities. We are given a set S and and random variable Y which takes values in S
and would like to estimate P(Y ∈ A) for some set A ⊆ S.
A natural way to try to do this is to take independent random variables Y1 , . . . , Yn which all follow
the same distribution as Y and to use the ratio |{1 ≤ i ≤ n : Yi ∈ A}|/n to estimate P(Y1 ∈ A). But
does this work (most of the time)?
Yes, it does. Apply Theorem 6.15 to the random variables Xi = 1{Yi ∈A} , for i = 1, . . . , n. Then for
any ε > 0, we find
|{1 ≤ i ≤ n : Yi ∈ A}|
lim P − P (Y1 ∈ A) > ε = 0, (1)
n→∞ n
4
y y
M
f (x) f (x)
x x
a b a b
Figure 1: The area of the yellow region be- Figure 2: 11 of the 20 points lie between f (x) and
11
tween the curve f (x) and the x-axis represents the x-axis, so we approximate the integral by 20 ·
Rb
M (b − a).
a f (x)dx.
Rb Rb
In particular, the area(B) = a f (x)dx. It follows that P (Y1 ∈ B) = (M (b − a))−1 · a f (x)dx. As
justified in (1), with probability close to 1 this gives
Rb Z b
{1 ≤ i ≤ n : Yi ∈ A} f (x)dx |{1 ≤ i ≤ n : Yi ∈ B}|
≈ P(Y1 ∈ B) = a =⇒ f (x)dx ≈ M (b−a)· .
n M · (b − a) a n
Example 6.18 (Finding π). A nice application of Monte Carlo simulation is in approximating π.
Let Y1 , Y2 , . . . , Yn be independent random variables with the uniform distribution on [−1, 1]2 =
{(x, y) : −1 ≤ x, y ≤ 1}. In other words, the two components of each Yi are independent ran-
dom variables with the continuous uniform distribution on [−1, 1]. For a set A ⊆ [−1, 1]2 , we
have P (Y1 ∈ A) = area(A)/4, where area(A) denotes the area of the set A 4 . As P (|Y1 | ≤ 1) =
P Y1 ∈ {(x, y) ∈ [−1, 1]2 : x2 + y 2 ≤ 1} = π/4, the law of large numbers gives
|{1 ≤ i ≤ n : |Yi | ≤ 1}| π
lim P − > ε = 0.
n→∞ n 4
Chebyshev’s inequality shows that a random variable X with X ∼ binn,p typically differs from E(X) =
p p
np by about σX = Var(X) = np(1 − p). In this subsection we study the De Moivre–Laplace
theorem which gives a finer description of the behaviour of X, showing that X is well–approximated
by a normal distribution. The central limit theorem, a later generalisation of De Moivre–Laplace, is
widely considered as the most important result in probability theory and statistics.
4
The definition of area is clear for rectangles, circles and other familiar objects. For more general sets, such a definition
is content of second and third year modules.
5
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
36 38 40 42 44 46 48 50 52 54 56 58 60 62 64
p
Figure 3: Mass function of X ∼ bin100,0.5 . The smooth curve is x 7→ 1/(50π) exp −(x − 50)2 /50 . The area
of the shaded region equals to P (47 ≤ X ≤ 55) and is approximated by the integral of the smooth curve from
46 to 55, or, more precisely, by the integral from 46.5 to 55.5.
Theorem 6.19 (De Moivre–Laplace). Let p ∈ (0, 1). Given n ∈ N let Xn ∼ binn,p . Then for any
t ∈ R, ! Z t
Xn − np 1 2
e−x /2 dx,
lim P p ≤ t = P N ≤ t = Φ(t) = √
n→∞ np(1 − p) 2π −∞
where N ∼ N (0, 1) follows the standard normal distribution.
Proof. Omitted.
The De Moivre–Laplace theorem is very useful as it allows us to estimate P(Xn ≤ k) for Xn ∼ binn,p
by a calculation for the standard normal distribution. Given such k, let
k − np
x(n, k) := p .
np(1 − p)
We note that if |x(n, k)| is not too large, the so-called midpoint rule 6 is a little more accurate
!
1 p
P (Xn ≤ k) ≈ Φ x(n, k) + p = P np + np(1 − p) · N ≤ k + 1/2 . (3)
2 np(1 − p)
This is called the continuity correction and is supported by Figure 3 and the following example.
Example 6.20 (Dice). We roll a fair dice 600 times, and let X denote the number of times ‘6’ appears.
We are interested in the event {90 ≤ X ≤ 100}. Exact computations reveal that
6
q
1 5
p
As np(1 − p) = 600 · 6 · 6 = 9.1287 . . ., (2) gives
11
P (90 ≤ X ≤ 100) = P (X ≤ 100) − P (X ≤ 89) ≈ Φ(0) − Φ − = 0.3858 . . .
9.1287 . . .
Similarly, the estimate (3) involving the continuity correction yields the estimate
P (90 ≤ X ≤ 100) = P (X ≤ 100) − P (X ≤ 89) ≈ Φ(0.0547 . . .) − Φ(−1.1502 . . .) = 0.3968 . . . .
Example 6.21 (Greedy manager). A hotel has 250 rooms. Statistically, the probability of a no-show
is 0.15 per room and night. Looking to take advantage here, the hotel manager decides to overbook
and tolerate a probability of 0.03 to exceed capacities. How many rooms can the manager offer?
Let n be the number of bookings and X be the number of guests showing up. Then, X ∼ binn,0.85 .
Overbooking happens if and only if X ≥ 251, and this event shall have probability at most 3%. Using
the De Moivre-Laplace theorem with continuity correction, that is (3), we obtain
1
P (X ≥ 251) = 1 − P (X ≤ 250) ≈ 1 − Φ x(n, 250) + √ .
2 0.1275n
From Table 2, we can read off that the right hand side is smaller than 0.03 if (and only if)
1
x(n, 250) + √ ≥ 1.89
2 0.1275n
√
As x(n, 250) = (250 − 0.85n)/ 0.1275n, this corresponds to n ≤ 281. Thus, the hotel manager can
offer at most 281 beds.
Example 6.22 (Bribes). A city has N = 1, 000, 000 residents, and two candidates, A and B, run for
mayor. Every citizen flips a fair coin to come to reach a decision and so the two candidates are equally
likely to win. (A tie is possible, but very unlikely).
Now suppose candidate A bribes 1, 000 voters. Does this increase the candidate’s chances significantly?
Assume that the remaining 999, 000 people still flip fair coins and denote by X the total number of votes
candidate A receives from these people. Then X follows the binomial distribution with parameters
999, 000 and 1/2. By (2), we have
X − 499500 500
P (A wins) = P (X ≥ 499, 001) = 1 − P (X ≤ 499, 000) = 1 − P ≤−
499.7499 . . . 499.7499 . . .
≈ Φ(1) = 0.841 . . .
Thus, only 1, 000 bribes are enough to boost the candidate’s chances significantly!
The central limit theorem generalises the De Moivre-Laplace theorem to arbitrary distributions with
finite variance.
Theorem 6.23 (Central limit theorem). Let X1 , X2 , . . . be independent and identically distributed
random variables with well-defined expectations and variances, where µ := E [Xi ] and σ 2 := Var(Xi ) >
0. Then, setting Sn = X1 + · · · + Xn , for all t ∈ R we have
Z t
Sn − n · µ 1 2
e−x /2 dx,
lim P √ ≤ t = P N ≤ t = Φ(t) = √ (4)
n→∞ σ n 2π −∞
where N ∼ N (0, 1) follows the standard normal distribution.
7
Proof. Omitted, as all proofs are quite involved. See Statistics or Fourier Analysis in later years.
Remark 6.24. .
• The central limit theorem establishes the normal distribution as a key probability distribution.
It is remarkable that the same limit appears in (4), regardless of the distribution you start with
(i.e. the distribution of the random variables Xi ).
• The theorem also explains why the normal distribution appears so often in the real-world: if
a random quantity is determined as a sum of many (roughly) independent contributions then
it can be approximated by a random variable following the normal distribution with suitable
expectation and variance 7 .
Example 6.25. Your friend rolls a fair dice N = 10, 000 times and claims to have obtained a total
sum of 35, 854. Should you believe this?
Let X1 , . . . , XN denote the results of the individual dice rolls and let S denote the total sum. Then
X1 , . . . , XN all follow the uniform distribution on {1, . . . , 6} and these random variables are indepen-
35
dent. We have E [X1 ] = 3.5, and Var(X1 ) = 12 , as computed in Section 5. By linearity of expectation,
P10000
E[S] = i=1 E[Xi ] = 35, 000 and your friend claims a deviation of at least 854, which has probability
!
|S − N · E [X1 ] | 854
r 12
P |S − E [S] | ≥ 854 = P √ ≥ q ≈2· 1−Φ · 8.54
N · σX1 100 · 35 35
12
Theorem 6.23 justifies the approximation. This is a tiny probability, so the claim is extremely unlikely.
Confidence intervals. Let X̂1 , . . . , X̂n be independent random variables following the same distri-
bution with mean µ and variance σ 2 . We consider these random variables as observed data and would
like to estimate the unknown mean µ. The natural guess is to use the same average
Here it is the interval Iα that is randomly chosen (based on X̂1 , . . . , X̂n ), and not µ (which is fixed,
as µ is the mean of the distribution). Thus, in words, (6) says that the random interval Iα covers µ
with probability at least α.
7
A nice example here is human height, which is roughly the sum of the length of many bones.
8
Here, we only construct approximate confidence intervals. Let us first assume that the value of σ is
known to us. Given the samples X̂1 , . . . , X̂n , take µ̂n as in (5) and set Iα to be
σ σ
Iα = µ̂n − zα √ , µ̂n + zα √ ,
n n
where we still need select the value of zα . This approximately satisfies (6), as by Theorem 6.23
Pn
h σ σ i i=1 X̂i − n · µ
P (µ ∈ Iα ) = P µ ∈ µ̂n − zα √ , µ̂n + zα √ = P − zα ≤ √ ≤ zα
n n σ n
≈ Φ(zα ) − Φ(−zα ) = 2Φ(zα ) − 1 = α,
It turns out that, if the unknown distribution is N (µ, σ 2 ), then Iα gives an exact confidence interval,
that is, P (µ ∈ Iα ) = α independently of the value of n.
Typically σ is unknown in applications. Then, one can either use an upper bound for σ valid for all
possible values of µ or needs to estimate σ from the data adding a second level of approximation.
Example 6.26 (Swiss babies). Between 1901 and 2016, n = 9, 569, 478 babies were born in Switzer-
land, 4, 907, 770 of them were boys. We assume that the number of boys born followed the binomial
distribution with parameters n and p where p is unknown. In other words, we are in the situation of
the previous example with Bernoulli random variables X̂1 , X̂2 , . . . , X̂n with parameter p. The observed
p
probability is p̂n = µ̂n = 0.512856 . . . 8 We do not know σ = p(1 − p), but we can use the bound
σ ≤ 1/2 valid for all p. 9 An approximate 95%-confidence interval is given by
1.96 1.96
I0.95 = p̂n − √ , p̂n + √ = [p̂n − 0.000316 . . . , p̂n + 0.000316 . . .] = [0.512539 . . . , 0.51317 . . .].
2 n 2 n
We caution that this does not mean that I0.95 covers p with probability at least (roughly) 0.95. Indeed,
there is no longer any randomness here since p is fixed (but unknown) and I0.95 also fixed above. Thus
I0.95 either contains p or not. We can only say the following: if p ∈/ I0.95 , then the probability that
observed interval I0.95 took such an extreme value was at most (roughly) 0.05.
8
The ratio is similar in the UK and in other western societies.
9
As the true value of p should be close to 1/2, this bound is very precise.
9
Most important takeaways in this chapter. You should
• know the statements of Markov’s inequality, Chebyshev’s inequality, the law of large numbers
and the De Moivre-Laplace theorem.
• be familiar with the concept of confidence intervals, their interpretation and how to find them
in simple examples.
Rt −x2 /2 dx, t
Table 2: Table for Φ(t) = √1 ≥ 0.
2π −∞ e
t 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.53983 0.5438 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91309 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999
10