Two Proofs of The Central Limit Theorem

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Two Proofs of the Central Limit Theorem

Yuval Filmus
January/February 2010
In this lecture, we describe two proofs of a central theorem of mathemat-
ics, namely the central limit theorem. One will be using cumulants, and the
other using moments. Actually, our proofs wont be entirely formal, but we
will explain how to make them formal.
1 Central Limit Theorem
What it the central limit theorem? The theorem says that under rather gen-
eral circumstances, if you sum independent random variables and normalize
them accordingly, then at the limit (when you sum lots of them) youll get a
normal distribution.
For reference, here is the density of the normal distribution ^(,
2
) with
mean and variance
2
:
1

2
2
e

(x)
2
2
2
.
We now state a very weak form of the central limit theorem. Suppose that
X
i
are independent, identically distributed random variables with zero mean
and variance
2
. Then
X
1
+ +X
n

n
^(0,
2
).
Note that if the variables do not have zero mean, we can always normalize
them by subtracting the expectation from them.
The meaning of Y
n
Y is as follows: for each interval [a, b],
Pr[a Y
n
b] Pr[a Y b].
1
This mode of convergence is called convergence in distribution.
The exact form of convergence is not just a technical nicety the normal-
ized sums do not converge uniformly to a normal distribution. This means
that the tails of the distribution converge more slowly than its center. Es-
timates for the speed of convergence are given by the Berry-Esseen theorem
and Chernos bound.
The central limit theorem is true under wider conditions. We will be
able to prove it for independent variables with bounded moments, and even
more general versions are available. For example, limited dependency can
be tolerated (we will give a number-theoretic example). Moreover, random
variables not having moments (i.e. E[X
n
] doesnt converge for all n) are
sometimes well-behaved enough to induce convergence. Other problematical
random variable will converge, under a dierent normalization, to an -stable
distribution (look it up!).
2 Normal Distribution and Meaning of CLT
The normal distribution satises a nice convolution identity:
X
1
^(
1
,
2
1
), X
2
^(
2
,
2
2
) =X
1
+X
2
^(
1
+
2
,
2
1
+
2
2
).
Moreover, we can scale a normally distributed variable:
X ^(,
2
) =cX ^(c, c
2

2
).
Even more exciting, we can recover the normal distribution from these prop-
erties. The equation ^(0, 1)+^(0, 1) =

2^(0, 1) in essence denes ^(0, 1)


(up to scaling), from which the entire ensemble can be recovered.
These properties point at why we should expect the normalized sums in
the central limit theorem to converge to a normal variable. Indeed, suppose
the convergence is to a hypothetical distribution T. From the equations
X
1
+ +X
n

n
T
X
1
+ +X
2n

2n
T
we would expect T + T =

2T, so T must be normal. Therefore the real


content of the central limit theorem is that convergence does take place. The
2
exact form of the basin of attraction is deducible beforehand the only
question is whether summing up lots of independent variables and normal-
izing them accordingly would get us closer and closer to the only possible
limit, a normal distribution with the limiting mean and variance.
3 Moment Generating Function
The main tool we are going to use is the so-called moment generating func-
tion, dened as follows for a random variable X:
M
X
(t) = E[e
tX
].
Expanding the Taylor series of e
tX
, we discover the reason its called the
moment generating function:
M
X
(t) =

n=0
E[X
n
]
n!
t
n
.
The moment generating function is thus just the exponential generating func-
tion for the moments of X. In particular,
M
(n)
X
(0) = E[X
n
].
So far weve assumed that the moment generating function exists, i.e. the
implied integral E[e
tX
] actually converges for some t ,= 0. Later on (on
the section on characteristic functions) we will discuss what can be done
otherwise. For now, we will simply assume that the random variable X has
moments of all orders, and it follows that M
X
(t) is well-dened (the diligent
reader will prove this using monotonicity of the p-norm | |
p
).
The moment generating function satises the following very useful iden-
tities, concerning convolution (sum of independent variables) and scaling
(multiplication by a constant):
M
X+Y
(t) = E[e
t(X+Y )
] = E[e
tX
e
tY
] = M
X
(t)M
Y
(t),
M
cX
(t) = E[e
tcX
] = M
X
(ct).
For the rst identity, X and Y must be independent of course.
3
4 Example: Bernoulli and Poisson
A Bernoulli random variable Ber(p) is 1 with probability p and 0 otherwise.
A binomial random variable Bin(n, p) is the sum of n independent Ber(p)
variables.
Let us nd the moment generating functions of Ber(p) and Bin(n, p). For
a Bernoulli random variable, it is very simple:
M
Ber(p)
= (1 p) + pe
t
= 1 + (e
t
1)p.
A binomial random variable is just the sum of many Bernoulli variables, and
so
M
Bin(n,p)
=
_
1 + (e
t
1)p
_
n
.
Now suppose p = /n, and consider the asymptotic behavior of Bin(n, p):
M
Bin(n,/n)
=
_
1 +
(e
t
1)
n
_
n
e
(e
t
1)
.
As the reader might know, Bin(n, p) Poisson(), where the Poisson
random variable is dened by
Pr[Poisson() = n] = e

n
n!
.
Let us calculate the moment generating function of Poisson():
M
Poisson()
(t) = e

n=0

n
e
tn
n!
= e

e
e
t
= e
(e
t
1)
.
This is hardly surprising. In the section about characteristic functions we
show how to transform this calculation into a bona de proof (we comment
that this result is also easy to prove directly using Stirlings formula).
5 Cumulants
We are now almost ready to present our rst proof. We rst dene the
cumulant generating function of a random variable X:
K
X
(t) = log M
X
(t).
4
This somewhat strange denition makes more sense once we notice that
M
X
(t) = 1 + O(t), so that it makes sense to take its logarithm. In fact,
using the Taylor series of log(1 +x),
log(1 +t) = t
t
2
2
+
we can expand K
X
(t) as a power series, which begins as follows:
K
X
(t) =
_
E[X]t +
E[X
2
]
2
t
2
+
_

(E[X]t + )
2
2
+
= E[X]t +
E[X
2
] E[X]
2
2
t
2
+
Hence the rst two coecients of K
X
(t) (as an exponential generating func-
tion, that is disregarding the 1/n! factors) are the expectation and the vari-
ance. We call these coecients cumulants. Formally, we can dene the nth
cumulant K
n
[X] as follows:
K
n
[X] = K
(n)
X
(0).
In particular, we have just shown that
K
0
[X] = 0, K
1
[X] = E[X], K
2
[X] = V[X].
In general, using the Taylor series of log(1 + x), we can express K
n
[X] as
a polynomial in the moments. Conversely, using the Taylor series of e
x
we
can express the moments as polynomials in the cumulants. This provides
an example of Moebius inversion (in lattices!), which alas we do not explain
here. The moral is that we can rephrase the proof below completely in terms
of moments, although it wouldnt make as much sense!
We are nally ready to give the proof, which is extremely simple. First
notice that the formulas for scaling and convolution extend to cumulant
generating functions as follows:
K
X+Y
(t) = K
X
(t) +K
Y
(t), K
cX
(t) = K
X
(ct).
Now suppose X
1
, . . . are independent random variables with zero mean.
Hence
KX
1
++Xn

n
(t) = K
X
1
_
t

n
_
+ +K
Xn
_
t

n
_
.
5
Rephrased in terms of the cumulants,
K
m
_
X
1
+ +X
n

n
_
=
K
m
[X
1
] + +K
m
[X
n
]
n
m/2
.
Note that K
1
[X
k
] = 0, so the rst cumulant doesnt blow up. The second
cumulant, the variance, is simply averaged. What happens to all the higher
cumulants? If the cumulants are bounded by some constant C, then for
m > 2,
K
m
_
X
1
+ +X
n

n
_

nC
n
m/2
0.
In other words, all the higher cumulants disappear in the limit! Thus the
cumulants of the normalized sums tend to the cumulants of some xed dis-
tribution, which must be the normal distribution!
In order to get convinced that the limit cumulant generating function,
which is of the form

2
2
t
2
, indeed corresponds to a normal distribution, we
can explicitly calculate the cumulant generating function of a normal variable
(this is a simple exercise). Conversely, note that the limiting distribution does
not depend on the distributions we started with. In particular, if we start
with normally distributed X
i
, notice that (X
1
+ + X
n
)/

n will always
be normal. This argument shows that

2
2
t
2
must be the cumulant generating
function of ^(0,
2
)!
Lets see what we proved and whats missing. We proved that the cu-
mulant generating function of the normalized sum tends to the cumulant
generating function of a normal distribution with zero mean and the cor-
rect (limiting) variance, all under the assumption that the cumulants are
bounded. This is satised whenever the moments are bounded, for example
when all variables are identically distributed. However, were really interested
in proving convergence in distribution. The missing ingredient is Levys con-
tinuity theorem, which will be explained (without proof) in the next section.
6 Characteristic Functions
In this section we both indicate how to complete the proof of the central
limit theorem, and explain what to do when the moment generating function
is not well dened.
The moment generating function is not always dened, since the implicit
integral E[e
tX
] need not converge, in general. However, the following trick
6
will ensure that the integral will always converge: simply make t imaginary!
This prompts the denition of the characteristic function

X
(t) = E[e
itX
].
Here the integrand is bounded so the integral always converges (since E[1] =
1). The astute reader will notice that
X
is just the Fourier transform of X
(in an appropriate sense). Therefore, we would expect that convergence in
terms of characteristic functions implies convergence in distribution, since the
inverse Fourier transform is continuous. This is just the contents of Levys
continuity theorem! Hence, to make our proof completely formal, all we need
to do is make the argument t imaginary instead of real.
The classical proof of the central limit theorem in terms of characteristic
functions argues directly using the characteristic function, i.e. without taking
logarithms. Suppose that the independent random variables X
i
with zero
mean and variance
2
have bounded third moments. Thus

X
i
(t) = 1

2
2
t
2
+O(t
3
).
Using the identities for the moment generating function,
X
1
++Xn

n
=
_
1

2
2n
t
2
+O
_
t
3
n
3/2
__
n
e

2
2
t
2
.
The righthand side is just the characteristic function of a normal variable, so
the proof is concluded with an application of Levys continuity theorem.
7 Moments of the Normal Distribution
The next proof we are going to describe has the advantage of providing a
combinatorial explanation for the values of the moments of a normal distri-
bution. In this section, we will calculate these very moments.
Calculating the moments of a normal distribution is easy. The only thing
needed is integration by parts. We will concentrate on the case of zero mean
and unit variance. Notice that
_

x
n
e

x
2
2
dx =
x
n+1
n + 1
e

x
2
2

x
n+1
n + 1
_
xe

x
2
2
_
dx
=
1
n + 1
_

x
n+2
e

x
2
2
dx.
7
In terms of moments, we get the recurrence relation
M
n
=
M
n+2
n + 1
=M
n+2
= (n + 1)M
n
.
Since M
0
= 1 and M
1
= 0 we get that all odd moments are zero (this happens
because the distribution is symmetric), and the even moments are
M
n
= (n 1)M
n2
= = (n 1)(n 3) 1.
The next proof will explain what these numbers stand for.
8 Proof using Moments
In this proof we will bravely compute the limit of the moments of Y
n
=
(X
1
+ + X
n
)/

n. For simplicity, we assume that the variables X


i
are
independent with zero mean, unit variance and bounded moments. The
proof can be adapted to the case of varying variances.
It is easy to see that M
1
[Y
n
] = 0. What about the second moment?
M
2
[Y
n
] = E
_
(X
1
+ +X
n
)
2
n
_
=

i
E[X
2
i
]
n
+

i=j
E[X
i
X
j
]
n
= 1.
Calculation was easy since E[X
2
i
] = 1 whereas E[X
i
X
j
] = E[X
i
]E[X
j
] = 0.
Now lets try the third moment, assuming M
3
[X
i
] C
3
:
M
3
[Y
n
] = E
_
(X
1
+ +X
n
)
3
n
3/2
_
=

i
E[X
3
i
]
n
3/2
+ 3

i=j
E[X
2
i
X
j
]
n
3/2
+

i=j=k
E[X
i
X
j
X
k
]
n
3/2

nC
3
n
3/2
=
C
3

n
.
Thus the third moment tends to zero.
8
The fourth moment brings forth a more interesting calculation:
M
4
[Y
n
] = E
_
(X
1
+ +X
n
)
4
n
2
_
=

i
E[X
4
i
]
n
2
+ 4

i=j
E[X
3
i
X
j
]
n
2
+ 3

i=j
E[X
2
i
X
2
j
]
n
2
+ 6

i=j=k
E[X
2
i
X
j
X
k
]
n
2
+

i=j=k=l
E[X
i
X
j
X
k
X
l
]
n
2
= O(n
2
) + 3
n(n 1)
n
2
3.
What a mess! In fact, what we were doing is classifying all terms of length
4. These come in several kinds (replacing X
i
with i):
i i i i
i i i j (4)
i i j j (3)
i i j k (6)
i j k l
For example, the interesting term iijj comes in these varieties:
i i j j
i j i j
i j j i
All terms which contain a singleton (a variable appearing only once) equal
zero and can be dropped. Of these, the term corresponding to X
4
i
is asymp-
totically nil, and the term corresponding to X
2
i
X
2
j
is asymptotically 3, since
the extra condition i ,= j becomes insignicant in the limit.
We can now explain the general case. In the calculation of the mth
moment, we need to deal with terms of length m. We can identify each
term with a form similar to the ones given earlier, as follows. We go over
all the factors, and let the rth unique factor get the appellation r (this is
just what we did before, with i, j, k, l for 1, 2, 3, 4). Each term containing a
singleton is identically zero. The contribution of a term with t variables with
multiplicities m
1
, . . . , m
t
is at most
n
m
1
++mt
n
m/2
C
m
1
C
mt
,
9
where C
s
is a bound on E[X
s
i
]. Thus the term is asymptotically nil if m
1
+
+ m
t
< m/2. If m
1
+ + m
t
m/2, then since m
i
2 (otherwise the
term is identically nil) we see that t = m/2 and m
i
= 2. In that case, the
contribution of the term is
n(n 1) (n m/2 + 1)
n
m/2
1,
since the random variables have unit variance. Thus the mth moment con-
verges to the number of such terms! (note that the number of asymptotically
nil terms is nite)
If m is odd then clearly, there are no such terms, hence the moment is
asymptotically zero. If m is even then the number of terms can be evaluated
combinatorially. The term must begin with X
1
. There are m 1 possible
positions of the other X
1
. Given its position, the rst available slot must be
X
2
. Its counterpart must be in one of the m3 available positions, and so
on. Thus the number of possible terms is
(m1)(m3) 1,
which is just the formula we obtained in the previous section!
9 Bonus: Number of Prime Divisors
In this section we adapt the proof in the previous subsection to a number-
theoretic setting, by calculating the average number of prime factors of a
number.
Let c(n) be the number of prime factors of n (if n is divisible by a prime
power p
d
then we count it only once; counting it d times will lead to very sim-
ilar results). We will show that c(n) is asymptotically normally distributed.
By that we mean that if X
n
is a uniformly random integer between 1 and
n, then c(X
n
) is close to a normal distribution, with parameters that we will
calculate.
The quantity c(n) can be written in the following form, where [P] is the
Iverson bracket (equal to 1 if P holds):
c(n) =

p
[p [ n].
10
The sum is taken over all primes. Thus, if we denote I
p
= [p [ X
n
], we get
c(X
n
) =

pn
I
p
.
The random variables I
p
are almost independent. Bounding their dependence
will allow us to conclude that c(X
n
) is asymptotically normally distributed.
Notice that the I
p
are all Bernoulli variables. Thus for m
i
> 0 we have
E[I
m
1
p
1
I
mt
pt
] =
n/p
1
p
t
|
n
=
1
p
1
p
t
+O(1/n).
We now begin calculating the moments of c(X
n
). The expectation E[c(X
n
)]
is given by

pn
E[I
p
] =

pn
1
n
_
n
p
_
=

pn
1
p
+O
_
1
log n
_
= log log n +B
1
+O
_
1
log n
_
.
We have used the well known estimate for

pn
1/p = log log n + B
1
+
O
_
1
log n
_
, which can be recovered (non-rigorously and without the constant
and the error term) by using the approximate parametrization p
t
= t log t.
The constant B
1
0.2615 is Mertens constant.
The second moment is given by

pn
E[I
2
p
] +

p=qn
E[I
p
I
q
] =

pn
1
p
+

p=qn
1
pq
=

pn
1
p
_
1 +

qn
1
q

1
p
_
= E[c(X
n
)]
2
+

pn
1
p

pn
1
p
2
.
Since the series 1/n
2
converges, we get that the variance is
V[c(X
n
)] = E[c(X
n
)] +O(1) = log log n +O(1).
11
In fact, if we want we can be more accurate:
V[c(X
n
)] = log log n +B
1

p=1
1
p
2
+O
_
1
log n
_
,
since the error in computing the innite sum

p
2
is only O(1/n).
Before we calculate any further moments, we would like to normalize the
indicators. Let J
p
= I
p
E[I
p
]. Expanding the expectation of the product,
with k
i
being the power of E[I
i
], we get
E
_
t

s=1
J
ms
ps
_
=
m
1

k
1
=0

mt

kt=0
(1)
P
t
s=1
ks
t

s=1
_
m
s
k
s
_
t

s=1
E[I
ps
]
ks
E
_
t

s=1
I
msks
ps
_
=
m
1

k
1
=0

mt

kt=0
(1)
P
t
s=1
ks
t

s=1
_
m
s
k
s
_
t

s=1
p
s
ks

ks<ms
p
1
s
+O(1/n).
If m
s
= 1 then the two summands cancel exactly, and so the entire expecta-
tion is O(1/n). Therefore we can assume that all m
s
2.
When calculating the moments, we are going to sum over all t-tuples
of dierent primes smaller than n. Letting m =

t
s=1
m
s
, we are going
to divide by V[c(X
n
)]
m/2
, so every term that is asymptotically smaller is
negligible. Therefore, a natural next step is to estimate the following sum:

p
1
=...=pt
t

s=1
p

i
i

p
1
, ..., pt
t

s=1
p

i
i
= O(E[c(X
n
)]
|{i:
i
=1}
).
This is negligible unless the number of
i
s which equal 1 is at least m/2.
Considering now the previous expression for the expectation of a product
of J
p
s, unless t m/2 the term is negligible. If t m/2 then since m
s
2,
we see that in fact t = m/2 and m
s
= 2; in particular, m must be odd. By
considering the m = 2 case, we get that the contribution of any such term
tends to V[c(X
n
)]
m/2
(the loss due to the requirement that the primes be
dierent tends to zero). As in the proof of the central limit theorem in the
previous section, there are (m1)(m3) 1 such terms, and so all the mo-
ments tend to the moments of a normal distribution with stated expectation
and variance. Hence the number of prime factors is asymptotically normal.
We leave the reader to verify that even if we count prime factors according
to multiplicity, the result is still asymptotically normal, albeit with a slightly
larger expectation and variance. The main point is that the total expected
contribution of the higher powers to the sum is O(1).
12
10 Sources
Both proofs are taken from the nice book The Semicircle Law, Free Random
Variables, and Entropy. This book is actually about free probability, which is
an example of non-commutative probability. What can be non-commutative
about probability?
To answer this intriguing question, let us notice that so far, we have
virtually identied a distribution with its moments. Now suppose X is a
random matrix. The expectations E[Tr X
n
] give the empirical distribution
of the eigenvalues of X, since denoting the eigenvalues by
i
,
E[Tr X
n
] =

n
i
.
Let us now dene expectation to always include the trace. We identify
a distribution with the moments E[X
n
] of a random variable having this
distribution. Independent random matrices X, Y do not necessarily satisfy
E[XY ] = E[Y X]. However, if X and Y are very large then we do asymptot-
ically have E[XY ] = E[X]E[Y ]. This, along with similar facts, is expressed
by saying that independent random matrices are asymptotically free. By
presenting a large random matrix as a sum of many random matrices (using
properties like the additivity of the normal distribution, according to which
the entries might be distributed), we get, using very similar methods, the
semicircle law, which gives the asymptotic empirical distribution of eigenval-
ues; the limiting density is a semicircle!
The counterpart of the cumulant generating function is called the R-
transform. Similar relations between moments are cumulants are true also
in the free setting, but they are dierent, having something to do with non-
crossing partitions (in our case, the relevant lattice is that of all partitions).
For all this and more, consult the book or the works of Alexandru Nica and
Roland Speicher.
We comment that old-style proofs of the semicircle law massage an exact
expression for the density of the eignevalues. This corresponds to proving
the central limit theorem for binomial random variables by using the exact
probabilities and Strilings formula,
Finally, the number-theoretic example was suggested by a lecture of
Bal azs Szegedy (the theorem itself appears in classic books on number theory
like Hardy and Wright).
13

You might also like