0% found this document useful (0 votes)
66 views

MATH/STAT 235A - Probability Theory Lecture Notes, Fall 2011

The document contains lecture notes for a probability theory course covering various topics in probability including probability spaces, random variables, independence, expected values, laws of large numbers, convergence in distribution, characteristic functions, and central limit theorems. The notes are organized into 14 chapters with each chapter covering key definitions, properties, examples, and theorems for the given probability topic.

Uploaded by

Ali Ali
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

MATH/STAT 235A - Probability Theory Lecture Notes, Fall 2011

The document contains lecture notes for a probability theory course covering various topics in probability including probability spaces, random variables, independence, expected values, laws of large numbers, convergence in distribution, characteristic functions, and central limit theorems. The notes are organized into 14 chapters with each chapter covering key definitions, properties, examples, and theorems for the given probability topic.

Uploaded by

Ali Ali
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

MATH/STAT 235A Probability Theory

Lecture Notes, Fall 2011


Dan Romik
Department of Mathematics, UC Davis
December 15, 2011
Contents
Chapter 1: Introduction 5
1.1 What is probability theory? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 The algebra of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Probability spaces 10
2.1 Basic denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Properties and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 3: Random variables 16
3.1 Random variables and their distributions . . . . . . . . . . . . . . . . . . . . 16
3.2 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 4: Random vectors and independence 23
4.1 Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Multi-dimensional distribution functions . . . . . . . . . . . . . . . . . . . . 24
4.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 5: The Borel-Cantelli lemmas 29
Chapter 6: A brief excursion into measure theory 31
Chapter 7: Expected values 33
7.1 Construction of the expectation operator . . . . . . . . . . . . . . . . . . . . 33
7.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.3 Convergence theorems for expectations . . . . . . . . . . . . . . . . . . . . . 37
7.4 Computing expected values . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5 Expectation and independent random variables . . . . . . . . . . . . . . . . 40
7.6 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 8: Laws of large numbers 44
8.1 Weak laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Strong laws of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2
Chapter 9: Applications and further examples 52
9.1 The Weierstrass approximation theorem . . . . . . . . . . . . . . . . . . . . 52
9.2 Innite expectations and triangular arrays . . . . . . . . . . . . . . . . . . . 53
9.3 Random series of independent samples . . . . . . . . . . . . . . . . . . . . . 57
Chapter 10: Stirlings formula and the de Moivre-Laplace theorem 61
Chapter 11: Convergence in distribution 65
11.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
11.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
11.3 Compactness and tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 12: Characteristic functions 71
12.1 Denition and basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . 71
12.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.3 The inversion formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
12.4 The continuity theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
12.5 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 13: Central limit theorems 80
13.1 The case of i.i.d. r.v.s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.2 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.2.1 Triangular arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
13.2.2 Erd os-Kac theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
13.2.3 The Euclidean algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 14: Additional topics 84
14.1 The Kolmogorov 0-1 law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
14.2 The Kolmogorov three-series theorem . . . . . . . . . . . . . . . . . . . . . 85
14.3 The Poisson limit law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Exercises 89
3
Note to the reader. These notes are based to a large extent on Chapters 13 in the text-
book Probability: Theory and Examples, 4th Ed. by Rick Durrett. References to Durretts
book appear throughout the text as [Dur2010]. References to the earlier 3rd edition appear
as [Dur2004].
4
Part I Foundations
Chapter 1: Introduction
1.1 What is probability theory?
In this course well learn about probability theory. But what exactly is probability theory?
Like some other mathematical elds (but unlike some others), it has a dual role:
It is a rigorous mathematical theory with denitions, lemmas, theorems, proofs
etc.
It is a mathematical model that purports to explain or model real-life phenomena.
We will concentrate on the rigorous mathematical aspects, but we will try not to forget the
connections to the intuitive notion of real-life probability. These connections will enhance
our intuition, and they make probability an extremely useful tool in all the sciences. And
they make the study of probability much more fun, too! A note of caution is in order, though:
mathematical models are only as good as the assumptions they are based on. So probability
can be used, and it can be (and quite frequently is) abused...
Example 1.1. The theory of dierential equations is another mathematical theory
which has the dual role of a rigorous theory and an applied mathematical model. On the
other hand, number theory, complex analysis and algebraic topology are examples
of elds which are not normally used to model real-life phenomena.
1.2 The algebra of events
A central notion in probability is that of the algebra of events (well clarify later what
the word algebra means in this context). We begin with an informal discussion. We
imagine that probability is a function, denoted P, that takes as its argument an event
(i.e., occurrence of something in a real-life situation involving uncertainty) and returns a
real number in [0, 1] representing how likely this event is to occur. For example, if a fair coin
5
is tossed 10 times and we denote the results of the tosses by X
1
, X
2
, . . . , X
10
(where each of
X
i
is 0 or 1, signifying tails or heads), then we can write statements like
P(X
i
= 0) = 1/2, (1 i 10),
P
_
10

i=1
X
i
= 4
_
=
_
10
4
_
2
10
.
Note that if A and B represent events (meaning, for the purposes of the present informal
discussion, objects that have a well-dened probability), then we expect that the phrases
A did not occur, A and B both occurred and at least one of A and B occurred
also represent events. We can use notation borrowed from mathematical logic and denote
these new events by A, A B, and A B, respectively. Thus, the set of events is not
just a set it is a set with some extra structure, namely the ability to perform negation,
conjunction and disjunction operations on its elements. Such a set is called an algebra
in some contexts.
But what if the coin is tossed an innite number of times? In other words, we now
imagine an innite sequence X
1
, X
2
, X
3
, . . . of (independent) coin toss results. We want to
be able to ask questions such as
P(innitely many of the X
i
s are 0) = ?
P
_
lim
n
1
n
n

k=0
X
k
=
1
2
_
= ?
P
_

k=0
2X
k
1
k
converges
_
= ?
Do such questions make sense? (And if they do, can you guess what the answers are?)
Maybe it is not enough to have an informal discussion to answer this...
Example 1.2. (a) An urn initially contains a white ball and a black ball. A ball is drawn
out at random from the urn, then added back and another white ball is added to the urn.
This procedure is repeated innitely many times, so that after step n the urn contains 1
black ball and n + 1 white balls. For each n 1, let A
n
denote the event that at step n the
black ball was drawn. Now let A

denote the event


A

= in total, the black ball was selected innitely many times,


6
(i.e., the event that innitely many of the events A
n
occurred).
(b) While this experiment takes place, an identical copy of the experiment is taking place in
the next room. The random selections in the two neighboring rooms have no connection to
each other, i.e., they are independent. For each n 1, let B
n
be the event that at step n
the black ball was drawn out of the copy experiment urn. Now let B

denote the event


B

= in total, the black ball was selected innitely many times in


the second copy of the experiment,
(in other words, the event that innitely many of the events B
n
occurred).
(c) For each n 1, let C
n
be the event that both A
n
and B
n
occurred, i.e.
C
n
= at step n, the black ball was selected simultaneously
in both experiments,
and let C

denote the event C


n
occurred for innitely many values of n.
Theorem 1.3. We have
P(A

) = P(B

) = 1, P(C

) = 0.
Proof. These claims are consequences of the Borel-Cantelli lemmas which we will learn
about later in the course. Here is a sketch of the proof that P(C

) = 0 (remember, this is
still an informal discussion, so our proof is really more of an exploration of what formal
assumptions are needed to make the claim hold). For each n we have
P(A
n
) = P(B
n
) =
1
n + 1
,
since at time n each of the urns contains n + 1 balls, only one of which is black. Moreover,
the choices in both rooms are made independently, so we have
P(C
n
) = P(A
n
B
n
) = P(A
n
)P(B
n
) =
1
(n + 1)
2
.
It turns out that to prove that P(C

) = 0, the only relevant bit of information is that the


innite series

n=1
P(C
n
) is a convergent series; the precise values of the probabilities are
7
irrelevant. Indeed, we can try to do various manipulations on the denition of the event C,
as follows:
C

= innitely many of the C


n
s occurred
= for all N 1, the event C
n
occurred for some n N.
For any N 1, denote the event C
n
occurred for some n N by D
N
. Then
C

= for all N 1, D
N
occurred
= D
1
D
2
D
3
. . . (innite conjunction...?!)
=

N=1
D
N
(shorthand notation for innite conjunction).
In particular, in order for the event C

to happen, D
N
must happen for any xed value of
N (for example, D
100
must happen, D
101
must happen, etc.). It follows that C

is at most
as likely to happen as any of the D
N
s; in other words we have
P(C) P(D
N
), (N 1).
Now, what can we say about P(D
N
)? Looking at the denition of D
N
, we see that it too
can be written as an innite disjunction of events, namely
D
N
= C
N
C
N+1
C
N+2
. . . (innite disjunction)
=

n=N
C
n
(shorthand for innite disjunction).
If this were a nite disjunction, we could say that the likelihood for at least one of the
events to happen is at most the sum of the likelihoods (for example, the probability that it
will rain next weekend is at most the probability that it will rain next Saturday, plus the
probability that it will rain next Sunday; of course it might rain on both days, so the sum
of the probabilities can be strictly greater than the probability of the disjunction). What
can we say for an innite disjunction? Since this is an informal discussion, it is impossible
to answer this without being more formal about the precise mathematical model and its
assumptions. As it turns out, the correct thing to do (in the sense that it leads to the most
interesting and natural mathematical theory) is to assume that this fact that holds for nite
8
disjunctions also holds for innite ones. Whether this has any relevance to real life is a
dierent question! If we make this assumption, we get for each N 1 the bound
P(D
N
)

n=N
P(C
n
).
But now recall that the innite series of probabilities

n=1
P(C
n
) converges. Therefore, for
any > 0, we can nd an N for which the tail

n=N
P(C
n
) of the series is less than . For
such an N, we get that P(D
N
) < , and therefore that P(C) < . This is true for any > 0,
so it follows that P(C) = 0.
9
Chapter 2: Probability spaces
2.1 Basic denitions
We now formalize the concepts introduced in the previous lecture. It turns out that its
easiest to deal with events as subsets of a large set called the probability space, instead of as
abstract logical statements. The logical operations of negation, conjunction and disjunction
are replaced by the set-theoretic operations of taking the complement, intersection or union,
but the intuitive meaning attached to those operations is the same.
Denition 2.1 (Algebra). If is a set, an algebra of subsets of is a collection T of
subsets of that satises the following axioms:
T, (A1)
A T = A T, (A2)
A, B T = A B T. (A3)
A word synonymous with algebra in this context is eld.
Denition 2.2 (-algebra). A -algebra (also called a -eld) is an algebra T that sat-
ises the additional axiom
A
1
, A
2
, A
3
, . . . T =

n=1
A
n
T. (A4)
Example 2.3. If is any set, then , is a -algebra in fact it is the smallest possible
-algebra of subsets of . Similarly, the power set T() of all subsets of is a -algebra,
and is (obviously) the largest -algebra of subsets of .
Denition 2.4 (Measurable space). A measurable space is a pair (, T) where is a
set and T is a -algebra of subsets of .
Denition 2.5 (Probability measure). Given a measurable space (, T), a probability
measure on (, T) is a function P : T [0, 1] that satises the properties:
P() = 0, P() = 1, (P1)
A
1
, A
2
, . . . T are pairwise disjoint = P(

n=1
A
n
) =

n=1
P(A
n
). (P2)
10
Denition 2.6 (Probability space). A probability space is a triple (, T, P), where (, T)
is a measurable space, and P is a probability measure on (, T).
Intuitively, we think of as representing the set of possible outcomes of a probabilistic
experiment, and refer to it as the sample space. The -algebra T is the -algebra of
events, namely those subsets of which have a well-dened probability (as we shall see
later, it is not always possible to assign well-dened probabilities to all sets of outcomes).
And P is the notion or measure of probability on our sample space.
Probability theory can be described loosely as the study of probability spaces (this is of course
a gross oversimplication...). A more general mathematical theory called measure theory
studies measure spaces, which are like probability spaces except that the measures can take
values in [0, ] instead of [0, 1], and the total measure of the space is not necessarily equal to
1 (such measures are referred to as -additive nonnegative measures). Measure theory is
an important and non-trivial theory, and studying it requires a separate concentrated eort.
We shall content ourselves with citing and using some of its most basic results. For proofs
and more details, refer to Chapter 1 and the measure theory appendix in [Dur2010] or to a
measure theory textbook.
2.2 Properties and examples
Lemma 2.7. If (, T, P) is a probability space, then we have:
(i) Monotonicity: If A, B T, A B then P(A) P(B).
(ii) Sub-additivity: If A
1
, A
2
, . . . T then P(

n=1
A
n
)

n=1
P(A
n
).
(iii) Continuity from below: If A
1
, A
2
, . . . T such that A
1
A
2
A
3
. . ., then
P(

n=1
A
n
) = lim
n
P(A
n
).
(iv) Continuity from above: If A
1
, A
2
, . . . T such that A
1
A
2
A
3
. . ., then
P(

n=1
A
n
) = lim
n
P(A
n
).
11
Exercise 2.8. Prove Lemma 2.7.
Example 2.9. Discrete probability spaces. Let be a countable set and let p :
[0, 1] be a function such that

p() = 1.
This corresponds to the intuitive notion of a probabilistic experiment with a nite or count-
ably innite number of outcomes, where each individual outcome has a probability p() of
occurring. We can put such an elementary or discrete experiment in our more general
framework by dening the -algebra of events T to be the set of subsets of , and dening
the probability measure P by
P(A) =

A
p(), A T.
If is a nite set, a natural probability measure to consider is the uniform measure,
dened by
P(A) =
[A[
[[
.
Example 2.10. Choosing a random number uniformly in (0, 1). The archetypical
example of a non-elementary probability space (i.e., one which does not fall within the
scope of the previous example) is the experiment of choosing a random number uniformly
in the interval (0, 1). How do we know that it makes sense to speak of such an experiment?
We dont, yet. But let us imagine what constructing such an experiment might entail. We
are looking for a hypothetical probability space (, T, P), in which the sample space is
simply (0, 1), T is some -algebra of subsets of (0, 1), and P is a probability measure that
corresponds to our notion of a uniform choice of a random number. One plausible way to
formalize this is to require that intervals of the form (a, b) (0, 1) be considered as events,
and that the probability for our uniform number to fall in such an interval should be equal
to its length b a. In other words, we shall require that
(a, b) T, (0 a < b 1),
and that
P
_
(a, b)
_
= b a, (0 a < b 1). (1)
12
How do we generate a -algebra of subsets of (0, 1) that contains all the intervals? We
already saw that the set of all subsets of (0, 1) will work. But that is too large! If we take all
subsets, we will see in an exercise later that it will be impossible to construct the probability
measure P to satisfy our requirements. So lets try to build the smallest possible -algebra.
One way (which can perhaps be described as the bottom-up approach) would be to start
with the intervals, then take all countable unions of such and add them to our collection of
sets, then add all countable intersections of such sets, then add all countable unions, etc.
Will this work? In principle it can be made to work, but is a bit dicult and requires
knowing something about transnite induction. Fortunately there is a more elegant way
(but somewhat more abstract and less intuitive) of constructing the minimal -algebra, that
is outlined in the next exercise below, and can be thought of as the top-down approach.
The resulting -algebra of subsets of (0, 1) is called the Borel -algebra; its elements are
called Borel sets.
What about the probability measure P? Here we will simply cite a result from measure
theory that says that the measure we are looking for exists, and is unique. This is not too
dicult to prove, but doing so would take us a bit too far o course.
Theorem 2.11.. Let B be the -algebra of Borel sets on (0, 1), the minimal -algebra
containing all the sub-intervals of (0, 1), proved to exist in the exercise below. There exists a
unique measure P on the measure space satisfying (1), called Lebesgue measure on (0, 1).
Exercise 2.12 (The -algebra generated by a set of subsets of ). (i) Let be a set, and let
T
i

iI
be some collection of -algebras of subsets of , indexed by some index set I. Prove
that the intersection of all the T
i
s (i.e., the collection of subsets of that are elements of
all the T
i
s) is also a -algebra.
(ii) Let be a set, and let / be a collection of subsets of . Prove that there exists a unique
-algebra (/) of subsets of that satises the following two properties:
1. / (/) (in words, (/) contains all the elements of /).
2. (/) is the minimal -algebra satisfying property 1 above, in the sense that if T is any
other -algebra that contains all the elements of /, then (/) T.
13
Hint for (ii): Let (T
i
)
iI
be the collection of all -algebras of subsets of that contain /.
This is a non-empty collection, since it contains for example T(), the set of all subsets of
. Any -algebra (/) that satises the two properties above is necessarily a subset of any
of the T
i
s, hence it is also contained in the intersection of all the T
i
s, which is a -algebra
by part (i) of the exercise.
Denition 2.13. If / is a collection of subsets of a set , the -algebra (/) discussed
above is called the -algebra generated by /.
Example 2.14. The space of innite coin toss sequences. Another archetypical
experiment in probability theory is that of a sequence of independent fair coin tosses, so
lets try to model this experiment with a suitable probability space. If for convenience we
represent the result of each coin as a binary value of 0 or 1, then the sample space is
simply the set of innite sequences of 0s and 1s, namely
=
_
(x
1
, x
2
, x
3
, . . .) : x
i
0, 1, i = 1, 2, . . .
_
= 0, 1
N
.
What about the -algebra T? We will take the same approach as we did in the previous
example, which is to require certain natural sets to be events, and to take as our -algebra
the -algebra generated by these elementary events. In this case, surely, for each n 1,
we would like the set
A
n
(1) := x = (x
1
, x
2
, . . .) : x
n
= 1 (2)
to be an event (in words, this represents the event the coin toss x
n
came out Heads).
Therefore we take T to be the -algebra generated by the collection of sets of this form.
Finally, the probability measure P should conform to our notion of a sequence of inde-
pendent fair coin tosses. Generalizing the notation in (2), for a 0, 1 dene
A
n
(a) = x = (x
1
, x
2
, . . .) : x
n
= a.
Then P should satisfy
P(A
n
(a)) =
1
2
,
representing the fact that the n-th coin toss is unbiased. But more generally, for any n 1
and (a
1
, a
2
, . . . , a
n
) 0, 1
n
, since the rst n coin tosses are independent, P should satisfy
P
_
A
1
(a
1
) A
2
(a
2
) . . . A
n
(a
n
)
_
=
1
2
n
. (3)
14
As in the example of Lebesgue measure discussed above, the fact that a probability measure
P on (, T) that satises (3) exists and is unique follows from some slightly non-trivial
facts from measure theory, and we will take it on faith for the time being. Below we quote
the relevant theorem from measure theory, which generalizes the setting discussed in this
example to the more general situation of a product of probability spaces.
Theorem 2.15 (Products of probability spaces). Let
_
(
n
, T
n
, P
n
)
_

n=1
be a sequence of
probability spaces. Denote =

n=1

n
(the cartesian product of the outcome sets), and let
T be the -algbera of subsets of generated by sets which are of the form
(x
1
, x
2
, . . .) : x
n
A
for some n 1 and set A T
n
. Then there exists a unique probability measure P on (, T)
such that for any n 1 and any nite sequence
(A
1
, A
2
, . . . , A
n
) T
1
T
2
. . . T
n
the equation
P
_
_
(x
1
, x
2
, . . .) : x
1
A
1
, x
2
A
2
, . . . , x
n
A
n
_
_
=
n

k=1
P
k
(A
k
)
holds.
Exercise 2.16. Explain why the innite sequence of coin tosses experiment is a special
case of a product of probability spaces, and why the existence and uniqueness of a probability
measure satisfying (3) follows from Theorem 2.15.
In an upcoming homework exercise we will show an alternative way of proving the ex-
istence of the probability space of innite coin toss sequences using Lebesgue measure on
(0, 1).
15
Chapter 3: Random variables
3.1 Random variables and their distributions
As we have seen, a probability space is an abstract concept that represents our intuitive
notion of a probabilistic experiment. Such an experiment however can be very long (even
innite) and contain a lot of information. To make things more manageable, we consider
numerical-valued functions on probability spaces, which we call random variables. How-
ever, not any function will do: a random variable has to relate in a nice way to the measurable
space structure, so that we will be able to ask questions like what is the probability that this
random variable takes a value less than 8, etc. This leads us to the following denitions.
Denition 3.1. If (
1
, T
1
) and (
2
, T
2
) are two measurable spaces, a function X :
1

2
is called measurable if for any set E T
2
, the set
X
1
(E) =
1
: X() E
is in T
1
.
Denition 3.2. If (, T, P) is a probability space, a real-valued function X : R is
called a random variable if it is a measurable function when considered as a function from
the measurable space (, T) to the measurable space (R, B), where B is the Borel -algebra
on R, namely the -algebra generated by the intervals.
Exercise 3.3. Let (
1
, T
1
) and (
2
, T
2
) be two measurable spaces such that T
2
is the -
algebra generated by a collection / of subsets of
2
. Prove that a function X :
1

2
is
measurable if and only if X
1
(A) T
1
for all A /.
It follows that the random variables are exactly those real-valued functions on for
which the question
What is the probability that a < X < b?
has a well-dened answer for all a < b. This observation makes it easier in practice to check
if a given function is a random variable or not, since working with intervals is much easier
than with the rather unwieldy (and mysterious, until you get used to them) Borel sets.
16
What can we say about the behavior of a random variable X dened on a probability
space (, T, P)? All the information is contained in a new probability measure
X
on the
measurable space (R, B) that is induced by X, dened by

X
(A) = P(X
1
(A)) = P( : X() A).
The number
X
(A) is the probability that X falls in A (or takes its value in A).
Exercise 3.4. Verify that
X
is a probability measure on (R, B). This measure is called
the distribution of X, or sometimes referred to more fancifully as the law of X. In some
textbooks it is denoted /
X
.
Denition 3.5. If X and Y are two random variables (possibly dened on dierent probabil-
ity spaces), we say that X and Y are identically distributed (or equal in distribution)
if
X
=
Y
(meaning that
X
(A) =
Y
(A) for any Borel set A R). We denote this
X
d
= Y.
How can we check if two random variables are identically distributed? Once again,
working with Borel sets can be dicult, but since the Borel sets are generated by the intervals,
a simpler criterion involving just this generating family of sets exists. The following lemma
is a consequence of basic facts in measure theory, which can be found in the Measure Theory
appendix in [Dur2010].
Lemma 3.6. Two probability measures
1
,
2
on the measurable space (R, B) are equal if
only if they are equal on the generating set of intervals, namely if

1
_
(a, b)
_
=
2
_
(a, b)
_
for all a < b < .
3.2 Distribution functions
Instead of working with distributions of random variables (which are probability measure on
the measurable space (R, B) and themselves quite unwieldy objects), we will encode them in
a simpler object called a distribution function (sometimes referred to as a cumulative
distribution function, or c.d.f.).
17
Denition 3.7. The cumulative distribution function (or c.d.f., or just distribution
function) of a random variable X dened on a probability space (, T, P) is the function
F
X
: R [0, 1] dened by
F
X
(x) = P(X x) = P(X
1
((, x])) = P( : X() x), (x R).
Note that we have introduced here a useful notational device that will be used again
many times in the following sections: if A is a Borel set, we will often write X A as
shorthand for the set : X() A. In words, we may refer to this as the event that
X falls in A. When discussing its probability, we may omit the curly braces and simply
write P(X A). Of course, one should always remember that on the formal level this is
just the set-theoretic inverse image of a set by a function!
Theorem 3.8 (Properties of distribution functions). If F = F
X
is a distribution function,
then it has the following properties:
(i) F is nondecreasing.
(ii) lim
x
F(x) = 1, lim
x
F(x) = 0.
(iii) F is right-continuous, i.e., F(x+) := lim
yx
F(y) = F(x).
(iv) F(x) := lim
yx
F(y) = P(X < x).
(v) P(X = x) = F(x) F(x).
(vi) If G = F
Y
is another distribution function of a random variable Y , then X and Y are
equal in distribution if and only if F G.
Proof. Exercise (recommended), or see page 9 of [Dur2010].
Denition 3.9. A function F : R [0, 1] satisfying properties (i)(iii) in the previous
theorem is called a cumulative distribution function, or just distribution function.
Theorem 3.10. If F is a distribution function, then there exists a random variable X such
that F = F
X
.
18
This fact has a measure-theoretic proof similar to the proof of Theorem 2.11, but fortu-
nately in this case, there is a more probabilistic proof that relies only on the existence of
Lebesgue measure. (This is one of many examples of probabilistic ideas turning out to be
useful to prove facts in analysis and measure theory). This involves the probabilistic concept
of a quantile, (a generalization of the concepts of percentile and median that we frequently
hear about in news reports).
Denition 3.11. If X is a random variable on a probability space (, T, P) and 0 < p < 1
is a number, then a real number x is called a p-quantile of X if the inequalities
P(X x) p,
P(X x) 1 p
hold.
Note that the question of whether t is a p-quantile of X can be answered just by knowing
the distribution function F
X
of X: since P(X x) = F
X
(x) and P(X x) = 1 F(x),
we can write the conditions above as
F
X
(x) p F
X
(x).
Lemma 3.12. A p-quantile for X always exists. Moreover, the set of p-quantiles of X is
equal to the (possibly degenerate) closed interval [a
p
, b
p
], where
a
p
= supx : F
X
(x) < p,
b
p
= infx : F
X
(x) > p.
Exercise 3.13. Prove Lemma 3.12.
Proof of Theorem 3.10. Let ((0, 1), B, P) be the unit interval with Lebesgue measure, repre-
senting the experiment of drawing a uniform random number in (0, 1). We shall construct
our random variable X on this space. Inspired by the discussion of quantiles above, we dene
X(p) = supy : F(y) < p, (0 < p < 1).
19
If F were the distribution function of a random variable, then X(p) would be its (minimal)
p-quantile.
Note that properties (i) and (ii) of F imply that X(p) is dened and nite for any p and
that it is a monotone nondecreasing function on (0, 1). In particular, it is measurable, so it
is in fact a random variable on the probability space ((0, 1), B, P). We need to show that F
is its distribution function. We will show that for each p (0, 1) and x R, we have that
X(p) x if and only if p F(x). This will imply that for every x R we have the equality
of sets
p : X(p) x = p : p F(x),
and, applying P to both sides of this equation we will get
F
X
(x) = P(X x) = P
_
p : p F(x)
_
= P
_
(0, F(x)]
_
= F(x)
(since P is Lebesgue measure).
To prove the claim, note that if p F(x) then all elements of the set y : F(y) < p satisfy
y x, and therefore the supremum X(p) of this set also satises X(p) x. Conversely, if
p > F(x), then, since F is right-continuous, we have p > F(x +) for some > 0. It follows
that X(p) x + > x (since x + is in the set y : F(y) < p).
The function X dened in the proof above is sometimes referred to as the (lower) quantile
function of the distribution F. Note that if F is a strictly increasing function then X is simply
its ordinary (set-theoretic) inverse function.
3.3 Examples
Example 3.14. Indicator random variables If A is an event in a probability space
(, T, P), its indicator random variable is the r.v. 1
A
dened by
1
A
() =
_
_
_
0 / A,
1 A.
The above discussion shows that to specify the behavior of a random variable, it is enough
to specify its distribution function. Another useful concept is that of a density function.
20
If F = F
X
is a distribution function such that for some nonnegative function f : R R we
have
F(x) =
_
x

f(y) dy, (y R), (4)


then we say that X has a density function f. Note that f determines F but is itself only
determined by F up to small changes that do not aect the integral in (4) (in measure-
theoretic terminology we say that f is determined up to a set of measure 0). For example,
changing the valuef in a nite number of points results in a density function that is equally
valid for computing F.
Example 3.15. Uniform random variables. We say that X is a uniform random
variable on (0, 1) if it has the distribution function
F(x) =
_

_
0 x 0,
x 0 x 1,
1 x 1.
Such a r.v. has as its density function the function
f(x) =
_
_
_
1 0 x 1,
0 otherwise.
More generally, if a < b we say that X is a uniform random variable in the interval
(a, b) if it has the (respective) distribution and density functions
F(x) =
_

_
0 x a,
xa
ba
a x b,
1 x b,
f(x) =
_
_
_
1
ba
a x b,
0 otherwise.
Example 3.16. Exponential distribution.
F(x) =
_
_
_
0 x 0
1 e
x
x 0,
f(x) =
_
_
_
0 x < 0
e
x
x 0.
21
Example 3.17. Standard normal distribution. The normal (or gaussian) distribution
is given in terms of its density function
f(x) =
1

2
e
x
2
/2
.
The cumulative distribution function is denoted by
(x) =
1

2
_
x

e
y
2
/2
dy.
This integral cannot be evaluated explicitly in terms of more familiar functions, but is an
important special function of mathematics nonetheless.
22
Chapter 4: Random vectors and independence
4.1 Random vectors
A random variable is a real-valued measurable function dened on a probability space
(, T, P) (when we talk of R as a measurable space, it will always be taken to mean with
the -algebra of Borel sets). Similarly, we now wish to talk about vector-valued measurable
functions on a probability space, i.e., functions taking values in R
d
. First, we need to iden-
tify a good -algebra of subsets of R
d
. Risking some confusion, we will still call it the Borel
-algebra and denote it by B, or sometimes by B(R
d
).
Denition 4.1. The Borel -algebra on R
d
is dened in one of the following equivalent
ways:
(i) It is the -algebra generated by boxes of the form
(a
1
, b
1
) (a
2
, b
2
) . . . (a
d
, b
d
).
(ii) It is the -algebra generated by the balls in R
d
.
(iii) It is the -algebra generated by the open sets in R
d
.
(iv) It is the minimal -algebra of subsets of R
d
such that the coordinate functions
i
:
R
d
R dened by

i
(x) = x
i
, i = 1, 2, . . . , d
are all measurable (where measurability is respect to the Borel -algebra on the target
space R).
Exercise 4.2. Check that the denitions above are indeed all equivalent.
Denition 4.3. A random (d-dimensional) vector (or vector random variable)
X = (X
1
, X
2
, . . . , X
d
) on a probability space (, T, P) is a function X : R
d
that is
measurable (as a function between the measurable spaces (, T) and (R
d
, B).
Lemma 4.4. X = (X
1
, . . . , X
d
) is a random vector if and only if X
i
is a random variable
for each i = 1, . . . , d.
23
Proof. If X is a random vector then each of its coordinates X
i
=
i
X is a composition of
two measurable functions and therefore (check!) measurable. Conversely, if X
1
, . . . , X
d
are
random variables then for any box E = (a
1
, b
1
) (a
2
, b
2
) . . . (a
d
, b
d
) R
d
we have
X
1
(E) =
d
k=1
X
1
i
((a
i
, b
i
)) T.
Therefore by Denition 4.1 and Exercise 3.3, X is a random vector.
Exercise 4.5. (i) Prove that any continuous function f : R
m
R
n
is measurable (when
each of the spaces is equipped with the respective Borel -algebra).
(ii) Prove that the composition g f of measurable functions f : (
1
, T
1
) (
2
, T
2
) and
g : (
2
, T
2
) (
3
, T
3
) (where (
i
, T
i
) are measurable spaces for i = 1, 2, 3) is a measurable
function.
(iii) Deduce that the sum X
1
+. . . +X
d
of random variables is a random variable.
Exercise 4.6. Prove that if X
1
, X
2
, . . . is a sequence of random variables (all dened on the
same probability space, then the functions
inf
n
X
n
, sup
n
X
n
, limsup
n
X
n
, liminf
n
X
n
are all random variables. Note: Part of the question is to generalize the notion of random
variable to a function taking values in R = R , +, or you can solve it rst with
the additional assumption that all the X
i
s are uniformly bounded by some constant M.
4.2 Multi-dimensional distribution functions
If X = (X
1
, . . . , X
d
) is a d-dimensional random vector, we dene its distribution to be the
probability measure

X
(A) = P(X
1
(A)) = P( : X() A), A B(R
d
),
similarly to the one-dimensional case. The measure
X
is also called the joint distribution
(or joint law) of the random variables X
1
, . . . , X
d
.
Once again, to avoid having to work with measures, we introduce the concept of a d-
dimensional distribution function.
24
Denition 4.7. The d-dimensional distribution function of a d-dimensional random vector
X = (X
1
, . . . , X
d
) (also called the joint distribution function of X
1
, . . . , X
d
) is the function
F
X
: R
d
[0, 1] dened by
F
X
(x
1
, x
2
, . . . , x
d
) = P(X
1
x
1
, X
2
x
2
, . . . , X
d
x
d
)
=
X
_
(, x
1
] (, x
2
] . . . (, x
d
]
_
Theorem 4.8 (Properties of distribution functions). If F = F
X
is a distribution function
of a d-dimensional random vector, then it has the following properties:
(i) F is nondecreasing in each coordinate.
(ii) For any 1 i d, lim
x
i

F(x) = 0.
(iii) lim
x(,...,)
F(x) = 1.
(iv) F is right-continuous, i.e., F(x+) := lim
yx
F(y) = F(x), where here y x means
that y
i
x
i
in each coordinate.
(v) For 1 i d and a < b, denote by
x
a,b
the dierencing operator in the variable x,
which takes a function f of the real variable x (and possibly also dependent on other
variables) and returns the value

x
a,b
f = f(b) f(a)
Then, for any real numbers a
1
< b
1
, a
2
< b
2
, . . . , a
d
< b
d
, we have

x
1
a
1
,b
1

x
2
a
d
,b
d
. . .
x
d
a
d
,b
d
F 0.
Proof. See Chapter 1 in [Dur2010] or Appendix A.2 in [Dur2004].
Theorem 4.9. Any function F satisfying the properties in Theorem 4.8 above is a distribu-
tion function of some random vector X.
Proof. See Chapter 1 in [Dur2010] or Appendix A.2 in [Dur2004].
25
4.3 Independence
Denition 4.10. Events A, B T in a probability space (, T, P) are called independent if
P(A B) = P(A)P(B).
More generally, a family / = (A
i
)
i7
of events in a probability space (, T, P) is called
an independent family if for any nite subset A
i
1
, A
i
1
, . . . , A
i
k
/ of distinct events in the
family we have that
P(A
i
1
A
i
2
. . . A
i
k
) =
k

j=1
P(A
i
j
).
Denition 4.11. Random variables X, Y on a probability space (, T, P) are called inde-
pendent if
P(X E, Y F) = P(X E)P(Y F)
for any Borel sets E, F R. In other words any two events representing possible statements
about the behaviors of X and Y are independent events.
Denition 4.12. If is a set and X : R is a function, the family of subsets of
dened by
(X) =
_
X
1
(A) : A B(R)
_
is a -algebra (check!) called the -algebra generated by X. It is easy to check that it is
the minimal -algebra with which can be equipped so as to make X into a random variable.
Denition 4.13. If (, T, P) is a probability space, two -algebras /, B T are called
independent if any two events A /, B B are independent events.
It follows from the above denitions that r.v.s X, Y are independent if and only if the
-algebras (X), (Y ) generated by them are independent -algebras.
Denition 4.14. If (, T, P) is a probability space(T and (T
i
)
iI
is some family of sub--
algebras of T (i.e., -algebras that are subsets of T, we say that (T
i
)
iI
is an independent
family of -algebras if for any i
1
, i
2
, . . . i
k
I and events A
1
T
i
1
, A
2
T
i
2
, . . . , A
k
T
i
k
,
the events A
1
, . . . , A
k
are independent.
26
Denition 4.15. A family (X
i
)
iI
of random variables dened on some common probability
space (, T, P) is called an independent family of random variables if the -algebras
(X
i
)
iI
form an independent family of -algebras.
Unraveling these somewhat abstract denitions, we see that (X
i
)
iI
is an independent
family of r.v.s if and only if we have
P(X
i
1
A
1
, . . . X
i
k
A
k
) =
k

j=1
P(X
i
j
A
j
)
for all indices i
1
, . . . , i
k
I and Borel sets A
1
, . . . , A
k
B(R).
Theorem 4.16. If (T
i
)
iI
are a family of sub--algebras of the -algebra of events T in
a probability space, and for each i I, the -algebra T
i
is generated by a family /
i
of
subsets of , and each family /
i
is closed under taking the intersection of two sets (such a
family is called a -system), then the family (T
i
)
iI
is independent if and only if for each
i
1
, . . . , i
k
I, any nite sequence of events A
1
/
i
1
, A
2
/
i
2
, . . . , A
k
/
i
k
is independent.
Proof. This uses Dynkins theorem from measure theory. See [Dur2010], Theorem
2.1.3, p. 39 or [Dur2004], Theorem (4.2), p. 24.
As a corollary, we get a convenient criterion for checking when the coordinates of a
random vector are independent.
Lemma 4.17. If X
1
, . . . , X
d
are random variables dened on a common probability space,
then they are independent if and only if for all x
1
, . . . , x
d
R we have that
F
X
1
,...,X
d
(x
1
, . . . , x
d
) = F
X
1
(x
1
)F
X
2
(x
2
) . . . F
X
d
(x
d
).
Exercise 4.18. (i) We say that a Riemann-integrable function f : R
d
[0, ) is a (d-
dimensional, or joint) density function for a random vector X = (X
1
, . . . , X
d
) if
F
X
(x
1
, . . . , x
d
) =
_
x
1

_
x
2

. . .
_
x
d

f(u
1
, . . . , u
d
) dx
d
. . . dx
1
x
1
, . . . , x
d
R.
Show that if f is a density for X and can be written in the form
f(x
1
, . . . , x
d
) = f
1
(x
1
)f
2
(x
2
) . . . f
d
(x
d
),
27
then X
1
, . . . , X
d
are independent.
(ii) Show that if X
1
, . . . , X
d
are random variables taking values in a countable set S, then in
order for X
1
, . . . , X
d
to be independent it is enough that for all x
1
, . . . , x
d
S we have
P(X
1
= x
1
, . . . , X
d
= x
d
) = P(X
1
= x
1
) . . . P(X
d
= x
d
).
28
Chapter 5: The Borel-Cantelli lemmas
Let (, T, P) be a probability space, and let A
1
, A
2
, A
3
, . . . T be a sequence of events. We
dene the following events derived from the sequence (A
n
)
n
:
limsup A
n
=

N=1

_
n=N
A
n
,
liminf A
n
=

_
N=1

n=N
A
n
.
What is the meaning of these events? If we think of the sequence A
n
as representing a
sequence of (not necessarily independent) probabilistic experiments, then we can translate
the rst event into words as
limsup A
n
= the event that for all N 1 there is an n N such
that A
n
occurred
= the event that innitely many of the A
n
s occurred.
For this reason, the event limsup A
n
is often denoted by A
n
innitely often or A
n
i.o..
The denition of the event liminf A
n
can similarly be given meaning by writing
liminf A
n
= the event that there exists an N 1 such that A
n
occurred for all n N
= the event that all but nitely many of the A
n
s occurred.
Exercise 5.1. Prove that for any we have
1
limsup An
() = limsup
n
1
An
(),
1
liminf An
() = liminf
n
1
An
().
Theorem 5.2 (Borel-Cantelli lemmas).
(i) If

n=1
P(A
n
) < then P(A
n
i.o.) = 0.
(ii) If

n=1
P(A
n
) = and (A
n
)

n=1
are independent then P(A
n
i.o.) = 1.
29
Proof. We essentially already proved part (i) in the rst lecture, but here is a more general
repetition of the same argument.
P(A
n
i.o.) = P
_

N=1

_
n=N
A
n
_
inf
N1
P
_

_
n=N
A
n
_
inf
N1

n=N
P(A
n
).
Since we assumed that

n=1
P(A
n
), converges, this last expression is equal to 0.
Proof of (ii): Consider the complementary event that the A
n
s did not occur for innitely
many values of n. Using De-Morgans laws, we get
P(A
n
i.o.
c
) = P
__

N=1

_
n=N
A
n
_
c
_
= P
_

_
N=1

n=N
A
c
n
_

N=1
P
_

n=N
A
c
n
_
.
So, to show that this is 0 (under the assumptions that

n=1
P(A
n
) = and that the
events are independent), we show that P(

n=N
A
c
n
) = 0 for all N 1. Since the events
are independent, the probability of the intersection is the product of the probabilities, so we
need to show that

n=N
(1 P(A
n
)) = 0,
or equivalently (taking minus the logarithm) that

n=N
log(1 P(A
n
)) = .
But log(1 x) x for all x > 0, so this follows from the assumption that the series of
probabilities diverges.
30
Chapter 6: A brief excursion into measure theory
Here we briey mention some of the basic denitions and results from measure theory, and
point out how we used them in the previous lectures. The relevant material is covered in
Appendix A.1 in [Dur2010] (Appendix A.1 and A.2 in [Dur2004]). It is not required reading,
but if you read it you are perhaps more likely to attain a good understanding of the material
that is required...
Denition 6.1. (i) A -system is a collection T of subsets of a set that is closed under
intersection of two sets, i.e., if A, B T then A B T.
(ii) A -system is a collection / of subsets of a set such that: 1. /; 2. If A, B /
and A B then B A /; 3. If (A
n
)

n=1
are all in / and A
n
A then A /.
The following is a somewhat technical result that turns out to be quite useful:
Theorem 6.2 (Dynkins theorem). If T is a -system and / is a -system that
contains T then (T) /.
Lemma 6.3 (Uniqueness theorem). If the values of two probability measures
1
and
2
coincide on a collection T of sets, and T is a -system (closed under nite intersection),
then
1
and
2
coincide on the generated -algebra (T).
The uniqueness theorem implies for example that to check if random variables X and Y
are equal in distribution, it is enough to check that they have the same distribution functions.
Both of the above results are used in the proof of the following important theorem in
measure theory:
Theorem 6.4 (Caratheodorys extension theorem). Let be an almost-probability-measure
dened on an algebra / of subsets of a set . That is, it satises all the axioms of a
probability measure except it is dened on an algebra and not a -algebra; -additivity is
satised whenever the countable union of disjoint sets in the algebra is also an element of
the algebra. Then has a unique extension to a probability measure on the -algebra (/)
generated by /.
Caratheodorys extension theorem is the main tool used in measure theory for construct-
ing measures: one always starts out by dening the measure on some relatively small family of
31
sets and then extending to the generated -algebra (after verifying -additivity, which often
requires using topological arguments, e.g., involving compactness). Applications include:
Existence and uniqueness of Lebesgue measure in (0, 1), R and R
d
.
Existence and uniqueness of probability measures associated with a given distribu-
tion function in R (sometimes called Lebesgue-Stieltjes measures in R). We proved
existence instead by starting with Lebesgue measure and using quantile functions.
Existence and uniqueness of Lebesgue-Stieltjes measures in R
d
(i.e., measures associ-
ated with a d-dimensional joint distribution function). Here there is no good concept
analogous to quantile functions, although there are other ways to construct such mea-
sures explicitly using ordinary Lebesgue measures.
Product measures this corresponds to probabilistic experiments which consist of
several independent smaller experiments.
Note that Durretts book also talks about measures that are not probability measures,
i.e., the total measure of the space is not 1 and may even be innite. In this setting, the
theorems above can be formulated in greater generality.
End of Part I
32
Part II Laws of Large Numbers
Chapter 7: Expected values
7.1 Construction of the expectation operator
We wish to dene the notion of the expected value, or expectation, of a random variable
X, which will be denoted EX (or E(X)). In measure theory this is denoted
_
XdP and is
called the Lebesgue integral. It is one of the most important concepts in all of mathematical
analysis! So time invested in understanding it is time well-spent.
The idea is simple. For bounded random variables, we want the expectation to satisfy
three properties: First, the expectation of an indicator variable 1
A
, where A is an event,
should be equal to P(A). Second, the expectation operator should be linear i.e., should
satisfy E(aX +bY ) = aEX +bEY for real numbers a, b and r.v.s X, Y . Third, it should be
monotone, i.e., if X Y (meaning X() Y () for all ) then EX EY .
For unbounded random variables, we will also require some kind of continuity, but lets
treat the case of bounded case rst. It turns out that these properties determine the expec-
tation/Lebesgue integral operator uniquely. Dierent textbooks may have some variation in
how they construct it, but the existence and uniqueness are really the essential facts.
Theorem 7.1. Let (, T, P) be a probability space. Let B

denote the class of bounded


random variables. There exists a unique operator E that takes a r.v. X B

and returns a
number in R, and satises:
1. If A T then E(1
A
) = P(A).
2. If X, Y B

, a, b R then E(aX +bY ) = aE(X) +bE(Y ).


3. If X, Y B

and X Y then E(X) E(Y ).


Sketch of proof. Call X a simple function if it is of the form X =

n
i=1
a
i
1
B
i
, where
a
1
, . . . , a
n
R and B
1
, . . . , B
n
are disjoint events. For such r.v.s dene E(X) =

a
i
P(B
i
).
Show that the linearity and monotonicity properties hold, and so far uniqueness clearly holds
since we had no choice in how to dene E(X) for such functions if we wanted the properties
33
above to hold. Now for a general bounded r.v. X with [X[ M, for any > 0 it is pos-
sible to approximate X from below and above by simple functions Y X Z such that
E(Z Y ) < . This suggests dening
E(X) = supE(Y ) : Y is a simple function such that Y X. (5)
By approximation, the construction is shown to still satisfy the properties in the Theorem
and to be unique, since E(X) is squeezed between E(Y ) and E(Z), and these can be made
arbitrarily close to each other.
We can now extend the denition of the expectation operator to non-negative random
variables. In that case we still dene EX by eq. (5). This can be thought of as a kind
of continuity from below axiom that is added to the properties 13 above, although we
shall see that it can be reformulated in several equivalent ways. Note that now EX may
sometimes be innite.
Finally, for a general random variable X, we decompose X as a dierence of two non-
negative r.v.s by writing
X = X
+
X

,
where X
+
= max(X, 0) is called the positive part of X and X

= max(X, 0) is called
the negative part of X.
We say that X has an expectation if the two numbers EX

, EX
+
are not both . In
this case we dene
EX = EX
+
EX

.
This is a number in R, . If both EX

, EX
+
are < , or in other words if E[X[ <
(since [X[ = X
+
+X

), we say that X has nite expectation or is integrable.


Theorem 7.2. Suppose X, Y 0 or X, Y 0 or E[X[, E[Y [ < . Then:
1. If X is a simple function then the denition (5) coincides with the original denition,
namely E(

i
a
i
1
B
i
) =

i
a
i
P(B
i
).
2. E(aX+bY +c) = aEX+bEY +c for any real numbers a, b, c, where in the case where
E(X) = E(Y ) = , we require a, b to have the same sign in order for the right-hand
side of this identity to be well-dened.
34
3. If X Y then EX EY .
Proof. See [Dur2010], section 1.4.
Remark. See
http://en.wikipedia.org/wiki/Lebesgue_integration#Intuitive_interpretation
for a nice description of the dierence in approaches between the more familiar Riemann
integral and the Lebesgue integral.
7.2 Properties
1. Expectation is invariant under almost-sure equivalence: If X Y almost
surely, meaning P(X Y ) = 1, then by the denition we have EX EY , since any
simple function Z such that Z X can be replaced with another simple function Z
t
such
that Z
t
Y and Z = Z
t
almost surely. It follows also that if X = Y almost surely then
EX = EY .
2. Triangle inequality: [EX[ E[X[.
Proof.
[EX[ = [EX
+
EX

[ EX
+
+EX

= E[X[.
3. Markovs inequality (Called Chebyshevs inequality in [Dur2010]):
P(X t)
EX
t
.
Proof. Use monotonicity twice to deduce:
P(X t) = E(1
Xt
) E
_
1
t
X1
Xt
_

EX
t
.
35
4. Variance: If X has nite expectation, we dene its variance to be
V(X) = E(X EX)
2
.
If V(X) < , by expanding the square it is easy to rewrite the variance as
V(X) = E(X
2
) (EX)
2
.
We denote (X) =
_
V(X) and call this quantity the standard deviation of X. Note
that if a R then V(aX) = a
2
V(X) and (aX) = [a[(X).
5. Chebyshevs inequality:
P([X EX[ t)
V(X)
t
2
.
Proof. Apply Markovs inequality to Y = (X EX)
2
.
6. Cauchy-Schwartz inequality:
E[XY [
_
EX
2
EY
2
_
1/2
.
Equality holds if and only if X and Y are linearly dependent, i.e. aX + bY 0 holds for
some a, b R.
Proof. Consider the function
p(t) = E([X[ +t[Y [)
2
= t
2
EY
2
+ 2tE[XY [ +EX
2
.
Since p(t) = at
2
+ bt + c is a quadratic polynomial in t that satises p(t) 0 for all t, its
discriminant b
2
4ac must be non-positive. This gives
(E[XY [)
2
EX
2
EY
2
0,
as claimed. The condition for equality is left as an exercise.
7. Jensens inequality: A function : R R is called convex if it satises
(x + (1 )y) (x) + (1 )(y)
for all x, y R and [0, 1]. If is convex then
(EX) E((X)).
Proof. See homework.
36
8. L
p
-norm monotonicity: If 0 < r s then
(E[X[
r
)
1/r
(E[X[
s
)
1/s
. (6)
Proof. Apply Jensens inequality to the r.v. [X[
r
with the convex function (x) = x
s/r
.
7.3 Convergence theorems for expectations
We want to study notions of continuity for the expectation operator. If X
n
X as n ,
under what conditions do we have that E(X
n
) EX? First we have to decide what
X
n
X actually means. We dene two notions of convergence of a sequence of random
variables to a limit.
Denition 7.3. Let X, X
1
, X
2
, . . . be random variables all dened on the same probability
space. We say that X
n
converges in probability to X, and denote X
n
P

n
X, if for
all > 0 we have that
P([X
n
X[ > )
n
0.
Denition 7.4. With X, X
1
, X
2
, . . . as before, we say that X
n
converges almost surely
to X (or converges to X with probability 1), and denote X
n
a.s.

n
X, if
P(X
n
X) = P
__
: X() = lim
n
X
n
()
__
= 1.
Exercise 7.5. Show that : X() = lim
n
X
n
() is an event and therefore has a
well-dened probability. In other words, represent it in terms of countable union, intersection
and complementation operations on simple sets that are known to be events. Hint: Use the
denition of a limit.
Lemma 7.6. Almost sure convergence is a stronger notion of convergence than convergence
in probability. In other words, if X
n
a.s.

n
X then X
n
P

n
X, but the converse is not true.
Exercise 7.7. Prove Lemma 7.6. For the counterexample showing that convergence in prob-
ability does not imply almost sure convergence, consider the following sequence of random
37
variables dened on the space ((0, 1), B, Lebesgue measure):
1
(0,1)
,
1
(0,1/2)
, 1
(1/2,1)
,
1
(0,1/4)
, 1
(1/4,2/4)
, 1
(2/4,3/4)
, 1
(3/4,1)
,
1
(0,1/8)
, 1
(1/8,2/8)
, 1
(2/8,3/8)
, 1
(3/8,4/8)
, 1
(4/8,5/8)
, 1
(5/8,6/8)
, 1
(6/8,7/8)
, 1
(7/8,1)
,
. . .
We can now formulate the fundamental convergence theorems for Lebesgue integration.
Theorem 7.8 (Bounded convergence theorem). If X
n
is a sequence of r.v.s such that
[X
n
[ M for all n, and X
n
X in probability, then EX
n
EX.
Proof. Fix > 0. Then
[EX
n
EX[ E[X
n
X[ = E[X
n
X[1
[XnX[>
+E[X
n
X[1
[XnX[
2MP([X
n
X[ > ) +
n
.
Since was an arbitrary positive number, this implies that [EX
n
EX[ 0, as claimed.
Theorem 7.9 (Fatous lemma). If X
n
0 then liminf
n
EX
n
E(liminf
n
X
n
).
To see that the inequality in the lemma can fail to be an equality, let U U(0, 1), and
dene X
n
= n1
U1/n
. Clearly liminf
n
X
n
= lim
n
X
n
0, but E(X
n
) = 1 for all n.
Proof. Let Y = liminf
n
X
n
. Note that Y can be written as
Y = sup
n1
inf
mn
X
m
(this is a general fact about the lim inf of a sequence of real numbers), or Y = sup
n
Y
n
,
where we denote
Y
n
= inf
mn
X
m
.
We have Y
n
X
n
, and as n , Y
n
Y (in fact Y
n
Y ) almost surely. Therefore
EY
n
EX
n
, so liminf
n
EY
n
liminf
n
EX
n
, and therefore it is enough to show that
liminf
n
EY
n
EY.
38
But for any M we have that
Y
n
M
a.s.

n
Y M,
and this is a sequence of uniformly bounded r.v.s, therefore by the bounded convergence
theorem we get that
E(Y
n
) E(Y
n
M)
n
E(Y M).
We therefore get that liminf
n
E(Y
n
) E(Y M) for any M > 0, which implies the result
because of the following exercise.
Exercise 7.10. Let Y 0 be a random variable. Prove that
E(Y ) = sup
M>0
E(Y M).
Theorem 7.11 (Monotone convergence theorem). If 0 X
n
X as n then EX
n

EX.
Proof.
EX = E[liminf
n
X
n
] liminf
n
EX
n
limsup
n
EX
n
limsup
n
EX = EX.
Theorem 7.12 (Dominated convergence theorem). If X
n
X almost surely, [X
n
[ Y for
all n 1 and EY < , then EX
n
EX.
Proof. Apply Fatous lemma separately to Y +X
n
and to Y X
n
.
7.4 Computing expected values
Lemma 7.13. If X is a discrete r.v., that is, takes values in some countable set S, then
EX =

sS
s P(X = s)
when the right-hand side is well-dened, i.e., when at least one of the numbers
E(X

) =

sS,s<0
(s) P(X = s), E(X
+
) =

sS,s>0
s P(X = s)
is nite. It follows that for any function g : R R, we also have
E(g(X)) =

sS
g(s)P(X = s).
39
Proof. If S is nite then X is a simple function, and can be written X =

sS
s 1
X=s
, so
this follows from the denition of E() for simple functions. If S is innite this follows (check!)
from the convergence theorems in the previous section by considering approximations to X
of the form

sS,[s[<M
s 1
X=s
.
Lemma 7.14. If X is a r.v. with a density function f
X
, then
E(X) =
_

xf
X
(x) dx
when the right-hand side is well-dened, i.e., when at least one of the numbers
E(X

) =
_
0

xf
X
(x) dx, E(X
+
) =
_

0
xf
X
(x) dx
is nite. Similarly, for any reasonable function g we have
E(g(X)) =
_

g(x)f
X
(x) dx.
Proof. Fix > 0, and approximate X by a discrete r.v. Y , e.g.,
Y =

k=
k1
k<X(k+1)
.
Then [E(X) E(Y )[ E[X Y [ . By the previous lemma we have
E(Y ) =

k=
kP(k < X < (k + 1)) =

k=
k
_
(k+1)
k
f
X
(x) dx,
so the result for EX follows by letting 0. For general functions g(X) repeat this
argument, and invoke the relevant convergence theorem to deduce that E(g(Y )) Eg(Y )
as 0.
7.5 Expectation and independent random variables
Theorem 7.15. (i) If X, Y are independent r.v.s then E(XY ) = EXEY .
(ii) X, Y are independent if and only if E[g(X)h(Y )] = Eg(X)Eh(Y ) for all bounded mea-
surable functions g, h : R R.
40
Sketch of proof. Part (i) follows either by approximation of X, Y using simple functions, or
using Fubinis theorem, which you can read about in section 1.7 of [Dur2010] (Note that
Fubinis theorem in turn is proved by approximation using simple functions, so these two
seemingly dierent approaches are really equivalent).
For part (ii), the only if follows from part (i) together with the observation that if X, Y
are independent then so are g(X), h(Y ). For the if part, observe that the function 1
(a,b)
is
a bounded measurable function, so in particular the condition E[g(X)h(Y )] = Eg(X)Eh(Y )
includes the information that P(X I, Y J) = P(X I)P(Y J) for any two nite
intervals I, J, which we already know is enough to imply independence.
Theorem 7.16. (i) If X
1
, X
2
, . . . , X
n
are independent r.v.s then E(X
1
. . . X
n
) =

n
k=1
EX
k
.
(ii) X
1
, . . . , X
n
are independent if and only if E(

n
k=1
g
k
(X
k
)) =

n
k=1
Eg
k
(X
k
) for all
bounded measurable functions g
1
, . . . , g
n
: R R.
The fact that expectation is multiplicative for independent random variables implies an
important fact about the variance of a sum of independent r.v.s. Let X, Y be independent
r.v.s with nite variance. Then we get immediately that
V(X +Y ) = E[(X EX) + (Y EY )]
2
= E(X EX)
2
+E(Y EY )
2
+ 2E[(X EX)(Y EY )]
= V(X) +V(Y ) + 0 = V(X) +V(Y ).
More generally, if X, Y are not necessarily independent, then we can dene the covariance
of X and Y by
Cov(X, Y ) = E[(X EX)(Y EY )] .
We then get the more general formula
V(X +Y ) = V(X) +V(Y ) + 2Cov(X, Y ).
Repeating this computation with a sum of n variables instead of just two, we get the following
formula for the variance of a sum of r.v.s.
Lemma 7.17. If X
1
, . . . , X
n
are r.v.s with nite variance, then
V
_
n

k=1
X
k
_
=
n

k=1
V(X
k
) + 2

1i<jn
Cov(X
i
, X
j
).
41
Lemma 7.18 (Properties of the covariance). If X, Y are r.v.s with nite variance, then:
1. Cov(X, Y ) = Cov(Y, X) = E(XY ) E(X)E(Y ).
2. Cov(X, X) = V(X).
3. Cov(aX
1
+bX
2
, Y ) = aCov(X
1
, Y ) +bCov(X
2
, Y )
4. Cov(X, aY
1
+bY
2
) = aCov(X, Y
1
) +bCov(X, Y
2
)
5. If X, Y are independent then Cov(X, Y ) = 0.
6. [Cov(X, Y )[ (X)(Y ), with equality if and only if X and Y are linearly dependent.
Proof. Properties 15 are obvious. Property 6 follows by applying the Cauchy-Schwartz
inequality to the r.v.s X EX and Y EY .
If Cov(X, Y ) = 0 we say that X and Y are uncorrelated, or orthogonal. This is a
weaker condition than being independent, but because of the way the variance of a sum of
r.v.s behaves, it is still often useful for deriving bounds, as we shall see.
Dene the correlation coecient of X and Y by
(X, Y ) =
Cov(X, Y )
(X)(Y )
.
This measures the correlation in units of the standard deviation of X and Y so does not
depend on the choice of scale. From property 6 in the above lemma, we get that
1 (X, Y ) 1,
with equality on either side if and only if X and Y are linearly dependent.
7.6 Moments
For an integer k 0, the k-th moment of a random variable X is the number E(X
k
). The
k-th moment around a point c R is the number E(X c)
k
. If c is not mentioned it is
understood that the moment is around 0. The k-th central moment is the k-th moment
around EX (when it exists), i.e., E(X EX)
k
. In this terminology, the variance is the
second central moment.
42
The sequence of moments (usually around 0, or around EX) often contains important
information about the behavior of X and is an important computational and theoretical tool.
Important special distributions often turn out to have interesting sequences of moments. Also
note that by the monotonicity of the L
p
norms (inequality (6) in Section 7.2), the set of values
r 0 such that E(X
r
) exists (one can also talk about r-th moments for non-integer r, but
that is much less commonly discussed) is an interval containing 0.
A nice characterization of the variance is that it is the minimal second moment. To
compute the second moment around a point t, we can write
E(X t)
2
= E[(X EX) (t EX)]
2
= E(X EX)
2
+E(t EX)
2
+ 2(t EX)E(X EX)
= V(X) + (t EX)
2
V(X).
So the function t E(Xt)
2
is a quadratic polynomial that attains its minimum at t = EX,
and the value of the minimum is V(X). In words, the identity
E(X t)
2
= V(X) + (t EX)
2
says that the second moment around t is equal to the second moment around the mean
EX plus the square of the distance between t and the mean. Note that this is analogous
to (and mathematically equivalent to) the Huygens-Steiner theorem (also called the Parallel
Axis theorem, see Wikipedia) from mechanics, which says that the moment of inertia of a
body with unit mass around a given axis L is equal to the moment of inertia around the
line parallel to L passing through the center of mass of the body, plus the square of the
distance between the two lines. Indeed, the moment terminology seems to originate in
this physical context.
43
Chapter 8: Laws of large numbers
Let X
1
, X
2
, X
3
, . . . be a sequence of independent and identically distributed random variables.
A common abbreviation for independent and identically distributed is i.i.d.. What this
can mean is that we are taking repeated independent samples from some distribution, for
example when doing a poll or an experiment in quantum mechanics (when doing a poll, it
might be better to have the samples not be independent after all, does it really make sense
to call up the same person twice? But if the population is large the eect of having the
samples be independent is negligible, and can make the analysis of the results a bit simpler).
Assume that EX
1
, the mean of X
1
(therefore of every X
n
), is dened and nite, and
denote it by . In a real-life situation, we might not know what the mean is, and wish to
estimate it. So we look at the sum of the rst n samples,
S
n
=
n

k=1
X
k
,
and use it to form the empirical average,
X
n
=
1
n
S
n
=
1
n
n

k=1
X
k
.
The natural question is whether we can expect the empirical average to be close to the true
mean = EX
1
as n , and in what sense of close. A theorem or statement to that
eect, making some assumptions, is called a law of large numbers. This can be generalized
to a large extent to cases when the X
n
s are not identically distributed, or not independent,
or both, etc. Thus, the Law of Large Numbers is not a single theorem in the normal sense
of the word but rather a class of theorems, or even a principle or meta-theorem that
you might hear a probabilist refer to in a somewhat vague, metaphorical way.
8.1 Weak laws of large numbers
Theorem 8.1 (Weak Law of Large Numbers (WLLN)). If E[X
1
[ < then
S
n
n
P

n
.
44
Proof in the case of nite variance. In the most general case, proving this requires some
work we shall deduce it from its stronger cousin, the Strong Law of Large Numbers.
However, if we assume that
2
= V(X
1
) < , the proof is extremely easy! It suces to
note that in this case V(S
n
) = n
2
(this is true even when the X
n
s are not independent but
only uncorrelated), or V(S
n
/n) =
2
/n. From Chebyshevs inequality we then have that for
any > 0,
P([n
1
S
n
[ > )

2
n
2

n
0. (7)
8.2 Strong laws of large numbers
Our goal in this section will be to prove:
Theorem 8.2 (Strong Law of Large Numbers (SLLN)). If E[X
1
[ < then
S
n
n
a.s.

n
.
As in the case of the weak law, it turns out that this is easier to prove when making more
restrictive assumptions about the distribution of X, and specically about the existence of
moments. So we will prove it several times, successively weakening our assumptions until we
get to the most general (but least easy to prove) result.
Proof in the case of a nite fourth moment. To prove that S
n
/n , we want to prove that
P([S
n
/n [ > i.o.) = 0. Looking at the bound (7), we see that if the bound
2
/n
2
were
to form the n-th general term of a convergent series, then by the Borel-Cantelli lemma this
would be enough to get the desired consequence. This is not the case, but if we assume that
X
1
has a nite fourth moment, then we could do a similar trick that will give a convergent
series. In that case, using Markovs inequality we can get that
P([S
n
/n [ > ) = P
_
_
_
n

k=1
(X
k
)
_
4
> n
4

4
_
_

E(S
n
n)
4
n
4

4
. (8)
45
Denote

X
k
= X
k
, and T
n
= S
n
n =

n
k=1

X
k
. To bound E(T
4
n
), note that we can
write
T
4
n
=
n

k=1

X
k
4
+
_
4
1
_

1i,=jn

X
i
3

X
j
+
1
2
_
4
2
_

1i,=jn

X
i
2

X
j
2
+
1
2
_
4
2, 1, 1
_

1i,j,kn distinct

X
i
2

X
j

X
k
+
1
4!
_
4
1, 1, 1, 1
_

1i,j,k,n distinct

X
i

X
j

X
k

X

(where
_
4
2,1,1
_
= 4!/2!1!1!,
_
4
1,1,1,1
_
= 4!/1!1!1!1! are multinomial coecients). Now take
the expectations on both sides, and use the fact that E(

X
k
) = 0 and that the

X
k
s are
independent, to get
E(T
4
n
) = nm
4
+ 3n(n 1)m
2
2
,
where we denote m
2
= E(

X
2
k
) = V(X
k
), m
4
= E(

X
4
k
) < . This gives us the bound we
wanted! In fact, all that matters is that E(T
4
n
) Cn
2
for some constant C > 0. Combining
this with (8), we get that
P([S
n
/n [ > )
C
n
2

4
.
In particular, we have that

n=1
P([S
n
/n [ > ) < , so by the Borel-Cantelli lemma,
the probability of the event
A

:= [S
n
/n [ > i.o.
is 0. This is true for all > 0, therefore the probability of
S
n
/n
_
>0
A

_
n=1
A
1/n
is also 0, since it is contained in a countable union of events of probability 0.
Proof in the case of nite variance. We are slowly rening our techniques to weaken the
assumptions required to prove the SLLN. The following nice proof manages to harness the
Chebyshev variance bound (7) after all to deduce the theorem in the case of r.v.s with nite
variance. Observe that while the series of terms on the right-hand side of (7) diverges, if we
restrict it to a subsequence n
k
= k
2
, it will become a convergent series:

k=1

2
/k
2

2
< .
This implies, again by the Borel-Cantelli lemma, that
P([S
n
k
/n
k
[ > i.o.) = 0
46
for all > 0 (the i.o. here refers to the running index k, not n). This implies as before
that
P
_
S
n
k
n
k
a.s.

_
= 1.
So, while we have not shown almost sure convergence of the empirical averages to the true
mean for all values of n, at least we have done so for the subsequence n
k
= k
2
. But
this subsequence is relatively dense. In particular, if we show that in between elements
of the subsequence n
k
= k
2
, the empirical average cannot uctuate too wildly, then the
theorem would follow. This will again follow from a combination of variance-based (i.e.,
Chebyshev) bounds and the Borel-Cantelli lemma. Fix some k, and take an integer n
satisfying n
k
n < n
k+1
. How likely is the n-th empirical average to deviate signicantly
from the n
k
-th empirical average? Using the notation T
n
= S
n
n as before, we have
P
_

T
n
n

T
n
k
n
k

>
_
= P
_

n
k
T
n
nT
n
k
n n
k

>
_
(by triangle ineq.) P
_

T
n
T
n
k
n
k

> /2
_
+P
_
[T
n
[
n n
k
n n
k
> /2
_
(by Chebyshevs ineq.)
4Var(T
n
T
n
k
)

2
n
2
k
+
4Var(T
n
)(n n
k
)
2
n
2
n
2
k

10k
2

2
k
4
+
20k
4

2
k
8
<
20
2
k
3
,
where we have denoted
2
= V(X
1
).
The estimate that we obtained is valid for a single n. We are actually interested in
the maximal uctuation of the empirical averages S
n
/n from S
n
k
/n
k
, when n ranges in the
interval [n
k
, n
k+1
). By a simple subadditivity argument, usually referred to as a union
bound (i.e., bounding the probability of a union of events by the sum of the probabilities,
which is the most naive bound you can imagine), we get easily that
P
_
max
n
k
n<n
k+1

T
n
n

T
n
k
n
k

>
_
= P
_
_
_
n
k
n<n
k+1
_

T
n
n

T
n
k
n
k

>
_
_
_

n
k+1
1

n=n
k
P
_

T
n
n

T
n
k
n
k

>
_
<
20
2
2k
k
3
=
40
2
k
2
.
Once again, we have obtained as our bound the general term of a convergent series! Denoting
A
k,
=
_
max
n
k
n<n
k+1

T
n
n

T
n
k
n
k

>
_
,
47
by the Borel-Cantelli lemma this implies that for any > 0, P(A
k,
i.o.) = 0 (with the
i.o. qualier again referring to the running index k). What about the chance that this
will happen for some > 0? After all, there are so many s out there... But no, we also get
that
P
_
_
>0
A
k,
i.o.
_
= P
_

_
m=1
_
A
k,1/m
i.o. (w.r.t. k)
_
_
= 0,
by subadditivity for countable unions and the fact that A
k,
A
k,
if >
t
.
Finally, combining our two main results, namely that the two events
E
1
=
_
S
n
k
n
k

_
,
E
2
=
_

_
m=1
_
max
n
k
n<n
k+1

T
n
n

T
n
k
n
k

>
1
m
for ininitely many ks
_
_
c
both have probability 1, we get that their intersection E
1
E
2
also has probability 1. But
we have the event inclusion
E
1
E
2

_
S
n
n

n

_
(if the conditions in both the events E
1
, E
2
occur, then S
n
/n must converge to ), so this
latter event also has probability 1.
The above proof was a little involved, but is based on a few simple ideas: 1. Use
Chebyshev bounds together with the Borel-Cantelli lemma. 2. Break up the event S
n
/n
in a clever way into events which can be managed using this technique, namely (in this
case) the weaker convergence along the subsequence n
k
= k
2
, and separately the control of
the uctuations of the empirical averages in each range [n
k
, n
k+1
) between successive elements
of the subsequence.
An extra benet of the above proof is that it doesnt use the full power of the assumption
that the X
n
s are independent. In fact, everything is based on variance computations of
sums of the X
k
s, so the proof works equally well for uncorrelated random variables!
Theorem 8.3 (SLLN for uncorrelated r.v.s with nite variance). If X
1
, X
2
, . . . are a se-
quence of uncorrelated and identically distributed r.v.s with nite variance, then
1
n
n

k=1
X
k
a.s.

n
EX
1
.
48
Finally, we turn to the proof of the full SLLN for an i.i.d. sequence (Theorem 8.2),
assuming only a nite rst moment. Etemadis 1981 proof presented in [Dur2010] is based
on a similar idea of proving convergence rst for a subsequence, and introduces another
useful technique, that of truncation.
Proof of Theorem 8.2. First, observe that it is enough to prove the theorem in the case where
X
1
0, since in the general case we may decompose each X
n
into its positive and negative
parts, and this gives two i.i.d. sequences of nonnegative r.v.s. The validity of the SLLN for
each of the two sequences implies the result for their dierence.
Second, for each n 1 let Y
n
= X
n
1
Xnn
(X
n
truncated at n), and denote T
n
= Y
1
+
Y
2
+. . . +Y
n
. Since EX
1
< , by a homework exercise we know that

n=1
P(X
n
> n) < ,
which by the Borel-Cantelli lemma implies that P(X
n
> n i.o.) = P(X
n
,= Y
n
i.o.) = 0. It
follows that the event
_
sup
n
[S
n
T
n
[ =
_
has probability 0, and therefore the even smaller event
_
limsup
n

S
n
n

T
n
n

> 0
_
=
_
limsup
n

_
S
n
n

_

T
n
E(T
n
)
n

> 0
_
,
has probability 0 (the equivalence between these last two events follows from the observa-
tion that E(Y
n
) by the monotone/dominated convergence theorems, and therefore also
E(T
n
)/n as n since (E(T
n
))

n=1
is the sequence of arithmetic averages of the
E(Y
n
)s). So we have shown that to prove the theorem, it is enough to prove that
P
_
T
n
ET
n
n
0
_
= 1.
Now, to prove this, as before we establish a.s. convergence rst along a subsequence
n
k
, this time taking n
k
=
k
| where > 1. (Here, x| denotes the oor function,
namely the largest integer x). This is done by again combining the Borel-Cantelli lemma
with a Chebyshev variance bound, except that showing that the sum of the bounds gives a
convergent series requires more work than before. We have
P
_

T
n
k
n
k

E(T
n
k
)
n
k

>
_

V(T
n
k
)

2
n
2
k
=
1

2
n
2
k
n
k

m=1
V(Y
m
).
49
Therefore

k=1
P
_

T
n
k
n
k

E(T
n
k
)
n
k

>
_

k=1
1

2
n
2
k
n
k

m=1
V(Y
m
) =
1

m=1
V(Y
m
)

n
k
m
1
n
2
k

m=1
V(Y
m
)
1
m
2
(1 +
2
+
4
+
6
+. . .)
=
1

2
(1
2
)

m=1
V(Y
m
)
m
2
(9)
Lemma 8.4. We have

m=1
V(Y
m
)
m
2
4EX
1
< .
Exercise 8.5. Prove this. You can use the formula E(Z) =
_

0
P(Z > x) dx thats valid
for any nonnegative r.v. Z.
It follows that the innite sum on the left-hand side of (9) converges, and therefore by
the Borel-Cantelli lemma, we have that
P
_

T
n
k
n
k

E(T
n
k
)
n
k

> innitely often (w.r.t. k)


_
= 0.
Since this is true for all > 0, using the standard trick we get that
P
_
T
n
k
n
k

E(T
n
k
)
n
k

k
0
_
= P
_
T
n
k
n
k

_
= 1.
The last step is now to show that convergence along this subsequence forces (T
n
ET
n
)/n
to behave well between successive n
k
s. Observe that if n
k
n < n
k+1
then
T
n
k
n
k+1

T
n
n

T
n
k+1
n
k
.
Since n
k
=
k
|, in the limit this gives that almost surely the bounds
1

liminf
n
T
n
n
limsup
n
T
n
n

hold. Since > 1 was arbitrary, the intersection of these events for = 1+1/d, d = 1, 2, . . .,
implies that
P
_
T
n
n

_
= 1,
which nishes the proof.
50
Solution to Exercise 8.5.

m=1
V(Y
m
)
m
2

m=1
1
m
2
E(Y
2
m
) =

m=1
1
m
2
E
_
X
2
m
1
Xmm
_
=

m=1
1
m
2
_

0
P
_
X
2
m
1
Xmm
> t
_
dt
=

m=1
1
m
2
_

0
P
_

t X
m
m
_
dt
=

m=1
1
m
2
_
m
2
0
P
_

t X
1
m
_
dt

m=1
1
m
2
_
m
2
0
P
_
X
1

t
_
dt
=
_

0
_
_

t
1
m
2
_
_
P
_
X
1

t
_
dt

_

0
2

t
P
_
X
1

t
_
dt = 4
_

0
P(X
1
u)du = 4E(X
1
)
51
Chapter 9: Applications and further examples
9.1 The Weierstrass approximation theorem
As an application of WLLN (or, rather, of Chebyshevs inequality), we prove the following
theorem in analysis, which seems to have no connection to probability whatsoever.
Theorem 9.1 (The Weierstrass approximation theorem). If f : [0, 1] R is a continuous
function, then f can be uniformly approximated by polynomials. That is, for any > 0 there
exists a polynomial p such that [[f p[[ := max
x[0,1]
[f(x) p(x)[ < .
Proof. Let f : [0, 1] R be a continuous function. Dene the sequence of Bernstein
polynomials of f by
B
n
(x) = B
f
n
(x) =
n

k=0
f
_
k
n
__
n
k
_
x
k
(1 x)
nk
.
We claim that [[B
n
f[[ 0 as n , which will prove the theorem. We will prove this
by giving a probabilistic interpretation to B
n
. Indeed, x x [0, 1], and let X
1
, X
2
, . . . be a
sequence of i.i.d. r.v.s with the Bernoulli distribution Binom(1, x). Denote S
n
=

k=1
X
k
.
Then it is easy to see that
B
n
(x) = E
x
_
f
_
S
n
n
__
,
where the notation E
x
just means taking the expectation, while at the same time high-
lighting the fact that the distribution of S
n
depends on the parameter x.
Now the idea is that since the law of large numbers implies that S
n
/n is with high
probability close to its mean x, and f is a continuous function, then f(S
n
/n) should with
high probability be close to f(x), and therefore also the average value of f(S
n
/n), namely
B
n
(x), should be close to f(x), which is what we want. However, we want to make this claim
uniformly in x, so instead of invoking the WLLN (Theorem 8.1) we have to go back and
look under the hood at Chebyshevs inequality which we used to prove it. We estimate
[B
n
(x) f(x)[ as follows. Fix some arbitrary > 0. Let > 0 be such that for any
u, v [0, 1], if [u v[ < then [f(u) f(v)[ < (this is guaranteed to exist because f is
52
uniformly continuous on [0, 1]). Then
[B
n
(x) f(x)[ =

E
x
f
_
S
n
n
_
f(x)

E
x

f
_
S
n
n
_
f(x)

= E
x
_

f
_
S
n
n
_
f(x)

1
[
Sn
n
x[>
_
+E
x
_

f
_
S
n
n
_
f(x)

1
[
Sn
n
x[
_
In this last expression, each of the two expectations is small for a dierent reason. The second
expectation is bounded by , since on the event that [S
n
/nx[ , we have that [f(S
n
/n)
f(x)[ < . To bound the rst expectation, denote M = [[f[[ := max
x[0,1]
[f(x)[ < . Then,
by bounding the dierence of f-values by 2M and then using Chebyshevs inequality, we get
E
x
_

f
_
S
n
n
_
f(x)

1
[
Sn
n
x[>
_
2M P
x
_

S
n
n
x

>
_

2M
2
(X
1
)
n
2
=
2Mx(1 x)
n
2
This bound converges to 0, not just for a single x but (fortunately for us) uniformly in
x [0, 1], since it is bounded from above by M/(2n
2
). So we have shown that
[[B
n
f[[ = max
x[0,1]
[B
n
(x) f(x)[ +
M
2n
2
.
It follows that limsup
n
[[B
n
f[[ , and since was an arbitrary positive number the
result follows.
9.2 Innite expectations and triangular arrays
After treating the classical case of an i.i.d. sequence with nite expectations, lets turn
to slightly more exotic situations. First, what happens if the expectation is innite? For
the strong law, the following result shows that we have no hope of having convergence in a
meaningful sense.
Theorem 9.2 (Converse to SLLN). If X
1
, X
2
, . . . , is an i.i.d. sequence of r.v.s with E[X
1
[ =
, then
P
_
lim
n
1
n
n

k=1
_
= 0.
Proof. See homework.
What about the weak law? It turns out that a weak law can still hold in certain situations
with innite expectations, although one has to consider a generalized empirical average where
53
the sum of the rst n samples is divided by a quantity growing faster than n. Rather than
develop a complete theory, we will consider the particular example of the St. Petersburg
Lottery. In this example, the winning in a single round of lottery is an integer-valued
random variable with the following distribution:
P(X = 2
k
) = 2
k
, k = 1, 2, 3, . . .
Let X
1
, X
2
, . . . , be an i.i.d. sequence with the same distribution, and let S
n
=

n
k=1
X
k
. How
much should you agree to pay to be allowed to play this lottery n times? (Even the seemingly
simple case n = 1 of this question has been a subject of quite some debate by economists!
See http://en.wikipedia.org/wiki/St._Petersburg_paradox). In other words, how big
should we expect S
n
to be, with probability close to 1?
Theorem 9.3.
S
n
nlog
2
n
P

n
1.
In other words, to be allowed to pay the game n times when n is large (and assuming the
payo is in dollars), it may be considered reasonable to pay exactly (or even better, slightly
less than) log
2
n dollars per round played. For example, if n = 1024 you would be paying
$10 per round.
Proof. The proof uses a truncation idea similar to the one we saw in the proof of SLLN,
except that for each n we will truncate the rst n variables at a level which is a function of
n. Denote b
n
= nlog
2
n, Y
n,k
= X
k
1
X
k
<bn
, T
n
=

n
k=1
Y
n,k
, and a
n
= E(T
n
). We will prove
that
S
n
a
n
b
n
P

n
0. (10)
First we check that this is enough, by estimating a
n
:
a
n
=
n

k=1
E(Y
n,k
) =
n

k=1
E(X
k
1
X
k
<bn
) =
n

k=1
_
1
2
2 +
1
4
4 +
1
8
8 +. . . +
1
2
mn
2
mn
_
where m
n
is the largest integer such that 2
mn
b
n
, or in other words m
n
= log
2
n +
log
2
log
2
n|, which gives
a
n
=
n

k=1
log
2
n + log
2
log
2
n| = nlog
2
n +O(nlog
2
log
2
n),
54
so a
n
indeed behaves like nlog
2
n up to rst-order asymptotics. Now to prove (10), note that
for any > 0,
P
_

S
n
a
n
b
n

>
_
P(T
n
,= S
n
) +P
_

T
n
a
n
b
n

>
_
. (11)
In this bound, the rst term is bounded by
n

k=1
P(Y
n,k
,= X
k
) =
n

k=1
P(X
k
> b
n
)
n

k=1
2
b
n

2
log
2
n

n
0.
To bound the second term, use Chebyshevs inequality and the fact that V(Y
n,k
) E(Y
2
n,k
)
to write
P
_

T
n
a
n
b
n

>
_

V(T
n
)

2
b
2
n

2
b
2
n
n

k=1
E(Y
2
n,k
)

2
b
2
n
n

k=1
_
1
2
2
2
+
1
4
4
2
+
1
8
8
2
+. . . +
1
2
mn
2
2mn
_

2
b
2
n
n

k=1
2 2
mn

2
b
2
n
n

k=1
b
n
=
2

2
log
2
n

n
0.
We conclude that the left-hand side of (11) converges to 0, as n , which nishes the
proof.
Another twist on laws of large numbers comes when we replace the notion of an i.i.d.
sequence by a more general notion of a triangular array. In this case, for each n we have a
sequence X
n,1
, X
n,2
, . . . , X
n,n
of independent, but not necessarily identically distributed, and
we denote S
n
=

n
k=1
X
n,k
this is the sum of the samples in the n-th experiment. Here,
for each n there could be a separate experiment involving n dierent r.v.s, and the r.v.s
X
n,k
and X
m,j
for n ,= m are not even assumed to be dened on the same probability space,
let alone to be independent of each other.
Again, instead of giving general conditions for a law of large numbers to hold, consider
the following example of the so-called coupon collectors problem: A brand of breakfast
cereals comes with a small toy chosen uniformly at random from a set of n possible kinds of
toys. A collector will buy more boxes of cereals until she has collected all n dierent toys.
Denote by T
n
the number of boxes she ends up buying. What can we say about the size of
T
n
? Fortunately, we can represent it as a sum of independent r.v.s, by writing
T
n
= X
n,1
+X
n,2
+X
n,3
+. . . +X
n,n
,
55
where
X
n,1
= number of boxes purchased to get one kind of toy = 1,
X
n,2
= number of boxes purchased after having one toy to get a dierent kind,
X
n,3
= number of boxes purchased after having two kinds of toys to get a third kind,
.
.
.
X
n,n
= number of boxes purchased after having n 1 kinds of toys to get the last kind.
Clearly these r.v.s are independent. Furthermore, X
n,k
is a geometric r.v. with parameter
p
n,k
= (n k + 1)/n. This gives us that
E(T
n
) =
n

k=1
E(X
n,k
) =
n

k=1
n
n k + 1
= n
_
1
n
+
1
n 1
+. . . +
1
2
+
1
1
_
= nH
n
,
where H
n
=

n
k=1
1/k is the n-th harmonic number, and
V(T
n
) =
n

k=1
V(X
n,k
) =
n

k=1
k 1
n
_
n
n k + 1
_
2
n
2
n

k=1
1
k
2
2n
2
(in this example, we only need a bound for V(T
n
), but it is possible also to get more precise
asymptotics for this quantity). It follows using Chebyshevs inequality that for each > 0,
P
_

T
n
nH
n
nlog n

>
_

V(T
n
)

2
n
2
(log n)
2

2

2
(log n)
2

n
0,
so (T
n
nH
n
)/(nlog n) converges in probability to 0, and therefore we get that
T
n
nlog n
P

n
1.
The lesson to be learned from the above examples is that some naturally-occurring problems
in real life lead to more complicated situations than can be modeled with an i.i.d. sequence
with nite mean; but often such problems can be analyzed anyway using the same ideas and
techniques that we developed. In probability textbooks you can nd a general treatment of
various conditions under which a triangular array of independent random variables satises
a (strong or weak) law of large numbers.
56
9.3 Random series of independent samples
We now look at the related topic of innite series of independent r.v.s. When can such a
series be said to converge? A key technical result that will help us answer this question in
some cases is the following beautiful inequality due to Kolmogorov.
Theorem 9.4 (Kolmogorovs maximal inequality). Assume that X
1
, X
2
, . . . , X
n
are inde-
pendent r.v.s with nite variances, and let S
k
=

k
j=1
X
j
. Then
P
_
max
1kn
[S
k
E(S
k
)[ > t
_

V(S
n
)
t
2
=

n
k=1

2
(X
k
)
t
2
.
Before we start the proof, note that the bound on the right-hand side is the usual variance
bound that follows from Chebyshevs inequality; except that the event on the left-hand side
whose probability this quantity bounds is a much bigger event than the usual deviation event
[S
n
E(S
n
)[ > t for which we know the Chebyshev bound holds!
Proof. We may assume without loss of generality that E(X
k
) = 0 for all k. Denote
A =
_
max
1kn
[S
k
[ > t
_
,
and dene events A
1
, A
2
, . . . , A
n
by
A
k
=
_
[S
k
[ t, max
1j<k
[S
k
[ < t
_
.
In words, A
k
is the event that the sequence of cumulative sums (S
j
)
n
j=1
exceeded t in absolute
value for the rst time at time k. Note that these events are disjoint and their union is the
event A whose probability we are trying to bound. As a consequence, we can lower-bound
the variance V(S
n
) = E(S
2
n
) of S
n
, as follows:
E(S
2
n
) E(S
2
n
1
A
) =
n

k=1
E
_
S
2
n
1
A
k
_
=
n

k=1
E
_
(S
k
+ (S
n
S
k
))
2
1
A
k

=
n

k=1
_
E(S
2
k
1
A
k
) +E
_
(S
n
S
k
)
2
1
A
k

+ 2E[(S
k
1
A
k
)(S
n
S k)]
_
.
In this last expression, the terms E[(S
k
1
A
k
)(S
n
S k)] are equal to 0, since S
k
1
A
k
is a
random variable that depends only on X
1
, . . . , X
k
and hence independent of S
n
S
k
, which
57
depends only on the values of X
k+1
, . . . , X
n
, which causes the expectation of their product
to be equal to the product of the expectations, which is 0. Furthermore, the middle terms
E[(S
n
S
k
)
2
1
A
k
] are all nonnegative, and each of the rst terms E(S
2
k
1
A
k
) satises
E(S
2
k
1
A
k
) E(t
2
1
A
k
) = t
2
P(A
k
),
since on the event A
k
we know that S
2
k
is at least t
2
(look again at the denition of A
k
).
Combining these observations, we get that
V(S
n
) t
2
n

k=1
P(A
k
) = t
2
P(A),
which is exactly the claim that was to be proved.
As a corollary, we get a result on convergence of random series.
Theorem 9.5. Let X
1
, X
2
, . . . be a sequence of independent r.v.s such that E(X
n
) = 0 for
all n, and assume that

n=1
V(X
n
) < . Then the random series

n=1
X
n
converges
almost surely.
Proof. Denote as usual S
n
=

n
k=1
X
k
. We have the following equality of events:
_

n=1
X
n
converges
_
=
__
N

n=1
X
n
_
N1
is a Cauchy sequence
_
=

>0
_
N1

nN
_
[S
n
S
N
[ <
_
.
Or, put dierently, we can look at the complement of this event and represent it as
_

n=1
X
n
does not converge
_
=
_
>0

N1
_
nN
_
[S
n
S
N
[
_
=
_
>0

N1
_
sup
nN
[S
n
S
N
[
_
.
This form is exactly suitable for an application of the Kolmogorovs maximal inequality,
except that here we have an innite sequence of partial sums instead of a nite maximum.
58
However, by the continuity from below property of probability measures, we see that it
does not matter. More precisely, for any > 0 and N 1, we have
P
_
sup
nN
[S
n
S
N
[
_
= lim
M
P
_
sup
NnM
[S
n
S
N
[
_
lim
M
V(S
M
S
N
)

2
=
1

n=N
V(X
n
).
Therefore also
P
_

N1
_
sup
nN
[S
n
S
N
[
_
_
inf
N1
P
_
sup
nN
[S
n
S
N
[
_
inf
N1
1

n=N
V(X
n
) = 0,
because of our assumption that the sum of the variances converges. Finally, this is true
for all > 0, so by the usual trick of replacing an uncountably-innite intersection by a
countable one (provided that the particular form of the event in question warrants this!), we
get the claim that
P
_

n=1
X
n
does not converge
_
= 0.
One could ask whether the sucient condition given by the theorem above is also neces-
sary (it is). More generally, what happens for random variables with non-zero expectations?
What happens for r.v.s with innite expectations, or innite variances? Kolmogorov for-
mulated a general theorem that gives a necessary and sucient condition for a series of
independent random variables to converge almost surely.
Theorem 9.6 (The Kolmogorov three-series theorem). If X
1
, X
2
, . . . is a sequence of inde-
pendent random variables, then the random series

n=1
X
n
converges almost surely if and
only if the following three conditions hold:
1.

n=1
P([X
n
[ > 1) < .
2. The series

n=1
E(X
n
1
[Xn[1
) converges.
3.

n=1
V(X
n
1
[Xn[1
) < .
59
If one of the conditions does not hold, then series

X
n
diverges almost surely.
We postpone the proof of Theorem 9.6 until later; we will prove it in Section 14.2 as
an application of a generalized version of the central limit theorem. Note that since the
convergence of the series

X
n
is equivalent to the convergence of

aX
n
for any constant
a, the value 1 chosen for the truncation of X
n
in the theorem is arbitrary and can be replaced
by any other constant.
Example 9.7. Let (c
n
)

n=1
be a sequence of real numbers. Consider the series with
random signs associated with the sequence (c
n
), which we denote

n=1
c
n
, and which
more precisely represents the series

n=1
c
n
X
n
,
where X
1
, X
2
, . . . is a sequence of i.i.d. r.v.s taking the values 1, +1 with respective prob-
abilities 1/2, 1/2. By Theorem 9.5 it follows that if

n=1
c
2
n
< then the series

n=1
c
n
converges almost surely. From Theorem 9.6, one can check easily that this condition is also
necessary, in other words that the series with random series converges a.s. if and only if the
series of squares

c
2
n
converges. Thus, for example, the harmonic series with random
signs

1
n
converges a.s., but the analogous series of square root reciprocals

n
di-
verges a.s. Compare this to the series with alternating signs

n
(1)
n
n
and

n
(1)
n

n
, both of
which are known to converge! Try to develop your intuition by thinking about the reasons
why the behavior for series with alternating signs and that for random signs is not the same.
End of Part II
60
Part III The Central Limit Theorem
Chapter 10: Stirlings formula and the de Moivre-Laplace
theorem
We start with a motivating example that was also historically the rst instance in which
the phenomenon that came to be known as the Central Limit Theorem was observed. Let
X
1
, X
2
, . . . be an i.i.d. of Binom(1, p) random variables, and let S
n
=

n
k=1
X
k
, a r.v. with
distribution Binom(n, p).
Theorem 10.1 (The de Moivre-Laplace theorem). For any t R,
P
_
S
n
np
_
np(1 p)
t
_

n
(t) =
1

2
_
t

e
x
2
/2
dx.
Since this is such a concrete example, the proof will simply require us to estimate a sum
of the form

0kt
_
n
k
_
p
k
(1 p)
nk
. Knowing how to estimate such sums is a useful skill in
its own right. Since the binomial coecients are involved, we also need some preparation
related to Stirlings formula.
Lemma 10.2. The limit C = lim
n
n!

n(n/e)
n
exists.
Proof.
log n! =
n

k=1
log k =
n

=1
_
k
1
dx
x
=
_
n
1
n x|
x
dx
=
_
n
1
n +
1
2
+ (x
1
2
) x
x
dx = (n + 1/2) log n n + 1 +
_
n
1
x
1
2
x
dx
= (n + 1/2) log n n + 1 +
_

1
x
1
2
x
dx +o(1),
where the last integral converges because
_
t
1
(x
1
2
)dx is bounded and 1/x goes to 0 as
x .
Note that an easy consequence of Lemma 10.2 is that
_
2n
n
_
= (1 +o(1))2
2n
/C
_
n/2. We
shall now use this to nd the value of C.
61
Lemma 10.3. Let f : R R be an n + 1 times continuously-dierentiable function. Then
for all x R, we have
f(x) = f(0) +f
t
(0)x +
f
tt
(0)
2
x
2
+. . . +
f
(n)
(0)
n!
x
n
+R
n
(x),
where
R
n
(x) =
1
n!
_
x
0
f
(n+1)
(t)(x t)
n
dt.
Proof. This follows by induction on n, using integration by parts.
Lemma 10.4. C =

2.
Proof. Apply Lemma 10.3 with f(x) = (1 +x)
2n+1
to compute R
n
(1):
1
2
2n+1
R
n
(1) =
1
2
2n+1

1
n!
_
1
0
(2n + 1)(2n) (n + 1)(1 +t)
n
(1 t)
n
dt
=
2
_
2n
n
_
2
2n+1
(n +
1
2
)
_
1
0
(1 t
2
)
n
dt =
_
2n
n
_

n
2
2n
(1 +
1
2n
)
_

n
0
_
1
u
2
n
_
n
du

2
C
_

0
e
u
2
du =

2
C

2
.
The convergence of the integrals is justied by the fact that (1 u
2
/n)
n
e
u
2
for all
0 u

n, and (1 u
2
/n)
n
e
u
2
as n , uniformly on compact intervals. To nish
the proof, note that
1
2
2n+1
R
n
(1) =

n<k2n+1
_
2n+1
k
_
2
2n+1
=
1
2
(this is the probability that a Binom(2n + 1, 1/2) random variable takes a value > n).
Therefore C =

2, as claimed.
Corollary 10.5 (Stirlings formula). lim
n
n!

2n(n/e)
n
= 1.
Note that the proof is based on computing P(S
2n+1
> n) in two dierent ways, when
S
2n+1
Binom(2n + 1, 1/2). This is just the special case p = 1/2, t = 0 of Theorem 10.1!
In this very special case, by symmetry the probability is equal to 1/2; on the other hand,
Lemma 10.3 enables us to relate this to the asymptotic behavior of n! and to (half of) the
gaussian integral
_

e
x
2
dx. The evaluation of the constant C in Stirlings formula is the
62
part that is attributed to James Stirling. The form that appears in Lemma 10.2 is due to
Abraham de Moivre (1733).
With this preparation, it is now possible to apply the same technique to prove Theo-
rem 10.1. Instead of the function f(x) = (1+x)
2n+1
, take the function g(x) = ((1p)+px)
n
=

n
k=0
_
n
k
_
p
k
(1 p)
nk
x
k
, and compute the remainder R
k
(1) of the Taylor expansion of g,
where k np + t
_
np(1 p). This should converge to 1 (t), and indeed, this follows
without too much diculty from Lemma 10.3. The computation is left as an exercise. We
also sketch another way of proving Theorem 10.1 by directly approximating the probabilities
_
n
k
_
p
k
(1 p)
nk
by Gaussian densities.
Sketch of Proof of Theorem 10.1. Denote q = 1 p. For a large n, let k be approximately
equal to np + t

npq, and use Stirlings formula to estimate the probability P(S


n
= k), as
follows:
P(S
n
= k) =
_
n
k
_
p
k
q
nk
= (1 +o(1))

2n(n/e)
n
p
k
q
nk

2k(k/e)
k
_
2(n k)((n k)/e)
nk
=
1 +o(1)

2npq
_
np
k
_
k
_
nq
n k
_
nk
=
1 +o(1)

2npq
_
1 +
t

np
_
k
_
1
t

nq
_
(nk)
.
Taking the logarithm of the product of the last two factors, using the facts that k np +
t

npq, n k nq t

npq, and that log(1 + x) = x x


2
/2 + O(x
3
) when x 0, we see
that
log
_
_
1 +
t

np
_
k
_
1
t

nq
_
(nk)
_
= (np +t

npq) log
_
1 +
t

np
_
(nq t

npq) log
_
1
t

p
nq
_
= (np +t

npq)
_
t

np

t
2
q
2np
_
(nq t

npq)
_

nq

t
2
p
2nq
_
+O
_
t
3

n
_
= t

npq t
2
q +
t
2
q
2
+t

npq t
2
p +
t
2
p
2
+O
_
t
3

n
_
=
t
2
2
+O
_
t
3

n
_
.
63
It follows that
P(S
n
= k) =
1 +o(1)

2npq
e
t
2
/2
In other words, the individual probabilities for S
n
approximate a normal density! From here,
it is not too hard to show that the probability
P
_
a
S
n
np

npq
b
_
=

np+a

npqknp+b

npq
P(S
n
= k)
is approximately a Riemann sum for the integral (2)
1/2
_
b
a
e
x
2
/2
dx = (b) (a). In
fact, this is true since for a, b xed and k ranging between np + a

npq and np + b

npq,
the error concealed by the o(1) term is uniformly small (smaller than any > 0, say, when
n is suciently large), since this error term originates with three applications of Stirlings
approximation formula (for n!, for k! and for (n k)!) followed by the log function second-
order Taylor expansion above.
One lesson that can be learned from this proof is that doing computations for specic
distributions can be messy! So we might be better o looking for more general, and therefore
more conceptual, techniques for proving convergence to the normal distribution, that require
less explicit computations; fortunately such techniques exist, and will lead us to the much
more general central limit theorem.
64
Chapter 11: Convergence in distribution
11.1 Denition
Since we will be talking about convergence of the distribution of random variables to the nor-
mal distribution, it makes sense to develop the general theory of convergence of distributions
to a limiting distribution.
Denition 11.1. Let (F
n
)

n=1
be a sequence of distribution functions. We say that F
n
converges to a limiting distribution function F, and denote this by F
n
= F, if
F
n
(x) F(x) as n for any x R which is a continuity point of F. If X, (X
n
)

n=1
are
random variables, we say that X
n
converges in distribution to X (or, interchangeably,
converges in distribution to F
X
) if F
Xn
= F
X
.
This denition, which may seem unnatural at rst sight, will become more reasonable
after we prove the following lemma.
Lemma 11.2. The following are equivalent:
1. X
n
= X.
2. Ef(X
n
)
n
Ef(X) for any bounded continuous function f : R R.
3. There exists a r.v. Y and a sequence (Y
n
)

n=1
of r.v.s, all dened on some probability
space (, T, P) such that Y
n
Y a.s., Y is equal in distribution to X, and each Y
n
is
equal in distribution to the respective X
n
.
Proof. Proof that 2 = 1: Assume that Ef(X
n
)
n
Ef(X) for any bounded continuous
function f : R R, and x x R. For any t R and > 0, dene a function g
t,
: R R
by
g
t,
(u) =
_

_
1 u < t,
tu+

u t t +,
0 u > t +.
Then we have that
E(g
x,
(X
n
)) F
Xn
(x) = E(1
(,x]
(X
n
)) E(g
x,
(X
n
))
65
Letting n gives the chain of inequalities
F
X
(x ) E(g
x,x
(X)) liminf
n
F
Xn
(x) limsup
n
F
Xn
(x) E(g
x,
(X)) F
X
(x +).
Now if x is a point of continuity of F
X
, letting 0 gives that lim
n
F
Xn
(x) = F
X
(x).
Proof that 3 = 2: this follows immediately by applying the bounded convergence
theorem to the sequence g(Y
n
).
Proof that 1 = 3: Take (, T, P) = ((0, 1), B(0, 1), Leb). For each n 1, let
Y
n
(x) = supy : F
Xn
(y) < x be the lower quantile function of X
n
, as discussed in a
previous lecture, and similarly let Y (x) = supy : F
X
(y) < x be the lower quantile function
of X. Then as we previously showed, we have F
Y
F
X
and F
Yn
F
Xn
for all n. It remains
to show that Y
n
(x) Y (x) for almost all x (0, 1). In fact, we show that this is true for
all but a countable set of xs. Denote Y

(x) = infy : F
X
(y) > x (the upper quantile
function of X). As we have seen, we always have Y (x) Y

(x), and Y (x) = Y

(x) for all


x (0, 1) except on a countable set of xs (the exceptional xs correspond to intervals where
F
X
is constant; these intervals are disjoint and each one contains a rational point).
Let x (0, 1) be such that Y (x) = Y

(x). This means that for any y < Y (x) we have


F
X
(y) < x, and for any z > Y (x) we have F
X
(z) > x. Now, take a y < Y (x) which
is a continuity point of F
X
. Then F
Xn
(y) F
X
(y) as n , so also F
Xn
(y) < x for
suciently large n, which means (by the denition of Y
n
) that Y
n
(x) y for such large n.
This establishes that liminf
n
Y
n
(x) Y (x), since we have continuity points y < Y (x)
that are arbitrarily close to Y (x).
Similarly, take a z > Y (x) which is a continuity point of F
X
. Then F
Xn
(z) F
x
(z) as
n , so also F
Xn
(z) > x for large n, which implies that Y
n
(x) z. Again, by taking
continuity points z > Y (x) that are arbitrarily close to Y (x) we get that limsup
n
Y
n
(x)
Y (x). Combining these last two results shows that Y
n
(x) Y (x) which was what we
wanted.
11.2 Examples
1. Normal convergence: We showed that if X
1
, X
2
, . . . are i.i.d. Binom(1, p) r.v.s and
S
n
=

n
k=1
X
k
, then
S
n
nE(X
1
)

n(X
1
)
= N(0, 1).
66
Similarly, using explicit computations (see the homework) it is not too dicult to
see that this is also true when X
1
Poisson(1), X
1
Exp(1), and in other specic
examples. The central limit theorem generalizes this claim to any i.i.d. sequence with
nite variance.
2. Waiting for rare events: If for each 0 < p < 1 we have a r.v. X
p
Geom
0
(p), then
P(X
p
n) = (1 p)
n1
. It follows that
P(pX
p
> x) = (1 p)
]x/p|

p0
e
x
, (x > 0),
so
pX
p
= Exp(1) as p 0.
3. Polyas urn: Let X
n
be the number of white balls in the P olya urn experiment after
starting with one white ball and one black ball and performing the experiment for n
steps (so that there are n + 2 balls). In a homework exercise we showed that X
n
is a
discrete uniform r.v. on 1, 2, . . . , n+1. It follows easily that the proportion of white
balls in the urn converges in distribution:
X
n
n + 2
= U(0, 1).
4. Gumbel distribution: If X
1
, X
2
, . . . are i.i.d. Exp(1) random variables, and M
n
=
max(X
1
, . . . , X
n
), we showed in a homework exercise that
P(M
n
log n x)
n
e
e
x
, x
It follows that
M
n
log n = F
where F(x) = exp (e
x
) is called the Gumbel distribution.
11.3 Compactness and tightness
Theorem 11.3 (Hellys selection theorem). If (F
n
)

n=1
is a sequence of distribution functions,
then there is a subsequence F
n
k
and a right-continuous, nondecreasing function H : R [0, 1]
such that
F
n
k
(x)
n
H(x)
holds for any x R which is a continuity point of H.
67
Note. The subsequential limit H need not be a distribution function, since it may not
satisfy the properties lim
x
H(x) = 0 or lim
x
H(x) = 1. For example, taking F
n
=
F
Xn
, where X
n
U[n, n], we see that F
n
(x) 1/2 for all x R. For a more interesting
example, take G
n
= (F
n
+F
Zn
)/2 where F
n
are as in the previous example, and Z
n
is some
sequence of r.v.s that converges in distribution.
Proof. First, note that we can nd a subsequence (n
k
)

k=1
such that F
n
k
(r) converges to a
limit G(r) at least for any rational number r. This is done by combining the compactness of
the interval [0, 1] (which implies that for any specic a R we can always take a subsequence
to make the sequence of numbers F
n
(a) converge to a limit) with a diagonal argument (for
some enumeration r
1
, r
2
, r
3
, . . . of the rationals, rst take a subsequence to force convergence
at r
1
; then take a subsequence of that subsequence to force convergence at r
2
, etc.; now form
a subsequence whose k-th term is the k-th term of the k-th subsequence in this series).
Now, use G(), which is dened only on the rationals and not necessarily right-continuous
(but is nondecreasing), to dene a function H : R R by
H(x) = infG(r) : r Q, r > x.
This function is clearly nondecreasing, and is also right-continuous, since we have
lim
xnx
H(x
n
) = infG(r) : r Q, r > x
n
for some n = infG(r) : r Q, r > x = H(x).
Finally, let x be a continuity point of H. To show that F
n
k
(x) H(x), x some > 0
and let r
1
, r
2
, s be rationals such that r
1
< r
2
< x < s and
H(x) < H(r
1
) H(r
2
) H(x) H(s) < H(x) +.
Then since F
n
k
(r
2
) G(r
2
) H(r
1
), and F
n
k
(s) G(s) H(s), it follows that for
suciently large k we have
H(x) < F
n
k
(r
2
) F
n
k
(x) F
n
k
(s) < H(x) +.
Therefore
H(x) liminf
n
F
n
k
(x) limsup
n
F
n
k
(x) H(x) +,
and since was arbitrary this proves the claim.
68
Theorem 11.3 can be thought of as a kind of compactness property for probability dis-
tributions, except that the subsequential limit guaranteed to exist by the theorem is not
a distribution function. To ensure that we get a distribution function, it turns out that a
certain property called tightness has to hold.
Denition 11.4. A sequence (
n
)

n=1
of probability measures on (R, B) is called tight if for
any > 0 there exists an M > 0 such that
liminf
n

n
([M, M]) 1 .
A sequence of distribution functions (F
n
)

n=1
is called tight if the associated probability mea-
sures determined by F
n
form a tight sequence, or, more explicitly, if for any > 0 there
exists an M > 0 such that
limsup
n
(1 F
n
(M) +F
n
(M)) < .
A sequence of random variables is called tight if the sequence of their distribution functions
is tight.
Theorem 11.5. If (F
n
)

n=1
is a tight sequence of distribution functions, then there exists
a subsequence (F
n
k
)

k=1
and a distribution function F such that F
n
k
= F. In fact, any
subsequential limit H as guaranteed to exist in the previous theorem is a distribution function.
Exercise 11.6. Prove that the converse is also true, i.e., if a sequence is not tight then it
must have at least one subsequential limit H (in the sense of the subsequence converging to
H at any continuity point of H) that is not a proper distribution function. In particular, it
is worth noting that a sequence that converges in distribution is tight.
Proof. Let H be a nondecreasing, right-continuous function that arises as a subsequential
limit-in-distribution of a subsequence F
n
k
, that we know exists by Theorem 11.3. To show
that H is a distribution function, x > 0, and let M > 0 be the constant guaranteed to
exist in the denition of tightness. Let x < M be a continuity point of H. We have
H(x) = lim
k
F
n
k
(x) limsup
k
F
n
k
(M) limsup
k
(F
n
k
(M) + (1 F
n
k
(M))) < ,
so this shows that lim
x
H(x) = 0. Similarly, let x > M be a continuity point of H. Then
H(x) = lim
k
F
n
k
(x) liminf
k
F
n
k
(M) liminf
k
(F
n
k
(M)) F
n
k
(M)) > 1 ,
which shows that lim
x
H(x) = 1.
69
The condition of tightness is not very restrictive, and in practical situations it is usually
quite easy to verify. The following lemma gives an example that is relevant for our purposes.
Lemma 11.7. If X
1
, X
2
, . . . are r.v.s such that EX
n
= 0 and V(X
n
) < C for all n, then
(X
n
)
n
is a tight sequence.
Proof. Use Chebyshevs inequality:
P([X
n
[ > M)
V(X
n
)
M
2

C
M
2
,
so, if > 0 is given, taking M =
_
C/ ensures that the left-hand side is bounded by .
70
Chapter 12: Characteristic functions
12.1 Denition and basic properties
A main tool in our proof of the central limit theorem will be that of characteristic functions.
The basic idea will be to show that
E
_
g
_
S
n
n

n
__

n
Eg(N(0, 1))
for a suciently large family of functions g. It turns out that the family of functions of the
form
g
t
(x) = e
itx
, (t R),
is ideally suited for this purpose. (Here and throughout, i =

1).
Denition 12.1. The characteristic function of a r.v. X, denoted
X
, is dened by

X
(t) = E
_
e
itX
_
= E(cos(tX)) +iE(sin(tX)), (t R).
Note that we are taking the expectation of a complex-valued random variable (which
is a kind of two-dimensional random vector, really). However, the main properties of the
expectation operator (linearity, the triangle inequality etc.) that hold for real-valued random
variables also hold for complex-valued ones, so this will not pose too much of a problem.
Here are some simple properties of characteristic functions. For simplicity we denote
=
X
where there is no risk of confusion.
1. (0) = Ee
i0X
= 1.
2. (t) = Ee
itX
= E
_
e
itX
_
= (t) (where z denotes the complex conjugate of a
complex number z).
3. [(t)[ E

e
itX

= 1 by the triangle inequality.


4. [(t) (s)[ E

e
itX
e
isX

= E

e
isX
_
e
i(ts)X
1
_

= E

e
i(ts)X
1

. Note also
that E

e
iuX
1

0 as u 0 by the bounded convergence theorem. It follows that


is a uniformly continuous function on R.
5.
aX
(t) = Ee
iatX
=
X
(at), (a R).
71
6.
X+b
(t) = Ee
it(X+b)
= e
ibt

X
(t), (b R).
7. Important: If X, Y are independent then

X+Y
(t) = E
_
e
it(X+Y )
_
= E
_
e
itX
e
itY
_
= E
_
e
itX
_
E
_
e
itY
_
=
X
(t)
Y
(t).
Note that this is the main reason why characteristic functions are such a useful tool
for studying the distribution of a sum of independent random variables.
A note on terminology. If X has a density function f, then the characteristic function
can be computed as

X
(t) =
_

f
X
(x)e
itx
dx.
In all other branches of mathematics, this would be called the Fourier transform
1
of f.
So the concept of a characteristic function generalizes the Fourier transform. If is the
distribution measure of X, some authors write

X
(t) =
_

e
itx
d(x)
(which is an example of a Lebesgue-Stieltjes integral) and call this the Fourier-Stieltjes
transform (or just the Fourier transform) of the measure .
12.2 Examples
No study of characteristic functions is complete without dirtying your hands a little to
compute the characteristic function for some important cases. The following exercise is
highly recommended
Exercise 12.2. Compute the characteristic functions for the following distributions.
1. Coin ips: Compute
X
when P(X = 1) = P(X = 1) = 1/2 (this comes out
slightly more symmetrical than the usual Bernoulli r.v. for which P(X = 0) = P(X =
1) = 1/2).
1
Well, more or less it is really the inverse Fourier transform; but it will be the Fourier transform if we
replace t by t, so that is almost the same thing
72
2. Symmetric random walk: Compute
Sn
where S
n
=

n
k=1
X
k
is the sum of n i.i.d.
copies of the coin ip distribution above.
3. Poisson distribution: X Poisson().
4. Uniform distribution: X U[a, b], and in particular X [1, 1] which is espe-
cially symmetric and useful in applications.
5. Exponential distribution: X Exp().
6. Symmetrized exponential: A r.v. Z with density function f
Z
(x) = e
[x[
. Note
that this is the distribution of the exponential distribution after being symmetrized in
either of two ways: (i) We showed that if X, Y Exp(1) are independent then X Y
has density e
[x[
; (ii) alternatively, it is the distribution of an exponential variable
with random sign, namely X where X Exp(1) and is a random sign (same as
the coin ip distribution mentioned above) that is independent of X.
The normal distribution has the nice property that its characteristic function is equal,
up to a constant, to its density function.
Lemma 12.3. If Z N(0, 1) then

Z
(t) = e
t
2
/2
.
Proof.

Z
(t) =
1

2
_

e
itx
e
x
2
/2
dx =
1

2
_

e
t
2
/2
e
(xit)
2
/2
dx
= e
t
2
/2
_
1

2
_

e
(xit)
2
/2
dx
_
.
As Durrett suggests in his physics proof (p. 92 in [Dur2010], 91 in [Dur2004]), the expres-
sion in parentheses is 1, since it is the integral of a normal density with mean it and variance
1. This is a nonsensical argument, of course (it being an imaginary number), but the claim
is true, easy and is proved in any complex analysis course using contour integration.
Alternatively, let S
n
=

n
k=1
X
k
where X
1
, X
2
, . . . are i.i.d. coin ips with P(X
k
) = 1 =
P(X
k
) = 1 = 1/2. We know from the de Moivre-Laplace theorem (Theorem 10.1) that
S
n
/

n = N(0, 1),
73
so that

Sn/

n
(t) = E
_
e
itSn/

n
_

Z
(t), (t R),
since the function x e
itx
is bounded and continuous. On the other hand, from the exercise
above it is easy to compute that
Sn
(t) = cos
n
(t), which implies that

Sn/

n
(t) = cos
n
_
t

n
_
=
_
1
t
2
2n
+O
_
t
4
n
2
__
n

n
e
t
2
/2
.
As a consequence, let X N(0,
2
1
) and Y N(0,
2
2
) be independent, and let Z = X+Y .
Then

X
(t) = e

2
1
t
2
/2
,
Y
(t) = e

2
2
t
2
/2
,
so
Z
(t) = e
(
2
1
+
2
)/2
. This is the same as
W
(t), where W N(0,
2
1
+
2
2
). It would
be nice if we could deduce from this that Z N(0,
2
1
+
2
2
) (we already proved this fact
in a homework exercise, but its always nice to have several proofs of a result, especially
an important one like this one). This naturally leads us to an important question about
characteristic functions, which we consider in the next section.
12.3 The inversion formula
A fundamental question about characteristic functions is whether they contain all the infor-
mation about a distribution, or in other words whether knowing the characteristic function
determines the distribution uniquely. This question is answered (armatively) by the fol-
lowing theorem, which is a close cousin of the standard inversion formula from analysis for
the Fourier transform.
Theorem 12.4 (The inversion formula). If X is a r.v. with distribution
X
, then for any
a < b we have
lim
T
1
2
_
T
T
e
iat
e
ibt
it

X
(t) dt =
X
((a, b)) +
1
2

X
(a, b)
= P(a < X < b) +
1
2
P(X = a) +
1
2
P(X = b).
74
Proof. Throughout the proof, denote (t) =
X
(t) and =
X
. For convenience, we use
the notation of Lebesgue-Stieltjes integration with respect to the measure , remembering
that this really means taking the expectation of some function of the r.v. X. Denote
I
T
=
_
T
T
e
iat
e
ibt
it
(t) dt =
_
T
T
_

e
iat
e
ibt
it
e
itx
d(x) dt. (12)
Since
e
iat
e
ibt
it
=
_
b
a
e
ity
dy is a bounded function of t (it is bounded in absolute value by
b a), it follows by Fubinis theorem that we can change the order of integration, so
I
T
=
_

_
T
T
e
iat
e
ibt
it
e
itx
dt d(x)
=
_

__
T
T
sin(t(x a))
t
dt
_
T
T
sin(t(x b))
t
dt
_
d(x)
=
_

(R(x a, T) R(x b, T)) d(x),


where we denote R(, T) =
_
T
T
sin(t)/t dt. Note that in the notation of expectations this
can be written as I
T
= E(R(X a, T) R(X b, T) ). This can be simplied somewhat;
in fact, observe also that
R(, T) = 2sgn()
_
[[T
0
sin x
x
dx = 2sgn()S([[T),
where we denote S(x) =
_
x
0
sin(u)
u
du and sgn() is 1 if > 0, 1 if < 0 and 0 if = 0. By
a standard convergence test for integrals, the improper integral
_

0
sin u
u
du = lim
x
S(x)
converges; denote its value by C/4. Thus, we have shown that R(, T) 2sgn()C as
T , hence that
R(x a, T) R(x b, T)
T
_

_
C a < x < b,
C/2 x = a or x = b,
0 x < a or x > b.
Furthermore, the function R(xa, T)R(xb, T) is bounded in absolute value by 2 sup
x0
S(x).
It follows that we can apply the bounded convergence theorem in (12) to get that
I
T

T
CE(1
a<X<b
) + (C/2)E(1
X=a
+1
X=b
) = C((a, b)) + (C/2)(a, b). (13)
75
This is just what we claimed, minus the fact that C = 2. This fact is a well-known integral
evaluation from complex analysis. We can also deduce it in a self-contained manner, by
applying what we proved to a specic measure and specic values of a and b for which we
can evaluate the limit in (12) directly. This is not entirely easy to do, but one possibility,
involving an additional limiting argument, is outlined in the next exercise; see also Exercise
1.7.5 on p. 35 in [Dur2010], (Exercise 6.6, p. 470 in Appendix A.6 of [Dur2004]) for a dierent
approach to nding the value of C.
Exercise 12.5. (Recommended for aspiring analysts...) For each > 0, let X

be
a r.v. with distribution N(0,
2
) and therefore with density f
X
(x) = (

2)
1
e
x
2
/2
2
and
characteristic function
X
(t) = e

2
t
2
/2
. For xed , apply Theorem 12.4 in its weak form
given by (13) (that is, without the knowledge of the value of C), with parameters X = X

,
a = 1 and b = 1, to deduce the identity
C

2
_
1
1
e
x
2
/2
2
dx =
_

2 sin t
t
e

2
t
2
/2
dt.
Now multiply both sides by and take the limit as . For the left-hand side this should
give in the limit (why?) the value (2C)/

2. For the right-hand side this should give 2

2.
Justify these claims and compare the two numbers to deduce that C = 2.
The following theorem shows that the inversion formula can be written as a simpler
connection between the characteristic function and the density function of a random variable,
in the case when the characteristic function is integrable.
Theorem 12.6. If
_

[
X
(t)[ dt < , then X has a bounded and continuous density
function f
X
, and the density and characteristic function are related by

X
(t) =
_

f
X
(x)e
itx
dx,
f
X
(x) =
1
2
_

X
(t)e
itx
dt.
In the lingo of Fourier analysis, this is known as the inversion formula for Fourier
transforms.
Proof. This is a straightforward corollary of Theorem 12.4. See p. 95 in either [Dur2010] or
[Dur2004].
76
12.4 The continuity theorem
Theorem 12.7. Let (X
n
)

n=1
be r.v.s. Then:
(i) If X
n
= X for some r.v. X, then
Xn
(t)
X
(t) for all t R.
(ii) If the limit (t) = lim
n

Xn
(t) exists for all t R, and is continuous at 0, then

X
for some r.v. X, and X
n
= X.
Proof. Part (i) follows immediately from the fact that convergence in distribution implies
that Eg(X
n
) Eg(X) for any bounded continuous function. It remains to prove the less
trivial claim in part (ii). Assume that
Xn
(t) (t) for all t R and that is continuous
at 0. First, we show that the sequence (X
n
)

n=1
is tight. Fixing an M > 0, we can bound
the probability P([X
n
[ > M), as follows:
P([X
n
[ > M) = E
_
1
[Xn[>M
_
E
_
2
_
1
M
2[X
n
[
_
1
[Xn[>M
_
E
_
2
_
1
sin(2X
n
/M)
2X
n
/M
_
1
[Xn[>M
_
.
But this last expression can be related to the behavior of the characteristic function near 0.
Denote = 2/M. Reverting again to the Lebesgue-Stieltjes integral notation, we have
E
_
2
_
1
sin(2X
n
/M)
2X
n
/M
_
1
[Xn[>M
_
= 2
_
[x[>2/
_
1
sin(x)
x
_
d
Xn
(x)
2
_

_
1
sin(x)
x
_
d
Xn
(x) =
_

__

(1 e
itx
) dt
_
d
Xn
(x).
Now use Fubinis theorem to get that this bound can be written as
1

(1 e
itx
) d
Xn
(x) dt =
1

(1
Xn
(t)) dt
n
1

(1 (t)) dt
(the convergence follows from the bounded convergence theorem). So we have shown that
limsup
n
P([X
n
[ > M)
1

(1 (t)) dt.
But, because of the assumption that (t) (0) = 1 as t 0, it follows that if is
suciently small then
1
_

(1 (t)) dt < , where > 0 is arbitrary; so this establishes


the tightness claim.
77
Finally, to nish the proof, let (n
k
)

k=1
be a subsequence (guaranteed to exist by tightness)
such that X
n
k
= Y for some r.v. Y . Then
Xn
k
(t)
Y
(t) = (t) as k for all t R,
so
Y
. This determines the distribution of Y , which means that the limit in distribution
is the same no matter what convergent in distribution subsequence of the sequence (X
n
)
n
we take. But this implies that X
n
= Y (why? The reader is invited to verify this last
claim; it is best to use the denition of convergence in distribution in terms of expectations
of bounded continuous functions).
12.5 Moments
The nal step in our lengthy preparation for the proof of the central limit theorem will be to
tie the behavior of the characteristic function
X
(t) near t = 0 to the moments of X. Note
that, computing formally without regards to rigor, we can write

X
(t) = E(e
itX
) = E
_

n=0
i
n
t
n
X
n
n!
_
=

n=0
i
n
EX
n
n!
t
n
.
So it appears that the moments of X appear as (roughly) the coecients in the Taylor
expansion of
X
around t = 0. However, for CLT we dont want to assume anything beyond
the existence of the second moment, so a (slightly) more delicate estimate is required.
Lemma 12.8.

e
ix

n
m=0
(ix)
m
m!

min
_
[x[
n+1
(n+1)!
,
2[x[
n
n!
_
.
Proof. Start with the identity
R
n
(x) := e
ix

m=0
(ix)
m
m!
=
i
n+1
n!
_
x
0
(x s)
n
e
is
ds,
which follows from Lemma 10.3 that we used in the proof of Stirlings formula. Taking the
absolute value and using the fact that [e
is
[ = 1 gives
[R
n
(x)[
1
n!

_
x
0
[x s[
n
ds

=
[x[
n+1
n!
. (14)
To get a bound that is better-behaved for large x, note that
R
n
(x) = R
n1
(x)
(ix)
n
n!
= R
n1
(x)
i
n
(n 1)!
_
x
0
(x s)
n1
ds
=
i
n
(n 1)!
_
x
0
(x s)
n1
(e
is
1) ds.
78
So, since [e
is
1[ 2, we get that
[R
n
(x)[
2
(n 1)!

_
x
0
[x s[
n1
ds

=
2[x[
n
(n 1)!
. (15)
Combining (14) and (15) gives the claim.
Now let X be a r.v. with E[X[
n
< . Letting x = tX in Lemma 12.8, taking expectations
and using the triangle inequality, we get that

X
(t)
n

m=0
i
m
EX
m
m!
t
m

E
_
min
_
[t[
n+1
[X[
n+1
(n + 1)!
,
2[t[
n
[X[
n
n!
__
. (16)
Note that in this minimum of two terms, when t is very small the rst term gives a better
bound, but when taking expectations we need the second term to ensure that the expectation
is nite if X is only assumed to have a nite n-th moment.
Theorem 12.9. If X is a r.v. with mean = EX and V(X) < then

X
(t) = 1 +it
EX
2
2
t
2
+o(t
2
) as t 0.
Proof. By (16) above, we have
1
t
2

X
(t)
_
1 +it
EX
2
2
t
2
_

E
_
min
_
[t[ [X[
3
/6, X
2
_
.
As t 0, the right-hand side converges to 0 by the dominated convergence theorem.
79
Chapter 13: Central limit theorems
13.1 The case of i.i.d. r.v.s
We are now ready to prove:
Theorem 13.1 (The central limit theorem). Let X
1
, X
2
, . . . be an i.i.d. sequence of r.v.s
with nite variance. Denote = EX
1
, = (X
1
) and S
n
=

n
k=0
X
k
. Then as n we
have the convergence in distribution
S
n
n

n
= N(0, 1).
Proof. For convenience, denote

X
k
= (X
k
)/ and

S
n
=

n
k=0

X
k
. Then

Sn/

n
(t) =

Sn
(t/

n) =
n

k=1

X
k
(t/

n) =
_

X
1
(t/

n)
_
n
.
Note that E

X
1
= 0 and V(

X
1
) = E

X
2
1
= 1. Therefore by Theorem 12.9,

X
1
satises

X
1
(u) = 1
u
2
2
+o(u
2
)
as u 0. It follows that

Sn/

n
(t) =
_
1
t
2
2n
+o
_
t
2
n
__
n

n
e
t
2
/2
for any t R. Using the continuity theorem (Theorem 12.7) and our previous computations,
it follows that

S
n
= N(0, 1), as claimed.
13.2 Generalizations
The CLT can be generalized in many ways. None of the assumptions (independence, identical
distributions, even nite variance) are entirely necessary. A central paradigm of probability
theory is that any random quantity that arises as a sum of many small contributions that are
either independent or not too strongly dependent, will converge to the normal distribution in
some asymptotic limit. Thousands of examples exist, but there is no single all-encompassing
theorem that includes all of them as a special case. Rather, probabilists have a toolbox of
80
tricks and techniques that they try to apply in order to prove normal convergence in any
given situation. characteristic functions are among the more useful techniques. Another
important technique, the so-called moment method, involves the direct use of moments:
If we can show that E(W
k
n
) E(Z
k
), where (W
n
)

n=1
is the (normalized) sequence being
studied, and Z N(0, 1), then by Theorem 3.3.12 in [Dur2010] (p. 105) or Theorem (3.12)
in [Dur2004] (p. 109), that implies that W
n
= N(0, 1).
We now discuss several examples of interesting generalizations of CLT.
13.2.1 Triangular arrays
Theorem 13.2 (Lindeberg-Feller CLT for triangular arrays). Let (X
n,k
)
1kn<
be a tri-
angular array of r.v.s. Denote S
n
=

n
k=1
X
n,k
(the sum of the n-th row). Assume that:
1. For each n, the r.v.s (X
n,k
)
n
k=1
are independent.
2. EX
n,k
= 0 for all n, k.
3. V(S
n
) =
2
n

2
< as n .
4. For all > 0, lim
n

n
k=1
E
_
X
2
n,k
1
[X
n,k
[>
_
= 0.
Then S
n
= N(0,
2
) as n .
Proof. See [Dur2010], p. 110111 or [Dur2004], p. 115116. The proof uses the characteristic
function technique and is a straightforward extension of the proof for the i.i.d. case.
Example 13.3. (Record times and cycles in permutations). Let X
1
, X
2
, . . . be i.i.d. U(0, 1)
r.v.s. Let A
n
be the event that X
n
= max(X
1
, . . . , X
n
) (in this case, we say that n is a
record time). Let S
n
=

n
k=1
1
A
k
be the number of record times up to time n. We saw in
a homework exercise that the A
k
s are independent events and P(A
k
) = 1/k. This implies
that E(S
n
) =

n
k=1
1
k
= H
n
(the n-th harmonic number) and V(S
n
) =

n
k=1
k1
k
2
. Note
that both E(S
n
) and V(S
n
) are approximately equal to log n, with an error term that is
O(1). Now taking X
n,k
= (1
A
k
k
1
)/
_
V(S
n
) in Theorem 13.2, it is easy to check that the
assumptions of the theorem hold. It follows that
S
n
H
n
(S
n
)
= N(0, 1).
81
Equivalently, because of the asymptotic behavior of E(S
n
) and V(S
n
) it is also true that
S
n
log n

log n
= N(0, 1).
Note: S
n
describes the distribution of another interesting statistic on random permutations.
It is not too dicult to show by induction (using an amusing construction often referred to
as the Chinese restaurant process) that if S
n
is a uniformly random permutation on
n elements, then the number of cycles in is a random variable which is equal in distribution
to S
n
.
13.2.2 Erdos-Kac theorem
Theorem 13.4 (Erd os-Kac theorem (1940)). Let g(m) denote the number of prime divisors
of an integer k (for example, g(28) = 2). For each n 1, let X
n
be a uniformly random
integer chosen in 1, 2, . . . , n, and let Y
n
= g(X
n
) be the number of prime divisors of X
n
.
Then we have
Y
n
log log n

log log n
= N(0, 1).
In other words, for any x R we have
1
n
#
_
1 k n : g(k) log log n +k
_
log log n
_

n
(x).
Proof. See [Dur2010], p. 114117 or [Dur2004], p. 119124. The proof uses the moment
method.
Note that Y
n
can be written in the form

pn
1
p[Xn
, namely the sum over all primes
p n of the indicator of the event that X
n
is divisible by p. The probability that X
n
is
divisible by p is roughly 1/p, at least if p is signicantly smaller than n. Therefore we can
expect Y
n
to be on the average around

prime pn
1
p
,
a sum that is known (thanks to Euler) to behave roughly like log log n. The Erdos-Kac
theorem is intuitively related to the observation that these indicators 1
p[Xn
for dierent ps
are also close to being independent (a fact which follows from the Chinese remainder theo-
rem). Of course, they are only approximately independent, and making these observations
82
precise is the challenge to proving the theorem. In fact, many famous open problems in
number theory (even the Riemann Hypothesis, widely considered to be the most important
open problem in mathematics) can be formulated in terms of a statement about approxi-
mate independence (in some loose sense) of some arithmetic sequence relating to the prime
numbers.
13.2.3 The Euclidean algorithm
As a nal example from number theory, consider the following problem: For some n 1,
choose X
n
and Y
n
independently and uniformly at random in 1, 2, . . . , n, and compute their
greatest common divisor (g.c.d.) using the Euclidean algorithm. Let N
n
be the number of
division (with remainder) steps that were required. For example, if X = 58 and Y = 24
then the application of the Euclidean algorithm would result in the sequence of steps
(58, 24) (24, 10) (10, 4) (4, 2) (2, 0),
so 4 division operations were required (and the g.c.d. is 2).
Theorem 13.5 (CLT for the number of steps in the Euclidean algorithm; D. Hensley (1992)).
There exists a constant

(which has a very complicated denition) such that


N
n

12 log 2

2
log n

log n
= N(0, 1).
Hensleys theorem was in recent years signicantly generalized and the techniques ex-
tended by Brigitte Vallee, a French mathematician. The fact that the average value of N
n
is
approximately (12 log 2/
2
) log n was previously known from work of Heilbronn and Dixon
in 19691970, using ideas dating back to Gauss, who discovered the probability distribution
now called the Gauss measure. This is the probability distribution on (0, 1) with den-
sity
1
log 2(1+x)
, which Gauss found (but did not prove!) describes the limiting distribution
of the ratio of a pair of independent U(0, 1) random variables after many iterations of the
division-with-remainder step in the Euclidean algorithm.
83
Chapter 14: Additional topics
14.1 The Kolmogorov 0-1 law
In this section we prove a well-known and easy result due to Kolmogorov, the Kolmogorov
0-1 law. This will be useful in the next section.
Theorem 14.1 (Kolmogorov 0-1 law). Let X
1
, X
2
, . . . be a sequence of independent random
variables in some probability space (, T, P). An event A T is called a tail event if
A (X
n
, X
n+1
, . . .)
for all n. That is, one can tell whether A occurred or not by looking at the sequence
X
1
, X
2
, . . ., and the occurrence of A is unaected by changing a nite number of X
k
s. (Note
that the set of tail events is a -algebra, called the tail -algebra of the sequence, and
denoted T ; formally, T =
n1
(X
n
, X
n+1
, . . .)). Then for any tail event A we have that
P(A) = 0 or P(A) = 1.
Proof. It will be enough to prove the somewhat strange statement that the tail event A is
independent of itself, since in that case well have
P(A) = P(A A) = P(A)P(A) = P(A)
2
,
which would imply the claim, since the only solutions to the equation x
2
= x are x = 0
and x = 1. The fact that a tail event is independent of itself is a consequence of the
following two claims: 1. If A (X
1
, . . . , X
n
) and B (X
n+1
, X
n+2
, . . .) then A and
B are independent. This follows by observing that
k1
(X
n+1
, X
n+2
, . . . , X
n+k
) is a -
system generating (X
n+1
, X
n+2
, . . .), and for B in the -system the claim is trivial. 2. If
A (X
1
, X
2
, . . .) and B T then A and B are independent. The fact that is true when
A (X
1
, . . . , X
n
) for some n follows from claim 1. But
n1
(X
1
, . . . , X
n
) is a -system
generating (X
1
, X
2
, . . .), so the same is true also for A in the generated -algebra.
Example 14.2. The events lim
n
X
n
exist and

n=1
X
n
converges are tail events.
84
14.2 The Kolmogorov three-series theorem
As a further application of the Lindeberg-Feller central limit theorem, we prove the Kol-
mogorov three-series theorem (Theorem 9.6).
Proof of Theorem 9.6. Denote Y
n
= X
n
1
[Xn[1
. Assume that the eponymous three series

n=1
P([X
n
[ > 1) =

n=1
P(X
n
,= Y
n
), (17)

n=1
E(Y
n
), (18)

n=1
V(Y
n
) (19)
all converge. Applying Theorem 9.5 to the random variables Y
n
E(Y
n
), from the assumption
that (19) converges we infer that the series

n=1
_
Y
n
E(Y
n
)
_
converges almost surely. Since
(18) converges, this also means that

n
Y
n
converges almost surely. Since (17) converges
we get using the rst Borel-Cantelli lemma that the event X
n
,= Y
n
i.o. has probability 0.
Outside this event, the series

n
X
n
converges if and only if

n
Y
n
converges, so

n
X
n
also converges almost surely.
Next, for the converse claim, assume that

n
X
n
converges with positive probability. By
the Kolmogorov 0-1 law, it therefore converges a.s. We need to show that the series (17),
(18), (19) converge.
First, note that (17) must converge, since if it doesnt, the second Borel-Cantelli lemma
implies that the event [X
n
[ > 1 i.o. has probability one,which would imply that

X
n
diverges almost surely.
Second, assume by contradiction that (19) diverges, or equivalently that
v
n
:=
n

k=1
V(Y
n
) as n .
We apply Theorem 13.2 to the triangular array
X
n,k
= v
1/2
n
(Y
k
EY
k
) .
Assumptions 1, 2 and 3 of the theorem are trivially satised (with
2
=

n
k=1
V(X
n,k
) = 1
for all n). For assumption 4, note that [X
n,k
[ 2v
1/2
n
, so X
2
n,k
1
[X
n,k
[>
= 0 for all k if
85
v
n
> 4/
2
, which holds for n large enough. Thus, the assumptions of the theorem are valid
and hence we obtain the conclusion that S
n
=

n
k=1
X
n,k
= N(0, 1) as n . Dene a
sequence (of numbers, not random variables)
t
n
= v
1/2
n
n

k=1
EY
k
= S
n
v
1/2
n
n

k=1
Y
k
,
and note that on the event

n
X
n
converges (in which case clearly

n
Y
n
also converges),
which we assumed has probability 1, S
n
t
n
= v
1/2
n

n
k=1
Y
k
0 as n . This is easily
seen to be in contradiction to the convergence in distribution of S
n
to a limiting distribution
of a non-constant random variable. So we have shown that (19) converges.
Finally, having shown that (19) converges, we conclude from Theorem 9.5 that the se-
ries

n
(Y
n
E(Y
n
)) converges almost surely. We already saw that on the event where

n
X
n
converges, which we assumed has positive probability, also

n
Y
n
converges, and
therefore the series

n
E(Y
n
) =
_
Y
n
(Y
n
E(Y
n
))
_
also converges.
14.3 The Poisson limit law
The Poisson distribution is the one-parameter family of distributions Poisson(), > 0,
where X Poisson() is a discrete random variable satisfying
P(X = k) = e

k
k!
, k = 0, 1, 2, . . .
This distribution arises naturally as measuring the number of successful coin tossing exper-
iments when one performs a large number of independent coin tosses with equal bias, as
long as the expected number of successes remains bounded away from 0 and . That is,
if X
1
, X
2
, . . . , X
n
denote i.i.d. Bernoulli random variables with bias p = p(n) (which is a
function of n0, and S
n
=

n
k=1
, then if n p(n) as n , it is a simple exercise to
check that
S
n
= Poisson(). (20)
As in the case of the Central Limit Theorem, it turns out that the i.i.d. condition can
be signicantly weakened, so in fact the Poisson distribution arises almost universally as the
limit law for counting numbers of rare events,, i.e., sums of indicator functions of many
86
independent or weakly correlated events each of which has a small chance of success, in
the asymptotic regime in which the expected number of successes converges to a constant.
For this reason, the Poisson distribution is considered an excellent model for many real-life
phenomena, e.g.:
The number of typographical errors in a web page or book chapter.
The number of radioactive particles emitted from a chunk of radioactive material in a
unit of time.
The number of cars passing a point on a remote desert road in an hour.
The number of incandescent light-bulbs burning out in your house each month.
Etc. (try to think of more examples from science or your daily life...)
Our goal is to prove the following generalization of (20), analogous to the Lindeberg-Feller
extension of the Central Limit Theorem. The technique used in the proof is also similar and
uses characteristic functions.
Theorem 14.3. Let (X
n,m
)
1mn
be a triangular array of random variables such that for
each n 1, the random variables X
n,1
, . . . , X
n,n
are independent, and each X
n,m
satises
P(X
n,m
= 1) = p
n,m
, P(X
n,m
= 0) = 1 p
n,m
. Denote S
n
=

n
m=1
X
m
. Assume that the
biases (p
n,m
)
n,m
satisfy the following conditions:
(i) ES
n
=

n
m=1
p
n,m
(0, ) as n .
(ii) max
1mn
p
n,m
0 as n .
Then S
n
= Poisson().
Proof. Let
n,m
(t) =
Xn,m
(t) = E(exp(itX
n,m
)) = (1 p
n,m
) + p
n,m
e
it
denote the charac-
teristic function of X
n,m
. The ch. f. of S
n
is then

Sn
=
n

k=1

n,m
(t) =
n

k=1
_
(1 p
n,m
) +p
n,m
e
it
_
.
87
Our goal is to show that
Sn
(t)
Z
(t) where Z Poisson(). From a past homework
computation we know that

Z
(t) =

k=0
e

k
k!
e
ikt
= exp((e
it
1)) = lim
n
n

m=1
exp(p
n,m
(e
it
1)) (by assumption (i)),
so let us investigate how closely

n
m=1

n,m
(t) is approximated by

n
m=1
exp(p
n,m
(e
it
1)).
Note that for any 0 p 1, we have
[ exp(p(e
it
1))[ = exp(p Re(e
it
1)) exp(0) = 1,
[1 p +pe
it
[ = [1 +p(e
it
1)[ 1 (a convex combination of 1 and e
it
).
We therefore get using the exercise below that

m=1
exp
_
p
n,m
(e
it
1)
_

m=1
(1 +p
n,m
(e
it
1))

m=1
[ exp
_
p
n,m
(e
it
1)
_
(1 +p
n,m
(e
it
1))[
10
n

m=1
p
2
n,m
[e
it
1[
2
40
_
max
1mn
p
n,m
_
n

m=1
p
n,m

n
0.
This is exactly what was needed to nish the proof.
Exercise 14.4. (i) Let z
1
, . . . , z
n
, w
1
, . . . , w
n
be complex numbers such that [z
m
[, [w
m
[ 1
for all 1 m n. Prove that

m=1
z
m

m=1
w
m

m=1
[z
m
w
m
[.
(ii) Prove that if z is a complex number with [z[ 2 then
[ exp(z) (1 +z)[ 10[z[
2
.
End of Part III
88
Exercises
1. (a) If (, T, P) is a probability space and A, B T are events such that P(B) ,= 0,
the conditional probability of A given B is denoted P(A[B) and dened by
P(A[B) =
P(A B)
P(B)
.
Prove the total probability formula: if A, B
1
, B
2
, . . . , B
k
T such that is the
disjoint union of B
1
, . . . , B
k
and P(B
i
) ,= 0 for 1 i k, then
P(A) =
k

i=1
P(B
i
)P(A[B
i
). (TPF)
(b) An urn initially contains one white ball and one black ball. At each step of the
experiment, a ball is drawn at random from the urn, then put back and another ball
of the same color is added. Prove that the number of white balls that are in the urn
after N steps is a uniform random number in 1, 2, . . . , N + 1. That is, the event
that the number of white balls after step N is equal to k has probability 1/(N + 1)
for each 1 k N + 1. (Note: The idea is to use (TPF), but there is no need to
be too formal about constructing the relevant probability space you can assume an
intuitive notion of probabilities.)
2. If = 1, 2, 3, list all the possible -algebras of subsets of .
3. Let (, T) be a measurable space. A pre-probability measure is a function P :
T [0, 1] that satises
P() = 0, P() = 1. (P1)
If a pre-probability measure P satises
A
1
, A
2
, . . . T are pairwise disjoint = P(

n=1
A
n
) =

n=1
P(A
n
). (P2)
then we say that it is -additive. If it satises
A
1
, . . . , A
n
T are pairwise disjoint = P(
n
k=1
A
k
) =
n

k=1
P(A
k
). (P3)
89
then we say that it is additive. Recall that we dened a probability measure to be
a pre-probability measure that is -additive.
We say that a pre-probability measure satises the continuity properties if it satises
(A
n
)

n=1
T, A
n
A
n+1
n = P(

n=1
A
n
) = lim
n
P(A
n
), (CONT1)
(A
n
)

n=1
T, A
n
A
n+1
n = P(

n=1
A
n
) = lim
n
P(A
n
). (CONT2)
Prove that a probability measure satises the continuity properties, and that an ad-
ditive pre-probability measure that satises the rst continuity property (CONT1)
(continuity from below) is -additive, and is therefore a probability measure.
4. (a) A coin has some bias p (0, 1), so when tossed it comes up Heads with probability
p, or Tails with probability 1 p. Suppose the coin is tossed N times independently,
and let A
N,k
denote the event that the result came up Heads exactly k times. Refresh
your memory concerning why the Binomial Distribution Formula, which says that
P(A
N,k
) =
_
N
k
_
p
k
(1 p)
Nk
, (BINOM)
is true. You may submit a short written explanation to test your understanding, but
it is not required.
(b) A group of N prisoners is locked up in a prison, each in a separate cell with no
ability to communicate with the other prisoners. Each cell contains a mysterious on/o
electrical switch. One evening the warden visits each of the prisoners and presents them
with the following dilemma: During the night each prisoner must choose whether to
leave his switch in the on or o position. If at midnight exactly one of the switches is
in the on position, all the prisoners will be set free in the morning; otherwise they will
all be executed!
The prisoners cannot coordinate their actions, but they are all rational, know calculus
and probability theory, and each is equipped with a random number generator. Find
the strategy that the prisoners will take to maximize their chance of survival, and
compute what that chance is, as a function of N and in the limit when N is very
large. For extra fun, try to guess in advance how big or small you expect the survival
likelihood to be, and see how your guess measures up to the actual result.
90
5. (a) Let be a set, and let o = T
i

iI
be some collection of -algebras of subsets of
, indexed by some index set I (note that o is a set of subsets of subsets of - try
to avoid dizziness!). Prove that the intersection of all the T
i
s (i.e., the collection of
subsets of that are elements of all the T
i
s) is also a -algebra.
(b) Let be a set, and let / be a collection of subsets of . Prove that there exists a
unique -algebra (/) of subsets of that satises the following two properties:
(i) / (/) (in words, (/) contains all the elements of /).
(ii) (/) is the minimal -algebra satisfying property 1 above, in the sense that if T
is any other -algebra that contains all the elements of /, then (/) T.
The -algebra (/) is called the -algebra generated by /.
Hint for (b). Let (T
i
)
iI
be the collection of all -algebras of subsets of that contain
/. This is a non-empty collection, since it contains for example T(), the set of all
subsets of . Any -algebra (/) that satises the two properties above is necessarily
a subset of any of the T
i
s, hence it is also contained in the intersection of all the T
i
s,
which is a -algebra by part (a) of the question.
6. (a) Let X be a random variable with distribution function F
X
and piecewise continuous
density function f
X
. Let [a, b] R be an interval (possibly innite) such that
P(X [a, b]) = 1,
and let g : [a, b] R be a monotone (strictly) increasing and dierentiable function.
Prove that the random variable Y = g(X) (this is the function on dened by
Y () = g(X()), in other words the composition of the two functions g and X) has
density function
f
Y
(x) =
_
_
_
f
X
(g
1
(x))
g

(g
1
(x))
x (g(a), g(b)),
0 otherwise.
(b) If > 0, we say that a random variable has the exponential distribution with
parameter if
F
X
(x) =
_
_
_
0 x < 0,
1 e
x
x 0,
91
and denote this X Exp(). Find an algorithm to produce a random variable with
Exp() distribution using a random number generator that produces uniform random
numbers in (0, 1). In other words, if U U(0, 1), nd a function g : (0, 1) R such
that the random variable X = g(U) has distribution Exp().
(c) We say that a non-negative random variable X 0 has the lack of memory
property if it satises that
P(X t [ X s) = P(X t s) for all 0 < s < t.
Prove that exponential random variables have the lack of memory property.
(d) Prove that any non-negative random variable that has the lack of memory property
has the exponential distribution with some parameter > 0. (This is easier if one
assumes that the function G(x) = P(X x) is dierentiable on [0, ), so you can
make this assumption if you fail to nd a more general argument).
7. (a) Prove the inclusion-exclusion principle: If A
1
, . . . , A
n
are events in a probability
space (, T, P), then
P
_
n
_
k=1
A
k
_
= s
1
s
2
+s
3
s
4
+s
5
. . . + (1)
n1
s
n
,
where
s
1
= P(A
1
) +P(A
2
) +. . . +P(A
n
) =
n

k=1
P(A
k
),
s
2
=

1k
1
<k
2
n
P(A
k
1
A
k
2
),
s
3
=

1k
1
<k
2
<k
3
n
P(A
k
1
A
k
2
A
k
3
),
.
.
.
s
d
=

1k
1
<...<k
d
n
P(A
k
1
A
k
2
. . . A
k
d
),
.
.
.
s
n
= P(A
1
A
2
. . . A
n
).
92
(b) N letters addressed to dierent people are inserted at random into N envelopes
that are labelled with the names and addresses of the N recipients, such that all N!
possible matchings between the letters and envelopes are equally likely. What is the
probability of the event that no letter will arrive at its intended destination? Compute
this probability for any N, and in the limit when N .
8. Let F be the distribution function
F(x) =
_

_
0 x < 0,
1
3
+
1
6
x 0 x < 1,
1
2
1 x < 2,
1
1
4
e
2x
x 2.
Compute the lower and upper quantile functions of F, dened by
X

(p) = supx : F(x) < p,


X

(p) = infx : F(x) > p,


(0 < p < 1).
A recommended way is to plot F on paper and then gure out the quantiles by eye-
balling. Of course, the answer should be spelled out in precise formulas.
9. A drunken archer shoots at a target hanging on a wall 1 unit of distance away. Since he
is drunk, his arrow ends up going in a random direction at an angle chosen uniformly
in (/2, /2) (an angle of 0 means he will hit the target precisely) until it hits the
wall. Ignoring gravity and the third dimension, compute the distribution function (and
density function if it exists) of the random distance from the hitting point of the arrow
to the target.
10. (a) Let (
1
, T
1
, P
1
) be a probability space, let (
2
, T
2
) be a measurable space, and
let f :
1

2
be a measurable function. Verify that the function P
2
: T
2
[0, 1]
dened by
P
2
(A) = P
1
(f
1
(A))
is a probability measure. This probability measure is called the push-forward mea-
sure of P
1
under f.
93
(b) For a real number x, denote the integer part of x by
x| = supn Z : n x,
and denote the fractional part of x by
x = x x|.
Let ((0, 1), B, P) be the unit interval with the -algebra of Borel subsets and the
Lebesgue probability measure, corresponding to the experiment of choosing a uniform
random number in (0, 1). Dene a sequence of functions R
1
, R
2
, . . . : (0, 1) R by
R
n
(x) =
_
_
_
0 0 2
n1
x < 1/2,
1 1/2 2
n1
x < 1.
For any n N and a
1
, a
2
, . . . , a
n
0, 1, denote by B
n
(a
1
, . . . , a
n
) the set
B
n
(a
1
, . . . , a
n
) = x (0, 1) : R
1
(x) = a
1
, R
2
(x) = a
2
, . . . , R
n
(x) = a
n
.
Find a good explicit description for this set (the set of all xs such that ...), and
deduce from it that
P(B
n
(a
1
, . . . , a
n
)) =
1
2
n
.
(c) Dene a function f : (0, 1) 0, 1
N
by
f(x) = (R
1
(x), R
2
(x), R
3
(x), . . .).
Prove that f is a measurable function when the space 0, 1
N
is equipped with the
-algebra generated by the sets
A
n
(1) = (x
1
, x
2
, . . .) 0, 1
N
: x
n
= 1.
(d) Prove that the push-forward of Lebesgue measure under f is the probability mea-
sure corresponding to the random experiment of an innite sequence of fair coin tosses.
94
11. Let X be an exponential r.v. with parameter , i.e., F
X
(x) = (1e
x
)1
[0,)
(x). Dene
random variables
Y = X| := supn Z : n x (the integer part of X),
Z = X := X X| (the fractional part of X).
(a) Compute the (1-dimensional) distributions of Y and Z (in the case of Y , since its
a discrete random variable it is most convenient to describe the distribution by giving
the individual probabilities P(Y = n), n = 0, 1, 2, . . .; for Z one should compute either
the distribution function or density function).
(b) Show that Y and Z are independent. (Hint: check that P(Y = n, Z t) =
P(Y = n)P(Z t) for all n and t.)
12. (a) Let X, Y be independent r.v.s. Dene U = min(X, Y ), V = max(X, Y ). Find ex-
pressions for the distribution functions F
U
and F
V
in terms of the distribution functions
of X and Y .
(b) Assume that X Exp(), Y Exp() (and are independent as before). Prove
that min(X, Y ) has distribution Exp( + ). Try to give an intuitive explanation in
terms of the kind of real-life phenomena that the exponential distribution is intended
to model (e.g., measuring the time for a light-bulb to burn out, or for a radioactive
particle to be emitted from a chunk of radioactive material).
(c) Let X
1
, X
2
, . . . be a sequence of independent r.v.s, all of them having distribution
Exp(1). For each n 1 denote
M
n
= max(X
1
, X
2
, . . . , X
n
) log n.
Compute for each n the distribution function of M
n
, and nd the limit (if it exists)
F(x) = lim
n
F
Mn
(x).
13. If X, Y are r.v.s with a joint density f
X,Y
, the identity
P((X, Y ) A) =
__
A
f
X,Y
(x, y) dx dy
95
holds for all reasonable sets A R
2
(in fact, for all Borel-measurable sets, but that
requires knowing what that integral means for a set such as R
2
Q
2
...). In particular,
if X, Y are independent and have respective densities f
X
and f
Y
, so f
X,Y
(x, y) =
f
X
(x)f
Y
(y), then
F
X+Y
(t) = P(X +Y t) =
_

_
tx

f
X
(x)f
Y
(y) dy dx.
Dierentiating with respect to t gives (assuming without justication that it is allowed
to dierentiate under the integral):
f
X+Y
(t) =
_

f
X
(x)f
Y
(t x) dx.
Use this formula to compute the distribution of X+Y when X and Y are independent
r.v.s with the following (pairs of) distributions:
(a) X U[0, 1], Y U[0, 2].
(b) X Exp(1), Y Exp(1).
(c) X Exp(1), Y Exp(1).
14. (a) Let (A
n
)

n=1
be a sequence of events in a probability space. Show that
1
limsup An
= limsup
n
1
An
.
(The lim-sup on the left refers to the lim-sup operation on events; on the right it refers
to the lim-sup of a sequence of functions; the identity is an identity of real-valued
functions on , i.e., should be satised for each individual point in the sample
space). Similarly, show (either separately or by relying on the rst claim) that
1
liminf An
= liminf
n
1
An
.
(b) Let U be a uniform random variable in (0, 1). For each n 1 dene an event A
n
by
A
n
= U < 1/n.
Note that

n=1
P(A
n
) = . However, compute P(A
n
i.o.) and show that the conclu-
sion of the second Borel-Cantelli lemma does not hold (of course, one of the assumptions
of the lemma also doesnt hold, so theres no contradiction).
96
15. If P, Q are two probability measures on a measurable space (, T), we say that P is
absolutely continuous with respect to Q, and denote this P << Q, if for any
A T, if Q(A) = 0 then P(A) = 0.
Prove that P << Q if and only if for any > 0 there exists a > 0 such that if A T
and Q(A) < then P(A) < .
Hint. Apply a certain famous lemma.
Note. The intuitive meaning of the relation P << Q is as follows: suppose there is a
probabilistic experiment, and we are told that one of the measures P or Q governs the
statistical behavior of the outcome, but we dont know which one. (This is a situation
that arises frequently in real-life applications of probability and statistics.) All we can
do is perform the experiment, observe the result, and make a guess. If P << Q, any
event which is observable with positive probability according to P also has positive
Q-probability, so we can never rule out Q as the correct measure, although we may get
an event with Q(A) > 0 and P(A) = 0 that enables us to rule out P. If we also have
the symmetric relation Q << P, then we cant rule out either of the measures.
16. A function : (a, b) R is called convex if for any x, y (a, b) and [0, 1] we
have
(x + (1 )y) (x) + (1 )(y).
(a) Prove that an equivalent condition for to be convex is that for any x < z < y in
(a, b) we have
(z) (x)
z x

(y) (z)
y z
.
Deduce using the mean value theorem that if is twice continuously dierentiable and
satises
tt
0 then it is convex.
(b) Prove Jensens inequality, which says that if X is a random variable such that
P(X (a, b)) = 1 and : (a, b) R is convex, then
(EX) E((X)).
Hint. Start by proving the following property of a convex function: If is convex
then at any point x
0
(a, b), has a supporting line, that is, a linear function
97
y(x) = ax +b such that y(x
0
) = (x
0
) and such that (x) y(x) for all x (a, b) (to
prove its existence, use the characterization of convexity from part (a) to show that
the left-sided derivative of at x
0
is less than or equal to the right-sided derivative at
x
0
; the supporting line is a line passing through the point (x
0
, (x
0
)) whose slope lies
between these two numbers). Now take the supporting line function at x
0
= EX and
see what happens.
17. If X is a random variable satisfying a X b, prove that
V(X)
(b a)
2
4
,
and identify when equality holds.
18. Let X
1
, X
2
, . . . be a sequence of i.i.d. (independent and identically distributed) random
variables with distribution U(0, 1). Dene events A
1
, A
2
, . . . by
A
n
= X
n
= max(X
1
, X
2
, . . . , X
n
)
(if A
n
occurred, we say that n is a record time).
(a) Prove that A
1
, A
2
, . . . are independent events.
Hint. for each n 1, let
n
be the random permutation of (1, 2, . . . , n) obtained by
forgetting the values of (X
1
, . . . , X
n
) and only retaining their respective order. In other
words, dene

n
(k) = #1 j n : X
j
X
k
.
By considering the joint density f
X
1
,...,Xn
(a uniform density on the n-dimensional unit
cube), show that
n
is a uniformly random permutation of n elements, i.e. P(
n
=
) = 1/n! for any permutation S
n
. Deduce that the event A
n
=
n
(n) = n is
independent of
n1
and therefore is independent of the previous events (A
1
, . . . , A
n1
),
which are all determined by
n1
.
(b) Dene
R
n
=
n

k=1
1
A
k
= #1 k n : k is a record time, (n = 1, 2, . . .).
98
Compute E(R
n
) and V(R
n
). Deduce that if (m
n
)

n=1
is a sequence of positive numbers
such that m
n
, however slowly, then the number R
n
of record times up to time n
satises
P
_
[R
n
log n[ > m
n
_
log n
_

n
0.
19. Compute E(X) and V(X) when X is a random variable having each of the following
distributions:
(a) X Binomial(n, p).
(b) X Poisson(), i.e., P(X = k) = e

k
k!
, (k = 0, 1, 2, . . .).
(c) X Geom(p), i.e,. P(X = k) = p(1 p)
k1
, (k = 1, 2, . . .).
(d) X U1, 2, . . . , n (the discrete uniform distribution on 1, 2, . . . , n).
(e) X U(a, b) (the uniform distribution on the interval (a, b)).
(f) X Exp()
20. (a) If X, Y are independent r.v.s taking values in Z, show that
P(X +Y = n) =

k=
P(X = k)P(Y = n k) (n Z)
(compare this formula with the convolution formula in the case of r.v.s with density).
(b) Use this to show that if X Poisson() and Y Poisson() are independent
then X + Y Poisson( + ). (Recall that for a parameter > 0, we say that
X Poisson() if P(X = k) = e

k
/k! for k = 0, 1, 2, . . .).
(c) Use the same discrete convolution formula to prove directly that if X Bin(n, p)
and Y Bin(m, p) are independent then X+Y Bin(n+m, p). You may make use of
the combinatorial identity (known as the Vandermonde identity or Chu-Vandermonde
identity)
k

j=0
_
n
j
__
m
k j
_
=
_
n +m
k
_
, (n, m 0, 0 k n +m).
As a bonus, try to nd a direct combinatorial proof for this identity. An amusing
version of the answer can be found at:
99
http://en.wikipedia.org/wiki/Vandermondes identity.
21. Prove that if X is a random variable that is independent of itself, then X is a.s.
constant, i.e., there is a constant c R such that P(X = c) = 1.
22. (a) If X 0 is a nonnegative r.v. with distribution function F, show that
E(X) =
_

0
P(X x) dx.
(b) Prove that if X
1
, X
2
, . . . , is a sequence of independent and identically distributed
(i.i.d.) r.v.s, then
P([X
n
[ n i.o.) =
_
_
_
0 if E[X
1
[ < ,
1 if E[X
1
[ = .
(c) Deduce the following converse to the Strong Law of Large Numbers in the case of
undened expectations: If X
1
, X
2
, . . . are i.i.d. and EX
1
is undened (meaning that
EX
1+
= EX
1
= ) then
P
_
lim
n
1
n
n

k=1
X
k
does not exist
_
= 1.
23. Let X be a r.v. with nite variance, and dene a function M(t) = E[Xt[, the mean
absolute deviation of X from t. The goal of this question is to show that the function
M(t), like its easier to understand and better-behaved cousin, E(Xt)
2
(the moment
of inertia around t, which by the Huygens-Steiner theorem is simply a parabola in
t, taking its minimum value of V(X) at t = EX), also has some unexpectedly nice
propreties.
(a) Prove that M(t) [t EX[.
(b) Prove that M(t) is a convex function.
(c) Prove that
_

_
M(t) [t EX[
_
dt = V(X)
100
(see hints below). Deduce in particular that M(t) [t EX[
t
0 (again under
the assumption that V(X) < ). If it helps, you may assume that X has a density
f
X
.
(d) Prove that if t
0
is a (not necessarily unique) minimum point of M(t), then t
0
is a
median (that is, a 0.5-quantile) of X.
(e) Optionally, draw (or, at least, imagine) a diagram showing the graphs of the two
functions M(t) and [t EX[ illustrating schematically the facts (a)(d) above.
Hints. For (c), assume rst (without loss of generality - why?) that EX = 0. Divide
the integral into two integrals, on the positive real axis and the negative real axis. For
each of the two integrals, by decomposing [Xt[ into a sum of its positive and negative
parts and using the fact that EX = 0 in a clever way, show that one may replace the
integrand (E[X t[ [t[) by a constant multiple of either E(X t)
+
or E(X t)

,
and proceed from there.
For (d), rst, develop your intuition by plotting the function M(t) in a couple of
cases, for example when X Binom(1, 1/2) and when X Binom(2, 1/2). Second, if
t
0
< t
1
, plot the graph of the function x
[xt
1
[[xt
0
[
t
1
t
0
, and deduce from this a formula
for M
t
(t
0
+) and (by considering t
1
< t
0
instead) a similar formula for M
t
(t
0
), the
right- and left-sided derivatives of M at t
0
, respectively. On the other hand, think how
the condition that t
0
is a minimum point of M(t) can be expressed in terms of these
one-sided derivatives.
24. (a) Let (t) denote the Euler gamma function, dened by
(t) =
_

0
e
x
x
t1
dx, (t > 0).
Show that the special value (1/2) =

of the gamma function is equivalent to the
integral evaluation

2 =
_

e
x
2
/2
dx (which is equivalent to the standard normal
density being a density function).
(b) Prove that the Euler gamma function satises for all t > 0 the identity
(t + 1) = t (t).
101
(This identity immediately implies the fact that (n + 1) = n! for integer n 0.)
(c) Find a formula for the values of () at half-integers, that is,

_
n +
1
2
_
= ?, (n 0).
25. Compute EX
n
when n 0 is an integer and X has each of the following distributions:
(a) X U(a, b)
(b) X Exp()
(c) X Gamma(, ), i.e. f
X
(x) =

()
e
x
x
1
, (x > 0).
(d) X Beta(a, b), i.e. f
X
(x) =
1
B(a,b)
x
a1
(1 x)
b1
, (0 < x < 1), where
B(a, b) =
_
1
0
u
a1
(1 u)
b1
du =
(a)(b)
(a +b)
is the Euler beta function.
(e) X N(0, 1). In this case, identify EX
n
combinatorially as the number of match-
ings of a set of size n into pairs (for example, if a university dorm has only 2-person
housing units, then when n is even this is the number of ways to divide n students
into pairs of roommates; no importance is given to the ordering of the pairs).
(f) X N(1, 1). In this case, identify EX
n
combinatorially as the number of invo-
lutions (permutations which are self-inverse) of a set of n elements. To count the
involutions, it is a good idea to divide them into classes according to how many
xed points they have. (Note: the expression for E(X
n
) may not have a very
simple form.)
26. Let f : [0, 1] R be a continuous function. Prove that
_
1
0
_
1
0
. . .
_
1
0
f
_
x
1
+x
2
+. . . +x
n
n
_
dx
1
dx
2
. . . dx
n

n
f(1/2).
Hint. Interpret the left-hand side as an expected value; use the laws of large numbers.
27. A bowl contains n spaghetti noodles arranged in a chaotic fashion. Bob performs the
following experiment: he picks two random ends of noodles from the bowl (chosen
102
uniformly from the 2n possible ends), ties them together, and places them back in the
bowl. Then he picks at random two more ends (from the remaining 2n 2), ties them
together and puts them back, and so on until no more loose ends are left.
Let L
n
denote the number of spaghetti loops at the end of this process (a loop is
a chain of one or more spaghettis whose ends are tied to each other to form a cycle).
Compute E(L
n
) and V(L
n
). Find a sequence of numbers (b
n
)

n=1
such that
L
n
b
n
P

n
1,
if such a sequence exists.
28. Martians communicate in a binary language with two symbols, 0 and 1. A text of
length n symbols written in the Martian language looks like a sequence X
1
, X
2
, . . . , X
n
of i.i.d. random symbols, each of which is 1 with probability p and 0 with probability
1 p. Here, p (0, 1) is a parameter (the Martian bias).
Dene the entropy function H(p) by
H(p) = p log
2
p (1 p) log
2
(1 p).
Prove the following result that eectively says that if n is large, then with high proba-
bility a Martian text of length n can be encoded into an ordinary (man-made) computer
le of length approximately n H(p) computer bits (note that if p ,= 1/2 then this is
smaller than n, meaning that the text can be compressed by a linear factor):
Theorem. Let X
1
, X
2
, X
3
, . . . be a sequence of i.i.d. Martian symbols (i.e., Bernoulli
variables with bias p). Denote by T
n
= (X
1
, . . . , X
n
) the Martian text comprising the
rst n symbols. For any > 0, if n is suciently large, the set 0, 1
n
of possible texts
of length n can be partitioned into two disjoint sets,
0, 1
n
= A
n
B
n
,
such that the following statements hold:
(a) P(T
n
B
n
) <
(b) 2
n(H(p))
[A
n
[ 2
n(H(p)+)
.
103
Notes: The texts in B
n
can be thought of as the exceptional sequences they
are the Martian texts of length n that are rarely observed. The texts in A
n
are
called typical sequences. Because of the two-sided bounds the theorem gives on
the number of typical sequences, it follows that we can encode them in a computer
le of size approximately nH(p) bits, provided we prepare in advance a code that
translates the typical sequences to computer les of the appropriate size (this can
be done algorithmically, for example by making a list of all the typical sequences
sorted in lexicographic order, and matching them to successive binary strings of length
(H(p) +)n).
Hint. To prove the theorem, let P
n
be the random variable given by
P
n
=
n

k=1
_
p
X
k
(1 p)
1X
k
_
.
Note that P
n
measures the probability of the sequence that was observed up to time n.
(Somewhat unusually, in this problem the probability itself is thought of as a random
variable). Try to represent P
n
in terms of cumulative sums of a sequence of i.i.d.
random variables. Apply the Weak Law of Large Numbers to that sequence, and see
where that gets you.
29. Prove the following one-sided version of Chebyshevs inequality: For any r.v. X and
t 0,
P(X EX t)

2
(X)
t
2
+
2
(X)
.
Hint. Assume without loss of generality that EX = 0. For any a > 0, we have that
P(X t) P((X + a)
2
(a + t)
2
). Bound this using known methods and then look
for the value of a that gives the best bound.
30. Let X
1
, X
2
, . . . be a sequence of i.i.d. r.v.s with distribution Exp(1). Prove that
P
_
limsup
n
X
n
log n
= 1
_
= 1.
31. Let A = (X
i,j
)
n
i,j=1
be a random n n matrix of i.i.d. random signs (i.e., random
variables such that P(X
i,j
= 1) = P(X
i,j
= 1) = 1/2). Compute Var(det(A)).
104
32. (a) Read, in Durretts book (p. 63 in the 3rd edition) or on Wikipedia, the statement
and proof of Kroneckers lemma.
(b) Deduce from this lemma, using results we learned in class, the following rate of
convergence result for the Strong Law of Large Numbers in the case of a nite variance:
If X
1
, X
2
, . . . is an i.i.d. sequence such that EX
1
= 0, V(X
1
) < , and S
n
=

n
k=1
X
k
,
then for any > 0,
S
n
n
1/2+
a.s.

n
0.
Notes. When X
1
is a random sign, i.e., a random variable that takes the values
1, +1 with respective probabilities 1/2, 1/2, the sequence of cumulative sums (S
n
)

n=1
is often called a (symmetric) random walk on Z, since it represents the trajectory
of a walker starting from 0 and taking a sequence of independent jumps in a random
(positive or negative) direction. An interesting question concerns the rate at which
the random walk can drift away from its starting point. By the SLLN, it follows that
almost surely, S
n
= o(n), so the distance of the random walk from the origin almost
surely has sub-linear growth. By the exercise above, the stronger result S
n
= o(n
1/2+
)
also holds for all . This is close to optimal, since by the Central Limit Theorem which
we will discuss soon, one cannot hope to show that S
n
= o(n
1/2
). In fact, the true
rate of growth is given by the following famous theorem, whose proof is a (somewhat
complicated) elaboration on the techniques we have discussed.
Theorem (The Law of the Iterated Logarithm (A. Y. Khinchin, 1924)).
P
_
limsup
n
S
n

2nlog log n
= 1
_
= 1.
Therefore, by symmetry, also
P
_
liminf
n
S
n

2nlog log n
= 1
_
= 1.
It follows in particular that, almost surely, the random walk will cross the origin in-
nitely many times.
33. Prove that if F and (F
n
)

n=1
are distribution functions, F is continuous, and for any
t R we have F
n
(t) F(t) as n , then the convergence is uniform in t.
105
34. Let (x) = (2)
1/2
e
x
2
/2
be the standard normal density function.
(a) If X
1
, X
2
, . . . are i.i.d. Poisson(1) random variables and S
n
=

n
k=1
X
k
(so S
n

Poisson(n)), show that if n is large and k is an integer such that k n +x

n then
P(S
n
= k)
1

n
(x).
Hint. Use the fact that log(1 +u) = u u
2
/2 +O(u
3
) as u 0.
(b) Find lim
n
e
n

n
k=0
n
k
k!
.
(c) If X
1
, X
2
, . . . are i.i.d. Exp(1) random variables and denote S
n
=

n
k=1
X
k
(so
S
n
Gamma(n, 1)),

S
n
= (S
n
n)/

n. Show that if n is large and x R is xed


then the density of

S
n
satises
f

Sn
(x) (x).
4. Prove that if X N(
1
,
2
1
) and Y N(
2
,
2
2
) are independent r.v.s, then
X +Y N(
1
+
2
,
2
1
+
2
2
).
Hint. First, show why it is enough to prove the following statement: If U, V N(0, 1)
are independent and a
2
+ b
2
= 1, then W = aU + bV N(0, 1). Then, to prove this,
introduce another auxiliary variable Z = bU +aV , and consider the two-dimensional
transformation (U, V ) (W, Z). Apply the formula
f
(U,V )
(w, z) =
1
[J

(
1
(w, z))[
f
U,V
(
1
(w, z))
for the density of a transformed random 2-d vector to get the joint density of W, Z.
35. (a) Prove that if X, (X
n
)

n=1
are random variables such that X
n
X in probability
then X
n
= X.
(b) Prove that if X
n
= c where c R is a constant, then X
n
c in probability.
(c) Prove that if Z, (X
n
)

n=1
, (Y
n
)

n=1
are random variables such that X
n
= Z and
X
n
Y
n
0 in probability, then Y
n
= Z.
106
36. (a) Let X, (X
n
)

n=1
be integer-valued r.v.s. Show that X
n
= X if and only if
P(X
n
= k) P(X = k) for any k Z.
(b) If > 0 is a xed number, and for each n, Z
n
is a r.v. with distribution
Binomial(n, /n), show that
Z
n
= Poisson().
37. Let f(x) = (2)
1/2
e
x
2
/2
be the density function of the standard normal distribution,
and let (x) =
_
x

f(u) du be its c.d.f. Prove the inequalities


1
x +x
1
f(x) 1 (x)
1
x
f(x), (x > 0). (21)
Note that for large x this gives a very accurate two-sided bound for the tail of the
normal distribution. In fact, it can be shown that
1 (x) = f(x)
1
x +
1
x+
2
x+
3
x+
4
x+...
which gives a relatively ecient method of estimating (x).
Hint. To prove the upper bound in (21), use the fact that for t > x we have e
t
2
/2

(t/x)e
t
2
/2
. For the lower bound, use the identity
d
dx
_
e
x
2
/2
x
_
=
_
1 +
1
x
2
_
e
x
2
/2
to compute
_

x
(1+u
2
)e
u
2
/2
du. On the other hand, show that this integral is bounded
from above by (1 +x
2
)
_

x
e
u
2
/2
du.
38. (a) Let X
1
, X
2
, . . . be a sequence of independent r.v.s that are uniformly distributed
on 1, . . . , n. Dene
T
n
= mink : X
k
= X
m
for some m < k.
If the X
j
s represent the birthdays of some sequence of people on a planet in which
the calendar year has n days, then T
n
represents the number of people in the list who
107
have to declare their birthdays before two people are found to have the same birthday.
Show that
P(T
n
> k) =
k1

m=1
_
1
m
n
_
, (k 2),
and use this to prove that
T
n

n
= F
birthday
,
where F
birthday
is the distribution function dened by
F
birthday
(x) =
_
_
_
0 x < 0,
1 e
x
2
/2
x 0
(note: this is not the same as the normal distribution!)
(b) Take n = 365. Assuming that the approximation F
Tn/

n
F
birthday
is good for such
a value of n, estimate what is the minimal number of students that have to be put into a
classroom so that the probability that two of them have the same birthday exceeds 50%.
(Ignore leap years, and assume for simplicity that birthdays are distributed uniformly
throughout the year; in practice this is not entirely true.)
39. Consider the following two-step experiment: First, we choose a uniform random vari-
able U U(0, 1). Then, conditioned on the event U = u, we perform a sequence of n
coin tosses with bias u, i.e., we have a sequence X
1
, X
2
, . . . , X
n
such that conditioned
on the event U = u, the X
k
s are independent and have distribution Binom(1, u).
(Note: without this conditioning, the X
k
s are not independent!)
Let S
n
=

n
k=1
X
k
. Assume that we know that S
n
= k, but dont know the value
of U. What is our subjective estimate of the probability distribution of U given this
information? Show that the conditional distribution of U given that S
n
= k is the beta
distribution Beta(k + 1, n k + 1). In other words, show that
P(U x [ S
n
= k) =
1
B(k, n k)
_
x
0
u
k
(1 u)
nk
du, (0 x 1).
Note: This problem has been whimsically suggested by Laplace in the 18th century as
a way to estimate the probability that the sun will rise tomorrow, given the knowledge
108
that it has risen in the last n days. (Of course, this assumes the unlikely theological sce-
nario whereby at the dawn of history, a U(0, 1) random number U was drawn, and that
subsequently, every day an independent experiment was performed with probability U
of success, such that if the experiment is successful then the sun rises.)
Hint. Use the following density version of the total probability formula: If A is an
event and X is a random variable with density f
X
, then
P(A) =
_
R
f
X
(u)P(A [ X = u) du.
Note that we have not dened what it means to condition on a 0-probability event
(this is a somewhat delicate subject that we will not discuss in this quarter) but
dont worry about it, it is possible to use the formula in computations anyway and get
results.
40. Let Z
1
, Z
2
, . . . be a sequence of i.i.d. random variables with the standard normal N(0, 1)
distribution. For each n, dene the random vector
X
n
= (X
n,1
, . . . , X
n,n
) =
1
(

n
i=1
Z
2
i
)
1/2
(Z
1
, . . . , Z
n
)
(a) The distribution of the random vector X
n
is called the uniform distribution on the
(n 1)-dimensional sphere S
n1
= x R
n
: [[x[[ = 1. Explain why this makes
intuitive sense, and if possible explain rigorously what conditions this distribution
satises that justies describing it by this name.
(b) Show that

nX
n,1
= N(0, 1) as n .
Hint. Use the law of large numbers.
(c) For each n 1, nd the density function of the coordinate X
n,1
. Optionally, use
this to give an alternative solution to part (b) above.
Hint. Do it rst for n = 2 and n = 3, and generalize using ideas from multivariate
calculus. For n = 3, you should nd that X
3,1
U[1, 1], a geometric fact which was
known to Archimedes.
41. Compute the characteristic functions for the following distributions.
109
(a) Poisson distribution: X Poisson().
(b) Geometric distribution: X Geom(p) (assume a geometric that starts at 1).
(c) Uniform distribution: X U[a, b], and in particular X [1, 1] which is
especially symmetric and useful in applications.
(d) Exponential distribution: X Exp().
(e) Symmetrized exponential: A r.v. Z with density function f
Z
(x) =
1
2
e
[x[
.
Note that this is the distribution of the exponential distribution after being sym-
metrized in either of two ways: (i) We showed that if X, Y Exp(1) are inde-
pendent then X Y has density
1
2
e
[x[
; (ii) alternatively, it is the distribution of
an exponential variable with random sign, namely X where X Exp(1) and
is a random sign (same as the coin ip distribution mentioned above) that is
independent of X.
42. (a) If X is a r.v., show that Re(
X
) (the real part of
X
) and [
X
[
2
=
X

X
are also
characteristic functions (i.e., construct r.v.s Y and Z such that
Y
(t) = Re(
X
(t)),

Z
(t) = [
X
(t)[
2
).
(b) Show that X is equal in distribution to X if and only if
X
is a real-valued
function.
43. (a) Let Z
1
, Z
2
, . . . be a sequence of independent r.v.s such that the random series
X =

n=1
Z
n
converges a.s. Prove that

X
(t) =

n=1

Zn
(t), (t R).
(b) Let X be a uniform r.v. in (0, 1), and let Y
1
, Y
2
, . . . be the (random) bits in its
binary expansion, i.e. each Y
n
is either 0 or 1, and the equation
X =

n=1
Y
n
2
n
(22)
holds. Show that Y
1
, Y
2
, . . . are i.i.d. unbiased coin tosses (i.e., taking values 0, 1 with
probabilities 1/2, 1/2).
110
(c) Compute the characteristic function
Z
of Z = 2X1 (which is uniform in (1, 1)).
Use (22) to represent this in terms of the characteristic functions of the Y
n
s (note that
the series (22) converges absolutely, so here there is no need to worry about almost
sure convergence). Deduce the innite product identity
sin(t)
t
=

n=1
cos
_
t
2
n
_
, (t R). (23)
(d) Substitute t = /2 in (23) to get the identity
2

2
2

_
2 +

2
2

_
2 +
_
2 +

2
2
. . .
44. Let X be a r.v. From the inversion formula, it follows without much diculty (see
p. 95 in Durretts book, 3rd or 4th eds.), that if
X
is integrable, then X has a density
f
X
, and the density and characteristic function are related by

X
(t) =
_

f
X
(x)e
itx
dx,
f
X
(x) =
1
2
_

X
(t)e
itx
dt
(this shows the duality between the Fourier transform and its inverse). Use this and
the answer to a previous exercise to conclude that if X is a r.v. with the Cauchy
distribution (i.e., X has density f
X
(x) = 1/(1 + x
2
)) then its characteristic function
is given by

X
(t) = e
[t[
.
Deduce from this that if X, Y are independent Cauchy r.v.s then any weighted average
X +(1 )Y , where 0 1, is also a Cauchy r.v. (As a special case, it follows by
induction that if X
1
, . . . , X
n
are i.i.d. Cauchy r.v.s, then their average (X
1
+. . .+X
n
)/n
is also a Cauchy r.v., which was a claim we made without proof earlier in the course.)
111

You might also like