0% found this document useful (0 votes)
67 views

Prob ALecture Notes

This document provides an overview of a 16-week course on probability taught by David Steinsaltz at the University of Oxford. The course covers topics like continuous and jointly continuous random variables, moment generating functions, Markov chains, and Poisson processes. It emphasizes learning probability concepts through examples and exercises over purely theoretical proofs. The document recommends supplemental textbooks to use for additional reading and practice problems.

Uploaded by

gmert58
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Prob ALecture Notes

This document provides an overview of a 16-week course on probability taught by David Steinsaltz at the University of Oxford. The course covers topics like continuous and jointly continuous random variables, moment generating functions, Markov chains, and Poisson processes. It emphasizes learning probability concepts through examples and exercises over purely theoretical proofs. The document recommends supplemental textbooks to use for additional reading and practice problems.

Uploaded by

gmert58
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 202

Probability Part A

David Steinsaltz1
University of Oxford

1
University lecturer at the Department of Statistics, University of Oxford
Part A Probability
David Steinsaltz – 16 lectures HT 2012
[email protected]
Website: http://www.steinsaltz.me.uk/probA/probA.html

Overview
The first half of the course takes further the probability theory that was developed in the
first year. The aim is to build up a range of techniques that will be useful in dealing with
mathematical models involving uncertainty. The second half of the course is concerned
with Markov chains in discrete time and Poisson processes in one dimension, both with
developing the relevant theory and giving examples of applications.

Synopsis
Continuous random variables. Jointly continuous random variables, independence, con-
ditioning, bivariate distributions, functions of one or more random variables. Moment
generating functions and applications. Characteristic functions, definition only. Exam-
ples to include some of those which may have later applications in Statistics. Basic ideas
of what it means for a sequence of random variables to converge in probability, in distri-
bution and in mean square. Chebychev and Markov inequalities. The weak law of large
numbers and central limit theorem for independent identically distributed variables with
a second moment. Statements of the continuity and uniqueness theorems for moment
generating functions. Discrete-time Markov chains: definition, transition matrix, n-step
transition probabilities, communicating classes, absorption, irreducibility, calculation of
hitting probabilities and mean hitting times, recurrence and transience. Invariant dis-
tributions, mean return time, positive recurrence, convergence to equilibrium (proof not
examinable). Examples of appli- cations in areas such as: genetics, branching processes,
Markov chain Monte Carlo. Poisson processes in one dimension: exponential spacings,
Poisson counts, thinning and superposition.

A word about stories and proofs


Probability is a highly theoretical area of mathematics, but also has many applications.
We will spend a significant amount of time talking about some of the standard probabil-
ity “stories”. You need to learn these. For instance, you need to know that the binomial
distribution is connected to the story of successes in independent trials with the same
probability of success, and understand how to interpret that story in the context of a
given problem.
At the same time, there are some deep mathematical concepts that need to be
understood. Proofs are crucial to understanding these concepts. The question of whether
a proof is “examinable” should not be uppermost in your mind. Proofs are tools for
clarifying your understanding of how the assumptions of a theorem are linked to the
conclusions, and why they are (or perhaps are not) essential.
To this end, there will be exercises that ask you to generalise standard proofs to
new settings, in ways that will force you to grapple with the techniques. In many cases
it will be clear that these proofs are too long to fit into the context of an exam paper.
Some of you may be tempted to neglect these as not essential. I advise you to resist this
temptation.

Exercises
There are four problem sheets, each covering two weeks of the course. There are no
explicit review exercises, but you may want to look back at your Mods sheets, to make
sure you still remember the material.
Doing problems is the most important way to learn a mathematical subject, and for
no subject is that more important than for probability. Some important principles:
• Problems vary in difficulty. There are intended to be questions that will challenge
the very best students in the course. An average student should find that there are
multiple questions or parts of questions that he or she cannot do. Since questions
are organised in part by topic, and since different students find different topics
difficult, you will not necessarily find them to be ordered by difficulty. So if you
can’t do some questions, don’t panic! Do what you can, think about the question
a while, and plan to discuss it in tutorials.
• There is a variety of problems:
(i). Straightforward application of new techniques;
(ii). Extensions of techniques to novel situations;
(iii). Interpretation of applications, often involving extensive descriptions of an ide-
alised real-world setting. These are particularly important in probability;
(iv). Proofs and theoretical exercises. See the above section.
• Some tutors will have you turn in questions ahead of the relevant lectures. This is
not optimal, but seems unavoidable. If this is your situation, you’ll have to do the
best you can from reading.
• Problem sheets are not the same as exam questions. Problem sheets are part of the
learning process; exam questions are merely for evaluation of your learning. Exam
questions are designed to be done completely in no more than 40 minutes. You
should be spending about 8 hours a week on each of your 16-hour lecture courses,
and most of that time will be spent doing exercises. Thus, 12–15 hours on a two-
week sheet is not excessive. The hardest question on a sheet will be much harder,
and take much longer to do, than the hardest exam question.

Reading
16 hours of lectures are not enough to cover all of the material in any depth. The
lectures are intended to provide a framework for understanding what you will then learn
in depth through reading and
Your reading should begin with the lecture notes, but should not end there. You
will not be able to master the material by reading one source, and lecture notes are, by
their nature, not worked out nearly as carefully as a book.
The official suggested supplemental reading, covering the entire syllabus:
• G. R. Grimmett and D. R. Stirzaker, Probability and Random Processes (3rd
edition, OUP, 2001). Chapters 4, 6.1—6.5, 6.8.
• G. R. Grimmett and D. R. Stirzaker, One Thousand Exercises in Probability (OUP,
2001).
• G. R. Grimmett and D J A Welsh, Probability: An Introduction (OUP, 1986).
Chapters 6, 7.4, 8, 11.1—11.3.
• J. R. Norris, Markov Chains (CUP, 1997). Chapter 1.
• D. R. Stirzaker, Elementary Probability (Second edition, CUP, 2003). Chapters
7—9 excluding 9.9.
These are some suggestions, but there is nothing compulsory about them. The
material in this course is absolutely standard, and as a consequence there are hundreds
of books covering it in different ways. You will learn best if you browse various books to
find a style of presentation (and of thought) that makes the most sense to you. Here is
a list of some other probability texts with my personal comments. This has no official
sanction!
• Sheldon Ross — A first course in probability theory. Loads of good examples.
Text a bit dry and plodding. His Introduction to Probability Models (now
in its 9th edition) has some of the same pros and cons for Markov chains, Poisson
process, and a lot more. The range of exercises is breathtaking.
• David Williams — Weighing the odds. A recent text by an inspiring author,
with an integrated treatment of probability and statistics. Fast and exciting.
• Kai Lai Chung — Elementary probability theory. Slower exposition than the
above. Good on counting.
• William Feller — An introduction to probability theory and its applica-
tions. The classic book, in 2 volumes. Volume 1 is just discrete distributions
(including Markov chains), volume 2 is general probability, including convergence
of distributions. Generations of probabilists have been inspired by this book, and
it still can’t be beat for its treatment of renewal processes, densities, and random
walks. Deep exercises involving many interesting applications. Good treatment of
convergence in distribution.
• Jim Pitman — Probability. Covers the more elementary parts of the course. Lots
of entertaining examples worked through in the text, and exercises with solutions.
Encyclopaedic array of exercises. Terrific text. Particularly good on conditional
expectations and applying the normal approximation. Also great section on the
bivariate normal distribution. Very intuitive treatment of the Poisson process
• Henk Tijms – Understanding probability. Full of anecdotal details – the first
half is designed to be read as fun and to act as motivation for the mathematical
details in the second half.
• Grinstead and Snell — Introduction to Probability. This is available in its
entirety on-line. Long and thorough, but gentle.
• P. Billingsley — Convergence of Probability Measures. Above the level of
this course, but this is the text to go to for those who are interested in really
understanding how convergence of probability distributions really works.
• Kemeny and Snell — Finite Markov Chains. Another classic (1960), though
the only thing that’s outdated is the typeface. (Well, the applications are meager.)
• James Norris — Markov Chains. Officially recommended text of the course. Not
bad. The section on discrete Markov chains is not all that long, and perhaps less
accessible than it could be.
Contents

1 Distributions and random variables 1


1.1 The science of probability . . . . . . . . . . . . . . . . . . . . 1
1.2 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Finite state space . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Countable state space . . . . . . . . . . . . . . . . . . 4
1.3.3 Uniform distribution on a bounded subset of Euclidean
space: Uniform distribution . . . . . . . . . . . . . . . 4
1.3.4 General distribution on the real line . . . . . . . . . . 5
1.3.5 Densities on the real line . . . . . . . . . . . . . . . . 7
1.3.6 Examples of densities . . . . . . . . . . . . . . . . . . 9
1.4 Some difficulties in probability . . . . . . . . . . . . . . . . . 12
1.5 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Examples of random variables . . . . . . . . . . . . . . . . . . 14
1.6.1 Coin flipping . . . . . . . . . . . . . . . . . . . . . . . 14
1.6.2 Projection . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6.3 Indicators . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7 A word about random variables and probability nomenclature 15
1.8 Functions of random variables . . . . . . . . . . . . . . . . . . 16
1.8.1 The discrete case . . . . . . . . . . . . . . . . . . . . . 16
1.8.2 General random variables with injective functions . . . 17
1.8.3 Continuous random variables and invertible g . . . . . 19
1.8.4 Non-invertible g . . . . . . . . . . . . . . . . . . . . . 23
1.9 Limits of random variables . . . . . . . . . . . . . . . . . . . . 25

2 Joint distributions 27
2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . 29

vi
2.4 Joint densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Example: Uniform distribution on a disk . . . . . . . 32
2.4.2 Independent standard normal random variables . . . . 33
2.4.3 Multivariate normal distribution . . . . . . . . . . . . 34
2.4.4 The convolution formula . . . . . . . . . . . . . . . . . 38
2.5 Mixing discrete and continuous random variables . . . . . . . 39
2.5.1 Independence . . . . . . . . . . . . . . . . . . . . . . . 40

3 Expectations, moments, and inequalities 42


3.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Essential properties of expectations . . . . . . . . . . . . . . . 44
3.3 Variance and covariance . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Covariance matrix . . . . . . . . . . . . . . . . . . . . 51
3.4 Tail bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Are expectations the fundamental objects in probability? . . 58

4 Conditioning 60
4.1 Conditioning on non-null events . . . . . . . . . . . . . . . . . 60
4.2 Conditional distributions: The non-null case . . . . . . . . . . 63
4.3 Conditioning on a random variable: The discrete case . . . . 65
4.4 Conditional distributions: The continuous case . . . . . . . . 69
4.5 The Borel paradox . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Bivariate normal distribution: The regression line . . . . . . . 73
4.7 Linear regression on normal random variables . . . . . . . . . 77
4.8 Conditional variance . . . . . . . . . . . . . . . . . . . . . . . 78
4.9 Mixtures of distributions . . . . . . . . . . . . . . . . . . . . . 80
4.9.1 X continuous, Y discrete . . . . . . . . . . . . . . . . 81
4.9.2 X discrete, Y continuous . . . . . . . . . . . . . . . . 82
4.9.3 X and Y continuous . . . . . . . . . . . . . . . . . . . 86

5 Generating functions and their relatives 87


5.1 Generating functions . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Moment generating functions . . . . . . . . . . . . . . . . . . 92
5.2.1 Properties of moment generating functions . . . . . . 92
5.2.2 Behaviour near 0 . . . . . . . . . . . . . . . . . . . . . 94
5.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.4 Tail bounds from moment generating functions . . . . 97
5.2.5 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.6 Uniqueness and moments . . . . . . . . . . . . . . . . 100
5.3 Characteristic functions . . . . . . . . . . . . . . . . . . . . . 101
5.4 Properties of characteristic functions . . . . . . . . . . . . . . 101

6 Limit theorems in probability I 103


6.1 Convergence in distribution . . . . . . . . . . . . . . . . . . . 104
6.2 The Continuity Theorem for mgfs . . . . . . . . . . . . . . . . 109
6.3 Convergence in probability . . . . . . . . . . . . . . . . . . . 109
6.4 Law of large numbers . . . . . . . . . . . . . . . . . . . . . . 110
6.4.1 Weak Law of Large Numbers . . . . . . . . . . . . . . 111
6.5 Convergence in mean square . . . . . . . . . . . . . . . . . . . 112
6.6 Strong convergence . . . . . . . . . . . . . . . . . . . . . . . . 113
6.6.1 Convergence in probability doesn’t imply strong con-
vergence . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.6.2 Strong Law of Large Numbers . . . . . . . . . . . . . 113
6.6.3 Strong convergence is stronger than weak convergence 114

7 Limit theorems in probability II: The central limit theorem115


7.1 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . 115
7.2 Basic applications of the CLT . . . . . . . . . . . . . . . . . . 117
7.2.1 The Normal approximation to the Binomial . . . . . . 117
7.2.2 Continuity correction . . . . . . . . . . . . . . . . . . 120
7.2.3 Poisson distribution . . . . . . . . . . . . . . . . . . . 122
7.2.4 CLT for real data . . . . . . . . . . . . . . . . . . . . . 124
7.3 Using the Normal approximation for statistical inference . . . 128
7.3.1 An example: Average incomes . . . . . . . . . . . . . 128

8 Proof of the Central Limit Theorem 133


8.1 Many paths up the mountain . . . . . . . . . . . . . . . . . . 133
8.1.1 De Moivre’s proof . . . . . . . . . . . . . . . . . . . . 133
8.1.2 Maximum entropy . . . . . . . . . . . . . . . . . . . . 134
8.1.3 Fixed point . . . . . . . . . . . . . . . . . . . . . . . . 135
8.2 Continuity Theorem . . . . . . . . . . . . . . . . . . . . . . . 135
8.3 Proof of the CLT . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.4 Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

9 Markov chains: Introduction 141

10 Stationary distributions for Markov chains 142

11 Convergence to stationarity of Markov chains 143


12 Recurrence, Transience, and Hitting Times 144
12.1 Basic facts about recurrence and transience . . . . . . . . . . 144
12.2 Probability generating functions for passage times . . . . . . 146
12.3 Demonstrating recurrence and transience: The example of
the random walk on Z . . . . . . . . . . . . . . . . . . . . . . 147
12.3.1 The simple random walk . . . . . . . . . . . . . . . . . 147
12.3.2 General p . . . . . . . . . . . . . . . . . . . . . . . . . 150

13 Expected hitting times 151


13.1 Null and positive recurrence . . . . . . . . . . . . . . . . . . . 151
13.2 Fun with eigenvectors and matrix inverses . . . . . . . . . . . 157
13.2.1 Alternative representation of the stationary distribution158
13.2.2 Hitting probabilities . . . . . . . . . . . . . . . . . . . 158
13.2.3 Absorption probabilities . . . . . . . . . . . . . . . . . 159
13.2.4 Expected hitting times and total occupation . . . . . . 163

14 Applications of Markov chains 167


14.1 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 167
14.1.1 The Metropolis Algorithm . . . . . . . . . . . . . . . . 168

15 Poisson process 170


15.1 Point processes . . . . . . . . . . . . . . . . . . . . . . . . . . 170
15.2 The Poisson process on R+ . . . . . . . . . . . . . . . . . . . 171
15.2.1 Local definition of the Poisson process . . . . . . . . . 171
15.2.2 Global definition of the Poisson process . . . . . . . . 172
15.2.3 Defining the Interarrival process . . . . . . . . . . . . 172
15.2.4 Equivalence of the definitions . . . . . . . . . . . . . . 173
15.2.5 The Poisson process as Markov process . . . . . . . . 175

16 Poisson process: Examples and extensions 176


16.1 Some basic calculations . . . . . . . . . . . . . . . . . . . . . 177
16.2 Thinning and merging . . . . . . . . . . . . . . . . . . . . . . 179
16.3 Poisson process and the uniform distribution . . . . . . . . . 181

A Assignments I
A.1 Part A Probability Problem sheet 1:
Random Variables, Expectations, Conditioning . . . . . . . . II
A.2 Part A Probability Problem sheet 2:
Generating functions and convergence . . . . . . . . . . . . . IV
A.3 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . VI
A.4 Asymptotics of Markov chains and Poisson process . . . . . . VIII
Lecture 1

Distributions and random


variables

Summary: We review the definitions of probability and random variables.


Some of these are placed in a slightly more general context, or using more
flexible notation.
Key concepts: probability triple, probability space, probability distribution,
cumulative distribution function, events, probability mass function, density,
change of variables

1.1 The science of probability


What is probability theory? It’s a fairly modern branch of mathematics —
though that is only judged relative to the antiquity claimed for geometry,
algebra, and
As a topic in applied mathematics, probability is the mathematical the-
ory of random events. But unlike geometry, say, which models (foundation-
ally, at least) space, there is considerable disagreement about whether this
fundamental object exists at all, and if so what it is. The applications are
pervasive, since any scientific modelling problem can involve randomness.
As a topic in pure mathematics, probability is closely allied to several
areas of analysis, though point-set topology and set theory play important
roles, as do the theory of dynamical systems, Fourier analysis, and complex
analysis.

1
1.2. PROBABILITY SPACES 2

1.2 Probability spaces


The basic object of probability theory is often referred to as the probability
triple:
(Ω, F, P).
Here Ω is a set, called the sample space which may or may not have some
additional topological or geometric structure. F is a collection of subsets
of Ω, called the events. And P is the probability distribution — also
called simply the probability — which maps events to nonnegative real
numbers, according to the rules:

(i) P(Ω) = 1;

(ii) P(A ∪ B) = P(A) + P(B) when A and B are disjoint.

Intuitively, we think of Ω as the collection of all the things that “could


happen”, and P assigns to events (subsets) a number indicating the propor-
tion of the total space Ω that is in that event. Rule (i) says that “everything”
corresponds to the number 1; rule (ii) says that if A and B don’t overlap,
the fraction of the space in A and B together is simply the sum of propor-
tions. Philosophers argue (not without reason!) over what these proportions
should be thought of as representing in the real world. We usually ωactual
that really happens in the world. We want to answer the question, how
likely is it that ωactual is in a certain part A of the space. But what does it
mean to put a number on “how likely”?
If we think of probabilities as long-term frequency of occurrence, then
rule (i) says that the fraction of times that something occurs is 1, and rule
(ii) says that if A and B can’t both occur, then the fraction of times that
either one of them occurs is the sum of the fraction of times that either of
them occurs.
Rule (ii) isn’t sufficient for our purposes. We can add probabilities for
three or four disjoint events, but we run into trouble when we have an infinite
sample space. For instance, we might have the sample space {1, 2, 3, . . . },
and define a probability with P({n}) = 2−n . Must it be the case that
2
P ({1, 3, 5, 7, . . . }) = 2−1 + 2−3 + 2−5 + 2−7 + · · · = ?
3
1.2. PROBABILITY SPACES 3

It makes sense, but it’s not obvious from our axioms. So we replace the
additivity axiom (ii) by countable additivity (ii0 ):

(ii0 ) Given countably many disjoint events A1 , A2 , . . . ,


∞ ∞
!
[ X
P An = P(An ).
n=1 n=1

What sort of collection of sets is F? It should include all the events we’re
interested in assigning probabilities to. If Ω is a finite set, or even a countable
set, then F could simply be all of the subsets of Ω.1 If Ω is uncountable —
for instance R — then this won’t work. Fortunately, it turns out that we
can define an F that includes all the events we’re likely to want to assign
probabilities to; namely,
• All open and closed sets are in F. (So, in particular, if we’re working
in R we can assign probabilities to events like [a, b] and (a, b] and any
finite collection of points.)

• If A1 , A2 , . . . are elements of F, then their union and intersection is also


in F. Complements of sets in F are also in F. These properties define
what we call a σ-algebra of sets. (This definition is non-examinable.)
In other words, if we can define the probability that some events occur
then we can define the probability that at least one event occurs, and
the probability that all of them occur. If we can define the probability
that an event occurs then we can define the probability that it doesn’t
occur.
Does such an F exist? Can we define probabilities consistently on such
an F? The answer is yes, and the technicalities form the basic material
of Measure Theory (taught at Oxford as the Part A Integration course).
While it is helpful to understand the technicalities of measure theory, and
even necessary to progress beyond a certain point in probability theory,
there is nothing illegitimate or even non-rigorous about simply accepting
the existence of the objects in question as another axiom. Occasionally we
will make reference to further results from measure theory which we simply
use to inspire or justify various definitions.
We will usually ignore the choice of F, and in fact won’t mention it at
all, simply taking it for granted that we can consistently define probabilities
on all the events we care about.
1
But it doesn’t have to be! In more advanced probability theory (not in this course)
we find that changing to a smaller σ-algebra can be a useful tool.
1.3. EXAMPLES 4

We call sets in F measurable, without worrying about exactly which


sets they are. We will also simply refer to them as events.

1.3 Examples
1.3.1 Finite state space
If Ω is a finite set, all probabilities can
P be defined by a collection of nonneg-
ative numbers px for x ∈ Ω, where x∈Ω px = 1. We can then define
X
P(A) = px .
x∈A

1.3.2 Countable state space


This is no different in principle from the finite case. Probabilities can
P defined by a collection of nonnegative numbers px for x ∈ Ω, where
be
x∈Ω px = 1. We can then define for any A ⊂ Ω
X
P(A) = px .
x∈A

1.3.3 Uniform distribution on a bounded subset of Euclidean


space: Uniform distribution
Let Ω be a bounded subset of Rn . Define for a subset A of Ω
Vol(A)
P(A) = .
Vol(Ω)

This is obviously a nonnegative quantity, and bounded by 1 because the


volume of a subset can’t exceed the volume of the set that contains it. Vol-
umes are additive so these probabilities are additive. Are they countably
additive? At this point we have to recognise that we don’t have a rigor-
ous theory of “volumes”; that’s what we get from measure theory. Again,
though, there is no problem as long as we hew close to the sets that are
familiar to our geometric intuition. We need measure theory to see how far
we can push this intuition; although, as we shall see, even starting from very
basic probability models we are very quickly led to “geometric” sets whose
volume is hard to define.
1.3. EXAMPLES 5

1.3.4 General distribution on the real line


Let Ω = R. A distribution on the real line is equivalent to a cumulative
distribution function F (abbreviated cdf), defined to be a function F :
R → [0, 1] with the following properties:
(cdf1) F is monotone nondecreasing;

(cdf2) F is right-continuous;

(cdf3) limx→−∞ F (x) = 0;

(cdf4) limx→+∞ F (x) = 1.


Then we can define P{ω ≤ x} = P((−∞, x]) = F (x); it follows that
P((a, b]) = F (b) − F (a).
If F has a discontinuity at the point x, that means that P({x}) is nonzero.
(In fact, it is exactly the size of the jump.) Such a point with nonzero
probability is called an atom. We will usually be concerned with either
purely discrete distributions (distributions on a finite or countable sample
space), for which we will rarely have cause to think about cdfs; or with purely
continuous distributions — that is, distributions that have no atoms — so
cdfs will mainly be continuous functions. But mixed discrete-continuous
distributions arise quite naturally in many problems. 
It is obvious that for any probability P the function F (x) := P (−∞, x]
is nondecreasing. The other three properties are consequences of an almost-
obvious lemma, which we will term the Monotone Event Lemma:
Lemma 1.1. (i) Let A1 ⊂ A2 ⊂ · · · be an infinite sequence of increasing
events. Then

!
[
P Ai = lim P(Ai ).
n→∞
i=1

(ii) Let A1 ⊃ A2 ⊃ · · · be an infinite sequence of decreasing events. Then



!
\
P Ai = lim P(Ai ).
n→∞
i=1

In other words, if event i implies event i + 1, then P(at least one occurs)
is the limit of the probability that that
Sthe last (largest) one occurs. For any
finite collection this is trivial, since ni=1 Ai = An ; all we need to show is
that this carries over to an infinite union. Similarly for intersections. (Note
that this result was given as an exercise in Mods probability.)
1.3. EXAMPLES 6

Proof. (i)SLet Bi := ASi \ Ai−1 and B1 := A1 . Then the Bi are disjoint,


n ∞ S∞
An = i=1 Bi , and i=1 Ai = i=1 Bi . By countable additivity
∞ ∞
! !
[ [
P Ai = P Bi
i=1 i=1

X
= P(Bi )
i=1
n
X
= lim P(Bi )
n→∞
i=1
= lim P(An ).
n→∞

Sn
(ii) Let Bi := Ai \Ai+1 and B0 := A{1 . Then the Bi are disjoint, i=0 Bi =
A{n , and ∞
S T∞ {
i=0 Bi = ( i=1 Ai ) . So

∞ ∞
! !
\ [
1−P Ai = P Bi
i=1 i=0

X
= P (Bi )
i=0
n
X
= lim P (Bi )
n→∞
i=0
n
!
[
= lim P Bi
n→∞
i=0
= lim 1 − P(An ).
n→∞

It follows that for any sequence xi ↓ x,



!
\ 
lim F (xi ) = lim P ((−∞, xi ]) = P (−∞, xi ] = P (−∞, x] = F (x),
i→∞ i→∞
i=1

proving that F is right-continuous. Similarly, it follows that F (−∞) = 0


and F (∞) = 1.
On the other hand, if we take a sequence xi ↑ x (so, approaching x from
below — e.g. xi = x − 1/i), then the union of the sets (−∞, xi ] is the open
1.3. EXAMPLES 7

interval (−∞, x),


 
lim F (xi ) = lim P (−∞, xi ]
i→∞ i→∞
∞ (1.1)
!
[ 
=P (−∞, xi ] = P (−∞, x) = F (x) − P({x}).
i=1

We also write this limit from the left of F (xi ) as F (x−). Thus, the difference
between F (x) and the limit from the left F (x−) is P({x}), the atom of
probability at the single point x.
The fact that any function satisfying these properties is the cdf of a
probability measure on R is not surprising, although the fact that it really
holds for any such function is a theorem of measure theory. As long as
we restrict ourselves to functions that are piecewise differentiable there’s no
problem. Observe that an increasing function can only have countably many
discontinuities. Piecewise differentiable is a somewhat stronger assumption
— in particular, the jumps form a discrete set.

1.3.5 Densities on the real line


Most of the cdfs we are interested in are not merely continuous, they are
differentiable. (It takes some work, actually, to see how a continuous nonde-
creasing function could be non-differentiable, and not just at isolated points.)
The derivative of the cdf is called the probability density function, or
just density, or pdf.

Definition 1.2. A density on R is a function f : R → R+ such that

(i) f is nonnegative;
R∞
(ii) −∞ f (z)dz = 1.

For many purposes the density is a more intuitive object than the dis-
tribution function. The density gives you a picture of the real line, showing
the hot spots for the distribution — where the density is high — and the
cold spots, where the density is low or 0. If you were to take a large num-
ber of independent samples from the distribution and stack them up on the
X-axis at the point corresponding to their value — in other words, draw a
histogram of the samples — it would look like the density.
It’s usually a good idea to think intuitively that there is a probability
f (x)dx of finding ω in a tiny interval (x, x + dx].
1.3. EXAMPLES 8
1

1.00
5/6
4/6

0.75
1/2
P

Probability
0.50
2/6

0.25
1/6
0

0.00
1 2 3 4 5 6
0 1

(a) cdf for a fair die (b) cdf for traffic light distribution
1.0
1.00

0.8
0.75

0.6
Probability

P
0.50

0.4
0.25

0.2
0.0
0.00

-2 -1 0 1 2
0 1
z

(c) cdf for the uniform distribution (d) cdf for standard normal distribution

Figure 1.1: Examples of cumulative distribution functions


1.3. EXAMPLES 9

Continuous distributions allow for an important simplification: The


probability of any event composed of finitely many points, or indeed
countably infinitely many, is always 0. It is obvious from the definition
that a single point has probability 0. The extension to countably many
points follows by countable additivity (cf. (ii’) of section 1.2). Thus,
when talking about a continuous random variable X we will move with-
out comment between events like {0 ≤ X ≤ 1} and {0 < X < 1}.
Similarly, there will be no distinction required between a uniform dis-
tribution on the closed interval [0, 1] and a uniform distribution on the
open interval (0, 1). In section 1.8 we will want to consider random vari-
ables defined as functions of other random variables. If X is a continuous
random variable and Y = g(X), it won’t matter if g is badly behaved,
or even undefined, at countably many points of the range of Y . We can
simply pretend that Y never takes on those values.

1.3.6 Examples of densities


Uniform
The uniform distribution on an interval (a, b) has density
(
1
for a < z < b,
f (z) = b−a (1.2)
0 otherwise.

Normal
The normal distribution (also called the Gaussian distribution) is a 2-parameter
family. The parameters are commonly called µ (mean) and σ 2 (variance),
and written N(µ, σ 2 ). The density is
1 2 2
f (z) = √ e−(z−µ) /2σ . (1.3)
2πσ
The N(0, 1) distribution is also called standard normal. Its density is
2
typically written φ(z) = (2π)−1/2 e−z /2 .

Exponential
A one-parameter family of distributions on (0, ∞) with density

f (z) = λe−λz for z > 0. (1.4)


1.3. EXAMPLES 10

This is commonly used as a model for waiting times, where it corresponds


to a “memoryless” waiting time, where the remaining time is independent
of the time already past. More about this in Example 1.1 and Lecture 4.

Gamma
This is a generalisation of the exponential distribution. It is a two-parameter
family with parameters termed rate and shape. Denoting these by λ and r,
the density is
λr r−1 −λz
γr,λ (z) = z e for z > 0. (1.5)
Γ(r)
R∞
Here Γ(r) is the function defined by Γ(r) = 0 z r−1 e−z dz. It is the analytic
continuation of the factorial function, in that Γ(r) = (r − 1)! when r is an
integer.
The exponential distribution is the same as the Gamma distribution
with r = 1, and the sum of r independent exponential random variables
with exponential(λ) distribution is Gamma(r, λ) distributed (when r is an
integer).

Log-normal
This is the distribution of a positive random variable whose logarithm is
normal. It is commonly used to model generic data of the sort that we
might typically model with a normal distribution, but which is nonnegative.
1 2 2
f (z) = √ e−(ln z−µ) /2σ for z ≥ 0. (1.6)
z 2πσ

Pareto
This is a fat-tailed distribution, frequently applied in economics to model
incomes, sizes of cities, insurance claims, and many other phenomena. It
has two parameters: A minimum value C > 0, and an exponent α > 0
describing the rate of decline of the tails. It is “fat-tailed” because the rate
of decline of the density is a power of x rather than being exponential, or
like the Gaussian or Poisson much faster than exponential. The density is
0 for z < C and
αC α
f (z) = α+1 for z ≥ C. (1.7)
z

Example 1.1: Hazard rate


1.3. EXAMPLES 11

Suppose we have a machine whose failure rate at time t is a


certain function h(t); by this we mean that the probability that
a functioning machine fails in the next instant of length δt is
h(t)δt + o(δt), where o(δt) means a quantity that is asymptoti-
cally smaller than δt. (Formally, we want to say that

lim t−1 P {T ≤ t + δ | T > t},


δ↓0

but we won’t be talking formally about conditional probability


until Lecture 4.) Compute the cdf and pdf for the distribution
of the time T at which the machine fails.
Solution: Split up the interval [0, t] into n equal pieces. Then
the probability of surviving to time t is
n−1 n−1
Y 
Y 1
P{no failure in [i/n, (i + 1)/n]} = 1 − h(it/n) − o(1/n)
n
i=0 i=0
n−1  
Y 1
= exp −h(it/n) − o(1/n)
n
i=0
n−1
( )
X 1
= exp o(1) − h(it/n)
n
i=0

−1
Pn−1
Note that n i=0 h(i/n) is a Riemann sum approximation to
Rt
0 h(s)ds. Letting n → ∞, then, leads to
 Z t 
1 − F (t) = P{T > t} = exp − h(s)ds .
0

Consequently
 Z t 
f (t) = F 0 (t) = h(t) exp − h(s)ds .
0

Fact 1.3. Let F satisfy the assumptions (cdf1)–(cdf4), and assume that
there is a bi-infinite increasing sequence xi (i ∈ Z), with xi < xi+1 , such
that F is differentiable on the interval
P∞ (xi , xi+1 ), and has a jump of size ji
at xi (where ji may be 0). Let J = i=−∞ ji , and let f : R → R+ be defined
to be F 0 (x)/J (if F is differentiable at x), or 0 otherwise. Let PC be defined
1.4. SOME DIFFICULTIES IN PROBABILITY 12

to be the probability with density f /(1 − J), and PD the probability defined
on the discrete set of points {xi }, with P({xi }) = ji /J. Then F is the cdf
of the mixed distribution P = JPD + (1 − J)PC .
That is, it is the cdf of a random variable X that may be generated by
the following two-step process: First, with probability J we decide that X
is discrete, and with probability (1 − J) it is continuous. If it is discrete,
then we sample X from the discrete distribution PD ; otherwise, we sample
X from the continuous distribution PC .

Example 1.2: Traffic light

A traffic light cycles once per minute. That is, the time from the
beginning of one green to the beginning of the next is exactly
60 seconds. It spends 30 seconds of each cycle being red in
the North-South direction, and 30 seconds red in the East-West
direction. When it is red in one direction, it is green or yellow
in the other. Cars arriving at a green or yellow light go right
through, while those arriving at a red wait for the light to change
to green. Model the random time that a car waits, assuming it
arrives at a time uniformly distributed in the cycle.
Solution: Half the time the car arrives at a green light, and so
has 0 waiting time. If it arrives at a red light, the waiting time is
uniform on (0, 1/2) minute. The cdf is sketched in Figure 1.1(b).


1.4 Some difficulties in probability


Why do we need to learn more mathematics to solve probability problems?
Often in mathematics, to introduce the questions that require more sophisti-
cated techniques to solve also require more sophisticated techniques to pose.
That is sometimes true in probability, but
Here are some examples of questions that we can understand in terms of
what you already know, but which we need more sophisticated concepts to
be able to solve. Where we started in the first lectures looking at a single
probability space, and trying to compute interesting probabilities by count-
ing, here we are going to be interested in comparing different probability
1.5. RANDOM VARIABLES 13

spaces, taking limits of probabilities, and taking limits of random variables


within a probability space. As is so often the case in mathematics, a well-
chosen approximation or an appropriate upper bound can be far more useful
than an exact formula. We will be developing techniques for:

• Approximating probabilities, comparing probabilities, and describing


limits of distributions;

• Understanding how probabilities change when we introduce partial


information — the topic called conditioning;

• Performing calculations — often relying on simulations — on compli-


cated probability distributions that are not explicitly computable;

• Describing successions of random events developing in time.

Thinking about probability requires at least as much rigour as thinking


about Riemann surfaces or commutative rings. In some ways even more,
because intuitive concepts that we carry over from everyday life — while
enormously useful in solving problems — can lead us into contradictions
and paradoxes, as we will see in section 4.5.

1.5 Random variables


The most common way of generating interesting new probability distribu-
tions is by way of maps on probability spaces. Suppose we are given a
probability triple (Ω, F, P), and that there is a map X : Ω → R. Then we
can define an image probability triple (X, F0 , P0 ), where F0 is simply all the
images of sets in F, and for B ∈ F0

P0 (B) := P(X −1 (B)) = P ω : X(ω) ∈ B .


 

That is, we have a probability distribution whose sample space is R, and


the probability of a set B is the probability defined by P that X is in B.
We call such a map a random variable, as long as the image distribu-
tion it looks like the way we want a probability distribution on X. What does
this mean? Usually we’re thinking of X as R (or a subset of R). Remember
that we want to be able to have probabilities for open and closed intervals in
R. That means that the inverse images of intervals under the map X needs
to be in F. Since mathematicians like to reduce their assumptions down to
the minimum necessary, we end up with the definition:
1.6. EXAMPLES OF RANDOM VARIABLES 14

Definition 1.4. A real-valued random variable is a map X : Ω → R such


that for every x ∈ R the sets {ω : X(ω) ≤ x} and {ω : X(ω) < x} are in F.
We can think of a point in the sample space as capsule containing all the
information (in the model) about the random outcome, and each random
variable as the answer to a specific question about the outcome.

1.6 Examples of random variables


1.6.1 Coin flipping
Let Ω = {−1, +1}n and P the uniform distribution on Ω; that is, Ω consists
of n-tuples of −1’s and +1’s. We can define lots of different random variables
on Ω. For instance,
• The total number of −1’s.
• Xi = ith coordinate.
• Sk = ki=1 Xi — the simple random walk.
P

• Rk = number of runs of length k in the sequence. (A run is a subse-


quence of length k of all −1 or all +1.)
The distributions, and particularly the joint distributions, are often quite
complicated, but we will learn methods for analysing such distributions.
As an example, we compute the distribution Pi ({−1})of Xi . This is a
probability distribution on the sample space {−1, +1}, so it is completely
defined by the probability Pi ({−1}), the probability assigned by the distri-
bution of Xi to the point {−1}. Under the uniform distribution
#A
P(A) = .
#Ω
If Ai is the event {Xi = −1} (which is an informal way of writing {ω :
Xi (ω) = −1}), it contains the outcomes of the form (x1 , . . . , xi−1 , −1, xi+1 , . . . , xn ),
where xj is −1 or +1. Since there are 2 possibilities for each of n − 1 open
places, there are 2n−1 outcomes in Ai , and Pi ({−1}) = P(Ai ) = 2n−1 /2n =
1/2. Similarly Pi ({+1}) = 1/2.

1.6.2 Projection
Let Ω = Rk . Points of ω have the form (ω1 , . . . , ωk ), where the ωi are in R.
We can naturally define the random variables Xi (ω1 , . . . , ωk ) = ωi .
1.7. A WORD ABOUT RANDOM VARIABLES AND PROBABILITY NOMENCLATURE15

1.6.3 Indicators
Let A be any event. The random variable 1A is defined by
(
1 if ω ∈ A;
1A (ω) =
0 if ω ∈/ A.

It seems trivial, but it is surprisingly useful.

1.7 A word about random variables and probabil-


ity nomenclature
In formal terms, the distribution of the random variable X is a probability
whose sample space is R. Thus, in section 1.6.1, if X is the total number
of −1’s, then the distribution of X is a probability triple with sample space
R whose probability P0 is concentrated only on the subset {0, 1, 2, . . . , n}.
That is, P0 ({0, 1, 2, . . . , n}) = 1. In fact, the distribution is Bin(n, 12 ); that
is, P({k}) = nk 2−n .
We are usually much looser in describing events defined by a random
variable. We write {sentence describing X} as a shorthand for the event

ω : sentence is true about X(ω) ,

and then we make probability statements about P {sentence describing X}.


Thus we will talk about events like

{X > 3},

{X and Y are both even},


{at least three heads in a row}, or
{height > 180cm and weight < 70kg}.
It is crucial — and often difficult at the beginning — to remember the
distinction between a random variable and its distribution. A single random
variable is just another way of talking about a distribution. For instance,
you might say We measure the height of a student. The true height is 180
cm, and the error is normally distributed with expectation 180cm and SD
1cm. Let X be the measured height. Then the random variable X is just
another way of talking about the N(180, 1) distribution. Random variable
language is convenient for associating our abstract probability objects with
1.8. FUNCTIONS OF RANDOM VARIABLES 16

objects that we are modelling in the real world. (The word comes from
statistics, where a variable is a quantity observed or measured in a study,
such as height in a health survey, or 1 or 0 to denote whether this subject
was in the treatment or control group. A random variable is a probabilistic
model for a statistical variable.)
If we then let let Y = (X − 180)2 — the square of the error — and talk
only about Y , then we using transformations and the methods of section 1.8
as a modelling tool for arriving at the χ2 distribution (see Example 1.7).
Things get more complicated when we have multiple random variables
— we then have a joint distribution, as we will discuss in Lecture 2. Fun-
damental is to recognise that random variables may have the same distri-
bution without being the same random variable. As a very simple example
in coin-flipping (section 1.6.1) the random variables Xi all have the same
distribution Pi ({−1}) = Pi ({+1}) = 1/2, but they are all different random
variables.
Another example: Let X and Y be independent standard normal random
variables, and let Zθ = X sin θ + Y cos θ for any θ ∈ R. We show in section
(2.4.2) that Zθ also has the standard normal distribution for any θ. On
the other hand, Zθ and Zθ0 are different random variables — they do not
certainly have the same value — unless θ − θ0 is a multiple of 2π.
A word that is sometimes used for the distribution of a random variable,
thought of as an object in its own right, is law: The law of Zθ is standard
normal, or we say that Zθ and Zθ0 “have the same law”. we also sometimes
write L(Z) for the law of the random variable Z. Thus, we might write
L(Zθ ) = L(Zθ0 ).

1.8 Functions of random variables


It is a trivial consequence of the definition that functions of random variables
are also random variables. . . as long as the inverse images of events are also
events. We won’t go into the exact conditions here — it is a question of basic
measure theory — except to say that it’s the sort of problem that goes away
(mostly) if you ignore it. That is, nearly all functions that one ordinarily
might think of are “measurable”. In particular, all continuous functions and
piecewise continuous functions are acceptable.

1.8.1 The discrete case


Suppose X is a discrete random variable, and g : R → R is any function.
Then Y := g(X) is a random variable, and it is straightforward to de-
1.8. FUNCTIONS OF RANDOM VARIABLES 17

scribe its distribution. The discrete random variable X takes on a finite or


countably infinite set of values X ⊂ R, and its distribution is defined by a
probability mass function px for x ∈ X. Then the set of possible values of
Y is Y := g(X) = {g(x) : x ∈ X}.
Let qy (y ∈ Y) be the probability mass function of Y . Then

qy = P Y = y

= P g(X) = y

= P X ∈ {x : g(x) = y}
X 
= P X=x
x∈g −1 (y)
X
= px .
x:g(x)=y

1.8.2 General random variables with injective functions


Suppose X is a real-valued random variable with cdf FX , and g : R → R
a strictly monotone function, so the inverse g −1 is defined on Y := g(R).
Define the random variable Y := g(X).
Suppose g is increasing. Then Y has cdf FY defined for y ∈ Y by

FY (y) := P Y ≤ y

= P g(X) ≤ y
(1.8)
= P X ≤ g −1 (y)


= FX (g −1 (y)).
/ Y then P Y ≤ y = P Y ≤ y 0 , where y 0 = sup(Y ∩ (−∞, y)). Thus,
 
If y ∈
we fill in the gaps in Y by extending FY as a constant. To put it slightly
differently, we may replace equation (1.8) by
FY (y) = sup FX (x). (1.9)
x:g(x)≤y

On the other hand, if g is decreasing, then for y ∈ Y



FY (y) := P Y ≤ y

= P g(X) ≤ y
= P X ≥ g −1 (y)

(1.10)
=1− lim FY (y 0 )
x↑g −1 (y)

= 1 − FX (g −1 (y)) + P({g −1 (y)}) = 1 − sup FX (x).


x<g −1 (y)
1.8. FUNCTIONS OF RANDOM VARIABLES 18

/ Y by
When g has jumps, we extend to y ∈

FY (y) = 1 − sup FX (x). (1.11)


x:g(x)>y

Y Y
g(x)
g(x)

y+∆y y+∆y
y X y X

x x=g-1(y)

P(Y≤y)=P(X≤x)=FX(x) P(Y≤y)=P(X≥x)=1-FX(x)+P(X=x)

(a) Increasing g. (b) Decreasing g.

Figure 1.2: Change of variables for monotone g.

Example 1.3: Discontinuous map

Let (
x if x ≤ 0,
g(x) :=
x+1 if x > 0.
Let X be uniformly distributed on (−1, 1). Find the cdf of Y =
g(X) and Z = −g(X).
Solution: The range of g is g(R) = (−∞, 0] ∪ (1, ∞). On this
set, (
y if y ≤ 0,
g −1 (y) =
y − 1 if y > 1.
1.8. FUNCTIONS OF RANDOM VARIABLES 19

Then

FY (y) = FX (g −1 (y))


0 if y < −1,
y+1

if − 1 < y ≤ 0,


 2

1
= 2 if 0 < y ≤ 1,
y

if 1 < y ≤ 2,




 2
1 if 2 < y.

1.8.3 Continuous random variables and invertible g


If the real-valued random variable X has a density fX , then, of course, we
don’t have the problem of dealing with the probability of a single point.
If g is monotone and differentiable then we can compute the density fY of
Y = g(X) directly. Again, we write g −1 for the inverse function, such that
g(g −1 (y)) = y and g −1 (g(x)) = x.
The basic idea is that fY (y) expresses the probability of Y being near y,
in the sense that fY (y)δy is the probability of y ≤ Y ≤ y+δy for infinitesimal
δy. If Y is near y then X must be near g −1 (y). So you might think that
fY (y) = fX (g −1 (y)), which seems superficially similar to the formula for the
discrete case in section 1.8.1. This would be wrong, because an interval of
width δx near x = g −1 (y) is identified with an interval of width g 0 (x) δx
near y under the action of g. Thus, the density is smaller where g 0 (x) is
larger, and vice versa.
Formally, if g has positive derivative g 0 , then we apply (1.8) to obtain

d
fY (y) = FY (y)
dy
d
= FX (g −1 (y))
dy
 d
= FX0 g −1 (y) · g −1 (y) by the Chain Rule
dy
−1

fX g (y)
= 0 −1 
g g (y)
1.8. FUNCTIONS OF RANDOM VARIABLES 20

If g has negative derivative g 0 , then we apply (1.10) to obtain

d
fY (y) = FY (y)
dy
d
1 − FX (g −1 (y))

=
dy
 d
= −FX0 g −1 (y) · g −1 (y) by the Chain Rule
dy
−1

fX g (y)
= 
−g 0 g −1 (y)

Putting these together yields the Change of Variables formula:

Theorem 1.5. If X is a random variable with density fX on R, and Y =


g(X) for a function g that is differentiable with nonzero derivative, then Y
has density
fX (g −1 (y))
fY (y) = 0 −1 , (1.12)
|g (g (y))|
where g −1 (y) is the unique x such that g(x) = y.

Example 1.4: Deriving the exponential distribution

Suppose X has uniform distribution on (0, 1), and λ is a positive


real number. Let Y = −λ−1 ln X. What is the distribution of
Y?
Solution: X has cdf and density given by

0 if x ≤ 0,

FX (x) = x if 0 < x ≤ 1,

1 if 1 < x.

(
1 if 0 < x ≤ 1,
fX (x) =
0 otherwise.

Since g(x) := −λ−1 ln x is decreasing, with g −1 (y) = e−λy , we


1.8. FUNCTIONS OF RANDOM VARIABLES 21

may apply formula (1.10) to compute the cdf FY for Y , which is


 
FY (y) = 1 − FX e−λy

1
 if e−λy ≤ 0,
= 1 − e−λy if 0 < e−λy ≤ 1,

0 if 1 < e−λy

(
1 − e−λy if y ≥ 0,
=
0 if y < 0.
This, of course, is recognisable as the cdf of an exponential dis-
tribution with parameter λ.
We can also apply Theorem 1.5 with g 0 (x) = 1/λx. Thus g 0 (g −1 (y)) =
−λ−1 eλy , so
fX (e−λy )
fY (y) = −1 λy
(λ e
λe−λy if 0 < e−λy ≤ 1,
=
0 otherwise
(
λe−λy if y ≥ 0,
=
0 if y < 0.


Example 1.5: Linear transformations of random variables

Let X be a random variable with density fX , and Y = σX + µ


for constants σ and µ, with σ 6= 0. Compute the density of Y .
Apply this to the case when X is a standard normal distribution.
Solution: We have g(x) = σx + µ, so g −1 (y) = (y − µ)/σ, and
g 0 (x) = σ for all x. Then
fY (y) = α−1 fX (y − µ)/σ .


2 /2
If X is a standard normal random variable then fX (x) = (2π)−1/2 e−x
and the distribution of Y = σX + µ is normal with mean µ and
variance σ 2 . Its density is then
1 2 2
fY (y) = √ e−(y−µ) /2σ .
2πσ
1.8. FUNCTIONS OF RANDOM VARIABLES 22

Example 1.6: Simulating a general random variable

Suppose we have a mechanism for generating random variables


with uniform distribution on (0, 1). How can we use this to
generate a random variable with cdf F , where F is an arbitrary
function satisfying (cdf1)–(cdf4).
Solution: Consider first the case where F is continuous and
monotonically increasing. Then there is for each x ∈ (0, 1) a
unique y = F −1 (x) such that F (y) = x. Let X be uniform on
(0, 1) and let Y = F −1 (X). If g = F −1 , then g −1 (y) is the unique
x such that F −1 (x) = y, which by definition is x = F (y).
Then by (1.8) Y has cdf

FY (y) = FX (F (y)) = F (y) for all y such that F (y) ∈ [0, 1],

where FX is the cdf of the uniform distribution on (0, 1), which


is FX (x) = x for x ∈ [0, 1]. Thus Y has cdf F .
If F is not strictly increasing there will be multiple x — hence a
whole interval of x — such that F (x) = y for a given y. If F has
jumps, there will some y for which there is no solution to F (x) =
y. For such cases we define instead of F −1 the pseudoinverse

g(x) := sup{z : F (z) < x}.

Then g is strictly increasing on (0, 1), but it may have jumps


(precisely at those x which are the image of multiple y). Define
Y = g(X). Then by (1.9),

FY (y) = sup FX (x)


x:g(x)≤y

= sup x : sup{z : F (z) < x} ≤ y

= sup x : F (y) ≥ x
= F (y)

(Note on the third step: If sup{z : F (z) ≤ x} ≤ y then F (y) ≥ x.


After all, if F (y) < x then by right-continuity there is some
1.8. FUNCTIONS OF RANDOM VARIABLES 23

 > 0 such that F (y + ) < x. On the other hand, if F (y) ≥ x


then any z with F (z) < x must satisfy z < y, implying that
sup{z : F (z) ≤ x} ≤ y. Thus F (y) ≥ x if and only if sup{z :
F (z) ≤ x} ≤ y.)
We could also deal with discontinuities in F by following the idea
in Fact 1.3: We split X into a continuous piece and a discrete
piece (which lives on the locations of the jumps of F ). With
probability equal to the sum of the jumps J we randomly select
one of the discontinuities of F ; with probability (1 − J) we se-
lect from the continuous part of F , according to the procedure
defined above. 

1.8.4 Non-invertible g
When g is differentiable (except perhaps at a discrete set of points) then
we can still apply the same idea. As in the discrete case, we now may have
multiple points x which map onto the same point y, so that an infinitesimal
interval around y corresponds to multiple infinitesimal intervals where X
could be and still produce Y near y.

Theorem 1.6. If X is a random variable with density fX on R, and Y =


g(X) for a function g that is differentiable with nonzero derivative except at
most at a discrete set of points, then Y has density
X fX (x)
fY (y) = (1.13)
|g 0 (x)|
x:g(x)=y

at points where g 0 (x) exists and is nonzero.

Note that there is no problem with having a discrete set of points at


which the density is infinite or undefined, since this doesn’t affect the inte-
gral.
This result is illustrated in Figure 1.3.
1.8. FUNCTIONS OF RANDOM VARIABLES 24

Y
g(x)

y+∆y

y X

x1 x +∆x2
x1+∆x1 2 x2 x3 x3+∆x3

P(y≤Y≤y+∆y)=P(x1≤X≤x1+∆x1)+P(x2≤X≤x2+∆x2)+P(x3≤X≤x3+∆x3)

Figure 1.3: Change of variables from the random variable X to Y = g(X).


We illustrate the computation of P{y ≤ Y ≤ y + ∆y}. Note that this
small interval of Y values is the image of three different intervals of X
values, belonging to three different xi such that g(xi ) = y. Those xi with
larger gradient (value of |g 0 (xi )|) have correspondingly smaller ∆xi . When
g 0 (xi ) < 0 then ∆xi is negative.

Proof. Let ξ1 < ξ2 < · · · < ξk be the points at which g 0 (x) = 0 or g is


nondifferentiable. Let yi = g(ξi ). Then if (y, y + δy) is an interval that is
small enough not to include any yi , the event {y ≤ Y ≤ y + δy} is a disjoint
union of events of the form {xi ≤ X ≤ xi +δxi }, where g is strictly monotone
on each of the intervals (xi , xi + δxi ). The result then follows directly from
Theorem 1.5.
1.9. LIMITS OF RANDOM VARIABLES 25

Example 1.7: The χ2 distribution

The χ2 (chi-squared) distribution with d degrees of freedom is


defined to be the distribution of the random variable X = Z12 +
· · · + Zd2 , where Z1 , Z2 , . . . , Zd are i.i.d. standard normal random
variables. Compute the density of this distribution.
Solution: Exercise.


1.9 Limits of random variables


If we have infinitely many random variables X1 , X2 , . . . such that limn→∞ Xn (ω)
exists for every ω, then this limit — call it X∞ (ω) is itself a function on Ω.
Is it a random variable? It is; the proof is in any book on measure-theoretic
probability.

Example 1.8: Uniform distribution

Suppose we have a box with K identical tickets, labelled with


numbers 0 through K−1. We draw repeatedly, with replacement,
generating random variables X1 , X2 , . . . i.i.d. uniform on the set
{0, . . . , K − 1}. Define

X
X= Xi K −i .
i=1

What is P{X < 1/π}?


Solution: For any real number x ∈ [0, 1], let 0.b1 b2 . . . be the
base-K representation of the number x which does not end in an
infinite sequence of digits 9:

X
x= bi K −i .
i=1

Let J = min{j : Xj 6= bj }. Then the event {X < x} is the same


1.9. LIMITS OF RANDOM VARIABLES 26

as the event {XJ < bJ }. Thus, we may write



X
P{X < x} = P{J = j and Xj < bj }
j=1
X∞
= P{J = j and Xj < bj }
j=1
X∞

= P X1 = b1 , . . . , Xj−1 = bj−1 , Xj ∈ {0, 1, . . . , bj − 1}
j=1
X∞
  
= P X1 = b1 · · · P Xj−1 = bj−1 P Xj ∈ {0, 1, . . . , bj − 1}
j=1

by independence

X bj
= 10j−1
10
j=1

X bj
=
10j
j=1

= x.
By uniqueness of cdf’s, this also means that X has the uniform
distribution on the interval [0, 1].
Note that, although X is defined in terms of an infinite sequence
of random digits, the event {X < x} will always be determined
by a finite sequence of digits — however many are needed to
obtain the first discrepancy between Xj and bj . The number of
digits is finite always, but treating them as part of an infinite se-
quence allows us to escape the need for any fixed (and arbitrary)
bound on the number of digits.
Note too that we may define other distributions on the interval
[0, 1] by taking any non-uniform distribution on the digits. We
will not prove this fact here, but no distribution on [0, 1] defined
by an alternative digit distribution has a density. This is one
of the simplest ways of defining a distribution that is neither
discrete nor continuous. The cdf of such a distribution is thus
a function that is monotone, with F (0) = 0 and F (1) = 1, but
whose derivative is 0 at almost every point. Making formal sense
of these statements of course requires measure theory. 
Lecture 2

Joint distributions

Summary: We extend our definitions of random variables to describe mul-


tiple random variables simultaneously. We define independent random vari-
ables. We sketch a proof for the change of variables formula for joint den-
sities. Derive the convolution formula for sums of independent continuous
random variables.
Key concepts: joint distribution, independence, marginal distribution, joint
density, multivariate normal distribution, convolution
The whole point of the theory of random variables is to give us a platform
for describing joint distributions. Talking about a single random variable is
the same as talking about a probability distribution. In principle, we can
describe probabilities on multidimensional spaces — probabilities of multi-
ple real-valued quantities simultaneously — using similar tools to those we
have considered so far, by writing down probability mass functions and den-
sities. In practice, this is rarely the most useful way of thinking about joint
distributions. Instead, we usually want to describe general random joint
distributions in terms of marginal distributions and two kinds of elementary
transformations:

• Functional transformations of independent random variables.

• Mixtures of conditional distributions. (These will be discussed in


Lecture 4.)

2.1 Basic concepts


When we define more than one random variable on the same probability
space, we immediately have a joint distribution. The joint distribution of

27
2.2. INDEPENDENCE 28

two or more random variables is simply a description of joint probabilities


on events of the form

P{X1 ∈ A1 , . . . , Xn ∈ An }.

(Intuitively it is clear what the joint distribution of countably infinitely


many random variables would be, following the same template. We will on
occasion have recourse to this concept, but will not address the technical
complexities that arise in defining it rigorously.) The distributions of the
individual random variables are called marginal distributions.
For any joint distribution there will be many different ways of defining a
probability triple (Ω, P, F) and a collection of random variables which have
that joint distribution. There is no reason to give priority to one of these
over the others. This is an essential point of flexibility: Any computation
that can be done in one representation will give the same result in any
other representation. Nonetheless, we often think of one representation of
the joint distribution as canonical, namely the representation in which the
probability space Ω is a product space, and the random variables are the
projection maps onto the individual coordinates.
You should already be familiar with joint densities for continuous random
variables, and joint probability mass functions for discrete random variables
(that is, a list of probabilities P{X = x, Y = y}). These are special tools,
which are applicable in limited situations; even when they can be applied,
they are less often useful than you might think. It’s usually best to think
about joint distributions as a convenient way of describing interesting events
that we refer back to the space on which the random variables were originally
defined.

2.2 Independence
Events E1 , . . . , En are said to be independent if
 
P E1 ∩ · · · En = P (E1 ) · · · P (En ). (2.1)

Random variables X1 , . . . , Xn are said to be independent if the events


{X1 ∈ A1 }, . . . , {Xn ∈ An } are independent for any subsets Ai ⊂ R for
which these probabilities are defined. That is,

P{X1 ∈ A1 , . . . , Xn ∈ An } = P{X1 ∈ A1 } · P{X2 ∈ A2 } · · · · · P{Xn ∈ An }.


(2.2)
2.3. MARGINAL DISTRIBUTIONS 29

An infinite collection of events or of random variables is said to be indepen-


dent if any finite subcollection is independent.
It doesn’t take much theory to describe the joint distribution of indepen-
dent random variables. Once we have described the marginal distributions,
the joint distribution follows automatically.
Theorem 2.1. If (Ωi , Pi , Fi ) are any probability spaces, then it is possible
to define a space (Ω, P, F) with Ω the Cartesian product of the Ωi , such that
the projections are independent random variables.
This result, a simple version of the Kolmogorov Extension Theorem, in-
volves measure theory. It may be found in standard textbooks.
Since we are not concerned with measure theory in these lectures, we
will simply take this as given. We will commonly make statements like
“Let X1 , X2 , . . . be i.i.d. (independent identically distributed) random
variables with distribution such-and-such;” or, given a random variable X,
we might say, “Let Y be an independent copy of X.”

2.3 Marginal distributions


When (X1 , . . . , Xk ) are jointly defined random variables, the distribution of
each individual Xi is called a marginal distribution. If the random vari-
ables take on a discrete set of values — call them {0, 1, 2, . . . } for simplicity
— the marginal distribution of X1 is given by

X ∞
X 
P{X1 = j} = ··· P X1 = j, X2 = j2 , . . . , Xk = jk ;
j2 =0 jk =0

and similarly for the other marginals.


If the joint distribution is continuous, with joint density f , then the
marginal of X1 has a density f1 given by
Z ∞ Z ∞
f1 (x) = ··· f (x, x2 , x3 , . . . , xk )dx2 · · · dxk .
−∞ −∞

Example 2.1: Uniform density on the unit disk

We define the uniform distribution on a region S of Rk as the


distribution with density
(
1
if x ∈ S,
fS (x) = Vol(S)
0 if x ∈
/ S.
2.4. JOINT DENSITIES 30

This represents the intuition of picking a point uniformly at ran-


dom from inside S. All regions of S have probability proportional
to their size, and points outside S have 0 probability.
If S is the unit disk in the plane, we have
(
1
if x2 + y 2 < 1,
fS (x, y) = π
0 if x2 + y 2 ≥ 1.

We have two natural random variables X and Y , defined as the


coordinate projections. These are clearly not independent. For
instance, P{X > 0.9} > 0 and P{Y > 0.9} > 0, but P{Y >
0.9 and X > 0.9} = 0, so it is not the product.
The marginal distribution of X has density
Z ∞ Z √1−x2
1 2p
fX (x) = f (x, y)dxdy = √ = 1 − x2 .
−∞ − 1−x 2 π π
The marginal distribution of Y is identical.
Independent random variables X and Y with this same density
would have the joint density
4p p
f 0 (x, y) = 2 1 − x2 1 − y 2 .
π

One way of defining multiple random variables is by way of a joint den-
sity. If f : Rk → R+ is an integrable function, then we can define a proba-
bility triple whose sample space is Rk by
Z ∞ Z ∞
P(A) = ··· 1A (x1 , . . . , xk )f (x1 , . . . , xk )dx1 · · · dxk ,
−∞ −∞

for appropriate events A.

2.4 Joint densities


One way of defining multiple random variables is by way of a joint density.
If f : Rk → R+ is an integrable function, then we can define a probability
triple whose sample space is Rk by
Z ∞ Z ∞
P(A) = ··· 1A (x1 , . . . , xk )f (x1 , . . . , xk )dx1 · · · dxk ,
−∞ −∞
2.4. JOINT DENSITIES 31

for appropriate events A. To be a joint density, the function f must be


nonnegative, and its integral over all of Rk must be 1.
In particular, we can think of the projections on the individual coordi-
nates as defining k random variables. In particular,

Theorem 2.2. Suppose (X1 , . . . , Xk ) are random variables with joint den-
sity fX : U → R+ , where U is an open region in Rk . Suppose g : U → Rk
is a finite-to-one map which is differentiable1 with nonzero Jacobian deter-
minant. Define random variables (Y1 , . . . , Yk ) = g(X1 , . . . , Xk ). Then these
have a joint density fY given by
X fX (x1 , . . . , xk )
fY (y1 , . . . , yk ) = , (2.3)
|D(x1 ,...,xk ) g|
(x1 ,...,xk ):
g(x1 ,...,xk )=(y1 ,...,yk )

where |D(x1 ,...,xk ) g| is the absolute value of the determinant of the Jacobian
matrix of g at (x1 , . . . , xk ).

Note that when k is 2, and we write g(x1 , x2 ) = g1 (x1 , x2 ), g2 (x1 , x2 )
we have
∂g1 ∂g2 ∂g1 ∂g2
|Dg| = − .
∂x1 ∂x2 ∂x2 ∂x1
As in the single-variable case, the density may not be defined at all points
of Rk ; it may be infinite, or simply undefined. We use the restriction to a
region U to exclude the problematic points. (Problematic points include
those at which the function g is not differentiable. If we are excluding single
points — or, indeed, lower-dimensional sets — it doesn’t affect the integral.
A standard example is the map onto polar coordinates, described in section
2.4.1.

Proof. When g is invertible, this is simply the change-of-variables formula


from analysis. We give an intuitive derivation; it can be made rigorous
by standard methods, available in most real-analysis texts. The proba-
bility of finding Y in an infinitesimal box B := (y1 , y1 + dy1 ) × (y2 , y2 +
dy2 ) × · · · × (yk , yk + dyk ) can be described in two ways: By definition, it
is fY (y1 , . . . , yk )dy1 · · · dyk . It is also the probability of finding X in the
1
For our purposes we can take “differentiable” to mean that all partial derivatives exist
and are continuous. If you follow the Part A multivariable calculus lecture you’ll learn
that this is a slightly stronger condition than being differentiable as a function from Rk
to Rk .
2.4. JOINT DENSITIES 32

infinitesimal parallelogram that maps onto B, called g −1 (B). This paral-


lelogram has one corner at x = g −1 (y), and a volume |dx1 · · · dxk | given
by

dy1 · · · dyk
dy1 · · · dyk = · |dx1 · · · dxk | = |Dx g| · |dx1 · · · dxk |.
dx1 · · · dxk

In other words, we are using the fact that the Jacobian determinant gives
the ratio of volumes of infinitesimal boxes under the transformation. So
dy1 · · · dyk
fY (y1 , . . . , yk )dy1 · · · dyk = fX (x1 , . . . , xk ) ,
|D(x1 ,...,xk ) g|

which proves the result.


When g is not invertible, the assumptions still imply that there is a
finite set of pre-images of y, and g is invertible locally around all but a
lower-dimensional collection of points.

2.4.1 Example: Uniform distribution on a disk


We can also define random variables
p
R(x, y) = x2 + y 2 , and

arctan y/x
 if x ≥ 0, y ≥ 0,
Θ(x, y) = 2π + arctan y/x if x ≥ 0, y < 0,

π + arctan y/x if x < 0.

(Here we are taking the definition of arctan z ∈ [−π/2, π/2]. This is just the
standard assignment of angles in [0, 2π) to points in the plane.)
Take g(x, y) := (R(x, y), Θ(x, y)). Note that Θ is undefined for (x, y) =
(0, 0). So our region U is the complement of the origin. On U the function
is 1-1 onto the strip (0, ∞) × [0, 2π). A point (r, θ) in the strip has a single
inverse image, given by (r cos θ, r sin θ). The Jacobian deteriminant is given
by
∂R ∂R √ x √ y
∂x ∂y x 2 +y 2 x2 +y 2 = (x2 + y 2 )−1/2 .
∂Θ ∂Θ = −y x
∂x ∂y x2 +y 2 x2 +y 2

If g(x, y) = (r, θ), Then |D(x,y) g| = r−1 .


Points with r ≥ 1 have density 0 at their inverse image; points with
Θ∈ / [0, 2π) have no inverse image; all other points no inverse image; points
2.4. JOINT DENSITIES 33

with 0 < r < 1 and 0 ≤ θ < 2π have an inverse image with density 1/π.
Then the pair (R, Θ) has density
(
0 if r ≥ 1 or θ ∈
/ [0, 2π);
fR,Θ (r, θ) = 1/π r
r−1
= π if 0 < r < 1 and 0 ≤ θ < 2π.

Note that this density may be written trivially as a product of a function


of r and a function of θ. That means that R and Θ are independent. The
marginal distribution of R has density 2r on the interval (0, 1), and the
marginal distribution of Θ is uniform on [0, 2π).

2.4.2 Independent standard normal random variables


Let X and Y be independent standard normal random variables. Their joint
density is
1 −(x2 +y2 )/2
f (x, y) = e .

Define g = (R, Θ) as in section 2.4.1. Then the random variables (R, Θ)
have density
1 −r2 /2
fR,Θ (r, θ) = re .

Thus R and Θ are independent; Θ is uniformly distributed on the circle
2
[0, 2π) and R has density re−r /2 .
A couple of observations: We say that the distribution of 2 indepen-
dent standard normal random variables is rotationally symmetric. In fact,
this holds for any numberqof independent standard normal random vari-
ables: The radius R = Z12 + · · · + Zk2 is independent of the direction
(Z1 /R, . . . , Zk /R), and the latter is a uniform random point on the (k − 1)-
dimensional sphere. (It’s hard to define a uniform random point on the
sphere with the methods we have, so we won’t do it formally.)
Another way of putting this spherical symmetry is that an orthogonal
linear projection of a collection of independent standard normal random
variables is also a collection of independent standard normal random vari-
ables, which we will show in the next section. It is an interesting fact that
this property uniquely defines the normal distribution. That is, if X and Y
are independent random variables, and two orthogonal linear combinations
of X and Y are also independent, then X and Y are normal. (See [Fel71,
Section III.6].)
2.4. JOINT DENSITIES 34

2.4.3 Multivariate normal distribution


The random variables (X1 , . . . , Xk ) are said to have a multivariate normal
distribution if every linear combination a1 X1 + · · · + ak Xk has a normal
distribution. It is clear that for any k × k matrix M , and any vector of
constants (µi ), if Z1 , . . . , Zk are i.i.d. standard normal random variables,
then      
X1 Z1 µ1
 ..   ..   .. 
 .  := M  .  +  .  (2.4)
Xk Zk µk
is multivariate normal, since sums of independent normal random variables
are normal. It is slightly less obvious (we will sketch a proof in section ??)
that all multivariate normal random variables may be written in this form.
Let us consider the case k = 2. We specify five parameters: µX , µY , σX ,
σY , ρ. (These are the expectations and the standard deviations of X and
Y , and the correlation of X and Y , terms which should be familiar already,
but which are defined in Lecture 3.) Here µX , µY ∈ R, σX , σY ∈ R+ , and
ρ ∈ (−1, 1).
We can define a bivariate normal pair (X, Y ) by

X = σX Z1 + µX
p (2.5)
Y = ρσY Z1 + 1 − ρ2 σY Z2 + µY .

We can compute a joint density for (X, Y ). The mapping


p
g(z1 , z2 ) = (σX z1 + µX , ρσY z1 + 1 − ρ2 σY z2 + µY )
p
is bijective (it is linear), and the Jacobian determinant is 1 − ρ2 σX σY .
We have the inverse image
!
x − µ X y − µ Y − ρ(x − µ X )σ Y /σX
g −1 (x, y) = , p ;
σX σY 1 − ρ2

the independent normal random variables have joint density


1 −(z12 +z22 )/2
f0 (z1 , z2 ) = φ(z1 )φ(z2 ) = e .

Thus the bivariate normal random variable has joint density
2.4. JOINT DENSITIES 35

!2 !)
y − µY − ρ σσX
( 2 Y
(x − µX )

1 1 x − µX
fρ (x, y) = p exp − + p
2πσX σY 1−ρ 2 2 σX σ Y 1 − ρ2
1 (2.6)
= p
2πσX σY 1 − ρ2
(x − µX )2 (y − µY )2
  
1 2ρ(x − µX )(y − µY )
× exp − 2
+ − .
2(1 − ρ2 ) σX σY2 σX σY

Notice that the final form is symmetric in X and Y .


More generally, the multivariate normal random variables defined in (2.4)
have joint distribution
1 −1 (x−µ)k2 /2
f (x1 , . . . , xk ) = e−kM , (2.7)
(2π)k/2 | det M|

where k(y1 , . . . , yk )k2 = y12 + · · · + yk2 . Note, in particular, that when M is


an orthogonal matrix and µi = 0, the k random variables (X1 , . . . , Xk ) are
also independent standard normal.
This formula is rarely used (and you don’t need to learn it for the exam!)
For many problems it is easier to think of a multivariate normal distribution
not as a collection of variables with density given by (2.7), but as the linear
image of independent standard normal random variables.

Example 2.2: Bivariate normal in a quadrant

Let (X, Y ) be the bivariate normal random variables defined


above. Compute P{X > 0, Y > 0}.

Solution: We have X > 0 ⇐⇒ Z1 > 0, and then Y >


0 ⇐⇒ Z2 > − √ ρ 2 Z1 . In other words, (Z1 , Z2 ) lies in the
1−ρ
p
infinite sector defined by π/2 > θ > − arctan ρ/ 1 − ρ2 in polar
coordinates. Since θ is uniform on [0, 2π) (as shown in section
2.4.2), we conclude that
 1 1 ρ
P X > 0, Y > 0 = − arctan p .
4 2π 1 − ρ2

Note that when ρ = 0 the probability is 1/4, which we know must


be true by independence. When ρ is positive the probability
2.4. JOINT DENSITIES 36

is greater than 1/4, which again makes since, since X and Y


are positively correlated: both being positive at the same time
should be more likely.
Of course, the event we chose was special. If we wanted a number
for, say, P{X > 1, Y > 1}, we would probably do better to
use the joint density and do a numerical approximation on a
computer. 

Example 2.3: Continuous random variables with no joint


density

Let X have uniform distribution on (0, 1) and Y = X 2 . Describe


the joint distribution of (X, Y ), and compute P{X < 21 , Y < 21 }
and P{X > 0.6, Y > 0.6}.
Solution: There is no canonical way of writing this joint dis-
tribution. A convenient and formally correct way of describing
it is to state that the sample space is Ω = (0, 1) and P the uni-
form distribution; and X is the inclusion map taking ω ∈ (0, 1)
to itself, and Y the map ω 7→ ω 2 . Alternatively, we can say
Ω0 is the subset of the square (0, 1) × (0, 1) corresponding to
the graph G = {(x, y) : y = x2 } and for A ⊂ Ω0 we define
P 0 (A) = P (X(A)), where X : Ω0 → R is the projection on the
first coordinate, so X (x, y) = x. The third possibility is that
we say Ω00 = (0, 1)×(0, 1) is the sample space, but the probability
distribution is P 00 (A) = P 0 (A ∩ G) — that is, the sample space
is the whole square, but the probability is defined purely by its
intersection with the curve G. We say the probability “lives on”
the subset G.
These are three slightly different ways of saying exactly the same
thing. (This is more formal detail than you really need for such
a simple example, but it’s worth thinking through occasionally
how a basic example fits in with our general framework.)
The computation of probabilities on this set is illustrated in Fig-
ure 2.1. If A = {(x, y) : x < 21 , y < 21 }, then its intersection
with G is {(x, y) : x < 12 , y = x2 }, so P 00 (A) is the same as
P {X < 21 } = 21 , where X has uniform distribution on (0, 1).
2.4. JOINT DENSITIES 37

1.0
0.9
0.8
0.7
0.6
0.5
y

0.4
0.3
0.2
0.1
0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 2.1: Computing the probabilities for events in Example 2.3. The
red square is the event {X < 12 , Y < 12 }, and the green square is the event
{X > 0.6, Y > 0.6}. Note that the probability is determined solely by the
intersection of the event with the curve G = {(x, y) : y = x2 }, shown here
in black.

A = {(x, y) : x > 0.6, y > 0.6}, then its intersection with G is


√ √
{(x, y) : y > 0.6, x = y} = {(x, y) : x > 0.6, y = x2 },
√ √
so P 00 (B) is the same as P {X > 0.6} = 1 − 0.6 ≈ 0.225.
The marginal distribution of X is, of course, U (0, 1). Since Y =
X 2 , the change-of-variables formula tells us that Y is continuous,
with density

fX ( y) 1
fY (y) = √ = √ for 0 < y < 1
2 y 2 y
2.4. JOINT DENSITIES 38

(and 0 elsewhere).
Are we sure that (X, Y ) doesn’t have a density? Suppose there
were a joint density fX,Y . For any (x, y) ∈
/ G the density would
R1
have to be 0. Thus the marginal density of X would be 0 fX,Y (x, y)dy =
0, since we are integrating a function that is 0 everywhere but
at a single point. 
Fact 2.3. If (X1 , . . . , Xn ) are random variables such that P {(X1 , . . . , Xn ∈
A} = 1 for some (n−1)-dimensional set A, then the joint distribution cannot
be described by a joint density. That is, the random variables are not jointly
continuous.

Example 2.4: No joint density, normal example

Let X and Y be independent standard normal random variables


and Z = X + Y . Then Z is normal with expectation 0 and vari-
ance 4. Since (X, Y, Z) ∈ {(x, y, z) : x + y = z} with probability
1, there is no joint density.
We could describe (X, Y, Z) as multivariate normal with covari-
ance matrix (see section 3.3.1)
 
1 0 1
0 1 1 ,
1 1 2
which is a singular matrix. 

2.4.4 The convolution formula


If X and Y are jointly defined random variables, we are often interested in
the distribution of Z := X + Y . If X and Y have a joint density fX,Y , we
compute the density of Z by first taking (x, y) 7→ (x, x + y) and then the
marginal (x, z) 7→ z. If g(x, y) = (x, x + y), this is an invertible map with
Jacobian determinant always equal to 1, and g −1 (x, z) = (x, z − x). Thus,
the density of (X, Z) is
fX,Z (x, z) = fX,Y (x, z − x).
The density of Z is then the second marginal density, computed as
Z ∞ Z ∞
FZ (z) = fX,Z (x, z)dx = fX,Y (x, z − x)dx.
−∞ −∞
2.5. MIXING DISCRETE AND CONTINUOUS RANDOM VARIABLES39

If X and Y are independent, the distribution of Z is called the con-


volution of the distribution of X and the distribution of Y . If they have
densities fX and fY , then Z has density
Z ∞
fZ (z) = fX (x)fY (z − x)dx. (2.8)
−∞

Example 2.5: Gamma distribution

Compute the convolution of the exponential distribution with


parameter λ with itself.

Solution: Let f be the exponential density f (x) = λe−λx for


x > 0, and f (x) = 0 for x ≤ 0. The convolution has density
Z ∞ Z z Z z
−λx −λ(z−x)
f (z) = f (x)f (z−x)dx = λe ·λe = λ2 e−z = λ2 ze−z .
−∞ 0 0
This is the Gamma density with shape parameter 2 and rate
parameter λ. 

2.5 Mixing discrete and continuous random vari-


ables
There is nothing about our framework that requires the individual random
variables be of the same sort: All continuous or all discrete. Furthermore,
just because two random variables are continuous (defined by a density), it
need not follow that they are jointly continuous (defined by a joint density).
This is not a problem, but it means that we cannot write down a canonical
version of the joint distribution as a product space with a probability mass
function or a density. Instead, we will generally rely on transformations and
mixtures, as described at the beginning of this lecture. We will describe this
at greater length in Lecture 3.

Example 2.6: Mixed discrete and continuous

Let X1 , . . . , Xn be i.i.d. standard normal, and fix z ∈ R. Let


n
X

Y := # i : Xi < z = 1{Xi <z} .
i=1
2.5. MIXING DISCRETE AND CONTINUOUS RANDOM VARIABLES40

Then Y has binomial


Rz distribution with parameters (n, p), where
−1/2 2
p = Φ(z) = −∞ (2π) e−x /2 dx. What is the joint distribu-
tion of (X1 , Y )? This is best described in terms of conditional
distributions, so we postpone it to Example 4.14. 

2.5.1 Independence
If X and Y are jointly continuous then we can tell that they are independent
by looking at the joint density: X and Y are independent if and only if this
is a product of a function of x and a function of y. If X and Y are discrete
then we simply need to check the identity P{X = x, Y = y} = P{X =
x} · P{Y = y}. But what do we do when one is continuous and the other
discrete? What is it that we need to check?
Suppose X is continuous and Y discrete. Returning to first principles, we
see that X and Y are independent when the events {X ∈ A} and {Y ∈ B}
are independent for all appropriate sets A and B. Of course, we don’t need
to check this for all sets, just for enough sets to fully specify the distribution.
In this case, it will suffice to take A to be any interval of the form (−∞, x]
and B only singletons. (There are certainly other choices possible; this
choice is often a convenient one.) That is, X and Y are independent if and
only if for any real x and y,

P X ≤ x, Y = y = FX (x) · P{Y = y},

where FX is the cdf of X.


We follow our standard paradigm of seeing joint distributions as arising
from defining multiple random variables on the same sample space in the
following example.

Example 2.7: Independence for discrete-continuous pair

Let Ω = [0, 10) and P the uniform distribution. Let Y = bωc


(the integer part of ω) and X = ω − Y (the fractional part of ω).
Are X and Y independent?
Solution: We compute first the marginal distributions: Y is a
discrete random variable taking values in {0, 1, . . . , 9}. For any
y ∈ {0, 1, . . . , 9}, Y (ω) = y exactly when ω ∈ [y, y + 1), so

P{Y = y} = P [y, y + 1) = 1/10.
2.5. MIXING DISCRETE AND CONTINUOUS RANDOM VARIABLES41

That is, Y is uniformly distributed. Similarly, X takes values


in [0, 1) and for x ∈ [0, 1), X(ω) ≤ x if and only if ω ∈ A :=
[0, x] ∪ [1, 1 + x] ∪ · · · ∪ [9, 9 + x]. This set has total length 10x,
so P (A) = 10x/10 = x, so X is also uniformly distributed. To
test independence we consider for 0 ≤ x < 1,

P{X ≤ x and Y = y} = P [y, y+x] = x/10 = P{X ≤ x}·P{Y = y}.


Lecture 3

Expectations, moments, and


inequalities

Summary: Applications of expectation, variance and covariance. Proof


of Markov’s and Chebyshev’s inequalities. Linearity of expectations, with
applications to estimating probabilities of complex events. Using moments
to bound tail probabilities.
Key concepts: expectation, covariance, variance, correlation, Chebyshev
inequality, Markov Inequality, tail bounds, moments, moment bounds

3.1 Expectation
The expectation of a real-valued random variable is the average value,
where all possible values are weighted by their probabilities.

Definition 3.1. We define the expectation of a nonnegative real-valued ran-


dom variable X with cdf F to be
Z ∞
E[X] := (1 − F (x))dx, (3.1)
0

if this quantity is finite. For a random variable X that takes on negative


values (as well as positive), we define
Z ∞ Z 0
E[X] := E[X+ ] − E[X− ] = (1 − F (x))dx − F (x)dx. (3.2)
0 −∞

This is defined only if both expectations are finite. In that case, we say X
is integrable; otherwise, we say it is nonintegrable.

42
3.1. EXPECTATION 43

Note that we use the standard notation x+ = max{x, 0} and x− =


max{−x, 0}, for real numbers x.
When E[X+ ] is infinite and E[X− ] is finite, we sometimes say that E[X] =
∞. Similarly, when E[X− ] is infinite and E[X+ ] is finite, we sometimes say
that E[X] = −∞.

Theorem 3.2. R ∞If the random variable X has density f , then X is integrable
if and only if 0 |x|f (x)dx, and
Z ∞
E[X] = xf (x)dx. (3.3)
−∞

If the random variable X takes on a countable set P∞of values xi with prob-
abilities pi , then X is integrable if and only if i=1 |xi |pi < ∞. In that
case,
X∞
E[X] = pi xi . (3.4)
i=1

Proof. This proof is non-examinable. The continuous case is just integration


by parts: We consider a density concentrated on the positive real line. Write
Z ∞ Z ∞Z x
xf (x)dx = f (x)dtdx
−∞
Z0 ∞ Z0 ∞
= f (x)dxdt (changing the order of integration — Fubini’s Theorem)
0 t
Z ∞
= (1 − F (t))dt.
0

For random variables with positive and negative values, we apply this result
to the positive and negative parts separately.
Assume first that the set of values taken on by X are nonnegative and
3.2. ESSENTIAL PROPERTIES OF EXPECTATIONS 44

discrete, so we may assume 0 ≤ x1 < x2 < · · · . Then

X ∞
X i
X
p i xi = pi (xj − xj−1 ) where we take x0 = 0
i=1 j=1
X∞ X ∞
= pi (xj − xj−1 ) (reversing the order of summation)
j=1 i=j
X∞
= (1 − F (xj−1 )(xj − xj−1 )
j=1
X∞ Z xj
= (1 − F (x))dx since F (x) = F (xj−1 ) on the interval (xj−1 , xj )
j=1 xj−1
Z ∞
= (1 − F (x))dx.
0

The extension to general random variables X which take on countably many


values may be accomplished by approximating X by a random variable
taking values on a discrete set. The details are a bit advanced for this
context.

3.2 Essential properties of expectations


(E1) For any event A, E[1A ] = P(A).

(E2) If X and Y are random variables and P{X ≥ Y } = 1, then E[X] ≥


E[Y ].

(E3) For any constant c and any random variable X, E[cX] = cE[X].

(E4) For any random variables X and Y , E[X + Y ] = E[X] + E[Y ].

(E5) If X and Y are independent, then E[XY ] = E[X]E[Y ].

(E6) If X has densityRf and g : R → R so that g(X) is a random variable,



then E[g(X)] = −∞ f (x)g(x)dx. More generally, if (X1 , . . . , Xk ) have
joint density f and g : Rk → R so that g(X1 , . . . , Xk ) is a random
variable, then
Z ∞ Z ∞
E[g(X1 , . . . , Xk )] = ··· g(x1 , . . . , xk )f (x1 , . . . , xk )dx1 · · · dxk .
−∞ −∞
(3.5)
3.2. ESSENTIAL PROPERTIES OF EXPECTATIONS 45

(E7) If X is any integrable random variable then


 
E[X] ≤ E |X| (3.6)

(E8) If X is a nonnegative random variable with cdf F , and g : R → R+ a


non-negative non-decreasing differentiable function with limx→∞ F (x)g(x) =
0, then Z ∞
1 − F (x) g 0 (x)dx

E[g(X)] = (3.7)
−∞

The first three properties are almost trivial. The linearity of expecta-
tions (E4) is absolutely essential; proving it in full generality requires some
measure theory. For discrete random variables it is quite simple:
X
E[X + Y ] = P{X + Y = z}z
z
XX
= P{X = x, Y = z − x}(x + z − x)
z x
XX
= P{X = x, Y = y}(x + y)
y x
XX XX
= P{X = x, Y = y}x + P{X = x, Y = y}y
x y y x
X X
= P{X = x}x + P{Y = y}y
x y

= E[X] + E[Y ].

Similarly, if they have a joint density f , then by (3.5)


Z ∞Z ∞
E[X + Y ] = (x + y)f (x, y)dxdy
−∞ −∞
Z ∞ Z ∞ Z ∞ Z ∞
= x f (x, y)dy dx + y f (x, y)dx dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
= xfX (x)dx + yfY (y)dy.
−∞ −∞

Property (E6) is straightforward in a measure-theoretic context. For


invertible functions g it is a simple consequence of the change-of-variables
formula. For g non-negative and non-decreasing, Y = g(X) has cdf

FY (y) = P Y ≤ y = P X ≤ g −1 (y) = F (g −1 (y)).


 
3.2. ESSENTIAL PROPERTIES OF EXPECTATIONS 46

So, making the change of variables y = g(x), so that dy/dx = g 0 (x),


Z ∞ Z ∞
−1
1 − F (x) g 0 (x)dx.
   
E[Y ] = 1 − F (g (y)) dy =
0 −∞

Property (E5) is straightforward for distributions X and Y that are


either discrete or continuous. The general proof, again, becomes simple in a
measure-theory context. For these and other proofs see a book like [Ros06].
Property (E7) is proved by writing X = X+ −X− , where X+ = X1{X≥0}
and X− = −X1{X≤0} . Then X+ and X− are both nonnegative random
variables and |X| = X+ + X− , so

E[X] = E[X+ ] − E[X− ]


≤ E[X+ ] + E[X− ]
= E[X+ ] + E[X− ]
 
= E |X| .

Example 3.1: Applications of linearity

Let Y have binomial distribution with parameters (n, p). Com-


pute E[Y ].
Solution: The hard way is direct:
n  
X n k
E[Y ] = p (1 − p)n−k k
k
k=0
n  
n
X n p
= (1 − p) krk where r =
k 1−p
k=0
n  
n
X n − 1 k−1
= nr(1 − p) r
k−1
k=1
n−1
X n − 1
n
= nr(1 − p) rk
k
k=0
= nr(1 − p)n (1 + r)n−1 by the binomial formula
= nr(1 − p)n (1 − p)−n+1 since 1 + r = (1 − p)−1
= np.
3.3. VARIANCE AND COVARIANCE 47

Alternatively, Y is the number of successes in n trials with prob-


ability of success p. If we take independent random variables
X1 , . . . , Xn to be 1 or 0 with probability p or 1 − p respectively,
to represent the indicator of success in the individual trials, then
Y = X1 + · · · + Xn . Since E[Xi ] = p, we immediately have
E[Y ] = np.

Let (X1 , X2 ) be independent uniformly distributed in the interval


(0, 1), and Y = X1 + X2 . Compute E[X1 + X2 ].
Solution: The hard way: Y has density
Z 1 (
y if 0 ≤ y < 1
fY (y) = 1{0≤y−x≤1} dx =
0 2−y if 1 ≤ y < 2

So
1 2
y3 1 y3
Z Z  
2
2 1 8 1
E[Y ] = y·ydy+ y·(2−y)dy = + y − = +(4− )−(1− ) = 1.
0 1 3 0 3 1 3 3 3
R1
The easy way: E[X1 ] = E[X2 ] = 0 xdx = 21 . So E[Y ] = E[X1 ] +
E[X2 ] = 1. 

3.3 Variance and covariance


Definition 3.3. The variance of a random variable X is

Var(X) := E (X − E[X])2 .
 

The standard deviation of a random variable X is


p
SD(X) := Var(X).

The covariance of random variables X and Y is


 
Cov(X, Y ) := E (X − E[X])(Y − E[Y ]) .

Random variables with Cov(X, Y ) = 0 are said to be uncorrelated.


3.3. VARIANCE AND COVARIANCE 48

Theorem 3.4. Basic facts about variance and covariance: Var(X) ≥


0, with equality if and only if X = E[X] with probability 1. (We call such a
random variable, which is not really random, deterministic.)

Var(X) = E[X 2 ] − E[X]2


Cov(X, Y ) = E[XY ] − E[X]E[Y ].

We also have the Cauchy-Schwarz inequality:

Cov(X, Y )2 ≤ Var(X) Var(Y ). (3.8)

Equality holds if and only if there are constants c, b such that X = cY + b


with probability 1.

Proof. By linearity of expectations,

Var(X) = E X 2 − 2XE[X] + E[X]2 = E[X 2 ] − 2E[X]E[X] + E[X]2 = E[X 2 ] − E[X]2


 

Cov(X, Y ) = E [XY − XE[Y ] − Y E[X] + E[X]E[Y ]] = E[XY ] − E[X]E[Y ].

We prove the following slightly more general version of the Cauchy-


Schwarz inequality: For any random variables X and Y with E[X 2 ] and
E[Y 2 ] both finite,
E[XY ]2 ≤ E[X 2 ]E[Y 2 ]. (3.9)
(The version (3.8) follows when we replace X and Y by X − E[X] and
Y − E[Y ].)
For any real number c,

0 ≤ E[(X − cY )2 ] = E[X 2 ] + c2 E[Y 2 ] − 2cE[XY ].

Taking c = E[XY ]/E[Y 2 ] we get

E[XY ]2
0 ≤ E[X 2 ] − ,
E[Y 2 ]

proving (3.9). Equality holds if and only if X − cY = 0 with probability


1.
3.3. VARIANCE AND COVARIANCE 49

Example 3.2: Indicators

Let X := 1A and Y := 1B be the indicators of events A and B


respectively. Compute the expectation, variance and covariance.
Solution: X(ω) = 1 precisely when ω ∈ A, and it is 0 otherwise.
Thus E[X] = 1 · P{X = 1} = P(A). Similarly, E[1B ] = P(B).
To compute the variance, we note that X 2 = X and Y 2 = Y . So

Var(1A ) = E[1A ] − E[1A ]2 = P(A) − P(A)2 = P(A) 1 − P(A) .




(3.10)
Similarly, XY = 1A 1B = 1 on A ∩ B, and 0 elsewhere. Thus

Cov(1A , 1B ) = P(A ∩ B) − P(A)P(B). (3.11)

Definition 3.5. The correlation between two nondeterministic random vari-


ables X and Y is defined to be

Cov(X, Y ) Cov(X, Y )
Cor(X, Y ) = =p . (3.12)
SD(X) SD(Y ) Var(X) Var(Y )

The correlation is always between −1 and +1, and is equal to ±1 if and only
if X = cY + b with probability 1, for constants b, c. (The correlation is then
the sign of c.)

Fact 3.6. For any random variables X1 , . . . , Xn ,


n
X X
Var(X1 + · · · + Xn ) = Var(Xi ) + Cov(Xi , Xj ). (3.13)
i=1 i6=j

If X1 , . . . , Xn are independent, then

Var(X1 + · · · + Xn ) = Var(X1 ) + Var(X2 ) + · · · + Var(Xn ). (3.14)

If they are i.i.d. , then Var(X1 + · · · + Xn ) = n Var(Xi ).


3.3. VARIANCE AND COVARIANCE 50

Proof. This is an immediate consequence of linearity: Write ξi := Xi −E[Xi ].


Then Var(Xi ) = E[ξi2 ] and Cov(Xi , Xj ) = E[ξi ξj ]. So

Var(X1 + · · · + Xn ) = E (X1 + · · · + Xn ) − E[X1 + · · · + Xn ])2


 
h 2 i
= E X1 − E[X1 ]) + · · · + (Xn − E[Xn ])
= E (ξ1 + · · · + ξn )2
 
 
Xn X
= E ξi2 + ξi ξj 
i=1 1≤i6=j≤n
n
X X
= Var(Xi ) + Cov(Xi , Xj ).
i=1 i6=j

Example 3.3: Uncorrelated and Independent

It is a trivial consequence of property (E5) that independent


random variables are uncorrelated. What about the converse?
It is easy to come up with counterexamples. For instance, sup-
pose X and Y take the values {−1, 0, +1} with the following
probabilities:
Y
−1 0 +1
1
−1 0 4 0
1 1
X 0 4 0 4
1
+1 0 4 0
Then X and Y have the same marginal distribution: P({−1}) =
P({+1}) = 14 and P({0}) = 12 . So E[X] = E[Y ] = 0. Also
XY is always 0, since either X or Y is 0 in all outcomes with
positive probability. So Cov(X, Y ) = 0. But X and Y are not
independent. For instance, P{X = 0} = P{Y = 0} = 12 , but
P{X = 0, Y = 0} = 0.
And here is a continuous example: Let Z be a standard normal
random variable, and W = +1 or −1, each with probability
1
2 , independent of Z. Let Y = Z · W . Then Y and Z are
3.4. TAIL BOUNDS 51

uncorrelated, since E[Y Z] = E[W Z 2 ] = E[W ]E[Z 2 ] = 0, and


E[Y ] = E[Z] = 0. But they are not independent, since |Y | = |Z|.


3.3.1 Covariance matrix


If we have a sequence of random variables X1 , . . . , Xn it will often be conve-
nient to think of the pairwise covariances of the random variables as a matrix.
We define the covariance matrix to be the matrix C = (Cij )1≤i,j≤n .
If we think of the expectation of a matrix of random variables as being
the matrix of expectations, we can write X = (X1 , . . . , Xn )∗ as the column
vector of the Xi ’s (writing ∗ for the transpose) and µ for the column vector
E[X]. Then
C = E (X − µ)(X − µ)∗ .
 
(3.15)
The following fact is left as an exercise. It is an immediate consequence
of the spectral theorem for symmetric matrices.

Fact 3.7. If X is a vector random variables with expectations µ and covari-


ance matrix C, then C has a nonnegative-definite square-root D. That is,
D is an n × n matrix such that D2 = C. (We also write D = C 1/2 .) Then
the following are equivalent

• C is nonsingular;

• The random variables Xi are linearly independent; that is, there are no
constants a1 , . . . , an such that a1 X1 + · · · + an Xn = 0 with probability
1.

• D is nonsingular;

• D−1 X is a vector of random variables (Y1 , . . . , Yn ) that are uncorre-


lated and have expectation 0 and variance 1.

If X are multivariate normal then C 1/2 is precisely the matrix M de-


scribed in section 2.4.3. If C is nonsingular then C −1/2 (X − µ) is a vector
of i.i.d. standard normal random variables.

3.4 Tail bounds


Quite often we compute probabilities when we want an upper bound on the
probability of an extreme event: “How likely is it that the treatment and
3.4. TAIL BOUNDS 52

control group would have had such different mortality rates if the treatment
really had no effect?” “How likely is it that this stock portfolio will lose
more than £1000 this year?” “How likely is it that floods will go above
the current levees some time in the next 100 years?” Such probabilities of
extreme events in the distribution are called tail probabilities, and bounds
of the sort P{Z > z} ≤ (z) are called tail bounds.
The basis of most tail bounds is an almost-trivial fact called Markov’s
inequality.

Theorem 3.8. Let X be a nonnegative-valued random variable. Then for


any z > 0
E[X]
P{X ≥ z} ≤ . (3.16)
z
As long as the expectation is finite, this bound goes to 0 as z → ∞. The
reason this is useful is that for complicated random variables expectations
are often easier to compute than probabilities — usually because of linearity.
(In fact, historians say that the notion of expected value arose in the West
much earlier than the notion of probability; cf. [Das95].)
The proof depends on the obvious fact that if there is a high probability
of a high value of X, then the expected value must be correspondingly large.

Proof. Let Xz = X1{X≥z} . Then X ≥ Xz ≥ z1{X≥z} . By properties (E1),


(E2), and (E3),

E[X] ≥ E[Xz ] ≥ E[z1{X≥z} ] ≥ zP X ≥ z .

Dividing through by z yields the desired result.

Markov’s inequality is sharp, in the sense that it is possible to have


random variables for which equality holds. For z ≥ 1 we simply take X to
be the random variable taking on just the values 0 and z, with P{X = 0} =
1 − 1/z and P{X = z} = 1/z. Then E[X] = 1 and P{X ≥ z} = 1/z. So
there is no better conclusion that we can draw based on just the expected
value of X.
If we want better bounds, we need more information about the random
variable. An obvious consequence of Markov’s inequality is:

Corollary 3.9. Let X be a random variable and g : R → R+ an increasing


function. Then
 E[g(X)]
P X≥z ≤ .
g(z)
3.4. TAIL BOUNDS 53

Proof. Since g is monotonic, we have


  E[g(X)]
P X ≥ z = P g(X) ≥ g(z) ≤ .
g(z)

Suppose now Y is an arbitrary random variable. Applying this bound to


the positive random variable X = |Y − E[Y ]|, and the function g(x) = x2 ,
we obtain the celebrated Chebyshev’s Theorem

Theorem 3.10 (Chebyshev’s Theorem). For any random variable Y with


finite variance, and any  > 0,
 Var(Y )
P Y − E[Y ] ≥  ≤ . (3.17)
2
We use the variance most commonly because it is easy to compute. If we
want a tail bound that falls off very quickly with z we need to compute the
expectation of a function that grows very quickly. This is one motivation
for defining the moments of a random variable:

Definition 3.11. The k-th moment of a random variable X is E[X k ], if


this exists. If µ = E[X] is the first moment, the k-th central moment is
µk := E[(X − µ)k ], if this exists.

Of course, if k is even then X k is nonnegative, so the k-th moment and


central moment exist, though it may be infinite. The first central moment
is of course always 0.
It’s a consequence of Jensen’s inequality (a crucial relation for more
advanced probability, found in most textbooks, but which isn’t part of the
syllabus of this course) that for k > j, E[|X|j ] ≤ E[|X|k ]j/k . In particular,
if a k-th moment of |X| is finite, then all lower moments of X exist and are
finite.
A calculation of a 2k-th moment (or a bound on a 2k-th moment) gives
you a bound on tail probabilities of |X| that falls off like z −2k . This is
a useful observation because X 2k is a very convenient random variable to
compute expectations of, and it is naturally positive. (Odd moments are less
used for this reason; we need to take absolute values, which is inconvenient.)

Example 3.4: Tail bounds for the simple random walk


3.4. TAIL BOUNDS 54

Let Xi be i.i.d. , taking on the values +1 and −1 with probability


1/2 each. Let Sn = X1 + · · · + Xn . Compute the even-order
moments, and use these to find a bound on P{Sn > z}.
Solution: Observe that
(
1 if k even,
E[Xik ] =
0 if k odd.
Any product of different Xi ’s has expectation 0, unless the differ-
ent i’s come in pairs, so each index that appears in the product
appears an even number of times. (Note that this implies that
all odd-order moments are 0.)
We first compute the second moment:
n
X X
E[Sn2 ] = E[Xi2 ] + E[Xi Xj ] = n,
i=1 i6=j

since all products Xi Xj for i 6= j have 0 expectation. So E[Sn2 ] =


n.
In general, E[Sn2k ] is simply the number of ways of making an or-
dered choice of 2k indices (with repetitions allowed) from {1, 2, . . . , n}
so that each one appears an even number of times. We can imag-
ine picking the indices in two steps: First, we make an unordered
choice. This is the same as the number of ways of picking k in-
dices (with repetition) from n, which is a standard combinatorial
object, known to be equal to n+k−1 k . (If you haven’t seen it, the
ingenious “Stars and bars” argument for deriving this formula is
in many discrete probability texts; for instance, Proposition 6.2
of [Ros08].)
Then we choose an order. We need to count the number of ways
of ordering 2k objects, consisting of j types, such that elements
of the same type are indistinguishable, such that there are ri
objects of type i. This is (2k)!/r1 ! · · · rj !. This is not very easy
to work with, since the groups have size at least 2. So this factor
is no bigger than (2k)!/2k , giving us a bound
n(n + 1) · · · (n + k − 1)(2k)!
E[Sn2k ] = .
k!2k
(This is an example of the principle that exact formulas are often
less useful than good approximations.)
3.4. TAIL BOUNDS 55

This yields the tail bound


 n(n + 1) · · · (n + k − 1) −2k
P |Sn | > z ≤ z
k!
for every positive integer k. Just as an example, suppose n =
100, and we want to estimate the probability that Sn ≥ 40. We
get

Moment 2 4 6 8 10 12 14 16 18 20
Tail bound 0.125 0.047 0.030 0.027 0.032 0.046 0.079 0.159 0.364 0.943

Since each of these is a bound on the same probability, we are


free to choose
 the lowest one, which comes from the 8th moment.
Thus P |S100 | ≥ 40 ≤ 0.027.
How good is this bound? We can compute this probability ex-
actly, observing that the event {|S100 | ≥ 40} is exactly the event
that {#i : Xi = +1 is ≥ 70 or ≤ 30}. This we can compute
from the binomial distribution to be exactly
30  
X 100 −100
2 2 = 8 × 10−5 .
i
i=0

So it’s not a very good bound. We’ll see better ones in section
5.2. On the other hand, it’s good to have any bound at all. If the
distribution had been made just a bit more complicated — for
instance, Xi uniformly distributed on [−1, 1] — we could still
have derived these bounds, but the exact complication would
have been nearly impossible. 

Example 3.5: Counting records

We observe a succession of i.i.d. random variables X1 , X2 , . . .


with a continuous cdf F . We count Xk as a “record” if it is larger
than all X1 , . . . , Xk−1 . Let Rn be the number of records in the
first k observations. What are the expectation and variance of
Rn ? What is a number r such that we can be 99% sure that
there will be no more than r records in 1 million observations?
3.4. TAIL BOUNDS 56

Solution: Let Ak be the event that Xk is a record, and let


ξk = 1Ak be the indicator of Ak . Since the distribution of the
Xi is continuous the probability of a tie is 0, and
1
E[ξk ] = P(Ak ) = .
k
How do we know this? Since X1 , . . . , Xk are independent, all
orders are equally likely. And of all the orders, one out of k
of them will have the largest at position k, so the probability
is 1/k. Similarly, regardless of what numbers are at positions
j + 1, . . . , k, there is a record at Xj only if Xj is the maximum of
X1 , . . . , Xj ; one out of j orders will have this being true. Thus

1
E[ξk ξj ] = P(Ak ∩ Aj ) = ,
jk

meaning that ξj and ξk are independent. Thus P Cov(ξj , ξk ) = 0


for j 6= k, and Var(ξk ) = k1 − k12 . Since Rn = nk=1 ξk ,
n
X 1
E[Rn ] = ≈ log n + γ (where γ ≈ 0.577 . . . is Euler’s constant
k
k=1
n 
π2

X 1 1
Var(Rn ) = − 2 ≈ log n + γ −
k k 6
k=1

We want to find r such that P{Rn ≥ r + 1} ≤ 0.01. We can


bound this by
 Var(Rn )
P{Rn ≥ r+1} ≤ P |Rn −E[Rn ]| ≥ |r+1−E[Rn ]| ≤ .
(r − E[Rn ])2

We have Var(Rn ) = 12.748 . . . and E[Rn ] = 14.393, so if we want


this to be smaller than 0.01 we need
12.748
≤ 0.01,
(r − 13.393)2

so that √
r≥ 1274.8 + 13.393 = 49.1.
Thus, the probability of {Rn ≤ 50} is at least 0.99. (We will
derive a better approximation in section ??.) 
3.4. TAIL BOUNDS 57

Example 3.6: Runs

Let X1 , . . . , Xn be i.i.d. with P{Xi = −1} = P{Xi = +1} = 12 .


Let Rk = number of runs of length k in the sequence; that is, the
number of subsequences of length k of all the same value. Find
n such that the probability is at least 0.99 of having at least one
run of length 10.
Solution: We compute the expectation and variance, using in-
dicators. Let ξi be the indicator of the event that a run starts
at position i (for 1 ≤ i ≤ n − k + 1); that is, {Xi = Xi+1 = · · · =
Xi+k−1 }. Then E[ξi ] = 2−k+1 , and E[Rk ] = (n − k + 1)2−k+1
Pn−k+1
since Rk = i=1 ξi . In order compute the variance we need
to compute the variances of the ξi , and the covariance of each
pair.
Using the formula (3.10),

Var(ξi ) = P{run starts at Xi }−P{run starts at Xi }2 = 2−k+1 −2−2k+2 .

Using the formula (3.11),

Cov(ξi , ξj ) = P{run starts at Xi and run starts at Xj }−2−2k+2 .

Suppose i < j. What is the probability that a run starts at Xi


and at Xj ? If i < j ≤ i + k − 1 then this is equivalent to there
being a run of length k + j − i starting at Xi . If j ≥ i + k then
this is equivalent to there being two independent runs of length
k starting at Xi and Xj . Thus for i < j,
(
2−k−j+i+1 − 2−2k+2 if j ≤ i + k − 1,
Cov(ξi , ξj ) =
0 if j ≥ i + k.

(Note that the covariance is 0 when j = i + k − 1 as well. The


events that a run of length k starts at Xi and that a run of length
k starts at Xi+k−1 make an interesting example of independent
events that do depend on one random variable in common.)
3.5. ARE EXPECTATIONS THE FUNDAMENTAL OBJECTS IN PROBABILITY?58

Applying (3.13) we have


n−k+1
X n−k
X n−k+1
X
Var(Rk ) = Var(ξi ) + 2 Cov(ξi , ξj )
i=1 i=1 j=i+1
  n−2k+3
X X k−2  
−k+1 −2k+2
= (n − k + 1) 2 −2 +2 2−k−`+1 − 2−2k+2
i=1 `=1
n−k
X n−k+1−i
X  
−k−`+1 −2k+2
+2 2 −2
i=n−2k+4 `=1
 
= 2−k+2 (n − k + 1) − 2−k+2 (n − 2k + 5) − (3n − 3k + 1)2−2k+2

Note that Xi has up to k −2 successors with nonzero correlation.


Thus, the sums from i = 1 to i = n − 2k + 3 have the full
complement of successors, and produce the same sum, and the
rest of the i’s are matched with successively smaller numbers of
terms. The last step requires playing around with the sum of
geometric series.
Trying a few numbers, we see that for n = 101691, Var(Rk ) =
394.4802, and E[Rk ] = 198.6152. It follows from Chebyshev’s
inequality that

Var(Rk )
P(X = 0) ≤ = 0.01.
E[Rk ]2

3.5 Are expectations the fundamental objects in


probability?
The theory of probability started with expectations. Before anyone had
written down a probability computation, merchants were figuring out “fair
prices” for investments in risky ventures. (For an engaging account of this
prehistory of probability theory, see [Das95].)
Modern probability theory has returned to its origins. So far, we have
been thinking of probability distributions as functions on sets (events): A
probability takes a set to a number between 0 and 1, with certain properties.
In the language of random variables, the distribution of the random variable
X associates the event A with the number P{X ∈ A}. It turns out, for most
3.5. ARE EXPECTATIONS THE FUNDAMENTAL OBJECTS IN PROBABILITY?59

purposes, to be more useful to think about a distribution as a function on


functions: A probability (thought of as a distribution of the random variable
X) associates a function h : R → R with the number E[h(X)].
Notice that this is a larger class of objects. The probabilities of events
may naturally be thought of as expectations of indicator functions of events.
These are discontinuous, suggesting that they might not actually be the most
natural class of functions to focus on. Starting in Lecture 5 we will start
studying probability distributions in terms of the expectations of classes of
smooth functions.
Lecture 4

Conditioning

Summary: Conditioning on non-null events. Bayes’ rule. Multiplication


rule and consequences. Conditional expectations: discrete case. Conditional
distributions and continuous expectations: continuous case. Borel paradox.
Bivariate normal distributions: application to linear regression. Conditional
variance.
Key concepts: Conditional expectation. Conditional distribution. Condi-
tional density.

4.1 Conditioning on non-null events


Definition 4.1. For two events A and B, with P(A) > 0, the conditional
probability of B given A is
P(A ∩ B)
P(B|A) := . (4.1)
P(A)
The conditional probability is the proportion of the probability of A
that is also in B. It is a mathematical formalisation of the notion of the
probability that B happens given the information that A happens.
Three obvious but important consequences:
Multiplication rule For any events A and B,

P(A ∩ B) = P(A)P(B|A).

Law of total probability For any events A and B,

P(B) = P(A)P(B|A) + P(A{ )P(B|A{ ). (4.2)

60
4.1. CONDITIONING ON NON-NULL EVENTS 61

Bayes’ rule For any events A and B,


P(A|B)
P(B|A) = P(B) (4.3)
P(A)

The multiplication rule implies that independence is equivalent to P(B|A) =


P(B), which makes intuitive sense: if A and B are independent, knowing
that A happens doesn’t change the probability of B.
The law of total probability may be generalised to any partition of Ω:
If A1 , . . . , Ak are disjoint events with nonzero probability and ki=1 Ai = Ω,
S
then
Xk
P(B) = P(Ai )P(B|Ai ). (4.4)
i=1
We can also use a countable partition:

X
P(B) = P(Ai )P(B|Ai ). (4.5)
i=1

Note that this means in particular that the probability of an event is always
some kind of average of the probabilities conditioned on all the possible
events in a partition.
Bayes’ rule, though apparently trivial, solves an important problem:
Think of A as a piece of scientific evidence, and B as a fact about the
world. Then the Bayes tells us how to update our belief of the probability
that B is true, based on the new evidence. The principle is that the prior
probability P (B) gets multiplied by the Bayes factor P(A|B)/P(A), which
tells us how relatively likely the evidence A is if B is true, compared to its
overall likelihood. If B makes A much more likely than it would have been
otherwise, then A is strong evidence for B. We also use the law of total
probability to rewrite the Bayes factor:
P(A|B)
P(B|A) = P(B). (4.6)
P(A|B)P(B) + P(A|B { )P(B { )

Example 4.1: Three cards

We have 3 cards: One is red on both sides, one is black on both


sides, one is red on one side and black on the other. The dealer
picks a card at random and places it on the table. It is red. He
4.1. CONDITIONING ON NON-NULL EVENTS 62

offers to bet you even money that the other side is red. (That
is, he wins if it’s red, you win if it’s black.) Is this a fair bet?
Solution: Let A be the event that the side you see is red; B the
event that this is the card with two red sides. Then P(B) = 31 ;
P(A) = 12 ; and if we have the card with two red sides the side
you see will certainly be red, so P(A|B) = 1. By Bayes’ rule
1 1 2
P(B|A) = · = .
3 1/2 3
So it is not a fair bet. 

Example 4.2: Disease screening

There is an infection which is present (without symptoms) in


0.1% of the population. A doctor decides that all of her patients
should have the new test for this infection, which is known to be
99% accurate: 99% of the times when it gives a positive result
(saying the patient is infected) the patient really is infected; and
99% of the times when it gives a negative result (saying the
patient is not infected) the patient really is not infected. (The
first of these numbers is called the specificity of the test; the
second the sensitivity. In general there is no reason why these
need to be the same.)
One of the doctor’s patients has the test, and gets a positive
result. What is the probability that the patient really is infected?
Solution: Let A be the event {positive test}, and B the event
{patient infected}. We want to compute P(B|A). The 99% sen-
sitivity implies that P(A|B) = 0.99, and the probability of a
positive test for an uninfected patient is P(A|B { ) = 0.01 because
of the 99% specificity. The prior P(B) is the background rate
0.001. And

P(A) = P(B)P(A|B)+P(B { )P(A|B { ) = 0.001·0.99+0.999·0.01 = 0.01098.

Thus
0.99
P(B|A) = 0.001 · = 0.090.
0.01098
So the probability of being infected has been substantially raised
— by a factor of nearly 100 — from the baseline rate. In this
4.2. CONDITIONAL DISTRIBUTIONS: THE NON-NULL CASE 63

sense, the test is strong evidence of infection. But it is still quite


low — less than one chance in 11 — because the baseline is so
low. 

4.2 Conditional distributions: The non-null case


Let X be a random variable, and B an event with P(B) > 0. The conditional
distribution of X given B is the distribution with cdf
FX|B (x) = P(X ≤ x | B) = P(B)−1 P B ∩ {X ≤ x} .


We can also define the conditional expectation of X given B as the


expectation of the conditional distribution. That is,

X  X P {X = x} ∩ B
E[X B] = xP X B = x . (4.7)
x x
P(B)
The law of total probability leads to a law of total expectation.
S Remem-
ber that sets Bi form a partition of Ω if they are disjoint and Bi = Ω.

If B1 , . . . , Bk are a partition of Ω with P(Bi ) > 0, and X a random variable, then


k
X  
E[X] = P(Bi )E X Bi .
i=1
(This holds as well if the partition is countably infinite.) (4.8)
Independence: The random variable X is independent of the event B
if all events {X ≤ x} are independent of B. Equivalently, X is indepen-
dent of B if the conditional distribution of X given B is the same as the
unconditional distribution of X.

Example 4.3: Two dice

We roll two six-sided dice. Y is the sum of the two rolls. B is the
event Y ≤ 5. C is the event Y is even. Compute the conditional
distribution of Y given each of these events.
Solution: As a reminder, the probabilities for Y are
y 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P{Y = y} 36 36 36 36 36 36 36 36 36 36 36

So P(B) = 10/36, and


4.2. CONDITIONAL DISTRIBUTIONS: THE NON-NULL CASE 64

y 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4
P{Y = y B} 10 10 10 10 0 0 0 0 0 0 0

P (C) = 12 , so

y 2 3 4 5 6 7 8 9 10 11 12

1 3 5 5 3 1
P{Y = y C} 18 0 18 0 18 0 18 0 18 0 18

The conditional expectation of Y given B is


1 2 3 4
E[Y |Y ≤ 5] = 2 · +3· +4· +5· = 4.
10 10 10 10


Example 4.4: Minimum of exponentials

Let X and Y be independent exponential random variables with


parameters µ and λ. Let Z = min{X, Y }. Compute the dis-
tribution of Z, and the conditional distribution of Z given the
event {X ≤ Y } (which is the same as the event {X = Z}).
Solution: We have P{X > x} = e−µx and P{Y > y} = e−λy .
Thus
P{Z > z} = P{min{X, Y } > z
= P{X > z and Y > z}
= e−µz · e−λz
= e−(µ+λ)z .
So Z has the exponential distribution with parameter µ + λ. To
compute the conditional distribution we use the joint density
f (x, y) = µλe−µx−λy for x > 0 and y > 0 (and f (x, y) = 0
otherwise) to compute
Z ∞Z ∞

P X≤Y = 1{X≤Y } f (x, y)dydx
0 0
Z ∞Z ∞
= µλ e−µx−λy dydx
0 x
µ
= .
µ+λ
4.3. CONDITIONING ON A RANDOM VARIABLE: THE DISCRETE CASE65

We want to look at
  −1 
P Z>z X≤Y =P X≤Y P Z > z, X ≤ Y
µ+λ 
= P z<X≤Y
µ
Z ∞Z ∞
µ+λ
= µλ e−µx−λy dydx
µ z x
Z ∞
−(µ+λ)x
= (µ + λ) e dx
z
−(µ+λ)z
=e .

Thus, somewhat surprisingly, min{X, Y } is independent of the


event {X ≤ Y }. 

4.3 Conditioning on a random variable: The dis-


crete case
In modern probability, we move from conditional probabilities to making
conditional expectations the fundamental object. The full generality of this
concept is naturally a measure-theoretic concept, so we will confine ourselves
to some special cases. In this section, we focus on the discrete setting.
In an important sense, the expected value of a random variable X is a
“best guess” about the value of X, given no information. The conditional
expectation of X given Y is the best guess about the value of X given
only the information about which value Y takes.

Definition 4.2. Let X and Y be two random variables, where Y takes on


only a finite or countable set of possible values. The conditional expec-
tation of X given Y is a random variable that is a function of Y . For any
outcome ω on which Y = y,

E[X Y ](ω) = E[X Y = y].

If A is an event and Y a random variable, the conditional probability


of A given Y is the conditional expectation of the indicator function

So E[X|Y ] is itself a random quantity, which is the average value of X


over all circumstances where Y has taken on the value that it has.
4.3. CONDITIONING ON A RANDOM VARIABLE: THE DISCRETE CASE66

Example 4.5: Two dice, continued

Let Y be the sum of the spots on two fair six-sided die rolls, and
X the number of spots on the first roll. Compute E[X|Y ].
Solution: (This is the long way. We’ll do this again in example
4.5.) We make a table of the joint distribution:
P{ YX=x
=y,
} y
2 3 4 5 6 7 8 9 10 11 12 Mar.
1 1 1 1 1 1 1
1 36 36 36 36 36 36 0 0 0 0 0 6
1 1 1 1 1 1 1
2 0 36 36 36 36 36 36 0 0 0 0 6
1 1 1 1 1 1 1
3 0 0 36 36 36 36 36 36 0 0 0 6
x 1 1 1 1 1 1 1
4 0 0 0 36 36 36 36 36 36 0 0 6
1 1 1 1 1 1 1
5 0 0 0 0 36 36 36 36 36 36 0 6
1 1 1 1 1 1 1
6 0 0 0 0 0 36 36 36 36 36 36 6
1 2 3 4 5 6 5 4 3 2 1
Mar. 36 36 36 36 36 36 36 36 36 36 36
The last column is the marginal distribution of X, which is just
the sum of the other numbers in the row. The last row is the
marginal distribution of Y , which is just the sum of the other
numbers in the column.
We divide through by the marginal of X to get the conditional
probabilities of Y given X. The conditional expectation is then
the expected value computed from the distribution in each row:
=y |
P{ YX=x } y
2 3 4 5 6 7 8 9 10 11 12 E[Y |X = x]
1 1 1 1 1 1
1 6 6 6 6 6 6 0 0 0 0 0 4.5
1 1 1 1 1 1
2 0 6 6 6 6 6 6 0 0 0 0 5.5
1 1 1 1 1 1
3 0 0 6 6 6 6 6 6 0 0 0 6.5
x 1 1 1 1 1 1
4 0 0 0 6 6 6 6 6 6 0 0 7.5
1 1 1 1 1 1
5 0 0 0 0 6 6 6 6 6 6 0 8.5
1 1 1 1 1 1
6 0 0 0 0 0 6 6 6 6 6 6 9.5

We see that in each case E[Y |X = x] = x + 3.5. Thus we may


write the random variable E[Y |X] as
E[Y X] = X + 3.5.

Now compute E[Y |X].


4.3. CONDITIONING ON A RANDOM VARIABLE: THE DISCRETE CASE67

Solution: This time we divide through by the marginal of Y to


get the conditional distribution of X given Y = y.
|
P{ X=x
Y =y
} y
2 3 4 5 6 7 8 9 10 11 12
1 1 1 1 1
1 1 2 3 4 5 6 0 0 0 0 0
1 1 1 1 1 1
2 0 2 3 4 5 6 5 0 0 0 0
1 1 1 1 1 1
3 0 0 3 4 5 6 5 4 0 0 0
x 1 1 1 1 1 1
4 0 0 0 4 5 6 5 4 3 0 0
1 1 1 1 1 1
5 0 0 0 0 5 6 5 4 3 2 0
1 1 1 1 1
6 0 0 0 0 0 6 5 4 3 2 1
E[X|Y = y] 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

We see that in each case E[X Y = y] = y/2. Thus

E[X Y ] = Y /2.

Conditional expectations satisfy many of the same rules as unconditional


expectations, and these make the computations a lot simpler.
 
(CE1) Law of Total Expectation E E[X|Y ] = E[X].

(CE2) If X1 , X2 and Y are random variables and P{X1 ≥ X2 } = 1, then


E[X1 Y ] ≥ E[X2 Y ].

(CE3) For any constant c and any random variables X and Y , E[cX Y ] =
cE[X Y ].

(CE4) If X1 , X2 and Y are random variables, E[X1 + X2 Y ] = E[X1 Y ] +


E[X1 Y ].

(CE5) If X and Y are independent, then E[X Y ] = E[X]. (In other words,
it’s a deterministic constant equal to E[X].

(CE6) For any random variables X and Y where X may be written as g(Y ),
E[X Y ] = X.
4.3. CONDITIONING ON A RANDOM VARIABLE: THE DISCRETE CASE68

Example 4.6: Two dice, continued again

Let Y be the sum of the spots on two fair six-sided die rolls, and
X the number of spots on the first roll. Compute E[X|Y ].
Solution: Define X1 = X, X2 = Y −X (the spots on the second
roll). These are independent. Then
   
E Y X = E X1 + X2 X
   
= E X1 X1 + E X2 X by rule (CE4)
= X1 + E[X2 ] by rules (CE5) and (CE6)
= X + 3.5.

Note that E[Y ] = 7, which is the same as E[X + 3.5].

Now compute E[X|Y ].


Solution: The pairs (X1 , Y ) and (X2 , Y ) have the same joint
distribution, so E[X1 Y ] = E[X2 Y ]. So

2E[X1 Y ] = E[X1 Y ] + E[X2 Y ]


= E[Y Y ] by rule (CE4)
= Y by rule (CE6).

Note that E[X] = 3.5, which is the same as E[Y /2]. 

Example 4.7: Conditioning on an indicator

Let A be an event and X a random variable. What is E[X 1A ]?


Solution: This must be a function of 1A . That is, it takes
on one value on outcomes which are in A and the other value
on outcomes not in A. These values must be the conditional
expectation of X given A, and the conditional expectation of X
given A{ respectively. We may write this as

E[X 1A ] = E[X A]1A + E[X A{ ]1A{ .


4.4. CONDITIONAL DISTRIBUTIONS: THE CONTINUOUS CASE 69

Example 4.8: Random lifetimes

A lightbulb factory makes bulbs which last a random length of


time, which has exponential distribution. The factory makes
equal numbers of bulbs every day of the week (five days a week).
Bulbs made on Mondays last an average of 1 year, while those
made on Tuesdays last 2 years, Wednesdays 3 years, Thursdays
4 years, Fridays 5 years. Compute the expectation and variance
of bulb lifetimes.
Solution: Let T be the lifetime of a random bulb, and L the
expected lifetime of the bulb. The conditional distribution of T
given L is Exp(1/L). We have

E[T L] = L,
E[T 2 L] = 2L2 .

(Remember that when T has Exp(λ) distribution, E[T ] = λ−1


and E[T 2 ] = 2λ−2 .) L is uniformly distributed over {1, 2, 3, 4, 5}.
The law of total expectation tells us that

E[T ] = E[L] = 3,
2 2
E[T 2 ] = E[2L2 ] = 1 + 22 + 32 + 42 + 52 = 22

5
Var(T ) = E[T 2 ] − E[T ]2 = 13.

Note that an exponential distribution with expectation 3 would


have variance 9. We say the mixed distribution is overdis-
persed relative to the exponential, which is usually an effect
of mixing. 

4.4 Conditional distributions: The continuous case


How do we define conditional distributions when the variable we are condi-
tioning on is continuous? If X has density fx and A is any event, then we
can’t compute
 P(A ∩ {X = x})
P A X=x = , (NO GOOD!)
P{X = x}
4.5. THE BOREL PARADOX 70

because the numerator and denominator are both 0. Instead, we define


 P(A ∩ {|X − x| ≤ δ})
P A X = x = lim . (4.9)
δ↓0 P{|X − x| ≤ δ}
Of course, we can apply this to the conditional distribution of a discrete
variable Y conditioned on continuous X:
 P{Y = y and |X − x| ≤ δ})
P Y = y X = x = lim . (4.10)
δ↓0 P{|X − x| ≤ δ}
As discussed in section 4.9, we would understand this formula as saying
that we have a function px (y) = P{Y = y X = x} which describes the
distribution of Y for each possible value of the parameter X.
When X and Y are both continuous the problem becomes more delicate.
Suppose (X, Y ) has joint density fX,Y . We define the conditional density of
Y given X = x to be
fX,Y (x, y) P{|X − x| ≤ δx and |Y − y| ≤ δy}
fY (y X = x) := = lim lim .
fX (x) δy↓0 δx↓0 P{|X − x| ≤ δx}
(4.11)
The conditional expectation of Y given X = x is then the expectation with
respect to the conditional density
Z ∞
fX,Y (x, y)
E[Y X = x] = ydy. (4.12)
−∞ fX (x)
Finally, the conditional expectation of Y given X is a random variable that
is a function of X, defined to take the value E[Y |X = x] whenever X = x.
Conditional expectations satisfy the same rules as in the discrete setting.
We also still have Bayes’ rule:
fY (Y X = x)
fX (x Y = y) = fX (x). (4.13)
fY (y)
This follows simply from the definition, since

fX (x Y = y)fY (y) = fX,Y (x, y) = fY (y X = x)fX (x).

4.5 The Borel paradox


Suppose we pick a random point on the earth’s surface. What is the proba-
bility that the latitude is above 45◦ N, conditioned on the point being chosen
to lie on 0◦ longitude?
4.5. THE BOREL PARADOX 71

Our first idea might be: The uniform distribution assigns probabilities in
proportion to area. Consider a tiny strip of width δ around around the great
semicircle corresponding to longitude φ = 0. It has very nearly equal areas
all around, approaching uniformity as δ goes to 0. Thus, the conditional
distribution of the random latitude Φ must be uniform on [−90, 90]; in
particular, the probability below −45 is precisely 1/4.
Problem: There is nothing special about longitude Θ = 0; the same
should be true for conditioning on any other value of the longitude. But if the
latitude is uniformly distributed when conditioned on any value of Θ, then
it must be uniformly distributed unconditionally, and P{Φ < −45} = 1/4.
But we can compute the unconditional probability directly: The total area
of the earth is 4πR2 , where R is the radius. Latitude is between −90◦ (at the
north pole) and +90◦ (at the south pole). for φ ∈ (−90, 90) the area of the
region with latitude ≤ φ is a spherical cap with area 2πR2 (1 + sin φπ/180).
If we let Φ be the random latitude, since uniform probability is proportional
to area, we have the cdf and density
1 π 
FΦ (φ) = 1 + sin φ
2 180
π π
fΦ (φ) = FΦ0 (φ) = cos φ.
180 180
In particular, the probability we are looking for is P{Φ < −45} = (2 −

2)/4 ≈ 0.146.
What this paradox points up is that the intuitive notion “the latitude
of a random point on the earth, conditioned on being on the circle with
longitude θ” doesn’t actually make sense. You can’t say you’ll look at the
distribution of φ in the subset of (θ, φ) corresponding to θ = 0 (or whatever).
The whole set has probability 0, so we can’t talk about the proportion of
probability in that set where φ has a certain value. Instead, taking the ratio
of joint density to marginal density means that we are taking a limit as δ ↓ 0
of
P{Φ ≤ 45 and 0 ≤ Θ < δ}
.
P{0 ≤ Θ < δ}
And whereas the limit set {(φ, θ) : θ = 0} are circles, which seem to imply
a uniform distribution of Φ, all of the sets on the way to the limit of the
form {(φ, θ) : θ ∈ [0, δ)} are wedges, which are fatter around the equator
than near the poles. The first approach is perfectly legitimate, and since
longitude and latitude are indeed independent we can infer that the joint
density of (Φ, Θ) is
π π
fΦ,Θ (φ, θ) = cos φ.
180 · 360 180
4.5. THE BOREL PARADOX 72

If we have a region on the sphere, expressed in terms of θ and φ measured


in degrees, we integrate this density over the region and multiply by 4πR2
to obtain the area of the region.
On the other hand, if we set up a coordinate system in which regions
of the form {(φ, θ) : φ ∈ [−90, 90], θ ∈ (θ0 − δ, θ0 + δ)} are narrow strips
of constant width, then the conditional distribution of Φ given Θ = 0 will
indeed be uniform; however, conditioning on other values of Θ must give
different distributions, so that the average conditional probability of {Φ <
−45} will be 0.146.

Example 4.9: Uniform distribution on a disk: The Borel


paradox

Let (X, Y ) be a uniform random point on the disk x2 + y 2 < 1.


Compute the conditional density of Y given X = x, and the
conditional expectation of X and X 2 .
Solution: The joint density is fX,Y (x, y) = 1/π for x2 + y 2 < 1,
0 otherwise. The marginal density of X is

1 ∞
Z
2p
fX (x) = 1{x2 +y2 <1} = 1 − x2 .
π −∞ π

So the conditional density is


1{x2 +y2 <1}
fY (y X = x) = √
2 1 − x2
( √ √
√1 if − 1 − x2 < y < − 1 − x2
= 2 1−x2
0 otherwise.

In other words,
√ conditioned
√ on X = x, Y is uniform on the
2 2
interval (− 1 − x , + 1 − x ).
Suppose instead we consider the random variables Θ and Z,
where Θ is the angle in [0, π) made by the line through the point
and the origin, to the x axis. (So that Θ or Θ+πpis the usual an-
gle component in polar coordinates.) And Z is x2 + y 2 sgn y..
(So these are polar coordinates with the radius allowed to be pos-
itive or negative.) Compute the conditional density of Z given
Θ = θ.
4.6. BIVARIATE NORMAL DISTRIBUTION: THE REGRESSION LINE73

Solution: The joint density is fR,Θ (r, θ) = 2r


π for 0 ≤ r < 1
and 0 ≤ θ < π. The marginal density of Θ is 1/π on [0, π) (so
uniform), and the conditional density is

fR (r Θ = θ) = 2r.

This does not depend on θ, reflecting the fact that R and Θ are
independent.
Notice how this resolves the Borel paradox. The problem there
was that we assumed it meant something to say “pick a uniform
point on the disk, conditioned on being on the centre line”. In
fact, conditioning for continuous random variables depends on a
limit, and it depends on how we take the limit: In other words,
it only makes sense in terms of a complete set of coordinates.
If I think of my random point on the middle line as being Y
conditioned on X = 0, then I am looking at the distribution
of Y conditioned on points in a narrow strip, whose width goes
to 0. This converges to the uniform distribution. On the other
hand, when I think of it as being Z conditioned on Θ = π/2, I
am looking at the distribution of Z conditioned on points in a
wedge, which has a nonuniform distribution for all wedges, and
converges to a nonuniform distribution.
The law of total expectation still holds:
1 π
Z
1 1 1 1
P{R < } = E[P{R < Θ}] = P{R < Θ = θ}dθ = .
2 2 π 0 2 4


4.6 Bivariate normal distribution: The regression


line
Suppose we have a population of husband-wife pairs, such that the heights
of husbands and wives are jointly normal. The mean height of the men is
180cm and the mean height of the women is 170cm. The SD for men is 6cm
and for women is 5cm. The correlation of height between husbands and
wives is 0.5. Let us try to answer the following questions:

(i) Given that a husband has height 190cm, what is the expected height
of his wife?
4.6. BIVARIATE NORMAL DISTRIBUTION: THE REGRESSION LINE74

(ii) Given that a husband has height 190cm, what is the probability that
his wife is below average height?
(iii) Suppose we know for a randomly chosen couple that the average height
of the husband and wife is 180cm. What is the expected height of the
wife?
(iv) What is the probability that a randomly chosen man is taller than a
randomly chosen woman?
(v) What is the probability that a randomly chosen husband is taller than
his wife?
Let (X, Y ) be the heights of a randomly chosen husband and wife. We
have

E[X] = 180, E[Y ] = 170,


Var(X) = 36, Var(Y ) = 25,
Cov(X, Y ) = Cor(X, Y ) SD(X) SD(Y ) = 15.

(i) We want to begin by computing E[Y |X]. The most direct way of doing
this is to split up Y into a piece that is a multiple of X and a piece that is
independent of X. We write Y = αX + (Y − αX), and try to solve for α
to make Z := Y − αX independent of X. We know from section 2.4.3 that
this will be true if Cov(X, Y − αX) = 0. By linearity,
5
0 = Cov(X, Y ) − α Cov(X, X) = 15 − 36α =⇒ α = .
12
5
Thus E[Z] = E[Y ] − 12 E[X] = 95, and
5  5 5
E[Y |X] = E X + Z X = E[X|X] + E[Z|X] = X + 95.
12 12 12

(ii) We need the conditional distribution of Y conditioned on X = 190.


We could resort to the joint density (2.6) here. We have
(x − 180)2 (y − 170)2 (x − 180)(y − 170)
 
1
fX,Y (x, y) = √ exp − − + .
60π .75 54 37.5 45
The marginal density of Y is
1 2
fX (x) = √ e−(x−180) /72 ,
6 2π
4.6. BIVARIATE NORMAL DISTRIBUTION: THE REGRESSION LINE75

so the conditional density is


fX,Y (x, y)
fY |X=x (y) =
fX (x)
(x − 180)2 (y − 170)2 (x − 180)(y − 170)
 
1
= √ √ exp − − +
5 .75 2π 216 37.5 45
2
 
1 100 (y − 170) 2
fY |X=190 (y) = √ √ exp − − + (y − 170)
5 .75 2π 216 37.5 9
  
1 1 25
= √ √ exp − y − 170 −
18.75 2π 2 · 18.75 6

That is, the conditional distribution of Y given that X = 190 is N(174 16 , 18.75).
We are trying to compute P{Y < 170}. We now standardise Y : We have
E[Y ] ≈ 174.167 and SD(Y ) ≈ 4.33, so Z := (Y − 174.167)/4.33 is standard
normal. We have
 
Y − 174.167 4.167
P{Y < 170 | X = 190} = P <− = P {Z < −0.962} = 0.168.
4.33 4.33
(You could either a normal-distribution table, the Matlab command normcdf
— cf. http://www.mathworks.co.uk/help/toolbox/stats/normcdf.html
— or this applet http://www-stat.stanford.edu/~naras/jsm/FindProbability.
html.)
More directly, we could represent (X, Y ) in the form
X = 6Z1 + 180,

Y = 2.5Z1 + 18.75Z2 + 170,

as in (2.5). If X = 190 then Z1 = 10/6, and Y = 174.167 + 18.75Z2 , where
Z2 has conditional distribution equal to its unconditional distribution (since
Z2 is independent of X), which gets us immediately to the conditional dis-
tribution Y ∼ N(174 61 , 18.75) (conditioned on X = 190).

(iii) Let W = (X + Y )/2. Then µW = 175 and σW 2 = (σ 2 + σ 2 +


X Y
2 Cov(X, Y ))/4 = 91/4. The correlation is computed by
1 1 1
Cov(Y, W ) = Cov(Y, X + Y ) = Cov(Y, X) + Var(Y ) = 20,
2 2 2
so the correlation is
20 8
ρY,W = Cor(Y, W ) = p = √ ≈ 0.839.
5 · 91/4 91
4.6. BIVARIATE NORMAL DISTRIBUTION: THE REGRESSION LINE76

We follow now the same procedure as in part (i). The general version of this
procedure is stated below, in the formula (4.14), yielding
W − µW
E[Y |W ] = ρY,W σY + µY
σW
8 5
= √ · 5p + 170
91 91/4
400
= + 170
91
≈ 174.4cm.

(iv) Let X, Y 0 be independent random choices from men’s heights and


women’s heights. These are independent normal random variables, so W =
X − Y 0 is also normal. Its expectation is 10cm and its variance (by in-
dependence) is σW2 = Var(X) + Var(Y 0 ) = 61. We are trying to compute

P{W > 0}. We proceed to standardise W .


 
W − 10 0 − 10
P{W > 0} = P >
σW σW
 
−10
= P Z > −√ where Z is standard normal
61
= P {Z > −1.28}
= 0.900.

(iv) This is exactly the same as (iv), except that instead of independent
X, Y 0 we take X, Y to be the heights of a randomly chosen husband-wife pair.
Since they are jointly normal, W = X − Y is still normal. Its expectation
is 10cm, but its variance has changed. It is
2
σW = Var(X) + Var(Y ) − 2 Cov(X, Y ) = 36 + 21 − 2 · 15 = 27.
Then
 
W − 10 0 − 10
P{W > 0} = P >
σW σW
 
−10
= P Z > −√ where Z is standard normal
27
= P {Z > −1.92}
= 0.973.
4.7. LINEAR REGRESSION ON NORMAL RANDOM VARIABLES 77

4.7 Linear regression on normal random variables


Theorem 4.3 (Regression equation). If (X, Y ) are jointly normal with
2 = Var(X), σ 2 = Var(Y ), and ρ = Cor(X, Y ),
µX = E[X], µY = E[Y ], σX Y
then
X − µX
E[Y | X] = ρσY + µY
σX
  (4.14)
Cov(X, Y ) Cov(X, Y )µX
= 2 X + µY − 2 .
σX σX

The conditional variance of Y is σY2 (1 − ρ2 ).

The regression equation gives the best approximation to Y as a linear


function of X, in the sense of minimising the average squared error. (You
have encountered this equation in a slightly different context in Mods Statis-
tics.)
It says that if we want to predict Y from X, the best we can do is to see
how many SDs X is above its mean, multiply by the correlation, and guess
that Y is about that many of its SDs above its mean.
The conditional variance is sometimes described by saying “ρ2 of the
variance in Y is explained by X.” That is, σY2 describes our uncertainty
about the value of Y , in the sense that if we had to guess what Y was, and
we did the best we could without any information by guessing µY , we’d be
off on average (in the mean-square sense) by σY2 . On the other hand, if we
are allowed to use the value of X to make a guess, and we use the regression
equation, our mean-square error will be reduced to (1 − ρ2 )σY2 .

Proof. Observe that if (4.14) is true for variables X and Y , then it remains
true if we add any constants to X and Y . Thus we may assume wlog that
µX = µY = 0. Let β := Cov(X, Y )/σX 2 . Let Z := Y − βX. Then

Cov(X, Y )
Cov(X, Z) = Cov(X, Y )−Cov(X, βX) = Cov(X, Y )− Cov(X, X) = 0.
Var(X)

Since X, Z are jointly normal, they are independent, and


     
E[Y |X] = E Z + βX X = E Z X + βE X X = βX,

which is equivalent to (4.14).


4.8. CONDITIONAL VARIANCE 78

4.8 Conditional variance


If we think of variance as being a summary of how spread out the distribution
of Y is, then it is natural to wonder,

Definition 4.4. For any random variables X, Y the conditional variance


of Y given X is
2
X = E Y 2 X − E[Y |X]2 .
   
Var(Y |X) := E Y − E[Y |X] (4.15)

In other words, it is the random variable whose value when X = x is the


variance of the conditional distribution of Y given X = x.

In other words, Var(Y |X = x) is the remaining variance in Y once we


know that X has the value x.

Example 4.10: Conditional variance: Lightbulbs

The DLM Lightbulb Factory produces equal numbers of bulbs


each day of the week, Monday through Friday. The lifetimes
of the bulbs (measured in years) are exponentially distributed,
but the parameter depends on the day of manufacture: Bulbs
manufactured on Monday last on average 1/2 year (λ = 2), bulbs
manufactured on other days last on average 2 years (λ = 1/2).
Let T be the lifetime of a randomly chosen bulb, and X the day
on which it was made (1 =Monday, 2 =Tuesday, etc.). Compute
the variance of T and the conditional variance of T conditioned
on X. Is it true that the variance of T is the expected value of
the conditional variance.
Solution: Conditioned on {X = 1} the distribution of T is
Exp(2), and conditioned on {X = i} for i 6= 1 the distribution
of T is Exp(1/2). Then E[T |X = 1] = 1/2 and E[T |X = i] = 2;
and E[T 2 |X = 1] = 3/4 and E[T 2 |X = i] = 6. Thus
1 1 4
E[T ] = E[T |X = 1]P{X = 1} + E[T |X 6= 1]P{X 6= 1} =
· + 2 · = 1.7
2 5 5
2 2 2 3 1 4
E[T ] = E[T |X = 1]P{X = 1} + E[T |X 6= 1]P{X =
6 1} = · + 6 · = 4.95
4 5 5
4.8. CONDITIONAL VARIANCE 79

Thus Var(T ) = E[T 2 ] − E[T ]2 = 4.95 − 2.89 = 2.06. The condi-


tional variance is Var(T |X = 1) = 1/2 and Var(T |X = i) = 2,
so
1 3
Var(T |X) = · 1{X=1} + 2 · 1{X6=1} = 2 − 1{X=1} .
2 2
1
The expectation of the conditional variance is 2 · 15 + 2 · 54 = 1.7,
which is less than the variance of T . 
It may seem surprising that the variance is larger than the expectation
of the conditional variance, but in fact this must be the case. Consider the
extreme case where T is a function of X, so there is no uncertainty about T
once we know X, which means that the conditional variance of T given X is
0. But there remains some variance of T , which derives from the variability
of X. We can turn this into a formula, which says that the variance of Y
has two pieces: the average conditional variance of Y , and the variance in
the conditional expectation of Y .
Theorem 4.5 (Law of total variance). For any random variables X, Y
  
Var(Y ) = E Var(Y |X) + Var E[Y |X] . (4.16)
Proof. We write
 2
Var(Y ) = E Y 2 − E Y
 
2
= E E[Y 2 |X] − E E[Y |X]
  
2
= E E[Y 2 |X] − E[Y |X]2 + E E[Y |X]2 − E E[Y |X]
    
  
= E Var(Y |X) + Var E[Y |X] .

Example 4.11: Law of total variance: Lightbulbs

Continuing Example 4.10, verify the Law of Total Variance.


Solution: We have already computed Var(T ) = 2.06 and E[Var(T |X)] =
1.7. The remaining piece is
 
3 9 9 1 1
Var(E[T |X]) = Var 2 − 1{X=1} = Var(1{X=1} ) ( − ) = 0.36.
2 4 4 5 25
We see that the sums come out correctly. 
4.9. MIXTURES OF DISTRIBUTIONS 80

Example 4.12: Random sums

Suppose we play a game as follows: We flip a coin until the first


time that tails comes up. Let N be the number of coin flips.
Then we roll N 6-sided dice, and get a number of pounds equal
to the total of all the dice. What is a fair price for playing this
game, and what is the variance of the winnings?
Solution: We first note that N has geometric distribution with
parameter 12 , so E[N ] = 2 and Var(N ) = 2. Let Xi be the
number that comes up on die number i. Then the total winnings
are S = X1 + · · · + XN . Conditioned on N = n, this is a sum
of n i.i.d. random variables with expectation 3.5 and variance
σ 2 = 61 (12 + 22 + 32 + 42 + 52 + 62 ) − 3.52 = 2.917. Thus the
conditional expectation is 3.5N and the conditional variance is
σ 2 N.
Thus

E[S] = E[3.5N ] = 3.5E[N ] = 7,


Var(S) = E[Var(S|N )] + Var(E[S|N ])
= E[σ 2 N ] + Var(3.5N ) = 2σ 2 + 3.52 · 2 = 30.3.

4.9 Mixtures of distributions


Quite commonly we run the conditioning process in reverse: We have a
random variable X, and specify the distribution of Y by saying what the
distribution of Y is conditioned on X. This is straightforward when X is
discrete, as in Example 4.8, where L was uniform on {1, 2, 3, 4, 5} and T had
the conditional distribution Exp(1/L). In such a case we can simply consider
each conditional distribution separately. In fact, there is a sense in which
you could say that the joint distribution of a discrete random variable X
and a continuous random variable Y can only be understood as a collection
of densities, one for each possible value of Y .
When X is continuous it becomes slightly more complicated. We con-
sider two cases separately:
4.9. MIXTURES OF DISTRIBUTIONS 81

4.9.1 X continuous, Y discrete


For simplicity, we assume that our discrete set of points is N = {0, 1, 2, . . . }.
Suppose X has density f , and we wish to specify

pX=x (y) = P Y = y X = x .

Then we have the marginal distribution of Y defined by the mixed proba-


bility mass function
Z ∞

p(y) = P Y = y = pX=x (y)f (x)dx.
−∞

More generally, if we want to know the probability that (X, Y ) ∈ A, for A


a set of pairs, we may think of A as being stratified by N; that is,

[
A= {i} × Ai .
i=0

Then
∞ Z
X ∞
X  
P(A) = pX=x (i)f (x)dx = P Y = i X ∈ Ai P X ∈ Ai .
i=0 Ai i=0
(4.17)

Example 4.13: Random Poisson

Let X be exponential with parameter 1, and let Y be Poisson


with parameter X. Find P{X < Y }.
Solution: The density of X is f (x) = e−x . Conditioned on
{X = x} the probability mass function of Y is
 e−x xy
py (x) = P Y = y X = x = .
y!
The event of interest is A = {X = Y } = ∞
S
y=0 {(x, y) : 0 < x <
y}. Thus
∞ Z y ∞ Z y −2x y
X X e x
P(A) = f (x)pX=x (y)dx = dx.
0 0 y!
y=0 y=0

This has no simple algebraic simplification, but it obviously can


be computed numerically. (It comes out to about 0.355.) 
4.9. MIXTURES OF DISTRIBUTIONS 82

Example 4.14: Mixed discrete and continuous

(Continuation of Example 2.6.) Let X1 , . . . , Xn be i.i.d. stan-


dard normal, and fix z ∈ R. Let
n
X

Y := # i : Xi < p = 1{Xi <z} .
i=1

What is the joint distribution of (X1 , Y )? Compute the condi-


tional expectation E[Y |X1 ].
Solution: Let p = Φ(z). We have
n
X
0 0
Y = 1{X1 <z} + Y , where Y = 1{Xi <z} .
i=2

Since Y 0 ∼ Bin(n − 1, p) is independent of X1 , we have a condi-


tional probability mass distribution, it follows that when X1 ≥ z,
Y has Bin(n−1, p), whereas when X1 < z, Y −1 ∼ Bin(n−1, p).
That is,
 n−1
y n−1−y if X ≥ z and 0 ≤ y ≤ n − 1,
 y p (1 − p)
 1
n−1 y−1 n−y
pX1 =x (y) = y−1 p (1 − p) if X1 < z and 1 ≤ y ≤ n,

0 otherwise.

The conditional expectation is


h i
E[Y |X1 ] = E 1{X1 <z} + Y 0 X1
h i h i
= E 1{X1 <z} X1 + E Y 0 X1
n−1
= 1{X1 <z} + , since Y 0 and X1 are independent.
2
Note that this is a random variable that is a function of X1 . 

4.9.2 X discrete, Y continuous


In this case we have the probability mass function px for X, and a conditional
density fx (y) conditioned on X = x. The marginal distribution of Y is given
by the mixed density X
fY (y) = px fx (y). (4.18)
x
4.9. MIXTURES OF DISTRIBUTIONS 83

More generally, if we want to know the probability that (X, Y ) ∈ A, for A


a set of pairs, we may think of A as being stratified by R; that is,
[
A= Ay × {y}.
y∈R

(We simply define Ay := {x : (x, y) ∈ A}.) Then


Z ∞ X
P(A) = py fy (x)dx. (4.19)
−∞ y∈Ax

These two approaches to describing the joint distribution of (X, Y ) —


describing it as a family of continuous distributions of X stratified by a
discrete collection of possible values of Y , or as a family of discrete distribu-
tions of Y stratified by the continuous range of values of X — are perfectly
equivalent, though in a given problem we may find that one or the other is
more natural or easier to carry through.
We may translate from one to the other by using Bayes’ rule.

Theorem 4.6. Bayes’ rule for continuous and discrete distributions Sup-
pose (X, Y ) has a joint distribution with X discrete and Y continuous. Let
fY |X=x (y) be the conditional density of Y given X = x, and let pX|Y =y (x)
be the conditional probability mass function of X given Y = y. Let p(x)
be the unconditional probability mass function of X, and let fY (y) be the
conditional density of Y . Then

fY |X=x (y)p(x)
pX|Y =y (x) = ,
fY (y)
(4.20)
pX|Y =y (y)f (y)
fY |X=x (y) = .
p(x)

Example 4.15: Uniform-binomial

Suppose U is picked uniformly on [0, 1], and then a very large


number of red and black balls are placed into a box, such that
the proportion of red balls is U . We then sample n balls at
random from the box. Find the distribution of Sn = # red balls
sampled.
4.9. MIXTURES OF DISTRIBUTIONS 84

Solution: pk (u) := P{Sn = k U = u} = nk uk (1 − u)n−k . So




by the law of total probability


Z 1 Z 1 
n k
P{Sn = k} = pk (u)du = u (1 − u)n−k du.
0 0 k
R1
There are two ways we can compute this. The integral 0 uk (1 −
u)n−k du is called the beta integral, and may be computed fairly
directly — see, for example, section II.2 of [Fel68]. It comes out
−1
to be (n + 1)−1 nk , so the probability comes out to 1/(n + 1).
Note again that there is no canonical way of writing down the
joint distribution of (U, Sn ). Instead, we have adopted the mix-
ing approach: We start with U , whose distribution is known
and apply (4.17) to the conditional distribution of the discrete
variable Sn conditioned on any value of U .
We may also start with Sn , which we now know has uniform dis-
tribution on {0, 1, . . . , n} and think of it as describing a stratified
collection of densities. Conditioned on {Sn = k}, we may apply
Bayes’ rule to see that the conditional density of U is
1 · uk (1 − u)n−k
 
n k
fk (u) = R 1 = (n + 1) u (1 − u)n−k .
k n−k k
0 x (1 − x)
(This distribution is called a beta distribution.) From here we
could recover the density of U by means of (4.18) as
n
X
f (u) = P{Sn = k}fk (u)
k=0
n  
X
−1 n k
= (n + 1) (n + 1) u (1 − u)n−k
k
k=0
n  
X n k
= u (1 − u)n−k
k
k=0
= (u + 1 − u)n (Binomial Theorem)
=1
by the Binomial Theorem.
Alternatively, we can take the transformation approach: Start
from a collection of independent random variables, and repre-
sent (U, Sn ) as a function of this collection. Here is one ap-
proach, which actually makes the problem quite a bit easier:
4.9. MIXTURES OF DISTRIBUTIONS 85

Let U0 , U1 , . . . , Un be i.i.d. random variables uniform on (0, 1).


Define U = U0 and define ball i to be red if Ui < U0 . Then
this happens with probability U , and the balls are indepen-
dent. Thus, the event {k balls are red} is the same as the event
{there are k of the Ui ’s less than U0 }. But since they are i.i.d. ,
U0 has an equal chance of being anywhere on the list. There are
n + 1 places in total, so they all have probability 1/(n + 1). 

Example 4.16: Mixed discrete and continuous

(Continuation of Example 4.14.) Let X1 , . . . , Xn be i.i.d. stan-


dard normal, and fix z ∈ R. Let
n
X

Y := # i : Xi < p = 1{Xi <z} .
i=1

What is the conditional distribution of X1 given Y ? Compute


the conditional expectation E[X1 |Y ].
Solution: We use Bayes’ rule. We have by (4.20),

pY |X1 =x (y)fX1 (x)


fX1 |Y =y (x) =
p(y)
 n−1 y 2
( y )p (1−p)n−1−y (2π)−1/2 e−x /2

 n if X1 ≥ z and 0 ≤ y ≤ n − 1,


 (y )py (1−p)n−y
2
= (n−1
y−1 )
py−1 (1−p)n−y (2π)−1/2 e−x /2
if X1 < z and 1 ≤ y ≤ n,



 (ny)py (1−p)n−y
0 otherwise.

 n−y 2
−1/2 e−x /2 if X ≥ z and 0 ≤ y ≤ n − 1,
 n(1−p) (2π)
 1
y −1/2 −x 2 /2
= np (2π) e if X1 < z and 1 ≤ y ≤ n,


0 otherwise.

Notice that this is a mixture of a standard normal distribution


conditioned on being < z (chosen with probability y/n) and a
standard normal distribution conditioned on being ≥ z (chosen
with probability 1 − y/n).
4.9. MIXTURES OF DISTRIBUTIONS 86

Now let
1 z
Z
1 2
c− := x · √ e−x /2 dx
p −∞ 2π
Z ∞
1 1
=− √ e−u du (substituting u = x2 /2)
p z 2 /2 2π
2
e−z /2
=− √ ,
p 2π
Z ∞
1 1 2
c+ := x · √ e−x /2 dx
1−p z 2π
−z 2 /2
e
= √ .
(1 − p) 2π

Then E[X1 |Y ] = c− ny + c+ (1 − ny ). 

4.9.3 X and Y continuous


Suppose X has density fX , and we wish to specify the conditional density

fY |X=x (y) = h(y; x)

Then we have the joint density

fX,Y = fX (x)h(y; x),

and the marginal density


Z ∞
FY (y) = fX (x)h(y; x). (4.21)
−∞
Lecture 5

Generating functions and


their relatives

Summary: Probability generating function (pgf ); Moment generating func-


tion (mgf ). Examples of computing generating functions. Uniqueness the-
orem for mgf and cf. Applications to normal distribution. Characteristic
function (cf ).
Key concepts: Moment generating functions. Characteristic functions.
Uniqueness theorem.

5.1 Generating functions


Definition 5.1. For X a discrete random variable taking values in {0, 1, 2, . . . },
with P{X = k} = pk , we define the probability generating function
(pgf ) as a function gX : [0, 1] → [0, 1] defined by

X
X
kz k .
 
gX (z) := E z = (5.1)
k=0

Properties of probability generating functions:


(PGF1) gX (1) = 1.
(PGF2) gX is infinitely differentiable (in fact, real analytic) on (0, 1).
(PGF3) We may be able to extend gX (analytically) to z > 1. In this case,
1/k
sup {z : gX (z) < ∞} = lim pk .
k→∞

87
5.1. GENERATING FUNCTIONS 88

(k)
(PGF4) The k-th derivative gX (z) is increasing as a function of z. Its limit is
finite if and only if X has a finite k-th moment, in which case
(j)  
gX (0) = E X(X − 1) · · · (X − j + 1) (5.2)

(PGF5) If X, Y are independent, then gX+Y (z) = gX (z)gY (z) for all z ∈ (0, 1).

(PGF6) Uniqueness: If gX (z) = gY (z) for all z ∈ (0, 1) (in fact, for z in any
interval), then X =d Y . In other words, distributions are determined
by their generating function.

Note: We write X =d Y to mean that X and Y have the same dis-


tribution; that is, P{X ∈ A} = P{Y ∈ A} for any event A. This is also
sometimes written as L(X) = L(Y ), read as “the law of X is the same as
the law of Y ”. This terminology is less common nowadays; here the “law”
of a random variable is its distribution, understood as a probability on R,
independent of its instantiation as a random variable.
Properties (PGF1), (PGF2), and (PGF3) are simply properties of power
series with bounded coefficients. Property (PGF5) follows simply from the
fact that z X and z Y are also independent, so

gX+Y (z) = E z X+Y = E z X z Y = E z X E z Y = gX (z)gY (z).


       

The derivative rule (PGF4) is almost obvious. For any fixed K,


"K # K
d X k
X
pk z = kpk z k−1 .
dz
k=0 k=0

Thus, if we can exchange the limit and the derivative (effectively exchanging
two different limits), we get

"K # "∞ #
X d X d X
kpk z k−1
= lim k
pk z = pk z = g 0 (z).
k
K→∞ dz dz
k=0 k=0 k=0

The same thing works for higher derivatives.


5.1. GENERATING FUNCTIONS 89

Caution: Exchanging the order of limits can go badly wrong!


Consider, for instance, the bivariate series am,n = arctan(m/n). For
each fixed m we have limn→∞ am,n = 0, so limm→∞ limn→∞ am,n = 0
too. On the other hand, for fixed n we have limm→∞ am,n = π/2, so
limn→∞ limm→∞ am,n = π/2. In order for it to work, we need some
sort of uniform
P convergence in both parameters. In this case, it works
as long as ∞ k=0 kpk < ∞.
Why? Suppose the sum converges. Then we can pick K such that
k k k−1 ,
P
k>K kpk < , and using the bound |x −y | ≤ k|x−y| max{x, y}
!
d X X X
pk z k = lim δ −1 pk (z + δ)k − pk z k
dz δ→0
k>K k>K k>K

X
≤ lim pk · k (since z and z + δ may be assumed ≤ 1)
δ→0
k=K+1
< .

So by choosing K large enough, we may ignore all but a finite number


of terms.

What about uniqueness? It’s pretty straightforward for generating func-


tions. (Less so, as we will see, for moment generating functions.) After
all, two power series can’t be the same unless all the coefficients are the
same. Formally: Suppose the distribution of X is different from the distri-
bution of Y . Let K be the smallest such that P{X = k} = 6 P{Y = k}. Let
z0 ≤ |P{X = k} − P{Y = k}|/2. Then

X
P{X = k} − P{Y = k} z0k

|gX (z0 ) − gY (z0 )| =
k=0

P{X = K + j} − P{Y = K + j} z0j
X
= z0k P{X = K} − P{Y = K} +


j=1
z0
≥ z0k |P{X = K} − P{Y = K}| −
1 − z0
> 0.

Thus gX and gY must be different for z sufficiently close to 0.


5.1. GENERATING FUNCTIONS 90

Example 5.1: Poisson pgf

Let X have Poisson(λ) distribution. Then



X λk k
gX (z) = e−λ z
k!
k=0

X (λz)k
= e−λ
k!
k=0
λ(z−1)
=e .

Note that if Y is independent of X and has Poisson(µ) distribu-


tion, then
gX+Y (z) = gX (z)gY (z) = e(λ+µ)(z−1) ,
which is the pgf of a Poisson distribution with parameter λ + µ.
Since pgf’s are unique, this proves that X + Y has the Poisson
distribution. 

Example 5.2: Binomial pgf

Let X have Binomial(n, p) distribution. X can be thought of as


a sum of n i.i.d. Bernoulli random variables with pgf pz + (1 − p).
So
pX (z) = (1 + p(z − 1))n .


Example 5.3: Sum of random number of i.i.d. random vari-


ables

Let Y = N
P
i=1 Xi , where the Xi are i.i.d. with pgf gX , and N is
independent of these with pgf gN . Then
gY (z) = E z Y
 
h  i
= E E z X1 +···+XN N
h i
= E gX (z)N

= gN gX (z) .
5.1. GENERATING FUNCTIONS 91

Example 5.4: Galton-Watson Process

Consider the following “branching process”: We start with a


certain number X0 of individuals in the population. Individual i
has a random number ξi of offspring, with P{ξi = k} = pk . The ξi
are
PXi.i.d. . Thus the number of individuals in generation 1 is X1 =
0
i=1 ξ i . The same process repeats itself in the next generation,
and so on. Let g(z) be the pgf of the offspring distribution, and
Gn (z) the generating function of Xn . There are X1 individuals in
generation 1, and each of them has some (independent random)
number of offspring in generation n. The number of offspring
that any of these generation 1 individuals has in generation n
has the same distribution as Xn−1 . Thus, we are in the setting
of Example 5.3, and

Gn (z) = g(Gn−1 (z)).

Iterating this, we get Gn (z) = g(g(· · · (g(z)) · · · )).


Note that g is convex and increasing (since all derivatives are
nonnegative), with g(1) = 1 and g 0 (1) = µ.
Take X0 = 1. Let en = P{Xn = 0} be the probability that the
population is extinct in generation n, and let e∞ = limn→∞ en
be the probability that the population is eventually extinct. This
leads to the following facts:

• E[Xn ] = µn , where µ = E[X1 ];


• e∞ is the smallest fixed point of g;
• If µ ≤ 1 then e∞ = 1; if µ > 1 then e∞ < 1.

We have E[Xn+1 ] = E[E[Xn+1 |Xn ]] = E[µXn ] = µE[Xn ], which


proves the first property.
Assume g(0) > 0. (Otherwise the probability of extinction is 0,
since the probability of having no children is 0.) Now observe
that en = Gn (0), so en+1 = g(en ). Let z0 be the smallest fixed
point of g. Since g is increasing and e0 = 0 < z0 , we have en+1 =
g(en ) < g(z0 ) = z0 , so en can never exceed z0 . For 0 ≤ z < z0 ,
5.2. MOMENT GENERATING FUNCTIONS 92

furthermore, we have g(z) > z. Thus e0 < e1 < e2 < · · · , and en


must converge to a limit e∞ . Since e∞ must be a fixed point of
g, it follows that e∞ = z0 .
We know that 1 is a fixed point of g, and since g is convex it can
hit the line y = x at most twice. If g 0 (1) > 1, then the curve
of g is below the line for z close to 1. Thus there must be a
second intersection with {y = x} in (0, 1). On the other hand, if
g 0 (1) ≤ 1 the curve of g comes into 1 from above the line, so has
no fixed point below 1. 

5.2 Moment generating functions


Definition 5.2. If X is a real-valued random variable, its moment gen-
erating function (or mgf ) MX : R → R+ ∪ {∞} is a function defined
by
MX (t) := E etX .
 
(5.3)
Note that MX (t) may be ∞ for all nonzero t. For instance, any distri-
bution with density (k − 1)x−k on [1, ∞) has no finite mgf. Similarly, the
log-normal distribution has infinite mgf.

5.2.1 Properties of moment generating functions


(MGF1) MX (0) = 1.

(MGF2) If Y = aX + b for constants a and b, then MY (t) = ebt MY (at).

(MGF3) The set of t such that MX (t) is finite in an interval that includes 0.
MX (t) is finite for some t > 0 if and only if there exists t0 > 0 and
C > 0 such that P{X > x} ≤ Ce−t0 x .

(MGF4) Suppose MX (t) < ∞ for |t| < t0 , where t0 > 0. Then MX is in-
(k)
finitely differentiable, and the k-th derivative MX (t) is increasing as
a function of t. All moments of X are finite, and
(k)
MX (0) = E X k .
 
(5.4)

(MGF5) − lim supx→∞ x−1 ln P{|X| > x} = t0 := sup{|t| : MX (t) < ∞}. We
have t0 = ∞ if and only if limx→∞ x−1 ln P{|X| > x} = −∞. This is
equivalent to limk→∞ k −1 E[|X|k ]1/k = 0.
5.2. MOMENT GENERATING FUNCTIONS 93

(MGF6) If X, Y are independent, then MX+Y (t) = MX (t)MY (t) for all t such
that MX (t) and MY (t) are both finite.

(MGF7) If X and Y are random variables such that for any y,


MX (t Y = y) := E[etX Y = y] < ∞, then MX (t) = E[E[MX (t Y )]].

(MGF8) Uniqueness: If MX (t) = MY (t) for all |t| < t0 , for some positive t0 ,
then X =d Y . In other words, distributions are determined by their
moment generating functions.

Property (MGF1) is trivial, as are (MGF6) and (MGF7).


For (MGF2),
h i h i
MY (t) = E et(aX+b) = E etb · eatX = etb MX (at).

One direction of (MGF3) is a direct consequence of Markov’s inequality


(see section 5.2.4). For the “if” part, assume P{X > x} ≤ Ce−t0 x . Then for
0 < t < t0 , and (3.7),
Z ∞ Z ∞
Ct
e−(t0 −t)x dx =
 tX  tx

E e =t e P X > x dx ≤ Ct .
−∞ −∞ t0 − t

Property (MGF4) is essentially the same as (PGF4): Consider the case


where X has a density f . Then
Z ∞
dk dk
MX (t) = k etx f (x)dx
dtk dt −∞
Z ∞ k
d tx
= e f (x)dx
dtk
Z−∞

= xn etx f (x)dx,
−∞

where the second line requires that the convergence of the derivative be uni-
form enough that we can exchange the derivative and the integral.
R ∞ 0 We won’t
go into the details of this, except to say that the fact that −∞ et x f (x)dx is
finite for some t0 > t is more than enough to guarantee this. It also allows
5.2. MOMENT GENERATING FUNCTIONS 94

us to exchange the limits in the following:


Z ∞
dk
lim MX (t) = lim xn etx f (x)dx
t→0 dtk t→0 −∞
Z ∞
= lim xn etx f (x)dx
−∞ t→0
Z ∞
= xn f (x)dx
−∞
= E[X n ].

The Uniqueness Theorem for mgfs is really a consequence of the cor-


responding result for characteristic functions (see section 5.3), which itself
follows from Fourier analysis and the principles layed out in section 6.1. The
proof (which is not part of the syllabus of this course) involves showing that
the cdf of a distribution may be approximated by linear combinations of
expectations of functions of the form eitX . The details may be bound in
section 5.9 of [GS01].

5.2.2 Behaviour near 0


Property (MGF4) tells us something interesting about the value of the mgf
at t = 0. It will also be important for us to understand the behaviour of the
mgf in a neighbourhood of 0.
Lemma 5.3. Let X be a random variable with E[X] = 0. Suppose MX (t)
is finite for |t| < t0 , for some positive t0 . Then as t → 0
1
log MX (t) = t2 Var(X) + O(t3 ). (5.5)
2
That is, there are constants C and  > 0 (depending on the distribution of
X) such that
1
log MX (t) − t2 Var(X) ≤ Ct3 for |t| ≤ . (5.6)
2
Proof. By property (MGF4), MX (t) is infinitely differentiable on (−t0 , t0 );
we may assume wlog that t0 ≤ 1. In particular, then, for t ≤  := t0 /2, the
third derivative of MX is bounded on [−, ]. Let K = sup|t|≤ |MX 000 (t)|/6.

By Taylor’s Theorem, taking 0 is the centre for the series expansion, for all
t ≤ ,
1 1
MX (t) = MX (0) + M0X (0)t + M00X (0)t2 + M000 (s)t3 ,
2 6 X
5.2. MOMENT GENERATING FUNCTIONS 95

where s ∈ [−, ]. We know that


MX (0) = 1,
M0X (0) = E[X] = 0,
M00X (0) = E[X 2 ] = Var(X),
1 000
M (s)
6 X
We also have the Taylor expansion for log(1 + x) around x = 0 yielding
log(1 + x) = x + E(x),
where the error term E(x) is bounded by x2 /2(1 − |x|)2 ≤ 2x2 if we take
|x| < 12 . Thus
σ 2 2 1 000
 
log MX (t) = log 1 + t + MX (s)t 3
2 6
2
σ 2
= t + E0 ,
2
where the error term E0 is bounded by
 2 
1 000 σ 2
M (s)t + E
3
t + Kt ≤ Ct3 ,
3
6 X 2
where C depends only on K and σ.

5.2.3 Examples
Bernoulli and binomial

Pi be i.i.d. random variables with P{Xi = 1} = 1 − P{Xi = 0} = p. Let


Let X
S = ni=1 Xi . Then
MX (t) = pet + (1 − p) = 1 + p et − 1 .


The mgf for the binomial(n, p) distribution is


  n
MS (t) = 1 + p et − 1 .

Uniform random variable


Let U be uniformly distributed on (a, b). Then
Z b
1 1  
MU (t) = eut du = ebt − eat .
b−a a t(b − a)
In particular, if b = 1 and a = −1 then MU (t) = (t)−1 sinh(t).
5.2. MOMENT GENERATING FUNCTIONS 96

Exponential random variable


Let X be exponentially distributed, with parameter λ. Then for t < λ,
Z ∞
λ
MX (t) = λe−λx etx dx = .
0 λ −t

Gamma distribution
λr r−1 −λx
Let X have density f (x) = Γ(r) x e . Then for t < λ,

∞ r
λr r−1 (t−λ)x
Z 
λ
MX (t) = x e = .
0 Γ(r) λ−t

Uniqueness of mgf’s then shows immediately that the sum of independent


gamma-distributed random variables with the same λ is also gamma dis-
tributed.

Normal distribution
Let X be a N(µ, σ 2 ) random variable. Since the tails fall off faster than
exponentially, the mgf exists for all t. Then X = σZ + µ, where Z is
standard normal. We have
Z ∞
1 2
MZ (t) = etz · √ e−z /2 dz
−∞ 2π
1 t2 /2 ∞ −(z−t)2 /2
Z
=√ e e dz
2π −∞
2 /2
= et
2 2
Thus, by (MGF2) MX (t) = eµt+σ t /2 .
In lecture 7 we will see that sums of i.i.d. random variables, properly
normalised, converge to a normal distribution. Is it possible to add a finite
number of non-normal random variables to get a normally distributed ran-
dom variable? Suppose it were possible. Suppose X1 , . . . , Xn are i.i.d. , and
X1 + · · · + Xn has standard normal distribution. Let MX be the mgf of the
Xi . It must be finite for all t, so we have
2 /2
MX (t)n = et .
2 /2n
But then MX (t) = et . By uniqueness, X also has normal distribution.
5.2. MOMENT GENERATING FUNCTIONS 97

5.2.4 Tail bounds from moment generating functions


One of the most important applications of moment generating functions is
to derive exponential tail bounds. Quite often we are interested in showing
that a certain random variable has a “small enough” probability of exceeding
a certain threshold. We can use Chebyshev’s inequality, but then the bound
on P{X > t} only declines like t−2 . When we have sums of independent
random variables, we expect the tails to fall of like a normal cdf; that is, like
2
e−t ; that is, very rapidly indeed. There are moment bounds (see Example
3.4), but these can be hard to compute, and you don’t know which one
to use. But then, once you have all the moments, you have a moment
generating function (except: see Example 5.6), and you can put all the
moments together into one much better (because exponential) bound.
The principle is, the bigger the function g for which you can calculate
(or estimate) E[g(X)], the better the tail bounds you can get on X.
By a direct application of Markov’s inequality, when t is positive,
E[etX ]
for all x, P X > x = P etX > etx ≤ = e−tx MX (t). (5.7)
 
etx
This is sometimes called Chernoff’s bound.

Example 5.5: Tail bounds for sums of i.i.d. uniform

Let U1 , . . . , Un be i.i.d. uniformly distributed on (−1, 1). Let


Z = (U1 + · · · + Un )/n. Find a bound for P{|Z| > z}.
Solution: We know from section 5.2.3 that Z has mgf
 −n
t
MZ (t) = MU (t/n) =
n
sinhn t/n.
n
Thus, we have for all real t,
 −n
t
sinhn (t/n) · e−tz .

P Z>z ≤
n
We could compute this numerically, but we can also use the
bound
t3 t5 2
sinh t = t + + + · · · ≤ tet /6 .
3! 5!
So we have
2
P Z > z ≤ et /6n−tz .

5.2. MOMENT GENERATING FUNCTIONS 98

This holds for every t. If we choose t = 3zn (chosen to minimise


the bound), we get the bound
3 2
P Z > z ≤ e− 2 z n .


The Chebyshev bound, on the other hand, would be merely


 Var(Z) 1
P Z>z ≤ 2
= 2 ,
z 3z n
using the fact that E[Z] = 0 and Var(Z) = n−1 Var(U ) = 1/3n.


5.2.5 Uniqueness
(The remainder of this section is conceptually important, but not exam-
inable.) Random variables X and Y have the same distribution when
E[g(X)] = E[g(Y )] for any bounded function g. Why? If the distribu-
tions are the same, they must give the same expectations. But if they give
the same expectations, that holds in particular for g = 1A for any A ⊂ R
— that is, they assign the same probabilities to all possible events.
But this allows us a significant amount of freedom. We already know
that distributions may be defined by cdfs. That is, equality of distributions
follows if E[g(X)] = E[g(Y )] just for functions of the form g = 1(−∞,x] . It
turns out that there are lots of other classes of functions that could be used
to define distributions. The basic rule is:

A class of functions G is sufficient to prove identity of distri-


butions (that is, the uniqueness theorem holds: X and Y have
the same distribution if and only if E[g(X)] = E[g(Y )] for all g ∈ G
when all bounded continuous functions may be uniformly approx-
imated on bounded intervals by linear combinations of elements
of G, where the linear combination is itself bounded.
Why? Consider any bounded function g, and suppose we have gn linear
combinations of elements of G with sup |gn | ≤ M and sup−n≤x≤n |g(x) −
gn (x)| ≤ 1/n for all n. By linearity of expectations, E[gn (X)] = E[gn (Y )].
5.2. MOMENT GENERATING FUNCTIONS 99

Then by the triangle inequality

E[g(X)] − E[g(Y )] ≤ E[g(X)] − E[gn (X)] + E[gn (X)] − E[gn (Y )] + E[gn (Y )] − E[g(Y )]
= E[g(X)] − E[gn (X)] + E[gn (Y )] − E[g(Y )]

≤ 2 sup |g(x) − gn (x)| + 2 sup |gn (x)| + |g(x)| P{|X| > n}
x∈[−n,n] |x|>n
2
≤ + M P{|X| > n}
n
n→∞
−−−
−→ 0.

Since this is true for all n, it follows that E[g(X)] = E[g(Y )].
So we know E[g(X)] = E[g(Y )] for bounded continuous functions g.
Why does that imply equality of distributions? If distributions are equal
then this equality of expectations should hold for all functions g — or, at
least, all g for which the expectation makes sense, a class of functions that
we are not being very careful about defining here. In particular, we want
equality to hold for indicators of events, so that P{X ∈ A} = P{Y ∈ A} for
events A. Let’s consider this problem for A = [a, b]. Suppose we know that
E[g(X)] = E[g(Y )] for all bounded continuous g. Define


 0 if x ≤ a,

1
n(x − a) if a < x < a + n ,



hn (x) = 1 if a + n1 ≤ x ≤ b − n1 ,

n(b − x) if b − n1 < x < b,





0 if x ≥ b.

Then
1[a+1/n,b−1/n] ≤ hn (x) ≤ 1[a,b] ;
but since hn is continuous, E[hn (X)] = E[hn (Y )]. Thus

|P{a ≤ X ≤ b} − P{a ≤ Y ≤ b}| = E[1[a,b] (X)] − E[1[a,b] (Y )]


= E[1[a,b] (X)] − E[hn (X)] + E[hn (X)] − E[hn (Y )] + E[hn (Y )] − E[1[a

  
1 1 1
≤ P a < X < a + or b − < X < b + P a < X < a + or b −
n n n
n→∞
−−−
−→ 0

Thus, the difference must be 0.


What classes of functions work? Some examples:
5.2. MOMENT GENERATING FUNCTIONS 100

• Indicators of intervals (whose linear combinations are called simple


functions; if you follow the Integration course you learn that continu-
ous functions may be uniformly approximated by simple functions);
• Smooth functions with bounded derivative;
• Sines and cosines (this is Fourier analysis): Hence functions of the
form g(x) = eitx .
• Polynomials? Not quite. Continuous functions may be uniformly ap-
proximated by polynomials, but the approximations won’t stay bounded.
We need some control over the tails of the distributions, as described
in section 5.2.6

5.2.6 Uniqueness and moments


In thinking about the uniqueness property, we do best to think of the mgf
as one slice of a complex function. The restriction to the real axis is the
mgf, while the restriction to the imaginary axis is the characteristic function.
From the definition, it is clear that if the mgf exists in an interval around 0,
then the power series has positive radius of convergence. Thus, the function
is analytic on a disk, so they must agree anywhere that they can be analyti-
cally continued to. In particular, this includes a strip around the imaginary
axis, so characteristic functions exist and are equal. Thus, uniqueness for
the mgf is equivalent to uniqueness for the characteristic function.
This means that the distribution of a random variable is determined by
its mgf. Of course, the mgf is determined by the moments. The following
may then seem like a reasonable conclusion: Let X and Y be random vari-
ables which have finite moments of all orders, with E[X k ] = E[Y k ] for all k.
Then X and Y have the same distribution.
This is not true!

Example 5.6: Equal moments don’t imply equal distribu-


tion

Let X be a log-normal distributed random variable, with density


1 2
fX (x) = (2π)−1/2 x−1 e− 2 ln x
for x > 0.

Let Y be a random variable with density

fY (y) = fX (y) (1 + sin(2π ln y)) for y > 0.


5.3. CHARACTERISTIC FUNCTIONS 101

For any k ≥ 0, we have (substituting ln y = t = s + k),


Z ∞ √
2π 1 2
y fX (y) sin(2π ln y)dy = R ∞ e− 2 t +kt sin(2πt)dt
k
0 −∞
√ k2 /2 Z ∞
2π 1 2
= e− 2 s sin(2πs)ds
e −∞
= 0,

by symmetry. Thus E[X k ] = E[Y k ] for all k, but the distribu-


tions are clearly different. (This is from section VII.3 of [Fel71].)
What went wrong? The tails of the density fall off faster than
any polynomial — hence the finiteness of moments of all orders
— but slower than any exponential. Thus MX (t) = ∞ for all
t > 0. In fact, it is a theorem that a distribution is determined by
P −1/2n
its moments µn = E[X n ] precisely when µ2n = ∞, which
1/n
is pretty close to saying that µn doesn’t increase faster than
linearly in n. 

5.3 Characteristic functions


For most mathematical purposes characteristic functions are the most natu-
ral generating function-type objects for using with continuous random vari-
ables. In this course we will not go much beyond the definition. . .

Definition 5.4. If X is a real-valued random variable, its characteristic


function φX is defined to be

φX (µ) := E eiµX = E [cos(µX)] + E [i sin(µX)] .


 
(5.8)

5.4 Properties of characteristic functions


(CF1) φX (0) = 1.

(CF2) If Y = aX + b for constants a and b, then φY (t) = eibt MY (at).

(CF3) φX (t) exists and is finite for all t.

(CF4) If X, Y are independent, then φX+Y (t) = φX (t)φY (t) for all t such
that φX (t) and φY (t) are both finite.
5.4. PROPERTIES OF CHARACTERISTIC FUNCTIONS 102

(CF5) If X and Y are random variables such that for any y, φX (t Y = y) :=


E[etX Y = y] < ∞, then φX (t) = E[E[φX (t Y )]].

(CF6) Uniqueness: If φX (t) = φY (t) for all t ∈ (a, b), for some positive
a < b, then X =d Y . In other words, distributions are determined by
their characteristic functions.

Characteristic functions are mathematically more natural objects, in


many respects, than moment generating functions. They have the advantage
of existing for all t. Of course, they form part of the same complex function;
as long as the mgf is finite for a given t and −t, the radius of convergence
is at least t, so the characteristic function will be just be the same function
with it substituted for t.
Lecture 6

Limit theorems in
probability I

Summary: Strong and weak convergence of random variables. Convergence


in distribution. Continuity theorems. Convergence in probability. Weak law
of large numbers. Convergence in mean square.
Key concepts: Convergence in probability, convergence in distribution,
convergence in mean square. Weak law of large numbers.

Fundamentally, there are two kinds of convergence in probability theory:


• Strong convergence: This is a random version of your usual defi-
nition of convergence of sequences. Given a probability space (Ω, P ),
and random variables X1 , X2 , · · · : Ω → R, and another random vari-
able Y : Ω → R — so all the random variables live on the same space
Ω — we say that Xi → Y strongly or almost surely if

P lim Xi = Y = 1. (6.1)
i→∞

i→∞
This is sometimes written Xi −−−
−→ Y a.s.
• Weak convergence: This is a more abstract — and fundamentally
new — concept of convergence of probability distributions, seen as a
sequence of elements in a metric space of all probability distributions.
Given probability spaces (Ωi , Pi ) and random variables Xi : Ωi → R,
and another random variable Y : Ω → R, we say that Xi converges to
Y weakly or in distribution if
 
lim P Xi ∈ A = P Y ∈ A (6.2)
i→∞

103
6.1. CONVERGENCE IN DISTRIBUTION 104

i→∞
for a sufficient collection of subsets A ⊂ R. We write this as Xi −−−
−→d
i→∞
Y or Xi −−−
−→p Y .

We are being deliberately vague about what sets are allowed in the def-
inition of convergence in distribution. The details are in section 6.1.
Note that strong convergence is fundamentally about random variables
— it is crucial how they are realised as functions on a particular probability
space — whereas weak convergence refers only to their distributions. We
can talk about convergence of probability distributions with no reference to
random variables. Of course, they may happen to conveniently be repre-
sented on the same probability space — all the (Ωi , Pi ) may be the same —
in which case it becomes possible to talk about comparing weak and strong
convergence.

6.1 Convergence in distribution


What sets can be used as the A in the definition of convergence in distribu-
tion? We begin by showing by some examples that we have a problem if we
demand that the convergence of (6.2) should hold for all events A.

Example 6.1: Discrete approximations

Let ξ1 , ξ2 , . . . be i.i.d. random variables taking on the values 0


and 1 each with probability 12 , and define
n
X
Xn := 2−i ξi .
i=1

The sequence (Xn ) is a Cauchy sequence, since |Xn −Xm | ≤ 2−N


whenever m, n ≥ N . Thus Y := limn→∞ Xn always exists. It’s
not hard to see that Y must have the uniform distribution on
[0, 1], since its probability of being in any interval [q · 2−k , (q +
1)2−k ) is 2−k .
Intuitively, it also seems clear that Xn ought to converge in distri-
bution to the uniform distribution. But (6.2) will not be satisfied
for all possible A. For instance, if we take
q
A := for k ∈ N, and q = 0, 1, . . . , 2k − 1 ,
2k
6.1. CONVERGENCE IN DISTRIBUTION 105

then P{Xn ∈ A} = 1 for all n, but P{Y ∈ A} = 0, since A


comprises only a countable number of points. 

Example 6.2: Deterministic convergence to a point mass

Let Xi be the random variable that takes on the value 1/i with
probability 1, and Y the random variable that takes on the value
0 with probability 1. Then it seems reasonable to suppose that
i→∞
Xi −−−
−→d Y . But if we take A to be the set {0}, or the interval
(−∞, 0], we see that P{Xi ∈ A} = 0 for all i, but P{Y ∈ A} = 1.

We clearly run into problems when we try to define convergence of ran-
dom variables in terms of a set whose boundary is a discontinuity point of
the limit random variable. This leads us to the following definition:
Definition 6.1. Let X1 , X2 , . . . and Y be random variables, with cumulative
distribution functions F1 , F2 , . . . and F∞ . We say that the random variables
Xn converge to Y in distribution if limn→∞ Fn (x) = F∞ (x) for all x at
which F∞ is continuous.
n→∞
It is then obvious that when Y is a continuous random variable, Xn −−− −→d
Y if and only if P{Xn ∈ A} = P{Y ∈ A} for any interval A.
As we already discussed in section t turns out that thinking about con-
vergence in terms of convergence of probabilities isn’t the best approach —
it leads to these awkward problems of continuity. These can be dealt with,
but a better way of thinking about convergence is to think of the probability
P{Xi ∈ A} as being the expectation of the indicator set: E[f (Xi )] where
f = 1A .
Once we do that, we see that there are a lot of different classes of sets we
can use that are mathematically more convenient. The important principle
n→∞
is: Xn converges to Y in distribution if E[f (Xn )] −−− −→ E[f (Y )] for enough
functions f . What’s enough? The more we can restrict the required class of
functions, the easier it becomes to prove convergence.
The basic principle is the same as in section : a class G of bounded
functions is sufficient if you can approximate all the indicators of intervals
uniformly by functions in G. Of course, if G is sufficient, and G0 is another
class that can approximate all functions of G, then G0 is sufficient. It’s also
much better to work with classes of continuous functions, since that avoids
the problem of worrying about where the distributions are continuous.
6.1. CONVERGENCE IN DISTRIBUTION 106

n→∞
−→ E[F (Y )] for any F ∈ F,
Think of it this way: Suppose E[F (Xn )] −−−
and for any bounded continuous G and any real numbers K and  > 0 there
is some F ∈ F such that
u sup |F (x) − G(x)| < .
|x|≤K
n→∞
If Xn and Y were never bigger than K, then this would clearly that E[G(Xn )] −−−
−→
E[G(Y )]. But by choosing K big enough we can make the probabilities of
these problematic events P{|Xn | > K} and P{|Y | > K} as small as we like.
Fact 6.2. Let X1 , X2 , . . . and Y be random variables, with cumulative dis-
tribution functions F1 , F2 , . . . and F∞ . The following are equivalent:
• limn→∞ Fn (x) = F∞ (x) for all x at which F∞ is continuous.
• limn→∞ P{Xn ∈ (−∞, x]} = P{Y ∈ (−∞, x]} as long as P{Y = x} =
0.
• limn→∞ E[f (Xn )] = E[f (Y )] when f is a function of the form 1(−∞,x]
for some real number x with P{Y = x} = 0.
• limn→∞ E[f (Xn )] = E[f (Y )] for any bounded continuous f : R → R.
• limn→∞ E[f (Xn )] = E[f (Y )] where f is a function of the form f (x) =
etx for any t in some neighbourhood of 0.
• limn→∞ E[f (Xn )] = E[f (Y )] where f is a function of the form f (x) =
eitx for any real t.
Other possible classes of functions which can define convergence in distri-
bution are the smooth functions, the differentiable functions, the Lipschitz
functions, and so on.

Example 6.3: Convergence to a discrete distribution

Let Xn be a random variable that takes on the values 1/n and


n→∞
1 − 1/n with probabilities p and 1 − p. Show that Xn −−−
−→d X,
where X is a Bernoulli random variable with parameter p.
Solution: The cdfs of X and Xn are
 
0 if x < 0,
 0
 if x < 1/n,
FX (x) = p if 0 ≤ x < 1, FXn (x) = p if 1/n ≤ x < 1 − 1/n,
 
1 if 1 ≤ x; 1 if 1 − 1/n ≤ x.
 
6.1. CONVERGENCE IN DISTRIBUTION 107

The continuity points of FX are all points other than 0 and


1. If x < 0 then FX (x) = FXn (x) = 0 for all n; similarly, if
x ≥ 1 then FX (x) = FXn (x) = 1 for all n. If 0 < x < 1,
let  = min{x, 1 − x}. Then for any n > 1/, FX (x) = p, so
limn→∞ FXn (x) = FX (x) = p.
Note that for all n, FXn (0) = 0, whereas FX (0) = p, so there is
no convergence of cdfs at this discontinuity point x = 0.
n→∞
We could also show directly that E[f (Xn )] −−− −→ E[f (X)] for
any bounded continuous f . Since f is continuous, f (0) = limn→∞ f (1/n)
and f (1) = limn→∞ f (1 − 1/n). Thus
n→∞
E[f (Xn )] = pf (1/n)+(1−p)f (1−1/n) −−−
−→ pf (0)+(1−p)f (1) = E[f (X)].

Note that this demonstration is actually somewhat more straight-


forward than the one based on cdfs. 

Example 6.4: Convergence to a discrete distribution

Let Xn be a random variable that takes on the values 0 and 1


n→∞
with probabilities p−p/n and 1−p+p/n. Show that Xn −−− −→d
X, where X is a Bernoulli random variable with parameter p.
Solution: The cdfs of X and Xn are
 
0 if x < 0,
 0
 if x < 0,
FX (x) = p if 0 ≤ x < 1, FXn (x) = p − p/n if 0 ≤ x < 1,
 
1 if 1 ≤ x; 1 if 1 ≤ x.
 

The continuity points of FX are all points other than 0 and 1.


If x < 0 then FX (x) = FXn (x) = 0 for all n; similarly, if x ≥ 1
then FX (x) = FXn (x) = 1 for all n. If 0 < x < 1 then
p n→∞
FXn (x) = p − −−−
−→ p = FX (x).
n
n→∞
Note that in this case FXn (0) = p − p/n −−−
−→ FX (0), so in fact
we have convergence at all points at all points.
6.1. CONVERGENCE IN DISTRIBUTION 108

n→∞
We could also show directly that E[f (Xn )] −−− −→ E[f (X)] for
any bounded continuous f . Since f is continuous, f (0) = limn→∞ f (1/n)
and f (1) = limn→∞ f (1 − 1/n). Thus
n→∞
E[f (Xn )] = (p−p/n)f (0)+(1−p+p/n)f (1) −−−
−→ pf (0)+(1−p)f (1) = E[f (X)].

Note that in this case convergence holds for all f , not just con-
tinuous. 

Note the basic principle in the above two examples: Xn converges to X


in distribution if they take on the same values with similar probabilities, or
similar values with the same probabilities — or some combination of these,
of course.

Example 6.5: Discrete approximations, continued

Let ξ1 , ξ2 , . . . be i.i.d. random variables taking on the values 0


and 1 each with probability 12 , and define
n
X
Xn := 2−i ξi .
i=1

Show that Xn converges in distribution to a uniform distribution.


Solution: Since the cdf of the uniform distribution is continu-
ous, this means we need to show that limn→∞ P{Xn ≤ t} = t
for t ∈ (0, 1). Let
P t = 0.b 1 b2 . . . be the binary representation of
t; that is, t = ∞ b
i=1 i 2−i , where b = 0 or 1. (In order to make
i
this unique, we specify that a terminating expansion ends with
all 0’s, rather than all 1’s.)
Let N be a random variable, defined to be

N := min{i : bi 6= ξi }.

That is, it is the first place in the binary expansion where X and
t disagree. Note that each ξi has probability 1/2 of matching bi ,
so N has geometric distribution with parameter 1/2. Of course,
if they disagree first at a place where bN = 0, then X is bigger;
if bN = 1 and ξN = 0 then X is smaller. And if N > n then
6.2. THE CONTINUITY THEOREM FOR MGFS 109

Xn ≤ t anyway. So
 X
P{Xn ≤ t} = P N > n + P{N = n}
1≤i≤n: bi =1

so

X n→∞
−n
|P{Xn ≤ t} − t| = 2 − bi 2−i ≤ 2−n −−−
−→ 0.
i=n+1

6.2 The Continuity Theorem for mgfs


The last two conditions are particularly useful, so we state them as a theo-
rem.
Theorem 6.3 (Continuity Theorem). Let X1 , X2 , . . . be random variables
with moment generating functions M1 , M2 , . . . defined in an interval [−t0 , t0 ],
and Y a random variable with mgf MY defined in the same interval. Then
n→∞
Xn −−−−→ Y in distribution if and only if

lim Mn (t) = MY (t) for all t ∈ [−t0 , t0 ].


n→∞

The same holds true with characteristic function in place of the moment
generating function.
This may seem surprising — we can check convergence of expectations
for all bounded continuous f just by checking the exponential functions —
but this is quite similar to the fact discussed in Lecture 5 that a distribution
is determined entirely by its moment generating function or characteristic
function.

6.3 Convergence in probability


Another important notion of convergence that lies between strong and weak
convergence is called convergence in probability. This is like strong
convergence in that it requires that the random variables Y and (Xi )∞i=1
live on the same probability space, but like weak convergence in that the
convergence need not be along paths.
6.4. LAW OF LARGE NUMBERS 110

Definition 6.4. We say that the random variables Xi : Ω → R converge in


probability to the random variable Y : Ω → R if

lim P |Xi − Y | >  = 0 for all  > 0. (6.3)
i→∞

Equivalently,

∀ > 0 ∃n s.t. P |Xi − Y | >  <  for all i > n. (6.4)

Theorem 6.5. If Xi converges in probability to X, then Xi converges in


distribution to X.
Proof. We begin by noting the trivial fact that then

X + |Xn − X| ≥ Xn ≥ X − |Xn − X|.

Fix any x ∈ R where the cdf FX is continuous, and choose  > 0. Then
whenever X > x +  and |Xn − X| ≤ , it must be that Xn > . Thus

FXn (x) = P Xn ≤ x

≤ P X ≤ x +  or |Xn − X| > 
 
≤ P X ≤ x +  + P |Xn − X| > 
n→∞
−−−
−→ FX (x + ).

Similarly,

1 − FXn (x) = P Xn > x

≤ P X > x −  or |Xn − X| > 
n→∞
−−−
−→ 1 − FX (x − ).

Thus
FX (x − ) ≤ lim FXn (x) ≤ FX (x + ).
n→∞

6.4 Law of large numbers


The expectation of a random variable X is the average value of all values
that the random variable can take on, weighted by the probability of having
each value. You might then expect that we can get close to the expectation
by averaging many different samples of X. Formal mathematical statements
6.4. LAW OF LARGE NUMBERS 111

of this idea are called “laws of large numbers”. There are two basic versions:
the Weak Law of Large Numbers (WLLN — this says the average will be
close to the expectation with high probability) and the Strong Law of Large
Numbers (SLLN — this says that as we take more and more samples, the
means converge to the expectation). Only the weak law is included in the
syllabus of this course.

6.4.1 Weak Law of Large Numbers


Theorem 6.6. Let X1 , X2 , . . . be i.i.d. random variables with expectation
µ and finite variance. Then the means Sn := n−1 (X1 + · · · + Xn ) converge
in probability to µ.

Proof. Let σ 2 be the variance of Xi . Then


n
X
E[Sn ] = n−1 E[Xi ] = µ,
i=1
n
X σ2
Var(Sn ) = n−2 Var(Xi ) = by (3.14).
n
i=1

Applying Chebyshev’s inequality, for any  > 0

Var(Sn ) 2
−1 σ

P |Sn − µ| >  ≤ = n ,
2 2
which goes to 0 as n → ∞, no matter what  is.

Note that we have actually proved a result that is stronger than the
statement of the theorem in two ways: First, we didn’t really use indepen-
dence, only the fact that the Xi are uncorrelated. Second, we didn’t just
prove the asymptotic result, that Sn converges to µ eventually, we actually
have a bound on the probability of |Sn − µ| being bigger than .
In fact, the WLLN still holds even if the covariances are nonzero, as long
as the covariance between Xi and Xj goes to 0 as i and j get far apart. We
leave this as an exercise.
The WLLN even still holds if Xi has infinite variance, just as long as the
expectation is finite; but in this case we may not have any general bound
on the tail probabilities.
6.5. CONVERGENCE IN MEAN SQUARE 112

6.5 Convergence in mean square


n→∞ E→∞
If Xn −−−
−→d X, does it follow that E[Xn ] −−− −→ [X]? No: Consider the
random variable Xn taking on the values 0 and n, with probabilities 1 − 1/n
and 1/n respectively, and X ≡ 0. Then

0
 if x < 0,
1
P{Xn ≤ x} = 1 − n if 0 ≤ x < n,

1 if x ≥ n,

n→∞
which converges to 0 for x < 0 and 1 for x ≥ 0. So Xn −−− −→d X, but
E[Xn ] = 1 for each n, so it doesn’t converge to 0 = E[X]. In fact, Xn
converges to X in probability, since P{|Xn − X| > } = 1/n for  ∈ (0, 1).
We introduce a stronger version of convergence in probability, which does
imply convergence of expectations and variances.
Definition 6.7. We say that a sequence of random variables Xn converges
in mean square to a random variable X (all defined on the same probability
space) if
lim E (Xn − X)2 = 0.
 
(6.5)
n→∞
Fact 6.8. If Xn converge to X in mean square then
• Xn converge to X in probability;
• if E[X] < ∞ then limn→∞ E[Xn ] = E[X];
• if E[X 2 ] < ∞ then limn→∞ Var(Xn ) = Var(X).
Proof. The first part is left as an exercise.
We have (by property (E7) of expectations)
 
E[Xn ] − E[X] ≤ E |Xn − X|
1/2 n→∞
≤ E (Xn − X)2

−−−
−→ 0
(Remember: For any random variable Z we have E[Z 2 ] − E[Z]2 = Var(Z) ≥
0, so E[Z] ≤ E[Z 2 ]1/2 .) Similarly,
E[Xn2 ] − E[X 2 ] E (Xn − X)2 + 2X(Xn − X)
 
1/2
≤ E (Xn − X)2 + 2E[X 2 ]1/2 E (Xn − X)2
  
n→∞
−−−
−→ 0
by the Cauchy-Schwarz inequality (3.9).
6.6. STRONG CONVERGENCE 113

6.6 Strong convergence


(This section is not examinable. For more discussion, see section 7.6 of
[GS01].)

6.6.1 Convergence in probability doesn’t imply strong con-


vergence
As we will discuss in section 6.6.3, strong convergence implies weak conver-
gence. That the converse is not true is obvious, since the random variables
need not even be defined on the same space, but it may not be obvious how
strong convergence can fail when random variables converge in probability.
The point is that convergence in probability implies that the exceptional set
— the set of outcomes ω ∈ Ω where |Xn − X| is big (bigger than ) — gets
small for large n, but it could still be that every ω keeps coming back into
that set, at longer and longer intervals. A simple example is the following:
Let Xn be independent random variables taking on the values 0 and 1, with
P{Xn = 1} = 1/n; and X the random variable that is 0 always. It’s clear
n→∞
that Xn −−−−→p 0, since for any 0 <  < 1 we have P{|Xn − 0| > } = 1/n.
But limn→∞ Xn = X only if Xn is eventually 0; that is, there exists N such
that Xn = 0 for all n ≥ N . Let BN be the event {Xn = 0 for all n ≥ N }. If
Xn (ω) converges to X(ω) then ω must be in one of the events BN . Thus the
event {Xn converges} is contained in B1 ∪ B2 ∪ · · · . But by independence
P(BN ) = P{XN = XN +1 = · · · = 0}
Y∞
= P{Xn = 0}
n=N
K  
Y 1
= lim 1−
K→∞ n
n=N
N −1
= lim
K→∞ K
= 0.
Since the BN all have probability 0, their union has probability 0 as well.

6.6.2 Strong Law of Large Numbers


As long as the random variables are i.i.d. and have finite variance, the prob-
lem described in section 6.6.1 cannot arise: The means really do converge
to the expectation pointwise, not just in probability.
6.6. STRONG CONVERGENCE 114

Theorem 6.9. Let X1 , X2 , . . . be i.i.d. random variables with E[Xi ] = µ


and E[Xi2 ] < ∞, and Sn = n−1 (X1 + · · · + Xn ). Then with probability 1,
limn→∞ Sn = µ.

For the strong law, in contrast to the weak law, existence of finite ex-
pectations is not enough. We need finite variance. Independent also cannot
be replaced by uncorrelated.

6.6.3 Strong convergence is stronger than weak convergence


Our main theoretical concern in the coming lectures is weak convergence, on
which the most important results are the Weak Law of Large Numbers
and the Central Limit Theorem. But it is important to understand
the difference between the two definitions. As already mentioned, weak
convergence is defined in a broader context than strong convergence.

Fact 6.10. If X, Xn : Ω → R are random variables defined on the same


n→∞ n→∞
probability space, with Xn −−−
−→ X almost surely, then Xn −−−
−→ X in
probability.

Proof. Choose any  > 0. We need to show that limn→∞ P{|Xn − X| > } =
0. We know that for every1 ω ∈ Ω,

lim Xn (ω) = X(ω),


n→∞

which means that there is a random N = N (ω) such that |Xn − X| ≤  for
all n ≥ N . Thus
P{|Xn − X| > } ≤ P{N > n}.
Note that the events {N > n} are decreasing events, so by the Monotone
Event Lemma (Lemma 1.1)

\ 
lim P{|Xn −X| > } ≤ lim P{N > n} = P {N > n} = P{N = ∞} = 0.
n→∞ n→∞
n=1

1
Actually, there could be an event with probability 0 of ω such that Xn (ω) doesn’t
converge; but we can simply throw out this null set.
Lecture 7

Limit theorems in
probability II: The central
limit theorem

7.1 The Central Limit Theorem


It would be hard to overestimate the importance of the normal approxima-
tion in probability theory. It is a crucial tool, its proofs and extensions have
inspired the development of much of the theory over a century or more, and
as an archetypal result it inspires a vast range of other approximation and
convergence theorems.
In a certain sense elementary probability theory can seem almost vacu-
ous: It tells us how to compute probabilities from other probabilities, but
up until this point distributions have been almost arbitrary. The Central
Limit Theorem (CLT) tells us to expect a single 2-parameter family of dis-
tributions to be applicable to a huge range of problems: Essentially, any
problem where the quantity of interest is a sum (or average) of a large num-
ber of small independent pieces. Applications include measurement errors;
biological variation of measures like height, weight, and intelligence; varia-
tion of the means in statistical samples; counts of random defects in items
coming off an assembly line. Two of the most important discrete distribu-
tions that you have learned, the binomial and Poisson, turn out to be well
approximated by normal distributions when the parameters get large.
Let X1 , . . . , Xn be random variables and Sn = X1 + · · · + Xn . Obviously
this is too general to say anything about the distribution of Sn , but heuris-
tically Sn will always be approximately normal if the following conditions

115
7.1. THE CENTRAL LIMIT THEOREM 116

hold:

• Most pairs Xi , Xj are approximately independent;

• No individual Xi contributes too much to the sum. This may be


expressed by saying the tails of the Xi are not too fat. It suffices to
have a moderate-sized bound on the third moments E[|Xi |3 ], while a
bound on E[Xi2 ] isn’t quite enough, unless the distributions are the
same.

To make a theorem out of this heuristic we will need to impose some


precise conditions. In section ?? we will describe some more general ver-
sions of the CLT. These are not an examinable part of the course, but it is
important to understand that the CLT is not simply the particular result
that we prove here, but is a general principle that holds in great generality.
We will assume that n is large, that the random variables Xi are i.i.d.,
and that they have finite variance. If µ = E[Xi ] and σ 2 = Var(Xi ) then
E[Sn ] = nµ and Var(Sn ) = nσ 2 , so the relevant normal distribution is
N(nµ, nσ 2 ). (This is part of the heuristic of the CLT: We start by having
reason to think that a sum is approximately normal. Then it is a matter of
deciding which normal distribution it is, which is a matter of computing the
expectation and variance.)
We will express our theorem slightly differently. The easiest way to make
a mathematically precise statement is not in terms of approximation, but in
terms of convergence in distribution. We can’t formulate this if the means
and variances are growing with n. Instead, we normalise1 — subtracting the
mean and dividing by the SD to make a random variable with expectation

0 and variance 1. Thus Yn := (Sn − nµ)/σ n should be approximately
standard normal for each n, and we can have Yn converging in distribution.

Theorem 7.1 (Central Limit Theorem). Let X1 , X2 , . . . be a sequence of


i.i.d. random variables with expectation µ and variance σ 2 . Let
X1 + · · · + Xn − nµ
Yn := √ . (7.1)
σ n

Then Yn converges in distribution to N(0, 1).


1
Apologies for two completely different uses of the concept normal in a single paragraph.
“Normal” here does not refer to the normal distribution, but to the generic sense of
“normalise” by which we mean to bring something into a standard shape and size.
7.2. BASIC APPLICATIONS OF THE CLT 117

The conclusion of the CLT may be described in different ways.


R z Writing
1 −z 2 /2
φ(z) = 2π e
√ for the standard normal density, and Φ(z) = −∞ φ(y)dy
for the standard normal cdf, we have

lim P Yn ≤ z = Φ(z);
n→∞
 √ √
lim P nµ + aσ n ≤ X1 + · · · + Xn ≤ nµ + bσ n = Φ(b) − Φ(a);
n→∞
Z ∞
 
For bounded continuous F, lim E F (Yn ) = F (z)φ(z)dz.
n→∞ −∞

This theorem will be proved in Lecture 8. In this lecture we will interpret


the theorem and show some of the ways it is applied.

7.2 Basic applications of the CLT


7.2.1 The Normal approximation to the Binomial
An important special case is when we take Xi to be 0 or 1 with probability
p or 1 − p respectively — the indicators of Bernoulli trials. Then Sn :=
X1 + · · · + Xn is the number of successes in n trials, which has Binomial
distribution with parameters (n, p).
Sn has expectation np and variance np(1 − p). The CLT then implies
that Bin(n, p) ∼ N(np, np(1 − p)) for large values of n. Note that Sn /n is
the proportion of successes in n trials, and this has approximately N(p, p(1−
p)/n) distribution. In other words, the observed proportion will be close to

p, but will pbe off by a small multiple of the SD, which shrinks as σ/ n,
where σ = p(1 − p).
How large does n need to be? As in the case of the Poisson distribu-
tion, discussed in section 7.2.3, a minimum requirementp is that the mean be
substantially larger than the SD; in other words, np  np(1 − p), so that
n  1/p. (The condition is symmetric, so we also need n  1/(1 − p).) This
fits with our rule of thumb that n needs to be bigger when the distribution
of X is skewed, which is the case when p is close to 0 or 1.
In Figure 7.1 we see that when p = 0.5 the normal approximation is
quite good, even when n is only 10; on the other hand, when p = 0.1 we
have a good normal approximation when n = 100, but not when n is 25.
(Note, by the way, that Binom(25, 0.1) is approximately P o(2.5), so this is
closely related to the observations we made in section 7.2.3.
We then have the normal approximation for the binomial distribution:
7.2. BASIC APPLICATIONS OF THE CLT 118
0.4

0.6
0.3

0.4
0.2

0.2
0.1

0.0
0.0

-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3

(a) p = 0.5, n = 3 (b) p = 0.1, n = 3


0.00 0.05 0.10 0.15 0.20 0.25

0.4
0.3
0.2
0.1
0.0

0 2 4 6 8 10 -2 -1 0 1 2 3 4

(c) p = 0.5, n = 10 (d) p = 0.1, n = 10


0.15

0.20
0.10

0.10
0.05

0.00
0.00

5 10 15 20 -2 0 2 4 6 8

(e) p = 0.5, n = 25 (f) p = 0.1, n = 25


0.08

0.12
0.06

0.08
0.04

0.04
0.02
0.00

0.00

35 40 45 50 55 60 65 5 10 15

(g) p = 0.5, n = 100 (h) p = 0.1, n = 100

Figure 7.1: Normal approximations to Binom(n, p). Shaded region is the implied approximate
probability of the Binomial variable < 0 or > n.
7.2. BASIC APPLICATIONS OF THE CLT 119

Z b
1 2
√ e−z /2 dz.
 p p
lim P np + a np(1 − p) ≤ Sn ≤ np + b np(1 − p) =
n→∞ a 2π
(7.2)
For example, Figure 7.2 compares a Bin(300, 0.5) and a N(150, 75)
which both have the same mean and variance. The figure shows that the
distributions are very similar.

Bin(300, 0.5) N(150, 75)


0.04

0.04
0.03

0.03
P(X = x)

density
0.02

0.02
0.01

0.01
0.00

0.00

100 120 140 160 180 200 100 120 140 160 180 200
X X

Figure 7.2: Comparison of a Bin(300, 0.5) and a N(150, 75) distribution

In general,
7.2. BASIC APPLICATIONS OF THE CLT 120

If X ∼ Bin(n, p) then

µ = np
σ 2 = npq where q = 1 − p.

For large n and p not too small or too large

X ∼ N(np, npq)

Rule of thumb: n ≥ 5/ min{p, 1 − p}. (This makes the ex-


pectation bigger than about 3 times the SD.)

7.2.2 Continuity correction


Suppose X ∼Bin(12, 0.5) what is P (4 ≤ X ≤ 7)?

For this distribution we have

µ = np = 6
σ 2 = npq = 3

So we can use a N(6, 3) distribution as an approximation.

Unfortunately, it’s not quite so simple. We have to take into account the
fact that we are using a continuous distribution to approximate a discrete
distribution. This is done using a continuity correction. The continuity
correction appropriate for this example is illustrated in the figure below

In this example, P (4 ≤ X ≤ 7) transforms to P (3.5 < X < 7.5)

!
3.5 − 6 X −6 7.5 − 6
P (3.5 < X < 7.5) = P √ < √ < √
3 3 3
= P (−1.443 < Z < 0.866) where Z ∼ N(0, 1)
= 0.732

The exact answer is 0.733 so in this case the approximation is very good.
7.2. BASIC APPLICATIONS OF THE CLT 121

0 1 2 3 4 5 6 7 8 9 10 11 12
3.5 7.5

Example 7.1: CLT for dice

Suppose we roll 100 standard 6-sided dice. Let X be the sum


of the outcomes of all the rolls. Estimate the probability that
X ≥ 400.
Solution: Let ξi be the outcome of the i-th roll. We have
1
µ = E[X] = (1 + 2 + 3 + 4 + 5 + 6) = 3.5
6
r
p 1
σ = Var(X) = (2.52 + 1.52 + 0.52 + 0.52 + 1.52 + 2.52 ) ≈ 1.71.
6
Using the continuity correction, we are computing
 
X − nµ 399.5 − 350
P{X ≥ 399.5} = P √ ≥
σ n 17.1
≈ P{Z ≥ 2.89}
≈ 0.00193.

7.2. BASIC APPLICATIONS OF THE CLT 122

λ Standardised Z Normal probability


1 -1.5 0.067
4 -2.25 0.012
10 -3.32 0.00045
20 -4.58 0.0000023

Table 7.1: Probability below −0.5 in the normal approximation to Poisson


random variables with different parameters λ.

7.2.3 Poisson distribution


Suppose Xi are drawn from a Poisson distribution with parameter µ. The
variance is then also µ. We know that X1 +· · ·+Xn has Poisson distribution
with parameter nµ. The CLT tells us, then, that for n large enough, the
P o(nµ) distribution is very close to the N (nµ, nµ) distribution; or, in other
words, P o(λ) is approximately the same as N (λ, λ) for λ large. How large
should it be?
The Poisson distribution is shown in Figure 7.3 for different values of λ,
together with the approximating normal density curve. One way of seeing
the failure of the approximation
√ for small λ is to note that when λ is not
much bigger than λ — much bigger meaning a factor of 2.5 or so, so
λ < 6.2, the normal curve will have substantial probability below −0.5. Since
this is supposed to approximate the probability of the corresponding Poisson
distribution below 0, this manifestly represents a failure. For instance, when
λ = 1, the P o(1) distribution is supposed to be approximated by N(1, 1),
implying

P P o(1) < 0 ≈ P N(1, 1) < −0.5 ≈ P N(0, 1) < −1.5 = .067.


  


In general, the threshold −0.5 corresponds to Z = (−0.5 − λ)/ λ. The
corresponding values for other parameters are given in Table 7.1.

Example 7.2: Radioactive emission

A radioactive source emits particles at an average rate of 25


particles per second. What is the probability that in 1 second
the count is less than 27 particles?
Solution: The number of particles emitted in a period of time
should have a Poisson distribution. (To understand why, see
7.2. BASIC APPLICATIONS OF THE CLT 123

0.4
0.3
Density

0.2
0.1
0.0

-3 -2 -1 0 1 2 3 4

(a) λ = 1
0.20
Density

0.10
0.00

-4 -2 0 2 4 6 8 10

(b) λ = 4
0.12
0.08
Density

0.04
0.00

0 5 10 15 20

(c) λ = 10
0.08
Density

0.04
0.00

5 10 15 20 25 30 35

(d) λ = 20

Figure 7.3: Normal approximations to P o(λ). Shaded region is the implied


approximate probability of the Poisson variable < 0.
7.2. BASIC APPLICATIONS OF THE CLT 124

Lecture 15.) X = No. of particles emitted in 1 second X∼


Po(25)

The CLT says this distribution should be approximately the


same as N(25, 25).

Again, we need to make a continuity correction

So P (X < 27) is understood as P (X ≤ 26.5). THen


!
X − 25 26.5 − 25
P (X ≤ 26.5) = P ≤
5 5
≈ P (Z ≤ 0.3) where Z ∼ N(0, 1)
= 0.6179

X ~ N(25, 25) Z ~ N(0, 1)

P(X < 26.5) P(Z < 0.3)

Z = X ! 25
5

25 26.5 0 26.5 ! 25
5
= 0.3

7.2.4 CLT for real data


We show how the CLT is applied to understand the mean of samples from
real data. It permits us to apply our Z and t tests for testing population
means and computing confidence intervals for the population mean (as well
as for differences in means) even to data that are not normally distributed.
(Caution: Remember that t is an improvement over Z only when the number
7.2. BASIC APPLICATIONS OF THE CLT 125

of samples being averaged is small. Unfortunately, the CLT itself may not
apply in such a case.) We have already applied this idea when we did the Z
test for proportions, and the CLT was also hidden in our use of the χ2 test.

Quebec births
We begin with an example that is well suited to fast convergence. We
have a list of 5,113 numbers, giving the number of births recorded each
day in the Canadian province of Quebec over a period of 14 years, from 1
January, 1977 through 31 December, 1990. (The data are available at the
Time Series Data Library http://www.robjhyndman.com/TSDL/, under the
rubric “demography”.) A histogram of the data is shown in Figure 7.4(a).
The mean number of births is µ = 251, and the SD is σ = 41.9.
Suppose we were interested in the average number of daily births, but
couldn’t observe data for all of the days. How many days would we need
to observe to get a reasonable estimate? Obviously, if we observed just a
single day’s data, we would be seeing a random pick from the histogram
7.4(a), which could be far off of the true value. (Typically, it would be off
by about the SD, which is 41.9.) Suppose we sample the data from n days,
obtaining counts x1 , . . . , xn , which average to x̄. How far off might this be?
The normal approximation tells us that a 95% confidence interval for µ will

be x̄ ± 1.96 · 41.9/ n. For instance, if n = 10, and we find the mean of
our 10 samples to be 245, then a 95% confidence interval will be (229, 271).
If there had been 100 samples, the confidence interval would be (237, 253).
Put differently, the average of 10 samples will lie within 26 of the true mean
95% of the time, while the average of 100 samples will lie within 8 of the
true mean 95% of the time.
This computation depends upon n being large enough to apply the CLT.
Is it? One way of checking is to perform a simulation: We let a computer
pick 1000 random samples of size n, compute the means, and then look at
the distribution of those 1000 means. The CLT predicts that they should
have a certain normal distribution, so we can compare them and see. If
n = 1, the result will look exactly like Figure 7.4(a), where the curve in red
is the appropriate normal approximation predicted by the CLT. Of course,
there is no reason why the distribution should be normal for n = 1. We see
that for n = 2 the true distribution is still quite far from normal, but by
n = 10 the normal is already starting to fit fairly closely, and by n = 100
the fit has become extremely good.
Suppose we sample 100 days at random. What is the Pprobability that
the total number of births is at least 25500? Let S = 100 i=1 Xi . Then S
7.2. BASIC APPLICATIONS OF THE CLT 126

0.015
0.008

0.010
Density

Density
0.004

0.005
0.000

0.000
150 200 250 300 350 200 250 300

# Daily births average # Daily births

(a) n = 1 (b) n = 2

0.030
0.020
0.015

0.020
Density

Density
0.010

0.010
0.005
0.000

0.000

180 200 220 240 260 280 300 220 240 260 280

average # Daily births average # Daily births

(c) n = 5 (d) n = 10
0.06

0.08
0.06
0.04
Density

Density

0.04
0.02

0.02
0.00

0.00

240 250 260 270 235 240 245 250 255 260 265

average # Daily births average # Daily births

(e) n = 50 (f) n = 100

Figure 7.4: Normal approximations to averages of n samples from the Que-


bec birth data.
7.2. BASIC APPLICATIONS OF THE CLT 127

is normally distributed with mean 25100, and SD 41.9 100 = 419. We
compute by standardising:
 
 S − 25100 25500 − 25100
P S > 25500 = P > = P {Z > 0.95}
419 419

where Z = (S −25100)/419. By the CLT, Z has approximately the standard


normal distribution, so we can look up its probabilities on the table, and see
that P {Z > 0.95} = 1 − P {Z ≤ 0.95} = 0.171.

S comes in whole number values, so shouldn’t we have made the cutoff


25500.5? Or should it be 25499.5? If we want to answer the question
about the probability that S is strictly bigger than 25500, then the
cutoff should be 25500.5. If we want the probability that S is strictly
bigger than 25500, then the cutoff should be 25499.5. If we don’t have a
specific preference, then 25500 is a reasonable compromise. Of course,
in this case, it only makes a difference of about 0.002 in the value of
Z, which is negligible.

This is of course the same as the probability that the average number of
births is at least 255. We could also compute this by reasoning
√ that X̄ =
S/100 is normally distributed with mean 251 and SD 41.9/ 100 = 4.19.
Thus,
 
 X̄ − 251 255 − 251
P X̄ > 255 = P > = P {Z > 0.95} ,
4.19 4.19

which comes out to the same thing.

California incomes
A standard example of a highly skewed distribution — hence a poor can-
didate for applying the CLT — is household income. The mean is much
greater than the median, since there are a small number of extremely high
incomes. It is intuitively clear that the average of incomes must be hard to
predict. Suppose you were sampling 10,000 Americans at random — a very
large sample — whose average income is £30,000. If your sample happens
to include Bill Gates, with annual income of, let us say, £3 billion, then his
income will be ten times as large as the total income of the entire remainder
of the sample. Even if everyone else has zero income, the sample mean will
7.3. NORMAL APPROXIMATION FOR STATISTICAL INFERENCE128

be at least £300,000. The distribution of the mean will not converge, or will
converge only very slowly, if it can be substantially affected by the presence
or absence of a few very high-earning individuals in the sample.
Figure 7.5(a) is a histogram of household incomes, in thousands of US
dollars, in the state of California in 1999, based on the 2000 US census (see
www.census.gov). We have simplified somewhat, since the final category is
“more than $200,000”, which we have treated as being the range $200,000 to
$300,000. (Remember that histograms are on a density scale, with the area
of a box corresponding to the number of individuals in that range. Thus,
the last three boxes all correspond to about 3.5% of the population, despite
their different heights.) The mean income is about µ = $62, 000, while the
median is $48,000. The SD of the incomes is σ = $55, 000.
Figures 7.5(b)—7.5(f) show the effect of averaging 2,5,10,50, and 100
randomly chosen incomes, together with a normal distribution (in green) as
predicted by the CLT, with mean µ and variance σ 2 /n. We see that the
convergence takes a little longer than it did with the more balanced birth
data of Figure 7.4 — averaging just 10 incomes is still quite skewed — but
by the time we have reached the average of 100 incomes the match to the
predicted normal distribution is remarkably good.

7.3 Normal approximation for statistical inference


There are many implications of the Central Limit Theorem. We can use it
to estimate the probability of obtaining a total of at least 400 in 100 rolls of
a fair six-sided die, for instance, or the probability of a subject in an ESP
experiment, guessing one of four patterns, obtaining 30 correct guesses out
of 100 purely by chance. These were discussed in lecture 6 of the first set
of lectures. It suggests an explanation for why height and weight, and any
other quantity that is affected by many small random factors, should end
up being normally distributed.
Here we discuss one crucial application: The CLT allows us to com-
pute normal confidence intervals and apply the Z test to data that are not
themselves normally distributed.

7.3.1 An example: Average incomes


Suppose we take a random sample of 400 households in Oxford, and find
that they have an average income of £36,200, with an SD of £26,400. What
can we infer about the average income of all households in Oxford?
7.3. NORMAL APPROXIMATION FOR STATISTICAL INFERENCE129

Answer: Although the distribution of incomes is not normal — and if


we weren’t sure of that, we could see from the fact that the SD is not
much smaller than the mean — the average of 400 incomes
√ will be nor-
mally distributed. The SE for the mean is £26400/ 400 = 1320, so a
95% confidence interval for the average income in the population will be
£36200 ± 1.96 · £1320 = (£33560, £38840). A 99% confidence interval is
£36200 ± 2.6 · £1320 = (£32800, £39600).

Example 7.3: ESP Experiment

This example is adapted from [FPP98].


In the 1970s, the psychologist Charles Tart tried to test whether
people might have the power to see into the future. His “Aquar-
ius machine” was a device that would flash four different lights
in random orders. Subjects would press buttons to predict which
of the 4 lights will come on next.
15 different subjects each ran a trial of 500 guesses, so 1500
guesses in total. They produced 2006 correct guesses and 5494
incorrect. What should we conclude?
We might begin by hypothesising that without any power to
predict the future, a subject has just a 1/4 chance of guess-
ing right each time, independent of any other outcomes. Thus,
the number of correct guesses X has Bin(7500, 1/4) distribu-
tion.
√ This has mean µ = 7500/4 = 1875, and standard deviation
7500 × 0.25 × 0.75 = 37.5. The result is thus above the mean:
There were more correct guesses than would be expected. Might
we plausibly say that the difference from the expectation is just
chance variation?
We want to know how likely a result this extreme would be if
X really has this binomial distribution. We could compute this
7.3. NORMAL APPROXIMATION FOR STATISTICAL INFERENCE130

directly from the binomial distribution as


7500
 
X 7500
P (X ≥ 2006) = (0.25)x (0.75)7500−x
x
x=2006
 
7500
= (0.25)2006 (0.75)5494
2006
 
7500
+ (0.25)2007 (0.75)5493 + · · ·
2007
 
7500
+ (0.25)7500 (0.75)0 .
7500

This is not only a lot of work, it is also not very illuminating.


More useful is to treat X as a continuous variable that is ap-
proximately normal.
We sketch the relevant normal curve in Figure 7.6. This is the
normal distribution with mean 1875 and SD 37.5. Because of
the continuity correction, the probability we are looking for is
P (X > 2005.5). We convert x = 2005.5 into standard units:
x−µ 2005.5 − 1875
z= = = 3.48.
σ 37.5
(Note that with such a large SD, the continuity correction makes
hardly any difference.) We have then P (X > 2005.5) ≈ P (Z >
3.48) = 0.000250, where Z has standard normal distribution.2
Since most of the probability of the standard normal distribution
is between −2 and 2, and nearly all between −3 and 3, we know
this is a small probability. From a table, or using the appropriate
computer command we see that

P (X > 2005.5) = P (Z > 3.48) = 1 − P (Z < 3.48) = 0.000250.

This may be compared to the exact binomial probability 0.000274.




2
Reminder: To calculate the normal tail probability you could use a table — they are
found in many probability textbooks — the Matlab command normcdf (see http://www.
mathworks.co.uk/help/toolbox/stats/normcdf.html) or the R command pnorm, or an
online applet http://www-stat.stanford.edu/~naras/jsm/FindProbability.html.
7.3. NORMAL APPROXIMATION FOR STATISTICAL INFERENCE131

Histogram of California household income 1999 Averages of 2 California household incomes


0.012

0.012
0.008

0.008
Density

Density
0.004

0.004
0.000

0.000
0 50 100 150 200 250 300 0 50 100 150 200 250 300

income in $thousands income in $thousands

(a) n = 1 (b) n = 2

Averages of 5 California household incomes Averages of 10 California household incomes


0.015

0.020
0.015
0.010
Density

Density

0.010
0.005

0.005
0.000

0.000

0 50 100 150 200 50 100 150

income in $thousands income in $thousands

(c) n = 5 (d) n = 10

Averages of 50 California household incomes Averages of 100 California household incomes


0.05

0.06
0.04
0.03

0.04
Density

Density
0.02

0.02
0.01
0.00

0.00

40 50 60 70 80 90 100 50 60 70 80

income in $thousands income in $thousands

(e) n = 50 (f) n = 100

Figure 7.5: Normal approximations to averages of n samples from the Cal-


ifornia income data. The green curve shows a normal density with mean µ
and variance σ 2 /n.
7.3. NORMAL APPROXIMATION FOR STATISTICAL INFERENCE132

0.000 0.002 0.004 0.006 0.008 0.010


y

1875 2006
x

Figure 7.6: Normal approximation to the distribution of correct guesses in


the Aquarius experiment, under the binomial hypothesis. The solid line is
the mean, and the dotted lines show 1 and 2 SDs away from the mean.
Lecture 8

Proof of the Central Limit


Theorem

We prove the CLT only for the case µ = 0 and σ = 1. Note that for general
ei := (Xi −µ)/σ have
random variables Xi , the normalised random variables X
expectation 0 and variance 1. The CLT for sums of Xi follows immediately
for the corresponding CLT for sums of X ei .

8.1 Many paths up the mountain


There are many different proofs of the Central Limit Theorem, correspond-
ing to many different special properties of the normal distribution which
are the “reason” why sums of i.i.d. random variables become normal in the
limit.

8.1.1 De Moivre’s proof


The first proof of a version of the CLT was by A. de Moivre in 1733. 1
He proved the convergence of the binomial distribution with p = 12 to nor-
mal, a version of (7.2). Essentially, the “special property” of the Gaussian
distribution is that its density φ satisfies
z−1
z2 X 2i + 1
log φ(z) = − = − .
2 2
i=0
1
De Moivre’s proof and an entertaining discussion are presented in [Dav62]. The full
text of the 1756 third edition De Moivre’s Doctrine of Chances is available at http://www.
archive.org/details/doctrineofchance00moiv. An accessible but complete version of
this proof is in section 2.3 of [Pit93].

133
8.1. MANY PATHS UP THE MOUNTAIN 134

Consider (7.2) in the case a = 0 < b and n = 2k even. Then P{Y2k =


2k
 −2k
k + i} = k+i 2 . Observe that
 
P{Y2k = k + i + 1} k−i 2i + 1 2i + 1
= ≈1− ≈ exp −
P{Y2k = k + i} k+i+1 k k

for 0 ≤ i  k. Thus
  ( j−1 )    2
−2k 2k X 2i + 1
−2k 2k j
P{Y2k = k + j} ≈ 2 exp − =2 exp − .
k k k 2
i=0

2k
Using the fact that 2−2k ∼ (2πk)−1/2 (from Stirling’s formula), a bit of

k
manipulation then turns

j2
   
−2k 2k
 p X
P 0 ≤ Y2k ≤ b k/2 ≈ 2 exp −
k √ 2
0≤j≤b k/2

Rb 2 /2
into a Riemann sum converging to 0 e−z dz.

8.1.2 Maximum entropy


We can define the “entropy” of a distribution as a measure of how “spread
out” the distribution is. Thus, among discrete distributions on {1, . . . , n}
the uniform distribution has maximum entropy, and on the interval [0, 1]
or the closed circle the uniform continuous distribution has maximum en-
tropy. For a continuous distribution on all of R there isRno such thing as a

uniform distribution, but the entropy, which is given by −∞ f (z) log f (z)dz
for a density f , is maximised (over all distributions with a given variance)
by a normal density. Thus, we might say that the normal distribution is
the closest thing to a uniform distribution that is possible on an unbounded
domain. If you imagine taking steps at random around a circle, you might
expect that, if the steps are sufficiently random, the distribution will con-
verge to uniform on the circle. In the same way, adding a distribution to
itself will converge to the normal distribution on R. If X has finite variance

then (X1 +· · ·+Xn −nµ)/ n has the same variance and decreasing entropy;
with some work, it can be shown that it must converge to the minimum en-
tropy distribution, namely the normal distribution. (For a detailed look at
this sort of argument, see [Joh04].)
8.2. CONTINUITY THEOREM 135

8.1.3 Fixed point


Suppose there were some distribution with variance 1 such that normalised
averages of i.i.d. copies of other random variables with variance 1 converge
to this distribution. The only possible candidate is the standard normal
distribution. Why? Suppose X1 , . . . , Xn are i.i.d. standard normal. Then

Yn = (X1 +· · ·+Xn )/ n is also standard normal. Thus Yn couldn’t possibly
converge to any distribution other than standard normal.
To put this more formally, consider the mapping on distributions with √
expectation 0 that takes a distribution P to the distribution of (X1 +X2 )/ 2,
where X1 , X2 are i.i.d. with distribution P . Then normal distributions are
fixed points of this mapping; what is more, they are the only fixed points of
this mapping. (This result, due to , was already mentioned in section 2.4.2.
It is not itself part of the course.)
Furthermore, in some sense that we won’t make precise, this mapping is
a contraction.
√ That is, no matter what random variables X and Y I take,
(X1 + X2 )/ 2 is closer to (Y1 + Y2 )/sqrt2 than X is to Y . By iteration,
then, we get normalised averages of any starting distribution getting closer
and closer to each other, and in particular to the one fixed point, which is
the normal distribution. We will focus on this approach to the proof.2

8.2 Continuity Theorem


We begin with a very simple proof of the CLT using moment generating
functions. The advantage of this proof is that it is both short and elementary.
The disadvantages are multiple: First, it is “elementary” only in the sense
that the profound part — the proof of the Continuity Theorem for mgfs —
has been bracketed out. We have not covered it in this course. Second, it
does not cover the full range of cases that we are concerned with: We need to
assume that the distribution has exponential tails, ruling out such natural
examples as the Pareto distribution (section 1.3.6). Third, and perhaps
most important, the proof is purely abstract and asymptotic: It offers little
insight about what it is about the normal distribution that makes it the
limit, nor does it give any information about how far the distribution of Yn
is from normal for a given n.
2
This principle is very commonly used in mathematics: If a map H has Lipschitz
constant < 1, then there is a unique fixed point of H, and successive iterations of H
converge to that fixed point. This is called the Banach Fixed Point Theorem. It should
remind you of Picard’s iterative method for proving the existence of solutions to ODEs,
which you learned in Michaelmas Term.
8.3. PROOF OF THE CLT 136

We prove the following weak form of the Central Limit Theorem:

Theorem 8.1. Let X1 , X2 , . . . be i.i.d. random variables whose moment


generating function MX (t) is defined for all |t| < t0 . Then the normalised
sums Yn defined by (7.1) converge to the standard normal distribution.

Proof. Remember that we are assuming that E[Xi ] = 0 and Var(Xi ) = 1.


By properties ((MGF2)) and ((MGF6)), we can compute the mgf of Yn as

MYn (t) = MX (t/ n)n .

Using Lemma 5.3 we see that there is a constant C such that for n sufficiently
large

√ (t/ n)2 2 t 3
 
log MX (t/ n) − σ ≤C √ .
2 n
Thus
√ t2
log MYn (t) = n log MX (t/ n) = + Error,
2
where Error ≤ Ct3 n−1/2 , which goes to 0 is n → ∞. Thus, for each t with
|t| < t0 ,
2
lim MYn (t) = et /2 = MZ (t),
n→∞

where Z is standard normal. Thus the CLT follows from Theorem 6.3.

8.3 Proof of the CLT


We prove the following fact: Let φ : R → R be any function that is three-
times differentiable, such that the third derivative of φ is bounded. Let
K = sup |φ000 |.
We are given i.i.d. random variables X1 , X2 , . . . with expectation 0 and
variance 1. Let Z1 , Z2 , . . . be i.i.d. random variables independent of all the
X’s, also with expectation 0 and variance 1, and define Yn∗ := (Z1 + · · · +

Zn )/ n. For 1 ≤ i ≤ n define

Yei,n := n−1/2 (X1 + · · · + Xi−1 + Zi+1 + · · · + Zn ) .

In other words, we interpolate between the sum of the X’s and the sum of
the Z’s by replacing X’s by Z’s one at a time. We have
Xn Zn
Yn = Yen,n + √ , Yn∗ = Ye0,n + √ .
n n
8.3. PROOF OF THE CLT 137

√ √
Note, too, that Yei,n + Zi / n = Yei−1,n + Xi−1 / n.
Suppose first that τ := max{E[|Xi |3 ], E[|Zi |3 ]} < ∞. We want to show
that the distributions of Yn and of Yn∗ are close together. We do this by
forming a telescoping sum
n
   ∗
 X  Xi   Zi 
E φ(Yn ) − E φ(Yn ) = E φ(Yei,n + √ ) − E φ(Yei,n + √ )
n n
i=1
n (8.1)
X  Xi   Zi 
≤ E φ(Yi,n + √ ) − E φ(Yi,n + √ )
e e
n n
i=1
by the triangle inequality. We now apply Taylor’s Theorem to the function
φ, expanded around the centre Yei,n , to obtain
Xi  h Xi X2 Xi3 000 i
E φ(Yei,n + √ ) = E φ(Yei,n ) + √ φ0 (Yei,n ) + i φ00 (Yei,n ) + 3/2

φ (y) ,
n n 2n 6n
(8.2)

where y is between Yei,n and Yei,n + Xi / n.
Observe now that Yei,n is a function of X’s and Z’s other than Xi ; thus
Yei,n and Xi and Zi are all independent, and the expectation of any function
of Yei,n multiplied by any function of Xi is the product of the expectations,
yielding
Xi 
E φ(Yei,n + √ ) = E φ(Yei,n ) + n−1/2 E Xi E φ0 (Yei,n )
      
n
 1
+ (2n)−1 E Xi2 E φ00 (Yei,n ) + n−3/2 E Xi3 φ000 (y)
    
6
1
= E φ(Yei,n ) + (2n)−1 E φ00 (Yei,n ) + n−3/2 E Xi3 φ000 (y) .
     
6
An identical expression may be written with Zi in place of Xi , so
Xi   1
E φ(Yei,n + √ ) = E φ(Yei,n ) + (2n)−1 E φ00 (Yei,n ) + n−3/2 E Zi3 φ000 (ŷ) ,
     
n 6

where y is between Yei,n and Yei,n + Zi / n. Combining these expressions
yields
 Xi   Zi 
E φ(Yei,n + √ ) − E φ(Yei,n + √ )
n n
1 −3/2  3 000 
E Xi φ (y) − E Zi3 φ000 (ŷ)
 
= n
6 (8.3)
1 −3/2   3
  3

≤ n E |Xi | K + E |Zi | K
6
1
= n−3/2 τ K.
3
8.4. TRUNCATION 138

By (8.1) it follows that


n
X 1 −3/2
E φ(Yn ) − E φ(Yn∗ ) ≤ τ K = n−1/2 τ K,
   
n
3
i=1

which goes to 0 as n → ∞.
Now suppose that the Zi have standard normal distribution. Then Yn∗
also has standard normal distribution for every n, so
   
lim E φ(Yn ) − E φ(Z) = 0
n→∞

where Z ∼ N(0, 1), for any function φ with bounded third derivative. Since
all bounded continuous functions may be approximated by functions with
n→∞
bounded third derivative, it follows that Yn −−−−→d Z.
This completes the proof except for one small defect: We had to assume
that Xi has finite third moment τ . What do we do if E[|Xi |3 ] = ∞? This
requires a trick called “truncation”, that you will certainly encounter often
if you go on to more advanced probability theory, but is not strictly part of
this course.

8.4 Truncation
(Not examinable) Look back at the crucial step (8.2). We’ve expanded out

the Taylor series to the third order, because the error term (Xi / n)3 is
going to be small enough to go to 0 in the limit when multiplied by n. This
is not true, however, if Xi is too big. In that case, we would be better off
stopping at the second-order term:

Xi  h Xi X2 i
E φ(Yei,n + √ ) = E φ(Yei,n ) + √ φ0 (Yei,n ) + i φ00 (y 0 ) .

(8.4)
n n 2n

We have here two different expressions for the same quantity, (8.2) being
good when Xi is small (or not extremely large) and (8.5) being superior
when Xi is extremely large. Pick some large number M to be the cutoff
between when we will use the first expression and when we will use the second
expression. These expressions are actually equal (depending on different
choices of y and y 0 to make them equal) so if I multiply one of them by
1{Xi >M } and the other by 1{Xi ≤M } and add these two together, we get
8.4. TRUNCATION 139

again the same quantity:

Xi  h Xi X2
E φ(Yei,n + √ ) = E φ(Yei,n ) + √ φ0 (Yei,n ) + i φ00 (y 0 )1{Xi >M }

n n 2n
 X2 3
Xi 000  i
i 00 e
+ φ (Yi,n ) + 3/2 φ (y) 1{Xi ≤M }
2n 6n
= E φ(Yi,n ) + (2n) E φ00 (Yei,n )
−1 (8.5)
   
e
h  X2 i
+ E φ00 (y 0 ) − φ00 (Yei,n ) i
1{Xi >M }
2n
h X3 i
i 000
+E φ (y)1 {Xi ≤M }
6n3/2

Then taking the difference to the expression we already had for φ(Yei,n + Zi / n)
(which certainly does have finite moments of all orders if Zi is Gaussian),
we get
 Xi   Zi 
E φ(Yei,n + √ ) − E φ(Yei,n + √ )
n n
1 −3/2 
E Xi 13{Xi ≤M } φ000 (y) − E Zi3 φ000 (ŷ)
  
= n
6 h  i (8.6)
+ (2n)−1 E φ00 (y 0 ) − φ00 (Yei,n ) X 2 1{X >M }
i i

K   K 
≤ 3/2 M 3 + E |Zi |3 + E Xi2 1{Xi >M }
 
6n n
if we take K to be a bound on the second and third derivatives of φ.
Summing as in (8.1) we get

K  3 3

+ KE Xi2 1{Xi >M } .
      
E φ(Yn ) − E φ(Z) ≤ 1/2
M + E |Z i |
6n
The problem is, this does not obviously go to 0 as n → ∞. Rather,

lim E φ(Yn ) − E φ(Z) ≤ KE Xi2 1{Xi >M } ,


     
n→∞

which is not 0, in general.3 However, we still have the freedom to choose M


as large as we like. Suppose we know that

lim E Xi2 1{Xi >M } = 0.


 
(8.7)
M →∞
3
Strictly speaking, we can’t even speak of bounding the “limit” here, since the limit
may not exist. To be formally correct we would replace the limit here by lim sup.
8.4. TRUNCATION 140

Then for any  > 0 we can choose M so as to make


   
lim E φ(Yn ) − E φ(Z) ≤ K.
n→∞

Since this is true for every , it must in fact be true that the limit is 0.
So is (8.7) true? The answer is yes. If we think of the sequence of random
variables WM := Xi2 1{Xi >M } , they clearly converge to 0 as M → ∞: No
matter what Xi is, there is some M bigger. Is it true that if a sequence
of random variables converge to a constant, that the expectation of these
random variables converges to that constant? The answer to this is clearly
no, in general, but yes in many circumstances. There is a whole collection
of theorems in measure theory that tell us when the expectation of a limit
of random variables is the limit of the expectations, but this is beyond the
scope of this course. The relevant theorem in this case is called the Monotone
Convergence Theorem. It says that the limit of the expectations is the
expectation of the limit if the sequence of random variables is monotone
(increasing or decreasing). Since WM clearly is decreasing — a larger M
either leaves W the same or drops it down to 0 — the result follows. (A
straightforward source if you’re interested in reading more about this — and
in general for the basics of measure-theoretic probability — is chapter 4 of
[Ros06]. A more extensive account of these ideas is in chapter 23 of [Bil95].)
Lecture 9

Markov chains: Introduction

141
Lecture 10

Stationary distributions for


Markov chains

142
Lecture 11

Convergence to stationarity
of Markov chains

143
Lecture 12

Recurrence, Transience, and


Hitting Times

12.1 Basic facts about recurrence and transience


Definition 12.1. A Markov chain (Xi ) is recurrent if

P Xi = x for some i ≥ 1 X0 = x = 1.

The chain is transient if



P Xi = x for some i ≥ 1 X0 = x < 1.

Definition 12.2. The first return time of a state x in a Markov chain


x = X0 , X1 , . . . is Tx := min{i ≥ 1 : Xi = x}. The k-th hitting time of
(k) (k−1)
x, for k ≥ 2, is defined recursively as Tx := min{i > Tx : Xi = x},
where the minimum is taken to be ∞ if the set of return times is empty.
The number of returns to x is a random variable (possibly ∞) which
(k)
is limt→∞ #{i ≤ t : Xi = x}. It is also max{k : Tx < ∞}.
The hitting times of x are defined the same as the return times, except
that X0 is not necessarily assumed to be equal to x; if X0 = x, then the first
hitting time is 0. The first hitting time is also called the first passage time
when X0 6= x. We write Tyx for the first passage time from y to x; that is,
given start in y.
Note: There is no formal problem with allowing a positive random vari-
able to take on the value ∞. We have E[T ] = ∞ if P{T = ∞} > 0,
and the probability generating function of T is naturally gT (s) = E[sT ] =
E[sT 1{T <∞} ] for 0 ≤ s < 1, so that lims↑1 gT (s) = P{T < ∞}.

144
12.1. BASIC FACTS ABOUT RECURRENCE AND TRANSIENCE 145

Proposition 12.3. The number of returns to state x (which is the same as


the maximum of k such that the k-th return time is finite) is geometrically
distributed, with parameter p = 1 − P{Xi = x for some i ≥ 1 X0 = x}.
(k)
Proof. Let x be recurrent, and let Tx be the k-th return time to x. By
(k+1) (k) (1)
the strong Markov property, Tx − Tx has the same distribution as Tx
(k)
(given that the chain starts in x), independent of Tx . But that means that
(k)
conditioned on Tx = t, for any finite t,
P{Tx(k+1) < ∞ Tx(k) = t} = P{T1 < ∞} = p.

Proposition 12.4. The following are equivalent


(i) x is recurrent;
(ii) P{Xi = x for infinitely many i X0 = x} = 1;
(iii) E[# returns to x X0 = x] = ∞;
P∞ n
(iv) n=1 P (x, x) = ∞.

Proof.
t ∞
" #
X X
E[# returns to x] = E[Nx∞ ] = lim E ξn = P n (x, x).
t→∞
n=1 n=1

Proposition 12.5. Recurrence and transience are properties of communi-


cating classes. That is, if x ↔ y then x is recurrent if and only if y is
recurrent.
Proof. Suppose x ↔ y and x is recurrent. Let ` and m be positive inte-
gers such that P ` (x, y) > 0 and P m (y, x) > 0. Let p = min{P ` (x, y) >
0, P m (y, x)}. By Chapman-Kolmogorov,
XX
P i+m+` (y, y) = P m (y, z)P i (z, z 0 )P ` (z 0 , y) ≥ P m (y, x)P i (x, x)P ` (x, y) ≥ p2 P i (x, x).
z∈X z 0 ∈X

Thus

X ∞
X ∞
X
P n (y, y) ≥ P i+m+` (y, y) ≥ p2 P i (x, x) = ∞,
n=1 i=1 i=1
So y is also recurrent, by part (iv) of Proposition 12.4.
12.2. PROBABILITY GENERATING FUNCTIONS FOR PASSAGE TIMES146

12.2 Probability generating functions for passage


times
For x, y ∈ X, let
gxy (s) := E sTy X0 = x .
 

By the law of total expectation


X 
E sTy X1 = z P X1 = z X0 = x
 
gxy (s) =
z∈X
X 
= E sTy X1 = y P (x, y) + E sTy X1 = z P (x, z)
  

z6=y (12.1)
 
X
= s P (x, y) + P (x, z)gzy (s) .
z6=y

In principle, we can use this relation to compute the pgf’s, but this is
generally impractical, except when symmetry makes the recurrence partic-
ularly simple.

Example 12.1: pgf for the random walk hitting time

Let Xn be the random walk on Z, taking steps up with probabil-


ity p and down with probability 1−p. Let gxy (s) be the pgf of the
first-passage time Tx , and g := g01 . The strong-Markov property
together with the one-dimensional structure of the space means
that the random variables Tx − Tx−1 (setting T0 = 0), the first
passage time from x − 1 to x, are i.i.d. Thus gxy (s) = g(s)y−x
for y > x, and by (12.1)
g(s) = s (p + (1 − p)g−1,1 (s)) = s p + (1 − p)g(s)2 .


We can solve this equation explicitly, using the fact that lims↓0 g(s) =
0, to get
1  p 
g(s) = 1 − 1 − 4p(1 − p)s . (12.2)
2(1 − p)s
The first passage time back to 0 comes from exchanging p and
1 − p:
1  p 
g10 (s) = 1 − 1 − 4p(1 − p)s . (12.3)
2ps

12.3. DEMONSTRATING RECURRENCE AND TRANSIENCE: THE EXAMPLE OF THE RANDO

However, we can use the pgf in the abstract to derive important prop-
erties related to recurrence and transience. Suppose the chain is irreducible
and aperiodic. We note first that the probability of ever reaching y starting
from x is lims↑1 gxy (s), and the chain is recurrent if and only if lims↑1 gxy (s) =
1 for all x and y. If there are only finitely many y that may be reached in
one step from any particular x, then we may substitute into (12.1) to get
X
gxy (1) = P (x, z)gzy (1).
z∈X

Thus, the vector (P{Xn eventually reachs y X0 = x})x∈X is a column eigen-


vector of P with eigenvalue 1, bounded by 1. The constant vector 1 is, as
always, one solution. If the solution is unique, then it can only be the
constant 1, and the chain is recurrent.
We won’t give a proof here, but when there is more than one bounded
eigenvector with eigenvalue 1, the “correct” solution is the minimal one.
That is, we normalise the vectors to have the x coordinate equal to 1 (since
the probability of eventually hitting x starting from x is 1), and take the
smaller vector. In particular, the chain is transient if and only if P has a
non-constant but bounded column eigenvector.

12.3 Demonstrating recurrence and transience: The


example of the random walk on Z
12.3.1 The simple random walk
We consider first the case p = 12 .

Direct computation
If we want to show that a Markov chain is recurrent (or not), the most
straightforward thing to do is to compute the distribution of the first re-
turn time. If limt→∞ P{Txx > t} = 0, then Tx is finite with probability
1. By symmetry, for the simple (unbiased) random walk T00 has the same
distribution as T−1,0 + 1. For any k ≥ 0, we know that
 
−2k−1 2k + 1
P{X2k = 0 X0 = −1} = 0, P{X2k+1 = 0} = 2 ,
k

since a return to 0 occurs precisely when the 2k + 1 steps include exactly


k +1 upward steps and k downward steps. Obviously this is an upper bound
12.3. DEMONSTRATING RECURRENCE AND TRANSIENCE: THE EXAMPLE OF THE RANDO

for the probability of {T−1,0 = 2k + 2}, since the process could be back at 0
for the second, third, etc. time. We also have

P{X2k ≥ 0 X0 = −1} = P{X2k > 0 X0 = −1}


2k  
X 2k
= 2−2k
j
j=k+1
(12.4)
1 1
= − P{X2k = −1 X0 = −1
2 2  
1 1 −2k 2k
= − ·2 ,
2 2 k

using the symmetry of the binomial distribution. (See Figure 12.1.)


0.00 0.02 0.04 0.06 0.08 0.10 0.12

Probability 1/2
when we add half
the event X0=-1.
Probability

-21 -19 -17 -15 -13 -11 -9 -7 -5 -3 -1 1 3 5 7 9 11 13 15 17 19


X2k

Figure 12.1: Computing P{X2k > 0}.

Consider a random walk path started from −1. After the walk has hit
0, at any future even time it is either > 0 or < 0, so trivially
 
P T0 ≤ 2k X0 = −1 = P T0 ≤ 2k − 1 and X2k > 0 X0 = −1

+ P T0 ≤ 2k and X2k < 0 X0 = −1 .

Not so trivially, by the strong Markov property the process started from T0
is just like a simple random walk started from 0, so it has equal chances of
ending up above or below 0 at any later time, and
 
P T0 ≤ 2k X0 = −1 = 2P T0 ≤ 2k − 1 and X2k > 0 X0 = −1 .
12.3. DEMONSTRATING RECURRENCE AND TRANSIENCE: THE EXAMPLE OF THE RANDO

But if X2k > 0, the process must have passed through 0 on the way, so
T0 ≤ 2k − 1 automatically. Hence
 
−2k 2k
 
P T0 ≤ 2k X0 = −1 = 2P T0 ≤ 2k − 1 X0 = −1 = 1 − 2 .
k
Thus  
 −2k 2k k→∞
P T0 > 2k = 2 −−−
−→ 0,
k
which means that T0 is finite with probability 1, and the random walk is
recurrent.

Using the pgf


We have computed in (12.2) the pgf for the first return time T0 , which for
the case p = 21 is
1 √ 
1− 1−s .
s
This converges to 1 as s → 1, meaning that T0 is finite with probability 1,
and proving that the walk is recurrent.

Using the return probabilities


As already noted, it is much simpler to compute the probability of Xn = 0
for each t. Combined with Proposition 12.4 this allows us to show that
the
P∞chain is recurrent. We know that the chain is recurrent if and only if
P n (0, 0) = ∞. This reduces to
n=1
∞ ∞  
X
2k
X
−2k 2k
∞= P (0, 0) = 2 .
k
k=1 k=1

By Stirling’s approximation (see almost any elementary probability text for


this; for instance, section 3.10 of [GS01]), we know that
 k
√ k
k! ∼ 2πk ,
e
so √
2π · 2k (2k/e)2k
 
2k (2k)! 1 2k
= ∼ √ 2 (k/e)2k = πk 2 .
k k!2 2πk
2k
Thus 2−2k −1/2 . Since k −1/2 = ∞, it follows that the chain is
 P
k ∼ (πk)
recurrent.
12.3. DEMONSTRATING RECURRENCE AND TRANSIENCE: THE EXAMPLE OF THE RANDO

12.3.2 General p
It is easy to see that a biased random walk on Z is transient. For instance, we
know from the strong law of large numbers that limn→∞ n−1 Xn = E[X1 ] =
n→∞
2p − 1. Thus, for p > 1/2 it must be that Xn −−−
−→ +∞, and in particular
n→∞
it cannot return to 0 infinitely often; similarly if p < 1/2 we have Xn −−−
−→
−∞.
We could also calculate for p < 1/2 that the moment generating function
of Xn is  n
Mn (θ) = peθ + (1 − p)e−θ = e−nλ ,

where λ = − ln(peθ + (1 − p)e−θ ) > 0. Thus

P n (0, 0) ≤ P{Xn ≥ 0} ≤ e−nλ ,

which has finite sum. Reversing the signs does the same job for p > 1/2.
What about the random walk on the nonnegative integers which is re-
flected at 0? (That is, P (0, 1) = 1 and P (x, x + 1) = p = 1 − P (x, x − 1) for
x ≥ 1.) As we would expect, this is transient for p > 1/2, and recurrent for
p ≤ 1/2. This follows immediately from the pgf calculation of (12.3). We
have the pgf of the first return time to 0 being
1  p 
g00 (s) = 1 − 1 − 4p(1 − p)s ,
2p
so
min{p, 1 − p}
P{T0 < ∞ X0 = +1} = g00 (1) = ,
p
which is 1 precisely when p ≤ 1/2.
Lecture 13

Expected hitting times

13.1 Null and positive recurrence


Does a Markov chain on an infinite state space converge to a stationary
distribution? Obviously not if the chain is transient, since the probability
of being in any fixed finite subset of the state space converges to 0.
What about a recurrent Markov chain? Let’s look at our favourite ex-
ample, the random walk on {0, 1, 2, . . . } taking steps up with probability p
and down with probability 1 − p. We can add a holding probability at 0
to make it aperiodic, so P (0, 1) = 1 − P (0, 0) = p. First of all, is there a
stationary distribution at all? A stationary distribution π must satisfy
p
πi+1 = πi for i ≥ 0.
1−p
If p > 1/2 then the πi are increasing, so they couldn’t have a finite sum.
Therefore there is no (finite) stationary distribution.
If p = 1/2 then the πi are all equal, so again the sum is infinite, and
there is no stationary distribution.
If p < 1/2 then the distribution πi = (p/(1 − p))i (1 − p)/(1 − 2p) is
stationary.
The case p = 1/2 seems peculiar here: It is recurrent, but has no sta-
tionary distribution. This case has another peculiarity: The return time to
0 is finite with probability 1, but its expectation is
0 1 1−p
E[T0 ] = lim g00 (s) = lim · 2p(1 − p)(1 − 4p(1 − p)s)−1/2 = =∞
s↑1 s↑1 2p 1 − 2p
when p = 1/2.
This is no coincidence.

151
13.1. NULL AND POSITIVE RECURRENCE 152

Theorem 13.1. An irreducible Markov chain has a stationary distribution


if and only if the first return time to any state has finite expectation. If the
stationary distribution π exists, then it is unique, and
n
X
lim n−1

P Xk = j X0 = i = πj .
n→∞
k=1

If the chain is aperiodic then for any i, j ∈ X,



lim P Xn = j X0 = i = πj .
n→∞

Proof. (Proof not examinable.) The basic idea is that we can use the results
we already have for finite chains. The condition that the first return time
has finite expectation implies that the chain spends “almost all” of its time
in a finite subset of the state space.
We provide here an intuitive proof, based on the Strong Law of Large
Numbers. This has been mentioned but not proved in this course. You may
seek out either a proof of the Strong Law or a proof of the present result
which does not depend on it, both of which are in chapter 6 of [GS01].
Suppose µx := E[Txx ] < ∞. Start the chain in state y. Then the k-th
(k)
arrival time in x may be expressed as Tyx = Tyx + τ1 + · · · + τk−1 , where the
τ ’s are i.i.d. with the same distribution as Txx . Let Nx (n) be the number of
(k)
times the process visits state x up to time n; so Nx (n) = max{k : Tyx ≤ n}
(k+1) (k)
when the process starts in y. For Tyx > n ≥ Tyx we have Nx (n) = k, so

k Nx (n) k
(k)
≤ ≤ (k+1) .
Tyx n Tyx

By the Strong Law of Large Numbers the upper and lower bounds both
n→∞
converge to 1/µx as k → ∞. So Nx (n)/n −−−
−→ 1/µx . By Lemma 13.3 it
n→∞
follows that 1/µx is a stationary distribution. Furthermore, if Nx /n −−−
−→
1/µx with probability 1, then it must certainly be true that
n→∞
E[Nx /n] = E[Nx ]/n −−−
−→ 1/µx .

The proof that the probabilities actually converge in the aperiodic case
is slightly tricky although it seems intuitively obvious, and could be skipped
on a first reading. It’s worth thinking about though, because it provides a
good paradigm of how we bootstrap up from results on finite state spaces
to results on infinite state spaces.
13.1. NULL AND POSITIVE RECURRENCE 153

1
P For any  ∈ (0, 2 ) we may find a finite subset X ⊂ X such that P  :=

i∈X  πi ≥ 1 − . Define a new chain Xn which lists only the states of Xn
that are in X . Since this is an ergodic chain on a finite state space, we have


πj
lim P{Xn = j X0 = 0} = ,
n→∞ P
which is the stationary distribution of the restricted chain. It follows (with
a little bit of technical work) that the conditional distribution of Xn , con-
ditioned on being in the finite set X , converges to πj /P  . Thus for n
sufficiently large — that is, n > N for some N —
πj 
(πj − )P Xn ∈ X ≤ P Xn = j X0 = i ≤
 
+  (13.1)
1 −  |X |

for j ∈ X , where |X | is the number of elements in X .


Then for any x ∈ / X and any K > 0 and n > N ,
X 
/ X X0 = i =

P Xn+K ∈ P Xn+K = x X0 = i
/ 
x∈X
XX 
≤ P Xn = j X0 = i P K (j, x)
/  j∈X
x∈X
  X X
X X
K πj 
P Xn = j X0 = i P K (j, x)

≤ P (j, x) +  +
1 −  |X |
/  j∈X
x∈X /  j ∈X
x∈X / 
X πx X 
≤ ++ P Xn = j X0 = i · 1

1− 
x∈X
/ j ∈X
/

+  + P Xn ∈ X X0 = i ,


1−
P
using the fact that π is a stationary distribution  πx ≤ .
 and x∈X
Think now about the probabilities pk := P Xk ∈ / X . We know their
long-term average is , and no two of them can differ by more than 2 +
/(1 − ) ≤ 4 for k > N . It follows that

P Xn ∈ X = 1 − pn ≥ 1 − 5.


Returning to our bounds (13.1), we have for n > N ,


 πj 
(πj − )(1 − 5) ≤ P Xn = j X0 = i ≤ +  .
1 −  |X |

Since  may be made as small as we like, this completes the proof.


13.1. NULL AND POSITIVE RECURRENCE 154

Lemma 13.2. Suppose for each i, j ∈ X we have



lim P Xn = j X0 = i = πj .
n→∞

Then πj is the unique stationary distribution for the chain.

Proof. We have for any states i, ` ∈ X,


X X 
πj P (j, `) = lim P Xn = j X0 = i P (j, `)
n→∞
j∈X j∈X
X 
= lim P Xn = j X0 = i P (j, `)
n→∞
j∈X

= lim P Xn+1 = ` X0 = i
n→∞
= π` .

(Note that the exchange of limit and summation isn’t any problem if P (j, `)
is nonzero for only finitely many `.

Lemma 13.3. Suppose we have µj ∈ [0, 1] such that for each i, j ∈ X we


have, conditioned on X0 = i,

lim n−1 # visits to j by time n = µj


n→∞

Then πj = 1/µj is the unique stationary distribution for the chain.

Proof. Fix states i, ` ∈ X, and define



Nj (n) = # 1 ≤ t ≤ n : Xt = j

Nj,` (n) = # 1 ≤ t ≤ n : Xt−1 = j, Xt = ` .
PN
Then n=1 ξj (n)is the number of visits the chain makes to state j by time
N →∞
N , so by our assumption N −1 N
P
n=1 ξj (n) −−− −→ πj .
Now, consider that every time the chain is in state j it has probability
13.1. NULL AND POSITIVE RECURRENCE 155

P (j, `) of transitioning to state `, independent of every other time. Thus,


X X
πj P (j, `) = lim n−1 Nj (n)P (j, `)
n→∞
j∈X j∈X
X Nj,` (n)
= lim n−1 Nj (n) lim
n→∞ n→∞ Nj (n)
j∈X
X
= lim n−1 Nj,` (n)
n→∞
j∈X
X
= lim n−1 Nj,` (n)
n→∞
j∈X
−1
= lim n N` (n)
n→∞
= π` .

Those paying attention to the technical details will notice that we have
been a bit sloppy here in exchanging the order of limits from the second to
the third line. What we need is to show that for any  > 0 we can find n
large enough so that
X Nj,` (n)
Nj (n) − Pj,` < n.
Nj (n)
j∈X

Conditioned on Nj (n) = ν, the random variables Nj,` (n) − Pj,` ν are inde-
pendent. We then use the exponential bounds from the moment generating
function to show that this sum is eventually less than n, conditioned on
Nj (n) large enough.

Definition 13.4. A recurrent irreducible Markov chain is said to be posi-


tive recurrent if one the following equivalent conditions hold:

• There is a stationary probability distribution;

• The return time to some state has finite expectation;

• The return time to every state has finite expectation.

Example 13.1: Queueing model


13.1. NULL AND POSITIVE RECURRENCE 156

Suppose we have a shop with a single cashier, and assume for sim-
plicity that a customer takes exactly 1 unit of time to be served.
Suppose there are ξn arrivals in the n-th unit of time, where
P the
ξn are i.i.d. , with P{ξn = k} = ρk . Let µ = E[ξn ] = kρk .
Then (Xn ) is a Markov chain with state space {0, 1, 2, . . . }, and
Xn+1 = ξn+1 if Xn = 0, and Xn+1 = Xn + ξn+1 − 1 if Xn ≥ 1.
That means

ρy
 if x = 0,
P (x, y) = ρy−x+1 if x ≥ 1 and y ≥ x − 1,

0 otherwise.

Let gρ (s) be the probability generating function of ξi .


When P is this queueing chain recurrent? We observe first that
Xn ≥ ni=1 (ξi − 1). As in the case of the biased random walk, if
µ > 1 then limn→∞ Xn /n ≥ µ − 1 > 0, so X must be transient.
Suppose now that µ ≤ 1, and let f (x) := P{T0 < ∞ X0 = x},
the probability that the process reaches 0 eventually, starting
from x. Recurrence is equivalent to f (x) = 1 for all x. Note
that, since the process moves down only in steps of size 1, we
have T0 = Tx−1 + (Tx−2 − Tx−1 ) + · · · + (T0 − T1 ). By the
symmetry of the process (and the strong Markov property) the
differences are independent and have the same distribution as
T1,0 (the first-passage time from 1 to 0).
Let f (i) = P{Tx−i < ∞}, the probability that the queue started
from x reaches x − i in finite time. Then f (i) = f (1)i for 0 ≤
i ≤ x, and we have the recursion

X
ρi f (1)i = gρ f (1) .

f (1) =
i=0

That is, f (1) must be a fixed point of gρ . As in our discussion of


the Galton-Watson branching process (Example 5.4), we observe
that g is a convex increasing smooth function with g(0) > 0 and
g(1) = 1. It can have at most one other fixed point on (0, 1). If
there were another fixed point x < 1, then by the Mean-Value
Theorem there would be a point x0 ∈ (x, 1) such that gρ0 (x0 ) = 1;
since the derivative of gρ is increasing, it follows that gρ0 (1) > 1.
But gρ0 (1) = µ, which has been assumed ≤ 1. Thus the only
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 157

fixed point is 1, and so f (1) = 1. Hence f (i) = 1 for all i, and


the process is recurrent.
We consider the question of positive or null recurrence in section
13.2. 

Note that we have proved a result that is in many respects stronger than
the convergence of the Markov chain in distribution — the convergence of
the average occupation time to the stationary distribution — but in fact
it is not simply stronger. That is, convergence of the occupation times
n→∞ n→∞
Nx (n)/n −−− −→ πx does not necessarily imply that P{Xn = x} −−− −→ πx .
After all, we have not even assumed that the chain is aperiodic.
In principle, our linear algebra proof still works: The largest eigenvalue
is still ≤ 1, and 1 being an eigenvalue is equivalent to positive recurrence.
But we can also argue as follows: ConsiderP the Markov chain observed only
when it is in the finite set Y ⊂ X, with  = y∈X\Y πy . Call it XnY . We know
that XnY converges in distribution to a stationary distribution, which must
be πx /(1 − ). Let Y (n) = #{m ≤ n : Xm ∈ / Y}. Then Xn = Xn−Y Y
(n) when
Xn ∈ Y. We see then that, conditioned on being in Y the distribution of Xn
converges to π (conditioned on being in Y). Taking Y ever larger then would
complete the proof. (There are a few subtleties that we would have to take
care of to make this into a rigorous proof. In particular, since conditioning
on spending a fixed amount of time outside of Y up to time n can change
the distribution of Xn .)

Theorem 13.5. If (Xn ) is any positive recurrent Markov chain, then it has
a unique stationary distribution π, and for any i, j ∈ X

lim P Xn = j X0 = i = πj .
n→∞

13.2 Fun with eigenvectors and matrix inverses


We have already shown in section 13.1 that transience/null recurrence/positive
recurrence may be determined by reference to the row-eigenvector of P with
eigenvalue 1. We know that

• If the chain is transient, there is no row-eigenvector with eigenvalue 1.

• If the chain is null recurrent, there is aPunique (up to constants) row-


eigenvector πx with eigenvalue 1, and πx = ∞ (so it is not a distri-
bution).
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 158

• If the chain is positive recurrent, there is aP


unique (up to constants)
row-eigenvector πx with eigenvalue 1, and πx < ∞ (so it is a sta-
tionary distribution).

13.2.1 Alternative representation of the stationary distribu-


tion
We have already shown that the stationary distribution of a positive recur-
rent chain may be represented as πx = 1/µx , where µx is the expected return
time to x. We show now that for any x, y ∈ X,

E[# visits to x up to time Ty X0 = y]


πx = .
E[Ty X0 = y]

LetPξyx (k) be the indicator of the event {Xk = x and Ty ≥ k}, and
ηyx = ∞ k=1 ξyx (k). Then for any z ∈ X

E[ξyx (k)]P (x, z) = P{Xk = x and Xk+1 = z and Ty ≥ k + 1},

so

X
E[ηyx ]P (x, z) = E[ξyx (k)]P (x, z) = E[#k < Ty : Xk = x and Xk+1 = z].
k=1

Summing over x ∈ X we get


X
E[ηyx ]P (x, z) = E[ηyz ].
z∈X

Thus (E[ηyx ])x∈X is an eigenvector with eigenvalue 1. By uniqueness, it


must be a multiple of (πx ). Taking x = y, we see that the multiplier must
be 1/E[Ty X0 = y].

13.2.2 Hitting probabilities


For any fixed y, let f (x) = P{Xn = y for some n ≥ 0 X0 = x}. Then
(f (x))x∈X is a column eigenvector with eigenvalue 1, satisfying f (y) = 1.
The constant vector 1 is always an eigenvector. We have already seen that
the (irreducible) chain determined by P is recurrentPif and only if there is a
positive eigenvector f (x) ∈ [0, 1], satisfying f (x) = y∈X P (x, y)f (y) which
is not constant.
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 159

13.2.3 Absorption probabilities


In this section we follow somewhat the approach of chapter 6 of [TK98].

Let Y ⊂ X be a finite set of absorbing states. For any fixed y ∈ Y, the


probability of eventual absorption in y is a hitting probability, just like any
other described in section 13.2.2, hence is an eigenvector with eigenvalue 1.
There will be one dimension of the eigenspace for each absorbing state, and
the probabilities corresponding to absorption in the state y will correspond
to solving for f (y) = 1.

Example 13.2:

Let X1 , X2 , . . . be a Markov chain on {1, 2, 3, 4}, started in state


1, and with transition matrix
 
0.2 0.4 0.2 0.2
0.5 0.3 0 0.2
P =0
.
0 1 0
0 0 0 1

So states 3 and 4 are both absorbing. What is the probability


that the process is eventually absorbed in 3?
The eigenvectors will satisfy the equations

v1 = 0.2v1 + 0.4v2 + 0.2v3 + 0.2v4


v2 = 0.5v1 + 0.3v2 + 0.2v4 .

We find the solutions are spanned by the vectors


   
7/18 11/18
5/18
 and 13/18 .
 

 1   0 
0 1

The first vector represents the probabilities of absorption in 3,


while the second represents the probabilities of absorption in 4.
(Since every run is eventually absorbed, the vectors add up to
1.) Starting from 1, then, the probability of ending up in state
3 is 7/18, whereas if we started from 2 the probability would be
only 5/18. 
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 160

Example 13.3: Gambler’s ruin

Suppose a gambler with an initial fortune of £K plays a succes-


sion of independent games in which there is a probability p of
winning £1 and probability 1 − p of losing £1. He stops when
he has either lost his entire fortune or has increased his fortune
to M . Compute the probability that he wins.
Solution: The transition matrix is
 
1 0 0 0 ··· 0
1 − p 0 p 0 ··· 0
 
P =
 0 1 − p 0 p ··· 0 
 .. .. .. .. . . .. 
 . . . . . .
0 0 0 0 ··· p

The eigenvector satisfies (for 1 ≤ i ≤ M − 1)

vi = (1 − p)vi−1 + pvi+1 .

These have solutions


 i
1−p
vi = and vi = 1.
p
We need to solve for the solution which has v0 = 0 and vM = 1.
For p 6= 12 this will be satisfied by
 i
1−p
p −1
P{absorption at M } =  M .
1−p
p −1

Note that when M → ∞, this expression goes to 0 if p < 12 ;


this represents the recurrence of the chain, since it says that it
always hits 0 before hitting ∞. On the other hand, if p > 12 the
same argument says that there is positive probability

1−p K
 
1−
p
that the process reaches ∞ before reaching 0; in other words,
the chain is transient.
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 161

If p = 12 then vi = 12 vi−1 + 21 vi+1 . In addition to the constant


solution, this has the solution vi = i/M . Thus
i
P{absorption at M } =
M

Problems about solving linear equations can be expressed in terms of
inverting a matrix. But what matrix should we invert? The matrix P is
clearly not invertible.
Suppose we order the states so that the absorbing states are at the end.
Then the transition matrix can be written in the form
 

 Q R 

P = .
 
 
0 0 0 · · ·
0 0 0 ···
I 

That is, the rows corresponding to the absorbing states have transition prob-
abilities 0 to any other states, and 1 to themselves, which means that the
lower square is an identity matrix. Squaring this matrix yields
 
2
2

 Q QR +R 

P = .
 
 
0 0 0 · · ·
0 0 0 ···
I 

Iterating this gives us

Qn-1 R +Qn-2 R
 
n
n

 Q +· · ·+QR +R


P = .
 
 
0 0 0 · · ·
0 0 0 ···
I 

Note that the final columns represent the probability of the chain being in
the corresponding absorbing states. These probabilities simply accumulate,
n→∞
as Qn −−−−→ 0. Now we use the identity (essentially just the matrix version
of the sum of a geometric series)
(I − Q) Qn−1 R + Qn−2 R + · · · + QR + R = (I − Qn ) R.

13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 162

n→∞
Since Qn −−−
−→ 0, we have

Proposition 13.6. Given an irreducible Markov chain with m transient


states and k absorbing states, let Q be the m × m matrix of transitions
among the transient states, and R the m × k matrix of transitions from
transient to absorbing states. Then the probability of the chain ending up
absorbed into the j-th absorbing state is the j-th component of

q T (I − Q)−1 R,

where q T is the m-component row vector giving the initial distribution.

Of course, the sequence of operations we would perform to invert the


matrix I − Q are exactly the operations we would perform to solve the
linear system of equations. On the other hand, we have very good standard
algorithms for inverting matrices. (In R we use the command solve, which
is not what you would immediately think of. . . ) This formula is useful for
theoretical purposes, and for numerical examples — particularly if you have
a large number of states — but not for a class of models with unspecified
parameters, like Example 13.3.

Example 13.4: Example 13.2 continued

Let X1 , X2 , . . . be a Markov chain on {1, 2, 3, 4}, started in state


1, and with transition matrix
 
0.2 0.4 0.2 0.2
0.5 0.3 0 0.2
P =0
.
0 1 0
0 0 0 1

So states 3 and 4 are both absorbing. What is the probability


that the process is eventually absorbed in 3?
We have
   
0.2 0.4 0.2 0.2
Q= , R= .
0.5 0.3 0 0.2

So  
0.8 −0.4
I −Q= ,
−0.5 0.7
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 163

and  
1 35 20
(I − Q)−1 = .
18 25 40
Multiplying this by R, we get
 
−1 1 7 11
(I − Q) R= .
18 5 13

The first column represents the probabilities of absorption in 3,


while the second represents the probabilities of absorption in 4.
Starting from 1, then, the probability of ending up in state 3 is
7/18, whereas if we started from 2 the probability would be only
5/18. 

13.2.4 Expected hitting times and total occupation


For any fixed y ∈ X, let µxy be the expected time to hit y, starting from
x. We take µyy = 0. Either by a direct conditioning argument or by
differentiating the pgf formula (12.1), we see that
X
µxy = 1 + P (x, z)µzy .
z∈X

Write 1 for the column vector of all 1’s. Then µ := µ·y satisfies the system
of linear equations
(I − P )µ = 1.
Again, we can turn this into a matrix-inversion problem. We have

µ = (I − Q)−1 1, (13.2)

where Q is the square matrix that results from eliminating the row and
column corresponding to the state y. Where does this formula come from?
We could show this by linear algebra, but we can also think of it as follows:
Suppose we modify the chain to make y an absorbing state. Then the
matrix Q represents the transitions among all the other (transient) states,
and (Qn )ij is the probability that a chain started in state i is at state j at
time n, without ever having visited y (since it would have been absorbed
then). If we write ξj (n) := 1{Xn =j} , then when the chain is started in state
i,
E[ξj (n)] = (Qn )ij .
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 164

P∞
The total number of visits to state j before absorption is n=0 ξj (n), and
the total time until absorption is

XX
Tiy = ξj (n).
j6=y n=0

Putting these together, we have



hX X i
µi = E[Tiy ] = E ξj (n)
j6=y n=0
XX ∞ h i
= E ξj (n)
j6=y n=0

!
X X
= Qn
j n=0 ij
X
−1

= (I − Q) ij
j

= (I − Q)−1 1 i .


Example 13.5: Return times for random walk on a ring

Consider the random walk on a ring of 5 points, so


 
0 1/2 0 0 1/2
1/2 0 1/2 0 0 
 
P = 0 1/2 0 1/2 0  
 0 0 1/2 0 1/2
1/2 0 0 1/2 0
Suppose we start at 0. What is the expected time to return to
0?
Solution: The expected return time is the same as the expected
time to reach 0 starting from 1. Eliminating the first row and
column, we get
 
0 1/2 0 0
1/2 0 1/2 0 
Q=  0 1/2 0 1/2

0 0 1/2 0
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 165

We get  
1.6 1.2 0.8 0.4
1.2 2.4 1.6 0.8
(I − Q)−1 =
0.8

1.6 2.4 1.2
0.4 0.8 1.2 1.6
So the total expected time to reach 0 is 4, so the expected return
time is 5. 

Example 13.6: General random walk

Let ∞
 (ξi )i=1 be i.i.d. random variables
Pn taking values in Z, with
P
P iξ = x = ρx , and define Y n := i=1 ξi . Suppose µ = E[ξi ] =
xρx exists and is finite; suppose ξi has finite variance σ 2 , and
it has a finite fourth moment, so τ := E[(ξi − µ)4 ] is finite as
well. Suppose the chain is irreducible and aperiodic.
Let T = min{n : Yn < 0}. We know from Example 13.1 that T
is finite almost surely if and only if µ ≤ 0. When does T have
finite expectation?
A very important algebraic identity is
n
!4
X
(Yn − nµ)4 = (ξi − µ)
i=1
n
X X X
= (ξi − µ)4 + 3 (ξi − µ)2 (ξj − µ)2 + 4 (ξi − µ)(ξj − µ)3
i=1 i6=j i6=j
X X
+3 (ξi − µ)(ξj − µ)(ξk − µ)2 + (ξi − µ)(ξj − µ)(ξk − µ)(ξ` − µ).
i6=j6=k i6=j6=k6=`

Note that all the terms with ξi − µ not raised to a higher power
have expectation 0. Thus

E[(Yn − nµ)4 ] = nτ + 3n(n − 1)σ 4 .


13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 166

Suppose µ < 0. If Yn < 0 then certainly T ≤ n. Thus


  
P T > n ≤ P Yn ≥ 0 ≤ P Yn − nµ ≥ −nµ
≤ P (Yn − nµ)4 ≥ n4 µ4


nτ + 3n(n − 1)σ 4

n 4 µ4
τ + 3σ 2 −2
≤ n .
µ4
We have then
∞ ∞
X  X τ + 3σ 2 −2
E[T ] ≤ P T ≥n ≤ n < ∞.
µ4
n=1 n=1

Notice how putting bounds on higher moments of the increments


— in this case fourth moments — allows us better control over
the paths of the Markov chain. What would have gone wrong if
we had simply assumed finite variance?
Of course, the chain can be recurrent only if it returns to 0
with probability 1 from above and from below. That clearly will
happen if and only if µ = 0. The chain can never be positive
recurrent. 

Example 13.7: Queueing model, continued

We introduced in Example 13.1 a queueing model, and showed


that it was recurrent if and only if the expected number of ar-
rivals in unit time was µ ≤ 1. We wish to consider whether the
queueing model is null or positive recurrent.
We define as before the random walk Yn = ni=1 (ξi − 1), such
P
that Xn ≥ Yn and min{n : Yn = 0} = min{n : Xn = 0}. From
the result of Example ??, it follows that the process is positive
recurrent if µ < 1 and null recurrent if µ = 1. 
Lecture 14

Applications of Markov
chains

14.1 Markov chain Monte Carlo


What does it mean to know a probability distribution? There are various
ways that we can uniquely define a probability distribution: If it’s discrete,
we can list all the probabilities of the individual outcomes. If it’s continuous,
we can give a density or a cumulative distribution function. We can give
a probability generating function for a discrete distribution, or a moment
generating function for any distribution with sufficiently thin tails, or a
characteristic function for any distribution.
But one operational definition of knowing a probability distribution (for
a single random variable or for some collection of joint random variables) —
certainly for the modern age of cheap computing — is that we can simulate a
selection from this random variable on a computer, to any desired accuracy.
After all, any property of a random variable X (or a complicated process
involving a joint distribution of some random variables) may be defined in
terms of the expectation of some test function X: E[φ(X)]. If we know
how to simulate samples from X, then we can estimate E[φ(X)] for any
reasonable function φ by taking n samples X1 , . . . , Xn and computing the
Monte Carlo estimate
1
(φ(X1 ) + · · · + φ(Xn )) .
n
Which distributions can we simulate? Starting from a sequence of inde-
pendent random bits (or independent random digits uniform in {0, 1, . . . , 9},
we can simulate a uniform random variable U on [0, 1] to any desired degree

167
14.1. MARKOV CHAIN MONTE CARLO 168

of accuracy. If F : R → [0, 1] is any potential cdf, we can simulate a random


variable with cdf F by computing F −1 (U ).
In principle, this allows us to simulate any distribution; but sometimes
the cdf is not feasible to compute. As an example, consider the Ising model
that we introduced in section ??. This is a distribution defined on Ω =
2
{−1, +1}N , where N 2 is thought of as the N × N square lattice. We define
the weight function P
w(ξ) = eJ i∼j ξ(i)ξ(j) ,
for a fixed constant J, where the sum is taken over pairs of grid points i and
j which are nearest neighbours; and then define the probability of A ⊂ Ω by
P
ξ∈A w(ξ)
P(A) = P .
ξ∈Ω w(ξ)

If we wanted to compute a probability of an event like {> 60% of the ξj are +


1} we’d just have to compute all the weights and add them up. Unfortu-
nately, we can’t actually compute any of these probabilities for any mod-
2
erately large N , because there are 2N outcomes whose weights need to be
computed. The fastest computer in the world would need longer than the
age of the universe to run through all the possibilities even for N = 10.
What can we do? It seems we can’t simulate from this distribution
without knowing the probabilities. But it turns out that this is not true.
What we do is to define a Markov chain that has the desired distribution P
as its stationary distribution. Then we run the Markov chain long enough
that we know it is close to stationarity. (How long to run these Markov
chains is a hot area of research. There is also a technique called coupling
from the past [PW97] which allows us, in some cases, to run the chain for
a random number of steps, such that the state we finish with is a sample
from exactly the stationary distribution.) So how do we find such a Markov
chain?

14.1.1 The Metropolis Algorithm


Here is an approach, called the Metropolis algorithm, based on a principle
of trial and rejection: We start with any Markov chain at all on the desired
sample space, with transition matrix P and initial distribution q. This
should be a Markov chain that is easy to simulate. Given a state Xn = x, we
then sample a new state from our transition probabilities P (x, ·). Suppose
we pick a state y. If πy P (y, x) ≥ πx P (x, y), then Xn+1 = y; if πy P (y, x) <
πx P (x, y), then we flip a coin with probability πy P (y, x)/πx P (x, y) of coming
14.1. MARKOV CHAIN MONTE CARLO 169

up heads. If heads comes up (that is, with probability πy /πx ) we take


Xn+1 = y; otherwise, we reject the move, and stay with Xn+1 = x.
It’s clear that this defines a new Markov chain, whose transition proba-
bilities are
 
π y P (y, x)
Pe(x, y) = P (x, y) min , 1 for y 6= x,
πx P (x, y)

with the remaining probability being made up of Pe(x, x). We see then that

πx Pe(x, y) = min {πy P (y, x), πx P (x, y)} = πy Pe(y, x).

Thus Pe(x, y) is reversible, with reversing distribution π. Note that to run


this algorithm we don’t need to know the probabilities πx , but only the
ratios πx /πy .
For the Ising model, we can take P (x, y) to be defined by picking one of
the N 2 sites uniformly at random, and then flip its sign. Then P (x, y) =
P (y, x). Given Xn = ξ, we pick a site j uniformly, and then let
 
 X 
p = exp −2J ξ(i) .
 
i:i∼j

If p ≥ 1 then we flip the sign of ξ(j). If p < 1 then we decide with probability
p to flip the sign.
Lecture 15

Poisson process

15.1 Point processes


A joint n-dimensional distribution may be thought of as a model for random
point in Rn . A point process is a model for choosing a set of points. Some
examples of phenomena that may be modelled this way are:

• The pattern of faults on a silicon chip;

• The accumulation of mutations within an evolutionary tree;

• Appearance of a certain pattern of pixels in a photograph;

• The times when a radioactive sample emits a particle;

• The arrival times of customers at a bank;

• Times when customers issue register financial transactions.

Note that a point process — thought of as a random subset of Rn (or


some more general space) — may also be identified with a counting process
N mapping regions of Rn to natural numbers. For any region A, N (A) is
the (random) number of points in A.
A Poisson process is the simplest possible point process. It has the
following properties (which we will formalise later on):

• Points in disjoint regions are selected independently;

• The number of points in a region is proportional purely to the size of


the region;

170
15.2. THE POISSON PROCESS ON R+ 171

• There are finitely many points in any bounded region, and points are
all distinct.

We will be focusing in this course on one-dimensional point processes,


so random collections of points in R. Note that the counting process is
uniquely determined by its restriction to half-infinite intervals: We write

Nt := N (−∞, t] = # points ≤ t.

If we imagine starting from 0 and progressing to the left, the succession of


points may be identified with a succession of “interarrival times”.

15.2 The Poisson process on R+


The Poisson process on R+ is the simplest nontrivial point process. We have
three equivalent ways of representing a point process on R+ :

Random set of points ←→ Interarrival times ←→ Counting process Nt

15.2.1 Local definition of the Poisson process


A Poisson arrival process with parameter λ is an integer-valued stochastic
process N (t) that satisfies the following properties:

(PPI.1) N (0) = 0.

(PPI.2) Independent increments: If 0 ≤ s1 < t1 ≤ s2 < t2 ≤ · · · ≤ sk < tk ,


k
then the random variables N (ti ) − N (si ) i=1 are independent.

(PPI.3) Constant rate: P{N (t + h) − N (t) = 1} = λh + o(h). That is,

lim h−1 P{N (t + h) − N (t) = 1} = λ.


h↓0

(PPI.4) No clustering: P{N (t + h) − N (t) ≥ 2} = o(h).

The corresponding point process has a discrete set of points at those t


where N (t) jumps. That is, {t : N (t) = N (t−) + 1}.
15.2. THE POISSON PROCESS ON R+ 172

15.2.2 Global definition of the Poisson process


A Poisson arrival process with parameter λ is an integer-valued stochastic
process N (t) that satisfies the following properties:

(PPII.1) N (0) = 0.

(PPII.2) Independent increments: If 0 ≤ s1 < t1 ≤ s2 < t2 ≤ · · · ≤ sk < tk ,


k
then the random variables N (ti ) − N (si ) i=1 are independent.

(PPII.3) Poisson distribution:

(λt)n
P{N (t + s) − N (s) = n} = e−λt .
n!

The corresponding point process has a discrete set of points at those t


where N (t) jumps. That is, {t : N (t) = N (t−) + 1}.

15.2.3 Defining the Interarrival process


A Poisson process with parameter λ may be defined by letting τ1 , τ2 , . . .
be i.i.d. random variables with exponential distribution with parameter λ.
Then the point process is made up of the cumulative sums Tk := ki=1 τi ;
P
and the counting process is

N (t) = #{k : 0 ≤ Tk ≤ t}.


15.2. THE POISSON PROCESS ON R+ 173

6 N(t)
5

1
T1 T2 T3 T4 T5 T6 T7 T8 T9
!1 !2 !3 !4 !5 !6 !7 !8 !9

Figure 15.1: Representations of a Poisson process. T1 , T2 , . . . are the loca-


tions of the points (the arrival times). τ1 , τ2 , . . . are the i.i.d. interarrival
times. The green line represents the counting process, which increases by 1
at each arrival time.

15.2.4 Equivalence of the definitions


Proposition 15.1. The local, global, and interarrival definitions define the
same stochastic process.

Proof. We show that the local definition is equivalent to the other defini-
tions. We start with the local definition and show that it implies the global
definition. The first two conditions are the same. Consider an interval [s, t],
and suppose the process satisfies the local definition. Choose a large integer
K, and define    
i i−1
ξi = N s + t −N s+t .
K K
Then N (t + s) − N (s) = K
P
i=1 ξi , and ξi is close to being Bernoulli with
parameter λt/K, so N (t + s) − N (s) should be close to Binom(K, λt/K)
distribution, which we know converges to Poisson(λt). Formally, we can
15.2. THE POISSON PROCESS ON R+ 174

assume that the ξi are all 0 or 1, since


 K→∞
P ξi ≥ 2 for some 1 ≤ i ≤ K ≤ K · o(t/K) −−−
−→ 0.
The new ξi , which are no more than 1, have mgf
 
λt
Mξ (θ) = 1 + (1 + o(1))(e − 1) .
θ
K
Thus the mgf of N (t + s) − N (t) is
 K
λt K→∞ θ
Mξ (θ) = 1 + (1 + o(1))(e − 1)
K θ
−→ eλt(e −1) ,
−−−
K
which is the mgf of a Poi(λt) random variable.
Now assume the global definition. Since there are a finite number of
points in any finite interval, we may list them in order, giving us a sequence
of random variables T1 , T2 , . . . , and the interarrival times τ1 , τ2 , . . . . We
need to show that these are independent with Exp(λ) distribution. It will
suffice to show that for any positive numbers t1 , t2 , . . . , tk ,
P τk > tk τ1 = t1 , . . . , τk−1 = tk−1 = e−λtk .


(This means that, independent of τi for i ≤ k−1, τk has Exp(λ) distribution.)


Let t = t1 + · · · + tk−1 . By property (PPII.2), the numbers and locations
1 The event τ >

of the points on [0, t] and on (t, t + tk ] are independent.  k
tk , τ1 = t1 , . . . , τk−1 = tk−1 is identical to the event N (tk + t) − N (t) =
0, τ1 = t1 , . . . , τk−1 = tk−1 , so by independence the probability is simply
the probability of a Poi(λtk ) random variable being 0, which is
P N (tk + t) − N (t) = 0 = e−λtk .


Finally, suppose the process is defined by the interarrival definition. It’s


trivial that N (0) = 0. The independent increment condition is satisfied
because of the memoryless property of the exponential distribution. (For-
mally, we can show that, conditioned on any placement of all the points
on the interval [0, t2 ], the next point still has distribution t + τ , where τ is
Exp(λ). Then we proceed by induction on the number of intervals.) Finally,
P{N (t + h) − N (t) ≥ 1} = 1 − e−λh = λh + o(h), while P{N (t + h) − N (t) ≥
2} ≤ P{τ < h}2 , where τ is an Exp(λ) random variable, so this is on the
order of λ2 h2 .
1
If we’re going to be completely rigorous, we would need to take an infinitesimal in-
terval around each ti , and show that the events of τi being in all of these intervals are
independent.
15.2. THE POISSON PROCESS ON R+ 175

15.2.5 The Poisson process as Markov process


We note that the Poisson process in one dimension is also a Markov process.
It is the simplest version of a Markov process in continuous time. We do
not formally define the Markov property in continuous time, but intuitively
it must be that the past contains no information about the future of the
process that is not in the current position. For a counting process, the
“future” is merely the time when it will make the next jump (and the next
one after that, and so on), while the past is simply how long since the last
jump, and the one before that, and so on. So the time to the next jump
must be independent of the time since the last jump, which can only happen
if the distribution of the time is the “memoryless” exponential distribution.
Thus, if a counting process is to be Markov, it must be something like a
Poisson process. The only exception is that it would be possible to change
the arrival rate, depending on the cumulative count of the arrivals.
More generally, a continuous-time Markov process on a discrete state
space is defined by linking a Markov chain of transitions among the states
with independent exponentially distributed waiting times between transi-
tions, with the rate parameter of the waiting times determined by the state
currently occupied. But this goes beyond the scope of this course.
Lecture 16

Poisson process: Examples


and extensions

Part of learning about any probability distribution or process is to know


what the standard situations are where that distribution is considered to be
the default model. The one-dimensional Poisson process is the default model
for a process of identical events happening at a constant rate, such that none
of the events influences the timing of the others. Standard examples:

• Arrival times of customers in a shop.

• Calls coming into a telephone exchange (the traditional version) or


service requests coming into an internet server.

• Particle emissions from a radioactive material.

• Times when surgical accidents occur in a hospital.

Some of these examples we would expect to be not exactly like a Pois-


son process, particularly as regards homogeneity: Customers might be more
likely to come in the morning than in the afternoon, or accidents might be
more likely to occur in the hospital late at night than in mid-afternoon.
Generalising to Poisson-like processes where the arrival rates may change
with time, for instance, is fairly straightforward. As with many other mod-
elling problems, we might best think of the Poisson process as a kind of
“null model”: Probably too simplistic to be really accurate, but providing
a baseline for starting the modelling process, from which we can consider
whether more detailed realism is worth the effort.

176
16.1. SOME BASIC CALCULATIONS 177

16.1 Some basic calculations

Example 16.1: Queueing at the bank

Customers arrive at a supermarket throughout the day according


to a Poisson process, at a rate of 2 per minute. Calculate:

(i) The probability that the first customer to arrive after 12


noon arrives after 12:02;
(ii) The probability that exactly 4 customers arrive between
12:03 and 12:06;
(iii) The probability that there are at least 3 customers to arrive
between 12:00 and 12:01.
(iv) The expected time between the 4th customer arrival after
noon and the 7th.
(v) The distribution of the time between the 4th customer ar-
rival after noon and the 7th.

Solution:

(i) The number of arrivals in a 2-minute period has Poi(4)


distribution. So the probability that this number is 0 is
e−4 = 0.018.. Alternatively, we can think in terms of wait-
ing times. The waiting time has Exp(2) distribution. Re-
gardless of how long since the last arrival at noon, the re-
maining waiting time is still Exp(2). So the probability that
this is at least 2 minutes is e−4 .
(ii) The number N3 arriving in 3 minutes has Poi(6) distribu-
tion. The probability this is exactly 4 is e−6 64 /4! = 0.134.
(iii) The number is Poi(2). Then

22
 
−2 2
P{N1 ≥ 3} = 1−P{N1 ≤ 2} = 1−e 1+ + = 0.323.
1! 2!

(iv) The expected time between arrivals is 1/2 minute. Thus


the expected sum of three interarrival times is 1.5 minutes.
(v) This is the sum of three independent Exp(2) random vari-
ables, so has Gamma(2, 3) distribution, with density 4t2 e−λt .
16.1. SOME BASIC CALCULATIONS 178

Example 16.2: The waiting time paradox

Central stations of the Moscow subway have (or had, when I


visited a dozen or so years ago) an electronic sign that shows
the time since the last train arrived on the line. That is, it is a
timer that resets itself to 0 every time a train arrives, and then
counts the seconds up to the next arrival. Suppose trains arrive
on average once every 2 minutes, and arrivals are distributed as
a Poisson process. A woman comes every day, boards the next
train, and writes down the number on the timer as the train
arrives. What is the long-term average of these numbers?
Solution: You might think that the average should be 2 min-
utes. After all, she is observing an interarrival time between
trains, and the expected interarrival time is 2 minutes. It is true
that if she stood there all day and wrote down the numbers on
the timer when the trains arrive, the average should converge to
2 minutes. But she isn’t averaging all the interarrival times, only
the ones that span the moments when she comes to the station.
Isn’t that the same thing?
No! The intervals that span her arrival times are size-biased.
Imagine the train arrival times being marked on the time-line.
Now the woman picks a time to come into the station at random,
independent of the train arrivals. (We are implicitly assuming
that she uses no information about the actual arrival times. This
would be true if she comes at the same time every day, or at a
random time independent of the trains.) This is like dropping
a point at random onto the real line. There will be some wide
intervals and some narrow intervals, and the point will naturally
be more likely to end up in one of the wider intervals. In fact,
if the interarrival times have densityRf (t), the size-biased inter-

arrival times will have density tf (t)/ 0 sf (s)ds. In the Poisson
process, the interarrival times have exponential distribution, so
the density of the observed interarrival times is

t · λe−λt
R∞
−λs ds
= λ2 te−λt .
0 sλe
16.2. THINNING AND MERGING 179

An easier way to see this is to think of the time that the woman
waits for the train, and the time since the previous train at the
moment when she enters the station. By the memoryless prop-
erty of the exponential distribution, the waiting time until the
next train, at the moment when she enters, is still precisely ex-
ponential with parameter 2. By the symmetry of the Poisson
process, it’s clear that if we go backwards looking for the pre-
vious train, the time will also be exponential with parameter
2, and the two times will be independent. Thus, the interar-
rival times observed by the woman will actually be the sum of
two independent Exp(2) random variables, so will have gamma
distribution, with rate parameter 2 and shape parameter 2. 

Example 16.3: Genetic recombination model

STILL TO COME. 

16.2 Thinning and merging


Consider the following problems: A hospital intensive care unit admits 4
patients per day, according to a Poisson process. One patient in twenty, on
average, develops a dangerous infection. What is the probability that there
will be more than 2 dangerous infections in the course of a week?
Or: The casualty department takes in victims of accidents at the rate of
4 per hour through the night, and heart attack/stroke victims at the rate
of 2 per hour, each of them according to a Poisson process. What is the
distribution of the total number of patients that arrive during an 8-hour
shift? What can we say about the distribution of patient arrivals, ignoring
their cause?
Theorem 16.1. Suppose we have a Poisson process with parameter λ. De-
note the arrival times T1 < T2 < · · · . We thin the process by the fol-
lowing process: We are given a probability distribution on {1, . . . , K} (so,
we have numbers pi = P({i}) for i = 1, . . . , K). Each arrival time is as-
signed to category i with probability pi . The assignments are independent.
(i) (i)
Let T1 < T2 < · · · be the arrival times in category i. Then these are
independent Poisson processes, with rate parameters λpi for process #i.
(i) (i)
Conversely, suppose we have independent Poisson processes T1 < T2 <
· · · with rate parameters λi . We form the merged process T1 < T2 < · · ·
16.2. THINNING AND MERGING 180

by taking the union of all the times, ignoring which process they come
P from.
Then the merged process is also a Poisson process, with parameter λi .
Proof. Thinning: We use the local definition of the Poisson process. Start
(i) (i)
from the process T1 < T2 < . . . . Let T1 < T2 < · · · be the i-th thinned
process. Clearly Ni (0) is still 0. If we look at the number of events occurring
on disjoint intervals, they are being thinned from independent random vari-
ables. Since a function applied to independent random variables produces
independent random variables, we still have independent increments. We
have

P Ni (t + h) − Ni (t) = 1
 
= P N (t + h) − N (t) = 1 · P assign category i
 
+ P N (t + h) − N (t) ≥ 2 · P assign category i to exactly one
= (λh + o(h))pi + o(h)
= pi λh + o(h).

And by the same approach, we see that P Ni (t + h) − Ni (t) ≥ 2 = o(h).
Independence is slightly less obvious. In general, if you take a fixed num-
ber of points and allocate them to categories, the numbers in the different
categories will not be independent. The key is that there is not a fixed num-
ber of points; moving from left to right, there is always the same chance λdt
of getting an event at the next moment, and these may be allocated to any
of the categories, independent of the points already allocated. A rigorous
proof is easiest with the global definition. Consider N1 (t), N2 (t), . . . , NK (t)
for fixed t. These may be generated by the following process: Let N (t) be
Poi(λt), and let (N1 (t), N2 (t), . . . , NK (t)) be multinomial with parameters
(N (t); (pi )). That is, supposing N (t) = n, allocate points to bins 1, . . . , K
according to the probabilities p1 , . . . , pK . Then

P N1 (t) = n1 , . . . , NK (t) = nK
 
= P N (t) = n P N1 (t) = n1 , . . . , NK (t) = nK N (t) = n
λn n!
= e−λt · pn1 · · · pnKK
n! n1 ! · · · nK ! 1
K
Y λni
= e−λi t
ni !
i=1
YK

= P Ni (t) = ni
i=1
16.3. POISSON PROCESS AND THE UNIFORM DISTRIBUTION 181

Since counts involving distinct intervals are clearly independent, this com-
pletes the proof.
Merging: This is simpler, since we don’t have to prove independence.
The total number of arrivals on disjoint intervals are the sums of the num-
bers of category i arrivals; the sums of independent random variables are
also independent. Furthermore, the sum of independent Poisson random
variables is also Poisson, which completes the proof, by the global definition
of the Poisson process.

Thus, in the questions originally posed, the arrivals of patients who


develop dangerous infections (assuming they are independent) is a Poisson
process with rate 4/20 = 0.2. The number of such patients in the course of
a week is then Poi(1.4), so the probability that this is > 2 is

1.4 1.42
 
−1.4
1−e 1+ + = 0.167.
1 2
The casualty department takes in two independent Poisson streams of
patients with total rate 6, so it is a Poisson process with parameter 6. The
number of patients in 8 hours has Poi(48) distribution.

16.3 Poisson process and the uniform distribution


The presentation here is based on section V.4 of [TK98].

Example 16.4: Discounted income

Companies evaluate their future income stream in terms of present


value: Earning £1 next year is not worth as much as £1 today;
put simply, £1 income t years in the future is worth e−θt £ today,
where θ is the interest rate.
A company makes deals at random times, according to a Poisson
process with parameter λ. For simplicity, let us say that each
deal is worth £1. What is the expectation and variance of the
total present value of all its future deals? 

The problem here is that, while we know the distribution of the number
of deals over a span [0, t], the quantity of interest depends on the precise
times.
16.3. POISSON PROCESS AND THE UNIFORM DISTRIBUTION 182

Theorem 16.2. Let T1 < T2 < · · · be the arrival process of a Poisson


process. For any s < t, conditioned on the event {N (t) − N (s) = n}, the
subset of points on the interval (s, t] is jointly distributed as n independent
points uniform on (s, t].

Proof. Intuitively this is clear: The process is uniform, so there can’t be a


higher probability density at one point than at another. And finding a point
in (t, t + δt] doesn’t affect the locations of any other points.
We can prove this formally by calculating that for any s < u < t,
conditioned on N (t) − N (s) = n, the number of points in (s, u] (that is,
N (u)−N (s)) has binomial distribution with parameters n and (u−s)/(t−s).
That is, the number of points in the subinterval has exactly the same dis-
tribution as you would find by allocating the n points independently and
uniformly. Then we need to argue that the distribution of the number of
points in every subinterval determines the joint distribution. The details are
left as an exercise.

Solution to the Discounted income problem: Consider the present


value of all deals in the interval [0, t]. Conditioned on there being n deals,
these are independent and uniformly distributed on [0, t]. The expected
present value of a single deal is then
Z t  
t−1 e−θs ds = (θt)−1 1 − e−θt ,
0

and the variance of a single deal’s value is


   2
σ 2 (t) := (2θt)−1 1 − e−2θt − (θt)−2 1 − e−θt
1  
= 2 (θ − 2t−1 ) + 4e−θt − (θ + 2t−1 )e−2θt .
2θ t
Thus, the present value up to time t, call it Vt , has conditional expectation
and variance
 
E Vt N (t) = N (t)(θt)−1 1 − e−θt ,
 

Var Vt N (t) = N (t)σ 2 (t).




(The formula for the conditional variance depends on independence.) So we


have  
−1 −θt
  
E[Vt ] = E E Vt N (t) = λ(θ) 1−e .
16.3. POISSON PROCESS AND THE UNIFORM DISTRIBUTION 183

Of course, as t → ∞ this will converge to λ/θ. For the variance, we use the
formula Var(V ) = E[Var(V |X)] + Var(E[V |X]), so that
 2
Var(Vt ) = E[N (t)]σ 2 (t) + 1 − e−θt Var(N (t))
1    2
= λ · 2 (θ − 2t−1 ) + 4e−θt − (θ + 2t−1 )e−2θt + (θ)−2 1 − e−θt λt−1

t→∞ λ
−−−−→ .

Bibliography

[Bil95] Patrick Billingsley. Probability and Measure. John Wiley & Sons
Inc., New York, third edition, 1995.

[Das95] Lorraine Daston. Classical Probability in the Enlightenment.


Princeton University Press, 1995.

[Dav62] F. N. David. Games, gods, and gambling: a history of probability


and statistical ideas. Charles Griffin, London, 1962.

[Fel68] William Feller. An Introduction to Probability and its Applications,


volume 1. John Wiley & Sons, New York, 3rd edition, 1968.

[Fel71] William Feller. An Introduction to Probability and its Applications,


volume 2. John Wiley & Sons, New York, 1971.

[FPP98] David Freedman, Robert Pisani, and Roger Purves. Statistics.


Norton, 3 edition, 1998.

[GS01] Geoffrey Grimmett and David Stirzaker. Probability and Random


Processes. Oxford University Press, Oxford, 2001.

[Joh04] Oliver Johnson. Information theory and the Central Limit Theo-
rem. Imperial College Press, 2004.

[Pit93] Jim Pitman. Probability. SV, 1993.

[PW97] James Propp and David Wilson. Coupling from the past: A user’s
guide. In Microsurveys in Discrete Probability. DIMACS, 1997.

[Ros06] Jeffrey S. Rosenthal. A First Look at Rigorous Probability Theory.


World Scientific, second edition, 2006.

[Ros08] Sheldon Ross. A First course in Probability. Prentice-Hall, 8th


edition, 2008.

184
BIBLIOGRAPHY 185

[TK98] Howard M. Taylor and Samuel Karlin. An introduction to stochas-


tic modeling. Academic Press, 3rd edition, 1998.
Appendix A

Assignments

I
A.1 Part A Probability Problem sheet 1:
Random Variables, Expectations, Conditioning
(1) (a) The Gamma(r, λ) distribution is the distribution on R+ with density

λr xr−1 −λx
fr,λ (x) = e .
Γ(r)

r is called the shape parameter, and λ is the rate parameter. Show that the sum of two indepen-
dent Gamma distributed random variables with the same rate parameter has another Gamma
distribution.
(b) The χ2 distribution with n degrees of freedom is defined to be the distribution of X12 + · · · + Xn2 ,
where X1 , . . . , Xn are i.i.d. standard normal random variables. Show that the χ2 distribution
belongs to the Gamma family, and compute the parameters.
(2) Let (Z1 , Z2 ) be independent random variables with standard normal distribution, and let X =
2Z1 + Z2 and Y = Z1 + Z2 .
(a) What are the marginal distributions of X and Y ? What is the joint distribution of (X, Y )?
(b) Compute E[Y | X = 5].
(3) Let X1 and X2 be independent exponential random variables with parameters µ and λ respectively.
Let Y = min(X1 , X2 ) and Z = max(X1 , X2 ). Let W = Z − Y , and I = 1{X1 >X2 } .
(a) Compute the joint densities of (Y, Z), (Y, W ), and (Z, W ). Which pairs are independent?
(b) Compute the marginal distributions of Y , Z, W , and I.
(c) Compute the conditional density of Z given Y = y.
(d) Which of Y , Z, or W are independent of I?
(e) Under what conditions are three of the four random variables (Y, Z, W, I) jointly independent?
(4) The bivariate normal distribution with parameters µX , µY , σX , σY , and ρ is described in section
2.4.3. Show that these parameters are in fact the expectations of X and Y , the SDs of X and Y ,
and the correlation. You may show this by working from the representation of X and Y as linear
combinations of independent standard normal random variables. You could also try showing that
you get the same result by integrating the bivariate joint density (2.6).
(5) Random variables X and Y have joint density f (x, y). Let Z = Y /X. Show that Z has density
Z ∞
fZ (z) = |x|f (x, xz)dx.
−∞

Suppose that X and Y are independent standard normal random variables. Show that Z has density
1
fZ (z) = π(1+z 2 ) for − ∞ < z < ∞.
(6) We have a coin such that P{heads} = Q, where Q is a fixed quantity that was uniformly chosen from
the interval (0, 1).
(a) We flip the coin n times. Compute the expectation and variance of the number of heads.
(b) Let Xi be the indicator of {flip #i is a head} (that is, Xi = 1 if the i-th flip is heads, and Xi = 0 if
it is tails). Let Y =total number of heads. Show that E[X1 |Y ] = Y /n and E[Y |Xi ] = (Xi +1)n/3.
(c) We flip the coin repeatedly until the first time heads comes up. Let T be the number of flips.
Show that T is finite with probability 1, but that E[T ] = ∞.
(7) The second-moment method: Let X be a random variable with E[X] ≥ 0. Show that

Var(X)
P{X > 0} ≥ 1 − .
E[X]2

Optional more challenging problem: If X is nonnegative, improve this bound to

E[X]2
P{X > 0} ≥ ,
E[X 2 ]

and show that this is indeed always a better bound. (Hint: Let Y = 1{X6=0} . Show that E[X] =
E[XY ], and use the fact that E[XY ]2 ≤ E[X 2 ]E[Y 2 ], which is in the lecture notes as (3.8).)
(8) The birthday problem: Suppose there are 365 possible birthdays, and we have n individuals. Each
person’s birthday is an independent uniform random choice from the 365 possibilities. Let Bk be the
number of groups of k individuals who all have the same birthday. (Groups can overlap. Thus, if
we have 3 people with birthday Jan. 1 and no other matches, then B3 = 1, B2 = 3, and Bk = 0 for
k > 3.)
(a) Compute P{B2 > 0}. How big must n be so that the probability that at least two people in the
population have the same birthday is at least 21 ? (This is the classic birthday problem.)
(b) Compute E[B2 ] and Var(B2 ), and E[B3 ] and Var(B3 ).
(c) Use Chebyshev’s inequality to find a number n such that the probability at least three people in
the group have the same birthday is at least 12 ?
(d) Bonus Question: Compute E[Bk ] and V ar(Bk ) for general k. (This requires some careful com-
binatorics.)
A.2 Part A Probability Problem sheet 2:
Generating functions and convergence
(1) Let λn be a sequence of positive real numbers such that limn→∞ λn = ∞. Let √ Xn be a random
variable with Poisson distribution with parameter λn , and Yn := (Xn − λn )/ λn . Using moment
generating functions, show that Yn converges in distribution to a standard normal distribution. If

λn = nλ, what distribution does Zn := (Xn − nλ)/ n converge to? Explain why this is a special
case of the Central Limit Theorem.
(2) (a) Compute the moment generating function of a random variable with Gamma(r, λ) distribution.
Use this to repeat the result from the previous sheet, that the sum of independent Gamma
random variables with the same rate parameter is also Gamma distributed.
(b) An insurance company has N claims in a year, where N has a Poisson distribution with parameter
µ. The individual claims each have i.i.d. Gamma(r, λ) independent of N . Let Y be the total of
all the claims in a year. Compute the moment generating function of Y .
(c) Suppose µ = 20, r = 10, λ = 0.1. Show that E[Y ] = 2000, and the probability of Y > 4000 is
less than 0.0011.
(3) Prove that if random variables Xn converge to X in mean square, then they also converge in proba-
bility.
(4) (Grimmett and Welsh 8.5.1) Let X1 , X2 , . . . be a sequence of i.i.d. random variables uniform on
(0, x). Let Zn := max{X1 , . . . , Xn } and Un := n(x − Zn ).
(a) Show that Zn → x in probability as n → ∞.
√ √
(b) Show that Zn → x in probability as n → ∞.
(c) Show that Un converges in distribution to an exponential distribution as n → ∞, and find the
parameter.
(5) Let X be the number of successes in n Bernoulli trials with probability p of success. Use the CLT to
approximate P{|X/n − p| ≤ 0.01} when
(a) n = 100 and p = 12 ;
(b) n = 1000 and p = 21 ;
1
(c) n = 1000 and p = 10 .
In each case, compare the bound to the Chebyshev bound used in the proof of the Weak Law of
Large Numbers; and to the exact probability.
(6) An investor purchases 100 shares of Consolidated Confidence for £10 on January 1. It trades on
200 days during the year, and on day i the price changes by Xi % of its current value, where the
Xi are i.i.d. with uniform distribution on the interval [−2, 2]. The investor sells the shares at the
price Y at the end of the year. Use the Central Limit Theorem to compute R (approximately) the
probability
R 2 that the investor makes a profit. You may use the fact that ln xdx = x ln x − x + C and
ln xdx = x ln2 x − 2x ln x + 2x + C. Hint: Consider the random variable Z = ln Y .
(7) Law of large numbers for non-i.i.d. random variables
(a) Suppose X1 , X2 , . . . are independent random variables, where E[Xi ] = µi and Var(Xi ) = σi2 . Let
Sn = X1 + · · · + Xn and Mn = µ1 + · · · + µn . Assume for all i that σi2 < K for some constant
K. Show that for every  > 0
 
Sn Mn
(∗) lim P − <  = 1.
n→∞ n n

(b) (More challenging): Suppose now that the random variables are not independent, but that
| Cov(Xi , Xi+j )| ≤ κ(j), for some function κ(j). In other words, if κ is decreasing, this says
that random variables that are far apart become less correlated. Show that (∗) still holds if
limn→∞ κ(n) = 0.
A.3 Markov chains
A.4. ASYMPTOTICS OF MARKOV CHAINS AND POISSON PROCESSVII

A.4 Asymptotics of Markov chains and Poisson process

You might also like