Prob ALecture Notes
Prob ALecture Notes
David Steinsaltz1
University of Oxford
1
University lecturer at the Department of Statistics, University of Oxford
Part A Probability
David Steinsaltz – 16 lectures HT 2012
[email protected]
Website: http://www.steinsaltz.me.uk/probA/probA.html
Overview
The first half of the course takes further the probability theory that was developed in the
first year. The aim is to build up a range of techniques that will be useful in dealing with
mathematical models involving uncertainty. The second half of the course is concerned
with Markov chains in discrete time and Poisson processes in one dimension, both with
developing the relevant theory and giving examples of applications.
Synopsis
Continuous random variables. Jointly continuous random variables, independence, con-
ditioning, bivariate distributions, functions of one or more random variables. Moment
generating functions and applications. Characteristic functions, definition only. Exam-
ples to include some of those which may have later applications in Statistics. Basic ideas
of what it means for a sequence of random variables to converge in probability, in distri-
bution and in mean square. Chebychev and Markov inequalities. The weak law of large
numbers and central limit theorem for independent identically distributed variables with
a second moment. Statements of the continuity and uniqueness theorems for moment
generating functions. Discrete-time Markov chains: definition, transition matrix, n-step
transition probabilities, communicating classes, absorption, irreducibility, calculation of
hitting probabilities and mean hitting times, recurrence and transience. Invariant dis-
tributions, mean return time, positive recurrence, convergence to equilibrium (proof not
examinable). Examples of appli- cations in areas such as: genetics, branching processes,
Markov chain Monte Carlo. Poisson processes in one dimension: exponential spacings,
Poisson counts, thinning and superposition.
Exercises
There are four problem sheets, each covering two weeks of the course. There are no
explicit review exercises, but you may want to look back at your Mods sheets, to make
sure you still remember the material.
Doing problems is the most important way to learn a mathematical subject, and for
no subject is that more important than for probability. Some important principles:
• Problems vary in difficulty. There are intended to be questions that will challenge
the very best students in the course. An average student should find that there are
multiple questions or parts of questions that he or she cannot do. Since questions
are organised in part by topic, and since different students find different topics
difficult, you will not necessarily find them to be ordered by difficulty. So if you
can’t do some questions, don’t panic! Do what you can, think about the question
a while, and plan to discuss it in tutorials.
• There is a variety of problems:
(i). Straightforward application of new techniques;
(ii). Extensions of techniques to novel situations;
(iii). Interpretation of applications, often involving extensive descriptions of an ide-
alised real-world setting. These are particularly important in probability;
(iv). Proofs and theoretical exercises. See the above section.
• Some tutors will have you turn in questions ahead of the relevant lectures. This is
not optimal, but seems unavoidable. If this is your situation, you’ll have to do the
best you can from reading.
• Problem sheets are not the same as exam questions. Problem sheets are part of the
learning process; exam questions are merely for evaluation of your learning. Exam
questions are designed to be done completely in no more than 40 minutes. You
should be spending about 8 hours a week on each of your 16-hour lecture courses,
and most of that time will be spent doing exercises. Thus, 12–15 hours on a two-
week sheet is not excessive. The hardest question on a sheet will be much harder,
and take much longer to do, than the hardest exam question.
Reading
16 hours of lectures are not enough to cover all of the material in any depth. The
lectures are intended to provide a framework for understanding what you will then learn
in depth through reading and
Your reading should begin with the lecture notes, but should not end there. You
will not be able to master the material by reading one source, and lecture notes are, by
their nature, not worked out nearly as carefully as a book.
The official suggested supplemental reading, covering the entire syllabus:
• G. R. Grimmett and D. R. Stirzaker, Probability and Random Processes (3rd
edition, OUP, 2001). Chapters 4, 6.1—6.5, 6.8.
• G. R. Grimmett and D. R. Stirzaker, One Thousand Exercises in Probability (OUP,
2001).
• G. R. Grimmett and D J A Welsh, Probability: An Introduction (OUP, 1986).
Chapters 6, 7.4, 8, 11.1—11.3.
• J. R. Norris, Markov Chains (CUP, 1997). Chapter 1.
• D. R. Stirzaker, Elementary Probability (Second edition, CUP, 2003). Chapters
7—9 excluding 9.9.
These are some suggestions, but there is nothing compulsory about them. The
material in this course is absolutely standard, and as a consequence there are hundreds
of books covering it in different ways. You will learn best if you browse various books to
find a style of presentation (and of thought) that makes the most sense to you. Here is
a list of some other probability texts with my personal comments. This has no official
sanction!
• Sheldon Ross — A first course in probability theory. Loads of good examples.
Text a bit dry and plodding. His Introduction to Probability Models (now
in its 9th edition) has some of the same pros and cons for Markov chains, Poisson
process, and a lot more. The range of exercises is breathtaking.
• David Williams — Weighing the odds. A recent text by an inspiring author,
with an integrated treatment of probability and statistics. Fast and exciting.
• Kai Lai Chung — Elementary probability theory. Slower exposition than the
above. Good on counting.
• William Feller — An introduction to probability theory and its applica-
tions. The classic book, in 2 volumes. Volume 1 is just discrete distributions
(including Markov chains), volume 2 is general probability, including convergence
of distributions. Generations of probabilists have been inspired by this book, and
it still can’t be beat for its treatment of renewal processes, densities, and random
walks. Deep exercises involving many interesting applications. Good treatment of
convergence in distribution.
• Jim Pitman — Probability. Covers the more elementary parts of the course. Lots
of entertaining examples worked through in the text, and exercises with solutions.
Encyclopaedic array of exercises. Terrific text. Particularly good on conditional
expectations and applying the normal approximation. Also great section on the
bivariate normal distribution. Very intuitive treatment of the Poisson process
• Henk Tijms – Understanding probability. Full of anecdotal details – the first
half is designed to be read as fun and to act as motivation for the mathematical
details in the second half.
• Grinstead and Snell — Introduction to Probability. This is available in its
entirety on-line. Long and thorough, but gentle.
• P. Billingsley — Convergence of Probability Measures. Above the level of
this course, but this is the text to go to for those who are interested in really
understanding how convergence of probability distributions really works.
• Kemeny and Snell — Finite Markov Chains. Another classic (1960), though
the only thing that’s outdated is the typeface. (Well, the applications are meager.)
• James Norris — Markov Chains. Officially recommended text of the course. Not
bad. The section on discrete Markov chains is not all that long, and perhaps less
accessible than it could be.
Contents
2 Joint distributions 27
2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . 29
vi
2.4 Joint densities . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Example: Uniform distribution on a disk . . . . . . . 32
2.4.2 Independent standard normal random variables . . . . 33
2.4.3 Multivariate normal distribution . . . . . . . . . . . . 34
2.4.4 The convolution formula . . . . . . . . . . . . . . . . . 38
2.5 Mixing discrete and continuous random variables . . . . . . . 39
2.5.1 Independence . . . . . . . . . . . . . . . . . . . . . . . 40
4 Conditioning 60
4.1 Conditioning on non-null events . . . . . . . . . . . . . . . . . 60
4.2 Conditional distributions: The non-null case . . . . . . . . . . 63
4.3 Conditioning on a random variable: The discrete case . . . . 65
4.4 Conditional distributions: The continuous case . . . . . . . . 69
4.5 The Borel paradox . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Bivariate normal distribution: The regression line . . . . . . . 73
4.7 Linear regression on normal random variables . . . . . . . . . 77
4.8 Conditional variance . . . . . . . . . . . . . . . . . . . . . . . 78
4.9 Mixtures of distributions . . . . . . . . . . . . . . . . . . . . . 80
4.9.1 X continuous, Y discrete . . . . . . . . . . . . . . . . 81
4.9.2 X discrete, Y continuous . . . . . . . . . . . . . . . . 82
4.9.3 X and Y continuous . . . . . . . . . . . . . . . . . . . 86
A Assignments I
A.1 Part A Probability Problem sheet 1:
Random Variables, Expectations, Conditioning . . . . . . . . II
A.2 Part A Probability Problem sheet 2:
Generating functions and convergence . . . . . . . . . . . . . IV
A.3 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . VI
A.4 Asymptotics of Markov chains and Poisson process . . . . . . VIII
Lecture 1
1
1.2. PROBABILITY SPACES 2
(i) P(Ω) = 1;
It makes sense, but it’s not obvious from our axioms. So we replace the
additivity axiom (ii) by countable additivity (ii0 ):
What sort of collection of sets is F? It should include all the events we’re
interested in assigning probabilities to. If Ω is a finite set, or even a countable
set, then F could simply be all of the subsets of Ω.1 If Ω is uncountable —
for instance R — then this won’t work. Fortunately, it turns out that we
can define an F that includes all the events we’re likely to want to assign
probabilities to; namely,
• All open and closed sets are in F. (So, in particular, if we’re working
in R we can assign probabilities to events like [a, b] and (a, b] and any
finite collection of points.)
1.3 Examples
1.3.1 Finite state space
If Ω is a finite set, all probabilities can
P be defined by a collection of nonneg-
ative numbers px for x ∈ Ω, where x∈Ω px = 1. We can then define
X
P(A) = px .
x∈A
(cdf2) F is right-continuous;
In other words, if event i implies event i + 1, then P(at least one occurs)
is the limit of the probability that that
Sthe last (largest) one occurs. For any
finite collection this is trivial, since ni=1 Ai = An ; all we need to show is
that this carries over to an infinite union. Similarly for intersections. (Note
that this result was given as an exercise in Mods probability.)
1.3. EXAMPLES 6
Sn
(ii) Let Bi := Ai \Ai+1 and B0 := A{1 . Then the Bi are disjoint, i=0 Bi =
A{n , and ∞
S T∞ {
i=0 Bi = ( i=1 Ai ) . So
∞ ∞
! !
\ [
1−P Ai = P Bi
i=1 i=0
∞
X
= P (Bi )
i=0
n
X
= lim P (Bi )
n→∞
i=0
n
!
[
= lim P Bi
n→∞
i=0
= lim 1 − P(An ).
n→∞
We also write this limit from the left of F (xi ) as F (x−). Thus, the difference
between F (x) and the limit from the left F (x−) is P({x}), the atom of
probability at the single point x.
The fact that any function satisfying these properties is the cdf of a
probability measure on R is not surprising, although the fact that it really
holds for any such function is a theorem of measure theory. As long as
we restrict ourselves to functions that are piecewise differentiable there’s no
problem. Observe that an increasing function can only have countably many
discontinuities. Piecewise differentiable is a somewhat stronger assumption
— in particular, the jumps form a discrete set.
(i) f is nonnegative;
R∞
(ii) −∞ f (z)dz = 1.
For many purposes the density is a more intuitive object than the dis-
tribution function. The density gives you a picture of the real line, showing
the hot spots for the distribution — where the density is high — and the
cold spots, where the density is low or 0. If you were to take a large num-
ber of independent samples from the distribution and stack them up on the
X-axis at the point corresponding to their value — in other words, draw a
histogram of the samples — it would look like the density.
It’s usually a good idea to think intuitively that there is a probability
f (x)dx of finding ω in a tiny interval (x, x + dx].
1.3. EXAMPLES 8
1
1.00
5/6
4/6
0.75
1/2
P
Probability
0.50
2/6
0.25
1/6
0
0.00
1 2 3 4 5 6
0 1
(a) cdf for a fair die (b) cdf for traffic light distribution
1.0
1.00
0.8
0.75
0.6
Probability
P
0.50
0.4
0.25
0.2
0.0
0.00
-2 -1 0 1 2
0 1
z
(c) cdf for the uniform distribution (d) cdf for standard normal distribution
Normal
The normal distribution (also called the Gaussian distribution) is a 2-parameter
family. The parameters are commonly called µ (mean) and σ 2 (variance),
and written N(µ, σ 2 ). The density is
1 2 2
f (z) = √ e−(z−µ) /2σ . (1.3)
2πσ
The N(0, 1) distribution is also called standard normal. Its density is
2
typically written φ(z) = (2π)−1/2 e−z /2 .
Exponential
A one-parameter family of distributions on (0, ∞) with density
Gamma
This is a generalisation of the exponential distribution. It is a two-parameter
family with parameters termed rate and shape. Denoting these by λ and r,
the density is
λr r−1 −λz
γr,λ (z) = z e for z > 0. (1.5)
Γ(r)
R∞
Here Γ(r) is the function defined by Γ(r) = 0 z r−1 e−z dz. It is the analytic
continuation of the factorial function, in that Γ(r) = (r − 1)! when r is an
integer.
The exponential distribution is the same as the Gamma distribution
with r = 1, and the sum of r independent exponential random variables
with exponential(λ) distribution is Gamma(r, λ) distributed (when r is an
integer).
Log-normal
This is the distribution of a positive random variable whose logarithm is
normal. It is commonly used to model generic data of the sort that we
might typically model with a normal distribution, but which is nonnegative.
1 2 2
f (z) = √ e−(ln z−µ) /2σ for z ≥ 0. (1.6)
z 2πσ
Pareto
This is a fat-tailed distribution, frequently applied in economics to model
incomes, sizes of cities, insurance claims, and many other phenomena. It
has two parameters: A minimum value C > 0, and an exponent α > 0
describing the rate of decline of the tails. It is “fat-tailed” because the rate
of decline of the density is a power of x rather than being exponential, or
like the Gaussian or Poisson much faster than exponential. The density is
0 for z < C and
αC α
f (z) = α+1 for z ≥ C. (1.7)
z
−1
Pn−1
Note that n i=0 h(i/n) is a Riemann sum approximation to
Rt
0 h(s)ds. Letting n → ∞, then, leads to
Z t
1 − F (t) = P{T > t} = exp − h(s)ds .
0
Consequently
Z t
f (t) = F 0 (t) = h(t) exp − h(s)ds .
0
Fact 1.3. Let F satisfy the assumptions (cdf1)–(cdf4), and assume that
there is a bi-infinite increasing sequence xi (i ∈ Z), with xi < xi+1 , such
that F is differentiable on the interval
P∞ (xi , xi+1 ), and has a jump of size ji
at xi (where ji may be 0). Let J = i=−∞ ji , and let f : R → R+ be defined
to be F 0 (x)/J (if F is differentiable at x), or 0 otherwise. Let PC be defined
1.4. SOME DIFFICULTIES IN PROBABILITY 12
to be the probability with density f /(1 − J), and PD the probability defined
on the discrete set of points {xi }, with P({xi }) = ji /J. Then F is the cdf
of the mixed distribution P = JPD + (1 − J)PC .
That is, it is the cdf of a random variable X that may be generated by
the following two-step process: First, with probability J we decide that X
is discrete, and with probability (1 − J) it is continuous. If it is discrete,
then we sample X from the discrete distribution PD ; otherwise, we sample
X from the continuous distribution PC .
A traffic light cycles once per minute. That is, the time from the
beginning of one green to the beginning of the next is exactly
60 seconds. It spends 30 seconds of each cycle being red in
the North-South direction, and 30 seconds red in the East-West
direction. When it is red in one direction, it is green or yellow
in the other. Cars arriving at a green or yellow light go right
through, while those arriving at a red wait for the light to change
to green. Model the random time that a car waits, assuming it
arrives at a time uniformly distributed in the cycle.
Solution: Half the time the car arrives at a green light, and so
has 0 waiting time. If it arrives at a red light, the waiting time is
uniform on (0, 1/2) minute. The cdf is sketched in Figure 1.1(b).
1.6.2 Projection
Let Ω = Rk . Points of ω have the form (ω1 , . . . , ωk ), where the ωi are in R.
We can naturally define the random variables Xi (ω1 , . . . , ωk ) = ωi .
1.7. A WORD ABOUT RANDOM VARIABLES AND PROBABILITY NOMENCLATURE15
1.6.3 Indicators
Let A be any event. The random variable 1A is defined by
(
1 if ω ∈ A;
1A (ω) =
0 if ω ∈/ A.
{X > 3},
objects that we are modelling in the real world. (The word comes from
statistics, where a variable is a quantity observed or measured in a study,
such as height in a health survey, or 1 or 0 to denote whether this subject
was in the treatment or control group. A random variable is a probabilistic
model for a statistical variable.)
If we then let let Y = (X − 180)2 — the square of the error — and talk
only about Y , then we using transformations and the methods of section 1.8
as a modelling tool for arriving at the χ2 distribution (see Example 1.7).
Things get more complicated when we have multiple random variables
— we then have a joint distribution, as we will discuss in Lecture 2. Fun-
damental is to recognise that random variables may have the same distri-
bution without being the same random variable. As a very simple example
in coin-flipping (section 1.6.1) the random variables Xi all have the same
distribution Pi ({−1}) = Pi ({+1}) = 1/2, but they are all different random
variables.
Another example: Let X and Y be independent standard normal random
variables, and let Zθ = X sin θ + Y cos θ for any θ ∈ R. We show in section
(2.4.2) that Zθ also has the standard normal distribution for any θ. On
the other hand, Zθ and Zθ0 are different random variables — they do not
certainly have the same value — unless θ − θ0 is a multiple of 2π.
A word that is sometimes used for the distribution of a random variable,
thought of as an object in its own right, is law: The law of Zθ is standard
normal, or we say that Zθ and Zθ0 “have the same law”. we also sometimes
write L(Z) for the law of the random variable Z. Thus, we might write
L(Zθ ) = L(Zθ0 ).
= FX (g −1 (y)).
/ Y then P Y ≤ y = P Y ≤ y 0 , where y 0 = sup(Y ∩ (−∞, y)). Thus,
If y ∈
we fill in the gaps in Y by extending FY as a constant. To put it slightly
differently, we may replace equation (1.8) by
FY (y) = sup FX (x). (1.9)
x:g(x)≤y
/ Y by
When g has jumps, we extend to y ∈
Y Y
g(x)
g(x)
y+∆y y+∆y
y X y X
x x=g-1(y)
P(Y≤y)=P(X≤x)=FX(x) P(Y≤y)=P(X≥x)=1-FX(x)+P(X=x)
Let (
x if x ≤ 0,
g(x) :=
x+1 if x > 0.
Let X be uniformly distributed on (−1, 1). Find the cdf of Y =
g(X) and Z = −g(X).
Solution: The range of g is g(R) = (−∞, 0] ∪ (1, ∞). On this
set, (
y if y ≤ 0,
g −1 (y) =
y − 1 if y > 1.
1.8. FUNCTIONS OF RANDOM VARIABLES 19
Then
FY (y) = FX (g −1 (y))
0 if y < −1,
y+1
if − 1 < y ≤ 0,
2
1
= 2 if 0 < y ≤ 1,
y
if 1 < y ≤ 2,
2
1 if 2 < y.
d
fY (y) = FY (y)
dy
d
= FX (g −1 (y))
dy
d
= FX0 g −1 (y) · g −1 (y) by the Chain Rule
dy
−1
fX g (y)
= 0 −1
g g (y)
1.8. FUNCTIONS OF RANDOM VARIABLES 20
d
fY (y) = FY (y)
dy
d
1 − FX (g −1 (y))
=
dy
d
= −FX0 g −1 (y) · g −1 (y) by the Chain Rule
dy
−1
fX g (y)
=
−g 0 g −1 (y)
2 /2
If X is a standard normal random variable then fX (x) = (2π)−1/2 e−x
and the distribution of Y = σX + µ is normal with mean µ and
variance σ 2 . Its density is then
1 2 2
fY (y) = √ e−(y−µ) /2σ .
2πσ
1.8. FUNCTIONS OF RANDOM VARIABLES 22
FY (y) = FX (F (y)) = F (y) for all y such that F (y) ∈ [0, 1],
1.8.4 Non-invertible g
When g is differentiable (except perhaps at a discrete set of points) then
we can still apply the same idea. As in the discrete case, we now may have
multiple points x which map onto the same point y, so that an infinitesimal
interval around y corresponds to multiple infinitesimal intervals where X
could be and still produce Y near y.
Y
g(x)
y+∆y
y X
x1 x +∆x2
x1+∆x1 2 x2 x3 x3+∆x3
P(y≤Y≤y+∆y)=P(x1≤X≤x1+∆x1)+P(x2≤X≤x2+∆x2)+P(x3≤X≤x3+∆x3)
by independence
∞
X bj
= 10j−1
10
j=1
∞
X bj
=
10j
j=1
= x.
By uniqueness of cdf’s, this also means that X has the uniform
distribution on the interval [0, 1].
Note that, although X is defined in terms of an infinite sequence
of random digits, the event {X < x} will always be determined
by a finite sequence of digits — however many are needed to
obtain the first discrepancy between Xj and bj . The number of
digits is finite always, but treating them as part of an infinite se-
quence allows us to escape the need for any fixed (and arbitrary)
bound on the number of digits.
Note too that we may define other distributions on the interval
[0, 1] by taking any non-uniform distribution on the digits. We
will not prove this fact here, but no distribution on [0, 1] defined
by an alternative digit distribution has a density. This is one
of the simplest ways of defining a distribution that is neither
discrete nor continuous. The cdf of such a distribution is thus
a function that is monotone, with F (0) = 0 and F (1) = 1, but
whose derivative is 0 at almost every point. Making formal sense
of these statements of course requires measure theory.
Lecture 2
Joint distributions
27
2.2. INDEPENDENCE 28
P{X1 ∈ A1 , . . . , Xn ∈ An }.
2.2 Independence
Events E1 , . . . , En are said to be independent if
P E1 ∩ · · · En = P (E1 ) · · · P (En ). (2.1)
Theorem 2.2. Suppose (X1 , . . . , Xk ) are random variables with joint den-
sity fX : U → R+ , where U is an open region in Rk . Suppose g : U → Rk
is a finite-to-one map which is differentiable1 with nonzero Jacobian deter-
minant. Define random variables (Y1 , . . . , Yk ) = g(X1 , . . . , Xk ). Then these
have a joint density fY given by
X fX (x1 , . . . , xk )
fY (y1 , . . . , yk ) = , (2.3)
|D(x1 ,...,xk ) g|
(x1 ,...,xk ):
g(x1 ,...,xk )=(y1 ,...,yk )
where |D(x1 ,...,xk ) g| is the absolute value of the determinant of the Jacobian
matrix of g at (x1 , . . . , xk ).
Note that when k is 2, and we write g(x1 , x2 ) = g1 (x1 , x2 ), g2 (x1 , x2 )
we have
∂g1 ∂g2 ∂g1 ∂g2
|Dg| = − .
∂x1 ∂x2 ∂x2 ∂x1
As in the single-variable case, the density may not be defined at all points
of Rk ; it may be infinite, or simply undefined. We use the restriction to a
region U to exclude the problematic points. (Problematic points include
those at which the function g is not differentiable. If we are excluding single
points — or, indeed, lower-dimensional sets — it doesn’t affect the integral.
A standard example is the map onto polar coordinates, described in section
2.4.1.
dy1 · · · dyk
dy1 · · · dyk = · |dx1 · · · dxk | = |Dx g| · |dx1 · · · dxk |.
dx1 · · · dxk
In other words, we are using the fact that the Jacobian determinant gives
the ratio of volumes of infinitesimal boxes under the transformation. So
dy1 · · · dyk
fY (y1 , . . . , yk )dy1 · · · dyk = fX (x1 , . . . , xk ) ,
|D(x1 ,...,xk ) g|
(Here we are taking the definition of arctan z ∈ [−π/2, π/2]. This is just the
standard assignment of angles in [0, 2π) to points in the plane.)
Take g(x, y) := (R(x, y), Θ(x, y)). Note that Θ is undefined for (x, y) =
(0, 0). So our region U is the complement of the origin. On U the function
is 1-1 onto the strip (0, ∞) × [0, 2π). A point (r, θ) in the strip has a single
inverse image, given by (r cos θ, r sin θ). The Jacobian deteriminant is given
by
∂R ∂R √ x √ y
∂x ∂y x 2 +y 2 x2 +y 2 = (x2 + y 2 )−1/2 .
∂Θ ∂Θ = −y x
∂x ∂y x2 +y 2 x2 +y 2
with 0 < r < 1 and 0 ≤ θ < 2π have an inverse image with density 1/π.
Then the pair (R, Θ) has density
(
0 if r ≥ 1 or θ ∈
/ [0, 2π);
fR,Θ (r, θ) = 1/π r
r−1
= π if 0 < r < 1 and 0 ≤ θ < 2π.
X = σX Z1 + µX
p (2.5)
Y = ρσY Z1 + 1 − ρ2 σY Z2 + µY .
!2 !)
y − µY − ρ σσX
( 2 Y
(x − µX )
1 1 x − µX
fρ (x, y) = p exp − + p
2πσX σY 1−ρ 2 2 σX σ Y 1 − ρ2
1 (2.6)
= p
2πσX σY 1 − ρ2
(x − µX )2 (y − µY )2
1 2ρ(x − µX )(y − µY )
× exp − 2
+ − .
2(1 − ρ2 ) σX σY2 σX σY
1.0
0.9
0.8
0.7
0.6
0.5
y
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Figure 2.1: Computing the probabilities for events in Example 2.3. The
red square is the event {X < 12 , Y < 12 }, and the green square is the event
{X > 0.6, Y > 0.6}. Note that the probability is determined solely by the
intersection of the event with the curve G = {(x, y) : y = x2 }, shown here
in black.
(and 0 elsewhere).
Are we sure that (X, Y ) doesn’t have a density? Suppose there
were a joint density fX,Y . For any (x, y) ∈
/ G the density would
R1
have to be 0. Thus the marginal density of X would be 0 fX,Y (x, y)dy =
0, since we are integrating a function that is 0 everywhere but
at a single point.
Fact 2.3. If (X1 , . . . , Xn ) are random variables such that P {(X1 , . . . , Xn ∈
A} = 1 for some (n−1)-dimensional set A, then the joint distribution cannot
be described by a joint density. That is, the random variables are not jointly
continuous.
2.5.1 Independence
If X and Y are jointly continuous then we can tell that they are independent
by looking at the joint density: X and Y are independent if and only if this
is a product of a function of x and a function of y. If X and Y are discrete
then we simply need to check the identity P{X = x, Y = y} = P{X =
x} · P{Y = y}. But what do we do when one is continuous and the other
discrete? What is it that we need to check?
Suppose X is continuous and Y discrete. Returning to first principles, we
see that X and Y are independent when the events {X ∈ A} and {Y ∈ B}
are independent for all appropriate sets A and B. Of course, we don’t need
to check this for all sets, just for enough sets to fully specify the distribution.
In this case, it will suffice to take A to be any interval of the form (−∞, x]
and B only singletons. (There are certainly other choices possible; this
choice is often a convenient one.) That is, X and Y are independent if and
only if for any real x and y,
P X ≤ x, Y = y = FX (x) · P{Y = y},
Lecture 3
3.1 Expectation
The expectation of a real-valued random variable is the average value,
where all possible values are weighted by their probabilities.
This is defined only if both expectations are finite. In that case, we say X
is integrable; otherwise, we say it is nonintegrable.
42
3.1. EXPECTATION 43
Theorem 3.2. R ∞If the random variable X has density f , then X is integrable
if and only if 0 |x|f (x)dx, and
Z ∞
E[X] = xf (x)dx. (3.3)
−∞
If the random variable X takes on a countable set P∞of values xi with prob-
abilities pi , then X is integrable if and only if i=1 |xi |pi < ∞. In that
case,
X∞
E[X] = pi xi . (3.4)
i=1
For random variables with positive and negative values, we apply this result
to the positive and negative parts separately.
Assume first that the set of values taken on by X are nonnegative and
3.2. ESSENTIAL PROPERTIES OF EXPECTATIONS 44
X ∞
X i
X
p i xi = pi (xj − xj−1 ) where we take x0 = 0
i=1 j=1
X∞ X ∞
= pi (xj − xj−1 ) (reversing the order of summation)
j=1 i=j
X∞
= (1 − F (xj−1 )(xj − xj−1 )
j=1
X∞ Z xj
= (1 − F (x))dx since F (x) = F (xj−1 ) on the interval (xj−1 , xj )
j=1 xj−1
Z ∞
= (1 − F (x))dx.
0
(E3) For any constant c and any random variable X, E[cX] = cE[X].
The first three properties are almost trivial. The linearity of expecta-
tions (E4) is absolutely essential; proving it in full generality requires some
measure theory. For discrete random variables it is quite simple:
X
E[X + Y ] = P{X + Y = z}z
z
XX
= P{X = x, Y = z − x}(x + z − x)
z x
XX
= P{X = x, Y = y}(x + y)
y x
XX XX
= P{X = x, Y = y}x + P{X = x, Y = y}y
x y y x
X X
= P{X = x}x + P{Y = y}y
x y
= E[X] + E[Y ].
So
1 2
y3 1 y3
Z Z
2
2 1 8 1
E[Y ] = y·ydy+ y·(2−y)dy = + y − = +(4− )−(1− ) = 1.
0 1 3 0 3 1 3 3 3
R1
The easy way: E[X1 ] = E[X2 ] = 0 xdx = 21 . So E[Y ] = E[X1 ] +
E[X2 ] = 1.
Var(X) := E (X − E[X])2 .
E[XY ]2
0 ≤ E[X 2 ] − ,
E[Y 2 ]
(3.10)
Similarly, XY = 1A 1B = 1 on A ∩ B, and 0 elsewhere. Thus
Cov(X, Y ) Cov(X, Y )
Cor(X, Y ) = =p . (3.12)
SD(X) SD(Y ) Var(X) Var(Y )
The correlation is always between −1 and +1, and is equal to ±1 if and only
if X = cY + b with probability 1, for constants b, c. (The correlation is then
the sign of c.)
• C is nonsingular;
• The random variables Xi are linearly independent; that is, there are no
constants a1 , . . . , an such that a1 X1 + · · · + an Xn = 0 with probability
1.
• D is nonsingular;
control group would have had such different mortality rates if the treatment
really had no effect?” “How likely is it that this stock portfolio will lose
more than £1000 this year?” “How likely is it that floods will go above
the current levees some time in the next 100 years?” Such probabilities of
extreme events in the distribution are called tail probabilities, and bounds
of the sort P{Z > z} ≤ (z) are called tail bounds.
The basis of most tail bounds is an almost-trivial fact called Markov’s
inequality.
Moment 2 4 6 8 10 12 14 16 18 20
Tail bound 0.125 0.047 0.030 0.027 0.032 0.046 0.079 0.159 0.364 0.943
So it’s not a very good bound. We’ll see better ones in section
5.2. On the other hand, it’s good to have any bound at all. If the
distribution had been made just a bit more complicated — for
instance, Xi uniformly distributed on [−1, 1] — we could still
have derived these bounds, but the exact complication would
have been nearly impossible.
1
E[ξk ξj ] = P(Ak ∩ Aj ) = ,
jk
so that √
r≥ 1274.8 + 13.393 = 49.1.
Thus, the probability of {Rn ≤ 50} is at least 0.99. (We will
derive a better approximation in section ??.)
3.4. TAIL BOUNDS 57
Var(Rk )
P(X = 0) ≤ = 0.01.
E[Rk ]2
Conditioning
P(A ∩ B) = P(A)P(B|A).
60
4.1. CONDITIONING ON NON-NULL EVENTS 61
Note that this means in particular that the probability of an event is always
some kind of average of the probabilities conditioned on all the possible
events in a partition.
Bayes’ rule, though apparently trivial, solves an important problem:
Think of A as a piece of scientific evidence, and B as a fact about the
world. Then the Bayes tells us how to update our belief of the probability
that B is true, based on the new evidence. The principle is that the prior
probability P (B) gets multiplied by the Bayes factor P(A|B)/P(A), which
tells us how relatively likely the evidence A is if B is true, compared to its
overall likelihood. If B makes A much more likely than it would have been
otherwise, then A is strong evidence for B. We also use the law of total
probability to rewrite the Bayes factor:
P(A|B)
P(B|A) = P(B). (4.6)
P(A|B)P(B) + P(A|B { )P(B { )
offers to bet you even money that the other side is red. (That
is, he wins if it’s red, you win if it’s black.) Is this a fair bet?
Solution: Let A be the event that the side you see is red; B the
event that this is the card with two red sides. Then P(B) = 31 ;
P(A) = 12 ; and if we have the card with two red sides the side
you see will certainly be red, so P(A|B) = 1. By Bayes’ rule
1 1 2
P(B|A) = · = .
3 1/2 3
So it is not a fair bet.
Thus
0.99
P(B|A) = 0.001 · = 0.090.
0.01098
So the probability of being infected has been substantially raised
— by a factor of nearly 100 — from the baseline rate. In this
4.2. CONDITIONAL DISTRIBUTIONS: THE NON-NULL CASE 63
We roll two six-sided dice. Y is the sum of the two rolls. B is the
event Y ≤ 5. C is the event Y is even. Compute the conditional
distribution of Y given each of these events.
Solution: As a reminder, the probabilities for Y are
y 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P{Y = y} 36 36 36 36 36 36 36 36 36 36 36
y 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4
P{Y = y B} 10 10 10 10 0 0 0 0 0 0 0
P (C) = 12 , so
y 2 3 4 5 6 7 8 9 10 11 12
1 3 5 5 3 1
P{Y = y C} 18 0 18 0 18 0 18 0 18 0 18
We want to look at
−1
P Z>z X≤Y =P X≤Y P Z > z, X ≤ Y
µ+λ
= P z<X≤Y
µ
Z ∞Z ∞
µ+λ
= µλ e−µx−λy dydx
µ z x
Z ∞
−(µ+λ)x
= (µ + λ) e dx
z
−(µ+λ)z
=e .
Let Y be the sum of the spots on two fair six-sided die rolls, and
X the number of spots on the first roll. Compute E[X|Y ].
Solution: (This is the long way. We’ll do this again in example
4.5.) We make a table of the joint distribution:
P{ YX=x
=y,
} y
2 3 4 5 6 7 8 9 10 11 12 Mar.
1 1 1 1 1 1 1
1 36 36 36 36 36 36 0 0 0 0 0 6
1 1 1 1 1 1 1
2 0 36 36 36 36 36 36 0 0 0 0 6
1 1 1 1 1 1 1
3 0 0 36 36 36 36 36 36 0 0 0 6
x 1 1 1 1 1 1 1
4 0 0 0 36 36 36 36 36 36 0 0 6
1 1 1 1 1 1 1
5 0 0 0 0 36 36 36 36 36 36 0 6
1 1 1 1 1 1 1
6 0 0 0 0 0 36 36 36 36 36 36 6
1 2 3 4 5 6 5 4 3 2 1
Mar. 36 36 36 36 36 36 36 36 36 36 36
The last column is the marginal distribution of X, which is just
the sum of the other numbers in the row. The last row is the
marginal distribution of Y , which is just the sum of the other
numbers in the column.
We divide through by the marginal of X to get the conditional
probabilities of Y given X. The conditional expectation is then
the expected value computed from the distribution in each row:
=y |
P{ YX=x } y
2 3 4 5 6 7 8 9 10 11 12 E[Y |X = x]
1 1 1 1 1 1
1 6 6 6 6 6 6 0 0 0 0 0 4.5
1 1 1 1 1 1
2 0 6 6 6 6 6 6 0 0 0 0 5.5
1 1 1 1 1 1
3 0 0 6 6 6 6 6 6 0 0 0 6.5
x 1 1 1 1 1 1
4 0 0 0 6 6 6 6 6 6 0 0 7.5
1 1 1 1 1 1
5 0 0 0 0 6 6 6 6 6 6 0 8.5
1 1 1 1 1 1
6 0 0 0 0 0 6 6 6 6 6 6 9.5
E[X Y ] = Y /2.
(CE3) For any constant c and any random variables X and Y , E[cX Y ] =
cE[X Y ].
(CE5) If X and Y are independent, then E[X Y ] = E[X]. (In other words,
it’s a deterministic constant equal to E[X].
(CE6) For any random variables X and Y where X may be written as g(Y ),
E[X Y ] = X.
4.3. CONDITIONING ON A RANDOM VARIABLE: THE DISCRETE CASE68
Let Y be the sum of the spots on two fair six-sided die rolls, and
X the number of spots on the first roll. Compute E[X|Y ].
Solution: Define X1 = X, X2 = Y −X (the spots on the second
roll). These are independent. Then
E Y X = E X1 + X2 X
= E X1 X1 + E X2 X by rule (CE4)
= X1 + E[X2 ] by rules (CE5) and (CE6)
= X + 3.5.
4.4. CONDITIONAL DISTRIBUTIONS: THE CONTINUOUS CASE 69
E[T L] = L,
E[T 2 L] = 2L2 .
E[T ] = E[L] = 3,
2 2
E[T 2 ] = E[2L2 ] = 1 + 22 + 32 + 42 + 52 = 22
5
Var(T ) = E[T 2 ] − E[T ]2 = 13.
Our first idea might be: The uniform distribution assigns probabilities in
proportion to area. Consider a tiny strip of width δ around around the great
semicircle corresponding to longitude φ = 0. It has very nearly equal areas
all around, approaching uniformity as δ goes to 0. Thus, the conditional
distribution of the random latitude Φ must be uniform on [−90, 90]; in
particular, the probability below −45 is precisely 1/4.
Problem: There is nothing special about longitude Θ = 0; the same
should be true for conditioning on any other value of the longitude. But if the
latitude is uniformly distributed when conditioned on any value of Θ, then
it must be uniformly distributed unconditionally, and P{Φ < −45} = 1/4.
But we can compute the unconditional probability directly: The total area
of the earth is 4πR2 , where R is the radius. Latitude is between −90◦ (at the
north pole) and +90◦ (at the south pole). for φ ∈ (−90, 90) the area of the
region with latitude ≤ φ is a spherical cap with area 2πR2 (1 + sin φπ/180).
If we let Φ be the random latitude, since uniform probability is proportional
to area, we have the cdf and density
1 π
FΦ (φ) = 1 + sin φ
2 180
π π
fΦ (φ) = FΦ0 (φ) = cos φ.
180 180
In particular, the probability we are looking for is P{Φ < −45} = (2 −
√
2)/4 ≈ 0.146.
What this paradox points up is that the intuitive notion “the latitude
of a random point on the earth, conditioned on being on the circle with
longitude θ” doesn’t actually make sense. You can’t say you’ll look at the
distribution of φ in the subset of (θ, φ) corresponding to θ = 0 (or whatever).
The whole set has probability 0, so we can’t talk about the proportion of
probability in that set where φ has a certain value. Instead, taking the ratio
of joint density to marginal density means that we are taking a limit as δ ↓ 0
of
P{Φ ≤ 45 and 0 ≤ Θ < δ}
.
P{0 ≤ Θ < δ}
And whereas the limit set {(φ, θ) : θ = 0} are circles, which seem to imply
a uniform distribution of Φ, all of the sets on the way to the limit of the
form {(φ, θ) : θ ∈ [0, δ)} are wedges, which are fatter around the equator
than near the poles. The first approach is perfectly legitimate, and since
longitude and latitude are indeed independent we can infer that the joint
density of (Φ, Θ) is
π π
fΦ,Θ (φ, θ) = cos φ.
180 · 360 180
4.5. THE BOREL PARADOX 72
1 ∞
Z
2p
fX (x) = 1{x2 +y2 <1} = 1 − x2 .
π −∞ π
In other words,
√ conditioned
√ on X = x, Y is uniform on the
2 2
interval (− 1 − x , + 1 − x ).
Suppose instead we consider the random variables Θ and Z,
where Θ is the angle in [0, π) made by the line through the point
and the origin, to the x axis. (So that Θ or Θ+πpis the usual an-
gle component in polar coordinates.) And Z is x2 + y 2 sgn y..
(So these are polar coordinates with the radius allowed to be pos-
itive or negative.) Compute the conditional density of Z given
Θ = θ.
4.6. BIVARIATE NORMAL DISTRIBUTION: THE REGRESSION LINE73
fR (r Θ = θ) = 2r.
This does not depend on θ, reflecting the fact that R and Θ are
independent.
Notice how this resolves the Borel paradox. The problem there
was that we assumed it meant something to say “pick a uniform
point on the disk, conditioned on being on the centre line”. In
fact, conditioning for continuous random variables depends on a
limit, and it depends on how we take the limit: In other words,
it only makes sense in terms of a complete set of coordinates.
If I think of my random point on the middle line as being Y
conditioned on X = 0, then I am looking at the distribution
of Y conditioned on points in a narrow strip, whose width goes
to 0. This converges to the uniform distribution. On the other
hand, when I think of it as being Z conditioned on Θ = π/2, I
am looking at the distribution of Z conditioned on points in a
wedge, which has a nonuniform distribution for all wedges, and
converges to a nonuniform distribution.
The law of total expectation still holds:
1 π
Z
1 1 1 1
P{R < } = E[P{R < Θ}] = P{R < Θ = θ}dθ = .
2 2 π 0 2 4
(i) Given that a husband has height 190cm, what is the expected height
of his wife?
4.6. BIVARIATE NORMAL DISTRIBUTION: THE REGRESSION LINE74
(ii) Given that a husband has height 190cm, what is the probability that
his wife is below average height?
(iii) Suppose we know for a randomly chosen couple that the average height
of the husband and wife is 180cm. What is the expected height of the
wife?
(iv) What is the probability that a randomly chosen man is taller than a
randomly chosen woman?
(v) What is the probability that a randomly chosen husband is taller than
his wife?
Let (X, Y ) be the heights of a randomly chosen husband and wife. We
have
(i) We want to begin by computing E[Y |X]. The most direct way of doing
this is to split up Y into a piece that is a multiple of X and a piece that is
independent of X. We write Y = αX + (Y − αX), and try to solve for α
to make Z := Y − αX independent of X. We know from section 2.4.3 that
this will be true if Cov(X, Y − αX) = 0. By linearity,
5
0 = Cov(X, Y ) − α Cov(X, X) = 15 − 36α =⇒ α = .
12
5
Thus E[Z] = E[Y ] − 12 E[X] = 95, and
5 5 5
E[Y |X] = E X + Z X = E[X|X] + E[Z|X] = X + 95.
12 12 12
That is, the conditional distribution of Y given that X = 190 is N(174 16 , 18.75).
We are trying to compute P{Y < 170}. We now standardise Y : We have
E[Y ] ≈ 174.167 and SD(Y ) ≈ 4.33, so Z := (Y − 174.167)/4.33 is standard
normal. We have
Y − 174.167 4.167
P{Y < 170 | X = 190} = P <− = P {Z < −0.962} = 0.168.
4.33 4.33
(You could either a normal-distribution table, the Matlab command normcdf
— cf. http://www.mathworks.co.uk/help/toolbox/stats/normcdf.html
— or this applet http://www-stat.stanford.edu/~naras/jsm/FindProbability.
html.)
More directly, we could represent (X, Y ) in the form
X = 6Z1 + 180,
√
Y = 2.5Z1 + 18.75Z2 + 170,
√
as in (2.5). If X = 190 then Z1 = 10/6, and Y = 174.167 + 18.75Z2 , where
Z2 has conditional distribution equal to its unconditional distribution (since
Z2 is independent of X), which gets us immediately to the conditional dis-
tribution Y ∼ N(174 61 , 18.75) (conditioned on X = 190).
We follow now the same procedure as in part (i). The general version of this
procedure is stated below, in the formula (4.14), yielding
W − µW
E[Y |W ] = ρY,W σY + µY
σW
8 5
= √ · 5p + 170
91 91/4
400
= + 170
91
≈ 174.4cm.
(iv) This is exactly the same as (iv), except that instead of independent
X, Y 0 we take X, Y to be the heights of a randomly chosen husband-wife pair.
Since they are jointly normal, W = X − Y is still normal. Its expectation
is 10cm, but its variance has changed. It is
2
σW = Var(X) + Var(Y ) − 2 Cov(X, Y ) = 36 + 21 − 2 · 15 = 27.
Then
W − 10 0 − 10
P{W > 0} = P >
σW σW
−10
= P Z > −√ where Z is standard normal
27
= P {Z > −1.92}
= 0.973.
4.7. LINEAR REGRESSION ON NORMAL RANDOM VARIABLES 77
Proof. Observe that if (4.14) is true for variables X and Y , then it remains
true if we add any constants to X and Y . Thus we may assume wlog that
µX = µY = 0. Let β := Cov(X, Y )/σX 2 . Let Z := Y − βX. Then
Cov(X, Y )
Cov(X, Z) = Cov(X, Y )−Cov(X, βX) = Cov(X, Y )− Cov(X, X) = 0.
Var(X)
Then
∞ Z
X ∞
X
P(A) = pX=x (i)f (x)dx = P Y = i X ∈ Ai P X ∈ Ai .
i=0 Ai i=0
(4.17)
Theorem 4.6. Bayes’ rule for continuous and discrete distributions Sup-
pose (X, Y ) has a joint distribution with X discrete and Y continuous. Let
fY |X=x (y) be the conditional density of Y given X = x, and let pX|Y =y (x)
be the conditional probability mass function of X given Y = y. Let p(x)
be the unconditional probability mass function of X, and let fY (y) be the
conditional density of Y . Then
fY |X=x (y)p(x)
pX|Y =y (x) = ,
fY (y)
(4.20)
pX|Y =y (y)f (y)
fY |X=x (y) = .
p(x)
Now let
1 z
Z
1 2
c− := x · √ e−x /2 dx
p −∞ 2π
Z ∞
1 1
=− √ e−u du (substituting u = x2 /2)
p z 2 /2 2π
2
e−z /2
=− √ ,
p 2π
Z ∞
1 1 2
c+ := x · √ e−x /2 dx
1−p z 2π
−z 2 /2
e
= √ .
(1 − p) 2π
Then E[X1 |Y ] = c− ny + c+ (1 − ny ).
87
5.1. GENERATING FUNCTIONS 88
(k)
(PGF4) The k-th derivative gX (z) is increasing as a function of z. Its limit is
finite if and only if X has a finite k-th moment, in which case
(j)
gX (0) = E X(X − 1) · · · (X − j + 1) (5.2)
(PGF5) If X, Y are independent, then gX+Y (z) = gX (z)gY (z) for all z ∈ (0, 1).
(PGF6) Uniqueness: If gX (z) = gY (z) for all z ∈ (0, 1) (in fact, for z in any
interval), then X =d Y . In other words, distributions are determined
by their generating function.
Thus, if we can exchange the limit and the derivative (effectively exchanging
two different limits), we get
∞
"K # "∞ #
X d X d X
kpk z k−1
= lim k
pk z = pk z = g 0 (z).
k
K→∞ dz dz
k=0 k=0 k=0
j=1
z0
≥ z0k |P{X = K} − P{Y = K}| −
1 − z0
> 0.
Let Y = N
P
i=1 Xi , where the Xi are i.i.d. with pgf gX , and N is
independent of these with pgf gN . Then
gY (z) = E z Y
h i
= E E z X1 +···+XN N
h i
= E gX (z)N
= gN gX (z) .
5.1. GENERATING FUNCTIONS 91
(MGF3) The set of t such that MX (t) is finite in an interval that includes 0.
MX (t) is finite for some t > 0 if and only if there exists t0 > 0 and
C > 0 such that P{X > x} ≤ Ce−t0 x .
(MGF4) Suppose MX (t) < ∞ for |t| < t0 , where t0 > 0. Then MX is in-
(k)
finitely differentiable, and the k-th derivative MX (t) is increasing as
a function of t. All moments of X are finite, and
(k)
MX (0) = E X k .
(5.4)
(MGF5) − lim supx→∞ x−1 ln P{|X| > x} = t0 := sup{|t| : MX (t) < ∞}. We
have t0 = ∞ if and only if limx→∞ x−1 ln P{|X| > x} = −∞. This is
equivalent to limk→∞ k −1 E[|X|k ]1/k = 0.
5.2. MOMENT GENERATING FUNCTIONS 93
(MGF6) If X, Y are independent, then MX+Y (t) = MX (t)MY (t) for all t such
that MX (t) and MY (t) are both finite.
(MGF8) Uniqueness: If MX (t) = MY (t) for all |t| < t0 , for some positive t0 ,
then X =d Y . In other words, distributions are determined by their
moment generating functions.
where the second line requires that the convergence of the derivative be uni-
form enough that we can exchange the derivative and the integral.
R ∞ 0 We won’t
go into the details of this, except to say that the fact that −∞ et x f (x)dx is
finite for some t0 > t is more than enough to guarantee this. It also allows
5.2. MOMENT GENERATING FUNCTIONS 94
By Taylor’s Theorem, taking 0 is the centre for the series expansion, for all
t ≤ ,
1 1
MX (t) = MX (0) + M0X (0)t + M00X (0)t2 + M000 (s)t3 ,
2 6 X
5.2. MOMENT GENERATING FUNCTIONS 95
5.2.3 Examples
Bernoulli and binomial
Gamma distribution
λr r−1 −λx
Let X have density f (x) = Γ(r) x e . Then for t < λ,
∞ r
λr r−1 (t−λ)x
Z
λ
MX (t) = x e = .
0 Γ(r) λ−t
Normal distribution
Let X be a N(µ, σ 2 ) random variable. Since the tails fall off faster than
exponentially, the mgf exists for all t. Then X = σZ + µ, where Z is
standard normal. We have
Z ∞
1 2
MZ (t) = etz · √ e−z /2 dz
−∞ 2π
1 t2 /2 ∞ −(z−t)2 /2
Z
=√ e e dz
2π −∞
2 /2
= et
2 2
Thus, by (MGF2) MX (t) = eµt+σ t /2 .
In lecture 7 we will see that sums of i.i.d. random variables, properly
normalised, converge to a normal distribution. Is it possible to add a finite
number of non-normal random variables to get a normally distributed ran-
dom variable? Suppose it were possible. Suppose X1 , . . . , Xn are i.i.d. , and
X1 + · · · + Xn has standard normal distribution. Let MX be the mgf of the
Xi . It must be finite for all t, so we have
2 /2
MX (t)n = et .
2 /2n
But then MX (t) = et . By uniqueness, X also has normal distribution.
5.2. MOMENT GENERATING FUNCTIONS 97
5.2.5 Uniqueness
(The remainder of this section is conceptually important, but not exam-
inable.) Random variables X and Y have the same distribution when
E[g(X)] = E[g(Y )] for any bounded function g. Why? If the distribu-
tions are the same, they must give the same expectations. But if they give
the same expectations, that holds in particular for g = 1A for any A ⊂ R
— that is, they assign the same probabilities to all possible events.
But this allows us a significant amount of freedom. We already know
that distributions may be defined by cdfs. That is, equality of distributions
follows if E[g(X)] = E[g(Y )] just for functions of the form g = 1(−∞,x] . It
turns out that there are lots of other classes of functions that could be used
to define distributions. The basic rule is:
E[g(X)] − E[g(Y )] ≤ E[g(X)] − E[gn (X)] + E[gn (X)] − E[gn (Y )] + E[gn (Y )] − E[g(Y )]
= E[g(X)] − E[gn (X)] + E[gn (Y )] − E[g(Y )]
≤ 2 sup |g(x) − gn (x)| + 2 sup |gn (x)| + |g(x)| P{|X| > n}
x∈[−n,n] |x|>n
2
≤ + M P{|X| > n}
n
n→∞
−−−
−→ 0.
Since this is true for all n, it follows that E[g(X)] = E[g(Y )].
So we know E[g(X)] = E[g(Y )] for bounded continuous functions g.
Why does that imply equality of distributions? If distributions are equal
then this equality of expectations should hold for all functions g — or, at
least, all g for which the expectation makes sense, a class of functions that
we are not being very careful about defining here. In particular, we want
equality to hold for indicators of events, so that P{X ∈ A} = P{Y ∈ A} for
events A. Let’s consider this problem for A = [a, b]. Suppose we know that
E[g(X)] = E[g(Y )] for all bounded continuous g. Define
0 if x ≤ a,
1
n(x − a) if a < x < a + n ,
hn (x) = 1 if a + n1 ≤ x ≤ b − n1 ,
n(b − x) if b − n1 < x < b,
0 if x ≥ b.
Then
1[a+1/n,b−1/n] ≤ hn (x) ≤ 1[a,b] ;
but since hn is continuous, E[hn (X)] = E[hn (Y )]. Thus
1 1 1
≤ P a < X < a + or b − < X < b + P a < X < a + or b −
n n n
n→∞
−−−
−→ 0
(CF4) If X, Y are independent, then φX+Y (t) = φX (t)φY (t) for all t such
that φX (t) and φY (t) are both finite.
5.4. PROPERTIES OF CHARACTERISTIC FUNCTIONS 102
(CF6) Uniqueness: If φX (t) = φY (t) for all t ∈ (a, b), for some positive
a < b, then X =d Y . In other words, distributions are determined by
their characteristic functions.
Limit theorems in
probability I
i→∞
This is sometimes written Xi −−−
−→ Y a.s.
• Weak convergence: This is a more abstract — and fundamentally
new — concept of convergence of probability distributions, seen as a
sequence of elements in a metric space of all probability distributions.
Given probability spaces (Ωi , Pi ) and random variables Xi : Ωi → R,
and another random variable Y : Ω → R, we say that Xi converges to
Y weakly or in distribution if
lim P Xi ∈ A = P Y ∈ A (6.2)
i→∞
103
6.1. CONVERGENCE IN DISTRIBUTION 104
i→∞
for a sufficient collection of subsets A ⊂ R. We write this as Xi −−−
−→d
i→∞
Y or Xi −−−
−→p Y .
We are being deliberately vague about what sets are allowed in the def-
inition of convergence in distribution. The details are in section 6.1.
Note that strong convergence is fundamentally about random variables
— it is crucial how they are realised as functions on a particular probability
space — whereas weak convergence refers only to their distributions. We
can talk about convergence of probability distributions with no reference to
random variables. Of course, they may happen to conveniently be repre-
sented on the same probability space — all the (Ωi , Pi ) may be the same —
in which case it becomes possible to talk about comparing weak and strong
convergence.
Let Xi be the random variable that takes on the value 1/i with
probability 1, and Y the random variable that takes on the value
0 with probability 1. Then it seems reasonable to suppose that
i→∞
Xi −−−
−→d Y . But if we take A to be the set {0}, or the interval
(−∞, 0], we see that P{Xi ∈ A} = 0 for all i, but P{Y ∈ A} = 1.
We clearly run into problems when we try to define convergence of ran-
dom variables in terms of a set whose boundary is a discontinuity point of
the limit random variable. This leads us to the following definition:
Definition 6.1. Let X1 , X2 , . . . and Y be random variables, with cumulative
distribution functions F1 , F2 , . . . and F∞ . We say that the random variables
Xn converge to Y in distribution if limn→∞ Fn (x) = F∞ (x) for all x at
which F∞ is continuous.
n→∞
It is then obvious that when Y is a continuous random variable, Xn −−− −→d
Y if and only if P{Xn ∈ A} = P{Y ∈ A} for any interval A.
As we already discussed in section t turns out that thinking about con-
vergence in terms of convergence of probabilities isn’t the best approach —
it leads to these awkward problems of continuity. These can be dealt with,
but a better way of thinking about convergence is to think of the probability
P{Xi ∈ A} as being the expectation of the indicator set: E[f (Xi )] where
f = 1A .
Once we do that, we see that there are a lot of different classes of sets we
can use that are mathematically more convenient. The important principle
n→∞
is: Xn converges to Y in distribution if E[f (Xn )] −−− −→ E[f (Y )] for enough
functions f . What’s enough? The more we can restrict the required class of
functions, the easier it becomes to prove convergence.
The basic principle is the same as in section : a class G of bounded
functions is sufficient if you can approximate all the indicators of intervals
uniformly by functions in G. Of course, if G is sufficient, and G0 is another
class that can approximate all functions of G, then G0 is sufficient. It’s also
much better to work with classes of continuous functions, since that avoids
the problem of worrying about where the distributions are continuous.
6.1. CONVERGENCE IN DISTRIBUTION 106
n→∞
−→ E[F (Y )] for any F ∈ F,
Think of it this way: Suppose E[F (Xn )] −−−
and for any bounded continuous G and any real numbers K and > 0 there
is some F ∈ F such that
u sup |F (x) − G(x)| < .
|x|≤K
n→∞
If Xn and Y were never bigger than K, then this would clearly that E[G(Xn )] −−−
−→
E[G(Y )]. But by choosing K big enough we can make the probabilities of
these problematic events P{|Xn | > K} and P{|Y | > K} as small as we like.
Fact 6.2. Let X1 , X2 , . . . and Y be random variables, with cumulative dis-
tribution functions F1 , F2 , . . . and F∞ . The following are equivalent:
• limn→∞ Fn (x) = F∞ (x) for all x at which F∞ is continuous.
• limn→∞ P{Xn ∈ (−∞, x]} = P{Y ∈ (−∞, x]} as long as P{Y = x} =
0.
• limn→∞ E[f (Xn )] = E[f (Y )] when f is a function of the form 1(−∞,x]
for some real number x with P{Y = x} = 0.
• limn→∞ E[f (Xn )] = E[f (Y )] for any bounded continuous f : R → R.
• limn→∞ E[f (Xn )] = E[f (Y )] where f is a function of the form f (x) =
etx for any t in some neighbourhood of 0.
• limn→∞ E[f (Xn )] = E[f (Y )] where f is a function of the form f (x) =
eitx for any real t.
Other possible classes of functions which can define convergence in distri-
bution are the smooth functions, the differentiable functions, the Lipschitz
functions, and so on.
n→∞
We could also show directly that E[f (Xn )] −−− −→ E[f (X)] for
any bounded continuous f . Since f is continuous, f (0) = limn→∞ f (1/n)
and f (1) = limn→∞ f (1 − 1/n). Thus
n→∞
E[f (Xn )] = (p−p/n)f (0)+(1−p+p/n)f (1) −−−
−→ pf (0)+(1−p)f (1) = E[f (X)].
Note that in this case convergence holds for all f , not just con-
tinuous.
N := min{i : bi 6= ξi }.
That is, it is the first place in the binary expansion where X and
t disagree. Note that each ξi has probability 1/2 of matching bi ,
so N has geometric distribution with parameter 1/2. Of course,
if they disagree first at a place where bN = 0, then X is bigger;
if bN = 1 and ξN = 0 then X is smaller. And if N > n then
6.2. THE CONTINUITY THEOREM FOR MGFS 109
Xn ≤ t anyway. So
X
P{Xn ≤ t} = P N > n + P{N = n}
1≤i≤n: bi =1
so
∞
X n→∞
−n
|P{Xn ≤ t} − t| = 2 − bi 2−i ≤ 2−n −−−
−→ 0.
i=n+1
The same holds true with characteristic function in place of the moment
generating function.
This may seem surprising — we can check convergence of expectations
for all bounded continuous f just by checking the exponential functions —
but this is quite similar to the fact discussed in Lecture 5 that a distribution
is determined entirely by its moment generating function or characteristic
function.
Equivalently,
∀ > 0 ∃n s.t. P |Xi − Y | > < for all i > n. (6.4)
Fix any x ∈ R where the cdf FX is continuous, and choose > 0. Then
whenever X > x + and |Xn − X| ≤ , it must be that Xn > . Thus
FXn (x) = P Xn ≤ x
≤ P X ≤ x + or |Xn − X| >
≤ P X ≤ x + + P |Xn − X| >
n→∞
−−−
−→ FX (x + ).
Similarly,
1 − FXn (x) = P Xn > x
≤ P X > x − or |Xn − X| >
n→∞
−−−
−→ 1 − FX (x − ).
Thus
FX (x − ) ≤ lim FXn (x) ≤ FX (x + ).
n→∞
of this idea are called “laws of large numbers”. There are two basic versions:
the Weak Law of Large Numbers (WLLN — this says the average will be
close to the expectation with high probability) and the Strong Law of Large
Numbers (SLLN — this says that as we take more and more samples, the
means converge to the expectation). Only the weak law is included in the
syllabus of this course.
Var(Sn ) 2
−1 σ
P |Sn − µ| > ≤ = n ,
2 2
which goes to 0 as n → ∞, no matter what is.
Note that we have actually proved a result that is stronger than the
statement of the theorem in two ways: First, we didn’t really use indepen-
dence, only the fact that the Xi are uncorrelated. Second, we didn’t just
prove the asymptotic result, that Sn converges to µ eventually, we actually
have a bound on the probability of |Sn − µ| being bigger than .
In fact, the WLLN still holds even if the covariances are nonzero, as long
as the covariance between Xi and Xj goes to 0 as i and j get far apart. We
leave this as an exercise.
The WLLN even still holds if Xi has infinite variance, just as long as the
expectation is finite; but in this case we may not have any general bound
on the tail probabilities.
6.5. CONVERGENCE IN MEAN SQUARE 112
n→∞
which converges to 0 for x < 0 and 1 for x ≥ 0. So Xn −−− −→d X, but
E[Xn ] = 1 for each n, so it doesn’t converge to 0 = E[X]. In fact, Xn
converges to X in probability, since P{|Xn − X| > } = 1/n for ∈ (0, 1).
We introduce a stronger version of convergence in probability, which does
imply convergence of expectations and variances.
Definition 6.7. We say that a sequence of random variables Xn converges
in mean square to a random variable X (all defined on the same probability
space) if
lim E (Xn − X)2 = 0.
(6.5)
n→∞
Fact 6.8. If Xn converge to X in mean square then
• Xn converge to X in probability;
• if E[X] < ∞ then limn→∞ E[Xn ] = E[X];
• if E[X 2 ] < ∞ then limn→∞ Var(Xn ) = Var(X).
Proof. The first part is left as an exercise.
We have (by property (E7) of expectations)
E[Xn ] − E[X] ≤ E |Xn − X|
1/2 n→∞
≤ E (Xn − X)2
−−−
−→ 0
(Remember: For any random variable Z we have E[Z 2 ] − E[Z]2 = Var(Z) ≥
0, so E[Z] ≤ E[Z 2 ]1/2 .) Similarly,
E[Xn2 ] − E[X 2 ] E (Xn − X)2 + 2X(Xn − X)
1/2
≤ E (Xn − X)2 + 2E[X 2 ]1/2 E (Xn − X)2
n→∞
−−−
−→ 0
by the Cauchy-Schwarz inequality (3.9).
6.6. STRONG CONVERGENCE 113
For the strong law, in contrast to the weak law, existence of finite ex-
pectations is not enough. We need finite variance. Independent also cannot
be replaced by uncorrelated.
Proof. Choose any > 0. We need to show that limn→∞ P{|Xn − X| > } =
0. We know that for every1 ω ∈ Ω,
which means that there is a random N = N (ω) such that |Xn − X| ≤ for
all n ≥ N . Thus
P{|Xn − X| > } ≤ P{N > n}.
Note that the events {N > n} are decreasing events, so by the Monotone
Event Lemma (Lemma 1.1)
∞
\
lim P{|Xn −X| > } ≤ lim P{N > n} = P {N > n} = P{N = ∞} = 0.
n→∞ n→∞
n=1
1
Actually, there could be an event with probability 0 of ω such that Xn (ω) doesn’t
converge; but we can simply throw out this null set.
Lecture 7
Limit theorems in
probability II: The central
limit theorem
115
7.1. THE CENTRAL LIMIT THEOREM 116
hold:
0.6
0.3
0.4
0.2
0.2
0.1
0.0
0.0
-2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3
0.4
0.3
0.2
0.1
0.0
0 2 4 6 8 10 -2 -1 0 1 2 3 4
0.20
0.10
0.10
0.05
0.00
0.00
5 10 15 20 -2 0 2 4 6 8
0.12
0.06
0.08
0.04
0.04
0.02
0.00
0.00
35 40 45 50 55 60 65 5 10 15
Figure 7.1: Normal approximations to Binom(n, p). Shaded region is the implied approximate
probability of the Binomial variable < 0 or > n.
7.2. BASIC APPLICATIONS OF THE CLT 119
Z b
1 2
√ e−z /2 dz.
p p
lim P np + a np(1 − p) ≤ Sn ≤ np + b np(1 − p) =
n→∞ a 2π
(7.2)
For example, Figure 7.2 compares a Bin(300, 0.5) and a N(150, 75)
which both have the same mean and variance. The figure shows that the
distributions are very similar.
0.04
0.03
0.03
P(X = x)
density
0.02
0.02
0.01
0.01
0.00
0.00
100 120 140 160 180 200 100 120 140 160 180 200
X X
In general,
7.2. BASIC APPLICATIONS OF THE CLT 120
If X ∼ Bin(n, p) then
µ = np
σ 2 = npq where q = 1 − p.
X ∼ N(np, npq)
µ = np = 6
σ 2 = npq = 3
Unfortunately, it’s not quite so simple. We have to take into account the
fact that we are using a continuous distribution to approximate a discrete
distribution. This is done using a continuity correction. The continuity
correction appropriate for this example is illustrated in the figure below
!
3.5 − 6 X −6 7.5 − 6
P (3.5 < X < 7.5) = P √ < √ < √
3 3 3
= P (−1.443 < Z < 0.866) where Z ∼ N(0, 1)
= 0.732
The exact answer is 0.733 so in this case the approximation is very good.
7.2. BASIC APPLICATIONS OF THE CLT 121
0 1 2 3 4 5 6 7 8 9 10 11 12
3.5 7.5
√
In general, the threshold −0.5 corresponds to Z = (−0.5 − λ)/ λ. The
corresponding values for other parameters are given in Table 7.1.
0.4
0.3
Density
0.2
0.1
0.0
-3 -2 -1 0 1 2 3 4
(a) λ = 1
0.20
Density
0.10
0.00
-4 -2 0 2 4 6 8 10
(b) λ = 4
0.12
0.08
Density
0.04
0.00
0 5 10 15 20
(c) λ = 10
0.08
Density
0.04
0.00
5 10 15 20 25 30 35
(d) λ = 20
Z = X ! 25
5
25 26.5 0 26.5 ! 25
5
= 0.3
of samples being averaged is small. Unfortunately, the CLT itself may not
apply in such a case.) We have already applied this idea when we did the Z
test for proportions, and the CLT was also hidden in our use of the χ2 test.
Quebec births
We begin with an example that is well suited to fast convergence. We
have a list of 5,113 numbers, giving the number of births recorded each
day in the Canadian province of Quebec over a period of 14 years, from 1
January, 1977 through 31 December, 1990. (The data are available at the
Time Series Data Library http://www.robjhyndman.com/TSDL/, under the
rubric “demography”.) A histogram of the data is shown in Figure 7.4(a).
The mean number of births is µ = 251, and the SD is σ = 41.9.
Suppose we were interested in the average number of daily births, but
couldn’t observe data for all of the days. How many days would we need
to observe to get a reasonable estimate? Obviously, if we observed just a
single day’s data, we would be seeing a random pick from the histogram
7.4(a), which could be far off of the true value. (Typically, it would be off
by about the SD, which is 41.9.) Suppose we sample the data from n days,
obtaining counts x1 , . . . , xn , which average to x̄. How far off might this be?
The normal approximation tells us that a 95% confidence interval for µ will
√
be x̄ ± 1.96 · 41.9/ n. For instance, if n = 10, and we find the mean of
our 10 samples to be 245, then a 95% confidence interval will be (229, 271).
If there had been 100 samples, the confidence interval would be (237, 253).
Put differently, the average of 10 samples will lie within 26 of the true mean
95% of the time, while the average of 100 samples will lie within 8 of the
true mean 95% of the time.
This computation depends upon n being large enough to apply the CLT.
Is it? One way of checking is to perform a simulation: We let a computer
pick 1000 random samples of size n, compute the means, and then look at
the distribution of those 1000 means. The CLT predicts that they should
have a certain normal distribution, so we can compare them and see. If
n = 1, the result will look exactly like Figure 7.4(a), where the curve in red
is the appropriate normal approximation predicted by the CLT. Of course,
there is no reason why the distribution should be normal for n = 1. We see
that for n = 2 the true distribution is still quite far from normal, but by
n = 10 the normal is already starting to fit fairly closely, and by n = 100
the fit has become extremely good.
Suppose we sample 100 days at random. What is the Pprobability that
the total number of births is at least 25500? Let S = 100 i=1 Xi . Then S
7.2. BASIC APPLICATIONS OF THE CLT 126
0.015
0.008
0.010
Density
Density
0.004
0.005
0.000
0.000
150 200 250 300 350 200 250 300
(a) n = 1 (b) n = 2
0.030
0.020
0.015
0.020
Density
Density
0.010
0.010
0.005
0.000
0.000
180 200 220 240 260 280 300 220 240 260 280
(c) n = 5 (d) n = 10
0.06
0.08
0.06
0.04
Density
Density
0.04
0.02
0.02
0.00
0.00
240 250 260 270 235 240 245 250 255 260 265
This is of course the same as the probability that the average number of
births is at least 255. We could also compute this by reasoning
√ that X̄ =
S/100 is normally distributed with mean 251 and SD 41.9/ 100 = 4.19.
Thus,
X̄ − 251 255 − 251
P X̄ > 255 = P > = P {Z > 0.95} ,
4.19 4.19
California incomes
A standard example of a highly skewed distribution — hence a poor can-
didate for applying the CLT — is household income. The mean is much
greater than the median, since there are a small number of extremely high
incomes. It is intuitively clear that the average of incomes must be hard to
predict. Suppose you were sampling 10,000 Americans at random — a very
large sample — whose average income is £30,000. If your sample happens
to include Bill Gates, with annual income of, let us say, £3 billion, then his
income will be ten times as large as the total income of the entire remainder
of the sample. Even if everyone else has zero income, the sample mean will
7.3. NORMAL APPROXIMATION FOR STATISTICAL INFERENCE128
be at least £300,000. The distribution of the mean will not converge, or will
converge only very slowly, if it can be substantially affected by the presence
or absence of a few very high-earning individuals in the sample.
Figure 7.5(a) is a histogram of household incomes, in thousands of US
dollars, in the state of California in 1999, based on the 2000 US census (see
www.census.gov). We have simplified somewhat, since the final category is
“more than $200,000”, which we have treated as being the range $200,000 to
$300,000. (Remember that histograms are on a density scale, with the area
of a box corresponding to the number of individuals in that range. Thus,
the last three boxes all correspond to about 3.5% of the population, despite
their different heights.) The mean income is about µ = $62, 000, while the
median is $48,000. The SD of the incomes is σ = $55, 000.
Figures 7.5(b)—7.5(f) show the effect of averaging 2,5,10,50, and 100
randomly chosen incomes, together with a normal distribution (in green) as
predicted by the CLT, with mean µ and variance σ 2 /n. We see that the
convergence takes a little longer than it did with the more balanced birth
data of Figure 7.4 — averaging just 10 incomes is still quite skewed — but
by the time we have reached the average of 100 incomes the match to the
predicted normal distribution is remarkably good.
2
Reminder: To calculate the normal tail probability you could use a table — they are
found in many probability textbooks — the Matlab command normcdf (see http://www.
mathworks.co.uk/help/toolbox/stats/normcdf.html) or the R command pnorm, or an
online applet http://www-stat.stanford.edu/~naras/jsm/FindProbability.html.
7.3. NORMAL APPROXIMATION FOR STATISTICAL INFERENCE131
0.012
0.008
0.008
Density
Density
0.004
0.004
0.000
0.000
0 50 100 150 200 250 300 0 50 100 150 200 250 300
(a) n = 1 (b) n = 2
0.020
0.015
0.010
Density
Density
0.010
0.005
0.005
0.000
0.000
(c) n = 5 (d) n = 10
0.06
0.04
0.03
0.04
Density
Density
0.02
0.02
0.01
0.00
0.00
40 50 60 70 80 90 100 50 60 70 80
1875 2006
x
We prove the CLT only for the case µ = 0 and σ = 1. Note that for general
ei := (Xi −µ)/σ have
random variables Xi , the normalised random variables X
expectation 0 and variance 1. The CLT for sums of Xi follows immediately
for the corresponding CLT for sums of X ei .
133
8.1. MANY PATHS UP THE MOUNTAIN 134
for 0 ≤ i k. Thus
( j−1 ) 2
−2k 2k X 2i + 1
−2k 2k j
P{Y2k = k + j} ≈ 2 exp − =2 exp − .
k k k 2
i=0
2k
Using the fact that 2−2k ∼ (2πk)−1/2 (from Stirling’s formula), a bit of
k
manipulation then turns
j2
−2k 2k
p X
P 0 ≤ Y2k ≤ b k/2 ≈ 2 exp −
k √ 2
0≤j≤b k/2
Rb 2 /2
into a Riemann sum converging to 0 e−z dz.
Using Lemma 5.3 we see that there is a constant C such that for n sufficiently
large
√
√ (t/ n)2 2 t 3
log MX (t/ n) − σ ≤C √ .
2 n
Thus
√ t2
log MYn (t) = n log MX (t/ n) = + Error,
2
where Error ≤ Ct3 n−1/2 , which goes to 0 is n → ∞. Thus, for each t with
|t| < t0 ,
2
lim MYn (t) = et /2 = MZ (t),
n→∞
where Z is standard normal. Thus the CLT follows from Theorem 6.3.
In other words, we interpolate between the sum of the X’s and the sum of
the Z’s by replacing X’s by Z’s one at a time. We have
Xn Zn
Yn = Yen,n + √ , Yn∗ = Ye0,n + √ .
n n
8.3. PROOF OF THE CLT 137
√ √
Note, too, that Yei,n + Zi / n = Yei−1,n + Xi−1 / n.
Suppose first that τ := max{E[|Xi |3 ], E[|Zi |3 ]} < ∞. We want to show
that the distributions of Yn and of Yn∗ are close together. We do this by
forming a telescoping sum
n
∗
X Xi Zi
E φ(Yn ) − E φ(Yn ) = E φ(Yei,n + √ ) − E φ(Yei,n + √ )
n n
i=1
n (8.1)
X Xi Zi
≤ E φ(Yi,n + √ ) − E φ(Yi,n + √ )
e e
n n
i=1
by the triangle inequality. We now apply Taylor’s Theorem to the function
φ, expanded around the centre Yei,n , to obtain
Xi h Xi X2 Xi3 000 i
E φ(Yei,n + √ ) = E φ(Yei,n ) + √ φ0 (Yei,n ) + i φ00 (Yei,n ) + 3/2
φ (y) ,
n n 2n 6n
(8.2)
√
where y is between Yei,n and Yei,n + Xi / n.
Observe now that Yei,n is a function of X’s and Z’s other than Xi ; thus
Yei,n and Xi and Zi are all independent, and the expectation of any function
of Yei,n multiplied by any function of Xi is the product of the expectations,
yielding
Xi
E φ(Yei,n + √ ) = E φ(Yei,n ) + n−1/2 E Xi E φ0 (Yei,n )
n
1
+ (2n)−1 E Xi2 E φ00 (Yei,n ) + n−3/2 E Xi3 φ000 (y)
6
1
= E φ(Yei,n ) + (2n)−1 E φ00 (Yei,n ) + n−3/2 E Xi3 φ000 (y) .
6
An identical expression may be written with Zi in place of Xi , so
Xi 1
E φ(Yei,n + √ ) = E φ(Yei,n ) + (2n)−1 E φ00 (Yei,n ) + n−3/2 E Zi3 φ000 (ŷ) ,
n 6
√
where y is between Yei,n and Yei,n + Zi / n. Combining these expressions
yields
Xi Zi
E φ(Yei,n + √ ) − E φ(Yei,n + √ )
n n
1 −3/2 3 000
E Xi φ (y) − E Zi3 φ000 (ŷ)
= n
6 (8.3)
1 −3/2 3
3
≤ n E |Xi | K + E |Zi | K
6
1
= n−3/2 τ K.
3
8.4. TRUNCATION 138
which goes to 0 as n → ∞.
Now suppose that the Zi have standard normal distribution. Then Yn∗
also has standard normal distribution for every n, so
lim E φ(Yn ) − E φ(Z) = 0
n→∞
where Z ∼ N(0, 1), for any function φ with bounded third derivative. Since
all bounded continuous functions may be approximated by functions with
n→∞
bounded third derivative, it follows that Yn −−−−→d Z.
This completes the proof except for one small defect: We had to assume
that Xi has finite third moment τ . What do we do if E[|Xi |3 ] = ∞? This
requires a trick called “truncation”, that you will certainly encounter often
if you go on to more advanced probability theory, but is not strictly part of
this course.
8.4 Truncation
(Not examinable) Look back at the crucial step (8.2). We’ve expanded out
√
the Taylor series to the third order, because the error term (Xi / n)3 is
going to be small enough to go to 0 in the limit when multiplied by n. This
is not true, however, if Xi is too big. In that case, we would be better off
stopping at the second-order term:
Xi h Xi X2 i
E φ(Yei,n + √ ) = E φ(Yei,n ) + √ φ0 (Yei,n ) + i φ00 (y 0 ) .
(8.4)
n n 2n
We have here two different expressions for the same quantity, (8.2) being
good when Xi is small (or not extremely large) and (8.5) being superior
when Xi is extremely large. Pick some large number M to be the cutoff
between when we will use the first expression and when we will use the second
expression. These expressions are actually equal (depending on different
choices of y and y 0 to make them equal) so if I multiply one of them by
1{Xi >M } and the other by 1{Xi ≤M } and add these two together, we get
8.4. TRUNCATION 139
Xi h Xi X2
E φ(Yei,n + √ ) = E φ(Yei,n ) + √ φ0 (Yei,n ) + i φ00 (y 0 )1{Xi >M }
n n 2n
X2 3
Xi 000 i
i 00 e
+ φ (Yi,n ) + 3/2 φ (y) 1{Xi ≤M }
2n 6n
= E φ(Yi,n ) + (2n) E φ00 (Yei,n )
−1 (8.5)
e
h X2 i
+ E φ00 (y 0 ) − φ00 (Yei,n ) i
1{Xi >M }
2n
h X3 i
i 000
+E φ (y)1 {Xi ≤M }
6n3/2
√
Then taking the difference to the expression we already had for φ(Yei,n + Zi / n)
(which certainly does have finite moments of all orders if Zi is Gaussian),
we get
Xi Zi
E φ(Yei,n + √ ) − E φ(Yei,n + √ )
n n
1 −3/2
E Xi 13{Xi ≤M } φ000 (y) − E Zi3 φ000 (ŷ)
= n
6 h i (8.6)
+ (2n)−1 E φ00 (y 0 ) − φ00 (Yei,n ) X 2 1{X >M }
i i
K K
≤ 3/2 M 3 + E |Zi |3 + E Xi2 1{Xi >M }
6n n
if we take K to be a bound on the second and third derivatives of φ.
Summing as in (8.1) we get
K 3 3
+ KE Xi2 1{Xi >M } .
E φ(Yn ) − E φ(Z) ≤ 1/2
M + E |Z i |
6n
The problem is, this does not obviously go to 0 as n → ∞. Rather,
Since this is true for every , it must in fact be true that the limit is 0.
So is (8.7) true? The answer is yes. If we think of the sequence of random
variables WM := Xi2 1{Xi >M } , they clearly converge to 0 as M → ∞: No
matter what Xi is, there is some M bigger. Is it true that if a sequence
of random variables converge to a constant, that the expectation of these
random variables converges to that constant? The answer to this is clearly
no, in general, but yes in many circumstances. There is a whole collection
of theorems in measure theory that tell us when the expectation of a limit
of random variables is the limit of the expectations, but this is beyond the
scope of this course. The relevant theorem in this case is called the Monotone
Convergence Theorem. It says that the limit of the expectations is the
expectation of the limit if the sequence of random variables is monotone
(increasing or decreasing). Since WM clearly is decreasing — a larger M
either leaves W the same or drops it down to 0 — the result follows. (A
straightforward source if you’re interested in reading more about this — and
in general for the basics of measure-theoretic probability — is chapter 4 of
[Ros06]. A more extensive account of these ideas is in chapter 23 of [Bil95].)
Lecture 9
141
Lecture 10
142
Lecture 11
Convergence to stationarity
of Markov chains
143
Lecture 12
144
12.1. BASIC FACTS ABOUT RECURRENCE AND TRANSIENCE 145
Proof.
t ∞
" #
X X
E[# returns to x] = E[Nx∞ ] = lim E ξn = P n (x, x).
t→∞
n=1 n=1
Thus
∞
X ∞
X ∞
X
P n (y, y) ≥ P i+m+` (y, y) ≥ p2 P i (x, x) = ∞,
n=1 i=1 i=1
So y is also recurrent, by part (iv) of Proposition 12.4.
12.2. PROBABILITY GENERATING FUNCTIONS FOR PASSAGE TIMES146
z6=y (12.1)
X
= s P (x, y) + P (x, z)gzy (s) .
z6=y
In principle, we can use this relation to compute the pgf’s, but this is
generally impractical, except when symmetry makes the recurrence partic-
ularly simple.
We can solve this equation explicitly, using the fact that lims↓0 g(s) =
0, to get
1 p
g(s) = 1 − 1 − 4p(1 − p)s . (12.2)
2(1 − p)s
The first passage time back to 0 comes from exchanging p and
1 − p:
1 p
g10 (s) = 1 − 1 − 4p(1 − p)s . (12.3)
2ps
12.3. DEMONSTRATING RECURRENCE AND TRANSIENCE: THE EXAMPLE OF THE RANDO
However, we can use the pgf in the abstract to derive important prop-
erties related to recurrence and transience. Suppose the chain is irreducible
and aperiodic. We note first that the probability of ever reaching y starting
from x is lims↑1 gxy (s), and the chain is recurrent if and only if lims↑1 gxy (s) =
1 for all x and y. If there are only finitely many y that may be reached in
one step from any particular x, then we may substitute into (12.1) to get
X
gxy (1) = P (x, z)gzy (1).
z∈X
Direct computation
If we want to show that a Markov chain is recurrent (or not), the most
straightforward thing to do is to compute the distribution of the first re-
turn time. If limt→∞ P{Txx > t} = 0, then Tx is finite with probability
1. By symmetry, for the simple (unbiased) random walk T00 has the same
distribution as T−1,0 + 1. For any k ≥ 0, we know that
−2k−1 2k + 1
P{X2k = 0 X0 = −1} = 0, P{X2k+1 = 0} = 2 ,
k
for the probability of {T−1,0 = 2k + 2}, since the process could be back at 0
for the second, third, etc. time. We also have
Probability 1/2
when we add half
the event X0=-1.
Probability
Consider a random walk path started from −1. After the walk has hit
0, at any future even time it is either > 0 or < 0, so trivially
P T0 ≤ 2k X0 = −1 = P T0 ≤ 2k − 1 and X2k > 0 X0 = −1
+ P T0 ≤ 2k and X2k < 0 X0 = −1 .
Not so trivially, by the strong Markov property the process started from T0
is just like a simple random walk started from 0, so it has equal chances of
ending up above or below 0 at any later time, and
P T0 ≤ 2k X0 = −1 = 2P T0 ≤ 2k − 1 and X2k > 0 X0 = −1 .
12.3. DEMONSTRATING RECURRENCE AND TRANSIENCE: THE EXAMPLE OF THE RANDO
But if X2k > 0, the process must have passed through 0 on the way, so
T0 ≤ 2k − 1 automatically. Hence
−2k 2k
P T0 ≤ 2k X0 = −1 = 2P T0 ≤ 2k − 1 X0 = −1 = 1 − 2 .
k
Thus
−2k 2k k→∞
P T0 > 2k = 2 −−−
−→ 0,
k
which means that T0 is finite with probability 1, and the random walk is
recurrent.
12.3.2 General p
It is easy to see that a biased random walk on Z is transient. For instance, we
know from the strong law of large numbers that limn→∞ n−1 Xn = E[X1 ] =
n→∞
2p − 1. Thus, for p > 1/2 it must be that Xn −−−
−→ +∞, and in particular
n→∞
it cannot return to 0 infinitely often; similarly if p < 1/2 we have Xn −−−
−→
−∞.
We could also calculate for p < 1/2 that the moment generating function
of Xn is n
Mn (θ) = peθ + (1 − p)e−θ = e−nλ ,
which has finite sum. Reversing the signs does the same job for p > 1/2.
What about the random walk on the nonnegative integers which is re-
flected at 0? (That is, P (0, 1) = 1 and P (x, x + 1) = p = 1 − P (x, x − 1) for
x ≥ 1.) As we would expect, this is transient for p > 1/2, and recurrent for
p ≤ 1/2. This follows immediately from the pgf calculation of (12.3). We
have the pgf of the first return time to 0 being
1 p
g00 (s) = 1 − 1 − 4p(1 − p)s ,
2p
so
min{p, 1 − p}
P{T0 < ∞ X0 = +1} = g00 (1) = ,
p
which is 1 precisely when p ≤ 1/2.
Lecture 13
151
13.1. NULL AND POSITIVE RECURRENCE 152
Proof. (Proof not examinable.) The basic idea is that we can use the results
we already have for finite chains. The condition that the first return time
has finite expectation implies that the chain spends “almost all” of its time
in a finite subset of the state space.
We provide here an intuitive proof, based on the Strong Law of Large
Numbers. This has been mentioned but not proved in this course. You may
seek out either a proof of the Strong Law or a proof of the present result
which does not depend on it, both of which are in chapter 6 of [GS01].
Suppose µx := E[Txx ] < ∞. Start the chain in state y. Then the k-th
(k)
arrival time in x may be expressed as Tyx = Tyx + τ1 + · · · + τk−1 , where the
τ ’s are i.i.d. with the same distribution as Txx . Let Nx (n) be the number of
(k)
times the process visits state x up to time n; so Nx (n) = max{k : Tyx ≤ n}
(k+1) (k)
when the process starts in y. For Tyx > n ≥ Tyx we have Nx (n) = k, so
k Nx (n) k
(k)
≤ ≤ (k+1) .
Tyx n Tyx
By the Strong Law of Large Numbers the upper and lower bounds both
n→∞
converge to 1/µx as k → ∞. So Nx (n)/n −−−
−→ 1/µx . By Lemma 13.3 it
n→∞
follows that 1/µx is a stationary distribution. Furthermore, if Nx /n −−−
−→
1/µx with probability 1, then it must certainly be true that
n→∞
E[Nx /n] = E[Nx ]/n −−−
−→ 1/µx .
The proof that the probabilities actually converge in the aperiodic case
is slightly tricky although it seems intuitively obvious, and could be skipped
on a first reading. It’s worth thinking about though, because it provides a
good paradigm of how we bootstrap up from results on finite state spaces
to results on infinite state spaces.
13.1. NULL AND POSITIVE RECURRENCE 153
1
P For any ∈ (0, 2 ) we may find a finite subset X ⊂ X such that P :=
i∈X πi ≥ 1 − . Define a new chain Xn which lists only the states of Xn
that are in X . Since this is an ergodic chain on a finite state space, we have
πj
lim P{Xn = j X0 = 0} = ,
n→∞ P
which is the stationary distribution of the restricted chain. It follows (with
a little bit of technical work) that the conditional distribution of Xn , con-
ditioned on being in the finite set X , converges to πj /P . Thus for n
sufficiently large — that is, n > N for some N —
πj
(πj − )P Xn ∈ X ≤ P Xn = j X0 = i ≤
+ (13.1)
1 − |X |
P Xn ∈ X = 1 − pn ≥ 1 − 5.
(Note that the exchange of limit and summation isn’t any problem if P (j, `)
is nonzero for only finitely many `.
Those paying attention to the technical details will notice that we have
been a bit sloppy here in exchanging the order of limits from the second to
the third line. What we need is to show that for any > 0 we can find n
large enough so that
X Nj,` (n)
Nj (n) − Pj,` < n.
Nj (n)
j∈X
Conditioned on Nj (n) = ν, the random variables Nj,` (n) − Pj,` ν are inde-
pendent. We then use the exponential bounds from the moment generating
function to show that this sum is eventually less than n, conditioned on
Nj (n) large enough.
Suppose we have a shop with a single cashier, and assume for sim-
plicity that a customer takes exactly 1 unit of time to be served.
Suppose there are ξn arrivals in the n-th unit of time, where
P the
ξn are i.i.d. , with P{ξn = k} = ρk . Let µ = E[ξn ] = kρk .
Then (Xn ) is a Markov chain with state space {0, 1, 2, . . . }, and
Xn+1 = ξn+1 if Xn = 0, and Xn+1 = Xn + ξn+1 − 1 if Xn ≥ 1.
That means
ρy
if x = 0,
P (x, y) = ρy−x+1 if x ≥ 1 and y ≥ x − 1,
0 otherwise.
Note that we have proved a result that is in many respects stronger than
the convergence of the Markov chain in distribution — the convergence of
the average occupation time to the stationary distribution — but in fact
it is not simply stronger. That is, convergence of the occupation times
n→∞ n→∞
Nx (n)/n −−− −→ πx does not necessarily imply that P{Xn = x} −−− −→ πx .
After all, we have not even assumed that the chain is aperiodic.
In principle, our linear algebra proof still works: The largest eigenvalue
is still ≤ 1, and 1 being an eigenvalue is equivalent to positive recurrence.
But we can also argue as follows: ConsiderP the Markov chain observed only
when it is in the finite set Y ⊂ X, with = y∈X\Y πy . Call it XnY . We know
that XnY converges in distribution to a stationary distribution, which must
be πx /(1 − ). Let Y (n) = #{m ≤ n : Xm ∈ / Y}. Then Xn = Xn−Y Y
(n) when
Xn ∈ Y. We see then that, conditioned on being in Y the distribution of Xn
converges to π (conditioned on being in Y). Taking Y ever larger then would
complete the proof. (There are a few subtleties that we would have to take
care of to make this into a rigorous proof. In particular, since conditioning
on spending a fixed amount of time outside of Y up to time n can change
the distribution of Xn .)
Theorem 13.5. If (Xn ) is any positive recurrent Markov chain, then it has
a unique stationary distribution π, and for any i, j ∈ X
lim P Xn = j X0 = i = πj .
n→∞
LetPξyx (k) be the indicator of the event {Xk = x and Ty ≥ k}, and
ηyx = ∞ k=1 ξyx (k). Then for any z ∈ X
so
∞
X
E[ηyx ]P (x, z) = E[ξyx (k)]P (x, z) = E[#k < Ty : Xk = x and Xk+1 = z].
k=1
Example 13.2:
vi = (1 − p)vi−1 + pvi+1 .
1−p K
1−
p
that the process reaches ∞ before reaching 0; in other words,
the chain is transient.
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 161
That is, the rows corresponding to the absorbing states have transition prob-
abilities 0 to any other states, and 1 to themselves, which means that the
lower square is an identity matrix. Squaring this matrix yields
2
2
Q QR +R
P = .
0 0 0 · · ·
0 0 0 ···
I
Qn-1 R +Qn-2 R
n
n
Q +· · ·+QR +R
P = .
0 0 0 · · ·
0 0 0 ···
I
Note that the final columns represent the probability of the chain being in
the corresponding absorbing states. These probabilities simply accumulate,
n→∞
as Qn −−−−→ 0. Now we use the identity (essentially just the matrix version
of the sum of a geometric series)
(I − Q) Qn−1 R + Qn−2 R + · · · + QR + R = (I − Qn ) R.
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 162
n→∞
Since Qn −−−
−→ 0, we have
q T (I − Q)−1 R,
So
0.8 −0.4
I −Q= ,
−0.5 0.7
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 163
and
1 35 20
(I − Q)−1 = .
18 25 40
Multiplying this by R, we get
−1 1 7 11
(I − Q) R= .
18 5 13
Write 1 for the column vector of all 1’s. Then µ := µ·y satisfies the system
of linear equations
(I − P )µ = 1.
Again, we can turn this into a matrix-inversion problem. We have
µ = (I − Q)−1 1, (13.2)
where Q is the square matrix that results from eliminating the row and
column corresponding to the state y. Where does this formula come from?
We could show this by linear algebra, but we can also think of it as follows:
Suppose we modify the chain to make y an absorbing state. Then the
matrix Q represents the transitions among all the other (transient) states,
and (Qn )ij is the probability that a chain started in state i is at state j at
time n, without ever having visited y (since it would have been absorbed
then). If we write ξj (n) := 1{Xn =j} , then when the chain is started in state
i,
E[ξj (n)] = (Qn )ij .
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 164
P∞
The total number of visits to state j before absorption is n=0 ξj (n), and
the total time until absorption is
∞
XX
Tiy = ξj (n).
j6=y n=0
= (I − Q)−1 1 i .
0 0 1/2 0
13.2. FUN WITH EIGENVECTORS AND MATRIX INVERSES 165
We get
1.6 1.2 0.8 0.4
1.2 2.4 1.6 0.8
(I − Q)−1 =
0.8
1.6 2.4 1.2
0.4 0.8 1.2 1.6
So the total expected time to reach 0 is 4, so the expected return
time is 5.
Let ∞
(ξi )i=1 be i.i.d. random variables
Pn taking values in Z, with
P
P iξ = x = ρx , and define Y n := i=1 ξi . Suppose µ = E[ξi ] =
xρx exists and is finite; suppose ξi has finite variance σ 2 , and
it has a finite fourth moment, so τ := E[(ξi − µ)4 ] is finite as
well. Suppose the chain is irreducible and aperiodic.
Let T = min{n : Yn < 0}. We know from Example 13.1 that T
is finite almost surely if and only if µ ≤ 0. When does T have
finite expectation?
A very important algebraic identity is
n
!4
X
(Yn − nµ)4 = (ξi − µ)
i=1
n
X X X
= (ξi − µ)4 + 3 (ξi − µ)2 (ξj − µ)2 + 4 (ξi − µ)(ξj − µ)3
i=1 i6=j i6=j
X X
+3 (ξi − µ)(ξj − µ)(ξk − µ)2 + (ξi − µ)(ξj − µ)(ξk − µ)(ξ` − µ).
i6=j6=k i6=j6=k6=`
Note that all the terms with ξi − µ not raised to a higher power
have expectation 0. Thus
nτ + 3n(n − 1)σ 4
≤
n 4 µ4
τ + 3σ 2 −2
≤ n .
µ4
We have then
∞ ∞
X X τ + 3σ 2 −2
E[T ] ≤ P T ≥n ≤ n < ∞.
µ4
n=1 n=1
Applications of Markov
chains
167
14.1. MARKOV CHAIN MONTE CARLO 168
with the remaining probability being made up of Pe(x, x). We see then that
If p ≥ 1 then we flip the sign of ξ(j). If p < 1 then we decide with probability
p to flip the sign.
Lecture 15
Poisson process
170
15.2. THE POISSON PROCESS ON R+ 171
• There are finitely many points in any bounded region, and points are
all distinct.
(PPI.1) N (0) = 0.
(PPII.1) N (0) = 0.
(λt)n
P{N (t + s) − N (s) = n} = e−λt .
n!
6 N(t)
5
1
T1 T2 T3 T4 T5 T6 T7 T8 T9
!1 !2 !3 !4 !5 !6 !7 !8 !9
Proof. We show that the local definition is equivalent to the other defini-
tions. We start with the local definition and show that it implies the global
definition. The first two conditions are the same. Consider an interval [s, t],
and suppose the process satisfies the local definition. Choose a large integer
K, and define
i i−1
ξi = N s + t −N s+t .
K K
Then N (t + s) − N (s) = K
P
i=1 ξi , and ξi is close to being Bernoulli with
parameter λt/K, so N (t + s) − N (s) should be close to Binom(K, λt/K)
distribution, which we know converges to Poisson(λt). Formally, we can
15.2. THE POISSON PROCESS ON R+ 174
176
16.1. SOME BASIC CALCULATIONS 177
Solution:
22
−2 2
P{N1 ≥ 3} = 1−P{N1 ≤ 2} = 1−e 1+ + = 0.323.
1! 2!
t · λe−λt
R∞
−λs ds
= λ2 te−λt .
0 sλe
16.2. THINNING AND MERGING 179
An easier way to see this is to think of the time that the woman
waits for the train, and the time since the previous train at the
moment when she enters the station. By the memoryless prop-
erty of the exponential distribution, the waiting time until the
next train, at the moment when she enters, is still precisely ex-
ponential with parameter 2. By the symmetry of the Poisson
process, it’s clear that if we go backwards looking for the pre-
vious train, the time will also be exponential with parameter
2, and the two times will be independent. Thus, the interar-
rival times observed by the woman will actually be the sum of
two independent Exp(2) random variables, so will have gamma
distribution, with rate parameter 2 and shape parameter 2.
STILL TO COME.
by taking the union of all the times, ignoring which process they come
P from.
Then the merged process is also a Poisson process, with parameter λi .
Proof. Thinning: We use the local definition of the Poisson process. Start
(i) (i)
from the process T1 < T2 < . . . . Let T1 < T2 < · · · be the i-th thinned
process. Clearly Ni (0) is still 0. If we look at the number of events occurring
on disjoint intervals, they are being thinned from independent random vari-
ables. Since a function applied to independent random variables produces
independent random variables, we still have independent increments. We
have
P Ni (t + h) − Ni (t) = 1
= P N (t + h) − N (t) = 1 · P assign category i
+ P N (t + h) − N (t) ≥ 2 · P assign category i to exactly one
= (λh + o(h))pi + o(h)
= pi λh + o(h).
And by the same approach, we see that P Ni (t + h) − Ni (t) ≥ 2 = o(h).
Independence is slightly less obvious. In general, if you take a fixed num-
ber of points and allocate them to categories, the numbers in the different
categories will not be independent. The key is that there is not a fixed num-
ber of points; moving from left to right, there is always the same chance λdt
of getting an event at the next moment, and these may be allocated to any
of the categories, independent of the points already allocated. A rigorous
proof is easiest with the global definition. Consider N1 (t), N2 (t), . . . , NK (t)
for fixed t. These may be generated by the following process: Let N (t) be
Poi(λt), and let (N1 (t), N2 (t), . . . , NK (t)) be multinomial with parameters
(N (t); (pi )). That is, supposing N (t) = n, allocate points to bins 1, . . . , K
according to the probabilities p1 , . . . , pK . Then
P N1 (t) = n1 , . . . , NK (t) = nK
= P N (t) = n P N1 (t) = n1 , . . . , NK (t) = nK N (t) = n
λn n!
= e−λt · pn1 · · · pnKK
n! n1 ! · · · nK ! 1
K
Y λni
= e−λi t
ni !
i=1
YK
= P Ni (t) = ni
i=1
16.3. POISSON PROCESS AND THE UNIFORM DISTRIBUTION 181
Since counts involving distinct intervals are clearly independent, this com-
pletes the proof.
Merging: This is simpler, since we don’t have to prove independence.
The total number of arrivals on disjoint intervals are the sums of the num-
bers of category i arrivals; the sums of independent random variables are
also independent. Furthermore, the sum of independent Poisson random
variables is also Poisson, which completes the proof, by the global definition
of the Poisson process.
1.4 1.42
−1.4
1−e 1+ + = 0.167.
1 2
The casualty department takes in two independent Poisson streams of
patients with total rate 6, so it is a Poisson process with parameter 6. The
number of patients in 8 hours has Poi(48) distribution.
The problem here is that, while we know the distribution of the number
of deals over a span [0, t], the quantity of interest depends on the precise
times.
16.3. POISSON PROCESS AND THE UNIFORM DISTRIBUTION 182
Of course, as t → ∞ this will converge to λ/θ. For the variance, we use the
formula Var(V ) = E[Var(V |X)] + Var(E[V |X]), so that
2
Var(Vt ) = E[N (t)]σ 2 (t) + 1 − e−θt Var(N (t))
1 2
= λ · 2 (θ − 2t−1 ) + 4e−θt − (θ + 2t−1 )e−2θt + (θ)−2 1 − e−θt λt−1
2θ
t→∞ λ
−−−−→ .
2θ
Bibliography
[Bil95] Patrick Billingsley. Probability and Measure. John Wiley & Sons
Inc., New York, third edition, 1995.
[Joh04] Oliver Johnson. Information theory and the Central Limit Theo-
rem. Imperial College Press, 2004.
[PW97] James Propp and David Wilson. Coupling from the past: A user’s
guide. In Microsurveys in Discrete Probability. DIMACS, 1997.
184
BIBLIOGRAPHY 185
Assignments
I
A.1 Part A Probability Problem sheet 1:
Random Variables, Expectations, Conditioning
(1) (a) The Gamma(r, λ) distribution is the distribution on R+ with density
λr xr−1 −λx
fr,λ (x) = e .
Γ(r)
r is called the shape parameter, and λ is the rate parameter. Show that the sum of two indepen-
dent Gamma distributed random variables with the same rate parameter has another Gamma
distribution.
(b) The χ2 distribution with n degrees of freedom is defined to be the distribution of X12 + · · · + Xn2 ,
where X1 , . . . , Xn are i.i.d. standard normal random variables. Show that the χ2 distribution
belongs to the Gamma family, and compute the parameters.
(2) Let (Z1 , Z2 ) be independent random variables with standard normal distribution, and let X =
2Z1 + Z2 and Y = Z1 + Z2 .
(a) What are the marginal distributions of X and Y ? What is the joint distribution of (X, Y )?
(b) Compute E[Y | X = 5].
(3) Let X1 and X2 be independent exponential random variables with parameters µ and λ respectively.
Let Y = min(X1 , X2 ) and Z = max(X1 , X2 ). Let W = Z − Y , and I = 1{X1 >X2 } .
(a) Compute the joint densities of (Y, Z), (Y, W ), and (Z, W ). Which pairs are independent?
(b) Compute the marginal distributions of Y , Z, W , and I.
(c) Compute the conditional density of Z given Y = y.
(d) Which of Y , Z, or W are independent of I?
(e) Under what conditions are three of the four random variables (Y, Z, W, I) jointly independent?
(4) The bivariate normal distribution with parameters µX , µY , σX , σY , and ρ is described in section
2.4.3. Show that these parameters are in fact the expectations of X and Y , the SDs of X and Y ,
and the correlation. You may show this by working from the representation of X and Y as linear
combinations of independent standard normal random variables. You could also try showing that
you get the same result by integrating the bivariate joint density (2.6).
(5) Random variables X and Y have joint density f (x, y). Let Z = Y /X. Show that Z has density
Z ∞
fZ (z) = |x|f (x, xz)dx.
−∞
Suppose that X and Y are independent standard normal random variables. Show that Z has density
1
fZ (z) = π(1+z 2 ) for − ∞ < z < ∞.
(6) We have a coin such that P{heads} = Q, where Q is a fixed quantity that was uniformly chosen from
the interval (0, 1).
(a) We flip the coin n times. Compute the expectation and variance of the number of heads.
(b) Let Xi be the indicator of {flip #i is a head} (that is, Xi = 1 if the i-th flip is heads, and Xi = 0 if
it is tails). Let Y =total number of heads. Show that E[X1 |Y ] = Y /n and E[Y |Xi ] = (Xi +1)n/3.
(c) We flip the coin repeatedly until the first time heads comes up. Let T be the number of flips.
Show that T is finite with probability 1, but that E[T ] = ∞.
(7) The second-moment method: Let X be a random variable with E[X] ≥ 0. Show that
Var(X)
P{X > 0} ≥ 1 − .
E[X]2
E[X]2
P{X > 0} ≥ ,
E[X 2 ]
and show that this is indeed always a better bound. (Hint: Let Y = 1{X6=0} . Show that E[X] =
E[XY ], and use the fact that E[XY ]2 ≤ E[X 2 ]E[Y 2 ], which is in the lecture notes as (3.8).)
(8) The birthday problem: Suppose there are 365 possible birthdays, and we have n individuals. Each
person’s birthday is an independent uniform random choice from the 365 possibilities. Let Bk be the
number of groups of k individuals who all have the same birthday. (Groups can overlap. Thus, if
we have 3 people with birthday Jan. 1 and no other matches, then B3 = 1, B2 = 3, and Bk = 0 for
k > 3.)
(a) Compute P{B2 > 0}. How big must n be so that the probability that at least two people in the
population have the same birthday is at least 21 ? (This is the classic birthday problem.)
(b) Compute E[B2 ] and Var(B2 ), and E[B3 ] and Var(B3 ).
(c) Use Chebyshev’s inequality to find a number n such that the probability at least three people in
the group have the same birthday is at least 12 ?
(d) Bonus Question: Compute E[Bk ] and V ar(Bk ) for general k. (This requires some careful com-
binatorics.)
A.2 Part A Probability Problem sheet 2:
Generating functions and convergence
(1) Let λn be a sequence of positive real numbers such that limn→∞ λn = ∞. Let √ Xn be a random
variable with Poisson distribution with parameter λn , and Yn := (Xn − λn )/ λn . Using moment
generating functions, show that Yn converges in distribution to a standard normal distribution. If
√
λn = nλ, what distribution does Zn := (Xn − nλ)/ n converge to? Explain why this is a special
case of the Central Limit Theorem.
(2) (a) Compute the moment generating function of a random variable with Gamma(r, λ) distribution.
Use this to repeat the result from the previous sheet, that the sum of independent Gamma
random variables with the same rate parameter is also Gamma distributed.
(b) An insurance company has N claims in a year, where N has a Poisson distribution with parameter
µ. The individual claims each have i.i.d. Gamma(r, λ) independent of N . Let Y be the total of
all the claims in a year. Compute the moment generating function of Y .
(c) Suppose µ = 20, r = 10, λ = 0.1. Show that E[Y ] = 2000, and the probability of Y > 4000 is
less than 0.0011.
(3) Prove that if random variables Xn converge to X in mean square, then they also converge in proba-
bility.
(4) (Grimmett and Welsh 8.5.1) Let X1 , X2 , . . . be a sequence of i.i.d. random variables uniform on
(0, x). Let Zn := max{X1 , . . . , Xn } and Un := n(x − Zn ).
(a) Show that Zn → x in probability as n → ∞.
√ √
(b) Show that Zn → x in probability as n → ∞.
(c) Show that Un converges in distribution to an exponential distribution as n → ∞, and find the
parameter.
(5) Let X be the number of successes in n Bernoulli trials with probability p of success. Use the CLT to
approximate P{|X/n − p| ≤ 0.01} when
(a) n = 100 and p = 12 ;
(b) n = 1000 and p = 21 ;
1
(c) n = 1000 and p = 10 .
In each case, compare the bound to the Chebyshev bound used in the proof of the Weak Law of
Large Numbers; and to the exact probability.
(6) An investor purchases 100 shares of Consolidated Confidence for £10 on January 1. It trades on
200 days during the year, and on day i the price changes by Xi % of its current value, where the
Xi are i.i.d. with uniform distribution on the interval [−2, 2]. The investor sells the shares at the
price Y at the end of the year. Use the Central Limit Theorem to compute R (approximately) the
probability
R 2 that the investor makes a profit. You may use the fact that ln xdx = x ln x − x + C and
ln xdx = x ln2 x − 2x ln x + 2x + C. Hint: Consider the random variable Z = ln Y .
(7) Law of large numbers for non-i.i.d. random variables
(a) Suppose X1 , X2 , . . . are independent random variables, where E[Xi ] = µi and Var(Xi ) = σi2 . Let
Sn = X1 + · · · + Xn and Mn = µ1 + · · · + µn . Assume for all i that σi2 < K for some constant
K. Show that for every > 0
Sn Mn
(∗) lim P − < = 1.
n→∞ n n
(b) (More challenging): Suppose now that the random variables are not independent, but that
| Cov(Xi , Xi+j )| ≤ κ(j), for some function κ(j). In other words, if κ is decreasing, this says
that random variables that are far apart become less correlated. Show that (∗) still holds if
limn→∞ κ(n) = 0.
A.3 Markov chains
A.4. ASYMPTOTICS OF MARKOV CHAINS AND POISSON PROCESSVII