Probability P

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Probability

J.R. Norris
December 13, 2017

1
Contents
1 Mathematical models for randomness 6
1.1 A general definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Equally likely outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Throwing a die . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Balls from a bag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 A well-shuffled pack of cards . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Distribution of the largest digit . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Coincident birthdays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Counting the elements of a finite set 9


2.1 Multiplication rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Increasing and non-decreasing functions . . . . . . . . . . . . . . . . . . . . 10
2.5 Ordered partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Stirling’s formula 11
3.1 Statement of Stirling’s formula . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Asymptotics for log(n!) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Proof of Stirling’s formula (non-examinable) . . . . . . . . . . . . . . . . . 11

4 Basic properties of probability measures 13


4.1 Countable subadditivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Inclusion-exclusion formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Bonferroni inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 More counting using inclusion-exclusion 16


5.1 Surjections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Derangements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6 Independence 18
6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Pairwise independence does not imply independence . . . . . . . . . . . . . 18
6.3 Independence and product spaces . . . . . . . . . . . . . . . . . . . . . . . 18

7 Some natural discrete probability distibutions 20


7.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.4 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2
8 Conditional probability 22
8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.2 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.3 Bayes’ formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.4 False positives for a rare condition . . . . . . . . . . . . . . . . . . . . . . . 23
8.5 Knowledge changes probabilities in surprising ways . . . . . . . . . . . . . 23
8.6 Simpson’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

9 Random variables 25
9.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.2 Doing without measure theory . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.3 Number of heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

10 Expectation 27
10.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
10.2 Properties of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
10.3 Variance and covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.4 Zero covariance does not imply independence . . . . . . . . . . . . . . . . . 30
10.5 Calculation of some expectations and variances . . . . . . . . . . . . . . . 30
10.6 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
10.7 Inclusion-exclusion via expectation . . . . . . . . . . . . . . . . . . . . . . 32

11 Inequalities 33
11.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.2 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.3 Cauchy–Schwarz inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.4 Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
11.5 AM/GM inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

12 Random walks 36
12.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
12.2 Gambler’s ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
12.3 Mean time to absorption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

13 Generating functions 38
13.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
13.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
13.3 Generating functions and moments . . . . . . . . . . . . . . . . . . . . . . 38
13.4 Sums of independent random variables . . . . . . . . . . . . . . . . . . . . 39
13.5 Random sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
13.6 Counting with generating functions . . . . . . . . . . . . . . . . . . . . . . 40

3
14 Branching processes 42
14.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
14.2 Mean population size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
14.3 Generating function for the population size . . . . . . . . . . . . . . . . . . 42
14.4 Conditioning on the first generation . . . . . . . . . . . . . . . . . . . . . . 43
14.5 Extinction probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
14.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

15 Some natural continuous probability distributions 45


15.1 Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
15.2 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
15.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
15.4 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

16 Continuous random variables 47


16.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
16.2 Transformation of one-dimensional random variables . . . . . . . . . . . . 48
16.3 Calculation of expectations using density functions . . . . . . . . . . . . . 48

17 Properties of the exponential distribution 50


17.1 Exponential distribution as a limit of geometrics . . . . . . . . . . . . . . . 50
17.2 Memoryless property of the exponential distribution . . . . . . . . . . . . . 50

18 Multivariate density functions 51


18.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
18.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
18.3 Marginal densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
18.4 Convolution of density functions . . . . . . . . . . . . . . . . . . . . . . . . 52
18.5 Transformation of multi-dimensional random variables . . . . . . . . . . . 53

19 Simulation of random variables 54


19.1 Construction of a random variable from its distribution function . . . . . . 54
19.2 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

20 Moment generating functions 56


20.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
20.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
20.3 Uniqueness and the continuity theorem . . . . . . . . . . . . . . . . . . . . 57

21 Limits of sums of independent random variables 58


21.1 Weak law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
21.2 Strong law of large numbers (non-examinable) . . . . . . . . . . . . . . . . 58
21.3 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
21.4 Sampling error via the central limit theorem . . . . . . . . . . . . . . . . . 61

4
22 Geometric probability 62
22.1 Bertrand’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
22.2 Buffon’s needle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

23 Gaussian random variables 64


23.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
23.2 Moment generating function . . . . . . . . . . . . . . . . . . . . . . . . . . 64
23.3 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
23.4 Density function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
23.5 Bivariate normals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5
1 Mathematical models for randomness
1.1 A general definition
Let Ω be a set. Let F be a set of subsets of Ω. We say that F is a σ-algebra if Ω ∈ F and,
for all A ∈ F and every sequence (An : n ∈ N) in F,

[
c
A ∈ F, An ∈ F.
n=1

Thus F is non-empty and closed under countable set operations. Assume that F is indeed
a σ-algebra. A function P : F → [0, 1] is called a probability measure if P(Ω) = 1 and, for
every sequence (An : n ∈ N) of disjoint elements of F,

! ∞
[ X
P An = P(An ).
n=1 n=1

Assume that P is indeed a probability measure. The triple (Ω, F, P) is then called a
probability space.
In the case where Ω is countable, we will take F to be the set of all subsets of Ω unless
otherwise stated. The elements of Ω are called outcomes and the elements of F are called
events. In a probability model, Ω will be an abstraction of a real set of outcomes, and
F will model observable events. Then P(A) is interpreted as the probability of the event
A. In some probability models, for example choosing a random point in the interval [0, 1],
the probability of each individual outcome is 0. This is one reason we need to specify
probabilities of events rather than outcomes.

1.2 Equally likely outcomes


Let Ω be a finite set and let F denote the set of all subsets of Ω. Write |Ω| for the number
of elements of Ω. Define P : F → [0, 1] by

P(A) = |A|/|Ω|.

In classical probability an appeal to symmetry is taken as justification to consider this as


a model for a randomly chosen element of Ω.
This simple set-up will be the only one we consider for now. At some point, you may
wish to check that (Ω, F, P) is a probability space so this is a case of the general set-up.
This essentially comes down to the fact that, for disjoint sets A and B, we have

|A ∪ B| = |A| + |B|.

6
1.3 Throwing a die
The throw of a die has six possible outcomes. We use symmetry to justify the following
model for a throw of the die

Ω = {1, 2, 3, 4, 5, 6}, P(A) = |A|/6 for A ⊆ Ω.

Thus, for example, P({2, 4, 6}) = 1/2.

1.4 Balls from a bag


A bag contains n balls, indistinguishable by touch. We draw k balls all at once from the
bag without looking. Considering the balls as labelled by {1, . . . , n}, we have then selected
a subset of {1, . . . , n} of size k. By symmetry, we suppose that all choices are equally likely.
Then Ω is the set of subsets of {1, . . . , n} of size k, so
 
n n! n(n − 1) . . . (n − k + 1)
|Ω| = = =
k k!(n − k)! k(k − 1) . . . 1

and, for each individual outcome ω,

P({ω}) = 1/|Ω|.

1.5 A well-shuffled pack of cards


There are 52 playing cards in a pack. In shuffling we seek to make every possible order
equally likely. Thus an idealized well-shuffled pack is modelled by the set Ω of permutations
of {1, . . . , 52}. Note that
|Ω| = 52!
Let us calculate the probability of the event A that the first two cards are aces. There are
4 choices for the first ace, then 3 for the second, and then the rest can come in any order.
So
|A| = 4 × 3 × 50!
and
P(A) = |A|/|Ω| = 12/(52 × 51) = 1/221.

1.6 Distribution of the largest digit


From a stream of random digits 0, 1, . . . , 9 we examine the first n. For k 6 9, what is the
probability that the largest of these n digits is k?
We model the set of possible outcomes by

Ω = {0, 1, . . . , 9}n .

7
In the absence of further information, and prompted by the characterization as ‘random
digits’, we assume that all outcomes are equally likely. Consider the event Ak that none of
the n digits exceeds k and the event Bk that the largest digit is k. Then

|Ω| = 10n , |Ak | = (k + 1)n , |Bk | = |Ak \ Ak−1 | = (k + 1)n − k n .

So
(k + 1)n − k n
P(Bk ) = .
10n

1.7 Coincident birthdays


There are n people in a room. What is the probability that two of them have the same
birthday? We will assume that no-one was born on 29th February. We model the set of
possible sequences of birthdays by

Ω = {1, . . . , 365}n .

We will work on the assumption that all outcomes are equally likely. In fact this is empiri-
cally false, but we are free to choose the model and have no further information. Consider
the event A that all n birthdays are different. Then

|Ω| = 365n , |A| = 365 × 364 × · · · × (365 − n + 1).

So the probability that two birthdays coincide is

365 × 364 × · · · × (365 − n + 1)


p(n) = P(Ac ) = 1 − P(A) = 1 − .
365n
In fact
p(22) = 0.476, p(23) = 0.507
to 3 significant figures so, as soon as n > 23, it is more likely than not that two people
have the same birthday.

8
2 Counting the elements of a finite set
We have seen the need to count numbers of permutations and numbers of subsets of a
given size. Now we take a systematic look at some methods of counting.

2.1 Multiplication rule


A non-empty set Ω is finite if there exists n ∈ N and a bijection

{1, . . . , n} → Ω.

Then n is called the cardinality of Ω. We will tend to refer to n simply as the size of Ω.
The set {1, . . . , n1 } × · · · × {1, . . . , nk } has size n1 × · · · × nk . More generally, suppose we
are given a sequence of sets Ω1 , . . . , Ωk and bijections

f1 : {1, . . . , n1 } → Ω1 , fi : Ωi−1 × {1, . . . , ni } → Ωi , i = 2, . . . , k.

Then we can define maps

gi : {1, . . . , n1 } × · · · × {1, . . . , ni } → Ωi , i = 1, . . . , k

by
g1 = f1 , gi (m1 , . . . , mi ) = fi (gi−1 (m1 , . . . , mi−1 ), mi ).
It can be checked by induction on i that these are all bijections. In particular, we see that

|Ωk | = n1 × n2 × · · · × nk .

We will tend to think of the bijections f1 , . . . , fk in terms of choices. Suppose we can


describe each element of Ω uniquely by choosing first from n1 possibilities, then, for each
of these, from n2 possibilities, and so on, with the final choice from nk possibilities. Then
implicitly we have in mind bijections f1 , . . . , fk , as above, with Ωk = Ω, so

|Ω| = n1 × n2 × · · · × nk .

2.2 Permutations
The bijections of {1, . . . , n} to itself are called permutations of {1, . . . , n}. We may obtain
all these permutations by choosing successively the image of 1, then that of 2, and finally
the image of n. There are respectively n, n − 1, . . . , 1 choices at each stage, corresponding
to the numbers of values which have not been taken. Hence the number of permutations
of {1, . . . , n} is n! This is then also the number of bijections between any two given sets of
size n.

9
2.3 Subsets
Fix k 6 n. Let N denote the number of subsets of {1, . . . , n} which have k elements. We
can count the permutations of {1, . . . , n} as follows. First choose a subset S1 of {1, . . . , n}
of size k and set S2 = {1, . . . , n} \ S1 . Then choose bijections {1, . . . , k} → S1 and
{k + 1, . . . , n} → S2 . These choices determine a unique permutation of {1, . . . , n} and we
obtain all such permutations in this way. Hence

N × k! × (n − k)! = n!

and so  
n! n
N= = .
k!(n − k)! k
More generally, suppose we are given integers n1 , . . . , nk > 0 with n1 + · · · + nk = n.
Let M denote the number of ways to partition {1, . . . , n} into k disjoint subsets S1 , . . . , Sk
with |S1 | = n1 , . . . , |Sk | = nk . The argument just given generalizes to show that

M × n1 ! × · · · × nk ! = n!

so  
n! n
M= = .
n1 ! . . . nk ! n1 . . . nk

2.4 Increasing and non-decreasing functions


An increasing function from {1, . . . , k} to {1, . . . , n} is uniquely determined by its range,
which is a subset of {1, . . . , n} of size k, and we obtained all such subsets in this way.
Hence the number of such increasing functions is nk .


There is a bijection from the set of non-decreasing functions f : {1, . . . , m} → {1, . . . , n}


to the set of increasing functions g : {1, . . . , m} → {1, . . . , m + n − 1} given by

g(i) = i + f (i) − 1.
m+n−1

Hence the number of such non-decreasing functions is m
.

2.5 Ordered partitions


An ordered partition of m of size n is a sequence (m1 , . . . , mn ) of non-negative integers
such that m1 + · · · + mn = m. We count these using stars and bars. Put m1 stars in a
row, followed by a bar, then m2 more stars, another bar, and so on, finishing with mn
stars. There are m stars and n − 1 bars, so m + n − 1 ordered symbols altogether, which
we label by {1, . . . , m + n − 1}. Each ordered partition corresponds uniquely to a subset
of {1, . . . , m + n − 1} of size m, namely the subset of stars. Hence the number of ordered
partitions of m of size n is m+n−1

m
. This is a more colourful version of the same trick used
to count non-decreasing functions.

10
3 Stirling’s formula
Stirling’s formula is important in giving a computable asymptotic equivalent for factorials,
which enter many expressions for probabilities. Having stated the formula, we first prove
a cruder asymptotic, then prove the formula itself.

3.1 Statement of Stirling’s formula


In the limit n → ∞ we have √
n! ∼ 2π nn+1/2 e−n .
Recall that we write an ∼ bn to mean that an /bn → 1.

3.2 Asymptotics for log(n!)


Set
ln = log(n!) = log 2 + · · · + log n.
Write bxc for the integer part of x. Then, for x > 1,

logbxc 6 log x 6 logbx + 1c.

Integrate over the interval [1, n] to obtain


Z n
ln−1 6 log x dx 6 ln .
1

An integration by parts gives


Z n
log x dx = n log n − n + 1
1

so
n log n − n + 1 6 ln 6 (n + 1) log(n + 1) − n.
The ratio of the left-hand side to n log n tends to 1 as n → ∞, and the same is true of the
right-hand side, so we deduce that

log(n!) ∼ n log n.

3.3 Proof of Stirling’s formula (non-examinable)


The following identity may be verified by integration by parts
Z b
1 b
Z
f (a) + f (b)
f (x) dx = (b − a) − (x − a)(b − x)f 00 (x) dx.
a 2 2 a

11
Take f = log to obtain
Z n+1
log n + log(n + 1) 1 n+1
Z
1
log x dx = + (x − n)(n + 1 − x) 2 dx.
n 2 2 n x
Next, sum over n to obtain
n−1
1 X
n log n − n + 1 = log(n!) − log n + ak
2 k=1

where Z 1 Z 1
1 1 1 1
ak = x(1 − x) 2
dx 6 2 x(1 − x) dx = .
2 0 (k + x) 2k 0 12k 2
Set ( ∞
)
X
A = exp 1 − ak .
k=1
We rearrange our equation for log(n!) and then take the exponential to obtain
(∞ )
X
n! = Ann+1/2 e−n exp ak .
k=n

It follows that
n! ∼ Ann+1/2 e−n
and, from this asymptotic, we deduce that
  √
−2n 2n 2
2 ∼ √ .
n A n
We will complete the proof by showing that
 
−2n 2n 1
2 ∼√
n nπ

so A = 2π, as required. Set
Z π/2
In = cosn θ dθ.
0
n−1
Then I0 = π/2 and I1 = 1. For n > 2 we can integrate by parts to obtain In = I .
n n−2
Then
 
13 2n − 1 π −2n 2n π
I2n = ... =2 ,
24 2n 2 n 2
  −1
24 2n −2n 2n 1
I2n+1 = ... = 2 .
35 2n + 1 n 2n + 1
But In is decreasing in n and In /In−2 → 1, so also I2n /I2n+1 → 1, and so
  2
−2n 2n 2 1
2 ∼ ∼ .
n (2n + 1)π nπ

12
4 Basic properties of probability measures
Let (Ω, F, P) be a probability space. Recall that the probability measure P is a function
P : F → [0, 1] with P(Ω) = 1 which is countably additive, that is, has the property that,
for all sequences (An : n ∈ N) of disjoint sets in F,

! ∞
[ X
P An = P(An ).
n=1 n=1

4.1 Countable subadditivity


For all sequences (An : n ∈ N) in F, we have

! ∞
[ X
P An 6 P(An ).
n=1 n=1

Thus, on dropping the requirement of disjointness, we get an inequality instead of an


equality. To see this, define a sequence of disjoint sets in F by

B1 = A1 , Bn = An \ (A1 ∪ · · · ∪ An−1 ), n = 2, 3, . . . .

Then Bn ⊆ An for all n, so P(Bn ) 6 P(An ). On the other hand, ∪∞ ∞


n=1 Bn = ∪n=1 An so, by
countable additivity,

! ∞
! ∞ ∞
[ [ X X
P An = P Bn = P(Bn ) 6 P(An ).
n=1 n=1 n=1 n=1

4.2 Continuity
For all sequences (An : n ∈ N) in F such that An ⊆ An+1 for all n and ∪∞
n=1 An = A, we
have
lim P(An ) = P(A).
n→∞
To see this, define Bn as above and note that ∪nk=1 Bk = An for all n and ∪∞
n=1 Bn = A.
Then, by countable additivity,
n
! n ∞ ∞
!
[ X X [
P(An ) = P Bn = P(Bn ) → P(Bn ) = P Bn = P(A).
k=1 k=1 n=1 n=1

4.3 Inclusion-exclusion formula


For all sequences (A1 , . . . , An ) in F, we have
n
! n
[ X X
P Ai = (−1)k+1 P(Ai1 ∩ · · · ∩ Aik ).
i=1 k=1 16i1 <···<ik 6n

13
For n = 2 and n = 3, this says simply that

P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 ∩ A2 )

and

P(A1 ∪ A2 ∪ A3 ) = P(A1 ) + P(A2 ) + P(A3 )


− P(A1 ∩ A2 ) − P(A1 ∩ A3 ) − P(A2 ∩ A3 )
+ P(A1 ∩ A2 ∩ A3 ).

The general formula can also be written as


n
! n
[ X
P Ai = P(Ai )
i=1 i=1
X
− P(Ai1 ∩ Ai2 )
16i1 <i2 6n
X
+ P(Ai1 ∩ Ai2 ∩ Ai3 )
16i1 <i2 <i3 6n

− · · · + (−1)n+1 P(A1 ∩ A2 ∩ · · · ∩ An ).

The case n = 2 is an easy consequence of additivity. We apply this case to obtain

P(A1 ∪ · · · ∪ An ) = P(A1 ∪ · · · ∪ An−1 ) + P(An ) − P(B1 ∪ · · · ∪ Bn−1 )

where Bk = Ak ∩ An . On using the formula for the case n − 1 for the terms on the right-
hand side, and rearranging, we obtain the formula for n. We omit the details. Hence the
general case follows by induction. We will give another proof using expectation later.
Note in particular the special case of inclusion-exclusion for the case of equally likely
outcomes, which we write in terms of the sizes of sets. Let A1 , . . . , An be subsets of a finite
set Ω. Then
n
X X
|A1 ∪ · · · ∪ An | = (−1)k+1 |Ai1 ∩ · · · ∩ Aik |.
k=1 16i1 <···<ik 6n

4.4 Bonferroni inequalities


If we truncate the sum in the inclusion-exclusion formula at the kth term, then the trun-
cated sum is an overestimate if k is odd, and is an underestimate if k is even. Put another
way, truncation give an overestimate if the first term omitted is negative and an underes-
timate if the first term omitted is positive. Thus, for n = 2,

P(A1 ∪ A2 ) 6 P(A1 ) + P(A2 )

14
while, for n = 3,

P(A1 ) + P(A2 ) + P(A3 ) − P(A1 ∩ A2 ) − P(A1 ∩ A3 ) − P(A2 ∩ A3 )


6 P(A1 ∪ A2 ∪ A3 ) 6 P(A1 ) + P(A2 ) + P(A3 ).

The case n = 2 is clear. In the formula

P(A1 ∪ · · · ∪ An ) = P(A1 ∪ · · · ∪ An−1 ) + P(An ) − P(B1 ∪ · · · ∪ Bn−1 )

used in the proof of inclusion-exclusion, suppose we substitute for P(A1 ∪ · · · ∪ An−1 ) using
inclusion-exclusion truncated at the the kth term and for P(B1 ∪· · ·∪Bn−1 ) using inclusion-
exclusion truncated at the (k − 1)th term. Then we obtain on the right the inclusion-
exclusion formula truncated at the kth term. Suppose inductively that the Bonferroni
inequalities hold for n − 1 and that k is odd. Then k − 1 is even. So, on the right-
hand side, the substitution results in an overestimate. Similarly, if k is even, we get an
underestimate. Hence the inequalities hold for all n > 2 by induction.

15
5 More counting using inclusion-exclusion
Sometimes it is easier to count intersections of sets than unions. We give two examples of
this where the inclusion-exclusion formula can be used to advantage.

5.1 Surjections
The inclusion-exclusion formula gives an expression for the number of surjections from
{1, . . . , n} to {1, . . . , m}. Write Ω for the set of all functions from {1, . . . , n} to {1, . . . , m}
and consider the subsets

Ai = {ω ∈ Ω : i 6∈ {ω(1), . . . , ω(n)}}.

Then (A1 ∪ · · · ∪ Am )c is the set of surjections. We have

|Ω| = mn , |Ai1 ∩ · · · ∩ Aik | = (m − k)n

for distinct i1 , . . . , ik . By inclusion-exclusion


m m  
X
k+1
X X
k+1 m
|A1 ∪ · · · ∪ Am | = (−1) |Ai1 ∩ · · · ∩ Aik | = (−1) (m − k)n
k=1 16i1 <···<ik 6m k=1
k

where we have used the fact that there are m



k
terms in the inner sum, all with the same
value. Hence the number of surjections from {1, . . . , n} to {1, . . . , m} is
m−1  
X mk
(−1) (m − k)n .
k=0
k

5.2 Derangements
A permutation of {1, . . . , n} is called a derangement if it has no fixed points. Using
inclusion-exclusion, we can calculate the probability that a random permutation is a de-
rangement. Write Ω for the set of permutations and A for the subset of derangements. For
i ∈ {1, . . . , n}, consider the event

Ai = {ω ∈ Ω : ω(i) = i}

and note that !c


n
[
A= Ai .
i=1

For i1 < · · · < ik , each element of the intersection Ai1 ∩ · · · ∩ Aik corresponds to a permu-
tation of {1, . . . , n} \ {i1 , . . . , ik }. So

|Ai1 ∩ · · · ∩ Aik | = (n − k)!

16
and so
(n − k)!
P(Ai1 ∩ · · · ∩ Aik ) = .
n!
Then, by inclusion-exclusion,
n
! n
[ X X
P Ai = (−1)k+1 P(Ai1 ∩ · · · ∩ Aik )
i=1 k=1 16i1 <···<ik 6n
n   n
X
k+1 n (n − k)! X 1
= (−1) × = (−1)k+1 .
k=1
k n! k=1
k!

n

Here we have used the fact that there are k
terms in the inner sum, all having the same
value (n − k)!/n!. So we find
n n
X 1 X 1
P(A) = 1 − (−1)k+1 = (−1)k .
k=1
k! k=0 k!

Note that, as n → ∞, the proportion of permutations which are derangements tends to


the limit e−1 = 0.3678 . . ..

17
6 Independence
6.1 Definition
Events A, B are said to be independent if

P(A ∩ B) = P(A)P(B).

More generally, we say that the events in a sequence (A1 , . . . , An ) or (An : n ∈ N) are
independent if, for all k > 2 and all sequences of distinct indices i1 , . . . , ik ,

P(Ai1 ∩ · · · ∩ Aik ) = P(Ai1 ) × · · · × P(Aik ).

6.2 Pairwise independence does not imply independence


Suppose we toss a fair coin twice. Take as probability space

Ω = {(0, 0), (0, 1), (1, 0), (1, 1)}

with all outcomes equally likely. Consider the events

A1 = {(0, 0), (0, 1)}, A2 = {(0, 0), (1, 0)}, A3 {(0, 1), (1, 0)}.

Then P(A1 ) = P(A2 ) = P(A3 ) = 1/2 and

P(A1 ∩ A2 ) = P(A1 ∩ A3 ) = P(A2 ∩ A3 ) = 1/4

so all pairs A1 , A2 and A1 , A3 and A2 , A3 are independent. However

P(A1 ∩ A2 ∩ A3 ) = 0 6= P(A1 )P(A2 )P(A3 )

so the triple A1 , A2 , A3 is not independent.

6.3 Independence and product spaces


Independence is a natural property of certain events when we consider equally likely out-
comes in a product space
Ω = Ω 1 × · · · × Ωn .
Consider a sequence of events A1 , . . . , An of the form

Ai = {(ω1 , . . . , ωn ) ∈ Ω : ωi ∈ Bi }

for some sets Bi ⊆ Ωi . Thus Ai depends only on ωi . Then

P(Ai ) = |Ai |/|Ω| = |Bi |/|Ωi |

18
and
A1 ∩ · · · ∩ An = {(ω1 , . . . , ωn ) ∈ Ω : ω1 ∈ B1 , . . . , ωn ∈ Bn }
so
|B1 × · · · × Bn |
P(A1 ∩ · · · ∩ An ) = = P(A1 ) × · · · × P(An ).
|Ω|
The same argument shows that

P(Ai1 ∩ · · · ∩ Aik ) = P(Ai1 ) × · · · × P(Aik )

for any distinct indices i1 , . . . , ik , by switching some of the sets Bi to be Ωi . Hence the
events A1 , . . . , An are independent.

19
7 Some natural discrete probability distibutions
The word ‘distribution’ is used interchangeably with ‘probability measure’, especially when
in later sections we are describing the probabilities associated with some random variable.
A probability measure µ on (Ω, F) is said to be discrete if there is a countable set S ⊆ Ω
and a function (px : x ∈ S) such that, for all events A,
X
µ(A) = px .
x∈A∩S

We consider only the case where {x} ∈ F for all x ∈ S. Then px = µ({x}). We refer to
(px : x ∈ S) as the mass function for µ.

7.1 Bernoulli distribution


The Bernoulli distribution of parameter p ∈ [0, 1] is the probability measure on {0, 1} given
by
p0 = 1 − p, p1 = p.
We use this to model the number of heads obtained on tossing a biased coin once.

7.2 Binomial distribution


The binomial distribution B(N, p) of parameters N ∈ Z+ and p ∈ [0, 1] is the probability
measure on {0, 1, . . . , N } given by
 
N k
pk = pk (N, p) = p (1 − p)N −k .
k

We use this to model the number of heads obtained on tossing a biased coin N times.

7.3 Multinomial distribution


More generally, the multinomial distribution M (N, p1 , . . . , pk ) of parameters N ∈ Z+ and
(p1 , . . . , pk ) is the probability measure on ordered partitions (n1 , . . . , nk ) of N given by
 
N
p(n1 ,...,nk ) = pn1 × · · · × pnk k .
n 1 . . . nk 1

Here p1 , . . . , pk are non-negative parameters, with p1 + · · · + pk = 1, and n1 + · · · + nk = N .


We use this to model the number of balls in each of k boxes, when we assign N balls
independently to the boxes, so that each ball lands in box i with probability pi .

20
7.4 Geometric distribution
The geometric distribution of parameter p is the probability measure on Z+ = {0, 1, . . . }
given by
pk = p(1 − p)k .
We use this to model the number of tails obtained on tossing a biased coin until the first
head appears.
The probability measure on N = {1, 2, . . . } given by

pk = p(1 − p)k−1

is also sometimes called the geometric distribution of parameter p. This models the number
of tosses of a biased coin up to the first head. You should always be clear which version of
the geometric distribution is intended.

7.5 Poisson distribution


The Poisson distribution P (λ) of parameter λ ∈ (0, ∞) the probability measure on Z+
given by
λk
pk = pn (λ) = e−λ .
k!
Note that, for λ fixed and N → ∞,
   k  N −k  N −k k
N λ λ N (N − 1) . . . (N − k + 1) λ λ
pk (N, λ/N ) = 1− = k
1− .
k N N N N k!

Now
 N  −k
N (N − 1) . . . (N − k + 1) λ −λ λ
→ 1, 1− →e , 1− →1
Nk N N
so
pk (N, λ/N ) → pk (λ).
Hence the Poisson distribution arises as the limit as N → ∞ of the binomial distribution
with parameters N and p = λ/N .

21
8 Conditional probability
8.1 Definition
Given events A, B with P(B) > 0, the conditional probability of A given B is defined by
P(A ∩ B)
P(A|B) = .
P(B)
For fixed B, we can define a new function P̃ : F → [0, 1] by
P̃(A) = P(A|B).
Then P̃(Ω) = P(B)/P(B) = 1 and, for any sequence (An : n ∈ N) of disjoint sets in F,

! ∞ ∞
P (( ∞ P( ∞
S S
n=1 An ) ∩ B) n=1 (An ∩ B)) P(An ∩ B) X
[ X
P̃ An = = = = P̃(An ).
n=1
P(B) P(B) n=1
P(B) n=1

So P̃ is a probability measure, the conditional probability measure given B.

8.2 Law of total probability


Let (Bn : n ∈ N) be a sequence of disjoint events of positive probability, whose union is Ω.
Then, for all events A,
X∞
P(A) = P(A|Bn )P(Bn ).
n=1
For, by countable additivity, we have

!! ∞
! ∞ ∞
[ [ X X
P(A) = P A ∩ Bn =P (A ∩ Bn ) = P(A ∩ Bn ) = P(A|Bn )P(Bn ).
n=1 n=1 n=1 n=1

The condition of positive probability may be omitted provided we agree to interpret


P(A|Bn )P(Bn ) as 0 whenever P(Bn ) = 0.

8.3 Bayes’ formula


Let (Bn : n ∈ N) be a sequence of disjoint events whose union is Ω, and let A be an event
of positive probability. Then
P(A|Bn )P(Bn )
P(Bn |A) = P∞ .
k=1 P(A|Bk )P(Bk )

We make the same convention as above when P(Bk ) = 0. The formula follows directly
from the definition of conditional probability, using the law of total probability.
This formula is the basis of Bayesian statistics. We hold a prior view of the probabilities
of the events Bn , and we have a model giving us the conditional probability of the event
A given each possible Bn . Then Bayes’ formula tells us how to calculate the posterior
probabilities for the Bn , given that the event A occurs.

22
8.4 False positives for a rare condition
A rare medical condition A affects 0.1% of the population. A test is performed on a
randomly chosen person, which is known empirically to give a positive result for 98% of
people affected by the condition and 1% of those unaffected. Suppose the test is positive.
What is the probability that the chosen person has condition A?
We use Bayes’ formula
P(P |A)P(A) 0.98 × 0.001
P(A|P ) = c c
= = 0.089 . . . .
P(P |A)P(A) + P(P |A )P(A ) 0.98 × 0.001 + 0.01 × 0.999
The implied probability model is some large finite set Ω = {1, . . . , N }, representing the
whole population, with subsets A and P such that
1 98 1
|A| = N, |A ∩ P | = |A|, |Ac ∩ P | = |Ac |.
1000 100 100
We used Bayes’ formula to work out what proportion of the set P is contained in A. You
may prefer to do it directly.

8.5 Knowledge changes probabilities in surprising ways


Consider the three statements:
(a) I have two children the elder of whom is a boy,
(b) I have two children one of whom is a boy,
(c) I have two children one of whom is a boy born on a Thursday.
What is the probability that both children are boys? Since we have no further information,
we will assume all outcomes are equally likely. Write BG for the event that the elder is a
boy and the younger a girl. Write GT for the event that the elder is a girl and the younger
a boy born on a Thursday, and write T N for the event that the elder is a boy born on a
Thursday and the younger a boy born on another day. Then
(a) P(BB|BG ∪ BB) = 1/2,
(b) P(BB|BB ∪ BG ∪ GB) = 1/3,
(c) P(N T ∪ T N ∪ T T |N T ∪ T N ∪ T T ∪ T G ∪ GT ) = 13/27.
In (c) we used
6 1 1 6 1 1
P(N T ∪ T N ∪ T T ) = + + ,
14 14 14 14 14 14
6 1 1 6 1 1 1 7 7 1
P(N T ∪ T N ∪ T T ∪ T G ∪ GT ) = + + + + .
14 14 14 14 14 14 14 14 14 14
Thus, learning about the gender of one child biases the probabilities for the other. Also,
learning a seemingly irrelevant additional fact pulls the probabilities back towards evens.

23
8.6 Simpson’s paradox
We first note a version of the law of total probability for conditional probabilities. Let A
and B be events with P(B) > 0. Let (Ωn : n ∈ N) be a sequence of disjoint events whose
union is Ω. Then ∞
X
P(A|B) = P(A|B ∩ Ωn )P(Ωn |B).
n=1

This is simply the law of total probability for the conditional probability P̃(A) = P(A|B).
For we have

P̃(A ∩ Ωn ) P(A ∩ Ωn ∩ B)
P̃(A|Ωn ) = = = P(A|B ∩ Ωn ).
P̃(Ωn ) P(Ωn ∩ B)

Here is an example of Simpson’s paradox. The interval [0, 1] can be made into a
probability space such that P((a, b]) = b − a whenever 0 6 a 6 b 6 1. Fix ε ∈ (0, 1/4) and
consider the events

A = (ε/2, 1/2 + ε/2], B = (1/2 − ε/2, 1 − ε/2], Ω1 = (0, 1/2], Ω2 = (1/2, 1].

Then
1
P(A|B) = 2ε < = P(A)
2
but
ε
P(A|B ∩ Ω1 ) = 1 > 1 − ε = P(A|Ω1 ) and P(A|B ∩ Ω2 ) = > ε = P(A|Ω2 ). (1)
1−ε
We cannot conclude from the fact that ‘B attracts A on Ω1 and on Ω2 ’ that ‘B attracts A
on Ω’.
Note that
1
P(Ω1 ) = P(Ω2 ) = , P(Ω1 |B) = ε, P(Ω2 |B) = 1 − ε.
2
According to the laws of total probability, these are the correct weights to combine the
conditional probabilities from (1) to give
1 1 1 ε
P(A) = (1 − ε) × +ε× = , P(A|B) = 1 × ε + (1 − ε) = 2ε
2 2 2 1−ε
in agreement with our prior calculations. In this example, conditioning on B significantly
alters the weights. It is this which leads to the apparently paradoxical outcome.
More generally, Simpson’s paradox refers to any instance where a positive (or negative)
association between events, when conditioned by the elements of a partition, is reversed
when the same events are considered without conditioning.

24
9 Random variables
9.1 Definitions
Let (Ω, F, P) be a probability space. A random variable is a function X : Ω → R such
that, for all x ∈ R,
{X 6 x} = {ω ∈ Ω : X(ω) 6 x} ∈ F.
For any event A, the indicator function 1A is a random variable. Here
(
1, if ω ∈ A,
1A (ω) =
0, if ω ∈ Ac .

Given a random variable X, the distribution function of X is the function FX : R → [0, 1]


given by
FX (x) = P(X 6 x).
Here P(X 6 x) is the usual shorthand for P({ω ∈ Ω : X( ω) 6 x}). Note that FX (x) → 0
as x → −∞ and FX (x) → 1 as x → ∞. Also, FX is non-decreasing and right-continuous,
that is, for all x ∈ R and all h > 0,
FX (x) 6 FX (x + h), FX (x + h) → FX (x) as h → 0.
More generally, a random variable in Rn is a function X = (X1 , . . . , Xn ) : Ω → Rn such
that, for all x1 , . . . , xn ∈ R,
{X1 6 x1 , . . . , Xn 6 xn } ∈ F.
It is straightforward to check this condition is equivalent to the condition that X1 , . . . , Xn
are random variables in R. For such a random variable X, the joint distribution function
of X is the function FX : Rn → [0, 1] given by
FX (x1 , . . . , xn ) = P(X1 6 x1 , . . . , Xn 6 xn ).
We return for a while to the scalar case n = 1. We say X is discrete if it takes only
countably many values. Given a discrete random variable X, with values in a countable
set S, we obtain a discrete distribution µX on R by setting
µX (B) = P(X ∈ B).
Then µX has mass function given by
px = µX ({x}) = P(X = x), x ∈ S.
We call µX the distribution of X and (px : x ∈ S) the mass function of X. We say that
discrete random variables X1 , . . . , Xn (with values in S1 , . . . , Sn say) are independent if,
for all x1 ∈ S1 , . . . , xn ∈ Sn ,
P(X1 = x1 , . . . , Xn = xn ) = P(X1 = x1 ) × · · · × P(Xn = xn ).

25
9.2 Doing without measure theory
The condition that {X 6 x} ∈ F for all x is a measurability condition. It guarantees that
P(X 6 x) is well defined. While this is obvious for a countable probability space when F is
the set of all subsets, in general it requires some attention. For example, in Section 10.1, we
implicitlyPuse the fact that, for a sequence of non-negative random variables (Xn : n ∈ N),
the sum ∞ n=1 Xn is also a non-negative random variable, that is to say, for all x
(∞ )
X
Xn 6 x ∈ F.
n=1

This is not hard to show, using the fact that F is a σ-algebra, but we do not give details
and we will not focus on such questions in this course.

9.3 Number of heads


Consider the probability space
Ω = {0, 1}N
with mass function (pω : ω ∈ Ω) given by
N
Y
pω = pωk (1 − p)1−ωk ,
k=1

modelling a sequence of N tosses of a biased coin. We can define random variables


X1 , . . . , XN on Ω by
Xk (ω) = ωk .
Then X1 , . . . , XN all have Bernoulli distribution of parameter p. For x1 , . . . , xN ∈ {0, 1},
N
Y
P(X1 = x1 , . . . , XN = xN ) = pxk (1 − p)1−xk = P(X1 = x1 ) × · · · × P(XN = xN )
k=1

so the random variables X1 , . . . , XN are independent. Define a further random variable


SN on Ω by SN = X1 + · · · + XN , that is,
SN (ω) = X1 (ω) + · · · + XN (ω) = ω1 + · · · + ωN .
Then, for k = 0, 1, . . . , N ,  
N
|{SN = k}| =
k
and pω = pk (1 − p)N −k for all ω ∈ {SN = k}, so
 
N k
P(SN = k) = p (1 − p)N −k
k
showing that SN has binomial distribution of parameters N and p. We write SN ∼ B(N, p)
for short.

26
10 Expectation
10.1 Definition
Let (Ω, F, P) be a probability space. Recall that a random variable is a function X : Ω → R
such that {X 6 x} ∈ F for all x ∈ R. A non-negative random variable is a function
X : Ω → [0, ∞] such that {X 6 x} ∈ F for all x > 0. Note that we do not allow random
variables to take the values ±∞ but we do allow non-negative random variables to take
the value ∞. Write F + for the set of non-negative random variables.
Theorem 10.1. There is a unique map
E : F + → [0, ∞]
with the following properties:
(a) E(1A ) = P(A) for all A ∈ F,
(b) E(λX) = λE(X) for all λ ∈ [0, ∞) and all X ∈ F + ,
(c) E ( n Xn ) = n E(Xn ) for all sequences (Xn : n ∈ N) in F + .
P P

In (b), we apply the usual rule that 0 × ∞ = 0. The map E is called the expectation.
Proof for Ω countable. By choosing an enumeration, we reduce to the case where Ω =
{1, . . . , N } or Ω =
PN. We give details for Ω = N. Note that we can write any X ∈ F+
in the form X = n Xn , where Xn (ω) = X(n)1{n} (ω). So, for any map E : F + → [0, ∞]
with the given properties,
X X X
E(X) = E(Xn ) = X(n)P({n}) = X(ω)P({ω}). (2)
n n ω

Hence there is at most one such map. On the other hand, if we use (2) to define E, then
X
E(1A ) = 1A (ω)P({ω}) = P(A)
ω
and X
E(λX) = λX(ω)P({ω}) = λE(X)
ω
and
!
X XX XX X
E Xn = Xn (ω)P({ω}) = Xn (ω)P({ω}) = E(Xn )
n ω n n ω n

so E satisfies (a), (b) and (c).


We will allow ourselves to use the theorem in its general form.
A random variable X is said to be integrable if E|X| < ∞, and square-integrable if
E(X 2 ) < ∞. We define the expectation of an integrable random variable X by setting
E(X) = E(X + ) − E(X − )
where X ± = max{±X, 0}.

27
10.2 Properties of expectation
For non-negative random variables X, Y , by taking X1 = X, X2 = Y and Xn = 0 for n > 2
in the countable additivity property (c), we obtain
E(X + Y ) = E(X) + E(Y ).
This shows in particular that E(X) 6 E(X + Y ) and hence
E(X) 6 E(Y ) whenever X 6 Y.
Also, for any non-negative random variable X and all n ∈ N (see Section 11.1 below)
P(X > 1/n) 6 nE(X)
so
P(X = 0) = 1 whenever E(X) = 0.
For integrable random variables X, Y , we have
(X + Y )+ + X − + Y − = (X + Y )− + X + + Y +
so
E((X + Y )+ ) + E(X − ) + E(Y − ) = E((X + Y )− ) + E(X + ) + E(Y + )
and so
E(X + Y ) = E(X) + E(Y ).
Let X be a discrete non-negative random variable, taking values (xn : n ∈ N) say.
Then, if X is non-negative or X is integrable, we can compute the expectation using the
formula X
E(X) = xn P(X = xn ).
n
P
To see this, for X non-negative, we can write X = n Xn , where Xn (ω) = xn 1{X=xn } (ω).
Then the formula follows by countable additivity. For X integrable, we subtract the
formulas for X + and X − . Similarly, for any discrete random variable X and any non-
negative function f , X
E(f (X)) = f (xn )P(X = xn ). (3)
n
For independent discrete random variables X, Y , and for non-negative functions f and
g, we have
X
E(f (X)g(Y )) = f (x)g(y)P(X = x, Y = y)
x,y
X
= f (x)g(y)P(X = x)P(Y = y)
x,y
X X
= f (x)P(X = x) g(y)P(Y = y) = E(f (X))E(g(Y ))
x y

28
This formula remains valid without the assumption that that f and g are non-negative,
provided only that f (X) and g(Y ) are integrable. Indeed, it remains valid without the
assumption that X and Y are discrete, but we will not prove this.

10.3 Variance and covariance


The variance of an integrable random variable X of mean µ is defined by
var(X) = E((X − µ)2 ).
Note that
(X − µ)2 = X 2 − 2µX + µ2
so
var(X) = E(X 2 − 2Xµ + µ2 ) = E(X 2 ) − 2µE(X) + µ2 = E(X 2 ) − E(X)2 .
Also, by taking f (x) = (x − µ)2 in (3), for X integrable and discrete,
X
var(X) = (xn − µ)2 P(X = xn ).
n

The covariance of square-integrable random variables X, Y of means µ, ν is defined by


cov(X, Y ) = E((X − µ)(Y − ν)).
Note that
(X − µ)(Y − ν) = XY − µY − νX + µν
so
cov(X, Y ) = E(XY ) − µE(Y ) − νE(X) + µν = E(XY ) − E(X)E(Y ).
For independent integrable random variables X, Y , we have E(XY ) = E(X)E(Y ), so
cov(X, Y ) = 0.
For square-integrable random variables X, Y , we have
var(X + Y ) = var(X) + 2 cov(X, Y ) + var(Y ).
To see this, we note that
(X + Y − µ − ν)2 = (X − µ)2 + 2(X − µ)(Y − ν) + (Y − ν)2
and take expectations. In particular, if X, Y are independent, then
var(X + Y ) = var(X) + var(Y ).
The correlation of square-integrable random variables X, Y of positive variance is de-
fined by
cov(X, Y )
corr(X, Y ) = p p .
var(X) var(Y )
Note that corr(X, Y ) does not change when X or Y are multiplied by a positive con-
stant. The Cauchy–Schwarz inequality, which will be discussed in Section 11.3, shows that
corr(X, Y ) ∈ [−1, 1] for all X, Y .

29
10.4 Zero covariance does not imply independence
It is a common mistake to confuse the condition E(XY ) = E(X)E(Y ) with independence.
Here is an example to illustrate the difference. Given independent Bernoulli random vari-
ables X1 , X2 , X3 , all with parameter 1/2, consider the random variables

Y1 = 2X1 − 1, Y2 = 2X2 − 1 Z 1 = Y1 X 3 , Z2 = Y2 X3 .

Then

E(Y1 ) = E(Y2 ) = 0, E(Z1 ) = E(Y1 )E(X3 ) = 0, E(Z2 ) = E(Y2 )E(X3 ) = 0

so
E(Z1 Z2 ) = E(Y1 Y2 X3 ) = 0 = E(Z1 )E(Z2 ).
On the other hand {Z1 = 0} = {Z2 = 0} = {X3 = 0}, so

P(Z1 = 0, Z2 = 0) = 1/2 6= 1/4 = P(Z1 = 0)P(Z2 = 0)

which shows that Z1 , Z2 are not independent.

10.5 Calculation of some expectations and variances


For an event A with P(A) = p, we have, using E(X 2 ) − E(X)2 for the variance,

E(1A ) = p, var(1A ) = p(1 − p).

Sometimes a non-negative integer-valued random variable X can be written conveniently


as a sum of indicator functions of events An , with P(An ) = pn say. Then its expectation is
given by !
X X
E(X) = E 1An = pn .
n n

For such random variables X, we always have



X
X= 1{X>n}
n=1

so ∞
X
E(X) = P(X > n). (4)
n=1

More generally, the following calculation can be justified using Fubini’s theorem for any
non-negative random variable X
Z ∞ Z ∞ Z ∞
E(X) = E 1{x6X} dx = E(1{X>x} )dx = P(X > x)dx.
0 0 0

30
The expectation and variance of a binomial random variable SN ∼ B(N, p) are given
by
E(SN ) = N p, var(SN ) = N p(1 − p).
This can be seen by writing SN as the sum of N independent Bernoulli random variables.
If G is a geometric random variable of parameter p ∈ (0, 1], then

X
P(G > n) = (1 − p)k p = (1 − p)n
k=n

so we can use (4) to obtain


E(G) = (1 − p)/p.
For a Poisson random variable X of parameter λ, we have
E(X) = λ, var(X) = λ.
To see this, we calculate
∞ ∞ n ∞
X X
−λ λ
X
−λ λn−1
E(X) = nP(X = n) = ne =λ e =λ
n=0 n=0
n! n=1
(n − 1)!

and
∞ ∞ ∞
X X λn X λn−2
E(X(X − 1)) = n(n − 1)P(X = n) = n(n − 1)e−λ = λ2 e−λ = λ2
n=0 n=0
n! n=2
(n − 2)!
so
var(X) = E(X 2 ) − E(X)2 = E(X(X − 1)) + E(X) − E(X)2 = λ2 + λ − λ2 = λ.

10.6 Conditional expectation


Given a non-negative random variable X and an event B with P(B) > 0, we define the
conditional expectation of X given B by
E(X1B )
E(X|B) = .
P(B)
Then, for a sequence of disjoint events (Ωn : n ∈ N) with union Ω, we have the law of total
expectation

X
E(X) = E(X|Ωn )P(Ωn ).
n=1
P∞
To see this, note that X = n=1 Xn , where Xn = X1Ωn so, by countable additivity,

X ∞
X
E(X) = E(Xn ) = E(X|Ωn )P(Ωn ).
n=1 n=1

31
10.7 Inclusion-exclusion via expectation
Note the identity
n
Y n
X X
(1 − xi ) = (−1)k (xi1 × · · · × xik ).
i=1 k=0 16i1 <···<ik 6n

Given events A1 , . . . , An , fix ω ∈ Ω and set xi = 1Ai (ω). Then


n
Y
(1 − xi ) = 1Ac1 ∩···∩Acn (ω) = 1 − 1A1 ∪···∪An (ω)
i=1

and
xi1 × · · · × xik = 1Ai1 ∩···∩Aik (ω).
Hence n
X X
1A1 ∪···∪An = (−1)k+1 1Ai1 ∩···∩Aik .
k=1 16i1 <···<ik 6n

On taking expectations we obtain the inclusion-exclusion formula.

32
11 Inequalities
11.1 Markov’s inequality
Let X be a non-negative random variable and let λ ∈ (0, ∞). Then
P(X > λ) 6 E(X)/λ.
To see this, we note the following inequality of random variables
λ1{X>λ} 6 X
and take expectations to obtain λP(X > λ) 6 E(X).

11.2 Chebyshev’s inequality


Let X be an integrable random variable with mean µ and let λ ∈ (0, ∞). Then
P(|X − µ| > λ) 6 var(X)/λ2 .
To see this, note the inequality
λ2 1{|X−µ|>λ} 6 (X − µ)2
and take expectations to obtain λ2 P(|X − µ| > λ) 6 var(X).

11.3 Cauchy–Schwarz inequality


For all random variables X, Y , we have
p p
E(|XY |) 6 E(X 2 ) E(Y 2 ).
Proof of Cauchy–Schwarz. It suffices to consider the case where X, Y > 0 and E(X 2 ) < ∞
and E(Y 2 ) < ∞. Since XY 6 21 (X 2 + Y 2 ), we then have
E(X 2 ) + E(Y 2 )
E(XY ) 6 <∞
2
and the case where E(X 2 ) = E(Y 2 ) = 0 is clear. Suppose then, without loss of generality,
that E(Y 2 ) > 0. For t ∈ R, we have
0 6 (X − tY )2 = X 2 − 2tXY + t2 Y 2
so
0 6 E(X 2 ) − 2tE(XY ) + t2 E(Y 2 ).
We minimize the right-hand side by taking t = E(XY )/E(Y 2 ) and rearrange to obtain
E(XY )2 6 E(X 2 )E(Y 2 ).

33
It is instructive to examine how the equality
p p
E(XY ) = E(X 2 ) E(Y 2 )
can occur for square-integrable random variables X, Y . Let us exclude the trivial case
where P(Y = 0) = 1. Then E(Y 2 ) > 0 and above calculation shows that equality can
occur only when
E((X − tY )2 ) = 0
where t = E(XY )/E(Y 2 ), that is to say when P(X = λY ) = 1 for some λ ∈ R.

11.4 Jensen’s inequality


Let X be an integrable random variable with values in an open interval I and let f be a
convex function on I. Then
f (E(X)) 6 E(f (X)).
To remember the sense of the inequality, consider the case I = R and f (x) = x2 and recall
that
E(X 2 ) − E(X)2 = var(X) > 0.
Recall that f is said to be convex on I if, for all x, y ∈ I and all t ∈ [0, 1],
f (tx + (1 − t)y) 6 tf (x) + (1 − t)f (y).
Thus, the graph of the function from (x, f (x)) to (y, f (y) lies below the chord. If f is twice
differentiable, then f is convex if and only if f 00 (x) > 0 for all x ∈ I. We will use the
following property of convex functions: for all points m ∈ I, there exist a, b ∈ R such that
am + b = f (m) and ax + b 6 f (x) for all x ∈ I.
To see this, note that, for all x, y ∈ I with x < m < y, we can find t ∈ (0, 1) such that
m = tx + (1 − t)y. Then, on rearranging the convexity inequality, we obtain
f (m) − f (x) f (y) − f (m)
6 .
m−x y−m
So there exists a ∈ R such that
f (m) − f (x) f (y) − f (m)
6a6
m−x y−m
for all such x, y. Then f (x) 6 a(x − m) + f (m) for all x ∈ I.
Proof of Jensen’s inequality. Take m = E(X) and note the inequality of random variables
aX + b 6 f (X).
On taking expectations, we obtain
f (E(X)) = f (m) = am + b 6 E(f (X)).

34
It is again interesting to examine the case of equality

f (E(X)) = E(f (X))

especially in the case where for m = E(X) there exist a, b ∈ R such that

f (m) = am + b, f (x) > ax + b for all x 6= m.

Then equality forces the non-negative random variable f (X)−(aX +b) to have expectation
0, so P(f (X) = aX + b) = 1 and so P(X = m) = 1.

11.5 AM/GM inequality


Let f be a convex function defined on an open interval I, and let x1 , . . . , xn ∈ I. Then
n
! n
1X 1X
f xk 6 f (xk ).
n k=1 n k=1

To see this, consider a random variable which takes the values x1 , . . . , xn all with equal
probability and apply Jensen’s inequality.
In the special case when I = (0, ∞) and f = − log, we obtain for x1 , . . . , xn ∈ (0, ∞)

n
!1/n n
Y 1X
xk 6 xk .
k=1
n k=1

Thus the geometric mean is always less than or equal to the arithmetic mean.

35
12 Random walks
12.1 Definitions
A random process (Xn : n ∈ N) is a sequence of random variables. An integer-valued
random process (Xn : n ∈ N) is called a random walk if it has the form

Xn = x + Y1 + · · · + Yn

for some sequence of independent identically distributed random variables (Yn : n > 1).
We consider only the case of simple random walk, when the steps are all of size 1.

12.2 Gambler’s ruin


We think of the random walk (Xn : n > 0) as a model for the fortune of a gambler,
who makes a series of bets of the same unit stake, which is either returned double, with
probability p ∈ (0, 1) or is lost, with probability q = 1 − p. Suppose that the gambler’s
initial fortune is x and he continues to play until his fortune reaches a > x or he runs out
of money. Set
hx = Px (Xn hits a before 0).
The subscript x indicates the initial position. Note that, for y = x ± 1, conditional on
X1 = y, (Xn : n > 1) is a simple random walk starting from y. Hence, by the law of total
probability,
hx = qhx−1 + phx+1 , x = 1, . . . , a − 1.
We look for solutions to this recurrence relation satisfying the boundary conditions h0 = 0
and ha = 1.
First, consider the case p = 1/2. Then hx − hx−1 = hx+1 − hx for all x, so we must have

hx = x/a.

Suppose then that p 6= 1/2. We look for solutions of the recurrence relation of the form
hx = λx . Then
pλ2 − λ + q = 0
so λ = 1 or λ = q/p. Then A + B(q/p)x gives a general family of solutions and we can
choose A and B to satisfy the boundary conditions. This requires

A + B = 0, A + B(q/p)a = 1

so
1
B = −A =
(q/p)a − 1
and
(q/p)x − 1
hx = .
(q/p)a − 1

36
12.3 Mean time to absorption
Denote by T the number of steps taken by the random walk until it first hits 0 or a. Set

τx = Ex (T )

so τx is the mean time to absorption starting from x. We condition again on the first step,
using the law of total expectation to obtain, for x = 1, . . . , a − 1,

τx = 1 + pτx+1 + qτx−1 .

This time the boundary conditions are τ0 = τa = 0.


For p = 1/2, try a solution of the form Ax2 . Then we require
1 1
Ax2 = 1 + A(x + 1)2 + A(x − 1)2 = Ax2 + 1 + A
2 2
so A = −1. Then, by symmetry, we see that

τx = x(a − x).

In the case p 6= 1/2, we try Cx as a solution to the recurrence relation. Then

Cx = 1 + pC(x + 1) + qC(x − 1)

so C = 1/(q − p). The general solution then has the form


 x
x q
τx = +A+B
q−p p

and we determine A and B using the boundary conditions to obtain

x a (q/p)x − 1
τx = − .
q − p q − p (q/p)a − 1

37
13 Generating functions
13.1 Definition
Let X be a random variable with values in Z+ = {0, 1, 2, . . . }. The generating function of
X is the power series given by

X
GX (t) = E(tX ) = P(X = n)tn .
n=0

Then GX (1) = 1 so the power series has radius of convergence at least 1. By standard
results on power series, GX defines a function on (−1, 1) with derivatives of all orders, and
we can recover the probabilities for X by
 n
1 d
P(X = n) = GX (t).
n! dt t=0

13.2 Examples
For a Bernoulli random variable X of parameter p, we have
GX (t) = (1 − p) + pt.
For a geometric random variable X of parameter p,

X p
GX (t) = (1 − p)n ptn = .
n=0
1 − (1 − p)t

For a Poisson random variable X of parameter λ,


∞ n
−λ λ n
X
GX (t) = e t = e−λ eλt = e−λ+λt .
n=0
n!

13.3 Generating functions and moments


The expectation E(X n ) is called the nth moment of X. Provided the radius of convergence
exceeds 1, we can differentiate term by term at t = 1 to obtain

X
G0X (1) = nP(X = n) = E(X),
n=1

X
G00X (1) = n(n − 1)P(X = n) = E(X(X − 1))
n=2

and so on. Note that for a Poisson random variable X of parameter λ we have
G0X (1) = λ, G00X (1) = λ2

38
in agreement with the values for E(X) and E(X(X − 1)) computed in Section 10.5.
When the radius of convergence equals 1, we can differentiate term by term at all t < 1
to obtain ∞
X
0
GX (t) = nP(X = n)tn−1 .
n=1

So, in all cases,


lim G0X (t) = E(X)
t↑1

and we can obtain E(X(X − 1)) as a limit similarly.


For example, consider a random variable X in N = {1, 2, . . . } with probabilities
1
P(X = n) = .
n(n + 1)

Then, for |t| < 1, we have



X tn−1
G0X (t) =
n=1
n+1
and, as t → 1,

X 1
G0X (t) → E(X) = = ∞.
n=1
n+1

13.4 Sums of independent random variables


Let X, Y be independent random variables with values in Z+ . Set

pn = P(X = n), qn = P(Y = n).

Then X + Y is also a random variable with values in Z+ . We have

{X + Y = n} = ∪nk=0 {X = k, Y = n − k}

and, by independence,
P(X = k, Y = n − k) = pk qn−k .
So the probabilities for X + Y are given by the convolution
n
X
P(X + Y = n) = pk qn−k .
k=1

Generating functions are convenient in turning convolutions into products. Thus



X ∞ X
X n ∞
X ∞
X
n n k
GX+Y (t) = P(X + Y = n)t = pk qn−k t = pk t qn−k tn−k = GX (t)GY (t).
n=0 n=0 k=0 k=0 n=k

39
This can also be seen directly

GX+Y (t) = E(tX+Y ) = E(tX tY ) = E(tX )E(tY ) = GX (t)GY (t).

Let X, X1 , X2 , . . . be independent random variables in Z+ all having the same distri-


bution. Set S0 = 0 and for n > 1 set

Sn = X1 + · · · + Xn .

Then, for all n > 0,

GSn (t) = E(tSn ) = E(tX1 × · · · × tXn ) = E(tX1 ) × · · · × E(tXn ) = (GX (t))n .

In the case where X has Bernoulli distribution of parameter p, we have computed


GX (t) = 1 − p + pt and we know that Sn ∼ B(n, p). Hence the generating function for
B(n, p) is given by
GSn (t) = (1 − p + pt)n .
In the case where X has Poisson distribution of parameter λ, we have computed GX (t) =
e−λ+λt and we know that Sn ∼ P (nλ). The relation GSn (t) = (GX (t))n thus checks with
the known form of the generating function for P (nλ).

13.5 Random sums


Let N be further random variable in Z+ , independent of the sequence (Xn : n ∈ N).
Consider the random sum
N (ω)
X
SN (ω) = Xi (ω).
i=1

Then, for n > 0,


E(tSN |N = n) = (GX (t))n
so SN has generating function

X ∞
X
GSN (t) = E(tSN ) = E(tSN |N = n)P(N = n) = (GX (t))n P(N = n) = F (GX (t))
n=0 n=0

where F is the generating function of N .

13.6 Counting with generating functions


Here is an example of a counting problem where generating functions are helpful. Consider
the set Pn of integer-valued paths x = (x0 , x1 , . . . , x2n ) such that

x0 = x2n = 0, |xi − xi+1 | = 1, xi > 0 for all i.

40
Set
Cn = |Pn |.
Note that, for all n > 1 and x ∈ Pn , we have x1 = 1 and y = (x1 − 1, . . . , x2k−1 − 1) ∈ Pk−1
and z = (x2k , . . . , x2n ) ∈ Pn−k , where k = min{i > 1 : x2i = 0}. This gives a bijection from
Pn to ∪nk=1 Pk−1 × Pn−k . So we get a convolution-type identity
n
X
Cn = Ck−1 Cn−k .
k=1

Consider the generating function



X
c(t) = Cn tn .
n=0

2n

Note that Cn 6 n
6 22n , so the radius of convergence of this power series is at least 1/4.
Then
∞ X
X n ∞
X ∞
X
n k−1
c(t) = 1 + Ck−1 Cn−k t = 1 + t Ck−1 t Cn−k tn−k = 1 + tc(t)2 .
n=1 k=1 k=1 n=k

So, for t ∈ (0, 1/4), √


1− 1 − 4t
c(t) =
2t

where the other root 1+ 2t1−4t can be excluded because we know that c(0) = C0 = 1 and c
is continuous on [0, 1/4). Then, using the binomial expansion, we conclude that
 
1 2n
Cn = .
n+1 n

The numbers Cn are the Catalan numbers. They appear in many other counting problems.

41
14 Branching processes
14.1 Definition
A branching process or Galton–Watson process is a random process (Xn : n > 0) with the
following structure:
Xn
X
X0 = 1, Xn+1 = Yk,n for all n > 0.
k=1

Here (Yk,n : k > 1, n > 0) is a sequence of independent identically distributed random


variables in Z+ . A branching process models the evolution of the number of individuals
in a population, where the kth individual in generation n has Yk,n offspring in generation
n + 1. We call the distribution of X1 the offspring distribution.

14.2 Mean population size


Set µ = E(X1 ). Then, for all n > 1,

E(Xn ) = µn .

This is true for n = 1. Suppose inductively it is true for n. Note that


m
!
X
E(Xn+1 |Xn = m) = E Yk,n = mµ
k=1

so by the law of total expectation



X ∞
X
E(Xn+1 ) = E(Xn+1 |Xn = m)P(Xn = m) = µ mP(Xn = m) = µE(Xn ) = µn+1
m=0 m=0

and the induction proceeds.

14.3 Generating function for the population size


Set F (t) = E(tX1 ) and set Fn (t) = E(tXn ) for n > 0. Then F0 (t) = t and, for n > 0, we
can apply the calculation of the generating function in Section 13.5 to the random sum
defining Xn+1 to obtain
Fn+1 (t) = Fn (F (t))
so, by induction, Fn = F ◦ · · · ◦ F , the n-fold composition of F with itself.

42
14.4 Conditioning on the first generation
Fix m > 1. On the event {X1 = m}, for n > 0, we have
m
X
Xn+1 = Xn(j)
j=1

(j)
where X0 = 1 and, for n > 1,
(j)
Sn−1
X
Xn(j) = Yk,n
(j−1)
k=Sn−1 +1

and
(j) (1) (j)
Sn−1 = Xn−1 + · · · + Xn−1 .
(1) (m)
Note that, conditional on {X1 = m}, the processes (Xn : n > 0), . . . , (Xn : n > 0) are
independent Galton–Watson processes with the same offspring distribution as (Xn : n > 0).

14.5 Extinction probability


Set
q = P(Xn = 0 for some n > 0), qn = P(Xn = 0).
Then qn → q as n → ∞ by continuity of probability. Also

qn+1 = Fn+1 (0) = F (Fn (0)) = F (qn ).

This can also be seen by conditioning on the first generation. For

P(Xn+1 = 0|X1 = m) = P(Xn(1) = 0, . . . , Xn(m) = 0|X1 = m) = qnm

so, by the law of total probability,



X
qn+1 = P(Xn+1 = 0|X1 = m)P(X1 = m) = F (qn ).
m=0

Theorem 14.1. The extinction probability q is the smallest non-negative solution of the
equation q = F (q). Moreover, provided P(X1 = 1) < 1, we have

q < 1 if and only if µ > 1.

Proof. Note that F is continuous and non-decreasing on [0, 1] and F (1) = 1. On letting
n → ∞ in the equation qn+1 = F (qn ), we see that q = F (q). On the other hand, let us
denote the smallest non-negative solution of this equation by t. Then q0 = 0 6 t. Suppose

43
inductively for n > 0 that qn 6 t, then qn+1 = F (qn ) 6 F (t) = t and the induction
proceeds. Hence q = t.
We have µ = limt↑1 F 0 (t). If µ > 1, then we must have F (t) < t for all t ∈ (0, 1)
sufficiently close to 1 by the mean value theorem. Since F (0) = P(X1 = 0) > 0, there is
then a solution to t = F (t) in [0, 1) by the intermediate value theorem. Hence q < 1.
It remains to consider the case µ 6 1. If P(X 6 1) = 1 then F (t) = 1 − µ + µt and we
exclude the case µ = 1 by the condition P(X = 1) < 1, so q = 1. On the other hand, if
P(X > 2) > 0 then, by term-by-term differentiation, F 0 (t) < µ 6 1 for all t ∈ (0, 1) so, by
the mean value theorem, there is no solution to t = F (t) in [0, 1), so q = 1.
We emphasize the fact that, if there is any variability in the number of offspring, and
the average number of offspring is 1, then the population is sure to die out.

14.6 Example
Consider the Galton–Watson process (Xn : n > 0) with

P(X1 = 0) = 1/3, P(X1 = 2) = 2/3.

The offspring generating function F is given by


1 2 2
F (t) = + t
3 3
so the extinction probability q is the smallest non-negative solution to

0 = 2t2 − 3t + 1 = (2t − 1)(t − 1).

Hence q = 1/2.
Recall the construction of (Xn : n > 0) from a sequence of independent random vari-
ablesP(Yk,n : k > 1, n > 0). Set T0 = 0 and Tn = X0 + · · · + Xn−1 for n > 1 and
T = ∞ n=0 Xn . Set S0 = 1 and define recursively for n > 0 and k = 1, . . . , Xn ,

STn +k = STn + Y1,n + · · · + Yk,n − k.

Then ST0 = S0 = 1 = X0 . Suppose inductively for n > 0 that STn = Xn . Then

STn+1 = STn + Y1,n + · · · + YXn ,n − Xn = Xn+1

and the induction proceeds. Moreover T = min{m > 0 : Sm = 0}. Now (Sm )m6T is a
random walk starting from 1, which jumps up by 1 with probability 2/3 and down by 1
with probability 1/3, until it hits 0. The extinction probability for the branching process
is then the probability that the walk ever hits 0.

44
15 Some natural continuous probability distributions
A non-negative function f on R is a probability density function if
Z
f (x)dx = 1.
R

Given such a function f , we can define a unique probability measure µ on R such that, for
all x ∈ R, Z x
µ((−∞, x]) = f (y)dy.
−∞
In fact this measure µ is given on the σ-algebra B of Borel sets B by
Z
µ(B) = f (x)dx.
B

To be precise, we should require that f be Borel measurable. This is a weak assumption of


regularity, which holds for all piecewise continuous functions and many more besides. The
Borel σ-algebra B contains all the intervals in R. In fact it may be defined as the smallest
σ-algebra with this property. You are not expected to keep track of these subtleties in this
course and will miss no essential point by ignoring them.

15.1 Uniform distribution


For a, b ∈ R with a < b, the density function
f (x) = (b − a)−1 1[a,b] (x)
defines the uniform distribution on [a, b], written U [a, b] for short.

15.2 Exponential distribution


For λ ∈ (0, ∞), the density function on [0, ∞)
f (x) = λe−λx
defines the exponential distribution of parameter λ, written E(λ).

15.3 Gamma distribution


More generally, for α, λ ∈ (0, ∞), the density function on [0, ∞)
f (x) = λα xα−1 e−λx /Γ(α)
defines the gamma distribution of parameters α and λ, written Γ(α, λ). Here
Z ∞
Γ(α) = xα−1 e−x dx.
0

45
The substitution y = λx shows that
Z ∞
λα xα−1 e−λx dx = Γ(α)
0

so f is indeed a probability density function. When α is an integer, we can integrate by


parts α − 1 times to see that Γ(α) = (α − 1)!

15.4 Normal distribution


The density function
1 2
f (x) = √ e−x /2

defines the standard normal distribution, written N (0, 1). More generally, for µ ∈ R and
σ 2 ∈ (0, ∞), the density function
1 2 2
f (x) = √ e−(x−µ) /(2σ )
2πσ 2

defines the normal distribution of mean µ and variance σ 2 , written N (µ, σ 2 ).


To check that f is indeed a probability density function, we can use Fubini’s theorem
and a transformation to polar coordinates to compute
Z 2 Z Z 2π Z ∞
−x2 /2 −(x2 +y 2 )/2 2
e dx = e dxdy = e−r /2 rdrdθ = 2π.
R R2 0 0

46
16 Continuous random variables
16.1 Definitions
Recall that each real-valued random variable X has a distribution function FX , given by

FX (x) = P(X 6 x).

Then FX : R → [0, 1] is non-decreasing and right-continuous, with FX (x) → 0 as x → −∞


and FX (x) → 1 as x → ∞. A random variable X is said to be continuous if its distribution
function FX is continuous. We say that X is absolutely continuous if there exists a density
function fX such that, for all x ∈ R,
Z x
FX (x) = fX (y)dy. (5)
−∞

We will say in this case that X has density function fX . In fact we then have, for all Borel
sets B, Z
P(X ∈ B) = fX (x)dx.
B
By continuity of probability, the left limit of FX at x is given by

FX (x−) = lim FX (x − 1/n) = lim P(X 6 x − 1/n) = P(X < x)


n→∞ n→∞

so FX has a discontinuity at x if and only if P(X = x) > 0. If X has a density function


fX , then P(X = x) = 0 for all x, so X is continuous. The converse is false, as may be seen
by considering the uniform distribution on the Cantor set.
On the other hand, if FX is differentiable with piecewise continuous derivative f then,
by the Fundamental Theorem of Calculus, for all a, b ∈ R with a 6 b,
Z b
P(a 6 X 6 b) = FX (b) − FX (a) = f (x)dx
a

so X has density function f . Conversely, suppose that X has a density function fX . Then,
for all x ∈ R and all h > 0,
Z x+h
FX (x + h) − FX (x) 1
− fX (x) = (fX (y) − fX (x))dy 6 sup |fX (y) − fX (x)|
h h x x6y6x+h

with a similar estimate for h 6 0. Hence, if fX is continuous at x, then FX is differentiable


at x with derivative fX (x).
Given a real-valued random variable X, the distribution of X is the Borel probability
measure µX on R given by
µX (B) = P(X ∈ B).

47
16.2 Transformation of one-dimensional random variables
Let X be a random variable with values in some open interval I and having a piecewise
continuous density function fX on I. Let φ be a function on I having a continuous derivative
and such that φ0 (x) 6= 0 for all x ∈ I. Set y = φ(x) and consider the new random variable
Y = φ(X) in the interval φ(I). Then Y has a density function fY on φ(I), given by

dx
fY (y) = fX (x) .
dy
To see this, consider first the case where φ is increasing. Then Y 6 y if and only if
X 6 ψ(y), where ψ = φ−1 : φ(I) → I. So
FY (y) = FX (ψ(y)).
By the chain rule, FY then has a piecewise continuous derivative fY , given by
dx
fY (y) = fX (ψ(y))ψ 0 (y) = fX (x)
dy
which is then a density function for Y on φ(I). Since φ0 does not vanish, there remains the
case where φ is decreasing. Then Y 6 y if and only if X > ψ(y), so FY (y) = 1 − FX (ψ(y))
and a similar argument applies.
Here is an example. Suppose that X ∼ U [0, 1], by which we mean that µX is the uniform
distribution on [0, 1]. We may assume that X takes values in (0, 1], since P(X = 0) = 0.
Take φ = − log. Then Y = φ(X) takes values in [0, ∞) and
P(Y > y) = P(− log X > y) = P(X < e−y ) = e−y
so Y ∼ E(1), that is, µY is the exponential distribution of parameter 1.
Here is a second example. Let Z ∼ N (0, 1) and fix µ ∈ R and σ ∈ (0, ∞). Set
X = µ + σZ.
Then X ∼ N (µ, σ 2 ). To see this, we note that, for a, b ∈ R with a 6 b, we have X ∈ [a, b]
if and only if Z ∈ [a0 , b0 ], where a0 = (a − µ)/σ, b0 = (b − µ)/σ. So
Z b0 Z b
0 0 1 −z2 /2 1 2 2
P(X ∈ [a, b]) = P(Z ∈ [a , b ]) = √ e dz = √ e−(x−µ) /(2σ ) dx
2π 2πσ 2
a0 a

where we made the substitution z = (x − µ)/σ to obtain the final equality.

16.3 Calculation of expectations using density functions


Let X be a random variable having a density function fX and let g be a non-negative
function on R. Then the expectation E(g(X)) may be computed by
Z
E(g(X)) = g(x)fX (x)dx
R

48
and the same formula holds when g is real-valued provided g(X) is integrable. We omit
the proof of this formula but note that for g = 1(−∞,x] it is a restatement of (5).
For X ∼ U [a, b], we have
Z b
1 a+b
E(X) = xdx = .
b−a a 2

For X ∼ E(λ), by integration by parts,


Z ∞ Z ∞
−λx 1
E(X) = xλe dx = e−λx dx = .
0 0 λ

For X ∼ N (0, 1), by symmetry,


Z
1 2
E(X) = x √ e−x /2 dx = 0
R 2π
and, by integration by parts,
Z Z Z
2 1 −x2 /2 1 −x2 /2 1 2
var(X) = E(X ) =2
x √ e dx = x √ xe dx = √ e−x /2 dx = 1.
R 2π R 2π R 2π

For X ∼ N (µ, σ 2 ), we can write X = µ + σZ with Z ∼ N (0, 1). So

E(X) = µ, var(X) = σ 2 .

49
17 Properties of the exponential distribution
17.1 Exponential distribution as a limit of geometrics
Let T be an exponential random variable of parameter λ. Set

Tn = bnT c.

Then Tn takes values in Z+ and

P(Tn > k) = P(T > k/n) = e−λk/n

so Tn is geometric of parameter pn = 1 − e−λ/n . Now, as n → ∞, we have

pn ∼ λ/n, Tn /n → T.

So T is a limit of rescaled geometric random variables of small parameter.

17.2 Memoryless property of the exponential distribution


Let T be an exponential random variable of parameter λ. Then
Z ∞
P(T > t) = λe−λτ dτ = e−λt
t

so, for all s, t > 0,


P(T > s + t|T > s) = P(T > t).
This is called the memoryless property because, thinking of T as a random time, conditional
on T exceeding a given time s, the time T − s still to wait is also exponential of parameter
λ.
In fact this property characterizes the exponential distribution. For it implies that

P(T > s + t) = P(T > s)P(T > t)

for all s, t > 0. Then, for all m, n ∈ N,

P(T > mt) = P(T > t)m

and
P(T > m/n)n = P(T > m) = P(T > 1)m .
We exclude the trivial cases where T is identically 0 or ∞. Then P(T > 1) ∈ (0, 1), so
λ = − log P(T > 1) ∈ (0, ∞). Then P(T > t) = e−λt for all rationals t > 0, and this
extends to all t > 0 because P(T > t) and e−λt are both non-increasing in t.

50
18 Multivariate density functions
18.1 Definitions
A random variable X in Rn is said to have density function fX if the joint distribution
function FX is given by
Z x1 Z xn
FX (x1 , . . . , xn ) = ... fX (y1 , . . . , yn )dy1 . . . dyn .
−∞ −∞

It then follows that Z


P(X ∈ B) = fX (x)dx
B
for all Borel sets B in Rn , in particular for all open and all closed sets. Moreover, for any
non-negative Borel function g, we have
Z
E(g(X)) = g(x)fX (x)dx.
Rn

Morover, this formula remains valid for real-valued Borel functions g, provided E(|g(X)|) <
∞. We omit proof of these facts.

18.2 Independence
We say that random variables X1 , . . . , Xn are independent if, for all x1 , . . . , xn ∈ R,

P(X1 6 x1 , . . . , Xn 6 xn ) = P(X1 6 x1 ) × · · · × P(Xn 6 xn ).

Theorem 18.1. Let X = (X1 , . . . , Xn ) be a random variable in Rn .


(a) Suppose that X1 , . . . , Xn are independent and have density functions f1 , . . . , fn re-
spectively. Then X has density function fX given by
n
Y
fX (x1 , . . . , xn ) = fi (xi ).
i=1

(b) On the other hand, suppose that X has density function fX which factorizes as in (a)
for some non-negative functions f1 , . . . , fn . Then X1 , . . . , Xn are independent and
have density functions which are proportional to f1 , . . . , fn respectively.
Proof. Under the hypothesis of (a), we have, for B = (−∞, x1 ] × · · · × (−∞, xn ]
n
! n n Z xi n
Z Y
\ Y Y
P(X ∈ B) = P {Xi 6 xi } = P(Xi 6 xi ) = fi (yi )dyi = fi (yi )dy.
i=1 i=1 i=1 −∞ B i=1

Hence X has the claimed density function.

51
On the other hand, under the hypothesis of (b), since
Yn Z Z
fi (xi )dxi = fX (x)dx = 1
i=1 R Rn

we may assume, by moving suitable constant factors between the functions fi , that they
all individually integrate to 1. Consider a set B of the form B1 × · · · × Bn . Then
n
! Z n Z
\ Y
P {Xi ∈ Bi } = P(X ∈ B) = fX (x)dx = fi (xi )dxi . (6)
i=1 B i=1 Bi

On taking Bj = R for all j 6= i, we see that


Z
P(Xi ∈ Bi ) = fi (xi )dxi
Bi

showing that Xi has density fi and, returning to the general formula (6), that
n
! n
\ Y
P {Xi ∈ Bi } = P(Xi ∈ Bi )
i=1 i=1

so X1 , . . . , Xn are independent.

18.3 Marginal densities


If an Rn -valued random variable X = (X1 , . . . , Xn ) has a density function fX then each of
the component random variables has a density, which is given by integrating out the other
variables. Thus X1 has density
Z
fX1 (x1 ) = fX (x1 , x2 , . . . , xn )dx2 . . . dxn .
Rn−1

These are called the marginal density functions. When the component random variables
are independent, we can recover fX as the product of the marginals, but this fails otherwise.

18.4 Convolution of density functions


Given two probability density functions f, g on R, we define their convolution f ∗ g by
Z
f ∗ g(x) = f (x − y)g(y)dy.
R

If X, Y are independent random variables having densities fX , fY then X + Y has density


fX ∗ fY . To see this we compute for z ∈ R
Z Z Z z−y 
P(X + Y 6 z) = 1{x+y6z} fX (x)fY (y)dxdy = fX (x)fY (y)dx dy
R2 R −∞
Z Z z  Z z Z 
= fX (w − y)fY (y)dw dy = fX (w − y)fY (y)dy dw
R −∞ −∞ R

52
where we made the substitution w = x + y in the inner integral for the third equality and
interchanged the order of integration for the fourth.
Consider the case where X, Y are independent U [0, 1] random variables. Then X + Y
has density given by
Z 1 (
if x ∈ [0, 1],
Z
x,
fX ∗ fY (x) = 1[0,1] (x − y)1[0,1] (y)dy = 1[x−1,x] (y)dy =
R 0 2 − x, if x ∈ [1, 2].

18.5 Transformation of multi-dimensional random variables


Theorem 18.2. Let X be a random variable with values in some domain D in Rd and
having a density function fX on D. Suppose that φ maps D bijectively to another domain
φ(D) in Rd and suppose that φ has a continuous derivative on D with

det φ0 (x) 6= 0

for all x ∈ D. Set y = φ(x) and consider the new random variable Y = φ(X). Then Y
has a density function fY on φ(D), given by

fY (y) = fX (x)|J|

where J is the Jacobian, given by


 d !
∂xi
J = det .
∂yj i,j=1

We omit proof of this result, but see Section 16.2 for the one-dimensional case. Note
that the Jacobian factor is computed from the inverse transformation, where x is given as
a function of y.
Here is an example. Let (X, Y ) be a standard normal random variable in R2 , which
we will consider as defined in D = R2 \ {(x, 0) : x > 0}. Set R = |(X, Y )| ∈ (0, ∞) and
let Θ ∈ (0, 2π) be the angle from the positive x-axis to the vector (X, Y ). The inverse
transformation is given by
x = r cos θ, y = r sin θ
so  ∂x ∂x
  
∂r ∂θ cos θ −r sin θ
J = det ∂y ∂y = det =r
∂r ∂θ
sin θ r cos θ
and (R, Θ) has density function on (0, ∞) × (0, 2π) given by
1 −r2 /2
fR,Θ (r, θ) = fX,Y (x, y)|J| = re .

2 /2
Hence R has density re−r on (0, ∞), Θ is uniform on (0, 2π), and R and Θ are indepen-
dent.

53
19 Simulation of random variables
We sometimes wish to generate a simulated sample X1 , . . . , Xn from a given distribution.
We will discuss two ways to do this, relevant in different contexts, based on the reasonable
assumption that we can simulate well a sequence (Un : n ∈ N) of independent U [0, 1]
random variables.

19.1 Construction of a random variable from its distribution


function
Suppose we wish to simulate a random variable X which takes values in an open interval
I, where it has a positive density function f . The associated distribution function F then
maps I bijectively to (0, 1), so it has an inverse map G : (0, 1) → I. For U ∼ U [0, 1], since
P(U = 0) = P(U = 1) = 0, we may consider U as a random variable in (0, 1). Set
X = G(U )
then
{X 6 x} = {G(U ) 6 x} = {U 6 F (x)}
so
P(X 6 x) = P(U 6 F (x)) = F (x).
Hence we have constructed a random variable with the given distribution function F . To
obtain a sequence of independent such random variables, we can apply the same transfor-
mation to the sequence (Un : n ∈ N).
In fact, for any distribution function F , if we define G : (0, 1) → R by
G(u) = inf{x ∈ R : u 6 F (x)}
then, for u ∈ (0, 1) and x ∈ R, we have G(u) 6 x if and only if u 6 F (x). We omit to show
this in general. Then the same argument as above shows that X = G(U ) has distribution
function F .

19.2 Rejection sampling


Suppose we wish to simulate a random variable X in Rd whose density function f has the
form
f (x) = 1A (x)/|A|
where A is a subset of the unit cube and |A| denotes the volume of A, which we assume
to be positive. For this is it convenient to assume that the sequence (Un : n ∈ N) consists
of independent d-dimensional uniform random variables. We can obtain these from a
sequence (Uk,n : k ∈ {1, . . . , d}, n ∈ N) of independent U [0, 1] random variables by setting
Un = (U1,n , . . . , Ud,n ) for all n. Then set
X = UN

54
where
N = min{n > 1 : Un ∈ A}.
Then, by the law of total probability, for any Borel set B ⊆ [0, 1]d ,

X
P(X ∈ B) = P(X ∈ B|N = n)P(N = n).
n=1

Now
P(Un ) ∈ B ∩ A and U1 , . . . , Un−1 6∈ A) |B ∩ A|
P(X ∈ B|N = n) = = .
P(Un ∈ A and U1 , . . . , Un−1 6∈ A) |A|
so ∞
|B ∩ A| X |B ∩ A|
Z
P(X ∈ B) = P(N = n) = = f (x)dx
|A| n=1 |A| B

as required. Note that an algorithm to implement this construction requires us only to


determine sequentially whether each proposed value Un lies in A.
A similar approach can be used to simulate a random variable (X1 , . . . , Xd−1 ) taking
values in [0, 1]d−1 and which has a bounded density function f . Let λ be an upper bound
for f and define

A = {(x1 , . . . , xd−1 , xd ) ∈ [0, 1]d : xd 6 f (x1 , . . . , xd−1 )/λ}.

Construct (X1 , . . . , Xd−1 , Xd ) from A as above. Then, for any Borel set B ⊆ [0, 1]d−1 ,
Z
f (x1 , . . . , xd−1 )
|(B × [0, 1]) ∩ A| = dx1 . . . dxd−1
B λ
so
|(B × [0, 1]) ∩ A|
Z
P((X1 , . . . , Xd−1 ) ∈ B) = = f (x1 , . . . , xd−1 )dx1 . . . dxd−1 .
|A| B

55
20 Moment generating functions
20.1 Definition
Let X be a random variable. The moment generating function of X is the function MX
on R given by
MX (λ) = E(eλX ).
Note that MX (0) = 1 but it is not guaranteed that MX (λ) < ∞ for any λ 6= 0.
For independent random variables X, Y , we have

MX+Y (λ) = E(eλ(X+Y ) ) = E(eλX eλY ) = E(eλX )E(eλY ) = MX (λ)MY (λ).

20.2 Examples
For X ∼ E(β) and λ < β, we have
Z ∞
β
MX (λ) = eλx βe−βx dx =
0 β−λ

and MX (λ) = ∞ for λ > β. More generally, for X ∼ Γ(α, β) and λ < β,
Z ∞  α
1 λx α α−1 −βx β
MX (λ) = e β x e dx =
Γ(α) 0 β−λ

where we recognise in the integral a multiple of the Γ(α, β − λ) density function.


For X ∼ N (0, 1) and all λ ∈ R,
Z Z
λx 1 −x2 /2 λ2 /2 1 2 2
MX (λ) = e √ e dx = e √ e−(x−λ) /2 dx = eλ /2
R 2π R 2π
where for the last equality we recognise the integral of the N (λ, 1) density function.
More generally, for X ∼ N (µ, σ 2 ), we can write X = µ + σZ with Z ∼ N (0, 1), so
2 λ2 /2
MX (λ) = E(eλX ) = eλµ E(eλσZ ) = eµλ+σ .

We say that a random variable X has the Cauchy distribution if it has the following
density function
1
f (x) = .
π(1 + x2 )
Then, for all λ 6= 0, Z
1
MX (λ) = eλx dx = ∞.
R π(1 + x2 )

56
20.3 Uniqueness and the continuity theorem
Moment generating functions provide a convenient way to characterize probability distri-
butions and to show their convergence, because of the following two results, whose proofs
we omit.

Theorem 20.1 (Uniqueness of moment generating functions). Let X and Y be random


variables having the same moment generating function M and suppose that M (λ) < ∞ for
some λ 6= 0. Then X and Y have the same distribution function.

Theorem 20.2 (Continuity theorem for moment generating functions). Let X be a random
variable and let (Xn : n ∈ N) be a sequence of random variables. Suppose that MXn (λ) →
MX (λ) for all λ ∈ R and MX (λ) < ∞ for some λ 6= 0. Then Xn converges to X in
distribution.

Here, we say that Xn converges to X in distribution if FXn (x) → FX (x) as n → ∞ at


all points x ∈ R where FX is continuous.
We can use uniqueness of moment generating functions to show that, if X, X1 , . . . , Xn
are independent E(β) random variables, then Sn = X1 + · · · + Xn ∼ Γ(n, β). For
n  n
Y
n β
MSn (λ) = MXk (λ) = MX (λ) =
k=1
β−λ

which we have seen is the moment generating function for the Γ(n, β) distribution.
The condition that M (λ) < ∞ for some λ 6= 0 is necessary for uniqueness. For, if
X is a Cauchy random variable, then MX = M2X but X and 2X do not have the same
distribution.
A version of these theorems holds also for random variables in Rn , where the moment
generating function of such a random variable X is the function on Rn given by
TX
MX (λ) = E(eλ ).

In this case the condition that MX (λ) < ∞ for some λ 6= 0 is replaced by the requirement
that MX be finite on some open set.

57
21 Limits of sums of independent random variables
21.1 Weak law of large numbers
Theorem 21.1. Let (Xn : n ∈ N) be a sequence of independent identically distributed
integrable random variables. Set Sn = X1 + · · · + Xn and µ = E(X1 ). Then, for all ε > 0,
as n → ∞,  
Sn
P − µ > ε → 0.

n
Proof for finite second moment. Assume further that

var(X1 ) = σ 2 < ∞.

Note that
E(Sn /n) = µ, var(Sn /n) = σ 2 /n.
Then, by Chebyshev’s inequality, for all ε > 0, as n → ∞,
 
Sn
P − µ > ε 6 ε−2 σ 2 /n → 0.


n

21.2 Strong law of large numbers (non-examinable)


Theorem 21.2. Let (Xn : n ∈ N) be a sequence of independent identically distributed
integrable random variables. Set Sn = X1 + · · · + Xn and µ = E(X1 ). Then

P(Sn /n → µ as n → ∞) = 1.

Proof for finite fourth moment. Assume further that

E(X14 ) < ∞.

Set Yn = Xn − µ. Then (Yn : n ∈ N) is a sequence of independent identically distributed


random variables with E(Y1 ) = 0 and E(Y14 ) < ∞, and

Sn Y1 + · · · + Yn
−µ= .
n n
Hence, it will suffice to conside the case where µ = 0.
Note that n   X
4
X
4 4
Sn = Xi + Xi2 Xj2 + R
i=1
2 16i<j6n

58
where R is a sum of terms of the following forms: Xi Xj Xk Xl or Xi Xj Xk2 or Xi Xj3 for
i, j, k, l distinct. Since µ = 0, by independence, we have E(R) = 0. By Cauchy–Schwarz,
q q
E(X12 X22 ) 6 E(X14 ) E(X24 ) = E(X14 ).

Hence
E(Sn4 ) = nE(X14 ) + 3n(n − 1)E(X12 X22 ) 6 3n2 E(X14 )
and so
∞  4 ! ∞  4 ! ∞
X Sn X Sn X 1
E = E 6 3E(X14 ) 2
< ∞.
n=1
n n=1
n n=1
n
Hence P(Sn /n → 0) = 1.
The following general argument allows us to deduce the weak law from the strong law.
Let (Xn : n ∈ N) be a sequence of random variables and let ε > 0. Consider the events

\
An = {|Xm | 6 ε}, Bn = {|Xn | 6 ε}, A = {|Xn | 6 ε for all sufficiently large n}.
m=n

Then An ⊆ An+1 and An ⊆ Bn . Also, ∪∞


n=1 An = A and {Xn → 0} ⊆ A. So

P(An ) 6 P(Bn ), P(An ) → P(A), P(Xn → 0) 6 P(A).

Hence, if P(Xn → 0) = 1 then P(|Xn | > ε) → 0 as n → ∞.

21.3 Central limit theorem


Theorem 21.3. Let (Xn : n ∈ N) be a sequence of independent identically distributed
square-integrable random variables. Set Sn = X1 + · · · + Xn and set µ = E(X1 ) and
σ 2 = var(X1 ). Then, for all x ∈ R, as n → ∞,
 
Sn − nµ
P √ 6 x → Φ(x)
σ n

where Φ is the standard normal distribution function, given by


Z x
1 2
Φ(x) = √ e−y /2 dy.
−∞ 2π
A less precise but more intuitive way to state the conclusion is that, for large n we
have approximately Sn ∼ N (nµ, nσ 2 ). This formulation is useful in applications where the
theorem is used to justify calculating as if Sn was exactly N (nµ, nσ 2 ).

59
Proof for finite exponential moment. Assume further that MX (δ) < ∞ and MX (−δ) < ∞
for some δ > 0. It will suffice to deal with the case where µ = 0 and σ 2 = 1. For then the
general case follows by considering the random variables Yn = (Xn − µ)/σ. Set

1 1 3 tx
Z
R(x) = x e (1 − t)2 dt.
2 0
By integration be parts, we see that

x2
ex = 1 + x + + R(x).
2
Note that, for |λ| 6 δ/2 and t ∈ [0, 1], we have etλx 6 eδ|x|/2 , so
3 3
|λx|3 δ|x|/2 (δ|x|/2)3 δ|x|/2
 
2|λ| 2|λ|
|R(λx)| 6 e 6 e 6 eδ|x|
3! δ 3! δ

and so  3
2|λ|
|R(λX)| 6 (eδX + e−δX ).
δ
Hence  3
2|λ|
|E(R(λX))| 6 (MX (δ) + MX (−δ)) = o(|λ|2 )
δ
as λ → 0. On taking expectations in the identity

λ2 X 2
eλX = 1 + λX + + R(λX)
2
we obtain
λ2
MX (λ) = 1 + + E(R(λX)).
2
Hence, for all λ ∈ R, as n → ∞,
√ √
MSn /√n (λ) = E(eλ(X1 +···+Xn )/ n ) = MX (λ/ n)n
n
λ2

2
= 1 + (1 + o(1)) → eλ /2 = MZ (λ)
2n

where Z ∼ N (0, 1). The result then follows from the continuity theorem for moment
generating functions.

60
21.4 Sampling error via the central limit theorem
A referendum is held in a large population. A proportion p of the population are inclined
to vote ‘Yes’, the rest being inclined to vote ‘No’. A random sample of N individuals is
interviewed prior to the referendum and asked for their voting intentions. How large should
N be chosen in order to predict the percentage of ‘Yes’ voters with an accuracy of ±4%
with probability exceeding 0.99?
If we suppose that the interviewees are chosen uniformly at random and with replace-
ment, and all answer truthfully, then the proportion p̂N of ‘Yes’ voters revealed by the
sample is given by
SN
p̂N =
N
where SN ∼ B(N, p). Note that

E(SN ) = N p, var(SN ) = N pq

where q = 1 − p. Since N will be chosen large, we will use the approximation to the distri-
bution of SN given by the central limit theorem. Thus SN is approximated in distribution
by p
N p + N pqZ
where Z ∼ N (0, 1). Hence
 √  √ !
N pqZ
P(|p̂N − p| > ε) ≈ P > ε = P |Z| > ε √ N .
N pq

By symmetry of the N (0, 1) distribution, for z > 0,

P(|Z| > z) = 2P(Z > z) = 2(1 − Φ(z))

so P(|Z| > z) = 0.01 for z = 2.58. We want ε = 0.04 so, choosing p = 1/2 for the worst
variance, we require
4 √
2× N = 2.58
100
which gives N = 1040.

61
22 Geometric probability
Some nice problems can be formulated in terms of points or lines chosen uniformly at
random in a given geometric object. Their solutions often benefit from observations of
symmetry.

22.1 Bertrand’s paradox


Suppose we are given a circle and draw a random chord. What is the probability that the
length of the chord exceeds the length L of the sides of an equilateral triangle inscribed in
the circle?
Here is one answer. We can construct a random chord C1 by drawing a straight line
between two points A, B chosen uniformly at random on the circle. Write ADE for the
inscribed equilateral triangle with vertex at A. Then |C1 | > L if and only if B lies on the
shorter arc from D to E. By symmetry, this event has probability 1/3.
Alternatively, we can construct a random chord C2 by choosing a point P inside the
circle and requiring that P be the midpoint of C2 . This determines C2 uniquely. Denote
the centre of the circle by O, the endpoints of C2 by D, E and the endpoint of the radius
through P by Q. If |C2 | = L, then OQD and OQE are congruent equilateral triangles. So
P is also the midpoint of OQ. Hence |C2 | > L if and only if |OP | < R/2. By scaling, this
event has probability 1/4.
These alternative answers are not a paradox but do show that there is more than one
natural distribution for a random chord. Arguably, neither is the most natural. We could
instead draw a line in a plane and think of dropping the circle randomly onto the plane.
Conditional on the circle hitting the line, it is natural to suppose that the distribution of
the height H of its centre above the line is uniform on [−R, R], where R is the radius of
the circle. Denote the chord obtained by C3 . By the calculation for C2 , we have |C3 | > L
if and only if |H| 6 R/2. So for this model the probability is 1/2

22.2 Buffon’s needle


A needle of length ` is tossed at random onto a floor marked with parallel lines spaced a
distance L apart. We assume that ` 6 L. What is the probability p that the needle crosses
one of the lines?
Think of the direction of the lines as horizontal and the perpendicular direction as
vertical. Write Y for the distance of the lower endpoint of the needle from the line above
the line below it and write Θ for the angle made by the needle with the positive horizontal
direction. A reasonable model is then to take Y ∼ U [0, L] and Θ ∼ U [0, π]. Since ` 6 L,
with probability 1, the needle can only cross one line and this happens if and only if
Y 6 ` sin Θ. Hence
Z πZ L Z π
1 1 2`
p= 1{y6` sin θ} dydθ = ` sin θdθ = .
πL 0 0 πL 0 πL

62
The appearance of π in the probability for this simple experiment turns it, in principle,
into a means to estimate π. Consider the function f on (0, ∞) given by

2`
f (x) = .
xL
Then f (p) = π and f 0 (p) = −2`/(p2 L) = −π/p. Suppose we throw n needles on the floor
and denote by p̂n the proportion which land on a line. The central limit theorem gives an
approximation in distribution
p
p̂n ≈ p + p(1 − p)/nZ

where Z ∼ N (0, 1). Set


2`
π̂n = f (p̂n ) = .
p̂n L
We use Taylor’s theorem for the approximation

π̂n ≈ π + (p̂n − p)f 0 (p).

Then, combining the two approximations,


r
1−p
π̂n ≈ π − π Z.
np

Note that the approximate variance π 2 (1 − p)/(np) of π̂n is decreasing in p. We take ` = L


for the minimal variance π 2 (π/2 − 1)/n. Now

P(|Z| > 2.58) = 0.01

so to obtain
P(|π̂n − π| 6 0.001) > 0.99
p
we need 2.58π (π/2 − 1)/n ≈ 0.001, that is n ≈ 3.75 × 107 . It is not a very efficient way
to estimate π.

63
23 Gaussian random variables
23.1 Definition
A random variable X in R is Gaussian if

X = µ + σZ

for some µ ∈ R and some σ ∈ [0, ∞), where Z ∼ N (0, 1). We write X ∼ N (µ, σ 2 ). If
σ > 0, then X has a density on R given by
1 2 /(2σ 2 )
fX (x) = √ e−(x−µ) .
2πσ 2
A random variable X in Rn is Gaussian if uT X is Gaussian for all u ∈ Rn . For such X, if
a is an m × n matrix and b ∈ Rm , then aX + b is also Gaussian. To see this, note that, for
all v ∈ Rm ,
v T (aX + b) = uT X + v T b
where u = aT v, so v T (aX + b) is Gaussian.

23.2 Moment generating function


Given a Gaussian random variable X in Rn , set

µ = E(X), V = var(X) = E((X − µ)(X − µ)T ).

Then, by linearity of expectation,

E(uT X) = uT µ, 0 6 var(uT X) = uT V u

so
uT X ∼ N (uT µ, uT V u).
Note that V is an n × n matrix, which is necessarily symmetric and non-negative definite.
The moment generating function of X is the function MX on Rn given by
TX
MX (λ) = E(eλ ).

By taking u = λ, from the known form of the moment generating function in the scalar
case, we see that
T T
MX (λ) = eλ µ+λ V λ/2 .
By uniqueness of moment generating functions, this shows that the distribution X is
determined by its mean µ and covariance matrix V . So we write X ∼ N (µ, V ).

64
23.3 Construction
Given independent N (0, 1) random variables Z1 , . . . , Zn , we can define a Gaussian random
variable in Rn by
Z = (Z1 , . . . , Zn )T .
To check that Z is indeed Gaussian, we compute for u = (u1 , . . . , un )T ∈ Rn and λ ∈ R
n
! n
T 2 2 2
Y Y
E(e λu Z
)=E e λui Zi
= eλui /2 = eλ |u| /2
i=1 i=1

which shows by uniqueness of moment generating functions that uT Z ∼ N (0, |u|2 ). Now
E(Z) = 0, cov(Zi , Zj ) = δij
so Z ∼ N (0, In ) where In is the n × n identity matrix.
More generally, given µ ∈ Rn and a non-negative definite n × n matrix V , we can define
a random variable X in Rn by
X = µ + σZ
where σ is the non-negative definite square root of V . Then X is Gaussian and
E(X) = µ, var(X) = E(σZ(σZ)T ) = σE(ZZ T )σ = σIn σ = V
so X ∼ N (µ, V ).

23.4 Density function


In the case where V is positive definite, we can invert the transformation x = µ + σz by
z = σ −1 (x − µ). Then
|z|2 = z T z = (x − µ)T V −1 (x − µ), J = det(σ −1 ) = (det V )−1/2
so X = µ + σZ has a density function on Rn given by
T V −1 (x−µ)/2
fX (x) = fZ (z)|J| = (2π)−n/2 (det V )−1/2 e−(x−µ)
which is thus a density function for all N (µ, V ) random variables.
More generally, by an orthogonal change of basis, we may suppose that we have a
decomposition Rn = Rm × Rn−m such that
   
λ U 0
µ= , V =
ν 0 0
where U is positive definite m × m matrix. Then
 
Y
X=
ν
where Y is a random variable in Rm with density function
T U −1 (y−λ)/2
fY (y) = (2π)−m/2 (det U )−1/2 e−(y−λ) .

65
23.5 Bivariate normals
In the case n = 2, the Gaussian distributions are characterized by five parameters. Let
X = (X1 , X2 ) be a Gaussian random variable in R2 and set

µk = E(Xk ), σk2 = var(Xk ), ρ = corr(X1 , X2 ).

Then µ1 , µ2 ∈ R and σ12 , σ22 ∈ [0, ∞) and ρ ∈ [−1, 1]. The covariance matrix V is given by
 2 
σ1 ρσ1 σ2
V = .
ρσ1 σ2 σ22

Note that, for any such matrix V and for all x = (x1 , x2 )T ∈ R2 ,

xT V x = (1−ρ)(σ12 x21 +σ22 x22 )+ ρ(σ1 x1 + σ2 x2 )2 = (1+ρ)(σ12 x21 + σ22 x22 )−ρ(σ1 x1 − σ2 x2 )2 > 0

so V is non-negative definite for all ρ ∈ [−1, 1]. This shows that all combinations of
parameter values in the given ranges are possible.
In the case ρ = 0, excluding the trivial cases σ1 = 0 and σ2 = 0, X has density function
2
−1/2 −(x−µ)T V −1 (x−µ)/2 2 /(2σ 2 )
Y
−1
fX (x) = (2π) (det V ) e = (2πσk2 )−1/2 e−(xk −µk ) k

k=1

so X1 and X2 are independent, with Xk ∼ N (µk , σk2 ).


More generally, we have

cov(X1 , X2 − aX1 ) = cov(X1 , X2 ) − a var(X1 ) = ρσ1 σ2 − aσ12 .

If we take Y = X2 − aX1 with a = ρσ2 /σ1 , then cov(X1 , Y ) = 0. But


    
X1 1 0 X1
=
Y −a 1 X2

so (X1 , Y ) is Gaussian, and so X1 and Y are independent by the argument of the preceding
paragraph. Hence X2 has a decomposition

X2 = aX1 + Y

in which X1 and Y are independent Gaussian random variables.

66

You might also like