Probability P
Probability P
Probability P
J.R. Norris
December 13, 2017
1
Contents
1 Mathematical models for randomness 6
1.1 A general definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Equally likely outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Throwing a die . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Balls from a bag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 A well-shuffled pack of cards . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Distribution of the largest digit . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Coincident birthdays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Stirling’s formula 11
3.1 Statement of Stirling’s formula . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Asymptotics for log(n!) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Proof of Stirling’s formula (non-examinable) . . . . . . . . . . . . . . . . . 11
6 Independence 18
6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.2 Pairwise independence does not imply independence . . . . . . . . . . . . . 18
6.3 Independence and product spaces . . . . . . . . . . . . . . . . . . . . . . . 18
2
8 Conditional probability 22
8.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.2 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.3 Bayes’ formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
8.4 False positives for a rare condition . . . . . . . . . . . . . . . . . . . . . . . 23
8.5 Knowledge changes probabilities in surprising ways . . . . . . . . . . . . . 23
8.6 Simpson’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
9 Random variables 25
9.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.2 Doing without measure theory . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.3 Number of heads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
10 Expectation 27
10.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
10.2 Properties of expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
10.3 Variance and covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.4 Zero covariance does not imply independence . . . . . . . . . . . . . . . . . 30
10.5 Calculation of some expectations and variances . . . . . . . . . . . . . . . 30
10.6 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
10.7 Inclusion-exclusion via expectation . . . . . . . . . . . . . . . . . . . . . . 32
11 Inequalities 33
11.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.2 Chebyshev’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.3 Cauchy–Schwarz inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
11.4 Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
11.5 AM/GM inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
12 Random walks 36
12.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
12.2 Gambler’s ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
12.3 Mean time to absorption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
13 Generating functions 38
13.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
13.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
13.3 Generating functions and moments . . . . . . . . . . . . . . . . . . . . . . 38
13.4 Sums of independent random variables . . . . . . . . . . . . . . . . . . . . 39
13.5 Random sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
13.6 Counting with generating functions . . . . . . . . . . . . . . . . . . . . . . 40
3
14 Branching processes 42
14.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
14.2 Mean population size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
14.3 Generating function for the population size . . . . . . . . . . . . . . . . . . 42
14.4 Conditioning on the first generation . . . . . . . . . . . . . . . . . . . . . . 43
14.5 Extinction probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
14.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4
22 Geometric probability 62
22.1 Bertrand’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
22.2 Buffon’s needle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5
1 Mathematical models for randomness
1.1 A general definition
Let Ω be a set. Let F be a set of subsets of Ω. We say that F is a σ-algebra if Ω ∈ F and,
for all A ∈ F and every sequence (An : n ∈ N) in F,
∞
[
c
A ∈ F, An ∈ F.
n=1
Thus F is non-empty and closed under countable set operations. Assume that F is indeed
a σ-algebra. A function P : F → [0, 1] is called a probability measure if P(Ω) = 1 and, for
every sequence (An : n ∈ N) of disjoint elements of F,
∞
! ∞
[ X
P An = P(An ).
n=1 n=1
Assume that P is indeed a probability measure. The triple (Ω, F, P) is then called a
probability space.
In the case where Ω is countable, we will take F to be the set of all subsets of Ω unless
otherwise stated. The elements of Ω are called outcomes and the elements of F are called
events. In a probability model, Ω will be an abstraction of a real set of outcomes, and
F will model observable events. Then P(A) is interpreted as the probability of the event
A. In some probability models, for example choosing a random point in the interval [0, 1],
the probability of each individual outcome is 0. This is one reason we need to specify
probabilities of events rather than outcomes.
P(A) = |A|/|Ω|.
|A ∪ B| = |A| + |B|.
6
1.3 Throwing a die
The throw of a die has six possible outcomes. We use symmetry to justify the following
model for a throw of the die
P({ω}) = 1/|Ω|.
Ω = {0, 1, . . . , 9}n .
7
In the absence of further information, and prompted by the characterization as ‘random
digits’, we assume that all outcomes are equally likely. Consider the event Ak that none of
the n digits exceeds k and the event Bk that the largest digit is k. Then
So
(k + 1)n − k n
P(Bk ) = .
10n
Ω = {1, . . . , 365}n .
We will work on the assumption that all outcomes are equally likely. In fact this is empiri-
cally false, but we are free to choose the model and have no further information. Consider
the event A that all n birthdays are different. Then
8
2 Counting the elements of a finite set
We have seen the need to count numbers of permutations and numbers of subsets of a
given size. Now we take a systematic look at some methods of counting.
{1, . . . , n} → Ω.
Then n is called the cardinality of Ω. We will tend to refer to n simply as the size of Ω.
The set {1, . . . , n1 } × · · · × {1, . . . , nk } has size n1 × · · · × nk . More generally, suppose we
are given a sequence of sets Ω1 , . . . , Ωk and bijections
gi : {1, . . . , n1 } × · · · × {1, . . . , ni } → Ωi , i = 1, . . . , k
by
g1 = f1 , gi (m1 , . . . , mi ) = fi (gi−1 (m1 , . . . , mi−1 ), mi ).
It can be checked by induction on i that these are all bijections. In particular, we see that
|Ωk | = n1 × n2 × · · · × nk .
|Ω| = n1 × n2 × · · · × nk .
2.2 Permutations
The bijections of {1, . . . , n} to itself are called permutations of {1, . . . , n}. We may obtain
all these permutations by choosing successively the image of 1, then that of 2, and finally
the image of n. There are respectively n, n − 1, . . . , 1 choices at each stage, corresponding
to the numbers of values which have not been taken. Hence the number of permutations
of {1, . . . , n} is n! This is then also the number of bijections between any two given sets of
size n.
9
2.3 Subsets
Fix k 6 n. Let N denote the number of subsets of {1, . . . , n} which have k elements. We
can count the permutations of {1, . . . , n} as follows. First choose a subset S1 of {1, . . . , n}
of size k and set S2 = {1, . . . , n} \ S1 . Then choose bijections {1, . . . , k} → S1 and
{k + 1, . . . , n} → S2 . These choices determine a unique permutation of {1, . . . , n} and we
obtain all such permutations in this way. Hence
N × k! × (n − k)! = n!
and so
n! n
N= = .
k!(n − k)! k
More generally, suppose we are given integers n1 , . . . , nk > 0 with n1 + · · · + nk = n.
Let M denote the number of ways to partition {1, . . . , n} into k disjoint subsets S1 , . . . , Sk
with |S1 | = n1 , . . . , |Sk | = nk . The argument just given generalizes to show that
M × n1 ! × · · · × nk ! = n!
so
n! n
M= = .
n1 ! . . . nk ! n1 . . . nk
g(i) = i + f (i) − 1.
m+n−1
Hence the number of such non-decreasing functions is m
.
10
3 Stirling’s formula
Stirling’s formula is important in giving a computable asymptotic equivalent for factorials,
which enter many expressions for probabilities. Having stated the formula, we first prove
a cruder asymptotic, then prove the formula itself.
so
n log n − n + 1 6 ln 6 (n + 1) log(n + 1) − n.
The ratio of the left-hand side to n log n tends to 1 as n → ∞, and the same is true of the
right-hand side, so we deduce that
log(n!) ∼ n log n.
11
Take f = log to obtain
Z n+1
log n + log(n + 1) 1 n+1
Z
1
log x dx = + (x − n)(n + 1 − x) 2 dx.
n 2 2 n x
Next, sum over n to obtain
n−1
1 X
n log n − n + 1 = log(n!) − log n + ak
2 k=1
where Z 1 Z 1
1 1 1 1
ak = x(1 − x) 2
dx 6 2 x(1 − x) dx = .
2 0 (k + x) 2k 0 12k 2
Set ( ∞
)
X
A = exp 1 − ak .
k=1
We rearrange our equation for log(n!) and then take the exponential to obtain
(∞ )
X
n! = Ann+1/2 e−n exp ak .
k=n
It follows that
n! ∼ Ann+1/2 e−n
and, from this asymptotic, we deduce that
√
−2n 2n 2
2 ∼ √ .
n A n
We will complete the proof by showing that
−2n 2n 1
2 ∼√
n nπ
√
so A = 2π, as required. Set
Z π/2
In = cosn θ dθ.
0
n−1
Then I0 = π/2 and I1 = 1. For n > 2 we can integrate by parts to obtain In = I .
n n−2
Then
13 2n − 1 π −2n 2n π
I2n = ... =2 ,
24 2n 2 n 2
−1
24 2n −2n 2n 1
I2n+1 = ... = 2 .
35 2n + 1 n 2n + 1
But In is decreasing in n and In /In−2 → 1, so also I2n /I2n+1 → 1, and so
2
−2n 2n 2 1
2 ∼ ∼ .
n (2n + 1)π nπ
12
4 Basic properties of probability measures
Let (Ω, F, P) be a probability space. Recall that the probability measure P is a function
P : F → [0, 1] with P(Ω) = 1 which is countably additive, that is, has the property that,
for all sequences (An : n ∈ N) of disjoint sets in F,
∞
! ∞
[ X
P An = P(An ).
n=1 n=1
B1 = A1 , Bn = An \ (A1 ∪ · · · ∪ An−1 ), n = 2, 3, . . . .
4.2 Continuity
For all sequences (An : n ∈ N) in F such that An ⊆ An+1 for all n and ∪∞
n=1 An = A, we
have
lim P(An ) = P(A).
n→∞
To see this, define Bn as above and note that ∪nk=1 Bk = An for all n and ∪∞
n=1 Bn = A.
Then, by countable additivity,
n
! n ∞ ∞
!
[ X X [
P(An ) = P Bn = P(Bn ) → P(Bn ) = P Bn = P(A).
k=1 k=1 n=1 n=1
13
For n = 2 and n = 3, this says simply that
and
− · · · + (−1)n+1 P(A1 ∩ A2 ∩ · · · ∩ An ).
where Bk = Ak ∩ An . On using the formula for the case n − 1 for the terms on the right-
hand side, and rearranging, we obtain the formula for n. We omit the details. Hence the
general case follows by induction. We will give another proof using expectation later.
Note in particular the special case of inclusion-exclusion for the case of equally likely
outcomes, which we write in terms of the sizes of sets. Let A1 , . . . , An be subsets of a finite
set Ω. Then
n
X X
|A1 ∪ · · · ∪ An | = (−1)k+1 |Ai1 ∩ · · · ∩ Aik |.
k=1 16i1 <···<ik 6n
14
while, for n = 3,
used in the proof of inclusion-exclusion, suppose we substitute for P(A1 ∪ · · · ∪ An−1 ) using
inclusion-exclusion truncated at the the kth term and for P(B1 ∪· · ·∪Bn−1 ) using inclusion-
exclusion truncated at the (k − 1)th term. Then we obtain on the right the inclusion-
exclusion formula truncated at the kth term. Suppose inductively that the Bonferroni
inequalities hold for n − 1 and that k is odd. Then k − 1 is even. So, on the right-
hand side, the substitution results in an overestimate. Similarly, if k is even, we get an
underestimate. Hence the inequalities hold for all n > 2 by induction.
15
5 More counting using inclusion-exclusion
Sometimes it is easier to count intersections of sets than unions. We give two examples of
this where the inclusion-exclusion formula can be used to advantage.
5.1 Surjections
The inclusion-exclusion formula gives an expression for the number of surjections from
{1, . . . , n} to {1, . . . , m}. Write Ω for the set of all functions from {1, . . . , n} to {1, . . . , m}
and consider the subsets
Ai = {ω ∈ Ω : i 6∈ {ω(1), . . . , ω(n)}}.
5.2 Derangements
A permutation of {1, . . . , n} is called a derangement if it has no fixed points. Using
inclusion-exclusion, we can calculate the probability that a random permutation is a de-
rangement. Write Ω for the set of permutations and A for the subset of derangements. For
i ∈ {1, . . . , n}, consider the event
Ai = {ω ∈ Ω : ω(i) = i}
For i1 < · · · < ik , each element of the intersection Ai1 ∩ · · · ∩ Aik corresponds to a permu-
tation of {1, . . . , n} \ {i1 , . . . , ik }. So
16
and so
(n − k)!
P(Ai1 ∩ · · · ∩ Aik ) = .
n!
Then, by inclusion-exclusion,
n
! n
[ X X
P Ai = (−1)k+1 P(Ai1 ∩ · · · ∩ Aik )
i=1 k=1 16i1 <···<ik 6n
n n
X
k+1 n (n − k)! X 1
= (−1) × = (−1)k+1 .
k=1
k n! k=1
k!
n
Here we have used the fact that there are k
terms in the inner sum, all having the same
value (n − k)!/n!. So we find
n n
X 1 X 1
P(A) = 1 − (−1)k+1 = (−1)k .
k=1
k! k=0 k!
17
6 Independence
6.1 Definition
Events A, B are said to be independent if
P(A ∩ B) = P(A)P(B).
More generally, we say that the events in a sequence (A1 , . . . , An ) or (An : n ∈ N) are
independent if, for all k > 2 and all sequences of distinct indices i1 , . . . , ik ,
A1 = {(0, 0), (0, 1)}, A2 = {(0, 0), (1, 0)}, A3 {(0, 1), (1, 0)}.
Ai = {(ω1 , . . . , ωn ) ∈ Ω : ωi ∈ Bi }
18
and
A1 ∩ · · · ∩ An = {(ω1 , . . . , ωn ) ∈ Ω : ω1 ∈ B1 , . . . , ωn ∈ Bn }
so
|B1 × · · · × Bn |
P(A1 ∩ · · · ∩ An ) = = P(A1 ) × · · · × P(An ).
|Ω|
The same argument shows that
for any distinct indices i1 , . . . , ik , by switching some of the sets Bi to be Ωi . Hence the
events A1 , . . . , An are independent.
19
7 Some natural discrete probability distibutions
The word ‘distribution’ is used interchangeably with ‘probability measure’, especially when
in later sections we are describing the probabilities associated with some random variable.
A probability measure µ on (Ω, F) is said to be discrete if there is a countable set S ⊆ Ω
and a function (px : x ∈ S) such that, for all events A,
X
µ(A) = px .
x∈A∩S
We consider only the case where {x} ∈ F for all x ∈ S. Then px = µ({x}). We refer to
(px : x ∈ S) as the mass function for µ.
We use this to model the number of heads obtained on tossing a biased coin N times.
20
7.4 Geometric distribution
The geometric distribution of parameter p is the probability measure on Z+ = {0, 1, . . . }
given by
pk = p(1 − p)k .
We use this to model the number of tails obtained on tossing a biased coin until the first
head appears.
The probability measure on N = {1, 2, . . . } given by
pk = p(1 − p)k−1
is also sometimes called the geometric distribution of parameter p. This models the number
of tosses of a biased coin up to the first head. You should always be clear which version of
the geometric distribution is intended.
Now
N −k
N (N − 1) . . . (N − k + 1) λ −λ λ
→ 1, 1− →e , 1− →1
Nk N N
so
pk (N, λ/N ) → pk (λ).
Hence the Poisson distribution arises as the limit as N → ∞ of the binomial distribution
with parameters N and p = λ/N .
21
8 Conditional probability
8.1 Definition
Given events A, B with P(B) > 0, the conditional probability of A given B is defined by
P(A ∩ B)
P(A|B) = .
P(B)
For fixed B, we can define a new function P̃ : F → [0, 1] by
P̃(A) = P(A|B).
Then P̃(Ω) = P(B)/P(B) = 1 and, for any sequence (An : n ∈ N) of disjoint sets in F,
∞
! ∞ ∞
P (( ∞ P( ∞
S S
n=1 An ) ∩ B) n=1 (An ∩ B)) P(An ∩ B) X
[ X
P̃ An = = = = P̃(An ).
n=1
P(B) P(B) n=1
P(B) n=1
We make the same convention as above when P(Bk ) = 0. The formula follows directly
from the definition of conditional probability, using the law of total probability.
This formula is the basis of Bayesian statistics. We hold a prior view of the probabilities
of the events Bn , and we have a model giving us the conditional probability of the event
A given each possible Bn . Then Bayes’ formula tells us how to calculate the posterior
probabilities for the Bn , given that the event A occurs.
22
8.4 False positives for a rare condition
A rare medical condition A affects 0.1% of the population. A test is performed on a
randomly chosen person, which is known empirically to give a positive result for 98% of
people affected by the condition and 1% of those unaffected. Suppose the test is positive.
What is the probability that the chosen person has condition A?
We use Bayes’ formula
P(P |A)P(A) 0.98 × 0.001
P(A|P ) = c c
= = 0.089 . . . .
P(P |A)P(A) + P(P |A )P(A ) 0.98 × 0.001 + 0.01 × 0.999
The implied probability model is some large finite set Ω = {1, . . . , N }, representing the
whole population, with subsets A and P such that
1 98 1
|A| = N, |A ∩ P | = |A|, |Ac ∩ P | = |Ac |.
1000 100 100
We used Bayes’ formula to work out what proportion of the set P is contained in A. You
may prefer to do it directly.
23
8.6 Simpson’s paradox
We first note a version of the law of total probability for conditional probabilities. Let A
and B be events with P(B) > 0. Let (Ωn : n ∈ N) be a sequence of disjoint events whose
union is Ω. Then ∞
X
P(A|B) = P(A|B ∩ Ωn )P(Ωn |B).
n=1
This is simply the law of total probability for the conditional probability P̃(A) = P(A|B).
For we have
P̃(A ∩ Ωn ) P(A ∩ Ωn ∩ B)
P̃(A|Ωn ) = = = P(A|B ∩ Ωn ).
P̃(Ωn ) P(Ωn ∩ B)
Here is an example of Simpson’s paradox. The interval [0, 1] can be made into a
probability space such that P((a, b]) = b − a whenever 0 6 a 6 b 6 1. Fix ε ∈ (0, 1/4) and
consider the events
A = (ε/2, 1/2 + ε/2], B = (1/2 − ε/2, 1 − ε/2], Ω1 = (0, 1/2], Ω2 = (1/2, 1].
Then
1
P(A|B) = 2ε < = P(A)
2
but
ε
P(A|B ∩ Ω1 ) = 1 > 1 − ε = P(A|Ω1 ) and P(A|B ∩ Ω2 ) = > ε = P(A|Ω2 ). (1)
1−ε
We cannot conclude from the fact that ‘B attracts A on Ω1 and on Ω2 ’ that ‘B attracts A
on Ω’.
Note that
1
P(Ω1 ) = P(Ω2 ) = , P(Ω1 |B) = ε, P(Ω2 |B) = 1 − ε.
2
According to the laws of total probability, these are the correct weights to combine the
conditional probabilities from (1) to give
1 1 1 ε
P(A) = (1 − ε) × +ε× = , P(A|B) = 1 × ε + (1 − ε) = 2ε
2 2 2 1−ε
in agreement with our prior calculations. In this example, conditioning on B significantly
alters the weights. It is this which leads to the apparently paradoxical outcome.
More generally, Simpson’s paradox refers to any instance where a positive (or negative)
association between events, when conditioned by the elements of a partition, is reversed
when the same events are considered without conditioning.
24
9 Random variables
9.1 Definitions
Let (Ω, F, P) be a probability space. A random variable is a function X : Ω → R such
that, for all x ∈ R,
{X 6 x} = {ω ∈ Ω : X(ω) 6 x} ∈ F.
For any event A, the indicator function 1A is a random variable. Here
(
1, if ω ∈ A,
1A (ω) =
0, if ω ∈ Ac .
25
9.2 Doing without measure theory
The condition that {X 6 x} ∈ F for all x is a measurability condition. It guarantees that
P(X 6 x) is well defined. While this is obvious for a countable probability space when F is
the set of all subsets, in general it requires some attention. For example, in Section 10.1, we
implicitlyPuse the fact that, for a sequence of non-negative random variables (Xn : n ∈ N),
the sum ∞ n=1 Xn is also a non-negative random variable, that is to say, for all x
(∞ )
X
Xn 6 x ∈ F.
n=1
This is not hard to show, using the fact that F is a σ-algebra, but we do not give details
and we will not focus on such questions in this course.
26
10 Expectation
10.1 Definition
Let (Ω, F, P) be a probability space. Recall that a random variable is a function X : Ω → R
such that {X 6 x} ∈ F for all x ∈ R. A non-negative random variable is a function
X : Ω → [0, ∞] such that {X 6 x} ∈ F for all x > 0. Note that we do not allow random
variables to take the values ±∞ but we do allow non-negative random variables to take
the value ∞. Write F + for the set of non-negative random variables.
Theorem 10.1. There is a unique map
E : F + → [0, ∞]
with the following properties:
(a) E(1A ) = P(A) for all A ∈ F,
(b) E(λX) = λE(X) for all λ ∈ [0, ∞) and all X ∈ F + ,
(c) E ( n Xn ) = n E(Xn ) for all sequences (Xn : n ∈ N) in F + .
P P
In (b), we apply the usual rule that 0 × ∞ = 0. The map E is called the expectation.
Proof for Ω countable. By choosing an enumeration, we reduce to the case where Ω =
{1, . . . , N } or Ω =
PN. We give details for Ω = N. Note that we can write any X ∈ F+
in the form X = n Xn , where Xn (ω) = X(n)1{n} (ω). So, for any map E : F + → [0, ∞]
with the given properties,
X X X
E(X) = E(Xn ) = X(n)P({n}) = X(ω)P({ω}). (2)
n n ω
Hence there is at most one such map. On the other hand, if we use (2) to define E, then
X
E(1A ) = 1A (ω)P({ω}) = P(A)
ω
and X
E(λX) = λX(ω)P({ω}) = λE(X)
ω
and
!
X XX XX X
E Xn = Xn (ω)P({ω}) = Xn (ω)P({ω}) = E(Xn )
n ω n n ω n
27
10.2 Properties of expectation
For non-negative random variables X, Y , by taking X1 = X, X2 = Y and Xn = 0 for n > 2
in the countable additivity property (c), we obtain
E(X + Y ) = E(X) + E(Y ).
This shows in particular that E(X) 6 E(X + Y ) and hence
E(X) 6 E(Y ) whenever X 6 Y.
Also, for any non-negative random variable X and all n ∈ N (see Section 11.1 below)
P(X > 1/n) 6 nE(X)
so
P(X = 0) = 1 whenever E(X) = 0.
For integrable random variables X, Y , we have
(X + Y )+ + X − + Y − = (X + Y )− + X + + Y +
so
E((X + Y )+ ) + E(X − ) + E(Y − ) = E((X + Y )− ) + E(X + ) + E(Y + )
and so
E(X + Y ) = E(X) + E(Y ).
Let X be a discrete non-negative random variable, taking values (xn : n ∈ N) say.
Then, if X is non-negative or X is integrable, we can compute the expectation using the
formula X
E(X) = xn P(X = xn ).
n
P
To see this, for X non-negative, we can write X = n Xn , where Xn (ω) = xn 1{X=xn } (ω).
Then the formula follows by countable additivity. For X integrable, we subtract the
formulas for X + and X − . Similarly, for any discrete random variable X and any non-
negative function f , X
E(f (X)) = f (xn )P(X = xn ). (3)
n
For independent discrete random variables X, Y , and for non-negative functions f and
g, we have
X
E(f (X)g(Y )) = f (x)g(y)P(X = x, Y = y)
x,y
X
= f (x)g(y)P(X = x)P(Y = y)
x,y
X X
= f (x)P(X = x) g(y)P(Y = y) = E(f (X))E(g(Y ))
x y
28
This formula remains valid without the assumption that that f and g are non-negative,
provided only that f (X) and g(Y ) are integrable. Indeed, it remains valid without the
assumption that X and Y are discrete, but we will not prove this.
29
10.4 Zero covariance does not imply independence
It is a common mistake to confuse the condition E(XY ) = E(X)E(Y ) with independence.
Here is an example to illustrate the difference. Given independent Bernoulli random vari-
ables X1 , X2 , X3 , all with parameter 1/2, consider the random variables
Y1 = 2X1 − 1, Y2 = 2X2 − 1 Z 1 = Y1 X 3 , Z2 = Y2 X3 .
Then
so
E(Z1 Z2 ) = E(Y1 Y2 X3 ) = 0 = E(Z1 )E(Z2 ).
On the other hand {Z1 = 0} = {Z2 = 0} = {X3 = 0}, so
so ∞
X
E(X) = P(X > n). (4)
n=1
More generally, the following calculation can be justified using Fubini’s theorem for any
non-negative random variable X
Z ∞ Z ∞ Z ∞
E(X) = E 1{x6X} dx = E(1{X>x} )dx = P(X > x)dx.
0 0 0
30
The expectation and variance of a binomial random variable SN ∼ B(N, p) are given
by
E(SN ) = N p, var(SN ) = N p(1 − p).
This can be seen by writing SN as the sum of N independent Bernoulli random variables.
If G is a geometric random variable of parameter p ∈ (0, 1], then
∞
X
P(G > n) = (1 − p)k p = (1 − p)n
k=n
and
∞ ∞ ∞
X X λn X λn−2
E(X(X − 1)) = n(n − 1)P(X = n) = n(n − 1)e−λ = λ2 e−λ = λ2
n=0 n=0
n! n=2
(n − 2)!
so
var(X) = E(X 2 ) − E(X)2 = E(X(X − 1)) + E(X) − E(X)2 = λ2 + λ − λ2 = λ.
31
10.7 Inclusion-exclusion via expectation
Note the identity
n
Y n
X X
(1 − xi ) = (−1)k (xi1 × · · · × xik ).
i=1 k=0 16i1 <···<ik 6n
and
xi1 × · · · × xik = 1Ai1 ∩···∩Aik (ω).
Hence n
X X
1A1 ∪···∪An = (−1)k+1 1Ai1 ∩···∩Aik .
k=1 16i1 <···<ik 6n
32
11 Inequalities
11.1 Markov’s inequality
Let X be a non-negative random variable and let λ ∈ (0, ∞). Then
P(X > λ) 6 E(X)/λ.
To see this, we note the following inequality of random variables
λ1{X>λ} 6 X
and take expectations to obtain λP(X > λ) 6 E(X).
33
It is instructive to examine how the equality
p p
E(XY ) = E(X 2 ) E(Y 2 )
can occur for square-integrable random variables X, Y . Let us exclude the trivial case
where P(Y = 0) = 1. Then E(Y 2 ) > 0 and above calculation shows that equality can
occur only when
E((X − tY )2 ) = 0
where t = E(XY )/E(Y 2 ), that is to say when P(X = λY ) = 1 for some λ ∈ R.
34
It is again interesting to examine the case of equality
especially in the case where for m = E(X) there exist a, b ∈ R such that
Then equality forces the non-negative random variable f (X)−(aX +b) to have expectation
0, so P(f (X) = aX + b) = 1 and so P(X = m) = 1.
To see this, consider a random variable which takes the values x1 , . . . , xn all with equal
probability and apply Jensen’s inequality.
In the special case when I = (0, ∞) and f = − log, we obtain for x1 , . . . , xn ∈ (0, ∞)
n
!1/n n
Y 1X
xk 6 xk .
k=1
n k=1
Thus the geometric mean is always less than or equal to the arithmetic mean.
35
12 Random walks
12.1 Definitions
A random process (Xn : n ∈ N) is a sequence of random variables. An integer-valued
random process (Xn : n ∈ N) is called a random walk if it has the form
Xn = x + Y1 + · · · + Yn
for some sequence of independent identically distributed random variables (Yn : n > 1).
We consider only the case of simple random walk, when the steps are all of size 1.
hx = x/a.
Suppose then that p 6= 1/2. We look for solutions of the recurrence relation of the form
hx = λx . Then
pλ2 − λ + q = 0
so λ = 1 or λ = q/p. Then A + B(q/p)x gives a general family of solutions and we can
choose A and B to satisfy the boundary conditions. This requires
A + B = 0, A + B(q/p)a = 1
so
1
B = −A =
(q/p)a − 1
and
(q/p)x − 1
hx = .
(q/p)a − 1
36
12.3 Mean time to absorption
Denote by T the number of steps taken by the random walk until it first hits 0 or a. Set
τx = Ex (T )
so τx is the mean time to absorption starting from x. We condition again on the first step,
using the law of total expectation to obtain, for x = 1, . . . , a − 1,
τx = 1 + pτx+1 + qτx−1 .
τx = x(a − x).
Cx = 1 + pC(x + 1) + qC(x − 1)
x a (q/p)x − 1
τx = − .
q − p q − p (q/p)a − 1
37
13 Generating functions
13.1 Definition
Let X be a random variable with values in Z+ = {0, 1, 2, . . . }. The generating function of
X is the power series given by
∞
X
GX (t) = E(tX ) = P(X = n)tn .
n=0
Then GX (1) = 1 so the power series has radius of convergence at least 1. By standard
results on power series, GX defines a function on (−1, 1) with derivatives of all orders, and
we can recover the probabilities for X by
n
1 d
P(X = n) = GX (t).
n! dt t=0
13.2 Examples
For a Bernoulli random variable X of parameter p, we have
GX (t) = (1 − p) + pt.
For a geometric random variable X of parameter p,
∞
X p
GX (t) = (1 − p)n ptn = .
n=0
1 − (1 − p)t
and so on. Note that for a Poisson random variable X of parameter λ we have
G0X (1) = λ, G00X (1) = λ2
38
in agreement with the values for E(X) and E(X(X − 1)) computed in Section 10.5.
When the radius of convergence equals 1, we can differentiate term by term at all t < 1
to obtain ∞
X
0
GX (t) = nP(X = n)tn−1 .
n=1
{X + Y = n} = ∪nk=0 {X = k, Y = n − k}
and, by independence,
P(X = k, Y = n − k) = pk qn−k .
So the probabilities for X + Y are given by the convolution
n
X
P(X + Y = n) = pk qn−k .
k=1
39
This can also be seen directly
Sn = X1 + · · · + Xn .
40
Set
Cn = |Pn |.
Note that, for all n > 1 and x ∈ Pn , we have x1 = 1 and y = (x1 − 1, . . . , x2k−1 − 1) ∈ Pk−1
and z = (x2k , . . . , x2n ) ∈ Pn−k , where k = min{i > 1 : x2i = 0}. This gives a bijection from
Pn to ∪nk=1 Pk−1 × Pn−k . So we get a convolution-type identity
n
X
Cn = Ck−1 Cn−k .
k=1
2n
Note that Cn 6 n
6 22n , so the radius of convergence of this power series is at least 1/4.
Then
∞ X
X n ∞
X ∞
X
n k−1
c(t) = 1 + Ck−1 Cn−k t = 1 + t Ck−1 t Cn−k tn−k = 1 + tc(t)2 .
n=1 k=1 k=1 n=k
The numbers Cn are the Catalan numbers. They appear in many other counting problems.
41
14 Branching processes
14.1 Definition
A branching process or Galton–Watson process is a random process (Xn : n > 0) with the
following structure:
Xn
X
X0 = 1, Xn+1 = Yk,n for all n > 0.
k=1
E(Xn ) = µn .
42
14.4 Conditioning on the first generation
Fix m > 1. On the event {X1 = m}, for n > 0, we have
m
X
Xn+1 = Xn(j)
j=1
(j)
where X0 = 1 and, for n > 1,
(j)
Sn−1
X
Xn(j) = Yk,n
(j−1)
k=Sn−1 +1
and
(j) (1) (j)
Sn−1 = Xn−1 + · · · + Xn−1 .
(1) (m)
Note that, conditional on {X1 = m}, the processes (Xn : n > 0), . . . , (Xn : n > 0) are
independent Galton–Watson processes with the same offspring distribution as (Xn : n > 0).
Theorem 14.1. The extinction probability q is the smallest non-negative solution of the
equation q = F (q). Moreover, provided P(X1 = 1) < 1, we have
Proof. Note that F is continuous and non-decreasing on [0, 1] and F (1) = 1. On letting
n → ∞ in the equation qn+1 = F (qn ), we see that q = F (q). On the other hand, let us
denote the smallest non-negative solution of this equation by t. Then q0 = 0 6 t. Suppose
43
inductively for n > 0 that qn 6 t, then qn+1 = F (qn ) 6 F (t) = t and the induction
proceeds. Hence q = t.
We have µ = limt↑1 F 0 (t). If µ > 1, then we must have F (t) < t for all t ∈ (0, 1)
sufficiently close to 1 by the mean value theorem. Since F (0) = P(X1 = 0) > 0, there is
then a solution to t = F (t) in [0, 1) by the intermediate value theorem. Hence q < 1.
It remains to consider the case µ 6 1. If P(X 6 1) = 1 then F (t) = 1 − µ + µt and we
exclude the case µ = 1 by the condition P(X = 1) < 1, so q = 1. On the other hand, if
P(X > 2) > 0 then, by term-by-term differentiation, F 0 (t) < µ 6 1 for all t ∈ (0, 1) so, by
the mean value theorem, there is no solution to t = F (t) in [0, 1), so q = 1.
We emphasize the fact that, if there is any variability in the number of offspring, and
the average number of offspring is 1, then the population is sure to die out.
14.6 Example
Consider the Galton–Watson process (Xn : n > 0) with
Hence q = 1/2.
Recall the construction of (Xn : n > 0) from a sequence of independent random vari-
ablesP(Yk,n : k > 1, n > 0). Set T0 = 0 and Tn = X0 + · · · + Xn−1 for n > 1 and
T = ∞ n=0 Xn . Set S0 = 1 and define recursively for n > 0 and k = 1, . . . , Xn ,
and the induction proceeds. Moreover T = min{m > 0 : Sm = 0}. Now (Sm )m6T is a
random walk starting from 1, which jumps up by 1 with probability 2/3 and down by 1
with probability 1/3, until it hits 0. The extinction probability for the branching process
is then the probability that the walk ever hits 0.
44
15 Some natural continuous probability distributions
A non-negative function f on R is a probability density function if
Z
f (x)dx = 1.
R
Given such a function f , we can define a unique probability measure µ on R such that, for
all x ∈ R, Z x
µ((−∞, x]) = f (y)dy.
−∞
In fact this measure µ is given on the σ-algebra B of Borel sets B by
Z
µ(B) = f (x)dx.
B
45
The substitution y = λx shows that
Z ∞
λα xα−1 e−λx dx = Γ(α)
0
46
16 Continuous random variables
16.1 Definitions
Recall that each real-valued random variable X has a distribution function FX , given by
We will say in this case that X has density function fX . In fact we then have, for all Borel
sets B, Z
P(X ∈ B) = fX (x)dx.
B
By continuity of probability, the left limit of FX at x is given by
so X has density function f . Conversely, suppose that X has a density function fX . Then,
for all x ∈ R and all h > 0,
Z x+h
FX (x + h) − FX (x) 1
− fX (x) = (fX (y) − fX (x))dy 6 sup |fX (y) − fX (x)|
h h x x6y6x+h
47
16.2 Transformation of one-dimensional random variables
Let X be a random variable with values in some open interval I and having a piecewise
continuous density function fX on I. Let φ be a function on I having a continuous derivative
and such that φ0 (x) 6= 0 for all x ∈ I. Set y = φ(x) and consider the new random variable
Y = φ(X) in the interval φ(I). Then Y has a density function fY on φ(I), given by
dx
fY (y) = fX (x) .
dy
To see this, consider first the case where φ is increasing. Then Y 6 y if and only if
X 6 ψ(y), where ψ = φ−1 : φ(I) → I. So
FY (y) = FX (ψ(y)).
By the chain rule, FY then has a piecewise continuous derivative fY , given by
dx
fY (y) = fX (ψ(y))ψ 0 (y) = fX (x)
dy
which is then a density function for Y on φ(I). Since φ0 does not vanish, there remains the
case where φ is decreasing. Then Y 6 y if and only if X > ψ(y), so FY (y) = 1 − FX (ψ(y))
and a similar argument applies.
Here is an example. Suppose that X ∼ U [0, 1], by which we mean that µX is the uniform
distribution on [0, 1]. We may assume that X takes values in (0, 1], since P(X = 0) = 0.
Take φ = − log. Then Y = φ(X) takes values in [0, ∞) and
P(Y > y) = P(− log X > y) = P(X < e−y ) = e−y
so Y ∼ E(1), that is, µY is the exponential distribution of parameter 1.
Here is a second example. Let Z ∼ N (0, 1) and fix µ ∈ R and σ ∈ (0, ∞). Set
X = µ + σZ.
Then X ∼ N (µ, σ 2 ). To see this, we note that, for a, b ∈ R with a 6 b, we have X ∈ [a, b]
if and only if Z ∈ [a0 , b0 ], where a0 = (a − µ)/σ, b0 = (b − µ)/σ. So
Z b0 Z b
0 0 1 −z2 /2 1 2 2
P(X ∈ [a, b]) = P(Z ∈ [a , b ]) = √ e dz = √ e−(x−µ) /(2σ ) dx
2π 2πσ 2
a0 a
48
and the same formula holds when g is real-valued provided g(X) is integrable. We omit
the proof of this formula but note that for g = 1(−∞,x] it is a restatement of (5).
For X ∼ U [a, b], we have
Z b
1 a+b
E(X) = xdx = .
b−a a 2
E(X) = µ, var(X) = σ 2 .
49
17 Properties of the exponential distribution
17.1 Exponential distribution as a limit of geometrics
Let T be an exponential random variable of parameter λ. Set
Tn = bnT c.
pn ∼ λ/n, Tn /n → T.
and
P(T > m/n)n = P(T > m) = P(T > 1)m .
We exclude the trivial cases where T is identically 0 or ∞. Then P(T > 1) ∈ (0, 1), so
λ = − log P(T > 1) ∈ (0, ∞). Then P(T > t) = e−λt for all rationals t > 0, and this
extends to all t > 0 because P(T > t) and e−λt are both non-increasing in t.
50
18 Multivariate density functions
18.1 Definitions
A random variable X in Rn is said to have density function fX if the joint distribution
function FX is given by
Z x1 Z xn
FX (x1 , . . . , xn ) = ... fX (y1 , . . . , yn )dy1 . . . dyn .
−∞ −∞
Morover, this formula remains valid for real-valued Borel functions g, provided E(|g(X)|) <
∞. We omit proof of these facts.
18.2 Independence
We say that random variables X1 , . . . , Xn are independent if, for all x1 , . . . , xn ∈ R,
(b) On the other hand, suppose that X has density function fX which factorizes as in (a)
for some non-negative functions f1 , . . . , fn . Then X1 , . . . , Xn are independent and
have density functions which are proportional to f1 , . . . , fn respectively.
Proof. Under the hypothesis of (a), we have, for B = (−∞, x1 ] × · · · × (−∞, xn ]
n
! n n Z xi n
Z Y
\ Y Y
P(X ∈ B) = P {Xi 6 xi } = P(Xi 6 xi ) = fi (yi )dyi = fi (yi )dy.
i=1 i=1 i=1 −∞ B i=1
51
On the other hand, under the hypothesis of (b), since
Yn Z Z
fi (xi )dxi = fX (x)dx = 1
i=1 R Rn
we may assume, by moving suitable constant factors between the functions fi , that they
all individually integrate to 1. Consider a set B of the form B1 × · · · × Bn . Then
n
! Z n Z
\ Y
P {Xi ∈ Bi } = P(X ∈ B) = fX (x)dx = fi (xi )dxi . (6)
i=1 B i=1 Bi
showing that Xi has density fi and, returning to the general formula (6), that
n
! n
\ Y
P {Xi ∈ Bi } = P(Xi ∈ Bi )
i=1 i=1
so X1 , . . . , Xn are independent.
These are called the marginal density functions. When the component random variables
are independent, we can recover fX as the product of the marginals, but this fails otherwise.
52
where we made the substitution w = x + y in the inner integral for the third equality and
interchanged the order of integration for the fourth.
Consider the case where X, Y are independent U [0, 1] random variables. Then X + Y
has density given by
Z 1 (
if x ∈ [0, 1],
Z
x,
fX ∗ fY (x) = 1[0,1] (x − y)1[0,1] (y)dy = 1[x−1,x] (y)dy =
R 0 2 − x, if x ∈ [1, 2].
det φ0 (x) 6= 0
for all x ∈ D. Set y = φ(x) and consider the new random variable Y = φ(X). Then Y
has a density function fY on φ(D), given by
fY (y) = fX (x)|J|
We omit proof of this result, but see Section 16.2 for the one-dimensional case. Note
that the Jacobian factor is computed from the inverse transformation, where x is given as
a function of y.
Here is an example. Let (X, Y ) be a standard normal random variable in R2 , which
we will consider as defined in D = R2 \ {(x, 0) : x > 0}. Set R = |(X, Y )| ∈ (0, ∞) and
let Θ ∈ (0, 2π) be the angle from the positive x-axis to the vector (X, Y ). The inverse
transformation is given by
x = r cos θ, y = r sin θ
so ∂x ∂x
∂r ∂θ cos θ −r sin θ
J = det ∂y ∂y = det =r
∂r ∂θ
sin θ r cos θ
and (R, Θ) has density function on (0, ∞) × (0, 2π) given by
1 −r2 /2
fR,Θ (r, θ) = fX,Y (x, y)|J| = re .
2π
2 /2
Hence R has density re−r on (0, ∞), Θ is uniform on (0, 2π), and R and Θ are indepen-
dent.
53
19 Simulation of random variables
We sometimes wish to generate a simulated sample X1 , . . . , Xn from a given distribution.
We will discuss two ways to do this, relevant in different contexts, based on the reasonable
assumption that we can simulate well a sequence (Un : n ∈ N) of independent U [0, 1]
random variables.
54
where
N = min{n > 1 : Un ∈ A}.
Then, by the law of total probability, for any Borel set B ⊆ [0, 1]d ,
∞
X
P(X ∈ B) = P(X ∈ B|N = n)P(N = n).
n=1
Now
P(Un ) ∈ B ∩ A and U1 , . . . , Un−1 6∈ A) |B ∩ A|
P(X ∈ B|N = n) = = .
P(Un ∈ A and U1 , . . . , Un−1 6∈ A) |A|
so ∞
|B ∩ A| X |B ∩ A|
Z
P(X ∈ B) = P(N = n) = = f (x)dx
|A| n=1 |A| B
Construct (X1 , . . . , Xd−1 , Xd ) from A as above. Then, for any Borel set B ⊆ [0, 1]d−1 ,
Z
f (x1 , . . . , xd−1 )
|(B × [0, 1]) ∩ A| = dx1 . . . dxd−1
B λ
so
|(B × [0, 1]) ∩ A|
Z
P((X1 , . . . , Xd−1 ) ∈ B) = = f (x1 , . . . , xd−1 )dx1 . . . dxd−1 .
|A| B
55
20 Moment generating functions
20.1 Definition
Let X be a random variable. The moment generating function of X is the function MX
on R given by
MX (λ) = E(eλX ).
Note that MX (0) = 1 but it is not guaranteed that MX (λ) < ∞ for any λ 6= 0.
For independent random variables X, Y , we have
20.2 Examples
For X ∼ E(β) and λ < β, we have
Z ∞
β
MX (λ) = eλx βe−βx dx =
0 β−λ
and MX (λ) = ∞ for λ > β. More generally, for X ∼ Γ(α, β) and λ < β,
Z ∞ α
1 λx α α−1 −βx β
MX (λ) = e β x e dx =
Γ(α) 0 β−λ
We say that a random variable X has the Cauchy distribution if it has the following
density function
1
f (x) = .
π(1 + x2 )
Then, for all λ 6= 0, Z
1
MX (λ) = eλx dx = ∞.
R π(1 + x2 )
56
20.3 Uniqueness and the continuity theorem
Moment generating functions provide a convenient way to characterize probability distri-
butions and to show their convergence, because of the following two results, whose proofs
we omit.
Theorem 20.2 (Continuity theorem for moment generating functions). Let X be a random
variable and let (Xn : n ∈ N) be a sequence of random variables. Suppose that MXn (λ) →
MX (λ) for all λ ∈ R and MX (λ) < ∞ for some λ 6= 0. Then Xn converges to X in
distribution.
which we have seen is the moment generating function for the Γ(n, β) distribution.
The condition that M (λ) < ∞ for some λ 6= 0 is necessary for uniqueness. For, if
X is a Cauchy random variable, then MX = M2X but X and 2X do not have the same
distribution.
A version of these theorems holds also for random variables in Rn , where the moment
generating function of such a random variable X is the function on Rn given by
TX
MX (λ) = E(eλ ).
In this case the condition that MX (λ) < ∞ for some λ 6= 0 is replaced by the requirement
that MX be finite on some open set.
57
21 Limits of sums of independent random variables
21.1 Weak law of large numbers
Theorem 21.1. Let (Xn : n ∈ N) be a sequence of independent identically distributed
integrable random variables. Set Sn = X1 + · · · + Xn and µ = E(X1 ). Then, for all ε > 0,
as n → ∞,
Sn
P − µ > ε → 0.
n
Proof for finite second moment. Assume further that
var(X1 ) = σ 2 < ∞.
Note that
E(Sn /n) = µ, var(Sn /n) = σ 2 /n.
Then, by Chebyshev’s inequality, for all ε > 0, as n → ∞,
Sn
P − µ > ε 6 ε−2 σ 2 /n → 0.
n
P(Sn /n → µ as n → ∞) = 1.
E(X14 ) < ∞.
Sn Y1 + · · · + Yn
−µ= .
n n
Hence, it will suffice to conside the case where µ = 0.
Note that n X
4
X
4 4
Sn = Xi + Xi2 Xj2 + R
i=1
2 16i<j6n
58
where R is a sum of terms of the following forms: Xi Xj Xk Xl or Xi Xj Xk2 or Xi Xj3 for
i, j, k, l distinct. Since µ = 0, by independence, we have E(R) = 0. By Cauchy–Schwarz,
q q
E(X12 X22 ) 6 E(X14 ) E(X24 ) = E(X14 ).
Hence
E(Sn4 ) = nE(X14 ) + 3n(n − 1)E(X12 X22 ) 6 3n2 E(X14 )
and so
∞ 4 ! ∞ 4 ! ∞
X Sn X Sn X 1
E = E 6 3E(X14 ) 2
< ∞.
n=1
n n=1
n n=1
n
Hence P(Sn /n → 0) = 1.
The following general argument allows us to deduce the weak law from the strong law.
Let (Xn : n ∈ N) be a sequence of random variables and let ε > 0. Consider the events
∞
\
An = {|Xm | 6 ε}, Bn = {|Xn | 6 ε}, A = {|Xn | 6 ε for all sufficiently large n}.
m=n
59
Proof for finite exponential moment. Assume further that MX (δ) < ∞ and MX (−δ) < ∞
for some δ > 0. It will suffice to deal with the case where µ = 0 and σ 2 = 1. For then the
general case follows by considering the random variables Yn = (Xn − µ)/σ. Set
1 1 3 tx
Z
R(x) = x e (1 − t)2 dt.
2 0
By integration be parts, we see that
x2
ex = 1 + x + + R(x).
2
Note that, for |λ| 6 δ/2 and t ∈ [0, 1], we have etλx 6 eδ|x|/2 , so
3 3
|λx|3 δ|x|/2 (δ|x|/2)3 δ|x|/2
2|λ| 2|λ|
|R(λx)| 6 e 6 e 6 eδ|x|
3! δ 3! δ
and so 3
2|λ|
|R(λX)| 6 (eδX + e−δX ).
δ
Hence 3
2|λ|
|E(R(λX))| 6 (MX (δ) + MX (−δ)) = o(|λ|2 )
δ
as λ → 0. On taking expectations in the identity
λ2 X 2
eλX = 1 + λX + + R(λX)
2
we obtain
λ2
MX (λ) = 1 + + E(R(λX)).
2
Hence, for all λ ∈ R, as n → ∞,
√ √
MSn /√n (λ) = E(eλ(X1 +···+Xn )/ n ) = MX (λ/ n)n
n
λ2
2
= 1 + (1 + o(1)) → eλ /2 = MZ (λ)
2n
where Z ∼ N (0, 1). The result then follows from the continuity theorem for moment
generating functions.
60
21.4 Sampling error via the central limit theorem
A referendum is held in a large population. A proportion p of the population are inclined
to vote ‘Yes’, the rest being inclined to vote ‘No’. A random sample of N individuals is
interviewed prior to the referendum and asked for their voting intentions. How large should
N be chosen in order to predict the percentage of ‘Yes’ voters with an accuracy of ±4%
with probability exceeding 0.99?
If we suppose that the interviewees are chosen uniformly at random and with replace-
ment, and all answer truthfully, then the proportion p̂N of ‘Yes’ voters revealed by the
sample is given by
SN
p̂N =
N
where SN ∼ B(N, p). Note that
E(SN ) = N p, var(SN ) = N pq
where q = 1 − p. Since N will be chosen large, we will use the approximation to the distri-
bution of SN given by the central limit theorem. Thus SN is approximated in distribution
by p
N p + N pqZ
where Z ∼ N (0, 1). Hence
√ √ !
N pqZ
P(|p̂N − p| > ε) ≈ P > ε = P |Z| > ε √ N .
N pq
so P(|Z| > z) = 0.01 for z = 2.58. We want ε = 0.04 so, choosing p = 1/2 for the worst
variance, we require
4 √
2× N = 2.58
100
which gives N = 1040.
61
22 Geometric probability
Some nice problems can be formulated in terms of points or lines chosen uniformly at
random in a given geometric object. Their solutions often benefit from observations of
symmetry.
62
The appearance of π in the probability for this simple experiment turns it, in principle,
into a means to estimate π. Consider the function f on (0, ∞) given by
2`
f (x) = .
xL
Then f (p) = π and f 0 (p) = −2`/(p2 L) = −π/p. Suppose we throw n needles on the floor
and denote by p̂n the proportion which land on a line. The central limit theorem gives an
approximation in distribution
p
p̂n ≈ p + p(1 − p)/nZ
so to obtain
P(|π̂n − π| 6 0.001) > 0.99
p
we need 2.58π (π/2 − 1)/n ≈ 0.001, that is n ≈ 3.75 × 107 . It is not a very efficient way
to estimate π.
63
23 Gaussian random variables
23.1 Definition
A random variable X in R is Gaussian if
X = µ + σZ
for some µ ∈ R and some σ ∈ [0, ∞), where Z ∼ N (0, 1). We write X ∼ N (µ, σ 2 ). If
σ > 0, then X has a density on R given by
1 2 /(2σ 2 )
fX (x) = √ e−(x−µ) .
2πσ 2
A random variable X in Rn is Gaussian if uT X is Gaussian for all u ∈ Rn . For such X, if
a is an m × n matrix and b ∈ Rm , then aX + b is also Gaussian. To see this, note that, for
all v ∈ Rm ,
v T (aX + b) = uT X + v T b
where u = aT v, so v T (aX + b) is Gaussian.
E(uT X) = uT µ, 0 6 var(uT X) = uT V u
so
uT X ∼ N (uT µ, uT V u).
Note that V is an n × n matrix, which is necessarily symmetric and non-negative definite.
The moment generating function of X is the function MX on Rn given by
TX
MX (λ) = E(eλ ).
By taking u = λ, from the known form of the moment generating function in the scalar
case, we see that
T T
MX (λ) = eλ µ+λ V λ/2 .
By uniqueness of moment generating functions, this shows that the distribution X is
determined by its mean µ and covariance matrix V . So we write X ∼ N (µ, V ).
64
23.3 Construction
Given independent N (0, 1) random variables Z1 , . . . , Zn , we can define a Gaussian random
variable in Rn by
Z = (Z1 , . . . , Zn )T .
To check that Z is indeed Gaussian, we compute for u = (u1 , . . . , un )T ∈ Rn and λ ∈ R
n
! n
T 2 2 2
Y Y
E(e λu Z
)=E e λui Zi
= eλui /2 = eλ |u| /2
i=1 i=1
which shows by uniqueness of moment generating functions that uT Z ∼ N (0, |u|2 ). Now
E(Z) = 0, cov(Zi , Zj ) = δij
so Z ∼ N (0, In ) where In is the n × n identity matrix.
More generally, given µ ∈ Rn and a non-negative definite n × n matrix V , we can define
a random variable X in Rn by
X = µ + σZ
where σ is the non-negative definite square root of V . Then X is Gaussian and
E(X) = µ, var(X) = E(σZ(σZ)T ) = σE(ZZ T )σ = σIn σ = V
so X ∼ N (µ, V ).
65
23.5 Bivariate normals
In the case n = 2, the Gaussian distributions are characterized by five parameters. Let
X = (X1 , X2 ) be a Gaussian random variable in R2 and set
Then µ1 , µ2 ∈ R and σ12 , σ22 ∈ [0, ∞) and ρ ∈ [−1, 1]. The covariance matrix V is given by
2
σ1 ρσ1 σ2
V = .
ρσ1 σ2 σ22
Note that, for any such matrix V and for all x = (x1 , x2 )T ∈ R2 ,
xT V x = (1−ρ)(σ12 x21 +σ22 x22 )+ ρ(σ1 x1 + σ2 x2 )2 = (1+ρ)(σ12 x21 + σ22 x22 )−ρ(σ1 x1 − σ2 x2 )2 > 0
so V is non-negative definite for all ρ ∈ [−1, 1]. This shows that all combinations of
parameter values in the given ranges are possible.
In the case ρ = 0, excluding the trivial cases σ1 = 0 and σ2 = 0, X has density function
2
−1/2 −(x−µ)T V −1 (x−µ)/2 2 /(2σ 2 )
Y
−1
fX (x) = (2π) (det V ) e = (2πσk2 )−1/2 e−(xk −µk ) k
k=1
so (X1 , Y ) is Gaussian, and so X1 and Y are independent by the argument of the preceding
paragraph. Hence X2 has a decomposition
X2 = aX1 + Y
66