Lecture NotesLecture Notes For Introductory Probability
Lecture NotesLecture Notes For Introductory Probability
Janko Gravner
Mathematics Department
University of California
Davis, CA 95616
gravner@math.ucdavis.edu
January 5, 2014
These notes were started in January 2009 with help from Christopher Ng, a student in
Math 135A and 135B classes at UC Davis, who typeset the notes he took during my lectures.
This text is not a treatise in elementary probability and has no lofty goals; instead, its aim is
to help a student achieve the prociency in the subject required for a typical exam and basic
real-life applications. Therefore, its emphasis is on examples, which are chosen without much
redundancy. A reader should strive to understand every example given and be able to design
and solve a similar one. Problems at the end of chapters and on sample exams (the solutions
to all of which are provided) have been selected from actual exams, hence should be used as a
test for preparedness.
I have only one tip for studying probability: you cannot do it half-heartedly. You have to
devote to this class several hours per week of concentrated attention to understand the subject
enough so that standard problems become routine. If you think that coming to class and reading
the examples while also doing something else is enough, youre in for an unpleasant surprise on
the exams.
This text will always be available free of charge to UC Davis students. Please contact me if
you spot any mistake. I am thankful to Marisano James for numerous corrections and helpful
suggestions.
INTRODUCTION
Introduction
The theory of probability has always been associated with gambling and many most accessible
examples still come from that activity. You should be familiar with the basic tools of the
gambling trade: a coin, a (six-sided) die, and a full deck of 52 cards. A fair coin gives you Heads
(H) or Tails (T) with equal probability, a fair die will give you 1, 2, 3, 4, 5, or 6 with equal
probability, and a shued deck of cards means that any ordering of cards is equally likely.
Example 1.1. Here are typical questions that we will be asking and that you will learn how to
answer. This example serves as an illustration and you should not expect to understand how to
get the answer yet.
Start with a shued deck of cards and distribute all 52 cards to 4 players, 13 cards to each.
What is the probability that each player gets an Ace? Next, assume that you are a player and
you get a single Ace. What is the probability now that each player gets an Ace?
Answers. If any ordering of cards( is)equally likely, then any position of the four Aces in the
deck is also equally likely. There are 52
4 possibilities for the positions (slots) for the 4 aces. Out
of these, the number of positions that give each player an Ace is 134 : pick the rst slot among
the cards that the rst player gets, then the second slot among the second players cards, then
4
0.1055.
the third and the fourth slot. Therefore, the answer is 13
(52
4)
After you see that you have a single Ace, the probability goes up: the previous answer needs
13(39)
to be divided by the probability that you get a single Ace, which is 523 0.4388. The answer
(4)
134
then becomes 13 39 0.2404.
(3)
Here is how you can quickly estimate the second probability during a card game: give the
second ace to a player, the third to a dierent player (probability about 2/3) and then the last
to the third player (probability about 1/3) for the approximate answer 2/9 0.22.
History of probability
Although gambling dates back thousands of years, the birth of modern probability is considered
to be a 1654 letter from the Flemish aristocrat and notorious gambler Chevalier de Mere to the
mathematician and philosopher Blaise Pascal. In essence the letter said:
I used to bet even money that I would get at least one 6 in four rolls of a fair die.
The probability of this is 4 times the probability of getting a 6 in a single die, i.e.,
4/6 = 2/3; clearly I had an advantage and indeed I was making money. Now I bet
even money that within 24 rolls of two dice I get at least one double 6. This has the
same advantage (24/62 = 2/3), but now I am losing money. Why?
As Pascal discussed in his correspondence with Pierre de Fermat, de Meres reasoning was faulty;
after all, if the number of rolls were 7 in the rst game, the logic would give the nonsensical
probability 7/6. Well come back to this later.
INTRODUCTION
Example 1.2. In a family with 4 children, what is the probability of a 2:2 boy-girl split?
One common wrong answer:
likely.
1
5,
Another common guess: close to 1, as this is the most balanced possibility. This represents the mistaken belief that symmetry in probabilities should very likely result in symmetry in
the outcome. A related confusion supposes that events that are probable (say, have probability
around 0.75) occur nearly certainly.
P (E) =
m
.
n
Example 1.3. A fair die has 6 outcomes; take E = {2, 4, 6}. Then P (E) = 12 .
What does the answer in Example 1.3 mean? Every student of probability should spend
some time thinking about this. The fact is that it is very dicult to attach a meaning to P (E)
if we roll a die a single time or a few times. The most straightforward interpretation is that for
a very large number of rolls about half of the outcomes will be even. Note that this requires
at least the concept of a limit! This relative frequency interpretation of probability will be
explained in detail much later. For now, take formula (1.1) as the denition of probability.
COMBINATORICS
Combinatorics
Example 2.1. Toss three fair coins. What is the probability of exactly one Heads (H)?
There are 8 equally likely outcomes: HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. Out
of these, 3 have exactly one H. That is, E = {HTT, THT, TTH}, and P (E) = 3/8.
Example 2.2. Let us now compute the probability of a 2:2 boy-girl split in a four-children
family. We have 16 outcomes, which we will assume are equally likely, although this is not quite
true in reality. We list the outcomes below, although we will soon see that there is a better way.
BBBB
BGBB
GBBB
GGBB
BBBG
BGBG
GBBG
GGBG
We conclude that
P (2:2 split) =
BBGB
BGGB
GBGB
GGGB
BBGG
BGGG
GBGG
GGGG
3
6
= ,
16
8
8
1
= ,
16
2
2
1
P (4:0 split or 0:4 split) =
= .
16
8
P (1:3 split or 3:1 split) =
Example 2.3. Roll two dice. What is the most likely sum?
Outcomes are ordered pairs (i, j), 1 i 6, 1 j 6.
sum
2
3
4
5
6
7
8
9
10
11
12
Our answer is 7, and P (sum = 7) =
6
36
no. of outcomes
1
2
3
4
5
6
5
4
3
2
1
= 16 .
COMBINATORICS
How to count?
Listing all outcomes is very inecient, especially if their number is large. We will, therefore,
learn a few counting techniques, starting with a trivial, but conceptually important fact.
Basic principle of counting. If an experiment consists of two stages and the rst stage has
m outcomes, while the second stage has n outcomes regardless of the outcome at the rst
stage, then the experiment as a whole has mn outcomes.
Example 2.4. Roll a die 4 times. What is the probability that you get dierent numbers?
At least at the beginning, you should divide every solution into the following three steps:
Step 1: Identify the set of equally likely outcomes. In this case, this is the set of all ordered
4-tuples of numbers 1, . . . , 6. That is, {(a, b, c, d) : a, b, c, d {1, . . . , 6}}.
Step 2: Compute the number of outcomes. In this case, it is therefore 64 .
Step 3: Compute the number of good outcomes. In this case it is 6 5 4 3. Why? We
have 6 options for the rst roll, 5 options for the second roll since its number must dier
from the number on the rst roll; 4 options for the third roll since its number must not
appear on the rst two rolls, etc. Note that the set of possible outcomes changes from
stage to stage (roll to roll in this case), but their number does not!
The answer then is
6543
64
5
18
0.2778.
COMBINATORICS
Permutations
Assume you have n objects. The number of ways to ll n ordered slots with them is
n (n 1) . . . 2 1 = n!,
while the number of ways to ll k n ordered slots is
n(n 1) . . . (n k + 1) =
n!
.
(n k)!
1
13
451!
52! .
40!13!
52!
4!(13!)4
52!
= 6 1011 .
To compute the last probability, for example, collect all hearts into a block; a good event is
specied by ordering 40 items (the block of hearts plus 39 other cards) and ordering the hearts
within their block.
Before we go on to further examples, let us agree that when the text says without further
elaboration, that a random choice is made, this means that all available choices are equally
likely. Also, in the next problem (and in statistics in general) sampling with replacement refers
to choosing, at random, an object from a population, noting its properties, putting the object
back into the population, and then repeating. Sampling without replacement omits the putting
back part.
Example 2.7. A bag has 6 pieces of paper, each with one of the letters, E, E, P , P , P , and
R, on it. Pull 6 pieces at random out of the bag (1) without, and (2) with replacement. What
is the probability that these pieces, in order, spell P EP P ER?
There are two problems to solve. For sampling without replacement:
1. An outcome is an ordering of the pieces of paper E1 E2 P1 P2 P3 R.
2. The number of outcomes thus is 6!.
3. The number of good outcomes is 3!2!.
The probability is
3!2!
6!
1
60 .
33 22
66
1
,
263
COMBINATORICS
Example 2.8. Sit 3 men and 3 women at random (1) in a row of chairs and (2) around a table.
Compute P (all women sit together). In case (2), also compute P (men and women alternate).
In case (1), the answer is
4!3!
6!
= 15 .
For case (2), pick a man, say John Smith, and sit him rst. Then, we reduce to a row
3
. For the
problem with 5! outcomes; the number of good outcomes is 3! 3!. The answer is 10
last question, the seats for the men and women are xed after John Smith takes his seat and so
1
the answer is 3!2!
5! = 10 .
Example 2.9. A group consists of 3 Norwegians, 4 Swedes, and 5 Finns, and they sit at random
around a table. What is the probability that all groups end up sitting together?
The answer is 3!4!5!2!
0.000866. Pick, say, a Norwegian (Arne) and sit him down. Here is
11!
how you count the good events. There are 3! choices for ordering the group of Norwegians (and
then sit them down to one of both sides of Arne, depending on the ordering). Then, there are
4! choices for arranging the Swedes and 5! choices for arranging the Finns. Finally, there are 2!
choices to order the two blocks of Swedes and Finns.
Combinations
Let
(n)
k
be the number of dierent subsets with k elements of a set with n elements. Then,
( )
n
n(n 1) . . . (n k + 1)
=
k
k!
n!
=
k!(n k)!
To understand why the above is true, rst choose a subset, then order its elements in a row
to ll k ordered slots with elements from the set with n objects. Then, distinct choices of a
subset and its ordering will end up as distinct orderings. Therefore,
( )
n
k! = n(n 1) . . . (n k + 1).
k
( )
We call nk = n (choose
k or a binomial coecient (as it appears in the binomial theorem:
)
(x + y)n = nk=0 nk xk y nk ). Also, note that
( ) ( )
( ) (
)
n
n
n
n
=
= 1 and
=
.
0
n
k
nk
The multinomial coecients are more general and are dened next.
COMBINATORICS
n
n1 . . . nr
(
=
=
)(
n n1
n2
n!
n1 !n2 ! . . . nr !
n
n1
)(
) (
)
n n1 n2
n n1 . . . nr1
...
n3
nr
To understand the slightly confusing word distinguishable, just think of painting n1 elements
red, then n2 dierent elements blue, etc. These colors distinguish among the dierent subsets.
Example 2.10. A fair coin is tossed 10 times. What is the probability that we get exactly 5
Heads?
(10)
5
0.2461,
210
as one needs to choose the position of the ve heads among 10 slots to x a good outcome.
P (exactly 5 Heads) =
Example 2.11. We have a bag that contains 100 balls, 50 of them red and 50 blue. Select 5
balls at random. What is the probability that 3 are blue and 2 are red?
( )
The number of outcomes is 100
and all of them are equally likely, which is a reasonable
5
interpretation of select 5 balls at random. The answer is
(50)(50)
P (3 are blue and 2 are red) =
(100)2
3
0.3189
Why should this probability be less than a half? The probability that 3 are blue and 2 are
red is equal to the probability of 3 are red and 2 are blue and they cannot both exceed 12 , as
their sum cannot be more than 1. It cannot be exactly 21 either, because other possibilities (such
as all 5 chosen balls red) have probability greater than 0.
Example 2.12. Here we return to Example 1.1 and solve it more slowly. Shue a standard
deck of 52 cards and deal 13 cards to each of the 4 players.
What is the probability that each player gets an Ace? We will solve this problem in two
ways to emphasize that you often have a choice in your set of equally likely outcomes.
The rst way uses an outcome to be an ordering of 52 cards:
1. There are 52! equally likely outcomes.
2. Let the rst 13 cards go to the rst player, the second 13 cards to the second player,
etc. Pick a slot within each of the four segments of 13 slots for an Ace. There are 134
possibilities to choose these four slots for the Aces.
3. The number of choices to ll these four positions with (four dierent) Aces is 4!.
4. Order the rest of the cards in 48! ways.
COMBINATORICS
8
134 4!48!
52! .
The second way, via a small leap of faith, assumes that each set of the four positions of the
four Aces among the 52 shued cards is equally likely. You may choose to believe this intuitive
fact or try to write down a formal proof: the number of permutations that result in a given set
F of four positions is independent of F . Then:
1. The outcomes are the positions of the 4 Aces among the 52 slots for the shued cards of
the deck.
( )
2. The number of outcomes is 52
4 .
3. The number of good outcomes is 134 , as we need to choose one slot among 13 cards that
go to the rst player, etc.
The probability, then, is
134
(52
4)
Let us also compute P (one person has all four Aces). Doing the problem the second way,
we get
1. The number of outcomes is
(52)
4
()
2. To x a good outcome, pick one player ( 41 choices) and pick four slots for the Aces for
( )
that player ( 13
4 choices).
The answer, then, is
(41)(13
4)
= 0.0106, a lot smaller than the probability of each player getting
(52
)
4
an Ace.
Example 2.13. Roll a die 12 times. P (each number appears exactly twice)?
1. An outcome consists of lling each of the 12 slots (for the 12 rolls) with an integer between
1 and 6 (the outcome of the roll).
2. The number of outcomes, therefore, is 612 .
3. To x a good outcome, pick two slots for 1, then pick two slots for 2, etc., with
choices.
The probability, then, is
10
... 2
(12
2 )( 2 ) (2)
612
(12)(10)
2
...
(2)
2
COMBINATORICS
example,
(14)(9)(5)(2)
P (5 white, 4 blue, 3 green, 2 yellow rooms) =
414
Example 2.15. A middle row on a plane seats 7 people. Three of them order chicken (C) and
the remaining four pasta (P). The ight attendant returns with the meals, but has forgotten
who ordered what and discovers that they are all asleep, so she puts the meals in front of them
at random. What is the probability that they all receive correct meals?
A reformulation makes the problem clearer: we are interested in P (3 people who ordered C
get C). Let us label the people 1, . . . , 7 and assume that 1, 2, and 3 ordered
(7)C. The outcome
is a selection of 3 people from the 7 who receive C, the number of them is 3 , and there is a
1
. Similarly,
single good outcome. So, the answer is 17 = 35
(3)
(4)
4
P (no one who ordered C gets C) = (37) = ,
35
3
()
3 42
18
P (a single person who ordered C gets C) = (7) = ,
35
(3 ) 3
4
12
P (two persons who ordered C get C) = 2(7) = .
35
3
Problems
1. A California licence plate consists of a sequence of seven symbols: number, letter, letter,
letter, number, number, number, where a letter is any one of 26 letters and a number is one
among 0, 1, . . . , 9. Assume that all licence plates are equally likely. (a) What is the probability
that all symbols are dierent? (b) What is the probability that all symbols are dierent and the
rst number is the largest among the numbers?
2. A tennis tournament has 2n participants, n Swedes and n Norwegians. First, n people are
chosen at random from the 2n (with no regard to nationality) and then paired randomly with
the other n people. Each pair proceeds to play one match. An outcome is a set of n (ordered)
pairs, giving the winner and the loser in each of the n matches. (a) Determine the number of
outcomes. (b) What do you need to assume to conclude that all outcomes are equally likely?
(c) Under this assumption, compute the probability that all Swedes are the winners.
3. A group of 18 Scandinavians consists of 5 Norwegians, 6 Swedes, and 7 Finns. They are seated
at random around a table. Compute the following probabilities: (a) that all the Norwegians
sit together, (b) that all the Norwegians and all the Swedes sit together, and (c) that all the
Norwegians, all the Swedes, and all the Finns sit together.
COMBINATORICS
10
4. A bag contains 80 balls numbered 1, . . . , 80. Before the game starts, you choose 10 dierent
numbers from amongst 1, . . . , 80 and write them on a piece of paper. Then 20 balls are selected
(without replacement) out of the bag at random. (a) What is the probability that all your
numbers are selected? (b) What is the probability that none of your numbers is selected? (c)
What is the probability that exactly 4 of your numbers are selected?
5. A full deck of 52 cards contains 13 hearts. Pick 8 cards from the deck at random (a) without
replacement and (b) with replacement. In each case compute the probability that you get no
hearts.
Solutions to problems
1. (a)
10987262524
,
104 263
( )
2. (a) Divide into two groups (winners and losers), then pair them: 2n
n n!. Alternatively, pair
the rst player, then the next available player, etc., and, then, at the end choose the winners
and the losers: (2n 1)(2n 3) 3 1 2n . (Of course, these two expressions are the same.)
(b) All players are of equal strength, equally likely to win or lose any match against any other
player. (c) The number of good events is n!, the choice of a Norwegian paired with each Swede.
3. (a)
13!5!
17! ,
4. (a)
70
(70
(70
(10
10)
20)
4 )(16)
.
80 , (b)
80 , (c)
80
(20)
(20)
(20)
5. (a)
( 3 )8
(39
8)
.
52 , (b) 4
(8)
(b)
8!5!6!
17! ,
(c)
2!7!6!5!
.
17!
AXIOMS OF PROBABILITY
11
Axioms of Probability
The question here is: how can we mathematically dene a random experiment? What we have
are outcomes (which tell you exactly what happens), events (sets containing certain outcomes),
and probability (which attaches to every event the likelihood that it happens). We need to
agree on which properties these objects must have in order to compute with them and develop
a theory.
When we have nitely many equally likely outcomes all is clear and we have already seen
many examples. However, as is common in mathematics, innite sets are much harder to deal
with. For example, we will soon see what it means to choose a random point within a unit circle.
On the other hand, we will also see that there is no way to choose at random a positive integer
remember that at random means all choices are equally likely, unless otherwise specied.
Finally, a probability space is a triple (, F, P ). The rst object is an arbitrary set,
representing the set of outcomes, sometimes called the sample space.
The second object F is a collection of events, that is, a set of subsets of . Therefore, an
event A F is necessarily a subset of . Can we just say that each A is an event? In this
course you can assume so without worry, although there are good reasons for not assuming so
in general ! I will give the denition of what properties F needs to satisfy, but this is only for
illustration and you should take a course in measure theory to understand what is really going
on. Namely, F needs to be a -algebra, which means (1) F, (2) A F = Ac F, and
(3) A1 , A2 , F =
i=1 Ai F.
What is important is that you can take the complement Ac of an event A (i.e., Ac happens
when A does not happen), unions of two or more events (i.e., A1 A2 happens when either A1
or A2 happens), and intersections of two or more events (i.e., A1 A2 happens when both A1
and A2 happen). We call events A1 , A2 , . . . pairwise disjoint if Ai Aj = if i = j that is,
at most one of these events can occur.
Finally, the probability P is a number attached to every event A and satises the following
three axioms:
Axiom 1. For every event A, P (A) 0.
Axiom 2. P () = 1.
Axiom 3. If A1 , A2 , . . . is a sequence of pairwise disjoint events, then
P(
i=1
Ai ) =
P (Ai ).
i=1
Whenever we have an abstract denition such as this one, the rst thing to do is to look for
examples. Here are some.
Example 3.1. = {1, 2, 3, 4, 5, 6},
P (A) =
(number of elements in A)
.
6
AXIOMS OF PROBABILITY
12
The random experiment here is rolling a fair die. Clearly, this can be generalized to any nite
set with equally likely outcomes.
Example 3.2. = {1, 2, . . .} and you have numbers p1 , p2 , . . . 0 with p1 + p2 + . . . = 1. For
any A ,
P (A) =
pi .
iA
For example, toss a fair coin repeatedly until the rst Heads. Your outcome is the number of
tosses. Here, pi = 21i .
Note that pi cannot be chosen to be equal, as you cannot make the sum of innitely many
equal numbers to be 1.
Example 3.3. Pick a point from inside the unit circle centered at the origin. Here, = {(x, y) :
x2 + y 2 < 1} and
(area of A)
P (A) =
.
It is important to observe that if A is a singleton (a set whose element is a single point), then
P (A) = 0. This means that we cannot attach the probability to outcomes you hit a single
point (or even a line) with probability 0, but a fatter set with positive area you hit with
positive probability.
Another important theoretical remark: this is a case where A cannot be an arbitrary subset
of the circle for some sets area cannot be dened!
(C2)
P (Ac ) = 1 P (A).
Proof . Apply (C1) to A1 = A, A2 = Ac .
AXIOMS OF PROBABILITY
13
(C3) 0 P (A) 1.
Proof . Use that P (Ac ) 0 in (C2).
(C6)
P (A1 A2 A3 ) = P (A1 ) + P (A2 ) + P (A3 )
P (A1 A2 ) P (A1 A3 ) P (A2 A3 )
+ P (A1 A2 A3 )
and more generally
P (A1 An ) =
i=1
P (Ai )
P (Ai Aj ) +
1i<jn
P (Ai Aj Ak ) + . . .
1i<j<kn
+ (1)n1 P (A1 An ).
This is called the inclusion-exclusion formula and is commonly used when it is easier to compute
probabilities of intersections than of unions.
Proof . We prove this only for n = 3. Let p1 = P (A1 Ac2 Ac3 ), p2 = P (Ac1 A2 Ac3 ),
p3 = P (Ac1 Ac2 A3 ), p12 = P (A1 A2 Ac3 ), p13 = P (A1 Ac2 A3 ), p23 = P (Ac1 A2 A3 ),
and p123 = P (A1 A2 A3 ). Again, note that all sets are pairwise disjoint and that the right
hand side of (6) is
(p1 + p12 + p13 + p123 ) + (p2 + p12 + p23 + p123 ) + (p3 + p13 + p23 + p123 )
(p12 + p123 ) (p13 + p123 ) (p23 + p123 )
+ p123
= p1 + p2 + p3 + p12 + p13 + p23 + p123 = P (A1 A2 A3 ).
AXIOMS OF PROBABILITY
14
Example 3.4. Pick an integer in [1, 1000] at random. Compute the probability that it is
divisible neither by 12 nor by 15.
The sample space consists of the 1000 integers between 1 and 1000 and let Ar be the subset
consisting of integers divisible by r. The cardinality of Ar is 1000/r. Another simple fact
is that Ar As = Alcm(r,s) , where lcm stands for the least common multiple. Our probability
equals
1 P (A12 A15 ) = 1 P (A12 ) P (A15 ) + P (A12 A15 )
= 1 P (A12 ) P (A15 ) + P (A60 )
83
66
16
=1
+
= 0.867.
1000 1000 1000
Example 3.5. Sit 3 men and 4 women at random in a row. What is the probability that either
all the men or all the women end up sitting together?
Here, A1 = {all women sit together}, A2 = {all men sit together}, A1 A2 = {both women
and men sit together}, and so the answer is
P (A1 A2 ) = P (A1 ) + P (A2 ) P (A1 A2 ) =
4! 4! 5! 3! 2! 3! 4!
+
.
7!
7!
7!
Example 3.6. A group of 3 Norwegians, 4 Swedes, and 5 Finns is seated at random around a
table. Compute the probability that at least one of the three groups ends up sitting together.
Dene AN = {Norwegians sit together} and similarly AS , AF . We have
4! 8!
5! 7!
3! 9!
, P (AS ) =
, P (AF ) =
,
11!
11!
11!
3! 4! 6!
3! 5! 5!
4! 5! 4!
P (AN AS ) =
, P (AN AF ) =
, P (AS AF ) =
,
11!
11!
11!
3! 4! 5! 2!
.
P (AN AS AF ) =
11!
P (AN ) =
Therefore,
P (AN AS AF ) =
3! 9! + 4! 8! + 5! 7! 3! 4! 6! 3! 5! 5! 4! 5! 4! + 3! 4! 5! 2!
.
11!
Example 3.7. Matching problem. A large company with n employees has a scheme according
to which each employee buys a Christmas gift and the gifts are then distributed at random to
the employees. What is the probability that someone gets his or her own gift?
Note that this is dierent from asking, assuming that you are one of the employees, for the
probability that you get your own gift, which is n1 .
Let Ai = {ith employee gets his or her own gift}. Then, what we are looking for is
P(
i=1
Ai ).
AXIOMS OF PROBABILITY
15
We have
P (Ai )
P (Ai Aj )
P (Ai Aj Ak )
=
...
P (A1 An )
1
(for all i),
n
(n 2)!
1
=
(for all i < j),
n!
n(n 1)
(n 3)!
1
=
(for all i < j < k),
n!
n(n 1)(n 2)
1
.
n!
. . . + (1)n1
n
2
n(n 1)
3
n(n 1)(n 2)
n!
1
1
1
= 1 + + . . . + (1)n1
2! 3!
n!
1
1 0.6321 (as n ).
e
n
Example 3.8. Birthday Problem. Assume that there are k people in the room. What is
the probability that there are two who share a birthday? We will ignore leap years, assume
all birthdays are equally likely, and generalize the problem a little: from n possible birthdays,
sample k times with replacement.
P (a shared birthday) = 1 P (no shared birthdays) = 1
n (n 1) (n k + 1)
.
nk
When n = 365, the lowest k for which the above exceeds 0.5 is, famously, k = 23. Some values
are given in the following table.
k
10
23
41
57
70
Occurences of this problem are quite common in various contexts, so we give another example.
Each day, the Massachusetts lottery chooses a four digit number at random, with leading 0s
allowed. Thus, this is sampling with replacement from among n = 10, 000 choices each day. On
February 6, 1978, the Boston Evening Globe reported that
During [the lotterys] 22 months existence [...], no winning number has ever been
repeated. [David] Hughes, the expert [and a lottery ocial] doesnt expect to see
duplicate winners until about half of the 10, 000 possibilities have been exhausted.
AXIOMS OF PROBABILITY
16
What would an informed reader make of this? Assuming k = 660 days, the probability of no
repetition works out to be about 2.19 1010 , making it a remarkably improbable event. What
happened was that Mr. Hughes, not understanding the Birthday Problem, did not check for
repetitions, condent that there would not be any. Apologetic lottery ocials announced later
that there indeed were repetitions.
Example 3.9. Coupon Collector Problem. Within the context of the previous problem, assume
that k n and compute P (all n birthdays are represented).
More often, this is described in terms of cereal boxes, each of which contains one of n dierent
cards (coupons), chosen at random. If you buy k boxes, what is the probability that you have
a complete collection?
When k = n, our probability is
n!
nn .
Ai ).
i=1
Now,
P (Ai )
P (Ai Aj )
=
...
P (A1 An )
(n 1)k
nk
(n 2)k
nk
(for all i)
(for all i < j)
n
i
=
(1)i 1
.
i
n
(
i=0
This must be nn!n for k = n, and 0 when k < n, neither of which is obvious from the formula.
Neither will you, for large n, get anything close to the correct numbers when k n if you try to
compute the probabilities by computer, due to the very large size of summands with alternating
signs and the resulting rounding errors. We will return to this problem later for a much more
ecient computational method, but some numbers are in the two tables below. Another remark
for those who know a lot of combinatorics: you will perhaps notice that the above probability
is nn!k Sk,n , where Sk,n is the Stirling number of the second kind.
AXIOMS OF PROBABILITY
k
13
23
36
17
k
1607
1854
2287
2972
3828
4669
prob. for n = 6
0.5139
0.9108
0.9915
(6)
= 6 choices.
()
2. Pick the two numbers that occur 3 times each: 52 choices.
( )
3. Pick slots (rolls) for the number that occurs 6 times: 12
6 choices.
()
4. Pick slots for one of the numbers that occur 3 times: 63 choices.
Therefore, our probability is
6
6(52)(12
6 )(3)
.
612
Example 3.11. You have 10 pairs of socks in the closet. Pick 8 socks at random. For every i,
compute the probability that you get i complete pairs of socks.
( )
There are 20
8 outcomes. To count the number of good outcomes:
1. Pick i pairs of socks from the 10:
(10)
i
choices.
(10i)
82i
10
282i (10i
82i)( i )
(20
8)
choices.
AXIOMS OF PROBABILITY
18
Example 3.12. Poker Hands. In the denitions, the word value refers to A, K, Q, J, 10, 9,
8, 7, 6, 5, 4, 3, 2. This sequence orders the cards in descending consecutive values, with one
exception: an Ace may be regarded as 1 for the purposes of making a straight (but note that,
for example, K, A, 1, 2, 3 is not a valid straight sequence A can only begin or end a straight).
From the lowest to the highest, here are the hands:
(a) one pair : two cards of the same value plus 3 cards with dierent values
J J 9 Q 4
(b) two pairs: two pairs plus another card of dierent value
J J 9 9 3
(c) three of a kind : three cards of the same value plus two with dierent values
Q Q Q 9 3
(d) straight: ve cards with consecutive values
5 4 3 2 A
(e) ush: ve cards of the same suit
K 9 7 6 3
(f) full house: a three of a kind and a pair
J J J 3 3
(g) four of a kind : four cards of the same value
K K K K 10
(e) straight ush: ve cards of the same suit with consecutive values
A K Q J 10
Here are the probabilities:
hand
one pair
two pairs
three of a kind
straight
ush
full house
four of a kind
straight ush
no. combinations
(12) (4) 3
13
approx. prob.
0.422569
0.047539
0.021128
0.003940
0.001981
0.001441
0.000240
0.000015
AXIOMS OF PROBABILITY
19
Note that the probabilities of a straight and a ush above include the possibility of a straight
ush.
( )
Let us see how some of these are computed. First, the number of all outcomes is 52
5 =
2, 598, 960. Then, for example, for the three of a kind , the number of good outcomes may be
obtained by listing the number of choices:
1. Choose a value for the triple: 13.
2. Choose the values of other two cards:
(12)
2
3. Pick three cards from the four of the same chosen value:
(4 )
3
4. Pick a card (i.e., the suit) from each of the two remaining values: 42 .
One could do the same for one pair :
1. Pick a number for the pair: 13.
( )
2. Pick the other three numbers: 12
3
3. Pick two cards from the value of the pair:
(4)
2
(13)
5
45 10 45 4
(52)
5
0.5012.
(13)
5
+ 40
AXIOMS OF PROBABILITY
20
Example 3.13. Assume that 20 Scandinavians, 10 Finns, and 10 Danes, are to be distributed
at random into 10 rooms, 2 per room. What is the probability that exactly 2i rooms are mixed,
i = 0, . . . 5?
This is an example when careful thinking about what the outcomes should be really pays
o. Consider the following model for distributing the Scandinavians into rooms. First arrange
them at random into a row of 20 slots S1, S2, . . . , S20. Assume that room 1 takes people in
slots S1, S2, so let us call these two slots R1. Similarly, room 2 takes people in slots S3, S4, so
let us call these two slots R2, etc.
Now, it is clear that we only need to keep track of the distribution of 10 Ds into the 20 slots,
corresponding to the positions of the 10 Danes.
(20) Any such distribution constitutes an outcome
and they are equally likely. Their number is 10 .
To get 2i (for example, 4) mixed rooms,
(10) start by choosing 2i (ex., 4) out of the 10 rooms
which are going to be mixed; there are 2i choices. You also need to decide into which slot in
each of the 2i chosen mixed rooms the D goes, for 22i choices.
Once these two choices are made, you still have 10 2i (ex., 6) Ds to distribute into 5 i
(ex., 3) rooms, as there are two Ds in each of these rooms.
For
(
) this, you need to choose 5 i
(ex., 3) rooms from the remaining 10 2i (ex., 6), for 102i
choices, and this choice xes a
5i
good outcome.
The nal answer, therefore, is
(10)
2i
(
)
22i 102i
(20)5i .
10
Problems
1. Roll a single die 10 times. Compute the following probabilities: (a) that you get at least one
6; (b) that you get at least one 6 and at least one 5; (c) that you get three 1s, two 2s, and ve
3s.
2. Three married couples take seats around a table at random. Compute P (no wife sits next to
her husband).
3. A group of 20 Scandinavians consists of 7 Swedes, 3 Finns, and 10 Norwegians. A committee
of ve people is chosen at random from this group. What is the probability that at least one of
the three nations is not represented on the committee?
4. Choose each digit of a 5 digit number at random from digits 1, . . . 9. Compute the probability
that no digit appears more than twice.
5. Roll a fair die 10 times. (a) Compute the probability that at least one number occurs exactly
6 times. (b) Compute the probability that at least one number occurs exactly once.
AXIOMS OF PROBABILITY
21
6. A lottery ticket consists of two rows, each containing 3 numbers from 1, 2, . . . , 50. The
drawing consists of choosing 5 dierent numbers from 1, 2, . . . , 50 at random. A ticket wins if
its rst row contains at least two of the numbers drawn and its second row contains at least two
of the numbers drawn. The four examples below represent the four types of tickets:
Ticket 1
Ticket 2
Ticket 3
Ticket 4
1 2 3
4 5 6
1 2 3
1 2 3
1 2 3
2 3 4
1 2 3
3 4 5
For example, if the numbers 1, 3, 5, 6, 17 are drawn, then Ticket 1, Ticket 2, and Ticket 4 all
win, while Ticket 3 loses. Compute the winning probabilities for each of the four tickets.
Solutions to problems
1. (a) 1 (5/6)10 . (b) 1 2 (5/6)10 + (2/3)10 . (c)
(10)(7)
3
610 .
2. The complement is the union of the three events Ai = {couple i sits together}, i = 1, 2, 3.
Moreover,
2
P (A1 ) = = P (A2 ) = P (A3 ),
5
3! 2! 2!
1
P (A1 A2 ) = P (A1 A3 ) = P (A2 A3 ) =
= ,
5!
5
2! 2! 2! 2!
2
P (A1 A2 A3 ) =
= .
5!
15
For P (A1 A2 ), for example, pick a seat for husband3 . In the remaining row of 5 seats, pick the
ordering for couple1 , couple2 , and wife3 , then the ordering of seats within each of couple1 and
couple2 . Now, by inclusion-exclusion,
P (A1 A2 A3 ) = 3
and our answer is
2
1
2
11
3 +
= ,
5
5 15
15
4
15 .
3. Let A1 = the event that Swedes are not represented, A2 = the event that Finns are not
represented, and A3 = the event that Norwegians are not represented.
P (A1 A2 A3 ) = P (A1 ) + P (A2 ) + P (A3 ) P (A1 A2 ) P (A1 A3 ) P (A2 A3 )
+P (A1 A2 A3 )
[( ) ( ) ( ) ( )
( )
]
1
13
17
10
10
7
= (20)
+
+
0
+0
5
5
5
5
5
5
()
()
4. The number of bad outcomes is 9 53 82 + 9 54 8 + 9. The rst term is the number of
numbers in which a digit appears 3 times: choose a digit, choose 3 positions lled by it, then ll
AXIOMS OF PROBABILITY
22
the remaining position. The second term is the number of numbers in which a digit appears 4
times, and the last term is the number of numbers in which a digit appears 5 times. The answer
then is
()
()
9 53 82 + 9 54 8 + 9
1
.
95
5. (a) Let Ai be the event that the number i appears exactly 6 times. As Ai are pairwise
disjoint,
(10) 4
5
P (A1 A2 A3 A4 A5 A6 ) = 6 6 10 .
6
(b) (a) Now, Ai is the event that the number i appears exactly once. By inclusion-exclusion,
P (A1 A2 A3 A4 A5 A6 )
= 6P (A1 )
( )
6
P (A1 A2 )
2
( )
6
+
P (A1 A2 A3 )
3
( )
6
P (A1 A2 A3 A4 )
4
( )
6
+
P (A1 A2 A3 A4 A5 )
5
( )
6
P (A1 A2 A3 A4 A5 A6 )
6
59
= 6 10 10
( ) 6
6
48
10 9 10
6
2
( )
6
37
+
10 9 8 10
3
6
( )
6
26
10 9 8 7 10
4
6
( )
6
1
+
10 9 8 7 6 10
5
6
0.
6. Below, a hit is shorthand for a chosen number.
P (ticket 1 wins) = P (two hits on each line) + P (two hits on one line, three on the other)
3 3 44 + 2 3
402
(50)
=
= (50) ,
5
AXIOMS OF PROBABILITY
23
and
P (ticket 2 wins) = P (two hits among 1, 2, 3) + P (three hits among 1, 2, 3)
( ) (47)
3 47
49726
3 + 2
(50)
=
= (50) ,
5
and
P (ticket 3 wins) = P (2, 3 both hit) + P (1, 4 both hit and one of 2, 3 hit)
(48)
(46)
+
2
19366
= 3 (50) 2 = (50) ,
5
and, nally,
P (ticket 4 wins) = P (3 hit, at least one additional hit on each line) + P (1, 2, 4, 5 all hit)
( )
(4 45
4186
2 + 4 45 + 1) + 45
(50)
=
= (50)
5
24
Example 4.1. Assume that you have a bag with 11 cubes, 7 of which have a fuzzy surface and
4 are smooth. Out of the 7 fuzzy ones, 3 are red and 4 are blue; out of 4 smooth ones, 2 are red
and 2 are blue. So, there are 5 red and 6 blue cubes. Other than color and fuzziness, the cubes
have no other distinguishing characteristics.
You plan to pick a cube out of the bag at random, but forget to wear gloves. Before you
start your experiment, the probability that the selected cube is red is 5/11. Now, you reach into
the bag, grab a cube, and notice it is fuzzy (but you do not take it out or note its color in any
other way). Clearly, the probability should now change to 3/7!
Your experiment clearly has 11 outcomes. Consider the events R, B, F , S that the selected
cube is red, blue, fuzzy, and smooth, respectively. We observed that P (R) = 5/11. For the
probability of a red cube, conditioned on it being fuzzy, we do not have notation, so we introduce
it here: P (R|F ) = 3/7. Note that this also equals
P (the selected ball is red and fuzzy)
P (R F )
=
.
P (F )
P (the selected ball is fuzzy)
This conveys the idea that with additional information the probability must be adjusted .
This is common in real life. Say bookies estimate your basketball teams chances of winning a
particular game to be 0.6, 24 hours before the game starts. Two hours before the game starts,
however, it becomes known that your teams star player is out with a sprained ankle. You
cannot expect that the bookies odds will remain the same and they change, say, to 0.3. Then,
the game starts and at half-time your team leads by 25 points. Again, the odds will change, say
to 0.8. Finally, when complete information (that is, the outcome of your experiment, the game
in this case) is known, all probabilities are trivial, 0 or 1.
For the general denition, take events A and B, and assume that P (B) > 0. The conditional
probability of A given B equals
P (A|B) =
P (A B)
.
P (B)
Example 4.2. Here is a question asked on Wall Street job interviews. (This is the original
formulation; the macabre tone is not unusual for such interviews.)
Lets play a game of Russian roulette. You are tied to your chair. Heres a gun, a revolver.
Heres the barrel of the gun, six chambers, all empty. Now watch me as I put two bullets into
the barrel, into two adjacent chambers. I close the barrel and spin it. I put a gun to your head
and pull the trigger. Click. Lucky you! Now Im going to pull the trigger one more time. Which
would you prefer: that I spin the barrel rst or that I just pull the trigger?
Assume that the barrel rotates clockwise after the hammer hits and is pulled back. You are
given the choice between an unconditional and a conditional probability of death. The former,
25
if the barrel is spun again, remains 1/3. The latter, if the trigger is pulled without the extra
spin, equals the probability that the hammer clicked on an empty slot, which is next to a bullet
in the counterclockwise direction, and equals 1/4.
For a xed condition B, and acting on events A, the conditional probability Q(A) = P (A|B)
satises the three axioms in Chapter 3. (This is routine to check and the reader who is more theoretically inclined might view it as a good exercise.) Thus, Q is another probability assignment
and all consequences of the axioms are valid for it.
Example 4.3. Toss two fair coins, blindfolded. Somebody tells you that you tossed at least
one Heads. What is the probability that both tosses are Heads?
Here A = {both H}, B = {at least one H}, and
P (A|B) =
P (A B)
P (both H)
=
=
P (B)
P (at least one H)
1
4
3
4
1
= .
3
Example 4.4. Toss a coin 10 times. If you know (a) that exactly 7 Heads are tossed, (b) that
at least 7 Heads are tossed, what is the probability that your rst toss is Heads?
For (a),
(9 )
6
)
P (rst toss H|exactly 7 Hs) = (10
7
1
210
2110
7
.
10
Why is this not surprising? Conditioned on 7 Heads, they are equally likely to occur on any
given 7 tosses. If you choose 7 tosses out of 10 at random, the rst toss is included in your
7
choice with probability 10
.
For (b), the answer is, after canceling
1
,
210
(9 )
() () ()
+ 97 + 98 + 99
65
(10) (10) (10) (10) =
0.7386.
88
7 + 8 + 9 + 10
6
Clearly, the answer should be a little larger than before, because this condition is more advantageous for Heads.
Conditional probabilities are sometimes given, or can be easily determined, especially in
sequential random experiments. Then, we can use
P (A1 A2 ) = P (A1 )P (A2 |A1 ),
P (A1 A2 A3 ) = P (A1 )P (A2 |A1 )P (A3 |A1 A2 ),
etc.
Example 4.5. An urn contains 10 black and 10 white balls. Draw 3 (a) without replacement,
and (b) with replacement. What is the probability that all three are white?
We already know how to do part (a):
(20)
3
26
(10)
3
(10
3)
.
(20
3)
To do this problem another way, imagine drawing the balls sequentially. Then, we are
computing the probability of the intersection of the three events: P (1st ball is white, 2nd ball
is white, and 3rd ball is white). The relevant probabilities are:
Our probability is then
9
19 .
1
2
9
19
8
18 ,
8
18 .
This approach is particularly easy in case (b), where the previous colors of the selected balls
( )3
do not aect the probabilities at subsequent stages. The answer, therefore, is 12 .
Theorem 4.1. First Bayes formula. Assume that F1 , . . . , Fn are pairwise disjoint and that
F1 . . . Fn = , that is, exactly one of them always happens. Then, for an event A,
P (A) = P (F1 )P (A|F1 )+P (F2 )P (A|F2 )+. . .+P (Fn )P (A|Fn ) .
Proof.
P (F1 )P (A|F1 ) + P (F2 )P (A|F2 ) + . . . + P (Fn )P (A|Fn ) = P (A F1 ) + . . . + P (A Fn )
= P ((A F1 ) . . . (A Fn ))
= P (A (F1 . . . Fn ))
= P (A ) = P (A)
27
Here you condition on the outcome of the coin toss, which could be Heads (event F ) or Tails
1
c
(event F c ). If A = {exactly one 6}, then P (A|F ) = 16 , P (A|F c ) = 25
36 , P (F ) = P (F ) = 2 and
so
2
P (A) = P (F )P (A|F ) + P (F c )P (A|F c ) = .
9
Example 4.7. Roll a die, then select at random, without replacement, as many cards from the
deck as the number shown on the die. What is the probability that you get at least one Ace?
Here Fi = {number shown on the die is i}, for i = 1, . . . , 6. Clearly, P (Fi ) = 16 . If A is the
event that you get at least one Ace,
1. P (A|F1 ) =
1
13 ,
(48i )
.
(52i )
nr+1
r
pk1,r +
pk1,r1 ,
n
n
28
Theorem 4.2. Second Bayes formula. Let F1 , . . . , Fn and A be as in Theorem 4.1. Then
P (Fj |A) =
P (Fj A)
P (A|Fj )P (Fj )
=
.
P (A)
P (A|F1 )P (F1 ) + . . . + P (A|Fn )P (Fn )
An event Fj is often called a hypothesis, P (Fj ) its prior probability, and P (Fj |A) its posterior
probability.
Example 4.9. We have a fair coin and an unfair coin, which always comes out Heads. Choose
one at random, toss it twice. It comes out Heads both times. What is the probability that the
coin is fair?
The relevant events are F = {fair coin}, U = {unfair coin}, and B = {both tosses H}. Then
P (F ) = P (U ) = 12 (as each coin is chosen with equal probability). Moreover, P (B|F ) = 14 , and
P (B|U ) = 1. Our probability then is
1
2
1 1
2 4
1
1
4 + 2
1
= .
5
1
Example 4.10. A factory has three machines, M1 , M2 and M3 , that produce items (say,
lightbulbs). It is impossible to tell which machine produced a particular item, but some are
defective. Here are the known numbers:
machine
M1
M2
M3
You pick an item, test it, and nd it is defective. What is the probability that it was made
by machine M2 ?
The best way to think about this random experiment is as a two-stage procedure. First you
choose a machine with the probabilities given by the proportion. Then, that machine produces
an item, which you then proceed to test. (Indeed, this is the same as choosing the item from a
large number of them and testing it.)
Let D be the event that an item is defective and let Mi also denote the event that the
item was made by machine i. Then, P (D|M1 ) = 0.001, P (D|M2 ) = 0.002, P (D|M3 ) = 0.003,
P (M1 ) = 0.2, P (M2 ) = 0.3, P (M3 ) = 0.5, and so
P (M2 |D) =
0.002 0.3
0.26.
0.001 0.2 + 0.002 0.3 + 0.003 0.5
Example 4.11. Assume 10% of people have a certain disease. A test gives the correct diagnosis
with probability of 0.8; that is, if the person is sick, the test will be positive with probability 0.8,
but if the person is not sick, the test will be positive with probability 0.2. A random person from
29
the population has tested positive for the disease. What is the probability that he is actually
sick? (No, it is not 0.8!)
Let us dene the three relevant events: S = {sick}, H = {healthy}, T = {tested positive}.
Now, P (H) = 0.9, P (S) = 0.1, P (T |H) = 0.2 and P (T |S) = 0.8. We are interested in
P (S|T ) =
8
P (T |S)P (S)
=
31%.
P (T |S)P (S) + P (T |H)P (H)
26
Note that the prior probability P (S) is very important! Without a very good idea about what
it is, a positive test result is dicult to evaluate: a positive test for HIV would mean something
very dierent for a random person as opposed to somebody who gets tested because of risky
behavior.
Example 4.12. O. J. Simpsons rst trial , 1995. The famous sports star and media personality
O. J. Simpson was on trial in Los Angeles for murder of his wife and her boyfriend. One of the
many issues was whether Simpsons history of spousal abuse could be presented by prosecution
at the trial; that is, whether this history was probative, i.e., it had some evidentiary value,
or whether it was merely prejudicial and should be excluded. Alan Dershowitz, a famous
professor of law at Harvard and a consultant for the defense, was claiming the latter, citing the
statistics that < 0.1% of men who abuse their wives end up killing them. As J. F. Merz and
J. C. Caulkins pointed out in the journal Chance (Vol. 8, 1995, pg. 14), this was the wrong
probability to look at!
We need to start with the fact that a woman is murdered. These numbered 4, 936 in 1992,
out of which 1, 430 were killed by partners. In other words, if we let
A = {the (murdered) woman was abused by the partner},
M = {the woman was murdered by the partner},
then we estimate the prior probabilities P (M ) = 0.29, P (M c ) = 0.71, and what we are interested
in is the posterior probability P (M |A). It was also commonly estimated at the time that
about 5% of the women had been physically abused by their husbands. Thus, we can say that
P (A|M c ) = 0.05, as there is no reason to assume that a woman murdered by somebody else
was more or less likely to be abused by her partner. The nal number we need is P (A|M ).
Dershowitz states that a considerable number of wife murderers had previously assaulted
them, although some did not. So, we will (conservatively) say that P (A|M ) = 0.5. (The
two-stage experiment then is: choose a murdered woman at random; at the rst stage, she is
murdered by her partner, or not, with stated probabilities; at the second stage, she is among
the abused women, or not, with probabilities depending on the outcome of the rst stage.) By
Bayes formula,
P (M |A) =
P (M )P (A|M )
29
=
0.8.
c
c
P (M )P (A|M ) + P (M )P (A|M )
36.1
The law literature studiously avoids quantifying concepts such as probative value and reasonable
doubt. Nevertheless, we can probably say that 80% is considerably too high, compared to the
prior probability of 29%, to use as a sole argument that the evidence is not probative.
30
Independence
Events A and B are independent if P (A B) = P (A)P (B) and dependent (or correlated )
otherwise.
Assuming that P (B) > 0, one could rewrite the condition for independence,
P (A|B) = P (A),
so the probability of A is unaected by knowledge that B occurred. Also, if A and B are
independent,
P (A B c ) = P (A) P (A B) = P (A) P (A)P (B) = P (A)(1 P (B)) = P (A)P (B c ),
so A and B c are also independent knowing that B has not occurred also has no inuence on
the probability of A. Another fact to notice immediately is that disjoint events with nonzero
probability cannot be independent: given that one of them happens, the other cannot happen
and thus its probability drops to zero.
Quite often, independence is an assumption and it is the most important concept in probability.
Example 4.13. Pick a random card from a full deck. Let A = {card is an Ace} and R =
{card is red}. Are A and R independent?
1
2
1
We have P (A) = 13
, P (R) = 12 , and, as there are two red Aces, P (A R) = 52
= 26
.
The two events are independent the proportion of aces among red cards is the same as the
proportion among all cards.
Now, pick two cards out of the deck sequentially without replacement. Are F = {rst card
is an Ace} and S = {second card is an Ace} independent?
Now P (F ) = P (S) =
1
13
and P (S|F ) =
3
51 ,
Example 4.14. Toss 2 fair coins and let F = {Heads on 1st toss}, S = {Heads on 2nd toss}.
These are independent. You will notice that here the independence is in fact an assumption.
How do we dene independence of more than two events? We say that events A1 , A2 , . . . , An
are independent if
P (Ai1 . . . Aik ) = P (Ai1 )P (Ai2 ) P (Aik ),
where 1 i1 < i2 < . . . < ik n are arbitrary indices. The occurrence of any combination
of events does not inuence the probability of others. Again, it can be shown that, in such a
collection of independent events, we can replace an Ai by Aci and the events remain independent.
Example 4.15. Roll a four sided fair die, that is, choose one of the numbers 1, 2, 3, 4 at
random. Let A = {1, 2}, B = {1, 3}, C = {1, 4}. Check that these are pairwise independent
(each pair is independent), but not independent.
and P (A B) = P (A C) = P (B C) =
1
2
P (A B C) =
31
1
4
and pairwise
1
1
= .
4
8
+ ...,
= +
6
6 2 6
6 2
6
and then we sum the geometric series. Important note: we have implicitly assumed independence
between the coin and the die, as well as between dierent tosses and rolls. This is very common
in problems such as this!
You can avoid the nuisance, however, by the following trick. Let
D = {game is decided on 1st round},
W = {you win}.
The events D and W are independent, which one can certainly check by computation, but, in
fact, there is a very good reason to conclude so immediately. The crucial observation is that,
provided that the game is not decided in the 1st round, you are thereafter facing the same game
with the same winning probability; thus
P (W |Dc ) = P (W ).
In other words, Dc and W are independent and then so are D and W , and therefore
P (W ) = P (W |D).
32
This means that one can solve this problem by computing the relevant probabilities for the 1st
round:
1
P (W D)
2
P (W |D) =
= 1 651 = ,
P (D)
7
6 + 62
which is our answer.
Example 4.17. Craps. Many casinos allow you to bet even money on the following game. Two
dice are rolled and the sum S is observed.
If S {7, 11}, you win immediately.
If S {2, 3, 12}, you lose immediately.
If S {4, 5, 6, 8, 9, 10}, the pair of dice is rolled repeatedly until one of the following
happens:
S repeats, in which case you win.
7 appears, in which case you lose.
What is the winning probability?
Let us look at all possible ways to win:
1. You win on the rst roll with probability
8
36 .
2. Otherwise,
you roll a 4 (probability
3
36 ),
3
36
3
6
+ 36
36
3
3+6
4
4
2
36 ), then win with probability 4+6 = 5 ;
5
5
5
6 (probability 36
), then win with probability 5+6
= 11
;
5
5
5
8 (probability 36 ), then win with probability 5+6 = 11 ;
4
4
), then win with probability 4+6
= 25 ;
9 (probability 36
3
3
), then win with probability 3+6
= 13 .
10 (probability 36
3 1
4 2
5 5
+
+
36 3 36 5 36 11
)
=
244
0.4929,
495
= 13 ;
33
Bernoulli trials
Assume n independent experiments, each of which is a success with probability p and, thus,
failure with probability 1 p.
( )
n k
In a sequence of n Bernoulli trials, P (exactly k successes) =
p (1 p)nk .
k
This is because the successes can occur on any subset S of k trials out of n, i.e., on any
S {1, . . . , n} with cardinality k. These possibilities
are disjoint, as exactly k successes cannot
( )
occur on two dierent such sets. There are nk such subsets; if we x such an S, then successes
must occur on k trials in S and failures on all n k trials not in S; the probability that this
happens, by the assumed independence, is pk (1 p)nk .
Example 4.18. A machine produces items which are independently defective with probability
p. Let us compute a few probabilities:
1. P (exactly two items among the rst 6 are defective) =
(6)
2
p2 (1 p)4 .
2. P (at least one item among the rst 6 is defective) = 1 P (no defects) = 1 (1 p)6
3. P (at least 2 items among the rst 6 are defective) = 1 (1 p)6 6p(1 p)5
4. P (exactly 100 items are made before 6 defective are found) equals
(
)
99 5
P (100th item defective, exactly 5 items among 1st 99 defective) = p
p (1 p)94 .
5
Example 4.19. Problem of Points. This involves nding the probability of n successes before
m failures in a sequence of Bernoulli trials. Let us call this probability pn,m .
pn,m = P (in the rst m + n 1 trials, the number of successes is n)
n+m1
(n + m 1)
=
pk (1 p)n+m1k .
k
k=n
The problem is solved, but it needs to be pointed out that computationally this is not the best
formula. It is much more ecient to use the recursive formula obtained by conditioning on the
outcome of the rst trial. Assume m, n 1. Then,
pn,m = P (rst trial is a success) P (n 1 successes before m failures)
+ P (rst trial is a failure) P (n successes before m 1 failures)
= p pn1,m + (1 p) pn,m1 .
Together with boundary conditions, valid for m, n 1,
pn,0 = 0, p0,m = 1,
34
which allows for very speedy and precise computations for large m and n.
Example 4.20. Best of 7 . Assume that two equally matched teams, A and B, play a series
of games and that the rst team that wins four games is the overall winner of the series. As
it happens, team A lost the rst game. What is the probability it will win the series? Assume
that the games are Bernoulli trials with success probability 12 .
We have
P (A wins the series) = P (4 successes before 3 failures)
6 ( ) ( )6
6
1
15 + 6 + 1
=
0.3438.
=
k
2
26
k=4
Here, we assume that the games continue even after the winner of the series is decided, which
we can do without aecting the probability.
Example 4.21. Banach Matchbox Problem. A mathematician carries two matchboxes, each
originally containing n matches. Each time he needs a match, he is equally likely to take it from
either box. What is the probability that, upon reaching for a box and nding it empty, there
are exactly k matches still in the other box? Here, 0 k n.
Let A1 be the event that matchbox 1 is the one discovered empty and that, at that instant,
matchbox 2 contains k matches. Before this point, he has used n + n k matches, n from
matchbox 1 and n k from matchbox 2. This means that he has reached for box 1 exactly n
times in (n + n k) trials and for the last time at the (n + 1 + n k)th trial. Therefore, our
probability is
(
)
(
)
1
1 2n k
2n k
1
2 P (A1 ) = 2
=
.
2
n
22nk
22nk
n
Example 4.22. Each day, you independently decide, with probability p, to ip a fair coin.
Otherwise, you do nothing. (a) What is the probability of getting exactly 10 Heads in the rst
20 days? (b) What is the probability of getting 10 Heads before 5 Tails?
For (a), the probability of getting Heads is p/2 independently each day, so the answer is
( )( ) (
20
p 10
p )10
1
.
10
2
2
For (b), you can disregard days at which you do not ip to get
14 ( )
14 1
.
k 214
k=10
Example 4.23. You roll a die and your score is the number shown on the die. Your friend rolls
ve dice and his score is the number of 6s shown. Compute (a) the probability of event A that
the two scores are equal and (b) the probability of event B that your friends score is strictly
larger than yours.
35
In both cases we will condition on your friends score this works a little better in case (b)
than conditioning on your score. Let Fi , i = 0, . . . , 5, be the event that your friends score is i.
Then, P (A|Fi ) = 16 if i 1 and P (A|F0 ) = 0. Then, by the rst Bayes formula, we get
P (A) =
P (Fi )
i=1
Moreover, P (B|Fi ) =
i1
6
1
1
1 55
= (1 P (F0 )) = 6 0.0997.
6
6
6 6
P (B) =
P (Fi )
i=1
i1
6
1
1
=
i P (Fi )
P (Fi )
6
6
5
i=1
5
i=1
1 55
+
6 66
i=1
( ) ( )i ( )5i
5
5
1
5
1
1 55
i
=
+ 6
6
i
6
6
6 6
=
1
6
i P (Fi )
i=1
1 5 1 55
+
0.0392.
6 6 6 66
The last equality can be obtained by computation, but we will soon learn why the sum has to
equal 56 .
Problems
1. Consider the following game. Pick one card at random from a full deck of 52 cards. If you
pull an Ace, you win outright. If not, then you look at the value of the card (K, Q, and J count
as 10). If the number is 7 or less, you lose outright. Otherwise, you select (at random, without
replacement) that number of additional cards from the deck. (For example, if you picked a 9
the rst time, you select 9 more cards.) If you get at least one Ace, you win. What are your
chances of winning this game?
2. An item is defective (independently of other items) with probability 0.3. You have a method
of testing whether the item is defective, but it does not always give you correct answer. If
the tested item is defective, the method detects the defect with probability 0.9 (and says it is
good with probability 0.1). If the tested item is good, then the method says it is defective with
probability 0.2 (and gives the right answer with probability 0.8).
A box contains 3 items. You have tested all of them and the tests detect no defects. What
is the probability that none of the 3 items is defective?
36
3. A chocolate egg either contains a toy or is empty. Assume that each egg contains a toy with
probability p, independently of other eggs. You have 5 eggs; open the rst one and see if it has
a toy inside, then do the same for the second one, etc. Let E1 be the event that you get at least
4 toys and let E2 be the event that you get at least two toys in succession. Compute P (E1 ) and
P (E2 ). Are E1 and E2 independent?
4. You have 16 balls, 3 blue, 4 green, and 9 red. You also have 3 urns. For each of the 16 balls,
you select an urn at random and put the ball into it. (Urns are large enough to accommodate any
number of balls.) (a) What is the probability that no urn is empty? (b) What is the probability
that each urn contains 3 red balls? (c) What is the probability that each urn contains all three
colors?
5. Suppose 5 coins are tossed. Then, the coins that come out Heads are left alone, while each
coin that comes out Tails is tossed again once. Call this experiment acceptable if at the end
all 5 coins show Heads. Repeat the experiment until it is acceptable. Compute the probability
that, at the nal repetition of the experiment, only 5 tosses were performed (i.e., the rst ve
tosses came out Heads).
6. Assume that you have an nelement set U and that you select r independent random subsets
A1 , . . . , Ar U . All Ai are chosen so that all 2n choices are equally likely. Compute (in a simple
closed form) the probability that the Ai are pairwise disjoint.
Solutions to problems
1. Let
(47)
8
),
P (W |F8 ) = 1 (51
8
(47
)
9
),
P (W |F9 ) = 1 (51
9
(47
)
10
),
P (W |F10 ) = 1 (51
10
and so,
4
4
P (W ) =
+
52 52
(47) )
8
)
1 (51
8
4
+
52
(47) )
9
)
1 (51
9
16
+
52
37
(
(47) )
)
1 (10
51
10
2. Let F = {none is defective} and A = {test indicates that none is defective}. By independence
and the rst Bayes formula,
P (A F )
P (A)
(0.7 0.8)3
=
(0.7 0.8 + 0.3 0.1)3
( )3
56
=
.
59
P (F |A) =
216 1
,
315
and
P (no urns are empty) = 1 P (A1 A2 A3 )
216 1
= 1
.
315
(b) We can ignore other balls since only the red balls matter here.
Hence, the result is:
9!
9!
3!3!3!
=
.
9
3
8 312
(4)
2
p2 (1 p)3
(c) As
38
( )3
( )3
1
2
3
,
P (at least one urn lacks blue) = 3
3
3
( )4
( )4
2
1
P (at least one urn lacks green) = 3
3
,
3
3
( )9
( )9
1
2
3
,
P (at least one urn lacks red) = 3
3
3
we have, by independence,
[
( ( )
( )3 )]
2 3
1
P (each urn contains all 3 colors) = 1 3
3
3
3
[
( ( )
( )4 )]
2 4
1
1 3
3
3
3
[
(
( )9
( )9 )]
2
1
1 3
3
.
3
3
5. The answer is the conditional probability
P (Heads on rst 5 tosses|experiment acceptable) =
The numerator equals 1/25 . The denominator equals the probability that each of the ve coins,
when tossed twice, results in at least one Heads. Therefore, the denominator is (3/4)5 , thus the
answer is (2/3)5 0.1317.
6. This is the same as choosing an r n matrix in which every entry is independently 0 or
1 with probability 1/2 and ending up with at most one 1 in every column. Since columns are
n
independent, this gives ((1 + r)2r ) .
39
40
(b) Each of the numbers 1, 2, 3 appears exactly twice, while the number 4 appears four
times.
Solution:
(10)(8)(6)
2
2 2
610
23
10!
.
4! 610
41
Then,
P (1, 2, and 3 each appear at least once)
= P ((A1 A2 A3 )c )
= 1 P (A1 ) P (A2 ) P (A3 )
+P (A1 A2 ) + P (A2 A3 ) + P (A1 A3 )
P (A1 A2 A3 )
( )10
( )10 ( )10
5
4
3
= 13
+3
.
6
6
6
2 4! 24
.
9!
(b) Compute the probability that at most one wife does not sit next to her husband.
Solution:
Let A be the event that all wives sit next to their husbands and let B be the event
5
that exactly one wife does not sit next to her husband. We know that P (A) = 2 9!4!
from part (a). Moreover, B = B1 B2 B3 B4 B5 , where Bi is the event that wi
42
does not sit next to hi and the remaining couples sit together. Then, Bi are disjoint
and their probabilities are all the same. So, we need to determine P (B1 ).
i.
ii.
iii.
iv.
Therefore,
P (B1 ) =
Our answer, then, is
5
3 4! 24
.
9!
3 4! 24 25 4!
+
.
9!
9!
3. Consider the following game. The player rolls a fair die. If he rolls 3 or less, he loses
immediately. Otherwise he selects, at random, as many cards from a full deck as the
number that came up on the die. The player wins if all four Aces are among the selected
cards.
(a) Compute the winning probability for this game.
Solution:
Let W be the event that the player wins. Let Fi be the event that he rolls i, where
i = 1, . . . , 6; P (Fi ) = 61 .
Since we lose if we roll a 1, 2, or 3, P (W |F1 ) = P (W |F2 ) = P (W |F3 ) = 0. Moreover,
1
P (W |F4 ) = (52) ,
4
(5)
4
),
P (W |F5 ) = (52
4
(6)
4
).
P (W |F6 ) = (52
4
Therefore,
1
1
P (W ) = (52)
6
4
( ) ( ))
5
6
1+
+
.
4
4
43
(b) Smith tells you that he recently played this game once and won. What is the probability that he rolled a 6 on the die?
Solution:
P (F6 |W ) =
=
=
=
1
6
1
52
4
(6 )
( )
P (W )
(6)
1+
15
21
5
.
7
(5)4
4
(6 )
4
4. A chocolate egg either contains a toy or is empty. Assume that each egg contains a toy with
probability p (0, 1), independently of other eggs. Each toy is, with equal probability,
red, white, or blue (again, independently of other toys). You buy 5 eggs. Let E1 be the
event that you get at most 2 toys and let E2 be the event that you get you get at least
one red and at least one white and at least one blue toy (so that you have a complete
collection).
(a) Compute P (E1 ). Why is this probability very easy to compute when p = 1/2?
Solution:
44
45
A random variable is a number whose value depends upon the outcome of a random experiment.
Mathematically, a random variable X is a real-valued function on , the space of outcomes:
X : R.
Sometimes, when convenient, we also allow X to have the value or, more rarely, , but
this will not occur in this chapter. The crucial theoretical property that X should have is that,
for each interval B, the set of outcomes for which X B is an event, so we are able to talk
about its probability, P (X B). Random variables are traditionally denoted by capital letters
to distinguish them from deterministic quantities.
Example 5.1. Here are some examples of random variables.
1. Toss a coin 10 times and let X be the number of Heads.
2. Choose a random point in the unit square {(x, y) : 0 x, y 1} and let X be its distance
from the origin.
3. Choose a random person in a class and let X be the height of the person, in inches.
4. Let X be value of the NASDAQ stock index at the closing of the next business day.
A discrete random variable X has nitely or countably many values xi , i = 1, 2, . . ., and
p(xi ) = P (X = xi ) with i = 1, 2, . . . is called the probability mass function of X. Sometimes X
is added as the subscript of its p. m. f., p = pX .
A probability mass function p has the following properties:
1. For all i, p(xi ) > 0 (we do not list values of X which occur with probability 0).
46
The possible values are 5, . . . , 20. To determine the p. m. f., note that we have
and, then,
( )
(20)
5
outcomes,
i1
4
).
P (X = i) = (20
5
Finally,
P (at least one number 15 or more) = P (X 15) =
20
(14)
5
).
P (X = i) = 1 (20
i=15
Example 5.4. An urn contains 11 balls, 3 white, 3 red, and 5 blue balls. Take out 3 balls at
random, without replacement. You win $1 for each red ball you select and lose a $1 for each
white ball you select. Determine the p. m. f. of X, the amount you win.
( )
The number of outcomes is 11
3 . X can have values 3, 2, 1, 0, 1, 2, and 3. Let us start
with 0. This can occur with one ball of each color or with 3 blue balls:
()
3 3 5 + 53
55
(11)
P (X = 0) =
=
.
165
3
To get X = 1, we can have 2 red and 1 white, or 1 red and 2 blue:
(3)(3) (3)(5)
+
39
P (X = 1) = P (X = 1) = 2 1 (11) 1 2 =
.
165
3
The probability that X = 1 is the same because of symmetry between the roles that the red
and the white balls play. Next, to get X = 2 we must have 2 red balls and 1 blue:
(3)(5)
15
P (X = 2) = P (X = 2) = (2 11)1 =
.
165
3
Finally, a single outcome (3 red balls) produces X = 3:
1
1
.
P (X = 3) = P (X = 3) = (11) =
165
3
All the seven probabilities should add to 1, which can be used either to check the computations
or to compute the seventh probability (say, P (X = 0)) from the other six.
Assume that X is a discrete random variable with possible values xi , i = 1, 2 . . .. Then, the
expected value, also called expectation, average, or mean, of X is
EX =
xi P (X = xi ) =
xi p(xi ).
i
g(xi )P (X = xi ).
47
In computations, bear in mind that variance cannot be negative! Furthermore, the only way
that a random variable has 0 variance is when it is equal to its expectation with probability
1 (so it is not really random at all): P (X = ) = 1. Here is the summary:
The variance of a random variable X is Var(X) = E(X EX)2 = E(X 2 )(EX)2 .
48
Var(X) 0.7810.
5.1
This is a random variable with values x1 , . . . , xn , each with equal probability 1/n. Such a random
variable is simply the random choice of one among n numbers.
Properties:
1. EX =
x1 +...+xn
.
n
2. VarX =
x21 +...+x2n
n
( x1 +...+xn )2
n
Example 5.7. Let X be the number shown on a rolled fair die. Compute EX, E(X 2 ), and
Var(X).
This is a standard example of a discrete uniform random variable and
7
1 + 2 + ... + 6
= ,
EX =
6
2
2 + . . . + 62
1
+
2
91
EX 2 =
= ,
6
6
( )2
91
7
35
Var(X) =
= .
6
2
12
5.2
This is also called an indicator random variable. Assume that A is an event with probability p.
Then, IA , the indicator of A, is given by
{
1 if A happens,
IA =
0 otherwise.
Other notations for IA include 1A and A . Although simple, such random variables are very
important as building blocks for more complicated random variables.
Properties:
1. EIA = p.
2. Var(IA ) = p(1 p).
2 = I , so that E(I 2 ) = EI = p.
For the variance, note that IA
A
A
A
5.3
49
A Binomial(n,p) random variable counts the number of successes in n independent trials, each
of which is a success with probability p.
Properties:
1. Probability mass function: P (X = i) =
(n)
i
pi (1 p)ni , i = 0, . . . , n.
2. EX = np.
3. Var(X) = np(1 p).
The expectation and variance formulas will be proved in Chapter 8. For now, take them on
faith.
Example 5.8. Let X be the number of Heads in 50 tosses of a fair coin. Determine EX,
Var(X) and P (X 10). As X is Binomial(50, 21 ), so EX = 25, Var(X) = 12.5, and
P (X 10) =
10 ( )
50 1
.
i 250
i=0
Example 5.9. Denote by d the dominant gene and by r the recessive gene at a single locus.
Then dd is called the pure dominant genotype, dr is called the hybrid, and rr the pure recessive
genotype. The two genotypes with at least one dominant gene, dd and dr, result in the phenotype
of the dominant gene, while rr results in a recessive phenotype.
Assuming that both parents are hybrid and have n children, what is the probability that at
least two will have the recessive phenotype? Each child, independently, gets one of the genes at
random from each parent.
For each child, independently, the probability of the rr genotype is 14 . If X is the number of
rr children, then X is Binomial(n, 14 ). Therefore,
( )n
( )
3
1 3 n1
P (X 2) = 1 P (X = 0) P (X = 1) = 1
n
.
4
4 4
5.4
A random variable is Poisson(), with parameter > 0, if it has the probability mass function
given below.
Properties:
1. P (X = i) =
i
i!
e , for i = 0, 1, 2, . . ..
50
2. EX = .
3. Var(X) = .
Here is how we compute the expectation:
EX =
i=1
i e
i1
i
= e
= e e = ,
i!
(i 1)!
i=1
i
,
i!
=
=
=
( ) ( )i (
)
n
ni
1
i
n
n
) (
)
(
n(n 1) . . . (n i + 1) i
i
n
i 1
1
i!
n
n
n
(
)n
i
n(n 1) . . . (n i + 1)
1
1
(
)i
i!
n
ni
1
n
i
e 1 1,
i!
as n .
The Poisson approximation is quite good: one can prove that the error made by computing
a probability using the Poisson approximation instead of its exact Binomial expression (in the
context of the above theorem) is no more than
min(1, ) p.
Example 5.10. Suppose that the probability that a person is killed by lightning in a year is,
independently, 1/(500 million). Assume that the US population is 300 million.
51
p (1 p)n2 0.02311530.
2
2. Approximate the above probability.
As np = 53 , X is approximately Poisson( 35 ), and the answer is
1 e e
2
e 0.02311529.
2
3. Approximate P (two or more people are killed by lightning within the rst 6 months of
next year).
This highlights the interpretation of as a rate. If lightning deaths occur at the rate of 35
a year, they should occur at half that rate in 6 months. Indeed, assuming that lightning
deaths occur as a result of independent factors in disjoint time intervals, we can imagine
that they operate on dierent people in disjoint time intervals. Thus, doubling the time
interval is the same as doubling the number n of people (while keeping p the same), and
then np also doubles. Consequently, halving the time interval has the same p, but half as
3
3
and so = 10
as well. The answer is
many trials, so np changes to 10
1 e e 0.0369.
4. Approximate P (in exactly 3 of next 10 years exactly 3 people are killed by lightning).
In every year, the probability of exactly 3 deaths is approximately 3! e , where, again,
= 35 . Assuming year-to-year independence, the number of years with exactly 3 people
3
killed is approximately Binomial(10, 3! e ). The answer, then, is
3
10
3
)(
3
e
3!
)3 (
3
1 e
3!
)7
4.34 106 .
5. Compute the expected number of years, among the next 10, in which 2 or more people are
killed by lightning.
By the same logic as above and the formula for Binomal expectation, the answer is
10(1 e e ) 0.3694.
Example 5.11. Poisson distribution and law . Assume a crime has been committed. It is
known that the perpetrator has certain characteristics, which occur with a small frequency p
(say, 108 ) in a population of size n (say, 108 ). A person who matches these characteristics
has been found at random (e.g., at a routine trac stop or by airport security) and, since p is
52
so small, charged with the crime. There is no other evidence. We will also assume that the
authorities stop looking for another suspect once the arrest has been made. What should the
defense be?
Let us start with a mathematical model of this situation. Assume that N is the number of
people with given characteristics. This is a Binomial random variable but, given the assumptions,
we can easily assume that it is Poisson with = np. Choose a person from among these N , label
that person by C, the criminal. Then, choose at random another person, A, who is arrested.
The question is whether C = A, that is, whether the arrested person is guilty. Mathematically,
we can formulate the problem as follows:
P (C = A | N 1) =
P (C = A, N 1)
.
P (N 1)
We need to condition as the experiment cannot even be performed when N = 0. Now, by the
rst Bayes formula,
P (C = A, N 1) =
=
k=1
P (C = A, N 1 | N = k) P (N = k)
P (C = A | N = k) P (N = k)
k=1
and
P (C = A | N = k) =
so
1
,
k
1 k
P (C = A, N 1) =
e .
k k!
k=1
e
k
.
1 e
k k!
k=1
There is no closed-form expression for the sum, but it can be easily computed numerically. The
defense may claim that the probability of innocence, 1(the above probability), is about 0.2330
when = 1, presumably enough for a reasonable doubt.
This model was in fact tested in court, in the famous People v. Collins case, a 1968 jury
trial in Los Angeles. In this instance, it was claimed by the prosecution (on imsy grounds)
that p = 1/12, 000, 000 and n would have been the number of adult couples in the LA area, say
n = 5, 000, 000. The jury convicted the couple charged for robbery on the basis of the prosecutors claim that, due to low p, the chances of there being another couple [with the specied
characteristics, in the LA area] must be one in a billion. The Supreme Court of California
reversed the conviction and gave two reasons. The rst reason was insucient foundation for
53
the estimate of p. The second reason was that the probability that another couple with matching
characteristics existed was, in fact,
P (N 2 | N 1) =
1 e e
,
1 e
5
much larger than the prosecutor claimed, namely, for = 12
it is about 0.1939. This is about
twice the (more relevant) probability of innocence, which, for this , is about 0.1015.
5.5
A Geometric(p) random variable X counts the number trials required for the rst success in
independent trials with success probability p.
Properties:
1. Probability mass function: P (X = n) = p(1 p)n1 , where n = 1, 2, . . ..
2. EX = p1 .
3. Var(X) =
1p
.
p2
4. P (X > n) =
k=n+1 p(1
p)k1 = (1 p)n .
(1p)n+k
(1p)k
= P (X > n).
We omit the proofs of the second and third formulas, which reduce to manipulations with
geometric series.
Example 5.12. Let X be the number of tosses of a fair coin required for the rst Heads. What
are EX and Var(X)?
As X is Geometric( 12 ), EX = 2 and Var(X) = 2.
Example 5.13. You roll a die, your opponent tosses a coin. If you roll 6 you win; if you do
not roll 6 and your opponent tosses Heads you lose; otherwise, this round ends and the game
repeats. On the average, how many rounds does the game last?
P (game decided on round 1) =
7
and so the number of rounds N is Geometric( 12
), and
EN =
12
.
7
1 5 1
7
+ = ,
6 6 2
12
54
Problems
1. Roll a fair die repeatedly. Let X be the number of 6s in the rst 10 rolls and let Y the
number of rolls needed to obtain a 3. (a) Write down the probability mass function of X. (b)
Write down the probability mass function of Y . (c) Find an expression for P (X 6). (d) Find
an expression for P (Y > 10).
2. A biologist needs at least 3 mature specimens of a certain plant. The plant needs a year
to reach maturity; once a seed is planted, any plant will survive for the year with probability
1/1000 (independently of other plants). The biologist plants 3000 seeds. A year is deemed a
success if three or more plants from these seeds reach maturity.
(a) Write down the exact expression for the probability that the biologist will indeed end up
with at least 3 mature plants.
(b) Write down a relevant approximate expression for the probability from (a). Justify briey
the approximation.
(c) The biologist plans to do this year after year. What is the approximate probability that he
has at least 2 successes in 10 years?
(d) Devise a method to determine the number of seeds the biologist should plant in order to get
at least 3 mature plants in a year with probability at least 0.999. (Your method will probably
require a lengthy calculation do not try to carry it out with pen and paper.)
3. You are dealt one card at random from a full deck and your opponent is dealt 2 cards
(without any replacement). If you get an Ace, he pays you $10, if you get a King, he pays you
$5 (regardless of his cards). If you have neither an Ace nor a King, but your card is red and
your opponent has no red cards, he pays you $1. In all other cases you pay him $1. Determine
your expected earnings. Are they positive?
4. You and your opponent both roll a fair die. If you both roll the same number, the game
is repeated, otherwise whoever rolls the larger number wins. Let N be the number of times
the two dice have to be rolled before the game is decided. (a) Determine the probability mass
function of N . (b) Compute EN . (c) Compute P (you win). (d) Assume that you get paid
$10 for winning in the rst round, $1 for winning in any other round, and nothing otherwise.
Compute your expected winnings.
5. Each of the 50 students in class belongs to exactly one of the four groups A, B, C, or D. The
membership numbers for the four groups are as follows: A: 5, B: 10, C: 15, D: 20. First, choose
one of the 50 students at random and let X be the size of that students group. Next, choose
one of the four groups at random and let Y be its size. (Recall: all random choices are with
equal probability, unless otherwise specied.) (a) Write down the probability mass functions for
X and Y . (b) Compute EX and EY . (c) Compute Var(X) and Var(Y ). (d) Assume you have
55
s students divided into n groups with membership numbers s1 , . . . , sn , and again X is the size
of the group of a randomly chosen student, while Y is the size of the randomly chosen group.
Let EY = and Var(Y ) = 2 . Express EX with s, n, , and .
6. Refer to Example 4.7 for description of the Craps game. In many casinos, one can make side
bets on the players performance in a particular instance of this game. To describe an example,
say Alice is the player and Bob makes the so-called Dont pass side bet. Then Bob wins $1 if
Alice looses. If Alice wins, Bob looses $1 (i.e., wins $1) with one exception: if Alice rolls 12 on
the rst roll, then Bob wins or looses nothing. Let X be the winning dollar amount on Bobs
Dont pass bet. Find the probability mass function of X, and its expectation and variance.
Solutions
1. (a) X is Binomial(10, 16 ):
(
P (X = i) =
10
i
) ( )i ( )10i
1
5
,
6
6
where i = 0, 1, 2, . . . , 10.
(b) Y is Geometric( 16 ):
1
P (Y = i) =
6
( )i1
5
,
6
where i = 1, 2, . . ..
(c)
P (X 6) =
10 ( ) ( )i ( )10i
10
1
5
i=6
(d)
( )10
5
P (Y > 10) =
.
6
)
(
1
.
2. (a) The random variable X, the number of mature plants, is Binomial 3000, 1000
P (X 3) = 1 P (X 2)
1
1000
)
3000
(0.999)2998 (0.001)2 .
2
= 3,
9
P (X 3) 1 e3 3e3 e3 .
2
56
(c) Denote the probability in (b) by s. Then, the number of years the biologists succeeds is
approximately Binomial(10, s) and the answer is
1 (1 s)10 10s(1 s)9 .
(d) Solve
2
e = 0.001
2
for and then let n = 1000. The equation above can be solved by rewriting
e + e +
2
)
2
and then solved by iteration. The result is that the biologist should plant 11, 229 seeds.
3. Let X be your earnings.
4
,
52
4
P (X = 5) = ,
52
(26)
22
11
P (X = 1) =
(2) =
,
52 51
102
2
P (X = 10) =
P (X = 1) = 1
and so
EX =
4. (a) N is Geometric( 56 ):
2
11
,
13 102
5
11
2
11
4
11
10
+
+
1+
+
=
+
>0
13 13 102
13 102
13 51
( )n1
1
5
P (N = n) =
,
6
6
where n = 1, 2, 3, . . ..
(b) EN = 65 .
(c) By symmetry, P (you win) = 12 .
(d) You get paid $10 with probability
expected winnings are 51
12 .
5. (a)
5
12 ,
$1 with probability
1
12 ,
P (X = x)
0.1
0.2
0.3
0.4
57
P (Y = x)
0.25
0.25
0.25
0.25
i=1
si
n 2 1
n
n
n
si = EY 2 = (Var(Y ) + (EY )2 ) = ( 2 + 2 ).
si =
s
s
n
s
s
s
n
i=1
1
6. From Example 4.17, we have P (X = 1) = 244
495 0.4929, and P (X = 0) = 36 , so that
949
0.4793. Then EX = P (X = 1) P (X =
P (X = 1) = 1 P (X = 1) P (X = 0) = 1980
3
1) = 220 0.0136 and Var(X) = 1 P (X = 0) (EX)2 0.972.
58
A random variable X is continuous if there exists a nonnegative function f so that, for every
interval B,
P (X B) =
f (x) dx,
B
P (X [a, b]) = P (a X b) =
f (x) dx,
a
P (X = a) = 0,
P (X b) = P (X < b) =
f (x) dx.
F (x) = P (X x) =
f (s) ds
Eg(X) =
g(x) f (x) dx,
59
{
cx if 0 < x < 4,
f (x) =
0
otherwise.
x2
8
dx =
8
3
EX =
and
2
E(X ) =
0
So, Var(X) = 8
64
9
x3
dx = 8.
8
8
9.
if x [0, 1],
otherwise.
FY (y) = P (Y y) = P (1 X y) = P (1 y X ) = P ((1 y) X) =
4
1
(1y) 4
3x2 dx .
It follows that
fY (y) =
1
3
d
1
3
1
,
FY (y) = 3((1 y) 4 )2 (1 y) 4 (1) =
dy
4
4 (1 y) 41
for y (0, 1), and fY (y) = 0 otherwise. Observe that it is immaterial how f (y) is dened at
y = 0 and y = 1, because those two values contribute nothing to any integral.
As with discrete random variables, we now look at some famous densities.
6.1
Such a random variable represents the choice of a random number in [, ]. For [, ] = [0, 1],
this is ideally the output of a computer random number generator.
60
Properties:
{
1. Density: f (x) =
2. EX =
if x [, ],
otherwise.
+
2 .
3. Var(X) =
()2
12 .
Example 6.3. Assume that X is uniform on [0, 1]. What is P (X Q)? What is the probability
that the binary expansion of X starts with 0.010?
As Q is countable, it has an enumeration, say, Q = {q1 , q2 , . . . }. By Axiom 3 of Chapter 3:
P (X = qi ) = 0.
P (X Q) = P (i {X = qi }) =
i
Note that you cannot do this for sets that are not countable or you would prove that P (X
R) = 0, while we, of course, know that P (X R) = P () = 1. As X is, with probability 1,
irrational, its binary expansion is uniquely dened, so there is no ambiguity about what the
second question means.
Divide [0, 1) into 2n intervals of equal length. If the binary expansion of a number x [0, 1)
is 0.x1 x2 . . ., the rst n binary digits determine which of the 2n subintervals x belongs to: if you
know that x belongs to an interval I based on the rst n 1 digits, then nth digit 1 means that
x is in the right half of I and nth digit 0 means that x is in the left half of I. For example, if
the expansion starts with 0.010, the number is in [0, 12 ], then in [ 14 , 12 ], and then nally in [ 14 , 38 ].
Our answer is 81 , but, in fact, we can make a more general conclusion. If X is uniform on
[0, 1], then any of the 2n possibilities for its rst n binary digits are equally likely. In other
words, the binary digits of X are the result of an innite sequence of independent fair coin
tosses. Choosing a uniform random number on [0, 1] is thus equivalent to tossing a fair coin
innitely many times.
Example 6.4. A uniform random number X divides [0, 1] into two segments. Let R be the
ratio of the smaller versus the larger segment. Compute the density of R.
As R has values in (0, 1), the density fR (r) is nonzero only for r (0, 1) and we will deal
only with such rs.
(
)
(
)
1
X
1 1X
FR (r) = P (R r) = P X ,
r +P X > ,
r
2 1X
2
X
(
)
(
)
1
r
1
1
= P X ,X
+ P X > ,X
2
r+1
2
r+1
)
(
) (
)
(
1
r
1
1
1
r
+P X
since
and
= P X
r+1
r+1
r+1
2
r+1
2
r
1
2r
=
+1
=
r+1
r+1
r+1
61
d
2
FR (r) =
.
dr
(r + 1)2
We have computed the density of R, but we will use this example to make an additional
point. Let S = min{X, 1 X} be the smaller of the two segments and L = max{X, 1 X}
the larger. Clearly R = S/L. Is ER = ES/EL? To check that this equation does not hold we
compute
1
2r
ER =
dr = 2 log 2 1 0.3863.
(r
+
1)2
0
Moreover, we can compute ES by
ES =
0
1
min{x, 1 x} dx = ,
4
or by checking (by a short computation which we omit) that S is uniform on [0, 1/2]. Finally,
as S + L = 1,
3
EL = 1 ES = .
4
Thus ES/EL = 1/3 = ER.
6.2
A random variable is Exponential(), with parameter > 0, if it has the probability mass
function given below. This is a distribution for the waiting time for some random event, for
example, for a lightbulb to burn out or for the next earthquake of at least some given magnitude.
Properties:
{
ex
1. Density: f (x) =
0
if x 0,
if x < 0.
2. EX = 1 .
3. Var(X) =
1
.
2
4. P (X x) = ex .
5. Memoryless property: P (X x + y|X y) = ex .
The last property means that, if the event has not occurred by some given time (no matter
how large), the distribution of the remaining waiting time is the same as it was at the beginning.
There is no aging.
Proofs of these properties are integration exercises and are omitted.
62
Example 6.5. Assume that a lightbulb lasts on average 100 hours. Assuming exponential
distribution, compute the probability that it lasts more than 200 hours and the probability that
it lasts less than 50 hours.
Let X be the waiting time for the bulb to burn out. Then, X is Exponential with =
and
P (X 200) = e2 0.1353,
1
100
P (X 50) = 1 e 2 0.3935.
1
6.3
where x (, ).
2. EX = .
3. Var(X) = 2 .
To show that
f (x) dx = 1
is a tricky exercise in integration, as is the computation of the variance. Assuming that the
integral of f is 1, we can use symmetry to prove that EX must be :
xf (x) dx =
(x )f (x) dx +
f (x) dx
EX =
(x)2
1
=
(x ) e 22 dx +
2
z2
1
=
z e 22 dz +
2
= ,
where the last integral was obtained by the change of variable z = x and is zero because the
function integrated is odd.
63
Example 6.6. Let X be a N (, 2 ) random variable and let Y = X + , with > 0. How is
Y distributed?
If X is a measurement with error X + amounts to changing the units and so Y should
still be normal. Let us see if this is the case. We start by computing the distribution function
of Y ,
FY (y) = P (Y y)
= P (X + y)
(
)
y
= P X
=
fX (x) dx
(y)2
1
e 22 2 .
2
has EZ = 0 and Var(Z) = 1. Such a N (0, 1) random variable is called standard Normal. Its
distribution function FZ (z) is denoted by (z). Note that
Z=
fZ (z) =
(z) = FZ (z) =
1
2
ez /2
2
z
1
2
ex /2 dx.
2
The integral for (z) cannot be computed as an elementary function, so approximate values
are given in tables. Nowadays, this is largely obsolete, as computers can easily compute (z)
very accurately for any given z. You should also note that it is enough to know these values for
z > 0, as in this case, by using the fact that fZ (x) is an even function,
z
z
(z) =
fZ (x) dx =
fZ (x) dx = 1
fZ (x) dx = 1 (z).
64
(
)
X
P (|X | ) = P
1 = P (|Z| 1) = 2P (Z 1) = 2(1 (1)) 0.3173.
Similarly,
P (|X | 2) = 2(1 (2)) 0.0455,
P (|X | 3) = 2(1 (3)) 0.0027.
Example 6.8. Assume that X is Normal with mean = 2 and variance 2 = 25. Compute
the probability that X is between 1 and 4.
Here is the computation:
12
X 2
42
5
5
5
= P (0.2 Z 0.4)
P (1 X 4) = P
= P (Z 0.4) P (Z 0.2)
= (0.4) (1 (0.2))
0.2347 .
Let Sn be a Binomial(n, p) random variable. Recall that its mean is np and its variance
np(1 p). If we pretend that Sn is Normal, then Sn np is standard Normal, i.e., N (0, 1).
np(1p)
The following theorem says that this is approximately true if p is xed (e.g., 0.5) and n is large
(e.g., n = 100).
Theorem 6.1. De Moivre-Laplace Central Limit Theorem.
Let Sn be Binomial(n, p), where p is xed and n is large. Then, Sn np
np(1p)
precisely,
(
P
S np
n
x
np(1 p)
)
(x)
s2
n k
1
nk
p (1 p)
e 2 ds
k
2
k:0knp+x
np(1p)
65
as n , for every x R. Indeed it can be, and originally was, proved this way, with a lot of
computational work.
An important issue is the quality of the Normal approximation to the Binomial. One can
prove that the dierence between the Binomial probability (in the above theorem) and its limit
is at most
0.5 (p2 + (1 p)2 )
.
n p (1 p)
A commonly cited rule of thumb is that this is a decent approximation when np(1 p) 10;
however, if we take p = 1/3 and n = 45, so that np(1p) = 10, the bound above is about 0.0878,
too large for many purposes. Various corrections have been developed to diminish the error,
but they are, in my opinion, obsolete by now. In the situation when the above upper bound
on the error is too high, we should simply compute directly with the Binomial distribution and
not use the Normal approximation. (We will assume that the approximation is adequate in the
1
examples below.) Remember that, when n is large and p is small, say n = 100 and p = 100
, the
Poisson approximation (with = np) is much better!
Example 6.9. A roulette wheel has 38 slots: 18 red, 18 black, and 2 green. The ball ends at
one of these at random. You are a player who plays a large number of games and makes an even
bet of $1 on red in every game. After n games, what is the probability that you are ahead?
Answer this for n = 100 and n = 1000.
9
Let Sn be the number of times you win. This is a Binomial(n, 19
) random variable.
np
Sn np
= P
> 2
np(1 p)
np(1 p)
)
(
( 12 p) n
P Z>
p(1 p)
For n = 100, we get
(
P
5
Z>
90
(
P
5
Z>
3
)
0.2990,
)
0.0478.
For comparison, the true probabilities are 0.2650 and 0.0448, respectively.
Example 6.10. What would the answer to the previous example be if the game were fair, i.e.,
you bet even money on the outcome of a fair coin toss each time.
Then, p =
1
2
and
P (ahead) P (Z > 0) = 0.5,
66
as n .
Example 6.11. How many times do you need to toss a fair coin to get at least 100 heads with
probability at least 90%?
Let n be the number of tosses that we are looking for. For Sn , which is Binomial(n, 12 ), we
need to nd n so that
P (Sn 100) 0.9.
We will use below that n > 200, as the probability would be approximately
the previous example). Here is the computation:
)
(
)
(
100 12 n
100 12 n
Sn 12 n
P
1
P Z 1
1
2 n
2 n
2 n
(
)
200 n
= P Z
n
(
(
))
n 200
= P Z
n
)
(
n 200
= P Z
n
(
)
n 200
=
n
= 0.9
Now, according to the tables, (1.28) 0.9, thus we need to solve
n200
1
2
n 1.28 n 200 = 0.
This is a quadratic equation in
Rounding up the number n we get from above, we conclude that n = 219. (In fact, the probability
of getting at most 99 heads changes from about 0.1108 to about 0.0990 as n changes from 217
to 218.)
Problems
1. A random variable X has the density function
{
c(x + x)
f (x) =
0
x [0, 1],
otherwise.
67
(a) Determine c. (b) Compute E(1/X). (c) Determine the probability density function of
Y = X 2.
2. The density function of a random variable X is given by
{
a + bx
0 x 2,
f (x) =
0
otherwise.
We also know that E(X) = 7/6. (a) Compute a and b. (b) Compute Var(X).
3. After your complaint about their service, a representative of an insurance company promised
to call you between 7 and 9 this evening. Assume that this means that the time T of the call
is uniformly distributed in the specied interval.
(a) Compute the probability that the call arrives between 8:00 and 8:20.
(b) At 8:30, the call still hasnt arrived. What is the probability that it arrives in the next 10
minutes?
(c) Assume that you know in advance that the call will last exactly 1 hour. From 9 to 9:30,
there is a game show on TV that you wanted to watch. Let M be the amount of time of the
show that you miss because of the call. Compute the expected value of M .
4. Toss a fair coin twice. You win $1 if at least one of the two tosses comes out heads.
(a) Assume that you play this game 300 times. What is, approximately, the probability that
you win at least $250?
(b) Approximately how many times do you need to play so that you win at least $250 with
probability at least 0.99?
5. Roll a die n times and let M be the number of times you roll 6. Assume that n is large.
(a) Compute the expectation EM .
(b) Write down an approximation, in terms on n and , of the probability that M diers from
its expectation by less than 10%.
(c) How large should n be so that the probability in (b) is larger than 0.99?
Solutions
1. (a) As
1=c
(x +
0
it follows that c =
6
7.
(
x) dx = c
1 2
+
2 3
7
= c,
6
(b)
6
7
68
18
1
(x + x) dx = .
x
7
(c)
Fr (y) = P (Y y)
= P (X y)
y
6
=
(x + x) dx,
7 0
{
and so
fY (y) =
3
7 (1
+ y 4 )
1
if y (0, 1),
otherwise.
2
2
2. (a) From 0 f (x) dx = 1 we get 2a + 2b = 1 and from 0 xf (x) dx =
The two equations give a = b = 41 .
2
(b) E(X 2 ) = 0 x2 f (x) dx = 53 and so Var(X) = 53 ( 76 )2 = 11
36 .
7
6
we get 2a + 83 b = 76 .
3. (a) 16 .
(b) Let T be the time of the call, from 7pm, in minutes; T is uniform on [0, 120]. Thus,
1
P (T 100|T 90) = .
3
(c) We have
if 0 T 60,
0
M = T 60 if 60 T 90,
30
if 90 T.
Then,
1
EM =
120
90
60
1
(t 60) dx +
120
120
30 dx = 11.25.
90
4. (a) P (win a single game) = 34 . If you play n times, the number X of games you win is
Binomial(n, 34 ). If Z is N (0, 1), then
3
250 n 4
.
P (X 250) P Z
3 1
n 4 4
For (a), n = 300 and the above expression is P (Z
0.0004.
10
3 ),
69
For (b), you need to nd n so that the above expression is 0.99 or so that
3
250 n 4
= 0.01.
3 1
n 4 4
The argument must be negative, hence
250 n 34
= 2.33.
n 34 14
If x =
and solving the quadratic equation gives x 32.81, n > (32.81)2 /3, n 359.
5. (a) M is Binomial(n, 16 ), so EM = n6 .
(b)
( )
n
(
)
0.1
n n
0.1 n
6
)
0.1
n
5
= 0.995,
0.1
n
5
70
Discrete Case
Assume that you have a pair (X, Y ) of discrete random variables X and Y . Their joint probability
mass function is given by
p(x, y) = P (X = x, Y = y)
so that
P ((X, Y ) A) =
p(x, y).
(x,y)A
The marginal probability mass functions are the p. m. f.s of X and Y , given by
P (X = x) =
P (X = x, Y = y) =
p(x, y)
y
P (Y = y) =
P (X = x, Y = y) =
p(x, y)
Example 7.1. An urn has 2 red, 5 white, and 3 green balls. Select 3 balls at random and let
X be the number of red balls and Y the number of white balls. Determine (a) joint p. m. f. of
(X, Y ), (b) marginal p. m. f.s, (c) P (X Y ), and (d) P (X = 2|X Y ).
The joint p. m. f. is given by P (X = x, Y = y) for all possible x and y. In our case, x can
be 0, 1, or 2 and y can be 0, 1, 2, or 3. The values can be given by the formula
(2)(5)( 3 )
P (X = x, Y = y) =
where we use the convention that
y\x
0
1
2
3
P (X = x)
(a )
b
3xy
(10)
0
1/120
5 3/120
10 3/120
10/120
56/120
1
2 3/120
2 5 3/120
10 2/120
0
56/120
2
3/120
5/120
0
0
8/120
P (Y = y)
10/120
50/120
50/120
10/120
1
The last row and column entries are the respective column and row sums and, therefore,
determine the marginal p. m. f.s. To answer (c) we merely add the relevant probabilities,
P (X Y ) =
1 + 6 + 3 + 30 + 5
3
= ,
120
8
8
120
3
8
8
.
45
71
Continuous Case
We say that (X, Y ) is a jointly continuous pair of random variables if there exists a joint density
f (x, y) 0 so that
P ((X, Y ) S) =
f (x, y) dx dy,
S
{
c x2 y
f (x, y) =
0
72
if x2 y 1,
otherwise.
c x2 y dy = 1
dx
1
x2
c
and so
c=
4
21
= 1
21
.
4
For (b), let S be the region between the graphs y = x2 and y = x, for x (0, 1). Then,
P (X Y ) = P ((X, Y ) S)
x
1
21 2
dx
x y dy
=
2
x 4
0
3
=
20
Both probabilities in (c) and (d) are 0 because a two-dimensional integral over a line is 0.
If f is the joint density of (X, Y ), then the two marginal densities, which are the densities of X
and Y , are computed by integrating out the other variable:
f (x, y) dy,
fX (x) =
f (x, y) dx.
fY (y) =
P (X A) =
dx
f (x, y) dy.
A
The marginal densities formulas follow from the denition of density. With some advanced
calculus expertise, the following can be checked.
Two jointly continuous random variables X and Y are independent exactly when the joint
density is the product of the marginal ones:
f (x, y) = fX (x) fY (y),
for all x and y.
fX (x) =
x2
73
21 2
21 2
x y dy =
x (1 x4 ),
4
8
fY (y) =
21 2
7 5
x y dx = y 2 ,
4
2
where y [0, 1], and 0 otherwise. The two random variables X and Y are clearly not independent, as f (x, y) = fX (x)fY (y).
Example 7.7. Let (X, Y ) be a random point in a square of length 1 with the bottom left corner
at the origin. Are X and Y independent?
{
1 (x, y) [0, 1] [0, 1],
f (x, y) =
0 otherwise.
The marginal densities are
fX (x) = 1,
if x [0, 1], and
fY (y) = 1,
if y [0, 1], and 0 otherwise. Therefore, X and Y are independent.
Example 7.8. Let (X, Y ) be a random point in the triangle {(x, y) : 0 y x 1}. Are X
and Y independent?
Now
{
2 0yx1
f (x, y) =
0, otherwise.
if y [0, 1], and 0 otherwise. So X and Y are no longer distributed uniformly and no longer
independent.
We can make a more general conclusion from the last two examples. Assume that (X, Y ) is
a jointly continuous pair of random variables, uniform on a compact set S R2 . If they are to
be independent, their marginal densities have to be constant, thus uniform on some sets, say A
and B, and then S = A B. (If A and B are both intervals, then S = A B is a rectangle,
which is the most common example of independence.)
74
Example 7.9. Mr. and Mrs. Smith agree to meet at a specied location between 5 and 6
p.m. Assume that they both arrive there at a random time between 5 and 6 and that their
arrivals are independent. (a) Find the density for the time one of them will have to wait for
the other. (b) Mrs. Smith later tells you she had to wait; given this information, compute the
probability that Mr. Smith arrived before 5:30.
Let X be the time when Mr. Smith arrives and let Y be the time when Mrs. Smith arrives,
with the time unit 1 hour. The assumptions imply that (X, Y ) is uniform on [0, 1] [0, 1].
For (a), let T = |X Y |, which has possible values in [0, 1]. So, x t [0, 1] and compute
(drawing a picture will also help)
P (T t) = P (|X Y | t)
= P (t X Y t)
= P (X t Y X + t)
= 1 (1 t)2
= 2t t2 ,
and so
fT (t) = 2 2t.
For (b), we need to compute
P (X 0.5|X > Y ) =
P (X 0.5, X > Y )
=
P (X > Y )
1
8
1
2
1
= .
4
Example 7.10. Assume that X and Y are independent, that X is uniform on [0, 1], and that
Y has density fY (y) = 2y, for y [0, 1], and 0 elsewhere. Compute P (X + Y 1).
The assumptions determine the joint density of (X, Y )
{
2y if (x, y) [0, 1] [0, 1],
f (x, y) =
0
otherwise.
To compute the probability in question we compute
1
1x
dx
2y dy
0
or
dy
0
1y
2y dx,
0
75
Bobs call has expectation 40 minutes. Assume T1 and T2 are independent exponential random
variables. What is the probability that Alices call will come rst?
We need to compute P (T1 < T2 ). Assuming our unit is 10 minutes, we have, for t1 , t2 > 0,
fT1 (t1 ) = et1
and
1
fT2 (t2 ) = et2 /4 ,
4
P (T1 < T2 ) =
dt1
0
t1
0
=
1 t1 t2 /4
e
dt2
4
5t1
4
dt1
4
.
5
Example 7.12. Buon needle problem. Parallel lines at a distance 1 are drawn on a large sheet
of paper. Drop a needle of length onto the sheet. Compute the probability that it intersects
one of the lines.
Let D be the distance from the center of the needle to the nearest line and let be the
acute angle relative to the lines. We will, reasonably, assume that D and are independent
and uniform on their respective intervals 0 D 21 and 0 2 . Then,
)
(
D
P (the needle intersects a line) = P
<
sin
2
(
)
= P D < sin .
2
Case 1: 1. Then, the probability equals
/2
/2
2
2 sin d
0
=
= .
/4
/4
2
,
Case 2: > 1. Now, the curve d = 2 sin intersects d = 12 at = arcsin 1 . The probability
equals
[
) ]
[
]
(
1
1
1
4 1 2
1
4 arcsin
1
arcsin
.
sin d +
1 + arcsin
2 0
2
2
2 2
4 2
76
A similar approach works for cases with more than two random variables. Let us do an
example for illustration.
Example 7.13. Assume X1 , X2 , X3 are uniform on [0, 1] and independent. What is P (X1 +
X2 + X3 1)?
The joint density is
{
1 if (x1 , x2 , x3 ) [0, 1]3 ,
fX1 ,X2 ,X3 (x1 , x2 , x3 ) = fX1 (x1 )fX2 (x2 )fX3 (x3 ) =
0 otherwise.
P (X1 + X2 + X3 1) =
x1 +x2 +x3 1
0s1 s2 s3 1
1
ds1 ds2 ds3 = ,
6
EN =
n P (N = n) =
n=1
P (N = n) =
n=1 k=1
P (N = n) =
k=1 n=k
P (N k).
k=1
1
= e 1.
k!
k=1
Conditional distributions
The conditional p. m. f. of X given Y = y is, in the discrete case, given simply by
pX (x|Y = y) = P (X = x|Y = y) =
P (X = x, Y = y)
.
P (Y = y)
77
Observe that when fY (y) = 0, f (x, y) dx = 0, and so f (x, y) = 0 for every x. So, we
have a 00 expression, which we dene to be 0.
Here is a physicists proof why this should be the conditional density formula:
P (X = x + dx, Y = y + dy)
P (Y = y + dy)
f (x, y) dx dy
=
fY (y) dy
f (x, y)
dx
=
fY (y)
= fX (x|Y = y) dx .
P (X = x + dx|Y = y + dy) =
Example 7.14. Let (X, Y ) be a random point in the triangle {(x, y) : x, y 0, x + y 1}.
Compute fX (x|Y = y).
The joint density f (x, y) equals 2 on the triangle. For a given y [0, 1], we know that, if
Y = y, X is between 0 and 1 y. Moreover,
1y
fY (y) =
2 dx = 2(1 y).
0
Therefore,
{
fX (x|Y = y) =
1
1y
0 x 1 y,
otherwise.
In other words, given Y = y, X is distributed uniformly on [0, 1 y], which is hardly surprising.
Example 7.15. Suppose (X, Y ) has joint density
{
21 2
x y x2 y 1,
f (x, y) = 4
0
otherwise.
Compute fX (x|Y = y).
We compute rst
21
fY (y) = y
4
where y x y.
x2 dx =
21 2
4 x y
7 5/2
2y
7 5
y2,
2
3 2 3/2
x y
,
2
Suppose we are asked to compute P (X Y |Y = y). This makes no literal sense because
the probability P (Y = y) of the condition is 0. We reinterpret this expression as
P (X y|Y = y) =
fX (x|Y = y) dx,
y
which equals
78
(
) 1 (
)
3 2 3/2
1
x y
dx = y 3/2 y 3/2 y 3 =
1 y 3/2 .
2
2
2
Problems
1. Let (X, Y ) be a random point in the square {(x, y) : 1 x, y 1}. Compute the conditional
probability P (X 0 | Y 2X). (It may be a good idea to draw a picture and use elementary
geometry, rather than calculus.)
2. Roll a fair die 3 times. Let X be the number of 6s obtained and Y the number of 5s.
(a) Compute the joint probability mass function of X and Y .
(b) Are X and Y independent?
3. X and Y are independent random variables and they have the same density function
{
c(2 x)
x (0, 1)
f (x) =
0
otherwise.
(a) Determine c. (b) Compute P (Y 2X) and P (Y < 2X).
4. Let X and Y be independent random variables, both uniformly distributed on [0, 1]. Let
Z = min(X, Y ) be the smaller value of the two.
(a) Compute the density function of Z.
(b) Compute P (X 0.5|Z 0.5).
(c) Are X and Z independent?
5. The joint density of (X, Y ) is given by
{
3x if 0 y x 1,
f (x, y) =
0
otherwise.
(a) Compute the conditional density of Y given X = x.
(b) Are X and Y independent?
79
Solutions to problems
1. After noting the relevant areas,
P (X 0|Y 2X) =
=
=
P (X 0, Y 2X)
P (Y 2X)
(
)
1
1 1
2
1
4
2 2
1
2
7
8
P (Y = y)
43
216
342
216
34
216
1
216
125
216
342
216
324
216
3
216
75
216
34
216
3
216
15
216
1
216
1
216
P (X = x)
125
216
75
216
15
216
1
216
c
0
it follows that c = 23 .
(2 x) dx = 1,
80
(b) We have
P (Y 2X) = P (Y < 2X)
1
1
4
=
dy
(2 x)(2 y) dx
9
0
y/2
1
4 1
=
(2 y) dy
(2 x) dx
9 0
y/2
[ (
(
)]
4 1
y) 1
y2
=
(2 y) dy 2 1
1
9 0
2
2
4
[
]
1
2
4
3
y
=
(2 y)
dy
y+
9 0
2
8
]
[
4 1
7
5
y3
=
3 y + y2
dy
9 0
2
4
8
[
]
7
5
1
4
3 +
.
=
9
4 12 32
3
4
(c) No: Z X.
5. (a) Assume that x [0, 1]. As
3x dy = 3x2 ,
fX (x) =
0
we have
fY (y|X = x) =
f (x, y)
3x
1
= 2 = ,
fX (x)
3x
x
81
x [0, 1],
otherwise.
(a) Determine c.
(b) Compute E(1/X).
(c) Determine the probability density function of Y = X 2 .
2. A certain country holds a presidential election, with two candidates running for oce. Not
satised with their choice, each voter casts a vote independently at random, based on the
outcome of a fair coin ip. At the end, there are 4,000,000 valid votes, as well as 20,000 invalid
votes.
(a) Using a relevant approximation, compute the probability that, in the nal count of valid
votes only, the numbers for the two candidates will dier by less than 1000 votes.
(b) Each invalid vote is double-checked independently with probability 1/5000. Using a relevant
approximation, compute the probability that at least 3 invalid votes are double-checked.
3. Toss a fair coin 5 times. Let X be the total number of Heads among the rst three tosses and
Y the total number of Heads among the last three tosses. (Note that, if the third toss comes
out Heads, it is counted both into X and into Y ).
(a) Write down the joint probability mass function of X and Y .
(b) Are X and Y independent? Explain.
(c) Compute the conditional probability P (X 2 | X Y ).
4. Every working day, John comes to the bus stop exactly at 7am. He takes the rst bus
that arrives. The arrival of the rst bus is an exponential random variable with expectation 20
minutes.
Also, every working day, and independently, Mary comes to the same bus stop at a random
time, uniformly distributed between 7 and 7:30.
(a) What is the probability that tomorrow John will wait for more than 30 minutes?
(b) Assume day-to-day independence. Consider Mary late if she comes after 7:20. What is the
probability that Mary will be late on 2 or more working days among the next 10 working days?
(c) What is the probability that John and Mary will meet at the station tomorrow?
82
x [0, 1],
otherwise.
(a) Determine c.
Solution:
Since
(x + x2 ) dx,
(0
)
1 1
1 = c
,
+
2 3
5
c,
1 =
6
1 = c
and so c = 65 .
(b) Compute E(1/X).
Solution:
(
E
1
X
)
=
=
=
=
6 11
(x + x2 ) dx
5 0 x
6 1
(1 + x) dx
5 0
(
)
1
6
1+
5
2
9
.
5
83
Solution:
The values of Y are in [0, 1], so we will assume that y [0, 1]. Then,
F (y) = P (Y y)
= P (X 2 y)
= P (X y)
y
6
(x + x2 ) dx,
=
5
0
and so
fY (y) =
=
=
d
FY (y)
dy
6
1
( y + y)
5
2 y
3
(1 + y).
5
2. A certain country holds a presidential election, with two candidates running for oce. Not
satised with their choice, each voter casts a vote independently at random, based on the
outcome of a fair coin ip. At the end, there are 4,000,000 valid votes, as well as 20,000
invalid votes.
(a) Using a relevant approximation, compute the probability that, in the nal count of
valid votes only, the numbers for the two candidates will dier by less than 1000
votes.
Solution:
Let Sn be the vote count for candidate 1. Thus, Sn is Binomial(n, p), where n =
4, 000, 000 and p = 12 . Then, n Sn is the vote count for candidate 2.
P (|Sn (n Sn )| 1000) = P (1000 2Sn n 1000)
500
Sn n/2
500
= P
n 12 12
n 12 12
n 21 12
P (0.5 Z 0.5)
= P (Z 0.5) P (Z 0.5)
= P (Z 0.5) (1 P (Z 0.5))
= 2P (Z 0.5) 1
= 2(0.5) 1
2 0.6915 1
= 0.383.
84
(b) Each invalid vote is double-checked independently with probability 1/5000. Using
a relevant approximation, compute the probability that at least 3 invalid votes are
double-checked.
Solution:
(
)
1
Now, let Sn be the number of double-checked votes, which is Binomial 20000, 5000
and thus approximately Poisson(4). Then,
P (Sn 3) = 1 P (Sn = 0) P (Sn = 1) P (Sn = 2)
42
1 e4 4e4 e4
2
= 1 13e4 .
3. Toss a fair coin 5 times. Let X be the total number of Heads among the rst three tosses
and Y the total number of Heads among the last three tosses. (Note that, if the third toss
comes out Heads, it is counted both into X and into Y ).
(a) Write down the joint probability mass function of X and Y .
Solution:
P (X = x, Y = y) is given by the table
x\y
0
1
2
3
0
1/32
2/32
1/32
0
1
2/32
5/32
4/32
1/32
2
1/32
4/32
5/32
2/32
3
0
1/32
2/32
1/32
85
1 1
= P (X = 0)P (Y = 3).
8 8
P (X 2|X Y ) =
=
=
=
P (X 2, X Y )
P (X Y )
1+4+5+1+2+1
1+2+1+5+4+1+5+2+1
14
22
7
.
11
4. Every working day, John comes to the bus stop exactly at 7am. He takes the rst bus that
arrives. The arrival of the rst bus is an exponential random variable with expectation 20
minutes.
Also, every working day, and independently, Mary comes to the same bus stop at a random
time, uniformly distributed between 7 and 7:30.
(a) What is the probability that tomorrow John will wait for more than 30 minutes?
Solution:
Assume that the time unit is 10 minutes. Let T be the arrival time of the bus.
It is Exponential with parameter = 12 . Then,
1
fT (t) = et/2 ,
2
for t 0, and
P (T 3) = e3/2 .
86
(b) Assume day-to-day independence. Consider Mary late if she comes after 7:20. What
is the probability that Mary will be late on 2 or more working days among the next
10 working days?
Solution:
Let X be Marys arrival time. It is uniform on [0, 3]. Therefore,
1
P (X 2) = .
3
(
)
The number of late days among 10 days is Binomial 10, 31 and, therefore,
P (2 or more late working days among 10 working days)
= 1 P (0 late) P (1 late)
( )10
( )
2
1 2 9
= 1
10
.
3
3 3
(c) What is the probability that John and Mary will meet at the station tomorrow?
Solution:
We have
1
f(T,X) (t, x) = et/2 ,
6
for x [0, 3] and t 0. Therefore,
P (X T ) =
=
=
1 t/2
1 3
dx
e
dt
3 0
2
x
1 3 x/2
e
dx
3 0
2
(1 e3/2 ).
3
87
Given a pair of random variables (X, Y ) with joint density f and another function g of two
variables,
Eg(X, Y ) =
if instead (X, Y ) is a discrete pair with joint probability mass function p, then
Eg(X, Y ) =
g(x, y)p(x, y).
x,y
Example 8.1. Assume that two among the 5 items are defective. Put the items in a random
order and inspect them one by one. Let X be the number of inspections needed to nd the rst
defective item and Y the number of additional inspections needed to nd the second defective
item. Compute E|X Y |.
The joint p. m. f. of (X, Y ) is given by the following table, which lists P (X = i, Y = j),
together with |i j| in parentheses, whenever the probability is nonzero:
i\j
1
2
3
4
.1
.1
.1
.1
1
(0)
(1)
(2)
(3)
2
.1 (1)
.1 (0)
.1 (1)
0
3
.1 (2)
.1 (1)
0
0
4
.1 (3)
0
0
0
= 2
2 3
1
=
,
3
88
E(XY ) =
dx
xy2 dy
0
1
2x
0
1x
(1 x)2
dx
2
(1 x)x2 dx
=
=
1 1
3 4
1
.
12
(x + y)f (x, y) dx dy =
xf (x, y) dx dy +
yf (x, y) dx dy.
To prove this property for arbitrary n (and the continuous case), one can simply proceed by
induction.
By the way we dened expectation, the third property is not immediately obvious. However,
it is clear that Z 0 implies EZ 0 and, applying this to Z = Y X, together with linearity,
establishes monotonicity.
We emphasize again that linearity holds for arbitrary random variables which do not need
to be independent! This is very useful. For example, we can often write a random variable X
as a sum of (possibly dependent) indicators, X = I1 + + In . An instance of this method is
called the indicator trick .
89
Example 8.3. Assume that an urn contains 10 black, 7 red, and 5 white balls. Select 5 balls
(a) with and (b) without replacement and let X be the number of red balls selected. Compute
EX.
Let Ii be the indicator of the event that the ith ball is red, that is,
{
1 if ith ball is red,
Ii = I{ith ball is red} =
0 otherwise.
In both cases, X = I1 + I2 + I3 + I4 + I5 .
7
7
In (a), X is Binomial(5, 22
), so we know that EX = 5 22
, but we will not use this knowledge.
Instead, it is clear that
7
= EI2 = . . . = EI5 .
22
7
22 .
(225i
) ,
5
(7)( 15 )
i
(225i
) .
5
i=0
However, the indicator trick works exactly as before (the fact that Ii are now dependent does
7
not matter) and so the answer is also exactly the same, EX = 5 22
.
Example 8.4. Matching problem, revisited. Assume n people buy n gifts, which are then
assigned at random, and let X be the number of people who receive their own gift. What is
EX?
This is another problem very well suited for the indicator trick. Let
Ii = I{person i receives own gift} .
Then,
X = I1 + I2 + . . . + In .
Moreover,
EIi =
1
,
n
90
Now, let
Ii = I{wife i sits next to her husband} .
Then,
X = I1 + . . . + I5 ,
and
2
EIi = ,
9
so that
EX =
10
.
9
Example 8.6. Coupon collector problem, revisited. Sample from n cards, with replacement,
indenitely. Let N be the number of cards you need to sample for a complete collection, i.e., to
get all dierent cards represented. What is EN ?
Let Ni be the number of additional cards you need to get the ith new card, after you have
received the (i 1)st new card.
Then, N1 , the number of cards needed to receive the rst new card, is trivial, as the rst
card you buy is new: N1 = 1. Afterward, N2 , the number of additional cards needed to get
the second new card is Geometric with success probability n1
n . After that, N3 , the number of
additional cards needed to get the third new card is Geometric with success probability n2
n . In
ni+1
general, Ni is geometric with success probability n , i = 1, . . . , n, and
N = N1 + . . . + Nn ,
so that
1 1
1
EN = n 1 + + + . . . +
2 3
n
Now, we have
n
1
i=2
)
.
1
1
dx
,
x
i
n1
i=1
by comparing the integral with the Riemman sum at the left and right endpoints in the division
of [1, n] into [1, 2], [2, 3], . . . , [n 1, n], and so
log n
1
i=1
log n + 1,
EN
= 1.
n n log n
lim
Example 8.7. Assume that an urn contains 10 black, 7 red, and 5 white balls. Select 5 balls
(a) with replacement and (b) without replacement, and let W be the number of white balls
selected, and Y the number of dierent colors. Compute EW and EY .
91
5
22
in either case.
Let Ib , Ir , and Iw be the indicators of the event that, respectively, black, red, and white balls
are represented. Clearly,
Y = Ib + Ir + Iw ,
and so, in the case with replacement
EY = 1
155
175
125
+ 1 5 + 1 5 2.5289,
5
22
22
22
(15)
5
(22
)
5
5
(22
)
5
EY = 1
+1
(17)
5
) 2.6209.
+ 1 (22
5
The expectation of the product of independent random variables is the product of their expectations, i.e., if X and Y are independent,
E[g(X)h(Y )] = Eg(X) Eh(Y )
Proof. For the continuous case,
E[g(X)h(Y )] =
g(x)h(y)f (x, y) dx dy
= Eg(X) Eh(Y )
Example 8.8. Let us return to a random point (X, Y ) in the triangle {(x, y) : x, y 0, x + y
1
1}. We computed that E(XY ) = 12
and that EX = EY = 13 . The two random variables have
E(XY ) = EX EY , thus, they cannot be independent. Of course, we already knew that they
were not independent.
92
If, instead, we pick a random point (X, Y ) in the square {(x, y) : 0 x, y 1}, X and Y
are independent and, therefore, E(XY ) = EX EY = 14 .
Finally, pick a random point (X, Y ) in the diamond of radius 1, that is, in the square with
corners at (0, 1), (1, 0), (0, 1), and (1, 0). Clearly, we have, by symmetry,
EX = EY = 0,
but also
E(XY ) =
=
1
2
1
2
1
2
= 0.
1|x|
dx
1
xy dy
1+|x|
1|x|
x dx
1
1
y dy
1+|x|
x dx 0
This is an example where E(XY ) = EX EY even though X and Y are not independent.
E(Y |X = x) =
y P (Y = y|X = x) (discrete case),
y
Observe that E(Y |X = x) is a function of x; let us call it g(x) for a moment. We denote
g(X) by E(Y |X). This is the expectation of Y provided the value X is known; note again that
this is an expression dependent on X and so we can compute its expectation. Here is what we
get.
Theorem 8.3. Tower property.
The formula E(E(Y |X)) = EY holds; less mysteriously, in the discrete case
EY =
E(Y |X = x) P (X = x),
x
EY =
93
Proof. To verify this in the discrete case, we write out the expectation inside the sum:
P (X = x, Y = y)
y
P (X = x)
P
(X
=
x)
x
y
=
yP (X = x, Y = y)
yP (Y = y|X = x) P (X = x) =
= EY
Example 8.9. Once again, consider a random point (X, Y ) in the triangle {(x, y) : x, y
0, x + y 1}. Given that X = x, Y is distributed uniformly on [0, 1 x] and so
1
E(Y |X = x) = (1 x).
2
By denition, E(Y |X) = 12 (1 X), and the expectation of 12 (1 X) must, therefore, equal the
expectation of Y , and, indeed, it does as EX = EY = 31 , as we know.
Example 8.10. Roll a die and then toss as many coins as shown up on the die. Compute the
expected number of Heads.
Let X be the number on the die and let Y be the number of Heads. Fix an x {1, 2, . . . , 6}.
Given that X = x, Y is Binomial(x, 12 ). In particular,
1
E(Y |X = x) = x ,
2
and, therefore,
E(number of Heads) = EY
6
1
=
x P (X = x)
2
x=1
x=1
1 1
2 6
7
.
4
Example 8.11. Here is another job interview question. You die and are presented with three
doors. One of them leads to heaven, one leads to one day in purgatory, and one leads to two
days in purgatory. After your stay in purgatory is over, you go back to the doors and pick again,
but the doors are reshued each time you come back, so you must in fact choose a door at
random each time. How long is your expected stay in purgatory?
Code the doors 0, 1, and 2, with the obvious meaning, and let N be the number of days in
purgatory.
94
Therefore,
1
1
EN = (1 + EN ) + (2 + EN ) ,
3
3
and solving this equation gives
EN = 3.
Covariance
Let X, Y be random variables. We dene the covariance of (or between) X and Y as
Cov(X, Y ) = E((X EX)(Y EY ))
= E(XY (EX) Y (EY ) X + EX EY )
= E(XY ) EX EY EY EX + EX EY
= E(XY ) EX EY.
To summarize, the most useful formula is
Cov(X, Y ) = E(XY ) EX EY.
Note immediately that, if X and Y are independent, then Cov(X, Y ) = 0, but the converse
is false.
Let X and Y be indicator random variables, so X = IA and Y = IB , for two events A and
B. Then, EX = P (A), EY = P (B), E(XY ) = E(IAB ) = P (A B), and so
Cov(X, Y ) = P (A B) P (A)P (B) = P (A)[P (B|A) P (B)].
If P (B|A) > P (B), we say the two events are positively correlated and, in this case, the covariance
is positive; if the events are negatively correlated all inequalities are reversed. For general random
variables X and Y , Cov(X, Y ) > 0 intuitively means that, on the average, increasing X will
result in larger Y .
2
E(
Xi ) =
EXi2 +
E(Xi Xj ),
i=1
n
Var(
i=1
Xi ) =
i=1
n
i=1
i=j
Var(Xi ) +
i=j
Cov(Xi , Xj ).
95
)2
Xi
i=1
Xi2 +
i=1
Xi Xj
i=j
and linearity of expectation. The second formula follows from the rst:
[ n
]2
n
n
Var(
Xi ) = E
Xi E(
Xi )
i=1
[
= E
i=1
n
i=1
]2
(Xi EXi )
i=1
Var(Xi ) +
i=1
i=j
The crucial observation is that Sn = ni=1 Ii , where Ii is the indicator I{ith trial is a success} .
Therefore, Ii are independent. Then, ESn = np and
Var(Sn ) =
Var(Ii )
i=1
n
=
(EIi (EIi )2 )
i=1
= n(p p2 )
= np(1 p).
Example 8.13. Matching problem, revisited yet again. Recall that X is the number of people
who get their own gift. We will compute Var(X).
i=1 Ii ,
E(X 2 ) = n
96
1
n
and
1
+
E(Ii Ij )
n
= 1+
i=j
i=j
1
n(n 1)
= 1+1
= 2.
The E(Ii Ij ) above is the probability that the ith person and jth person both get their own gifts
1
. We conclude that Var(X) = 1. (In fact, X is, for large n, very close
and, thus, equals n(n1)
to Poisson with = 1.)
Example 8.14. Roll a die 10 times. Let X be the number of 6s rolled and Y be the number
of 5s rolled. Compute Cov(X, Y ).
10
10
Observe that X =
i=1 Ii , where Ii = I{ith roll is 6} , and Y =
j=1 Jj , where Jj =
5
I{jth roll is 5} . Then, EX = EY = 10
=
.
Moreover,
E(I
J
)
equals
0
if i = j (because
i j
6
3
1
both 5 and 6 cannot be rolled on the same roll), and E(Ii Jj ) = 62 if i = j (by independence of
dierent rolls). Therefore,
EXY
10
10
E(Ii Jj )
i=1 j=1
1
62
i=j
=
=
10 9
36
5
2
and
5
Cov(X, Y ) =
2
( )2
5
5
=
3
18
97
If X, X1 , X2 , . . . are independent and identically distributed random variables with nite expecn
tation and variance, then X1 +...+X
converges to EX in the sense that, for any xed > 0,
n
(
)
X1 + . . . + Xn
P
EX 0,
n
as n .
In particular, if Sn is the number of successes in n independent trials, each of which is
a success with probability p, then, as we have observed before, Sn = I1 + . . . + In , where
Ii = I{success at trial i} . So, for every > 0,
(
)
Sn
P
p 0,
n
as n . Thus, the proportion of successes converges to p in this sense.
Theorem 8.7. Markov Inequality. If X 0 is a random variable and a > 0, then
P (X a)
1
EX.
a
Theorem 8.8. Chebyshev inequality. If EX = and Var(X) = 2 are both nite and k > 0,
then
2
P (|X | k) 2 .
k
Example 8.16. If EX = 1 and Var(X) = 1, P (X 10) P (|X 1| 9)
1
81 .
0.1
2
= .
2
0.5
5
As the previous two examples show, the Chebyshev inequality is useful if either is small
or k is large.
98
1
1
E(X )2 = 2 Var(X).
k2
k
n 2
2
=
0,
n 2 2
n2
as n .
A careful examination of the proof above will show that it remains valid if depends on n,
n
but goes to 0 slower than 1n . This suggests that X1 +...+X
converges to EX at the rate of
n
about
1 .
n
Assume that X, X1 , X2 , . . . are independent, identically distributed random variables, with nite
= EX and 2 = V ar(X). Then,
(
)
X1 + . . . + Xn n
P
x P (Z x),
n
as n , where Z is standard Normal.
We will not prove this theorem in full detail, but will later give good indication as to why
it holds. Observe, however, that it is a remarkable theorem: the random variables Xi have an
arbitrary distribution (with given expectation and variance) and the theorem says that their
sum approximates a very particular distribution, the normal one. Adding many independent
copies of a random variable erases all information about its distribution other than expectation
and variance!
99
On the other hand, the convergence is not very fast; the current version of the celebrated
Berry-Esseen theorem states that an upper bound on the dierence between the two probabilities
in the Central limit theorem is
E|X |3
0.4785
.
3 n
Example 8.18. Assume that Xn are independent and uniform on [0, 1]. Let Sn = X1 +. . .+Xn .
(a) Compute approximately P (S200 90). (b) Using the approximation, nd n so that P (Sn
50) 0.99.
We know that EXi =
1
2
and Var(Xi ) =
1
3
1
4
1
12 .
For (a),
S200 200
P (S200 90) = P
1
200 12
P (Z 6)
= 1 P (Z 6)
1
2
90 200
1
200 12
1
2
1 0.993
= 0.007
For (b), we rewrite
Sn n 12
50 n 12
= 0.99,
P
1
1
n 12
n 12
P Z
or
P Z
n
2
n
2
50
50
1
12
= 0.99,
1
12
n
2
50
= 0.99.
1
12
Using the fact that (z) = 0.99 for (approximately) z = 2.326, we get that
n 1.345 n 100 = 0
and that n = 115.
Example 8.19. A casino charges $1 for entrance. For promotion, they oer to the rst 30,000
guests the following game. Roll a fair die:
if you roll 6, you get free entrance and $2;
100
4
6,
2
3
and
( )2
2
11
= .
3
9
1 9
Var(Xi ) = +
6 6
Therefore,
L n
s n
P (L s) = P
11
11
n 9
n 9
s 2 n
= 0.9,
P Z 3
11
n 9
2
3
2
3
which gives
s 23 n
1.28,
n 11
9
and nally,
2
s n + 1.28
3
11
20, 245.
9
Problems
1. An urn contains 2 white and 4 black balls. Select three balls in three successive steps without
replacement. Let X be the total number of white balls selected and Y the step in which you
selected the rst black ball. For example, if the selected balls are white, black, black, then
X = 1, Y = 2. Compute E(XY ).
101
if 0 y x 1,
otherwise.
Compute Cov(X, Y ).
3. Five married couples are seated at random in a row of 10 seats.
(a) Compute the expected number of women that sit next to their husbands.
(b) Compute the expected number of women that sit next to at least one man.
4. There are 20 birds that sit in a row on a wire. Each bird looks left or right with equal
probability. Let N be the number of birds not seen by any neighboring bird. Compute EN .
5. Recall that a full deck of cards contains 52 cards, 13 cards of each of the four suits. Distribute
the cards at random to 13 players, so that each gets 4 cards. Let N be the number of players
whose four cards are of the same suit. Using the indicator trick, compute EN .
6. Roll a fair die 24 times. Compute, using a relevant approximation, the probability that the
sum of numbers exceeds 100.
Solutions to problems
1. First we determine the joint p. m. f. of (X, Y ). We have
1
P (X = 0, Y = 1) = P (bbb) = ,
5
2
P (X = 1, Y = 1) = P (bwb or bbw) = ,
5
1
P (X = 1, Y = 2) = P (wbb) = ,
5
1
P (X = 2, Y = 1) = P (bww) = ,
15
1
P (X = 2, Y = 2) = P (wbw) = ,
15
1
P (X = 2, Y = 3) = P (wwb) = ,
15
so that
E(XY ) = 1
2
1
1
1
1
8
+2 +2
+4
+6
= .
5
5
15
15
15
5
2. We have
102
3
3x3 dx = ,
4
0
0
0
1 x
1
3 3
3
EX =
dx
y 3x dy =
x dx = ,
2
8
0
0
0
1
1 x
3 4
3
E(XY ) =
dx
xy 3x dy =
x dx = ,
10
0
0
0 2
EX =
so that Cov(X, Y ) =
3
10
3
4
dx
3
8
x 3x dy =
3
160 .
9! 2!
1
= ,
10!
5
and so
EM = EI1 + . . . + EI5 = 5 EI1 = 1.
(b) Let the number be N . Let Ii = I{woman i sits next to a man} . Then, by dividing into cases,
whereby the woman either sits on one of the two end chairs or on one of the eight middle chairs,
(
)
2 5
8
4 3
7
EIi =
+
1
= ,
10 9 10
9 8
9
and so
EM = EI1 + . . . + EI5 = 5 EI1 =
35
.
9
4. For the two birds at either end, the probability that it is not seen is 12 , while for any other
bird this probability is 41 . By the indicator trick
EN = 2
1
1
11
+ 18 = .
2
4
2
5. Let
Ii = I{player i has four cards of the same suit} ,
so that N = I1 + ... + I13 . Observe that:
the number of ways to select 4 cards from a 52 card deck is
(52)
4
(13)
4
103
( )
4 13
EIi = (52)4
4
and
( )
4 13
EN = (52)4 13.
4
7
7
S24 24 2
100 24 2
P (Z 1.85) = 1 (1.85) 0.032.
P (S24 100) = P
35
35
24 12
24 12
104
105
(b) Assume that you play this game 500 times. What is, approximately, the probability that
you win at least $135?
(c) Again, assume that you play this game 500 times. Compute (approximately) the amount
of money x such that your winnings will be at least x with probability 0.5. Then, do the same
with probability 0.9.
6. Two random variables X and Y are independent and have the same probability density
function
{
c(1 + x) x [0, 1],
g(x) =
0
otherwise.
(a) Find the value of c. Here and in (b): use
1
0
xn dx =
1
n+1 ,
for n > 1.
106
(b) Compute the probability that all suits are represented among the rst four cards
selected.
Solution:
134
(52)
4
(c) Compute the expected number of dierent suits among the rst four cards selected.
Solution:
If X is the number of suits represented, then X = I + I + I + I , where I =
I{ is represented} , etc. Then,
(39)
4
),
EI = 1 (52
4
4
)
1 (52
(d) Compute the expected number of cards you have to select to get the rst hearts card.
Solution:
Label non- cards 1, . . . , 39 and let Ii = I{card i selected before any card} . Then, EIi =
1
14 for any i. If N is the number of cards you have to select to get the rst hearts
card, then
39
EN = E (I1 + + EI39 ) = .
14
107
(c) Compute the probability that the two Swedes have exactly one person sitting between
them.
Solution:
The two Swedes may occupy chairs 1, 3; or 2, 4; or 3, 5; ...; or 9, 11. There are exactly
9 possibilities, so the answer is
9
9
(11) = .
55
2
3. You have two fair coins. Toss the rst coin three times and let X be the number of Heads.
Then, toss the second coin X times, that is, as many times as you got Heads in the rst
coin toss. Let Y be the number of Heads in the second coin toss. (For example, if X = 0,
Y is automatically 0; if X = 2, toss the second coin twice and count the number of Heads
to get Y .)
108
(a) Determine the joint probability mass function of X and Y , that is, write down a
formula for P (X = i, Y = j) for all relevant i and j.
Solution:
P (X = i, Y = j) = P (X = i)P (Y = j|X = i) =
( )
( )
3 1
i 1
,
j 2i
i 23
for 0 j i 3.
(b) Compute P (X 2 | Y = 1).
Solution:
This equals
5
P (X = 2, Y = 1) + P (X = 3, Y = 1)
= .
P (X = 1, Y = 1) + P (X = 2, Y = 1) + P (X = 3, Y = 1)
9
4. Assume that 2, 000, 000 single male tourists visit Las Vegas every year. Assume also that
each of these tourists independently gets married while drunk with probability 1/1, 000, 000.
(a) Write down the exact probability that exactly 3 male tourists will get married while
drunk next year.
Solution:
With X equal to the number of such drunk marriages, X is Binomial(n, p) with
p = 1/1, 000, 000 and n = 2, 000, 000, so we have
( )
n 3
P (X = 3) =
p (1 p)n3
3
(b) Compute the expected number of such drunk marriages in the next 10 years.
Solution:
As X is binomial, its expected value is np = 2, so the answer is 10 EX = 20.
(c) Write down a relevant approximate expression for the probability in (a).
Solution:
We use that X is approximately Poisson with = 2, so the answer is
3 4 2
e = e .
3!
3
109
(d) Write down an approximate expression for the probability that there will be no such
drunk marriage during at least one of the next 3 years.
Solution:
This equals 1P (at least one such marriage in each of the next 3 years), which equals
(
)3
1 1 e2 .
5. Toss a fair coin twice. You win $2 if both tosses comes out Heads, lose $1 if no toss comes
out Heads, and win or lose nothing otherwise.
(a) What is the expected number of games you need to play to win once?
Solution:
The probability of winning in
random variable, is 4.
1
4.
(1)
4
(b) Assume that you play this game 500 times. What is, approximately, the probability
that you win at least $135?
Solution:
Let X be the winnings in one game and X1 , X2 , . . . , Xn the winnings in successive
games, with Sn = X1 + . . . + Xn . Then, we have
EX = 2
1
1
1
1 =
4
4
4
and
1
1
Var(X) = 4 + 1
4
4
( )2
1
19
= .
4
16
Thus,
Sn n
135 n
135 n
P Z
P (Sn 135) = P
n 19
n 19
n 19
16
16
16
1
4
1
4
10
.
1
500 19
16
1
4
110
(c) Again, assume that you play this game 500 times. Compute (approximately) the
amount of money x such that your winnings will be at least x with probability 0.5.
Then, do the same with probability 0.9.
Solution:
For probability 0.5, the answer is exactly ESn = n 14 = 125. For probability 0.9, we
approximate
125
125
x
125
x
= P Z
=
,
P (Sn x) P Z
19
19
500 19
500
500
16
16
16
where we have used that x < 125. Then, we use that (z) = 0.9 at z 1.28, leading
to the equation
125 x
= 1.28
500 19
16
and, therefore,
x = 125 1.28
500
19
.
16
1=c
0
we have c =
2
3
1
0
xn dx =
1
n+1 ,
3
(1 + x) dx = c ,
2
for n > 1.
111
2 1
5
x(1 + x) dx = ,
3 0
9
1
2
7
E(X 2 ) =
x2 (1 + x) dx = ,
3 0
18
E(X) =
13
81 .
(c) Find P (X + Y < 1) and P (X + Y 1). Here and in (d): when you get to a single
integral involving powers, stop.
Solution:
The two probabilities are both equal to
( )2 1 1x
2
dx
(1 + x)(1 + y) dy =
3
0
0
( )2 1
)
(
2
(1 x)2
dx.
(1 + x) (1 x) +
3
2
0
CONVERGENCE IN PROBABILITY
112
Convergence in probability
One of the goals of probability theory is to extricate a useful deterministic quantity out of a
random situation. This is typically possible when a large number of random eects cancel each
other out, so some limit is involved. In this chapter we consider the following setting: given a
sequence of random variables, Y1 , Y2 , . . ., we want to show that, when n is large, Yn is approximately f (n), for some simple deterministic function f (n). The meaning of approximately is
what we now make clear.
= P(
i=1
1
.
2k
For the lower bound, divide the string of size n into disjoint blocks of size k. There are
such blocks (if n is not divisible by k, simply throw away the leftover smaller block at the
end). Then, Rn k as soon as one of the blocks consists of Heads only; dierent blocks are
independent. Therefore,
(
) n
(
)
1 k
1 n
P (Rn < k) 1 k
exp k
,
2
2 k
nk
CONVERGENCE IN PROBABILITY
113
Rn
log2 n
(1)
P (Rn (1 + ) log2 n) 0,
(2)
P (Rn (1 ) log2 n) 0,
as
)
(
)
(
Rn
Rn
Rn
1
1 + or
1
P
= P
log2 n
log2 n
log2 n
(
)
(
)
Rn
Rn
= P
1+ +P
1
log2 n
log2 n
= P (Rn (1 + ) log2 n) + P (Rn (1 ) log2 n) .
A little fussing in the proof comes from the fact that (1 ) log2 n are not integers. This is
common in such problems. To prove (1), we plug k = (1 + ) log2 n into the upper bound to
get
P (Rn (1 + ) log2 n) n
= n
=
1
2(1+) log2 n1
2
n1+
2
0,
n
as n . On the other hand, to prove (2) we need to plug k = (1 ) log2 n + 1 into the
lower bound,
P (Rn (1 ) log2 n)
P (Rn < k)
)
(
1 n
exp k
2 k
(
))
1 (n
exp k
1
2 k
(
)
1
1
n
exp 1
32 n
(1 ) log2 n
)
(
n
1
= exp
32 (1 ) log2 n
0,
CONVERGENCE IN PROBABILITY
114
The most basic tool in proving convergence in probability is the Chebyshev inequality: if X
is a random variable with EX = and Var(X) = 2 , then
P (|X | k)
2
,
k2
for any k > 0. We proved this inequality in the previous chapter and we will use it to prove the
next theorem.
Theorem 9.1. Connection between variance and convergence in probability.
Assume that Yn are random variables and that a is a constant such
that
EYn a,
Var(Yn ) 0,
as n . Then,
Yn a,
as n , in probability.
Proof. Fix an > 0. If n is so large that
|EYn a| < /2,
then
P (|Yn a| > )
as n . Note that the second inequality in the computation is the Chebyshev inequality.
This is most often applied to sums of random variables. Let
Sn = X1 + . . . + Xn ,
where Xi are random variables with nite expectation and variance. Then, without any independence assumption,
ESn = EX1 + . . . + EXn
and
E(Sn2 ) =
EXi2 +
i=1
n
Var(Sn ) =
i=1
E(Xi Xj ),
i=j
Var(Xi ) +
i=j
Cov(Xi , Xj ).
CONVERGENCE IN PROBABILITY
115
Recall that
Cov(X1 , Xj ) = E(Xi Xj ) EXi EXj
and
Var(aX) = a2 Var(X).
Moreover, if Xi are independent,
Var(X1 + . . . + Xn ) = Var(X1 ) + . . . + Var(Xn ).
Continuing with the review, let us reformulate and prove again the most famous convergence in
probability theorem. We will use the common abbreviation i. i. d. for independent identically
distributed random variables.
Theorem 9.2. Weak law of large numbers. Let X, X1 , X2 , . . . be i. i. d. random variables with
with EX = and Var(X) = 2 < . Let Sn = X1 + . . . + Xn . Then, as n ,
Sn
n
in probability.
Proof. Let Yn =
Sn
n .
1
1
2
2
.
Var(S
)
=
n
=
n
n2
n2
n
CONVERGENCE IN PROBABILITY
116
It is important to note that the expected value of the capital at the end of the year is
maximized when x = 1, but by using this strategy you will eventually lose everything. Let Xn
be your capital at the end of year n. Dene the average growth rate of your investment as
= lim
1
Xn
log
,
n
x0
so that
Xn x0 en .
We will express in terms of x; in particular, we will show that it is a nonrandom quantity.
Let Ii = I{stock goes up in year i} . These are independent indicators with EIi = 0.8.
Xn = Xn1 (1 x) 1.06 + Xn1 x 1.5 In
= Xn1 (1.06(1 x) + 1.5x In )
and so we can unroll the recurrence to get
Xn = x0 (1.06(1 x) + 1.5x)Sn ((1 x)1.06)nSn ,
where Sn = I1 + . . . + In . Therefore,
1
Xn
log
n
x0
(
)
Sn
Sn
=
log(1.06 + 0.44x) + 1
log(1.06(1 x))
n
n
0.8 log(1.06 + 0.44x) + 0.2 log(1.06(1 x)),
7
22 ,
Example 9.3. Distribute n balls independently at random into n boxes. Let Nn be the number
of empty boxes. Show that n1 Nn converges in probability and identify the limit.
Note that
N n = I1 + . . . + In ,
where Ii = I{ith box is empty} , but we cannot use the weak law of large numbers as Ii are not
independent. Nevertheless,
(
)
(
)
n1 n
1 n
EIi =
= 1
,
n
n
and so
1
ENn = n 1
n
)n
.
CONVERGENCE IN PROBABILITY
117
Moreover,
E(Nn2 ) = ENn +
E(Ii Ij )
i=j
with
E(Ii Ij ) = P (box i and j are both empty) =
n2
n
)n
,
so that
(
Var(Nn ) =
E(Nn2 )
1
(ENn ) = n 1
n
)n
2
+ n(n 1) 1
n
)n
(
n
1
1
n
)2n
.
EYn e1 ,
as n , and
Var(Yn )
1
n
(
1
1
n
)n
+
n1
n
(
1
2
n
)n
(
)
1 2n
1
n
0 + e2 e2 = 0,
as n . Therefore,
Yn =
Nn
e1 ,
n
as n , in probability.
Problems
1. Assume that n married couples (amounting to 2n people) are seated at random on 2n seats
around a round table. Let T be the number of couples that sit together. Determine ET and
Var(T ).
2. There are n birds that sit in a row on a wire. Each bird looks left or right with equal
probability. Let N be the number of birds not seen by any neighboring bird. Determine, with
proof, the constant c so that, as n , n1 N c in probability.
3. Recall the coupon collector problem: sample from n cards, with replacement, indenitely, and
let N be the number of cards you need to get so that each of n dierent cards are represented.
Find a sequence an so that, as n , N/an converges to 1 in probability.
4. Kings and Lakers are playing a best of seven playo series, which means they play until
one team wins four games. Assume Kings win every game independently with probability p.
CONVERGENCE IN PROBABILITY
118
(There is no dierence between home and away games.) Let N be the number of games played.
Compute EN and Var(N ).
5. An urn contains n red and m black balls. Select balls from the urn one by one without
replacement. Let X be the number of red balls selected before any black ball, and let Y be the
number of red balls between the rst and the second black one. Compute EX and EY .
Solutions to problems
1. Let Ii be the indicator of the event that the ith couple sits together. Then, T = I1 + + In .
Moreover,
2
22 (2n 3)!
4
EIi =
, E(Ii Ij ) =
=
,
2n 1
(2n 1)!
(2n 1)(2n 2)
for any i and j = i. Thus,
2n
2n 1
ET =
and
4
4n
=
,
(2n 1)(2n 2)
2n 1
E(T 2 ) = ET + n(n 1)
so
Var(T ) =
4n2
4n
4n(n 1)
=
.
2
2n 1 (2n 1)
(2n 1)2
2. Let Ii indicate the event that bird i is not seen by any other bird. Then, EIi is
i = n and 41 otherwise. It follows that
EN = 1 +
1
2
if i = 1 or
n2
n+2
=
.
4
4
Furthermore, Ii and Ij are independent if |i j| 3 (two birds that have two or more birds
between them are observed independently). Thus, Cov(Ii , Ij ) = 0 if |i j| 3. As Ii and Ij are
indicators, Cov(Ii , Ij ) 1 for any i and j. For the same reason, Var(Ii ) 1. Therefore,
Var(N ) =
Var(Ii ) +
Cov(Ii , Ij ) n + 4n = 5n.
i
Clearly, if M =
c = 41 .
1
nN,
then EM =
i=j
1
n EN
1
4
and Var(M ) =
1
Var(N )
n2
0. It follows that
3. Let Ni be the number of coupons needed to get i dierent coupons after having i 1 dierent
ones. Then N = N1 + . . . + Nn , and Ni are independent Geometric with success probability
CONVERGENCE IN PROBABILITY
ni+1
n .
119
So,
ENi =
n
,
ni+1
Var(Ni ) =
n(i 1)
,
(n i + 1)2
and, therefore,
(
)
1
1
EN = n 1 + + . . . +
,
2
n
(
)
n
2
n(i 1)
1
1
2
2
Var(N ) =
n
1
+
+
.
.
.
+
n
< 2n2 .
(n i + 1)2
22
n2
6
i=1
If an = n log n, then
1
EN 1,
an
as n , so that
1
EN 0,
a2n
1
N 1
an
in probability.
4. Let Ii be the indicator of the event that the ith game is played. Then, EI1 = EI2 = EI3 =
EI4 = 1,
EI5 = 1 p4 (1 p)4 ,
EI6 = 1 p5 5p4 (1 p) 5p(1 p)4 (1 p)5 ,
( )
6 3
EI7 =
p (1 p)3 .
3
Add the seven expectations to get EN . To compute E(N 2 ), we use the fact that Ii Ij = Ii if
i > j, so that E(Ii Ij ) = EIi . So,
2
EN =
EIi + 2
i>j
E(Ii Ij ) =
EIi + 2
(i 1)EIi =
(2i 1)EIi ,
i=1
and the nal result can be obtained by plugging in EIi and by the standard formula
Var(N ) = E(N 2 ) (EN )2 .
5. Imagine the balls ordered in a row where the ordering species the sequence in which they
are selected. Let Ii be the indicator of the event that the ith red ball is selected before any black
1
ball. Then, EIi = m+1
, the probability that in a random ordering of the ith red ball and all m
n
black balls, the red comes rst. As X = I1 + . . . + In , EX = m+1
.
Now, let Ji be the indicator of the event that the ith red ball is selected between the rst and
the second black one. Then, EJi is the probability that the red ball is second in the ordering of
the above m + 1 balls, so EJi = EIi , and EY = EX.
10
10
120
(t) =
0
etx ex dx =
1
,
1t
only when t < 1. Otherwise, the integral diverges and the moment generating function does not
exist. Have in mind that the moment generating function is meaningful only when the integral
(or the sum) converges.
Here is where the name comes from: by writing its Taylor expansion in place of etX and
exchanging the sum and the integral (which can be done in many cases)
1
1
E(etX ) = E[1 + tX + t2 X 2 + t3 X 3 + . . .]
2
3!
1
1 2
2
= 1 + tE(X) + t E(X ) + t3 E(X 3 ) + . . .
2
3!
The expectation of the k-th power of X, mk = E(X k ), is called the k-th moment of x. In
combinatorial language, (t) is the exponential generating function of the sequence mk . Note
also that
d
E(etX )|t=0 = EX,
dt
d2
E(etX )|t=0 = EX 2 ,
dt2
which lets us compute the expectation and variance of a random variable once we know its
moment generating function.
Example 10.2. Compute the moment generating function for a Poisson() random variable.
10
121
By denition,
(t) =
etn
n=0
= e
n
e
n!
(et )n
n!
n=0
+et
= e
= e(e
t 1)
Example 10.3. Compute the moment generating function for a standard Normal random
variable.
By denition,
X (t) =
=
1
2
etx ex /2 dx
2
1 1 2 1 (xt)2
e2t
dx
e 2
2
1 2
= e2t ,
where, from the rst to the second line, we have used, in the exponent,
1
1
1
tx x2 = (2tx + x2 ) = ((x t)2 t2 ).
2
2
2
Lemma 10.1. If X1 , X2 , . . . , Xn are independent and Sn = X1 + . . . + Xn , then
Sn (t) = X1 (t) . . . Xn (t).
If Xi is identically distributed as X, then
Sn (t) = (X (t))n .
Proof. This follows from multiplicativity of expectation for independent random variables:
E[etSn ] = E[etX1 etX2 . . . etXn ] = E[etX1 ] E[etX2 ] . . . E[etXn ].
Example 10.4. Compute the moment generating function of a Binomial(n, p) random variable.
Here we have Sn = nk=1 Ik , where the indicators Ik are independent and Ik = I{success on kth trial} ,
so that
Sn (t) = (et p + 1 p)n .
10
122
Why are moment generating functions useful? One reason is the computation of large deviations. Let Sn = X1 + + Xn , where Xi are independent and identically distributed as X,
with expectation EX = and moment generating function . At issue is the probability that
Sn is far away from its expectation n, more precisely, P (Sn > an), where a > . We can,
of course, use the Chebyshev inequality to get an upper bound of order n1 . It turns out that
this probability is, for large n, much smaller; the theorem below gives an upper bound that is a
much better estimate.
Theorem 10.2. Large deviation bound .
Assume that (t) is nite for some t > 0. For any a > ,
P (Sn an) exp(n I(a)),
where
I(a) = sup{at log (t) : t > 0} > 0.
Proof. For any t > 0, using the Markov inequality,
P (Sn an) = P (etSn tan 1) E[etSn tan ] = etan (t)n = exp (n(at log (t))) .
Note that t > 0 is arbitrary, so we can optimize over t to get what the theorem claims. We need
to show that I(a) > 0 when a > . For this, note that (t) = at log (t) satises (0) = 0
and, assuming that one can dierentiate inside the integral sign (which one can in this case, but
proving this requires abstract analysis beyond our scope),
(t) = a
and, then,
E(XetX )
(t)
=a
,
(t)
(t)
(0) = a > 0,
i=1
and we need to compute I(4.5), which, by denition, is the maximum, over t > 0, of the function
4.5 t log (t),
whose graph is in the gure below.
10
123
0.2
0.15
0.1
0.05
0.05
0.1
0.15
0.2
0.1
0.2
0.3
0.4
0.5
t
0.6
0.7
0.8
0.9
It would be nice if we could solve this problem by calculus, but unfortunately we cannot (which
is very common in such problems), so we resort to numerical calculations. The maximum is at
t 0.37105 and, as a result, I(4.5) is a little larger than 0.178. This gives the upper bound
P (Sn 4.5 n) e0.178n ,
which is about 0.17 for n = 10, 1.83 108 for n = 100, and 4.16 1078 for n = 1000. The bound
35
12n for the same probability, obtained by the Chebyshev inequality, is much too large for large
n.
Another reason why moment generating functions are useful is that they characterize the
distribution and convergence of distributions. We will state the following theorem without proof.
Theorem 10.3. Assume that the moment generating functions for random variables X, Y , and
Xn are nite for all t.
1. If X (t) = Y (t) for all t, then P (X x) = P (Y x) for all x.
2. If Xn (t) X (t) for all t, and P (X x) is continuous in x, then P (Xn x) P (X
x) for all x.
Example 10.6. Show that the sum of independent Poisson random variables is Poisson.
Here is the situation. We have n independent random variables X1 , . . . , Xn , such that:
X1
X2
is Poisson(1 ),
is Poisson(2 ),
..
.
X1 (t) = e1 (e 1) ,
t
X2 (t) = e2 (e 1) ,
Xn
is Poisson(n ),
Xn (t) = en (e
t 1)
10
Therefore,
124
t 1)
and so X1 + . . . + Xn is Poisson(1 + . . . + n ). Similarly, one can also prove that the sum of
independent Normal random variables is Normal.
We will now reformulate and prove the Central Limit Theorem in a special case when the
moment generating function is nite. This assumption is not needed and the theorem may be
applied it as it was in the previous chapter.
Theorem 10.4. Assume that X is a random variable, with EX = and Var(X) = 2 , and
assume that X (t) is nite for all t. Let Sn = X1 + . . . + Xn , where X1 , . . . , Xn are i. i. d. and
distributed as X. Let
Sn n
.
Tn =
n
Then, for every x,
P (Tn x) P (Z x),
as n , where Z is a standard Normal random variable.
Proof. Let Y = X
and Yi =
Var(Yi ) = 1, and
Xi
.
Tn =
Y1 + . . . + Yn
.
n
1+
2 n
t2
e2.
10
125
Problems
1. A player selects three cards at random from a full deck and collects as many dollars as the
number of red cards among the three. Assume 10 people play this game once and let X be the
number of their combined winnings. Compute the moment generating function of X.
2. Compute the moment generating function of a uniform random variable on [0, 1].
3. This exercise was the original motivation for the study of large deviations by the Swedish
probabilist Harald Cram`er, who was working as an insurance company consultant in the 1930s.
Assume that the insurance company receives a steady stream of payments, amounting to (a
deterministic number) per day. Also, every day, they receive a certain amount in claims;
assume this amount is Normal with expectation and variance 2 . Assume also day-to-day
independence of the claims. Regulators require that, within a period of n days, the company
must be able to cover its claims by the payments received in the same period, or else. Intimidated
by the erce regulators, the company wants to fail to satisfy their requirement with probability
less than some small number . The parameters n, , and are xed, but is a quantity the
company controls. Determine .
4. Assume that S is Binomial(n, p). For every a > p, determine by calculus the large deviation
bound for P (S an).
5. The only information you have about a random variable X is that its moment generating
4
function satises the inequality (t) et for all t 0. Find an upper bound for P (X 32).
6. Using the central limit theorem for a sum of Poisson random variables, compute
n
ni
.
lim en
n
i!
i=0
Solutions to problems
1. Compute the moment generating function for a single game, then raise it to the 10th power:
(
(( ) ( )( )
( )( )
( )
))10
26
26 26
26 26
26
1
t
2t
3t
+
e +
e +
e
(t) = (52)
.
3
1
2
2
1
3
3
10
2. Answer: (t) =
1
0
126
n) eIn .
1
t t2 ,
2
and the maximum equals
1
I=
2
Finally, we solve the equation
)2
.
eIn = ,
to get
=+
2 log
.
n
a
1a
+ (1 a) log
.
p
1p
Next we compute the minimum of h(t) = t4 32t over t > 0. As h (t) = 4t3 32, we see that the
minimum is achieved when t = 2 and h(2) = 16. Therefore P (X 32) e16 1.13 107 .
6. Let Sn be the sum of i. i. d. Poisson(1) random variables. Thus, Sn is Poisson(n) and
ESn = n. By the Central Limit Theorem, P (Sn n) 12 , but P (Sn n) is exactly the
expression in question. So, the answer is 12 .
7.
For , EXi = 0 and Var(Xi ) =
We estimate P (Sn n), for a large n, in the three cases.
2/3, thus by the Central Limit Theorem, P (Z 3/2). For , the EXi = 0 and
Var(Xi ) = 1 and P (Z 1). (In both cases, Z is standard Normal.) Thus we know that
< . For , EXi = 1/2 and so by the large deviation bound P (Sn 0.25 n) is exponentially
small in n, and then so is P (Sn n) (as n 0.25 n for n 16). Thus < < .
11
11
127
P (AB)
P (B) ;
Conditional probability mass function: if (X, Y ) has probability mass function p, pX (x|Y =
y) = p(x,y)
pY (y) = P (X = x|Y = y);
Conditional density: if (X, Y ) has joint density f , fX (x|Y = y) = ffY(x,y)
(y) .
=
=
P (X = k, X + Y = n)
P (X + Y = n)
P (X = k)P (Y = n k)
(1 +2 )n (1 +2 )
e
n!
nk
2
k1 1
(nk)!
e2
k! e
(1 +2 )n (1 +2 )
e
n!
( )(
)k (
n
1
1 + 2
2
1 + 2
)nk
.
11
128
1
1 +2 ).
Example 11.2. Let T1 , T2 be two independent Exponential() random variables and let S1 =
T1 , S2 = T1 + T2 . Compute fS1 (s1 |S2 = s2 ).
First,
P (S1 s1 , S2 s2 ) = P (T1 s1 , T1 + T2 s2 )
s1
s2 t1
=
dt1
fT1 ,T2 (t1 , t2 ) dt2 .
0
f (s1 , s2 ) =
{
2 es2
f (s1 , s2 ) =
0
if 0 s1 s2 ,
otherwise
fS2 (s2 ) =
s2
Therefore,
fS1 (s1 |S2 = s2 ) =
2 es2
1
= ,
2 s2 es2
s2
11
129
Let N be the number of additional rolls. This number is Geometric, if we know U , so let us
condition on the value of U . We know that
6
,
7n
E(N |U = n) =
and so, by Bayes formula for expectation,
E(N ) =
E(N |U = n) P (U = n)
n=1
6
1
6
n=1
n=1
= 1+
6
7n
1
7n
1 1
1
+ + . . . + = 2.45 .
2 3
6
Now, let U be a uniform random variable on [0, 1], that is, the result of a call of a random
number generator. Once we know U , generate additional independent uniform random variables
(still on [0, 1]), X1 , X2 , . . ., until we get one that equals or exceeds U . Let n be the number of
additional calls of the generator, that is, the smallest n for which Xn U . Determine the
p. m. f. of N and EN .
Given that U = u, N is Geometric(1 u). Thus, for k = 0, 1, 2, . . .,
P (N = k|U = u) = uk1 (1 u)
and so
P (N = k) =
P (N = k|U = u) du =
0
uk1 (1 u) du =
1
.
k(k + 1)
In fact, a slick alternate derivation shows that P (N = k) does not depend on the distribution of
random variables (which we assumed to be uniform), as soon as it is continuous, so that there
are no ties (i.e., no two random variables are equal). Namely, the event {N = k} happens
exactly when Xk is the largest and U is the second largest among X1 , X2 , . . . , Xk , U . All orders,
by diminishing size, of these k + 1 random numbers are equally likely, so the probability that
1
Xk and U are the rst and the second is k+1
k1 .
It follows that
EN =
k=1
1
= ,
k+1
EN =
0
E(N |U = u) du =
0
1
du = .
1u
11
130
As we see from this example, random variables with innite expectation are more common and
natural than one might suppose.
Example 11.4. The number N of customers entering a store on a given day is Poisson().
Each of them buys something independently with probability p. Compute the probability that
exactly k people buy something.
Let X be the number of people who buy something. Why should X be Poisson? Approximate: let n be the (large) number of people in the town and the probability that any particular
one of them enters the store on a given day. Then, by the Poisson approximation, with = n,
N Binomial(n, ) and X Binomial(n, p) Poisson(p). A more straightforward way to
see this is as follows:
P (X = k) =
=
P (X = k|N = n)P (N = n)
n=k
(
n=k
= e
=
=
)
n k
n e
p (1 p)nk
n!
k
n!
n
pk (1 p)nk
k!(n k)!
n!
n=k
e p k k
k!
n=k
(1 p)nk nk
(n k)!
=
=
e (p)k (1p)
e
k!
ep (p)k
.
k!
11
131
Therefore,
mk = E(Nk ) =
=
n=k1
(n + 1 + mk (1 p)) P (Nk1 = n)
n=k1
= mk1 + 1 + mk (1 p)
1 mk1
+
.
=
p
p
This recursion can be unrolled,
m1 =
m2 =
1
p
1
1
+ 2
p p
..
.
mk =
1
1
1
+
+ ... + k.
p p2
p
In fact, we can even compute the moment generating function of Nk by dierent conditioning1 . Let Fa , a = 0, . . . , k 1, be the event that the tosses begin with a Heads, followed by
Tails, and let Fk the event that the rst k tosses are Heads. One of F0 , . . . , Fk must happen,
therefore, by Bayes formula,
E[etNk ] =
a=0
If Fk happens, then Nk = k, otherwise a + 1 tosses are wasted and one has to start over with
the same conditions as at the beginning. Therefore,
E[etNk ] =
k1
a=0
k1
et(a+1) pa + etk pk ,
a=0
pk etk (1 pet )
pk etk
.
k1 t(a+1) a =
1 pet (1 p)et (1 pk etk )
p
1 (1 p) a=0 e
d
E[etNk ]|t=0 .
dt
11
132
Example 11.6. Gamblers ruin. Fix a probability p (0, 1). Play a sequence of games; in each
game you (independently) win $1 with probability p and lose $1 with probability 1 p. Assume
that your initial capital is i dollars and that you play until you either reach a predetermined
amount N , or you lose all your money. For example, if you play a fair game, p = 21 , while, if
9
you bet on Red at roulette, p = 19
. You are interested in the probability Pi that you leave the
game happy with your desired amount N .
Another interpretation is that of a simple random walk on the integers. Start at i, 0 i N ,
and make steps in discrete time units: each time (independently) move rightward by 1 (i.e., add
1 to your position) with probability p and move leftward by 1 (i.e., add 1 to your position)
with probability 1 p. In other words, if the position of the walker at time n is Sn , then
Sn = i + X 1 + + X n ,
where Xk are i. i. d. and P (Xk = 1) = p, P (Xk = 1) = 1 p. This random walk is one of the
very basic random (or, if you prefer a Greek word, stochastic) processes. The probability Pi is
the probability that the walker visits N before a visit to 0.
We condition on the rst step X1 the walker makes, i.e., the outcome of the rst bet. Then,
by Bayes formula,
Pi = P (visit N before 0|X1 = 1)P (X1 = 1) + P (visit N before 0|X1 = 1)P (X1 = 1)
= Pi+1 p + Pi1 (1 p),
which gives us a recurrence relation, which we can rewrite as
Pi+1 Pi =
1p
(Pi Pi1 ).
p
1p
P1
p
P3 P2 =
1p
(P2 P1 ) =
p
..
.
Pi Pi1 =
We conclude that
((
Pi P1
Pi
1p
p
1p
p
)2
P1
)i1
P1 , for i = 1, . . . , N.
) (
)
(
) )
1p
1p 2
1 p i1
=
+
+ ... +
P1 ,
p
p
p
(
) (
)
(
) )
(
1p 2
1 p i1
1p
+
+ ... +
P1 .
=
1+
p
p
p
11
Therefore,
( )i
1p
1 p
Pi =
1 1p
p
iP
P1
133
if p = 12 ,
if p = 12 .
1( p )
if p = 12 ,
N
Pi = 1 1p
p
i
if p = 12 .
N
For example, if N = 10 and p = 0.6, then P5 0.8836; if N = 1000 and p =
P900 2.6561 105 .
9
19 ,
then
Example 11.7. Bold Play. Assume that the only game available to you is a game in which
you can place even bets at any amount, and that you win each of these bets with probability p.
Your initial capital is x [0, N ], a real number, and again you want to increase it to N before
going broke. Your bold strategy (which can be proved to be the best) is to bet everything unless
you are close enough to N that a smaller amount will do:
1. Bet x if x
N
2.
2. Bet N x if x
N
2.
We can, without loss of generality, x our monetary unit so that N = 1. We now dene
P (x) = P (reach 1 before reaching 0).
By conditioning on the outcome of your rst bet,
{
[
]
p P (2x)
if x 0, 12 ,
[
]
P (x) =
p 1 + (1 p) P (2x 1) if x 12 , 1 .
For each positive integer n, this is a linear system for P ( 2kn ), k = 0, . . . , 2n , which can be solved.
For example:
When n = 1, P
(1)
2
= p.
( )
= p2 , P 34 = p + (1 p)p.
(1)
( )
( )
( )
3 , P 3 = p P 3 = p2 + p2 (1 p), P 5 = p + p2 (1 p),
When
n
=
3,
P
=
p
8
8
4
8
( )
P 87 = p + p(1 p) + p(1 p)2 .
When n = 2, P
(1)
4
It is easy to verify that P (x) = x, for all x, if p = 21 . Moreover, it can be computed that
9
P (0.9) 0.8794 for p = 19
, which is not too dierent from a fair game. The gure below
9
displays the graphs of functions P (x) for p = 0.1, 0.25, 19
, and 12 .
11
134
1
0.9
0.8
0.7
P(x)
0.6
0.5
0.4
0.3
0.2
0.1
0
0.2
0.4
0.6
0.8
A few remarks for the more mathematically inclined: The function P (x) is continuous, but
nowhere dierentiable on [0, 1]. It is thus a highly irregular function despite the fact that it is
strictly increasing. In fact, P (x) is the distribution function of a certain random variable Y ,
that is, P (x) = P (Y x). This random variable with values in (0, 1) is dened by its binary
expansion
1
Y =
Dj j ,
2
j=1
where its binary digits Dj are independent and equal to 1 with probability 1 p and, thus, 0
with probability p.
Theorem 11.1. Expectation and variance of sums with a random number of terms.
Xi .
i=1
Then
ES = EN,
Var(S) = 2 EN + 2 Var(N ).
Proof. Let Sn = X1 + . . . + Xn . We have
E[S|N = n] = ESn = nEX1 = n.
11
Then,
ES =
135
n P (N = n) = EN.
n=0
n=0
n=0
n=0
E[S 2 |N = n]P (N = n)
E(Sn2 )P (N = n)
((Var(Sn ) + (ESn )2 )P (N = n)
(n 2 + n2 2 )P (N = n)
n=0
= 2 EN + 2 E(N 2 ).
Therefore,
Var(S) = E(S 2 ) (ES)2
= 2 EN + 2 E(N 2 ) 2 (EN )2
= 2 EN + 2 Var(N ).
Example 11.8. Toss a fair coin until you toss Heads for the rst time. Each time you toss
Tails, roll a die and collect as many dollars as the number on the die. Let S be your total
winnings. Compute ES and Var(S).
This ts into the above context, with Xi , the numbers rolled on the die, and N , the number
of Tails tossed before rst Heads. We know that
7
EX1 = ,
2
Var(X1 ) =
35
.
12
Plug in to get ES =
7
2
and Var(S) =
59
12 .
11
136
Example 11.9. We now take another look at Example 8.11. We will rename the number of
days in purgatory as S, to t it better into the present context, and call the three doors 0, 1,
and 2. Let N be the number of times your choice of door is not door 0. This means that N is
Geometric( 13 )1. Any time you do not pick door 0, you pick door 1 or 2 with equal probability.
Therefore, each Xi is 1 or 2 with probability 12 each. (Note that Xi are not 0, 1, or 2 with
probability 31 each!)
It follows that
EN = 3 1 = 2,
1 1
Var(N ) = ( )23 = 6
1
3
and
3
EX1 = ,
2
12 + 22 9
1
= .
2
4
4
Therefore, ES = EN EX1 = 3, which, of course, agrees with the answer in Example 8.11.
Moreover,
1
9
Var(S) = 2 + 6 = 14.
4
4
Var(X1 ) =
Problems
1. Toss an unfair coin with probability p (0, 1) of Heads n times. By conditioning on the
outcome of the last toss, compute the probability that you get an even number of Heads.
2. Let X1 and X2 be independent Geometric(p) random variables. Compute the conditional
p. m. f. of X1 given X1 + X2 = n, n = 2, 3, . . .
3. Assume that the joint density of (X, Y ) is
f (x, y) =
1 y
e , for 0 < x < y,
y
11
137
If you press B,
you will have to wait three minutes and then be able to press one of the buttons again.
Assume that you cannot see the buttons, so each time you press one of them at random. Compute
the expected time of your connement.
5. Assume that a Poisson number with expectation 10 of customers enters the store. For
promotion, each of them receives an in-store credit uniformly distributed between 0 and 100
dollars. Compute the expectation and variance of the amount of credit the store will give.
6. Generate a random number uniformly on [0, 1]; once you observe the value of , say
= , generate a Poisson random variable N with expectation . Before you start the random
experiment, what is the probability that N 3?
7. A coin has probability p of Heads. Alice ips it rst, then Bob, then Alice, etc., and the
winner is the rst to ip Heads. Compute the probability that Alice wins.
Solutions to problems
1. Let pn be the probability of an even number of Heads in n tosses. We have
pn = p (1 pn1 ) + (1 p)pn1 = p + (1 2p)pn1 ,
and so
pn
1
1
= (1 2p)(pn1 ),
2
2
and then
pn =
As p0 = 1, we get C =
1
2
1
+ C(1 2p)n .
2
and, nally,
pn =
1 1
+ (1 2p)n .
2 2
2. We have, for i = 1, . . . , n 1,
P (X1 = i|X1 +X2 = n) =
1
p(1 p)
k=1 p(1 p)
11
138
3. The conditional density of X given Y = y is fX (x|Y = y) = y1 , for 0 < x < y (i.e., uniform
on [0, y]), and so the answer is
y2
3 .
4. Let I be the indicator of the event that you press A, and X the time of your connement in
minutes. Then,
1
1
2
1
EX = E(X|I = 0)P (I = 0) + E(X|I = 1)P (I = 1) = (3 + EX) + ( 2 + (5 + EX))
2
3
3
2
and the answer is EX = 21.
5. Let N be the number of customers and X the amount of credit,
Nwhile Xi are independent
1002
uniform on [0, 100]. So, EXi = 50 and Var(Xi ) = 12 . Then, X = i=1 Xi , so EX = 50EN =
2
2
500 and Var(X) = 100
12 10 + 50 10.
6. The answer is
P (N 3) =
P (N 3| = ) d =
(1 (1 + +
1
.
2p
2
)e )d.
2
11
139
1
n Dn
converges to c in probability, as n .
2. Consider the following game, which will also appear in problem 4. Toss a coin with probability
p of Heads. If you toss Heads, you win $2, if you toss Tails, you win $1.
(a) Assume that you play this game n times and let Sn be your combined winnings. Compute
the moment generating function of Sn , that is, E(etSn ).
(b) Keep the assumptions from (a). Explain how you would nd an upper bound for the
probability that Sn is more than 10% larger than its expectation. Do not compute.
(c) Now you roll a fair die and you play the game as many times as the number you roll. Let Y
be your total winnings. Compute E(Y ) and Var(Y ).
3. The joint density of X and Y is
f (x, y) =
ex/y ey
,
y
11
140
Ii ,
i=1
where
Ii = I{ith player gets four dierent suits} ,
and
n4
EIi = (4n) ,
4
the answer is
n5
EDn = (4n) .
4
n4 (n 1)4
E(Ii Ij ) = (4n)(4n4) .
4
Therefore,
E(Dn2 ) =
E(Ii Ij ) +
n5 (n 1)5
n5
EIi = (4n)(4n4) + (4n)
4
i=j
and
n5 (n 1)5
n5
Var(Dn ) = (4n)(4n4) + (4n)
4
n5
(4n)
4
)2
.
11
141
converges to c in probability as n .
Solution:
Let Yn = n1 Dn . Then
n4
EYn = (4n) =
4
6n3
6
3
3 = ,
(4n 1)(4n 2)(4n 3)
4
32
as n . Moreover,
1
n3 (n 1)5
n3
Var(Yn ) = 2 Var(Dn ) = (4n)(4n4) + (4n)
n
4
4
4
as n , so the statement holds with c =
n4
(4n)
4
)2
62
6 +0
4
3
32
)2
= 0,
3
32 .
2. Consider the following game, which will also appear in problem 4. Toss a coin with
probability p of Heads. If you toss Heads, you win $2, if you toss Tails, you win $1.
(a) Assume that you play this game n times and let Sn be your combined winnings.
Compute the moment generating function of Sn , that is, E(etSn ).
Solution:
E(etSn ) = (e2t p + et (1 p))n .
(b) Keep the assumptions from (a). Explain how you would nd an upper bound for the
probability that Sn is more than 10% larger than its expectation. Do not compute.
Solution:
As EX1 = 2p + 1 p = 1 + p, ESn = n(1 + p), and we need to nd an upper bound
9
for P (Sn > n(1.1 + 1.1p). When (1.1 + 1.1p) 2, i.e., p 11
, this is an impossible
9
event, so the probability is 0. When p < 11 , the bound is
P (Sn > n(1.1 + 1.1p) eI(1.1+1.1p)n ,
where I(1.1 + 1.1p) > 0 and is given by
I(1.1 + 1.1p) = sup{(1.1 + 1.1p)t log(pe2t + (1 p)et ) : t > 0}.
11
142
(c) Now you roll a fair die and you play the game as many times as the number you roll.
Let Y be your total winnings. Compute E(Y ) and Var(Y ).
Solution:
Let Y = X1 + . . . + XN , where Xi are independently and identically distributed with
P (X1 = 2) = p and P (X1 = 1) = 1 p, and P (N = k) = 16 , for k = 1, . . . , 6. We
know that
EY
= EN EX1 ,
7
35
EN = , Var(N ) = .
2
12
Moreover,
EX1 = 2p + 1 p = 1 + p,
and
EX12 = 4p + 1 p = 1 + 3p,
so that
Var(X1 ) = 1 + 3p (1 + p)2 = p p2 .
The answer is
7
(1 + p),
2
7
35
Var(Y ) = (p p2 ) + (1 + p)2 .
2
12
3. The joint density of X and Y is
EY
ex/y ey
,
y
for x > 0 and y > 0, and 0 otherwise. Compute E(X|Y = y).
f (x, y) =
Solution:
We have
E(X|Y = y) =
As
fY (y) =
0
ex/y ey
dx
y
ex/y dx
ey
=
y 0
y
e [ x/y ]x=
=
ye
y
x=0
y
= e , for y > 0,
11
143
we have
fX (x|Y = y) =
=
f (x, y)
fY (y)
ex/y
, for x, y > 0,
y
and so
E(X|Y = y) =
x x/y
e
dx
y
= y
= y.
zez dz
4. Consider the following game again: Toss a coin with probability p of Heads. If you toss
Heads, you win $2, if you toss Tails, you win $1. Assume that you start with no money
and you have to quit the game when your winnings match or exceed the dollar amount
n. (For example, assume n = 5 and you have $3: if your next toss is Heads, you collect
$5 and quit; if your next toss is Tails, you play once more. Note that, at the amount you
quit, your winnings will be either n or n + 1.) Let pn be the probability that you will quit
with winnings exactly n.
(a) What is p1 ? What is p2 ?
Solution:
We have
p1 = 1 p
and
p2 = (1 p)2 + p.
Also, p0 = 1.
(b) Write down the recursive equation which expresses pn in terms of pn1 and pn2 .
Solution:
We have
pn = p pn2 + (1 p)pn1 .
11
144
1
1 p (1 p)2 + 4p
1 p (1 + p)
=
=
=
.
2
2
p
This gives
pn = a + b(p)n ,
with
a + b = 1, a bp = 1 p.
We get
a=
and then
pn =
p
1
,b=
,
1+p
1+p
1
p
+
(p)n .
1+p 1+p
12
12
145
Example 12.1. Take your favorite book. Start, at step 0, by choosing a random letter. Pick
one of the ve random procedures described below and perform it at each time step n = 1, 2, . . .
1. Pick another random letter.
2. Choose a random occurrence of the letter obtained at the previous step ((n 1)st), then
pick the letter following it in the text. Use the convention that the letter that follows the
last letter in the text is the rst letter in the text.
3. At step 1 use procedure (2), while for n 2 choose a random occurrence of the two letters
obtained, in order, in the previous two steps, then pick the following letter.
4. Choose a random occurrence of all previously chosen letters, in order, then pick the following letter.
5. At step n, perform procedure (1) with probability
probability 1 n1 .
1
n
Repeated iteration of procedure (1) merely gives the familiar independent experiments selection of letters is done with replacement and, thus, the letters at dierent steps are independent.
Procedure (2), however, is dierent: the probability mass function for the letter at the next
time step depends on the letter at this step and nothing else. If we call our current letter our
state, then we transition into a new state chosen with the p. m. f. that depends only on our
current state. Such processes are called Markov .
Procedure (3) is not Markov at rst glance. However, it becomes such via a natural redenition of state: keep track of the last two letters; call an ordered pair of two letters a state.
Procedure (4) can be made Markov in a contrived fashion, that is, by keeping track, at the
current state, of the entire history of the process. There is, however, no natural way of making
this process Markov and, indeed, there is something dierent about this scheme: it ceases being
random after many steps are performed, as the sequence of the chosen letters occurs just once
in the book.
Procedure (5) is Markov, but what distinguishes it from (2) is that the p. m. f. is dependent
not only on the current step, but also on time. That is, the process is Markov, but not timehomogeneous. We will only consider time-homogeneous Markov processes.
In general, a Markov chain is given by
a state space, a countable set S of states, which are often labeled by the positive integers
1, 2, . . . ;
transition probabilities, a (possibly innite) matrix of numbers Pij , where i and j range
over all states in S; and
12
146
Pij = 1,
j
for all i. In other words, every row of the matrix is a p. m. f. Clearly, by (12.1), every transition
matrix is a stochastic matrix (as Xn+1 must be in some state). The opposite also holds: given
any stochastic matrix, one can construct a Markov chain on positive integer states with the
same transition matrix, by using the entries as transition probabilities as in (12.1).
Geometrically, a Markov chain is often represented as oriented graph on S (possibly with
self-loops) with an oriented edge going from i to j whenever a transition from i to j is possible,
i.e., whenever Pij > 0; such an edge is labeled by Pij .
Example 12.2. A random walker moves on the set {0, 1, 2, 3, 4}. She moves to the right (by
1) with probability, p, and to the left with probability 1 p, except when she is at 0 or at 4.
These two states are absorbing: once there, the walker does not move. The transition matrix is
1
0
0
0 0
1p
0
p
0 0
1p
0
p 0
P = 0
.
0
0
1p 0 p
0
0
0
0 1
and the matching graphical representation is below.
p
0
1
1
1p
p
2
1p
p
3
1p
4
1
12
147
Example 12.3. Same as the previous example except that now 0 and 4 are reecting. From
0, the walker always moves to 1, while from 4 she always moves to 3. The transition matrix
changes to
0
1
0
0 0
1p
0
p
0 0
1p
0
p 0
P = 0
.
0
0
1p 0 p
0
0
0
1 0
Example 12.4. Random walk on a graph. Assume that a graph with undirected edges is
given by its adjacency matrix, which is a binary matrix with the i, jth entry 1 exactly when i
is connected to j. At every step, a random walker moves to a randomly chosen neighbor. For
example, the adjacency matrix of the graph
is
0
1
0
1
and the transition matrix is
1
0
1
1
0
1
3
P =
0
1
3
0
1
0
1
1
2
0
1
2
1
3
1
1
,
1
0
0
1
3
0
1
3
1
2
1
3
1
2
Example 12.5. The general two-state Markov chain. There are two states 1 and 2 with
transitions:
1 1 with probability ;
1 2 with probability 1 ;
2 1 with probability ;
2 2 with probability 1 .
The transition matrix has two parameters , [0, 1]:
[
]
1
P =
.
1
12
148
Example 12.6. Changeovers. Keep track of two-toss blocks in an innite sequence of independent coin tosses with probability p of Heads. The states represent (previous ip, current ip)
and are (in order) HH, HT, TH, and TT. The resulting transition matrix is
p 1p 0
0
0
0
p 1 p
.
p 1 p 0
0
0
0
p 1p
Example 12.7. Simple random walk on Z. The walker moves left or right by 1, with probabilities p and 1 p, respectively. The state space is doubly innite and so is the transition
matrix:
..
.
. . . 1 p
0
p
0 0 . . .
. . .
0
1p
0
p 0 . . .
. . .
0
0
1
p
0
p
.
.
.
..
.
Example 12.8. Birth-death chain. This is a general model in which a population may change
by at most 1 at each time step. Assume that the size of a population is x. Birth probability
px is the transition probability to x + 1, death probability qx is the transition to x 1. and
rx = 1 px qx is the transition to x. Clearly, q0 = 0. The transition matrix is
r0 p 0 0 0 0 . . .
q1 r1 p1 0 0 . . .
0 q2 r2 p2 0 . . .
..
.
We begin our theory by studying n-step probabilities
Pijn = P (Xn = j|X0 = i) = P (Xn+m = j|Xm = i).
Note that Pij0 = I, the identity matrix, and Pij1 = Pij . Note also that the condition X0 = i
simply species a particular non-random initial state.
Consider an oriented path of length n from i to j, that is i, k1 , . . . , kn1 , j, for some states
k1 , . . . , kn1 . One can compute the probability of following this path by multiplying all transition
probabilities, i.e., Pik1 Pk1 k2 Pkn1 j . To compute Pijn , one has to sum these products over all
paths of length n from i to j. The next theorem writes this in a familiar and neater fashion.
Theorem 12.1. Connection between n-step probabilities and matrix powers:
Pijn is the i, jth entry of the nth power of the transition matrix.
12
149
Proof. Call the transition matrix P and temporarily denote the n-step transition matrix by
P (n) . Then, for m, n 0,
(n+m)
Pij
= P (Xn+m = j|X0 = i)
=
P (Xn+m = j, Xn = k|X0 = i)
k
(m)
(n)
Pkj Pik .
The rst equality decomposes the probability according to where the chain is at time n, the
second uses the Markov property and the third time-homogeneity. Thus,
P (m+n) = P (n) P (m) ,
and, then, by induction
P (n) = P (1) P (1) P (1) = P n .
The fact that the matrix powers of the transition matrix give the n-step probabilities makes
linear algebra useful in the study of nite-state Markov chains.
Example 12.9. For the two state Markov Chain
[
]
1
P =
,
1
[
and
P2 =
2 + (1 ) (1 ) + (1 )(1 )
+ (1 )
(1 ) + (1 )2
P (Xn = j) =
P (Xn = j|X0 = i)P (X0 = i)
i
i Pijn .
12
150
As
1
3
P =
0
1
3
1
2
0
1
2
1
3
0
1
3
0
1
3
1
2
1
3
1
2
we have
[
] [
P (X2 = 1) P (X2 = 2) P (X2 = 3) P (X2 = 4) = 14
1
4
1
4
1
4
P2 =
[2
9
5
18
2
9
5
18
8645
0.0122.
708588
Problems
1. Three white and three black balls are distributed in two urns, with three balls per urn. The
state of the system is the number of white balls in the rst urn. At each step, we draw at
random a ball from each of the two urns and exchange their places (the ball that was in the rst
urn is put into the second and vice versa). (a) Determine the transition matrix for this Markov
chain. (b) Assume that initially all white balls are in the rst urn. Determine the probability
that this is also the case after 6 steps.
2. A Markov chain on states 0, 1, 2, has the transition matrix
1 1 1
2
0
5
6
3
1
3
6
2
3
1
6
12
151
otherwise you ip coin 2 the next day. (a) Compute the probability that you ip coin 1 on day
3. (b) Compute the probability that you ip coin 1 on days 3, 6, and 14. (c) Compute the
probability that you ip Heads on days 3 and 6.
4. A walker moves on two positions a and b. She begins at a at time 0 and is at a the next time
as well. Subsequently, if she is at the same position for two consecutive time steps, she changes
position with probability 0.8 and remains at the same position with probability 0.2; in all other
cases she decides the next position by a ip of a fair coin. (a) Interpret this as a Markov chain
on a suitable state space and write down the transition matrix P . (b) Determine the probability
that the walker is at position a at time 10.
Solutions to problems
1. The states are 0, 1, 2, 3. For (a),
1
9
P =
0
0
4
9
4
9
4
9
4
9
0
0
1
9
0
]
0 0 0 1 P 6,
1
4
]
1
2
0
P 3 1 .
2
3. The state Xn of our Markov chain, 1 or 2, is the coin we ip on day n. (a) Let
[
]
0.7 0.3
P =
.
0.6 0.4
Then,
] [
P (X3 = 1) P (X3 = 2) = 21
1
2
P3
12
152
4. (a) The states are the four ordered pairs aa, ab, ba, and bb, which we will code as 1, 2, 3, and
4. Then,
0.2 0.8 0
0
0
0 0.5 0.5
.
P =
0.5 0.5 0
0
0
0 0.8 0.2
The answer to (b) is the sum of the rst and the third entries of
[
]
1 0 0 0 P 9.
The power is 9 instead of 10 because the initial time for the chain (when it is at state aa) is
time 1 for the walker.
13
13
153
We say that a state j is accessible from state i, i j, if Pijn > 0 for some n 0. This means
that there is a possibility of reaching j from i in some number of steps. If j is not accessible
from i, Pijn = 0 for all n 0, and thus the chain started from i never visits j:
{Xn = j} |X0 = i)
n=0
P (Xn = j|X0 = i) = 0.
n=0
Also, note that for accessibility the size of entries of P does not matter, all that matters is which
are positive and which are 0. For computational purposes, one should also observe that, if the
chain has m states, then j is accessible from i if and only if (P + P 2 + . . . + P m )ij > 0.
If i is accessible from j and j is accessible from i, then we say that i and j communicate,
i j. It is easy to check that this is an equivalence relation:
1. i i;
2. i j implies j i; and
3. i j and j k together imply i k.
The only nontrivial part is (3) and, to prove it, let us assume i j and j k. This means that
m > 0. Now, one can get from i
there exists an n 0 so that Pijn > 0 and an m 0 so that Pjk
to j in m + n steps by going rst to j in n steps and then from j to k in m steps, so that
n+m
m
Pik
Pijn Pjk
> 0.
n+m
m
m
Pik
=
Pin Pk
Pijn Pjk
,
13
154
Any state 1, 2, 3, 4 is accessible from any of the ve states, but 5 is not accessible from 1, 2, 3, 4.
So, we have two classes: {1, 2, 3, 4}, and {5}. The chain is not irreducible.
Example 13.2. Consider the chain on states 1, 2, 3 and
1 1
2
2 0
P = 12 14 14 .
0 13 23
As 1 2 and 2 3, this is an irreducible chain.
Example 13.3. Consider the chain on states 1, 2, 3,
1 1
2
2 0
1 1 0
2
2
P =
0 0 1
4
0 0 0
4, and
0
0
3 .
4
1
This chain has three classes {1, 2}, {3} and {4}, hence, it is not irreducible.
For any state i, denote
fi = P (ever reenter i|X0 = i).
We call a state i recurrent if fi = 1, and transient if fi < 1.
Example 13.4. Back to the previous example. Obviously, 4 is recurrent, as it is an absorbing
state. The only possibility of returning to 3 is to do so in one step, so we have f3 = 14 , and 3 is
transient. Moreover, f1 = 1 because in order to never return to 1 we need to go to state 2 and
stay there forever. We stay at 2 for n steps with probability
( )n
1
0,
2
as n , so the probability of staying at 1 forever is 0 and, consequently, f1 = 1. By similar
logic, f2 = 1. We will soon develop better methods to determine recurrence and transience.
Starting from any state, a Markov Chain visits a recurrent state innitely many times or not
at all. Let us now compute, in two dierent ways, the expected number of visits to i (i.e., the
times, including time 0, when the chain is at i). First, we observe that, at every visit to i, the
probability of never visiting i again is 1 fi , therefore,
P (exactly n visits to i|X0 = i) = fin1 (1 fi ).
13
155
This formula says that the number of visits to i is a Geometric(1 fi ) random variable and so
its expectation is
1
.
E(number of visits to i|X0 = i) =
1 fi
A second way to compute this expectation is by using the indicator trick:
E(number of visits to i|X0 = i) = E(
In |X0 = i),
n=0
In |X0 = i) =
n=0
n=0
P (Xn = i|X0 = i)
Piin .
n=0
Thus,
1
=
Piin
1 fi
n=0
Piin = .
n=1
13
156
Proposition 13.4. If i is recurrent and i j, then j is also recurrent. Therefore, in any class,
either all states are recurrent or all are transient. In particular, if the chain is irreducible, then
either all states are recurrent or all are transient.
In light of this proposition, we can classify each class, as well as an irreducible Markov chain,
as recurrent or transient.
Proof. By the previous proposition, we know that also j i. We will now give two arguments
for the recurrence of j.
We could use the same logic as before: starting from j, the chain must visit i with probability
1 (or else the chain starting at i has a positive probability of no return to i, by visiting j), then
it returns to i innitely many times and, at each of those times, it has an independent chance
of getting to j at a later time so it must do so innitely often.
For another argument, we know that there exist k, m 0 so that Pijk > 0, Pjim > 0. Furthermore, for any n 0, one way to get from j to j in m + n + k steps is by going from j to i in m
steps, then from i to i in n steps, and then from i to j in k steps; thus,
m+n+k
Pjj
Pjim Piin Pijk .
m+n+k
If
= and, nally,
n=0 Pii = , then
n=0 Pjj
=0 Pjj = . In short, if i is recurrent,
then so is j.
Proposition 13.5. Any recurrent class is a closed subset of states.
Proof. Let S0 be a recurrent class, i S0 and j S0 . We need to show that Pij = 0. Assume
the converse, Pij > 0. As j does not communicate with i, the chain never reaches i from j, i.e.,
i is not accessible from j. But this is a contradiction to Proposition 13.3.
For nite Markov chains, these propositions make it easy to determine recurrence and transience: if a class is closed, it is recurrent, but if it is not closed, it is transient.
Example 13.5. Assume that the states are 1, 2,
0 0
1 0
P =
0 1
0 1
0
0
0
0
.
0
0
By inspection, every state is accessible from every other state and so this chain is irreducible.
Therefore, every state is recurrent.
13
0
1
0
0.4 0.6 0
0.3 0 0.4
P =
0
0
0
0
0
0
0
0
0
157
1, . . . , 6 and
0
0
0
0
0
0
0.2 0.1 0
.
0.3 0.7 0
0.5 0 0.5
0.3 0 0.7
0.4
0.3
1
0.4
0.3
0.2
0.1 0.5
1
2
0.6
4
0.3
0.7
0.5
6
0.7
We observe that 3 can only be reached from 3, therefore, 3 is in a class of its own. States 1 and
2 can reach each other and no other state, so they form a class together. Furthermore, 4, 5, 6
all communicate with each other. Thus, the division into classes is {1, 2}, {3}, and {4, 5, 6}. As
it is not closed, {3} is a transient class (in fact, it is clear that f3 = 0.4). On the other hand,
{1, 2} and {4, 5, 6} both are closed and, therefore, recurrent.
Example 13.7. Recurrence of a simple random walk on Z. Recall that such a walker moves
from x to x + 1 with probability p and to x 1 with probability 1 p. We will assume that
(1)
p (0, 1) and denote the chain Sn = Sn . (The superscript indicates the dimension. We will
make use of this in subsequent examples in which the walker will move in higher dimensions.)
As such a walk is irreducible, we only have to check whether state 0 is recurrent or transient, so
we assume that the walker begins at 0. First, we observe that the walker will be at 0 at a later
time only if she makes an equal number of left and right moves. Thus, for n = 1, 2, . . .,
2n1
P00
=0
and
2n
P00
)
2n n
p (1 p)n .
n
n! nn en 2n
(the symbol means that the quotient of the two quantities converges to 1 as n ).
13
158
Therefore,
(
2n
n
(2n)!
(n!)2
=
and, therefore,
2n
P00
=
therefore,
22n
pn (1 p)n
n
1
(4p(1 p))n
n
1
2n
P00
,
n
2n
P00
= ,
n=0
2n
P00
< ,
n=0
and the random walk is transient. In this case, what is the probability f0 that the chain ever
reenters 0? We need to recall the Gamblers ruin probabilities,
1 1p
p
P (Sn reaches N before 0|S0 = 1) =
(
)N .
1 1p
p
As N , the probability
P (Sn reaches 0 before N |S0 = 1) = 1 P (Sn reaches N before 0|S0 = 1)
converges to
P (Sn ever reaches 0|S0 = 1) =
{
1
1p
p
if p < 21 ,
if p > 12 .
13
159
k=0
n
(2)
(1)
k=0
In order not to obscure the computation, we will not show the full details from now on; lling
in the missing pieces is an excellent computational exercise.
First, as the walker chooses to go horizontally or vertically with equal probability, N 2n
2 =
n with overwhelming probability and so we can assume that k n2 . Taking this into account,
2
(1)
P (S2k = 0) ,
n
2
(1)
P (S2(nk) = 0) .
n
Therefore,
(2)
n
2
P (N = 2k)
n
k=0
2
P (N is even)
n
1
,
n
13
160
(2)
n=0
and we have demonstrated that this chain is still recurrent, albeit barely. In fact, there is an
easier slick proof that does not generalize to higher dimensions, which demonstrates that
(2)
(1)
half of the points that are never visited, this becomes the same walk as Sn . In particular, it is
(2)
at the origin exactly when Sn is.
Example 13.9. Is the simple symmetric random walk on Z3 recurrent? Now, imagine a squirrel
(3)
running around in a 3 dimensional maze. The process Sn moves from a point (x, y, z) to one
of the six neighbors (x 1, y, z), (x, y 1, z), (x, y, z 1) with equal probability. To return to
(0, 0, 0), it has to make an even number number of steps in each of the three directions. We will
condition on the number N of steps in the z direction. This time N 2n
3 and, thus,
(3)
P (S2n
= (0, 0, 0)) =
(1)
(2)
k=0
3 3
P (N = 2k)
n 2n
k=0
3 3
=
P (N is even)
3/2
2 n3/2
3 3
.
3/2
4 n3/2
n
Therefore,
(3)
and the three-dimensional random walk is transient, so the squirrel may never return home. The
probability f0 = P (return to 0), thus, is not 1, but can we compute it? One approximation is
obtained by using
1
1
(3)
=
P (S2n = (0, 0, 0)) = 1 + + . . . ,
1 f0
6
n=0
13
161
but this series converges slowly and its terms are dicult to compute. Instead, one can use the
remarkable formula, derived by Fourier analysis,
1
dx dy dz
1
=
,
1
3
1 f0
(2)
(,)3 1 3 (cos(x) + cos(y) + cos(z))
which gives, to four decimal places,
f0 0.3405.
Problems
1. For the following transition matrices, determine the classes and
and which transient.
0 12 12 0 0
0 0 0 1
1 0 1 0 0
0 0 0 1
12 1 2
1 1
P1 =
12 21 0 0 01 P2 = 3 3 0 0
0 0 1 0
2
4 0 0 4
1
1
1
0 2 4 4 0
0 0 1 0
1 1
1
1
2 0 2 0 0
2
2 0 0
1 1 0 0
1 1 1 0 0
2 2
14 21 41
P3 =
4 4 2 01 01 P4 = 0 0 11 01
0 0
0 0 0
2
2
2
2
1
1
0 0 0 2 2
1 0 0 0
0
0
1
3
0
0
0
0
0
0
2. Assume that a Markov chain Xn has states 0, 1, 2 . . . and transitions from each i > 0 to i + 1
with probability 1 2i1 and to 0 with probability 2i1 . Moreover, from 0 it transitions to 1
with probability 1. (a) Is this chain irreducible? (b) Assume that X0 = 0 and let R be the
rst return time to 0 (i.e., the rst time after the initial time the chain is back at the origin).
Determine for which
1 f0 = P (no return) = lim P (R > n) = 0.
n
13
162
Solutions to problems
1. Assume that the states are 1, . . . , 5. For P1 : {1, 2, 3} recurrent, {4, 5} transient. For P2 :
irreducible, so all states are recurrent. For P3 : {1, 2, 3} recurrent, {4, 5} recurrent. For P4 :
{1, 2} recurrent, {3} recurrent (absorbing), {4} transient, {5} transient.
2. (a) The chain is irreducible. (b) If R > n, then the chain, after moving to 1, makes n 1
consecutive steps to the right, so
P (R > n) =
n1
(
i=1
1
1
2 i
)
.
The product converges to 0 if and only if its logarithm converges to and that holds if and
only if the series
1
i=1
2 i
diverges, which is when 1. (c) For 1, the chain is recurrent, otherwise, it is transient.
3. For (a), the walker makes one step and then proceeds from i+1 or i1 with equal probability,
so that
1
Ei = 1 + (Ei+1 + Ei ),
2
with E0 = EN = 0. For (b), the homogeneous equation is the same as the one in the Gamblers
ruin, so its general solution is linear: Ci + D. We look for a particular solution of the form Bi2
and we get Bi2 = 1 + 21 (B(i2 + 2i + 1) + B(i2 2i + 1)) = 1 + Bi2 + B, so B = 1. By plugging
in the boundary conditions we can solve for C and D to get D = 0, C = N . Therefore,
Ei = i(N i).
For (c), after a step the walker proceeds either from 1 or 1 and, by symmetry, the expected
time to get to 0 is the same for both. So, for every N ,
ER 1 + E1 = 1 + 1 (N 1) = N,
and so ER = .
14
14
BRANCHING PROCESSES
163
Branching processes
In this chapter we will consider a random model for population growth in the absence of spatial or
any other resource constraints. So, consider a population of individuals which evolves according
to the following rule: in every generation n = 0, 1, 2, . . ., each individual produces a random
number of ospring in the next generation, independently of other individuals. The probability
mass function for ospring is often called the ospring distribution and is given by
pi = P (number of ospring = i),
for i = 0, 1, 2, . . .. We will assume that p0 < 1 and p1 < 1 to eliminate the trivial cases. This
model was introduced by F. Galton in the late 1800s to study the disappearance of family names;
in this case pi is the probability that a man has i sons.
We will start with a single individual in generation 0 and generate the resulting random
family tree. This tree is either nite (when some generation produces no ospring at all) or
innite in the former case, we say that the branching process dies out and, in the latter case,
that it survives.
We can look at this process as a Markov chain where Xn is the number of individuals in
generation n. Let us start with the following observations:
If Xn reaches 0, it stays there, so 0 is an absorbing state.
If p0 > 0, P (Xn+1 = 0|Xn = k) > 0, for all k.
Therefore, by Proposition 13.5, all states other than 0 are transient if p0 > 0; the population must either die out or increase to innity. If p0 = 0, then the population cannot
decrease and each generation increases with probability at least 1 p1 , therefore it must
increase to innity.
It is possible to write down the transition probabilities for this chain, but they have a rather
complicated explicit form, as
P (Xn+1 = i|Xn = k) = P (W1 + W2 + . . . + Wk = i),
where W1 , . . . , Wk are independent random variables, each with the ospring distribution. This
suggests using moment generating functions, which we will indeed do. Recall that we are assuming that X0 = 1.
Let
n = P (Xn = 0)
be the probability that the population is extinct by generation (which we also think of as time) n.
The probability 0 that the branching process dies out is, then, the limit of these probabilities:
0 = P (the process dies out) = P (Xn = 0 for some n) = lim P (Xn = 0) = lim n .
n
14
BRANCHING PROCESSES
164
Note that 0 = 0 if p0 = 0. Our main task will be to compute 0 for general probabilities pk . We
start, however, with computing the expectation and variance of the population at generation n.
Let and 2 be the expectation and the variance of the ospring distribution, that is,
= EX1 =
kpk ,
k=0
and
2 = Var(X1 ).
Let mn = E(Xn ) and vn = Var(Xn ). Now, Xn+1 is the sum of a random number Xn of
independent random variables, each with the ospring distribution. Thus, we have by Theorem
11.1,
mn+1 = mn ,
and
vn+1 = mn 2 + vn 2 .
Together with initial conditions m0 = 1, v0 = 0, the two recursive equations determine mn and
vn . We can very quickly solve the rst recursion to get mn = n and so
vn+1 = n 2 + vn 2 .
When = 1, mn = 1 and vn = n 2 . When = 1, the recursion has the general solution
vn = An + B2n . The constant A must satisfy
An+1 = 2 n + An+2 ,
so that,
A=
2
.
(1 )
2 n (1n )
(1)
n 2
if = 1,
if = 1.
14
BRANCHING PROCESSES
165
if the individuals have less than one ospring on average, the branching process dies out.
Now, let be the moment generating function of the ospring distribution. It is more
convenient to replace et in our original denition with s, so that
(s) = X1 (s) = E(s
X1
)=
pk s k .
k=0
In combinatorics, this would be exactly the generating function of the sequence pk . Then, the
moment generating function of Xn is
Xn (s) = E[s
Xn
]=
P (Xn = k)sk .
k=0
We will assume that 0 s 1 and observe that, for such s, this power series converges. Let us
get a recursive equation for Xn by conditioning on the population count in generation n 1:
Xn (s) = E[sXn ]
k=0
k=0
(s)k P (Xn1 = k)
k=0
= Xn1 ((s)).
So, Xn is the nth iterate of ,
X2 (s) = ((s)), X3 (s) = (((s))), . . .
and we can also write
Xn (s) = (Xn1 (s)).
Next, we take a closer look at the properties of . Clearly,
(0) = p0 > 0
and
(1) =
pk = 1.
k=0
k=0
14
BRANCHING PROCESSES
166
(1) = .
Finally,
(s) =
k=1
12
1 2
14
BRANCHING PROCESSES
167
for every n.
If is barely larger than 1, the probability 0 of extinction is quite close to 1. In the context
of family names, this means that the ones with already a large number of representatives in the
population are at a distinct advantage, as the probability that they die out by chance is much
lower than that of those with only a few representatives. Thus, common family names become
ever more common, especially in societies that have used family names for a long time. The
most famous example of this phenomenon is in Korea, where three family names (Kim, Lee, and
Park in English transcriptions) account for about 45% of the population.
Example 14.2. Assume that
pk = pk (1 p), k = 0, 1, 2, . . . .
This means that the ospring distribution is Geometric(1 p) minus 1. Thus,
=
1
p
1=
1p
1p
sk pk (1 p)
k=0
1p
.
1 ps
1p
p .
1p
.
p
3
2
> 1, 0 is given by
(s) =
1 3
3
1
+ s + s2 + s3 = s,
8 8
8
8
with solutions s = 1, 5 2, and 5 2. The one that lies in (0, 1), 5 2 0.2361, is the
probability 0 .
Problems
1. For a branching process with ospring distribution given by p0 = 16 , p1 = 12 , p3 = 13 , determine
(a) the expectation and variance of X9 , the population at generation 9, (b) the probability that
the branching process dies by generation 3, but not by generation 2, and (c) the probability that
14
BRANCHING PROCESSES
168
the process ever dies out. Then, assume that you start 5 independent copies of this branching
process at the same time (equivalently, change X0 to 5), and (d) compute the probability that
the process ever dies out.
2. Assume that the ospring distribution of a branching process is Poisson with parameter .
(a) Determine the expected combined population through generation 10. (b) Determine, with
the aid of computer if necessary, the probability that the process ever dies out for = 21 , = 1,
and = 2.
3. Assume that the ospring distribution of a branching process is given by p1 = p2 = p3 = 13 .
Note that p0 = 0. Solve the following problem for a = 1, 2, 3. Let Yn be the proportion of
individuals in generation n (out of the total number of Xn individuals) from families of size
a. (A family consists of individuals that are ospring of the same parent from the previous
generation.) Compute the limit of Yn as n .
Solutions to problems
1. For (a), compute = 32 , 2 =
7
2
9
4
(s) =
1 1
1
+ s + s3 .
6 2
3
For (b),
P (X3 = 0) P (X2 = 0) = (((0))) ((0)) 0.0462.
For (c), we solve (s) = s, 0 = 2s3 3s + 1 = (s 1)(2s2 + 2s 1), and so 0 =
For (d), the answer is 05 .
31
2
0.3660.
This equation cannot be solved analytically, but we can numerically obtain the solution for = 2
to get 0 0.2032.
3. Assuming Xn1 = k, the number of families at time n is also k. Each of these has, independently, a members with probability pa . If k is large which it will be for large n, as
the branching process cannot die out then, with overwhelming probability, the number of
children in such families is about apa k, while Xn is about k. Then, the proportion Yn is about
apa
1 1
1
, which works out to be 6 , 3 , and 2 , for a = 1, 2, and 3.
15
15
169
0.7 0.2
P = 0.4 0.6
0
1
is given by
0.1
0 .
0
Recall that the n-step transition probabilities are given by powers of P . Let us look at some
large powers of P , beginning with
15
170
Ri = inf{n 1 : Xn = i}
be the rst time, after time 0, that the chain is at i S. Also, let
(n)
fi
= P (Ri = n|X0 = i)
be the p. m. f. of Ri when the starting state is i itself (in which case we may call Ri the return
time). We can connect these to the familiar quantity
fi = P (ever reenter i|X0 = i) =
(n)
fi ,
n=1
(n)
n=1 fi
mi = E[Ri |X0 = i] =
= 1. Then, we dene
(n)
nfi .
n=1
If the above series converges, i.e., mi < , then we say that i is positive recurrent. It can be
shown that positive recurrence is also a class property: a state shares it with all members of its
class. Thus, an irreducible chain is positive recurrent if each of its states is.
It is not hard to show that a nite irreducible chain is positive recurrent. In this case, there
must exist an m 1 and an > 0 so that i can be reached from any j in at most m steps with
probability at least . Then, P (Ri n) (1 )n/m , which goes to 0 geometrically fast.
We now state the key theorems. Some of these have rather involved proofs (although none
is exceptionally dicult), which we will merely sketch or omit altogether.
Theorem 15.1. Proportion of the time spent at i.
Assume that the chain is irreducible and positive recurrent. Let Nn (i) be the number of visits to
i in the time interval from 0 through n. Then,
lim
1
Nn (i)
=
,
n
mi
in probability.
Proof. The idea is quite simple: once the chain visits i, it returns, on average, once per mi time
steps, hence the proportion of time spent there is 1/mi . We skip a more detailed proof.
15
171
An irreducible positive recurrent Markov chain has a unique invariant distribution, which is
given by
1
i =
.
mi
In fact, an irreducible chain is positive recurrent if and only if a stationary distribution exists.
The formula for should not be a surprise: if the probability that the chain is at i is always
i , then one should expect that the proportion of time spent at i, which we already know to be
1/mi , to be equal to i . We will not, however, go deeper into the proof.
Theorem 15.3. Convergence to invariant distribution.
If a Markov chain is irreducible, aperiodic, and positive recurrent, then, for every i, j S,
lim Pijn = j .
Recall that Pijn = P (Xn = j|X0 = i) and note that the limit is independent of the initial
state. Thus, the rows of P n are more and more similar to the row vector as n becomes large.
The most elegant proof of this theorem uses coupling, an important idea rst developed by
a young French probabilist Wolfgang Doeblin in the late 1930s. (Doeblins life is a romantic,
and quite tragic, story. An immigrant from Germany, he died as a soldier in the French army
in 1940, at the age of 25. He made signicant mathematical contributions during his army
service.) Start with two independent copies of the chain two particles moving from state to
state according to the transition probabilities one started from i, the other using the initial
distribution . Under the stated assumptions, they will eventually meet. Afterwards, the two
particles move together in unison, that is, they are coupled . Thus, the dierence between the
two probabilities at time n is bounded above by twice the probability that coupling does not
happen by time n, which goes to 0. We will not go into greater detail, but, as we will see in the
next example, periodicity is necessary.
Example 15.6. A deterministic cycle with a = 3 has the transition matrix
0 1 0
P = 0 0 1 .
1 0 0
15
172
0 0 1
P 2 = 1 0 0 ,
0 1 0
1
3
(as it is easy to
P 3 = I, P 4 = P , etc. Although the chain does spend 1/3 of the time at each state, the transition
probabilities are a periodic sequence of 0s and 1s and do not converge.
Our nal theorem is mostly a summary of the results for the special, and for us the most
common, case.
Theorem 15.4. Convergence theorem for a nite state space S.
1
mi .
and so that
Pijn
Example 15.7. We begin by our rst example, Example 15.1. That was clearly an irreducible
and aperiodic (note that P00 > 0) chain. The invariant distribution [1 , 2 , 3 ] is given by
0.71 + 0.42
= 1
0.21 + 0.62 +3 = 2
0.11
= 3
15
173
20
15
2
0.5405, 2 =
0.4054, 3 =
0.0541.
37
37
37
Example 15.8. The general two-state Markov chain. Here S = {1, 2} and
[
]
1
P =
1
and we assume that 0 < , < 1.
1 + 2
= 1
(1 )1 + (1 )2 = 2
1 + 2
=1
,
1+
1
.
1+
1
2 .
In the long run, what proportion of time is the chain at 2, while at the previous time it
was at 1? Answer: 1 P12 , as it needs to be at 1 at the previous time and then make a
transition to 2 (again, the answer does not depend on the starting state).
Example 15.9. In this example we will see how to compute the average length of time a chain
remains in a subset of states, once the subset is entered. Assume that a machine can be in 4
states labeled 1, 2, 3, and 4. In states 1 and 2 the machine is up, working properly. In states 3
and 4 the machine is down, out of order. Suppose that the transition matrix is
1 1 1
4
4
2 0
0 1 1 1
4
2
4
P =
1 1 1 1 .
4
4
4
4
1
1
1
4
4 0 2
(a) Compute the average length of time the machine remains up after it goes up. (b) Compute
the proportion of time that the system is up, but down at the next time step (this is called the
breakdown rate).
15
174
9
48 ,
2 =
12
48 ,
9
,
32
If the transition matrix for an irreducible Markov chain with a nite state space S is doubly
stochastic, its (unique) invariant measure is uniform over S.
Proof. Assume that S = {1, . . . , m}, as usual. If [1, . . . , 1] is the row vector with m 1s, then
[1, . . . , 1]P is exactly the vector of column sums, thus [1, . . . , 1]. This vector is preserved by right
1
multiplication by P , as is m
[1, . . . , 1]. This vector species the uniform p. m. f. on S.
Example 15.10. Simple random walk on a circle. Pick a probability p (0, 1). Assume that
a points labeled 0, 1, . . . , a 1 are arranged on a circle clockwise. From i, the walker moves to
i + 1 (with a identied with 0) with probability p and to i 1 (with 1 identied with a 1)
15
0
p
0 0
1 p
0
p 0
0
1
p
0 p
P =
...
0
0
0 0
p
0
0 0
...
...
...
...
...
0
0
0
175
0
0
0
1p
0
0
1p
1p
0
p
0
and is doubly stochastic. Moreover, the chain is aperiodic if a is odd and otherwise periodic
with period 2. Therefore, the proportion of time the walker spends at any state is a1 , which is
also the limit of Pijn for all i and j if a is odd. If a is even, then Pijn = 0 if (i j) and n have a
dierent parity, while if they are of the same parity, Pijn a2 .
Assume that we change the transition probabilities a little: assume that, only when the
walker is at 0, she stays at 0 with probability r (0, 1), moves to 1 with probability (1 r)p,
and to a 1 with probability (1 r)(1 p). The other transition probabilities are unchanged.
Clearly, now the chain is aperiodic for any a, but the transition matrix is no longer doubly
stochastic. What happens to the invariant distribution?
The walker spends a longer time at 0; if we stop the clock while she stays at 0, the chain
is the same as before and spends an equal an proportion of time at all states. It follows that
our perturbed chain spends the same proportion of time at all states except 0, where it spends
1
than other i .
a Geometric(1 r) time at every visit. Therefore, 0 is larger by the factor 1r
Thus, the row vector with invariant distributions is
]
[ 1
] [
1
1
1r
1r
.
.
.
1
.
.
.
1
=
1+(1r)(a1)
1+(1r)(a1)
1+(1r)(a1) .
1r
1
1r + a 1
Thus, we can still determine the invariant distribution if only the self-transition probabilities Pii
are changed.
Problems
1. Consider the chain in Problem 2 of Chapter 12. (a) Determine the invariant distribution. (b)
n . Why does it exist?
Determine limn P10
2. Consider the chain in Problem 4 of Chapter 12 with the same initial state. Determine the
proportion of time the walker spends at a.
3. Roll a fair die n times and let Sn be the sum of the numbers you roll. Determine, with proof,
limn P (Sn mod 13 = 0).
4. Peter owns two pairs of running shoes. Each morning he goes running. He is equally likely
15
176
to leave from his front or back door. Upon leaving the house, he chooses a pair of running shoes
at the door from which he leaves or goes running barefoot if there are no shoes there. On his
return, he is equally likely to enter at either door and leaves his shoes (if any) there. (a) What
proportion of days does he run barefoot? (b) What proportion of days is there at least one pair
of shoes at the front door (before he goes running)? (c) Now, assume also that the pairs of shoes
are green and red and that he chooses a pair at random if he has a choice. What proportion of
mornings does he run in green shoes?
5. Prof. Messi does one of three service tasks every year, coded as 1, 2, and 3. The assignment
changes randomly from year to year as a Markov chain with transition matrix
7 1 1
10
1
5
1
10
5
3
5
2
5
10
1
.
5
1
5
Determine the proportion of years that Messi has the same assignment as the previous two years.
6. Consider the Markov chain with states 0, 1, 2, 3, 4, which transition from state i > 0 to one
of the states 0, . . . , i 1 with equal probability, and transition from 0 to 4 with probability 1.
Show that all Pijn converge as n and determine the limits.
Solutions to problems
1. The chain is irreducible and aperiodic. Moreover, (a) =
10
0 = 21
.
2. The chain is irreducible and aperiodic. Moreover, =
13
1 + 3 = 41
.
[ 10
5 6
21 , 21 , 21
5 8 8 20
41 , 41 , 41 , 41
3. Consider Sn mod 13. This is a Markov chain with states 0, 1 . . . , 12 and transition matrix is
0 16 16 16 16 16 16 0 . . . 0
. . .
.
1
1
1
1
1
1
6
6
6
6
6
6 0 0 ... 0
(To get the next row, shift cyclically to the right.) This is a doubly stochastic matrix with
1
1
i = 13
, for all i. So the answer is 13
.
4. Consider the Markov chain with states given by the number of shoes at the front door. Then
3 1
4
4 0
P = 14 12 14 .
0 14 34
15
6 7 4
17 , 17 , 17
177
1
2
+ 2
1
2
= 13 ; (b)
]
2 + P 2 + P 2 =
. The answer is 1 P11
2 22
3 33
15
178
0 0
0 0
0 0
.
1
1
2
1
2
2
1
2
Specify the classes and determine whether they are transient or recurrent.
3. In a branching process the number of descendants is determined as follows. An individual
rst tosses a coin that comes out Heads with probability p. If this coin comes out Tails, the
individual has no descendants. If the coin comes Heads, the individual has 1 or 2 descendants,
each with probability 12 .
(a) Compute 0 , the probability that the branching process eventually dies out. Your answer
will, of course, depend on the parameter p.
(b) Write down the expression for the probability that the branching process is still alive at
generation 3. Do not simplify.
4. A random walker is in one of the four states, 0, 1, 2, or 3. If she is at i at some time, she
makes the following transition. With probability 21 she moves from i to (i + 1) mod 4 (that is, if
she is at 0 she moves to 1, from 1 she moves to 2, from 2 to 3, and from 3 to 0). With probability
1
2 , she moves to a random state among the four states, each chosen with equal probability.
(a) Show that this chain has a unique invariant distribution and compute it. (Take a good look
at the transition matrix before you start solving this).
15
179
(b) After the walker makes many steps, compute the proportion of time she spends at 1. Does
the answer depend on the chains starting point?
(c) After the walker makes many steps, compute the proportion of times she is at the same state
as at the previous time.
15
180
0.7 0 0.3 0
0.5 0 0.5 0
P =
0 0.4 0 0.6 .
0 0.2 0 0.8
(b) Today is Wednesday and it is raining. It also rained yesterday. Explain how you
would compute the probability that it will rain on Saturday. Do not carry out the
computation.
Solution: If Wednesday is time 0, then Saturday is time 3. The initial state is given by
the row [1, 0, 0, 0] and it will rain on Saturday if we end up at state 1 or 2. Therefore,
our solution is
1
[
]
3 1
1 0 0 0 P ,
0
0
[
]
that is, the sum of the rst two entries of 1 0 0 0 P 3 .
(c) Under the same assumption as in (b), explain how you would approximate the probability of rain on a day exactly a year from today. Carefully justify your answer, but
do not carry out the computation.
Solution:
The matrix P is irreducible since the chain makes the following transitions with
positive probability: (R, R) (R, N ) (N, N ) (N, R) (R, R). It is also
15
181
aperiodic because the transition (R, R) (R, R) has positive probability. Therefore,
the probability can be approximated by 1 + 2 , where [1 , 2 , 3 , 4 ] is the unique
solution to [1 , 2 , 3 , 4 ] P = [1 , 2 , 3 , 4 ] and 1 + 2 + 3 + 4 = 1.
2. Consider the Markov chain with states 1, 2, 3, 4, 5, given by the following transition
matrix:
1
1
2 0 2 0 0
1 1 1 0 0
41 2 41
P =
2 0 2 01 01 .
0 0 0
2
2
1
1
0 0 0 2 2
Specify all classes and determine whether they are transient or recurrent.
Solution:
Answer:
{2} is transient;
{1, 3} is recurrent;
{4, 5} is recurrent.
3. In a branching process, the number of descendants is determined as follows. An individual
rst tosses a coin that comes out Heads with probability p. If this coin comes out Tails,
the individual has no descendants. If the coin comes Heads, the individual has 1 or 2
descendants, each with probability 12 .
(a) Compute 0 , the probability that the branching process eventually dies out. Your
answer will, of course, depend on the parameter p.
Solution:
The probability mass function for the number of descendants is
)
(
0
1 2
,
1 p p2 p2
and so
p
3p
+p= .
2
2
3p
2
If 2 1, i.e., p 3 , then 0 = 1. Otherwise, we need to compute (s) and solve
(s) = s. We have
p
p
(s) = 1 p + s + s2 .
2
2
Then,
p
p
s = 1 p + s + s2 ,
2
2
0 = ps2 + (p 2)s + 2(1 p),
E(number of descendants) =
15
2(1p)
,
p
182
if p > 23 .
(b) Write down the expression for the probability that the branching process is still alive
at generation 3. Do not simplify.
Solution:
The answer is 1 (((0))) and we compute
(0) = 1 p,
p
p
((0)) = 1 p + (1 p) + (1 p)2 ,
2
2
(
p
p
1 (((0))) = 1 1 p + (1 p + (1 p)+
2
2
)
p
p
p
p
2
(1 p) ) + (1 p + (1 p) + (1 p)2 )2 .
2
2
2
2
4. A random walker is at one of the four states, 0, 1, 2, or 3. If she at i at some time, she
makes the following transition. With probability 12 she moves from i to (i + 1) mod 4
(that is, if she is at 0 she moves to 1, from 1 she moves to 2, from 2 to 3, and from 3 to
0). With probability 12 , she moves to a random state among the four states, each chosen
with equal probability.
(a) Show that this chain has a unique invariant distribution and compute it. (Take a
good look at the transition matrix before you start solving this).
Solution:
The transition matrix is
P =
1
8
1
8
1
8
5
8
5
8
1
8
1
8
1
8
1
8
5
8
1
8
1
8
1
8
1
8
5
8
1
8
15
183
16
16
184
Assume that you have an irreducible and positive recurrent chain, started at its unique invariant
distribution . Recall that this means that is the p. m. f. of X0 and of all other Xn as well.
Now suppose that, for every n, X0 , X1 , . . . , Xn have the same joint p. m. f. as their time-reversal
Xn , Xn1 , . . . , X0 . Then, we call the chain reversible sometimes it is, equivalently, also
said that its invariant distribution is reversible. This means that a recorded simulation of a
reversible chain looks the same if the movie is run backwards.
Is there a condition for reversibility that can be easily checked? The rst thing to observe
is that for the chain started at , reversible or not, the time-reversed chain has the Markov
property. This is not completely intuitively clear, but can be checked:
P (Xk = i|Xk+1 = j, Xk+2 = ik+2 , . . . , Xn = in )
P (Xk = i, Xk+1 = j, Xk+2 = ik+2 , . . . , Xn = in )
=
P (Xk+1 = j, Xk+2 = ik+2 , . . . , Xn = in )
i Pij Pjik+2 Pin1 in
=
j Pjik+2 Pin1 in
i Pij
=
,
j
which is an expression dependent only on i and j. For reversibility, this expression must be
the same as the forward transition probability P (Xk+1 = i|Xk = j) = Pji . Conversely, if both
the original and the time-reversed chain have the same transition probabilities (and we already
know that the two start at the same invariant distribution and that both are Markov), then
their p. m. f.s must agree. We have proved the following useful result.
Theorem 16.1. Reversibility condition.
A Markov chain with invariant measure is reversible if and only if
i Pij = j Pji ,
for all states i and j.
Another useful fact is that once reversibility is checked, invariance is automatic.
Proposition 16.2. Reversibility implies invariance. If a probability mass function i satises
the condition in the previous theorem, then it is invariant.
Proof. We need to check that, for every j, j = i i Pij , and here is how we do it:
i Pij =
j Pji = j
Pji = j .
i
16
185
We now proceed to describe random walks on weighted graphs, the most easily recognizable
examples of reversible chains. Assume that every undirected edge between vertices i and j in a
complete graph has weight wij = wji ; we think of edges with zero weight as not present at all.
When at i, the walker goes to j with probability proportional to wij , so that
wij
Pij =
.
k wik
What makes such random walks easy to analyze is the existence of a simple reversible measure.
Let
s=
wik
i,k
wik
.
s
To see why this is a reversible distribution, compute
wik
wij
wij
=
i Pij = k
,
s
s
k wik
i =
degree of i
.
2 (number of all edges)
16
186
3
4
2
3
3
3
, 2 = , 3 = , 4 = , 5 = , 6 = ,
18
18
18
18
18
18
and it is straightforward to do so. Note that this chain is irreducible, but not aperiodic (it has
period 2).
Example 16.3. Markov chain Monte Carlo. Assume that we have a very large probability
space, say some subset of S = {0, 1}V , where V is a large set of n sites. Assume also that
16
187
we have a probability measure on S given via the energy (sometimes called the Hamiltonian)
function E : S R. The probability of any conguration S is
1 1 E()
T
.
Z
Here,
T > 0 is the temperature, a parameter, and Z is the normalizing constant that makes
S () = 1. Such distributions frequently occur in statistical physics and are often called
Maxwell-Boltzmann distributions. They have numerous other applications, however, especially
in optimization problems, and have yielded an optimization technique called simulated annealing.
() =
If T is very large, the role of the energy is diminished and the states are almost equally likely.
On the other hand, if T is very small, the large energy states have a much lower probability than
the small energy ones, thus the system is much more likely to be found in the close to minimal
energy states. If we want to nd states with small energy, we merely choose some small T and
generate at random, according to P , some states, and we have a reasonable answer. The only
problem is that, although E is typically a simple function, is very dicult to evaluate exactly,
as Z is some enormous sum. (There are a few celebrated cases, called exactly solvable systems,
in which exact computations are dicult, but possible.)
Instead of generating a random state directly, we design a Markov chain, which has as
its invariant distribution. It is common that the convergence to is quite fast and that the
necessary number of steps of the chain to get close to is some small power of n. This is in
startling contrast to the size of S, which is typically exponential in n. However, the convergence
slows down at a rate exponential in T 1 when T is small.
We will illustrate this on the Knapsack problem. Assume that you are a burglar and have
just broken into a jewelry store. You see a large number n of items, with weights wi and values
vi . Your backpack (knapsack) has a weight limit b. You are faced with a question of how to ll
your backpack, that is, you have to maximize the combined value of the items you will carry out
V = V (1 , . . . , n ) =
vi i ,
i=1
subject to the constraints that i {0, 1} and that the combined weight does not exceed the
backpack capacity,
n
wi i b.
i=1
16
188
state at time t, i.e., Xt = . Pick a coordinate i, uniformly at random. Let i be the same as
except that its ith coordinate is ipped: ii = 1 i . (This means that the status of the ith
item is changed from in to out or from out to in.) If i is not feasible, then Xt+1 = and the
state is unchanged. Otherwise, evaluate the dierence in energy E( i ) E() and proceed as
follows:
if E( i ) E() 0, then make the transition to i , Xt+1 = i ;
1
i )E())
Problems
1. Determine the invariant distribution for the random walk in Examples 12.4 and 12.10.
2. A total of m white and m black balls are distributed into two urns, with m balls per urn. At
each step, a ball is randomly selected from each urn and the two balls are interchanged. The
state of this Markov chain can, thus, be described by the number of black balls in urn 1. Guess
the invariant measure for this chain and prove that it is reversible.
3. Each day, your opinion on a particular political issue is either positive, neutral, or negative.
If it is positive today, then it is neutral or negative tomorrow with equal probability. If it is
16
189
neutral or negative, it stays the same with probability 0.5, and, otherwise, it is equally likely to
be either of the other two possibilities. Is this a reversible Markov chain?
4. A king moves on a standard 8 8 chessboard. Each time, it makes one of the available legal
moves (to a horizontally, vertically or diagonally adjacent square) at random. (a) Assuming
that the king starts at one of the four corner squares of the chessboard, compute the expected
number of steps before it returns to the starting position. (b) Now you have two kings, they both
start at the same corner square and move independently. What is, now, the expected number
of steps before they simultaneously occupy the starting position?
Solutions to problems
1. Answer: =
[1
5,
3
10 ,
1
5,
3
10
i =
mi
(2m
) ,
m
i2
(m i)2
2i(m i)
,
P
=
, Pi,i =
.
i,i+1
2
2
m
m
m2
0 12 12
P = 41 12 14 .
1
4
1
4
1
2
The only way to check reversibility is to compute the invariant distribution 1 , 2 , 3 , form the
diagonal matrix D with 1 , 2 , 3 on the diagonal and to check that DP is symmetric. We get
1 = 51 , 2 = 25 , 3 = 25 , and DP is, indeed, symmetric, so the chain is reversible.
4. This is a random walk on a graph with 64 vertices (squares) and degrees 3 (4 corner squares),
3
5 (24 side squares), and 8 (36 remaining squares). If i is a corner square, i = 34+524+836
,
420
so the answer to (a) is 3 . In (b), you have two independent chains, so (i,j) = i j and the
( )2
.
answer is 420
3
17
17
THREE APPLICATIONS
190
Three Applications
Parrondos Paradox
This famous paradox was constructed by the Spanish physicist J. Parrondo. We will consider
three games A, B and C with ve parameters: probabilities p, p1 , p2 , and , and an integer
period M 2. These parameters are, for now, general so that the description of the games is
more transparent. We will choose particular values once we are nished with the analysis.
We will call a game losing if, after playing it for a long time, a players capital becomes more
and more negative, i.e., the player loses more and more money.
Game A is very simple; in fact it is an asymmetric one-dimensional simple random walk.
Win $1, i.e., add +1 to your capital, with probability p, and lose a dollar, i.e., add 1 to your
capital, with probability 1 p. This is clearly a losing game if p < 12 .
In game B, the winning probabilities depend on whether your current capital is divisible by
M . If it is, you add +1 with probability p1 , and 1 with probability 1 p1 , and, if it is not,
you add +1 with probability p2 and 1 with probability 1 p2 . We will determine below when
this is a losing game.
Now consider game C, in which you, at every step, play A with probability and B with
probability 1 . Is it possible that A and B are losing games, while C is winning?
The surprising answer is yes! However, this should not be so surprising as in game B your
winning probabilities depend on the capital you have and you can manipulate the proportion
of time your capital spends at unfavorable amounts by playing the combination of the two
games.
We now provide a detailed analysis. As mentioned, game A is easy. To analyze game B, take
a simple random walk which makes a +1 step with probability p2 and 1 step with probability
1 p2 . Assume that you start this walk at some x, 0 < x < M . Then, by the Gamblers ruin
computation (Example 11.6),
(
)x
2
1 1p
p2
(17.1)
P (the walk hits M before 0) =
(
)M .
1p2
1 p2
Starting from a multiple of M , the probability that you increase your capital by M before either
decreasing it by M or returning to the starting point is
(
)
1p2
1 p2
(17.2)
p1
)M .
(
1p2
1 p2
(You have to make a step to the right and then use the formula (17.1) with x = 1.) Similarly, from
a multiple of M , the probability that you decrease your capital by M before either increasing it
17
THREE APPLICATIONS
191
(1 p1 )
1p2
p2
)M 1
1p2
p2
1p2
p2
)M
)M
.
(Now you have to move one step to the left and then use 1(probability in (17.1) with x =
M 1).)
The main trick is to observe that game B is losing if (17.2)<(17.3). Why? Observe your
capital at multiples of M : if, starting from kM , the probability that the next (dierent) multiple
of M you visit is (k 1)M exceeds the probability that it is (k + 1)M , then the game is losing
and that is exactly when (17.2)<(17.3). After some algebra, this condition reduces to
(17.4)
(1 p1 )(1 p2 )M 1
> 1.
1
p1 pM
2
Now, game C is the same as game B with p1 and p2 replaced by q1 = p + (1 )p1 and
q2 = p + (1 )p2 , yielding a winning game if
(17.5)
(1 q1 )(1 q2 )M 1
< 1.
q1 q2M 1
This is easily achieved with large enough M as soon as p2 < 12 and q2 > 12 , but even for M = 3,
5
1
6
217
one can choose p = 11
, p1 = 1112 , p2 = 10
11 , = 2 , to get 5 in (17.4) and 300 in (17.5).
k=1 fk
= 1. Let =
if n < 0,
u0 = 1,
un =
fk unk
if n > 0.
k=1
Assume that the greatest common divisor of the set {k : fk > 0} is 1. Then,
lim un =
1
.
Example 17.1. Roll a fair die forever and let Sm be the sum of outcomes of the rst m rolls.
Let pn = P (Sm ever equals n). Estimate p10,000 .
17
THREE APPLICATIONS
192
f1
1 f1
0
0
...
1f1 f2
f2
0
0
. . .
1f1
1f1
f3
1f1 f2 f3
0
0
.
.
.
1f1 f2
1f1 f2
...
fN
0
0
0
...
1f1 fN 1
This is called a renewal chain: it moves to the right (from x to x + 1) on the nonnegative
integers, except for renewals, i.e., jumps to 0. At N 1, the jump to 0 is certain (note that the
matrix entry PN 1,0 is 1, since the sum of fk s is 1).
The chain is irreducible (you can get to N 1 from anywhere, from N 1 to 0, and from
0 anywhere) and we will see shortly that is also aperiodic. If X0 = 0 and R0 is the rst return
time to 0, then
P (R0 = k)
clearly equals f1 , if k = 1. Then, for k = 2 it equals
(1 f1 )
f2
= f2 ,
1 f1
f3
1 f1 f2
= f3 ,
1 f1
1 f1 f2
17
THREE APPLICATIONS
193
for all k 1.
In particular, the promised aperiodicity follows, as the chain can return to 0 in k steps if fk > 0.
Moreover, the expected return time to 0 is
m00 =
kfk = .
k=1
n that the chain is at 0 in n steps is given by the
The next observation is that the probability P00
recursion
(17.6)
n
P00
nk
P (R0 = k)P00
.
k=1
To see this, observe that you must return to 0 at some time not exceeding n in order to end up
at 0; either you return for the rst time at time n or you return at some previous time k and,
then, you have to be back at 0 in n k steps.
The above formula (17.6) is true for every Markov chain. In this case, however, we note that
the rst return time to 0 is, certainly, at most N , so we can always sum to N with the proviso
nk
that P00
= 0 when k > n. So, from (17.6) we get
n
P00
=
nk
fk P00
.
k=1
n is the same as the recursion for u . The initial conditions are also the
The recursion for P00
n
n . It follows from the convergence theorem (Theorem 15.3)
same and we conclude that un = P00
that
1
1
n
= ,
lim un = lim P00
=
n
n
m00
17
THREE APPLICATIONS
194
likely to win in the horse race? There are several ways of solving these problems (a particularly
elegant one uses the so called Optional Stopping Theorem for martingales), but we will use
Markov chains.
The Markov chain Xn we will utilize has as the state space all patterns of length . Each
time, the chain transitions into the pattern obtained by appending 1 (with probability p) or 0
(with probability 1 p) at the right end of the current pattern and by deleting the symbol at
the left end of the current pattern. That is, the chain simply keeps track of the last symbols
in a sequence of tosses.
There is a slight problem before we have tosses. For now, assume that the chain starts
with some particular sequence of tosses, chosen in some way.
We can immediately gure out the invariant distribution for this chain. At any time n 2
and for any pattern A with k 1s and k 0s,
P (Xn = A) = pk (1 p)k ,
as the chain is generated by independent coin tosses! Therefore, the invariant distribution of
Xn assigns to A the probability
A = pk (1 p)k .
Now, if we have two patterns B and A, denote by NBA the expected number of additional
tosses we need to get A provided that the rst tosses ended in B. Here, if A is a subpattern of
B, this does not count, we have to actually make A in the additional tosses, although we can
use a part of B. For example, if B = 111001 and A = 110, and the next tosses are 10, then
NBA = 2, and, if the next tosses are 001110, then NBA = 6.
Also denote
E(B A) = E(NBA ).
Our initial example can, therefore, be formulated as follows: compute
E( 1011101).
The convergence theorem for Markov chains guarantees that, for every A,
E(A A) =
1
.
A
The hard part of our problem is over. We now show how to analyze the waiting game by the
example.
We know that
E(1011101 1011101) =
1
.
1011101
However, starting with 1011101, we can only use the overlap 101 to help us get back to 1011101,
so that
E(1011101 1011101) = E(101 1011101).
17
THREE APPLICATIONS
195
To get from to 1011101, we have to get rst to 101 and then from there to 1011101, so that
E( 1011101) = E( 101) + E(101 1011101).
We have reduced the problem to 101 and we iterate our method:
E( 101) = E( 1) + E(1 101)
= E( 1) + E(101 101)
= E(1 1) + E(101 101)
1
1
=
+
.
1 101
The nal result is
1
1
1
+
+
1011101 101 1
1
1
1
= 5
+
+ ,
p (1 p)2 p2 (1 p) p
E( 1011101) =
where NBA
is the additional number of tosses we need to get to A after we reach B for the rst
time. In words, to get to A we either stop at N or go further starting from B, but the second
17
THREE APPLICATIONS
196
Problems
1. Start at 0 and perform the following random walk on the integers. At each step, ip 3 fair
coins and make a jump forward equal to the number of Heads (you stay where you are if you
ip no Heads). Let pn be the probability that you ever hit n. Compute limn pn . (It is not
2
3 !)
2. Suppose that you have three patterns A = 0110, B = 1010, C = 0010. Compute the
probability that A appears rst among the three in a sequence of fair coin tosses.
Solutions to problems
1. The size S of the step has the p. m. f. given by P (X = 0) = 18 , P (X = 1) = 38 , P (X = 2) = 38 ,
P (X = 3) = 18 . Thus,
1
3
3
1
pn = pn + pn1 + pn2 + pn3 ,
8
8
8
8
and so
(
)
8 3
3
1
pn =
pn1 + pn2 + pn3 .
7 8
8
8
17
THREE APPLICATIONS
197
7
.
12
E( A) = 18
E( B) = 20
E( C) = 18
E(B A) = 16
E(C A) = 16
E(A B) = 16
E(C B) = 16
E(A C) = 16
E(B C) = 16
18
POISSON PROCESS
18
198
Poisson Process
2. it has independent increments: if (s1 , t1 ] (s2 , t2 ] = , then N (t1 ) N (s1 ) and N (t2 )
N (s2 ) are independent; and
3. the number of events in any interval of length t is Poisson(t).
In particular,
P (N (t + s) N (s) = k) = et
(t)k
, k = 0, 1, 2, . . . ,
k!
E(N (t + s) N (s)) = t.
Moreover, as h 0,
P (N (h) = 1) = eh h h,
P (N (h) 2) = O(h2 ) h.
Thus, in small time intervals, a single event happens with probability proportional to the length
of the interval; this is why is called the rate.
A denition as the above should be followed by the question whether the object in question
exists we may be wishing for contradictory properties. To demonstrate the existence, we
will outline two constructions of the Poisson process. Yes, it is unique, but it would require
some probabilistic sophistication to prove this, as would the proof (or even the formulation) of
convergence in the rst construction we are about to give. Nevertheless, it is very useful, as it
makes many properties of the Poisson process almost instantly understandable.
Construction by tossing a low-probability coin very fast. Pick a large n and assume
that you have a coin with (low) Heads probability n . Toss the coin at times which are positive
integer multiples of n1 (that is, very fast) and let Nn (t) be the number of Heads in [0, t]. Clearly,
as n , the number of Heads in any interval (s, t] is Binomial with the number of trials
n(t s) 2 and success probability n ; thus, it converges to Poisson(t s), as n . Moreover,
Nn has independent increments for any n and hence the same holds in the limit. We should
18
POISSON PROCESS
199
note that the Heads probability does not need to be exactly n , instead, it suces that this
probability converges to when multiplied by n. Similarly, we do not need all integer multiples
of n1 ; it is enough that their number in [0, t], divided by n, converges to t in probability for any
xed t.
An example of a property that follows immediately is the following. Let Sk be the time of
the kth (say, 3rd) event (which is a random time) and let Nk (t) be the number of additional
events within time t after time Sk . Then, Nk (t) is another Poisson process, with the same rate
, as starting to count the Heads afresh after the kth Heads gives us the same process as if we
counted them from the beginning we can restart a Poisson process at the time of the kth
event. In fact, we can do so at any stopping time, a random time T with the property that
T = t depends only on the behavior of the Poisson process up to time t (i.e., depends on the
past, but not on the future). The Poisson process, restarted at a stopping time, has the same
properties as the original process started at time 0; this is called the strong Markov property.
As each Nk is a Poisson process, Nk (0) = 0, so two events in the original Poisson N (t)
process do not happen at the same time.
Let T1 , T2 , . . . be the interarrival times, where Tn is the time elapsed between (n 1)st and
nth event. A typical example would be the times between consecutive buses arriving at a station.
Proposition 18.1. Distribution of interarrival times:
T1 , T2 , . . . are independent and Exponential().
Proof. We have
which proves that T1 is Exponential(). Moreover, for any s > 0 and any t > 0,
P (T2 > t|T1 = s) = P (no events in (s, s + t]|T1 = s) = P (N (t) = 0) = et ,
as events in (s, s + t] are not inuenced by what happens in [0, s]. So, T2 is independent of T1
and Exponential(). Similarly, we can establish that T3 is independent of T1 and T2 with the
same distribution, and so on.
Construction by exponential interarrival times. We can use the above Proposition 18.1
for another construction of a Poisson process, which is convenient for simulations. Let T1 , T2 , . . .
be i. i. d. Exponential() random variables and let Sn = T1 + . . . + Tn be the waiting time for
the nth event. We dene N (t) to be the largest n so that Sn t.
We know that ESn = n , but we can derive its density; the distribution is called Gamma(n, ).
We start with
n1
(t)j
P (Sn > t) = P (N (t) < n) =
et
,
j!
j=0
18
POISSON PROCESS
200
n1
j=0
1
(et (t)j + et j(t)j1 )
j!
= e
)
k1 (
(t)j
(t)j1
+
j!
(j 1)!
j=0
= et
(t)n1
,
(n 1)!
and so
fSn (t) = et
(t)n1
.
(n 1)!
Example 18.1. Consider a Poisson process with rate . Compute (a) E(time of the 10th
event), (b) P (the 10th event occurs 2 or more time units after the 9th event), (c) P (the 10th
event occurs later than time 20), and (d) P (2 events in [1, 4] and 3 events in [3, 5]).
2 , as one can restart
The answer to (a) is 10
by Proposition 18.1. The answer to (b) is e
the Poisson process at any event. The answer to (c) is P (S10 > 20) = P (N (20) < 10), so we
can either write the integral
(t)9
dt,
P (S10 > 20) =
et
9!
20
or use
9
(20)j
P (N (20) < 10) =
e20
.
9!
j=0
P (2 events in [1, 4] and 3 events in [3, 5] | k events in [3, 4]) P (k events in [3, 4])
k=0
k=0
2
(2)2k 3k
k
e
e
(2 k)!
(3 k)!
k!
k=0
(
)
1 5
1
= e4
+ 4 + 3 .
3
2
e2
18
POISSON PROCESS
201
Proof. This is a consequence of the same property for Poisson random variables.
Theorem 18.3. Thinning of a Poisson process.
Each event in a Poisson process N (t) with rate is independently a Type I event with probability
p; the remaining events are Type II. Let N1 (t) and N2 (t) be the numbers of Type I and Type II
events in [0, t]. These are independent Poisson processes with rates p and (1 p).
The most substantial part of this theorem is independence, as the other claims follow from
the thinning properties of Poisson random variables (Example 11.4).
Proof. We argue by discrete approximation. At each integer multiple of n1 , we toss two(indepen)
(1p)
p
dent coins: coin A has Heads probability p
n ; and coin B has Heads probability
n / 1 n .
Then call discrete Type I events the locations with coin A Heads; discrete Type II(a) events the
locations with coin A Tails and coin B Heads; and discrete Type II(b) events the locations with
coin B Heads. A location is a discrete event if it is either a Type I or a Type II(a) event.
One can easily compute that a location nk is a discrete event with probability n . Moreover,
given that a location is a discrete event, it is Type I with probability p. Therefore, the process
of discrete events and its division into Type I and Type II(a) events determines the discrete
versions of the processes in the statement of the theorem. Now, discrete Type I and Type II(a)
events are not independent (for example, both cannot occur at the same location), but discrete
Type I and Type II(b) events are independent (as they depend on dierent coins). The proof
will be concluded by showing that discrete Type II(a) and Type II(b) events have the same limit
as n . The Type I and Type II events will then be independent as limits of independent
discrete processes.
To prove the claimed asymptotic equality, observe rst that the discrete schemes (a) and (b)
result in a dierent outcome at a location nk exactly when two events occur there: a discrete Type
I event and a discrete Type II(b) event. The probability that the two discrete Type II schemes
dier at nk is thus at most nC2 , for some constant C. This causes the expected number of such
double points in [0, t] to be at most Ct
n . Therefore, by the Markov inequality, an upper bound
for the probability that there is at least one double point in [0, t] is also Ct
n . This probability
goes to zero, as n , for any xed t and, consequently, discrete (a) and (b) schemes indeed
result in the same limit.
Example 18.2. Customers arrive at a store at a rate of 10 per hour. Each is either male or
female with probability 12 . Assume that you know that exactly 10 women entered within some
hour (say, 10 to 11am). (a) Compute the probability that exactly 10 men also entered. (b)
Compute the probability that at least 20 customers have entered.
Male and female arrivals are independent Poisson processes, with parameter
the answer to (a) is
510
e5
.
10!
1
2
10 = 5, so
18
POISSON PROCESS
202
P (k men entered) =
k=10
5k
5k
=1
e5 .
k!
k!
9
e5
k=10
k=0
Example 18.3. Assume that cars arrive at a rate of 10 per hour. Assume that each car will
1
. You are second in line. What is the probability that
pick up a hitchhiker with probability 10
you will have to wait for more than 2 hours?
Cars that pick up hitchhikers are a Poisson process with rate 10
1
10
Assume that we have two independent Poisson processes, N1 (t) with rate 1 and N2 (t) with rate
2 . The probability that n events occur in the rst process before m events occur in the second
process is
n+m1
(n + m 1) ( 1 )k ( 2 )n+m1k
.
k
1 + 2
1 + 2
k=n
We can easily extend this idea to more than two independent Poisson processes; we will not
make a formal statement, but instead illustrate by the few examples below.
Proof. Start with a Poisson process with 1 + 2 , then independently decide for each event
1
whether it belongs to the rst process, with probability 1+
, or to the second process, with
2
2
probability 1 +2 . The obtained processes are independent and have the correct rates. The
probability we are interested in is the probability that among the rst m + n 1 events in the
combined process, n or more events belong to the rst process, which is the binomial probability
in the statement.
Example 18.4. Assume that 1 = 5, 2 = 1. Then,
( )5
5
P (5 events in the rst process before 1 in the second) =
6
and
P (5 events in the rst process before 2 in the second) =
6 ( ) ( )k ( )6k
6
5
1
k=5
11 55
.
66
Example 18.5. You have three friends, A, B, and C. Each will call you after an Exponential
amount of time with expectation 30 minutes, 1 hour, and 2.5 hours, respectively. You will go
out with the rst friend that calls. What is the probability that you go out with A?
18
POISSON PROCESS
203
We could evaluate the triple integral, but we will avoid that. Interpret each call as the rst
event in the appropriate one of three Poisson processes with rates 2, 1, and 25 , assuming the
time unit to be one hour. (Recall that the rates are inverses of the expectations.)
We will solve the general problem with rates 1 , 2 , and 3 . Start with rate 1 + 2 +
3 Poisson process, distribute the events with probability 1 +12 +3 , 1 +22 +3 , and 1 +32 +3 ,
respectively. The probability of A calling rst is clearly 1 +12 +3 , which in our case works out
to be
2
10
= .
17
2 + 1 + 25
Our next theorem illustrates what we can say about previous event times if we either know
that their number by time t is k or we know that the kth one happens exactly at time t.
Theorem 18.5. Uniformity of previous event times.
T 2
2
T 3
3 .
(b) Now two buses depart, one at T and one at S < T . What is EW now?
18
POISSON PROCESS
204
We have two independent Poisson processes in time intervals [0, S] and [S, T ], so the answer
is
(T S)2
S2
+
.
2
2
(c) Now assume T , the only bus departure time, is Exponential(), independent of the passengers arrivals.
This time,
EW =
t2
fT (t) dt = E(T 2 )
2
2
= (Var(T ) + (ET )2 ) =
= 2.
2
2
2
(d) Finally, two buses depart as the rst two events in a rate Poisson process.
This makes
EW = 2
.
2
Example 18.7. You have two machines. Machine 1 has lifetime T1 , which is Exponential(1 ),
and Machine 2 has lifetime T2 , which is Exponential(2 ). Machine 1 starts at time 0 and Machine
2 starts at a time T .
(a) Assume that T is deterministic. Compute the probability that M1 is the rst to fail.
We could compute this via a double integral (which is a good exercise!), but instead we
proceed thus:
P (T1 < T2 + T ) = P (T1 < T ) + P (T1 T, T1 < T2 + T )
= P (T1 < T ) + P (T1 < T2 + T |T1 T )P (T1 T )
= 1 e1 T + P (T1 T < T2 |T1 T )e1 T
= 1 e1 T + P (T1 < T2 )e1 T
1
e1 T .
= 1 e1 T +
1 + 2
The key observation above is that P (T1 T < T2 |T1 T ) = P (T1 < T2 ). Why does this
hold? We can simply quote the memoryless property of the Exponential distribution, but it
is instructive to make a short argument using Poisson processes. Embed the failure times into
appropriate Poisson processes. Then, T1 T means that no events in the rst process occur
during time [0, T ]. Under this condition, T1 T is the time of the rst event of the same process
restarted at T , but this restarted process is not inuenced by what happened before T , so the
condition (which in addition does not inuence T2 ) drops out.
18
POISSON PROCESS
205
(b) Answer the same question when T is Exponential() (and, of course, independent of the
machines). Now, by the same logic,
P (T1 < T2 + T ) = P (T1 < T ) + P (T1 T, T1 < T2 + T )
1
1
=
+
.
1 + 1 + 2 1 +
Example 18.8. Impatient hitchhikers. Two people, Alice and Bob, are hitchhiking. Cars that
would pick up a hitchhiker arrive as a Poisson process with rate C . Alice is rst in line for a
ride. Moreover, after Exponential(A ) time, Alice quits, and after Exponential(B ) time, Bob
quits. Compute the probability that Alice is picked up before she quits and compute the same
for Bob.
Embed each quitting time into an appropriate Poisson process, call these A and B processes,
and call the car arrivals C process. Clearly, Alice gets picked if the rst event in the combined
A and C process is a C event:
P (Alice gets picked) =
C
.
A + C
Moreover,
P (Bob gets picked)
= P ({at least 2 C events before a B event}
{at least one A event before either a B or a C event,
and then at least one C event before a B event})
= P (at least 2 C events before a B event)
+ P (at least one A event before either a B or a C event,
and then at least one C event before a B event)
P (at least one A event before either a B or a C event,
and then at least two C events before a B event)
)2
C
=
B + C
(
)(
)
A
C
+
A + B + C
B + C
)(
)2
(
C
A
A + B + C
B + C
A + C
C
=
.
A + B + C B + C
This leaves us with an excellent hint that there may be a shorter way and, indeed, there is:
(
18
POISSON PROCESS
206
Problems
1. An oce has two clerks. Three people, A, B, and C enter simultaneously. A and B begin
service with the two clerks, while C waits for the rst available clerk. Assume that the service
time is Exponential(). (a) Compute the probability that A is the last to nish the service. (b)
Compute the expected time before C is nished (i.e., Cs combined waiting and service time).
2. A car wash has two stations, 1 and 2, with Exponential(1 ) and Exponential(2 ) service
times. A car enters at station 1. Upon completing the service at station 1, the car proceeds to
station 2, provided station 2 is free; otherwise, the car has to wait at station 1, blocking the
entrance of other cars. The car exits the wash after the service at station 2 is completed. When
you arrive at the wash there is a single car at station 1. Compute the expected time before you
exit.
3. A system has two server stations, 1 and 2, with Exponential(1 ) and Exponential(2 ) service
times. Whenever a new customer arrives, any customer in the system immediately departs.
Customer arrivals are a rate Poisson process, and a new arrival enters the system at station
1, then goes to station 2. (a) What proportion of customers complete their service? (b) What
proportion of customers stay in the system for more than 1 time unit, but do not complete the
service?
4. A machine needs frequent maintenance to stay on. The maintenance times occur as a Poisson
process with rate . Once the machine receives no maintenance for a time interval of length h,
it breaks down. It then needs to be repaired, which takes an Exponential() time, after which it
goes back on. (a) After the machine is started, nd the probability that the machine will break
down before receiving its rst maintenance. (b) Find the expected time for the rst breakdown.
(c) Find the proportion of time the machine is on.
5. Assume that certain events (say, power surges) occur as a Poisson process with rate 3 per hour.
These events cause damage to a certain system (say, a computer), thus, a special protecting unit
has been designed. That unit now has to be removed from the system for 10 minutes for service.
(a) Assume that a single event occurring in the service period will cause the system to crash.
What is the probability that the system will crash?
(b) Assume that the system will survive a single event, but two events occurring in the service
period will cause it to crash. What is, now, the probability that the system will crash?
(c) Assume that a crash will not happen unless there are two events within 5 minutes of each
other. Compute the probability that the system will crash.
(d) Solve (b) by assuming that the protective unit will be out of the system for a time which is
exponentially distributed with expectation 10 minutes.
18
POISSON PROCESS
207
Solutions to problems
1. (a) This is the probability that two events happen in a rate Poisson process before a single
event in an independent rate process, that is, 14 . (b) First, C has to wait for the rst event in
two combined Poisson processes, which is a single process with rate 2, and then for the service
1
3
time; the answer is 2
+ 1 = 2
.
2. Your total time is (the time the other car spends at station 1) + (the time you spend at
station 2)+(maximum of the time the other car spends at station 2 and the time you spend at
station 1). If T1 and T2 are Exponential(1 ) and Exponential(2 ), then you need to compute
E(T1 ) + E(T2 ) + E(max{T1 , T2 }).
Now use that
max{T1 , T2 } = T1 + T2 min{T1 , T2 }
and that min{T1 , T2 } is Exponential(1 + 2 ), to get
2
2
1
+
.
1 2 1 + 2
3. (a) A customer needs to complete the service at both stations before a new one arrives, thus
the answer is
1
2
.
1 + 2 +
(b) Let T1 and T2 be the customers times at stations 1 and 2. The event will happen if either:
T1 > 1, no newcomers during time 1, and a newcomer during time [1, T1 ]; or
T1 < 1, T1 + T2 > 1, no newcomers by time 1, and a newcomer during time [1, T1 + T2 ].
For the rst case, nothing will happen by time 1, which has probability e(+1 ) . Then, after
time 1, a newcomer has to appear before the service time at station 1, which has probability
1 + .
For the second case, conditioned on T1 = t < 1, the probability is
e e2 (1t)
.
2 +
e2 (1t) 1 e1 t dt = e
e
,
e
2 + 0
2 +
2 1
18
POISSON PROCESS
208
1 (+2 ) e2 1 1
+
e
.
1 + 2 +
2 1
4. (a) The answer is eh . (b) Let W be the waiting time for maintenance such that the next
maintenance is at least time h in the future, and let T1 be the time of the rst maintenance.
Then, provided t < h,
E(W |T1 = t) = t + EW,
as the process is restarted at time t. Therefore,
h
t
EW =
(t + EW ) e
dt =
0
t e
dt + EW
et dt.
1 heh eh
.
eh
The answer to (b) is EW + h (the machine waits for h more units before it breaks down). The
answer to (c) is
EW + h
.
EW + h + 1
5. Assume the time unit is 10 minutes,
1
6
P (N (1/6) 1) = 1 e 2 ,
1
and to (b)
3 1
P (N (1/6) 2) = 1 e 2 .
2
For (c), if there are 0 or 1 events in the 10 minutes, there will be no crash, but 3 or more events
in the 10 minutes will cause a crash. The nal possibility is exactly two events, in which case
the crash will happen with probability
(
)
1
,
P |U1 U2 | <
2
where U1 and U2 are independent uniform random variables on [0, 1]. By drawing a picture, this
probability can be computed to be 43 . Therefore,
P (crash) = P (X > 2) + P (crash|X = 2)P (X = 2)
( 1 )2
( 1 )2
1
1
1 1
3
12
2
= 1e e 2 e 2
+ 2 e 2
2
2
4
2
49 1
= 1 e 2
32
18
POISSON PROCESS
209
Finally, for (d), we need to calculate the probability that two events in a rate 3 Poisson process
occur before an event occurs in a rate 6 Poisson process. This probability is
(
3
3+6
)2
1
= .
9
18
POISSON PROCESS
210
18
POISSON PROCESS
211
(a) Compute the proportion of time the walker spends at 0, after she makes many steps. Does
this proportion depend on the walkers starting vertex?
(b) Compute the proportion of time the walker is at an odd state (1, 3, or 5) while, previously,
she was at even state (0, 2, or 4).
(c) Now assume that the walker starts at 0. What is expected number of steps she will take
before she is back at 0?
5. In a branching process, an individual has two descendants with probability 34 and no descendants with probability 41 . The process starts with a single individual in generation 0.
(a) Compute the expected number of individuals in generation 2.
(b) Compute the probability that the process ever becomes extinct.
6. Customers arrive at two service stations, labeled 1 and 2, as a Poisson process with rate
. Assume that the time unit is one hour. Whenever a new customer arrives, any previous
customer is immediately ejected from the system. A new arrival enters the service at station 1,
then goes to station 2.
(a) Assume that the service time at each station is exactly 2 hours. What proportion of entering
customers will complete the service (before they are ejected)?
(b) Assume that the service time at each station now is exponential with expectation 2 hours.
What proportion of entering customers will now complete the service?
(c) Keep the service time assumption from (b). A customer arrives, but he is now given special
treatment: he will not be ejected unless at least three or more new customers arrive during his
service. Compute the probability that this special customer is allowed to complete his service.
18
POISSON PROCESS
212
EN
=
0
1
,
1 x2
dx
1 x2
= 2 log(1
x 1
)|
2 0
= 2 log 2.
(b) Compute the expected sum S of the numbers your friend selects.
Solution:
Given X = x and N = n, your friend selects n 1 numbers
one number uniformly in [ x2 , 1]. Therefore,
x 1(
x)
E[S|X = x, N = n] = (n 1) +
1+
=
4 2
2
1
1
1
1
E[S|X = x] =
,
x+ =
4 1 x2
2
2x
1
1
ES =
dx = log 2.
2
x
0
2. You are a member of a sports team and your coach has instituted the following policy.
You begin with zero warnings. After every game the coach evaluates whether youve had
a discipline problem during the game; if so, he gives you a warning. After you receive
two warnings (not necessarily in consecutive games), you are suspended for the next game
and your warnings count goes back to zero. After the suspension, the rules are the same
as at the beginning. You gure you will receive a warning after each game you play
independently with probability p (0, 1).
18
POISSON PROCESS
213
(a) Let the state of the Markov chain be your warning count after a game. Write down
the transition matrix and determine whether this chain is irreducible and aperiodic.
Compute its invariant distribution.
Solution:
The transition matrix is
1p
p
0
1 p p ,
P = 0
1
0
0
]
1
1
p
=
,
,
.
p+2 p+2 p+2
(b) Write down an expression for the probability that you are suspended for both games
10 and 15. Do not evaluate.
Solution:
You must have 2 warnings after game 9 and then again 2 warnings after game 14:
9
4
P02
P02
.
(c) Let sn be the probability that you are suspended in the nth game. Compute limn sn .
Solution:
As the chain is irreducible and aperiodic,
lim sn = lim P (Xn1 = 2) =
p
.
2+p
18
POISSON PROCESS
214
3. A random walker on the nonnegative integers starts at 0 and then at each step adds either
2 or 3 to her position, each with probability 12 .
(a) Compute the probability that the walker is at 2n + 3 after making n steps.
Solution:
The walker has to make (n 3) 2-steps and 3 3-steps, so the answer is
( )
n 1
.
3 2n
(b) Let pn be the probability that the walker ever hits n. Compute limn pn .
Solution:
The step distribution is aperiodic, as the greatest common divisor of 2 and 3 is 1, so
lim pn =
1
2
1
2+
1
2
2
= .
5
3
4. A random walker is at one of six vertices, labeled 0, 1, 2, 3, 4, and 5, of the graph in the
picture. At each time, she moves to a randomly chosen vertex connected to her current
position by an edge. (All choices are equally likely and she never stays at the same position
for two successive steps.)
1
(a) Compute the proportion of time the walker spends at 0, after she makes many steps.
Does this proportion depend on the walkers starting vertex?
Solution:
Independently of the starting vertex, the proportion is 0 , where [0 , 1 , 2 , 3 , 4 , 5 ]
is the unique invariant distribution. (Unique because of irreducibility.) This chain
18
POISSON PROCESS
215
1
14 [4, 2, 2, 2, 3, 1].
Therefore, the
(b) Compute the proportion of times the walker is at an odd state (1, 3, or 5) while,
previously, she was at even state (0, 2, or 4).
Solution:
The answer is
0 (p03 + p05 ) + 2 p21 + 4 (p43 + p41 ) =
2 2 1 1
3 2
5
+ +
= .
7 4 7 2 14 3
14
(c) Now assume that the walker starts at 0. What is expected number of steps she will
take before she is back at 0?
Solution:
The answer is
1
7
=
0
2
1
3
3
+0 = ,
4
4
2
9
2 = .
4
(b) Compute the probability that the process ever goes extinct.
Solution:
As
(s) =
1 3 2
+ s
4 4
18
POISSON PROCESS
216
6. Customers arrive at two service stations, labeled 1 and 2, as a Poisson process with rate .
Assume that the time unit is one hour. Whenever a new customer arrives, any previous
customer is immediately ejected from the system. A new arrival enters the service at
station 1, then goes to station 2.
(a) Assume that the service time at each station is exactly 2 hours. What proportion of
entering customers will complete the service (before they are ejected)?
Solution:
The answer is
P (customer served) = P (no arrival in 4 hours) = e4 .
(b) Assume that the service time at each station now is exponential with expectation 2
hours. What proportion of entering customers will now complete the service?
Solution:
Now,
P (customer served)
1
Poisson process
2
before one arrival in rate Poisson process)
(
)2
1
1
2
= 1
=
.
(1
+
2)2
+
(c) Keep the service time assumption from (b). A customer arrives, but he is now given
special treatment: he will not be ejected unless at least three or more new customers
arrive during his service. Compute the probability that this special customer is
allowed to complete his service.
Solution:
The special customer is served exactly when 2 or more arrivals in rate
1
2
Poisson
18
POISSON PROCESS
217
before three arrivals in rate Poisson process happen. Equivalently, among the rst
4 arrivals in rate + 21 Poisson process, 2 or more belong to the rate 21 Poisson process.
The answer is
(
)4
(
)3
1
4 + 23
2
1
)4 .
1
1
1 =1 (
+ 2
+ 2
+ 2
+ 1
2