OERprobability 2020
OERprobability 2020
Richard F. Bass
Patricia Alonso Ruiz, Fabrice Baudoin, Maria Gordina,
Phanuel Mariano, Oleksii Mostovyi, Ambar Sengupta,
Alexander Teplyaev, Emiliano Valdez
© Copyright 2013 Richard F. Bass
© Copyright 2017 University of Connecticut, Department of Mathematics
© Copyright 2018 University of Connecticut, Department of Mathematics
© Copyright 2020 University of Connecticut, Department of Mathematics
Acknowledgments. The authors are very grateful to Joe Chen, Tom Laetsch, Tom
Roby, Linda Westrick for their important contributions to this project, and to all the Uni-
versity of Connecticut students in Math 3160 for their tireless work in the course.
We gratefully acknowledge the generous support from the University of Connecticut Open
and Affordable Initiative, the Davis Educational Foundation, the UConn Co-op Bookstore,
and the University of Connecticut Libraries.
Preface
This textbook has been created as a part of the University of Connecticut Open and Afford-
able Initiative, which in turn was a response to the Connecticut State Legislature Special Act
No. 15-18 (House Bill 6117), An Act Concerning the Use of Digital Open-Source Textbooks
in Higher Education. At the University of Connecticut this initiative was supported by the
UConn Bookstore and the University of Connecticut Libraries. Generous external support
was provided by the Davis Educational Foundation.
Even before this initiative, our department had a number of freely available and internal
resources for Math 3160, our basic undergraduate probability course. This included lecture
notes prepared by Richard Bass, the Board of Trustees Distinguished Professor of Mathe-
matics. Therefore, it was natural to extend the lecture notes into a complete textbook for
the course. Two aspects of the courses were taken into account. On the one hand, the course
is taken by many students who are interested in financial and actuarial careers. On the other
hand, this course has multivariable calculus as a prerequisite, which is not common for most
of the undergraduate probability courses taught at other US universities. The 2018 edition
of the textbook has 4 parts divided into 15 chapters. The first 3 parts consist of required
material for Math 3160, and the 4th part contains optional material for this course.
This textbook has been used in classrooms during 3 semesters at UConn, and received
overwhelmingly positive feedback from students. However, we are still working on improving
the text, and will be grateful for comments and suggestions.
May 2018.
iii
Contents
Acknowledgments ii
Preface iii
Combinatorics
1.1.1. Basic counting principle. The first basic counting principle is to multiply.
Namely, if there are n possible outcomes of doing something and m outcomes of doing
another thing, then there are m · n possible outcomes of performing both actions.
Basic counting principle
Suppose that two experiments are to be performed. Then if experiment 1 can result
in any one of m possible outcomes, and if for each outcome of experiment 1 there are
n possible outcomes of experiment 2, then there are m · n possible outcomes of the two
experiments together.
Example 1.1. Suppose we have 4 shirts of 4 different colors and 3 pants of different colors.
How many different outfits are there? For each shirt there are 3 different colors of pants, so
altogether there are 4 × 3 = 12 possibilities.
Example 1.2. How many different license plate numbers with 3 letters followed by 3
numbers are possible?
The English alphabet has 26 different letters, therefore there are 26 possibilities for the first
place, 26 for the second, 26 for the third, 10 for the fourth, 10 for the fifth, and 10 for the
sixth. We multiply to get (26)3 (10)3 .
1.1.2. Permutations. How many ways can one arrange letters a, b, c? We can list all
possibilities, namely,
abc acb bac bca cab cba.
There are 3 possibilities for the first position. Once we have chosen the letter in the first
position, there are 2 possibilities for the second position, and once we have chosen the first
two letters, there is only 1 choice left for the third. So there are 3 × 2 × 1 = 6 = 3!
arrangements. In general, if there are n distinct letters, there are n! different arrangements
of these letters.
Example 1.3. What is the number of possible batting orders (in baseball) with 9 players?
Applying the formula for the number of permutations we get 9! = 362880.
© Copyright 2013 Richard F. Bass, 2020 Masha Gordina Typesetting date: September 2, 2020
3
4 1. COMBINATORICS
Example 1.4. How many ways can one arrange 4 math books, 3 chemistry books, 2 physics
books, and 1 biology book on a bookshelf so that all the math books are together, all the
chemistry books are together, and all the physics books are together?
We can arrange the math books in 4! ways, the chemistry books in 3! ways, the physics
books in 2! ways, and the biology book in 1! = 1 way. But we also have to decide which
set of books go on the left, which next, and so on. That is the same as the number of ways
of arranging of four objects (such as the letters M, C, P, B), and there are 4! ways of doing
that. We multiply to get the answer 4! · (4! · 3! · 2! · 1!) = 6912.
Example 1.5. How many ways can one arrange the letters a, a, b, c? Let us label them
first as A, a, b, c. There are 4! = 24 ways to arrange these letters. But we have repeats: we
could have Aa or aA which are the same. So we have a repeat for each possibility, and so
the answer should be 4!/2! = 12.
If there were 3 as, 4 bs, and 2 cs, we would have
9!
= 1260.
3!4!2!
What we just did is called finding the number of permutations. These are permutations
of a given set of objects (elements) unlike the example with the licence plate numbers where
we could choose the same letter as many times as we wished.
Permutations
The number of permutations of n objects is equal to
n! := 1 · ... · n,
with the usual convention 0! = 1.
Example 1.6. How many ways can we choose 3 letters out of 5? If the letters are a, b, c, d, e
and order matters, then there would be 5 choices for the first position, 4 for the second, and
3 for the third, for a total of 5 × 4 × 3. Suppose now the letters selected were a, b, c. If order
does not matter, in our counting we will have the letters a, b, c six times, because there are
3! ways of arranging three letters. The same is true for any choice of three letters. So we
should have 5 × 4 × 3/3!. We can rewrite this as
5·4·3 5!
= = 10
3! 3!2!
5
This is often written as 3
, read “5 choose 3”. Sometimes this is written C5,3 or 5 C3 .
1.1. BASIC COUNTING PRINCIPLE AND COMBINATORICS 5
The number of different groups of k objects chosen from a total of n objects is equal
to
n n!
= .
k k! (n − k)!
Note that this is true when the order of selection is irrelevant, and if the order of selection
is relevant, then there are
n!
n · (n − 1) · ... · (n − k + 1) =
(n − k)!
Example 1.7. How many ways can one choose a committee of 3 out of 10 people? Applying
the formula of the number of combinations we get 10
3
= 120.
Example 1.8. Suppose there are 8 men and 8 women. How many ways can we choose a
committee that has 2 men and 2 women? We can choose 2 men in 82 ways and 2 women in
8
2
ways. The number of possible committees is then the product
8 8
2
· 2
= 28 · 28 = 784.
Example 1.9. Suppose one has 9 people and one wants to divide them into one committee
9
of 3, one committee of 4, and the last one of 2. There are 3 ways of choosing the first
committee. Once that is done, there are 6 people left and there are 64 ways of choosing the
second committee. Once that is done, the remainder must go in the third committee. So
the answer is
9! 6! 9!
= .
3!6! 4!2! 3!4!2!
Indeed, the left-hand side gives the number of different groups of k objects chosen from a
total of n objects which is the same to choose n − k objects not to be in the group of k
objects which is the number on the right-hand side.
6 1. COMBINATORICS
The number of ways to divide n objects into one group of n1 objects, one group of n2 ,
. . ., and a rth group of nr objects, where n = n1 + · · · + nr , is equal to
n n!
= .
n1 , ..., nr n1 !n2 ! · · · nr !
10
Finally for (e) we can choose a committee of 3 out of 10 in 3
ways, so the answer is
10 4 6
3
− 3 − 3 .
Example 1.12. First, suppose one has 8 copies of o and two copies of |. How many ways
can one arrange these symbols in order? There are 10 spots, and we want to select 8 of them
10
in which we place the os. So we have 8 .
Example 1.13. Next, suppose one has 8 indistinguishable balls. How many ways can one
put them in 3 boxes? Let us use sequences of os and | s to represent an arrangement of balls
in these 3 boxes; any such sequence that has | at each side, 2 other | s, and 8 os represents a
way of arranging balls into boxes. For example, if one has
| o o | o o o | o o o |,
this would represent 2 balls in the first box, 3 in the second, and 3 in the third. Altogether
there are 8 + 4 symbols, the first is a | as is the last, so there are 10 symbols that can be
1.1. BASIC COUNTING PRINCIPLE AND COMBINATORICS 7
either | or o. Also, 8 of them must be o. How many ways out of 10spaces can one pick 8 of
them into which to put a o? We just did that, so the answer is 10
8
.
Example 1.14. Now, to finish, suppose we have $8,000 to invest in 3 mutual funds. Each
mutual fund required you to make investments in increments of $1,000. How many ways can
we do this? This is the same as putting 8 indistinguishable balls in 3 boxes, and we know
the answer is 10
8
.
8 1. COMBINATORICS
1.2.1. Generalized counting principle. Here we expand on the basic counting prin-
ciple formulated in Section 1.1.1. One can visualize this principle by using the box method
below. Suppose we have two experiments to be performed, namely, one experiment can result
in n outcomes, and the second experiment can result in m outcomes. Each box represents
the number of possible outcomes in that experiment.
Example 1.15. There are 20 teachers and 100 students in a school. How many ways can
we pick a teacher and student of the year? Using the box method we get 20 × 100 = 2000.
Example 1.17 (Example 1.2 revisited). Recall that for 6-place license plates, with the first
three places occupied by letters and the last three by numbers, we have 26 · 26 · 26 · 10 · 10 · 10
choices. What if no repetition is allowed? We can use the counting principle or the box
method to get 26 · 25 · 24 · 10 · 9 · 8.
Example 1.18. How many functions defined on k points are possible if each function can
take values as either 0 or 1. The counting principle or the box method on the 1, . . . , k points
gives us 2k possible functions. This is the generalized counting principle with n1 = n2 =
... = nk = 2.
1.2.2. Permutations. Now we give more examples on permutations, and we start with
a general results on the number of possible permutations, see also Combinations (multinomial
coefficients) on page 6.
Permutations revisited
The number of different permutations of n objects of which n1 are alike, n2 are alike,
..., nr are alike is equal to
n!
.
n1 ! · · · nr !
Example 1.19. How many ways can one arrange 5 math books, 6 chemistry books, 7
physics books, and 8 biology books on a bookshelf so that all the math books are together,
all the chemistry books are together, and all the physics books are together.
We can arrange the math books in 5! ways, the chemistry in 6! ways, the physics in 7! ways,
and biology books in 8! ways. We also have to decide which set of books go on the left, which
next, and so on. That is the same as the number of ways of arranging the letters M,C,P,
and B, and there are 4! ways of doing that. So the total is 4! · (5! · 6! · 7! · 8!) ways.
Example 1.20. How many ways can one arrange the letters a, a, b, b, c, c?
Let us first re-label the letters by A, a, B, b, C, c. Then there are 6! = 720 ways to arrange
these letters. But we have repeats (for example, Aa or aA) which produce the same arrange-
ment for the original letters. So dividing by the number of repeats for A, a, B, b and C, c, so
the answer is
6!
= 90.
(2!)3
Example 1.21. How many different letter arrangements can be formed from the word
PEPPER?
There are three copies of P and two copies of E, and one of R. So the answer is
6!
= 60.
3!2!1!
Example 1.22. Suppose there are 4 Czech tennis players, 4 U.S. players, and 3 Russian
players, in how many ways could they be arranged, if we do not distinguish players from the
11!
same country? By the formula above we get 4!4!3! .
Example 1.23. Suppose there are 9 men and 8 women. How many ways can we choose a
committee that has 2 men and 3 women?
10 1. COMBINATORICS
9 8
We can choose 2 men in 2
ways and 3 women in 3
ways. The number of committees is
then the product
9 8
· .
2 3
Example 1.24. Suppose somebody has n friends, of whom k are to be invited to a meeting.
(1) How many choices do exist for such a meeting if two of the friends will not attend
together?
(2) How many choices do exist if two of the friends will only attend together?
We use a similar reasoning for both questions.
(1) We can divide all possible groups into two (disjoint) parts: one is for groups of
friends none of which are these
two, and another which includes exactly one of these
two friends. There are n−2 k
groups in the first part, and n−2
k−1
in the second. For
the latter we also need to account for a choice of one out of these two incompatible
friends. So altogether we have
n−2 2 n−2
+ ·
k 1 k−1
(2) Again, we split all possible groups into two parts: one for groups which have none
of the two inseparable friends, and the other for groups which include both of these
two friends. Then
n−2 n−2
+1·1· .
k k−2
Second proof: we will use (mathematical) induction on n. For n = 1 we have that the
left-hand side is x + y, and the right-hand side
1
X 1 k 1−k 1 0 1−0 1 1 1−1
x y = xy + xy
k=0
k 0 1
= y + x = x + y,
so the statement holds for n = 1. Suppose now that the statement holds for n = N , we
would like to show it for n = N + 1.
N
N +1 N
X N k N −k
(x + y) = (x + y) (x + y) = (x + y) x y
k=0
k
N
X N N
X N
=x xk y N −k + y xk y N −k
k=0
k k=0
k
N N
X N k+1 N −k X N k N −k+1
= x y + x y
k=0
k k=0
k
N +1 N
X N k N −k+1
X N k N −k+1
= x y + x y ,
k=1
k − 1 k=0
k
where we replaced k by k − 1 in the first sum. Then we see that
N
N +1 N N +1 0 X N N k N −k+1 N 0 N +1
(x + y) = x y + + x y + xy
N k=1
k−1 k 0
N N +1
N +1
X N N k N −k+1 N +1
X N +1
=x + + x y +y = .
k=1
k−1 k k=0
k
Here we used Example 1.26.
0
0
1 1
0 1
2 2 2
0 1 2
3 3 3 3
0 1 2 3
4 4 4 4 4
0 1 2 3 4
5 5 5 5 5 5
0 1 2 3 4 5
6 6 6 6 6 6 6
0 1 2 3 4 5 6
7 7 7 7 7 7 7 7
0 1 2 3 4 5 6 7
Pascal’s triangle
which can be proven either using the same argument or a formula for binomial coefficients.
Example 1.27. Expand (x + y)3 . This can be done by applying Theorem 1.1 (The bino-
mial theorem) to get (x + y)3 = y 3 + 3xy 2 + 3x2 y + x3 .
Example 1.29. We have 10 flags: 5 of them are blue, 3 are red, and 2 are yellow. These
10!
flags are indistinguishable, except for their color. Then there are 5!3!2! many different ways
we can order them on a flag pole.
Example 1.30 (Example 1.13 revisited). Suppose one has n indistinguishable balls. How
many ways can one put them in k boxes, assuming n > k?
As in Example 1.13 we use sequences of os and | s to represent each an arrangement of balls
in boxes; any such sequence that has | at each side, k −1 copies of | s, and n copies of os. How
many different ways can we arrange this, if we have to start with | and end with |? Between
these, we are only arranging n + k − 1 symbols, of which only n are os. So the question can
be re-formulated as this: how many ways out of n + k − 1 spaces can one pick n of them into
n+k−1
which to put an o? This gives n
. Note that this counts all possible ways including the
ones when some of the boxes can be empty.
1.2. FURTHER EXAMPLES AND EXPLANATIONS 13
Suppose now we want to distribute n balls in k boxes so that none of the boxes are empty.
Then we can line up n balls represented by os, instead of putting them in boxes we can place
| s in spaces between them. Note that we should have a | on each side, as all balls have to
be put to a box. So we are left with k − 1 copies of | s to be placed among n balls. This
means that we have n − 1 places, and we need to pick k − 1 out of these to place | s. So we
can reformulate the problem as choose k − 1 places out of n − 1, and so the answer is n−1
k−1
.
We can check that for n = 3 and k = 2 we indeed have 4 ways of distributing three balls in
two boxes, and only two ways if every box has to have at least one ball.
14 1. COMBINATORICS
1.3. Exercises
Exercise 1.1. Suppose a license plate must consist of 7 numbers or letters. How many
license plates are there if
Exercise 1.2. A school of 50 students has awards for the top math, English, history and
science student in the school
(A) How many ways can these awards be given if each student can only win one award?
(B) How many ways can these awards be given if students can win multiple awards?
Exercise 1.4. There is a school class of 25 people made up of 11 guys and 14 girls.
Exercise 1.5. If a student council contains 10 people, how many ways are there to elect a
president, a vice president, and a 3 person prom committee from the group of 10 students?
Exercise 1.6. Suppose you are organizing your textbooks on a book shelf. You have three
chemistry books, 5 math books, 2 history books and 3 English books.
(A) How many ways can you order the textbooks if you must have math books first, English
books second, chemistry third, and history fourth?
(B) How many ways can you order the books if each subject must be ordered together?
Exercise 1.7. If you buy a Powerball lottery ticket, you can choose 5 numbers between
1 and 59 (picked on white balls) and one number between 1 and 35 (picked on a red ball).
How many ways can you
Exercise 1.8. A couple wants to invite their friends to be in their wedding party. The
groom has 8 possible groomsmen and the bride has 11 possible bridesmaids. The wedding
party will consist of 5 groomsmen and 5 bridesmaids.
Exercise 1.9. There are 52 cards in a standard deck of playing cards. The poker hand
consists of five cards. How many poker hands are there?
Exercise 1.10. There are 30 people in a communications class. Each student must inter-
view one another for a class project. How many total interviews will there be?
Exercise 1.11. Suppose a college basketball tournament consists of 64 teams playing head
to head in a knockout style tournament. There are 6 rounds, the round of 64, round of 32,
round of 16, round of 8, the final four teams, and the finals. Suppose you are filling out a
bracket, such as this, which specifies which teams will win each game in each round.
Exercise 1.12. We need to choose a group of 3 women and 3 men out of 5 women and 6
men. In how many ways can we do it if 2 of the men refuse to be chosen together?
Exercise 1.13. Find the coefficient in front of x4 in the expansion of (2x2 + 3y)4 .
Exercise 1.14. In how many ways can you choose 2 or less (maybe none!) toppings for
your ice-cream sundae if 6 different toppings are available? (You can use combinations here,
but you do not have to. Next, try to find a general formula to compute in how many ways
you can choose k or less toppings if n different toppings are available
16 1. COMBINATORICS
n+k−1
Exercise∗ 1.4. Show that there are
k−1
distinct non-positive integer-valued vectors
(x1 , ..., xk ) satisfying
x1 + ... + xk = n, xi > 0 for all i = 1, ..., k.
Exercise∗ 1.5. Consider a smooth function of n variables. How many different partial
derivatives of order k does f possess?
1.4. SELECTED SOLUTIONS 17
Solution to Exercise 1.1(A): in each of the seven places we can put any of the 26 letters
giving
267
possible letter combinations.
Solution to Exercise 1.1(B): in each of the first three places we can place any of the 10
digits, and in each of the last four places we can put any of the 26 letters giving a total of
103 · 264 .
Solution to Exercise 1.1(C): if we can not repeat a letter or a number on a license plate,
then the number of license plates becomes
5 54 1 5 54 1
3
· 2
· 1
+ 4
· 1
· 1
+1
Solution to Exercise 1.8(A):
8 11
5
· 5
6 11 2 6 11
5
· 5
+ 1
· 4
· 5
Solution to Exercise 1.11: First notice that the 64 teams play 63 total games: 32 games
in the first round, 16 in the second round, 8 in the 3rd round, 4 in the regional finals, 2
in the final four, and then the national championship game. That is, 32+16+8+4+2+1=
63. Since there are 63 games to be played, and you have two choices at each stage in your
bracket, there are 263 different ways to fill out the bracket. That is,
263 = 9, 223, 372, 036, 854, 775, 808.
n
X
n n
xk y n−k
(x + y) = k
k=0
with x = y = 1 to see
n
X n
X
n n n
k n−k n
2 = (1 + 1) = k
·1 ·1 = k
,
k=0 k=0
1.4. SELECTED SOLUTIONS 19
n
X n
X
n k n−k
n n
(−1)k .
0 = (−1 + 1) = k
· (−1) · (1) = k
k=0 k=0
Solution to Exercise∗ 1.2: we can prove the statement using mathematical induction on
k. For k = 1 we have
X
(x1 )n = n
x1 = xn1 ,
n1
n1 =n
X n
X
(x1 + x2 )n = n n
n1 n2 n1 n−n1
n1 ,n2
x1 · x2 = n1
x1 · x2 ,
(n1 ,n2 ) n1 =0
n1 +n2 =n
which is the binomial formula itself. Now suppose the multinomial formula holds for k = K
(induction hypothesis), that is,
X
(x1 + ... + xK )n = n
· xn1 1 · ... · xnKK ,
n1 ,...,nK
(n1 ,··· ,nK )
n1 +···+nK =n
X n
(x1 + ... + xK+1 )n = n
· xn1 1 · ... · xK+1
K+1
n1 ,...,nK+1
.
(n1 ,··· ,nK+1 )
n1 +···+nK+1 =n
Denote
nK
X nK −m
(xK + xK+1 )nK = nK
· xm
m K · xK+1 ,
m=1
therefore
20 1. COMBINATORICS
X nK
X
n nK −m
(x1 + ... + xK+1 )n = n
· xn1 1 · ... · xK−1 nK
· xm
K−1
n1 ,...,nK
· m K · xK+1 .
(n1 ,··· ,nK ) m=1
n1 +···+nK =n
n nK n
n1 ,...,nK m
= n1 ,...,nK ,m
, n1 + ... + nK + m = n.
Indeed,
n
nK
n! nK !
n1 ,...,nK m
=
n1 !n2 ! · ... · nK−1 ! · nK ! m! (nK − m)!
n! n
= = n1 ,...,n K ,m
.
n1 !n2 ! · ... · nK−1 ! · m! (nK − m)!
Thus
X nK
X
n n n nK −m
· xn1 1 · ... · xK−1 · ·xm
K−1
(x1 + ... + xK+1 ) = n1 ,...,nK ,m K · xK+1 .
(n1 ,··· ,nK ) m=1
n1 +···+nK =n
X m m
(x1 + ... + xK+1 )n = n
· xm mK
K−1 K+1
1 · ... · xK−1 · xK · xK+1
1
m1 ,...,mK ,mK+1
(m1 ,··· ,mK ,mK+1 )
m1 +···+mK+1 =n
We will have a sample space, denoted by S (sometimes Ω) that consists of all possible
outcomes. For example, if we roll two dice, the sample space would be all possible pairs
made up of the numbers one through six. An event is a subset of S.
Another example is to toss a coin 2 times, and let
S = {HH, HT, T H, T T };
or to let S be the possible orders in which 5 horses finish in a horse race; or S the possible
prices of some stock at closing time today; or S = [0, ∞); the age at which someone dies; or
S the points in a circle, the possible places a dart can hit. We should also keep in mind that
the same setting can be described using different sample set. For example, in two solutions
in Example 1.30 we used two different sample sets.
We extend this definition to have ∪ni=1 Ai is the union of sets A1 , · · · , An , and similarly ∩ni=1 Ai .
An exercise is to show that
De Morgan’s laws
n
!c n n
!c n
[ \ \ [
Ai = Aci and Ai = Aci .
i=1 i=1 i=1 i=1
© Copyright 2013 Richard F. Bass, 2020 Masha Gordina Typesetting date: September 2, 2020
21
22 2. THE PROBABILITY SET-UP
Typically we will take F to be all subsets of S, and so (i)-(iii) are automatically satisfied.
The only times we won’t have F be all subsets is for technical reasons or when we talk about
conditional expectation.
(1) P(∅) = 0.
If E1 , . . . , En ∈ F are pairwise disjoint, then P( ni=1 Ei ) = ni=1 P(Ei ).
S P
(2)
(3) P(E c ) = 1 − P(E) for any E ∈ F.
(4) If E ⊂ F , then P(E) 6 P(F ), for any E, F ∈ F.
(5) for any E, F ∈ F
Proof. To
∞
P∞ Ei = ∅ for each i. These are clearly disjoint, so P(∅) =
P∞show (1), choose
P(∪i=1 Ei ) = i=1 P(Ei ) = i=1 P(∅). If P(∅) were strictly positive, then the last term would
2.1. BASIC THEORY OF PROBABILITY 23
be infinity, contradicting the fact that probabilities are between 0 and 1. So the probability
of ∅ must be zero.
∞
S∞if we letSnEn+1 = En+2 = · · · = ∅. Then {Ei }i=1 , Ei ∈ F are still pairwise
Part (2) follows
disjoint, and i=1 Ei = i=1 Ei , and by (1) we have
∞
X n
X
P(Ei ) = P(Ei ).
i=1 i=1
To prove (3), use S = E ∪ E c . By (2), P(S) = P(E) + P(E c ). By axiom (2), P(S) = 1, so
(1) follows.
To prove (4), write F = E ∪ (F ∩ E c ), so P(F ) = P(E) + P(F ∩ E c ) > P(E) by (2) and
axiom (1).
Similarly, to prove (5), we have P(E∪F ) = P(E)+P(E c ∩F ) and P(F ) = P(E∩F )+P(E c ∩F ).
Solving the second equation for P(E c ∩ F ) and substituting in the first gives the desired
result.
It is common for a probability space to consist of finitely many points, all with equally likely
probabilities. For example, in tossing a fair coin, we have S = {H, T }, with P(H) = P(T ) =
1
2
. Similarly, in rolling a fair die, the probability space consists of {1, 2, 3, 4, 5, 6}, each point
having probability 16 .
Example 2.1. What is the probability that if we roll 2 dice, the sum is 7?
There are 36 possibilities, of which 6 have a sum of 7: (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1).
6
Since they are all equally likely, the probability is 36 = 16 .
Example 2.2. What is the probability that in a poker hand (5 cards out of 52) we get
exactly 4 of a kind?
We have four suits: clubs, diamonds, hearts and spades. Each suit includes an ace, a king,
queen and jack, and ranks two through ten.
For example, the probability of 4 aces and 1 king is
4 4
4
52
1 .
5
The probability of 4 jacks and one 3 is the same. There are 13 ways to pick the rank that
we have 4 of, and then 12 ways to pick the rank we have one of, so the answer is
4 4
4
13 · 12 · 52
1 .
5
Example 2.3. What is the probability that in a poker hand we get exactly3 of a kind (and
the other two cards are of different ranks)?
For example, the probability of 3 aces, 1 king and 1 queen is
24 2. THE PROBABILITY SET-UP
4
4 4
3 1 1
52
.
5
12
We have 13 choices for the rank we have 3 of, and 2
choices for the other two ranks, so
the answer is
4
4 4
12 3 1 1
13 2 52
.
5
Example 2.4. In a class of 30 people, what is the probability everyone has a different
birthday? (We assume each day is equally likely.)
We assume that it is not a leap year. Let the first person have a birthday on some day. The
364
probability that the second person has a different birthday will be 365 . The probability that
the third person has a different birthday from the first two people is 363
365
. So the answer is
2.2.1. Sets revisited. A visual way to represent set operations is given by the Venn
diagrams.
Example 2.5. Roll two dice. We can describe the sample set S as ordered pairs of numbers
1, 2, ..., 6, that is, S has 36 elements. Examples of events are
E = the two dice come up equal and even = {(2, 2) , (4, 4) , (6, 6)} ,
F = the sum of the two dice is 8 = {(2, 6) , (3, 5) , (4, 4) , (5, 3) , (6, 2)} ,
E ∪ F = {(2, 2) , (2, 6) , (3, 5) , (4, 4) , (5, 3) , (6, 2) , (6, 6)} ,
E ∩ F = {(4, 4)} ,
F c = all 31 pairs that do not include {(2, 6) , (3, 5) , (4, 4) , (5, 3) , (6, 2)} .
Example 2.6. Let S = [0, ∞) be the space of all possible ages at which someone can die.
Possible events are
A = person dies before reaching 30 = [0, 30).
Ac = [30, ∞) = person dies after turning 30.
A ∪ Ac = S,
B = a person lives either less than 15 or more than 45 years = (15, 45].
Example 2.7 (Coin tosses). In this case S = {H, T }, where H stands for heads, and T
stands for tails. We say that a coin is fair if we toss a coin with each side being equally
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, Copyright 2020 Masha Gordina.
26 2. THE PROBABILITY SET-UP
Example 2.8. Rolling a fair die, the probability of getting an even number is
1
P ({even}) = P(2) + P (4) + P (6) = .
2
Let us see how we can use properties of probability in Proposition 2.1 to solve problems.
Example 2.9. UConn Basketball is playing Kentucky this year and from past experience
the following is known:
• a home game has 0.5 chance of winning;
• an away game has 0.4 chance of winning;
• there is a 0.3 chance that UConn wins both games.
What is probability that UConn loses both games?
Let us denote by A1 the event of a home game win, and by A2 an away game win. Then,
from the past experience we know that P (A1 ) = 0.5, P (A2 ) = 0.4 and P (A1 ∩ A2 ) = 0.3.
Notice that the event UConn loses both games can be expressed as Ac1 ∩ Ac2 . Thus we want
to find P (Ac1 ∩ Ac2 ). Use De Morgan’s laws and (3) in Proposition 2.1 we have
P (Ac1 ∩ Ac2 ) = P ((A1 ∪ A2 )c ) = 1 − P (A1 ∪ A2 ) .
The inclusion-exclusion identity (2.1.1) tells us
P (A1 ∪ A2 ) = 0.5 + 0.4 − 0.3 = 0.6,
and hence P (Ac1 ∩ Ac2 ) = 1 − 0.6 = 0.4.
The inclusion-exclusion identity is actually true for any finite number of events. To illustrate
this, we give next the formula in the case of three events.
Proposition 2.2 (Inclusion-exclusion identity)
To show this formula rigorously we can start by considering an event E consisting of exactly
one element, and use axioms of probability to see that this formula holds for such an E.
Then we can represent any event E as a disjoint union of one-element events to prove the
statement.
Many experiments can be modeled by considering a set of balls from which some will be
withdrawn. There are two basic ways of withdrawing, namely with or without replacement.
Example 2.11. Three balls are randomly withdrawn without replacement from a bowl
containing 6 white and 5 black balls. What is the probability that one ball is white and the
other two are black?
This is a good example of the situation where a choice of the sample space might be different.
First solution: we model the experiment so that the order in which the balls are drawn is
important. That is, we can describe the sample state S as ordered triples of letters W and
B. Then
W BB + BW B + BBW
P (E) =
11 · 10 · 9
6·5·4+5·6·4+5·4·6 120 + 120 + 120 4
= = = .
990 990 11
Second solution: we model the experiment so that the order in which the balls are drawn is
not important. In this case
6 5
(one ball is white) (two balls are black) 4
P (E) = 11
= 1 112 = .
3 3
11
28 2. THE PROBABILITY SET-UP
2.3. Exercises
Exercise 2.1. Consider a box that contains 3 balls: 1 red, 1 green, and 1 yellow.
(A) Consider an experiment that consists of taking 1 ball from the box, placing it back in
the box, and then drawing a second ball from the box. List all possible outcomes.
(B) Repeat the experiment but now, after drawing the first ball, the second ball is drawn
from the box without replacing the first. List all possible outcomes.
Exercise 2.2. Suppose that A and B are pairwise disjoint events for which P(A) = 0.2
and P(B) = 0.4.
Exercise 2.3. Forty percent of the students at a certain college are members neither of
an academic club nor of a Greek organization. Fifty percent are members of an academic
club and thirty percent are members of a Greek organization. What is the probability that
a randomly chosen student is
Exercise 2.4. In a city, 60% of the households subscribe to newspaper A, 50% to newspaper
B, 40% to newspaper C, 30% to A and B, 20% to B and C, and 10% to A and C. None
subscribe to all three.
(A) What percentage subscribe to exactly one newspaper?(Hint: Draw a Venn diagram)
(B) What percentage subscribe to at most one newspaper?
Exercise 2.5. There are 52 cards in a standard deck of playing cards. There are 4 suits:
hearts, spades, diamonds, and clubs (♥♠♦♣). Hearts and diamonds are red while spades
and clubs are black. In each suit there are 13 ranks: the numbers 2, 3 . . . , 10, the three face
cards, Jack, Queen, King, and the Ace. Note that Ace is not a face card. A poker hand
consists of five cards. Find the probability of randomly drawing the following poker hands.
Exercise 2.6. Find the probability of randomly drawing the following poker hands.
(A) A one pair, which consists of two cards of the same rank and three other distinct ranks.
(e.g. 22Q59)
(B) A two pair, which consists of two cards of the same rank, two cards of another rank,
and another card of yet another rank. (e.g.JJ779)
(C) A three of a kind, which consists of a three cards of the same rank, and two others of
distinct rank (e.g. 4449K).
2.3. EXERCISES 29
(D) A flush, which consists of all five cards of the same suit (e.g. HHHH, SSSS, DDDD, or
CCCC).
(E) A full house, which consists of a two pair and a three of a kind (e.g. 88844). (Hint:
Note that 88844 is a different hand than a 44488.)
Exercise 2.7. Suppose a standard deck of cards is modified with the additional rank of
Super King and the additional suit of Swords so now each card has one of 14 ranks and one
of 5 suits. What is the probability of
Exercise 2.8. A pair of fair dice is rolled. What is the probability that the first die lands
on a strictly higher value than the second die?
Exercise 2.9. In a seminar attended by 8 students, what is the probability that at least
two of them have birthday in the same month?
Exercise 2.10. Nine balls are randomly withdrawn without replacement from an urn that
contains 10 blue, 12 red, and 15 green balls. What is the probability that
Exercise 2.11. Suppose 4 valedictorians from different high schools are accepted to the
8 Ivy League universities. What is the probability that each of them chooses to go to a
different Ivy League university?
Exercise 2.12. Two dice are thrown. Let E be the event that the sum of the dice
S is even,
and F be the event that at least one of the dice lands on 2. Describe EF and E F .
Exercise 2.13. If there are 8 people in a room, what is the probability that no two of
them celebrate their birthday in the same month?
Exercise 2.14. Box I contains 3 red and 2 black balls. Box II contains 2 red and 8 black
balls. A coin is tossed. If H, then a ball from box I is chosen; if T, then from from box II.
Solution to Exercise 2.1(A): Since every marble can be drawn first and every marble can
be drawn second, there are 32 = 9 possibilities: RR, RG, RB, GR, GG, GB, BR, BG, and
BB (we let the first letter of the color of the drawn marble represent the draw).
Solution to Exercise 2.1(B): In this case, the color of the second marble cannot match
the color of the rest, so there are 6 possibilities: RG, RB, GR, GB, BR, and BG.
Solution to Exercise 2.2(A): Since A ∩ B = ∅, B ⊆ Ac hence P(B ∩ Ac ) = P(B) = 0.4.
Solution to Exercise 2.2(B): By De Morgan’s laws and property (3) of Proposition 2.1,
Thus, P (A ∩ B) = 0.2.
Solution to Exercise 2.4(A): We use these percentages to produce the Venn diagram
below:
26
Solution to Exercise 2.5(A): 5
52 .
5
4
4
·
Solution to Exercise 2.5(B): 2 3
52
5
2.4. SELECTED SOLUTIONS 31
12 40
Solution to Exercise 2.5(C): 5 + 5
52 52
5 5
4 12 4 4 4
13· · · · · )
Solution to Exercise 2.6(A): 2 3 1 1 1
52
5
13
4 4 44
· · ·
Solution to Exercise 2.6(B): 2 2 2 1
52
5
4
12 4 4
13· · · ·
Solution to Exercise 2.6(C): 3 2 1 1
52
5
13
4·
Solution to Exercise 2.6(D): 5
52
5
4
4
13·12· ·
Solution to Exercise 2.6(E): 3 2
52
5
1
Solution to Exercise 2.7(A): 70
14
5 5 5
· · ·
Solution to Exercise 2.7(B): 3 2 2 2
70
6
5 5 5
14· ·13· ·12·
Solution to Exercise 2.7(C): 3 2 1
70
6
Solution to Exercise 2.8: we can simply list all possibilities
(6, 1) , (6, 2) , (6, 3), (6, 4) , (6, 5) 5 possibilities
(5, 1) , (5, 2), (5, 3) , (5, 4) 4 possibilities
(4, 1) , (4, 2), (4, 3) 3 possibilities
(3, 1) , (3, 2) 2 possibilities
(2, 1) 1 possibility
= 15 possibilities in total
15
Thus the probability is 36
.
Solution to Exercise 2.9:
12 · 11 · 10 · 9 · 8 · 7 · 6 · 5
1−
128
10
12 15
· ·
Solution to Exercise 2.10(A): 2 37 5 2
9
Solution to Exercise 2.10(B):
27 10 27
9 1
· 8
1− 37
− 37
9 9
Solution to Exercise∗ 2.1: to prove Proposition 2.2 we will use Equation (2.1.1) several
times, as well as a distribution law for sets. First, we apply Equation (2.1.1) to two sets A
and B ∪ C to see that
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
which we can see by using the Venn diagrams. Now we can apply Equation (2.1.1) to the
sets (A ∩ B) and (A ∩ C) to see that
(A ∩ B) ∩ (A ∩ C) = A ∩ B ∩ C,
so
P (A ∩ (B ∪ C)) = P (A ∩ B) + P (A ∩ C) − P (A ∩ B ∩ C) .
Use this in Equation (2.4.1) to finish the proof.
CHAPTER 3
Independence
Suppose we have a probability space of a sample space S, σ-field F and probability P defined
on F.
Definition (Independence)
Example 3.1. Suppose you flip two coins. The outcome of heads on the second is inde-
pendent of the outcome of tails on the first. To be more precise, if A is tails for the first
coin and B is heads for the second, and we assume we have fair coins (although this is not
necessary), we have P(A ∩ B) = 41 = 12 · 12 = P(A)P(B).
Example 3.2. Suppose you draw a card from an ordinary deck. Let E be you drew an
1 1
ace, F be that you drew a spade. Here 52 = P(E ∩ F ) = 13 · 14 = P(E) ∩ P(F ).
Proposition 3.1
If E and F are independent, then E and F c are independent.
Proof.
© Copyright 2013 Richard F. Bass, Copyright 2020 Masha Gordina Typesetting date: September 2, 2020
33
34 3. INDEPENDENCE
For example, for three events, E, F , and G, they are independent if E and F are independent,
E and G are independent, F and G are independent, and P(E ∩ F ∩ G) = P(E)P(F )P(G).
Example 3.3 (Pairwise but not jointly independent events). Throw two fair dice. Consider
three events
Example 3.4. What is the probability that exactly 3 threes will show if you roll 10 dice?
The probability that the 1st, 2nd, and 4th dice will show a three and the other 7 will not
3 7
is 16 65 . Independence is used here: the probability is 16 16 56 16 56 · · · 65 . The probability that the
4th, 5th, and 6th dice will show a three and the other 7 will not is the same thing. So to
3 7
answer our original question, we take 16 56 and multiply it by the number of ways of choosing
10
3 dice out of 10 to be the ones showing threes. There are 3 ways of doing that.
This is a particular example of what are known as Bernoulli trials or the binomial distribu-
tion.
3.1. INDEPENDENT EVENTS 35
Proof. The probability there are k successes is the number of ways of putting k objects
n
in n slots (which is k ) times the probability that there will be k successes and n − k failures
in exactly a given order. So the probability is nk pk (1 − p)n−k .
The name binomial for this distribution comes from the simple observation that if we denote
by
Ek := {k successes in n independent trials} .
Sn
Then S = k=0 Ek is the disjoint decomposition of the sample space, so
n
! n n
[ X X
n k
p (1 − p)n−k = (p + (1 − p))n = 1
P (S) = P Ek = P (Ek ) = k
k=0 k=0 k=0
by the binomial formula. This shows we indeed have the second axiom of probability for the
binomial distribution. We denote by Binom (n, p) the binomial distribution with parameters
n and p.
36 3. INDEPENDENCE
3.2.1. Examples.
Example 3.5. A card is drawn from an ordinary deck of cards (52 cards). Consider the
events F := a face is drawn, R := a red color is drawn.
These are independent events because, for one card, being a face does not affect it being red:
there are 12 faces, 26 red cards, and 6 cards that are red and faces. Thus,
12 26 3
P (F ) P (R) = · = ,
52 52 26
6 3
P (F ∩ R) = = .
52 26
Example 3.6. Suppose that two unfair coins are flipped: the first coin has the heads
probability 0.5001 and the second has heads probability 0.5002. The events AT =:= the first
coin lands tails, BH := the second coin lands heads are independent. Why? The sample
space S = {HH, HT, TH, TT} has 4 elements, all of them of different probabilities, given as
products. The events correspond to AT = {TH,TT} and BH = {HH,TH} respectively, and
the computation of the probabilities is given by
Example 3.7. An urn contains 10 balls, 4 red and 6 blue. A second urn contains 16 red
balls and an unknown number of blue balls. A single ball is drawn from each urn and the
probability that both balls are the same color is 0.44. How many blue balls are there in the
second urn?
Let us define the events Ri := a red ball is drawn from urn i, Bi := a blue ball is drawn from
urn i, and let x denote the (unknown) number of blue balls in urn 2, so that the second urn
has 16 + x balls in total. Using the fact that the events R1 ∩ R2 and B1 ∩ B2 are independent
(check this!), we have
[
0.44 = P (R1 ∩ R2 ) (B1 ∩ B2 ) = P (R1 ∩ R2 ) + P (B1 ∩ B2 )
= P (R1 ) P (R2 ) + P (B1 ) P (B2 )
4 16 6 x
= + .
10 x + 16 10 x + 16
Solving this equation for x we get x = 4.
Example 3.8. Suppose that we roll 10 dice. What is the probability that at most 4 of
them land a two?
We can regard this experiment as consequently rolling one single die. One possibility is that
the first, second, third, and tenth trial land a two, while the rest land something else. Since
each trial is independent, the probability of this event will be
4 6
1 1 1 5 5 5 5 5 5 1 1 5
· · · · · · · · · = · .
6 6 6 6 6 6 6 6 6 6 6 6
Note that the probability that the 10th, 9th, 8th, and 7th dice land a two and the other 6
do not is the same as the previous one. To answer our original question, we thus need to
consider the number of ways of choosing 0, 1, 2, 3 or 4 trials out of 10 to be the ones showing
a two. This means
0 10 10
10 1 5 5
P(exactly 0 dice land a two) = · · = .
0 6 6 6
9
10 1 5
P(exactly 1 dice lands a two) = · · .
1 6 6
2 8
10 1 5
P(exactly 2 dice land a two) = · · .
2 6 6
3 7
10 1 5
P(exactly 3 dice land a two) = · · .
3 6 6
4 6
10 1 5
P(exactly 4 dice land a two) = · · .
4 6 6
The answer to the question is the sum of these five numbers.
38 3. INDEPENDENCE
3.3. Exercises
Exercise 3.1. Let A and B be two independent events such that P (A ∪ B) = 0.64 and
P (A) = 0.4. What is P(B)?
Exercise 3.2. In a class, there are 4 male math majors, 6 female math majors, and 6 male
actuarial science majors. How many actuarial science females must be present in the class
if sex and major are independent when choosing a student selected at random?
Exercise 3.3. Following Proposition on 3.1, prove that E and F are independent if and
only if E and F c are independent.
Exercise 3.4. Suppose we toss a fair coin twice, and let E be the event that both tosses
give the same outcome, F that the first toss is a heads, and G is that the second toss is
heads. Show that E, F and G are pairwise independent, but not jointly independent.
Exercise 3.5. Two dice are simultaneously rolled. For each pair of events defined below,
compute if they are independent or not.
(a) A1 = {the sum is 7}, B1 = {the first die lands a 3}.
(b) A2 = {the sum is 9}, B2 = {the second die lands a 3}.
(c) A3 = {the sum is 9}, B3 = {the first die lands even}.
(d) A4 = {the sum is 9}, B4 = {the first die is less than the second}.
(e) A5 = {two dice are equal}, B5 = {the sum is 8}.
(f) A6 = {two dice are equal}, B6 = {the first die lands even}.
(g) A7 = {two dice are not equal}, B7 = {the first die is less than the second}.
Exercise 3.6. Are the events A1 , B1 and B3 from Exercise 3.5 independent?
Exercise 3.7. A hockey team has 0.45 chances of losing a game. Assuming that each
game is independent from the other, what is the probability that the team loses 3 of the
next upcoming 5 games?
Exercise 3.8. You make successive independent flips of a coin that lands on heads with
probability p. What is the probability that the 3rd heads appears on the 7th flip?
Hint: express your answers in terms of p; do not assume p = 1/2.
Exercise 3.9. Suppose you toss a fair coin repeatedly and independently. If it comes up
heads, you win a dollar, and if it comes up tails, you lose a dollar. Suppose you start with
$M . What is the probability you will get up to $N before you go broke? Give the answer
in terms of M and N , assuming 0 < M < N .
Exercise 3.10. Suppose that we roll n dice. What is the probability that at most k of
them land a two?
3.4. SELECTED SOLUTIONS 39
Conditional probability
Suppose there are 200 men, of which 100 are smokers, and 100 women, of which 20 are
smokers. What is the probability that a person chosen at random will be a smoker? The
answer is 120/300. Now, let us ask, what is the probability that a person chosen at random
is a smoker given that the person is a women? One would expect the answer to be 20/100
and it is.
What we have computed is
which is the same as the probability that a person chosen at random is a woman and a
smoker divided by the probability that a person chosen at random is a woman.
With this in mind, we give the following definition.
Definition 4.1 (Conditional probability)
Example 4.1. Suppose you roll two dice. What is the probability the sum is 8?
There are five ways this can happen {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)}, so the probability is
5/36. Let us call this event A. What is the probability that the sum is 8 given that the
first die shows a 3? Let B be the event that the first die shows a 3. Then P(A ∩ B) is
the probability that the first die shows a 3 and the sum is 8, or 1/36. P(B) = 1/6, so
P(A | B) = 1/36
1/6
= 1/6.
Example 4.2. Suppose a box has 3 red marbles and 2 black ones. We select 2 marbles.
What is the probability that second marble is red given that the first one is red?
41
42 4. CONDITIONAL PROBABILITY
Let A be the event the second marble is red, and B the event that the first one is red.
P(B) = 3/5, while P(A∩B) is the probability both are red, or is the probability that we chose
2 red out of 3 and 0 black out of 2. Then P(A∩B) = 32 20 / 52 , and so P(A | B) = 3/10
3/5
= 1/2.
Example 4.3. A family has 2 children. Given that one of the children is a boy, what is
the probability that the other child is also a boy?
Let B be the event that one child is a boy, and A the event that both children are boys.
The possibilities are bb, bg, gb, gg, each with probability 1/4. P(A ∩ B) = P(bb) = 1/4 and
P(B) = P(bb, bg, gb) = 3/4. So the answer is 1/43/4
= 1/3.
Example 4.4. Suppose the test for HIV is 99% accurate in both directions and 0.3% of the
population is HIV positive. If someone tests positive, what is the probability they actually
are HIV positive?
Let D is the event that a person is HIV positive, and T is the event that the person tests
positive.
P(D ∩ T ) (0.99)(0.003)
P(D | T ) = = ≈ 23%.
P(T ) (0.99)(0.003) + (0.01)(0.997)
A short reason why this surprising result holds is that the error in the test is much greater
than the percentage of people with HIV. A little longer answer is to suppose that we have
1000 people. On average, 3 of them will be HIV positive and 10 will test positive. So the
chances that someone has HIV given that the person tests positive is approximately 3/10.
The reason that it is not exactly 0.3 is that there is some chance someone who is positive
will test negative.
Suppose you know P(E | F ) and you want to find P(F | E). Recall that
and so
P (F ∩ E) P(E | F )P(F )
P (F | E) = =
P (E) P (E)
Example 4.5. Suppose 36% of families own a dog, 30% of families own a cat, and 22%
of the families that have a dog also have a cat. A family is chosen at random and found to
have a cat. What is the probability they also own a dog?
Let D be the families that own a dog, and C the families that own a cat. We are given
P(D) = 0.36, P(C) = 0.30, P(C | D) = 0.22. We want to know P(D | C). We know
P(D | C) = P(D ∩ C)/P(C). To find the numerator, we use P(D ∩ C) = P(C | D)P(D) =
(0.22)(0.36) = 0.0792. So P(D | C) = 0.0792/0.3 = 0.264 = 26.4%.
4.1. DEFINITION, BAYES’ RULE AND EXAMPLES 43
Example 4.6. Suppose 30% of the women in a class received an A on the test and 25%
of the men received an A. The class is 60% women. Given that a person chosen at random
received an A, what is the probability this person is a women?
Let A be the event of receiving an A, W be the event of being a woman, and M the event
of being a man. We are given P(A | W ) = 0.30, P(A | M ) = 0.25, P(W ) = 0.60 and we want
P(W | A). From the definition
P(W ∩ A)
P(W | A) = .
P(A)
As in the previous example,
P(W ∩ A) = P(A | W )P(W ) = (0.30)(0.60) = 0.18.
To find P(A), we write
P(A) = P(W ∩ A) + P(M ∩ A).
Since the class is 40% men,
P(M ∩ A) = P(A | M )P(M ) = (0.25)(0.40) = 0.10.
So
P(A) = P(W ∩ A) + P(M ∩ A) = 0.18 + 0.10 = 0.28.
Finally,
P(W ∩ A) 0.18
P(W | A) = = .
P(A) 0.28
Proof. We use the definition of conditional probability and the fact that
Here is another example related to conditional probability, although this is not an example
of Bayes’ rule. This is known as the Monty Hall problem after the host of the TV show in
the 60s called Let’s Make a Deal.
Example 4.7. There are three doors, behind one a nice car, behind each of the other
two a goat eating a bale of straw. You choose a door. Then Monty Hall opens one of the
other doors, which shows a bale of straw. He gives you the opportunity of switching to the
remaining door. Should you do it?
Let us suppose you choose door 1, since the same analysis applies whichever door you chose.
The first strategy is to stick with door 1. With probability 1/3 you chose the car. Monty
Hall shows you one of the other doors, but that doesn’t change your probability of winning.
The second strategy is to change. Let us say the car is behind door 1, which happens with
probability 1/3. Monty Hall shows you one of the other doors, say door 2. There will be a
goat, so you switch to door 3, and lose. The same argument applies if he shows you door 3.
Suppose the car is behind door 2. He will show you door 3, since he doesn’t want to give
away the car. You switch to door 2 and win. This happens with probability 1/3. The same
argument applies if the car is behind door 3.
So you win with probability 2/3 and lose with probability 1/3. Thus strategy 2 is much
superior.
Example 4.8 (Gambler’s ruin). Suppose you play the game by tossing a fair coin re-
peatedly and independently. If it comes up heads, you win a dollar, and if it comes up tails,
you lose a dollar. Suppose you start with $50. What’s the probability you will get to $200
without first getting ruined (running out of money)?
It is easier to solve a slightly harder problem. The game can be described as having proba-
bility 1/2 of winning 1 dollar and a probability 1/2 of losing 1 dollar. A player begins with
a given number of dollars, and intends to play the game repeatedly until the player either
goes broke or increases his holdings to N dollars.
For any given amount n of current holdings, the conditional probability of reaching N dollars
before going broke is independent of how we acquired the n dollars, so there is a unique
probability P (N | n) of reaching N on the condition that we currently hold n dollars. Of
course, for any finite N we see that P (N | 0) = 0 and P (N | N ) = 1. The problem is to
determine the values of P (N | n) for n between 0 and N .
We are considering this setting for N = 200, and we would like to find P (200 | 50). Denote
y (n) := P (200 | n), which is the probability you get to 200 without first getting ruined if
you start with n dollars. We saw that y(0) = 0 and y(200) = 1. Suppose the player has
n dollars at the moment, the next round will leave the player with either n + 1 or n − 1
dollars, both with probability 1/2. Thus the current probability of winning is the same as a
weighted average of the probabilities of winning in player’s two possible next states. So we
have
y(n) = 21 y(n + 1) + 12 y(n − 1).
4.1. DEFINITION, BAYES’ RULE AND EXAMPLES 45
Example 4.9. Suppose we are in the same situation, but you are allowed to go arbitrarily
far in debt. Let z (n) be the probability you ever get to $200 if you start with n dollars.
What is a formula for z (n)?
Just as above, we see that z satisfies the recursive equation
z(n) = 21 z(n + 1) + 21 z(n − 1).
What we need to determine now are boundary conditions. Now that the gambler can go
to debt, the condition that if we start with 0 we never get to $200, that is, probability of
getting $200 is 0, is not true. Following Equation 4.1.1 with z (0) 6= 0 we see that
z (n) = an + b.
We would like to find a and b now. Recall that this function is probability, so for any n we
have 0 6 z (n) 6 1. This is possible only if a = 0, that is,
z (1) = z (0) ,
so
z (n) = z (0)
for any n. We know that z (200) = 1, therefore
In other words, one is certain to get to $200 eventually (provided, of course, that one is
allowed to go into debt).
4.2. FURTHER EXAMPLES AND APPLICATIONS 47
Example 4.10. Landon is 80% sure he forgot his textbook either at the Union or in
Monteith. He is 40% sure that the book is at the union, and 40% sure that it is in Monteith.
Given that Landon already went to Monteith and noticed his textbook is not there, what is
the probability that it is at the Union?
We denote by U the event that textbook is at the Union, and by M the event that textbook
is in Monteith, notice that U ⊆ M c and hence U ∩ M c = U . Thus,
P (U ∩ M c ) P (U ) 4/10 2
P(U | M c ) = = = = .
P (M c ) 1 − P (M ) 6/10 3
Example 4.11. Sarah and Bob draw 13 cards each from a standard deck of 52. Given
that Sarah has exactly two aces, what is the probability that Bob has exactly one ace?
Let A be the event that Sarah has two aces, and let B be the event that Bob has exactly
one ace. In order to compute P (B | A),
we need to calculate P(A) and P(A ∩48B).
On the
52 4
one hand, Sarah could have any of 13 possible hands. Of these hands, 2 · 11 will have
exactly two aces so that
4
48
2
· 11
P(A) = 52
.
13
On the other hand,
39the
number of ways in which Sarah can pick a hand and Bob another
52
(different) is
48 213 ·
3713 . The the number of ways in which A and B can simultaneously occur
4
is 2 · 11 · 1 · 12 and hence
4
48 2 37
2
· 11 · 1 · 12
P(A ∩ B) = 52
39 .
13
· 13
Applying the definition of conditional probability we finally get
4
48 2 37. 52 39
· 11 · 1 · 12 · 37
P (A ∩ B) 2 13 13 2· 12
P (B | A) = = . = 39
4
P(A) 48 52
2
· 11 13 13
Example 4.12. A total of 500 married couples are polled about their salaries with the
following results
husband makes less than $25K husband makes more than $25K
wife makes less than $25K 212 198
wife makes more than $25K 36 54
(a) We can find the probability that a husband earns less than $25K as follows.
212 36 248
P(husband makes < $25K) = + = = 0.496.
500 500 500
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, Copyright 2020 Masha Gordina
48 4. CONDITIONAL PROBABILITY
(b) Now we find the probability that a wife earns more than $25K, given that the husband
earns as that much as well.
54/500 54
P (wife makes > $25K | husband makes > $25K) = = = 0.214
(198 + 54)/500 252
(c) Finally we find the probability that a wife earns more than $25K, given that the husband
makes less than $ 25K.
36/500
P (wife makes > $25K | husband makes < $25K) = = 0.145.
248/500
From the definition of conditional probability we can deduce some useful relations.
Proposition 4.2
Proof. We already saw (i) which is a rewriting of the definition of conditional probability
)
P(F | E) = P(E∩F
P(E)
. Let us prove (ii): we can write E as the union of the pairwise disjoint
sets E ∩ F and E ∩ F c . Using (i) we have
P (E) = P (E ∩ F ) + P (E ∩ F c )
= P (E | F ) P (F ) + P (E | F c ) P (F c ) .
Finally, writing F = E in the previous equation and since P(E | E c ) = 0, we obtain (iii).
Example 4.13. Phan wants to take either a Biology course or a Chemistry course. His
adviser estimates that the probability of scoring an A in Biology is 54 , while the probability
of scoring an A in Chemistry is 17 . If Phan decides randomly, by a coin toss, which course
to take, what is his probability of scoring an A in Chemistry?
We denote by B the event that Phan takes Biology, and by C the event that Phan takes
Chemistry, and by A = the event that the score is an A. Then, since P(B) = P(C) = 12 we
have
1 1 1
P (A ∩ C) = P (C) P (A | C) = · = .
2 7 14
The identity P(E ∩ F ) = P(E)P(F | E) from Proposition 4.2(i) can be generalized to any
number of events in what is sometimes called the multiplication rule.
4.2. FURTHER EXAMPLES AND APPLICATIONS 49
Example 4.14. An urn has 5 blue balls and 8 red balls. Each ball that is selected is
returned to the urn along with an additional ball of the same color. Suppose that 3 balls are
drawn in this way.
(a) What is the probability that the three balls are blue?
In this case, we can define the sequence of events B1 , B2 , B3 , ..., where Bi is the event
that the ith ball drawn is blue. Applying the multiplication rule yields
5 6 7
P(B1 ∩ B2 ∩ B3 ) = P(B1 )P(B2 | B1 )P(B3 | B1 ∩ B2 ) = .
13 14 15
(b) What is the probability that only 1 ball is blue?
Denote by Ri = the event that the ith ball drawn is red, we have
5·8·9
P (only 1 blue ball) = P(B1 ∩R2 ∩R3 ) + P(R1 ∩B2 ∩R3 ) + P(R1 ∩R2 ∩B3 ) = 3 .
13 · 14 · 15
Also the identity (ii) in Proposition 4.2 can be generalized by partitioning the sample space
S into several pairwise disjoint sets F1 , . . . , Fn (instead of simply F and F c ).
Proposition 4.4 (Law of total probability)
Sn
Let F1 , . . . , Fn ⊆ S be mutually exclusive and exhaustive events, i.e. S = i=1 Fi .
Then, for any event E ∈ F it holds that
Xn
P (E) = P (E | Fi ) P (Fi ) .
i=1
4.2.2. Generalized Bayes’ rule. The following example describes the type of problems
treated in this section.
Example 4.15. An insurance company classifies insured policyholders into accident prone
or non-accident prone. Their current risk model works with the following probabilities.
The probability that an accident prone insured has an accident within a year is 0.4
The probability that a non-accident prone insured has an accident within a year is 0.2.
(a) what is the probability that a policy holder will have an accident within a year?
50 4. CONDITIONAL PROBABILITY
Denote by A1 = the event that a policy holder will have an accident within a year,
and denote by A = the event that a policy holder is accident prone. Applying Proposi-
tion 4.2(ii) we have
P (A1 ) = P (A1 | A) P (A) + P (A1 | Ac ) (1 − P (A))
= 0.4 · 0.3 + 0.2(1 − 0.3) = 0.26
(b) Suppose now that the policy holder has had accident within one year. What is the
probability that he or she is accident prone?
Use Bayes’ formula to see that
P (A ∩ A1 ) P (A) P (A1 | A) 0.3 · 0.4 6
P (A | A1 ) = = = = .
P (A1 ) 0.26 0.26 14
Using the law of total probability from Proposition 4.4 one can generalize Bayes’ rule, which
appeared in Proposition 4.1.
Proposition 4.5 (Generalized Bayes’ rule)
Sn
Let F1 , . . . , Fn ⊆ S be mutually exclusive and exhaustive events, i.e. S = i=1 Fi .
Then, for any event E ⊆ S and any j = 1, . . . , n it holds that
P (E | Fj ) P (Fj )
P (Fj | E) = Pn
i=1 P (E | Fi ) P (Fi )
Example 4.16. Suppose a factory has machines I, II, and III that produce iSung phones.
The factory’s record shows that
Machines I, II and III produce, respectively, 2%, 1%, and 3% defective iSungs.
Out of the total production, machines I, II, and III produce, respectively, 35%, 25% and
40% of all iSungs.
Example 4.17. In a multiple choice test, a student either knows the answer to a question
or she/he will randomly guess it. If each question has m possible answers and the student
knows the answer to a question with probability p, what is the probability that the student
actually knows the answer to a question, given that he/she answers correctly?
Denote by K the event that a student knows the answer, and by C the event that a student
answers correctly. Applying Bayes’ rule we have
P (C | K) P (K) 1·p mp
P (K | C) = c c
= 1 = .
P (C | K) P (K) + P (C | K ) P (K ) 1 · p + m (1 − p) 1 + (m − 1)p
52 4. CONDITIONAL PROBABILITY
4.3. Exercises
Exercise 4.1. Two dice are rolled. Consider the events A = {sum of two dice equals 3},
B = {sum of two dice equals 7 }, and C = {at least one of the dice shows a 1}.
Exercise 4.2. Suppose you roll two standard, fair, 6-sided dice. What is the probability
that the sum is at least 9 given that you rolled at least one 6?
Exercise 4.3. A box contains 1 green ball and 1 red ball, and a second box contains 2
green and 3 red balls. First a box is chosen and afterwards a ball withdrawn from the chosen
box. Both boxes are equally likely to be chosen. Given that a green ball has been withdrawn,
what is the probability that the first box was chosen?
Exercise 4.4. Suppose that 60% of UConn students will be at random exposed to the flu.
If you are exposed and did not get a flu shot, then the probability that you will get the flu
(after being exposed) is 80%. If you did get a flu shot, then the probability that you will get
the flu (after being exposed) is only 15%.
(a) What is the probability that a person who got a flu shot will get the flu?
(b) What is the probability that a person who did not get a flu shot will get the flu?
Exercise 4.5. Color blindness is a sex-linked condition, and 5% of men and 0.25% of
women are color blind. The population of the United States is 51% female. What is the
probability that a color-blind American is a man?
Exercise 4.6. Two factories supply light bulbs to the market. Bulbs from factory X work
for over 5000 hours in 99% of cases, whereas bulbs from factory Y work for over 5000 hours
in 95% of cases. It is known that factory X supplies 60% of the total bulbs available in the
market.
(a) What is the probability that a purchased bulb will work for longer than 5000 hours?
(b) Given that a light bulb works for more than 5000 hours, what is the probability that it
was supplied by factory Y ?
(c) Given that a light bulb work does not work for more than 5000 hours, what is the
probability that it was supplied by factory X?
Exercise 4.7. A factory production line is manufacturing bolts using three machines, A,
B and C. Of the total output, machine A is responsible for 25%, machine B for 35% and
machine C for the rest. It is known from previous experience with the machines that 5%
of the output from machine A is defective, 4% from machine B and 2% from machine C. A
bolt is chosen at random from the production line and found to be defective. What is the
probability that it came from Machine A?
4.3. EXERCISES 53
Exercise 4.8. A multiple choice exam has 4 choices for each question. The student has
studied enough so that the probability they will know the answer to a question is 0.5, the
probability that the student will be able to eliminate one choice is 0.25, otherwise all 4
choices seem equally plausible. If they know the answer they will get the question correct.
If not they have to guess from the 3 or 4 choices. As the teacher you would like the test to
measure what the student knows, and not how well they can guess. If the student answers
a question correctly what is the probability that they actually know the answer?
Exercise 4.9. A blood test indicates the presence of Amyotrophic lateral sclerosis (ALS)
95% of the time when ALS is actually present. The same test indicates the presence of ALS
0.5% of the time when the disease is not actually present. One percent of the population
actually has ALS. Calculate the probability that a person actually has ALS given that the
test indicates the presence of ALS.
Exercise 4.10. A survey conducted in a college found that 40% of the students watch
show A and 17% of the students who follow show A, also watch show B. In addition, 20%
of the students watch show B.
(1) What is the probability that a randomly chosen student follows both shows?
(2) What is the conditional probability that the student follows show A given that
she/he follows show B?
Exercise 4.11. Use Bayes’ formula to solve the following problem. An airport has problems
with birds. If the weather is sunny, the probability that there are birds on the runway is
1/2; if it is cloudy, but dry, the probability is 1/3; and if it is raining, then the probability
is 1/4. The probability of each type of the weather is 1/3. Given that the birds are on the
runaway, what is the probability
Exercise 4.12. Suppose you toss a fair coin repeatedly and independently. If it comes up
heads, you win a dollar, and if it comes up tails, you lose a dollar. Suppose you start with
$20. What is the probability you will get to $150 before you go broke? (See Example 4.8 for
a solution).
Exercise∗ 4.1. Suppose we play gambler’s ruin game in Example 4.8 not with a fair coin,
but rather in such a way that you win a dollar with probability p, and you lose a dollar with
probability 1 − p, 0 < p < 1. Find the probability of reaching N dollars before going broke
if we start with n dollars.
Exercise∗ 4.2. Suppose F is an event, and define PF (E) := P (E | F ). Show that the
conditional probability PF is a probability function, that is, it satisfies the axioms of proba-
bility.
54 4. CONDITIONAL PROBABILITY
Exercise∗ 4.3. Show directly that Proposition 2.1 holds for the conditional probability
PF . In particular, for any events E and F
E (E c | F ) = 1 − E (E | F ) .
4.4. SELECTED SOLUTIONS 55
Solution to Exercise 4.1(C): Note that P(A) = 2/36 6= P (A | C), so they are not inde-
pendent. Similarly, P (B) = 6/36 6= P(B | C), so they are not independent.
Solution to Exercise 4.2: denote by E the event that there is at least one 6 and by
F the event that the sum is at least 9. We want to find P (F | E). Begin by noting that
there are 36 possible rolls of these two dice and all of them are equally likely. We can see
that 11 different rolls of these two dice will result in at least one 6, so P(E) = 11 36
. There
are 7 different rolls that will result in at least one 6 and a sum of at least 9. They are
7
{(6, 3), (6, 4), (6, 5), (6, 6), (3, 6), (4, 6), (5, 6)}, so P (E ∩ F ) = 36 . This tells us that
P (E ∩ F ) 7/36 7
P (F | E) = = = .
P(E) 11/36 11
Solution to Exercise 4.3: denote by Bi the event that the ith box is chosen. Since both
are equally likely, P(B1 ) = P(B2 ) = 21 . In addition, we know that P(G | B1 ) = 21 and
P(G | B2 ) = 25 . Applying Bayes’ rule yields
P(G | B1 )P(B1 ) 1/4 5
P(B1 | G) = = = .
P(G | B1 )P(B1 ) + P(G | B2 )P(B2 ) 1/4 + 1/5 9
Solution to Exercise 4.4(A): Suppose we look at students who have gotten the flu shot.
Denote by E the event that a student is exposed to the flu, and by F the event that a student
gets the flu. We know that P(E) = 0.6 and P(F | E) = 0.15. This means that P(E ∩ F ) =
(0.6) (0.15) = 0.09, and it is clear that P(E c ∩ F ) = 0. Since P(F ) = P(E ∩ F ) + P(E c ∩ F ),
we see that P(F ) = 0.09.
Solution to Exercise 4.4(B): Suppose we look at students who have not gotten the flu
shot. Let E be the event that a student is exposed to the flu, and let F be the event
that a student gets the flu. We know that P(E) = 0.6 and P(F | E) = 0.8. This means
that P(E ∩ F ) = (0.6) (0.8) = 0.48, and it is clear that P(E c ∩ F ) = 0. Since P(F ) =
P(E ∩ F ) + P(E c ∩ F ), we see that P(F ) = 0.48.
56 4. CONDITIONAL PROBABILITY
Solution to Exercise 4.5: denote by M the event an American is a man, by C the event
an American is color blind. Then
P (C | M ) P(M )
P (M | C) =
P (C | M ) P(M ) + P (C | M c ) P(M c )
(0.05) (0.49)
= ≈ 0.9505.
(0.05) (0.49) + (0.0025) (0.51)
Solution to Exercise 4.6(A): let H be the event a bulb works over 5000 hours, X be the
event that a bulb comes from factory X, and Y be the event a bulb comes from factory Y .
Then by the law of total probability
P (H | Y ) P(Y )
P (Y | H) =
P (H)
(0.95) (0.4)
= ≈ 0.39.
0.974
Solution to Exercise 4.6(C): We again use the result from Part (a)
P (H c | X) P(X) P (H c | X) P(X)
P (X | H c ) = =
P (H c ) 1 − P (H)
(1 − 0.99) (0.6) (0.01) (0.6)
= =
1 − 0.974 0.026
≈ 0.23
Solution to Exercise 4.7: denote by D the event that a bolt is defective, A the event that
a bolt is from machine A, by B the event that a bolt is from machine C. Then by Bayes’
theorem
P (D | A) P(A)
P (A | D) =
P (D | A) P(A) + P (D | B) P(B) + P (D | C) P(C)
(0.05) (0.25)
= = 0.362.
(0.05) (0.25) + (0.04) (0.35) + (0.02) (0.4)
Solution to Exercise 4.8: Let C be the event a student gives the correct answer, K be
the event a student knows the correct answer, E be the event a student can eliminate one
incorrect answer, and G be the event a student have to guess an answer. Using Bayes’
4.4. SELECTED SOLUTIONS 57
theorem we have
P(C | K)P(K)
P (K | C) = =
P(C)
P(C | K)P(K)
=
P(C | K)P(K) + P(C | E)P(E) + P(C | G)P(G)
1 · 12 24
1 1 1 1 1 = ≈ .774,
1· 2 + 3 · 4 + 4 · 4 31
that is, approximately 77.4% of the students know the answer if they give the correct answer.
Solution to Exercise 4.9: Let + denote the event that a test result is positive, and by D
the event that the disease is present. Then
P (+ | D) P (D)
P (D | +) =
P (+ | D) P (D) + P (+ | Dc )P (Dc )
(0.95) (0.01)
= = 0.657.
(0.95) (0.01) + (0.005) (0.99)
Solution to Exercise 4.2: it is clear that PF (E) = P (E | F ) is between 0 and 1 since the
right-hand side of the identity defining PF is. To see the second axiom, observe that
P (S ∩ F ) P (F )
PF (S) = P (S | F ) = = = 1.
P (F ) P (F )
Now take {Ei }∞
i=1 , Ei ∈ F to be pairwise disjoint, then
∞
! ∞
!
P (( ∞
S
i=1 Ei ) ∩ F )
[ [
PF Ei =P Ei | F =
i=1 i=1
P (F )
P( ∞
S P∞
i=1 (Ei ∩ F )) P (Ei ∩ F )
= = i=1
P (F ) P (F )
∞
X P (Ei ∩ F ) X ∞
= = PF (Ei ) .
i=1
P (F ) i=1
In this we used the distribution law for sets (E ∪ F ) ∩ G = (E ∩ G) ∪ (F ∩ G) and the fact
that {Ei ∩ F }∞
i=1 are pairwise disjoint as well.
CHAPTER 5
Example 5.1. If one rolls a die, let X denote the outcome, i.e. taking values 1, 2, 3, 4, 5, 6.
Example 5.2. If one rolls a die, let Y be 1 if an odd number is showing, and 0 if an even
number is showing.
Example 5.3. If one tosses 10 coins, let X be the number of heads showing.
For a discrete random variable X, we define the probability mass function (PMF) or
the density of X by
pX (x) := P(X = x),
where P(X = x) is a standard abbreviation for
P(X = x) = P X −1 (x) .
Let X be the number showing if we roll a die. The expected number to show up on a roll of
a die should be 1 · P(X = 1) + 2 · P(X = 2) + · · · + 6 · P(X = 6) = 3.5. More generally, we
define
59
60 5. DISCRETE RANDOM VARIABLES
For a discrete random variable X we define the expected value or expectation or mean
of X as X
EX := xpX (x)
{x:pX (x)>0}
provided this sum converges absolutely. In this case we say that the expectation of X
is well-defined.
We need absolute convergence of the sum so that the expectation does not depend on
the order in which we take the sum to define it. We know from calculus that we need
to be careful about the sums of conditionally convergent series, though in most of the
examples we deal with this will not be a problem. Note that pX (x) is nonnegative for
all x, but x itself can be negative or positive, so in general the terms in the sum might
have different signs.
Example 5.5. If we toss a coin and X is 1 if we have heads and 0 if we have tails, what
is the expectation of X?
We start with the mass function for X
1
2, x = 1
pX (x) = 12 , x = 0
0, all other values of x.
This turns out to sum to 1. To see this, recall the formula for a geometric series
1
1 + x + x2 + x3 + · · · = .
1−x
If we differentiate this, we get
1
1 + 2x + 3x2 + · · · = .
(1 − x)2
5.1. DEFINITION, PROPERTIES, EXPECTATION, MOMENTS 61
We have
EX = 1( 14 ) + 2( 81 ) + 3( 16
1
) + ···
h i
= 14 1 + 2( 21 ) + 3( 14 ) + · · ·
1 1
= = 1.
4
(1 − 12 )2
Example 5.8. Consider a discrete random variable taking only positive integers as values
1
with P (X = n) = n(n+1) . What is the expectation EX?
First observe that this is indeed a probability since we can use telescoping partial sums to
show that
∞
X 1
= 1.
n=1
n(n + 1)
Then
∞ ∞ ∞
X X 1 X 1
EX = n · P (X = n) = n· = = +∞,
n=1 n=1
n(n + 1) n=1
n + 1
so the expectation of X is infinite.
If we list all possible values of a discrete random variable X as {xi }i∈N , then we can write
X ∞
X
EX = xpX (x) = xi pX (xi ) .
{x:pX (x)>0} i=1
Proof. For each i ∈ N we denote by Si the event {ω ∈ S : X (ω) = xi }. Then {Si }i∈N
is a partition of the space S into disjoint sets. Note that since S is finite, then each set Si is
finite too, moreover, we only have a finite number of sets Si which are non-empty.
∞ ∞ ∞
!
X X X X
EX = xi p (xi ) = xi P (X = xi ) = xi P ({ω})
i=1 i=1 i=1 ω∈Si
∞
! ∞
X X X X
= xi P ({ω}) = X(ω)P ({ω})
i=1 ω∈Si i=1 ω∈Si
X
= X(ω)P ({ω}) ,
ω∈S
If X and Y are discrete random variables defined on the same sample space S and
a ∈ R, then
(i) E [X + Y ] = EX + EY ,
(ii) E [aX] = aEX,
as long as all expectations are well-defined.
∞ ∞ ∞
!
X X X
EZ = zk P(Z = zk ) = zk P(Z = zk , X = xi )
k=1 k=1 i=1
∞ ∞
!
X X
= zk P(X = xi , Y = zk − xi )
k=1 i=1
∞
XXX ∞ ∞
= zk P(X = xi , Y = zk − xi , Y = yj ).
k=1 i=1 j=1
For a ∈ R we have
X X
E [aX] = (aX(ω)) P (ω) = a X(ω)P (ω) = aEX
ω∈S ω∈S
Using induction on the number of random variables linearity holds for a collection of random
variables X1 , X2 , . . . , Xn .
Corollary
If X1 , X2 , . . . , Xn are random variables, then
E (X1 + X2 + · · · + Xn ) = EX1 + EX2 + · · · + EXn .
Example 5.9. Suppose we roll a die and let X be the value that is showing. We want to
find the expectation EX 2 (second moment).
Let Y = X 2 , so that P(Y = 1) = 16 , P(Y = 4) = 1
6
etc. and
EX 2 = EY = (1) 61 + (4) 16 + · · · + (36) 16 .
We can also write this as
EX 2 = (12 ) 16 + (22 ) 61 + · · · + (62 ) 16 ,
which suggests that a formula for EX 2 is x x2 P(X = x). This turns out to be correct.
P
The only possibility where things could go wrong is if more than one value of X leads to
the same value of X 2 . For example, suppose P(X = −2) = 18 , P(X = −1) = 14 , P(X = 1) =
3
8
, P(X = 2) = 14 . Then if Y = X 2 , P(Y = 1) = 58 and P(Y = 4) = 38 . Then
EX 2 = (1) 58 + (4) 83 = (−1)2 41 + (1)2 38 + (−2)2 18 + (2)2 14 .
But even in this case EX 2 = x x2 P(X = x).
P
Theorem 5.2
For a discrete random variable X taking values {xi }∞ i=1 and a real-valued function g
defined on this set, we have
∞
X ∞
X
Eg(X) = g (xi ) P (X = xi ) = g (xi ) p (xi ) .
i=1 i=1
∞
X ∞
X
Eg(X) = cp (xi ) = c p (xi ) = c · 1 = c.
i=1 i=1
Definition (Moments)
The variance measures how much spread there is about the expected value.
Example 5.11. We toss a fair coin and let X = 1 if we get heads, X = −1 if we get tails.
Then EX = 0, so X − EX = X, and then Var X = EX 2 = (1)2 12 + (−1)2 21 = 1.
Example 5.12. We roll a die and let X be the value that shows. We have previously
calculated EX = 72 . So X − EX equals
5 3 1 1 3 5
− ,− ,− , , , ,
2 2 2 2 2 2
Var X = (− 25 )2 16 + (− 32 )2 61 + (− 12 )2 16 + ( 21 )2 16 + ( 23 )2 16 + ( 25 )2 16 = 35
12
.
Using the fact that the expectation of a constant is the constant we get an alternate expres-
sion for the variance.
Proposition 5.2 (Variance)
Suppose X is a random variable with finite first and second moments. Then
Var X = EX 2 − (EX)2 .
66 5. DISCRETE RANDOM VARIABLES
5.2.1. Discrete random variables. Recall that we defined a discrete random variable
in Definition 5.1 as the one taking countably many values. A random variable is a function
X : S −→ R, and we can think of it as a numerical value that is random. When we perform
an experiment, many times we are interested in some quantity (a function) related to the
outcome, instead of the outcome itself. That means we want to attach a numerical value to
each outcome. Below are more examples of such variables.
Example 5.14. Let X be the amount of liability (damages) a driver causes in a year. In
this case, X can be any dollar amount. Thus X can attain any value in [0, ∞).
Example 5.15. Toss a coin 3 times. Let X be the number of heads that appear, so that
X can take the values 0, 1, 2, 3. What are the associated probabilities to each value?
1 1
P (X = 0) =P ((T, T, T )) = = ,
23 8
3
P (X = 1) =P ((T, T, H) , (T, H, T ) , (H, T, T )) = ,
8
3
P (X = 2) =P ((T, H, H) , (H, H, T ) , (H, T, H)) = ,
8
1
P (X = 3) =P ((H, H, H)) = .
8
Example 5.16. Toss a coin n times. Let X be the number of heads that occur. This
random variable can take the values 0, 1, 2, . . . , n. From the binomial formula we see that
1 n
P(X = k) = n .
2 k
Example 5.17. Suppose we toss a fair coin, and we let X be 1 if we have H and X be 0
if we have T . The probability mass function of this random variable is
1
2 x = 0
pX (x) = 12 x = 1,
0 otherwise .
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, Copyright 2020 Masha Gordina.
68 5. DISCRETE RANDOM VARIABLES
Often the probability mass function (PMF) will already be given and we can then use it to
compute probabilities.
Example 5.18. The PMF of a random variable X taking values in N ∪ {0} is given by
λi
pX (i) = e−λ , i = 0, 1, 2, . . . ,
i!
where λ is a positive real number.
(a) Find P (X = 0). By definition of the PMF we have
λ0
P (X = 0) = pX (0) = e−λ = e−λ .
0!
(b) Find P (X > 2). Note that
P (X > 2) = 1 − P (X 6 2)
= 1 − P (X = 0) − P (X = 1) − P (X = 2)
= 1 − pX (0) − pX (1) − pX (2)
λ2 e−λ
= 1 − e−λ − λe−λ − .
2
5.2.2. Expectation. We defined the expectation in Definition 5.2 in the case when X
is a discrete random variable X taking values {xi }i∈N . Then for a random variable X with
the PMF pX (x) the expectation is given by
X ∞
X
E [X] = xpX (x) = xi pX (xi ) .
x:p( x)>0 i=1
Example 5.19. Suppose again that we have a coin, and let X(H) = 0 and X (T ) = 1.
What is EX if the coin is not necessarily fair?
EX = 0 · pX (0) + 1 · pX (1) = P(T ).
Example 5.20. Let X be the outcome when we roll a fair die. What is EX?
1 1 1 1 21 7
EX = 1 · + 2 · + · · · + 6 · = (1 + 2 + 3 + 4 + 5 + 6) = = = 3.5.
6 6 6 6 6 2
Note that in the last example X can never be 3.5. This means that the expectation may not
be a value attained by X. It serves the purpose of giving an average value for X.
Example 5.21. Let X be the number of insurance claims a person makes in a year. Assume
that X can take the values 0, 1, 2, 3 . . . with P (X = 0) = 32 , P (X = 1) = 29 , . . . , P (X = n) =
2
3n+1
. Find the expected number of claims this person makes in a year.
Note that X has infinite but countable number of values, hence it is a discrete random
2
variable. We have that pX (i) = 3i+1 . We compute using the definition of expectation,
5.2. FURTHER EXAMPLES AND APPLICATIONS 69
Example 5.22. Let S = {1, 2, 3, 4, 5, 6} and assume that X(1) = X(2) = 1, X(3) =
X(4) = 3, and X(5) = X(6) = 5.
(1) Using the initial definition, the random variable X takes the values 1, 3, 5 and pX (1) =
pX (3) = pX (5) = 31 . Then
1 1 1 9
EX = 1 · + 3 + 5 = = 3.
3 3 3 3
(2) Using the equivalent definition, we list all of S = {1, 2, 3, 4, 5, 6} and then
1 1 1 1 1 1
EX = X(1)P ({1}) + · · · + X(6) · P ({6}) = 1 + 1 + 3 + 3 + 5 + 1 = 3.
6 6 6 6 6 6
5.2.3. The cumulative distribution function (CDF). We implicitly used this char-
acterization of a random variable, and now we define it.
Definition 5.3 (Cumulative distribution function)
X
F (x0 ) = pX (x).
x6x0
70 5. DISCRETE RANDOM VARIABLES
0.8
0.6
0.4
0.2
0
−1 0 1 2 3 4
Note that EX 2 = 0.5 . While (EX)2 = 0.01 since EX = 0.3 − 0.2 = 0.1. Thus in general
EX 2 6= (EX)2 .
72 5. DISCRETE RANDOM VARIABLES
In general, there is a formula for g(X) where g is function that uses the fact that g(X) will
be g(x) for some x such that X = x. We recall Theorem 5.2. If X is a discrete distribution
that takes the values xi , i ≥ 1 with probability pX (xi ), respectively, then for any real valued
function g we have that
X∞
E [g (X)] = g (xi ) pX (xi ).
i=1
Note that
∞
X
2
EX = x2i pX (xi )
i=1
will be useful.
Example 5.26. Let us revisit the previous example. Let X denote a random variable such
that
P (X = −1) = 0.2
P (X = 0) = 0.5
P (X = 1) = 0.3.
∞
X
EY = EX = 2
x2i pX (xi ) = (−1)2 (0.2) + 02 (0.5) + 12 (0.3) = 0.5
i=1
5.2.5. Variance. The variance of a random variable is a measure of how spread out the
values of X are. The expectation of a random variable is quantity that help us differentiate
between random variables, but it does not tell us how spread out its values are. For example,
consider
X = 0 with probability 1
(
−1 p = 12
Y =
1 p = 12
(
−100 p = 12
Z = .
100 p = 12
What are the expected values? The are 0, 0 and 0. But there is much greater spread in Z
than Y and Y than X. Thus expectation is not enough to detect spread, or variation.
Example 5.27. Find Var(X) if X represents the outcome when a fair die is rolled.
Recall that we showed Equation 5.2 to find the variance
Previously we found that EX = 27 . Thus we only need to find the second moment
2 2 1 1 91
EX = 1 + · · · + 62 = .
6 6 6
Using our formula we have that
2
2 2 91 7 35
Var (X) = E X − (E [X]) = − = .
6 2 12
5.3. Exercises
Exercise 5.1. Three balls are randomly chosen with replacement from an urn containing
5 blue, 4 red, and 2 yellow balls. Let X denote the number of red balls chosen.
Exercise 5.2. Two cards are chosen from a standard deck of 52 cards. Suppose that
you win $2 for each heart selected, and lose $1 for each spade selected. Other suits (clubs
or diamonds) bring neither win nor loss. Let X denote your winnings. Determine the
probability mass function of X.
Exercise 5.3. A financial regulator from the FED will evaluate two banks this week. For
each evaluation, the regulator will choose with equal probability between two different stress
tests. Failing under test one costs a bank 10K fee, whereas failing test 2 costs 5K. The
probability that the first bank fails any test is 0.4. Independently, the second bank will fail
any test with 0.5 probability. Let X denote the total amount of fees the regulator can obtain
after having evaluated both banks. Determine the cumulative distribution function of X.
Exercise 5.4. Five buses carry students from Hartford to campus. Each bus carries,
respectively, 50, 55, 60, 65, and 70 students. One of these students and one bus driver are
picked at random.
(a) What is the expected number of students sitting in the same bus that carries the ran-
domly selected student?
(b) Let Y be the number of students in the same bus as the randomly selected driver. Is
E[Y ] larger than the expectation obtained in the previous question?
Exercise 5.5. Two balls are chosen randomly from an urn containing 8 white balls, 4
black, and 2 orange balls. Suppose that we win $2 for each black ball selected and we lose
$1 for each white ball selected. Let X denote our winnings.
Exercise 5.6. A card is drawn at random from a standard deck of playing cards. If it is
a heart, you win $1. If it is a diamond, you have to pay $2. If it is any other card, you win
$3. What is the expected value of your winnings?
Exercise 5.7. The game of roulette consists of a small ball and a wheel with 38 numbered
pockets around the edge that includes the numbers 1 − 36, 0 and 00. As the wheel is spun,
the ball bounces around randomly until it settles down in one of the pockets.
(A) Suppose you bet $1 on a single number and random variable X represents the (mone-
tary) outcome (the money you win or lose). If the bet wins, the payoff is $35 and you
5.3. EXERCISES 75
get your money back. If you lose the bet then you lose your $1. What is the expected
profit on a 1 dollar bet?
(B) Suppose you bet $1 on the numbers 1 − 18 and random variable X represents the
(monetary) outcome (the money you win or lose). If the bet wins, the payoff is $1
and you get your money back. If you lose the bet then you lose your $1. What is the
expected profit on a 1 dollar bet ?
Exercise 5.8. An insurance company finds that Mark has a 8% chance of getting into a
car accident in the next year. If Mark has any kind of accident then the company guarantees
to pay him $10, 000. The company has decided to charge Mark a $200 premium for this one
year insurance policy.
(A) Let X be the amount profit or loss from this insurance policy in the next year for the
insurance company. Find EX, the expected return for the Insurance company? Should
the insurance company charge more or less on its premium?
(B) What amount should the insurance company charge Mark in order to guarantee an
expected return of $100?
Exercise 5.9. A random variable X has the following probability mass function: pX (0) =
1
3
, pX (1) = 16 , pX (2) = 14 , pX (3) = 14 . Find its expected value, variance, and standard
deviation, and plot its CDF.
Exercise 5.10. Suppose X is a random variable such that E [X] = 50 and Var(X) = 12.
Calculate the following quantities.
(A) E [X 2 ],
(B) E [3X
+ 2],2
(C) E (X + 2) ,
(D) Var [−X],
(E) SD (2X).
Exercise 5.11. Does there exist a random variable X such that E [X] = 4 and E [X 2 ] = 10?
Why or why not? (Hint: look at its variance)
Exercise 5.12. A box contains 25 peppers of which 5 are red and 20 green. Four peppers
are randomly picked from the box. What is the expected number of red peppers in this
sample of four?
76 5. DISCRETE RANDOM VARIABLES
Solution to Exercise 5.2: The random variable X can take the values −2, −1, 0, 1, 2, 4.
Moreover,
13
2
P(X = −2) = P(2♠) = 52 ,
2
13 · 26
P(X = −1) = P(1♠ and 1(♦ or ♣)) = 52
,
2
26
2
P(X = 0) = P(2(♦ or ♣)) = 52 ,
2
13 · 13
P(X = 1) = P(1♥ and 1♠) = 52
,
2
P(X = 2) = P(1♥ and 1(♦ or ♣)) = P(X = −1),
P(X = 4) = P(2♥) = P(X = −2).
Thus the probability mass function is given by pX (x) = P(X = x) for x = −2, −1, 0, 1, 2, 4
and pX (x) = 0 otherwise.
Solution to Exercise 5.3: The random variable X can take the values 0, 5, 10, 15 and 20
depending on which test was applied to each bank, and if the bank fails the evaluation or
not. Denote by Bi the event that the ith bank fails and by Ti the event that test i applied.
Then
0.8
0.6
0.4
0.2
0
−5 0 5 10 15 20 25
Solution to Exercise 5.4: Let X denote the number of students in the bus that carries
the randomly selected student.
50 55
(a) In total there are 300 students, hence P(X = 50) = 300 , P(X = 55) = 300 , P(X = 60) =
60 65 70
300
, P(X = 65) = 300 and P(X = 70) = 300 . The expected value of X is thus
50 55 60 65 70
E[X] = 50 + 55 + 60 + 65 + 70 ≈ 60.8333.
300 300 300 300 300
1
E[Y ] = (50 + 55 + 60 + 65 + 70) = 60
5
which is slightly less than the previous one.
4
2 6
P (X = 4) = P ({BB}) = 14 = ,
2
91
2
2 1
P (X = 0) = P ({OO}) = 14 =
2
91
4 2
8
P (X = 2) = P ({BO}) = 1 141 = ,
2
91
8 2
16
P (X = −1) = P ({W O}) = 1 141 = ,
2
91
4 8
32
P (X = 1) = P ({BW }) = 1 141 = ,
2
91
8
2 28
P (X = −2) = P ({W W }) = 14 =
2
91
1 1 1 5
EX = 1 · + (−2) + 3 · =
4 4 2 4
1 37
Solution to Exercise 5.7(A): The expected profit is EX = 35 · 38
−1· 38
= −$0.0526.
Solution to Exercise 5.7(B): If you will then yourprofit will be $1. If you lose then you
18
lose your $1 bet. The expected profit is EX = 1 · 38 − 1 · 20
38
= −$0.0526.
Solution to Exercise 5.8(A): If Mark has no accident then the company makes a profit
of 200 dollars. If Mark has an accident they have to pay him 10, 000 dollars, but regardless
they received 200 dollars from him as an yearly premium. We have
On average the company will lose $600 dollars. Thus the company should charge more.
Solution to Exercise 5.8(B): Let P be the premium. Then in order to guarantee an
expected return of 100 then
1 1 1 1 34
EX = 0 · +1· +2· +3· = .
3 6 4 4 24
5.4. SELECTED SOLUTIONS 79
0.8
0.6
0.4
0.2
0
−1 0 1 2 3 4
Solution to Exercise 5.11: Using the hint let’s compute the variance of this random
variable which would be Var(X) = E [X 2 ] − (EX)2 = 10 − 42 = −6. But we know a random
variable cannot have a negative variance. Thus no such a random variable exists.
CHAPTER 6
Bernoulli distribution
A random variable X such that P(X = 1) = p and P(X = 0) = 1 − p is said to
be a Bernoulli random variable with parameter p. Note EX = p and EX 2 = p, so
Var X = p − p2 = p(1 − p).
n n
X n i n−i
X n i
EX = i p (1 − p) = i p (1 − p)n−i .
i=0
i i=1
i
Then
81
82 6. SOME DISCRETE DISTRIBUTIONS
n
X n!
EX = i pi (1 − p)n−i
i=1
i!(n − i)!
n
X (n − 1)!
= np pi−1 (1 − p)(n−1)−(i−1)
i=1
(i − 1)!((n − 1) − (i − 1))!
n−1
X (n − 1)!
= np pi (1 − p)(n−1)−i
i=0
i!((n − 1) − i)!
n−1
X n − 1
= np pi (1 − p)(n−1)−i = np,
i=0
i
Now
EYi Yj = 1 · P(Yi Yj = 1) + 0 · P(Yi Yj = 0)
= P(Yi = 1, Yj = 1) = P(Yi = 1)P(Yj = 1) = p2
using independence of random variables {Yi }ni=1 . Expanding (Y1 + · · · + Yn )2 yields n2 terms,
of which n are of the form Yk2 . So we have n2 − n terms of the form Yi Yj with i 6= j. Hence
Var X = EX 2 − (EX)2 = np + (n2 − n)p2 − (np)2 = np(1 − p).
Later we will see that the variance of the sum of independent random variables is the sum
of the variances, so we could quickly get Var X = np(1 − p). Alternatively, one can compute
E(X 2 ) − EX = E(X(X − 1)) using binomial coefficients and derive the variance of X from
that.
Poisson distribution
A random variable X has the Poisson distribution with parameter λ if
λi
P(X = i) = e−λ .
i!
Proposition 6.2
Suppose X is a Poisson random variable with parameter λ, then
EX = λ,
Var X = λ.
Example 6.1. Suppose on average there are 5 homicides per month in a given city. What
is the probability there will be at most 1 in a certain month?
If X denotes the number of homicides, then we are given that EX = 5. Since the expectation
for a Poisson is λ, then λ = 5. Therefore P(X = 0) + P(X = 1) = e−5 + 5e−5 .
Example 6.2. Suppose on average there is one large earthquake per year in California.
What is the probability that next year there will be exactly 2 large earthquakes? λ = EX =
−1
1, so P(X = 2) = e 2 .
This proposition shows that the Poisson distribution models binomials when the probability
of a success is small. The number of misprints on a page, the number of automobile accidents,
the number of people entering a store, etc. can all be modeled by a Poisson distribution.
Proof. For simplicity, let us suppose that λ = npn for n > λ. In the general case we
can use λn = npn −−−→ λ. We write
n→∞
n!
P(Xn = i) = pi (1 − pn )n−i
i!(n − i)! n
i n−i
n(n − 1) · · · (n − i + 1) λ λ
= 1−
i! n n
i n
n(n − 1) · · · (n − i + 1) λ (1 − λ/n)
= .
ni i! (1 − λ/n)i
n(n − 1) · · · (n − i + 1)
−−−→ 1,
ni n→∞
i
(1 − λ/n) −−−→ 1,
n→∞
∞ ∞
X X 1
P(X = i) = (1 − p)i−1 p = p = 1.
i=1 i=1
1 − (1 − p)
In Bernoulli trials, if we let X be the first time we have a success, then X will be a geometric
random variable. For example, if we toss a coin over and over and X is the first time we
get a heads, then X will have a geometric distribution. To see this, to have the first success
occur on the k th trial, we have to have k − 1 failures in the first k − 1 trials and then a
success. The probability of that is (1 − p)k−1 p.
6.1. EXAMPLES: BERNOULLI, BINOMIAL, POISSON, GEOMETRIC DISTRIBUTIONS 85
Proposition 6.4
If X is a geometric random variable with parameter p, 0 < p < 1, then
1
EX = ,
p
1−p
Var X = ,
p2
FX (k) = P (X 6 k) = 1 − (1 − p)k .
∞
1 X
= nrn−1
(1 − r)2 n=0
P∞
which we can show by differentiating the formula for geometric series 1/(1 − r) = n=0 rn .
Then
∞ ∞
X X 1 1
EX = i · P(X = i) = i · (1 − p)i−1 p = 2 ·p = .
i=1 i=1
(1 − (1 − p)) p
Then the variance
2 X ∞ 2
2 1 1
Var X = E (X − EX) = E X − = i− · P(X = i)
p i=1
p
Then
∞ ∞
2
X
2
X (1 + (1 − p)) 2−p
EX = i · P(X = i) = i2 · (1 − p)i−1 p = 3 ·p = .
i=1 i=1
(1 − (1 − p)) p2
Thus 2
2 2−p 2 1 1−p
Var X = EX − (EX) = − = .
p2 p p2
The cumulative distribution function (CDF) can be found by using the geometric series sum
formula
∞ ∞
X X (1 − p)k
1 − FX (k) = P (X > k) = P(X = i) = (1 − p)i−1 p = p = (1 − p)k .
i=k+1 i=k+1
1 − (1 − p)
86 6. SOME DISCRETE DISTRIBUTIONS
A negative binomial represents the number of trials until r successes. To get the above
formula, to have the rth success in the nth trial, we must exactly have r − 1 successes in the
first n − 1 trials and then a success in the nth trial.
Hypergeometric distribution
A random variable X has hypergeometric distribution with parameters m, n and N if
m N −m
i n−i
P(X = i) = N
.
n
This comes up in sampling without replacement: if there are N balls, of which m are one
color and the other N −m are another, and we choose n balls at random without replacement,
then X represents the probability of having i balls of the first color.
Another model where the hypergeometric distribution comes up is the probability of a success
changes on each draw, since each draw decreases the population, in other words, when we
consider sampling without replacement from a finite population). Then N is the population
size, m is the number of success states in the population, n is the number of draws, that is,
quantity drawn in each trial, i is the number of observed successes.
6.2. FURTHER EXAMPLES AND APPLICATIONS 87
Example 6.3. A company prices its hurricane insurance using the following assumptions:
Using the company’s assumptions, find the probability that there are fewer than 3 hurricanes
in a 20-year period.
Denote by X the number of hurricanes in a 20-year period. From the assumptions we see
that X ∼ Binom (20, 0.05), therefore
P (X < 3) = P (X 6 2)
20 0 20 20 1 19 20
= (0.05) (0.95) + (0.05) (0.95) + (0.05)2 (0.95)18
0 1 2
= 0.9245.
Example 6.4. Phan has a 0.6 probability of making a free throw. Suppose each free throw
is independent of the other. If he attempts 10 free throws, what is the probability that he
makes at least 2 of them?
If X ∼ Binom (10, 0.6), then
P (X > 2) = 1 − P (X = 0) − P (X = 1)
10 0 10 10
=1− (0.6) (0.4) − (0.6)1 (0.4)9
0 1
= 0.998.
6.2.2. The Poisson distribution. Recall that a Poisson distribution models well events
that have a low probability and the number of trials is high. For example, the probability of
a misprint is small and the number of words in a page is usually a relatively large number
compared to the number of misprints.
Example 6.5. Levi receives an average of two texts every 3 minutes. If we assume that
the number of texts is Poisson distributed, what is the probability that he receives five or
more texts in a 9-minute period?
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, Copyright 2020 Masha Gordina.
88 6. SOME DISCRETE DISTRIBUTIONS
Example 6.6. Let X1 , ..., Xk be independent Poisson random variables, each with expec-
tation λ. What is the distribution of the random variable Y := X1 + ... + Xk ?
The distribution of Y is Poisson with the expectation λ = kλ. To show this, we use
Proposition 6.3 and (6.1) to choose n = mk Bernoulli random variables with parameter
pn = kλ1 /n = λ1 /m = λ/n to approximation the Poisson random variables. If we sum
them all together, the limit as n → ∞ gives us a Poisson distribution with expectation
lim npn = λ. However, we can re-arrange the same n = mk Bernoulli random variables
n→∞
in k groups, each group having m Bernoulli random variables. Then the limit gives us the
distribution of X1 + ... + Xk . This argument can be made rigorous, but this is beyond the
scope of this course. Note that we do not show that the we have convergence in distribution.
Example 6.7. Let X1 , . . . , Xk be independent Poisson random variables, each with ex-
pectation λ1 , . . . , λk , respectively. What is the distribution of the random variable Y =
X1 + ... + Xk ?
The distribution of Y is Poisson with expectation λ = λ1 + ... + λk . To show this, we again
use Proposition 6.3 and (6.1) with parameter pn = λ/n. If n is large, we can separate these n
Bernoulli random variables in k groups, each having ni ≈ λi n/λ Bernoulli random variables.
The result follows if lim ni /n = λi for each i = 1, ..., k.
n→∞
This entire set-up, which is quite common, involves what is called independent identically
distributed Bernoulli random variables (i.i.d. Bernoulli r.v.).
Example 6.8. Can we use binomial approximation to find the mean and the variance of
a Poisson random variable?
Yes, and this is really simple. Recall again from Proposition 6.3 and (6.1) that we can
approximate Poisson Y with parameter λ by a binomial random variable Binom (n, pn ),
where pn = λ/n. Each such a binomial random variable is a sum on n independent Bernoulli
random variables with parameter pn . Therefore
λ
EY = lim npn = lim n = λ,
n→∞ n→∞ n
λ λ
Var(Y ) = lim npn (1 − pn ) = lim n 1− = λ.
n→∞ n→∞ n n
6.2. FURTHER EXAMPLES AND APPLICATIONS 89
6.2.3. Table of distributions. The following table summarizes the discrete distribu-
tions we have seen in this chapter. Here N stands for the set of positive integers, and
N0 = N ∪ {0} is the set of nonnegative integers.
6.3. Exercises
Exercise 6.1. A UConn student claims that she can distinguish Dairy Bar ice cream from
Friendly’s ice cream. As a test, she is given ten samples of ice cream (each sample is either
from the Dairy Bar or Friendly’s) and asked to identify each one. She is right eight times.
What is the probability that she would be right exactly eight times if she guessed randomly
for each sample?
Exercise 6.2. A Pharmaceutical company conducted a study on a new drug that is sup-
posed to treat patients suffering from a certain disease. The study concluded that the drug
did not help 25% of those who participated in the study. What is the probability that of 6
randomly selected patients, 4 will recover?
Exercise 6.3. 20% of all students are left-handed. A class of size 20 meets in a room with
18 right-handed desks and 5 left-handed desks. What is the probability that every student
will have a suitable desk?
Exercise 6.4. A ball is drawn from an urn containing 4 blue and 5 red balls. After the
ball is drawn, it is replaced and another ball is drawn. Suppose this process is done 7 times.
(a) What is the probability that exactly 2 red balls were drawn in the 7 draws?
(b) What is the probability that at least 3 blue balls were drawn in the 7 draws?
Exercise 6.5. The expected number of typos on a page of the new Harry Potter book is
0.2. What is the probability that the next page you read contains
(a) 0 typos?
(b) 2 or more typos?
(c) Explain what assumptions you used.
Exercise 6.6. The monthly average number of car crashes in Storrs, CT is 3.5. What is
the probability that there will be
(a) at least 2 accidents in the next month?
(b) at most 1 accident in the next month?
(c) Explain what assumptions you used.
Exercise 6.7. Suppose that, some time in a distant future, the average number of bur-
glaries in New York City in a week is 2.2. Approximate the probability that there will
be
(a) no burglaries in the next week;
(b) at least 2 burglaries in the next week.
Exercise 6.8. The number of accidents per working week in a particular shipyard is Poisson
distributed with mean 0.5. Find the probability that:
(a) In a particular week there will be at least 2 accidents.
6.3. EXERCISES 91
Exercise 6.9. Jennifer is baking cookies. She mixes 400 raisins and 600 chocolate chips
into her cookie dough and ends up with 500 cookies.
(a) Find the probability that a randomly picked cookie will have three raisins in it.
(b) Find the probability that a randomly picked cookie will have at least one chocolate chip
in it.
(c) Find the probability that a randomly picked cookie will have no more than two bits in
it (a bit is either a raisin or a chocolate chip).
Exercise 6.10. A roulette wheel has 38 numbers on it: the numbers 0 through 36 and a
00. Suppose that Lauren always bets that the outcome will be a number between 1 and 18
(including 1 and 18).
(a) What is the probability that Lauren will lose her first 6 bets.
(b) What is the probability that Lauren will first win on her sixth bet?
Exercise 6.11. In the US, albinism occurs in about one in 17,000 births. Estimate the
probabilities no albino person, of at least one, or more than one albino at a football game with
5,000 attendants. Use the Poisson approximation to the binomial to estimate the probability.
Exercise 6.12. An egg carton contains 20 eggs, of which 3 have a double yolk. To make a
pancake, 5 eggs from the carton are picked at random. What is the probability that at least
2 of them have a double yolk?
Exercise 6.13. Around 30,000 couples married this year in CT. Approximate the proba-
bility that at least in one of these couples
(a) both partners have birthday on January 1st.
(b) both partners celebrate birthday in the same month.
Exercise 6.14. A telecommunications company has discovered that users are three times
as likely to make two-minute calls as to make four-minute calls. The length of a typical call
(in minutes) has a Poisson distribution. Find the expected length (in minutes) of a typical
call.
92 6. SOME DISCRETE DISTRIBUTIONS
Solution to Exercise 6.1: This should be modeled using a binomial random variable
X, since there is a sequence of trials with the same probability of success in each one. If
she guesses randomly for each sample, the probability that she will be right each time is 12 .
Therefore,
8 2
10 1 1 45
P (X = 8) = = 10 .
8 2 2 2
6
(0.75)4 (0.25)2
Solution to Exercise 6.2: 4
Solution to Exercise 6.3: For each student to have the kind of desk he or she prefers, there
must be no more than 18 right-handed students and no more than 5 left-handed students, so
the number of left-handed students must be between 2 and 5 (inclusive). This means that
we want the probability that there will be 2, 3, 4, or 5 left-handed students. We use the
binomial distribution and get
5 i 20−i
X 20 1 4
.
i=2
i 5 5
Solution to Exercise 6.9(A): This calls for a Poisson random variable R. The average
number of raisins per cookie is 0.8, so we take this as our λ . We are asking for P(R = 3),
3
which is e−0.8 (0.8)
3!
≈ 0.0383.
Solution to Exercise 6.9(B): This calls for a Poisson random variable C. The average
number of chocolate chips per cookie is 1.2, so we take this as our λ. We are asking for
0
P (C > 1), which is 1 − P (C = 0) = 1 − e−1.2 (1.2)
0!
≈ 0.6988.
Solution to Exercise 6.9(C): This calls for a Poisson random variable B. The average
number of bits per cookie is 0.8 + 1.2 = 2, so we take this as our λ. We are asking for
0 1 2
P (B 6 2), which is P (B = 0) + P (B = 1) + P (B = 2) = e−2 20! + e−2 21! + e−2 22! ≈ .6767.
18 6
Solution to Exercise 6.10(A): 1 − 38
18 5 18
Solution to Exercise 6.10(B): 1 − 38 38
Solution to Exercise 6.11 Let X denote the number of albinos at the game. We have that
X ∼ Binom(5000, p) with p = 1/17000 ≈ 0.00029. The binomial distribution gives us
5000 16999 5000
P(X = 0) = 16999
17000
≈ 0.745 P(X > 1) = 1 − P(X = 0) = 1 − 17000
≈ 0.255
P(X > 1) = P(X > 1) − P(X = 1) =
5000 16999 4999 1 1
= 1 − 16999
17000
− 5000
1 17000 17000
≈ 0.035633
5000 5
Approximating the distribution of X by a Poisson with parameter λ = 17000 = 17 gives
5 5
P(Y = 0) = exp − 17 ≈ 0.745 P(Y > 1) = 1 − P(Y = 0) = 1 − exp − 17 ≈ 0.255
5 5 5
P(Y > 1) = P(Y > 1) − P(Y = 1) = 1 − exp − 17 − exp − 17 17
≈ 0.035638
Solution to Exercise 6.12: Let X be the random variable that denotes the number of
eggs with double yolk in the set of chosen 5. Then X ∼ Hyp(20, 3, 5) and we have that
3
17 3
17
2
· 3 3
· 2
P(X > 2) = P(X = 2) + P(X = 3) = 20
+ 20
.
5 5
denotes the number of married couples where this is the case, we can approximate the
94 6. SOME DISCRETE DISTRIBUTIONS
Continuous distributions
7.1.1. Definition, PDF, CDF. We start with the definition a continuous random
variable.
Definition (Continuous random variables)
Example
´∞ 7.1. Suppose we are given that f (x) = c/x3 for x > 1 and 0 otherwise. Since
f (x)dx = 1 and
−∞ ˆ ∞ ˆ ∞
1 c
c f (x)dx = c 3
dx = ,
−∞ 1 x 2
we have c = 2.
PMF or PDF?
Probability mass function (PMF) and (probability) density function (PDF) are two
names for the same notion in the case of discrete random variables. We say PDF or
simply a density function for a general random variable, and we use PMF only for
discrete random variables.
97
98 7. CONTINUOUS DISTRIBUTIONS
We can define CDF for any random variable, not just continuous ones, by setting F (y) :=
P(X 6 y). Recall that we introduced it in Definition 5.3 for discrete random variables. In
that case it is not particularly useful, although it does serve to unify discrete and continuous
random variables. In the continuous case, the fundamental theorem of calculus tells us,
provided f satisfies some conditions, that
f (y) = F 0 (y) .
By analogy with the discrete case, we define the expectation of a continuous random variable.
For a continuous random variable X with the density function f we define its expec-
tation by ˆ ∞
EX = xf (x)dx
−∞
if this integral is absolutely convergent. In this case we call X integrable.
Proof. First observe that if a continuous random variable X is nonnegative, then its
density f (x) = 0 x < 0. In particular, F (y) = 0 for y 6 0, thought the latter is not needed
for our proof. Thus for such a random variable
ˆ ∞
EX = xf (x)dx.
0
Suppose n ∈ N, then we define Xn (ω) to be k/2n if k/2n 6 X(ω) < (k+1)/2n , for k ∈ N∪{0}.
This means that we are approximating X from below by the largest multiple of 2−n that is
still below the value of X. Each Xn is discrete, and Xn increase to X for each ω ∈ S.
7.1. BASIC THEORY 99
Consider the sequence {EXn }∞ n=1 . This sequence is an increasing sequence of positive num-
bers, and therefore it has a limit, possibly infinite. We want to show that it is finite and it
is equal to EX.
We have
∞
X k k
EXn = P Xn = n
k=1
2n 2
∞
X k k k+1
= P n 6X<
k=1
2n 2 2n
∞ ˆ (k+1)/2n
X k
= n
f (x)dx
k=1
2 k/2n
∞ ˆ (k+1)/2n
X k
= n
f (x)dx.
k=1 k/2n 2
If x ∈ [k/2n , (k + 1)/2n ), then x differs from k/2n by at most 1/2n , and therefore
ˆ (k+1)/2n ˆ (k+1)/2n
k
06 xf (x)dx − f (x)dx
k/2n k/2n 2n
ˆ (k+1)/2n ˆ (k+1)/2n
k 1
= x − n f (x)dx 6 n f (x)dx
k/2n 2 2 k/2n
Note that
∞ ˆ
X (k+1)/2n ˆ ∞
xf (x)dx = xf (x)dx
k=1 k/2n 0
and
∞ ˆ (k+1)/2n ∞ ˆ n ˆ ∞
X 1 1 X (k+1)/2 1 1
n
f (x)dx = n f (x)dx = n f (x)dx = n .
k=1
2 k/2n 2 k=1 k/2n 2 0 2
Therefore
100 7. CONTINUOUS DISTRIBUTIONS
ˆ ∞ ∞ ˆ (k+1)/2n
X k
0 6 EX − EXn = xf (x)dx − f (x)dx
0 k=1 k/2n 2n
∞ ˆ (k+1)/2n ∞ ˆ (k+1)/2n
X X k
= xf (x)dx − f (x)dx
k=1 k/2n k=1 k/2n 2n
∞ ˆ (k+1)/2n ˆ (k+1)/2n
!
X k
= xf (x)dx − f (x)dx
k=1 k/2n k/2n 2n
∞ ˆ (k+1)/2n
X 1 1
6 f (x)dx = −−→ 0.
k=1
2n k/2n 2n n→0
We will not prove the following, but it is an interesting exercise: if Xm is any sequence of
discrete random variables that increase up to X, then limm→∞ EXm will have the same value
EX.
This fact is useful to show linearity, if X and Y are positive random variables with finite
expectations, then we can take Xm discrete increasing up to X and Ym discrete increasing
up to Y . Then Xm + Ym is discrete and increases up to X + Y , so we have
Note that we can not easily use the approximations to X, Y and X + Y we used in the
previous proof to use in this argument, since Xm + Ym might not be an approximation of the
same kind.
If X is not necessarily positive, we can show a similar result; we will not do the details.
Similarly to the discrete case, we have
Proposition 7.2
Suppose X is a continuous random variable with density fX and g is a real-valued
function, then
ˆ ∞
Eg(X) = g(x)f (x)dx
−∞
as long as the expectation of the random variable g (X) makes sense.
As in the discrete case, this allows us to define moments, and in particular the variance
Var X := E[X − EX]2 .
As an example of these calculations, let us look at the uniform distribution.
7.1. BASIC THEORY 101
Uniform distribution
1
We say that a random variable X has a uniform distribution on [a, b] if fX (x) = b−a
if a 6 x 6 b and 0 otherwise.
ˆ ∞ ˆ b
1
EX = xfX (x)dx = x dx
−∞ a b−a
ˆ b
1
= x dx
b−a a
1 b 2 a2 a + b
= − = .
b−a 2 2 2
This is what one would expect. To calculate the variance, we first calculate
ˆ ∞ ˆ b
2 2 1 a2 + ab + b2
EX = x fX (x)dx = x2 dx = .
−∞ a b−a 3
We then do some algebra to obtain
(b − a)2
Var X = EX 2 − (EX)2 = .
12
102 7. CONTINUOUS DISTRIBUTIONS
´∞
We need to use the fact that −∞
f (x)dx = 1 and E [X 2 ] = 16 . The first one gives us
ˆ 1
a
1= (ax + b) dx = + b,
0 2
and the second one gives us
ˆ 1
1 a b
= x2 (ax + b) dx = + .
6 0 4 3
Solving these equations gives us
a = −2, and b = 2.
104 7. CONTINUOUS DISTRIBUTIONS
7.3. Exercises
Exercise 7.2. UConn students have designed the new U-phone. They have determined
that the lifetime of a U-Phone is given by the random variable X (measured in hours), with
probability density function
(
10
x 2 x > 10,
f (x) =
0 x ≤ 10.
(A) Find the probability that the u-phone will last more than 20 hours.
(B) What is the cumulative distribution function of X? That is, find FX (x) = P (X 6 x).
(C) Use part (b) to help you find P (X > 35)?
Compute E [X].
Exercise 7.4. An insurance company insures a large number of homes. The insured value,
X, of a randomly selected home is assumed to follow a distribution with density function
(
3
4 x > 1,
f (x) = x
0 otherwise.
Given that a randomly selected home is insured for at least 1.5, calculate the probability
that it is insured for less than 2.
Exercise 7.7. Suppose you order a pizza from your favorite pizzeria at 7:00 pm, knowing
that the time it takes for your pizza to be ready is uniformly distributed between 7:00 pm
and 7:30 pm.
(A) What is the probability that you will have to wait longer than 10 minutes for your
pizza?
(B) If at 7:15pm, the pizza has not yet arrived, what is the probability that you will have
to wait at least an additional 10 minutes?
Exercise 7.8. The grade of deterioration X of a machine part has a continuous distribution
on the interval (0, 10) with probability density function fX (x), where fX (x) is proportional
to x5 on the interval. The reparation costs of this part are modeled by a random variable Y
that is given by Y = 3X 2 . Compute the expected cost of reparation of the machine part.
Exercise 7.9. A bus arrives at some (random) time uniformly distributed between 10 : 00
and 10 : 20, and you arrive at a bus stop at 10 : 05.
(A) What is the probability that you have to wait at least 5 minutes until the bus comes?
(B) What is the probability that you have to wait at least 5 minutes, given that when you
arrive today to the station the bus was not there yet (you are lucky today)?
Exercise∗ 7.1. For a continuous random variable X with finite first and second moments
prove that
E (aX + b) = aEX + b,
Var (aX + b) = a2 Var X.
for any a, b ∈ R.
Exercise∗ 7.2. Let X be a continuous random variable with probability density function
1
fX (x) = xe− 2 1[0,∞) (x) ,
x
4
where the indicator function is defined as
1, 0 6 x < ∞;
1[0,∞) (x) =
0, otherwise.
Check that fX is a valid probability density function, and find E (X) if it exists.
106 7. CONTINUOUS DISTRIBUTIONS
Exercise∗ 7.3. Let X be a continuous random variable with probability density function
4 ln x
fX (x) = 1[1,∞) (x) ,
x3
where the indicator function is defined as
1, 1 6 x < ∞;
1[1,∞) (x) =
0, otherwise.
Check that fX is a valid probability density function, and find E (X) if it exists.
7.4. SELECTED SOLUTIONS 107
Solution to Exercise 7.8: First of all we need to find the PDF of X. So far we know that
(
cx
0 6 x 6 10,
f (x) = 5
0 otherwise.
Since ˆ 10
x
c dx = 10c,
0 5
1
we have c = 10 . Now, applying Proposition 7.2 we get
ˆ 10
3 3
EY = x dx = 150.
0 50
Solution to Exercise 7.9(A): The probability that you have to wait at least 5 minutes
until the bus comes is 12 . Note that with probability 14 you have to wait less than 5 minutes,
and with probability 41 you already missed the bus.
Solution to Exercise 7.9(B): The conditional probability is 32 .
CHAPTER 8
Normal distribution
A continuous random variable is a standard normal (written N (0, 1)) if it has density
1 2
fZ (x) = √ e−x /2 .
2π
A synonym for normal is Gaussian. The first thing to do is to show that this is a (probability)
density.
Theorem
fZ (x) is a valid PDF, that is, it is a nonnegative function such that
ˆ ∞
1 2 /2
√ e−x dx = 1.
2π −∞
Suppose Z ∼ N (0, 1). Then
EZ = 0,
Var Z = 1.
´∞ 2 /2
Proof. Let I = 0
e−x dx. Then
ˆ ∞ˆ ∞
2 /2 2 /2
2
I = e−x e−y dx dy.
0 0
Changing to polar coordinates,
ˆ π/2 ˆ ∞
2 /2
I = 2
re−r dr = π/2.
0 0
p ´∞ 2 √
So I = π/2, hence −∞ e−x /2 dx = 2π as it should.
Note ˆ ∞
2 /2
xe−x dx = 0
−∞
by symmetry, so EZ = 0. For the variance of Z, we use integration by parts as follows.
ˆ ˆ
1 2 −x2 /2 1 2
2
EZ = √ xe dx = √ x · xe−x /2 dx.
2π 2π
111
112 8. NORMAL DISTRIBUTION
Therefore Var Z = EZ 2 = 1.
Note that these integrals are improper, so our arguments are somewhat informal, but they
are easy to make rigorous.
Definition (General normal distribution)
EX = µ,
Var X = σ 2 .
For any a, b ∈ R the random variable aX + b is a normal variable.
The distribution function of a standard random variable N (0, 1) is often denoted as Φ(x),
so that ˆ x
1 2
Φ(x) = √ e−y /2 dy.
2π −∞
Tables of Φ(x) are often given only for x > 0. One can use the symmetry of the density
function to see that
Φ(−x) = 1 − Φ(x),
this follows from
ˆ −x
1 2
Φ(−x) = P(Z 6 −x) = √ e−y /2 dy
−∞ 2π
ˆ ∞
1 2
= √ e−y /2 dy = P(Z > x)
x 2π
= 1 − P(Z < x) = 1 − Φ(x).
114 8. NORMAL DISTRIBUTION
Proposition 8.3
For a standard normal random variable Z we have the following bound. For x > 0
1 2
P(Z > x) = 1 − Φ(x) 6 √ e−x /2 .
2π x
ˆ ∞
1 2
P(Z > x) = √ e−y /2 dy
2π x
ˆ ∞
1 y −y2 /2 1 1 −x2 /2
6√ e dy = √ e .
2π x x 2π x
Example 8.3. Suppose X is normal with mean 6. If P (X > 16) = 0.0228, then what is
the standard deviation of X?
X−µ
Recall that we saw in Proposition 8.2 that σ
= Z ∼ N (0, 1) and then
X −6 16 − 6
P (X > 16) = 0.0228 ⇐⇒ P > = 0.0228
σ σ
10
⇐⇒ P Z> = 0.0228
σ
10
⇐⇒ 1−P Z 6 = 0.0228
σ
10
⇐⇒ 1−Φ = 0.0228
σ
10
⇐⇒ Φ = 0.9772.
σ
Using the standard normal table we see that Φ (2) = 0.9772, thus we have that
10
2=
σ
and hence σ = 5.
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, Copyright 2020 Masha Gordina.
8.3. EXERCISES 117
8.3. Exercises
Exercise 8.2. The height of maple trees at age 10 are estimated to be normally distributed
with mean 200 cm and variance 64 cm. What is the probability a maple tree at age 10 grows
more than 210cm?
Exercise 8.3. The peak temperature T , in degrees Fahrenheit, on a July day in Antarctica
is a Normal random variable with a variance of 225. With probability .5, the temperature
T exceeds 10 degrees.
(a) What is P(T > 32), the probability the temperature is above freezing?
(b) What is P (T < 0)?
Exercise 8.5. Suppose X is a normal random variable with mean 5. If P (X > 0) = 0.8888,
approximately what is Var (X)?
Exercise 8.6. The shoe size of a UConn basketball player is normally distributed with
mean 12 inches and variance 4 inches. Ten percent of all UConn basketball players have a
shoe size greater than c inches. Find the value of c.
Exercise 8.7. The length of the forearm of a UConn football player is normally distributed
with mean 12 inches. If ten percent of the football team players have a forearm whose length
is greater than 12.5 inches, find out the approximate standard deviation of the forearm length
of a UConn football player.
Exercise 8.8. Companies C and A earn each an annual profit that is normally distributed
with the same positive mean µ. The standard deviation of C’s annual profit is one third of
its mean. In a certain year, the probability that A makes a loss (i.e. a negative profit) is 0.8
times the probability that C does. Assuming that A’s annual profit has a standard deviation
of 10, compute (approximately) the standard deviation of C’s annual profit.
Exercise 8.9. Let Z ∼ N (0, 1), that is, a standard normal random variable. Find
probability density for X = Z 2 . Hint: first find the (cumulative) distribution function
FX (x) = P (X 6 x) in terms of Φ(x) = FZ (x). Then use the fact that the probability
density function can be found by fX (x) = FX0 (x), and use the known density function for Z.
118 8. NORMAL DISTRIBUTION
5 − 10 5
P (X > 5) = P Z > =P Z>−
6 6
5 5
=P Z< =Φ ≈ Φ(0.83) ≈ 0.797
6 6
Solution to Exercise 8.1(B): 2Φ(1) − 1 = 0.6827
Solution to Exercise 8.1(C): 1 − Φ(0.3333) = 0.3695
√
Solution to Exercise 8.2: We have µ = 200 and σ = 64 = 8. Then
210 − 200
P (X > 210) = P Z > = P (Z > 1.25)
8
= 1 − Φ(1.25) = 0.1056.
√
Solution to Exercise 8.3(A): We have σ = 225 = 15. Since P (X > 10) = 0.5 then we
must have that µ = 10 since the PDF of the normal distribution is symmetric. Then
32 − 10
P(T > 32) = P Z >
15
= 1 − Φ (1.47) = 0.0708.
Solution to Exercise 8.4(A): First we need to figure out what µ and σ are. Note that
80, 000 − µ
P (X 6 80, 000) = 0.33 ⇐⇒ P Z < = 0.33
σ
80, 000 − µ
⇐⇒ Φ = 0.33
σ
and since Φ (0.44) = 0.67 then Φ (−0.44) = 0.33. Then we must have
80, 000 − µ
= −0.44.
σ
Similarly, since
P (X > 120, 000) = 0.33 ⇐⇒ 1 − P (X 6 120, 000) = 0.33
120, 000 − µ
⇐⇒ 1 − Φ = 0.33
σ
120, 000 − µ
⇐⇒ Φ = 0.67
σ
Now again since Φ (0.44) = 0.67 then
120, 000 − µ
= 0.44.
σ
8.4. SELECTED SOLUTIONS 119
Solution to Exercise 8.7: Let X denote the forearm length of a UConn football player
and let σ denote its standard deviation. From the problem we know that
X − 12 0.5 0.5
P(X > 12.5) = P > =1−Φ = 0.1.
σ σ σ
0.5
From the table we get σ
≈ 1.29 hence σ ≈ 0.39.
Solution to Exercise 8.8: Let A and C denote the respective annual profits, and µ their
mean. Form the problem we know P(A < 0) = 0.8P(C < 0) and σA = µ/3. Since they are
normal distributed, Φ −µ
10
= 0.8Φ(−3) which implies
µ
Φ = 0.2 + 0.8Φ(3) ≈ 0.998.
10
From the table we thus get µ/10 ≈ 2.88 and hence the standard deviation of C is µ/3 ≈ 9.6.
Solution to Exercise 8.9: see Example 10.2.
CHAPTER 9
This approximation is good if np(1 − p) > 10 and gets better the larger this quantity gets.
This means that if either p or 1 − p is small, then this is valid for large n. Recall that by
√ same as ESn and np(1 − p) is the same as Var Sn . So the ratio
Proposition 6.1 np is the
is equal to (Sn − ESn )/ Var Sn , and this ratio has mean 0 and variance 1, the same as a
standard N (0, 1).
Note that here p stays fixed as n → ∞, unlike in the case of the Poisson approximation, as
we described in Proposition 6.3.
Sketch of the proof. This is usually not covered in this course, so we only explain
one (of many) ways to show why this holds. We would like
to p
compare the distribution of
Sn with the distribution of the normal variable X ∼ N np, np (1 − p) . The random
variable X has the density
1 (x−np)2
p e− 2np(1−p) .
2πnp (1 − p)
The idea behind this proof is that we are interested in approximating the binomial dis-
tribution by the normal distribution in the region where the binomial distribution differs
significantly from zero, that is, in the region around the mean np. We consider P (Sn = k),
and we assume that k does not deviate too much pfrom np. We measure deviations by some
small number of standard
√ deviations, which is np (1 − p). Therefore we see that k − np
should be of order n. This is not much of a restriction since once k deviates from np by
many standard deviations, P (Sn = k) becomes very small and can be approximated by zero.
In what follows we assume that k and n − k of order n.
We use Stirling’s formula is the following form
√
m! ∼ 2πme−m mm ,
121
122 9. NORMAL APPROXIMATION
where by ∼ we mean that the two quantities are asymptotically equal, that is,their ratio
tends to 1 as m → ∞. Then for large n, k and n − k
n!
P (Sn = k) = pk (1 − p)n−k
k! (n − k)!
√
2πne−n nn
∼√ p n−k
pk (1 − p)n−k
−k k
2πke k 2π (n − k)e −(n−k) (n − k)
p k 1 − p n−k
r
n np k n (1 − p) n−k r n
n
= n = .
k n−k 2πk (n − k) k n−k 2πk (n − k)
Now we can use identities
np k − np
ln = − ln 1 + ,
k np
n (1 − p) k − np
ln = − ln 1 − .
n−k n (1 − p)
y2 y3
Then we can use ln (1 + y) ∼ y − 2
+ 3
,y → 0 to see that
!
np k n (1 − p) n−k np
n (1 − p)
ln = k ln + (n − k) ln
k n−k k n−k
2 3 !
k − np 1 k − np 1 k − np
∼k − + −
np 2 np 3 np
2 3 !
k − np 1 k − np 1 k − np
+ (n − k) + +
n (1 − p) 2 n (1 − p) 3 n (1 − p)
(k − np)2
∼− .
2np (1 − p)
Thus
np k n (1 − p) n−k (k−np)2
∼ e− 2np(1−p) .
k n−k
√
Now we use our assumption that k − np should be of order n to see that
√
k − np ≈ n,
√
n − k ≈ n (1 − p) − n,
k (n − k) ≈ n2 p (1 − p) ,
so
9. NORMAL APPROXIMATION 123
r
n 1
∼p .
2πk (n − k) 2πnp (1 − p)
Example 9.1. Suppose a fair coin is tossed 100 times. What is the probability there will
be more than 60 heads?
p
First observe that np = 50 and np(1 − p) = 5. Then we have
P(Sn > 60) = P((Sn − 50)/5 > 2) ≈ P(Z > 2) ≈ 0.0228.
Example 9.2. Suppose a die is rolled 180 times. What is the probability a 3 will be
showing more than 50 times?
p
Here p = 16 , so np = 30 and np(1 − p) = 5. Then P(Sn > 50) ≈ P(Z > 4), which is less
2
than e−4 /2 .
Example 9.3. Suppose a drug is supposed to be 75% effective. It is tested on 100 people.
What is the probability more than 70 people will be helped?
Here Sn is the number of successes, n = 100, and p = 0.75. We have
p
P(Sn > 70) = P((Sn − 75)/ 300/16 > −1.154)
≈ P(Z > −1.154) ≈ 0.87,
where the last number came from table 1.
When b − a is small, there is a correction that makes things more accurate, namely replace
a by a − 21 and b by b + 12 . This correction never hurts and is sometime necessary. For
example, in tossing a coin 100 times, there is positive probability that there are exactly 50
heads, while without the correction, the answer given by the normal approximation would
be 0.
Example 9.4. We toss a coin 100 times. What is the probability of getting 49, 50, or 51
heads?
We write P(49 6 Sn 6 51) = P(48.5 6 Sn 6 51.5) and then continue as above.
In this case we again have
p = 0.5,
µ = np = 50,
σ 2 = np(1 − p) = 25,
p
σ = np(1 − p) = 5.
The normal approximation can be done in three different ways:
P(49 6 Sn 6 51) ≈ P(49 6 50 + 5Z 6 51) = Φ(0.2) − Φ(−0.2) = 2Φ(0.2) − 1 ≈ 0.15852
124 9. NORMAL APPROXIMATION
or
P(48 < Sn < 52) ≈ P(48 < 50 + 5Z < 52) = Φ(0.4) − Φ(−0.4) = 2Φ(0.4) − 1 ≈ 0.31084
or
P(48.5 < Sn < 51.5) ≈ P(48.5 < 50+5Z < 51.5) = Φ(0.3)−Φ(−0.3) = 2Φ(0.3)−1 ≈ 0.23582
Here all three answers are approximate, and the third one, 0.23582, is the most accurate
among these three. We also can compute the precise answer using the binomial formula:
51
1 100
X
100 37339688790147532337148742857
P(49 6 Sn 6 51) = k
=
k=49
2 158456325028528675187087900672
≈ 0.2356465655973331958...
Continuity correction
If a continuous distribution such as the normal distribution is used to approximate a
discrete one such as the binomial distribution, a continuity correction should be used.
For example, if X is a binomial random variable that represents the number of successes in
n independent trials with the probability of success in any trial p, and Y is a normal random
variable with the same mean and the same variance as X. Then for any integer k we have
that P (X 6 k) is well approximated by P (Y 6 k) if np (1 − p) is not too small. It is better
approximated by P (Y 6 k + 1/2) as explained at the end of this section. The role of 1/2
is clear if we start by looking at the normal distribution first, and seeing how we use it to
approximate the binomial distribution.
The fact that this approximation is better based on a couple of considerations. One is that
a discrete random variable can only take on only discrete values such as integers, while a
continuous random variable used to approximate it can take on any values within an interval
around these specified values. Hence, when using the normal distribution to approximate the
binomial, more accurate approximations are likely to be obtained if a continuity correction
is used.
The second reason is that a continuous distribution such as the normal, the probability of
taking on a particular value of a random variable is zero. On the other hand, when the
9. NORMAL APPROXIMATION 125
P (3 6 X 6 5) = P (2.5 6 X 6 5.5)
and then use the normal approximation by P (2.5 6 Y 6 5.5).
Below is a table on how to use the continuity correction for normal approximation to a
binomial.
Binomial Normal
9.1. Exercises
Exercise 9.1. Suppose that we roll 2 dice 180 times. Let E be the event that we roll two
fives no more than once.
(a) Find the exact probability of E.
(b) Approximate P(E) using the normal distribution.
(c) Approximate P(E) using the Poisson distribution.
Exercise 9.2. About 10% of the population is left-handed. Use the normal distribution
to approximate the probability that in a class of 150 students,
(a) at least 25 of them are left-handed.
(b) between 15 and 20 are left-handed.
Exercise 9.3. A teacher purchases a box with 50 markers of colors selected at random.
The probability that marker is black is 0.6, independent of all other markers. Knowing
that the probability of there being more than N black markers is greater than 0.2 and the
probability of there being more than N + 1 black markers is less than 0.2, use the normal
approximation to calculate N.
9.2. SELECTED SOLUTIONS 127
Exponential distribution
A continuous random variable has exponential distribution with parameter λ > 0 if its
density is f (x) = λe−λx if x > 0 and 0 otherwise.
ˆ ∞
(10.1.1) P(X > a) = λe−λx dx = e−λa ,
a
FX (a) = 1 − P(X > a) = 1 − e−λa ,
and we can use integration by parts to see that EX = 1/λ, Var X = 1/λ2 . Examples where
an exponential random variable is a good model is the length of a telephone call, the length
of time before someone arrives at a bank, the length of time before a light bulb burns out.
Exponentials are memoryless, that is,
or given that the light bulb has burned 5 hours, the probability it will burn 2 more hours is
the same as the probability a new light bulb will burn 2 hours. Here is how we can prove
this
129
130 10. SOME CONTINUOUS DISTRIBUTIONS
P(X > s + t)
P(X > s + t | X > t) =
P(X > t)
e−λ(s+t)
= −λt
= e−λs
e
= P(X > s),
where we used Equation (10.1.1) for a = t and a = s + t.
Gamma distribution
A continuous random variable has a Gamma distribution with parameters α and θ if
its density is
αe−αx (αx)θ−1
f (x) =
Γ(θ)
´ ∞ −y θ−1
if x > 0 and 0 otherwise. Here Γ(θ) = 0 e y dy is the gamma function. We
denote such a distribution by Γ (α, θ).
´∞
Note that Γ(1) = 0
e−y dy = 1, and using induction on n and integration by parts one can
see that
Γ (n) = (n − 1)!
so we say that gamma function interpolates the factorial.
While an exponential random variable is used to model the time for something to occur,
a gamma random variable is the time for θ events to occur. A gamma random variable,
Γ 12 , n2 , with parameters 12 and n2 is known as a χ2n , a chi-squared random variable with
n degrees of freedom. Recall that in Exercise 8.9 we had a different description of a χ2
random variable, namely, Z 2 with Z ∼ N (0, 1). Gamma and χ2 random variables come up
frequently in statistics.
Beta distribution
A continuous random variable has a Beta distribution if its density is
1
f (x) = xa−1 (1 − x)b−1 , 0 < x < 1,
B(a, b)
´1
where B(a, b) = 0
xa−1 (1 − x)b−1 .
What is interesting about the Cauchy random variable is that it does not have a finite mean,
that is, E|X| = ∞.
Densities of functions of continuous random variables
Often it is important to be able to compute the density of Y = g (X), where X is a
continuous random variable. This is explained later in Theorem 11.1.
Example 10.1 (Log of a uniform random variable). If X is uniform on the interval [0, 1]
and Y = − log X, so Y > 0. If x > 0, then
d
fY (x) = FY (x) = −fX (e−x )(−e−x ),
dx
using the chain rule. Since fX (x) = 1, x ∈ [0, 1], this gives fY (x) = e−x , or Y is exponential
with parameter 1.
Example 10.2 (χ2 , revisited). As in Exercise 8.9 we consider Y = Z 2 , where Z ∼ N (0, 1).
Then
√ √
FY (x) = P(Y 6 x) = P(Z 2 6 x) = P(− x 6 Z 6 x)
√ √ √ √
= P(Z 6 x) − P(Z 6 − x) = FZ ( x) − FZ (− x).
Taking the derivative and using the chain rule we see
√ √
d 1 1
fY (x) = FY (x) = fZ ( x) √ − fZ (− x) − √ .
dx 2 x 2 x
2
Recall that fZ (y) = √1 e−y /2 and doing some algebra, we end up with
2π
1
fY (x) = √ x−1/2 e−x/2 ,
2π
which is Γ 12 , 12 . As we pointed out before, this is also a χ2 distributed random variable
with one degree of freedom.
Example 10.4. Suppose that the length of a phone call in minutes is an exponential
random variable with average length 10 minutes.
(1) What is the probability of your phone call being more than 10 minutes?
1
Here λ = 10 , thus
1
P(X > 10) = e−( 10 )10 = e−1 ≈ 0.368.
(2) Between 10 and 20 minutes?
We have that
P(10 < X < 20) = F (20) − F (10) = e−1 − e−2 ≈ 0.233.
Example 10.5. Suppose the life of an Uphone has exponential distribution with mean life
of 4 years. Let X denote the life of an Uphone (or time until it dies). Given that the Uphone
has lasted 3 years, what is the probability that it will 5 more years.
In this case λ = 14 .
P (X > 8)
P (X > 5 + 3 | X > 3) =
P (X > 3)
1
e− 4 ·8 1
= − 14 ·3
= e− 4 ·5 = P (X > 5) .
e
Recall that an exponential random variable is memoryless, so our answer is consistent with
this property of X.
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, Copyright 2020 Masha Gordina.
134 10. SOME CONTINUOUS DISTRIBUTIONS
10.3. Exercises
Exercise 10.1. Suppose that the time required to replace a car’s windshield can be rep-
resented by an exponentially distributed random variable with parameter λ = 21 .
(a) What is the probability that it will take at least 3 hours to replace a windshield?
(b) What is the probability that it will take at least 5 hours to replace a windshield given
that it hasn’t been finished after 2 hours?
Exercise 10.2. The number of years a Uphone functions is exponentially distributed with
parameter λ = 81 . If Pat buys a used Uphone, what is the probability that it will be working
after an additional 8 years?
Exercise 10.3. Suppose that the time (in minutes) required to check out a book at the
library can be represented by an exponentially distributed random variable with parameter
2
λ = 11 .
(a) What is the probability that it will take at least 5 minutes to check out a book?
(b) What is the probability that it will take at least 11 minutes to check out a book given
that you have already waited for 6 minutes?
Exercise 10.6. Let X be a uniform random variable over [0, 1]. Define a new random
variable Y = eX . Find the probability density function of Y , fY (y).
Exercise 10.7. An insurance company insures a large number of homes. The insured
value, X, of a randomly selected home is assumed to follow a distribution with density
function
83 x > 2,
x
fX (x) =
0 otherwise.
(A) Given that a randomly selected home is insured for at most 4, calculate the proba-
bility that it is insured for less than 3.
(B) Given that a randomly selected home is insured for at least 3, calculate the proba-
bility that it is insured for less than 4.
Solution to Exercise 10.1(B): There are two ways to do this. The longer one is to explic-
itly find P(X > 5 | X > 2). The shorter one is to remember that the exponential distribution
is memoryless and to observe that P(X > t + 3 | X > t) = P (X > 3), so the answer is the
same as the answer to part (a).
Solution to Exercise 10.2: e−1
Solution to Exercise 10.3(A): Recall that by Equation 10.1.1 P (X > a) = e−λa , and
therefore
10
P (X > 5) = e− 11 .
Solution to Exercise 10.3(B): We use the memoryless property
P (X > 11 | X > 6) = P (X > 6 + 5 | X > 6)
10
= P (X > 5) = e− 11 .
Solution to Exercise 10.4: Since E [X] = 1/λ = 1 then we know that λ = 1. Then its
PDF and CDF are
fX (x) = e−x , x > 0
FX (x) = 1 − e−x , x > 0.
Thus
FY (y) = P (Y 6 y) = P eX 6 y = P (X 6 log y) = FX (log y) ,
and so
1
FY (y) = 1 − e− log y = 1 − , when log(y) > 0,
y
taking derivative we get
dFy (y) 1
fY (y) = = 2 when y > 1.
dy y
Solution to Exercise 10.5: Since X is exponential with parameter 1, then its PDF and
CDF are
fX (x) = e−x , x > 0
FX (x) = 1 − e−x , x > 0.
Thus
X
FY (y) = P (Y 6 y) = P 6y = P (X 6 cy) = FX (cy) ,
c
and so
FY (y) = 1 − e−cy , when cy > 0,
136 10. SOME CONTINUOUS DISTRIBUTIONS
Solution to Exercise 10.6: Since X is uniform over [0, 1], then its PDF and CDF are
fX (x) = 1, 0 6 x < 1
FX (x) = x, 0 6 x < 1.
Thus
Fy (y) = P (Y 6 y) = P eX 6 y = P (X 6 log y) = FX (log y)
and so
FY (y) = log y, when 0 6 log y < 1.
taking derivatives we get
dFy (y) 1
fY (y) = = , when 1 < y < e1 .
dy y
Solution to Exercise 10.7(A): Using the definition of conditional probability
P(X < 3)
P (X < 3 | X < 4) = .
P(X < 4)
Since ˆ 4 4
8 4 1 3
P(X < 4) = 3
dx = − 2 = − + 1 =
2 x x 2 4 4
and
ˆ 3 3
8 4 4 5
P(X < 3) = P(2 < X < 3) = 3
dx = − 2 =− +1= .
2 x x 2 9 9
Thus the answer is
5
9 20
3 = ≈ 0.74074074
4
27
Solution to Exercise 10.7(B): Using the definition of conditional probability
P(3 < X < 4)
P(X < 4 | X > 3) = .
P(X > 3)
Since
ˆ ∞ ∞
8 4 4
P(X > 3) = 3
dx = − 2 = ,
3 x x 3 9
ˆ 4 4
8 4 7
P(3 < X < 4) = 3
dx = − 2 = .
3 x x 3 36
Thus the answer is
7
36 7
4 = ≈ 0.4375
9
16
Part 3
Multivariate distributions
We are often interested in considering several random variables that might be related to
each other. For example, we can be interested in several characteristics of a randomly
chosen object, such as gene mutations and a certain disease for a person, different kinds of
preventive measures for an infection etc. Each of these characteristics can be thought of as
a random variable, and we are interested in their dependence. In this chapter we d study
joint distributions of several random variables.
Consider collections of random variables (X1 , X2 , . . . , Xn ), which are known as random vec-
tors. We start by looking at two random variables, though the approach can be easily
extended to more variables.
Joint PMF for discrete random variables
The joint probability mass function of two discrete random variables X and Y is defined
as
Recall that here the comma means and, or the intersection of two events. If X takes values
{xi }∞ ∞
i=1 and Y takes values {yj }j=1 , then the range of (X, Y ) as a map from the probability
space (S, F, P) to the set {(xi , yj )}∞
i,j=1 . Note that pXY is indeed a probability mass function
as
∞
X
pXY (xi , yj ) = 1.
i,j=1
Two random variables X and Y are jointly continuous if there exists a nonnegative
function fXY : R × R −→ R, such that, for any set A ⊆ R × R, we have
¨
P ((X, Y ) ∈ A) = fXY (x, y) dxdy.
A
The function fXY (x, y) is called the joint probability density function (PDF) of X and
Y.
ˆ bˆ d
P(a 6 X 6 b, c 6 Y 6 d) = fXY (x, y)dy dx.
a c
¨
P(X < Y ) = fX,Y (x, y)dy dx.
{x<y}
Example 11.1. If the density fX,Y (x, y) = ce−x e−2y for 0 < x < ∞ and x < y < ∞, what
is c?
We use the fact that a density must integrate to 1. So
ˆ ∞ˆ ∞
ce−x e−2y dy dx = 1.
0 x
The multivariate distribution function (CDF) of (X, Y ) is defined by FX,Y (x, y) = P(X 6
x, Y 6 y). In the continuous case, this is
ˆ x ˆ y
FX,Y (x, y) = fX,Y (x, y)dy dx,
−∞ −∞
and so we have
∂ 2F
f (x, y) = (x, y).
∂x∂y
The extension to n random variables is exactly similar.
Marginal PDFs
Suppose fX,Y (x, y) is a joint PDF of X and Y , then the marginal densities of X and
of Y are given by
ˆ ∞ ˆ ∞
fX (x) = fX,Y (x, y)dy, fY (y) = fX,Y (x, y)dx.
−∞ −∞
One can conclude from this by taking partial derivatives that the joint density function
factors into the product of the density functions. Going the other way, one can also see that
if the joint density factors, then one has independence of random variables.
11.1 (The joint density function factors for independent random variables)
Example 11.4 (Buffon’s needle problem). Suppose one has a floor made out of wood
planks and one drops a needle onto it. What is the probability the needle crosses one of the
cracks? Suppose the needle is of length L and the wood planks are D across.
Let X be the distance from the midpoint of the needle to the nearest crack and let Θ be the
angle the needle makes with the vertical. Then X and Θ are independent random variables.
X is uniform on [0, D/2] and Θ is uniform on [0, π/2]. A little geometry shows that the
4
needle will cross a crack if L/2 > X/ cos Θ. We have fX,Θ = πD and so we have to integrate
this constant over the set where X < L cos Θ/2 and 0 6 Θ 6 π/2 and 0 6 X 6 D/2. The
integral is
ˆ π/2 ˆ L cos θ/2
4 2L
dx dθ = .
0 0 πD πD
Suppose X and Y are independent continuous random variables, then the density of
X + Y is given by the following convolution formula
ˆ
fX+Y (a) = fX (a − y)fY (y)dy.
Proof. If X and Y are independent, then the joint probability density function factors,
and therefore
¨
P(X + Y 6 a) = fX,Y (x, y)dx dy
{x+y6a}
¨
= fX (x)fY (y)dx dy
{x+y6a}
ˆ ∞ ˆ a−y
= fX (x)fY (y)dx dy
ˆ
−∞ −∞
= FX (a − y)fY (y)dy.
Differentiating with respect to a, we have the convolution formula for the density of X + Y
as follows ˆ
fX+Y (a) = fX (a − y)fY (y)dy.
11.2. INDEPENDENT RANDOM VARIABLES 143
Example 11.8. The analogue for discrete random variables is easier. If X and Y take
only nonnegative integer values, we have
r
X
P(X + Y = r) = P(X = k, Y = r − k)
k=0
r
X
= P(X = k)P(Y = r − k).
k=0
In the case where X is a Poisson random variable with parameter λ and Y is a Poisson
random variable with parameter µ, we see that X + Y is a Poisson random variable with
parameter λ + µ. To check this, use the above formula to get
Xr
P(X + Y = r) = P(X = k)P(Y = r − k)
k=0
r
X λk −µ µr−k
= e−λ
e
k=0
k! (r − k)!
r
−(λ+µ) 1 r k r−k
X
=e λ µ
r! k=0 k
(λ + µ)r
= e−(λ+µ)
r!
using the binomial theorem.
Note that it is not always the case that the sum of two independent random variables will
be a random variable of the same type.
144 11. MULTIVARIATE DISTRIBUTIONS
Example 11.9. If X and Y are independent normals, then −Y is also a normal with
E(−Y ) = −EY and Var(−Y ) = (−1)2 Var Y = Var Y , and so X − Y is also normal.
We now extend the notion of conditioning an event on another event to random variables.
Suppose we have observed the value of a random variable Y , and we need to update the
density function of another random variable X whose value we are still to observe. For this
we use the conditional density function of X given Y .
We start by considering the case when X and Y are discrete random variables.
Definition (Conditional discrete random variable)
wherever pY (y) 6= 0.
Analogously, we define conditional random variables in the continuous case.
Definition (Conditional continuous random variable)
Suppose X and Y are jointly continuous, the conditional probability density function
(PDF) of X given Y is given by
fXY (x, y)
fX|Y =y (x) = .
fY (y)
First we formalize what we saw in the one-dimensional case. First we recall that g (x) is
called a strictly increasing function if for any x1 < x2 , then g (x1 ) < g (x2 ). Similarly we can
define a strictly decreasing function. We also say g is a strictly monotone function on [a, b]
if it is either strictly increasing or strictly decreasing function on this interval.
Finally, we will use the fact that for a strictly monotone function g we have that for any
point y in the range of the function there is a unique x such that g (x) = y. That is, we have
a well-defined g −1 on the range of function g.
11.4. FUNCTIONS OF RANDOM VARIABLES 145
Proof. Without loss of generality we can assume that g is strictly increasing. If X has
a density fX and Y = g(X), then
FY (y) = P(Y 6 y) = P(g(X) 6 y)
= P(X 6 g −1 (y)) = FX (g −1 (y)).
Taking the derivative, using the chain rule, and recalling that the derivative of g −1 (y) is
given by
1 1
(g −1 (y))0 = 0 = 0 −1 .
g (x) g (g (y))
Here we use that y = g(x), x = g −1 (y), and the assumption that g(x) is an increasing
function.
The higher-dimensional case is very analogous. Note that the function below is assumed to
be one-to-one, and therefore it is invertible. Note that in the one-dimensional case strictly
monotone functions are one-to-one.
Theorem (PDF for a function of two random variables)
Suppose X and Y are two jointly continuous random variables. Let (U, V ) =
g (X, Y ) = (g1 (X, Y ) , g2 (X, Y )), where g : R2 −→ R is a continuous one-to-one
function with continuous partial derivatives. Denote by h = g −1 , so h (U, V ) =
(h1 (U, V ) , h2 (U, V )) = (X, Y ). Then U and V are jointly continuous and their joint
PDF, fU V (u, v), is defined on the range of (U, V ) and is given by
Proof. The proof is based on the change of variables theorem from multivariable cal-
culus, and it is analogous to (11.4.1). Note that to reconcile this formula with some of the
applications we might find the following property of the Jacobian for a map and its inverse
useful
Jh ◦ g = (Jg )−1 .
Example 11.10. Suppose X1 is N (0, 1), X2 is N (0, 4), and X1 and X2 are independent.
Let Y1 = 2X1 +X2 , Y2 = X1 −3X2 . Then y1 = g1 (x1 , x2 ) = 2x1 +x2 , y2 = g2 (x1 , x2 ) = x1 −3x2 ,
so
2 1
J = = −7.
1 −3
In general, J might depend on x, and hence on y. Some algebra leads to x1 = 73 y1 + 17 y2 ,
x2 = 17 y1 − 27 y2 . Since X1 and X2 are independent,
1 2 1 2
fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) = √ e−x1 /2 √ e−x2 /8 .
2π 8π
Therefore
1 3 1 2 1 1 2 2 1
fY1 ,Y2 (y1 , y2 ) = √ e−( 7 y1 + 7 y2 ) /2 √ e−( 7 y1 − 7 y2 ) /8 .
2π 8π 7
11.5. FURTHER EXAMPLES AND APPLICATIONS 147
Example 11.11. Suppose we roll two dice with sides 1, 1, 2, 2, 3, 3. Let X be the largest
value obtained on any of the two dice. Let Y = be the sum of the two dice. Find the joint
PMF of X and Y .
First we make a table of all the possible outcomes. Note that individually, X = 1, 2, 3 and
Y = 2, 3, 4, 5, 6. The table for possible outcomes of (X, Y ) jointly is given by the following
table.
outcome 1 2 3
X\Y 2 3 4 5 6
1
1 P (X = 1, Y = 2) = 9
0 0 0 0
2 1
2 0 9 9
0 0
2 2 1
3 0 0 9 9 9
(a) Find c that makes this a valid PDF. The region that we integrate over is the first
quadrant, therefore
ˆ ∞ˆ ∞ ˆ ∞
−x −2y
∞
e−2y −e−x 0 dy
1= ce e dxdy = c
ˆ0 ∞ 0 ∞
0
1 c
=c e−2y dy = c − e−2y = ,
0 2 0 2
and so c = 2.
(b) Find P (X < Y ). We start by describing the region corresponding to this event, namely,
D = {(x, y) | 0 < x < y, 0 < y < ∞} and set up the double integral for the probability
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, Copyright 2020 Masha Gordina
148 11. MULTIVARIATE DISTRIBUTIONS
of this event.
ˆ ˆ
P (X < Y ) = f (x, y)dA
ˆ ∞ ˆ
D
y
1
= 2e−x e−2y dxdy = .
0 0 3
(c) Set up the double integral representing P (X > 1, Y < 1).
ˆ 1 ˆ ∞
P (X > 1, Y < 1) = 2e−x e−2y dxdy = (1 − e−2 )e−1 .
0 1
(d) Find the marginal fX (x). We have
ˆ ∞ ˆ ∞
fX (x) = f (x, y)dy = 2e−x e−2y dy
−∞ 0
−2y ∞
−x −e −x 1
= 2e = 2e 0+
2 0 2
= e−x .
which are both exponential. Thus fXY = fX fY , therefore yes, X and Y are independent!
All the concepts and some of the techniques we introduced for two random variables can
be extended to more than two random variables. For example, we can define joint PMF,
PDF, CDF, independence for three or more random variables. While many of the explicit
expressions can be less tractable, the case of normal variables is tractable. First we comment
on independence of several random variables.
11.6. MORE THAN TWO RANDOM VARIABLES 149
Recall that we distinguished between jointly independent events (Definition 3.1) and pairwise
independent events.
Definition
A set of n random variables {X1 , . . . , Xn } is pairwise independent if every pair of
random variables is independent.
As for events, if the set of random variables is pairwise independent, it is not necessarily
mutually independent as defined next.
Definition
A set of n random variables {X1 , . . . , Xn } is mutually independent if for any sequence of
numbers {x1 , . . . , xn }, the events {X1 6 x1 } , . . . , {Xn 6 xn } are mutually independent
events.
This definition is equivalent to the following condition on the joint cumulative distribution
function FX1 ,...,Xn (x1 , . . . , xn ). Namely, a set of n random variables {X1 , . . . , Xn } is mutually
independent if
for all x1 , . . . , xn . Note that we do not need require that the probability distribution fac-
torizes for all possible subsets as in the case for n events. This is not required because
Equation 11.6.1 implies factorization for any subset of 1, ..., n.
The following statement will be easier to prove later once we have the appropriate mathe-
matical tools.
Proposition 11.2
In particular if X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 ) then X + Y ∼ N µx + µy , σx2 + σy2 and
X − Y ∼ N µx − µy , σx2 + σy2 . In general for two independent Gaussian X and Y we have
cX + dY ∼ N cµx + dµy , cσx2 + dσy2 .
Example 11.15. Suppose T ∼ N (95, 25) and H ∼ N (65, 36) represents the grades of T.
and H. in their Probability class.
(a) What is the probability that their average grades will be less than 90?
(b) What is the probability that H. will have scored higher than T.?
(c) Answer question (b) if T ∼ N (90, 64) and H ∼ N (70, 225).
150 11. MULTIVARIATE DISTRIBUTIONS
In (c) we can use Proposition 11.2 to see that T − H ∼ N (−20, 289) and so
P (H > T ) = P (H − T > 0)
= 1 − P (H − T < 0)
0 − (−20)
=1−P Z 6
17
≈ 1 − Φ(1.18) ≈ 0.11900
Secondly, we need to invert the map g, that is, solve for x1 , x2 in terms of y1 , y2
√
x 1 = y2 ,
√
x2 = y1 − y2 .
11.6. MORE THAN TWO RANDOM VARIABLES 151
11.7. Exercises
Exercise 11.1. Suppose that 2 balls are chosen without replacement from an urn consisting
of 5 white and 8 red balls. Let X equal 1 if the first ball selected is white and zero otherwise.
Let Y equal 1 if the second ball selected is white and zero otherwise.
(A) Find the probability mass function of X, Y .
(B) Find E(XY ).
(C) Is it true that E(XY ) = (EX)(EY )?
(D) Are X, Y independent?
Exercise 11.2. Suppose you roll two fair dice. Find the probability mass function of X
and Y , where X is the largest value obtained on any die, and Y is the sum of the values.
1
Exercise 11.3. Suppose the joint density function of X and Y is f (x, y) = 4
for 0 < x < 2
and 0 < y < 2.
(A) Find P 12 < X < 1, 23 < Y < 43 .
(B) Find P (XY < 2).
(C) Find the marginal distributions fX (x) and fY (y).
Exercise 11.5. Suppose X and Y are independent random variables and that X is expo-
nential with λ = 14 and Y is uniform on (2, 5). Calculate the probability that 2X + Y < 8.
Exercise 11.9. Suppose that gross weekly ticket sales for UConn basketball games are
normally distributed with mean $2,200,000 and standard deviation $230,000. What is the
probability that the total gross ticket sales over the next two weeks exceeds $4,600,000?
Exercise 11.10. Suppose the joint density function of the random variable X1 and X2 is
(
4x1 x2 0 < x1 < 1, 0 < x2 < 1
f (x1 , x2 ) =
0 otherwise.
Let Y1 = 2X1 + X2 and Y2 = X1 − 3X2 . What is the joint density function of Y1 and Y2 ?
Exercise 11.11. Suppose the joint density function of the random variable X1 and X2 is
(
3
(x2 + x22 ) 0 < x1 < 1, 0 < x2 < 1
f (x1 , x2 ) = 2 1
0 otherwise.
Let Y1 = X1 − 2x2 and Y2 = 2X1 + 3X2 . What is the joint density function of Y1 and Y2 ?
Exercise 11.12. We roll two dice. Let X be the minimum of the two numbers that
appear, and let Y be the maximum. Find the joint probability mass function of (X, Y ), that
is, P (X = i, Y = j):
j
1 2 3 4 5 6
i
6
Find the marginal probability mass functions of X and Y . Finally, find the conditional
probability mass function of X given that Y = 5, that is, P (X = i|Y = 5), for i = 1, . . . , 6.
154 11. MULTIVARIATE DISTRIBUTIONS
Exercise∗ 11.1. Let X and Y be independent exponential random variables with param-
eters λX and λY respectively. Find the cumulative distribution function and the probability
density function of U = X/Y .
Exercise∗ 11.2. Let X and Y be random variables uniformly distributed in the triangle
{(x, y) : x > 0, y > 0, x + y < 1}. Find the cumulative distribution function and the
probability density function of U = X/Y .
11.8. SELECTED SOLUTIONS 155
Solution to Exercise 11.2: First we need to figure out what values X, Y can attain. Note
that X can be any of 1, 2, 3, 4, 5, 6, but Y is the sum and can only be as low as 2 and as
high as 12. First we make a table of all possibilities for (X, Y ) given the values of the dice.
Recall X is the largest of the two, and Y is the sum of them. The possible outcomes are
given by the following.
X\Y 2 3 4 5 6 7 8 9 10 11 12
1
1 36
0 0 0 0 0 0 0 0 0 0
2 1
2 0 36 36
0 0 0 0 0 0 0 0
2 2 1
3 0 0 36 36 36
0 0 0 0 0 0
2 2 2 1
4 0 0 0 36 36 36 36
0 0 0 0
2 2 2 2 1
5 0 0 0 0 36 36 36 36 36
0 0
2 2 2 2 2 1
6 0 0 0 0 0 36 36 36 36 36 36
Solution to Exercise 11.3(A): We integrate the PDF over 0 < x < 2 and 0 < y < 2 and
get
ˆ 1ˆ 4
3 1 1 1 4 2 1
dydx = 1− − = .
1
2
2
3
4 4 2 3 3 12
Solution to Exercise 11.3(B): We need to find the region that is within 0 < x, y < 2
and y < x2 . (Try to draw the region) We get two regions from this. One with bounds
0 < x < 1, 0 < y < 2 and the region 1 < x < 2, 0 < y < x2 . Then
ˆ 1 ˆ 2 ˆ 2 ˆ 2
1 x 1
P (XY < 2) = dydx + dydx
0 0 4 1 0 4
ˆ 2
1 1
= + dx
2 1 2x
1 ln 2
= + .
2 2
Draw the region (2X + Y < 8), which correspond to 0 6 x, 0 < y < 5 and y < 8 − 2x.
Drawing a picture of the region, we get the corresponding bounds of 2 < y < 5 and 0 < x <
4 − y2 , so that
ˆ 5 ˆ 4− y2
1 −x
P (2X + Y < 8) = e 4 dxdy
2 0 12
ˆ 5
1
1 − ey/8−1 dy
=
2 3
8 −3 3
=1− e 8 − e− 4
3
Solution to Exercise 11.6(A): We have
(
5x4 , 0 6 x 6 1
fX (x) =
0, otherwise
(
10
y(1 − y 3 ), 0 6 y 6 1
fY (y) = 3 .
0, otherwise
Solution to Exercise 11.9: If W = X1 + X2 is the sales over the next two weeks, then W
is normal with mean 2, 200, 000 + 2, 200, 000 = 4, 400, 00 and variance 230, 0002 + 230, 0002 .
158 11. MULTIVARIATE DISTRIBUTIONS
p
Thus the variance is 230, 0002 + 230, 0002 = 325, 269.1193. Hence
4, 600, 000 − 4, 400, 000
P (W > 5, 000, 000) = P Z >
325, 269.1193
= P (Z > 0.6149)
≈ 1 − Φ (0.61) = 0.27.
2 1
J (x1 , x2 ) = = −7.
1 −3
1 −2
J (x1 , x2 ) = = 7.
2 3
11.8. SELECTED SOLUTIONS 159
j
1 2 3 4 5 6
i
5 0 0 0 0 1/36 1/18
6 0 0 0 0 0 1/36
i 1 2 3 4 5 6
i 1 2 3 4 5 6
The conditional probability mass function of X given that Y is 5, P(X = i|Y = 5), for
i = 1, . . . , 6
i 1 2 3 4 5 6
Expectations
As we discussed earlier for two random variables X and Y and a function g (x, y) we can
consider g (X, Y ) as a random variable, and therefore
XX
(12.1.1) Eg(X, Y ) = g(x, y)p(x, y)
x,y
If we now set g(x, y) = x, we see the first integral on the right is EX, and similarly the
second is EY . Therefore
E(X + Y ) = EX + EY.
Proposition 12.1
If X and Y are two independent random variables, then for any functions h and k we
have
E[h(X)k(Y )] = Eh(X) · Ek(Y ).
In particular, E(XY ) = (EX)(EY ).
Proof. We only prove the statement in the case X and Y are jointly continuous, the case
of discrete random variables is left as Exercise∗ 12.4. By Equation (12.1.2) with g(x, y) =
h(x)k(y), and recalling that the joint density function factors by independence of X and Y
161
162 12. EXPECTATIONS
Note that we can easily extend Proposition 12.1 to any number of independent random
variables.
Definition 12.1
The covariance of two random variables X and Y is defined by
As with the variance, Cov(X, Y ) = E(XY ) − (EX)(EY ). It follows that if X and Y are
independent, then E(XY ) = (EX)(EY ), and then Cov(X, Y ) = 0.
Proposition 12.2
Suppose X, Y and Z are random variables and a and c are constants. Then
(1) Cov (X, X) = Var (X).
(2) if X and Y are independent, then Cov (X, Y ) = 0.
(3) Cov (X, Y ) = Cov (Y, X).
(4) Cov (aX, Y ) = a Cov (X, Y ).
(5) Cov (X + c, Y ) = Cov (X, Y ).
(6) Cov (X + Y, Z) = Cov (X, Z) + Cov (Y, Z).
More generally,
m n
! m X
n
X X X
Cov ai X i , bj Y j = ai bj Cov (Xi , Yj ) .
i=1 j=1 i=1 j=1
Note
Var(aX + bY )
= E[((aX + bY ) − E(aX + bY ))2 ]
= E[(a(X − EX) + b(Y − EY ))2 ]
= E[a2 (X − EX)2 + 2ab(X − EX)(Y − EY ) + b2 (Y − EY )2 ]
= a2 Var X + 2ab Cov(X, Y ) + b2 Var Y.
12.2. CONDITIONAL EXPECTATION 163
Proof. We have
Var(X + Y ) = Var X + Var Y + 2 Cov(X, Y ) = Var X + Var Y.
Example 12.1. Recall that a binomial random variable is the sum of n independent
Bernoulli random variables with parameter p. Consider the sample mean
n
X Xi
X := ,
i=1
n
where all {Xi }∞
i=1 are independent and have the same distribution, then EX = EX1 = p and
Var X = Var X1 /n = p (1 − p).
Here the conditional density is defined by Equation (11.3) in Section 11.3. We can think of
E[X | Y = y] is the mean value of X, when Y is fixed at y. Note that unlike the expectation
of a random variable which is a number, the conditional expectation, E[X | Y = y], is a
random variable with randomness inherited from Y , not X.
164 12. EXPECTATIONS
XY 0 1
0 0.2 0.7 .
1 0 0.1
Example 12.3. Suppose X, Y are independent exponential random variables with param-
eter λ = 1. Set up a double integral that represents
E X 2Y .
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, Copyright 2020 Masha Gordina.
12.3. FURTHER EXAMPLES AND APPLICATIONS 165
Proof. We only prove (4). First observe that for any random variables X and Y we
can define the normalized random variables as follows.
X − EX
U := √ ,
Var X
Y − EY
V := √ .
Var Y
166 12. EXPECTATIONS
Then EU = EV = 0, EU 2 = EV 2 = 1.
Note that we can use (3) since U = aX +b, where a = √ 1 > 0, b = − √Var
EX
, and similarly
Var X X
1
V = cY + d, where c = √Var Y
> 0, d = − √Var
EY
Y
, so
(3)
ρ (U, V ) = ρ (X, Y ) .
Therefore
Cov (U, V )
ρ (X, Y ) = ρ (U, V ) = p = Cov (U, V ) .
Var (U ) Var (V )
Note that
Cov (U, V ) = E (U V ) − E (U ) E (V ) = E (U V ) .
Therefore
(12.3.1) ρ (X, Y ) = E (U V )
a2 + b 2
ab 6
2
for any real numbers a and b. This follows from the fact that
(a − b)2 > 0.
a2 +b2
Moreover, ab = 2
if and only if a = b.
Applying this to a = U and b = V to Equation (12.3.1) we see that
U2 V 2
ρ (X, Y ) = E (U V ) 6 E + = 1.
2 2
Note that is true also for −X and Y , that is,
Example 12.5. Suppose X, Y are random variables whose joint PDF is given by
(
1
0 < y < 1, 0 < x < y
f (x, y) = y .
0 otherwise
Proposition 12.4
For the conditional expectation of X given Y = y it holds that
(i) for any a, b ∈ R, E[aX + b | Y = y] = aE[X | Y = y] + b.
(ii) Var(X | Y = y) = E[X 2 | Y = y] − (E[X | Y = y])2 .
Example 12.6. Let X and Y be random variables with the joint PDF
(
1 − x+y
18
e 6 if 0 < y < x,
fXY (x, y) =
0 otherwise.
In order to find Var(X | Y = 2), we need to compute the conditional PDF of X given Y = 2,
i.e.
fXY (x, 2)
fX|Y =2 (x | 2) = .
fY (2)
To this purpose, we compute first the marginal of Y .
ˆ ∞
1 − x+y 1 y y ∞ 1 y
fY (y) = e 6 dx = e− 6 −e− 6 = e− 3 for y > 0.
y 18 3 y 3
Then we have (
1 2−x
6
e 6 if x > 2,
fX|Y =2 (x | 2) =
0 otherwise.
2
Now it only remains to find E[X | Y = 2] and E[X | Y = 2]. Applying integration by parts
twice we have
ˆ ∞ 2
x 2−x 2−x ∞ 2−x ∞ 2−x ∞
2
E[X | Y = 2] = e 6 dx = − x2 e 6 − 12xe 6 −12 6e 6 = 4+24+72 = 100.
2 6 2 2 2
12.4. Exercises
Exercise 12.1. Suppose the joint distribution for X and Y is given by the joint probability
mass function shown below:
Y \X 0 1
0 0 0.3
1 0.5 0.2
Exercise 12.2. Let X and Y be random variables whose joint probability density function
is given by (
x + y 0 < x < 1, 0 < y < 1
f (x, y) =
0 otherwise.
Exercise 12.3. Let X be normally distributed with mean 1 and variance 9. Let
Y be expo-
nentially distributed with λ = 2. Suppose X and Y are independent. Find E (X − 1)2 Y .
(Hint: Use properties of expectations.)
Exercise∗ 12.2. Show that if random variables X and Y are uncorrelated, then Var (X + Y ) =
Var (X) + Var (Y ). Note that this is a more general statement than Proposition 12.3 since
independent variables are uncorrelated.
Exercise∗ 12.3. Suppose U and V are independent random variables taking values 1 and
−1 with probability 1/2. Show that X := U + V and Y := U − V are dependent but
uncorrelated random variables.
Exercise∗ 12.4. Use Equation (12.1.1) to prove Proposition 12.1 in the case when X and
Y are independent discrete random variables.
170 12. EXPECTATIONS
Y \X 0 1
0 0 0.3 0.3
0.5 0.5
Then
EXY = (0 · 0) · 0 + (0 · 1) · 0.5 + (1 · 0) · 0.3 + (1 · 1) · 0.2 = 0.2
EX = 0 · 0.5 + 1 · 0.5 = 0.5
EY = 0 · 0.3 + 1 · 0.7 = 0.7.
1
= Var(X) = 9/2 = 4.5
λ
CHAPTER 13
n n
X n−k
X k n
etk nk n
(1 − p)n−k = pet + (1 − p)
k
pet
p (1 − p) = k
k=0 k=0
by the Binomial formula.
Example 13.6 (General normal). Suppose X ∼ N (µ, σ 2 ), then we can write X = µ + σZ,
and therefore
2 /2 2 σ 2 /2
mX (t) = EetX = Eetµ etσZ = etµ mZ (tσ) = etµ e(tσ) = etµ+t .
Proposition 13.1
Suppose X1 , ..., Xn are mutually independent random variables, and the random vari-
able Y is defined by
Y = X1 + ... + Xn .
Then
mY (t) = mX1 (t) · ... · mXn (t) .
mY (t) = E etX1 · ... · etXn = EetX1 · ... · EetXn = mX1 (t) · ... · mXn (t) .
Proposition 13.2
Suppose for two random variables X and Y we have mX (t) = mY (t) < ∞ for all t in
an interval, then X and Y have the same distribution.
We will not prove this, but this statement is essentially the uniqueness of the Laplace trans-
form L. Recall that the Laplace transform of a function f (x) defined for all positive real
numbers s > 0 ˆ ∞
(Lf ) (s) := f (x) e−sx dx
0
Thus if X is a continuous random variable with the PDF such that fX (x) = 0 for x < 0,
then ˆ ∞
etx fX (x)dx = LfX (−t).
0
Proposition 13.1 allows to show some of the properties of sums of independent random
variables we proved or stated before.
One problem with the moment generating function is that it might be infinite. One way
to get around this, at the cost
√ of considerable work, is to use the characteristic function
ϕX (t) = EeitX , where i = −1. This is always finite, and is the analogue of the Fourier
transform.
Definition (Joint MGF)
This example is an illustration to why mX (t) is called the moment generating function.
Namely we can use it to find all the moments of X by differentiating m(t) and then evaluating
at t = 0. Note that
0 d tX d tX
= E XetX .
mX (t) = E e =E e
dt dt
Now evaluate at t = 0 to get
m0X (0) = E Xe0·X = E [X] .
Similarly
d tX
m00X (t) = = E X 2 etX ,
E Xe
dt
so that
m00X (0) = E X 2 e0 = E X 2 .
Example 13.10. Suppose X is a discrete random variable and has the MGF
1 3 2 1
mX (t) = e2t + e3t + e5t + e8t .
7 7 7 7
What is the PMF of X? Find EX.
Note that this MGF does not match any of the known MGFs directly. Reading off from the
MGF we guess
4
1 2t 3 3t 2 5t 1 8t X txi
e + e + e + e = e p(xi )
7 7 7 7 i=1
To find E[X] we can use Proposition 13.3 by taking the derivative of the moment generating
function as follows.
2 9 10 8
m0 (t) = e2t + e3t + e5t + e8t ,
7 7 7 7
so that
2 9 10 8 29
E [X] = m0 (0) = + + + = .
7 7 7 7 7
13.3. Exercises
Exercise 13.1. Suppose that you have a fair 4-sided die, and let X be the random variable
representing the value of the number rolled.
(a) Write down the moment generating function for X.
(b) Use this moment generating function to compute the first and second moments of X.
Exercise 13.2. Let X be a random variable whose probability density function is given
by (
e−2x + 21 e−x x > 0
fX (x) = .
0 otherwise
Exercise 13.3. Suppose that a mathematician determines that the revenue the UConn
Dairy Bar makes in a week is a random variable, X, with moment generating function
1
mX (t) =
(1 − 2500t)4
Find the standard deviation of the revenue the UConn Dairy bar makes in a week.
Exercise 13.4. Let X and Y be two independent random variables with respective moment
generating functions
1 1 1 1
mX (t) = , if t < , mY (t) = 2 , if t < .
1 − 5t 5 (1 − 5t) 5
Find E (X + Y )2 .
Exercise 13.5. Suppose X and Y are independent Poisson random variables with param-
eters λx , λy , respectively. Find the distribution of X + Y .
Exercise 13.6. True or false? If X ∼ Exp (λx ) and Y ∼ Exp (λy ) then X + Y ∼
Exp (λx + λy ). Justify your answer.
Exercise∗ 13.1. Suppose the moment generating function mX (t) is defined on an interval
(−a, a). Show that n
t
mX −−−→ etEX .
n n→∞
Exercise∗ 13.2. Suppose the moment generating function mX (t) is defined on an interval
(−a, a). Assume also that EX = 0 and Var (X) = 1. Show that
n
t t2
mX √ −−−→ e 2 .
n n→∞
13.4. SELECTED SOLUTIONS 179
1 1 1 1
mX (t) = E etX = e1·t + e2·t + e3·t + e4·t
4 4 4 4
1 1·t
e + e2·t + e3·t + e4·t
=
4
1 1·t
m0X (t) = e + 2e2·t + 3e3·t + 4e4·t ,
4
1
m00X (t) = e1·t + 4e2·t + 9e3·t + 16e4·t ,
4
so
1 5
EX = m0X (0) = (1 + 2 + 3 + 4) =
4 2
and
1 15
EX 2 = m00X (0) = (1 + 4 + 9 + 16) = .
4 2
ˆ ∞
−2x 1 −x
mX (t) = E etX = tx
e e + e dx
0 2
x=∞
1 tx−2x 1 tx−x
= e + e =
t−2 2(t − 1) x=0
1 1
=0− +0−
2−t 2 (t − 1)
1 1 t
= + =
t − 2 2 (1 − t) 2(2 − t)(1 − t)
1 1
m0X (t) = 2 +
(2 − t) 2 (1 − t)2
2 1
m00X (t) = 3 +
(2 − t) (1 − t)3
Solution to Exercise 13.4: First recall that if we let W = X + Y , and using that X, Y
are independent, then we see that
1
mW (t) = mX+Y (t) = mX (t)mY (t) = ,
(1 − 5t)3
recall that E [W 2 ] = m00W (0), which we can find from
15
m0W (t) = ,
(1 − 5t)4
300
m00W (t) = ,
(1 − 5t)5
thus
300
E W 2 = m00W (0) =
= 300.
(1 − 0)5
Solution to Exercise 13.5: Since X ∼ Pois(λx ) and Y ∼ Pois(λy ) then
Then
mX+Y (t) = mX (t)mY (t)
independence
eλx (e −1) eλy (e −1) = e(λx +λy )(e −1) .
t t t
=
Thus X + Y ∼ Pois (λx + λy ).
Solution to Exercise 13.6: We will use Proposition 13.2. Namely, we first find the MGF
of X + Y and compare it to the MGF of a random variable V ∼ Exp (λx + λy ). The MGF
of V is
λx + λy
mV (t) = for t < λx + λy .
λx + λy − t
By independence of X and Y
λx λy
mX+Y (t) = mX (t)mY (t) = · ,
λx − t λy − t
13.4. SELECTED SOLUTIONS 181
but
λx + λy λx λy
6= ·
λx + λy − t λx − t λy − t
and hence the statement is false.
Solution to Exercise∗ 13.1: It is enough to show that
t
n ln mX −−−→ tEX.
n n→∞
Then
t 1/n=s ln (mX (st))
lim n ln mX = lim
n→∞ n s→0 s
tm0X (st)
L’Hôpital’s rule mX (st) tm0X (0) tEX
= lim = = = tEX.
s→0 1 mX (0) E1
Solution to Exercise∗ 13.2: It is enough to show that
t2
t
n ln mX √ −−−→ .
n n→∞ 2
Then
ln mX √t s= √1n
t n ln (mX (st))
lim n ln mX √ = lim 1 = lim
n→∞ n n→∞
n
s→0 s2
tm0X (st)
L’Hôpital’s rule t
mX (st) mX (0)=1 m0 (st)
= lim =lim X
s→0 2s 2 s→0 s
00 2 00
L’Hôpital’s rule t tmX (st) t mX (0) t2
= lim = = ,
2 s→0 1 2 2
00 2
since by Proposition 13.3 we have mX (0) = EX = 1 and we are given that Var X = 1 and
EX = 0.
CHAPTER 14
Suppose we have a probability space of a sample space Ω, σ-field F and probability P defined
on F. We consider a collection of random variables Xi defined on the same probability space.
Definition 14.1 (Sums of i.i.d. random variables)
We say the sequence of random variables {Xi }∞ i=1 are i.i.d. (independent and identi-
cally distributed)
Pn if they are (mutually) independent and all have the same distribution.
We call Sn = i=1 Xi the partial sum process.
In the case of continuous or discrete random variables, having the same distribution means
that they all have the same probability density.
Theorem 14.1 (Strong law of large numbers (SLLN))
Suppose {Xi }∞
i=1 is a sequence of i.i.d. random variables with E|Xi | < ∞, and let
µ = EXi . Then
Sn
−−−→ µ.
n n→∞
The convergence here means that the sample mean Sn (ω)/n → µ for every ω ∈ Ω except
possibly for a set of ω s of probability 0.
The proof of Theorem 14.1 is quite hard, and we prove a weaker version, the weak law of
large numbers (WLLN).
Theorem 14.2 (Weak law of large numbers (WLLN))
Suppose {Xi }∞
i=1 is a sequence of i.i.d. random variables with finite first and second
moments, that is, E|X1 | and Var X1 are finite.
For every a > 0,
Sn
P − EX1 > a −−−→ 0.
n n→∞
It is not even that easy to give an example of random variables that satisfy the WLLN but
not the SLLN. Before proving the WLLN, we need an inequality called Markov’s inequality.
We cover this in more detail later, see Proposition 15.3.
183
184 14. LIMIT LAWS AND MODES OF CONVERGENCE
Proof. We only consider the case of continuous random variables, the case of discrete
random variables being similar. We have
ˆ ∞ ˆ ∞
y
P(Y > A) = fY (y) dy 6
fY (y) dy
A A
A
ˆ
1 ∞ 1
6 yfY (y) dy = EY.
A −∞ A
Proof of Theorem 14.2. Recall ESn = nEX1 , and by the independence Var Sn =
n Var X1 , so Var(Sn /n) = Var X1 /n. We have
Sn Sn Sn
P − EX1 > a = P −E >a
n n n
2
2 !
Sn Sn M arkov E Sn − E( Sn )
2 n n
=P −E >a 6
n n a2
Var X1
Var( Snn ) n
= = → 0.
a2 a2
The inequality step follows from Markov’s inequality (Proposition 14.1) with A = a2 and
2
Y = Snn − E( Snn ) .
Suppose {Xi }∞ 2
i=1 is a sequence of i.i.d. random variables such that EXi < ∞. Let
µ = EXi and σ 2 = Var Xi . Then
Sn − nµ
P a6 √ 6 b −−−→ P(a 6 Z 6 b)
σ n n→∞
Example 14.1 (Theorem 9.1). If Xi are i.i.d. Bernoulli random variables, so that Sn is a
binomial, this is just the normal approximation to the binomial as in Theorem 9.1.
Example 14.2. Suppose √ in the CLT we have Xi ∼ Pois (λ), then Sn ∼ Pois (λn) so we see
that (Pois (λn) − nλ) /λ n is close to the distribution of N (0, 1). This is what is behind a
normal distribution approximation to the Poisson distribution which can be illustrated by
looking at the Poisson distribution below.
·10−2
·10−2
5
0
0 50 100 150 200
Comparison of shapes of the Poisson distributions for λ = 64, λ = 100 and λ = 144.
Example 14.3. Suppose we roll a die 3600 times. Let Xi be the number showing on the
ith roll. We know Sn /n will be close to 3.5. What is the (approximate) probability it differs
from 3.5 by more than 0.05?
We want to estimate
Sn
P − 3.5 > 0.05 .
n
We rewrite this as
S n − nEX 1 180
P(|Sn − nEX1 | > (0.05)(3600)) = P √ √ > q
n Var X1 (60) 35
12
Note that √
180
≈ 1.7566, so this probability can be approximately as
(60) 3512
Example 14.4. Suppose the lifetime of a human has expectation 72 and variance 36. What
is the (approximate) probability that the average of the lifetimes of 100 people exceeds 73?
We want to estimate
Sn
P > 73 = P(Sn > 7300)
n
Sn − nEX1 7300 − (100)(72)
=P √ √ > √ √
n Var X1 100 36
≈ P(Z > 1.667) ≈ 0.047.
Sketch of the proof of Theorem 14.3. A typical proof of this theorem uses char-
acteristic functions ϕX (t) = EeitX (we mentioned them before Definition 13.1) which are
defined for all t, but if the moment generating functions are defined on an interval (−a, a),
14.2.∗ CONVERGENCE OF RANDOM VARIABLES 187
In the limits laws we discussed earlier we used different modes of convergence of sequences
of random variables which we now discuss separately.
188 14. LIMIT LAWS AND MODES OF CONVERGENCE
Let {Xn }n∈N be a sequence of random variables on probability space (Ω, F, P), and
X be a random variable on the same probability space. We say that the sequence
{Xn }n∈N converges in probability to X if for any ε > 0 we have
Example 14.5 (Weak law of large numbers). The convergence used in the Weak law of
large numbers (Theorem 14.2) is convergence in probability. Namely, it says that the sample
mean Sn /n converges to EX1 in probability.
Let {Xn }n∈N be a sequence of random variables on probability space (Ω, F, P), and
X be a random variable on the same probability space. We say that the sequence
{Xn }n∈N converges almost surely or almost everywhere or with probability 1 to X if
P Xn −−−→ X = 1,
n→∞
and we denote it by Xn −−−→ X a.s.
n→∞
This mode of convergence is a slightly modified version of the concept of pointwise conver-
gence of functions. Recall that a random variable is a function from the sample space to R.
Pointwise convergence would require that
Xn (ω) −−−→ X (ω)
n→∞
for all ω ∈ Ω. This is usually too much to assume, this is why we have almost sure conver-
gence. Namely, define the event
Example 14.6 (Strong law of large numbers). This is the type of convergence appearing
the Strong law of large numbers (Theorem 14.1). Note that we mentioned this right after
the theorem, and now we can say that the conclusion of the SLLN is that the sample mean
Sn /n converges to EX1 almost surely.
14.2.∗ CONVERGENCE OF RANDOM VARIABLES 189
Let {Xn }n∈N be a sequence of random variables on probability space (Ω, F, P), and we
denote by Fn (x) their distribution functions. Let X is a random variable with distri-
bution function FX (x). We say that the sequence {Xn }n∈N converges in distribution
or converges weakly to X if
Fn (x) −−−→ FX (x)
n→∞
d
for all x ∈ R at which FX (x) is continuous, and we denote it by Xn −−−→ X.
n→∞
Note that convergence in distribution only involves the distributions of random variables. In
particular, the random variables need not even be defined on the same probability space, and
moreover we do not even need the random variables to define convergence in distribution.
This is in contrast to other modes of convergence we have introduced.
Example 14.7 (Central limit theorem). The Central limit theorem (Theorem 14.3) says
−nµ
that Sσn√ n
converges to the standard normal in distribution.
d
Example 14.8. Suppose Xn ∼ Binom n, nλ , n > λ > 0. Then Xn −−−→ Pois (λ).
n→∞
Let {Xn }n∈N be a sequence of random variables on probability space (Ω, F, P), and
let X be a random variable on the same probability space. For p > 1 we say that the
sequence {Xn }n∈N converges in the pth mean or in the Lp − norm to X if
lim E|Xn − X|p = 0
n→∞
Lp
provided that E|Xn |p and E|X|p exist for all n, and we denote it by Xn −−−→ X.
n→∞
In particular, for p = 1 we say that Xn converges in mean, and for p = 2 we say that
Xn converges in mean square. While we have not used this type of convergence in this
course, this is used widely in probability and statistics. One of the tools used for this type
of convergence is Jensen’s inequality.
Relations between different modes of convergence
1. For p > q > 1, convergence in the pth mean implies convergence in qth mean.
2. Convergence in mean implies convergence in probability.
3. Almost sure convergence implies convergence in probability.
4. Convergence in probability implies convergence in distribution.
190 14. LIMIT LAWS AND MODES OF CONVERGENCE
Example 14.9. If 10 fair dice are rolled, find the approximate probability that the sum
obtained is between 30 and 40, inclusive.
We will use the ±0.5 continuity correction because these are discrete random variables. Let
Xi denote the value of the ith die. Recall that
7 35
E (Xi ) = Var(Xi ) = .
2 12
Take
X = X 1 + · · · + Xn
to be their sum. To apply the CLT we have
7
nµ = 10 · = 35
r 2
√ 350
σ n= ,
12
thus using the continuity correction we have
29.5 − 35 X − 35 40.5 − 35
P (29.5 6 X 6 40.5) = P q 6 q 6 q
350 350 350
12 12 12
≈ P (−1.0184 6 Z 6 1.0184)
= Φ (1.0184) − Φ (−1.0184)
= 2Φ (1.0184) − 1 = 0.692.
Example 14.10. Your instructor has 1000 Probability final exams that needs to be graded.
The time required to grade an exam are all i.i.d. random variables with mean of 20 minutes
and standard deviation of 4 minutes. Approximate the probability that your instructor will
be able to grade at least 25 exams in the first 450 minutes of work.
Denote by Xi the time it takes to grade exam i. Then
X = X1 + · · · + X25
is the time it takes to grade the first 25 exams. We want P (X ≤ 450). To apply the CLT
we have
nµ = 25 · 20 = 500
√ √
σ n = 4 25 = 20.
Thus
X − 500 450 − 500
P (X 6 450) = P 6
20 20
≈ P (Z 6 −2.5)
= 1 − Φ(2.5) = 0.006.
© Copyright 2017 Phanuel Mariano, Patricia Alonso Ruiz, 2020 Masha Gordina.
14.4. EXERCISES 191
14.4. Exercises
Exercise 14.1. In a 162-game season, find the approximate probability that a team with
a 0.5 chance of winning will win at least 87 games.
Exercise 14.2. An individual students MATH 3160 Final exam score at UConn is a ran-
dom variable with mean 75 and variance 25, How many students would have to take the
examination to ensure with probability at least 0.9 that the class average would be within 5
of 75?
Exercise 14.3. Let X1 , X2 , ..., X100 be independent exponential random variables with
parameter λ = 1. Use the central limit theorem to approximate
100
!
X
P Xi > 90 .
i=1
Exercise 14.4. Suppose an insurance company has 10,000 automobile policy holders. The
expected yearly claim per policy holder is $240, with a standard deviation of $800. Approx-
imate the probability that the total yearly claim is greater than $2,500,000.
Exercise 14.5. Suppose that the checkout time at the UConn dairy bar has a mean of 5
minutes and a standard deviation of 2 minutes. Estimate the probability to serve at least
36 customers during a 3-hour and a half shift.
Exercise 14.6. Shabazz Napier is a basketball player in the NBA. His expected number
of points per game is 15 with a standard deviation of 5 points per game. The NBA season
is 82 games long. Shabazz is guaranteed a ten million dollar raise next year if he can score a
total of 1300 points this season. Approximate the probability that Shabazz will get a raise
next season.
Exercise∗ 14.1. Assuming that the moment generating functions are defined on an interval
(−a, a), use Exercise∗ 13.1 to give another proof of the WLLN.
192 14. LIMIT LAWS AND MODES OF CONVERGENCE
Solution to Exercise 14.1: Let Xi be 1 if the team wins the ith game and 0 if the
team loses. This is a Bernoulli random variable with p = 0.5. Thus µ = p = 0.5 and
σ 2 = p (1 − p) = (0.5)2 . Then
162
X
X= Xi
i=1
is the number of games won in the season. Using the CLT with
nµ = 162 · 0.5 = 81
√ √
σ n = 0.5 162 ≈ 6.36,
then
162
!
X
P Xi > 87 = P (X > 86.5)
i=1
X − 81 86.5 − 81
=P >
6.36 6.36
≈ P (Z > 0.86) ≈ 0.1949,
n = 3.
14.5. SELECTED SOLUTIONS 193
Solution to Exercise 14.3: Since λ = 1 then EXi = 1 and Var(Xi ) = 1. Use the CLT
with
nµ = 100 · 1 = 100
√ √
σ n = 1 · 100 = 10.
100
! P100 !
X
i=1 Xi − 100 · 1 90 − 100 · 1
P Xi > 90 =P √ > √
i=1
1 · 100 1 · 100
≈ P (Z > −1) ≈ 0.8413.
Solution to Exercise 14.5: Let Xi be the time it takes to check out customer i. Then
X = X1 + · · · + X36
is the time it takes to check out 36 customer. We want to estimate P (X 6 210). Use the
CLT with
nµ = 36 · 5 = 180,
√ √
σ n = 2 36 = 12.
Thus
X − 180 210 − 180
P (X 6 210) = P 6
12 12
≈ P (Z 6 2.5)
= Φ(2.5) ≈ 0.9938.
Thus
X − 1230 1300 − 1230
P (X > 1300) = P >
45.28 45.28
≈ P (Z > 1.55)
= 1 − Φ(1.55) ≈ 1 − 0.9394 = 0.0606.
CHAPTER 15
Probability inequalities
We already used several types of inequalities, and in this Chapter we give a more systematic
description of the inequalities and bounds used in probability and statistics.
Boole’s inequality(or the union bound ) states that for any at most countable collection of
events, the probability that at least one of the events happens is no greater than the sum of
the probabilities of the events in the collection.
Proposition 15.1 (Boole’s inequality)
Proof. We only give a proof for a finite collection of events, and we mathematical
induction on the number of events.
For the n = 1 we see that
P (E1 ) 6 P (E1 ) .
Suppose that for some n and any collection of events E1 , ..., En we have
n
! n
[ X
P Ei 6 P (Ei ) .
i=1 i=1
Recall that by (2.1.1) for any events A and B we have
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
Sn Sn+1
We apply it to A = i=1 Ei and B = En+1 and using the associativity of the union i=1 Ei =
A ∪ B, we get that
n+1
! n
! n
! !
[ [ [ \
P Ei =P Ei ) + P(En+1 −P Ei En+1 .
i=1 i=1 i=1
195
196 15. PROBABILITY INEQUALITIES
n+1
! n
!
[ [
P Ei 6P Ei + P (En+1 ) .
i=1 i=1
Thus using the induction hypothesis we see that
n+1
! n n+1
[ X X
P Ei 6 P (Ei ) + P (En+1 ) = P (Ei ) .
i=1 i=1 i=1
n
X
S1 := P (Ei ) ,
i=1
X
S2 := P (Ei ∩ Ej )
16i<j6n
X
Sk := P (Ei1 ∩ · · · ∩ Eik ) , k = 3, ..., n.
16i1 <···<ik 6n
We omit the proof which starts with considering the case k = 1 for which we need to show
15.2. MARKOV’S INEQUALITY 197
n
! 1 n
[ X X
j−1
P Ei 6 (−1) Sj = S1 = P (Ei ) ,
i=1 j=1 i=1
which is Boole’s inequality. When k = 2
n
! 2 n
[ X X X
j−1
P Ai > (−1) Sj = S1 − S1 = P (Ei ) − P (Ei ∩ Ej ) .
i=1 j=1 i=1 16i<j6n
which for n = 2 is the inclusion-exclusion identity (Proposition 2.2).
Proof. We only give the proof for a continuous random variable, the case of a discrete
random variable is similar. Suppose X is a positive continuous random variable, we can
write
ˆ ∞ ˆ ∞
X>0
EX = xfX (x) dx = xfX (x) dx
−∞
ˆ ∞ ˆ ∞ 0
ˆ ∞
a>0 x>a
> xfX (x) dx > afX (x) dx = a fX (x) dx = aP (X > a) .
a a a
Therefore
aP (X > a) 6 EX
which is what we wanted to prove.
Example 15.2. First we observe that Boole’s inequality can be interpreted as expectations
of the number of occurred events. Suppose (S, F, P) is a probability space, and E1 , E2 , ... ∈ F
are events. Define
1, if Ei occurs
Xi :=
0,
otherwise.
Then X := X1 + ... + Xn is the number of events that occur. Then
∞
!
[
P Ei = P (X > 1) 6 EX = P (E1 ) + ... + P (En )
i=1
which completes the proof.
Example 15.3. Suppose X ∼ Binom (n, p). We would like to use Markov’s inequality to
find an upper bound on P (X > qn) for p < q < 1.
Note that X is a nonnegative random variable and EX = np. By Markov’s inequality, we
have
EX p
P (X > qn) 6 = .
qn q
15.3. CHEBYSHEV’S INEQUALITY 199
Here we revisit Chebyshev’s inequality Proposition 14.1 we used previously. This results
shows that the difference between a random variable and its expectation is controlled by its
variance. Informally we can say that it shows how far the random variable is from its mean
on average.
Proposition 15.4 (Chebyshev’s inequality)
Var (X)
P (|X − EX| > b) 6 .
b2
EY
P Y > b2 6 2 .
b
Note that
Example 15.4. Consider again X ∼ Binom (n, p). We now will use Chebyshev’s inequality
to find an upper bound on P (X > qn) for p < q < 1.
Recall that EX = np. By Chebyshev’s inequality with b = (q − p) n > 0 we have
Proof.
P (X > a) = P etX > eta , t > 0,
Note that note that etX is a positive random variable for any t ∈ R. Therefore we can apply
Markov’s inequality (Proposition 15.3) to see that
EetX
P (X > a) = P etX > eta 6 ta , t > 0,
e
EetX
P (X 6 a) = P etX > eta 6 ta , t < 0.
e
Recall that EetX is the moment generating function mX (t), and so we have
mX (t)
P (X > a) 6 , t > 0,
eta
mX (t)
P (X 6 a) 6 , t < 0.
eta
Taking the minimum over appropriate t we get the result.
Example 15.5. Consider again X ∼ Binom (n, p). We now will use Chernoff bounds for
P (X > qn) for p < q < 1. Recall that in Example 13.2 we found the moment generating
function for X as follows
mX (t) = (pet + (1 − p))n .
Thus a Chernoff bound gives
n
P (X > qn) 6 min e−tqn pet + (1 − p) .
t>0
−tqn n
To find the minimum of g (t) = e (pet + (1 − p)) we can take its derivative and using
the only critical point of this function , we can see that the minimum on (0, ∞) is achieved
at t∗ such that
15.4. CHERNOFF BOUNDS 201
q (1 − p)
et∗ = ,
(1 − q) p
and so
−qn n
q (1 − p) q (1 − p)
g (t∗ ) = p + (1 − p)
(1 − q) p (1 − q) p
−qn n qn −qn n
q (1 − p) 1−p p 1−p 1−p
= =
(1 − q) p 1−q q 1−q 1−q
qn (1−q)n
p 1−p
= .
q 1−q
qn (1−q)n
p 1−p
P (X > qn) 6 .
q 1−q
p
Markov’s inequality P (X > qn) 6 ,
q
p (1 − p)
Chebyshev’s inequality P (X > qn) 6 ,
(q − p)2 n
qn (1−q)n
p 1−p
Chernoff bound P (X > qn) 6 .
q 1−q
Clearly the right-hand sides are very different: Markov’s inequality gives a bound indepen-
dent of n, and the Chernoff bound is the strongest with exponential convergence to 0 as
n → ∞. For example, for p = 1/2 and q = 3/4 we have
3n 2
Markov’s inequality P X> 6 ,
4 3
3n 4
Chebyshev’s inequality P X> 6 ,
4 n
n/4
3n 16
Chernoff bound P X> 6 .
4 27
3n 1
Markov’s inequality P X> 6 ,
4 2
3n 2
Chebyshev’s inequality P X> 6 ,
4 n
3n
Chernoff bound P X> 6 2−n/2 .
4
√ 2
2 2 2 EXY (EXY )2
g (s) = EY s − 2EXY s + EX = EY 2 s − √ + EX 2 − ,
EY 2 EY 2
so g (s) > 0 for all s if and only if
(EXY )2
EX 2 − > 0,
EY 2
which is what we needed to show.
To deal with the last claim, observe that if U > 0 with probability one, then g (s) = EU > 0.
This happens only if
(EXY )2
EX 2 − > 0.
EY 2
)2
And if EX 2 − (EXY
EY 2
= 0, then g EXY
EY 2
= EU = 0, which only can be true if
EXY
X− Y = 0,
EY 2
that is, X is a scalar multiple of Y .
15.6. JENSEN’S INEQUALITY 203
Example 15.7. We can use the Cauchy-Schwartz inequality to prove one of the proper-
ties of correlation coefficient in Proposition 12.3.2. Namely, suppose X and Y are random
variables, then |ρ (X, Y ) | 6 1. Moreover, |ρ (X, Y ) | = 1 if and only if there are constants
a, b ∈ R such that X = a + bY .
We will use normalized random variables as before, namely,
X − EX
U := √ ,
Var X
Y − EY
V := √ .
Var Y
Then EU = EV = 0, EU 2 = EV 2 = 1. We can use the Cauchy-Schwartz inequality for U
and V to see that
√
|EU V | 6 EU 2 · EV 2 = 1
and the identity holds if and only if U = aV for some a ∈ R.
Recall Equation (12.3.1)
ρ (X, Y ) = E (U V ) ,
which gives the bound we need. Note that if U = aV , then
X − EX Y − EY
√ =a √ ,
Var X Var Y
therefore
√ √
√
Y − EY Var X Var X
X = a Var X √ + EX = a √ Y − a√ EY + EX,
Var Y Var Y Var Y
which completes the proof.
Recall that a function g : R −→ R is convex on [a, b] if for each x, y ∈ [a, b] and each λ ∈ [0, 1]
we have
g (λx + (1 − λ) y) 6 λg (x) + (1 − λ) g (y) .
Note that for a convex function g this property holds for any convex linear combination of
points in [a, b], that is,
If g is twice differentiable, then we have a simple test to see if a function is convex, namely, g
is convex if g 00 (x) > 0 for all x ∈ [a, b]. Geometrically one can show that if g is convex, then
if we draw a line segment between any two points on the graph of the function, the entire
segment lies above the graph of g, as we show formally bellow. A function g is concave if −g
is convex. Typical examples of convex functions are g (x) = x2 and g (x) = ex . Examples
of concave functions are g (x) = −x2 and g (x) = log x. Convex and concave functions are
always continuous.
Suppose a < c < b and g : [a, b] −→ R be convex. Then there exist A, B ∈ R such
that g (c) = Ac + B and for all x ∈ [a, b] we have g (x) > Ax + B.
Eg (X) 6 g (EX) .
so E|g (X) | < ∞ and therefore Eg (X) is well defined. Now we can use AX + B 6 g (X) to
see that
g (EX) = AEX + B = E (AX + B) 6 Eg (X)
1
fX (ak ) = for k = 1, ..., n.
n
Note that the function g (x) = − log x is a convex function on (0, ∞). Jensen’s inequality
gives that
n
! n
1X 1X
− log ak = − log (EX) 6 E (− log X) = − log ak .
n k=1 n k=1
Exponentiating this we get
n n
!1/n
1X Y
ak > ak .
n k=1 k=1
Example 15.9. Suppose p > 1, then the function g (x) = |x|p is convex. Then
E|X|p > |EX|p
for any random variable X such that EX is defined. In particular,
EX 2 > (EX)2 ,
and therefore EX 2 − (EX)2 > 0.
206 15. PROBABILITY INEQUALITIES
15.7. Exercises
Exercise 15.2. Suppose X ∼ Exp (λ). Using Markov’s inequality estimate P (X > a) for
a > 0 and compare it with the exact value of this probability.
Exercise 15.3. Suppose X ∼ Exp (λ). Using Chebyshev’s inequality estimate P (|X − EX| > b)
for b > 0.
Exercise 15.4. Suppose X ∼ Exp (λ). Using Chernoff bounds estimate P (X > a) for
a > EX and compare it with the exact value of this probability.
Exercise 15.5. Suppose X > 0 is a random variable such that Var(X) > 0. Decide which
of the two quantities is larger.
(A) EX 3 or (EX)3 ?
(B) EX 3/2 or (EX)3/2 ?
(C) EX 2/3 or (EX)2/3 ?
(D) E log(X + 1) or log(EX + 1)?
(E) EeX or eEX ?
(F) Ee−X or e−EX ?
15.8. SELECTED SOLUTIONS 207
Solution to Exercise 15.1: Let Ei denote the event that connection i is made correctly, so
P (Eic ) = p. We do not anything beyond this (such as whether these events are dependent),
so we will use Boole’s inequality to estimate this probability as follows. The event we are
interested in is ∩ni=1 Ei .
n
! n
!c ! n
!
\ \ [
P Ei =1−P Ei =1−P Eic
i=1 i=1 i=1
n
Boole X
> 1− P (Eic ) = 1 − np.
i=1
EX 1
P (X > a) 6 = ,
a aλ
while the exact value is
ˆ ∞
1
P (X > a) = λe−λx dx = e−λa 6 .
a aλ
Solution to Exercise 15.3: We have EX = 1/λ and Var X = 1/λ2 . By Chebyshev’s
inequality we have
Var X 1
P (|X − EX| > b) 6 2
= 2 2.
b bλ
Solution to Exercise 15.4: recall first that
λ
mX (t) = for t < λ.
λ−t
Using Chernoff bounds, we see
−ta −ta λ
P (X > a) 6 min t > 0 e mX (t) = min e for t < λ.
t>0 λ−t
To find the minimum of e−ta λ−t
λ
as a function of t, we can find the critical point and see that
it is λ − 1/a > 0 since we assume that a > EX = 1/λ. Using this value for t we get
Solution to Exercise 15.5(A): EX 3 > (EX)3 since (x3 )00 = x/3 > 0 for x > 0.
Solution to Exercise 15.5(B): EX 3/2 > (EX)3/2 since (x3/2 )00 = 3
√
4 x
> 0 for x > 0.
Solution to Exercise 15.5(C): EX 2/3 < (EX)2/3 since (x2/3 )00 = − 9x24/3 < 0 for x > 0.
208 15. PROBABILITY INEQUALITIES
Solution to Exercise 15.5(D): E log(X + 1) < log(EX + 1) since (log(x))00 = −1/x2 < 0
for x > 0.
Solution to Exercise 15.5(E): EeX > eEX since (ex )00 = ex > 0 for any x.
Solution to Exercise 15.5(F): Ee−X > e−EX since (e−x )00 = e−x > 0 for any x.
Part 4
Applications of probability
CHAPTER 16
∗
Applications in Insurance and Actuarial Science
16.1 Introduction
Suppose that for a period, you face the risk of losing something that is unpredictable, and
denote this potential loss by a random variable X. This loss may be the result of damages
or injuries from (a) an automobile accident, (b) fire, theft, storm or hurricane at home, (c)
premature death of head of the household, or (d) hospitalization due to an illness. Insurance
allows you to exchange facing this potential loss for a fixed price or premium. It is one
of the responsibilities of an actuary to assess the fair price given the nature of the risk.
Actuarial science is a discipline that deals with events that are uncertain and their economic
consequences; the concepts of probability and statistics provide for indispensable tools in
measuring and managing these uncertainties.
The Pareto distribution is commonly used to describe and model insurance losses. One
reason is its flexibility to handle positive skewness or long distribution tails. It is possi-
ble for insurance losses to become extremely large, although such may be considered rare
event. While there are several versions of the Pareto family of distributions, we consider the
cumulative distribution function of X that follows a Type II Pareto:
α
θ
(16.0.1) FX (x) = 1 − , for x > 0,
x+θ
where α > 0 is the shape or tail parameter and θ > 0 is the scale parameter. If X follows
such distribution, we write X ∼ Pareto(α, θ).
211
212 16.2 The Pareto distribution
By taking the derivative of (16.0.1), it can be shown that the probability density function of
X is given by
α+1
α θ
(16.0.2) fX (x) = , for x > 0.
θ x+θ
Figure 16.0.1 depicts shapes of the density plot with varying parameter values of α and θ.
0.04 α
0.5
0.03
fX(x)
1
1.5
0.02
3
5
0.01
0.00
0 50 100 150 200
x
α = 3 and varying θ
0.03
θ
0.02 100
fX(x)
200
300
0.01 500
1000
0.00
0 50 100 150 200
x
In some instances, the potential loss that you face may be huge and unlimited. In this case,
the cost of the insurance coverage may be burdensome. There are possible modifications to
16.3.1 DEDUCTIBLES 213
your insurance coverage so that the burden may be reduced. We introduce three possible
modifications: (a) deductibles, (b) limits or caps, and (c) coinsurance.
These coverage modifications are a form of loss sharing between you, who is called the poli-
cyholder or insured, and the insurance company, which is also called the insurer. The effect
is a reduced premium to the policyholder, and at the same time, because the policyholder
shares in the loss, there is a perceived notion that this may alter the behavior of the poli-
cyholder. For instance, in the case of automobile insurance, the policyholder may be more
careful about his or her driving behavior. Note that it is also possible to have an insurance
coverage which is a combination of these three modifications.
16.3.1 Deductibles
In an excess-of-loss insurance contract, the insurance company agrees to reimburse the pol-
icyholder for losses beyond a pre-specified amount d. This amount d is referred to as the
deductible of the contract. Given the loss is X, this amount is then shared between the
policyholder, who is responsible for the first d amount, and the insurance company, which
pays the excess if any. Thus, the policyholder is responsible for min(X, d) and the insurance
company pays the excess, which is then equal to
(
0, if X ≤ d
(16.0.5) XI = X − min(X, d) = .
X − d, if X > d
In general, we keep this notation, XI , to denote the portion of X that the insurance company
agrees to pay. The expected value of this can be expressed as
(16.0.6) E(XI ) = E(X) − E[min(X, d)],
where E[min(X, d)] is sometimes called the limited expected value. For any non-negative
random variable X, it can be shown that
ˆ d
(16.0.7) E[min(X, d)] = [1 − FX (x)] dx.
0
This result can be proved as follows. Starting with
ˆ d ˆ ∞
E[min(X, d)] = xfX (x)dx + d · fX (x)dx
0 d
ˆ d
= xfX (x)dx + d [1 − FX (d)] ,
0
and applying integration by parts with u = x and dv = −fX (x)dx so that v = 1 − FX (x),
we have
d ˆ d
E[min(X, d)] = − x [1 − FX (x)] + [1 − FX (x)] dx + d [1 − FX (d)] ,
0 0
ˆ d
= −d [1 − FX (d)] + [1 − FX (x)] dx + d [1 − FX (d)]
0
ˆ d
= [1 − FX (x)] dx,
0
214 16.3 Insurance coverage modifications
Example 16.3.1 Show that the limited expected value, with a deductible d, for a Pareto(α, θ)
distribution has the following expression
" α−1 #
θ θ
(16.0.8) E[min(X, d)] = 1− ,
α−1 d+θ
provided α 6= 1.
Solution: First, we find the parameters of the distribution. From (16.0.3) and (16.0.4), we
have two equations in two unknowns:
θ αθ2
= 100 and = 200000/3
α−1 (α − 1)(α − 2)
This leads us to θ = 100(α−1) which results to a quadratic equation in α: 3α2 −23α+40 = 0.
There are two possible solutions for α: either α = 5 or α = 8/3. Since we are given α > 3,
we use α = 5 so that θ = 400. Calculating the insurer’s expected payment, we have from
(16.0.6) and (16.0.8) the following:
4
400
E(XI ) = 100 = 80.
400 + d
Solving for the deductible, we have d = 22.95.
An insurance contract with a policy limit, or cap, L is called a limited contract. For this
contract, the insurance company is responsible for payment of X provided the loss does not
exceed L, and if it does, the maximum payment it will pay is L. We then have
(
X, if X ≤ L
(16.0.9) XI = min(X, L) =
L, if X > L
16.3.3 COINSURANCE 215
Example 16.3.3 An insurance company offers a limited contract against loss X that follows
a Pareto distribution. The parameters are assumed to be α = 3.5 and θ = 250. Calculate
the company’s expected payment for this loss if the limit L = 5000.
Solution: For an exponential with mean parameter µ, its density can be expressed as fX (x) =
(1/µ)e(−x/µ) , for x > 0. This gives a distribution function equal to FX (x) = 1−e(−x/µ) . From
(16.0.7), we can show that
E[min(X, L)] = µ 1 − e(−L/µ) .
(−500/µ)
From the given,
this leads us to the following two equations: 216.17 = µ 1 − e and
(−1000/µ) (500/µ)
245.42 = µ 1 − e . If we let k = e and divide the two equations, we get the
quadratic equation:
216.17 1−k 245.42
= 2
=⇒ 1 + k = =⇒ k == 0.1353102,
245.42 1−k 216.17
provided k 6= 1. When k = 1, µ is undefined, and therefore, k = 0.1353102. This gives
µ = 249.9768 ≈ 250.
16.3.3 Coinsurance
In an insurance contract with coinsurance, the insurance company agrees to pay a fixed and
pre-specified proportion of k for each loss. This proportion k must be between 0 and 100%.
The company’s payment for each loss is
(16.0.10) XI = kX,
where 0 < k ≤ 1. Therefore, the expected value of this payment is just a fixed proportion k
of the average of the loss.
Example 16.3.5 An insurance company offers a contract with coinsurance of 80%. Assume
loss X follows a Pareto distribution with θ = 400 and P(X > 100) = 0.5120.
(a) Calculate the company’s expected payment for each loss.
(b) Suppose the company agrees to replace the contract with an excess-of-loss coverage.
Find the deductible d so that the company has the same expected payment.
216 16.3 Insurance coverage modifications
16.3.4 Combination
It is not uncommon to find insurance contracts that combine the three different coverage
modifications described in the previous sections. In particular, consider the general situation
where we have a combination of a deductible d, policy limit L, and coinsurance k. In this
case, the insurance contract will pay for a proportion k of the loss X, in excess of the
deductible d, subject to the policy limit of L. The company’s payment for each loss can be
written as
0,
if X ≤ d,
(16.0.11) XI = k(X − d), if d < X ≤ L,
k(L − d), if X > L,
This leads us to the average reimbursement as E(XI ) = kµe−d/µ = 0.90 ∗ 800 ∗ e−25/800 =
697.85. In effect, the policyholder can expect to be reimbursed (697.85/800) = 87% of each
loss.
217
In practice, the insurance company pays for damages or injuries to insured only if the specified
insured event happens. For example, in an automobile insurance, the insurance company
will pay only in the event of an accident. It is therefore important to consider also the
probability that an accident occurs. This refers to the loss frequency.
For simplicity, start with the Bernoulli random variable I, which indicates an accident, or
some other insured event, occurs. Assume that I follows a Bernoulli distribution with p
denoting the probability the event happens, i.e., P(I = 1) = p. If the insured event happens,
the amount of loss is X, typically a continuous random variable. This amount of loss is
referred to as the loss severity. Ignoring any possible coverage modifications, the insurance
company will pay 0, if the event does not happen, and X, if the event happens. In effect,
the insurance claim can be written as the random variable
(
0, if I = 0 (event does not happen)
(16.0.13) Y =X ·I =
X, if I = 1 (event happens)
The random variable Y is a mixed random variable and will be called the insurance claim. It
has a probability mass at 0 and a continuous distribution for positive values. By conditioning
on the Bernoulli I, it can be shown that the cumulative distribution function of Y has the
expression
(
1 − p, if y = 0
(16.0.14) FY (y) =
(1 − p) + pFX (y), if y > 0
Denote the mean of X by µX and its standard deviation by σX > 0. It can be shown that
the expected value of Y has the expression
(16.0.15) E(Y ) = pµX
and the variance is
(16.0.16) Var(Y ) = p(1 − p)µ2X + pσX
2
.
Example 16.4.1 Consider an insurance contract that covers an event with probability 0.25
that it happens, and the loss severity X has a Pareto(3, 1000) distribution.
(a) Calculate the probability that the insurance contract will pay an amount less than
or equal to 500.
(b) Calculate the probability that the insurance contract will pay an amount larger than
750.
(c) Calculate the expected value and variance of the insurance claim.
218 16.4 Loss frequency
1.00
0.75
FY(x)
0.50
0.25
0.00
0 500 1000 1500 2000
x
For part (b), we use the complement of the cumulative distribution function:
P(Y > 600) = 1 − [(1 − 0.25) + 0.25FX (500)]
" 3 #
5
= 0.25 − 0.25 1 − = 0.0610.
8
2
For the claim severity X, the mean is µX = 500 and the variance is σX = 23 (10002 ). Therefore,
from (16.0.15) and (16.0.16), we have
E(Y ) = 0.25(500) = 125
and
3
Var(Y ) = 0.25(0.75)(5002 ) + 0.25 · (10002 ) = 421, 875.
2
Figure 16.0.2 shows the graph of the cumulative distribution function of Y using the param-
eters of this example.
(a) Calculate the probability that the insurance claim will be below 150.
(b) Calculate the expected value of the insurance claim.
(c) Calculate the standard deviation of the insurance claim.
Solution: We have
P(Y ≤ 150) = P(Y = 0) + 0.20 ∗ P(X ≤ 150)
= (1 − 0.20) + 0.20(0.30 + 0.50) = 0.96
For part (b), first we find E(X) = 182.50 so that E(Y ) = 0.20 ∗ 182.50 = 36.5. For part (c),
2
we find σX = 59381.25 so that
Var(Y ) = 0.20(0.80) ∗ 182.52 + 0.20 ∗ 59381.25 = 17205.25.
√
Therefore, the standard deviation is 17205.25 = 131.17.
Insurance is based on the idea of pooling several individuals willing to exchange their risks.
Consider the general situation where there are n individuals in the pool. Assume that each
individual faces the same loss distribution but the eventual losses from each of them will be
denoted by Y1 , Y2 , . . . , Yn . In addition, assume these individual losses are independent. The
total loss arising from this pool is the sum of all these individual losses, Sn = Y1 +Y2 +· · ·+Yn .
To support funding this total loss, each individual agrees to contribute an equal amount of
n
1 1X
(16.0.17) Pn = Sn = Yk ,
n n k=1
which is the average loss.
Notice that the contribution above is still a random loss. Indeed in the absence of risk
pooling where there is a single individual in the pool, that individual is responsible for his
or her own loss. However, when there is sufficiently large enough number of individuals, this
average contribution becomes more predictable as shown below.
Assume that the expected value and variance of each loss are given, respectively, by
E(Yk ) = µ and Var(Yk ) = σ 2 .
For the pool of n individuals, the mean of the average loss is then
n
1X
E(Pn ) = E(Yk ) = µ,
n k=1
220 16.5 The concept of risk pooling
which is exactly equal to the mean of each individual loss. However, what is interesting is
the variability is reduced as shown below:
n
1X 1
Var(Pn ) = Var(Yk ) = σ 2 .
n k=1 n
The variance is further reduced as the number of individuals in the pool increases. As
discussed in Chapter 14, one version of the law of large numbers is the SLLN (Strong Law
of Large Numbers), that accordingly, Pn → µ as n → ∞. In words, the unpredictable loss
for a single individual becomes much more predictable. This is sometimes referred to as the
basic law of insurance. The origins of insurance can be traced back from the idea of pooling
the contributions of several for the indemnification of losses against the misfortunes of the
few.
In principle, the insurance company acts as a third party that formally makes this arrange-
ment. The company forms a group of such homogeneous and independent risks, and is
responsible for collecting the individual contributions, called the premium, as well as dis-
bursing payments when losses occur. There are additional responsibilities of the insurance
company such as ensuring enough capital to cover future claims, however, such are beyond
the scope here.
Figure 16.0.3 depicts the basic law of insurance. In all situations, the mean of the average
loss are the same for all cases. The variability, however, is reduced with increasing number
of policyholders. As you probably suspect from these graphs, the average loss is about 1000,
which is sometimes referred to as the actuarially fair premium. However, please bear in mind
that there are conditions for the basic law of insurance to effectively work:
• Losses must be unpredictable.
• The individual risks must be independent.
• The individuals insured must be considered homogeneous, that is, they share com-
mon risk characteristics.
• The number of individuals in the pool must be sufficiently large.
Example 16.5.1 A company insures a group of 200 homogeneous and independent poli-
cyholders against an event that happens with probability 0.18. If the event happens and
therefore a loss occurs, the amount of loss X has an exponential distribution with mean
1250.
(a) Calculate the mean of the total loss arising from the group.
(b) Calculate the variance of the total loss arising from the group.
(c) Using the Central Limit Theorem, estimate the probability that the total loss will
exceed 60,000.
n=1 n = 10 n = 100
0.0012
0.003
0.0009
0.001
likelihood
likelihood
likelihood
0.0006 0.002
5e−04
0.0003 0.001
0.0000 0 0.000
0 500 1000 1500 0 500 1000 1500 0 500 1000 1500
average loss average loss average loss
n = 1,000 n = 10,000 n = 100,000
0.012 0.25
0.03 0.20
0.009
likelihood
likelihood
likelihood
0.15
0.02
0.006
0.10
0.003 0.01
0.05
Figure 16.0.3. Distributions of the average loss as the number of policyholders increases.
and
Var(Yk ) = p(1 − p)µ2X + pσX
2
= 0.18 ∗ .82 ∗ 12502 + 0.18 ∗ (12502 ) = 511875
The mean and variance of the total variance are, respectively, given by
E(S200 ) = 200 ∗ 225 = 45000 and Var(S200 ) = 200 ∗ 511875 = 102375000
The probability that the total loss will exceed 60,000 can be estimated as follows:
√
P(S200 > 60000) ≈ P(Z > (60000 − 45000)/ 102375000) = P(Z > 1.48),
where Z denotes a standard normal random variable. From Table 1 on page 114, we get
0.06944.
222 16.5 The concept of risk pooling
16.6 Exercises
Exercise 16.2. Suppose insurance loss X has a Pareto(α, θ) distribution. You are given
θ = 225 and P(X ≤ 125) = 0.7621. Calculate the probability that loss will exceed 200, given
it exceeds 125.
Exercise 16.3. Find an expression for the limited expected value, with a deductible d, for
a Pareto(α, θ) distribution when α = 1.
Exercise 16.4. An insurance company pays for a random loss X subject to a deductible
amount of d, where 0 < d < 10. The loss amount is modeled as a continuous random variable
with density function
1
fX (x) = x, for 0 < x ≤ 10.
50
The probability that given the company will make a payment, it will pay less than or equal
to 5 is 0.4092. Calculate d.
Exercise 16.5 An insurer offers a limited contract against loss X that follows an exponential
distribution with mean 500. The limit of the contract is L = 1000. Calculate the probability
that the loss will not reach the limit L.
Exercise 16.6 A company insures a loss X with a coinsurance of 90%. Loss X follows a
distribution that is uniform on (0, u), for u > 0. The expected payment for each loss for
the company is 900. Calculate the expected amount of each loss the buyer of this policy is
responsible for.
Exercise 16.7 An insurer offers a comprehensive medical insurance contract with a de-
ductible of 25, subject to a coinsurance of 90% and a policy limit of 2000. Medical loss X
follows an exponential distribution with mean 800.
(a) Calculate the expected reimbursement the policyholder will receive in the event of
a loss.
(b) The insurer wants to reduce this expected reimbursement to 575 by adjusting only
the level of deductible d. What amount d is needed to achieve this?
Exercise 16.8 Consider an insurance contract that covers an event with probability 0.10
that it happens, and the loss severity X has a Pareto(2.5, 400) distribution.
(a) Calculate the probability that the insurance claim will exceed 150.
(b) Given the claim exceeds 150, calculate the probability that it is less than or equal
to 400.
Exercise 16.9 A company insures a group of 150 homogeneous and independent policyhold-
ers against an event that happens with probability 0.10. If the event happens, the amount
of loss X has a uniform distribution on (0, 1000]. The company collects a premium of 55
from each policyholder. Estimate the probability that the total premium collected will not
be sufficient to support the total loss.
223
Exercise 16.1: In writing the expressions for the first two moments, rewrite the density
function as
α x −α−1
fX (x) = 1+
θ θ
and use the substitution u = 1 + (x/θ) in the integrals.
Exercise 16.2: From the given, we find α = 3.25. This gives us 0.5320 for the answer.
Exercise 16.3: From (16.0.1), when α = 1, we have FX (x) = 1−(θ/(x+θ)). Use (16.0.7) to
arrive at the following expression for the limited expected value: E[min(X, d)] = θ ln(1 + dθ ).
Exercise 16.4: Observe that the insurer will pay only beyond the deductible d. So, it will
pay no more than 5 if the loss does not reach (d + 5). The given probability is therefore
P(X ≤ d + 5|X > d) = 0.4092. It can be shown that FX (x) = x2 /100. This gives d = 1.5.
Exercise 16.6: The expected loss is 1000 for which insurer is responsible for 900. Therefore,
the policyholder is responsible for 100, on the average.
Exercise 16.7: The expected reimbursement is E(XI ) = kµ[e−d/µ − e−L/µ ] = 0.90 ∗ 800 ∗
[e−25/800 − e−2000/800 ] = 638.75. To arrive at the desired d, we have d = −µ · log((575/(0.90 ∗
800) − e−2000/800 ) = 101.63.
Exercise 16.8: To calculate P(Y > 150), we can use direct application of (16.0.14) or use
the law of total probability, i.e., P(Y > 150|I = 0)P(I = 0) + P(Y > 150|I = 1)P(I =
1) = P(Y > 150|I = 1)P(I = 1). The first term vanishes because if event does not happen,
then there is 0 insurance claim. Thus, using the property of Pareto, we have P(Y > 150) =
0.10 ∗ (1 + (150/400))−2.5 = 0.451. For part (b), the required probability is
P(150 ≤ Y < 400) P(Y > 150) − P(Y > 400)
P(Y ≤ 400|Y > 150) = =
P(Y > 150) P(Y > 150)
P(Y > 400) 0.10 ∗ (1 + (400/400))−2.5
= 1− =1− = 0.608.
P(Y > 150) 0.10 ∗ (1 + (150/400))−2.5
Exercise 16.9: The mean of the total losses is E(S150 ) = 150 ∗ 50 = 7500 and the variance
is Var(S150 ) = 150 ∗ 30833.33 = 4625000. Total premium collected is 55 ∗ 150 = 8250. The
required probability then is
√
P(S150 > 8250) ≈ P(Z > (8250 − 7500)/ 4625000)
= P(Z > 0.35) = 1 − P(Z ≤ 0.35) = 1 − 0.63307 = 0.36693.
CHAPTER 17
∗
Applications of probability in finance
17.1.1. The simple coin toss game. Suppose, as in Example 4.8, that we toss a fair
coin repeatedly and independently. If it comes up heads, we win a dollar, and if it comes
up tails, we lose a dollar. Unlike in Chapter 3 , we now can describe the solution using
sums of independent random variables. We will use the partial sums process introduced in
Definition 14.1 in Chapter 14
Xn
Sn = Xi ,
i=1
where X1 , X2 , X3 , ... are i.i.d. random variables with the distribution P(Xi = 1) = P(Xi =
−1) = 21 . Then Sn represents the total change in the number of dollars that we have after n
coin tosses: if we started with $M , we will have M + Sn dollars after n tosses. The name
process is used because the amount changes over time, and partial sums is used because we
compute Sn before we know what is the final outcome of the game. The process Sn is also
commonly called the simple random walk.
The Central Limit Theorem tells us that Sn is approximately distributed as a normal random
variable with mean 0 and variance n, that is,
√
Mn = M + Sn ∼ M + nZ ∼ N (M, n)
x−M
and these random variables have the distribution function F (x) = Φ √
n
.
© Copyright 2018 Oleksii Mostovyi, Alexander Teplyaev Typesetting date: September 2, 2020
225
226 17.∗ APPLICATIONS OF PROBABILITY IN FINANCE
17.1.2. The coin toss game stopped at zero. Suppose the game is modified so that
it is stopped when the amount of money reaches zero. Can we compute the probability
distribution function of Mn , the amount of money after n coin tosses?
A useful trick, called the Reflection Principle, tells us that the probability to have x dollars
after n coin tosses is
P(M + Sn = x) − P(M + Sn = −x) if x > 0
To derive this formula, we again denote by Mn the amount of money we have after n coin
tosses. Then
P(Mn = x) = P(M + Sn = x, M + Sk > 0 for all k = 1, 2, ..., n)
= P(M + Sn = x) − P(M + Sn = x, M + Sk = 0 for some k = 1, 2, ..., n)
= P(M + Sn = x) − P(M + Sn = −x, M + Sk = 0 for some k = 1, 2, ..., n)
= P(M + Sn = x) − P(M + Sn = −x).
This, together with the Central Limit Theorem, implies that the cumulative probability
distribution function of Mn can be approximated by
(
−x−M
Φ x−M
√
n
+ Φ √
n
if x > 0
F (x) =
0 otherwise.
The following graph shows the approximate shape of this function.
0.8
0.6
0.4
0.2
Note that this function is discontinuous as the jump at zero represents the probability that
we have lost all the money by the time n, that is,
M
P(Mn = 0) ≈ 2Φ − √
n
If we consider the limit n → ∞, then P(Mn = 0) −−−→ 2Φ (0) = 1.
n→∞
This proves that in this game all the money will be eventually lost with probability one. In
fact, this conclusion is similar to the conclusion in Example 4.9.
17.1. COIN TOSS GAMES 227
17.1.3. The coin toss game with borrowing at zero. Suppose now that the game is
modified so that each time when we hit zero, instead of stopping, we borrow $1 and continue
playing. Another form of the Reflection Principle implies that the probability to have x
dollars is
P(Mn = x) = P(M + Sn = x) + P(M + Sn = −x) if x > 0.
This formula is easy to explain because in this game the amount of money can be expressed
as Mn = |M + Sn |. The Central Limit Theorem tells us that the cumulative probability
distribution function of Mn can be approximated by
(
−x−M
Φ x−M
√
n
− Φ √
n
if x > 0
F (x) =
0 otherwise
The following graph shows the approximate shape of this function.
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
0
10 20 30 40 50 60
M = x amount in the beginning
17.1.5. Expected playing time. Suppose we play the same simple coin toss game as
in the previous section, and we would like to compute the expected number of coin tosses
needed to complete the game. If we denote this expected number by T (x), we will have a
system of N − L + 1 linear equations
0 if x = L
...
ET (x) = 1 + 21 (ET (x + 1) + ET (x − 1)) if L < x < N
...
0 if x = N
These equations have a unique solution given by the formula
ET (x) = (x − L)(N − x)
and the final answer: the expected number of coin tosses is
(17.1.2) ET(M) = (M − L)(N − M).
The following graph shows ET (x) = (x − L)(N − x), the expected number of coin tosses
to win $N = $60 before reaching as low as $L = $10, in a game when Mn+1 = 2Mn or
Mn+1 = 12 Mn with probability 1/2 at each step.
17.1. COIN TOSS GAMES 229
600
y = ET (x)
400
200
0
10 20 30 40 50 60
M = x amount in the beginning
17.1.6. Doubling the money coin toss game. Let us now consider a game in which
we begin with $M dollars, toss a fair coin repeatedly and independently. If it comes up
heads, we double our money, and if it comes up tails, we lose half of our money. If we start
with $M , what is the probability that we will get up to $N before we go as low as $L?
To answer this question, we first should notice that our money Mn after n coin tosses is
given as a partial product process Mn = M · Y1 · Y2 · ... · Yn , where Y1 , Y2 , Y3 , ... are i.i.d.
random variables with the distribution P(Yi = 2) = P(Yi = 12 ) = 12 . If again we write
y(x) = P ( winning $N before reaching $L), then
0 if x = L
...
y(x) = 21 (y(2x) + y( 21 x)) if L < x < N
...
1 if x = N
This function is linear if we change to the logarithmic variable log(x), which gives us the
answer:
log(M/L)
P ( winning $N before reaching $L) ≈
log(N/L)
This answer is approximate because, according to the rules, we can only have capital amounts
represented by numbers M 2k , where k is an integer, and L, M, N maybe only approximately
equal to such numbers. The exact answer is
l
(17.1.3) P ( winning $N before reaching $L | M = x) = ,
l+w
where l is the number of straight losses needed to reach $L from $M and w is the number of
straight wins needed to reach $N from $M . Equation (17.1.3) is again the general formula
for Gambler’s Ruin problems, the same as in Equation (17.1.1).
The following graph shows the probability to win $N = $256 before reaching as low as
$L = $1 in a game when Mn+1 = 2Mn or Mn+1 = 21 Mn with probability 1/2 at each step.
230 17.∗ APPLICATIONS OF PROBABILITY IN FINANCE
0.8
0.6
0.4
0.2
0
0 50 100 150 200 250
the beginning amount M = x
17.2. EXERCISES ON SIMPLE COIN TOSS GAMES 231
Exercise 17.3. Consider the game in which Mn = M eσSn . Describe the rules of this game.
Exercise 17.4. In the game in Exercise 17.3, find EMn , EMn2 , Var(Mn ).
Exercise 17.5. In the game in Exercise 17.3, how Mn and Mk are related?
Mn+1 Mn
Hint: assume n > k and write Mn+1 = Mn . Also consider Mn = Mk .
Mn Mk
Exercise 17.7. In the game in Exercise 17.3, find the probability to win $N before reaching
as low as $L.
Exercise 17.8. In the game in Exercise 17.7, find the expected playing time.
Exercise 17.9. Following Exercise 17.3, use the normal approximation (as in the Central
Limit Theorem) to find an approximate distribution of Mn . Then use this distribution to
find approximate values of EMn , EMn2 , Var(Mn ).
Exercise 17.10. Following Exercise 17.6, use the normal approximation (as in the Central
Limit Theorem) to find the approximate value of Cov(Mn , Mk ).
Exercise 17.11. Compare quantities in Exercises 17.4 and 17.9: which ones are larger and
which ones are smaller? In which case a normal approximation gets better for larger n, and
in which case it gets worse? If n → ∞, how does σ need to behave in order to have an
accurate normal approximation?
232 17.∗ APPLICATIONS OF PROBABILITY IN FINANCE
Problem 17.1. Consider the following game: a fair dice is thrown once and the player can
either stop the game and receive the amount of money equals the outcome of the dice, or the
player can decide to throw the dice the second time, and then receive the amount of money
equals the outcome of the dice on this second throw. Compute the maximal expected value
of the payoff and the corresponding optimal strategy.
Problem 17.2. Compute the maximal expected value of the payoff and the corresponding
optimal strategy in the following game. A fair dice is thrown 3 times.
• After each throw except for the 3rd one, the player can either stop the game or
continue.
• If the player decides to stop, then he/she receives the amount of money, which equals
the current outcome of the dice (between 1 and 6).
• If the game is continued up to and including the 3rd throw, the player receives the
amount of money, which equals to the outcome of the dice on the 3rd throw.
Problem 17.3.
(1) Compute the maximal expected value of the payoff and the corresponding optimal
strategy in the same game as in Problem 17.2, but when up to 4, or 5, or 6 throws
are allowed.
(2) Compute the maximal expected value of the payoff and the corresponding optimal
strategy in the same game as in Problem 17.2, when an unlimited number of throws
are allowed.
Problem 17.4. Let us consider a game where at each round, if you bet $x, you get $2x,
if you win and $0, if you lose. Let us also suppose that at each round, the probability of
winning equals to the probability of losing and is equal to 1/2. Additionally, let us assume
that the outcomes of every round are independent.
In such settings, let us consider the following doubling strategy. Starting from a bet of $1 in
the first round, you stop if you win or you bet twice as much if you lose. In such settings,
if you win for the first (and only) time in the nth round, your cumulative winning is $2n .
Show that
E [ cumulative winning ] = ∞.
This is called the St. Petersburg paradox. The paradox is in an observation that one would
not pay an infinite amount to play such a game.
Notice that if the game is stopped at the nth round, the dollar amount you spent in the
previous rounds is
1− 1
20 + · · · + 2n−2 = 20 + · · · + 2n−2 2
1 = 2n−1 − 1.
1− 2
Therefore, the dollar difference between the total amount won and the total amount spent
is
2n−1 − (2n−1 − 1) = 1,
and does not depend on n. This seems to specify a riskless strategy of winning $1. However,
if one introduces a credit constraint, i.e., if a player can only spent $M , for some fixed positive
number M , then even if M is large, the expected winning becomes finite, and one cannot
safely win $1 anymore.
Problem 17.5. In the context of Problem 17.4, let G denotes the cumulative winning.
Instead of computing the expectation of G, Daniel Bernoulli has proposed to compute the
expectation of the logarithm of G. Show that
E [log2 (G)] = log2 (g) < ∞
and find g.
Problem 17.6. Let us suppose that a random variable X, which corresponds to the dollar
amount of winning in some lottery, has the following distribution
1
P[X = n] = , n ∈ N,
Cn2
∞
1
P
where C = n2
, which in particular is finite. Clearly, X is finite-valued (with probability
n=1
one). Show that nevertheless E[X] = ∞.
∞
1
P
As a historical remark, note that here C = ζ(2), where ζ(s) = ns
is the Riemann zeta
k=1
function (or Euler-Riemann zeta function) of a complex variables s. It was first proven by
2
Euler in 1735 that ζ(2) = π6 .
Problem 17.7. Let us suppose that a one-year interest rate is determined at the beginning
of each month. In this case r0 , r1 , . . . , rN −1 are such interest rates, where only r0 is non-
random. Thus $1 of investment at time zero is worth (1 + r0 ) at the end of the year 1,
(1 + r0 )(1 + r1 ) at the end of year 2, (1 + r0 ) . . . (1 + rk−1 ) at the end of year k, and so forth.
Let us suppose that r0 = 0.1 and (ri )i=1,...,N −1 are independent random variables with the
following Bernoulli distribution (under so-called risk-neutral measure): ri = 0.15 or .05 with
probability 1/2 each.
Compute the price at time 0 of the security that pays $1 at time N . Note that such a security
is called zero-coupon bond.
Hint: let DN denotes the discount factor, i.e.,
1
DN = .
(1 + r0 ) . . . (1 + rN −1 )
We need to evaluate
e [DN ] .
E
234 17.∗ APPLICATIONS OF PROBABILITY IN FINANCE
Problem 17.8. In the settings of Problem 17.7, let simply compounded yield for the zero-
coupon bond with maturity N is the number y, such that
e [DN ] = 1
E .
(1 + y)m
Calculate y.
Problem 17.9. In the settings of Problem 17.7, let continuously compounded yield for the
zero-coupon bond with maturity N is the number y, such that
e [DN ] = e−yN .
E
Find y.
17.4. PROBLEMS ON THE BLACK-SCHOLES-MERTON PRICING FORMULA 235
The following problems are related to the Black-Scholes-Merton pricing formula. Let us
suppose that X is a standard normal random variable and
√
1 2
(17.4.1) S(T ) = S(0) exp (r − 2 σ )T + σ T X ,
is the price of the stock at time T , where r is the interest rate, σ is the volatility, and S(0)
is the initial value. Here T , r, σ, and S(0) are constants.
Problem 17.11. In the framework of the Black-Scholes-Merton model, i.e., with the stock
price process given by (17.4.1) with r = 0, let us consider
E S(t)1/3 .
(17.4.3)
Find t̂ ∈ [0, 1] and evaluate E S(t̂)1/3 such that
E S(t̂)1/3 = max E S(t)1/3 .
t∈[0,1]
Note that max E S(t)1/3 is closely related to the payoff of the American cube root option
t∈[0,1]
with maturity 1 and t̂ to the optimal policy.
Problem 17.12. In the framework of the Black-Scholes-Merton model, i.e. with the stock
price process given by (17.4.1), let us consider
max E e−rt (S(t) − K)+ .
(17.4.4)
t∈[0,1]
Similarly to Problem 17.11, max E e−rt (S(t) − K)+ is closely related to the payoff of the
t∈[0,1]
American call option with maturity 1 and t̂ to the optimal policy.
n
eσ + e−σ
EMn = M ,
2
n
e + e−2σ
2σ
2 2
EMn = M ,
2
2σ −2σ n
σ !
−σ 2n
e + e e + e
Var(Mn ) = M 2 − .
2 2
2 /2
EMn ≈ M enσ ,
2
EMn2 ≈ M 2 e2nσ ,
2 2
Var(Mn ) ≈ M 2 e2nσ − enσ .
n
eσ + e−σ
2
EMn = M < M enσ /2 ,
2
n
e + e−2σ
2σ
2 2 2
EMn = M < M 2 e2nσ ,
2
2σ −2σ n
σ !
−σ 2n
e + e e + e 2 2
Var(Mn ) = M 2 − < M 2 e2nσ − enσ .
2 2
The normal approximations get better if σ is small, but get worse if n is large. The standard
optimal regime is n → ∞ and nσ 2 → 1, which means σ ∼ √1n .
Sketch of the solution to Problem 17.1: The strategy is to select a value x and say
that the player stops if this value is exceeded after the first throw, and goes to the second
through if this value is not exceeded. We know that the average value of one through is
(6+1)/2 without any strategies. The probability to exceed x is (6−x)/6, and the conditional
expectation of the payoff is 7+x
2
if x is exceeded. So the expected payoff is 7+x
2
· 6−x
6
+ 72 · x6 .
This gives the optimal strategies for x = 3 and the maximal expected payoff is EP2 = 4.25.
after the first throw, and cut off x2 = 3 after the second throw, following Problem 17.1. The
expected payoff of the game which allows up to three throws is
7 + 4 6 − 4 17 4 14
EP3 = · + · = ≈ 4.6666
2 6 4 6 3
N1
1
Sketch of the solution to Problem 17.8: Direct computations give y = Ẽ[DN ]
− 1.
Sketch of the solution to Problem 17.11: Let us fix t ∈ [0, 1]. From Jensen’s inequality
(Proposition 15.7), we get E S(t)1/3 ≤ (E [S(t)])1/3 . The inequality is strict for t > 0, by
strict
concavity
of x → x1/3 . The equality is achieved at t = 0. Therefore t̂ = 0, and
1/3
E S(t̂) = S(0)1/3 .
238 17.∗ APPLICATIONS OF PROBABILITY IN FINANCE
Sketch of the solution to Problem 17.12: for every t ∈ [0, 1], we have
+ + +
e−rt (S(t) − K) = S(t)e−rt − Ke−rt 6 S(t)e−rt − Ke−r
and E [S(t)e−rt ] = S(0). Now, using convexity of x → (x − K)+ and applying Jensen’s
inequality (Proposition 15.7) for conditional expectation, we deduce that
h + i h + i
E S(t)e−rt − Ke−r < E S(1)e−r − Ke−r
for every t ∈ [0, 1). We conclude that
h + i
t̂ = 1 and E e−rt̂ S(t̂) − K = E e−r (S(1) − K)+ .
Table of probability distributions
Discrete random variables
Name Abbrev. Parameters PMF: P[X
= k], k ∈ N0 E[X] Var(X) MGF: E[etX ], t ∈ R
1 k
Bernoulli Bern(p) p ∈ [0, 1] p (1 − p)1−k p p(1 − p) (1 − p) + pet
k
n k
Binomial Bin(n, p) n ∈ N, p ∈ [0, 1] p (1 − p)n−k np np(1 − p) [(1 − p) + pet ]n
k
λk
Poisson Pois(λ) λ>0 e−λ λ λ exp(λ(et − 1))
( k!
(1 − p)k−1 p, for k ≥ 1, 1 1−p pet
Geometric Geo(p) p ∈ (0, 1) , for t < − log(1 − p)
0, else. p p2 1 − (1 − p)et
k − 1 pr (1 − p)k−r , if k ≥ r, r
r r(1 − p) pet
Negative binomial NB(r, p) r ∈ N, p ∈ (0, 1) r−1 , for t < − log(1 − p)
p p2 1 − (1 − p)et
0, else.
m N −m
k n−k nm
Hypergeometric n, m, N ∈ N0 (not tested) (not tested)
N N
n