Prob Stat For CS Book
Prob Stat For CS Book
Applications to Computing
Alex Tsun
2
Acknowledgements
This textbook would not have been possible without the following people:
• Mitchell Estberg (Head TA CSE 312 at UW): For helping organize the e↵ort to put this together,
formatting, and revising a vast majority of this content. Your countless hours you put in for the course
and for me are much appreciated. Neither this class nor this book would have been possible without
your contributions!
• Pemi Nguyen (TA CSE 312 at UW): For helping to add examples, motivation, and contributing
to a lot of this content. For constantly going above and beyond to ensure a great experience for the
students, the sta↵, and me. Your bar for quality and dedication to these notes, the course, and the
students is unrivaled.
• Cooper Chia, William Howard-Snyder, Shreya Jayaraman, Aleks Jovcic, Muxi (Scott)
Ni, Luxi Wang (TA’s CSE 312 at UW): For each typesetting several sections and adding your own
thoughts and intuition. Thank you for being the best teaching sta↵ I could have asked for!
• Joshua Fan (UW/Cornell): You are an amazing co-TA who is extremely dedicated to your students.
Thank you for your help in developing this content, and for recording several videos!
• Matthew Taing (UW): You are an extremely caring TA, dedicated to making learning an enjoyable
experience. Thank you for your help and suggestions throughout the development of this content.
• Martin Tompa (Professor at UW Allen School): Thank you for taking a chance on me to give me my
first TA experience, and for supporting me through my career to graduate school and beyond. Thank
you especially for helping me attain my first teaching position, and for your advice and mentorship.
• Anna Karlin (Professor at UW Allen School): Thank you for making my CSE 312 TA experiences at
UW amazing, and for giving me much freedom and flexibility to create content and lead during those
times. Thank you also for your significant help and guidance during my first teaching position.
• Lisa Yan, David Varodayan, Chris Piech (Instructors at Stanford): I learned a lot from TAing
for each of you, especially to compare and constrast this course at two di↵erent universities. I’d like
to think I took the “best of both worlds” at Stanford and the University of Washington. Thank you
for your help, guidance, and inspiration!
• My Family: Thank for for supporting me and encouraging me to pursue my passions. I would not
be where I am or the person I am without you.
4
Notes
Information
This book was written in Summer of 2020 during an o↵ering of “CSE 312: Foundations of Computing
II”, which is essentially probability and statistics for computer scientists. The curriculum was based o↵ of
this course as well as Stanford University’s “CS 109: Probability for Computer Scientists”. I strongly believe
coding applications (which are included in Chapter 9) are essential to teach to show why this class is a core
CS requirement, but also it helps keeps the students engaged and excited. This textbook is currently being
used at the University of Washington (Autumn 2020).
Resources
• Course Videos (YouTube Playlist): Mostly under 5 minutes long, serves generally as a quick review
of each section.
• Course Slides (Google Drive): Contains Google Slides presentations for each section, used in the videos.
• Course Website (UW CSE 312): Taught at the University of Washington during Summer 2020 and
Autumn 2020 quarters by Alex Tsun and Professor Anna Karlin.
https://courses.cs.washington.edu/courses/cse312/20su/
• This Textbook: Available online free here.
• Key Theorems and Definitions: At the end of this book.
• Distributions (2 pages): At the end of this book.
Assumed Prerequisites
We assume the student has experience in the following topics:
• Multivariable calculus (at least up to partial derivatives and double integrals). We won’t really use
much calculus beyond taking derivatives and integrals, so a surface-level knowledge is fine.
• Discrete mathematics (introduction to logic and proofs). We’ll especially use set theory, but this will
be covered in Chapter 0: Prerequisites of this book.
• Programming experience (at least one or two introductory classes, in any language). We will teach
Python, but assuming knowledge of fundamental ideas such as: variables, conditionals, loops, and
arrays. This will be crucial in studying and coding up the CS applications of Chapter 9.
About the Author
Alex Tsun grew up in the bay area, with a family full of software engineers (parents and older brother). He
completed Bachelor’s degrees in computer science, statistics, and theoretical mathematics at the University
of Washington in 2018, before attending Stanford University for his Master’s degree in AI and Theoretical
CS. During his six years as a student, he served as a TA for this course a total of 13 times. After graduating
in June 2020, he returned to UW to be the instructor for the course CSE 312 during Summer 2020.
Contents
0. Prerequisites 7
0.1 Intro to Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.2 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.3 Sum and Product Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1. Combinatorial Theory 19
1.1 So You Think You Can Count? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 More Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3 No More Counting Please . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2. Discrete Probability 43
2.1 Intro to Discrete Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5
6 CONTENTS
Chapter 0. Prerequisites
This chapter focuses on set theory, which makes up the building blocks of probability. To even define a
probability space, we need this notion of a set. While it is assumed that a discrete mathematics course was
taken, we will focus on reviewing this particular topic. We also cover summation and product notation,
which we will use frequently for compactness and conciseness of notation.
Chapter 0. Prerequisites
0.1: Intro to Set Theory
Slides (Google Drive) Video (YouTube)
There is only one set of cardinality 0 (containing no elements), the empty set, denoted by ; = {}
Example(s)
Solution To calculate the cardinality of a set, we have to determine the number of elements in the set.
1. For the set {apple, orange, watermelon}, we have three distinct elements, so the cardinality is 3. That
is | {apple, orange, watermelon} |= 3
2. For {1, 1, 1, 1, 1}, there are five 1s, but recall that set’s don’t contain duplicates, so actually this set only
contains 1, and is equal to the set {1}. This means that it’s cardinality is 1, that is | {1, 1, 1, 1, 1} |= 1
8
0.1 Probability & Statistics with Applications to Computing 9
3. For the set [0, 1], all the values between 0 and 1 (inclusive) we have an infinite number of elements.
This means that the cardinality of this set is infinity, that is | [0, 1] |= 1
4. For the set {1, 2, 3, · · · }, the set of all positive integers, we have an infinite number of elements. This
means that the cardinality of this set is infinity, that is | {1, 2, 3, · · · } |= 1.
5. For the set {;, {1}, {2}, {1, 2}} (a set of sets), there are four distinct elements that are each a di↵erent
set. This means that the cardinality is 4, that is | {;, {1}, {2}, {1, 2}} |= 4.
6. Finally, for the set of {;, {1}, {1, 1}, {1, 1, 1}, · · · }, we do have an infinite number of sets, each of which
is an element. But are these distinct? Upon further consideration, all the the sets containing various
numbers of 1s are equivalent, as duplicates don’t matter. So there is the set containing 1 and the
empty set. So the cardinality is 2, that is | {;, {1}, {1, 1}, {1, 1, 1}, · · · } |=| {;, {1}} |= 2.
Example(s)
Let us define A = {1, 3}, B = {3, 1}, C = {1, 2} and D = {;, {1}, {2}, {1, 2}, 1, 2}.
Determine whether the following are true or false:
• 12A
• 1✓A
• {1} ✓ A
• {1} 2 A
• 32 /C
• A2B
• A✓B
• C2D
• C✓D
• ;2D
• ;✓D
• A=B
• ;✓;
• ;2;
Solution
• 1 2 A. True, because 1 is an element in A.
• 1 ✓ A. False, because 1 is a value, not a set, so it cannot be a subset of a set.
• {1} ✓ A. True, because every element of the set {1} is an element of A.
• {1} 2 A. False, because {1} is a set, and A contains no sets as elements.
• 32
/ C. True, because the value 3 is not one of the elements of C.
• A 2 B. False, because A is a set, and there are no elements of B which are sets, A 62 B.
• A ✓ B. True, because every element of A is an element of B.
• C 2 D. True, because C is an element of D.
• C ✓ D. True, because each of the elements of C are also elements of D.
• ; 2 D. True, because the empty set is an element of D.
• ; ✓ D. True, by definition, the empty set is a subset of any set. This is because if this were not the
case, there would have to be an element of ; which was not in D. But there are no elements in ;, so
the statement is true (vacuously).
• A = B. True, A ✓ B, as every element of A is an elementof B and B ✓ A, as every element of B is an
element of A, so since this relationship is in both directions, we have A = B.
• ; ✓ ;. True, because the empty set is a subset of every set (vacuously).
• ; 2 ;. False, because the empty set contains no elements, so the empty set cannot be an element of it.
Chapter 0. Prerequisites
0.2: Set Operations
Slides (Google Drive) Video (YouTube)
Example(s)
1. If we were talking about the set of fruits a supermarket might sell S, we might have S =
{apple, watermelon, pear, strawberry} and U = {all fruits}. We might want to know which
fruits the supermarket doesn’t sell, which would be denoted S C (defined later). This requires a
universal set of all fruits that we can check with to see which are missing from S.
2. If we were talking about the set of kinds of cars Bill Gates owns, that might be the set T . There
must be a universal set U of possible kinds of cars that exist, if we wanted to list out which
ones he was missing T C .
The union of A and B is denoted A[B. It contains elements in A or B, or both (without duplicates).
So x 2 A [ B if and only if x 2 A or x 2 B.
The image below shows in red the union of A and B: A [ B. The outer rectangle is the univeral set U .
11
12 Probability & Statistics with Applications to Computing 0.2
The image below shows in red the intersection of A and B: A \ B. The outer rectangle is the univeral set U .
The set di↵erence of A with B is denoted, A \ B. It contains elements of A which are not in B. So
x 2 A \ B if and only if x 2 A and x 2
/ B.
The image below shows in red the set di↵erent of A with B: A \ B. The outer rectangle is the univeral set
U.
Example(s)
Let A = {1, 3}, B = {2, 3, 4}, and U = {1, 2, 3, 4, 5}. Solve for: A \ B, A [ B, B \ A, A \ B, (A [ B)C ,
AC , B C , and AC \ B C .
Solution
• A \ B = {3}, since 3 is the only element in both A and B.
• A [ B = {1, 2, 3, 4}, as these are all the elements in either A or B. Note we dropped the duplicate 3,
since sets cannot contain duplicates.
• B \ A = {2, 4}, as these are the elements of B which are not in A.
• A \ B = {1}, as this is the only element of A which is not an element of B.
• (A [ B)C = {5}, as by definition (A [ B)C = U \ (A [ B) and 5 is the only element of U which is not
an element of A [ B.
• AC = {2, 4, 5}, as by definition AC = U \ A, and these are the elements of U which are not elements
of A.
• B C = {1, 5}, as by definition B C = U \ B, and these are the elements of U which are not elements of
B.
• AC \ B C = {5}, because the only element in both AC and B C is 5 (see the above).
Chapter 0. Prerequisites
0.3: Sum and Product Notation
Slides (Google Drive) Video (YouTube)
10
X
1 + 2 + 3 + · · · + 10 = i
i=1
Note that i is just a dummy variable. We could have also used j, k, or any other letter. What if we wanted
to sum numbers that weren’t consecutive integers?
As long as there is some pattern, we can write it compactly! For example, how could we write 16 + 25 +
36 + · · · + 81? In the first equation below (0.3.1), j takes on the values from 4 to 9, and the square of each of
these values will be summed together. Note that this is equivalent to k taking on the values of 1 to 6, and
adding 3 to each of the values before squaring and summing them up (0.3.2).
9
X
16 + 25 + 36 + · · · + 81 = j2 (0.3.1)
j=4
6
X
= (k + 3)2 (0.3.2)
k=1
If you know what a for-loop is (from computer science), this is exactly the following (in Java or C++).
This first loop represents the first sum with dummy variable j.
int sum = 0
for (int j = 4; j <= 9; j++) {
sum += (j * j)
}
This second loop represents the second sum with dummy variable k, and is equivalent to the first.
int sum = 0
for (int k = 1; k <= 6; k++) {
sum += ((k + 3) * (k + 3))
}
14
0.3 Probability & Statistics with Applications to Computing 15
Furthermore, if S is a set, and f : S ! R is a function defined on S, then the following notation sums
over all elements x 2 S of f (x):
X
f (x)
x2S
Note that the sum over no terms (the empty set) is defined as 0.
Example(s)
Write P
out the following sums:
7
• Pk=3 k 10
• (2y + 5), for S = {3, 6, 8, 11}
Py2S
8
• 4
Pt=6
1
• Pz=2 sin(z)
• x2T 13x, for T = { 1, 3, 5}.
Solution
P7
• For, k=3 k 10 , we raise each value of k from 3 to 7 to the power of 10 and sum them together. That
is:
7
X
k 10 = 310 + 410 + 510 + 610 + 710
k=3
P
• Then, if we let S = {3, 6, 8, 11}, for y2S (2y + 5), raise 2 to the power of each value y in S and add
5, and then sum the results together. That is
X
(2y + 5) = (23 + 5) + (26 + 5) + (28 + 5) + (211 + 5)
y2S
P8
• For the sum of a constant, t=6 4, we add the constant, 4 for each value t = 6, 7, 8. This is equivalent
to just adding 4 together three times.
8
X
4=4+4+4
t=6
P1
• Then, for a range with no values, the sum is defined as 0, for z=2 sin(z), because there are no values
from 2 to 1, we have:
1
X
sin(z) = 0
z=2
16 Probability & Statistics with Applications to Computing 0.3
P
• Finally, if we let T = { 1, 3, 5}, for x2T 13x, we multiply each value of x in T by 13 and then sum
them up.
X
13x = 13( 1) + 13( 3) + 13(5)
x2T
= 13( 1 + 3 + 5)
X
= 13 x
x2T
Notice that we can actually factor out the 13; that is, we could sum all values of x 2 T first, and then
multiply by 13. This is one of a few properties of summations we can see below!
Further, the associative and distributive properties hold for sums. If you squint hard enough, you can kind
of see why they’re true! We’ll also see some examples below too, since the notation can be confusing at first.
We have the associative property (0.3.3) and distributive property (0.3.4, 0.3.5) for sums.
X X X
f (x) + g(x) = (f (x) + g(x)) (0.3.3)
x2A x2A x2A
X X
↵f (x) = ↵ f (x) (0.3.4)
x2A x2A
!0 1
X X XX
f (x) @ g(x)A = f (x)g(y) (0.3.5)
x2A y2B x2A y2b
The last property is like FOIL - if you multiply (x + x2 + x3 )(1/y + 1/y 2 ) (left-hand side) for example, you
would have to sum over every possible combination x/y + x/y 2 + x2 /y + x2 /y 2 + x3 /y + x3 /y 2 (right-hand
side).
The proof of these are left to the reader, but see the examples below for some intuition!
Example(s)
“Prove”
P7the following
P7 by writing
P7 out the sums:
• i=5 i + i=5 i2 = i=5 (i + i2 )
P5 P5
• j=3 2j = 2 j
P2 Pj=3
3 P2 P3
• ( i=1 f (ai ))( j=1 g(bj )) = i=1 j=1 f (ai )g(bj )
Solution
7
X 7
X 6
X
2 2 2 2 2 2 2
i+ i = (5 + 6 + 7) + (5 + 6 + 7 ) = (5 + 5 ) + (6 + 6 ) + (7 + 7 ) = (i + i2 )
i=5 i=5 i=5
0.3 Probability & Statistics with Applications to Computing 17
= f (a1 )g(b1 ) + f (a1 )g(b2 ) + f (a1 )g(b3 ) + f (a2 )g(b1 ) + f (a2 )g(b2 ) + f (a2 )g(b3 )
2 X
X 3
= f (ai )g(bj )
i=1 j=1
Further, if S is a set, and f : S ! R is a function defined on S, then the following notation multiplies
over all elements x 2 S of f (x):
Y
f (x)
x2S
Note that the product over no terms is defined as 1 (not 0 like it was for sums).
Example(s)
Write Q
out the following products:
7
• Qa=4 a
• x2S 8 for S = {3, 6, 8, 11}
Q1
• z=2 sin(z)
Q5
• b=2 91/b
Solution
Q7
• For a=4 a, we multiply each value a in the range 4 to 7 and have:
7
Y
a=4·5·6·7
a=4
18 Probability & Statistics with Applications to Computing 0.3
Q
• Then if, we let S = {3, 6, 8, 11}, for x2S 8, we multiply 8 for each value in the set, S and have:
Y
8=8·8·8·8
x2S
Q1
• Then for z=2 sin(z), we have the empty product, because there are no values in the range 2 to 1, so
we have:
1
Y
sin(z) = 1
z=2
Q5
• Finally for b=2 91/b , we have each value of b from 2 to 5 of 91/b , to get
5
Y
91/b = 91/2 · 91/3 · 91/4 · 91/5
b=2
= 91/2+1/3+1/4+1/5
P5
1/b
=9 b=2
Q P
Also, if you were to do the same examples as we did for sums replacing with , you just multiply instead
of add! They are almost identical, except the empty sum is 0 and the empty product is 1.
0.3 Probability & Statistics with Applications to Computing 19
Before we jump into probability, we must first learn a little bit of combinatorics, or more informally, counting.
You might wonder how this is relevant to probability, and we’ll see how very soon. You might also think
that counting is for kindergarteners, but it is actually a lot harder than you think!
To motivate us, let’s consider how easy or difficult it is for a robber to randomly guess your PIN code. Every
debit card has a PIN code that their owners use to withdraw cash from ATMs or to complete transactions.
How secure are these PINs, and how safe can we feel?
If an experiment can either end up being one of N outcomes, or one of M outcomes (where there is
no overlap), then the number of possible outcomes of the experiment is:
N +M
We’ll see some examples of the Sum Rule combined with the Product Rule (next), so that they can be a bit
more complex!
Well, we can consider this from first picking out a top. Once we have our top, we have 4 choices for our
bottom. This means we have 4 choices of bottom for each top, which we have 3 of. So, we have a total of
4 + 4 + 4 = 3 ⇥ 4 = 12 outfit choices.
20
1.1 Probability & Statistics with Applications to Computing 21
We could also do this in reverse and first pick out a bottom. Once we have our bottom, we have 3 choices
for our top. This means we have 3 choices of top for each bottom, which we have 4 of. So, we still have a
total of 3 + 3 + 3 + 3 = 4 ⇥ 3 = 12 outfit choices. (This makes sense - the number of outfits should be the
same no matter how I count!)
What if we also wanted to add socks to the outfit, and we had 2 di↵erent pairs of socks? Then, for each of
the 12 choices outlined above, we now have 2 choices of sock. This brings us to a total of 24 possible outfits.
This could be calculated more directly rather than drawing out each of these unique outfits, by multiplying
our choices: 3 tops ⇥ 4 bottoms ⇥ 2 socks = 24 outfits.
22 Probability & Statistics with Applications to Computing 1.1
If this still sounds “simple” to you or you just want to practice, see the examples below! There are some
pretty interesting scenarios we can count, and they are more difficult than you might expect.
Example(s)
Flamingos Fanny and Freddy have three o↵spring: Happy, Glee, and Joy. These five flamingos are
to be distributed to seven di↵erent zoos so that no zoo gets both a parent and a child :(. It is not
required that every zoo gets a flamingo. In how many di↵erent ways can this be done?
Solution There are two disjoint (mutually exclusive) cases we can consider that cover every possibility. We
can use the sum rule to add them up since they don’t overlap!
1. Case 1: The parents end up in the same zoo. There are 7 choices of zoo they could end up
at. Then, the three o↵spring can go to any of the 6 other zoos, for a total of 7 ⇥ 6 ⇥ 6 ⇥ 6 = 7 ⇥ 63
possibilities (by the product rule).
2. Case 2: The parents end up in di↵erent zoos. There are 7 choices for Fanny and 6 for Freddy.
Then, the three o↵spring can go to any of the 5 other zoos, for a total of 7 ⇥ 6 ⇥ 53 possibilities.
The result, by the sum rule, is 7 ⇥ 63 + 7 ⇥ 6 ⇥ 53 . (Note: This may not be the only way to solve this problem.
Often, counting problems have two or more approaches, and it is instructive to try di↵erent methods to get
the same answer. If they di↵er, at least one of them is wrong, so try to find out which one and why!)
1.1.3 Permutations
Back to the example of the debit card. There are 10 possible digits for the each of the 4 digits of a PIN. So
how many possible 4-digit PINs are there? This can be solved as 10 ⇥ 10 ⇥ 10 ⇥ 10 = 104 = 10, 000. So,
there is a one in ten thousand chance that a robber can guess your pin code (randomly).
Let’s consider a stronger case where you must use each digit exactly once, so the PIN is exactly 10 digits
long. How many such PINs exist?
Well, we have 10 choices for the first digit, 9 choices for the second digit, and so forth, until we only have 2
choices for the ninth digit, and 1 choice for the tenth digit. This means there are 362,880 possible PINs in
this scenario as follows:
10
Y
10 ⇥ 9 ⇥ · · · ⇥ 2 ⇥ 1 = i = 362, 880
i=1
This formula/pattern seems like it would appear often! Wouldn’t it be great if there were a shorthand for
this?
Definition 1.1.3: Permutation
The number of orderings of N distinct objects, is called a permutation, and mathematically defined
1.1 Probability & Statistics with Applications to Computing 23
as:
N
Y
N ! = N ⇥ (N 1) ⇥ (N 2) ⇥ . . . 3 ⇥ 2 ⇥ 1 = j
j=1
N ! is read as “N factorial”. It is important to note that 0! = 1 since there is one way to arrange 0
objects.
Example(s)
A standard 52-card deck consists of one of each combination of: 13 di↵erent ranks (Ace, 2, 3,...,10,
Jack, Queen, King) and 4 di↵erent suits (clubs, diamonds, hearts, spades), since 13 ⇥ 4 = 52. In how
many ways a 52-card deck be dealt to thirteen players, four to each, so that every player has one card
of each suit?
Solution This is a great example where we can try two equivalent approaches. Each person usually has di↵er-
ent preferences, and sometimes one way is significantly easier to understand than another. Read them both,
understand why they both make sense and are equal, and figure out which approach is more intuitive for you!
Let’s assign each player one at a time. The first player has 13 choices for the club, 13 for the heart, 13 for
the diamond, and 13 for the spade, for a total of 134 ways. The second player has 124 choices (since there
are only 12 of each suit remaining). And so on, so the answer is 134 ⇥ 124 ⇥ 114 ⇥ ... ⇥ 24 ⇥ 14 .
Alternatively, we can assign each suit one at a time. For the clubs suit, there are 13! ways to distribute them
to the 13 di↵erent players. Then, the diamonds suit can be assigned in 13! ways as well, and same for the
other two suits. By the product rule, the total number of ways is (13!)4 . Check that this di↵erent order of
assigning cards gave the same answer as earlier! (Expand the factorials.)
Example(s)
A group of n families, each with m members, are to be lined up for a photograph. In how many ways
can the nm people be arranged if members of a family must stay together?
Solution We first choose the ordering of the families, of which there are n!. Then, in the first family, we have
m! ways to arrange them. The second family also has m! ways to be arranged. And so on. By the product
rule, the number of orderings is n! ⇥ (m!)n .
Another approach might be to count how many PINs don’t satisfy this property, and subtract it from the
total number of PINs. This strategy is called complementary counting, as we are counting the size of the
complement of the set of interest. The number of possible 10-digit PINs, with no stipulations, is 1010 (from
the product rule, multiplying 10 choices with itself for each of 10 positions). Then, we found above that
the 10-digit PINs with no repeats has 10! possibilities (each digit used exactly once). Well, consider that
the 10-digit PINs with at least one repeat will be all other possibilities (they could have one, two, or more
repeats but certainly won’t have none). This means that we can count this by taking the di↵erence of all
the possible 10-digit PINs and those with no repeats. That is:
1010 10!
Let U be a (finite) universal set, and S a subset of interest. Let S C = U \ S denote the set di↵erence
(complement of S). Then,
| S |=| U | | SC |
Informally, to find the number of ways to do something, we could count the number of ways to NOT
to do that thing, and subtract it from the total. That is, the complement of the subset of interest is
also of interest!
1.1.5 Exercises
1. Suppose we have 6 people who want to line up in a row for a picture, but two of them, A and B, refuse
to sit next to each other. How many ways can they sit in a row?
Solution: There are two equivalent approaches. The first approach is to solve it directly. How-
ever, depending on where A sits, B has a di↵erent number of options (whether A sits at the end or the
middle). So we have two disjoint (non-overlapping) cases:
(a) Case 1: A sits at one of the two end seats. Then, A has 2 choices for where to sit, and B has 4.
(See this diagram where A sits at the right end: A.) Then, there are 4! ways for the
remaining people to sit, for a total of 2 ⇥ 4 ⇥ 4! ways.
(b) Case 2: A sits in one of the middle 4 seats. Then, A has 4 choices of seat, but B only has three
choices for where to sit. (See this diagram where A sits in a middle seat: A .) Again,
there are 4! ways to seat the rest, for a total of 4 ⇥ 3 ⇥ 4! ways.
1.1 Probability & Statistics with Applications to Computing 25
The alternative approach is complementary counting. We can count the total orderings, of which
there are 6!, and subtract the cases where A and B do sit next to each other. There’s a trick we can
do to guarantee this: let’s treat A and B as a single entity. Then, along with the remaining 4 people,
there are only 5 entities. We order the entities in 5! ways, but also multiply by 2! since we could have
the seating AB or BA. Hence, the number of ways they do sit together is 2 ⇥ 5! = 240, and the ways
they do not sit together is 6! 240 = 720 = 240 = 480.
Decide which approach you liked better - oftentimes, one method will be easier than another!
Chapter 1. Combinatorial Theory
1.2: More Counting
Slides (Google Drive) Video (YouTube)
1.2.1 k-Permutations
Last time, we learned the foundational techniques for counting (the sum and product rule), and the factorial
notation which arises frequently. Now, we’ll learn even more “shortcuts”/“notations” for common counting
situations, and tackle more complex problems.
We’ll start with a simpler situation than most of the exercises from last time. How many 3-color mini
rainbows can be made out of 7 available colors, with all 3 being di↵erent colors?
We choose an outer color, then a middle color and then an inner color. There are 7 possibilities for the outer
layer, 6 for the middle and 5 for the inner (since we cannot have duplicates). Since order matters, we find
that the total number of possibilities is 210, from the following calculation:
7·6·5 4·3·2·1
7·6·5= · [multiply numerator and denominator by 4! = 4 · 3 · 2 · 1]
1 4·3·2·1
7!
= [def of factorial]
4!
7!
=
(7 3)!
Notice that we are “picking” 3 out of 7 available colors - so order matters. This may not seem useful, but
imagine if there were 835 colors and we wanted a rainbow with 135 di↵erent colors. You would have to
multiply 135 numbers, rather than just three!
If we want to arrange only k out of n distinct objects, the number of ways to do so is P (n, k) (read
as “n pick k”), where
26
1.2 Probability & Statistics with Applications to Computing 27
n!
P (n, k) = n · (n 1) · (n 2) · ... · (n k + 1) =
(n k)!
A permutation of a n objects is an arrangement of each object (where order matters), so a k-
permutation is an arrangement of k members of a set of n members (where order matters). The
number of k-Permutations of n objects is just P (n, k).
Example(s)
Suppose we have 13 chairs (in a row) with 9 TA’s, and Professors Sunny, Rainy, Windy, and Cloudy
to be seated. What is the number of seatings where every professor has a TA to his/her immediate
left and right?
Solution This is quite a tricky problem if we don’t choose the right setup. Imagine we first seat the 9 TA’s
- there are 9! ways to do this. Then, there are 8 spots between them, so that if we place a professor there,
they’re guaranteed to have a TA to their immediate left and right. We can’t place more than one professor
in a spot. Out of the 8 spots, we pick 4 of them for the professors to sit (order matters, since the professors
are di↵erent people). So the answer by the product rule is 9! · P (8, 4).
Recall that there were P (7, 3) = 210 possible mini-rainbows. But as we see from these rainbows, each
“smeared” color is counted 3! = 6 times. So, to get our answer, we take the 210 mini-rainbows and divide
by 6 to account for the overcounting since in this case, order doesn’t matter.
The answer is,
210 P (7, 3) 7!
= =
6 3! 3!(7 3)!
If we want to choose (order doesn’t matter) only k out of n distinct objects, the number of ways to
do so is C(n, k) = nk (read as “n choose k”), where
✓ ◆
n P (n, k) n!
C(n, k) = = =
k k! k!(n k)!
A k-combination is a selection of k objects from a collection of n objects, in which the order does
28 Probability & Statistics with Applications to Computing 1.2
n n
not matter. The number of k-Combinations of n objects is just k . k is also called a binomial
coefficient - we’ll see why in the next section.
Notice, we can show from this that there is symmetry in the definition of binomial coefficients:
✓ ◆ ✓ ◆
n n! n! n
= = =
k k!(n k)! (n k)!k! n k
Looking at these, we can see that the color choices in each row are complementary. Intuitively, choosing
1 color is the same as choosing 4 1 = 3 colors that we don’t want - and vice versa. This explains the
symmetry in binomial coefficients!
Example(s)
There are 6 AI professors and 7 theory professors taking part in an escape room. If 4 security
professors and 4 theory professors are to be chosen and divided into 4 pairs (one AI professor with
one theory professor per pair), how many pairings are possible?
Solution We first choose 4 out of 6 AI professors, with order not mattering, and 4 out of 7 theory professors,
again with order not mattering. There are 64 · 74 ways to do this by the product rule. Then, for the first
theory professor, we have 4 choices of AI professor to match with, for the second theory professor, we only
have 3 choices, and so on. So we multiply by 4! to pair them o↵, and we get 64 · 74 · 4!. You may have
counted it di↵erently, but check if your answer matches!
1.2 Probability & Statistics with Applications to Computing 29
But if we want to rearrange the letters in “POOPOO”, we have indistinct letters (two types - P and O).
How do we approach this?
4
One approach is to choose where the 2 P’s go, and then the O’s have to go in the remaining 4 spots ( 4 =1
way) . Or, we can choose where the 4 O’s go, and then the remaining P’s are set ( 22 = 1 way).
Either way, we get,
✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆
6 4 6 2 6!
· = · =
2 4 4 2 2!4!
Another interpretation of this formula is that we are first arranging the 6 letters as if they were distinct:
P1 O1 O2 P2 O3 O4 . Then, we divide by 4! and 2! to account for 4 duplicate O’s and 2 duplicate P’s.
What if we got even more complex, let’s say three di↵erent letters? For example, rearranging the word
‘BABYYYBAY‘. There are 3 B’s, 2 A’s, and 4 Y’s, for a total of 9 letters. We can choose where the 3 B’s
should go of the 9 spots: 93 (order doesn’t matter since all the B’s are identical). Then out of the remaining
6 spots, we should choose 2 for the A’s: 62 . Finally, out of the 4 remaining spots, we put the 4 Y’s there:
4
4 = 1. By the product rule, our answer is
✓ ◆✓ ◆✓ ◆
9 6 4 9! 6! 4! 9!
= =
3 2 4 3!6! 2!4! 4!0! 3!2!4!
Note that we could have chosen to assign the Y’s first instead: Out of 9 positions, we choose 4 to be Y:
9 5
4 . Then from the 5 remaining spots, choose where the 2 A’s go: 2 , and the last three spots must be B’s:
3
3 = 1. This gives us the equivalent answer
✓ ◆✓ ◆✓ ◆
9 5 3 9! 5! 3! 9!
= =
4 2 3 4!5! 2!3! 3!0! 3!2!4!
This shows once again that there are many correct ways to count something. This type of problem also
frequently appears, and so we have a special notation (called a multinomial coefficient)
✓ ◆
9 9!
=
3, 2, 4 3!2!4!
Note the order of the bottom three numbers does not matter (since the multiplication in the denominator is
commutative), and that the bottom numbers must add up to the top number.
If we have k types of objects (n total), with n1 of the first type, n2 of the second, ..., and nk of the
30 Probability & Statistics with Applications to Computing 1.2
Above, we had k = 3 objects (B, A, Y) with n1 = 3 (number of B’s), n2 = 2 (number of A’s), and n3 = 4
(number of Y’s), for an answer of n1 ,nn2 ,n3 = 3!2!4!
9!
.
Example(s)
Solution There are n = 7 letters. There are only k = 4 distinct letters - {G, O, D, Y }.
n1 = 3 - there are 3 G’s.
n2 = 2 - there are 2 O’s.
n3 = 1 - there is 1 D.
n4 = 1 - there is 1 Y.
This gives us the number of possible arrangements:
✓ ◆
7 7!
=
3, 2, 1, 1 3!2!1!1!
It is important to note that even though the 1’s are “useless” since 1! = 1, we still must write every number
on the bottom since they have to add to the top number.
How many ways can we give 5 (indistinguishable) candies to these 3 (distinguishable) kids? Here are three
possible distributions of candy:
Notice that the second and third pictures show di↵erent possible distributions, since the kids are distinguish-
able (di↵erent). Any idea on how we can tackle this problem?
The idea here is that we will count something equivalent. Let’s say there are 5 “stars” for the 5 candies
and 2 “bars” for the dividers (dividing 3 kids). For instance, this distribution of candies corresponds to this
arrangement of 5 stars and 2 bars:
1.2 Probability & Statistics with Applications to Computing 31
Here is another example of the correspondence between a distribution of candies and the arrangement of
stars and bars:
For each candy distribution, there is exactly one corresponding way to arrange the stars and bars. Conversely,
for each arrangement of stars and bars, there is exactly one candy distribution it represents.
Hence, the number of ways to distribute 5 candies to the 3 kids is the number of arrangements of 5 stars
and 2 bars.
This is simply
✓ ◆ ✓ ◆
7 7 7!
= =
2 5 2!5!
Amazing right? We just reduced this candy distribution problem to reordering letters!
since we set up n stars for the n balls, and k 1 bars dividing the k bins.
Example(s)
How many ways can we assign 20 students to 4 di↵erent professors? Assume the students are indis-
tinguishable to the professors; who only care how many students they have, and not which ones.
Solution This is actually the perfect setup for stars and bars. We have 20 stars (students) and 3 bars
(professors), and so our answer is 23 23
3 = 20 .
32 Probability & Statistics with Applications to Computing 1.2
1.2.5 Exercises
1. There are 40 seats and 40 students in a classroom. Suppose that the front row contains 10 seats, and
there are 5 students who must sit in the front row in order to see the board clearly. How many seating
arrangements are possible with this restriction?
Solution: Again, there may be many correct approaches. We can first choose which 5 out of
10 seats in the front row we want to give, so we have 10 5 ways of doing this. Then, assign those 5
students to these seats, to which there are 5! ways. Finally, assign the other 35 students in any way,
for 35! ways. By the product rule, there are 105 · 5! · 35! ways to do so.
2. If we roll a fair 3-sided die 11 times, what is the number of ways that we can get 4 1’s, 5 2’s, and 2 3’s?
Solution: We can write the outcomes as a sequence of length 11, each digit of which is 1, 2 or
3. Hence, the number of ways to get 4 1’s, 5 2’s, and 2 3’s, is the number of orderings of 11112222233,
11 11!
which is 4,5,2 = .
4!5!2!
3. These two problems are almost identical, but have drastically di↵erent approaches to them. These are
both extremely hard/tricky problems, though they may look deceivingly simple. These are probably
the two coolest problems I’ve encountered in counting, as they do have elegant solutions!
(a) How many 7-digit phone numbers are such that the numbers are strictly increasing (digits must
go up)? (e.g., 014-5689, 134-6789, etc.)
(b) How many 7-digit phone numbers are such that the numbers are monotone increasing (digits can
stay the same or go up)? (e.g., 011-5566, 134-6789, etc.) Hint: Reduce this to stars and bars.
Solution:
(a) We choose 7 out of 10 digits, which has 10
7 possibilities, and then once we do, there is only 1 valid
ordering (must put them in increasing order). Hence, the answer is simply 10 7 . This question
has a deceivingly simple solution, as many students (including myself at one point), would have
started by choosing the first digit. But the choices for the next digit depend on the first digit.
And so on for the third. This leads to a complicated, nearly unsolvable mess!
(b) This is a very difficult problem to frame in terms of stars and bars. We need to map one phone
number to exactly one ordering of stars and bars, and vice versa. Consider letting the 9 bars being
an increase from one-digit to the next, and 7 stars for the 7 digits. This is extremely complicated,
so we’ll give 3 examples of what we mean.
i. The phone number 011-5566 is represented as ⇤| ⇤ ⇤|||| ⇤ ⇤| ⇤ ⇤|||. We start a counter at 0, we
see a digit first (a star), so we mark down 0. Then we see a bar, which tells us to increase
our counter to 1. Then, two more digits (stars), which say to mark down 2 1’s. Then, 4 bars
which tell us to increase count from 1 to 5. Then two *’s for the next two 5’s, and a bar to
increase to 6. Then, two stars indicate to put down 2 6’s. Then, we increment count to 9 but
don’t put down any more digits.
ii. The phone number 134-6789 is represented as | ⇤ || ⇤ | ⇤ || ⇤ | ⇤ | ⇤ |⇤. We start a counter at 0,
and we see a bar first, so we increase count to 1. Then a star tells us to actually write down
1 as our first digit. The two bars tell us to increase count from 1 to 3. The star says mark a
3 down now. Then, a bar to increase to 4. Then a star to write down 4. Two bars to increase
to 6. And so on.
1.2 Probability & Statistics with Applications to Computing 33
iii. The stars and bars ordering |||| ⇤ | ⇤ ⇤ ⇤ ⇤|| ⇤ ||⇤ represents the phone number 455-5579. We
start a counter at 0. We see 4 bars so we increment to 4. The star says to mark down a 4.
Then increment count by 1 to 5 due to the next bar. Then, mark 5 down 4 times (4 stars).
Then increment count by 2, put down a 7, and repeat to put down a 9.
Hence there is a bijection between these phone numbers and arrangements of 7 stars and 9 bars.
So the number of satisfying phone numbers is 16 16
7 = 9 .
Chapter 1. Combinatorial Theory
1.3: No More Counting Please
Slides (Google Drive) Video (YouTube)
In this section, we don’t really have a nice successive ordering where one topic leads to the next as we did
earlier. This section serves as a place to put all the final miscellaneous but useful concepts in counting.
(x + y)2 = (x + y)(x + y)
= xx + xy + yx + yy [FOIL]
2 2
= x + 2xy + y
But, let’s say that we wanted to do this for a binomial raised to some higher power, say (x + y)4 . There
would be a lot more terms, but we could use a similar approach.
34
1.3 Probability & Statistics with Applications to Computing 35
But what are the terms exactly that are included in this expression? And how could we combine the
like-terms though?
Notice that each term will be a mixture of x’s and y’s. In fact, each term will be in the form xk y n k (in this
case n = 4). This is because there will be exactly n x’s or y’s in each term, so if there are k x’s, then there
must be n k y’s. That is, we will have terms of the form x4 , x3 y, x2 y 2 , xy 3 , y 4 , with most appearing more
than once.
For a specific k though, how many times does xk y n k appear? For example, in the above case, take k = 1,
then note that xyyy = yxyy = yyxy = yyyx = xy 3 , so xy 3 will appear with the coefficient of 4 in the final
simplified form (just like for (x + y)2 the term xy appears with a coefficient 2). Does this look familiar? It
should remind you yet again of rearranging words with duplicate letters!
Now, we can generalize this, as the number of terms will simplify to xk y n k will be equivalent to the number
of ways to choose exactly k of the binomials to give us x (and let the remaining n k give us y). Alternatively,
we need to arrange k x’s and n k y’s. To think of this in the above example with k = 1 and n = 4, we
were consider which of the four binomials would give us the single x, the first, second, third, or fourth, for
a total of 41 = 4.
Let’s consider k = 2 in the above example. We want to know how many terms are equivalent to x2 y 2 . Well,
we then have xxyy = yxxy = yyxx = xyxy = yxyx = xyyx = x2 y 2 , so there are six ways and the coefficient
on the simplified term x2 y 2 will be 42 = 6.
Notice that we are essentially choosing which of the binomials gives us an x such that k of the n binomials
do. That is, the coefficient for xk y n k where k ranges from 0 to n is simply nk . This is why it was also
called a binomial coefficient.
This essentially states that in the expansion of the left side, the coefficient of the term with x raised to the
power of k and y raised to the power of n k will be nk , and we know this because we are considering the
number of ways to choose k of the n binomials in the expression to give us x.
This can also be proved by induction, but this is left as an exercise for the reader.
Example(s)
Calculate the coefficient of a45 b14 in the expansion (4a3 5b2 )22 .
36 Probability & Statistics with Applications to Computing 1.3
Solution Let x = 4a3 and y = 5b2 . Then, we are looking for the coefficient of x15 y 7 (because x15 gives us
a45 and y 7 gives us b14 ), which is 22
15 . So we have the term
✓ ◆ ✓ ◆ ✓ ✓ ◆ ◆
22 15 7 22 3 15 2 7 22 15 7 45 14
x y = (4a ) ( 5b ) = 4 5 a b
15 15 15
22
and our answer is 15 415 57 .
1.3.2 Inclusion-Exclusion
Say we did an anonymous survey where we asked whether students in CSE312 like ice cream, and found
that 43 people liked ice cream. Then we did another anonymous survey where we asked whether students in
CSE312 liked donuts, and found that 20 people liked donuts. With this information can we determine how
many people like ice cream or donuts (or both)?
Let A be the set of people who like ice cream, and B the set of people who like donuts. The sum rule from 1.1
said that, if A, B were mutually exclusive (it wasn’t possible to like both donuts and ice cream: A \ B = ;),
then we could just add them up: |A [ B| = |A| + |B| = 43 + 20 = 63. But this is not the case, since it is
possible that to like both. We can’t quite figure this out yet without knowing how many people overlapped:
the size of A \ B.
So, we did another anonymous survey in which we asked whether students in CSE312 like both ice cream
and donuts, and found that only 7 people like both. Now, do we have enough information to determine how
many students like ice cream or donuts?
Yes! Knowing that 43 people like ice cream and 7 people like both ice cream and donuts, we can conclude
that 36 people like ice cream but don’t like donuts. Similarly, knowing that 20 people like donuts and 7
people like both ice cream and donuts, we can conclude that 13 people like donuts but don’t like ice cream.
This leaves us with the following picture, where A is the students who like ice cream. B is the students who
like donuts (this implies |A \ B| = 7 is the number of students who like both):
1.3 Probability & Statistics with Applications to Computing 37
|A| = 43
|B| = 20
|A \ B| = 7
Now, to go back to the question of how many students like either ice cream or donuts, we can just add up
the 36 people that just like ice cream, the 7 people that like both ice cream and donuts, and the 13 people
that just like donuts, and get 36 + 7 + 13 = 56. Alternatively, we could consider this as adding up the 43
people who like ice cream (including both the 36 those who just like ice cream and the 7 who like both)
and the 20 people who like donuts (including the 13 who just like donuts and the 7 who like both) and then
subtracting the 7 who like both since they were counted twice. That is 43 + 20 7 = 56. That leaves us
with:
|A [ B| = 36 + 7 + 13 = 56 = 43 20 7 = |A| + |B| |A \ B|
Recall that |A [ B| is the students who like donuts or ice cream (the union of the two sets).
|A [ B| = |A| + |B| |A \ B|
where singles are the sizes of all the single sets ( n1 terms), doubles are the sizes of all the intersections
of two sets ( n2 terms), triples are the size of all the intersections of three sets ( n3 terms), quads are
all the intersection of four sets, and so forth.
Example(s)
How many numbers in the set [360] = {1, 2, . . . , 360} are divisible by:
1. 4, 6, and 9.
2. 4, 6 or 9.
3. neither 4, 6, nor 9.
Solution
360
1. This is just the multiplies of lcm(4, 6, 9) = 36, which there are = 10 of.
36
2. Let Di be the number of numbers in [360] which are divisible by i, for i = 4, 6, 9. Hence, the number of
numbers which are divisible by 4, 6, or 9 is |D4 [ D6 [ D9 |. We can apply inclusion-exclusion (singles
38 Probability & Statistics with Applications to Computing 1.3
Many times it may be possible to avoid this ugly mess using complementary counting, but sometimes it isn’t.
The floor function bxc returns the largest integer x (i.e. rounds down).
The ceiling function dxe returns the smallest integer x (i.e. rounds up). Note the di↵erence is just
whether the bracket is on top (ceiling) or bottom (floor).
Example(s)
Solution
If there are n pigeons we want to put into k holes (where n > k), then at least one pigeonhole must
contain at least 2 pigeons.
More generally, if there are n pigeons we want to put into k pigeonholes, then at least one pigeonhole
must contain at least dn/ke pigeons.
This fact or rule may seem trivial to you, but the hard part of pigeonhole problems is knowing how to apply
it. See the examples below!
Example(s)
Show that there exists a number made up of only 1’s (e.g., 1111 or 11) which is divisible by 333.
Solution Consider the sequence of 334 numbers x1 , x2 , x3 , . . . , x334 where xi is the number made of exactly
i 1’s (e.g., x2 = 11, x5 = 11, 111, etc.). We’ll use the notation xi = 1i to mean i 1’s concatenated together.
The number of possible remainders when dividing by 333 is 333: {0, 1, 2, . . . , 332}, so by the pigeonhole
principle, since 334 > 333, two numbers xi and xj have the same remainder (suppose i < j without loss
of generality) when divided by 333. The number xj xi is of the form 1j i 0i ; that is j i 1’s followed
by i 0’s (e.g., x5 x2 = 11111 11 = 11100 = 13 02 ). This number must be divisible by 333 because
xi ⌘ xj (mod 333) ) (xj xi ) ⌘ 0 (mod 333).
Now, keep deleting zeros (by dividing by 10) until there aren’t any more left - this doesn’t a↵ect whether or
not 333 goes in since neither 2 nor 5 divides 333. Now we’re left with a number divisible by 333 made up of
all ones (1j i to be exact)!
Note that 333 was not special - we could have used any number that wasn’t divisible by 2 nor 5.
Example(s)
Show that in a group of n people (who may be friends with any number of other people), two must
have the same number of friends.
✓ ◆ ✓ ◆
n 1 n 1 (n 1) (n 1)
+ = + [def of binomial coef]
k 1 k (k 1)!(n k)! k!(n 1 k)!
··· [lots of algebra]
n!
=
k!(n k)!
✓ ◆
n
=
k
However, those · · · may be tedious and take a lot of algebra we don’t want to do.
So, let’s consider another approach. A combinatorial proof is one where you prove two quantities are equal
by imagining a situation/something to count. Then, you argue that the left side and right side are two
equivalent ways to count the same thing, and hence must be equal. We’ve seen earlier often how there are
multiple approaches to counting!
In this case, let’s consider the set of numbers [n] = {1, 2, . . . n}. We will argue that the LHS and RHS both
count the number of subsets of size k.
1. LHS: nk is literally the number of subsets of size k, since we just want to choose any k items out of n
(order doesn’t matter).
2. RHS: We take a slightly more convoluted approach, splitting on cases depending on whether or not
the number 1 was included in the subset.
Case 1: Our subset of size k includes the number 1. Then we need to choose k 1 of the
remaining n 1 numbers (n numbers excluding 1 is n 1 numbers) to make a subset of size k which
includes 1.
Case 2: Our subset of size k does not include the number 1. Then we need to choose k
numbers from the remaining n 1 numbers. There are n k 1 ways to do this. So, in total we have
n 1 n 1
k 1 + k possible subsets of size k.
Since the left side and right side count the same thing, they must be equal! Note that we dreamed up
this situation and you may wonder how we did - this just comes from practicing many types of counting
problems. You’ll get used to it!
Example(s)
Solution
1. We’ll show that both sides count, from a group of n people, the number of committees of size m, and
within that committee a subcommittee of size k.
n
Left-hand side: We first choose m people to be on the committee from n total; there are m
m
ways to do so. Then, within those m, we choose k to be on a specialized subcommittee; there are k
n m
ways to do so. By the product rule, the number of ways to assign these is m k .
Right-hand side: We first choose which k to be on the subcommitee of size k; there are nk ways
to do so. From the remaining n k people, we choose m k to be on the committee (but not the
subcommittee). By the product rule, the number of ways to assign these is nk m
n k
k .
Since the LHS and RHS both count the same thing, they must be equal.
2. We’ll argue that both sides count the number of subsets of the set [n] = {1, 2, . . . , n}.
Left-hand side: Each element we can have in our subset or not. For the first element, we have
2 choices (in or out). For the second element, we also have 2 choices (in or out). And so on. So the
number of subsets is 2n .
Right-hand side: The subset can be of any size ranging from 0 to n, so we have a sum. Now
how many subsets are there of size exactly k? There are nk because we choose k out of n to have in
Pn
our set (and order doesn’t matter in sets)! Hence, the number of subsets is k=0 nk
Since the LHS and RHS both count the same thing, they must be equal.
It’s cool to note we can also prove this with the binomial theorem setting x = 1 and y = 1 - try
this out! It takes just one line!
1.3.5 Exercises
1. These problems involve using the pigeonhole principle. How many cards must you draw from a standard
52-card deck (4 suits and 13 cards of each suit) until you are guaranteed to have:
(a) A single pair? (e.g., AA, 99, JJ)
(b) Two (di↵erent) pairs? (e.g., AAKK, 9933, 44QQ)
(c) A full house (a triple and a pair)? (e.g., AAAKK, 99922, 555JJ)
(d) A straight (5 in a row, with the lowest being A,2,3,4,5 and the highest being 10,J,Q,K,A)?
(e) A flush (5 cards of the same suit)? (e.g., 5 hearts, 5 diamonds)
(f) A straight flush (5 cards which are both a straight and a flush)?
42 Probability & Statistics with Applications to Computing 1.3
Solution:
(a) The worst that could happen is to draw 13 di↵erent cards, but the next is guaranteed to form a
pair. So the answer is 14.
(b) The worst that could happen is to draw 13 di↵erent cards, but the next is guaranteed to form a
pair. But then we could draw the other two of that pair as well to get 16 still without two pairs.
So the answer is 17.
(c) The worst that could happen is to draw all pairs (26 cards). Then the next is guaranteed to cause
a triple. So the answer is 27.
(d) The worst that could happen is to draw all the A - 4, 6 - 9, and J - K. After drawing these
11 · 4 = 44 cards, we could still fail to have a straight. Finally, getting a 5 or 10 would give us a
straight. So the answer is 45.
(e) The worst that could happen is to draw 4 of each suit (16 cards), and still not have a flush. So
the answer is 17.
(f) Same as straight, 45.
1.3 Probability & Statistics with Applications to Computing 43
We’re just about to learn about the axioms (rules) of probability, and see how all that counting stu↵ from
chapter 1 was relevant at all. This should align with your current understanding of probability (I only assume
you might be able to tell me the probability I roll an even number on a fair six-sided die at this point), and
formalize it.
We’ll be using a lot of set theory from here on out, so review that in Chapter 0 if you need to!
2.1.1 Definitions
Definition 2.1.1: Sample Space
Example(s)
Solution
1. The sample space of a single coin flip is: ⌦ = {H, T } (heads or tails).
2. The sample space of two coin flips is: ⌦ = {HH, HT, T H, T T }.
3. The sample space of the roll of a die is: ⌦ = {1, 2, 3, 4, 5, 6}.
Example(s)
Solution
1. Getting at least one head in two coin flips: E = {HH, HT, T H}
2. Rolling an even number: E = {2, 4, 6}
44
2.1 Probability & Statistics with Applications to Computing 45
Events E and F are mutually exclusive if E \ F = ;. (i.e. they can’t simultaneously happen).
Example(s)
Say E is the event of rolling an even number: E = {2, 4, 6}, and F is the event of rolling an odd
number: F = {1, 3, 5}. Are E and F mutually exclusive?
Example(s)
Let’s consider another example in which our experiment is the rolling of two fair 4-sided dice,
one which is blue D1 and one which is red D2 (so they are distinguishable, or e↵ectively, order
matters). We can represent each element in the sample set as an ordered pair (D1, D2) where
D1, D2 2 {1, 2, 3, 4} and represent the respective value rolled by the blue and red die.
The sample space ⌦ is the set of all possible ordered pairs of values that could be rolled by
the die (|⌦| = 4 · 4 = 16 by the product rule). Let’s consider some events:
1. A = {(1, 1), (1, 2), (1, 3), (1, 4)}, the event that the blue die, D1 is a 1.
2. B = {(2, 4), (3, 3), (4, 2)}, the event that the sum of the two rolls is 6 (D1 + D2 = 6).
3. C = {(2, 1), (4, 2)}, the event that the value on the blue die is twice the value on the red die
(D1 = 2 ⇤ D2).
All of these events and the sample space are shown below:
Solution Now, let’s consider whether A and B are mutually exclusive. Well, they do not overlap, as we can
see that A \ B = ;, so yes they are mutually exclusive.
B and C are not mutually exclusive, since there is a case in which they can happen at the same time
B \ C = {(4, 2)} =
6 ;, so they are not mutually exclusive.
Again, to summarize, we learned that ⌦ was the sample space (set of all outcomes of an experiment), and
46 Probability & Statistics with Applications to Computing 2.1
The word “axiom” means: things that we take for granted and assume to be true without proof.
Corollaries:
1. (Complementation) P E C = 1 P (E).
2. (Monotonicity) If E ✓ F , then P (E) P (F ).
3. (Inclusion-Exclusion) P (E [ F ) = P (E) + P (F ) P (E \ F ).
The world “corollary” means: results that follow almost immediately from a previous result (in this
case, the axioms).
Explanation of Axioms
1. Non-negativity is simply because we cannot consider an event to have a negative probability. It just
wouldn’t make sense. A probability of 1/6 would mean that on average, something would happen 1
out of every 6 trials. What about a probability of 1/4?
2. Normalization is based on the fact that when we run an experiment, there must be some outcome, and
all possible outcomes are in the sample space. So, we say the probability of observing some outcome
from the sample space is 1.
3. Countable additivity is because if two events are mutually exclusive, they don’t overlap at all; that is,
they don’t share any outcomes. This means that the union of them will contain the same outcomes of
each together, so the probability of their union is the the sum of their individual probabilities. (This
is like the sum role of counting).
Explanation of Corollaries
1. Complementation is based on the fact that the sample space is all the possible outcomes. This means
that E C = ⌦ \ E, so P E C = 1 P (E). (This is like complementary counting).
2. Monotonocity is because if E is a subset of F , then all outcomes in the event E are in the event F .
This means that all the outcomes that contribute to the probability of E contribute to the probability
of F , so it’s probability is greater than or equal to that of E (since probabilities are non-negative).
3. Inclusion-Exclusion follows because if E and F have some intersection, this would be counted twice
by adding their probabilities, so we have to subtract it once to only count it once and not overcount.
(This is like inclusion-exclusion for counting).
2.1 Probability & Statistics with Applications to Computing 47
Proof of Corollaries. The proofs of these corollaries only depend on the 3 axioms which we assume to be
true.
1. Since E and E C = ⌦ \ E are mutually exclusive,
P (E) + P E C = P E [ E C [axiom 3]
= P (⌦) [E [ E C = ⌦]
=1 [axiom 2]
Now just subtract P (E) from both sides.
2. Since E ✓ F , consider the sets E and F \ E. Then,
P (F ) = P (E [ (F \ E)) [draw a picture of E inside event F]
= P (E) + P (F \ E) [mutually exclusive, axiom 3]
P (E) + 0 [since P (F \ E) 0 by axiom 1]
If ⌦ is a sample space such that each of the unique outcome elements in ⌦ are equally likely, then
for any event E ✓ ⌦:
|E|
P(E) =
|⌦|
Proof of Equally Likely Outcomes Formula. If outcomes are equally likely, then for any outcome in the
1
sample space ! 2 ⌦, we have P (!) = (since there are |⌦| total outcomes). Then, if we list the |E|
|⌦|
outcomes that make up event E, we can write
E = {!1 , !2 , . . . , !|E| }
Every set is the union of the (mutually exclusive) singleton sets containing each element (e.g., {1, 2, 3} =
{1} [ {2} [ {3}), and so by countable additivity, we get
0 1
|E| |E|
[ X
P @ {!i }A = P ({!i }) [countable additivity axiom]
i=1 i=1
|E|
X 1
= [equally likely outcomes]
i=1
|⌦|
|E|
= [sum constant |E| times]
|⌦|
48 Probability & Statistics with Applications to Computing 2.1
The notation in the first line is like summation or product notation: just union all the sets {!1 } [ {!2 } [
· · · [ {!|E| }.
Example(s)
If we flip two fair coins independently, what is the probability we get at least one head?
Solution Since the sample space ⌦ = {HH, HT, T H, T T } is such that events are equally likely and the event
of getting at least one head is E = {HH, HT, T H}, we can say that
|E| 3
P (E) = =
|⌦| 4
Example(s)
Consider the example of rolling the red and blue fair 4-sided dice again (above), a blue die D1 and a
red die D2. What is the probability that the two die’s rolls sum up to 6?
Solution We called that event B = {(2, 4), (3, 3), (4, 2)}. What is the probability of the event B happening?
Well, the 16 possible outcomes that make up all the elements of ⌦ are each equally likely because each die
has an equal chance of landing on any of the 4 numbers. So, P(E) = |B| 3 3
|⌦| = 16 , so the probability is 16 .
2.1.4 Exercises
1. If there are 5 people named A, B, C, D, and E, and they are randomly arranged in a row (with each
ordering equally likely), what is the probability that A and B are placed next to each other?
Solution: The size of the sample space is the number of ways to organize 5 people randomly, which
is |⌦| = 5! = 120. The event space is the number of ways to have A and B sit next to each other. We
did a similar problem in 1.1, and so the answer was 2! · 4! = 48 (why?). Hence, since the outcomes are
|E| 48
equally likely, P (E) = = .
|⌦| 120
2. Suppose I draw 4 cards from a standard 52-card deck. What is the probability they are all aces (there
are exactly 4 aces in a deck)?
Solution: There are two ways to define our sample space, one where order matters, and one where
it doesn’t. These two approaches are equivalent.
(a) If order matters, then |⌦| = P (52, 4) = 52 · 51 · 50 · 49, as the number of ways to pick 4 cards out
of 52. The event space E is the number of ways to pick all 4 aces (with order mattering), which
|E| P (52, 4)
is P (4, 4) = 4 · 3 · 2 · 1. Hence, since the outcomes are equally likely, P (E) = = =
|⌦| P (4, 4)
4·3·2·1
52 · 51 · 50 · 49
(b) If order does not matter, then |⌦| = 52 4 , since we just care which 4 out of 52 cards we get.
Then, there is only 44 = 1 way to get all 4 aces, and, since the outcomes are equally likely,
52
|E| P (52, 4)/4! 4·3·2·1
P (E) = = 44 = = .
|⌦| 4
P (4, 4)/4! 52 · 51 · 50 · 49
2.1 Probability & Statistics with Applications to Computing 49
Notice how it did not matter whether order mattered or not, but we had to be consistent! The 4!
accounting for the ordering of the 4 cards gets cancelled out :).
3. Given 3 di↵erent spades (S) and 3 di↵erent hearts (H), shu✏e them. Compute P (E), where E is the
event that the suits of the shu✏ed cards are in alternating order (e.g., SHSHSH or HSHSHS)
Solution: The sample space |⌦| is the number of ways to order the 6 (distinct) cards: 6!. The
number of ways to organize the three spades is 3! and same for the three hearts. Once we do that, we
either lead with spades or hearts, so we get 2 · 3!2 for the size of our event space E. Hence, since the
|E| 2 · 3!2
outcomes are equally likely, P (E) = = .
|⌦| 6!
Note that all of these exercises are just counting two things! We count the size of the sample space, then
the event space and divide them. It is very important to acknowledge that we can only do this when the
outcomes are equally likely.
You can see how we can get even more fun and complicated problems - the three exercises above displayed
counting problems on the “easier side”. The reason we didn’t give “harder” problems is because computing
probability in the case of equally likely outcomes reduces to doing two counting problems (counting |E| and
|⌦|, where computing |⌦| is generally easier than computing |E|). Just use the techniques from Chapter 1
to do this!
Chapter 2. Discrete Probability
2.2: Conditional Probability
Slides (Google Drive) Video (YouTube)
Let’s go back to the example of students in CSE312 liking donuts and ice cream. Recall we defined event
A as liking ice cream and event B as liking donuts. Then, remember we had 36 students that only like ice
cream (A \ B C ), 7 students that like donuts and ice cream (A \ B), and 13 students that only like donuts
(B \ AC ). Let’s also say that we have 14 students that don’t like either (AC \ B C ). That leaves us with the
following picture, which makes up the whole sample space:
Now, what if we asked the question, what’s the probability that someone likes ice cream, given that we
know they like donuts? We can approach this with the knowledge that 20 of the students like donuts (13
who don’t like ice cream and 7 who do). What this question is getting at, is: given the knowledge that
someone likes donuts, what is the chance that they also like ice cream? Well, 7 of the 20 who like donuts
7
like ice cream, so we are left with the probability 20 . We write this as P (A | B) (read the “probability of A
given B”) and in this case we have the following:
50
2.2 Probability & Statistics with Applications to Computing 51
7
P (A | B) =
20
|A \ B|
= [|B| = 20 people like donuts, |A \ B| = 7 people like both]
|B|
|A \ B|/|⌦|
= [divide top and bottom by |⌦|, which is equivalent]
|B|/|⌦|
P (A \ B)
= [if we have equally likely outcomes]
P (B)
This intuition (which worked only in the special case equally likely outcomes), leads us to the definition of
conditional probability:
P (A \ B)
P (A | B) =
P (B)
An equivalent and useful formula we can derive (by multiplying both sides by the denominator, P (B),
and switching the sides of the equation is:
P (A \ B) = P (A | B) P (B)
This is a common misconception we can show with some examples. In the above example with ice cream,
we showed already P (A | B) = 20
7
, but P (B | A) = 36
7
, and these are not equal.
Consider another example where W is the event that you are wet and S is the event you are swimming.
Then, the probability you are wet given you are swimming, P (W | S) = 1, as if you are swimming you are
certainly wet. But, the probability you are swimming given you are wet, P (S | W ) 6= 1, because there are
numerous other reasons you could be wet that don’t involve swimming (being in the rain, showering, etc.).
52 Probability & Statistics with Applications to Computing 2.2
P (B | A) P (A)
P (A | B) =
P (B)
Note that in the above P (A) is called the prior, which is our belief without knowing anything about
event B. P (A | B) is called the posterior, our belief after learning that event B occurred.
This theorem is important because it allows to “reverse the conditioning”! Notice that both P (A | B)
and P (B | A) appear in this equation on opposite sides. So if we know P (A) and P (B) and can more
easily calculate one of P (A | B) or P (B | A), we can use Bayes Theorem to derive the other.
Proof of Bayes Theorem. Recall the (alternate) definition of conditional probability from above:
P (A \ B) = P (A | B) P (B) (2.2.6)
P (B \ A) = P (B | A) P (A) (2.2.7)
But, because A \ B = B \ A (since these are the outcomes in both events A and B, and the order of
intersection does not matter), P (A \ B) = P (B \ A), so (2.2.1) and (2.2.2) are equal and we have (by
setting the right-hand sides equal):
P (A | B) P (B) = P (B | A) P (A)
P (B | A) P (A)
P (A | B) =
P (B)
Wow, I wish I was alive back then and had this important (and easy to prove) theorem named after me!
Example(s)
We’ll investigate two slightly di↵erent questions whose answers don’t seem that they should be di↵er-
ent, but are. Suppose a family has two children (whom at birth, were each equally likely to be male
or female). Let’s say a telemarketer calls home and one of the two children picks up.
1. If the child who responded was male, and says “Let me get my older sibling”, what is the
probability that both children are male?
2. If the child who responded was male, and says “Let me get my other sibling”, what is the
probability that both children are male?
Solution There are four equally likely outcomes, MM, MF, FM, and FF (where M represents male and F
represents female). Let A be the event both children are male.
2.2 Probability & Statistics with Applications to Computing 53
1. In this part, we’re given that the younger sibling is male. So we can rule out 2 of the 4 outcomes
above and we’re left with MF and MM. Out of these two, in one of these cases we get MM, and so our
desired probability is 1/2.
More formally, let this event be B, which happens with probability 2/4 (2 out of 4 equally likely
P (A \ B) 1/4 1
outcomes). Then, P (A|B) = = = , since P (A \ B) is the probability both children
P (B) 2/4 2
are male, which happens in 1 out of 4 equally likely scenarios. This is because the older sibling’s sex
is independent of the younger sibling’s, so knowing the younger sibling is male doesn’t change the
probability of the older sibling being male (which is what we computed just now).
2. In this part, we’re given that at least one sibling is male. That is, out of the 4 outcomes, we can only
rule out the FF option. Out of the remaining options MM, MF, and FM, only one has both siblings
being male. Hence, the probability desired is 1/3. You can do a similar more formal argument like we
did above!
See how a slight wording change changed the answer?
We’ll see a disease testing example later, which requires the next section first. If you test positive for a
disease, how concerned should you be? The result may surprise you!
Example(s)
You can see that partition is a very appropriate word here! In the first image, the four events
E1 , . . . , E4 don’t overlap and cover the sample space. In the second image, the two events E, E C do
the same thing! This is useful when you know exactly one of a few things will happen. For example,
for the chemistry example, there might be only three teachers, and you will be assigned to exactly
one of them: at most one because you can’t have two teachers (mutually exclusive), and at least one
54 Probability & Statistics with Applications to Computing 2.2
Now, suppose we have some event F which intersects with various events that form a partition of ⌦. This
is illustrated by the picture below:
Notice that F is composed of its intersection with each of E1 , E2 , and E3 , and so we can split F up into
smaller pieces. This means that we can write the following (green chunk F \ E1 , plus pink chunk F \ E2
plus yellow chunk F \ E3 ):
P (F ) = P (F \ E1 ) + P (F \ E2 ) + P (F \ E3 )
Note that F and E4 do not intersect, so F \ E4 = ;. For completion, we can include E4 in the above
equation, because P (F \ E4 ) = 0. So, in all we have:
P (F ) = P (F \ E1 ) + P (F \ E2 ) + P (F \ E3 ) + P (F \ E4 )
n
X
P (F ) = P (F \ E1 ) + · · · + P (F \ En ) = P (F \ En )
i=1
n
X
P (F ) = P (F | E1 ) P (E1 ) + · · · + P (F | En ) P (En ) = P (F | Ei ) P (Ei )
i=1
That is, to compute the probability of an event F overall; suppose we have n disjoint cases E1 , . . . , En
for which we can (easily) compute the probability of F in each of these cases (P (F |Ei )). Then, take
the weighted average of these probabilities, using the probabilities P (Ei ) as weights (the probability
of being in each case).
Example(s)
Let’s consider an example in which we are trying to determine the probability that we fail chemistry.
Let’s call the event F failing, and consider the three events E1 for getting the Mean Teacher, E2 for
getting the Nice Teacher, and E3 for getting the Hard Teacher which partition the sample space. The
following table gives the relevant probabilities:
Mean Teacher E1 Nice Teacher E2 Hard Teacher E3
Probability of Teaching You P (Ei ) 6/8 1/8 1/8
Probability of Failing You P (F | Ei ) 1 0 1/2
Solve for the probability of failing.
Solution Before doing anything, how are you liking your chances? There is a high probability (6/8) of getting
the Mean Teacher, and she will certainly fail you. Therefore, you should be pretty sad.
Now let’s do the computation. Notice that the first row sums to 1, as it must, since events E1 , E2 , E3
partition the sample space (you have exactly one of the three teachers). Using the Law of Total Probability
(LTP), we have the following:
3
X
P (F ) = P (F | Ei ) P (Ei ) = P (F | E1 ) P (E1 ) + P (F | E2 ) P (E2 ) + P (F | E3 ) P (E3 )
i=1
6 1 1 1 13
=1· +0· + · =
8 8 2 8 16
Notice to get the probability of failing, what we did was: consider the probability of failing in each of the
3 cases, and take a weighted average of using the probability of each case. This is exactly what the law of
total probability lets us do!
You might consider using the LTP when you know the probability of your desired event in
56 Probability & Statistics with Applications to Computing 2.2
Example(s)
Misfortune struck us and we ended up failing chemistry class. What is the probability that we had
the Hard Teacher given that we failed?
Solution First, this probability should be low intuitively because if you failed, it was probably due to the
Hard Teacher (because you are more likely to get them, AND because they have a high fail rate of 100%).
Start by writing out in a formula what you want to compute; in our case, it is P (E3 | F ) (getting the hard
teacher given that we failed). We know P (F | E3 ) and we want to solve for P (E3 | F ). This is a hint to use
Bayes Theorem since we can reverse the conditioning! Using that with the numbers from the table and the
previous question:
P (F | E3 ) P (E3 )
P (E3 | F ) = [bayes theorem]
P (F )
1 1
2 · 8
= 13
16
1
=
13
Let events E1 , . . . , En partition the sample space ⌦, and let F be another event. Then:
P (F | E1 ) P (E1 )
P (E1 | F ) = [by bayes theorem]
P (F )
P (F | E1 ) P (E1 )
= Pn [by the law of total probability]
i=1 P (F | Ei ) P (Ei )
In particular, in the case of a simple partition of ⌦ into E and E C , if E is an event with nonzero
probability, then:
P (F | E) P (E)
P (E | F ) = [by bayes theorem]
P (F )
P (F | E) P (E)
= [by the law of total probability]
P (F | E) P (E) + P (F | E C ) P (E C )
2.2.5 Exercises
1. Suppose the llama flu disease has become increasingly common, and now 0.1% of the population has
it (1 in 1000 people). Suppose there is a test for it which is 98% accurate (e.g., 2% of the time it will
2.2 Probability & Statistics with Applications to Computing 57
give the wrong answer). Given that you tested positive, what is the probability you have the disease?
Before any computation, think about what you think the answer might be.
Solution: Let L be the event you have the llama flu, and T be the event you test positive (T C is
the event you test negative). You are asked for P (L | T ). We do know P (T | L) = 0.98 because if you
have the llama flu, the probably you test positive is 98%. This gives us the hint to use Bayes Theorem!
We get that
P (T | L) P (L)
P (L | T ) =
P (T )
We are given P (T | L) = 0.98 and P (L) = 0.001, but how can we get P (T ), the probability of testing
positive? Well that depends on whether you have the disease or not. When you have two or more
cases (L and LC ), that’s a hint to use the LTP! So we can write
P (T ) = P (T | L) P (L) + P T | LC P LC
Again, interpret this as a weighted average of the probability of testing positive whether you had llama
flu P (T | L) or not P T | LC , weighting by the probability you are in each of these cases P (L) and
P LC . We know P LC = 0.999 since these P LC = 1 P (L) (axiom of probability). But what
about P T | LC ? This is the probability of testing positive given that you don’t have llama flu, which
is 0.02 or 2% (due to the 98% accuracy). Putting this all together, we get:
P (T | L) P (L)
P (L | T ) = [bayes theorem]
P (T )
P (T | L) P (L)
= [LTP]
P (T | L) P (L) + P (T | LC ) P (LC )
0.98 · 0.001
=
0.98 · 0.001 + 0.02 · 0.999
⇡ 0.046756
Not even a 5% chance we have the disease, what a relief! But wait, how can that be? The test
is so accurate, and it said you were positive? This is because the prior probability of having the
disease P (L) was so low at 0.1% (actually this is pretty high for a disease rate). If you think about
it, the posterior probability we computed P (L | T ) is 47⇥ larger than the prior probability P (L)
(P (L | T ) /P (L) ⇡ 0.047/0.001 = 47), so the test did make it a lot more likely we had the disease after
all!
2. Suppose we have four fair die: one with three sides, one with four sides, one with five sides, and one
with six sides (The numbering of an n-sided die is 1, 2, ..., n). We pick one of the four die, each with
equal probability, and roll the same die three times. We get all 4’s. What is the probability we chose
the 5-sided die to begin with?
Solution: Let Di be the event we rolled the i-sided die, for i = 3, 4, 5, 6. Notice that these
58 Probability & Statistics with Applications to Computing 2.2
P (444|D5 )P (D5 )
P (D5 |444) = [by bayes theorem]
P (444)
P (444|D5 )P (D5 )
= [by ltp]
P (444|D3 )P (D3 ) + P (444|D4 )P (D4 ) + P (444|D5 )P (D5 ) + P (444|D6 )P (D6 )
1 1
53 · 4
= 0 1 1 1 1 1 1 1
33 · 4 + 43 · 4 + 5 3 · 4 + 63 · 4
1/125
=
1/64 + 1/125 + 1/216
1728
= ⇡ 0.2831
6103
Note that we compute P (444|Di ) by noting there’s only one outcome where we get (4, 4, 4) out of the
i3 equally likely outcomes. This is true except when i = 3, where it’s not possible to roll all 4’s.
Chapter 2. Discrete Probability
2.3: Independence
Slides (Google Drive) Video (YouTube)
Now, suppose that we shu✏e this deck and draw the top three cards. Let’s define:
1. A to be the event that we get the Ace of spades as our first card.
2. B to be the event that we get the 10 of clubs as our second card.
3. C to be the event that we get the 4 of diamonds as our third card.
What is the probability that all three of these events happen? We can write this as P (A, B, C) (sometimes
we use commas as an alternative to using the intersection symbol, so this is equivalent to P (A \ B \ C)).
Note that this is equivalent to P (C, B, A) or P (B, C, A) since order of intersection does not matter.
1 1 1
Intuitively, you might say that this probability is 52 · 51 · 50 , and you would be correct.
1. The first factor comes from the fact that there are 52 cards that could be drawn, and only one ace of
spades. That is, we computed P (A).
59
60 Probability & Statistics with Applications to Computing 2.3
2. The second factor comes from the fact that there are 51 cards after we draw the first card and only
one 10 of clubs. That is, we computed P (B | A).
3. The final factor comes from the fact that there are 50 cards left after we draw the first two and only
one 4 of diamonds. That is, we computed P (C | A, B).
To summarize, we said that
1 1 1
P (A, B, C) = P (A) · P (B | A) · P (C | A, B) = · ·
52 51 50
In the case of two events, A, B (this is just the alternate form of the definition of conditional probability
from 2.2):
P (A, B) = P (A) P (B | A)
An easy way to remember this, is if we want to observe n events, we can observe one event at a
time, and condition on those that we’ve done thus far. And most importantly, since the order of
intersection doesn’t matter, you can actually decompose this into any of n! orderings. Make sure
you “do” one event at a time, conditioning on the intersection of ALL past events like we did above.
Proof of Chain Rule. Remember that the definition of conditional probability says P (A \ B) = P (A) P (B | A).
We’ll use this repeatedly to break down our P (A1 , . . . , An ). Sometimes it is easier to use commas, and some-
times it is easier to use the intersection sign \: for this proof, we’ll use the intersection sign. We’ll prove
this for four events, and you’ll see how it can be easily extended to any number of events!
Note how we keep “chaining” and applying the definition of conditional probability repeatedly!
Example(s)
Consider the 3-stage process. We roll a 6-sided die (numbered 1-6), call the outcome X. Then, we
roll a X-sided die (numbered 1-X), call the outcome Y . Finally, we roll a Y -sided die (numbered
1-Y ), call the outcome Z. What is P (Z = 5)?
2.3 Probability & Statistics with Applications to Computing 61
Solution There are only three things that could have happened for the triplet (X, Y, Z) so that Z takes on
the value 5: {(6, 6, 5), (6, 5, 5), (5, 5, 5)}. So
P (Z = 5) = P (X = 6, Y = 6, Z = 5) + P (X = 6, Y = 5, Z = 5) + P (X = 5, Y = 5, Z = 5) [cases]
1 1 1 1 1 1 1 1 1
= · · + · · + · · [chain rule 3x]
6 6 6 6 6 5 6 5 5
How did we use the chain rule? Let’s see for example the last term:
P (X = 5, Y = 5, Z = 5) = P (X = 5)P (Y = 5 | X = 5)P (Z = 5 | X = 5, Y = 5)
1
P (X = 5) = because we rolled a 6-sided die.
6
1
P (Y = 5 | X = 5) = since we rolled a X = 5-sided die.
5
1
Finally, P (Z = 5 | X = 5, Y = 5) = P (Z = 5 | Y = 5) = since we rolled a Y = 5-sided die. Note we didn’t
5
need to know X = 5 once we knew Y = 5!
2.3.2 Independence
Let’s say we flip a fair coin 3 times independently (whatever that means) - what is the probability of getting
all heads? You may be inclined to say (1/2)3 = 1/8 because the probability of getting heads each time is
just 1/2. However, we haven’t learned such a rule to compute the joint probability P (H1 \ H2 \ H3 ) except
the chain rule.
Using only what we’ve learned, we could consider equally likely outcomes. There are 23 = 8 possible
outcomes when flipping a coin three times (by product rule), and only one of those (HHH) makes up the
event we care about: H1 \ H2 \ H3 . Since the outcomes are equally likely,
| H1 \ H 2 \ H3 | | {HHH} | 1
P (H1 \ H2 \ H3 ) = = =
|⌦| 23 8
We’d love a rule to say P (H1 \ H2 \ H3 ) = P (H1 ) · P (H2 ) · P (H3 ) = 1/2 · 1/2 · 1/2 = 1/8 - and it turns out
this is true when the events are independent!
But first, let’s consider the smaller case: does P (A, B) = P (A) P (B) in general? No! How do we know this
though? Well recall that by the chain rule, we know that:
P (A, B) = P (A) P (B | A)
So, unless P (B | A) = P (B) the equality does not hold. However, when this equality does hold, it is a special
case, which brings us to independence.
Events A and B are independent if any of the following equivalent statements hold:
1. P (A | B) = P (A)
2. P (B | A) = P (B)
3. P (A, B) = P (A) P (B)
Intuitively what it means for P (A | B) = P (A) is that: given that we know B happened, the prob-
ability of observing A is the same as if we didn’t know anything. So, event B has no influence on
62 Probability & Statistics with Applications to Computing 2.3
What about independence of more than just two events? We call this concept “mutual independence” (but
most of the time we don’t even say the word “mutual”). You might think that for events A1 , A2 , A3 , A4 to
be (mutually) independent, by extension of the definition of two events, we would just need
But it turns out, we need this property to hold for any subset of the 4 events. For example, the following
must be true (in addition to others):
For all 2n subsets of the 4 events (24 = 16 in our case), the probability of the intersection must simply be
the product of the individual probabilities.
As you can see, it would be quite annoying to check even if three events were (mutually) independent.
Luckily, most of the time we are told to assume that several events are (mutually) independent and we get
all of those statements to be true for free. We are rarely asked to demonstrate/prove mutual independence.
We say n events A1 , A2 , . . . , An are (mutually) independent if, for any subset I ✓ [n] =
{1, 2, . . . , n}, we have !
\ Y
P Ai = P (Ai )
i2I i2I
This is very similar to the last formula P (A, B) = P (A) P (B) in the definition of independence for
two events, just extended to multiple events. It must hold for any subset of the n events, and so this
equation is actually saying 2n equations are true!
Example(s)
Suppose we have the following network, in which circles represents a node in the network (A, B, C,
and D) and the links have the probabilities p, q, r and s of successfully working. That is, for example,
the probability of successful communication from A to B is p. Each link is independent of the others
though.
2.3 Probability & Statistics with Applications to Computing 63
Now, let’s consider the question, what is the probability that A and D can successfully communicate?
Solution There are two ways in which it can communicate: (1) in the top path via B or (2) in the bottom
path via C. Let’s define the event top to be successful communication in the top path and the event bottom
to be successful communication in the bottom path. Let’s first consider the probabilities of each of these
being successful communication. For the top to be a valid path, both links AB and BD must work.
Similarly:
P (bottom) = P (AC \ CD)
= P (AC) P (CD) [by independence]
= rs
So, to calculate the probability of successful communication between A and D, we can take the union of top
and bottom (we just need at least one of the two to work), and so we have:
P (top [ bottom) = P (top) + P (bottom) P (top \ bottom) [by inclusion-exclusion]
= P (top) + P (bottom) P (top) P (bottom) [by independence]
= pq + rs pqrs
This is actually another form of independence, called conditional independence! That is, given that Y = 5,
the events X = 5 and Z = 5 are independent (the above equation looks exactly like P (Z = 5 | X = 5) =
P (Z = 5) except with extra conditioning on Y = 5 on both sides.
Events A and B are conditionally independent given an event C if any of the following equivalent
statements hold:
1. P (A | B, C) = P (A | C)
2. P (B | A, C) = P (B | C)
3. P (A, B | C) = P (A | C) P (B | C)
Recall the definition of A and B being (unconditionally) independent below:
1. P (A | B) = P (A)
2. P (B | A) = P (B)
3. P (A, B) = P (A) P (B)
Notice that this is very similar to the definition of independence. There is no di↵erence, except we
have just added in conditioning on C to every probability.
Example(s)
Suppose there is a coin C1 with P (head) = 0.3 and a coin C2 with P (head) = 0.9. We pick one
randomly with equal probability and will flip that coin 3 times independently. What is the probability
we get all heads?
Solution Let us call HHH the event of getting three heads, C1 the event of picking the first coin, and C2
the event of getting the second coin. Then we have the following:
P (HHH) = P (HHH | C1 ) P (C1 ) + P (HHH | C2 ) P (C2 ) [by the law of total probability]
3 3
= (P (H | C1 )) P (C1 ) + (P (H | C2 )) P (C2 ) [by conditional independence]
1 1
= (0.3)3 + (0.9)3 = 0.378
2 2
It is important to note that getting heads on the first and second flip are NOT independent. The probability
of heads on the second, given that we got heads on the first flip, is much higher since we are more likely
to have chosen coin C2 . However, given which coin we are flipping, the flips are conditionally independent.
3
Hence, we can write P (HHH | C1 ) = P (H | C1 ) .
2.3.4 Exercises
1. Corrupted by their power, the judges running the popular game show America’s Next Top Mathe-
matician have been taking bribes from many of the contestants. During each of two episodes, a given
contestant is either allowed to stay on the show or is kicked o↵. If the contestant has been bribing
the judges, she will be allowed to stay with probability 1. If the contestant has not been bribing
the judges, she will be allowed to stay with probability 1/3, independent of what happens in earlier
episodes. Suppose that 1/4 of the contestants have been bribing the judges. The same contestants
bribe the judges in both rounds.
(a) If you pick a random contestant, what is the probability that she is allowed to stay during the
first episode?
(b) If you pick a random contestant, what is the probability that she is allowed to stay during both
episodes?
2.3 Probability & Statistics with Applications to Computing 65
(c) If you pick a random contestant who was allowed to stay during the first episode, what is the
probability that she gets kicked o↵ during the second episode?
(d) If you pick a random contestant who was allowed to stay during the first episode, what is the
probability that she was bribing the judge?
Solution:
(a) Let Si be the event a contestant stays in the ith episode, and B be the event a contestant is
bribing the judges. Then, by the law of total probability,
1 1 3 1
P (S1 ) = P (S1 | B) P (B) + P S1 | B C P B C = 1 · + · =
4 3 4 2
Again, it’s important to note that staying on the first and second episode are NOT independent.
If we know she stayed on the first episode, then it is more likely she stays on the second (since
she’s more likely to be bribing the judges). However, conditioned on whether or not we are bribing
the judges, S1 and S2 are independent.
(c)
P S1 \ S2C
P S2C | S1 =
P (S1 )
The denominator is our answer to (a), and the numerator can be computed in the same way as
(b).
2. A parallel system functions whenever at least one of its components works. Consider a parallel system
of n components and suppose that each component works with probability p independently
(b) If the system is functioning, what is the probability that component 1 is working?
(c) If the system is functioning and component 2 is working, what is the probability that component
1 is working?
Solution:
(a) Let Ci be the event component i is functioning, for i = 1, . . . , n. Let F be the event the system
66 Probability & Statistics with Applications to Computing 2.3
functions. Then,
P (F ) = 1 P FC
n
!
\
=1 P CiC [def of parallel system]
i=1
n
Y
=1 P CiC [independence]
i=1
=1 (1 p)n [prob any fails is 1 p]
P (F | C1 ) P (C1 ) 1·p
P (C1 | F ) = =
P (F ) 1 (1 p)n
(c)
⌦ = {HH, HT, T H, T T }
Sometimes, though, we don’t care about the order (HT vs T H), but just the fact that we got one heads and
one tail. So we can define a random variable as a numeric function of the outcome.
For example, we can define X to be the number of heads in the two independent flips of a fair coin. Then
X is a function, X : ⌦ ! R which takes outcomes ! 2 ⌦ and maps them to a number. For example, for the
outcome HH, we have X(HH) = 2 since there are two heads. See the rest below!
X(HH) = 2
X(HT ) = 1
X(T H) = 1
X(T T ) = 0
Suppose we conduct an experiment with sample space ⌦. A random variable (rv) is a numeric
function of the outcome, X : ⌦ ! R. That is it maps outcomes ! 2 ⌦ to numbers, ! ! X(!).
The set of possible values X can take on is its range/support, denoted ⌦X .
If ⌦X is finite or countable infinite (typically integers or a subset), X is a discrete random variable
(drv). Else if ⌦X is uncountably large (the size of real numbers), X is a continuous random
variable.
Example(s)
Below are some descriptions of random variables. Find their ranges and classify them as a discrete
random variable (DRV) or continuous random variable (CRV). The first row is filled out for you as
an example!
68
3.1 Probability & Statistics with Applications to Computing 69
• The range of X is ⌦X = {0, 1, . . . , n} because there could be any where from 0 to n heads flipped. It
is a discrete random variable because there are finite n + 1 values that it takes on.
• The range of N is ⌦N = {0, 1, 2 . . . } because there is no upper bound on the number of people that
can be born. This is countably infinite as it is a subset of all the integers, so it is a discrete random
variable.
• The range of F is ⌦F = {1, 2, . . . } because it will take at least 1 flip to flip a head or it could always
be tails and never flip a head (although the chance is low). This is still countable as a subset of all the
integers, so it is a discrete random variable.
• The range of B is ⌦B = [0, 1), as there could be partial seconds waited, and it could be anywhere from
0 seconds to a bus never coming. This is a continuous random variable because there are uncountably
many values in this range.
• The range of C is ⌦C = (0, 100) because the temperature can be any real number in this range. It
cannot be 0 or below because that would be frozen (ice), nor can it be 100 or above because this would
be boiling (steam). This is a continuous random variable.
pX (k) = P (X = k)
this is because the number of outcomes for X = 0 is 1 of the 4, the number of outcomes for X = 1 is 2 of
the 4, and the number of outcomes for X = 2 is 1 of the 4.
The probability mass function (pmf ) of a discrete random variable X assigns probabilities to the
possible values of the random variable. That is pX : ⌦X ! [0, 1] where:
pX (k) = P (X = k)
Note that {X = a} for a 2 ⌦ form a partition of ⌦, since each outcome a 2 ⌦ is mapped to exactly
one number. Hence,
X
pX (z) = 1
z2⌦X
Notice here the only thing consistent is pX , as it’s the PMF of X. The value inside is a dummy
variable - just like we can write f (x) = x2 or f (t) = t2 . To reinforce this, I will constantly use
di↵erent letters for dummy variables.
3.1.3 Expectation
We have this idea of a random variable, which is actually neither random nor a variable (it’s a deterministic
function X : ⌦ ! ⌦X .) However, the way I like to think about it is: it a random quantity which we do
not know the value of yet. You might want to know what you might expect it to equal on average. For
example, X could be the random variable which represents the number of babies born in Seattle per day. On
average, X might be equal to 250, and we would write that its average/mean/expectation/expected value is
E [X] = 250.
Let’s go back to the coin example though to define expectation. Your intuition might tell you that the
expected number of heads in 2 flips of a fair coin would be 1 (you would be correct).
Since X was the random variable defined to be the number of heads in 2 flips of a fair coin, we denote this
E [X]. Think of this as the average value of X.
More specifically, imagine if we repeated the two coin flip experiment 4 times. Then we would “expect” to
get HH, HT , T H, and T T each once. Then, we can divide by the number of times (4) to get 1.
2+1+1+0 1 1 1 1
=2· +1· +1· +0· =1
4 4 4 4 4
Notice that:
1 1 1 1
2· + 1 · + 1 · + 0 · = X(HH)P (HH) + X(HT )P (HT ) + X(T H)P (T H) + X(T T )P (T T )
4 4 4 4
X
= X(!)P (!)
!2⌦
This is the the sum of the random variable’s value for each outcome multiplied by the probability of that
outcome (a weighted average).
Another way of writing this is by multiplying every value that X takes on (in its range) with the probability
of that value occurring (the pmf). Notice that below is the same exact sum, but groups the common values
together (since X(HT ) = X(T H) = 1). That is:
✓ ◆ X
1 1 1 1 1 2 1
2· +1· + +0· =2· +1· +0· = k · pX (k)
4 4 4 4 4 4 4
k2⌦X
3.1 Probability & Statistics with Applications to Computing 71
or equivalently,
X
E [X] = k · pX (k)
k2⌦X
The interpretation is that we take an average of the possible values, but weighted by their probabilities.
3.1.4 Exercises
1. Let X be the value of single roll of a fair six-sided dice. What is the range ⌦X , the PMF pX (k), and
the expectation E [X]?
This kind of makes sense right? You expect the “middle number” between 1 and 6, which is 3.5.
2. Suppose at time t = 0, a frog starts on a 1-dimensional number line at the origin 0. At each step, the
frog moves independently: left with probability 1/10, and right (with probability 9/10). Let X be the
position of the frog after 2 time steps. What is the range ⌦X , the PMF pX (k), and the expectation
E [X]?
Solution: The range is ⌦X = { 2, 0, 2}. To find the pmf, we find the probabilities of being
each of those three values.
1 1
(a) For X to equal 2, we have to move left both times, which happens with probability 10 · 10 by
independence of the moves.
9 9
(b) For X to equal 2, we have to move right both times, which happens with probability 10 · 10 by
independence of the moves.
(c) Finally, for X to equal 0, we have to take opposite moves. So either LR or RL, which happens
1 9 18
with probability 2 · 10 · 10 = 100 . Alternatively, the easier way is to note that these three values
sum to 1, so P (X = 0) = 1 P (X = 2) P (X = 2) = 1 100 81 1 18
100 = 100
The expectation is
X 1 18 81
E [X] = k · pX (k) = 2· +0· +2· = 1.6
100 100 100
k2⌦X
You might have been able to guess this, but how? At each time step you “expect” to move to the
9 1
right by 10 10 which is 0.8. So after two steps, you would expect to be at 1.6. We’ll formalize this
approach more in the next chapter!
3. Let X be the number of independent coin flips up to and including our first head, where P (head) = p.
What is the range ⌦X , the PMF pX (k), and the expectation E [X]?
Solution: The range is ⌦X = {1, 2, 3, ...}, since it could theoretically take any number of flips.
The pmf is
pX (k) = (1 p)k 1
p, k 2 ⌦X
(a) P (X = 1) is the probability we get heads (for the first time) on our first try, which is just p.
(b) P (X = 2) is the probability we get heads (for the first time) on our second try, which is (1 p)p
since we had to get a tails first.
(c) P (X = k) is the probability we get heads (for the first time) on our k th try, which is (1 p)k 1 p,
since we had to get all tails on the first k 1 tries (otherwise, our first head would have been
earlier).
The expectation is pretty complicated and uses a calculus trick, so don’t worry about it too much.
Just understand the first two lines, which are the setup! But before that, what do you think it should
be? For example, if p = 1/10, how many flips do you think it would take until our first head? Possibly
10? And if p = 1/7, maybe 7? So seems like our guess will be E [X] = p1 . It turns out this intuition is
3.1 Probability & Statistics with Applications to Computing 73
actually correct!
X
E [X] = k · pX (k) [def of expectation]
k2⌦X
X1
= k(1 p)k 1
p
k=1
X1
=p k(1 p)k 1
[p is a constant with respect to k ]
k=1
X1
d d k
=p ( (1 p)k ) y = ky k 1
dp dy
k=1
1
!
d X k 1
= p (1 p) [swap sum and integral]
dp
k=1
✓ ◆ " 1
#
d 1 X 1
= p geometric series formula: ri =
dp 1 (1 p) i=0
1 r
✓ ◆
d 1
= p
dp p
✓ ◆
1
= p
p2
1
=
p
Chapter 3. Discrete Random Variables
3.2: More on Expectation
Slides (Google Drive) Video (YouTube)
Let’s say that you and your friend sell fish for a living. Every day, you catch X fish, with E [X] = 3 and
your friend catches Y fish, with E [Y ] = 7. How many fish do the two of you bring in (Z = X + Y ) on an
average day? You might guess 3 + 7 = 10. This is the formula you just guessed:
E [Z] = E [X + Y ] = E [X] + E [Y ] = 3 + 7 = 10
This property turns out to be true! Furthermore, let’s say that you can sell each fish for $5 at a store, but
you need to pay $20 in rent for the storefront. How much profit do you expect to make? The profit formula
would be 5Z 20: $5 times the number of total fish, minus $20. You might guess 5 · 10 20 = 30 and you
would be right once again! This is the formula you just guessed:
E [X + Y ] = E [X] + E [Y ]
and
E [aX + b] = aE [X] + b
E [aX + bY + c] = aE [X] + bE [Y ] + c
Proof of Linearity of Expectation. Note that X and Y are functions (since random variables are functions),
so X + Y is function that is the sum of the outputs of each of the functions. We have the following (in the
74
3.2 Probability & Statistics with Applications to Computing 75
first equation, (X + Y )(!) is the function (X + Y ) applied to ! which is equal to X(!) + Y (!), it is not a
product):
X
E [X + Y ] = (X + Y )(!) · P (!) [def of expectation for the rv X + Y ]
!2⌦
X
= (X(!) + Y (!)) · P (!) [def of sum of functions]
!2⌦
X X
= X(!) · P (!) + Y (!) · P (!) [property of summation]
!2⌦ !2⌦
= E [X] + E [Y ] [def of expectation of X and Y ]
For the second property, note that aX + b is also a random variable and hence a function (e.g., if f (x) =
sin (1/x), then (2f 5)(x) = 2f (x) 5 = 2 sin (1/x) 5.)
X
E [aX + b] = (aX + b)(!) · P (!) [def of expectation]
!2⌦
X
= (aX(!) + b) · P (!) [def of the function aX + b]
!2⌦
X X
= aX(!) · P (!) + b · P (!) [property of summation]
!2⌦ !2⌦
X X
=a X(!) · P (!) + b P (!) [property of summation]
!2⌦ !2⌦
X
= aE [X] + b [def of E [X] and P (!) = 1]
!
For the last property, we get to assume the first two that we proved already:
Again, you may think a result like this is “trivial” or “obvious”, but we’ll see the true power of linearity
of expectation through examples. It is one of the most important ideas that you will continue to use (and
probably take for granted), even when studying some of the most complex topics in probability theory.
Example(s)
Brute Force Solution: When dealing with any random variable, the first thing you should do is
identify its range. The frog must end up in one of these positions, since it can move at most 1 to the
left and 1 to the right at each step:
⌦X = { 2, 1, 0, +1, +2}
So we need to compute 5 values: the probability of each of these. Let’s start with the easier ones.
The only way to end up at 2 is if the frog moves left at both steps, which happens with probability
pL · pL = p2L = P (X = 2) = pX ( 2). The only reason we can multiply them is because of our
independence assumption. Similarly, pX (2) = pR · pR = p2R .
To get to 1, there are two possibilities: first going left and staying (pL · pS ), or first staying and then
going left (pS · pL ). Adding these disjoint cases gives pX ( 1) = 2pL pS . Again, we can only multiply
due to independence. Similarly, pX (1) = 2pR pS .
Finally, to compute pX (0), we have two options. One is considering all the possibilities (there are
three: left right, right left, or stay stay) and adding them up, and you get 2pL pR + p2S . Alternatively
and equivalently, since you know the probabilities of 4 of the values (pX ( 2), pX (2), pX ( 1), pX (1)),
the last one pX (0) must be 1 minus the other four since probabilities have to sum to 1! This is a
often useful and clever trick - solving for all but one of the probabilities actually gives you the last
one!
In summary, we would write the PMF as:
8
> 2
> pL
> k = 2 :Left left
>
>
>2pL pS
< k = 1 : Left and stay, or stay and left
pX (k) = 2pL pR + p2S k = 0 : Right left, or left right, or stay stay
>
>
>
> 2pR pS k = 1 : Right and stay, or stay and right
>
>
: p2 k = 2 : Right right
R
Then to solve for the expectation we just multiply the value and probability mass function and take
the sum and have the following:
X
E [X] = k · pX (k) [def of expectation]
k2⌦X
= ( 2) · p2L + ( 1) · 2pL pS + (0) · (2pL pR + p2S ) + (1) · 2pR pS + (2) · p2R [plug in our values]
= 2(pR pL ) [lots of messy algebra]
The last step of algebra is not important - once you get to more advanced mathematics (like this
text), getting the second-to-last formula is sufficient. Everything else is algebra which you could do,
or use a computer to do, and so we will omit the useless calculations.
This was quite tedious already; what if instead you were to find the expected location after 100 steps?
Then, this method would be completely ridiculous: finding ⌦X = { 100, 99, . . . , +99, +100} and
3.2 Probability & Statistics with Applications to Computing 77
their 201 probabilities. Since you know the frog always moves with the same probabilities though,
maybe we can do something more clever!
Linearity Solution:
Let X1 , X2 be the distance the frog travels at time steps 1,2 respectively.
Important Observation: X = X1 + X2 , since your location after 2 time steps is the sum of the
displacement of the first time step and the second time step. Therefore, ⌦X1 = ⌦X2 = { 1, 0, +1}.
They have the same simple PMF of:
8
>
< pL k = 1
pXi (k) = pS k = 0
>
:
pR k = 1
Which method is easier? Maybe in this case it is debatable, but if we change the time steps from 2 to
100 or 1000, the brute force solution is entirely infeasible, and the linearity solution will basically be
the same amount of work! You could say that X1 , . . . , X100 is the displacement at each of 100 time
steps, and hence by linearity:
" 100 # 100 100
X X X
E [X] = E Xi = E [Xi ] = (pR pL ) = 100(pR pL )
i=1 i=1 i=1
Hopefully now you can come to appreciate more how powerful LoE truly is! We’ll see more examples
in the next section as well as at the end of this section.
this is actually almost never true! Let’s see if we can’t derive a nice formula for E [g(X)] for any function g,
linear or not.
Consider we are flipping 2 coins again. Let X be the number of heads in two independent flips of a fair coin.
Recall the range, PMF, and expectation (again, I’m using the dummy letter d to emphasize that pX is the
78 Probability & Statistics with Applications to Computing 3.2
⌦X = {0, 1, 2}
, 8
> 1
<4 d=0
1
pX (d) = 2 d=1
>
:1
4 d=2
1 1 1
E [X] = 0 · +1· +2· =1
4 2 4
Let g be the cubing function; i.e., g(t) = t . Let Y = g(X) ⇥= X⇤3 ; what does this mean? It literally means
3
the cubed number of heads! Let’s try to compute E [Y ] = E X 3 , the expected cubed number of heads. We
first find its range and PMF. Based on the range of X, we can calculate the range of Y to be:
⌦Y = {0, 1, 8}
since if we get 0 heads, the cubed number of heads is 03 = 0; if we get 1 head, the cubed number of heads
is 13 = 1; and if we get 2 heads, the cubed number of heads is 23 = 8.
Now to find the PMF of Y = X 3 . (Again, below I use the notation pY to denote the probability mass
function of Y = X 3 ; z is a dummy variable which could be any letter.)
8
> 1
<4 z=0
1
pY (z) = 2 z=1
>
:1
4 z=8
since there is a 1/4 chance of getting 0 cubed heads (the outcome TT), 1/2 chance of getting 1 cubed heads
(the outcomes HT or TH), and a 1/4 chance of getting 8 cubed heads (the outcome HH).
⇥ ⇤ 1 1 1
E X 3 = E [Y ] = 0 · + 1 · + 8 · = 2.5
4 2 4
⇥ 3⇤
Is there an easier way to compute E X = E [Y ] without going through the trouble of writing
out pY ? Yes! Since we know X’s PMF already, why should we have to find the PMF of Y = g(X)?
Note this formula below is the same formula as above, rewritten so you can observe something:
⇥ ⇤ 1 1 1
E X 3 = 03 · + 13 · + 23 · = 2.5
4 2 4
In fact: ⇥ ⇤ X
E X3 = b3 pX (b)
b2⌦X
That is, we can apply the function to each value in ⌦X , and then take the weighted average! We can
generalize such that for any function g : ⌦X ! R, we have:
X
E [g(X)] = g(b)pX (b)
b2⌦X
⇥ ⇤
Caveat: It is worth noting that 2.5 = E X 3 6= (E [X])3 = 1. You cannot just say E [g(X)] = g(E [X]) as
we just showed!
3.2 Probability & Statistics with Applications to Computing 79
Let X be a discrete random variable with range ⌦X and g : D ! R be a function defined at least
over ⌦X , (⌦X ✓ D). Then X
E [g(X)] = g(b)pX (b)
b2⌦X
⇥ ⇤
Note that in general, E [g(X)] 6= g(E [X]). For example, E X 2 6= (E [X])2 , and E [log(X)] 6=
log(E [X]).
Before we formally prove this, it will help if we have some intuition for each step. As an example, let X have
range ⌦X = { 1, 0, 1} and PMF
8
> 3
< 12 k = 1
5
pX (k) = 12 k=0
>
:4
12 k=1
Notice that Y = X 2 has range ⌦Y = {g(x) : x 2 ⌦X } = {( 1)2 , 02 , 12 } = {0, 1} and the following PMF:
(
3 4
12 + 12 k=1
pY (k) = 5
12 k=0
Note that pY (1) = P (X = 1) + P (X = 1) because { 1, 1} = {x : x2 = 1}. The crux of the LOTUS proof
depends on this fact. We just group things together and sum!
Proof of LOTUS. The proof isn’t too complicated, but the notation is pretty tricky and may be an impedi-
ment to your understanding, so focus on understanding the setup in the next few lines.
That is, the total probability that Y = y is the sum of the probabilities over all x 2 ⌦X where g(x) = y
(this is like saying P (Y = 1) = P (X = 1) + P (X = 1) because {x 2 ⌦X : x2 = 1} = { 1, 1}.)
E [g(X)] = E [Y ] [Y = g(X)]
X
= ypY (y) [def of expectation]
y2⌦Y
X X
= y pX (x) [above substitution]
y2⌦Y x2⌦X :g(x)=y
X X
= ypX (x) [move y into a sum]
y2⌦Y x2⌦X :g(x)=y
X X
= g(x)pX (x) [y = g(x) in the inner sum]
y2⌦Y x2⌦X :g(x)=y
X
= g(x)pX (x) [the double sum is the same as summing over all x]
x2⌦X
80 Probability & Statistics with Applications to Computing 3.2
3.2.3 Exercises
1. Let S be the sum of three rolls of a fair 6-sided die. What is E [S]?
Solution: Let X, Y, Z be the first, second, and third roll respectively. Then, S = X + Y + Z.
We showed in the first exercise of 3.1 that E [X] = E [Y ] = E [Z] = 3.5, so by LoE,
Alternatively, imagine if we didn’t have this theorem. We would find the range of S, which is ⌦S =
{3, 4, . . . , 18} and find its PMF. What a nightmare!
2. Blind LOTUS Practice: This will all seem useless, but I promise we’ll need this in the future. Let X
have PMF 8
> 3
< 12 k =5
5
pX (k) = 12 k =2.
>
:4
12 k =1
⇥ 2⇤
(a) Compute E X .
(b) Compute E [log(X)]
⇥ ⇤
(c) Compute E esin(X) .
P
Solution: LOTUS says that E [g(X)] = k2⌦X g(k)pX (k). That is,
(a)
⇥ ⇤ X 3 5 4
E X2 = k 2 pX (k) = 52 · + 22 · + 12 ·
12 12 12
k2⌦X
(b)
X 3 5 4
E [log X] = log (k) · pX (k) = log (5) · + log (2) · + log (1) ·
12 12 12
k2⌦X
(c)
h i X 3 5 4
E esin(X) = esin(k) pX (k) = esin(5) · + esin(2) · + esin(1) ·
12 12 12
k2⌦X
Chapter 3. Discrete Random Variables
3.3: Variance
Slides (Google Drive) Video (YouTube)
Suppose there are 7 mermaids in the sea. Below is a table that represents these mermaids and the colors of
their hair.
Each column in the third row of the table is a variable, Xi , that is 1 if the i-th mermaid has red hair and 0
otherwise. We call these sorts of variables indicator variables because they are either 1 or 0, and their values
indicate the truth of a boolean (red hair or not).
Let the variable X represent how many of the 7 mermaids have red hair. If I only gave you this third row
(X1 , X2 , . . . , X7 of 1’s and 0’s), how could you compute X?
Well, you would add them all up! X = X1 + X2 + ... + X7 = 3. So, there are 3 mermaids in the sea that have
red hair. This might seem like a trivial result, but let’s go over a more complicated an example to illustrate
the usefulness of indicator random variables!
Example(s)
Suppose n people go to a party and leave their hat with the hat-check person. At the end of the
party, she returns hats randomly and uniformly because she does not care about her job. Let X be
the number of people who get their original hat back. What is E [X]?
Solution Your first instinct might be to approach this problem with brute force. Such an approach would
involve enumerating the range, ⌦X = {0, 1, 2, ..., n 2, n} (all the integers from 0 to n, except n 1), and
computing the probability mass function for each of its elements. However, this approach will get very
complicated (give it a shot). So, let’s use our new friend, linearity of expectation.
81
82 Probability & Statistics with Applications to Computing 3.3
1
with probability n.
Let’s use linearity with indicator random variables! For i = 1, ..., n, let
(
1 if i-th person got their hat back
Xi = .
0 otherwise
Pn
Then the total number of people who get their hat back is X = i=1 Xi . (Why?)
The expected value of each individual indicator random variable can be found as follows, since it can only
take on the values 0 and 1:
1
E [Xi ] = 1 · P (Xi = 1) + 0 · P (Xi = 0) = P (Xi = 1) = P (ith person got their hat back) =
n
From here, we will use linearity of expectation:
" n #
X
E [X] = E Xi
i=1
n
X
= E [Xi ] [linearity of expectation]
i=1
Xn
1
=
i=1
n
1
=n·
n
=1
So, the expected number of people to get their hats back is 1 (doesn’t even depend on n)! It is worth noting
that these indicator random variables are not “independent” (we’ll define this formally later). One of the
reasons why is because if we know that a particular person did not get their own hat back, then the original
owner of that hat will have a probability of 0 that they get that hat back.
If asked only about the expectation of a random variable X (and not its PMF), then you may be
able to write X as the sum of possibly dependent indicator random variables, and apply linearity
of expectation. This technique is used when X is counting something (the number of people
who get their hat back). Finding the PMF for this random variable is extremely complicated,
and linearity makes computing the expectation easy (or at least easier than directly finding the PMF).
Example(s)
Suppose we flip a coin n = 100 times independently, where the probability of getting a head on each
flip is p = 0.23. What is the expected number of heads we get? Before doing any computation, what
do you think it might be?
3.3 Probability & Statistics with Applications to Computing 83
Solution You might expect np = 100 · 0.23 = 23 heads, and you would be absolutely correct! But we do need
to prove/show this.
Let X be the number of heads total, so ⌦X = {0, 1, 2, . . . , 100}. The “normal” approach might be to try to
find this PMF, which could be a bit complicated (we’ll actually see this in the next section)! But let’s try
to use what we just learned instead, and define indicators.
P100
For i = 1, 2, . . . , 100, let Xi = 1 if the i-th flip is heads, and Xi = 0 otherwise. Then, X = i=1 Xi
is the total number of heads (why?). To use linearity, we need to find E [Xi ].
We showed earlier that
E [Xi ] = P (Xi = 1) = p = 0.23
and so
" 100 #
X
E [X] = E Xi [def of X]
i=1
100
X
= E [Xi ] [linearity of expectation]
i=1
100
X
= 0.23
i=1
= 100 · 0.23 = 23
3.3.2 Variance
We’ve talked about the expectation (average/mean) of a random variable, and some approaches to computing
this quantity. This provides a nice “summarization” of a random variable, as something we often want to
know about it (sometimes even in place of its PMF). But we might want to know another summary quantity:
how “variable” the random variable is, or how much it deviates from its mean. This is called the variance
of a random variable, and we’ll start with a motivating example below!
Consider the following two games. In both games we flip a fair coin. In Game 1, if a heads is flipped you
pay me $1, and if a tails is flipped I pay you $1. In Game 2, if a heads is flipped you pay me $1000, and if a
tails is flipped I pay you $1000.
Both games are fair, in the sense that the expected values of playing both games is 0.
1 1 1 1
E [G1 ] = 1· +1· =0= 1000 · + 1000 · = E [G2 ]
2 2 2 2
Which game would you rather play? Maybe the adrenaline junkies among us would be willing to risk it all
on Game 2, but I think most of us would feel better playing Game 1. As shown above, there is no di↵erence
in the expected value of playing these two games, so we need another metric to explain why Game 1 feels
safer than Game 2.
We can measure this by calculating how far away a random variable is from its mean, on average. The
quantity X E [X] is the di↵erence between a rv and its mean, but we want a distance, a positive value.
So we will look at the squared di↵erence (X E [X])2 instead (another option would have been the absolute
di↵erence |X E [X] |, but someone chose the squared one instead). This is still a random variable (a
nonnegative one, since it is squared),
⇥ and so to⇤get a number (the average distance from the mean), we take
the expectation of this new rv, E (X E [X])2 . This is called the variance of the original random variable.
The definition goes as follows:
84 Probability & Statistics with Applications to Computing 3.3
The variance is always nonnegative since we take the expectation of a nonnegative random variable
(X E [X])2 . The first equality is the definition of variance, and the second equality is a more useful
identity for doing computation that we show below.
There is one problem though - if X is the height of someone in feet for example, then the average E [X]
is also in units of feet, but the variance is in terms of square feet (since we square X). We’d like to say
something like: the height of adults is generally 5.5 feet plus or minus 0.3 feet. To correct for this, we define
the standard deviation to be the square root of the variance, which “undoes” the squaring.
This measure is also useful, because the units of variance are squared in terms of the original variable
X, and this essentially ”undoes” our squaring, returning our units to the same as X.
We had something nice happen for the random variable aX +b when computing its expectation: E [aX + b] =
aE [X] + b, called linearity of expectation. Is there a similar nice property for the variance as well?
Before proving this, let’s think about and try to understand why a came out squared, and what happened
3.3 Probability & Statistics with Applications to Computing 85
to the b. The reason a is squared is because variance involved squaring the random variable, so the a had
to come out squared. It might not be a great intuitive reason, but we’ll prove it below algebraically. The
second (b disappearing) has a nice intuition behind it. Which of the two distributions (random variables)
below do you think should have higher variance?
You might agree with me that they have the same variance! Why?
The idea behind variance is that it measures the “spread” of the values that a random variable can take on.
The two graphs of random variables (distributions) above have the same “spread”, but one is shifted slightly
to the right. Since these graphs have the same “spread”, we want their variance to reflect this similarity.
Thus, shifting a random variable by some constant does not change the variance of that random variable.
That is, Var (X + b) = Var (X): that’s why the b got lost!
Proof of Variance Property:. Var (aX + b) = a2 Var (X)
First, we show variance is una↵ected ⇤ is, Var (X + b) = Var (X) for any scalar b. We use the
⇥ by shifts; that
original definition that Var (Y ) = E (Y E [Y ])2 , with Y = X + b.
⇥ ⇤
Var (X + b) = E ((X + b) E [X + b])2 [def of variance]
⇥ ⇤
= E (X + b E [X] b)2 [linearity of expectation]
⇥ ⇤
= E (X E [X])2
= Var (X) [def of variance]
Example(s)
Let X be the outcome of a fair 6-sided die roll. Recall that E [X] = 3.5. What is Var (X)?
Let’s say you play a casino game, where you must pay $10 to roll this die once, but earn twice the
value of the roll. What are the expected value and variance of your earnings?
Now you might wonder, what about the variance of a sum Var (X + Y )? You might hope that Var (X + Y ) =
Var (X)+Var (Y ), but this unfortunately is only true when the random variables are independent (we’ll define
this in the next section, but you can kind of guess what it means)! It is so important to remember that we
made no independence assumptions for linearity of expectation - it’s always true!
3.3.3 Exercises
1. Suppose you studied hard for a 100-question multiple-choice exam (with 4 choices per question) so
that you believe you know the answer to about 80% of the questions, and you guess the answer to the
remaining 20%. What is the expected number of questions you answer correctly?
Solution: For i = 1, . . . , 100, let Xi be the indicator rv which is 1 if youPgot the ith question
100
correct, and 0 otherwise. Then, the total number of questions correct is X = i=1 Xi . To compute
E [X] we need E [Xi ] for each i = 1, . . . , 100.
E [Xi ] = 1·P (Xi = 1)+0·P (Xi = 0) = P (Xi = 1) = P (correct on question i) = 1·0.8+0.25·0.2 = 0.85
3.3 Probability & Statistics with Applications to Computing 87
where the second last step was using the law of total probability, conditioning on whether we know the
answer to a question or not. Hence,
" 100 # 100 100
X X X
E [X] = E Xi = E [Xi ] = 0.85 = 85
i=1 i=1 i=1
This kind of makes sense - I should be guaranteed 80 out of 100, and if I guess on the other 20, I would
get about 5 (a quarter of them) right, for a total of 85.
2. Recall exercise 2 from 3.1, where we had a random variable X with PMF
8
>
<1/100 k= 2
pX (k) = 18/100 k = 0
>
:
81/100 k = 2
Hence, ⇥ ⇤ 2
Var (X) = E X 2 E [X] = 3.28 1.62 = 0.72
Chapter 3. Discrete Random Variables
3.4: Zoo of Discrete Random Variables Part I
Slides (Google Drive) Video (YouTube)
In this section, we’ll define formally what it means for random variables to be independent. Then, for the
rest of the chapter (3.4, 3.5, 3.6), we’ll discuss commonly appearing random variables for which we can just
cite its properties like its PMF, mean, and variance without doing any work! These situations are so common
that we name them, and can refer to them and related quantities easily!
Random variables X and Y are independent, denoted X ? Y , if for all x 2 ⌦X and all y 2 ⌦Y , any
of the following three equivalent properties holds:
1. P (X = x | Y = y) = P (X = x)
2. P (Y = y | X = x) = P (Y = y)
3. P (X = x \ Y = y) = P (X = x) · P (Y = y)
Note, that this is the same as the event definition of independence, but it must hold for all events
{X = x} and {Y = y}.
If X ? Y , then
Var (X + Y ) = Var (X) + Var (Y )
This will be proved a bit later, but we can start using this fact now! It is important to remember
that you cannot use this formula if the random variables are not independent (unlike linearity).
A common misconception is that Var (X Y ) = Var (X) Var (Y ), but this actually isn’t true, otherwise we
could get a negative number. In fact, if X ? Y , then
Var (X Y ) = Var (X + ( Y )) = Var (X) + Var ( Y ) = Var (X) + ( 1)2 Var (Y ) = Var (X) + Var (Y )
Before diving into the random variables themselves, let’s look at a situation that arises often...
88
3.4 Probability & Statistics with Applications to Computing 89
Let’s illustrate how this might be useful with an example. Suppose we independently flip 8 coins that land
heads with probability p, and get the following sequence of coin flips
This series of flips is a Bernoulli process. We call each of these coin flips a Bernoulli random variable (or
indicator rv).
A random variable X is Bernoulli (or indicator), denoted X ⇠ Ber(p), if and only if X has the
following PMF: (
p, k=1
pX (k) =
1 p, k = 0
Each Xi in the Bernoulli process with parameter p is Bernoulli/indicator random variable with
parameter p. It simply represents a binary outcome, like a coin flip.
Additionally,
E [X] = p and Var (X) = p (1 p)
⇥ ⇤ 2
Var (X) = E X 2 E [X] = p p2 = p(1 p)
Notice how we found a situation whose general form comes up quite often, and derived a random variable
that models that situation well. Now, anytime we need a Bernoulli/indicator random variable we can denote
it as follows: X ⇠ Ber(p).
90 Probability & Statistics with Applications to Computing 3.4
We can generalize this as follows to get the PMF of a binomial random variable:
✓ ◆
n k
pX (k) = P (X = k) = P (exactly k heads in n Bernoulli trials) = p (1 p)n k
, k 2 ⌦X
k
This hopefully sheds some light on why nk is called a binomial coefficient and X a binomial random variable.
Before computing its expectation, let’s make sure we didn’t make a mistake, and check thatPour probabilities
n
sum to 1. This will use the binomial theorem we learned in chapter 1 finally: (x + y)n = k=0 nk xk y n k .
Xn Xn ✓ ◆
n k
pX (k) = p (1 p)n k [PMF of Binomial RV]
k
k=0 k=0
= (p + (1 p))n [binomial theorem]
n
=1 =1
A random variable X has a Binomial distribution, denoted X ⇠ Bin(n, p), if and only if X has the
following PMF for k 2 ⌦X = {0, 1, 2, . . . , n}:
✓ ◆
n k
pX (k) = p (1 p)n k
k
3.4 Probability & Statistics with Applications to Computing 91
X is the sum of n independent Ber(p) random variables, and represents the number of heads in n
independent coin flips where P (head) = p.
Additionally,
E [X] = np and Var (X) = np(1 p)
Proof of Expectation and Variance of Binomial. We can use linearity of expectation to compute the expected
value of a particular binomial variable (i.e. the expected number of successes in n Bernoulli trials). Let
X ⇠ Bin(n, p).
" n
#
X
E [X] = E Xi
i=1
n
X
= E [Xi ] [linearity of expectation]
i=1
Xn
= p [expectation of Bernoulli]
i=1
= np
This makes sense! If X ⇠ Bin(100, 0.5) (number of heads in 100 independent flips of a fair coin), you expect
50 heads, which is just np = 100 · 0.5 = 50. Variance can be found in a similar manner
n
!
X
Var (X) = Var Xi
i=1
n
X
= Var (Xi ) [variance adds if independent rvs]
i=1
Xn
= p (1 p) [variance of Bernoulli]
i=1
= np(1 p)
Like Bernoulli rvs, Binomial random variables have a special place in our zoo. Argually, Binomial rvs are
probably the most important discrete random variable, so make sure to understand everything above and
be ready to use it!
It is important to note for the hat check example in 3.3 that we had the sum of n Bernoulli/indicator rvs
BUT that they were NOT independent. This is because if we know one person gets their hat back, someone
else is more likely to (since there are n 1 possibilities instead of n). However, linearity of expectation works
regardless of independence, so we were able to still add their expectations like so
n
X 1
E [X] = E [Xi ] = n · =1
i=1
n
It would be incorrect to say that X ⇠ Bin n, n1 because the indicator rvs were NOT independent.
92 Probability & Statistics with Applications to Computing 3.4
Example(s)
A factory produces 100 cars per day, but a car is defective with probability 0.02. What’s the proba-
bility that the factory produces 2 or more defective cars on a given day?
Solution Let X be the number of defective cars that the factory produces. X ⇠ Bin(100, 0.02), so
P (X 2) = 1 P (X = 0) P (X = 1) [complement]
✓ ◆ ✓ ◆
100 100
=1 (0.02)0 (1 0.02)100 (0.02)1 (1 0.02)99 [plug in binomial PMF]
0 1
⇡ 0.5967
So, there is about a 60% chance that 2 or more cars produced on a given day will be defective.
3.4.4 Exercises
1. An elementary school wants to keep track of how many of their 200 students have acceptable attendance.
Each student shows up to school on a particular day with probability 0.85, independently of other days
and students.
(a) A student has acceptable attendance if they show up to class at least 4 out of 5 times in a school
week. What is the probability a student has acceptable attendance?
(b) What is the probability that at least 170 out of the 200 students have acceptable attendance?
Assume students’ attendance are independent since they live separately.
(c) What is the expected number of students with acceptable attendance?
Solution: Actually, this is a great question because it has nested binomials!
(a) Let X be the number of school days a student shows up in a school week. Then, X ⇠ Bin(n =
5, p = 0.85) since a students’ attendance on di↵erent days is independent as mentioned earlier.
We want X 4,
✓ ◆ ✓ ◆
5 5
P (X 4) = P (X = 4) + P (X = 5) = 0.854 0.151 + 0.855 0.150 = 0.83521
4 5
.
(b) Let Y be the number of students who have acceptable attendance. Then, Y ⇠ Bin(n = 200, p =
0.83521) since each students’ attendance is independent of the rest. So,
200 ✓
X ◆
200
P (Y 170) = 0.83521k (1 0.83521)200 k
⇡ 0.3258
k
k=170
(c) We have E [Y ] = np = 200 · 0.83521 = 167.04 as the expected number of students! We can just
cite it now that we’ve identified Y as being Binomial!
2. [From Stanford CS109] When sending binary data to satellites (or really over any noisy channel) the
bits can be flipped with high probabilities. In 1947 Richard Hamming developed a system to more
reliably send data. By using Error Correcting Hamming Codes, you can send a stream of 4 bits with 3
(additional) redundant bits. If zero or one of the seven bits are corrupted, using error correcting codes,
a receiver can identify the original 4 bits. Let’s consider the case of sending a signal to a satellite where
3.4 Probability & Statistics with Applications to Computing 93
each bit is independently flipped with probability p = 0.1. (Hamming codes are super interesting. It’s
worth looking up if you haven’t seen them be- fore! All these problems could be approached using a
binomial distribution (or from first principles).)
(a) If you send 4 bits, what is the probability that the correct message was received (i.e. none of the
bits are flipped).
(b) If you send 4 bits, with 3 (additional) Hamming error correcting bits, what is the probability that
a correctable message was received?
(c) Instead of using Hamming codes, you decide to send 100 copies of each of the four bits. If for
every single bit, more than 50 of the copies are not flipped, the signal will be correctable. What
is the probability that a correctable message was received?
Solution:
(a) We have X ⇠ Bin(n = 4, p = 0.9) to be the number of correct (unflipped) bits. So the binomial
PMF says: ✓ ◆
4
P (X = 4) = 0.94 (0.1)4 0 = 0.94 = 0.656
4
Note we could have also approached this by letting Y ⇠ Bin(4, 0.1) be the number of corrupted
(flipped) bits, and computing P (Y = 0). This is the same result!
(b) Let Z be the number of corrupted bits, then Z ⇠ Bin(n = 7, p = 0.1), so we can use its PMF. A
message is correctable if Z = 0 or Z = 1 (mentioned above), so
✓ ◆ ✓ ◆
7 7 0 7
P (Z = 0) + P (Z = 1) = 0.1 0.9 + 0.16 0.91 = 0.850
0 1
This is a 30% (relative) improvement compared to above by just using 3 extra bits!
(c) For i = 1, . . . , 4, let Xi ⇠ Bin(n = 100, p = 0.9). We need X1 > 50, X2 > 50, X3 > 50, and
X4 > 50 for us to get a correctable message. For Xi > 50, we just sum the binomial PMF from
51 to 100:
100 ✓
X ◆
100
P (Xi > 50) = 0.9k (1 p)100 k
k
k=51
P (X1 > 50, X2 > 50, X3 > 50, X4 > 50) = P (X1 > 50) P (X2 > 50) P (X3 > 50) P (X4 > 50)
100 ✓ ◆ !4
X 100 k 100 k
= 0.9 (1 p)
k
k=51
> 0.999
But this required 400 bits instead of just the 7 required by Hamming codes! This is well worth
the tradeo↵.
3. Suppose A and B are random, independent (possibly empty) subsets of {1, 2, ..., n}, where each subset
is equally likely to be chosen. Consider A \ B, i.e., the set containing elements that are in both A and
B. Let X be the random variable that is the size of A \ B. What is E [X]?
Choosing a random subset of {1, . . . , n} can be thought of as follows: for each element i = 1, . . . , n,
with probability 1/2 take the element (and with probability 1/2 don’t take it), independently of other
elements. This is a crucial observation.
For each element i = 1, . . . , n, the element is either in A\B or not. So let Xi be the indicator/Bernoulli
rv of whether i 2 A \ B or not. Then, P (Xi = 1) = P (i 2 A, i 2 B) = P (i 2 A) P (i 2 B) = 12 · 12 = 14
because A, B are chosen independently, and each element is in A or B with probability 1/2. Note that
these Xi ’s are independentPnbecause one element being in the set does not a↵ect another element being
in the set. Hence, X = i=1 Xi the number of elements in our intersection, so X ⇠ Bin n, 14 and
E [X] = np = n4 .
Note that it was not necessary that these variables were independent; we could have still applied
linearity of expectation anyway to get n4 . We just wouldn’t have been able to say X ⇠ Bin n, 14 .
Chapter 3. Discrete Random Variables
3.5: Zoo of Discrete Random Variables Part II
Slides (Google Drive) Video (YouTube)
X is a uniform random variable, denoted X ⇠ Unif(a, b), where a < b are integers, if and only if X
has the following probability mass function
(
1
, k 2 {a, a + 1, ..., b}
pX (k) = b a+1
0, otherwise
X is equally likely to take on any value in ⌦X = {a, a + 1, ..., b}. This set contains b a + 1 integers,
which is why P (X = k) is always b a+11
.
Additionally,
a+b (b a)(b a + 2)
E [X] = and Var (X) =
2 12
As you might expect, the expected value is just the average of the endpoints that the uniform random
variable is defined over.
b b b
⇥ ⇤ X X 1 1 X
E X2 = k 2 · pX (k) = k2 · = k2 = . . .
b a+1 b a+1
k=a k=a k=a
⇥ ⇤ 2 (b a)(b a + 2)
Var (X) = E X 2 E [X] =
12
This variable models situations like rolling a fair six sided die. Let X be the random variable whose value
is the number face up on a die roll. Since the die is fair each outcome is equally likely, which means that
X ⇠ Unif(1, 6) so
(
1
, k 2 {1, 2, ..., 6}
pX (k) = 6
0, otherwise
95
96 Probability & Statistics with Applications to Computing 3.5
This is fairly intuitive, but is nice to have these formulas in our zoo so we can make computations quickly,
and think about random processes in an organized fashion. Using the equations above we can find that
1+6 (6 1)(6 1 + 2) 35
E [X] = = 3.5 and Var (X) = =
2 12 12
Let X be the random variable that represents the number of independent coin flips up to and including
your first head. Lets compute P (X = 4). X = 4 occurs exactly when there are 3 tails followed by a head.
So,
P (X = 4) = P (T T T H) = (1 p)(1 p)(1 p)p = (1 p)3 p
In general,
pX (k) = (1 p)k 1
p
This is because there must be k 1 tails in a row followed by a head occurring on the kth trial.
Let’s also verify that the probabilities sum to 1.
1
X 1
X
pX (k) = (1 p)k 1
p [Geometric PMF]
k=1 k=1
X1
=p (1 p)k 1
[take out constant]
k=1
X1
=p (1 p)k [reindex to 0]
k=0
✓ ◆ " 1
#
1 X 1
i
=p geometric series formula: r =
1 (1 p) i=0
1 r
1
=p· =1
p
The second last step used the geometric series formula - this may be why this random variable is called
Geometric!
3.5 Probability & Statistics with Applications to Computing 97
X is a Geometric random variable, denoted X ⇠ Geo(p), if and only if X has the following probability
mass function (and range ⌦X = {1, 2, . . . }):
pX (k) = (1 p)k 1
p, k = 1, 2, 3, ...
Additionally,
1 1 p
E [X] = and Var (X) =
p p2
Example(s)
Let’s say you buy lottery tickets every day, and the probability you win on a given day is 0.01,
independently of other days. What is the probability that after a year (365 days), you still haven’t
won? What is the expected number of days until you win your first lottery?
98 Probability & Statistics with Applications to Computing 3.5
Solution If X is the number of days until the first win, then X ⇠ Geo(p = 0.01). Hence, the probability we
don’t win after a year is (using the PMF)
364
X 364
X
P (X 365) = 1 P (X < 365) = 1 P (X = k) = 1 (1 0.01)k 1
0.01
k=1 k=1
This is great, but for the geometric, we can actually get a closed-form formula by thinking of what it means
that X 365 in English. X 365 happens if and only if we lose for the first 365 days, which happens with
probability 0.99365 . If you evaluated that nasty sum above and this quantity, you would find that they are
equal!
Finally, we can just cite the expectation of the Geometric RV:
1 1
E [X] = = = 100
p 0.01
This is the point of the zoo! We do all these generic calculations so we can use them later anytime.
Example(s)
You gamble by flipping a fair coin independently up to and including the first head. If it takes k tries,
you earn $2k (i.e., if your first head was the flip, you would earn $8). How much would you pay to
play this game?
Solution Let X be the number of flips to the first head. Then, X ⇠ Geo( 12 ) because its a fair coin, and
⇣ 1 ⌘k 1 ⇣1⌘ 1
pX (k) = 1 = k = 1, 2, 3, ...
2 2 2k
It is usually unwise to gamble, especially if your expected earnings are lower than the price to play. So, let
Y be your expected earnings. Note that Y = 2X because the amount ⇥ you
⇤ win depends
⇥ ⇤ the number of flips
it takes to get a heads. We will use LOTUS to compute E [Y ] = E 2X . Recall E 2X 6= 2E[X] = 22 = 4 as
we’ve seen many times now.
1 1 1
⇥ ⇤ X X 1 X
E [Y ] = E 2X = 2k pX (k) = 2k k = 1=1
2
k=1 k=1 k=1
Some might say they would be willing to pay any finite amount of money to play this game. Think about
why that would be unwise, and what this means regarding the modeling tools we have provided you so far.
r
X
X= Xi
i=1
where Xi is a a geometric random variable that represents the number of flips it takes to get the ith head
after i 1 heads have already occurred. Since all the flips are independent, so are the rvs X1 , . . . , Xr . For
example, if r = 3 we might observe the following sequence of flips
In this case, X1 = 3 and represents the number of trials between the 0th to the 1st head; X2 = 1 and
represents the number of trials between the 1st to the 2nd head; X3 = 4 and represents the number of trials
between the 2nd and the 3rd head. Remember this fact for later!
How do we find P (X = 8)? There must be exactly 5 heads and 3 tails, so it is reasonable to expect (1 p)5 p3
to come up somewhere in our final formula, but how many ways can we get a valid sequence of flips? Note
that the last coin flip must be a heads, otherwise we would’ve gotten our r heads earlier than our 8th flip.
From here, any 2 of the first 7 flips can be heads, and 5 of must be tails. Thus, there are 72 valid sequences
of coin flips.
Each of these 7 flip sub-sequences (of the 8 total flips) occurs with probability (1 p)5 p2 and there is no
overlap. However, we need to include the probability that the last coin flip is a heads. So,
✓ ◆ ✓ ◆
7 5 2 7
pX (8) = P (X = 8) = (1 p) p · p = (1 p)5 p3
2 2
Again, the interpretation is that our rth head must come at the k th trial exactly; so in the first k 1 we can
get r 1 heads anywhere (hence the binomial coefficient), and overall we have r heads and k r tails.
If we are interested in finding the expected value of X we might try the brute force approach directly from
the definition of expected value
X X ✓ ◆
k 1
E [X] = k pX (k) = k (1 p)k r
pr
r 1
k2⌦X k2⌦X
but this approach is overly complicated, and there is a much simpler way using Pr linearity of expectation!
Suppose X1 , ..., Xr ⇠ Geo(p) are independent. As we showed earlier, X = i=1 Xi , and we showed that
each E [Xi ] = 1/p. Using linearity of expectation, we can derive the following:
" r # r r
X X X 1 r
E [X] = E Xi = E [Xi ] = =
i=1 i=1 i=1
p p
100 Probability & Statistics with Applications to Computing 3.5
Using a similar technique and the (yet unproven) fact that Var (X + Y ) = Var (X) + Var (Y ), we can find the
variance of X from the sum of the variances of multiple geometric random variables
r
! r r
X X X 1 p r(1 p)
Var (X) = Var Xi = Var (Xi ) = =
i=1 i=1 i=1
p2 p2
This random variable is called the negative binomial random variable. It is quite common so it too deserves
a special place in our zoo.
X is a negative binomial random variable, denoted X ⇠ NegBin(r, p), if and only if X has the
following probability mass function (and range ⌦X = {r, r + 1, . . . , }):
✓ ◆
k 1
pX (k) = pr (1 p)k r , k = r, r + 1, ...
r 1
Additionally,
r r(1 p)
E [X] = and Var (X) =
p p2 )
Also, note that Geo(p) ⌘ NegBin(1, p), and that if X, Y are independent such that X ⇠ NegBin(r, p)
and Y ⇠ NegBin(s, p), then X + Y ⇠ NegBin(r + s, p) (waiting for r + s heads).
3.5.4 Exercises
1. You are a hardworking boxer. Your coach tells you that the probability of your winning a boxing
match is 0.25, independently of every other match.
(a) How many matches do you expect to fight until you win once?
(b) How many matches do you expect to fight until you win ten times?
(c) You only get to play 12 matches every year. To win a spot in the Annual Boxing Championship,
a boxer needs to win at least 10 matches in a year. What is the probability that you will go to
the Championship this year?
(d) Let q be your answer from the previous part. How many times can you expect to go to the
Championship in your 20 year career?
Solution:
(a) Let X be the matches you have to fight until you win once. Then, X ⇠ Geo(p = 0.25), so
E [X] = p1 = 0.25
1
= 4.
(b) Let Y be the matches you have to fight until you win ten times. Then, Y ⇠ NegBin(r = 10, p =
0.25), so E [Y ] = pr = 0.25
10
= 40.
(c) Let Z be the number of matches you win out of 12. Then, Z ⇠ Bin(n = 12, p = 0.25), and we
want
X12 ✓ ◆
12
P (Z 10) = 0.2k (1 0.2)12 k
k
k=10
3.5 Probability & Statistics with Applications to Computing 101
(d) Let W be the number of times we make it to the Championship in 20 years. Then, W ⇠ Bin(n =
20, p = q), and
E [W ] = np = 20q
2. You are in music class, and your cruel teacher says you cannot leave until you play the 1000-note
song Fur Elise correctly 5 times. You start playing the song, and if you play an incorrect note,
you immediately start the song over from scratch. You play each note correctly independently with
probability 0.999.
(a) What is the probability you play the 1000-note song Fur Elise correctly immediately? (i.e., the
first 1000 notes are all correct).
(b) What is the probability you take exactly 20 attempts to correctly play the song 5 times?
(c) What is the probability you take at least 20 attempts to correctly play the song 5 times?
(d) (Challenge) What is the expected number of notes you play until you finish playing Fur Elise
correctly 5 times?
Solution:
(a) Let X be the number of correct notes we play in Fur Elise in one attempt, so X ⇠ Bin(1000, 0.999).
We need P (X = 1000) = 0.9991000 ⇡ 0.3677.
(b) If Y is the number of attempts until we play the song correctly 5 times, then Y ⇠ NegBin(5, 0.3677),
and so ✓ ◆
20 1
P (Y = 20) = 0.36775 (1 0.3677)15 ⇡ 0.0269
5 1
(c) We can actually take two approaches to this. We can either take our Y from earlier, and compute
19 ✓
X ◆
k 1
P (Y 20) = 1 P (Y < 20) = 1 0.36775 (1 0.3677)k 5
⇡ 0.1161
4
k=5
Notice the sum starts at 5 since that’s the lowest possible value of Y . This would be exactly the
probability of the statement asked. We could alternatively rephrase the question as: what is the
probability we play the song correctly at most 4 times correctly in the first 19 times? Check that
these questions are equivalent! Then, we can let Z ⇠ Bin(19, 0.3677) and instead compute
4 ✓ ◆
X 19
P (Z 4) = 0.3677k (1 0.3677)19 k
⇡ 0.1161
k
k=0
(d) We will have to revisit this question later in the course! Note that we could have computed
the expected number of attempts to finish playing Fur Elise though, as it would follow a
5
NegBin(5, 0.3677) distribution with expectation 0.3677 ⇡ 13.598.
Chapter 3. Discrete Random Variables
3.6: Zoo of Discrete RV’s Part III
Slides (Google Drive) Video (YouTube)
We start by breaking one unit of time into 5 parts, and we say at each of the five chunks, either a
baby is born or not. That means we’ll be using a binomial rv with n = 5. The choice of p that will
keep our average to be 2 is 25 , because the expected value of binomial RV is np = 2.
Similarly, if we break the time into even smaller chunks such as n = 10 or n = 70, we can get the
2 2
corresponding p to be 10 or 70 respectively (either a baby is born or not in 1/70 of a second).
And we keep increasing n so that it gets down to the smallest fraction of a second; we have n ! 1
and p ! 0 in this fashion while maintaining the condition that np = 2.
Let be the historical average number of events per unit of time. Send n ! 1 and p ! 0 in such a way
that np = is fixed (i.e., p = n ).
Let Xn ⇠ Bin(n, n ) and Y ⇠ limn!1 Xn be the limit of this sequence of Binomial rvs. Then, we say
Y ⇠ Poi( ) and measures the number of events in a unit time, where the historical average is . We’ll derive
its PMF by taking the limit of the binomial PMF.
We’ll need to recall how we defined the base of the natural logarithm e. There are two equivalent formulations.
⇣ x ⌘n
ex = lim 1+
n!1 n
102
3.6 Probability & Statistics with Applications to Computing 103
1
X xk
ex =
k!
k=0
k n
n! 1 (1 n)
= · lim [algebra]
k! (nn!1
n)
k)! nk (1
k
k
✓ ◆ n
n n 1 n k + 1 (1 n) n!
= · lim · · ... · = P (n, k) = n(n 1) . . . (n k + 1)
k! n!1 n n n (1 n)
k (n k)!
k n ✓ ◆
(1 n) n n 1 n k+1
= · lim lim · · ... · =1
k! (1
n!1
n)
k n!1 n n n
✓ ◆n " ✓ ◆k #
k
= · lim 1 lim 1 = 1 since k is finite
k! n!1 n n!1 n
k h ⇣ x ⌘n i
= e ex = lim 1 +
k! n!1 n
We’ll now verify that the Poisson PMF does sum to 1, and is valid.
P1 k
Recall the Taylor series for ex = k=0 xk! , so
1
X 1
X k 1
X k
pY (k) = e =e =e e =1
k! k!
k=0 k=0 k=0
X ⇠ Poi( ) if and only if X has the following probability mass function (and range ⌦X =
{0, 1, 2, . . . }):
k
pX (k) = e
, k = 0, 1, 2, ...
k!
If is the historical average of events per unit of time, then X is the number of events that occur in
a unit of time.
Additionally,
Proof of Expectation and Variance of Poisson. Let Xn ⇠ Bin(n, n ) and Y ⇠ limn!1 Xn = Poi( ). By the
properties of the binomial random variable, the mean and variance are as follows for any n (plug in = np
or equivalently, p = /n):
E [Xn ] = np =
104 Probability & Statistics with Applications to Computing 3.6
✓ ◆
Var (Xn ) = np(1 p) = 1
n
Therefore: h i
E [Y ] = E lim Xn = lim E [Xn ] = lim =
n!1 n!1 n!1
⇣ ⌘ ✓ ◆
Var (Y ) = Var lim Xn = lim Var (Xn ) = lim 1 =
n!1 n!1 n!1 n
Example(s)
Suppose the average number of babies born in Seattle historically is 2 babies every 15 minutes.
1. What is the probability no babies are born in the next hour in Seattle?
2. What is the expected number of babies born in the next hour?
3. What is the probability no babies are born in the next 5 minutes in Seattle?
Solution
1. Since Poi( ) is the number of events in a single unit of time (matching units as ), we must convert
our rate to hours (since we are interested in one hour). So the number of babies born in the next hour
can be modelled as X ⇠ Poi( = 8/hr), and so the probability no babies are born is
0
88 8
P (X = 0) = e =e
0!
Before doing the next example, let’s talk about the sum of two independent Poisson rvs. Almost by definition,
if X, Y are independent with X ⇠ Poi( ) and Y ⇠ Poi(µ), then X + Y ⇠ Poi( + µ) (if the average number
of babies born per minute in the USA is 5 and in Canada is 2, then the total babies in the next minute
combined is Poi(5+2) since the average combined rate is 7. We’ll prove this fact that the sum of independent
Poisson rvs is Poisson with the sum of their rates in a future chapter!
Example(s)
Suppose Lookbook gets on average 120 new users per hour, and Quickgram gets 180 new users per
hour, independently. What is the probability that, combined, less than 2 users sign up in the next
minute?
Solution Convert ’s to the same unit of interest. For us, it’s a minute. We can always change the rate
(e.g., 120 per hour is the same as 2 per minute), but we can’t change the unit of time we’re interested in.
0 1
55 55 5
P (Z < 2) = pZ (0) + pZ (1) = e +e = 6e ⇡ 0.04
0! 1!
Suppose there is a candy bag of N = 9 total candies, K = 4 of which are lollipops. Our parents allow us
grab n = 3 of them. Let X be the number of lollipops we grab. What is the probability that we get exactly
2 lollipops?
The number of ways to grab three candies is just 93 , and we need to get exactly 2 lollipops out of 4, which
is 42 . Out of the other 5 candies, we only need one of them, which yields 51 ways.
4 5
pX (2) = P (X = 2) = 2
9
1
We say the number of successes we draw is X ⇠ HypGeo(N, K, n), where K out of N items in a bag are
successes, and we draw n without replacement.
106 Probability & Statistics with Applications to Computing 3.6
X is the number of successes when drawing n items without replacement from a bag containing N
items, K of which are successes (hence N K failures).
K K(N K)(N n)
E [X] = n Var (X) = n
N N 2 (N 1)
Note that if we drew with replacement, then we would model this situation using Bin(n, K
N ) as each
draw would be an independent trial.
Then, each Xi is Bernoulli, but with what parameter? The probability of getting a lollipop on the first draw
(X1 being equal to 1) is just K/N .
K
P (X1 = 1) =
N
What about P (X2 = 1), the probability we get a lollipop on our second draw? Well, it depends on whether
or not we got one the first draw! So we can use the LTP conditioning on whether we got one (X1 = 1) or
we didn’t X1 = 0).
Actually, each Xi ⇠ Ber(K/N ), at every draw i! You could continue the above logic for X3 and so on. This
makes sense, because, if you just think about the i-th draw and you didn’t know anything about the first
i 1, the probability you get a lollipop would just be K/N .
K
E [Xi ] =
N
" n
# n n
X X X K K
E [X] = E Xi = E [Xi ] = =n
i=1 i=1 i=1
N N
Note again it would be wrong to say X ⇠ Bin(n, K/N ) because the trials are NOT independent, but we are
still able to use linearity of expectation. If we did this experiment with replacement though (take one and
put it back), then the draws would be independent, and modelled as Bin(n, K/N ). Note the expectation
with or without replacement is the same because linearity of expectation doesn’t care about independence!
3.6 Probability & Statistics with Applications to Computing 107
The variance is a nightmare, and will be proven in 5.4 when figure out how to compute the variance of the
sum of these dependent indicator variables.
The Zoo of Discrete RV’s: Here are all the distributions in our zoo of discrete random variables!
• The Bernoulli RV
• The Binomial RV
• The Uniform (Discrete) RV
• The Geometric RV
• The Negative Binomial RV
• The Poisson RV
• The Hypergeometric RV
Congratulations on making it through this chapter on all these wonderful discrete random variables! There
are several practice problems below which require using a lot of these zoo elements. It will definitely take
some time to get used to all of these - you’ll need to do practice! See our handy reference sheet for one place
to see all of them while doing problems.
3.6.4 Exercises
1. Suppose that on average, 40 babies are born per hour in Seattle.
(a) What is the probability that over 1000 babies are born in a single day in Seattle?
(b) What is the probability that in a 365-day year, over 1000 babies are born on exactly 200 days?
Solution:
(a) The number of babies born in a single average day is 40 · 24 = 960, so X ⇠ Poi( = 960). Then,
1000
X k
960 960
P (X > 1000) = 1 P (X 1000) = 1 e
k!
k=0
(b) Let q be the answer from part (a). The number of days where over 1000 babies are born is
Y ⇠ Bin(n = 365, p = q), so
✓ ◆
365 200
P (Y = 200) = q (1 q)165
200
2. Suppose the Senate consists of 53 Republicans and 47 Democrats. Suppose we were to create a
bipartisan committee of 20 senators by randomly choosing from the 100 total.
(a) What is the probability we end up with exactly 9 Republicans and 11 Democrats?
(b) What is the expected number of Democrats on the committee?
Solution:
(a) Let X be the number of Republican senators chosen. Then X ⇠ HypGeo(N = 100, K = 53, n =
20), and the desired probability is
53 47
P (X = 9) = 9
100
11
20
108 Probability & Statistics with Applications to Computing 3.6
since choosing 9 out of 20 Republicans also implies immediately we have 11 out of 20 Democrats.
Note we could have flipped the roles of Democrats and Republicans. If Y is the number of
Democratic senators chosen, then Y ⇠ HypGeo(N = 100, K = 47, n = 20), and
47 53
P (Y = 11) = 11
100
9
20
(b) The number of Democrats as mentioned earlier is Y ⇠ HypGeo(N = 100, K = 47, n = 20), and
so
K 47
E [Y ] = n = 20 · = 9.4
N 100
3. (Poisson Approximation to Binomial) Suppose the famous chip company “Bayes” produces
n = 10000 bags per day. They need to do a quality check, and they know that 0.1% of their bags
independently have “bad” chips in them.
(a) What is the exact probability that at most 5 bags contain “bad” chips?
(b) Recall the Poisson was derived from the Binomial with n ! 1 and p ! 0, so it suggests that a
Poisson distribution would be a good approximation to a Binomial with large n and small p. Use
a Poisson rv instead to compute the same probability as in part (a). How close are the answers?
Note: The reason we use a Poisson approximation sometimes is because the binomial PMF is
hard to compute. Imagine X ⇠ Bin(10000, 0.256), computing P (X = 2000) = 10000 2000 0.256
2000
(1
8000 10000
0.256) has at least 10000 multiplication operations for the probabilities. Furthermore, 2000 =
10000!
2000!8000! - good luck avoiding overflow on your computer!
Solution:
(a) If X is the number of bags with “bad” chips, then X ⇠ Bin(n = 10000, p = 0.001), so
5 ✓
X ◆
10000
P (X 5) = 0.001k (1 0.001)10000 k
⇡ 0.06699
k
k=0
(b) Since n is large and p is small, we might approximate X as Poisson rv, with = np = 10000 ·
0.001 = 10. Then, since X ⇡ Poi(10), we have
5
X k
10 10
P (X 5) = e ⇡ 0.06709
k!
k=0
Solution:
(a) The average rate of typos is one per two pages, or equivalently, 1/2 per one page. Hence, if X is
the number of typos on a page, then X ⇠ Poi( = 1/2), and
0
1/2 (1/2) 1/2
P (X 1) = 1 P (X = 0) = 1 e =1 e ⇡ 0.39347
0!
(b) Since we are interested in a 250 page “time period”, the average rate of typos is 125 per 250 pages.
If Y is the number of typos in total, then Y ⇠ Poi( = 125), and E [Y ] = = 125.
(c) We can consider each page as Poi(1/2) like in part (a). Let Z be the number of pages with at
least one typo. Then, Z ⇠ Bin(n = 250, p = 0.39347), and
50 ✓
X ◆
250
P (Z 50) = 0.39347k (1 0.39347)250 k
k
k=0
(d) Let V be the first page that contains (at least) one typo. Then, V ⇠ Geo(0.39347), so
1
E [V ] = = 2.5415
0.39347
(e) If W is the number of pages out of 20 that have a typo, then W ⇠ HypGeo(N = 250, K = 50, n =
20), and
50 200
P (W = 5) = 5
250
15
20
Chapter 4. Continuous Random Vari-
ables
We learned about how to model things like the number of car crashes in a year or the number of lot-
tery tickets I must buy until I win. What about quantities like the time until the next earthquake, or the
height of human beings? These latter quantities are a completely di↵erent beast, since they can take on
uncountably infinitely many values (infinite decimal precision). Some of the ideas from the previous chapter
will stay, but we’ll have to develop new tools to handle this new challenge. We’ll also learn about the most
important continuous distribution: the Normal distribution.
110
4.1 Probability & Statistics with Applications to Computing 111
Up to this point, we have only been talking about discrete random variables - ones that only take values
in a countable (finite or countably infinite) set like the integers or a subset. What if we wanted to model
quantities that were continuous - that could take on uncountably infinitely many values? If you haven’t
studied or seen cardinality (or types of infinities) before, you can think of this as being intervals of the real
line, which take decimal values. Our tools from the previous chapter were not suitable to modelling these
situations, and so we need a new type of random variable.
that the sum of the probabilities is 1 (assuming we could even sum over uncountably many values; we can’t).
Instead, we have the idea of a probability density function where the x-axis has values in the random variable’s
range (usually an interval), and the y-axis has the probability density (not mass), which is explained below.
The probability density function fX has some characteristic properties (denoted with fX to distinguish from
PMFs pX ). Notice again I will use di↵erent dummy variables inside the function like fX (z) or fX (t) to
ensure you get the idea that the density is fX (subscript indicates for rv X) and the dummy variable can
be anything.
• fX (z) 0 for all z 2 R; i.e., it is always non-negative, just like a probability mass function.
R1
• 1 X
f (t)dt = 1; i.e., the area under the entire curve is equal to 1, just like the sum of all the proba-
bilities of a discrete random variable equals 1.
Rb
• P (a X b) = a fX (w)dw; i.e., the probability that X lies in the interval a to b is the area under
the curve from a to b. This is key - integrating fX gives us probabilities.
4.1 Probability & Statistics with Applications to Computing 113
Ry
• P (X = y) = P (y X y) = y fX (w)dw = 0. The probability of being a particular value is 0, and
NOT equal to the density fX (y) which is nonzero. This is particularly confusing at first.
• P (X ⇡ q) = P q 2" X q + 2" ⇡ "fX (q); i.e., with a small epsilon value, we can obtain a good
rectangle approximation of the area under the curve. The width of the rectangle is " (from the di↵erence
between q + 2" and q 2" ). The height of the rectangle is fX (q), the value of the probability density func-
tion fX at q. So, the area of the rectangle is "fX (q). This is similar to the idea of Riemann integration.
P(X⇡u)
• P(X⇡v) ⇡ "f X (u) fX (u)
"fX (v) = fX (v) ; i.e., the PDF tells us ratios of probabilities of being “near” a point. From
the previous point, we know the probabilities of X being approximately u and v, and through algebra,
we see their ratios. Since the density is twice as high at u as it is at v, it means we are twice as likely
to get a point “near” u as we are to get one “near” v.
114 Probability & Statistics with Applications to Computing 4.1
Let X be a continuous random variable (one whose range is typically an interval or union of intervals).
The probability density function (PDF) of X is the function fX : R ! R, such that the following
properties hold:
• fRX (z) 0 for all z 2 R
1
• 1 X
f (t) dt = 1
Rb
• P (a X b) = a fX (w) dw
• P (X = y) = 0 for any y 2 R
• The probability that X is close to q is proportional to its density fX (q);
⇣ " "⌘
P (X ⇡ q) = P q X q+ ⇡ "fX (q)
2 2
• Ratios of probabilities of being “near points” are maintained;
We know this is valid, because the area under the curve is the area of a square with side lengths 1, which is
4.1 Probability & Statistics with Applications to Computing 115
1 · 1 = 1.
We define the cumulative distribution function (CDF) of X to be FX (w) = P (X w). That is, the all the
area to the left of w in the density function. Note we also have CDFs for discrete random variables, they
are defined exactly the same way (the probability of being less than or equal to a certain value)! They just
don’t usually have a nice closed form like they do for continuous RVs. Note for continuous random variables,
the CDF at w is just the cumulative area to the left of w, which can be found by an integral (the dummy
variable of integration should be di↵erent than the input variable w)
Z w
FX (w) = P (X w) = fX (y)dy
1
Let’s try to compute the CDF of this uniform random variable on [0, 1]. There are three cases to consider
here.
• If w < 0, FX (w) = 0 since ⌦X = [0, 1]. For example, if w = 1, then FX (w) = P (X 1) = 0 since
there is no chance that X 1. Formally, there is also no area to the left of w = 1 as you can see
from the PDF above, so the integral evaluates to 0!
• If 0 w 1, the area up to w is a rectangle of height 1 and width w (see below), so FX (w) = w.
That is, P (X w) = w. For example, if w = 0.5, then the probability X 0.5 is actually just
0.5 since X is just equally likely to be anywhere in ⌦X = [0, 1]! Note here we didn’t do an integral
since there are nice shapes, and we sometimes don’t have to! We just looked at the area to the left of w.
• If w > 1, all the area is up to the left of w, so FX (w) = 1. Again, since ⌦X = [0, 1] and suppose w = 2,
then FX (w) = P (X 2) = 1 since X is always between 0 and 1 (X must be less than or equal to 2).
Formally, the cumulative area to the left of w = 2 is 1 (just the area of the square)!
8
< 0 if w < 0
FX (w) = w if 0 w 1
:
1 if w > 0
116 Probability & Statistics with Applications to Computing 4.1
Let X be a continuous random variable (one whose range is typically an interval or union of intervals).
The cumulative distribution function (CDF) of X is the function FX : R ! R such that:
Rt
• FX (t) = P (X t) = 1 fX (w) dw for all t 2 R
d
• du FX (u) = fX (u)
• P (a X b) = FX (b) FX (a)
• FX is monotone increasing, since fX 0. That is, FX (c) FX (d) for c d.
• limv! 1 FX (v) = P (X 1) = 0
• limv!+1 FX (v) = P (X +1) = 1
Example(s)
Suppose the number of hours that a package gets delivered past noon is modelled by the following
PDF: 8
>
<x/10 0 x 2
fX (x) = c 2<x6
>
:
0 otherwise
Here is a graph of the PDF as described above:
4.1 Probability & Statistics with Applications to Computing 117
Solution
1. The range is all values where the density is nonzero; in our case, that is ⌦X = [0, 6] (or (0, 6)), but we
don’t care about single points or endpoints because the probability of being exactly that value is 0.
2. Formally, we need the density function to integrate to 1; that is,
Z 1
fX (x)dx = 1
1
But, the density function is split into three parts, we can split our integral into three. However,
anywhere the density is zero, we will get an integral of zero, so we’ll only set up the two integrals that
are nontrivial: Z Z
2 6
x/10dx + cdx = 1
0 2
Solving this equation for c would definitely work. But let’s try to use geometry instead, as we do know
how to compute the area of a triangle and rectangle. So the left integral is the area of the triangle
with base from 0 to 2 and height c, so that area is 2c/2 = c (the area of a triangle is b · h/2). The area
of the rectangle with base from 2 to 6 is 4c. We need the total area of c + 4c = 1, so c = 1/5.
3. Our CDF needs four cases: when x < 0, when 0 x 2, when 2 < x 6, and when x > 6.
(a) The outer cases are usually the easiest ones: if x < 0, then FX (x) = P (X x) = 0 since X
cannot be less than zero.
(b) If x > 6, then FX (x) = P (X x) = 1 since X is guaranteed to be at most 6.
118 Probability & Statistics with Applications to Computing 4.1
(c) For 0 x 2, we need the cumulative area to the left of x, which happens to be a triangle with
base x and height x/10, so the area is x2 /20. Alternatively, evaluate the integral
Z x Z x
FX (x) = fX (t)dt = t/10dt = t2 /20
1 0
(d) For 2 < x 6, we have the entire triangle of area 2 · 1/5 · 0.5 = 1/5, but also a rectangle of base
x 2 and height 1/5, for a total area of 1/5 + 1/5(x 2) = x/5 1/5. Alternatively, the integral
would be Z Z Z
x 2 x
FX (x) = fX (t)dt = t/10dt + 1/5dt = x/5 1/5
1 0 2
Again, I skipped all the integral evaluation steps as they are purely computational, but feel free
to verify!
Finally, putting this together gives
8
>
> 0 x<0
>
<x2 /20 0x2
FX (x) =
>
> x/5 1/5 2<x6
>
:
1 x>6
R6 R6
4. Using the formula, we find the area between 2 and 6 to get P (2 X 6) = 2 fX (t)dt = 2 1/5dt =
4/5. Alternatively, we can just see the area from 2 to 6 is just a rectangle with base 4 and height 1/5,
so the probability is just 4/5.
This is just the area to the left of 6, minus the area to the left of 2, which gives us the area between 2
and 6.
5. We’ll use the formula for expectation of a continuous RV, but split into three integrals again due to
the piecewise definition of our density. However, the integral outside the range [0, 6] will evaluate to
zero, so we won’t include it.
Z 1 Z 2 Z 6 Z 2 Z 6
E [X] = xfX (x)dx = xfX (x)dx + xfX (x)dx = x · (x/10)dx + x · (1/5)dx
1 0 2 0 2
We won’t do the computation because it’s not important, but hopefully you get the idea of how similar
this is to the discrete version!
Discrete Continuous
4.1.4 Exercises
1. Suppose X is continuous with density
(
cx2 0x9
fX (x) =
0 otherwise
Write an expression for the value of c that makes X a valid pdf, and set up expressions (integrals) for
its mean and variance. Also, find the cdf of X, FX .
Similarly, by LOTUS,
Z 1 Z 9 Z 9
⇥ 2
⇤ 2 2 1 2 1
E X = z fX (z)dz = z z dz = z 4 dz
1 0 243 243 0
Write an expression for the value of c that makes X a valid pdf, and set up expressions (integrals) for
its mean and variance. Also, find the cdf of X, FX .
Hence, c = 1. The expected value is the weighted average of each point weighted by its density, so
Z 1 Z 1 Z 1
1 1
E [X] = zfX (z)dz = z · 2 dz = dz = [ln(z)]1
1 =1
1 1 z 1 z
We actually have two cases. If t < 1, FX (t) = 0 since there’s no way to get a number less than 1 (the
range is ⌦X = [1, 1)). For t > 1, we just do a normal integral to get that
Z t Z 1 Z t Z t t ✓ ◆
1 1 1 1
FX (t) = P (X t) = fX (s)ds = fX (s)ds+ fX (s)ds = ds = = 1 =1
1 1 1 1 s2 s 1 t t
Now that we’ve learned about the properties of continuous random variables, we’ll discover some frequently
used RVs just like we did for discrete RVs! In this section, we’ll learn the continuous Uniform distribution,
the Exponential distribution, and Gamma distribution. In the next section, we’ll finally learn about the
Normal/Gaussian (bell-shaped) distribution which you all may have heard of before!
4.2.1 The (Continuous) Uniform RV
The continuous uniform random variable models a situation where there is no preference for any particular
value over a bounded interval. This is very similar to the discrete uniform random variable (e.g., roll of a fair
die), except extended to include decimal values. The probability of equalling any particular value is again 0
since we are dealing with a continuous RV.
X ⇠ Unif(a, b) (continuous) where a < b are real numbers, if and only if X has the following pdf:
(
1
, x 2 [a, b]
fX (x) = b a
0, otherwise
X is equally likely to be take on any value in [a, b]. Note the similarities and di↵erences it has with
the discrete uniform! The value of the density function is constant at b 1 a , for any input x 2 [a, b],
and makes it a rectangle whose area integrates to 1.
a+b (b a)2
E [X] = , Var (X) =
2 12
The cdf is 8
>
<0, x<a
x a
FX (x) = , axb
>b a
:
1, x>b
Proof of Expectation and Variance of Uniform. I’m setting up the integrals but omitting the steps that are
not relevant to your understanding of probability theory (computing integrals):
Z 1 Z b
1 a+b
E [X] = xfX (x)dx = x· dx =
1 a b a 2
Z 1 Z b
⇥ ⇤ 1 a2 + ab + b2
E X2 = x2 fX (x)dx = x2 · dx =
1 a b a 3
✓ ◆2
⇥ 2
⇤ 2 a2 + ab + b2 a+b (b a)2
Var (X) = E X E [X] = =
3 2 12
121
122 Probability & Statistics with Applications to Computing 4.2
Example(s)
Suppose we think that a Hollywood movie’s overall rating is equally likely to be any decimal value in
the interval [1, 5] (this may not be realistic). You may be able to do these questions “in your head”,
but I encourage you to formalize the questions and solutions to practice the notation and concepts
we’ve learned. (You probably wouldn’t be able to do them “in your head” if the movie rating wasn’t
uniformly distributed!)
1. A movie is considered average if it is overall rating is between 1.5 and 4.5. What is the probability
that is average?
2. A movie is considered a huge success if its overall rating is at least 4.5. What is the probability
that it is a huge success?
3. A movie is considered legendary if its overall rating is at least 4.95. Given that a movie is a
huge success, what is the probability it is legendary?
Solution Before starting, we can write that the overall rating of a movie is X ⇠ Unif(1, 5). Hence, its density
1 1
function is fX (x) = = for x 2 [1, 5] (and 0 otherwise).
5 1 4
1. We know the probability of being in the range [1.5, 4.5] is the area under the density function from 1.5
to 4.5, so Z 4.5 Z 4.5
1 3
P (1.5 X 4.5) = fX (x)dx = dx =
1.5 1.5 4 4
You could have also drawn a picture of this density function (which is flat at 1/4), and exploited
geometry to figure that the base of the rectangle is 3 and the height is 1/4.
2. Similarly, Z Z
1 5
1 1
P (X 4.5) = fX (x)dx = dx =
4.5 4.5 4 8
Note that the density function for values x 5 is zero, so that’s why the integral changed its upper
bound from 1 to 5 when replacing the density!
3. We’ll use Bayes Theorem:
minutes or 9.9324 seconds. This is like the continuous extension of the Geometric (discrete) RV which is the
number of trials until a success occurs.
Recall the Poisson Process with parameter > 0 has events happening at average rate of per unit time
forever. The exponential RV measures the time (e.g., 4.33212 seconds, 9.382 hours, etc.) until the first
occurrence of an event, so is a continuous RV with range [0, 1) (unlike the Poisson RV, which counts the
number of occurrences in a unit of tine, with range {0, 1, 2, ...} and is a discrete RV).
Let Y ⇠ Exp( ) be the time until the first event. We’ll first compute its CDF FY (t) and then di↵erentiate
it to find its PDF fY (t).
Let X(t) ⇠ Poi( t) be the number of events in the first t units of time, for t 0 (if average is is per
unit of time, then it is t per t units of time). Then, Y > t (wait longer than t units of time until the first
event) if and only if X(t) = 0 (no events happened in the first t units of time). This allows us to relate the
Exponential CDF to the Poisson PMF.
t( t)0 t
P (Y > t) = P (no events in the first t units) = P (X(t) = 0) = e =e
0!
Note that we plugged in the Poi( t) PMF at 0 in the second last equality. Now, the CDF is just the
complement of the probability we computed:
t
FY (t) = P (Y t) = 1 P (Y > t) = 1 e
Remember since the CDF was the integral of the PDF, the PDF is the derivative of the CDF by the
fundamental theorem of calculus:
d
fY (t) = FY (t) = e t
dt
X ⇠ Exp( ), if and only if X has the following pdf (and range ⌦X = [0, 1)):
(
e x, x 0
fX (x) =
0, otherwise
X is the waiting time until the first occurrence of an event in a Poisson Process with parameter .
1 1
E [X] = , Var (X) = 2
The cdf is (
x
1 e , x 0
FX (x) =
0, otherwise
Proof of Expectation and Variance of Exponential. You can use integration by parts if you want to solve
these integrals, or you can use WolframAlpha. Again, I’m omitting the steps that are not relevant to your
understanding of probability theory (computing integrals):
Z 1 Z 1
x 1
E [X] = xfX (x)dx = x· e dx =
1 0
124 Probability & Statistics with Applications to Computing 4.2
Z 1 Z 1
⇥ ⇤ 2
E X2 = x2 fX (x)dx = x2 · e x
dx = 2
1 0
✓ ◆2
⇥ ⇤ 2 2 1 1
Var (X) = E X 2 E [X] = 2
= 2
If you usually skip examples, please don’t skip the next two. The first example here highlights the relationship
between the Poisson and Exponential RVs, and the second highlights the memoryless property!
Example(s)
Suppose that, on average, 13 car crashes occur each day on Highway 101. What is the probability
that no car crashes occur in the next hour ? Be careful of units of time!
Solution We will solve this problem with three equivalent approaches! Take the time to understand why
each of them work.
13
1. Then, on average there are 24 car crashes per hour. So the number of crashes in the next hour is
X ⇠ Poi = 13 24 .
0
13/24 (13/24) 13/24
P (X = 0) = e =e
0!
2. Similar to above, the time (in hours) until the first car crash is Y ⇠ Exp = 13
24 , since on average
13/24 car crashes happen per hour. Then, the probability no car crashes happen in the next hour is
13/24·1 13/24
P (Y > 1 (hour)) = 1 P (Y 1) = 1 FY (1) = 1 (1 e )=e
3. If we don’t want to change the units, then we can say the waiting time until the next car crash (in
days) is Z ⇠ Exp( = 13). since on average 13 car crashes happen per day. Then, the probability no
car crashes occur in the next hour (1/24 of a day) is the probability that we wait longer than 1/24 day:
13·1/24 13/24
P (Z > 1/24) = 1 P (Z 1/24) = 1 FZ (1/24) = 1 (1 e )=e
Hopefully the first and second solutions show you the relationship between the Poisson and Exponential RVs
(they both come from the Poisson process), and the second and third solution show you how to be careful
with units and that you’ll get the same answer as long as you are consistent.
Example(s)
Solution Since we want to model battery life, we should use an Exponential distribution. Since we know the
average battery life is 50 hours, and that the expected value of an exponential RV is 1/ (see above), we
1
should say that the battery life is X ⇠ Exp = 50 = 0.02 .
4.2 Probability & Statistics with Applications to Computing 125
1. If we want the probability the battery lasts more than 60 hours, then we want
Z 1 Z 1
P (X 60) = fX (t)dt = 0.02e 0.02t dt = e 1.2
60 60
But continuous distributions have a CDF which we can and should take advantage of! We can look
up the CDF above as well:
0.02·60 1.2
P (X 60) = 1 P (X < 60) = 1 FX (60) = 1 (1 e )=e
We made a step above that said P (X < 60) = FX (60), but FX (60) = P (X 60). It turns out they
are the same for continuous RVs, since the probability X = 60 exactly is zero!
2. Similarly,
0.02·40 0.8
P (X 40) = 1 P (X < 40) = 1 FX (40) = 1 (1 e )=e
3. By Bayes Theorem,
P (X 60 | X 100) P (X 100)
P (X 100 | X 60) =
P (X 60)
1·e 2 0.8
= =e
e 1.2
Note that this is exactly the same as P (X 40) above, the probability we the battery lasted at least
40 hours. This says that the previous 60 hours don’t matter - P (X 40 + 60 | X 60) = P (X 40).
This property is called memorylessness, since the battery essentially forgets that it was alive for 60
hours! We’ll discuss this more formally below and prove it.
4.2.3 Memorylessness
Definition 4.2.3: Memorylessness
We just saw a concrete example above, but let’s see another. Let s = 7, t = 2. So P (X > 9 | X > 7) =
P (X > 2).
This memoryless property says that, given we’ve waited (at least) 7 minutes, the probability we wait (at
least) 2 more minutes, is the same as the probability we waited (at least 2) more from the beginning. That
is, the random variable “forgot” how long we’ve already been waiting.
The only memoryless RVs are the Geometric (discrete) and Exponential (Continuous)! This is because
events happen independently over time/trials, and so the past doesn’t matter.
We’ve seen it algebraically and intuitively, but let’s see it pictorially as well. Here is a picture of the
probability is greater than 1 for an exponential RV. It is the area to the right of 1 of the density function
e x for x 0 (shaded in blue).
126 Probability & Statistics with Applications to Computing 4.2
Below is a picture of the probability X > 2.5 given X > 1.5 (shaded in orange and blue). If you hide the
area to the left of 1.5, you can see the ratio of the orange area (right of 2.5) to the entire shaded region (right
of 1.5) is the same as P (X > 1) above. So this exponential density function has memorylessness built in!
x x
P (X > x) = 1 FX (x) = 1 (1 e )=e
4.2 Probability & Statistics with Applications to Computing 127
Then, I’ll leave it to you to do the same computation as above (using Bayes Theorem). You’ll see it work
out almost exactly the same way!
X is the waiting time until the rth occurrence of an event in a Poisson Process with parameter .
Notice that Gamma(1, ) ⌘ Exp( ). By definition, if X, Y are independent with X ⇠ Gamma(r, )
and Y ⇠ Gamma(s, ), then X + Y ⇠ Gamma(r + s, ).
128 Probability & Statistics with Applications to Computing 4.2
Proof of Expectation and Variance of Gamma. The PDF of the Gamma looks very ugly and hard to deal
Pr use our favorite trick: Linearity of Expectation! As mentioned earlier, if X ⇠ Gamma(r, ),
with, so let’s
then X = i=1 Xi where each Xi ⇠ Exp( ) is independent with E [Xi ] = 1 and Var (Xi ) = 12 . So by LoE,
" r
# r r
X X X 1 r
E [X] = E Xi = E [Xi ] = =
i=1 i=1 i=1
Now, we can use the fact that the variance of a sum of independent rvs is the sum of the variances (we have
yet to prove this fact).
r
! r r
X X X 1 r
Var (X) = Var Xi = Var (Xi ) = 2
= 2
i=1 i=1 i=1
4.2.5 Exercises
1. Suppose that on average, 40 babies are born every hour in Seattle.
(a) What is the probability that no babies are born in the next minute? Try solving this in two
di↵erent but equivalent ways - using a Poisson and Exponential RV.
(b) What is the probabilities that it takes more than 20 minutes for the first 10 babies to be born?
Again, try solving this in two di↵erent but equivalent ways - using a Poisson and Gamma RV.
(c) What is the expected time until the 5th baby is born?
Solution:
(a) The number of babies born in the next minute is X ⇠ Poi(40/60), so P (X = 0) = e 40/60 ⇡
0.5134. Alternatively, the time in minutes until the next baby is born is Y ⇠ Exp(40/60), and we
want the probability that no babies are born in the next minute; i.e., it takes at least one minute
for the first baby to be born. Hence,
2/3·1 2/3
P (Y > 1) = 1 FY (1) = 1 (1 e )=e
Alternatively, the time in minutes until the tenth baby is born is Z ⇠ Gamma(10, 40/60), and we
are asking what’s the probability this is over 20 minutes, is
Z 20
(40/60)10 10 1 (40/60)x
P (Z > 20) = 1 FZ (20) = 1 x e dx
0 (10 1)!
Unfortunately, there isn’t a nice closed form for the Gamma CDF, but this would evaluate to the
same result!
(c) The time in minutes until the 5th baby is born is V ⇠ Gamma(5, 40/60), so E [V ] = r
= 5
40/60 =
7.5 minutes.
4.2 Probability & Statistics with Applications to Computing 129
2. You are waiting for a bus to take you home from CSE. You can either take the E-line, U-line, or Cline.
The distribution of the waiting time in minutes for each is the following:
• E-Line: E ⇠ Exp( = 0.1).
• U-Line: U ⇠ Unif(0, 20) (continuous).
• C-Line: Has range (1, 1) and PDF fC (x) = 1/x2 .
Assume the three bus arrival times are independent. You take the first bus that arrives
(a) Find the CDFs of E, U , and C, FE (t), FU (t) and FC (t). Hint: The first two can be looked up in
our distributions handout!
(b) What is the probability you wait more than 5 minutes for a bus?
(c) What is the probability you wait more than 30 minutes for a bus?
Solution:
(a) The CDF of E for t > 0 is FE (t) = 1 e 0.1t (see above).
t
The CDF of U for 0 < t < 20 is FU (t) = 20 .
Rt
The CDF of C for t > 1 is FC (t) = 1 fC (x)dx = 1 1t .
(b) Let B = min{E, U, C} be the time until the first bus. Then, the probability we wait more than
5 minutes is the probability that all of them take longer than 5 minutes to arrive. We can then
multiply the individual probabilities due to independence.
0.5 15 1 3 0.5
= (1 FE (5))(1 FU (5))(1 FC (5)) = e · · = e
20 5 20
(c) The same exact logic applies here! But be careful of the range of U when plugging in the CDF.
It is true that
P (B > 30) = P (E > 30) P (U > 30) P (C > 30)
But when plugging in P (U > 30) = 1 FU (30), we have to remember that FU (30) = 1 because U
must be in [0, 20]. That’s why it is so important to define the piecewise function! This probability
is indeed 0 since bus U will always come within 20 minutes.
Chapter 4. Continuous Random Variables
4.3: The Normal/Gaussian Random Variable
Slides (Google Drive) Video (YouTube)
in order to calculate the number of standard deviations above the mean a random variable’s value is. (Note
how we are using standard deviation instead of variance here so the units are the same!)
Recall that in general, if X is any random variable (discrete or continuous) with E [X] = µ and Var (X) = 2
,
and a, b 2 R. Then,
E [aX + b] = aE [X] + b = aµ + b
Var (aX + b) = a2 Var (X) = a2 2
In particular, we call X µ a standardized version of X, as it measures how many standard deviations above
the mean a point is. We standardize random variables for fair comparison. Applying linearity of expectation
and variance of random variables to standardized random variables, we get the expectation and variance of
standardized random variables:
X µ 1
E = (E [X] µ) = 0
✓ ◆
X µ 1 1 1 2
p p
Var = 2
Var (X µ) = 2
Var (X) = 2
= 1 =) X = Var (X) = 1=1
It turns out the mean is 0 and the standard deviation (and variance) is 1 ! This makes sense because on
average, someone is average (0 standard deviations above the mean), and the standard deviation is 1.
130
4.3 Probability & Statistics with Applications to Computing 131
Unfortunately, there is no closed form formula for the CDF (there wasn’t one for the Gamma RV) either.
We’ll see how to compute these probabilities anyway though soon using a lookup table!
Normal distributions produce bell-shaped curves. Here are some visualizations of the density function for
varying µ and 2 .
For instance, a normal distribution with µ = 0 and = 1 produces the following bell curve:
If the standard deviation increases, it becomes more likely for the variable to be farther away from the mean,
so the distribution becomes flatter. For instance, a curve with the same µ = 0 but higher = 2 ( 2 = 4)
looks like this:
If you change the mean, the distribution will shift left or right. For instance, increasing the mean so µ = 4
shifts the distribution 4 to the right. The shape of the curve remains unchanged:
132 Probability & Statistics with Applications to Computing 4.3
If you change the mean AND standard deviation, the curves shape changes and shifts. For instance, changing
the mean so µ = 4 and standard deviation so = 2 gives us a flatter, shifted curve:
However, scaling and shifting a random variable often does not keep it in the same family. Continuous
uniform rvs are the only ones we learned so far that do: if X ⇠ Unif(0, 1), then 3X + 2 ⇠ Unif(2, 5): we’ll
learn how to prove this in the next section! However, this is not true for the others; for example, the range of
a Poi( ) is {0, 1, 2, . . . } as it is the number of events in a unit of time, and 2X has range {0, 2, 4, 6, . . . , } so
cannot be Poisson (cannot be an odd number)! We’ll see that Normal random variables have these closure
properties.
Recall that in general, if X is any random variable (discrete or continuous) with E [X] = µ and Var (X) = 2
,
and a, b 2 R. Then,
E [aX + b] = aE [X] + b = aµ + b
We will prove this theorem later in section 5.6 using Moment Generating Functions! This is really amazing -
the mean and variance are no surprise. The fact that scaling and shifting a Normal random variable results
in another Normal random variable is very interesting!
Let X, Y be ANY independent random variables (discrete or continuous) with E [X] = µX , E [Y ] = µY ,
Var (X) = X2
, Var (Y ) = Y2 and a, b, c 2 R. Recall,
aX + bY + c ⇠ N (aµX + bµY + c, a2 2
X + b2 2
Y )
Again, this is really amazing. The mean and variance aren’t a surprise again, but the fact that adding two
independent Normals results in another Normal distribution is not trivial, and we will prove this later as
well!
Example(s)
Suppose you believe temperatures in the Vancouver, Canada each day are approximately normally
distributed with mean 25 degrees Celsius and standard deviation 5 degrees Celsius. However, your
American friend only understands Fahrenheit.
1. What is the distribution of temperatures each day in Vancouver in Fahrenheit? To convert
Celsius (C) to Fahrenheit (F ), the formula is F = 95 C + 32.
2. What is the distribution of the average temperature over a week in the Vancouver, in Fahren-
heit? That is, if you were to sample a random week’s average temperature, what is its distribu-
tion? Assume the temperature each day is independent of the rest (this may not be a realistic
assumption).
Solution
1. The degrees in Celsius are N (µC = 25, C 2
= 52 ). Since F = 95 C + 32, we know by linearity of
expectation and properties of variance:
9 9 9
µF = E [F ] = E C + 32 = E [C] + 32 = 25 + 32 = 77
5 5 5
✓ ◆ ✓ ◆2 ✓ ◆2
2 9 9 9
F = Var (F ) = Var C + 32 = Var (C) = 52 = 81
5 5 5
These values are no surprise, but by closure of the Normal distribution, we can say that F ⇠ N (µF =
77, F2 = 92 ).
134 Probability & Statistics with Applications to Computing 4.3
2. Let F1 , F2 , . . . , F7 be independent temperatures over a week, so each Fi ⇠ N (µF = 77, F2 = 81). Let
P7
F̄ = 17 i=1 Fi denote the average temperature over this week. Then, by linearity of expectation and
properties of variance (requiring independence),
" 7
# 7
1X 1X 1
E Fi = E [Fi ] = · 7 · 77 = 77
7 i=1 7 i=1 7
7
! 7
1X 1 X 1 81
Var Fi = 2 Var (Fi ) = 2 · 7 · 81 =
7 i=1 7 i=1 7 7
Note that the mean is the same, but the variance is smaller. This might make sense because we expect
the average temperature over a week should match that of a single day, but it is more stable (has
lower variance). By closure properties of the Normal distribution, since we take a sum of independent
P7
Normal RVs and then divide it by 7, F̄ = 17 i=1 Fi ⇠ N (µ = 77, 2 = 81/7).
If Z ⇠ N (0, 1) is the standard normal (the normal RV with mean 0 and variance/standard deviation
1), we denote the CDF (a) = FZ (a) = P (Z a), since it is so commonly used. There is no closed-form
formula, so this CDF is stored in a table (read a “Phi Table”). Remember, (a) is just the area to the
left of a.
Since the normal distribution curve is symmetric, the area to the left of a is the same as the area to the
right of a. This picture below shows that (a) = 1 ( a).
To get the CDF (1.09) = P (Z 1.09) from the table, we look at the row with a value of 1.0, and column
with value 0.09, as marked here:
4.3 Probability & Statistics with Applications to Computing 135
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.53983 0.5438 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.0 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91309 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2.0 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999
From this, we see that P (Z 1.39) = (1.39) ⇡ 0.91774. (Look at the gray row 1.3, and the column 0.09).
This table usually only has positive numbers, so if you want to look up negative numbers, it’s necessary to
use the fact that (a) = 1 ( a). For example, if we want P (Z 2.13) = ( 2.13), we need to do
1 (2.13) = 1 0.9834 = 0.0166 (try to find (2.13) yourself above).
2
How does this help though when X is Normal but not the standard normal? In general, for a X ⇠ N (µ, ),
136 Probability & Statistics with Applications to Computing 4.3
See some examples below of how we can use the table to calculate probabilities associated with the Normal
distributions! Again, the table gives the CDF of the standard Normal since it doesn’t have a closed form
like Uniform/Exponential. Also, any Normal RV can be standardized so we can look up probabilities in the
table!
Example(s)
Suppose the age of a random adult in the United States is (approximately) normally distributed with
mean 50 and standard deviation 15.
1. What is the probability that a randomly selected adult in the US is over 70 years old?
2. What is the probability that a randomly selected adult in the US is under 25 years old?
3. What is the probability that a randomly selected adult in the US is between 40 and 45 years
old?
Solution
2
1. The height of a random adult is X ⇠ N (µ = 50, = 152 ), so remember we standardize to use the
standard Gaussian:
✓ ◆
X 50 70 50
P (X > 70) = P > [standardize]
15 15
X µ
= P (Z > 1.33) = Z ⇠ N (0, 1)
=1 P (Z 1.33) [complement]
=1 (1.33) [def of ]
=1 0.9082 [look up table from earlier]
= 0.0918
4.3 Probability & Statistics with Applications to Computing 137
2. We do a similar calculation:
✓ ◆
X 50 25 50
P (X < 25) = P > [standardize]
15 15
X µ
= P (Z < 2/3) = Z ⇠ N (0, 1)
3. We do a similar calculation:
✓ ◆
40 50 X 50 45 50
P (40 < X < 45) = P < < [standardize]
15 15 15
X µ
= P ( 2/3 < Z < 1/3) = Z ⇠ N (0, 1)
4.3.5 Exercises
1. Suppose the time (in hours) it takes for you to finish pset i is approximately Xi ⇠ N (µ = 10, 2 = 9)
(for i = 1, . . . , 5) and the time (in hours) it takes for you to finish a project is approximately Y ⇠
N (µ = 20, 2 = 10). Let W = X1 + X2 + X3 + X4 + X5 + Y be the time it takes to complete all 5
psets and the project.
(a) What are the mean and variance of W ?
(b) What is the distribution of W and what are its parameter(s)?
(c) What is the probability that you complete all the homework in under 60 hours?
Solution:
(a) The mean by linearity of expectation is E [W ] = E [X1 ] + · · · + E [X5 ] + E [Y ] = 50 + 20 = 70.
Variance adds for independent RVs, so Var (W ) = Var (X1 )+· · ·+Var (X5 )+Var (Y ) = 45+10 = 55.
(b) Since W is the sum of independent Normal random variables, W is also normal with the parameters
we calculated above. So W ⇠ N (µ = 70, 2 = 55).
(c)
✓ ◆
W 70 60 70
P (W < 60) = P p < p ⇡ P (Z < 1.35) = ( 1.35) = 1 (1.35) = 1 0.9115 = 0.0885
55 55
.
Chapter 4. Continuous Random Variables
4.4: Transforming Continuous RVs
Slides (Google Drive) Video (YouTube)
Suppose the amount of gold a company can mine is X tons per year, and you have some (continuous)
distribution to model this. However, your earning is not simply X - it is actually a function of the amount
of product, some Y = g(X). What is the distribution of Y ?
Since we know the distribution of X, this will help us model the distribution of Y by transforming random
variables.
This is because Y = 1 if and only if X 2 { 1, 1}, so to find P (Y = 1), we sum over all values x such that
x2 = 1 of its probability. That’s all this formula below says (the “:” means “such that”):
X
pY (y) = pX (x)
x2⌦X :g(x)=y
But for continous random variables, we have density functions instead of mass functions. That means
fX is not actually a probability and so we can’t do this same technique. We want to work with the CDF
FX (x) = P (X x) instead because it actually does represent a probability! It’s best to see this idea through
an example.
Example(s)
p
Suppose you know X ⇠ Unif(0, 9) (continuous). What is the PDF of Y = X?
138
4.4 Probability & Statistics with Applications to Computing 139
The CDF of X is derived by taking the integral of the PDF, giving us (can also cite this),
8
< 0 if x < 0
x
FX (x) = 9 if 0 x 9
:
1 ifx > 9
p
Now, we determinepthe range of Y . The smallest value that Y can take is 0 = 0, and the largest value
that Y can take is 9 = 3, from the range of X. Since the square root function is monotone increasing, this
gives us,
⌦Y = [0, 3]
But can we assume that, because X has a uniform distribution, Y does too?
This is not the case! Notice that values of X in the range [0, 1] will map to Y values in the range [0, 1]. But,
X values in the range [1, 4] map to Y values in the range [1, 2] and X values in the range [4, 9] map to Y
values in the range [2, 3].
So, there is a much larger range of values of X that map to [2, 3] than to [0, 1] (since [4, 9] is a larger range
than [0, 1]). Therefore, Y ’s distribution shouldn’t be uniform. So, we cannot define the PDF of Y using the
assumption that Y is uniform.
Instead, we will first compute the CDF FY and then, di↵erentiate that to get the PDF fY for y 2 [0, 3].
Be very careful when squaring both sides of an equation - it may not keep the inequality true. In this case
we didn’t have to worry since X and Y were both guaranteed positive.
d 2y
fY (y) = FY (y) =
dy 9
p
Here is an image of the original and transformed PDFs! Remember that X ⇠ Unif(0, 9) and Y = X.
140 Probability & Statistics with Applications to Computing 4.4
This is the general strategy for transforming continuous RVs! We’ll summarize the steps below.
Example(s)
2. The range of Y = X 4 is ⌦Y = {x4 : x 2 [ 1, +1]} = [0, 1], since x4 is always positive and between 0
and 1 for x 2 [ 1, +1].
4.4 Probability & Statistics with Applications to Computing 141
3. Be careful in the third equation below to include both lower and upper bounds (draw the function
y = x4 to see why). For y 2 ⌦Y = [0, 1], we will compute the CDF:
FY (y) = P (Y y) [def of CDF]
4
=P X y [def of Y ]
p p
= P ( 4 y X 4 y) [don’t forget the negative side]
p p
= P (X 4 y) P (X 4 y)
p p
= FX ( 4 y) FX ( 4 y) [def of CDF of X ]
1 p p 3 1 p p
= (2 + 3 4 y 4
y ) (2 + 3( 4 y) ( 4
y)3 ) [plug in CDF]
4 4
4. The last step is to di↵erentiate the CDF to get the PDF, which is just computational, so I’ll skip it!
That is, the PDF of Y at y is the PDF of X evaluated at h(y) (the value of x that maps to y) multiplied by
the absolute value of the derivative of h(y).
Note that the formula method is not as general as the previous method (using CDF), since g must satisfy
monotonicity and invertibility. So transforming via CDF always works, but transforming may not work with
this explicit formula all the time.
Proof of Formula to get PDF of Y = g(X) from X.
Suppose Y = g(X) and g is strictly monotone and invertible with inverse X = g 1 (Y ) = h(Y ). We’ll assume
g is strictly monotone increasing and leave it to you to prove it for the case when g is strictly monotone
decreasing (it’s very similar).
A similar proof would hold if g were monotone decreasing, except in the third line we would flip the sign of
the inequality and make the h0 (y) become an absolute value: |h0 (y)|.
Now let’s try the same example as we did earlier, but using this new method instead.
Example(s)
p
Suppose you know X ⇠ Unif(0, 9) (continuous). What is the PDF of Y = X?
Our goal is to use the formula given fY (y) = fX (h(y)) · |h0 (y)|, after verifying some conditions on g.
p p
Let g(t) = t. This is strictly monotone increasing on ⌦X = [0, 9]. This means that as t increases, t also
increases - therefore, g(t) is an increasing function.
What is the inverse of this function g? The inverse of the square root function is just the squaring function:
1
h(y) = g (y) = y 2
h0 (y) = 2y
Note that we dropped the absolute value because we already assume y 2 [0, 3] and hence 2y is always positive.
This gives the same formula as earlier, as it should!
Let X = (X1 , ..., Xn ), Y = (Y1 , ..., Yn ) be continuous random vectors (each component is a continuous
rv) with the same dimension n (so ⌦X , ⌦Y ✓ Rn ), and Y = g(X) where g : ⌦X ! ⌦Y is invertible
and di↵erentiable, with di↵erentiable inverse X = g 1 (y) = h(y). Then,
✓ ◆
@h(y)
fY (y) = fX (h(y)) det
@y
⇣ ⌘
where @h(y)
@y 2 Rn⇥n is the Jacobian matrix of partial derivatives of h, with
✓ ◆
@h(y) @(h(y))i
=
@y ij @yj
Hopefully this formula looks very similar to the one for the single-dimensional case! This formula is just for
your information and you’ll never have to use it in this class.
4.4.4 Exercises
1. Suppose X has range ⌦X = (1, 1) and density function
8
< 2 x>1
fX (x) = x3
:0 otherwise
eX 1
Let Y = .
2
(a) Compute the density function of Y via the CDF transformation method.
(b) Compute the density function of Y using the formula, but explicitly verify the monotonicity and
invertibility conditions.
Solution:
✓ ◆
e 1
(a) The range of Y is ⌦Y = , 1 . For y 2 ⌦Y ,
2
d 2 1 4
fY (y) = FY (y) = 3
· ·2=
dy [ln(2y + 1)] 2y + 1 (2y + 1)[ln(2y + 1)]3
This chapter, especially Sections 5.1-5.6, are arguably the most difficult in this entire text. They might take
more time to fully absorb, but you’ll get it, so don’t give up!
We are finally going to talk about what happens when we want the probability distribution of more than one
random variable. This will be called the joint distribution of two or more random variables. In this section,
we’ll focus on joint discrete distributions, and in the next, joint continuous distributions. We’ll also finally
prove that variance the variance of the sum of independent RVs is the sum of the variances, an important
fact that we’ve been using without proof! But first, we need to review what a Cartesian product of sets is.
A ⇥ B = {(a, b) : a 2 A, b 2 B}
Further if A, B are finite sets, then |A ⇥ B| = |A| · |B| by the product rule of counting.
Example(s)
Write each of the following in a notation that does not involve a Cartesian product:
1. {1, 2, 3} ⇥ {4, 5}
2. R2 = R ⇥ R
Solution
1. Here, we have:
{1, 2, 3} ⇥ {4, 5} = {(1, 4), (1, 5), (2, 4), (2, 5), (3, 4), (3, 5)}
We have each of the elements of the first set paired with each of the elements of the second set. Note
that |{1, 2, 3}| = 3, |{4, 5}| = 2, and |{1, 2, 3} ⇥ {4, 5}| = 6.
2. This is the xy-plane (2D space), which is denoted:
R2 = R ⇥ R = {(x, y) : x 2 R, y 2 R}
146
5.1 Probability & Statistics with Applications to Computing 147
Suppose we roll two fair 4-sided die independently, one blue and one red. Let X be the value of the blue die
and Y be the value of the red die. Note:
⌦X = {1, 2, 3, 4}
⌦Y = {1, 2, 3, 4}
Then we can also consider ⌦X,Y , the joint range of X and Y . The joint range happens to be any combination
of {1, 2, 3, 4} for both rolls. This can be written as:
⌦X,Y = ⌦X ⇥ ⌦Y
Further each of these will be equally likely (as shown in the table below):
Above is a suitable way to write the joint probability mass function of X and Y , as it enumerates every
probability of every pair of values. If we wanted to write it as a formula, pX,Y (x, y) = P (X = x, Y = y) for
x, y 2 ⌦X,Y we have:
(
1
, x, y 2 ⌦X,Y
pX,Y (x, y) = 16
0, otherwise
Note that either this piecewise function or the table above are valid ways to express the joint PMF.
pX,Y (a, b) = P (X = a, Y = b)
The joint range is the set of pairs (c, d) that have nonzero probability:
Further, note that if g : R2 ! R is a function, then LOTUS extends to the multidimensional case:
X X
E [g(X, Y )] = g(x, y)pX,Y (x, y)
x2⌦X y2⌦Y
A lot of things are just the same as what we learned in Chapter 3, but extended! Note that the joint range
above ⌦X,Y was always a subset of ⌦X ⇥ ⌦Y , and they’re not necessarily equal. Let’s see an example of this.
Back to our example of the blue and red die rolls. Again, let X be the value of the blue die and Y be
the value of the red die. Now, let U = min{X, Y } (the smaller of the two die rolls) and V = max{X, Y }
(the larger of the two die rolls). Then:
⌦U = {1, 2, 3, 4}
⌦V = {1, 2, 3, 4}
because both random variables can take on any of the four values that appear on the dice (e.g., t is possible
that the minimum is 4 if we roll (4, 4) and the maximum to be 1 if we roll (1, 1)).
However, there is the constraint that the minimum value U is always at most the maximum value V . That is,
the joint range would not include the pair (u, v) = (4, 1) for example, since the probability that the minimum
is 4 and the maximum is 1 is zero. We can write this formally as the subset of the Cartesian product subject
to u v:
⌦U,V = {(u, v) 2 ⌦U ⇥ ⌦V : u v} =
6 ⌦U ⇥ ⌦ V
This will just be all the ordered pairs of the values that can appear as U and V . Now, however these are
not equally likely, as shown in the table below. Notice that any pair (u, v) with u > v has zero probability,
as promised. We’ll explain how we got the other numbers under the table.
As discussed earlier, we can’t have the case where U > V , so these are all 0. The cases where U = V occurs
1
when the blue and red die have the same value, each which occurs with probability of 16 as shown earlier. For
example, pU,V (2, 2) = P (U = 2, V = 2) = 1/16 since only one of the 16 equally likely outcomes (2, 2) gives
2
this result. The others in which U < V each occur with probability 16 because it could be the red die with
the max and the blue die with the min, or the reverse. For example, pU,V (1, 3) = P (U = 1, V = 3) = 2/16
because two of the 16 outcomes (1, 3) and (3, 1) would result in the min being 1 and the max being 3.
5.1 Probability & Statistics with Applications to Computing 149
So for the joint PMF as a formula pU,V (u, v) = P (U = u, V = v) for u, v 2 ⌦U,V we have:
8
> 2
< 16 , u, v 2 ⌦U ⇥ ⌦V , v>u
1
pU,V (u, v) = 16 , u, v 2 ⌦U ⇥ ⌦V , v=u
>
:
0, otherwise
Again, the piecewise function and the table are both valid ways to express the joint PMF, and you may
choose whichever is easier for you. When the joint range is larger, it might be infeasible to use a table
though!
You might think the answer is 7/16, but how did you get that? Well, P (U = 1) would be the sum of
the first row, since that is all the cases where U = 1. You computed
1 2 2 2 7
P (U = 1) = P (U = 1, V = 1)+P (U = 1, V = 2)+P (U = 1, V = 3)+P (U = 1, V = 4) = + + + =
16 16 16 16 16
Mathematically, we have
X
P (U = u) = P (U = u, V = v)
v2⌦V
Does this look like anything we learned before? It’s just the law of total probability (intersection version)
that we derived in 2.2, as the events {V = v}v2⌦V partition the sample space (V takes on exactly one value)!
We can refer to the table above sum each row (which corresponds to a value of u to find the probability of
that value of u occurring). That gives us the following:
87
>
>
> 16 , u=1
<5, u=2
pU (u) = 16
> 3
>
> , u=3
: 16
1
16 , u=4
1 1
P (U = 4) = P (U = 4, V = 1)+P (U = 4, V = 2)+P (U = 4, V = 3)+P (U = 4, V = 4) = 0+0+0+ =
16 16
This brings us to the definition of marginal PMFs. The idea of these is: given a joint probability distribution,
what is the distribution of just one of them (or a subset)? We get this by marginalizing (summing) out the
other variables.
150 Probability & Statistics with Applications to Computing 5.1
(Extension) If Z is also a discrete random variable, then the marginal PMF of z is:
X X
pZ (z) = pX,Y,Z (x, y, z)
x2⌦X y2⌦Y
This follows from the law of total probability, and is just like taking the sum of a row in the example above.
Now if asked for E [U ] for example, we actually don’t need the joint
P PMF anymore. We’ve extracted the
pertinent information in the form of pU (u), and compute E [U ] = u upU (u) normally.
5.1.4 Independence
We’ll now redefine independence of RVs in terms of the joint PMF. This is completely the same as the
definition we gave earlier, just with the new notation we learned.
Recall the joint range ⌦X,Y = {(x, y) : pX,Y (x, y) > 0} ✓ ⌦X ⇥ ⌦Y is always a subset of the
Cartesian product of the individual ranges. A necessary but not sufficient condition for independence
is that ⌦X,Y = ⌦X ⇥ ⌦Y . That is, if ⌦X,Y 6= ⌦X ⇥ ⌦Y , then X and Y cannot be independent, but
if ⌦X,Y = ⌦X ⇥ ⌦Y , then we have to check the condition above.
This is because if there is some (a, b) 2 ⌦X ⇥ ⌦Y but not in ⌦X,Y , then pX,Y (a, b) = 0 but pX (a) > 0
and pY (b) > 0, violating independence. For example, suppose the joint PMF looks like:
X \Y 8 9 Row Total pX (x)
3 1/3 1/2 5/6
7 1/6 0 1/6
Col Total pY (y) 1/2 1/2 1
Also side note that the marginal distributions are named what they are, since we often write the row
and column totals in the margins. The joint range ⌦X,Y 6= ⌦X ⇥ ⌦Y since one of the entries is 0,
and so (7, 9) 62 ⌦X,Y but (7, 9) 2 ⌦X ⇥ ⌦Y . This immediately tells us they cannot be independent -
pX (7) > 0 and pY (9) > 0, yet pX,Y (7, 9) = 0.
Example(s)
Solution
1. Actually these can be found by filling in the row and column totals, since
X X
pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y)
y x
P
For example, P (X = 0) = pX (0) = y pX,Y (0, y) = pX,Y (0, 6) + pX,Y (0, 9) = 3/12 + 5/12 = 8/12 is
the sum of the first row.
X \Y 6 9 Row Total pX (x)
0 3/12 5/12 8/12
2 1/12 2/12 3/12
3 0 1/12 1/12
Col Total pY (y) 4/12 8/12 1
Hence, 8
>
<8/12 x = 0
pX (x) = 3/12 x = 2
>
:
1/12 x = 3
152 Probability & Statistics with Applications to Computing 5.1
(
4/12 y=6
pY (y) =
8/12 y=9
2. We can actually compute E [Y ] just using pY now that we’ve eliminated/marginalized out X - we don’t
need the joint PMF anymore. We go back to the definition:
X 4 8
E [Y ] = ypY (y) = 6 · +9· =8
y
12 12
3. X, Y are independent, if for every table entry (x, y), we have pX,Y (x, y) = pX (x)pY (y). However, notice
pX,Y (3, 6) = 0 but pX (3) > 0 and pY (6) > 0. Hence we found an entry where this condition isn’t true,
so they cannot be independent. This is like the comment mentioned earlier: if ⌦X,Y 6= ⌦X ⇥ ⌦Y , they
have no chance of being independent.
This just sums over all the entries in the table (x, y) and takes a weighted average of all values xy
weighted by pX,Y (x, y) = P (X = x, Y = y).
Note this property relies on the fact that they are independent, whereas linearity of expectation
always holds, regardless.
Proof of Lemma.
X X
E [XY ] = xypX,Y (x, y) [LOTUS]
x2⌦X y2⌦Y
X X
= xypX (x)pY (y) [X ? Y , so pX,Y (x, y) = pX (x)pY (y)]
x2⌦X y2⌦Y
X X
= xpX (x) ypY (y)
x2⌦X y2⌦Y
= E [X] E [Y ]
Proof of Variance Adds for Independent RVs. Now we have the following:
⇥ ⇤
Var (X + Y ) = E (X + Y )2 (E [X + Y ])2 [def of variance]
⇥ 2 ⇤
= E X + 2XY + Y 2 (E [X] + E [Y ])2 [linearity of expectation]
⇥ 2⇤ ⇥ 2⇤
= E X + 2E [XY ] + E Y (E [X])2 2E [X] E [Y ] (E [Y ])2 [linearity of expectation]
⇥ 2⇤ ⇥ ⇤
=E X (E [X])2 + E Y 2 (E [Y ])2 + 2E [XY ] 2E [X] E [Y ] [rearranging]
⇥ 2⇤ ⇥ ⇤
= E X (E [X])2 + E Y2 (E [Y ])2 + 2 (E [X] E [Y ] E [X] E [Y ]) [lemma, since X ? Y ]
= Var (X) + Var (Y ) + 0 [def of variance]
XX
E [X + Y ] = (x + y)pX,Y (x, y) [LOTUS]
x y
XX XX
= xpX,Y (x, y) + ypX,Y (x, y) [split sum]
x y x y
X X X X
= x pX,Y (x, y) + y pX,Y (x, y) [algebra]
x y y x
X X
= xpX (x) + ypY (y) [def of marginal PMF]
x y
5.1.7 Exercises
1. Suppose we flip a fair coin three times independently. Let X be the number of heads in the first two
flips, and Y be the number of heads in the last two flips (there is overlap).
154 Probability & Statistics with Applications to Computing 5.1
(a) What distribution do X and Y have marginally, and what are their ranges?
(b) What is pX,Y (x, y)? Fill in this table below. You may want to fill in the marginal distributions
first!
(c) What is ⌦X,Y , using your answer to (b)?
(d) Write a formula for E [cos(XY )].
(e) Are X, Y independent?
Solution:
(a) Since X counts the number of heads in two independent flips of a fair coin, then X ⇠ Bin(n =
2, p = 0.5). Y also has this distribution! Their ranges are ⌦X = ⌦Y = {0, 1, 2}.
(b) First, fill in the marginal distributions, which should be 1/4, 1/2, 1/4 for the probability that
X = 0, X = 1, and X = 2 respectively (same for Y ).
First let’s start with pX,Y (2, 2) = P (X = 2, Y = 2). If X = 2, that means the first two flips
must’ve been heads. If Y = 2, that means the last two flips must’ve been heads. So the probabil-
ity that X = 2, Y = 2 is the probability of the single outcome HHH, which is 1/8. Apply similar
logic for pX,Y (0, 0) = P (X = 0, Y = 0) which is the probability of TTT.
Then, pX,Y (0, 2) = P (X = 0, Y = 2). If X = 0 then the first two flips are tails. If Y = 2, the last
two flips are heads. This is impossible, so P (X = 0, Y = 2) = 0. Similarly, P (X = 2, Y = 0) = 0
as well. Now use the constraints (the row totals and col totals) to fill in the rest! For example, the
first row must sum to 1/4, and we have two out of three of the entries pX,Y (0, 0) and pX,Y (0, 2),
so pX,Y (0, 1) = 1/4 1/8 0 = 1/8.
X \Y 0 1 2 Row Total pX (x)
0 1/8 1/8 0 1/4
1 1/8 1/4 1/8 1/2
2 0 1/8 1/8 1/4
Col Total pY (y) 1/4 1/2 1/4 1
(c) From the previous part, we can see that the joint range is everything in the Cartesian product
except (0, 2) and (2, 0), so ⌦X,Y = (⌦X ⇥ ⌦Y ) \ {(0, 2), (2, 0)}.
(d) By LOTUS extended to multiple variables,
XX
E [cos(XY )] = cos(xy)pX,Y (x, y)
x y
(e) No, the joint range is not equal to the Cartesian product. This immediately makes independence
impossible. The intuitive reason is that, since (0, 2) 62 ⌦X,Y for example, if we know X = 0, then
Y cannot be 2. Formally, there exists a pair (x, y) 2 ⌦X ⇥ ⌦Y (namely (x, y) = (0, 2)) such that
pX,Y (0, 2) = 0 but pX (0) > 0 and pY (2) > 0. Hence, pX,Y (0, 2) 6= pX (0)pY (2), which violates
independence.
2. Suppose radioactive particles at Area 51 are emitted at an average rate of per second. You want to
measure how many particles are emitted, but your geiger-counter (device that measures radioactivity)
fails to record each particle independently with some small probability p. Let X be the number of
particles emitted, and Y be the number of particles observed (by your geiger-counter).
(a) Describe the joint range ⌦X,Y using set notation.
(b) Write a formula (not a table) for pX,Y (x, y).
5.1 Probability & Statistics with Applications to Computing 155
X 1
X x
✓ ◆
x
pY (y) = pX,Y (x, y) = e · (1 p)y px y
x=y
x! y
x2⌦X
Solution: Qr
K1 Kr Ki
k1 ... kr i=1 ki
pX1 ,...,Xr (k1 , . . . , kr ) = N
= N
n n
Chapter 5. Multiple Random Variables
5.2: Joint Continuous Distributions
Slides (Google Drive) Video (YouTube)
fX,Y (a, b) 0
The joint range is the set of pairs (c, d) that have nonzero density:
Further, note that if g : R2 ! R is a function, then LOTUS extends to the multidimensional case:
Z 1Z 1
E [g(X, Y )] = g(s, t)fX,Y (s, t)dsdt
1 1
The joint PDF must satisfy the following (similar to univariate PDFs):
Z b Z d
P (a X < b, c Y d) = fX,Y (x, y)dydx
a c
Example(s)
Let X and Y be two jointly continuous random variables with the following joint PDF:
⇢
x + cy 2 0 x 1, 0 y 1
fX,Y (x, y) =
0 otherwise
156
5.2 Probability & Statistics with Applications to Computing 157
(c) Find P 0 X 12 , 0 Y 1
2 .
Solution
(a)
⌦X,Y = (x, y) 2 R2 : 0 x 1, 0 y 1
Z 1 Z 1
1= fX,Y (x, y)dxdy
1 1
Z 1Z 1
= (x + cy 2 )dxdy
0 0
Z 1 x=1
1 2
= x + cy 2 x dy
0 2 x=0
Z 1✓ ◆
1
= + cy 2 dy
0 2
y=1
1 1
= y + cy 3
2 3 y=0
1 1
= + c
2 3
Thus, c = 32 .
158 Probability & Statistics with Applications to Computing 5.2
(c)
✓ ◆ Z 1/2 Z 1/2 ✓ ◆
1 1 3 2
P 0 X ,0 Y = x + y dxdy
2 2 0 0 2
Z 1/2 x=1/2
1 2 3 2
= x + y x dy
0 2 2 x=0
Z 1/2 ✓ ◆
1 3 2
= + y dy
0 8 4
3
=
32
Example(s)
Let X and Y be two jointly continuous random variables with the following PDF:
⇢
x + y 0 x 1, 0 y 1
fX,Y (x, y) =
0 otherwise
⇥ ⇤
Find E XY 2 .
Solution By LOTUS,
Z 1 Z 1
⇥ ⇤
E XY 2 = (xy 2 )fX,Y (x, y)dxdy
1 1
Z 1Z 1
= xy 2 (x + y)dxdy
0 0
Z 1✓ ◆
1 2 1 3
= y + y dy
0 3 2
17
=
72
Suppose that X and Y are jointly distributed continuous random variables with joint PDF fX,Y (x, y).
The marginal PDFs of X and Y are respectively given by the following:
Z 1
fX (x) = fX,Y (x, y)dy
1
Z 1
fY (y) = fX,Y (x, y)dx
1
Note this is exactly like for joint discrete random variables, with integrals instead of sums.
5.2 Probability & Statistics with Applications to Computing 159
(Extension): If Z is also a continuous random variable, then the marginal PDF of Z is:
Z 1Z 1
fZ (z) = fX,Y,Z (x, y, z)dxdy
1 1
Solution
Example(s)
Find the marginal PDFs fX (x) and fY (y) given the joint PDF:
(
3
x + y 2 0 x 1, 0 y 1
fX,Y (x, y) = 2
0 otherwise
Then, compute E [X]. (This is the same joint density as the first example, plugging in c = 3/2).
For 0 x 1:
Z 1
fX (x) = fX,Y (x, y)dy
1
Z 1✓ ◆
3 2
= x + y dy
0 2
y=1
1
= xy + y 3
2 y=0
1
=x+
2
For 0 y 1:
Z 1
fY (y) = fX,Y (x, y)dx
1
Z 1✓ ◆
3
= x + y 2 dx
0 2
x=1
1 2 3 2
= x + y x
2 2 x=0
3 1
= y2 +
2 2
Note that to compute E [X] for example, we can either use LOTUS, or just the marginal PDF fX (x). These
methods are equivalent. By LOTUS (taking g(X, Y ) = X),
Z 1Z 1 Z 1Z 1 ✓ ◆
3 2
E [X] = xfX,Y (x, y)dxdy = x x + y dxdy
1 1 0 0 2
Alternatively, by definition of expectation for a single RV,
Z 1 Z 1 ✓ ◆
1
E [X] = xfX (x)dx = x x+ dx
1 0 2
It only takes two lines or so of algebra to show they are equal!
Recall ⌦X,Y = {(x, y) : fX,Y (x, y) > 0} ✓ ⌦Y ⇥ ⌦Y . A necessary but not sufficient condition for
independence is that ⌦X,Y = ⌦X ⇥ ⌦Y . That is, if ⌦X,Y = ⌦X ⇥ ⌦Y , then we have to check the
condition, but if not, then we know they are not independent.
This is because if there is some (a, b) 2 ⌦X ⇥ ⌦Y but not in ⌦X,Y , then fX,Y (a, b) = 0 but fX (a) > 0
and fY (b) > 0, which violates independence. (This is very similar to independence for discrete RVs).
Example(s)
Let’s return to our dart example. Suppose (X, Y ) are jointly and uniformly distributed on the circle
of radius R centered at the origin (example a dart throw).
1. First find and sketch the joint range ⌦X,Y .
5.2 Probability & Statistics with Applications to Computing 161
2. Now, write an expression for the joint PDF fX,Y (x, y) and carefully define it for all x, y 2 R.
3. Now, solve for the range of X and write an expression we can evaluate to find fX (x), the
marginal PDF for X.
4. Now, let Z be the distance from the center that the dart falls. Find ⌦Z and write an expression
for E [Z].
5. Finally, determine using the definition of independence whether X and Y are independent.
Solution
1. The joint range is ⌦X,Y = {(x, y) 2 R2 : x2 + y 2 R2 } since the values must be within the circle of
radius R. We can sketch the range as follows, with the semi-circles below and above the y-axis labeled
with their respective equations.
2. The height of the density Rfunction is constant, say h, since it is uniform. The double integral over all
1 R1
x and y must equal one ( 1 1 fX,Y (x, y)dxdy = 1), meaning the volume of this cylinder must be
1
1. The volume is base times height, which is ⇡R2 · h, and setting it equal to 1 gives h = ⇡R 2 . This
3. Well, X can range from R to R, since there are points on the circle with x values in this range. So
the range of X is:
⌦X = [ R, R]
Setting up this integral will be trickier than in the earlier examples, because when finding
fX (x) and integrating out the y, the limits of integration actually depend on x. Imagine making a tick
mark at some x 2 [ R, R] (on the x-axis) and drawing a vertical line through x: where does y enter
and leave (like summing a column in a joint PMF)? Based on the equations we had earlier for y in
terms of x (see the sketch above), this give us:
Z p
R 2 x2
fX (x) = p fX,Y (x, y)dy
R 2 x2
162 Probability & Statistics with Applications to Computing 5.2
Again, this is di↵erent from the previous examples, and you MUST sketch/plot the joint range to figure
this out. If you learned how to do double integrals, this is exactly the same idea.
p
4. Well, the distance will be given by Z = X 2 + Y 2 , which is the definition of distance. We can further
see that Z will take on any value from 0 to R, since the point could be at the origin and as far as R.
This gives, ⌦Z = [0, R].
Then, to solve for the expected value of Z, we can use LOTUS, and only integrate over the joint range
of X and Y (since the joint PDF is 0 elsewhere). We have to be careful in setting up the bounds of our
integral. X will range
p from Rpto R as we discussed earlier.p But as X ranges across these values, Y
will range from R2 x2 to R2 x2 . We had Z = X 2 + Y 2 , so for the expected value we have:
Z Z p
hp i R R 2 x2 p
E [Z] = E X2 +Y 2 = p x2 + y 2 fX,Y (x, y)dydx
R R 2 x2
Note that we could’ve set up this integral dxdy instead - what would the limits of integration have
been? It would’ve been
p
hp i Z R Z R2 y 2 p
E [Z] = E X2 + Y 2 = 2 2
p 2 2 x + y fX,Y (x, y)dxdy
R R y
Your outer limits must be just the range of Y (both constants), and your inner limits may depend on
the outer variable of integration.
5. No, they are not independent. We can see this with the test: ⌦X,Y 6= ⌦X ⇥ ⌦Y . This is because X
and Y both have marginal range from R to R, but the joint range is not a rectangle of this region (it
is a circle). More explicitly, take a point (0.99R, 0.99R) which is basically the top right of the square
(R, R)). We get 0 = fX,Y (0.99R, 0.99R) 6= fX (0.99R)fY (0.99R) > 0. This is because the joint PDF
is defined to be 0 at (0.99R, 0.99R) (not in the circle), but the marginal PDFs of both X and Y are
nonzero at 0.99R (since 0.99R is in the marginal range of both).
Example(s)
Now let’s consider another example where we have a continuous joint distribution (X, Y ), where
X 2 [0, 1] is the proportion of the time until the midterm that you actually spend studying for it and
Y 2 [0, 1] is your percentage score on the exam.
Suppose the joint PDF is:
(
ce (y x) x, y 2 [0, 1] and y x
fX,Y (x, y) =
0 otherwise
1. First, consider the joint range and sketch it. Then, interpret it in English in the context of the
problem.
2. Now, write an expression for c in the PDF above.
3. Now, find ⌦Y and write an expression that we could evaluate to find fY (y).
4. Now, write an expression that we could evaluate to find P (Y 0.9).
5. Now, write an expression that we can evaluate to find E [Y ], the expected score on the exam.
6. Finally, consider whether X and Y are independent.
Solution
5.2 Probability & Statistics with Applications to Computing 163
1. X can range from any value in [0, 1] without conditions. Then Y will only be bounded in that it must
be less than or equal to X. We can first draw the line x = y, and then the region above this line for
which x, y are less than 1 will be our range. That gives us the following:
In English, this means that your score is at least the percentage of time that you studied, as your score
will be that proportion or more.
2. RTo solve for c, we should find the volume above this triangle on the x-y plane and invert it, since
1 R1
1
f
1 X,Y
(x, y)dxdy = 1. To find the area we can integrate in terms of x or y first, which will give
us the following two equivalent expressions:
1 1
c = R1R1 = R1Ry
e (y x) dydx e (y x) dxdy
0 x 0 0
We’ll explain the first equality using the dydx ordering. Since dx is the outer integral, the limits must
be just the range of X, which is [0, 1]. For each value of x (draw a vertical line through x on the
x-axis), y goes between x and 1, so those are the inner limits of integration.
Now, for the second equality using dxdy ordering, the outer integral is dy, so the limits are the range
of Y , also [0, 1]. Then, for each value of y (draw a horizontal line through y on the y-axis), x goes
between 0 and y, and so those are the inner limits of integration.
3. Well, ⌦Y = [0, 1] as we can see in our graph above that Y takes on values in this range. For the
marginal PDF we have to integrate in respect to X, which will take on values in the range 0 to y based
on our graph. So, we have:
Z y
fY (y) = ce (y x) dx
0
4. We can integrate from 0.9 to 1 to solve for this, using the marginal PDF that we solved for above.
This takes us back to the univariate case essentially, and gives us the following:
Z 1 Z 1 Z y
(y x)
P (Y 0.9) = fY (y)dy = ce dxdy
0.9 0.9 0
Z 1 Z 1 Z y
(y x)
E [Y ] = yfY (y)dy = cye dxdy
0 0 0
6. ⌦X,Y 6= ⌦X ⇥ ⌦Y since the sketch of the range is not a rectangle. The joint range is not equal to
the cartesian product of the marginal ranges. To be concrete, consider the point (x = 0.99, y = 0.01)
(basically the corner (1, 0)). I chose this point because it was in the Cartesian product ⌦X ⇥ ⌦Y =
[0, 1] ⇥ [0, 1], but not in the joint range (see the picture from the first part). Since it’s not in the joint
range (shaded region), we have fX,Y (0.99, 0.01) = 0, but since 0.99 2 ⌦X and 0.01 2 ⌦Y , fX (0.99) > 0
and fY (0.01) > 0. Hence, I’ve found a pair of points (x, y) where the joint density isn’t equal to the
product of the marginal densities, violating independence.
Chapter 5. Multiple Random Variables
5.3: Conditional Distributions
Slides (Google Drive) Video (YouTube)
Now that we’ve finished talking about joint distributions (whew), we can move on to conditional distribu-
tions and conditional expectation. This is actually just applying the concepts from 2.2 about conditional
probability, generalizing to random variables (instead of events)!
Note that this should remind you of Bayes Theorem (because that’s what it is)!
If X, Y are continuous random variables, then the conditional PDF of X given Y is:
Again, this is just a generalization from discrete to continuous as we’ve been doing!
It’s important to note that, for each fixed value of b, the probabilities that X = a must sum to 1:
X
pX|Y (a | b) = 1
a2⌦X
If X and Y are mixed (one discrete, one continuous), then a similar extension can be made where
any discrete random variable has a p (a probability mass function) any continuous random variable
has an f (a probability density function).
Example(s)
Back to our example of the blue and red die rolls from 5.1. Suppose we roll a fair blue 4-sided die
and a fair red 4-sided die independently. Recall that U = min{X, Y } (the smaller of the two die rolls)
and V = max{X, Y } (the larger of the two die rolls). Then, their joint PMF was:
165
166 Probability & Statistics with Applications to Computing 5.3
pU,V (u, 3)
pU |V =
pV (3)
We need to compute the denominator which is the marginal PMF of V (the sum of the third column):
X
pV (3) = pU,V (a, 3) = 2/16 + 2/16 + 1/16 + 0 = 5/16
a2⌦U
it’s only fair that the conditional expectation of X, given knowledge that some other RV Y is equal to y is
the same exact thing, EXCEPT the probabilities should be conditioned on Y = y now:
X X
E [X | Y = y] = xP (X = x | Y = y) = xpX,Y (x | y)
x2⌦X x2⌦X
Most notably, we are still summing over x and NOT y, since this expression should depend on y right? Given
that Y = y, what is the expectation of X?
5.3 Probability & Statistics with Applications to Computing 167
If X is continuous (and Y is either discrete or continuous), then we define the conditional expectation
of g(X) given (the event that) Y = y as:
Z 1
E [g(X) | Y = y] = g(x)fX|Y (x | y)dx
1
Notice that these sums and integrals are over x (not y), since E [g(X) | Y = y] is a function of y.
These formulas are exactly the same as E [g(X)], except the PMF/PDF of X is replaced with the
conditional PMF/PDF of X | Y = y.
Example(s)
The question is basically asking the following: we get some uniformly random decimal num-
ber X from [0, 1]. We keep drawing uniform random numbers until we get a value less than our
initial value. What is the expected number of draws until this happens?
Solution We’ll do this problem in a “bad” way (the only way we know how to know), and then learn the
Law of Total Expectation next to see how this solution could be much simpler!
To find E [T ], since T is discrete with range ⌦T = {1, 2, 3, . . . }, we can find its PMF pT (t) = P (T = t) for
any value t and use the usual formula for expectation. However, T depends on the value of the initial number
X right? If X = 0.1 it would take longer to get a number less than this than if X = 0.99. Let’s try to find
the probability T = t given that X = x first:
P (T = t | X = x) = (1 x)t 1
x
because the probability we get a number smaller than x is just x (Uniform CDF), and so we need to get
t 1 failures first before our first success. Actually, (T |X = x) ⇠ Geo(x) so that’s another way we could’ve
computed this conditional PMF. Then, let’s use the LTP to find P (T = t) (we need to integrate over all
values of t because T is continuous, not discrete):
Z 1 Z 1
1
P (T = t) = P (T = t | X = x) fX (x)dx = (1 x)t 1
x · 1dx = · · · =
0 0 t(t + 1)
after skipping some purely computational steps. Finally, since we have the PMF of T , we can compute
expectation in the normal way:
1
X 1
X X 1 1
1
E [T ] = tpT (t) = t = =1
t=1 t=1
t(t + 1) t=1
t+1
168 Probability & Statistics with Applications to Computing 5.3
1 1 1
The reason this is 1 is because this is like the harmonic series 1 + + + + . . . which is known to diverge
2 3 4
to 1. This is surprising right? The expected time until you get a number smaller than your first is infinite!
This looks exactly like the law of total probability we are used to. Basically to solve for E [g(X)], we
need to take a weighted average of E [g(X) | Y = y] over all possible values of y.
Example(s)
(This is the same example as earlier): Suppose X ⇠ Unif(0, 1) (continuous). We repeatedly draw
independent Y1 , Y2 , Y3 , · · · ⇠ Unif(0, 1) (continuous) until the first random time T such that YT < X.
What is E [T ]?
5.3 Probability & Statistics with Applications to Computing 169
Solution Using the LTE now, we can solve this in a much simpler fashion. We know that (T | X = x) ⇠ Geo(p)
as stated earlier. By citing the expectation of a Geometric RV, we know that E [T | X = x] = x1 . By the
LTE, conditioning on x:
Z 1 Z 1
1
E [T ] = E [T | X = x] fX (x)dx = 1dx = [ln(x)]10 = 1
0 0 x
This was a much faster way to getting to the answer than before!
Example(s)
Let’s finally prove that if X ⇠ Geo(p), then µ = E [X] = p1 . Recall that the Geometric random
variable is the number of independent Bernoulli trials with parameter p up to and including the first
success.
Solution First, let’s condition on whether our first flip was heads or tails (these events partition the sample
space):
What are those four values on the right though? We know P (H) = p and P (T ) = 1 p, so that’s out of the
way.
What is E [X | H]? If we got heads on the first try, then E [X | H] = 1 since we are immediately done
(i.e., the number of trials it took to get our first heads, given we got heads on the first trial, is 1).
What is E [X | T ]? This is a bit trickier: because the trials are independent, and we got a tail on the
first try, we basically have to restart (memorylessness), and so our conditional expectation is just E [1 + X],
since we are back to square one except with one additional trial!
Plugging these four values in gives a recursive formula (E [X] appears on both sides):
E [X] = p + (1 + E [X]) · (1 p)
µ = p + (1 + µ)(1 p)
µ=p+1 p+µ µp
µ=1+µ µp
0=1 µp
µp = 1
1
µ=
p
This is a really “cute” proof of the expectation of a Geometric RV! See the notes in 3.5 to see the “ugly”
calculus proof.
170 Probability & Statistics with Applications to Computing 5.3
5.3.4 Exercises
1. What happens to linearity of expectation when you sum a random number of random variables? We
know it holds for fixed values of n, but let’s see what happens if we sum a random number N of them.
It turns out, you get something very nice!
Let X1 , X2 , X3 , . . . be a sequence of independent and identically distributed (iid) RVs, with com-
mon mean E [X1 ] = E [X2 ] = . . . . Let N be a random variable which
hP hasirange ⌦N ✓ {0, 1, 2, . . . }
N
(nonnegative integers), independent of all the Xi ’s. Show that E i=1 Xi = E [X1 ] E [N ]. That is,
the expected sum of a random number of random variables is the expected number of random variables
times the expected value of each (which you might think is intuitively true, but we have to prove it!).
Solution: We have the following:
"N # "N #
X X X
E Xi = E Xi | N = n pN (n) [Law of Total Expectation]
i=1 n2⌦N i=1
" n
#
X X
= E Xi | N = n pN (n) [given N = n : substitute in the upper limit]
n2⌦N i=1
" n
#
X X
= E Xi pN (n) [N independent of Xi0 s]
n2⌦N i=1
X
= nE [X1 ] pN (n) [Linearity of Expectation]
n2⌦N
X
= E [X1 ] npN (n)
n2⌦N
= E [X1 ] E [N ] [def of E [N ]]
Chapter 5. Multiple Random Variables
5.4: Covariance and Correlation
Slides (Google Drive) Video (YouTube)
In this section, we’ll learn about covariance; which as you might guess, is related to variance. It is a function
of two random variables, and tells us whether they have a positive or negative linear relationship. It also
helps us finally compute the variance of a sum of dependent random variables, which we have not yet been
able to do.
171
172 Probability & Statistics with Applications to Computing 5.4
Var (X)
⇣P + Var (Y ) (as we
⌘ discussed earlier).
n Pm Pn Pm
7. Cov i=1 Xi , j=1 Yi = i=1 j=1 Cov (Xi , Yj ). That is covariance works like FOIL (first,
outer, inner, last) for multiplication of sums ((a + b + c)(d + e) = ad + ae + bd + be + cd + ce).
Proof of Covariance Alternate Formula. We will prove that Cov (X, Y ) = E [XY ] E [X] E [Y ].
We actually proved in 5.1 already that E [XY ] = E [X] E [Y ] when X, Y are independent. Hence,
Example(s)
Z = 1 + X + XY 2
W =1+X
Find Cov(Z, W ).
⇥ ⇤ 2
Solution First note that E X 2 = Var (X) + E [X] = 1 + 02 = 1 (rearrange variance formula and solve for
5.4 Probability & Statistics with Applications to Computing 173
⇥ ⇤ ⇥ ⇤
E X 2 ). Similarly, E Y 2 = 1.
Covariance has a “problem” in measuring linear relationships, in that Cov (X, Y ) will be positive when there
is a positive linear relationship and negative when there is a negative linear relationship, but Cov (2X, Y ) =
2Cov (X, Y ). Scaling one of the random variables should not a↵ect the strength of their relationship, which
it seems to do. It would be great if we defined some metric that was normalized (had a maximum and
minimum), and was invariant to scale. This metric will be called correlation!
Cov (X, Y )
⇢(X, Y ) = p p
Var (X) Var (Y )
We can prove by the Cauchy-Schwarz inequality (from linear algebra), 1 ⇢(X, Y ) 1. That is,
correlation is just a normalized version of covariance. Most notably, ⇢(X, Y ) = ±1 if and only if
Y = aX + b for some constants a, b 2 R, and then the sign of ⇢ is the same as that of a.
In linear regression (”line-fitting”) from high school science class, you may have calculated some R2 ,
0 R2 1, and this is actually ⇢2 , and measure how well a linear relationship exists between X and
Y . R2 is the percentage of variance in Y which can be explained by X.
Let’s take a look at some example graphs which shows a sample of data and their (Pearson) correlations, to
get some intuition.
174 Probability & Statistics with Applications to Computing 5.4
The 1st (purple) plot has a perfect negative linear relationship and so the correlation is 1.
The 2nd (green) plot has an positive relationship, but it is not perfect, so the correlation is around +0.9.
The 3rd (orange) plot is a perfectly linear positive relationship, so the correlation is +1.
The 4th (red) plot appears to have data that is independent, so the correlation is 0.
The 5th (blue) plot has a negative trend that isn’t strongly linear, so the correlation is around 0.6.
Example(s)
Suppose X and Y are random variables, where Y = 5X + 2. Show that, since there is a perfect
negative linear relationship, ⇢(X, Y ) = 1.
Solution To find the correlation, we need the covariance and the two individual variances. Let’s write them
in terms of Var (X).
Var (Y ) = Var ( 5X + 2) = ( 5)2 Var (X) = 25Var (X)
Finally,
Cov (X, Y ) 5Var (X) 5Var (X)
⇢(X, Y ) = p p =p p = = 1
Var (X) Var (Y ) Var (X) 25Var (X) 5Var (X)
Note that the 5 and 2 did not matter at all (except that 5 was negative and made the correlation negative)!
Proof of Variance of Sums of RVs. We’ll first do something unintutive - making our expression more
complicated. ThePvariance X2 + · · · + Xn is the covariance with itself! We’ll use i to index
of the sum X1 + P
n n
one of the sums i=1 Xi and j for the other j=1 Xi . Keep in mind these both represent the same quantity;
you’ll see why we used di↵erent dummy variables soon!
! 0 1
n
X Xn n
X
Var Xi = Cov @ Xi , Xj A [covariance with self = variance]
i=1 i=1 j=1
n X
X n
= Cov (Xi , Xj ) [by FOIL]
i=1 j=1
Xn X
= Var (Xi ) + 2 Cov (Xi , Xj ) [by symmetry (see image below)]
i=1 i<j
The final step comes from the definition of covariance of a variable with itself and the symmetry of the
covariance. It is illustrated below where the red diagonal is the covariance of a variable with itself (which
is its variance), and the green o↵-diagonal are the symmetric pairs of covariance. We used the fact that
Cov (Xi , Xj ) = Cov (Xj , Xi ) to require us to only sum the lower triangle (where i < j), and multiply by 2 to
account for the upper triangle.
It is important to remember than if all the RVs were independent, all the Cov (Xi , Xj ) terms (for i 6= j)
would be zero, and so we would just be left with the sum of the variances as we showed earlier!
176 Probability & Statistics with Applications to Computing 5.4
Example(s)
Recall in the hat check problem in 3.3, we had n people who go to a party and leave their hats with
a hat check person. At the end of the party, the hats are returned randomly though.
We let X be the number of people who get their original hat back. We solved for E [X] with indicator
random variables X1 , . . . Xn for whether the i-th person got their hat back.
We showed that:
E [Xi ] = P (Xi = 1)
= P ith person get their hat back
1
=
n
So,
" n
#
X
E [X] = E Xi
i=1
n
X
= E [Xi ]
i=1
Xn
1
=
i=1
n
1
=n·
n
=1
✓ ◆
1 1
Solution Recall that each Xi ⇠ Ber (1 with probability , and 0 otherwise). (Remember these were
n n
NOT independent RVs, but we still could apply linearity of expectation.) In our previous proof, we showed
that
n
! n
X X X
Var (X) = Var Xi = Var (Xi ) + 2 Cov (Xi , Xj )
i=1 i=1 i<j
Recall that Xi , Xj are indicator random variables which are in {0, 1}, so their product Xi Xj 2 {0, 1} as well.
This allows us to calculate:
This is because we need both person i and person j to get their hat back: person i gets theirs back with
probability n1 , and given this is true, person j gets theirs back with probability n 1 1
So, by definition of covariance (recall each E [Xi ] = 1
n ):
Finally, we have
n
X X
Var (X) = Var (Xi ) + 2 Cov (Xi , Xj ) [formula for variance of sum]
i=1 i<j
Xn ✓ ◆ X
1 1 1
= 1 +2
2 (n
[plug in]
i=1
n n
i<j
n 1)
✓ ◆✓ ◆ ✓ ◆✓ ◆ ✓ ◆
1 1 n 1 n
=n 1 +2 [there are pairs with i < j]
n n 2 n2 (n 1) 2
✓ ◆ ✓ ◆
1 n(n 1) 1
= 1 +2
n 2 n2 (n 1)
✓ ◆
1 1
= 1 +
n n
=1
How many pairs are their with i < j? This is just n2 = n(n2 1) since we just choose two di↵erent elements.
Another way to see this is that there was an n ⇥ n square, and we removed the diagonal of n elements, so
we are left with n2 n = n(n 1). Divide by two to get just the lower half.
This is very surprising and interesting! When returning n hats randomly and uniformly, the expected number
of people who get their hat back is 1, and so is the variance! These don’t even depend on n at all! It
takes practice to get used to these formula, so let’s do one more problem.
Example(s)
Suppose we throw 12 balls independently and uniformly into 7 bins. What are the mean and variance
of the number of empty bins after this process? (Hint: Indicators).
since we need to avoid this bin (with probability 6/7) 12 times independently. That is,
✓ ◆12 !
6
Xi ⇠ Ber p =
7
Hence, E [Xi ] = p ⇡ 0.1573 and Var (Xi ) = p(1 p) ⇡ 0.1325. These random variables are surely dependent,
since knowing one bin is empty means the 12 balls had to go to the other 6 bins, making it less likely that
another bin is empty.
However, dependence doesn’t bother us for computing the expectation; by linearity of expectation, we get
" 7 # 7 7 ✓ ◆12 ✓ ◆12
X X X 6 6
E [X] = E Xi = E [Xi ] = =7 ⇡ 1.1009
i=1 i=1 i=1
7 7
Now for the variance, we need to find Cov (Xi , Xj ) = E [Xi Xj ] E [Xi ] E [Xj ] for i 6= j. Well, Xi Xj 2 {0, 1}
since both Xi , Xj 2 {0, 1}, so Xi Xj is indicator/Bernoulli as well with
✓ ◆12
5
E [Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1, Xj = 1) = P (both bin i and j are empty) =
7
since all the balls must go into the other 5 bins during each of the 12 independent throws. Finally,
✓ ◆12 ✓ ◆12 ✓ ◆12
5 6 6
Cov (Xi , Xj ) = E [Xi Xj ] E [Xi ] E [Xj ] = ⇡ 0.0071
7 7 7
Recall that Var (Xi ) = p(1 p) ⇡ 0.1325, and so putting this all together gives:
7
X X
Var (X) = Var (Xi ) + 2 Cov (Xi , Xj ) [formula for variance of sum]
i=1 i<j
7
X X
⇡ 0.1325 + 2 ( 0.0071) [plug in approximate decimal values]
i=1 i<j
✓ ◆
7
= 7 · 0.1325 + 2 ( 0.0071)
2
⇡ 0.62954
Recall the hypergeometric RV X ⇠ HypGeo(N, K, n) which was the number of lollipops we get when we
draw n candies from a bag of N total candies (K N are lollipops). We stated without proof that
Var (X) = n K(NN 2 (N
K)(N n)
1) . You have the tools now to prove this if you like using indicators and covariances,
but we’ll prove this later in 5.8 as well!
Chapter 5. Multiple Random Variables
5.5: Convolution
Slides (Google Drive) Video (YouTube)
In section 4.4, we explained how to transform random variables (finding the density function of g(X)). In
this section, we’ll talk about how to find the distribution of the sum of two independent random variables,
X + Y , using a technique called convolution. It will allow us to prove some statements we made earlier
without proof (like sums of independent Binomials are Binomial, sums of indepenent, Poissons are Poisson),
and also derive the density function of the Gamma distribution which we just stated.
This should just remind of you of the LTP we learned in section 2.2, or the definition of marginal PMF/PDFs
from earlier in the chapter! We’ll use this LTP to help us derive the formulae for convolution.
5.5.2 Convolution
Convolution is a mathematical operation that allows to derive the distribution of a sum of two independent
random variables. For example, suppose the amount of gold a company can mine is X tons per year in
country A, and the amount of gold the company can mine is Y tons per year in country B, independently.
You have some distribution to model each. What is the distribution of the total amount of p gold you mine,
Z = X + Y ? Combining this with 4.4, if you know your profit is some function of g(Z) = X + Y of the
total amount of gold, you can now find the density function of your profit!
Example(s)
Let X, Y ⇠ Unif(1, 4) be independent rolls of a fair 4-sided die. What is the PMF of Z = X + Y ?
179
180 Probability & Statistics with Applications to Computing 5.5
Solution We know that for the range of Z we have the following, since it is the sum of two values each in
the range {1, 2, 3, 4}:
⌦Z = {2, 3, 4, 5, 6, 7, 8}
Should the probabilities be uniform? That is, would you be equally likely to roll a 2 as a 5? No, because
there is only one way to get a 2 (rolling (1, 1)), but many ways to get a 5.
If I wanted to compute the probability that Z = 3 for example, I could just sum over all possible val-
ues of X in ⌦X = {1, 2, 3, 4} to get:
P (Z = 3) = P (X = 1, Y = 2) + P (X = 2, Y = 1) + P (X = 3, Y = 0) + P (X = 4, Y = 1)
= P (X = 1) P (Y = 2) + P (X = 2) P (Y = 1) + P (X = 3) P (Y = 0) + P (X = 4) P (Y = 1)
1 1 1 1 1 1
= · + · + ·0+ ·0
4 4 4 4 4 4
2
=
16
where the first line is all ways to get a 3, and the second line uses independence. Note that is is not possible
that Y = 0 or Y = 1, but we write this for completion. More generally, to find pZ (z) = P (Z = z) for any
value of z, we just write
pZ (z) = P (Z = z)
X
= P (X = x, Y = z x)
x2⌦X
X
= P (X = x) P (Y = z x)
x2⌦X
X
= pX (x)pY (z x)
x2⌦X
The intuition is that if we want Z = z, we sum over all possibilities of X = x but require that Y = z x so
that we get the desired sum of z. It is very possible that pY (z x) = 0 as we saw above.
It turns out that formula at the bottom was extremely general, and works for any sum of two independent
discrete RVs. Now let’s consider the continuous case. What if X and Y are continuous RVs and we define
Z = X + Y ; how can we solve for the probability density function for Z, fZ (z)? It turns out the formula is
extremely similar, just replacing p with f !
Note: You can swap the roles of X and Y . Note the similarity between the cases!
5.5 Probability & Statistics with Applications to Computing 181
Proof of Convolution.:
• Discrete case: Even though we proved this earlier, we’ll do it again a di↵erent way (using the LTP/def
of marginal):
pZ (z) = P (Z = z)
X
= P (X = x, Z = z) [LTP/marginal]
x2⌦X
X
= P (X = x, Y = z x) [(X = x, Z = z) equivalent to (X = x, Y = z x)]
x2⌦X
X
= P (X = x) P (Y = z x) [X and Y are independent]
x2⌦X
X
= pX (x)pY (z x)
x2⌦X
• Continuous case: Since we should never work with densities as probabilities, let’s start with the CDF
and di↵erentiate:
FZ (z) = P (Z z)
= P (X + Y z) [def of Z]
Z
= P (X + Y z | X = x)fX (x)dx) [LTP, conditioning on X]
x2⌦X
Z
= P (x + Y z | X = x)fX (x)dx) [given X = x]
x2⌦X
Z
= P (Y z x | X = x)fX (x)dx) [algebra]
x2⌦X
Z
= P (Y z x)fX (x)dx) [X and Y are independent]
x2⌦X
Z
= FY (z x)fX (x)dx [def of CDF of Y ]
x2⌦X
Now we can take the derivative (with respect to z) of the CDF to get the density (FY becomes fY ):
Z
d
fZ (z) = FZ (z) = fX (x)fY (z x)dx
dz x2⌦X
Example(s)
Suppose X and Y are two independent random variables such that X ⇠ Poi( 1) and Y ⇠ Poi( 2 ),
and let Z = X + Y . Prove that Z ⇠ Poi( 1 + 2 ).
The range of X, Y are ⌦X = ⌦Y = {0, 1, 2, . . . }, and so ⌦Z = {0, 1, 2, . . . } as well. For n 2 ⌦Z : Note that
the convolution formula says:
X 1
X
pZ (n) = pX (k)pY (n k) = pX (k)pY (n k)
k2⌦X k=0
182 Probability & Statistics with Applications to Computing 5.5
However, if you blindly plug in the PMFs pX and pY , you will get the wrong answer, and here’s why. We
only want to sum things that are non-zero (otherwise what’s the point?), and if we want pX (k)pY (n k) > 0,
we need BOTH to be nonzero. That means, k must be in the range of X AND n k must be in the range
of Y . Remember the dice example (we had pY ( 1) at some point, which would be 0 and not 1/4). We
are guaranteed pX (k) > 0 because we are only summing over valid k 2 ⌦X , but we must have n k be a
nonnegative integer (in the range ⌦Y = {0, 1, 2, . . . }, so actually, we must have k n. Now, we can just
plug and chug:
n
X
pZ (n) = pX (k)pY (n k) [convolution formula]
k=0
Xn k n k
1 2
= e 1
·e 2
[plug in Poisson PMFs]
k! (n k)!
k=0
n
X
( 1+ 2)
1 k n k
=e 1 (1 2) [algebra]
k!(n k)!
k=0
n
( 1+ 2)
1 X n! k n k
=e 1 (1 2) [multiply and divide by n!]
n! k!(n k)!
k=0
n ✓ ◆ ✓ ◆
( 1+ 2)
1 X n k n k n n!
=e (1 2) =
n! k 1 k k!(n k)!
k=0
( 1+ 2)
( 1 + 2 )n
=e [binomial theorem]
n!
Thus, Z ⇠ Poi( 1 + 2 ), as its PMF matches that of a Poisson distribution! Note we wouldn’t have been
able to do that last step if our sum was still k = 0 to n. You MUST watch out for this at the beginning,
and after that, it’s just algebra.
Example(s)
Suppose X, Y are independent and identically distributed (iid) continuous Unif(0, 1) random vari-
ables. Let Z = X + Y . What is fZ (z)?
Solution We always begin by calculating the range: we have ⌦Z = [0, 2]. Again, we shouldn’t expect Z to
be uniform, since we should expect a number around 1, but not 0 or 2.
For a U ⇠ Unif(0, 1) (continuous) random variable, we know ⌦U = [0, 1], and that
⇢
1 0u1
fU (u) =
0 otherwise
Z Z 1 Z 1
fZ (z) = fX (x)fY (z x)dx = fX (x)fY (z x)dx = fY (z x)dx
x2⌦X 0 0
where the last formula holds since fX (x) = 1 for all 0 x 1 as we saw above. Remember, we need to
make sure z x 2 ⌦Y = [0, 1], otherwise the density will be 0.
For fY (z x) > 0, we need 0 z x 1. We’ll split into two cases depending on whether z 2 [0, 1] or
z 2 [0, 2], which compose its range ⌦Z = [0, 2].
5.5 Probability & Statistics with Applications to Computing 183
• If z 2 [0, 1], we already have z x 1 since z 1 (and x 2 [0, 1]). We also need z x 0 for the
density to be nonzero: x z. Hence, our integral becomes:
Z z Z 1
fZ (z) = fY (z x)dx + fY (z x)dx
Z0 z z
= 1dx + 0 = [x]z0 = z
0
• If z 2 [1, 2], we already have z x 0 since z 1 (and x 2 [0, 1]). We now need the other condition
z x 1 for the density to be nonzero: x z 1. Hence, our integral becomes:
Z z 1 Z 1
fZ (z) = fY (z x)dx + fY (z x)dx
0 z 1
Z 1
=0+ 1dx = [x]1z 1 =2 z
z 1
This makes sense because there are “more ways” to get a value of 1 for example than any other point.
Whereas to get a value of 2, there’s only one way - we need both X, Y to be equal to 1.
Example(s)
Mitchell and Alex are competing together in a 2-mile relay race. The time Mitchell takes to finish (in
hours) is X ⇠ Exp(2) and the time Alex takes to finish his mile (in hours) is continuous Y ⇠ Unif(0, 1).
Alex starts immediately after Mitchell finishes his mile, and their performances are independent.
What is the distribution of Z = X + Y , the total time they take to finish the race?
Solution First, we know that ⌦X = [0, 1) and ⌦Y = [0, 1], so ⌦Z = [0, 1). We know from our distribution
chart that
fX (x) = e x , x 0 and fY (y) = 1, 0 y 1
184 Probability & Statistics with Applications to Computing 5.5
Let z 2 ⌦Z . We’ll use the convolution formula, but this time over the range of Y (you could also do over X
too!). We can do this because X + Y = Y + X, and there was no reason why we had to condition on X first.
Z Z 1
fZ (z) = fY (y)fX (z y)dy = fY (y)fX (z y)dy
⌦Y 0
Since we are integrating over y, we don’t need to worry about fY (y) being 0, but we do need to make sure
fX (z y) > 0. There are two cases again:
• If z 2 [0, 1], then since we need z y 0, we need y z:
Z z Z z
(z y) z
fZ (z) = fY (y)fX (z y)dy = 1· e dy = 1 e
0 0
Note this tiny di↵erence in the upper limit of the integral made a huge di↵erence! Our final result is
8
> z
<1 e z 2 [0, 1]
fZ (z) = (e 1)e z z 2 (1, 1)
>
:
0 otherwise
The moral of the story is: always watch out for the ranges, otherwise you might not get what you expect!
The range of the random variable exists for a reason, so be careful!
Chapter 5. Multiple Random Variables
5.6: Moment Generating Functions
Slides (Google Drive) Video (YouTube)
Last time, we talked about how to find the distribution of the sum of two independent random variables.
Some of the most important use cases are to prove the results we’ve been using for so long: the sum of
independent Binomials is Binomial, the sum of independent Poissons is Poisson (we proved this in 5.5 using
convolution), etc. We’ll now talk about Moment Generating Functions, which allow us to do these in a
di↵erent (and arguably easier) way. These will also be used to prove the Central Limit Theorem (next
section), probably the most important result in all of statistics! Also, to derive the Cherno↵ bound (6.2).
The point is, these are used to prove a lot of important results. They might not be as direct applicable to
problems though.
5.6.1 Moments
First, we need to define what a moment is.
The first four moments of a distribution/RV are commonly used, though we have only talked about the first
two of them. I’ll briefly explain each but we won’t talk about the latter two much.
1. The first moment of X is the mean of the distribution µ = E [X]. This describes the center or average
value.
⇥ ⇤
2. The second moment of X about µ is the variance of the distribution 2 = Var (X) = E (X µ)2 .
This describes the spread of a distribution (how much it varies).
⇣ ⌘3
3. The third standardized moment is called skewness E X µ and typically tells us about the asym-
metry of a distribution about its peak. If skewness is positive, then the mean is larger than the median
and there are a lot of extreme high values. If skewness is negative, than the median is larger than the
mean and there are a lot of extreme low values.
⇣ ⌘4 E[ X 4 ]
4. The fourth standardized moment is called kurtosis E X µ = 4 , which measures how peaked
a distribution is. If the kurtosis is positive, then the distribution is thin and pointy, and if the kurtosis
is negative, the distribution is flat and wide.
185
186 Probability & Statistics with Applications to Computing 5.6
If X is discrete, by LOTUS:
X
MX (t) = etx pX (x)
x2⌦X
If X is continuous, by LOTUS:
Z 1
MX (t) = etx fX (x)dx
1
We say that the MGF of X exists, if there is a " > 0 such that the MGF is finite for all t 2 ( ", "),
since it is possible that the sum or integral diverges.
Example(s)
Solution
(a)
⇥ ⇤
MX (t) = E etX
X
= etx pX (x) [LOTUS]
x
1 t 2 2t
= e + e
3 3
5.6 Probability & Statistics with Applications to Computing 187
(b)
⇥ ⇤
MY (t) = E etY
Z 1
= ety fY (y)dy [LOTUS]
0
Z 1
= ety · 1dy [fY (y) = 1, 0 y 1]
0
t
e 1
=
t
h i h i
MaX+b (t) = E et(aX+b) = etb E e(at)X = etb MX (at)
2. Computing MGF of Sums: We can also compute the MGF of the sum of independent RVs X and Y
given their individual MGFs: (the third step is due to independence):
h i ⇥ ⇤ ⇥ ⇤ ⇥ ⇤
MX+Y (t) = E et(X+Y ) = E etX etY = E etX E etY = MX (t)MY (t)
3. Generating Moments with MGFs: The reason why MGFs are named they⇥way⇤ they ⇥ are,
⇤ is because
they generate moments of X. That means, they can be used to compute E [X], E X 2 , E X 3 , and so on.
How? Let’s take the derivative of an MGF (with respect to t):
0 d ⇥ tX ⇤ d X tx X d X
MX (t) = E e = e pX (x) = etx pX (x) = xetx pX (x)
dt dt dt
x2⌦X x2⌦X x2⌦X
d tx
note in the last step that x is a constant with respect to t and so dt e = xetx .
Note that if evaluate the derivative at t = 0, we get E [X] since e0 = 1:
X X
0
MX (0) = xe0x pX (x) = xpX (x) = E [X]
x2⌦X x2⌦X
00 d 0 d X X d X
MX (t) = MX (t) = xetx pX (x) = xetx pX (x) = x2 etx pX (x)
dt dt dt
x2⌦X x2⌦X x2⌦X
188 Probability & Statistics with Applications to Computing 5.6
⇥ ⇤
If we evaluate the second derivative at t = 0, we get E X 2 :
X X ⇥ ⇤
00
MX (0) = x2 e0x pX (x) = x2 pX (x) = E X 2
x2⌦X x2⌦X
Seems like there’s a pattern - if we take the n-th derivative of MX (t), then we will generate the n-th moment
E [X n ]!
For a function f : R ! R, we will denote f (n) (x) to be the nth derivative of f (x). Let X, Y be
independent random variables, and a, b 2 R be scalars. Then MGFs satisfy the following properties:
⇥ ⇤ (n)
1. MX 0
(0) = E [X], MX00
(0) = E X 2 , and in general MX = E [X n ]. This is why we call MX a
moment generating function, as we can use it to generate the moments of X.
2. MaX+b (t) = etb MX (at).
3. If X ? Y , then MX+Y (t) = MX (t)MY (t).
4. (Uniqueness) The following are equivalent:
(a) X and Y have the same distribution.
(b) fX (z) = fY (z) for all z 2 R.
(c) FX (z) = FY (z) for all z 2 R.
(d) There is an " > 0 such that MX (t) = MY (t) for all t 2 ( ", ") (they match on a small
interval around t = 0).
That is MX uniquely identifies a distribution, just like PDFs or CDFs do.
We proved the first three properties before stating all the theorems, so all that’s left is property 4. This is
a very complex proof (out of the scope of this course), but we can prove it for a special case.
Proof of Property 4 for a Special Case. We’ll prove that, if X, Y are discrete rvs with range ⌦ = {0, 1, 2, ..., m}
and whose MGFs are equal everywhere, that pX (k) = pY (k) for all k 2 ⌦. That is, if two distributions have
the same MGF, they have the same distribution (PMF).
Let ak = pX (k) pY (k) for k = 0, . . . , m and write etk as (et )k . Then, we get
m
X
ak (et )k = 0
k=0
Note that this is an m-th degree polynomial in et , and remember that this equation holds for (uncountably)
infinitely many t. An mth degree polynomial can only have m roots, unless all the coefficients are 0. Hence
ak = 0 for all k, and so pX (k) = pY (k) for all k.
Now we’ll see how to use MGFs to prove some results we’ve been using.
5.6 Probability & Statistics with Applications to Computing 189
Example(s)
1 1 1
⇥ ⇤ X X k X k
MX (t) = E etX = etk pX (k) = etk · e =e (et )k
k! k!
k=0 k=0 k=0
1
X ( et ) k et
=e =e e [Taylor series with x = et ]
k!
k=0
(et 1)
=e
Example(s)
(et 1)
If X ⇠ Poi( ), compute E [X] using its MGF we computed earlier MX (t) = e .
Example(s)
If Y ⇠ Poi( ) and Z ⇠ Poi(µ) and Y ? Z, show that Y + Z ⇠ Poi( + µ) using the uniqueness
property of MGFs. (Recall we did this exact problem using convolution in 5.5).
+µ)(et 1)
Solution First note that a Poi( + µ) RV has MGF e( (just plugging in + µ as the parameter).
Since Y and Z are independent, by property 3,
(et 1) µ(et 1) +µ)(et 1)
MY +Z (t) = MY (t)MZ (t) = e e = e(
190 Probability & Statistics with Applications to Computing 5.6
The MGF of Y + Z which we computed is the same as that of Poi( + µ). So, by the uniqueness of MGFs
(which implies that an MGF can uniquely describe a distribution), Y + Z ⇠ Poi( + µ).
Which way was easier for you - this approach or using convolution? MGF’s have limitations though whereas
convolution doesn’t (besides independence) - we need to compute the MGF of Y, Z but we also need to know
the MGF of what distribution we are trying to “get”.
Example(s)
Now, use MGFs to prove the closure properties of Gaussian RVs (which we’ve been using without
proof).
• If V ⇠ N (µ, 2 ) and W ⇠ N (⌫, 2 ) are independent, that V + W ⇠ N (µ + ⌫, 2 + 2 ).
• If a, b 2 R are constants and X ⇠ N (µ, 2 ), show that aX + b ⇠ N (aµ + b, a2 2 ).
You may use the fact that if Y ⇠ N (µ, 2 ), that
Z 1
1 (y µ)2 2 t2
MY (t) = ety p e 2 2 dy = eµt+ 2
1 2⇡
Solution
2 2
• If V ⇠ N (µ, ) and W ⇠ N (⌫, ) are independent, we have the following:
2 t2 2 t2 ( 2 + 2 )t2
MV +W (t) = MV (t)MW (t) = eµt+ 2 e⌫t+ 2 = e(µ+⌫)t+ 2
2 2
This is the MGF of a Normal distribution with mean µ + ⌫ and variance + . So, by uniqueness
of MGFs, Y + Z ⇠ N (µ + v, 2 + 2 ).
• Let us examine the moment generating function for aX + b. (We’ll use the notation exp(z) = ez so
that we can actually see what’s in the exponent clearly):
✓ 2
◆ ✓ ◆
(at)2 (a2 2 )t2
MaX+b (t) = ebt MX (at) = exp(bt) exp µ(at) + = exp (aµ + b)t +
2 2
This is definitely one of the most important sections in the entire text! The Central Limit Theorem is used
everywhere in statistics (hypothesis testing), and it also has its applications in computing probabilities. We’ll
see three results here, each getting more powerful and surprising.
are iid random variables with mean µ and variance 2 , then we define the sample mean
If X1 , . . . , Xn P
1 n
to be X n = n i=1 Xi . We’ll see the following results:
⇥ ⇤
• The expectation of the sample mean E X n is exactly the true mean µ, and the variance Var X n =
2
/n goes to 0 as you get more samples.
• (Law of Large Numbers) As n ! 1, the sample mean X n converges (in probability) to the true mean
µ. That is, as you get more samples, you will be able to get an excellent estimate of µ.
• (Central Limit Theorem) In fact, Xn follows a Normal distribution as n ! 1 (in practice n as low as
30 is good enough for this to be true). When we talk about the distribution of Xn , this means: if we
take n samples and take the sample mean, another n samples and take the sample mean, and so on,
how will these sample means look in a histogram? This is crazy - regardless of what the distribution
of Xi ’s were (discrete, continuous), their average will be approximately Normal! We’ll see pictures and
describe this more soon!
Further:
" n
# n
⇥ ⇤ 1X 1X 1
E X̄n =E Xi = E [Xi ] = nµ = µ
n i=1 n i=1 n
Again, none of this is “mind-blowing” to prove: we just used linearity of expectation and properties of
191
192 Probability & Statistics with Applications to Computing 5.7
What is this saying? Basically, if you wanted to estimate the mean height of the U.S. population by sampling
n people uniformly at random:
⇥ ⇤
• In expectation, your sample average will be “on point” at E X n = µ. This even includes the case
n = 1: if you just sample one person, on average, you will be correct. However, the variance is high.
• The variance of your estimate (the sample mean) for the true mean goes down ( 2 /n) as your sample
size n gets larger. This makes sense right? If you have more samples, you have more confidence in
your estimate because you are more “sure” (less variance).
In fact, as n ! 1, the variance of the sample mean approaches 0. A distribution with mean µ and variance
0 is essentially the degenerate random variable that takes on µ with probability 1. We’ll actually see that
the Law of Large Numbers argues exactly that!
The SLLN implies the WLLN, but not vice versa. The di↵erence is subtle and is basically swapping
the limit and probability operations.
The proof the WLLN will be given in 6.1 when we prove Chebyshev’s inequality, but the proof of the SLLN
is out of the scope of this class and much harder to prove.
5.7 Probability & Statistics with Applications to Computing 193
Let X1 , . . . Xn be a sequence of independent and identically distributed random variables with mean
2
µ and (finite) variance 2 . We’ve seen that the sample mean X̄n has mean µ and variance n . Then
as n ! 1, the following equivalent statements hold:
2
1. X̄n ! N (µ, n ).
X̄n µ
2. p 2
! N (0, 1)
Pn /n
3. P i=1 Xi ⇠ N (nµ, n 2 ). This is not “technically” correct, but is useful for applications.
n
X nµ
4. i=1 p i ! N (0, 1)
n 2
The mean or variance are not a surprise (we computed these at the beginning of these notes for any
sample mean); the importance of the CLT is, regardless of the distribution of Xi ’s, the sample mean
approaches a Normal distribution as n ! 1.
We will prove the central limit theorem in 5.11 using MGFs, but take a second to appreciate this crazy
result! The LLN say that as n ! 1, the sample mean of iid variables X n converges to µ. The CLT says
that, as n ! 1, the sample mean actually converges to a Normal distribution! For any original distribution
of the Xi ’s (discrete or continuous), the average/sum will become approximately normally distributed.
If you’re still having trouble with figuring out what “the distribution of the sample mean” means, that’s
completely normal (double pun!). Let’s consider n = 2, so we just take the average of X1 + X2 , which is
X1 +X2
2 . The distribution of X1 + X2 means: if we repeatedly sample X1 , X2 and add them, what might
the density look like? For example, if X1 , X2 ⇠ Unif(0, 1) (continuous), we showed the density of X1 + X2
looked like a triangle. We figured out how to compute the PMF/PDF of the sum using convolution in 5.5,
and the average is just dividing this by 2: X1 +X
2
2
, which you can find the PMF/PDF by transforming RVs
in 4.4. On the next page, you’ll see exactly the CLT applied to these Uniform distributions. With n = 1, it
looks (and is) Uniform. When n = 2, you get the triangular shape. And as n gets larger, it starts looking
more and more like a Normal!
You’ll see some examples below of how we start with some arbitrary distributions and how the density
function of their mean becomes shaped like a Gaussian (you know how to compute the pdf of the mean now
using convolution in 5.5 and transforming RV’s in 4.4)!
On the next two pages, we’ll see some visual “proof” of this surprising result!
194 Probability & Statistics with Applications to Computing 5.7
1
• The first (n = 1) of the four graphs below shows a discrete · Unif(0, 29) PMF in the dots (and
29
a blue line with the curve of the normal distribution
⇢ with the same mean and variance). That is,
1 1 2 28
P (X = k) = for each value in the range 0, , , . . . , , 1 .
30 29 29 29
• The second graph (n = 2) has the average of two of these distributions, again with a blue line with
the curve of the normal distribution with the same mean and variance. Remember we expected this
triangular distribution when summing either discrete or continuous Uniforms. (e.g., when summing
two fair 6-sided die rolls, you’re most likely to get a 7, and the probability goes down linearly as you
approach 2 or 12. See the example in 5.5 if you forgot how we got this!
• The third (n = 3) and fourth (n = 4) have the average of 3 and 4 identically distributed random
variables respectively, each of the distribution shown in the distribution in the first graph. We can see
that as we average more, the sum approaches a normal distribution.
Again, if you don’t believe me, you can compute the PMF yourself using convolution: first add two Unif(0, 1),
then convolve it with a third, and a fourth!
Despite this being a discrete random variable, when we take an average of many, there become increasingly
many values we can get between 0 and 1. The average of these iid discrete rv’s approaches a continuous
Normal random variable even after just averaging 4 of them!
Image Credit: Larry Ruzzo (a previous University of Washington CSE 312 instructor).
5.7 Probability & Statistics with Applications to Computing 195
You might still be skeptical, because the Uniform distribution is “nice” and already looked pretty “Normal”
even with n = 2 samples. We now illustrate the same idea with a strange distribution shown in the first
(n = 1) of the four graphs below, illustrated with the dots (instead of a “nice” uniform distribution). Even
this crazy distribution nearly looks Normal after just averaging 4 of them. This is the power of the CLT!
What we are getting at here is that, regardless of the distribution, as we have more independent and
identically distributed random variables, the average follows a Normal distribution (with the same mean and
variance as the sample mean).
196 Probability & Statistics with Applications to Computing 5.7
Now let’s see how we can apply the CLT to problems! There were four di↵erent equivalent forms (just
scaling/shifting) stated, but I find it easier to just look at the problem and decide what’s best. Seeing
examples is the best way to understand!
Example(s)
Let’s consider the example of flipping a fair coin 40 times independently. What’s the probability of
getting between 15 to 25 heads? First compute this exactly and then give an approximation using
the CLT.
Solution Define X to be the number of heads in the 40 flips. Then we have X ⇠ Bin(n = 40, p = 12 ), so we
just sum the Binomial PMF:
25 ✓ ◆ ✓ ◆k ✓
X ◆40 k
40 1 1
P (15 X 25) = 1 ⇡ 0.9193.
k 2 2
k=15
Now, let’s use the CLT. Since X can be thought of as the sum of 40 iid Ber( 12 ) RVs, we can apply the
CLT. We have E [X] = np = 40( 12 ) = 20 and Var (X) = np(1 p) = 40( 12 )(1 12 ) = 10. So we can use the
approximation X ⇡ N (µ = 20, 2 = 10).
Example(s)
Use the continuity correction to get a better estimate than we did earlier for the coin problem.
5.7 Probability & Statistics with Applications to Computing 197
Solution We’ll apply the exact same steps, except changing the bounds from 15 and 25 to 14.5 and 25.5.
Notice that this is much closer to the exact answer from the first part of the prior example (0.9193) than
approximating with the central limit theorem without the continuity correction!
Note: If you are applying the CLT to sums/averages of continuous RVs instead, you should not
apply the continuity correction.
See the additional exercises below to get more practice with the CLT!
5.7.5 Exercises
1. Each day, the number of customers who come to the CSE 312 probability gift shop is approximately
Poi(11). Approximate the probability that, after the quarter ends (9 ⇥ 7 = 63 days), that we had over
700 customers.
Solution: The total number of customers that come is X = X1 + · · · + X63 , where each Xi ⇠ Poi(11)
has E [Xi ] = Var (Xi ) = = 11 from the chart. By the CLT, X ⇡ N (µ = 63 · 11, 2 = 63 · 11) (sum of
the means and sum of the variances). Hence,
Note that you could compute this exactly as well since you know the sum of iid Poissons is Poisson.
In fact, X ⇠ Poi(693) (the average rate in 63 days is 693 per 63 days), and you could do a sum which
would be very annoying.
2. Suppose I have a flashlight which requires one battery to operate, and I have 18 identical batteries. I
want to go camping for a week (24 ⇥ 7 = 168) hours. If the lifetime of a single battery is Exp(0.1),
198 Probability & Statistics with Applications to Computing 5.7
what’s the probability my flashlight can operate for the entirety of my trip?
Solution: The total lifetime of the battery is X = X1 + · · · + X18 where each Xi ⇠ Exp(0.1)
1 1
has E [Xi ] = = 10 and Var (Xi ) = = 100. Hence, E [X] = 180 and Var (X) = 1800 by linearity
0.1 0.12
of expectation and since variance adds for independent rvs. In fact, X ⇠ Gamma(r = 18, = 0.1),
but we don’t have a closed-form for its CDF. By the CLT, X ⇡ N (µ = 180, 2 = 1800), so
Note that we don’t use the continuity correction here because the RV’s we are summing are already
continuous RVs.
Chapter 5. Multiple Random Variables
5.8: The Multinomial Distribution
Slides (Google Drive) Video (YouTube)
As you’ve seen, the Binomial distribution is extremely commonly used, and probably the most important
discrete distribution. The Normal distribution is certainly the most important continuous distribution. In
this section, we’ll see how to generalize the Binomial, and in the next, the Normal.
Why do we need to generalize the Binomial distribution? Sometimes, we don’t just have two outcomes
(success and failure), but we have r > 2 outcomes. In this case, we need to maintain counts of how many
times each of the r outcomes appeared. A single random variable is no longer sufficient; we need a vector of
counts!
Actually, the example problems at the end could have been solved in Chapter 1. We will just formalize
this situation so that we can use it later!
What about the variance? We cannot just say or compute a single scalar Var (X) because what does that
mean for a random vector? Actually, we need to define an n ⇥ n covariance matrix, which stores all pairwise
covariances. It is often denoted in one of three ways: ⌃ = Var (X) = Cov(X).
The covariance matrix of a random vector X 2 Rn with E [X] = µ is the matrix denoted ⌃ =
199
200 Probability & Statistics with Applications to Computing 5.8
Var (X) = Cov(X) whose entries ⌃ij = Cov (Xi , Xj ). The formula for this is:
⇥ ⇤ ⇥ ⇤
⌃ = Var (X) = Cov(X) = E (X µ)(X µ)T = E XXT µµT
2 3
Cov (X1 , X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
6 Cov (X2 , X1 ) Cov (X2 , X2 ) ... Cov (X2 , Xn ) 7
6 7
=6 .. .. .. .. 7
4 . . . . 5
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Cov (Xn , Xn )
2 3
Var (X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
6 Cov (X2 , X1 ) Var (X2 ) ... Cov (X2 , Xn )7
6 7
=6 .. .. .. .. 7
4 . . . . 5
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Var (Xn )
Notice that the covariance matrix is symmetric (⌃ij = ⌃ji ), and contains variances along the
diagonal.
Note: If you know a bit of linear algebra, you might like to know that covariance matrices
are always symmetric positive semi-definite.
We will not be doing any linear algebra in this class - think of it as just a place to store all the pairwise
covariances. Now let us look at an example of a covariance matrix.
Example(s)
2
If X1 , X2 , ..., Xn are iid with mean µ and variance , then find the mean vector and covariance
matrix of the random vector X = (X1 , . . . , Xn ).
where 1n denotes the n-dimensional vector of all 1’s. The covariance matrix is (since the diagonal is just the
individual variances 2 and the o↵-diagonals (i 6= j) are all Cov (Xi , Xj ) = 0 due to independence)
2 2
3
0 ... 0
60 2
... 07
6 7 2
⌃=6 . .. .. .. 7 = In
4 .. . . .5
2
0 0 ...
An important theorem is that properties of expectation and variance still hold for RVTRs.
E [AX + b] = AE [X] + b
Var (AX + b) = AVar (X) AT
Since we aren’t expecting any linear algebra background, we won’t prove this.
Now, what is the probability of this outcome (two of outcome 1, one of outcome 2, and four of outcome 3) -
that is, (Y1 = 2, Y2 = 1, Y3 = 4)? We get the following:
7!
pY1 ,Y2 ,Y3 (2, 1, 4) = · p21 · p12 · p43 [recall from counting]
2!1!4!
✓ ◆
7
= · p21 · p12 · p43
2, 1, 4
This describes the joint distribution of the random vector Y = (Y1 , Y2 , Y3 ), and its PMF should remind
7
of you of the binomial PMF. We just count the number of ways 2,1,4 to get these counts (multinomial
2 1 4
coefficient), and make sure we get each outcome that many times p1 p2 p3 .
Y ⇠ Multr (n, p)
202 Probability & Statistics with Applications to Computing 5.8
Notice that each Yi is marginally Bin(n, pi ). Hence, E [Yi ] = npi and Var (Yi ) = npi (1 pi ).
Then, we can specify the entire mean vector E [Y] and covariance matrix:
2 3
np1
6 7
E [Y] = np = 4 ... 5 Var (Yi ) = npi (1 pi ) Cov (Yi , Yj ) = npi pj (for i 6= j)
npr
Notice the covariance is negative, which makes sense because as the number of occurrences of Yi
increases, the number of occurrences of Yj should decrease since they can not occur simultaneously.
Proof of Multinomial Covariance. Recall that marginally, Xi and Xj are binomial random variables; let’s de-
compose them into their Bernoull trials. We’ll use di↵erent dummy indices as we’re dealing with covariances.
th
Let Xik for
Pnk = 1, . . . , n be indicator/Bernoulli rvs of whether the k trial resulted in outcome i, so
that Xi = k=1 Xik
th
Similarly,
Pn let Xj` for ` = 1, . . . , n be indicators of whether the ` trial resulted in outcome j, so that
Xk = `=1 Xj` .
Before we begin, we should argue that Cov (Xik , Xj` ) = 0 when k 6= ` since k and ` are di↵erent trials and
are independent.
Furthermore, E [Xik Xjk ] = 0 since it’s not possible that both outcome i and j occur at trial k.
n n
!
X X
Cov (Xi , Xj ) = Cov Xik , Xj` [indicators]
k=1 `=1
n X
X n
= Cov (Xik , Xj` ) [covariance works like FOIL]
k=1 `=1
Xn
= Cov (Xik , Xjk ) [independent trials, cross terms are 0]
k=1
Xn
= E [Xik Xjk ] E [Xik ] E [Xjk ] [def of covariance]
k=1
Xn
= pi p j [first expectation is 0]
k=1
= npi pj
Note that in the third line we dropped one of the sums because the indicators across di↵erent trials k, ` are
independent (zero covariance). Hence, we just need to sum when k = `.
Let Y = (Y1 , Y2 , Y3 ) be the number of each party’s members in the committee (G, D, R in that order). What
is the probability we get 1 Green party member, 6 Democrats, and 3 Republicans? It turns out is just the
following:
45 20 35
1 6 3
pY1 ,Y2 ,Y3 (1, 6, 3) = 100
10
This is very similar to the univariate Hypergeometric distribution! For the denominator, there are 100
10
ways to choose 10 senators. For the numerator, we need 1 from the 45 Green party members, 6 from the 20
Democrats, and 3 from the 35 Republicans.
Suppose there are rPdi↵erent colors of balls in a bag, having K = (K1 , ..., Kr ) balls of each color,
r
1 i r. Let N = i=1 Ki be the total number of balls in the bag, and suppose we draw n without
replacement. Let Y = (Y1 , ..., Yr ) be the rvtr such that Yi is the number of balls of color i we drew.
We write that:
Y ⇠ MVHGr (N, K, n)
Ki N Ki N n
Var (Yi ) = n · · . Then, we can specify the entire mean vector E [Y] and covariance
N N N 1
matrix:
2 K1 3
nN
K 6 . 7 Ki N K i N n Ki K j N n
E [Y] = n = 4 .. 5 Var (Yi ) = n · · Cov (Yi , Yj ) = n ·
N Kr
N N N 1 N N N 1
nN
Proof of Hypergeometric Variance. We’ll prove the variance of a univariate Hypergeometric finally (the vari-
ance of Yi ), but leave the covariance matrix to you (can approach it similarly to the multinomial covariance
matrix).
✓ ◆
K
First, we have that since Xi ⇠ Ber :
N
✓ ◆
K K
Var (Xi ) = p(1 p) = 1
N N
K K 1
Second, for i 6= j, E [Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1) P (Xj = 1 | Xi = 1) = · , so
N N 1
K K 1 K2
Cov (Xi , Xj ) = E [Xi Xj ] E [Xi ] E [Xj ] = ·
N N 1 N2
Finally,
n
!
X
Var (X) = Var Xi [def of X]
i=1
0 1
n
X n
X
= Cov @ Xi , Xj A [covariance with self is variance]
i=1 j=1
n X
X n
= Cov (Xi , Xj ) [bilinearity of covariance]
i=1 j=1
Xn X
= Var (Xi ) + 2 Cov (Xi , Xj ) [split diagonal]
i=1 i<j
✓ ◆ ✓ ◆✓ ◆
K K n K K 1 K2
=n 1 +2 · [plug in]
N N 2 N N 1 N2
K N K N n
=n · · [algebra]
N N N 1
5.8.4 Exercises
These won’t be very interesting since this could’ve been done in chapter 1 and 2!
1. Suppose you are fishing in a pond with 3 red fish, 4 green fish, and 5 blue fish.
(a) You use a net to scoop up 6 of them. What is the probability you scooped up 2 of each?
(b) You “catch and release” until you caught 6 fish (catch 1, throw it back, catch another, throw it
back, etc.). What is the probability you caught 2 of each?
Solution:
(a) Let (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then, X ⇠
MVHG3 (N = 12, K = (3, 4, 5), n = 6), and
3 4 5
P (X1 = 2, X2 = 2, X3 = 2) = 2 2
12
2
(b) Let (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then, X ⇠
Mult3 (n = 6, p = (3/12, 4/12, 5/12)), and
✓ ◆ ✓ ◆2 ✓ ◆2 ✓ ◆2
6 3 4 5
P (X1 = 2, X2 = 2, X3 = 2) =
2, 2, 2 12 12 12
Chapter 5. Multiple Random Variables
5.9: The Multivariate Normal Distribution
Slides (Google Drive) Video (YouTube)
In this section, we will generalize the Normal random variable, the most important continuous distribution!
We were able to find the joint PMF for the Multinomial random vector using a counting argument, but how
can we find the Multivariate Normal density function? We’ll start with the simplest case, and work from
there.
Then, we say that (X, Y ) has a bivariate Normal distribution, which we will denote:
(X, Y ) ⇠ N2 (µ, ⌃)
This is nice and all, if we have two independent Normals. But what if they aren’t independent?
X= X Z1 + µX
2. We construct Y from both Z1 and Z2 , as shown below:
p
Y = Y (⇢Z1 + 1 ⇢2 Z2 ) + µY
205
206 Probability & Statistics with Applications to Computing 5.9
From this transformation, we get that marginally (show this by computing the mean and variance
of X, Y and closure properties of Normal RVs),
2 2
X ⇠ N (µX , X) Y ⇠ N (µY , Y )
Additionally,
By using the multivariate change-of-variables formula from 4.4, we can turn the ”simple”
product of standard normal PDFs into the PDF of the bivariate Normal:
✓ ◆
1 z
fX,Y (x, y) = p exp , x, y 2 R
2⇡ X Y 1 ⇢2 2(1 ⇢2 )
where
(x µX ) 2 2⇢(x µX )(y µY ) (y µY ) 2
z= 2 + 2
X X Y Y
Finally, we write:
(X, Y ) ⇠ N2 (µ, ⌃)
The visualization below shows the density of a bivariate Normal distribution. On the xy-plane, we have the
actual two Normas, and on the z-axis, we have the density. Marginally, both variables are Normals!
5.9 Probability & Statistics with Applications to Computing 207
Now let’s take a look at the e↵ect of di↵erent covariance matrices ⌃ on the distribution for a bivariate
normal, all with mean vector (0, 0). Each row below modifies one entry in the covariance matrix; see the
pictures graphically to explore how the parameters change the shape!
208 Probability & Statistics with Applications to Computing 5.9
A random vector X = (X1 , ..., Xn ) has a multivariate Normal distribution with mean vector µ 2 Rn
and (symmetric and positive-definite) covariance matrix ⌃ 2 Rn⇥n , written X ⇠ Nn (µ, ⌃), if it has
the following joint PDF:
✓ ◆
1 1 T 1
fX (x) = exp (x µ) ⌃ (x µ) , x 2 Rn
(2⇡)n/2 |⌃|1/2 2
While this PDF may look intimidating, if we recall the PDF of a univariate Normal W ⇠ N (µ, 2 ):
✓ ◆
1 1 2
fW (w) = p exp (w µ)
2⇡ 2 2 2
We can note that the two formulae are quite similar; we simply extend scalars to vectors and matrices!
Cov (Xi , Xj ) = 0 ! Xi ? Xj
Unfortunately, we cannot do example problems as they would require a deeper knowledge of linear algebra,
which we do not assume.
Chapter 5. Multiple Random Variables
5.10: Order Statistics
Slides (Google Drive) Video (YouTube)
We’ve talked a lot about the distribution of the sum of random variables, but what about the maximum,
minimum, or median? For example, if there are 4 possible buses you could take, and the time until each
arrives is independent with an exponential distribution, what is the expected time until the first one arrives?
Mathematically, this would be E [min{X1 , X2 , X3 , X4 }] if the arrival times were X1 , X2 , X3 , X4 .
In this section, we’ll figure out how to find out the density function (and hence expectation/variance) of the
minimum, maximum, median, and more!
Y(1) is the smallest value (the minimum), and Y(n) is the largest value (the maximum), and since
they are so commonly used, they have special names Ymin and Ymax respectively.
Notice that we can’t have equality because with continuous random variables, the probability that
any two are equal is 0. So, we don’t have to worry about any of these random variables being “less
than or equal to” another.
Notice that each Y(i) is a random variable as well! We call Y(i) the i-th order statistic, i.e. the
i-th smallest in a sample of size n. For example, if we had n = 9 samples, Y(5) would be the median
value. We are interested in finding the distribution of each order statistic, and properties such as
expectation and variance as well.
Why are order statistics important? Usually, we take the min, max or median of a set of random variables
and do computations with them - so, it would be useful if we had a general formula for the PDF and CDF
of the min or max.
We start with an example to find the distribution of Y(n) = Ymax , the largest order statistic. We’ll then
extend this to any of the order statistics (not just the max). Again, this means, if we were to repeatedly
take the maximum of n iid RVs, what would the samples look like?
Example(s)
Let Y1 , Y2 , . . . , Yn be iid continuous random variables with the same CDF FY and PDF fY . What is
the distribution of Y(n) = Ymax = max{Y1 , Y2 , . . . , Yn } the largest order statistic?
209
210 Probability & Statistics with Applications to Computing 5.10
Solution
We’ll employ our typical strategy and work with probabilities instead of densities, so we’ll start with the
CDF:
Let’s take a step back and see what we just did here. We just computed the density function of the maximum
of n iid random variables, denoted Ymax = Y(n) . We now need to find the density of any arbitrary ranked
Y(i) .
Now, using the same intuition as before, we’ll use an informal argument to find the density of a general Y(i) ,
fY(i) (y). For example, this might help find the distribution of the minimum fY(1) or the median.
Proof of Density of Order Statistics. The formula above may remind you of a multinomial distribution, and
you would be correct! Let’s consider what it means for Y(i) = y (the i-th smallest value in the sample of n
to equal a particular value y).
• One of the values needs to be exactly y
• i 1 of the values need to be smaller than y (this happens for each with probability FY (y))
5.10 Probability & Statistics with Applications to Computing 211
• the other n i values need to be greater than y (this happens for each with probability 1 FY (y)
Now, we have 3 distinct types of objects: 1 that is exactly y, i 1 which are less than y and n i which are
greater. Using multinomial coefficients and the above, we see that
✓ ◆
n
fY(i) (y) = · [FY (y)]i 1 · [1 FY (y)]n i · fY (y)
i 1, 1, n i
Note that this isn’t a probability; it is a density, so there is something flawed with how we approached this
problem. For a more rigorous approach, we just have to make a slight modification, but use the same idea.
Re-Proof (Rigorous) This time, we’ll find P y 2" Y(i) y + 2" and use the fact that this is approxi-
mately equal to "fY(i) (y) for small " > 0 (Riemann integral (rectangle) approximation from 4.1).
We have very similar cases:
• One of the values needs to be between y 2" and y + "
2 (this happens with probability approximately
"fY (y), again by Riemann approximation).
" "
• i 1 of the values need to be smaller than y 2 (this happens for each with probability FY (y 2 ))
• the other n i values need to be greater than y+ 2" (this happens for each with probability 1 FY (y+ 2" ))
Now these are actually probabilities (not densities), so we get
⇣ ✓ ◆
" "⌘ n
P y Y(i) y + ⇡ "fY(i) (y) = · [FY (y)]i 1
· [1 FY (y)]n i
· ("fY (y))
2 2 i 1, 1, n i
Dividing both sides by " > 0 gives the same result as earlier!
Let’s verify this formula with our maximum that we derived earlier by plugging in n for i:
✓ ◆
n
fYmax (y) = fY(n) (y) = · [FY (y)]n 1
· [1 FY (y)]0 · fY (y) = nFYn 1
(y)fY (y)
n 1, 1, 0
Example(s)
If ⇥ Y1 , ...,
⇤ Yn are iid Unif(0, 1), where do we “expect” the points to end up? That is, find
E Y(i) for any i. You may find this picture with di↵erent values of n useful for intuition.
Solution
212 Probability & Statistics with Applications to Computing 5.10
Intuitively, from the picture, if n = 1, we expect the single point to end up at 12 . If n = 2, we expect the
two points to end up at 13 and 23 . If n = 4, we expect the four points to end up at 15 , 25 , 35 and 45 .
Let’s prove this formally. Recall, if Y ⇠ Unif(0, 1) (continuous), then fY (y) = 1 for y 2 [0, 1] and FY (y) = y
for y 2 [0, 1]. By the order statistics formula,
✓ ◆
n
fY(i) (y) = · [FY (y)]i 1
· [1 FY (y)]n i
· fY (y)
i 1, 1, n i
✓ ◆
n
= · [y]i 1
· [1 y]n i
·1
i 1, 1, n i
Here is a picture which may help you figure out what the formulae you just computed mean!
Example(s)
At 5pm each day, four buses make their way to the HUB bus stop. Each bus would be acceptable
to take you home. The time in hours (after 5pm) that each arrives at the stop is independent with
Y1 , Y2 , Y3 , Y4 ⇠ Exp( = 6) (on average, it takes 1/6 of an hour (10 minutes) for each bus to arrive).
1. On Mondays, you want to get home ASAP, so you arrive at the bus stop at 5pm sharp. What
is the expected time until the first one arrives?
2. On Tuesdays, you have a lab meeting that runs until 5:15 and are worried you may not catch
any bus. What is the probability you miss all the buses?
Solution The first question asks about the smallest order statistic Y(1) = Ymin since we care about the first
bus. The second question asks about the largest order statistic Y(4) since we care about the last bus. Let’s
compute the general formula for order statistics first so we can apply it to both parts of the problem.
6y 6y
Recall, if Y ⇠ Exp( = 6) (continuous), then fY (y) = 6e for y 2 [0, 1) and FY (y) = 1 e for
y 2 [0, 1). By the order statistics formula,
5.10 Probability & Statistics with Applications to Computing 213
✓ ◆
n
fY(i) (y) = · [FY (y)]i 1 · [1 FY (y)]n i · fY (y)
i 1, 1, n i
✓ ◆
n
= · [1 e 6y ]i 1 · [e 6y ]n i · 6e 6y
i 1, 1, n i
⇥ ⇤
1. For the first part, we want E Y(1) , so we plug in i = 1 (and n = 4) to the above formula to get:
✓ ◆
4
fY(1) (y) = · [1 e 6y ]1 1 · [e 6y ]4 1 · 6e 6y = 4[e 18y ]6e 6y = 24e 24y
1 1, 1, 4 1
Now we can use the PDF to find the expectation normally. However, notice that the PDF is that of
an Exp( = 24) distribution, so it has expectation 1/24. That is, the expected time until the first bus
arrives is 1/24 an hour, or 2.5 minutes.
Let’s talk about something amazing here. We found that min{Y1 , Y2 , Y3 , Y4 } ⇠ Exp( = 4 · 6); that the
minimum of exponentials is distributed as an exponential with the sum of the rates! Why might this
be true? If we have Y1 , Y2 , Y3 , Y4 ⇠ Exp(6), that means on average, 6 buses of each type arrive each
hour, for a total of 24. That just means we can model our waiting time in thie regime with an average
of 24 buses per hour, to get that the time until the first bus has an Exp(6 + 6 + 6 + 6) distribution!
2. For finding the maximum, we just plug in i = n = 4 (and n = 4), to get
✓ ◆
4
fY(4) (y) = · [1 e 6y ]4 1 · [e 6y ]4 4 · 6e 6y = [1 e 6y 3
] 6e 6y
4 1, 1, 4 4
Unfortunately, this is as simplified as it gets, and we don’t get the nice result that the maximum of
exponentials is exponential. To find the desired quantity, we just need to compute the probability the
last bus comes before 5:15 (which is 0.25 hours - be careful of units!):
Z 0.25 Z 0.25
6y 3 6y
P (Ymax 0.25) = fYmax (y)dy = [1 e ] 6e dy
0 0
Chapter 5. Multiple Random Variables
5.11: Proof of the CLT
Slides (Google Drive) Video (YouTube)
In this optional section, we’ll prove the Central Limit Theorem, one of the most fundamental and amazing
results in all of statistics, using MGFs!
For a function f : R ! R, we will denote f (n) (x) to be the nth derivative of f (x). Let X, Y be
independent random variables, and a, b 2 R be scalars. Then MGFs satisfy the following properties:
⇥ ⇤ (n)
1. MX 0
(0) = E [X], MX00
(0) = E X 2 , and in general MX = E [X n ]. This is why we call MX a
moment generating function, as we can use it to generate the moments of X.
2. MaX+b (t) = etb MX (at).
3. If X ? Y , then MX+Y (t) = MX (t)MY (t).
4. (Uniqueness) The following are equivalent:
(a) X and Y have the same distribution.
(b) fX (z) = fY (z) for all z 2 R.
(c) FX (z) = FY (z) for all z 2 R.
(d) There is an " > 0 such that MX (t) = MY (t) for all t 2 ( ", ") (they match on a small
interval around t = 0).
That is MX uniquely identifies a distribution, just like PDFs or CDFs do.
Let X1 , . . . Xn be a sequence of independent and identically distributed random variables with mean
µ and (finite) variance 2 . Then, the standardized sample mean approaches the standard Normal
distribution:
Xn µ
As n ! 1, Zn = p ! N (0, 1)
/ n
Proof of The Central Limit Theorem. Our strategy will be to compute the MGF of Z n and exploit properties
of the MGF (especially uniqueness) to show that it must have a standard Normal distribution!
214
5.11 Probability & Statistics with Applications to Computing 215
Now, let:
n
Xn 1 X
Zn = p = p Xi
/ n n i=1
1 p
Note there is no typo above: the from X n changes the division by n to a multiplication.
n
2
We will show MZ n (t) ! et /2
(the standard normal MGF) and hence, Z n ! N (0, 1) by uniqueness of the
MGF.
1. First, for an arbitrary random variable Y , since the MGF exists in ( ", ") under “most” conditions,
we can use the 2nd order Taylor series expansion around 0 (quadratic approximation to a function):
s0 s1 s2
MY (s) ⇡ MY (0) · + MY0 (0) · + MY00 (0) ·
0! 1! 2!
⇥ 0⇤ ⇥ 2 ⇤ s2 (n)
= E Y + E [Y ] s + E Y [Since MY (0) = E [Y n ]]
2
⇥ 2 ⇤ s2
= 1 + E [Y ] s + E Y [Since Y 0 = 1]
2
2. Now, let MX denote the common MGF of all the Xi ’s (since they are iid).
MZ n (t) = M Pn
1
p
n i=1 Xi (t) [Definition of Z n ]
✓ ◆
t 1
= MPni=1 Xi p [By Property 2 of MGFs above, where a = p , b = 0]
n n
✓ ◆ n
t
= MX p [By Property 3 of MGFs above]
n
t
3. Recall Step 1, and now let Y = X and s = p
n
so we get a Taylor approximation of MX . Then:
⇣ ⌘2
✓ ◆ t
p
t t ⇥ ⇤ n
MX p ⇡ 1 + E [X] p + E X 2 [Step 1]
n n 2
t 2 ⇥ ⇤
=1+0+ 2 2 [Since E [X] = 0 and E X 2 = 2
]
2 n
t2 /2
=1+
n
4. Now we combine Steps 2 and 3:
✓ ◆ n
t
MZ n (t) = MX p [step 2]
n
✓ 2
◆n
t /2
⇡ 1+ [step 3]
n
2
h ⇣ x ⌘n i
! et /2
Since 1 + ! ex
n
Hence, Z n has the same MGF as that of a standard normal, so must follow that distribution!
Chapter 6. Concentration Inequali-
ties
It seems like we must have learned everything there possible is - what else could go wrong? Sometimes
we know only certain properties about a random variable (its mean and/or variance), but not its entire
distribution. For example, the expected running time (number of comparisons) of the randomized QuickSort
algorithm can be found using linearity of expectation and indicators. But what are the strongest guarantees
we can make about a random variable without full knowledge of its distribution, if any?
216
6.1 Probability & Statistics with Applications to Computing 217
When reasoning about some random variable X, it’s not always easy or possible to calculate/know its ex-
act PMF/PDF. We might not know much about X (maybe just its mean and variance), but we can still
provide concentration inequalities to get a bound of how likely it is for X to be far from its mean µ
(of the form P (|X µ| > ↵)), or how likely for this random variable to be very large (of the form P (X k)).
You might ask when we would only know the mean/variance but not the PMF/PDF? Some of our dis-
tributions that we use (like Exponential for bus waiting time), are just modelling assumptions and are
probably incorrect. If we measured how long it took for the bus to arrive over many days, we could estimate
its mean and variance! That is, we have no idea the true distribution of daily bus waiting times but can
get good estimates for the mean and variance. We can use these concentration inequalities to bound the
probability that we wait too long for a bus knowing just those two quantities and nothing else!
Example(s)
The score distribution of an exam is modelled by a random variable X with range ⌦X = [0, 110] (with
10 points for extra credit). Give an upper bound on the proportion of students who score at least
100 when the average is 50? When the average is 25?
Solution What would you guess? If the average is E [X] = 50, an upper bound on the proportion of students
who score at least 100 should be 50% right? If more than 50% of students scored a 100 (or higher), the
average would already be 50% since all scores must be nonnegative ( 0). Mathematically, we just argued
that:
E [X] 50 1
P (X 100) = =
100 100 2
This sounds reasonable - if say 70% of the class were to get 100 or higher, the average would already be at
least 70%, even if everyone else got a zero. The best bound we can get is 50% - and that requires everyone
else to get a zero.
If the average is E [X] = 25, an upper bound on the proportion of students who score at least 100 is:
E [X] 25 1
P (X 100) = =
100 100 4
Similarly, if we had more than 30% students get 100 or higher, the average would already be at least 30%,
even if everyone else got a zero.
218 Probability & Statistics with Applications to Computing 6.1
Let X 0 be a non-negative random variable (discrete or continuous), and let k > 0. Then:
E [X]
P (X k)
k
Equivalently (plugging in kE [X] for k above):
1
P (X kE [X])
k
Proof of Markov’s Inequality. Below is the proof when X is continuous. The proof for discrete RVs is similar
(just change all the integrals into summations).
Z 1
E [X] = xfX (x)dx [because X 0]
0
Z k Z 1
= xfX (x)dx + xfX (x)dx [split integral at some 0 k 1]
0 k
Z "Z #
1 k
xfX (x)dx xfX (x)dx 0 because k 0, x 0 and fX (x) 0
k 0
Z 1
kfX (x)dx [because x k in the integral]
k
Z 1
=k fX (x)dx
k
= kP (X k)
So just knowing that the random variable is non-negative and what its expectation is, we can bound the
probability that it is “very large”. We know nothing else about the exam distribution! Note there is no bound
we can derive if X could be negative. Always check that X is indeed nonnegative before applying this bound!
The following example demonstrates how to use Markov’s inequality, and how loose it can be in some cases.
Example(s)
A coin is weighted so that its probability of landing on heads is 20%, independently of other flips.
Suppose the coin is flipped 20 times. Use Markov’s inequality to bound the probability it lands on
heads at least 16 times.
Solution We actually do know this distribution; the number of heads is X ⇠ Bin(n = 20, p = 0.2). Thus,
E [X] = np = 20 · 0.2 = 4. By Markov’s inequality:
E[X] 4 1
= P (X= 16)
16 16 4
Let’s compare this to the actual probability that this happens:
X20 ✓ ◆
20
P (X 16) = 0.2k · 0.820 k ⇡ 1.38 · 10 8
k
k=16
6.1 Probability & Statistics with Applications to Computing 219
This is not a good bound, since we only assume to know the expected value. Again, we knew the exact
distribution, but chose not to use any of that information (the variance, the PMF, etc.).
Example(s)
Suppose the expected runtime of QuickSort is 2n log(n) operations/comparisons (we can show this
using linearity of expectation with dependent indicator variables). Use Markov’s inequality to bound
the probability that QuickSort runs for longer than 20n log(n) time.
Solution Let X be the runtime of QuickSort, with E [X] = 2n log(n). Then, since X is non-negative, we can
use Markov’s inequality:
E [X]
P (X 20n log(n)) [Markov’s inequality]
20n log(n)
2n log(n)
=
20n log(n)
1
=
10
So we know there’s at most 10% probability that QuickSort takes this long to run. Again, we can get this
bound despite not knowing anything except its expectation!
Let X be any random variable with expected value µ = E[X] and finite variance Var (X). Then, for
any real number ↵ > 0:
Var (X)
P (|X µ| ↵)
↵2
p
Equivalently (plugging in k for ↵ above, where = Var (X)):
1
P (|X µ| k )
k2
This is used to bound the probability of being in the tails. Here is a picture of Chebyshev’s inequality
bounding the probability that a Gaussian X ⇠ N (µ, 2 ) is more than k = 2 standard deviations from its
mean:
220 Probability & Statistics with Applications to Computing 6.1
While in principle Chebyshev’s inequality asks about distance from the mean in either direction, it can still
be used to give a bound on how often a random variable can take large values, and will usually give much
better bounds than Markov’s inequality. This is expected, since we also assume to know the variance - and
if the variance is small, we know the RV can’t deviate too far from its mean.
Example(s)
Let’s revisit the example in Markov’s inequality section earlier in which we toss a weighted coin
independently with probability of landing heads p = 0.2. Upper bound the probability it lands on
heads at least 16 times out of 20 flips using Chebyshev’s inequality.
E [X] = np = 20 · 0.2 = 4
and:
Var (X) = np(1 p) = 20 · 0.2 · (1 0.2) = 3.2
Note that since Chebyshev’s asks about the di↵erence in either direction of the RV from its mean, we
must weaken our statement first to include the probability X 8. The reason we chose 8 is because
6.1 Probability & Statistics with Applications to Computing 221
Chebyshev’s inequality is symmetric about the mean (di↵erence of 12; 4 ± 12 gives the interval [ 8, 16]):
This is a much better bound than given by Markov’s inequality, but still far from the actual probability.
This is because Chebyshev’s inequality only takes the mean and variance into account. There is so much
more information about a RV than just these two quantities!
We can actually use Chebyshev’s inequality to prove an important result from 5.7: The Weak Law of Large
Numbers. The proof is so short!
lim P X n µ >✏ =0
n!1
⇥ ⇤
Proof. By the property of the expectation and variance of sample mean consisting of n iid variables: E X n =
2
µ and Var X n = n (from 5.7). By Chebyshev’s inequality:
Var X n 2
lim P X n µ > ✏ lim 2
= lim =0
n!1 n!1 ✏ n!1 n✏2
Chapter 6. Concentration Inequalities
6.2: The Cherno↵ Bound
Slides (Google Drive) Video (YouTube)
The more we know about a distribution, the stronger concentration inequality we can derive. We know that
Markov’s inequality is weak, since we only use the expectation of a random variable to get the probability
bound. Chebyshev’s inequality is a bit stronger, because we incorporate the variance into the probability
bound. However, as we showed in the example in 6.1, these bounds are still pretty “loose”. (They are tight
in some cases though).
What if we know even more? In particular, its PMF/PDF and hence MGF? That will allow us to have
an even stronger bound. The Cherno↵ bound is derived using a combination of Markov’s inequality and
moment generating functions.
Let X be any random variable. etX is always a non-negative random variable. Thus, for any t > 0, using
Markov’s inequality and the definition of MGF:
(Note that the first line requires t > 0, otherwise it would change to P etX etk . This is because et > 1 for
t > 0 so we get something like 2X which is monotone increasing. If t < 0, then et < 1 so we get something
like 0.3X which is monotone decreasing.)
Now the right hand side holds for (uncountably) infinitely many t. For example, if we plugged in t = 0.5 we
MX (t)
might get = 0.53 and if we plugged in t = 3.26 we might get 0.21. Since P (X k) has to be less than
etk
all the possible values when plugging in di↵erent t > 0, it in particular must be less than the minimum of
all the values.
MX (t)
P (X k) min
t>0 etk
This is good - if we can minimize the right hand side, we can get a very tight/strong bound.
We’ll now focus our attention to deriving the Cherno↵ bound when X has a Binomial distribution. Everything
above applies generally though.
222
6.2 Probability & Statistics with Applications to Computing 223
The Cherno↵ bound will allow us to bound the probability that X is larger than some multiple of its mean,
or less than or equal to it. These are the tails of a distribution as you go farther in either direction from the
mean. For example, we might want to bound the probability that X 1.5µ or X 0.1µ.
I think it’s completely acceptable if you’d like to not read the proof, as it is very involved algebraically.
You can still use the result regardless!
Pn
If X = i=1 Xi where X1 , X2 ,...,Xn are iid variables, then since the MGF of the (independent) sum equals
the product of the MGFs. Taking our general result from above and using this fact, we get:
Qn
MX (t) MXi (t)
P (X k) min = min i=1
t>0 etk t>0 etk
Let’s derive a Cherno↵ bound for X ⇠ Bin(n, p), which has the form P (X (1 + )µ) for > 0. For example
with = 4, you may want to bound P (X 5E [X]).
Pn
Recall X = i=1 Xi where Xi ⇠ Ber(p) are iid, with µ = E[X] = np.
⇥ ⇤
MXi (t) = E etXi [def of MGF]
t·1 t·0
= e pXi (1) + e pXi (0) [LOTUS]
t
= pe + 1(1 p) [Xi ⇠ Ber(p)]
t
= 1 + p(e 1)
p(et 1)
e [1 + x ex with x = p(et 1)]
See here for a pictorial proof that 1 + x ex for any real number x (just plot the two functions). Alter-
natively, use the Taylor series for ex to argue this. We use this bound for algebra convenience coming up soon.
224 Probability & Statistics with Applications to Computing 6.2
Now using the result from earlier and plugging in the MGF for the Ber(p) distribution, we get:
Qn
MX (t)
P (X k) min i=1 tk i [from earlier]
t>0 e
⇣ t
⌘ n
ep(e 1)
= min [MGF of Ber(p), n times]
t>0 etk
t
enp(e 1)
= min [algebra]
t>0 etk
t
eµ(e 1)
= min [µ = np]
t>0 etk
For our bound, we want something like P (X (1 + )µ), so our k = (1 + )µ. To minimize the RHS and
get the tightest bound, the best bound we get is by choosing t = ln(1 + ) after some terrible algebra (take
the derivative and set to 0). We simply plug in k and our optimal value of t to the above equation:
ln (1+ ) ✓ ◆µ
e µ( e 1)
eµ((1+ ) 1) eµ e
P (X (1 + )µ) (1+ )µ ln (1+ ) = (1+ )µ
= =
e eln (1+ ) (1 + )(1+ )µ (1 + )(1+ )
Again, we wanted to choose t that minimizes our upper bound for the tail probability. Taking the derivative
with respect to t tells us we should plug in t = ln(1 + ) to minimize that quantity. This would actually be
pretty annoying to plug into a calculator.
✓ 2 ◆
µ
We actually can show that the final RHS is exp with some more messy algebra. Additionally, if
2+
we restrict 0 < < 1, we can simplify this even more to the bound provided earlier:
✓ 2 ◆
µ
P (X (1 + )µ) exp
3
The proof of the lower tail is entirely analogous, except optimizing over t < 0 when the inequality flips. It
proceeds by taking t = ln(1 ).
We also get a lower tail bound:
✓ ◆µ ✓ ◆µ ✓ 2
◆
e e µ
P (X (1 )µ) 2 = exp
(1 )1 e + 2 2
6.2 Probability & Statistics with Applications to Computing 225
You may wonder, why are we bounding P (X (1 + )µ), when we can just sum the PMF of a binomial to
get an exact answer? The reason is, it is very computationally expensive to compute the binomial PMF! For
example, if X ⇠ Bin(n = 20000, p = 0.1), then by plugging in the PMF, we get
✓ ◆
20000 20000!
P (X = 13333) = 0.113333 0.920000 13333 = 0.113333 0.920000 13333
13333 13333!(20000 13333)!
(Actually, n = 20000 isn’t even that large.) You have to multiply 20,000 numbers on the second two terms,
and it multiplies to a number that is infinitesimally small. For the first term (binomial coefficient), computing
20000! is impossible - in fact, it is so large you can’t even imagine. You would have to cleverly interleave
multiplying a factorial vs the probability, to keep the value in an acceptable range for the computer. Then,
sum up a bunch of these....
This is why we have/need the Poisson approximation, the Normal approximation (CLT), and the Cherno↵
bound for the Binomial!
Example(s)
Suppose X ⇠ Bin(500, 0.2). Use Markov’s inequality and the Cherno↵ bound to bound P (X 150),
and compare the results.
Solution We have:
E[X] = np = 500 · 0.2 = 150
Var(X) = np(1 p) = 500 · 0.2 · 0.8 = 80
The Cherno↵ bound is much stronger! It isn’t a fair comparison necessarily, because the Cherno↵ bound
required knowing the MGF (and hence the distribution), whereas Markov only required knowing the mean
(and that it was non-negative).
These examples give you an overall comparison of all three inequalities we learned so far!
Example(s)
Suppose the number of red lights Alex encounters each day to work is on average 4.8 (according to
historical trips to work). Alex really will be late if he encounters 8 or more red lights. Let X be the
number of lights he gets on a given day.
1. Give a bound for P (X 8) using Markov’s inequality.
2. Give a bound for P (X 8) using Chebyshev’s inequality, if we also assume Var (X) = 2.88.
3. Give a bound for P (X 8) using the Cherno↵ bound. Assume that X ⇠ Bin(12, 0.4) - that
there are 12 traffic lights, and each is independently red with probability 0.4.
4. Compute P (X 8) exactly using the assumption from the previous part.
5. Compare the three bounds and their assumptions.
226 Probability & Statistics with Applications to Computing 6.2
1. Since X is nonnegative and we know its expectation, we can apply Markov’s inequality:
E [X] 4.8
P (X 8) = = 0.6
8 8
2. Since we know X’s variance, we can apply Chebyshevs inequality after some manipulation. We have
to do this to match the form required:
The reason we chose 1.6 is so it looks like P (|X µ| ↵). Now, applying Chebyshev’s gives:
3. Actually, X ⇠ Bin(12, 0.4) also has E [X] = np = 4.8 and Var (X) = np(1 p) = 2.88 (what a
coincidence). The Cherno↵ bound requires something of the form P (X (1 + )µ), so we first need
to solve for : (1 + )4.8 = 8 so that = 2/3. Now,
✓ ◆
(2/3)2 4.8
P (X 8) = P (X (1 + 2/3) · 4.8) exp ⇡ 0.4911
3
5. Actually it’s usually the case that the bounds are tighter/better as we move down the list Markov,
Chebyshev, Cherno↵. But in this case Chebyshev’s gave us the tightest bound, even after being
weakened by including some additional P (X 1.6). Cherno↵ bounds will typically be better for
farther tails - 8 isn’t considered too far from the mean 4.8.
It’s also important to note that we found out more information progressively - we can’t blindly apply
all these inequalities every time. We need to make sure the conditions for the bound being valid are
satisfied.
Even our best bound of 0.28125 was 5-6x larger than the true probability of 0.0573.
Chapter 6. Concentration Inequalities
6.3: Even More Inequalities
Slides (Google Drive) Video (YouTube)
In this section, we will talk about a potpourri of remaining concentration bounds. More specifically, the
union bound, Jensen’s inequality for convex functions, and Hoe↵ding’s inequality.
The intuition for the union bound is fairly simple. Suppose we have two events A and B. Then P (A [ B)
P (A) + P (B) since the event space of A and B may overlap:
227
228 Probability & Statistics with Applications to Computing 6.3
The union bound, though seemingly trivial, can actually be quite useful.
Example(s)
This will relate to the earlier question of bounding the probability of at least one bad event happening.
Suppose the probability Alex is late to teaching class on a given day is at most 0.01. Bound
the probability that Alex is late at least once over a 30-class quarter. Do not make any independence
assumptions.
Solution
Let Ai be the event Alex is late to class on day i for i = 1, . . . , 30. Then, by the union bound,
30
!
[
P (late at least once) = P Ai
i=1
30
X
P (Ai ) [union bound]
i=1
30
X
0.01 [P (Ai ) 0.01]
i=1
= 0.30
Sometimes it may be useless though; imagine I asked instead about over a 200-day period. Then the union
bound would’ve given me a bound of 2.0 which is not helpful since probabilities have to be at most 1 already...
Let’s look at some examples of convex (left) and non-convex (right) sets:
The sets on the left hand side are said to be convex because if you take any two points in the set and
draw the line segment between them, it is always contained in the set. The sets on the right hand side are
non-convex because I found two endpoints in the set, but the line segment connecting them is not completely
contained in the set.
How can we describe this mathematically? Well for any two points x, y 2 S, the set of points between them
must be entirely contained in S. The set of points making up the line segment between two points x, y can
be described as a weighted average (1 p)x + py for p 2 [0, 1]. If p = 0, we just get x; if p = 1, we just get
y, and if p = 1/2, we get the midpoint (x + y)/2. So p controls the fraction of the way we are from x to y.
Equivalently, for any points x1 , ..., xm 2 S, convex polyhedron formed by the “corners” is contained
in S. (This sounds complicated, but if m = 3, it just says the triangle formed by the 3 corners
completely lies in the set S. If m = 4, the quadrilateral formed by the 4 corners completely lies in
the set S.) The points in the convex polyhedron are described by taking weighted average of the
points, where the weights are non-negative and sum to 1. (This should remind you of a probability
distribution!) (m )
X m
X
pi xi : p1 , ..., pm 0 and pi = 1 ✓ S
i=1 i=1
Now, onto convex functions. Let’s take a look at some convex (top) and non-convex (bottom) functions:
230 Probability & Statistics with Applications to Computing 6.3
The functions on the top (convex) have the property that, for any two points on the function curve, the
line segment connecting them lies above the function always. The functions on the bottom don’t have this
property: you can see that some or all of the line segment is below the function.
Let’s try to formalize what this means. For the convex function g(t) = t2 below, we can see that any line
drawn connecting 2 points of the function clearly lies above the function itself and so it is convex. Look at
any two points on the curve g(x) and g(y). Pick a point on the x-axis between x and y, call it (1 p)x + py
where p 2 [0, 1]. The function value at this point is g((1 p)x + py). The corresponding point above it on
the line segment connecting g(x) and g(y) is actually the weighted average (1 p)g(x) + pg(y). Hence, a
function g is convex if it satisfies the following for any x, y and p 2 [0, 1]: g((1 p)x+py) (1 p)g(x)+pg(y)
6.3 Probability & Statistics with Applications to Computing 231
Let S ✓ Rn be a convex set (a convex function must have the domain being a convex set). A function
g : S ! R is a convex function if for any line segment connecting g(x) and g(y), the function g lies
entirely below the line. Mathematically, for any p 2 [0, 1] and x, y 2 R,
Pm
Equivalently, for any m points x1 , ..., xm 2 S, and p1 , ..., pm 0 such that i=1 pi = 1,
m
! m
X X
g pi x i pi g(xi )
i=1 i=1
Proof of Jensen’s Inequality. We will only prove it in the case X is a discrete random variable (not a random
vector), and with finite range (not countably infinite). However, this inequality does hold for any random
variable.
The proof follows immediately from the definition of a convex function. Since X has finite range, let
⌦X = {x1 , ..., xn } and pX (xi ) = pi . By definition of a convex function (see above),
n
!
X
g(E [X]) = g pi x i [def of expectation]
i=1
n
X
pi g(xi ) [def of convex function]
i=1
= E [g(X)] [LOTUS]
232 Probability & Statistics with Applications to Computing 6.3
Example(s)
Show that variance of any random variable X is always non-negative using Jensen’s inequality.
⇥ ⇤
Solution We already know that Var (X) = E (X µ)2 0 since (X µ)2 is a non-negative RV, but let’s
prove it a di↵erent way.
Let X1 , ...Xn be independent random variables, where each Xi is bounded: ai Xi bi and let X n
be their sample mean. Then,
✓ ◆
⇥ ⇤ 2n2 t2
P |X n E X n | t 2 exp Pn
i 1 (bi ai ) 2
where exp(x) = ex .
In the case X1 , ..., Xn are iid (so a Xi b for all i) with mean µ, then
✓ ◆ ✓ ◆
2n2 t2 2nt2
P |X n µ| t 2 exp = 2 exp
n(b a)2 (b a)2
Example(s)
Suppose an email company ColdMail is responsible for delivering 100 emails per day. ColdMail has
a bad day if it takes longer than 190 seconds to deliver all 100 emails, and a bad week if there is
even one bad day in the week.
The time it takes to send an email on average is 1 second, with a worst-case time of 5 sec-
onds; independently of other emails. (Note we don’t know anything else like its PDF).
1. Give an upper bound for the probability that ColdMail has a bad day.
2. Give an upper bound for the probability that ColdMail has a bad week.
Solution
1. In this scenario, we may use Hoe↵ding’s inequality since we have X1⇥, . . . , ⇤X100 the (independent) times
to send each email bounded in the interval [0, 5] seconds, with E X 100 = 1. Asking that the total
6.3 Probability & Statistics with Applications to Computing 233
time to be at least 190 seconds is the same as asking the mean time to be at least 1.9 seconds.
Like we did for Chebyshev, we have to massage (and weaken) a little bit to get in the same form
as required for Hoe↵ding’s:
You might be tempted to use the CLT (and you should when you can), as it would probably give a better
bound than Hoe↵ding’s. But we didn’t know the variances, so we wouldn’t know which Normal to use.
Hoe↵ding’s gives us a way!
Chapter 7. Statistical Estimation
Now we’ve hit a real turning point in the course. What we’ve been doing so far is “probability”, and
the remaining two chapters of the course will be about “statistics”. In the real world, we’re often not given
the true probability of heads p, or average rate of babies being born per minute . In today’s world, data
is being collected faster than ever! How can we use data to estimate these quantities of interest? We’ll
start with more mundane examples, such as: If I flip a coin (with unknown probability of heads) ten times
independently and I observe seven heads, why is 7/10 the “best” estimate for the probability of heads? We’ll
learn several techniques for estimating quantities, and talk about several properties that allow us to compare
them for “goodness”.
234
7.1 Probability & Statistics with Applications to Computing 235
What we’re going to focus now is going the opposite way. Given a coin with unknown probability of heads
is, I flip it a few times and I get THHTHH. How can I use this data to predict/estimate this value of p?
7.1.2 Likelihood
Let’s say I give you and your classmates each 5 minutes with a coin with unknown probability of heads p.
Whoever has the closest estimate will get an A+ in the class. What do you do in your precious 5 minutes,
and what do you give as your estimate?
I don’t know about you, but I would flip the coin as many times as I can, and return the total number of
heads over the total number of flips, or
Heads
Heads + Tails
which actually turns out to be a really good estimate.
236 Probability & Statistics with Applications to Computing 7.1
To make things concrete, let’s say you saw 4 heads and 1 tail. You tell me that p̂ = 45 (the hat above the p
just means it is an estimate). How can you argue, objectively, that this is the ”best” estimate?
Is there some objective function that it maximizes? It turns out yes, 45 maximizes this blue curve, which is
called the likelihood of the data. The x-axis has the di↵erent possible values of p, and they y-axis has the
probability of seeing the data if the coin had probability of heads p.
You assume a model (Bernoulli in our case) with unknown parameter ✓ (the probability of heads), and
receive iid samples x = (x1 , ..., xn ) ⇠ Ber(✓) (in this example, each xi is either 1 or 0). The likelihood of the
data given a parameter ✓ is defined as the probability of seeing the data, given ✓, or:
A realization/sample x of a random variable X is the value that is actually observed (will always be
in ⌦X ).
For example, for Bernoulli, a realization is either 0 or 1, and for Geometric, some positive integer 1.
Let x = (x1 , ..., xn ) be iid samples from probability mass function pX (t | ✓) (if X is discrete), or from
density fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). We define
the likelihood of x given ✓ to be the “probability” of observing x if the true parameter is ✓.
7.1 Probability & Statistics with Applications to Computing 237
If X is discrete,
n
Y
L(x | ✓) = pX (xi | ✓)
i=1
If X is continuous,
n
Y
L(x | ✓) = fX (xi | ✓)
i=1
In the continuous case, we have to multiply densities, because the probability of seeing a particular value with
a continuous random variable is always 0. We can do this because the density preserves relative probabilities;
P (X ⇡ u) fX (u)
i.e., ⇡ . For example, if X ⇠ N (µ = 3, 2 = 5), the realization x = 503.22 has much
P (X ⇡ v) fX (v)
lower density/likelihood than x = 3.12.
Example(s)
Give the likelihoods for each of the samples, and take a guess at which value of ✓ maximizes the
likelihood!
1. Suppose x = (x1 , x2 , x3 ) = (1, 0, 1) are iid samples from Ber(✓) (recall ✓ is the probability of a
success).
2. Suppose x = (x1 , x2 , x3 , x4 ) = (3, 0, 2, 7) are iid samples from Poi(✓) (recall ✓ is the historical
average number of events in a unit of time).
3. Suppose x = (x1 , x2 , x3 ) = (3.22, 1.81, 2.47) are iid samples from Exp(✓) (recall ✓ is the historical
average number of events in a unit of time).
Solution
1. The samples mean we got a success, then a failure, then a success. The likelihood is the “probability”
of observing the data.
3
Y
L(x | ✓) = pX (xi | ✓) = pX (1 | ✓) · pX (0 | ✓) · pX (1 | ✓) = ✓(1 ✓)✓ = ✓2 (1 ✓)
i=1
Since we observed two successes out of three trials, my guess for the maximum likelihood estimate
2
would be ✓ˆ = .
3
2. The samples mean we observed 3 events in the first unit of time, then 0 in the second, then 2 in the
third, then 7 in the fourth. The likelihood is the “probability” of observing the data (just multiplying
k
Poisson PMFs pX (k | ) = e k! ).
4
Y
L(x | ✓) = pX (xi | ✓) = pX (3 | ✓) · pX (0 | ✓) · pX (2 | ✓) · pX (7 | ✓)
i=1
✓ 3
◆✓ 0
◆✓ 2
◆✓ 7
◆
✓✓ ✓✓ ✓✓ ✓✓
= e e e e
3! 0! 2! 7!
Since there were a total of 3 + 0 + 2 + 7 = 12 events over 4 units of time (samples), my guess for the
12
maximum likelihood estimate would be ✓ˆ = = 3 events per unit time.
4
238 Probability & Statistics with Applications to Computing 7.1
3. The samples mean we waited until three events happened (x1 , x2 , x3 ), and it took 3.22 units of time
until the first event, 1.81 until the second, and 2.47 until the third. The likelihood is the “probability”
of observing the data. The likelihood is the “probability” of observing the data (just multiplying
Exponential PDFs fX (y | ) = e y ).
3
Y
3.22✓ 1.81✓
L(x | ✓) = fX (xi | ✓) = fX (x1 | ✓) · fX (x2 | ✓) · fX (x3 | ✓) = ✓e ✓e ✓e2.47✓
i=1
In the previous three scenarios, we set up the likelihood of the data. Now, the only thing left to do is find out
which value of ✓ maximizes the likelihood. Everything else in this section is just explaining how to use calcu-
lus to optimize this likelihood! There is no more “probability” or “statistics” involved in the remaining pages.
Before we move on, we have to go back and review calculus really quickly. How do we optimize a func-
tion? Each of these three points is a local optima; what do they have in common? Their derivative is 0.
We’re going to try and set the derivative of our likelihood to 0, so we can solve for the optimum value.
Example(s)
Suppose x = (x1 , x2 , x3 , x4 , x5 ) = (1, 1, 1, 1, 0) are iid samples from the Ber(✓) distribution with
unknown parameter ✓. Find the maximum likelihood estimator ✓ˆ of ✓.
Solution The data (1, 1, 1, 1, 0) can be thought of the sequence HHHHT , which has likelihood (assuming
7.1 Probability & Statistics with Applications to Computing 239
L(HHHHT | ✓) = ✓4 (1 ✓)
4 5
=✓ ✓
The plot of the likelihood with ✓ on the x-axis and L(HHHHT | ✓) on the y-axis is (copied from above):
and we can actually see the ✓ which maximizes the likelihood is ✓ˆ = 4/5. But sometimes we can’t plot the
likelihood, so we will solve for this analytically now.
We want to find the ✓ which maximizes this likelihood, so we take the derivative with respect to ✓ and
set it to 0:
@
L(x | ✓) = 4✓3 5✓4
@✓
= ✓3 (4 5✓)
Now, when we set the derivative to 0 (remember the optimum points occur when the derivative is 0), we
replace ✓ with ✓ˆ because we are now estimating ✓. After solving for ✓, we end up with
✓ˆ3 (4 ˆ = 0 ! ✓ˆ = 4 or 0
5✓)
5
We switch ✓ to ✓ˆ when we set the derivative to 0, as that is when we start estimating. To see which is the
maximizer, you can just plug in the candidates (0 and 4/5) and the endpoints (0 and 1: the min and max
possible values of ✓)! That is, compute the likelihood at 0, 4/5, 1 and see which is largest.
To summarize, we defined ✓ˆM LE = arg max✓ L(x | ✓), the argument (input) ✓ that maximizes the likelihood
function. The di↵erence between max and argmax is as follows. Here is a function,
f (x) = 1 x2
where the maximum value is 1, it’s the highest value this function could ever achieve. The argmax, on the
other hand, is 0, because argmax just means the argument (input) that maximizes the function. So, which x
actually achieved f (x) = 1? Well that was x = 0. And so, in MLE, we’re trying to find the ✓ that maximizes
the likelihood, and we don’t care what the maximum value of the likelihood is. We didn’t even compute it!
We just care that the argmax is 45 .
240 Probability & Statistics with Applications to Computing 7.1
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓) (if X is discrete), or
from density fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). We
define the maximum likelihood estimator ✓ˆM LE of ✓ to be the parameter which maximizes the
likelihood (or equivalently, the log-likelihood) of the data.
Taking the log of a product (such as the likelihood) results in the sum of logs because of log properties:
log(a · b · c) = log(a) + log(b) + log(c)
7.1 Probability & Statistics with Applications to Computing 241
We see now why we might want to take the log of the likelihood before di↵erentiating it, but why can we?
Below there are two images: the left image is a function, and the right image is the log of that function.
The values are di↵erent (see the y-axis), but if you look at the x-axis, it happens that both functions are
maximized at 1 (the argmax’s are the same). Log is a monotone increasing function, so it preserves order,
so whatever was the maximizer (argmax) in the original function, will also be maximizer in the log function.
See below to see what happens when you apply the natural log (ln) to a product in our likelihood scenario!
And see the next section 7.2 for examples of maximum likelihood estimation in action.
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓) (if X is discrete), or
from density fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). We
define the likelihood of x given ✓ to be the probability of observing x if the true parameter is ✓.
If X is discrete,
n
X
ln L(x | ✓) = ln pX (xi | ✓)
i=1
If X is continuous,
n
X
ln L(x | ✓) = ln fX (xi | ✓)
i=1
Chapter 7. Statistical Estimation
7.2: Maximum Likelihood Examples
Slides (Google Drive) Video (YouTube)
We spend an entire section just doing examples because maximum likelihood is such a fundamental concept
used everywhere (especially machine learning). I promise that the idea is simple: find ✓ that maximizes the
likelihood of the data. The computation and notation can be confusing at first though.
Let’s say x1 , x2 , ..., xn are iid samples from Poi(✓). (These values might look like x1 = 13, x2 = 5, x3 =
6, etc...) What is the MLE of ✓?
Solution Remember that we discussed that the sample mean might be a good estimate of ✓. If we observed
20 events over 5 units of time, a good estimate for , the average number of events per unit of time, would
be 20
5 = 4. This turns out to be the maximum likelihood estimate!
Let’s follow the recipe provided in 7.1.
1. Compute the likelihood and log-likelihood of data. To do this, we take the following product
of the Poisson PMFs at each sample xi , over all the data points:
n
Y n
Y xi
✓✓
L(x | ✓) = pX (xi | ✓) = e
i=1 i=1
xi !
Again, this is the probability of seeing x1 , then x2 , and so on. This function is pretty hard to di↵er-
entiate, so to make it easier, let’s compute the log-likelihood instead, using the following identities:
In most cases, we’ll want to optimize the log-likelihood instead of the likelihood (since we don’t want
to use the product rule of calculus)!
n
!
Y xi
✓✓
ln L(x | ✓) = ln e [def of likelihood]
i=1
xi !
n
X xi
✓✓
= ln e [log of product is sum of logs]
i=1
xi !
Xn
✓
= [ln(e ) + ln(✓xi ) ln xi !)] [log of product is sum of logs]
i=1
Xn
= [ ✓ + xi ln ✓ ln xi !)] [other log properties]
i=1
242
7.2 Probability & Statistics with Applications to Computing 243
2. Take the partial derivative(s) with respect to ✓ and set to 0. Solve the equation(s).
Now we want to take the derivative of the log likelihood with respect to ✓, so the derivative of ✓ is
just 1, and the derivative of xi ln ✓ is just x✓i , because remember xi is a constant with respect to ✓.
@ Xh xi i
n
ln L(x | ✓) = 1+
@✓ i=1
✓
n h
xi i
X n n
1X 1X
1+ =0! n+ xi = 0 ! ✓ˆ = xi
i=1
✓ ✓ˆ i=1 n i=1
3. Optionally, verify ✓ˆM LE is indeed a (local) maximizer by checking that the second deriva-
tive at ✓ˆM LE is negative (if ✓ is a single parameter), or the Hessian (matrix of second
partial derivatives) is negative semi-definite (if ✓ is a vector of parameters).
We want to take the second derivative also, because
Pn otherwise we don’t know if this is a maximum or
a minimum. We di↵erentiate the first derivative i=1 [ 1 + x✓i ] again with respect to ✓, and we notice
that because ✓2 is always positive, the negative of that is always negative, so the second derivative
is always less than 0, so that means that it’s concave down everywhere. This means that anywhere
the derivative is zero is a global maximum, so we’ve successfully found the global maximum of our
likelihood equation.
@2 X h xi i
n
ln L(x | ✓) = < 0 ! concave down everywhere
@✓2 i=1
✓2
Let’s say x1 , x2 , ..., xn are iid samples from Exp(✓). (These values might look like x1 = 1.354, x2 =
3.198, x3 = 4.312, etc...) What is the MLE of ✓?
Solution Now that we’ve seen one example, we’ll just follow the procedure given in the previous section.
1. Compute the likelihood and log-likelihood of data.
Since we have a continuous distribution, our likelihood is the product of the PDFs:
n
Y n
Y
✓xi
L(x | ✓) = fX (xi | ✓) = ✓e
i=1 i=1
The log-likelihood is
n
X n
X
✓xi
ln L(x | ✓) = ln ✓e = [ln(✓) ✓xi ]
i=1 i=1
244 Probability & Statistics with Applications to Computing 7.2
2. Take the partial derivative(s) with respect to ✓ and set to 0. Solve the equation(s).
Xn
@ 1
ln L(x | ✓) = xi
@✓ i=1
✓
ˆ
Now, we set the derivative to 0 and solve (here we replace ✓ with ✓):
n
X n
X
1 n n
xi = 0 ! xi = 0 ! ✓ˆ = Pn
i=1 ✓ˆ ✓ˆ i=1 i=1 xi
This is just the inverse of the sample mean! This makes sense because if the average waiting time was
1
1/2 hours, then the average rate per unit of time should be 1/2 = 2 per hour!
3. Optionally, verify ✓ˆM LE is indeed a (local) maximizer by checking that the second deriva-
tive at ✓ˆM LE is negative (if ✓ is a single parameter), or the Hessian (matrix of second
partial derivatives) is negative semi-definite (if ✓ is a vector of parameters). The second
derivative of the log-likelihood just requires us to take one more derivative:
Xn
@2 1
ln L(x | ✓) = <0
@✓2 i=1
✓ 2
Since the second derivative is negative everywhere, the function is concave down, and any critical point
is a global maximum!
Let’s say x1 , x2 , ..., xn are iid samples from (continuous) Unif(0, ✓). (These values might look like
x1 = 2.325, x2 = 1.1242, x3 = 9.262, etc...) What is the MLE of ✓?
Solution It turns out our usual procedure won’t work on this example, unfortunately. We’ll explain why
once we run into the problem!
To compute the likelihood, we first need the individual density functions. Recall
8
<1 0 x ✓
fX (x | ✓) = ✓
:0 otherwise
Let’s actually define an indicator function for whether or not some boolean condition A is true or false:
(
1 A is true
IA =
0 A is false
This way, we can rewrite the uniform density in one line as (1/✓ for 0 x ✓ and 0 otherwise):
1
fX (x | ✓) = I{0x✓}
✓
7.2 Probability & Statistics with Applications to Computing 245
First, we take the product over all data points of the density at that data point, and plug in the density of
the uniform distribution. How do we simplify this? First of all, we notice that in every term in the product,
there is still a ✓1 , so multiply it by itself n times and get ✓1n . How do we multiply indicators? If we want the
product of 1’s and 0’s to be 1, they ALL have to be 1. So,
We could take the log-likelihood before di↵erentiating, but this function isn’t too bad-looking, so let’s take
the derivative of this. The I{0x1 ,...,xn ✓} just says the function is ✓1n when it the condition is true and 0
otherwise. So our derivative will just be the derivative of ✓1n when that condition is true and 0 otherwise.
d n
L(x | ✓) = I{0x1 ,...,xn ✓}
d✓ ✓n+1
n
= 0 ! ✓ =???
✓n+1
There seems to be no value of ✓ that solves this, what’s going on? Let’s plot the likelihood. First, we plot
n
just ✓1 (not quite the likelihood) where ✓ is on the x-axis:
Above is a graph of ✓1n , and so if we wanted to maximize this function, we should choose ✓ = 0. But
remember that the likelihood, was ✓1n I{0x1 ,...,xn ✓} , which can also be written as ✓1n I{xmax ✓} , because all
the samples are ✓ if and only if the maximum is. Below is the graph of the actual likelihood:
246 Probability & Statistics with Applications to Computing 7.2
Notice that multiplying by the indicator function just kept the function as is when the condition was true,
xmax ✓, but zeroed it out otherwise. So now we can see that our maximum likelihood estimator should be
✓ˆM LE = xmax = max{x1 , x2 , . . . , xn }, since it achieves the highest value.
Why? Remember x1 , . . . , xn ⇠ Unif(0, ✓), so ✓ has to be at least as large as the biggest xi , because if it’s
not as large as the biggest xi , then it would have been impossible for that uniform to produce that largest
xi . For example, if our samples were x1 = 2.53, x2 = 8.55, x3 = 4.12, our ✓ had to be at least 8.55 (the
maximum sample), because if it were 7 for example, then Unif(0, 7) could not possibly generate the sample
8.55.
So our likelihood remember ✓1n would have preferred as small a ✓ as possible to maximize it, but subject to
✓ xmax . Therefore the “compromise” was reached by making them equal!
I’d like to point out this is a special case because the range of the uniform distribution depends on its
parameter(s) a, b (the range of Unif(a, b) is [a, b]). On the other hand, most of our distributions like Poisson
or Exponential have the same range no matter what value the value of their parameters. For example, the
range of Poi( ) is always {0, 1, 2, . . . } and the range of Exp( ) is always [0, 1), independent of .
Therefore, most MLE problems will be similar to the first two examples rather than this complicated one!
Chapter 7. Statistical Estimation
7.3: Method of Moments Estimation
Slides (Google Drive) Video (YouTube)
Usually, we are ⇥interested⇤in the first moment of X: µ = E [X], and the second moment of X about
µ: Var (X) = E (X µ)2 .
Now since we are in the statistics portion of the class, we will define a sample moment.
Let X be a random variable, and c 2 R a scalar. Let x1 , . . . , xn be iid realizations (samples) from X.
The k th sample moment of X is
n
1X k
x
n i=1 i
For example, the first sample moment is just the sample mean, and the second sample moment
about the sample mean is the sample variance.
247
248 Probability & Statistics with Applications to Computing 7.3
Suppose we only need to estimate one parameter ✓ (you might have to estimate two for example ✓ = (µ, 2 )
for the N (µ, 2 ) distribution). The idea behind Method of Moments (MoM) estimation is that: to find a
good estimator, we should have the true and sample moments match as best we can. That is, I should choose
the parameter ✓ such that the first true moment E [X] is equal to the first sample moment x̄. Examples
always make things clearer!
Example(s)
Let’s say x1 , x2 , . . . , xn are iid samples from X ⇠ Unif(0, ✓) (continuous). (These values might look
like x1 = 3.21, x2 = 5.11, x3 = 4.33, etc.) What is the MoM estimator of ✓?
Solution We then set the first true moment to the first sample moment as follows (recall that
E [Unif(a, b)] = a+b
2 ):
n
✓ 1X
E [X] = = xi
2 n i=1
n
2X
✓ˆM oM = xi
n i=1
This estimator makes sense intuitively once you think about it for aP
bit: if we take the sample mean of a
n
bunch of Unif(0, ✓) rvs, we expect to get close to the true mean: n1 i=1 xi ! ✓/2 (by the Law of Large
Numbers). Hence, a good estimator for ✓ would just be twice the sample mean!
Notice that in this case, the MoM estimator disagrees with the MLE we derived in 7.2!
n
2X
xi = ✓ˆM oM 6= ✓ˆM LE = xmax
n i=1
What if you had two parameters instead of just one? Well, then you would set the first true moment equal
to the first sample moment (as we just did), but also the second true moment equal to the second sample
moment! We’ll see an example of this below. But basically, if we have k parameters to estimate, we need k
equations to solve for these k unknowns!
7.3 Probability & Statistics with Applications to Computing 249
Let x = (x1 , . . . , xn ) be iid realizations (samples) from probability mass function pX (t; ✓) (if X is
discrete), or from density fX (t; ✓) (if X continuous), where ✓ is a parameter (or vector of parameters).
Example(s)
Let’s say x1 , x2 , . . . , xn are iid samples from X ⇠ Exp(✓). (These values might look like x1 =
3.21, x2 = 5.11, x3 = 4.33, etc.) What is the MoM estimator of ✓?
Solution We have k = 1 (since only one parameter). We then set the first true moment to the first
sample moment as follows (recall that E [Exp( )] = 1 ):
n
1 1X
E [X] = = xi
✓ n i=1
1
✓ˆM oM = 1
Pn
n i=1 xi
Notice that in this case, the MoM estimator agrees with the MLE (Maximum Likelihood Estimator),
hooray!
1
✓ˆM oM = ✓ˆM LE = 1
Pn
n i=1 xi
Example(s)
Let’s say x1 , x2 , . . . , xn are iid samples from X ⇠ Poi(✓). (These values might look like x1 = 13, x2 =
5, x3 = 4, etc.) What is the MoM estimator of ✓?
Solution We have k = 1 (since only one parameter). We then set the first true moment to the first
sample moment as follows (recall that E [Poi( )] = ):
n
1X
E [X] = ✓ = xi
n i=1
250 Probability & Statistics with Applications to Computing 7.3
In this case, again, the MoM estimator agrees with the MLE! Again, much easier than MLE :).
Now, we’ll do an example where there is more than one parameter.
Example(s)
Let’s say x1 , x2 , . . . , xn are iid samples from X ⇠ N (✓1 , ✓2 ). (These values might look like x1 =
2.321, x2 = 1.112, x3 = 5.221, etc.) What is the MoM estimator of the vector ✓ = (✓1 , ✓2 ) (✓1 is
the mean, and ✓2 is the variance)?
Solution
⇥ 2 ⇤ We have k = 2 (since now we ⇥have⇤ two parameters ✓1 = µ and ✓2 = 2 ). Notice Var (X) =
2 2
E X E [X] , so rearranging we get E X 2 = Var (X) + E [X] . Let’s solve for ✓1 first:
Again, we set the first true moment to the first sample moment:
n
1X
E [X] = ✓1 = xi
n i=1
n
1X
✓ˆ1 = xi
n i=1
⇥ ⇤ 2
Now let’s use our result for ✓ˆ1 to solve for ✓ˆ2 (recall that E X 2 = Var (X) + E [X] = ✓2 + ✓12 )
n
⇥ ⇤ 1X 2
E X 2 = ✓2 + ✓12 = x
n i=1 i
n n
!2
1X 2 1X
✓ˆ2 = x xi
n i=1 i n i=1
If you were to use maximum likelihood to estimate the mean and variance of a Normal distribution, you
would get the same result!
Chapter 7. Statistical Estimation
7.4: The Beta and Dirichlet Distributions
Slides (Google Drive) Video (YouTube)
We’ll take a quick break after learning two ways (MLE and MoM) to estimate unknown parameters! In the
next section, we’ll learn yet another approach. But that approach requires us to learn at least one other
distribution, the Beta distribution, which will be the focus of this section.
Suppose you want to model your belief on the unknown probability X of heads. You could assign, for
example, a probability distribution as follows:
This figure below shows that you believe that X = P (head) is most likely to be 0.5, somewhat likely to be
0.8, and least likely to be 0.37. That is, X is a discrete random variable with range ⌦X = {0.37, 0.5, 0.8}
and pX (0.37) + pX (0.5) + pX (0.8) = 1. This is a probability distribution on a probability of heads!
Now what if we want P (head) to be open to any value in [0, 1] (which we should want; having it be just
one of three values is arbitrary and unrealistic)? The answer is that we need a continuous random variable
(with range [0, 1] because probabilities can be any number within this range)! Let’s try to see how we might
define a new distribution which might do a good job modelling this belief! Let’s see which of the following
shapes might be appropriate (or not).
251
252 Probability & Statistics with Applications to Computing 7.4
Example(s)
Suppose you flipped the coin n times and observed k heads. Which of the above density functions
have a “shape” which would be reasonable to model your belief?
It’s important to note that Distributions 2 and 4 are invalid, because there is no possible sequence of flips
that could result in the belief that is ”bi-modal” (have two peaks in the graph of the distribution). Your
belief should have a single peak at your highest belief, and go down on both sides from there.
For instance, if you believe that the probability of (getting heads) is most likely around 0.25, we have Distri-
bution 1 in the figure above. Similarly, if you think that it’s most likely around 0.85, we have Distribution
3. Or, more interestingly, if you have NO idea what the probability might be and you want to make every
probability equally likely, you could use a Uniform distribution like in Distribution 5.
Example(s)
If you flip a coin with unknown probability of heads X, what does your belief distribution look like
if:
• You didn’t observe anything?
• You observed 8 heads and 2 tails?
• You observed 80 heads and 20 tails?
• You observed 2 heads and 3 tails?
Match the four distributions below to the four scenarios above. Note the vertical bars in each
distribution represents where the mode (the point with highest density) is, as that’s probably what
we want to estimate as our probability of heads!
Solution
Explanation: Since we haven’t observed anything yet, we shouldn’t have preference over any partic-
ular value. This is encoded as a continuous Unif(0, 1) distribution.
There is a continuous distribution/rv with range [0, 1] that parametrizes probability distributions over a
7.4 Probability & Statistics with Applications to Computing 255
probability just like this, based on two parameters ↵ and , which allow you to account for how many heads
and tails you’ve seen!
X ⇠ Beta(↵, ), if and only if X has the following density function (and range ⌦X = [0, 1]):
(
1 ↵ 1 1
B(↵, ) x (1 x) , 0x1
fX (x) =
0, otherwise
X is typically the belief distribution about some unknown probability of success, where we pretend
we’ve seen ↵ 1 successes and 1 failures. Hence the mode (most likely value of the probability/point
with highest density) arg max fX (x), is
x2[0,1]
↵ 1
mode[X] =
(↵ 1) + ( 1)
If you flip a coin with unknown probability of heads X, identify the parameters of the most appropriate
Beta distribution to model your belief:
• You didn’t observe anything?
• You observed 8 heads and 2 tails?
• You observed 80 heads and 20 tails?
• You observed 2 heads and 3 tails?
Solution
(81 1) 80
• You observed 80 heads and 20 tails? Beta(80+1, 20+1) ⌘ Beta(81, 21) ! mode = (81 1)+(21 1) = 100
(3 1) 2
• You observed 2 heads and 3 tails? Beta(2 + 1, 3 + 1) ⌘ Beta(3, 4) ! mode = (3 1)+(4 1) = 5
( Qr Pr
1
B(↵) i=1 xai i 1
, xi 2 (0, 1) and i=1 xi = 1
fX (x) =
0, otherwise
This is a generalization of the Beta random variable from 2 outcomes to r. The random vector X is
typically the belief distribution about some unknown probabilities of the di↵erent outcomes, where
we pretend we saw a1 1 outcomes of type 1, a2 1 outcomes of type 2, . . . , and ar 1 outcomes
of type r . Hence, the mode of the distribution is the vector, arg max P fX (x), is
x2[0,1]d and xi =1
✓ ◆
↵ 1 ↵2 1 ↵r 1
mode[X] = Pr 1 , Pr , . . . , Pr
(a
i=1 i 1) (a
i=1 i 1) i=1 (ai 1)
We’ve seen two ways now to estimate unknown parameters of a distribution. Maximum likelihood estimation
(MLE) says that we should find the parameter ✓ that maximizes the likelihood (“probability”) of seeing the
data, whereas the method of moments (MoM) says that we should match as many moments as possible
(mean, variance, etc.). Now, we learn yet another (and final) technique for estimation that will cover (there
are many more...).
7.5.1.1 Intuition
In Maximum Likelihood Estimation (MLE), we used iid samples x = (x1 , . . . , xn ) from some distribution
with unknown parameter(s) ✓, in order to estimate ✓.
n
Y
✓ˆM LE = arg max L(x | ✓) = arg max fX (xi | ✓)
✓ ✓
i=1
Note: Recall that, using the English description, how we found ✓ˆM LE is: we computed this likelihood, which
is the probability of seeing the data given the parameter ✓, and we chose the “best” ✓ that maximized this
likelihood.
You might have been thinking: shouldn’t we be trying to maximize ”P (✓ | x)” instead? Well, this doesn’t
make sense unless ⇥ is a R.V.! And this is where Maximum A Posteriori (MAP) Estimation comes in.
So far, for MLE and MoM estimation, we assumed ✓ was fixed but unknown. This is called the Frequentist
framework where we only estimate our parameter based on data alone, and ✓ is not a random variable.
Now, we are in the Bayesian framework, meaning that our unknown parameter is a random variable
⇥. This means, we will have some belief distribution ⇡⇥ (✓) (think of this as a density function over all
possible values of the parameter), and after observing data x, we will have a new/updated belief distribution
⇡⇥ (✓ | x). Let’s see a picture of what MAP is going to do first, before getting more into the math and
formalism.
257
258 Probability & Statistics with Applications to Computing 7.5
Example(s)
We’ll see the idea of MAP being applied to our typical coin example. Suppose we are trying to
estimate the unknown parameter for the probability of heads on a coin: that is, ✓ in Ber(✓). We
are going to treat the parameter as a random variable (before in MLE/MoM we treated it as a fixed
unknown quantity), so we’ll call it ⇥ (capitalized ✓).
1. We must have a prior belief distribution ⇡⇥ (✓) over possible values that ⇥ could
take on.
The range of ⇥ in our case is ⌦⇥ = [0, 1], because the probability of heads must be in
this interval. Hence, when we plot the density function of ⇥, the x-axis will range from 0 to 1.
On a piece of paper, please sketch a density function that you might have for this probability of
heads without yet seeing any data (coin flips). There are two reasonable shapes for this PDF:
• The Unif(0, 1) = Beta(1, 1) distribution (left picture below).
• Some Beta distribution where ↵ = , since most coins in this world are fair. Let’s say
Beta(11, 11); meaning we pretend we’ve seen 10 heads and 10 tails (right picture below).
Again, for the Bernoulli distribution, these will be a sequence of Pnn 1’s and 0’s repre-
senting heads
Pn or tails. Suppose we observed n = 30 samples, in which i=1 xi = 25 were heads
and n x
i=1 i = 5 were tails.
3. We will combine our prior knowledge and the data to create a posterior belief
distribution ⇡⇥ (✓ | x).
Sketch two density functions for this posterior: one using the Beta(1, 1) prior above, and one
using the Beta(11, 11) prior above. We’ll compare these.
• If our prior distribution was ⇥ ⇠ Beta(1, 1) (meaning we pretend we didn’t see anything
yet), then our posterior distribution should be ⇥ | x ⇠ Beta(26, 6) (meaning we saw 25
heads and 5 tails total).
• If our prior distribution was ⇥ ⇠ Beta(11, 11) (meaning pretend we saw 10 heads and 10
tails beforehand), then our posterior distribution should be ⇥ | x ⇠ Beta(36, 16) (meaning
we saw 35 heads and 15 tails total).
7.5 Probability & Statistics with Applications to Computing 259
4. We’ll give our MAP estimate as the mode of this posterior distribution. Hence,
the name “Maximum a Posteriori”.
• If we used the ⇥ ⇠ Beta(1, 1) prior, we ended up with the ⇥ | x ⇠ Beta(26, 6) posterior,
and our MAP estimate is defined to be the mode of the distribution, which occurs at
✓ˆM AP = 25
30 ⇡ 0.833 (left picture above). You may notice that this would give the same as
the MLE: we’ll examine this more later!
• If we used the ⇥ ⇠ Beta(11, 11) prior, we ended up with the ⇥ | x ⇠ Beta(36, 16) posterior,
our MAP estimate is defined to be the mode of the distribution, which occurs at ✓ˆM AP =
35
50 = 0.70 (right picture above).
Hopefully you now see the process and idea behind MAP: We have a prior belief on our unknown
parameter, and after observing data, we update our belief distribution and take the mode (most likely
value)! Our estimate definitely depends on the prior distribution we choose (which is often arbitrary).
7.5.1.2 Derivation
We chose a Beta prior, and ended up with a Beta posterior, which made sense intuitively given our definition
of the Beta distribution. But how do we prove this? We’ll see the math behind MAP now (quite short), and
see the same example again but mathematically rigorous now.
MAP Idea: Actually, unknown parameter(s) is a random variable ⇥. We have a prior distribution (prior
belief on ⇥ before seeing data) ⇡⇥ (✓) and posterior distribution (given data; updated belief on ⇥ after
observing some data) ⇡⇥ (✓ | x).
By Bayes’ Theorem,
Recall that ⇡⇥ is just a PDF or PMF over possible values of ⇥. In other words, now we are maximizing
the posterior distribution ⇡⇥ (✓ | x), where ⇥ has a PMF/PDF. That is, we are finding the mode of the
density/mass function. Note that since the denominator P (x) in the expression above does not depend on
✓, we can just maximize the numerator L(x | ✓)⇡⇥ (✓)! Therefore:
Let x = (x1 , . . . , xn ) be iid realizations from probability mass function pX (t ; ⇥ = ✓) (if X discrete),
or from density fX (t ; ⇥ = ✓) (if X continuous), where ⇥ is the random variable representing the
parameter (or vector of parameters). We define the Maximum A Posteriori (MAP) estimator ✓ˆM AP
of ⇥ to be the parameter which maximizes the posterior distribution of ⇥ given the data.
That is, it’s exactly the same as maximum likelihood, except instead of just maximizing the likelihood,
we are maximizing the likelihood multiplied by the prior!
Now we’ll see a similar coin-flipping example, but deriving the MAP estimate mathematically and building
even more intuition. I encourage you to try each part out before reading the answers!
7.5.1.3 Example
Example(s)
(a) Suppose our samples are x = (0, 0, 1, 1, 0), from Ber(✓), where ✓ is unknown. Assume ✓ is
unrestricted; that is, ✓ 2 (0, 1). What is the MLE for ✓?
(b) Suppose we impose the restriction that ✓ 2 {0.2, 0.5, 0.7}. What is the MLE for ✓?
(c) Assume ⇥ is restricted as in part (b) (but now a random variable for MAP). Suppose we have
a (discrete) prior ⇡⇥ (0.2) = 0.1, ⇡⇥ (0.5) = 0.01, and ⇡⇥ (0.7) = 0.89. What is the MAP for ✓?
(d) Show that we can make the MAP whatever we like, by finding a prior over {0.2, 0.5, 0.7} so
that the MAP is 0.2, another so that it is 0.5, and another so that it is 0.7.
(e) Typically, for the Bernoulli/Binomial distribution, if we use MAP, we want to be able to get
any value 2 (0, 1), not just ones in a finite set such as {0.2, 0.5, 0.7}. So we need a (continuous)
prior distribution with range (0, 1) instead of our discrete one. We assign ⇥ ⇠ Beta(↵, )
1 ↵ 1
with parameters ↵, > 0 and density ⇡⇥ (✓) = B(↵, )✓ (1 ✓) 1 for ✓ 2 (0, 1). Recall the
↵ 1
mode of a W ⇠ Beta(↵, ) random variable is (↵ 1)+( 1) (the mode is the value with highest
density arg maxw fW (w)).
Suppose x1 , . . . , xn arePiid from a Bernoulli distribution with unknown parameter. Recall the
MLE is nk , where k = xi (the total number of successes). Show that the posterior ⇡⇥ (✓ | x)
has a Beta(k + ↵, n k + ) distribution, and find the MAP estimator.
(f) Recall that Beta(1, 1) ⌘ Unif(0, 1) (pretend we saw 1 1 heads and 1 1 tails ahead of time).
If we used this as the prior, how would the MLE and MAP compare?
(g) Since the posterior is also a Beta Distribution, we call Beta the conjugate prior to the Bernoul-
li/Binomial distribution’s parameter p. Interpret ↵, as to how they a↵ect our estimate. This
is a really special property: if the prior distribution multipled by the likelihood results in a pos-
terior distribution in the same family (with di↵erent parameters), then we say that distribution
is the conjugate prior to the distribution we are estimating.
(h) As the number of samples goes to infinity, what is the relationship between the MLE and MAP?
What does this say about our prior when n is small, or n is large?
(i) Which do you think is ”better”, MLE or MAP?
Solution
7.5 Probability & Statistics with Applications to Computing 261
(a) Suppose our samples are x = (0, 0, 1, 1, 0), from Ber(✓), where ✓ is unknown. Assume ✓ is unrestricted;
that is, ✓ 2 (0, 1). What is the MLE for ✓?
• Answer: 25 . We just find the likelihood of the data, which is the probability of observing 2 heads
and 3 tails, and find the ✓ that maximizes it.
L(x | ✓) = ✓2 (1 ✓)3
ˆ
✓M LE = arg max✓2[0,1] ✓2 (1 ✓)3 = 2
5
(b) Suppose we impose the restriction that ✓ 2 {0.2, 0.5, 0.7}. What is the MLE for ✓?
• Answer: 0.5. We need to find which of the three acceptable ✓ values maximizes the likelihood,
and since there are only finitely many, we can just plug them all in and compare!
Suppose x1 , . . . , xnPare iid from a Bernoulli distribution with unknown parameter. Recall the MLE
is nk , where k = xi (the total number of successes). Show that the posterior ⇡⇥ (✓ | x) has a
Beta(k + ↵, n k + ) distribution, and find the MAP estimator.
⇡⇥ (✓ | x) / L(x | ✓) · ⇡⇥ (✓)
✓ ◆ ! !
n k n k 1
= ✓ (1 ✓) · ✓↵ 1
(1 ✓) 1
k B(↵, )
/ ✓(k+↵) 1
(1 ✓)(n k+ ) 1
The first to second line comes from noticing L(x | ✓) is just the probability of seeing exactly k
successes out of n (binomial PMF), and plugging in our equation for ⇡⇥ (beta density). The
second to third line comes from dropping the normalizing constants (that don’t depend on ✓),
which we can do because we only care to maximize this over ✓. If you stare closely at that last
equation, it actually proportional to the PDF of a Beta distribution with di↵erent parameters!
Our posterior is hence Beta(k + ↵, n k + ) since PDFs uniquely define a distribution (there
is only one normalizing constant that would make it integrate to 1). The MAP estimator is the
mode of this posterior Beta distribution, which is given by the formula:
k+↵ 1 k + (↵ 1)
✓ˆM AP = =
(k + ↵ 1) + (n k + 1) n + (↵ 1) + ( 1)
Try staring at this to see why this might make sense. We’ll explain it more in part (g)!
(f) Recall that Beta(1, 1) ⌘ Unif(0, 1) (pretend we saw 1 1 heads and 1 1 tails ahead of time). If we
used this as the prior, how would the MLE and MAP compare?
• Answer: They would be the same! From our previous question, if ↵ = = 1, then
k + (↵ 1) k
✓ˆM AP = = = ✓ˆM LE
n + (↵ 1) + ( 1) n
This is because we don’t have any prior information essentially, by saying each value is equally
likely!
(g) Since the posterior is also a Beta Distribution, we call Beta the conjugate prior to the Bernoulli/Bi-
nomial distribution’s parameter p. Interpret ↵, as to how they a↵ect our estimate. This is a really
special property: if the prior distribution multipled by the likelihood results in a posterior distribution
in the same family (with di↵erent parameters), then we say that distribution is the conjugate prior to
the distribution we are estimating.
• Answer: The interpretation is: pretend we saw ↵ 1 heads ahead of time, and 1 tails ahead
of time. Then our total number of heads is k + (↵ 1) (real + fake) and our total number of
trials is n + (↵ + 2) (real + fake), so that’s our estimate! That’s how prior information was
factored in to our estimator, rather than just using what we actually saw in the data.
(h) As the number of samples goes to infinity, what is the relationship between the MLE and MAP? What
does this say about our prior when n is small, or n is large?
• Answer: They become equal! The prior is important if we don’t have much data, but as we get
more, the evidence overwhelms the prior. You can imagine that if we only flipped the coin 5
times, the prior would play a huge role in our estimate. But if we flipped the coin 10,000 times,
any (small) prior wouldn’t really change our estimate.
(i) Which do you think is “better”, MLE or MAP?
7.5 Probability & Statistics with Applications to Computing 263
• Answer: There is no right answer. There are two main schools in statistics: Bayesians and
Frequentists.
• Frequentists prefer MLE since they don’t believe you should be putting a prior belief on anything,
and you should only make judgment based on what you’ve seen. They believe the parameter
being estimated is a fixed quantity.
• On the other hand, Bayesians prefer MAP, since they can incorporate their prior knowledge into
the estimation. Hence the parameter being estimated is a random variable, and we seek the
mode - the value with the highest probability or density. An example would be estimating the
probability of heads of a coin - is it reasonable to assume it is more likely fair than not? If so,
what distribution should we put on the parameter space?
• Anyway, in the long run, the prior “washes out”, and the only thing that matters is the likelihood;
the observed data. For small sample sizes like this, the prior significantly influences the MAP
estimate. However, as the number of samples goes to infinity, the MAP and MLE are equal.
7.5.2 Exercises
1. Let x = (x1 , . . . , xn ) be iid samples from Exp(⇥) where ⇥ is a random variable (not fixed). Note that
the range of ⇥ should be ⌦⇥ = [0, 1) (the average rate of events per unit time), so any prior we choose
should have this range.
(a) Using the prior ⇥ ⇠ Gamma(r, ) (for some arbitrary but known parameters r, > 0), show that
the posterior distribution ⇥ | x also follows a Gamma distribution and identify its parameters (by
computing ⇡⇥ (✓ | x)). Then, explain this sentence: “The Gamma distribution is the conjugate
prior for the rate parameter of the Exponential distribution”. Hint: This can be done in just a
few lines!
s 1
(b) Now derive the MAP estimate for ⇥. The mode of a Gamma(s, ⌫) distribution is . Hint:
⌫
This should be just one line using your answer to part (a).
(c) Explain how this MAP estimate di↵ers from the MLE estimate (recall for the Exponential distri-
n
bution it was just the inverse sample mean Pn ), and provide an interpretation of r and as
i=1 xi
to how they a↵ect the estimate.
Solution:
(a) Remember that the posterior is proportional to likelihood times prior, and the density of Y ⇠
Exp(✓) is fY (y | ✓) = ✓e ✓y :
P
Therefore ⇥ | x ⇠ Gamma(n + r, + xi ), since the final line above is proportional to the pdf
for the gamma distribution (minus normalizing constant).
264 Probability & Statistics with Applications to Computing 7.5
It is the conjugate prior because, assuming a Gamma prior for the Exponential likelihood, we
end up with a Gamma posterior. That is, the prior and posterior are in the same family of
distributions (Gamma) with di↵erent parameters.
(b) Just citing the mode of a Gamma given, we get
n+r 1
✓ˆM AP = P
+ xi
n
(c) We see how the estimate changes from the MLE of ✓ˆM LE = P : pretend we saw r 1 extra
xi
events
P over units of time. (Instead of waiting
P for n events, we waited for n + r 1, and instead
of xi as our total time, we now have + xi units of time).
Chapter 7. Statistical Estimation
7.6: Properties of Estimators I
Slides (Google Drive) Video (YouTube)
Now that we have all these techniques to compute estimators, you might be wondering which one is the
“best”. Actually, a better question would be: how can we determine which estimator is “better” (rather
than the technique)? There are even more di↵erent ways to estimate besides MLE/MoM/MAP, and in
di↵erent scenarios, di↵erent techniques may work better. In these notes, we will consider some properties of
estimators that allow us to compare their “goodness”.
7.6.1 Bias
The first estimator property we’ll cover is Bias. The bias of an estimator measures whether or not in
expectation, the estimator will be equal to the true parameter.
If h i
ˆ ✓) = 0, or equivalently E ✓ˆ = ✓, then we say ✓ˆ is an unbiased estimator of ✓.
• Bias(✓, ˆ
ˆ ✓) > 0, then ✓ˆ typically overestimates ✓.
• Bias(✓,
ˆ ✓) < 0, then ✓ˆ typically underestimates ✓.
• Bias(✓,
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from Poi(✓), then the MLE and MoM were both the
sample mean.
n
1X
✓ˆ = ✓ˆM LE = ✓ˆM oM = xi
n i=1
265
266 Probability & Statistics with Applications to Computing 7.6
Solution
" n #
h i 1 X
E ✓ˆ = E xi
n i=1
n
1X
= E [xi ] [LoE]
n i=1
n
1X
= ✓ [E [Poi(✓)] = ✓]
n i=1
1
= n✓
n
=✓
This makes sense: the average of your samples should be “on-target” for the true average!
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, ✓), then
n
1X
✓ˆM LE = xmax ✓ˆM oM = 2 · xi
n i=1
Sure, ✓ˆM LE maximizes the likelihood, so in a way ✓ˆM LE is better than ✓ˆM OM . But, what are the biases
of these estimators? Before doing any computation: do you think ✓ˆM LE and ✓ˆM oM are overestimates,
underestimates, or unbiased?
Solution I actually think ✓ˆM oM is spot-on since the average of the samples should be close to ✓/2, and mul-
tiplying by 2 would seem to give the true ✓. On the other hand, ✓ˆM LE might be a bit of an underestimate,
since we probably wouldn’t have ✓ be exactly the largest (maybe a little larger).
Z ! Z " #✓
h i ✓ ⇣ y ⌘n 11 n ✓
n 1 n
E ✓ˆM LE = E [Xmax ] = y n dy = n y n dy = n y n+1 = ✓
0 ✓ ✓ ✓ 0 ✓ n+1 n+1
0
This makes sense because if I had 3 samples from Unif(0, 1) for example, I would expect them at
n
1/4, 2/4, 3/4, and so it would be as my expected max. Similarly, if I had 4 samples, then I would
n+1
n
expect them at 1/5, 2/5, 3/5, 4/5, and so it would again be as my expected max.
n+1
7.6 Probability & Statistics with Applications to Computing 267
Finally,
h i n 1
Bias(✓ˆM LE , ✓) = E ✓ˆM LE ✓= ✓ ✓= ✓
n+1 n+1
" #
h i 1 X
n
2 X
n
2 ✓
E ✓ˆM OM =E 2· xi = E [xi ] = n = ✓
n i=1 n i=1 n 2
h i
Bias(✓ˆM oM , ✓) = E ✓ˆM oM ✓=✓ ✓=0
• Analysis of Results
This means that ✓ˆM LE typically underestimates ✓ and ✓ˆM OM is an unbiased estimator of ✓. But some-
thing isn’t quite right...
2
✓ˆM LE = max{1, 9, 2} = 9 ✓ˆM OM = (1 + 9 + 2) = 8
3
However, based on our sample, the MoM estimator is impossible. If the actual parameter were 8, then
that means that the distribution we pulled the sample from is Unif(0, 8), in which case the likelihood
that we get a 9 is 0. But we did see a 9 in our sample. So, even though ✓ˆM OM is unbiased, it still
yields an impossible estimate. This just goes to show that finding the right estimator is actually quite
tricky.
A good solution would be to “de-bias” the MLE by scaling it appropriately. If you decided to have a
new estimator based on the MLE:
n+1ˆ
✓ˆ = ✓M LE
n
you would now get an unbiased estimator that can’t be wrong! But now it does not maximize the
likelihood anymore...
Actually, the MLE is what we say to be “asymptotically unbiased”, meaning unbiased in the limit.
This is because
1
Bias(✓ˆM LE , ✓) = ✓!0
n+1
as n ! 1. So usually we might just leave it because we can’t seem to win...
Example(s)
Recall that if x1 , . . . , xn ⇠ Exp(✓) are iid, our MLE and MoM estimates were both the inverse sample
mean:
1 n
✓ˆ = ✓ˆM LE = ✓ˆM oM = = Pn
x̄ i=1 xi
What can you say about the bias of this estimator?
Solution
268 Probability & Statistics with Applications to Computing 7.6
h i
ˆ n
E ✓ = E Pn
i=1 xi
n
Pn [Jensen’s inequality]
i=1 E [xi ]
n 1
= Pn 1 E [Exp(✓)] =
i=1 ✓ ✓
n
=
n ✓1
=✓
1
The inequality comes from Jensen’s (section 6.3): since g(x1 , . . . , xn ) = Pn is convex (at least in the
i=1 xi
positive octant when all xi 0), we have that E [g(x1 , . . . , xn )] g(E [x1 ] , E [x2 ] , . . . , E [xn ]). It is convex
1 h i
for a reason similar to that is a convex function. So E ✓ˆ ✓ systematically, and we typically have an
x
overestimate.
This is just the definition of variance applied to the random variable ✓ˆ and isn’t actually a new definition.
But maybe instead of just computing the variance, we want a slightly di↵erent metric which instead measures
the squared di↵erence from the actual estimator and not just its expectation:
h i
E (✓ˆ ✓)2
This leads to what is known as the “Bias-Variance Tradeo↵” in machine learning and statistics. Usually, we
want to minimize MSE, and these two quantities are often inversely related. That is, decreasing one leads
to an increase in the other, and finding the balance will minimize the MSE. It’s hard to see why that might
be the case since we aren’t working with as complex of estimators (we’re just learning the basics!).
⇣ ⌘
Proof of Alternate MSE Formula. We will prove that MSE(✓, ˆ ✓) = Var ✓ˆ + Bias(✓,
ˆ ✓)2 .
h i
ˆ ✓) = E (✓ˆ ✓)2
MSE(✓, [def of MSE])
⇣ h i h i ⌘2 h i
= E ✓ˆ E ✓ˆ + E ✓ˆ ✓ [add and subtract E ✓ˆ ]
⇣ h i⌘2 h⇣ h i⌘ ⇣ h i ⌘i ⇣ h i ⌘2
= E ✓ˆ E ✓ˆ + 2E ✓ˆ E ✓ˆ E ✓ˆ ✓ +E E ✓ˆ ✓ [(a + b)2 = a2 + 2ab + b2 ]
⇣ ⌘ h i h h ii
= Var ✓ˆ + 0 + E Bias(✓,
ˆ ✓)2 [def of var, bias, E ✓ˆ E ✓ˆ = 0]
⇣ ⌘
= Var ✓ˆ + Bias(✓,
ˆ ✓)2
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from Poi(✓), then the MLE and MoM were both the
sample mean.
n
1X
✓ˆ = ✓ˆM LE = ✓ˆM oM = xi
n i=1
Solution To compute the MSE, let’s compute the bias and variance separately. Earlier, we showed that
h i
ˆ ✓) = E ✓ˆ
Bias(✓, ✓=✓ ✓=0
We’ll discuss even more desirable properties of estimators. Last time we talked about bias, variance, and
MSE. Bias measured whether or not, in expectation, our estimator was equal to the true value of ✓. MSE
measured the expected squared di↵erence between our estimator and the true value of ✓. If our estimator
was unbiased, then the MSE of our estimator was precisely the variance.
7.7.1 Consistency
Definition 7.7.1: Consistency
Example(s)
Recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, ✓), then
n
1X
✓ˆn = ✓ˆn,M oM = 2 · xi
n i=1
Solution
Since ✓ˆn is unbiased, we have that
⇣ ⌘ ⇣ h i ⌘
P |✓ˆn ✓| > " = P |✓ˆn E ✓ˆn | > "
because we can replace ✓ with the expected value of the estimator. Now, we can apply Chebyshev’s inequality
(6.1) to see that
⇣ ⌘
⇣ h i ⌘ Var ✓ˆn
P |✓ˆn E ✓ˆn | > "
"2
Now, we can take out the 22 from the estimator’s expression and are left only with the variance of the sample
271
272 Probability & Statistics with Applications to Computing 7.7
2 Var(xi )
mean, which is always just n = n .
⇣ ⌘
⇣ h i ⌘ Var ✓ˆn 1
Pn
22 Var xi 4 · Var (xi ) /n
P |✓ˆn E ✓ˆn | > " = n i=1
=
"2 "2 "2
Example(s)
Recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, ✓), then
Solution
In this case, we cannot use Chebyshev’s inequality unfortunately, because the maximum likelihood estimator
is not unbiased. The CDF for ✓ˆn is ⇣ ⌘
F✓ˆn (t) = P ✓ˆn t
which is the probability that each individual sample is less than t because only in that case will the max be
less than t, and we have independence so we can say
⇣ ⌘
P ✓ˆn t = P (X1 t) P (X2 t) ...P (Xn t)
t
This is just the CDF of Xi to the n-th power, where the CDF of Unif(0, ✓) is just ✓ (see the distribution
sheet):
8
>
<0, t<0
n t n
F✓ˆn (t) = FX (t) = ( ✓ ) , 0 t ✓
>
:
1, t>0
There are two ways we can have the absolute value from before be greater than epsilon
⇣ ⌘ ⇣ ⌘ ⇣ ⌘
P |✓ˆn ✓| > " = P ✓ˆn > ✓ + " + P ✓ˆn < ✓ "
The first term is 0, because there’s no way our estimator is greater than ✓ + ", as it’s never going to be
greater than ✓ by definition (the samples are between 0 and ✓ so there’s no way the max of the samples is
greater than ✓). So, now we can just use the CDF on the right term, and just plug in for t:
(
⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ( ✓ ✓ " )n , " < ✓
P ✓ˆn > ✓ + " + P ✓ˆn < ✓ " = P ✓ˆn < ✓ " =
0, " ✓
7.7 Probability & Statistics with Applications to Computing 273
We can assume that " is less than ✓ because we really only care when " is very very small, so we have that
⇣ ⌘ ✓ ✓ " ◆n
P |✓ˆn ✓| > " =
✓
Thus, when we take the limit as n approaches infinity, we see that in the parenthesis, we have a number less
than 1, and we raise it to the n-th power, so it goes to 0
⇣ ⌘
lim P |✓ˆn ✓| > " = 0
n!1
Now we’ve seen that, even though the MLE and MoM estimators of ✓ given iid samples from Unif(0, ✓) are
di↵erent, they are both consistent! That means, as n ! 1, they will both converge to the true parameter
✓. This is clearly a good property of an estimator.
1. For instance, an unbiased and consistent estimator was the MoM for the uniform distribution: ✓ˆn,M oM =
2x̄. We proved it was unbiased in 7.6, meaning it is correct in expectation. It converges to the true
parameter (consistent) since the variance goes to 0.
2. However, if you ignore all the samples and just take the first one and multiply it by 2, ✓ˆ = 2X1 , it is
unbiased (as it is 2 · ✓2 ), but it’s not consistent; our estimator doesn’t get better and better with more n
because we’re not using all n samples. Consistency requires that as we get more samples, we approach
the true parameter.
3. Biased but consistent, on the other hand, was the MLE estimator.
h iWe showed its expectation was
n ˆ n
✓, which is actually “asymptotically unbiased” since E ✓n,M LE = ✓ ! ✓ as n ! 1. It
n+1 n+1
does get better and better as n ! 1.
4. Neither unbiased nor consistent would just be some random expression, such as ✓ˆ = 1
X12
.
7.7.3 Efficiency
To take about our last topic, efficiency, we first have to define Fisher Information. Efficiency says that our
estimator has as low variance as possible. This property combined with consistency and unbiasedness mean
that our estimator is on target (unbiased), converges to the true parameter (consistent), and does so as fast
as possible (efficient).
274 Probability & Statistics with Applications to Computing 7.7
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓) (if X is discrete), or
from density function fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters).
The Fisher Information of the parameter ✓ is defined to be:
"✓ ◆2 # 2
@ ln L(x | ✓) @ ln L(x | ✓)
I(✓) = n · E = E
@✓ @✓2
where L(x | ✓) denotes the likelihood of the data given parameter ✓ (defined in 7.1). From Wikipedia,
it “is a way of measuring the amount of information that an observable random variable X carries
about an unknown parameter ✓ upon which the probability X depends”.
That written definition is definitely a mouthful, but if you stop and parse it, you’ll see it’s not too bad
to compute. We always take the second derivative of the log-likelihood to confirm that our MLE was a
maximizer; now all you have to do is take the expectation to get the Fisher Information. There’s no way
though that I can interpret the negative expected value of the second derivative of the log-likelihood, it’s
just too gross and messy.
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓) (if X is discrete), or
from density function fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters).
If ✓ˆ is an unbiased estimator for ✓, then
⇣ ⌘ 1
ˆ ✓) = Var ✓ˆ
MSE(✓,
I(✓)
where I(✓) is the Fisher information defined earlier. What this is saying is, for any unbiased estimator
✓ˆ for ✓, the variance (=MSE) is at least I(✓)
1
. If we achieve this lower bound, meaning our variance
1
is exactly equal to I(✓) , then we have the best variance possible for our estimate. That is, we have
the minimum variance unbiased estimator (MVUE) for ✓.
Since we want to find the lowest variance possible, we can look at this through the frame of finding the
estimator’s efficiency.
7.7 Probability & Statistics with Applications to Computing 275
ˆ ✓) = I(✓) 1
e(✓, ⇣ ⌘ 1
Var ✓ˆ
This will always be between 0 and 1 because if your variance is equal to the CRLB, then it equals 1,
and anything greater will result in a smaller value. A larger variance will result in a smaller efficiency,
and we want our efficiency to be as high as possible (1).
An unbiased estimator is said to be efficient if it achieves the CRLB - meaning e(✓, ˆ ✓) = 1. That is,
it could not possibly have a lower variance. Again, the CRLB is not guaranteed for biased estimators.
That was super complicated - let’s see how to verify the MLE of Poi(✓) is efficient. It looks scary - but it’s
just messy algebra!
Example(s)
Recall that, if x1 , ..., xn are iid realizations from X ⇠ Poi(✓) (recall E [X] = Var (X) = ✓), then
n
1X
✓ˆ = ✓ˆMLE = ✓ˆMoM = xi
n i=1
Is ✓ˆ efficient?
Solution
First, you have to check that it’s unbiased, as the CRLB only holds for unbiased estimators...
" n #
h i 1 X
E ✓ˆ = E xi = E [xi ] = ✓
n i=1
...which it is! Otherwise, we wouldn’t be able to use this bound. We also need to compute the variance. The
2
variance of the sample mean (the estimator) is just n , and the variance of a Poisson is just ✓.
!
⇣ ⌘ 1X
n
Var (xi ) ✓
ˆ
Var ✓ = Var xi = =
n i=1 n n
Then, we’re going to compute that weird Fisher Information, which gives us the CRLB, and see if our
variance matches. Remember, we take the second derivative of the log-likelihood, which we did earlier in 7.2
so we’re just going to copy over the answer.
n
X
@2 xi
2
ln L(x | ✓) =
@✓ i=1
✓2
n
Then, we need to take the expected value of this. It turns out, with some algebra, you get ✓.
2 " n # n
@ ln L(x | ✓) X xi 1 X 1 n
E = E = E [xi ] = n✓ =
@✓2 i=1
✓ 2 ✓ 2
i=1
✓ 2 ✓
276 Probability & Statistics with Applications to Computing 7.7
Our Fisher Information was the negative expected value of the second derivative of the log-likelihood, so
we just flip the sign to get n✓ .
2
@ ln L(x | ✓) n
I(✓) = E 2
=
@✓ ✓
.
Finally, our efficiency is the inverse of the Fisher Information over the variance:
ˆ ✓) = I(✓) 1 (n) 1
e(✓, ⇣ ⌘ = ✓✓ =1
Var ✓ˆ n
Thus, we’ve shown that, since our efficiency is 1, our estimator is efficient. That is, it has the best pos-
sible variance among all unbiased estimators of ✓. This, again, is a really good property that we want to have.
To reiterate, this means we cannot possibly do better in terms of mean squared error. Our bias is 0, and our
variance is as low as it can possibly go. The sample mean is the unequivocally best estimator for a Poisson
distribution, in terms of efficiency, in terms of bias, and MSE (it also happens to be consistent, so there are
a lot of good things).
As you can see, showing efficiency is just a bunch of tedious calculations!
Chapter 7. Statistical Estimation
7.8: Properties of Estimators III
Slides (Google Drive) Video (YouTube)
The final property of estimators we will discuss is called sufficiency. Just like we want our estimators to be
consistent and efficient, we also want them to be sufficient.
7.8.1 Sufficiency
We first must define what a statistic is.
Definition 7.8.1: Statistic
All estimators are statistics because they take in our n data points and produce a single number. We’ll see
an example which intuitively explains what it means for a statistic to be sufficient.
Suppose we have iid samples x = (x1 , . . . , xn ) from a known distribution with unknown parameter ✓. Imagine
we have two people:
• Statistician A: Knows the entire sample, gets n quantities: x = (x1 , . . . , xn ).
• Statistician B: Knows T (x1 , . . . , xn ) = t, a single number which is a function of the samples. For
example, the sum or the maximum of the samples.
Heuristically, T (x1 , . . . , xn ) is a sufficient statistic if Statistician B can do just as good a job as Statisti-
cian A, given “less Pninformation”. For example, if the samples are from the Bernoulli distribution, knowing
T (x1 , . . . , xn ) = i=1 xi (the number of heads) is just as good as knowing all the individual outcomes, since
a good estimate would be the number of heads over the number of total trials! Hence, we don’t actually
care the ORDER of the outcomes, just how many heads occurred! The word “sufficient” in English roughly
means “enough”, and so this terminology was well-chosen.
277
278 Probability & Statistics with Applications to Computing 7.8
To motivate the definition, we’ll go back to the previous example. Again, statistician A has all the samples
x1 , . . . , xn but statistician B only has the single number t = T (x1 , . . . , xn ). The idea is, Statistician B
0 0
only knows T = t, but since T is sufficient, doesn’t need ✓ to generate new samples X1 , . . . , Xn from the
distribution. This is because P (X1 = x1 , . . . , Xn = xn | T = t, ✓) = P (X1 = x1 , . . . , Xn = xn | T = t) and
since she knows T = t, she knows the conditional distribution (can generate samples)! Now Statistician
B has n iid samples from the distribution, just like Statistician A. So using these samples X10 , . . . , Xn0 ,
statistician B can do just a good a job as statistician A with samples X1 , . . . , Xn (on average). So no one is
at any disadvantage. :)
This definition is hard to check, but it turns out that there is a criterion that helps us determine whether a
statistic is sufficient:
Theorem 7.8.35: Neyman-Fisher Factorization Criterion
Let x1 , . . . , xn be iid random samples with likelihood L(x1 , . . . , xn | ✓). A statistic T = T (x1 , . . . , xn )
is sufficient if and only if there exist non-negative functions g and h such that:
L(x1 , . . . , xn | ✓) = g(x1 , . . . , xn ) · h(T (x1 , . . . , xn ), ✓)
That is, the likelihood of the data can be split into a product of two terms: the first term g can
depend on the entire data, but not ✓, and the second term h can depend on ✓, but only on the
data through the sufficient statistic T . (In other words, T is the only thing that allows the data
x1 , . . . , xn and ✓ to interact!) That is, we don’t have access to the n individual quantities x1 , . . . , xn ;
just the single number (T , the sufficient statistic).
If you are reading this for the first time, you might not think this is any better...You may be very confused
right now, but let’s see some examples to clear things up!
But basically, you want to split the likelihood into a product of two terms/functions:
1. For the first term g, you are allowed to know each individual sample if you want, but NOT ✓.
2. For the second term h, you can only know the sufficient statistic (single number) T (x1 , . . . , xn ) and ✓.
You may not know each individual xi .
Example(s)
Let x1 , . . . , xn be iid random samples from Unif(0, ✓) (continuous). Show that the MLE ✓ˆ =
T (x1 , . . . , xn ) = max{x1 , . . . , xn } is a sufficient statistic. (The reason this is true is because we
don’t need to know each individual sample to have a good estimate for ✓; we just need to know the
largest!)
Solution We saw the likelihood of this continuous uniform in 7.2, which we’ll just rewrite:
n
Y 1 1 1 1
L(x1 , . . . , xn | ✓) = I{xi ✓} = n
I{x1 ,...,xn ✓} = n I{max{x1 ,...,xn }✓} = n I{T (x1 ,...,xn )✓}
i=1
✓ ✓ ✓ ✓
Choose
g(x1 , . . . , xn ) = 1
and
1
h(T (x1 , . . . , xn ), ✓) = I{T (x1 ,...,xn )✓}
✓n
7.8 Probability & Statistics with Applications to Computing 279
Notice there is no need for a g term (that’s why it is = 1), because there is no term in the likelihood
which just has the data (without ✓).
For the h term, notice that we just need to know the max of the samples T (x1 , . . . , xn ) to compute h:
we don’t actually need to know each individual xi .
Notice that here the only interaction between the data and parameter ✓ happens through the sufficient
statistic (the max of all the values).
Example(s)
Pn
Let x1 , . . . , xn be iid random samples from Poi(✓). Show that T (x1 , . . . , xn ) = i=1 xi is a sufficient
P n
statistic, and hence the MLE ✓ˆ = n1 i=1 xi is sufficient as well. (The reason this is true is because
we don’t need to know each individual sample to have a good estimate for ✓; we just need to know
how many events happened total!)
Solution We take our Poisson likelihood and split it into smaller terms:
n n
! n
! n
! Pn
Y xi Y Y Y n✓ xi
✓✓ 1 e ✓ i=1
✓ xi
L(x1 , . . . , xn | ✓) = e = e ✓ = Qn
i=1
xi ! i=1 i=1
x!
i=1 i i=1 xi !
1 n✓ T (x1 ,...,xn )
= Qn ·e ✓
i=1 xi !
Choose
1
g(x1 , . . . , xn ) = Qn
i=1 xi !
and
n✓
h(T (x1 , . . . , xn ), ✓) = e ✓T (x1 ,...,xn )
Pn ˆ
By
Pnthe Neyman-Fisher Factorization Criterion, T (x1 , . . . , xn ) = i=1 xi is sufficient. The mean ✓M LE =
i=1 xi T (x1 , . . . , xn )
= is as well, since knowing the total number of events and the average number of
n n
events is equivalent (since we know n)!
Notice here we had the g term handle some function of only x1 , . . . , xn but not ✓.
For the h term though, we do have ✓ but don’t need the individual samples x1 , . . . , xn to compute h.
Imagine being just given T (x1 , . . . , xn ): now you have enough information to compute h!
Notice that here the only interaction between the data and parameter ✓ happens through the sufficient
statistic (the sum/mean of all the values). We don’t actually need to know each individual xi .
280 Probability & Statistics with Applications to Computing 7.8
Example(s)
Pn
Let x1 , . . . , xn be iid random samples from Ber(✓). Show that T (x1 , . . . , xn ) = i=1 xi is a sufficient
P n
statistic, and hence the MLE ✓ˆ = n1 i=1 xi is sufficient as well. (The reason this is true is because
we don’t need to know each individual sample to have a good estimate for ✓; we just need to know
how many heads happened total!)
Solution The Bernoulli likelihood comes by using the PMF pX (k) = ✓k (1 ✓)1 k
for k 2 {0, 1}. We get this
by observing that Ber(✓) = Bin(1, ✓).
n n
! n
!
Y Y Y
xi 1 xi xi 1 xi
L(x1 , . . . , xn |✓) = ✓ (1 ✓) = ✓ (1 ✓)
i=1 i=1 i=1
Pn Pn
xi
=✓ i=1 (1 ✓)n i=1 xi
= ✓T (x1 ,...,xn ) (1 ✓)n T (x1 ,...,xn )
Choose
g(x1 , . . . , xn ) = 1
and
h(T (x1 , . . . , xn ), ✓) = ✓T (x1 ,...,xn ) (1 ✓)n T (x1 ,...,xn )
Pn ˆ
By
Pnthe Neyman-Fisher Factorization Criterion, T (x1 , . . . , xn ) = i=1 xi is sufficient. The mean ✓M LE =
x
i=1 i T (x 1 , . . . , x n )
= is as well, since knowing the total number of heads and the sample proportion of
n n
heads is equivalent (since we know n)!
Notice that here the only interaction between the data and parameter ✓ happens through the sufficient
statistic (the sum/mean of all the values). We don’t actually need to know each individual xi .
The mean squared error of an estimator ✓ˆ of ✓ measures the expected squared error from the
true value ✓, and decomposes into a bias term and variance term. This term results in the phrase
”Bias-Variance Tradeo↵” - sometimes these are opposing forces and minimizing MSE is a result of
choosing the right balance.
7.8 Probability & Statistics with Applications to Computing 281
h i ⇣ ⌘
ˆ ✓) = E (✓ˆ
MSE(✓, ✓)2 = Var ✓ˆ + Bias2 (✓,
ˆ ✓)
⇣ ⌘
If ✓ˆ is an unbiased estimator of ✓, then the MSE reduces to just: MSE(✓,
ˆ ✓) = Var ✓ˆ .
An unbiased estimator ✓ˆ is efficient if it achieves the Cramer-Rao Lower Bound, meaning it has
the lowest variance possible.
I(✓) 1 ⇣ ⌘ 1 1
ˆ ✓) =
e(✓, ⇣ ⌘ = 1 () Var ✓ˆ = = h i
Var ✓ˆ I(✓) E @ 2 ln L(x|✓)
@✓ 2
ˆ ✓)
L(x1 , . . . , xn | ✓) = g(x1 , . . . , xn ) · h(✓,
Chapter 8. Statistical Inference
In this last chapter, we talk about how to draw conclusions about a population using only a subset (hypoth-
esis testing). This is something we commonly want to do to answer questions like: who will win the next
U.S. presidential election? We can’t possibly poll everyone in the U.S. to see who they prefer, but we can
sample a few thousand and get their opinion. We will then make predictions for the election result with
some margin of error. What about drug testing? How can a drug company use clinical trials to “prove”
that their drug increases life expectancy or reduces risk of disease? These types of important questions will
be addressed in this chapter!
282
8.1 Probability & Statistics with Applications to Computing 283
We’ve talked about several ways to estimate unknown parameters, and desirable properties. But there is just
one problem now: even if our estimator had all the good properties, the probability that our estimator for ✓
is exactly correct is 0, since ✓ is continuous (a decimal number)! We’ll see how we can construct confidence
intervals around our estimator, so that we can argue that ✓ˆ is close to ✓ with high probability.
The confidence interval for ✓ can be illustrated in the below picture. We will explain how to interpret a
confidence interval at a specific confidence level soon.
Note that we can write this in any of the following three equivalent ways, as they all represent the probability
that ✓ˆ and ✓ di↵er by no more than some amount :
⇣ h i⌘ ⇣ ⌘ ⇣ ⌘
P ✓ 2 ✓ˆ , ✓ˆ + = P ✓ˆ ✓ = P ✓ˆ 2 [✓ , ✓ + ] = 0.95
Note the first and third equivalent statements especially (swapping ✓ˆ and ✓).
We have learned about the CDF of normal distribution. If Z ⇠ N (0, 1), we denote the CDF (a) = FZ (a) =
P (Z a), since it’s so commonly used. There is no closed-form formula, so one way to find a z-score
associated with a percentage is to look up in a z-table.
Suppose we want a (centered) interval, where the probability of being in that interval is 95%.
Left bound: the probability of being less than the left bound is 2.5%.
Right bound: the probability of being greater than the right bound is 2.5%. Thus, the probability of being
less than the right bound should be 97.5%.
Note the following two equivalent statements that say that P (Z 1.96) = 0.975 (where 1
is the inverse
CDF of the standard normal):
(1.96) = 0.975
1
(0.975) = 1.96
Example(s)
Suppose x1 ,...xn are iid samples from Poi(✓) where ✓ is unknown. Our MLE and MoM estimates
Pn
agreed at the sample mean: ✓ˆ = x̄ = n1 i=1 xi . Create an interval centered at ✓ˆ which contains ✓
with probability 95%.
h i that if W ⇠ ⇣
Solution Recall Poi(✓),
⌘ then E [W ] = Var (W ) = ✓, and so our estimator (the sample mean)
ˆ ˆ
✓ = x̄ has E ✓ = ✓ and Var ✓ =ˆ Var(xi )
= n✓ . Thus, by the Central Limit Theorem, ✓ˆ is approximately
n
Normally distributed:
n ✓ ◆
1X ✓
✓ˆ = xi ⇡ N ✓,
n i=1 n
If we standardize, we get that
✓ˆ ✓
p ⇡ N (0, 1)
✓/n
⇣ h i⌘
To construct our 95% confidence interval, we want P ✓ 2 ✓ˆ , ✓ˆ + = 0.95
⇣ h i⌘ ⇣ ⌘
P ✓ 2 ✓ˆ , ✓ˆ + =P ✓ ✓ˆ ✓ + [one of 3 equivalent statements]
⇣ ⌘
=P ✓ˆ ✓
!
✓ˆ ✓
=P p p p
✓/n ✓/n ✓/n
!
=P p Z p [CLT]
✓/n ✓/n
= 0.95
Because p represents the right bound, and the probability of being less than the right bound is 97.5%
✓/n
for a 95% interval (see the above picture again). Thus:
r
1 ✓
p = (0.975) = 1.96 =) = 1.96
✓/n n
ˆ and get
Since we don’t know ✓, we plug in our estimator ✓,
2 s s 3
✓ˆ ˆ ✓ˆ 5
[✓ˆ , ✓ˆ + ] = 4✓ˆ 1.96 , ✓ + 1.96
n n
That is, since ✓ˆ is normally distributed with mean ✓, we just need to find the so that ✓ˆ ± contains 95%
1
of the area in a Normal distribution. The way to do so is to find (0.975) = 1.96, and go ± 1.96 standard
ˆ
deviations of ✓ in each direction!
Definition 8.1.1: Confidence Interval
Suppose you have iid samples x1 ,...,xn from some distribution with unknown parameter ✓, and you
have some estimator ✓ˆ for ✓.
286 Probability & Statistics with Applications to Computing 8.1
ˆ
hA 100(1 ↵)% i confidence interval for ✓ is an interval (typically but not always) centered at ✓,
ˆ
✓ ˆ
,✓ + , such that the probability (over the randomness in the samples x1 ,...,xn ) ✓ lies in the
interval is 1 ↵: ⇣ h i⌘
P ✓ 2 ✓ˆ , ✓ˆ + =1 ↵
Pn
If ✓ˆ = n1 i=1 xi is the sample mean, then ✓ˆ is approximately normal by the CLT, and a 100(1 ↵)%
confidence interval is given by the formula:
✓ˆ z1 ↵/2 p , ✓ˆ + z1 ↵/2 p
n n
1 ↵
where z1 ↵/2 = 1 2 and is the true standard deviation of a single sample (which may need
to be estimated).
It is important to note that this last formula ONLY works when ✓ˆ is the sample mean (otherwise we can’t
use the CLT); you’ll need to find some other strategy if it isn’t.
If we wanted a 95% interval, then that corresponds to ↵ = 0.05, since 100(1 ↵) = 95. We were then
looking up the inverse Phi table at (1 ↵/2) = (1 0.05/2) = 0.975 to get our desired number of standard
deviations in each direction of 1.96.
If we wanted a 98% interval, then that corresponds to ↵ = 0.02 since 100(1 ↵) = 98. We then would
1
look up (0.99) since 1 ↵/2 = 0.99, because if there is to be 98% of the area in the middle, there is 1%
to the left and right!
Example(s)
Solution
Recall for Bernoulli distribution Ber(✓), our MLE/MoM estimator was the sample mean:
n
1X 136
✓ˆ = xi = = 0.34
n i=1 400
This is incorrect because there is no randomness here: ✓ is a fixed parameter. ✓ is either in the interval or
out of it; there’s nothing probabilistic about it.
Correct: If we repeat this process several times (getting n samples each time and constructing di↵erent
confidence intervals), about 99% of the confidence intervals we construct will contain ✓.
Notice the subtle di↵erence! Alternatively, before you receive samples, you can say that there is a 99%
probability
h (over the irandomness in the samples) that ✓ will fall into our to-be-constructed confidence
interval ✓ˆ , ✓ˆ + . Once you plug in the numbers though, you cannot say that anymore.
Chapter 8. Statistical Inference
8.2: Credible Intervals
Slides (Google Drive) Video (YouTube)
Example(s)
Construct a 80% credible interval for ⇥ (the Pn unknown) probability of success in Ber(⇥) given iid
n = 12 samples x = (x1 , x2 , ..., x12 ) where i=1 xi = 11 (observed 11 successes out of 12). Suppose
our prior is ⇥ ⇠ Beta(↵ = 7, = 3) (i.e., pretend we saw 6 successes and 2 failures ahead of time).
Solution From lecture 7.5 (MAP), we showed that choosing a Beta prior for ⇥ leads to a Beta posterior of
⇥ | x ⇠ Beta(11 + 7, 1 + 3) = Beta(18, 4) and our MAP was then (18 18 1 17
1)+(4 1) = 20 (since we saw 17 total
successes, and 3 total failures.)
We want an interval [a, b] such that P (a ⇥ b) = 0.8
If we look at the Beta PDF, we are looking for such an interval that the probability that we fall in this area
is 80%. If the area is centered, then the area to the left of that should have probability of 10%, and the area
to the right of that should also have probability 10%.
This is equivalent of looking for P (⇥ a) = 0.1 and P (⇥ b) = 0.9. These information are given in the
CDF of the Beta distribution. Note that on the x-axis we have the range of the Beta distribution [0, 1], and
on the y-axis, we have the cumulative probability of being to the left by integrating the PDF from above.
288
8.2 Probability & Statistics with Applications to Computing 289
1
Let FBeta denote the CDF of this Beta(18,4) distribution. Then, choose a = FBeta (0.1) ⇡ 0.7089 and
1
b = FBeta (0.9) ⇡ 0.9142, so our credible interval is [0.7089, 0.9142].
In order to compute the inverse CDF, we can use the scipy.stats library as follows:
That’s all there is to it! Just find the PDF/CDF of your posterior distribution (hopefully you chose a
conjugate prior), and look up the inverse CDF at points a and b such that b a is your desired confidence
level of your credible interval.
Suppose you have iid samples x = (x1 , ..., xn ) from some distribution with unknown parameter ⇥.
You are in the Bayesian setting, so you have chosen a prior distribution for the RV ⇥.
A 100(1 ↵)% credible interval for ⇥ is an interval [a, b] such that the probability (over the
randomness in ⇥) that ⇥ lies in the interval is 1 ↵:
P (⇥ 2 [a, b]) = 1 ↵
If we’ve chosen the appropriate conjugate prior for the sampling distribution (like Beta for Bernoulli),
the posterior is easy to compute. Say the CDF of the posterior is FY . Then, a 100(1 ↵)% credible
interval is given by h ⇣↵⌘ ⇣ ↵ ⌘i
FY 1 , FY 1 1
2 2
Again, this is one which has equal area to the left and right of the interval, but there are infinitely
many possible credible intervals you can create.
290 Probability & Statistics with Applications to Computing 8.2
This is correct because not ⇥ is a random variable, and it makes sense to say!
Contrast this with the interpretation of a confidence interval, where ✓ is a fixed number.
8.2.3 Exercises
1. Let x = (x1 , . . . , xn ) be iid samples from Exp(⇥) where ⇥ is a random variable (not fixed). Recall from
section 7.5 Exercise 1 that if we choose the P prior distribution ⇥ ⇠ Gamma(r, ), then the posterior
distribution is ⇥ | x ⇠ Gamma(n + r, + xi ).
Suppose n = 13, x̄ = 0.21, r = 7, = 12. Construct a 96% credible interval for ⇥. To find the
point t such that FT (t) = y for T ⇠ Gamma(u, v), call the following function which gets the inverse
CDF:
scipy.stats.gamma.ppf(y, u, 0, 1/v)
Then, verify that the MAP estimate is actually contained in your credible interval.
Solution: Before we call the function, we have to identify what u and v are. Plugging in the numbers
above to the general posterior we computed earlier, we find
Since we want a 96% interval, we must look up the inverse CDF at 0.02 and 0.98 (why?).
We write a few lines of code, calling the provided function twice:
1 >>> from scipy . stats import gamma
2 >>> gamma.ppf (0.02 , 20, 0, 1/39.3) # i n v e r s e cdf o f Gamma ( 2 0 , 14.73)
3 0.809150510196322
4 >>> gamma.ppf (0.98 , 20, 0, 1/39.3) # i n v e r s e cdf o f Gamma ( 2 0 , 14.73)
5 2.0514641398722735
[0.809, 2.051]
Hypothesis testing allows us to “statistically prove” claims. For example, if a drug company wants to claim
that their new drug reduces the risk of cancer, they might perform a hypothesis test. Or if a company wanted
to argue that their academic prep program leads to a higher SAT score. A lot of business decisions are reliant
on this statistical method of hypothesis testing, and we’ll see how to conduct them properly below.
So let’s give Mark the benefit of the doubt. We’ll compute the probability that we observed an outcome
as least as extreme as this, given that Mark isn’t lying.
If Mark isn’t lying, then the coin is fair, so the number of heads observed should be X ⇠ Bin(100, 0.5),
because there are 100 independent trials and a 50% of heads since it’s fair. So, the probability that we
observe at least 99 heads (because we’re looking for something as least as extreme), is the sum of the
probability of 99 heads and the probability of 100 heads. You just sum the Binomial PMF and you get:
✓ ◆ ✓ ◆
100 100 101
P (X 99) = (0.5)99 (1 0.5)1 + (0.5)100 = 100 ⇡ 7.96 ⇥ 10 29
⇡0
99 100 2
Basically, if the coin were fair, the probability of what we just observed (99 heads or more) is basically 0.
This is strong statistical evidence that the coin is NOT fair. Our assumption was that the coin is fair, but if
this were the case, observing such an extreme outcome would be extremely unlikely. Hence, our assumption
is probably wrong.
So, this is like a ”Probabilistic Proof by Contradiction”!
291
292 Probability & Statistics with Applications to Computing 8.3
1. Make a claim (like ”Airplane food is good”, ”Pineapples belong on pizza”, etc...)
• Our example will be that SuperSAT Prep claims that their program helps students perform better
on the SAT. (The average SAT score as of June 2020 was: 1059 out of 1600, and the standard
deviation of SAT scores was 210).
2. Set up a null hypothesis H0 and alternative hypothesis HA .
(a) Alternative hypothesis can be one-sided or two-sided.
• Let µ be the true mean of the SAT scores of students of SuperSAT Prep.
• Our null hypothesis is that H0 : µ = 1059, which is our “baseline”, “no e↵ect”, “benefit of the
doubt”. We’re going to assume that the true mean of our scores is the same as the nationwide
scores (for the sake of contradiction).
• Our alternative hypothesis is what we want to show, which is HA : µ > 1059, or that SuperSAT
Prep is good and that their test takers are (strictly) better o↵. So, our alternative will assert that
µ > 1059.
• This is called a one-sided hypothesis. The other one-sided hypothesis would be µ < 1059 (if
we wanted to argue that SuperSAT Prep makes students worse o↵).
• A two-sided hypothesis would be that µ 6= 1059, because it’s two sides (less than or greater
than). This is if we wanted to argue that SuperSAT Prep makes some di↵erence for better or
worse.
3. Choose a significance level ↵ (usually ↵ = 0.05 or 0.01).
• Let’s choose ↵ = 0.05 and explain this more later!
4. Collect data.
• We observe 100 students from SuperSAT Prep, x1 , ..., x100 . It turns out, the sample mean of the
scores, x̄, is x̄ = 1113.
5. Compute a p-value, p = P (observing data at least as extreme as ours | H0 is true).
• Again, since we’re assuming H0 is true (that SuperSAT has no e↵ect), our true mean µ is 1059
(again we do this in hopes of reaching a “probabilistic contradiction”). By the CLT, since n = 100
is large, the distribution of the sample mean of 100 samples is approximately normal with mean
2
1059, and variance 210100 (because the variance of a single test taker was given to be
2
= 2102 ,
2
and so the variance of the sample mean is n ):
2 2102
X̄ ⇡ N (µ = 1059, = )
100
So, then, the p-value is the probability that if we took an arbitrary sample mean, that it would
be at least as extreme as the one we computed, which was 1113. So, we can just standardize, look
up a table like always, which is a procedure you know how to do:
✓ ◆ ✓ ◆
X̄ µ x̄ µ 1113 1059
p = P X̄ x̄ = P p p =P Z p = P (Z 2.14) ⇡ 0.0162
/ n / n 210/ 100
(a) If p < ↵, ”reject” the null hypothesis H0 in favor of the alternative HA . (Because, given the null
hypothesis is true, the probability of what we saw happening (or something more extreme) is p
which is less than some small number ↵.)
(b) Otherwise, ”fail to reject” the null hypothesis H0 .
• Since p = 0.0162 < 0.05 = ↵, we’ll reject the null hypothesis H0 at the ↵ = 0.05 significance level.
We can say that there is strong statistical evidence to suggest that SuperSAT Prep actually helps
students perform better on the SAT.
Notice that if we had chosen ↵ = 0.01 earlier instead of 0.05, we would have a di↵erent conclusion:
Since p = 0.0162 > 0.01 = ↵, we fail to reject the null hypothesis at the ↵ = 0.01 significance
level. There is insufficient evidence to prove that SuperSAT Prep actually helps students perform
better.
Note that we’ll NEVER say we “accept” the null hypothesis. If you recall the coin example,
if we had observed 55 heads instead of 99, that wouldn’t have been improbable. We wouldn’t
have called the magician a liar, but it does NOT imply that p = 0.5. It could have been 0.54 or
0.58, for example.
8.3.4 Exercises
1. You want to determine whether or not more than 3/4 of Americans would vote for George Washington
for President in 2020 (if he were still alive). In a random poll sampling n = 137 Americans, we collected
responses
Pn x1 , . . . , xn (each is 1 or 0, if they would vote for him or not). We observe 131 “yes” responses:
i=1 xi = 131. Perform a hypothesis test and state your conclusion.
Solution: We have our claim that “Over 3/4 of Americans would vote for George Washington for
President in 2020 (if he were still alive).
294 Probability & Statistics with Applications to Computing 8.3
Let p denote the true proportion of Americans that would vote for Washington. Then our null and
alternative hypotheses are:
H0 : p = 0.75
HA : p > 0.75
With a p-value so close to 0 (and certainly < ↵ = 0.01), we reject the null hypothesis that (only) 75%
of Americans would vote for Washington. There is strong evidence that this proportion is actually
larger.
Note: Again, what we did was: assume p = 0.75 (null hypothesis), then note that the probability of
observing data so extreme (in fact very close to 100% of people), was nearly 0. Hence, we reject this
null hypothesis because what we observed would’ve been so unlikely if it were true.
Chapter 9: Applications to Computing
9.1: Intro to Python Programming
Slides (Google Drive) Video (YouTube)
9.1.1 Python
For this section only, I’ll ask you to use the slides linked above. There are a lot of great animations and
visualizations! We assume you know some programming language (such as Java or C++) beforehand, and
are merely teaching you the new syntax and libraries.
Python is the language of choice for anything related to scientific computing, data science, and machine
learning. It is also sometimes used for website development among many other things! It has extremely
powerful syntax and libraries - I came from Java and was adamant on having that be my main language.
But once I saw the elegance of Python, I never went back! I’m not saying that Python is “absolutely better”
than Java, but for our applications involving probability and math, it definitely is!
295
Chapter 9: Applications to Computing
9.2: Probability via Simulation
Slides (Google Drive) Video (YouTube)
9.2.1 Motivation
Even though we have learned several techniques for computing probabilities, and have more to go, it is still
hard sometimes. Imagine I asked the question: “Suppose I randomly shu✏e an array of the first 100 integers
in order: [1, 2, . . . , 100]. What is the probability that exactly 13 end up in their original position?” I’m
not even sure I could solve this problem, and if so, it wouldn’t be pretty to set up nor actually type into a
calculator.
But since you are a computer scientist, you can actually avoid computing hard probabilities! You could
also even verify that your hand-computed answers are correct using this technique of “Probability via Sim-
ulation”.
Suppose a weighted coin comes up heads with probability 1/3. How many flips do you think it will
take for the first head to appear? Use code to estimate this average!
Solution You may think it is just 3, and you would be correct! We’ll see how to prove this mathematically
in chapter 3 actually. But for now, since we don’t have the tools to compute it, let’s use our programming
skills!
The first thing we need to do is to simulate a single coin flip. Recall that to generate a random number, we
use the numpy library in Python.
1 np. random .rand () # r e t u r n s a s i n g l e float in the range [0 ,1)
296
9.2 Probability & Statistics with Applications to Computing 297
This might be a bit tricky: since np.random.rand() returns a random float between [0, 1), the function
returns a value < p with probability exactly p! For example if p = 1/2, then np.random.rand() < 1/2,
which happens with probability 1/2 right? In our case, we’ll want p = 1/3, which will execute with probability
1/3.
This allows us to simulate the event in question: the first “Heads” appears whenever np.random.rand()
returns a value < p. And, if it is p, the coin flip turned up “Tails”.
The following function allows us to simulate ONCE how long it took to get heads.
We start with our number of flips being 0. And we keep incrementing flips until we get a head. So this
should return an integer ! We just need to simulate this game many times (call this function many times),
and take the average of our samples! Then, this should give us a good approximation of the true average
time (which happens to be 3)!
The code above is duplicated below, as a helper function. Python is great because you can define functions
inside other functions, only visible to the parent function!
1 import numpy as np
2
3 def coin flips (p, ntrials =50000) > float:
4
5 def sim one game () > int: # i n t e r n a l helper function
6 flips = 0
7 while True:
8 flips += 1
9 if np. random .rand () < p:
10 return flips
11
12 total flips = 0
13 for i in range ( ntrials ):
14 total flips += sim one game ()
15 return total flips / ntrials
16
17 print ( coin flips (p =1/3))
Notice the helper function is the exact same as above! All we did was call it ntrials times and return the
average number of flips per trial. This is it! The number 50000 is arbitrary: any large number of trials is
good!
Now to tackle the original problem:
298 Probability & Statistics with Applications to Computing 9.2
Example(s)
Suppose I randomly shu✏e an array of the first 100 integers in order: [1, 2, . . . , 100]. What is the
probability that exactly 13 end up in their original position? Use code to estimate this probability!
Hint: Use np.random.shuffle to shu✏e an array randomly.
Take a look and see how similar this was to the previous example!
Chapter 9: Applications to Computing
9.3: The Naive Bayes Classifier
Slides (Google Drive) Video (YouTube)
9.3.1 Motivation
Have you ever wondered how Gmail knows whether or not an email should be marked as spam? Or how
Alexa/Google Home can your answer free-form questions? How self-driving cars actually work? How social
media platforms recommend friends and people you should follow? How a computer program DeepBlue beat
the chess champion Garry Kasparov? The answer to all of these questions is: machine learning (ML)!
After learning just a tiny bit of probability, we are ready to discover one way to solve one extremely important
type of ML task: classification. In particular, we’ll learn how to take in an email (a string/text), and predict
whether it is “Spam” or “Ham”. We will discuss this further shortly!
It’s okay if you didn’t see the pattern, but we should predict 16; can you figure out why? It seems that the
pattern is to take the number and multiply it by the number of sides in the shape! So for our last row, we
take 4 and multiply by 4 (the number of sides of the square) to get 16. Sure, there is a possibility that
this isn’t the right function: this is only the most simple explanation we could give. The function could be
some complex polynomial in which case we would be completely wrong.
This is the idea of (supervised) machine learning (ML): given some training examples, we want to
learn the pattern between the input features and output label and be able to have a computer predict the
label on new/unseen examples. Above, our input features were number and shape. We want the computer
to “learn” just like how we do: with several examples.
Within supervised ML, two of the largest subcategories are regression and classification. Regression
refers to predicting a continuous (decimal) value. For example, when predicting house price given features
of the house or predicting weight from height. Classification on the other hand refers to predicting one of a
finite number of classes. For example, predicting whether an email is spam or ham, or whether an image
299
300 Probability & Statistics with Applications to Computing 9.3
Example(s)
For each of the situations below with a desired output label, identify whether it would be a classifi-
cation or regression task. Then, describe what input features may be useful in making a prediction.
1. Predicting the price of a house.
2. Predicting whether or not a PhD applicant will be admitted.
3. Predicting which of 50 menu items someone will order.
Solution
1. This is a regression task, since we are predicting a continuous number like $310, 321.55 or $1, 235, 998.23.
Some features which would be useful for prediction include: square footage, age, location, number of
bedrooms/bathrooms, number of stories, etc.
2. This is a classification task, since we predicting one of two outcomes: admitted or not. Features which
may be important are: GPA, SAT score, recommendation letter quality, number of papers published,
number of internships, etc.
3. This is a classification task since we are choosing from one of 50 classes. Important features may
include: past order history, favorite cuisine, dietary restrictions, income, etc.
So how do we write the code to make the decision for us? In the past, people tried writing these classifiers
with a set of rules that they came up themselves. For example, if it is over 1000 words, predict “SPAM”.
Or if it contains the word ’Viagra’, predict that it is “SPAM”. This leads to code which looks like a ton of
if-else statements, and is also not very accurate. In machine learning, we come up with a model that learns
a decision-making rule for us! This may not make sense now, but I promise it will soon.
That is, we will reduce an email into a Set of lowercase words and nothing else! We’ll see a potential
drawback to this later, but despite these strong assumptions, the classifier still does a really good job!
Here are some examples of how we take the input string (email) to a Set of standardized words.
Above all sounds nice and all, but how do we even begin to compute such a quantity? Let’s try Bayes
Theorem with the Law of Total Probability and see where that gets us!
How does this even help?? This looks way worse than before... Let’s see if we can’t start by figuring out the
“easier” terms, like P (spam). Remember we haven’t even touched our data yet. Let’s assume we were given
five examples of emails with their labels to learn from:
302 Probability & Statistics with Applications to Computing 9.3
Based on the data only, what would you estimate P (spam) to be? I might guess 3/5, and hope that you
matched that! That is,
# of spam emails
P (spam) ⇡
# of total emails
Similarly, we might estimate
# of ham emails
P (ham) ⇡
# of total emails
to be 2/5 in our case. Great, so we’ve figured out two our of the four terms we needed after using Bayes/LTP.
Now, we might try to similarly guess that
because our definition of conditional probability came intuitively with equally likely outcomes in 2.1 as
|A \ B| P (A \ B)
P (A | B) = =
|B| P (B)
But how many spam emails are we going to get that contain all three words? Probably none, or very few.
In general, most emails will be much longer, so there’s almost no chance that an email you are given to learn
from has ALL of the words. This is a problem because it makes this probability 0, which isn’t good for our
model.
The Naive Bayes name comes from two parts. We’ve seen the Bayes part because we used Bayes Theorem
to (attempt to) compute our desired probability. We are at a roadblock now, and now we will make the
“naive” assumption that: words are conditionally independent GIVEN the label. That is,
P ({you, buy, viagra} | spam) ⇡ P (you | spam) P (buy | spam) P (viagra | spam)
P (A, B, C | D) = P (A | D) P (B | D) P (C | D)
which is most likely nonzero if we have a lot of emails! What should this quantity be? It is 1/3: there is
just one spam email out of three which contains the word “you”. In general,
Example(s)
Make a prediction as to whether this email is SPAM or HAM, using the Naive Bayes classifier! Do
this by computing P (spam | {you, buy, viagra}) and comparing it to 0.5. Don’t forget to use the
conditional independence assumption!
Solution Combining what we had earlier (Bayes+LTP) with the (naive) conditional independence assumption,
we get
P ({you, buy, viagra} | spam) P (spam)
P (spam | {you, buy, viagra}) =
P ({you, buy, viagra} | spam) P (spam) + P ({you, buy, viagra} | ham) P (ham)
P (you | spam) P (buy | spam) P (viagra | spam) P (spam)
=
P (you | spam) P (buy | spam) P (viagra | spam) P (spam) + P (you | ham) P (buy | ham) P (viagra | ham) P (ham)
We need to compute a bunch of quantities, but notice the left side of the denominator is the same as the
numerator, so we need to compute 8 quantities, 3 of which we did earlier! I’ll just skip to the solution:
3 2
P (spam) = P (ham) =
5 5
1 1
P (you | spam) = P (you | ham) =
3 2
1 0
P (buy | spam) = P (buy | ham) =
3 2
3 1
P (viagra | spam) = P (viagra | ham) =
3 2
Once we plug in all these quantities, we end up with a probability of 1, because the P (buy | ham) =
0 killed the entire right side of the denominator! It turns out then we should predict spam because
P (spam | {you, buy, viagra}) = 1 > 0.5, and this is correct! We still don’t ever want zeros though, so
we’ll see how we can fix that soon!
Notice how the data (example emails) completely dictated our decision rule, along with Bayes Theorem and
Conditional Independence. That is, we learned from our data, and used it to make conclusions on new
data!
One last final thing, to avoid zeros, we will apply the following trick called “Laplace Smoothing”. Before,
we had said that
# of spam emails with word
P (word | spam) ⇡
# of spam emails
We will now pretend we saw TWO additional spam emails: one which contained the word, and one which
did not. This means instead that we have
# of spam emails with word + 1
P (word | spam) ⇡
# of spam emails + 2
304 Probability & Statistics with Applications to Computing 9.3
0
This will ensure that we don’t get any zeros! For example, P (buy | ham) was previously (none of the two
2
0+1 1
ham emails contained the word “buy”), but now it is = .
2+2 4
We do not usually apply Laplace smoothing to the label probabilities P (spam) and P (ham) since these will
never be zero anyway (and it wouldn’t make much di↵erence if we did).
Example(s)
Redo the example from earlier, but now apply Laplace smoothing to ensure no zero probabilities. Do
not apply it to the label probabilities.
Solution Basically, we just take the same numbers from above and add 1 to the numerator and 2 to the
denominator!
3 2
P (spam) = P (ham) =
5 5
1+1 2 1+1 2
P (you | spam) = = P (you | ham) = =
3+2 5 2+2 4
1+1 2 0+1 1
P (buy | spam) = = P (buy | ham) = =
3+2 5 2+2 4
3+1 4 1+1 2
P (viagra | spam) = = P (viagra | ham) = =
3+2 5 2+2 4
Plugging these in gives P (spam | {you, buy, viagra}) ⇡ 0.7544 > 0.5, so our prediction is unchanged! But it
is better to not have probabilities ever being exactly one or zero, so this solution is preferred!
That’s it for the main idea! We’re almost there now, just some logistics.
However, if we trained/learned from these 1000 emails, and measure the accuracy, surely it will be very good
right? It’s like taking a practice test and then using that as your actual test - of course you’ll do well! What
we care about is how well the spam filter works on NEW or UNSEEN emails. Emails that the spam filter
was not allowed to see/use when estimating those probabilities. This is fair and more realistic now right?
You get practice exams, as many as you want, but you are only evaluated once on an exam you (hopefully)
haven’t seen before!
Where do we get these new/unseen emails? We actually take our initial 1000 emails and do a train/test
split (usually around 80/20 split). That means, we will use 800 emails to estimate those quantities, and
measure the accuracy on the remaining 200 emails. The 800 emails we learn from are collectively called the
training set, and the 200 emails we test on are collectively called the test set.
This is good because we care how our classifier does on new examples, and so when doing machine learning,
we ALWAYS split our data into separate training/testing sets!
Disclaimer: Accuracy is typically not a good measure of performance for classification. Look into F1-Score
and AUROC instead if you are interested! Since this isn’t a ML class, we will stick with plain accuracy for
simplicity.
9.3 Probability & Statistics with Applications to Computing 305
9.3.3.5 Summary
Here’s a summary of everything we just learned:
Suppose we are given a set of emails WITH their labels (of spam or ham). We split into a training
set with around 80% of the data, and a test set with the remaining 20%.
Suppose we are given an email with wordset {w1 , . . . , wk } and want to make a prediction. We
compute using Bayes Theorem, the law of total probability, and our naive assumption that words are
conditionally independent given their label to get:
Qk
P (spam) P (wi | spam)
P (spam | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)
To get a fair measure of performance, make predictions using the above procedure on all the
TEST emails and return the overall test accuracy.
we are multiplying a bunch of numbers between 0 and 1, and so we will get some very very small number
(close to zero). When numbers get too large on a computer (exceeding 263 or something), it is called
overflow, and results in weird and wrong arithmetic. Our problem is appropriately named underflow, as
we can’t handle the precision.
This is the last thing we need to figure out (I promise). Remember that our two probabilities P (spam | {w1 , . . . , wk })
and P (ham | {w1 , . . . , wk }) summed to 1, so we only needed to compute one of them. Let’s go back to com-
puting both, and just comparing which is larger:
Qk
P (spam) P (wi | spam)
P (spam | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)
306 Probability & Statistics with Applications to Computing 9.3
Qk
P (ham) P (wi | ham)
P (ham | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)
Notice the denominators are equal: they are both just P ({w1 , . . . , wk }). So, P (spam | {w1 , . . . , wk }) >
P (ham | {w1 , . . . , wk }) if and only the corresponding numerator is greater:
k
Y k
Y
P (spam) P (wi | spam) > P (ham) P (wi | ham)
i=1 i=1
And that’s it, problem solved! If our initial quantity (after multiplying 50 word probabilities) was something
like P (spam | {w1 , . . . , wk }) ⇡ 10 81 , then log P (spam | {w1 , . . . , wk }) ⇡ 186.51. There is no chance of
underflow anymore!
After reading Chapter 7: do you see how MLE/MAP were used here? We used MLE to estimate P (spam)
and P (ham). We also used MAP to estimate all the P (wi | spam) as well, with a Beta(2, 2) prior: pretending
we saw 1 of each success and failure. Naive Bayes actually required us to estimate all these di↵erent Bernoulli
parameters, and it’s great to come back and see!
Chapter 9: Applications to Computing
9.4: Bloom Filters
Slides (Google Drive) Video (YouTube)
9.4.1 Motivation
Google Chrome has a huge database of malicious URLs, but it takes a long time to do a database lookup
(think of this as a typical Set, but on a di↵erent computer than yours). As you may know, Sets have
desirable constant-time lookup, but due to the fact it isn’t on your computer, the time bottleneck comes
from the communication between the database and your computer. They want to have a quick check in the
web browser itself (on your computer), so a space-efficient data structure must be used.
That is, we want to save both time (not in the typical big-Oh sense) and space. But what will we trade
for it? It turns out we will have limited operations (fewer than a Set), and some probability of error which
turns out to be fine.
9.4.2 Definition
A bloom filter is a probabilistic data structure which only supports the following two operations:
I. add(x): Add an element x to the structure.
II. contains(x): Check if an element x is in the structure. If either returns “definitely not in the set” or
“could be in the set”.
It does not support the following two operations:
I. Delete an element from the structure.
II. Give a collection of elements that are in the structure.
The idea is that we can check our bloom filter if a URL is in the set. The bloom filter is always correct in
saying a URL definitely isn’t in the set, but may have false positives (it may say a URL is in the set when it
isn’t). So most of the time, we get instant time, and only in these rare cases does Chrome have to perform
an expensive database lookup to know for sure.
Suppose we have k bit arrays t1 , . . . , tk each of length m (all entries are 0 or 1), so the total space required
is only km bits or km/8 bytes (as a byte is 8 bits). See below for one with k = 3 arrays of length m = 5:
So regardless of the number of elements n that we want to insert store in our bloom filter, we use the same
amount of memory! That being said, the higher n is for a fixed k and m, the higher your error rate will be.
Suppose the universe of URL’s is the set U (think of this as all strings with less than 100 characters),
307
308 Probability & Statistics with Applications to Computing 9.4
and we have k independent and uniform hash functions h1 , . . . , hk : U ! {0, 1, . . . , m 1}. That is, for
an element x and hash function hi , pretend hi (x) is a discrete Unif(0, m 1) random variable. Basically,
when we see a new URL, we will add it to one random entry per row of our bloom filter.
See the image below to see how we add the URL “thisisavirus.com” into our bloom filter.
For each of our k = 3 hash functions (corresponding to each row), we hash our URL x as hi (x) to get a
random integer from {0, 1, . . . , 4} (0 to m 1). It happened that h1 (x) = 2, h2 (x) = 1 and h3 (x) = 4 in this
example: each hash function is independent of the others and chooses a position uniformly at random.
But if we hash the same URL, we will get the same hash. In other words, if I tried to add this URL
one more time, nothing would change because all the entries were already set to 1. Notice we never “unset”
an entry: once a URL sets an entry to 1, it will stay 1 forever.
Now let’s see how the contains function is implemented. When we check whether the URL we just added
is contained in the bloom filter, we should definitely return yes.
We say that a URL x is contained in the bloom filter, if when we apply each hash function hi (x), the
corresponding entries are already set to 1. We added this URL “thisisavirus.com” right before this, so we
are guaranteed that t1 [2] == 1, t2 [1] == 1, and t3 [4] == 1, and so we return TRUE overall! You might now
see how this could lead to false positives: returning TRUE even though the URL was never added! Don’t
worry if not, we’ll see some examples below.
That’s all there is for bloom filters!
9.4 Probability & Statistics with Applications to Computing 309
Example(s)
Solution
Notice that t3 [4] was already set to 1 by the previous entry, and that’s okay! We just leave it set to 1.
Notice here we got a false positive: that means, saying a URL is in the bloom filter when it wasn’t. This
is a tradeo↵ we make in exchange for using much less space.
9.4.3 Analysis
You might be dying to know, what is the false positive rate (FPR) for a bloom filter, and how should I
choose k and m? These are great questions, and we actually have the tools to figure this out already.
310 Probability & Statistics with Applications to Computing 9.4
After inserting n distinct URLs to a k ⇥ m bloom filter (k hash functions/rows, m columns), suppose
we had a new URL and wanted to check whether it was contained in the bloom filter. The false
positive rate (probability the bloom filter returns True incorrectly), is
✓ ✓ ◆n ◆ k
1
1 1
m
Proof of Bloom Filter FPR. We get a match for new URL x if in each row, the bit assigned by the hash
function hi (x) is set to 1.
For i = 1, . . . , k, let Ei be the event that hi (x) is set to 1 already. Then,
k
Y
P (false positive) = P (h1 (x) = 1, h2 (x) = 1, . . . , hk (x) = 1) = P (E1 \ E2 \ · · · \ Ek ) = P (Ei )
i=1
where the last equality is because each hash function is assumed to be independent of the others.
Now, let’s focus on a single row i (all the rows are the “same”). The probability that the bit is set to 1
P (Ei ), is the probability that at least one of the n URLs hashed to that entry. Seeing “at least one” should
tell you that: you should try the complement instead (otherwise, use inclusion-exclusion)!
So the probability a bit remains at 0 after n entries are added (EiC ) is
✓ ◆n
C 1
P Ei = 1
m
because the probability of missing this bit for a single URL is 1 1/m. Hence,
✓ ◆n
1
P (Ei ) = 1 P EiC = 1 1
m
Finally, combining this result with the previous gives our final answer, since each row has the same proba-
bility:
Yk ✓ ✓ ◆n ◆k
1
P (false positive) = P (Ei ) = 1 1
i=1
m
So based on n, the number of malicious URLs Google Chrome would like to store, should definitely play a
part in how large they should choose k and m to be.
Let’s now see (by example) the kind of time and space improvement we can get.
Example(s)
1. Let’s compare this approach to using a typical Set data structure. Google wants to store
5 million URLs, with each URL taking (on average) 40 bytes. How much space (in MB, 1
MB = 1 million bytes) is required if we store all the elements in a set? How much space (in
MB) is required if we store all the elements in a bloom filter with k = 30 hash functions and
m = 900, 000 buckets? Recall that 1 byte = 8 bits.
2. Let’s analyze the time improvement as well. Let’s say an average Chrome user attempts to
9.4 Probability & Statistics with Applications to Computing 311
visit 102,000 URLs in a year, only 2,000 of which are actually malicious. Suppose it takes
half a second for Chrome to make a call to the database (the Set), and only 1 millisecond for
Chrome to check containment in the bloom filter. Suppose the false positive rate on the bloom
filter is 3%; that is, if a website is not malicious, the bloom filter will will incorrectly report
it as malicious with probability 0.03. What is the time (in seconds) taken if we only use the
database, and what is the expected time taken (in seconds) to check all 102,000 strings if we
used the bloom filter + database combination described earlier?
Solution
1. For the set, we would require 5 million times 40 bytes, for a total of 200 MB.
For the bloom filter, we need just km/8 = 27/8 million bytes, or 3.375 MB, wow! Note how this doesn’t
depend (directly) at all on how many URLs, or the size of each one as we just hash it to a few bits.
Of course, k and m should increase with n though :) to keep the FPR low.
2. If we only use the database, it will take 102, 000 · 12 = 51, 000 seconds.
If we use the bloom filter + database combination, we will definitely call the bloom filter 102, 000 times
at 0.001 seconds each, for a total of 102 seconds. Then for about 3% of the 100, 000 other URLs (3, 000
of them), we’ll have to do a database lookup, costing 3, 000 · 12 = 1, 500 seconds. For the 2, 000 actually
malicious URLs, we also have to do a database lookup, costing 2, 000 · 12 = 1000 seconds . So in total,
102 + 1500 + 1000 = 2602 seconds.
Just take a second to stare at how much memory savings we had (the first part), and the time savings we
had (the second part)!
9.4.4 Summary
Hopefully now you see the pros and cons of bloom filters. We cannot delete from the bloom filter (why?)
nor list out which elements are in it because we never stored the string! Below summarizes the operations
of a bloom filter.
If you imagine coding this up, it’s so short, only a few lines of code! We just saw how probability and
randomness can be used to save space and time, in exchange for accuracy! In our application, we didn’t even
mind the accuracy part because we would just do the lookup in that case just to be certain anyway! We saw
it being used for a data structure, and in our next application, we’ll see it being used for an algorithm.
Randomness just makes our lives (as a computer scientist) better, and can lead to elegant and beautiful data
structures algorithms which often outperform their deterministic counterparts.
Chapter 9: Applications to Computing
9.5: Distinct Elements
Slides (Google Drive) Video (YouTube)
9.5.1 Motivation
YouTube wants to count the number of distinct views for a video, but doesn’t want to store all the user
ID’s. How can they get an accurate count of users without doing so? Note: A user can view their favorite
video several times, but should only be counted as one distinct view.
Before we attempt to solve this problem, you should wonder: why should we even care? For one of the
most popular videos on YouTube, let’s say there are N = 2 billion views, with n = 900 million of them being
distinct views. How much space is required to accurately track this number? Well, let’s assume a user ID is
an 8-byte integer. Then, we need 900, 000, 000 ⇥ 8 bytes total if we use a Set to track the user IDs, which
requires 7.2 gigabytes of memory for ONE video. Granted, not too many videos have this many views, but
imagine now how many videos there are on YouTube: I’m not sure of the exact number, but I wouldn’t be
suprised it it was in the tens or hundreds of millions, or even higher!
It would be great if we could get the number of distinct views with constant space O(1) instead of lin-
ear space O(n) required by storing all the IDs (let’s say a single 8-byte floating point number instead of 7.2
GB). It turns out we (approximately) can! There is no free lunch of course - we can’t solve this problem
exactly with constant memory. But we can trade this space for some error in accuracy, using the contin-
uous Uniform random variable! That is, we will potentially have huge memory savings, but are okay with
accepting a distinct view count which has some margin of error.
9.5.2 Intuition
This seemingly unrelated calculation will be crucial in tying our algorithm together - I’ll ask for your patience
as we do this. Let U1 , . . . , Um be m iid (independent and identically distributed) RVs from the continuous
Unif(0, 1) distribution. If we take the minimum of these m random variables, what do we “expect” it to be?
That is, if X = min{U1 , . . . , Um }, what is E [X]? Before actually doing the computation, let’s think about
this intuitively and see some pictures.
312
9.5 Probability & Statistics with Applications to Computing 313
What these examples are getting at is that, the expected value of the smallest of m Unif(0, 1) RVs is
1
E [X] = E [min{U1 , . . . , Um }] =
m+1
I promise this will be the key observation in making this clever algorithm work. If you believed the intuition
above, that’s great! If not, that’s also fine, so I’ll have to prove it to you formally below. Whether you
believe me or not at this point, you are definitely encouraged to read through the strategy as it may come
up many times in your future.
Some of these steps need more justification. For the second equation, we use the fact that the minimum of
numbers is greater than a value if and only if all of them are (think about this). For the next equation, the
probability of all of the Ui > x is just the product of the m probabilities by our independence assumption.
314 Probability & Statistics with Applications to Computing 9.5
x 0
And finally, for Ui ⇠ Unif(0, 1), we know its CDF (look it up in our table) is P (Ui x) = = x, and
1 0
so P (Ui > x) = 1 P (U1 x) = 1 x.
I’ll leave it to you to compute the density fX (x) by di↵erentiating the CDF we just computed, and then
using our standard expectation formula (the minimum of numbers in [0, 1] is also in [0, 1]):
Z 1
E [X] = xfX (x)dx
0
1
and you should get E [X] = after all this work!
m+1
If you are thinking of giving up now, I promise this was the hardest part! The rest of the section should be
(generally) smooth sailing.
Suppose the universe of user ID’s is the set U (think of this as all 8-byte integers), and we have a single
uniform hash function h : U ! [0, 1] (i.e., for an user ID y, pretend h(y) is a continuous Unif(0, 1) random
variable). That is, h(y1 ), h(y2 ), ..., h(yk ) for any k distinct elements are iid continuous Unif(0, 1) random
variables, but since the hash function always gives the same output for some given input, h(y1 ) and h(y1 )
are the “same” Unif(0, 1) random variable.
To parse that mess, let’s see two examples. These will also hopefully give us the lightbulb moment!
Example(s)
This is a stream of user IDs. From this, there are 3 distinct views (13,25,19) out of 6 total views.
The uniform hash function h might give us the following stream of hashes:
Note that all of these numbers are between 0 and 1 as they should be, as they are supposedly
Unif(0, 1). Note also that for the same user ID, we get the same hash! That is, h(19) will always
return 0.79, h(25) is always 0.26, and so on. Now go back and reread the previous paragraph and see
if it makes more sense.
9.5 Probability & Statistics with Applications to Computing 315
Example(s)
Consider the same stream of N = 6 elements as the previous example, with n = 3 distinct elements.
1. How many independent Unif(0, 1) RVs are there total: N or n?
2. If we only stored the minimum value every time we received a view, we would store the single
floating point number 0.26 as it is the smallest hash of the six. If we didn’t know n, how
might we exploit 0.26 to get the value of n = 3? Hint: Use the fact we proved earlier that
1
E [min{U1 , . . . , Um }] = where U1 , . . . , Um are iid.
m+1
Solution
1. As you can see, we only have three iid Uniform RVs: 0.26, 0.51, 0.79. So in general, we’ll have
the minimum n (and not N ) RVs.
2. Actually, remember that the expected minimum of n distinct/independent values is approxi-
1
mately as we showed earlier. Our 0.26 isn’t exactly equal to E [X], but it is an estimate
n+1
for it! So if we solve
1
0.26 ⇡ E [X] =
n+1
1
we would get that n ⇡ 1 ⇡ 2.846. Rounding this to the nearest integer of 3 actually
0.26
gives us the correct answer!
So our strategy is: keep a running minimum (a single floating point which ONLY takes 8 bytes).
As we get a stream of user IDs x1 , . . . , xN , hash each one and update the ✓running minimum
◆
1
if necessary. When we want to estimate n, we just reverse solve n = round 1 , and
E [X]
that’s it! Take a minute to reread this example if necessary, as this is the entire idea!
This is known as the Distinct Elements algorithm! We start our single floating point minimum (called val
below) at 1, and repeatedly update it. The key observation is that we are only taking the minimum of n iid
Uniform RVs, and NOT N because h always returns the same value given the same input. Reverse-solving
1
for E [X] = gives us an estimate for m since E [X] (which is stored in the variable val) is only an
m+1
approximation. Note we want to round to the nearest integer because n should be an integer.
316 Probability & Statistics with Applications to Computing 9.5
This algorithm sounds great right? One pass over the data (which is the best we can do in time complexity),
and one single float (which is the best we can do in space complexity)! But you have to remember the
tradeo↵ is in the accuracy, which we haven’t seen yet.
The reason the previous example was spot-on is because I cheated a little bit. I ensured the three values
0.26, 0.51, 0.79 were close to where they were supposed to be: 0.25, 0.50, 0.75. Actually, it’s most important
that just the minimum is on-target. See the following example for an unfortunate situation.
Example(s)
The uniform hash function h might give us the following stream of N = 7 hashes:
Trace the distinct elements algorithm above by hand and report the value that it will return for our
estimate. Compare it to the true value of n = 4 which is unknown to the algorithm.
Solution
At the end of all the updates, val will be equal to the minimum hash of 0.1. So the estimated number of
distinct elements is
✓ ◆
1
round 1 =9
0.1
There are only n = 4 distinct elements though! The reason this time it didn’t work out well for us is that
the minimum value was supposed to be around 1/5 = 0.2, but was actually 0.1. This is not necessarily a
huge di↵erence until we take its reciprocal...
That’s it! The code for this algorithm is actually pretty short and sweet (imagine converting the pseudocode
above into code). If you take a step back and think about what machinery we needed, we needed continuous
RVs: the idea of PDF/CDF, and the Uniform RV. The mathematical/statistical tools we learn have many
applications to computer science; we have several more to go!
2 1 Pn
If X1 , . . . , Xn are iid RVs with mean µ and variance , we’ll show that the sample mean X̄n = Xi
n i=1
has the same mean but lower variance as each Xi .
" n
# n
⇥ ⇤ 1X 1X 1
E X̄n =E Xi = E [Xi ] = nµ = µ
n i=1 n i=1 n
9.5 Probability & Statistics with Applications to Computing 317
That is, the sample mean will have the same expectation, but the variance will go down linearly! Why might
this make sense? Well, imagine you wanted to estimate the height of American adults: would you rather
have a sample of 1, 10, or 100 adults? All would be correct in expectation, but the size of 100 gives us more
confidence in our answer!
1
So if we instead estimate the minimum E [X] = with the average of k minimums instead of just one,
n+1
we should get a more accurate estimate for E [X] and hence n, the number of distinct elements, as well!
So, imagine we had k independent hash functions instead of just one: h1 , . . . , hk , and k minimums val1 , val2 , . . . , valk .
Stream ! 13 25 19 25 19 19 vali
h1 0.51 0.26 0.79 0.26 0.79 0.79 0.26
h2 0.22 0.83 0.53 0.84 0.53 0.53 0.22
... ... ... ... ... ... ... ...
hk 0.27 0.44 0.72 0.44 0.72 0.72 0.27
Each row represents one hash function hi , and the last column in each row is the minimum for that hash
function. Again, we’re only keeping track of the k floating point minimums in the final column. Now, for
improved accuracy, we just take the average of the k minimums first, before reverse-solving. Imagine k = 3
(so there were no rows in . . . above). Then, a good estimate for the true minimum E [X] is
0.26 + 0.22 + 0.27
E [X] ⇡ = 0.25
3
✓ ◆
1
So our estimate for n is round 1 = 3, which is perfect! Note that we basically combined 3 distinct
0.25
elements instances with h1 , h2 , h3 individually from earlier, in a way that reduced the variance! The indi-
vidual estimates 0.26, 0.22, 0.27 were varying around 0.25, but their average was even closer!
Now our memory is just O(k) instead of O(1), but we get a better estimate as a result. It is up to you to
determine how you want to tradeo↵ these two opposing quantities.
9.5.5 Summary
We just saw today an extremely clever use of continuous RVs, applied to computing! In general, randomness
(the use of a random number generator (RNG)) in algorithms and data structures often can help improve
either the time or space (or both)! We saw earlier with the bloom filter how adding a RNG can save a ton
of space in a data structure. Even if you don’t go on to study machine learning or theoretical CS, you can
see what we’re learning can be applied to algorithms and data structures, arguably the core knowledge of
every computer scientist.
Chapter 9: Applications to Computing
9.6: Markov Chain Monte Carlo (MCMC)
Slides (Google Drive) Video (YouTube)
9.6.1 Motivation
Markov Chain Monte Carlo (MCMC) is a technique which can be used to solve hard optimization
problems (among other things). In this section, we’ll design MCMC algorithms to solve the following two
problems, and you will be able to solve many more yourself!
• The Knapsack Problem: Suppose you have a knapsack which has some maximum weight capacity.
There are n items with weights w1 , . . . , wn > 0 and values v1 , . . . , vn > 0, and we want to choose the
subset of them that maximizes the total value subject to the weight constraint of the knapsack. How
can we do this?
• The Travelling Salesman Problem (TSP): Suppose you want to find the best route (minimizing
total distance travelled) between the 50 U.S. state capitals that we want to visit! A valid route starts
and ends in the same state capital, and visits each capital exactly once (this is known as the TSP,
and is known to be NP-Hard). We will design an MCMC algorithm for this as well!
As the name suggests, this technique depends a bit on the idea of Markov Chains. Most of this section
then will actually be building up the foundations of Markov Chains, and MCMC will follow soon after. In
fact, you could definitely understand and code up the algorithm without learning this math, but if you care
to know how and why it works (you should), then it is important to learn first!
• The temperature in Seattle each day. X0 can be the temperature today, X1 tomorrow, and so on.
• The price of Google stock at the end of each year. X0 can be the final price at the end of the year it
IPO’d, X1 the next, and so on.
• The number of people who come to my store each day. X0 is the number of people who came on the
first day, X1 on the second, and so on.
Consider the following random walk on the graph below. You’ll see what that means through an example!
318
9.6 Probability & Statistics with Applications to Computing 319
Suppose we start at node 1, and at each time step, independently step to a neighboring node with equal
probability.
For example, X0 = 1 since at time t = 0, we are at node 1. Then, X1 can be either 2 or 3 (but not 4 or 5
since not neighbors of node 1). And so on. So each Xt just tells us the position we are at at time t, and is
always in the set {1, 2, 3, 4, 5} (for our example anyway).
This DTSP actually has a lot of structure, and is actually an example of a special type of DTSP called a
Markov Chain: can you think about how this particular setup provides a lot of additional constraints over
a normal DTSP?
Here are three key properties of a Markov Chain, which we will formalize immediately after:
1. We only have finitely many states (5 in our example: {1, 2, 3, 4, 5}). (The stock price or temperature
example earlier could be any real number).
2. We don’t care about the past, given the present. That is, the distribution of where we go next
ONLY depends on where we are currently, and not any past history.
3. The transition probabilities are the same at each step (stationary). That is, if we are at node 1 at
time t = 0 or t = 152, we are always equally likely to go to node 2 or 3).
3. Has stationary transition probabilities. That is, we always transition from state si to sj with
probability independent of the current time. Hence, due to this property and the previous, the
transitions are governed by n2 probabilities: the probability of transitioning to one of n current
states to one of n next states. These are stored in a square n ⇥ n transition probability
matrix (TPM) P , where Pij = P (Xt+1 = sj | Xt = si ) is the probability of transitioning
from si ! sj for any and every time t.
If you’re a bit confused right now, especially with that last bullet point, this is totally normal and means you
are paying attention! Let’s construct the TPM for the graph example earlier to see what it means exactly.
For example, the second entry of the first row is: given that Xt = 1 (we are in state 1 at some time t), what
is the probability of going to state 2 next Xt+1 = 2? It’s 1/2 because from state 1, we are equally likely to
go to state 2 or 3. It isn’t possible to go to states 1, 4, and 5, and that’s why their respective entries are 0.
From state 2, we can only go to states 1 and 4 as you can see from the graph and the TPM. Try filling out
the remaining three rows yourself! These images may help:
Note that in the last row, from state 5, we MUST go to state 4, and so P54 = 1 and the rest of the row
has zero probability. Also note that each ROW sums to 1, but there is no such constraint on the columns.
That’s because this is secretly a joint PMF right? Given we are in some state si (Xt = si ), the probabilities
of going to the next state Xt+1 must sum to 1.
Example(s)
Now let’s talk about how to compute some probabilities we may be interested in. Nothing here is
“new”: it is all based on your core probability knowledge from the previous chapters! Let’s say we
want to find out the probability we end up at state 5 after two time steps, starting from state 3. That
is, compute P (X2 = 5 | X0 = 3). Try to do come up with an “intuitive” answer first, and then show
your work formally.
Solution You might be able to hack your way around to a solution since it is only two time steps: something
like
1 1
·
2 3
Intuitively, we can either go to state 4 or 1 from state 3 with equal probability. If we went to state 1, there’s
no chance we make it to state 5. If we went to state 4, there’s a 1/3 chance we go to state 5. So our answer
is 1/2 · 1/3 = 1/6. This is just the LTP conditioning on possible middle states!
Now we’ll write this out more generally. This LTP will be a conditional form though: the LTP says that if
Bi ’s partition the sample space: X
P (A) = P (A | Bi ) P (Bi )
i
But what about if we wanted P (A | C)? We just condition everything on C as well to get:
X
P (A | C) = P (A | Bi , C) P (Bi | C)
i
The second equation comes because the probability of X2 given both the positions X0 and X1 only depends
on X1 right? Once we know where we are currently, we can forget about the past. But now, we can zero
out several of these because P (X1 = i | X0 = 3) = 0 for i = 2, 3, 5. So we are left with just 2 of the 5 terms:
= P (X2 = 5 | X1 = 1) P (X1 = 1 | X0 = 3) + P (X2 = 5 | X1 = 4) P (X1 = 4 | X0 = 3)
If you have the TPM P (we have this above), try looking up the entries to see if you get the same answer!
1 1 1 1
= P15 P31 + P45 P34 = 0 · + · =
2 3 2 6
322 Probability & Statistics with Applications to Computing 9.6
Back to our random walk example: suppose we weren’t sure where we started. That is, let the vector
be such that P (X0 = i) = vi , where vi is the ith element of v (these probabilities sum to 1, because we must
start in one of these 5 positions). Think of this vector v as our belief distribution of where we are at time
t = 0. Let’s compute vP , the matrix-product of v and P , the transition probability matrix. We’ll see what
comes out of it after computing and interpreting it! If you haven’t taken linear algebra yet, don’t worry: vP
is the following 5-dimensional row vector:
5 5 5 5 5
!
X X X X X
vP = Pi1 vi , Pi2 vi , Pi3 vi , Pi4 vi , Pi5 vi
i=1 i=1 i=1 i=1 i=1
What does vP represent? Let’s focus on the first entry, and substitute vi = P (X0 = i) and Pi1 =
P (X1 = 1 | X0 = i) (the probability of going from i ! 1). We actually get (by LTP over initial states):
5
X 5
X
Pi1 vi = P (X1 = 1 | X0 = i) P (X0 = i) = P (X1 = 1)
i=1 i=1
This is an interesting pattern that holds for the next three entries as well! In fact, the i-th entry of vP is
just P (X1 = i), so overall, the vector vP represents your belief distribution at the next time step!
That is, right-multiplying by the transition matrix P literally transitions your belief distribution from one
time step to the next.
We can also see that for example vP 2 = (vP )P is your belief of where you are after 2 time steps, and by
induction, vP n is your belief of where you are after n time steps.
A natural question might be then, does vP n have a limit as n ! 1? That is, after a long time, is there a belief
distribution (5-dimensional row vector) ⇡ such that it never changes again? The answer is unfortunately:
it depends. We won’t go into the technical details of when it does and doesn’t exist (search “Fundamental
Theorem of Markov Chains” if you are interested), but this leads us to the following definition:
The stationary distribution of a Markov Chain with n states (if one exists), is the n-dimensional
row vector ⇡ (representing a probability distribution: entries which are nonnegative and sum to 1),
such that
⇡P = ⇡
Intuitively, it means that the belief distribution at the next time step is the same as the distribution
at the current. This typically happens after a “long time” (called the mixing time) in the process,
meaning after lots of transitions were taken.
9.6 Probability & Statistics with Applications to Computing 323
We’re going to see an example of this visually, which will also help us build our final piece of intuition for
MCMC. Consider the Markov Chain we’ve been using throughout this section:
Here is the distribution v that we’ll start with. Our Markov Chain happens to have a stationary distribution,
so we’ll see what happens as we take vP n for n ! 1 visually.
v = (0.25, 0.45, 0.15, 0.05, 0.10)
Here is a heatmap of it visually:
You can see from the key that darker values mean lower probabilities (hence 4 and 5 are very dark), and
that 2 is the lighest value since it has the highest probability.
We’ll then show the distribution after 1 step, 5 steps, 10 steps, and 100 steps. Before we continue, what
do you think the fifth entry will look like after one time step, the probability of being in node 5? Actually,
there is only one way to get to node 5, and that’s from node 4, which we start in with probability only 0.05.
From there, only a 1/3 chance to get to node 5, so node 5 will only have 0.05/3 = 1/60 probability at time
step 1 and hence be super dark.
It turns out that after just n = 100 time steps, we start getting the same distribution over and over again
(see t = 10 and t = 100: there’s already almost no di↵erence)! This limiting value of vP n is the stationary
distribution!
⇡ = lim vP n = (0.12, 0.28, 0.28, 0.18, 0.14)
n!1
100
Suppose ⇡ = vP above. Once we find ⇡ such that ⇡P = ⇡ for the first time, that means that if we
transition again, we get
⇡P 2 = (⇡P )P = ⇡P = ⇡
(applying the equality ⇡P = ⇡ twice). That means, by just running the Markov Chain for several
time steps, we actually reached our stationary distribution! This is the most crucial observation
for MCMC.
Markov Chain Monte Carlo (MCMC) is a technique which can be used to hard optimization
problems (though generally it is used to sample from a distribution). The general strategy is as
follows:
I. Define a Markov Chain with states being possible solutions, and (implicitly defined) transition
probabilities that result in the stationary distribution ⇡ having higher probabilities on “good”
solutions to our problem. We don’t actually compute ⇡, but we just want to define the Markov
Chain such that the stationary distribution would have higher probabilities on more desirable
solutions.
II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good” state/-
solution). This means: start at some initial state, and transition according to the transition
probability matrix (TPM) for a long time. This will eventually take us to our stationary dis-
tribution which has high probability on “good” solutions!
Again, if this doesn’t make sense yet, that’s totally fine. We will apply this two-step procedure to two
examples below so you can understand better how it works!
P
Note that our total value is the sum of the values of the items we take: think about why vi xi is the total
value (remember that xi is either 0 or 1). This problem has 2n possible solutions (either take each item or
don’t), and so is combinatorially hard (exponentially many solutions). If I asked you to write a program to
do this, would you even know where to begin, except by writing the brute-force solution?
I. Define a Markov Chain with states being possible solutions, and (implicitly defined) tran-
sition probabilities that result in the stationary distribution ⇡ having higher probabilities
on “good” solutions to our problem.
We’ll define a Markov Chain with 2n states (that’s huge!). The states will be all possible solutions:
binary vectors x of length n (only having 0/1 entries). We’ll then define our transitions to go to “good”
states (ones that satisfy our weight constraint), while keeping track of the best solution so far. This
way, our stationary distribution has higher probabilities on good solutions than bad ones. Hence, when
we sample from the distribution (simulating the Markov chain), we are likely to get a good solution!
326 Probability & Statistics with Applications to Computing 9.6
II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good”
state/solution). This means: start at some initial state, and transition according to the
transition probability matrix (TPM) for a long time. This will eventually take us to our
stationary distribution which has high probability on “good” solutions!
Basically, this algorithm starts with the guess of x being all zeros (no items). Then, for NUM ITER
steps, we simulate the Markov Chain. Again, what this does is give us a sample from our stationary
distribution. Inside the loop, we literally just choose a random object and flip whether or not we have
it. We maintain track of the best solution so far and return it.
That’s all there is to it! This is such a “dumb” solution right? We just start somewhere and randomly
transition for a long time and hope our answer is good. So MCMC definitely won’t guarantee us to get the
best solution, but it leads to “dumb” solutions that actually work quite well in practice. We are guaranteed
though (provided we take enough transitions), to sample from the stationary distribution which has higher
probabilities on good solutions. This is because we only transition to solutions that maintain feasibility.
Note: This is just one version of MCMC for the knapsack problem, there are definitely probably better
versions. It would be better to transition to solutions which have higher value, not just feasible solutions
like we did. The next example does a better job of this!
Given n locations and distances between each pair, we want to find an ordering of them that:
• Starts and ends in the same location.
• Visits each location exactly once (except the starting location twice).
• Minimizes the total distance travelled.
You can imagine an instantiation of this problem for the US Postal Service. A mail delivery person wants to
start and end at the post office, and find the most efficient route which delivers all the mail to the residents.
Again, where would you even begin on trying to solve this, other than brute-force? MCMC to the rescue
again! This time, our algorithm will be more clever than the previous.
I. Define a Markov Chain with states being possible solutions, and (implicitly defined) tran-
sition probabilities that result in the stationary distribution ⇡ having higher probabilities
on “good” solutions to our problem.
9.6 Probability & Statistics with Applications to Computing 327
We’ll define a Markov Chain with n! states (that’s huge!). The states will be all possible solutions
(state=route): all orderings of the n locations. We’ll then define our transitions to go to “good” states
(ones that go to lower-distance routes), while keeping track of the best solution so far. This way,
our stationary distribution has higher probabilities on good solutions than bad ones. Hence, when we
sample from the distribution (simulating the Markov chain), we are likely to get a good solution!
II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good”
state/solution). This means: start at some initial state, and transition according to the
transition probability matrix (TPM) for a long time. This will eventually take us to our
stationary distribution which has high probability on “good” solutions!
We will start with a random state (route). At at each iteration, propose a new state (route) as follows:
choose a random index from {1, 2, . . . , n}, and swap that location with the successive (next) location
in the route, possibly with wraparound if index 50 is chosen. If the proposed route has lower total
distance (is better) than the current route, we will always transition to it (exploitation). Otherwise, if
T > 0, with probability e /T , update the current route to the proposed route, where > 0 is the
increase in total distance. This allows us to transition to a “worse” route occasionally (exploration),
and get out of local optima! Repeat this for NUM ITER transitions from the initial state (route), and
output the shortest route during the entire process (which may not be the last route).
Again, this is such a “dumb” solution right? But also very clever! We just start somewhere and randomly
transition for a long time and hope our answer is good. And it should be: after a long time, our route
distance increasingly gets better and better, so we should expect a rather good solution!
9.6.4 Summary
Once again, we’ve used probability to make our lives easier. There are definitely papers and research on
how to solve these problems deterministically, but this is one of the simplest algorithms you can get, and
it uses randomness! Again, the idea of MCMC for optimization is: define the state space to be all possible
solutions, define transitions to go to better states, and just run it and wait!
Chapter 9: Applications to Computing
9.7: Bootstrapping (for Hypothesis Testing)
Slides (Google Drive) Video (YouTube)
9.7.1 Motivation
We’ve just learned how to perform a generic hypothesis test, where in our examples we were especially often
able to use the Normal distribution and its CDF due to the CLT. But actually, there are tons of specialized
other hypothesis tests which won’t allow this. For example:
• The t-test for equality of means when variance is unknown.
2
• The -test of independence (testing whether to quantities are independent or not).
• The F -test for equality of variances (testing whether or not the variances of two populations are equal
or not).
There are many more that I haven’t even listed because I probably have never heard of them myself! These
three above though involve three distributions we haven’t learned yet: the t, 2 , and F distributions. But
because you are a computer scientist, we’ll actually learn a way now to completely erase the need to learn
each specific procedure, called bootstrapping!
Example(s)
Main Idea: We have some (not enough) data and want more. How can we “get more”?
328
9.7 Probability & Statistics with Applications to Computing 329
Imagine: You have 1000 iid coin flip samples, x1 ,...,x1000 which are all 1’s and 0’s. Your boss wants
you to somehow get/generate 500 more (independent) samples.
How can you “get more (iid) data” without actually having access to the coin? There are
two proposed solutions below: both of which you could theoretically come up with, but only one of
which which I expect most of you to guess.
Example(s)
A colleague has collected samples of weights of labradoodles that live on two di↵erent islands:
CatIsland and DogIsland. The colleague collects 48 samples from CatIsland, and 43 samples from
the DogIsland. The colleague notes ahead of time that she thinks the labradoodles on DogIsland
have a higher spread of weights than CatIsland. You are skeptical. You and your colleague do
however agree to assume that their true means are equal. Here is the data:
CatIsland Labradoodle Weights (48 samples): 13, 12, 7, 16, 9, 11, 7, 10, 9, 8, 9, 7, 16, 7, 9,
8, 13, 10, 11, 9, 13, 13, 10, 10, 9, 7, 7, 6, 7, 8, 12, 13, 9, 6, 9, 11, 10, 8, 12, 10, 9, 10, 8, 14, 13, 13, 10, 11
DogIsland Labradoodle Weights (43 samples): 8, 8, 16, 16, 9, 13, 14, 13, 10, 12, 10, 6, 14, 8,
13, 14, 7, 13, 7, 8, 4, 11, 7, 12, 8, 9, 12, 8, 11, 10, 12, 6, 10, 15, 11, 12, 3, 8, 11, 10, 10, 8, 12
Solution Step 5 is the only part where bootstrapping is involved. Everything else is the same as we learned
in 8.3!
1. Make a claim.
The spread of labradoodle weights on DogIsland is (significantly) larger than that on CatIsland.
2. Set up a null hypothesis H0 and alternative hypothesis HA .
2 2 2 2
H0 : C = D HA : C < D
330 Probability & Statistics with Applications to Computing 9.7
Our null hypothesis is that the spreads are the same, and our alternative is what we want to show.
Here, spread is taken to mean “variance”.
3. Choose a significance level ↵ (usually ↵ = 0.05 or 0.01).
Because of this, we can combine the two samples into a single one of size 48 + 43 = 91 (in our
case, we’ve also assumed the means are the same, so this is okay). Then, we repeatedly bootstrap
this combined sample (let’s say 50,000 times): we sample with replacement a sample of size 48, and of
size 43, and compute the sample variances of these two samples. Then, we compute the sample pro-
portion of times the di↵erence in variances was at least as extreme, and that’s it! See the pseudocode
below, and reread these two paragraphs.
2 2 2 2
Algorithm 5 Bootstrapping for p-value for H0 : C = D vs HA : C < D
1: Given: Two samples x = [x1 , . . . , xn ] and y = [y1 , . . . , ym ].
2: obs di↵ s2y s2x (the di↵erence in sample variances).
3: combined concat(x, y) = [x1 , x2 , . . . , xn , y1 , y2 , . . . , ym ] (of size n + m).
4: count 0.
5: for i = 1, 2, . . . , 50000 do . Any large number is fine.
6: x0 resample(combined, n) with replacement. . Sample of size n from combined.
7: y0 resample(combined, m) with replacement. . Sample of size m from combined.
8: di↵ s2y0 s2x0 . . Compute the di↵erence in sample variances.
9: if di↵ obs di↵ then . This line changes depending on the alternative hypothesis.
10: count count + 1.
11: p-val count/50000.
Again, what we’re doing is: assuming there was this master island that split into two (same variance),
what is the probability we observed a sample of size 48 and a sample of size 43 with variances at least
as extreme as we did? That is, if we were to repeat this “separation” process many times, how often
would we get a di↵erence so large? We don’t have the other labradoodles from the master island, so
we bootstrap (reuse our current samples). It turns out this method leads to a good approximation to
the true p-value!
It’s important to note that the alternative hypothesis is EXTREMELY IMPORTANT. If instead we
2 2
wanted to assert HA : C 6= D , we would have used absolute values for di↵ and obs di↵. Also, for
example, if we wanted to make a statement about the means µC and µD instead, we would have
computed and compared the sample means instead of the sample variances.
It turns out we get a p-value of approximately 0.07. (Try coding this up yourself!)
9.7 Probability & Statistics with Applications to Computing 331
Since our p-value of 0.07 was greater than ↵ = 0.05, we fail to reject the null hypothesis. There
is insufficient evidence to show that the labradoodle spreads are di↵erent across the two islands.
Actually, this two-sample test for di↵erence in variances is done by an “F-Test of Equality of Variances” (see
Wikipedia). But because we know how to code, we don’t need to know that!
You can imagine bootstrapping for other types of hypothesis tests as well! Actually, bootstrapping is a
powerful tool which also has other applications.
Chapter 9: Applications to Computing
9.8: Multi-Armed Bandits
Slides (Google Drive) Video (YouTube)
Actually, for this application of bandits, we will do the problem setup before the motivation. This is because
modelling problems in this bandit framework may be a bit tricky, so we’ll kill two birds with one stone. We’ll
also see how to do “Modern Hypothesis Testing” using bandits!
You bought some credits and can pull any slot machines, but only a total of T = 100 times. At each time
step t = 1, . . . , T , you pull arm at 2 {1, 2, . . . , K} and observe a random reward. Your goal is to maximize
your total (expected) reward after T = 100 pulls! The problem is: at each time step (pull), how do I decide
which arm to pull based on the past history of rewards?
We make a simplifying assumption that each arm is independent of the rest, and has some reward dis-
tribution which does NOT change over time.
Here is an example you may be able to do: don’t overthink it!
Example(s)
If the reward distributions are given in the image below for the K = 3 arms, what is the best strategy
to maximize your expected reward?
Solution We can just compute the expectations of each from the distributions handout. The first machine
has expectation = 1.36, the second has expectation np = 4, and the third has expectation µ = 1. So to
332
9.8 Probability & Statistics with Applications to Computing 333
maximize our total reward, we should just always pull arm 2 because it has the best expected reward! There
would be no benefit in pulling other arms at all.
So we’re done right? Well actually, we DON’T KNOW the reward distributions at all! We must estimate all
K expectations (one per arm), WHILE simultaneously maximizing reward! This is a hard problem because
we know nothing about the K reward distributions. Which arm should we pull then at each time step? Do
we pull arms we know to be “good” (probably), or try other arms?
• Exploration: Pulling less-frequently pulled arms in the hopes they are also “good” or even better.
In this section, we will only handle the case of Bernoulli-bandits. That is, the reward of each arm
a 2 {1, . . . , K} is Ber(pa ) (i.e., we either get a reward of 1 or 0 from each machine, with possibly di↵erent
probabilities). Observe that the expected reward of arm a is just pa (expectation of Bernoulli).
The last thing we need to talk about when talking about bandits is regret. Regret is the di↵erence between
• The best possible expected reward (if you always pulled the best arm).
Let p⇤ = arg maxi2{1,2,...,K} pi denote the highest expected reward from one of the K arms. Then, the regret
at time T is
Regret(T ) = T p⇤ Reward(T )
where T p⇤ is the reward from the best arm if you pull it T times, and Reward(T ) is your actual reward after
T pulls. Sometimes it’s easier to think about this in terms of average regret (divide everything by T ).
Reward(T )
Avg-Regret(T ) = p⇤
T
The below summarizes and formalizes everything above into this so-called “Bernoulli Bandit Framework”.
334 Probability & Statistics with Applications to Computing 9.8
The focus for the rest of the entire section is: “how do we choose which arm”?
9.8.2 Motivation
Before we talk about that though, we’ll discuss the motivation as promised.
As you can see above, we can model a lot of real-life problems as a bandit problem. We will learn two
popular algorithms: Upper Confidence Bound (UCB) and Thompson Sampling. This is after we discuss
some “intuitive” or “naive” strategies you may have yourself!
We’ll actually call on a lot of our knowledge from Chapters 7 and 8! We will discuss maximum likelihood,
maximum a posteriori, confidence intervals, and hypothesis testing, so you may need to brush up on those!
One strategy may be: pull each arm M times in the beginning, and then forever pull the best arm! This is
described formally below:
9.8 Probability & Statistics with Applications to Computing 335
Actually, this strategy is no good, because if we choose the wrong best arm, we would regret it for the rest
of time! You might then say, why don’t we increase M ? If you do that, then you are pulling sub-optimal
arms more than you should, which would not help us in maximizing reward...The problem is: we did all of
our exploration FIRST, and then exploited our best arm (possibly incorrect) for the rest of time. Why don’t
we try to blend in exploration more? Do you have any ideas on how we might do that?
This following algorithm is called the "-Greedy algorithm, because it explores with probability " at each
time step! It has the same initial setup: pull each arm M times to begin. But it does two things better than
the previous algorithm:
1. It continuously updates an arm’s estimated expected reward when it is pulled (even after the KM
steps).
2. It explores with some probability " (you choose). This allows you to choose in some quantitative way
how to balance exploration and exploitation.
See below!
However, we can do much much better! Why should we explore each arm uniformly at random, when we
have a past history of rewards? Let’s explore more the arms that have the potential to be really good! In an
extreme case, if there is an arm with average reward 0.01 after 100 pulls and an arm with average reward
0.6 after only 5 pulls, should we really both explore each equally?
336 Probability & Statistics with Applications to Computing 9.8
• Arm 1: Estimate is p̂1 = 0.75. Confidence interval is [0.75 0.10, 0.75 + 0.10] = [0.65, 0.85].
• Arm 2: Estimate is p̂2 = 0.33. Confidence interval is [0.33 0.25, 0.33 + 0.25] = [0.08, 0.58].
• Arm 3: Estimate is p̂3 = 0.60. Confidence interval is [0.60 0.29, 0.60 + 0.29] = [0.31, 0.89].
Notice all the intervals are centered at the MLE. Remember the intervals may have di↵erent widths, because
the width of a confidence interval depends on how many times it has been pulled (more pulls means more
confidence and hence narrower interval). Review 8.1 if you need to recall how we construct them.
The greedy algorithm from earlier at this point in time would choose arm 1 because it has the highest
estimate (0.75 is greater than 0.33 and 0.60). But our new Upper Confidence Bound (UCB) algorithm
will choose arm 3 instead, as it has the highest possibility of being the best (0.89 is greater than 0.85 and
0.58).
See how exploration is “baked in” now? As we pull an arm more and more, the upper confidence bound
decreases. The less frequently pulled arms have a chance to have a higher UCB, despite having a lower
point estimate! After the next algorithm we examine, we will visually compare and contrast the results. But
before we move on, let’s take a look at this visually.
Suppose we have K = 5 arms. The following picture depicts at time t = 10 what the confidence intervals
may look like. The horizontal lines at the top of each arm represent the upper confidence bound, and the red
dots represent the TRUE (unknown) means. The center of each confidence interval are the ESTIMATED
means.
9.8 Probability & Statistics with Applications to Computing 337
Pretty inaccurate at first right? Because it’s so early on, our estimates are expected to be bad.
Notice how the interval for the best arm (arm 5) keeps shrinking, and is the smallest one because it was pulled
(exploited) so much! Clearly, arm 1 was terrible and so our estimate isn’t perfect; it has the widest width
since we almost never pulled it. This is the idea of UCB: basically just greedy but using upper confidence
bounds!
You can go to the slides linked at the top of the section if you would like to see a step-by-step of the first
few iterations of this algorithm (slides 64-86).
s
2 ln(t)
Note if just deleted the + in the 5th line of the algorithm, it would reduce to the greedy!
Nt (i)
338 Probability & Statistics with Applications to Computing 9.8
So as I mentioned earlier, each pi is a RV which starts with a Beta(1, 1) distribution. For each arm i, we
keep track of ↵i and i , where ↵i 1 is the number of successes (number of times we got a reward of 1),
and i 1 is the number of failures (number of times we got a reward of 0).
For this algorithm, I would highly recommend you go to the slides linked at the top of the section if you
would like to see a step-by-step of the first few iterations of this algorithm (slides 94-112). If you don’t want
to, we’ll still walk through it below!
Let’s again suppose we have K = 3 arms. At time t = 1, we sample once from each arm’s Beta distribution.
We suppose the true pi ’s are 0.5, 0.2, and 0.9 for arms 1, 2, and 3 respectively (see the table). Each arm
has ↵i and i , initially 1. We get a sample from each arm’s Beta distribution and just pull the arm with
the largest sample! So in our first step, each has the same distribution Beta(1, 1) = Unif(0, 1), so each arm
is equally likely to be pulled. Then, because arm 2 has the highest sample (of 0.75), we pull arm 2. The
algorithm doesn’t know this, but there is only a 0.2 chance of getting a 1 from arm 2 (see the table), and so
let’s say we happen to observe our first reward to be zero: r1 = 0.
9.8 Probability & Statistics with Applications to Computing 339
Consistent with our Beta random variable intuition and MAP, we increment our number of failures by 1 for
arm 2 only.
At the next time step, we do the same! Sample from each arm’s Beta and choose the arm with the highest
sample. We’ll see it for a more interesting example below after skipping a few time steps.
Now let’s say we’re at time step 4, and we see the following chart. Below depicts the current Beta densities
for each arm, and what sample we got from each.
We can see from the ↵i ’s and i ’s that we still haven’t pulled arm 1 (both parameters are still at 1), we pulled
arm 2 and got a reward of 0 ( 2 = 2), and we pulled arm 3 twice and got one 1 and one 0 (↵3 = 3 = 2).
See the density functions below: arm 1 is equally likely to be any number in [0, 1], whereas arm 2 is more
likely to give a low number. Arm 3 is more certain of being in the center.
You can see that Thompson Sampling just uses this ingenious idea of sampling rather than just tak-
ing the MAP, and it works great! We’ll see some comparisons below between UCB and Thompson sampling.
Note that with a single-line change, instead of sampling in line 3, if we just took the MAP (which equals the
MLE because of our uniform prior), we would again revert back to the greedy algorithm! The exploration
comes from the sampling, which works out great for us!
It might be a bit hard to see, but notice Thompson sampling’s regret got close to 0 a lot faster than UCB.
UCB happened around time 5000, and TS happened around time 2000. The reason why Thompson sampling
might be “better” is unfortunately out of scope.
Below is my favorite visualization of all. On the x-axis we have time, and on the y-axis, we have the
proportion of time each arm was pulled (there were K = 5 arms). Notice how arm 2 (green) has the highest
true expected reward at 0.89, and how quickly Thompson sampling discovered it and starting exploiting it.
Here are the benefits and drawbacks of using Traditional A/B Testing vs Multi-Armed bandits. Each has
their own advantages, and you should carefully consider which approach to take before arbitrarily deciding!
When to use Traditional A/B Testing:
• Need to collect data for critical business decisions.
• Need statistical confidence in all your results and impact. Want to learn even about treatments that
didn’t perform well.
• The reward is not immediate (e.g., if drug testing, don’t have time to wait for each patient to finish
before experimenting with next patient).
• Optimize/measure multiple metrics, not just one.
When to use Multi-Armed Bandits:
1. No need for interpreting results, just maximize reward (typically revenue/engagement)
2. The opportunity cost is high (if advertising a car, losing a conversion is $20,000)
3. Can add/remove arms in the middle of an experiment! Cannot do with A/B tests.
The study of Multi-Armed Bandits can be categorized as:
• Statistics
• Optimization
• “Reinforcement Learning” (subfield of Machine Learning)
Alex Tsun Probability & Statistics with Applications to Computing 1
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.53983 0.5438 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.0 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91309 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2.0 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999
Discrete Distributions
PMF (pX (k) for
Distribution Parameters Possible Description Range ⌦X E[X] Var(X)
k 2 ⌦X )
X ⇠ Unif(a, b)
Equally likely to be any a+b (b a)(b a + 2) 1
Uniform (disc) for a, b 2 Z {a, . . . , b}
integer in [a, b] 2 12 b a+1
and a b
1 k
X ⇠ Ber(p) Takes value 1 with prob p pk (1 p)
Bernoulli {0, 1} p p(1 p)
for p 2 [0, 1] and 0 with prob 1 p
X ⇠ Bin(n, p) Sum of n iid Ber(p) rvs. # n n k
pk (1 p)
Binomial for n 2 N, of heads in n independent {0, 1, . . . , n} np np(1 p) k
and p 2 [0, 1] coin flips with P (head) = p.
# of events that occur in k
X ⇠ Poi( ) one unit of time e
Poisson {0, 1, . . .} k!
for > 0 independently with rate
per unit time
# of independent Bernoulli k 1
X ⇠ Geo(p) 1 1 p (1 p) p
Geometric trials with parameter p up to {1, 2, . . .}
for p 2 [0, 1] p p2
and including first success
# of successes in n draws K N K
X ⇠ HypGeo(N, K, n)
(w/o replacement) from N {max (0, n + K N ), K K (N K) (N n) k n k
Hypergeometric for n, K N n n N
items that contain K . . . , min (n, K)} N N 2 (N 1) n
and n, K, N 2 N
successes in total
Sum of r iid Geo(p) rvs. # k 1 k r
Negative X ⇠ NegBin(r, p) r r (1 p) pr (1 p)
of independent flips until rth {r, r + 1, . . .} r 1
Binomial for r 2 N, p 2 [0, 1] p p2
head with P (head) = p
X ⇠ Multr (n, p) Generalization of the 2 3
ki 2 {0, . . . , n}, np1 V ar(Xi ) = npi (1 pi ) Qr
for r, n 2 N and Binomial distribution, n 6 7 n
Multinomial i 2 {1, . . . , r} 6 . 7 Cov(Xi , Xj ) = k1 ,...,kr i=1 piki
p = (p1 , p2 , ..., pr ),
P trials with r categories each E[X] = np = 6 . 7
r and ⌃ki = n 4 . 5 npi pj , i 6= j
i=1 pi = 1 with probability pi
npr
Generalization of the 2 3
X ⇠ MVHGr (N, K, n) V ar(X ) =
i
Hypergeometric distribution, ki 2 {0, . . . , Ki }, n K1 Qr
Multivariate for r, n 2 N, 6 N7 nKN ·
i N Ki N n
N ·N 1 Ki
i=1 ki
n draws from r categories i 2 {1, . . . , r} K 6 . 7
Hypergeometric K 2 NrPand E[X] = n = 6 .. 7 Cov(Xi , Xj ) = N
r each with Ki successes and ⌃ki = n N 4 5 i Kj n
N = i=1 Ki nK N n
N N · N 1 , i 6= j
(w/out replacement)
n KNr
Probability & Statistics with Applications to Computing Alex Tsun & Mitchell Estberg
1
Continuous Distributions
Possible CDF
Distribution Parameters Range ⌦X E[X] Var(X) PDF fX (x) for x 2 ⌦X
Description (FX (x) = P(X x))
8
Equally likely to be 1 >
X ⇠ Unif(a, b) a+b (b a)
2 >0
<
if x < a
Uniform any real number in [a, b] b a x a
for a < b 2 12 if a x < b
[a, b] >b a
>
:
1 if x b
Time until first (
X ⇠ Exp( ) 1 1 x 0 if x < 0
Exponential event in Poisson [0, 1) e
for > 0 2
1 e x
if x 0
process
X ⇠ N (µ, 2 ) ! ✓ ◆
2
1 (x µ) x µ
Normal for µ 2 R, Standard bell curve ( 1, 1) µ 2
p exp
2 2
and 2 > 0 2⇡
Sum of r iid Exp( )
rvs. Time to rth
r r r
X ⇠ Gam(r, ) event in Poisson Note: (r) = (r 1)!
Gamma [0, 1) xr 1
e x
for r, > 0 process. Conjugate 2 (r) for integers r.
prior for Exp, Poi
parameter
Conjugate prior for
X ⇠ Beta(↵, ) ↵ ↵ (↵ + ) ↵ 1 1
Beta Ber, Bin, Geo, (0, 1) x (1 x)
for ↵, > 0 ↵+ 2
(↵ + ) (↵ + + 1) (↵) ( )
NegBin parameter p
Generalization of
X⇠ Qr
Beta distribution. 1
xai 1
,
Dir(↵1 , ↵2 , . . . , ↵r ) x 2 (0, 1);
i ↵i
Dirichlet Conjugate prior for P r E[Xi ] = Pr B(↵) i=1
Pir
for ↵i , r > 0 and i=1 xi = 1 ↵j xi 2 (0, 1), xi = 1
Multinomial j=1 i=1
r 2 N, ↵i 2 R
parameter p
X ⇠ Nn (µ, ⌃) 1
·
Multivariate Generalization of (2⇡)n/2 |⌃|1/2
for µ 2 Rn and Rn µ ⌃ 1
Normal Normal distribution exp( 2 (x µ)T ⌃ 1 (x µ))
⌃ 2 Rn⇥n
Probability & Statistics with Applications to Computing Alex Tsun & Mitchell Estberg
2
Probability & Statistics with Applications to Computing
Key Definitions and Theorems
1 Combinatorial Theory
1.1 So You Think You Can Count?
The Sum Rule: If an experiment can either end up being one of N outcomes, or one of M outcomes (where there is no
overlap), then the total number of possible outcomes is: N + M .
The Product Rule: If an experiment has N1 outcomes for the first stage, N2 outcomes for the second stage,Qm . . . , and Nm
outcomes for the mth stage, then the total number of outcomes of the experiment is N1 ⇥ N2 · · · · · Nm = i=1 Ni .
Complementary Counting: Let U be a (finite) universal set, and S a subset of interest. Then, | S |=| U | | U \ S |.
k-Permutations: If we want to pick (order matters) only k out of n distinct objects, the number of ways to do so is:
n!
P (n, k) = n · (n 1) · (n 2) · ... · (n k + 1) = (n k)!
k-Combinations/Binomial Coefficients: If we want to choose (order doesn’t matter) only k out of n distinct objects,
the number of ways to do so is:
✓ ◆
n P (n, k) n!
C(n, k) = = =
k k! k!(n k)!
Multinomial Coefficients: If we have k distinct types of objects (n total), with n1 of the first type, n2 of the second, ...,
and nk of the k-th, then the number of arrangements possible is
✓ ◆
n n!
=
n1 , n2 , ..., nk n1 !n2 !...nk !
Stars and Bars/Divider Method: The number of ways to distribute n indistinguishable balls into k distinguishable bins
is ✓ ◆ ✓ ◆
n + (k 1) n + (k 1)
=
k 1 n
Pigeonhole Principle: If there are n pigeons we want to put into k holes (where n > k), then at least one pigeonhole must
contain at least 2 (or to be precise, dn/ke) pigeons.
Combinatorial Proofs: To prove two quantities are equal, you can come up with a combinatorial situation, and show that
both in fact count the same thing, and hence must be equal.
2 Discrete Probability
2.1 Discrete Probability
Key Probability Definitions: The sample space is the set ⌦ of all possible outcomes of an experiment. An event is
any subset E ✓ ⌦. Events E and F are mutually exclusive if E \ F = ;.
3. (Axiom: Countable Additivity) If E and F are mutually exclusive, then P(E [ F ) = P (E) + P (F ).
Equally Likely Outcomes: If ⌦ is a sample space such that each of the unique outcome elements in ⌦ are equally likely,
then for any event E ✓ ⌦: P(E) = |E|/|⌦|.
P (A \ B)
Conditional Probability: P (A | B) =
P (B)
P (B | A) P (A)
Bayes Theorem: P (A | B) =
P (B)
Partition: Non-empty events E1 , . . . , En partition the sample space ⌦ if they are both:
Sn
• (Exhaustive) E1 [ E2 [ · · · [ En = i=1 Ei = ⌦ (they cover the entire sample space).
Law of Total Probability (LTP): If events E1 , . . . , En partition ⌦, then for any event F :
n
X n
X
P (F ) = P (F \ En ) = P (F | Ei ) P (Ei )
i=1 i=1
Alex Tsun Probability & Statistics with Applications to Computing 3
Bayes Theorem with LTP: Let events E1 , . . . , En partition the sample space ⌦, and let F be another event. Then:
P (F | E1 ) P (E1 )
P (E1 | F ) = Pn
i=1 P (F | Ei ) P (Ei )
2.3 Independence
Independence: A and B are independent if any of the following equivalent statements hold:
1. P (A | B) = P (A)
2. P (B | A) = P (B)
Mutual Independence: We say n events A1 , A2 , . . . , An are (mutually) independent if, for any subset I ✓ [n] =
{1, 2, . . . , n}, we have !
\ Y
P Ai = P (Ai )
i2I i2I
This equation is actually representing 2 equations since there are 2n subsets of [n].
n
Conditional Independence: A and B are conditionally independent given an event C if any of the following
equivalent statements hold:
1. P (A | B, C) = P (A | C)
2. P (B | A, C) = P (B | C)
3. P (A, B | C) = P (A | C) P (B | C)
Random Variable (RV): A random variable (RV) X is a numeric function of the outcome X : ⌦ ! R. The set of possible
values X can take on is its range/support, denoted ⌦X .
If ⌦X is finite or countable infinite (typically integers or a subset), X is a discrete RV. Else if ⌦X is uncountably large (the
size of real numbers), X is a continuous RV.
Probability Mass Function (PMF): For a discrete RV X, assigns probabilities to values in its range. That is pX : ⌦X !
[0, 1] where: pX (k) = P (X = k).
P
Expectation: The expectation of a discrete RV X is: E [X] = k2⌦X k · pX (k).
Alex Tsun Probability & Statistics with Applications to Computing 4
E [aX + bY + c] = aE [X] + bE [Y ] + c
P
Law of the Unconscious Statistician (LOTUS): For a discrete RV X and function g, E [g(X)] = b2⌦X g(b) · pX (b).
3.3 Variance
Linearity of Expectation with Indicators: If asked only about the expectation of a RV X which is some sort of “count”
(and not its PMF), then you may be able to write X as the sum of possibly dependent indicator RVs X1 , . . . , Xn , and apply
LoE, where for an indicator RV Xi , E [Xi ] = 1 · P (Xi = 1) + 0 · P (Xi = 0) = P (Xi = 1).
⇥ ⇤ ⇥ ⇤ 2
Variance: Var (X) = E (X E [X])2 = E X 2 E [X] .
p
Standard Deviation (SD): X = Var (X).
Independence: Random variables X and Y are independent, denoted X ? Y , if for all x 2 ⌦X and all y 2 ⌦Y :
P (X = x \ Y = y) = P (X = x) · P (Y = y).
Independent and Identically Distributed (iid): We say X1 , . . . , Xn are said to be independent and identically
distributed (iid) if all the Xi ’s are independent of each other, and have the same distribution (PMF for discrete RVs, or
CDF for continuous RVs).
Variance Adds for Independent RVs: If X ? Y , then Var (X + Y ) = Var (X) + Var (Y ).
Bernoulli Process: A Bernoulli process with parameter p is a sequence of independent coin flips X1 , X2 , X3 , ... where
P (head) = p. If flip i is heads, then we encode Xi = 1; otherwise, Xi = 0.
E [X] = p and Var (X) = p(1 p). An example of a Bernoulli/indicator RV is one flip of a coin with P (head) = p. By a
clever trick, we can write
1 k
pX (k) = pk (1 p) , k = 0, 1
0, with np = , then Bin (n, p) ! Poi( ). If X1 , . . . , Xn are independent Binomial RV’s, where Xi ⇠ Bin(Ni , p), then
X = X1 + . . . + Xn ⇠ Bin(N1 + . . . + Nn , p).
Uniform Random Variable (Discrete): X ⇠ Uniform(a, b) (Unif(a, b) for short), for integers a b, i↵ X has PMF:
1
pX (k) = , k 2 ⌦X = {a, a + 1, . . . , b}
b a+1
(b a)(b a+2)
E [X] = a+b2 and Var (X) = 12 . This represents each integer in [a, b] to be equally likely. For example, a single roll
of a fair die is Unif(1, 6).
E [X] = p1 and Var (X) = 1p2p . An example of a Geometric RV is the number of independent coin flips up to and including
the first head, where P (head) = p.
Negative Binomial Random Variable: X ⇠ NegativeBinomial(r, p) (NegBin(r, p) for short) i↵ X has PMF:
✓ ◆
k 1 r k r
pX (k) = p (1 p) , k 2 ⌦X = {r, r + 1, r + 2, . . .}
r 1
E [X] = pr and Var (X) = r(1p2 p) . X is the sum of r iid Geo(p) random variables. An example of a Negative Binomial RV is
the number of independent coin flips up to and including the r-th head, where P (head) = p. If X1 , . . . , Xn are independent
Negative Binomial RV’s, where Xi ⇠ NegBin(ri , p), then X = X1 + . . . + Xn ⇠ NegBin(r1 + . . . + rn , p).
K(N K)(N n)
E [X] = n KN and Var (X) = n N 2 (N 1) . This represents the number of successes drawn, when n items are drawn from
a bag with N items (K of which are successes, and N K failures) without replacement. If we did this with replacement,
then this scenario would be represented as Bin n, K
N .
Alex Tsun Probability & Statistics with Applications to Computing 6
Probability Density Function (PDF): The probability density function (PDF) of a continuous RV X is the function
fX : R ! R, such that the following properties hold:
Cumulative Distribution Function (CDF): The cumulative distribution function (CDF) of ANY random variable
(discrete or continuous) is defined to be the function FX : R ! R with FX (t) = P (X t). If X is a continuous RV, we have:
Rt
• FX (t) = P (X t) = 1
fX (w) dw for all t 2 R
d
• du FX (u) = fX (u)
Uniform Random Variable (Continuous): X ⇠ Uniform(a, b) (Unif(a, b) for short) i↵ X has PDF:
⇢ 1
b a if x 2 ⌦X = [a, b]
fX (x) =
0 otherwise
2
(b a)
E [X] = a+b
2 and Var (X) = 12 . This represents each real number from [a, b] to be equally likely. Do NOT confuse this
with its discrete counterpart!
E [X] = 1 and Var (X) = 12 . FX (x) = 1 e x for x 0. The exponential RV is the continuous analog of the geometric
RV: it represents the waiting time to the next event, where > 0 is the average number of events per unit time. Note that
the exponential measures how much time passes until the next event (any real number, continuous), whereas the Poisson
measures how many events occur in a unit of time (nonnegative integer, discrete). The exponential RV is also memoryless:
E [X] = r and Var (X) = r2 . X is the sum of r iid Exp( ) random variables. In the above PDF, for positive integers r,
(r) = (r 1)! (a normalizing constant). An example of a Gamma RV is the waiting time until the r-th event in the Poisson
process. If X1 , . . . , Xn are independent Gamma RV’s, where Xi ⇠ Gam(ri , ), then X = X1 +. . .+Xn ⇠ Gam(r1 +. . .+rn , ).
It also serves as a conjugate prior for in the Poisson and Exponential distributions.
2
Normal (Gaussian, “bell curve”) Random Variable: X ⇠ N (µ, ) i↵ X has PDF:
1 1 (x µ)
2
fX (x) = p e 2 2
, x 2 ⌦X = R
2⇡
E [X] = µ and Var (X) = 2 . The “standard normal” random variable is typically denoted Z and has mean 0 and variance 1:
if X ⇠ N (µ, 2 ), then Z = X µ ⇠ N (0, 1). The CDF has no closed form, but we denote the CDF of the standard normal as
(z) = FZ (z) = P (Z z). Note from symmetry of the probability density function about z = 0 that: ( z) = 1 (z).
2
Closure of the Normal Under Scale and Shift: If X ⇠ N (µ, ), then aX + b ⇠ N (aµ + b, a2 2
). In particular, we
can always scale/shift to get the standard Normal: X µ ⇠ N (0, 1).
2 2
Closure of the Normal Under Addition: If X ⇠ N (µX , X) and Y ⇠ N (µY , Y ) are independent, then
aX + bY + c ⇠ N (aµX + bµY + c, a2 2
X + b2 2
Y )
Steps to compute PDF of Y = g(X) from X (via CDF): Suppose X is a continuous RV.
Explicit Formula to compute PDF of Y = g(X) from X (Univariate Case): Suppose X is a continuous RV. If Y =
g(X) and g : ⌦X ! ⌦Y is strictly monotone and invertible with inverse X = g 1 (Y ) = h(Y ), then
⇢
fX (h(y)) · |h0 (y)| if y 2 ⌦Y
fY (y) =
0 otherwise
Explicit Formula to compute PDF of Y = g(X) from X (Multivariate Case): Let X = (X1 , ..., Xn ), Y =
(Y1 , ..., Yn ) be continuous random vectors (each component is a continuous rv) with the same dimension n (so ⌦X , ⌦Y ✓ Rn ),
and Y = g(X) where g : ⌦X ! ⌦Y is invertible and di↵erentiable, with di↵erentiable inverse X = g 1 (y) = h(y). Then,
✓ ◆
@h(y)
fY (y) = fX (h(y)) det
@y
⇣ ⌘
@h(y)
where @y 2 Rn⇥n is the Jacobian matrix of partial derivatives of h, with
✓ ◆
@h(y) @(h(y))i
=
@y ij @yj
Alex Tsun Probability & Statistics with Applications to Computing 8
Cartesian Product of Sets: The Cartesian product of sets A and B is denoted: A ⇥ B = {(a, b) : a 2 A, b 2 B}.
Joint PMFs: Let X, Y be discrete random variables. The joint PMF of X and Y is:
pX,Y (a, b) = P (X = a, Y = b)
The joint range is the set of pairs (c, d) that have nonzero probability:
Further, note that if g : R2 ! R is a function, then LOTUS extends to the multidimensional case:
X X
E [g(X, Y )] = g(x, y)pX,Y (x, y)
x2⌦X y2⌦Y
P
Marginal PMFs: Let X, Y be discrete random variables. The marginal PMF of X is: pX (a) = b2⌦Y pX,Y (a, b).
Independence (DRVs): Discrete RVs X, Y are independent, written X ? Y , if for all x 2 ⌦X and y 2 ⌦Y : pX,Y (x, y) =
pX (x)pY (y).
Variance Adds for Independent RVs: If X ? Y , then: Var (X + Y ) = Var (X) + Var (Y ).
Joint PDFs: Let X, Y be continuous random variables. The joint PDF of X and Y is:
fX,Y (a, b) 0
The joint range is the set of pairs (c, d) that have nonzero density:
Further, note that if g : R2 ! R is a function, then LOTUS extends to the multidimensional case:
Z 1Z 1
E [g(X, Y )] = g(s, t)fX,Y (s, t)dsdt
1 1
The joint PDF must satisfy the following (similar to univariate PDFs):
Z b Z d
P (a X < b, c Y d) = fX,Y (x, y)dydx
a c
Alex Tsun Probability & Statistics with Applications to Computing 9
R1
Marginal PDFs: Let X, Y be continuous random variables. The marginal PDF of X is: fX (x) = 1
fX,Y (x, y)dy.
Independence of Continuous Random Variables: Continuous RVs X, Y are independent, written X ? Y , if for all
x 2 ⌦X and y 2 ⌦Y , fX,Y (x, y) = fX (x)fY (y).
Conditional PMFs and PDFs: If X, Y are discrete, the conditional PMF of X given Y is:
Similarly for continuous RVs, but with f ’s instead of p’s (PDFs instead of PMFs).
Conditional Expectation: If X is discrete (and Y is either discrete or continuous), then we define the conditional expec-
tation of g(X) given (the event that) Y = y as:
X
E [g(X) | Y = y] = g(x)pX|Y (x | y)
x2⌦X
Notice that these sums and integrals are over x (not y), since E [g(X) | Y = y] is a function of y.
Basically, for E [g(X)], we take a weighted average of E [g(X) | Y = y] over all possible values of y.
4. Cov (X + c, Y ) = Cov (X, Y ). (Shifting doesn’t and shouldn’t a↵ect the covariance).
5. Cov (aX + bY, Z) = a · Cov (X) Z + b · Cov (Y, Z). This can be easily remembered like the distributive property of scalars
(aX + bY )Z = a(XZ) + b(Y Z).
6. Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y ), and hence if X ? Y , then Var (X + Y ) = Var (X) + Var (Y ).
⇣P Pm ⌘ Pn Pm
n
7. Cov i=1 X i , j=1 Y i = i=1 j=1 Cov (Xi , Yj ). That is covariance works like FOIL (first, outer, inner, last) for
multiplication of sums ((a + b + c)(d + e) = ad + ae + bd + be + cd + ce).
Cov(X,Y )
(Pearson) Correlation: The (Pearson) correlation of X and Y is: ⇢(X, Y ) = p p .
Var(X) Var(Y )
It is always true that 1 ⇢(X, Y ) 1. That is, correlation is just a normalized version of covariance. Most notably,
⇢(X, Y ) = ±1 if and only if Y = aX + b for some constants a, b 2 R, and then the sign of ⇢ is the same as that of a.
5.5 Convolution
Moment Generating Functions (MGFs): The moment generating function (MGF) of X is a function of a dummy
⇥ ⇤
variable t (use LOTUS to compute this): MX (t) = E etX .
Properties and Uniqueness of Moment Generating Functions: For a function f : R ! R, we will denote f (n) (x) to
be the n-th derivative of f (x). Let X, Y be independent random variables, and a, b 2 R be scalars. Then MGFs satisfy the
following properties:
⇥ ⇤ (n)
1. MX0
(0) = E [X], MX 00
(0) = E X 2 , and in general MX = E [X n ]. This is why we call MX a moment generating
function, as we can use it to generate the moments of X.
2. MaX+b (t) = etb MX (at).
3. If X ? Y , then MX+Y (t) = MX (t)MY (t).
2
The Sample Mean + Properties: Let X1 , X2 , . . . , Xn be a sequence of iid RVs with mean µ and variance . The
Pn ⇥ ⇤
sample mean is: X̄n = n1 i=1 Xi . Further, E X̄n = µ and Var X̄n = 2 /n
The Law of Large Numbers (LLN): Let X1 , . . . , Xn be iid RVs with the same mean µ. As n ! 1, the sample mean
X n converges to the true mean µ.
2
The Central Limit Theorem (CLT): Let X1 , . . . Xn be a sequence of iid RVs with mean µ and (finite) variance .
Then as n ! 1, ✓ ◆
2
X̄n ! N µ,
n
The mean or variance are not a surprise; the importance of the CLT is, regardless of the distribution of Xi ’s, the sample
mean approaches a Normal distribution as n ! 1.
The Continuity Correction: When approximating an integer-valued (discrete) random variable X with a continuous one
Y (such as in the CLT), if asked to find a P (a X b) for integers a b, you should use P (a 0.5 Y b + 0.5) so that
the width of the interval being integrated is the same as the number of terms summed over (b a + 1).
Random Vectors (RVTRs): Let X1 , ..., Xn be random variables. We say X = (X1 , . . . , Xn )T is a random vector.
Expectation is defined pointwise: E [X] = (E [X1 ] , . . . , E [Xn ])T .
Alex Tsun Probability & Statistics with Applications to Computing 12
Covariance Matrices: The covariance matrix of a random vector X 2 Rn with E [X] = µ is the matrix ⌃ = Var (X) =
Cov (X) whose entries ⌃ij = Cov (Xi , Xj ). The formula for this is:
⇥ ⇤ ⇥ ⇤
⌃ = Var (X) = Cov (X) = E (X µ)(X µ)T = E XXT µµT
2 3
Var (X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
6 Cov (X2 , X1 ) Var (X2 ) ... Cov (X2 , Xn )7
6 7
=6 .. .. .. .. 7
4 . . . . 5
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Var (Xn )
Notice that the covariance matrix is symmetric (⌃ij = ⌃ji ), and has variances on the diagonal.
The PMultinomial Distribution: Suppose there are r outcomes, with probabilities p = (p1 , p2 , ..., pr ) respectively, such
r
that i=1 pi = 1. Suppose we have n independent trials, and let Y = (Y1 , Y2 , ..., Yr ) be the rvtr of counts of each outcome.
Then, we say Y ⇠ Multr (n, p):
The joint PMF of Y is:
✓ ◆Yr r
X
n
pY1 ,...,Yr (k1 , ...kr ) = pki , k1 , ...kr 0 and ki = n
k1 , ..., kr i=1 i i=1
Notice that each Yi is marginally Bin(n, pi ). Hence, E [Yi ] = npi and Var (Yi ) = npi (1 pi ).
Then, we can specify the entire mean vector E [Y] and covariance matrix:
2 3
np1
6 7
E [Y] = np = 4 ... 5 Var (Yi ) = npi (1 pi ) Cov (Yi , Yj ) = npi pj
npr
The Multivariate Hypergeometric (MVHG) Distribution: Suppose there are r di↵erent colors of balls in a bag,
Pr
having K = (K1 , ..., Kr ) balls of each color, 1 i r. Let N = i=1 Ki be the total number of balls in the bag, and suppose
we draw n without replacement. Let Y = (Y1 , ..., Yr ) be the rvtr such that Yi is the number of balls of color i we drew. We
write that Y ⇠ MVHGr (N, K, n) The joint PMF of Y is:
Qr Ki r
i=1 ki
X
pY1 ,...,Yr (k1 , ...kr ) = N
, 0 ki Ki for all 1 i r and kr = n
n i=1
Ki N Ki N n
Var (Yi ) = n · · . The mean vector E [Y] and covariance matrix are:
N N N 1
2 K1 3
nN
K 6 . 7 Ki N K i N n Ki K j N n
E [Y] = n = 4 .. 5 Var (Yi ) = n · · Cov (Yi , Yj ) = n ·
N Kr
N N N 1 N N N 1
nN
Properties of Expectation and Variance Hold for RVTRs: Let X be an n-dimensional RVTR, A 2 Rn⇥n be a con-
stant matrix, b 2 Rn be a constant vector. Then: E [AX + b] = AE [X] + b and Var (AX + b) = AVar (X) AT .
The Multivariate Normal Distribution: A random vector X = (X1 , ..., Xn ) has a multivariate Normal distribution
with mean vector µ 2 Rn and (symmetric and positive-definite) covariance matrix ⌃ 2 Rn⇥n , written X ⇠ Nn (µ, ⌃), if it
has the following joint PDF:
Alex Tsun Probability & Statistics with Applications to Computing 13
✓ ◆
1 1 T 1
fX (x) = exp (x µ) ⌃ (x µ) , x 2 Rn
(2⇡)n/2 |⌃|1/2 2
Additionally, let us recall that for any RVs X and Y : X ? Y ! Cov (X, Y ) = 0. If X = (X1 , . . . , Xn ) is Multivariate Normal,
the converse also holds: Cov (Xi , Xj ) = 0 ! Xi ? Xj .
Order Statistics: Suppose Y1 , ..., Yn are iid continuous random variables with common PDF fY and common CDF FY .
We sort the Yi ’s such that Ymin ⌘ Y(1) < Y(2) < ... < Y(n) ⌘ Ymax .
Notice that we can’t have equality because with continuous random variables, the probability that any two are equal is 0.
Notice that each Y(i) is a random variable as well! We call Y(i) the ith order statistic, i.e. the ith smallest in a sample of
size n. The density function of each Y(i) is
✓ ◆
n
fY(i) (y) = · [FY (y)]i 1 · [1 FY (y)]n i · fY (y), y 2 ⌦Y
i 1, 1, n i
6 Concentration Inequalities
6.1 Markov and Chebyshev Inequalities
E [X]
Markov’s Inequality: Let X 0 be a non-negative RV, and let k > 0. Then: P (X k) .
k
Chebyshev’s Inequality: Let X be any RV with expected value µ = E[X] and finite variance Var (X). Then, for any real
Var (X)
number ↵ > 0. Then, P (|X µ| ↵) .
↵2
Cherno↵ Bound for Binomial: Let X ⇠ Bin(n, p) and let µ = E[X]. For any 0 < < 1:
✓ 2
◆ ✓ 2
◆
µ µ
P (X (1 + )µ) exp and P (X (1 )µ) exp
3 2
Convex Functions: Let SP ✓ Rn be a convex set. A function g : S ! R is a convex function if for any x1 , ..., xm 2 S,
m
and p1 , ..., pm 0 such that i=1 pi = 1, !
Xm Xm
g pi x i pi g(xi )
i=1 i=1
Alex Tsun Probability & Statistics with Applications to Computing 14
Jensen’s Inequality: Let X be any RV, and g : R ! R be convex. Then, g(E [X]) E [g(X)].
Hoe↵ding’s Inequality: Let X1 , ...Xn be independent random variables, where each Xi is bounded: ai Xi bi and let
X n be their sample mean. Then,
✓ ◆
⇥ ⇤ 2n2 t2
P |X n E X n | t 2 exp P n
i 1 (bi ai ) 2
In the case X1 , ..., Xn are iid (so a Xi b for all i) with mean µ, then
✓ ◆ ✓ ◆
2n2 t2 2nt2
P |X n µ| t 2 exp = 2 exp
n(b a)2 (b a)2
7 Statistical Estimation
7.1 Maximum Likelihood Estimation
Realization / Sample: A realization/sample x of a random variable X is the value that is actually observed (will always
be in ⌦X ).
Likelihood: Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | ✓) (if X is discrete), or from density fX (t | ✓) (if X is
continuous), where ✓ is a parameter (or vector of parameters). We define the likelihood of x given ✓ to be the “probability”
of observing x if the true parameter is ✓. The log-likelihood is just the log of the likelihood, which is typically easier to
optimize.
If X is discrete,
Yn Xn
L(x | ✓) = pX (xi | ✓) ln L(x | ✓) = ln pX (xi | ✓)
i=1 i=1
If X is continuous,
n
Y n
X
L(x | ✓) = fX (xi | ✓) ln L(x | ✓) = ln fX (xi | ✓)
i=1 i=1
Maximum Likelihood Estimator (MLE): Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓)
(if X is discrete), or from density fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). We define
the maximum likelihood estimator (MLE) ✓ˆM LE of ✓ to be the parameter which maximizes the likelihood/log-likelihood:
Sample Moments: Let X be a random variable, and c 2 R a scalar. Let x1 , . . . , xn be iid realizations (samples) from X.
Pn
The k th sample moment of X is: n1 i=1 xki .
Pn
The k th sample moment of X (about c) is: n1 i=1 (xi c)k .
Method of Moments Estimation: Let x = (x1 , . . . , xn ) be iid realizations (samples) from PMF pX (t; ✓) (if X is discrete),
or from density fX (t; ✓) (if X continuous), where ✓ is a parameter (or vector of parameters).
Alex Tsun Probability & Statistics with Applications to Computing 15
We then define the Method of Moments (MoM) estimator ✓ˆM oM of ✓ = (✓1 , . . . , ✓k ) to be a solution (if it ex-
ists) to the k simultaneous equations where, for j = 1, . . . , k, we set the j th true and sample moments equal:
Pn ⇥ ⇤ Pn
E [X] = n1 i=1 xi ··· E X k = n1 i=1 xki
Beta Random Variable: X ⇠ Beta(↵, ), if and only if X has the following PDF:
(
1 ↵ 1 1
B(↵, ) x (1 x) , x 2 ⌦X = [0, 1]
fX (x) =
0, otherwise
X is typically the belief distribution about some unknown probability of success, where we pretend we’ve seen ↵ 1 successes
and 1 failures. Hence the mode (most likely value of the probability/point with highest density) arg max fX (x), is
x2[0,1]
↵ 1
mode[X] = (↵ 1)+( 1)
Also note that there is an annoying “o↵-by-1” issue: (↵ 1 heads and 1 tails), so when choosing these parameters, be
careful! It also serves as a conjugate prior for p in the Bernoulli and Geometric distributions.
Dirichlet RV: X ⇠ Dir(↵1 , ↵2 , . . . , ↵r ), if and only if X has the following density function:
( Qr Pr
1
B(↵) i=1 xai i 1
, xi 2 (0, 1) and i=1 xi = 1
fX (x) =
0, otherwise
This is a generalization of the Beta random variable from 2 outcomes to r. The random vector X is typically the belief
distribution about some unknown probabilities of the di↵erent outcomes, where we pretend we saw a1 1 outcomes of type
1, a2 1 outcomes of type 2, . . . , and ar 1 outcomes of type r . Hence, the mode of the distribution is the vector,
arg max P fX (x), is
x2[0,1]d and xi =1
⇣ ⌘
mode[X] = Pr ↵ 1 1 , Pr ↵2(ai1 1) , . . . , Pr ↵r(ai1 1)
i=1 (ai 1) i=1 i=1
Maximum A Posteriori (MAP) Estimation: Let x = (x1 , . . . , xn ) be iid realizations from PMF pX (t ; ⇥ = ✓) (if X
discrete), or from density fX (t ; ⇥ = ✓) (if X continuous), where ⇥ is the random variable representing the parameter (or
vector of parameters). We define the Maximum A Posteriori (MAP) estimator ✓ˆM AP of ⇥ to be the parameter which
maximizes the posterior distribution of ⇥ given the data (the mode).
h i
Mean Squared Error (MSE): The mean squared error (MSE) of an estimator ✓ˆ of ✓ is MSE(✓, ˆ ✓) = E (✓ˆ ✓)2 .
h i ⇣ ⌘
If ✓ˆ is an unbiased estimator of ✓ (i.e. E ✓ˆ = ✓), then you can see that MSE(✓,
ˆ ✓) = Var ✓ˆ . In fact, in general
⇣ ⌘
MSE(✓, ˆ ✓) = Var ✓ˆ + Bias(✓,
ˆ ✓)2 .
Consistency: An estimator ✓ˆn (depending on n iid samples) of ✓ is said to be consistent if it converges (in probability) to
⇣ ⌘
✓. That is, for any " > 0, lim P |✓ˆn ✓| > " = 0.
n!1
Fisher Information: Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | ✓) (if X is discrete), or from density function
fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). The Fisher Information of a parameter ✓
is defined to be "✓ ◆2 # 2
@ ln L(x | ✓) @ ln L(x | ✓)
I(✓) = n · E = E
@✓ @✓2
Cramer-Rao Lower Bound (CRLB): Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | ✓) (if X is discrete), or
from density function fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). If ✓ˆ is an unbiased
estimator for ✓, then
⇣ ⌘ 1
MSE(✓,ˆ ✓) = Var ✓ˆ
I(✓)
That is, for any unbiased estimator ✓ˆ for ✓, the variance (=MSE) is at least I(✓)
1
. If we achieve this lower bound, meaning
1
our variance is exactly equal to I(✓) , then we have the best variance possible for our estimate. Hence, it is the minimum
variance unbiased estimator (MVUE) for ✓.
1
ˆ ✓) = I(✓)⇣ ⌘ 1.
Efficiency: Let ✓ˆ be an unbiased estimator of ✓. The efficiency of ✓ˆ is e(✓,
Var ✓ˆ
ˆ ✓) = 1.
An estimator is said to be efficient if it achieves the CRLB - meaning e(✓,
P (X1 = x1 , . . . , Xn = xn | T = t, ✓) = P (X1 = x1 , . . . , Xn = xn | T = t)
Neyman-Fisher Factorization Criterion (NFFC): Let x1 , . . . , xn be iid random samples with likelihood
L(x1 , . . . , xn |✓). A statistic T = T (x1 , . . . , xn ) is sufficient if and only if there exist non-negative functions g and h such that:
8 Statistical Inference
8.1 Confidence Intervals
Confidence Interval: Suppose you have iid samples x1 ,...,xn from some distribution with unknown parameter ✓, and you
have some estimator ✓ˆ for ✓.
h i
ˆ ✓ˆ
A 100(1 ↵)% confidence interval for ✓ is an interval (typically but not always) centered at ✓, , ✓ˆ + , such that
the probability (over the randomness in the samples x1 ,...,xn ) ✓ lies in the interval is 1 ↵:
⇣ h i⌘
P ✓ 2 ✓ˆ , ✓ˆ + =1 ↵
Pn
If ✓ˆ = n1 i=1 xi is the sample mean, then ✓ˆ is approximately normal by the CLT, and a 100(1 ↵)% confidence interval is
given by the formula:
✓ˆ z1 ↵/2 p , ✓ˆ + z1 ↵/2 p
n n
1 ↵
where z1 ↵/2 = 1 2 and is the true standard deviation of a single sample (which may need to be estimated).
Credible Intervals: Suppose you have iid samples x = (x1 , ..., xn ) from some distribution with unknown parameter ⇥.
You are in the Bayesian setting, so you have chosen a prior distribution for the RV ⇥.
A 100(1 ↵)% credible interval for ⇥ is an interval [a, b] such that the probability (over the randomness in ⇥) that ⇥ lies
in the interval is 1 ↵:
P (⇥ 2 [a, b]) = 1 ↵
If we’ve chosen the appropriate conjugate prior for the sampling distribution (like Beta for Bernoulli), the posterior is easy
to compute. Say the CDF of the posterior is FY . Then, a 100(1 ↵)% credible interval is given by
h ⇣↵⌘ ⇣ ↵ ⌘i
FY 1 , FY 1 1
2 2
1. Make a claim (like ”Airplane food is good”, ”Pineapples belong on pizza”, etc...)