Prob Stat For CS Book
Prob Stat For CS Book
with Applications to
Computing
Alex Tsun
2
Acknowledgements
This textbook would not have been possible without the following people:
• Mitchell Estberg (Head TA CSE 312 at UW): For helping organize the effort to put this together,
formatting, and revising a vast majority of this content. Your countless hours you put in for the course
and for me are much appreciated. Neither this class nor this book would have been possible without
your contributions!
• Pemi Nguyen (TA CSE 312 at UW): For helping to add examples, motivation, and contributing
to a lot of this content. For constantly going above and beyond to ensure a great experience for the
students, the staff, and me. Your bar for quality and dedication to these notes, the course, and the
students is unrivaled.
• Cooper Chia, William Howard-Snyder, Shreya Jayaraman, Aleks Jovcic, Muxi (Scott)
Ni, Luxi Wang (TA’s CSE 312 at UW): For each typesetting several sections and adding your own
thoughts and intuition. Thank you for being the best teaching staff I could have asked for!
• Joshua Fan (UW/Cornell): You are an amazing co-TA who is extremely dedicated to your students.
Thank you for your help in developing this content, and for recording several videos!
• Matthew Taing (UW): You are an extremely caring TA, dedicated to making learning an enjoyable
experience. Thank you for your help and suggestions throughout the development of this content.
• Martin Tompa (Professor at UW Allen School): Thank you for taking a chance on me to give me my
first TA experience, and for supporting me through my career to graduate school and beyond. Thank
you especially for helping me attain my first teaching position, and for your advice and mentorship.
• Anna Karlin (Professor at UW Allen School): Thank you for making my CSE 312 TA experiences at
UW amazing, and for giving me much freedom and flexibility to create content and lead during those
times. Thank you also for your significant help and guidance during my first teaching position.
• Lisa Yan, David Varodayan, Chris Piech (Instructors at Stanford): I learned a lot from TAing
for each of you, especially to compare and constrast this course at two different universities. I’d like
to think I took the “best of both worlds” at Stanford and the University of Washington. Thank you
for your help, guidance, and inspiration!
• My Family: Thank you for your unwavering help and support throughout my journey through college
and beyond. I would not be where I am or the person I am without you.
4
Notes
Information
This book was written in Summer of 2020 during an offering of “CSE 312: Foundations of Computing
II”, which is essentially probability and statistics for computer scientists. The curriculum was based off of
this course as well as Stanford University’s “CS 109: Probability for Computer Scientists”. I strongly believe
coding applications (which are included in Chapter 9) are essential to teach to show why this class is a core
CS requirement, but also it helps keeps the students engaged and excited. This textbook is currently being
used at the University of Washington (Autumn 2020).
Resources
• Course Videos (YouTube Playlist): Mostly under 5 minutes long, serves generally as a quick review
of each section.
• Course Slides (Google Drive): Contains Google Slides presentations for each section, used in the videos.
• Course Website (UW CSE 312): Taught at the University of Washington during Summer 2020 and
Autumn 2020 quarters by Alex Tsun and Professor Anna Karlin.
https://courses.cs.washington.edu/courses/cse312/20su/
• This Textbook: Available online free here.
• Key Theorems and Definitions: At the end of this book.
• Distributions (2 pages): At the end of this book.
Assumed Prerequisites
We assume the student has experience in the following topics:
• Multivariable calculus (at least up to partial derivatives and double integrals). We won’t really use
much calculus beyond taking derivatives and integrals, so a surface-level knowledge is fine.
• Discrete mathematics (introduction to logic and proofs). We’ll especially use set theory, but this
will be covered in Chapter 0: Prerequisites of this book.
• Programming experience (at least one or two introductory classes, in any language). We will teach
Python, but assume knowledge of fundamental ideas such as: variables, conditionals, loops, and arrays.
This will be crucial in studying and coding up the CS applications of Chapter 9.
About the Author
Alex Tsun grew up in the Bay Area, with a family of software engineers (parents and older brother). He
completed Bachelor’s degrees in computer science, statistics, and mathematics at the University of Washing-
ton in 2018, before attending Stanford University for his Master’s degree in AI and Theoretical CS. During
his six years as a student, he served as a TA for this course a total of 13 times. After graduating in June
2020, he returned to UW to be the instructor for the course CSE 312 during Summer 2020.
Contents
0. Prerequisites 7
0.1 Intro to Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.2 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.3 Sum and Product Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1. Combinatorial Theory 19
1.1 So You Think You Can Count? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 More Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3 No More Counting Please . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2. Discrete Probability 46
2.1 Intro to Discrete Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5
6 CONTENTS
Chapter 0. Prerequisites
This chapter focuses on set theory, which makes up the building blocks of probability. To even define a
probability space, we need this notion of a set. While it is assumed that a discrete mathematics course was
taken, we will focus on reviewing this particular topic. We also cover summation and product notation,
which we will use frequently for compactness and conciseness of notation.
Chapter 0. Prerequisites
0.1: Intro to Set Theory
Slides (Google Drive) Video (YouTube)
There is only one set of cardinality 0 (containing no elements), the empty set, denoted by ∅ = {}
Example(s)
Solution To calculate the cardinality of a set, we have to determine the number of elements in the set.
1. For the set {apple, orange, watermelon}, we have three distinct elements, so the cardinality is 3. That
is | {apple, orange, watermelon} |= 3
2. For {1, 1, 1, 1, 1}, there are five 1s, but recall that set’s don’t contain duplicates, so actually this set only
contains 1, and is equal to the set {1}. This means that it’s cardinality is 1, that is | {1, 1, 1, 1, 1} |= 1
8
0.1 Probability & Statistics with Applications to Computing 9
3. For the set [0, 1], all the values between 0 and 1 (inclusive) we have an infinite number of elements.
This means that the cardinality of this set is infinity, that is | [0, 1] |= ∞
4. For the set {1, 2, 3, · · · }, the set of all positive integers, we have an infinite number of elements. This
means that the cardinality of this set is infinity, that is | {1, 2, 3, · · · } |= ∞.
5. For the set {∅, {1}, {2}, {1, 2}} (a set of sets), there are four distinct elements that are each a different
set. This means that the cardinality is 4, that is | {∅, {1}, {2}, {1, 2}} |= 4.
6. Finally, for the set of {∅, {1}, {1, 1}, {1, 1, 1}, · · · }, we do have an infinite number of sets, each of which
is an element. But are these distinct? Upon further consideration, all the the sets containing various
numbers of 1s are equivalent, as duplicates don’t matter. So there is the set containing 1 and the
empty set. So the cardinality is 2, that is | {∅, {1}, {1, 1}, {1, 1, 1}, · · · } |=| {∅, {1}} |= 2.
Example(s)
Let us define A = {1, 3}, B = {3, 1}, C = {1, 2} and D = {∅, {1}, {2}, {1, 2}, 1, 2}.
Determine whether the following are true or false:
• 1∈A
• 1⊆A
• {1} ⊆ A
• {1} ∈ A
• 3∈ /C
• A∈B
• A⊆B
• C∈D
• C⊆D
• ∅∈D
• ∅⊆D
• A=B
• ∅⊆∅
• ∅∈∅
Solution
• 1 ∈ A. True, because 1 is an element in A.
• 1 ⊆ A. False, because 1 is a value, not a set, so it cannot be a subset of a set.
• {1} ⊆ A. True, because every element of the set {1} is an element of A.
• {1} ∈ A. False, because {1} is a set, and A contains no sets as elements.
• 3∈
/ C. True, because the value 3 is not one of the elements of C.
• A ∈ B. False, because A is a set, and there are no elements of B which are sets, A 6∈ B.
• A ⊆ B. True, because every element of A is an element of B.
• C ∈ D. True, because C is an element of D.
• C ⊆ D. True, because each of the elements of C are also elements of D.
• ∅ ∈ D. True, because the empty set is an element of D.
• ∅ ⊆ D. True, by definition, the empty set is a subset of any set. This is because if this were not the
case, there would have to be an element of ∅ which was not in D. But there are no elements in ∅, so
the statement is true (vacuously).
• A = B. True, A ⊆ B, as every element of A is an elementof B and B ⊆ A, as every element of B is an
element of A, so since this relationship is in both directions, we have A = B.
• ∅ ⊆ ∅. True, because the empty set is a subset of every set (vacuously).
• ∅ ∈ ∅. False, because the empty set contains no elements, so the empty set cannot be an element of it.
Chapter 0. Prerequisites
0.2: Set Operations
Slides (Google Drive) Video (YouTube)
Example(s)
1. If we were talking about the set of fruits a supermarket might sell S, we might have S =
{apple, watermelon, pear, strawberry} and U = {all fruits}. We might want to know which
fruits the supermarket doesn’t sell, which would be denoted S C (defined later). This requires a
universal set of all fruits that we can check with to see which are missing from S.
2. If we were talking about the set of kinds of cars Bill Gates owns, that might be the set T . There
must be a universal set U of possible kinds of cars that exist, if we wanted to list out which
ones he was missing T C .
The union of A and B is denoted A∪B. It contains elements in A or B, or both (without duplicates).
So x ∈ A ∪ B if and only if x ∈ A or x ∈ B.
The image below shows in red the union of A and B: A ∪ B. The outer rectangle is the univeral set U.
11
12 Probability & Statistics with Applications to Computing 0.2
The image below shows in red the intersection of A and B: A ∩ B. The outer rectangle is the univeral set U.
The set difference of A with B is denoted, A \ B. It contains elements of A which are not in B. So
x ∈ A \ B if and only if x ∈ A and x ∈
/ B.
The image below shows in red the set different of A with B: A \ B. The outer rectangle is the univeral set
U.
Example(s)
Let A = {1, 3}, B = {2, 3, 4}, and U = {1, 2, 3, 4, 5}. Solve for: A ∩ B, A ∪ B, B \ A, A \ B, (A ∪ B)C ,
AC , B C , and AC ∩ B C .
Solution
• A ∩ B = {3}, since 3 is the only element in both A and B.
• A ∪ B = {1, 2, 3, 4}, as these are all the elements in either A or B. Note we dropped the duplicate 3,
since sets cannot contain duplicates.
• B \ A = {2, 4}, as these are the elements of B which are not in A.
• A \ B = {1}, as this is the only element of A which is not an element of B.
• (A ∪ B)C = {5}, as by definition (A ∪ B)C = U \ (A ∪ B) and 5 is the only element of U which is not
an element of A ∪ B.
• AC = {2, 4, 5}, as by definition AC = U \ A, and these are the elements of U which are not elements
of A.
• B C = {1, 5}, as by definition B C = U \ B, and these are the elements of U which are not elements of
B.
• AC ∩ B C = {5}, because the only element in both AC and B C is 5 (see the above).
Chapter 0. Prerequisites
0.3: Sum and Product Notation
Slides (Google Drive) Video (YouTube)
10
X
1 + 2 + 3 + · · · + 10 = i
i=1
Note that i is just a dummy variable. We could have also used j, k, or any other letter. What if we wanted
to sum numbers that weren’t consecutive integers?
As long as there is some pattern, we can write it compactly! For example, how could we write 16 + 25 +
36 + · · · + 81? In the first equation below (0.3.1), j takes on the values from 4 to 9, and the square of each of
these values will be summed together. Note that this is equivalent to k taking on the values of 1 to 6, and
adding 3 to each of the values before squaring and summing them up (0.3.2).
9
X
16 + 25 + 36 + · · · + 81 = j2 (0.3.1)
j=4
6
X
= (k + 3)2 (0.3.2)
k=1
If you know what a for-loop is (from computer science), this is exactly the following (in Java or C++).
This first loop represents the first sum with dummy variable j.
int sum = 0
for (int j = 4; j <= 9; j++) {
sum += (j * j)
}
This second loop represents the second sum with dummy variable k, and is equivalent to the first.
int sum = 0
for (int k = 1; k <= 6; k++) {
sum += ((k + 3) * (k + 3))
}
14
0.3 Probability & Statistics with Applications to Computing 15
Furthermore, if S is a set, and f : S → R is a function defined on S, then the following notation sums
over all elements x ∈ S of f (x):
X
f (x)
x∈S
Note that the sum over no terms (the empty set) is defined as 0.
Example(s)
Write P
out the following sums:
7
• Pk=3 k 10
• (2y + 5), for S = {3, 6, 8, 11}
P8y∈S
• 4
Pt=6
1
• Pz=2 sin(z)
• x∈T 13x, for T = {−1, −3, 5}.
Solution
P7
• For, k=3 k 10 , we raise each value of k from 3 to 7 to the power of 10 and sum them together. That
is:
7
X
k 10 = 310 + 410 + 510 + 610 + 710
k=3
P
• Then, if we let S = {3, 6, 8, 11}, for y∈S (2y + 5), raise 2 to the power of each value y in S and add
5, and then sum the results together. That is
X
(2y + 5) = (23 + 5) + (26 + 5) + (28 + 5) + (211 + 5)
y∈S
P8
• For the sum of a constant, t=6 4, we add the constant, 4 for each value t = 6, 7, 8. This is equivalent
to just adding 4 together three times.
8
X
4=4+4+4
t=6
P1
• Then, for a range with no values, the sum is defined as 0, for z=2 sin(z), because there are no values
from 2 to 1, we have:
1
X
sin(z) = 0
z=2
16 Probability & Statistics with Applications to Computing 0.3
P
• Finally, if we let T = {−1, −3, 5}, for x∈T 13x, we multiply each value of x in T by 13 and then sum
them up.
X
13x = 13(−1) + 13(−3) + 13(5)
x∈T
= 13(−1 + −3 + 5)
X
= 13 x
x∈T
Notice that we can actually factor out the 13; that is, we could sum all values of x ∈ T first, and then
multiply by 13. This is one of a few properties of summations we can see below!
Further, the associative and distributive properties hold for sums. If you squint hard enough, you can kind
of see why they’re true! We’ll also see some examples below too, since the notation can be confusing at first.
We have the associative property (0.3.3) and distributive property (0.3.4, 0.3.5) for sums.
X X X
f (x) + g(x) = (f (x) + g(x)) (0.3.3)
x∈A x∈A x∈A
X X
αf (x) = α f (x) (0.3.4)
x∈A x∈A
!
X X XX
f (x) g(x) = f (x)g(y) (0.3.5)
x∈A y∈B x∈A y∈b
The last property is like FOIL - if you multiply (x + x2 + x3 )(1/y + 1/y 2 ) (left-hand side) for example, you
would have to sum over every possible combination x/y + x/y 2 + x2 /y + x2 /y 2 + x3 /y + x3 /y 2 (right-hand
side).
The proof of these are left to the reader, but see the examples below for some intuition!
Example(s)
“Prove”
P7the following
P7 by writing
P7 out the sums:
• i=5 i + i=5 i2 = i=5 (i + i2 )
P5 P5
• j=3 2j = 2 j
P2 Pj=3
3 P2 P3
• ( i=1 f (ai ))( j=1 g(bj )) = i=1 j=1 f (ai )g(bj )
Solution
7
X 7
X 6
X
2 2 2 2 2 2 2
i+ i = (5 + 6 + 7) + (5 + 6 + 7 ) = (5 + 5 ) + (6 + 6 ) + (7 + 7 ) = (i + i2 )
i=5 i=5 i=5
0.3 Probability & Statistics with Applications to Computing 17
= f (a1 )g(b1 ) + f (a1 )g(b2 ) + f (a1 )g(b3 ) + f (a2 )g(b1 ) + f (a2 )g(b2 ) + f (a2 )g(b3 )
2 X
X 3
= f (ai )g(bj )
i=1 j=1
Further, if S is a set, and f : S → R is a function defined on S, then the following notation multiplies
over all elements x ∈ S of f (x):
Y
f (x)
x∈S
Note that the product over no terms is defined as 1 (not 0 like it was for sums).
Example(s)
Write Q
out the following products:
7
• Qa=4 a
• x∈S 8 for S = {3, 6, 8, 11}
Q1
• z=2 sin(z)
Q5
• b=2 91/b
Solution
Q7
• For a=4 a, we multiply each value a in the range 4 to 7 and have:
7
Y
a=4·5·6·7
a=4
18 Probability & Statistics with Applications to Computing 0.3
Q
• Then if, we let S = {3, 6, 8, 11}, for x∈S 8, we multiply 8 for each value in the set, S and have:
Y
8=8·8·8·8
x∈S
Q1
• Then for z=2 sin(z), we have the empty product, because there are no values in the range 2 to 1, so
we have:
1
Y
sin(z) = 1
z=2
Q5
• Finally for b=2 91/b , we have each value of b from 2 to 5 of 91/b , to get
5
Y
91/b = 91/2 · 91/3 · 91/4 · 91/5
b=2
= 91/2+1/3+1/4+1/5
P5
1/b
=9 b=2
Q P
Also, if you were to do the same examples as we did for sums replacing with , you just multiply instead
of add! They are almost identical, except the empty sum is 0 and the empty product is 1.
0.3 Probability & Statistics with Applications to Computing 19
Before we jump into probability, we must first learn a little bit of combinatorics, or more informally, counting.
You might wonder how this is relevant to probability, and we’ll see how very soon. You might also think
that counting is for kindergarteners, but it is actually a lot harder than you think!
To motivate us, let’s consider how easy or difficult it is for a robber to randomly guess your PIN code. Every
debit card has a PIN code that their owners use to withdraw cash from ATMs or to complete transactions.
How secure are these PINs, and how safe can we feel?
If an experiment can either end up being one of N outcomes, or one of M outcomes (where there is
no overlap), then the number of possible outcomes of the experiment is:
N +M
More formally, if A and B are sets with no overlap (A ∩ B = ∅), then |A ∪ B| = |A| + |B|.
Example(s)
Suppose you must take a natural science class this year to graduate at any of the three UW campuses:
Seattle, Bothell, and Tacoma. Seattle offers 4 different courses, Bothell offers 7, and Tacoma only 2.
How many choices of class do you have in total?
Solution By the sum rule, it is simply 4 + 7 + 2 = 13 different courses (since there is no overlap)!
We’ll see some examples of the Sum Rule combined with the Product Rule (next), so that they can be a bit
more complex!
Well, we can consider this from first picking out a top. Once we have our top, we have 4 choices for our
bottom. This means we have 4 choices of bottom for each top, which we have 3 of. So, we have a total of
4 + 4 + 4 = 3 · 4 = 12 outfit choices.
20
1.1 Probability & Statistics with Applications to Computing 21
We could also do this in reverse and first pick out a bottom. Once we have our bottom, we have 3 choices
for our top. This means we have 3 choices of top for each bottom, which we have 4 of. So, we still have a
total of 3 + 3 + 3 + 3 = 4 · 3 = 12 outfit choices. (This makes sense - the number of outfits should be the
same no matter how I count!)
What if we also wanted to add socks to the outfit, and we had 2 different pairs of socks? Then, for each of
the 12 choices outlined above, we now have 2 choices of sock. This brings us to a total of 24 possible outfits.
This could be calculated more directly rather than drawing out each of these unique outfits, by multiplying
our choices: 3 tops · 4 bottoms · 2 socks = 24 outfits.
22 Probability & Statistics with Applications to Computing 1.1
More formally, if A and B are sets, then |A × B| = |A| · |B| where A × B = {(a, b) : a ∈ A, b ∈ B} is
the Cartesian product of sets A and B.
If this still sounds “simple” to you or you just want to practice, see the examples below! There are some
pretty interesting scenarios we can count, and they are more difficult than you might expect.
Example(s)
1. How many outcomes are possible when flipping a coin n times? For example, when n = 2 there
are four possibilties: HH, HT, TH, TT.
2. How many subsets of the set [n] = {1, 2, . . . , n} are there?
Solution
1. The answer is 2n : for the first flip, there are two choices: H or T. Same for the second flip, the third,
and so on. Multiply 2 together n times to get 2n .
2. This may be hard to think about at first. But think of the subset {2, 4, 5} of the set {1, 2, 3, 4, 5, 6, 7}
as follows: for each number in the set, either it is in the subset or not. So there are two choices for the
first element (in or out), and for each of them. This gives 2n as well!
Example(s)
Flamingos Fanny and Freddy have three offspring: Happy, Glee, and Joy. These five flamingos are
to be distributed to seven different zoos so that no zoo gets both a parent and a child :(. It is not
required that every zoo gets a flamingo. In how many different ways can this be done?
Solution There are two disjoint (mutually exclusive) cases we can consider that cover every possibility. We
can use the sum rule to add them up since they don’t overlap!
1. Case 1: The parents end up in the same zoo. There are 7 choices of zoo they could end up at.
Then, the three offspring can go to any of the 6 other zoos, for a total of 7 · 6 · 6 · 6 = 7 · 63 possibilities
(by the product rule).
2. Case 2: The parents end up in different zoos. There are 7 choices for Fanny and 6 for Freddy.
Then, the three offspring can go to any of the 5 other zoos, for a total of 7 · 6 · 53 possibilities.
The result, by the sum rule, is 7 · 63 + 7 · 6 · 53 . (Note: This may not be the only way to solve this problem.
Often, counting problems have two or more approaches, and it is instructive to try different methods to get
the same answer. If they differ, at least one of them is wrong, so try to find out which one and why!)
1.1.3 Permutations
Back to the example of the debit card. There are 10 possible digits for the each of the 4 digits of a PIN. So
how many possible 4-digit PINs are there? This can be solved as 10 · 10 · 10 · 10 = 104 = 10, 000. So, there
is a one in ten thousand chance that a robber can guess your pin code (randomly).
1.1 Probability & Statistics with Applications to Computing 23
Let’s consider a stronger case where you must use each digit exactly once, so the PIN is exactly 10 digits
long. How many such PINs exist?
Well, we have 10 choices for the first digit, 9 choices for the second digit, and so forth, until we only have 2
choices for the ninth digit, and 1 choice for the tenth digit. This means there are 362,880 possible PINs in
this scenario as follows:
10
Y
10 · 9 · · · · · 2 · 1 = i = 362, 880
i=1
This formula/pattern seems like it would appear often! Wouldn’t it be great if there were a shorthand for
this?
Definition 1.1.1: Permutation
The number of orderings of N distinct objects, is called a permutation, and mathematically defined
as:
N
Y
N ! = N · (N − 1) · (N − 2) · . . . 3 · 2 · 1 = j
j=1
N ! is read as “N factorial”. It is important to note that 0! = 1 since there is one way to arrange 0
objects.
Example(s)
A standard 52-card deck consists of one of each combination of: 13 different ranks (Ace, 2, 3,...,10,
Jack, Queen, King) and 4 different suits (clubs, diamonds, hearts, spades), since 13 · 4 = 52. In how
many ways a 52-card deck be dealt to thirteen players, four to each, so that every player has one card
of each suit?
Solution This is a great example where we can try two equivalent approaches. Each person usually has differ-
ent preferences, and sometimes one way is significantly easier to understand than another. Read them both,
understand why they both make sense and are equal, and figure out which approach is more intuitive for you!
Let’s assign each player one at a time. The first player has 13 choices for the club, 13 for the heart, 13 for
the diamond, and 13 for the spade, for a total of 134 ways. The second player has 124 choices (since there
are only 12 of each suit remaining). And so on, so the answer is 134 · 124 · 114 · ... · 24 · 14 .
Alternatively, we can assign each suit one at a time. For the clubs suit, there are 13! ways to distribute them
to the 13 different players. Then, the diamonds suit can be assigned in 13! ways as well, and same for the
other two suits. By the product rule, the total number of ways is (13!)4 . Check that this different order of
assigning cards gave the same answer as earlier! (Expand the factorials.)
Example(s)
A group of n families, each with m members, are to be lined up for a photograph. In how many ways
can the nm people be arranged if members of a family must stay together?
24 Probability & Statistics with Applications to Computing 1.1
Solution We first choose the ordering of the families, of which there are n!. Then, in the first family, we have
m! ways to arrange them. The second family also has m! ways to be arranged. And so on. By the product
rule, the number of orderings is n! · (m!)n .
1010 − 10!
Let U be a (finite) universal set, and S a subset of interest. Let S C = U \ S denote the set difference
(complement of S). Then,
|S| = |U| − |S C |
Informally, to find the number of ways to do something, we could count the number of ways to NOT
to do that thing, and subtract it from the total. That is, the complement of the subset of interest is
also of interest! This technique is called complementary counting.
Think about how this is just the Sum Rule rephrased, using the diagram above!
1.1 Probability & Statistics with Applications to Computing 25
1.1.5 Exercises
1. Suppose we have 6 people who want to line up in a row for a picture, but two of them, A and B, refuse
to sit next to each other. How many ways can they sit in a row?
Solution: There are two equivalent approaches. The first approach is to solve it directly. How-
ever, depending on where A sits, B has a different number of options (whether A sits at the end or the
middle). So we have two disjoint (non-overlapping) cases:
(a) Case 1: A sits at one of the two end seats. Then, A has 2 choices for where to sit, and B has 4.
(See this diagram where A sits at the right end: − − − − −A.) Then, there are 4! ways for the
remaining people to sit, for a total of 2 · 4 · 4! ways.
(b) Case 2: A sits in one of the middle 4 seats. Then, A has 4 choices of seat, but B only has three
choices for where to sit. (See this diagram where A sits in a middle seat: −A − − − −.) Again,
there are 4! ways to seat the rest, for a total of 4 · 3 · 4! ways.
Hence our total by the sum rule is 2 · 4 · 4! + 4 · 3 · 4! = 480.
The alternative approach is complementary counting. We can count the total orderings, of which
there are 6!, and subtract the cases where A and B do sit next to each other. There’s a trick we can
do to guarantee this: let’s treat A and B as a single entity. Then, along with the remaining 4 people,
there are only 5 entities. We order the entities in 5! ways, but also multiply by 2! since we could have
the seating AB or BA. Hence, the number of ways they do sit together is 2 · 5! = 240, and the ways
they do not sit together is 6! − 240 = 720 = 240 = 480.
Decide which approach you liked better - oftentimes, one method will be easier than another!
2. You love playing the 5 racket sports: tennis, badminton, ping pong, squash, and racquetball. You plan
a week of sports at a time, from Sunday to Saturday. Each day you want to play one of these sports
with a friend, but to avoid getting bored, you don’t ever play the same sport two days in a row. If your
mom is visiting town and wants to play tennis with you on Wednesday, how many possible “sports
schedules“ for the week can you create?
Solution: If you try to start from Sunday (which is a very natural thing to do since it is the
first day), you will run into some trouble. You could have 5 choices for Sunday, and 4 for the Monday
(since you can’t play the same sport as Sunday). But Tuesday is interesting because you can’t choose
tennis because of Wednesday, and you don’t know what Monday’s choice was...
We should try a different approach. Why don’t we start by assigning Wednesday to tennis first (1 way)
and work outwards. Then, let’s plan Tuesday (4 ways), then Monday (4 ways), and Sunday (4 ways).
Then, similarly plan the rest of the week Thurs-Sat. The total number of ways is just 46 then because
you have 4 choices for each of the other 6 days!
The goal of this problem is to show you that you don’t always have to start left to right or right to
left - as long as it works!
3. Suppose that 8 people, including you and a friend, line up for a picture. In how many ways can the
photographer organize the line if she wants to have fewer than 2 people between you and your friend?
Solution: This is hard to tackle directly. A lot of these problems require some interesting mod-
eling, which you’ll get used to through practice!
26 Probability & Statistics with Applications to Computing 1.1
There are two disjoint (non-overlapping) cases for your friend and you, so we can use the sum rule.
(a) Case 1: You are next to your friend. Then, there are 7 sets of positions you and your friend can
occupy (positions 1/2, 2/3, ..., 7/8), and for each set of positions, there are 2! ways to arrange
you and your friend. So there are 7 · 2! ways to pick positions for you and your friend.
(b) Case 2: There is exactly 1 person between you and your friend. Then, there are 6 sets of positions
you and your friend can occupy (positions 1/3, 2/4, ... , 6/8), and for each set of positions, there
are again 2! ways to arrange you and your friend. So there are 6 · 2! ways to pick positions for
you and your friend.
Note that in both cases, there are then 6! ways to arrange the remaining people, so we multiply both
cases by 6! by the product rule. This gives (7 · 2! + 6 · 2!) · 6! ways in total.
Chapter 1. Combinatorial Theory
1.2: More Counting
Slides (Google Drive) Video (YouTube)
1.2.1 k-Permutations
Last time, we learned the foundational techniques for counting (the sum and product rule), and the factorial
notation which arises frequently. Now, we’ll learn even more “shortcuts”/“notations” for common counting
situations, and tackle more complex problems.
We’ll start with a simpler situation than most of the exercises from last time. How many 3-color mini
rainbows can be made out of 7 available colors, with all 3 being different colors?
We choose an outer color, then a middle color and then an inner color. There are 7 possibilities for the outer
layer, 6 for the middle and 5 for the inner (since we cannot have duplicates). Since order matters, we find
that the total number of possibilities is 210, from the following calculation:
7·6·5 4·3·2·1
7·6·5= · [multiply numerator and denominator by 4! = 4 · 3 · 2 · 1]
1 4·3·2·1
7!
= [def of factorial]
4!
7!
=
(7 − 3)!
Notice that we are “picking” 3 out of 7 available colors - so order matters. This may not seem useful, but
imagine if there were 835 colors and we wanted a rainbow with 135 different colors. You would have to
multiply 135 numbers, rather than just three!
If we want to arrange only k out of n distinct objects, the number of ways to do so is P (n, k) (read
as “n pick k”), where
27
28 Probability & Statistics with Applications to Computing 1.2
n!
P (n, k) = n · (n − 1) · (n − 2) · ... · (n − k + 1) =
(n − k)!
A permutation of a n objects is an arrangement of each object (where order matters), so a k-
permutation is an arrangement of k members of a set of n members (where order matters). The
number of k-Permutations of n objects is just P (n, k).
Example(s)
Suppose we have 13 chairs (in a row) with 9 TA’s, and Professors Sunny, Rainy, Windy, and Cloudy
to be seated. What is the number of seatings where every professor has a TA to his/her immediate
left and right?
Solution This is quite a tricky problem if we don’t choose the right setup. Imagine we first just order 9 TA’s
in a line - there are 9! ways to do this. Then, there are 8 spots between them, so that if we place a professor
there, they’re guaranteed to have a TA to their immediate left and right. We can’t place more than one
professor in a spot. Out of the 8 spots, we pick 4 of them for the professors to sit (order matters, since the
professors are different people). So the answer by the product rule is 9! · P (8, 4).
Recall that there were P (7, 3) = 210 possible mini-rainbows. But as we see from these rainbows, each
“smeared” color is counted 3! = 6 times. So, to get our answer, we take the 210 mini-rainbows and divide
by 6 to account for the overcounting since in this case, order doesn’t matter.
The answer is,
210 P (7, 3) 7!
= =
6 3! 3!(7 − 3)!
If we want to choose (order doesn’t matter) only k out of n distinct objects, the number of ways to
do so is C(n, k) = nk (read as “n choose k”), where
n P (n, k) n!
C(n, k) = = =
k k! k!(n − k)!
A k-combination is a selection of k objects from a collection of n objects, in which the order does
1.2 Probability & Statistics with Applications to Computing 29
n
n
not matter. The number of k-Combinations of n objects is just k . k is also called a binomial
coefficient - we’ll see why in the next section.
Notice, we can show from this that there is symmetry in the definition of binomial coefficients:
n n! n! n
= = =
k k!(n − k)! (n − k)!k! n−k
Looking at these, we can see that the color choices in each row are complementary. Intuitively, choosing
1 color is the same as choosing 4 − 1 = 3 colors that we don’t want - and vice versa. This explains the
symmetry in binomial coefficients!
Example(s)
There are 6 AI professors and 7 theory professors taking part in an escape room. If 4 AI professors
and 4 theory professors are to be chosen and divided into 4 pairs (one AI professor with one theory
professor per pair), how many pairings are possible?
Example(s)
How many ways are there to walk from the intersection of 1st and Spring to 5th and Pine? Assume
we only go North and East. A sample route is highlighted in purple.
Solution We can actually solve this problem as well! It has a rather clever solution.
We have to move North exactly three times and East exactly four times. Let’s encode a path as a se-
quence of 3 N’s and 4 E’s (the path highlighted is encoded as ENEENEN). Then, let’s choose the three
positions for the N’s, giving us 73 ways (why not pick ?). Then, the E’s are actually already determined
right? They have to be in the remaining 4 positions. So the answer is simply 73 . Alternatively, if we wanted
to choose the positions for the 4 N’s first instead, there would be 74 ways to do this.
Remember that 73 = 7−3 7
= 74 so these are equivalent!
But if we want to rearrange the letters in “POOPOO”, we have indistinct letters (two types - P and O).
How do we approach this?
One approach is to choose where the 2 P’s go, and then the O’s have to go in the remaining 4 spots ( 44 = 1
way) . Or, we can choose where the 4 O’s go, and then the remaining P’s are set ( 22 = 1 way).
Another interpretation of this formula is that we are first arranging the 6 letters as if they were distinct:
P1 O1 O2 P2 O3 O4 . Then, we divide by 4! and 2! to account for 4 duplicate O’s and 2 duplicate P’s.
What if we got even more complex, let’s say three different letters? For example, rearranging the word
“BABYYYBAY”. There are 3 B’s, 2 A’s, and 4 Y’s, for a total of 9 letters. We can choose where the 3 B’s
should go of the 9 spots: 93 (order doesn’t matter since all the B’s are identical). Then out of the remaining
6 spots, we should choose 2 for the A’s: 62 . Finally, out of the 4 remaining spots, we put the 4 Y’s there:
1.2 Probability & Statistics with Applications to Computing 31
4
4 = 1. By the product rule, our answer is
9 6 4 9! 6! 4! 9!
= =
3 2 4 3!6! 2!4! 4!0! 3!2!4!
Note
that we could have chosen to assign the Y’s first instead: Out of 9 positions, we choose 4 to be Y:
9 5
4 . Then from the 5 remaining spots, choose where the 2 A’s go: 2 , and the last three spots must be B’s:
3
3 = 1. This gives us the equivalent answer
9 5 3 9! 5! 3! 9!
= =
4 2 3 4!5! 2!3! 3!0! 3!2!4!
This shows once again that there are many correct ways to count something. This type of problem also
frequently appears, and so we have a special notation (called a multinomial coefficient)
9 9!
=
3, 2, 4 3!2!4!
Note the order of the bottom three numbers does not matter (since the multiplication in the denominator is
commutative), and that the bottom numbers must add up to the top number.
If we have k types of objects (n total), with n1 of the first type, n2 of the second, ..., and nk of the
k-th, then the number of arrangements possible is
n n!
=
n1 , n2 , ..., nk n1 !n2 !...nk !
Above, we had k = 3 objects (B, A, Y) with n19!= 3 (number of B’s), n2 = 2 (number of A’s), and n3 = 4
(number of Y’s), for an answer of n1 ,nn2 ,n3 = 3!2!4! .
Example(s)
Solution There are n = 7 letters. There are only k = 4 distinct letters - {G, O, D, Y }.
n1 = 3 - there are 3 G’s.
n2 = 2 - there are 2 O’s.
n3 = 1 - there is 1 D.
n4 = 1 - there is 1 Y.
This gives us the number of possible arrangements:
7 7!
=
3, 2, 1, 1 3!2!1!1!
It is important to note that even though the 1’s are “useless” since 1! = 1, we still must write every number
on the bottom since they have to add to the top number.
32 Probability & Statistics with Applications to Computing 1.2
Notice that the second and third pictures show different possible distributions, since the kids are distinguish-
able (different). Any idea on how we can tackle this problem?
The idea here is that we will count something equivalent. Let’s say there are 5 “stars” for the 5 candies
and 2 “bars” for the dividers (dividing 3 kids). For instance, this distribution of candies corresponds to this
arrangement of 5 stars and 2 bars:
Here is another example of the correspondence between a distribution of candies and the arrangement of
stars and bars:
For each candy distribution, there is exactly one corresponding way to arrange the stars and bars. Conversely,
for each arrangement of stars and bars, there is exactly one candy distribution it represents.
Hence, the number of ways to distribute 5 candies to the 3 kids is the number of arrangements of 5 stars
and 2 bars.
This is simply
7 7 7!
= =
2 5 2!5!
1.2 Probability & Statistics with Applications to Computing 33
Amazing right? We just reduced this candy distribution problem to reordering letters!
since we set up n stars for the n balls, and k − 1 bars dividing the k bins.
Example(s)
There are 20 students and 4 professors. Assume the students are indistinguishable to the professors;
who only care how many students they have, and not which ones.
1. If there are no restrictions, how many ways can we assign the students to the professors?
Solution This is actually the perfect setup for stars and bars. We have 20 stars (students) and 3 bars
(professors), and so our answer is 23 23
3 = 20 .
1.2.5 Exercises
1. There are 40 seats and 40 students in a classroom. Suppose that the front row contains 10 seats, and
there are 5 students who must sit in the front row in order to see the board clearly. How many seating
arrangements are possible with this restriction?
Solution: Again, there may be many correct approaches. We can first choose which 5 out of
10 seats in the front row we want to give, so we have 10 5 ways of doing this. Then, assign those 5
students to these seats, to which there are 5! ways.
Finally, assign the other 35 students in any way,
for 35! ways. By the product rule, there are 105 · 5! · 35! ways to do so.
2. Suppose you are to get to take your final exam in pairs. There are 100 students in the class and 8 TAs,
so 8 lucky students will get to pair up with a TA! Each TA must take the exam with some student,
but two TAs cannot take the exam together. How many ways can they pair up?
Solution: First we choose the 8 lucky students and pair them with a TA. There are 100 ways
100
8
to choose those 8 students and then 8! ways to pair them up, for a total of 8 · 8! ways (note this is
the same as P (100, 8)). Then there are 92 students left. The first one has 91 choices. Then there are
90 students left, and so the next one has 89 choices. And so on. So the total number of ways is
100
· 8! · 91 · 89 · 87 · . . . 3 · 1
8
3. If we roll a fair 3-sided die 11 times, what is the number of ways that we can get 4 1’s, 5 2’s, and 2 3’s?
Solution: We can write the outcomes as a sequence of length 11, each digit of which is 1, 2 or
3. Hence, the number of ways to get 4 1’s, 5 2’s, and 2 3’s, is the number of orderings of 11112222233,
11
11!
which is 4,5,2 = .
4!5!2!
34 Probability & Statistics with Applications to Computing 1.2
4. These two problems are almost identical, but have drastically different approaches to them. These are
both extremely hard/tricky problems, though they may look deceivingly simple. These are probably
the two coolest problems I’ve encountered in counting, as they do have elegant solutions!
(a) How many 7-digit phone numbers are such that the numbers are strictly increasing (digits must
go up)? (e.g., 014-5689, 134-6789, etc.)
(b) How many 7-digit phone numbers are such that the numbers are monotone increasing (digits can
stay the same or go up)? (e.g., 011-5566, 134-6789, etc.) Hint: Reduce this to stars and bars.
Solution:
(a) We choose 7 out of 10 digits, which has 10
7 possibilities, and then once we do, there is only 1 valid
ordering (must put them in increasing order). Hence, the answer is simply 10 7 . This question
has a deceivingly simple solution, as many students (including myself at one point), would have
started by choosing the first digit. But the choices for the next digit depend on the first digit.
And so on for the third. This leads to a complicated, nearly unsolvable mess!
(b) This is a very difficult problem to frame in terms of stars and bars. We need to map one phone
number to exactly one ordering of stars and bars, and vice versa. Consider letting the 9 bars being
an increase from one-digit to the next, and 7 stars for the 7 digits. This is extremely complicated,
so we’ll give 3 examples of what we mean.
i. The phone number 011-5566 is represented as ∗| ∗ ∗|||| ∗ ∗| ∗ ∗|||. We start a counter at 0, we
see a digit first (a star), so we mark down 0. Then we see a bar, which tells us to increase
our counter to 1. Then, two more digits (stars), which say to mark down 2 1’s. Then, 4 bars
which tell us to increase count from 1 to 5. Then two *’s for the next two 5’s, and a bar to
increase to 6. Then, two stars indicate to put down 2 6’s. Then, we increment count to 9 but
don’t put down any more digits.
ii. The phone number 134-6789 is represented as | ∗ || ∗ | ∗ || ∗ | ∗ | ∗ |∗. We start a counter at 0,
and we see a bar first, so we increase count to 1. Then a star tells us to actually write down
1 as our first digit. The two bars tell us to increase count from 1 to 3. The star says mark a
3 down now. Then, a bar to increase to 4. Then a star to write down 4. Two bars to increase
to 6. And so on.
iii. The stars and bars ordering |||| ∗ | ∗ ∗ ∗ ∗|| ∗ ||∗ represents the phone number 455-5579. We
start a counter at 0. We see 4 bars so we increment to 4. The star says to mark down a 4.
Then increment count by 1 to 5 due to the next bar. Then, mark 5 down 4 times (4 stars).
Then increment count by 2, put down a 7, and repeat to put down a 9.
Hence there is a bijection between these phone numbers
and arrangements of 7 stars and 9 bars.
So the number of satisfying phone numbers is 16 16
7 = 9 .
Chapter 1. Combinatorial Theory
1.3: No More Counting Please
Slides (Google Drive) Video (YouTube)
In this section, we don’t really have a nice successive ordering where one topic leads to the next as we did
earlier. This section serves as a place to put all the final miscellaneous but useful concepts in counting.
(x + y)2 = (x + y)(x + y)
= xx + xy + yx + yy [FOIL]
2 2
= x + 2xy + y
But, let’s say that we wanted to do this for a binomial raised to some higher power, say (x + y)4 . There
would be a lot more terms, but we could use a similar approach.
35
36 Probability & Statistics with Applications to Computing 1.3
But what are the terms exactly that are included in this expression? And how could we combine the
like-terms though?
Notice that each term will be a mixture of x’s and y’s. In fact, each term will be in the form xk y n−k (in this
case n = 4). This is because there will be exactly n x’s or y’s in each term, so if there are k x’s, then there
must be n − k y’s. That is, we will have terms of the form x4 , x3 y, x2 y 2 , xy 3 , y 4 , with most appearing more
than once.
For a specific k though, how many times does xk y n−k appear? For example, in the above case, take k = 1,
then note that xyyy = yxyy = yyxy = yyyx = xy 3 , so xy 3 will appear with the coefficient of 4 in the final
simplified form (just like for (x + y)2 the term xy appears with a coefficient 2). Does this look familiar? It
should remind you yet again of rearranging words with duplicate letters!
Now, we can generalize this, as the number of terms will simplify to xk y n−k will be equivalent to the number
of ways to choose exactly k of the binomials to give us x (and let the remaining n−k give us y). Alternatively,
we need to arrange k x’s and n − k y’s. To think of this in the above example with k = 1 and n = 4, we
were consider which of the four binomials would give us the single x, the first, second, third, or fourth, for
a total of 41 = 4.
Let’s consider k = 2 in the above example. We want to know how many terms are equivalent to x2 y 2 . Well,
we then have xxyy = yxxy = yyxx = xyxy = yxyx = xyyx = x2 y 2 , so there are six ways and the coefficient
on the simplified term x2 y 2 will be 42 = 6.
Notice that we are essentially choosing which of the binomials gives us an x such that k of the n binomials
do. That is, the coefficient for xk y n−k where k ranges from 0 to n is simply nk . This is why it was also
called a binomial coefficient.
This can also be proved by induction, but this is left as an exercise for the reader.
Example(s)
Calculate the coefficient of a45 b14 in the expansion (4a3 − 5b2 )22 .
1.3 Probability & Statistics with Applications to Computing 37
Solution Let x = 4a3 and y = −5b2 . Then, we are looking for the coefficient of x15 y 7 (because x15 gives us
a45 and y 7 gives us b14 ), which is 22
15 . So we have the term
22 15 7 22 3 15 2 7 22 15 7 45 14
x y = (4a ) (−5b ) = − 4 5 a b
15 15 15
22
and our answer is − 15 415 57 .
1.3.2 Inclusion-Exclusion
Say we did an anonymous survey where we asked whether students in CSE312 like ice cream, and found
that 43 people liked ice cream. Then we did another anonymous survey where we asked whether students in
CSE312 liked donuts, and found that 20 people liked donuts. With this information can we determine how
many people like ice cream or donuts (or both)?
Let A be the set of people who like ice cream, and B the set of people who like donuts. The sum rule from 1.1
said that, if A, B were mutually exclusive (it wasn’t possible to like both donuts and ice cream: A ∩ B = ∅),
then we could just add them up: |A ∪ B| = |A| + |B| = 43 + 20 = 63. But this is not the case, since it is
possible that to like both. We can’t quite figure this out yet without knowing how many people overlapped:
the size of A ∩ B.
So, we did another anonymous survey in which we asked whether students in CSE312 like both ice cream
and donuts, and found that only 7 people like both. Now, do we have enough information to determine how
many students like ice cream or donuts?
Yes! Knowing that 43 people like ice cream and 7 people like both ice cream and donuts, we can conclude
that 36 people like ice cream but don’t like donuts. Similarly, knowing that 20 people like donuts and 7
people like both ice cream and donuts, we can conclude that 13 people like donuts but don’t like ice cream.
This leaves us with the following picture, where A is the students who like ice cream. B is the students who
like donuts (this implies |A ∩ B| = 7 is the number of students who like both):
38 Probability & Statistics with Applications to Computing 1.3
|A| = 43
|B| = 20
|A ∩ B| = 7
Now, to go back to the question of how many students like either ice cream or donuts, we can just add up
the 36 people that just like ice cream, the 7 people that like both ice cream and donuts, and the 13 people
that just like donuts, and get 36 + 7 + 13 = 56. Alternatively, we could consider this as adding up the 43
people who like ice cream (including both the 36 those who just like ice cream and the 7 who like both)
and the 20 people who like donuts (including the 13 who just like donuts and the 7 who like both) and then
subtracting the 7 who like both since they were counted twice. That is 43 + 20 − 7 = 56. That leaves us
with:
|A ∪ B| = 36 + 7 + 13 = 56 = 43 − 20 − 7 = |A| + |B| − |A ∩ B|
Recall that |A ∪ B| is the students who like donuts or ice cream (the union of the two sets).
|A ∪ B| = |A| + |B| − |A ∩ B|
Example(s)
How many numbers in the set [360] = {1, 2, . . . , 360} are divisible by:
1. 4, 6, and 9.
2. 4, 6 or 9.
3. neither 4, 6, nor 9.
Solution
360
1. This is just the multiplies of lcm(4, 6, 9) = 36, which there are = 10 of.
36
2. Let Di be the number of numbers in [360] which are divisible by i, for i = 4, 6, 9. Hence, the number of
numbers which are divisible by 4, 6, or 9 is |D4 ∪ D6 ∪ D9 |. We can apply inclusion-exclusion (singles
1.3 Probability & Statistics with Applications to Computing 39
Many times it may be possible to avoid this ugly mess using complementary counting, but sometimes it isn’t.
The floor function bxc returns the largest integer ≤ x (i.e. rounds down).
The ceiling function dxe returns the smallest integer ≥ x (i.e. rounds up). Note the difference is just
whether the bracket is on top (ceiling) or bottom (floor).
Example(s)
Solution
If there are n pigeons we want to put into k holes (where n > k), then at least one pigeonhole must
contain at least 2 pigeons.
More generally, if there are n pigeons we want to put into k pigeonholes, then at least one pigeonhole
must contain at least dn/ke pigeons.
This fact or rule may seem trivial to you, but the hard part of pigeonhole problems is knowing how to apply
it. See the examples below!
Example(s)
First, assume that if Alex is friends with Jun, Jun must also be friends with Alex. In other words,
friendship is mutual.
Show that in a group of n ≥ 2 people (who may be friends with any number of other peo-
ple), two must have the same number of friends.
Solution Suppose there are exactly k people with exactly 0 friends. If k ≥ 2, we are done since (at least) two
people will both have 0 friends.
Otherwise, the remaining n − k people have between 1 and n − k − 1 friends (they can’t be friends with those
k, nor themselves). If we have the n − k people being pigeons and n − k − 1 possible values of friends being
the pigeonholes, then we know by the PHP that (at least) two people have the same number of friends!
Example(s)
Show that there exists a number made up of only 1’s (e.g., 1111 or 11) which is divisible by 333.
Solution Consider the sequence of 334 numbers x1 , x2 , x3 , . . . , x334 where xi is the number made of exactly
i 1’s (e.g., x2 = 11, x5 = 11, 111, etc.). We’ll use the notation xi = 1i to mean i 1’s concatenated together.
The number of possible remainders when dividing by 333 is 333: {0, 1, 2, . . . , 332}, so by the pigeonhole
principle, since 334 > 333, two numbers xi and xj have the same remainder (suppose i < j without loss
of generality) when divided by 333. The number xj − xi is of the form 1j−i 0i ; that is j − i 1’s followed
by i 0’s (e.g., x5 − x2 = 11111 − 11 = 11100 = 13 02 ). This number must be divisible by 333 because
xi ≡ xj (mod 333) ⇒ (xj − xi ) ≡ 0 (mod 333).
Now, keep deleting zeros (by dividing by 10) until there aren’t any more left - this doesn’t affect whether or
not 333 goes in since neither 2 nor 5 divides 333. Now we’re left with a number divisible by 333 made up of
all ones (1j−i to be exact)!
Note that 333 was not special - we could have used any number that wasn’t divisible by 2 nor 5.
that we know how to count, we can actually prove some algebraic identities using counting instead!
n
n−1 n−1
Suppose we wanted to show that k = k−1 + k was true for any positive integer n ∈ N and 0 ≤ k ≤ n.
We could start with an algebraic approach and try something like:
n−1 n−1 (n − 1) (n − 1)
+ = + [def of binomial coef]
k−1 k (k − 1)!(n − k)! k!(n − 1 − k)!
··· [lots of algebra]
n!
=
k!(n − k)!
n
=
k
However, those · · · may be tedious and take a lot of algebra we don’t want to do.
So, let’s consider another approach. A combinatorial proof is one where you prove two quantities are equal
by imagining a situation/something to count. Then, you argue that the left side and right side are two
equivalent ways to count the same thing, and hence must be equal. We’ve seen earlier often how there are
multiple approaches to counting!
In this case, let’s consider the set of numbers [n] = {1, 2, . . . n}. We will argue that the LHS and RHS both
count the number of subsets of size k.
1. LHS: nk is literally the number of subsets of size k, since we just want to choose any k items out of n
(order doesn’t matter).
2. RHS: We take a slightly more convoluted approach, splitting on cases depending on whether or not
the number 1 was included in the subset.
Case 1: Our subset of size k includes the number 1. Then we need to choose k − 1 of the
remaining n − 1 numbers (n numbers excluding 1 is n − 1 numbers) to make a subset of size k which
includes 1.
Case 2: Our subset of size k does not include the number 1. Then we need to choose k
numbers from the remaining n − 1 numbers. There are n−1 ways to do this. So, in total we have
n−1
n−1
k
k−1 + k possible subsets of size k.
Note: We could have chosen any number to single out (not just 1).
Since the left-hand side (LHS) and right-hand side (RHS) count the same thing, they must be equal! Note
that we dreamed up this situation, and you may wonder how we did - this just comes from practicing many
types of counting problems. You’ll get used to it!
That is, we can take the following three step process to show n = m.
1. Let S be some set of objects (you decide).
2. Show |S| = n using one method of counting.
42 Probability & Statistics with Applications to Computing 1.3
Example(s)
Solution
1. We’ll show that both sides count, from a group of n people, the number of committees of size m, and
within that committee a subcommittee of size k.
n
Left-hand side: We first choose m people to be on the committee from n total; there are m
m
ways to do so. Then, within those m, we choose k to be on a specialized subcommittee;
there are k
n m
ways to do so. By the product rule, the number of ways to assign these is m k .
Right-hand side: We first choose which k to be on the subcommitee of size k; there are nk ways
to do so. From the remaining n − k people, we choose m − k to be on the committee
n−k (but not the
subcommittee). By the product rule, the number of ways to assign these is nk m−k .
Since the LHS and RHS both count the same thing, they must be equal.
2. We’ll argue that both sides count the number of subsets of the set [n] = {1, 2, . . . , n}.
Left-hand side: Each element we can have in our subset or not. For the first element, we have
2 choices (in or out). For the second element, we also have 2 choices (in or out). And so on. So the
number of subsets is 2n .
Right-hand side: The subset can be of any size ranging from 0 to n, so we have a sum. Now
how many subsets are there of size exactly k? There are nk because we choose k out of n to have in
Pn
our set (and order doesn’t matter in sets)! Hence, the number of subsets is k=0 nk
Since the LHS and RHS both count the same thing, they must be equal. It’s cool to note we can
also prove this with the binomial theorem setting x = 1 and y = 1 - try this out! It takes just one line!
1.3.5 Exercises
1. Let [n] = {1, 2, . . . , n}. How many (ordered) pairs of subsets (A, B) are there such that A ⊆ B ⊆ [n]?
For example, if n = 5, then A = {1, 3} and B = {1, 3, 4} is a possible pair!
Solution: There are two ways to do this question, which is always great!
(a) Method 1: We will choose B first. There are no restrictions on the size of B since it just has to
be a subset of [n]. B can be of size 0, . . . , n, and so if it is of size k, then there are nk such subsets.
Now supposing we have B of size k, A must be a subset of it, so there are 2k ways to choose A.
1.3 Probability & Statistics with Applications to Computing 43
Hence, we have the sum over all possible ways to choose B (k = 0, 1, . . . , n), and if it is of size k,
there are 2k ways to choose A:
n
X n
n k X n k n−k
2 = 2 1 = 3n
k k
k=0 k=0
Solution: We’ll use complementary counting and inclusion-exclusion. Let SA be the set of order-
ings where A is at the beginning, and SH where H is at the end. Then, we want |U \ (SA ∪ SH )| =
|U| − |SA ∪ SH |, where U is the universal set of possible orderings (we want everything except SA ∪ SH ).
We know |U| = 8! since that is the number of permutations of 8 letters. So now, we compute by
inclusion-exclusion:
|SA ∪ SH | = |SA | + |SH | − |SA ∩ SH |
For |SA |, we need to put A in the first position, so there are 7! orderings total for the remaining letters.
Similarly, |SH | = 7!. Then, |SA ∩ SH | requires us to put A at the beginning AND H at the end, giving
only 6! arrangements for the remaining letters. Hence, our answer is
3. These problems involve using the pigeonhole principle. How many cards must you draw from a standard
52-card deck (4 suits and 13 cards of each suit) until you are guaranteed to have:
(a) A single pair? (e.g., AA, 99, JJ)
(b) Two (different) pairs? (e.g., AAKK, 9933, 44QQ)
(c) A full house (a triple and a pair)? (e.g., AAAKK, 99922, 555JJ)
(d) A straight (5 in a row, with the lowest being A,2,3,4,5 and the highest being 10,J,Q,K,A)?
(e) A flush (5 cards of the same suit)? (e.g., 5 hearts, 5 diamonds)
(f) A straight flush (5 cards which are both a straight and a flush)?
Solution:
(a) The worst that could happen is to draw 13 different cards, but the next is guaranteed to form a
pair. So the answer is 14.
(b) The worst that could happen is to draw 13 different cards, but the next is guaranteed to form a
pair. But then we could draw the other two of that pair as well to get 16 still without two pairs.
So the answer is 17.
(c) The worst that could happen is to draw all pairs (26 cards). Then the next is guaranteed to cause
a triple. So the answer is 27.
44 Probability & Statistics with Applications to Computing 1.3
(d) The worst that could happen is to draw all the A - 4, 6 - 9, and J - K. After drawing these
11 · 4 = 44 cards, we could still fail to have a straight. Finally, getting a 5 or 10 would give us a
straight. So the answer is 45.
(e) The worst that could happen is to draw 4 of each suit (16 cards), and still not have a flush. So
the answer is 17.
(f) Same as straight, 45.
1.3 Probability & Statistics with Applications to Computing 45
Application Time!!
Now you’ve learned enough theory and you probably want a break from all
this math... Discover the Python programming language covered in section
9.1 (which will direct you to a Google Slides Presentation). You are highly
encouraged to read that section before moving on, as you’ll need it as soon
as 2.1 and 2.3 hit with some exciting applications!
This book really will also try to convince you how probability is useful to
your field of computer science, so we need to start by learning a common
programming language. I’ve also provided starter code for all of the applica-
tions we will investigate so you can see for yourself!
Chapter 9 will be spread out across the book indicated by this page titled
“Application Time”, to give you a break from math and also get you excited.
Sorry for the weird format!
46 Probability & Statistics with Applications to Computing 1.3
We’re just about to learn about the axioms (rules) of probability, and see how all that counting stuff from
chapter 1 was relevant at all. This should align with your current understanding of probability (I only assume
you might be able to tell me the probability I roll an even number on a fair six-sided die at this point), and
formalize it.
We’ll be using a lot of set theory from here on out, so review that in Chapter 0 if you need to!
2.1.1 Definitions
Definition 2.1.1: Sample Space
Example(s)
Solution
1. The sample space of a single coin flip is: Ω = {H, T } (heads or tails).
2. The sample space of two coin flips is: Ω = {HH, HT, T H, T T }.
3. The sample space of the roll of a die is: Ω = {1, 2, 3, 4, 5, 6}.
Example(s)
Solution
1. Getting at least one head in two coin flips: E = {HH, HT, T H}
2. Rolling an even number: E = {2, 4, 6}
47
48 Probability & Statistics with Applications to Computing 2.1
Events E and F are mutually exclusive if E ∩ F = ∅. (i.e. they can’t simultaneously happen).
Example(s)
Say E is the event of rolling an even number: E = {2, 4, 6}, and F is the event of rolling an odd
number: F = {1, 3, 5}. Are E and F mutually exclusive?
Example(s)
Let’s consider another example in which our experiment is the rolling of two fair 4-sided dice,
one which is blue D1 and one which is red D2 (so they are distinguishable, or effectively, order
matters). We can represent each element in the sample set as an ordered pair (D1, D2) where
D1, D2 ∈ {1, 2, 3, 4} and represent the respective value rolled by the blue and red die.
The sample space Ω is the set of all possible ordered pairs of values that could be rolled by
the die (|Ω| = 4 · 4 = 16 by the product rule). Let’s consider some events:
1. A = {(1, 1), (1, 2), (1, 3), (1, 4)}, the event that the blue die, D1 is a 1.
2. B = {(2, 4), (3, 3), (4, 2)}, the event that the sum of the two rolls is 6 (D1 + D2 = 6).
3. C = {(2, 1), (4, 2)}, the event that the value on the blue die is twice the value on the red die
(D1 = 2 ∗ D2).
All of these events and the sample space are shown below:
Solution Now, let’s consider whether A and B are mutually exclusive. Well, they do not overlap, as we can
see that A ∩ B = ∅, so yes they are mutually exclusive.
B and C are not mutually exclusive, since there is a case in which they can happen at the same time
B ∩ C = {(4, 2)} =
6 ∅, so they are not mutually exclusive.
Again, to summarize, we learned that Ω was the sample space (set of all outcomes of an experiment), and
2.1 Probability & Statistics with Applications to Computing 49
The word “axiom” means: things that we take for granted and assume to be true without proof.
Corollaries:
1. (Complementation) P E C = 1 − P (E).
2. (Monotonicity) If E ⊆ F , then P (E) ≤ P (F ).
3. (Inclusion-Exclusion) P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ).
Note: The word “corollary” means: results that follow almost immediately from a previous result
(in this case, the axioms).
A pair (Ω, P) of a sample space Ω and a probability measure P on possible events which sat-
isfies the above axioms is called a probability space.
Explanation of Axioms
1. Non-negativity is simply because we cannot consider an event to have a negative probability. It just
wouldn’t make sense. A probability of 1/6 would mean that on average, something would happen 1
out of every 6 trials. What about a probability of −1/4?
2. Normalization is based on the fact that when we run an experiment, there must be some outcome, and
all possible outcomes are in the sample space. So, we say the probability of observing some outcome
from the sample space is 1.
3. Countable additivity is because if two events are mutually exclusive, they don’t overlap at all; that is,
they don’t share any outcomes. This means that the union of them will contain the same outcomes of
each together, so the probability of their union is the the sum of their individual probabilities. (This
is like the sum role of counting).
Explanation of Corollaries
1. Complementation is based on
the fact that the sample space is all the possible outcomes. This means
that E C = Ω \ E, so P E C = 1 − P (E). (This is like complementary counting).
2. Monotonocity is because if E is a subset of F , then all outcomes in the event E are in the event F .
This means that all the outcomes that contribute to the probability of E contribute to the probability
of F , so it’s probability is greater than or equal to that of E (since probabilities are non-negative).
50 Probability & Statistics with Applications to Computing 2.1
3. Inclusion-Exclusion follows because if E and F have some intersection, this would be counted twice
by adding their probabilities, so we have to subtract it once to only count it once and not overcount.
(This is like inclusion-exclusion for counting).
Proof of Corollaries. The proofs of these corollaries only depend on the 3 axioms which we assume to be
true.
If Ω is a sample space such that each of the unique outcome elements in Ω are equally likely, then
for any event E ⊆ Ω:
|E|
P(E) =
|Ω|
Proof of Equally Likely Outcomes Formula. If outcomes are equally likely, then for any outcome in the
1
sample space ω ∈ Ω, we have P ({ω}) = (since there are |Ω| total outcomes). Then, if we list the |E|
|Ω|
outcomes that make up event E, we can write
E = {ω1 , ω2 , . . . , ω|E| }
2.1 Probability & Statistics with Applications to Computing 51
Every set is the union of the (mutually exclusive) singleton sets containing each element (e.g., {1, 2, 3} =
{1} ∪ {2} ∪ {3}), and so by countable additivity, we get
|E| |E|
[ X
P
{ωi } = P ({ωi }) [countable additivity axiom]
i=1 i=1
|E|
X 1
= [equally likely outcomes]
i=1
|Ω|
|E|
= [sum constant |E| times]
|Ω|
The notation in the first line is like summation or product notation: just union all the sets {ω1 } ∪ {ω2 } ∪
· · · ∪ {ω|E| }.
Example(s)
Consider the example of rolling the red and blue fair 4-sided dice again (above), a blue die D1 and a
red die D2. What is the probability that the two die’s rolls sum up to 6?
Solution We called that event B = {(2, 4), (3, 3), (4, 2)}. What is the probability of the event B happening?
Well, the 16 possible outcomes that make up all the elements of Ω are each equally likely because each die
has an equal chance of landing on any of the 4 numbers. So, P(E) = |B| 3 3
|Ω| = 16 , so the probability is 16 .
Example(s)
Let’s say the year 2500 U.S. Presidential Election has two candidates due to scientific advancements
allowing us to revive someone: George Washington and Abraham Lincoln. Each of the 100 citizens
of the U.S. is equally likely to vote for either of the two candidates. What is the probability that
George Washington wins by a vote of 74 to 26?
Solution Let Ω be the set of all possible voting patterns of length 100, each a W (for Washington) or L (for
Lincoln). Let E be the event described, meaning we have exactly 74 W’s and 26 L’s. Since outcomes are
equally likely, we have |Ω| = 2100 (each of the citizens has two choices). The size of E is just 100
74 : we just
choose which 74 voters voted for W. Hence, since outcomes are equally likely:
100
|E|
P (E) = = 10074
|Ω| 2
Example(s)
Suppose we flip a fair coin twice. What is the probability we get at least one head?
Proposed Answer: Since we could either get 0, 1, or 2 heads, we can define our sample space to
|E| 2
be Ω = {0, 1, 2}. Then, our event space is E = {1, 2}. So our probability is P (E) = = .
|Ω| 3
Explain the flaw in the above reasoning.
52 Probability & Statistics with Applications to Computing 2.1
Solution First, let me say it is definitely okay to define the sample space like that - that’s not the issue.
However, our formula only works when the outcomes are equally likely, which they ARE NOT in this case.
We actually have
1 2 1
P (0) = , P (1) = , P (2) =
4 4 4
because 0 happens when we get TT, 1 happens when we get HT or TH, and 2 happens when we get HH.
2 1 3
So indeed, P (E) = P (1) + P (2) = + = , but we couldn’t just use our formula from earlier. This is a
4 4 4
warning to watch out for checking that condition!
2.1.4 Exercises
1. If there are 5 people named A, B, C, D, and E, and they are randomly arranged in a row (with each
ordering equally likely), what is the probability that A and B are placed next to each other?
Solution: The size of the sample space is the number of ways to organize 5 people randomly, which
is |Ω| = 5! = 120. The event space is the number of ways to have A and B sit next to each other. We
did a similar problem in 1.1, and so the answer was 2! · 4! = 48 (why?). Hence, since the outcomes are
|E| 48
equally likely, P (E) = = .
|Ω| 120
2. Suppose I draw 4 cards from a standard 52-card deck. What is the probability they are all aces (there
are exactly 4 aces in a deck)?
Solution: There are two ways to define our sample space, one where order matters, and one where
it doesn’t. These two approaches are equivalent.
(a) If order matters, then |Ω| = P (52, 4) = 52 · 51 · 50 · 49, as the number of ways to pick 4 cards out
of 52. The event space E is the number of ways to pick all 4 aces (with order mattering), which
|E| P (52, 4)
is P (4, 4) = 4 · 3 · 2 · 1. Hence, since the outcomes are equally likely, P (E) = = =
|Ω| P (4, 4)
4·3·2·1
52 · 51 · 50 · 49
(b) If order does not matter, then |Ω| = 52 , since we just care which 4 out of 52 cards we get.
4
4
Then, there is only 4 = 1 way to get all 4 aces, and, since the outcomes are equally likely,
52
|E| P (52, 4)/4! 4·3·2·1
P (E) = 4
= 4 = = .
|Ω| 4
P (4, 4)/4! 52 · 51 · 50 · 49
Notice how it did not matter whether order mattered or not, but we had to be consistent! The 4!
accounting for the ordering of the 4 cards gets cancelled out :).
3. Given 3 different spades (S) and 3 different hearts (H), shuffle them. Compute P (E), where E is the
event that the suits of the shuffled cards are in alternating order (e.g., SHSHSH or HSHSHS)
Solution: The sample space |Ω| is the number of ways to order the 6 (distinct) cards: 6!. The
number of ways to organize the three spades is 3! and same for the three hearts. Once we do that, we
either lead with spades or hearts, so we get 2 · 3!2 for the size of our event space E. Hence, since the
|E| 2 · 3!2
outcomes are equally likely, P (E) = = .
|Ω| 6!
Note that all of these exercises are just counting two things! We count the size of the sample space, then
the event space and divide them. It is very important to acknowledge that we can only do this when the
outcomes are equally likely.
2.1 Probability & Statistics with Applications to Computing 53
You can see how we can get even more fun and complicated problems - the three exercises above displayed
counting problems on the “easier side”. The reason we didn’t give “harder” problems is because computing
probability in the case of equally likely outcomes reduces to doing two counting problems (counting |E| and
|Ω|, where computing |Ω| is generally easier than computing |E|). Just use the techniques from Chapter 1
to do this!
54 Probability & Statistics with Applications to Computing 2.1
Application Time!!
Sometimes we would like to incorporate new information into our probability. For example, you may be
feeling symptoms of some disease, and so you take a test to see whether you have it or not. Let D be the
event you have a disease, and T be the event you test positive (T C is the event you test negative). It could
be that P (D) = 0.01 (1% chance of having the disease without knowing anything). But how can we update
this probability given that we tested positive (or negative)? This will be written as P (D | T ) or P D | T C
respectively. You would think P (D | T ) > P (D) since you’re more likely to have the disease once you test
positive, and P D | T C < P (D) since you’re less likely to have the disease once you test negative. These
are called conditional probabilities - they are the probability of an event, given that you know some other
event occurred. Is there a formula for updating P (D) given new information? Yes!
Let’s go back to the example of students in CSE312 liking donuts and ice cream. Recall we defined event
A as liking ice cream and event B as liking donuts. Then, remember we had 36 students that only like ice
cream (A ∩ B C ), 7 students that like donuts and ice cream (A ∩ B), and 13 students that only like donuts
(B ∩ AC ). Let’s also say that we have 14 students that don’t like either (AC ∩ B C ). That leaves us with the
following picture, which makes up the whole sample space:
Now, what if we asked the question, what’s the probability that someone likes ice cream, given that we
know they like donuts? We can approach this with the knowledge that 20 of the students like donuts (13
who don’t like ice cream and 7 who do). What this question is getting at, is: given the knowledge that
someone likes donuts, what is the chance that they also like ice cream? Well, 7 of the 20 who like donuts
7
like ice cream, so we are left with the probability 20 . We write this as P (A | B) (read the “probability of A
55
56 Probability & Statistics with Applications to Computing 2.2
7
P (A | B) =
20
|A ∩ B|
= [|B| = 20 people like donuts, |A ∩ B| = 7 people like both]
|B|
|A ∩ B|/|Ω|
= [divide top and bottom by |Ω|, which is equivalent]
|B|/|Ω|
P (A ∩ B)
= [if we have equally likely outcomes]
P (B)
This intuition (which worked only in the special case equally likely outcomes), leads us to the definition of
conditional probability:
The conditional probability of event A given that event B happened (where P (B) > 0) is:
P (A ∩ B)
P (A | B) =
P (B)
An equivalent and useful formula we can derive (by multiplying both sides by the denominator, P (B),
and switching the sides of the equation is:
P (A ∩ B) = P (A | B) P (B)
This is a common misconception we can show with some examples. In the above example with ice cream,
we showed already P (A | B) = 20
7
, but P (B | A) = 36
7
, and these are not equal.
Consider another example where W is the event that you are wet and S is the event you are swimming.
Then, the probability you are wet given you are swimming, P (W | S) = 1, as if you are swimming you are
certainly wet. But, the probability you are swimming given you are wet, P (S | W ) 6= 1, because there are
numerous other reasons you could be wet that don’t involve swimming (being in the rain, showering, etc.).
2.2 Probability & Statistics with Applications to Computing 57
P (B | A) P (A)
P (A | B) =
P (B)
Note that in the above P (A) is called the prior, which is our belief without knowing anything about
event B. P (A | B) is called the posterior, our belief after learning that event B occurred.
This theorem is important because it allows to “reverse the conditioning”! Notice that both P (A | B)
and P (B | A) appear in this equation on opposite sides. So if we know P (A) and P (B) and can more
easily calculate one of P (A | B) or P (B | A), we can use Bayes’ Theorem to derive the other.
Proof of Bayes’ Theorem. Recall the (alternate) definition of conditional probability from above:
P (A ∩ B) = P (A | B) P (B) (2.2.6)
P (B ∩ A) = P (B | A) P (A) (2.2.7)
But, because A ∩ B = B ∩ A (since these are the outcomes in both events A and B, and the order of
intersection does not matter), P (A ∩ B) = P (B ∩ A), so (2.2.1) and (2.2.2) are equal and we have (by
setting the right-hand sides equal):
P (A | B) P (B) = P (B | A) P (A)
P (B | A) P (A)
P (A | B) =
P (B)
Wow, I wish I was alive back then and had this important (and easy to prove) theorem named after me!
Example(s)
We’ll investigate two slightly different questions whose answers don’t seem that they should be differ-
ent, but are. Suppose a family has two children (whom at birth, were each equally likely to be male
or female). Let’s say a telemarketer calls home and one of the two children picks up.
1. If the child who responded was male, and says “Let me get my older sibling”, what is the
probability that both children are male?
2. If the child who responded was male, and says “Let me get my other sibling”, what is the
probability that both children are male?
Solution There are four equally likely outcomes, MM, MF, FM, and FF (where M represents male and F
represents female). Let A be the event both children are male.
58 Probability & Statistics with Applications to Computing 2.2
1. In this part, we’re given that the younger sibling is male. So we can rule out 2 of the 4 outcomes
above and we’re left with MF and MM. Out of these two, in one of these cases we get MM, and so our
desired probability is 1/2.
More formally, let this event be B, which happens with probability 2/4 (2 out of 4 equally likely
P (A ∩ B) 1/4 1
outcomes). Then, P (A | B) = = = , since P (A ∩ B) is the probability both children
P (B) 2/4 2
are male, which happens in 1 out of 4 equally likely scenarios. This is because the older sibling’s sex
is independent of the younger sibling’s, so knowing the younger sibling is male doesn’t change the
probability of the older sibling being male (which is what we computed just now).
2. In this part, we’re given that at least one sibling is male. That is, out of the 4 outcomes, we can only
rule out the FF option. Out of the remaining options MM, MF, and FM, only one has both siblings
being male. Hence, the probability desired is 1/3. You can do a similar more formal argument like we
did above!
See how a slight wording change changed the answer?
We’ll see a disease testing example later, which requires the next section first. If you test positive for a
disease, how concerned should you be? The result may surprise you!
Example(s)
You can see that partition is a very appropriate word here! In the first image, the four events
E1 , . . . , E4 don’t overlap and cover the sample space. In the second image, the two events E, E C do
the same thing! This is useful when you know exactly one of a few things will happen. For example,
for the chemistry example, there might be only three teachers, and you will be assigned to exactly
2.2 Probability & Statistics with Applications to Computing 59
one of them: at most one because you can’t have two teachers (mutually exclusive), and at least one
because there aren’t other teachers possible (exhaustive).
Now, suppose we have some event F which intersects with various events that form a partition of Ω. This
is illustrated by the picture below:
Notice that F is composed of its intersection with each of E1 , E2 , and E3 , and so we can split F up into
smaller pieces. This means that we can write the following (green chunk F ∩ E1 , plus pink chunk F ∩ E2
plus yellow chunk F ∩ E3 ):
P (F ) = P (F ∩ E1 ) + P (F ∩ E2 ) + P (F ∩ E3 )
Note that F and E4 do not intersect, so F ∩ E4 = ∅. For completion, we can include E4 in the above
equation, because P (F ∩ E4 ) = 0. So, in all we have:
P (F ) = P (F ∩ E1 ) + P (F ∩ E2 ) + P (F ∩ E3 ) + P (F ∩ E4 )
n
X
P (F ) = P (F ∩ E1 ) + · · · + P (F ∩ En ) = P (F ∩ Ei )
i=1
n
X
P (F ) = P (F | E1 ) P (E1 ) + · · · + P (F | En ) P (En ) = P (F | Ei ) P (Ei )
i=1
That is, to compute the probability of an event F overall; suppose we have n disjoint cases E1 , . . . , En
for which we can (easily) compute the probability of F in each of these cases (P (F |Ei )). Then, take
the weighted average of these probabilities, using the probabilities P (Ei ) as weights (the probability
of being in each case).
Proof of Law of Total Probability. We’ll use the picture above for inspiration. Since the Ei ’s are exhaustive,
we have that
[n
Ω= Ei
i=1
Then,
F =F ∩Ω [F ⊆ Ω]
[n
=F∩ Ei [exhaustive]
i=1
n
[
= (F ∩ Ei ) [distributive property]
i=1
The above basically just explains how to decompose F into the smaller chunks (as we saw in the picture).
But all the n events of the form (F ∩ Ei ) are mutually exclusive since the Ei ’s themselves are. Hence, by
Axiom 3 (countable additivity for disjoint events),
n
! n
[ X
P (F ) = P (F ∩ Ei ) = P (F ∩ Ei )
i=1 i=1
Example(s)
Let’s consider an example in which we are trying to determine the probability that we fail chemistry.
Let’s call the event F failing, and consider the three events E1 for getting the Mean Teacher, E2 for
getting the Nice Teacher, and E3 for getting the Hard Teacher which partition the sample space. The
following table gives the relevant probabilities:
2.2 Probability & Statistics with Applications to Computing 61
Solution Before doing anything, how are you liking your chances? There is a high probability (6/8) of getting
the Mean Teacher, and she will certainly fail you. Therefore, you should be pretty sad.
Now let’s do the computation. Notice that the first row sums to 1, as it must, since events E1 , E2 , E3
partition the sample space (you have exactly one of the three teachers). Using the Law of Total Probability
(LTP), we have the following:
3
X
P (F ) = P (F | Ei ) P (Ei ) = P (F | E1 ) P (E1 ) + P (F | E2 ) P (E2 ) + P (F | E3 ) P (E3 )
i=1
6 1 1 1 13
=1· +0· + · =
8 8 2 8 16
Notice to get the probability of failing, what we did was: consider the probability of failing in each of the
3 cases, and take a weighted average of using the probability of each case. This is exactly what the law of
total probability lets us do!
Example(s)
Misfortune struck us and we ended up failing chemistry class. What is the probability that we had
the Hard Teacher given that we failed?
Solution First, this probability should be low intuitively because if you failed, it was probably due to the
Hard Teacher (because you are more likely to get them, AND because they have a high fail rate of 100%).
Start by writing out in a formula what you want to compute; in our case, it is P (E3 | F ) (getting the hard
teacher given that we failed). We know P (F | E3 ) and we want to solve for P (E3 | F ). This is a hint to use
Bayes’ Theorem since we can reverse the conditioning! Using that with the numbers from the table and the
previous question:
P (F | E3 ) P (E3 )
P (E3 | F ) = [Bayes’ theorem]
P (F )
1 1
2 · 8
= 13
16
1
=
13
62 Probability & Statistics with Applications to Computing 2.2
Let events E1 , . . . , En partition the sample space Ω, and let F be another event. Then:
P (F | E1 ) P (E1 )
P (E1 | F ) = [by Bayes’ theorem]
P (F )
P (F | E1 ) P (E1 )
= Pn [by the law of total probability]
i=1 P (F | Ei ) P (Ei )
In particular, in the case of a simple partition of Ω into E and E C , if E is an event with nonzero
probability, then:
P (F | E) P (E)
P (E | F ) = [by Bayes’ theorem]
P (F )
P (F | E) P (E)
= [by the law of total probability]
P (F | E) P (E) + P (F | E C ) P (E C )
2.2.5 Exercises
1. Suppose the llama flu disease has become increasingly common, and now 0.1% of the population has
it (1 in 1000 people). Suppose there is a test for it which is 98% accurate (e.g., 2% of the time it will
give the wrong answer). Given that you tested positive, what is the probability you have the disease?
Before any computation, think about what you think the answer might be.
Solution: Let L be the event you have the llama flu, and T be the event you test positive (T C is the
event you test negative). You are asked for P (L | T ). We do know P (T | L) = 0.98 because if you have
the llama flu, the probably you test positive is 98%. This gives us the hint to use Bayes’ Theorem!
We get that
P (T | L) P (L)
P (L | T ) =
P (T )
We are given P (T | L) = 0.98 and P (L) = 0.001, but how can we get P (T ), the probability of testing
positive? Well that depends on whether you have the disease or not. When you have two or more
cases (L and LC ), that’s a hint to use the LTP! So we can write
P (T ) = P (T | L) P (L) + P T | LC P LC
is 0.02 or 2% (due to the 98% accuracy). Putting this all together, we get:
2.2 Probability & Statistics with Applications to Computing 63
P (T | L) P (L)
P (L | T ) = [Bayes’ theorem]
P (T )
P (T | L) P (L)
= [LTP]
P (T | L) P (L) + P (T | LC ) P (LC )
0.98 · 0.001
=
0.98 · 0.001 + 0.02 · 0.999
≈ 0.046756
Not even a 5% chance we have the disease, what a relief! But wait, how can that be? The test
is so accurate, and it said you were positive? This is because the prior probability of having the
disease P (L) was so low at 0.1% (actually this is pretty high for a disease rate). If you think about
it, the posterior probability we computed P (L | T ) is 47× larger than the prior probability P (L)
(P (L | T ) /P (L) ≈ 0.047/0.001 = 47), so the test did make it a lot more likely we had the disease after
all!
2. Suppose we have four fair die: one with three sides, one with four sides, one with five sides, and one
with six sides (The numbering of an n-sided die is 1, 2, ..., n). We pick one of the four die, each with
equal probability, and roll the same die three times. We get all 4’s. What is the probability we chose
the 5-sided die to begin with?
Solution: Let Di be the event we rolled the i-sided die, for i = 3, 4, 5, 6. Notice that these
D3 , D4 , D5 , D6 partition the sample space.
P (444|D5 )P (D5 )
P (D5 |444) = [by Bayes’ theorem]
P (444)
P (444|D5 )P (D5 )
= [by ltp]
P (444|D3 )P (D3 ) + P (444|D4 )P (D4 ) + P (444|D5 )P (D5 ) + P (444|D6 )P (D6 )
1 1
53 · 4
= 0 1 1 1 1 1 1 1
33 · 4 + 43 · 4 + 53 · 4 + 63 · 4
1/125
=
1/64 + 1/125 + 1/216
1728
= ≈ 0.2831
6103
Note that we compute P (444|Di ) by noting there’s only one outcome where we get (4, 4, 4) out of the
i3 equally likely outcomes. This is true except when i = 3, where it’s not possible to roll all 4’s.
Chapter 2. Discrete Probability
2.3: Independence
Slides (Google Drive) Video (YouTube)
Now, suppose that we shuffle this deck and draw the top three cards. Let’s define:
1. A to be the event that we get the Ace of spades as our first card.
2. B to be the event that we get the 10 of clubs as our second card.
3. C to be the event that we get the 4 of diamonds as our third card.
What is the probability that all three of these events happen? We can write this as P (A, B, C) (sometimes
we use commas as an alternative to using the intersection symbol, so this is equivalent to P (A ∩ B ∩ C)).
Note that this is equivalent to P (C, B, A) or P (B, C, A) since order of intersection does not matter.
1 1 1
Intuitively, you might say that this probability is 52 · 51 · 50 , and you would be correct.
1. The first factor comes from the fact that there are 52 cards that could be drawn, and only one ace of
spades. That is, we computed P (A).
64
2.3 Probability & Statistics with Applications to Computing 65
2. The second factor comes from the fact that there are 51 cards after we draw the first card and only
one 10 of clubs. That is, we computed P (B | A).
3. The final factor comes from the fact that there are 50 cards left after we draw the first two and only
one 4 of diamonds. That is, we computed P (C | A, B).
To summarize, we said that
1 1 1
P (A, B, C) = P (A) · P (B | A) · P (C | A, B) = · ·
52 51 50
In the case of two events, A, B (this is just the alternate form of the definition of conditional probability
from 2.2):
P (A, B) = P (A) P (B | A)
An easy way to remember this, is if we want to observe n events, we can observe one event at a
time, and condition on those that we’ve done thus far. And most importantly, since the order of
intersection doesn’t matter, you can actually decompose this into any of n! orderings. Make sure
you “do” one event at a time, conditioning on the intersection of ALL past events like we did above.
Proof of Chain Rule. Remember that the definition of conditional probability says P (A ∩ B) = P (A) P (B | A).
We’ll use this repeatedly to break down our P (A1 , . . . , An ). Sometimes it is easier to use commas, and some-
times it is easier to use the intersection sign ∩: for this proof, we’ll use the intersection sign. We’ll prove
this for four events, and you’ll see how it can be easily extended to any number of events!
Note how we keep “chaining” and applying the definition of conditional probability repeatedly!
Example(s)
Consider the 3-stage process. We roll a 6-sided die (numbered 1-6), call the outcome X. Then, we
roll a X-sided die (numbered 1-X), call the outcome Y . Finally, we roll a Y -sided die (numbered
1-Y ), call the outcome Z. What is P (Z = 5)?
66 Probability & Statistics with Applications to Computing 2.3
Solution There are only three things that could have happened for the triplet (X, Y, Z) so that Z takes on
the value 5: {(6, 6, 5), (6, 5, 5), (5, 5, 5)}. So
P (Z = 5) = P (X = 6, Y = 6, Z = 5) + P (X = 6, Y = 5, Z = 5) + P (X = 5, Y = 5, Z = 5) [cases]
1 1 1 1 1 1 1 1 1
= · · + · · + · · [chain rule 3x]
6 6 6 6 6 5 6 5 5
How did we use the chain rule? Let’s see for example the last term:
P (X = 5, Y = 5, Z = 5) = P (X = 5)P (Y = 5 | X = 5)P (Z = 5 | X = 5, Y = 5)
1
P (X = 5) = because we rolled a 6-sided die.
6
1
P (Y = 5 | X = 5) = since we rolled a X = 5-sided die.
5
1
Finally, P (Z = 5 | X = 5, Y = 5) = P (Z = 5 | Y = 5) = since we rolled a Y = 5-sided die. Note we didn’t
5
need to know X once we knew Y = 5!
2.3.2 Independence
Let’s say we flip a fair coin 3 times independently (whatever that means) - what is the probability of getting
all heads? You may be inclined to say (1/2)3 = 1/8 because the probability of getting heads each time is
just 1/2. This is indeed correct! However, we haven’t formally learned such a rule to compute the joint
probability P (H1 ∩ H2 ∩ H3 ) yet, except for the chain rule.
Using only what we’ve learned, we could consider equally likely outcomes. There are 23 = 8 possible
outcomes when flipping a coin three times (by product rule), and only one of those (HHH) makes up the
event we care about: H1 ∩ H2 ∩ H3 . Since the outcomes are equally likely,
| H1 ∩ H2 ∩ H3 | | {HHH} | 1
P (H1 ∩ H2 ∩ H3 ) = = =
|Ω| 23 8
We’d love a rule to say P (H1 ∩ H2 ∩ H3 ) = P (H1 ) · P (H2 ) · P (H3 ) = 1/2 · 1/2 · 1/2 = 1/8 - and it turns out
this is true when the events are independent!
But first, let’s consider the smaller case: does P (A, B) = P (A) P (B) in general? No! How do we know this
though? Well recall that by the chain rule, we know that:
P (A, B) = P (A) P (B | A)
So, unless P (B | A) = P (B) the equality does not hold. However, when this equality does hold, it is a special
case, which brings us to independence.
Events A and B are independent if any of the following equivalent statements hold:
1. P (A | B) = P (A)
2. P (B | A) = P (B)
3. P (A, B) = P (A) P (B)
Intuitively what it means for P (A | B) = P (A) is that: given that we know B happened, the prob-
ability of observing A is the same as if we didn’t know anything. So, event B has no influence on
2.3 Probability & Statistics with Applications to Computing 67
What about independence of more than just two events? We call this concept “mutual independence” (but
most of the time we don’t even say the word “mutual”). You might think that for events A1 , A2 , A3 , A4 to
be (mutually) independent, by extension of the definition of two events, we would just need
But it turns out, we need this property to hold for any subset of the 4 events. For example, the following
must be true (in addition to others):
For all 2n subsets of the 4 events (24 = 16 in our case), the probability of the intersection must simply be
the product of the individual probabilities.
As you can see, it would be quite annoying to check even if three events were (mutually) independent.
Luckily, most of the time we are told to assume that several events are (mutually) independent and we get
all of those statements to be true for free. We are rarely asked to demonstrate/prove mutual independence.
We say n events A1 , A2 , . . . , An are (mutually) independent if, for any subset I ⊆ [n] =
{1, 2, . . . , n}, we have !
\ Y
P Ai = P (Ai )
i∈I i∈I
This is very similar to the last formula P (A, B) = P (A) P (B) in the definition of independence for
two events, just extended to multiple events. It must hold for any subset of the n events, and so this
equation is actually saying 2n equations are true!
Example(s)
Suppose we have the following network, in which circles represents a node in the network (A, B, C,
and D) and the links have the probabilities p, q, r and s of successfully working. That is, for example,
the probability of successful communication from A to B is p. Each link is independent of the others
though.
68 Probability & Statistics with Applications to Computing 2.3
Now, let’s consider the question, what is the probability that A and D can successfully communicate?
Solution There are two ways in which it can communicate: (1) in the top path via B or (2) in the bottom
path via C. Let’s define the event top to be successful communication in the top path and the event bottom
to be successful communication in the bottom path. Let’s first consider the probabilities of each of these
being successful communication. For the top to be a valid path, both links AB and BD must work.
Similarly:
P (bottom) = P (AC ∩ CD)
= P (AC) P (CD) [by independence]
= rs
So, to calculate the probability of successful communication between A and D, we can take the union of top
and bottom (we just need at least one of the two to work), and so we have:
P (top ∪ bottom) = P (top) + P (bottom) − P (top ∩ bottom) [by inclusion-exclusion]
= P (top) + P (bottom) − P (top) P (bottom) [by independence]
= pq + rs − pqrs
This is actually another form of independence, called conditional independence! That is, given that Y = 5,
the events X = 5 and Z = 5 are independent (the above equation looks exactly like P (Z = 5 | X = 5) =
P (Z = 5) except with extra conditioning on Y = 5 on both sides.
Events A and B are conditionally independent given an event C if any of the following equivalent
statements hold:
1. P (A | B, C) = P (A | C)
2. P (B | A, C) = P (B | C)
3. P (A, B | C) = P (A | C) P (B | C)
Recall the definition of A and B being (unconditionally) independent below:
1. P (A | B) = P (A)
2. P (B | A) = P (B)
3. P (A, B) = P (A) P (B)
Notice that this is very similar to the definition of independence. There is no difference, except we
have just added in conditioning on C to every probability.
Example(s)
Suppose there is a coin C1 with P (head) = 0.3 and a coin C2 with P (head) = 0.9. We pick one
randomly with equal probability and will flip that coin 3 times independently. What is the probability
we get all heads?
Solution Let us call HHH the event of getting three heads, C1 the event of picking the first coin, and C2
the event of getting the second coin. Then we have the following:
P (HHH) = P (HHH | C1 ) P (C1 ) + P (HHH | C2 ) P (C2 ) [by the law of total probability]
3 3
= (P (H | C1 )) P (C1 ) + (P (H | C2 )) P (C2 ) [by conditional independence]
1 1
= (0.3)3 + (0.9)3 = 0.378
2 2
It is important to note that getting heads on the first and second flip are NOT independent. The probability
of heads on the second, given that we got heads on the first flip, is much higher since we are more likely
to have chosen coin C2 . However, given which coin we are flipping, the flips are conditionally independent.
3
Hence, we can write P (HHH | C1 ) = P (H | C1 ) .
2.3.4 Exercises
1. Corrupted by their power, the judges running the popular game show America’s Next Top Mathe-
matician have been taking bribes from many of the contestants. During each of two episodes, a given
contestant is either allowed to stay on the show or is kicked off. If the contestant has been bribing
the judges, she will be allowed to stay with probability 1. If the contestant has not been bribing
the judges, she will be allowed to stay with probability 1/3, independent of what happens in earlier
episodes. Suppose that 1/4 of the contestants have been bribing the judges. The same contestants
bribe the judges in both rounds.
(a) If you pick a random contestant, what is the probability that she is allowed to stay during the
first episode?
(b) If you pick a random contestant, what is the probability that she is allowed to stay during both
episodes?
70 Probability & Statistics with Applications to Computing 2.3
(c) If you pick a random contestant who was allowed to stay during the first episode, what is the
probability that she gets kicked off during the second episode?
(d) If you pick a random contestant who was allowed to stay during the first episode, what is the
probability that she was bribing the judge?
Solution:
(a) Let Si be the event a contestant stays in the ith episode, and B be the event a contestant is
bribing the judges. Then, by the law of total probability,
1 1 3 1
P (S1 ) = P (S1 | B) P (B) + P S1 | B C P B C = 1 · + · =
4 3 4 2
Again, it’s important to note that staying on the first and second episode are NOT independent.
If we know she stayed on the first episode, then it is more likely she stays on the second (since
she’s more likely to be bribing the judges). However, conditioned on whether or not we are bribing
the judges, S1 and S2 are independent.
(c)
P S1 ∩ S2C
P S2C | S1 =
P (S1 )
The denominator is our answer to (a), and the numerator can be computed in the same way as
(b).
2. A parallel system functions whenever at least one of its components works. Consider a parallel system
of n components and suppose that each component works with probability p independently
(b) If the system is functioning, what is the probability that component 1 is working?
(c) If the system is functioning and component 2 is working, what is the probability that component
1 is working?
Solution:
(a) Let Ci be the event component i is functioning, for i = 1, . . . , n. Let F be the event the system
2.3 Probability & Statistics with Applications to Computing 71
functions. Then,
P (F ) = 1 − P F C
n
!
\
=1−P CiC [def of parallel system]
i=1
n
Y
=1− P CiC [independence]
i=1
= 1 − (1 − p)n [prob any fails is 1 − p]
P (F | C1 ) P (C1 ) 1·p
P (C1 | F ) = =
P (F ) 1 − (1 − p)n
(c)
Application Time!!
Now you’ve learned enough theory to discover the Naive Bayes classifier cov-
ered in section 9.3. You are highly encouraged to read that section before
moving on!
2.3 Probability & Statistics with Applications to Computing 73
Ω = {HH, HT, T H, T T }
Sometimes, though, we don’t care about the order (HT vs T H), but just the fact that we got one heads and
one tail. So we can define a random variable as a numeric function of the outcome.
For example, we can define X to be the number of heads in the two independent flips of a fair coin. Then
X is a function, X : Ω → R which takes outcomes ω ∈ Ω and maps them to a number. For example, for the
outcome HH, we have X(HH) = 2 since there are two heads. See the rest below!
Suppose we conduct an experiment with sample space Ω. A random variable (rv) is a numeric
function of the outcome, X : Ω → R. That is, it maps outcomes (ω ∈ Ω) to numbers: ω 7→ X(ω).
The set of possible values X can take on is its range/support, denoted ΩX .
If ΩX is finite or countably infinite (typically integers or a subset), X is a discrete random variable
(drv). Else if ΩX is uncountably large (the size of real numbers), X is a continuous random
variable.
74
3.1 Probability & Statistics with Applications to Computing 75
Example(s)
Below are some descriptions of random variables. Find their ranges and classify them as a discrete
random variable (DRV) or continuous random variable (CRV). The first row is filled out for you as
an example!
to calculate the probabilities that X takes on each of these values. That is, the probability X = k is the
sum of the probabilities of the outcomes ω ∈ Ω where X(ω) = k (see below for an explicit example).
In this case we have the following:
1
4 k=0
1
pX (k) = k=1
21
4 k=2
The probability mass function (PMF) of a discrete random variable X assigns probabilities to
the possible values of the random variable. That is pX : ΩX → [0, 1] where:
pX (k) = P (X = k)
Note that {X = a} for a ∈ Ω form a partition of Ω, since each outcome a ∈ Ω is mapped to exactly
one number. Hence,
X
pX (z) = 1
z∈ΩX
Notice here the only thing consistent is pX , as it’s the PMF of X. The value inside is a dummy
variable - just like we can write f (x) = x2 or f (t) = t2 . To reinforce this, I will constantly use
different letters for dummy variables.
Example(s)
Suppose we have an urn containing 20 balls, numbered with the integers 1-20. Reach in and grab
three of them (without replacement), and let Y denote the largest number of the three balls.
Solution If we draw three balls, the lowest possible value of Y is 3 (if we drew the balls 1,2,3), and the highest
possible value of Y is 20. Hence, ΩY = {3, 4, . . . , 20}.
To compute pY , first note that the size of the sample space Ω is 20 3 since we draw three balls without
replacement and order. If we want pY (k) = P (Y = k), we must choose the value k for one of the balls, and
the other two balls must be LESS than k (there are k−1 2 ways). For example, if we want P (Y = 9), we
must choose 9 as one of the balls, and the other 2 from 1-8. So we have
k−1
2
pY (k) = 20
, k ∈ ΩY
3
3.1 Probability & Statistics with Applications to Computing 77
In our next example, we will briefly introduce one more concept called the CDF of a random variable. We
will discuss it a lot more in depth in 4.1 when we discuss continuous RVs!
Example(s)
Suppose there are three students, and their hats are returned randomly with each of the 3! permuta-
tions equally likely. Let X be the number of hats returned to the correct owner.
1. List out all 3! = 6 elements Ω, the sample space of the experiment, as permutations of the
numbers 1,2, and 3.
2. Find the range ΩX (be careful) and PMF pX .
3. The cumulative distribution function (CDF) of a random variable X is defined to be
FX : R → [0, 1] such that FX (t) = P (X ≤ t) (again, t is a dummy letter and we could have
chosen any). Find the CDF FX .
Solution
1. The sample space is Ω = {123, 132, 213, 231, 312, 321}. For example, 123 means that everyone got their
own hat back, and 321 means only person 2 got their own hat back.
2. We construct the following table with 6 rows: one for each outcome ω.
ω X(ω) P (ω) Explanation
123 3 1/6 All 3 people got their hat back.
132 1 1/6 Only person 1 got their hat back.
213 1 1/6 Only person 3 got their hat back.
231 0 1/6 No one got their hat back.
312 0 1/6 No one got their hat back.
321 1 1/6 Only person 2 got their hat back.
Note that it isn’t possible for X to equal 2: if 2 people out of 3 have their hat back, then the third
person must also have their own hat! So ΩX = {0, 1, 3}. Let’s work on each:
P
• pX (0) = P (X = 0) = ω∈Ω:X(ω)=0 P (ω) = P (231) + P (312) = 61 + 16 = 62 .
P
• pX (1) = P (X = 1) = ω∈Ω:X(ω)=1 P (ω) = P (132) + P (213) + P (321) = 61 + 16 + 16 = 36 .
P
• pX (3) = P (X = 3) = ω∈Ω:X(ω)=3 P (ω) = P (123) = 16 .
So our final PMF is:
2/6 k=0
pX (k) = 3/6 k=1
1/6 k=3
3. Notice that the CDF is defined for ALL real numbers R = (−∞, +∞), unlike PMF’s. So we’ll have to
specify FX (t) = P (X ≤ t) for t that are not even in the range ΩX , including decimal numbers!
This sounds nearly impossible, but it’s actually not too bad! Let’s start by seeing some example values.
• FX (−3.642) = P (X ≤ −3.642) = 0 because there is no way that X ≤ −3.642. In fact, FX (t) for
any t < 0 is precisely 0 since the lowest possible value of X is 0.
• FX (0.724) = P (X ≤ 0.724) = P (X = 0) = 2/6 because the only way that X ≤ 0.724 is if X = 0,
which happens with probability 2/6. In fact, FX (t) = 2/6 for any 0 ≤ t < 1 for this reason!
• FX (2.999) = P (X ≤ 2.999) = P (X = 0) + P (X = 1) = 2/6 + 3/6 = 5/6 because X ≤ 2.999 only
if X = 0 or X = 1. And again, for any 1 ≤ t < 3, we have FX (t) = 5/6.
78 Probability & Statistics with Applications to Computing 3.1
See the picture below for a plot of the PMF and CDF!
You’ll notice the CDF is always between 0 and 1 because it is a probability! It is always increasing as
well, since we are only adding more and more cumulative probabilities (which are nonnegative). Notice
at the jumps of the CDF, the vertical distance is just the PMF (why?)! Again, we’ll talk more about
CDFs in 4.1, so treat this as foreshadowing!
3.1.3 Expectation
We have this idea of a random variable, which is actually neither random nor a variable (it’s a deterministic
function X : Ω → ΩX .) However, the way I like to think about it is: it a random quantity which we do
not know the value of yet. You might want to know what you might expect it to equal on average. For
example, X could be the random variable which represents the number of babies born in Seattle per day. On
average, X might be equal to 250, and we would write that its average/mean/expectation/expected value is
E [X] = 250.
Let’s go back to the coin example though to define expectation. Your intuition might tell you that the
expected number of heads in 2 flips of a fair coin would be 1 (you would be correct).
Since X was the random variable defined to be the number of heads in 2 flips of a fair coin, we denote this
E [X]. Think of this as the average value of X.
More specifically, imagine if we repeated the two coin flip experiment 4 times. Then we would “expect” to
3.1 Probability & Statistics with Applications to Computing 79
get HH, HT , T H, and T T each once. Then, we can divide by the number of times (4) to get 1.
2+1+1+0 1 1 1 1
=2· +1· +1· +0· =1
4 4 4 4 4
Notice that:
1 1 1 1
2 · + 1 · + 1 · + 0 · = X(HH)P (HH) + X(HT )P (HT ) + X(T H)P (T H) + X(T T )P (T T )
4 4 4 4
X
= X(ω)P (ω)
ω∈Ω
This is the the sum of the random variable’s value for each outcome multiplied by the probability of that
outcome (a weighted average).
Another way of writing this is by multiplying every value that X takes on (in its range) with the probability
of that value occurring (the PMF). Notice that below is the same exact sum, but groups the common values
together (since X(HT ) = X(T H) = 1). That is:
X
1 1 1 1 1 2 1
2· +1· + +0· =2· +1· +0· = k · pX (k)
4 4 4 4 4 4 4
k∈ΩX
or equivalently,
X
E [X] = k · pX (k)
k∈ΩX
The interpretation is that we take an average of the possible values, but weighted by their probabilities.
Example(s)
Recall the example from earlier: “Suppose there are three students, and their hats are returned
randomly with each of the 3! permutations equally likely. Let X be the number of hats returned to
the correct owner.” The range was ΩX = {0, 1, 3} and PMF was
2/6 k = 0
pX (k) = 3/6 k = 1
1/6 k = 3
Find the expected number of people who get their hat back, E [X].
Solution Typically, the second definition of expectation is easier to use since it has less terms to sum over.
We take the sum of each value in ΩX multiplied by its probability.
X 2 3 1
E [X] = k · pX (k) = 0 · + 1 · + 3 · = 1
6 6 6
k∈{0,1,3}
80 Probability & Statistics with Applications to Computing 3.1
That is, if we return 3 hats randomly to the 3 students, we expect on average that 1 student will get their
own hat back. It turns out that, no matter how many students/hats there are, the answer is always 1; how
amazing! We’ll actually show this amazing fact in section 3.3, so stay tuned!
Example(s)
There are 3 people in Linbo’s family; his mom, dad, and sister. Each family member decides whether
or not they want to come to lunch in his social-distancing home restaurant, independently of the
others.
• Mom wants to come with probability 0.8.
• Dad wants to come with probability 0.6.
• Sister wants to come with probability 0.1.
Unfortunately, if all 3 of them want to come, he must turn one of them away since the restaurant
capacity is 2 guests. Otherwise, he will take everyone that comes. Let X be the number of customers
that Linbo serves at lunch.
1. What is the range ΩX , the PMF pX (k) and expectation E [X]?
2. If he charges everyone who comes $10, but it costs him $50 to make all the food, what is his
expected profit (this could be negative)?
Solution
1. The range is ΩX = {0, 1, 2} since we can have anywhere from 0 to 2 people. Let M, D, S be the events
that his mom, dad, and sister want to come, respectively. By independence, the probability no one
comes is:
pX (0) = P (X = 0) = P M C , DC , S C = P M C P DC P S C = 0.2 · 0.4 · 0.9 = 0.072
The probability that exactly one person comes has three cases: only mom comes, only dad comes, or
only sister comes:
pX (1) = P (X = 1) = P M, DC , S C + P M C , D, S C + P M C , DC , S
= 0.8 · 0.4 · 0.9 + 0.2 · 0.6 · 0.9 + 0.2 · 0.4 · 0.1 = 0.404
Finally, for pX (2), we have some work to do. We can sum over the three cases where exactly 2 of the 3
want to come. But if all 3 want to come (P (M, D, S)), this also counts as X = 2 since we turn one of
them away! So we actually add 4 probabilities to get pX (2). Alternatively, we know that these three
probabilities must sum to 1: pX (0) + pX (1) + pX (2) = 1, and hence using our previous computations:
The expectation is
X
E [X] = k · pX (k) = 0 · 0.072 + 1 · 0.404 + 2 · 0.524 = 1.452
k∈ΩX
2. We’d intuitively like to say something like: the profit is P = 10X − 50, so
But is this step valid: E [10X − 50] = 10E [X] − 50? Yes, and it is called linearity of expectation! This
is one of the most important theorems on expectation, and is covered in the next section.
The “proper” way to do this expectation right now is to start over and find the range, PMF, and
expectation of P . That is, ΩP = {−50, −40, −30} since these are the possible profits if 0, 1, or 2
people came. Then,
0.072 k = −50
pP (k) = 0.404 k = −40
0.524 k = −30
You can check now that computing expectation using the usual formula gives the same answer!
3.1.4 Exercises
1. Let X be the value of single roll of a fair six-sided dice. What is the range ΩX , the PMF pX (k), and
the expectation E [X]?
1
pX (k) = , k ∈ ΩX
6
The expectation is
X 1 1 1 1
E [X] = k · pX (k) = 1 · + 2 · + · · · + 6 · = (1 + 2 + · · · + 6) = 3.5
6 6 6 6
k∈ΩX
This kind of makes sense right? You expect the “middle number” between 1 and 6, which is 3.5.
2. Suppose at time t = 0, a frog starts on a 1-dimensional number line at the origin 0. At each step, the
frog moves independently: left with probability 1/10, and right (with probability 9/10). Let X be the
position of the frog after 2 time steps. What is the range ΩX , the PMF pX (k), and the expectation
E [X]?
Solution: The range is ΩX = {−2, 0, 2}. To find the PMF, we find the probabilities of being
each of those three values.
1 1
(a) For X to equal −2, we have to move left both times, which happens with probability 10 · 10 by
independence of the moves.
9 9
(b) For X to equal 2, we have to move right both times, which happens with probability 10 · 10 by
independence of the moves.
(c) Finally, for X to equal 0, we have to take opposite moves. So either LR or RL, which happens
1 9 18
with probability 2 · 10 · 10 = 100 . Alternatively, the easier way is to note that these three values
sum to 1, so P (X = 0) = 1 − P (X = 2) − P (X = −2) = 1 − 100 81 1
− 100 18
= 100
82 Probability & Statistics with Applications to Computing 3.1
1/100 k = −2
pX (k) = 18/100 k = 0
81/100 k = 2
The expectation is
X 1 18 81
E [X] = k · pX (k) = −2 · +0· +2· = 1.6
100 100 100
k∈ΩX
You might have been able to guess this, but how? At each time step you “expect” to move to the
9 1
right by 10 − 10 which is 0.8. So after two steps, you would expect to be at 1.6. We’ll formalize this
approach more in the next chapter!
3. Let X be the number of independent coin flips up to and including our first head, where P (head) = p.
What is the range ΩX , the PMF pX (k), and the expectation E [X]?
Solution: The range is ΩX = {1, 2, 3, ...}, since it could theoretically take any number of flips.
The PMF is
pX (k) = (1 − p)k−1 p, k ∈ ΩX
(a) P (X = 1) is the probability we get heads (for the first time) on our first try, which is just p.
(b) P (X = 2) is the probability we get heads (for the first time) on our second try, which is (1 − p)p
since we had to get a tails first.
(c) P (X = k) is the probability we get heads (for the first time) on our k th try, which is (1 − p)k−1 p,
since we had to get all tails on the first k − 1 tries (otherwise, our first head would have been
earlier).
The expectation is pretty complicated and uses a calculus trick, so don’t worry about it too much.
Just understand the first two lines, which are the setup! But before that, what do you think it should
be? For example, if p = 1/10, how many flips do you think it would take until our first head? Possibly
10? And if p = 1/7, maybe 7? So seems like our guess will be E [X] = p1 . It turns out this intuition is
3.1 Probability & Statistics with Applications to Computing 83
actually correct!
X
E [X] = k · pX (k) [def of expectation]
k∈ΩX
X∞
= k(1 − p)k−1 p
k=1
X∞
=p k(1 − p)k−1 [p is a constant with respect to k ]
k=1
X∞
d d k
=p (−(1 − p)k ) y = ky k−1
dp dy
k=1
∞
!
d X
= −p (1 − p)k−1 [swap sum and integral]
dp
k=1
" ∞
#
d 1 X 1
= −p geometric series formula: ri =
dp 1 − (1 − p) i=0
1 − r
d 1
= −p
dp p
1
= −p − 2
p
1
=
p
Chapter 3. Discrete Random Variables
3.2: More on Expectation
Slides (Google Drive) Video (YouTube)
Let’s say that you and your friend sell fish for a living. Every day, you catch X fish, with E [X] = 3 and
your friend catches Y fish, with E [Y ] = 7. How many fish do the two of you bring in (Z = X + Y ) on an
average day? You might guess 3 + 7 = 10. This is the formula you just guessed:
E [Z] = E [X + Y ] = E [X] + E [Y ] = 3 + 7 = 10
This property turns out to be true! Furthermore, let’s say that you can sell each fish for $5 at a store, but
you need to pay $20 in rent for the storefront. How much profit do you expect to make? The profit formula
would be 5Z − 20: $5 times the number of total fish, minus $20. You might guess 5 · 10 − 20 = 30 and you
would be right once again! This is the formula you just guessed:
E [X + Y ] = E [X] + E [Y ]
and
E [aX + b] = aE [X] + b
E [aX + bY + c] = aE [X] + bE [Y ] + c
Proof of Linearity of Expectation. Note that X and Y are functions (since random variables are functions),
so X + Y is function that is the sum of the outputs of each of the functions. We have the following (in the
84
3.2 Probability & Statistics with Applications to Computing 85
first equation, (X + Y )(ω) is the function (X + Y ) applied to ω which is equal to X(ω) + Y (ω), it is not a
product):
X
E [X + Y ] = (X + Y )(ω) · P (ω) [def of expectation for the rv X + Y ]
ω∈Ω
X
= (X(ω) + Y (ω)) · P (ω) [def of sum of functions]
ω∈Ω
X X
= X(ω) · P (ω) + Y (ω) · P (ω) [property of summation]
ω∈Ω ω∈Ω
= E [X] + E [Y ] [def of expectation of X and Y ]
For the second property, note that aX + b is also a random variable and hence a function (e.g., if f (x) =
sin (1/x), then (2f − 5)(x) = 2f (x) − 5 = 2 sin (1/x) − 5.)
X
E [aX + b] = (aX + b)(ω) · P (ω) [def of expectation]
ω∈Ω
X
= (aX(ω) + b) · P (ω) [def of the function aX + b]
ω∈Ω
X X
= aX(ω) · P (ω) + b · P (ω) [property of summation]
ω∈Ω ω∈Ω
X X
=a X(ω) · P (ω) + b P (ω) [property of summation]
ω∈Ω ω∈Ω
X
= aE [X] + b [def of E [X] and P (ω) = 1]
ω
For the last property, we get to assume the first two that we proved already:
Again, you may think a result like this is “trivial” or “obvious”, but we’ll see the true power of linearity
of expectation through examples. It is one of the most important ideas that you will continue to use (and
probably take for granted), even when studying some of the most complex topics in probability theory.
Example(s)
Brute Force Solution: When dealing with any random variable, the first thing you should do is
identify its range. The frog must end up in one of these positions, since it can move at most 1 to the
left and 1 to the right at each step:
ΩX = {−2, −1, 0, +1, +2}
So we need to compute 5 values: the probability of each of these. Let’s start with the easier ones.
The only way to end up at −2 is if the frog moves left at both steps, which happens with probability
pL · pL = p2L = P (X = −2) = pX (−2). The only reason we can multiply them is because of our
independence assumption. Similarly, pX (2) = pR · pR = p2R .
To get to −1, there are two possibilities: first going left and staying (pL · pS ), or first staying and then
going left (pS · pL ). Adding these disjoint cases gives pX (−1) = 2pL pS . Again, we can only multiply
due to independence. Similarly, pX (1) = 2pR pS .
Finally, to compute pX (0), we have two options. One is considering all the possibilities (there are
three: left right, right left, or stay stay) and adding them up, and you get 2pL pR + p2S . Alternatively
and equivalently, since you know the probabilities of 4 of the values (pX (−2), pX (2), pX (−1), pX (1)),
the last one pX (0) must be 1 minus the other four since probabilities have to sum to 1! This is a
often useful and clever trick - solving for all but one of the probabilities actually gives you the last
one!
In summary, we would write the PMF as:
2
pL
k = −2 :Left left
2pL pS
k = −1 : Left and stay, or stay and left
pX (k) = 2pL pR + p2S k = 0 : Right left, or left right, or stay stay
2pR pS k = 1 : Right and stay, or stay and right
p2 k = 2 : Right right
R
Then to solve for the expectation we just multiply the value and probability mass function and take
the sum and have the following:
X
E [X] = k · pX (k) [def of expectation]
k∈ΩX
= (−2) · p2L + (−1) · 2pL pS + (0) · (2pL pR + p2S ) + (1) · 2pR pS + (2) · p2R [plug in our values]
= 2(pR − pL ) [lots of messy algebra]
The last step of algebra is not important - once you get to more advanced mathematics (like this
text), getting the second-to-last formula is sufficient. Everything else is algebra which you could do,
or use a computer to do, and so we will omit the useless calculations.
This was quite tedious already; what if instead you were to find the expected location after 100 steps?
Then, this method would be completely ridiculous: finding ΩX = {−100, −99, . . . , +99, +100} and
3.2 Probability & Statistics with Applications to Computing 87
their 201 probabilities. Since you know the frog always moves with the same probabilities though,
maybe we can do something more clever!
Linearity Solution:
Let X1 , X2 be the distance the frog travels at time steps 1,2 respectively.
Important Observation: X = X1 + X2 , since your location after 2 time steps is the sum of the
displacement of the first time step and the second time step. Therefore, ΩX1 = ΩX2 = {−1, 0, +1}.
They have the same simple PMF of:
pL k = −1
pXi (k) = pS k = 0
pR k = 1
Which method is easier? Maybe in this case it is debatable, but if we change the time steps from 2 to
100 or 1000, the brute force solution is entirely infeasible, and the linearity solution will basically be
the same amount of work! You could say that X1 , . . . , X100 is the displacement at each of 100 time
steps, and hence by linearity:
" 100 # 100 100
X X X
E [X] = E Xi = E [Xi ] = (pR − pL ) = 100(pR − pL )
i=1 i=1 i=1
Hopefully now you can come to appreciate more how powerful LoE truly is! We’ll see more examples
in the next section as well as at the end of this section.
Consider we are flipping 2 coins again. Let X be the number of heads in two independent flips of a fair coin.
Recall the range, PMF, and expectation (again, I’m using the dummy letter d to emphasize that pX is the
88 Probability & Statistics with Applications to Computing 3.2
ΩX = {0, 1, 2}
,
1
4 d=0
1
pX (d) = 2 d=1
1
4 d=2
1 1 1
E [X] = 0 · +1· +2· =1
4 2 4
Let g be the cubing function; i.e., g(t) = t3 . Let Y = g(X) = X3 ; what does this mean? It literally means
the cubed number of heads! Let’s try to compute E [Y ] = E X 3 , the expected cubed number of heads. We
first find its range and PMF. Based on the range of X, we can calculate the range of Y to be:
ΩY = {0, 1, 8}
since if we get 0 heads, the cubed number of heads is 03 = 0; if we get 1 head, the cubed number of heads
is 13 = 1; and if we get 2 heads, the cubed number of heads is 23 = 8.
Now to find the PMF of Y = X 3 . (Again, below I use the notation pY to denote the probability mass
function of Y = X 3 ; z is a dummy variable which could be any letter.)
1
4 z=0
1
pY (z) = 2 z=1
1
4 z=8
since there is a 1/4 chance of getting 0 cubed heads (the outcome TT), 1/2 chance of getting 1 cubed heads
(the outcomes HT or TH), and a 1/4 chance of getting 8 cubed heads (the outcome HH).
1 1 1
E X 3 = E [Y ] = 0 · + 1 · + 8 · = 2.5
4 2 4
3
Is there an easier way to compute E X = E [Y ] without going through the trouble of writing
out pY ? Yes! Since we know X’s PMF already, why should we have to find the PMF of Y = g(X)?
Note this formula below is the same formula as above, rewritten so you can observe something:
1 1 1
E X 3 = 03 · + 13 · + 23 · = 2.5
4 2 4
In fact: X
E X3 = b3 pX (b)
b∈ΩX
That is, we can apply the function to each value in ΩX , and then take the weighted average! We can
generalize such that for any function g : ΩX → R, we have:
X
E [g(X)] = g(b)pX (b)
b∈ΩX
Caveat: It is worth noting that 2.5 = E X 3 6= (E [X])3 = 1. You cannot just say E [g(X)] = g(E [X]) as
we just showed!
3.2 Probability & Statistics with Applications to Computing 89
Let X be a discrete random variable with range ΩX and g : D → R be a function defined at least
over ΩX , (ΩX ⊆ D). Then X
E [g(X)] = g(b)pX (b)
b∈ΩX
Note that in general, E [g(X)] 6= g(E [X]). For example, E X 2 6= (E [X])2 , and E [log(X)] 6=
log(E [X]).
Before we formally prove this, it will help if we have some intuition for each step. As an example, let X have
range ΩX = {−1, 0, 1} and PMF
3
12 k = −1
5
pX (k) = 12 k=0
4
12 k=1
Notice that Y = X 2 has range ΩY = {g(x) : x ∈ ΩX } = {(−1)2 , 02 , 12 } = {0, 1} and the following PMF:
(
3 4
+ 12 k=1
pY (k) = 12 5
12 k=0
Note that pY (1) = P (X = −1) + P (X = 1) because {−1, 1} = {x : x2 = 1}. The crux of the LOTUS proof
depends on this fact. We just group things together and sum!
Proof of LOTUS. The proof isn’t too complicated, but the notation is pretty tricky and may be an impedi-
ment to your understanding, so focus on understanding the setup in the next few lines.
That is, the total probability that Y = y is the sum of the probabilities over all x ∈ ΩX where g(x) = y
(this is like saying P (Y = 1) = P (X = −1) + P (X = 1) because {x ∈ ΩX : x2 = 1} = {−1, 1}.)
E [g(X)] = E [Y ] [Y = g(X)]
X
= ypY (y) [def of expectation]
y∈ΩY
X X
= y pX (x) [above substitution]
y∈ΩY x∈ΩX :g(x)=y
X X
= ypX (x) [move y into a sum]
y∈ΩY x∈ΩX :g(x)=y
X X
= g(x)pX (x) [y = g(x) in the inner sum]
y∈ΩY x∈ΩX :g(x)=y
X
= g(x)pX (x) [the double sum is the same as summing over all x]
x∈ΩX
The last step here is tricky. All x ∈ ΩX map to exactly one y ∈ ΩY (through g). If I sum over all x (the
last line), I can partition my sum over all x by grouping them by their function value g(x).
90 Probability & Statistics with Applications to Computing 3.2
Take our example above of Y = X 2 in the second-last line: we sum over y ∈ ΩY = {0, 1}. For y = 0,
we sum over the x ∈ ΩX where g(x) = x2 = 0, which is just x ∈ {0} for us. For y = 1, we sum over the
x ∈ ΩX where g(x) = x2 = 1, which is just x ∈ {−1, 1}. So we’ve covered all values x ∈ ΩX by partitioning
on what g(x) is!
The hardest part of this proof was the notation; the key idea is just we sum in a different way. To compute
E [g(X)], we just group all the possible g(x) values together!
3.2.3 Exercises
1. Let S be the sum of three rolls of a fair 6-sided die. What is E [S]?
Solution: Let X, Y, Z be the first, second, and third roll respectively. Then, S = X + Y + Z.
We showed in the first exercise of 3.1 that E [X] = E [Y ] = E [Z] = 3.5, so by LoE,
Alternatively, imagine if we didn’t have this theorem. We would find the range of S, which is ΩS =
{3, 4, . . . , 18} and find its PMF. What a nightmare!
2. Blind LOTUS Practice: This will all seem useless, but I promise we’ll need this in the future. Let X
have PMF
3
12 k =5
5
pX (k) = 12 k =2.
4
12 k =1
2
(a) Compute E X .
(b) Compute E [log(X)]
(c) Compute E esin(X) .
P
Solution: LOTUS says that E [g(X)] = k∈ΩX g(k)pX (k). That is,
(a)
X 3 5 4
E X2 = k 2 pX (k) = 52 · + 22 · + 12 ·
12 12 12
k∈ΩX
(b)
X 3 5 4
E [log X] = log (k) · pX (k) = log (5) · + log (2) · + log (1) ·
12 12 12
k∈ΩX
(c)
h i X 3 5 4
E esin(X) = esin(k) pX (k) = esin(5) · + esin(2) · + esin(1) ·
12 12 12
k∈ΩX
Chapter 3. Discrete Random Variables
3.3: Variance
Slides (Google Drive) Video (YouTube)
Suppose there are 7 mermaids in the sea. Below is a table that represents these mermaids and the colors of
their hair.
Each column in the third row of the table is a variable, Xi , that is 1 if the i-th mermaid has red hair and 0
otherwise. We call these sorts of variables indicator variables because they are either 1 or 0, and their values
indicate the truth of a boolean (red hair or not).
Let the variable X represent how many of the 7 mermaids have red hair. If I only gave you this third row
(X1 , X2 , . . . , X7 of 1’s and 0’s), how could you compute X?
Well, you would add them all up! X = X1 + X2 + ... + X7 = 3. So, there are 3 mermaids in the sea that have
red hair. This might seem like a trivial result, but let’s go over a more complicated an example to illustrate
the usefulness of indicator random variables!
Example(s)
Suppose n people go to a party and leave their hat with the hat-check person. At the end of the
party, she returns hats randomly and uniformly because she does not care about her job. Let X be
the number of people who get their original hat back. What is E [X]?
Solution Your first instinct might be to approach this problem with brute force. Such an approach would
involve enumerating the range, ΩX = {0, 1, 2, ..., n − 2, n} (all the integers from 0 to n, except n − 1), and
computing the probability mass function for each of its elements. However, this approach will get very
complicated (give it a shot). So, let’s use our new friend, linearity of expectation.
91
92 Probability & Statistics with Applications to Computing 3.3
1
with probability n.
(Another way to think of this is: the sample space of all ways to give n hats back has size n!. If we
(n − 1)! 1
want person i to get their hat back, then there are (n − 1)! ways to do so, so the probability is = .)
n! n
Let’s use linearity with indicator random variables! For i = 1, ..., n, let
(
1 if i-th person got their hat back
Xi = .
0 otherwise
Pn
Then the total number of people who get their hat back is X = i=1 Xi . (Why?)
The expected value of each individual indicator random variable can be found as follows, since it can only
take on the values 0 and 1:
1
E [Xi ] = 1 · P (Xi = 1) + 0 · P (Xi = 0) = P (Xi = 1) = P (ith person got their hat back) =
n
So, the expected number of people to get their hats back is 1 (doesn’t even depend on n)! It is worth noting
that these indicator random variables are not “independent” (we’ll define this formally later). One of the
reasons why is because if we know that a particular person did not get their own hat back, then the original
owner of that hat will have a probability of 0 that they get that hat back.
If asked only about the expectation of a random variable X (and not its PMF), then you may be
able to write X as the sum of possibly dependent indicator random variables, and apply linearity
of expectation. This technique is used when X is counting something (the number of people
who get their hat back). Finding the PMF for this random variable is extremely complicated,
and linearity makes computing the expectation easy (or at least easier than directly finding the PMF).
Example(s)
Suppose we flip a coin n = 100 times independently, where the probability of getting a head on each
flip is p = 0.23. What is the expected number of heads we get? Before doing any computation, what
do you think it might be?
Solution You might expect np = 100 · 0.23 = 23 heads, and you would be absolutely correct! But we do need
to prove/show this.
Let X be the number of heads total, so ΩX = {0, 1, 2, . . . , 100}. The “normal” approach might be to try to
find this PMF, which could be a bit complicated (we’ll actually see this in the next section)! But let’s try
to use what we just learned instead, and define indicators.
P100
For i = 1, 2, . . . , 100, let Xi = 1 if the i-th flip is heads, and Xi = 0 otherwise. Then, X = i=1 Xi
is the total number of heads (why?). To use linearity, we need to find E [Xi ].
We showed earlier that
E [Xi ] = P (Xi = 1) = p = 0.23
and so
" 100 #
X
E [X] = E Xi [def of X]
i=1
100
X
= E [Xi ] [linearity of expectation]
i=1
100
X
= 0.23
i=1
= 100 · 0.23 = 23
3.3.2 Variance
We’ve talked about the expectation (average/mean) of a random variable, and some approaches to computing
this quantity. This provides a nice “summarization” of a random variable, as something we often want to
know about it (sometimes even in place of its PMF). But we might want to know another summary quantity:
how “variable” the random variable is, or how much it deviates from its mean. This is called the variance
of a random variable, and we’ll start with a motivating example below!
Consider the following two games. In both games we flip a fair coin. In Game 1, if a heads is flipped you
pay me $1, and if a tails is flipped I pay you $1. In Game 2, if a heads is flipped you pay me $1000, and if a
tails is flipped I pay you $1000.
Both games are fair, in the sense that the expected values of playing both games is 0.
1 1 1 1
E [G1 ] = −1 · + 1 · = 0 = −1000 · + 1000 · = E [G2 ]
2 2 2 2
Which game would you rather play? Maybe the adrenaline junkies among us would be willing to risk it all
on Game 2, but I think most of us would feel better playing Game 1. As shown above, there is no difference
in the expected value of playing these two games, so we need another metric to explain why Game 1 feels
safer than Game 2.
94 Probability & Statistics with Applications to Computing 3.3
We can measure this by calculating how far away a random variable is from its mean, on average. The
quantity X − E [X] is the difference between a rv and its mean, but we want a distance, a positive value.
So we will look at the squared difference (X − E [X])2 instead (another option would have been the absolute
difference |X − E [X] |, but someone chose the squared one instead). This is still a random variable (a
nonnegative one, since it is squared),
and so toget a number (the average distance from the mean), we take
the expectation of this new rv, E (X − E [X])2 . This is called the variance of the original random variable.
The definition goes as follows:
The variance is always nonnegative since we take the expectation of a nonnegative random variable
(X − E [X])2 . The first equality is the definition of variance, and the second equality is a more useful
identity for doing computation that we show below.
There is one problem though - if X is the height of someone in feet for example, then the average E [X]
is also in units of feet, but the variance is in terms of square feet (since we square X). We’d like to say
something like: the height of adults is generally 5.5 feet plus or minus 0.3 feet. To correct for this, we define
the standard deviation to be the square root of the variance, which “undoes” the squaring.
This measure is also useful, because the units of variance are squared in terms of the original variable
X, and this essentially ”undoes” our squaring, returning our units to the same as X.
We had something nice happen for the random variable aX +b when computing its expectation: E [aX + b] =
aE [X] + b, called linearity of expectation. Is there a similar nice property for the variance as well?
3.3 Probability & Statistics with Applications to Computing 95
Before proving this, let’s think about and try to understand why a came out squared, and what happened
to the b. The reason a is squared is because variance involved squaring the random variable, so the a had
to come out squared. It might not be a great intuitive reason, but we’ll prove it below algebraically. The
second (b disappearing) has a nice intuition behind it. Which of the two distributions (random variables)
below do you think should have higher variance?
You might agree with me that they have the same variance! Why?
The idea behind variance is that it measures the “spread” of the values that a random variable can take on.
The two graphs of random variables (distributions) above have the same “spread”, but one is shifted slightly
to the right. Since these graphs have the same “spread”, we want their variance to reflect this similarity.
Thus, shifting a random variable by some constant does not change the variance of that random variable.
That is, Var (X + b) = Var (X): that’s why the b got lost!
Var (X + b) = E ((X + b) − E [X + b])2 [def of variance]
= E (X + b − E [X] − b)2 [linearity of expectation]
= E (X − E [X])2
= Var (X) [def of variance]
96 Probability & Statistics with Applications to Computing 3.3
Example(s)
Let X be the outcome of a fair 6-sided die roll. Recall that E [X] = 3.5. What is Var (X)?
Let’s say you play a casino game, where you must pay $10 to roll this die once, but earn twice the
value of the roll. What are the expected value and variance of your earnings?
Now you might wonder, what about the variance of a sum Var (X + Y )? You might hope that Var (X + Y ) =
Var (X)+Var (Y ), but this unfortunately is only true when the random variables are independent (we’ll define
this in the next section, but you can kind of guess what it means)! It is so important to remember that we
made no independence assumptions for linearity of expectation - it’s always true!
3.3 Probability & Statistics with Applications to Computing 97
3.3.3 Exercises
1. Suppose you studied hard for a 100-question multiple-choice exam (with 4 choices per question) so
that you believe you know the answer to about 80% of the questions, and you guess the answer to the
remaining 20%. What is the expected number of questions you answer correctly?
Solution: For i = 1, . . . , 100, let Xi be the indicator rv which is 1 if youPgot the ith question
100
correct, and 0 otherwise. Then, the total number of questions correct is X = i=1 Xi . To compute
E [X] we need E [Xi ] for each i = 1, . . . , 100.
E [Xi ] = 1·P (Xi = 1)+0·P (Xi = 0) = P (Xi = 1) = P (correct on question i) = 1·0.8+0.25·0.2 = 0.85
where the second last step was using the law of total probability, conditioning on whether we know the
answer to a question or not. Hence,
" 100 # 100 100
X X X
E [X] = E Xi = E [Xi ] = 0.85 = 85
i=1 i=1 i=1
This kind of makes sense - I should be guaranteed 80 out of 100, and if I guess on the other 20, I would
get about 5 (a quarter of them) right, for a total of 85.
2. Recall exercise 2 from 3.1, where we had a random variable X with PMF
1/100 k = −2
pX (k) = 18/100 k = 0
81/100 k = 2
Hence, 2
Var (X) = E X 2 − E [X] = 3.28 − 1.62 = 0.72
Chapter 3. Discrete Random Variables
3.4: Zoo of Discrete RVs Part I
Slides (Google Drive) Video (YouTube)
In this section, we’ll define formally what it means for random variables to be independent. Then, for the
rest of the chapter (3.4, 3.5, 3.6), we’ll discuss commonly appearing random variables for which we can just
cite its properties like its PMF, mean, and variance without doing any work! These situations are so common
that we name them, and can refer to them and related quantities easily!
Random variables X and Y are independent, denoted X ⊥ Y , if for all x ∈ ΩX and all y ∈ ΩY , any
of the following three equivalent properties holds:
1. P (X = x | Y = y) = P (X = x)
2. P (Y = y | X = x) = P (Y = y)
3. P (X = x ∩ Y = y) = P (X = x) · P (Y = y)
Note, that this is the same as the event definition of independence, but it must hold for all events
{X = x} and {Y = y}.
If X ⊥ Y , then
Var (X + Y ) = Var (X) + Var (Y )
This will be proved a bit later, but we can start using this fact now! It is important to remember
that you cannot use this formula if the random variables are not independent (unlike linearity).
A common misconception is that Var (X − Y ) = Var (X) − Var (Y ), but this actually isn’t true, otherwise we
could get a negative number. In fact, if X ⊥ Y , then
Var (X − Y ) = Var (X + (−Y )) = Var (X) + Var (−Y ) = Var (X) + (−1)2 Var (Y ) = Var (X) + Var (Y )
Before diving into the random variables themselves, let’s look at a situation that arises often...
98
3.4 Probability & Statistics with Applications to Computing 99
Let’s illustrate how this might be useful with an example. Suppose we independently flip 8 coins that land
heads with probability p, and get the following sequence of coin flips
This series of flips is a Bernoulli process. We call each of these coin flips a Bernoulli random variable (or
indicator rv).
A random variable X is Bernoulli (or indicator), denoted X ∼ Ber(p), if and only if X has the
following PMF: (
p, k=1
pX (k) =
1 − p, k = 0
Each Xi in the Bernoulli process with parameter p is Bernoulli/indicator random variable with
parameter p. It simply represents a binary outcome, like a coin flip.
Additionally,
E [X] = p and Var (X) = p (1 − p)
2
Var (X) = E X 2 − E [X] = p − p2 = p(1 − p)
Notice how we found a situation whose general form comes up quite often, and derived a random variable
that models that situation well. Now, anytime we need a Bernoulli/indicator random variable we can denote
it as follows: X ∼ Ber(p).
100 Probability & Statistics with Applications to Computing 3.4
We can generalize this as follows to get the PMF of a binomial random variable:
n k
pX (k) = P (X = k) = P (exactly k heads in n Bernoulli trials) = p (1 − p)n−k , k ∈ ΩX
k
This hopefully sheds some light on why nk is called a binomial coefficient and X a binomial random variable.
Before computing its expectation, let’s make sure we didn’t make a mistake, and check thatPour probabilities
n
sum to 1. This will use the binomial theorem we learned in chapter 1 finally: (x + y)n = k=0 nk xk y n−k .
Xn Xn
n k
pX (k) = p (1 − p)n−k [PMF of Binomial RV]
k
k=0 k=0
= (p + (1 − p))n [binomial theorem]
= 1n = 1
A random variable X has a Binomial distribution, denoted X ∼ Bin(n, p), if and only if X has the
following PMF for k ∈ ΩX = {0, 1, 2, . . . , n}:
n k
pX (k) = p (1 − p)n−k
k
3.4 Probability & Statistics with Applications to Computing 101
X is the sum of n independent Ber(p) random variables, and represents the number of heads in n
independent coin flips where P (head) = p.
Additionally,
E [X] = np and Var (X) = np(1 − p)
Proof of Expectation and Variance of Binomial. We can use linearity of expectation to compute the expected
value of a particular binomial variable (i.e. the expected number of successes in n Bernoulli trials). Let
X ∼ Bin(n, p).
" n #
X
E [X] = E Xi
i=1
n
X
= E [Xi ] [linearity of expectation]
i=1
Xn
= p [expectation of Bernoulli]
i=1
= np
This makes sense! If X ∼ Bin(100, 0.5) (number of heads in 100 independent flips of a fair coin), you expect
50 heads, which is just np = 100 · 0.5 = 50. Variance can be found in a similar manner
n
!
X
Var (X) = Var Xi
i=1
n
X
= Var (Xi ) [variance adds if independent rvs]
i=1
Xn
= p (1 − p) [variance of Bernoulli]
i=1
= np(1 − p)
Like Bernoulli rvs, Binomial random variables have a special place in our zoo. Argually, Binomial rvs are
probably the most important discrete random variable, so make sure to understand everything above and
be ready to use it!
It is important to note for the hat check example in 3.3 that we had the sum of n Bernoulli/indicator rvs
BUT that they were NOT independent. This is because if we know one person gets their hat back, someone
else is more likely to (since there are n − 1 possibilities instead of n). However, linearity of expectation works
regardless of independence, so we were able to still add their expectations like so
n
X 1
E [X] = E [Xi ] = n · =1
i=1
n
It would be incorrect to say that X ∼ Bin n, n1 because the indicator rvs were NOT independent.
102 Probability & Statistics with Applications to Computing 3.4
Example(s)
A factory produces 100 cars per day, but a car is defective with probability 0.02. What’s the proba-
bility that the factory produces 2 or more defective cars on a given day?
Solution Let X be the number of defective cars that the factory produces. X ∼ Bin(100, 0.02), so
P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) [complement]
100 0 100 100
=1− (0.02) (1 − 0.02) − (0.02)1 (1 − 0.02)99 [plug in binomial PMF]
0 1
≈ 0.5967
So, there is about a 60% chance that 2 or more cars produced on a given day will be defective.
3.4.4 Exercises
1. An elementary school wants to keep track of how many of their 200 students have acceptable attendance.
Each student shows up to school on a particular day with probability 0.85, independently of other days
and students.
(a) A student has acceptable attendance if they show up to class at least 4 out of 5 times in a school
week. What is the probability a student has acceptable attendance?
(b) What is the probability that at least 170 out of the 200 students have acceptable attendance?
Assume students’ attendance are independent since they live separately.
(c) What is the expected number of students with acceptable attendance?
Solution: Actually, this is a great question because it has nested binomials!
(a) Let X be the number of school days a student shows up in a school week. Then, X ∼ Bin(n =
5, p = 0.85) since a students’ attendance on different days is independent as mentioned earlier.
We want X ≥ 4,
5 5
P (X ≥ 4) = P (X = 4) + P (X = 5) = 0.854 0.151 + 0.855 0.150 = 0.83521
4 5
.
(b) Let Y be the number of students who have acceptable attendance. Then, Y ∼ Bin(n = 200, p =
0.83521) since each students’ attendance is independent of the rest. So,
200
X
200
P (Y ≥ 170) = 0.83521k (1 − 0.83521)200−k ≈ 0.3258
k
k=170
(c) We have E [Y ] = np = 200 · 0.83521 = 167.04 as the expected number of students! We can just
cite it now that we’ve identified Y as being Binomial!
2. [From Stanford CS109] When sending binary data to satellites (or really over any noisy channel) the
bits can be flipped with high probabilities. In 1947 Richard Hamming developed a system to more
reliably send data. By using Error Correcting Hamming Codes, you can send a stream of 4 bits with 3
(additional) redundant bits. If zero or one of the seven bits are corrupted, using error correcting codes,
a receiver can identify the original 4 bits. Let’s consider the case of sending a signal to a satellite where
3.4 Probability & Statistics with Applications to Computing 103
each bit is independently flipped with probability p = 0.1. (Hamming codes are super interesting. It’s
worth looking up if you haven’t seen them be- fore! All these problems could be approached using a
binomial distribution (or from first principles).)
(a) If you send 4 bits, what is the probability that the correct message was received (i.e. none of the
bits are flipped).
(b) If you send 4 bits, with 3 (additional) Hamming error correcting bits, what is the probability that
a correctable message was received?
(c) Instead of using Hamming codes, you decide to send 100 copies of each of the four bits. If for
every single bit, more than 50 of the copies are not flipped, the signal will be correctable. What
is the probability that a correctable message was received?
Solution:
(a) We have X ∼ Bin(n = 4, p = 0.9) to be the number of correct (unflipped) bits. So the binomial
PMF says:
4
P (X = 4) = 0.94 (0.1)4−0 = 0.94 = 0.656
4
Note we could have also approached this by letting Y ∼ Bin(4, 0.1) be the number of corrupted
(flipped) bits, and computing P (Y = 0). This is the same result!
(b) Let Z be the number of corrupted bits, then Z ∼ Bin(n = 7, p = 0.1), so we can use its PMF. A
message is correctable if Z = 0 or Z = 1 (mentioned above), so
7 7 0 7
P (Z = 0) + P (Z = 1) = 0.1 0.9 + 0.16 0.91 = 0.850
0 1
This is a 30% (relative) improvement compared to above by just using 3 extra bits!
(c) For i = 1, . . . , 4, let Xi ∼ Bin(n = 100, p = 0.9). We need X1 > 50, X2 > 50, X3 > 50, and
X4 > 50 for us to get a correctable message. For Xi > 50, we just sum the binomial PMF from
51 to 100:
100
X
100
P (Xi > 50) = 0.9k (1 − p)100−k
k
k=51
P (X1 > 50, X2 > 50, X3 > 50, X4 > 50) = P (X1 > 50) P (X2 > 50) P (X3 > 50) P (X4 > 50)
100 !4
X 100 k 100−k
= 0.9 (1 − p)
k
k=51
> 0.999
But this required 400 bits instead of just the 7 required by Hamming codes! This is well worth
the tradeoff.
3. Suppose A and B are random, independent (possibly empty) subsets of {1, 2, ..., n}, where each subset
is equally likely to be chosen. Consider A ∩ B, i.e., the set containing elements that are in both A and
B. Let X be the random variable that is the size of A ∩ B. What is E [X]?
Solution: Then, X ∼ Bin n, 41 , so E [X] = n
4 (since we know the expected value of Bin(n, p)
is np). How did we do that??
104 Probability & Statistics with Applications to Computing 3.4
Choosing a random subset of {1, . . . , n} can be thought of as follows: for each element i = 1, . . . , n,
with probability 1/2 take the element (and with probability 1/2 don’t take it), independently of other
elements. This is a crucial observation.
For each element i = 1, . . . , n, the element is either in A∩B or not. So let Xi be the indicator/Bernoulli
rv of whether i ∈ A ∩ B or not. Then, P (Xi = 1) = P (i ∈ A, i ∈ B) = P (i ∈ A) P (i ∈ B) = 21 · 12 = 14
because A, B are chosen independently, and each element is in A or B with probability 1/2. Note that
these Xi ’s are independentPnbecause one element being in the set does not affect another element being
in the set. Hence, X = i=1 Xi the number of elements in our intersection, so X ∼ Bin n, 14 and
E [X] = np = n4 .
Note that it was not necessary that these variables were independent; we could have still applied
linearity of expectation anyway to get n4 . We just wouldn’t have been able to say X ∼ Bin n, 14 .
Chapter 3. Discrete Random Variables
3.5: Zoo of Discrete RVs Part II
Slides (Google Drive) Video (YouTube)
X is a uniform random variable, denoted X ∼ Unif(a, b), where a < b are integers, if and only if X
has the following probability mass function
(
1
, k ∈ {a, a + 1, ..., b}
pX (k) = b−a+1
0, otherwise
X is equally likely to take on any value in ΩX = {a, a + 1, ..., b}. This set contains b − a + 1 integers,
which is why P (X = k) is always b−a+11
.
Additionally,
a+b (b − a)(b − a + 2)
E [X] = and Var (X) =
2 12
As you might expect, the expected value is just the average of the endpoints that the uniform random
variable is defined over.
b b b
X X 1 1 X
E X2 = k 2 · pX (k) = k2 · = k2 = . . .
b−a+1 b−a+1
k=a k=a k=a
2 (b − a)(b − a + 2)
Var (X) = E X 2 − E [X] =
12
This variable models situations like rolling a fair six sided die. Let X be the random variable whose value
is the number face up on a die roll. Since the die is fair each outcome is equally likely, which means that
X ∼ Unif(1, 6) so
(
1
, k ∈ {1, 2, ..., 6}
pX (k) = 6
0, otherwise
105
106 Probability & Statistics with Applications to Computing 3.5
This is fairly intuitive, but is nice to have these formulas in our zoo so we can make computations quickly,
and think about random processes in an organized fashion. Using the equations above we can find that
1+6 (6 − 1)(6 − 1 + 2) 35
E [X] = = 3.5 and Var (X) = =
2 12 12
Let X be the random variable that represents the number of independent coin flips up to and including
your first head. Lets compute P (X = 4). X = 4 occurs exactly when there are 3 tails followed by a head.
So,
P (X = 4) = P (T T T H) = (1 − p)(1 − p)(1 − p)p = (1 − p)3 p
In general,
pX (k) = (1 − p)k−1 p
This is because there must be k − 1 tails in a row followed by a head occurring on the kth trial.
Let’s also verify that the probabilities sum to 1.
∞
X ∞
X
pX (k) = (1 − p)k−1 p [Geometric PMF]
k=1 k=1
X∞
=p (1 − p)k−1 [take out constant]
k=1
X∞
=p (1 − p)k [reindex to 0]
k=0
" ∞
#
1 X 1
i
=p geometric series formula: r = for |r| < 1
1 − (1 − p) i=0
1−r
1
=p· =1
p
The second last step used the geometric series formula - this may be why this random variable is called
Geometric!
3.5 Probability & Statistics with Applications to Computing 107
X is a Geometric random variable, denoted X ∼ Geo(p), if and only if X has the following probability
mass function (and range ΩX = {1, 2, . . . }):
Additionally,
1 1−p
E [X] = and Var (X) =
p p2
Example(s)
Let’s say you buy lottery tickets every day, and the probability you win on a given day is 0.01,
independently of other days. What is the probability that after a year (365 days), you still haven’t
won? What is the expected number of days until you win your first lottery?
108 Probability & Statistics with Applications to Computing 3.5
Solution If X is the number of days until the first win, then X ∼ Geo(p = 0.01). Hence, the probability we
don’t win after a year is (using the PMF)
364
X 364
X
P (X ≥ 365) = 1 − P (X < 365) = 1 − P (X = k) = 1 − (1 − 0.01)k−1 0.01
k=1 k=1
This is great, but for the geometric, we can actually get a closed-form formula by thinking of what it means
that X ≥ 365 in English. X ≥ 365 happens if and only if we lose for the first 365 days, which happens with
probability 0.99365 . If you evaluated that nasty sum above and this quantity, you would find that they are
equal!
Finally, we can just cite the expectation of the Geometric RV:
1 1
E [X] = = = 100
p 0.01
This is the point of the zoo! We do all these generic calculations so we can use them later anytime.
Example(s)
You gamble by flipping a fair coin independently up to and including the first head. If it takes k tries,
you earn $2k (i.e., if your first head was on the third flip, you would earn $8). How much would you
pay to play this game?
Solution Let X be the number of flips to the first head. Then, X ∼ Geo( 21 ) because its a fair coin, and
1 k−1 1 1
pX (k) = 1 − = k k = 1, 2, 3, ...
2 2 2
It is usually unwise to gamble, especially if your expected earnings are lower than the price to play. So, let
Y be your expected earnings. Note that Y = 2X because the amount you
win depends
the number of flips
it takes to get a heads. We will use LOTUS to compute E [Y ] = E 2X . Recall E 2X 6= 2E[X] = 22 = 4 as
we’ve seen many times now.
∞ ∞ ∞
X X 1 X
E [Y ] = E 2X = 2k pX (k) = 2k k = 1=∞
2
k=1 k=1 k=1
Some might say they would be willing to pay any finite amount of money to play this game. Think about
why that would be unwise, and what this means regarding the modeling tools we have provided you so far.
r
X
X= Xi
i=1
where Xi is a a geometric random variable that represents the number of flips it takes to get the ith head
after i − 1 heads have already occurred. Since all the flips are independent, so are the rvs X1 , . . . , Xr . For
example, if r = 3 we might observe the following sequence of flips
In this case, X1 = 3 and represents the number of trials between the 0th to the 1st head; X2 = 1 and
represents the number of trials between the 1st to the 2nd head; X3 = 4 and represents the number of trials
between the 2nd and the 3rd head. Remember this fact for later!
How do we find P (X = 8)? There must be exactly 3 heads and 5 tails, so it is reasonable to expect (1−p)5 p3
to come up somewhere in our final formula, but how many ways can we get a valid sequence of flips? Note
that the last coin flip must be a heads, otherwise we would’ve gotten our r = 3 heads earlier than our 8th
flip. From here, any 2 of the first 7 flips can be heads, and 5 of must be tails. Thus, there are 72 valid
sequences of coin flips.
Each of these 7 flip sub-sequences (of the 8 total flips) occurs with probability (1 − p)5 p2 and there is no
overlap. However, we need to include the probability that the last coin flip is a heads. So,
7 7
pX (8) = P (X = 8) = (1 − p)5 p2 · p = (1 − p)5 p3
2 2
Again, the interpretation is that our rth head must come at the k th trial exactly; so in the first k − 1 we can
get r − 1 heads anywhere (hence the binomial coefficient), and overall we have r heads and k − r tails.
If we are interested in finding the expected value of X we might try the brute force approach directly from
the definition of expected value
X ∞
X
k−1
E [X] = k pX (k) = k (1 − p)k−r pr
r−1
k∈ΩX k=r
but this approach is overly complicated, and there is a much simpler way using Pr linearity of expectation!
Suppose X1 , ..., Xr ∼ Geo(p) are independent. As we showed earlier, X = i=1 Xi , and we showed that
each E [Xi ] = 1/p. Using linearity of expectation, we can derive the following:
" r # r r
X X X 1 r
E [X] = E Xi = E [Xi ] = =
i=1 i=1 i=1
p p
110 Probability & Statistics with Applications to Computing 3.5
Using a similar technique and the (yet unproven) fact that Var (X + Y ) = Var (X) + Var (Y ), we can find the
variance of X from the sum of the variances of multiple geometric random variables
r
! r r
X X X 1−p r(1 − p)
Var (X) = Var Xi = Var (Xi ) = =
i=1 i=1 i=1
p2 p2
This random variable is called the negative binomial random variable. It is quite common so it too deserves
a special place in our zoo.
X is a negative binomial random variable, denoted X ∼ NegBin(r, p), if and only if X has the
following probability mass function (and range ΩX = {r, r + 1, . . . , }):
k−1
pX (k) = pr (1 − p)k−r , k = r, r + 1, ...
r−1
Additionally,
r r(1 − p)
E [X] = and Var (X) =
p p2
Also, note that Geo(p) ≡ NegBin(1, p), and that if X, Y are independent such that X ∼ NegBin(r, p)
and Y ∼ NegBin(s, p), then X + Y ∼ NegBin(r + s, p) (waiting for r + s heads).
3.5.4 Exercises
1. You are a hardworking boxer. Your coach tells you that the probability of your winning a boxing
match is 0.25, independently of every other match.
(a) How many matches do you expect to fight until you win once?
(b) How many matches do you expect to fight until you win ten times?
(c) You only get to play 12 matches every year. To win a spot in the Annual Boxing Championship,
a boxer needs to win at least 10 matches in a year. What is the probability that you will go to
the Championship this year?
(d) Let q be your answer from the previous part. How many times can you expect to go to the
Championship in your 20 year career?
Solution:
(a) Let X be the matches you have to fight until you win once. Then, X ∼ Geo(p = 0.25), so
E [X] = p1 = 0.25
1
= 4.
(b) Let Y be the matches you have to fight until you win ten times. Then, Y ∼ NegBin(r = 10, p =
0.25), so E [Y ] = pr = 0.25
10
= 40.
(c) Let Z be the number of matches you win out of 12. Then, Z ∼ Bin(n = 12, p = 0.25), and we
want
X12
12
P (Z ≥ 10) = 0.2k (1 − 0.2)12−k
k
k=10
3.5 Probability & Statistics with Applications to Computing 111
(d) Let W be the number of times we make it to the Championship in 20 years. Then, W ∼ Bin(n =
20, p = q), and
E [W ] = np = 20q
2. You are in music class, and your cruel teacher says you cannot leave until you play the 1000-note
song Fur Elise correctly 5 times. You start playing the song, and if you play an incorrect note,
you immediately start the song over from scratch. You play each note correctly independently with
probability 0.999.
(a) What is the probability you play the 1000-note song Fur Elise correctly immediately? (i.e., the
first 1000 notes are all correct).
(b) What is the probability you take exactly 20 attempts to correctly play the song 5 times?
(c) What is the probability you take at least 20 attempts to correctly play the song 5 times?
(d) (Challenge) What is the expected number of notes you play until you finish playing Fur Elise
correctly 5 times?
Solution:
(a) Let X be the number of correct notes we play in Fur Elise in one attempt, so X ∼ Bin(1000, 0.999).
We need P (X = 1000) = 0.9991000 ≈ 0.3677.
(b) If Y is the number of attempts until we play the song correctly 5 times, then Y ∼ NegBin(5, 0.3677),
and so
20 − 1
P (Y = 20) = 0.36775 (1 − 0.3677)15 ≈ 0.0269
5−1
(c) We can actually take two approaches to this. We can either take our Y from earlier, and compute
19
X
k−1
P (Y ≥ 20) = 1 − P (Y < 20) = 1 − 0.36775 (1 − 0.3677)k−5 ≈ 0.1161
4
k=5
Notice the sum starts at 5 since that’s the lowest possible value of Y . This would be exactly the
probability of the statement asked. We could alternatively rephrase the question as: what is the
probability we play the song correctly at most 4 times correctly in the first 19 times? Check that
these questions are equivalent! Then, we can let Z ∼ Bin(19, 0.3677) and instead compute
4
X 19
P (Z ≤ 4) = 0.3677k (1 − 0.3677)19−k ≈ 0.1161
k
k=0
(d) We will have to revisit this question later in the course! Note that we could have computed
the expected number of attempts to finish playing Fur Elise though, as it would follow a
5
NegBin(5, 0.3677) distribution with expectation 0.3677 ≈ 13.598.
Chapter 3. Discrete Random Variables
3.6: Zoo of Discrete RVs Part III
Slides (Google Drive) Video (YouTube)
We start by breaking one unit of time into 5 parts, and we say at each of the five chunks, either a
baby is born or not. That means we’ll be using a binomial rv with n = 5. The choice of p that will
keep our average to be 2 is 25 , because the expected value of binomial RV is np = 2.
Similarly, if we break the time into even smaller chunks such as n = 10 or n = 70, we can get the
2 2
corresponding p to be 10 or 70 respectively (either a baby is born or not in 1/70 of a second).
And we keep increasing n so that it gets down to the smallest fraction of a second; we have n → ∞
and p → 0 in this fashion while maintaining the condition that np = 2.
Let λ be the historical average number of events per unit of time. Send n → ∞ and p → 0 in such a way
that np = λ is fixed (i.e., p = nλ ).
Let Xn ∼ Bin(n, nλ ) and Y ∼ limn→∞ Xn be the limit of this sequence of Binomial rvs. Then, we say
Y ∼ Poi(λ) and measures the number of events in a unit time, where the historical average is λ. We’ll derive
its PMF by taking the limit of the binomial PMF.
We’ll need to recall how we defined the base of the natural logarithm e. There are two equivalent formulations.
x n
ex = lim 1+
n→∞ n
112
3.6 Probability & Statistics with Applications to Computing 113
∞
X xk
ex =
k!
k=0
We’ll now verify that the Poisson PMF does sum to 1, and is valid.
P∞ k
Recall the Taylor series for ex = k=0 xk! , so
∞
X ∞
X X λk ∞
λk
pY (k) = e−λ = e−λ = e−λ eλ = 1
k! k!
k=0 k=0 k=0
X ∼ Poi(λ) if and only if X has the following probability mass function (and range ΩX =
{0, 1, 2, . . . }):
λk
pX (k) = e−λ , k = 0, 1, 2, ...
k!
If λ is the historical average of events per unit of time, then X is the number of events that occur in
a unit of time.
Additionally,
Proof of Expectation and Variance of Poisson. Let Xn ∼ Bin(n, nλ ) and Y ∼ limn→∞ Xn = Poi(λ). By the
properties of the binomial random variable, the mean and variance are as follows for any n (plug in λ = np
or equivalently, p = λ/n):
E [Xn ] = np = λ
114 Probability & Statistics with Applications to Computing 3.6
λ
Var (Xn ) = np(1 − p) = λ 1 −
n
Therefore: h i
E [Y ] = E lim Xn = lim E [Xn ] = lim λ = λ
n→∞ n→∞ n→∞
λ
Var (Y ) = Var lim Xn = lim Var (Xn ) = lim λ 1 − =λ
n→∞ n→∞ n→∞ n
Example(s)
Suppose the average number of babies born in Seattle historically is 2 babies every 15 minutes.
1. What is the probability no babies are born in the next hour in Seattle?
2. What is the expected number of babies born in the next hour?
3. What is the probability no babies are born in the next 5 minutes in Seattle?
Solution
1. Since Poi(λ) is the number of events in a single unit of time (matching units as λ), we must convert
our rate to hours (since we are interested in one hour). So the number of babies born in the next hour
can be modelled as X ∼ Poi(λ = 8/hr), and so the probability no babies are born is
80
P (X = 0) = e−8 = e−8
0!
(2/3)0
P (Y = 0) = e−2/3 = e−2/3
0!
Before doing the next example, let’s talk about the sum of two independent Poisson rvs. Almost by definition,
if X, Y are independent with X ∼ Poi(λ) and Y ∼ Poi(µ), then X + Y ∼ Poi(λ + µ) (if the average number
of babies born per minute in the USA is 5 and in Canada is 2, then the total babies in the next minute
combined is Poi(5+2) since the average combined rate is 7. We’ll prove this fact that the sum of independent
Poisson rvs is Poisson with the sum of their rates in a future chapter!
Example(s)
Suppose Lookbook gets on average 120 new users per hour, and Quickgram gets 180 new users per
hour, independently. What is the probability that, combined, less than 2 users sign up in the next
minute?
Solution Convert λ’s to the same unit of interest. For us, it’s a minute. We can always change the rate λ
(e.g., 120 per hour is the same as 2 per minute), but we can’t change the unit of time we’re interested in.
50 51
P (Z < 2) = pZ (0) + pZ (1) = e−5 + e−5 = 6e−5 ≈ 0.04
0! 1!
Suppose there is a candy bag of N = 9 total candies, K = 4 of which are lollipops. Our parents allow us
grab n = 3 of them. Let X be the number of lollipops we grab. What is the probability that we get exactly
2 lollipops?
The number of ways to grab three candies is just 93 , and we need to get exactly 2 lollipops out of 4, which
is 42 . Out of the other 5 candies, we only need one of them, which yields 51 ways.
4
5
pX (2) = P (X = 2) = 2
1
9
3
We say the number of successes we draw is X ∼ HypGeo(N, K, n), where K out of N items in a bag are
successes, and we draw n without replacement.
116 Probability & Statistics with Applications to Computing 3.6
X is the number of successes when drawing n items without replacement from a bag containing N
items, K of which are successes (hence N − K failures).
K K(N − K)(N − n)
E [X] = n Var (X) = n
N N 2 (N − 1)
Note that if we drew with replacement, then we would model this situation using Bin(n, K
N ) as each
draw would be an independent trial.
Then, each Xi is Bernoulli, but with what parameter? The probability of getting a lollipop on the first draw
(X1 being equal to 1) is just K/N .
K
P (X1 = 1) =
N
What about P (X2 = 1), the probability we get a lollipop on our second draw? Well, it depends on whether
or not we got one the first draw! So we can use the LTP conditioning on whether we got one (X1 = 1) or
we didn’t X1 = 0).
Actually, each Xi ∼ Ber(K/N ), at every draw i! You could continue the above logic for X3 and so on. This
makes sense, because, if you just think about the i-th draw and you didn’t know anything about the first
i − 1, the probability you get a lollipop would just be K/N .
K
E [Xi ] =
N
" n
# n n
X X X K K
E [X] = E Xi = E [Xi ] = =n
i=1 i=1 i=1
N N
Note again it would be wrong to say X ∼ Bin(n, K/N ) because the trials are NOT independent, but we are
still able to use linearity of expectation. If we did this experiment with replacement though (take one and
put it back), then the draws would be independent, and modelled as Bin(n, K/N ). Note the expectation
with or without replacement is the same because linearity of expectation doesn’t care about independence!
3.6 Probability & Statistics with Applications to Computing 117
The variance is a nightmare, and will be proven in 5.4 when figure out how to compute the variance of the
sum of these dependent indicator variables.
The Zoo of Discrete RV’s: Here are all the distributions in our zoo of discrete random variables!
• The Bernoulli RV
• The Binomial RV
• The Uniform (Discrete) RV
• The Geometric RV
• The Negative Binomial RV
• The Poisson RV
• The Hypergeometric RV
Congratulations on making it through this chapter on all these wonderful discrete random variables! There
are several practice problems below which require using a lot of these zoo elements. It will definitely take
some time to get used to all of these - you’ll need to do practice! See our handy reference sheet at the end
of the book for one place to see all of them while doing problems.
3.6.4 Exercises
1. Suppose that on average, 40 babies are born per hour in Seattle.
(a) What is the probability that over 1000 babies are born in a single day in Seattle?
(b) What is the probability that in a 365-day year, over 1000 babies are born on exactly 200 days?
Solution:
(a) The number of babies born in a single average day is 40 · 24 = 960, so X ∼ Poi(λ = 960). Then,
1000
X 960k
P (X > 1000) = 1 − P (X ≤ 1000) = 1 − e−960
k!
k=0
(b) Let q be the answer from part (a). The number of days where over 1000 babies are born is
Y ∼ Bin(n = 365, p = q), so
365 200
P (Y = 200) = q (1 − q)165
200
2. Suppose the Senate consists of 53 Republicans and 47 Democrats. Suppose we were to create a
bipartisan committee of 20 senators by randomly choosing from the 100 total.
(a) What is the probability we end up with exactly 9 Republicans and 11 Democrats?
(b) What is the expected number of Democrats on the committee?
Solution:
(a) Let X be the number of Republican senators chosen. Then X ∼ HypGeo(N = 100, K = 53, n =
20), and the desired probability is
53 47
P (X = 9) = 9
100
11
20
118 Probability & Statistics with Applications to Computing 3.6
since choosing 9 out of 20 Republicans also implies immediately we have 11 out of 20 Democrats.
Note we could have flipped the roles of Democrats and Republicans. If Y is the number of
Democratic senators chosen, then Y ∼ HypGeo(N = 100, K = 47, n = 20), and
47 53
P (Y = 11) = 11
100
9
20
(b) The number of Democrats as mentioned earlier is Y ∼ HypGeo(N = 100, K = 47, n = 20), and
so
K 47
E [Y ] = n = 20 · = 9.4
N 100
3. (Poisson Approximation to Binomial) Suppose the famous chip company “Bayes” produces
n = 10000 bags per day. They need to do a quality check, and they know that 0.1% of their bags
independently have “bad” chips in them.
(a) What is the exact probability that at most 5 bags contain “bad” chips?
(b) Recall the Poisson was derived from the Binomial with n → ∞ and p → 0, so it suggests that a
Poisson distribution would be a good approximation to a Binomial with large n and small p. Use
a Poisson rv instead to compute the same probability as in part (a). How close are the answers?
Note: The reason we use a Poisson approximation sometimes is because the binomial PMF is
hard to compute. Imagine X ∼ Bin(10000, 0.256), computing P (X = 2000) = 10000 0.256 2000
(1−
2000
0.256)8000 has at least 10000 multiplication operations for the probabilities. Furthermore, 10000
2000 =
10000!
2000!8000! - good luck avoiding overflow on your computer!
Solution:
(a) If X is the number of bags with “bad” chips, then X ∼ Bin(n = 10000, p = 0.001), so
5
X
10000
P (X ≤ 5) = 0.001k (1 − 0.001)10000−k ≈ 0.06699
k
k=0
(b) Since n is large and p is small, we might approximate X as Poisson rv, with λ = np = 10000 ·
0.001 = 10. Then, since X ≈ Poi(10), we have
5
X 10k
P (X ≤ 5) = e−10 ≈ 0.06709
k!
k=0
Solution:
(a) The average rate of typos is one per two pages, or equivalently, 1/2 per one page. Hence, if X is
the number of typos on a page, then X ∼ Poi(λ = 1/2), and
(1/2)0
P (X ≥ 1) = 1 − P (X = 0) = 1 − e−1/2 = 1 − e−1/2 ≈ 0.39347
0!
(b) Since we are interested in a 250 page “time period”, the average rate of typos is 125 per 250 pages.
If Y is the number of typos in total, then Y ∼ Poi(λ = 125), and E [Y ] = λ = 125.
(c) We can consider each page as Poi(1/2) like in part (a). Let Z be the number of pages with at
least one typo. Then, Z ∼ Bin(n = 250, p = 0.39347), and
50
X
250
P (Z ≤ 50) = 0.39347k (1 − 0.39347)250−k
k
k=0
(d) Let V be the first page that contains (at least) one typo. Then, V ∼ Geo(0.39347), so
1
E [V ] = = 2.5415
0.39347
(e) If W is the number of pages out of 20 that have a typo, then W ∼ HypGeo(N = 250, K = 50, n =
20), and
50 200
P (W = 5) = 5
250
15
20
120 Probability & Statistics with Applications to Computing 3.6
Application Time!!
Now you’ve learned enough theory to discover the Bloom Filter covered in
section 9.4. You are highly encouraged to read that section before moving
on!
Chapter 4. Continuous Random Vari-
ables
We learned about how to model things like the number of car crashes in a year or the number of lot-
tery tickets I must buy until I win. What about quantities like the time until the next earthquake, or the
height of human beings? These latter quantities are a completely different beast, since they can take on
uncountably infinitely many values (infinite decimal precision). Some of the ideas from the previous chapter
will stay, but we’ll have to develop new tools to handle this new challenge. We’ll also learn about the most
important continuous distribution: the Normal distribution.
121
122 Probability & Statistics with Applications to Computing 4.1
Up to this point, we have only been talking about discrete random variables - ones that only take values
in a countable (finite or countably infinite) set like the integers or a subset. What if we wanted to model
quantities that were continuous - that could take on uncountably infinitely many values? If you haven’t
studied or seen cardinality (or types of infinities) before, you can think of this as being intervals of the real
line, which take decimal values. Our tools from the previous chapter were not suitable to modelling these
situations, and so we need a new type of random variable.
that the sum of the probabilities is 1 (assuming we could even sum over uncountably many values; we can’t).
Instead, we have the idea of a probability density function where the x-axis has values in the random variable’s
range (usually an interval), and the y-axis has the probability density (not mass), which is explained below.
The probability density function fX has some characteristic properties (denoted with fX to distinguish from
PMFs pX ). Notice again I will use different dummy variables inside the function like fX (z) or fX (t) to
ensure you get the idea that the density is fX (subscript indicates for rv X) and the dummy variable can
be anything.
• fX (z) ≥ 0 for all z ∈ R; i.e., it is always non-negative, just like a probability mass function.
R∞
• −∞ X
f (t)dt = 1; i.e., the area under the entire curve is equal to 1, just like the sum of all the proba-
bilities of a discrete random variable equals 1.
Rb
• P (a ≤ X ≤ b) = a fX (w)dw; i.e., the probability that X lies in the interval a to b is the area under
the curve from a to b. This is key - integrating fX gives us probabilities.
124 Probability & Statistics with Applications to Computing 4.1
Ry
• P (X = y) = P (y ≤ X ≤ y) = y fX (w)dw = 0. The probability of being a particular value is 0, and
NOT equal to the density fX (y) which is nonzero. This is particularly confusing at first.
• P (X ≈ q) = P q − 2ε ≤ X ≤ q + 2ε ≈ εfX (q); i.e., with a small epsilon value, we can obtain a good
rectangle approximation of the area under the curve. The width of the rectangle is ε (from the difference
between q + 2ε and q − 2ε ). The height of the rectangle is fX (q), the value of the probability density func-
tion fX at q. So, the area of the rectangle is εfX (q). This is similar to the idea of Riemann integration.
• P(X≈u)
P(X≈v) ≈ εf X (u) fX (u)
εfX (v) = fX (v) ; i.e., the PDF tells us ratios of probabilities of being “near” a point. From
the previous point, we know the probabilities of X being approximately u and v, and through algebra,
we see their ratios. Since the density is twice as high at u as it is at v, it means we are twice as likely
to get a point “near” u as we are to get one “near” v.
4.1 Probability & Statistics with Applications to Computing 125
Let X be a continuous random variable (one whose range is typically an interval or union of intervals).
The probability density function (PDF) of X is the function fX : R → R, such that the following
properties hold:
• RfX (z) ≥ 0 for all z ∈ R
∞
• −∞ fX (t) dt = 1
Rb
• P (a ≤ X ≤ b) = a fX (w) dw
• P (X = y) = 0 for any y ∈ R
• The probability that X is close to q is proportional to its density fX (q);
ε ε
P (X ≈ q) = P q − ≤ X ≤ q + ≈ εfX (q)
2 2
• Ratios of probabilities of being “near points” are maintained;
We know this is valid, because the area under the curve is the area of a square with side lengths 1, which is
126 Probability & Statistics with Applications to Computing 4.1
1 · 1 = 1.
We define the cumulative distribution function (CDF) of X to be FX (w) = P (X ≤ w). That is, the all the
area to the left of w in the density function. Note we also have CDFs for discrete random variables, they
are defined exactly the same way (the probability of being less than or equal to a certain value)! They just
don’t usually have a nice closed form like they do for continuous RVs. Note for continuous random variables,
the CDF at w is just the cumulative area to the left of w, which can be found by an integral (the dummy
variable of integration should be different than the input variable w)
Z w
FX (w) = P (X ≤ w) = fX (y)dy
−∞
Let’s try to compute the CDF of this uniform random variable on [0, 1]. There are three cases to consider
here.
• If w < 0, FX (w) = 0 since ΩX = [0, 1]. For example, if w = −1, then FX (w) = P (X ≤ −1) = 0 since
there is no chance that X ≤ −1. Formally, there is also no area to the left of w = −1 as you can see
from the PDF above, so the integral evaluates to 0!
• If 0 ≤ w ≤ 1, the area up to w is a rectangle of height 1 and width w (see below), so FX (w) = w.
That is, P (X ≤ w) = w. For example, if w = 0.5, then the probability X ≤ 0.5 is actually just
0.5 since X is just equally likely to be anywhere in ΩX = [0, 1]! Note here we didn’t do an integral
since there are nice shapes, and we sometimes don’t have to! We just looked at the area to the left of w.
• If w > 1, all the area is up to the left of w, so FX (w) = 1. Again, since ΩX = [0, 1] and suppose w = 2,
then FX (w) = P (X ≤ 2) = 1 since X is always between 0 and 1 (X must be less than or equal to 2).
Formally, the cumulative area to the left of w = 2 is 1 (just the area of the square)!
0 if w < 0
FX (w) = w if 0 ≤ w ≤ 1
1 if w > 0
4.1 Probability & Statistics with Applications to Computing 127
Let X be a continuous random variable (one whose range is typically an interval or union of intervals).
The cumulative distribution function (CDF) of X is the function FX : R → R such that:
Rt
• FX (t) = P (X ≤ t) = −∞ fX (w) dw for all t ∈ R
d
• du FX (u) = fX (u)
• P (a ≤ X ≤ b) = FX (b) − FX (a)
• FX is monotone increasing, since fX ≥ 0. That is, FX (c) ≤ FX (d) for c ≤ d.
• limv→−∞ FX (v) = P (X ≤ −∞) = 0
• limv→+∞ FX (v) = P (X ≤ +∞) = 1
Example(s)
Suppose the number of hours that a package gets delivered past noon is modelled by the following
PDF:
x/10 0 ≤ x ≤ 2
fX (x) = c 2<x≤6
0 otherwise
Here is a graph of the PDF as described above:
128 Probability & Statistics with Applications to Computing 4.1
Solution
1. The range is all values where the density is nonzero; in our case, that is ΩX = [0, 6] (or (0, 6)), but we
don’t care about single points or endpoints because the probability of being exactly that value is 0.
2. Formally, we need the density function to integrate to 1; that is,
Z ∞
fX (x)dx = 1
−∞
But, the density function is split into three parts, we can split our integral into three. However,
anywhere the density is zero, we will get an integral of zero, so we’ll only set up the two integrals that
are nontrivial: Z Z
2 6
x/10dx + cdx = 1
0 2
Solving this equation for c would definitely work. But let’s try to use geometry instead, as we do know
how to compute the area of a triangle and rectangle. So the left integral is the area of the triangle
with base from 0 to 2 and height c, so that area is 2c/2 = c (the area of a triangle is b · h/2). The area
of the rectangle with base from 2 to 6 is 4c. We need the total area of c + 4c = 1, so c = 1/5.
3. Our CDF needs four cases: when x < 0, when 0 ≤ x ≤ 2, when 2 < x ≤ 6, and when x > 6.
(a) The outer cases are usually the easiest ones: if x < 0, then FX (x) = P (X ≤ x) = 0 since X
cannot be less than zero.
(b) If x > 6, then FX (x) = P (X ≤ x) = 1 since X is guaranteed to be at most 6.
4.1 Probability & Statistics with Applications to Computing 129
(c) For 0 ≤ x ≤ 2, we need the cumulative area to the left of x, which happens to be a triangle with
base x and height x/10, so the area is x2 /20. Alternatively, evaluate the integral
Z x Z x
FX (x) = fX (t)dt = t/10dt = t2 /20
−∞ 0
(d) For 2 < x ≤ 6, we have the entire triangle of area 2 · 1/5 · 0.5 = 1/5, but also a rectangle of base
x − 2 and height 1/5, for a total area of 1/5 + 1/5(x − 2) = x/5 − 1/5. Alternatively, the integral
would be Z Z Z
x 2 x
FX (x) = fX (t)dt = t/10dt + 1/5dt = x/5 − 1/5
−∞ 0 2
Again, I skipped all the integral evaluation steps as they are purely computational, but feel free
to verify!
Finally, putting this together gives
0 x<0
x2 /20 0≤x≤2
FX (x) =
x/5 − 1/5 2<x≤6
1 x>6
R6 R6
4. Using the formula, we find the area between 2 and 6 to get P (2 ≤ X ≤ 6) = 2 fX (t)dt = 2 1/5dt =
4/5. Alternatively, we can just see the area from 2 to 6 is just a rectangle with base 4 and height 1/5,
so the probability is just 4/5.
This is just the area to the left of 6, minus the area to the left of 2, which gives us the area between 2
and 6.
5. We’ll use the formula for expectation of a continuous RV, but split into three integrals again due to
the piecewise definition of our density. However, the integral outside the range [0, 6] will evaluate to
zero, so we won’t include it.
Z ∞ Z 2 Z 6 Z 2 Z 6
E [X] = xfX (x)dx = xfX (x)dx + xfX (x)dx = x · (x/10)dx + x · (1/5)dx
−∞ 0 2 0 2
We won’t do the computation because it’s not important, but hopefully you get the idea of how similar
this is to the discrete version!
Discrete Continuous
4.1.4 Exercises
1. Suppose X is continuous with density
(
cx2 0≤x≤9
fX (x) =
0 otherwise
Write an expression for the value of c that makes X a valid PDF, and set up expressions (integrals)
for its mean and variance. Also, find the CDF of X, FX .
Similarly, by LOTUS,
Z ∞ Z 9 Z 9
1 2 1
E X2 = 2
z fX (z)dz = z2
z dz = z 4 dz
−∞ 0 243 243 0
Write an expression for the value of c that makes X a valid PDF, and set up expressions (integrals)
for its mean and variance. Also, find the CDF of X, FX .
Hence, c = 1. The expected value is the weighted average of each point weighted by its density, so
Z ∞ Z ∞ Z ∞
1 1
E [X] = zfX (z)dz = z · 2 dz = dz = [ln(z)]∞
1 =∞
−∞ 1 z 1 z
We actually have two cases. If t < 1, FX (t) = 0 since there’s no way to get a number less than 1 (the
range is ΩX = [1, ∞)). For t > 1, we just do a normal integral to get that
Z t Z 1 Z t Z t t
1 1 1 1
FX (t) = P (X ≤ t) = fX (s)ds = fX (s)ds+ fX (s)ds = 2
ds = − = − − 1 = 1−
−∞ −∞ 1 1 s s 1 t t
Now that we’ve learned about the properties of continuous random variables, we’ll discover some frequently
used RVs just like we did for discrete RVs! In this section, we’ll learn the continuous Uniform distribution,
the Exponential distribution, and Gamma distribution. In the next section, we’ll finally learn about the
Normal/Gaussian (bell-shaped) distribution which you all may have heard of before!
4.2.1 The (Continuous) Uniform RV
The continuous uniform random variable models a situation where there is no preference for any particular
value over a bounded interval. This is very similar to the discrete uniform random variable (e.g., roll of a fair
die), except extended to include decimal values. The probability of equalling any particular value is again 0
since we are dealing with a continuous RV.
X ∼ Unif(a, b) (continuous) where a < b are real numbers, if and only if X has the following PDF:
(
1
, x ∈ [a, b]
fX (x) = b−a
0, otherwise
X is equally likely to be take on any value in [a, b]. Note the similarities and differences it has with
1
the discrete uniform! The value of the density function is constant at b−a , for any input x ∈ [a, b],
and makes it a rectangle whose area integrates to 1.
a+b (b − a)2
E [X] = , Var (X) =
2 12
The cdf is
0, x<a
x−a
FX (x) = , a≤x≤b
b−a
1, x>b
Proof of Expectation and Variance of Uniform. I’m setting up the integrals but omitting the steps that are
not relevant to your understanding of probability theory (computing integrals):
Z ∞ Z b
1 a+b
E [X] = xfX (x)dx = x· dx =
−∞ a b−a 2
Z ∞ Z b
1 a2 + ab + b2
E X2 = x2 fX (x)dx = x2 · dx =
−∞ a b−a 3
2
2 a2 + ab + b2 a+b (b − a)2
Var (X) = E X 2 − E [X] = − =
3 2 12
132
4.2 Probability & Statistics with Applications to Computing 133
Example(s)
Suppose we think that a Hollywood movie’s overall rating is equally likely to be any decimal value in
the interval [1, 5] (this may not be realistic). You may be able to do these questions “in your head”,
but I encourage you to formalize the questions and solutions to practice the notation and concepts
we’ve learned. (You probably wouldn’t be able to do them “in your head” if the movie rating wasn’t
uniformly distributed!)
1. A movie is considered average if it is overall rating is between 1.5 and 4.5. What is the probability
that is average?
2. A movie is considered a huge success if its overall rating is at least 4.5. What is the probability
that it is a huge success?
3. A movie is considered legendary if its overall rating is at least 4.95. Given that a movie is a
huge success, what is the probability it is legendary?
Solution Before starting, we can write that the overall rating of a movie is X ∼ Unif(1, 5). Hence, its density
1 1
function is fX (x) = = for x ∈ [1, 5] (and 0 otherwise).
5−1 4
1. We know the probability of being in the range [1.5, 4.5] is the area under the density function from 1.5
to 4.5, so Z 4.5 Z 4.5
1 3
P (1.5 ≤ X ≤ 4.5) = fX (x)dx = dx =
1.5 1.5 4 4
You could have also drawn a picture of this density function (which is flat at 1/4), and exploited
geometry to figure that the base of the rectangle is 3 and the height is 1/4.
2. Similarly, Z Z
∞ 5
1 1
P (X ≥ 4.5) = fX (x)dx = dx =
4.5 4.5 4 8
Note that the density function for values x ≥ 5 is zero, so that’s why the integral changed its upper
bound from ∞ to 5 when replacing the density!
3. We’ll use Bayes’ Theorem:
minutes or 9.9324 seconds. This is like the continuous extension of the Geometric (discrete) RV which is the
number of trials until a success occurs.
Recall the Poisson Process with parameter λ > 0 has events happening at average rate of λ per unit time
forever. The exponential RV measures the time (e.g., 4.33212 seconds, 9.382 hours, etc.) until the first
occurrence of an event, so is a continuous RV with range [0, ∞) (unlike the Poisson RV, which counts the
number of occurrences in a unit of tine, with range {0, 1, 2, ...} and is a discrete RV).
Let Y ∼ Exp(λ) be the time until the first event. We’ll first compute its CDF FY (t) and then differentiate
it to find its PDF fY (t).
Let X(t) ∼ Poi(λt) be the number of events in the first t units of time, for t ≥ 0 (if average is is λ per
unit of time, then it is λt per t units of time). Then, Y > t (wait longer than t units of time until the first
event) if and only if X(t) = 0 (no events happened in the first t units of time). This allows us to relate the
Exponential CDF to the Poisson PMF.
(λt)0
P (Y > t) = P (no events in the first t units) = P (X(t) = 0) = e−λt = e−λt
0!
Note that we plugged in the Poi(λt) PMF at 0 in the second last equality. Now, the CDF is just the
complement of the probability we computed:
Remember since the CDF was the integral of the PDF, the PDF is the derivative of the CDF by the
fundamental theorem of calculus:
d
fY (t) = FY (t) = λe−λt
dt
X ∼ Exp(λ), if and only if X has the following PDF (and range ΩX = [0, ∞)):
(
λe−λx , x ≥ 0
fX (x) =
0, otherwise
X is the waiting time until the first occurrence of an event in a Poisson Process with parameter λ.
1 1
E [X] = , Var (X) =
λ λ2
The cdf is (
1 − e−λx , x≥0
FX (x) =
0, otherwise
Proof of Expectation and Variance of Exponential. You can use integration by parts if you want to solve
these integrals, or you can use WolframAlpha. Again, I’m omitting the steps that are not relevant to your
understanding of probability theory (computing integrals):
Z ∞ Z ∞
1
E [X] = xfX (x)dx = x · λe−λx dx =
−∞ 0 λ
4.2 Probability & Statistics with Applications to Computing 135
Z ∞ Z ∞
2
E X2 = x2 fX (x)dx = x2 · λe−λx dx =
−∞ 0 λ2
2
2 2 1 1
Var (X) = E X 2 − E [X] = 2 − = 2
λ λ λ
If you usually skip examples, please don’t skip the next two. The first example here highlights the relationship
between the Poisson and Exponential RVs, and the second highlights the memoryless property!
Example(s)
Suppose that, on average, 13 car crashes occur each day on Highway 101. What is the probability
that no car crashes occur in the next hour ? Be careful of units of time!
Solution We will solve this problem with three equivalent approaches! Take the time to understand why
each of them work.
13
1. Then, on average there are 24 car crashes per hour. So the number of crashes in the next hour is
X ∼ Poi λ = 13
24 .
(13/24)0
P (X = 0) = e−13/24 = e−13/24
0!
2. Similar to above, the time (in hours) until the first car crash is Y ∼ Exp λ = 13
24 , since on average
13/24 car crashes happen per hour. Then, the probability no car crashes happen in the next hour is
3. If we don’t want to change the units, then we can say the waiting time until the next car crash (in
days) is Z ∼ Exp(λ = 13). since on average 13 car crashes happen per day. Then, the probability no
car crashes occur in the next hour (1/24 of a day) is the probability that we wait longer than 1/24 day:
Hopefully the first and second solutions show you the relationship between the Poisson and Exponential RVs
(they both come from the Poisson process), and the second and third solution show you how to be careful
with units and that you’ll get the same answer as long as you are consistent.
Example(s)
Solution Since we want to model battery life, we should use an Exponential distribution. Since we know the
average battery life is 50 hours, and that the expected value
of an exponential RV is 1/λ (see above), we
1
should say that the battery life is X ∼ Exp λ = 50 = 0.02 .
136 Probability & Statistics with Applications to Computing 4.2
1. If we want the probability the battery lasts more than 60 hours, then we want
Z ∞ Z ∞
P (X ≥ 60) = fX (t)dt = 0.02e−0.02t dt = e−1.2
60 60
But continuous distributions have a CDF which we can and should take advantage of! We can look
up the CDF above as well:
We made a step above that said P (X < 60) = FX (60), but FX (60) = P (X ≤ 60). It turns out they
are the same for continuous RVs, since the probability X = 60 exactly is zero!
2. Similarly,
P (X ≥ 40) = 1 − P (X < 40) = 1 − FX (40) = 1 − (1 − e−0.02·40 ) = e−0.8
3. By Bayes’ Theorem,
P (X ≥ 60 | X ≥ 100) P (X ≥ 100)
P (X ≥ 100 | X ≥ 60) =
P (X ≥ 60)
Note that P (X ≥ 60 | X ≥ 100) = 1 (why?) and P (X ≥ 100) = e−0.02·100 = e−2 (same process as
above), so plugging in these numbers gives
1 · e−2
= = e−0.8
e−1.2
Note that this is exactly the same as P (X ≥ 40) above, the probability we the battery lasted at least
40 hours. This says that the previous 60 hours don’t matter - P (X ≥ 40 + 60 | X ≥ 60) = P (X ≥ 40).
This property is called memorylessness, since the battery essentially forgets that it was alive for 60
hours! We’ll discuss this more formally below and prove it.
4.2.3 Memorylessness
Definition 4.2.3: Memorylessness
We just saw a concrete example above, but let’s see another. Let s = 7, t = 2. So P (X > 9 | X > 7) =
P (X > 2).
This memoryless property says that, given we’ve waited (at least) 7 minutes, the probability we wait (at
least) 2 more minutes, is the same as the probability we waited (at least 2) more from the beginning. That
is, the random variable “forgot” how long we’ve already been waiting.
The only memoryless RVs are the Geometric (discrete) and Exponential (Continuous)! This is because
events happen independently over time/trials, and so the past doesn’t matter.
We’ve seen it algebraically and intuitively, but let’s see it pictorially as well. Here is a picture of the
probability is greater than 1 for an exponential RV. It is the area to the right of 1 of the density function
λe−λx for x ≥ 0 (shaded in blue).
4.2 Probability & Statistics with Applications to Computing 137
Below is a picture of the probability X > 2.5 given X > 1.5 (shaded in orange and blue). If you hide the
area to the left of 1.5, you can see the ratio of the orange area (right of 2.5) to the entire shaded region (right
of 1.5) is the same as P (X > 1) above. So this exponential density function has memorylessness built in!
Then, I’ll leave it to you to do the same computation as above (using Bayes’ Theorem). You’ll see it work
out almost exactly the same way!
Proof of Expectation and Variance of Gamma. The PDF of the Gamma looks very ugly and hard to deal
Pr use our favorite trick: Linearity of Expectation! As mentioned earlier, if X ∼ Gamma(r, λ),
with, so let’s
then X = i=1 Xi where each Xi ∼ Exp(λ) is independent with E [Xi ] = λ1 and Var (Xi ) = λ12 . So by LoE,
" r
# r r
X X X 1 r
E [X] = E Xi = E [Xi ] = =
i=1 i=1 i=1
λ λ
Now, we can use the fact that the variance of a sum of independent rvs is the sum of the variances (we have
yet to prove this fact).
r
! r r
X X X 1 r
Var (X) = Var Xi = Var (Xi ) = 2
= 2
i=1 i=1 i=1
λ λ
Wow, several new distributions added to our arsenal, and also to our handy reference sheet at the end of the
book! Check it again for the second page covering continuous RVs - some more are still to come!
4.2.5 Exercises
1. Suppose that on average, 40 babies are born every hour in Seattle.
(a) What is the probability that no babies are born in the next minute? Try solving this in two
different but equivalent ways - using a Poisson and Exponential RV.
(b) What is the probabilities that it takes more than 20 minutes for the first 10 babies to be born?
Again, try solving this in two different but equivalent ways - using a Poisson and Gamma RV.
(c) What is the expected time until the 5th baby is born?
Solution:
(a) The number of babies born in the next minute is X ∼ Poi(40/60), so P (X = 0) = e−40/60 ≈
0.5134. Alternatively, the time in minutes until the next baby is born is Y ∼ Exp(40/60), and we
want the probability that no babies are born in the next minute; i.e., it takes at least one minute
for the first baby to be born. Hence,
Alternatively, the time in minutes until the tenth baby is born is Z ∼ Gamma(10, 40/60), and we
are asking what’s the probability this is over 20 minutes, is
Z 20
(40/60)10 10−1 −(40/60)x
P (Z > 20) = 1 − FZ (20) = 1 − x e dx
0 (10 − 1)!
Unfortunately, there isn’t a nice closed form for the Gamma CDF, but this would evaluate to the
same result!
140 Probability & Statistics with Applications to Computing 4.2
(c) The time in minutes until the 5th baby is born is V ∼ Gamma(5, 40/60), so E [V ] = r
λ = 5
40/60 =
7.5 minutes.
2. You are waiting for a bus to take you home from CSE. You can either take the E-line, U-line, or Cline.
The distribution of the waiting time in minutes for each is the following:
• E-Line: E ∼ Exp(λ = 0.1).
• U-Line: U ∼ Unif(0, 20) (continuous).
• C-Line: Has range (1, ∞) and PDF fC (x) = 1/x2 .
Assume the three bus arrival times are independent. You take the first bus that arrives
(a) Find the CDFs of E, U , and C, FE (t), FU (t) and FC (t). Hint: The first two can be looked up in
our distributions handout!
(b) What is the probability you wait more than 5 minutes for a bus?
(c) What is the probability you wait more than 30 minutes for a bus?
Solution:
(a) The CDF of E for t > 0 is FE (t) = 1 − e−0.1t (see above).
t
The CDF of U for 0 < t < 20 is FU (t) = 20 .
Rt
The CDF of C for t > 1 is FC (t) = 1 fC (x)dx = 1 − 1t .
(b) Let B = min{E, U, C} be the time until the first bus. Then, the probability we wait more than 5
minutes is the probability that ALL of them take longer than 5 minutes to arrive. We can then
multiply the individual probabilities due to independence.
(c) The same exact logic applies here! But be careful of the range of U when plugging in the CDF.
It is true that
P (B > 30) = P (E > 30) P (U > 30) P (C > 30)
But when plugging in P (U > 30) = 1 − FU (30), we have to remember that FU (30) = 1 because U
must be in [0, 20]. That’s why it is so important to define the piecewise function! This probability
is indeed 0 since bus U will always come within 20 minutes.
4.2 Probability & Statistics with Applications to Computing 141
Application Time!!
Now you’ve learned enough theory to discover the Distinct Elements algo-
rithm covered in section 9.5. You are highly encouraged to read that section
before moving on!
Chapter 4. Continuous Random Variables
4.3: The Normal/Gaussian Random Variable
Slides (Google Drive) Video (YouTube)
The Normal (Gaussian) distribution is probably the most important of our entire Zoo of discrete and contin-
uous variables (with Binomial a close second). You have probably heard of and seen this famous distribution
before (think bell-curve, pictures below), and now we’ll get into the technical details of it and its many use
cases!
X −µ 1
E = (E [X] − µ) = 0
σ σ
142
4.3 Probability & Statistics with Applications to Computing 143
X −µ 1 1 1 p √
Var = 2
Var (X − µ) = 2 Var (X) = 2 σ 2 = 1 =⇒ σX = Var (X) = 1 = 1
σ σ σ σ
It turns out the mean is 0 and the standard deviation (and variance) is 1 ! This makes sense because on
average, someone is average (0 standard deviations above the mean), and the standard deviation is 1.
X ∼ N (µ, σ 2 ) if and only if X has the following PDF (and range ΩX = (−∞, +∞)):
1 (x − µ)2
fX (x) = √ exp −
σ 2π 2σ 2
where exp(y) = ey . This Normal random variable actually has as parameters its mean and variance,
and hence:
E [X] = µ
Var (X) = σ 2
Unfortunately, there is no closed form formula for the CDF (there wasn’t one for the Gamma RV) either.
We’ll see how to compute these probabilities anyway though soon using a lookup table!
Normal distributions produce bell-shaped curves. Here are some visualizations of the density function for
varying µ and σ 2 .
For instance, a normal distribution with µ = 0 and σ = 1 produces the following bell curve:
If the standard deviation increases, it becomes more likely for the variable to be farther away from the mean,
so the distribution becomes flatter. For instance, a curve with the same µ = 0 but higher σ = 2 (σ 2 = 4)
looks like this:
144 Probability & Statistics with Applications to Computing 4.3
If you change the mean, the distribution will shift left or right. For instance, increasing the mean so µ = 4
shifts the distribution 4 to the right. The shape of the curve remains unchanged:
If you change the mean AND standard deviation, the curves shape changes and shifts. For instance, changing
the mean so µ = 4 and standard deviation so σ = 2 gives us a flatter, shifted curve:
However, scaling and shifting a random variable often does not keep it in the same family. Continuous
uniform rvs are the only ones we learned so far that do: if X ∼ Unif(0, 1), then 3X + 2 ∼ Unif(2, 5): we’ll
learn how to prove this in the next section! However, this is not true for the others; for example, the range of
4.3 Probability & Statistics with Applications to Computing 145
a Poi(λ) is {0, 1, 2, . . . } as it is the number of events in a unit of time, and 2X has range {0, 2, 4, 6, . . . , } so
cannot be Poisson (cannot be an odd number)! We’ll see that Normal random variables have these closure
properties.
Recall that in general, if X is any random variable (discrete or continuous) with E [X] = µ and Var (X) = σ 2 ,
and a, b ∈ R. Then,
E [aX + b] = aE [X] + b = aµ + b
We will prove this theorem later in section 5.6 using Moment Generating Functions! This is really amazing -
the mean and variance are no surprise. The fact that scaling and shifting a Normal random variable results
in another Normal random variable is very interesting!
Let X, Y be ANY independent random variables (discrete or continuous) with E [X] = µX , E [Y ] = µY ,
2
Var (X) = σX , Var (Y ) = σY2 and a, b, c ∈ R. Recall,
aX + bY + c ∼ N (aµX + bµY + c, a2 σX
2
+ b2 σY2 )
Again, this is really amazing. The mean and variance aren’t a surprise again, but the fact that adding two
independent Normals results in another Normal distribution is not trivial, and we will prove this later as
well!
Example(s)
Suppose you believe temperatures in the Vancouver, Canada each day are approximately normally
distributed with mean 25 degrees Celsius and standard deviation 5 degrees Celsius. However, your
American friend only understands Fahrenheit.
1. What is the distribution of temperatures each day in Vancouver in Fahrenheit? To convert
Celsius (C) to Fahrenheit (F ), the formula is F = 95 C + 32.
2. What is the distribution of the average temperature over a week in the Vancouver, in Fahren-
heit? That is, if you were to sample a random week’s average temperature, what is its distribu-
tion? Assume the temperature each day is independent of the rest (this may not be a realistic
assumption).
Solution
146 Probability & Statistics with Applications to Computing 4.3
2 9
1. The degrees in Celsius are N (µC = 25, σC = 52 ). Since F = 5C + 32, we know by linearity of
expectation and properties of variance:
9 9 9
µF = E [F ] = E C + 32 = E [C] + 32 = 25 + 32 = 77
5 5 5
2 2
9 9 9
σF2 = Var (F ) = Var C + 32 = Var (C) = 52 = 81
5 5 5
These values are no surprise, but by closure of the Normal distribution, we can say that F ∼ N (µF =
77, σF2 = 92 ).
2. Let F1 , F2 , . . . , F7 be independent temperatures over a week, so each Fi ∼ N (µF = 77, σF2 = 81). Let
P7
F̄ = 17 i=1 Fi denote the average temperature over this week. Then, by linearity of expectation and
properties of variance (requiring independence),
" 7
# 7
1X 1X 1
E Fi = E [Fi ] = · 7 · 77 = 77
7 i=1 7 i=1 7
7
! 7
1X 1 X 1 81
Var Fi = Var (Fi ) = 2 · 7 · 81 =
7 i=1 72 i=1 7 7
Note that the mean is the same, but the variance is smaller. This might make sense because we expect
the average temperature over a week should match that of a single day, but it is more stable (has
lower variance). By closure properties of the Normal distribution, since we take a sum of independent
P7
Normal RVs and then divide it by 7, F̄ = 17 i=1 Fi ∼ N (µ = 77, σ 2 = 81/7).
If Z ∼ N (0, 1) is the standard normal (the normal RV with mean 0 and variance/standard deviation
1), we denote the CDF Φ(a) = FZ (a) = P (Z ≤ a), since it is so commonly used. There is no closed-form
formula, so this CDF is stored in a Φ table (read a “Phi Table”). Remember, Φ(a) is just the area to the
left of a.
Since the normal distribution curve is symmetric, the area to the left of a is the same as the area to the
right of −a. This picture below shows that Φ(a) = 1 − Φ(−a).
4.3 Probability & Statistics with Applications to Computing 147
To get the CDF Φ(1.09) = P (Z ≤ 1.09) from the Φ table, we look at the row with a value of 1.0, and column
with value 0.09, as marked here:
From this, we see that P (Z ≤ 1.39) = Φ(1.39) ≈ 0.91774. (Look at the gray row 1.3, and the column 0.09).
This table usually only has positive numbers, so if you want to look up negative numbers, it’s necessary to
use the fact that Φ(a) = 1 − Φ(−a). For example, if we want P (Z ≤ −2.13) = Φ(−2.13), we need to do
1 − Φ(2.13) = 1 − 0.9834 = 0.0166 (try to find Φ(2.13) yourself above).
How does this help though when X is Normal but not the standard normal? In general, for a X ∼ N (µ, σ 2 ),
we can calculate the CDF of X by standardizing it to be standard normal,
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.53983 0.5438 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.0 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91309 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2.0 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999
See some examples below of how we can use the Φ table to calculate probabilities associated with the Normal
distributions! Again, the Φ table gives the CDF of the standard Normal since it doesn’t have a closed form
like Uniform/Exponential. Also, any Normal RV can be standardized so we can look up probabilities in the
Φ table!
4.3 Probability & Statistics with Applications to Computing 149
Example(s)
Suppose the age of a random adult in the United States is (approximately) normally distributed with
mean 50 and standard deviation 15.
1. What is the probability that a randomly selected adult in the US is over 70 years old?
2. What is the probability that a randomly selected adult in the US is under 25 years old?
3. What is the probability that a randomly selected adult in the US is between 40 and 45 years
old?
Solution
1. The height of a random adult is X ∼ N (µ = 50, σ 2 = 152 ), so remember we standardize to use the
standard Gaussian:
X − 50 70 − 50
P (X > 70) = P > [standardize]
15 15
X −µ
= P (Z > 1.33) = Z ∼ N (0, 1)
σ
= 1 − P (Z ≤ 1.33) [complement]
= 1 − Φ(1.33) [def of Φ]
= 1 − 0.9082 [look up Φ table from earlier]
= 0.0918
2. We do a similar calculation:
X − 50 25 − 50
P (X < 25) = P < [standardize]
15 15
X −µ
= P (Z < −5/3) = Z ∼ N (0, 1)
σ
= Φ(−1.67) [recall since continuous rv, identical to less than or equal]
= 1 − Φ(1.67) [symmetry trick to make positive]
= 1 − 0.9525 [look up Φ table from earlier]
= 0.0475
3. We do a similar calculation:
40 − 50 X − 50 45 − 50
P (40 < X < 45) = P < < [standardize]
15 15 15
X −µ
= P (−2/3 < Z < −1/3) = Z ∼ N (0, 1)
σ
= Φ(−0.33) − Φ(−0.67) [P (a < X < b) = FX (b) − FX (a))]
= (1 − Φ(0.33)) − (1 − Φ(0.67)) [symmetry trick to make positive]
= Φ(0.67) − Φ(0.33)
= 0.7486 − 0.6293 [look up Φ table from earlier]
= 0.1193
150 Probability & Statistics with Applications to Computing 4.3
4.3.5 Exercises
1. Suppose the time (in hours) it takes for you to finish pset i is approximately Xi ∼ N (µ = 10, σ 2 = 9)
(for i = 1, . . . , 5) and the time (in hours) it takes for you to finish a project is approximately Y ∼
N (µ = 20, σ 2 = 10). Let W = X1 + X2 + X3 + X4 + X5 + Y be the time it takes to complete all 5
psets and the project.
(a) What are the mean and variance of W ?
(b) What is the distribution of W and what are its parameter(s)?
(c) What is the probability that you complete all the homework in under 60 hours?
Solution:
(a) The mean by linearity of expectation is E [W ] = E [X1 ] + · · · + E [X5 ] + E [Y ] = 50 + 20 = 70.
Variance adds for independent RVs, so Var (W ) = Var (X1 )+· · ·+Var (X5 )+Var (Y ) = 45+10 = 55.
(b) Since W is the sum of independent Normal random variables, W is also normal with the parameters
we calculated above. So W ∼ N (µ = 70, σ 2 = 55).
(c)
W − 70 60 − 70
P (W < 60) = P √ < √ ≈ P (Z < −1.35) = Φ(−1.35) = 1−Φ(1.35) = 1−0.9115 = 0.0885
55 55
.
Chapter 4. Continuous Random Variables
4.4: Transforming Continuous RVs
Slides (Google Drive) Video (YouTube)
Suppose the amount of gold a company can mine is X tons per year, and you have some (continuous)
distribution to model this. However, your earning is not simply X - it is actually a function of the amount
of product, some Y = g(X). What is the distribution of Y ?
Since we know the distribution of X, this will help us model the distribution of Y by transforming random
variables.
This is because Y = 1 if and only if X ∈ {−1, 1}, so to find P (Y = 1), we sum over all values x such that
x2 = 1 of its probability. That’s all this formula below says (the “:” means “such that”):
X
pY (y) = pX (x)
x∈ΩX :g(x)=y
But for continous random variables, we have density functions instead of mass functions. That means
fX is not actually a probability and so we can’t do this same technique. We want to work with the CDF
FX (x) = P (X ≤ x) instead because it actually does represent a probability! It’s best to see this idea through
an example.
Example(s)
√
Suppose you know X ∼ Unif(0, 9) (continuous). What is the PDF of Y = X?
151
152 Probability & Statistics with Applications to Computing 4.4
The CDF of X is derived by taking the integral of the PDF, giving us (can also cite this),
0 if x < 0
x
FX (x) = 9 if 0 ≤ x ≤ 9
1 if x > 9
√
Now, we determine√the range of Y . The smallest value that Y can take is 0 = 0, and the largest value
that Y can take is 9 = 3, from the range of X. Since the square root function is monotone increasing, this
gives us,
ΩY = [0, 3]
But can we assume that, because X has a uniform distribution, Y does too?
This is not the case! Notice that values of X in the range [0, 1] will map to Y values in the range [0, 1]. But,
X values in the range [1, 4] map to Y values in the range [1, 2] and X values in the range [4, 9] map to Y
values in the range [2, 3].
So, there is a much larger range of values of X that map to [2, 3] than to [0, 1] (since [4, 9] is a larger range
than [0, 1]). Therefore, Y ’s distribution shouldn’t be uniform. So, we cannot define the PDF of Y using the
assumption that Y is uniform.
Instead, we will first compute the CDF FY and then, differentiate that to get the PDF fY for y ∈ [0, 3].
Be very careful when squaring both sides of an equation - it may not keep the inequality true. In this case
we didn’t have to worry since X and Y were both guaranteed positive.
d 2y
fY (y) = FY (y) =
dy 9
√
Here is an image of the original and transformed PDFs! Remember that X ∼ Unif(0, 9) and Y = X.
4.4 Probability & Statistics with Applications to Computing 153
This is the general strategy for transforming continuous RVs! We’ll summarize the steps below.
Example(s)
2. The range of Y = X 4 is ΩY = {x4 : x ∈ [−1, +1]} = [0, 1], since x4 is always positive and between 0
and 1 for x ∈ [−1, +1].
154 Probability & Statistics with Applications to Computing 4.4
3. Be careful in the third equation below to include both lower and upper bounds (draw the function
y = x4 to see why). For y ∈ ΩY = [0, 1], we will compute the CDF:
FY (y) = P (Y ≤ y) [def of CDF]
4
=P X ≤y [def of Y ]
√ √
= P (− 4 y ≤ X ≤ 4 y) [don’t forget the negative side]
√ √
= P (X ≤ 4 y) − P (X ≤ − 4 y)
√ √
= FX ( 4 y) − FX (− 4 y) [def of CDF of X ]
1 √ √ 3 1 √ √
= (2 + 3 4 y − 4 y ) − (2 + 3(− 4 y) − (− 4 y)3 ) [plug in CDF]
4 4
4. The last step is to differentiate the CDF to get the PDF, which is just computational, so I’ll skip it!
That is, the PDF of Y at y is the PDF of X evaluated at h(y) (the value of x that maps to y) multiplied by
the absolute value of the derivative of h(y).
Note that the formula method is not as general as the previous method (using CDF), since g must satisfy
monotonicity and invertibility. So transforming via CDF always works, but transforming may not work with
this explicit formula all the time.
Proof of Formula to get PDF of Y = g(X) from X.
Suppose Y = g(X) and g is strictly monotone and invertible with inverse X = g −1 (Y ) = h(Y ). We’ll assume
g is strictly monotone increasing and leave it to you to prove it for the case when g is strictly monotone
decreasing (it’s very similar).
A similar proof would hold if g were monotone decreasing, except in the third line we would flip the sign of
the inequality and make the h0 (y) become an absolute value: |h0 (y)|.
Now let’s try the same example as we did earlier, but using this new method instead.
Example(s)
√
Suppose you know X ∼ Unif(0, 9) (continuous). What is the PDF of Y = X?
Our goal is to use the formula given fY (y) = fX (h(y)) · |h0 (y)|, after verifying some conditions on g.
√ √
Let g(t) = t. This is strictly monotone increasing on ΩX = [0, 9]. This means that as t increases, t also
increases - therefore, g(t) is an increasing function.
What is the inverse of this function g? The inverse of the square root function is just the squaring function:
h(y) = g −1 (y) = y 2
h0 (y) = 2y
Note that we dropped the absolute value because we already assume y ∈ [0, 3] and hence 2y is always positive.
This gives the same formula as earlier, as it should!
Let X = (X1 , ..., Xn ), Y = (Y1 , ..., Yn ) be continuous random vectors (each component is a continuous
rv) with the same dimension n (so ΩX , ΩY ⊆ Rn ), and Y = g(X) where g : ΩX → ΩY is invertible
and differentiable, with differentiable inverse X = g −1 (y) = h(y). Then,
∂h(y)
fY (y) = fX (h(y)) det
∂y
where ∂h(y)
∂y ∈ Rn×n is the Jacobian matrix of partial derivatives of h, with
∂h(y) ∂(h(y))i
=
∂y ij ∂yj
Hopefully this formula looks very similar to the one for the single-dimensional case! This formula is just for
your information and you’ll never have to use it in this class.
4.4.4 Exercises
1. Suppose X has range ΩX = (1, ∞) and density function
2 x>1
fX (x) = x3
0 otherwise
eX − 1
Let Y = .
2
(a) Compute the density function of Y via the CDF transformation method.
(b) Compute the density function of Y using the formula, but explicitly verify the monotonicity and
invertibility conditions.
Solution:
e−1
(a) The range of Y is ΩY = , ∞ . For y ∈ ΩY ,
2
d 2 1 4
fY (y) = FY (y) = 3
· ·2=
dy [ln(2y + 1)] 2y + 1 (2y + 1)[ln(2y + 1)]3
This chapter, especially Sections 5.1-5.6, are arguably the most difficult in this entire text. They might take
more time to fully absorb, but you’ll get it, so don’t give up!
We are finally going to talk about what happens when we want the probability distribution of more than one
random variable. This will be called the joint distribution of two or more random variables. In this section,
we’ll focus on joint discrete distributions, and in the next, joint continuous distributions. We’ll also finally
prove that variance the variance of the sum of independent RVs is the sum of the variances, an important
fact that we’ve been using without proof! But first, we need to review what a Cartesian product of sets is.
A × B = {(a, b) : a ∈ A, b ∈ B}
Further if A, B are finite sets, then |A × B| = |A| · |B| by the product rule of counting.
Example(s)
Write each of the following in a notation that does not involve a Cartesian product:
1. {1, 2, 3} × {4, 5}
2. R2 = R × R
Solution
1. Here, we have:
{1, 2, 3} × {4, 5} = {(1, 4), (1, 5), (2, 4), (2, 5), (3, 4), (3, 5)}
We have each of the elements of the first set paired with each of the elements of the second set. Note
that |{1, 2, 3}| = 3, |{4, 5}| = 2, and |{1, 2, 3} × {4, 5}| = 6.
2. This is the xy-plane (2D space), which is denoted:
R2 = R × R = {(x, y) : x ∈ R, y ∈ R}
159
160 Probability & Statistics with Applications to Computing 5.1
Suppose we roll two fair 4-sided die independently, one blue and one red. Let X be the value of the blue die
and Y be the value of the red die. Note:
ΩX = {1, 2, 3, 4}
ΩY = {1, 2, 3, 4}
Then we can also consider ΩX,Y , the joint range of X and Y . The joint range happens to be any combination
of {1, 2, 3, 4} for both rolls. This can be written as:
ΩX,Y = ΩX × ΩY
Further each of these will be equally likely (as shown in the table below):
Above is a suitable way to write the joint probability mass function of X and Y , as it enumerates every
probability of every pair of values. If we wanted to write it as a formula, pX,Y (x, y) = P (X = x, Y = y) for
x, y ∈ ΩX,Y we have:
(
1
, x, y ∈ ΩX,Y
pX,Y (x, y) = 16
0, otherwise
Note that either this piecewise function or the table above are valid ways to express the joint PMF.
pX,Y (a, b) = P (X = a, Y = b)
The joint range is the set of pairs (c, d) that have nonzero probability:
Further, note that if g : R2 → R is a function, then LOTUS extends to the multidimensional case:
X X
E [g(X, Y )] = g(x, y)pX,Y (x, y)
x∈ΩX y∈ΩY
A lot of things are just the same as what we learned in Chapter 3, but extended! Note that the joint range
above ΩX,Y was always a subset of ΩX × ΩY , and they’re not necessarily equal. Let’s see an example of this.
Back to our example of the blue and red die rolls. Again, let X be the value of the blue die and Y be
the value of the red die. Now, let U = min{X, Y } (the smaller of the two die rolls) and V = max{X, Y }
(the larger of the two die rolls). Then:
ΩU = {1, 2, 3, 4}
ΩV = {1, 2, 3, 4}
because both random variables can take on any of the four values that appear on the dice (e.g., t is possible
that the minimum is 4 if we roll (4, 4) and the maximum to be 1 if we roll (1, 1)).
However, there is the constraint that the minimum value U is always at most the maximum value V . That is,
the joint range would not include the pair (u, v) = (4, 1) for example, since the probability that the minimum
is 4 and the maximum is 1 is zero. We can write this formally as the subset of the Cartesian product subject
to u ≤ v:
ΩU,V = {(u, v) ∈ ΩU × ΩV : u ≤ v} =
6 ΩU × ΩV
This will just be all the ordered pairs of the values that can appear as U and V . Now, however these are
not equally likely, as shown in the table below. Notice that any pair (u, v) with u > v has zero probability,
as promised. We’ll explain how we got the other numbers under the table.
As discussed earlier, we can’t have the case where U > V , so these are all 0. The cases where U = V occurs
1
when the blue and red die have the same value, each which occurs with probability of 16 as shown earlier. For
example, pU,V (2, 2) = P (U = 2, V = 2) = 1/16 since only one of the 16 equally likely outcomes (2, 2) gives
2
this result. The others in which U < V each occur with probability 16 because it could be the red die with
the max and the blue die with the min, or the reverse. For example, pU,V (1, 3) = P (U = 1, V = 3) = 2/16
because two of the 16 outcomes (1, 3) and (3, 1) would result in the min being 1 and the max being 3.
162 Probability & Statistics with Applications to Computing 5.1
So for the joint PMF as a formula pU,V (u, v) = P (U = u, V = v) for u, v ∈ ΩU,V we have:
2
16 , u, v ∈ ΩU × ΩV , v>u
1
pU,V (u, v) = 16 , u, v ∈ ΩU × ΩV , v=u
0, otherwise
Again, the piecewise function and the table are both valid ways to express the joint PMF, and you may
choose whichever is easier for you. When the joint range is larger, it might be infeasible to use a table
though!
You might think the answer is 7/16, but how did you get that? Well, P (U = 1) would be the sum of
the first row, since that is all the cases where U = 1. You computed
1 2 2 2 7
P (U = 1) = P (U = 1, V = 1)+P (U = 1, V = 2)+P (U = 1, V = 3)+P (U = 1, V = 4) = + + + =
16 16 16 16 16
Mathematically, we have
X
P (U = u) = P (U = u, V = v)
v∈ΩV
Does this look like anything we learned before? It’s just the law of total probability (intersection version)
that we derived in 2.2, as the events {V = v}v∈ΩV partition the sample space (V takes on exactly one value)!
We can refer to the table above sum each row (which corresponds to a value of u to find the probability of
that value of u occurring). That gives us the following:
7
16 , u=1
5, u=2
pU (u) = 16
3
, u=3
16
1
16 , u=4
1 1
P (U = 4) = P (U = 4, V = 1)+P (U = 4, V = 2)+P (U = 4, V = 3)+P (U = 4, V = 4) = 0+0+0+ =
16 16
This brings us to the definition of marginal PMFs. The idea of these is: given a joint probability distribution,
what is the distribution of just one of them (or a subset)? We get this by marginalizing (summing) out the
other variables.
5.1 Probability & Statistics with Applications to Computing 163
(Extension) If Z is also a discrete random variable, then the marginal PMF of z is:
X X
pZ (z) = pX,Y,Z (x, y, z)
x∈ΩX y∈ΩY
This follows from the law of total probability, and is just like taking the sum of a row in the example above.
Now if asked for E [U ] for example, we actually don’t need the joint
P PMF anymore. We’ve extracted the
pertinent information in the form of pU (u), and compute E [U ] = u upU (u) normally.
5.1.4 Independence
We’ll now redefine independence of RVs in terms of the joint PMF. This is completely the same as the
definition we gave earlier, just with the new notation we learned.
Recall the joint range ΩX,Y = {(x, y) : pX,Y (x, y) > 0} ⊆ ΩX × ΩY is always a subset of the
Cartesian product of the individual ranges. A necessary but not sufficient condition for independence
is that ΩX,Y = ΩX × ΩY . That is, if ΩX,Y 6= ΩX × ΩY , then X and Y cannot be independent, but
if ΩX,Y = ΩX × ΩY , then we have to check the condition above.
This is because if there is some (a, b) ∈ ΩX × ΩY but not in ΩX,Y , then pX,Y (a, b) = 0 but pX (a) > 0
and pY (b) > 0, violating independence. For example, suppose the joint PMF looks like:
X \Y 8 9 Row Total pX (x)
3 1/3 1/2 5/6
7 1/6 0 1/6
Col Total pY (y) 1/2 1/2 1
Also side note that the marginal distributions are named what they are, since we often write the row
and column totals in the margins. The joint range ΩX,Y 6= ΩX × ΩY since one of the entries is 0,
and so (7, 9) 6∈ ΩX,Y but (7, 9) ∈ ΩX × ΩY . This immediately tells us they cannot be independent -
pX (7) > 0 and pY (9) > 0, yet pX,Y (7, 9) = 0.
Example(s)
Solution
1. Actually these can be found by filling in the row and column totals, since
X X
pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y)
y x
P
For example, P (X = 0) = pX (0) = y pX,Y (0, y) = pX,Y (0, 6) + pX,Y (0, 9) = 3/12 + 5/12 = 8/12 is
the sum of the first row.
X \Y 6 9 Row Total pX (x)
0 3/12 5/12 8/12
2 1/12 2/12 3/12
3 0 1/12 1/12
Col Total pY (y) 4/12 8/12 1
Hence,
8/12 x = 0
pX (x) = 3/12 x = 2
1/12 x = 3
5.1 Probability & Statistics with Applications to Computing 165
(
4/12 y=6
pY (y) =
8/12 y=9
2. We can actually compute E [Y ] just using pY now that we’ve eliminated/marginalized out X - we don’t
need the joint PMF anymore. We go back to the definition:
X 4 8
E [Y ] = ypY (y) = 6 · +9· =8
y
12 12
3. X, Y are independent, if for every table entry (x, y), we have pX,Y (x, y) = pX (x)pY (y). However, notice
pX,Y (3, 6) = 0 but pX (3) > 0 and pY (6) > 0. Hence we found an entry where this condition isn’t true,
so they cannot be independent. This is like the comment mentioned earlier: if ΩX,Y 6= ΩX × ΩY , they
have no chance of being independent.
This just sums over all the entries in the table (x, y) and takes a weighted average of all values xy
weighted by pX,Y (x, y) = P (X = x, Y = y).
Note this property relies on the fact that they are independent, whereas linearity of expectation
always holds, regardless.
Proof of Lemma.
X X
E [XY ] = xypX,Y (x, y) [LOTUS]
x∈ΩX y∈ΩY
X X
= xypX (x)pY (y) [X ⊥ Y , so pX,Y (x, y) = pX (x)pY (y)]
x∈ΩX y∈ΩY
X X
= xpX (x) ypY (y)
x∈ΩX y∈ΩY
= E [X] E [Y ]
Proof of Variance Adds for Independent RVs. Now we have the following:
Var (X + Y ) = E (X + Y )2 − (E [X + Y ])2 [def of variance]
= E X 2 + 2XY + Y 2 − (E [X] + E [Y ])2 [linearity of expectation]
= E X 2 + 2E [XY ] + E Y 2 − (E [X])2 − 2E [X] E [Y ] − (E [Y ])2 [linearity of expectation]
= E X 2 − (E [X])2 + E Y 2 − (E [Y ])2 + 2E [XY ] − 2E [X] E [Y ] [rearranging]
= E X 2 − (E [X])2 + E Y 2 − (E [Y ])2 + 2 (E [X] E [Y ] − E [X] E [Y ]) [lemma, since X ⊥ Y ]
= Var (X) + Var (Y ) + 0 [def of variance]
XX
E [X + Y ] = (x + y)pX,Y (x, y) [LOTUS]
x y
XX XX
= xpX,Y (x, y) + ypX,Y (x, y) [split sum]
x y x y
X X X X
= x pX,Y (x, y) + y pX,Y (x, y) [algebra]
x y y x
X X
= xpX (x) + ypY (y) [def of marginal PMF]
x y
5.1.7 Exercises
1. Suppose we flip a fair coin three times independently. Let X be the number of heads in the first two
flips, and Y be the number of heads in the last two flips (there is overlap).
5.1 Probability & Statistics with Applications to Computing 167
(a) What distribution do X and Y have marginally, and what are their ranges?
(b) What is pX,Y (x, y)? Fill in this table below. You may want to fill in the marginal distributions
first!
(c) What is ΩX,Y , using your answer to (b)?
(d) Write a formula for E [cos(XY )].
(e) Are X, Y independent?
Solution:
(a) Since X counts the number of heads in two independent flips of a fair coin, then X ∼ Bin(n =
2, p = 0.5). Y also has this distribution! Their ranges are ΩX = ΩY = {0, 1, 2}.
(b) First, fill in the marginal distributions, which should be 1/4, 1/2, 1/4 for the probability that
X = 0, X = 1, and X = 2 respectively (same for Y ).
First let’s start with pX,Y (2, 2) = P (X = 2, Y = 2). If X = 2, that means the first two flips
must’ve been heads. If Y = 2, that means the last two flips must’ve been heads. So the probabil-
ity that X = 2, Y = 2 is the probability of the single outcome HHH, which is 1/8. Apply similar
logic for pX,Y (0, 0) = P (X = 0, Y = 0) which is the probability of TTT.
Then, pX,Y (0, 2) = P (X = 0, Y = 2). If X = 0 then the first two flips are tails. If Y = 2, the last
two flips are heads. This is impossible, so P (X = 0, Y = 2) = 0. Similarly, P (X = 2, Y = 0) = 0
as well. Now use the constraints (the row totals and col totals) to fill in the rest! For example, the
first row must sum to 1/4, and we have two out of three of the entries pX,Y (0, 0) and pX,Y (0, 2),
so pX,Y (0, 1) = 1/4 − 1/8 − 0 = 1/8.
X \Y 0 1 2 Row Total pX (x)
0 1/8 1/8 0 1/4
1 1/8 1/4 1/8 1/2
2 0 1/8 1/8 1/4
Col Total pY (y) 1/4 1/2 1/4 1
(c) From the previous part, we can see that the joint range is everything in the Cartesian product
except (0, 2) and (2, 0), so ΩX,Y = (ΩX × ΩY ) \ {(0, 2), (2, 0)}.
(d) By LOTUS extended to multiple variables,
XX
E [cos(XY )] = cos(xy)pX,Y (x, y)
x y
(e) No, the joint range is not equal to the Cartesian product. This immediately makes independence
impossible. The intuitive reason is that, since (0, 2) 6∈ ΩX,Y for example, if we know X = 0, then
Y cannot be 2. Formally, there exists a pair (x, y) ∈ ΩX × ΩY (namely (x, y) = (0, 2)) such that
pX,Y (0, 2) = 0 but pX (0) > 0 and pY (2) > 0. Hence, pX,Y (0, 2) 6= pX (0)pY (2), which violates
independence.
2. Suppose radioactive particles at Area 51 are emitted at an average rate of λ per second. You want to
measure how many particles are emitted, but your geiger-counter (device that measures radioactivity)
fails to record each particle independently with some small probability p. Let X be the number of
particles emitted, and Y be the number of particles observed (by your geiger-counter).
(a) Describe the joint range ΩX,Y using set notation.
(b) Write a formula (not a table) for pX,Y (x, y).
168 Probability & Statistics with Applications to Computing 5.1
X ∞
X
x
−λ λ x
pY (y) = pX,Y (x, y) = e · (1 − p)y px−y
x=y
x! y
x∈ΩX
Solution: Qr
K1 Kr Ki
k1 ... kr i=1 ki
pX1 ,...,Xr (k1 , . . . , kr ) = N
= N
n n
Chapter 5. Multiple Random Variables
5.2: Joint Continuous Distributions
Slides (Google Drive) Video (YouTube)
fX,Y (a, b) ≥ 0
The joint range is the set of pairs (c, d) that have nonzero density:
Further, note that if g : R2 → R is a function, then LOTUS extends to the multidimensional case:
Z ∞Z ∞
E [g(X, Y )] = g(s, t)fX,Y (s, t)dsdt
−∞ −∞
The joint PDF must satisfy the following (similar to univariate PDFs):
Z b Z d
P (a ≤ X < b, c ≤ Y ≤ d) = fX,Y (x, y)dydx
a c
Example(s)
Let X and Y be two jointly continuous random variables with the following joint PDF:
x + cy 2 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
fX,Y (x, y) =
0 otherwise
169
170 Probability & Statistics with Applications to Computing 5.2
(c) Find P 0 ≤ X ≤ 12 , 0 ≤ Y ≤ 1
2 .
Solution
(a)
ΩX,Y = (x, y) ∈ R2 : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
Z ∞ Z ∞
1= fX,Y (x, y)dxdy
−∞ −∞
Z 1Z 1
= (x + cy 2 )dxdy
0 0
Z 1 x=1
1 2 2
= x + cy x dy
0 2 x=0
Z 1
1
= + cy 2 dy
0 2
y=1
1 1 3
= y + cy
2 3 y=0
1 1
= + c
2 3
Thus, c = 32 .
5.2 Probability & Statistics with Applications to Computing 171
(c)
Z 1/2 Z 1/2
1 1 3 2
P 0 ≤ X ≤ ,0 ≤ Y ≤ = x + y dxdy
2 2 0 0 2
Z 1/2 x=1/2
1 2 3 2
= x + y x dy
0 2 2 x=0
Z 1/2
1 3 2
= + y dy
0 8 4
3
=
32
Example(s)
Let X and Y be two jointly continuous random variables with the following PDF:
x + y 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
fX,Y (x, y) =
0 otherwise
Find E XY 2 .
Solution By LOTUS,
Z ∞ Z ∞
E XY 2 = (xy 2 )fX,Y (x, y)dxdy
−∞ −∞
Z 1Z 1
= xy 2 (x + y)dxdy
0 0
Z 1
1 2 1 3
= y + y dy
0 3 2
17
=
72
Suppose that X and Y are jointly distributed continuous random variables with joint PDF fX,Y (x, y).
The marginal PDFs of X and Y are respectively given by the following:
Z ∞
fX (x) = fX,Y (x, y)dy
−∞
Z ∞
fY (y) = fX,Y (x, y)dx
−∞
Note this is exactly like for joint discrete random variables, with integrals instead of sums.
172 Probability & Statistics with Applications to Computing 5.2
(Extension): If Z is also a continuous random variable, then the marginal PDF of Z is:
Z ∞Z ∞
fZ (z) = fX,Y,Z (x, y, z)dxdy
−∞ −∞
Solution
Example(s)
Find the marginal PDFs fX (x) and fY (y) given the joint PDF:
(
3
x + y 2 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
fX,Y (x, y) = 2
0 otherwise
Then, compute E [X]. (This is the same joint density as the first example, plugging in c = 3/2).
For 0 ≤ x ≤ 1:
Z ∞
fX (x) = fX,Y (x, y)dy
−∞
Z 1
3 2
= x + y dy
0 2
y=1
1
= xy + y 3
2 y=0
1
=x+
2
For 0 ≤ y ≤ 1:
Z ∞
fY (y) = fX,Y (x, y)dx
−∞
Z 1
3
= x + y 2 dx
0 2
x=1
1 2 3 2
= x + y x
2 2 x=0
3 2 1
= y +
2 2
Note that to compute E [X] for example, we can either use LOTUS, or just the marginal PDF fX (x). These
methods are equivalent. By LOTUS (taking g(X, Y ) = X),
Z ∞Z ∞ Z 1Z 1
3 2
E [X] = xfX,Y (x, y)dxdy = x x + y dxdy
−∞ −∞ 0 0 2
Alternatively, by definition of expectation for a single RV,
Z ∞ Z 1
1
E [X] = xfX (x)dx = x x+ dx
−∞ 0 2
It only takes two lines or so of algebra to show they are equal!
Recall ΩX,Y = {(x, y) : fX,Y (x, y) > 0} ⊆ ΩY × ΩY . A necessary but not sufficient condition for
independence is that ΩX,Y = ΩX × ΩY . That is, if ΩX,Y = ΩX × ΩY , then we have to check the
condition, but if not, then we know they are not independent.
This is because if there is some (a, b) ∈ ΩX × ΩY but not in ΩX,Y , then fX,Y (a, b) = 0 but fX (a) > 0
and fY (b) > 0, which violates independence. (This is very similar to independence for discrete RVs).
Example(s)
Let’s return to our dart example. Suppose (X, Y ) are jointly and uniformly distributed on the circle
of radius R centered at the origin (example a dart throw).
1. First find and sketch the joint range ΩX,Y .
174 Probability & Statistics with Applications to Computing 5.2
2. Now, write an expression for the joint PDF fX,Y (x, y) and carefully define it for all x, y ∈ R.
3. Now, solve for the range of X and write an expression we can evaluate to find fX (x), the
marginal PDF for X.
4. Now, let Z be the distance from the center that the dart falls. Find ΩZ and write an expression
for E [Z].
5. Finally, determine using the definition of independence whether X and Y are independent.
Solution
1. The joint range is ΩX,Y = {(x, y) ∈ R2 : x2 + y 2 ≤ R2 } since the values must be within the circle of
radius R. We can sketch the range as follows, with the semi-circles below and above the y-axis labeled
with their respective equations.
2. The height of the density Rfunction is constant, say h, since it is uniform. The double integral over all
∞ R∞
x and y must equal one ( −∞ −∞ fX,Y (x, y)dxdy = 1), meaning the volume of this cylinder must be
1
1. The volume is base times height, which is πR2 · h, and setting it equal to 1 gives h = πR 2 . This
3. Well, X can range from −R to R, since there are points on the circle with x values in this range. So
the range of X is:
ΩX = [−R, R]
Setting up this integral will be trickier than in the earlier examples, because when finding
fX (x) and integrating out the y, the limits of integration actually depend on x. Imagine making a tick
mark at some x ∈ [−R, R] (on the x-axis) and drawing a vertical line through x: where does y enter
and leave (like summing a column in a joint PMF)? Based on the equations we had earlier for y in
terms of x (see the sketch above), this give us:
Z √
R2 −x2
fX (x) = √ fX,Y (x, y)dy
− R2 −x2
5.2 Probability & Statistics with Applications to Computing 175
Again, this is different from the previous examples, and you MUST sketch/plot the joint range to figure
this out. If you learned how to do double integrals, this is exactly the same idea.
√
4. Well, the distance will be given by Z = X 2 + Y 2 , which is the definition of distance. We can further
see that Z will take on any value from 0 to R, since the point could be at the origin and as far as R.
This gives, ΩZ = [0, R].
Then, to solve for the expected value of Z, we can use LOTUS, and only integrate over the joint range
of X and Y (since the joint PDF is 0 elsewhere). We have to be careful in setting up the bounds of our
√ from −R√to R as we discussed earlier.
integral. X will range √ But as X ranges across these values, Y
will range from − R2 − x2 to R2 − x2 . We had Z = X 2 + Y 2 , so for the expected value we have:
Z Z √
hp i R R2 −x2 p
E [Z] = E X2 +Y 2 = √ x2 + y 2 fX,Y (x, y)dydx
−R − R2 −x2
Note that we could’ve set up this integral dxdy instead - what would the limits of integration have
been? It would’ve been
√
hp i Z R Z R2 −y2 p
E [Z] = E X2 + Y 2 = 2 2
√ 2 2 x + y fX,Y (x, y)dxdy
−R − R −y
Your outer limits must be just the range of Y (both constants), and your inner limits may depend on
the outer variable of integration.
5. No, they are not independent. We can see this with the test: ΩX,Y 6= ΩX × ΩY . This is because X
and Y both have marginal range from −R to R, but the joint range is not a rectangle of this region (it
is a circle). More explicitly, take a point (0.99R, 0.99R) which is basically the top right of the square
(R, R)). We get 0 = fX,Y (0.99R, 0.99R) 6= fX (0.99R)fY (0.99R) > 0. This is because the joint PDF
is defined to be 0 at (0.99R, 0.99R) (not in the circle), but the marginal PDFs of both X and Y are
nonzero at 0.99R (since 0.99R is in the marginal range of both).
Example(s)
Now let’s consider another example where we have a continuous joint distribution (X, Y ), where
X ∈ [0, 1] is the proportion of the time until the midterm that you actually spend studying for it and
Y ∈ [0, 1] is your percentage score on the exam.
Suppose the joint PDF is:
(
ce−(y−x) x, y ∈ [0, 1] and y ≥ x
fX,Y (x, y) =
0 otherwise
1. First, consider the joint range and sketch it. Then, interpret it in English in the context of the
problem.
2. Now, write an expression for c in the PDF above.
3. Now, find ΩY and write an expression that we could evaluate to find fY (y).
4. Now, write an expression that we could evaluate to find P (Y ≥ 0.9).
5. Now, write an expression that we can evaluate to find E [Y ], the expected score on the exam.
6. Finally, consider whether X and Y are independent.
Solution
176 Probability & Statistics with Applications to Computing 5.2
1. X can range from any value in [0, 1] without conditions. Then Y will only be bounded in that it must
be less than or equal to X. We can first draw the line x = y, and then the region above this line for
which x, y are less than 1 will be our range. That gives us the following:
In English, this means that your score is at least the percentage of time that you studied, as your score
will be that proportion or more.
2. RTo solve for c, we should find the volume above this triangle on the x-y plane and invert it, since
∞ R∞
f
−∞ −∞ X,Y
(x, y)dxdy = 1. To find the area we can integrate in terms of x or y first, which will give
us the following two equivalent expressions:
1 1
c = R1R1 = R1Ry
0 x
e−(y−x) dydx 0 0
e−(y−x) dxdy
We’ll explain the first equality using the dydx ordering. Since dx is the outer integral, the limits must
be just the range of X, which is [0, 1]. For each value of x (draw a vertical line through x on the
x-axis), y goes between x and 1, so those are the inner limits of integration.
Now, for the second equality using dxdy ordering, the outer integral is dy, so the limits are the range
of Y , also [0, 1]. Then, for each value of y (draw a horizontal line through y on the y-axis), x goes
between 0 and y, and so those are the inner limits of integration.
3. Well, ΩY = [0, 1] as we can see in our graph above that Y takes on values in this range. For the
marginal PDF we have to integrate in respect to X, which will take on values in the range 0 to y based
on our graph. So, we have:
Z y
fY (y) = ce−(y−x) dx
0
4. We can integrate from 0.9 to 1 to solve for this, using the marginal PDF that we solved for above.
This takes us back to the univariate case essentially, and gives us the following:
Z 1 Z 1 Z y
P (Y ≥ 0.9) = fY (y)dy = ce−(y−x) dxdy
0.9 0.9 0
Z 1 Z 1 Z y
E [Y ] = yfY (y)dy = cye−(y−x) dxdy
0 0 0
6. ΩX,Y 6= ΩX × ΩY since the sketch of the range is not a rectangle. The joint range is not equal to
the cartesian product of the marginal ranges. To be concrete, consider the point (x = 0.99, y = 0.01)
(basically the corner (1, 0)). I chose this point because it was in the Cartesian product ΩX × ΩY =
[0, 1] × [0, 1], but not in the joint range (see the picture from the first part). Since it’s not in the joint
range (shaded region), we have fX,Y (0.99, 0.01) = 0, but since 0.99 ∈ ΩX and 0.01 ∈ ΩY , fX (0.99) > 0
and fY (0.01) > 0. Hence, I’ve found a pair of points (x, y) where the joint density isn’t equal to the
product of the marginal densities, violating independence.
Chapter 5. Multiple Random Variables
5.3: Conditional Distributions
Slides (Google Drive) Video (YouTube)
Now that we’ve finished talking about joint distributions (whew), we can move on to conditional distribu-
tions and conditional expectation. This is actually just applying the concepts from 2.2 about conditional
probability, generalizing to random variables (instead of events)!
Note that this should remind you of Bayes’ Theorem (because that’s what it is)!
If X, Y are continuous random variables, then the conditional PDF of X given Y is:
Again, this is just a generalization from discrete to continuous as we’ve been doing!
It’s important to note that, for each fixed value of b, the probabilities that X = a must sum to 1:
X
pX|Y (a | b) = 1
a∈ΩX
If X and Y are mixed (one discrete, one continuous), then a similar extension can be made where
any discrete random variable has a p (a probability mass function) any continuous random variable
has an f (a probability density function).
Example(s)
Back to our example of the blue and red die rolls from 5.1. Suppose we roll a fair blue 4-sided die
and a fair red 4-sided die independently. Recall that U = min{X, Y } (the smaller of the two die rolls)
and V = max{X, Y } (the larger of the two die rolls). Then, their joint PMF was:
178
5.3 Probability & Statistics with Applications to Computing 179
pU,V (u, 3)
pU |V =
pV (3)
We need to compute the denominator which is the marginal PMF of V (the sum of the third column):
X
pV (3) = pU,V (a, 3) = 2/16 + 2/16 + 1/16 + 0 = 5/16
a∈ΩU
it’s only fair that the conditional expectation of X, given knowledge that some other RV Y is equal to y is
the same exact thing, EXCEPT the probabilities should be conditioned on Y = y now:
X X
E [X | Y = y] = xP (X = x | Y = y) = xpX,Y (x | y)
x∈ΩX x∈ΩX
Most notably, we are still summing over x and NOT y, since this expression should depend on y right? Given
that Y = y, what is the expectation of X?
180 Probability & Statistics with Applications to Computing 5.3
If X is continuous (and Y is either discrete or continuous), then we define the conditional expectation
of g(X) given (the event that) Y = y as:
Z ∞
E [g(X) | Y = y] = g(x)fX|Y (x | y)dx
−∞
Notice that these sums and integrals are over x (not y), since E [g(X) | Y = y] is a function of y.
These formulas are exactly the same as E [g(X)], except the PMF/PDF of X is replaced with the
conditional PMF/PDF of X | Y = y.
Example(s)
The question is basically asking the following: we get some uniformly random decimal num-
ber X from [0, 1]. We keep drawing uniform random numbers until we get a value less than our
initial value. What is the expected number of draws until this happens?
Solution We’ll do this problem in a “bad” way (the only way we know how to know), and then learn the
Law of Total Expectation next to see how this solution could be much simpler!
To find E [T ], since T is discrete with range ΩT = {1, 2, 3, . . . }, we can find its PMF pT (t) = P (T = t) for
any value t and use the usual formula for expectation. However, T depends on the value of the initial number
X right? If X = 0.1 it would take longer to get a number less than this than if X = 0.99. Let’s try to find
the probability T = t given that X = x first:
P (T = t | X = x) = (1 − x)t−1 x
because the probability we get a number smaller than x is just x (Uniform CDF), and so we need to get
t − 1 failures first before our first success. Actually, (T | X = x) ∼ Geo(x) so that’s another way we could’ve
computed this conditional PMF. Then, let’s use the LTP to find P (T = t) (we need to integrate over all
values of x because X is continuous, not discrete):
Z 1 Z 1
1
P (T = t) = P (T = t | X = x) fX (x)dx = (1 − x)t−1 x · 1dx = · · · =
0 0 t(t + 1)
after skipping some purely computational steps. Finally, since we have the PMF of T , we can compute
expectation in the normal way:
∞
X ∞
X X 1 ∞
1
E [T ] = tpT (t) = t = =∞
t=1 t=1
t(t + 1) t=1
t+1
5.3 Probability & Statistics with Applications to Computing 181
1 1 1
The reason this is ∞ is because this is like the harmonic series 1 + + + + . . . which is known to diverge
2 3 4
to ∞. This is surprising right? The expected time until you get a number smaller than your first is infinite!
This looks exactly like the law of total probability we are used to. Basically to solve for E [g(X)], we
need to take a weighted average of E [g(X) | Y = y] over all possible values of y.
Example(s)
(This is the same example as earlier): Suppose X ∼ Unif(0, 1) (continuous). We repeatedly draw
independent Y1 , Y2 , Y3 , · · · ∼ Unif(0, 1) (continuous) until the first random time T such that YT < X.
What is E [T ]?
182 Probability & Statistics with Applications to Computing 5.3
Solution Using the LTE now, we can solve this in a much simpler fashion. We know that (T | X = x) ∼ Geo(p)
as stated earlier. By citing the expectation of a Geometric RV, we know that E [T | X = x] = x1 . By the
LTE, conditioning on x:
Z 1 Z 1
1
E [T ] = E [T | X = x] fX (x)dx = 1dx = [ln(x)]10 = ∞
0 0 x
This was a much faster way to getting to the answer than before!
Example(s)
Let’s finally prove that if X ∼ Geo(p), then µ = E [X] = p1 . Recall that the Geometric random
variable is the number of independent Bernoulli trials with parameter p up to and including the first
success.
Solution First, let’s condition on whether our first flip was heads or tails (these events partition the sample
space):
What are those four values on the right though? We know P (H) = p and P (T ) = 1 − p, so that’s out of the
way.
What is E [X | H]? If we got heads on the first try, then E [X | H] = 1 since we are immediately done
(i.e., the number of trials it took to get our first heads, given we got heads on the first trial, is 1).
What is E [X | T ]? This is a bit trickier: because the trials are independent, and we got a tail on the
first try, we basically have to restart (memorylessness), and so our conditional expectation is just E [1 + X],
since we are back to square one except with one additional trial!
Plugging these four values in gives a recursive formula (E [X] appears on both sides):
E [X] = p + (1 + E [X]) · (1 − p)
µ = p + (1 + µ)(1 − p)
µ = p + 1 − p + µ − µp
µ = 1 + µ − µp
0 = 1 − µp
µp = 1
1
µ=
p
This is a really “cute” proof of the expectation of a Geometric RV! See the notes in 3.5 to see the “ugly”
calculus proof.
5.3 Probability & Statistics with Applications to Computing 183
5.3.4 Exercises
1. What happens to linearity of expectation when you sum a random number of random variables? We
know it holds for fixed values of n, but let’s see what happens if we sum a random number N of them.
It turns out, you get something very nice!
Let X1 , X2 , X3 , . . . be a sequence of independent and identically distributed (iid) RVs, with com-
mon mean E [X1 ] = E [X2 ] = . . . . Let N be a random variable which
hP hasirange ΩN ⊆ {0, 1, 2, . . . }
N
(nonnegative integers), independent of all the Xi ’s. Show that E i=1 Xi = E [X1 ] E [N ]. That is,
the expected sum of a random number of random variables is the expected number of random variables
times the expected value of each (which you might think is intuitively true, but we have to prove it!).
Solution: We have the following:
"N # "N #
X X X
E Xi = E Xi | N = n pN (n) [Law of Total Expectation]
i=1 n∈ΩN i=1
" n
#
X X
= E Xi | N = n pN (n) [given N = n : substitute in the upper limit]
n∈ΩN i=1
" n
#
X X
= E Xi pN (n) [N independent of Xi0 s]
n∈ΩN i=1
X
= nE [X1 ] pN (n) [Linearity of Expectation]
n∈ΩN
X
= E [X1 ] npN (n)
n∈ΩN
= E [X1 ] E [N ] [def of E [N ]]
184 Probability & Statistics with Applications to Computing 5.3
Application Time!!
Now you’ve learned enough theory to discover the Markov Chain Monte Carlo
(MCMC) strategy covered in section 9.6. You are highly encouraged to read
that section before moving on!
Chapter 5. Multiple Random Variables
5.4: Covariance and Correlation
Slides (Google Drive) Video (YouTube)
In this section, we’ll learn about covariance; which as you might guess, is related to variance. It is a function
of two random variables, and tells us whether they have a positive or negative linear relationship. It also
helps us finally compute the variance of a sum of dependent random variables, which we have not yet been
able to do.
185
186 Probability & Statistics with Applications to Computing 5.4
Var (X)
P + Var (Y ) (as we
discussed earlier).
n Pm Pn Pm
7. Cov i=1 Xi , j=1 Yi = i=1 j=1 Cov (Xi , Yj ). That is covariance works like FOIL (first,
outer, inner, last) for multiplication of sums ((a + b + c)(d + e) = ad + ae + bd + be + cd + ce).
Proof of Covariance Alternate Formula. We will prove that Cov (X, Y ) = E [XY ] − E [X] E [Y ].
We actually proved in 5.1 already that E [XY ] = E [X] E [Y ] when X, Y are independent. Hence,
Example(s)
Z = 1 + X + XY 2
W =1+X
Find Cov(Z, W ).
2
Solution First note that E X 2 = Var (X) + E [X] = 1 + 02 = 1 (rearrange variance formula and solve for
5.4 Probability & Statistics with Applications to Computing 187
E X 2 ). Similarly, E Y 2 = 1.
Cov (Z, W ) = Cov 1 + X + XY 2 , 1 + X
= Cov X + XY 2 , X [Property 4]
= Cov (X, X) + Cov XY 2 , X [Property 7]
= Var (X) + E X 2 Y 2 − E XY 2 E [X] [Property 2 and def of covariance]
2
= 1 + E X 2 E Y 2 − E [X] E Y 2 [Because X and Y are independent]
=1+1−0=2
Covariance has a “problem” in measuring linear relationships, in that Cov (X, Y ) will be positive when there
is a positive linear relationship and negative when there is a negative linear relationship, but Cov (2X, Y ) =
2Cov (X, Y ). Scaling one of the random variables should not affect the strength of their relationship, which
it seems to do. It would be great if we defined some metric that was normalized (had a maximum and
minimum), and was invariant to scale. This metric will be called correlation!
Cov (X, Y )
ρ(X, Y ) = p p
Var (X) Var (Y )
We can prove by the Cauchy-Schwarz inequality (from linear algebra), −1 ≤ ρ(X, Y ) ≤ 1. That is,
correlation is just a normalized version of covariance. Most notably, ρ(X, Y ) = ±1 if and only if
Y = aX + b for some constants a, b ∈ R, and then the sign of ρ is the same as that of a.
In linear regression (”line-fitting”) from high school science class, you may have calculated some R2 ,
0 ≤ R2 ≤ 1, and this is actually ρ2 , and measure how well a linear relationship exists between X and
Y . R2 is the percentage of variance in Y which can be explained by X.
Let’s take a look at some example graphs which shows a sample of data and their (Pearson) correlations, to
get some intuition.
188 Probability & Statistics with Applications to Computing 5.4
The 1st (purple) plot has a perfect negative linear relationship and so the correlation is −1.
The 2nd (green) plot has an positive relationship, but it is not perfect, so the correlation is around +0.9.
The 3rd (orange) plot is a perfectly linear positive relationship, so the correlation is +1.
The 4th (red) plot appears to have data that is independent, so the correlation is 0.
The 5th (blue) plot has a negative trend that isn’t strongly linear, so the correlation is around −0.6.
Example(s)
Suppose X and Y are random variables, where Y = −5X + 2. Show that, since there is a perfect
negative linear relationship, ρ(X, Y ) = −1.
Solution To find the correlation, we need the covariance and the two individual variances. Let’s write them
in terms of Var (X).
Var (Y ) = Var (−5X + 2) = (−5)2 Var (X) = 25Var (X)
Finally,
Cov (X, Y ) −5Var (X) −5Var (X)
ρ(X, Y ) = p p =p p = = −1
Var (X) Var (Y ) Var (X) 25Var (X) 5Var (X)
Note that the −5 and 2 did not matter at all (except that −5 was negative and made the correlation negative)!
Proof of Variance of Sums of RVs. We’ll first do something unintutive - making our expression more
complicated. ThePvariance X2 + · · · + Xn is the covariance with itself! We’ll use i to index
of the sum X1 + P
n n
one of the sums i=1 Xi and j for the other j=1 Xi . Keep in mind these both represent the same quantity;
you’ll see why we used different dummy variables soon!
!
n
X Xn n
X
Var Xi = Cov Xi , Xj [covariance with self = variance]
i=1 i=1 j=1
n X
X n
= Cov (Xi , Xj ) [by FOIL]
i=1 j=1
Xn X
= Var (Xi ) + 2 Cov (Xi , Xj ) [by symmetry (see image below)]
i=1 i<j
The final step comes from the definition of covariance of a variable with itself and the symmetry of the
covariance. It is illustrated below where the red diagonal is the covariance of a variable with itself (which
is its variance), and the green off-diagonal are the symmetric pairs of covariance. We used the fact that
Cov (Xi , Xj ) = Cov (Xj , Xi ) to require us to only sum the lower triangle (where i < j), and multiply by 2 to
account for the upper triangle.
It is important to remember than if all the RVs were independent, all the Cov (Xi , Xj ) terms (for i 6= j)
would be zero, and so we would just be left with the sum of the variances as we showed earlier!
190 Probability & Statistics with Applications to Computing 5.4
Example(s)
Recall in the hat check problem in 3.3, we had n people who go to a party and leave their hats with
a hat check person. At the end of the party, the hats are returned randomly though.
We let X be the number of people who get their original hat back. We solved for E [X] with indicator
random variables X1 , . . . Xn for whether the i-th person got their hat back.
We showed that:
E [Xi ] = P (Xi = 1)
= P ith person get their hat back
1
=
n
So,
" n #
X
E [X] = E Xi
i=1
n
X
= E [Xi ]
i=1
Xn
1
=
i=1
n
1
=n·
n
=1
1 1
Solution Recall that each Xi ∼ Ber (1 with probability , and 0 otherwise). (Remember these were
n n
NOT independent RVs, but we still could apply linearity of expectation.) In our previous proof, we showed
that
n
! n
X X X
Var (X) = Var Xi = Var (Xi ) + 2 Cov (Xi , Xj )
i=1 i=1 i<j
Recall that Xi , Xj are indicator random variables which are in {0, 1}, so their product Xi Xj ∈ {0, 1} as well.
This allows us to calculate:
This is because we need both person i and person j to get their hat back: person i gets theirs back with
probability n1 , and given this is true, person j gets theirs back with probability n−1
1
Finally, we have
n
X X
Var (X) = Var (Xi ) + 2 Cov (Xi , Xj ) [formula for variance of sum]
i=1 i<j
Xn X
1 1 1
= 1− +2 2 (n − 1)
[plug in]
i=1
n n i<j
n
1 1 n 1 n
=n 1− +2 [there are pairs with i < j]
n n 2 n2 (n − 1) 2
1 n(n − 1) 1
= 1− +2
n 2 n2 (n − 1)
1 1
= 1− +
n n
=1
How many pairs are their with i < j? This is just n2 = n(n−1)2 since we just choose two different elements.
Another way to see this is that there was an n × n square, and we removed the diagonal of n elements, so
we are left with n2 − n = n(n − 1). Divide by two to get just the lower half.
This is very surprising and interesting! When returning n hats randomly and uniformly, the expected number
of people who get their hat back is 1, and so is the variance! These don’t even depend on n at all! It
takes practice to get used to these formula, so let’s do one more problem.
Example(s)
Suppose we throw 12 balls independently and uniformly into 7 bins. What are the mean and variance
of the number of empty bins after this process? (Hint: Indicators).
since we need to avoid this bin (with probability 6/7) 12 times independently. That is,
12 !
6
Xi ∼ Ber p =
7
Hence, E [Xi ] = p ≈ 0.1573 and Var (Xi ) = p(1 − p) ≈ 0.1325. These random variables are surely dependent,
since knowing one bin is empty means the 12 balls had to go to the other 6 bins, making it less likely that
another bin is empty.
However, dependence doesn’t bother us for computing the expectation; by linearity of expectation, we get
" 7 # 7 7 12 12
X X X 6 6
E [X] = E Xi = E [Xi ] = =7 ≈ 1.1009
i=1 i=1 i=1
7 7
Now for the variance, we need to find Cov (Xi , Xj ) = E [Xi Xj ] − E [Xi ] E [Xj ] for i 6= j. Well, Xi Xj ∈ {0, 1}
since both Xi , Xj ∈ {0, 1}, so Xi Xj is indicator/Bernoulli as well with
12
5
E [Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1, Xj = 1) = P (both bin i and j are empty) =
7
since all the balls must go into the other 5 bins during each of the 12 independent throws. Finally,
12 12 12
5 6 6
Cov (Xi , Xj ) = E [Xi Xj ] − E [Xi ] E [Xj ] = − ≈ −0.0071
7 7 7
Recall that Var (Xi ) = p(1 − p) ≈ 0.1325, and so putting this all together gives:
7
X X
Var (X) = Var (Xi ) + 2 Cov (Xi , Xj ) [formula for variance of sum]
i=1 i<j
7
X X
≈ 0.1325 + 2 (−0.0071) [plug in approximate decimal values]
i=1 i<j
7
= 7 · 0.1325 + 2 (−0.0071)
2
≈ 0.62954
Recall the hypergeometric RV X ∼ HypGeo(N, K, n) which was the number of lollipops we get when we
draw n candies from a bag of N total candies (K ≤ N are lollipops). We stated without proof that
Var (X) = n K(NN−K)(N −n)
2 (N −1) . You have the tools now to prove this if you like using indicators and covariances,
but we’ll prove this later in 5.8 as well!
Chapter 5. Multiple Random Variables
5.5: Convolution
Slides (Google Drive) Video (YouTube)
In section 4.4, we explained how to transform random variables (finding the density function of g(X)). In
this section, we’ll talk about how to find the distribution of the sum of two independent random variables,
X + Y , using a technique called convolution. It will allow us to prove some statements we made earlier
without proof (like sums of independent Binomials are Binomial, sums of indepenent, Poissons are Poisson),
and also derive the density function of the Gamma distribution which we just stated.
This should just remind of you of the LTP we learned in section 2.2, or the definition of marginal PMF/PDFs
from earlier in the chapter! We’ll use this LTP to help us derive the formulae for convolution.
5.5.2 Convolution
Convolution is a mathematical operation that allows to derive the distribution of a sum of two independent
random variables. For example, suppose the amount of gold a company can mine is X tons per year in
country A, and the amount of gold the company can mine is Y tons per year in country B, independently.
You have some distribution to model each. What is the distribution of the total amount of √ gold you mine,
Z = X + Y ? Combining this with 4.4, if you know your profit is some function of g(Z) = X + Y of the
total amount of gold, you can now find the density function of your profit!
Example(s)
Let X, Y ∼ Unif(1, 4) be independent rolls of a fair 4-sided die. What is the PMF of Z = X + Y ?
193
194 Probability & Statistics with Applications to Computing 5.5
Solution We know that for the range of Z we have the following, since it is the sum of two values each in
the range {1, 2, 3, 4}:
ΩZ = {2, 3, 4, 5, 6, 7, 8}
Should the probabilities be uniform? That is, would you be equally likely to roll a 2 as a 5? No, because
there is only one way to get a 2 (rolling (1, 1)), but many ways to get a 5.
If I wanted to compute the probability that Z = 3 for example, I could just sum over all possible val-
ues of X in ΩX = {1, 2, 3, 4} to get:
P (Z = 3) = P (X = 1, Y = 2) + P (X = 2, Y = 1) + P (X = 3, Y = 0) + P (X = 4, Y = −1)
= P (X = 1) P (Y = 2) + P (X = 2) P (Y = 1) + P (X = 3) P (Y = 0) + P (X = 4) P (Y = −1)
1 1 1 1 1 1
= · + · + ·0+ ·0
4 4 4 4 4 4
2
=
16
where the first line is all ways to get a 3, and the second line uses independence. Note that is is not possible
that Y = 0 or Y = −1, but we write this for completion. More generally, to find pZ (z) = P (Z = z) for any
value of z, we just write
pZ (z) = P (Z = z)
X
= P (X = x, Y = z − x)
x∈ΩX
X
= P (X = x) P (Y = z − x)
x∈ΩX
X
= pX (x)pY (z − x)
x∈ΩX
The intuition is that if we want Z = z, we sum over all possibilities of X = x but require that Y = z − x so
that we get the desired sum of z. It is very possible that pY (z − x) = 0 as we saw above.
It turns out that formula at the bottom was extremely general, and works for any sum of two independent
discrete RVs. Now let’s consider the continuous case. What if X and Y are continuous RVs and we define
Z = X + Y ; how can we solve for the probability density function for Z, fZ (z)? It turns out the formula is
extremely similar, just replacing p with f !
Note: You can swap the roles of X and Y . Note the similarity between the cases!
5.5 Probability & Statistics with Applications to Computing 195
Proof of Convolution.:
• Discrete case: Even though we proved this earlier, we’ll do it again a different way (using the LTP/def
of marginal):
pZ (z) = P (Z = z)
X
= P (X = x, Z = z) [LTP/marginal]
x∈ΩX
X
= P (X = x, Y = z − x) [(X = x, Z = z) equivalent to (X = x, Y = z − x)]
x∈ΩX
X
= P (X = x) P (Y = z − x) [X and Y are independent]
x∈ΩX
X
= pX (x)pY (z − x)
x∈ΩX
• Continuous case: Since we should never work with densities as probabilities, let’s start with the CDF
and differentiate:
FZ (z) = P (Z ≤ z)
= P (X + Y ≤ z) [def of Z]
Z
= P (X + Y ≤ z | X = x) fX (x)dx [LTP, conditioning on X]
x∈ΩX
Z
= P (x + Y ≤ z | X = x) fX (x)dx [given X = x]
x∈ΩX
Z
= P (Y ≤ z − x | X = x) fX (x)dx [algebra]
x∈ΩX
Z
= P (Y ≤ z − x) fX (x)dx [X and Y are independent]
x∈ΩX
Z
= FY (z − x)fX (x)dx [def of CDF of Y ]
x∈ΩX
Now we can take the derivative (with respect to z) of the CDF to get the density (FY becomes fY ):
Z
d
fZ (z) = FZ (z) = fX (x)fY (z − x)dx
dz x∈ΩX
Example(s)
Suppose X and Y are two independent random variables such that X ∼ Poi(λ1 ) and Y ∼ Poi(λ2 ),
and let Z = X + Y . Prove that Z ∼ Poi(λ1 + λ2 ).
The range of X, Y are ΩX = ΩY = {0, 1, 2, . . . }, and so ΩZ = {0, 1, 2, . . . } as well. For n ∈ ΩZ : Note that
the convolution formula says:
X ∞
X
pZ (n) = pX (k)pY (n − k) = pX (k)pY (n − k)
k∈ΩX k=0
196 Probability & Statistics with Applications to Computing 5.5
However, if you blindly plug in the PMFs pX and pY , you will get the wrong answer, and here’s why. We
only want to sum things that are non-zero (otherwise what’s the point?), and if we want pX (k)pY (n−k) > 0,
we need BOTH to be nonzero. That means, k must be in the range of X AND n − k must be in the range
of Y . Remember the dice example (we had pY (−1) at some point, which would be 0 and not 1/4). We
are guaranteed pX (k) > 0 because we are only summing over valid k ∈ ΩX , but we must have n − k be a
nonnegative integer (in the range ΩY = {0, 1, 2, . . . }, so actually, we must have k ≤ n. Now, we can just
plug and chug:
n
X
pZ (n) = pX (k)pY (n − k) [convolution formula]
k=0
Xn
λk1 −λ2 λ2n−k
= e−λ1 ·e [plug in Poisson PMFs]
k! (n − k)!
k=0
n
X
−(λ1 +λ2 ) 1
=e λk (1 − λ2 )n−k [algebra]
k!(n − k)! 1
k=0
n
1 X n!
= e−(λ1 +λ2 ) λk (1 − λ2 )n−k [multiply and divide by n!]
n! k!(n − k)! 1
k=0
n
1 X n k n n!
= e−(λ1 +λ2 ) λ (1 − λ2 )n−k =
n! k 1 k k!(n − k)!
k=0
(λ1 + λ2 )n
= e−(λ1 +λ2 ) [binomial theorem]
n!
Thus, Z ∼ Poi(λ1 + λ2 ), as its PMF matches that of a Poisson distribution! Note we wouldn’t have been
able to do that last step if our sum was still k = 0 to n. You MUST watch out for this at the beginning,
and after that, it’s just algebra.
Example(s)
Suppose X, Y are independent and identically distributed (iid) continuous Unif(0, 1) random vari-
ables. Let Z = X + Y . What is fZ (z)?
Solution We always begin by calculating the range: we have ΩZ = [0, 2]. Again, we shouldn’t expect Z to
be uniform, since we should expect a number around 1, but not 0 or 2.
For a U ∼ Unif(0, 1) (continuous) random variable, we know ΩU = [0, 1], and that
1 0≤u≤1
fU (u) =
0 otherwise
Z Z 1 Z 1
fZ (z) = fX (x)fY (z − x)dx = fX (x)fY (z − x)dx = fY (z − x)dx
x∈ΩX 0 0
where the last formula holds since fX (x) = 1 for all 0 ≤ x ≤ 1 as we saw above. Remember, we need to
make sure z − x ∈ ΩY = [0, 1], otherwise the density will be 0.
For fY (z − x) > 0, we need 0 ≤ z − x ≤ 1. We’ll split into two cases depending on whether z ∈ [0, 1] or
z ∈ [1, 2], which compose its range ΩZ = [0, 2].
5.5 Probability & Statistics with Applications to Computing 197
• If z ∈ [0, 1], we already have z − x ≤ 1 since z ≤ 1 (and x ∈ [0, 1]). We also need z − x ≥ 0 for the
density to be nonzero: x ≤ z. Hence, our integral becomes:
Z z Z 1
fZ (z) = fY (z − x)dx + fY (z − x)dx
Z0 z z
= 1dx + 0 = [x]z0 = z
0
• If z ∈ [1, 2], we already have z − x ≥ 0 since z ≥ 1 (and x ∈ [0, 1]). We now need the other condition
z − x ≤ 1 for the density to be nonzero: x ≥ z − 1. Hence, our integral becomes:
Z z−1 Z 1
fZ (z) = fY (z − x)dx + fY (z − x)dx
0 z−1
Z 1
=0+ 1dx = [x]1z−1 = 2 − z
z−1
This makes sense because there are “more ways” to get a value of 1 for example than any other point.
Whereas to get a value of 2, there’s only one way - we need both X, Y to be equal to 1.
Example(s)
Mitchell and Alex are competing together in a 2-mile relay race. The time Mitchell takes to finish (in
hours) is X ∼ Exp(2) and the time Alex takes to finish his mile (in hours) is continuous Y ∼ Unif(0, 1).
Alex starts immediately after Mitchell finishes his mile, and their performances are independent.
What is the distribution of Z = X + Y , the total time they take to finish the race?
Solution First, we know that ΩX = [0, ∞) and ΩY = [0, 1], so ΩZ = [0, ∞). We know from our distribution
chart that
fX (x) = λe−λx , x ≥ 0 and fY (y) = 1, 0 ≤ y ≤ 1
198 Probability & Statistics with Applications to Computing 5.5
Let z ∈ ΩZ . We’ll use the convolution formula, but this time over the range of Y (you could also do over X
too!). We can do this because X + Y = Y + X, and there was no reason why we had to condition on X first.
Z Z 1
fZ (z) = fY (y)fX (z − y)dy = fY (y)fX (z − y)dy
ΩY 0
Since we are integrating over y, we don’t need to worry about fY (y) being 0, but we do need to make sure
fX (z − y) > 0. There are two cases again:
• If z ∈ [0, 1], then since we need z − y ≥ 0, we need y ≥ z:
Z z Z z
fZ (z) = fY (y)fX (z − y)dy = 1 · λe−λ(z−y) dy = 1 − e−λz
0 0
Note this tiny difference in the upper limit of the integral made a huge difference! Our final result is
−λz
1 − e z ∈ [0, 1]
fZ (z) = (e − 1)e−λz z ∈ (1, ∞)
λ
0 otherwise
The moral of the story is: always watch out for the ranges, otherwise you might not get what you expect!
The range of the random variable exists for a reason, so be careful!
Chapter 5. Multiple Random Variables
5.6: Moment Generating Functions
Slides (Google Drive) Video (YouTube)
Last time, we talked about how to find the distribution of the sum of two independent random variables.
Some of the most important use cases are to prove the results we’ve been using for so long: the sum of
independent Binomials is Binomial, the sum of independent Poissons is Poisson (we proved this in 5.5 using
convolution), etc. We’ll now talk about Moment Generating Functions, which allow us to do these in a
different (and arguably easier) way. These will also be used to prove the Central Limit Theorem (next
section), probably the most important result in all of statistics! Also, to derive the Chernoff bound (6.2).
The point is, these are used to prove a lot of important results. They might not be as direct applicable to
problems though.
5.6.1 Moments
First, we need to define what a moment is.
The first four moments of a distribution/RV are commonly used, though we have only talked about the first
two of them. I’ll briefly explain each but we won’t talk about the latter two much.
1. The first moment of X is the mean of the distribution µ = E [X]. This describes the center or average
value.
2. The second moment of X about µ is the variance of the distribution σ 2 = Var (X) = E (X − µ)2 .
This describes the spread of a distribution (how much it varies).
3
3. The third standardized moment is called skewness E X−µ σ and typically tells us about the asym-
metry of a distribution about its peak. If skewness is positive, then the mean is larger than the median
and there are a lot of extreme high values. If skewness is negative, than the median is larger than the
mean and there are a lot of extreme low values.
4 E X 4
X−µ [ ]
4. The fourth standardized moment is called kurtosis E σ = σ4 , which measures how peaked
a distribution is. If the kurtosis is positive, then the distribution is thin and pointy, and if the kurtosis
is negative, the distribution is flat and wide.
199
200 Probability & Statistics with Applications to Computing 5.6
If X is discrete, by LOTUS:
X
MX (t) = etx pX (x)
x∈ΩX
If X is continuous, by LOTUS:
Z ∞
MX (t) = etx fX (x)dx
−∞
We say that the MGF of X exists, if there is a ε > 0 such that the MGF is finite for all t ∈ (−ε, ε),
since it is possible that the sum or integral diverges.
Example(s)
Solution
(a)
MX (t) = E etX
X
= etx pX (x) [LOTUS]
x
1 t 2 2t
= e + e
3 3
5.6 Probability & Statistics with Applications to Computing 201
(b)
MY (t) = E etY
Z 1
= ety fY (y)dy [LOTUS]
0
Z 1
= ety · 1dy [fY (y) = 1, 0 ≤ y ≤ 1]
0
t
e −1
=
t
h i h i
MaX+b (t) = E et(aX+b) = etb E e(at)X = etb MX (at)
2. Computing MGF of Sums: We can also compute the MGF of the sum of independent RVs X and Y
given their individual MGFs: (the third step is due to independence):
h i
MX+Y (t) = E et(X+Y ) = E etX etY = E etX E etY = MX (t)MY (t)
3. Generating Moments with MGFs: The reason why MGFs are named theyway they are,
is because
they generate moments of X. That means, they can be used to compute E [X], E X 2 , E X 3 , and so on.
How? Let’s take the derivative of an MGF (with respect to t):
0 d tX d X tx X d X
MX (t) = E e = e pX (x) = etx pX (x) = xetx pX (x)
dt dt dt
x∈ΩX x∈ΩX x∈ΩX
d tx
note in the last step that x is a constant with respect to t and so dt e = xetx .
Note that if evaluate the derivative at t = 0, we get E [X] since e0 = 1:
X X
0
MX (0) = xe0x pX (x) = xpX (x) = E [X]
x∈ΩX x∈ΩX
00 d 0 d X X d X
MX (t) = MX (t) = xetx pX (x) = xetx pX (x) = x2 etx pX (x)
dt dt dt
x∈ΩX x∈ΩX x∈ΩX
202 Probability & Statistics with Applications to Computing 5.6
If we evaluate the second derivative at t = 0, we get E X 2 :
X X
00
MX (0) = x2 e0x pX (x) = x2 pX (x) = E X 2
x∈ΩX x∈ΩX
Seems like there’s a pattern - if we take the n-th derivative of MX (t), then we will generate the n-th moment
E [X n ]!
For a function f : R → R, we will denote f (n) (x) to be the nth derivative of f (x). Let X, Y be
independent random variables, and a, b ∈ R be scalars. Then MGFs satisfy the following properties:
(n)
1. MX 0
(0) = E [X], MX00
(0) = E X 2 , and in general MX = E [X n ]. This is why we call MX a
moment generating function, as we can use it to generate the moments of X.
2. MaX+b (t) = etb MX (at).
3. If X ⊥ Y , then MX+Y (t) = MX (t)MY (t).
4. (Uniqueness) The following are equivalent:
(a) X and Y have the same distribution.
(b) fX (z) = fY (z) for all z ∈ R.
(c) FX (z) = FY (z) for all z ∈ R.
(d) There is an ε > 0 such that MX (t) = MY (t) for all t ∈ (−ε, ε) (they match on a small
interval around t = 0).
That is MX uniquely identifies a distribution, just like PDFs or CDFs do.
We proved the first three properties before stating all the theorems, so all that’s left is property 4. This is
a very complex proof (out of the scope of this course), but we can prove it for a special case.
Proof of Property 4 for a Special Case. We’ll prove that, if X, Y are discrete rvs with range Ω = {0, 1, 2, ..., m}
and whose MGFs are equal everywhere, that pX (k) = pY (k) for all k ∈ Ω. That is, if two distributions have
the same MGF, they have the same distribution (PMF).
Let ak = pX (k) − pY (k) for k = 0, . . . , m and write etk as (et )k . Then, we get
m
X
ak (et )k = 0
k=0
Note that this is an m-th degree polynomial in et , and remember that this equation holds for (uncountably)
infinitely many t. An mth degree polynomial can only have m roots, unless all the coefficients are 0. Hence
ak = 0 for all k, and so pX (k) = pY (k) for all k.
Now we’ll see how to use MGFs to prove some results we’ve been using.
5.6 Probability & Statistics with Applications to Computing 203
Example(s)
λk
pX (k) = e−λ
k!
Compute MX (t).
∞ ∞ ∞
X X λk X λk
MX (t) = E etX = etk pX (k) = etk · e−λ = e−λ (et )k
k! k!
k=0 k=0 k=0
∞
X (λet )k t
= e−λ = e−λ eλe [Taylor series with x = λet ]
k!
k=0
λ(et −1)
=e
Example(s)
t
If X ∼ Poi(λ), compute E [X] using its MGF we computed earlier MX (t) = eλ(e −1)
.
Example(s)
If Y ∼ Poi(γ) and Z ∼ Poi(µ) and Y ⊥ Z, show that Y + Z ∼ Poi(γ + µ) using the uniqueness
property of MGFs. (Recall we did this exact problem using convolution in 5.5).
t
−1)
Solution First note that a Poi(γ + µ) RV has MGF e(γ+µ)(e (just plugging in γ + µ as the parameter).
Since Y and Z are independent, by property 3,
t
−1) µ(et −1) t
−1)
MY +Z (t) = MY (t)MZ (t) = eγ(e e = e(γ+µ)(e
204 Probability & Statistics with Applications to Computing 5.6
The MGF of Y + Z which we computed is the same as that of Poi(γ + µ). So, by the uniqueness of MGFs
(which implies that an MGF can uniquely describe a distribution), Y + Z ∼ Poi(γ + µ).
Which way was easier for you - this approach or using convolution? MGF’s have limitations though whereas
convolution doesn’t (besides independence) - we need to compute the MGF of Y, Z but we also need to know
the MGF of what distribution we are trying to “get”.
Example(s)
Now, use MGFs to prove the closure properties of Gaussian RVs (which we’ve been using without
proof).
• If V ∼ N (µ, σ 2 ) and W ∼ N (ν, γ 2 ) are independent, that V + W ∼ N (µ + ν, σ 2 + γ 2 ).
• If a, b ∈ R are constants and X ∼ N (µ, σ 2 ), show that aX + b ∼ N (aµ + b, a2 σ 2 ).
You may use the fact that if Y ∼ N (µ, σ 2 ), that
Z ∞
1 (y−µ)2 σ 2 t2
MY (t) = ety √ e− 2σ2 dy = eµt+ 2
−∞ σ 2π
Solution
• If V ∼ N (µ, σ 2 ) and W ∼ N (ν, γ 2 ) are independent, we have the following:
σ 2 t2 γ 2 t2 (σ 2 +γ 2 )t2
MV +W (t) = MV (t)MW (t) = eµt+ 2 eνt+ 2 = e(µ+ν)t+ 2
This is the MGF of a Normal distribution with mean µ + ν and variance σ 2 + γ 2 . So, by uniqueness
of MGFs, Y + Z ∼ N (µ + v, σ 2 + γ 2 ).
• Let us examine the moment generating function for aX + b. (We’ll use the notation exp(z) = ez so
that we can actually see what’s in the exponent clearly):
σ 2 (at)2 (a2 σ 2 )t2
MaX+b (t) = ebt MX (at) = exp(bt) exp µ(at) + = exp (aµ + b)t +
2 2
Since this is the moment generating function for a RV that is N (aµ + b, a2 σ 2 ), we have shown that by
the uniqueness of MGFs that aX + b ∼ N (aµ + b, a2 σ 2 ).
Chapter 5. Multiple Random Variables
5.7: Limit Theorems
Slides (Google Drive) Video (YouTube)
This is definitely one of the most important sections in the entire text! The Central Limit Theorem is used
everywhere in statistics (hypothesis testing), and it also has its applications in computing probabilities. We’ll
see three results here, each getting more powerful and surprising.
are iid random variables with mean µ and variance σ 2 , then we define the sample mean
If X1 , . . . , Xn P
1 n
to be X n = n i=1 Xi . We’ll see the following results:
• The expectation of the sample mean E X n is exactly the true mean µ, and the variance Var X n =
σ 2 /n goes to 0 as you get more samples.
• (Law of Large Numbers) As n → ∞, the sample mean X n converges (in probability) to the true mean
µ. That is, as you get more samples, you will be able to get an excellent estimate of µ.
• (Central Limit Theorem) In fact, Xn follows a Normal distribution as n → ∞ (in practice n as low as
30 is good enough for this to be true). When we talk about the distribution of Xn , this means: if we
take n samples and take the sample mean, another n samples and take the sample mean, and so on,
how will these sample means look in a histogram? This is crazy - regardless of what the distribution
of Xi ’s were (discrete, continuous), their average will be approximately Normal! We’ll see pictures and
describe this more soon!
Further:
" n # n
1X 1X 1
E X̄n = E Xi = E [Xi ] = nµ = µ
n i=1 n i=1 n
Again, none of this is “mind-blowing” to prove: we just used linearity of expectation and properties of
205
206 Probability & Statistics with Applications to Computing 5.7
What is this saying? Basically, if you wanted to estimate the mean height of the U.S. population by sampling
n people uniformly at random:
• In expectation, your sample average will be “on point” at E X n = µ. This even includes the case
n = 1: if you just sample one person, on average, you will be correct. However, the variance is high.
• The variance of your estimate (the sample mean) for the true mean goes down (σ 2 /n) as your sample
size n gets larger. This makes sense right? If you have more samples, you have more confidence in
your estimate because you are more “sure” (less variance).
In fact, as n → ∞, the variance of the sample mean approaches 0. A distribution with mean µ and variance
0 is essentially the degenerate random variable that takes on µ with probability 1. We’ll actually see that
the Law of Large Numbers argues exactly that!
The SLLN implies the WLLN, but not vice versa. The difference is subtle and is basically swapping
the limit and probability operations.
The proof the WLLN will be given in 6.1 when we prove Chebyshev’s inequality, but the proof of the SLLN
is out of the scope of this class and much harder to prove.
5.7 Probability & Statistics with Applications to Computing 207
Let X1 , . . . Xn be a sequence of independent and identically distributed random variables with mean
2
µ and (finite) variance σ 2 . We’ve seen that the sample mean X̄n has mean µ and variance σn . Then
as n → ∞, the following equivalent statements hold:
2
1. X̄n → N (µ, σn ).
X̄n −µ
2. √ 2
→ N (0, 1)
Pσn /n
3. P i=1 Xi ∼ N (nµ, nσ 2 ). This is not “technically” correct, but is useful for applications.
n
X −nµ
4. i=1 √ i → N (0, 1)
nσ 2
The mean or variance are not a surprise (we computed these at the beginning of these notes for any
sample mean); the importance of the CLT is, regardless of the distribution of Xi ’s, the sample mean
approaches a Normal distribution as n → ∞.
We will prove the central limit theorem in 5.11 using MGFs, but take a second to appreciate this crazy
result! The LLN say that as n → ∞, the sample mean of iid variables X n converges to µ. The CLT says
that, as n → ∞, the sample mean actually converges to a Normal distribution! For any original distribution
of the Xi ’s (discrete or continuous), the average/sum will become approximately normally distributed.
If you’re still having trouble with figuring out what “the distribution of the sample mean” means, that’s
completely normal (double pun!). Let’s consider n = 2, so we just take the average of X1 + X2 , which is
X1 +X2
2 . The distribution of X1 + X2 means: if we repeatedly sample X1 , X2 and add them, what might
the density look like? For example, if X1 , X2 ∼ Unif(0, 1) (continuous), we showed the density of X1 + X2
looked like a triangle. We figured out how to compute the PMF/PDF of the sum using convolution in 5.5,
and the average is just dividing this by 2: X1 +X
2
2
, which you can find the PMF/PDF by transforming RVs
in 4.4. On the next page, you’ll see exactly the CLT applied to these Uniform distributions. With n = 1, it
looks (and is) Uniform. When n = 2, you get the triangular shape. And as n gets larger, it starts looking
more and more like a Normal!
You’ll see some examples below of how we start with some arbitrary distributions and how the density
function of their mean becomes shaped like a Gaussian (you know how to compute the pdf of the mean now
using convolution in 5.5 and transforming RV’s in 4.4)!
On the next two pages, we’ll see some visual “proof” of this surprising result!
208 Probability & Statistics with Applications to Computing 5.7
1
• The first (n = 1) of the four graphs below shows a discrete · Unif(0, 29) PMF in the dots (and
29
a blue line with the curve of the normal distribution
with the same mean and variance). That is,
1 1 2 28
P (X = k) = for each value in the range 0, , , . . . , , 1 .
30 29 29 29
• The second graph (n = 2) has the average of two of these distributions, again with a blue line with
the curve of the normal distribution with the same mean and variance. Remember we expected this
triangular distribution when summing either discrete or continuous Uniforms. (e.g., when summing
two fair 6-sided die rolls, you’re most likely to get a 7, and the probability goes down linearly as you
approach 2 or 12. See the example in 5.5 if you forgot how we got this!
• The third (n = 3) and fourth (n = 4) have the average of 3 and 4 identically distributed random
variables respectively, each of the distribution shown in the distribution in the first graph. We can see
that as we average more, the sum approaches a normal distribution.
Again, if you don’t believe me, you can compute the PMF yourself using convolution: first add two Unif(0, 1),
then convolve it with a third, and a fourth!
Despite this being a discrete random variable, when we take an average of many, there become increasingly
many values we can get between 0 and 1. The average of these iid discrete rv’s approaches a continuous
Normal random variable even after just averaging 4 of them!
Image Credit: Larry Ruzzo (a previous University of Washington CSE 312 instructor).
5.7 Probability & Statistics with Applications to Computing 209
You might still be skeptical, because the Uniform distribution is “nice” and already looked pretty “Normal”
even with n = 2 samples. We now illustrate the same idea with a strange distribution shown in the first
(n = 1) of the four graphs below, illustrated with the dots (instead of a “nice” uniform distribution). Even
this crazy distribution nearly looks Normal after just averaging 4 of them. This is the power of the CLT!
What we are getting at here is that, regardless of the distribution, as we have more independent and
identically distributed random variables, the average follows a Normal distribution (with the same mean and
variance as the sample mean).
210 Probability & Statistics with Applications to Computing 5.7
Now let’s see how we can apply the CLT to problems! There were four different equivalent forms (just
scaling/shifting) stated, but I find it easier to just look at the problem and decide what’s best. Seeing
examples is the best way to understand!
Example(s)
Let’s consider the example of flipping a fair coin 40 times independently. What’s the probability of
getting between 15 to 25 heads? First compute this exactly and then give an approximation using
the CLT.
Solution Define X to be the number of heads in the 40 flips. Then we have X ∼ Bin(n = 40, p = 12 ), so we
just sum the Binomial PMF:
25 k
X 40−k
40 1 1
P (15 ≤ X ≤ 25) = 1− ≈ 0.9193.
k 2 2
k=15
Now, let’s use the CLT. Since X can be thought of as the sum of 40 iid Ber( 21 ) RVs, we can apply the
CLT. We have E [X] = np = 40( 12 ) = 20 and Var (X) = np(1 − p) = 40( 21 )(1 − 12 ) = 10. So we can use the
approximation X ≈ N (µ = 20, σ 2 = 10).
Example(s)
Use the continuity correction to get a better estimate than we did earlier for the coin problem.
5.7 Probability & Statistics with Applications to Computing 211
Solution We’ll apply the exact same steps, except changing the bounds from 15 and 25 to 14.5 and 25.5.
Notice that this is much closer to the exact answer from the first part of the prior example (0.9193) than
approximating with the central limit theorem without the continuity correction!
Note: If you are applying the CLT to sums/averages of continuous RVs instead, you should not
apply the continuity correction.
See the additional exercises below to get more practice with the CLT!
5.7.5 Exercises
1. Each day, the number of customers who come to the CSE 312 probability gift shop is approximately
Poi(11). Approximate the probability that, after the quarter ends (9 × 7 = 63 days), that we had over
700 customers.
Solution: The total number of customers that come is X = X1 + · · · + X63 , where each Xi ∼ Poi(11)
has E [Xi ] = Var (Xi ) = λ = 11 from the chart. By the CLT, X ≈ N (µ = 63 · 11, σ 2 = 63 · 11) (sum of
the means and sum of the variances). Hence,
Note that you could compute this exactly as well since you know the sum of iid Poissons is Poisson.
In fact, X ∼ Poi(693) (the average rate in 63 days is 693 per 63 days), and you could do a sum which
would be very annoying.
2. Suppose I have a flashlight which requires one battery to operate, and I have 18 identical batteries. I
want to go camping for a week (24 × 7 = 168) hours. If the lifetime of a single battery is Exp(0.1),
212 Probability & Statistics with Applications to Computing 5.7
what’s the probability my flashlight can operate for the entirety of my trip?
Solution: The total lifetime of the battery is X = X1 + · · · + X18 where each Xi ∼ Exp(0.1)
1 1
has E [Xi ] = = 10 and Var (Xi ) = = 100. Hence, E [X] = 180 and Var (X) = 1800 by linearity
0.1 0.12
of expectation and since variance adds for independent rvs. In fact, X ∼ Gamma(r = 18, λ = 0.1),
but we don’t have a closed-form for its CDF. By the CLT, X ≈ N (µ = 180, σ 2 = 1800), so
Note that we don’t use the continuity correction here because the RV’s we are summing are already
continuous RVs.
Chapter 5. Multiple Random Variables
5.8: The Multinomial Distribution
Slides (Google Drive) Video (YouTube)
As you’ve seen, the Binomial distribution is extremely commonly used, and probably the most important
discrete distribution. The Normal distribution is certainly the most important continuous distribution. In
this section, we’ll see how to generalize the Binomial, and in the next, the Normal.
Why do we need to generalize the Binomial distribution? Sometimes, we don’t just have two outcomes
(success and failure), but we have r > 2 outcomes. In this case, we need to maintain counts of how many
times each of the r outcomes appeared. A single random variable is no longer sufficient; we need a vector of
counts!
Actually, the example problems at the end could have been solved in Chapter 1. We will just formalize
this situation so that we can use it later!
What about the variance? We cannot just say or compute a single scalar Var (X) because what does that
mean for a random vector? Actually, we need to define an n × n covariance matrix, which stores all pairwise
covariances. It is often denoted in one of three ways: Σ = Var (X) = Cov(X).
The covariance matrix of a random vector X ∈ Rn with E [X] = µ is the matrix denoted Σ =
213
214 Probability & Statistics with Applications to Computing 5.8
Var (X) = Cov(X) whose entries Σij = Cov (Xi , Xj ). The formula for this is:
Σ = Var (X) = Cov(X) = E (X − µ)(X − µ)T = E XXT − µµT
Cov (X1 , X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
Cov (X2 , X1 ) Cov (X2 , X2 ) ... Cov (X2 , Xn )
= .. .. .. ..
. . . .
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Cov (Xn , Xn )
Var (X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
Cov (X2 , X1 ) Var (X2 ) ... Cov (X2 , Xn )
= .. .. .. ..
. . . .
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Var (Xn )
Notice that the covariance matrix is symmetric (Σij = Σji ), and contains variances along the
diagonal.
Note: If you know a bit of linear algebra, you might like to know that covariance matrices
are always symmetric positive semi-definite.
We will not be doing any linear algebra in this class - think of it as just a place to store all the pairwise
covariances. Now let us look at an example of a covariance matrix.
Example(s)
If X1 , X2 , ..., Xn are iid with mean µ and variance σ 2 , then find the mean vector and covariance
matrix of the random vector X = (X1 , . . . , Xn ).
where 1n denotes the n-dimensional vector of all 1’s. The covariance matrix is (since the diagonal is just the
individual variances σ 2 and the off-diagonals (i 6= j) are all Cov (Xi , Xj ) = 0 due to independence)
2
σ 0 ... 0
0 σ2 ... 0
2
Σ= . .. .. .. = σ In
.. . . .
0 0 ... σ2
An important theorem is that properties of expectation and variance still hold for RVTRs.
E [AX + b] = AE [X] + b
Var (AX + b) = AVar (X) AT
Since we aren’t expecting any linear algebra background, we won’t prove this.
Now, what is the probability of this outcome (two of outcome 1, one of outcome 2, and four of outcome 3) -
that is, (Y1 = 2, Y2 = 1, Y3 = 4)? We get the following:
7!
pY1 ,Y2 ,Y3 (2, 1, 4) = · p21 · p12 · p43 [recall from counting]
2!1!4!
7
= · p21 · p12 · p43
2, 1, 4
This describes the joint distribution of the random vector Y = (Y1 , Y2, Y3 ), and its PMF should remind
7
of you of the binomial PMF. We just count the number of ways 2,1,4 to get these counts (multinomial
2 1 4
coefficient), and make sure we get each outcome that many times p1 p2 p3 .
Y ∼ Multr (n, p)
216 Probability & Statistics with Applications to Computing 5.8
Notice that each Yi is marginally Bin(n, pi ). Hence, E [Yi ] = npi and Var (Yi ) = npi (1 − pi ).
Then, we can specify the entire mean vector E [Y] and covariance matrix:
np1
E [Y] = np = ... Var (Yi ) = npi (1 − pi ) Cov (Yi , Yj ) = −npi pj (for i 6= j)
npr
Notice the covariance is negative, which makes sense because as the number of occurrences of Yi
increases, the number of occurrences of Yj should decrease since they can not occur simultaneously.
Proof of Multinomial Covariance. Recall that marginally, Xi and Xj are binomial random variables; let’s de-
compose them into their Bernoull trials. We’ll use different dummy indices as we’re dealing with covariances.
th
Let Xik for
Pnk = 1, . . . , n be indicator/Bernoulli rvs of whether the k trial resulted in outcome i, so
that Xi = k=1 Xik
th
Similarly,
Pn let Xj` for ` = 1, . . . , n be indicators of whether the ` trial resulted in outcome j, so that
Xk = `=1 Xj` .
Before we begin, we should argue that Cov (Xik , Xj` ) = 0 when k 6= ` since k and ` are different trials and
are independent.
Furthermore, E [Xik Xjk ] = 0 since it’s not possible that both outcome i and j occur at trial k.
n n
!
X X
Cov (Xi , Xj ) = Cov Xik , Xj` [indicators]
k=1 `=1
n X
X n
= Cov (Xik , Xj` ) [covariance works like FOIL]
k=1 `=1
Xn
= Cov (Xik , Xjk ) [independent trials, cross terms are 0]
k=1
Xn
= E [Xik Xjk ] − E [Xik ] E [Xjk ] [def of covariance]
k=1
Xn
= −pi pj [first expectation is 0]
k=1
= −npi pj
Note that in the third line we dropped one of the sums because the indicators across different trials k, ` are
independent (zero covariance). Hence, we just need to sum when k = `.
Let Y = (Y1 , Y2 , Y3 ) be the number of each party’s members in the committee (G, D, R in that order). What
is the probability we get 1 Green party member, 6 Democrats, and 3 Republicans? It turns out is just the
following:
45
20 35
1 6
pY1 ,Y2 ,Y3 (1, 6, 3) = 100
3
10
This is very similar to the univariate Hypergeometric distribution! For the denominator, there are 100
10
ways to choose 10 senators. For the numerator, we need 1 from the 45 Green party members, 6 from the 20
Democrats, and 3 from the 35 Republicans.
Suppose there are rPdifferent colors of balls in a bag, having K = (K1 , ..., Kr ) balls of each color,
r
1 ≤ i ≤ r. Let N = i=1 Ki be the total number of balls in the bag, and suppose we draw n without
replacement. Let Y = (Y1 , ..., Yr ) be the RVTR such that Yi is the number of balls of color i we drew.
We write that:
Y ∼ MVHGr (N, K, n)
Ki N − Ki N − n
Var (Yi ) = n · · . Then, we can specify the entire mean vector E [Y] and covariance
N N N −1
matrix:
K1
nN
K . Ki N − Ki N − n Ki Kj N − n
E [Y] = n = .. Var (Yi ) = n · · Cov (Yi , Yj ) = −n ·
N Kr
N N N − 1 N N N −1
nN
Proof of Hypergeometric Variance. We’ll prove the variance of a univariate Hypergeometric finally (the vari-
ance of Yi ), but leave the covariance matrix to you (can approach it similarly to the multinomial covariance
matrix).
K
First, we have that since Xi ∼ Ber :
N
K K
Var (Xi ) = p(1 − p) = 1−
N N
K K −1
Second, for i 6= j, E [Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1) P (Xj = 1 | Xi = 1) = · , so
N N −1
K K − 1 K2
Cov (Xi , Xj ) = E [Xi Xj ] − E [Xi ] E [Xj ] = · −
N N − 1 N2
Finally,
n
!
X
Var (X) = Var Xi [def of X]
i=1
n
X n
X
= Cov Xi , Xj [covariance with self is variance]
i=1 j=1
n X
X n
= Cov (Xi , Xj ) [bilinearity of covariance]
i=1 j=1
Xn X
= Var (Xi ) + 2 Cov (Xi , Xj ) [split diagonal]
i=1 i<j
K K n K K − 1 K2
=n 1− +2 · − [plug in]
N N 2 N N − 1 N2
K N −K N −n
=n · · [algebra]
N N N −1
5.8.4 Exercises
These won’t be very interesting since this could’ve been done in chapter 1 and 2!
1. Suppose you are fishing in a pond with 3 red fish, 4 green fish, and 5 blue fish.
(a) You use a net to scoop up 6 of them. What is the probability you scooped up 2 of each?
(b) You “catch and release” until you caught 6 fish (catch 1, throw it back, catch another, throw it
back, etc.). What is the probability you caught 2 of each?
Solution:
(a) Let X = (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then,
X ∼ MVHG3 (N = 12, K = (3, 4, 5), n = 6), and
3 4 5
P (X1 = 2, X2 = 2, X3 = 2) = 2 2 2
12
6
(b) Let X = (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then,
X ∼ Mult3 (n = 6, p = (3/12, 4/12, 5/12)), and
2 2 2
6 3 4 5
P (X1 = 2, X2 = 2, X3 = 2) =
2, 2, 2 12 12 12
Chapter 5. Multiple Random Variables
5.9: The Multivariate Normal Distribution
Slides (Google Drive) Video (YouTube)
In this section, we will generalize the Normal random variable, the most important continuous distribution!
We were able to find the joint PMF for the Multinomial random vector using a counting argument, but how
can we find the Multivariate Normal density function? We’ll start with the simplest case, and work from
there.
Then, we say that (X, Y ) has a bivariate Normal distribution, which we will denote:
(X, Y ) ∼ N2 (µ, Σ)
This is nice and all, if we have two independent Normals. But what if they aren’t independent?
X = σX Z1 + µX
2. We construct Y from both Z1 and Z2 , as shown below:
p
Y = σY (ρZ1 + 1 − ρ2 Z2 ) + µY
219
220 Probability & Statistics with Applications to Computing 5.9
From this transformation, we get that marginally (show this by computing the mean and variance
of X, Y and closure properties of Normal RVs),
2
X ∼ N (µX , σX ) Y ∼ N (µY , σY2 )
Additionally,
By using the multivariate change-of-variables formula from 4.4, we can turn the ”simple”
product of standard normal PDFs into the PDF of the bivariate Normal:
1 z
fX,Y (x, y) = p exp − , x, y ∈ R
2πσX σY 1 − ρ2 2(1 − ρ2 )
where
(x − µX )2 2ρ(x − µX )(y − µY ) (y − µY )2
z= 2 − +
σX σX σY σY2
Finally, we write:
(X, Y ) ∼ N2 (µ, Σ)
The visualization below shows the density of a bivariate Normal distribution. On the xy-plane, we have the
actual two Normas, and on the z-axis, we have the density. Marginally, both variables are Normals!
5.9 Probability & Statistics with Applications to Computing 221
Now let’s take a look at the effect of different covariance matrices Σ on the distribution for a bivariate
normal, all with mean vector (0, 0). Each row below modifies one entry in the covariance matrix; see the
pictures graphically to explore how the parameters change the shape!
222 Probability & Statistics with Applications to Computing 5.9
A random vector X = (X1 , ..., Xn ) has a multivariate Normal distribution with mean vector µ ∈ Rn
and (symmetric and positive-definite) covariance matrix Σ ∈ Rn×n , written X ∼ Nn (µ, Σ), if it has
the following joint PDF:
1 1 T −1
fX (x) = exp − (x − µ) Σ (x − µ) , x ∈ Rn
(2π)n/2 |Σ|1/2 2
While this PDF may look intimidating, if we recall the PDF of a univariate Normal W ∼ N (µ, σ 2 ):
1 1
fW (w) = √ exp − 2 (w − µ)2
2πσ 2 2σ
We can note that the two formulae are quite similar; we simply extend scalars to vectors and matrices!
Cov (Xi , Xj ) = 0 → Xi ⊥ Xj
Unfortunately, we cannot do example problems as they would require a deeper knowledge of linear algebra,
which we do not assume.
Chapter 5. Multiple Random Variables
5.10: Order Statistics
Slides (Google Drive) Video (YouTube)
We’ve talked a lot about the distribution of the sum of random variables, but what about the maximum,
minimum, or median? For example, if there are 4 possible buses you could take, and the time until each
arrives is independent with an exponential distribution, what is the expected time until the first one arrives?
Mathematically, this would be E [min{X1 , X2 , X3 , X4 }] if the arrival times were X1 , X2 , X3 , X4 .
In this section, we’ll figure out how to find out the density function (and hence expectation/variance) of the
minimum, maximum, median, and more!
Y(1) is the smallest value (the minimum), and Y(n) is the largest value (the maximum), and since
they are so commonly used, they have special names Ymin and Ymax respectively.
Notice that we can’t have equality because with continuous random variables, the probability that
any two are equal is 0. So, we don’t have to worry about any of these random variables being “less
than or equal to” another.
Notice that each Y(i) is a random variable as well! We call Y(i) the i-th order statistic, i.e. the
i-th smallest in a sample of size n. For example, if we had n = 9 samples, Y(5) would be the median
value. We are interested in finding the distribution of each order statistic, and properties such as
expectation and variance as well.
Why are order statistics important? Usually, we take the min, max or median of a set of random variables
and do computations with them - so, it would be useful if we had a general formula for the PDF and CDF
of the min or max.
We start with an example to find the distribution of Y(n) = Ymax , the largest order statistic. We’ll then
extend this to any of the order statistics (not just the max). Again, this means, if we were to repeatedly
take the maximum of n iid RVs, what would the samples look like?
Example(s)
Let Y1 , Y2 , . . . , Yn be iid continuous random variables with the same CDF FY and PDF fY . What is
the distribution of Y(n) = Ymax = max{Y1 , Y2 , . . . , Yn } the largest order statistic?
223
224 Probability & Statistics with Applications to Computing 5.10
Solution
We’ll employ our typical strategy and work with probabilities instead of densities, so we’ll start with the
CDF:
Let’s take a step back and see what we just did here. We just computed the density function of the maximum
of n iid random variables, denoted Ymax = Y(n) . We now need to find the density of any arbitrary ranked
Y(i) .
Now, using the same intuition as before, we’ll use an informal argument to find the density of a general Y(i) ,
fY(i) (y). For example, this might help find the distribution of the minimum fY(1) or the median.
Proof of Density of Order Statistics. The formula above may remind you of a multinomial distribution, and
you would be correct! Let’s consider what it means for Y(i) = y (the i-th smallest value in the sample of n
to equal a particular value y).
• One of the values needs to be exactly y
• i − 1 of the values need to be smaller than y (this happens for each with probability FY (y))
5.10 Probability & Statistics with Applications to Computing 225
• the other n − i values need to be greater than y (this happens for each with probability 1 − FY (y)
Now, we have 3 distinct types of objects: 1 that is exactly y, i − 1 which are less than y and n − i which are
greater. Using multinomial coefficients and the above, we see that
n
fY(i) (y) = · [FY (y)]i−1 · [1 − FY (y)]n−i · fY (y)
i − 1, 1, n − i
Note that this isn’t a probability; it is a density, so there is something flawed with how we approached this
problem. For a more rigorous approach, we just have to make a slight modification, but use the same idea.
Re-Proof (Rigorous) This time, we’ll find P y − 2ε ≤ Y(i) ≤ y + 2ε and use the fact that this is approxi-
mately equal to εfY(i) (y) for small ε > 0 (Riemann integral (rectangle) approximation from 4.1).
We have very similar cases:
• One of the values needs to be between y − 2ε and y + ε
2 (this happens with probability approximately
εfY (y), again by Riemann approximation).
ε
• i − 1 of the values need to be smaller than y − 2 (this happens for each with probability FY (y − 2ε ))
• the other n−i values need to be greater than y+ 2ε (this happens for each with probability 1−FY (y+ 2ε ))
Now these are actually probabilities (not densities), so we get
ε ε n
P y − ≤ Y(i) ≤ y + ≈ εfY(i) (y) = · [FY (y)]i−1 · [1 − FY (y)]n−i · (εfY (y))
2 2 i − 1, 1, n − i
Let’s verify this formula with our maximum that we derived earlier by plugging in n for i:
n
fYmax (y) = fY(n) (y) = · [FY (y)]n−1 · [1 − FY (y)]0 · fY (y) = nFYn−1 (y)fY (y)
n − 1, 1, 0
Example(s)
If Y1 , ...,
Yn are iid Unif(0, 1), where do we “expect” the points to end up? That is, find
E Y(i) for any i. You may find this picture with different values of n useful for intuition.
Solution
226 Probability & Statistics with Applications to Computing 5.10
Intuitively, from the picture, if n = 1, we expect the single point to end up at 12 . If n = 2, we expect the
two points to end up at 31 and 23 . If n = 4, we expect the four points to end up at 15 , 25 , 53 and 45 .
Let’s prove this formally. Recall, if Y ∼ Unif(0, 1) (continuous), then fY (y) = 1 for y ∈ [0, 1] and FY (y) = y
for y ∈ [0, 1]. By the order statistics formula,
n
fY(i) (y) = · [FY (y)]i−1 · [1 − FY (y)]n−i · fY (y)
i − 1, 1, n − i
n
= · [y]i−1 · [1 − y]n−i · 1
i − 1, 1, n − i
Here is a picture which may help you figure out what the formulae you just computed mean!
Example(s)
At 5pm each day, four buses make their way to the HUB bus stop. Each bus would be acceptable
to take you home. The time in hours (after 5pm) that each arrives at the stop is independent with
Y1 , Y2 , Y3 , Y4 ∼ Exp(λ = 6) (on average, it takes 1/6 of an hour (10 minutes) for each bus to arrive).
1. On Mondays, you want to get home ASAP, so you arrive at the bus stop at 5pm sharp. What
is the expected time until the first one arrives?
2. On Tuesdays, you have a lab meeting that runs until 5:15 and are worried you may not catch
any bus. What is the probability you miss all the buses?
Solution The first question asks about the smallest order statistic Y(1) = Ymin since we care about the first
bus. The second question asks about the largest order statistic Y(4) since we care about the last bus. Let’s
compute the general formula for order statistics first so we can apply it to both parts of the problem.
Recall, if Y ∼ Exp(λ = 6) (continuous), then fY (y) = 6e−6y for y ∈ [0, ∞) and FY (y) = 1 − e−6y for
y ∈ [0, ∞). By the order statistics formula,
5.10 Probability & Statistics with Applications to Computing 227
n
fY(i) (y) = · [FY (y)]i−1 · [1 − FY (y)]n−i · fY (y)
i − 1, 1, n − i
n
= · [1 − e−6y ]i−1 · [e−6y ]n−i · 6e−6y
i − 1, 1, n − i
1. For the first part, we want E Y(1) , so we plug in i = 1 (and n = 4) to the above formula to get:
4
fY(1) (y) = · [1 − e−6y ]1−1 · [e−6y ]4−1 · 6e−6y = 4[e−18y ]6e−6y = 24e−24y
1 − 1, 1, 4 − 1
Now we can use the PDF to find the expectation normally. However, notice that the PDF is that of
an Exp(λ = 24) distribution, so it has expectation 1/24. That is, the expected time until the first bus
arrives is 1/24 an hour, or 2.5 minutes.
Let’s talk about something amazing here. We found that min{Y1 , Y2 , Y3 , Y4 } ∼ Exp(λ = 4 · 6); that the
minimum of exponentials is distributed as an exponential with the sum of the rates! Why might this
be true? If we have Y1 , Y2 , Y3 , Y4 ∼ Exp(6), that means on average, 6 buses of each type arrive each
hour, for a total of 24. That just means we can model our waiting time in thie regime with an average
of 24 buses per hour, to get that the time until the first bus has an Exp(6 + 6 + 6 + 6) distribution!
2. For finding the maximum, we just plug in i = n = 4 (and n = 4), to get
4
fY(4) (y) = · [1 − e−6y ]4−1 · [e−6y ]4−4 · 6e−6y = [1 − e−6y ]3 6e−6y
4 − 1, 1, 4 − 4
Unfortunately, this is as simplified as it gets, and we don’t get the nice result that the maximum of
exponentials is exponential. To find the desired quantity, we just need to compute the probability the
last bus comes before 5:15 (which is 0.25 hours - be careful of units!):
Z 0.25 Z 0.25
P (Ymax ≤ 0.25) = fYmax (y)dy = [1 − e−6y ]3 6e−6y dy
0 0
Chapter 5. Multiple Random Variables
5.11: Proof of the CLT
Slides (Google Drive) Video (YouTube)
In this optional section, we’ll prove the Central Limit Theorem, one of the most fundamental and amazing
results in all of statistics, using MGFs!
For a function f : R → R, we will denote f (n) (x) to be the nth derivative of f (x). Let X, Y be
independent random variables, and a, b ∈ R be scalars. Then MGFs satisfy the following properties:
(n)
1. MX 0
(0) = E [X], MX00
(0) = E X 2 , and in general MX = E [X n ]. This is why we call MX a
moment generating function, as we can use it to generate the moments of X.
2. MaX+b (t) = etb MX (at).
3. If X ⊥ Y , then MX+Y (t) = MX (t)MY (t).
4. (Uniqueness) The following are equivalent:
(a) X and Y have the same distribution.
(b) fX (z) = fY (z) for all z ∈ R.
(c) FX (z) = FY (z) for all z ∈ R.
(d) There is an ε > 0 such that MX (t) = MY (t) for all t ∈ (−ε, ε) (they match on a small
interval around t = 0).
That is MX uniquely identifies a distribution, just like PDFs or CDFs do.
Let X1 , . . . Xn be a sequence of independent and identically distributed random variables with mean
µ and (finite) variance σ 2 . Then, the standardized sample mean approaches the standard Normal
distribution:
Xn − µ
As n → ∞, Zn = √ → N (0, 1)
σ/ n
Proof of The Central Limit Theorem. Our strategy will be to compute the MGF of Z n and exploit properties
of the MGF (especially uniqueness) to show that it must have a standard Normal distribution!
228
5.11 Probability & Statistics with Applications to Computing 229
2. Now, let MX denote the common MGF of all the Xi ’s (since they are iid).
MZ n (t) = M Pn
σ
1
√
n i=1 Xi (t) [Definition of Z n ]
t 1
= MPni=1 Xi √ [By Property 2 of MGFs above, where a = √ , b = 0]
σ n σ n
n
t
= MX √ [By Property 3 of MGFs above]
σ n
t
3. Recall Step 1, and now let Y = X and s = √
σ n
so we get a Taylor approximation of MX . Then:
2
t
√
t t σ n
MX √ ≈ 1 + E [X] √ + E X 2 [Step 1]
σ n σ n 2
t2
= 1 + 0 + σ2 2 [Since E [X] = 0 and E X 2 = σ 2 ]
2σ n
t2 /2
=1+
n
4. Now we combine Steps 2 and 3:
n
t
MZ n (t) = MX √ [step 2]
σ n
2
n
t /2
≈ 1+ [step 3]
n
2
h x n i
→ et /2
Since 1 + → ex
n
Hence, Z n has the same MGF as that of a standard normal, so must follow that distribution!
Chapter 6. Concentration Inequali-
ties
It seems like we must have learned everything there possible is - what else could go wrong? Sometimes
we know only certain properties about a random variable (its mean and/or variance), but not its entire
distribution. For example, the expected running time (number of comparisons) of the randomized QuickSort
algorithm can be found using linearity of expectation and indicators. But what are the strongest guarantees
we can make about a random variable without full knowledge of its distribution, if any?
230
6.1 Probability & Statistics with Applications to Computing 231
When reasoning about some random variable X, it’s not always easy or possible to calculate/know its ex-
act PMF/PDF. We might not know much about X (maybe just its mean and variance), but we can still
provide concentration inequalities to get a bound of how likely it is for X to be far from its mean µ
(of the form P (|X − µ| > α)), or how likely for this random variable to be very large (of the form P (X ≥ k)).
You might ask when we would only know the mean/variance but not the PMF/PDF? Some of our dis-
tributions that we use (like Exponential for bus waiting time), are just modelling assumptions and are
probably incorrect. If we measured how long it took for the bus to arrive over many days, we could estimate
its mean and variance! That is, we have no idea the true distribution of daily bus waiting times but can
get good estimates for the mean and variance. We can use these concentration inequalities to bound the
probability that we wait too long for a bus knowing just those two quantities and nothing else!
Example(s)
The score distribution of an exam is modelled by a random variable X with range ΩX = [0, 110] (with
10 points for extra credit). Give an upper bound on the proportion of students who score at least
100 when the average is 50? When the average is 25?
Solution What would you guess? If the average is E [X] = 50, an upper bound on the proportion of students
who score at least 100 should be 50% right? If more than 50% of students scored a 100 (or higher), the
average would already be 50% since all scores must be nonnegative (≥ 0). Mathematically, we just argued
that:
E [X] 50 1
P (X ≥ 100) ≤ = =
100 100 2
This sounds reasonable - if say 70% of the class were to get 100 or higher, the average would already be at
least 70%, even if everyone else got a zero. The best bound we can get is 50% - and that requires everyone
else to get a zero.
If the average is E [X] = 25, an upper bound on the proportion of students who score at least 100 is:
E [X] 25 1
P (X ≥ 100) ≤ = =
100 100 4
Similarly, if we had more than 30% students get 100 or higher, the average would already be at least 30%,
even if everyone else got a zero.
232 Probability & Statistics with Applications to Computing 6.1
Let X ≥ 0 be a non-negative random variable (discrete or continuous), and let k > 0. Then:
E [X]
P (X ≥ k) ≤
k
Equivalently (plugging in kE [X] for k above):
1
P (X ≥ kE [X]) ≤
k
Proof of Markov’s Inequality. Below is the proof when X is continuous. The proof for discrete RVs is similar
(just change all the integrals into summations).
Z ∞
E [X] = xfX (x)dx [because X ≥ 0]
0
Z k Z ∞
= xfX (x)dx + xfX (x)dx [split integral at some 0 ≤ k ≤ ∞]
0 k
Z "Z #
∞ k
≥ xfX (x)dx xfX (x)dx ≥ 0 because k ≥ 0, x ≥ 0 and fX (x) ≥ 0
k 0
Z ∞
≥ kfX (x)dx [because x ≥ k in the integral]
k
Z ∞
=k fX (x)dx
k
= kP (X ≥ k)
So just knowing that the random variable is non-negative and what its expectation is, we can bound the
probability that it is “very large”. We know nothing else about the exam distribution! Note there is no bound
we can derive if X could be negative. Always check that X is indeed nonnegative before applying this bound!
The following example demonstrates how to use Markov’s inequality, and how loose it can be in some cases.
Example(s)
A coin is weighted so that its probability of landing on heads is 20%, independently of other flips.
Suppose the coin is flipped 20 times. Use Markov’s inequality to bound the probability it lands on
heads at least 16 times.
Solution We actually do know this distribution; the number of heads is X ∼ Bin(n = 20, p = 0.2). Thus,
E [X] = np = 20 · 0.2 = 4. By Markov’s inequality:
E[X] 4 1
P (X ≥ 16) ≤
= =
16 16 4
Let’s compare this to the actual probability that this happens:
X20
20
P (X ≥ 16) = 0.2k · 0.820−k ≈ 1.38 · 10−8
k
k=16
6.1 Probability & Statistics with Applications to Computing 233
This is not a good bound, since we only assume to know the expected value. Again, we knew the exact
distribution, but chose not to use any of that information (the variance, the PMF, etc.).
Example(s)
Solution Let X be the runtime of QuickSort, with E [X] = 2n log(n). Then, since X is non-negative, we can
use Markov’s inequality:
E [X]
P (X ≥ 20n log(n)) ≤ [Markov’s inequality]
20n log(n)
2n log(n)
=
20n log(n)
1
=
10
So we know there’s at most 10% probability that QuickSort takes this long to run. Again, we can get this
bound despite not knowing anything except its expectation!
Let X be any random variable with expected value µ = E[X] and finite variance Var (X). Then, for
any real number α > 0:
Var (X)
P (|X − µ| ≥ α) ≤
α2
p
Equivalently (plugging in kσ for α above, where σ = Var (X)):
1
P (|X − µ| ≥ kσ) ≤
k2
This is used to bound the probability of being in the tails. Here is a picture of Chebyshev’s inequality
bounding the probability that a Gaussian X ∼ N (µ, σ 2 ) is more than k = 2 standard deviations from its
mean:
234 Probability & Statistics with Applications to Computing 6.1
While in principle Chebyshev’s inequality asks about distance from the mean in either direction, it can still
be used to give a bound on how often a random variable can take large values, and will usually give much
better bounds than Markov’s inequality. This is expected, since we also assume to know the variance - and
if the variance is small, we know the RV can’t deviate too far from its mean.
Example(s)
Let’s revisit the example in Markov’s inequality section earlier in which we toss a weighted coin
independently with probability of landing heads p = 0.2. Upper bound the probability it lands on
heads at least 16 times out of 20 flips using Chebyshev’s inequality.
E [X] = np = 20 · 0.2 = 4
and:
Var (X) = np(1 − p) = 20 · 0.2 · (1 − 0.2) = 3.2
Note that since Chebyshev’s asks about the difference in either direction of the RV from its mean, we
must weaken our statement first to include the probability X ≤ −8. The reason we chose −8 is because
6.1 Probability & Statistics with Applications to Computing 235
Chebyshev’s inequality is symmetric about the mean (difference of 12; 4 ± 12 gives the interval [−8, 16]):
This is a much better bound than given by Markov’s inequality, but still far from the actual probability.
This is because Chebyshev’s inequality only takes the mean and variance into account. There is so much
more information about a RV than just these two quantities!
We can actually use Chebyshev’s inequality to prove an important result from 5.7: The Weak Law of Large
Numbers. The proof is so short!
Proof. By the property of the expectation and variance of sample mean consisting of n iid variables: E X n =
2
µ and Var X n = σn (from 5.7). By Chebyshev’s inequality:
Var X n σ2
lim P X n − µ > ≤ lim 2
= lim =0
n→∞ n→∞ n→∞ n2
Chapter 6. Concentration Inequalities
6.2: The Chernoff Bound
Slides (Google Drive) Video (YouTube)
The more we know about a distribution, the stronger concentration inequality we can derive. We know that
Markov’s inequality is weak, since we only use the expectation of a random variable to get the probability
bound. Chebyshev’s inequality is a bit stronger, because we incorporate the variance into the probability
bound. However, as we showed in the example in 6.1, these bounds are still pretty “loose”. (They are tight
in some cases though).
What if we know even more? In particular, its PMF/PDF and hence MGF? That will allow us to have
an even stronger bound. The Chernoff bound is derived using a combination of Markov’s inequality and
moment generating functions.
Let X be any random variable. etX is always a non-negative random variable. Thus, for any t > 0, using
Markov’s inequality and the definition of MGF (review 5.6 if necessary):
P (X ≥ k) = P etX ≥ etk [since t > 0. if t < 0, flip the inequality.]
E etX
≤ [Markov’s inequality]
etk
MX (t)
= [def of MGF]
etk
(Note that the first line requires t > 0, otherwise it would change to P etX ≤ etk . This is because et > 1 for
t > 0 so we get something like 2X which is monotone increasing. If t < 0, then et < 1 so we get something
like 0.3X which is monotone decreasing.)
Now the right hand side holds for (uncountably) infinitely many t. For example, if we plugged in t = 0.5 we
MX (t)
might get = 0.53 and if we plugged in t = 3.26 we might get 0.21. Since P (X ≥ k) has to be less than
etk
all the possible values when plugging in different t > 0, it in particular must be less than the minimum of
all the values.
MX (t)
P (X ≥ k) ≤ min
t>0 etk
This is good - if we can minimize the right hand side, we can get a very tight/strong bound.
We’ll now focus our attention to deriving the Chernoff bound when X has a Binomial distribution. Everything
above applies generally though.
236
6.2 Probability & Statistics with Applications to Computing 237
The Chernoff bound will allow us to bound the probability that X is larger than some multiple of its mean,
or less than or equal to it. These are the tails of a distribution as you go farther in either direction from the
mean. For example, we might want to bound the probability that X ≥ 1.5µ or X ≤ 0.1µ.
I think it’s completely acceptable if you’d like to not read the proof, as it is very involved algebraically.
You can still use the result regardless!
Pn
If X = i=1 Xi where X1 , X2 ,...,Xn are iid variables, then since the MGF of the (independent) sum equals
the product of the MGFs. Taking our general result from above and using this fact, we get:
Qn
MX (t) MXi (t)
P (X ≥ k) ≤ min = min i=1
t>0 etk t>0 etk
Let’s derive a Chernoff bound for X ∼ Bin(n, p), which has the form P (X ≥ (1 + δ)µ) for δ > 0. For example
with δ = 4, you may want to bound P (X ≥ 5E [X]).
Pn
Recall X = i=1 Xi where Xi ∼ Ber(p) are iid, with µ = E[X] = np.
MXi (t) = E etXi [def of MGF]
t·1 t·0
= e pXi (1) + e pXi (0) [LOTUS]
t
= pe + 1(1 − p) [Xi ∼ Ber(p)]
t
= 1 + p(e − 1)
t
−1)
≤ ep(e [1 + x ≤ ex with x = p(et − 1)]
See here for a pictorial proof that 1 + x ≤ ex for any real number x (just plot the two functions). Alter-
natively, use the Taylor series for ex to argue this. We use this bound for algebra convenience coming up soon.
238 Probability & Statistics with Applications to Computing 6.2
Now using the result from earlier and plugging in the MGF for the Ber(p) distribution, we get:
Qn
MX (t)
P (X ≥ k) ≤ min i=1 tk i [from earlier]
t>0 e
t
n
ep(e −1)
= min [MGF of Ber(p), n times]
t>0 etk
t
enp(e −1)
= min [algebra]
t>0 etk
t
eµ(e −1)
= min [µ = np]
t>0 etk
For our bound, we want something like P (X ≥ (1 + δ)µ), so our k = (1 + δ)µ. To minimize the RHS and
get the tightest bound, the best bound we get is by choosing t = ln(1 + δ) after some terrible algebra (take
the derivative and set to 0). We simply plug in k and our optimal value of t to the above equation:
ln (1+δ) µ
eµ(e −1)
eµ((1+δ)−1) eδµ eδ
P (X ≥ (1 + δ)µ) ≤ (1+δ)µ ln (1+δ) = (1+δ)µ = (1 + δ)(1+δ)µ = (1 + δ)(1+δ)
e eln (1+δ)
Again, we wanted to choose t that minimizes our upper bound for the tail probability. Taking the derivative
with respect to t tells us we should plug in t = ln(1 + δ) to minimize that quantity. This would actually be
pretty annoying to plug into a calculator.
2
−δ µ
We actually can show that the final RHS is ≤ exp with some more messy algebra. Additionally, if
2+δ
we restrict 0 < δ < 1, we can simplify this even more to the bound provided earlier:
2
−δ µ
P (X ≥ (1 + δ)µ) ≤ exp
3
The proof of the lower tail is entirely analogous, except optimizing over t < 0 when the inequality flips. It
proceeds by taking t = ln(1 − δ).
We also get a lower tail bound:
µ µ
e−δ e−δ −δ 2 µ
P (X ≤ (1 − δ)µ) ≤ ≤ δ2
= exp
(1 − δ)1−δ e−δ+ 2 2
6.2 Probability & Statistics with Applications to Computing 239
You may wonder, why are we bounding P (X ≥ (1 + δ)µ), when we can just sum the PMF of a binomial to
get an exact answer? The reason is, it is very computationally expensive to compute the binomial PMF! For
example, if X ∼ Bin(n = 20000, p = 0.1), then by plugging in the PMF, we get
20000 20000!
P (X = 13333) = 0.113333 0.920000−13333 = 0.113333 0.920000−13333
13333 13333!(20000 − 13333)!
(Actually, n = 20000 isn’t even that large.) You have to multiply 20,000 numbers on the second two terms,
and it multiplies to a number that is infinitesimally small. For the first term (binomial coefficient), computing
20000! is impossible - in fact, it is so large you can’t even imagine. You would have to cleverly interleave
multiplying a factorial vs the probability, to keep the value in an acceptable range for the computer. Then,
sum up a bunch of these....
This is why we have/need the Poisson approximation, the Normal approximation (CLT), and the Chernoff
bound for the Binomial!
Example(s)
Suppose X ∼ Bin(500, 0.2). Use Markov’s inequality and the Chernoff bound to bound P (X ≥ 150),
and compare the results.
Solution We have:
E[X] = np = 500 · 0.2 = 150
Var(X) = np(1 − p) = 500 · 0.2 · 0.8 = 80
The Chernoff bound is much stronger! It isn’t a fair comparison necessarily, because the Chernoff bound
required knowing the MGF (and hence the distribution), whereas Markov only required knowing the mean
(and that it was non-negative).
These examples give you an overall comparison of all three inequalities we learned so far!
Example(s)
Suppose the number of red lights Alex encounters each day to work is on average 4.8 (according to
historical trips to work). Alex really will be late if he encounters 8 or more red lights. Let X be the
number of lights he gets on a given day.
1. Give a bound for P (X ≥ 8) using Markov’s inequality.
2. Give a bound for P (X ≥ 8) using Chebyshev’s inequality, if we also assume Var (X) = 2.88.
3. Give a bound for P (X ≥ 8) using the Chernoff bound. Assume that X ∼ Bin(12, 0.4) - that
there are 12 traffic lights, and each is independently red with probability 0.4.
4. Compute P (X ≥ 8) exactly using the assumption from the previous part.
5. Compare the three bounds and their assumptions.
240 Probability & Statistics with Applications to Computing 6.2
1. Since X is nonnegative and we know its expectation, we can apply Markov’s inequality:
E [X] 4.8
P (X ≥ 8) ≤ = = 0.6
8 8
2. Since we know X’s variance, we can apply Chebyshevs inequality after some manipulation. We have
to do this to match the form required:
The reason we chose ≤ 1.6 is so it looks like P (|X − µ| ≥ α). Now, applying Chebyshev’s gives:
3. Actually, X ∼ Bin(12, 0.4) also has E [X] = np = 4.8 and Var (X) = np(1 − p) = 2.88 (what a
coincidence). The Chernoff bound requires something of the form P (X ≥ (1 + δ)µ), so we first need
to solve for δ: (1 + δ)4.8 = 8 so that δ = 2/3. Now,
−(2/3)2 4.8
P (X ≥ 8) = P (X ≥ (1 + 2/3) · 4.8) ≤ exp ≈ 0.4911
3
5. Actually it’s usually the case that the bounds are tighter/better as we move down the list Markov,
Chebyshev, Chernoff. But in this case Chebyshev’s gave us the tightest bound, even after being
weakened by including some additional P (X ≤ 1.6). Chernoff bounds will typically be better for
farther tails - 8 isn’t considered too far from the mean 4.8.
It’s also important to note that we found out more information progressively - we can’t blindly apply
all these inequalities every time. We need to make sure the conditions for the bound being valid are
satisfied.
Even our best bound of 0.28125 was 5-6x larger than the true probability of 0.0573.
Chapter 6. Concentration Inequalities
6.3: Even More Inequalities
Slides (Google Drive) Video (YouTube)
In this section, we will talk about a potpourri of remaining concentration bounds. More specifically, the
union bound, Jensen’s inequality for convex functions, and Hoeffding’s inequality.
The intuition for the union bound is fairly simple. Suppose we have two events A and B. Then P (A ∪ B) ≤
P (A) + P (B) since the event space of A and B may overlap:
241
242 Probability & Statistics with Applications to Computing 6.3
The union bound, though seemingly trivial, can actually be quite useful.
Example(s)
This will relate to the earlier question of bounding the probability of at least one bad event happening.
Suppose the probability Alex is late to teaching class on a given day is at most 0.01. Bound
the probability that Alex is late at least once over a 30-class quarter. Do not make any independence
assumptions.
Solution
Let Ai be the event Alex is late to class on day i for i = 1, . . . , 30. Then, by the union bound,
30
!
[
P (late at least once) = P Ai
i=1
30
X
≤ P (Ai ) [union bound]
i=1
30
X
≤ 0.01 [P (Ai ) ≤ 0.01]
i=1
= 0.30
Sometimes it may be useless though; imagine I asked instead about over a 200-day period. Then the union
bound would’ve given me a bound of 2.0 which is not helpful since probabilities have to be at most 1 already...
Let’s look at some examples of convex (left) and non-convex (right) sets:
The sets on the left hand side are said to be convex because if you take any two points in the set and
draw the line segment between them, it is always contained in the set. The sets on the right hand side are
non-convex because I found two endpoints in the set, but the line segment connecting them is not completely
contained in the set.
How can we describe this mathematically? Well for any two points x, y ∈ S, the set of points between them
must be entirely contained in S. The set of points making up the line segment between two points x, y can
be described as a weighted average (1 − p)x + py for p ∈ [0, 1]. If p = 0, we just get x; if p = 1, we just get
y, and if p = 1/2, we get the midpoint (x + y)/2. So p controls the fraction of the way we are from x to y.
Equivalently, for any points x1 , ..., xm ∈ S, convex polyhedron formed by the “corners” is contained
in S. (This sounds complicated, but if m = 3, it just says the triangle formed by the 3 corners
completely lies in the set S. If m = 4, the quadrilateral formed by the 4 corners completely lies in
the set S.) The points in the convex polyhedron are described by taking weighted average of the
points, where the weights are non-negative and sum to 1. (This should remind you of a probability
distribution!) (m )
X m
X
pi xi : p1 , ..., pm ≥ 0 and pi = 1 ⊆ S
i=1 i=1
Now, onto convex functions. Let’s take a look at some convex (top) and non-convex (bottom) functions:
244 Probability & Statistics with Applications to Computing 6.3
The functions on the top (convex) have the property that, for any two points on the function curve, the
line segment connecting them lies above the function always. The functions on the bottom don’t have this
property: you can see that some or all of the line segment is below the function.
Let’s try to formalize what this means. For the convex function g(t) = t2 below, we can see that any line
drawn connecting 2 points of the function clearly lies above the function itself and so it is convex. Look at
any two points on the curve g(x) and g(y). Pick a point on the x-axis between x and y, call it (1 − p)x + py
where p ∈ [0, 1]. The function value at this point is g((1 − p)x + py). The corresponding point above it on
the line segment connecting g(x) and g(y) is actually the weighted average (1 − p)g(x) + pg(y). Hence, a
function g is convex if it satisfies the following for any x, y and p ∈ [0, 1]: g((1−p)x+py) ≤ (1−p)g(x)+pg(y)
6.3 Probability & Statistics with Applications to Computing 245
Let S ⊆ Rn be a convex set (a convex function must have the domain being a convex set). A function
g : S → R is a convex function if for any line segment connecting g(x) and g(y), the function g lies
entirely below the line. Mathematically, for any p ∈ [0, 1] and x, y ∈ R,
Pm
Equivalently, for any m points x1 , ..., xm ∈ S, and p1 , ..., pm ≥ 0 such that i=1 pi = 1,
m
! m
X X
g pi xi ≤ pi g(xi )
i=1 i=1
Proof of Jensen’s Inequality. We will only prove it in the case X is a discrete random variable (not a random
vector), and with finite range (not countably infinite). However, this inequality does hold for any random
variable.
The proof follows immediately from the definition of a convex function. Since X has finite range, let
ΩX = {x1 , ..., xn } and pX (xi ) = pi . By definition of a convex function (see above),
n
!
X
g(E [X]) = g pi xi [def of expectation]
i=1
n
X
≤ pi g(xi ) [def of convex function]
i=1
= E [g(X)] [LOTUS]
246 Probability & Statistics with Applications to Computing 6.3
Example(s)
Show that variance of any random variable X is always non-negative using Jensen’s inequality.
Solution We already know that Var (X) = E (X − µ)2 ≥ 0 since (X − µ)2 is a non-negative RV, but let’s
prove it a different way.
Let X1 , ...Xn be independent random variables, where each Xi is bounded: ai ≤ Xi ≤ bi and let X n
be their sample mean. Then,
−2n2 t2
P |X n − E X n | ≥ t ≤ 2 exp Pn 2
i−1 (bi − ai )
where exp(x) = ex .
In the case X1 , ..., Xn are iid (so a ≤ Xi ≤ b for all i) with mean µ, then
−2n2 t2 −2nt2
P |X n − µ| ≥ t ≤ 2 exp = 2 exp
n(b − a)2 (b − a)2
Example(s)
Suppose an email company ColdMail is responsible for delivering 100 emails per day. ColdMail has
a bad day if it takes longer than 190 seconds to deliver all 100 emails, and a bad week if there is
even one bad day in the week.
The time it takes to send an email on average is 1 second, with a worst-case time of 5 sec-
onds; independently of other emails. (Note we don’t know anything else like its PDF).
1. Give an upper bound for the probability that ColdMail has a bad day.
2. Give an upper bound for the probability that ColdMail has a bad week.
Solution
1. In this scenario, we may use Hoeffding’s inequality since we have X1, . . . , X100 the (independent) times
to send each email bounded in the interval [0, 5] seconds, with E X 100 = 1. Asking that the total
6.3 Probability & Statistics with Applications to Computing 247
time to be at least 190 seconds is the same as asking the mean time to be at least 1.9 seconds.
Like we did for Chebyshev, we have to massage (and weaken) a little bit to get in the same form
as required for Hoeffding’s:
P X 100 ≥ 1.9 ≤ P X 100 ≥ 1.9 ∪ X 100 ≤ 0.1 = P |X 100 − 1| ≥ 0.9
Applying Hoeffding’s (since E X n = 1 ):
−2 · 100 · 0.92
P X 100 ≥ 1.9 ≤ P |X 100 − 1| ≥ 0.9 ≤ 2 exp ≈ 0.0031
(5 − 0)2
You might be tempted to use the CLT (and you should when you can), as it would probably give a better
bound than Hoeffding’s. But we didn’t know the variances, so we wouldn’t know which Normal to use.
Hoeffding’s gives us a way!
Chapter 7. Statistical Estimation
Now we’ve hit a real turning point in the course. What we’ve been doing so far is “probability”, and
the remaining two chapters of the course will be about “statistics”. In the real world, we’re often not given
the true probability of heads p, or average rate of babies being born per minute λ. In today’s world, data
is being collected faster than ever! How can we use data to estimate these quantities of interest? We’ll
start with more mundane examples, such as: If I flip a coin (with unknown probability of heads) ten times
independently and I observe seven heads, why is 7/10 the “best” estimate for the probability of heads? We’ll
learn several techniques for estimating quantities, and talk about several properties that allow us to compare
them for “goodness”.
248
7.1 Probability & Statistics with Applications to Computing 249
What we’re going to focus now is going the opposite way. Given a coin with unknown probability of heads
is, I flip it a few times and I get THHTHH. How can I use this data to predict/estimate this value of p?
7.1.2 Likelihood
Let’s say I give you and your classmates each 5 minutes with a coin with unknown probability of heads p.
Whoever has the closest estimate will get an A+ in the class. What do you do in your precious 5 minutes,
and what do you give as your estimate?
I don’t know about you, but I would flip the coin as many times as I can, and return the total number of
heads over the total number of flips, or
Heads
Heads + Tails
which actually turns out to be a really good estimate.
250 Probability & Statistics with Applications to Computing 7.1
To make things concrete, let’s say you saw 4 heads and 1 tail. You tell me that p̂ = 45 (the hat above the p
just means it is an estimate). How can you argue, objectively, that this is the ”best” estimate?
Is there some objective function that it maximizes? It turns out yes, 45 maximizes this blue curve, which is
called the likelihood of the data. The x-axis has the different possible values of p, and they y-axis has the
probability of seeing the data if the coin had probability of heads p.
You assume a model (Bernoulli in our case) with unknown parameter θ (the probability of heads), and
receive iid samples x = (x1 , ..., xn ) ∼ Ber(θ) (in this example, each xi is either 1 or 0). The likelihood of the
data given a parameter θ is defined as the probability of seeing the data, given θ, or:
A realization/sample x of a random variable X is the value that is actually observed (will always be
in ΩX ).
For example, for Bernoulli, a realization is either 0 or 1, and for Geometric, some positive integer ≥ 1.
Let x = (x1 , ..., xn ) be iid samples from probability mass function pX (t | θ) (if X is discrete), or from
density fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters). We define
the likelihood of x given θ to be the “probability” of observing x if the true parameter is θ.
7.1 Probability & Statistics with Applications to Computing 251
If X is discrete,
n
Y
L(x | θ) = pX (xi | θ)
i=1
If X is continuous,
n
Y
L(x | θ) = fX (xi | θ)
i=1
In the continuous case, we have to multiply densities, because the probability of seeing a particular value with
a continuous random variable is always 0. We can do this because the density preserves relative probabilities;
P (X ≈ u) fX (u)
i.e., ≈ . For example, if X ∼ N (µ = 3, σ 2 = 5), the realization x = −503.22 has much
P (X ≈ v) fX (v)
lower density/likelihood than x = 3.12.
Example(s)
Give the likelihoods for each of the samples, and take a guess at which value of θ maximizes the
likelihood!
1. Suppose x = (x1 , x2 , x3 ) = (1, 0, 1) are iid samples from Ber(θ) (recall θ is the probability of a
success).
2. Suppose x = (x1 , x2 , x3 , x4 ) = (3, 0, 2, 7) are iid samples from Poi(θ) (recall θ is the historical
average number of events in a unit of time).
3. Suppose x = (x1 , x2 , x3 ) = (3.22, 1.81, 2.47) are iid samples from Exp(θ) (recall θ is the historical
average number of events in a unit of time).
Solution
1. The samples mean we got a success, then a failure, then a success. The likelihood is the “probability”
of observing the data.
3
Y
L(x | θ) = pX (xi | θ) = pX (1 | θ) · pX (0 | θ) · pX (1 | θ) = θ(1 − θ)θ = θ2 (1 − θ)
i=1
Since we observed two successes out of three trials, my guess for the maximum likelihood estimate
2
would be θ̂ = .
3
2. The samples mean we observed 3 events in the first unit of time, then 0 in the second, then 2 in the
third, then 7 in the fourth. The likelihood is the “probability” of observing the data (just multiplying
k
Poisson PMFs pX (k | λ) = e−λ λk! ).
4
Y
L(x | θ) = pX (xi | θ) = pX (3 | θ) · pX (0 | θ) · pX (2 | θ) · pX (7 | θ)
i=1
3 0
2
7
−θ θ −θ θ −θ θ −θ θ
= e e e e
3! 0! 2! 7!
Since there were a total of 3 + 0 + 2 + 7 = 12 events over 4 units of time (samples), my guess for the
12
maximum likelihood estimate would be θ̂ = = 3 events per unit time.
4
252 Probability & Statistics with Applications to Computing 7.1
3. The samples mean we waited until three events happened (x1 , x2 , x3 ), and it took 3.22 units of time
until the first event, 1.81 until the second, and 2.47 until the third. The likelihood is the “probability”
of observing the data. The likelihood is the “probability” of observing the data (just multiplying
Exponential PDFs fX (y | λ) = λe−λy ).
3
Y
L(x | θ) = fX (xi | θ) = fX (x1 | θ) · fX (x2 | θ) · fX (x3 | θ) = θe−3.22θ θe−1.81θ θe2.47θ
i=1
In the previous three scenarios, we set up the likelihood of the data. Now, the only thing left to do is find out
which value of θ maximizes the likelihood. Everything else in this section is just explaining how to use calcu-
lus to optimize this likelihood! There is no more “probability” or “statistics” involved in the remaining pages.
Before we move on, we have to go back and review calculus really quickly. How do we optimize a func-
tion? Each of these three points is a local optima; what do they have in common? Their derivative is 0.
We’re going to try and set the derivative of our likelihood to 0, so we can solve for the optimum value.
Example(s)
Suppose x = (x1 , x2 , x3 , x4 , x5 ) = (1, 1, 1, 1, 0) are iid samples from the Ber(θ) distribution with
unknown parameter θ. Find the maximum likelihood estimator θ̂ of θ.
Solution The data (1, 1, 1, 1, 0) can be thought of the sequence HHHHT , which has likelihood (assuming
7.1 Probability & Statistics with Applications to Computing 253
L(HHHHT | θ) = θ4 (1 − θ)
= θ4 − θ5
The plot of the likelihood with θ on the x-axis and L(HHHHT | θ) on the y-axis is (copied from above):
and we can actually see the θ which maximizes the likelihood is θ̂ = 4/5. But sometimes we can’t plot the
likelihood, so we will solve for this analytically now.
We want to find the θ which maximizes this likelihood, so we take the derivative with respect to θ and
set it to 0:
∂
L(x | θ) = 4θ3 − 5θ4
∂θ
= θ3 (4 − 5θ)
Now, when we set the derivative to 0 (remember the optimum points occur when the derivative is 0), we
replace θ with θ̂ because we are now estimating θ. After solving for θ, we end up with
4
θ̂3 (4 − 5θ̂) = 0 → θ̂ = or 0
5
We switch θ to θ̂ when we set the derivative to 0, as that is when we start estimating. To see which is the
maximizer, you can just plug in the candidates (0 and 4/5) and the endpoints (0 and 1: the min and max
possible values of θ)! That is, compute the likelihood at 0, 4/5, 1 and see which is largest.
To summarize, we defined θ̂M LE = arg maxθ L(x | θ), the argument (input) θ that maximizes the likelihood
function. The difference between max and argmax is as follows. Here is a function,
f (x) = 1 − x2
where the maximum value is 1, it’s the highest value this function could ever achieve. The argmax, on the
other hand, is 0, because argmax just means the argument (input) that maximizes the function. So, which x
actually achieved f (x) = 1? Well that was x = 0. And so, in MLE, we’re trying to find the θ that maximizes
the likelihood, and we don’t care what the maximum value of the likelihood is. We didn’t even compute it!
We just care that the argmax is 45 .
254 Probability & Statistics with Applications to Computing 7.1
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | θ) (if X is discrete), or
from density fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters). We
define the maximum likelihood estimator θ̂M LE of θ to be the parameter which maximizes the
likelihood (or equivalently, the log-likelihood) of the data.
Taking the log of a product (such as the likelihood) results in the sum of logs because of log properties:
log(a · b · c) = log(a) + log(b) + log(c)
7.1 Probability & Statistics with Applications to Computing 255
We see now why we might want to take the log of the likelihood before differentiating it, but why can we?
Below there are two images: the left image is a function, and the right image is the log of that function.
The values are different (see the y-axis), but if you look at the x-axis, it happens that both functions are
maximized at 1 (the argmax’s are the same). Log is a monotone increasing function, so it preserves order,
so whatever was the maximizer (argmax) in the original function, will also be maximizer in the log function.
See below to see what happens when you apply the natural log (ln) to a product in our likelihood scenario!
And see the next section 7.2 for examples of maximum likelihood estimation in action.
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | θ) (if X is discrete), or
from density fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters). We
define the likelihood of x given θ to be the probability of observing x if the true parameter is θ.
If X is discrete,
n
X
ln L(x | θ) = ln pX (xi | θ)
i=1
If X is continuous,
n
X
ln L(x | θ) = ln fX (xi | θ)
i=1
Chapter 7. Statistical Estimation
7.2: Maximum Likelihood Examples
Slides (Google Drive) Video (YouTube)
We spend an entire section just doing examples because maximum likelihood is such a fundamental concept
used everywhere (especially machine learning). I promise that the idea is simple: find θ that maximizes the
likelihood of the data. The computation and notation can be confusing at first though.
Let’s say x1 , x2 , ..., xn are iid samples from Poi(θ). (These values might look like x1 = 13, x2 = 5, x3 =
6, etc...) What is the MLE of θ?
Solution Remember that we discussed that the sample mean might be a good estimate of θ. If we observed
20 events over 5 units of time, a good estimate for λ, the average number of events per unit of time, would
be 20
5 = 4. This turns out to be the maximum likelihood estimate!
Let’s follow the recipe provided in 7.1.
1. Compute the likelihood and log-likelihood of data. To do this, we take the following product
of the Poisson PMFs at each sample xi , over all the data points:
n
Y n
Y θ xi
L(x | θ) = pX (xi | θ) = e−θ
i=1 i=1
xi !
Again, this is the probability of seeing x1 , then x2 , and so on. This function is pretty hard to differ-
entiate, so to make it easier, let’s compute the log-likelihood instead, using the following identities:
In most cases, we’ll want to optimize the log-likelihood instead of the likelihood (since we don’t want
to use the product rule of calculus)!
n
!
Y xi
−θ θ
ln L(x | θ) = ln e [def of likelihood]
i=1
xi !
n
X
θ xi
= ln e−θ [log of product is sum of logs]
i=1
xi !
n
X
= [ln(e−θ ) + ln(θxi ) − ln xi !)] [log of product is sum of logs]
i=1
Xn
= [−θ + xi ln θ − ln xi !)] [other log properties]
i=1
256
7.2 Probability & Statistics with Applications to Computing 257
2. Take the partial derivative(s) with respect to θ and set to 0. Solve the equation(s).
Now we want to take the derivative of the log likelihood with respect to θ, so the derivative of −θ is
just −1, and the derivative of xi ln θ is just xθi , because remember xi is a constant with respect to θ.
∂ Xh n
xi i
ln L(x | θ) = −1 +
∂θ i=1
θ
n h
xi i
X n n
1X 1X
−1 + = 0 → −n + xi = 0 → θ̂ = xi
i=1
θ θ̂ i=1 n i=1
3. Optionally, verify θ̂M LE is indeed a (local) maximizer by checking that the second deriva-
tive at θ̂M LE is negative (if θ is a single parameter), or the Hessian (matrix of second
partial derivatives) is negative semi-definite (if θ is a vector of parameters).
We want to take the second derivative also, because
Pn otherwise we don’t know if this is a maximum or
a minimum. We differentiate the first derivative i=1 [−1 + xθi ] again with respect to θ, and we notice
that because θ2 is always positive, the negative of that is always negative, so the second derivative
is always less than 0, so that means that it’s concave down everywhere. This means that anywhere
the derivative is zero is a global maximum, so we’ve successfully found the global maximum of our
likelihood equation.
∂2 X h xi i
n
ln L(x | θ) = − 2 < 0 → concave down everywhere
∂θ2 i=1
θ
Let’s say x1 , x2 , ..., xn are iid samples from Exp(θ). (These values might look like x1 = 1.354, x2 =
3.198, x3 = 4.312, etc...) What is the MLE of θ?
Solution Now that we’ve seen one example, we’ll just follow the procedure given in the previous section.
1. Compute the likelihood and log-likelihood of data.
Since we have a continuous distribution, our likelihood is the product of the PDFs:
n
Y n
Y
L(x | θ) = fX (xi | θ) = θe−θxi
i=1 i=1
The log-likelihood is
n
X n
X
ln L(x | θ) = ln θe−θxi = [ln(θ) − θxi ]
i=1 i=1
258 Probability & Statistics with Applications to Computing 7.2
2. Take the partial derivative(s) with respect to θ and set to 0. Solve the equation(s).
Xn
∂ 1
ln L(x | θ) = − xi
∂θ i=1
θ
Now, we set the derivative to 0 and solve (here we replace θ with θ̂):
n
X n
1 n X n
− xi = 0 → − xi = 0 → θ̂ = Pn
i=1 θ̂ θ̂ i=1 i=1 xi
This is just the inverse of the sample mean! This makes sense because if the average waiting time was
1
1/2 hours, then the average rate per unit of time λ should be 1/2 = 2 per hour!
3. Optionally, verify θ̂M LE is indeed a (local) maximizer by checking that the second deriva-
tive at θ̂M LE is negative (if θ is a single parameter), or the Hessian (matrix of second
partial derivatives) is negative semi-definite (if θ is a vector of parameters). The second
derivative of the log-likelihood just requires us to take one more derivative:
Xn
∂2 −1
ln L(x | θ) = <0
∂θ2 i=1
θ2
Since the second derivative is negative everywhere, the function is concave down, and any critical point
is a global maximum!
Let’s say x1 , x2 , ..., xn are iid samples from (continuous) Unif(0, θ). (These values might look like
x1 = 2.325, x2 = 1.1242, x3 = 9.262, etc...) What is the MLE of θ?
Solution It turns out our usual procedure won’t work on this example, unfortunately. We’ll explain why
once we run into the problem!
To compute the likelihood, we first need the individual density functions. Recall
1 0 ≤ x ≤ θ
fX (x | θ) = θ
0 otherwise
Let’s actually define an indicator function for whether or not some boolean condition A is true or false:
(
1 A is true
IA =
0 A is false
This way, we can rewrite the uniform density in one line as (1/θ for 0 ≤ x ≤ θ and 0 otherwise):
1
fX (x | θ) = I{0≤x≤θ}
θ
7.2 Probability & Statistics with Applications to Computing 259
First, we take the product over all data points of the density at that data point, and plug in the density of
the uniform distribution. How do we simplify this? First of all, we notice that in every term in the product,
there is still a θ1 , so multiply it by itself n times and get θ1n . How do we multiply indicators? If we want the
product of 1’s and 0’s to be 1, they ALL have to be 1. So,
We could take the log-likelihood before differentiating, but this function isn’t too bad-looking, so let’s take
the derivative of this. The I{0≤x1 ,...,xn ≤θ} just says the function is θ1n when it the condition is true and 0
otherwise. So our derivative will just be the derivative of θ1n when that condition is true and 0 otherwise.
d n
L(x | θ) = − n+1 I{0≤x1 ,...,xn ≤θ}
dθ θ
n
− = 0 → θ =???
θn+1
There seems
n to be no value of θ that solves this, what’s going on? Let’s plot the likelihood. First, we plot
just θ1 (not quite the likelihood) where θ is on the x-axis:
Above is a graph of θ1n , and so if we wanted to maximize this function, we should choose θ = 0. But
remember that the likelihood, was θ1n I{0≤x1 ,...,xn ≤θ} , which can also be written as θ1n I{xmax ≤θ} , because all
the samples are ≤ θ if and only if the maximum is. Below is the graph of the actual likelihood:
260 Probability & Statistics with Applications to Computing 7.2
Notice that multiplying by the indicator function just kept the function as is when the condition was true,
xmax ≤ θ, but zeroed it out otherwise. So now we can see that our maximum likelihood estimator should be
θ̂M LE = xmax = max{x1 , x2 , . . . , xn }, since it achieves the highest value.
Why? Remember x1 , . . . , xn ∼ Unif(0, θ), so θ has to be at least as large as the biggest xi , because if it’s
not as large as the biggest xi , then it would have been impossible for that uniform to produce that largest
xi . For example, if our samples were x1 = 2.53, x2 = 8.55, x3 = 4.12, our θ had to be at least 8.55 (the
maximum sample), because if it were 7 for example, then Unif(0, 7) could not possibly generate the sample
8.55.
So our likelihood remember θ1n would have preferred as small a θ as possible to maximize it, but subject to
θ ≥ xmax . Therefore the “compromise” was reached by making them equal!
I’d like to point out this is a special case because the range of the uniform distribution depends on its
parameter(s) a, b (the range of Unif(a, b) is [a, b]). On the other hand, most of our distributions like Poisson
or Exponential have the same range no matter what value the value of their parameters. For example, the
range of Poi(λ) is always {0, 1, 2, . . . } and the range of Exp(λ) is always [0, ∞), independent of λ.
Therefore, most MLE problems will be similar to the first two examples rather than this complicated one!
Chapter 7. Statistical Estimation
7.3: Method of Moments Estimation
Slides (Google Drive) Video (YouTube)
Usually, we are interestedin the first moment of X: µ = E [X], and the second moment of X about
µ: Var (X) = E (X − µ)2 .
Now since we are in the statistics portion of the class, we will define a sample moment.
Let X be a random variable, and c ∈ R a scalar. Let x1 , . . . , xn be iid realizations (samples) from X.
The k th sample moment of X is
n
1X k
x
n i=1 i
For example, the first sample moment is just the sample mean, and the second sample moment
about the sample mean is the sample variance.
261
262 Probability & Statistics with Applications to Computing 7.3
Suppose we only need to estimate one parameter θ (you might have to estimate two for example θ = (µ, σ 2 )
for the N (µ, σ 2 ) distribution). The idea behind Method of Moments (MoM) estimation is that: to find a
good estimator, we should have the true and sample moments match as best we can. That is, I should choose
the parameter θ such that the first true moment E [X] is equal to the first sample moment x̄. Examples
always make things clearer!
Example(s)
Let’s say x1 , x2 , . . . , xn are iid samples from X ∼ Unif(0, θ) (continuous). (These values might look
like x1 = 3.21, x2 = 5.11, x3 = 4.33, etc.) What is the MoM estimator of θ?
Solution We then set the first true moment to the first sample moment as follows (recall that
E [Unif(a, b)] = a+b
2 ):
n
θ 1X
E [X] = = xi
2 n i=1
n
2X
θ̂M oM = xi
n i=1
This estimator makes sense intuitively once you think about it for aP
bit: if we take the sample mean of a
n
bunch of Unif(0, θ) rvs, we expect to get close to the true mean: n1 i=1 xi → θ/2 (by the Law of Large
Numbers). Hence, a good estimator for θ would just be twice the sample mean!
Notice that in this case, the MoM estimator disagrees with the MLE we derived in 7.2!
n
2X
xi = θ̂M oM 6= θ̂M LE = xmax
n i=1
What if you had two parameters instead of just one? Well, then you would set the first true moment equal
to the first sample moment (as we just did), but also the second true moment equal to the second sample
moment! We’ll see an example of this below. But basically, if we have k parameters to estimate, we need k
equations to solve for these k unknowns!
7.3 Probability & Statistics with Applications to Computing 263
Let x = (x1 , . . . , xn ) be iid realizations (samples) from probability mass function pX (t; θ) (if X is
discrete), or from density fX (t; θ) (if X continuous), where θ is a parameter (or vector of parameters).
Example(s)
Let’s say x1 , x2 , . . . , xn are iid samples from X ∼ Exp(θ). (These values might look like x1 =
3.21, x2 = 5.11, x3 = 4.33, etc.) What is the MoM estimator of θ?
Solution We have k = 1 (since only one parameter). We then set the first true moment to the first
sample moment as follows (recall that E [Exp(λ)] = λ1 ):
n
1 1X
E [X] = = xi
θ n i=1
1
θ̂M oM = 1
Pn
n i=1 xi
Notice that in this case, the MoM estimator agrees with the MLE (Maximum Likelihood Estimator),
hooray!
1
θ̂M oM = θ̂M LE = 1
Pn
n i=1 xi
Example(s)
Let’s say x1 , x2 , . . . , xn are iid samples from X ∼ Poi(θ). (These values might look like x1 = 13, x2 =
5, x3 = 4, etc.) What is the MoM estimator of θ?
Solution We have k = 1 (since only one parameter). We then set the first true moment to the first
sample moment as follows (recall that E [Poi(λ)] = λ):
n
1X
E [X] = θ = xi
n i=1
264 Probability & Statistics with Applications to Computing 7.3
In this case, again, the MoM estimator agrees with the MLE! Again, much easier than MLE :).
Now, we’ll do an example where there is more than one parameter.
Example(s)
Let’s say x1 , x2 , . . . , xn are iid samples from X ∼ N (θ1 , θ2 ). (These values might look like x1 =
−2.321, x2 = 1.112, x3 = −5.221, etc.) What is the MoM estimator of the vector θ = (θ1 , θ2 ) (θ1 is
the mean, and θ2 is the variance)?
Solution
2 We have k = 2 (since now we have two parameters θ1 = µ and θ2 = σ 2 ). Notice Var (X) =
2 2
E X − E [X] , so rearranging we get E X 2 = Var (X) + E [X] . Let’s solve for θ1 first:
Again, we set the first true moment to the first sample moment:
n
1X
E [X] = θ1 = xi
n i=1
n
1X
θ̂1 = xi
n i=1
2
Now let’s use our result for θ̂1 to solve for θ̂2 (recall that E X 2 = Var (X) + E [X] = θ2 + θ12 )
n
1X 2
E X 2 = θ2 + θ12 = x
n i=1 i
n n
!2
1X 2 1X
θ̂2 = x − xi
n i=1 i n i=1
If you were to use maximum likelihood to estimate the mean and variance of a Normal distribution, you
would get the same result!
Chapter 7. Statistical Estimation
7.4: The Beta and Dirichlet Distributions
Slides (Google Drive) Video (YouTube)
We’ll take a quick break after learning two ways (MLE and MoM) to estimate unknown parameters! In the
next section, we’ll learn yet another approach. But that approach requires us to learn at least one other
distribution, the Beta distribution, which will be the focus of this section.
Suppose you want to model your belief on the unknown probability X of heads. You could assign, for
example, a probability distribution as follows:
This figure below shows that you believe that X = P (head) is most likely to be 0.5, somewhat likely to be
0.8, and least likely to be 0.37. That is, X is a discrete random variable with range ΩX = {0.37, 0.5, 0.8}
and pX (0.37) + pX (0.5) + pX (0.8) = 1. This is a probability distribution on a probability of heads!
Now what if we want P (head) to be open to any value in [0, 1] (which we should want; having it be just
one of three values is arbitrary and unrealistic)? The answer is that we need a continuous random variable
(with range [0, 1] because probabilities can be any number within this range)! Let’s try to see how we might
define a new distribution which might do a good job modelling this belief! Let’s see which of the following
shapes might be appropriate (or not).
265
266 Probability & Statistics with Applications to Computing 7.4
Example(s)
Suppose you flipped the coin n times and observed k heads. Which of the above density functions
have a “shape” which would be reasonable to model your belief?
It’s important to note that Distributions 2 and 4 are invalid, because there is no possible sequence of flips
that could result in the belief that is ”bi-modal” (have two peaks in the graph of the distribution). Your
belief should have a single peak at your highest belief, and go down on both sides from there.
For instance, if you believe that the probability of (getting heads) is most likely around 0.25, we have Distri-
bution 1 in the figure above. Similarly, if you think that it’s most likely around 0.85, we have Distribution
3. Or, more interestingly, if you have NO idea what the probability might be and you want to make every
probability equally likely, you could use a Uniform distribution like in Distribution 5.
Example(s)
If you flip a coin with unknown probability of heads X, what does your belief distribution look like
if:
• You didn’t observe anything?
• You observed 8 heads and 2 tails?
• You observed 80 heads and 20 tails?
• You observed 2 heads and 3 tails?
Match the four distributions below to the four scenarios above. Note the vertical bars in each
distribution represents where the mode (the point with highest density) is, as that’s probably what
we want to estimate as our probability of heads!
Solution
Explanation: Since we haven’t observed anything yet, we shouldn’t have preference over any partic-
ular value. This is encoded as a continuous Unif(0, 1) distribution.
There is a continuous distribution/rv with range [0, 1] that parametrizes probability distributions over a
7.4 Probability & Statistics with Applications to Computing 269
probability just like this, based on two parameters α and β, which allow you to account for how many heads
and tails you’ve seen!
X ∼ Beta(α, β), if and only if X has the following density function (and range ΩX = [0, 1]):
(
1 α−1
B(α,β) x (1 − x)β−1 , 0 ≤ x ≤ 1
fX (x) =
0, otherwise
X is typically the belief distribution about some unknown probability of success, where we pretend
we’ve seen α−1 successes and β−1 failures. Hence the mode (most likely value of the probability/point
with highest density) arg max fX (x), is
x∈[0,1]
α−1
mode[X] =
(α − 1) + (β − 1)
If you flip a coin with unknown probability of heads X, identify the parameters of the most appropriate
Beta distribution to model your belief:
• You didn’t observe anything?
• You observed 8 heads and 2 tails?
• You observed 80 heads and 20 tails?
• You observed 2 heads and 3 tails?
Solution
(81−1) 80
• You observed 80 heads and 20 tails? Beta(80+1, 20+1) ≡ Beta(81, 21) → mode = (81−1)+(21−1) = 100
(3−1) 2
• You observed 2 heads and 3 tails? Beta(2 + 1, 3 + 1) ≡ Beta(3, 4) → mode = (3−1)+(4−1) = 5
( Qr Pr
1
B(α) i=1 xai i −1 , xi ∈ (0, 1) and i=1 xi = 1
fX (x) =
0, otherwise
This is a generalization of the Beta random variable from 2 outcomes to r. The random vector X is
typically the belief distribution about some unknown probabilities of the different outcomes, where
we pretend we saw a1 − 1 outcomes of type 1, a2 − 1 outcomes of type 2, . . . , and ar − 1 outcomes
of type r . Hence, the mode of the distribution is the vector, arg max P fX (x), is
x∈[0,1]d and xi =1
α −1 α2 − 1 αr − 1
mode[X] = Pr 1 , Pr , . . . , Pr
(a
i=1 i − 1) (a
i=1 i − 1) i=1 (ai − 1)
We’ve seen two ways now to estimate unknown parameters of a distribution. Maximum likelihood estimation
(MLE) says that we should find the parameter θ that maximizes the likelihood (“probability”) of seeing the
data, whereas the method of moments (MoM) says that we should match as many moments as possible
(mean, variance, etc.). Now, we learn yet another (and final) technique for estimation that will cover (there
are many more...).
7.5.1.1 Intuition
In Maximum Likelihood Estimation (MLE), we used iid samples x = (x1 , . . . , xn ) from some distribution
with unknown parameter(s) θ, in order to estimate θ.
n
Y
θ̂M LE = arg max L(x | θ) = arg max fX (xi | θ)
θ θ
i=1
Note: Recall that, using the English description, how we found θ̂M LE is: we computed this likelihood, which
is the probability of seeing the data given the parameter θ, and we chose the “best” θ that maximized this
likelihood.
You might have been thinking: shouldn’t we be trying to maximize ”P (θ | x)” instead? Well, this doesn’t
make sense unless Θ is a R.V.! And this is where Maximum A Posteriori (MAP) Estimation comes in.
So far, for MLE and MoM estimation, we assumed θ was fixed but unknown. This is called the Frequentist
framework where we only estimate our parameter based on data alone, and θ is not a random variable.
Now, we are in the Bayesian framework, meaning that our unknown parameter is a random variable
Θ. This means, we will have some belief distribution πΘ (θ) (think of this as a density function over all
possible values of the parameter), and after observing data x, we will have a new/updated belief distribution
πΘ (θ | x). Let’s see a picture of what MAP is going to do first, before getting more into the math and
formalism.
271
272 Probability & Statistics with Applications to Computing 7.5
Example(s)
We’ll see the idea of MAP being applied to our typical coin example. Suppose we are trying to
estimate the unknown parameter for the probability of heads on a coin: that is, θ in Ber(θ). We
are going to treat the parameter as a random variable (before in MLE/MoM we treated it as a fixed
unknown quantity), so we’ll call it Θ (capitalized θ).
1. We must have a prior belief distribution πΘ (θ) over possible values that Θ could
take on.
The range of Θ in our case is ΩΘ = [0, 1], because the probability of heads must be in
this interval. Hence, when we plot the density function of Θ, the x-axis will range from 0 to 1.
On a piece of paper, please sketch a density function that you might have for this probability of
heads without yet seeing any data (coin flips). There are two reasonable shapes for this PDF:
• The Unif(0, 1) = Beta(1, 1) distribution (left picture below).
• Some Beta distribution where α = β, since most coins in this world are fair. Let’s say
Beta(11, 11); meaning we pretend we’ve seen 10 heads and 10 tails (right picture below).
Again, for the Bernoulli distribution, these will be a sequence of Pnn 1’s and 0’s repre-
senting heads
Pn or tails. Suppose we observed n = 30 samples, in which i=1 xi = 25 were heads
and n − i=1 xi = 5 were tails.
3. We will combine our prior knowledge and the data to create a posterior belief
distribution πΘ (θ | x).
Sketch two density functions for this posterior: one using the Beta(1, 1) prior above, and one
using the Beta(11, 11) prior above. We’ll compare these.
• If our prior distribution was Θ ∼ Beta(1, 1) (meaning we pretend we didn’t see anything
yet), then our posterior distribution should be Θ | x ∼ Beta(26, 6) (meaning we saw 25
heads and 5 tails total).
• If our prior distribution was Θ ∼ Beta(11, 11) (meaning pretend we saw 10 heads and 10
tails beforehand), then our posterior distribution should be Θ | x ∼ Beta(36, 16) (meaning
we saw 35 heads and 15 tails total).
7.5 Probability & Statistics with Applications to Computing 273
4. We’ll give our MAP estimate as the mode of this posterior distribution. Hence,
the name “Maximum a Posteriori”.
• If we used the Θ ∼ Beta(1, 1) prior, we ended up with the Θ | x ∼ Beta(26, 6) posterior,
and our MAP estimate is defined to be the mode of the distribution, which occurs at
θ̂M AP = 25
30 ≈ 0.833 (left picture above). You may notice that this would give the same as
the MLE: we’ll examine this more later!
• If we used the Θ ∼ Beta(11, 11) prior, we ended up with the Θ | x ∼ Beta(36, 16) posterior,
our MAP estimate is defined to be the mode of the distribution, which occurs at θ̂M AP =
35
50 = 0.70 (right picture above).
Hopefully you now see the process and idea behind MAP: We have a prior belief on our unknown
parameter, and after observing data, we update our belief distribution and take the mode (most likely
value)! Our estimate definitely depends on the prior distribution we choose (which is often arbitrary).
7.5.1.2 Derivation
We chose a Beta prior, and ended up with a Beta posterior, which made sense intuitively given our definition
of the Beta distribution. But how do we prove this? We’ll see the math behind MAP now (quite short), and
see the same example again but mathematically rigorous now.
MAP Idea: Actually, unknown parameter(s) is a random variable Θ. We have a prior distribution (prior
belief on Θ before seeing data) πΘ (θ) and posterior distribution (given data; updated belief on Θ after
observing some data) πΘ (θ | x).
By Bayes’ Theorem,
Recall that πΘ is just a PDF or PMF over possible values of Θ. In other words, now we are maximizing
the posterior distribution πΘ (θ | x), where Θ has a PMF/PDF. That is, we are finding the mode of the
density/mass function. Note that since the denominator P (x) in the expression above does not depend on
θ, we can just maximize the numerator L(x | θ)πΘ (θ)! Therefore:
Let x = (x1 , . . . , xn ) be iid realizations from probability mass function pX (t ; Θ = θ) (if X discrete),
or from density fX (t ; Θ = θ) (if X continuous), where Θ is the random variable representing the
parameter (or vector of parameters). We define the Maximum A Posteriori (MAP) estimator θ̂M AP
of Θ to be the parameter which maximizes the posterior distribution of Θ given the data.
That is, it’s exactly the same as maximum likelihood, except instead of just maximizing the likelihood,
we are maximizing the likelihood multiplied by the prior!
Now we’ll see a similar coin-flipping example, but deriving the MAP estimate mathematically and building
even more intuition. I encourage you to try each part out before reading the answers!
7.5.1.3 Example
Example(s)
(a) Suppose our samples are x = (0, 0, 1, 1, 0), from Ber(θ), where θ is unknown. Assume θ is
unrestricted; that is, θ ∈ (0, 1). What is the MLE for θ?
(b) Suppose we impose the restriction that θ ∈ {0.2, 0.5, 0.7}. What is the MLE for θ?
(c) Assume Θ is restricted as in part (b) (but now a random variable for MAP). Suppose we have
a (discrete) prior πΘ (0.2) = 0.1, πΘ (0.5) = 0.01, and πΘ (0.7) = 0.89. What is the MAP for θ?
(d) Show that we can make the MAP whatever we like, by finding a prior over {0.2, 0.5, 0.7} so
that the MAP is 0.2, another so that it is 0.5, and another so that it is 0.7.
(e) Typically, for the Bernoulli/Binomial distribution, if we use MAP, we want to be able to get
any value ∈ (0, 1), not just ones in a finite set such as {0.2, 0.5, 0.7}. So we need a (continuous)
prior distribution with range (0, 1) instead of our discrete one. We assign Θ ∼ Beta(α, β)
1
with parameters α, β > 0 and density πΘ (θ) = B(α,β) θα−1 (1 − θ)β−1 for θ ∈ (0, 1). Recall the
α−1
mode of a W ∼ Beta(α, β) random variable is (α−1)+(β−1) (the mode is the value with highest
density arg maxw fW (w)).
Suppose x1 , . . . , xn arePiid from a Bernoulli distribution with unknown parameter. Recall the
MLE is nk , where k = xi (the total number of successes). Show that the posterior πΘ (θ | x)
has a Beta(k + α, n − k + β) distribution, and find the MAP estimator.
(f) Recall that Beta(1, 1) ≡ Unif(0, 1) (pretend we saw 1 − 1 heads and 1 − 1 tails ahead of time).
If we used this as the prior, how would the MLE and MAP compare?
(g) Since the posterior is also a Beta Distribution, we call Beta the conjugate prior to the Bernoul-
li/Binomial distribution’s parameter p. Interpret α, β as to how they affect our estimate. This
is a really special property: if the prior distribution multipled by the likelihood results in a pos-
terior distribution in the same family (with different parameters), then we say that distribution
is the conjugate prior to the distribution we are estimating.
(h) As the number of samples goes to infinity, what is the relationship between the MLE and MAP?
What does this say about our prior when n is small, or n is large?
(i) Which do you think is ”better”, MLE or MAP?
Solution
7.5 Probability & Statistics with Applications to Computing 275
(a) Suppose our samples are x = (0, 0, 1, 1, 0), from Ber(θ), where θ is unknown. Assume θ is unrestricted;
that is, θ ∈ (0, 1). What is the MLE for θ?
• Answer: 25 . We just find the likelihood of the data, which is the probability of observing 2 heads
and 3 tails, and find the θ that maximizes it.
L(x | θ) = θ2 (1 − θ)3
2
θ̂M LE = arg maxθ∈[0,1] θ2 (1 − θ)3 = 5
(b) Suppose we impose the restriction that θ ∈ {0.2, 0.5, 0.7}. What is the MLE for θ?
• Answer: 0.5. We need to find which of the three acceptable θ values maximizes the likelihood,
and since there are only finitely many, we can just plug them all in and compare!
Suppose x1 , . . . , xnPare iid from a Bernoulli distribution with unknown parameter. Recall the MLE
is nk , where k = xi (the total number of successes). Show that the posterior πΘ (θ | x) has a
Beta(k + α, n − k + β) distribution, and find the MAP estimator.
k+(α−1)
• Answer: θ̂M AP = n+(α−1)+(β−1) . We first have to write out what the posterior distribution is,
which is proportional to just the prior times the likelihood:
276 Probability & Statistics with Applications to Computing 7.5
πΘ (θ | x) ∝ L(x | θ) · πΘ (θ)
! !
n k n−k 1
= θ (1 − θ) · θα−1 (1 − θ)β−1
k B(α, β)
∝ θ(k+α)−1 (1 − θ)(n−k+β)−1
The first to second line comes from noticing L(x | θ) is just the probability of seeing exactly k
successes out of n (binomial PMF), and plugging in our equation for πΘ (beta density). The
second to third line comes from dropping the normalizing constants (that don’t depend on θ),
which we can do because we only care to maximize this over θ. If you stare closely at that last
equation, it actually proportional to the PDF of a Beta distribution with different parameters!
Our posterior is hence Beta(k + α, n − k + β) since PDFs uniquely define a distribution (there
is only one normalizing constant that would make it integrate to 1). The MAP estimator is the
mode of this posterior Beta distribution, which is given by the formula:
k+α−1 k + (α − 1)
θ̂M AP = =
(k + α − 1) + (n − k + β − 1) n + (α − 1) + (β − 1)
Try staring at this to see why this might make sense. We’ll explain it more in part (g)!
(f) Recall that Beta(1, 1) ≡ Unif(0, 1) (pretend we saw 1 − 1 heads and 1 − 1 tails ahead of time). If we
used this as the prior, how would the MLE and MAP compare?
• Answer: They would be the same! From our previous question, if α = β = 1, then
k + (α − 1) k
θ̂M AP = = = θ̂M LE
n + (α − 1) + (β − 1) n
This is because we don’t have any prior information essentially, by saying each value is equally
likely!
(g) Since the posterior is also a Beta Distribution, we call Beta the conjugate prior to the Bernoulli/Bi-
nomial distribution’s parameter p. Interpret α, β as to how they affect our estimate. This is a really
special property: if the prior distribution multipled by the likelihood results in a posterior distribution
in the same family (with different parameters), then we say that distribution is the conjugate prior to
the distribution we are estimating.
• Answer: The interpretation is: pretend we saw α − 1 heads ahead of time, and β − 1 tails ahead
of time. Then our total number of heads is k + (α − 1) (real + fake) and our total number of
trials is n + (α + β − 2) (real + fake), so that’s our estimate! That’s how prior information was
factored in to our estimator, rather than just using what we actually saw in the data.
(h) As the number of samples goes to infinity, what is the relationship between the MLE and MAP? What
does this say about our prior when n is small, or n is large?
• Answer: They become equal! The prior is important if we don’t have much data, but as we get
more, the evidence overwhelms the prior. You can imagine that if we only flipped the coin 5
times, the prior would play a huge role in our estimate. But if we flipped the coin 10,000 times,
any (small) prior wouldn’t really change our estimate.
(i) Which do you think is “better”, MLE or MAP?
7.5 Probability & Statistics with Applications to Computing 277
• Answer: There is no right answer. There are two main schools in statistics: Bayesians and
Frequentists.
• Frequentists prefer MLE since they don’t believe you should be putting a prior belief on anything,
and you should only make judgment based on what you’ve seen. They believe the parameter
being estimated is a fixed quantity.
• On the other hand, Bayesians prefer MAP, since they can incorporate their prior knowledge into
the estimation. Hence the parameter being estimated is a random variable, and we seek the
mode - the value with the highest probability or density. An example would be estimating the
probability of heads of a coin - is it reasonable to assume it is more likely fair than not? If so,
what distribution should we put on the parameter space?
• Anyway, in the long run, the prior “washes out”, and the only thing that matters is the likelihood;
the observed data. For small sample sizes like this, the prior significantly influences the MAP
estimate. However, as the number of samples goes to infinity, the MAP and MLE are equal.
7.5.2 Exercises
1. Let x = (x1 , . . . , xn ) be iid samples from Exp(Θ) where Θ is a random variable (not fixed). Note that
the range of Θ should be ΩΘ = [0, ∞) (the average rate of events per unit time), so any prior we choose
should have this range.
(a) Using the prior Θ ∼ Gamma(r, λ) (for some arbitrary but known parameters r, λ > 0), show that
the posterior distribution Θ | x also follows a Gamma distribution and identify its parameters (by
computing πΘ (θ | x)). Then, explain this sentence: “The Gamma distribution is the conjugate
prior for the rate parameter of the Exponential distribution”. Hint: This can be done in just a
few lines!
s−1
(b) Now derive the MAP estimate for Θ. The mode of a Gamma(s, ν) distribution is . Hint:
ν
This should be just one line using your answer to part (a).
(c) Explain how this MAP estimate differs from the MLE estimate (recall for the Exponential distri-
n
bution it was just the inverse sample mean Pn ), and provide an interpretation of r and λ as
i=1 xi
to how they affect the estimate.
Solution:
(a) Remember that the posterior is proportional to likelihood times prior, and the density of Y ∼
Exp(θ) is fY (y | θ) = θe−θy :
P
Therefore Θ | x ∼ Gamma(n + r, λ + xi ), since the final line above is proportional to the pdf
for the gamma distribution (minus normalizing constant).
278 Probability & Statistics with Applications to Computing 7.5
It is the conjugate prior because, assuming a Gamma prior for the Exponential likelihood, we
end up with a Gamma posterior. That is, the prior and posterior are in the same family of
distributions (Gamma) with different parameters.
(b) Just citing the mode of a Gamma given, we get
n+r−1
θ̂M AP = P
λ + xi
n
(c) We see how the estimate changes from the MLE of θ̂M LE = P : pretend we saw r − 1 extra
xi
events
P over λ units of time. (Instead of waiting
P for n events, we waited for n + r − 1, and instead
of xi as our total time, we now have λ + xi units of time).
Chapter 7. Statistical Estimation
7.6: Properties of Estimators I
Slides (Google Drive) Video (YouTube)
Now that we have all these techniques to compute estimators, you might be wondering which one is the
“best”. Actually, a better question would be: how can we determine which estimator is “better” (rather
than the technique)? There are even more different ways to estimate besides MLE/MoM/MAP, and in
different scenarios, different techniques may work better. In these notes, we will consider some properties of
estimators that allow us to compare their “goodness”.
7.6.1 Bias
The first estimator property we’ll cover is Bias. The bias of an estimator measures whether or not in
expectation, the estimator will be equal to the true parameter.
If h i
• Bias(θ̂, θ) = 0, or equivalently E θ̂ = θ, then we say θ̂ is an unbiased estimator of θ̂.
• Bias(θ̂, θ) > 0, then θ̂ typically overestimates θ.
• Bias(θ̂, θ) < 0, then θ̂ typically underestimates θ.
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from Poi(θ), then the MLE and MoM were both the
sample mean.
n
1X
θ̂ = θ̂M LE = θ̂M oM = xi
n i=1
279
280 Probability & Statistics with Applications to Computing 7.6
Solution
" n #
h i 1X
E θ̂ = E xi
n i=1
n
1X
= E [xi ] [LoE]
n i=1
n
1X
= θ [E [Poi(θ)] = θ]
n i=1
1
= nθ
n
=θ
This makes sense: the average of your samples should be “on-target” for the true average!
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, θ), then
n
1X
θ̂M LE = xmax θ̂M oM = 2 · xi
n i=1
Sure, θ̂M LE maximizes the likelihood, so in a way θ̂M LE is better than θ̂M OM . But, what are the biases
of these estimators? Before doing any computation: do you think θ̂M LE and θ̂M oM are overestimates,
underestimates, or unbiased?
Solution I actually think θ̂M oM is spot-on since the average of the samples should be close to θ/2, and mul-
tiplying by 2 would seem to give the true θ. On the other hand, θ̂M LE might be a bit of an underestimate,
since we probably wouldn’t have θ be exactly the largest (maybe a little larger).
Z ! Z " #θ
h i θ y n−1 1 n θ
n 1 n
E θ̂M LE = E [Xmax ] = y n dy = n y n dy = n y n+1 = θ
0 θ θ θ 0 θ n+1 n+1
0
This makes sense because if I had 3 samples from Unif(0, 1) for example, I would expect them at
n
1/4, 2/4, 3/4, and so it would be as my expected max. Similarly, if I had 4 samples, then I would
n+1
n
expect them at 1/5, 2/5, 3/5, 4/5, and so it would again be as my expected max.
n+1
7.6 Probability & Statistics with Applications to Computing 281
Finally,
h i n 1
Bias(θ̂M LE , θ) = E θ̂M LE − θ = θ−θ =− θ
n+1 n+1
" #
h i 1 X
n
2 X
n
2 θ
E θ̂M OM = E 2 · xi = E [xi ] = n = θ
n i=1 n i=1 n 2
h i
Bias(θ̂M oM , θ) = E θ̂M oM − θ = θ − θ = 0
• Analysis of Results
This means that θ̂M LE typically underestimates θ and θ̂M OM is an unbiased estimator of θ. But some-
thing isn’t quite right...
2
θ̂M LE = max{1, 9, 2} = 9 θ̂M OM = (1 + 9 + 2) = 8
3
However, based on our sample, the MoM estimator is impossible. If the actual parameter were 8, then
that means that the distribution we pulled the sample from is Unif(0, 8), in which case the likelihood
that we get a 9 is 0. But we did see a 9 in our sample. So, even though θ̂M OM is unbiased, it still
yields an impossible estimate. This just goes to show that finding the right estimator is actually quite
tricky.
A good solution would be to “de-bias” the MLE by scaling it appropriately. If you decided to have a
new estimator based on the MLE:
n+1
θ̂ = θ̂M LE
n
you would now get an unbiased estimator that can’t be wrong! But now it does not maximize the
likelihood anymore...
Actually, the MLE is what we say to be “asymptotically unbiased”, meaning unbiased in the limit.
This is because
1
Bias(θ̂M LE , θ) = − θ→0
n+1
as n → ∞. So usually we might just leave it because we can’t seem to win...
Example(s)
Recall that if x1 , . . . , xn ∼ Exp(θ) are iid, our MLE and MoM estimates were both the inverse sample
mean:
1 n
θ̂ = θ̂M LE = θ̂M oM = = Pn
x̄ i=1 xi
What can you say about the bias of this estimator?
Solution
282 Probability & Statistics with Applications to Computing 7.6
h i
n
E θ̂ = E Pn
i=1 xi
n
≥ Pn [Jensen’s inequality]
i=1 E [xi ]
n 1
= Pn 1 E [Exp(θ)] =
i=1 θ θ
n
=
n θ1
=θ
1
The inequality comes from Jensen’s (section 6.3): since g(x1 , . . . , xn ) = Pn is convex (at least in the
xi
i=1
positive octant when all xi ≥ 0), we have that E [g(x1 , . . . , xn )] ≥ g(E [x1 ] , E [x2 ] , . . . , E [xn ]). It is convex
1 h i
for a reason similar to that is a convex function. So E θ̂ ≥ θ systematically, and we typically have an
x
overestimate.
This is just the definition of variance applied to the random variable θ̂ and isn’t actually a new definition.
But maybe instead of just computing the variance, we want a slightly different metric which instead measures
the squared difference from the actual estimator and not just its expectation:
h i
E (θ̂ − θ)2
We call this property the mean squared error (MSE), h i and it is related to both bias and variance! Look
closely at the difference: if θ̂ is unbiased, then E θ̂ = θ and the MSE and variance are actually equal!
This leads to what is known as the “Bias-Variance Tradeoff” in machine learning and statistics. Usually, we
want to minimize MSE, and these two quantities are often inversely related. That is, decreasing one leads
to an increase in the other, and finding the balance will minimize the MSE. It’s hard to see why that might
be the case since we aren’t working with as complex of estimators (we’re just learning the basics!).
Proof of Alternate MSE Formula. We will prove that MSE(θ̂, θ) = Var θ̂ + Bias(θ̂, θ)2 .
h i
MSE(θ̂, θ) = E (θ̂ − θ)2 [def of MSE])
h i h i 2 h i
= E θ̂ − E θ̂ + E θ̂ − θ [add and subtract E θ̂ ]
h i2 h h i h i i h i 2
= E θ̂ − E θ̂ + 2E θ̂ − E θ̂ E θ̂ − θ + E E θ̂ − θ [(a + b)2 = a2 + 2ab + b2 ]
h i h h ii
= Var θ̂ + 0 + E Bias(θ̂, θ)2 [def of var, bias, E θ̂ − E θ̂ = 0]
= Var θ̂ + Bias(θ̂, θ)2
Example(s)
First, recall that, if x1 , ..., xn are iid realizations from Poi(θ), then the MLE and MoM were both the
sample mean.
n
1X
θ̂ = θ̂M LE = θ̂M oM = xi
n i=1
Solution To compute the MSE, let’s compute the bias and variance separately. Earlier, we showed that
h i
Bias(θ̂, θ) = E θ̂ − θ = θ − θ = 0
We’ll discuss even more desirable properties of estimators. Last time we talked about bias, variance, and
MSE. Bias measured whether or not, in expectation, our estimator was equal to the true value of θ. MSE
measured the expected squared difference between our estimator and the true value of θ. If our estimator
was unbiased, then the MSE of our estimator was precisely the variance.
7.7.1 Consistency
Definition 7.7.1: Consistency
Example(s)
Recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, θ), then
n
1X
θ̂n = θ̂n,M oM = 2 · xi
n i=1
Solution
Since θ̂n is unbiased, we have that
h i
P |θ̂n − θ| > ε = P |θ̂n − E θ̂n | > ε
because we can replace θ with the expected value of the estimator. Now, we can apply Chebyshev’s inequality
(6.1) to see that
h i Var θ̂n
P |θ̂n − E θ̂n | > ε ≤
ε2
Now, we can take out the 22 from the estimator’s expression and are left only with the variance of the sample
285
286 Probability & Statistics with Applications to Computing 7.7
σ2 Var(xi )
mean, which is always just n = n .
h i Var θ̂n 1
Pn
22 Var i=1 xi 4 · Var (xi ) /n
P |θ̂n − E θ̂n |>ε ≤ = n
=
ε2 ε2 ε2
Example(s)
Recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, θ), then
Solution
In this case, we cannot use Chebyshev’s inequality unfortunately, because the maximum likelihood estimator
is not unbiased. The CDF for θ̂n is
Fθ̂n (t) = P θ̂n ≤ t
which is the probability that each individual sample is less than t because only in that case will the max be
less than t, and we have independence so we can say
P θ̂n ≤ t = P (X1 ≤ t) P (X2 ≤ t) ...P (Xn ≤ t)
t
This is just the CDF of Xi to the n-th power, where the CDF of Unif(0, θ) is just θ (see the distribution
sheet):
0, t<0
n t n
Fθ̂n (t) = FX (t) = ( θ ) , 0 ≤ t ≤ θ
1, t>0
There are two ways we can have the absolute value from before be greater than epsilon
P |θ̂n − θ| > ε = P θ̂n > θ + ε + P θ̂n < θ − ε
The first term is 0, because there’s no way our estimator is greater than θ + ε, as it’s never going to be
greater than θ by definition (the samples are between 0 and θ so there’s no way the max of the samples is
greater than θ). So, now we can just use the CDF on the right term, and just plug in for t:
(
( θ−ε n
P θ̂n > θ + ε + P θ̂n < θ − ε = P θ̂n < θ − ε = θ ) , ε<θ
0, ε≥θ
7.7 Probability & Statistics with Applications to Computing 287
We can assume that ε is less than θ because we really only care when ε is very very small, so we have that
θ − ε n
P |θ̂n − θ| > ε =
θ
Thus, when we take the limit as n approaches infinity, we see that in the parenthesis, we have a number less
than 1, and we raise it to the n-th power, so it goes to 0
lim P |θ̂n − θ| > ε = 0
n→∞
Now we’ve seen that, even though the MLE and MoM estimators of θ given iid samples from Unif(0, θ) are
different, they are both consistent! That means, as n → ∞, they will both converge to the true parameter
θ. This is clearly a good property of an estimator.
1. For instance, an unbiased and consistent estimator was the MoM for the uniform distribution: θ̂n,M oM =
2x̄. We proved it was unbiased in 7.6, meaning it is correct in expectation. It converges to the true
parameter (consistent) since the variance goes to 0.
2. However, if you ignore all the samples and just take the first one and multiply it by 2, θ̂ = 2X1 , it is
unbiased (as it is 2 · θ2 ), but it’s not consistent; our estimator doesn’t get better and better with more n
because we’re not using all n samples. Consistency requires that as we get more samples, we approach
the true parameter.
3. Biased but consistent, on the other hand, was the MLE estimator.
h iWe showed its expectation was
n n
θ, which is actually “asymptotically unbiased” since E θ̂n,M LE = θ → θ as n → ∞. It
n+1 n+1
does get better and better as n → ∞.
1
4. Neither unbiased nor consistent would just be some random expression, such as θ̂ = X12
.
7.7.3 Efficiency
To take about our last topic, efficiency, we first have to define Fisher Information. Efficiency says that our
estimator has as low variance as possible. This property combined with consistency and unbiasedness mean
that our estimator is on target (unbiased), converges to the true parameter (consistent), and does so as fast
as possible (efficient).
288 Probability & Statistics with Applications to Computing 7.7
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | θ) (if X is discrete), or
from density function fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters).
The Fisher Information of the parameter θ is defined to be:
" 2 # 2
∂ ln L(x | θ) ∂ ln L(x | θ)
I(θ) = n · E = −E
∂θ ∂θ2
where L(x | θ) denotes the likelihood of the data given parameter θ (defined in 7.1). From Wikipedia,
it “is a way of measuring the amount of information that an observable random variable X carries
about an unknown parameter θ upon which the probability X depends”.
That written definition is definitely a mouthful, but if you stop and parse it, you’ll see it’s not too bad
to compute. We always take the second derivative of the log-likelihood to confirm that our MLE was a
maximizer; now all you have to do is take the expectation to get the Fisher Information. There’s no way
though that I can interpret the negative expected value of the second derivative of the log-likelihood, it’s
just too gross and messy.
Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | θ) (if X is discrete), or
from density function fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters).
If θ̂ is an unbiased estimator for θ, then
1
MSE(θ̂, θ) = Var θ̂ ≥
I(θ)
where I(θ) is the Fisher information defined earlier. What this is saying is, for any unbiased estimator
1
θ̂ for θ, the variance (=MSE) is at least I(θ) . If we achieve this lower bound, meaning our variance
1
is exactly equal to I(θ) , then we have the best variance possible for our estimate. That is, we have
the minimum variance unbiased estimator (MVUE) for θ.
Since we want to find the lowest variance possible, we can look at this through the frame of finding the
estimator’s efficiency.
7.7 Probability & Statistics with Applications to Computing 289
I(θ)−1
e(θ̂, θ) = ≤1
Var θ̂
This will always be between 0 and 1 because if your variance is equal to the CRLB, then it equals 1,
and anything greater will result in a smaller value. A larger variance will result in a smaller efficiency,
and we want our efficiency to be as high as possible (1).
An unbiased estimator is said to be efficient if it achieves the CRLB - meaning e(θ̂, θ) = 1. That is,
it could not possibly have a lower variance. Again, the CRLB is not guaranteed for biased estimators.
That was super complicated - let’s see how to verify the MLE of Poi(θ) is efficient. It looks scary - but it’s
just messy algebra!
Example(s)
Recall that, if x1 , ..., xn are iid realizations from X ∼ Poi(θ) (recall E [X] = Var (X) = θ), then
n
1X
θ̂ = θ̂MLE = θ̂MoM = xi
n i=1
Is θ̂ efficient?
Solution
First, you have to check that it’s unbiased, as the CRLB only holds for unbiased estimators...
" n #
h i 1X
E θ̂ = E xi = E [xi ] = θ
n i=1
...which it is! Otherwise, we wouldn’t be able to use this bound. We also need to compute the variance. The
2
variance of the sample mean (the estimator) is just σn , and the variance of a Poisson is just θ.
!
1X
n
Var (xi ) θ
Var θ̂ = Var xi = =
n i=1 n n
Then, we’re going to compute that weird Fisher Information, which gives us the CRLB, and see if our
variance matches. Remember, we take the second derivative of the log-likelihood, which we did earlier in 7.2
so we’re just going to copy over the answer.
n
X xi
∂2
2
ln L(x | θ) = − 2
∂θ i=1
θ
Then, we need to take the expected value of this. It turns out, with some algebra, you get − nθ .
2 " n # n
∂ ln L(x | θ) X xi 1 X 1 n
E = E − = − E [xi ] = − 2 nθ = −
∂θ2 i=1
θ 2 θ 2
i=1
θ θ
290 Probability & Statistics with Applications to Computing 7.7
Our Fisher Information was the negative expected value of the second derivative of the log-likelihood, so
we just flip the sign to get nθ .
2
∂ ln L(x | θ) n
I(θ) = −E 2
=
∂θ θ
.
Finally, our efficiency is the inverse of the Fisher Information over the variance:
I(θ)−1 ( n )−1
e(θ̂, θ) = = θθ =1
Var θ̂ n
Thus, we’ve shown that, since our efficiency is 1, our estimator is efficient. That is, it has the best pos-
sible variance among all unbiased estimators of θ. This, again, is a really good property that we want to have.
To reiterate, this means we cannot possibly do better in terms of mean squared error. Our bias is 0, and our
variance is as low as it can possibly go. The sample mean is the unequivocally best estimator for a Poisson
distribution, in terms of efficiency, in terms of bias, and MSE (it also happens to be consistent, so there are
a lot of good things).
As you can see, showing efficiency is just a bunch of tedious calculations!
Chapter 7. Statistical Estimation
7.8: Properties of Estimators III
Slides (Google Drive) Video (YouTube)
The final property of estimators we will discuss is called sufficiency. Just like we want our estimators to be
consistent and efficient, we also want them to be sufficient.
7.8.1 Sufficiency
We first must define what a statistic is.
Definition 7.8.1: Statistic
All estimators are statistics because they take in our n data points and produce a single number. We’ll see
an example which intuitively explains what it means for a statistic to be sufficient.
Suppose we have iid samples x = (x1 , . . . , xn ) from a known distribution with unknown parameter θ. Imagine
we have two people:
• Statistician A: Knows the entire sample, gets n quantities: x = (x1 , . . . , xn ).
• Statistician B: Knows T (x1 , . . . , xn ) = t, a single number which is a function of the samples. For
example, the sum or the maximum of the samples.
Heuristically, T (x1 , . . . , xn ) is a sufficient statistic if Statistician B can do just as good a job as Statisti-
cian A, given “less Pninformation”. For example, if the samples are from the Bernoulli distribution, knowing
T (x1 , . . . , xn ) = i=1 xi (the number of heads) is just as good as knowing all the individual outcomes, since
a good estimate would be the number of heads over the number of total trials! Hence, we don’t actually
care the ORDER of the outcomes, just how many heads occurred! The word “sufficient” in English roughly
means “enough”, and so this terminology was well-chosen.
291
292 Probability & Statistics with Applications to Computing 7.8
To motivate the definition, we’ll go back to the previous example. Again, statistician A has all the samples
x1 , . . . , xn but statistician B only has the single number t = T (x1 , . . . , xn ). The idea is, Statistician B
0 0
only knows T = t, but since T is sufficient, doesn’t need θ to generate new samples X1 , . . . , Xn from the
distribution. This is because P (X1 = x1 , . . . , Xn = xn | T = t, θ) = P (X1 = x1 , . . . , Xn = xn | T = t) and
since she knows T = t, she knows the conditional distribution (can generate samples)! Now Statistician
B has n iid samples from the distribution, just like Statistician A. So using these samples X10 , . . . , Xn0 ,
statistician B can do just a good a job as statistician A with samples X1 , . . . , Xn (on average). So no one is
at any disadvantage. :)
This definition is hard to check, but it turns out that there is a criterion that helps us determine whether a
statistic is sufficient:
Theorem 7.8.38: Neyman-Fisher Factorization Criterion
Let x1 , . . . , xn be iid random samples with likelihood L(x1 , . . . , xn | θ). A statistic T = T (x1 , . . . , xn )
is sufficient if and only if there exist non-negative functions g and h such that:
L(x1 , . . . , xn | θ) = g(x1 , . . . , xn ) · h(T (x1 , . . . , xn ), θ)
That is, the likelihood of the data can be split into a product of two terms: the first term g can
depend on the entire data, but not θ, and the second term h can depend on θ, but only on the
data through the sufficient statistic T . (In other words, T is the only thing that allows the data
x1 , . . . , xn and θ to interact!) That is, we don’t have access to the n individual quantities x1 , . . . , xn ;
just the single number (T , the sufficient statistic).
If you are reading this for the first time, you might not think this is any better...You may be very confused
right now, but let’s see some examples to clear things up!
But basically, you want to split the likelihood into a product of two terms/functions:
1. For the first term g, you are allowed to know each individual sample if you want, but NOT θ.
2. For the second term h, you can only know the sufficient statistic (single number) T (x1 , . . . , xn ) and θ.
You may not know each individual xi .
Example(s)
Let x1 , . . . , xn be iid random samples from Unif(0, θ) (continuous). Show that the MLE θ̂ =
T (x1 , . . . , xn ) = max{x1 , . . . , xn } is a sufficient statistic. (The reason this is true is because we
don’t need to know each individual sample to have a good estimate for θ; we just need to know the
largest!)
Solution We saw the likelihood of this continuous uniform in 7.2, which we’ll just rewrite:
n
Y 1 1 1 1
L(x1 , . . . , xn | θ) = I{xi ≤θ} = n
I{x1 ,...,xn ≤θ} = n I{max{x1 ,...,xn }≤θ} = n I{T (x1 ,...,xn )≤θ}
i=1
θ θ θ θ
Choose
g(x1 , . . . , xn ) = 1
and
1
h(T (x1 , . . . , xn ), θ) = I{T (x1 ,...,xn )≤θ}
θn
7.8 Probability & Statistics with Applications to Computing 293
Notice there is no need for a g term (that’s why it is = 1), because there is no term in the likelihood
which just has the data (without θ).
For the h term, notice that we just need to know the max of the samples T (x1 , . . . , xn ) to compute h:
we don’t actually need to know each individual xi .
Notice that here the only interaction between the data and parameter θ happens through the sufficient
statistic (the max of all the values).
Example(s)
Pn
Let x1 , . . . , xn be iid random samples from Poi(θ). Show that T (x1 , . . . , xn ) = i=1 xi is a sufficient
P n
statistic, and hence the MLE θ̂ = n1 i=1 xi is sufficient as well. (The reason this is true is because
we don’t need to know each individual sample to have a good estimate for θ; we just need to know
how many events happened total!)
Solution We take our Poisson likelihood and split it into smaller terms:
n n
! n
! n
! Pn
Y xi Y Y Y e−nθ θ xi
−θ θ 1 i=1
−θ xi
L(x1 , . . . , xn | θ) = e = e θ = Qn
i=1
xi ! i=1 i=1
x!
i=1 i i=1 xi !
1
= Qn · e−nθ θT (x1 ,...,xn )
i=1 xi !
Choose
1
g(x1 , . . . , xn ) = Qn
i=1 xi !
and
h(T (x1 , . . . , xn ), θ) = e−nθ θT (x1 ,...,xn )
Pn
By
Pnthe Neyman-Fisher Factorization Criterion, T (x1 , . . . , xn ) = i=1 xi is sufficient. The mean θ̂M LE =
i=1 xi T (x1 , . . . , xn )
= is as well, since knowing the total number of events and the average number of
n n
events is equivalent (since we know n)!
Notice here we had the g term handle some function of only x1 , . . . , xn but not θ.
For the h term though, we do have θ but don’t need the individual samples x1 , . . . , xn to compute h.
Imagine being just given T (x1 , . . . , xn ): now you have enough information to compute h!
Notice that here the only interaction between the data and parameter θ happens through the sufficient
statistic (the sum/mean of all the values). We don’t actually need to know each individual xi .
294 Probability & Statistics with Applications to Computing 7.8
Example(s)
Pn
Let x1 , . . . , xn be iid random samples from Ber(θ). Show that T (x1 , . . . , xn ) = i=1 xi is a sufficient
P n
statistic, and hence the MLE θ̂ = n1 i=1 xi is sufficient as well. (The reason this is true is because
we don’t need to know each individual sample to have a good estimate for θ; we just need to know
how many heads happened total!)
Solution The Bernoulli likelihood comes by using the PMF pX (k) = θk (1 − θ)1−k for k ∈ {0, 1}. We get this
by observing that Ber(θ) = Bin(1, θ).
n n
! n
!
Y Y Y
xi 1−xi xi 1−xi
L(x1 , . . . , xn |θ) = θ (1 − θ) = θ (1 − θ)
i=1 i=1 i=1
Pn Pn
xi
=θ i=1 (1 − θ)n− i=1 xi
= θT (x1 ,...,xn ) (1 − θ)n−T (x1 ,...,xn )
Choose
g(x1 , . . . , xn ) = 1
and
h(T (x1 , . . . , xn ), θ) = θT (x1 ,...,xn ) (1 − θ)n−T (x1 ,...,xn )
Pn
By
Pnthe Neyman-Fisher Factorization Criterion, T (x1 , . . . , xn ) = i=1 xi is sufficient. The mean θ̂M LE =
i=1 xi T (x1 , . . . , xn )
= is as well, since knowing the total number of heads and the sample proportion of
n n
heads is equivalent (since we know n)!
Notice that here the only interaction between the data and parameter θ happens through the sufficient
statistic (the sum/mean of all the values). We don’t actually need to know each individual xi .
The mean squared error of an estimator θ̂ of θ measures the expected squared error from the
true value θ, and decomposes into a bias term and variance term. This term results in the phrase
”Bias-Variance Tradeoff” - sometimes these are opposing forces and minimizing MSE is a result of
choosing the right balance.
7.8 Probability & Statistics with Applications to Computing 295
h i
MSE(θ̂, θ) = E (θ̂ − θ)2 = Var θ̂ + Bias2 (θ̂, θ)
If θ̂ is an unbiased estimator of θ, then the MSE reduces to just: MSE(θ̂, θ) = Var θ̂ .
An unbiased estimator θ̂ is efficient if it achieves the Cramer-Rao Lower Bound, meaning it has
the lowest variance possible.
I(θ)−1 1 1
e(θ̂, θ) = = 1 ⇐⇒ Var θ̂ = = h 2 i
Var θ̂ I(θ) −E ∂ ln L(x|θ)
∂θ 2
296
8.1 Probability & Statistics with Applications to Computing 297
We’ve talked about several ways to estimate unknown parameters, and desirable properties. But there is just
one problem now: even if our estimator had all the good properties, the probability that our estimator for θ
is exactly correct is 0, since θ is continuous (a decimal number)! We’ll see how we can construct confidence
intervals around our estimator, so that we can argue that θ̂ is close to θ with high probability.
The confidence interval for θ can be illustrated in the below picture. We will explain how to interpret a
confidence interval at a specific confidence level soon.
Note that we can write this in any of the following three equivalent ways, as they all represent the probability
that θ̂ and θ differ by no more than some amount ∆:
h i
P θ ∈ θ̂ − ∆, θ̂ + ∆ = P θ̂ − θ ≤ ∆ = P θ̂ ∈ [θ − ∆, θ + ∆] = 0.95
Note the first and third equivalent statements especially (swapping θ̂ and θ).
We have learned about the CDF of normal distribution. If Z ∼ N (0, 1), we denote the CDF Φ(a) = FZ (a) =
P (Z ≤ a), since it’s so commonly used. There is no closed-form formula, so one way to find a z-score
associated with a percentage is to look up in a z-table.
Suppose we want a (centered) interval, where the probability of being in that interval is 95%.
Left bound: the probability of being less than the left bound is 2.5%.
Right bound: the probability of being greater than the right bound is 2.5%. Thus, the probability of being
less than the right bound should be 97.5%.
Note the following two equivalent statements that say that P (Z ≤ 1.96) = 0.975 (where Φ−1 is the inverse
CDF of the standard normal):
Φ(1.96) = 0.975
Example(s)
Suppose x1 ,...xn are iid samples from Poi(θ) where θ is unknown. Our MLE and MoM estimates
Pn
agreed at the sample mean: θ̂ = x̄ = n1 i=1 xi . Create an interval centered at θ̂ which contains θ
with probability 95%.
h i that if W ∼
Solution Recall Poi(θ),
then E [W ] = Var (W ) = θ, and so our estimator (the sample mean)
Var(xi )
θ̂ = x̄ has E θ̂ = θ and Var θ̂ = n = nθ . Thus, by the Central Limit Theorem, θ̂ is approximately
Normally distributed:
n
1X θ
θ̂ = xi ≈ N θ,
n i=1 n
If we standardize, we get that
θ̂ − θ
p ≈ N (0, 1)
θ/n
h i
To construct our 95% confidence interval, we want P θ ∈ θ̂ − ∆, θ̂ + ∆ = 0.95
h i
P θ ∈ θ̂ − ∆, θ̂ + ∆ = P θ − ∆ ≤ θ̂ ≤ θ + ∆ [one of 3 equivalent statements]
= P −∆ ≤ θ̂ − θ ≤ ∆
!
∆ θ̂ − θ ∆
= P −p ≤p ≤p
θ/n θ/n θ/n
!
∆ ∆
= P −p ≤Z≤ p [CLT]
θ/n θ/n
= 0.95
Because √∆ represents the right bound, and the probability of being less than the right bound is 97.5%
θ/n
for a 95% interval (see the above picture again). Thus:
r
∆ θ
p = Φ−1 (0.975) = 1.96 =⇒ ∆ = 1.96
θ/n n
That is, since θ̂ is normally distributed with mean θ, we just need to find the ∆ so that θ̂ ± ∆ contains 95%
of the area in a Normal distribution. The way to do so is to find Φ−1 (0.975) = 1.96, and go ± 1.96 standard
deviations of θ̂ in each direction!
Definition 8.1.1: Confidence Interval
Suppose you have iid samples x1 ,...,xn from some distribution with unknown parameter θ, and you
have some estimator θ̂ for θ.
300 Probability & Statistics with Applications to Computing 8.1
hA 100(1 − α)% i confidence interval for θ is an interval (typically but not always) centered at θ̂,
θ̂ − ∆, θ̂ + ∆ , such that the probability (over the randomness in the samples x1 ,...,xn ) θ lies in the
interval is 1 − α: h i
P θ ∈ θ̂ − ∆, θ̂ + ∆ = 1 − α
Pn
If θ̂ = n1 i=1 xi is the sample mean, then θ̂ is approximately normal by the CLT, and a 100(1 − α)%
confidence interval is given by the formula:
σ σ
θ̂ − z1−α/2 √ , θ̂ + z1−α/2 √
n n
where z1−α/2 = Φ−1 1 − α2 and σ is the true standard deviation of a single sample (which may need
to be estimated).
It is important to note that this last formula ONLY works when θ̂ is the sample mean (otherwise we can’t
use the CLT); you’ll need to find some other strategy if it isn’t.
If we wanted a 95% interval, then that corresponds to α = 0.05, since 100(1 − α) = 95. We were then
looking up the inverse Phi table at (1 − α/2) = (1 − 0.05/2) = 0.975 to get our desired number of standard
deviations in each direction of 1.96.
If we wanted a 98% interval, then that corresponds to α = 0.02 since 100(1 − α) = 98. We then would
look up Φ−1 (0.99) since 1 − α/2 = 0.99, because if there is to be 98% of the area in the middle, there is 1%
to the left and right!
Example(s)
Solution
Recall for Bernoulli distribution Ber(θ), our MLE/MoM estimator was the sample mean:
n
1X 136
θ̂ = xi = = 0.34
n i=1 400
This is incorrect because there is no randomness here: θ is a fixed parameter. θ is either in the interval or
out of it; there’s nothing probabilistic about it.
Correct: If we repeat this process several times (getting n samples each time and constructing different
confidence intervals), about 99% of the confidence intervals we construct will contain θ.
Notice the subtle difference! Alternatively, before you receive samples, you can say that there is a 99%
probability
h (over the irandomness in the samples) that θ will fall into our to-be-constructed confidence
interval θ̂ − ∆, θ̂ + ∆ . Once you plug in the numbers though, you cannot say that anymore.
Chapter 8. Statistical Inference
8.2: Credible Intervals
Slides (Google Drive) Video (YouTube)
Example(s)
Construct a 80% credible interval for Θ (the Pn unknown) probability of success in Ber(Θ) given iid
n = 12 samples x = (x1 , x2 , ..., x12 ) where i=1 xi = 11 (observed 11 successes out of 12). Suppose
our prior is Θ ∼ Beta(α = 7, β = 3) (i.e., pretend we saw 6 successes and 2 failures ahead of time).
Solution From lecture 7.5 (MAP), we showed that choosing a Beta prior for Θ leads to a Beta posterior of
18−1
Θ | x ∼ Beta(11 + 7, 1 + 3) = Beta(18, 4) and our MAP was then (18−1)+(4−1) = 17
20 (since we saw 17 total
successes, and 3 total failures.)
We want an interval [a, b] such that P (a ≤ Θ ≤ b) = 0.8
If we look at the Beta PDF, we are looking for such an interval that the probability that we fall in this area
is 80%. If the area is centered, then the area to the left of that should have probability of 10%, and the area
to the right of that should also have probability 10%.
This is equivalent of looking for P (Θ ≤ a) = 0.1 and P (Θ ≤ b) = 0.9. These information are given in the
CDF of the Beta distribution. Note that on the x-axis we have the range of the Beta distribution [0, 1], and
on the y-axis, we have the cumulative probability of being to the left by integrating the PDF from above.
302
8.2 Probability & Statistics with Applications to Computing 303
−1
Let FBeta denote the CDF of this Beta(18,4) distribution. Then, choose a = FBeta (0.1) ≈ 0.7089 and
−1
b = FBeta (0.9) ≈ 0.9142, so our credible interval is [0.7089, 0.9142].
17
Note that the MAP was 20 = 0.85, which is not at the center! We could have chosen any a, b where the area
between them is 80%, but we set the areas to the left and right to be equal.
In order to compute the inverse CDF, we can use the scipy.stats library as follows:
That’s all there is to it! Just find the PDF/CDF of your posterior distribution (hopefully you chose a
conjugate prior), and look up the inverse CDF at points a and b such that b − a is your desired confidence
level of your credible interval.
Suppose you have iid samples x = (x1 , ..., xn ) from some distribution with unknown parameter Θ.
You are in the Bayesian setting, so you have chosen a prior distribution for the RV Θ.
A 100(1 − α)% credible interval for Θ is an interval [a, b] such that the probability (over the
randomness in Θ) that Θ lies in the interval is 1 − α:
P (Θ ∈ [a, b]) = 1 − α
If we’ve chosen the appropriate conjugate prior for the sampling distribution (like Beta for Bernoulli),
the posterior is easy to compute. Say the CDF of the posterior is FY . Then, a 100(1 − α)% credible
interval is given by h α α i
FY−1 , FY−1 1 −
2 2
Again, this is one which has equal area to the left and right of the interval, but there are infinitely
many possible credible intervals you can create.
304 Probability & Statistics with Applications to Computing 8.2
This is correct because not Θ is a random variable, and it makes sense to say!
Contrast this with the interpretation of a confidence interval, where θ is a fixed number.
8.2.3 Exercises
1. Let x = (x1 , . . . , xn ) be iid samples from Exp(Θ) where Θ is a random variable (not fixed). Recall from
section 7.5 Exercise 1 that if we choose the P prior distribution Θ ∼ Gamma(r, λ), then the posterior
distribution is Θ | x ∼ Gamma(n + r, λ + xi ).
Suppose n = 13, x̄ = 0.21, r = 7, λ = 12. Construct a 96% credible interval for Θ. To find the
point t such that FT (t) = y for T ∼ Gamma(u, v), call the following function which gets the inverse
CDF:
scipy.stats.gamma.ppf(y, u, 0, 1/v)
Then, verify that the MAP estimate is actually contained in your credible interval.
Solution: Before we call the function, we have to identify what u and v are. Plugging in the numbers
above to the general posterior we computed earlier, we find
Since we want a 96% interval, we must look up the inverse CDF at 0.02 and 0.98 (why?).
We write a few lines of code, calling the provided function twice:
1 >>> from scipy.stats import gamma
2 >>> gamma.ppf (0.02 , 20, 0, 1/14.73) # i n v e r s e cdf o f Gamma ( 2 0 , 14.73)
3 0.809150510196322
4 >>> gamma.ppf (0.98 , 20, 0, 1/14.73) # i n v e r s e cdf o f Gamma ( 2 0 , 14.73)
5 2.0514641398722735
[0.809, 2.051]
Hypothesis testing allows us to “statistically prove” claims. For example, if a drug company wants to claim
that their new drug reduces the risk of cancer, they might perform a hypothesis test. Or if a company wanted
to argue that their academic prep program leads to a higher SAT score. A lot of business decisions are reliant
on this statistical method of hypothesis testing, and we’ll see how to conduct them properly below.
So let’s give Mark the benefit of the doubt. We’ll compute the probability that we observed an outcome
at least as extreme as this, given that Mark isn’t lying.
If Mark isn’t lying, then the coin is fair, so the number of heads observed should be X ∼ Bin(100, 0.5),
because there are 100 independent trials and a 50% of heads since it’s fair. So, the probability that we
observe at least 99 heads (because we’re looking for something as least as extreme), is the sum of the
probability of 99 heads and the probability of 100 heads. You just sum the Binomial PMF and you get:
100 100 101
P (X ≥ 99) = (0.5)99 (1 − 0.5)1 + (0.5)100 = 100 ≈ 7.96 × 10−29 ≈ 0
99 100 2
Basically, if the coin were fair, the probability of what we just observed (99 heads or more) is basically 0.
This is strong statistical evidence that the coin is NOT fair. Our assumption was that the coin is fair, but if
this were the case, observing such an extreme outcome would be extremely unlikely. Hence, our assumption
is probably wrong.
So, this is like a ”Probabilistic Proof by Contradiction”!
305
306 Probability & Statistics with Applications to Computing 8.3
1. Make a claim (like ”Airplane food is good”, ”Pineapples belong on pizza”, etc...)
• Our example will be that SuperSAT Prep claims that their program helps students perform better
on the SAT. (The average SAT score as of June 2020 was: 1059 out of 1600, and the standard
deviation of SAT scores was 210).
2. Set up a null hypothesis H0 and alternative hypothesis HA .
(a) Alternative hypothesis can be one-sided or two-sided.
• Let µ be the true mean of the SAT scores of students of SuperSAT Prep.
• Our null hypothesis is that H0 : µ = 1059, which is our “baseline”, “no effect”, “benefit of the
doubt”. We’re going to assume that the true mean of our scores is the same as the nationwide
scores (for the sake of contradiction).
• Our alternative hypothesis is what we want to show, which is HA : µ > 1059, or that SuperSAT
Prep is good and that their test takers are (strictly) better off. So, our alternative will assert that
µ > 1059.
• This is called a one-sided hypothesis. The other one-sided hypothesis would be µ < 1059 (if
we wanted to argue that SuperSAT Prep makes students worse off).
• A two-sided hypothesis would be that µ 6= 1059, because it’s two sides (less than or greater
than). This is if we wanted to argue that SuperSAT Prep makes some difference for better or
worse.
3. Choose a significance level α (usually α = 0.05 or 0.01).
• Let’s choose α = 0.05 and explain this more later!
4. Collect data.
• We observe 100 students from SuperSAT Prep, x1 , ..., x100 . It turns out, the sample mean of the
scores, x̄, is x̄ = 1113.
5. Compute a p-value, p = P (observing data at least as extreme as ours | H0 is true).
• Again, since we’re assuming H0 is true (that SuperSAT has no effect), our true mean µ is 1059
(again we do this in hopes of reaching a “probabilistic contradiction”). By the CLT, since n = 100
is large, the distribution of the sample mean of 100 samples is approximately normal with mean
2
1059, and variance 210 2
100 (because the variance of a single test taker was given to be σ = 210 ,
2
2
and so the variance of the sample mean is σn ):
2102
X̄ ≈ N (µ = 1059, σ 2 = )
100
So, then, the p-value is the probability that if we took an arbitrary sample mean, that it would
be at least as extreme as the one we computed, which was 1113. So, we can just standardize, look
up a Φ table like always, which is a procedure you know how to do:
X̄ − µ x̄ − µ 1113 − 1059
p = P X̄ ≥ x̄ = P √ ≥ √ =P Z≥ √ = P (Z ≥ 2.14) ≈ 0.0162
σ/ n σ/ n 210/ 100
(a) If p < α, ”reject” the null hypothesis H0 in favor of the alternative HA . (Because, given the null
hypothesis is true, the probability of what we saw happening (or something more extreme) is p
which is less than some small number α.)
(b) Otherwise, ”fail to reject” the null hypothesis H0 .
• Since p = 0.0162 < 0.05 = α, we’ll reject the null hypothesis H0 at the α = 0.05 significance level.
We can say that there is strong statistical evidence to suggest that SuperSAT Prep actually helps
students perform better on the SAT.
Notice that if we had chosen α = 0.01 earlier instead of 0.05, we would have a different conclusion:
Since p = 0.0162 > 0.01 = α, we fail to reject the null hypothesis at the α = 0.01 significance
level. There is insufficient evidence to prove that SuperSAT Prep actually helps students perform
better.
Note that we’ll NEVER say we “accept” the null hypothesis. If you recall the coin example,
if we had observed 55 heads instead of 99, that wouldn’t have been improbable. We wouldn’t
have called the magician a liar, but it does NOT imply that p = 0.5. It could have been 0.54 or
0.58, for example.
8.3.4 Exercises
1. You want to determine whether or not more than 3/4 of Americans would vote for George Washington
for President in 2020 (if he were still alive). In a random poll sampling n = 137 Americans, we collected
responses x1 , . . . , xn (each is 1 or 0, if they would vote for him or not). We observe 131 “yes” responses:
P n
i=1 xi = 131. Perform a hypothesis test and state your conclusion.
Solution: We have our claim that “Over 3/4 of Americans would vote for George Washington for
President in 2020 (if he were still alive).
308 Probability & Statistics with Applications to Computing 8.3
Let p denote the true proportion of Americans that would vote for Washington. Then our null and
alternative hypotheses are:
H0 : p = 0.75
HA : p > 0.75
With a p-value so close to 0 (and certainly < α = 0.01), we reject the null hypothesis that (only) 75%
of Americans would vote for Washington. There is strong evidence that this proportion is actually
larger.
Note: Again, what we did was: assume p = 0.75 (null hypothesis), then note that the probability of
observing data so extreme (in fact very close to 100% of people), was nearly 0. Hence, we reject this
null hypothesis because what we observed would’ve been so unlikely if it were true.
8.3 Probability & Statistics with Applications to Computing 309
Application Time!!
Application Time!!
9.1.1 Python
For this section only, I’ll ask you to use the slides linked above. There are a lot of great animations and
visualizations! We assume you know some programming language (such as Java or C++) beforehand, and
are merely teaching you the new syntax and libraries.
Python is the language of choice for anything related to scientific computing, data science, and machine
learning. It is also sometimes used for website development among many other things! It has extremely
powerful syntax and libraries - I came from Java and was adamant on having that be my main language.
But once I saw the elegance of Python, I never went back! I’m not saying that Python is “absolutely better”
than Java, but for our applications involving probability and math, it definitely is!
First, go to the official Python website and download it! Make sure you are using Python3 and not Python2
(it is deprecated).
311
Chapter 9: Applications to Computing
9.2: Probability via Simulation
Slides (Google Drive) Starter Code (GitHub)
9.2.1 Motivation
Even though we have learned several techniques for computing probabilities, and have more to go, it is still
hard sometimes. Imagine I asked the question: “Suppose I randomly shuffle an array of the first 100 integers
in order: [1, 2, . . . , 100]. What is the probability that exactly 13 end up in their original position?” I’m
not even sure I could solve this problem, and if so, it wouldn’t be pretty to set up nor actually type into a
calculator.
But since you are a computer scientist, you can actually avoid computing hard probabilities! You could
also even verify that your hand-computed answers are correct using this technique of “Probability via Sim-
ulation”.
Suppose a weighted coin comes up heads with probability 1/3. How many flips do you think it will
take for the first head to appear? Use code to estimate this average!
Solution You may think it is just 3, and you would be correct! We’ll see how to prove this mathematically
in chapter 3 actually. But for now, since we don’t have the tools to compute it, let’s use our programming
skills!
The first thing we need to do is to simulate a single coin flip. Recall that to generate a random number, we
use the numpy library in Python.
1 np. random .rand () # r e t u r n s a s i n g l e float in the range [0 ,1)
312
9.2 Probability & Statistics with Applications to Computing 313
This might be a bit tricky: since np.random.rand() returns a random float between [0, 1), the function
returns a value < p with probability exactly p! For example if p = 1/2, then np.random.rand() < 1/2
happens with probability 1/2 right? In our case, we’ll want p = 1/3, which will execute with probability
1/3.
This allows us to simulate the event in question: the first “Heads” appears whenever np.random.rand()
returns a value < p. And, if it is ≥ p, the coin flip turned up “Tails”.
The following function allows us to simulate ONCE how long it took to get heads.
We start with our number of flips being 0. And we keep incrementing flips until we get a head. So this
should return an integer ! We just need to simulate this game many times (call this function many times),
and take the average of our samples! Then, this should give us a good approximation of the true average
time (which happens to be 3)!
The code above is duplicated below, as a helper function. Python is great because you can define functions
inside other functions, only visible to the parent function!
1 import numpy as np
2
3 def coin flips (p, ntrials =50000) −> float:
4
5 def sim one game () −> int: # i n t e r n a l helper function
6 flips = 0
7 while True:
8 flips += 1
9 if np. random .rand () < p:
10 return flips
11
12 total flips = 0
13 for i in range ( ntrials ):
14 total flips += sim one game ()
15 return total flips / ntrials
16
17 print ( coin flips (p =1/3))
Notice the helper function is the exact same as above! All we did was call it ntrials times and return the
average number of flips per trial. This is it! The number 50000 is arbitrary: any large number of trials is
good!
Now to tackle the original problem:
314 Probability & Statistics with Applications to Computing 9.2
Example(s)
Suppose I randomly shuffle an array of the first 100 integers in order: [1, 2, . . . , 100]. What is the
probability that exactly 13 end up in their original position? Use code to estimate this probability!
Hint: Use np.random.shuffle to shuffle an array randomly.
Take a look and see how similar this was to the previous example!
9.2 Probability & Statistics with Applications to Computing 315
2. Let’s learn how to use Python and data to do approximate quantities that are hard to compute
exactly! By the end of this, we’ll see how long it actually takes to “catch’em all”! You are given a
file data/pokemon.txt which contains information about several (fictional) Pokemon, such as their
encounter rate and catch rate.
Write your code for the following parts in the provided file: pokemon.py.
(a) Implement the function part a.
(b) Implement the function part b.
(c) Implement the function part c.
316 Probability & Statistics with Applications to Computing 9.2
9.3.1 Motivation
Have you ever wondered how Gmail knows whether or not an email should be marked as spam? Or how
Alexa/Google Home can your answer free-form questions? How self-driving cars actually work? How social
media platforms recommend friends and people you should follow? How a computer program DeepBlue beat
the chess champion Garry Kasparov? The answer to all of these questions is: machine learning (ML)!
After learning just a tiny bit of probability, we are ready to discover one way to solve one extremely important
type of ML task: classification. In particular, we’ll learn how to take in an email (a string/text), and predict
whether it is “Spam” or “Ham”. We will discuss this further shortly!
It’s okay if you didn’t see the pattern, but we should predict −16; can you figure out why? It seems that the
pattern is to take the number and multiply it by the number of sides in the shape! So for our last row, we
take −4 and multiply by 4 (the number of sides of the square) to get −16. Sure, there is a possibility that
this isn’t the right function: this is only the most simple explanation we could give. The function could be
some complex polynomial in which case we would be completely wrong.
This is the idea of (supervised) machine learning (ML): given some training examples, we want to
learn the pattern between the input features and output label and be able to have a computer predict the
label on new/unseen examples. Above, our input features were number and shape. We want the computer
to “learn” just like how we do: with several examples.
Within supervised ML, two of the largest subcategories are regression and classification. Regression
refers to predicting a continuous (decimal) value. For example, when predicting house price given features
of the house or predicting weight from height. Classification on the other hand refers to predicting one of a
finite number of classes. For example, predicting whether an email is spam or ham, or whether an image
317
318 Probability & Statistics with Applications to Computing 9.3
Example(s)
For each of the situations below with a desired output label, identify whether it would be a classifi-
cation or regression task. Then, describe what input features may be useful in making a prediction.
1. Predicting the price of a house.
2. Predicting whether or not a PhD applicant will be admitted.
3. Predicting which of 50 menu items someone will order.
Solution
1. This is a regression task, since we are predicting a continuous number like $310, 321.55 or $1, 235, 998.23.
Some features which would be useful for prediction include: square footage, age, location, number of
bedrooms/bathrooms, number of stories, etc.
2. This is a classification task, since we predicting one of two outcomes: admitted or not. Features which
may be important are: GPA, SAT score, recommendation letter quality, number of papers published,
number of internships, etc.
3. This is a classification task since we are choosing from one of 50 classes. Important features may
include: past order history, favorite cuisine, dietary restrictions, income, etc.
So how do we write the code to make the decision for us? In the past, people tried writing these classifiers
with a set of rules that they came up themselves. For example, if it is over 1000 words, predict “SPAM”.
Or if it contains the word ’Viagra’, predict that it is “SPAM”. This leads to code which looks like a ton of
if-else statements, and is also not very accurate. In machine learning, we come up with a model that learns
a decision-making rule for us! This may not make sense now, but I promise it will soon.
That is, we will reduce an email into a Set of lowercase words and nothing else! We’ll see a potential
drawback to this later, but despite these strong assumptions, the classifier still does a really good job!
Here are some examples of how we take the input string (email) to a Set of standardized words.
Above all sounds nice and all, but how do we even begin to compute such a quantity? Let’s try Bayes
Theorem with the Law of Total Probability and see where that gets us!
How does this even help?? This looks way worse than before... Let’s see if we can’t start by figuring out the
“easier” terms, like P (spam). Remember we haven’t even touched our data yet. Let’s assume we were given
five examples of emails with their labels to learn from:
320 Probability & Statistics with Applications to Computing 9.3
Based on the data only, what would you estimate P (spam) to be? I might guess 3/5, and hope that you
matched that! That is,
# of spam emails
P (spam) ≈
# of total emails
Similarly, we might estimate
# of ham emails
P (ham) ≈
# of total emails
to be 2/5 in our case. Great, so we’ve figured out two our of the four terms we needed after using Bayes/LTP.
Now, we might try to similarly guess that
because our definition of conditional probability came intuitively with equally likely outcomes in 2.1 as
|A ∩ B| P (A ∩ B)
P (A | B) = =
|B| P (B)
But how many spam emails are we going to get that contain all three words? Probably none, or very few.
In general, most emails will be much longer, so there’s almost no chance that an email you are given to learn
from has ALL of the words. This is a problem because it makes this probability 0, which isn’t good for our
model.
The Naive Bayes name comes from two parts. We’ve seen the Bayes part because we used Bayes Theorem
to (attempt to) compute our desired probability. We are at a roadblock now, and now we will make the
“naive” assumption that: words are conditionally independent GIVEN the label. That is,
P ({you, buy, viagra} | spam) ≈ P (you | spam) P (buy | spam) P (viagra | spam)
P (A, B, C | D) = P (A | D) P (B | D) P (C | D)
which is most likely nonzero if we have a lot of emails! What should this quantity be? It is 1/3: there is
just one spam email out of three which contains the word “you”. In general,
Example(s)
Make a prediction as to whether this email is SPAM or HAM, using the Naive Bayes classifier! Do
this by computing P (spam | {you, buy, viagra}) and comparing it to 0.5. Don’t forget to use the
conditional independence assumption!
Solution Combining what we had earlier (Bayes+LTP) with the (naive) conditional independence assumption,
we get
P ({you, buy, viagra} | spam) P (spam)
P (spam | {you, buy, viagra}) =
P ({you, buy, viagra} | spam) P (spam) + P ({you, buy, viagra} | ham) P (ham)
P (you | spam) P (buy | spam) P (viagra | spam) P (spam)
=
P (you | spam) P (buy | spam) P (viagra | spam) P (spam) + P (you | ham) P (buy | ham) P (viagra | ham) P (ham)
We need to compute a bunch of quantities, but notice the left side of the denominator is the same as the
numerator, so we need to compute 8 quantities, 3 of which we did earlier! I’ll just skip to the solution:
3 2
P (spam) = P (ham) =
5 5
1 1
P (you | spam) = P (you | ham) =
3 2
1 0
P (buy | spam) = P (buy | ham) =
3 2
3 1
P (viagra | spam) = P (viagra | ham) =
3 2
Once we plug in all these quantities, we end up with a probability of 1, because the P (buy | ham) =
0 killed the entire right side of the denominator! It turns out then we should predict spam because
P (spam | {you, buy, viagra}) = 1 > 0.5, and this is correct! We still don’t ever want zeros though, so
we’ll see how we can fix that soon!
Notice how the data (example emails) completely dictated our decision rule, along with Bayes Theorem and
Conditional Independence. That is, we learned from our data, and used it to make conclusions on new
data!
One last final thing, to avoid zeros, we will apply the following trick called “Laplace Smoothing”. Before,
we had said that
# of spam emails with word
P (word | spam) ≈
# of spam emails
We will now pretend we saw TWO additional spam emails: one which contained the word, and one which
did not. This means instead that we have
# of spam emails with word + 1
P (word | spam) ≈
# of spam emails + 2
322 Probability & Statistics with Applications to Computing 9.3
0
This will ensure that we don’t get any zeros! For example, P (buy | ham) was previously (none of the two
2
0+1 1
ham emails contained the word “buy”), but now it is = .
2+2 4
We do not usually apply Laplace smoothing to the label probabilities P (spam) and P (ham) since these will
never be zero anyway (and it wouldn’t make much difference if we did).
Example(s)
Redo the example from earlier, but now apply Laplace smoothing to ensure no zero probabilities. Do
not apply it to the label probabilities.
Solution Basically, we just take the same numbers from above and add 1 to the numerator and 2 to the
denominator!
3 2
P (spam) = P (ham) =
5 5
1+1 2 1+1 2
P (you | spam) = = P (you | ham) = =
3+2 5 2+2 4
1+1 2 0+1 1
P (buy | spam) = = P (buy | ham) = =
3+2 5 2+2 4
3+1 4 1+1 2
P (viagra | spam) = = P (viagra | ham) = =
3+2 5 2+2 4
Plugging these in gives P (spam | {you, buy, viagra}) ≈ 0.7544 > 0.5, so our prediction is unchanged! But it
is better to not have probabilities ever being exactly one or zero, so this solution is preferred!
That’s it for the main idea! We’re almost there now, just some logistics.
However, if we trained/learned from these 1000 emails, and measure the accuracy, surely it will be very good
right? It’s like taking a practice test and then using that as your actual test - of course you’ll do well! What
we care about is how well the spam filter works on NEW or UNSEEN emails. Emails that the spam filter
was not allowed to see/use when estimating those probabilities. This is fair and more realistic now right?
You get practice exams, as many as you want, but you are only evaluated once on an exam you (hopefully)
haven’t seen before!
Where do we get these new/unseen emails? We actually take our initial 1000 emails and do a train/test
split (usually around 80/20 split). That means, we will use 800 emails to estimate those quantities, and
measure the accuracy on the remaining 200 emails. The 800 emails we learn from are collectively called the
training set, and the 200 emails we test on are collectively called the test set.
This is good because we care how our classifier does on new examples, and so when doing machine learning,
we ALWAYS split our data into separate training/testing sets!
Disclaimer: Accuracy is typically not a good measure of performance for classification. Look into F1-Score
and AUROC instead if you are interested! Since this isn’t a ML class, we will stick with plain accuracy for
simplicity.
9.3 Probability & Statistics with Applications to Computing 323
9.3.3.5 Summary
Here’s a summary of everything we just learned:
Suppose we are given a set of emails WITH their labels (of spam or ham). We split into a training
set with around 80% of the data, and a test set with the remaining 20%.
Suppose we are given an email with wordset {w1 , . . . , wk } and want to make a prediction. We
compute using Bayes Theorem, the law of total probability, and our naive assumption that words are
conditionally independent given their label to get:
Qk
P (spam) P (wi | spam)
P (spam | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)
To get a fair measure of performance, make predictions using the above procedure on all the
TEST emails and return the overall test accuracy.
we are multiplying a bunch of numbers between 0 and 1, and so we will get some very very small number
(close to zero). When numbers get too large on a computer (exceeding around 264 − 1), it is called overflow,
and results in weird and wrong arithmetic. Our problem is appropriately named underflow (when the
exponent is < −128), as we can’t handle the precision. For example, if we tried to represent the number
3.2 × 10−133 , this would be an underflow problem.
This is the last thing we need to figure out (I promise). Remember that our two probabilities P (spam | {w1 , . . . , wk })
and P (ham | {w1 , . . . , wk }) summed to 1, so we only needed to compute one of them. Let’s go back to com-
puting both, and just comparing which is larger:
Qk
P (spam) P (wi | spam)
P (spam | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)
324 Probability & Statistics with Applications to Computing 9.3
Qk
P (ham) P (wi | ham)
P (ham | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)
Notice the denominators are equal: they are both just P ({w1 , . . . , wk }). So, P (spam | {w1 , . . . , wk }) >
P (ham | {w1 , . . . , wk }) if and only the corresponding numerator is greater:
k
Y k
Y
P (spam) P (wi | spam) > P (ham) P (wi | ham)
i=1 i=1
And that’s it, problem solved! If our initial quantity (after multiplying 50 word probabilities) was something
like P (spam | {w1 , . . . , wk }) ≈ 10−81 , then log P (spam | {w1 , . . . , wk }) ≈ −186.51. There is no chance of
underflow anymore!
After reading Chapter 7: do you see how MLE/MAP were used here? We used MLE to estimate P (spam)
and P (ham). We also used MAP to estimate all the P (wi | spam) as well, with a Beta(2, 2) prior: pretending
we saw 1 of each success and failure. Naive Bayes actually required us to estimate all these different Bernoulli
parameters, and it’s great to come back and see!
9.3 Probability & Statistics with Applications to Computing 325
9.4.1 Motivation
Google Chrome has a huge database of malicious URLs, but it takes a long time to do a database lookup
(think of this as a typical Set, but on a different computer than yours). As you may know, Sets have
desirable constant-time lookup, but due to the fact it isn’t on your computer, the time bottleneck comes
from the communication between the database and your computer. They want to have a quick check in the
web browser itself (on your computer), so a space-efficient data structure must be used.
That is, we want to save both time (not in the typical big-Oh sense) and space. But what will we trade
for it? It turns out we will have limited operations (fewer than a Set), and some probability of error which
turns out to be fine.
9.4.2 Definition
A bloom filter is a probabilistic data structure which only supports the following two operations:
I. add(x): Add an element x to the structure.
II. contains(x): Check if an element x is in the structure. If either returns “definitely not in the set” or
“could be in the set”.
It does not support the following two operations:
I. Delete an element from the structure.
II. Give a collection of elements that are in the structure.
The idea is that we can check our bloom filter if a URL is in the set. The bloom filter is always correct in
saying a URL definitely isn’t in the set, but may have false positives (it may say a URL is in the set when it
isn’t). So most of the time, we get instant time, and only in these rare cases does Chrome have to perform
an expensive database lookup to know for sure.
Suppose we have k bit arrays t1 , . . . , tk each of length m (all entries are 0 or 1), so the total space required
is only km bits or km/8 bytes (as a byte is 8 bits). See below for one with k = 3 arrays of length m = 5:
So regardless of the number of elements n that we want to insert store in our bloom filter, we use the same
amount of memory! That being said, the higher n is for a fixed k and m, the higher your error rate will be.
Suppose the universe of URL’s is the set U (think of this as all strings with less than 100 characters),
326
9.4 Probability & Statistics with Applications to Computing 327
and we have k independent and uniform hash functions h1 , . . . , hk : U → {0, 1, . . . , m − 1}. That is, for
an element x and hash function hi , pretend hi (x) is a discrete Unif(0, m − 1) random variable. Basically,
when we see a new URL, we will add it to one random entry per row of our bloom filter.
See the image below to see how we add the URL “thisisavirus.com” into our bloom filter.
For each of our k = 3 hash functions (corresponding to each row), we hash our URL x as hi (x) to get a
random integer from {0, 1, . . . , 4} (0 to m − 1). It happened that h1 (x) = 2, h2 (x) = 1 and h3 (x) = 4 in this
example: each hash function is independent of the others and chooses a position uniformly at random.
But if we hash the same URL, we will get the same hash. In other words, if I tried to add this URL
one more time, nothing would change because all the entries were already set to 1. Notice we never “unset”
an entry: once a URL sets an entry to 1, it will stay 1 forever.
Now let’s see how the contains function is implemented. When we check whether the URL we just added
is contained in the bloom filter, we should definitely return yes.
We say that a URL x is contained in the bloom filter, if when we apply each hash function hi (x), the
corresponding entries are already set to 1. We added this URL “thisisavirus.com” right before this, so we
are guaranteed that t1 [2] == 1, t2 [1] == 1, and t3 [4] == 1, and so we return TRUE overall! You might now
see how this could lead to false positives: returning TRUE even though the URL was never added! Don’t
worry if not, we’ll see some examples below.
That’s all there is for bloom filters!
328 Probability & Statistics with Applications to Computing 9.4
Example(s)
Solution
Notice that t3 [4] was already set to 1 by the previous entry, and that’s okay! We just leave it set to 1.
Notice here we got a false positive: that means, saying a URL is in the bloom filter when it wasn’t. This
is a tradeoff we make in exchange for using much less space.
9.4.3 Analysis
You might be dying to know, what is the false positive rate (FPR) for a bloom filter, and how should I
choose k and m? These are great questions, and we actually have the tools to figure this out already.
9.4 Probability & Statistics with Applications to Computing 329
After inserting n distinct URLs to a k × m bloom filter (k hash functions/rows, m columns), suppose
we had a new URL and wanted to check whether it was contained in the bloom filter. The false
positive rate (probability the bloom filter returns True incorrectly), is
n k
1
1− 1−
m
Proof of Bloom Filter FPR. We get a match for new URL x if in each row, the bit assigned by the hash
function hi (x) is set to 1.
For i = 1, . . . , k, let Ei be the event that hi (x) is set to 1 already. Then,
k
Y
P (false positive) = P (h1 (x) = 1, h2 (x) = 1, . . . , hk (x) = 1) = P (E1 ∩ E2 ∩ · · · ∩ Ek ) = P (Ei )
i=1
where the last equality is because each hash function is assumed to be independent of the others.
Now, let’s focus on a single row i (all the rows are the “same”). The probability that the bit is set to 1
P (Ei ), is the probability that at least one of the n URLs hashed to that entry. Seeing “at least one” should
tell you that: you should try the complement instead (otherwise, use inclusion-exclusion)!
So the probability a bit remains at 0 after n entries are added (EiC ) is
n
C
1
P Ei = 1 −
m
because the probability of missing this bit for a single URL is 1 − 1/m. Hence,
n
1
P (Ei ) = 1 − P EiC = 1 − 1 −
m
Finally, combining this result with the previous gives our final answer, since each row has the same proba-
bility:
Yk n k
1
P (false positive) = P (Ei ) = 1 − 1 −
i=1
m
So based on n, the number of malicious URLs Google Chrome would like to store, should definitely play a
part in how large they should choose k and m to be.
Let’s now see (by example) the kind of time and space improvement we can get.
Example(s)
1. Let’s compare this approach to using a typical Set data structure. Google wants to store
5 million URLs, with each URL taking (on average) 40 bytes. How much space (in MB, 1
MB = 1 million bytes) is required if we store all the elements in a set? How much space (in
MB) is required if we store all the elements in a bloom filter with k = 30 hash functions and
m = 900, 000 buckets? Recall that 1 byte = 8 bits.
2. Let’s analyze the time improvement as well. Let’s say an average Chrome user attempts to
330 Probability & Statistics with Applications to Computing 9.4
visit 102,000 URLs in a year, only 2,000 of which are actually malicious. Suppose it takes
half a second for Chrome to make a call to the database (the Set), and only 1 millisecond for
Chrome to check containment in the bloom filter. Suppose the false positive rate on the bloom
filter is 3%; that is, if a website is not malicious, the bloom filter will will incorrectly report
it as malicious with probability 0.03. What is the time (in seconds) taken if we only use the
database, and what is the expected time taken (in seconds) to check all 102,000 strings if we
used the bloom filter + database combination described earlier?
Solution
1. For the set, we would require 5 million times 40 bytes, for a total of 200 MB.
For the bloom filter, we need just km/8 = 27/8 million bytes, or 3.375 MB, wow! Note how this doesn’t
depend (directly) at all on how many URLs, or the size of each one as we just hash it to a few bits.
Of course, k and m should increase with n though :) to keep the FPR low.
2. If we only use the database, it will take 102, 000 · 21 = 51, 000 seconds.
If we use the bloom filter + database combination, we will definitely call the bloom filter 102, 000 times
at 0.001 seconds each, for a total of 102 seconds. Then for about 3% of the 100, 000 other URLs (3, 000
of them), we’ll have to do a database lookup, costing 3, 000 · 21 = 1, 500 seconds. For the 2, 000 actually
malicious URLs, we also have to do a database lookup, costing 2, 000 · 21 = 1000 seconds . So in total,
102 + 1500 + 1000 = 2602 seconds.
Just take a second to stare at how much memory savings we had (the first part), and the time savings we
had (the second part)!
9.4.4 Summary
Hopefully now you see the pros and cons of bloom filters. We cannot delete from the bloom filter (why?)
nor list out which elements are in it because we never stored the string! Below summarizes the operations
of a bloom filter.
If you imagine coding this up, it’s so short, only a few lines of code! We just saw how probability and
randomness can be used to save space and time, in exchange for accuracy! In our application, we didn’t even
mind the accuracy part because we would just do the lookup in that case just to be certain anyway! We saw
it being used for a data structure, and in our next application, we’ll see it being used for an algorithm.
Randomness just makes our lives (as a computer scientist) better, and can lead to elegant and beautiful data
structures algorithms which often outperform their deterministic counterparts.
9.4 Probability & Statistics with Applications to Computing 331
9.5.1 Motivation
YouTube wants to count the number of distinct views for a video, but doesn’t want to store all the user
ID’s. How can they get an accurate count of users without doing so? Note: A user can view their favorite
video several times, but should only be counted as one distinct view.
Before we attempt to solve this problem, you should wonder: why should we even care? For one of the
most popular videos on YouTube, let’s say there are N = 2 billion views, with n = 900 million of them being
distinct views. How much space is required to accurately track this number? Well, let’s assume a user ID is
an 8-byte integer. Then, we need 900, 000, 000 × 8 bytes total if we use a Set to track the user IDs, which
requires 7.2 gigabytes of memory for ONE video. Granted, not too many videos have this many views, but
imagine now how many videos there are on YouTube: I’m not sure of the exact number, but I wouldn’t be
suprised it it was in the tens or hundreds of millions, or even higher!
It would be great if we could get the number of distinct views with constant space O(1) instead of lin-
ear space O(n) required by storing all the IDs (let’s say a single 8-byte floating point number instead of 7.2
GB). It turns out we (approximately) can! There is no free lunch of course - we can’t solve this problem
exactly with constant memory. But we can trade this space for some error in accuracy, using the contin-
uous Uniform random variable! That is, we will potentially have huge memory savings, but are okay with
accepting a distinct view count which has some margin of error.
9.5.2 Intuition
This seemingly unrelated calculation will be crucial in tying our algorithm together - I’ll ask for your patience
as we do this. Let U1 , . . . , Um be m iid (independent and identically distributed) RVs from the continuous
Unif(0, 1) distribution. If we take the minimum of these m random variables, what do we “expect” it to be?
That is, if X = min{U1 , . . . , Um }, what is E [X]? Before actually doing the computation, let’s think about
this intuitively and see some pictures.
332
9.5 Probability & Statistics with Applications to Computing 333
What these examples are getting at is that, the expected value of the smallest of m Unif(0, 1) RVs is
1
E [X] = E [min{U1 , . . . , Um }] =
m+1
I promise this will be the key observation in making this clever algorithm work. If you believed the intuition
above, that’s great! If not, that’s also fine, so I’ll have to prove it to you formally below. Whether you
believe me or not at this point, you are definitely encouraged to read through the strategy as it may come
up many times in your future.
Some of these steps need more justification. For the second equation, we use the fact that the minimum of
numbers is greater than a value if and only if all of them are (think about this). For the next equation, the
probability of all of the Ui > x is just the product of the m probabilities by our independence assumption.
334 Probability & Statistics with Applications to Computing 9.5
x−0
And finally, for Ui ∼ Unif(0, 1), we know its CDF (look it up in our table) is P (Ui ≤ x) = = x, and
1−0
so P (Ui > x) = 1 − P (U1 ≤ x) = 1 − x.
I’ll leave it to you to compute the density fX (x) by differentiating the CDF we just computed, and then
using our standard expectation formula (the minimum of numbers in [0, 1] is also in [0, 1]):
Z 1
E [X] = xfX (x)dx
0
1
and you should get E [X] = after all this work!
m+1
If you are thinking of giving up now, I promise this was the hardest part! The rest of the section should be
(generally) smooth sailing.
Suppose the universe of user ID’s is the set U (think of this as all 8-byte integers), and we have a single
uniform hash function h : U → [0, 1] (i.e., for an user ID y, pretend h(y) is a continuous Unif(0, 1) random
variable). That is, h(y1 ), h(y2 ), ..., h(yk ) for any k distinct elements are iid continuous Unif(0, 1) random
variables, but since the hash function always gives the same output for some given input, h(y1 ) and h(y1 )
are the “same” Unif(0, 1) random variable.
To parse that mess, let’s see two examples. These will also hopefully give us the lightbulb moment!
Example(s)
This is a stream of user IDs. From this, there are 3 distinct views (13,25,19) out of 6 total views.
The uniform hash function h might give us the following stream of hashes:
Note that all of these numbers are between 0 and 1 as they should be, as they are supposedly
Unif(0, 1). Note also that for the same user ID, we get the same hash! That is, h(19) will always
return 0.79, h(25) is always 0.26, and so on. Now go back and reread the previous paragraph and see
if it makes more sense.
9.5 Probability & Statistics with Applications to Computing 335
Example(s)
Consider the same stream of N = 6 elements as the previous example, with n = 3 distinct elements.
1. How many independent Unif(0, 1) RVs are there total: N or n?
2. If we only stored the minimum value every time we received a view, we would store the single
floating point number 0.26 as it is the smallest hash of the six. If we didn’t know n, how
might we exploit 0.26 to get the value of n = 3? Hint: Use the fact we proved earlier that
1
E [min{U1 , . . . , Um }] = where U1 , . . . , Um are iid.
m+1
Solution
1. As you can see, we only have three iid Uniform RVs: 0.26, 0.51, 0.79. So in general, we’ll have
the minimum n (and not N ) RVs.
2. Actually, remember that the expected minimum of n distinct/independent values is approxi-
1
mately as we showed earlier. Our 0.26 isn’t exactly equal to E [X], but it is an estimate
n+1
for it! So if we solve
1
0.26 ≈ E [X] =
n+1
1
we would get that n ≈ − 1 ≈ 2.846. Rounding this to the nearest integer of 3 actually
0.26
gives us the correct answer!
So our strategy is: keep a running minimum (a single floating point which ONLY takes 8 bytes).
As we get a stream of user IDs x1 , . . . , xN , hash each one and update the running minimum
1
if necessary. When we want to estimate n, we just reverse solve n = round − 1 , and
E [X]
that’s it! Take a minute to reread this example if necessary, as this is the entire idea!
This is known as the Distinct Elements algorithm! We start our single floating point minimum (called val
below) at ∞, and repeatedly update it. The key observation is that we are only taking the minimum of n iid
Uniform RVs, and NOT N because h always returns the same value given the same input. Reverse-solving
1
for E [X] = gives us an estimate for m since E [X] (which is stored in the variable val) is only an
m+1
approximation. Note we want to round to the nearest integer because n should be an integer.
336 Probability & Statistics with Applications to Computing 9.5
This algorithm sounds great right? One pass over the data (which is the best we can do in time complexity),
and one single float (which is the best we can do in space complexity)! But you have to remember the
tradeoff is in the accuracy, which we haven’t seen yet.
The reason the previous example was spot-on is because I cheated a little bit. I ensured the three values
0.26, 0.51, 0.79 were close to where they were supposed to be: 0.25, 0.50, 0.75. Actually, it’s most important
that just the minimum is on-target. See the following example for an unfortunate situation.
Example(s)
The uniform hash function h might give us the following stream of N = 7 hashes:
Trace the distinct elements algorithm above by hand and report the value that it will return for our
estimate. Compare it to the true value of n = 4 which is unknown to the algorithm.
Solution
At the end of all the updates, val will be equal to the minimum hash of 0.1. So the estimated number of
distinct elements is
1
round −1 =9
0.1
There are only n = 4 distinct elements though! The reason this time it didn’t work out well for us is that
the minimum value was supposed to be around 1/5 = 0.2, but was actually 0.1. This is not necessarily a
huge difference until we take its reciprocal...
That’s it! The code for this algorithm is actually pretty short and sweet (imagine converting the pseudocode
above into code). If you take a step back and think about what machinery we needed, we needed continuous
RVs: the idea of PDF/CDF, and the Uniform RV. The mathematical/statistical tools we learn have many
applications to computer science; we have several more to go!
" n # n
1X 1X 1
E X̄n = E Xi = E [Xi ] = nµ = µ
n i=1 n i=1 n
9.5 Probability & Statistics with Applications to Computing 337
That is, the sample mean will have the same expectation, but the variance will go down linearly! Why might
this make sense? Well, imagine you wanted to estimate the height of American adults: would you rather
have a sample of 1, 10, or 100 adults? All would be correct in expectation, but the size of 100 gives us more
confidence in our answer!
1
So if we instead estimate the minimum E [X] = with the average of k minimums instead of just one,
n+1
we should get a more accurate estimate for E [X] and hence n, the number of distinct elements, as well!
So, imagine we had k independent hash functions instead of just one: h1 , . . . , hk , and k minimums val1 , val2 , . . . , valk .
Stream → 13 25 19 25 19 19 vali
h1 0.51 0.26 0.79 0.26 0.79 0.79 0.26
h2 0.22 0.83 0.53 0.84 0.53 0.53 0.22
... ... ... ... ... ... ... ...
hk 0.27 0.44 0.72 0.44 0.72 0.72 0.27
Each row represents one hash function hi , and the last column in each row is the minimum for that hash
function. Again, we’re only keeping track of the k floating point minimums in the final column. Now, for
improved accuracy, we just take the average of the k minimums first, before reverse-solving. Imagine k = 3
(so there were no rows in . . . above). Then, a good estimate for the true minimum E [X] is
0.26 + 0.22 + 0.27
E [X] ≈ = 0.25
3
1
So our estimate for n is round − 1 = 3, which is perfect! Note that we basically combined 3 distinct
0.25
elements instances with h1 , h2 , h3 individually from earlier, in a way that reduced the variance! The indi-
vidual estimates 0.26, 0.22, 0.27 were varying around 0.25, but their average was even closer!
Now our memory is just O(k) instead of O(1), but we get a better estimate as a result. It is up to you to
determine how you want to tradeoff these two opposing quantities.
9.5.5 Summary
We just saw today an extremely clever use of continuous RVs, applied to computing! In general, randomness
(the use of a random number generator (RNG)) in algorithms and data structures often can help improve
either the time or space (or both)! We saw earlier with the bloom filter how adding a RNG can save a ton
of space in a data structure. Even if you don’t go on to study machine learning or theoretical CS, you can
see what we’re learning can be applied to algorithms and data structures, arguably the core knowledge of
every computer scientist.
338 Probability & Statistics with Applications to Computing 9.5
9.6.1 Motivation
Markov Chain Monte Carlo (MCMC) is a technique which can be used to solve hard optimization
problems (among other things). In this section, we’ll design MCMC algorithms to solve the following two
problems, and you will be able to solve many more yourself!
• The Knapsack Problem: Suppose you have a knapsack which has some maximum weight capacity.
There are n items with weights w1 , . . . , wn > 0 and values v1 , . . . , vn > 0, and we want to choose the
subset of them that maximizes the total value subject to the weight constraint of the knapsack. How
can we do this?
• The Travelling Salesman Problem (TSP): Suppose you want to find the best route (minimizing
total distance travelled) between the 50 U.S. state capitals that we want to visit! A valid route starts
and ends in the same state capital, and visits each capital exactly once (this is known as the TSP,
and is known to be NP-Hard). We will design an MCMC algorithm for this as well!
As the name suggests, this technique depends a bit on the idea of Markov Chains. Most of this section
then will actually be building up the foundations of Markov Chains, and MCMC will follow soon after. In
fact, you could definitely understand and code up the algorithm without learning this math, but if you care
to know how and why it works (you should), then it is important to learn first!
• The temperature in Seattle each day. X0 can be the temperature today, X1 tomorrow, and so on.
• The price of Google stock at the end of each year. X0 can be the final price at the end of the year it
IPO’d, X1 the next, and so on.
• The number of people who come to my store each day. X0 is the number of people who came on the
first day, X1 on the second, and so on.
Consider the following random walk on the graph below. You’ll see what that means through an example!
339
340 Probability & Statistics with Applications to Computing 9.6
Suppose we start at node 1, and at each time step, independently step to a neighboring node with equal
probability.
For example, X0 = 1 since at time t = 0, we are at node 1. Then, X1 can be either 2 or 3 (but not 4 or 5
since not neighbors of node 1). And so on. So each Xt just tells us the position we are at at time t, and is
always in the set {1, 2, 3, 4, 5} (for our example anyway).
This DTSP actually has a lot of structure, and is actually an example of a special type of DTSP called a
Markov Chain: can you think about how this particular setup provides a lot of additional constraints over
a normal DTSP?
Here are three key properties of a Markov Chain, which we will formalize immediately after:
1. We only have finitely many states (5 in our example: {1, 2, 3, 4, 5}). (The stock price or temperature
example earlier could be any real number).
2. We don’t care about the past, given the present. That is, the distribution of where we go next
ONLY depends on where we are currently, and not any past history.
3. The transition probabilities are the same at each step (stationary). That is, if we are at node 1 at
time t = 0 or t = 152, we are always equally likely to go to node 2 or 3).
3. Has stationary transition probabilities. That is, we always transition from state si to sj with
probability independent of the current time. Hence, due to this property and the previous, the
transitions are governed by n2 probabilities: the probability of transitioning to one of n current
states to one of n next states. These are stored in a square n × n transition probability
matrix (TPM) P , where Pij = P (Xt+1 = sj | Xt = si ) is the probability of transitioning
from si → sj for any and every time t.
If you’re a bit confused right now, especially with that last bullet point, this is totally normal and means you
are paying attention! Let’s construct the TPM for the graph example earlier to see what it means exactly.
For example, the second entry of the first row is: given that Xt = 1 (we are in state 1 at some time t), what
is the probability of going to state 2 next Xt+1 = 2? It’s 1/2 because from state 1, we are equally likely to
go to state 2 or 3. It isn’t possible to go to states 1, 4, and 5, and that’s why their respective entries are 0.
From state 2, we can only go to states 1 and 4 as you can see from the graph and the TPM. Try filling out
the remaining three rows yourself! These images may help:
Note that in the last row, from state 5, we MUST go to state 4, and so P54 = 1 and the rest of the row
has zero probability. Also note that each ROW sums to 1, but there is no such constraint on the columns.
That’s because this is secretly a joint PMF right? Given we are in some state si (Xt = si ), the probabilities
of going to the next state Xt+1 must sum to 1.
Example(s)
Now let’s talk about how to compute some probabilities we may be interested in. Nothing here is
“new”: it is all based on your core probability knowledge from the previous chapters! Let’s say we
want to find out the probability we end up at state 5 after two time steps, starting from state 3. That
is, compute P (X2 = 5 | X0 = 3). Try to do come up with an “intuitive” answer first, and then show
your work formally.
Solution You might be able to hack your way around to a solution since it is only two time steps: something
like
1 1
·
2 3
Intuitively, we can either go to state 4 or 1 from state 3 with equal probability. If we went to state 1, there’s
no chance we make it to state 5. If we went to state 4, there’s a 1/3 chance we go to state 5. So our answer
is 1/2 · 1/3 = 1/6. This is just the LTP conditioning on possible middle states!
Now we’ll write this out more generally. This LTP will be a conditional form though: the LTP says that if
Bi ’s partition the sample space: X
P (A) = P (A | Bi ) P (Bi )
i
But what about if we wanted P (A | C)? We just condition everything on C as well to get:
X
P (A | C) = P (A | Bi , C) P (Bi | C)
i
The second equation comes because the probability of X2 given both the positions X0 and X1 only depends
on X1 right? Once we know where we are currently, we can forget about the past. But now, we can zero
out several of these because P (X1 = i | X0 = 3) = 0 for i = 2, 3, 5. So we are left with just 2 of the 5 terms:
= P (X2 = 5 | X1 = 1) P (X1 = 1 | X0 = 3) + P (X2 = 5 | X1 = 4) P (X1 = 4 | X0 = 3)
If you have the TPM P (we have this above), try looking up the entries to see if you get the same answer!
1 1 1 1
= P15 P31 + P45 P34 = 0 · + · =
2 3 2 6
9.6 Probability & Statistics with Applications to Computing 343
Back to our random walk example: suppose we weren’t sure where we started. That is, let the vector
be such that P (X0 = i) = vi , where vi is the ith element of v (these probabilities sum to 1, because we must
start in one of these 5 positions). Think of this vector v as our belief distribution of where we are at time
t = 0. Let’s compute vP , the matrix-product of v and P , the transition probability matrix. We’ll see what
comes out of it after computing and interpreting it! If you haven’t taken linear algebra yet, don’t worry: vP
is the following 5-dimensional row vector:
5 5 5 5 5
!
X X X X X
vP = Pi1 vi , Pi2 vi , Pi3 vi , Pi4 vi , Pi5 vi
i=1 i=1 i=1 i=1 i=1
What does vP represent? Let’s focus on the first entry, and substitute vi = P (X0 = i) and Pi1 =
P (X1 = 1 | X0 = i) (the probability of going from i → 1). We actually get (by LTP over initial states):
5
X 5
X
Pi1 vi = P (X1 = 1 | X0 = i) P (X0 = i) = P (X1 = 1)
i=1 i=1
This is an interesting pattern that holds for the next three entries as well! In fact, the i-th entry of vP is
just P (X1 = i), so overall, the vector vP represents your belief distribution at the next time step!
That is, right-multiplying by the transition matrix P literally transitions your belief distribution from one
time step to the next.
We can also see that for example vP 2 = (vP )P is your belief of where you are after 2 time steps, and by
induction, vP n is your belief of where you are after n time steps.
A natural question might be then, does vP n have a limit as n → ∞? That is, after a long time, is there a belief
distribution (5-dimensional row vector) π such that it never changes again? The answer is unfortunately:
it depends. We won’t go into the technical details of when it does and doesn’t exist (search “Fundamental
Theorem of Markov Chains” if you are interested), but this leads us to the following definition:
The stationary distribution of a Markov Chain with n states (if one exists), is the n-dimensional
row vector π (representing a probability distribution: entries which are nonnegative and sum to 1),
such that
πP = π
Intuitively, it means that the belief distribution at the next time step is the same as the distribution
at the current. This typically happens after a “long time” (called the mixing time) in the process,
meaning after lots of transitions were taken.
344 Probability & Statistics with Applications to Computing 9.6
We’re going to see an example of this visually, which will also help us build our final piece of intuition for
MCMC. Consider the Markov Chain we’ve been using throughout this section:
Here is the distribution v that we’ll start with. Our Markov Chain happens to have a stationary distribution,
so we’ll see what happens as we take vP n for n → ∞ visually.
v = (0.25, 0.45, 0.15, 0.05, 0.10)
Here is a heatmap of it visually:
You can see from the key that darker values mean lower probabilities (hence 4 and 5 are very dark), and
that 2 is the lighest value since it has the highest probability.
We’ll then show the distribution after 1 step, 5 steps, 10 steps, and 100 steps. Before we continue, what
do you think the fifth entry will look like after one time step, the probability of being in node 5? Actually,
there is only one way to get to node 5, and that’s from node 4, which we start in with probability only 0.05.
From there, only a 1/3 chance to get to node 5, so node 5 will only have 0.05/3 = 1/60 probability at time
step 1 and hence be super dark.
It turns out that after just n = 100 time steps, we start getting the same distribution over and over again
(see t = 10 and t = 100: there’s already almost no difference)! This limiting value of vP n is the stationary
distribution!
π = lim vP n = (0.12, 0.28, 0.28, 0.18, 0.14)
n→∞
100
Suppose π = vP above. Once we find π such that πP = π for the first time, that means that if we
transition again, we get
πP 2 = (πP )P = πP = π
(applying the equality πP = π twice). That means, by just running the Markov Chain for several
time steps, we actually reached our stationary distribution! This is the most crucial observation
for MCMC.
Markov Chain Monte Carlo (MCMC) is a technique which can be used to hard optimization
problems (though generally it is used to sample from a distribution). The general strategy is as
follows:
I. Define a Markov Chain with states being possible solutions, and (implicitly defined) transition
probabilities that result in the stationary distribution π having higher probabilities on “good”
solutions to our problem. We don’t actually compute π, but we just want to define the Markov
Chain such that the stationary distribution would have higher probabilities on more desirable
solutions.
II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good” state/-
solution). This means: start at some initial state, and transition according to the transition
probability matrix (TPM) for a long time. This will eventually take us to our stationary dis-
tribution which has high probability on “good” solutions!
Again, if this doesn’t make sense yet, that’s totally fine. We will apply this two-step procedure to two
examples below so you can understand better how it works!
P
Note that our total value is the sum of the values of the items we take: think about why vi xi is the total
value (remember that xi is either 0 or 1). This problem has 2n possible solutions (either take each item or
don’t), and so is combinatorially hard (exponentially many solutions). If I asked you to write a program to
do this, would you even know where to begin, except by writing the brute-force solution?
I. Define a Markov Chain with states being possible solutions, and (implicitly defined) tran-
sition probabilities that result in the stationary distribution π having higher probabilities
on “good” solutions to our problem.
We’ll define a Markov Chain with 2n states (that’s huge!). The states will be all possible solutions:
binary vectors x of length n (only having 0/1 entries). We’ll then define our transitions to go to “good”
states (ones that satisfy our weight constraint), while keeping track of the best solution so far. This
way, our stationary distribution has higher probabilities on good solutions than bad ones. Hence, when
we sample from the distribution (simulating the Markov chain), we are likely to get a good solution!
9.6 Probability & Statistics with Applications to Computing 347
II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good”
state/solution). This means: start at some initial state, and transition according to the
transition probability matrix (TPM) for a long time. This will eventually take us to our
stationary distribution which has high probability on “good” solutions!
Basically, this algorithm starts with the guess of x being all zeros (no items). Then, for NUM ITER
steps, we simulate the Markov Chain. Again, what this does is give us a sample from our stationary
distribution. Inside the loop, we literally just choose a random object and flip whether or not we have
it. We maintain track of the best solution so far and return it.
That’s all there is to it! This is such a “dumb” solution right? We just start somewhere and randomly
transition for a long time and hope our answer is good. So MCMC definitely won’t guarantee us to get the
best solution, but it leads to “dumb” solutions that actually work quite well in practice. We are guaranteed
though (provided we take enough transitions), to sample from the stationary distribution which has higher
probabilities on good solutions. This is because we only transition to solutions that maintain feasibility.
Note: This is just one version of MCMC for the knapsack problem, there is a “better” version below in
the coding section. It would be better to transition to solutions which have higher value, not just feasible
solutions like we did. The next example does a better job of this!
Given n locations and distances between each pair, we want to find an ordering of them that:
• Starts and ends in the same location.
• Visits each location exactly once (except the starting location twice).
• Minimizes the total distance travelled.
You can imagine an instantiation of this problem for the US Postal Service. A mail delivery person wants to
start and end at the post office, and find the most efficient route which delivers all the mail to the residents.
Again, where would you even begin on trying to solve this, other than brute-force? MCMC to the rescue
again! This time, our algorithm will be more clever than the previous.
I. Define a Markov Chain with states being possible solutions, and (implicitly defined) tran-
sition probabilities that result in the stationary distribution π having higher probabilities
on “good” solutions to our problem.
We’ll define a Markov Chain with n! states (that’s huge!). The states will be all possible solutions
348 Probability & Statistics with Applications to Computing 9.6
(state=route): all orderings of the n locations. We’ll then define our transitions to go to “good” states
(ones that go to lower-distance routes), while keeping track of the best solution so far. This way,
our stationary distribution has higher probabilities on good solutions than bad ones. Hence, when we
sample from the distribution (simulating the Markov chain), we are likely to get a good solution!
II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good”
state/solution). This means: start at some initial state, and transition according to the
transition probability matrix (TPM) for a long time. This will eventually take us to our
stationary distribution which has high probability on “good” solutions!
The MCMC algorithm will have a “temperature” parameter T which controls the trade-off between
exploration and exploitation (described soon). We will start with a random state (route). At at each
iteration, propose a new state (route) as follows: choose a random index from {1, 2, . . . , n}, and swap
that location with the successive (next) location in the route, possibly with wraparound if index 50 is
chosen. If the proposed route has lower total distance (is better) than the current route, we will always
transition to it (exploitation).
Otherwise, if T > 0, with probability e−∆/T , update the current route to the proposed route, where
∆ > 0 is the increase in total distance. This allows us to transition to a “worse” route occasionally
(exploration), and get out of local optima! Repeat this for NUM ITER transitions from the initial state
(route), and output the shortest route during the entire process (which may not be the last route).
Again, this is such a “dumb” solution right? But also very clever! We just start somewhere and randomly
transition for a long time and hope our answer is good. And it should be: after a long time, our route
distance increasingly gets better and better, so we should expect a rather good solution!
9.6.4 Summary
Once again, we’ve used probability to make our lives easier. There are definitely papers and research on
how to solve these problems deterministically, but this is one of the simplest algorithms you can get, and
it uses randomness! Again, the idea of MCMC for optimization is: define the state space to be all possible
solutions, define transitions to go to better states, and just run it and wait!
9.6 Probability & Statistics with Applications to Computing 349
The MCMC algorithm will have a “temperature” parameter T which controls the trade-off between
exploration and exploitation. The state space S will be the set of all subsets of n items. We will start
with a random state (subset). At at each iteration, propose a new state (subset) as follows: choose a
random index i from {1, 2, . . . , n}, and take item i if we don’t already have it, or put it back if we do.
• If the proposed subset is infeasible (doesn’t fit in our knapsack because of the weight constraint),
we return to the start of the loop and abandon the newly proposed subset.
• Suppose then it is feasible. If it has higher total value (is better) than the current route, we will
always transition to it (exploitation). Otherwise if it is worse but T > 0, with probability e∆/T ,
update the current subset to the proposed subset, where ∆ < 0 is the decrease in total value.
This allows us to transition to a “worse” subset occasionally (exploration), and get out of local
350 Probability & Statistics with Applications to Computing 9.6
optima! Repeat this for NUM ITER transitions from the initial state (subset), and output the
highest value subset during the entire process (which may not be the final subset).
(a) What is the size of the Markov Chain’s state space S (the number of possible subsets)? As
NUM ITER → ∞, are you guaranteed to eventually see all the subsets (consider the cases of
T = 0 and T > 0 separately)? Briefly justify your answers.
(b) Let’s try to figure out what the temperature parameter T does.
i. Suppose T = 0. Will we ever get to a worse subset than before as we transition?
ii. Suppose T > 0. For a fixed T , does the probability of transitioning to a worse subset increase
or decrease with larger values of ∆? For a fixed ∆, does the probability of transitioning to a
worse subset increase or decrease with larger values of T ? Explain briefly how the temperature
parameter T controls the degree of exploration we do.
(c) Implement the functions value, weight, and mcmc in mcmc knapsack.py.
You must use np.random.rand() to generate a continuous U nif (0, 1) rv, and
np.random.randint(low (inclusive), high (exclusive)) to generate your random index(es).
Make sure to read the documentation and hints provided! You must use this strategy exactly
to get full credit - we will be setting the random seed so that everyone should get the same result
if they follow the pseudocode. For Line 4 in the pseudocode, since Python is 0-indexed, generate
a random integer in {0, 1, . . . , n − 1} instead, otherwise the autograder may fail.
(d) We’ve called the make plot function to make a plot where the x-axis is the iteration number, and
the y-axis is the current knapsack value (not necessarily the current best), for ntrials=10 different
runs of MCMC. You should attach 4 plots which are generated for you (one per temperature),
and each plot should have 10 curves (one per trial). Which value of T tended to most reliably
produce high knapsack values?
Chapter 9: Applications to Computing
9.7: Bootstrapping (for Hypothesis Testing)
Slides (Google Drive) Starter Code (GitHub)
9.7.1 Motivation
We’ve just learned how to perform a generic hypothesis test, where in our examples we were especially often
able to use the Normal distribution and its CDF due to the CLT. But actually, there are tons of specialized
other hypothesis tests which won’t allow this. For example:
• The t-test for equality of means when variance is unknown.
• The χ2 -test of independence (testing whether to quantities are independent or not).
• The F -test for equality of variances (testing whether or not the variances of two populations are equal
or not).
There are many more that I haven’t even listed because I probably have never heard of them myself! These
three above though involve three distributions we haven’t learned yet: the t, χ2 , and F distributions. But
because you are a computer scientist, we’ll actually learn a way now to completely erase the need to learn
each specific procedure, called bootstrapping!
Example(s)
Main Idea: We have some (not enough) data and want more. How can we “get more”?
351
352 Probability & Statistics with Applications to Computing 9.7
Imagine: You have 1000 iid coin flip samples, x1 ,...,x1000 which are all 1’s and 0’s. Your boss wants
you to somehow get/generate 500 more (independent) samples.
How can you “get more (iid) data” without actually having access to the coin? There are
two proposed solutions below: both of which you could theoretically come up with, but only one of
which which I expect most of you to guess.
Example(s)
A colleague has collected samples of weights of labradoodles that live on two different islands:
CatIsland and DogIsland. The colleague collects 48 samples from CatIsland, and 43 samples from
the DogIsland. The colleague notes ahead of time that she thinks the labradoodles on DogIsland
have a higher spread of weights than CatIsland. You are skeptical. You and your colleague do
however agree to assume that their true means are equal. Here is the data:
CatIsland Labradoodle Weights (48 samples): 13, 12, 7, 16, 9, 11, 7, 10, 9, 8, 9, 7, 16, 7, 9,
8, 13, 10, 11, 9, 13, 13, 10, 10, 9, 7, 7, 6, 7, 8, 12, 13, 9, 6, 9, 11, 10, 8, 12, 10, 9, 10, 8, 14, 13, 13, 10, 11
DogIsland Labradoodle Weights (43 samples): 8, 8, 16, 16, 9, 13, 14, 13, 10, 12, 10, 6, 14, 8,
13, 14, 7, 13, 7, 8, 4, 11, 7, 12, 8, 9, 12, 8, 11, 10, 12, 6, 10, 15, 11, 12, 3, 8, 11, 10, 10, 8, 12
Solution Step 5 is the only part where bootstrapping is involved. Everything else is the same as we learned
in 8.3!
1. Make a claim.
The spread of labradoodle weights on DogIsland is (significantly) larger than that on CatIsland.
2. Set up a null hypothesis H0 and alternative hypothesis HA .
2 2 2 2
H0 : σC = σD HA : σC < σD
9.7 Probability & Statistics with Applications to Computing 353
Our null hypothesis is that the spreads are the same, and our alternative is what we want to show.
Here, spread is taken to mean “variance”.
3. Choose a significance level α (usually α = 0.05 or 0.01).
Because of this, we can combine the two samples into a single one of size 48 + 43 = 91 (in our
case, we’ve also assumed the means are the same, so this is okay). Then, we repeatedly bootstrap
this combined sample (let’s say 50,000 times): we sample with replacement a sample of size 48, and of
size 43, and compute the sample variances of these two samples. Then, we compute the sample pro-
portion of times the difference in variances was at least as extreme, and that’s it! See the pseudocode
below, and reread these two paragraphs.
2 2 2 2
Algorithm 6 Bootstrapping for p-value for H0 : σC = σD vs HA : σC < σD
1: Given: Two samples x = [x1 , . . . , xn ] and y = [y1 , . . . , ym ].
2: obs diff ← s2y − s2x (the difference in sample variances).
3: combined ← concat(x, y) = [x1 , x2 , . . . , xn , y1 , y2 , . . . , ym ] (of size n + m).
4: count ← 0.
5: for i = 1, 2, . . . , 50000 do . Any large number is fine.
6: x0 ← resample(combined, n) with replacement. . Sample of size n from combined.
7: y 0 ← resample(combined, m) with replacement. . Sample of size m from combined.
8: diff ← s2y0 − s2x0 . . Compute the difference in sample variances.
9: if diff ≥ obs diff then . This line changes depending on the alternative hypothesis.
10: count ← count + 1.
11: p-val ← count/50000.
Again, what we’re doing is: assuming there was this master island that split into two (same variance),
what is the probability we observed a sample of size 48 and a sample of size 43 with variances at least
as extreme as we did? That is, if we were to repeat this “separation” process many times, how often
would we get a difference so large? We don’t have the other labradoodles from the master island, so
we bootstrap (reuse our current samples). It turns out this method leads to a good approximation to
the true p-value!
It’s important to note that the alternative hypothesis is EXTREMELY IMPORTANT. If instead we
2 2
wanted to assert HA : σC 6= σD , we would have used absolute values for diff and obs diff. Also, for
example, if we wanted to make a statement about the means µC and µD instead, we would have
computed and compared the sample means instead of the sample variances.
It turns out we get a p-value of approximately 0.07. (Try coding this up yourself!)
354 Probability & Statistics with Applications to Computing 9.7
Since our p-value of 0.07 was greater than α = 0.05, we fail to reject the null hypothesis. There
is insufficient evidence to show that the labradoodle spreads are different across the two islands.
Actually, this two-sample test for difference in variances is done by an “F-Test of Equality of Variances” (see
Wikipedia). But because we know how to code, we don’t need to know that!
You can imagine bootstrapping for other types of hypothesis tests as well! Actually, bootstrapping is a
powerful tool which also has other applications.
9.7 Probability & Statistics with Applications to Computing 355
Actually, for this application of bandits, we will do the problem setup before the motivation. This is because
modelling problems in this bandit framework may be a bit tricky, so we’ll kill two birds with one stone. We’ll
also see how to do “Modern Hypothesis Testing” using bandits!
You bought some credits and can pull any slot machines, but only a total of T = 100 times. At each time
step t = 1, . . . , T , you pull arm at ∈ {1, 2, . . . , K} and observe a random reward. Your goal is to maximize
your total (expected) reward after T = 100 pulls! The problem is: at each time step (pull), how do I decide
which arm to pull based on the past history of rewards?
We make a simplifying assumption that each arm is independent of the rest, and has some reward dis-
tribution which does NOT change over time.
Here is an example you may be able to do: don’t overthink it!
Example(s)
If the reward distributions are given in the image below for the K = 3 arms, what is the best strategy
to maximize your expected reward?
Solution We can just compute the expectations of each from the distributions handout. The first machine
has expectation λ = 1.36, the second has expectation np = 4, and the third has expectation µ = −1. So to
356
9.8 Probability & Statistics with Applications to Computing 357
maximize our total reward, we should just always pull arm 2 because it has the best expected reward! There
would be no benefit in pulling other arms at all.
So we’re done right? Well actually, we DON’T KNOW the reward distributions at all! We must estimate all
K expectations (one per arm), WHILE simultaneously maximizing reward! This is a hard problem because
we know nothing about the K reward distributions. Which arm should we pull then at each time step? Do
we pull arms we know to be “good” (probably), or try other arms?
• Exploration: Pulling less-frequently pulled arms in the hopes they are also “good” or even better.
In this section, we will only handle the case of Bernoulli-bandits. That is, the reward of each arm
a ∈ {1, . . . , K} is Ber(pa ) (i.e., we either get a reward of 1 or 0 from each machine, with possibly different
probabilities). Observe that the expected reward of arm a is just pa (expectation of Bernoulli).
The last thing we need to talk about when talking about bandits is regret. Regret is the difference between
• The best possible expected reward (if you always pulled the best arm).
Let p∗ = arg maxi∈{1,2,...,K} pi denote the highest expected reward from one of the K arms. Then, the regret
at time T is
Regret(T ) = T p∗ − Reward(T )
where T p∗ is the reward from the best arm if you pull it T times, and Reward(T ) is your actual reward after
T pulls. Sometimes it’s easier to think about this in terms of average regret (divide everything by T ).
Reward(T )
Avg-Regret(T ) = p∗ −
T
The below summarizes and formalizes everything above into this so-called “Bernoulli Bandit Framework”.
358 Probability & Statistics with Applications to Computing 9.8
The focus for the rest of the entire section is: “how do we choose which arm”?
9.8.2 Motivation
Before we talk about that though, we’ll discuss the motivation as promised.
As you can see above, we can model a lot of real-life problems as a bandit problem. We will learn two
popular algorithms: Upper Confidence Bound (UCB) and Thompson Sampling. This is after we discuss
some “intuitive” or “naive” strategies you may have yourself!
We’ll actually call on a lot of our knowledge from Chapters 7 and 8! We will discuss maximum likelihood,
maximum a posteriori, confidence intervals, and hypothesis testing, so you may need to brush up on those!
One strategy may be: pull each arm M times in the beginning, and then forever pull the best arm! This is
described formally below:
9.8 Probability & Statistics with Applications to Computing 359
Actually, this strategy is no good, because if we choose the wrong best arm, we would regret it for the rest
of time! You might then say, why don’t we increase M ? If you do that, then you are pulling sub-optimal
arms more than you should, which would not help us in maximizing reward...The problem is: we did all of
our exploration FIRST, and then exploited our best arm (possibly incorrect) for the rest of time. Why don’t
we try to blend in exploration more? Do you have any ideas on how we might do that?
This following algorithm is called the ε-Greedy algorithm, because it explores with probability ε at each
time step! It has the same initial setup: pull each arm M times to begin. But it does two things better than
the previous algorithm:
1. It continuously updates an arm’s estimated expected reward when it is pulled (even after the KM
steps).
2. It explores with some probability ε (you choose). This allows you to choose in some quantitative way
how to balance exploration and exploitation.
See below!
However, we can do much much better! Why should we explore each arm uniformly at random, when we
have a past history of rewards? Let’s explore more the arms that have the potential to be really good! In an
extreme case, if there is an arm with average reward 0.01 after 100 pulls and an arm with average reward
0.6 after only 5 pulls, should we really both explore each equally?
360 Probability & Statistics with Applications to Computing 9.8
• Arm 1: Estimate is p̂1 = 0.75. Confidence interval is [0.75 − 0.10, 0.75 + 0.10] = [0.65, 0.85].
• Arm 2: Estimate is p̂2 = 0.33. Confidence interval is [0.33 − 0.25, 0.33 + 0.25] = [0.08, 0.58].
• Arm 3: Estimate is p̂3 = 0.60. Confidence interval is [0.60 − 0.29, 0.60 + 0.29] = [0.31, 0.89].
Notice all the intervals are centered at the MLE. Remember the intervals may have different widths, because
the width of a confidence interval depends on how many times it has been pulled (more pulls means more
confidence and hence narrower interval). Review 8.1 if you need to recall how we construct them.
The greedy algorithm from earlier at this point in time would choose arm 1 because it has the highest
estimate (0.75 is greater than 0.33 and 0.60). But our new Upper Confidence Bound (UCB) algorithm
will choose arm 3 instead, as it has the highest possibility of being the best (0.89 is greater than 0.85 and
0.58).
See how exploration is “baked in” now? As we pull an arm more and more, the upper confidence bound
decreases. The less frequently pulled arms have a chance to have a higher UCB, despite having a lower
point estimate! After the next algorithm we examine, we will visually compare and contrast the results. But
before we move on, let’s take a look at this visually.
Suppose we have K = 5 arms. The following picture depicts at time t = 10 what the confidence intervals
may look like. The horizontal lines at the top of each arm represent the upper confidence bound, and the red
dots represent the TRUE (unknown) means. The center of each confidence interval are the ESTIMATED
means.
9.8 Probability & Statistics with Applications to Computing 361
Pretty inaccurate at first right? Because it’s so early on, our estimates are expected to be bad.
Notice how the interval for the best arm (arm 5) keeps shrinking, and is the smallest one because it was pulled
(exploited) so much! Clearly, arm 1 was terrible and so our estimate isn’t perfect; it has the widest width
since we almost never pulled it. This is the idea of UCB: basically just greedy but using upper confidence
bounds!
You can go to the slides linked at the top of the section if you would like to see a step-by-step of the first
few iterations of this algorithm (slides 64-86).
s
2 ln(t)
Note if just deleted the + in the 5th line of the algorithm, it would reduce to the greedy!
Nt (i)
362 Probability & Statistics with Applications to Computing 9.8
We will assume a Beta(1, 1) (uniform) prior on each unknown probability of reward. That is, we can treat
our pi ’s as continuous probability distributions. Remember that though with this uniform prior, the MAP
and the MLE are equivalent though (pretend we saw 1 − 1 = 0 heads and 1 − 1 = 0 failures). However, we
will not be using the posterior distribution just to get the mode, we will SAMPLE from it! Here’s the idea
visually.
Let’s say we have K = 3 arms, and are at the first time step t = 1. We will start each arm off with a
Beta(αi = 1, βi = 1) prior and update αi , βi based on the rewards we observe. We’ll show the algorithm
first, then use visuals to walk through it.
So as I mentioned earlier, each pi is a RV which starts with a Beta(1, 1) distribution. For each arm i, we
keep track of αi and βi , where αi − 1 is the number of successes (number of times we got a reward of 1),
and βi − 1 is the number of failures (number of times we got a reward of 0).
For this algorithm, I would highly recommend you go to the slides linked at the top of the section if you
would like to see a step-by-step of the first few iterations of this algorithm (slides 94-112). If you don’t want
to, we’ll still walk through it below!
Let’s again suppose we have K = 3 arms. At time t = 1, we sample once from each arm’s Beta distribution.
We suppose the true pi ’s are 0.5, 0.2, and 0.9 for arms 1, 2, and 3 respectively (see the table). Each arm
has αi and βi , initially 1. We get a sample from each arm’s Beta distribution and just pull the arm with
the largest sample! So in our first step, each has the same distribution Beta(1, 1) = Unif(0, 1), so each arm
is equally likely to be pulled. Then, because arm 2 has the highest sample (of 0.75), we pull arm 2. The
algorithm doesn’t know this, but there is only a 0.2 chance of getting a 1 from arm 2 (see the table), and so
let’s say we happen to observe our first reward to be zero: r1 = 0.
9.8 Probability & Statistics with Applications to Computing 363
Consistent with our Beta random variable intuition and MAP, we increment our number of failures by 1 for
arm 2 only.
At the next time step, we do the same! Sample from each arm’s Beta and choose the arm with the highest
sample. We’ll see it for a more interesting example below after skipping a few time steps.
Now let’s say we’re at time step 4, and we see the following chart. Below depicts the current Beta densities
for each arm, and what sample we got from each.
We can see from the αi ’s and βi ’s that we still haven’t pulled arm 1 (both parameters are still at 1), we pulled
arm 2 and got a reward of 0 (β2 = 2), and we pulled arm 3 twice and got one 1 and one 0 (α3 = β3 = 2).
See the density functions below: arm 1 is equally likely to be any number in [0, 1], whereas arm 2 is more
likely to give a low number. Arm 3 is more certain of being in the center.
You can see that Thompson Sampling just uses this ingenious idea of sampling rather than just tak-
ing the MAP, and it works great! We’ll see some comparisons below between UCB and Thompson sampling.
Note that with a single-line change, instead of sampling in line 3, if we just took the MAP (which equals the
364 Probability & Statistics with Applications to Computing 9.8
MLE because of our uniform prior), we would again revert back to the greedy algorithm! The exploration
comes from the sampling, which works out great for us!
It might be a bit hard to see, but notice Thompson sampling’s regret got close to 0 a lot faster than UCB.
UCB happened around time 5000, and TS happened around time 2000. The reason why Thompson sampling
might be “better” is unfortunately out of scope.
Below is my favorite visualization of all. On the x-axis we have time, and on the y-axis, we have the
proportion of time each arm was pulled (there were K = 5 arms). Notice how arm 2 (green) has the highest
true expected reward at 0.89, and how quickly Thompson sampling discovered it and starting exploiting it.
9.8 Probability & Statistics with Applications to Computing 365
Here are the benefits and drawbacks of using Traditional A/B Testing vs Multi-Armed bandits. Each has
their own advantages, and you should carefully consider which approach to take before arbitrarily deciding!
When to use Traditional A/B Testing:
• Need to collect data for critical business decisions.
• Need statistical confidence in all your results and impact. Want to learn even about treatments that
didn’t perform well.
• The reward is not immediate (e.g., if drug testing, don’t have time to wait for each patient to finish
before experimenting with next patient).
• Optimize/measure multiple metrics, not just one.
When to use Multi-Armed Bandits:
1. No need for interpreting results, just maximize reward (typically revenue/engagement)
2. The opportunity cost is high (if advertising a car, losing a conversion is ≥$20,000)
3. Can add/remove arms in the middle of an experiment! Cannot do with A/B tests.
The study of Multi-Armed Bandits can be categorized as:
• Statistics
• Optimization
366 Probability & Statistics with Applications to Computing 9.8
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.53983 0.5438 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.0 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91309 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2.0 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999
Discrete Distributions
PMF (pX (k) for
Distribution Parameters Possible Description Range ΩX E[X] Var(X)
k ∈ ΩX )
X ∼ Unif(a, b)
Equally likely to be any a+b (b − a)(b − a + 2) 1
Uniform (disc) for a, b ∈ Z {a, . . . , b}
integer in [a, b] 2 12 b−a+1
and a ≤ b
1−k
X ∼ Ber(p) Takes value 1 with prob p pk (1 − p)
Bernoulli {0, 1} p p(1 − p)
for p ∈ [0, 1] and 0 with prob 1 − p
X ∼ Bin(n, p) Sum of n iid Ber(p) rvs. # n
n−k
pk (1 − p)
Binomial for n ∈ N, of heads in n independent {0, 1, . . . , n} np np(1 − p) k
and p ∈ [0, 1] coin flips with P (head) = p.
# of events that occur in
λk
Poisson
X ∼ Poi(λ) one unit of time
{0, 1, . . .} λ λ e−λ
for λ > 0 independently with rate λ k!
per unit time
# of independent Bernoulli k−1
X ∼ Geo(p) 1 1−p (1 − p) p
Geometric trials with parameter p up to {1, 2, . . .}
for p ∈ [0, 1] p p2
and including first success
# of successes in n draws K
N −K
X ∼ HypGeo(N, K, n)
(w/o replacement) from N {max (0, n + K − N ), K K (N − K) (N − n) k n−k
Hypergeometric for n, K ≤ N n n N
items that contain K . . . , min (n, K)} N N 2 (N − 1) n
and n, K, N ∈ N
successes in total
Sum of r iid Geo(p) rvs. # k−1
k−r
Negative X ∼ NegBin(r, p) r r (1 − p) pr (1 − p)
of independent flips until rth {r, r + 1, . . .} r−1
Binomial for r ∈ N, p ∈ [0, 1] p p2
head with P (head) = p
X ∼ Multr (n, p) Generalization of the
ki ∈ {0, . . . , n}, np1 Var(Xi ) = npi (1 − pi ) Qr
for r, n ∈ N and Binomial distribution, n n
Multinomial i ∈ {1, . . . , r} . Cov(Xi , Xj ) = k1 ,...,kr i=1 piki
p = (p1 , p2 , ..., pr ),
P trials with r categories each E[X] = np = .
r and Σki = n . −npi pj , i 6= j
i=1 pi = 1 with probability pi
npr
Generalization of the
X ∼ MVHGr (N, K, n) Var(X ) =
i
Hypergeometric distribution, ki ∈ {0, . . . , Ki }, n K1 N −Ki N −n Qr
Multivariate for r, n ∈ N, N nKN ·
i
N · N −1 Ki
i=1 ki
n draws from r categories i ∈ {1, . . . , r} K .
Hypergeometric K ∈ NrPand E[X] = n = .. Cov(Xi , Xj ) = N
r each with Ki successes and Σki = n N i Kj N −n n
N = i=1 Ki −n K
N N · N −1 , i 6= j
(w/out replacement)
n KNr
Probability & Statistics with Applications to Computing Alex Tsun & Mitchell Estberg
1
Continuous Distributions
Possible CDF
Distribution Parameters Range ΩX E[X] Var(X) PDF (fX (x) for x ∈ ΩX )
Description (FX (x) = P(X ≤ x))
Equally likely to be 1
X ∼ Unif(a, b) a+b (b − a)
2 0
if x < a
Uniform any real number in [a, b] b−a x−a
for a < b 2 12 if a ≤ x < b
[a, b] b−a
1 if x ≥ b
Time until first (
X ∼ Exp(λ) 1 1 0 if x < 0
Exponential event in Poisson [0, ∞) λe−λx
for λ > 0 λ λ2 1 − e−λx if x ≥ 0
process
X ∼ N (µ, σ 2 ) !
2
1 (x − µ) x−µ
Normal for µ ∈ R, Standard bell curve (−∞, ∞) µ σ2 √ exp − Φ
and σ 2 > 0 σ 2π 2σ 2 σ
Sum of r iid Exp(λ)
rvs. Time to rth
X ∼ Gam(r, λ) event in Poisson r r λr r−1 −λx Note: Γ(r) = (r − 1)!
Gamma [0, ∞) x e
for r, λ > 0 process. Conjugate λ λ2 Γ(r) for integers r.
prior for Exp, Poi
parameter λ
Conjugate prior for
X ∼ Beta(α, β) α αβ Γ(α + β) α−1 β−1
Beta Ber, Bin, Geo, (0, 1) x (1 − x)
for α, β > 0 α+β 2
(α + β) (α + β + 1) Γ(α)Γ(β)
NegBin parameter p
Generalization of
X∼ Qr
Beta distribution. 1
xai −1 ,
Dir(α1 , α2 , . . . , αr ) x ∈ (0, 1);
i αi
Dirichlet Conjugate prior for P r E[Xi ] = Pr B(α) i=1
Pir
for αi , r > 0 and i=1 xi = 1 αj xi ∈ (0, 1), i=1 xi = 1
Multinomial j=1
r ∈ N, αi ∈ R
parameter p
X ∼ Nn (µ, Σ) 1
·
Multivariate Generalization of (2π)n/2 |Σ|1/2
for µ ∈ Rn and Rn µ Σ
Normal
Σ ∈ Rn×n
Normal distribution exp(− 12 (x − µ)T Σ−1 (x − µ))
Probability & Statistics with Applications to Computing Alex Tsun & Mitchell Estberg
2
Probability & Statistics with Applications to Computing
Key Definitions and Theorems
1 Combinatorial Theory
1.1 So You Think You Can Count?
The Sum Rule: If an experiment can either end up being one of N outcomes, or one of M outcomes (where there is no
overlap), then the total number of possible outcomes is: N + M .
The Product Rule: If an experiment has N1 outcomes for the first stage, N2 outcomes for the second stage,Qm . . . , and Nm
outcomes for the mth stage, then the total number of outcomes of the experiment is N1 × N2 · · · · · Nm = i=1 Ni .
Permutation: The number of orderings of N distinct objects is N ! = N · (N − 1) · (N − 2) · . . . 3 · 2 · 1.
Complementary Counting: Let U be a (finite) universal set, and S a subset of interest. Then, | S |=| U | − | U \ S |.
1.2 More Counting
k-Permutations: If we want to pick (order matters) only k out of n distinct objects, the number of ways to do so is:
n!
P (n, k) = n · (n − 1) · (n − 2) · ... · (n − k + 1) = (n−k)!
k-Combinations/Binomial Coefficients: If we want to choose (order doesn’t matter) only k out of n distinct objects,
the number of ways to do so is:
n P (n, k) n!
C(n, k) = = =
k k! k!(n − k)!
Multinomial Coefficients: If we have k distinct types of objects (n total), with n1 of the first type, n2 of the second, ...,
and nk of the k-th, then the number of arrangements possible is
n n!
=
n1 , n2 , ..., nk n1 !n2 !...nk !
Stars and Bars/Divider Method: The number of ways to distribute n indistinguishable balls into k distinguishable bins
is
n + (k − 1) n + (k − 1)
=
k−1 n
3. (Axiom: Countable Additivity) If E and F are mutually exclusive, then P(E ∪ F ) = P (E) + P (F ).
1. (Corollary: Complementation) P E C = 1 − P (E)
2. (Corollary: Monotonicity) If E ⊆ F , then P (E) ≤ P (F )
Equally Likely Outcomes: If Ω is a sample space such that each of the unique outcome elements in Ω are equally likely,
then for any event E ⊆ Ω: P(E) = |E|/|Ω|.
2.2 Conditional Probability
P (A ∩ B)
Conditional Probability: P (A | B) =
P (B)
P (B | A) P (A)
Bayes Theorem: P (A | B) =
P (B)
Partition: Non-empty events E1 , . . . , En partition the sample space Ω if they are both:
Sn
• (Exhaustive) E1 ∪ E2 ∪ · · · ∪ En = i=1 Ei = Ω (they cover the entire sample space).
• (Pairwise Mutually Exclusive) For all i 6= j, Ei ∩ Ej = ∅ ( none of them overlap)
Bayes Theorem with LTP: Let events E1 , . . . , En partition the sample space Ω, and let F be another event. Then:
P (F | E1 ) P (E1 )
P (E1 | F ) = Pn
i=1 P (F | Ei ) P (Ei )
2.3 Independence
Chain Rule: Let A1 , . . . , An be events with nonzero probabilities. Then:
Independence: A and B are independent if any of the following equivalent statements hold:
1. P (A | B) = P (A)
2. P (B | A) = P (B)
3. P (A, B) = P (A) P (B)
Mutual Independence: We say n events A1 , A2 , . . . , An are (mutually) independent if, for any subset I ⊆ [n] =
{1, 2, . . . , n}, we have !
\ Y
P Ai = P (Ai )
i∈I i∈I
This equation is actually representing 2 equations since there are 2n subsets of [n].
n
Conditional Independence: A and B are conditionally independent given an event C if any of the following
equivalent statements hold:
1. P (A | B, C) = P (A | C)
Alex Tsun Probability & Statistics with Applications to Computing 3
2. P (B | A, C) = P (B | C)
3. P (A, B | C) = P (A | C) P (B | C)
E [aX + bY + c] = aE [X] + bE [Y ] + c
P
Law of the Unconscious Statistician (LOTUS): For a discrete RV X and function g, E [g(X)] = b∈ΩX g(b) · pX (b).
3.3 Variance
Linearity of Expectation with Indicators: If asked only about the expectation of a RV X which is some sort of “count”
(and not its PMF), then you may be able to write X as the sum of possibly dependent indicator RVs X1 , . . . , Xn , and apply
LoE, where for an indicator RV Xi , E [Xi ] = 1 · P (Xi = 1) + 0 · P (Xi = 0) = P (Xi = 1).
2
Variance: Var (X) = E (X − E [X])2 = E X 2 − E [X] .
p
Standard Deviation (SD): σX = Var (X).
Property of Variance: Var (aX + b) = a2 Var (X).
3.4 Zoo of Discrete Random Variables Part I
Independence: Random variables X and Y are independent, denoted X ⊥ Y , if for all x ∈ ΩX and all y ∈ ΩY :
P (X = x ∩ Y = y) = P (X = x) · P (Y = y).
Independent and Identically Distributed (iid): We say X1 , . . . , Xn are said to be independent and identically
distributed (iid) if all the Xi ’s are independent of each other, and have the same distribution (PMF for discrete RVs, or
CDF for continuous RVs).
Variance Adds for Independent RVs: If X ⊥ Y , then Var (X + Y ) = Var (X) + Var (Y ).
Bernoulli Process: A Bernoulli process with parameter p is a sequence of independent coin flips X1 , X2 , X3 , ... where
P (head) = p. If flip i is heads, then we encode Xi = 1; otherwise, Xi = 0.
Bernoulli/Indicator Random Variable: X ∼ Bernoulli(p) (Ber(p) for short) iff X has PMF:
p, k=1
pX (k) =
1 − p, k=0
E [X] = p and Var (X) = p(1 − p). An example of a Bernoulli/indicator RV is one flip of a coin with P (head) = p. By a
clever trick, we can write
1−k
pX (k) = pk (1 − p) , k = 0, 1
Binomial Random Variable: X ∼ Binomial(n, p) (Bin(n, p) for short) iff X has PMF
n k n−k
pX (k) = p (1 − p) , k ∈ ΩX = {0, 1, . . . , n}
k
E [X] = np and Var (X) = np(1 − p). X is the sum of n iid Ber(p) random variables. An example of a Binomial RV is the
number of heads in n independent flips of a coin with P (head) = p. Note that Bin(1, p) ≡ Ber(p). As n → ∞ and p →
Alex Tsun Probability & Statistics with Applications to Computing 4
0, with np = λ, then Bin (n, p) → Poi(λ). If X1 , . . . , Xn are independent Binomial RV’s, where Xi ∼ Bin(Ni , p), then
X = X1 + . . . + Xn ∼ Bin(N1 + . . . + Nn , p).
3.5 Zoo of Discrete Random Variables Part II
Uniform Random Variable (Discrete): X ∼ Uniform(a, b) (Unif(a, b) for short), for integers a ≤ b, iff X has PMF:
1
pX (k) = , k ∈ ΩX = {a, a + 1, . . . , b}
b−a+1
(b−a)(b−a+2)
E [X] = a+b2 and Var (X) = 12 . This represents each integer in [a, b] to be equally likely. For example, a single roll
of a fair die is Unif(1, 6).
Geometric Random Variable: X ∼ Geometric(p) (Geo(p) for short) iff X has PMF:
k−1
pX (k) = (1 − p) p, k ∈ ΩX = {1, 2, 3, . . .}
E [X] = p1 and Var (X) = 1−pp2 . An example of a Geometric RV is the number of independent coin flips up to and including
the first head, where P (head) = p.
Negative Binomial Random Variable: X ∼ NegativeBinomial(r, p) (NegBin(r, p) for short) iff X has PMF:
k−1 r k−r
pX (k) = p (1 − p) , k ∈ ΩX = {r, r + 1, r + 2, . . .}
r−1
λk
pX (k) = e−λ , k ∈ ΩX = {0, 1, 2, . . .}
k!
E [X] = λ and Var (X) = λ. An example of a Poisson RV is the number of people born during a particular minute,
where λ is the average birth rate per minute. If X1 , . . . , Xn are independent Poisson RV’s, where Xi ∼ Poi(λi ), then
X = X1 + . . . + Xn ∼ Poi(λ1 + . . . + λn ).
Hypergeometric Random Variable: X ∼ HyperGeometric(N, K, n) (HypGeo(N, K, n) for short) iff X has PMF:
K
N −K
k n−k
pX (k) = N
, k ∈ ΩX = {max{0, n + K − N }, . . . , min {K, n}}
n
Cumulative Distribution Function (CDF): The cumulative distribution function (CDF) of ANY random variable
(discrete or continuous) is defined to be the function FX : R → R with FX (t) = P (X ≤ t). If X is a continuous RV, we have:
Alex Tsun Probability & Statistics with Applications to Computing 5
Rt
• FX (t) = P (X ≤ t) = −∞
fX (w) dw for all t ∈ R
d
• du FX (u) = fX (u)
E [X] = λ1 and Var (X) = λ12 . FX (x) = 1 − e−λx for x ≥ 0. The exponential RV is the continuous analog of the geometric
RV: it represents the waiting time to the next event, where λ > 0 is the average number of events per unit time. Note that
the exponential measures how much time passes until the next event (any real number, continuous), whereas the Poisson
measures how many events occur in a unit of time (nonnegative integer, discrete). The exponential RV is also memoryless:
Gamma Random Variable: X ∼ Gamma(r, λ) (Gam(r, λ) for short) iff X has PDF:
λr r−1 −λx
fX (x) = x e , x ∈ ΩX = [0, ∞)
Γ(r)
E [X] = λr and Var (X) = λr2 . X is the sum of r iid Exp(λ) random variables. In the above PDF, for positive integers r,
Γ(r) = (r − 1)! (a normalizing constant). An example of a Gamma RV is the waiting time until the r-th event in the Poisson
process. If X1 , . . . , Xn are independent Gamma RV’s, where Xi ∼ Gam(ri , λ), then X = X1 +. . .+Xn ∼ Gam(r1 +. . .+rn , λ).
It also serves as a conjugate prior for λ in the Poisson and Exponential distributions.
4.3 The Normal/Gaussian Random Variable
Normal (Gaussian, “bell curve”) Random Variable: X ∼ N (µ, σ 2 ) iff X has PDF:
1 1 (x−µ)
2
fX (x) = √ e − 2 σ 2 , x ∈ ΩX = R
σ 2π
E [X] = µ and Var (X) = σ 2 . The “standard normal” random variable is typically denoted Z and has mean 0 and variance 1:
if X ∼ N (µ, σ 2 ), then Z = X−µ
σ ∼ N (0, 1). The CDF has no closed form, but we denote the CDF of the standard normal as
Φ (z) = FZ (z) = P (Z ≤ z). Note from symmetry of the probability density function about z = 0 that: Φ (−z) = 1 − Φ(z).
Closure of the Normal Under Scale and Shift: If X ∼ N (µ, σ 2 ), then aX + b ∼ N (aµ + b, a2 σ 2 ). In particular, we
can always scale/shift to get the standard Normal: X−µ
σ ∼ N (0, 1).
2
Closure of the Normal Under Addition: If X ∼ N (µX , σX ) and Y ∼ N (µY , σY2 ) are independent, then
aX + bY + c ∼ N (aµX + bµY + c, a2 σX
2
+ b2 σY2 )
Alex Tsun Probability & Statistics with Applications to Computing 6
Explicit Formula to compute PDF of Y = g(X) from X (Univariate Case): Suppose X is a continuous RV. If Y =
g(X) and g : ΩX → ΩY is strictly monotone and invertible with inverse X = g −1 (Y ) = h(Y ), then
fX (h(y)) · |h0 (y)| if y ∈ ΩY
fY (y) =
0 otherwise
Explicit Formula to compute PDF of Y = g(X) from X (Multivariate Case): Let X = (X1 , ..., Xn ), Y =
(Y1 , ..., Yn ) be continuous random vectors (each component is a continuous rv) with the same dimension n (so ΩX , ΩY ⊆ Rn ),
and Y = g(X) where g : ΩX → ΩY is invertible and differentiable, with differentiable inverse X = g −1 (y) = h(y). Then,
∂h(y)
fY (y) = fX (h(y)) det
∂y
∂h(y)
where ∂y ∈ Rn×n is the Jacobian matrix of partial derivatives of h, with
∂h(y) ∂(h(y))i
=
∂y ij ∂yj
pX,Y (a, b) = P (X = a, Y = b)
The joint range is the set of pairs (c, d) that have nonzero probability:
Further, note that if g : R2 → R is a function, then LOTUS extends to the multidimensional case:
X X
E [g(X, Y )] = g(x, y)pX,Y (x, y)
x∈ΩX y∈ΩY
P
Marginal PMFs: Let X, Y be discrete random variables. The marginal PMF of X is: pX (a) = b∈ΩY pX,Y (a, b).
Independence (DRVs): Discrete RVs X, Y are independent, written X ⊥ Y , if for all x ∈ ΩX and y ∈ ΩY : pX,Y (x, y) =
pX (x)pY (y).
Variance Adds for Independent RVs: If X ⊥ Y , then: Var (X + Y ) = Var (X) + Var (Y ).
Alex Tsun Probability & Statistics with Applications to Computing 7
fX,Y (a, b) ≥ 0
The joint range is the set of pairs (c, d) that have nonzero density:
Further, note that if g : R2 → R is a function, then LOTUS extends to the multidimensional case:
Z ∞Z ∞
E [g(X, Y )] = g(s, t)fX,Y (s, t)dsdt
−∞ −∞
The joint PDF must satisfy the following (similar to univariate PDFs):
Z b Z d
P (a ≤ X < b, c ≤ Y ≤ d) = fX,Y (x, y)dydx
a c
R∞
Marginal PDFs: Let X, Y be continuous random variables. The marginal PDF of X is: fX (x) = −∞
fX,Y (x, y)dy.
Independence of Continuous Random Variables: Continuous RVs X, Y are independent, written X ⊥ Y , if for all
x ∈ ΩX and y ∈ ΩY , fX,Y (x, y) = fX (x)fY (y).
5.3 Conditional Distributions
Conditional PMFs and PDFs: If X, Y are discrete, the conditional PMF of X given Y is:
Similarly for continuous RVs, but with f ’s instead of p’s (PDFs instead of PMFs).
Conditional Expectation: If X is discrete (and Y is either discrete or continuous), then we define the conditional expec-
tation of g(X) given (the event that) Y = y as:
X
E [g(X) | Y = y] = g(x)pX|Y (x | y)
x∈ΩX
Notice that these sums and integrals are over x (not y), since E [g(X) | Y = y] is a function of y.
Law of Total Expectation (LTE): Let X, Y be jointly distributed random variables.
If Y is discrete (and X is either discrete or continuous), then:
X
E [g(X)] = E [g(X) | Y = y] pY (y)
y∈ΩY
Basically, for E [g(X)], we take a weighted average of E [g(X) | Y = y] over all possible values of y.
Alex Tsun Probability & Statistics with Applications to Computing 8
5. Cov (aX + bY, Z) = a · Cov (X) Z + b · Cov (Y, Z). This can be easily remembered like the distributive property of scalars
(aX + bY )Z = a(XZ) + b(Y Z).
6. Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y ), and hence if X ⊥ Y , then Var (X + Y ) = Var (X) + Var (Y ).
P Pm Pn Pm
n
7. Cov i=1 X i , j=1 Yi = i=1 j=1 Cov (Xi , Yj ). That is covariance works like FOIL (first, outer, inner, last) for
multiplication of sums ((a + b + c)(d + e) = ad + ae + bd + be + cd + ce).
Cov(X,Y )
(Pearson) Correlation: The (Pearson) correlation of X and Y is: ρ(X, Y ) = √ √ .
Var(X) Var(Y )
It is always true that −1 ≤ ρ(X, Y ) ≤ 1. That is, correlation is just a normalized version of covariance. Most notably,
ρ(X, Y ) = ±1 if and only if Y = aX + b for some constants a, b ∈ R, and then the sign of ρ is the same as that of a.
Variance of Sums of RVs: Let X1 , . . . , Xn be any RVs (independent or not). Then,
n
! n
X X X
Var Xi = Var (Xi ) + 2 Cov (Xi , Xj )
i=1 i=1 i<j
5.5 Convolution
Law of Total Probability for Random Variables:
Discrete version: If X, Y are discrete:
X X
pX (x) = pX,Y (x, y) = pX|Y (x | y)pY (y)
y y
Cov (X) whose entries Σij = Cov (Xi , Xj ). The formula for this is:
Σ = Var (X) = Cov (X) = E (X − µ)(X − µ)T = E XXT − µµT
Var (X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
Cov (X2 , X1 ) Var (X2 ) ... Cov (X2 , Xn )
= .. .. . ..
. . . . .
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Var (Xn )
Notice that the covariance matrix is symmetric (Σij = Σji ), and has variances on the diagonal.
The PMultinomial Distribution: Suppose there are r outcomes, with probabilities p = (p1 , p2 , ..., pr ) respectively, such
r
that i=1 pi = 1. Suppose we have n independent trials, and let Y = (Y1 , Y2 , ..., Yr ) be the rvtr of counts of each outcome.
Then, we say Y ∼ Multr (n, p):
The joint PMF of Y is:
Yr r
X
n
pY1 ,...,Yr (k1 , ...kr ) = pki , k1 , ...kr ≥ 0 and ki = n
k1 , ..., kr i=1 i i=1
Notice that each Yi is marginally Bin(n, pi ). Hence, E [Yi ] = npi and Var (Yi ) = npi (1 − pi ).
Then, we can specify the entire mean vector E [Y] and covariance matrix:
np1
E [Y] = np = ... Var (Yi ) = npi (1 − pi ) Cov (Yi , Yj ) = −npi pj
npr
The Multivariate Hypergeometric (MVHG) Distribution: Suppose there are r different colors of balls in a bag,
Pr
having K = (K1 , ..., Kr ) balls of each color, 1 ≤ i ≤ r. Let N = i=1 Ki be the total number of balls in the bag, and suppose
we draw n without replacement. Let Y = (Y1 , ..., Yr ) be the rvtr such that Yi is the number of balls of color i we drew. We
write that Y ∼ MVHGr (N, K, n) The joint PMF of Y is:
Qr Ki
r
i=1 ki
X
pY1 ,...,Yr (k1 , ...kr ) = N
, 0 ≤ ki ≤ Ki for all 1 ≤ i ≤ r and kr = n
n i=1
Ki N − Ki N − n
Var (Yi ) = n · · . The mean vector E [Y] and covariance matrix are:
N N N −1
K1
nN
K . Ki N − Ki N − n Ki Kj N − n
E [Y] = n = .. Var (Yi ) = n · · Cov (Yi , Yj ) = −n ·
N Kr
N N N − 1 N N N −1
nN
Notice that we can’t have equality because with continuous random variables, the probability that any two are equal is 0.
Notice that each Y(i) is a random variable as well! We call Y(i) the ith order statistic, i.e. the ith smallest in a sample of
size n. The density function of each Y(i) is
n
fY(i) (y) = · [FY (y)]i−1 · [1 − FY (y)]n−i · fY (y), y ∈ ΩY
i − 1, 1, n − i
6 Concentration Inequalities
6.1 Markov and Chebyshev Inequalities
E [X]
Markov’s Inequality: Let X ≥ 0 be a non-negative RV, and let k > 0. Then: P (X ≥ k) ≤ .
k
Chebyshev’s Inequality: Let X be any RV with expected value µ = E[X] and finite variance Var (X). Then, for any real
Var (X)
number α > 0. Then, P (|X − µ| ≥ α) ≤ .
α2
6.2 The Chernoff Bound
Chernoff Bound for Binomial: Let X ∼ Bin(n, p) and let µ = E[X]. For any 0 < δ < 1:
2
δ µ δ2 µ
P (X ≥ (1 + δ)µ) ≤ exp − and P (X ≤ (1 − δ)µ) ≤ exp −
3 2
Convex Functions: Let SP ⊆ Rn be a convex set. A function g : S → R is a convex function if for any x1 , ..., xm ∈ S,
m
and p1 , ..., pm ≥ 0 such that i=1 pi = 1, !
Xm Xm
g p i xi ≤ pi g(xi )
i=1 i=1
Jensen’s Inequality: Let X be any RV, and g : R → R be convex. Then, g(E [X]) ≤ E [g(X)].
Hoeffding’s Inequality: Let X1 , ...Xn be independent random variables, where each Xi is bounded: ai ≤ Xi ≤ bi and let
X n be their sample mean. Then,
2n2 t2
P |X n − E X n | ≥ t ≤ 2 exp Pn 2
i−1 (bi − ai )
In the case X1 , ..., Xn are iid (so a ≤ Xi ≤ b for all i) with mean µ, then
2n2 t2 2nt2
P |X n − µ| ≥ t ≤ 2 exp = 2 exp
n(b − a)2 (b − a)2
7 Statistical Estimation
7.1 Maximum Likelihood Estimation
Realization / Sample: A realization/sample x of a random variable X is the value that is actually observed (will always
be in ΩX ).
Alex Tsun Probability & Statistics with Applications to Computing 12
Likelihood: Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | θ) (if X is discrete), or from density fX (t | θ) (if X is
continuous), where θ is a parameter (or vector of parameters). We define the likelihood of x given θ to be the “probability”
of observing x if the true parameter is θ. The log-likelihood is just the log of the likelihood, which is typically easier to
optimize.
If X is discrete,
Yn Xn
L(x | θ) = pX (xi | θ) ln L(x | θ) = ln pX (xi | θ)
i=1 i=1
If X is continuous,
n
Y n
X
L(x | θ) = fX (xi | θ) ln L(x | θ) = ln fX (xi | θ)
i=1 i=1
Maximum Likelihood Estimator (MLE): Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | θ)
(if X is discrete), or from density fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters). We define
the maximum likelihood estimator (MLE) θ̂M LE of θ to be the parameter which maximizes the likelihood/log-likelihood:
We then define the Method of Moments (MoM) estimator θ̂M oM of θ = (θ1 , . . . , θk ) to be a solution (if it ex-
ists) to the k simultaneous equations where, for j = 1, . . . , k, we set the j th true and sample moments equal:
Pn Pn
E [X] = n1 i=1 xi ··· E X k = n1 i=1 xki
(
1 α−1
B(α,β) x (1 − x)β−1 , x ∈ ΩX = [0, 1]
fX (x) =
0, otherwise
X is typically the belief distribution about some unknown probability of success, where we pretend we’ve seen α − 1 successes
and β − 1 failures. Hence the mode (most likely value of the probability/point with highest density) arg max fX (x), is
x∈[0,1]
α−1
mode[X] = (α−1)+(β−1)
Also note that there is an annoying “off-by-1” issue: (α − 1 heads and β − 1 tails), so when choosing these parameters, be
careful! It also serves as a conjugate prior for p in the Bernoulli and Geometric distributions.
Dirichlet RV: X ∼ Dir(α1 , α2 , . . . , αr ), if and only if X has the following density function:
( Qr Pr
1
B(α) i=1 xai i −1 , xi ∈ (0, 1) and i=1 xi = 1
fX (x) =
0, otherwise
This is a generalization of the Beta random variable from 2 outcomes to r. The random vector X is typically the belief
distribution about some unknown probabilities of the different outcomes, where we pretend we saw a1 − 1 outcomes of type
1, a2 − 1 outcomes of type 2, . . . , and ar − 1 outcomes of type r . Hence, the mode of the distribution is the vector,
arg max P fX (x), is
x∈[0,1]d and xi =1
Alex Tsun Probability & Statistics with Applications to Computing 13
mode[X] = Pr α1 −1 −1
, Pr α2(a −1
, . . . , Pr αr(a
i=1 (ai −1) i=1 i −1) i=1 i −1)
Fisher Information: Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | θ) (if X is discrete), or from density function
fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters). The Fisher Information of a parameter θ
is defined to be " 2 # 2
∂ ln L(x | θ) ∂ ln L(x | θ)
I(θ) = n · E = −E
∂θ ∂θ2
Cramer-Rao Lower Bound (CRLB): Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | θ) (if X is discrete), or
from density function fX (t | θ) (if X is continuous), where θ is a parameter (or vector of parameters). If θ̂ is an unbiased
estimator for θ, then
1
MSE(θ̂, θ) = Var θ̂ ≥
I(θ)
1
That is, for any unbiased estimator θ̂ for θ, the variance (=MSE) is at least I(θ) . If we achieve this lower bound, meaning
1
our variance is exactly equal to I(θ) , then we have the best variance possible for our estimate. Hence, it is the minimum
variance unbiased estimator (MVUE) for θ.
I(θ)−1
Efficiency: Let θ̂ be an unbiased estimator of θ. The efficiency of θ̂ is e(θ̂, θ) = ≤ 1.
Var θ̂
An estimator is said to be efficient if it achieves the CRLB - meaning e(θ̂, θ) = 1.
7.8 Properties of Estimators III
Pn
Statistic: A statistic is any function T : Rn → R of samples x = (x1 , . . . , xn ). For example, T (x1 , . . . , xn ) = i=1 xi (the
sum), T (x1 , . . . , xn ) = max{x1 , . . . , xn } (the max/largest value), T (x1 , . . . , xn ) = x1 (just take the first sample)
Sufficiency: A statistic T = T (X1 , . . . , Xn ) is a sufficient statistic if the conditional distribution of X1 , . . . , Xn given
T = t and θ does not depend on θ.
P (X1 = x1 , . . . , Xn = xn | T = t, θ) = P (X1 = x1 , . . . , Xn = xn | T = t)
Neyman-Fisher Factorization Criterion (NFFC): Let x1 , . . . , xn be iid random samples with likelihood
L(x1 , . . . , xn |θ). A statistic T = T (x1 , . . . , xn ) is sufficient if and only if there exist non-negative functions g and h such that:
Alex Tsun Probability & Statistics with Applications to Computing 14
8 Statistical Inference
8.1 Confidence Intervals
Confidence Interval: Suppose you have iid samples x1 ,...,xn from some distribution with unknown parameter θ, and you
have some estimator θ̂ for θ.
h i
A 100(1 − α)% confidence interval for θ is an interval (typically but not always) centered at θ̂, θ̂ − ∆, θ̂ + ∆ , such that
the probability (over the randomness in the samples x1 ,...,xn ) θ lies in the interval is 1 − α:
h i
P θ ∈ θ̂ − ∆, θ̂ + ∆ = 1 − α
Pn
If θ̂ = n1 i=1 xi is the sample mean, then θ̂ is approximately normal by the CLT, and a 100(1 − α)% confidence interval is
given by the formula:
σ σ
θ̂ − z1−α/2 √ , θ̂ + z1−α/2 √
n n
where z1−α/2 = Φ−1 1 − α2 and σ is the true standard deviation of a single sample (which may need to be estimated).
8.2 Credible Intervals
Credible Intervals: Suppose you have iid samples x = (x1 , ..., xn ) from some distribution with unknown parameter Θ.
You are in the Bayesian setting, so you have chosen a prior distribution for the RV Θ.
A 100(1 − α)% credible interval for Θ is an interval [a, b] such that the probability (over the randomness in Θ) that Θ lies
in the interval is 1 − α:
P (Θ ∈ [a, b]) = 1 − α
If we’ve chosen the appropriate conjugate prior for the sampling distribution (like Beta for Bernoulli), the posterior is easy
to compute. Say the CDF of the posterior is FY . Then, a 100(1 − α)% credible interval is given by
h α α i
FY−1 , FY−1 1 −
2 2
1. Make a claim (like ”Airplane food is good”, ”Pineapples belong on pizza”, etc...)