Probability For Computer Scientists: Stanford, California
Probability For Computer Scientists: Stanford, California
Stanford, California
Contents
Acknowledgments v
1 Counting 1
1.1 Sum Rule 1
1.2 Product Rule 2
1.3 The Inclusion-Exclusion Principle 3
1.4 Double Counting and Constraints 3
1.5 General Principle of Counting 4
1.6 Floor and Ceiling 5
1.7 The Pigeonhole Principle 5
1.8 Bibliography 6
2 Combinatorics 7
2.1 Permutations 7
2.2 Permutations of Indistinct Objects 10
2.3 Combinations 11
2.4 Selecting Multiple Groups of Objects 13
2.5 Bucketing/Group Assignment 13
iv c ontents
3 Probability 17
3.1 Event Spaces and Sample Spaces 17
3.2 Probability 17
3.3 Axioms of Probability 18
3.4 Provable Identities of Probability 19
3.5 Equally Likely Outcomes 20
4 Conditional Probability 25
4.1 Conditional Probability 25
4.2 Law of Total Probability 26
4.3 Bayes’ Theorem 27
4.4 Conditional Paradigm 28
5 Independence 31
5.1 Independence 31
5.2 Conditional Independence 34
6 Random Variables 37
6.1 Probability Mass Function 38
6.2 Expectation 39
6.3 Properties of Expectation 39
6.4 Variance 41
9 Continuous Distributions 55
9.1 Probability Density Functions 55
9.2 Cumulative Distribution Function 56
9.3 Expectation and Variance 57
9.4 Uniform Random Variable 60
9.5 Exponential Random Variable 61
10 Normal Distribution 63
10.1 Normal Random Variable 63
10.2 Linear Transform 64
10.3 Projection to Standard Normal 64
10.4 Binomial Approximation 67
11 Joint Distributions 69
11.1 Joint Distributions 69
11.2 Discrete Case 69
11.3 Multinomial Distribution 70
14 Conditional Expectation 85
14.1 Conditional Distributions 85
14.2 Conditional Expectation 85
14.3 Properties of Conditional Expectation 86
14.4 Law of Total Expectation 86
15 Inference 91
15.1 Bayesian Networks 91
15.2 General Inference via Sampling the Joint Distribution 94
15.3 General Inference when Conditioning on Rare Events 96
A Review 151
Quiz 1 221
References 227
This work is taken from the lecture notes for the course Probability for Computer
Scientists at Stanford University, CS 109 (cs109.stanford.edu). The contributors
to the content of this work are Chris Piech, Mehran Sahami, and Lisa Yan—this
collection is simply a typesetting of existing lecture notes with minor modifica-
tions/additions. We would like to thank the original authors for their contribution.
In addition, we wish to thank Mykel Kochenderfer and Tim Wheeler for their
contribution to the Tufte-Algorithms LATEX template, based off of Algorithms for
Optimization.1 1
M. J. Kochenderfer and T. A.
Wheeler, Algorithms for Optimiza-
tion. MIT Press, 2019.
Robert J. Moss
Stanford, Calif.
July 29, 2020
Problem: You are running an online social networking application which Example 1.1. Sum Rule example
counting server requests.
has its distributed servers housed in two different data centers, one in San
Fransisco and the other in Boston. The San Fransisco data center has 100
servers in it and the Boston data center has 50 servers in it. If a server request
is sent to the application, how large is the set of servers it may get routed to?
Solution: Since the request can be sent to either of the two data centers
and none of the machines in either data center are the same, the Sum Rule of
Counting applies. Using this rule, we know that the request could potentially
be routed to any of the 100 + 50 = 150 servers.
2 c hapter 1. counting
Product Rule of Counting: If an experiment has two parts, where the first part
can result in one of m outcomes and the second part can result in one of n outcomes
regardless of the outcome of the first part, then the total number of outcomes for
the experiment is mn.
Rewritten using set notation, the Product Rule states that if an experiment with
two parts has an outcome from set A in the first part, where | A| = m, and an
outcome from set B in the second part (regardless of the outcome of the first
part), where | B| = n, then the total number of outcomes of the experiment is
| A|| B| = mn.
Note that the Product Rule for Counting is very similar to ‘‘the basic principle of
counting’’ given in the Ross textbook.1 ‘’ 1
S. M. Ross et al., A First Course in
Probability. 2006, vol. 7.
Problem: Two 6-sided dice, with faces numbered 1 through 6, are rolled. Example 1.2. Product Rule example
with two 6-sided dice.
How many possible outcomes of the roll are there?
Solution: Note that we are not concerned with the total value of the two
dice, but rather the set of all explicit outcomes of the rolls. Since the first die ‘‘die’’ is the singular form of the
can come up with 6 possible values and the second die similarly can have word ‘‘dice’’ (which is the plural
form).
6 possible values (regardless of what appeared on the first die), the total
number of potential outcomes is 36 = 6 · 6. These possible outcomes are
explicitly listed below as a series of pairs, denoting the values rolled on the
pair of dice:
Problem: Consider a hash table with 100 buckets. Two arbitrary strings are Example 1.3. Product Rule example
with string hashing.
independently hashed and added to the table. How many possible ways are
there for the strings to be stored in the table?
Solution: Each string can be hashed to one of 100 buckets. Since the results
of hashing the first string do not impact the hash of the second, there are
100 · 100 = 10,000 ways that the two strings may be stored in the hash table.
Problem: An 8-bit string (one byte) is sent over a network. The valid set of Example 1.4. Inclusion-Exclusion
Principle example with bit strings.
strings recognized by the receiver must either start with 01 or end with 10.
How many such strings are there?
Solution: The potential bit strings that match the receiver’s criteria can
either be 64 strings that start with 01 (since the last 6 bits are unspecified).
Of course, these two sets overlap, (since the middle 4 bits can be arbitrary).
Casting this description into corresponding set notation, we have: | A| =
64, | B| = 64, and | A ∩ B| = 16, so by the Inclusion-Exclusion Principle, there
are 64 + 64 − 16 = 112 strings that match the specified receiver’s criteria.
There are many reasons for having counted some elements more than once (aka
‘‘double counted’’). One common case, is that there is a constraint in the problem
that you must contend with. It goes without saying that if you over-count, then
you have to subtract off the number of elements that were double counted. If you
did something along the lines of: count every element some multiple, then you
can divide your total number of elements by that multiple to get the correct final
answer.
Problem: California license plates prior to 1982 had only 6-place license Example 1.5. General Principle of
Counting example with California
plates, where the first three places were uppercase letters A-Z, and the last license plates.
three places were numeric 0-9. How many such 6-place license plates were
possible pre-1982?
Solution: We can treat each of the positions of the license plate as a separate
part of an overall six part experiment. That is, the first three parts of the
experiment each have 26 outcomes, corresponding to the letters A–Z, and
the last three parts of the experiment each have 10 outcomes, corresponding
to the digits 0–9. By the General Principle of Counting, we have 26 × 26 × 26 ×
10 × 10 × 10 = 17,576,000 possible license plates. Interestingly enough the
current population of California is 39.5 million residents as of 2017, so this
would not nearly be enough license plates such that each person can own one
vehicle. Fortunately, in 1982, California changed to 7-place license plates by
prepending a numeric digit, resulting in 10 × 26 × 26 × 26 × 10 × 10 × 10 =
175,760,000 possible 7-place license plates. This is enough for each resident
in California to own approximately 4.5 vehicles.
Floor and ceiling are two handy functions that we give below just for reference.
Besides, their names sound so much neater than ‘‘rounding down’’ and ‘‘rounding
up’’, and they are well-defined on negative numbers too.
Floor function: The floor function assigns to the real number x the largest integer
that is less than or equal to x. The floor function applied to x is denoted b x c.
Ceiling function: The ceiling function assigns to the real number x the smallest
integer that is greater than or equal to x. The ceiling function applied to x is
denoted d x e.
Examples of the floor and ceiling functions operating on the same numbers:
Basic Pigeonhole Principle: For positive integers m and n, if m objects are placed
in n buckets, where m > n, then at least one bucket must contain at least two
objects.
Problem: Consider a hash table with 100 buckets. 950 strings are hashed Example 1.6. General Pigeonhole
Principle example with hash tables.
and added to the table.
a. Is it possible that a bucket in the table has no entries?
b. Is it guaranteed that at least one bucket in the table has at least two entries?
c. Is it guaranteed that at least one bucket in the table has at least 10 entries?
d. Is it guaranteed that at least one bucket in the table has at least 11 entries?
Solution:
a. Yes. As one example, it is possible (albeit very improbable) that all 950
strings get hashed to the same bucket (say bucket 0). In this case bucket 1
would have no entries.
b. Yes. Since, 950 objects are placed in 100 buckets and 950 > 100, by the
Basic Pigeonhole Principle, it follows that at least one bucket must contain
at least two entries.
c. Yes. Since, 950 objects are placed in 100 buckets and d950/100e = d9.5e =
10, by the General Pigeonhole Principle, it follows that at least one bucket
must contain at least 10 entries.
d. No. As one example, consider the case where the first 50 bucket each
contain 10 entries and the second 50 buckets each contain 9 entries. This
accounts for all 950 entries (50 · 10 + 50 · 9 = 950), but there is no bucket
that contains 11 entries in the hash table.
1.8 Bibliography
For additional information on counting, you can consult a good discrete math-
ematics or probability textbook. Some of the discussion above is based on the
treatment in Discrete Mathematics and its Applications.2 2
K. Rosen, Discrete Mathematics and
Its Applications. 2007, vol. 6.
Problem, Part A: iPhones used to have 4-digit passcodes. Suppose there Example 2.1. Permutation example
for iPhone passcode attempts with
are 4 smudges over 4 digits on the screen. How many distinct passcodes are n = 4 fingerprint smudges.
possible?
Solution: Since the order of digits in the code is important, we should use
permutations. And since there are exactly four smudges we know that each
number is distinct. This, we can plug in the permutation formula: 4! = 24.
8 c hapter 2. combinatorics
Problem, Part B: What if there are 3 smudges over 3 digits on the screen? Example 2.2. Permutation example
for iPhone passcode attempts with
n = 3 fingerprint smudges.
Solution: One of 3 digits is repeated, but we don’t know which one. We can
solve this by making three cases, one for each digit that could be repeated
(each with the same number of permutations). Let A, B, and C represent
the 3 digits, with C repeated twice. We can initially pretend the two C’s are
distinct. Then each case will have 4! permutations:
A B C1 C2
4!
2! · 1! · 1!
Adding up the three cases for the different repeated digits gives:
4!
3· = 3 · 12 = 36
2! · 1! · 1!
Problem, Part C: What if there are 2 smudges over 2 digits on the screen? Example 2.3. Permutation example
for iPhone passcode attempts with
n = 2 fingerprint smudges.
Solution: There are two possibilities: 2 digits used twice each, or 1 digit
used 3 times, and the other digit used once.
4! 4!
+2· = 6 + (2 · 4) = 6 + 8 = 14
2! · 2! 3! · 1!
1, 2, 3 1, 3, 2 2, 1, 3 2, 3, 1 3, 1, 2 3, 2, 1
1 1 2 2 3 3
0 2 0 3 1 3 1 3 1 0 2 0
0 3 0 2 0 0 0 0 0 0 2 0 1 0
Recall the definition of a binary search tree (BST), which is a binary tree that
satisfies the following three properties for every node n in the tree:
1. n’s value is greater than all the values in its left subtree.
2. n’s value is less than all the values in its right subtree.
3. both n’s left and right subtrees are binary search trees.
Problem: How many possible binary search trees are there which contain
the three values, 1, 2, and 3, and have a degenerate structure (i.e. each node
in the BST has at most one child)?
Solution: We start by considering the fact that the three values in the BST
(1, 2, and 3) may have been inserted in any of 3! = 6 orderings (permuta-
tions). For each of the 3! ways the values could have been ordered when
being inserted into the BST, we can determine what the resulting structure
would be and determine which of them are degenerate. We consider each
possible ordering of the three values and the resulting BST structure is shown
in figure 2.7. We see that there are 4 degenerate BSTs here (the first two and Example 2.4. Permutation exam-
last two). ple with degenerate binary search
trees.
Problem: How many distinct bit strings can be formed from three 0’s and Example 2.5. Permutation problem
with bit strings.
two 1’s?
01 11 12 02 03
01 11 12 03 02
02 11 12 01 03
02 11 12 03 01
03 11 12 01 02
03 11 12 02 01
If identical digits are indistinguishable, then all the listed permutations are
the same. For any given permutation, there are 3! ways of rearranging the 0’s
and 2! ways of rearranging the 1’s (resulting in indistinguishable strings). We
have over-counted. Using the formula for permutations of indistinct objects,
we can correct for the over-counting:
5! 160 120
total = = = = 10
3! · 2! 6·2 12
2.3 Combinations
Permutations account for ordering,
Binomial Coefficient: A combination is an unordered selection of r objects from where as combinations do not.
a set of n objects. If all objects are distinct, then the number of ways of making
the selection is:
n! n
= (2.1)
r!(n − r )! r
The term (nr) is define as a binomial coefficient and is often read as ‘‘n choose r’’.
Consider this general way to produce combinations: To select r unordered
objects from a set of n objects, e.g., ‘‘7 choose 3’’,
2. Then select the first r in the permutation. There is one way to do that.
3. Note that the order of r selected objects is irrelevant. There are r! ways to
permute them. The selection remains unchanged.
n−1 n−1
n
= + , 0≤r≤n (2.2)
r r−1 r
This identity can be proved via a combinatorial argument. When we select a
group of size r from n distinct objects, then any particular object (say, object 1)
will either be part of that group or not part of that group. We can then define
sets A and B, where A is the number of ways of selecting a group that contains
object 1, and B is the number of ways of selecting a group that does not contain
object 1. For set A, if we decide to include object 1, then we must select r − 1 of
the remaining n − 1 objects (since the membership of object 1 in our selection is
−1
already decided), or 1 × (nr− 1 ). For set B, if we decide to exclude object 1, then we
only have n − 1 possible objects to select from to create a group of size r, or n − 1r.
These sets are mutually exclusive, and therefore by the Sum Rule of Counting the
total number of possibilities are as above.
Problem: In the Hunger Games, how many ways are there of choosing 2 Example 2.6. Combination problem
involving the Hunger Games.
villagers from district 12, which has a population of 8,000?
Problem, Part A: How many ways are there to select 3 books from a set of Example 2.7. Combination problem
choosing r = 3 books from n = 6.
6?
Solution: If each of the books are distinct, then this is another straight
forward combination problem.
6 6!
= = 20
3 3!3!
Problem, Part B: How many ways are there to select 3 books if there are Example 2.8. Combination problem
choosing r = 3 books from n = 6
two books that should not both be chosen together? For example, if you are with restrictions.
choosing 3 out of 6 probability books, don’t choose both the 8th and 9th
edition of the Ross textbook.
Case 2: Select the 9th Ed. and 2 other non-8th Ed.: There are (42) ways.
Case 3: Select 3 from the books that are neither the 8th nor the 9th edition:
There are (43) ways.
Using our old friend the Sum Rule of Counting, we can add the cases:
4 4
total = 2 · + = 16
2 3
Alternatively, we could have calculated all the ways of selecting 3 books from Example 2.9. Forbidden City method
of solving the combination problem
6, and then subtract the ‘‘forbidden’’ ones (i.e., the selections that break the choosing r = 3 books from n = 6
constraint). Chris Piech calls this the Forbidden City method. with restrictions.
Forbidden Case: Select 8th edition and 9th edition and 1 other book. There
are (41) ways of doing so (which equals 4).
6 4
total = all possibilities − forbidden = − = 20 − 4 = 16
3 1
You have probably heard about the dreaded ‘‘balls and urns’’ probability exam-
ples. What are those all about? They are for counting the many different ways
that we can think of stuffing elements into containers. (It turns out that Jacob
Bernoulli was into voting and ancient Rome. And in ancient Rome they used urns
for ballot boxes.) This ‘‘bucketing’’ or ‘‘group assignment’’ process is a useful
metaphor for many counting problems. Note that this bucketing problem is differ-
ent from the previous combinations problem. In combinations, we have n distinct
Problem: Company Camazon has 13 new servers that they would like to Example 2.10. Multinomial coeffi-
cient used to solve a server to data-
assign to 3 datacenters, where Datacenter A, B, and C have 6, 4, and 3 empty center allocation problem.
server racks, respectively. How many different divisions of the servers are
possible?
(distinguishable) objects to put in r distinct groups, and we are fixing the number
of distinct objects in group i to be ni (where ∑ri=1 ni = n) for every outcome that
we count. By contrast, in the bucketing problem we still have n objects to put in r
distinct groups, but (1) our objects can be distinct or indistinct, and (2) for each
outcome we can vary the number of objects in each distinct group i.
Problem: Say you want to put n distinguishable balls into r urns. (No! Wait! Example 2.11. Bucketing used to
solve a hash table problem.
Don’t say that!) Okay, fine. No urns. Say we are going to put n strings into
r buckets of a hash table where all outcomes are equally likely. How many
possible ways are there of doing this?
Problem, Part A: Say you are a startup incubator and you have $10 million Example 2.12. Divider Method used
to solve an investment problem.
to invest in 4 companies (in $1 million increments). How many ways can you
allocate this money?
Solution: This is just like putting 10 balls into 4 urns. Using the Divider
Method we get:
10 + 4 − 1
13
total ways = = = 286
10 10
Problem, Part B: What if you know you want to invest at least $3 million Example 2.13. Divider Method used
to solve an investment problem
in Company 1? with requirements.
Problem, Part C: What if you don’t have to invest all $10 million? (The Example 2.14. Divider Method used
to solve an investment problem
economy is tight, say, and you might want to save your money.) with no minimum requirements.
Solution: Imagine that you have an extra company: yourself. Now you are
investing $10 million in 5 companies. Thus, the answer is the same as putting
10 balls into 5 urns.
10 + 5 − 1
14
total ways = = = 1001
10 10
We can use Julia to verify our answers in examples 2.12 to 2.14. Example 2.15. Binomial coefficient in
Julia to check answers from exam-
julia> (13 ⋮ 10) ples 2.12 to 2.14. As a shorthand,
286 we define the alias ⋮ = binomial,
where ⋮ is created by typing \vdots
julia> (10 ⋮ 7)
and hitting tab. The normal syntax
120 is binomial(n,k).
julia> (14 ⋮ 10)
1001 julialang.org
A sample space S is the set of all possible outcomes of an experiment. For example:
• Coin flip: S = {Heads, Tails}
• Flipping two coins: S = {( H, H ), ( H, T ), ( T, H ), ( T, T )}
• Roll of 6-sided die: S = {1, 2, 3, 4, 5, 6}
• Number of emails in a day: S = { x | x ∈ Z, x ≥ 0} (non-negative integers)
• Number of Netflix hours in a day: S = { x | x ∈ R, 0 ≤ x ≤ 24}
An event space E is some subset of S that we ascribe meaning to. In set notation, E ⊆ S.
• Coin flip is heads: E = {Heads}
• ≥ 1 head on 2 coin flips: E = {( H, H ), ( H, T ), ( T, H )}
• Roll of die is 3 or less: E = {1, 2, 3}
• Number of emails in a day ≤ 20: E = { x | x ∈ Z, 0 ≤ x ≤ 20}
• ‘‘Wasted day’’ (≤ 5 Netflix hours): E = { x | x ∈ R, 5 ≤ x ≤ 24}
3.2 Probability
In the 20th century, people figured out one way to define what a probability is:
n( E)
P( E) = lim , (3.1)
n→∞ n
18 c hapter 3. probability
where n is the number of trials performed and n( E) is the number of trials with
an outcome in E. In English this reads: say you perform n trials of an experiment.
The probability of a desired event E is defined as the ratio of the number of trials
that result in an outcome in E to the number of trials performed (in the limit as
your number of trials approaches infinity). You can also give other meanings
to the concept of a probability, however. One common meaning ascribed is that
P( E) is a measure of the chance of E occurring. I often think of a probability
in another way: I don’t know everything about the world. As a result I have to
come up with a way of expressing my belief that E will happen given my limited
knowledge. This interpretation (often referred to as the Bayesian interpretation)
acknowledges that there are two sources of probabilities: natural randomness
and our own uncertainty. Later in the quarter, we will contrast the frequentist
definition we gave you above with this other Bayesian definition of probability.
Axiom 1: 0 ≤ P( E) ≤ 1
Axiom 2: P(S) = 1
You can convince yourself of the first axiom by thinking about the definition of
probability above: when performing some number of trials of an actual experi-
ment, it is not possible to get more occurrences of the event than there are trials (so
probabilities are at most 1), and it is not possible to get less than 0 occurrences of
the event (so probabilities are at least 0). The second axiom makes intuitive sense
as well: if your event space is the same as the sample space, then each trial must
produce an outcome from the event space. Of course, this is just a restatement of
the definition of the sample space; it is sort of like saying that the probability of
you eating cake (event space) if you eat cake (sample space) is 1.
We often refer to these as corollaries that are directly provable from the three
axioms given above.
Identity 2: If E ⊆ F, then P( E) ≤ P( F )
Identity 3: P( E ∪ F ) = P( E) + P( F ) − P( EF ) (where EF = E ∩ F )
This last rule is somewhat complicated, but the notation makes it look far
worse than it is. What we are trying to find is the probability that any of a number
of events happens. The outer sum loops over the possible sizes of event subsets
(that is, first we look at all single events, then pairs of events, then subsets of
events of size 3, etc.). The ‘‘−1’’ term tells you whether you add or subtract terms
with that set size. The inner sum sums over all subsets of that size. The less-than
signs ensure that you don’t count a subset of events twice, by requiring that the
indices i1 , . . . , ir are in ascending order.
Here’s how that looks for three events ( E1 , E2 , E3 ):
P( E1 ∪ E2 ∪ E3 ) = P( E1 ) + P( E2 ) + P( E3 )
− P( E1 E2 ) − P( E1 E3 ) − P( E2 E3 )
+ P( E1 E2 E3 )
P (( E ∪ F )c ) = 1 − P( E ∪ F ) (Identity 1)
= 1 − [ P( E) + P( F ) − P( EF )] (Identity 3)
= 1 − (0.28 + 0.07 − 0.05) = 0.7
Java Python
Probability with equally likely outcomes: For a sample space S in which all
outcomes are equally likely,
1
P(each outcome) =
|S|
and for any event E ⊆ S,
number of outcomes in E | E|
P( E) = = . (3.3)
number of outcomes in S |S|
Problem: You roll two six-sided dice. What is the probability that the sum Example 3.1. Equally likely outcomes
for two six-sided dice rolls that
of the two rolls is 7? sum to 7. Bold indicates those rolls
belonging to the event space E.
Solution: Define the sample space as a space of pairs, where the two ele-
ments are the outcomes of the first and second dice rolls, respectively. The
event is the subset of this sample space where the sum of the paired elements
is 7.
S = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6)
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6)
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6)
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6)
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}
E = {(6, 1), (5, 2), (4, 3), (3, 4), (2, 5), (1, 6)}
Since all outcomes are equally likely, the probability of this event is:
| E| 6 1
P( E) = = =
|S| 36 6
Problem: There are 4 oranges and 3 apples in a bag. You draw out 3. What Example 3.2. Equally likely outcomes
picking fruit out of a bag.
is the probability that you draw 1 orange and 2 apples?
Problem: In a 52-card deck, cards are flipped one at a time. After the first Example 3.3. Equally likely outcomes
flipping the next cards in a deck of
ace (of any suit) appears, consider the next card. Is the next card more likely 52.
to be the Ace of Spades than the 2 of Clubs? (This problem is based on
Example 5j in Chapter 2.5 of Ross’s textbook, 10th Edition.)
Solution: No; the probabilities are equal. The difficulty of this problem
stems from defining an experiment that gives equally likely outcomes while
preserving the specifications of the original problem. An incorrect approach
is to define the experiment as just drawing a pair of cards (first ace, next
card) because then we discard all information about the cards flipped prior
to the pair. Instead, consider the experiment to be shuffling the full 52-card
deck, where |S| = 52!. We can then reconstruct all outcomes of the pairs of
cards that we care about (if we so wish—but we just care about getting an
equally likely outcome sample space).
Define E AS as the event where the next card is the Ace of Spades. To
construct a 52-card order where this event holds, we first take out the Ace
of Spades, then shuffle the remaining 51 cards (51! ways), then insert the
Ace of Spades immediately after the first ace (1 way). By the product rule,
| E AS | = 51! · 1. Then define E2C as the event where the next card is the 2
of Clubs. To construct a 52-card order where this event holds, we perform
exactly the same steps, but with the 2 of Clubs instead. Then | E2C | = 51! · 1.
Therefore, P( E AS ) = 51!/52! = P( E2C ).
For many readers, it may seem apparent that the first ace drawn could
very well be the Ace of Spades, and so it is less likely that the next card is the
Ace of Spades. Yet by a similar train of thought, the 2 of Clubs could very
well have been drawn prior to the first ace drawn, and so we must consider
all of those cases as well. This example serves to highlight the difficulty of
probability: Mathematics often trumps intuition (no pun intended).
(and F, which has 14 equally likely outcomes, has become our new sample space).
Given that event F has occured, the conditional probability that event E occurs
is the subset of the outcomes of E that are consistent with F. In this case we
can visually see that those are the three outcomes in E ∩ F. Thus we have the
Figure 4.1. Probability of event E
probability: conditioning on event F.
P( EF ) 3/63 3
P( E | F ) = = = ≈ 0.21
P( F ) 14/63 14
26 c hapter 4. conditional probability
Even though the visual example (with equally likely outcome spaces) is useful for
gaining intuition, the above definition of conditional probability applies regardless
of whether the sample space has equally likely outcomes.
The Chain Rule: The definition of conditional probability can be rewritten as:
P( EF ) = P( E | F ) P( F ) (4.2)
which we call the Chain Rule. Intuitively, it states that the probability of observing
events E and F is the probability of observing F, multiplied by the probability
of observing E, given that you have observed F. Here is the general form of the
Chain Rule:2 2
A simple example of the chain
rule: Let E, F, and G be events with
P( E1 E2 . . . En ) = P( E1 ) P( E2 | E1 ) . . . P( En | E1 E2 . . . En−1 ) (4.3) nonzero probabilities. An equiva-
lent expression for P( EFG ) would
be:
4.2 Law of Total Probability P( EFG ) = P( E | FG ) P( FG )
= P( E | FG ) P( F | G ) P( G )
An astute person once observed that in a picture like the one in figure 4.1, event F
can be thought of as having two parts, the part that is in E (that is, E ∩ F = EF),
and the part that isn’t (Ec ∩ F = Ec F). This is true because E and Ec are mutually
exclusive sets of outcomes which together cover the entire sample space. After
further investigation this was proved to be a general mathematical truth, and
there was much rejoicing:
P( F ) = P( EF ) + P( Ec F ) (4.4)
This observation is called the law of total probability; however, it is most commonly
seen in combination with the chain rule.
P ( F ) = P ( F | E ) P ( E ) + P ( F | E c ) P ( E c ). (4.5)
There is a more general version of the rule. If you can divide your sample space into
any number of events E1 , E2 , . . . , En that are mutually exclusive and exhaustive—that
is, every outcome in sample space falls into exactly one of those events—then:
n
P( F ) = ∑ P( F | Ei ) P(Ei ) (4.6)
i =1
The word ‘‘total’’ refers to the fact that the events in Ei must combine to form the
totality of the sample space.
Bayes’ theorem (or Bayes’ rule) is one of the most ubiquitous results in probability
for computer scientists. Very often we know a conditional probability in one
direction, say P( E | F ), but we would like to know the conditional probability in
the other direction. Bayes’ theorem provides a way to convert from one to the
other. We can derive Bayes’ theorem by starting with the definition of conditional
probability:
P( FE)
P( E | F ) = (4.7)
P( F )
We can expand P( FE) using the chain rule, which results in Bayes’ theorem.
P( F | E) P( E)
P( E | F ) = (4.8)
P( F )
Each term in the Bayes’ rule formula has its own name. The P( E | F ) term is
often called the posterior; the P( E) term is often called the prior; the P( F | E) term
is called the likelihood (or the update); and P( F ) is often called the normalization
constant.
likelihood × prior
posterior = (4.9)
normalization constant
If the normalization constant (the probability of the event you were initially
conditioning on) is not known, you can expand it using the law of total probability:
P( F | E) P( E) P( F | E) P( E)
P( E | F ) = = (4.10)
P( F | E) P( E) + P( F | Ec ) P( Ec ) ∑i P( F | Ei ) P( Ei )
Again, for this to hold, all the events Ei must be mutually exclusive and exhaustive.
A common scenario for applying the Bayes Rule formula is when you want to
know the probability of something ‘‘unobservable’’ given an ‘‘observed’’ event.
For example, you want to know the probability that a student understands a
concept, given that you observed them solving a particular problem. It turns out it
is much easier to first estimate the probability that a student can solve a problem
given that they understand the concept and then to apply Bayes’ theorem.
The ‘‘expanded’’ version of Bayes’ rule in equation (4.10) allows you to work
around not immediately knowing the denominator P( F ). It is worth exploring this
in more depth, because this ‘‘trick’’ comes up often, and in slightly different forms.
Another way to get to the exact same result is to reason that because the posterior of
Bayes Theorem, P( E | F ), is a probability, we know that P( E | F ) + P( Ec | F ) = 1.
If you expand out P( Ec | F ) using Bayes, you get:
P( F | Ec ) P( Ec )
P( Ec | F ) = (4.11)
P( F )
Now we have:
1 = P( E | F ) + P( Ec | F ) (since P( E | F ) is a probability)
P( F | E) P( E) P( F | Ec ) P( Ec )
1= + (by Bayes’ rule (twice))
P( F ) P( F )
1
1= [ P( F | E) P( E) + P( F | Ec ) P( Ec )]
P( F )
P( F ) = P( F | E) P( E) + P( F | Ec ) P( Ec )
We call P( F ) the normalization constant because it is the term whose value can
be calculated by making sure that the probabilities of all outcomes sum to 1 (they
are ‘‘normalized’’).
As we mentioned above, when you condition on an event you enter the universe
where that event has taken place, all the laws of probability still hold. Thus, as
long as you condition consistently on the same event, every one of the tools we
have learned still apply. Let’s look at a few of our old friends when we condition
consistently on an event (in this case G, often read as ‘‘given’’ G):
Applying Bayes Rule: Consider the following (hypothetical) scenario re- Example 4.1. An application of
Bayes rule for disease testing. Orig-
garding an illness. Consider that 8% of all people have the illness and further inally from the concept check for
that there has been a test developed for the illness with a 95% true positive lecture 4.
rate (correctly says someone does have the illness when they do) and a 7%
false positive rate (incorrectly says someone has the illness when they don’t).
Given that I test positive for the illness, what is the probability that I actually
have the disease?
Since the true positive rate is 95%, P(+ | ill) is 95%, and since false positive
rate is 7%, P(+ | not ill) = 0.07.
(0.95)(0.08)
= ≈ 0.541
(0.95)(0.08) + (0.07)(0.92)
Notice that now you have a much higher chance of being ill than you did
before you got tested, but still only about a 1/2 chance!
Name of Rule Original Rule Conditional Rule Table 4.1. Conditional probability
rules, conditioning on G.
First axiom of probability 0 ≤ P( E) ≤ 1 0 ≤ P( E | G ) ≤ 1
Corollary 1 (complement) P( E) = 1 − P( Ec ) P( E | G ) = 1 − P( Ec | G )
Chain Rule P( EF ) = P( E | F ) P( F ) P( EF | G ) = P( E | FG ) P( F | G )
P( F | E) P( E) P( F | EG ) P( E | G )
Bayes Theorem P( E | F ) = P( E | FG ) =
P( F ) P( F | G )
P( EF ) = P( E) P( F ) (5.1)
P( Ea , Eb , . . . , Er ) = P( Ea ) P( Eb ) . . . P( Er ) (5.2)
32 c hapter 5. independence
The general definition implies that for three events E, F, and G to be independent,
all of the following must be true:
P( EFG ) = P( E) P( F ) P( G ) (5.3)
P( EF ) = P( E) P( F ) (5.4)
P( EG ) = P( E) P( G ) (5.5)
P( FG ) = P( F ) P( G ) (5.6)
Problems with more than two independent events come up frequently. For exam-
ple: the outcomes of n separate flips of a coin are all independent of one another.
Each flip in this case is called a ‘‘trial’’ of the experiment.
In the same way that the mutual exclusion property makes it easier to calculate
the probability of the OR of two events, independence makes it easier to calculate
the AND of two events.
Flipping a Biased Coin: A biased coin is flipped n times. Each flip (inde- Example 5.1. Probablity of getting
k heads when flipping a biased
pendently) comes up heads with probability p, and tails with probability coin.
1 − p. What is the probability of getting exactly k heads?
Solution: Consider all the possible orderings of heads and tails that result in
k heads. There are (nk) such orderings, and all of them are mutually exclusive.
Since all of the flips are independent, to compute the probability of any one
of these orderings, we can multiply the probabilities of each of the heads and
each of the tails. There are k heads and n − k tails, so the probability of each
ordering is pk (1 − p)n−k . Adding up all the different orderings gives us the
probability1 of getting exactly k heads:
n k
p (1 − p ) n − k (5.7)
k
Hash Map: Suppose m strings are hashed (unequally) into a hash table Example 5.2. Independence of in-
dividual hashed strings in hash
with n buckets. Each string hashed is an independent trial, with probability pi map.
of getting hashed to bucket i. Calculate the probability of these three events:
A. E = the first bucket has ≥ 1 string hashed to it.
Part A: Let Si be the event that string i is hashed into the first bucket. Example 5.3. Solution to Part A
of hash map independence exam-
Note that all Si are independent of one another. The complement, Sic , is the ple 5.2.
event that string i is not hashed into the first bucket; by mutual exclusion,
P(Sic ) = 1 − p1 = p2 + p3 + + pn .
P ( E ) = P ( S1 ∪ S2 ∪ · · · ∪ S m ) (definition of Si )
c
= 1 − P((S1 ∪ S2 ∪ · · · ∪ Sm ) ) (complement)
= 1 − P(S1c S2c . . . Sm
c
) (by De Morgan’s Law)
= 1 − P(S1c ) P(S2c ) . . . P(Sm
c
) (since the events are independent)
m
= 1 − (1 − p1 ) (calculating P(Si ) by mutual exclusion)
Part B: Let Fi be the event that at least one string is hashed into bucket i. Example 5.4. Solution to Part B
of hash map independence exam-
Note that the Fi ’s are neither independent nor mutually exclusive. ple 5.2.
P( E) = P( F1 ∪ F2 ∪ · · · ∪ Fk )
= 1 − P([ F1 ∪ F2 ∪ · · · ∪ Fk ]c ) (since P( A) + P( Ac ) = 1)
= 1 − P( F1c F2c . . . Fkc ) (by De Morgan’s law)
m
= 1 − (1 − p1 − p2 − · · · − p k )
(mutual exclusion, independence of strings)
The last step is calculated by realizing that P( F1c F2c . . . Fkc ) is only satisfied by
m independent hashes into buckets other than 1 through k.
Two events E and F are called conditionally independent given a third event G, if
P( EF | G ) = P( E | G ) P( F | G ) (5.9)
Or, equivalently:
P( E | FG ) = P( E | G ) (5.10)
Simplified Craps: Two 6-sided dice are rolled repeatedly. Consider the sum Example 5.6. Dice rolls in a game
of craps as independent trials.
of the two dice. What is P( E), where E is defined as the event where a sum
of 5 is rolled before a sum of 7?
Solution: Define our independent trials to be each paired roll. We can then
define Fi as the event where we observe our first 5 before 7 on the n-th trial.
In other words, no 5 or 7 was rolled in the first n − 1 trials, and a sum of 5
was rolled on the n-th trial. Notice that the Fi for i = 1, . . . , ∞ are mutually
exclusive, as there is only ever one first occurrence of the sum of 5. The
4 6
probability of rolling a sum of 5 or a sum of 7 is 36 and 36 , respectively, and
26 n−1 4
therefore, P( Fi ) = ( 36 ) ( 36 ).
∞
P( E) = P( F1 ∪ F2 ∪ · · · ∪ Fi ∪ · · · ) = ∑ P( Fi ) (Fi mutually exclusive)
i =1
4 ∞ 4 ∞
n −1 n
26 26
36 i∑ 36 i∑
= =
=1
36 =0
36
4 1 2
= · 26
=
36 1 − 36 5
where the last line comes from the property of infinite geometric series, where
∞
1
|x| < 1 : ∑ xi = 1 − x . (5.8)
i =1
Fevers: Let’s say a person has a fever if they either have malaria or have
an infection. We are going to assume that getting malaria and having an
infection are independent: knowing if a person has malaria does not tell us if
they have an infection. Now, a patient walks into a hospital with a fever. Your
belief that the patient has malaria is high and your belief that the patient has
an infection is high. Both explain why the patient has a fever.
Now, given our knowledge that the patient has a fever, gaining the knowl-
edge that the patient has malaria will change your belief the patient has an
infection. The malaria explains why the patient has a fever, and so the alter-
nate explanation becomes less likely. The two events (which were previously
independent) are dependent when conditioned on the patient having a fever.
• P(Y = 0) = 1/8 ( T, T, T )
• P(Y = 1) = 3/8 ( H, T, T ), ( T, H, T ), ( T, T, H )
• P(Y = 2) = 3/8 ( H, H, T ), ( H, T, H ), ( T, H, H )
• P(Y = 3) = 1/8 ( H, H, H )
• P (Y ≥ 4 ) = 0
Even though we use the same notation for random variables and for events (both
use capital letters), they are distinct concepts. An event is a situation, a random
variable is an object. The situation in which a random variable takes on a particular
value (or range of values) is an event. When possible, we will try to use letters
E, F, G for events and X, Y, Z for random variables.
38 c hapter 6. random variables
For a discrete random variable, the most important thing to know is the probabil-
ity that the random variable will take on each of its possible values. The probability
mass function (PMF) of a random variable is a function that maps possible out- P( X = x )
6.2 Expectation
It goes by many other names: mean, expected value, weighted average, center of mass,
and first moment.
The random variable X represents the outcome of one roll of a six-sided die. Example 6.1. Expected value of a six-
sided die roll.
What is E[ X ]? This is the same as asking for the average value of a die roll.
E[ g( X )] = ∑ g( x ) · p X ( x ) (6.3)
x
This identity has the humorous name of ‘‘the Law of the Unconscious Statistician’’
(LOTUS), for the fact that even statisticians are known—perhaps unfairly—to
ignore the difference between this identity and the basic definition of expectation
(the basic definition doesn’t have a function g).
We can use this to compute, for example, the expectation of the square of a
random variable (called the second moment or second central moment):
E[ X 2 ] = E[ g( X )] (where g( X ) = X 2 )
= ∑ g( x ) · p X ( x ) (by LOTUS)
x
= ∑ x2 · p X ( x ) (definition of g)
x
A school has 3 classes with 5, 10, and 150 students. Each student is only in Example 6.2. Class size expected
value based on the choice of the ran-
one of the three classes. If we randomly choose a class with equal probability dom variable.
and let X be the the size of the chosen class:
Consider a game played with a fair coin which comes up heads with p = 0.5. Example 6.3. Expected value game
resulting in an infinite money para-
Let n = the number of coin flips before the first tails. In this game you win dox.
$2n . How many dollars do you expect to win? Let X be a random variable
which represents your winnings.
1 2 3 ∞ i +1
1 1 1 1
E[ X ] = 0
2 + 1
2 + 2 +··· = ∑
2
2i
2 2 2 i =0
2
∞
1
= ∑2 =∞
i =0
0 0 0 0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
x x x x
Figure 6.3. Different variance in
probability mass functions that each
6.4 Variance have an expected value of E[ X ]=3.
Expectation is a useful statistic, but it does not give a detailed view of the proba-
bility mass function. Consider the 4 distributions in figure 6.3 (PMFs). All four
have the same expected value E[ X ] = 3 but the ‘‘spread’’ in the distributions is
quite different. Variance is a formal quantification of ‘‘spread’’. There is more than
one way to quantify spread; variance uses the average square distance from the
mean.
The variance of a discrete random variable X with expected value µ is defined:2 2
Variance has squared units rela-
tive to X.
Var( X ) = E[( X − E[ X ])2 ] (6.4)
= E[( X − µ)2 ]
When computing the variance, we often use a different form of the same equation: Var( X ) = E[( X − E[ X ])2 ]
= E[( X − µ)2 ] (Let µ = E[ X ])
Var( X ) = E[ X 2 ] − E[ X ]2 (6.5, Property 1)
= ∑ ( x − µ )2 p ( x )
x
2 = ∑ x2 p( x ) − 2µ ∑ xp( x ) + µ2 ∑ p( x )
Var( aX + b) = a Var( X ) (6.6, Property 2) x x x
= E[ X 2 ] − 2µE[ X ] + µ2 · 1
Adding a constant doesn’t change the ‘‘spread’’; multiplying by one does. = E[ X 2 ] − 2µ2 + µ2
To stay in the units of X, the standard deviation is the square root of variance: = E[ X 2 ] − µ2
q = E[ X 2 ] − E[ X ]2
SD( X ) = σ = Var( X ) (6.7)
Let X be the value on one roll of a 6-sided die. What is Var( X )? Example 6.4. Variance calculation
of a single 6-sided die roll.
1 1 1 1 1 1 91
E[ X 2 ] = (12 ) + (22 ) + (32 ) + (42 ) + (52 ) + (62 ) =
6 6 6 6 6 6 6
Recall that E[ X ] = 7/2, and we can use the expectation formula for variance:
2
2 2 91 7 35
Var( X ) = E[ X ] − (E[ X ]) = − =
6 2 12
Using the 𝔼 function from algorithm 6.1 and the Var function from algo- Example 6.5. Expected value and
variance functions in Julia; recom-
rithm 6.2, we can recompute the answer to example 6.4. puting example 6.4. Note the use
of // to indicate a Rational type.
julia> X = 1:6;
julia> P = fill(1//6, 6);
julia> 𝔼(X)
7//2
julia> Var(X)
35//12
in success and a 0 otherwise.1 Some example uses include a coin flip, a random 1
The Bernoulli random variable is
binary digit, whether a disk drive crashed, and whether someone likes a Netflix the simplest random variable (i.e.
an indicator or boolean random vari-
movie. If X is a Bernoulli random variable, denoted2 X ∼ Ber( p): able)
2
Sampling x from a distribution D
Probability mass function: P ( X = 1) = p (7.1) can also be written x ∼ D, where
P ( X = 0) = (1 − p ) (7.2) ∼ is read as ‘‘is distributed as’’.
Expectation: E[ X ] = p (7.3)
Variance: Var( X ) = p(1 − p) (7.4)
Bernoulli random variables and indicator variables are two aspects of the same
concept. A random variable I is an indicator variable for an event A if I = 1 when
A occurs and I = 0 if A does not occur. P( I =1)= P( A) and E[ I ]= P( A). Indicator
random variables are Bernoulli random variables, with p= P( A).
46 c hapter 7. bernoulli and binomial random variables
uses include the number of heads in n coin flips, the number of disk drives that Support for binomial: {0, 1, . . . , n}
crashed in a cluster of 1000 computers, and the number of advertisements that
are clicked when 40,000 are served.
If X is a Binomial random variable, we denote this X ∼ Bin(n, p), where p is
the probability of success in a given trial. A binomial random variable has the
following properties:3 3
A binomial random variable is the
sum of Bernoulli random variables.
P( X = k) = n pk (1 − p)n−k if k ∈ N, 0 ≤ k ≤ n
Probability mass function: k
0 otherwise
(7.5)
Expectation: E[ X ] = np (7.6)
Variance: Var( X ) = np(1 − p) (7.7)
Let X = number of heads after a coin is flipped three times. X ∼ Bin(3, 0.5). Example 7.1. Binomial random vari-
able for n = 3 coin flips with prob-
What is the probability of each of the different values of X? ability p = 0.5.
3 0 1
P ( X = 0) = p (1 − p )3 =
0 8
3 1 3
P ( X = 1) = p (1 − p )2 =
1 8
3 2 3 P( X = x )
P ( X = 2) = p (1 − p )1 =
2 8 3/8
3 3 1
P ( X = 3) = p (1 − p )0 = 2/8
3 8
1/8
0
0 1 2 3
x
When sending messages over a network, there is a chance that the bits will be Example 7.2. Binomial random vari-
able for bit encoding using a Ham-
corrupted. A Hamming code allows for a 4 bit code to be encoded as 7 bits, ming code.
with the advantage that if 0 or 1 bit(s) are corrupted, then the message can
be perfectly reconstructed. You are working on the Voyager space mission
and the probability of any bit being lost in space is 0.1. How does reliability
change when using a Hamming code?
Image we use error correcting codes. Let X ∼ Bin(7, 0.1).
7
P ( X = 0) = (0.1)0 (0.9)7 ≈ 0.468
0
7
P ( X = 1) = (0.1)1 (0.9)6 = 0.372
1
P( X = 0) + P( X = 1) = 0.850
Recall the example of sending a bit string over a network. In our last class we
used a binomial random variable to represent the number of bits corrupted out
of 4 with a high corruption probability (each bit had independent probability
of corruption p = 0.1). That example was relevant to sending data to spacecraft,
but for earthly applications like HTML data, voice or video, bit streams are much
longer (length ≈ 104 ) and the probability of corruption of a particular bit is very
small (p ≈ 10−6 ). Extreme n and p values arise in many cases: # visitors to a
website, # server crashes in a giant data center.
Unfortunately, X ∼ Bin(104 , 10−6 ) is unwieldy to compute. However, when
values get that extreme, we can make approximations that are accurate and make
computation feasible. Recall that the parameters of the binomial distribution are
n = 104 and p = 10−6 . First, define λ = np. We can rewrite the binomial PMF as:
λ n −i
i
n! λ
P( X = i) = 1− (8.1)
i!(n − 1)! n n
n(n − 1) . . . (n − i − 1) λi (1 − λ/n)n
= · · (8.2)
ni i! (1 − λ/n)i
This equation can be made simpler using some approximations that hold when
n is sufficiently large and p is sufficiently small: Recall the definition:
λ n
n ( n − 1) . . . ( n − i − 1) e−λ = lim 1 −
≈1 (8.3) n→∞ n
ni
(1 − λ/n)n ≈ e−λ (8.4)
i
(1 − λ/n) ≈ 1 (8.5)
# successes
λi −λ Units of λ:
Probability mass function: P( X = i) = e (8.7) time
i!
Expectation: E[ X ] = λ (8.8) Poisson examples:
• # earthquakes per year
Variance: Var( X ) = λ (8.9)
• # server hits per second
• # of emails per day
λi −λ 0.010 −0.01
P ( X = 0) = e = e ≈ 0.9900498
i! 0!
julia> X = Poi(0.01);
julia> pdf(X,0)
0.9900498337491681
The Poisson distribution is often used to model the number of events that Example 8.2. A Poisson distribu-
tion used to approximate the prob-
occur independently at any time in an interval of time or space, with a ability of an earthquake.
constant average rate. Earthquakes are a good example of this. Suppose there
are an average of 2.8 major earthquakes in the world each year. What is the
probability of getting more than one major earthquake next year?
Let X ∼ Poi(2.8) be the number of major earthquakes next year. We
want to know P( X > 1). We can use the complement rule to rewrite this as
1 − P( X = 0) − P( X = 1). Using the PMF for Poisson:
P ( X > 1) = 1 − P ( X = 0) − P ( X = 1)
2.80 2.81
= 1 − e−2.8 − e−2.8 = 1 − e−2.8 − 2.8e−2.8
0! 1!
≈ 1 − 0.06 − 0.17 = 0.77
julia> X = Poi(2.8);
julia> 1 - pdf(X,0) - pdf(X,1)
0.7689217620241717
The PMF, P( X = n), can be derived using the independence assumption. Let
Ei represent the event that the i-th trial succeeds. Then the probability that X is
exactly n is the probability that the first n − 1 trials fail, and the n-th succeeds:
P( X ≤ n) = 1 − P( X > n) (8.16)
= 1− P( E1c E2c . . . Enc ) (8.17)
= 1− P( E1c ) P( E2c ) . . . P( Enc ) (8.18)
n
= 1 − (1 − p ) (8.19)
In the Pokémon games, one captures Pokémon by throwing Poké Balls at Example 8.3. Geometric random vari-
able of independent Pokémon cap-
them. Suppose each ball independently has probability p = 0.1 of catching turing trials. Use X = Geo(p) for
the Pokémon. What is the average number of balls required for a successful the distribution and 𝔼(X) for expec-
tation.
capture?
Solution: Let X be the number of balls used until (and including) the capture.
X ∼ Geo( p), so the average number needed is E[ X ] = 1/p = 10.
Suppose we want to ensure that the probability of a capture before we run Example 8.4. Geometric random vari-
able of independent Pokémon cap-
out of Poké Balls is at least 0.99. How many balls do we need to carry? turing trials, using the cumulative
distribution function cdf(X,n), in-
troduced in section 9.2.
Solution: We want to know n such that P( X ≤ n) ≥ 0.99.
P( X ≤ n) = 1 − (1 − p)n ≥ 0.99
(1 − p)n ≤ 0.01
log[(1 − p)n ] ≤ log 0.01
n log(1 − p) ≤ log 0.01
log 0.01 log 0.01
n≥ = ≈ 43.7
log(1 − p) log 0.9
So we need 44 Poké Balls. (Note that we flipped the inequality on the last
line because we divided both sides by log(1 − p). Since 1 − p < 1, we know
log(1 − p) < 0, so we’re dividing by a negative number!)
Problem: A grad student needs 3 published papers to graduate. (Not how Example 8.5. Using the negative
binomial distribution to determine
it works in real life!) On average, how many papers will the student need conference submissions required
to submit to a conference, if the conference accepts each paper randomly for acceptance.
and independently with probability p = 0.25? (Also not how it works in real http://blog.mrtz.org/2014/12
life...though the NIPS Experiment suggests there is a grain of truth in this /15/the-nips-experiment.html
model!)
Z b
P( a ≤ X ≤ b) = f ( x )dx (9.1)
a
0 ≤ P( a ≤ X ≤ b) ≤ 1 (9.2)
P(−∞ < X < ∞) = 1 (9.3)
This is very different from the discrete setting, in which we often talked about PDF units: probability per units of
the probability of a random variable taking on a particular value exactly. X.
Having a probability density is great, but it means we are going to have to solve
an integral every single time we want to calculate a probability. To save ourselves
some effort, for most of these variables we will also compute a cumulative distribu-
tion function (CDF). The CDF is a function which takes in a number and returns
the probability that a random variable takes on a value less than (or equal to) that
number. If we have a CDF for a random variable, we don’t need to integrate to
answer probability questions!
For a continuous random variable X, the cumulative distribution function is:
Z a
FX ( a) = P( X ≤ a) = f ( x )dx (9.5)
−∞
This can be written F ( a), without the subscript, when it is obvious which random
variable we are using.
Why is the CDF the probability that a random variable takes on a value less
than (or equal to) the input value as opposed to greater than? It is a matter of
convention. But it is a useful convention. Most probability questions can be solved
simply by knowing the CDF (and taking advantage of the fact that the integral
over the range −∞ to ∞ is 1). Here are a few examples of how you can answer
probability questions by just using a CDF:
Let X be a continuous random variable (CRV) with PDF: Example 9.1. Using the proper-
ties of the probability density func-
tion (PDF) of a continuous random
C (4x − 2x2 ) if 0 < x < 2 variable (CRV) to solve for an un-
f (x) = known.
0 otherwise
Let X be a RV representing the number of days of use before your disk Example 9.2. Calculating the cumu-
lative distribution function for disk
crashes, with PDF: crashes.
λe− x/100 if x ≥ 0
f (x) =
0 otherwise
E[ aX + b] = aE[ X ] + b
Var( X ) = E[( X − µ)2 ] (with µ = E[ X ])
2 2
= E[ X ] − E[ X ]
Var( aX + b) = a2 Var( X )
Notice how the density 1/( β − α) is exactly the same regardless of the value for x.
That makes the density uniform. So why is the PDF 1/( β − α) and not 1? That is
the constant that makes it such that the integral over all possible inputs evaluates
to 1.
The cumulative distribution function (CDF), expectation, and variance of the
uniform random variable are:
Z b
P( a ≤ X ≤ b) = f ( x )dx for α ≤ a ≤ b ≤ β (9.11)
a
b−a
= (9.12)
β−α
Z ∞
E[ X ] = x · f ( x )dx (9.13)
−∞
β
x x2
Z β
= dx = (9.14)
α β−α 2( β − α ) x =α
α+β
= (9.15)
2
( β − α )2
Var X = (9.16)
12
the number of events that occur in a fixed interval, while an exponential variable
measures the amount of time until the next event occurs.1 1
Example 9.2 sneakily introduced
The probability density function (PDF) for an exponential random variable is: you to the exponential distribution
already; now we get to use for-
mulas we’ve already computed to
λe−λx if x ≥ 0 work with it without integrating
f (x) = (9.17) anything.
0 otherwise
Exponential examples:
The expectation and variance are as follows:
• time until next earthquake
1 • time for request to reach web
E[ X ] = (9.18) server
λ
1 • time until end of cell phone con-
Var( X ) = 2 (9.19) tract
λ
There is a closed form for the cumulative distribution function (CDF):
Let X be a random variable that represents the number of minutes until a Example 9.3. Using an exponential
random variable to determine dura-
visitor leaves your website. You have calculated that on average a visitor leaves tion a user stays on a website.
your site after 5 minutes, and you decide that an exponential distribution is
appropriate to model how long a person stays before leaving the site. What
is the P( X > 10)?
We can compute λ = 15 either using the definition of E[ X ] or by thinking
of how many people leave every minute (answer: ‘‘one-fifth of a person’’).
Thus X ∼ Exp(1/5).
Let X be the number of hours of use until your laptop dies. On average, Example 9.4. Using an exponential
random variable to determine if your
laptops die after 5000 hours of use. If you use your laptop for 7300 hours laptop will last all four years of uni-
during your undergraduate career (assuming usage equals 5 hours/day and versity.
four years of university), what is the probability that your laptop lasts all
four years?
As above, we can find λ either using E[ X ] or thinking about laptop deaths
1
per hour: Therefore, X ∼ Exp( 5000 ).
The single most important random variable type is the Normal (aka Gaussian)
random variable, parameterized by a mean µ and variance σ2 . If X is a normal
variable we write X ∼ N (µ, σ2 ). The normal is important for many reasons: it is
generated from the summation of independent random variables and as a result
it occurs often in nature. Many things in the world are not distributed normally
but data scientists and computer scientists model them as Normal distributions
anyways. Why? Because it is the most entropic (conservative) distribution that
we can apply to data with a measured mean and variance. Support for Normal: {−∞, ∞}
The probability density function (PDF) for a Normal is:
symmetric
around µ
−( x − µ)2 2
1 variance σ
2σ2
√ 1 exponential manages spread
f (x) = e (10.1)
σ 2π normalizing tail
constant
Y ∼ N ( aµ + b, a2 σ2 )
For any Normal random variable X, we can find a linear transform from X to
the Standard Normal N (0, 1). That is, if you subtract the mean µ of the normal
and divide by the standard deviation σ, the result is distributed according to the
standard normal (also called the unit Normal). We can prove this mathematically.
X −µ
Let Z = σ :
X−µ
Z= (Transform X: subtract µ and divide by σ)
σ
1 µ
= X− (Use algebra to rewrite the equation)
σ σ
(Define a = σ1 , b = − σ )
µ
= aX + b
∼ N ( aµ + b, a2 σ2 ) (The linear transform of a normal is another normal)
µ µ σ2
∼N − , 2 (Substitute values in for a and b)
σ σ σ
∼ N (0, 1) (The Standard Normal)
An extremely common use of this transform is to express FX ( x ), the CDF of X, Why is this useful? Well, in
in terms of the CDF of Z, FZ ( x ). Since the CDF of the Standard Normal is so the days when we couldn’t call
scipy.stats.norm.cdf (or on
common, it gets its own Greek symbol, Φ( x ). exams, when one doesn’t have a
calculator), people would look
FX ( x ) = P( X ≤ x ) (10.3) up values of the CDF in a table.
Using the Standard Normal means
X−µ x−µ
you only need to build a table
=P ≤ (10.4)
σ σ of one distribution, rather than
an indefinite number of tables
x−µ
=P Z≤ (10.5) for all the different values of µ
σ and σ. We also have an online
x−µ
calculator on the CS 109 website.
=Φ (10.6) You should learn how to use the
σ Standard Normal table for the
exams, however!
Let X ∼ N (3, 16), what is P( X > 0)? Example 10.1. Normal distribution
using the defined CDF.
X−3 0−3
3 3
P ( X > 0) = P > =P Z>− = 1−P Z ≤ −
4 4 4 4
3 3 3
= 1−Φ − = 1− 1−Φ =Φ ≈ 0.7734
4 4 4
2
An alternative
approach idea that if F is the CDF of X ∼ N (µ, σ ),
uses the
x −µ x −µ
then F ( x ) = P Z < σ =Φ σ :
Alternative solution:
5−3 2−3
P (2 < X < 5) = F (5) − F (2) = Φ −Φ
4 4
= Φ(1/2) − (1 − Φ(1/4)) ≈ 0.2902
You send voltage of 2 or −2 on a wire to denote 1 or 0. Let X = voltage sent Example 10.2. Standard Normal dis-
tribution used as signal noise.
and let R = voltage received. Note R = X + Y, where Y ∼ N (0, 1) is noise.
When decoding, if R ≥ 0.5, we interpret the voltage as 1, else 0.
Y ∼ N (E[ X ], Var( X ))
This approximation holds for large n. Since a Normal is continuous and Binomial
is discrete, we have to use a continuity correction to discretize the Normal:
! !
k − np + 0.5 k − np − 0.5
1 1
P( X = k) ≈ P k − < Y < k + =Φ p −Φ p
2 2 np(1 − p) np(1 − p)
100 visitors to your website are given a new design. Let X = # of people who Example 10.3. Approximating a
binomial distribution with a Normal
were given the new design and spend more time on your website. Your CEO distribution for website visit statis-
will endorse the new design if X ≥ 65. tics.
Discrete Continuous
What is P(CEO endorses change | it has no effect)?
x =6 5.5 < x < 6.5
x >6 x > 6.5
E[ X ] = np = 50 x ≤6 x < 6.5
x <6 x < 5.5
Var( X ) = np(1 − p) = 25 x ≥6 x > 5.5
q
σ = Var( X ) = 5
Y − 50 64.5 − 50
P( X ≥ 65) ≈ P(Y > 64.5) = P >
5 5
= 1 − Φ(2.9) ≈ 0.0019
Stanford accepts 2480 students and each student has a 68% chance of at- Example 10.4. Approximating a
binomial distribution with a Normal
tending. Let X = # students who will attend. X ∼ Bin(2480, 0.68). What is distribution for Stanford acceptance
P( X > 1745)? statistics.
E[ X ] = np = 1686.4
Var( X ) = np(1 − p) = 539.7
q
σ = Var( X ) = 23.23
Often you will work on problems where there are several random variables (often
interacting with one another). We are going to start to formally look at how those
interactions play out.
For now we will think of joint probabilities with two events X = a and Y = b.
For this chapter, we will assume both X and Y are discrete random variables, and
we will tackle the continuous in a future chapter.
In the discrete case, a joint probability mass function tells you the probability of
any combination of events X = a and Y = b:
p X,Y ( a, b) = P( X = a, Y = b) (11.1)
This function tells you the probability of all combinations of events (the ‘‘,’’ means
‘‘and’’). If you want to back calculate the probability of an event only for one
variable you can calculate a marginal from the joint probability mass function:
In the continuous case, a joint probability density function tells you the relative
probability of any combination of events X = a and Y = y.
In the discrete case, we can define the function p X,Y non-parametrically. Instead
of using a formula for p we simply state the probability of each possible outcome.
70 c hapter 11. jo int distributions
Say you perform n independent trials of an experiment where each trial results in
one of m outcomes, with respective probabilities: p1 , p2 , . . . , pm (constrained so
that ∑i pi = 1). Define Xi to be the number of trials with outcome i. A multinomial
distribution is a closed form function that answers the question: What is the
probability that there are ci trials with outcome i. Mathematically:
n c
P ( X1 = c 1 , X2 = c 2 , . . . , X m = c m ) = p 1 pc2 . . . pcmm (11.4)
c1 , c2 , . . . , c m 1 2
A 6-sided die is rolled 7 times. What is the probability that you roll: 1 one, 1 Example 11.1. Multinomial distribu-
tion to calculate the joint probability
two, 0 threes, 2 fours, 0 fives, 3 sixes (disregarding order). of outcomes from rolls of a 6-sided
die.
P( X1 = 1, X2 = 1, X3 = 0, X4 = 2, X5 = 0, X6 = 3) =
1 1 0 2 0 3 7
7! 1 1 1 1 1 1 1
= 420
2!3! 6 6 6 6 6 6 6
Two discrete random variables X and Y are called independent if: Same ideas, different notation.
Independence is symmetric. That means that if random variables X and Y are in-
dependent, X is independent of Y and Y is independent of X. This claim may seem
meaningless but it can be very useful. Imagine a sequence of events X1 , X2 , . . ..
Let Ai be the event that Xi is a ‘‘record value’’ (e.g., it is larger than all previous
values). Is An+1 independent of An ? It is easier to answer that An is independent
of An+1 . By symmetry of independence both claims must be true.
Independent Binomials with Equal p: For any two Binomial random variables
with the same ‘‘success’’ probability: X ∼ Bin(n1 , p) and Y ∼ Bin(n2 , p) the sum
of those two random variables is another binomial: X + Y ∼ Bin(n1 + n2 , p). This
does not hold when the two distributions have different parameters p.
This holds in the general case, let Xi ∼ Bin(ni , p) for Xi independent variables
for i = 1, . . . , n: !
n n
∑ Xi ∼ Bin ∑ ni , p (12.3)
i =1 i =1
74 c hapter 12. independent random variables
Let N be the number of requests to a web server/day and that N ∼ Poi(λ). Example 12.1. Independence of two
discrete Binomial random vari-
Each request comes from a human (probability = p) or from a ‘‘bot’’ (proba- ables
bility = (1 − p)), independently. Define X to be the number of requests from
humans/day and Y to be the number of requests from bots/day.
Since requests come in independently, the probability of X conditioned
on knowing the number of requests is a Binomial. Specifically, conditioned:
( X | N ) ∼ Bin( N, p)
(Y | N ) ∼ Bin( N, 1 − p)
Calculate the probability of getting exactly i human requests and j bot re-
quests. Start by expanding using the chain rule:
P( X = i, Y = j) = P( X = i, Y = j | X + Y = i + j) P( X + Y = i + j)
λi + j
i+j i
P( X = i, Y = j) = p (1 − p ) j e − λ
i (i + j ) !
As an exercise you can simplify this expression into two independent Poisson
distributions.
Let’s say we have two independent random Poisson variables for requests Example 12.2. Independence of two
discrete Poisson random variables.
received at a web server in a day: X = number of requests from humans/day,
X ∼ Poi(λ1 ) and Y = number of requests from bots/day, Y ∼ Poi(λ2 ). Since
the convolution of Poisson random variables is also a Poisson, we know that
the total number of requests ( X + Y ) is also a Poisson: ( X + Y ) ∼ Poi(λ1 +
λ2 ). What is the probability of having k human requests on a particular day
given that there were n total requests?
P( X = k, Y = n − k ) P ( X = k ) P (Y = n − k )
P( X = k | X + Y = n) = =
P( X + Y = n) P( X + Y = n)
e−λ1 λ1k e−λ2 λ2n−k n!
= · ·
k! (n − k)! e−(λ1 +λ2 ) (λ1 + λ2 )n
k n−k
n λ1 λ2
=
k λ1 + λ2 λ1 + λ2
λ1
∴ ( X | X + Y = n) ∼ Bin n,
λ1 + λ2
So far, we have had it easy: If our two independent random variables are both
Poisson, or both Binomial with the same probability of success, then their sum
has a nice, closed form. In the general case, however, the distribution of two
independent random variables can be calculated as a convolution of probability
distributions.
For two independent random variables, you can calculate the CDF or the PDF
of the sum of two random variables using the following formulas: For independent discrete random
variables, the convolution of p X
∞ and pY (in different notation):
FX +Y (n) = P( X + Y ≤ n) = ∑ FX (k) FY (n − k) (12.5)
P( X + Y = n) =
k =−∞
∞ ∑ P ( X = k ) P (Y = n − k )
p X +Y ( n ) = ∑ p X ( k ) pY ( n − k ) (12.6) k
k=−∞
Most importantly, convolution is the process of finding the sum of the random
variables themselves, and not the process of adding together probabilities.
Let’s go about proving that the sum of two independent Poisson random vari- Example 12.3. Proving that the
sum of two independent Poisson
ables is also Poisson. Let X ∼ Poi(λ1 ) and Y ∼ Poi(λ2 ) be two independent random variables is also a Poisson.
random variables, and Z = X + Y. What is P( Z = n)?
P( Z = n) = P( X + Y = n)
∞
= ∑ P ( X = k ) P (Y = n − k ) (Convolution)
k =−∞
n
= ∑ P ( X = k ) P (Y = n − k ) (Range of X and Y)
k =0
n λk λn−k
= ∑ e−λ1 k!1 e−λ2 (n −
2
k)!
(Poisson PMF)
k =0
n λ1k λ2n−k
= e−(λ1 +λ2 ) ∑ k!(n − k)!
k =0
e−(λ1 +λ2 ) n
n!
=
n! ∑ k! ( n − k ) !
λ1k λ2n−k
k =0
e−(λ1 +λ2 )
= ( λ1 + λ2 ) n (Binomial theorem)
n!
Note that the Binomial Theorem (which we did not cover in this class, but is
often used in contexts like expanding polynomials) says that for two numbers
n
n k n−k
a and b and positive integer n, then: ( a + b) = ∑
n a b
k =0
k
Expectation over a joint distribution is not nicely defined because it is not clear how
to compose the multiple variables. However, expectations over functions of ran-
dom variables (for example sums or products) are nicely defined: E[ g( X, Y )] =
∑ x,y g( x, y) p( x, y) for any function g( X, Y ). When you expand that result for the
function g( X, Y ) = X + Y you get a beautiful result:
= ∑ x ∑ p( x, y) + ∑ y ∑ p( x, y) (13.3)
x y y x
= E [ X ] + E [Y ] (13.5)
78 c hapter 13. statistics of multiple random variables
Let’s go back to our old friends—the Binomial and Negative Binomial RVs—and
show how we could have derived expressions for their expectation.
First let’s start with some practice with the sum of expectations of indicator vari-
ables. Let Y ∼ Bin(n, p), in other words if Y is a Binomial random variable. We
can express Y as the sum of n Bernoulli random indicator variables Xi ∼ Ber( p).
Since Xi is a Bernoulli, E[ Xi ] = p
n
Y = X1 + X2 + · · · + X n = ∑ Xi (13.7)
i =1
Coupon Collector’s Problem: There are several versions of the coupon col- Example 13.1. The coupon collec-
tor’s problem setup.
lector’s problem in probability theory, but the most common formulation is
as follows: You would like to collect coupons from cereal boxes, but you must
purchase a box of cereal to open and discover what coupon type you have.
More formally, suppose you buy n boxes of cereal, and there are k different
types of coupons. For each box you buy, you ‘‘collect’’ a coupon of type i.
What is the expected number of boxes that you must purchase until you have
at least one coupon of each type?
How does this relate to computer science? Suppose you are a big cloud
provider, and you have to service n web requests with a limited number of
k servers. Each web request is a request to server i. What is the expected
number of utilized servers after n requests?
We know that the expectation of the sum of two random variables is equal to
the sum of the expectations of the two variables. However, the expectation of the
product of two random variables only has a nice decomposition in the case where
the random variables are independent of one another.
Here’s a proof for independent discrete random variables X and Y. If you would
like to prove this for independent continuous random variables, just interchange
the summations with integrals.
= ∑ ∑ g ( x ) h ( y ) p X ( x ) pY ( y )
y x
!
= ∑ h ( y ) pY ( y ) ∑ g ( x ) p X ( x )
y x
! !
= ∑ g( x ) p X ( x ) ∑ h ( y ) pY ( y )
x y
= E[ g( X )]E[h(Y )]
Problem: Yes, hash table problems can be a variation of the coupon collec- Example 13.2. Hash tables as a
variation of the coupon collector’s
tor’s problem! Consider a hash table with k buckets. You hash each string to problem.
bucket i. What is the expected number of strings to hash until each bucket
has at least 1 string?
Solution: Define Y as the number of strings to hash until each bucket has
at least 1 string. We want to compute E[Y ]. Let us also define Yi to be the
number of trials (strings) until the next success, after we’ve seen our i-th
success. For example, Y0 is the number of strings hashed until our first hash
into an empty bucket (we start with k empty buckets), Y1 is the number of
additional strings to hash until we hash into an empty bucket (we have 1
non-empty bucket and k − 1 empty buckets, etc.). In the general case, we
have i non-empty buckets and k − i empty buckets after the i-th success,
and we are successful if we hash a string to one of the k − i empty buckets.
The probability of success p is then pi = k− i
k . With this definition of Yi , let
Yi ∼ Geo( p), and E[Yi ] = p = k−i .
1 k
i
That is a little hard to wrap your mind around (but worth pushing on a bit). The 4
term will be positive. If one is above its mean and the other is below, the term is −2
negative. If the weighted sum of terms is positive, the two random variables will
−4
have a positive correlation. We can rewrite the above equation to get an equivalent −4 −2 0 2 4
equation:
Figure 13.2. Two normal random
Cov( X, Y ) = E[ XY ] − E[Y ]E[ X ] (13.12) variables with the same mean and
variance as figure 13.1.
Using this equation (and the product lemma) is it easy to see that if two random
variables are independent their covariance is 0. The reverse is not true in general.
That last property gives us a third way to calculate variance. You could use this
definition to calculate the variance of the binomial.
For any random variables X and Y, the variance of the sum necessarily includes
covariance (unless the two random variables are independent):
Var( X + Y ) = Cov( X + Y, X + Y )
= Cov( X, X ) + Cov( X, Y ) + Cov(Y, X ) + Cov(Y, Y )
= Var( X ) + 2 Cov( X, Y ) + Var(Y )
More generally:
!
n n n n
Var ∑ Xi = ∑ Var(Xi ) + 2 ∑ ∑ Cov( Xi , X j )
i =1 i =1 i =1 j = i +1
E[ XY ] = E[ X ]E[Y ] (13.20)
Var( X + Y ) = Var( X ) + Var(Y ) (13.21)
13.7 Correlation
Somewhat paradoxically, though independence implies zero covariance, zero Example 13.3. Zero covariance
does not imply independence (di-
covariance does not imply independence. Consider the following example: rectionality matters).
you have a discrete random variable X with PMF P( X = x ) = 1/3 for x ∈
{−1, 0, 1} and we then define another random variable Y = X 2 . Clearly X
and Y are not independent, but do they have zero covariance? We’ll learn
in class that we tend to use a measure called correlation to indicate zero
covariance, because the units of covariance don’t matter so much as the sign.
Discrete: The conditional probability mass function (PMF) for the discrete case:
P( X = x, Y = y) p ( x, y)
p X |Y ( x | y ) = P ( X = x | Y = y ) = = X,Y (14.1)
P (Y = y ) pY ( y )
The conditional cumulative density function (CDF) for the discrete case:
∑ x≤a p X,Y ( x, y)
FX |Y ( a | y) = P( X ≤ a | Y = y) =
pY ( y )
= ∑ p X |Y ( x | y ) (14.2)
x≤a
We have gotten to know a kind and gentle soul, conditional probability. And
we know another funky fool, expectation. Let’s get those two crazy kids to play
together.
Let X and Y be jointly discrete random variables. We define the conditional
expectation of X given Y = y to be:
E[ X | Y = y ] = ∑ xpX|Y (x | y) (14.3)
x
86 c hapter 14. conditional expectation
E[E[ X | Y ]] = E[ X ] (14.4)
E[E[ X | Y ]] = E[ g(Y )] = ∑ E [ X | Y = y ] P (Y = y )
y
= ∑ ∑ xP( X = x | Y = y) P(Y = y)
y x
= ∑ ∑ xP( X = x, Y = y) = ∑ ∑ xP( X = x, Y = y)
y x x y
= ∑ x ∑ P( X = x, Y = y)
x y
= ∑ xP( X = x )
x
= E[ X ]
You roll two 6-sided dice D1 and D2 . Let S = D1 + D2 . Example 14.1. Conditional expecta-
tion of dice rolls.
• What is E[S | D2 = 6]?
E[S | D2 = 6] = ∑ xP(S = x | D2 = 6)
x
1 57
= (7 + 8 + 9 + 10 + 11 + 12) = = 9.5
6 6
This makes intuitive sense since 6 + E[value of D1 ] = 6 + 3.5.
E[S | D2 = d2 ] = E[ D1 + D2 | D2 = d2 ] = E[ D1 + d2 | D2 = d2 ]
= d2 + E[ D1 | D2 = d2 ]
(d2 is a constant with respect to D1 )
= d2 + ∑ d1 P( D1 = d1 | D2 = d2 )
d1
= d2 + 3.5
Consider the following code with random numbers: Example 14.2. Expected return of
the recursive function recurse.
function recurse()
x = rand(1:3) # Equally likely values
if x == 1 return 3
elseif x == 2 return 5 + recurse()
else return 7 + recurse() end
end
E [Y ] = E [Y | X = 1 ] P ( X = 1 )
+ E [Y | X = 2 ] P ( X = 2 )
+ E [Y | X = 3 ] P ( X = 3 )
E [Y | X = 1 ] = 3
E [Y | X = 2 ] = E [ 5 + Y ] = 5 + E [Y ]
E [Y | X = 3 ] = E [ 7 + Y ] = 7 + E [Y ]
Now we can plug those values into the equation. Note that the probability of
X taking on 1, 2, or 3 is 1/3:
E [Y ] = E [Y | X = 1 ] P ( X = 1 )
+ E [Y | X = 2 ] P ( X = 2 )
+ E [Y | X = 3 ] P ( X = 3 )
= 3(1/3) + (5 + E[Y ])(1/3) + (7 + E[Y ])(1/3)
= 15
You are interviewing n software engineer candidates and will hire only 1 Example 14.3. How many software
engineers do you have to interview
candidate. All orderings of candidates are equally likely. Right after each before hiring the optimal?
interview you must decide to hire or not hire. You can not go back on a
decision. At any point in time you can know the relative ranking of the
candidates you have already interviewed.
The strategy that we propose is that we interview the first k candidates and
reject them all. Then you hire the next candidate that is better than all of the
first k candidates. What is the probability that the best of all the n candidates
is hired for a particular choice of k? Let’s denote that result Pk (best). Let X
be the position in the ordering of the best candidate:
n
Pk (best) = ∑ Pk (best | X = i) P(X = i)
i =1
n
1
=
n ∑ Pk (best | X = i) (since each position is equally likely)
i =1
Before we jump into how to solve probability (aka inference) questions, let’s
take a moment to go over how an expert doctor could specify the relationship
between so many random variables. Ideally we could have our expert sit down
and specify the entire ‘‘joint distribution’’ (see the first lecture on multivariate
models). She could do so either by writing a single equation that relates all the
92 c hapter 15. inference
of assignments to those variables, which is approaching the number of atoms in Age Uni ··· Gender
the universe. Thankfully, there is a better way. We can simplify our task if we know
the generative process that creates a joint assignment. Based on the generative
Conditions
process we can make a data structure known as a Bayesian network. Here are two
Cold H1N1 Influenza ··· Mono
networks of random variables for diseases:
For diseases, the flow of influence is directed. The states of ‘‘demographic’’
random variables influence whether someone has particular ‘‘conditions’’, which Symptoms
influence whether someone shows particular ‘‘symptoms’’. On the right is a Fever Tired Phlegm ··· Runny
Nose
simple model with only four random variables. Though this is a less interesting
model it is easier to understand when first learning Bayesian networks. Being in
Figure 15.1. Full disease model
university (binary) influences whether or not someone has influenza (binary). where flow of influence is directed.
Having influenza influences whether or not someone has a fever (binary) and the
Simple Disease Model
state of university and influenza influences whether or not someone feels tired
(also binary).
Uni
In a Bayesian network, an arrow from random variable X to random variable Y
articulates our assumption that X directly influences the likelihood of Y. We say
that X is a parent of Y. To fully define the Bayesian network we must provide a way
to compute the probability of each random variable Xi conditioned on knowing
the value of all their parents: P( Xi = k | parents of Xi take on specified values). Influenza
Here is a concrete example of what needs to be defined for the simple disease
model. Recall that each of the random variables is binary:
Let’s put this in programming terms. All that we need to do in order to code up
a Bayesian network is to define a function: getProbXi(i, k, parents) which
returns the probability that Xi (the random variable with index i) takes on
the value k given a value for each of the parents of Xi encoded by parents:
P( Xi = k i | parents of Xi take on specified values)
Deeper understanding: The reason that a Bayes Net is so useful is that the ‘‘joint’’
probability can be expressed in exponentially less space as the product of the
probabilities of each random variable conditioned on its parents! Without loss of
generality, let Xi refer to the ith random variable (such that if Xi is a parent of X j
then i < j):
P(joint) = P( X1 = k1 , . . . , Xn = k n )
= ∏ P( Xi = k i | parents of Xi take on specified values) (15.1)
i
Using the chain rule we can decompose the joint probability. To make the following
math easier to digest I am going to use k i as shorthand for the event that Xi = k i :
P ( k 1 , . . . , k n ) = P ( k n | k n −1 , . . . , k 1 ) P ( k n −1 | k n −2 , . . . , k 1 ) · · · P ( k 2 | k 1 ) P ( k 1 )
(chain rule)
= ∏ P ( k i | k i −1 , . . . , k 1 ) (change in notation)
i
= ∏ P(k i | parents of Xi take on their values)
i
(implied by Bayes Net)
Thus the sampled particle is: [Uni = 1, Influenza = 0, Fever = 0, Tired = 0]. If
we were to run the process again we would get a new particle (with likelihood
determined by the joint probability).
Let’s take a moment to recognize that this is straight-up fantastic. General infer-
ence based on analytic probability (math without samples) is hard even given a
Bayesian network (if you don’t believe me, try to calculate the probability of flu
conditioning on one demographic and one symptom in the Full Disease Model).
However if we generate enough samples we can calculate any conditional proba-
bility question by reducing our samples to the ones that are consistent with the
condition (Y = b) and then counting how many of those are also consistent with
the query (X = a). Here is the algorithm in code:
out to be a very good approximation. However, in cases where the event we’re
conditioning on is rare enough that it doesn’t occur after millions of samples are
generated, our algorithm will not work. The last line of our code will result in a
divide by 0 error. See the next section for solutions!
For two random variables X and Y that are jointly distributed, the joint cumulative
distribution function FX,Y can be defined as
FX,Y ( a, b) = P( X ≤ a, Y ≤ b)
4
0.04
2
0
0.02
−2
−4
0.00
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
1
( x1 −µ1 )2 2ρ( x1 −µ1 )( x2 −µ2 ) ( x2 −µ2 )2
−5
1 − − +
2(1− ρ2 ) σ2 σ1 σ2 σ22
f X1 ,X2 ( x1 , x2 ) = p e 1 −10
2πσ1 σ2 1 − ρ2 −10 −5 0 5 10
1 ( x1 − µ1 )2 ( x2 − µ2 )2
− +
1 2 σ12 σ22
f X1 ,X2 ( x1 , x2 ) = e
2πσ1 σ2
1 2 2 1 2 2
= √ e−(x1 −µ1 ) /(2σ1 ) √ e−(x2 −µ2 ) /(2σ2 )
σ 2π σ 2π
|1 {z }|2 {z }
involves x1 involves x2
Let’s make a weight matrix used for Gaussian blur. In the weight matrix, each Example 16.2. Gaussian blur using
a bivariate normal distribution (i.e.
location in the weight matrix will be given a weight based on the probability multivariate) with a weight matrix.
density of the area covered by that grid square in a Bivariate Normal of
independent X and Y, each zero mean with variance σ2 . For this example
lets blur using σ = 3.
Each pixel is given a weight equal to the probability that X and Y are both
within the pixel bounds. The center pixel covers the area where −0.5 ≤ x ≤
0.5 and −0.5 ≤ y ≤ 0.5. What is the weight of the center pixel?
Remember how deriving the sum of two independent Poisson random variables
was tricky? When we move into integral land, the concept of convolution still
carries over, and once you get a handle on notation, then computing the sum of
two independent, jointly continuous random variables becomes fun. For some
definition of fun. . .
Let’s start with one common case that has a nice form but a difficult derivation:
For any two normal random variables X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) the sum
of those two random variables is another normal: X + Y ∼ N (µ1 + µ2 , σ12 + σ22 ).
We won’t derive the approach here, but it involves exponents, integrals, and
completing the square (algebra throwback!).
For two general independent random variables (i.e. cases of independent random
variables that don’t fit the above special situations) you can calculate the CDF
or the PDF of the sum of two random variables using the following convolution
formulas:
Z ∞
FX +Y ( a) = P( X + Y ≤ a) = FX ( a − y) FY (y)dy
y=−∞
Z ∞
f X +Y ( a ) = f X ( a − y) f Y (y)dy
y=−∞
104 chapter 17. continuous joint distributions ii
These is a direct analogy to the discrete case where you replace the integrals with
sums and change notation for CDF and PDF.
The conditional probability density function might look a bit wonky, but it works!
Continuous: The conditional probability density function (PDF) for the contin-
uous case:
f X,Y ( x, y)
f X |Y ( x | y ) = (17.1)
f Y (y)
The conditional cumulative density function (CDF) for the continuous case:
Z a
FX |Y ( a | y) = P( X ≤ a | Y = y) = f X |Y ( x | y)dx (17.2)
−∞
At first glance, the conditional density function seems to violate any notion of
stoichiometric units of probability. Let us verify this with our understanding of
discrete probability. Recall that for tiny epsilon e, we can approximate:
Z x+ e
e e e 2
P(| X − x | ≤ ) = P( x − ≤ X ≤ x + ) = f X ( a)da ≈ f X ( x )e
2 2 2 x − 2e
eX ey
This extends to the joint variable case: P(| X − x | ≤ 2 , |Y − y | ≤ 2 ) ≈ f X,Y ( x, y ) e X eY .
e e P(| X − x | ≤ e2X , |Y − y| ≤ e2Y )
P | X − x | ≤ X | |Y − y | ≤ Y =
2 2 P(|Y − y| ≤ e2Y )
(def. cond. prob.)
f X,Y ( x, y)eX eY f ( x, y)
≈ = X,Y e X = f X |Y ( x | y ) e X
f Y (y)eY f Y (y)
What is the PDF of X + Y for independent uniform random variables Example 17.1. Convolution of a uni-
form distribution.
X ∼ Uni(0, 1) and Y ∼ Uni(0, 1)? First plug in the equation for general
convolution of independent random variables:
Z 1
f X +Y ( a ) = f X ( a − y) f Y (y)dy
y =0
Z 1
f X +Y ( a ) = f X ( a − y)dy because f Y (y) = 1
y =0
It turns out that is not the easiest thing to integrate. By trying a few different
values of a in the range [0, 2] we can observe that the PDF we are trying to
calculate is discontinuous at the point a = 1 and thus will be easier to think
about as two cases: a < 1 and a > 1. If we calculate f X +Y for both cases and
correctly constrain the bounds of the integral we get simple closed forms for
each case:
a
if 0 < a ≤ 1
f X +Y ( a ) = 2−a if 1 < a ≤ 2
0 else
This combinations of normals is called a bivariate distribution. Here is a visualization of the PDF of our
prior. The interesting part about tracking an object is the process of updating your belief about it’s location
based on an observation. Let’s say that we get an instrument reading from a sonar that is sitting on the
origin. The instrument reports that the object is 4 units away. Our instrument is not perfect: if the true
distance was t units away, than the instrument will give a reading which is normally distributed with mean
t and variance 1. Let’s visualize the observation:
Based on this information about the noisiness of our prior, we can compute the conditional probability
of seeing a particular distance reading D, given the true location of the object X, Y. If we knew the object
p
was at location ( x, y), we could calculate the true distance to the origin x2 + y2 which would give us the
mean for the instrument Gaussian:
q 2
d− x 2 + y2
1
· e−
p
f ( D = d | X = x, Y = y) = √ 2·1 (Normal PDF where µ = x 2 + y2 )
2·1·π
q 2
d− x 2 + y2
How about we try this out on actual numbers. How much more likely is an instrument reading of 1
compared to 2, given that the location of the object is at (1, 1)?
2
p
12 +12
1−
f ( D = 1 | X = 1, Y = 1) K · e− 2·1
= 2 2 (Substituting into the conditional PDF of D)
f ( D = 2 | X = 1, Y = 1)
p
2− 12 +12
K2 · e − 2·1
e0
= −1/2 ≈ 1.65 (Notice how the K2 cancel out)
e
At this point we have a prior belief and we have an observation. We would like to compute an updated belief,
given that observation. This is a classic Bayes’ formula scenario. We are using joint continuous variables, but
that doesn’t change the math much, it just means we will be dealing with densities instead of probabilities:
f ( D = 4 | X = x, Y = y) · f ( X = x, Y = y)
f ( X = x, Y = y | D = 4) = (Bayes using densities)
f ( D = 4)
q 2
x 2 + y2
h i
4− ( x −3)2 +(y−3)2
K1 · e− 2 · K2 · e− 8
= (Substituting for prior and update)
f ( D = 4)
q 2
4− x 2 + y2
( x −3)2 +(y−3)2
− +
2 8
K1 · K2
= ·e
f ( D = 4)
( f ( D = 4) is a constant w.r.t. ( x, y))
q 2
4− x 2 + y2
( x −3)2 +(y−3)2
− +
2 8
= K3 · e (K3 is a new constant)
Wow! That looks like a pretty interesting function! You have successfully computed the updated belief.
Let’s see what it looks like. Here is a figure with our prior on the left and the posterior on the right: How
beautiful is that! Its like a 2D normal distribution merged with a circle. But wait, what about that constant!
We do not know the value of K3 and that is not a problem for two reasons: the first reason is that if we ever
want to calculate a relative probability of two locations, K3 will cancel out. The second reason is that if we
really wanted to know what K3 was, we could solve for it.
This math is used every day in millions of applications. If there are multiple observations the equations
can get truly complex (even worse than this one). To represent these complex functions often use an
algorithm called particle filtering.
Let’s say we have two independent random Poisson variables for requests Example 17.2. Convolution of Pois-
son random variables.
received at a web server in a day: X = # requests from humans/day, X ∼
Poi(λ1 ) and Y = # requests from bots/day, Y ∼ Poi(λ2 ). Since the convolution
of Poisson random variables is also a Poisson we know that the total number
of requests ( X + Y ) is also a Poisson ( X + Y ) ∼ Poi(λ1 + λ2 ). What is the
probability of having k human requests on a particular day given that there
were n total requests?
P( X = k, Y = n − k) P ( X = k ) P (Y = n − k )
P( X = k | X + Y = n) = =
P( X + Y = n) P( X + Y = n
e−λ1 λ1k e−λ2 λ2n−k n!
= · · 1( λ + λ )
k! ( n − k ) ! e 1 2 ( λ1 + λ2 ) n
k n−k
n λ1 λ2
=
k λ1 + λ2 λ1 + λ2
λ2
∼ Bin n,
λ1 + λ2
The central limit theorem proves that the averages of samples from any distribution
themselves must be normally distributed. Consider independent and indentically
distributed (IID)1 random variables X1 , X2 , . . . such that E[ Xi ] = µ and Var( Xi ) = 1
Random variables X1 , . . . , Xn are
σ2 . Let: IID if X1 , . . . , Xn are independent
1 n and they all have the same PMF (if
X̄ = ∑ Xi (18.1) discrete) or PDF (if continuous),
n i =1 with E[ Xi ] = µ and Var( Xi ) = σ2
for i = 1, . . . , n.
The central limit theorem states:
σ2
X̄ ∼ N µ, as n → ∞ (18.2)
n
It is sometimes expressed in terms of the standard normal, Z:
(∑in=1 Xi ) − nµ
Z= √ as n → ∞ (18.3)
σ n
At this point you probably think that the central limit theorem is awesome. But
it gets even better. With some algebraic manipulation we can show that if the
sample mean of IID random variables is normal, it follows that the sum of equally
weighted IID random variables must also be normal. Let’s call the sum of IID
random variables Ȳ:
n
Ȳ = ∑ Xi = n · X̄ (If we define Ȳ to be the sum of our variables)
i =1
σ2
∼N nµ, n2 (Since X̄ is a normal and n is a constant)
n
∼ N nµ, nσ2 (By simplifying)
In summary, the central limit theorem explains that both the sample mean of IID
variables is normal (regardless of what distribution the IID variables came from)
and that the sum of equally weighted IID random variables is normal (again,
regardless of the underlying distribution).
110 chapter 18. central limit theorem
You will roll a 6 sided dice 10 times. Let X be the total value of all 10 dice = Example 18.1. Calculating proba-
bility of winning a dice game using
X1 + X2 + · · · + X10 . You win the game if X ≤ 25 or X ≥ 45. Use the central the central limit theorem.
limit theorem to calculate the probability that you win.
Recall that E[ Xi ] = 3.5 and Var( Xi ) = 35
12 .
P( X ≤ 25 or X ≥ 45)
= 1 − P(25.5 ≤ X ≤ 44.5)
25.5 − 10(3.5) X − 10(3.5) 44.5 − 10(3.5)
= 1−P √ √ ≤ √ √ ≤ √ √
35/12 10 35/12 10 35/12 10
≈ 1 − (2Φ(1.76) − 1) ≈ 2(1 − 0.9608) = 0.0784
You want to test the runtime of a new algorithm. You know the variance Example 18.2. Algorithm runtime
using the central limit theorem.
of the algorithm’s runtime: σ2 = 4 sec2 but you want to estimate the mean:
µ = t sec. You can run the algorithm repeatedly (IID trials). How many
trials do you have to run so that your estimated runtime = t ± 0.5 with 95%
certainty? Let Xi be the run time of the i-th run (for 1 ≤ i ≤ n).
∑in=1 Xi
0.95 = P(−0.5 ≤ − t ≤ 0.5)
n
By the central limit theorem, the standard normal Z must be equal to:
(∑in=1 Xi ) − nµ (∑in=1 Xi ) − nt
Z= √ = √
σ n 2 n
∑ n Xi
0.95 = P −0.5 ≤ i=1 − t ≤ 0.5
n
√ √
−0.5 n ∑n X
0.5 n
=P ≤ i =1 i − t ≤
2 n 2
√ √ n √ √
−0.5 n n ∑ i = 1 Xi
n 0.5 n
=P ≤ − t≤
2 2 n 2 2
√ n √ √ √
−0.5 n ∑ i = 1 Xi
n nt 0.5 n
=P ≤ √ −√ ≤
2 2 n n 2 2
√ n √
−0.5 n ∑ X − nt
0.5 n
=P ≤ i=1 √i ≤
2 2 n 2
√ √
−0.5 n
0.5 n
=P ≤Z≤
2 2
And now we can find the value of n that makes this equation hold.
√ √ √ √
n n n n
0.95 = Φ −Φ − =Φ − 1−Φ
4 4 4 4
√
n
= 2Φ −1
4
√ √
n −1 n
0.975 = Φ ∴ Φ (0.975) = 1.96 = =⇒ n = 61.4
4 4
Thus it takes 62 runs. If you are interested in how this extends to cases where
the variance is unknown, look into variations of the students’ t-test.
toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]
19 Samples and the Bootstrap
We assume that the data we look at are IID from the same underlying distribution
(F) with a true mean µ and a true variance σ2 . Since we can’t talk to everyone in
Bhutan, we have to rely on our sample to estimate the mean and variance. From
our sample we can calculate a sample mean X̄ and a sample variance S2 . These
are the best guesses that we can make about the true mean and true variance.
n n
Xi ( Xi − X̄ )2
X̄ = ∑ n
S2 = ∑ n−1
i =1 i =1
The first question to ask is: are those unbiased estimates? Yes. Unbiased, means
that if we were to repeat this sampling process many times, the expected value of
our estimates should be equal to the true values we are trying to estimate. We
114 chapter 19. samples and the bootstra p
will prove that that is the case for X̄. The proof for S2 is in the lecture slides.
" # " #
n n
Xi 1
E[ X̄ ] = E ∑ = E ∑ Xi
i =1
n n i =1
n n
1 1 1
=
n ∑ E [ Xi ] = n ∑µ= n
nµ = µ
i =1 i =1
The equation for sample mean seems like a reasonable way to calculate the ex-
pectation of the underlying distribution. The same could be said about sample
variance except for the surprising (n1) in the denominator of the equation. Why
(n1)? That denominator is necessary to make sure that the E[S2 ] = σ2 .
The intuition behind the proof is that sample variance calculates the distance
of each sample to the sample mean, not the true mean. The sample mean itself
varies, and we can show that its variance is also related to the true variance.
Okay, you convinced me that our estimates for mean and variance are not biased.
But now I want to know how much my sample mean might vary relative to the
true mean.
! !
n
Xi 1 2 n
Var( X̄ ) = Var ∑ = Var ∑ Xi
i =1
n n i =1
2 n 2 n 2
1 1 1 σ2
= ∑
n i =1
Var ( X i ) = ∑
n i =1
σ 2
=
n
nσ2 =
n
S2
≈ (Since S is an unbiased estimate)
n
r
S2
Std( X̄ ) ≈ (Since Std is the square root of Var)
n
That Std( X̄ ) term has a special name. It is called the standard error and its how
you report uncertainty of estimates of means in scientific papers (and how you
get error bars). Great! Now we can compute all these wonderful statistics for the
Bhutanese people. But wait! You never told me how to calculate the Std(S2 ). True,
that is outside the scope of CS109. You can find it on Wikipedia if you want.
Let’s say we calculate the our sample of happiness has n = 200 people. The Example 19.1. Sample mean and
sample variance of the happiness
sample mean is X̄ = 83 (what is the unit here? happiness score?) and the in Bhutan problem.
sample variance is S2 = 450. We can now calculate the standard error of our
estimate of the mean to be 1.5. When we report our results we will say that
the average happiness score in Bhutan is 83 ± 1.5 with variance 450.
19.3 Bootstrap
def bootstrap(sample):
n = number of elements in sample
pmf = estimate the underlying pmf from the sample
stats = []
repeat 10,000 times:
resample = draw n new samples from the pmf
stat = calculate your stat on the resample
stats.append(stat)
now stats is used to estimate the distribution of the stat
20.1 Parameters
Before we dive into parameter estimation, first let’s revisit the concept of pa-
rameters. Given a model, the parameters are the numbers that yield the actual
distribution. In the case of a Bernoulli random variable, the single parameter
was the value p. In the case of a Uniform random variable, the parameters are
the a and b values that define the min and max value. Here is a list of random
variables and the corresponding parameters. From now on, we are going to use
the notation θ to be a vector of all the parameters:
In the real world often you don’t know the ‘‘true’’ parameters, but you get
to observe data. Next up, we will explore how we can use data to estimate the
model parameters.
118 chapter 20. maximum likelihood estimation
It turns out there isn’t just one way to estimate the value of parameters. There
are two main approaches: Maximum Likelihood Estimation (MLE) and Maximum A
Posteriori (MAP). Both of these approaches assume that your data are IID samples:
X1 , X2 , . . . , Xn where all Xi are independent and have the same distribution.
Our first algorithm for estimating parameters is called maximum likelihood estima-
tion (MLE). The central idea behind MLE is to select the parameters θ that make
the observed data the most likely.
The data that we are going to use to estimate the parameters are going to be n
independent and identically distributed (IID) samples: X1 , X2 , . . . , Xn .
20.2.1 Likelihood
We made the assumption that our data are identically distributed. This means
that they must have either the same probability mass function (if the data are
discrete) or the same probability density function (if the data are continuous).
To simplify our conversation about parameter estimation, we are going to use
the notation f ( X | θ ) to refer to this shared PMF or PDF. Our new notation is
interesting in two ways. First, we have now included a conditional on θ which
is our way of indicating that the likelihood of different values of X depends on
the values of our parameters. Second, we are going to use the same symbol f for
both discrete and continuous distributions.
What does likelihood mean and how is ‘‘likelihood’’ different than ‘‘probabil-
ity’’? In the case of discrete distributions, likelihood is a synonym for the joint
probability of your data. In the case of continuous distribution, likelihood refers
to the joint probability density of your data.
Since we assumed each data point is independent, the likelihood of all our data
is the product of the likelihood of each data point. Mathematically, the likelihood
of our data given parameters θ is:
n
L(θ ) = ∏ f ( Xi | θ ) (20.1)
i =1
For different values of parameters, the likelihood of our data will be different.
If we have correct parameters, our data will be much more probable than if we
have incorrect parameters. For that reason we write likelihood as a function of
our parameters (θ).
20.2.2 Maximization
In maximum likelihood estimation (MLE) our goal is to chose values of our
parameters (θ) that maximizes the likelihood function from the previous section.
We are going to use the notation θ̂ to represent the best choice of values for our
parameters. Formally, MLE assumes that:
‘‘Arg max’’ is short for argument of the maximum. The arg max of a function is the
value of the domain at which the function is maximized. It applies for domains
of any dimension.
A cool property of arg max is that since log is a monotonic function, the arg
max of a function is the same as the arg max of the log of the function! That’s nice
because logs make the math simpler.
If we find the arg max of the log of likelihood, it will be equal to the arg max of
the likelihood. Therefore, for MLE, we first write the log likelihood function (LL):
n n
LL(θ ) = log L(θ ) = log ∏ f ( Xi | θ ) = ∑ log f (Xi | θ ) (20.3)
i =1 i =1
To use a maximum likelihood estimator, first write the log likelihood of the
data given your parameters. Then chose the value of parameters that maximize
the log likelihood function. Argmax can be computed in many ways. All of the
methods that we cover in this class require computing the first derivative of the
function.
λX
f ( X | λ ) = e−λ (20.5)
X!
Let’s write the log-likelihood function first:
n
e − λ λ Xi
L(θ ) = ∏ Xi !
(likelihood function)
i =1
n
LL(θ ) = ∑ −λ log e + Xi log λ − log(Xi !) (log-likelihood function)
i =1
n n n
= ∑ −nλ + log λ ∑ Xi − ∑ log(Xi !) (use log with base e)
i =1 i =1 i =1
Then, we differentiate with respect to our parameter λ and set it equal to 0. Note
that ∑in=1 log( Xi ) is a constant with respect to λ:
n
∂LL(θ ) 1
∂λ
= −n +
λ ∑ Xi = 0
i =1
first is the mean (µ) parameter, and the second is the variance (σ2 ) parameter.
n
L(θ ) = ∏ f ( Xi | θ )
i =1
n ( X −θ ) 2
1 − i2θ 0
=∏√ e 1 (likelihood of a continuous variable is the PDF)
i =1 2πθ1
n ( X −θ ) 2
1 − i2θ 0
LL(θ ) = ∑ log √
2πθ1
e 1 (we want to calculate log likelihood)
i =1
n
p 1
= ∑ − log 2πθ1 − ( X − θ0 ) 2
i =1
2θ1 i
Again, the last step of MLE is to choose values of θ that maximize the log likelihood
function. In this case, we can calculate the partial derivative of the LL function
with respect to both θ0 and θ1 , set both equations to equal 0, and then solve for the
values of θ. Doing so results in the equations for the values µ̂ = θ̂0 and σ̂2 = θ̂1
that maximize likelihood. The result is: µ̂ = n1 ∑in=1 Xi and σ̂2 = n1 ∑in=1 ( Xi − µ̂)2 .
Note that σ̂2 is biased because it relies on the unbiased µ̂.
p N |X (n | x ) f X ( x ) f X | N ( x | n) p N (n)
f X | N ( x | n) = p N |X (n | x ) =
p N (n) f X (x)
Imagine we have a coin and we would like to know its probability of coming up
heads (p). We flip the coin (n + m) times and it comes up head n times. One
way to calculate the probability is to assume that it is exactly p = n+n m . That
124 chapter 21. the beta distribution
P( N = n | X = x ) f ( X = x )
f ( X = x | N = n) = (Bayes Theorem)
P( N = n)
(n+nm) x n (1 − x )m
= (Binomial PMF, Uniform PDF)
P( N = n)
(n+nm)
= x n (1 − x ) m (Moving terms around)
P( N = n)
1 R1
= · x n (1 − x ) m (where c = 0 x n (1 − x )m dx)
c
R1
where B( a, b) = 0 x a−1 (1 − x )b−1 dx is a normalization constant.
A Beta distribution has the following statistics: a−1
mode( X ) =
a+b−2
a
E[ X ] =
a+b
ab
Var( X ) =
( a + b )2 ( a + b + 1)
All modern programming languages have a package for calculating Beta CDFs.
You will not be expected to compute the CDF by hand in CS109.
To model our estimate of the probability of a coin coming up heads as a beta set
a = n + 1 and b = m + 1. Beta is used as a random variable to represent a belief
distribution of probabilities in contexts beyond estimating coin flips. It has many
desirable properties: it has a support range that is exactly (0, 1), matching the
3
values that probabilities can take on and it has the expressive capacity to capture Beta(5, 3)
many different forms of belief distributions. Let’s imagine that we had observed 2
P(θ )
n = 4 heads and m = 2 tails. The probability density function for X ∼ Beta(5, 3)
1
is shown in figure 21.1.
Notice how the most likely belief for the probability of our coin is when the 0
0 0.2 0.4 0.6 0.8 1
random variable, which represents the probability of getting a heads, is 4/6, the
θ
fraction of heads observed. This distribution shows that we hold a non-zero belief
that the probability could be something other than 4/6. It is unlikely that the Figure 21.1. A Beta distribution for
probability is 0.01 or 0.09, but reasonably likely that it could be 0.5. Beta(5, 3).
It works out that Beta(1, 1) = Uni(0, 1). As a result the distribution of our belief
about p before (‘‘prior’’) and after (‘‘posterior’’) can both be represented using
a Beta distribution. When that happens we call Beta a ‘‘conjugate’’ distribution.
Practically, conjugate means easy update.
Assignment Example: In one particular iteration of this course, we talked Example 21.1. Beta distribution to
model student grades.
about reasons why grade distributions might be well suited to be described
as a Beta distribution. Let’s say that we are given a set of student grades for a
single exam and we find that it is best fit by a Beta distribution: X ∼ Beta( a =
8.28, b = 3.16). What is the probability that a student is below the mean (i.e.
expectation)?
The answer to this question requires two steps. First calculate the mean
of the distribution, then calculate the probability that the random variable
takes on a value less than the expectation.
a 8.28
E[ X ] = = ≈ 0.7238
a+b 8.28 + 3.16
Now we need to calculate P( X < E[ X ]). That is exactly the CDF of X eval-
uated at E[ X ]. We don’t have a formula for the CDF of a Beta distribution
but all modern programming languages will have a Beta CDF function. In
Python using the scipy stats library we can execute stats.beta.cdf which
takes the x parameter first followed by the alpha and beta parameters of your
Beta distribution.
P( X < E[ X ]) = FX (0.7238)
= stats.beta.cdf(0.7238, 8.28, 3.16) ≈ 0.46
MLE is great, but it is not the only way to estimate parameters! This section
introduces an alternate algorithm, Maximum A Posteriori (MAP). The paradigm
of MAP is that we should choose the value for our parameters that is the most
likely given the data. At first blush this might seem the same as MLE; however,
remember that MLE chooses the value of parameters that makes the data most
likely.
One of the disadvantages of MLE is that it best explains data we have seen and
makes no attempt to generalize to unseen data. In MAP, we incorporate prior belief
about our parameters, and then we update our posterior belief of the parameters
based on the data we have seen.
Formally, for IID random variables X1 , . . . , Xn :
Note that f , g and h are all probability densities. We used different symbols
to make it explicit that they may have different functions. Now we are going
to leverage two observations. First, the data is assumed to be IID so we can
decompose the density of the data given θ. Second, the denominator is a constant
with respect to θ. As such, its value does not affect the arg max, and we can drop
that term. Mathematically:
∏in=1 f ( Xi | θ ) g(θ )
θ MAP = arg max (Since the samples are IID)
θ h ( X1 , X2 , . . . , X n )
n
= arg max ∏ f ( Xi | θ ) g(θ ) (Since h is constant with respect to θ)
θ i =1
As before, it will be more convenient to find the arg max of the log of the MAP
function, which gives us the final form for MAP estimation of parameters.
n
θ MLE = arg max ∏ f ( Xi | θ )
!
n
θ MAP = arg max log( g(θ )) + ∑ log( f ( Xi | θ )) (22.1) θ i =1
n
θ i =1 θ MAP = arg max ∏ f ( Xi | θ ) g(θ )
θ i =1
Using Bayesian terminology, the MAP estimate is the mode of the ‘‘posterior’’
distribution for θ. If you look at this equation side by side with the MLE equation
you will notice that MAP is the arg max of the exact same function plus a term
for the log of the prior. Maximum A Posteriori maximizes:
log-likelihood + log-prior
22.1.1 Parameter Priors
In order to get ready for the world of MAP estimation, we are going to need to
brush up on our distributions. We will need reasonable distributions for each of
our different parameters. For example, if you are predicting a Poisson distribution,
what is the right random variable type for the prior of λ?
A desiderata for prior distributions is that the resulting posterior distribution
has the same functional form. We call these ‘‘conjugate’’ priors. In the case where
you are updating your belief many times, conjugate priors makes programming
in the math equations much easier.
Here is a list of different parameters and the distributions most often used for
their priors: We won’t cover the inverse gamma distribution in this class. The
remaining two, Dirichlet and Gamma, you will not be required to know, but
details for them are included below for completeness.
The distributions used to represent your ‘‘prior’’ belief about a random variable
will often have their own parameters. For example, a Beta distribution is defined
using two parameters ( a, b). Do we have to use parameter estimation to evaluate
a and b too? No. Those parameters are called ‘‘hyperparameters’’. That is a term
we reserve for parameters in our model that we fix before running parameter
estimate. Before you run MAP you decide on the values of ( a, b).
22.1.2 Beta
We’ve covered that Beta is a conjugate distribution for Bernoulli. The MAP of a
Bernoulli distribution with a Beta prior is the mode of the Beta posterior. The mode
of a distribution is the value that maximizes the probability mass function (if
discrete) or probability density function (if continuous).
If X ∼ Beta( a, b), where a, b are integers where a + b > 2, the mode is
arg max f ( x ) = a+a−b−1 2 , where f ( x ) is the PDF of X.
x
22.1.3 Dirichlet
The Dirichlet distribution generalizes beta in the same way multinomial gen-
eralizes Bernoulli. A random variable X that is Dirichlet is parametrized as
X ∼ Dir( a1 , a2 , . . . , am ). The PDF of the distribution is:
m
f ( X1 = x 1 , X2 = x 2 , . . . , X m = x m ) = K ∏ x i i
a −1
i =1
22.1.4 Gamma
The Gamma(k, θ ) distribution is the conjugate prior for the λ parameter of the
Poisson distribution. (It is also the conjugate for the λ in the exponential, but we
won’t cover that here.)
The hyperparameters can be interpreted as: you saw k total imaginary events
during θ imaginary time periods. After observing n events during the next t time
periods the posterior distribution is Gamma(k + n, θ + t).
In supervised machine learning, your job is to use training data with feature/la-
bel pairs (I, Y ) in order to estimate a label-predicting function Ŷ = g( X ). This
function can then be used to make future predictions. A classification task is
one where Y takes on one of a discrete number of values. Often in classification,
g( X ) = arg maxy P̂(Y = y | X = x ).
To learn all parameters required to calculate g(X), you are given n different
training pairs known as training data: (x(1) , y(1) ), (x(2) , y(2) ), . . . , (x(n) , y(n) ). X( j)
is a vector of m discrete features for the i-th training example, where X( j) =
( j) ( j) ( j)
( x1 , x2 , . . . , xm ). y( j) is the discrete label for the i-th training example. This
symbolic description of classification hides the fact that prediction is applied to
interesting real life problems:
3. Predicting if a user will like a movie Y given whether or not they like a set of
m movies X.
134 chapter 23. naïve bayes
In this section we are going to assume that all random variables are binary.
While this is not a necessary assumption (Naïve Bayes can work for non-binary
data), it makes it much easier to learn the core concepts. Specifically, we assume
that all labels are binary y ∈ {0, 1}, and all features are binary x j ∈ {0, 1}, ∀ j =
1, . . . , m.
Here is the Naïve Bayes algorithm. After presenting the algorithm we are going
to show the theory behind it.
23.2.1 Prediction
For an example with X = [ x1 , x2 , . . . , xm ], we can make a corresponding prediction
for Y. We use hats (e.g., P̂ or Ŷ) to symbolize values which are estimated.
Ŷ = g(x) = arg max P̂(Y ) P̂( X | Y ) (This is equal to arg max P̂(Y = y | X))
y∈{0,1}
m
= arg max P̂(Y = y) ∏ P̂( Xi = xi | Y = y)
y∈{0,1} i =1
(Naïve Bayes assumption)
m
= arg max log P̂(Y = y) + ∑ log P̂( Xi = xi | Y = y)
y∈{0,1} i =1
(Log version for numerical stability)
In order to calculate this expression, we are going to need to learn the estimates
P̂(Y = y) and P̂( Xi = xi | Y = y) from data using a process called training.
23.2.2 Training
The objective in training is to estimate the probabilities P(Y ) and P( Xi | Y ) for all
0 < i ≤ m features. Using an MLE estimate:
23.3 Theory
Now that you have the algorithm spelled out, let’s go over the theory of how we
got there. To do so, we will first explore an algorithm which doesn’t work, called
‘‘Brute Force Bayes.’’ Then, we introduce the Naïve Bayes Assumption, which
will make our calculations possible.
P̂(X | Y ) P̂(Y )
= arg max (By Bayes’ Theorem)
y∈{0,1} P̂(X)
= arg max P̂(X | Y ) P̂(Y ) (Since P̂(X) is constant with respect to Y)
y∈{0,1}
Using our training data, we could interpret the joint distribution of X and Y as
one giant Multinomial with a different parameter for every combination of X = x
and Y = y. If for example, the input vectors are only length one (i.e., |X| = 1)
and the number of values that x and y can take on are small—say, binary—this is
a totally reasonable approach. We could estimate the multinomial using MLE or
MAP estimators and then calculate argmax over a few lookups in our table.
The bad times hit when the number of features becomes large. Recall that
our multinomial needs to estimate a parameter for every unique combination of
assignments to the vector X and the value Y. If there are |X| = m binary features
then this strategy is going to take order O(2m ) space and there will likely be
many parameters that are estimated without any training data that matches the
corresponding assignment.
This algorithm is fast and stable both when training and making predictions.
Let us consider a particular feature, the i-th feature Xi . How should we repre-
sent P̂( Xi = xi | Y = y)? For a particular event Y = y that we condition on, Xi
can take on one of k discrete values . Thus for each particular y, we can model
the likelihood of Xi taking on values as a Multinomial random variable with k
parameters. We can then find MLE and MAP estimators for the parameters of
that Multinomial. Recall that the MLE to estimate parameter pi for a Multinomial
is just counting, whereas the MAP estimator (with Laplace prior) to estimate
ni ni + 1
p̂i,MLE = and p̂i,MAP = ,
n n+k
where n is the number of observations, ni is the number of observations with
outcome i, and k is the total possible number of outcomes k.
Note that in the version of classification we are using in CS109, Xi is binary
(technically, a Multinomial with 2 parameters) and therefore k = 2. We used the
Multinomial derivation to help you understand how one would handle a feature
Xi that takes on multiple discrete values.
Naïve Bayes is a simple form of a Bayesian Network where the label Y is the
only variable which directly influences the likelihood of each feature variable Xi .
It is a simple model from a field of machine learning called probabilistic graphical
models. In that field you make a graph of how your variables are related to one
another and you come up with conditional independence assumptions that make
it computationally tractable to estimate the joint distribution.
Say we have thirty examples of people’s preferences (like or not) for Star
Wars, Harry Potter and Pokemon. Each training example has X1 , X2 and Y
where X1 is whether or not the user liked Star Wars, X2 is whether or not the
user liked Harry Potter and Y is whether or not the user liked Pokemon. For
the 30 training examples, the MAP and MLE estimates are as follows:
For a new user who likes Star Wars ( X1 = 1) but not Harry Potter ( X2 = 0),
do you predict that they will like Pokemon? Yes! Y = 1 leads to a larger value
in the argmax term:
if Y = 0 :
P̂( X1 = 1 | Y = 0) P̂( X2 = 0 | Y = 0) P̂(Y = 0) = (0.77)(0.38)(0.43) ≈ 0.126
if Y = 1 :
P̂( X1 = 1 | Y = 1) P̂( X2 = 0 | Y = 1) P̂(Y = 1) = (0.76)(0.41)(0.57) ≈ 0.178
In many cases we can’t solve for argmax mathematically. Instead we use a com-
puter. To do so we employ an algorithm called gradient ascent (a classic in
optimization theory). The idea behind gradient ascent is that if you continuously
take small steps in the direction of your gradient, you will eventually make it to a
local maxima.
Start with theta as any initial value (often 0). Then take many small steps
towards a local maxima. The new theta after each small step can be calculated as:
∂LL(θ old )
θ new
j = θ old
j +η·
∂θ j
140 chapter 24. linear regression and gradient ascent
Where ‘‘eta” (η) is the magnitude of the step size that we take. If you keep
updating θ using the equation above you will (often) converge on good values of
θ. As a general rule of thumb, use a small value of η to start. If ever you find that
the function value (for the function you are trying to argmax) is decreasing, your
choice of η was too large. Here is the gradient ascent algorithm in pseudocode:
With our linear prediction model, we determine θ MSE = ( a MSE , b MSE ) by differ-
entiating the mean squared error with respect to a and b:
∂ 2 ∂ 2
E[(Y − aX − b) ] = E (Y − aX − b)
∂a ∂a
= E[−2(Y − aX − b) X ]
= −2E[ XY ] + 2aE[ X 2 ] + 2bE[ X ]
∂ 2 ∂ 2
E[(Y − aX − b) ] = E (Y − aX − b)
∂b ∂b
= E[−2(Y − aX − b)]
= −2E[Y ] + 2aE[ X ] + 2b
E[ XY ] − E[ X ] E[Y ] Cov( X, Y ) σ
a MSE = = = ρ( X, Y ) X
E[ X 2 ] − ( E[ X ])2 Var( X ) σY
b MSE = E[Y ] − aE[ X ] = µY − a MSE µ X
σ
Y = ρ( X, Y ) Y ( X − µ X ) + µY
σX
Wait, those are our best parameters? But we don’t know the distributions of X
and Y, and therefore we don’t know true statistics on X and Y. We estimate these
statistics based on our observed training data. Our model is therefore as follows
(where X̄ and Ȳ are the sample means computed from the training data:
σ̂Y
Ŷ = g( X = x ) = ρ̂( X, Y ) ( x − X̄ ) + Ȳ
σ̂X
∑in=1 ( x (i) − X̄ )(y(i) − Ȳ ) S
â MSE = n
= ρ̂( X, Y ) Y
∑i=1 ( x − X̄ )
( i ) 2 SX
b̂ MSE = Ȳ − â MSE X̄
This first derivative can be plugged into gradient ascent to give our final algorithm:
a, b = 0, 0 # initialize θ
repeat many times:
gradient_a, gradient_b = 0, 0
for each training example (x, y):
diff = y - (a * x + b)
gradient_a += 2 * diff * x
gradient_b += 2 * diff
a += η * gradient_a # θ += η * gradient
b += η * gradient_b
If you run gradient ascent for enough training (i.e., update) steps, you will
find that for linear regression, the maximum likelihood estimators (assuming
zero-mean, normally distributed noise between predicted Ŷ and actual Y) is
equivalent to the mean squared error estimators. Cool!!
Classification is the task of choosing a value of y that maximizes P(Y | X). Naïve
Bayes worked by approximating that probability using the naïve assumption that
each feature was independent given the class label.
For all classification algorithms you are given n IID training datapoints
where each ‘‘feature’’ vector x(i) has m = |x(i) | features. Logistic regression classifier:
Ŷ = arg max P(Y | X)
y∈{0,1}
25.1.1 Logistic Regression Assumption !
m
independent:
n
L(θ ) = ∏ P (Y = y ( i ) | X = x ( i ) )
i =1
n h i (1− y (i ) )
= ∏ σ ( θ > x( i ) ) y · 1 − σ ( θ > x( i ) )
(i )
i =1
And if you take the log of this function, you get the log-conditional likelihood of
the training dataset for Logistic Regression.
Side Note: While we calculate conditional likelihood here, it is worthwhile not-
ing that maximizing log-likelihood and log-conditional likelihood are equivalent:
n n
LL(θ ) = ∑ log( f (x(i) , y(i) | θ )) = ∑ log( f (x(i) | θ ) P(y(i) | x(i) , θ )) (Chain rule)
i =1 i =1
n
= ∑ log( f (x(i) ) f (y(i) | x(i) , θ )) (X, θ independent)
i =1
!
n
θ MLE = arg max LL(θ ) = arg max ∑ log f (x (i )
) + log f (y (i ) (i )
| x , θ) (Log of products)
θ θ i =1
!
n
= arg max ∑ log f (y (i ) (i )
| x , θ) (Constants w.r.t. θ)
θ i =1
Once we have an equation for Log Likelihood, we chose the values for our pa-
rameters (θ) that maximize said function. In the case of logistic regression we
can’t solve for θ mathematically. Instead we use a computer to chose θ.
To do so we employ an algorithm called gradient ascent. That algorithms
claims that if you continuously take small steps in the direction of your gradient,
you will eventually make it to a local maxima. In the case of Logistic Regression
you can prove that the result will always be a global maxima.
The small step that we continually take given the training dataset can be
calculated as follows:
∂LL(θ old )
θ new
j = θ old
j +η·
∂θ old
j
n h
old> (i )
i
∑
(i )
= θ old
j + η · y (i )
− σ ( θ x ) xj ,
i =0
where η is the magnitude of the step size that we take. If you keep updating θ
using the equation above you will converge on the best values of θ!
Pro-tip: Don’t forget that in order to learn the value of θ0 , you can simply
define x0 = 1 for all datapoints.
25.3 Derivations
∂
σ (z) = σ (z)[1 − σ (z)] (to get the derivative with respect to θ, use the chain rule)
∂z
Derivative of gradient for one datapoint (x, y):
∂LL(θ ) ∂ ∂
= y log σ (θ > x) + (1 − y) log[1 − σ(θ > x)] (derivative of sum of terms)
∂θ j ∂θ j ∂θ j
1−y
y ∂
= >
− >
σ ( θ > x) (derivative of log f ( x ))
σ (θ x) 1 − σ (θ x) ∂θ j
1−y
y
= − σ (θ > x)[1 − σ (θ > x)] x j (chain rule + derivative of sigma)
σ ( θ > x) 1 − σ ( θ > x)
" #
y − σ ( θ > x)
= σ (θ > x)[1 − σ(θ > x)] x j (algebraic manipulation)
σ (θ > x)[1 − σ (θ > x)]
= [y − σ(θ > x)] x j (cancelling terms)
Because the derivative of sums is the sum of derivatives, the gradient of theta is
simply the sum of this term for each training datapoint.
n
∑ [y(i) − σ(θ > x(i) )] x j
(i )
i =1
Counting
PMF = P( X = x ) = p X ( x )
Sum Rule | A| + | B| = m + n
PDF = f ( x )
Product Rule | A|| B| = mn
CDF = FX ( x )
Inclusion-Exclusion | A ∪ B| = | A| + | B| − | A ∩ B|
Floor and Ceiling b1.9c = 1 and d1.9e = 2
The Pigeonhole Principle dm/ne
Combinatorics
Permutations n!
n! n
Combinations (Binomial) =
r!(n − r )! r
n! n
Bucketing (Multinomial) =
n1 !n2 ! . . . nr ! n1 , n2 , . . . , nr
( n + r − 1) ! n+r−1
Divider Method =
n!(r − 1)! n
Probability
PMF p X ( x ) = P( X = x )
Expectation E[ X ] = ∑ x · p X ( x )
x∈X
Variance Var( X ) = E[ X 2 ] − E[ X ]2
# Expectation
𝔼(X::Bernoulli) = X.p
𝔼(X::Binomial) = X.n*X.p
𝔼(X::Poisson) = X.λ
𝔼(X::Geometric) = 1/X.p
𝔼(X::NegativeBinomial) = X.r/X.p
𝔼(X::Uniform) = (X.a + X.b)/2
𝔼(X::Exponential) = 1/X.θ
# Variance
Var(X::Bernoulli) = X.p*(1-X.p)
Var(X::Binomial) = X.n*X.p*(1-X.p)
Var(X::Poisson) = X.λ
Var(X::Geometric) = (1-X.p)/X.p^2
Var(X::NegativeBinomial) = X.r*(1-X.p)/X.p^2
Var(X::Uniform) = (X.b-X.a)^2/12
Var(X::Exponential) = 1/X.θ^2
Arithmetic series:
n n
n ( n + 1)
∑i= ∑i= 2
(B.9)
i =0 i =1
n
(n − m + 1)(n + m)
∑i= 2
(B.10)
i =m
Geometric series:
n
1 − x n +1
∑ xi = 1−x
(B.13)
i =0
n
x n +1 − x m
∑ xi = x−1
(B.14)
i=m
∞
1
∑x i
=
1−x
if | x | < 1 (B.15)
i =0
Binomial coefficient:
n
n
∑ i = 2n (B.19)
i =0
n
∑ ic = Θ(nc+1 ), for c ≥ 0 (B.20)
i =1
n
1
∑ i
= Θ(log n) (B.21)
i =1
n
∑ ci = Θ(cn ), for c ≥ 2 (B.22)
i =1
Definition of factorial:1 1
By definition 0! = 1.
n
∏ i = n! (B.23)
i =1
Combining products:
n n n
∏ f (i ) · ∏ g (i ) = ∏ f (i ) · g (i ) (B.27)
i =1 i =1 i =1
Turning products into summations (using logarithms, assuming f (i ) > 0 for all
i):
!
n n
log ∏ f (i ) = ∑ log f (i) (B.28)
i =1 i =1
d(u · v) = du · v + u · dv (B.29)
For your problem set solutions it is fine for your answers to include factorials,
exponentials, or combinations; you don’t need to calculate those all out to get a
single numeric answer. However, if you’d like to work with those in Python or
Julia, here are a few functions you may find useful.
*Names to the left of the dots (.) are modules that need to be
imported before being used: import math, scipy.special. http://en.wikipedia.org/wiki
/Summation
Answer.
a.
b.
c.
d.
name: email:
Q2 cs 109: problem set 1, combinatorics July 29, 2020
PSET1 Q2. At the local zoo, a new exhibit consisting of 3 different species of birds and 3 different species of
reptiles is to be formed from a pool of 8 bird species and 6 reptile species. How many exhibits are possible if
a. there are no additional restrictions on which species can be selected?
b. 2 particular bird species cannot be placed together (e.g., they have a predator-prey relationship)?
c. 1 particular bird species and 1 particular reptile species cannot be placed together?
Answer.
a.
b.
c.
name: e mail:
Q3 cs 109: problem set 1, combinatorics July 29, 2 020
PSET1 Q3. A university is offering 3 programming classes: one in Java, one in C++, and J = 27 C = 28
one in Python. The classes are open to any of the 100 students at the university. There
are: a total of 27 students in the Java class; a total of 28 students in the C++ class; a total 12 10 10
of 20 students in the Python class; 12 students in both the Java and C++ classes (note: 2
these students are also counted as being in each class in the numbers above); 5 students in 3 6
both the Java and Python classes; 8 students in both the C++ and Python classes; and 2
9
students in all three classes (note: these students are also counted as being in each pair of
classes). P = 20
a. If a student is chosen randomly at the university, what is the probability that the student is not in any of the 3 programming
classes?
b. If a student is chosen randomly at the university, what is the probability that the student is taking exactly one of the three
programming classes?
c. If two different students are chosen randomly at the university, what is the probability that at least one of the chosen
students is taking at least one of the programming classes?
Answer.
a.
b.
c.
name: email:
Q4 cs 109: problem set 1, combinatorics July 29, 2020
PSET1 Q4. Say you have $20 million that must be invested among 4 possible companies. Each investment must
be in integral units of $1 million, and there are minimal investments that need to be made if one is to invest in
these companies. The minimal investments are $1, $2, $3, and $4 million dollars, respectively for company 1, 2,
3, and 4. How many different investment strategies are available if
a. an investment must be made in each company?
b. investments must be made in at least 3 of the 4 companies?
c. Now assume that we do not have a minimal investment in any of the companies. How many different
investment strategies are available if we must invest less than or equal to $k million dollars total among the
4 companies, where k is a non-negative integer less than or equal to 20 (i.e. 0 ≤ kle20)? Note that you can
think of k as a constant that can be used in your answer.
Answer.
a.
b.
c.
name: e mail:
Q5 cs 109: problem set 1, combinatorics July 29, 2 020
PSET1 Q5. Consider an array x of integers with k elements (e.g., int x[k]), where each entry in the array has
a distinct integer value between 1 and n, inclusive, and the array is sorted in increasing order. In other words,
1 ≤ x[i] ≤ n, for all i = 0, 1, 2, . . . , k − 1, and the array is sorted, so x[0] < x[1] < . . . < x[k-1]. How many
such sorted arrays are possible?
Answer.
name: email:
Q6 c s 10 9: problem set 1, combinatorics July 29, 20 20
PSET1 Q6. If we assume that all possible poker hands (comprised of 5 cards from a standard 52 card deck) are
equally likely, what is the probability of being dealt:
a. a flush? (A hand is said to be a flush if all 5 cards are of the same suit. Note that this definition means that
straight flushes (five cards of the same suit in numeric sequence) are also considered flushes.)
b. two pairs? (This occurs when the cards have numeric values a, a, b, b, c, where a, b and c are all distinct.)
c. four of a kind? (This occurs when the cards have numeric values a, a, a, a, b, where a and b are distinct.)
Answer.
a.
b.
c.
name: e mail:
Q7 cs 109: problem set 1, combinatorics July 29, 2 020
PSET1 Q7. Imagine you have a robot (Θ) that lives on an n × m grid (it has n Θ ···
The robot starts in cell (1, 1) and can take steps either to the right or down (no n .. .. .. .. .. .. ..
. . . . . . .
left or up steps). How many distinct paths can the robot take to the destination
(?) in cell (n, m) if ···
··· ?
a. there are no additional constraints? m
c. the robot changes direction exactly 3 times? Moving down two times in
a row is not changing directions, but switching from moving down to
moving right is. For example, moving [down, right, right, down] would
count as having two direction changes.
Answer.
a.
b.
c.
name: email:
Q8 c s 109: problem set 1, combinatorics July 29, 2020
PSET1 Q8. Say we roll a six-sided die six times. What is the probability that
a. we will roll two different numbers thrice (three times) each?
b. we will roll exactly one number exactly three times? Hint: Be careful of overcounting.
Answer.
a.
b.
name: e mail:
Q9 cs 109: problem set 1, combinatorics July 29, 2020
PSET1 Q9. A binary string containing M 0’s and N 1’s (in arbitrary order, where all orderings are equally
likely) is sent over a network. What is the probability that the first r bits of the received message contain exactly
k 1’s?
Answer.
name: email:
Q10 cs 109: problem set 1, combinatorics July 29, 2020
PSET1 Q10. Say we send out a total of 20 distinguishable emails to 12 distinct users, where each email we send
is equally likely to go to any of the 12 users (note that it is possible that some users may not actually receive any
email from us). What is the probability that the 20 emails are distributed such that there are 4 users who receive
exactly 2 emails each from us and 3 users who receive exactly 4 emails each from us?
Answer.
name: e mail:
Q11 cs 109: problem set 1, combinatorics July 29, 2 020
PSET1 Q11. Say a hacker has a list of n distinct password candidates, only one of which will successfully log
her into a secure system.
a. If she tries passwords from the list at random, deleting those passwords that do not work, what is the
probability that her first successful login will be (exactly) on her k-th try?
b. Now say the hacker tries passwords from the list at random, but does not delete previously tried passwords
from the list. She stops after her first successful login attempt. What is the probability that her first successful
login will be (exactly) on her k-th try?
Answer.
a.
b.
name: email:
Q12 cs 109: problem set 1, combinatorics July 29, 2020
PSET1 Q12. Suppose that m strings are hashed (randomly) into N buckets, assuming that all N m arrangements
are equally likely. Find the probability that exactly k strings are hashed to the first bucket.
Answer.
name: e mail:
Q13 cs 109: problem set 1, combinatorics July 29, 2 020
PSET1 Q13. [Extra credit] To get good performance when working with binary search trees (BST), we must
consider the probability of producing completely degenerate BSTs (where each node in the BST has at most one
child). See Lecture Notes #2, Example 2, for more details on binary search trees.
a. If the integers 1 through n are inserted in arbitrary order into a BST (where each possible order is equally
likely), what is the probability (as an expression in terms of n) that the resulting BST will have completely
degenerate structure?
b. Using your expression from part (a), determine the smallest value of n for which the probability of forming
a completely degenerate BST is less than 0.001 (i.e., 0.1%).
Answer.
name: email:
Q14 cs 109: problem set 1, combinatorics July 29, 2020
PSET1 Q14. [Coding] Consider a game, which uses a generator that produces independent random integers
between 1 and 100, inclusive. The game starts with a sum S = 0. The first player adds random numbers from
the generator to S until S > 100, at which point they record their last random number x. The second player
continues by adding random numbers from the generator to S until S > 200, at which point they record their
last random number y. The player with the highest number wins; e.g., if y > x, the second player wins. Write a
Python 3 program to simulate 100,000 games and output the estimated probability that the second player wins.
Answer.
name: e mail:
Q1 cs 109: problem set 2, probability July 29, 2 020
PSET2 Q1. Say in Silicon Valley, 35% of engineers program in Java and 28% of the engineers who program in
Java also program in C++. Furthermore, 40% of engineers program in C++.
a. What is the probability that a randomly selected engineer programs in Java and C++?
b. What is the conditional probability that a randomly selected engineer programs in Java given that they
program in C++?
Answer.
a.
b.
name: email:
Q2 cs 109: problem set 2, probability July 29, 2020
PSET2 Q2. A website wants to detect if a visitor is a robot or a human. They give the visitor five c a p t c h a
tests that are hard for robots but easy for humans. If the visitor fails one of the tests, they are flagged as a robot.
The probability that a human succeeds at a single test is 0.95, while a robot only succeeds with probability 0.3.
Assume all tests are independent. The percentage of visitors on this website that are robots is 5%; all other
visitors are human.
a. If a visitor is actually a robot, what is the probability they get flagged (the probability they fail at least one
test)?
b. If a visitor is human, what is the probability they get flagged?
c. Suppose a visitor gets flagged. Using your answers from part (a) and (b), what is the probability that the
visitor is a robot?
d. If a visitor is human, what is the probability that they pass exactly three of the five tests?
e. Building off of your answer from part (d), what is the probability that a visitor with unknown identity
passes exactly three of the five tests?
Answer.
a.
b.
c.
d.
e.
name: e mail:
Q3 cs 109: problem set 2, probability July 29, 2 020
PSET2 Q3. Say all computers either run operating system W or X. A computer running operating system W is
twice as likely to get infected with a virus as a computer running operating system X. If 70% of all computers
are running operating system W, what percentage of computers infected with a virus are running operating
system W?
Answer.
name: email:
Q4 cs 109: problem set 2, probability July 29, 2020
PSET2 Q4. The Superbowl institutes a new way to determine which team receives the kickoff first. The referee
chooses with equal probability one of three coins. Although the coins look identical, they have probability
of heads 0.1, 0.5 and 0.9, respectively. Then the referee tosses the chosen coin 3 times. If more than half the
tosses come up heads, one team will kick off; otherwise, the other team will kick off. If the tosses resulted in the
sequence H, T, H, what is the probability that the fair coin was actually used?
Answer.
name: e mail:
Q5 cs 109: problem set 2, probability July 29, 2020
PSET2 Q5. After a long night of programming, you have built a powerful, but slightly buggy, email spam
filter. When you don’t encounter the bug, the filter works very well, always marking a spam email as spam
and always marking a non-spam email as good. Unfortunately, your code contains a bug that is encountered
10% of the time when the filter is run on an email. When the bug is encountered, the filter always marks the
email as good. As a result, emails that are actually spam will be erroneously marked as good when the bug is
encountered. Let p denote the probability that an email is actually non-spam, and let q denote the conditional
probability that an email is non-spam given that it is marked as good by the filter.
a. Determine q in terms of p.
b. Using your answer from part (a), explain mathematically whether q or p is greater. Also, provide an intuitive
justification for your answer.
Answer.
a.
b.
name: email:
Q6 c s 10 9: problem set 2, probability July 29, 20 20
PSET2 Q6. Two cards are randomly chosen without replacement from an ordinary deck of 52 cards. Let E be
the event that both cards are Aces. Let F be the event that the Ace of Spades is one of the chosen cards, and let
G be the event that at least one Ace is chosen.
a. Compute P( E | F ).
b. Are E and F independent? Justify your answer using your response to part (a).
c. Compute P( E | G ).
Answer.
a.
b.
c.
name: e mail:
Q7 cs 109: problem set 2, probability July 29, 2020
PSET2 Q7. Your colleagues in a comp-bio lab have sequenced DNA from a large population in order to
understand how a gene (G) influences two particular traits (T1 and T2 ). They find that P( G ) = 0.6, P( T1 | G ) =
0.7, and P( T2 | G ) = 0.9. They also observe that if a subject does not have the gene G, they express neither T1
nor T2 . The probability of a patient having both T1 and T2 given that they have the gene G is 0.63.
a. Are T1 and T2 conditionally independent given G?
b. Are T1 and T2 conditionally independent given G c ?
c. What is P( T1 )?
d. What is P( T2 )?
e. Are T1 and T2 independent?
Answer.
a.
b.
c.
d.
e.
name: email:
Q8 c s 109: problem set 2, probability July 29, 2020
PSET2 Q8. The color of a person’s eyes is determined by a pair of eye-color genes, as follows:
• if both of the eye-color genes are blue-eyed genes, then the person will have blue eyes
• if one or more of the genes is a brown-eyed gene, then the person will have brown eyes
A newborn child independently receives one eye-color gene from each of its parents, and the gene it receives
from a parent is equally likely to be either of the two eye-color genes of that parent. Suppose William and both
of his parents have brown eyes, but William’s sister (Claire) has blue eyes. (We assume that blue and brown
are the only eye-color genes.)
Answer.
a.
b.
name: e mail:
Q9 cs 109: problem set 2, probability July 29, 2 020
PSET2 Q9. Consider the following algorithm for betting in roulette. At each round (‘‘spin’’), you bet $1 on a
color (‘‘red’’ or ‘‘black’’). If that color comes up on the wheel, you keep your bet AND win $1; otherwise, you
lose your bet.
i. Bet $1 on ‘‘red’’
ii. If ‘‘red’’ comes up on the wheel (with probability 18/38), then you win $1 (and keep your original $1 bet)
and you immediately quit (i.e., you do not do step (iii) below).
iii. If ‘‘red’’ did not come up on the wheel (with probability 20/38), then you lose your initial $1 bet. But,
then you bet $1 on ‘‘red’’ on each of the next two spins of the wheel. After those two spins, you quit (no
matter what the outcome of the next two spins).
Let X denote your ‘‘winnings’’ when you quit, i.e., the total amount of money won minus any amounts lost
while playing. This value may be negative.
a. Determine P( X > 0).
b. Determine E[ X ]. (Rhetorical question: Would you play this game?)
Answer.
a.
b.
name: email:
Q10 cs 109: problem set 2, probability July 29, 2020
PSET2 Q10.
c. For each gene i, decide whether or not you think that is would be reasonable to assume that Gi is independent
of T. Support your argument with numbers. Remember that our probabilities are based on 100,000 bats, not
infinite bats, and are therefore only estimates of the true probabilities.
d. Give your best interpretation of the results from (a) to (c).
e. [Extra Credit] Try and find conditional independence relationships between the genes and the trait. Incor-
porate this information to improve your hypothesis of how the five genes relate to whether or not a bat can
carry Ebola.
Answer.
c.
d.
e.
name: e mail:
Q11 cs 109: problem set 2, probability July 29, 2 020
PSET2 Q11. [Extra Credit] Suppose we want to write an algorithm fairRandom for randomly generating a 0
or a 1 with equal probability (= 0.5). Unfortunately, all we have available to us is a function:
unknownRandom()::Int
that randomly generates bits, where on each call a 1 is returned with some unknown probability p that need not
be equal to 0.5 (and a 0 is returned with probability 1 − p).
Consider the following algorithms for fairRandom and simpleRandom:
function fairRandom() function simpleRandom()
r₁, r₂ = 0, 0 r₁, r₂ = 0, 0
while true r₁ = unknownRandom()
r₁ = unknownRandom() while true
r₂ = unknownRandom() r₂ = unknownRandom()
if (r₁ ≠ r₂) break; end if (r₁ ≠ r₂) break; end
end end
return r₂ return r₂
end end
a. Show mathematically that fairRandom does indeed return a 0 or a 1 with equal probability.
b. Say we want to simplify the function, so we write the simpleRandom function. Would the simpleRandom
function also generate 0’s and 1’s with equal probability? Explain why or why not. In addition, determine
P(simpleRandom returns 1) in terms of p.
Answer.
a.
b.
name: email:
Q2 cs 109: problem set 3, random variables July 29, 2 020
PSET3 Q2. Lyft line gets 2 requests every 5 minutes, on average, for a particular route. A user requests the
route and Lyft commits a car to take her. All users who request the route in the next five minutes will be added
to the car as long as the car has space. The car can fit up to three users. Lyft will make $7 for each user in the car
(the revenue) minus $9 (the operating cost).
a. How much does Lyft expect to make from this trip?
b. Lyft has one space left in the car and wants to wait to get another user. What is the probability that another
user will make a request in the next 30 seconds?
Answer.
a.
b.
name: email:
Q3 cs 109: problem set 3, random variables July 29, 2020
PSET3 Q3. Suppose it takes at least 9 votes from a 12-member jury to convict a defendant. Suppose also that
the probability that a juror votes that an actually guilty person is innocent is 0.25, whereas the probability that
the juror votes that an actually innocent person is guilty is 0.15. If each juror acts independently and if 70% of
defendants are actually guilty, find the probability that the jury renders a correct decision. Also determine the
percentage of defendants found guilty by the jury.
Answer.
name: e mail:
Q4 cs 109: problem set 3, random variables July 29, 2 020
PSET3 Q4. To determine whether they have measles, 1000 people have their blood tested. However, rather
than testing each individual separately (1000 tests is quite costly), it is decided to use a group testing procedure:
• Phase 1: First, place people into groups of 5. The blood samples of the 5 people in each group will be pooled
and analyzed together. If the test is positive (at least one person in the pool has measles), continue to Phase 2.
Otherwise send the group home. 200 of these pooled tests are performed.
• Phase 2: Individually test each of the 5 people in the group. 5 of these individual tests are performed per
group in Phase 2.
Suppose that the probability that a person has measles is 5% for all people, independently of others, and that
the test has a 100% true positive rate and 0% false positive rate (note that this is unrealistic). Using this strategy,
compute the expected total number of blood tests (individual and pooled) that we will have to do across Phases
1 and 2.
Answer.
name: email:
Q5 cs 109: problem set 3, random variables July 29, 2020
PSET3 Q5. Let X be a continuous random variable with probability density function:
(
c(2 − 2x2 ) −1 < x < 1
f (x) =
0 otherwise
Answer.
a.
b.
c.
name: e mail:
Q6 cs 109: problem set 3, random variables July 29, 2 020
PSET3 Q6. The number of times a person’s computer crashes in a month is a Poisson random variable with
λ = 7. Suppose that a new operating system patch is released that reduces the Poisson parameter to λ = 2 for
80% of computers, and for the other 20% of computers the patch has no effect on the rate of crashes. If a person
installs the patch, and has their computer crash 4 times in the month thereafter, how likely is it that the patch
has had an effect on the user’s computer (i.e., it is one of the 80% of computers that the patch reduces crashes
on)?
Answer.
name: email:
Q7 cs 109: problem set 3, random variables July 29, 2020
PSET3 Q7. Say there are k buckets in a hash table. Each new string added to the table is hashed into bucket
i with probability pi , where ∑ik=1 pi = 1. If n strings are hashed into the table, find the expected number of
buckets that have at least one string hashed to them. (Hint: Let Xi be a binary variable that has the value 1
when there ish at least ione string hashed into bucket i after the n strings are added to the table (and 0 otherwise).
Compute E ∑ik=1 Xi .)
Answer.
name: e mail:
Q8 cs 109: problem set 3, random variables July 29, 2 020
PSET3 Q8. You are testing software and discover that your program has a non-deterministic bug that causes
catastrophic failure (aka a ‘‘hindenbug’’). Your program was tested for 400 hours and the bug occurred twice.
a. Each user uses your program to complete a three hour long task. If the hindenbug manifests they will
immediately stop their work. What is the probability that the bug manifests for a given user?
b. Your program is used by one million users. Use a Normal approximation to estimate the probability that
more than 10,000 users experience the bug. Use your answer from part (a). Provide a numeric answer for
this part.
Answer.
a.
b.
name: email:
Q9 c s 109: problem set 3, random variables July 29, 2020
PSET3 Q9. Say the lifetimes of computer chips produced by a certain manufacturer are normally distributed
with parameters µ = 1.5 × 106 hours and σ = 9 × 105 hours. The lifetime of each chip is independent of the
other chips produced.
a. What is the approximate probability that a batch of 100 chips will contain at least 6 whose lifetimes are
more than 3.0 × 106 hours?
b. What is the approximate probability that a batch of 100 chips will contain at least 65 whose lifetimes are
less than 1.9 × 106 hours? Provide a numeric answer for this part.
Answer.
a.
b.
name: e mail:
Q10 cs 109: July 29, 2020
PSET3 Q10. A Bloom filter is a probabilistic implementation of the set data structure, an unordered collection
of unique objects. In this problem we are going to look at it theoretically. Our Bloom filter uses 3 different
independent hash functions H1 , H2 , H3 that each take any string as input and each return an index into a
bit-array of length n. Each index is equally likely for each hash function.
To add a string into the set, feed it to each of the 3 hash functions to get 3 array positions. Set the bits at all
these positions to 1. For example, initially all values in the bit-array are zero. In this example n = 10:
Index: 0 1 2 3 4 5 6 7 8 9
Value: 0 0 0 0 0 0 0 0 0 0
Index: 0 1 2 3 4 5 6 7 8 9
Value: 0 0 0 0 1 0 0 1 1 0
Bits are never switched back to 0. Consider a Bloom filter with n = 9,000 buckets. You have added m = 1,000
strings to the Bloom filter. Provide a numerical answer for all questions.
a. What is the (approximated) probability that the first bucket has 0 strings hashed to it?
To check whether a string is in the set, feed it to each of the 3 hash functions to get 3 array positions. If any of the
bits at these positions is 0, the element is not in the set. If all bits at these positions are 1, the string may be in the
set; but it could be that those bits are 1 because some of the other strings hashed to the same values. You may
assume that the value of one bucket is independent of the value of all others.
Answer.
a.
b. What is the probability that a string which has not previously been added to the set will be misidentified
as in the set? That is, what is the probability that the bits at all of its hash positions are already 1? Use
approximations where appropriate.
Answer.
b.
c. Our Bloom filter uses three hash functions. Was that necessary? Repeat your calculation in (b) assuming
that we only use a single hash function (not 3).
name: email:
Q10 cs 109: problem set 3, random variables July 29, 2020
Answer.
c.
(Chrome uses a Bloom filter to keep track of malicious URLs. Questions such as this allow us to compute
appropriate sizes for hash tables in order to get good performance with high probability in applications where
we have a ballpark idea of the number of elements that will be hashed into the table.)
name: e mail:
Q11 cs 109: July 29, 2 020
PSET3 Q11. Last summer (May 2019) the concentration of CO2 in the atmosphere was 414 parts per million
(ppm) which is substantially higher than the pre-industrial concentration: 275 ppm. CO2 is a greenhouse gas
and as such increased CO2 corresponds to a warmer planet.
Absent some pretty significant policy changes, we will reach a point within the next 50 years (i.e., well within
your lifetime) where the CO2 in the atmosphere will be double the pre-industrial level. In this problem we
are going to explore the following question: What will happen to the global temperature if atmospheric CO2
doubles?
The measure, in degrees Celsius, of how much the global average surface temperature will change (at the
point of equilibrium) after a doubling of atmospheric CO2 is called ‘‘Climate Sensitivity.’’ Since the earth is a
complicated ecosystem climate scientists model Climate Sensitivity as a random variable, S. The IPPC Fourth
Assessment Report had a summary of 10 scientific studies that estimated the PDF of S:
In this problem we are going to treat S as part-discrete and part-continuous. For values of S less than 7.5, we
are going to model sensitivity as a discrete random variable with PMF based on the average of estimates from
the studies in the IPCC report. Here is the PMF for S in the range 0 through 7.5:
Sensitivity, S (degrees C) 0 1 2 3 4 5 6 7
Expert Probability 0.00 0.11 0.26 0.22 0.16 0.09 0.06 0.04
The IPCC fifth assessment report notes that there is a non-negligible chance of S being greater than 7.5
degrees but didn’t go into detail about probabilities. In the paper ‘‘Fat-Tailed Uncertainty in the Economics
of Catastrophic Climate Change” Martin Weitzman discusses how different models for the PDF of Climate
Sensitivity (S) for large values of S have wildly different policy implications.
For values of S greater than or equal to 7.5 degrees Celsius, we are going to model S as a continuous random
variable. Consider two different assumptions for S when it is at least 7.5 degrees Celsius: a fat tailed distribution
( f 1 ) and a thin tailed distribution ( f 2 ):
K
f1 (x) = s.t. 7.5 ≤ x < 30
x
K
f2 (x) = s.t. 7.5 ≤ x < 30
x3
For this problem assume that the probability that S is greater than 30 degrees Celsius is 0.
name: email:
Q11 cs 109: problem set 3, random variables July 29, 2020
a. Compute the probability that Climate Sensitivity is at least 7.5 degrees Celsius.
Answer.
a.
Answer.
b.
c. It is estimated that if temperatures rise more than 10 degrees Celsius, all the ice on Greenland will melt.
Estimate the probability that S is greater than 10 under both the f 1 and f 2 assumptions.
Answer.
c.
Answer.
d.
e. Let R = S2 be a crude approximation of the cost to society that results from S. Calculate E[ R] under both
the f 1 and f 2 assumptions.
Answer.
e.
Notes: (1) Both f 1 and f 2 are ‘‘power law distributions’’. (2) Calculating expectations for a variable that is part
discrete and part continuous is as simple as: use the discrete formula for the discrete part and the continuous
formula for the continuous part.
name: e mail:
Q12 cs 109: problem set 3, random variables July 29, 2 020
PSET3 Q12. [Extra Credit] Say we have a cable of length n. We select a point (chosen uniformly randomly)
along the cable, at which we cut the cable into two pieces. What is the probability that the shorter of the two
pieces of the cable is less than 1/3 the size of the longer of the two pieces? Explain formally how you derived
your answer.
Answer.
name: email:
Q1 cs 109: problem set 4, distributions July 29, 2 020
PSET4 Q1. The median of a continuous random variable having cumulative distribution function F is the value
m such that F (m) = 0.5. That is, a random variable is just as likely to be larger than its median as it is to be
smaller. Find the median of X (in terms of the respective distribution parameters) in each case below.
a. X ∼ Uni( a, b)
b. X ∼ N (µ, σ2 )
c. X ∼ Exp(λ)
Answer.
a.
b.
c.
name: email:
Q2 cs 109: problem set 4, distributions July 29, 2020
PSET4 Q2. Users independently sign up for two online social networking sites, Lookbook and Quickgram. On
average, 7.5 users sign up for Lookbook each minute, while on average 5.5 users sign up for Quickgram each
minute. The number of users signing up for Lookbook and for Quickgram each minute are independent. A new
user is defined as a new account, i.e., the same person signing up for both social networking sites will count as
two new users.
a. What is the probability that more than 10 new users will sign up for the Lookbook social networking site in
the next minute?
b. What is the probability that more than 13 new users will sign up for the Quickgram social networking site
in the next 2 minutes?
c. What is the probability that the company will get a combined total of more than 40 new users across both
websites in the next 2 minutes?
Answer.
a.
b.
c.
name: e mail:
Q3 cs 109: problem set 4, distributions July 29, 2 020
PSET4 Q3. Say that of all the students who will attend Stanford, each will buy at most one laptop computer
when they first arrive at school. 40% of students will purchase a PC, 30% will purchase a Mac, 10% will purchase
a Linux machine and the remaining 20% will not buy any laptop at all. If 15 students are asked which, if any,
laptop they purchased, what is the probability that exactly 6 students will have purchased a PC, 4 will have
purchased a Mac, 2 will have purchased a Linux machine, and the remaining 3 students will have not purchased
any laptop?
Answer.
name: email:
Q4 cs 109: problem set 4, distributions July 29, 2020
PSET4 Q4. Say we have two independent variables X and Y, such that X ∼ Geo( p) and Y ∼ Geo( p). Mathe-
matically derive an expression for P( X = k | X + Y = n), where k and n are non-negative integers.
Answer.
name: e mail:
Q5 cs 109: problem set 4, distributions July 29, 2020
PSET4 Q5. Choose a number X at random from the set of numbers {1, 2, 3, 4, 5}. Now choose a number at
random from the subset no larger than X, that is from {1, . . . , X }. Let Y denote the second number chosen.
a. Determine the joint probability mass function of X and Y.
b. Determine the conditional mass function of X given Y = i. Do this for i = 1, 2, 3, 4, 5.
c. Are X and Y independent? Justify your answer.
Answer.
a.
b.
c.
name: email:
Q6 c s 10 9: problem set 4, distributions July 29, 20 20
PSET4 Q6. Let X1 , X2 , . . . be a series of independent random variables which all have the same mean µ and the
same variance σ2 . Let Yn = Xn + Xn+1 . For j = 0, 1, and 2, determine Cov(Yn , Yn+ j ). Note that you may have
different cases for your answer depending on the value of j.
Answer.
name: e mail:
Q7 cs 109: problem set 4, distributions July 29, 2020
PSET4 Q7. Our ability to fight contagious diseases depends on our ability to model them. One person is
exposed to llama-flu. The method below models the number of individuals who will get infected.
from scipy import stats
"""
Return number of people infected by one individual.
"""
def num_infected():
# most people are immune to llama flu.
# stats.bernoulli(p).rvs() returns 1 w.p. p (0 otherwise)
immune = stats.bernoulli(p = 0.99).rvs()
if immune: return 0
Answer.
name: email:
Q8 c s 109: problem set 4, distributions July 29, 2020
Let Y = the value returned by recurse(). We previously computed E[Y ] = 15. What is Var(Y )?
Answer.
name: e mail:
Q9 cs 109: problem set 4, distributions July 29, 2 020
PSET4 Q9. You go on a camping trip with two friends who each have a mobile phone. Since you are out in the
wilderness, mobile phone reception isn’t very good. One friend’s phone will independently drop calls with
20% probability. Your other friend’s phone will independently drop calls with 30% probability. Say you need to
make 6 phone calls, so you randomly choose one of the two phones and you will use that same phone to make
all your calls (but you don’t know which has a 20% versus 30% chance of dropping calls). Of the first 3 (out of
6) calls you make, one of them is dropped. What is the conditional expected number of dropped calls in the 6
total calls you make (conditioned on having already had one of the first three calls dropped)?
Answer.
name: email:
Q11 cs 109: problem set 4, distributions July 29, 2020
PSET4 Q11. [Extra Credit] Consider a bit string of length n, where each bit is independently generated and
has probability p of being a 1. We say that a bit switch occurs whenever a bit differs from the one preceding it
in the string (if there is a preceding bit). For example, if n = 5 and we have the bit string 11010, then there
are 3 bit switches. Find the expected number of bit switches in a string of length n. (Hint: You might find it
helpful to use a set of indicator (Bernoulli) variables that are defined in terms of whether a bit switch occurred
in each position of the string. And in case you’re wondering why we care about bit switches, the number of bit
switches in a string can be one indicator of how compressible that string might be—for example, if the bit string
represented a file that we were trying to ZIP.)
Answer.
name: e mail:
Q1 cs 109: July 29, 2020
PSET5 Q1. The joint probability density function of continuous random variables X and Y is given by:
y
f X,Y ( x, y) = c where 0 < y < x < 1
x
a. What is the value of c in order for f X,Y ( x, y) to be a valid probability density function?
Answer.
a.
name: email:
Q1 c s 109: July 29, 2020
Answer.
b.
name: e mail:
Q1 cs 109: problem set 5, sampling July 29, 2 020
Answer.
c.
Answer.
d.
e. What is E[ X ]?
Answer.
e.
name: email:
Q2 cs 109: problem set 5, sampling July 29, 2020
PSET5 Q2. A robot is located at the center of a square world that is 10 kilometers on each side. A package is
dropped off in the robot’s world at a point ( x, y) that is uniformly (continuously) distributed in the square. If
the robot’s starting location is designated to be (0, 0) and the robot can only move up/down/left/right parallel
to the sides of the square, the distance the robot must travel to get to the package at point ( x, y) is | x | + |y|. Let
D = the distance the robot travels to get to the package. Compute E[ D ].
Answer.
name: e mail:
Q3 cs 109: July 29, 2020
PSET5 Q3. Let X, Y, and Z be independent random variables, where X ∼ N (µ1 , σ1 2 ), Y ∼ N (µ2 , σ2 2 ), and
Z ∼ N (µ3 , σ3 2 ).
a. Let A = X + Y. What is the distribution (along with parameter values) of A?
Answer.
a.
b. Let B = 4X + 3. What is the joint distribution (along with parameter values) of B and Z? (Hint: Bivariate
Normal)
Answer.
b.
name: email:
Q3 cs 109: problem set 5, sampling July 29, 2020
c. Let C = aX − b2 Y + cZ, where a, b, and c are real-valued constants. What is the distribution (along with
parameter values) of C? Show how you derived your answer.
Answer.
c.
name: e mail:
Q4 cs 109: problem set 5, sampling July 29, 2020
PSET5 Q4. A fair 6-sided die is repeatedly rolled until the total sum of all the rolls exceeds 300. Approximate
(using the Central Limit Theorem) the probability that at least 80 rolls are necessary to reach a sum that exceeds
300.
Answer.
name: email:
Q5 cs 109: July 29, 2020
PSET5 Q5. Program A will run 20 algorithms in sequence, with the running time for each algorithm being
independent random variables with mean = 46 seconds and variance = 100 seconds2 . Program B will run 20
algorithms in sequence, with the running time for each algorithm being independent random variables with
mean = 48 seconds and variance = 200 seconds2 .
a. What is the approximate probability that Program A completes in less than 900 seconds?
Answer.
a.
name: e mail:
Q5 cs 109: problem set 5, sampling July 29, 2 020
b. What is the approximate probability that Program B completes in less than 900 seconds?
Answer.
b.
c. What is the approximate probability that Program A completes in less time than Program B?
Answer.
c.
name: email:
Q6 c s 10 9: problem set 5, sampling July 29, 20 20
PSET5 Q6.
c. [Written] Use your answers to part (a) and (b) and approximate X and Y as Normal random variables
with mean and variance that match their biometric data. Report both distributions.
d. [Written] Calculate the ratio of the probability that user A wrote the email over the probability that user B
wrote the email. You do not need to submit code, but you should include the formula that you attempted to
calculate and a short description (a few sentences) of how your code works.
Answer.
c.
d.
name: e mail:
Q7 cs 109: problem set 5, sampling July 29, 2 020
PSET5 Q7.
d. [Written] Would you use the mean or the median of 5 peer grades to assign scores in the online version of
Stanford’s HCI class? Hint: it might help to visualize the scores. Feel free to write code to help you answer
this question, but for this question we’ll solely evaluate your written answer in the PDF that you upload to
Gradescope.
Answer.
d.
name: email:
Q8 c s 109: problem set 5, sampling July 29, 2020
PSET5 Q8.
c. [Written] For each of the three backgrounds, calculate a difference in means in learning outcome between
activity1 and activity2, and the p-value of that difference.
d. [Written] Your manager at Coursera is concerned that you have been ‘‘p-hacking,’’ which is also known
as data dredging: https://en.wikipedia.org/wiki/Data_dredging. In one sentence, explain why your
results in part (c) are not the result of p-hacking.
Answer.
c.
d.
name: e mail:
Q1 cs 109: quiz 1 July 29, 2 020
Quiz 1
I acknowledge and accept the letter and spirit of the Honor Code:
Q1. You have just been elected social chair for the student organization PRoB (Probability Revolution or Bust)
for the coming academic year 2020–2021. As new social chair, you would like to hold 10 (indistinct) socials
over the next 3 quarters (Autumn, Winter, and Spring). Each social is equally likely to be assigned to any of the
quarters, and it is possible that a quarter has no socials. Order of socials within a quarter doesn’t matter.
a. (5 points) In how many distinct ways can the 10 (indistinct) socials be allocated to the 3 quarters?
Answer.
a.
b. (5 points) In how many distinct ways can the 10 (indistinct) socials be allocated to the 3 quarters if you must
hold at least 2 socials each quarter?
Answer.
b.
c. (3 points) Let your answer to part (a) be n. Consider the event where you hold 3 socials in Autumn, 3 socials
in Winter, and 4 socials in Spring. Is the probability of this event n1 ? Briefly explain in a few sentences why or
why not.
Answer.
c.
d. (12 points) You are also planning your courseload for next year. You have 10 courses to schedule over 3
quarters (Autumn, Winter, and Spring). All of the courses are distinct and offered every quarter; order of
courses within a quarter doesn’t matter. In how many distinct ways can you allocate 10 (distinct) courses to
the 3 quarters if you can take at most 4 courses in any quarter?
Answer.
d.
name: email:
Q2 cs 109: quiz 1 July 29, 2020
Q2. A home robot has two different sensors for motion detection. If there is a moving object, sensor V (video
camera) will detect motion with probability 0.95, and sensor L (laser) will detect motion with probability 0.8. If
there is no moving object, there is a 0.1 probability that sensor V will detect motion (even though there is no
object), and a 0.05 probability that sensor L will detect motion.
Based on empirical evidence, the probability that there is a moving object is 0.7. Note that these sensors use
independent detection algorithms to identify motion, so that conditioned on there being a moving object (or
not), the events of detecting motion (or not) for each sensor is independent.
a. (3 points) Given that there is a moving object and that sensor V does not detect motion, what is the probability
that sensor L detects motion? Give a numerical answer.
Answer.
a.
b. (5 points) Given that there is a moving object, what is the probability that at least one of the two sensors
detects motion? Give a numerical answer.
Note: You can use WolframAlpha as a calculator (it also accepts LATEX equations). Example: https://www.wolframalpha
.com/input/?i=sum+x%5Ek%2Fk%21%2C+k%3D0+to+infinity
Answer.
b.
c. (5 points) Given that there is a moving object, what is the probability that exactly one of the two sensors
detects motion? Give a numerical answer.
Answer.
c.
d. (8 points) What is the probability that there is a moving object given that both sensors detect motion? Give a
numerical answer.
Answer.
d.
e. (4 points) The probabilities that sensor V detects motion given a moving object and that sensor V detects
motion given no moving object do not sum up to 1. Briefly explain in a few sentences why this is okay.
name: e mail:
Q2 cs 109: quiz 1 July 29, 2 020
Answer.
e.
name: email:
Q3 cs 109: quiz 1 July 29, 2020
Q3. You have 8 pairs of mittens, each a different pattern. Left and right mittens are also distinct. Suppose that
you are fostering kittens, and you leave them alone for a few hours with your mittens. When you return, you
discover that they have hidden 4 mittens! Suppose that your kittens are equally likely to hide any 4 of your 16
distinct mittens. Let X be the number of complete, distinct pairs of mittens that you have left.
a. (15 points) Compute the probability mass function of X, p X ( x ). (Hint: Note the support of X is {4, 5, 6})
Answer.
a.
b. (5 points) Compute E[ X ] using the definition of expectation and your answer to (a).
Answer.
b.
c. (10 points) Define the random variable Xi to be 1 if your ith pair of mittens is complete after the kitten fiasco,
and 0 otherwise. Using this definition of Xi for i = 1, . . . , 8 and the linearity of expectation, compute E[ X ]
again. Do not use your answer to part (a).
Answer.
c.
name: e mail:
Q4 cs 109: q uiz 1 July 29, 2 020
Q4. A food takeout and delivery company, DashDoor (like DoorDash, but better) wants to understand how
busy their employees are each month. In metropolitan areas, employees receive on average 8 customer requests
each day. Regardless of where they work, each employee spends exactly 0.5 hours to make deliveries for each
request (one at a time); for example, an employee who receives exactly 4 requests on a particular day will spend
exactly 2 hours making deliveries.
a. (4 points) What is the variance of the number of hours that a metropolitan employee spends on making
deliveries in a day?
Answer.
a.
b. (6 points) What is the probability that a metropolitan employee has a ‘‘very busy day,’’ defined as spending
at least 5 hours making deliveries from requests received that day? Give a numerical answer.
Note: You can use WolframAlpha as a calculator (it also accepts LATEX equations). Example: https://www.wolframalpha
.com/input/?i=sum+x%5Ek%2Fk%21%2C+k%3D0+to+infinity
Answer.
b.
c. (6 points) The company estimates that 0.6 of their employees work in metropolitan areas, and the rest in
suburban areas. Suburbs have different customer demand: in suburban areas, employees receive on average 6
customer requests each day. What is the probability that a randomly chosen employee has a very busy day?
Answer.
c.
d. (14 points) Consider a metropolitan employee. The event that they have a very busy day on a particular day
is independent of their business on other days. Let p be your answer from part (b), the probability that they
have a very busy day.
i. (6 points) What is the probability that this employee has more than 10 busy days in the next 20 working
days? Leave your answer in terms of p.
Answer.
d.i.
ii. (4 points) What is the expected number of very busy days that this employee has over the next 20 working
days? Leave your answer in terms of p.
name: email:
Q4 cs 109: quiz 1 July 29, 2020
Answer.
d.ii.
iii. (4 points) Would it be reasonable to approximate the probability you computed in part (d)(i) using a
Poisson random variable? Briefly explain in a few sentences why or why not.
Answer.
d.iii.
name: e mail:
index 227
References