0% found this document useful (0 votes)
24 views240 pages

Probability For Computer Scientists: Stanford, California

Uploaded by

unronicy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views240 pages

Probability For Computer Scientists: Stanford, California

Uploaded by

unronicy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 240

Probability for Computer Scientists

Stanford, California
Contents

Acknowledgments v

1 Counting 1
1.1 Sum Rule 1
1.2 Product Rule 2
1.3 The Inclusion-Exclusion Principle 3
1.4 Double Counting and Constraints 3
1.5 General Principle of Counting 4
1.6 Floor and Ceiling 5
1.7 The Pigeonhole Principle 5
1.8 Bibliography 6

2 Combinatorics 7
2.1 Permutations 7
2.2 Permutations of Indistinct Objects 10
2.3 Combinations 11
2.4 Selecting Multiple Groups of Objects 13
2.5 Bucketing/Group Assignment 13
iv c ontents

3 Probability 17
3.1 Event Spaces and Sample Spaces 17
3.2 Probability 17
3.3 Axioms of Probability 18
3.4 Provable Identities of Probability 19
3.5 Equally Likely Outcomes 20

4 Conditional Probability 25
4.1 Conditional Probability 25
4.2 Law of Total Probability 26
4.3 Bayes’ Theorem 27
4.4 Conditional Paradigm 28

5 Independence 31
5.1 Independence 31
5.2 Conditional Independence 34

6 Random Variables 37
6.1 Probability Mass Function 38
6.2 Expectation 39
6.3 Properties of Expectation 39
6.4 Variance 41

7 Bernoulli and Binomial Random Variables 45


7.1 Bernoulli Random Variable 45
7.2 Binomial Random Variable 46

2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


contents v

8 Poisson and Other Discrete Distributions 49


8.1 Binomial in the Limit 49
8.2 Poisson Random Variable 50
8.3 Geometric Distribution 50
8.4 Negative Binomial Distribution 53

9 Continuous Distributions 55
9.1 Probability Density Functions 55
9.2 Cumulative Distribution Function 56
9.3 Expectation and Variance 57
9.4 Uniform Random Variable 60
9.5 Exponential Random Variable 61

10 Normal Distribution 63
10.1 Normal Random Variable 63
10.2 Linear Transform 64
10.3 Projection to Standard Normal 64
10.4 Binomial Approximation 67

11 Joint Distributions 69
11.1 Joint Distributions 69
11.2 Discrete Case 69
11.3 Multinomial Distribution 70

12 Independent Random Variables 73


12.1 Independence with Multiple Discrete Random Variables 73
12.2 Symmetry of Independence 73
12.3 Sums of Independent Random Variables 73
12.4 Convolution: Sum of Independent Random Variables 75

2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


vi c ontents

13 Statistics of Multiple Random Variables 77


13.1 Expectation with Multiple Random Variables 77
13.2 Expectation of Binomial 78
13.3 Expectation of Negative Binomial 78
13.4 Expectations of Products Lemma 79
13.5 Covariance and Correlation 81
13.6 Properties of Covariance 82
13.7 Correlation 83

14 Conditional Expectation 85
14.1 Conditional Distributions 85
14.2 Conditional Expectation 85
14.3 Properties of Conditional Expectation 86
14.4 Law of Total Expectation 86

15 Inference 91
15.1 Bayesian Networks 91
15.2 General Inference via Sampling the Joint Distribution 94
15.3 General Inference when Conditioning on Rare Events 96

16 Continuous Joint Distributions 99


16.1 Independence with Multiple RVs (Continuous Case) 100
16.2 Joint CDFs 100
16.3 Bivariate Normal Distribution 101

17 Continuous Joint Distributions II 103


17.1 Convolution: Sum of Independent Random Variables 103

2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


contents vii

17.2 Independent Normals 103


17.3 General Independent Case 103
17.4 Conditional Distributions (Continuous Case) 104
17.5 Conditional Expectation (Continuous Case) 104

18 Central Limit Theorem 109


18.1 The Theory 109

19 Samples and the Bootstrap 113


19.1 Estimating Mean and Variance from samples 113
19.2 Standard Error 114
19.3 Bootstrap 115

20 Maximum Likelihood Estimation 117


20.1 Parameters 117
20.2 Maximum Likelihood 118

21 The Beta Distribution 123


21.1 Mixing Discrete and Continuous 123
21.2 Estimating Probabilities 123
21.3 Beta Distribution 124

22 Maximum A Posteriori 127


22.1 Maximum A Posteriori Estimation 127

23 Naïve Bayes 133


23.1 Machine Learning: Classification 133
23.2 Naïve Bayes algorithm 134
23.3 Theory 135

2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


viii contents

24 Linear Regression and Gradient Ascent 139


24.1 Regression 139
24.2 Gradient Ascent Optimization 139
24.3 Linear Regression 140

25 Logistic Regression 145


25.1 Logistic Regression Overview 145
25.2 Parameter Estimation using Gradient Ascent Optimization 148
25.3 Derivations 148

A Review 151

B Calculation Reference 153


B.1 Summation Identities 153
B.2 Growth Rates of Summations 155
B.3 Identities of Products 155
B.4 Calculus Review 156
B.5 Computing Permutations and Combinations 157

Problem Set 1, Combinatorics 159

Problem Set 2, Probability 173

Problem Set 3, Random Variables 185

Problem Set 4, Distributions 199

Problem Set 5, Sampling 209

2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


contents ix

Quiz 1 221

References 227

2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


Acknowledgments

This work is taken from the lecture notes for the course Probability for Computer
Scientists at Stanford University, CS 109 (cs109.stanford.edu). The contributors
to the content of this work are Chris Piech, Mehran Sahami, and Lisa Yan—this
collection is simply a typesetting of existing lecture notes with minor modifica-
tions/additions. We would like to thank the original authors for their contribution.
In addition, we wish to thank Mykel Kochenderfer and Tim Wheeler for their
contribution to the Tufte-Algorithms LATEX template, based off of Algorithms for
Optimization.1 1
M. J. Kochenderfer and T. A.
Wheeler, Algorithms for Optimiza-
tion. MIT Press, 2019.

Robert J. Moss
Stanford, Calif.
July 29, 2020

Ancillary material is available on the template’s webpage:


https://github.com/sisl/textbook_template
1 Counting

From notes by Mehran Sahami,


1.1 Sum Rule Chris Piech, and Lisa Yan.

Sum Rule of Counting: If the outcome of an experiment can either be one of


m outcomes or one of n outcomes, where none of the outcomes in the set of m
outcomes is the same as any of the outcomes in the set of n outcomes, then there
are m + n possible outcomes of the experiment.
Rewritten using set notation, the Sum Rule states that if the outcomes of an
experiment can either be drawn from set A or set B, where | A| = m and | B| = n,
and A ∩ B = ∅, then the number of outcomes of the experiment is | A| + | B| =
m + n.

Problem: You are running an online social networking application which Example 1.1. Sum Rule example
counting server requests.
has its distributed servers housed in two different data centers, one in San
Fransisco and the other in Boston. The San Fransisco data center has 100
servers in it and the Boston data center has 50 servers in it. If a server request
is sent to the application, how large is the set of servers it may get routed to?

Solution: Since the request can be sent to either of the two data centers
and none of the machines in either data center are the same, the Sum Rule of
Counting applies. Using this rule, we know that the request could potentially
be routed to any of the 100 + 50 = 150 servers.
2 c hapter 1. counting

1.2 Product Rule

Product Rule of Counting: If an experiment has two parts, where the first part
can result in one of m outcomes and the second part can result in one of n outcomes
regardless of the outcome of the first part, then the total number of outcomes for
the experiment is mn.
Rewritten using set notation, the Product Rule states that if an experiment with
two parts has an outcome from set A in the first part, where | A| = m, and an
outcome from set B in the second part (regardless of the outcome of the first
part), where | B| = n, then the total number of outcomes of the experiment is
| A|| B| = mn.
Note that the Product Rule for Counting is very similar to ‘‘the basic principle of
counting’’ given in the Ross textbook.1 ‘’ 1
S. M. Ross et al., A First Course in
Probability. 2006, vol. 7.

Problem: Two 6-sided dice, with faces numbered 1 through 6, are rolled. Example 1.2. Product Rule example
with two 6-sided dice.
How many possible outcomes of the roll are there?

Solution: Note that we are not concerned with the total value of the two
dice, but rather the set of all explicit outcomes of the rolls. Since the first die ‘‘die’’ is the singular form of the
can come up with 6 possible values and the second die similarly can have word ‘‘dice’’ (which is the plural
form).
6 possible values (regardless of what appeared on the first die), the total
number of potential outcomes is 36 = 6 · 6. These possible outcomes are
explicitly listed below as a series of pairs, denoting the values rolled on the
pair of dice:

(1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)


(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
(6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


1.3. the inclusion-exclusion principle 3

Problem: Consider a hash table with 100 buckets. Two arbitrary strings are Example 1.3. Product Rule example
with string hashing.
independently hashed and added to the table. How many possible ways are
there for the strings to be stored in the table?

Solution: Each string can be hashed to one of 100 buckets. Since the results
of hashing the first string do not impact the hash of the second, there are
100 · 100 = 10,000 ways that the two strings may be stored in the hash table.

1.3 The Inclusion-Exclusion Principle

Inclusion-Exclusion Principle: If the outcome of an experiment can either be


drawn from set A or set B, and sets A and B may potentially overlap (i.e. it is not
guaranteed that A ∩ B = ∅), then the number of outcomes of the experiment is
| A ∪ B | = | A | + | B | − | A ∩ B |.
Note that the Inclusion-Exclusion Principle generalizes the Sum Rule of Counting
for arbitrary sets A and B. In the case where A ∩ B = ∅, the Inclusion-Exclusion
Principle gives the same result as the Sum Rule of Counting since |∅| = 0.

Problem: An 8-bit string (one byte) is sent over a network. The valid set of Example 1.4. Inclusion-Exclusion
Principle example with bit strings.
strings recognized by the receiver must either start with 01 or end with 10.
How many such strings are there?

Solution: The potential bit strings that match the receiver’s criteria can
either be 64 strings that start with 01 (since the last 6 bits are unspecified).
Of course, these two sets overlap, (since the middle 4 bits can be arbitrary).
Casting this description into corresponding set notation, we have: | A| =
64, | B| = 64, and | A ∩ B| = 16, so by the Inclusion-Exclusion Principle, there
are 64 + 64 − 16 = 112 strings that match the specified receiver’s criteria.

1.4 Double Counting and Constraints

There are many reasons for having counted some elements more than once (aka
‘‘double counted’’). One common case, is that there is a constraint in the problem

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


4 c h apter 1. counting

that you must contend with. It goes without saying that if you over-count, then
you have to subtract off the number of elements that were double counted. If you
did something along the lines of: count every element some multiple, then you
can divide your total number of elements by that multiple to get the correct final
answer.

1.5 General Principle of Counting

General Principle of Counting: If an experiment has r parts such that part i


has ni outcomes for all i = 1, . . . , r, then the total number of outcomes for the
experiment is:
r
∏ n i = n1 × n2 × · · · × nr (1.1)
i =1
In the same way that the Inclusion-Exclusion Principle generalizes the Sum Rule
of Counting, our next rule, the General Principle of Counting (also commonly known
as the Fundamental Principle of Counting) generalizes the Product Rule of Counting.

Problem: California license plates prior to 1982 had only 6-place license Example 1.5. General Principle of
Counting example with California
plates, where the first three places were uppercase letters A-Z, and the last license plates.
three places were numeric 0-9. How many such 6-place license plates were
possible pre-1982?

Solution: We can treat each of the positions of the license plate as a separate
part of an overall six part experiment. That is, the first three parts of the
experiment each have 26 outcomes, corresponding to the letters A–Z, and
the last three parts of the experiment each have 10 outcomes, corresponding
to the digits 0–9. By the General Principle of Counting, we have 26 × 26 × 26 ×
10 × 10 × 10 = 17,576,000 possible license plates. Interestingly enough the
current population of California is 39.5 million residents as of 2017, so this
would not nearly be enough license plates such that each person can own one
vehicle. Fortunately, in 1982, California changed to 7-place license plates by
prepending a numeric digit, resulting in 10 × 26 × 26 × 26 × 10 × 10 × 10 =
175,760,000 possible 7-place license plates. This is enough for each resident
in California to own approximately 4.5 vehicles.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


1.6. floor and ceiling 5

1.6 Floor and Ceiling

Floor and ceiling are two handy functions that we give below just for reference.
Besides, their names sound so much neater than ‘‘rounding down’’ and ‘‘rounding
up’’, and they are well-defined on negative numbers too.

Floor function: The floor function assigns to the real number x the largest integer
that is less than or equal to x. The floor function applied to x is denoted b x c.

Ceiling function: The ceiling function assigns to the real number x the smallest
integer that is greater than or equal to x. The ceiling function applied to x is
denoted d x e.

Examples of the floor and ceiling functions operating on the same numbers:

floor: b1/2c = 0 b−1/2c = −1 b2.9c = 2 b8.0c = 8


ceiling: d1/2e = 1 d−1/2e = 0 d2.9e = 3 d8.0e = 8

1.7 The Pigeonhole Principle

Basic Pigeonhole Principle: For positive integers m and n, if m objects are placed
in n buckets, where m > n, then at least one bucket must contain at least two
objects.

General Pigeonhole Principle: In a more general form, this principle can be


stated as the following. For positive integers m and n, if m objects are placed in n
buckets, then at least one bucket must contain at least dm/ne objects.
Note that the generalized form does not require the constraint that m > n,
since in the case where m ≤ n, we have dm/ne = 1, and it trivially holds that at
least one bucket will contain at least one object.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


6 c h apter 1. counting

Problem: Consider a hash table with 100 buckets. 950 strings are hashed Example 1.6. General Pigeonhole
Principle example with hash tables.
and added to the table.
a. Is it possible that a bucket in the table has no entries?

b. Is it guaranteed that at least one bucket in the table has at least two entries?

c. Is it guaranteed that at least one bucket in the table has at least 10 entries?

d. Is it guaranteed that at least one bucket in the table has at least 11 entries?

Solution:

a. Yes. As one example, it is possible (albeit very improbable) that all 950
strings get hashed to the same bucket (say bucket 0). In this case bucket 1
would have no entries.

b. Yes. Since, 950 objects are placed in 100 buckets and 950 > 100, by the
Basic Pigeonhole Principle, it follows that at least one bucket must contain
at least two entries.

c. Yes. Since, 950 objects are placed in 100 buckets and d950/100e = d9.5e =
10, by the General Pigeonhole Principle, it follows that at least one bucket
must contain at least 10 entries.

d. No. As one example, consider the case where the first 50 bucket each
contain 10 entries and the second 50 buckets each contain 9 entries. This
accounts for all 950 entries (50 · 10 + 50 · 9 = 950), but there is no bucket
that contains 11 entries in the hash table.

1.8 Bibliography

For additional information on counting, you can consult a good discrete math-
ematics or probability textbook. Some of the discussion above is based on the
treatment in Discrete Mathematics and its Applications.2 2
K. Rosen, Discrete Mathematics and
Its Applications. 2007, vol. 6.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


2 Combinatorics

From notes by Mehran Sahami,


2.1 Permutations Chris Piech, and Lisa Yan.

Permutation Rule: A permutation is an ordered arrangement of n distinct ob-


jects. Those n objects can be permuted in n · (n − 1) · (n − 2) · · · · · 2 · 1 = n!
ways.
This changes slightly if you are permuting a subset of distinct objects, or if
some of your objects are indistinct. We will handle those cases.

Problem, Part A: iPhones used to have 4-digit passcodes. Suppose there Example 2.1. Permutation example
for iPhone passcode attempts with
are 4 smudges over 4 digits on the screen. How many distinct passcodes are n = 4 fingerprint smudges.
possible?

Solution: Since the order of digits in the code is important, we should use
permutations. And since there are exactly four smudges we know that each
number is distinct. This, we can plug in the permutation formula: 4! = 24.
8 c hapter 2. combinatorics

Problem, Part B: What if there are 3 smudges over 3 digits on the screen? Example 2.2. Permutation example
for iPhone passcode attempts with
n = 3 fingerprint smudges.
Solution: One of 3 digits is repeated, but we don’t know which one. We can
solve this by making three cases, one for each digit that could be repeated
(each with the same number of permutations). Let A, B, and C represent
the 3 digits, with C repeated twice. We can initially pretend the two C’s are
distinct. Then each case will have 4! permutations:

A B C1 C2

However, then we need to eliminate the double-counting of the permutations


of the identical digits (one A, one B, and two C’s):

4!
2! · 1! · 1!
Adding up the three cases for the different repeated digits gives:

4!
3· = 3 · 12 = 36
2! · 1! · 1!

Problem, Part C: What if there are 2 smudges over 2 digits on the screen? Example 2.3. Permutation example
for iPhone passcode attempts with
n = 2 fingerprint smudges.
Solution: There are two possibilities: 2 digits used twice each, or 1 digit
used 3 times, and the other digit used once.

4! 4!
+2· = 6 + (2 · 4) = 6 + 8 = 14
2! · 2! 3! · 1!

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


2.1. permutations 9

1, 2, 3 1, 3, 2 2, 1, 3 2, 3, 1 3, 1, 2 3, 2, 1
1 1 2 2 3 3

0 2 0 3 1 3 1 3 1 0 2 0

0 3 0 2 0 0 0 0 0 0 2 0 1 0

degenerate degenerate degenerate degenerate


Figure 2.7. Permutations of a bi-
nary search tree for n = 3 integers,
showing degenerate trees.

Recall the definition of a binary search tree (BST), which is a binary tree that
satisfies the following three properties for every node n in the tree:

1. n’s value is greater than all the values in its left subtree.

2. n’s value is less than all the values in its right subtree.

3. both n’s left and right subtrees are binary search trees.

Problem: How many possible binary search trees are there which contain
the three values, 1, 2, and 3, and have a degenerate structure (i.e. each node
in the BST has at most one child)?

Solution: We start by considering the fact that the three values in the BST
(1, 2, and 3) may have been inserted in any of 3! = 6 orderings (permuta-
tions). For each of the 3! ways the values could have been ordered when
being inserted into the BST, we can determine what the resulting structure
would be and determine which of them are degenerate. We consider each
possible ordering of the three values and the resulting BST structure is shown
in figure 2.7. We see that there are 4 degenerate BSTs here (the first two and Example 2.4. Permutation exam-
last two). ple with degenerate binary search
trees.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


10 c hapter 2. combinatorics

2.2 Permutations of Indistinct Objects

Permutations of Indistinct Objects: Generally when there are n objects and

n1 are the same (indistinguishable),


n2 are the same,
...
1
When n objects are distinct, there
and nr are the same,
are r = n groups each with ni = 1,
n! thus this formula becomes n! from
then there are distinct permutations of the objects.1 section 2.1.
n1 !n2 ! . . . nr !

Problem: How many distinct bit strings can be formed from three 0’s and Example 2.5. Permutation problem
with bit strings.
two 1’s?

Solution: 5 total digits would give 5! permutations. But that is assuming


the 0’s and 1’s are distinguishable (to make that explicit, let’s give each one
a subscript). Here is a subset of the permutations.

01 11 12 02 03
01 11 12 03 02
02 11 12 01 03
02 11 12 03 01
03 11 12 01 02
03 11 12 02 01

If identical digits are indistinguishable, then all the listed permutations are
the same. For any given permutation, there are 3! ways of rearranging the 0’s
and 2! ways of rearranging the 1’s (resulting in indistinguishable strings). We
have over-counted. Using the formula for permutations of indistinct objects,
we can correct for the over-counting:

5! 160 120
total = = = = 10
3! · 2! 6·2 12

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


2.3. combinations 11

2.3 Combinations
Permutations account for ordering,
Binomial Coefficient: A combination is an unordered selection of r objects from where as combinations do not.
a set of n objects. If all objects are distinct, then the number of ways of making
the selection is:  
n! n
= (2.1)
r!(n − r )! r
The term (nr) is define as a binomial coefficient and is often read as ‘‘n choose r’’.
Consider this general way to produce combinations: To select r unordered
objects from a set of n objects, e.g., ‘‘7 choose 3’’,

1. First consider permutations of all n objects. There are n! ways to do that.

2. Then select the first r in the permutation. There is one way to do that.

3. Note that the order of r selected objects is irrelevant. There are r! ways to
permute them. The selection remains unchanged.

4. Note that the order of (n − r ) unselected objects is irrelevant. There are (n − r )!


ways to permute them. The selection remains unchanged.
   
n! n n 7!
total = = = e.g., = 35
r!(n − r )! r n−r 3!4!
which is the combinations formula.
A useful recursive identity for combinations is as follows:

n−1 n−1
     
n
= + , 0≤r≤n (2.2)
r r−1 r
This identity can be proved via a combinatorial argument. When we select a
group of size r from n distinct objects, then any particular object (say, object 1)
will either be part of that group or not part of that group. We can then define
sets A and B, where A is the number of ways of selecting a group that contains
object 1, and B is the number of ways of selecting a group that does not contain
object 1. For set A, if we decide to include object 1, then we must select r − 1 of
the remaining n − 1 objects (since the membership of object 1 in our selection is
−1
already decided), or 1 × (nr− 1 ). For set B, if we decide to exclude object 1, then we
only have n − 1 possible objects to select from to create a group of size r, or n − 1r.
These sets are mutually exclusive, and therefore by the Sum Rule of Counting the
total number of possibilities are as above.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


12 c hapter 2. combinatorics

Problem: In the Hunger Games, how many ways are there of choosing 2 Example 2.6. Combination problem
involving the Hunger Games.
villagers from district 12, which has a population of 8,000?

Solution: This is a straightforward combinations problem.


 
8000
= 31,996,000
2

Problem, Part A: How many ways are there to select 3 books from a set of Example 2.7. Combination problem
choosing r = 3 books from n = 6.
6?

Solution: If each of the books are distinct, then this is another straight
forward combination problem.
 
6 6!
= = 20
3 3!3!

Problem, Part B: How many ways are there to select 3 books if there are Example 2.8. Combination problem
choosing r = 3 books from n = 6
two books that should not both be chosen together? For example, if you are with restrictions.
choosing 3 out of 6 probability books, don’t choose both the 8th and 9th
edition of the Ross textbook.

Solution: This problem is easier to solve if we split it up into cases. Consider


the following three different cases:
Case 1: Select the 8th Ed. and 2 other non-9th Ed.: There are (42) ways.

Case 2: Select the 9th Ed. and 2 other non-8th Ed.: There are (42) ways.

Case 3: Select 3 from the books that are neither the 8th nor the 9th edition:
There are (43) ways.

Using our old friend the Sum Rule of Counting, we can add the cases:
   
4 4
total = 2 · + = 16
2 3

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


2.4. selecting multiple groups of objects 13

Alternatively, we could have calculated all the ways of selecting 3 books from Example 2.9. Forbidden City method
of solving the combination problem
6, and then subtract the ‘‘forbidden’’ ones (i.e., the selections that break the choosing r = 3 books from n = 6
constraint). Chris Piech calls this the Forbidden City method. with restrictions.

Forbidden Case: Select 8th edition and 9th edition and 1 other book. There
are (41) ways of doing so (which equals 4).
   
6 4
total = all possibilities − forbidden = − = 20 − 4 = 16
3 1

Two different ways to get the same right answer!

2.4 Selecting Multiple Groups of Objects

Multinomial Coefficient: If n objects are distinct, then the number of ways to


select r groups of objects, such that group i has size ni and ∑ri=1 ni = n is:
 
n! n
= (2.3)
n1 !n2 ! · · · nr ! n1 , n2 , . . . , nr
where (n ,n2n,...,nr ) is defined as a multinomial coefficient.
1
This situation is a generalization of the combination, where (nr) is defined as a
binomial coefficient. One way to see this is that the task of selecting r unordered
objects from a set of n distinct objects is analogous to the task of separating n
objects into two groups 1 and 2, with respective element counts n1 = r and
n2 = n − r. Therefore it is true that the binomial coefficient (nr) = (r,nn−r), where
the latter is the multinomial coefficient.

2.5 Bucketing/Group Assignment

You have probably heard about the dreaded ‘‘balls and urns’’ probability exam-
ples. What are those all about? They are for counting the many different ways
that we can think of stuffing elements into containers. (It turns out that Jacob
Bernoulli was into voting and ancient Rome. And in ancient Rome they used urns
for ballot boxes.) This ‘‘bucketing’’ or ‘‘group assignment’’ process is a useful
metaphor for many counting problems. Note that this bucketing problem is differ-
ent from the previous combinations problem. In combinations, we have n distinct

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


14 c hapter 2. combinatorics

Problem: Company Camazon has 13 new servers that they would like to Example 2.10. Multinomial coeffi-
cient used to solve a server to data-
assign to 3 datacenters, where Datacenter A, B, and C have 6, 4, and 3 empty center allocation problem.
server racks, respectively. How many different divisions of the servers are
possible?

Solution: This is a straight forward application of our multinomial coeffi-


13
cient representation. Setting n1 = 6, n2 = 4, n3 = 3, (6,4,3 ) = 60,060. Another
way to do this problem would be from first principles of combinations as
a multi-part experiment. We first select the 6 servers to be assigned to Dat-
acenter A, in (13
6 ) ways. Now out of the 7 servers remaining, we select the
4 servers to be assigned to Datacenter B, in (74) ways. Finally, we select the
3 servers out of the remaining 3 servers, in (33) ways. By the Product Rule of
Counting, the total number of ways to assign all servers would be:
   
13 7 3 13!
= = 60,060
6 4 3 6!4!3!

(distinguishable) objects to put in r distinct groups, and we are fixing the number
of distinct objects in group i to be ni (where ∑ri=1 ni = n) for every outcome that
we count. By contrast, in the bucketing problem we still have n objects to put in r
distinct groups, but (1) our objects can be distinct or indistinct, and (2) for each
outcome we can vary the number of objects in each distinct group i.

Problem: Say you want to put n distinguishable balls into r urns. (No! Wait! Example 2.11. Bucketing used to
solve a hash table problem.
Don’t say that!) Okay, fine. No urns. Say we are going to put n strings into
r buckets of a hash table where all outcomes are equally likely. How many
possible ways are there of doing this?

Solution: You can think of this as n independent experiments each with r


outcomes. Using the General Principle of Counting, this comes out to r n .

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


2.5. bucketing/group assignment 15

2.5.1 Divider Method


While the previous example allowed us to put n distinguishable objects into r
distinct groups, the more interesting problem is to work with n indistinguishable
objects. This task has a direct analogy to the number of ways to solve the following
positive integer equation:

x1 + x2 + . . . + xr = n, where xi ≥ 0 for all i = 1, . . . , r (2.4)

Divider Method: Suppose you want to place n indistinguishable items into r


containers. The divider method works by imagining that you are going to solve
this problem by sorting two types of objects, your n original elements and (r − 1)
dividers. Thus, you are permuting n + r − 1 objects, n of which are same (your
elements) and r − 1 of which are same (the dividers). Thus the total number of
outcomes is:
( n + r − 1) ! n+r−1 n+r−1
   
= = (2.5)
n!(r − 1)! n r−1

Problem, Part A: Say you are a startup incubator and you have $10 million Example 2.12. Divider Method used
to solve an investment problem.
to invest in 4 companies (in $1 million increments). How many ways can you
allocate this money?

Solution: This is just like putting 10 balls into 4 urns. Using the Divider
Method we get:

10 + 4 − 1
   
13
total ways = = = 286
10 10

This problem is analogous to solving the integer equation x1 + x2 + x3 + x4 =


10, where xi represents the investment in company i such that xi ≥ 0 for all
i = 1, 2, 3, 4.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


16 c hapter 2. combinatorics

Problem, Part B: What if you know you want to invest at least $3 million Example 2.13. Divider Method used
to solve an investment problem
in Company 1? with requirements.

Solution: There is one way to give $3 million to Company 1. The number


of ways to invest the remaining money is the same as putting 7 balls into 4
urns.
7+4−1
   
10
total ways = = = 120
7 7
This problem is analogous to solving the integer equation x1 + x2 + x3 + x4 =
10, where x1 ≥ 3 and x2 , x3 , x4 ≥ 0. To translate this problem into the integer
solution equation that we can solve via the divider method, we need to adjust
the bounds on x1 such that the problem becomes x1 + x2 + x3 + x4 = 7,
where xi is defined as in Part A.

Problem, Part C: What if you don’t have to invest all $10 million? (The Example 2.14. Divider Method used
to solve an investment problem
economy is tight, say, and you might want to save your money.) with no minimum requirements.

Solution: Imagine that you have an extra company: yourself. Now you are
investing $10 million in 5 companies. Thus, the answer is the same as putting
10 balls into 5 urns.
10 + 5 − 1
   
14
total ways = = = 1001
10 10

This problem is analogous to solving the integer equation x1 + x2 + x3 + x4 +


x5 = 10, such that xi ≥ 0 for all i = 1, 2, 3, 4, 5.

We can use Julia to verify our answers in examples 2.12 to 2.14. Example 2.15. Binomial coefficient in
Julia to check answers from exam-
julia> (13 ⋮ 10) ples 2.12 to 2.14. As a shorthand,
286 we define the alias ⋮ = binomial,
where ⋮ is created by typing \vdots
julia> (10 ⋮ 7)
and hitting tab. The normal syntax
120 is binomial(n,k).
julia> (14 ⋮ 10)
1001 julialang.org

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


3 Probability

From notes by Chris Piech and Lisa


3.1 Event Spaces and Sample Spaces Yan.

A sample space S is the set of all possible outcomes of an experiment. For example:
• Coin flip: S = {Heads, Tails}
• Flipping two coins: S = {( H, H ), ( H, T ), ( T, H ), ( T, T )}
• Roll of 6-sided die: S = {1, 2, 3, 4, 5, 6}
• Number of emails in a day: S = { x | x ∈ Z, x ≥ 0} (non-negative integers)
• Number of Netflix hours in a day: S = { x | x ∈ R, 0 ≤ x ≤ 24}

An event space E is some subset of S that we ascribe meaning to. In set notation, E ⊆ S.
• Coin flip is heads: E = {Heads}
• ≥ 1 head on 2 coin flips: E = {( H, H ), ( H, T ), ( T, H )}
• Roll of die is 3 or less: E = {1, 2, 3}
• Number of emails in a day ≤ 20: E = { x | x ∈ Z, 0 ≤ x ≤ 20}
• ‘‘Wasted day’’ (≤ 5 Netflix hours): E = { x | x ∈ R, 5 ≤ x ≤ 24}

3.2 Probability

In the 20th century, people figured out one way to define what a probability is:

n( E)
P( E) = lim , (3.1)
n→∞ n
18 c hapter 3. probability

where n is the number of trials performed and n( E) is the number of trials with
an outcome in E. In English this reads: say you perform n trials of an experiment.
The probability of a desired event E is defined as the ratio of the number of trials
that result in an outcome in E to the number of trials performed (in the limit as
your number of trials approaches infinity). You can also give other meanings
to the concept of a probability, however. One common meaning ascribed is that
P( E) is a measure of the chance of E occurring. I often think of a probability
in another way: I don’t know everything about the world. As a result I have to
come up with a way of expressing my belief that E will happen given my limited
knowledge. This interpretation (often referred to as the Bayesian interpretation)
acknowledges that there are two sources of probabilities: natural randomness
and our own uncertainty. Later in the quarter, we will contrast the frequentist
definition we gave you above with this other Bayesian definition of probability.

3.3 Axioms of Probability

Here are some basic truths about probabilities:

Axiom 1: 0 ≤ P( E) ≤ 1

Axiom 2: P(S) = 1

Axiom 3: If E and F are mutually exclusive ( E ∩ F = ∅),


then P( E) + P( F ) = P( E ∪ F )

You can convince yourself of the first axiom by thinking about the definition of
probability above: when performing some number of trials of an actual experi-
ment, it is not possible to get more occurrences of the event than there are trials (so
probabilities are at most 1), and it is not possible to get less than 0 occurrences of
the event (so probabilities are at least 0). The second axiom makes intuitive sense
as well: if your event space is the same as the sample space, then each trial must
produce an outcome from the event space. Of course, this is just a restatement of
the definition of the sample space; it is sort of like saying that the probability of
you eating cake (event space) if you eat cake (sample space) is 1.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


3.4. provable identities of probability 19

3.4 Provable Identities of Probability

We often refer to these as corollaries that are directly provable from the three
axioms given above.

Identity 1: P( Ec ) = 1 − P( E) (= P(S) − P( E))

Identity 2: If E ⊆ F, then P( E) ≤ P( F )

Identity 3: P( E ∪ F ) = P( E) + P( F ) − P( EF ) (where EF = E ∩ F )

General Inclusion-Exclusion Identity:


!
n n
∑ (−1)r+1 ∑
[
P Ei = P( Ei1 Ei2 . . . Eir ) (3.2)
i =1 r =1 i1 <···<ir

This last rule is somewhat complicated, but the notation makes it look far
worse than it is. What we are trying to find is the probability that any of a number
of events happens. The outer sum loops over the possible sizes of event subsets
(that is, first we look at all single events, then pairs of events, then subsets of
events of size 3, etc.). The ‘‘−1’’ term tells you whether you add or subtract terms
with that set size. The inner sum sums over all subsets of that size. The less-than
signs ensure that you don’t count a subset of events twice, by requiring that the
indices i1 , . . . , ir are in ascending order.
Here’s how that looks for three events ( E1 , E2 , E3 ):

P( E1 ∪ E2 ∪ E3 ) = P( E1 ) + P( E2 ) + P( E3 )
− P( E1 E2 ) − P( E1 E3 ) − P( E2 E3 )
+ P( E1 E2 E3 )

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


20 c hapter 3. probability

Problem: On a university campus, 28% of all students program in Java, 7%


program in Python, and 5% program in both Java and Python. You meet a
random student on campus. What is the probability that they do not program
in Java or Python?

Solution: Let E be the event that a randomly selected student programs in


Java and F be the event that a randomly selected student programs in Python.
We would like to compute P(( E ∪ F )c ):

P (( E ∪ F )c ) = 1 − P( E ∪ F ) (Identity 1)
= 1 − [ P( E) + P( F ) − P( EF )] (Identity 3)
= 1 − (0.28 + 0.07 − 0.05) = 0.7

We can confirm this by drawing a Venn diagram as seen in figure 3.1.

Java Python

3.5 Equally Likely Outcomes


0.23 0.05 0.02
Some sample spaces have outcomes that are all equally likely. We like those
sample spaces; they make it simple to compute probabilities. Examples of sample
spaces with equally likely outcomes:
Figure 3.1. Venn diagram of the
• Coin flip: S = {Heads, Tails} probability space for Java and
Python programmers on a univer-
• Flipping two coins: S = {( H, H ), ( H, T ), ( T, H ), ( T, T )} sity campus.

• Roll of 6-sided die: S = {1, 2, 3, 4, 5, 6}

Probability with equally likely outcomes: For a sample space S in which all
outcomes are equally likely,
1
P(each outcome) =
|S|
and for any event E ⊆ S,
number of outcomes in E | E|
P( E) = = . (3.3)
number of outcomes in S |S|

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


3.5. equally likely outcomes 21

Problem: You roll two six-sided dice. What is the probability that the sum Example 3.1. Equally likely outcomes
for two six-sided dice rolls that
of the two rolls is 7? sum to 7. Bold indicates those rolls
belonging to the event space E.
Solution: Define the sample space as a space of pairs, where the two ele-
ments are the outcomes of the first and second dice rolls, respectively. The
event is the subset of this sample space where the sum of the paired elements
is 7.

S = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6)
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6)
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6)
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6)
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}

E = {(6, 1), (5, 2), (4, 3), (3, 4), (2, 5), (1, 6)}

Since all outcomes are equally likely, the probability of this event is:

| E| 6 1
P( E) = = =
|S| 36 6

The reason we can choose either an ordered or unordered approach is because


probability is a ratio. As we saw last time, any unordered counting task can be
generated by first creating an ordered list, splitting the list at marked intervals,
then dividing out by the overcounted cases due to ordering. If our sample space
is ordered, then our event (being a subset of the sample space) is also ordered,
and therefore we should account for the overcounted cases. However, probability
being a ratio means that these overcounted cases get cancelled out.
The key to solving many of this section’s problems involves (1) deciding
whether to count distinct objects to create an equally likely outcome space, and
then (2) defining the sample space and event space to consistently be ordered or
unordered.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


22 c hapter 3. probability

Problem: There are 4 oranges and 3 apples in a bag. You draw out 3. What Example 3.2. Equally likely outcomes
picking fruit out of a bag.
is the probability that you draw 1 orange and 2 apples?

Solution 1: If we treat the oranges and apples as indistinct, we do not have


a space with equally likely outcomes. We therefore treat all objects as distinct.
Suppose we treat each outcome in the sample space as an ordered list of
three distinct items. The size of the sample space S is simply the total number
of ways to order 3 of 7 distinct items: |S| = 7 · 6 · 5 = 210. We can then
decompose the event E into three mutually exclusive events, where we pick
the orange first, second, or third, respectively: | E| = 4 · 3 · 2 + 3 · 4 · 2 + 3 · 2 ·
4 = 72. The probability of our event is therefore P( E) = 72/210 = 12/35.

Solution 2: Another approach is to treat each outcome in the sample space


as an unordered group. The size of the sample space S is the total number of
ways to choose any 3 of 7 distinct items: |S| = (73). The event space is then
the way to pick 1 distinct orange (out of 4) and 2 distinct apples (out of 3),
which we combine with the product rule: | E| = (41)(32). The probability of
(4)(3)
our event is therefore P( E) = 1 7 2 = 12/35.
(3)

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


3.5. equally likely outcomes 23

Problem: In a 52-card deck, cards are flipped one at a time. After the first Example 3.3. Equally likely outcomes
flipping the next cards in a deck of
ace (of any suit) appears, consider the next card. Is the next card more likely 52.
to be the Ace of Spades than the 2 of Clubs? (This problem is based on
Example 5j in Chapter 2.5 of Ross’s textbook, 10th Edition.)

Solution: No; the probabilities are equal. The difficulty of this problem
stems from defining an experiment that gives equally likely outcomes while
preserving the specifications of the original problem. An incorrect approach
is to define the experiment as just drawing a pair of cards (first ace, next
card) because then we discard all information about the cards flipped prior
to the pair. Instead, consider the experiment to be shuffling the full 52-card
deck, where |S| = 52!. We can then reconstruct all outcomes of the pairs of
cards that we care about (if we so wish—but we just care about getting an
equally likely outcome sample space).
Define E AS as the event where the next card is the Ace of Spades. To
construct a 52-card order where this event holds, we first take out the Ace
of Spades, then shuffle the remaining 51 cards (51! ways), then insert the
Ace of Spades immediately after the first ace (1 way). By the product rule,
| E AS | = 51! · 1. Then define E2C as the event where the next card is the 2
of Clubs. To construct a 52-card order where this event holds, we perform
exactly the same steps, but with the 2 of Clubs instead. Then | E2C | = 51! · 1.
Therefore, P( E AS ) = 51!/52! = P( E2C ).
For many readers, it may seem apparent that the first ace drawn could
very well be the Ace of Spades, and so it is less likely that the next card is the
Ace of Spades. Yet by a similar train of thought, the 2 of Clubs could very
well have been drawn prior to the first ace drawn, and so we must consider
all of those cases as well. This example serves to highlight the difficulty of
probability: Mathematics often trumps intuition (no pun intended).

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


4 Conditional Probability

From notes by Chris Piech and Lisa


4.1 Conditional Probability Yan.

In English, a conditional probability answers the question: ‘‘What is the chance of


an event E happening, given that I have already observed some other event F?’’
Conditional probability quantifies the notion of updating one’s beliefs in the face
of new evidence.
When you condition on an event happening you are entering the universe
where that event has taken place. Mathematically, if you condition on F, then F
becomes your new sample space. In the universe where F has taken place, all
rules of probability still hold!

Definition of Conditional Probability: The probability of E given that (i.e. con-


ditioned on) event F already happened:1 1
As a reminder, EF means the
same thing as E ∩ F, which is read
P( EF ) P( E, F ) P( E ∩ F ) E ‘‘and’’ F.
P( E | F ) = = = (4.1)
P( F ) P( F ) P( F )
A visualization might help you understand this definition. Consider events E
and F which have outcomes that are subsets of a sample space with 63 equally
likely outcomes, each one drawn as a hexagon shown in figure 4.1.
E∩F
Conditioning on F means that we have entered the world where F has happened E F

(and F, which has 14 equally likely outcomes, has become our new sample space).
Given that event F has occured, the conditional probability that event E occurs
is the subset of the outcomes of E that are consistent with F. In this case we
can visually see that those are the three outcomes in E ∩ F. Thus we have the
Figure 4.1. Probability of event E
probability: conditioning on event F.
P( EF ) 3/63 3
P( E | F ) = = = ≈ 0.21
P( F ) 14/63 14
26 c hapter 4. conditional probability

Even though the visual example (with equally likely outcome spaces) is useful for
gaining intuition, the above definition of conditional probability applies regardless
of whether the sample space has equally likely outcomes.

The Chain Rule: The definition of conditional probability can be rewritten as:

P( EF ) = P( E | F ) P( F ) (4.2)

which we call the Chain Rule. Intuitively, it states that the probability of observing
events E and F is the probability of observing F, multiplied by the probability
of observing E, given that you have observed F. Here is the general form of the
Chain Rule:2 2
A simple example of the chain
rule: Let E, F, and G be events with
P( E1 E2 . . . En ) = P( E1 ) P( E2 | E1 ) . . . P( En | E1 E2 . . . En−1 ) (4.3) nonzero probabilities. An equiva-
lent expression for P( EFG ) would
be:
4.2 Law of Total Probability P( EFG ) = P( E | FG ) P( FG )
= P( E | FG ) P( F | G ) P( G )
An astute person once observed that in a picture like the one in figure 4.1, event F
can be thought of as having two parts, the part that is in E (that is, E ∩ F = EF),
and the part that isn’t (Ec ∩ F = Ec F). This is true because E and Ec are mutually
exclusive sets of outcomes which together cover the entire sample space. After
further investigation this was proved to be a general mathematical truth, and
there was much rejoicing:

P( F ) = P( EF ) + P( Ec F ) (4.4)

This observation is called the law of total probability; however, it is most commonly
seen in combination with the chain rule.

The Law of Total Probability: For events E and F,

P ( F ) = P ( F | E ) P ( E ) + P ( F | E c ) P ( E c ). (4.5)
There is a more general version of the rule. If you can divide your sample space into
any number of events E1 , E2 , . . . , En that are mutually exclusive and exhaustive—that
is, every outcome in sample space falls into exactly one of those events—then:
n
P( F ) = ∑ P( F | Ei ) P(Ei ) (4.6)
i =1

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


4.3. bayes’ theorem 27

The word ‘‘total’’ refers to the fact that the events in Ei must combine to form the
totality of the sample space.

4.3 Bayes’ Theorem

Bayes’ theorem (or Bayes’ rule) is one of the most ubiquitous results in probability
for computer scientists. Very often we know a conditional probability in one
direction, say P( E | F ), but we would like to know the conditional probability in
the other direction. Bayes’ theorem provides a way to convert from one to the
other. We can derive Bayes’ theorem by starting with the definition of conditional
probability:
P( FE)
P( E | F ) = (4.7)
P( F )
We can expand P( FE) using the chain rule, which results in Bayes’ theorem.

Bayes’ theorem: The most common form of Bayes’ theorem is:

P( F | E) P( E)
P( E | F ) = (4.8)
P( F )
Each term in the Bayes’ rule formula has its own name. The P( E | F ) term is
often called the posterior; the P( E) term is often called the prior; the P( F | E) term
is called the likelihood (or the update); and P( F ) is often called the normalization
constant.
likelihood × prior
posterior = (4.9)
normalization constant
If the normalization constant (the probability of the event you were initially
conditioning on) is not known, you can expand it using the law of total probability:
P( F | E) P( E) P( F | E) P( E)
P( E | F ) = = (4.10)
P( F | E) P( E) + P( F | Ec ) P( Ec ) ∑i P( F | Ei ) P( Ei )
Again, for this to hold, all the events Ei must be mutually exclusive and exhaustive.
A common scenario for applying the Bayes Rule formula is when you want to
know the probability of something ‘‘unobservable’’ given an ‘‘observed’’ event.
For example, you want to know the probability that a student understands a
concept, given that you observed them solving a particular problem. It turns out it
is much easier to first estimate the probability that a student can solve a problem
given that they understand the concept and then to apply Bayes’ theorem.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


28 c hapter 4. conditional probability

The ‘‘expanded’’ version of Bayes’ rule in equation (4.10) allows you to work
around not immediately knowing the denominator P( F ). It is worth exploring this
in more depth, because this ‘‘trick’’ comes up often, and in slightly different forms.
Another way to get to the exact same result is to reason that because the posterior of
Bayes Theorem, P( E | F ), is a probability, we know that P( E | F ) + P( Ec | F ) = 1.
If you expand out P( Ec | F ) using Bayes, you get:

P( F | Ec ) P( Ec )
P( Ec | F ) = (4.11)
P( F )

Now we have:

1 = P( E | F ) + P( Ec | F ) (since P( E | F ) is a probability)
P( F | E) P( E) P( F | Ec ) P( Ec )
1= + (by Bayes’ rule (twice))
P( F ) P( F )
1
1= [ P( F | E) P( E) + P( F | Ec ) P( Ec )]
P( F )
P( F ) = P( F | E) P( E) + P( F | Ec ) P( Ec )

We call P( F ) the normalization constant because it is the term whose value can
be calculated by making sure that the probabilities of all outcomes sum to 1 (they
are ‘‘normalized’’).

4.4 Conditional Paradigm

As we mentioned above, when you condition on an event you enter the universe
where that event has taken place, all the laws of probability still hold. Thus, as
long as you condition consistently on the same event, every one of the tools we
have learned still apply. Let’s look at a few of our old friends when we condition
consistently on an event (in this case G, often read as ‘‘given’’ G):

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


4.4. conditional paradig m 29

Applying Bayes Rule: Consider the following (hypothetical) scenario re- Example 4.1. An application of
Bayes rule for disease testing. Orig-
garding an illness. Consider that 8% of all people have the illness and further inally from the concept check for
that there has been a test developed for the illness with a 95% true positive lecture 4.
rate (correctly says someone does have the illness when they do) and a 7%
false positive rate (incorrectly says someone has the illness when they don’t).
Given that I test positive for the illness, what is the probability that I actually
have the disease?

Solution: We apply Bayes Rule with an expanded denominator (using the


law of total probability to find the probability of testing positive whether you
have the illness or not, shown in equation (4.10)) and using ‘‘+’’ to denote
the event that I test positive.

P(+ | ill) P(ill) P(+ | ill) P(ill)


P(ill | +) = =
P (+) P(+ | ill) P(ill) + P(+ | not ill) P(not ill)

Since the true positive rate is 95%, P(+ | ill) is 95%, and since false positive
rate is 7%, P(+ | not ill) = 0.07.

(0.95)(0.08)
= ≈ 0.541
(0.95)(0.08) + (0.07)(0.92)

​ Notice that now you have a much higher chance of being ill than you did
before you got tested, but still only about a 1/2 chance!

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


30 c hapter 4. conditional probability

Name of Rule Original Rule Conditional Rule Table 4.1. Conditional probability
rules, conditioning on G.
First axiom of probability 0 ≤ P( E) ≤ 1 0 ≤ P( E | G ) ≤ 1

Corollary 1 (complement) P( E) = 1 − P( Ec ) P( E | G ) = 1 − P( Ec | G )

Chain Rule P( EF ) = P( E | F ) P( F ) P( EF | G ) = P( E | FG ) P( F | G )

P( F | E) P( E) P( F | EG ) P( E | G )
Bayes Theorem P( E | F ) = P( E | FG ) =
P( F ) P( F | G )

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


5 Independence

From notes by Chris Piech and Lisa


5.1 Independence Yan.

Independence is a big deal in machine learning and probabilistic modeling. Know-


ing the ‘‘joint’’ probability of many events (the probability of the ‘‘and’’ of the
events) requires exponential amounts of data. By making independence and condi-
tional independence claims, computers can essentially decompose how to calculate
the joint probability, making it faster to compute, and requiring less data to learn
probabilities.

Independence: Two events, E and F, are independent if and only if:

P( EF ) = P( E) P( F ) (5.1)

Otherwise, they are called dependent events.


This property applies regardless of whether or not E and F are from an equally
likely sample space and whether or not the events are mutually exclusive.
The independence principle extends to more than two events. In general, n
events E1 , E2 , . . . , En are independent if for every subset with r elements (where
r ≤ n) it holds that:

P( Ea , Eb , . . . , Er ) = P( Ea ) P( Eb ) . . . P( Er ) (5.2)
32 c hapter 5. independence

The general definition implies that for three events E, F, and G to be independent,
all of the following must be true:

P( EFG ) = P( E) P( F ) P( G ) (5.3)
P( EF ) = P( E) P( F ) (5.4)
P( EG ) = P( E) P( G ) (5.5)
P( FG ) = P( F ) P( G ) (5.6)

Problems with more than two independent events come up frequently. For exam-
ple: the outcomes of n separate flips of a coin are all independent of one another.
Each flip in this case is called a ‘‘trial’’ of the experiment.
In the same way that the mutual exclusion property makes it easier to calculate
the probability of the OR of two events, independence makes it easier to calculate
the AND of two events.

Flipping a Biased Coin: A biased coin is flipped n times. Each flip (inde- Example 5.1. Probablity of getting
k heads when flipping a biased
pendently) comes up heads with probability p, and tails with probability coin.
1 − p. What is the probability of getting exactly k heads?

Solution: Consider all the possible orderings of heads and tails that result in
k heads. There are (nk) such orderings, and all of them are mutually exclusive.
Since all of the flips are independent, to compute the probability of any one
of these orderings, we can multiply the probabilities of each of the heads and
each of the tails. There are k heads and n − k tails, so the probability of each
ordering is pk (1 − p)n−k . Adding up all the different orderings gives us the
probability1 of getting exactly k heads:
 
n k
p (1 − p ) n − k (5.7)
k

1 Spoiler alert: This is the probabil-


ity density of a binomial distribution.
Intrigued by that term? Stay tuned
for next week!

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


5.1. independence 33

Hash Map: Suppose m strings are hashed (unequally) into a hash table Example 5.2. Independence of in-
dividual hashed strings in hash
with n buckets. Each string hashed is an independent trial, with probability pi map.
of getting hashed to bucket i. Calculate the probability of these three events:
A. E = the first bucket has ≥ 1 string hashed to it.

B. E = at least 1 of buckets 1 to k has ≥ 1 string hashed to it.

C. E = each of buckets 1 to k has ≥ 1 string hashed to it.

Part A: Let Si be the event that string i is hashed into the first bucket. Example 5.3. Solution to Part A
of hash map independence exam-
Note that all Si are independent of one another. The complement, Sic , is the ple 5.2.
event that string i is not hashed into the first bucket; by mutual exclusion,
P(Sic ) = 1 − p1 = p2 + p3 + + pn .

P ( E ) = P ( S1 ∪ S2 ∪ · · · ∪ S m ) (definition of Si )
c
= 1 − P((S1 ∪ S2 ∪ · · · ∪ Sm ) ) (complement)
= 1 − P(S1c S2c . . . Sm
c
) (by De Morgan’s Law)
= 1 − P(S1c ) P(S2c ) . . . P(Sm
c
) (since the events are independent)
m
= 1 − (1 − p1 ) (calculating P(Si ) by mutual exclusion)

Part B: Let Fi be the event that at least one string is hashed into bucket i. Example 5.4. Solution to Part B
of hash map independence exam-
Note that the Fi ’s are neither independent nor mutually exclusive. ple 5.2.

P( E) = P( F1 ∪ F2 ∪ · · · ∪ Fk )
= 1 − P([ F1 ∪ F2 ∪ · · · ∪ Fk ]c ) (since P( A) + P( Ac ) = 1)
= 1 − P( F1c F2c . . . Fkc ) (by De Morgan’s law)
m
= 1 − (1 − p1 − p2 − · · · − p k )
(mutual exclusion, independence of strings)

The last step is calculated by realizing that P( F1c F2c . . . Fkc ) is only satisfied by
m independent hashes into buckets other than 1 through k.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


34 c hapter 5. independence

Part C: Let Fi be the same as in Part B. Example 5.5. Solution to Part C


of hash map independence exam-
ple 5.2.
P( E) = P( F1 F2 · · · Fk )
= 1 − P([ F1 F2 · · · Fk ]c ) (since P( A) + P( Ac ) = 1)
= 1 − P( F1c ∪ F2c ∪ · · · ∪ Fkc ) (by De Morgan’s (other) law)
!
k
Fic
[
= 1−P
i =1
k
= 1− ∑ (−1)r+1 ∑ P( Fic1 Fic2 . . . Ficr )
r =1 i1 <···<ir
(by General Inclusion/Exclusion equation (3.2))

where P( F1c F2c . . . Fkc ) = (1 − p1 − p2 − · · · − pk )m just like in example 5.4.

5.2 Conditional Independence

Two events E and F are called conditionally independent given a third event G, if

P( EF | G ) = P( E | G ) P( F | G ) (5.9)

Or, equivalently:
P( E | FG ) = P( E | G ) (5.10)

5.2.1 Conditioning Breaks Independence


An important caveat about conditional independence is that ordinary indepen-
dence does not imply conditional independence, nor the other way around.
Knowing when exactly conditioning breaks or creates independence is a big
part of building complex probabilistic models; the first few weeks of CS 228 are cs228.stanford.edu

dedicated to some general principles for reasoning about conditional indepen-


dence. We will talk about this in another lecture. An example was included for
completeness.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


5.2. conditional independence 35

Simplified Craps: Two 6-sided dice are rolled repeatedly. Consider the sum Example 5.6. Dice rolls in a game
of craps as independent trials.
of the two dice. What is P( E), where E is defined as the event where a sum
of 5 is rolled before a sum of 7?

Solution: Define our independent trials to be each paired roll. We can then
define Fi as the event where we observe our first 5 before 7 on the n-th trial.
In other words, no 5 or 7 was rolled in the first n − 1 trials, and a sum of 5
was rolled on the n-th trial. Notice that the Fi for i = 1, . . . , ∞ are mutually
exclusive, as there is only ever one first occurrence of the sum of 5. The
4 6
probability of rolling a sum of 5 or a sum of 7 is 36 and 36 , respectively, and
26 n−1 4
therefore, P( Fi ) = ( 36 ) ( 36 ).

P( E) = P( F1 ∪ F2 ∪ · · · ∪ Fi ∪ · · · ) = ∑ P( Fi ) (Fi mutually exclusive)
i =1
4 ∞ 4 ∞
  n −1  n
26 26
36 i∑ 36 i∑
= =
=1
36 =0
36
4 1 2
= · 26
=
36 1 − 36 5

where the last line comes from the property of infinite geometric series, where

1
|x| < 1 : ∑ xi = 1 − x . (5.8)
i =1

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


36 c hapter 5. independence

Fevers: Let’s say a person has a fever if they either have malaria or have
an infection. We are going to assume that getting malaria and having an
infection are independent: knowing if a person has malaria does not tell us if
they have an infection. Now, a patient walks into a hospital with a fever. Your
belief that the patient has malaria is high and your belief that the patient has
an infection is high. Both explain why the patient has a fever.
Now, given our knowledge that the patient has a fever, gaining the knowl-
edge that the patient has malaria will change your belief the patient has an
infection. The malaria explains why the patient has a fever, and so the alter-
nate explanation becomes less likely. The two events (which were previously
independent) are dependent when conditioned on the patient having a fever.

Faculty Night: At faculty night with a CS professor in attendance, you


observe 44 students. Of these, you find out that 30 are straight-A students.
Additionally, 20 of the 44 are CS majors, and of these 20, 6 are straight-A
students. Let A be the event that a student gets straight A’s, C be the event
that a student is a CS major, and F be the event that a student attends faculty
night. In probability notation, P( A | F ) = 30/44 ≈ 0.68, but P( A | C, F ) =
6/20 = 0.30. It would seem that being a CS major decreases your chance of
being a straight-A student!
You decide to investigate further by surveying your whole dorm. There
are 100 students in your dorm; 30 of these are straight-A students, 20 are CS
majors, and 6 are straight-A CS majors. That is, overall, P( A) = 30/100 =
0.30, and P( A | C ) = 6/20 = 0.30. So A and C are independent! What
happened at faculty night?
As it turns out, faculty night attracted two types of people: straight-A
students (who go to all the faculty nights), and CS majors. So the non-
straight-A students at this faculty night are more likely to be CS majors! It’s
not because CS students are slackers, or because CS is harder; it’s because
non-straight-A students with other majors didn’t come to faculty night.
In both of these examples, conditioning on an event E leads to depen-
dence between previously independent events A and B when A and B are
independent causes of E.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


6 Random Variables

From notes by Chris Piech and Lisa


A random variable (RV) is a variable that probabilistically takes on different values. Yan.
You can think of an RV as being like a variable in a programming language. They
take on values, have types and have domains over which they are applicable. We
can define events that occur if the random variable takes on values that satisfy a
numerical test (e.g., does the variable equal 5? is the variable less than 8?). We
often need to know the probabilities of such events.
As an example, let’s say we flip three fair coins. We can define a random
variable Y to be the total number of ‘‘heads’’ on the three coins. We can ask about
the probability of Y taking on different values using the following notation:

• P(Y = 0) = 1/8 ( T, T, T )

• P(Y = 1) = 3/8 ( H, T, T ), ( T, H, T ), ( T, T, H )

• P(Y = 2) = 3/8 ( H, H, T ), ( H, T, H ), ( T, H, H )

• P(Y = 3) = 1/8 ( H, H, H )

• P (Y ≥ 4 ) = 0

Even though we use the same notation for random variables and for events (both
use capital letters), they are distinct concepts. An event is a situation, a random
variable is an object. The situation in which a random variable takes on a particular
value (or range of values) is an event. When possible, we will try to use letters
E, F, G for events and X, Y, Z for random variables.
38 c hapter 6. random variables

Using random variables is a convenient notation that assists in decomposing


problems. There are many different types of random variables (indicator, binary,
choice, Bernoulli, etc). The two main families of random variable types are discrete
and continuous. For now we are going to develop intuition around discrete
random variables.

6.1 Probability Mass Function

For a discrete random variable, the most important thing to know is the probabil-
ity that the random variable will take on each of its possible values. The probability
mass function (PMF) of a random variable is a function that maps possible out- P( X = x )

comes of a random variable to the corresponding probabilities. Because it is a


1/6
function, we can plot PMF graphs where the x-axis contains the values that the
random variable can take on and the y-axis contains the probability of the random
variable taking on said value:
There are many ways that probability mass functions can be specified. We can
0
draw a graph. We can build a table (or for you CS folks, a map/HashMap/dict) 1 2 3 4 5 6
that lists out all the probabilities for all possible events. Or we could write out a x
mathematical expression.
Figure 6.1. The probability mass
For example, consider the random variable X which is the sum of two dice function of a single 6-sided die roll.
rolls. The probability mass function can be defined by the graph on the right of
figure 6.1. It can also be defined using the equation:
 P( X = x )
x −1
 36

 if x ∈ Z, 1 ≤ x ≤ 7 6/36
13− x
p X ( x ) = P( X = x ) = if x ∈ Z, 8 ≤ x ≤ 12 5/36
 36
 4/36
0 otherwise

3/36
2/36
The probability mass function, p X ( x ), defines the probability of X taking on the 1/36
value x. The new notation p X ( x ) is simply different notation for writing P( X = x ). 1 2 3 4 5 6 7 8 9 10 11
Using this new notation makes it more apparent that we are specifying a function. x
Try a few values of x, and compare the value of p X ( x ) to the graph in figure 6.2.
Figure 6.2. Probability mass function
They should be the same.
of the sum of two dice rolls.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


6.2. expectation 39

6.2 Expectation

A useful piece of information about a random variable is the average value of


the random variable over many repetitions of the experiment it represents. This
average is called the expectation. The expectation of a discrete random variable X
is defined as:
E[ X ] = ∑ x · p X ( x ), where p X ( x ) > 0 (6.1)
x∈X

It goes by many other names: mean, expected value, weighted average, center of mass,
and first moment.

The random variable X represents the outcome of one roll of a six-sided die. Example 6.1. Expected value of a six-
sided die roll.
What is E[ X ]? This is the same as asking for the average value of a die roll.

E[ X ] = 1(1/6) + 2(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6) = 7/2 = 3.5

6.3 Properties of Expectation

Expectations preserve linearity.1 Mathematically, this means that: 1


For a comprehensive review of
linear algebra, see the textbook for
E[ aX + bY + c] = aE[ X ] + bE[Y ] + c (6.2) Stanford’s Math 51: http://web.s
tanford.edu/class/math51/stan
ford/math51book.pdf
So if you have an expectation of a sum of quantities, this is equal to the sum of the
expectations of those quantities. We will return to the implications of this very
useful fact later in the course.
One can also calculate the expected value of a function g( X ) of a random
variable X when one knows the probability distribution of X but one does not
explicitly know the distribution of g( X ):

E[ g( X )] = ∑ g( x ) · p X ( x ) (6.3)
x

This identity has the humorous name of ‘‘the Law of the Unconscious Statistician’’
(LOTUS), for the fact that even statisticians are known—perhaps unfairly—to
ignore the difference between this identity and the basic definition of expectation
(the basic definition doesn’t have a function g).

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


40 c hapter 6. random variables

We can use this to compute, for example, the expectation of the square of a
random variable (called the second moment or second central moment):

E[ X 2 ] = E[ g( X )] (where g( X ) = X 2 )
= ∑ g( x ) · p X ( x ) (by LOTUS)
x
= ∑ x2 · p X ( x ) (definition of g)
x

A school has 3 classes with 5, 10, and 150 students. Each student is only in Example 6.2. Class size expected
value based on the choice of the ran-
one of the three classes. If we randomly choose a class with equal probability dom variable.
and let X be the the size of the chosen class:

E[ X ] = 5(1/3) + 10(1/3) + 150(1/3)


= 165/3 = 55

However, if instead we randomly choose a student with equal probability


and let Y be the the size of the class the student is in:

E[Y ] = 5(5/165) + 10(10/165) + 150(150/165)


= 22635/165 ≈ 137

Consider a game played with a fair coin which comes up heads with p = 0.5. Example 6.3. Expected value game
resulting in an infinite money para-
Let n = the number of coin flips before the first tails. In this game you win dox.
$2n . How many dollars do you expect to win? Let X be a random variable
which represents your winnings.
 1  2  3 ∞   i +1
1 1 1 1
E[ X ] = 0
2 + 1
2 + 2 +··· = ∑
2
2i
2 2 2 i =0
2

1
= ∑2 =∞
i =0

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


6.4. variance 41

Var( X ) = 2 P( X = x ) Var( X ) ≈ 1.5 P( X = x ) Var( X ) ≈ 1 P( X = x ) Var( X ) ≈ 0.5 P( X = x )


0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
x x x x
Figure 6.3. Different variance in
probability mass functions that each
6.4 Variance have an expected value of E[ X ]=3.

Expectation is a useful statistic, but it does not give a detailed view of the proba-
bility mass function. Consider the 4 distributions in figure 6.3 (PMFs). All four
have the same expected value E[ X ] = 3 but the ‘‘spread’’ in the distributions is
quite different. Variance is a formal quantification of ‘‘spread’’. There is more than
one way to quantify spread; variance uses the average square distance from the
mean.
The variance of a discrete random variable X with expected value µ is defined:2 2
Variance has squared units rela-
tive to X.
Var( X ) = E[( X − E[ X ])2 ] (6.4)
= E[( X − µ)2 ]

When computing the variance, we often use a different form of the same equation: Var( X ) = E[( X − E[ X ])2 ]
= E[( X − µ)2 ] (Let µ = E[ X ])
Var( X ) = E[ X 2 ] − E[ X ]2 (6.5, Property 1)
= ∑ ( x − µ )2 p ( x )
x

A useful identity for variance, making it non-linear, is that: = ∑( x2 − 2µx + µ2 ) p( x )


x

2 = ∑ x2 p( x ) − 2µ ∑ xp( x ) + µ2 ∑ p( x )
Var( aX + b) = a Var( X ) (6.6, Property 2) x x x

= E[ X 2 ] − 2µE[ X ] + µ2 · 1
Adding a constant doesn’t change the ‘‘spread’’; multiplying by one does. = E[ X 2 ] − 2µ2 + µ2
To stay in the units of X, the standard deviation is the square root of variance: = E[ X 2 ] − µ2

q = E[ X 2 ] − E[ X ]2
SD( X ) = σ = Var( X ) (6.7)

Intuitively, standard deviation is a kind of average distance of a sample to the


mean.3 Variance is the square of this average distance. 3
Specifically, it is a root-mean-square
(RMS) average.
toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]
42 c hapter 6. random variables

Let X be the value on one roll of a 6-sided die. What is Var( X )? Example 6.4. Variance calculation
of a single 6-sided die roll.

Solution: First, we can calculate E[ X 2 ]:

1 1 1 1 1 1 91
E[ X 2 ] = (12 ) + (22 ) + (32 ) + (42 ) + (52 ) + (62 ) =
6 6 6 6 6 6 6
Recall that E[ X ] = 7/2, and we can use the expectation formula for variance:
 2
2 2 91 7 35
Var( X ) = E[ X ] − (E[ X ]) = − =
6 2 12

𝔼(X) = sum(X .* P) Algorithm 6.1. Expected value of


random variable X with probabili-
ties P, written in Julia. The symbol 𝔼
can be created by typing \bbE and
hitting tab. The .* syntax broad-
casts multiplication element-wise.

Var(X) = 𝔼(X.^2) - 𝔼(X)^2 Algorithm 6.2. Variance of random


variable X using expectation 𝔼.

Using the 𝔼 function from algorithm 6.1 and the Var function from algo- Example 6.5. Expected value and
variance functions in Julia; recom-
rithm 6.2, we can recompute the answer to example 6.4. puting example 6.4. Note the use
of // to indicate a Rational type.
julia> X = 1:6;
julia> P = fill(1//6, 6);
julia> 𝔼(X)
7//2
julia> Var(X)
35//12

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


6.4. vari an ce 43

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


7 Bernoulli and Binomial Random Variables

From notes by Chris Piech and Lisa


There are some classic random variable abstractions that show up in many Yan.
problems. At this point in the class you will learn about several of the most signif-
icant discrete distributions. When solving problems, if you are able to recognize
that a random variable fits one of these formats, then you can use its precalculated
probability mass function (PMF), expectation, variance, and other properties.
Random variables of this sort are called parametric random variables. If you can
argue that a random variable falls under one of the studied parametric types, you
simply need to provide parameters. A good analogy is a class in programming.
Creating a parametric random variable is very similar to calling a constructor
with input parameters.

7.1 Bernoulli Random Variable


A Bernoulli random variable maps
A Bernoulli random variable is the simplest kind of random variable. It can take on ‘‘success’’ to 1 and ‘‘failure’’ to 0.
two values, 1 and 0. It takes on a 1 if an experiment with probability p resulted Support for Bernoulli: {0, 1}

in success and a 0 otherwise.1 Some example uses include a coin flip, a random 1
The Bernoulli random variable is
binary digit, whether a disk drive crashed, and whether someone likes a Netflix the simplest random variable (i.e.
an indicator or boolean random vari-
movie. If X is a Bernoulli random variable, denoted2 X ∼ Ber( p): able)
2
Sampling x from a distribution D
Probability mass function: P ( X = 1) = p (7.1) can also be written x ∼ D, where
P ( X = 0) = (1 − p ) (7.2) ∼ is read as ‘‘is distributed as’’.

Expectation: E[ X ] = p (7.3)
Variance: Var( X ) = p(1 − p) (7.4)

Bernoulli random variables and indicator variables are two aspects of the same
concept. A random variable I is an indicator variable for an event A if I = 1 when
A occurs and I = 0 if A does not occur. P( I =1)= P( A) and E[ I ]= P( A). Indicator
random variables are Bernoulli random variables, with p= P( A).
46 c hapter 7. bernoulli and binomial random variables

7.2 Binomial Random Variable


A binomial random variable is the
A binomial random variable is random variable that represents the number of suc- number of successes in n trials.
cesses in n successive independent trials of a Bernoulli experiment. Some example Note that Ber( p) = Bin(1, p).

uses include the number of heads in n coin flips, the number of disk drives that Support for binomial: {0, 1, . . . , n}
crashed in a cluster of 1000 computers, and the number of advertisements that
are clicked when 40,000 are served.
If X is a Binomial random variable, we denote this X ∼ Bin(n, p), where p is
the probability of success in a given trial. A binomial random variable has the
following properties:3 3
A binomial random variable is the
 sum of Bernoulli random variables.
 
 P( X = k) = n pk (1 − p)n−k if k ∈ N, 0 ≤ k ≤ n

Probability mass function: k
0 otherwise

(7.5)
Expectation: E[ X ] = np (7.6)
Variance: Var( X ) = np(1 − p) (7.7)

Let X = number of heads after a coin is flipped three times. X ∼ Bin(3, 0.5). Example 7.1. Binomial random vari-
able for n = 3 coin flips with prob-
What is the probability of each of the different values of X? ability p = 0.5.
 
3 0 1
P ( X = 0) = p (1 − p )3 =
0 8
 
3 1 3
P ( X = 1) = p (1 − p )2 =
1 8
 
3 2 3 P( X = x )
P ( X = 2) = p (1 − p )1 =
2 8 3/8
 
3 3 1
P ( X = 3) = p (1 − p )0 = 2/8
3 8
1/8

0
0 1 2 3
x

Figure 7.1. Probability mass function


of a binomial random variable; num-
ber of heads after three coin flips.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


7.2. binomial random variable 47

When sending messages over a network, there is a chance that the bits will be Example 7.2. Binomial random vari-
able for bit encoding using a Ham-
corrupted. A Hamming code allows for a 4 bit code to be encoded as 7 bits, ming code.
with the advantage that if 0 or 1 bit(s) are corrupted, then the message can
be perfectly reconstructed. You are working on the Voyager space mission
and the probability of any bit being lost in space is 0.1. How does reliability
change when using a Hamming code?
Image we use error correcting codes. Let X ∼ Bin(7, 0.1).
 
7
P ( X = 0) = (0.1)0 (0.9)7 ≈ 0.468
0
 
7
P ( X = 1) = (0.1)1 (0.9)6 = 0.372
1
P( X = 0) + P( X = 1) = 0.850

julia> X = Bin(7, 0.1);


julia> pdf(X,0) + pdf(X,1)
0.8503056000000002

What if we didn’t use error correcting codes? Let X ∼ Bin(4, 0.1).


 
4
P ( X = 0) = (0.1)0 (0.9)4 ≈ 0.656
0

julia> X = Bin(4, 0.1);


julia> pdf(X,0)
0.6561

Using Hamming Codes improves reliability by about 30%!

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


48 c hapter 7. bernoulli and binomial random variables

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


8 Poisson and Other Discrete Distributions
From notes by Chris Piech and Lisa
8.1 Binomial in the Limit Yan.

Recall the example of sending a bit string over a network. In our last class we
used a binomial random variable to represent the number of bits corrupted out
of 4 with a high corruption probability (each bit had independent probability
of corruption p = 0.1). That example was relevant to sending data to spacecraft,
but for earthly applications like HTML data, voice or video, bit streams are much
longer (length ≈ 104 ) and the probability of corruption of a particular bit is very
small (p ≈ 10−6 ). Extreme n and p values arise in many cases: # visitors to a
website, # server crashes in a giant data center.
Unfortunately, X ∼ Bin(104 , 10−6 ) is unwieldy to compute. However, when
values get that extreme, we can make approximations that are accurate and make
computation feasible. Recall that the parameters of the binomial distribution are
n = 104 and p = 10−6 . First, define λ = np. We can rewrite the binomial PMF as:

λ n −i
 i  
n! λ
P( X = i) = 1− (8.1)
i!(n − 1)! n n
n(n − 1) . . . (n − i − 1) λi (1 − λ/n)n
= · · (8.2)
ni i! (1 − λ/n)i
This equation can be made simpler using some approximations that hold when
n is sufficiently large and p is sufficiently small: Recall the definition:
λ n
 
n ( n − 1) . . . ( n − i − 1) e−λ = lim 1 −
≈1 (8.3) n→∞ n
ni
(1 − λ/n)n ≈ e−λ (8.4)
i
(1 − λ/n) ≈ 1 (8.5)

Using these reduces our original equation (8.1) to:


λi −λ
P( X = i) = e (8.6)
i!
This simplification, derived by assuming extreme values of n and p, turns out to
be so useful that it gets its own random variable type: the Poisson random variable.
50 c hapter 8. poisson and other discrete distributions

8.2 Poisson Random Variable


A Poisson random variable mod-
A Poisson random variable approximates Binomial random variables where n is els number of successes of fix in-
large, p is small, and λ = np is ‘‘moderate’’. Interestingly, to calculate the things terval time, an approximation of
Bin(n, p) when n is large and p is
we care about (e.g., PMF, expectation, variance), we no longer need to know n small. Also an approximation of a
and p. We only need to provide λ, which we call the rate. binomial even when success in trials
are not entirely independent. See
There are different interpretations of ‘‘moderate’’. Commonly accepted ranges chapter 7 for the binomial random
are n > 20 and p < 0.05 or n > 100 and p < 0.1. variable definition.
Here are the key formulas you need to know for Poisson. If X is a Poisson
random variable, denoted X ∼ Poi(λ), then: Support for Poisson: {0, 1, . . .}

# successes
λi −λ Units of λ:
Probability mass function: P( X = i) = e (8.7) time
i!
Expectation: E[ X ] = λ (8.8) Poisson examples:
• # earthquakes per year
Variance: Var( X ) = λ (8.9)
• # server hits per second
• # of emails per day

Example 8.1. Using the Poisson dis-


Let’s say you want to send a bit string of length n = 104 where each bit is
tribution to approximate the prob-
independently corrupted with p = 10−6 . What is the probability that the ability of a corrupt bit; relating it to
message will arrive uncorrupted? You can solve this using a Poisson with the Binomial distribution.

λ = np = 104 10−6 = 0.01. Let X ∼ Poi(0.01) be the number of corrupted


bits. Using the PMF for Poisson:

λi −λ 0.010 −0.01
P ( X = 0) = e = e ≈ 0.9900498
i! 0!
julia> X = Poi(0.01);
julia> pdf(X,0)
0.9900498337491681

We could have also modeled X as a binomial such that X ∼ Bin(104 , 10−6 ).


That would have been harder to compute but would have resulted in the
same number (to 8 decimal places).
julia> Xₙ = Bin(10^4,10^-6);
julia> pdf(X,0) - pdf(Xₙ,0)
4.950252541213729e-9

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


8.3. geometric distribution 51

The Poisson distribution is often used to model the number of events that Example 8.2. A Poisson distribu-
tion used to approximate the prob-
occur independently at any time in an interval of time or space, with a ability of an earthquake.
constant average rate. Earthquakes are a good example of this. Suppose there
are an average of 2.8 major earthquakes in the world each year. What is the
probability of getting more than one major earthquake next year?
Let X ∼ Poi(2.8) be the number of major earthquakes next year. We
want to know P( X > 1). We can use the complement rule to rewrite this as
1 − P( X = 0) − P( X = 1). Using the PMF for Poisson:

P ( X > 1) = 1 − P ( X = 0) − P ( X = 1)
2.80 2.81
= 1 − e−2.8 − e−2.8 = 1 − e−2.8 − 2.8e−2.8
0! 1!
≈ 1 − 0.06 − 0.17 = 0.77

julia> X = Poi(2.8);
julia> 1 - pdf(X,0) - pdf(X,1)
0.7689217620241717

8.3 Geometric Distribution


A geometric random variable is the
The variable X is a geometric random variable, denoted X ∼ Geo( p), if X is number number of trials until the first suc-
of the independent trials until the first success and p is probability of success on cess.

each trial. If X ∼ Geo( p): Support for geometric: {1, 2, . . .}

Probability mass function: P ( X = n ) = (1 − p ) n −1 p (8.10)


Expectation: E[ X ] = 1/p (8.11)
2
Variance: Var X = (1 − p)/p (8.12)

The PMF, P( X = n), can be derived using the independence assumption. Let
Ei represent the event that the i-th trial succeeds. Then the probability that X is
exactly n is the probability that the first n − 1 trials fail, and the n-th succeeds:

P( X = n) = P( E1c E2c . . . Enc −1 En ) (8.13)


= P( E1c ) P( E2c ) . . . P( Enc −1 ) P( En ) (8.14)
n −1
= (1 − p ) p (8.15)

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


52 c hapter 8. poisson and other discrete distributions

A similar argument can be used to derive the cumulative distribution function


(CDF), the probability that X ≤ n. This is equal to 1 − P( X > n), and P( X > n)
is the probability that at least the first n trials fail:

P( X ≤ n) = 1 − P( X > n) (8.16)
= 1− P( E1c E2c . . . Enc ) (8.17)
= 1− P( E1c ) P( E2c ) . . . P( Enc ) (8.18)
n
= 1 − (1 − p ) (8.19)

In the Pokémon games, one captures Pokémon by throwing Poké Balls at Example 8.3. Geometric random vari-
able of independent Pokémon cap-
them. Suppose each ball independently has probability p = 0.1 of catching turing trials. Use X = Geo(p) for
the Pokémon. What is the average number of balls required for a successful the distribution and 𝔼(X) for expec-
tation.
capture?

Solution: Let X be the number of balls used until (and including) the capture.
X ∼ Geo( p), so the average number needed is E[ X ] = 1/p = 10.

Suppose we want to ensure that the probability of a capture before we run Example 8.4. Geometric random vari-
able of independent Pokémon cap-
out of Poké Balls is at least 0.99. How many balls do we need to carry? turing trials, using the cumulative
distribution function cdf(X,n), in-
troduced in section 9.2.
Solution: We want to know n such that P( X ≤ n) ≥ 0.99.

P( X ≤ n) = 1 − (1 − p)n ≥ 0.99
(1 − p)n ≤ 0.01
log[(1 − p)n ] ≤ log 0.01
n log(1 − p) ≤ log 0.01
log 0.01 log 0.01
n≥ = ≈ 43.7
log(1 − p) log 0.9

So we need 44 Poké Balls. (Note that we flipped the inequality on the last
line because we divided both sides by log(1 − p). Since 1 − p < 1, we know
log(1 − p) < 0, so we’re dividing by a negative number!)

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


8.4. negative binomial distribution 53

8.4 Negative Binomial Distribution


A negative binomial random variable
A variable X is a negative binomial random variable, denoted X ∼ NegBin(r, p), if is the number of trials until r suc-
X is the number of independent trials until r successes and p is probability of cesses.

success on each trial. If X ∼ NegBin(r, p): Support for negative binomial:


{r, r + 1, . . .}
n−1 r
 
Probability mass function: P( X = n) = p (1 − p ) n −r where r ≤ n
r−1
(8.20)
Expectation: E[ X ] = r/p (8.21)
Variance: Var X = r (1 − p)/p2 (8.22)

Problem: A grad student needs 3 published papers to graduate. (Not how Example 8.5. Using the negative
binomial distribution to determine
it works in real life!) On average, how many papers will the student need conference submissions required
to submit to a conference, if the conference accepts each paper randomly for acceptance.
and independently with probability p = 0.25? (Also not how it works in real http://blog.mrtz.org/2014/12
life...though the NIPS Experiment suggests there is a grain of truth in this /15/the-nips-experiment.html
model!)

Solution: Let X be the number of submissions required to get r = 3 accep-


tances.
X ∼ NegBin(r = 3, p = 0.25)
Therefore we get:
r 3
E[ X ] = = = 12
p 0.25
julia> X = NegBin(3, 0.25);
julia> 𝔼(X)
12.0

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


9 Continuous Distributions

From notes by Chris Piech and Lisa


So far, all random variables we have seen have been discrete. In all the cases we Yan.
have seen in CS 109, this meant that our RVs could only take on integer values.
Now it’s time for continuous random variables, which can take on values in the
real number domain R. Continuous random variables can be used to represent
measurements with arbitrary precision (e.g., height, weight, or time).

9.1 Probability Density Functions

In the world of discrete random variables, the most important property of a


random variable was its probability mass function (PMF), which told you the
probability of the random variable taking on a certain value. When we move
to the world of continuous random variables, we are going to need to rethink
this basic concept. If I were to ask you what the probability is of a child being
born with a weight of exactly 3.523112342234 kilograms, you might recognize that
question as ridiculous. No child will have precisely that weight. Real values are
defined with infinite precision; as a result, the probability that a random variable
takes on a specific value is not very meaningful when the random variable is
continuous. The PMF doesn’t apply. We need another idea.
In the continuous world, every random variable has a probability density function
(PDF), which says how likely it is that a random variable takes on a particular
value, relative to other values that it could take on. The PDF has the nice property
that you can integrate over it to find the probability that the random variable
takes on values within a range ( a, b).
The random variable X is a continuous random variable if there is a function f ( x )
for −∞ ≤ x ≤ ∞, called the probability density function (PDF), such that: See appendix B.4 for a calculus re-
view.
56 c hapter 9. continuous distributions

Z b
P( a ≤ X ≤ b) = f ( x )dx (9.1)
a

To preserve the axioms that guarantee P( a ≤ X ≤ b) is a probability, the following


properties must also hold:

0 ≤ P( a ≤ X ≤ b) ≤ 1 (9.2)
P(−∞ < X < ∞) = 1 (9.3)

A common misconception is to think of f ( x ) as a probability. It is instead what


we call a probability density. It represents probability divided by the units of X.
Generally this is only meaningful when we either take an integral over the PDF or
we compare probability densities. As we mentioned when motivating probability
densities, the probability that a continuous random variable takes on a specific
value (to infinite precision) is 0. Integrate f ( x ) to get probabilities.
Z a
P( X = a) = f ( x )dx = 0 (9.4)
a

This is very different from the discrete setting, in which we often talked about PDF units: probability per units of
the probability of a random variable taking on a particular value exactly. X.

9.2 Cumulative Distribution Function

Having a probability density is great, but it means we are going to have to solve
an integral every single time we want to calculate a probability. To save ourselves
some effort, for most of these variables we will also compute a cumulative distribu-
tion function (CDF). The CDF is a function which takes in a number and returns
the probability that a random variable takes on a value less than (or equal to) that
number. If we have a CDF for a random variable, we don’t need to integrate to
answer probability questions!
For a continuous random variable X, the cumulative distribution function is:
Z a
FX ( a) = P( X ≤ a) = f ( x )dx (9.5)
−∞

This can be written F ( a), without the subscript, when it is obvious which random
variable we are using.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


9.3. expectation and variance 57

Why is the CDF the probability that a random variable takes on a value less
than (or equal to) the input value as opposed to greater than? It is a matter of
convention. But it is a useful convention. Most probability questions can be solved
simply by knowing the CDF (and taking advantage of the fact that the integral
over the range −∞ to ∞ is 1). Here are a few examples of how you can answer
probability questions by just using a CDF:

Probability Query Solution Explanation


P( X ≤ a) F ( a) This is the definition of the CDF
P( X < a) F ( a) Note that P( X = a) = 0
P( X > a) 1 − F ( a) P( X ≤ a) + P( X > a) = 1
P( a < X < b) F (b) − F ( a) F ( a) + P( a < X < b) = F (b)

As we mentioned briefly earlier, the cumulative distribution function can also


be defined for discrete random variables, but there is less utility to a CDF in the
discrete world, because with the exception of the geometric random variable,
none of our discrete random variables had ‘‘closed form’’ (that is, without any
summations) functions for the CDF:
a
FX ( a) = ∑ P( X = i) (9.6)
i =0

9.3 Expectation and Variance

For a continuous random variable X: Similar to the discrete case, but


Z ∞ summation is replaced with inte-
E[ X ] = x · f ( x )dx (9.7) gration and p( x ) is replaced with
−∞ the PDF f ( x ).
Z ∞
E[ g( X )] = g( x ) · f ( x )dx (9.8)
−∞
Z ∞
E[ X n ] = x n · f ( x )dx (9.9)
−∞

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


58 c hapter 9. continuous distributions

Let X be a continuous random variable (CRV) with PDF: Example 9.1. Using the proper-
ties of the probability density func-
 tion (PDF) of a continuous random
C (4x − 2x2 ) if 0 < x < 2 variable (CRV) to solve for an un-
f (x) = known.
0 otherwise

In this function, C is a constant. What value is C? Since we know that the


PDF must sum to 1:
Z 2
C (4x − 2x2 )dx = 1
0
2
2x3

2
C 2x − =1
3 x =0
  
16
C 8− −0 = 1
3

Solving this equation for C gives C = 3/8. What is P( X > 1)?


Z ∞ Z 2
3
f ( x )dx = (4x − 2x2 )dx
1 8 1
2
2x3

3 2
= 2x −
8 3 x =1
   
3 16 2 1
= 8− − 2− =
8 3 3 2

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


9.3. expectation and variance 59

Let X be a RV representing the number of days of use before your disk Example 9.2. Calculating the cumu-
lative distribution function for disk
crashes, with PDF: crashes.

λe− x/100 if x ≥ 0
f (x) =
0 otherwise

Ae Au du = e Au , from equation (B.32):


R
First, determine λ. Recall that
Z ∞
λe− x/100 dx = 1
0
Z ∞
−1 − x/100
−100λ e dx = 1
0 100

−100λ · e− x/100 =1
x =0
1
100λ · 1 = 1 =⇒ λ =
100
What is P( X < 10)?
Z 10
1
F (10) = e− x/100 dx
0 100
10
= −e− x/100
x =0
−1/10
= −e + 1 ≈ 0.095

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


60 c hapter 9. continuous distributions

For both continuous and discrete RVs:

E[ aX + b] = aE[ X ] + b
Var( X ) = E[( X − µ)2 ] (with µ = E[ X ])
2 2
= E[ X ] − E[ X ]
Var( aX + b) = a2 Var( X )

9.4 Uniform Random Variable


Support for Uniform: [α, β]
The most basic of all the continuous random variables is the uniform random
variable, which is equally likely to take on any value in its range [α, β].
The variable X is a uniform random variable, denoted X ∼ Uni(α, β), if it has
probability density function (PDF):

 1

if α ≤ x ≤ β
f (x) = β − α (9.10)
0 otherwise

Notice how the density 1/( β − α) is exactly the same regardless of the value for x.
That makes the density uniform. So why is the PDF 1/( β − α) and not 1? That is
the constant that makes it such that the integral over all possible inputs evaluates
to 1.
The cumulative distribution function (CDF), expectation, and variance of the
uniform random variable are:
Z b
P( a ≤ X ≤ b) = f ( x )dx for α ≤ a ≤ b ≤ β (9.11)
a
b−a
= (9.12)
β−α
Z ∞
E[ X ] = x · f ( x )dx (9.13)
−∞
β
x x2
Z β
= dx = (9.14)
α β−α 2( β − α ) x =α
α+β
= (9.15)
2
( β − α )2
Var X = (9.16)
12

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


9.5. exponential random variable 61

9.5 Exponential Random Variable


Support for Exponential: [0, ∞)
An exponential random variable, denoted X ∼ Exp(λ), represents the time until an
event occurs. It is parametrized by λ > 0, the (constant) rate at which the event Units of λ:
event
occurs. This is the same λ as in the Poisson distribution; a Poisson variable counts time

the number of events that occur in a fixed interval, while an exponential variable
measures the amount of time until the next event occurs.1 1
Example 9.2 sneakily introduced
The probability density function (PDF) for an exponential random variable is: you to the exponential distribution
already; now we get to use for-
 mulas we’ve already computed to
λe−λx if x ≥ 0 work with it without integrating
f (x) = (9.17) anything.
0 otherwise
Exponential examples:
The expectation and variance are as follows:
• time until next earthquake
1 • time for request to reach web
E[ X ] = (9.18) server
λ
1 • time until end of cell phone con-
Var( X ) = 2 (9.19) tract
λ
There is a closed form for the cumulative distribution function (CDF):

F ( x ) = 1 − e−λx where x ≥ 0 (9.20)

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


62 c hapter 9. continuous distributions

Let X be a random variable that represents the number of minutes until a Example 9.3. Using an exponential
random variable to determine dura-
visitor leaves your website. You have calculated that on average a visitor leaves tion a user stays on a website.
your site after 5 minutes, and you decide that an exponential distribution is
appropriate to model how long a person stays before leaving the site. What
is the P( X > 10)?
We can compute λ = 15 either using the definition of E[ X ] or by thinking
of how many people leave every minute (answer: ‘‘one-fifth of a person’’).
Thus X ∼ Exp(1/5).

P( X > 10) = 1 − F (10)


= 1 − (1 − e−λ·10 )
= e −2
≈ 0.1353

Let X be the number of hours of use until your laptop dies. On average, Example 9.4. Using an exponential
random variable to determine if your
laptops die after 5000 hours of use. If you use your laptop for 7300 hours laptop will last all four years of uni-
during your undergraduate career (assuming usage equals 5 hours/day and versity.
four years of university), what is the probability that your laptop lasts all
four years?
As above, we can find λ either using E[ X ] or thinking about laptop deaths
1
per hour: Therefore, X ∼ Exp( 5000 ).

P( X > 7300) = 1 − F (7300)


= 1 − (1 − e−7300/5000 )
= e−1.46 ≈ 0.2322

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


10 Normal Distribution

From notes by Chris Piech and Lisa


10.1 Normal Random Variable Yan.

The single most important random variable type is the Normal (aka Gaussian)
random variable, parameterized by a mean µ and variance σ2 . If X is a normal
variable we write X ∼ N (µ, σ2 ). The normal is important for many reasons: it is
generated from the summation of independent random variables and as a result
it occurs often in nature. Many things in the world are not distributed normally
but data scientists and computer scientists model them as Normal distributions
anyways. Why? Because it is the most entropic (conservative) distribution that
we can apply to data with a measured mean and variance. Support for Normal: {−∞, ∞}
The probability density function (PDF) for a Normal is:
symmetric
around µ
−( x − µ)2 2
1  variance σ
2σ2

√ 1 exponential manages spread
f (x) = e (10.1)
σ 2π normalizing tail
constant

By definition, a Normal has E[ X ] = µ and Var( X ) = σ2 .


There is no closed form for the integral of the Normal PDF, however since
a linear transform of a Normal produces another Normal we can always map The PDF of a Normal is symmetric
our distribution to the Standard Normal (mean 0 and variance 1) which has a about the mean µ:

precomputed cumulative distribution function (CDF). The CDF of an arbitrary f (µ − x ) = 1 − f (µ + x )


normal is:
x−µ
 
F(x) = Φ (10.2)
σ
where Φ is a precomputed function that represents that CDF of the Standard Symmetry of the PDF of a Normal
Normal. implies (for the standard normal
CDF):
Φ(− x ) = 1 − Φ( x )
64 c hapter 10. normal distribution

10.2 Linear Transform

If X is a Normal such that X ∼ N (µ, σ2 ) and Y is a linear transform of X such


that Y = aX + b, then Y is also a Normal where:

Y ∼ N ( aµ + b, a2 σ2 )

10.3 Projection to Standard Normal

For any Normal random variable X, we can find a linear transform from X to
the Standard Normal N (0, 1). That is, if you subtract the mean µ of the normal
and divide by the standard deviation σ, the result is distributed according to the
standard normal (also called the unit Normal). We can prove this mathematically.
X −µ
Let Z = σ :

X−µ
Z= (Transform X: subtract µ and divide by σ)
σ
1 µ
= X− (Use algebra to rewrite the equation)
σ σ
(Define a = σ1 , b = − σ )
µ
= aX + b
∼ N ( aµ + b, a2 σ2 ) (The linear transform of a normal is another normal)
µ µ σ2
 
∼N − , 2 (Substitute values in for a and b)
σ σ σ
∼ N (0, 1) (The Standard Normal)

An extremely common use of this transform is to express FX ( x ), the CDF of X, Why is this useful? Well, in
in terms of the CDF of Z, FZ ( x ). Since the CDF of the Standard Normal is so the days when we couldn’t call
scipy.stats.norm.cdf (or on
common, it gets its own Greek symbol, Φ( x ). exams, when one doesn’t have a
calculator), people would look
FX ( x ) = P( X ≤ x ) (10.3) up values of the CDF in a table.
Using the Standard Normal means
X−µ x−µ
 
you only need to build a table
=P ≤ (10.4)
σ σ of one distribution, rather than
an indefinite number of tables
x−µ
 
=P Z≤ (10.5) for all the different values of µ
σ and σ. We also have an online

x−µ
 calculator on the CS 109 website.
=Φ (10.6) You should learn how to use the
σ Standard Normal table for the
exams, however!

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


10.3. projection to standard normal 65

Let X ∼ N (3, 16), what is P( X > 0)? Example 10.1. Normal distribution
using the defined CDF.
X−3 0−3
     
3 3
P ( X > 0) = P > =P Z>− = 1−P Z ≤ −
4 4 4 4
      
3 3 3
= 1−Φ − = 1− 1−Φ =Φ ≈ 0.7734
4 4 4
2
An alternative 
approach   idea that if F is the CDF of X ∼ N (µ, σ ),
uses the
x −µ x −µ
then F ( x ) = P Z < σ =Φ σ :

P( X > 0) = 1 − F (0) = 1 − Φ(−3/4)


= 1 − (1 − Φ(3/4)) = Φ(3/4) ≈ 0.7734

What is P(2 < X < 5)?

2−3 X−3 5−3


   
1 2
P (2 < X < 5) = P < < =P − <Z<
4 4 4 4 4
        
2 1 1 1
=Φ −Φ − =Φ − 1−Φ ≈ 0.2902
4 4 2 4

Alternative solution:
5−3 2−3
   
P (2 < X < 5) = F (5) − F (2) = Φ −Φ
4 4
= Φ(1/2) − (1 − Φ(1/4)) ≈ 0.2902

What is P(| X − 3| < 6)?

P(| X − 3| > 6) = P( X < −3) + P( X > 9) = F (−3) + (1 − F (9))


−3 − 3 9−3
    
=Φ + 1−Φ
4 4
= Φ(−3/2) + (1 − Φ(3/2)) = 2(1 − Φ(3/2)) ≈ 0.1337

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


66 c hapter 10. normal distribution

You send voltage of 2 or −2 on a wire to denote 1 or 0. Let X = voltage sent Example 10.2. Standard Normal dis-
tribution used as signal noise.
and let R = voltage received. Note R = X + Y, where Y ∼ N (0, 1) is noise.
When decoding, if R ≥ 0.5, we interpret the voltage as 1, else 0.

What is P(error after decoding | original bit = 1)?


Given that we sent a 1, X = 2 and therefore R = 2 + Y. A decoding error
occurs if we incorrectly interpret the signal as 0; this occurs if R < 0.5. Note
that Y is the Standard Normal and therefore has CDF Φ:

P( R < 0.5 | X = 2) = P( X + Y < 0.5 | X = 2) = P(2 + Y < 0.5)


= P(Y < −1.5) = Φ(−1.5) = 1 − Φ(1.5) ≈ 0.0668

What is P(error after decoding | original bit = 0)?


Given that we sent a 0, X = −2 and therefore R = −2 + Y. A decoding
error occurs if we incorrectly interpret the signal as 1; this occurs if R ≥ 0.5.

P( R ≥ 0.5 | X = −2) = P( X + Y ≥ 0.5 | X = −2) = P(−2 + Y ≥ 0.5)


= P(Y ≥ 2.5) = 1 − Φ(2.5) ≈ 0.0062

This example demonstrates an asymmetric decoding boundary, where


there is lower probability of erroneously decoding a 0 as a 1 than vice versa.
In many engineering circumstances, we may suffer stronger consequences if
we turn something ‘‘on’’ when it was supposed to stay turned off. By setting
the boundary of our decoding process asymmetrically, we can decrease the
probability of this undesirable error.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


10.4. binomial approximation 67

10.4 Binomial Approximation

You can use a Normal distribution to approximate a Binomial X ∼ Bin(n, p). To


do so, define a normal distribution:

Y ∼ N (E[ X ], Var( X ))

Now using the Binomial formulas for expectation and variance:


µ = np

Y ∼ N (np, np(1 − p)) σ2 = np(1 − p)

This approximation holds for large n. Since a Normal is continuous and Binomial
is discrete, we have to use a continuity correction to discretize the Normal:
! !
k − np + 0.5 k − np − 0.5
 
1 1
P( X = k) ≈ P k − < Y < k + =Φ p −Φ p
2 2 np(1 − p) np(1 − p)

100 visitors to your website are given a new design. Let X = # of people who Example 10.3. Approximating a
binomial distribution with a Normal
were given the new design and spend more time on your website. Your CEO distribution for website visit statis-
will endorse the new design if X ≥ 65. tics.

Discrete Continuous
What is P(CEO endorses change | it has no effect)?
x =6 5.5 < x < 6.5
x >6 x > 6.5
E[ X ] = np = 50 x ≤6 x < 6.5
x <6 x < 5.5
Var( X ) = np(1 − p) = 25 x ≥6 x > 5.5
q
σ = Var( X ) = 5

We can thus use a Normal approximation: Y ∼ N (50, 25).

Y − 50 64.5 − 50
 
P( X ≥ 65) ≈ P(Y > 64.5) = P >
5 5
= 1 − Φ(2.9) ≈ 0.0019

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


68 c hapter 10. normal distribution

Stanford accepts 2480 students and each student has a 68% chance of at- Example 10.4. Approximating a
binomial distribution with a Normal
tending. Let X = # students who will attend. X ∼ Bin(2480, 0.68). What is distribution for Stanford acceptance
P( X > 1745)? statistics.

E[ X ] = np = 1686.4
Var( X ) = np(1 − p) = 539.7
q
σ = Var( X ) = 23.23

We can thus use a Normal approximation: Y ∼ N (1686.4, 539.7).

Y − 1686.4 1745.5 − 1686.4


 
P( X > 1745) ≈ P(Y > 1745.5) = P >
23.23 23.23
= 1 − Φ(2.54) ≈ 0.0055

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


11 Joint Distributions

From notes by by Chris Piech and


11.1 Joint Distributions Lisa Yan.

Often you will work on problems where there are several random variables (often
interacting with one another). We are going to start to formally look at how those
interactions play out.
For now we will think of joint probabilities with two events X = a and Y = b.
For this chapter, we will assume both X and Y are discrete random variables, and
we will tackle the continuous in a future chapter.

11.2 Discrete Case

In the discrete case, a joint probability mass function tells you the probability of
any combination of events X = a and Y = b:

p X,Y ( a, b) = P( X = a, Y = b) (11.1)

This function tells you the probability of all combinations of events (the ‘‘,’’ means
‘‘and’’). If you want to back calculate the probability of an event only for one
variable you can calculate a marginal from the joint probability mass function:

p X ( a) = P( X = a) = ∑ pX,Y (a, y) (11.2)


y

p Y ( b ) = P (Y = b ) = ∑ pX,Y (x, b) (11.3)


x

In the continuous case, a joint probability density function tells you the relative
probability of any combination of events X = a and Y = y.
In the discrete case, we can define the function p X,Y non-parametrically. Instead
of using a formula for p we simply state the probability of each possible outcome.
70 c hapter 11. jo int distributions

11.3 Multinomial Distribution

Say you perform n independent trials of an experiment where each trial results in
one of m outcomes, with respective probabilities: p1 , p2 , . . . , pm (constrained so
that ∑i pi = 1). Define Xi to be the number of trials with outcome i. A multinomial
distribution is a closed form function that answers the question: What is the
probability that there are ci trials with outcome i. Mathematically:
 
n c
P ( X1 = c 1 , X2 = c 2 , . . . , X m = c m ) = p 1 pc2 . . . pcmm (11.4)
c1 , c2 , . . . , c m 1 2

A 6-sided die is rolled 7 times. What is the probability that you roll: 1 one, 1 Example 11.1. Multinomial distribu-
tion to calculate the joint probability
two, 0 threes, 2 fours, 0 fives, 3 sixes (disregarding order). of outcomes from rolls of a 6-sided
die.
P( X1 = 1, X2 = 1, X3 = 0, X4 = 2, X5 = 0, X6 = 3) =
 1  1  0  2  0  3  7
7! 1 1 1 1 1 1 1
= 420
2!3! 6 6 6 6 6 6 6

Example using multinomial distri-


Federalist Papers: In class, we wrote a program to decide whether or not butions to estimate who wrote the
James Madison or Alexander Hamilton wrote Fedaralist Paper 49. Both men Federalist Paper 49, highlighting
log numerical stability tricks.
have claimed to be have written it, and hence the authorship is in dispute.
First we used historical essays to estimate pi , the probability that Hamilton
generates the word i (independent of all previous and future choices or
words). Similarly we estimated qi , the probability that Madison generates the
word i. For each word i we observe the number of times that word occurs in
Fedaralist Paper 49 (we call that count ci ). We assume that, given no evidence,
the paper is equally likely to be written by Madison or Hamilton.
Define three events: H is the event that Hamilton wrote the paper, M is
the event that Madison wrote the paper, and D is the event that a paper has
the collection of words observed in Fedaralist Paper 49. We would like to
know whether P( H | D ) is larger than P( M | D ). This is equivalent to trying
to decide if P( H | D )/P( M | D ) is larger than 1.
The event ( D | H ) is a multinomial parameterized by the values p. The
event ( D | M ) is also a multinomial, this time parameterized by the values q.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


11.3. multinomial distribution 71

Using Bayes Rule we can simplify the desired probability.


P( D | H ) P( H )
P( H | D ) P( D ) P( D | H ) P( H ) P( D | H )
= P( D | M) P( M)
= =
P( M | D ) P( D | M) P( M) P( D | M)
P( D )
c
(c1 ,c2n,...,cm ) ∏i pi i c
∏i pi i
= c = c
(c1 ,c2n,...,cm ) ∏i qi i ∏i qi i

This seems great! We have our desired probability statement expressed in


terms of a product of values we have already estimated. However, when we
plug this into a computer, both the numerator and denominator come out to
be zero. The product of many numbers close to zero is too hard for a computer
to represent. To fix this problem, we use a standard trick in computational
probability: We apply a log to both sides and apply some basic rules of logs:
c
!
P( H | D ) ∏i pi i
 
log = log c
P( M | D ) ∏i qi i
! !
∏ pi i ∏ qi i
c c
= log − log
i i

= ∑ log pici − ∑ log qici


 
i i
= ∑ ci log( pi ) − ∑ ci log(qi )
i i

This expression is ‘‘numerically stable’’ and my computer returned that


the answer was a negative number. We can use exponentiation to solve for
P( H | D )/P( M | D ). Since the exponent of a negative number is a number
smaller than 1, this implies that P( H | D )/P( M | D ) is smaller than 1. As a
result, we conclude that Madison was more likely to have written Fedaralist
Paper 49.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


12 Independent Random Variables
From notes by by Chris Piech and
12.1 Independence with Multiple Discrete Random Variables Lisa Yan.

Two discrete random variables X and Y are called independent if: Same ideas, different notation.

P( X = x, Y = y) = P( X = x ) P(Y = y) for all x, y (12.1)


p X,Y ( x, y) = p X ( x ) pY (y) for all x, y (12.2)

Intuitively: knowing the value of X tells us nothing about the distribution of Y. If


two variables are not independent, they are called dependent.1 This is a similar 1
To prove dependence, simply find
conceptually to independent events, but we are dealing with multiple variables. a counterexample.

Make sure to keep your events and variables distinct.

12.2 Symmetry of Independence

Independence is symmetric. That means that if random variables X and Y are in-
dependent, X is independent of Y and Y is independent of X. This claim may seem
meaningless but it can be very useful. Imagine a sequence of events X1 , X2 , . . ..
Let Ai be the event that Xi is a ‘‘record value’’ (e.g., it is larger than all previous
values). Is An+1 independent of An ? It is easier to answer that An is independent
of An+1 . By symmetry of independence both claims must be true.

12.3 Sums of Independent Random Variables

Independent Binomials with Equal p: For any two Binomial random variables
with the same ‘‘success’’ probability: X ∼ Bin(n1 , p) and Y ∼ Bin(n2 , p) the sum
of those two random variables is another binomial: X + Y ∼ Bin(n1 + n2 , p). This
does not hold when the two distributions have different parameters p.
This holds in the general case, let Xi ∼ Bin(ni , p) for Xi independent variables
for i = 1, . . . , n: !
n n
∑ Xi ∼ Bin ∑ ni , p (12.3)
i =1 i =1
74 c hapter 12. independent random variables

Let N be the number of requests to a web server/day and that N ∼ Poi(λ). Example 12.1. Independence of two
discrete Binomial random vari-
Each request comes from a human (probability = p) or from a ‘‘bot’’ (proba- ables
bility = (1 − p)), independently. Define X to be the number of requests from
humans/day and Y to be the number of requests from bots/day.
Since requests come in independently, the probability of X conditioned
on knowing the number of requests is a Binomial. Specifically, conditioned:

( X | N ) ∼ Bin( N, p)
(Y | N ) ∼ Bin( N, 1 − p)

Calculate the probability of getting exactly i human requests and j bot re-
quests. Start by expanding using the chain rule:

P( X = i, Y = j) = P( X = i, Y = j | X + Y = i + j) P( X + Y = i + j)

We can calculate each term in this expression:


 
i+j i
P( X = i, Y = j | X + Y = i + j) = p (1 − p ) j
i
λi + j
P ( X + Y = i + j ) = e−λ
(i + j ) !

Now we can put those together and simplify:

λi + j
 
i+j i
P( X = i, Y = j) = p (1 − p ) j e − λ
i (i + j ) !

As an exercise you can simplify this expression into two independent Poisson
distributions.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


12.4. convolution: sum of independent random variables 75

Independent Poissons: For any two Poisson random variables: X ∼ Poi(λ1 )


and Y ∼ Poi(λ2 ) the sum of those two random variables is another Poisson:
X + Y ∼ Poi(λ1 + λ2 ). This holds even if λ1 is not the same as λ2 .
This holds in the general case, let Xi ∼ Poi(λi ) for Xi independent variables
for i = 1, . . . , n: !
n n
∑ Xi ∼ Poi ∑ λi (12.4)
i =1 i =1

Let’s say we have two independent random Poisson variables for requests Example 12.2. Independence of two
discrete Poisson random variables.
received at a web server in a day: X = number of requests from humans/day,
X ∼ Poi(λ1 ) and Y = number of requests from bots/day, Y ∼ Poi(λ2 ). Since
the convolution of Poisson random variables is also a Poisson, we know that
the total number of requests ( X + Y ) is also a Poisson: ( X + Y ) ∼ Poi(λ1 +
λ2 ). What is the probability of having k human requests on a particular day
given that there were n total requests?

P( X = k, Y = n − k ) P ( X = k ) P (Y = n − k )
P( X = k | X + Y = n) = =
P( X + Y = n) P( X + Y = n)
e−λ1 λ1k e−λ2 λ2n−k n!
= · ·
k! (n − k)! e−(λ1 +λ2 ) (λ1 + λ2 )n
  k  n−k
n λ1 λ2
=
k λ1 + λ2 λ1 + λ2
 
λ1
∴ ( X | X + Y = n) ∼ Bin n,
λ1 + λ2

12.4 Convolution: Sum of Independent Random Variables

So far, we have had it easy: If our two independent random variables are both
Poisson, or both Binomial with the same probability of success, then their sum
has a nice, closed form. In the general case, however, the distribution of two
independent random variables can be calculated as a convolution of probability
distributions.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


76 c hapter 12. independent random variables

For two independent random variables, you can calculate the CDF or the PDF
of the sum of two random variables using the following formulas: For independent discrete random
variables, the convolution of p X
∞ and pY (in different notation):
FX +Y (n) = P( X + Y ≤ n) = ∑ FX (k) FY (n − k) (12.5)
P( X + Y = n) =
k =−∞
∞ ∑ P ( X = k ) P (Y = n − k )
p X +Y ( n ) = ∑ p X ( k ) pY ( n − k ) (12.6) k

k=−∞

Most importantly, convolution is the process of finding the sum of the random
variables themselves, and not the process of adding together probabilities.

Let’s go about proving that the sum of two independent Poisson random vari- Example 12.3. Proving that the
sum of two independent Poisson
ables is also Poisson. Let X ∼ Poi(λ1 ) and Y ∼ Poi(λ2 ) be two independent random variables is also a Poisson.
random variables, and Z = X + Y. What is P( Z = n)?

P( Z = n) = P( X + Y = n)

= ∑ P ( X = k ) P (Y = n − k ) (Convolution)
k =−∞
n
= ∑ P ( X = k ) P (Y = n − k ) (Range of X and Y)
k =0
n λk λn−k
= ∑ e−λ1 k!1 e−λ2 (n −
2
k)!
(Poisson PMF)
k =0
n λ1k λ2n−k
= e−(λ1 +λ2 ) ∑ k!(n − k)!
k =0
e−(λ1 +λ2 ) n
n!
=
n! ∑ k! ( n − k ) !
λ1k λ2n−k
k =0
e−(λ1 +λ2 )
= ( λ1 + λ2 ) n (Binomial theorem)
n!
Note that the Binomial Theorem (which we did not cover in this class, but is
often used in contexts like expanding polynomials) says that for two numbers
n  
n k n−k
a and b and positive integer n, then: ( a + b) = ∑
n a b
k =0
k

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


13 Statistics of Multiple Random Variables

From notes by by Chris Piech and


As you can imagine, reporting probability mass functions or distributions is often Lisa Yan.
not ideal: We either have to find a common distribution that fits our experiment, or
we have to report a probability table or a bar graph. In the single random variable
case, we often report expectation or variance as statistics that characterize our
randomness. A similar paradigm applies for the multiple random variable case!
In this section, we discuss statistics of two random variables; in particular, (1)
how to easily calculate the expectation of the sum of multiple random variables,
and (2) how to report how two random variables vary with one another.

13.1 Expectation with Multiple Random Variables

Expectation over a joint distribution is not nicely defined because it is not clear how
to compose the multiple variables. However, expectations over functions of ran-
dom variables (for example sums or products) are nicely defined: E[ g( X, Y )] =
∑ x,y g( x, y) p( x, y) for any function g( X, Y ). When you expand that result for the
function g( X, Y ) = X + Y you get a beautiful result:

E[ X + Y ] = E[ g( X, Y )] = ∑ g(x, y) p(x, y) = ∑[x + y] p(x, y) (13.1)


x,y x,y

= ∑ xp( x, y) + ∑ yp( x, y) (13.2)


x,y x,y

= ∑ x ∑ p( x, y) + ∑ y ∑ p( x, y) (13.3)
x y y x

= ∑ xp( x ) + ∑ yp(y) (13.4)


x y

= E [ X ] + E [Y ] (13.5)
78 c hapter 13. statistics of multiple random variables

This can be generalized to multiple variables:


" #
n n
E ∑ Xi = ∑ E [ Xi ] (13.6)
i =1 i =1

Let’s go back to our old friends—the Binomial and Negative Binomial RVs—and
show how we could have derived expressions for their expectation.

13.2 Expectation of Binomial

First let’s start with some practice with the sum of expectations of indicator vari-
ables. Let Y ∼ Bin(n, p), in other words if Y is a Binomial random variable. We
can express Y as the sum of n Bernoulli random indicator variables Xi ∼ Ber( p).
Since Xi is a Bernoulli, E[ Xi ] = p

n
Y = X1 + X2 + · · · + X n = ∑ Xi (13.7)
i =1

Let’s formally calculate the expectation of Y:


" #
n n
E [Y ] = E ∑ Xi = ∑ E[ Xi ] = E[ X0 ] + E[ X1 ] + · · · + E[ Xn ] = np (13.8)
i i

13.3 Expectation of Negative Binomial

Recall that a Negative Binomial is a random variable that semantically represents


the number of trials until r successes. Let Y ∼ NegBin(r, p).
Let Xi = # trials to get success after the (i − 1)-th success. We can then think of
each Xi as a Geometric random variable: Xi ∼ Geo( p). Thus, E[ Xi ] = 1/p. We
can express Y as:
r
Y = X1 + X2 + · · · + X r = ∑ Xi (13.9)
i =1

Let’s formally calculate the expectation of Y:


" #
r r
r
E [Y ] = E ∑ Xi = ∑ E [ X i ] = E [ X1 ] + E [ X2 ] + · · · + E [ X r ] = p
(13.10)
i =1 i =1

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


13.4. expectations of products lemma 79

Coupon Collector’s Problem: There are several versions of the coupon col- Example 13.1. The coupon collec-
tor’s problem setup.
lector’s problem in probability theory, but the most common formulation is
as follows: You would like to collect coupons from cereal boxes, but you must
purchase a box of cereal to open and discover what coupon type you have.
More formally, suppose you buy n boxes of cereal, and there are k different
types of coupons. For each box you buy, you ‘‘collect’’ a coupon of type i.
What is the expected number of boxes that you must purchase until you have
at least one coupon of each type?
How does this relate to computer science? Suppose you are a big cloud
provider, and you have to service n web requests with a limited number of
k servers. Each web request is a request to server i. What is the expected
number of utilized servers after n requests?

13.4 Expectations of Products Lemma

We know that the expectation of the sum of two random variables is equal to
the sum of the expectations of the two variables. However, the expectation of the
product of two random variables only has a nice decomposition in the case where
the random variables are independent of one another.

E[ g( X )h(Y )] = E[ g( X )]E[h(Y )] if X and Y are independent

Here’s a proof for independent discrete random variables X and Y. If you would
like to prove this for independent continuous random variables, just interchange
the summations with integrals.

E[ g( X )h(Y )] = ∑ ∑ g(x)h(y) pX,Y (x, y)


y x

= ∑ ∑ g ( x ) h ( y ) p X ( x ) pY ( y )
y x
!
= ∑ h ( y ) pY ( y ) ∑ g ( x ) p X ( x )
y x
! !
= ∑ g( x ) p X ( x ) ∑ h ( y ) pY ( y )
x y

= E[ g( X )]E[h(Y )]

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


80 c hapter 13. statistics of multiple random variables

Problem: Yes, hash table problems can be a variation of the coupon collec- Example 13.2. Hash tables as a
variation of the coupon collector’s
tor’s problem! Consider a hash table with k buckets. You hash each string to problem.
bucket i. What is the expected number of strings to hash until each bucket
has at least 1 string?
Solution: Define Y as the number of strings to hash until each bucket has
at least 1 string. We want to compute E[Y ]. Let us also define Yi to be the
number of trials (strings) until the next success, after we’ve seen our i-th
success. For example, Y0 is the number of strings hashed until our first hash
into an empty bucket (we start with k empty buckets), Y1 is the number of
additional strings to hash until we hash into an empty bucket (we have 1
non-empty bucket and k − 1 empty buckets, etc.). In the general case, we
have i non-empty buckets and k − i empty buckets after the i-th success,
and we are successful if we hash a string to one of the k − i empty buckets.
The probability of success p is then pi = k− i
k . With this definition of Yi , let
Yi ∼ Geo( p), and E[Yi ] = p = k−i .
1 k
i

Note that Y = Y0 + Y1 + Y2 + · · · + Yn−1 . We can show the following:


" #
n n
E [Y ] = E ∑ Yi = ∑ E[Yi ]
i =0 i =0
n
k k k k k
=∑ = + + +···+
i =0
k−i k k−1 k−2 1
 
1 1
=k + + · · · + 1 = O(k log k)
k k−1

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


13.5. covariance and correlation 81

13.5 Covariance and Correlation


4
Consider the two multivariate distributions shown in figures 13.1 and 13.2. In
2
both images I have plotted one thousand samples drawn from the underlying
joint distribution. Clearly the two distributions are different. However, the mean 0
and variance are the same in both the x and the y dimension. What is different?
−2
Covariance is a quantitative measure of the extent to which the deviation of one
variable from its mean matches the deviation of the other from its mean. It is a −4
−4 −2 0 2 4
mathematical relationship that is defined as:
Figure 13.1. Two independent nor-
Cov( X, Y ) = E[( X − E[ X ]) · (Y − E[Y ])] (13.11) mal random variables.

That is a little hard to wrap your mind around (but worth pushing on a bit). The 4

outer expectation will be a weighted sum of the inner function evaluated at a 2


particular ( x, y) weighted by the probability of ( x, y). If x and y are both above
their respective means, or if x and y are both below their respective means, that 0

term will be positive. If one is above its mean and the other is below, the term is −2
negative. If the weighted sum of terms is positive, the two random variables will
−4
have a positive correlation. We can rewrite the above equation to get an equivalent −4 −2 0 2 4
equation:
Figure 13.2. Two normal random
Cov( X, Y ) = E[ XY ] − E[Y ]E[ X ] (13.12) variables with the same mean and
variance as figure 13.1.
Using this equation (and the product lemma) is it easy to see that if two random
variables are independent their covariance is 0. The reverse is not true in general.

Cov( X, Y ) = E[( X − E[ X ]) · (Y − E[Y ])]


= E[ XY − XE[Y ] − E[ X ]Y + E[ X ]E[Y ]]
= E[ XY ] − E[ XE[Y ]] − E[E[ X ]Y ] + E[E[ X ]E[Y ]]
= E[ XY ] − E[ X ]E[Y ] − E[ X ]E[Y ] + E[ X ]E[Y ]
= E[ XY ] − E[ X ]E[Y ]

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


82 c hapter 13. statistics of multiple random variables

13.6 Properties of Covariance

Say that X and Y are arbitrary random variables:

Cov( X, Y ) = Cov(Y, X ) (13.13)


2
Cov( X, X ) = E[ X ] − E[ X ]E[ X ] = Var( X ) (13.14)
!
Cov ∑ Xi , ∑ Yj = ∑ ∑ Cov( Xi , Yj ) (13.15)
i j i j

Cov( aX + b, Y ) = a Cov( X, Y ) (13.16)

Let X = X1 + X2 + · · · + Xn and let Y = Y1 + Y2 + · · · + Ym . The covariance of


X and Y is:
n m
Cov( X, Y ) = ∑ ∑ Cov(Xi , Yj ) (13.17)
i =1 j =1
n n
Cov( X, X ) = Var( X ) = ∑ ∑ Cov(Xi , Xj ) (13.18)
i =1 j =1
n n n
= ∑ Var(Xi ) + 2 ∑ ∑ Cov( Xi , X j ) (13.19)
i =1 i =1 j = i +1

That last property gives us a third way to calculate variance. You could use this
definition to calculate the variance of the binomial.
For any random variables X and Y, the variance of the sum necessarily includes
covariance (unless the two random variables are independent):

Var( X + Y ) = Cov( X + Y, X + Y )
= Cov( X, X ) + Cov( X, Y ) + Cov(Y, X ) + Cov(Y, Y )
= Var( X ) + 2 Cov( X, Y ) + Var(Y )

More generally:
!
n n n n
Var ∑ Xi = ∑ Var(Xi ) + 2 ∑ ∑ Cov( Xi , X j )
i =1 i =1 i =1 j = i +1

But for independent X and Y:

E[ XY ] = E[ X ]E[Y ] (13.20)
Var( X + Y ) = Var( X ) + Var(Y ) (13.21)

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


13.7. correlation 83

13.7 Correlation

Covariance is interesting because it is a quantitative measurement of the relation-


ship between two variables. Correlation between two random variables, ρ( X, Y )
is the covariance of the two variables normalized by the variance of each variable.
This normalization cancels the units out and normalizes the measure so that it is
always in the range [−1, 1]: q
σX = Var( X ) ⇐⇒ σX2 = Var( X )
Cov( X, Y ) Cov( X, Y )
q
ρ( X, Y ) = p = (13.22) σY = Var(Y ) ⇐⇒ σY2 = Var(Y )
Var( X ) Var(Y ) σX σY

Correlation measures linearity between X and Y. Conditional statements regarding


independence and correlation:
ρ( X, Y ) = 1 Y = aX + b where a = σY /σX independence =⇒ no correlation
ρ( X, Y ) = −1 Y = aX + b where a = −σY /σX correlation =⇒ dependence

ρ( X, Y ) = 0 absence of linear relationship But, ‘‘correlation does not imply


causation’’:

If ρ( X, Y ) = 0 we say that X and Y are uncorrelated. If two varaibles are indepen- 6


dependence =⇒ correlation
dent,1 then their correlation will be 0. However, it doesn’t go the other way. A 6
no correlation =⇒ independence
correlation of 0 does not imply independence.
When people use the term correlation, they are actually referring to a specific 1
If X and Y are independent ran-
type of correlation called ‘‘Pearson’’ correlation. It measures the degree to which dom variables, then:
there is a linear relationship between the two variables. An alternative measure Cov( X, Y ) = E[ XY ] − E[ X ]E[Y ]
is ‘‘Spearman’’ correlation which has a formula almost identical to your regular = E [ X ] E [Y ] − E [ X ] E [Y ]
correlation score, with the exception that the underlying random variables are =0
first transformed into their rank.

Somewhat paradoxically, though independence implies zero covariance, zero Example 13.3. Zero covariance
does not imply independence (di-
covariance does not imply independence. Consider the following example: rectionality matters).
you have a discrete random variable X with PMF P( X = x ) = 1/3 for x ∈
{−1, 0, 1} and we then define another random variable Y = X 2 . Clearly X
and Y are not independent, but do they have zero covariance? We’ll learn
in class that we tend to use a measure called correlation to indicate zero
covariance, because the units of covariance don’t matter so much as the sign.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


14 Conditional Expectation

From notes by Chris Piech and Lisa


14.1 Conditional Distributions Yan.

Before we looked at conditional probabilities for events. Here we formally go over


conditional probabilities for random variables. The equations for the discrete case
is an intuitive extension of our understanding of conditional probability:

Discrete: The conditional probability mass function (PMF) for the discrete case:

P( X = x, Y = y) p ( x, y)
p X |Y ( x | y ) = P ( X = x | Y = y ) = = X,Y (14.1)
P (Y = y ) pY ( y )

The conditional cumulative density function (CDF) for the discrete case:

∑ x≤a p X,Y ( x, y)
FX |Y ( a | y) = P( X ≤ a | Y = y) =
pY ( y )
= ∑ p X |Y ( x | y ) (14.2)
x≤a

14.2 Conditional Expectation

We have gotten to know a kind and gentle soul, conditional probability. And
we know another funky fool, expectation. Let’s get those two crazy kids to play
together.
Let X and Y be jointly discrete random variables. We define the conditional
expectation of X given Y = y to be:

E[ X | Y = y ] = ∑ xpX|Y (x | y) (14.3)
x
86 c hapter 14. conditional expectation

14.3 Properties of Conditional Expectation

Here are some helpful, intuitive properties of conditional expectation:

E[ g ( X ) | Y = y ] = ∑ g ( x ) p X |Y ( x | y ) if X and Y are discrete


x
Z ∞
E[ g ( X ) | Y = y ] = g( x ) f X |Y ( x | y)dx if X and Y are continuous
−∞
" #
n n
E ∑ Xi | Y = y = ∑ E [ Xi | Y = y ]
i =1 i =1

14.4 Law of Total Expectation

The law of total expectation states that:

E[E[ X | Y ]] = E[ X ] (14.4)

What?! How is that a thing? Check out this proof:

E[E[ X | Y ]] = E[ g(Y )] = ∑ E [ X | Y = y ] P (Y = y )
y

= ∑ ∑ xP( X = x | Y = y) P(Y = y)
y x

= ∑ ∑ xP( X = x, Y = y) = ∑ ∑ xP( X = x, Y = y)
y x x y

= ∑ x ∑ P( X = x, Y = y)
x y

= ∑ xP( X = x )
x
= E[ X ]

If we only have a conditional PMF of X on some discrete variable Y, we can


compute E[ X ] as follows:

1. Compute conditional expectation of X given some value of Y = y

2. Repeat step 1 for all values of Y = y

3. Compute a weighted sum (where weights are P(Y = y))

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


14.4. law of total expectation 87

You roll two 6-sided dice D1 and D2 . Let S = D1 + D2 . Example 14.1. Conditional expecta-
tion of dice rolls.
• What is E[S | D2 = 6]?

E[S | D2 = 6] = ∑ xP(S = x | D2 = 6)
x
1 57
= (7 + 8 + 9 + 10 + 11 + 12) = = 9.5
6 6
This makes intuitive sense since 6 + E[value of D1 ] = 6 + 3.5.

• What is E[S | D2 = d2 ], where d2 = 1, . . . , 6? Note that S = D1 + D2 and


that D1 and D2 are independent.

E[S | D2 = d2 ] = E[ D1 + D2 | D2 = d2 ] = E[ D1 + d2 | D2 = d2 ]
= d2 + E[ D1 | D2 = d2 ]
(d2 is a constant with respect to D1 )
= d2 + ∑ d1 P( D1 = d1 | D2 = d2 )
d1

= d2 + ∑ d1 P( D1 = d1 ) (D1 , D2 are independent)


d1

= d2 + 3.5

Note that E[S | D2 = d2 ] depends on the value d2 . In other words, E[S |


D2 ] is a function of the random variable D2 .

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


88 c hapter 14. conditional expectation

Consider the following code with random numbers: Example 14.2. Expected return of
the recursive function recurse.
function recurse()
x = rand(1:3) # Equally likely values
if x == 1 return 3
elseif x == 2 return 5 + recurse()
else return 7 + recurse() end
end

Let Y = value returned by recurse. What is E[Y ]? In other words, what


is the expected return value. Note that this is the exact same approach as
calculating the expected run time.

E [Y ] = E [Y | X = 1 ] P ( X = 1 )
+ E [Y | X = 2 ] P ( X = 2 )
+ E [Y | X = 3 ] P ( X = 3 )

First lets calculate each of the conditional expectations:

E [Y | X = 1 ] = 3
E [Y | X = 2 ] = E [ 5 + Y ] = 5 + E [Y ]
E [Y | X = 3 ] = E [ 7 + Y ] = 7 + E [Y ]

Now we can plug those values into the equation. Note that the probability of
X taking on 1, 2, or 3 is 1/3:

E [Y ] = E [Y | X = 1 ] P ( X = 1 )
+ E [Y | X = 2 ] P ( X = 2 )
+ E [Y | X = 3 ] P ( X = 3 )
= 3(1/3) + (5 + E[Y ])(1/3) + (7 + E[Y ])(1/3)
= 15

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


14.4. law of total expectation 89

You are interviewing n software engineer candidates and will hire only 1 Example 14.3. How many software
engineers do you have to interview
candidate. All orderings of candidates are equally likely. Right after each before hiring the optimal?
interview you must decide to hire or not hire. You can not go back on a
decision. At any point in time you can know the relative ranking of the
candidates you have already interviewed.
The strategy that we propose is that we interview the first k candidates and
reject them all. Then you hire the next candidate that is better than all of the
first k candidates. What is the probability that the best of all the n candidates
is hired for a particular choice of k? Let’s denote that result Pk (best). Let X
be the position in the ordering of the best candidate:
n
Pk (best) = ∑ Pk (best | X = i) P(X = i)
i =1
n
1
=
n ∑ Pk (best | X = i) (since each position is equally likely)
i =1

What is Pk (best | X = i )? If i ≤ k then the probability is 0 because the best


candidate will be rejected without consideration. Sad times. Otherwise we
will chose the best candidate, who is in position i, only if the best of the first
i − 1 candidates is among the first k interviewed. If the best among the first
i − 1 is not among the first k, that candidate will be chosen over the true best.
Since all orderings are equally likely the probability that the best among the
k
i − 1 candidates is in the first k is , if i > k. Plugging back in:
i−1
n
1
Pk (best) =
n ∑ Pk (best | X = i)
i =1
n
1 k
=
n ∑ i−1
(since we know Pk (best | X = i ))
i = k +1
1 n k
Z
≈ di (by Riemann Sum approximation)
n i = k +1 i − 1
n
k k n−1 k n
= ln(i = 1) = ln ≈ ln
n k +1 n k n k

If we think of Pk (best) = nk ln nk as a function of k we can take find the value


of k that optimizes it by taking its derivative and setting it equal to 0. The
optimal value of k is n/e. Where e is Euler’s number.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


15 Inference

From notes by Chris Piech and Lisa


At this point in CS109 we have developed tools for analytically solving for prob- Yan.
abilities. We can calculate the likelihood of random variables taking on values,
even if they are interacting with other random variables (which we have called
multivariate models, or we say the random variables are jointly distributed). We
have also started to study samples and sampling.
As a capstone for this part of the class I would like to consider the task of general
inference in the world of disease prediction. A website, WebMd Symptom Checker,
exemplifies our task. They have built a probabilistic model with random variables
which roughly fall under three categories: symptoms, risk factors and diseases.
For any combination of observed symptoms and risk factors, they can calculate
the probability of any disease. For example, they can calculate the probability
that I have influenza given that I am a 21-year-old female who has a fever and
who is tired: P( I = 1 | A = 21, G = 1, T = 1, F = 1). Or they could calculate the
probability that I have a cold given that I am a 30-year-old with a runny nose:
P(C = 1 | A = 30, R = 1). At first blush this might not seem difficult. But as
we dig deeper we will realize just how hard it is. There are two challenges: (1)
sufficiently specifying the probabilistic model and (2) calculating any desired
probability.

15.1 Bayesian Networks

Before we jump into how to solve probability (aka inference) questions, let’s
take a moment to go over how an expert doctor could specify the relationship
between so many random variables. Ideally we could have our expert sit down
and specify the entire ‘‘joint distribution’’ (see the first lecture on multivariate
models). She could do so either by writing a single equation that relates all the
92 c hapter 15. inference

variables (which is as impossible as it sounds), or she could come up with a joint


distribution table where she specifies the probability of any possible combination
of assignments to variables. It turns out that is not feasible either. Why? Imagine
there are N = 100 binary random variables in our WebMD model. Our expert
doctor would have to specify a probability for each of the 2 N > 1030 combinations Demographics

of assignments to those variables, which is approaching the number of atoms in Age Uni ··· Gender

the universe. Thankfully, there is a better way. We can simplify our task if we know
the generative process that creates a joint assignment. Based on the generative
Conditions
process we can make a data structure known as a Bayesian network. Here are two
Cold H1N1 Influenza ··· Mono
networks of random variables for diseases:
For diseases, the flow of influence is directed. The states of ‘‘demographic’’
random variables influence whether someone has particular ‘‘conditions’’, which Symptoms

influence whether someone shows particular ‘‘symptoms’’. On the right is a Fever Tired Phlegm ··· Runny
Nose
simple model with only four random variables. Though this is a less interesting
model it is easier to understand when first learning Bayesian networks. Being in
Figure 15.1. Full disease model
university (binary) influences whether or not someone has influenza (binary). where flow of influence is directed.
Having influenza influences whether or not someone has a fever (binary) and the
Simple Disease Model
state of university and influenza influences whether or not someone feels tired
(also binary).
Uni
In a Bayesian network, an arrow from random variable X to random variable Y
articulates our assumption that X directly influences the likelihood of Y. We say
that X is a parent of Y. To fully define the Bayesian network we must provide a way
to compute the probability of each random variable Xi conditioned on knowing
the value of all their parents: P( Xi = k | parents of Xi take on specified values). Influenza

Here is a concrete example of what needs to be defined for the simple disease
model. Recall that each of the random variables is binary:

P(Uni = 1) = 0.8 Fever Tired

P(Influenza = 1 | Uni = 1) = 0.2 P(Fever = 1 | Influenza = 1) = 0.9


P(Influenza = 1 | Uni = 0) = 0.1 P(Fever = 1 | Influenza = 0) = 0.05 Figure 15.2. Bayesian network sim-
ple disease model.
P(Tired = 1 | Uni = 0, Influenza = 0) = 0.1
P(Tired = 1 | Uni = 0, Influenza = 1) = 0.9
P(Tired = 1 | Uni = 1, Influenza = 0) = 0.8
P(Tired = 1 | Uni = 1, Influenza = 1) = 1.0

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


15.1. bayesian networks 93

Let’s put this in programming terms. All that we need to do in order to code up
a Bayesian network is to define a function: getProbXi(i, k, parents) which
returns the probability that Xi (the random variable with index i) takes on
the value k given a value for each of the parents of Xi encoded by parents:
P( Xi = k i | parents of Xi take on specified values)

Deeper understanding: The reason that a Bayes Net is so useful is that the ‘‘joint’’
probability can be expressed in exponentially less space as the product of the
probabilities of each random variable conditioned on its parents! Without loss of
generality, let Xi refer to the ith random variable (such that if Xi is a parent of X j
then i < j):

P(joint) = P( X1 = k1 , . . . , Xn = k n )
= ∏ P( Xi = k i | parents of Xi take on specified values) (15.1)
i

Using the chain rule we can decompose the joint probability. To make the following
math easier to digest I am going to use k i as shorthand for the event that Xi = k i :

P ( k 1 , . . . , k n ) = P ( k n | k n −1 , . . . , k 1 ) P ( k n −1 | k n −2 , . . . , k 1 ) · · · P ( k 2 | k 1 ) P ( k 1 )

(chain rule)
= ∏ P ( k i | k i −1 , . . . , k 1 ) (change in notation)
i
= ∏ P(k i | parents of Xi take on their values)
i
(implied by Bayes Net)

The central assumption made is:

P(k i | k i−1 , . . . , k1 ) = P(k i | parents of Xi take on their values)

In other words, each random variable is conditionally independent of its non-


descendants, given its parents.
In the next part of CS109 we are going to talk about how we could learn such
probabilities from data. For now let’s start with the (reasonable) assumption that
an expert can write getProbXi. We haven’t talked about continuous of multino-
mial random variables in Bayes Nets. None of the theory changes: the expert will
just have to define getProbXi to handle more values of k than 0 or 1.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


94 c hapter 15. inference

Great! We have a feasible way to define a large network of random variables.


First challenge complete. However a Bayesian network is not very interesting to
us unless we can use it to solve different conditional probability questions.

15.2 General Inference via Sampling the Joint Distribution

Now we have a reasonable way to specify the joint probability of a network of


many random variables. Before we celebrate, realize that we still don’t know how
to use such a network to answer probability questions. There are many techniques
for doing so. I am going to introduce you to one of the great ideas in probability for
computer science: We can use sampling to solve inference questions on Bayesian
networks. Sampling is frequently used in practice because it is relatively easy to
understand and easy to implement.
As a warmup consider what it would take to sample an assignment to each of
the random variables in our Bayes net. Such a sample is often called a particle (as
in a particle of sand). To sample a particle, simply sample a value for each random
variable one at a time based on the value of the random variable’s parents. This
means that if Xi is a parent of X j , you will have to sample a value for Xi before
you sample a value for X j .
Let’s work through an example of sampling a ‘‘particle’’ for the Simple Disease
Model in the previous section:

1. Sample from P(Uni = 1): Ber(0.8).


Sampled value for Uni is 1.

2. Sample from P(Influenza = 1 | Uni = 1): Ber(0.2).


Sampled value for Influenza is 0.

3. Sample from P(Fever = 1 | Influenza = 0): Ber(0.05).


Sampled value for Fever is 0.

4. Sample from P(Tired = 1 | Uni = 1, Influenza = 0): Ber(0.8).


Sampled value for Tired is 0.

Thus the sampled particle is: [Uni = 1, Influenza = 0, Fever = 0, Tired = 0]. If
we were to run the process again we would get a new particle (with likelihood
determined by the joint probability).

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


15.2. gen eral inference via sampling the joint distribution 95

Now our strategy is simple: we are going to generate N samples where N is in


the hundreds of thousands (if not millions). Then we can compute probability
queries by counting. Let N (X = k) be notation for the number of particles where
random variables X take on values k. Recall that the bold notation X means
that X is a vector with one or more elements. By the ‘‘frequentist’’ definition of
probability:
N (X = k)
P (X = k) =
N
Counting for the win! But what about conditional probabilities? Well using the
definition of conditional probabilities, we can see it’s still some pretty straightfor-
ward counting:
N (X=a,Y=b)
P(X = a, Y = b) N N (X = a, Y = b)
P (X = a | Y = b) = = N (Y=b)
=
P (Y = b) N (Y = b)
N

Let’s take a moment to recognize that this is straight-up fantastic. General infer-
ence based on analytic probability (math without samples) is hard even given a
Bayesian network (if you don’t believe me, try to calculate the probability of flu
conditioning on one demographic and one symptom in the Full Disease Model).
However if we generate enough samples we can calculate any conditional proba-
bility question by reducing our samples to the ones that are consistent with the
condition (Y = b) and then counting how many of those are also consistent with
the query (X = a). Here is the algorithm in code:

N = 10000 Algorithm 15.1. Sampling to get an


# query: the assignment to variables we want probabilities for approximate probability.
# condition: the assignments to variables we will condition on
function get_any_probability(query, condition)
particles = generate_many_joint_samples(N)
cond_particles = reject_non_consistent(particles, condition)
K = count_consistent_samples(cond_particles, query)
return K / length(cond_particles)
end

This algorithm is sometimes called rejection sampling because it works by gen-


erating many particles from the joint distribution and rejecting the ones that are
not consistent with the set of assignments we are conditioning on. Of course
this algorithm is an approximation, though with enough samples it often works

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


96 c hapter 15. inference

out to be a very good approximation. However, in cases where the event we’re
conditioning on is rare enough that it doesn’t occur after millions of samples are
generated, our algorithm will not work. The last line of our code will result in a
divide by 0 error. See the next section for solutions!

15.3 General Inference when Conditioning on Rare Events

Rejection sampling is a powerful technique that takes advantage of computational


power. But it doesn’t always work. In fact it doesn’t work any time that the
probability of the event we are conditioning is rare enough that we are unlikely to
ever produce samples that exactly match the event. The simplest example is with
continuous random variables. Consider the Simple Disease Model. Let’s change
Fever from being a binary variable to being a continuous variable. To do so the
only thing we need to do is re-specify the likelihood of fever given assignments to
its parents (influenza). Let’s say that the likelihoods come from the normal PDF:

if Influenza = 0, then Fever ∼ N (µ = 98.3, σ = 0.7)


1 ( x −98.3)2
∴ f (Fever = x ) = √ e− 2·0.7
2π · 0.7
if Influenza = 1, then Fever ∼ N (µ = 100.0, σ = 1.8)
1 ( x −100.0)2
∴ f (Fever = x ) = √ e− 2·1.8
2π · 1.8
Drawing samples (aka particles) is still straightforward. We apply the same
process until we get to the step where we sample a value for the Fever random
variable (in the example from the previous section that was step 3). If we had
sampled a 0 for influenza we draw a value for fever from the normal for healthy
adults (which has µ = 98.3). If we had sampled a 1 for influenza we draw a value
for fever from the normal for adults with the flu (which has µ = 100.0). The
problem comes in the ‘‘rejection’’ stage of sampling the joint distribution.
When we sample values for fever we get numbers with infinite precision (eg
100.819238 etc). If we condition on someone having a fever equal to 101 we would
reject every single particle. Why? No particle will have exactly a fever of 101.
There are several ways to deal with this problem. One especially easy solution
is to be less strict when rejecting particles. We could round all fevers to whole
numbers.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


15.3. general inference when conditioning on rare events 97

There is an algorithm called likelihood weighting which sometimes helps, but


which we don’t cover in CS109. Instead, in class we talked about a new algorithm
called Markov Chain Monte Carlo (MCMC) that allowed us to sample from the
‘‘posterior’’ probability: the distribution of random variables after (post) us fixing
variables in the conditioned event. The version of MCMC we talked about is
called Gibbs Sampling. While I don’t require that students in CS109 know how to
implement Gibbs Sampling, I wanted everyone to know that it exists and that
it isn’t beyond your capabilities. If you need to use it, you can learn it given the
knowledge you have now.
MCMC does require more math than rejection sampling. For every random
variable you will need to specify how to calculate the likelihood of assignments
given the variable’s: parents, children and parents of its children (a set of variables
cozily called a ‘‘blanket’’). Want to learn more? Take CS221, CS228, or CS238!

Thoughts: While there are slightly-more-powerful ‘‘general inference algorithms’’


that you will get to learn in the future, it is worth recognizing that at this point
we have reached an important milestone in CS109. You can take very complicated
probability models (encoded as Bayesian networks) and can answer general in-
ference queries on them. To get there we worked through the concrete example
of predicting disease. While the WebMd website is great for home users, similar
probability models are being used in thousands of hospitals around the world.
As you are reading this general inference is being used to improve health care
(and sometimes even save lives) for real human beings. That’s some probability
for computer scientists that is worth learning. Now there is just one last question
for us to look into in CS109. What if we don’t have an expert? Could we learn
those probabilities from data?

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


16 Continuous Joint Distributions

From notes by Chris Piech and Lisa


Of course joint variables don’t have to be discrete only, they can also be continuous. Yan.
As an example: consider throwing darts at a dart board. Because a dart board is
two dimensional, it is natural to think about the X location of the dart and the
Y location of the dart as two random variables that are varying together (aka
they are joint). However since x and y positions are continuous we are going
to need new language to think about the likelihood of different places a dart
could land. Just like in the non-joint case continuous is a little tricky because it
isn’t easy to think about the probability that a dart lands at a location defined
to infinite precision. What is the probability that a dart lands at exactly ( X =
456.234231234122355, Y = 532.12344123456)?
Lets build some intuition by first starting with discretized grids. On the left of
the image above you could imagine where your dart lands is one of 25 different
cells in a grid. We could reason about the probabilities now! But we have lost
all nuance about how likelihood is changing within a given cell. If we make our
cells smaller and smaller we eventually will get a second derivative of probability:
once again a probability density function. If we integrate under this joint-density
function in both the x and y dimension we will get the probability that x takes
on the values in the integrated range and y takes on the values in the integrated
range! Random variables X and Y are jointly continuous if there exists a probability
density function (PDF) f X,Y such that:
Z a2 Z b2
P( a1 < X ≤ a2 , b1 < Y ≤ b2 ) = f X,Y ( x, y)dy dx (16.1)
a1 b1
100 chapter 16. continuous joint distributions

Using the PDF we can compute marginal probability densities:


Z ∞
f X ( a) = f X,Y ( a, y)dy (16.2)
−∞
Z ∞
f Y (b) = f X,Y ( x, b)dx (16.3)
−∞

16.1 Independence with Multiple RVs (Continuous Case)

Two continuous random variables X and Y are called independent if:

P( X ≤ a, Y ≤ b) = P( X ≤ a) P(Y ≤ b) for all a, b

This can be stated equivalently as:


∂2
FX,Y ( a, b) = FX ( a) FY (b) for all a, b f X,Y ( x, y) = FX,Y ( x, y)
∂x∂y
f X,Y ( a, b) = f X ( a) f Y (b) for all a, b ∂2
= FX ( x ) FY (y)
∂x∂y
More generally, if you can factor the joint density function, then your continuous ∂ ∂
= FX ( x ) FY (y)
random variables are independent: ∂x ∂y
∂ ∂
= FX ( x ) FY (y)
f X,Y ( x, y) = g( x )h(y) where − ∞ < x, y < ∞ ∂x ∂y
= f X ( x ) f Y (y)

16.2 Joint CDFs

For two random variables X and Y that are jointly distributed, the joint cumulative
distribution function FX,Y can be defined as

FX,Y ( a, b) = P( X ≤ a, Y ≤ b)

FX,Y ( a, b) = ∑ ∑ pX,Y (x, y) X, Y discrete


x ≤ a y≤b
Z a Z b
FX,Y ( a, b) = f X,Y ( x, y)dydx X, Y continuous
−∞ −∞
∂2
f X,Y ( a, b) = FX,Y ( a, b) X, Y continuous
∂a ∂b
It can be shown via geometry that to calculate probabilities of joint distributions,
we can use the CDF as follows, for both jointly discrete and continuous RVs:

P( a1 < X ≤ a2 , b1 < Y ≤ b2 ) = FX,Y ( a2 , b2 ) − FX,Y ( a1 , b2 ) − FX,Y ( a2 , b1 ) + FX,Y ( a1 , b1 )

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


16.3. bivariate normal distribution 101

µ = [0, 0] µ = [0, 0] µ = [0, 0] µ = [0, 0]


       
1 0 1 0.75 1 −0.75 3 0
Σ= Σ= Σ= Σ=
0 1 0.75 1 −0.75 1 0 3

4
0.04
2

0
0.02
−2
−4
0.00
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

16.3 Bivariate Normal Distribution


 
4 0
Many times, we talk about multiple Normal (Gaussian) random variables, other- µ = [0, 0], Σ =
0 4
wise known as Multivariate Normal (Gaussian) distributions. Here, we talk about
10
the two-dimensional case, called a Bivariate Normal Distribution. Variables X1 and
5
X2 follow a bivariate normal distribution if their joint PDF is:
0

1

( x1 −µ1 )2 2ρ( x1 −µ1 )( x2 −µ2 ) ( x2 −µ2 )2

−5
1 − − +
2(1− ρ2 ) σ2 σ1 σ2 σ22
f X1 ,X2 ( x1 , x2 ) = p e 1 −10
2πσ1 σ2 1 − ρ2 −10 −5 0 5 10

We often write the distribution of the vector


" X = ( X1 ,#X2 ) as X ∼ N (µ, Σ), where
σ12 ρσ1 σ2

Cov( X1 , X1 ) Cov( X1 , X2 )

µ = (µ1 , µ2 ) is a mean vector and Σ = is a covariance matrix. Σ=
Cov( X2 , X1 ) Cov( X2 , X2 )
ρσ1 σ2 σ22
Note that ρ is the correlation between X1 and X2 , and σ1 , σ2 > 0. We defer where
Cov( X, X ) = Var( X )
to Ross Chapter 6, Example 5d, for the full proof, but it can be shown that the
marginal distributions of X1 and X2 are X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ),
respectively.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


102 chapter 16. continuous joint distributions

# " Example 16.1. Multivariate Gaus-


σ12 0
Let X = ( X1 , X2 ) ∼ N (µ, Σ), where µ = (µ1 , µ2 ) and Σ = ,a sian distribution of uncorrelated ran-
0 σ22 dom variables.
diagonal covariance matrix. Note the correlation between X1 and X2 is ρ = 0:

1 ( x1 − µ1 )2 ( x2 − µ2 )2
 

−  + 
1 2 σ12 σ22
f X1 ,X2 ( x1 , x2 ) = e
2πσ1 σ2
1 2 2 1 2 2
= √ e−(x1 −µ1 ) /(2σ1 ) √ e−(x2 −µ2 ) /(2σ2 )
σ 2π σ 2π
|1 {z }|2 {z }
involves x1 involves x2

In other words, for Bivariate Normal RVs, if Cov( X1 , X2 ) = 0, then X1 and


X2 are independent. Wild!

Let’s make a weight matrix used for Gaussian blur. In the weight matrix, each Example 16.2. Gaussian blur using
a bivariate normal distribution (i.e.
location in the weight matrix will be given a weight based on the probability multivariate) with a weight matrix.
density of the area covered by that grid square in a Bivariate Normal of
independent X and Y, each zero mean with variance σ2 . For this example
lets blur using σ = 3.
Each pixel is given a weight equal to the probability that X and Y are both
within the pixel bounds. The center pixel covers the area where −0.5 ≤ x ≤
0.5 and −0.5 ≤ y ≤ 0.5. What is the weight of the center pixel?

P(−0.5 < X < 0.5, −0.5 < Y < 0.5) =


P( X < 0.5, Y < 0.5) − P( X < 0.5, Y < −0.5)
− P( X < −0.5, Y < 0.5) + P( X < −0.5, Y < −0.5)
−0.5 −0.5 −0.5
           
0.5 0.5 0.5
=φ ·φ − 2φ ·φ +φ ·φ
3 3 3 3 3 3
= 0.56622 − 2 · 0.5662 · 0.4338 + 0.43382 ≈ 0.206

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


17 Continuous Joint Distributions II

From notes by Chris Piech and Lisa


17.1 Convolution: Sum of Independent Random Variables Yan.

Remember how deriving the sum of two independent Poisson random variables
was tricky? When we move into integral land, the concept of convolution still
carries over, and once you get a handle on notation, then computing the sum of
two independent, jointly continuous random variables becomes fun. For some
definition of fun. . .

17.2 Independent Normals

Let’s start with one common case that has a nice form but a difficult derivation:
For any two normal random variables X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ) the sum
of those two random variables is another normal: X + Y ∼ N (µ1 + µ2 , σ12 + σ22 ).
We won’t derive the approach here, but it involves exponents, integrals, and
completing the square (algebra throwback!).

17.3 General Independent Case

For two general independent random variables (i.e. cases of independent random
variables that don’t fit the above special situations) you can calculate the CDF
or the PDF of the sum of two random variables using the following convolution
formulas:
Z ∞
FX +Y ( a) = P( X + Y ≤ a) = FX ( a − y) FY (y)dy
y=−∞
Z ∞
f X +Y ( a ) = f X ( a − y) f Y (y)dy
y=−∞
104 chapter 17. continuous joint distributions ii

These is a direct analogy to the discrete case where you replace the integrals with
sums and change notation for CDF and PDF.

17.4 Conditional Distributions (Continuous Case)

The conditional probability density function might look a bit wonky, but it works!

Continuous: The conditional probability density function (PDF) for the contin-
uous case:
f X,Y ( x, y)
f X |Y ( x | y ) = (17.1)
f Y (y)
The conditional cumulative density function (CDF) for the continuous case:
Z a
FX |Y ( a | y) = P( X ≤ a | Y = y) = f X |Y ( x | y)dx (17.2)
−∞
At first glance, the conditional density function seems to violate any notion of
stoichiometric units of probability. Let us verify this with our understanding of
discrete probability. Recall that for tiny epsilon e, we can approximate:
Z x+ e
e e e 2
P(| X − x | ≤ ) = P( x − ≤ X ≤ x + ) = f X ( a)da ≈ f X ( x )e
2 2 2 x − 2e
eX ey
This extends to the joint variable case: P(| X − x | ≤ 2 , |Y − y | ≤ 2 ) ≈ f X,Y ( x, y ) e X eY .
 e e  P(| X − x | ≤ e2X , |Y − y| ≤ e2Y )
P | X − x | ≤ X | |Y − y | ≤ Y =
2 2 P(|Y − y| ≤ e2Y )
(def. cond. prob.)
f X,Y ( x, y)eX eY f ( x, y)
≈ = X,Y e X = f X |Y ( x | y ) e X
f Y (y)eY f Y (y)

17.5 Conditional Expectation (Continuous Case)

Conditional expectation in the continuous case is a direct analogy to what we


saw in the discrete case:
Let X and Y be jointly continuous random variables. We define the conditional
expectation of X given Y = y to be:
Z ∞
E[ X | Y = y ] = x f X |Y ( x | y)dx (17.3)
−∞

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


17.5. conditional expectation (continuous case) 105

What is the PDF of X + Y for independent uniform random variables Example 17.1. Convolution of a uni-
form distribution.
X ∼ Uni(0, 1) and Y ∼ Uni(0, 1)? First plug in the equation for general
convolution of independent random variables:
Z 1
f X +Y ( a ) = f X ( a − y) f Y (y)dy
y =0
Z 1
f X +Y ( a ) = f X ( a − y)dy because f Y (y) = 1
y =0

It turns out that is not the easiest thing to integrate. By trying a few different
values of a in the range [0, 2] we can observe that the PDF we are trying to
calculate is discontinuous at the point a = 1 and thus will be easier to think
about as two cases: a < 1 and a > 1. If we calculate f X +Y for both cases and
correctly constrain the bounds of the integral we get simple closed forms for
each case:

a

 if 0 < a ≤ 1
f X +Y ( a ) = 2−a if 1 < a ≤ 2


0 else

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


106 chapter 17. continuous joint distributions ii

Multivariate Example: In this example we are going to explore the problem


of tracking an object in 2D space. The object exists at some ( x, y) location, µ = [3, 3], Σ = [4 0; 0 4]
however we are not sure exactly where! Thus we are going to use random 10
variables X and Y to represent location.
5
We have a prior belief about where the object is. In this example our prior
0
both X and Y as normals which are independently distributed with mean 3
and variance 4. First let’s write the prior belief as a joint probability density −5
function: −10
−10 −5 0 5 10

f ( X = x, Y = y) = f ( X = x ) · f (Y = y) (In the prior X and Y are independent)


1 3)2 1 ( y −3)2
− (x− − 2·4
= √ ·e 2·4 ·√ ·e (Using the PDF equation for normals)
2·4·π 2·4·π
( x −3)2 +(y−3)2
= K1 · e − 8 (All constants are put into K1 )

This combinations of normals is called a bivariate distribution. Here is a visualization of the PDF of our
prior. The interesting part about tracking an object is the process of updating your belief about it’s location
based on an observation. Let’s say that we get an instrument reading from a sonar that is sitting on the
origin. The instrument reports that the object is 4 units away. Our instrument is not perfect: if the true
distance was t units away, than the instrument will give a reading which is normally distributed with mean
t and variance 1. Let’s visualize the observation:
Based on this information about the noisiness of our prior, we can compute the conditional probability
of seeing a particular distance reading D, given the true location of the object X, Y. If we knew the object
p
was at location ( x, y), we could calculate the true distance to the origin x2 + y2 which would give us the
mean for the instrument Gaussian:

 q 2
d− x 2 + y2
1
· e−
p
f ( D = d | X = x, Y = y) = √ 2·1 (Normal PDF where µ = x 2 + y2 )
2·1·π
 q 2
d− x 2 + y2

= K2 · e − 2·1 (All constants are put into K2 )

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


17.5. conditional expectation (continuous case) 107

How about we try this out on actual numbers. How much more likely is an instrument reading of 1
compared to 2, given that the location of the object is at (1, 1)?
 2
p
12 +12
1−
f ( D = 1 | X = 1, Y = 1) K · e− 2·1
= 2 2 (Substituting into the conditional PDF of D)
f ( D = 2 | X = 1, Y = 1)
 p
2− 12 +12
K2 · e − 2·1

e0
= −1/2 ≈ 1.65 (Notice how the K2 cancel out)
e
At this point we have a prior belief and we have an observation. We would like to compute an updated belief,
given that observation. This is a classic Bayes’ formula scenario. We are using joint continuous variables, but
that doesn’t change the math much, it just means we will be dealing with densities instead of probabilities:

f ( D = 4 | X = x, Y = y) · f ( X = x, Y = y)
f ( X = x, Y = y | D = 4) = (Bayes using densities)
f ( D = 4)
 q 2
x 2 + y2
h i
4− ( x −3)2 +(y−3)2
K1 · e− 2 · K2 · e− 8
= (Substituting for prior and update)
f ( D = 4)
 q 2 
4− x 2 + y2
( x −3)2 +(y−3)2 
− +

2 8 
K1 · K2
= ·e
f ( D = 4)
( f ( D = 4) is a constant w.r.t. ( x, y))
 q 2 
4− x 2 + y2
( x −3)2 +(y−3)2 
− +

2 8 
= K3 · e (K3 is a new constant)

Wow! That looks like a pretty interesting function! You have successfully computed the updated belief.
Let’s see what it looks like. Here is a figure with our prior on the left and the posterior on the right: How
beautiful is that! Its like a 2D normal distribution merged with a circle. But wait, what about that constant!
We do not know the value of K3 and that is not a problem for two reasons: the first reason is that if we ever
want to calculate a relative probability of two locations, K3 will cancel out. The second reason is that if we
really wanted to know what K3 was, we could solve for it.
This math is used every day in millions of applications. If there are multiple observations the equations
can get truly complex (even worse than this one). To represent these complex functions often use an
algorithm called particle filtering.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


108 chapter 17. c ontinuous joint distributions ii

Let’s say we have two independent random Poisson variables for requests Example 17.2. Convolution of Pois-
son random variables.
received at a web server in a day: X = # requests from humans/day, X ∼
Poi(λ1 ) and Y = # requests from bots/day, Y ∼ Poi(λ2 ). Since the convolution
of Poisson random variables is also a Poisson we know that the total number
of requests ( X + Y ) is also a Poisson ( X + Y ) ∼ Poi(λ1 + λ2 ). What is the
probability of having k human requests on a particular day given that there
were n total requests?

P( X = k, Y = n − k) P ( X = k ) P (Y = n − k )
P( X = k | X + Y = n) = =
P( X + Y = n) P( X + Y = n
e−λ1 λ1k e−λ2 λ2n−k n!
= · · 1( λ + λ )
k! ( n − k ) ! e 1 2 ( λ1 + λ2 ) n
  k  n−k
n λ1 λ2
=
k λ1 + λ2 λ1 + λ2
 
λ2
∼ Bin n,
λ1 + λ2

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


18 Central Limit Theorem
From notes by Chris Piech and Lisa
18.1 The Theory Yan.

The central limit theorem proves that the averages of samples from any distribution
themselves must be normally distributed. Consider independent and indentically
distributed (IID)1 random variables X1 , X2 , . . . such that E[ Xi ] = µ and Var( Xi ) = 1
Random variables X1 , . . . , Xn are
σ2 . Let: IID if X1 , . . . , Xn are independent
1 n and they all have the same PMF (if
X̄ = ∑ Xi (18.1) discrete) or PDF (if continuous),
n i =1 with E[ Xi ] = µ and Var( Xi ) = σ2
for i = 1, . . . , n.
The central limit theorem states:
σ2
 
X̄ ∼ N µ, as n → ∞ (18.2)
n
It is sometimes expressed in terms of the standard normal, Z:

(∑in=1 Xi ) − nµ
Z= √ as n → ∞ (18.3)
σ n
At this point you probably think that the central limit theorem is awesome. But
it gets even better. With some algebraic manipulation we can show that if the
sample mean of IID random variables is normal, it follows that the sum of equally
weighted IID random variables must also be normal. Let’s call the sum of IID
random variables Ȳ:
n
Ȳ = ∑ Xi = n · X̄ (If we define Ȳ to be the sum of our variables)
i =1
σ2
 
∼N nµ, n2 (Since X̄ is a normal and n is a constant)
n
 
∼ N nµ, nσ2 (By simplifying)

In summary, the central limit theorem explains that both the sample mean of IID
variables is normal (regardless of what distribution the IID variables came from)
and that the sum of equally weighted IID random variables is normal (again,
regardless of the underlying distribution).
110 chapter 18. central limit theorem

You will roll a 6 sided dice 10 times. Let X be the total value of all 10 dice = Example 18.1. Calculating proba-
bility of winning a dice game using
X1 + X2 + · · · + X10 . You win the game if X ≤ 25 or X ≥ 45. Use the central the central limit theorem.
limit theorem to calculate the probability that you win.
Recall that E[ Xi ] = 3.5 and Var( Xi ) = 35
12 .

P( X ≤ 25 or X ≥ 45)
= 1 − P(25.5 ≤ X ≤ 44.5)
25.5 − 10(3.5) X − 10(3.5) 44.5 − 10(3.5)
 
= 1−P √ √ ≤ √ √ ≤ √ √
35/12 10 35/12 10 35/12 10
≈ 1 − (2Φ(1.76) − 1) ≈ 2(1 − 0.9608) = 0.0784

18.1.1 Gimbel Distribution


A more obscure theorem, the Fisher-Tippett-Gnedenko theorem, tells us about
the max of IID random variables. It says that the max of IID exponential or normal
random variables will be a Gumbel random variable:

Y ∼ Gumbel(µ, β) (The max of IID variables)


1 −(z+e−z ) k−µ
f (Y = l ) = e where z = (the Gumbel PDF)
β β

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


18.1. the theory 111

You want to test the runtime of a new algorithm. You know the variance Example 18.2. Algorithm runtime
using the central limit theorem.
of the algorithm’s runtime: σ2 = 4 sec2 but you want to estimate the mean:
µ = t sec. You can run the algorithm repeatedly (IID trials). How many
trials do you have to run so that your estimated runtime = t ± 0.5 with 95%
certainty? Let Xi be the run time of the i-th run (for 1 ≤ i ≤ n).

∑in=1 Xi
0.95 = P(−0.5 ≤ − t ≤ 0.5)
n
By the central limit theorem, the standard normal Z must be equal to:

(∑in=1 Xi ) − nµ (∑in=1 Xi ) − nt
Z= √ = √
σ n 2 n

Now we rewrite our probability inequality so that the central term is Z:

∑ n Xi
 
0.95 = P −0.5 ≤ i=1 − t ≤ 0.5
n
√ √ 
−0.5 n ∑n X

0.5 n
=P ≤ i =1 i − t ≤
2 n 2
√ √ n √ √ 
−0.5 n n ∑ i = 1 Xi

n 0.5 n
=P ≤ − t≤
2 2 n 2 2
√ n √ √ √ 
−0.5 n ∑ i = 1 Xi

n nt 0.5 n
=P ≤ √ −√ ≤
2 2 n n 2 2
√ n √
−0.5 n ∑ X − nt
 
0.5 n
=P ≤ i=1 √i ≤
2 2 n 2
√ √ 
−0.5 n

0.5 n
=P ≤Z≤
2 2

And now we can find the value of n that makes this equation hold.
√   √  √    √ 
n n n n
0.95 = Φ −Φ − =Φ − 1−Φ
4 4 4 4
√ 
n
= 2Φ −1
4
√  √
n −1 n
0.975 = Φ ∴ Φ (0.975) = 1.96 = =⇒ n = 61.4
4 4

Thus it takes 62 runs. If you are interested in how this extends to cases where
the variance is unknown, look into variations of the students’ t-test.
toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]
19 Samples and the Bootstrap

From notes by Chris Piech and Lisa


Let’s say you are the king of Bhutan and you want to know the average happiness Yan.
of the people in your country. You can’t ask every single person, but you could ask
a random subsample. In this next section we will consider principled claims that
you can make based on a subsample. Assume we randomly sample 200 Bhutanese
and ask them about their happiness. Our data looks like this: 72, 85, . . . , 71. You
can also think of it as a collection of n = 200 IID (independent, identically
distributed) random variables X1 , X2 , . . . , Xn .

19.1 Estimating Mean and Variance from samples

We assume that the data we look at are IID from the same underlying distribution
(F) with a true mean µ and a true variance σ2 . Since we can’t talk to everyone in
Bhutan, we have to rely on our sample to estimate the mean and variance. From
our sample we can calculate a sample mean X̄ and a sample variance S2 . These
are the best guesses that we can make about the true mean and true variance.
n n
Xi ( Xi − X̄ )2
X̄ = ∑ n
S2 = ∑ n−1
i =1 i =1

The first question to ask is: are those unbiased estimates? Yes. Unbiased, means
that if we were to repeat this sampling process many times, the expected value of
our estimates should be equal to the true values we are trying to estimate. We
114 chapter 19. samples and the bootstra p

will prove that that is the case for X̄. The proof for S2 is in the lecture slides.
" # " #
n n
Xi 1
E[ X̄ ] = E ∑ = E ∑ Xi
i =1
n n i =1
n n
1 1 1
=
n ∑ E [ Xi ] = n ∑µ= n
nµ = µ
i =1 i =1

The equation for sample mean seems like a reasonable way to calculate the ex-
pectation of the underlying distribution. The same could be said about sample
variance except for the surprising (n1) in the denominator of the equation. Why
(n1)? That denominator is necessary to make sure that the E[S2 ] = σ2 .
The intuition behind the proof is that sample variance calculates the distance
of each sample to the sample mean, not the true mean. The sample mean itself
varies, and we can show that its variance is also related to the true variance.

19.2 Standard Error

Okay, you convinced me that our estimates for mean and variance are not biased.
But now I want to know how much my sample mean might vary relative to the
true mean.
!   !
n
Xi 1 2 n
Var( X̄ ) = Var ∑ = Var ∑ Xi
i =1
n n i =1
 2 n  2 n  2
1 1 1 σ2
= ∑
n i =1
Var ( X i ) = ∑
n i =1
σ 2
=
n
nσ2 =
n
S2
≈ (Since S is an unbiased estimate)
n
r
S2
Std( X̄ ) ≈ (Since Std is the square root of Var)
n
That Std( X̄ ) term has a special name. It is called the standard error and its how
you report uncertainty of estimates of means in scientific papers (and how you
get error bars). Great! Now we can compute all these wonderful statistics for the
Bhutanese people. But wait! You never told me how to calculate the Std(S2 ). True,
that is outside the scope of CS109. You can find it on Wikipedia if you want.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


19.3. bootstrap 115

Let’s say we calculate the our sample of happiness has n = 200 people. The Example 19.1. Sample mean and
sample variance of the happiness
sample mean is X̄ = 83 (what is the unit here? happiness score?) and the in Bhutan problem.
sample variance is S2 = 450. We can now calculate the standard error of our
estimate of the mean to be 1.5. When we report our results we will say that
the average happiness score in Bhutan is 83 ± 1.5 with variance 450.

19.3 Bootstrap

Bootstrap is a newly invented statistical technique for both understanding distri-


butions of statistics and for calculating p-values.1 It was invented here at Stanford 1
A p-value is the probability that a
in 1979 when mathematicians were just starting to understand how computers, scientific claim is incorrect

and computer simulations, could be used to better understand probabilities.


The first key insight is that: if we had access to the underlying distribution
F then answering almost any question we might have as to how accurate our
statistics are becomes straightforward. For example, in the previous section we
gave a formula for how you could calculate the sample variance from a sample
of size n. We know that in expectation our sample variance is equal to the true
variance. But what if we want to know the probability that the true variance is
within a certain range of the number we calculated? That question might sound
dry, but it is critical to evaluating scientific claims! If you knew the underlying
distribution F you could simply repeat the experiment of drawing a sample of
size n from F, calculate the sample variance from our new sample and test what
portion fell within a certain range.
The next insight behind bootstrapping is that the best estimate that we can
get for F is from our sample itself! The simplest way to estimate F (and the one
we will use in this class) is to assume that P( X = k) is simply the fraction of
times that k showed up in the sample. Note that this defines the probability mass
function of our estimate F̂ of F.
To calculate Var(S2 ) we could calculate Si2 for each resample i and after 10,000
iterations, we could calculate the sample variance of all the Si2 s. The bootstrap
has strong theoretic grantees, and is accepted by the scientific community, when
calculating any statistic. It breaks down when the underlying distribution has a
‘‘long tail’’ or if the samples are not IID.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


116 chapter 19. samples and the bootstrap

def bootstrap(sample):
n = number of elements in sample
pmf = estimate the underlying pmf from the sample
stats = []
repeat 10,000 times:
resample = draw n new samples from the pmf
stat = calculate your stat on the resample
stats.append(stat)
now stats is used to estimate the distribution of the stat

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


20 Maximum Likelihood Estimation

From notes by Chris Piech and Lisa


We have learned many different distributions for random variables, and all of those Yan.
distributions had parameters: the numbers that you provide as input when you
define a random variable. So far when we were working with random variables,
we either were explicitly told the values of the parameters, or we could divine the
values by understanding the process that was generating the random variables.
What if we don’t know the values of the parameters and we can’t estimate
them from our own expert knowledge? What if instead of knowing the random
variables, we have a lot of examples of data generated with the same underlying
distribution? In this chapter we are going to learn formal ways of estimating
parameters from data.
These ideas are critical for artificial intelligence. Almost all modern machine
learning algorithms work like this: (1) Specify a probabilistic model that has
parameters. (2) Learn the value of those parameters from data.

20.1 Parameters

Before we dive into parameter estimation, first let’s revisit the concept of pa-
rameters. Given a model, the parameters are the numbers that yield the actual
distribution. In the case of a Bernoulli random variable, the single parameter
was the value p. In the case of a Uniform random variable, the parameters are
the a and b values that define the min and max value. Here is a list of random
variables and the corresponding parameters. From now on, we are going to use
the notation θ to be a vector of all the parameters:
In the real world often you don’t know the ‘‘true’’ parameters, but you get
to observe data. Next up, we will explore how we can use data to estimate the
model parameters.
118 chapter 20. maximum likelihood estimation

Table 20.1. Probability distribution


Distribution Parameters
parameters θ.
Bernoulli( p) θ =p
Poisson(λ) θ =λ
Uniform( a, b) θ = ( a, b)
Normal(µ, σ2 ) θ = (µ, σ2 )
Y = mX + b θ = (m, b)

It turns out there isn’t just one way to estimate the value of parameters. There
are two main approaches: Maximum Likelihood Estimation (MLE) and Maximum A
Posteriori (MAP). Both of these approaches assume that your data are IID samples:
X1 , X2 , . . . , Xn where all Xi are independent and have the same distribution.

20.2 Maximum Likelihood

Our first algorithm for estimating parameters is called maximum likelihood estima-
tion (MLE). The central idea behind MLE is to select the parameters θ that make
the observed data the most likely.
The data that we are going to use to estimate the parameters are going to be n
independent and identically distributed (IID) samples: X1 , X2 , . . . , Xn .

20.2.1 Likelihood
We made the assumption that our data are identically distributed. This means
that they must have either the same probability mass function (if the data are
discrete) or the same probability density function (if the data are continuous).
To simplify our conversation about parameter estimation, we are going to use
the notation f ( X | θ ) to refer to this shared PMF or PDF. Our new notation is
interesting in two ways. First, we have now included a conditional on θ which
is our way of indicating that the likelihood of different values of X depends on
the values of our parameters. Second, we are going to use the same symbol f for
both discrete and continuous distributions.
What does likelihood mean and how is ‘‘likelihood’’ different than ‘‘probabil-
ity’’? In the case of discrete distributions, likelihood is a synonym for the joint
probability of your data. In the case of continuous distribution, likelihood refers
to the joint probability density of your data.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


20.2. maximum likelihood 119

Since we assumed each data point is independent, the likelihood of all our data
is the product of the likelihood of each data point. Mathematically, the likelihood
of our data given parameters θ is:
n
L(θ ) = ∏ f ( Xi | θ ) (20.1)
i =1

For different values of parameters, the likelihood of our data will be different.
If we have correct parameters, our data will be much more probable than if we
have incorrect parameters. For that reason we write likelihood as a function of
our parameters (θ).

20.2.2 Maximization
In maximum likelihood estimation (MLE) our goal is to chose values of our
parameters (θ) that maximizes the likelihood function from the previous section.
We are going to use the notation θ̂ to represent the best choice of values for our
parameters. Formally, MLE assumes that:

θ̂ = arg max L(θ ) (20.2)


θ

‘‘Arg max’’ is short for argument of the maximum. The arg max of a function is the
value of the domain at which the function is maximized. It applies for domains
of any dimension.
A cool property of arg max is that since log is a monotonic function, the arg
max of a function is the same as the arg max of the log of the function! That’s nice
because logs make the math simpler.
If we find the arg max of the log of likelihood, it will be equal to the arg max of
the likelihood. Therefore, for MLE, we first write the log likelihood function (LL):
n n
LL(θ ) = log L(θ ) = log ∏ f ( Xi | θ ) = ∑ log f (Xi | θ ) (20.3)
i =1 i =1

To use a maximum likelihood estimator, first write the log likelihood of the
data given your parameters. Then chose the value of parameters that maximize
the log likelihood function. Argmax can be computed in many ways. All of the
methods that we cover in this class require computing the first derivative of the
function.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


120 chapter 20. maximum likelihood estimation

20.2.3 Bernoulli MLE Estimation


For our first example, we are going to use MLE to estimate the p parameter of a
Bernoulli distribution. We are going to make our estimate based on n data points
which we will refer to as IID random variables X1 , X2 , . . . , Xn . Every one of these
random variables is assumed to be a sample from the same Bernoulli, with the
same p, namely Xi ∼ Ber ( p). We want to find out what that p is.
Step one of MLE is to write the likelihood of a Bernoulli as a function that we
can maximize. Since a Bernoulli is a discrete distribution, the likelihood is the
probability mass function. You may not have realized before that the probability
mass function of a Bernoulli X can be written as:
f ( X ) = p X (1 − p )1− X (20.4)
Interesting! Where did that come from? It’s an equation that allows us to say that
the probability that X = 1 is p and the probability that X = 0 is 1 − p. Convince
yourself that when Xi = 0 and Xi = 1 the PMF returns the right probabilities. We
write the PMF this way because it is differentiable.
Let’s do some maximum likelihood estimation:
n
L(θ ) = ∏ p Xi ( 1 − p ) 1 − Xi (first write the likelihood function)
i =1
n
LL(θ ) = ∑ log pXi (1 − p)1−Xi (then take the log)
i =1
n
= ∑ Xi (log p) + (1 − Xi ) log(1 − p)
i =1
= Y log p + (n − Y ) log(1 − p) (where Y = ∑in=1 Xi )
We have a formula for the log likelihood. Now we simply need to chose the value
of p that maximizes our log likelihood. As your calculus teacher probably taught
you, one way to find the value which maximizes a function that is to find the first
derivative of the function and set it equal to 0.
∂LL( p) 1 −1
= Y + (n − Y ) =0
∂p p 1− p
Y ∑n X
p̂ = = i =1 i
n n
All that work to find out that the maximum likelihood estimate is simply the
sample mean...

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


20.2. maximum likelihood 121

20.2.4 Poisson MLE Estimation


Practice is key. Let us estimate the best parameter values for a Poisson distribution.
Like before, suppose we have n samples from our Poisson, which we represent
as random variables X1 , X2 , . . . , Xn . We assume that for all i, Xi are IID and
Xi ∼ Poi(λ). Our parameter is therefore θ = λ. The PMF of a Poisson is:

λX
f ( X | λ ) = e−λ (20.5)
X!
Let’s write the log-likelihood function first:
n
e − λ λ Xi
L(θ ) = ∏ Xi !
(likelihood function)
i =1
n
LL(θ ) = ∑ −λ log e + Xi log λ − log(Xi !) (log-likelihood function)
i =1
n n n
= ∑ −nλ + log λ ∑ Xi − ∑ log(Xi !) (use log with base e)
i =1 i =1 i =1

Then, we differentiate with respect to our parameter λ and set it equal to 0. Note
that ∑in=1 log( Xi ) is a constant with respect to λ:
n
∂LL(θ ) 1
∂λ
= −n +
λ ∑ Xi = 0
i =1

Finally, we solve and find that λ̂ = 1


n ∑in=1 Xi . Yup, it’s the sample mean again!

20.2.5 Normal MLE Estimation


Let’s keep practicing. Next, we will estimate the best parameter values for a
normal distribution. All we have access to are n samples from our normal, which
we represent as IID random variables X1 , X2 , . . . , Xn . We assume that for all i,
Xi ∼ N (µ = θ0 , σ2 = θ1 ). This example seems trickier because a normal has two
parameters that we have to estimate. In this case, θ is a vector with two values. The

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


122 chapter 20. maximum likelihood estimation

first is the mean (µ) parameter, and the second is the variance (σ2 ) parameter.
n
L(θ ) = ∏ f ( Xi | θ )
i =1
n ( X −θ ) 2
1 − i2θ 0
=∏√ e 1 (likelihood of a continuous variable is the PDF)
i =1 2πθ1
n ( X −θ ) 2
1 − i2θ 0
LL(θ ) = ∑ log √
2πθ1
e 1 (we want to calculate log likelihood)
i =1
n  
p  1
= ∑ − log 2πθ1 − ( X − θ0 ) 2
i =1
2θ1 i

Again, the last step of MLE is to choose values of θ that maximize the log likelihood
function. In this case, we can calculate the partial derivative of the LL function
with respect to both θ0 and θ1 , set both equations to equal 0, and then solve for the
values of θ. Doing so results in the equations for the values µ̂ = θ̂0 and σ̂2 = θ̂1
that maximize likelihood. The result is: µ̂ = n1 ∑in=1 Xi and σ̂2 = n1 ∑in=1 ( Xi − µ̂)2 .
Note that σ̂2 is biased because it relies on the unbiased µ̂.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


21 The Beta Distribution

From notes by Chris Piech and Lisa


In this chapter we are going to have a very meta discussion about how we represent Yan.
probabilities. Until now probabilities have just been numbers in the range 0 to 1.
However, if we have uncertainty about our probability, it would make sense to
represent our probabilities as random variables (and thus articulate the relative
likelihood of our belief).

21.1 Mixing Discrete and Continuous

In order to characterize probabilities as random variables (recall that according to


Axiom 1, probabilities are real values between 0 and 1) in the context of discrete
experiment outcomes (e.g., number of heads in a certain number of coin flips),
we must introduce one more concept: Bayes’ theorem that mixes discrete PMFs
with continuous PDFs. These equations are straightforward once you have your
head around the notation for probability density functions f X ( x ) and probability
mass functions p X ( x ).
Let X be continuous random variable and let N be a discrete random variable.
The conditional probabilities of X given N and N given X respectively are:

p N |X (n | x ) f X ( x ) f X | N ( x | n) p N (n)
f X | N ( x | n) = p N |X (n | x ) =
p N (n) f X (x)

21.2 Estimating Probabilities

Imagine we have a coin and we would like to know its probability of coming up
heads (p). We flip the coin (n + m) times and it comes up head n times. One
way to calculate the probability is to assume that it is exactly p = n+n m . That
124 chapter 21. the beta distribution

number, however, is a coarse estimate, especially if n + m is small. Intuitively it


doesn’t capture our uncertainty about the value of p. Just like with other random
variables, it often makes sense to hold a distributed belief about the value of p.
To formalize the idea that we want a distribution for p we are going to use
a random variable X to represent the probability of the coin coming up heads.
Before flipping the coin, we could say that our belief about the coin’s success
probability is uniform: X ∼ Uni(0, 1).
If we let N be the number of heads that came up, given that the coin flips
are independent, ( N | X ) ∼ Bin(n + m, x ). We want to calculate the probability
density function for X | N. We can start by applying Bayes Theorem:

P( N = n | X = x ) f ( X = x )
f ( X = x | N = n) = (Bayes Theorem)
P( N = n)
(n+nm) x n (1 − x )m
= (Binomial PMF, Uniform PDF)
P( N = n)
(n+nm)
= x n (1 − x ) m (Moving terms around)
P( N = n)
1 R1
= · x n (1 − x ) m (where c = 0 x n (1 − x )m dx)
c

21.3 Beta Distribution

The equation that we arrived at when using a Bayesian approach to estimating


our probability defines a probability density function and thus a random variable.
The random variable is called a Beta distribution, and it is defined as follows: Support for Beta: (0, 1)
The probability density function (PDF) for a Beta X ∼ Beta( a, b) is:

 1 x a−1 (1 − x )b−1 if 0 < x < 1

f ( X = x ) = B( a, b)
0 otherwise

R1
where B( a, b) = 0 x a−1 (1 − x )b−1 dx is a normalization constant.
A Beta distribution has the following statistics: a−1
mode( X ) =
a+b−2
a
E[ X ] =
a+b
ab
Var( X ) =
( a + b )2 ( a + b + 1)

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


21.3. beta distr ibution 125

All modern programming languages have a package for calculating Beta CDFs.
You will not be expected to compute the CDF by hand in CS109.
To model our estimate of the probability of a coin coming up heads as a beta set
a = n + 1 and b = m + 1. Beta is used as a random variable to represent a belief
distribution of probabilities in contexts beyond estimating coin flips. It has many
desirable properties: it has a support range that is exactly (0, 1), matching the
3
values that probabilities can take on and it has the expressive capacity to capture Beta(5, 3)
many different forms of belief distributions. Let’s imagine that we had observed 2

P(θ )
n = 4 heads and m = 2 tails. The probability density function for X ∼ Beta(5, 3)
1
is shown in figure 21.1.
Notice how the most likely belief for the probability of our coin is when the 0
0 0.2 0.4 0.6 0.8 1
random variable, which represents the probability of getting a heads, is 4/6, the
θ
fraction of heads observed. This distribution shows that we hold a non-zero belief
that the probability could be something other than 4/6. It is unlikely that the Figure 21.1. A Beta distribution for
probability is 0.01 or 0.09, but reasonably likely that it could be 0.5. Beta(5, 3).

It works out that Beta(1, 1) = Uni(0, 1). As a result the distribution of our belief
about p before (‘‘prior’’) and after (‘‘posterior’’) can both be represented using
a Beta distribution. When that happens we call Beta a ‘‘conjugate’’ distribution.
Practically, conjugate means easy update.

21.3.1 Beta as a Prior


You can set X ∼ Beta( a, b) as a prior to reflect how biased you think the coin
is apriori to flipping it. This is a subjective judgment that represent a + b − 2
‘‘imaginary’’ trials with a − 1 heads and b − 1 tails. If you then observe n + m
real trials with n heads you can update your belief. Your new belief would be,
X | (n heads in n + m trials) ∼ Beta( a + n, b + m). Using the prior Beta(1, 1) =
Uni(0, 1) is the same as saying we haven’t seen any ‘‘imaginary’’ trials, so apriori
we know nothing about the coin. This form of thinking about probabilities is rep-
resentative of the ‘‘Bayesian’’ field of thought where computer scientists explicitly
represent probabilities as distributions (with prior beliefs). That school of thought
is separate from the ‘‘Frequentist’’ school which tries to calculate probabilities as
single numbers evaluated by the ratio of successes to experiments.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


126 chapter 21. the beta distribution

Assignment Example: In one particular iteration of this course, we talked Example 21.1. Beta distribution to
model student grades.
about reasons why grade distributions might be well suited to be described
as a Beta distribution. Let’s say that we are given a set of student grades for a
single exam and we find that it is best fit by a Beta distribution: X ∼ Beta( a =
8.28, b = 3.16). What is the probability that a student is below the mean (i.e.
expectation)?
The answer to this question requires two steps. First calculate the mean
of the distribution, then calculate the probability that the random variable
takes on a value less than the expectation.

a 8.28
E[ X ] = = ≈ 0.7238
a+b 8.28 + 3.16
Now we need to calculate P( X < E[ X ]). That is exactly the CDF of X eval-
uated at E[ X ]. We don’t have a formula for the CDF of a Beta distribution
but all modern programming languages will have a Beta CDF function. In
Python using the scipy stats library we can execute stats.beta.cdf which
takes the x parameter first followed by the alpha and beta parameters of your
Beta distribution.

P( X < E[ X ]) = FX (0.7238)
= stats.beta.cdf(0.7238, 8.28, 3.16) ≈ 0.46

This can also be done in Julia using the Distributions package:


julia> using Distributions
julia> B = Beta(8.28, 3.16);
julia> cdf(B, 0.7238)
0.4602742456226714

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


22 Maximum A Posteriori

From notes by Chris Piech and Lisa


22.1 Maximum A Posteriori Estimation Yan.

MLE is great, but it is not the only way to estimate parameters! This section
introduces an alternate algorithm, Maximum A Posteriori (MAP). The paradigm
of MAP is that we should choose the value for our parameters that is the most
likely given the data. At first blush this might seem the same as MLE; however,
remember that MLE chooses the value of parameters that makes the data most
likely.
One of the disadvantages of MLE is that it best explains data we have seen and
makes no attempt to generalize to unseen data. In MAP, we incorporate prior belief
about our parameters, and then we update our posterior belief of the parameters
based on the data we have seen.
Formally, for IID random variables X1 , . . . , Xn :

θMAP = arg max f (θ | X1 , X2 , . . . , Xn )


θ

In the equation above we trying to calculate the conditional probability of unob-


served random variables given observed random variables. When that is the case,
think Bayes’ Theorem! Expand the function f using the continuous version of
Bayes’ Theorem:

θMAP = arg max f (θ | X1 , X2 , . . . , Xn )


θ
f ( X1 , X2 , . . . , X n | θ ) g ( θ )
= arg max (by Bayes’ Theorem)
θ h ( X1 , X2 , . . . , X n )
128 chapter 22. maximum a posteriori

Note that f , g and h are all probability densities. We used different symbols
to make it explicit that they may have different functions. Now we are going
to leverage two observations. First, the data is assumed to be IID so we can
decompose the density of the data given θ. Second, the denominator is a constant
with respect to θ. As such, its value does not affect the arg max, and we can drop
that term. Mathematically:

∏in=1 f ( Xi | θ ) g(θ )
θ MAP = arg max (Since the samples are IID)
θ h ( X1 , X2 , . . . , X n )
n
= arg max ∏ f ( Xi | θ ) g(θ ) (Since h is constant with respect to θ)
θ i =1

As before, it will be more convenient to find the arg max of the log of the MAP
function, which gives us the final form for MAP estimation of parameters.
n
θ MLE = arg max ∏ f ( Xi | θ )
!
n
θ MAP = arg max log( g(θ )) + ∑ log( f ( Xi | θ )) (22.1) θ i =1
n
θ i =1 θ MAP = arg max ∏ f ( Xi | θ ) g(θ )
θ i =1
Using Bayesian terminology, the MAP estimate is the mode of the ‘‘posterior’’
distribution for θ. If you look at this equation side by side with the MLE equation
you will notice that MAP is the arg max of the exact same function plus a term
for the log of the prior. Maximum A Posteriori maximizes:
log-likelihood + log-prior
22.1.1 Parameter Priors
In order to get ready for the world of MAP estimation, we are going to need to
brush up on our distributions. We will need reasonable distributions for each of
our different parameters. For example, if you are predicting a Poisson distribution,
what is the right random variable type for the prior of λ?
A desiderata for prior distributions is that the resulting posterior distribution
has the same functional form. We call these ‘‘conjugate’’ priors. In the case where
you are updating your belief many times, conjugate priors makes programming
in the math equations much easier.
Here is a list of different parameters and the distributions most often used for
their priors: We won’t cover the inverse gamma distribution in this class. The
remaining two, Dirichlet and Gamma, you will not be required to know, but
details for them are included below for completeness.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


22.1. maximum a posteriori estimation 129

Table 22.1. List of distributions of-


Parameters Distribution
ten used as priors.
Bernoulli p Beta
Binomial p Beta
Poisson λ Gamma
Exponential λ Gamma
Multinomial pi Dirichlet
Normal µ Normal
Normal σ2 Inverse Gamma

The distributions used to represent your ‘‘prior’’ belief about a random variable
will often have their own parameters. For example, a Beta distribution is defined
using two parameters ( a, b). Do we have to use parameter estimation to evaluate
a and b too? No. Those parameters are called ‘‘hyperparameters’’. That is a term
we reserve for parameters in our model that we fix before running parameter
estimate. Before you run MAP you decide on the values of ( a, b).

22.1.2 Beta
We’ve covered that Beta is a conjugate distribution for Bernoulli. The MAP of a
Bernoulli distribution with a Beta prior is the mode of the Beta posterior. The mode
of a distribution is the value that maximizes the probability mass function (if
discrete) or probability density function (if continuous).
If X ∼ Beta( a, b), where a, b are integers where a + b > 2, the mode is
arg max f ( x ) = a+a−b−1 2 , where f ( x ) is the PDF of X.
x

Flip n + m coins and observe n heads. If we assume a prior on p of


Beta(nimag + 1, mimag + 1), the posterior on the parameter p is Beta(n +
nimag + 1, m + mimag + 1). The MAP estimator is therefore the mode of this
distribution:
n + nimag
n + nimag + m + mimag

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


130 chapter 22. maximum a posteriori

22.1.3 Dirichlet
The Dirichlet distribution generalizes beta in the same way multinomial gen-
eralizes Bernoulli. A random variable X that is Dirichlet is parametrized as
X ∼ Dir( a1 , a2 , . . . , am ). The PDF of the distribution is:
m
f ( X1 = x 1 , X2 = x 2 , . . . , X m = x m ) = K ∏ x i i
a −1

i =1

Where K is a normalizing constant.


You can intuitively understand the hyperparameters of a Dirichlet distribution:
imagine you have seen ∑im=1 ai − m imaginary trials. In those trials you had ( ai − 1)
outcomes of value i. As an example, consider estimating the probability of getting
different numbers on a six-sided ‘‘skewed die’’ (where each side is a different
shape). We will estimate the probabilities of rolling each side of this die by
repeatedly rolling the die n times. This will produce n IID samples. For the MAP
paradigm, we are going to need a prior on our belief of each of the parameters
p1 , . . . , p6 . We want to express that we lightly believe that each roll is equally
likely.
Before you roll, let’s imagine you had rolled the die six times and had gotten one
of each possible value. Thus, the ‘‘prior’’ distribution would be Dir(2, 2, 2, 2, 2, 2).
After observing n1 + n2 + · · · + n6 new trials with ni results of outcome i, the
‘‘posterior’’ distribution is Dir(2 + n1 , · · · , 2 + n6 ).
Using a prior which represents one imagined observation of each outcome is
called ‘‘Laplace smoothing’’ and it guarantees that none of your probabilities are
0 or 1. The Laplace estimate for a Multinomial RV is pi = Xn+ i +1
m for i = 1, . . . , m,
where Xi is the number of times you saw the outcome, and n is the number of
actual trials in your observed experiment.

22.1.4 Gamma
The Gamma(k, θ ) distribution is the conjugate prior for the λ parameter of the
Poisson distribution. (It is also the conjugate for the λ in the exponential, but we
won’t cover that here.)
The hyperparameters can be interpreted as: you saw k total imaginary events
during θ imaginary time periods. After observing n events during the next t time
periods the posterior distribution is Gamma(k + n, θ + t).

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


22.1. maximum a posteriori estimation 131

As an example, Gamma(10, 5) would represent having seen 10 imaginary


events in 5 time periods. It is like imagining a rate of 2 with some degree of
confidence. If we start with that Gamma as a prior and then see 11 events in
the next 2 time periods our posterior is Gamma(21, 7), which is equivalent to an
updated rate of 3.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


23 Naïve Bayes

From notes by Chris Piech and Lisa


Naïve Bayes is a type of machine learning algorithm called a classifier. It is used Yan.
to predict the probability of a discrete label random variable Y based on the state
of feature random variables X. We are going to learn all necessary parameters
for the probabilistic relationship between X and Y from data. Naïve Bayes is a
supervised classification Machine Learning algorithm.

23.1 Machine Learning: Classification

In supervised machine learning, your job is to use training data with feature/la-
bel pairs (I, Y ) in order to estimate a label-predicting function Ŷ = g( X ). This
function can then be used to make future predictions. A classification task is
one where Y takes on one of a discrete number of values. Often in classification,
g( X ) = arg maxy P̂(Y = y | X = x ).
To learn all parameters required to calculate g(X), you are given n different
training pairs known as training data: (x(1) , y(1) ), (x(2) , y(2) ), . . . , (x(n) , y(n) ). X( j)
is a vector of m discrete features for the i-th training example, where X( j) =
( j) ( j) ( j)
( x1 , x2 , . . . , xm ). y( j) is the discrete label for the i-th training example. This
symbolic description of classification hides the fact that prediction is applied to
interesting real life problems:

1. Predicting heart disease Y, based on a set of m observations from a heart scan


X.

2. Predicting ancestry Y, based on m DNA states X.

3. Predicting if a user will like a movie Y given whether or not they like a set of
m movies X.
134 chapter 23. naïve bayes

In this section we are going to assume that all random variables are binary.
While this is not a necessary assumption (Naïve Bayes can work for non-binary
data), it makes it much easier to learn the core concepts. Specifically, we assume
that all labels are binary y ∈ {0, 1}, and all features are binary x j ∈ {0, 1}, ∀ j =
1, . . . , m.

23.2 Naïve Bayes algorithm

Here is the Naïve Bayes algorithm. After presenting the algorithm we are going
to show the theory behind it.

23.2.1 Prediction
For an example with X = [ x1 , x2 , . . . , xm ], we can make a corresponding prediction
for Y. We use hats (e.g., P̂ or Ŷ) to symbolize values which are estimated.

Ŷ = g(x) = arg max P̂(Y ) P̂( X | Y ) (This is equal to arg max P̂(Y = y | X))
y∈{0,1}
m
= arg max P̂(Y = y) ∏ P̂( Xi = xi | Y = y)
y∈{0,1} i =1
(Naïve Bayes assumption)
m
= arg max log P̂(Y = y) + ∑ log P̂( Xi = xi | Y = y)
y∈{0,1} i =1
(Log version for numerical stability)

In order to calculate this expression, we are going to need to learn the estimates
P̂(Y = y) and P̂( Xi = xi | Y = y) from data using a process called training.

23.2.2 Training
The objective in training is to estimate the probabilities P(Y ) and P( Xi | Y ) for all
0 < i ≤ m features. Using an MLE estimate:

(# training examples where Xi = xi and Y = y)


P̂( Xi = xi | Y = y) =
(training examples where Y = y)

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


23.3. theory 135

Using a Laplace MAP estimate:

(# training examples where Xi = xi and Y = y) + 1


P̂( Xi = xi | Y = y) =
(training examples where Y = y) + 2

Estimating P(Y = y) is also straightforward. Using MLE estimation:

(# training examples where Y = y)


P̂(Y = y) =
(training examples)

23.3 Theory

Now that you have the algorithm spelled out, let’s go over the theory of how we
got there. To do so, we will first explore an algorithm which doesn’t work, called
‘‘Brute Force Bayes.’’ Then, we introduce the Naïve Bayes Assumption, which
will make our calculations possible.

23.3.1 Brute Force Bayes


We can solve the classification task using a brute force solution. To do so we will
learn the full joint distribution P̂(Y, X).
In the world of classification, when we make a prediction, we want to chose
the value of y that maximizes P(Y = y | X = x ). If we can only estimate P̂(Y =
y | X = x ), then we want to find a function g(X) = arg maxy P̂(Y | X).

ŷ = g( x ) = arg max P̂(Y | X) (Our objective)


y∈{0,1}

P̂(X | Y ) P̂(Y )
= arg max (By Bayes’ Theorem)
y∈{0,1} P̂(X)
= arg max P̂(X | Y ) P̂(Y ) (Since P̂(X) is constant with respect to Y)
y∈{0,1}

Using our training data, we could interpret the joint distribution of X and Y as
one giant Multinomial with a different parameter for every combination of X = x
and Y = y. If for example, the input vectors are only length one (i.e., |X| = 1)
and the number of values that x and y can take on are small—say, binary—this is
a totally reasonable approach. We could estimate the multinomial using MLE or
MAP estimators and then calculate argmax over a few lookups in our table.

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


136 chapter 23. naïve bayes

The bad times hit when the number of features becomes large. Recall that
our multinomial needs to estimate a parameter for every unique combination of
assignments to the vector X and the value Y. If there are |X| = m binary features
then this strategy is going to take order O(2m ) space and there will likely be
many parameters that are estimated without any training data that matches the
corresponding assignment.

23.3.2 Naïve Bayes Assumption


The Naïve Bayes Assumption is that each feature of X is conditionally independent
of one another given Y. That assumption is naïve (and often wrong), but useful.
This assumption allows us to make predictions using space and data which is
linear with respect to the size of the features: O(m)i f |x| = m. That allows us to
train and make predictions for huge feature spaces, such as one which has an
indicator for every word on the internet. Using this assumption the prediction
algorithm can be simplified:

ŷ = g( x ) = arg max P̂(X, Y ) (As we last left off)


y∈{0,1}

= arg max P̂(Y ) P̂(X | Y ) (By chain rule)


y∈{0,1}
m
= arg max P̂(Y ) ∏ P̂( Xi | Y ) (Using the Naïve Bayes assumption)
y∈{0,1} i =1
m
= arg max log P̂(Y ) + ∑ log P̂( Xi | Y )
y∈{0,1} i =1
(Log version for numerical stability)

This algorithm is fast and stable both when training and making predictions.
Let us consider a particular feature, the i-th feature Xi . How should we repre-
sent P̂( Xi = xi | Y = y)? For a particular event Y = y that we condition on, Xi
can take on one of k discrete values . Thus for each particular y, we can model
the likelihood of Xi taking on values as a Multinomial random variable with k
parameters. We can then find MLE and MAP estimators for the parameters of
that Multinomial. Recall that the MLE to estimate parameter pi for a Multinomial
is just counting, whereas the MAP estimator (with Laplace prior) to estimate

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


23.3. theory 137

parameter pi imagines one extra example of each outcome:

ni ni + 1
p̂i,MLE = and p̂i,MAP = ,
n n+k
where n is the number of observations, ni is the number of observations with
outcome i, and k is the total possible number of outcomes k.
Note that in the version of classification we are using in CS109, Xi is binary
(technically, a Multinomial with 2 parameters) and therefore k = 2. We used the
Multinomial derivation to help you understand how one would handle a feature
Xi that takes on multiple discrete values.
Naïve Bayes is a simple form of a Bayesian Network where the label Y is the
only variable which directly influences the likelihood of each feature variable Xi .
It is a simple model from a field of machine learning called probabilistic graphical
models. In that field you make a graph of how your variables are related to one
another and you come up with conditional independence assumptions that make
it computationally tractable to estimate the joint distribution.

Say we have thirty examples of people’s preferences (like or not) for Star
Wars, Harry Potter and Pokemon. Each training example has X1 , X2 and Y
where X1 is whether or not the user liked Star Wars, X2 is whether or not the
user liked Harry Potter and Y is whether or not the user liked Pokemon. For
the 30 training examples, the MAP and MLE estimates are as follows:
For a new user who likes Star Wars ( X1 = 1) but not Harry Potter ( X2 = 0),
do you predict that they will like Pokemon? Yes! Y = 1 leads to a larger value
in the argmax term:

if Y = 0 :
P̂( X1 = 1 | Y = 0) P̂( X2 = 0 | Y = 0) P̂(Y = 0) = (0.77)(0.38)(0.43) ≈ 0.126
if Y = 1 :
P̂( X1 = 1 | Y = 1) P̂( X2 = 0 | Y = 1) P̂(Y = 1) = (0.76)(0.41)(0.57) ≈ 0.178

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


138 chapter 23. naïve bayes

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


24 Linear Regression and Gradient Ascent

From notes by Chris Piech and Lisa


24.1 Regression Yan.

Regression is a second category of machine learning prediction algorithms. You


have a prediction function Ŷ = g(X) as before, but you would like to predict a Y
that takes on a continuous number.
We won’t elaborate on the regression task too much, because classification
(with discrete Y) already has a plethora of modern computer science applications—
image recognition, sentiment analysis of text, and text authorship, to name a few.
However, we will explore linear regression (where we model g as a linear func-
tion) and learn a truly valuable iterative optimization algorithm (the ‘‘butter’’ to
machine learning’s ‘‘bread,’’ if you will) called gradient ascent.

24.2 Gradient Ascent Optimization

In many cases we can’t solve for argmax mathematically. Instead we use a com-
puter. To do so we employ an algorithm called gradient ascent (a classic in
optimization theory). The idea behind gradient ascent is that if you continuously
take small steps in the direction of your gradient, you will eventually make it to a
local maxima.
Start with theta as any initial value (often 0). Then take many small steps
towards a local maxima. The new theta after each small step can be calculated as:

∂LL(θ old )
θ new
j = θ old
j +η·
∂θ j
140 chapter 24. linear regression and gradient ascent

Where ‘‘eta” (η) is the magnitude of the step size that we take. If you keep
updating θ using the equation above you will (often) converge on good values of
θ. As a general rule of thumb, use a small value of η to start. If ever you find that
the function value (for the function you are trying to argmax) is decreasing, your
choice of η was too large. Here is the gradient ascent algorithm in pseudocode:

function Gradi entAscent Algorithm 24.1. Gradient ascent.

Initialize θ j = 0 for all 0 ≤ j ≤ m


for many iterations do:
gradient[j] = 0 for all 0 ≤ j ≤ m
Calculate all gradient[j]’s based on data and current data setting
θ j ← θ j + η ∗ gradient[j] for all 0 ≤ j ≤ m

24.3 Linear Regression

Suppose we are working with 1-dimensional observations, i.e., X =< X1 >= X.


Linear Regression assumes the following linear model for prediction, which has
two parameters, a and b:
Ŷ = g(X) = aX + b
Using this model, we would like to determine the optimal parameters accord-
ing to some optimization objective. We discuss two approaches: an analytical
approach that minimizes mean squared error, and a computational approach that
maximizes training data likelihood. With one important assumption (which we’ll
get to later), the two approaches are equivalent.

24.3.1 Analytical Solution with Mean Squared Error


For regression tasks, we usually decide a prediction Ŷ = g( X ) that minimizes
the mean squared error (MSE) ‘‘loss’’ function:
θ MSE = arg min E[(Y − Ŷ )2 ]
θ
= arg min E[(Y − g(X))2 ]
θ
= arg min E[(Y − aX − b)2 ]
θ

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


24.3. linear regression 141

With our linear prediction model, we determine θ MSE = ( a MSE , b MSE ) by differ-
entiating the mean squared error with respect to a and b:
 
∂ 2 ∂ 2
E[(Y − aX − b) ] = E (Y − aX − b)
∂a ∂a
= E[−2(Y − aX − b) X ]
= −2E[ XY ] + 2aE[ X 2 ] + 2bE[ X ]
 
∂ 2 ∂ 2
E[(Y − aX − b) ] = E (Y − aX − b)
∂b ∂b
= E[−2(Y − aX − b)]
= −2E[Y ] + 2aE[ X ] + 2b

Setting derivatives to 0 and solving for simultaneous equations:

E[ XY ] − E[ X ] E[Y ] Cov( X, Y ) σ
a MSE = = = ρ( X, Y ) X
E[ X 2 ] − ( E[ X ])2 Var( X ) σY
b MSE = E[Y ] − aE[ X ] = µY − a MSE µ X
σ
Y = ρ( X, Y ) Y ( X − µ X ) + µY
σX

Wait, those are our best parameters? But we don’t know the distributions of X
and Y, and therefore we don’t know true statistics on X and Y. We estimate these
statistics based on our observed training data. Our model is therefore as follows
(where X̄ and Ȳ are the sample means computed from the training data:

σ̂Y
Ŷ = g( X = x ) = ρ̂( X, Y ) ( x − X̄ ) + Ȳ
σ̂X
∑in=1 ( x (i) − X̄ )(y(i) − Ȳ ) S
â MSE = n
= ρ̂( X, Y ) Y
∑i=1 ( x − X̄ )
( i ) 2 SX
b̂ MSE = Ȳ − â MSE X̄

24.3.2 Computational Solution with Maximum Likelihood


That seemed somewhat anticlimactic: we had this optimal prediction function,
but we had to estimate the parameters of the prediction function from the training
data. Let’s borrow an idea from our parameter estimation unit by maximizing
the likelihood of our training data!

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


142 chapter 24. linear regression and gradient ascent

Recall that our training data has n datapoints


(( x (1) , y(1) ), (( x (2) , y(2) ), . . . , (( x (n) , y(n) ),
generated IID according to the joint distribution of X and Y, f ( X, Y | θ ). We can
model this joint distribution by incorporating our regression model: Y = Ŷ + Z =
aX + b + Z, where Ŷ = g( X ) = aX + b is our prediction and Z is our error (i.e.,
noise) between our prediction Ŷ and the actual Y.
We approach the problem of finding a and b that maximize the likelihood of
our train data by first finding a distribution involving Y, X, and θ = ( a, b). We
then find the value of θ that maximizes the log-likelihood function.
If we assume Z ∼ N (0, σ2 ) and X follows some unknown distribution, then we
can calculate the conditional distribution of Y given X is some number x and we
have some parameter values θ = ( a, b) as simply Y = ax + b + Z. This is just the
sum of a Gaussian and a number, thereby implying that Y | X, θ ∼ N ( aX + b, σ2 ),
which has PDF:
1 (y− ax −b)2

f (Y = y | X = x, θ = ( a, b)) = √ e 2σ2
2πσ
Now we are ready to write the likelihood function, then take its log to get the log
likelihood function: Optimize the log conditional like-
n lihood:
L(θ ) = ∏ f ( y (i ) , x (i ) | θ ) (Let’s break up this joint) n
θ MLE = arg max ∑ log f (y(i) | x (i) , θ )
i =1
θ i =1
n
= ∏ f (y (i ) (i )
| x , θ ) f (x ) (i )
(Chain rule, f X ( x (i ) ) is independent of θ)
i =1
n
1
=∏√
(i ) − ax (i )+ b )2 / (2σ2 )
e−(y · f ( x (i ) )
i =1 2πσ
(Substitute in the conditional distribution of Y | X, θ)
LL(θ ) = log L(θ )
n
1
= log ∏ √
(i ) − ax (i )−b )2 / (2σ2 )
e−(y f ( x (i ) ) (Substitute in L(θ ))
i =1 2πσ
n n
1
∑ log √
(i ) − ax (i )−b )2 / (2σ2 )
= e−(y + ∑ log f ( x (i) )
i =1 2πσ i =1
(Log of a product is the sum of logs)
n n
1 1
= n log √

− 2
2σ ∑ (y(i) − ax(i) − b)2 + ∑ log f (x(i) )
i =1 i =1

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


24.3. linear regression 143

Our goal is to find parameters a, b that maximize likelihood. Remember that


argmax is invariant of logarithmic transformations and positive scalar constants,
and additive constants. Let’s remove positive constant multipliers and terms that
don’t include θ. We are left with trying to find a value of θ that maximizes:
" #
n
θ̂ = arg max − ∑ (y(i) − ax (i ) − b)2
θ i =1

To solve this argmax we are going to use Gradient Ascent. In order to do so we


first need to find the derivative of the function we want to argmax with respect to
both parameters in θ:
" #
n n
∂ ∂
− ∑ (y − ax (i ) − b) = − ∑ (y(i) − ax (i ) − b)2
(i ) 2
∂a i =1 i =1
∂a
n
= − ∑ 2(y(i) − ax (i ) − b)(− x (i) )
i =1
n
= 2 ∑ (y(i) − ax (i ) − b)( x (i) )
i =1
" #
n n

− ∑ (y(i) − ax (i ) − b)2 = 2 ∑ (y(i) − ax (i ) − b)
∂b i =1 i =1

This first derivative can be plugged into gradient ascent to give our final algorithm:

a, b = 0, 0 # initialize θ
repeat many times:
gradient_a, gradient_b = 0, 0
for each training example (x, y):
diff = y - (a * x + b)
gradient_a += 2 * diff * x
gradient_b += 2 * diff
a += η * gradient_a # θ += η * gradient
b += η * gradient_b

If you run gradient ascent for enough training (i.e., update) steps, you will
find that for linear regression, the maximum likelihood estimators (assuming
zero-mean, normally distributed noise between predicted Ŷ and actual Y) is
equivalent to the mean squared error estimators. Cool!!

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


25 Logistic Regression

From notes by Chris Piech and Lisa


Before we get started, I want to familiarize you with some notation: Yan.
n
θ>X = θ · X = ∑ θ i X i = θ 1 X1 + θ 2 X2 + · · · + θ n X n (weighted sum)
i =1
1
σ(z) = (sigmoid function)
1 + e−z

25.1 Logistic Regression Overview

Classification is the task of choosing a value of y that maximizes P(Y | X). Naïve
Bayes worked by approximating that probability using the naïve assumption that
each feature was independent given the class label.
For all classification algorithms you are given n IID training datapoints

( x(1) , y (1) ) , ( x(2) , y (2) ) , . . . , ( x( n ) , y ( n ) )

where each ‘‘feature’’ vector x(i) has m = |x(i) | features. Logistic regression classifier:
Ŷ = arg max P(Y | X)
y∈{0,1}
25.1.1 Logistic Regression Assumption !
m

Logistic Regression is a classification algorithm (I know, terrible name) that works


P (Y = 1 | X = x ) = σ ∑ θj xj = σ ( θ > x)
j =0
by trying to learn a function that approximates P(Y | X). It makes the central
assumption that P(Y | X) can be approximated as a sigmoid function applied
to a linear combination of input features. Mathematically, for a single training
datapoint (x, y) Logistic Regression assumes:
m
P(Y = 1 | X = x) = σ (z) where z = θ0 + ∑ θi xi
i =1
146 chapter 25. logistic regression

This assumption is often written in the equivalent forms:

P (Y = 1 | X = x ) = σ ( θ > x ) (where we always set x0 to be 1)


P (Y = 0 | X = x ) = 1 − σ ( θ > x ) (by total law of probability)

Using these equations for probability of Y | X, we can create an algorithm that


select values of θ that maximize that probability for all data. I am first going to
state the log probability function and partial derivatives with respect to θ. Then
later we will (a) show an algorithm that can chose optimal values of θ and (b)
show how the equations were derived.
An important realization is that given the best values for the parameters (θ),
logistic regression often can do a great job of estimating the probability of different
class labels. However, given bad—or even random—values of θ, it does a poor
job. The amount of ‘‘intelligence’’ that your logistic machine laerning algorithm
has is dependent on having good values of θ.

25.1.2 Log Likelihood


In order to choose values for the parameters of logistic regression, we use Max-
imum Likelihood Estimation (MLE). As a result, we will have two steps: (1)
Write the log-likelihood function, and (2) find the values of θ that maximize the
log-likelihood function.
The labels that we are predicting are binary, and the output of our logistic
regression function is P(Y = 1 | X = x). This means that we can (and should)
interpret the Logistic Regression model as a Bernoulli random variable: Y | X =
x ∼ Ber( p), where p = σ (θ > x).
To start, here is a super slick way of writing the probability of one datapoint
(recall that this is the non-piecewise form of writing the probability mass function
of a Bernoulli random variable): (
σ ( θ > x) if y = 1
P (Y = y | X = x ) =
> y
P(Y = y | X = x) = σ (θ x) · [1 − σ (θ x)] > (1− y ) 1 − σ ( θ > x) if y = 0

Now that we know the probability mass function of a single datapoint, we


can write the conditional likelihood of all the data, where each datapoint is

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


25.1. logistic regression overview 147

independent:
n
L(θ ) = ∏ P (Y = y ( i ) | X = x ( i ) )
i =1
n h i (1− y (i ) )
= ∏ σ ( θ > x( i ) ) y · 1 − σ ( θ > x( i ) )
(i )

i =1

And if you take the log of this function, you get the log-conditional likelihood of
the training dataset for Logistic Regression.
Side Note: While we calculate conditional likelihood here, it is worthwhile not-
ing that maximizing log-likelihood and log-conditional likelihood are equivalent:

n n
LL(θ ) = ∑ log( f (x(i) , y(i) | θ )) = ∑ log( f (x(i) | θ ) P(y(i) | x(i) , θ )) (Chain rule)
i =1 i =1
n
= ∑ log( f (x(i) ) f (y(i) | x(i) , θ )) (X, θ independent)
i =1
!
n
θ MLE = arg max LL(θ ) = arg max ∑ log f (x (i )
) + log f (y (i ) (i )
| x , θ) (Log of products)
θ θ i =1
!
n
= arg max ∑ log f (y (i ) (i )
| x , θ) (Constants w.r.t. θ)
θ i =1

25.1.3 Gradient of Log Likelihood


In MLE, now that we have a function for log-likelihood, we simply need to chose
the values of θ that maximize it. We can find the best values of θ by using an
optimization algorithm. However, in order to use an optimization algorithm, we
first need to know the partial derivative of log-likelihood with respect to each
parameter. First I am going to give you the partial derivative (so you can see how
it is used). Then I am going to show you how to derive it.
The partial derivative of log-likelihood with respect to each parameter θ j is:
n h
∂LL(θ ) i
= ∑ y ( i ) − σ ( θ > x( i ) ) x j
(i )
∂θ j i =0

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


148 chapter 25. logistic regression

25.2 Parameter Estimation using Gradient Ascent Optimization

Once we have an equation for Log Likelihood, we chose the values for our pa-
rameters (θ) that maximize said function. In the case of logistic regression we
can’t solve for θ mathematically. Instead we use a computer to chose θ.
To do so we employ an algorithm called gradient ascent. That algorithms
claims that if you continuously take small steps in the direction of your gradient,
you will eventually make it to a local maxima. In the case of Logistic Regression
you can prove that the result will always be a global maxima.
The small step that we continually take given the training dataset can be
calculated as follows:

∂LL(θ old )
θ new
j = θ old
j +η·
∂θ old
j
n h
old> (i )
i

(i )
= θ old
j + η · y (i )
− σ ( θ x ) xj ,
i =0

where η is the magnitude of the step size that we take. If you keep updating θ
using the equation above you will converge on the best values of θ!
Pro-tip: Don’t forget that in order to learn the value of θ0 , you can simply
define x0 = 1 for all datapoints.

25.3 Derivations

In this section we provide the mathematical derivations for the log-likelihood


function and the gradient. The derivations are worth knowing because these ideas
are heavily used in Artificial Neural Networks.
Our goal is to calculate the derivative of the log likelihood with respect to each
theta. To start, here is the definition for the derivative of sigma with respect to its
inputs:


σ (z) = σ (z)[1 − σ (z)] (to get the derivative with respect to θ, use the chain rule)
∂z
Derivative of gradient for one datapoint (x, y):

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


25.3. derivations 149

∂LL(θ ) ∂ ∂
= y log σ (θ > x) + (1 − y) log[1 − σ(θ > x)] (derivative of sum of terms)
∂θ j ∂θ j ∂θ j
1−y
 
y ∂
= >
− >
σ ( θ > x) (derivative of log f ( x ))
σ (θ x) 1 − σ (θ x) ∂θ j
1−y
 
y
= − σ (θ > x)[1 − σ (θ > x)] x j (chain rule + derivative of sigma)
σ ( θ > x) 1 − σ ( θ > x)
" #
y − σ ( θ > x)
= σ (θ > x)[1 − σ(θ > x)] x j (algebraic manipulation)
σ (θ > x)[1 − σ (θ > x)]
= [y − σ(θ > x)] x j (cancelling terms)

Because the derivative of sums is the sum of derivatives, the gradient of theta is
simply the sum of this term for each training datapoint.
n
∑ [y(i) − σ(θ > x(i) )] x j
(i )

i =1

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


A Review
This section organizes the concepts
to help navigate the material.

Counting
PMF = P( X = x ) = p X ( x )
Sum Rule | A| + | B| = m + n
PDF = f ( x )
Product Rule | A|| B| = mn
CDF = FX ( x )
Inclusion-Exclusion | A ∪ B| = | A| + | B| − | A ∩ B|
Floor and Ceiling b1.9c = 1 and d1.9e = 2
The Pigeonhole Principle dm/ne

Combinatorics
Permutations n!  
n! n
Combinations (Binomial) =
r!(n − r )! r
 
n! n
Bucketing (Multinomial) =
n1 !n2 ! . . . nr ! n1 , n2 , . . . , nr

( n + r − 1) ! n+r−1
 
Divider Method =
n!(r − 1)! n

Probability
PMF p X ( x ) = P( X = x )

Expectation E[ X ] = ∑ x · p X ( x )
x∈X

Variance Var( X ) = E[ X 2 ] − E[ X ]2

number of successes time until success


one trial Ber( p) Geo( p) one success
several trials Bin(n, p) NegBin(r, p) several successes
interval of time Poi(λ) Exp(λ) interval of time to first success
152 appendix a. review

using Distributions Algorithm A.1. Aliases for distri-


butions, expectation, and variance
using the Distributions package.
# Probability mass functions (PMFs)
Ber = Distributions.Bernoulli
Bin = Distributions.Binomial
Poi = Distributions.Poisson
Geo = Distributions.Geometric
NegBin = Distributions.NegativeBinomial

# Probability density functions (PDFs)


Uni = Distributions.Uniform
Exp = Distributions.Exponential

# Expectation
𝔼(X::Bernoulli) = X.p
𝔼(X::Binomial) = X.n*X.p
𝔼(X::Poisson) = X.λ
𝔼(X::Geometric) = 1/X.p
𝔼(X::NegativeBinomial) = X.r/X.p
𝔼(X::Uniform) = (X.a + X.b)/2
𝔼(X::Exponential) = 1/X.θ

# Variance
Var(X::Bernoulli) = X.p*(1-X.p)
Var(X::Binomial) = X.n*X.p*(1-X.p)
Var(X::Poisson) = X.λ
Var(X::Geometric) = (1-X.p)/X.p^2
Var(X::NegativeBinomial) = X.r*(1-X.p)/X.p^2
Var(X::Uniform) = (X.b-X.a)^2/12
Var(X::Exponential) = 1/X.θ^2

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


B Calculation Reference
B.1 Summation Identities
From notes by Mehran Sahami and
Here are some useful identities and rules related to working with summations. Lisa Yan.
In the rules below, f and g are arbitrary real-valued functions.

Pulling a constant out of a summation:


t n
∑ C · f (n) = C · ∑ f (n), where C is a constant. (B.1)
n=s n =2

Eliminating the summation by summing over the elements:


n
∑ x = nx (B.2)
i =1
n
∑ x = ( n − m + 1) x (B.3)
i=m
n
∑ f (C) = (n − s + 1) f (C), where C is a constant. (B.4)
i =s

Combining related summations:


j t t
∑ f (n) + ∑ f (n) = ∑ f (n) (B.5)
n=s n = j +1 n=s
t t t
∑ f (n) + ∑ g(n) = ∑ [ f (n) + g(n)] (B.6)
n=s n=s n=s

Changing the bounds on the summation:


t t+ p
∑ f (n) = ∑ f (n − p) (B.7)
n=s n=s+ p

‘‘Reversing’’ the order of the summation:


b a
∑ f (n) = ∑ f (n) (B.8)
n= a n=b
154 appendix b. calculation reference

Arithmetic series:
n n
n ( n + 1)
∑i= ∑i= 2
(B.9)
i =0 i =1
n
(n − m + 1)(n + m)
∑i= 2
(B.10)
i =m

Arithmetic series involving higher order polynomials:


n
n(n + 1)(2n + 1) n3 n2
∑ i2 = 6
=
3
+
2
+ n/6 (B.11)
i =1
2 " #2
n  4 3 2 n
n ( n + 1 ) n n n
∑ i3 = 2
=
4
+
2
+
4
= ∑i (B.12)
i =1 i =1

Geometric series:
n
1 − x n +1
∑ xi = 1−x
(B.13)
i =0
n
x n +1 − x m
∑ xi = x−1
(B.14)
i=m

1
∑x i
=
1−x
if | x | < 1 (B.15)
i =0

More exotic geometric series:


n
∑ i2i = 2 + 2n+1 (n − 1) (B.16)
i =0
n
i 2n +1 − n − 2
∑ 2i =
2n
(B.17)
i =0

Taylor expansion of exponential function:



xn x2 x3 x4
ex = ∑ n!
= 1+x+
2!
+
3!
+
4!
+··· (B.18)
n =0

Binomial coefficient:
n  
n
∑ i = 2n (B.19)
i =0

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


b.2. growth rates of summations 155

B.2 Growth Rates of Summations

Besides solving a summation explicitly, it is also worthwhile to know some general


growth rates on sums, so you can (tightly) bound a sum if you are trying to prove
something in the big-Oh/Theta world. If you’re not familiar with big-Theta (Θ)
notation, you can think of it like big-Oh notation, but it actually provides a ‘‘tight’’
bound. Namely, big-Theta means that the function grows no more quickly and no
more slowly than the function specified, up to constant factors, so it’s actually more
informative than big-Oh.
Here are some useful bounds:

n
∑ ic = Θ(nc+1 ), for c ≥ 0 (B.20)
i =1
n
1
∑ i
= Θ(log n) (B.21)
i =1
n
∑ ci = Θ(cn ), for c ≥ 2 (B.22)
i =1

B.3 Identities of Products

Recall that the mathematical symbol Π represents a product of terms (analogous


to Σ representing a sum of terms). Below, we give some useful identities related
to products.

Definition of factorial:1 1
By definition 0! = 1.
n
∏ i = n! (B.23)
i =1

Stirling’s approximation for n! is given below. This approximation is useful when


computing n! for large values of n (particularly when n > 30).
√  n n
n! ≈ 2πn (B.24)
e

 
n+ 12
≈ 2πn e−n (B.25)

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


156 appendix b. calculation reference

Eliminating the product by multiplying over the elements:


n
∏ C = Cn , where C is a constant. (B.26)
i =1

Combining products:
n n n
∏ f (i ) · ∏ g (i ) = ∏ f (i ) · g (i ) (B.27)
i =1 i =1 i =1

Turning products into summations (using logarithms, assuming f (i ) > 0 for all
i):
!
n n
log ∏ f (i ) = ∑ log f (i) (B.28)
i =1 i =1

B.4 Calculus Review

Product Rule for derivatives:

d(u · v) = du · v + u · dv (B.29)

Derivative of exponential function:


d u du
e = eu (B.30)
dx dx
Integral of exponential function:
Z
eu du = eu (B.31)
Z
Ae Au du = e Au (B.32)
1 Cu
Z
eCu du = e (B.33)
C
Derivative of natural logarithm:
d 1
ln( x ) = (B.34)
dx x
1
Integral of :
x
1
Z
dx = ln( x ) (B.35)
x

2020-07-29 20:12:29-07:00, draft: send comments to [email protected] to c


b.5. computing permutations and combinations 157

Integration by parts (everyone’s favorite!):


Choose a suitable u and dv to decompose the integral of interest:
Z Z
u · dv = u · v − v · du (B.36)

Here’s the underlying rule that integration by parts is derived from:


Z Z Z
d(u · v) = u · v = du · v + u · dv (B.37)

B.5 Computing Permutations and Combinations

For your problem set solutions it is fine for your answers to include factorials,
exponentials, or combinations; you don’t need to calculate those all out to get a
single numeric answer. However, if you’d like to work with those in Python or
Julia, here are a few functions you may find useful.

Computes Python* Julia


n! math.factorial(n) factorial(n)
(nk) scipy.special.binom(n, k) binomial(n, k)
en math.exp(n) exp(n)
nm n ** m n^m

*Names to the left of the dots (.) are modules that need to be
imported before being used: import math, scipy.special. http://en.wikipedia.org/wiki
/Summation

toc 2020-07-29 20:12:29-07:00, draft: send comments to [email protected]


Q1 cs 109: problem set 1, combinatorics July 29, 2 020

Problem Set 1, Combinatorics

PSET1 Q1. How many ways can 10 people be seated in a row if


a. there are no restrictions on the seating arrangement?
b. persons A and B must sit next to each other?
c. there are 5 adults and 5 children, and no two adults nor two children can sit next to each other?
d. there are 5 married couples and each couple must sit together?

Answer.
a.
b.
c.
d.

name: email:
Q2 cs 109: problem set 1, combinatorics July 29, 2020

PSET1 Q2. At the local zoo, a new exhibit consisting of 3 different species of birds and 3 different species of
reptiles is to be formed from a pool of 8 bird species and 6 reptile species. How many exhibits are possible if
a. there are no additional restrictions on which species can be selected?
b. 2 particular bird species cannot be placed together (e.g., they have a predator-prey relationship)?
c. 1 particular bird species and 1 particular reptile species cannot be placed together?

Answer.
a.
b.
c.

name: e mail:
Q3 cs 109: problem set 1, combinatorics July 29, 2 020

PSET1 Q3. A university is offering 3 programming classes: one in Java, one in C++, and J = 27 C = 28
one in Python. The classes are open to any of the 100 students at the university. There
are: a total of 27 students in the Java class; a total of 28 students in the C++ class; a total 12 10 10
of 20 students in the Python class; 12 students in both the Java and C++ classes (note: 2
these students are also counted as being in each class in the numbers above); 5 students in 3 6
both the Java and Python classes; 8 students in both the C++ and Python classes; and 2
9
students in all three classes (note: these students are also counted as being in each pair of
classes). P = 20
a. If a student is chosen randomly at the university, what is the probability that the student is not in any of the 3 programming
classes?
b. If a student is chosen randomly at the university, what is the probability that the student is taking exactly one of the three
programming classes?
c. If two different students are chosen randomly at the university, what is the probability that at least one of the chosen
students is taking at least one of the programming classes?

Answer.
a.
b.
c.

name: email:
Q4 cs 109: problem set 1, combinatorics July 29, 2020

PSET1 Q4. Say you have $20 million that must be invested among 4 possible companies. Each investment must
be in integral units of $1 million, and there are minimal investments that need to be made if one is to invest in
these companies. The minimal investments are $1, $2, $3, and $4 million dollars, respectively for company 1, 2,
3, and 4. How many different investment strategies are available if
a. an investment must be made in each company?
b. investments must be made in at least 3 of the 4 companies?
c. Now assume that we do not have a minimal investment in any of the companies. How many different
investment strategies are available if we must invest less than or equal to $k million dollars total among the
4 companies, where k is a non-negative integer less than or equal to 20 (i.e. 0 ≤ kle20)? Note that you can
think of k as a constant that can be used in your answer.

Answer.
a.
b.
c.

name: e mail:
Q5 cs 109: problem set 1, combinatorics July 29, 2 020

PSET1 Q5. Consider an array x of integers with k elements (e.g., int x[k]), where each entry in the array has
a distinct integer value between 1 and n, inclusive, and the array is sorted in increasing order. In other words,
1 ≤ x[i] ≤ n, for all i = 0, 1, 2, . . . , k − 1, and the array is sorted, so x[0] < x[1] < . . . < x[k-1]. How many
such sorted arrays are possible?

Answer.

name: email:
Q6 c s 10 9: problem set 1, combinatorics July 29, 20 20

PSET1 Q6. If we assume that all possible poker hands (comprised of 5 cards from a standard 52 card deck) are
equally likely, what is the probability of being dealt:
a. a flush? (A hand is said to be a flush if all 5 cards are of the same suit. Note that this definition means that
straight flushes (five cards of the same suit in numeric sequence) are also considered flushes.)
b. two pairs? (This occurs when the cards have numeric values a, a, b, b, c, where a, b and c are all distinct.)
c. four of a kind? (This occurs when the cards have numeric values a, a, a, a, b, where a and b are distinct.)

Answer.
a.
b.
c.

name: e mail:
Q7 cs 109: problem set 1, combinatorics July 29, 2 020

PSET1 Q7. Imagine you have a robot (Θ) that lives on an n × m grid (it has n Θ ···

rows and m columns): ···

The robot starts in cell (1, 1) and can take steps either to the right or down (no n .. .. .. .. .. .. ..
. . . . . . .
left or up steps). How many distinct paths can the robot take to the destination
(?) in cell (n, m) if ···
··· ?
a. there are no additional constraints? m

b. the robot must start by moving to the right?

c. the robot changes direction exactly 3 times? Moving down two times in
a row is not changing directions, but switching from moving down to
moving right is. For example, moving [down, right, right, down] would
count as having two direction changes.

Answer.
a.
b.
c.

name: email:
Q8 c s 109: problem set 1, combinatorics July 29, 2020

PSET1 Q8. Say we roll a six-sided die six times. What is the probability that
a. we will roll two different numbers thrice (three times) each?
b. we will roll exactly one number exactly three times? Hint: Be careful of overcounting.

Answer.
a.
b.

name: e mail:
Q9 cs 109: problem set 1, combinatorics July 29, 2020

PSET1 Q9. A binary string containing M 0’s and N 1’s (in arbitrary order, where all orderings are equally
likely) is sent over a network. What is the probability that the first r bits of the received message contain exactly
k 1’s?

Answer.

name: email:
Q10 cs 109: problem set 1, combinatorics July 29, 2020

PSET1 Q10. Say we send out a total of 20 distinguishable emails to 12 distinct users, where each email we send
is equally likely to go to any of the 12 users (note that it is possible that some users may not actually receive any
email from us). What is the probability that the 20 emails are distributed such that there are 4 users who receive
exactly 2 emails each from us and 3 users who receive exactly 4 emails each from us?

Answer.

name: e mail:
Q11 cs 109: problem set 1, combinatorics July 29, 2 020

PSET1 Q11. Say a hacker has a list of n distinct password candidates, only one of which will successfully log
her into a secure system.
a. If she tries passwords from the list at random, deleting those passwords that do not work, what is the
probability that her first successful login will be (exactly) on her k-th try?
b. Now say the hacker tries passwords from the list at random, but does not delete previously tried passwords
from the list. She stops after her first successful login attempt. What is the probability that her first successful
login will be (exactly) on her k-th try?

Answer.
a.
b.

name: email:
Q12 cs 109: problem set 1, combinatorics July 29, 2020

PSET1 Q12. Suppose that m strings are hashed (randomly) into N buckets, assuming that all N m arrangements
are equally likely. Find the probability that exactly k strings are hashed to the first bucket.

Answer.

name: e mail:
Q13 cs 109: problem set 1, combinatorics July 29, 2 020

PSET1 Q13. [Extra credit] To get good performance when working with binary search trees (BST), we must
consider the probability of producing completely degenerate BSTs (where each node in the BST has at most one
child). See Lecture Notes #2, Example 2, for more details on binary search trees.
a. If the integers 1 through n are inserted in arbitrary order into a BST (where each possible order is equally
likely), what is the probability (as an expression in terms of n) that the resulting BST will have completely
degenerate structure?
b. Using your expression from part (a), determine the smallest value of n for which the probability of forming
a completely degenerate BST is less than 0.001 (i.e., 0.1%).

Answer.

name: email:
Q14 cs 109: problem set 1, combinatorics July 29, 2020

PSET1 Q14. [Coding] Consider a game, which uses a generator that produces independent random integers
between 1 and 100, inclusive. The game starts with a sum S = 0. The first player adds random numbers from
the generator to S until S > 100, at which point they record their last random number x. The second player
continues by adding random numbers from the generator to S until S > 200, at which point they record their
last random number y. The player with the highest number wins; e.g., if y > x, the second player wins. Write a
Python 3 program to simulate 100,000 games and output the estimated probability that the second player wins.

Answer.

name: e mail:
Q1 cs 109: problem set 2, probability July 29, 2 020

Problem Set 2, Probability

PSET2 Q1. Say in Silicon Valley, 35% of engineers program in Java and 28% of the engineers who program in
Java also program in C++. Furthermore, 40% of engineers program in C++.
a. What is the probability that a randomly selected engineer programs in Java and C++?
b. What is the conditional probability that a randomly selected engineer programs in Java given that they
program in C++?

Answer.
a.
b.

name: email:
Q2 cs 109: problem set 2, probability July 29, 2020

PSET2 Q2. A website wants to detect if a visitor is a robot or a human. They give the visitor five c a p t c h a
tests that are hard for robots but easy for humans. If the visitor fails one of the tests, they are flagged as a robot.
The probability that a human succeeds at a single test is 0.95, while a robot only succeeds with probability 0.3.
Assume all tests are independent. The percentage of visitors on this website that are robots is 5%; all other
visitors are human.
a. If a visitor is actually a robot, what is the probability they get flagged (the probability they fail at least one
test)?
b. If a visitor is human, what is the probability they get flagged?
c. Suppose a visitor gets flagged. Using your answers from part (a) and (b), what is the probability that the
visitor is a robot?
d. If a visitor is human, what is the probability that they pass exactly three of the five tests?
e. Building off of your answer from part (d), what is the probability that a visitor with unknown identity
passes exactly three of the five tests?

Answer.
a.
b.
c.
d.
e.

name: e mail:
Q3 cs 109: problem set 2, probability July 29, 2 020

PSET2 Q3. Say all computers either run operating system W or X. A computer running operating system W is
twice as likely to get infected with a virus as a computer running operating system X. If 70% of all computers
are running operating system W, what percentage of computers infected with a virus are running operating
system W?

Answer.

name: email:
Q4 cs 109: problem set 2, probability July 29, 2020

PSET2 Q4. The Superbowl institutes a new way to determine which team receives the kickoff first. The referee
chooses with equal probability one of three coins. Although the coins look identical, they have probability
of heads 0.1, 0.5 and 0.9, respectively. Then the referee tosses the chosen coin 3 times. If more than half the
tosses come up heads, one team will kick off; otherwise, the other team will kick off. If the tosses resulted in the
sequence H, T, H, what is the probability that the fair coin was actually used?

Answer.

name: e mail:
Q5 cs 109: problem set 2, probability July 29, 2020

PSET2 Q5. After a long night of programming, you have built a powerful, but slightly buggy, email spam
filter. When you don’t encounter the bug, the filter works very well, always marking a spam email as spam
and always marking a non-spam email as good. Unfortunately, your code contains a bug that is encountered
10% of the time when the filter is run on an email. When the bug is encountered, the filter always marks the
email as good. As a result, emails that are actually spam will be erroneously marked as good when the bug is
encountered. Let p denote the probability that an email is actually non-spam, and let q denote the conditional
probability that an email is non-spam given that it is marked as good by the filter.

a. Determine q in terms of p.
b. Using your answer from part (a), explain mathematically whether q or p is greater. Also, provide an intuitive
justification for your answer.

Answer.
a.
b.

name: email:
Q6 c s 10 9: problem set 2, probability July 29, 20 20

PSET2 Q6. Two cards are randomly chosen without replacement from an ordinary deck of 52 cards. Let E be
the event that both cards are Aces. Let F be the event that the Ace of Spades is one of the chosen cards, and let
G be the event that at least one Ace is chosen.

a. Compute P( E | F ).
b. Are E and F independent? Justify your answer using your response to part (a).
c. Compute P( E | G ).

Answer.
a.
b.
c.

name: e mail:
Q7 cs 109: problem set 2, probability July 29, 2020

PSET2 Q7. Your colleagues in a comp-bio lab have sequenced DNA from a large population in order to
understand how a gene (G) influences two particular traits (T1 and T2 ). They find that P( G ) = 0.6, P( T1 | G ) =
0.7, and P( T2 | G ) = 0.9. They also observe that if a subject does not have the gene G, they express neither T1
nor T2 . The probability of a patient having both T1 and T2 given that they have the gene G is 0.63.
a. Are T1 and T2 conditionally independent given G?
b. Are T1 and T2 conditionally independent given G c ?
c. What is P( T1 )?
d. What is P( T2 )?
e. Are T1 and T2 independent?

Answer.
a.
b.
c.
d.
e.

name: email:
Q8 c s 109: problem set 2, probability July 29, 2020

PSET2 Q8. The color of a person’s eyes is determined by a pair of eye-color genes, as follows:
• if both of the eye-color genes are blue-eyed genes, then the person will have blue eyes
• if one or more of the genes is a brown-eyed gene, then the person will have brown eyes
A newborn child independently receives one eye-color gene from each of its parents, and the gene it receives
from a parent is equally likely to be either of the two eye-color genes of that parent. Suppose William and both
of his parents have brown eyes, but William’s sister (Claire) has blue eyes. (We assume that blue and brown
are the only eye-color genes.)

a. What is the probability that William possesses a blue-eyed gene?


b. Suppose that William’s wife has blue eyes. What is the probability that their first child will have blue eyes?

Answer.
a.
b.

name: e mail:
Q9 cs 109: problem set 2, probability July 29, 2 020

PSET2 Q9. Consider the following algorithm for betting in roulette. At each round (‘‘spin’’), you bet $1 on a
color (‘‘red’’ or ‘‘black’’). If that color comes up on the wheel, you keep your bet AND win $1; otherwise, you
lose your bet.
i. Bet $1 on ‘‘red’’
ii. If ‘‘red’’ comes up on the wheel (with probability 18/38), then you win $1 (and keep your original $1 bet)
and you immediately quit (i.e., you do not do step (iii) below).
iii. If ‘‘red’’ did not come up on the wheel (with probability 20/38), then you lose your initial $1 bet. But,
then you bet $1 on ‘‘red’’ on each of the next two spins of the wheel. After those two spins, you quit (no
matter what the outcome of the next two spins).
Let X denote your ‘‘winnings’’ when you quit, i.e., the total amount of money won minus any amounts lost
while playing. This value may be negative.
a. Determine P( X > 0).
b. Determine E[ X ]. (Rhetorical question: Would you play this game?)

Answer.
a.
b.

name: email:
Q10 cs 109: problem set 2, probability July 29, 2020

PSET2 Q10.
c. For each gene i, decide whether or not you think that is would be reasonable to assume that Gi is independent
of T. Support your argument with numbers. Remember that our probabilities are based on 100,000 bats, not
infinite bats, and are therefore only estimates of the true probabilities.
d. Give your best interpretation of the results from (a) to (c).
e. [Extra Credit] Try and find conditional independence relationships between the genes and the trait. Incor-
porate this information to improve your hypothesis of how the five genes relate to whether or not a bat can
carry Ebola.

Answer.
c.
d.
e.

name: e mail:
Q11 cs 109: problem set 2, probability July 29, 2 020

PSET2 Q11. [Extra Credit] Suppose we want to write an algorithm fairRandom for randomly generating a 0
or a 1 with equal probability (= 0.5). Unfortunately, all we have available to us is a function:
unknownRandom()::Int

that randomly generates bits, where on each call a 1 is returned with some unknown probability p that need not
be equal to 0.5 (and a 0 is returned with probability 1 − p).
Consider the following algorithms for fairRandom and simpleRandom:
function fairRandom() function simpleRandom()
r₁, r₂ = 0, 0 r₁, r₂ = 0, 0
while true r₁ = unknownRandom()
r₁ = unknownRandom() while true
r₂ = unknownRandom() r₂ = unknownRandom()
if (r₁ ≠ r₂) break; end if (r₁ ≠ r₂) break; end
end end
return r₂ return r₂
end end
a. Show mathematically that fairRandom does indeed return a 0 or a 1 with equal probability.
b. Say we want to simplify the function, so we write the simpleRandom function. Would the simpleRandom
function also generate 0’s and 1’s with equal probability? Explain why or why not. In addition, determine
P(simpleRandom returns 1) in terms of p.

Answer.
a.
b.

name: email:
Q2 cs 109: problem set 3, random variables July 29, 2 020

Problem Set 3, Random Variables

PSET3 Q2. Lyft line gets 2 requests every 5 minutes, on average, for a particular route. A user requests the
route and Lyft commits a car to take her. All users who request the route in the next five minutes will be added
to the car as long as the car has space. The car can fit up to three users. Lyft will make $7 for each user in the car
(the revenue) minus $9 (the operating cost).
a. How much does Lyft expect to make from this trip?
b. Lyft has one space left in the car and wants to wait to get another user. What is the probability that another
user will make a request in the next 30 seconds?

Answer.
a.
b.

name: email:
Q3 cs 109: problem set 3, random variables July 29, 2020

PSET3 Q3. Suppose it takes at least 9 votes from a 12-member jury to convict a defendant. Suppose also that
the probability that a juror votes that an actually guilty person is innocent is 0.25, whereas the probability that
the juror votes that an actually innocent person is guilty is 0.15. If each juror acts independently and if 70% of
defendants are actually guilty, find the probability that the jury renders a correct decision. Also determine the
percentage of defendants found guilty by the jury.

Answer.

name: e mail:
Q4 cs 109: problem set 3, random variables July 29, 2 020

PSET3 Q4. To determine whether they have measles, 1000 people have their blood tested. However, rather
than testing each individual separately (1000 tests is quite costly), it is decided to use a group testing procedure:
• Phase 1: First, place people into groups of 5. The blood samples of the 5 people in each group will be pooled
and analyzed together. If the test is positive (at least one person in the pool has measles), continue to Phase 2.
Otherwise send the group home. 200 of these pooled tests are performed.
• Phase 2: Individually test each of the 5 people in the group. 5 of these individual tests are performed per
group in Phase 2.
Suppose that the probability that a person has measles is 5% for all people, independently of others, and that
the test has a 100% true positive rate and 0% false positive rate (note that this is unrealistic). Using this strategy,
compute the expected total number of blood tests (individual and pooled) that we will have to do across Phases
1 and 2.

Answer.

name: email:
Q5 cs 109: problem set 3, random variables July 29, 2020

PSET3 Q5. Let X be a continuous random variable with probability density function:
(
c(2 − 2x2 ) −1 < x < 1
f (x) =
0 otherwise

a. What is the value of c?


b. What is the cumulative distribution function (CDF) of X?
c. What is E[ X ]?

Answer.
a.
b.
c.

name: e mail:
Q6 cs 109: problem set 3, random variables July 29, 2 020

PSET3 Q6. The number of times a person’s computer crashes in a month is a Poisson random variable with
λ = 7. Suppose that a new operating system patch is released that reduces the Poisson parameter to λ = 2 for
80% of computers, and for the other 20% of computers the patch has no effect on the rate of crashes. If a person
installs the patch, and has their computer crash 4 times in the month thereafter, how likely is it that the patch
has had an effect on the user’s computer (i.e., it is one of the 80% of computers that the patch reduces crashes
on)?

Answer.

name: email:
Q7 cs 109: problem set 3, random variables July 29, 2020

PSET3 Q7. Say there are k buckets in a hash table. Each new string added to the table is hashed into bucket
i with probability pi , where ∑ik=1 pi = 1. If n strings are hashed into the table, find the expected number of
buckets that have at least one string hashed to them. (Hint: Let Xi be a binary variable that has the value 1
when there ish at least ione string hashed into bucket i after the n strings are added to the table (and 0 otherwise).
Compute E ∑ik=1 Xi .)

Answer.

name: e mail:
Q8 cs 109: problem set 3, random variables July 29, 2 020

PSET3 Q8. You are testing software and discover that your program has a non-deterministic bug that causes
catastrophic failure (aka a ‘‘hindenbug’’). Your program was tested for 400 hours and the bug occurred twice.
a. Each user uses your program to complete a three hour long task. If the hindenbug manifests they will
immediately stop their work. What is the probability that the bug manifests for a given user?
b. Your program is used by one million users. Use a Normal approximation to estimate the probability that
more than 10,000 users experience the bug. Use your answer from part (a). Provide a numeric answer for
this part.

Answer.
a.
b.

name: email:
Q9 c s 109: problem set 3, random variables July 29, 2020

PSET3 Q9. Say the lifetimes of computer chips produced by a certain manufacturer are normally distributed
with parameters µ = 1.5 × 106 hours and σ = 9 × 105 hours. The lifetime of each chip is independent of the
other chips produced.
a. What is the approximate probability that a batch of 100 chips will contain at least 6 whose lifetimes are
more than 3.0 × 106 hours?
b. What is the approximate probability that a batch of 100 chips will contain at least 65 whose lifetimes are
less than 1.9 × 106 hours? Provide a numeric answer for this part.

Answer.
a.
b.

name: e mail:
Q10 cs 109: July 29, 2020

PSET3 Q10. A Bloom filter is a probabilistic implementation of the set data structure, an unordered collection
of unique objects. In this problem we are going to look at it theoretically. Our Bloom filter uses 3 different
independent hash functions H1 , H2 , H3 that each take any string as input and each return an index into a
bit-array of length n. Each index is equally likely for each hash function.
To add a string into the set, feed it to each of the 3 hash functions to get 3 array positions. Set the bits at all
these positions to 1. For example, initially all values in the bit-array are zero. In this example n = 10:

Index: 0 1 2 3 4 5 6 7 8 9
Value: 0 0 0 0 0 0 0 0 0 0

After adding a string ‘‘pie’’, where H1 (‘‘pie’’) = 4, H2 (‘‘pie’’) = 7, and H3 (‘‘pie’’) = 8:

Index: 0 1 2 3 4 5 6 7 8 9
Value: 0 0 0 0 1 0 0 1 1 0

Bits are never switched back to 0. Consider a Bloom filter with n = 9,000 buckets. You have added m = 1,000
strings to the Bloom filter. Provide a numerical answer for all questions.
a. What is the (approximated) probability that the first bucket has 0 strings hashed to it?

To check whether a string is in the set, feed it to each of the 3 hash functions to get 3 array positions. If any of the
bits at these positions is 0, the element is not in the set. If all bits at these positions are 1, the string may be in the
set; but it could be that those bits are 1 because some of the other strings hashed to the same values. You may
assume that the value of one bucket is independent of the value of all others.

Answer.
a.

b. What is the probability that a string which has not previously been added to the set will be misidentified
as in the set? That is, what is the probability that the bits at all of its hash positions are already 1? Use
approximations where appropriate.

Answer.
b.

c. Our Bloom filter uses three hash functions. Was that necessary? Repeat your calculation in (b) assuming
that we only use a single hash function (not 3).

name: email:
Q10 cs 109: problem set 3, random variables July 29, 2020

Answer.
c.

(Chrome uses a Bloom filter to keep track of malicious URLs. Questions such as this allow us to compute
appropriate sizes for hash tables in order to get good performance with high probability in applications where
we have a ballpark idea of the number of elements that will be hashed into the table.)

name: e mail:
Q11 cs 109: July 29, 2 020

PSET3 Q11. Last summer (May 2019) the concentration of CO2 in the atmosphere was 414 parts per million
(ppm) which is substantially higher than the pre-industrial concentration: 275 ppm. CO2 is a greenhouse gas
and as such increased CO2 corresponds to a warmer planet.
Absent some pretty significant policy changes, we will reach a point within the next 50 years (i.e., well within
your lifetime) where the CO2 in the atmosphere will be double the pre-industrial level. In this problem we
are going to explore the following question: What will happen to the global temperature if atmospheric CO2
doubles?
The measure, in degrees Celsius, of how much the global average surface temperature will change (at the
point of equilibrium) after a doubling of atmospheric CO2 is called ‘‘Climate Sensitivity.’’ Since the earth is a
complicated ecosystem climate scientists model Climate Sensitivity as a random variable, S. The IPPC Fourth
Assessment Report had a summary of 10 scientific studies that estimated the PDF of S:
In this problem we are going to treat S as part-discrete and part-continuous. For values of S less than 7.5, we
are going to model sensitivity as a discrete random variable with PMF based on the average of estimates from
the studies in the IPCC report. Here is the PMF for S in the range 0 through 7.5:

Sensitivity, S (degrees C) 0 1 2 3 4 5 6 7
Expert Probability 0.00 0.11 0.26 0.22 0.16 0.09 0.06 0.04

The IPCC fifth assessment report notes that there is a non-negligible chance of S being greater than 7.5
degrees but didn’t go into detail about probabilities. In the paper ‘‘Fat-Tailed Uncertainty in the Economics
of Catastrophic Climate Change” Martin Weitzman discusses how different models for the PDF of Climate
Sensitivity (S) for large values of S have wildly different policy implications.
For values of S greater than or equal to 7.5 degrees Celsius, we are going to model S as a continuous random
variable. Consider two different assumptions for S when it is at least 7.5 degrees Celsius: a fat tailed distribution
( f 1 ) and a thin tailed distribution ( f 2 ):

K
f1 (x) = s.t. 7.5 ≤ x < 30
x

K
f2 (x) = s.t. 7.5 ≤ x < 30
x3
For this problem assume that the probability that S is greater than 30 degrees Celsius is 0.

name: email:
Q11 cs 109: problem set 3, random variables July 29, 2020

a. Compute the probability that Climate Sensitivity is at least 7.5 degrees Celsius.

Answer.
a.

b. Calculate the value of K for both f 1 and f 2 .

Answer.
b.

c. It is estimated that if temperatures rise more than 10 degrees Celsius, all the ice on Greenland will melt.
Estimate the probability that S is greater than 10 under both the f 1 and f 2 assumptions.

Answer.
c.

d. Calculate the expectation of S under both the f 1 and f 2 assumptions.

Answer.
d.

e. Let R = S2 be a crude approximation of the cost to society that results from S. Calculate E[ R] under both
the f 1 and f 2 assumptions.

Answer.
e.

Notes: (1) Both f 1 and f 2 are ‘‘power law distributions’’. (2) Calculating expectations for a variable that is part
discrete and part continuous is as simple as: use the discrete formula for the discrete part and the continuous
formula for the continuous part.

name: e mail:
Q12 cs 109: problem set 3, random variables July 29, 2 020

PSET3 Q12. [Extra Credit] Say we have a cable of length n. We select a point (chosen uniformly randomly)
along the cable, at which we cut the cable into two pieces. What is the probability that the shorter of the two
pieces of the cable is less than 1/3 the size of the longer of the two pieces? Explain formally how you derived
your answer.

Answer.

name: email:
Q1 cs 109: problem set 4, distributions July 29, 2 020

Problem Set 4, Distributions

PSET4 Q1. The median of a continuous random variable having cumulative distribution function F is the value
m such that F (m) = 0.5. That is, a random variable is just as likely to be larger than its median as it is to be
smaller. Find the median of X (in terms of the respective distribution parameters) in each case below.
a. X ∼ Uni( a, b)
b. X ∼ N (µ, σ2 )
c. X ∼ Exp(λ)

Answer.
a.
b.
c.

name: email:
Q2 cs 109: problem set 4, distributions July 29, 2020

PSET4 Q2. Users independently sign up for two online social networking sites, Lookbook and Quickgram. On
average, 7.5 users sign up for Lookbook each minute, while on average 5.5 users sign up for Quickgram each
minute. The number of users signing up for Lookbook and for Quickgram each minute are independent. A new
user is defined as a new account, i.e., the same person signing up for both social networking sites will count as
two new users.
a. What is the probability that more than 10 new users will sign up for the Lookbook social networking site in
the next minute?
b. What is the probability that more than 13 new users will sign up for the Quickgram social networking site
in the next 2 minutes?
c. What is the probability that the company will get a combined total of more than 40 new users across both
websites in the next 2 minutes?

Answer.
a.
b.
c.

name: e mail:
Q3 cs 109: problem set 4, distributions July 29, 2 020

PSET4 Q3. Say that of all the students who will attend Stanford, each will buy at most one laptop computer
when they first arrive at school. 40% of students will purchase a PC, 30% will purchase a Mac, 10% will purchase
a Linux machine and the remaining 20% will not buy any laptop at all. If 15 students are asked which, if any,
laptop they purchased, what is the probability that exactly 6 students will have purchased a PC, 4 will have
purchased a Mac, 2 will have purchased a Linux machine, and the remaining 3 students will have not purchased
any laptop?

Answer.

name: email:
Q4 cs 109: problem set 4, distributions July 29, 2020

PSET4 Q4. Say we have two independent variables X and Y, such that X ∼ Geo( p) and Y ∼ Geo( p). Mathe-
matically derive an expression for P( X = k | X + Y = n), where k and n are non-negative integers.

Answer.

name: e mail:
Q5 cs 109: problem set 4, distributions July 29, 2020

PSET4 Q5. Choose a number X at random from the set of numbers {1, 2, 3, 4, 5}. Now choose a number at
random from the subset no larger than X, that is from {1, . . . , X }. Let Y denote the second number chosen.
a. Determine the joint probability mass function of X and Y.
b. Determine the conditional mass function of X given Y = i. Do this for i = 1, 2, 3, 4, 5.
c. Are X and Y independent? Justify your answer.

Answer.
a.
b.
c.

name: email:
Q6 c s 10 9: problem set 4, distributions July 29, 20 20

PSET4 Q6. Let X1 , X2 , . . . be a series of independent random variables which all have the same mean µ and the
same variance σ2 . Let Yn = Xn + Xn+1 . For j = 0, 1, and 2, determine Cov(Yn , Yn+ j ). Note that you may have
different cases for your answer depending on the value of j.

Answer.

name: e mail:
Q7 cs 109: problem set 4, distributions July 29, 2020

PSET4 Q7. Our ability to fight contagious diseases depends on our ability to model them. One person is
exposed to llama-flu. The method below models the number of individuals who will get infected.
from scipy import stats
"""
Return number of people infected by one individual.
"""
def num_infected():
# most people are immune to llama flu.
# stats.bernoulli(p).rvs() returns 1 w.p. p (0 otherwise)
immune = stats.bernoulli(p = 0.99).rvs()
if immune: return 0

# people who are not immune spread the disease far by


# making contact with k people (up to 100).
spread = 0
# returns random # of successes in n trials w.p. p of success
k = stats.binom(n = 100, p = 0.25).rvs()
for i in range(k):
spread += num_infected()

# total infections will include this individual


return spread + 1

What is the expected return value of numInfected()?

Answer.

name: email:
Q8 c s 109: problem set 4, distributions July 29, 2020

PSET4 Q8. In class, we considered the following recursive function:


def recurse():
x = np.random.choice([1,2,3]) # equally likely values 1,2,3
if (x == 1): return 3
elif (x == 2): return (5 + recurse())
else: return (7 + recurse())

Let Y = the value returned by recurse(). We previously computed E[Y ] = 15. What is Var(Y )?

Answer.

name: e mail:
Q9 cs 109: problem set 4, distributions July 29, 2 020

PSET4 Q9. You go on a camping trip with two friends who each have a mobile phone. Since you are out in the
wilderness, mobile phone reception isn’t very good. One friend’s phone will independently drop calls with
20% probability. Your other friend’s phone will independently drop calls with 30% probability. Say you need to
make 6 phone calls, so you randomly choose one of the two phones and you will use that same phone to make
all your calls (but you don’t know which has a 20% versus 30% chance of dropping calls). Of the first 3 (out of
6) calls you make, one of them is dropped. What is the conditional expected number of dropped calls in the 6
total calls you make (conditioned on having already had one of the first three calls dropped)?

Answer.

name: email:
Q11 cs 109: problem set 4, distributions July 29, 2020

PSET4 Q11. [Extra Credit] Consider a bit string of length n, where each bit is independently generated and
has probability p of being a 1. We say that a bit switch occurs whenever a bit differs from the one preceding it
in the string (if there is a preceding bit). For example, if n = 5 and we have the bit string 11010, then there
are 3 bit switches. Find the expected number of bit switches in a string of length n. (Hint: You might find it
helpful to use a set of indicator (Bernoulli) variables that are defined in terms of whether a bit switch occurred
in each position of the string. And in case you’re wondering why we care about bit switches, the number of bit
switches in a string can be one indicator of how compressible that string might be—for example, if the bit string
represented a file that we were trying to ZIP.)

Answer.

name: e mail:
Q1 cs 109: July 29, 2020

Problem Set 5, Sampling

PSET5 Q1. The joint probability density function of continuous random variables X and Y is given by:
y
f X,Y ( x, y) = c where 0 < y < x < 1
x
a. What is the value of c in order for f X,Y ( x, y) to be a valid probability density function?

Answer.
a.

Parts (c,b,d,e) on next pages…

name: email:
Q1 c s 109: July 29, 2020

…part (a) on previous page.

b. Are X and Y independent? Explain why or why not.

Answer.
b.

Parts (c,d,e) on next pages…

name: e mail:
Q1 cs 109: problem set 5, sampling July 29, 2 020

…parts (a,b) on previous pages.

c. What is the marginal density function of X?

Answer.
c.

d. What is the marginal density function of Y?

Answer.
d.

e. What is E[ X ]?

Answer.
e.

name: email:
Q2 cs 109: problem set 5, sampling July 29, 2020

PSET5 Q2. A robot is located at the center of a square world that is 10 kilometers on each side. A package is
dropped off in the robot’s world at a point ( x, y) that is uniformly (continuously) distributed in the square. If
the robot’s starting location is designated to be (0, 0) and the robot can only move up/down/left/right parallel
to the sides of the square, the distance the robot must travel to get to the package at point ( x, y) is | x | + |y|. Let
D = the distance the robot travels to get to the package. Compute E[ D ].

Answer.

name: e mail:
Q3 cs 109: July 29, 2020

PSET5 Q3. Let X, Y, and Z be independent random variables, where X ∼ N (µ1 , σ1 2 ), Y ∼ N (µ2 , σ2 2 ), and
Z ∼ N (µ3 , σ3 2 ).
a. Let A = X + Y. What is the distribution (along with parameter values) of A?

Answer.
a.

b. Let B = 4X + 3. What is the joint distribution (along with parameter values) of B and Z? (Hint: Bivariate
Normal)

Answer.
b.

Part (c) on next page…

name: email:
Q3 cs 109: problem set 5, sampling July 29, 2020

…parts (a,b) on previous page.

c. Let C = aX − b2 Y + cZ, where a, b, and c are real-valued constants. What is the distribution (along with
parameter values) of C? Show how you derived your answer.

Answer.
c.

name: e mail:
Q4 cs 109: problem set 5, sampling July 29, 2020

PSET5 Q4. A fair 6-sided die is repeatedly rolled until the total sum of all the rolls exceeds 300. Approximate
(using the Central Limit Theorem) the probability that at least 80 rolls are necessary to reach a sum that exceeds
300.

Answer.

name: email:
Q5 cs 109: July 29, 2020

PSET5 Q5. Program A will run 20 algorithms in sequence, with the running time for each algorithm being
independent random variables with mean = 46 seconds and variance = 100 seconds2 . Program B will run 20
algorithms in sequence, with the running time for each algorithm being independent random variables with
mean = 48 seconds and variance = 200 seconds2 .
a. What is the approximate probability that Program A completes in less than 900 seconds?

Answer.
a.

Parts (b,c) on next page…

name: e mail:
Q5 cs 109: problem set 5, sampling July 29, 2 020

…part (a) on previous page.

b. What is the approximate probability that Program B completes in less than 900 seconds?

Answer.
b.

c. What is the approximate probability that Program A completes in less time than Program B?

Answer.
c.

name: email:
Q6 c s 10 9: problem set 5, sampling July 29, 20 20

PSET5 Q6.
c. [Written] Use your answers to part (a) and (b) and approximate X and Y as Normal random variables
with mean and variance that match their biometric data. Report both distributions.
d. [Written] Calculate the ratio of the probability that user A wrote the email over the probability that user B
wrote the email. You do not need to submit code, but you should include the formula that you attempted to
calculate and a short description (a few sentences) of how your code works.

Answer.
c.
d.

name: e mail:
Q7 cs 109: problem set 5, sampling July 29, 2 020

PSET5 Q7.
d. [Written] Would you use the mean or the median of 5 peer grades to assign scores in the online version of
Stanford’s HCI class? Hint: it might help to visualize the scores. Feel free to write code to help you answer
this question, but for this question we’ll solely evaluate your written answer in the PDF that you upload to
Gradescope.

Answer.
d.

name: email:
Q8 c s 109: problem set 5, sampling July 29, 2020

PSET5 Q8.
c. [Written] For each of the three backgrounds, calculate a difference in means in learning outcome between
activity1 and activity2, and the p-value of that difference.

d. [Written] Your manager at Coursera is concerned that you have been ‘‘p-hacking,’’ which is also known
as data dredging: https://en.wikipedia.org/wiki/Data_dredging. In one sentence, explain why your
results in part (c) are not the result of p-hacking.

Answer.
c.
d.

name: e mail:
Q1 cs 109: quiz 1 July 29, 2 020

Quiz 1
I acknowledge and accept the letter and spirit of the Honor Code:

Q1. You have just been elected social chair for the student organization PRoB (Probability Revolution or Bust)
for the coming academic year 2020–2021. As new social chair, you would like to hold 10 (indistinct) socials
over the next 3 quarters (Autumn, Winter, and Spring). Each social is equally likely to be assigned to any of the
quarters, and it is possible that a quarter has no socials. Order of socials within a quarter doesn’t matter.
a. (5 points) In how many distinct ways can the 10 (indistinct) socials be allocated to the 3 quarters?

Answer.
a.

b. (5 points) In how many distinct ways can the 10 (indistinct) socials be allocated to the 3 quarters if you must
hold at least 2 socials each quarter?

Answer.
b.

c. (3 points) Let your answer to part (a) be n. Consider the event where you hold 3 socials in Autumn, 3 socials
in Winter, and 4 socials in Spring. Is the probability of this event n1 ? Briefly explain in a few sentences why or
why not.

Answer.
c.

d. (12 points) You are also planning your courseload for next year. You have 10 courses to schedule over 3
quarters (Autumn, Winter, and Spring). All of the courses are distinct and offered every quarter; order of
courses within a quarter doesn’t matter. In how many distinct ways can you allocate 10 (distinct) courses to
the 3 quarters if you can take at most 4 courses in any quarter?

Answer.
d.

name: email:
Q2 cs 109: quiz 1 July 29, 2020

Q2. A home robot has two different sensors for motion detection. If there is a moving object, sensor V (video
camera) will detect motion with probability 0.95, and sensor L (laser) will detect motion with probability 0.8. If
there is no moving object, there is a 0.1 probability that sensor V will detect motion (even though there is no
object), and a 0.05 probability that sensor L will detect motion.
Based on empirical evidence, the probability that there is a moving object is 0.7. Note that these sensors use
independent detection algorithms to identify motion, so that conditioned on there being a moving object (or
not), the events of detecting motion (or not) for each sensor is independent.
a. (3 points) Given that there is a moving object and that sensor V does not detect motion, what is the probability
that sensor L detects motion? Give a numerical answer.

Answer.
a.

b. (5 points) Given that there is a moving object, what is the probability that at least one of the two sensors
detects motion? Give a numerical answer.
Note: You can use WolframAlpha as a calculator (it also accepts LATEX equations). Example: https://www.wolframalpha
.com/input/?i=sum+x%5Ek%2Fk%21%2C+k%3D0+to+infinity

Answer.
b.

c. (5 points) Given that there is a moving object, what is the probability that exactly one of the two sensors
detects motion? Give a numerical answer.

Answer.
c.

d. (8 points) What is the probability that there is a moving object given that both sensors detect motion? Give a
numerical answer.

Answer.
d.

e. (4 points) The probabilities that sensor V detects motion given a moving object and that sensor V detects
motion given no moving object do not sum up to 1. Briefly explain in a few sentences why this is okay.

name: e mail:
Q2 cs 109: quiz 1 July 29, 2 020

Answer.
e.

name: email:
Q3 cs 109: quiz 1 July 29, 2020

Q3. You have 8 pairs of mittens, each a different pattern. Left and right mittens are also distinct. Suppose that
you are fostering kittens, and you leave them alone for a few hours with your mittens. When you return, you
discover that they have hidden 4 mittens! Suppose that your kittens are equally likely to hide any 4 of your 16
distinct mittens. Let X be the number of complete, distinct pairs of mittens that you have left.
a. (15 points) Compute the probability mass function of X, p X ( x ). (Hint: Note the support of X is {4, 5, 6})

Answer.
a.

b. (5 points) Compute E[ X ] using the definition of expectation and your answer to (a).

Answer.
b.

c. (10 points) Define the random variable Xi to be 1 if your ith pair of mittens is complete after the kitten fiasco,
and 0 otherwise. Using this definition of Xi for i = 1, . . . , 8 and the linearity of expectation, compute E[ X ]
again. Do not use your answer to part (a).

Answer.
c.

name: e mail:
Q4 cs 109: q uiz 1 July 29, 2 020

Q4. A food takeout and delivery company, DashDoor (like DoorDash, but better) wants to understand how
busy their employees are each month. In metropolitan areas, employees receive on average 8 customer requests
each day. Regardless of where they work, each employee spends exactly 0.5 hours to make deliveries for each
request (one at a time); for example, an employee who receives exactly 4 requests on a particular day will spend
exactly 2 hours making deliveries.
a. (4 points) What is the variance of the number of hours that a metropolitan employee spends on making
deliveries in a day?

Answer.
a.

b. (6 points) What is the probability that a metropolitan employee has a ‘‘very busy day,’’ defined as spending
at least 5 hours making deliveries from requests received that day? Give a numerical answer.
Note: You can use WolframAlpha as a calculator (it also accepts LATEX equations). Example: https://www.wolframalpha
.com/input/?i=sum+x%5Ek%2Fk%21%2C+k%3D0+to+infinity

Answer.
b.

c. (6 points) The company estimates that 0.6 of their employees work in metropolitan areas, and the rest in
suburban areas. Suburbs have different customer demand: in suburban areas, employees receive on average 6
customer requests each day. What is the probability that a randomly chosen employee has a very busy day?

Answer.
c.

d. (14 points) Consider a metropolitan employee. The event that they have a very busy day on a particular day
is independent of their business on other days. Let p be your answer from part (b), the probability that they
have a very busy day.
i. (6 points) What is the probability that this employee has more than 10 busy days in the next 20 working
days? Leave your answer in terms of p.

Answer.
d.i.
ii. (4 points) What is the expected number of very busy days that this employee has over the next 20 working
days? Leave your answer in terms of p.

name: email:
Q4 cs 109: quiz 1 July 29, 2020

Answer.
d.ii.

iii. (4 points) Would it be reasonable to approximate the probability you computed in part (d)(i) using a
Poisson random variable? Briefly explain in a few sentences why or why not.

Answer.
d.iii.

name: e mail:
index 227

References

1. M. J. Kochenderfer and T. A. Wheeler, Algorithms for Optimization. MIT Press, 2019.


2. K. Rosen, Discrete Mathematics and Its Applications. 2007, vol. 6.
3. S. M. Ross et al., A First Course in Probability. 2006, vol. 7.

2020-07-29 20:12:29-07:00, draft: send comments to [email protected]

You might also like