Probability For Computer Scientists
Probability For Computer Scientists
CS109
Department of Computer Science
Stanford University
Oct 2023
V 0.923
Get Started
Acknowledgements: This book was written by Chris Piech for Stanford's CS109 course, Probability for
Computer scientists. The course was originally designed by Mehran Sahami and followed the Sheldon
Ross book Probability Theory from which we take inspiration. The course has since been taught by Lisa
Yan, Jerry Cain and David Varodayan and their ideas and feedback have improved this reader.
This course reader is open to contributions. Want to make your mark? Keen to fix a typo? Download the
github project and publish a pull request. We will credit all contributors. Thank you so much to folks who
have contributed to editing the book: GitHub Contributors.
1
Introduction
This is a header
Notation Reference
Core Probability
Notation Meaning
E
C
Complement of an event or set
n! n factorial
n
( )
k
Binomial coefficient
(
n
r1,r2,r3
) Multinomial coefficient
Random Variables
Notation Meaning
E[X] Expectation of X
Var(X) Variance of X
1
This is a header
Notation Meaning
Parametric Distributions
Notation Meaning
2
This is a header
count(E)
P(E) = lim
n→∞ n
P(E) = 1 − P(E )
If S is a sample space with equally likely outcomes, for an event E that is a subset of the outcomes in S :
P(E and F )
P(E|F ) =
P(F )
For n events E , E , … E where each event is mutually exclusive of one another (in other words, no
1 2 n
i=1
1
This is a header
For more than three events see the chapter of probability of or.
i=1
For n events E 1, E2 , … En :
P(E n |E 1 … E n−1 )
C
P(E) = P(E and F ) + P(E and F )
C C
= P(E|F ) P(F ) + P(E|F ) P(F )
For mutually exclusive events: B 1, B2 , … Bn such that every outcome in the sample space falls into one
of those events:
n
i=1
i=1
P(E|B) ⋅ P(B)
P(B|E) =
P(E)
P(E|B) ⋅ P(B)
P(B|E) =
C C
P(E|B) ⋅ P(B) + P(E|B ) ⋅ P(B )
2
This is a header
Notation: X ∼ Bern(p)
PMF (smooth): x
P(X = x) = p (1 − p)
1−x
Expectation: E[X] = p
PMF graph:
Parameter p: 0.80
1
This is a header
Notation: X ∼ Bin(n, p)
Expectation: E[X] = n ⋅ p
Variance: Var(X) = n ⋅ p ⋅ (1 − p)
PMF graph:
Parameter n: 20 Parameter p: 0.60
2
This is a header
Notation: X ∼ Poi(λ)
Description: Number of events in a fixed time frame if (a) the events occur with a constant mean
rate and (b) they occur independently of time since last event.
Parameters: λ ∈ {0, 1, …} , the constant average rate.
Support: x ∈ {0, 1, …}
PMF equation:
x −λ
λ e
P(X = x) =
x!
Expectation: E[X] = λ
Variance: Var(X) = λ
PMF graph:
Parameter λ: 5
3
This is a header
Notation: X ∼ Geo(p)
PMF equation:
x−1
P(X = x) = (1 − p) p
Expectation: E[X] =
1
Variance: Var(X) =
1−p
p2
PMF graph:
Parameter p: 0.20
4
This is a header
Notation: X ∼ NegBin(r, p)
r − 1
Expectation: E[X] =
r
Variance: Var(X) =
r⋅(1−p)
p
2
PMF graph:
Parameter r: 3 Parameter p: 0.20
5
This is a header
6
Uniform Random Variable
Notation:
Description:
Parameters:
Support:
PDF equation:
CDF equation:
Expectation:
Variance:
PDF graph:
Parameter α: 0
E[X] =
,
Var(X) =
⎩
⎪
Continuous Random Variables
X ∼ Uni(α, β)
A continuous random variable that takes on values, with equal likelihood, between
α and β
x ∈ [α, β]
f (x) = {
0
F (x) = ⎨0
2
, the maximum value of the variable.
β−α
x−α
β−α
(α + β)
12
Parameter β: 1
f or x ∈ [α, β]
else
f or x ∈ [α, β]
f or x < α
f or x > β
(β − α)
2
This is a header
Notation: X ∼ Exp(λ)
Description: Time until next events if (a) the events occur with a constant mean rate and (b) they
occur independently of time since last event.
Parameters: λ ∈ {0, 1, …} , the constant average rate.
Support: x ∈ R
+
PDF equation:
−λx
f (x) = λe
PDF graph:
Parameter λ: 5
7
This is a header
Notation: X ∼ N(μ, σ )
2
Support: x ∈ R
2
PDF equation: 1 −
1
2
(
x−μ
σ
)
f (x) = e
σ√2π
Expectation: E[X] = μ
Variance: Var(X) = σ
2
PDF graph:
Parameter μ: 5 Parameter σ: 5
8
This is a header
Notation: X ∼ Beta(a, b)
Description: A belief distribution over the value of a probability p from a Binomial distribution
after observing a − 1 successes and b − 1 fails.
Parameters: a > 0 , the number successes + 1
b > 0 , the number of fails + 1
Support: x ∈ [0, 1]
PDF equation:
a−1 b−1
f (x) = B ⋅ x ⋅ (1 − x)
Variance: Var(X) =
ab
2
(a+b) (a+b+1)
PDF graph:
Parameter a: 2 Parameter b: 4
9
This is a header
Python Reference
Factorial
Compute n! as an integer. This example computes 20!:
import math
print(math.factorial(20))
Choose
As of Python 3.8, you can compute ( m
n
) from the math module. This example computes (
10
5
:
)
import math
print(math.comb(10, 5))
Natural Exponent
Calculate e . For example this computes e
x 3
import math
print(math.exp(3))
Binomial
Make a Binomial Random variable X and compute its probability mass function (PMF) or cumulative
density function (CDF). We love the scipy stats library because it defines all the functions you would
care about for a random variable, including expectation, variance, and even things we haven't talked
about in CS109, like entropy. This example declares X ∼ Bin(n = 10, p = 0.2). It calculates a few
statistics on X. It then calculates P (X = 3) and P (X ≤ 4). Finally it generates a few random samples
from X:
From a terminal you can always use the "help" command to see a full list of methods defined on a
variable (or for a package):
1
This is a header
Poisson
Make a Poisson Random variable Y . This example declares Y ∼ Poi(λ = 2) . It then calculates
P (Y = 3):
Geometric
Make a Geometric Random variable X , the number of trials until a success. This example declares
X ∼ Geo(p = 0.75):
Normal
Make a Normal Random variable A. This example declares A ∼ N (μ = 3, σ = 16). It then calculates
2
f (0) and F (0). Very Important!!! In class, the second parameter to a normal was the variance (σ ). In
2
Y Y
the scipy library, the second parameter is the standard deviation (σ):
import math
from scipy import stats
A = stats.norm(3, math.sqrt(16)) # Declare A to be a normal random variable
print(A.pdf(4)) # f(3), the probability density at 3
print(A.cdf(2)) # F(2), which is also P(Y < 2)
print(A.rvs()) # Get a random sample from A
Exponential
Make an Exponential Random variable B. This example declares B ∼ Exp(λ = 4):
Beta
Make an Beta Random variable X. This example declares X ∼ Beta(α = 1, β = 3) :
2
Part 1: Core Probability
This is a header
Counting
Although you may have thought you had a pretty good grasp on the notion of counting at the age of
three, it turns out that you had to wait until now to learn how to really count. Aren’t you glad you took
this class now?! But seriously, counting is like the foundation of a house (where the house is all the great
things we will do later in this book, such as machine learning). Houses are awesome. Foundations, on the
other hand, are pretty much just concrete in a hole. But don’t make a house without a foundation. It won’t
turn out well.
If an experiment has two parts, where the first part can result in one of m outcomes and the second part
can result in one of n outcomes regardless of the outcome of the first part, then the total number of
outcomes for the experiment is m ⋅ n.
Rewritten using set notation, the Step Rule of Counting states that if an experiment with two parts has an
outcome from set A in the first part, where |A| = m, and an outcome from set B in the second part
(where the number of outcomes in B is the same regardless of the outcome of the first part), where
|B| = n, then the total number of outcomes of the experiment is |A||B| = m ⋅ n.
Simple Example: Consider a hash table with 100 buckets. Two arbitrary strings are independently hashed
and added to the table. How many possible ways are there for the strings to be stored in the table? Each
string can be hashed to one of 100 buckets. Since the results of hashing the first string do not impact the
hash of the second, there are 100 * 100 = 10,000 ways that the two strings may be stored in the hash
table.
Peter Norvig, the author of the cannonical text book "Artificial Intelligence" made the following
compelling point on why computer scientists need to know how to count. To start, lets set a baseline for a
really big number: The number of atoms in the observable universe, often estimated to be around 10 to
the 80th power (10 ). There certainly are a lot of atoms in the universe. As a leading expert said,
80
“Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean,
you may think it’s a long way down the road to the chemist, but that’s just peanuts to space.” -
Douglas Adams
This number is often used to demonstrate tasks that computers will never be able to solve. Problems can
quickly grow to an absurd size, and we can understand why using the Step Rule of Counting.
There is an art project to display every possible picture. Surely that would take a long time, because there
must be many possible pictures. But how many? We will assume the color model known as True Color,
in which each pixel can be one of 2 ≈ 17 million distinct colors.
24
How many distinct pictures can you generate from (a) a smart phone camera shown with 12 million
pixels, (b) a grid with 300 pixels, and (c) a grid with just 12 pixels?
1
This is a header
Answer: We can use the step rule of counting. An image can be created one pixel at a time, step by step.
Each time we choose a pixel you can select its color out of 17 million choices. An array of n pixels
produces (17 million) different pictures. (17 million) ≈ 10 , so the tiny 12-pixel grid produces a
n 12 86
million times more pictures than the number of atoms in the universe! How about the 300 pixel array? It
can produce 10 pictures. You may think the number of atoms in the universe is big, but that’s just
2167
peanuts to the number of pictures in a 300-pixel array. And 12M pixels? 10 pictures.
86696638
For example a Go board has 19 × 19 points where a user can place a stone. Each of the points can be
empty or occupied by black or white stone. By the Step Rule of Counting, we can compute the number of
unique board configurations.
In go there are 19x19 points. Each point can have a black stone, white stone, or no stone at all.
Here we are going to construct the board one point at a time, step by step. Each time we add a point we
have a unique choice where we can decide to make the point one of three options: {Black, White, No
Stone}. Using this construction we can apply the Step Rule of Counting. If there was only one point,
there would be three unique board configurations. If there were four points you would have
3 ⋅ 3 ⋅ 3 ⋅ 3 = 81 unique combinations. In Go there are 3 possible board positions. The way
(19×19) 172
≈ 10
we constructed our board didn't take into account which ones were illegal by the rules of Go. It turns out
that "only" about 10 of those positions are legal. That is about the square of the number of atoms in the
170
universe. In other words: if there was another universe of atoms for every single atom, only then would
there be as many atoms in the universe as there are unique configurations of a Go board.
As a computer scientist this sort of result can be very important. While computers are powerful, an
algorithm which needed to store each configuration of the board would not be a reasonable approach. No
computer can store more information than atoms in the universe squared!
The above argument might leave you feeling like some problems are incredibly hard as a result of the
product rule of counting. Let’s take a moment to talk about how the product rule of counting can help!
Most logarithmic time algorithms leverage this principle.
Imagine you are building a machine learning system that needs to learn from data and you want to
synthetically generate 10 million unique data points for it. How many steps would you need to encode to
get to 10 million? Assuming that at each step you have a binary choice, the number of unique data points
you produce will be 2 by the Step Rule of counting. If we chose n such that log 10, 000, 000 < n. You
n
2
Example: Rolling two dice. Two 6-sided dice, with faces numbered 1 through 6, are rolled. How many
possible outcomes of the roll are there?
Solution: Note that we are not concerned with the total value of the two die ("die" is the singular form of
"dice"), but rather the set of all explicit outcomes of the rolls. Since the first die can come up with 6
possible values and the second die similarly can have 6 possible values (regardless of what appeared on
2
This is a header
the first die), the total number of potential outcomes is 36 (= 6 × 6). These possible outcomes are
explicitly listed below as a series of pairs, denoting the values rolled on the pair of dice:
(1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
(6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)
Counting with or
If you want to consider the total number of unique outcomes, when outcomes can come from source A or
source B, then the equation you use depends on whether or not there are outcomes which are both in A
and B. If not, you can use the simpler "Mutually Exclusive Counting" rule. Otherwise you need to use
the slightly more involved Inclusion Exclusion rule.
If the outcome of an experiment can either be drawn from set A or set B, where none of the outcomes in
set A are the same as any of the outcomes in set B (called mutual exclusion), then there are
|A or B| = |A| + |B| possible outcomes of the experiment.
Example: Sum of Routes. A route finding algorithm needs to find routes from Nairobi to Dar Es Salaam.
It finds routes that either pass through Mt Kilimanjaro or Mombasa. There are 20 routes that pass through
Mt Kilimanjaro, 15 routes that pass through Mombasa and 0 routes which pass through both Mt
Kilimanjaro and Mombasa. How many routes are there total?
Solution: Routes can come from either Mt Kilimanjaro or Mombasa. The two sets of routes are mutually
exclusive as there are zero routes which are in both groups. As such the total number of routes is
addition: 20 + 15 = 35.
If you can show that two groups are mutually exclusive counting becomes simple addition. Of course not
all sets are mutually exclusive. In the example above, imagine there had been a single route which went
through both Mt Kilimanjaro and Mombasa. We would have double counted that route because it would
be included in both the sets. If sets are not mutually exclusive, counting the or is still addition, we simply
need to take into account any double counting.
If the outcome of an experiment can either be drawn from set A or set B, and sets A and B may
potentially overlap (i.e., it is not the case that A and B are mutually exclusive), then the number of
outcomes of the experiment is |A or B| = |A| + |B| − |A and B|.
Note that the Inclusion-Exclusion Principle generalizes the Sum Rule of Counting for arbitrary sets A
and B. In the case where A and B = ∅, the Inclusion-Exclusion Principle gives the same result as the
Sum Rule of Counting since |A and B| = 0.
Example: An 8-bit string (one byte) is sent over a network. The valid set of strings recognized by the
receiver must either start with "01" or end with "10". How many such strings are there?
3
This is a header
Solution: The potential bit strings that match the receiver’s criteria can either be the 64 strings that start
with "01" (since that last 6 bits are left unspecified, allowing for 2 = 64 possibilities) or the 64 strings
6
that end with "10" (since the first 6 bits are unspecified). Of course, these two sets overlap, since strings
that start with "01" and end with "10" are in both sets. There are 2 = 16 such strings (since the middle 4
4
bits can be arbitrary). Casting this description into corresponding set notation, we have: |A| = 64, |B| =
64, and |A and B| = 16, so by the Inclusion-Exclusion Principle, there are 64 + 64 − 16 = 112 strings that
match the specified receiver’s criteria.
As a simple example to demonstrate the point, lets revisit the problem of generating all images, but this
time lets just have 4 pixels (2x2) and each pixel can only be blue or white. How many unique images are
there? Generating any image is a four step process where you choose each pixel one at a time. Since each
pixel has two choices there are 2 = 16 unique images (they are not exactly Picasso — but hey, it's 4
4
pixels):
Now lets say we add in new "constraint" that we only want to accept pictures which have an odd number
of pixels turned blue. There are two ways of getting to the answer. You could start out with the original
16 and work out that you need to subtract off 8 images that have either 0, 2 or 4 blue pixels (which is
easier to work out after the next chapter). Or you could have counted up using Mutually Exclusive
Counting: there are 4 ways of making an image with 1 pixel and 4 ways of making an image with 3. Both
approaches lead to the same answer, 8.
Next lets add a much harder constraint: mirror indistinction. If you can flip any image horizontally to
create another, they are no longer considered unique. For example these two both show up in our set of 8
odd-blue pixel images, but they are now considered to be the same (they are indistinct after a horizontal
flip):
How many images have an odd number of pixels taking into account mirror indistinction? The answer is
that for each unique image with odd numbers of blue pixels, under this new constraint, you have counted
it twice: itself and its horizonal flip. To convince yourself that each image has been counted exactly twice
you can look at all of the examples in the set of 8 images with an odd number of blue pixels. Each image
is next to one which is indistinct after a horizontal flip. Since each image was counted exactly twice in
the set of 8, we can divide by two to get the updated count. If we list them out we can confirm that there
are 8/2=4 images left after this last constraint:
4
This is a header
Applying any math (counting included) to novel contexts can be as much an art as it is a science. In the
next chapter we will build a useful toolset from the basic first principles of counting by steps, and
counting by "or".
5
This is a header
Combinatorics
Counting problems can be approached from the basic building blocks described in the first section:
Counting. However some counting problems are so ubiquitous in the world of probability that it is worth
knowing a few higher level counting abstractions. When solving problems, if you can find the analogy
from these canonical examples you can build off of the corresponding combinatorics formulas:
While these are by no means the only common counting paradigms, it is a helpful set.
This changes slightly if you are permuting a subset of distinct objects, or if some of your objects are
indistinct. We will handle those cases shortly! Note that unique is a synonym for distinct.
Example: How many unique orderings of characters are possible for the string "BAYES"? Solution:
Since the order of characters is important, we are considering all permutations of the 5 distinct characters
B, A, Y, E, and S: 5! = 120. Here is the full list:
BAYES, BAYSE, BAEYS, BAESY, BASYE, BASEY, BYAES, BYASE, BYEAS, BYESA, BYSAE,
BYSEA, BEAYS, BEASY, BEYAS, BEYSA, BESAY, BESYA, BSAYE, BSAEY, BSYAE, BSYEA,
BSEAY, BSEYA, ABYES, ABYSE, ABEYS, ABESY, ABSYE, ABSEY, AYBES, AYBSE, AYEBS,
AYESB, AYSBE, AYSEB, AEBYS, AEBSY, AEYBS, AEYSB, AESBY, AESYB, ASBYE, ASBEY,
ASYBE, ASYEB, ASEBY, ASEYB, YBAES, YBASE, YBEAS, YBESA, YBSAE, YBSEA, YABES,
YABSE, YAEBS, YAESB, YASBE, YASEB, YEBAS, YEBSA, YEABS, YEASB, YESBA, YESAB,
YSBAE, YSBEA, YSABE, YSAEB, YSEBA, YSEAB, EBAYS, EBASY, EBYAS, EBYSA, EBSAY,
EBSYA, EABYS, EABSY, EAYBS, EAYSB, EASBY, EASYB, EYBAS, EYBSA, EYABS, EYASB,
EYSBA, EYSAB, ESBAY, ESBYA, ESABY, ESAYB, ESYBA, ESYAB, SBAYE, SBAEY, SBYAE,
SBYEA, SBEAY, SBEYA, SABYE, SABEY, SAYBE, SAYEB, SAEBY, SAEYB, SYBAE, SYBEA,
SYABE, SYAEB, SYEBA, SYEAB, SEBAY, SEBYA, SEABY, SEAYB, SEYBA, SEYAB
Example: a smart-phone has a 4-digit passcode. Suppose there are 4 smudges over 4 digits on the screen.
How many distinct passcodes are possible?
Solution: Since the order of digits in the code is important, we should use permutations. And since there
are exactly four smudges we know that each number in the passcode is distinct. Thus, we can plug in the
permutation formula: 4! = 24.
n!
Number of unique orderings =
n 1 !n 2 ! ⋯ n r !
Example: How many distinct bit strings can be formed from three 0’s and two 1’s?
Solution: 5 total digits would give 5! permutations. But that is assuming the 0’s and 1’s are
distinguishable (to make that explicit, let’s give each one a subscript). Here are the 3! ⋅ 2! = 12 different
ways that we could have arrived at the identical string "01100" if we thought of each 0 and 1 as unique.
01 10 11 02 03
01 10 11 03 02
02 10 11 01 03
02 10 11 03 01
03 10 11 01 02
03 10 11 02 01
01 11 10 02 03
01 11 10 03 02
02 11 10 01 03
02 11 10 03 01
03 11 10 01 02
03 11 10 02 01
Since identical digits are indistinguishable, all the listed permutations are the same. For any given
permutation, there are 3! ways of rearranging the 0’s and 2! ways of rearranging the 1’s (resulting in
indistinguishable strings). We have over-counted. Using the formula for permutations of indistinct
objects, we can correct for the over-counting:
5! 120
Total = = = 10
3! ⋅ 2! 6 ⋅ 2
Example: How many distinct orderings of characters are possible for the string "MISSISSIPPI"?
Solution: In the case of the string "MISSISSIPPI", we should separate the characters into four distinct
groups of indistinct characters: one "M", four "I"s, four "S"s, and two "P"s. The number of distinct
orderings are:
11!
= 34, 650
1!4!4!2!
Example: Consider the 4-digit passcode smart-phone from before. How many distinct passcodes are
possible if there are 3 smudges over 3 digits on the screen?
Solution: One of 3 digits is repeated, but we don't know which one. We can solve this by making three
cases, one for each digit that could be repeated (each with the same number of permutations). Let
A, B, C represent the 3 digits, with C repeated twice. We can initially pretend the two C 's are distinct
2
This is a header
. Then each case will have 4! permutations: However, then we need to eliminate the
[A, B, C 1 , C 2 ]
double-counting of the permutations of the identical digits (one A, one B, and two C 's):
4!
2! ⋅ 1! ⋅ 1!
Adding up the three cases for the different repeated digits gives
4!
3 ⋅ = 3 ⋅ 12 = 36
2! ⋅ 1! ⋅ 1!
Solution: There are two possibilities: 2 digits used twice each, or 1 digit used 3 times, and other digit
used once.
4! 4!
+ 2 ⋅ = 6 + (2 ⋅ 4) = 6 + 8 = 14
2! ⋅ 2! 3! ⋅ 1!
You can use the power of computers to enumerate all permutations. Here is sample python code which
uses the built in itertools library:
A combination is an unordered selection of r objects from a set of n objects. If all objects are distinct, and
objects are not "replaced" once selected, then the number of ways of making the selection is:
n! n
Number of unique selections = = ( )
r!(n − r)! r
3
) ways of choosing three items from a list of 5 unique numbers:
Notice how order doesn't matter. Since (1, 2, 3) is in the set of combinations, we don't also include (3, 2,
1) as this is considered to be the same selection. Note that this formula does not work if some of the
objects are indistinct from one another.
r!(n−r)!
? Consider this general way to select r unordered objects from a set
of n objects, e.g., “7 choose 3”:
3
This is a header
n! n
Total = = ( )
r! ⋅ (n − r)! r
Example: In the Hunger Games, how many ways are there of choosing 2 villagers from district 12,
which has a population of 8,000?
2
) = 31,996,000.
Part A: How many ways are there to select 3 books from a set of 6?
Solution: If each of the books are distinct, then this is another straightforward combination problem.
There are ( ) =
6
3
6!
= 20 ways.
3!3!
Part B: How many ways are there to select 3 books if there are two books that should not both be chosen
together? For example, if you are choosing 3 out of 6 probability books, don't choose both the 8th and 9th
edition of the Ross textbook.
Solution: This problem is easier to solve if we split it up into cases. Consider the following three
different cases:
Case 1: Select the 8th Ed. and 2 other non-9th Ed. books: There are ( ) ways of doing so.
4
Case 2: Select the 9th Ed. and 2 other non-8th Ed. books: There are ( ) ways of doing so.
4
Case 3: Select 3 books from the 4 remaining books that are neither the 8th nor the 9th edition: There are
( ) ways of doing so.
4
Using our old friend the Sum Rule of Counting, we can add the cases:
4 4
Total = 2 ⋅ ( ) + ( ) = 16
2 3
Alternatively, we could have calculated all the ways of selecting 3 books from 6, and then subtract the
"forbidden'' ones (i.e., the selections that break the constraint).
Forbidden Case: Select 8th edition and 9th edition and 1 other book. There are ( ) ways of doing so
4
(which equals 4). Total = All possibilities - forbidden = 20 - 4 = 16 Two different ways to get the same
right answer!
The most common case that we will want to consider is when all of the items you are putting into buckets
are distinct. In that case you can think of bucketing as a series of steps, and employ the step rule of
counting. The first step? You put the first distinct item into a bucket (there are number-of-buckets ways to
do this). Second step? You put the second distinct item into a bucket (again, there are number-of-buckets
ways to do this).
4
This is a header
Suppose you want to place n distinguishable items into r containers. The number of ways of doing so is:
n
r
You have n steps (place each item) and for each item you have r choices
Problem: Say you want to put 10 distinguishable balls into 5 urns (No! Wait! Don't say that! Not urns!).
Okay, fine. No urns. Say we are going to put 10 different strings into 5 buckets of a hash table. How
many possible ways are there of doing this?
Solution: You can think of this as 10 independent experiments each with 5 outcomes. Using our rule for
bucketing with distinct items, this comes out to 5 . 10
Divider Method:
Suppose you want to place n indistinguishable items into r containers. The divider method works by
imagining that you are going to solve this problem by sorting two types of objects, your n original
elements and (r − 1) dividers. Thus, you are permuting n + r − 1 objects, n of which are the same (your
elements) and r − 1 of which are the same (the dividers). Thus the total number of outcomes is:
(n + r − 1)! n + r − 1 n + r − 1
= ( ) = ( )
n!(r − 1)! n r − 1
The divider method can be derived via the "Stars and Bars" method. This is a creative construction where
we consider permutations of indistinguishable items, represented by stars *, and dividers between our
containers, represented by bars |. Any distinct permutation of these stars and bars represents a unique
assignments of our items to containers.
Imagine we want to separate 5 indistinguishable objects into 3 containers. We can think of the problem
as finding the number of ways to order 5 stars and 2 bars *****||. Any permutation of these symbols
represents a unique assignment. Here are a few examples:
**|*|** represents 2 items in the first bucket, 1 item in the second and 2 items in the third.
****||* represents 4 items in the first bucket, 0 item in the second and 1 items in the third.
||***** represents 0 items in the first bucket, 0 item in the second and 5 items in the third.
Why are there only 2 dividers when there are 3 buckets? This is an example of a fence-post-problem".
With 2 dividers you have created three containers. We already have a method for counting permutations
with some indistinct items. For the example above where we have seven elements in our permutation (
n = 5 stars and r − 1 = 2 bars):
n! (n + r − 1)! 7!
Number of unique orderings = = = = 21
n 1 !n 2 ! n!(r − 1)! 5!2!
Part A: Say you are a startup incubator and you have $10 million to invest in 4 companies (in $1 million
increments). How many ways can you allocate this money?
Solution: This is just like putting 10 balls into 4 urns. Using the Divider Method we get:
( ) ( )
This is a header
10 + 4 − 1 13
Total ways = ( ) = ( ) = 286
10 10
Part B: What if you know you want to invest at least $3 million in Company 1?
Solution: There is one way to give $3 million to Company 1. The number of ways of investing the
remaining money is the same as putting 7 balls into 4 urns.
7 + 4 − 1 10
Total Ways = ( ) = ( ) = 120
7 7
This problem is analogous to solving the integer equation x + x + x + x = 10, where x ≥ 3 and
1 2 3 4 1
x , x , x ≥ 0. To translate this problem into the integer solution equation that we can solve via the
2 3 4
divider method, we need to adjust the bounds on x such that the problem becomes
1
Part C: What if you don't have to invest all $10 M? (The economy is tight, say, and you might want to
save your money.)
Solution: Imagine that you have an extra company: yourself. Now you are investing $10 million in 5
companies. Thus, the answer is the same as putting 10 balls into 5 urns.
10 + 5 − 1 14
Total = ( ) = ( ) = 1001
10 10
If n objects are distinct, then the number of ways of putting them into r groups of objects, such that
group i has size n , and ∑ n = n, is:
i
r
i=1 i
n! n
= ( )
n 1 !n 2 ! ⋯ n r ! n1 , n2 , … , nr
where ( n
n 1 ,n 2 ,…,n r
) is special notation called the multinomial coefficient.
You may have noticed that this is the exact same formula as "Permutations With Indistinct Objects".
There is a deep parallel. One way to imagine assigning objects into their groups would be to imagine the
groups themselves as objects. You have one object per "slot" in a group. So if there were two slots in
group 1, three slots in group 2, and one slot in group 3 you could have six objects (1, 1, 2, 2, 2, 3). Each
unique permutation can be used to make a unique assignment.
Problem:
Company Camazon has 13 distinct new servers that they would like to assign to 3 datacenters, where
Datacenter A, B, and C have 6, 4, and 3 empty server racks, respectively. How many different divisions
of the servers are possible?
6
This is a header
Another way to do this problem would be from first principles of combinations as a multipart
experiment. We first select the 6 servers to be assigned to Datacenter A, in ( ) ways. Now out of the 7
13
servers remaining, we select the 4 servers to be assigned to Datacenter B, in ( ) ways. Finally, we select
7
the 3 servers out of the remaining 3 servers, in ( ) ways. By the Product Rule of Counting, the total
3
6
7
4
3
3
13!
= 60, 060.
6!4!3!
7
This is a header
Definition of Probability
What does it mean when someone makes a claim like "the probability that you find a pearl in an oyster is
1 in 5,000?" or "the probability that it will rain tomorrow is 52%"?
Definition: Event, E
An Event is some subset of S that we ascribe meaning to. In set notation (E ⊆ S ).For example:
Coin flip is heads: E = {Heads}
At least 1 head on 2 coin flips = {(H, H), (H, T), (T, H)}
Roll of die is 3 or less: E = {1, 2, 3}
You receive less than 20 emails in a day: E = {x|x ∈ Z, 0 ≤ x < 20} (non-neg. ints)
Wasted day (≥ 5 YouTube hours): E = {x|x ∈ R, 5 ≤ x ≤ 24}
[todo] In the world of probability, events are binary: they either happen or they don't.
Definition of Probability
It wasn't until the 20th century that humans figured out a way to precisely define what the word
probability means:
count(Event)
P(Event) = lim
n→∞ n
In English this reads: lets say you perform n trials of an "experiment" which could result in a particular
"Event" occurring. The probability of the event occurring, P(Event), is the ratio of trials that result in
the event, written as count(Event), to the number of trials performed, n. In the limit, as your number of
trials approaches infinity, the ratio will converge to the true probability. People also apply other semantics
to the concept of a probability. One common meaning ascribed is that P(E) is a measure of the chance of
event E occurring.
Here we use the definition of probability to calculate the probability of event E, rolling a "5" or a "6" on
a fair six-sided dice. Hit the "Run trials" button to start running trials of the experiment "roll dice".
Notice how P(E), converges to 2/6 or 0.33 repeating.
n = 0 count(E) = 0 P(E) ≈
count(E)
n
=
Origins of probabilities: The different interpretations of probability are reflected in the many origins of
probabilities that you will encounter in the wild (and not so wild) world. Some probabilities are
calculated analytically using mathematical proofs. Some probabilities are calculated from data,
experiments or simulations. Some probabilities are just made up to represent a belief. Most probabilities
are generated from a combination of the above. For example, someone will make up a prior belief, that
belief will be mathematically updated using data and evidence. Here is an example of calculating a
probability from data:
Probabilities and simulations: Another way to compute probabilities is via simulation. For some
complex problems where the probabilities are too hard to compute analytically you can run simulations
using your computer. If your simulations generate believable trials from the sample space, then the
probability of an event E is approximately equal to the fraction of simulations that produced an outcome
from E. Again, by the definition of probability, as your number of simulations approaches infinity, the
estimate becomes more accurate.
Probabilities and percentages: You might hear people refer to a probability as a percent. That the
probability of rain tomorrow is 32%. The proper way to state this would be to say that 0.32 is the
probability of rain. Percentages are simply probabilities multiplied by 100. "Percent" is latin for "out of
one hundred".
Problem: Use the definition of probability to approximate the answer to the question: "What is the
probability a new-born elephant child is male?" Contrary to what you might think the gender outcomes of
a newborn elephant are not equally likely between male and female. You have data from a report in
Animal Reproductive Science which states that 3,070 elephants were born in Myanmar of which 2,180
were male [1]. Humans also don't have a 50/50 sex ratio at birth [2].
2
This is a header
By the definition of probability, the ratio — of trials that result in the event, to the total number of trials
— will tend to our desired probability:
count(E)
= lim
n→∞ n
2, 180
≈
3, 070
≈ 0.710
Since 3,000 is quite a bit less than infinity, this is an approximation. It turns out, however, to be a rather
good one. A few important notes: there is no guarantee that our estimate applies to elephants outside
Myanmar. Later in the class we will develop language for "how confident we can be in a number like
0.71 after 3,000 trials?" Using tools from later in class we can say that we have 98% confidence that the
true probability is within 0.02 of 0.710.
Axioms of Probability
Here are some basic truths about probabilities that we accept as axioms:
Axiom 3: If E and F are mutually exclusive, The probability of "or" for mutually exclusive events
then P(E or F ) = P(E) + P(F )
These three axioms are formally called the Kolmogorov axioms and they are considered to be the
foundation of probability theory. They are also useful identities!
You can convince yourself of the first axiom by thinking about the math definition of probability. As you
perform trials of an experiment it is not possible to get more events than trials (thus probabilities are less
than 1) and its not possible to get less than 0 occurrences of the event (thus probabilities are greater than
0). The second axiom makes sense too. If your event is the sample space, then each trial must produce the
event. This is sort of like saying; the probability of you eating cake (event) if you eat cake (sample space
that is the same as the event) is 1. The third axiom is more complex and in this textbook we dedicate an
entire chapter to understanding it: Probability of or. It applies to events that have a special property called
"mutual exclusion": the events do not share any outcomes.
These axioms have great historical significance. In the early 1900s it was not clear if probability was
somehow different than other fields of math -- perhaps the set of techniques and systems of proofs from
other fields of mathematics couldn't apply. Kolmogorov's great success was to show to the world that the
tools of mathematics did in fact apply to probability. From the foundation provided by this set of axioms
mathematicians built the edifice of probability theory.
Provable Identities
We often refer to these as corollaries that are directly provable from the three axioms given above.
Identity 1: P(E C
) = 1 − P(E) The probability of event E not happening
3
This is a header
This first identity is especially useful. For any event, you can calculate the probability of the event not
occurring which we write in probability notation as E , if you know the probability of it occurring -- and
C
vice versa. We can also use this identity to show you what it looks like to prove a theorem in probability.
Proof: P(E C
) = 1 − P(E)
C C
P(S) = P(E or E ) E or E covers every outcome in the sample space
C C
P(S) = P(E) + P(E ) Events E and E are mututally exclusive
C
1 = P(E) + P(E ) Axiom 2 of probability
C
P(E ) = 1 − P(E) By re-arranging
4
This is a header
Because every outcome is equally likely, and the probability of the sample space must be 1, we can prove
that each outcome must have probability:
1
P(an outcome) =
|S|
Where |S| is the size of the sample space, or, put in other words, the total number of outcomes of the
experiment. Of course this is only true in the special case where every outcome has the same likelihood.
If S is a sample space with equally likely outcomes, for an event E that is a subset of the outcomes in S :
There is some art form to setting up a problem to calculate a probability based on the equally likely
outcome rule. (1) The first step is to explicitly define your sample space and to argue that all outcomes in
your sample space are equally likely. (2) Next, you need to count the number of elements in the sample
space and (3) finally you need to count the size of the event space. The event space must be all elements
of the sample space that you defined in part (1). The first step leaves you with a lot of choice! For
example you can decide to make indistinguishable objects distinct, as long as your calculation of the size
of the event space makes the exact same assumptions.
Example: What is the probability that the sum of two die is equal to 7?
Buggy Solution: You could define your sample space to be all the possible sum values of two die (2
through 12). However this sample space fails the “equally likely” test. You are not equally likely to
have a sum of 2 as you are to have a sum of 7.
Solution: Consider the sample space from the previous chapter where we thought of the die as distinct
and enumerated all of the outcomes in the sample space. The first number is the roll on die 1 and the
second number is the roll on die 2. Note that (1, 2) is distinct from (2, 1). Since each outcome is equally
likely, and the sample space has exactly 36 outcomes, the likelihood of any one outcome is . Here is a
1
36
1
This is a header
The event (sum of dice is 7) is the subset of the sample space where the sum of the two dice is 7. Each
outcome in the event is highlighted in blue. There are 6 such outcomes: (1, 6), (2, 5), (3, 4), (4, 3), (5, 2),
(6, 1). Notice that (1, 6) is a different outcome than (6, 1). To make the outcomes equally likely we had to
make the die distinct.
|E|
P(Sum of two dice is 7) = Since outcomes are equally likely
|S|
6 1
= = There are 6 outcomes in the event
36 6
Interestingly, this idea also applies to continuous sample spaces. Consider the sample space of all the
outcomes of the computer function "random" which produces a real valued number between 0 and 1,
where all real valued numbers are equally likely. Now consider the event E that the number generated is
in the range [0.3 to 0.7]. Since the sample space is equally likely, P(E) is the ratio of the size of E to the
size of S . In this case P(E) = 0.4
1
= 0.4.
2
This is a header
Probability of or
The equation for calculating the probability of either event E or event F happening, written P(E or F ) or
equivalently as P(E ∪ F ), is deeply analogous to counting the size of two sets. As in counting, the
equation that you can use depends on whether or not the events are "mutually exclusive". If events are
mutually exclusive, it is very straightforward to calculate the probability of either event happening.
Otherwise, you need the more complex "inclusion exclusion" formula.
Mutual exclusion can be visualized. Consider the following visual sample space where each outcome is a
hexagon. The set of all the fifty hexagons is the full sample space:
Both events E and F are subsets of the same sample space. Visually, we can note that the two sets do not
overlap. They are mutually exclusive: there is no outcome that is in both sets.
This property applies regardless of how you calculate the probability of E or F . Moreover, the idea
extends to more than two events. Lets say you have n events E , E , … E where each event is
1 2 n
mutually exclusive of one another (in other words, no outcome is in more than one event). Then:
n
i=1
You may have noticed that this is one of the axioms of probability. Though it might seem intuitive, it is
one of three rules that we accept without proof.
Caution: Mutual exclusion only makes it easier to calculate the probability of E or F , not other ways
of combining events, such as E and F .
At this point we know how to compute the probability of the "or" of events if and only if they have the
mutual exclusion property. What if they don't?
1
This is a header
consider the event E: getting heads on a coin flip, where P(E) = 0.5. Now imagine the sample space S ,
getting either a heads or a tails on a coin flip. These events are not mutually exclusive (the outcome heads
is in both). If you incorrectly assumed they were mutually exclusive and tried to calculate P(E or S) you
would get this buggy derivation:
Calculate the probability of E, getting an even number on a dice role (2, 4 or 6), or F , getting three or
less (1, 2, 3) on the same dice role.
= 1.0 uh oh!
The probability can't be one since the outcome 5 is neither three or less nor even. The problem is that
we double counted the probability of getting a 2, and the fix is to subtract out the probability of that
doubly counted case.
What went wrong? If two events are not mutually exclusive, simply adding their probabilities double
counts the probability of any outcome which is in both events. There is a formula for calculating or of
two non-mutually exclusive events: it is called the "inclusion exclusion" principle.
This formula does have a version for more than two events, but it gets rather complex. See the next two
sections for more details.
Note that the inclusion exclusion principle also applies for mutually exclusive events. If two events are
mutually exclusive P(E and F ) = 0 since its not possible for both E and F to occur. As such the
formula P(E) + P(F ) − P(E and F ) reduces to P(E) + P(F ).
Recall that if they are mutually exclusive, we simply add the probabilities. If they are not mutually
exclusive, you need to use the inclusion exclusion formula for three events:
P(E 1 or E 2 or E 3 ) =
+ P(E 1 )
+ P(E 2 )
+ P(E 3 )
− P(E 1 and E 2 )
− P(E 1 and E 3 )
− P(E 2 and E 3 )
In words, to get the probability of three events, you: (1) add the probability of the events on their own.
(2) Then you need to subtract off the probability of every pair of events co-occuring. (3) Finally, you add
in the probability of all three events co-occuring.
2
This is a header
P(E 1 or E 2 or E 3 or E 4 ) =
+ P(E 1 )
+ P(E 2 )
+ P(E 3 )
+ P(E 4 )
− P(E 1 and E 2 )
− P(E 1 and E 3 )
− P(E 1 and E 4 )
− P(E 2 and E 3 )
− P(E 2 and E 4 )
− P(E 3 and E 4 )
Do you see the pattern? For n events, E , E , … E : add all the probabilities of the events on their own.
1 2 n
Then subtract all pairs of events. Then add all subsets of 3 events. Then subtract all subset of 4 events.
Continue this process, up until subset of size n, adding the subsets if the size of subsets is odd, else
subtracting them. The alternating addition and subtraction is where the name inclusion exclusion comes
from. This is a complex process and you should first check if there is an easier way to calculate your
probability. This can be written up mathematically — but it is a rather hard pattern to express in notation:
n
r+1
P(E 1 or E 2 or ⋯ or E n ) = ∑(−1) Yr
r−1
1≤i 1 <⋯<i r ≤n
The notation for Y is especially hard to parse. Y sums over all ways of selecting a subset of r events.
r r
For each selection of r events, calculate the probability of the "and" of those events. (−1) is saying:
r+1
It is not especially important to follow the math notation here. The main take away is that the general
inclusion exclusion principle gets incredibly complex with multiple events. Often, the way to make
progress in this situation is to find a way to solve your problem using another method.
The formulas for calculating the or of events that are not mutually exclusive often require calculating the
probability of the and of events. Learn more in the chapter Probability of and.
3
This is a header
Conditional Probability
In English, a conditional probability states "what is the chance of an event E happening given that I have
already observed some other event F ". It is a critical idea in machine learning and probability because it
allows us to update our probabilities in the face of new evidence.
When you condition on an event happening you are entering the universe where that event has taken
place. Formally, once you condition on F the only outcomes that are now possible are the ones which are
consistent with F . In other words your sample space will now be reduced to F . As an aside, in the
universe where F has taken place, all rules of probability still hold!
P(E and F )
P(E|F ) =
P(F )
Let's use a visualization to get an intuition for why the conditional probability formula is true. Again
consider events E and F which have outcomes that are subsets of a sample space with 50 equally likely
outcomes, each one drawn as a hexagon:
Conditioning on F means that we have entered the world where F has happened (and F , which has 14
equally likely outcomes, has become our new sample space). Given that event F has occurred, the
conditional probability that event E occurs is the subset of the outcomes of E that are consistent with F .
In this case we can visually see that those are the three outcomes in E and F . Thus we have the:
Even though the visual example (with equally likely outcome spaces) is useful for gaining intuition,
conditional probability applies regardless of whether the sample space has equally likely outcomes!
Amélie). To start lets answer the simpler question, what is the probability that a user watches the movie
Life is Beautiful, E? We can solve this problem using the definition of probability and a dataset of movie
watching [1]:
1, 234, 231
= ≈ 0.02
50, 923, 123
1
This is a header
P(E) = 0.02 P(E) = 0.01 P(E) = 0.05 P(E) = 0.09 P(E) = 0.03
Now for a more interesting question. What is the probability that a user will watch the movie Life is
Beautiful (E), given they watched Amelie (F )? We can use the definition of conditional probability.
P(E and F )
P(E|F ) = Def of Cond Prob
P(F )
If we let F be the event that someone watches the movie Amélie, we can now calculate , the
P(E|F )
P(E|F ) = 0.09 P(E|F ) = 0.03 P(E|F ) = 0.05 P(E|F ) = 0.02 P(E|F ) = 1.00
Why do some probabilities go up, some probabilities go down, and some probabilities are unchanged
after we observe that the person has watched Amelie (F )? If you know someone watched Amelie, they
are more likely to watch Life is Beautiful, and less likely to watch Star Wars. We have new information
on the person!
Axiom of probability 3 P(E or F ) = P(E) + P(F ) P(E or F |G) = P(E|G) + P(F |G)
Identity 1 P(E
C
) = 1 − P(E) P(E
C
|G) = 1 − P(E|G)
2
This is a header
The term P(E|F , G) is new notation for conditioning on multiple events. You should read that term as
"The probability of E occurring, given that both F and G have occurred". This equation states that the
definition for conditional probability of E|F still applies in the universe where G has occurred. Do you
think that P(E|F , G) should be equal to P(E|F )? The answer is: sometimes yes and sometimes no.
3
This is a header
Independence
So far we have talked about mutual exclusion as an important "property" that two or more events can
have. In this chapter we will introduce you to a second property: independence. Independence is perhaps
one of the most important properties to consider! Like for mutual exclusion, if you can establish that this
property applies (either by logic, or by declaring it as an assumption) it will make analytic probability
calculations much easier!
Definition: Independence
Two events are said to be independent if knowing the outcome of one event does not change your belief
about whether or not the other event will occur. For example, you might say that two separate dice rolls
are independent of one another: the outcome of the first dice gives you no information about the outcome
of the second -- and vice versa.
P(E|F ) = P(E)
Alternative Definition
Another definition of independence can be derived by using an equation called the chain rule, which we
will learn about later, in the context where two events are independent. Consider two indepedent events
A and B:
Independence is Symmetric
This definition is symmetric. If E is independent of F , then F is independent of E. We can prove that
P(F |E) = P(F ) implies P(E|F ) = P(E) starting with a law called Bayes' Theorem which we will
cover shortly:
P(F ) ⋅ P(E)
= P(F |E) = P(F )
P(F )
= P(E) Cancel
Generalized Independence
Events E 1, E2 , … , En are independent if for every subset with r elements (where r ≤ n):
r
′
P(E 1 ′ , E 2 ′ , … , E r ′ ) = ∏ P(E )
i
i=1
As an example, consider the probability of getting 5 heads on 5 coin flips where we assume that each
coin flip is independent of one another.
1
This is a header
P(H 1 ,H 2 , H 3 , H 4 , H 5 )
i=1
5
1
= ∏
2
i=1
5
1
=
2
= 0.03125
Independence is a property which is often "assumed" if you think it is reasonable that one event is
unlikely to influence your belief that the other will occur (or if the influence is negligible). Let's work
through an example to better understand.
functioning (where 1 ≤ i ≤ n). Let E be the event that there is a functional path from A to B. What is P(E)?
2
This is a header
Let F be the event that router i fails. Note that the problem states that routers are indepenent, and as
i
such we assume that the events F are all independent of one another.
i
= 1 − ∏ P(F i ) Independence of F i
i=1
= 1 − ∏ 1 − pi
i=1
show that: P(AB ) = P(A) P(B ). This starts with a rule called the Law of Total Probability which we
C C
C
= P(A) P(B ) Identity 1
Conditional Independence
We saw earlier that the laws of probability still held if you consistently conditioned on an event. As such,
the definition of independence also transfers to the universe of conditioned events. We use the
terminology "conditional independence" to refer to events that are independent when consistently
conditioned. For example if someone claims that events E , E , E are conditionally independent given
1 2 3
P(E 1 , E 2 , E 3 |F ) = P (E 1 |F ) ⋅ P (E 2 |F ) ⋅ P (E 3 |F )
P(E 1 , E 2 , E 3 |F ) = ∏ P (E i |F )
i=1
Warning: While the rules of probability stay the same when conditioning on an event, the
independence property between events might change. Events that were dependent can become
independent when conditioning on an event. Events that were independent can become dependent. For
example, if events E , E , E are conditionally independent given event F it is not necessarily true
1 2 3
that
3
P(E 1 , E 2 , E 3 ) = ∏ P (E i )
i=1
3
This is a header
Probability of and
The probability of the and of two events, say E and F , written P(E and F ), is the probability of both
events happening. You might see equivalent notations P(EF ), P(E ∩ F ) and P(E, F ) to mean the
probability of and. How you calculate the probability of event E and event F happening depends on
whether or not the events are "independent". In the same way that mutual exclusion makes it easy to
calculate the probability of the or of events, independence is a property that makes it easy to calculate the
probability of the and of events.
This property applies regardless of how the probabilities of E and F were calculated and whether or not
the events are mutually exclusive.
The independence principle extends to more than two events. For n events E , E , … E that are
1 2 n
mutually independent of one another -- the independence equation also holds for all subsets of the
events.
n
i=1
We can prove this equation by combining the definition of conditional probability and the definition of
independence.
P(E and F )
P(E|F ) = Def inition of conditional probability
P(F )
P(E and F )
P(E) = Def inition of independence
P(F )
See the chapter on independence to learn about when you can assume that two events are independent
1
This is a header
Of course there is nothing special about E that says it should go first. Equivalently:
We call this formula the "chain rule." Intuitively it states that the probability of observing events E and
F is the probability of observing F , multiplied by the probability of observing E , given that you have
P(E n |E 1 … E n−1 )
2
This is a header
that event E can be thought of as having two parts, the part that is in F , (E and F ), and the part that
isn’t, (E and F ). This is true because F and F are (a) mutually exclusive sets of outcomes which (b)
C C
together cover the entire sample space. After further investigation this proved to be mathematically true,
and there was much rejoicing:
C
P(E) = P(E and F ) + P(E and F )
This observation proved to be particularly useful when it was combined with the chain rule and gave rise
to a tool so useful, it was given the big name, law of total probability.
There is a more general version of the rule. If you can divide your sample space into any number of
mutually exclusive events: B , B , … B such that every outcome in the sample space falls into one of
1 2 n
i=1
i=1
We can build intuition for the general version of the law of total probability in a similar way. If we can
divide a sample space into a set of several mutually exclusive sets (where the or of all the sets covers the
entire sample space) then any event can be solved for by thinking of the likelihood of the event and each
of the mutually exclusive sets.
In the image above, you could compute P(E) to be equal to P [(E and B 1 ) or (E and B 2 ) … ] . Of
course this is worth mentioning because there are many real world cases where the sample space can be
discretized into several mutual exclusive events. As an example, if you were thinking about the
1
This is a header
probability of the location of an object on earth, you could discretize the area over which you are tracking
into a grid.
2
This is a header
Bayes' Theorem
Bayes' Theorem is one of the most ubiquitous results in probability for computer scientists. In a nutshell,
Bayes' theorem provides a way to convert a conditional probability from one direction, say P(E|F ), to
the other direction, P(F |E).
Bayes' theorem is a mathematical identity which we can derive ourselves. Start with the definition of
conditional probability and then expand the and term using the chain rule:
P(F and E)
P(F |E) = Def of conditional probability
P(E)
P(E|F ) ⋅ P(F )
= Substitute the chain rule f or P(F and E)
P(E)
This theorem makes no assumptions about E or F so it will apply for any two events. Bayes' theorem is
exceptionally useful because it turns out to be the ubiquitous way to answer the question: "how can I
update a belief about something, which is not directly observable, given evidence." This is for good
reason. For many "noisy" measurements it is straightforward to estimate the probability of the noisy
observation given the true state of the world. However, what you would really like to know is the
conditional probability the other way around: what is the probability of the true state of the world given
evidence. There are countless real world situations that fit this situation:
There is a pattern here: in each example we care about knowing some unobservable -- or hard to observe
-- state of the world. This state of the world "causes" some easy-to-observe evidence. For example:
having the flu (something we would like to know) causes a fever (something we can easily observe), not
the other way around. We often call the unobservable state the "belief" and the observable state the
"evidence". For that reason lets rename the events! Lets call the unobservable thing we want to know B
for belief. Lets call the thing we have evidence of E for evidence. This makes it clear that Bayes' theorem
allows us to calculate an updated belief given evidence: P(B|E)
1
This is a header
P(E|B) ⋅ P(B)
P(B|E) =
P(E)
There are names for the different terms in the Bayes' Rule formula. The term P(B|E) is often called the
"posterior": it is your updated belief of B after you take into account evidence E. The term P(B) is often
called the "prior": it was your belief before seeing any evidence. The term P(E|B) is called the update
and P(E) is often called the normalization constant.
There are several techniques for handling the case where the denominator is not known. One technique is
to use the law of total probability to expand out the term, resulting in another formula, called Bayes'
Theorem with Law of Total Probability:
P(E|B) ⋅ P(B)
P(B|E) =
C C
P(E|B) ⋅ P(B) + P(E|B ) ⋅ P(B )
Recall the law of total probability which is responsible for our new denominator:
C C
P(E) = P(E|B) ⋅ P(B) + P(E|B ) ⋅ P(B )
A common scenario for applying the Bayes' Rule formula is when you want to know the probability of
something “unobservable” given an “observed” event! For example, you want to know the probability
that a student understands a concept, given that you observed them solving a particular problem. It turns
out it is much easier to first estimate the probability that a student can solve a problem given that they
understand the concept and then to apply Bayes' Theorem. Intuitively, you can think about this as
updating a belief given evidence.
In this problem we are going to calculate the probability that a patient has an illness given test-result for
the illness. A positive test result means the test thinks the patient has the illness. You know the following
information, which is typical for medical tests:
Probability of a positive result given the patient has the illness 0.92
Probability of a positive result given the patient does not have the illness 0.10
The numbers in this example are from the Mammogram test for breast cancer. The seriousness of cancer
underscores the potential for bayesian probability to be applied to important contexts. The natural
occurrence of breast cancer is 8%. The mammogram test returns a positive result 95% of the time for
patients who have breast cancer. The test resturns a positive result 7% of the time for people who do not
have breast cancer. In this demo you can enter different input numbers and it will recalculate.
Answer
The probability that the patient has the illness given a positive test result is: 0.5789
2
This is a header
Terms:
Let I be the event that the patient has the illness
Let E be the event that the test result is positive
P(I |E) = probability of the illness given a positive test. This is the number we want to calculate.
Bayes Theorem:
In this problem we know P(E|I ) and P(E|I ) but we want to know P(I |E). We can apply Bayes
C
Theorem to turn our knowledge of one conditional into knowledge of the reverse.
P(E|I )P (I )
P(I |E) = Bayes' Theorem with Total Prob.
C C
P(E|I ) P(I ) + P(E|I ) P(I )
Now all we need to do is plug values into this formula. The only value we don't explicitly have is P(I C
.
)
(0.92)(0.13)
P (I |E) = = 0.5789
(0.92)(0.13) + (0.10)(1 − 0.13)
There are many possibilities for how many people have the illness, but one very plausible number is
1000, the number of people in our population, multiplied by the probability of the disease.
We are going to color people who have the illness in blue and those without the illness in pink (those
colors do not imply gender!).
A certain number of people with the illness will test positive (which we will draw in Dark Blue) and a
certain number of people without the illness will test positive (which we will draw in Dark Pink):
1000 × P(Illness) × P(Positive|Illness) people have the illness and test positive
) people do not have the illness and test positive.
C C
1000 × P(Illness ) × P(Positive|Illness
3
This is a header
The number of people who test positive and have the illness is 76.
The number of people who test positive and don't have the illness is 65.
The total number of people who test positive is 141.
Out of the subset of people who test positive, the fraction that have the illness is 76/141 = 0.5390 which
is a close approximation of the answer. If instead of using 1000 imaginary people, we had used more, the
approximation would have been even closer to the actual answer (which we calculated using Bayes
Theorem).
want the more expanded version of the law of total probability: P(E) = ∑ P(E|B ) P(B ). Recall that
i i i
this only works if the events B are mutually exclusive and cover the sample space.
i
For example say we are trying to track a phone which could be in any one of n discrete locations and we
have prior beliefs P(B ) … P(B ) as to whether the phone is in location B . Now we gain some
1 n i
evidence (such as a particular signal strength from a particular cell tower) that we call E and we need to
update all of our probabilities to be P(B |E). We should use Bayes' Theorem!
i
The probability of the observation, assuming that the the phone is in location B , P(E|B ), is something
i i
that can be given to you by an expert. In this case the probability of getting a particular signal strength
given a location B will be determined by the distance between the cell tower and location B .
i i
Since we are assuming that the phone must be in exactly one of the locations, we can find the probability
of any of the event B given E by first applying Bayes' Theorem and then applying the general version of
i
P(E|B i ) ⋅ P(B i )
P(B i |E) = Bayes Theorem. What to do about P(E)?
P(E)
P(E|B i ) ⋅ P(B i )
= Use General Law of Total Probability f or P(E)
n
∑ P(E|B j ) ⋅ P(B j )
j=1
4
This is a header
Here is another strategy for dealing with an unknown P(E). We can make it cancel out by calculating the
P(B|E)
ratio of P(B
C
|E)
. This fraction tells you how many times more likely it is that B will happen given E
than not B:
P(E|B) P(B)
P(B|E) P(E)
= Apply Bayes' Theorem to both terms
C C
C P(E|B ) P(B )
P(B |E)
P(E)
P(E|B) P(B)
= The term P(E) cancels
C C
P(E|B ) P(B )
5
This is a header
Log Probabilities
A log probability log P(E) is simply the log function applied to a probability. For example if
P(E) = 0.00001 then log P(E) = log(0.00001) ≈ −11.51. Note that in this book, the default base is
the natural base e. There are many reasons why log probabilities are an essential tool for digital
probability: (a) computers can be rather limited when representing very small numbers and (b) logs have
the wonderful ability to turn multiplication into addition, and computers are much faster at addition.
You may have noticed that the log in the above example produced a negative number. Recall that
log b = c, with the implied natural base e is the same as the statement e = b. It says that c is the
c
exponent of e that produces b. If b is a number between 0 and 1, what power should you raise e to in
order to produce b? If you raise e it produces 1. To produce a number less than 1, you must raise e to a
0
power less than 0. That is a long way of saying: if you take the log of a probability, the result will be a
negative number.
This is especially convenient because computers are much more efficient when adding than when
multiplying. It can also make derivations easier to write. This is especially true when you need to
multiply many probabilities together:
i i
Why would you care? Often in the digital world, computers are asked to reason about the probability of
data, or a whole dataset. For example, perhaps your data is words and you want to reason about the
probability that a given author would write these specific words. While this probability is very small (we
are talking about an exact document) it might be larger than the probability that a different author would
write a specific document with specific words. For these sort of small probabilities, if you use computers,
you would need to use log probabilities.
1
This is a header
Say a coin comes up heads with probability p. Most coins are fair and as such come up heads with
probability p = 0.5. There are many events for which coin flips are a great analogy that have different
values of p so lets leave p as a variable. You can try simulating coins here. Note that H is short for Heads
and T is short for Tails. We think of each coin as distinct:
Simulator results:
H, H, T, H, H, H, T, H, H, T
Warmups
What is the probability that all n flips are heads?
H, H, H, H, H, H, H, H, H, H
Each coin flip is independent so we can use the rule for probability of and with independent events. As
such, the probability of n heads is p multiplied by itself n times: p . If n = 10 and p = 0.6 then the
n
T, T, T, T, T, T, T, T, T, T
Each coin flip is independent. The probability of tails on any coin flip is 1 − p. Again, since the coin flips
are independent, the probability of tails n times on n flips is (1 − p) multiplied by itself n times:
(1 − p) . If n = 10 and p = 0.6 then the probability of n tails is around 0.0001.
n
Lets say n = 10 and k = 4, this question is asking what is the probability of getting:
H, H, H, H, T, T, T, T, T, T
The coins are still independent! The first k heads occur with probability p the run of n − k tails occurs
k
with probability (1 − p) . The probability of k heads then n − k tails is the product of those two
n−k
terms: p ⋅ (1 − p)
k n−k
1
This is a header
Exactly k heads
Next lets try to figure out the probability of exactly k heads in the n flips. Importantly we don't care
where in the n flips that we get the heads, as long as there are k of them. Note that this question is
different than the question of first k heads and then n − k tails which requires that the k heads come first!
That particular result does generate exactly k coin flips, but there are others.
There are many others! In fact any permutation of k heads and n − k tails will satisfy this event. Lets ask
the computer to list them all for exactly k = 4 heads within n = 10 coin flips. The output region is
scrollable:
(H, H, H, H, T, T, T, T, T, T)
(H, H, H, T, H, T, T, T, T, T)
(H, H, H, T, T, H, T, T, T, T)
(H, H, H, T, T, T, H, T, T, T)
(H, H, H, T, T, T, T, H, T, T)
(H, H, H, T, T, T, T, T, H, T)
(H, H, H, T, T, T, T, T, T, H)
(H, H, T, H, H, T, T, T, T, T)
(H, H, T, H, T, H, T, T, T, T)
(H, H, T, H, T, T, H, T, T, T)
(H, H, T, H, T, T, T, H, T, T)
(H, H, T, H, T, T, T, T, H, T)
(H, H, T, H, T, T, T, T, T, H)
(H, H, T, T, H, H, T, T, T, T)
(H, H, T, T, H, T, H, T, T, T)
(H, H, T, T, H, T, T, H, T, T)
(H, H, T, T, H, T, T, T, H, T)
(H, H, T, T, H, T, T, T, T, H)
Exactly how many outcomes are there with k = 4 heads in n = 10 flips? 210. The answer can be
calculated using permutations of indistinct objects:
n! n
N = = ( )
k!(n − k)! k
The probability of exactly k = 4 heads is the probability of the or of each of these outcomes. Because we
consider each coin to be unique, each of these outcomes is "mutually exclusive" and as such if E is the i
i=1
The next question is, what is the probability of each of these outcomes?
Here is a arbitrarily chosen outcome which satisfies the event of exactly k = 4 heads in n = 10 coin
flips. In fact it is the one on row 128 in the list above:
T, H, T, T, H, T, T, H, H, T
What is the probability of getting the exact sequence of heads and tails in the example above? Each coin
flip is still independent, so we multiply p for each heads and 1 − p for each tails. Let E be the event of
128
P(E 128 ) = (1 − p) ⋅ p ⋅ (1 − p) ⋅ (1 − p) ⋅ p ⋅ (1 − p) ⋅ (1 − p) ⋅ p ⋅ p ⋅ (1 − p)
P(E 128 ) = p ⋅ p ⋅ p ⋅ p ⋅ (1 − p) ⋅ (1 − p) ⋅ (1 − p) ⋅ (1 − p) ⋅ (1 − p) ⋅ (1 − p)
4 6
= p ⋅ (1 − p)
There is nothing too special about row 128. If you chose any row, you would get k independent heads
and n − k independent tails. For any row i, P(E ) = p ⋅ (1 − p) . Now we are ready to calculate the
i
k n−k
2
This is a header
i=1
k n−k
= ∑p ⋅ (1 − p) Sub in P(E i )
i=1
k n−k
= N ⋅ p ⋅ (1 − p) Sum N times
n
k n−k
= ( ) ⋅ p ⋅ (1 − p) Perm of indistinct objects
k
i=k+1
n
n i n−i
= ∑ ( ) ⋅ p ⋅ (1 − p) Substitution
i
i=k+1
3
This is a header
Enigma Machine
One of the very first computers was built to break the Nazi “enigma” codes in WW2. It was a hard
problem because the “enigma” machine, used to make secret codes, had so many unique configurations.
Every day the Nazis would choose a new configuration and if the Allies could figure out the daily
configuration, they could read all enemy messages. One solution was to try all configurations until one
produced legible German. This begs the question: How many configurations are there?
The enigma machine has three rotors. Each rotor can be set to one of 26 different positions. How many
unique configurations are there of the three rotors?
Whats more! The machine has a plug board which could swap the electrical signal for letters. On the plug
board, wires can connect any pair of letters to produce a new configuration. A wire can’t connect a letter
to itself. Wires are indistinct. A wire from ‘K’ to ’L’ is not considered distinct from a wire from ‘L’ to ’K’.
We are going to work up to considering any number of wires.
The engima plugboard. For electrical reasons, each letter has two jacks and each plug has two prongs.
Semantically this is equivalent to one plug location per letter.
One wire: How many ways are there to place exactly one wire that connects two letters?
2
.
) = 325
Two wires: How many ways are there to place exactly two wires? Recall that wires are not considered
distinct. Each letter can have at most one wire connected to it, thus you couldn’t have a wire connect ‘K’
to ‘L’ and another one connect ‘L’ to ‘X’
1
This is a header
There are ( ) ways to place the first wire and ( ) ways to place the second wire. However, since the
26
2
24
wires are indistinct, we have double counted every possibility. Because every possibility is counted twice
we should divide by 2:
26 24
( ) ⋅ ( )
2 2
Total = = 44, 850
2
Three wires: How many ways are there to place exactly three wires?
There are ( ) ways to place the first wire and ( ) ways to place the second wire. There are now ( )
26
2
24
2
22
ways to place the third. However, since the wires are indistinct, and our step counting implicitly treats
them as distinct, we have overcounted each possibility. How many times is each pairing of three letters
overcounted? It's the number of permutations of three distinct objects: 3!
26 24 22
( ) ⋅ ( ) ⋅ ( )
2 2 2
Total = = 3, 453, 450
3!
There is another way to arrive at the same answer. First we are going to choose the letters to be paired,
then we are going to pair them off. There are ( ) ways to select the letters that are being wired up. We
26
then need to pair off those letters. One way to think about pairing the letters off is to first permute them
(6! ways) and then pair up the first two letters, the next two, the next two, and so on. For example, if our
letters were {A,B,C,D,E,F} and our permutation was BADCEF, then this would correspond to wiring B
to A and D to C and E to F. We are overcounting by a lot. First, we are overcounting by a factor of 3!
since the ordering of the pairs doesn’t matter. Second, we are overcounting by a factor of 2 since the 3
26 6!
Total = ( ) = 3, 453, 450
3
6 3! ⋅ 2
Arbitrary wires: How many ways are there to place k wires, thus connecting 2 ⋅ k letters? During WW2
the Germans always used a fixed number of wires. But one fear was that if they discovered the Enigma
machine was cracked, they could simply use an arbitrary number of wires.
The set of ways to use exactly i wires is mutually exclusive from the set of ways to use exactly j wires if
i ≠ j (since no way can use both exactly i and j wires). As such Total = ∑ Total Where Total is
13
k=0 k k
the number of ways to use exactly k wires. Continuing our logic for ways to used exact number of wires:
k 28−2i
∏ ( )
i=1 2
Total k =
k!
13
Total = ∑ Total k
k=0
13 k 28−2i
∏ ( )
i=1 2
= ∑
k!
k=0
The actual Enigma used in WW2 had exactly 10 wires connecting 20 letters allowing for
150,738,274,937,250 unique configuration. The enigma machine also chose the three rotors from a set of
five adding another factor of ( ) = 60.
5
When you combine the number of ways of setting the rotors, with the number of ways you could set the
plug board you get the total number of configurations of an enigma machine. Thinking of this as two
steps we can multiply the two numbers we earlier calculated: 17,576 · 150,738,274,937,250 · 60
≈ 159 ⋅ 10
18
unique settings. So, Alan Turing and his team at Blechly Park went on to build a machine
which could help test many configurations -- a predecessor to the first computers.
2
This is a header
Serendipity
The word serendipity comes from the Persian fairy tale of the Three Princes of Serendip
Problem
What is the probability of a seredipitous encounter with a friend? Imagine you are live in an area with a
large general population (eg Stanford with 17,000 students). A small subset of the population are friends.
What are the chances that you run into at least one friend if you see a handful of people from the
population? Assume that seeing each person from the population is equally likely.
Total Population
17000
Friends
150
100
Calculate
Answer
The probability that you see at least one friend is:
1
This is a header
Random Shuffles
Here is a suprising claim. If you shuffle a standard deck of cards seven times, with almost total certainty
you can claim that the exact ordering of cards has never been seen! Wow! Let's explore. We can ask this
question formally as: What is the probability that in the n shuffles seen since the start of time, yours is
unique?
Orderings of 52 Cards
Our adventure starts with a simple observation: there are very many ways to order 52 cards. But exactly
how many unique orderings of a standard deck are there?
There are 52! ways to order a standard deck of cards. Since each card is unique (each card has a unique
suit, value combination) then we can apply the rule for Permutations of Distinct Objects:
Next let's calculate the probability that none of those n historical shuffles matches your particular
ordering of 52 cards. There are two valid approaches: using equally likely outcomes, and using
independence.
ourselves of this by symmetry — no ordering is more likely than any other. Out of that sample space we
want to count the number of outcomes where none of the orderings matches yours. There are 52! − 1
ways to order 52 cards that are not yours. We can construct the event space by steps: for each of the n
shuffles in history select any one of those 52! − 1 orderings. Thus |E| = (52! − 1) .
n
1
This is a header
|E|
P(U ) = Equally Likely Outcomes
|S|
n
(52! − 1)
=
n
(52!)
20
10
(52! − 1)
20
= n = 10
20
10
(52!)
20
10
52! − 1
= ( )
52!
In theory that is the correct answer, but those numbers are so big, its not clear how to evaluate it, even
when using a computer. One good idea is to first compute the log probability:
20
52! − 1 10
52! − 1
20
= 10 ⋅ log ( )
52!
20
= 10 ⋅ [ log(52! − 1) − log(52!)]
20 −68
= 10 ⋅ (−1.24 × 10 )
−48
= −1.24 × 10
−48
≈ 1 − 1.24 × 10
So the probability that your particular ordering is unique is very close to 1, and the probability that
someone else got the same ordering, 1 − P (U ), is a number with 47 zeros after the decimal point. It is
safe to say your ordering is unique.
In python, you can use a special library called decimal to compute very small probabilities. Here is an
example of how to compute log :
52!−1
52!
n = math.pow(10, 20)
card_perms = math.factorial(52)
denominator = card_perms
numerator = card_perms - 1
# approximately -1.24E-68
print(log_pr)
2
This is a header
For small values of x, the value (1 − x) is very close to 1 − nx, and this gives us another way to
n
compute P (U ):
n
(52! − 1)
P(U ) =
n
(52!)
20
1 10
20
= (1 − ) n = 10
52!
20
10
≈ 1 −
52!
−48
≈ 1 − 1.24 × 10
This agrees with the result we got using python's decimal library.
Independence
Another approach is to define events D that the ith card shuffle is different than yours. Because we
i
think of the sample space of D , it is 52! ways of ordering a deck cards. The event space is the 52! - 1
i
P(U ) = ∏ P(D i )
i=1
i=1
= n ⋅ log P(D i )
52! − 1
20
= 10 ⋅ log
52!
20 −68
≈ 10 ⋅ −1.24 × 10
Which is the same answer we got with the other approach for log P(U )
3
This is a header
Count locations for edges: First, lets establish how many locations there are for edges in a random
network. Consider a network with 10 nodes. Count the number of possible locations for undirected edges.
Recall that an undirected edge from node A to node B is not distinct from an edge from node B to node
A. You can assume that an edge does not connect a node to itself.
Each edge connects 2 nodes, and there are 10 possible options for each node, so the answer is
10
( ) = 45.
2
Count ways to place edges: Now lets add random edges to the network! Assume the same network (with
10 nodes) has 12 random edges. If each pair of nodes is equally likely to have an edge, how many unique
ways could we chose the location of the 12 edges?
Let k = ( ) be the number of possible locations for an edge, and we have 12 distinct (undirected) edges,
10
12
Probability of node degree: Now that we have a randomly generated graph, lets explore the degree of
our nodes! In the same network with 10 nodes and 12 edges, select a node uniformly at random. What is
the probability that our node has exactly degree i? Note that 0 ≤ i ≤ 9. Recall that there are only 9 nodes
to connect to from our chosen node, since there are 10 nodes in the graph.
Let E be the event that our node has exactly i connections. We will first compute the distribution of
P (E) using |E|/|S|. The sample space is the set of ways to choose 12 edges, and the event space is the
set of ways to do so such that we've chosen exactly i of the edges incident to our current node (which has
9 possible edges incident to it). To construct the event space E we can consider a two step process:
1. Select i edges from the 9 possible edge locations connected to our node.
2. Select the location for the 12 − i remaining edges. The edges can go to any of the k locations for
edges except for the 9 incident to our node.
|E|
P (E) =
|S|
9 k−9
( )( )
i 12−i
= .
k
( )
12
1
This is a header
Bacteria Evolution
A wonderful property of modern life is that we have anti-biotics to kill bacterial infections. However, we
only have a fixed number of anti-biotic medicines, and bacteria are evolving to become resistent to our
anti-biotics. In this example we are going to use probability to understand evolution of anti-biotic
resistence in bacteria.
Imagine you have a population of 1 million infectious bacteria in your gut, 10% of which have a
mutation that makes them slightly more resistant to anti-biotics. You take a course of anti-biotics. The
probability that bacteria with the mutation survives is 20%. The probability that bacteria without the
mutation survives is 1%.
What is the probability that a randomly chosen bacterium survives the anti-biotics?
Let E be the event that our bacterium survives. Let M be the event that a bacteria has the mutation. By
the Law of Total Probability (LOTP):
C
P(E) = P(E and M ) + P(E and M ) LOTP
C C
= P(E|M ) P(M ) + P(E|M ) P(M ) Chain Rule
= 0.029
Using the same events in the last section, this question is asking for P(M |E). We aren't giving the
conditional probability in that direction, instead we know P (E|M ). Such situations call for Bayes'
Theorem:
P(E|M ) P(M )
P(M |E) = Bayes
P(E)
0.20 ⋅ 0.10
= Given
P(E)
0.20 ⋅ 0.10
= Calculated
0.029
≈ 0.69
After the course of anti-biotics, 69% of bacteria have the mutation, up from 10% before. If this
population is allowed to reproduce you will have a much more resistent set of bacteria!
1
Part 2: Random Variables
This is a header
Random Variables
A Random Variable (RV) is a variable that probabilistically takes on a value and they are one of the most
important constructs in all of probability theory. You can think of an RV as being like a variable in a
programming language, and in fact random variables are just as important to probability theory as
variables are to programming. Random Variables take on values, have types and have domains over
which they are applicable.
Random variables work with all of the foundational theory we have built up to this point. We can define
events that occur if the random variable takes on values that satisfy a numerical test (eg does the variable
equal 5, is the variable less than 8).
Lets look at a first example of a random variable. Say we flip three fair coins. We can define a random
variable Y to be the total number of “heads” on the three coins. We can ask about the probability of Y
taking on different values using the following notation:
Even though we use the same notation for random variables and for events (both use capital letters) they
are distinct concepts. An event is a scenario, a random variable is an object. The scenario where a random
variable takes on a particular value (or range of values) is an event. When possible, I will try and use
letters E,F,G for events and X,Y,Z for random variables.
Using random variables is a convenient notation technique that assists in decomposing problems. There
are many different types of random variables (indicator, binary, choice, Bernoulli, etc). The two main
families of random variable types are discrete and continuous. Discrete random variables can only take
on integer values. Continuous random variables can take on decimal values. We are going to develop our
intuitions using discrete random variable and then introduce continuous.
Notation
Property Example Description
Support or Range {0, 1, … , 3} the values the random variable can take on
Distribution Function (PMF P(X = x) A function which maps values the RV can take on
or PDF) to likelihood.
1
This is a header
Notation
Property Example Description
You should set a goal of deeply understanding what each of these properties mean. There are many more
properties than the ones in the table above: properties like entropy, median, skew, kurtosis.
Lets continue our example of the random variable Y which represents the number of heads on three coin
flips. Here are some events using the variable Y :
X > Y X takes on a value greater than the value Y takes on. P(X > Y )
You will see many examples like this last one, P(Y = y), in this text book as well as in scientific and
math research papers. It allows us to talk about the likelihood of Y taking on a value, in general. For
example, later in this book we will derive that for three coin flips where Y is the number of heads, the
probability of getting exactly y heads is:
0.75
P(Y = y) = If 0 ≤ y ≤ 3
y!(3 − y)!
This statement above is a function which takes in a parameter y as input and returns the numeric
probability P(Y = y) as output. This particular expression allows us to talk about the probability that the
number of heads is 0, 1, 2 or 3 all in one expression. You can plug in any one of those values for y to get
the corresponding probability. It is customary to use lower-case symbols for non-random values. The use
of an equals sign in the "event" can be confusing. For example what does this expression say
P(Y = 1) = 0.375? It says that the probability that "Y takes on the value 1" is 0.375. For discrete
random variables this function is called the "probability mass function" and it is the topic of our next
chapter.
2
This is a header
Formally, the probability mass function is a mapping between the values that the random variable could
take on and the probability of the random variable taking on said value. In mathematics, we call these
associations functions. There are many different ways of representing functions: you can write an
equation, you can make a graph, you can even store many samples in a list. Let's start by looking at
PMFs as graphs where the x-axis is the values that the random variable could take on and the y-axis is
the probability of the random variable taking on said value.
In the following example, on the left we show a PMF, as a graph, for the random variable: X = the value
of a six-sided die roll. On the right we show a contrasting example of a PMF for the random variable X =
value of the sum of two dice rolls:
Left: the PMF of a single six-sided die roll. Right: the PMF of the sum of two dice rolls.
The sum of two dice example in the equally likely probability section. Again, the information that is
provided in these graphs is the likelihood of a random variable taking on different values. In the graph on
the right, the value "6" on the x-axis is associated with the probability 36
on the y-axis. This x-axis
5
refers to the event "the sum of two dice is 6" or Y = 6. The y-axis tells us that the probability of that
event is 5
36
. In full: P(Y = 6) = . The value "2" is associated with " " which tells us that,
5
36
1
36
P(Y = 2) =
1
36
, the probability that two dice sum to 2 is 1
36
. There is no value associated with "1"
because the sum of two dice can not be 1. If you find this notation confusing, revisit the random variables
section.
(y−1)
1 if 1 ≤ y ≤ 7
36
P(X = x) = if 1 ≤ x ≤ 6 P(Y = y) = {
(13−y)
6 if 8 ≤ y ≤ 12
36
As a final example, here is the PMF for Y , the sum of two dice, in Python code:
def pmf_sum_two_dice(y):
# Returns the probability that the sum of two dice is y
if y < 2 or y > 12:
return 0
if y <= 7:
return (y-1) / 36
else:
return (13-y) / 36
1
This is a header
Notation
You may feel that P(Y = y) is redundant notation. In probability research papers and higher-level work,
mathematicians often use the shorthand P(y) to mean P(Y = y). This shorthand assumes that the
lowercase value (e.g. y) has a capital letter counterpart (e.g. Y ) that represents a random variable even
though it's not written explicitly. In this book, we will often use the full form of the event P(Y = y), but
we will occasionally use the shorthand P(y).
∑ P(X = k) = 1
For further understanding, let's derive why this is the case. A random variable taking on a value is an
event (for example X = 2). Each of those events is mutually exclusive because a random variable will
take on exactly one value. Those mutually exclusive cases define an entire sample space. Why? Because
X must take on some value.
A normalized histogram (where each value is divided by the length of your data list) is an approximation
of the PMF. For a dataset of discrete numbers, a histogram shows the count of each value (in this case y).
By the definition of probability, if you divide this count by the number of experiments run, you arrive at
an approximation of the probability of the event P(Y = y). In our example, we have 10,000 elements in
our dataset. The count of times that 3 occurs is 552. Note that:
count(Y = 3) 552
= = 0.0552
n 10000
4
P(Y = 3) = = 0.0555
36
2
This is a header
In this case, since we ran 10,000 trials, the histogram is a very good approximation of the PMF. We use
the sum of dice as an example because it is easy to understand. Datasets in the real world often represent
more exciting events.
3
This is a header
Expectation
A random variable is fully represented by its probability mass function (PMF), which represents each of
the values the random variable can take on, and the corresponding probabilities. A PMF can be a lot of
information. Sometimes it is useful to summarize the random variable! The most common, and arguably
the most useful, summary of a random variable is its "Expectation".
Definition: Expectation
The expectation of a random variable X, written E[X] is the average of all the values the random
variable can take on, each weighted by the probability that the random variable will take on that value.
E[X] = ∑ x ⋅ P(X = x)
Expectation goes by many other names: Mean, Weighted Average, Center of Mass, 1st Moment. All of
which are calculated using the same formula.
Recall that P(X = x), also written as P(x), is the probability mass function of the random variable X.
Here is code that calculates the expectation of the sum of two dice, based off the probability mass
function:
def expectation_sum_two_dice():
exp_sum_two_dice = 0
# sum of dice can take on the values 2 through 12
for x in range(2, 12 + 1):
pr_x = pmf_sum_two_dice(x) # pmf gives Pr sum is x
exp_sum_two_dice += x * pr_x
return exp_sum_two_dice
def pmf_sum_two_dice(x):
# Return the probability that two dice sum to x
count = 0
# Loop through all possible combinations of two dice
for dice1 in range(1, 6 + 1):
for dice2 in range(1, 6 + 1):
if dice1 + dice2 == x:
count += 1
return count / 36 # There are 36 possible outcomes (6x6)
If we worked it out manually we would get that if X is the sum of two dice, E[X] = 7:
1 2 1
E[X] = ∑ x ⋅ P(X = x) = 2 ⋅ + 3 ⋅ + ⋯ + 12 = 7
36 36 36
x
7 is the "average" number you expect to get if you took the sum of two dice near infinite times. In this
case it also happens to be the same as the mode, the most likely value of the sum of two dice, but this is
not always the case!
Properties of Expectation
Property: Linearity of Expectation
E[aX + b] = a E[X] + b
1
This is a header
This is true regardless of the relationship between X and Y . They can be dependent, and they can have
different distributions. This also applies with more than two random variables.
n n
E[ ∑ X i ] = ∑ E[X i ]
i=1 i=1
One can also calculate the expected value of a function g(X) of a random variable X when one knows the
probability distribution of X but one does not explicitly know the distribution of g(X). This theorem has
the humorous name of "the Law of the Unconscious Statistician" (LOTUS), because it is so useful that
you should be able to employ it unconciously.
E[a] = a
Sometimes in proofs, you will end up with the expectation of a constant (rather than a random variable).
For example what does the E[5] mean? Since 5 is not a random variable, it does not change, and will
always be 5, E[5] = 5.
This very useful result is somewhat suprising. I believe that the best way to understand this result is via a
proof. This proof, however, will use some concepts from the next section in the course reader,
Probabilistic Models. Notice how the proof never needs to assume that X and Y are independent, or have
the same distribution. In this proof X and Y are discrete, but if you made them continuous you would
just replace the sum with an integral:
E[X + Y ]
x,y
x,y
x,y x,y
x y y x
x y y x
x y
2
This is a header
is both useful and hard to understand just by reading the equation. It allows us to calculate the
expectation of a function of any function applied to a random variable! One function that will turn out to
be very useful when computing Variance is E[X ]. According to the Law of Unconcious Statistician:
2
2 2
E[X ] = ∑x P(X = x)
In this case g is the squaring function. To calculate E[X ] we calculate expectation in a way similar to
2
E[X] with the exception that we square the value of x before multiplying by the probability mass
function. Here is code that calculates E[X ] for the sum of two dice:
2
def expectation_sum_two_dice_squared():
exp_sum_two_dice_squared = 0
# sum of dice can take on the values 2 through 12
for x in range(2, 12 + 1):
pr_x = pmf_sum_two_dice(x) # pmf gives Pr(x)
exp_sum_two_dice_squared += x**2 * pr_x
return exp_sum_two_dice_squared
3
This is a header
Variance
Definition: Variance of a Random Variable
The variance is a measure of the "spread" of a random variable around the mean. Variance for a random
variable, X, with expected value E[X] = µ is:
Var(X) = E[(X– µ)
2
]
Semantically, this is the average distance of a sample from the distribution to the mean. When computing
the variance often we use a different (equivalent) form of the variance equation:
2 2
Var(X) = E[X ] − E[X]
In the last section we showed that Expectation was a useful summary of a random variable (it calculates
the "weighted average" of the random variable). One of the next most important properties of random
variables to understand is variance: the measure of spread.
To start, let's consider probability mass functions for three sets of graders. When each of them grades an
assigment, meant to receive a 70/100, they each have a probability distribution of grades that they could
give.
Distributions of three types of peer graders. Data is from a massive online course.
The distribution for graders in group C has a different expectation. The average grade that they give
when grading an assignment worth 70 is a 55/100. That is clearly not great! But what is the difference
between graders A and B? Both of them have the same expected value (which is equal to the correct
grade). The graders in group A have a higher "spread". When grading an assignment worth 70, they have
a reasonable chance of giving it a 100, or of giving it a 40. Graders in group B have much less spread.
Most of the probability mass is close to 70. You want graders like those in group B: in expectation they
give the correct grade, and they have low spread. As an aside: scores in group B came from a
probabilistic algorithm over peer grades.
Theorists wanted a number to describe spread. They invented variance to be the average of the distance
between values that the random variable could take on and the mean of the random variable. There are
many reasonable choices for the distance function, probability theorists chose squared deviation from the
mean:
Var(X) = E[(X– µ)
2
]
It is much easier to compute variance using E[X ] − E[X] . You certainly don't need to know why its an
2 2
equivalent expression, but in case you were wondering, here is the proof.
1
This is a header
Var(X) = E[(X– µ) 2
] Note: μ = E[X]
2
= ∑(x − μ) P(x) Def inition of Expectation
2 2
= ∑(x − 2μx + μ ) P(x) Expanding the square
2 2
= ∑x P(x) − 2μ ∑ x P(x) + μ ∑ P(x) Propagating the sum
x x x
2 2
= ∑x P(x) − 2μ E[X] + μ ∑ P(x) Substitute def of expectation
x x
2 2 2
= E[X ] − 2μ E[X] + μ ∑ P(x) LOTUS g(x) = x
2 2
= E[X ] − 2μ E[X] + μ Since ∑ P(x) = 1
2 2 2
= E[X ] − 2 E[X] + E[X] Since μ = E[X]
2 2
= E[X ] − E[X] Cancelation
Standard Deviation
Variance is especially useful for comparing the "spread" of two distributions and it has the useful
property that it is easy to calculate. In general a larger variance means that there is more deviation around
the mean — more spread. However, if you look at the leading example, the units of variance are the
square of points. This makes it hard to interpret the numerical value. What does it mean that the spread is
52 points ? A more interpretable measure of spread is the square root of Variance, which we call the
2
Standard Deviation Std(X) = √Var(X). The standard deviation of our grader is 7.2 points. In this
example folks find it easier to think of spread in points rather than points . As an aside, the standard
2
deviation is the average distance of a sample (from the distribution) to the mean, using the euclidean
distance function.
2
This is a header
Bernoulli Distribution
Here is a full description of the key properties of a Bernoulli random variable. If X is declared to be a
Bernoulli random variable with parameter p, denoted X ∼ Bern(p):
Notation: X ∼ Bern(p)
PMF (smooth): x
P(X = x) = p (1 − p)
1−x
Expectation: E[X] = p
PMF graph:
Parameter p: 0.80
1
This is a header
Because Bernoulli distributed random variables are parametric, as soon as you declare a random variable
to be of type Bernoulli you automatically can know all of these pre-derived properties! Some of these
properties are straightforward to prove for a Bernoulli. For example, you could have solved for
expectation:
2 2
E[X ] = ∑ x ⋅ P(X = x) LOTUS
2 2
= 0 ⋅ (1 − p) + 1 ⋅ p
= p
2 2
Var(X) = E[X ] − E[X] Def of variance
2 2
= p − p Substitute E[X ] = p, E[X] = p
An indicator variable is a Bernoulli random variable which takes on the value 1 if an underlying event
occurs, and 0 otherwise.
Indicator random variables are a convenient way to convert the "true/false" outcome of an event into a
number. That number may be easier to incoporate into an equation. See the binomial expectation
derivation for an example.
A random variable I is an indicator variable for an event A if I = 1 when A occurs and I = 0 if A does
not occur. Indicator random variables are Bernoulli random variables, with p = P(A). I is a common
choice of name for an indicator random variable.
Here are some properties of indicator random variables: P(I = 1) = P(A) and E[I ] = P(A).
2
This is a header
Binomial Distribution
In this section, we will discuss the binomial distribution. To start, imagine the following example.
Consider n independent trials of an experiment where each trial is a "success" with probability p. Let X
be the number of successes in n trials. This situation is truly common in the natural world, and as such,
there has been a lot of research into such phenomena. Random variables like X are called binomial
random variables. If you can identify that a process fits this description, you can inherit many already
proved properties such as the PMF formula, expectation, and variance!
Notation: X ∼ Bin(n, p)
Support: x ∈ {0, 1, … , n}
Expectation: E[X] = n ⋅ p
Variance: Var(X) = n ⋅ p ⋅ (1 − p)
PMF graph:
Parameter n: 20 Parameter p: 0.60
1
This is a header
One way to think of the binomial is as the sum of n Bernoulli variables. Say that Y ∼ Bern(p) is an i
indicator Bernoulli random variable which is 1 if experiment i is a success. Then if X is the total number
of successes in n experiments, X ∼ Bin(n, p):
n
X = ∑ Yi
i=1
Recall that the outcome of Y will be 1 or 0, so one way to think of X is as the sum of those 1s and 0s.
i
Binomial PMF
The most important property to know about a binomial is its PMF function:
Recall, we derived this formula in Part 1. There is a complete example on the probability of k heads in n
coin flips, where each flip is heads with probability 0.5: Many Coin Flips. To briefly review, if you think
of each experiment as being distinct, then there are ( ) ways of permuting k successes from n
n
experiments. For any of the mutually exclusive permutations, the probability of that permutation is
k
p ⋅ (1 − p)
n−k
.
k
) which is formally called the binomial coefficient.
Expectation of Binomial
There is an easy way to calculate the expectation of a binomial and a hard way. The easy way is to
leverage the fact that a binomial is the sum of Bernoulli indicator random variables. X = ∑ Y where
n
i=1 i
Y1 is an indicator of whether the ith experiment was a success: Y ∼ Bern(p). Since the expectation of
i
the sum of random variables is the sum of expectations, we can add the expectation, E[Y ] = p, of each i
of the Bernoulli's:
n n
E[X] = E [ ∑ Y i ] Since X = ∑ Y i
i=1 i=1
i=1
= ∑p Expectation of Bernoulli
i=1
= n ⋅ p Sum n times
i=0
n
n
i n−i
= ∑i ⋅ ( )p (1 − p) Sub in PMF
i
i=0
= n ⋅ p
One of the most helpful methods that this package provides is a way to calculate the PMF. For example,
say X ∼ Bin(n = 5, p = 0.6) and you want to find P(X = 2) you could use the following code:
Console:
P(X = 2) = 0.2304
Another particularly helpful function is the ability to generate a random sample from a binomial. For
example, say X ∼ Bin(n = 10, p = 0.3) represents the number of requests to a website. We can draw
100 samples from this distribution using the following code:
Console:
[4 5 3 1 4 5 3 1 4 6 5 6 1 2 1 1 2 3 2 5 2 2 2 4 4 2 2 3 6 3 1 1 4 2 6 2 4
2 3 3 4 2 4 2 4 5 0 1 4 3 4 3 3 1 3 1 1 2 2 2 2 3 5 3 3 3 2 1 3 2 1 2 3 3
4 5 1 3 7 1 4 1 3 3 4 4 1 2 4 4 0 2 4 3 2 3 3 1 1 4]
You might be wondering what a random sample is! A random sample is a randomly chosen assignment
for our random variable. Above we have 100 such assignments. The probability that value x is chosen is
given by the PMF: P(X = x). You will notice that even though 8 is a possible assignment to the
binomial above, in 100 samples we never saw the value 8. Why? Because P (X = 8) ≈ 0.0014. You
would need to draw 1,000 samples before you would expect to see an 8.
There are also functions for getting the mean, the variance, and more. You can read the scipy.stats.binom
documentation, especially the list of methods.
3
This is a header
Poisson Distribution
A Poisson random variable gives the probability of a given number of events in a fixed interval of time
(or space). It makes the Poisson assumption that events occur with a known constant mean rate and
independently of the time since the last event.
Notation: X ∼ Poi(λ)
Description: Number of events in a fixed time frame if (a) the events occur with a constant mean
rate and (b) they occur independently of time since last event.
Parameters: λ ∈ {0, 1, …} , the constant average rate.
Support: x ∈ {0, 1, …}
PMF equation: x
λ e
−λ
P(X = x) =
x!
Expectation: E[X] = λ
Variance: Var(X) = λ
PMF graph:
Parameter λ: 5
Poisson Intuition
In this section we show the intuition behind the Poisson derivation. It is both a great way to deeply
understand the Poisson, as well as good practice with Binomial distributions.
Let's work on the problem of predicting the chance of a given number of events occurring in a fixed time
interval — the next minute. For example, imagine you are working on a ride sharing application and you
care about the probability of how many requests you get from a particular area. From historical data, you
know that the average requests per minute is λ = 5. What is the probability of getting 1, 2, 3, etc requests
in a minute?
: We could approximate a solution to this problem by using a binomial distribution! Lets say we split
our minute into 60 seconds, and make each second an indicator Bernoulli variable — you either get a
request or you don't. If you get a request in a second, the indicator is 1. Otherwise it is 0. Here is a
visualization of our 60 binary-indicators. In this example imagine we have requests at 2.75 and 7.12
seconds. the corresponding indicator variables are blue filled in boxes:
1 minute
1
This is a header
···
The total number of requests received over the minute can be approximated as the sum of the sixty
indicator variables, which conveniently matches the description of a binomial — a sum of Bernoullis.
Specifically define X to be the number of requests in a minute. X is a binomial with n = 60 trials. What
is the probability, p, of a success on a single trial? To make the expectation of X equal the observed
historical average λ = 5 we should choose p so that λ = E[X].
λ = n ⋅ p Expectation of a Binomial is n ⋅ p
λ
p = Solving f or p
n
In this case since λ = 5 and n = 60, we should choose p = 5/60 and state that
X ∼ Bin(n = 60, p = 5/60) . Now that we have a form for X we can answer probability questions about
the number of requests by using the Binomial PMF:
n x n−x
P(X = x) = ( )p (1 − p)
x
So for example:
60 1 60−1
P(X = 1) = ( )(5/60) (55/60) ≈ 0.0295
1
60 2 60−2
P(X = 2) = ( )(5/60) (55/60) ≈ 0.0790
2
60
3 60−3
P(X = 3) = ( )(5/60) (55/60) ≈ 0.1389
3
Great! But don't forget that this was an approximation. We didn't account for the fact that there can be more
than one event in a single second. One way to assuage this issue is to divide our minute into more fine-
grained intervals (the choice to split it into 60 seconds was rather arbitrary). Instead lets divide our minute
into 600 deciseconds, again with requests at 2.75 and 7.12 seconds:
1 minute
···
Now n = 600, p = 5/600 and X ∼ Bin(n = 600, p = 6/600) . We can repeat our example calculations
using this better approximation:
600 1 600−1
P(X = 1) = ( )(5/600) (595/60) ≈ 0.0333
1
600
2 600−2
P(X = 2) = ( )(5/600) (595/600) ≈ 0.0837
2
600
3 600−3
P(X = 3) = ( )(5/600) (595/600) ≈ 0.1402
3
Choose any value of n, the number of buckets to divide our minute into: 0
The larger n is, the more accurate the approximation. So what happens when n is infinity? It becomes a
Poisson!
2
This is a header
1 minute
What does the PMF of X look like now that we have infinite divisions of our minute? We can write the
equation and think about it as n goes to infinity. Recall that p still equals λ/n:
n
x n−x
P(X = x) = lim ( )(λ/n) (1 − λ/n)
n→∞ x
While it may look intimidating, this expression simplifies nicely. This proof uses a few special limit rules
that we haven't introduced in this book:
n
x n−x
P(X = x) = lim ( )(λ/n) (1 − λ/n) Start: binomial in the limit
n→∞ x
x n
n λ (1 − λ/n)
= lim ( ) ⋅ ⋅ Expanding the power terms
x x
n→∞ x n (1 − λ/n)
x n
n! λ (1 − λ/n)
= lim ⋅ ⋅ Expanding the binomial term
x x
n→∞ (n − x)!x! n (1 − λ/n)
x −λ
n! λ e
n −λ
= lim ⋅ ⋅ Rule lim (1 − λ/n) = e
x x
n→∞ (n − x)!x! n (1 − λ/n) n→∞
x −λ
n! λ e
= lim ⋅ ⋅ Rule lim λ/n = 0
x
n→∞ (n − x)!x! n 1 n→∞
x −λ
n! 1 λ e
= lim ⋅ ⋅ ⋅ Splitting f irst term
x
n→∞ (n − x)! x! n 1
x x −λ
n 1 λ e n!
x
= lim ⋅ ⋅ ⋅ lim = n
x
n→∞ 1 x! n 1 n→∞ (n − x)!
x −λ
λ e
x
= lim ⋅ Cancel n
n→∞ x! 1
x −λ
λ ⋅ e
= Simplif y
x!
That is a beautiful expression! Now we can calculate the real probability of number of requests in a
minute, if the historical average is λ = 5:
1 −5
5 ⋅ e
P(X = 1) = = 0.03369
1!
2 −5
5 ⋅ e
P(X = 2) = = 0.08422
2!
3 −5
5 ⋅ e
P(X = 3) = = 0.14037
3!
3
This is a header
Notation: X ∼ Geo(p)
PMF equation:
x−1
P(X = x) = (1 − p) p
Expectation: E[X] =
1
Variance:
1−p
Var(X) =
p2
PMF graph:
Parameter p: 0.20
1
This is a header
Notation: X ∼ NegBin(r, p)
r − 1
Expectation: E[X] =
r
Variance: Var(X) =
r⋅(1−p)
p
2
PMF graph:
Parameter r: 3 Parameter p: 0.20
2
This is a header
Categorical Distributions
The Categorical Distribution is a fancy name for random variables which takes on values other than
numbers. As an example, imagine a random variable for the weather today. A natural representation for
the weather is one of a few categories: {sunny, cloudy, rainy, snowy}. Unlike in past examples, these
values are not integers or real valued numbers! Are we allowed to continue? Sure! We can represent this
random variable as X where X is a categorical random variable.
There isn't much that you need to know about Categorical distributions. They work the way you might
expect. To provide the Probability Mass Function (PMF) for a categorical random variable, you just need
to provide the probability of each category. For example, if X is the weather today, then the PMF should
associate all the values that X could take on, with the probability that X takes on those values. Here is an
example PMF for the weather Categorical:
Notice that the probabilities must sum to 1.0. This is because (in this version) the weather must be one of
the four categories. Since the values are not numeric, this random variable will not have an expectation
(values are not numbers) variance nor a PMF expressed as a function, as opposed to a table.
Note to your future self: A categorical distribution is a simplified version of a multinomial distribution
(where the number of outcomes is 1)
1
This is a header
Continuous Distribution
So far, all random variables we have seen have been discrete. In all the cases we have seen in CS109 this
meant that our RVs could only take on integer values. Now it's time for continuous random variables
which can take on values in the real number domain (R). Continuous random variables can be used to
represent measurements with arbitrary precision (eg height, weight, time).
We immediately face a problem. For discrete distributions we would describe the probability that a
random variable takes on exact values. This doesn't make sense for continuous values, like the time the
bus arrives. As an example, what is the probability that the bus arrives at exactly 2:17pm and
12.12333911102389234 seconds? Similarly, if I were to ask you: what is the probability of a child being
born with weight exactly equal to 3.523112342234 kilos, you might recognize that question as ridiculous.
No child will have precisely that weight. Real values can have infinite precision and as such it is a bit
mind boggling to think about the probability that a random variable takes on a specific value.
Instead, let's start by discretizing time, our continuous variable, by breaking it into 5 minute chunks. We
can now think about something like, the probability that the bus arrives between 2:00p and 2:05 as an
event with some probability (see figure below on the left). Five minute chunks seem a bit coarse. You
could imagine that instead, we could have discretized time into 2.5 minute chunks (figure in the center).
In this case the probability that the bus shows up between 15 mins and 20 mins after 2pm is the sum of
two chunks, shown in orange. Why stop there? In the limit we could keep breaking time down into
smaller and smaller pieces. Eventually we will be left with a derivative of probability at each moment of
time, where the probability that P (15 < T < 20) is the integral of that derivative between 15 and 20
(figure on the right).
f (X = x) or f (x)
Where the notation on the right hand side is shorthand and the lowercase x implies that we are talking
about the relative likelihood of a continuous random variable which is the upper case X. Like in the bus
example, the PDF is the derivative of probability at all points of the random variable. This means that the
1
This is a header
PDF has the important property that you can integrate over it to find the probability that the random
variable takes on values within a range (a, b).
X is a Continuous Random Variable if there is a Probability Density Function (PDF) f (x) that takes in
real valued numbers x such that:
b
P(a ≤ X ≤ b) = ∫ f (x) dx
a
The following properties must also hold. These preserve the axiom that P (a ≤ X ≤ b) is a probability:
0 ≤ P(a ≤ X ≤ b) ≤ 1
P(X = a) = ∫ f (x) dx = 0
a
That is pretty different than in the discrete world where we often talked about the probability of a random
variable taking on a particular value.
For a continuous random variable X the Cumulative Distribution Function, written F (x) is:
x
F (x) = P (X ≤ x) = ∫ f (x) dx
−∞
Why is the CDF the probability that a random variable takes on a value less than the input value as
opposed to greater than? It is a matter of convention. But it is a useful convention. Most probability
questions can be solved simply by knowing the CDF (and taking advantage of the fact that the integral
over the range −∞ to ∞ is 1. Here are a few examples of how you can answer probability questions by
just using a CDF:
P(a < X < b) F (b) − F (a) F (a) + P(a < X < b) = F (b)
The continuous distribution also exists for discrete random variables, but there is less utility to a CDF in
the discrete world as none of our discrete random variables had "closed form" (eg without any
summation) functions for the CDF:
a
F X (a) = ∑ P (X = i)
i=1
2
This is a header
3
Solving for Constants
f (x) = {
P(X > 1) = ∫
=
C(4x − 2x )
C (2x
C ((8 −
C = 3/8
= ∫
E[X] = ∫
E[g(X)] = ∫
E[X
1
E[aX + b] = aE[X] + b
∞
(2x
f (x) dx
[(8 −
] = ∫
Var(aX + b) = a Var(X)
2
−∞
2
16
−∞
2
C(4x − 2x ) dx = 1
2
−
2x
3
2x
16
3
xf (x)dx
−∞
2
x
2
) − 0) = 1
(4x − 2x ) dx
) − (2 −
∣
when 0 < x < 2
otherwise
In this function, C is a constant. What value is C ? Since we know that the PDF must sum to 1:
2
g(x)f (x)dx
n
f (x)dx
= 1
2
2
3
)] =
] − (E[X])
1
2
This is a header
1
Notation:
Description:
Parameters:
Support:
PDF equation:
CDF equation:
Expectation:
Variance:
PDF graph:
Parameter α: 0
f (x) = {
Var(X) =
⎪
Uniform Distribution
The most basic of all the continuous random variables is the uniform random variable, which is equally
likely to take on any value in its range (α, β). X is a uniform random variable (X ∼ Uni(α, β)) if it has
PDF:
X ∼ Uni(α, β)
F (x) = ⎨0
E[X] =
⎩
1
2
0
x−α
β−α
f (x) = {
(α + β)
12
Parameter β: 1
else
β−α
0
1
f or x ∈ [α, β]
f or x < α
f or x > β
(β − α)
2
when α ≤ x ≤ β
otherwise
Notice how the density 1/(β − α) is exactly the same regardless of the value for x. That makes the
density uniform. So why is the PDF 1/(β − α) and not 1? That is the constant that makes it such that the
integral over all possible inputs evaluates to 1.
A continuous random variable that takes on values, with equal likelihood, between
α and β
x ∈ [α, β]
, the maximum value of the variable.
β−α
f or x ∈ [α, β]
Example: You are running to the bus stop. You don’t know exactly when the bus arrives. You believe all
times between 2 and 2:30 are equally likely. You show up at 2:15pm. What is P(wait < 5 minutes)?
Let T be the time, in minutes after 2pm that the bus arrives. Because we think that all times are equally
likely in this range, T ∼ Uni(α = 0, β = 30). The probability that you wait 5 minutes is equal to the
probability that the bus shows up between 2:15 and 2:20. In other words P(15 < T < 20):
This is a header
2
P(Wait under 5 mins) = P(15 < T < 20)
to b, assuming that α ≤ a ≤ b ≤ β:
P(a ≤ X ≤ b) = ∫
= ∫
=
= ∫
= ∫
b − a
β − α
b
b
15
15
30
30
20
30
∣
20
20
∂x
f (x) dx
β − α
1
f T (x)∂x
β − α
20
15
15
30
dx
=
∂x
30
We can come up with a closed form for the probability that a uniform random variable X is in the range a
This is a header
Exponential Distribution
An exponential distribution measures the amount of time until a next event occurs. It assumes that the
events occur via a poisson process. Note that this is different from the Poisson Random Variable which
measures number of events in a fixed amount of time.
Notation: X ∼ Exp(λ)
Description: Time until next events if (a) the events occur with a constant mean rate and (b) they
occur independently of time since last event.
Parameters: λ ∈ {0, 1, …} , the constant average rate.
Support: x ∈ R
+
PDF graph:
Parameter λ: 5
Example: Based on historical data from the USGS, earthquakes of magnitude 8.0+ happen in a certain
location at a rate of 0.002 per year. Earthquakes are known to occur via a poisson process. What is the
probability of a major earthquake in the next 4 years?
Let Y be the years until the next major earthquake. Because Y measures time until the next event it fits
the description of an exponential random variable: Y ∼ Exp(λ = 0.002). The question is asking, what is
P(Y < 4)?
−λ⋅y
= 1 − e The CDF of an Exp
−0.002⋅4
= 1 − e The CDF of an Exp
≈ 0.008
Note that it is possible to answer this question using the PDF, but it will require solving an integral.
1
This is a header
Exponential is Memoryless
One way to gain intuition for what is meant by the "poisson process" is through the proof that the
exponential distribution is "memoryless". That means that the occurrence (or lack of occurrence) of
events in the past does not change our belief as to how long until the next occurrence. This can be stated
formally. If X ∼ Exp(λ) then for an interval of time until the start s, and a proceeding, query, interval of
time t:
P(X > s + t)
= Because X > s + t implies X > s
P(X > s)
1 − FX (s + t)
= Def of CDF
1 − FX (s)
−λ(s+t)
e
= By CDF of Exp
−λs
e
−λt
= e Simplif y
2
This is a header
Normal Distribution
The single most important random variable type is the Normal (aka Gaussian) random variable,
parametrized by a mean (μ) and variance (σ ), or sometimes equivalently written as mean and variance (
2
σ ). If X is a normal variable we write X ∼ N (μ, σ ). The normal is important for many reasons: it is
2 2
generated from the summation of independent random variables and as a result it occurs often in nature.
Many things in the world are not distributed normally but data scientists and computer scientists model
them as Normal distributions anyways. Why? Because it is the most entropic (conservative) modelling
decision that we can make for a random variable while still matching a particular expectation (average
value) and variance (spread).
fX (x) = e 2σ2
σ√2π
Notice the x in the exponent of the PDF function. When x is equal to the mean (μ) then e is raised to the
power of 0 and the PDF is maximized.
There is no closed form for the integral of the Normal PDF, and as such there is no closed form CDF.
However we can use a transformation of any normal to a normal with a precomputed CDF. The result of
this mathematical gymnastics is that the CDF for a Normal X ∼ N (μ, σ ) is: 2
x − μ
FX (x) = Φ ( )
σ
Where Φ is a precomputed function that represents that CDF of the Standard Normal.
1
This is a header
Notation: X ∼ N(μ, σ )
2
Support: x ∈ R
2
PDF equation: 1 −
1
2
(
x−μ
σ
)
f (x) = e
σ√2π
Expectation: E[X] = μ
Variance: Var(X) = σ
2
PDF graph:
Parameter μ: 5 Parameter σ: 5
Linear Transform
If X is a Normal such that X ∼ N (μ, σ )
2
and Y is a linear transform of X such that Y = aX + b then Y
is also a Normal where:
2 2
Y ∼ N (aμ + b, a σ )
the normal and divide by the standard deviation (σ) the result is always the standard normal. We can
prove this mathematically. Let W = :
X−μ
X − μ
W = Transf orm X: Subtract by μ and diving by σ
σ
1 μ
= X − Use algebra to rewrite the equation
σ σ
1 μ
= aX + b Linear transf orm where a = , b = −
μ σ
2 2
∼ N (aμ + b, a σ ) The linear transf orm of a Normal is another Normal
2
μ μ σ
∼ N( − , ) Substituting values in f or a and b
2
σ σ σ
Using this transform we can express F (x), the CDF of X, in terms of the known CDF of
X Z , .
FZ (x)
Since the CDF of Z is so common it gets its own Greek symbol: Φ(x)
2
This is a header
FX (x) = P (X ≤ x)
X − μ x − μ
= P ( ≤ )
σ σ
x − μ
= P (Z ≤ )
σ
x − μ
= Φ( )
σ
The values of Φ(x) can be looked up in a table. Every modern programming language also has the ability
to calculate the CDF of a normal random variable!
X − 3 0 − 3 3 3
P (X > 0) = P ( > ) = P (Z > − ) = 1 − P (Z ≤ − )
4 4 4 4
3 3 3
= 1 − Φ(− ) = 1 − (1 − Φ( )) = Φ( ) = 0.7734
4 4 4
2 − 3 X − 3 5 − 3 1 2
P (2 < X < 5) = P ( < < ) = P (− < Z < )
4 4 4 4 4
2 1 1 1
= Φ( ) − Φ(− ) = Φ( ) − (1 − Φ( )) = 0.2902
4 4 2 4
Example: You send voltage of 2 or -2 on a wire to denote 1 or 0. Let X = voltage sent and let R = voltage
received. R = X + Y , where Y ∼ N (0, 1) is noise. When decoding, if R ≥ 0.5 we interpret the voltage
as 1, else 0. What is P (error af ter decoding|original bit = 1)?
= P (Y < −1.5)
= Φ(−1.5)
≈ 0.0668
Example: The 67% rule of a normal within one standard deviation. What is the probability that a normal
variable X ∼ N(μ, σ) has a value within one standard deviation of its mean?
(μ + σ) − μ (μ − σ) − μ
= Φ( ) − Φ( ) CDF of Normal
σ σ
σ −σ
= Φ( ) − Φ( ) Cancel μs
σ σ
We made no assumption about the value of μ or the value of σ so this will apply to every single normal
random variable. Since it uses the Normal CDF this doesn't apply to other types of random variables.
CDF Calculator
To calculate the Cumulative Density Function (CDF) for a normal (aka Gaussian) random variable at a
value x, also writen as F (x), you can transform your distribution to the "standard normal" and look up
the corresponding value in the standard normal CDF. However, most programming libraries will provide
a normal cdf funciton:
3
This is a header
mu 0
std 1
In python you can calculate these values using the scipy library
It is important to note that in the python library, the second parameter for the Normal distribution is standard
deviation not variance, as it is typically defined in math notation. Recall that standard deviation is the
square root of variance.
4
This is a header
Binomial Approximation
There are times when it is exceptionally hard to numerically calculate probabilities for a binomial
distribution, especially when n is large. For example, say X ∼ Bin(n = 10000, p = 0.5) and you want
to calculate P(X > 5500). The correct formula is:
10000
i=5500
10000
10000
i 10000−i
= ∑ ( )p (1 − p)
i
i=5500
That is a difficult value to calculate. Luckily there is an easier way. For deep reasons which we will cover
in our section on "uncertainty theory" it turns out that a binomial distribution can be very well
approximated by both Normal distributions and Poisson distributions if n is large enough.
Use the Poisson approximation when n is large (>20) and p is small (<0.05). A slight dependence
between results of each experiment is ok
Use the Normal approximation when n is large (>20), and p is mid-ranged. Specifically it's considered an
accurate approximation when the variance is greater then 10, in other words: np(1 − p) > 10. There are
situations where either a Poisson or a Normal can be used to approximate a Binomial. In that situation go
with the Normal!
Poisson Approximation
When defining the Poisson we proved that a Binomial in the limit as n → ∞ and p = λ/n is a Poisson.
That same logic can be used to show that a Poisson is a great approximation for a Binomial when the
Binomial has extreme values of n and p. A Poisson random variable approximates Binomial where n is
large, p is small, and λ = np is “moderate”. Interestingly, to calculate the things we care about (PMF,
expectation, variance) we no longer need to know n and p. We only need to provide λ which we call the
rate. When approximating a Poisson with a Binomial, always choose λ = n ⋅ p.
There are different interpretations of "moderate". The accepted ranges are n > 20 and p < 0.05 or
n > 100 and p < 0.1.
Let's say you want to send a bit string of length n = 10 where each bit is independently corrupted with
4
p = 10
−6
. What is the probability that the message will arrive uncorrupted? You can solve this using a
Poisson with λ = np = 10 10 = 0.01. Semantically, λ = 0.01 means that we expect 0.01 corrupt bits
4 −6
per string, assuming bits are continuous. Let X ∼ P oi(0.01) be the number of corrupted bits. Using the
PMF for Poisson:
i
λ
−λ
P (X = 0) = e
i!
0
0.01
−0.01
= e
0!
∼ 0.9900498
We could have also modelled X as a binomial such that X ∼ Bin(10 , 10 ). That would have been
4 −6
impossible to calculate on a computer but would have resulted in the same number (up to the millionth
decimal).
Normal Approximation
For a Binomial where n is large and p is mid-ranged, a Normal can be used to approximate the Binomial.
Let's take a side by side view of a normal and a binomial:
1
This is a header
Lets say our binomial is a random variable X ∼ Bin(100, 0.5) and we want to calculate P (X ≥ 55). We
could cheat by using the closest fit normal (in this case Y ∼ N (50, 25)). How did we choose that
particular Normal? Simply select one with a mean and variance that matches the Binomial expectation
and variance. The binomial expectation is np = 100 ⋅ 0.5 = 50. The Binomial variance is
np(1 − p) = 100 ⋅ 0.5 ⋅ 0.5 = 25.
You can use a Normal distribution to approximate a Binomial X ∼ Bin(n, p). To do so define a normal
Y ∼ (E[X], V ar(X)). Using the Binomial formulas for expectation and variance, Y ∼ (np, np(1 − p)).
This approximation holds for large n and moderate p. That gets you very close. However since a Normal
is continuous and Binomial is discrete we have to use a continuity correction to discretize the Normal.
1 1 k − np + 0.5 k − np − 0.5
P (X = k) ∼ P (k − < Y < k + ) = Φ( ) − Φ( )
2 2 √ np(1 − p) √ np(1 − p)
You should get comfortable deciding what continuity correction to use. Here are a few examples of discrete
probability questions and the continuity correction:
P (X ≥ 6) P (X > 5.5)
P (X ≤ 6) P (X < 6.5)
2
This is a header
Example: 100 visitors to your website are given a new design. Let X = # of people who were given the
new design and spend more time on your website. Your CEO will endorse the new design if X ≥ 65.
What is P (CEO endorses change|it has no ef f ect)?
. .
E[X] = np = 50 Var(X) = np(1 − p) = 25 σ = √Var(X) = 5 . We can thus use a Normal
approximation: Y ∼ N (μ = 50, σ
2
.
= 25)
Y − 50 64.5 − 50
P (X ≥ 65) ≈ P (Y > 64.5) = P ( > ) = 1 − Φ(2.9) = 0.0019
5 5
Example: Stanford accepts 2480 students and each student has a 68% chance of attending. Let X =#
students who will attend. X ∼ Bin(2480, 0.68). What is P (X > 1745)?
.
E[X] = np = 1686.4 Var(X) = np(1 − p) = 539.7 σ = √Var(X) = 23.23. . We can thus use a
Normal approximation: Y ∼ N (μ = 1686.4, σ
2
= 539.7) .
≈ 1 − Φ(2.54) = 0.0055
3
This is a header
Questions
Question 1: Laura is running a server cluster with 50 computers. The probability of a crash on a given
server is 0.5. What is the standard deviation of crashes?
Answer 1:
Let X be the number of crashes. X ∼ Bin(n = 50, p = 0.5)
Std(X) = √ np(1 − p)
= √ 50 ⋅ 0.5 ⋅ (1 − 0.5)
= 3.54
Question 2: You are showing an online-ad to 30 people. The probability of an ad ignore on each ad
shown is 2/3. What is the expected number of ad clicks?
Answer 2:
Let X be the number of ad clicks. X ∼ Bin(n = 30, p = 1/3)
E[X] = np
= 30 ⋅ 1/3
= 10
Question 3: A machine learning algorithm makes binary predictions. The machine learning algorithm
makes 50 guesses where the probability of a incorrect prediction on a given guess is 19/25. What is the
probability that the number of correct predictions is greater than 0?
Answer 3:
Let X be the number of correct predictions. X ∼ Bin(n = 50, p = 6/25)
n
0 n−0
= 1 − ( )p (1 − p)
0
Question 4: Wind blows independently across 50 locations. The probability of no wind at a given
location is 0.5. What is the expected number of locations that have wind?
Answer 4:
Let X be the number of locations that have wind. X ∼ Bin(n = 50, p = 0.5)
E[X] = np
= 50 ⋅ 0.5
= 25.0
Question 5: Wind blows independently across 30 locations. What is the standard deviation of locations
that have wind? the probability of wind at each location is 0.6.
1
This is a header
Answer 5:
Let X be the number of locations that have wind. X ∼ Bin(n = 30, p = 0.6)
Std(X) = √ np(1 − p)
= √ 30 ⋅ 0.6 ⋅ (1 − 0.6)
= 2.68
Question 6: You are trying to mine bitcoins. There are 50 independent attempts where the probability of a
mining a bitcoin on a given attempt is 0.6. What is the expectation of bitcoins mined?
Answer 6:
Let X be the number of bitcoins mined. X ∼ Bin(n = 50, p = 0.6)
E[X] = np
= 50 ⋅ 0.6
= 30.0
Question 7: You are testing a new medicine on 40 patients. What is P(X is exactly 38)? The number of
cured patients can be represented by a random variable X. X ~ Bin(40, 3/10).
Answer 7:
Let X be the number of cured patients. X ∼ Bin(n = 40, p = 3/10)
n
38 n−38
P(X = 38) = ( )p (1 − p)
38
40
38 40−38
= ( )3/10 (1 − 3/10)
38
< 0.00001
Question 8: You are manufacturing chips and are testing for defects. There are 50 independent tests and
0.5 is the probability of a defect on each test. What is the standard deviation of defects?
Answer 8:
Let X be the number of defects. X ∼ Bin(n = 50, p = 0.5)
Std(X) = √ np(1 − p)
= √ 50 ⋅ 0.5 ⋅ (1 − 0.5)
= 3.54
Question 9: Laura is flipping a coin 12 times. The probability of a tail on a given coin-flip is 5/12. What
is the probability that the number of tails is greater than or equal to 2?
Answer 9:
Let X be the number of tails. X ∼ Bin(n = 12, p = 5/12)
1
n i n−i
= 1 − ∑( )p (1 − p)
i
i=0
Question 10: You are asking a survey question where responses are "like" or "dislike". There are 30
responses. You can assume each response is independent where the probability of a dislike on a given
response is 1/6. What is the probability that the number of likes is greater than 28?
2
This is a header
Answer 10:
Let X be the number of likes. X ∼ Bin(n = 30, p = 5/6)
30
n
i n−i
= ∑( )p (1 − p)
i
i=29
Question 11: A ball hits a series of 50 pins where it can bounce either right or left. The probability of a
left on a given pin hit is 0.4. What is the standard deviation of rights?
Answer 11:
Let X be the number of rights. X ∼ Bin(n = 50, p = 3/5)
Std(X) = √ np(1 − p)
= √ 50 ⋅ 3/5 ⋅ (1 − 3/5)
= 3.46
Question 12: You are sending a stream of 30 bits to space. The probability of a no corruption on a given
bit is 1/3. What is the probability that the number of corruptions is 10?
Answer 12:
Let X be the number of corruptions. X ∼ Bin(n = 30, p = 2/3)
n
10 n−10
P(X = 10) = ( )p (1 − p)
10
30
10 30−10
= ( )2/3 (1 − 2/3)
10
= 0.00015
Question 13: Wind blows independently across locations. The probability of wind at a given location is
0.9. The number of independent locations is 20. What is the probability that the number of locations that
have wind is not less than 19?
Answer 13:
Let X be the number of locations that have wind. X ∼ Bin(n = 20, p = 0.9)
20
n
i n−i
= ∑( )p (1 − p)
i
i=19
Question 14: You are sending a stream of bits to space. There are 30 independent bits where 5/6 is the
probability of a no corruption on each bit. What is the probability that the number of corruptions is 21?
Answer 14:
Let X be the number of corruptions. X ∼ Bin(n = 30, p = 1/6)
n
21 n−21
P(X = 21) = ( )p (1 − p)
21
30
21 30−21
= ( )1/6 (1 − 1/6)
21
< 0.00001
Question 15: Cody generates random bit strings. There are 20 independent bits. Each bit has a 1/4
probability of resulting in a 1. What is the probability that the number of 1s is 11?
3
This is a header
Answer 15:
Let X be the number of 1s. X ∼ Bin(n = 20, p = 1/4)
n
11 n−11
P(X = 11) = ( )p (1 − p)
11
20
11 20−11
= ( )1/4 (1 − 1/4)
11
= 0.00301
Question 16: In a restaurant some customers ask for a water with their meal. A random sample of 40
customers is selected where the probability of a water requested by a given customer is 9/20. What is the
probability that the number of waters requested is 16?
Answer 16:
Let X be the number of waters requested. X ∼ Bin(n = 40, p = 9/20)
n
16 n−16
P(X = 16) = ( )p (1 − p)
16
40
16 40−16
= ( )9/20 (1 − 9/20)
16
= 0.10433
Question 17: A student is guessing randomly on an exam with 12 questions. What is the expected
number of correct answers? the probability of a correct answer on a given question is 5/12.
Answer 17:
Let X be the number of correct answers. X ∼ Bin(n = 12, p = 5/12)
E[X] = np
= 12 ⋅ 5/12
= 5
Question 18: Laura is trying to mine bitcoins. The number of bitcoins mined can be represented by a
random variable X. X ~ Bin(n = 100, p = 1/2). What is P(X is equal to 53)?
Answer 18:
Let X be the number of bitcoins mined. X ∼ Bin(n = 100, p = 1/2)
n
53 n−53
P(X = 53) = ( )p (1 − p)
53
100
53 100−53
= ( )1/2 (1 − 1/2)
53
= 0.06659
Question 19: You are showing an online-ad to customers. The add is shown to 100 people. The
probability of an ad ignore on a given ad shown is 1/2. What is the standard deviation of ad clicks?
Answer 19:
Let X be the number of ad clicks. X ∼ Bin(n = 100, p = 0.5)
Std(X) = √ np(1 − p)
= 5.00
4
This is a header
Question 20: You are running a server cluster with 40 computers. 5/8 is the probability of a computer
continuing to work on each server. What is the expected number of crashes?
Answer 20:
Let X be the number of crashes. X ∼ Bin(n = 40, p = 3/8)
E[X] = np
= 40 ⋅ 3/8
= 15
Question 21: You are hashing 100 strings into a hashtable. The probability of a hash to the first bucket on
a given string hash is 3/20. What is the probability that the number of hashes to the first bucket is greater
than or equal to 97?
Answer 21:
Let X be the number of hashes to the first bucket. X ∼ Bin(n = 100, p = 3/20)
100
n
i n−i
= ∑( )p (1 − p)
i
i=97
Question 22: You are running in an election with 50 voters. 6/25 is the probability of a vote for you on
each vote. What is the probability that the number of votes for you is less than 2?
Answer 22:
Let X be the number of votes for you. X ∼ Bin(n = 50, p = 6/25)
1
n
i n−i
= ∑( )p (1 − p)
i
i=0
Question 23: Irina is sending a stream of 40 bits to space. The probability of a corruption on each bit is
3/4. What is the probability that the number of corruptions is 22?
Answer 23:
Let X be the number of corruptions. X ∼ Bin(n = 40, p = 3/4)
n 22 n−22
P(X = 22) = ( )p (1 − p)
22
40
22 40−22
= ( )3/4 (1 − 3/4)
22
= 0.00294
Question 24: You are hashing 100 strings into a hashtable. The probability of a hash to the first bucket on
a given string hash is 9/50. What is the probability that the number of hashes to the first bucket is greater
than 97?
Answer 24:
Let X be the number of hashes to the first bucket. X ∼ Bin(n = 100, p = 9/50)
100
n
i n−i
= ∑( )p (1 − p)
i
i=98
Question 25: You generate random bit strings. There are 100 independent bits. The probability of a 1 at a
given bit is 3/25. What is the probability that the number of 1s is less than 97?
5
This is a header
Answer 25:
Let X be the number of 1s. X ∼ Bin(n = 100, p = 3/25)
100
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=97
Question 26: You are manufacturing toys and are testing for defects. What is the probability that the
number of defects is greater than 1? the probability of a non-defect on a given test is 16/25 and you test
50 objects.
Answer 26:
Let X be the number of defects. X ∼ Bin(n = 50, p = 9/25)
1
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=0
Question 27: Laura is sending a stream of 40 bits to space. The number of corruptions can be represented
by a random variable X. X is a Binomial with n = 40 and p = 3/4. What is P(X = 25)?
Answer 27:
Let X be the number of corruptions. X ∼ Bin(n = 40, p = 3/4)
n
25 n−25
P(X = 25) = ( )p (1 − p)
25
40
25 40−25
= ( )3/4 (1 − 3/4)
25
= 0.02819
Question 28: 100 trials are run. What is the probability that the number of successes is 78? 1/2 is the
probability of a success on each trial.
Answer 28:
Let X be the number of successes. X ∼ Bin(n = 100, p = 1/2)
n 78 n−78
P(X = 78) = ( )p (1 − p)
78
100
78 100−78
= ( )1/2 (1 − 1/2)
78
< 0.00001
Question 29: You are flipping a coin. You flip the coin 20 times. The probability of a tail on a given coin-
flip is 1/10. What is the standard deviation of heads?
Answer 29:
Let X be the number of heads. X ∼ Bin(n = 20, p = 0.9)
Std(X) = √ np(1 − p)
= √ 20 ⋅ 0.9 ⋅ (1 − 0.9)
= 1.34
Question 30: Irina is showing an online-ad to 12 people. 5/12 is the probability of an ad click on each ad
shown. What is the probability that the number of ad clicks is less than or equal to 11?
6
This is a header
Answer 30:
Let X be the number of ad clicks. X ∼ Bin(n = 12, p = 5/12)
n
1 n−12
= 1 − ( )p 2(1 − p)
12
Question 31: You are flipping a coin 50 times. 19/25 is the probability of a head on each coin-flip. What
is the standard deviation of tails?
Answer 31:
Let X be the number of tails. X ∼ Bin(n = 50, p = 6/25)
Std(X) = √ np(1 − p)
= √ 50 ⋅ 6/25 ⋅ (1 − 6/25)
= 3.02
Question 32: You are running in an election with 100 voters. The probability of a vote for you on each
vote is 1/4. What is the probability that the number of votes for you is less than or equal to 97?
Answer 32:
Let X be the number of votes for you. X ∼ Bin(n = 100, p = 1/4)
100
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=98
Question 33: You are running a server cluster with 40 computers. What is the probability that the number
of crashes is less than or equal to 39? 3/4 is the probability of a computer continuing to work on each
server.
Answer 33:
Let X be the number of crashes. X ∼ Bin(n = 40, p = 1/4)
n
4 n−40
= 1 − ( )p 0(1 − p)
40
Question 34: Waddie is sending a stream of bits to space. Waddie sends 100 bits. The probability of a
corruption on each bit is 1/2. What is the standard deviation of corruptions?
Answer 34:
Let X be the number of corruptions. X ∼ Bin(n = 100, p = 1/2)
Std(X) = √ np(1 − p)
= 5.00
Question 35: A student is guessing randomly on an exam with 100 questions. Each question has a 0.5
probability of resulting in a incorrect answer. What is the probability that the number of correct answers
is greater than 97?
7
This is a header
Answer 35:
Let X be the number of correct answers. X ∼ Bin(n = 100, p = 1/2)
100
n
i n−i
= ∑( )p (1 − p)
i
i=98
Question 36: You are testing a new medicine on patients. 0.5 is the probability of a cured patient on each
trial. There are 10 independent trials. What is the expected number of cured patients?
Answer 36:
Let X be the number of cured patients. X ∼ Bin(n = 10, p = 0.5)
E[X] = np
= 10 ⋅ 0.5
= 5.0
Question 37: A ball hits a series of pins where it can either go right or left. The number of independent
pin hits is 100. The probability of a right on each pin hit is 0.5. What is the standard deviation of rights?
Answer 37:
Let X be the number of rights. X ∼ Bin(n = 100, p = 0.5)
Std(X) = √ np(1 − p)
= 5.00
Question 38: You are flipping a coin 40 times. The probability of a head on a given coin-flip is 1/2. What
is the probability that the number of heads is 38?
Answer 38:
Let X be the number of heads. X ∼ Bin(n = 40, p = 1/2)
n
38 n−38
P(X = 38) = ( )p (1 − p)
38
40
38 40−38
= ( )1/2 (1 − 1/2)
38
< 0.00001
Question 39: 100 trials are run and the probability of a success on a given trial is 1/2. What is the
standard deviation of successes?
Answer 39:
Let X be the number of successes. X ∼ Bin(n = 100, p = 1/2)
Std(X) = √ np(1 − p)
= 5.00
Question 40: You are trying to mine bitcoins. There are 40 independent attempts. The probability of a
mining a bitcoin on each attempt is 3/10. What is the probability that the number of bitcoins mined is 19?
8
This is a header
Answer 40:
Let X be the number of bitcoins mined. X ∼ Bin(n = 40, p = 3/10)
n
19 n−19
P(X = 19) = ( )p (1 − p)
19
40
19 40−19
= ( )3/10 (1 − 3/10)
19
= 0.00852
Question 41: 20 trials are run. 0.5 is the probability of a failure on each trial. What is the probability that
the number of successes is 6?
Answer 41:
Let X be the number of successes. X ∼ Bin(n = 20, p = 0.5)
n
6 n−6
P(X = 6) = ( )p (1 − p)
6
20
6 20−6
= ( )0.5 (1 − 0.5)
6
= 0.03696
Question 42: You are flipping a coin. What is the probability that the number of tails is 0? there are 30
independent coin-flips where the probability of a head on a given coin-flip is 5/6.
Answer 42:
Let X be the number of tails. X ∼ Bin(n = 30, p = 1/6)
n
0 n−0
P(X = 0) = ( )p (1 − p)
0
30
0 30−0
= ( )1/6 (1 − 1/6)
0
= 0.00421
Question 43: In a restaurant some customers ask for a water with their meal. A random sample of 20
customers is selected and each customer has a 1/4 probability of resulting in a water not requested. What
is the probability that the number of waters requested is 14?
Answer 43:
Let X be the number of waters requested. X ∼ Bin(n = 20, p = 3/4)
n
14 n−14
P(X = 14) = ( )p (1 − p)
14
20
14 20−14
= ( )3/4 (1 − 3/4)
14
= 0.16861
Question 44: A student is guessing randomly on an exam. 3/8 is the probability of a incorrect answer on
each question. The number of independent questions is 40. What is the probability that the number of
correct answers is less than or equal to 37?
Answer 44:
Let X be the number of correct answers. X ∼ Bin(n = 40, p = 5/8)
40
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=38
9
This is a header
Question 45: You are running in an election with 30 voters. 3/5 is the probability of a vote for you on
each vote. What is the standard deviation of votes for you?
Answer 45:
Let X be the number of votes for you. X ∼ Bin(n = 30, p = 3/5)
Std(X) = √ np(1 − p)
= √ 30 ⋅ 3/5 ⋅ (1 − 3/5)
= 2.68
Question 46: Charlotte is flipping a coin 100 times. The probability of a tail on each coin-flip is 0.5.
What is the probability that the number of tails is greater than 2?
Answer 46:
Let X be the number of tails. X ∼ Bin(n = 100, p = 0.5)
2
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=0
Question 47: You are trying to mine bitcoins. You try 50 times. 3/5 is the probability of a not mining a
bitcoin on each attempt. What is the probability that the number of bitcoins mined is 14?
Answer 47:
Let X be the number of bitcoins mined. X ∼ Bin(n = 50, p = 2/5)
n
14 n−14
P(X = 14) = ( )p (1 − p)
14
50
14 50−14
= ( )2/5 (1 − 2/5)
14
= 0.02597
Question 48: You are testing a new medicine on 100 patients. The probability of a cured patient on a
given trial is 3/25. What is the probability that the number of cured patients is not less than 97?
Answer 48:
Let X be the number of cured patients. X ∼ Bin(n = 100, p = 3/25)
100
n
i n−i
= ∑( )p (1 − p)
i
i=97
Question 49: Wind blows independently across 40 locations. What is the probability that the number of
locations that have wind is 40? 11/20 is the probability of no wind at each location.
Answer 49:
Let X be the number of locations that have wind. X ∼ Bin(n = 40, p = 9/20)
n
40 n−40
P(X = 40) = ( )p (1 − p)
40
40
40 40−40
= ( )9/20 (1 − 9/20)
40
< 0.00001
10
This is a header
Question 50: You are showing an online-ad to 30 people. 1/6 is the probability of an ad click on each ad
shown. What is the probability that the number of ad clicks is less than or equal to 28?
Answer 50:
Let X be the number of ad clicks. X ∼ Bin(n = 30, p = 1/6)
30
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=29
Question 51: You are flipping a coin. You flip the coin 40 times and 7/8 is the probability of a head on
each coin-flip. What is the standard deviation of tails?
Answer 51:
Let X be the number of tails. X ∼ Bin(n = 40, p = 1/8)
Std(X) = √ np(1 − p)
= √ 40 ⋅ 1/8 ⋅ (1 − 1/8)
= 2.09
Question 52: Cody is sending a stream of bits to space. 2/5 is the probability of a no corruption on each
bit and there are 20 independent bits. What is the expectation of corruptions?
Answer 52:
Let X be the number of corruptions. X ∼ Bin(n = 20, p = 3/5)
E[X] = np
= 20 ⋅ 3/5
= 12
Question 53: You are running in an election. There are 12 independent votes and 5/6 is the probability of
a vote for you on each vote. What is the probability that the number of votes for you is greater than or
equal to 9?
Answer 53:
Let X be the number of votes for you. X ∼ Bin(n = 12, p = 5/6)
12
n
i n−i
= ∑( )p (1 − p)
i
i=9
Question 54: You are flipping a coin. The number of tails can be represented by a random variable X. X
is a Bin(n = 30, p = 5/6). What is the probability that X = 1?
Answer 54:
Let X be the number of tails. X ∼ Bin(n = 30, p = 5/6)
n
1 n−1
P(X = 1) = ( )p (1 − p)
1
30
1 30−1
= ( )5/6 (1 − 5/6)
1
< 0.00001
11
This is a header
Question 55: In a restaurant some customers ask for a water with their meal. A random sample of 100
customers is selected where 0.3 is the probability of a water requested by each customer. What is the
expected number of waters requested?
Answer 55:
Let X be the number of waters requested. X ∼ Bin(n = 100, p = 0.3)
E[X] = np
= 100 ⋅ 0.3
= 30.0
Question 56: You are hashing strings into a hashtable. 30 strings are hashed. The probability of a hash to
the first bucket on each string hash is 1/6. What is the expected number of hashes to the first bucket?
Answer 56:
Let X be the number of hashes to the first bucket. X ∼ Bin(n = 30, p = 1/6)
E[X] = np
= 30 ⋅ 1/6
= 5
Question 57: You are flipping a coin 100 times. What is the probability that the number of tails is greater
than or equal to 98? 19/20 is the probability of a head on each coin-flip.
Answer 57:
Let X be the number of tails. X ∼ Bin(n = 100, p = 1/20)
100
n i n−i
= ∑( )p (1 − p)
i
i=98
Question 58: Irina is running a server cluster. What is the probability that the number of crashes is less
than 99? the server has 100 computers which crash independently and the probability of a computer
continuing to work on a given server is 22/25.
Answer 58:
Let X be the number of crashes. X ∼ Bin(n = 100, p = 3/25)
100
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=99
Question 59: You are manufacturing chairs and are testing for defects. You test 100 objects. 1/2 is the
probability of a non-defect on each test. What is the probability that the number of defects is not greater
than 97?
Answer 59:
Let X be the number of defects. X ∼ Bin(n = 100, p = 1/2)
100
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=98
Question 60: In a restaurant some customers ask for a water with their meal. There are 50 customers. You
can assume each customer is independent. 0.2 is the probability of a water requested by each customer.
What is the expected number of waters requested?
12
This is a header
Answer 60:
Let X be the number of waters requested. X ∼ Bin(n = 50, p = 0.2)
E[X] = np
= 50 ⋅ 0.2
= 10.0
Question 61: You are showing an online-ad to 40 people. 1/4 is the probability of an ad ignore on each ad
shown. What is the probability that the number of ad clicks is 9?
Answer 61:
Let X be the number of ad clicks. X ∼ Bin(n = 40, p = 3/4)
n
9 n−9
P(X = 9) = ( )p (1 − p)
9
40
9 40−9
= ( )3/4 (1 − 3/4)
9
< 0.00001
Question 62: 100 trials are run. Each trial has a 22/25 probability of resulting in a failure. What is the
standard deviation of successes?
Answer 62:
Let X be the number of successes. X ∼ Bin(n = 100, p = 3/25)
Std(X) = √ np(1 − p)
= 3.25
Question 63: A machine learning algorithm makes binary predictions. There are 12 independent guesses
where the probability of a incorrect prediction on a given guess is 1/6. What is the expected number of
correct predictions?
Answer 63:
Let X be the number of correct predictions. X ∼ Bin(n = 12, p = 5/6)
E[X] = np
= 12 ⋅ 5/6
= 10
Question 64: Waddie is showing an online-ad to customers. 1/2 is the probability of an ad click on each
ad shown. The add is shown to 100 people. What is the average number of ad clicks?
Answer 64:
Let X be the number of ad clicks. X ∼ Bin(n = 100, p = 1/2)
E[X] = np
= 100 ⋅ 1/2
= 50
Question 65: Charlotte is testing a new medicine on 50 patients. The probability of a cured patient on a
given trial is 1/5. What is the probability that the number of cured patients is 12?
13
This is a header
Answer 65:
Let X be the number of cured patients. X ∼ Bin(n = 50, p = 1/5)
n
12 n−12
P(X = 12) = ( )p (1 − p)
12
50
12 50−12
= ( )1/5 (1 − 1/5)
12
= 0.10328
Question 66: You are running in an election. The number of votes for you can be represented by a
random variable X. X is a Bin(n = 50, p = 0.4). What is P(X is exactly 8)?
Answer 66:
Let X be the number of votes for you. X ∼ Bin(n = 50, p = 0.4)
n
8 n−8
P(X = 8) = ( )p (1 − p)
8
50
8 50−8
= ( )0.4 (1 − 0.4)
8
= 0.00017
Question 67: Irina is flipping a coin 100 times. The probability of a head on a given coin-flip is 1/2.
What is the probability that the number of tails is less than or equal to 99?
Answer 67:
Let X be the number of tails. X ∼ Bin(n = 100, p = 0.5)
n 1 n−100
= 1 − ( )p 00(1 − p)
100
Question 68: You are manufacturing airplanes and are testing for defects. You test 30 objects and the
probability of a defect on a given test is 5/6. What is the probability that the number of defects is 14?
Answer 68:
Let X be the number of defects. X ∼ Bin(n = 30, p = 5/6)
n
14 n−14
P(X = 14) = ( )p (1 − p)
14
30
14 30−14
= ( )5/6 (1 − 5/6)
14
< 0.00001
Question 69: You are flipping a coin 20 times. The number of heads can be represented by a random
variable X. X is a Binomial with 20 trials. Each trial is a success, independently, with probability 1/4.
What is the standard deviation of X?
Answer 69:
Let X be the number of heads. X ∼ Bin(n = 20, p = 1/4)
Std(X) = √ np(1 − p)
= √ 20 ⋅ 1/4 ⋅ (1 − 1/4)
= 1.94
14
This is a header
Question 70: You are giving a survey question where responses are "like" or "dislike" to 100 people.
What is the probability that X is equal to 4? The number of likes can be represented by a random variable
X. X is a Bin(100, 0.5).
Answer 70:
Let X be the number of likes. X ∼ Bin(n = 100, p = 0.5)
n
4 n−4
P(X = 4) = ( )p (1 − p)
4
100
4 100−4
= ( )0.5 (1 − 0.5)
4
< 0.00001
Question 71: You are flipping a coin. There are 20 independent coin-flips where the probability of a tail
on a given coin-flip is 0.9. What is the standard deviation of tails?
Answer 71:
Let X be the number of tails. X ∼ Bin(n = 20, p = 0.9)
Std(X) = √ np(1 − p)
= √ 20 ⋅ 0.9 ⋅ (1 − 0.9)
= 1.34
Question 72: You are flipping a coin. There are 50 independent coin-flips. The probability of a tail on a
given coin-flip is 4/5. What is the expectation of heads?
Answer 72:
Let X be the number of heads. X ∼ Bin(n = 50, p = 1/5)
E[X] = np
= 50 ⋅ 1/5
= 10
Question 73: You are giving a survey question where responses are "like" or "dislike" to 100 people.
What is the standard deviation of likes? the probability of a dislike on each response is 41/50.
Answer 73:
Let X be the number of likes. X ∼ Bin(n = 100, p = 9/50)
Std(X) = √ np(1 − p)
= 3.84
Question 74: In a restaurant some customers ask for a water with their meal. 0.6 is the probability of a
water requested by each customer and there are 30 independent customers. What is the expected number
of waters requested?
Answer 74:
Let X be the number of waters requested. X ∼ Bin(n = 30, p = 0.6)
E[X] = np
= 30 ⋅ 0.6
= 18.0
15
This is a header
Question 75: There are 40 independent trials and 0.5 is the probability of a failure on each trial. What is
the expectation of successes?
Answer 75:
Let X be the number of successes. X ∼ Bin(n = 40, p = 1/2)
E[X] = np
= 40 ⋅ 1/2
= 20
Question 76: Imran is showing an online-ad to 30 people. 5/6 is the probability of an ad click on each ad
shown. What is the standard deviation of ad clicks?
Answer 76:
Let X be the number of ad clicks. X ∼ Bin(n = 30, p = 5/6)
Std(X) = √ np(1 − p)
= √ 30 ⋅ 5/6 ⋅ (1 − 5/6)
= 2.04
Question 77: You are running a server cluster. What is the probability that the number of crashes is 1? the
server has 30 computers which crash independently and each server has a 1/3 probability of resulting in a
crash.
Answer 77:
Let X be the number of crashes. X ∼ Bin(n = 30, p = 1/3)
n
1 n−1
P(X = 1) = ( )p (1 − p)
1
30
1 30−1
= ( )1/3 (1 − 1/3)
1
= 0.00008
Question 78: Cody is running a server cluster with 40 computers. What is P(X <= 39)? The number of
crashes can be represented by a random variable X. X is a Bin(n = 40, p = 3/4).
Answer 78:
Let X be the number of crashes. X ∼ Bin(n = 40, p = 3/4)
n
4 n−40
= 1 − ( )p 0(1 − p)
40
Question 79: You are hashing strings into a hashtable. 5/6 is the probability of a hash to the first bucket
on each string hash. There are 30 independent string hashes. What is the probability that the number of
hashes to the first bucket is greater than or equal to 29?
Answer 79:
Let X be the number of hashes to the first bucket. X ∼ Bin(n = 30, p = 5/6)
30
n
i n−i
= ∑( )p (1 − p)
i
i=29
16
This is a header
Question 80: Irina is flipping a coin. Irina flips the coin 30 times and the probability of a head on each
coin-flip is 0.4. What is the probability that the number of tails is 19?
Answer 80:
Let X be the number of tails. X ∼ Bin(n = 30, p = 0.6)
n
19 n−19
P(X = 19) = ( )p (1 − p)
19
30
19 30−19
= ( )0.6 (1 − 0.6)
19
= 0.13962
Question 81: You are asking a survey question where responses are "like" or "dislike". The probability of
a like on a given response is 1/2. You give the survey to 100 people. What is the probability that the
number of likes is not less than 2?
Answer 81:
Let X be the number of likes. X ∼ Bin(n = 100, p = 1/2)
1
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=0
Question 82: Wind blows independently across locations. The number of independent locations is 100.
The probability of wind at a given location is 3/20. What is the probability that the number of locations
that have wind is 93?
Answer 82:
Let X be the number of locations that have wind. X ∼ Bin(n = 100, p = 3/20)
n
93 n−93
P(X = 93) = ( )p (1 − p)
93
100
93 100−93
= ( )3/20 (1 − 3/20)
93
< 0.00001
Question 83: You are flipping a coin. 0.9 is the probability of a tail on each coin-flip. You flip the coin 50
times. What is the expected number of heads?
Answer 83:
Let X be the number of heads. X ∼ Bin(n = 50, p = 0.1)
E[X] = np
= 50 ⋅ 0.1
= 5.0
Question 84: A machine learning algorithm makes binary predictions. What is the probability that the
number of correct predictions is less than or equal to 0? the probability of a incorrect prediction on a
given guess is 1/4. The number of independent guesses is 40.
Answer 84:
Let X be the number of correct predictions. X ∼ Bin(n = 40, p = 3/4)
n
0 n−0
= ( )p (1 − p)
0
17
This is a header
Question 85: Wind blows independently across 20 locations. 1/2 is the probability of wind at each
location. What is the standard deviation of locations that have wind?
Answer 85:
Let X be the number of locations that have wind. X ∼ Bin(n = 20, p = 1/2)
Std(X) = √ np(1 − p)
= √ 20 ⋅ 1/2 ⋅ (1 − 1/2)
= 2.24
Question 86: 7/10 is the probability of a failure on each trial and the number of independent trials is 100.
What is the probability that the number of successes is 7?
Answer 86:
Let X be the number of successes. X ∼ Bin(n = 100, p = 0.3)
n
7 n−7
P(X = 7) = ( )p (1 − p)
7
100
7 100−7
= ( )0.3 (1 − 0.3)
7
< 0.00001
Question 87: You generate random bit strings. What is the expectation of 1s? there are 100 independent
bits and 0.1 is the probability of a 1 at each bit.
Answer 87:
Let X be the number of 1s. X ∼ Bin(n = 100, p = 0.1)
E[X] = np
= 100 ⋅ 0.1
= 10.0
Question 88: You are testing a new medicine on patients. 3/5 is the probability of a cured patient on each
trial. There are 30 independent trials. What is the probability that the number of cured patients is greater
than or equal to 1?
Answer 88:
Let X be the number of cured patients. X ∼ Bin(n = 30, p = 3/5)
n
0 n−0
= 1 − ( )p (1 − p)
0
Question 89: A student is guessing randomly on an exam. 0.9 is the probability of a correct answer on
each question and the test has 20 questions. What is the standard deviation of correct answers?
Answer 89:
Let X be the number of correct answers. X ∼ Bin(n = 20, p = 0.9)
Std(X) = √ np(1 − p)
= √ 20 ⋅ 0.9 ⋅ (1 − 0.9)
= 1.34
18
This is a header
Question 90: A student is guessing randomly on an exam with 40 questions. What is the probability that
the number of correct answers is 32? 0.5 is the probability of a correct answer on each question.
Answer 90:
Let X be the number of correct answers. X ∼ Bin(n = 40, p = 0.5)
n
32 n−32
P(X = 32) = ( )p (1 − p)
32
40
32 40−32
= ( )0.5 (1 − 0.5)
32
= 0.00007
Question 91: In a restaurant some customers ask for a water with their meal. A random sample of 40
customers is selected where the probability of a water not requested by a given customer is 1/4. What is
the standard deviation of waters requested?
Answer 91:
Let X be the number of waters requested. X ∼ Bin(n = 40, p = 3/4)
Std(X) = √ np(1 − p)
= √ 40 ⋅ 3/4 ⋅ (1 − 3/4)
= 2.74
Question 92: A machine learning algorithm makes binary predictions. The number of correct predictions
can be represented by a random variable X. X is a Bin(n = 30, p = 2/5). What is P(X < 27)?
Answer 92:
Let X be the number of correct predictions. X ∼ Bin(n = 30, p = 2/5)
30
n
i n−i
= 1 − ∑( )p (1 − p)
i
i=27
Question 93: Irina is flipping a coin. The probability of a tail on each coin-flip is 3/4. The number of
independent coin-flips is 40. What is the probability that the number of tails is greater than 0?
Answer 93:
Let X be the number of tails. X ∼ Bin(n = 40, p = 3/4)
n 0 n−0
= 1 − ( )p (1 − p)
0
Question 94: Waddie is sending a stream of 50 bits to space. The probability of a no corruption on a
given bit is 1/2. What is the expectation of corruptions?
Answer 94:
Let X be the number of corruptions. X ∼ Bin(n = 50, p = 0.5)
E[X] = np
= 50 ⋅ 0.5
= 25.0
19
This is a header
Question 95: You are hashing strings into a hashtable. There are 30 independent string hashes where the
probability of a hash to the first bucket on each string hash is 5/6. What is the probability that the number
of hashes to the first bucket is 24?
Answer 95:
Let X be the number of hashes to the first bucket. X ∼ Bin(n = 30, p = 5/6)
n
24 n−24
P(X = 24) = ( )p (1 − p)
24
30
24 30−24
= ( )5/6 (1 − 5/6)
24
= 0.16009
Question 96: Charlotte is hashing strings into a hashtable. 100 strings are hashed and the probability of a
hash to the first bucket on a given string hash is 1/5. What is the probability that the number of hashes to
the first bucket is greater than or equal to 1?
Answer 96:
Let X be the number of hashes to the first bucket. X ∼ Bin(n = 100, p = 1/5)
n
0 n−0
= 1 − ( )p (1 − p)
0
Question 97: You are flipping a coin. Each coin-flip has a 3/10 probability of resulting in a head and
there are 100 coin-flips. You can assume each coin-flip is independent. What is the probability that the
number of heads is 0?
Answer 97:
Let X be the number of heads. X ∼ Bin(n = 100, p = 3/10)
n
0 n−0
P(X = 0) = ( )p (1 − p)
0
100
0 100−0
= ( )3/10 (1 − 3/10)
0
< 0.00001
Question 98: Chris is sending a stream of 50 bits to space. 16/25 is the probability of a no corruption on
each bit. What is the probability that the number of corruptions is greater than or equal to 47?
Answer 98:
Let X be the number of corruptions. X ∼ Bin(n = 50, p = 9/25)
50
n
i n−i
= ∑( )p (1 − p)
i
i=47
Question 99: You are flipping a coin 30 times. What is the probability that the number of tails is less than
29? the probability of a tail on a given coin-flip is 2/3.
Answer 99:
Let X be the number of tails. X ∼ Bin(n = 30, p = 2/3)
30
n i n−i
= 1 − ∑( )p (1 − p)
i
i=29
20
This is a header
Question 100: You are manufacturing chips and are testing for defects. There are 40 independent tests.
The probability of a non-defect on a given test is 5/8. What is the probability that the number of defects is
10?
Answer 100:
Let X be the number of defects. X ∼ Bin(n = 40, p = 3/8)
n
10 n−10
P(X = 10) = ( )p (1 − p)
10
40
10 40−10
= ( )3/8 (1 − 3/8)
10
= 0.03507
21
This is a header
This problem is equivalent to: Flip a biased coin 7 times (with a p = 0.55 probability of getting a heads).
What is the probability of at least 4 heads?
Note: without loss of generality you could imagine that the two teams always play all 7 games, regardless
of the outcome. Technically they stop playing after one team has achieved 4 wins, because the outcomes
of the games no longer impact who wins. However, you could imagine that they continue.
What is the probability that the warriors win the series? Leave your answer to 3 decimal places
A critical step is to define a random variable and to recognize it is a Binomial. Let X be the number of
games won. Since each game is independent, X ∼ Bin(n = 7, p = 0.55). The question is asking:
P(X ≥ 4)?
+ P(X = 6) + P(X = 7)
This is because the question is asking the probability of the "or" of each of the events on the right hand
side of the equals sign. Since each of these events; X = 4, X = 5, etc are mutually exclusive, the
probability of the "or" is simply the sum of the probabilities.
+ P(X = 6) + P(X = 7)
= ∑ P (X = i)
i=4
P(X ≥ 4) = ∑ P (X = i)
i=4
7
n
i n−i
= ∑( )p (1 − p)
i
i=4
7
7 i 7−i
= ∑ ( )0.55 ⋅ 0.45
i
i=4
Here is that equation graphically. It represents the sum of these columns in the PMF:
1
This is a header
At this point we have an equation that we can compute in order to find the answer. But how should we
compute it? We could do it by hand! Or using a calculator. Or, we can use python, and specifically the
scipy package:
pr = 0
# calculate the sum
for i in range(4, 8):
# this for loop gives i in [4,5,6,7]
pr_i = stats.binom.pmf(i, n = 7, p = 0.55)
pr += pr_i
Buggy Solution
A good reason to study this problem is because of this common misconception for how to compute
P (X ≥ 4). It is worth understanding why it is wrong.
We are going to choose 4 slots where the Warriors win, and we don't care about the rest. They could
either be wins or losses. Out of 7 games, select 4 where the Warriors win. There are ( ) ways to do so.
7
The probability of each particular selection of four games to win is p because we need them to win those
4
four games, and we don't care about the rest. As such the probability is:
7
4
P (X ≥ 4) = ( )p
4
This idea seems good, but it doesn't work. First of all, we can recognize that there is a problem by
considering the outcome if we set p = 1.0. In this case P (X ≥ 4) = ( )p = ( )1 = 35. Clearly 35 is
7
4
4 7
4
4
an invalid probability (which is much greater than 1). As such this can't be the right answer.
But what is wrong with this approach? Lets enumerate the 35 different outcomes that they are
considering. Let B = we don't know who wins. Let W = the warriors win. Here each outcome is the
assignment to each of the 7 games in the series:
2
This is a header
(B, B, B, W, W, W, W)
(B, B, W, B, W, W, W)
(B, B, W, W, B, W, W)
(B, B, W, W, W, B, W)
(B, B, W, W, W, W, B)
(B, W, B, B, W, W, W)
(B, W, B, W, B, W, W)
(B, W, B, W, W, B, W)
(B, W, B, W, W, W, B)
(B, W, W, B, B, W, W)
(B, W, W, B, W, B, W)
(B, W, W, B, W, W, B)
(B, W, W, W, B, B, W)
(B, W, W, W, B, W, B)
(B, W, W, W, W, B, B)
(W, B, B, B, W, W, W)
(W, B, B, W, B, W, W)
(W, B, B, W, W, B, W)
(W, B, B, W, W, W, B)
(W, B, W, B, B, W, W)
(W, B, W, B, W, B, W)
(W, B, W, B, W, W, B)
(W, B, W, W, B, B, W)
(W, B, W, W, B, W, B)
(W, B, W, W, W, B, B)
(W, W, B, B, B, W, W)
(W, W, B, B, W, B, W)
(W, W, B, B, W, W, B)
(W, W, B, W, B, B, W)
(W, W, B, W, B, W, B)
(W, W, B, W, W, B, B)
(W, W, W, B, B, B, W)
(W, W, W, B, B, W, B)
(W, W, W, B, W, B, B)
(W, W, W, W, B, B, B)
It is in fact the case that the probability of any of these 35 outcomes is p . For example: (W, W, W, W, B,
4
B, B). The warriors need to win the first 4 independent games: p . Then there are three events where
4
either team could win. The probability of "B", that either team could win, is 1. That makes sense. Either
the warriors win or the other team wins. As such our probability of any given outcome, aka row in the set
of outcomes above, is: p ⋅ 1 = p
4 3 4
The bug here is that these outcomes are not mutually exclusive, yet the answer treats them as such. In the
many coin flips example, we constructed outcomes in the format (T, T, T, H, H, T, H). These outcomes
are in fact mutually exclusive. Its not possible for two distinct lists of outcomes to simultaneously exist.
On the other hand in the version where "B" stands for either team could win, the outcomes do have
overlap. For example consider the two rows from the examples above:
(B, W, W, W, B, B, W)
(B, W, W, W, B, W, B)
These could both occur (and hence are not mutually exclusive). For example if the warriors win all 7
games! Or the warriors win all games except for games 1 and 5. Both events are satisfied. Because the
events are not mutually exclusive, if we want the probability of the "or" of each of these events we can
not just sum the probabilities of each of the events (and that is exactly what P (X ≥ 4) = ( )p implies.
7 4
Instead you would need to use inclusion exclusion for the or of 35 events (yikes!). Alternatively see the
answer we propose above.
3
This is a header
Approximate Counting
What if you wanted a counter that could count up to the number of atoms in the universe, but you wanted to
store the counter in 8 bits? You could use the amazing probabilistic algorithm described below! In this
example we are going to show that the expected return value of stochastic_counter(4), where count is
called four times, is in fact equal to four.
def stochastic_counter(true_count):
n = -1
for i in range(true_count):
n += count(n)
return 2 ** n # 2^n, aka 2 to the power of n
def count(n):
# To return 1 you need n heads. Always returns 1 if n is <= 0
for i in range(n):
if not coin_flip():
return 0
return 1
def coin_flip():
# returns true 50% of the time
return random.random() < 0.5
Let X be a random variable for the value of n at the end of \texttt{stochastic\_counter(4)}. Note that X is
not a binomial because the probabilities of each outcome change.
Let R be the return value of the function. R = 2 which is a function of X. Use the law of unconscious
X
statistician
x
E[R] = ∑ 2 ⋅ P (X = x)
We can compute each of the probabilities P (X = x) separately. Note that the first two calls to count will
always return 1. Let H be the event that the ith call returns 1. Let T be the event that the ith call returns
i i
0. X can't be less than 1 because the first two calls to count always return 1. P (X = 1) = P (T , T ) \
3 4
P (X = 2) = P (H , T ) + P (T , H ) \ P (X = 3) = P (H , H )
3 4 3 4 3 4
At the point of the third call to count, n = 1. If H then n = 2 for the fourth call and the loop runs twice.
3
P (H 3 , T 4 ) = P (H 3 ) ⋅ P (T 4 |H 3 )
1 1 1
= ⋅ ( + )
2 2 4
P (H 3 , H 4 ) = P (H 3 ) ⋅ P (H 4 |H 3 )
1 1
= ⋅
2 2
P (T 3 , H 4 ) = P (T 3 ) ⋅ P (H 4 |T 3 )
1 1
= ⋅
2 2
P (T 3 , T 4 ) = P (T 3 ) ⋅ P (T 4 |T 3 )
1 1
= ⋅
2 2
1
This is a header
x
E[R] = ∑ 2 ⋅ P (X = x)
x=1
1 5 1
= 2 ⋅ + 4 ⋅ + 8 ⋅
4 8 8
= 4
2
This is a header
Jury Selection
In the Supreme Court case: Berghuis v. Smith, the Supreme Court (of the US) discussed the question: "If
a group is underrepresented in a jury pool, how do you tell?"
Justice Breyer [Stanford Alum] opened the questioning by invoking the binomial theorem. He
hypothesized a scenario involving “an urn with a thousand balls, and sixty are red, and nine hundred
forty are green, and then you select them at random… twelve at a time.” According to Justice Breyer and
the binomial theorem, if the red balls were black jurors then “you would expect… something like a third
to a half of juries would have at least one black person” on them.
Note: What is missing in this conversation is the power of diverse backgrounds when making difficult
decisions.
Simulate
Simulation:
Explanation:
Technically, since jurors are selected without replacement, you should represent the number of under-
representative jurors as being a Hyper Geometric Random Variable (a random variable we don't look at
explicitely in CS109) st
P (X ≥ 1) = 1 − P (X = 0)
60 940
( )( )
0 12
= 1 −
1000
( )
12
≈ 0.5261
However Justic Breyer made his case by citing a Binomial distribution. This isn't a perfect use of
binomial, because the binomial assumes that each experiment has equal likelihood (p) of success.
Because the jurors are selected without replacement, the probability of getting a minority juror changes
slightly after each selection (and depending on what the selection was). However, as we will see, because
the probabilities don't change too much the binomial distribution is not too far off.
P (X ≥ 1) = 1 − P (X = 0)
60 12
= 1 − ( )(1 − 0.06)
0
≈ 0.5241
1
This is a header
What is the probability that Laura gives birth today (given that she hasn't given birth up until today)?
How likely is delivery, in humans, relative to the due date? There have been millions of births which
gives us a relatively good picture [1]. The length of human pregnancy varies by quite a lot! Have you
heard that it is 9 months? That is a rough, point estimate. The mean duration of pregnancy is 278.6 days,
and pregnancy length has a standard deviation (SD) of 12.5 days. This distribution is not normal, but
roughly matches a "skewed normal". This is a general probability mass function for the first pregnancy
collected from hundreds of thousands of women (this PMF is very similar across demographics, but
changes based on whether the woman has given birth before):
Of course, we have more information. Specifically, we know that Laura hasn't given birth up until today
(we will update this example when that changes). We also know that babies which are over 14 days late
are "induced" on day 14. How likely is delivery given that we haven't delivered up until today? Note that
the y-axis is scalled differently:
1
This is a header
Implementation notes: this calculation was performed by storing the PDF as a list of (day, probability)
points. These values are sometimes called weighted samples, or "particles" and are the key component to
a "particle filtering" approach. After we observe no-delivery, we set the probability of every point which
has a day before today to be 0, and then re-normalize the remaining points (aka we "filter" the
"particles"). This is convenient because the "posterior" belief doesn't follow a simple equation -- using
particles means we never have to write that equation down in our code.
Three friends have the exact same due date (Really! this isn't a hypothetical) What is the probability that
all three couples deliver on the exact same day?
How did we get that number? Let p be the probability that one baby is delivered on day i -- this number
i
can be read off the probability mass function. Let D be the event that all three babies are delivered on
i
day i. Note that the event D is mutually exclusive with the event that all three babies are born on
i
another day (So for example, D is mutually exclusive with D , D etc). Let N = 3 be the event that all
1 2 3
3
= ∑ pi Since the three couples are independent
[1] Predicting delivery date by ultrasound and last menstrual period in early gestation
2
This is a header
There is uncertainty in these counts. If the true average number of cells for a given patient's eye is 6, the
doctor could get a different count (say 4, or 5, or 7) just by chance. As of 2021, modern eye medicine
does not have a sense of uncertainty for their inflammation grades! In this problem we are going to
change that. At the same time we are going to learn about Poisson distributions over space.
Why is the number of cells observed in a 1x1 square governed by a Poisson process?
We can approximate a distribution for the count by discretizing the square into a fixed number of equal
sized buckets. Each bucket either has a cell or not. Therefore, the count of cells in the 1x1 square is a sum
of Bernoulli random variables with equal p, and as such can be modeled as a binomial random variable.
This is an approximation because it doesn't allow for two cells in one bucket. Just like with time, if we
make the size of each bucket infinitely small, this limitation goes away and we converge on the true
distribution of counts. The binomial in the limit, i.e. a binomial as n → ∞, is truly represented by a
Poisson random variable. In this context, λ represents the average number of cells per 1×1 sample. See
Figure 2.
For a given patient the true average rate of cells is 5 cells per 1x1 sample. What is the probability that in a
single 1x1 sample the doctor counts 4 cells?
Let X denote the number of cells in the 1x1 sample. We note that X . We want to find
∼ Poi(5)
P (X = 4).
4 −5
5 e
P (X = 4) = ≈ 0.175
4!
Multiple Observations
Heads up! This section uses concepts from Part 3. Specifically Independence in Variables
1
This is a header
For a given patient the true average rate of cells is 5 cells per 1mm by 1mm sample. In an attempt to be
more precise, the doctor counts cells in two different, larger 2mm by 2mm samples. Assume that the
occurrences of cells in one 2mm by 2mm samples are independent of the occurrences in any other 2mm
by 2mm samples. What is the probability that she counts 20 cells in the first samples and 20 cells in the
second?
Let Y and Y denote the number of cells in each of the 2x2 samples. Since there are 5 cells in a 1x1
1 2
sample, there are 20 samples in a 2x2 sample since the area quadrupled, so we have that Y ∼ Poi(20)
1
and Y ∼ Poi(20). We want to find P (Y = 20 ∧ Y = 20). Since the number of cells in the two
2 1 2
samples are independent, this is equivalent to finding P(Y = 20) P(Y = 20).
1 2
Estimating Lambda
Heads up! This section uses concepts from Part 5. Specifically Maximum A Posteriori
Inflammation prior: Based on millions of historical patients, doctors have learned that the prior
probability density function of true rate of cells is:
λ
−
f (λ) = K ⋅ λ ⋅ e 2
A doctor takes a single sample and counts 4 cells. Give an equation for the updated probability density of
λ. Use the "Inflammation prior" as the prior probability density over values of λ. Your probability density
Let θ be the random variable for true rate. Let X be the random variable for the count
P (X = 4|θ = λ)f (θ = λ)
f (θ = λ|X = 4) =
P (X = 4)
4 −λ
λ e λ/2
⋅ K ⋅ λ ⋅ e
4!
=
P (X = 4)
3
5 − λ
K ⋅ λ e 2
=
4!P (X = 4)
A doctor takes a single sample and counts 4 cells. What is the Maximum A Posteriori estimate of λ?
2
This is a header
λ 4!P (X = 4) λ
3
= arg max (5 log λ − λ)
λ 2
Calculate the derivative with respect to the parameter, and set equal to 0
∂ 3
0 = (5 log λ − λ)
∂λ 2
5 3
0 = −
λ 2
10
λ =
3
Explain, in words, the difference between the two estimates of lambda in the two previous parts.
The estimate in the first part is a ``distribution" (also called a soft estimate) whereas the estimate in the
second part is a single value (also called a point estimate). The former contains information about
confidence.
The MLE estimate doesn't use the prior belief. The MLE estimate for a poisson is simply the average of
the observations. In this case the average of our single observation is 4. MLE is not a great tool for
estimating our parameter from just one datapoint.
A patient comes on two separate days. The first day the doctor counts 5 cells, the second day the doctor
counts 4 cells. Based only on this observation, and treating the true rates on the two days as independent,
what is the probability that the patient's inflammation has gotten better (in other words, that their λ has
decreased)?
Let θ be the random variable for lambda on the first day and θ be the random variable for lambda on
1 2
3
5 − λ
f (θ 2 = λ|X = 4) = K 2 ⋅ λ e 2
The question is asking what is P (θ 1 > θ2 ) ? There are a few ways to calculate this exactly:
∞ λ1
∫ ∫ f (θ 1 = λ 1 , θ 2 = λ 2 )
λ 1 =0 λ 2 =0
∞ λ1
= ∫ ∫ f (θ 1 = λ 1 ) ⋅ f (θ 2 = λ 2 )
λ 1 =0 λ 2 =0
∞ λ1
= ∫ f (θ 1 = λ 1 ) ∫ f (θ 2 = λ 2 )
λ 1 =0 λ 2 =0
∞ λ1
3 3
6 − λ 5 − λ
= ∫ K1 ⋅ λ e 2
∫ K2 ⋅ λ e 2
λ 1 =0 λ 2 =0
3
This is a header
Logit Normal
The logit normal is the continuous distribution that results from applying a special "squashing" function
to a Normally distributed random variable. The squashing function maps all values the normal could take
on onto the range 0 to 1. If X ∼ LogitNormal(μ, σ ) it has:2
2
(logit(x)−μ)
1 −
e 2σ2 if 0 < x < 1
PDF: fX (x) = { σ(√2π)x(1−x)
0 otherwise
logit(x) − μ
CDF: FX (x) = Φ( )
σ
x
Where: logit(x) = log( )
1 − x
A new theory shows that the Logit Normal better fits exam score distributions than the traditionally used
Normal. Let's test it out! We have some set of exam scores for a test with min possible score 0 and max
possible score 1, and we are trying to decide between two hypotheses:
Under the normal assumption, H1 , what is P (0.9 < X < 1.0) ? Provide a numerical answer to two
decimal places.
Under the normal assumption, H , what is the maximum value that X can take on?
1
1
This is a header
Before observing any test scores, you assume that (a) one of your two hypotheses is correct and (b) that
initially, each hypothesis is equally likely to be correct, P (H ) = P (H ) = . You then observe a single
1 2
1
test score, X = 0.9. What is your updated probability that the Logit-Normal hypothesis is correct?
f (X = 0.9|H2)P (H2)
P (H2|X = 0.9) =
f (X = 0.9|H2)P (H2) + f (X = 0.9|H1)P (H1)
f (X = 0.9|H2)
=
f (X = 0.9|H2) + f (X = 0.9|H1)
2
(logit(0.9)−1.0)
1 −
2
e 2∗0.9
σ(√2π)0.9∗(1−0.9)
=
2 2
(logit(0.9)−1.0) (0.9−0.7)
1 − 1 −
2∗0.92 2∗0.22
e + e
σ(√2π)0.9∗(1−0.9) 0.2√2π
2
This is a header
Curse of Dimensionality
In machine learning, like many fields of computer science, often involves high dimensional points, and
high dimension spaces have some surprising probabilistic properties.
A random point [X , X , X ] of dimension 3 is close to an edge if any of its values are close to an edge.
1 2 3
The event is equivalent to the complement of none of the dimensions of the point is close to an edge,
which is: 1 − (1 − P (E)) = 1 − 0.98 ≈ 0.058
3 3
A random point [X , … X ] of dimension 100 is close to an edge if any of its values are close to an
1 100
edge. What is the probability that a 100 dimensional point is close to an edge?
There are many other phenomena of high dimensional points: such as, the euclidean distance between points
starts to converge.
1
This is a header
Algorithmic Art
We want to generate probabilistic artwork, efficiently. We are going to use random variables to make a
picture filled with non-overlapping circles:
Regenerate
In our art, the circles are different sizes. Specifically, each circle's radius is drawn from a Pareto
distribution (which is described below). The placement algorithm is greedy: we sample 1000 circle sizes.
Sort them by size, largest to smallest. Loop over the circle sizes and place circles one by one.
To place a circle on the canvas, we sample the location of the center of the circle. Both the x and y
coordinates are uniformly distributed over the dimensions of the canvas. Once we have selected a
prospective location we then check if there would be a collision with a circle that has already been
placed. If there is a collision we keep trying new locations until you find one that has no collisions.
1
This is a header
Notation: X ∼ Pareto(a)
Description: A long tail distribution. Large values are rare and small values are common.
Parameters: a ≥ 1 , the shape parameter
Note there are other optional params. See wikipedia
Support: x ∈ [0, 1]
a+1
x
a
x
1
α
y = 1 − ( )
x
1
α
( ) = 1 − y
x
1 1
= (1 − y) α
x
1
x =
1
(1 − y) α
2
Part 3: Probabilistic Models
This is a header
Joint Probability
Many interesting problems involve not one random variable, but rather several interacting with one
another. In order to create interesting probabilistic models and to reason in real world situations, we are
going to need to learn how to consider several random variables jointly.
In this section we are going to use disease prediction as a working example to introduce you to the
concepts involved in probabilistic models. The general question is: a person has a set of observed
symptoms. Given the symptoms what is the probability over each possible disease?
We have already considered events that co-occur and covered concepts such as independence and
conditional probability. What is new about this section is (1) we are going to cover how to handle random
variables which co-occur and (2) we are going to talk about how computers can reason under large
probabilistic models.
You should read the comma as an "and" and as such this is saying the probability that X = x and Y = y.
Again like for single variables, as shorthand, we often write just the values and it implies that we are
talking about the probability of the random variables taking on those values. This notation is convenient
because it is shorter, and it makes it explicit that the function is operating over two parameters. It requires
to recall that the event is a random variable taking on the given value.
If any of the variables are continuous we use different notation to make it clear that we need a probability
density function, something we can integrate over to get a probability. We will cover this in detail:
The same idea extends to as many variables as you have in your model. For example if you had three
discrete random variables X, Y , and Z , the joint probability function would state the likelihood of an
assignment to all three: P(X = x, Y = y, Z = z).
Let us start with an example. In 2020 the Covid-19 pandemic disrupted lives around the world. Many
people were unable to get tested and had to determine whether or not they were sick based on home
diagnosis. Let's build a very simple probabilistic model to enable us to make a tool which can predict the
probability of having the illness given observed symptoms. To make it clear that this is a pedagogical
example, let's consider a made up illness called Determinitis. The two main symptoms are fever and loss
of smell.
1
This is a header
A joint probability table is a brute force way to store the probability mass of a particular assignment of
values to our variables. Here is a probabilistic model for our three random variables (aside: the values in
this joint are realistic and based on reasearch, but are primarily for teaching. Consult a doctor before
making medical decisions).
D = 0
S = 0 S = 1
D = 1
S = 0 S = 1
Each cell in this table represents the probability of one assignment of variables. For example the
probability that someone can't smell, S = 0, has a low fever, F = low, and has the illness, D = 1, can
be directly read off the table: P (D = 1, S = 0, F = low) = 0.005.
These are joint probabilities not conditional probabilities. The value 0.005 is the value of illness, no
smell and low fever. It is not the probability of no smell and low fever given illness. A table which
stored conditional probabilities would be called a conditional probability table, this is a joint
probability table.
If you sum over all cells, the total will be 1. Each cell is a mutually exclusive combination of events
and the cells are meant to span the entire space of possible outcomes.
This table is large! We can count the number of cells using the step rule of counting. If n is the
i
number of different values that random variable i can take on, the number of cells in the joint table is
∏ n . i
i
2
This is a header
3
This is a header
Marginalization
An important insight regarding probabilistic models with many random variables is that "the joint
distribution is complete information." From the joint distribution you can compute all probability
questions involving those random variables in the model. This chapter is an example of that insight.
The central question of this chapter is: Given a joint distribution, how can you compute the probability of
random variables on their own?
for any value x and y . We already have a technique for computing P(X = x) from the joint. We can use
the Law of Total Probability (LOTP)! In this case the events Y = y make up the "background events":
P (X = x) = ∑ P(X = x, Y = y)
Note that to apply the LOTP it must be the case that the different events Y = i must be mutually
exclusvie and it must be the case that ∑ P(Y = y) = 1. Both are true.
y
If we wanted P(Y = y) we could again use the Law of Total Probability, this time with X taking on each
of its possible values as the background events:
P (Y = y) = ∑ P(X = x, Y = y)
X = 0 X = 1
Y = 5+ 0.02 0.06
What is the probability that a student's favorite digit is 0, P(X = 0)? We can use the LOTP to compute
this probability:
1
This is a header
P(X = 0) = ∑ P(X = 0, Y = y)
= + P(X = 0, Y = Frosh)
+ P(X = 0, Y = Soph)
+ P(X = 0, Y = Junior)
+ P(X = 0, Y = Senior)
+ P(X = 0, Y = 5+)
=0.15
2
This is a header
Multinomial
The multinomial is an example of a parametric distribution for multiple random variables. The
multinomial is a gentle introduction to joint distributions. It is a extension of the binomial. In both cases,
you have n independent experiments. In a binomial each outcome is a "success" or "not success". In a
multinomial there can be more than two outcomes (multi). A great analogy for the multinomial is: we are
going to roll an m sided dice n times. We care about reporting the number of outcomes of each side of
your dice.
Here is the formal definition of the multinomial. Say you perform n independent trials of an experiment
where each trial results in one of m outcomes, with respective probabilities: p , p , … , p (constrained 1 2 m
so that ∑ p = 1). Define X to be the number of trials with outcome i. A multinomial distribution is a
i i i
closed form function that answers the question: What is the probability that there are c trials with i
outcome i. Mathematically:
n c1 c2 cm
P (X 1 = c 1 , X 2 = c 2 , … , X m = c m ) = ( ) ⋅ p ⋅ p … pm
1 2
c1 , c2 , … , cm
n ci
= ( ) ⋅ ∏p
i
c1 , c2 , … , cm
i
This is our first joint random variable model! We can express it in a card, much like we would for
random variables:
PMF P (X 1 = c 1 , X 2 = c 2 , … , X m = c m ) = (
n
)∏p
ci
equation: c1 , c2 , … , cm
i
Examples
P(X 1 = 1, X 2 = 1, X 3 = 0, X 4 = 2, X 5 = 0, X 6 = 3)
1 1 0 2 0 3
7! 1 1 1 1 1 1
= ( ) ( ) ( ) ( ) ( ) ( )
2!3! 6 6 6 6 6 6
7
1
= 420( )
6
Weather Example:
Each day the weather in Bayeslandia can be {Sunny, Cloudy, Rainy} where p = 0.7, p = 0.2 sunny cloudy
and p = 0.1. Assume each day is independent of one another. What is the probability that over the
rainy
next 7 days we have 5 sunny days, 1 cloudy day and 1 rainy days?
1
This is a header
7!
5 1 1
= (0.7) ⋅ (0.2) ⋅ (0.1)
5!1!1!
≈ 0.14
How does that compare to the probability that every day is sunny?
7!
7 0 0
= (0.7) ⋅ (0.2) ⋅ (0.1)
7!1!
≈ 0.08
The multinomial is especially popular because of its use as a model of language. For a full example see
the Federalist Paper Authorship example.
to derive the probability that out of the n = 7 days, 5 are sunny, 1 is cloudy and 1 is rainy.
Like our derivation for the binomial, we are going to consider all of the possible weeks with 5 sunny
days, 1 rainy day and 1 cloudy day.
First, note that each outcome for assignments to the weeks are mutually exclusive. Then note that the
probability of any one outcome will be (p ) ⋅ p ⋅ p . The number of unique weeks with the chosen
S
5
C R
count of outcomes can be derived using the rule for Permutations with Indistinct Objects. There are 7
objects, 5 are indistinct from one another. The number of distinct outcomes is:
7 7!
( ) = = 7 ⋅ 6 = 42
5, 1, 1 5!1!1!
Since the outcomes are mutually exclusive, we are going to be adding the probability of each case to
itself 7!
times. Putting this all together we get the multinomial joint function for this particular case:
5!1!1!
7!
5 1 1
= (0.7) ⋅ (0.2) ⋅ (0.1)
5!1!1!
≈ 0.14
2
This is a header
Continuous Joint
Random variables X and Y are Jointly Continuous if there exists a joint Probability Density Function
(PDF) f such that:
a2 b2
P (a 1 < X ≤ a 2 , b 1 < Y ≤ b 2 ) = ∫ ∫ f (X = x, Y = y) dy dx
a1 b1
f (X = a) = ∫ f (X = a, Y = y) dy
−∞
∞
f (Y = b) = ∫ f (X = x, Y = b) dx
−∞
P (a 1 < X ≤ a 2 , b 1 < Y ≤ b 2 ) = F (a 2 , b 2 ) − F (a 1 , b 2 ) + F (a 1 , b 1 ) − F (a 2 , b 1 )
1
This is a header
On the left is a visualization of the probability mass of this joint distribution, and on the right is a
visualization of how we could answer the question: what is the probability that a dart hits within a certain
distance of the center. For each bucket there is a single number, the probability that a dart will fall into
that particular bucket (these probabilities are mutually exclusive and sum to 1).
Of course this discretization only approximates the joint probability distribution. In order to get a better
approximation we could create more fine-grained discretizations. In the limit we can make our buckets
infinitely small, and the value associated with each bucket becomes a second derivative of probability.
2
This is a header
To represent the 2D probability density in a graph, we use the darkness of a value to represent the density
(darker means more density). Another way to visualize this distribution is from an angle. This makes it
easier to realize that this is a function with two inputs and one output. Below is an different visualization
of the exact same density function:
Just like in the single random variable case, we are now representing our belief in the continuous random
variables as densities rather than probabilities. Recall that a density represents a relative belief. If the
density of f (X = 1.1, Y = 0.9) is twice as high as the density that f (X = 1.1, Y = 1.1) the function is
expressing that it is twice as likely to find the particular combination of X = 1.1 and Y = 0.9.
Multivariate Gaussian
The density that is depicted in this example happens to be a particular of joint continuous distribution
called Multivariate Gaussian. In fact it is a special case where all of the constituent variables are
independent.
3
This is a header
variables in vectors (similar to a list in python). The notation for the multivariate uses vector notation:
→
X ∼ N(μ, σ)
→ → →
→
f (x) = ∏ f (x i )
i=1
2
n −(x−μ )
i
1 2
2σ
= ∏ e i
i=1 σ i √ 2π
→
F (x) = ∏ F (x i )
i=1
n
xi − μi
= ∏ Φ( )
σi
i=1
In image processing, a Gaussian blur is the result of blurring an image by a Gaussian function. It is a
widely used effect in graphics software, typically to reduce image noise. A Gaussian blur works by
convolving an image with a 2D independent multivariate gaussian (with means of 0 and equal valued
standard deviations).
4
This is a header
In order to use a Gaussian blur, you need to be able to compute the probability mass of that 2D gaussian
in the space of pixels. Each pixel is given a weight equal to the probability that X and Y are both within
the pixel bounds. The center pixel covers the area where −0.5 ≤ x ≤ 0.5 and −0.5 ≤ y ≤ 0.5. Let's do
one step in computing the Gaussian function discretized over image space. What is the weight of the
center pixel for gaussian blur with a multivariate gaussian which has means of 0 and standard deviation
of 3?
n
xi − μi
F (x 1 , x 2 ) = ∏ Φ( )
σi
i=1
x1 − μ1 x2 − μ2
= Φ( ) ⋅ Φ( )
σ1 σ2
x1 x2
= Φ( ) ⋅ Φ( )
3 3
5
This is a header
≈ 0.026
How can this 2D gaussian blur the image? Wikipedia explains: "Since the Fourier transform of a
Gaussian is another Gaussian, applying a Gaussian blur has the effect of reducing the image's high-
frequency components; a Gaussian blur is a low pass filter" [2].
6
This is a header
Inference
So far we have set the foundation for how we can represent probabilistic models with multiple random
variables. These models are especially useful because they let us perform a task called "inference" where
we update our belief about one random variable in the model, conditioned on new information about
another. Inference in general is hard! In fact, it has been proven that in the worst case, the inference task,
can be NP-Hard where n is the number of random variables [1].
First we are going to practice it with two random variables (in this section). Then, later in this unit we are
going to talk about inference in the general case, with many random variables.
Earlier we looked at conditional probabilities for events. The first task in inference is to understand how
to combine conditional probabilities and random variables. The equations for both the discrete and
continuous case are intuitive extensions of our understanding of conditional probability:
P (X = x, Y = y)
P(X = x|Y = y) =
P (Y = y)
P (Y = y|X = x)P (X = x)
P(X = x|Y = y) =
P (Y = y)
In the presence of multiple random variables, it becomes increasingly useful to use shorthand! The above
definition is identical to this notation where a lowercase symbol such as x is short hand for the event
X = x:
P (x, y)
P(x|y) =
P (y)
The conditional definition works for any event and as such we can also write conditionals using
cumulative density functions (CDFs) for the discrete case:
P(X ≤ a, Y = y)
P(X ≤ a|Y = y) =
P(Y = y)
∑ P(X = x, Y = y)
x≤a
=
P(Y = y)
Here is a neat result: this last term can be rewritten, by a clever manipulation. We can make the sum
extend over the whole fraction:
∑ P(X = x, Y = y)
x≤a
P(X ≤ a|Y = y) =
P(Y = y)
P(X = x, Y = y)
= ∑
P(Y = y)
x≤a
= ∑ P(X = x|Y = y)
x≤a
1
This is a header
In fact it becomes straight forward to translate the rules of probability (such as Bayes' Theorem, law of
total probability, etc) to the language of discrete random variables: we simply need to recall that every
relational operator applied to a random variable defines an event.
Let X be a continuous random variable and let N be a discrete random variable. The conditional
probabilities of X given N and N given X respectively are:
f (X = x|N = n) P(N = n)
P(N = n|X = x) =
f (X = x)
These equations might seem complicated since they mix probability densities and probabilities. Why
should we believe that they are correct? First, observe that anytime the random variable on the left hand
side of the conditional is continuous, we use a density, whenever it is discrete, we use a probability. This
result can be derived by making the observation:
P(X = x) = f (X = x) ⋅ ϵx
In the limit as ϵ → 0. In order to obtain a probability from a density function is to integrate under the
x
function. If you wanted to approximate the probability that X = x you could consider the area created by
a rectangle which has height f (X = x) and some very small width. As that width gets smaller, your
answer becomes more accurate:
A value of ϵ is problematic if it is left in a formula. However, if we can get them to cancel, we can arrive
x
at a working equation. This is the key insight used to derive the rules of probability in the context of one
or more continuous random variables. Again, let X be continuous random variable and let N be a
discrete random variable:
P (X = x|N = n) P(N = n)
P(N = n|X = x) = Bayes' Theorem
P (X = x)
f (X = x|N = n) ⋅ ϵx ⋅ P(N = n)
= P(X = x) = f (X = x) ⋅ ϵx
f (X = x) ⋅ ϵx
f (X = x|N = n) ⋅ P(N = n)
= Cancel ϵx
f (X = x)
2
This is a header
This strategy applies beyond Bayes' Theorem. For example here is a version of the Law of Total
Probability when X is continuous and N is discrete:
f (X = x) = ∑ f (X = x|N = n) P(N = n)
n∈N
f (X = x, Y = y)
f (X = x|Y = y) =
f (Y = y)
Question: At birth, girl elephant weights are distributed as a Gaussian with mean 160kg, and standard
deviation 7kg. At birth, boy elephant weights are distributed as a Gaussian with mean 165kg, and
standard deviation of 3kg. All you know about a newborn elephant is that it is 163kg. What is the
probability that it is a girl?
Answer: Let G be an indicator that the elephant is a girl. G is Bern(p = 0.5) Let X be the distribution of
weight of the elephant.
X|G = 1 is N (μ = 160, σ = 7 )
2 2
X|G = 0 is N (μ = 165, σ = 3 )
2 2
f (X = 163|G = 1) P(G = 1)
P(G = 1|X = 163) = Bayes
f (X = 163)
If we can solve this equation we will have our answer. What is f (X = 163|G = 1)? It is the probability
density function of a gaussian X which has μ = 160, σ = 7 at the point x is 163:
2 2
2
3
This is a header
2
1 −
1
(
x−μ
)
2 σ
f (X = 163|G = 1) = e PDF Gauss
σ√2π
2
1 −
1
(
163−160
)
2 7
= e PDF X at 163
7√2π
Next we note that P(G = 0) = P(G = 1) = . Putting this all together, and using the law of total
1
f (X = 163|G = 1) P(G = 1)
=
f (X = 163)
f (X = 163|G = 1) P(G = 1)
=
f (X = 163|G = 1) P(G = 1) + f (X = 163|G = 0) P(G = 0)
2
1 163−160
1 − ( ) 1
2 7
e ⋅
2
7√ 2π
=
2 2
1 163−160 1 163−165
1 − ( ) 1 1 − ( ) 1
2 7 2 3
e ⋅ + e ⋅
2 2
7√ 2π 3√ 2π
1 9
1 − ( )
2 49
e
7
=
2
1 9 1 4
1 − ( ) 1 − ( )
2 49 2 9
e + e
7 3
≈ 0.328
4
This is a header
Bayesian Networks
At this point in the reader we have developed tools for analytically solving for probabilities. We can
calculate the likelihood of random variables taking on values, even if they are interacting with other
random variables (which we have called multi-variate models, or we say the random variables are jointly
distributed). We have also started to study samples and sampling.
Consider the WebMD Symptom Checker. WebMD has built a probabilistic model with random variables
which roughly fall under three categories: symptoms, risk factors and diseases. For any combination of
observed symptoms and risk factors, they can calculate the probability of any disease. For example, they
can calculate the probability that I have influenza given that I am a 21-year-old female who has a fever
and who is tired: P (I = 1|A = 21, G = 1, T = 1, F = 1). Or they could calculate the probability that I
have a cold given that I am a 30-year-old with a runny nose: P (C = 1|A = 30, R = 1). At first blush
this might not seem difficult. But as we dig deeper we will realize just how hard it is. There are two
challenges: (1) Modelling: sufficiently specifying the probabilistic model and (2) Inference: calculating
any desired probability.
Bayesian Networks
Before we jump into how to solve probability (aka inference) questions, let's take a moment to go over
how an expert doctor could specify the relationship between so many random variables. Ideally we could
have our expert sit down and specify the entire "joint distribution" (see the first lecture on multi-variable
models). She could do so either by writing a single equation that relates all the variables (which is as
impossible as it sounds), or she could come up with a joint distribution table where she specifies the
probability of any possible combination of assignments to variables. It turns out that is not feasible either.
Why? Imagine there are N = 100 binary random variables in our WebMD model. Our expert doctor
would have to specify a probability for each of the 2 > 10 combinations of assignments to those
N 30
variables, which is approaching the number of atoms in the universe. Thankfully, there is a better way.
We can simplify our task if we know the "generative" process that creates a joint assignment. Based on
the generative process we can make a data structure known as a Bayesian Network. Here are two
networks of random variables for diseases:
For diseases the flow of influence is directed. The states of "demographic" random variables influence
whether someone has particular "conditions", which influence whether someone shows particular
"symptoms". On the right is a simple model with only four random variables. Though this is a less
interesting model it is easier to understand when first learning Bayesian Networks. Being in university
(binary) influences whether or not someone has influenza (binary). Having influenza influences whether
or not someone has a fever (binary) and the state of university and influenza influences whether or not
someone feels tired (also binary).
1
This is a header
In a Bayesian Network an arrow from random variable X to random variable Y articulates our
assumption that X directly influences the likelihood of Y . We say that X is a parent of Y . To fully define
the Bayesian network we must provide a way to compute the probability of each random variable (X ) i
defined for the simple disease model. Recall that each of the random variables is binary:
P (Uni = 1) = 0.8
P (Tired = 1|Uni = 0, Inf luenza = 0) = 0.1 P (Tired = 1|Uni = 0, Inf luenza = 1) = 0.9
P (Tired = 1|Uni = 1, Inf luenza = 0) = 0.8 P (Tired = 1|Uni = 1, Inf luenza = 1) = 1.0
Let's put this in programming terms. All that we need to do in order to code up a Bayesian network is to
define a function: getProbXi(i, k, parents) which returns the probability that X (the random var i
with index i) takes on the value k given a value for each of the parents of X encoded by the list i
Deeper understanding: The reason that a Bayes Net is so useful is that the "joint" probability can be
expressed in exponentially less space as the product of the probabilities of each random variable
conditioned on its parents! Without loss of generality, let X refer to the ith random variable (such that if
i
What assumptions are implicit in a Bayes Net? Using the chain rule we can decompose the exact joint
probability for n random variables. To make the following math easier to digest I am going to use x as i
By looking at the difference in the two equations, we can see that a Bayes Net is assuming that
This is a conditional independence statement. It is saying that once you know the value of the parents of
a variable in your network, X , any further information about non-descendents will not change your
i
belief in X . Formally we say that X is conditionally independent of its non-descendents, given its
i i
subtree that starts at X . Everything else is a non-descendent. Non-descendents include the "ancestor"
i
nodes of X as well as nodes which are totally unconnected to X . When designing Bayes Nets you don't
i i
have to think about this assumption directly. It turns out to be a naturally good assumption if the arrows
between your nodes follow a causal path.
As you might have guessed, we can do step (2) and (3) by hand, or, we can have computers try and
perform those tasks based on data. The first task is called "structure learning" and the second is an
instance of "machine learning." There are fully autonomous solutions to structure learning—but they
only work well if you have a massive amount of data. Alternatively people will often compute a statistic
called correlation between all pairs of random variables to help in the art form of designing a Bayes Net.
2
This is a header
In the next part of thereader we are going to talk about how we could learn
P (X = x |Values of parents of X ) from data. For now let's start with the (reasonable) assumption that
i i i
Next Steps
Great! We have a feasible way to define a large network of random variables. First challenge complete.
We haven't talked about continuous or multinomial random variables in Bayes Nets. None of the theory
changes: the expert will just have to define getProbXi to handle more values of k than 0 or 1.
A Bayesian network is not very interesting to us unless we can use it to solve different conditional
probability questions. How can we perform "inference" on a network as complex as a Bayesian network?
3
This is a header
Independence in Variables
Discrete
Two discrete random variables X and Y are called independent if:
Intuitively: knowing the value of X tells us nothing about the distribution of Y . If two variables are not
independent, they are called dependent. This is a similar conceptually to independent events, but we are
dealing with multiple variables. Make sure to keep your events and variables distinct.
Continuous
Two continuous random variables X and Y are called independent if:
This can be stated equivalently using either the CDF or the PDF:
f (X = x, Y = y) = f (X = x)f (Y = y) f or all x, y
More generally, if you can factor the joint density function then your random variable are independent (or
the joint probability function for discrete random variables):
f (X = x, Y = y) = h(x)g(y)
P(X = x, Y = y) = h(x)g(y)
Let N be the # of requests to a web server/day and that N ∼ Poi(λ). Each request comes from a human
with probability = p or from a "bot" with probability = (1–p). Define X to be the # of requests from
humans/day and Y to be the # of requests from bots/day. Show that the number of requests from humans,
X , is independent of the number of requests from bots, Y .
Since requests come in independently, the probability of X conditioned on knowing the number of
requests is a Binomial. Specifically:
(X|N ) ∼ Bin(N , p)
(Y |N ) ∼ Bin(N , 1 − p)
To get started we need to first write an expression for the joint probability of X and Y . To do so, we use
the chain rule:
We can calculate each term in this expression. The first term is the PMF of the binomial X|N having x
"successes". The second term is the probability that the Poisson N takes on the value x + y :
x + y
x y
P(X = x, Y = y|N = x + y) = ( )p (1 − p)
x
x+y
λ
−λ
P(N = x + y) = e
(x + y)!
Now we can put those together we have an expression for the joint:
x+y
x + y x y −λ
λ
P(X = x, Y = y) = ( )p (1 − p) e
x (x + y)!
1
This is a header
At this point we have derived the joint distribution over X and Y . In order to show that these two are
independent, we need to be able to factor the joint:
P(X = x, Y = y)
x+y
x + y λ
x y −λ
= ( )p (1 − p) e
x (x + y)!
x+y
(x + y)! λ
x y −λ
= p (1 − p) e
x! ⋅ y! (x + y)!
1
x y −λ x+y
= p (1 − p) e λ Cancel (x+y)!
x! ⋅ y!
x x y y
p ⋅ λ (1 − p) ⋅ λ
−λ
= ⋅ ⋅ e Rearrange
x! y!
Because the joint can be factored into a term that only has x and a term that only has y, the random
variables are independent.
Symmetry of Independence
Independence is symmetric. That means that if random variables X and Y are independent, X is
independent of Y and Y is independent of X. This claim may seem meaningless but it can be very useful.
Imagine a sequence of events X , X , …. Let A be the event that X is a "record value" (eg it is larger than
1 2 i i
Expectation of Products
Lemma: Product of Expectation for Independent Random Variables:
If two random variables X and Y are independent, the expectation of their product is the product of the
individual expectations.
Note that this assumes that X and Y are independent. Contrast this to the sum version of this rule
(expectation of sum of random variables, is the sum of individual expectations) which does not require
the random variables to be independent.
2
This is a header
Correlation
Covariance
Covariance is a quantitative measure of the extent to which the deviation of one variable from its mean
matches the deviation of the other from its mean. It is a mathematical relationship that is defined as:
That is a little hard to wrap your mind around (but worth pushing on a bit). The outer expectation will be
a weighted sum of the inner function evaluated at a particular (x, y) weighted by the probability of (x, y).
If x and y are both above their respective means, or if x and y are both below their respective means, that
term will be positive. If one is above its mean and the other is below, the term is negative. If the weighted
sum of terms is positive, the two random variables will have a positive correlation. We can rewrite the
above equation to get an equivalent equation:
= 0
Note that the reverse claim is not true. Covariance of 0 does not prove independence.
Using this equation (and the product lemma) it is easy to see that if two random variables are independent
their covariance is 0. The reverse is not true in general.
Properties of Covariance
Say that X and Y are arbitrary random variables:
Cov(X, Y ) = Cov(Y , X)
2
Cov(X, X) = E[X ] − E[X]E[X] = Var(X)
Cov(aX + b, Y ) = aCov(X, Y )
Cov(X, Y ) = ∑ ∑ Cov(X i , Y j )
i=1 j=1
n n
i=1 j=1
That last property gives us a third way to calculate variance. We can use it to, again, show how to get the
variance of a Binomial.
Correlation
We left off last class talking about covariance. Covariance was interesting because it was a quantitative
measurement of the relationship between two variables. Today we are going to extend that concept to
correlation. Correlation between two random variables, ρ(X, Y ) is the covariance of the two variables
normalized by the variance of each variable. This normalization cancels the units out:
Cov(X, Y )
ρ(X, Y ) =
√Var(X)V ar(Y )
ρ(X, Y ) = 1 Y = aX + b where a = σ y /σ x
ρ(X, Y ) = −1 Y = aX + b where a = −σ y /σ x
When people use the term correlation, they are actually referring to a specific type of correlation called
"Pearson" correlation. It measures the degree to which there is a linear relationship between the two
variables. An alternative measure is "Spearman" correlation which has a formula almost identical to your
regular correlation score, with the exception that the underlying random variables are first transformed
into their rank. "Spearman" correlation is outside the scope of CS109.
2
This is a header
General Inference
A Bayesian Network gives us a reasonable way to specify the joint probability of a network of many
random variables. Before we celebrate, realize that we still don't know how to use such a network to
answer probability questions. There are many techniques for doing so. I am going to introduce you to one
of the great ideas in probability for computer science: we can use sampling to solve inference questions
on Bayesian networks. Sampling is frequently used in practice because it is relatively easy to understand
and easy to implement.
Rejection Sampling
As a warmup consider what it would take to sample an assignment to each of the random variables in our
Bayes net. Such a sample is often called a "joint sample" or a "particle" (as in a particle of sand). To
sample a particle, simply sample a value for each random variable one at a time based on the value of the
random variable's parents. This means that if X is a parent of X , you will have to sample a value for X
i j i
Let's work through an example of sampling a "particle" for the Simple Disease Model in the Bayes Net
section:
Thus the sampled particle is: [Uni = 1, Influenza = 0, Fever = 0, Tired = 0]. If we were to run the process
again we would get a new particle (with likelihood determined by the joint probability).
Now our strategy is simple: we are going to generate N samples where N is in the hundreds of thousands
(if not millions). Then we can compute probability queries by counting. Let N (X = k) be notation for
the number of particles where random variables X take on values k. Recall that the bold notation X
means that X is a vector with one or more elements. By the "frequentist" definition of probability:
N (X = k)
P (X = k) =
N
Counting for the win! But what about conditional probabilities? Well using the definition of conditional
probabilities, we can see it's still some pretty straightforward counting:
N (X=a,Y=b)
P (X = a, Y = b) N (X = a, Y = b)
N
P (X = a|Y = b) = = =
N (Y=b)
P (Y = b) N (Y = b)
N
Let's take a moment to recognize that this is straight-up fantastic. General inference based on analytic
probability (math without samples) is hard even given a Bayesian network (if you don't believe me, try to
calculate the probability of flu conditioning on one demographic and one symptom in the Full Disease
Model). However if we generate enough samples we can calculate any conditional probability question
by reducing our samples to the ones that are consistent with the condition (Y→ = →b) and then counting how
→
many of those are also consistent with the query (X →). Here is the algorithm in pseudocode:
= a
N = 10000
# "query" is the assignment to variables we want probabilities for
# condition" is the assignments to variables we will condition on
def get_any_probability(query, condition):
particles = generate_many_joint_samples(N)
cond_particles = reject_non_consistent_samples(particles, condition)
K = count_consistent_samples(cond_particles, query)
return K / len(cond_particles)
1
This is a header
This algorithm is sometimes called "Rejection Sampling" because it works by generating many particles
from the joint distribution and rejecting the ones that are not consistent with the set of assignments we are
conditioning on. Of course this algorithm is an approximation, though with enough samples it often
works out to be a very good approximation. However, in cases where the event we're conditioning on is
rare enough that it doesn't occur after millions of samples are generated, our algorithm will not work. The
last line of our code will result in a divide by 0 error. See the next section for solutions!
∴ f (Fever = x) = e 2⋅0.7
√ 2π ⋅ 0.7
∴ f (Fever = x) = e 2⋅1.8
√2π ⋅ 1.8
Drawing samples (aka particles) is still straightforward. We apply the same process until we get to the
step where we sample a value for the Fever random variable (in the example from the previous section
that was step 3). If we had sampled a 0 for influenza we draw a value for fever from the normal for
healthy adults (which has μ = 98.3). If we had sampled a 1 for influenza we draw a value for fever from
the normal for adults with the flu (which has μ = 100.0). The problem comes in the "rejection" stage of
joint sampling.
When we sample values for fever we get numbers with infinite precision (eg 100.819238 etc). If we
condition on someone having a fever equal to 101 we would reject every single particle. Why? No
particle will have exactly a fever of 101.
There are several ways to deal with this problem. One especially easy solution is to be less strict when
rejecting particles. We could round all fevers to whole numbers.
There is an algorithm called "Likelihood Weighting" which sometimes helps, but which we don't cover in
CS109. Instead, in class we talked about a new algorithm called Markov Chain Monte Carlo (MCMC)
that allowed us to sample from the "posterior" probability: the distribution of random variables after
(post) us fixing variables in the conditioned event. The version of MCMC we talked about is called Gibbs
Sampling. While I don't require that students in CS109 know how to implement Gibbs Sampling, I
wanted everyone to know that it exists and that it isn't beyond your capabilities. If you need to use it, you
can learn it given the knowledge you have now.
MCMC does require more math than Joint Sampling. For every random variable you will need to specify
how to calculate the likelihood of assignments given the variable's: parents, children and parents of its
children (a set of variables cozily called a "blanket"). Want to learn more? Take CS221 or CS228!
Thoughts
While there are slightly-more-powerful "general inference algorithms" that you will get to learn in the
future, it is worth recognizing that at this point we have reached an important milestone in CS109. You
can take very complicated probability models (encoded as Bayesian networks) and can answer general
inference queries on them. To get there we worked through the concrete example of predicting disease.
While the WebMD website is great for home users, similar probability models are being used in
thousands of hospitals around the world. As you are reading this general inference is being used to
2
This is a header
improve health care (and sometimes even save lives) for real human beings. That's some probability for
computer scientists that is worth learning. What if we don't have an expert? Could we learn those
probabilities from data? Jump to part 5 to answer that question.
3
This is a header
These examples have also led to a necessary research into a growing field of algorithmic fairness. How
can we demonstrate, or prove, that an algorithm is behaving in a way that we think is appropriate? What
is fair? Clearly these are complex questions and are deserving of a complete conversation. This example
is simple for the purpose of giving an introduction to the topic.
ML stands for Machine Learning. Solon Barocas and Moritz Hardt, "Fairness in Machine Learning",
NeurIPS 2017
What is Fairness?
An artificial intelligence algorithm is going to be used to make a binary prediction (G for guess) for
whether a person will repay a loan. The question has come up: is the algorithm "fair" with respect to a
binary protected demographic (D for demographic)? To answer this question we are going to analyze
predictions the algorithm made on historical data. We are then going to compare the predictions to the
true outcome (T for truth). Consider the following joint probability table from the history of the
algorithm’s predictions:
D = 0 D = 1
G = 0 G = 1 G = 0 G = 1
could be problematic for demographic variables (some people are mixed ethnicity, etc) which gives you a
hint that we are just scratching the surface in our conversation about fairness. Let's use this joint
probability to learn about some of the common definitions of fairness.
P(D = 1) = ∑ ∑ P(D = 1, G = j, T = k)
j∈{0,1} k∈{0,1}
P(D = 0) = ∑ ∑ P(D = 0, G = j, T = k)
j∈{0,1} k∈{0,1}
Note that P(D = 0) + P(D = 1) = 1. That implies that the demographics are mututally exclusive.
P (G = 1, D = 1)
P (G = 1|D = 1) = Cond. Prob.
P (D = 1)
P (G = 1, D = 1, T = 0) + P (G = 1, D = 1, T = 1)
= Prob or
P (D = 1)
0.01 + 0.08
= = 0.75 From joint
0.12
P (G = 1, D = 0)
P (G = 1|D = 0) = Cond. Prob.
P (D = 0)
P (G = 1, D = 0, T = 0) + P (G = 1, D = 0, T = 1)
= Prob or
P (D = 0)
0.32 + 0.28
= ≈ 0.68 From joint
0.88
No. Since P (G = 1|D = 1) ≠ P (G = 1|D = 0) this algorithm does not satisfy parity. It is more likely
to guess 1 when the demographic indicator is 1.
P (G = T |D = 0) = P (G = 1, T = 1|D = 0) + P (G = 0, T = 0|D = 0)
0.28 + 0.21
= ≈ 0.56
0.88
P (G = T |D = 1) = P (G = 1, T = 1|D = 1) + P (G = 0, T = 0|D = 1)
0.08 + 0.01
= = 0.75
0.12
No: P (G = T |D = 0) ≠ P (G = T |D = 1)
2
This is a header
P (G = 1, D = 1, T = 1)
P (G = 1|D = 1, T = 1) =
P (D = 1, T = 1)
0.08
= = 0.8
0.08 + 0.02
P (G = 1, D = 0, T = 1)
P (G = 1|D = 0, T = 1) =
P (D = 0, T = 1)
0.28
= = 0.8
0.28 + 0.07
Which of these definitions seems right to you? It turns out, it can actually be proven that these three
cannot be jointly optimized, and this is called the Impossibility Theorem of Machine Fairness. In other
words, any AI system we build will necessarily violate some notion of fairness. For a deeper treatment of
the subject, here is a useful summary of the latest research Pessach et al. Algorithmic Fairness.
Gender Shades
In 2018, Joy Buolamwini and Timnit Gebru had a breakthrough result called "gender shades" published
in the first conference on Fairness, Accountability and Transparency in ML [1]. They showed that facial
recognition algorithms, which had been deployed to be used by Facebook, IBM and Microsoft, were
substantially better at making predicitons (in this case classifying gender) when looking at lighter skinned
men than darker skinned women. Their work exposed several shortcomings in production AI: biased
datasets, optimizing for average accuracy (which means that the majority demographic gets most weight)
lack of awareness of intersectionality, and more. Let's take a look at some of their results.
Figure by Joy Buolamwini and Timnit Gebru. Facial recognition algorithms perform very differently
depending on who they are looking at. [1]
Timnit and Joy looked at three classifiers trained to predict gender, and computed several statistics. Let's
take a look at one statistic, accuracy, for one of the facial recognition classifiers, IBMs:
Using the language of fairness, accuracy measures P(G = T ). The cell in the table above under
"Women" says the accuracy when looking at photos of women P(G = T |D = Women). It is easy to
show that these production level systems are terribly "uncalibrated":
3
This is a header
Why should we care about calibration and not the other definitions of fairness? In this case the classifier
was making a prediction of gender where a positive prediction (say predicting women) doesn't have a
directly associated reward as in our above example, where we were predicting if someone should receive
a loan. As such the most salient idea is: is the algorithm just as accurate for different genders
(calibration)?
The lack of calibration between men/women and lighter/darker skinned photos is an issue. What Joy and
Timnit showed next was that the problem becomes even worse when you look at intersectional
demographics.
If the algorithms were "fair" according to the calibration you would expect the accuracy to be the same
regardless of demographics. Instead there is almost a 34.2 percentage point difference!
P(G = T |D = Darker Woman) = 65.3 compared to P(G = T |D = Ligher Man) = 99.7
Ways Forward?
Wadsworth et al. Achieving Fairness through Adversarial Learning
4
This is a header
previous and future choices of words). Similarly we estimated q , the probability that Madison generates
i
the word i. For each word i we observe the number of times that word occurs in Federalist Paper 49 (we
call that count c ). We assume that, given no evidence, the paper is equally likely to be written by
i
Madison or Hamilton.
Define three events: H is the event that Hamilton wrote the paper, M is the event that Madison wrote the
paper, and D is the event that a paper has the collection of words observed in Federalist Paper 49. We
would like to know whether P (H |D) is larger than P (M |D). This is equivalent to trying to decide if
P (H |D)/P (M |D) is larger than 1.
The event D|H is a multinomial parameterized by the values p. The event D|M is also a multinomial,
this time parameterized by the values q.
P (D|H )P (H )
n ci
ci
( )∏ p ∏ p
c 1 ,c 2 ,…,c m i i i i
= =
n ci ci
( )∏ q ∏ q
c 1 ,c 2 ,…,c m i i i i
This seems great! We have our desired probability statement expressed in terms of a product of values we
have already estimated. However, when we plug this into a computer, both the numerator and
denominator come out to be zero. The product of many numbers close to zero is too hard for a computer
to represent. To fix this problem, we use a standard trick in computational probability: we apply a log to
both sides and apply some basic rules of logs.
ci
P (H |D) ∏ p
i i
log( ) = log( )
ci
P (M |D) ∏ q
i i
ci ci
= log(∏ p ) − log(∏ q )
i i
i i
ci ci
= ∑ log(p ) − ∑ log(q )
i i
i i
= ∑ c i log(p i ) − ∑ c i log(q i )
i i
This expression is "numerically stable" and my computer returned that the answer was a negative
number. We can use exponentiation to solve for P (H |D)/P (M |D). Since the exponent of a negative
number is a number smaller than 1, this implies that P (H |D)/P (M |D) is smaller than 1. As a result, we
conclude that Madison was more likely to have written Federalist Paper 49. That is the standing
assumption currently made by historians!
1
This is a header
Name to Age
Because of shifting patterns in name popularity, a person's name is a hint as to their age. The United
States publishes a data which contains counts of how many US residents were born with a given name in
a given year, based off Social Security applications. We can use inference to compute the reverse
probability distribution: an updated belief in a person's age, given their name. As a reminder, if I know
the year someone was born, I can calculate their age within one year.
This demo is based on real data from US Social Security applications between 1914 and 2014. Thank you
to https://www.kaggle.com/kaggle/us-baby-names for compiling the data. Download Data .
Computation
The US Social Security applications data provides you with a function: count(year, name) which
returns the number of US citizens, born in a given year with a given name. You also have access to a list
names which has each name ever given in the US and years which has all the years. This function is
implicitly giving us the joint probability over names and birth year. The probability of a joint assignment
to name and birth year can be estimated as the count of people with that name, born on that year, over the
total number of people in the dataset. Let B be the year someone is born, and let N be their name:
count(b, n)
P (B = b, N = n) ≈
∑ ∑ count(i, j)
i∈names j∈years
The question we would really like to answer is: what is your belief that a resident was born in 1950,
given that their name is Gary?
We can get started by applying the definition of conditional probability for random variables:
But this leaves one term to compute: P(N = Gary) which we can compute using marginalization:
y∈years
∑ count(y, Gary)
y∈years
≈
∑ ∑ count(i, j)
i∈names j∈years
1
This is a header
count(1950,Gary)
( )
∑ ∑ count(i,j)
i∈names j∈years
≈
∑ count(y,Gary)
y∈years
( )
∑ ∑ count(i,j)
i∈names j∈years
count(1950, Gary)
≈
∑ count(y, Gary)
y∈years
More generally, for any name, we can compute the conditional probability mass function over birth year
B:
count(b, n)
P(B = b|N = n) ≈
∑ count(y, n)
y∈years
Assumptions
This problem makes many assumptions which are worth highlighting. In fact, any time we make
generalizations (especially about demographics) based on sparse information we should tread lightly.
Here are the assumptions that I can think of:
1. This data only is accurate for names of people in the US. The probability of age given names could be
very different in other countries.
2. The US census is not perfect. It does not capture all people who are resident in the US, and there are
demographics which are underrepresented. This will also skew our results.
Katina 49 0.245
Marquita 38 0.233
Ashanti 19 0.250
Miley 13 0.250
Aria 7 0.247
Debbie 62 0.104
Whitney 35 0.098
Chelsea 29 0.103
2
This is a header
Aidan 18 0.098
Addison 14 0.112
A search for "Katina 1972" brought up this interesting article about a baby named Katina in a 1972 CBS
Soap Opera. Marquita's popularity was likely from a 1983 toothpase add. Ashanti Douglas and Miley
Cirus were popular singers in 2002 and 2008 respectively.
Futher Reading
Some names don't seem to have enough data to make good probability estimates. Can we quantify our
uncertainty in such probability estimates? For example, if a name has only 10,000 entries in the database,
of which only 100 were born in the year 1950, how confident are we that the true probability for 1950 is
= 0.01? One way to express our uncertainty would be through a Beta Distribution. In this scenario
100
10000
we could represent our belief in the probability for 1950 as X ∼ Beta(a = 101, b = 9901) reflecting
that we have seen 100 people born in 1950 and 9900 people who were not. We can plot that belief,
zoomed into the range [0, 0.03]:
We can now ask questions such as, what is the probability that X is within 0.002 of 0.01?
= FX (0.012) − FX (0.008)
= 0.966 − 0.013
= 0.953
Semantically this leads to the claim that, after observing 100 births with a name in 1950, out of 10,000
births with that name over the whole dataset, there is a 95% chance that the probability of someone being
born in 1950 is 0.010 ± 0.002.
3
This is a header
Note: this demo was created in 2023 and the age reported is relative to that year! This chapter only has
historical C14 rates from 10,000 years ago and as such is not able to estimate age when there are fewer
than 350 molecules of C14 in the sample.
1
This is a header
Carbon dating allows us to know the age of things that used to be alive, like dinosaur bones.
Consider a single C14 molecule. What is the probability that it decays within 750 years?
−λ⋅750
P (T ≤ 750) = 1 − e Exp. CDF
1
− ⋅750
= 1 − e 8267
= 0.0867
That is only for a single molecule. Since C14 molecules decay independently, it is not much harder to
think of how many are left out of a larger initial count of C14. A particular sample started with 1000
molecules. What is the probability that exactly 900 are left after 750 years? This is equivalently to the
event that 100 molecules have decayed.
1000
100 900
P(X = 100) = ( )0.0867 (1 − 0.0867)
100
≈ 0.0144
Let's generalize. Define M to be a random variable for the number of molecules left, and A to be the age
of the sample. The probability P(M = m|A = i) of having m remaining C14 molecules given that the
artifact is i years old will be equal to P(X = n − m) where n is the starting number of C14 molecules,
p = 1 − e
−i/8267
and X ∼ Bin(n, p) is the count of decayed C14 molecules.
2
This is a header
For your prior belief you know that the sample must be between A = 100 and A = 10000 inclusive and
you assume that every year in that range is equally likely.
This is a perfect case for Bayes theorem. However instead of updating our belief in an event, like we did
in Part 1, we are updating the belief over all the values that a random variable can take on, a process
called inference. Here is the generalized version of Bayes' theorem for infering age, A:
P(A = i)
= P(M = 900|A = i) ⋅
P(M = 900)
= P(M = 900|A = i) ⋅ K
The critical part of the last line was to recognize that is a constant with respect to i. The term
P(A=i)
P(M =900)
P(A = i) is constant as our prior over A was uniform. We could compute the value of P(M = 900)
explicitely using the law of total probability. In code this is most easily implemented by computing all
values of P(M = 900|A = i) and normalizing as K will be the value that makes all of the values
P(A = i|M = 900) sum to 1.
Fluctuating History
The amount of C14 in the atmosphere fluctuates over time; it is not a constant baseline! Here is the delta
C14 (per 1000 molecules) that you would have found if the object died different number of years ago. To
incorporate this information we simply start our binomial with 1000 molecules plus the delta for the year,
downloaded from a public dataset [1]:
This offset is archeology theory not probability theory. We include it in this chapter because otherwise
our code will give an incorrect preiction. Also, it gives the posterior a really interesting shape (see the
demo).
Python Code
The math, derived above, leads to the following python code for a function inference(m) which returns
the probability mass function for age A, given an observation of m C14 molecules in a sample that
should have 1000 molecules, were it alive today. Notice the use of normalization to avoid explicitely
computing the prior or P(M = m) from Bayes Theorem.
3
This is a header
C14_MEAN_LIFE = 8267
def normalize(prob_dict):
# first compute the sum of the probability
sum = 0
for key, pr in prob_dict.items():
sum += pr
# then divide each probability by that sum
for key, pr in prob_dict.items():
prob_dict[key] = pr / sum
# now the probabilities sum to 1 (aka are normalized)
def delta_start(age):
"""
The amount of atmospheric C14 is not the same every
year. If the sample died "age" years ago, then it would
have started with slightly more, or slightly less than
1000 C14 molecules. We can look this value up from the
IntCal database. See the next section!
"""
return historical_c14_delta[age]
Perhaps the most fascinating extension is modelling "stratigraphic" relationships. Often in archeological
sites, you can know relative age of artifacts based on their position in sediment. This requires a joint
model of the age of each artifact with the constraint that you *know* some are older than others.
Inference can then be performed using a General Inference technique (often MCMC) and will be much
more accurate.
4
This is a header
Binomial Approximations?
Could we have used an approximation for the binomial PMF calculation? In Bayesian Carbon dating
both a normal and a poisson approximation are appropriate. The decay binomial X ∼ Bin(n, p) is well
approximated by either a Poisson with λ = n ⋅ p or a Gaussian with μ = n ⋅ p and σ = n ⋅ p ⋅ (1 − p).
2
This could be used to speed up calculations. Let's rework the example where we had
X ∼ Bin(n = 1000, p = 0.0867). We computed that P(X = 100) = 0.0144
Poisson Approximation:
Y ∼ Poi(λ = 86.7)
= scipy.stats.poi.pmf(100, 86.7)
≈ 0.0151
Normal Approximation:
2
Y ∼ N(μ = 86.7, σ = 79.2)
≈ 0.0146
5
This is a header
You can find a demo of the Stanford Acuity Test here: https://myeyes.ai/. Look out for the bar icon to
see the belief distribution change as the test progresses.
There are two primary tasks in a digital vision test: (1) based on the patient responses, infer their ability
to see and (2) select the next font size to show to the patient.
The prior probability mass function for A, written as P(A = a), represents our belief that A takes on the
value of a, before we have seen any observations about the patient. This prior belief comes from the
natural distribution of how well people can see. To make our algorithm most accurate, the prior should
best reflect our patient population. Since our eye test is built for doctors in an eye hospital, we used
historical data from eye hospital visits to build our prior. Here is P (A = a) as a graph:
1
This is a header
Here is that exact same probability mass function written as a table. A table representation is possible
because of our choice to discretize A. In code we can access P(A = a) as a dictionary lookup,
belief[a] where belief stores the whole probability mass function:
Observations
Once the patient starts the test, you will begin collecting observations. Consider this first observation,
obs where the patient was shown a letter with font size 0.7 and answered the question incorrectly:
1
We can represent this observation as a tuple with font size and correctness. Mathematically this could be
written as obs = [0.7, False]. In code this observation could be stored as a dictionary
1
2
This is a header
obs_1 = {
"font_size":0.7,
"is_correct":False
}
Infering Ability
Our first major task is to write code which can update our probability mass function for A based on
observations. First let us consider how to update our belief in ability to see from a single observation, obs
(aside: formally this the event that random variable Obs takes on the value obs). We can use Bayes'
Theory for random variables:
P(obs|A = a)P (A = a)
P(A = a|obs) =
P(obs)
This will be computed inside a for loop for each assignment a to ability to see. How can we compute
each term in the Bayes' Theorem expression? We already have values for the prior P(A = a) and we can
compute the denominator P(obs) using the Law of Total Probability:
Notice how the terms in this new expression for P(obs) already show up in the numerator of our Bayes'
Theorem equation. As such, in code we are going to (1) compute the numerator for every value of a,
store it as the value of belief, (2) compute the sum of all of those terms and (3) devide each value of
belief by the sum. The process of doing steps 2 and 3 is also known as normalization:
def normalize(belief):
# in place normalization of a belief dictionary
total = belief_sum(belief)
for key in belief:
belief[key] /= total
def belief_sum(belief):
# get the sum of probability mass for a discrete belief
total = 0
for key in belief:
total += belief[key]
return total
At this point we have an expression, and corresponding code, to update our belief in ability to see given
an observation. However we are missing a way to compute P(obs|A = a). In our code this expression is
the currently undefined calc_likelihood(a, obs) function. In the next section we will go over how
3
This is a header
to compute this "likelihood" function. Before we do so, let's take a look at the result of applying
update_belief for a patient with the single observation obs_1 defined above.
obs_1 says that this patient got a rather large letter (font-size of 0.7) incorrect. As such in our posterior
we think they can't see very well, though we have a lot of uncertainty as it has only been one observation.
This belief is expressed in our updated probability mass function for A, P(A = a|obs ), called the1
posterior. Here is what the posterior looks like for obs_1. Note that the posterior P(A = a|obs ) is still
1
The posterior belief on ability to see given a patient incorrectly identified a letter with font size 0.7. It shows
a belief that the patient can't see very well.
Likelihood Function
We are not done yet! We have not yet said how we will compute P(obs|A = a). In Bayes' Theorem this
term is called the "likelihood." The likelhood for our eye exam will be function that returns back
probabilities for inputs of a and obs. In python this will be a function calc_likelihood(a, obs). In
this function obs is a single observation such as obs_1 described above. Imagine a concrete call to the
likelihood function bellow. This call will return back the probability a person who has true ability to see
of 0.5 would get a letter of font-size 0.7 incorrect.
# get an observation
obs = {
"font_size":0.7,
"is_correct":False
}
# calculate likelihood for obs given a, P(obs | A = a)
calc_likelihood(a = 0.5, obs)
Before going any further, let's make two critical notes about the likelihood function:
Note 1: When computing the likelihood term, P(obs|A = a), we do not have to estimate A as it shows
up on the right hand size of the conditional. In the likelihood term we are told exactly how well the
person can see. Their vision is truly a. Do not be fooled by the fact that a is a (non-random) variable.
When computing the likelihood function this varaible will have a numeric value.
Note 2: The variable obs represents a single patient interaction. It has two parts: a font-size and a
boolean for whether the patient got the letter correct. However, we don't think of font-size as being a
random variable. Instead we think of it as a contant which has been fixed by the computer. As such
P(obs|A = a) can be simplified to P(correct|A = a). "correct" is short hand for the event that a random
P(obs|A = a)
= P(correct|A = a) f is a constant
4
This is a header
Defining the likelihood function P(correct|A = a) involves more medical and education theory than
probability theory. You don't need to know either for this course! But it is still neat to learn and without
the likelihood function we won't have complete code. So, let's dive in.
A very practical starting point for the likelihood function for a vision test comes from a classic education
model called "Item Response Theory", also known as IRT. IRT assumes the probability that a student
with ability a gets a question with difficulty d correct is governed by the easy to compute function:
P(Correct = True|a)
1
=
−(a−d)
1 + e
where e is the natural base constant and sigmoid(x) = . The sigmoid function is a handy function
1
1+e
−x
which takes in any real valued input and returns a corresponding value in the range [0, 1].
This IRT model introduces a new constant: difficulty of a letter d. How difficult is it to correctly respond
to a letter with a given font size? The simplest way to model difficulty, while accounting for the fact that
large font sizes are easier than small ones, is to define the difficulty of a letter with font size f to be
d = 1 − f . Plugging this in:
P(Correct = True|a)
= sigmoid(a − [1 − f ])
= sigmoid(a − 1 + f )
1
=
−(a−1+f )
1 + e
We now have a complete, if simplistic, liklihood function! In code it would look like this:
def sigmoid(x):
# the classic squashing function. All outputs are [0,1]
return 1 / (1 + math.exp(-x))
Note that Item Response Theory returns the probability that a a patient answers a letter correctly. In the
code above, notice what we do if the patient instead guesses the letter incorrectly:
In the published version of the Stanford Acuity Test we extend Item Response Theory in several ways.
We have a term for the probability that a patient gets the answer correct by random guessing as well as a
term that they make a mistake, aka "slip", even though they know the correct answer. We also observed
that a Floored Exponential seems to be a more accurate function than the sigmoid. These extensions are
beyond the scope of this chapter as they are not central to the probability insight. For more details see the
original paper [1].
Multiple Observations
What if you have multiple observations? For multiple observations the only term that will change will be
the likelihood term P(Observations|A = a). We assume that each observation is independent,
conditioned on ability to see. Formally
5
This is a header
As such the likelihood of all observations will be the product of the likelihood of each observation on its
own. This is equivalent mathematically to calculating the posterior for one observation and calling the
posterior your new prior.
def main():
"""
Compute your belief in how well someone can see based
off an eye exam with 20 questions at different fonts
"""
belief_a = load_prior_from_file()
observations = get_observations()
for obs in observations:
update_belief(belief_a,obs)
plot(belief_a)
def sigmoid(x):
# the classic squashing function. All outputs are [0,1]
return 1 / (1 + math.exp(-x))
def normalize(belief):
# in place normalization of a belief dictionary
total = belief_sum(belief)
for key in belief:
belief[key] /= total
def belief_sum(belief):
# get the sum of probability mass for a discrete belief
total = 0
for key in belief:
total += belief[key]
return total
6
This is a header
One of the neat take aways from this application is that there are many problems where you could take
the knowledge learned from this course and improve on the current state of the art! Often the most
creative task is to recognize where computer based probability could be usefully applied. Even for eye
tests this is not the end of the story. The Stanford Eye Test, which started in CS109, is just a step on the
journey to a more accurate digital eye test. There is ./always a better way. Have an idea?
[1] The Stanford Acuity Test: A Precise Vision Test Using Bayesian Techniques and a Discovery in
Human Visual Response. Association for the Advancement of Artificial Intelligence
[3] Eye, robot: Artificial intelligence dramatically improves accuracy of classic eye exam. Science
Magazine.
Special thanks to Ali Malik who co-invented the Stanford Acuity Test.
7
This is a header
count(x)
P(X = x) ≈
100000
X be the points of the ith card in your hand, which has 13 cards. H = ∑ X .
13
i i=1 i
First we compute E[X ], the expectation of points for the ith card in your hand — without considering
i
the other cards . A card can take on four non zero values X ∈ {1, 2, 3, 4}. For each value there are four
i
E[X i ] = ∑ x ⋅ P(X i = x)
1
= (1 + 2 + 3 + 4)
13
10
=
13
We can then calculate E[H ] by using the fact that the expectation of the sum of random variables is the
sum of expectations, regardless of independence:
1
This is a header
13
E[H ] = ∑ E[X i ]
i=1
= 13 ⋅ E[X i ]
10
= 13 ⋅
13
= 10
Saying that H is approximately ∼ Poi(λ = 10) is an interesting claim. It suggests that points in a hand
come at a constant rate, and that the next point in your hand is independent of when you got your last
point. Of course this second part of the assumption is mildly violated. There are a fixed set of cards so
getting one card changes the probabilities of others. For this reason the poisson is a close, but not perfect
approximation.
dealt hands:
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
11
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
From this joint distribution we can compute conditional probabilities. For example we can compute the
conditional distribution of your partner's points given your points using lookups from the joint:
2
This is a header
P(Partner = x|YourPoints = y)
P(Partner = x, YourPoints = y)
= Cond. Prob.
P(YourPoints = y)
P(Partner = x, YourPoints = y)
= LOTP
∑ P(Partner = z, YourPoints = y)
z
Your points: 13
Both opponents have equal sized hands with k cards left. Across the two hands there are a known number of
cards of a particular suit (eg spades) n, and you want to know how many are in one hand and how many are
in the other. A split is represented as a tuple. For example (0, 5) would mean 0 cards of the suit in opponent
A's hands and 5 in opponent B's. Feel free to chose specific values for k and n:
A few notes: If there are k cards in each of the 2 hands there are 2k cards total bewteen the two players. At
the start of a game of bridge k = 13. It must be the case that n ≤ 2k because you can't have more cards of
the suit left than number of cards! If there are n of a suit, then there are 2k − n of other suits. This problem
assumes that the cards are properly shuffled.
Each outcome in the sample set is a chosen set of k distinct cards to be dealt to one player (out of the 2k
cards). To create an outcome in the event space we first chose the i cards from the n cards of the given suit.
We then chose k − i from the cards of other suits. For k =13 and n =5 here is the PMF over splits:
3
This is a header
If we want to think about the probability of a given split, it is sufficient to chose one hand (call it "hand
one"). If I tell you how many of a suit are in one hand, you can automatically figure out how many of the
suit are in the other hand: recall that the number of the suit sums to n.
j−1
P(X ≥ j) = 1 − ∑ P(Y = i)
i=n−j+1
4
This is a header
We are going to build a Bayesian Viral Load Test which updates a belief distribution regarding a patient's
viral load. Though viral load is continuous, in our test we represent it by discretizing the quantity into
whole numbers between 0 and 99, inclusive. The units of viral load are the number of viral instances per
million samples.
If a person has a viral load of 9 (in other words, 9 viruses out of every 1 million samples) what is the
probability that a random sample from the person is a virus?
1, 000, 000
We test 100,000 samples from one person for the virus. If the person's true viral load is 9, what is the
probability that exactly 1 of our 100,000 samples is a virus? Use a computationally efficient
approximation to compute your answer. Your approximation should respect that there is 0 probability of
getting negative virus samples.
Let's define a random variable X, the number of samples that are viral given the true viral load is 9. The
question is asking for P (X = 1). We can think about this as a binomial process, where the number of
trials n is the number of samples and the probability p is the probability that a sample is viral.
9
n = 100, 000, p =
1, 000, 000
Notice that n is very small and p is very large, so we can use the Poisson approximation to approximate
our answer. We find λ = np = 100, 000 ⋅ 9/1, 000, 000 = 0.9, so X ∼\Poi(λ = 0.9). The last step is to
use the PMF of the Poisson distribution.
1 −0.9
(0.9) e
P (X = 1) =
1!
Based on what we know about a patient (their symptoms and personal history) we have encoded a prior
belief in a list prior where prior[i] is the probability that the viral load equals i. prior is of length
100 and has keys 0 through 99.
Write an equation for the updated probability that the true viral load is i given that we observe a count of
1 virus sample out of 100,000 tested. Recall that 0 ≤ i ≤ 99. You may use approximations.
1
This is a header
We want to find
1
P (viral load = i|observed count of )
100000
We know that we can define a random variable X ∼ observed count out of 100,000|viral load = i,
and we can model X as a Poisson approximation to a binomial with n = 100000 and p = , with i
1000000
i i
λ = np = 100000 ⋅ =
1000000 10
So X can be written as
i
X ∼ P oi(λ = )
10
We can now use the Poisson PMF and our given \texttt{prior} to get:
−i
i
e 10
10
⋅ prior[i]
1!
=
1
P (observed count of )
100000
We now need to expand our denominator. We can use the General Law of Total Probability to expand
99
1 1
P (observed count of ) = ∑ P (observed count of |viral load = i)P (viral load = i)
100000 100000
j=0
99
j −j
= ∑ e 10
⋅ prior[j]
10
j=0
−i
i
e 10 ⋅ prior[i]
10
.
−j
99 j
∑ e 10
⋅ prior[j]
j=0 10
2
This is a header
CS109 Logo
To generate the CS109 logo, we are going to throw half a million darts at a picture of the Stanford seal.
We only keep the pixels that are hit by at least one dart. Each dart has it's x-pixel and y-pixel chosen at
random from gaussian distributions. Let X be a random variable which represent the x-pixel, Y be a
random variable which represents the y-pixel and S be a constant that equals the size of the logo (its
width is equal to its height). X ∼ N (
S
2
,
S
2
) and Y ∼ N (
S
3
,
S
5
)
X Distribution Y Distribution
0.0009 0.0022
0.0008 0.0020
0.0007 0.0017
Probability Density
Probability Density
0.0006 0.0015
0.0005 0.0012
0.0004 0.0010
0.0003 0.0007
0.0002 0.0005
0.0001 0.0002
0.0000 0.0000
0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 800 900
pixel x pixel y
1
This is a header
Tracking in 2D
Warning: After learning about joint distributions, and about inference, you have all the technical
abilities necessary to follow this example. However this is very very difficult. Primarily because you
need to understand three complex things at the same time: (1) how to represent a continuous joint
distribution, (2) inference in probabilistic models and (3) a rather complex probability of observation
calculation.
In this example we are going to explore the problem of tracking an object in 2D space. The object exists
at some (x, y) location, however we are not sure exactly where! Thus we are going to use random
variables X and Y to represent location.
This combinations of normals is called a bivariate distribution. Here is a visualization of the PDF of our
prior.
The interesting part about tracking an object is the process of updating your belief about it's location
based on an observation. Let's say that we get an instrument reading from a sonar that is sitting on the
origin. The instrument reports that the object is 4 units away. Our instrument is not perfect: if the true
distance was t units away, than the instrument will give a reading which is normally distributed with
mean t and variance 1. Let's visualize the observation:
Based on this information about the noisiness of our prior, we can compute the conditional probability of
1
This is a header
seeing a particular distance reading D, given the true location of the object X, Y . If we knew the object
was at location (x, y), we could calculate the true distance to the origin √x2 + y2 which would give us
the mean for the instrument Gaussian:
2 2 2
(d−√ x +y )
1 −
Normal PDF where μ = √ x + y
2 2
f (D = d|X = x, Y = y) = ⋅ e 2⋅1
√2 ⋅ 1 ⋅ π
2
(d−√ x2+y2)
−
= K2 ⋅ e 2⋅1
All constants are put into K2
How about we try this out on actual numbers. How much more likely is an instrument reading of 1
compared to 2, given that the location of the object is at (1, 1)?
2
(1−√ 12+12)
−
f (D = 1|X = 1, Y = 1) K2 ⋅ e 2⋅1
0
e
= ≈ 1.65 Notice how the K2 cancel out
−1/2
e
At this point we have a prior belief and we have an observation. We would like to compute an updated
belief, given that observation. This is a classic Bayes' formula scenario. We are using joint continuous
variables, but that doesn't change the math much, it just means we will be dealing with densities instead
of probabilities:
f (X = x, Y = y|D = 4)
f (D = 4|X = x, Y = y) ⋅ f (X = x, Y = y)
= Bayes using densities
f (D = 4)
√ 2 2 2 2 2
[4− x +y ) ] [(x−3) +(y−3) ]
− −
K1 ⋅ e 2
⋅ K2 ⋅ e 8
= Substitute
f (D = 4)
2 2 2
[4−√ x +y ) ] 2 2
K1 ⋅ K2 −[ +
[(x−3) +(y−3) ]
]
= ⋅ e 2 8
f (D = 4) is a constant w.r.t. (x, y)
f (D = 4)
2
(4−√ x2+y2) 2 2
[(x−3) +(y−3) ]
−[ + ]
= K3 ⋅ e 2 8
K3 is a new constant
Wow! That looks like a pretty interesting function! You have successfully computed the updated belief.
Let's see what it looks like. Here is a figure with our prior on the left and the posterior on the right:
How beautiful is that! Its like a 2D normal distribution merged with a circle. But wait, what about that
constant! We do not know the value of K and that is not a problem for two reasons: the first reason is
3
that if we ever want to calculate a relative probability of two locations, K will cancel out. The second 3
reason is that if we really wanted to know what K was, we could solve for it. This math is used every
3
day in millions of applications. If there are multiple observations the equations can get truly complex
(even worse than this one). To represent these complex functions often use an algorithm called particle
filtering.
2
Part 4: Uncertainty Theory
This is a header
Beta Distribution
The Beta distribution is the distribution most often used as the distribution of probabilities. In this section
we are going to have a very meta discussion about how we represent probabilities. Until now
probabilities have just been numbers in the range 0 to 1. However, if we have uncertainty about our
probability, it would make sense to represent our probabilities as random variables (and thus articulate
the relative likelihood of our belief).
Notation: X ∼ Beta(a, b)
Description: A belief distribution over the value of a probability p from a Binomial distribution
after observing a − 1 successes and b − 1 fails.
Parameters: a > 0 , the number successes + 1
b > 0 , the number of fails + 1
Support: x ∈ [0, 1]
Variance: Var(X) =
ab
2
(a+b) (a+b+1)
PDF graph:
Parameter a: 2 Parameter b: 4
10
since it is only based off 10 coin flips. Moreover the "point-value" does not have the ability to
9
10
Could we instead have a random variable for the true probability? Formally, let X represent the true
probability of the coin coming up heads. We don't use the symbol P for random variables, so X will have
to do. If X = 0.7 then the probability of heads is 0.7. X must be a continuous random variable with
support [0, 1] since probabilities are continuous values which must be between 0 and 1.
1
This is a header
Before flipping the coin, we could say that our belief about the coin's heads probability is uniform:
X ∼ Uni(0, 1). Let H be a random variable for the number of heads and let T be a random variable for
That probability is hard to think about! However it is much easier to reason about the probability with the
condition reveresed: P (H = 9, T = 1|X = x). This term asks the question: what is the probability of
seeing 9 heads and 1 tail in 10 coin flips, given that the true probability of a heads is x. Convince
yourself that this probability is just a binomial probability mass function with n = 10 experiements, and
p = x evaluated at k = 9 heads:
10
9 1
P (H = 9, T = 1|X = x) = ( )x (1 − x)
9
We are presented with a perfect context for Bayes' theorem with random variables. We know a
conditional probability in one direction and we would like to know it in the other:
f (X = x|H = 9, T = 1)
P (H = 9, T = 1|X = x) ⋅ f (X = x)
= Bayes Theorem
P (H = 9, T = 1)
10 9 1
( )x (1 − x) ⋅ f (X = x)
9
= Binomial PMF
P (H = 9, T = 1)
10 9 1
( )x (1 − x) ⋅ 1
9
= Unif orm PDF
P (H = 9, T = 1)
10
( )
9 9 1
= x (1 − x) Constants to f ront
P (H = 9, T = 1)
9 1
= K ⋅ x (1 − x) Rename constant
110
. Regardless of K we will get the same
shape, just scaled:
What a beautiful image. It tells us relatively likelihood over the probability that is governing our
coinflips. Here are a few observations from this chart:
1. Even after only 10 coin flips we are very confident that the true probability is > 0.5
2. It is almost 10 times more likely that X = 0.9 as it is that X = 0.6.
3. f (X = 1) = 0, which makes sense. How could we have flipped that one tail if the probability of heads
was 1?
2
This is a header
It may be helpful to juxtapose P (H = 9, T = 1) with P (H = 9, T = 1|X = x). The later says "what is
the probability of 9 heads, given the true probability is x". The former says "what is the probability of 9
heads, under all possible assignments of x". If you wanted to calculate P (H = 9, T = 1) you could use
the law of total probability:
P (H = 9, T = 1)
1
= ∫ P (H = 9, T = 1|X = y)f (X = y)
y=0
Beta Derivation
Let's generalize the derivation from the previous section, using h for the number of observed heads and t
the number of observed tails.
If we let H = h be the event that we saw h heads, and let T = t be the event that we saw t tails in h + t
coinflips. We want to calculate the probability density function f (X = x|H = h, T = t). We can use the
exact same series of steps, starting with Bayes Theorem:
f (X = x|H = h, T = t)
P (H = h, T = t|X = x)f (X = x)
= Bayes Theorem
P (H = h, T = t)
h+t h t
( )x (1 − x)
h
= Binomial PMF, Unif orm PDF
P (H = h, T = t)
h+t
( )
h h t
= x (1 − x) Moving terms around
P (H = h, T = t)
1
1 h t h t
= ⋅ x (1 − x) where c = ∫ x (1 − x) dx
c 0
The equation that we arrived at when using a Bayesian approach to estimating our probability defines a
probability density function and thus a random variable. The random variable is called a Beta
distribution, and it is defined as follows:
a+b
and V ar(X) =
ab
(a+b) 2 (a+b+1)
. All modern programming languages
have a package for calculating Beta CDFs. You will not be expected to compute the CDF by hand in
CS109.
To model our estimate of the probability of a coin coming up heads: set a = h + 1 and b = t + 1. Beta is
used as a random variable to represent a belief distribution of probabilities in contexts beyond estimating
coin flips. For example perhaps a drug has been given to 6 patients, 4 of whom have been cured. We
could express our belief in the probability that the drug can cure patients as X ∼ Beta(a = 5, b = 3):
3
This is a header
Notice how the most likely belief for the probability of curing a patient, is 4/6, the fraction of patients
cured. This distribution shows that we hold a non-zero belief that the probability could be something
other than 4/6. It is unlikely that the probability is 0.01 or 0.09, but reasonably likely that it could be 0.5.
Beta as a Prior
You can set X ∼ Beta(a, b) as a prior to reflect how biased you think the coin is apriori to flipping it.
This is a subjective judgment that represent a + b − 2 "imaginary" trials with a − 1 heads and b − 1 tails.
If you then observe h + t real trials with h heads you can update your belief. Your new belief would be,
X ∼ Beta(a + h, b + t). Using the prior Beta(1, 1) = Uni(0, 1) is the same as saying we haven't seen
any "imaginary" trials, so apriori we know nothing about the coin. Here is the proof for the distribution of
X when the prior was a Beta too:
f (X = x|H = h, T = t)
P (H = h, T = t|X = x)f (X = x)
= Bayes Theorem
P (H = h, T = t)
a+h−1 b+t−1
= K ⋅ x (1 − x) Combine Like Bases
It is pretty convenient that if we have a Beta prior belief, then our posterior belief is also Beta. This
makes Betas especially convenient to work with, in code and in proof, if there are many updates that you
will make to your belief over time. This property where the type of distribution is the same before and
after an observation is called a conjugate prior.
Quick question: Are you allowed to just make up priors and imaginary trials? Some folks think that is
fine (they are called Bayesians) and some folks think that you shouldn't make up prior beliefs (they are
called frequentists). In general, for small data it can make you much better at making predictions if you
are able to come up with a good prior belief.
Observation: There is a deep connection between the beta-prior and the uniform-prior (which we used
initially). It turns out that Beta(1, 1) = Uni(0, 1). Recall that Beta(1, 1) means 0 imaginary heads and 0
imaginary tails.
4
This is a header
P(X + Y = n) = ∑ P(X = i, Y = n − i)
i=−∞
If the random variables are independent you can futher decompose the term P(X = i, Y = n − i). Let's
expand on some of the mutually exclusive cases where X + Y = n:
i X Y
0 0 n P(X = 0, Y = n)
1 1 n − 1 P(X = 1, Y = n − 1)
2 2 n − 2 P(X = 2, Y = n − 2)
...
n n 0 P(X = n, Y = 0)
Consider the sum of two independent dice. Let X and Y be the outcome of each dice. Here is the
probability mass function for the sum X + Y :
1
This is a header
Let's use this context to practice deriving the sum of two variables, in this case P(X + Y = n), starting
with the General Rule for the Convolution of Discrete Random Variables. We start by considering values of
n between 2 and 7. In this range P(X = i, Y = n − i) = for all values of i between 1 and n − 1. There
1
36
is exactly one outcome of the two die where X = i and Y = n − i. For values of i outside this range n − i
is not a valid dice outcome and P(X = i, Y = n − i) = 0:
P(X + Y = n)
∞
= ∑ P(X = i, Y = n − i)
i=−∞
n−1
= ∑ P(X = i, Y = n − i)
i=1
n−1
1
= ∑
36
i=1
n − 1
=
36
For values of n greater than 7 we could use the same approach, though different values of i would make
P(X = i, Y = n − i) non-zero.
f (X + Y = n) = ∫ f (X = n − i, Y = i) di
i=−∞
variables is another Poisson: X + Y ∼ Poi(λ + λ ). This holds even when λ is not the same as λ .
1 2 1 2
Example derivation:
Let's go about proving that the sum of two independent Poisson random variables is also Poisson. Let
X ∼ Poi(λ ) and Y ∼ Poi(λ ) be two independent random variables, and Z = X + Y . What is
1 2
P (Z = n)?
2
This is a header
P (Z = n) = P (X + Y = n)
∞
= ∑ P(X = k, Y = n − k) (Convolution)
k=−∞
= ∑ P (X = k)P (Y = n − k) (Independence)
k=−∞
k=0
n k n−k
λ λ
−λ 1 1 −λ 2 2
= ∑e e (Poisson PMF)
k! (n − k)!
k=0
n k n−k
λ λ
−(λ 1 +λ 2 ) 1 2
= e ∑
k!(n − k)!
k=0
−(λ 1 +λ 2 ) n
e n! k n−k
= ∑ λ1 λ
2
n! k!(n − k)!
k=0
−(λ 1 +λ 2 )
e
n
= (λ 1 + λ 2 ) (Binomial theorem)
n!
Note that the Binomial Theorem (which we did not cover in this class, but is often used in contexts like
expanding polynomials) says that for two numbers a and b and positive integer n,
(a + b)
n
= ∑
n
( )a b
k=0
n
k
.k n−k
X + Y ∼ Bin(n + n , p).
1 2
This result hopefully makes sense. The convolution is the number of sucesses across X and Y . Since
each trial has the same probability of success, and there are now n + n trials, which are all 1 2
independent, the convolution is simply a new Binomial. This rule does not hold when the two Binomial
random variables have different parameters p.
Again this only holds when the two normals are independent.
⎧n if 0 < n ≤ 1
f (X + Y = n) = ⎨ 2 − n if 1 < n ≤ 2
⎩
0 else
Example derivation:
Calculate the PDF of X + Y for independent uniform random variables X ∼ Uni(0, 1) and
Y ∼ Uni(0, 1)? First plug in the equation for general convolution of independent random variables:
3
This is a header
f (X + Y = n) = ∫ f (X = n − i, Y = i)di
i=0
= ∫ f (X = n − i)di Because f (Y = y) = 1
i=0
It turns out that is not the easiest thing to integrate. By trying a few different values of n in the range
[0, 2] we can observe that the PDF we are trying to calculate is discontinuous at the point n = 1 and thus
will be easier to think about as two cases: n < 1 and n > 1. If we calculate f (X + Y = n) for both cases
and correctly constrain the bounds of the integral we get simple closed forms for each case:
⎧n if 0 < n ≤ 1
f (X + Y = n) = ⎨ 2 − n if 1 < n ≤ 2
⎩
0 else
4
This is a header
2
∑ X i ∼ N (n ⋅ μ, n ⋅ σ )
i=1
Where μ = E[X ] and σ = Var(X ). Note that since each X is identically distributed they share the
i
2
i i
At this point you probably think that the central limit theorem is awesome. But it gets even better. With
some algebraic manipulation we can show that if the sample mean of IID random variables is normal, it
follows that the sum of equally weighted IID random variables must also be normal:
def add_100_uniforms():
total = 0
for i in range(100):
# returns a sample from uniform(0, 1)
x_i = random()
total += x_i
return total
The value, total returned by this function will be a random variable. Hit the button below to run the
function and observe the resulting value of total:
What does total look like as a distribution? Let's calculate total many times and visualize the histogram
of values it produces.
1
This is a header
That is interesting! total which is the sum of 100 independent uniforms looks normal. Is that a special
property of uniforms? No! It turns out to work for almost any type of distribution (as long as the thing
you are adding has finite mean and finite variance, everything we have covered in this reader).
For any distribution the sum, or average, of n independent equally-weighted samples from that
distribution, will be normal.
Continuity Correction
Now we can see that the Binomial Approximation using a Normal actually derives from the central limit
theorem. Recall that, when computing probabilities for a normal approximation, we had to to use a
continuity correction. This was because we were approximating a discrete random variable (a binomial)
with a continuous one (a normal). You should use a continuity correction any time your normal is
approximating a discrete random variable. The rules for a general continuity correction are the same as
the rules for the binomial-approximation continuity correction.
In the motivating example above, where we added 100 uniforms, a continuity correction isn't needed
because the sum of uniforms is continuous. In the dice sum example below, a continuity correction is
needed because die outcomes are discrete.
Examples
Example:
You will roll a 6 sided dice 10 times. Let X be the total value of all 10 dice = X + X + ⋯ + X . You
1 2 10
win the game if X ≤ 25 or X ≥ 45. Use the central limit theorem to calculate the probability that you
win. Recall that E[X ] = 3.5 and Var(X ) = .
i i
35
12
Let Y be the approximating normal. By the Central Limit Theorem Y ∼ N (10 ⋅ E[X i ], .
10 ⋅ Var(X i ))
Substituting in the known values for expectation and variance: Y ∼ N (35, 29.2)
2
This is a header
P(X ≤ 25 or X ≥ 45)
25.5 − 35 44.5 − 35
≈ Φ( ) + [1 − Φ( )] Normal CDF
√ 29.2 √ 29.2
≈ Φ(−1.76) + [1 − Φ(1.76)]
Example:
Say you have a new algorithm and you want to test its running time. You have an idea of the variance of
the algorithm's run time: σ = 4sec but you want to estimate the mean: μ = tsec. You can run the
2 2
algorithm repeatedly (IID trials). How many trials do you have to run so that your estimated runtime =
t ± 0.5 with 95\% certainty? LetX be the run time of the i-th run (for 1 ≤ i ≤ n).
i
n
∑ Xi
i=1
0.95 = P (−0.5 ≤ − t ≤ 0.5)
n
By the central limit theorem, the standard normal Z must be equal to:
n
(∑ X i ) − nμ
i=1
Z =
σ√n
n
(∑ X i ) − nt
i=1
=
2√n
−0.5√n 0.5√n
= P( ≤ Z ≤ )
2 2
And now we can find the value of n that makes this equation hold.
√n √n √n √n
0.95 = ϕ( ) − ϕ(− ) = ϕ( ) − (1 − ϕ( ))
4 4 4 4
√n
= 2ϕ( ) − 1
4
√n
0.975 = ϕ( )
4
√n
−1
ϕ (0.975) =
4
√n
1.96 =
4
n = 61.4
Thus it takes 62 runs. If you are interested in how this extends to cases where the variance is unknown,
look into variations of the students' t-test.
3
This is a header
Sampling
In this section we are going to talk about statistics calculated on samples from a population. We are then
going to talk about probability claims that we can make with respect to the original population -- a central
requirement for most scientific disciplines.
Let's say you are the king of Bhutan and you want to know the average happiness of the people in your
country. You can't ask every single person, but you could ask a random subsample. In this next section we
will consider principled claims that you can make based on a subsample. Assume we randomly sample
200 Bhutanese and ask them about their happiness. Our data looks like this: 72, 85, … , 71. You can also
think of it as a collection of n = 200 I.I.D. (independent, identically distributed) random variables
X ,X ,…,X .
1 2 n
Understanding Samples
The idea behind sampling is simple, but the details and the mathematical notation can be complicated.
Here is a picture to show you all of the ideas involved:
The theory is that there is some large population (for example the 774,000 people who live in Bhutan).
We collect a sample of n people at random, where each person in the population is equally likely to be in
our sample. From each person we record one number (for example their reported happiness). We are
going to call the number from the ith person we sampled X . One way to visualize your samples
i
X ,X ,…,X
1 2 nis to make a histogram of their values.
We make the assumption that all of our X s are identically distributed. That means that we are assuming
i
there is a single underlying distribution F that we drew our samples from. Recall that a distribution for
discrete random variables should define a probability mass function.
estimate the mean and variance. From our sample we can calculate a sample mean (X ¯
) and a sample
variance (S ). These are the best guesses that we can make about the true mean and true variance.
2
n
Xi
X̄ = ∑
n
i=1
2
1 2
S = ∑(X i − X̄)
n − 1
i=1
The first question to ask is, are those unbiased estimates? Yes. Unbiased, means that if we were to repeat
this sampling process many times, the expected value of our estimates should be equal to the true values
we are trying to estimate. We will prove that that is the case for X̄. The proof for S is in lecture slides.
2
1
This is a header
n n
Xi 1
¯
E[X] = E[∑ ] = E [∑ X i ]
n n
i=1 i=1
n n
1 1 1
= ∑ E[X i ] = ∑μ = nμ = μ
n n n
i=1 i=1
The equation for sample mean seems related to our understanding of expectation. The same could be said
about sample variance except for the surprising (n − 1) in the denominator of the equation. Why (n − 1)
? That denominator is necessary to make sure that the E[S ] = σ . 2 2
The intuition behind the proof is that sample variance calculates the distance of each sample to the
sample mean, not the true mean. The sample mean itself varies, and we can show that its variance is also
related to the true variance.
Standard Error
Ok, you convinced me that our estimates for mean and variance are not biased. But now I want to know
how much my sample mean might vary relative to the true mean.
n 2 n
Xi 1
Var(X̄) = Var(∑ ) = ( ) Var (∑ X i )
n n
i=1 i=1
2 n 2 n 2 2
1 1 1 σ
2 2
= ( ) ∑ Var(X i ) = ( ) ∑σ = ( ) nσ =
n n n n
i=1 i=1
2
S
≈
n
2
S
¯
Std(X) ≈ √
n
That term, Std(X̄), has a special name. It is called the standard error and its how you report uncertainty of
estimates of means in scientific papers (and how you get error bars). Great! Now we can compute all
these wonderful statistics for the Bhutanese people. But wait! You never told me how to calculate the
Std(S ). That is hard because the central limit theorem doesn't apply to the computation of S . Instead
2 2
we will need a more general technique. See the next chapter: Bootstrapping
Let's say we calculate the our sample of happiness has n = 200 people. The sample mean is X̄ = 83
(what is the unit here? happiness score?) and the sample variance is S = 450. We can now calculate the 2
standard error of our estimate of the mean to be 1.5. When we report our results we will say that our
estimate of the average happiness score in Bhutan is 83 ± 1.5. Our estimate of the variance of happiness
is 450 ± ?.
2
This is a header
Bootstrapping
The bootstrap is a newly invented statistical technique for both understanding distributions of statistics
and for calculating p-values (a p-value is a the probability that a scientific claim is incorrect). It was
invented here at Stanford in 1979 when mathematicians were just starting to understand how computers,
and computer simulations, could be used to better understand probabilities.
The first key insight is that: if we had access to the underlying distribution (F ) then answering almost
any question we might have as to how accurate our statistics are becomes straightforward. For example,
in the previous section we gave a formula for how you could calculate the sample variance from a sample
of size n. We know that in expectation our sample variance is equal to the true variance. But what if we
want to know the probability that the true variance is within a certain range of the number we calculated?
That question might sound dry, but it is critical to evaluating scientific claims! If you knew the
underlying distribution, F , you could simply repeat the experiment of drawing a sample of size n from F
, calculate the sample variance from our new sample and test what portion fell within a certain range.
The next insight behind bootstrapping is that the best estimate that we can get for F is from our sample
itself! The simplest way to estimate F (and the one we will use in this class) is to assume that the
P (X = k) is simply the fraction of times that k showed up in the sample. Note that this defines the
def bootstrap(sample):
N = number of elements in sample
pmf = estimate the underlying pmf from the sample
stats = []
repeat 10,000 times:
resample = draw N new samples from the pmf
stat = calculate your stat on the resample
stats.append(stat)
stats can now be used to estimate the distribution of the stat
Bootstrapping is a reasonable thing to do because the sample you have is the best and only information
you have about what the underlying population distribution actually looks like. Moreover most samples
will, if they're randomly chosen, look quite like the population they came from.
To calculate Var(S ) we could calculate S for each resample i and after 10,000 iterations, we could
2 2
i
calculate the sample variance of all the S s. You might be wondering why the resample is the same size
2
i
as the original sample (n). The answer is that the variation of the variation of stat that you are calculating
could depend on the size of the sample (or the resample). To accurately estimate the distribution of the
stat we must use resamples of the same size.
The bootstrap has strong theoretic grantees, and is accepted by the scientific community. It breaks down
when the underlying distribution has a ``long tail" or if the samples are not I.I.D.
in Bhutan and n = 300 individuals in Nepal and ask them to rate their happiness on a scale from 1 to
2
10. We measure the sample means for the two samples and observe that people in Nepal are slightly
happier--the difference between the Nepal sample mean and the Bhutan sample mean is 0.5 points on the
happiness scale.
If you want to make this claim scientific you should calculate a p-value. A p-value is the probability that,
when the null hypothesis is true, the statistic measured would be equal to, or more extreme than, than the
value you are reporting. The null hypothesis is the hypothesis that there is no relationship between two
measured phenomena or no difference between two groups.
1
This is a header
In the case of comparing Nepal to Bhutan, the null hypothesis is that there is no difference between the
distribution of happiness in Bhutan and Nepal. The null hypothesis argument is: there is no difference in
the distribution of happiness between Nepal and Bhutan. When you drew samples, Nepal had a mean that
0.5 points larger than Bhutan by chance.
We can use bootstrapping to calculate the p-value. First, we estimate the underlying distribution of the
null hypothesis underlying distribution, by making a probability mass function from all of our samples
from Nepal and all of our samples from Bhutan.
This is particularly nice because nowhere did we have to make an assumption about a parametric
distribution that our samples came from (ie we never had to claim that happiness is gaussian). You might
have heard of a t-test. That is another way of calculating p-values, but it makes the assumption that both
samples are gaussian and that they both have the same variance. In the modern context where we have
reasonable computer power, bootstrapping is a more correct and versatile tool.
2
This is a header
Algorithmic Analysis
In this section we are going to use probability to analyze code. Specifically we are going to be calculating
expectations on code: expected run time, expected resulting values etc. The reason that we are going to
focus on expectation is that it has several nice properties. One of the most useful properties that we have
seen so far is that the expectation of a sum, is the sum of expectations, regardless of whether the random
variables are independent of one another. In this section we will see a few more helpful properties,
including the Law of Total Expectation, which is also helpful in analyzing code:
Location P (L = l) E[T |L = l]
= 33.3 seconds
def recurse():
x = random.choice([1, 2, 3]) # Equally likely values
if x == 1:
return 3;
else if (x == 2):
return 5 + recurse()
else:
return 7 + recurse()
1
This is a header
In order to solve this problem we are going to need to use the law of total expectation considering X as
your background variable.
i∈{1,2,3}
=E[Y |X = 1] P(X = 1)
+ E[Y |X = 2] P(X = 2)
+ E[Y |X = 3] P(X = 3)
We know the probability P(X = x) = for x ∈ {1, 2, 3}. How can we compute a value such as
1
E[Y |X = 2]? Well that is the expectation of your return, in the world where X = 2. In that case you will
return 5 + recurse(). The expectation of that is 5 + E[Y ]. Plugging a similar result for each case we can
continue our solution:
E[Y |X = 1] = 3
E[Y |X = 2] = 5 + E[Y ]
E[Y |X = 3] = 7 + E[Y ]
Now we can just plug values into the law of total expectation:
i∈{1,2,3}
=E[Y |X = 1] P(X = 1)
+ E[Y |X = 2] P(X = 2)
+ E[Y |X = 3] P(X = 3)
=3 ⋅ 1/3
+ (5 + E[Y ]) ⋅ 1/3
+ (7 + E[Y ]) ⋅ 1/3
1/3 ⋅ E[Y ] =5
E[Y ] =15
∑E[X|Y = y]P (Y = y)
= ∑ ∑ xP (X = x|Y = y)P (Y = y)
y x
= ∑ ∑ xP (X = x, Y = y)
y x
= ∑ ∑ xP (X = x, Y = y)
x y
= ∑ x ∑ P (X = x, Y = y)
x y
= ∑ xP (X = x)
= E[X]
2
This is a header
Thompson Sampling
Imagine having to make the following series of decisions. You have two drugs you can administer, drug
1, or drug 2. Initially you have no idea which drug is better. You want to know which drug is the most
effective, but at the same time, there are costs to exploration — the stakes are high.
Here is an example:
This problem is suprisingly complex. It sometimes goes by the name "the multi-armed bandit problem!"
In fact, the perfect answer to this question can be exponentially hard to calculate. There are many
approximate solutions and it is an active area of research.
One solution has risen to be a rather popular option: Thompson Sampling. It is easy to implement,
elegant to understand, has provable garuntees [1], and in practice does very well [2].
sophisticated way to represent our belief in the two hidden probabilites behind drug 1 and drug 2 is to use
the Beta distribution. Let X be the belief in the probability for drug 1 and let X be the belief in the
1 2
X1 ∼ Beta(a = 2, b = 4)
X2 ∼ Beta(a = 3, b = 3)
Recall that in the Beta distribution with a uniform prior the first parameter, a, is number of observed
successes + 1. The second parameter, b, is the number of observed fails + 1. It is helpful to look at these
two distributions graphically:
If we had to guess, drug 2 is looking better, but there is still a lot of uncertainty, represented by the high
variance in these beliefs. That is a helpful representation. But how can we use this information to make a
good decision of the next drug.
Making a Choice
It is hard to know what is the right choice! If you only had one more patient, then it is clear what you
should do. You should calculate the probability that X > X and if that probability is over 0.5 then you
2 1
should chose a. However, if you need to continually administer the pills then it is less clear what is the
right choice. If you chose 1, you miss out on the chance to learn more about 2. What should we do? We
need to balance this need for "exploring" and the need to take advantage of what we already know.
The simple idea behind Thompson Sampling is to randomly make your choice according to its
probability of being optimal. In this case we should chose drug 1 with the probability that 1 is > 2. How
do people do this in practice? They have a very simple formula. Take a random sample from each Beta
distribution. Chose the option which has a larger value for its sample.
sample_a = sample_beta(2, 4)
sample_b = sample_beta(3, 3)
if sample_a > sample_b:
choose choice a
else:
choose choice b
What does it mean to take a sample? It means to chose a value according to the probability density (or
probability mass) function. So in our example above, we might sample 0.4 for drug 1, and sample 0.35
for drug 2. In which case we would go with drug 1.
At the start Thompson Sampling "explores" quite a lot of time. As it gets more confident that one drug is
better than another, it will start to chose that drug most of the time. Eventually it will converge to
knowing which drug is best, and it will always chose that drug.
2
This is a header
Night Sight
In this problem we explore how to use probability theory to take photos in the dark. Digital cameras have
a sensor that capture photons over the duration of a photo shot to produce pictures. However, these
sensors are subject to “shot noise” which are random fluctuations in the amount of photons that hit the
lens. In the scope of this problem, we only consider a single pixel. The arrival of shot noise photons on a
surface is independent with constant rate.
Left: photo captured using a standard photo. Right: the same photo using a shot burst [1].
For shot noise, standard deviation is what matters! Why? Because if the camera can compute the
expected amount of noise, it can simply subtract it out. But the fluctuations around the mean (measured
as standard deviation) lead to changes in measurement that the camera can't simply subtract out.
Noise in a standard photo: As you may have guessed, because photos hit the camera at a constant rate,
and independent of one another, the number of shot noise photos hitting any pixel is modelled as a
Poisson! For the given rate of noise, let X be the amount of shot noise photos that hit the pixel:
Note that 10,000 is the average number of photons that hit in 1000𝜇s (duration in microseconds
multiplied by photons per microsecond). The standard of a Poisson is simply equal to the square root of
its parameter, √λ. Thus the standard deviation of the shot noise photons captured is 100 (quite high).
1
This is a header
X i ∼ Poi(λ = 66 ⋅ 10)
Since Y is the average of IID random variables, the Central Limit Theorem will kick in. Moreover, by
the CLT rule Y will have variance equal to 1/n ⋅ Var(X ). i
= 1/15 ⋅ 660 = 44
The standard deviation will then be the square root of this variance Std(Y ) = √ 44 which is
approximately 6.6. That is a huge reduction in shot noise!
2
This is a header
P-Hacking
It turns out that science has a bug! If you test many hypotheses but only report the one with the lowest p-
value you are more likely to get a spurious result (one resulting from chance, not a real pattern).
Recall p-values: A p-value was meant to represent the probability of a spurious result. It is the chance of
seeing a difference in means (or in whichever statistic you are measuring) at least as large as the one
observed in the dataset if the two populations were actually identical. A p-value < 0.05 is considered
"statistically significant". In class we compared sample means of two populations and calculated p-
values. What if we had 5 populations and searched for pairs with a significant p-value? This is called p-
hacking!
To explore this idea, we are going to look for patterns in a dataset which is totally random – every value
is Uniform(0,1) and independent of every other value. There is clearly no significance in any difference
in means in this toy dataset. However, we might find a result which looks statistically significant just by
chance. Here is an example of a simulated dataset with 5 random populations, each of which has 20
samples:
The numbers in the table above are just for demonstration purposes. You should not base your answer off
of them. We call each population a random population to emphasize that there is no pattern.
5
( )
2
1
This is a header
Let Z ∼ Uni(0, 1)
1
Var(Z) = (β − α)
12
1
= (1 − 0)
12
1
=
12
What is an approximation for the distribution of the mean of 20 samples from Uniform(0,1)?
n
∑
n
i=1
Zi .
n n
1 1 n
E[X] = ∑ E[Z i ] = ∑ 0.5 = 0.5 = 0.5
n n n
i=1 i=1
n
1
Var(X) = Var ( ∑ Zi )
n
i=1
n
1
= Var (∑ Z i )
2
n
i=1
n
1
= ∑ Var (Z i )
2
n
i=1
n
1
= ∑v
2
n
i=1
n v v 1
= v = = =
2
n n 20 240
What is an approximation for the distribution of the mean from one population minus the mean from
another population? Note: this value may be negative if the first population has a smaller mean than the
second.
2 1
X 1 ∼ N (μ = 0.5, σ = )
240
X 2 ∼ N (μ = 0.5, σ
2
=
1
240
) The expectation is simple to calculate because
1
=
120
10
}
)
(8 points) What is the smallest difference in means, k, that would look statistically significant if there were
only two populations? In other words, the probability of seeing a difference in means of k or greater is <
0.05.
One tricky part of this problem is to recognize the double sidedness to distance. We would consider it a
significant distance if P (Y < −k) or P (Y > k).
2 − 2F Y (k) = 0.05
F Y (k) = 0.975
2
This is a header
3
i
Φ
min_mean_col = get_min_mean_col(matrix)
−1
0.975 = Φ(
(0.975) =
k = Φ
k
√v/10
−1
k − 0
√v/10
= 1 − P
5
⎜⎟
P (min{X 1 . . . X n } < 0.2) = P ( ⋃ X i < 0.2)
i=1
⎝
)
(0.975)√ v/10
(5 points) Give an expression for the probability that the smallest sample mean among 5 random
populations is less than 0.2.
= 1 − ∏ P (X i ≥ 0.2)
i=1
= 1 − ∏ 1 − Φ(
i=1
# from the matrix, return the row (as a list) which has the largest mean
max_mean_col = get_max_mean_col(matrix)
# calculate the p-value between two lists using bootstrapping (like in pset5)
p_value = bootstrap(list1, list2)
Write pseudocode:
n_significant = 0
k = calculate_k()
for i in range(N_TRIALS):
dataset = random_matrix(20, 5)
col_max = get_max_mean_col(dataset)
col_min = get_min_mean_col(dataset)}
diff = np.mean(col_max) - np.mean(col_min)}
if diff >= k:
n_significant += 1}
print(n_significant / N_TRIALS)
5
( ⋃ X i < 0.2)
5
i=1
= 1 − P ( ⋂ X i ≥ 0.2)
5
i=1
0.2 − 0.5
√v/20
∁
)
⎞
(7 points) Use the following functions to write code that estimates the probability that among 5 populations
you find a difference of means which would be considered significant (using the bootstrapping method
designed to compare 2 populations). Run at least 10,000 simulations to estimate your answer. You may use
the following helper functions.
# the smallest difference in means that would look statistically significant
k = calculate_k()
# from the matrix, return the column (as a list) which has the smallest mean
This is a header
Differential Privacy
Recently, many organizations have released machine learning models trained on massive datasets (GPT-
3, YOLO, etc...). This is a great contribution to science and streamlines modern AI research. However,
publicizing these models allows for the potential ``reverse engineering'' of models to uncover the training
data for the model. Specifically, an attacker can download a model, look at the parameter values and then
try to reconstruct the original training data. This is particularly bad for models trained on sensetive data
like health information. In this section we are going to use randomness as a method to defend against
algorithmic ``reverse engineering.''
Injecting Randomness
One way to combat algorithmic reverse engineering is to add some random element to an already existing
dataset. Let
i.i.d
X1 , … , Xi ∼ Bern(p)
represent a set of real human data. Consider the following snippet of code:
def calculateXi(Xi):
return Xi
Quite simply, an attacker can call the above for all 100 samples and uncover all 100 data points. Instead,
we can inject an element of randomness:
def calculateYi(Xi):
obfuscate = random() # Bern with parameter p=0.5
if obfuscate:
return indicator(random())
else:
return Xi
The attacker can in expectation call the new function 100 times and get the correct values for 50 of them
(but they wont know which 50).
Recovering p
Now consider if we publish the function calculateYi, how could a researcher who is interested in the
mean of the samples get useful data? They can look at:
100
Z = ∑ Yi .
n=1
Z − 25
p ≈
50
1
Part 5: Machine Learning
This is a header
Parameter Estimation
We have learned many different distributions for random variables and all of those distributions had
parameters: the numbers that you provide as input when you define a random variable. So far when we
were working with random variables, we either were explicitly told the values of the parameters, or, we
could divine the values by understanding the process that was generating the random variables.
What if we don't know the values of the parameters and we can't estimate them from our own expert
knowledge? What if instead of knowing the random variables, we have a lot of examples of data
generated with the same underlying distribution? In this chapter we are going to learn formal ways of
estimating parameters from data.
These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work
like this: (1) specify a probabilistic model that has parameters. (2) Learn the value of those parameters
from data.
Parameters
Before we dive into parameter estimation, first let's revisit the concept of parameters. Given a model, the
parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable,
the single parameter was the value p. In the case of a Uniform random variable, the parameters are the a
and b values that define the min and max value. Here is a list of random variables and the corresponding
parameters. From now on, we are going to use the notation θ to be a vector of all the parameters:
Distribution Parameters
Bernoulli(p) θ = p
Poisson(λ) θ = λ
Uniform(a, b) θ = [a, b]
Normal(μ, σ ) 2 2
θ = [μ, σ ]
In the real world often you don't know the "true" parameters, but you get to observe data. Next up, we
will explore how we can use data to estimate the model parameters.
It turns out there isn't just one way to estimate the value of parameters. There are two main schools of
thought: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP). Both of these
schools of thought assume that your data are independent and identically distributed (IID) samples:
X ,X ,…X .
1 2 n
1
This is a header
The data that we are going to use to estimate the parameters are going to be n independent and
identically distributed (IID) samples: X , X , … X .
1 2 n
Likelihood
We made the assumption that our data are identically distributed. This means that they must have either
the same probability mass function (if the data are discrete) or the same probability density function (if
the data are continuous). To simplify our conversation about parameter estimation we are going to use the
notation f (X|θ) to refer to this shared PMF or PDF. Our new notation is interesting in two ways. First,
we have now included a conditional on θ which is our way of indicating that the likelihood of different
values of X depends on the values of our parameters. Second, we are going to use the same symbol f for
both discrete and continuous distributions.
What does likelihood mean and how is "likelihood" different than "probability"? In the case of discrete
distributions, likelihood is a synonym for the probability mass, or joint probability mass, of your data. In
the case of continuous distribution, likelihood refers to the probability density of your data.
Since we assumed that each data point is independent, the likelihood of all of our data is the product of
the likelihood of each data point. Mathematically, the likelihood of our data give parameters θ is:
n
L(θ) = ∏ f (X i |θ)
i=1
For different values of parameters, the likelihood of our data will be different. If we have correct
parameters our data will be much more probable than if we have incorrect parameters. For that reason we
write likelihood as a function of our parameters (θ).
Maximization
In maximum likelihood estimation (MLE) our goal is to chose values of our parameters (θ) that
maximizes the likelihood function from the previous section. We are going to use the notation ^
θ to
represent the best choice of values for our parameters. Formally, MLE assumes that:
^
θ = argmax L(θ)
θ
Argmax is short for Arguments of the Maxima. The argmax of a function is the value of the domain at
which the function is maximized. It applies for domains of any dimension.
A cool property of argmax is that since log is a monotone function, the argmax of a function is the same
as the argmax of the log of the function! That's nice because logs make the math simpler. If we find the
argmax of the log of likelihood it will be equal to the armax of the likelihood. Thus for MLE we first
write the Log Likelihood function (LL)
n n
i=1 i=1
To use a maximum likelihood estimator, first write the log likelihood of the data given your parameters.
Then chose the value of parameters that maximize the log likelihood function. Argmax can be computed
in many ways. All of the methods that we cover in this class require computing the first derivative of the
function.
1
This is a header
with the same p, X ∼ Ber(p). We want to find out what that p is.
i
Step one of MLE is to write the likelihood of a Bernoulli as a function that we can maximize. Since a
Bernoulli is a discrete distribution, the likelihood is the probability mass function.
The probability mass function of a Bernoulli X can be written as f (x) = p (1 − p) . Wow! What's up x 1−x
with that? It's an equation that allows us to say that the probability that X = 1 is p and the probability
that X = 0 is 1 − p. Convince yourself that when X = 0 and X = 1 the PMF returns the right i i
xi 1−x i
L(θ) = ∏ p (1 − p) First write the likelihood f unction
i=1
xi 1−x i
LL(θ) = ∑ log p (1 − p) Then write the log likelihood f unction
i=1
= ∑ x i (log p) + (1 − x i )log(1 − p)
i=1
i=1
Great Scott! We have the log likelihood equation. Now we simply need to chose the value of p that
maximizes our log-likelihood. As your calculus teacher probably taught you, one way to find the value
which maximizes a function that is to find the first derivative of the function and set it equal to 0.
δLL(p) 1 −1
= Y + (n − Y ) = 0
δp p 1 − p
n
Y ∑ xi
i=1
^=
p =
n n
All that work and find out that the MLE estimate is simply the sample mean...
trickier since a normal has two parameters that we have to estimate. In this case θ is a vector with two
values, the first is the mean (μ) parameter. The second is the variance(σ ) parameter. 2
L(θ) = ∏ f (X i |θ)
i=1
n 2
(x −θ )
1 −
i 0
2θ
= ∏ e 1 Likelihood f or a continuous variable is the PDF
√2πθ 1
i=1
n 2
1 −
(x i −θ
0
)
2θ
LL(θ) = ∑ log e 1 We want to calculate log likelihood
√2πθ 1
i=1
n
1
2
= ∑ [− log(√ 2πθ 1 ) − (x i − θ 0 ) ]
2θ 1
i=1
Again, the last step of MLE is to chose values of θ that maximize the log likelihood function. In this case
we can calculate the partial derivative of the LL function with respect to both θ and θ , set both 0 1
equations to equal 0 and than solve for the values of θ. Doing so results in the equations for the values
μ
^
^ = θ 0 and ^2 ^
σ = θ1 that maximize likelihood. The result is: ^ =
μ
1
n
∑
n
i=1
xi and
.
^2 1 n 2
σ = ∑ (x i − μ)
^
n i=1
2
This is a header
In the case where you are told the value of X, θX is a number and θX + Z is the sum of a gaussian and a
number. This implies that Y |X ∼ N (θX, σ ). Our goal is to chose a value of θ that maximizes the
2
probability IID: (X , Y ), (X , Y ), … (X , Y ).
1 1 2 2 n n
We approach this problem by first finding a function for the log likelihood of the data given θ. Then we
find the value of θ that maximizes the log likelihood function. To start, use the PDF of a Normal to
express the probability of Y |X, θ:
2
1 −
(Y −θX )
i i
f (Y i |X i , θ) = e 2σ 2
√2πσ
Now we are ready to write the likelihood function, then take its log to get the log likelihood function:
n
i=1
= ∏ f (Y i |X i , θ)f (X i ) f (X i ) is independent of θ
i=1
n 2
1 −
(Y i −θX i )
2
= ∏ e 2σ f (X i ) Substitute in the def inition of f (Y i |X i )
√ 2πσ
i=1
2
= log ∏ e 2σ f (X i ) Substitute in L(θ)
√ 2πσ
i=1
n 2 n
1 −
(Y −θX )
i i
i=1
√2πσ i=1
n n
1 1 2
= n log − ∑(Y i − θX i ) + ∑ log f (X i )
2
√ 2πσ 2σ
i=1 i=1
Remove constant multipliers and terms that don't include θ. We are left with trying to find a value of θ
that maximizes:
m
^ 2
θ = argmax − ∑(Y i − θX i )
θ
i=1
2
= argmin ∑(Y i − θX i )
θ
i=1
This result says that the value of θ that makes the data most likely is one that minimizes the squared error
of predictions of Y . We will see in a few days that this is the basis for linear regression.
3
This is a header
Maximum A Posteriori
MLE is great, but it is not the only way to estimate parameters! This section introduces an alternate
algorithm, Maximum A Posteriori (MAP).The paradigm of MAP is that we should chose the value for
our parameters that is the most likely given the data. At first blush this might seem the same as MLE,
however notice that MLE chooses the value of parameters that makes the \emph{data} most likely.
Formally, for IID random variables X 1, … , Xn :
In the equation above we trying to calculate the conditional probability of unobserved random variables
given observed random variables. When that is the case, think Bayes Theorem! Expand the function f
using the continuous version of Bayes Theorem.
f (X 1 , X 2 , … , X n |θ)g(θ)
=argmax Ahh much better
θ h(X 1 , X 2 , … X n )
Note that f , g and h are all probability densities. I used different symbols to make it explicit that they
may have different functions. Now we are going to leverage two observations. First, the data is assumed
to be IID so we can decompose the density of the data given θ. Second, the denominator is a constant
with respect to θ. As such its value does not affect the argmax and we can drop that term.
Mathematically:
n
∏ f (X i |θ)g(θ)
i=1
θ MAP =argmax Since the samples are IID
θ h(X 1 , X 2 , … X n )
n
As before, it will be more convenient to find the argmax of the log of the MAP function, which gives us
the final form for MAP estimation of parameters.
Using Bayesian terminology, the MAP estimate is the mode of the "posterior" distribution for θ. If you
look at this equation side by side with the MLE equation you will notice that MAP is the argmax of the
exact same function \emph{plus} a term for the log of the prior.
Parameter Priors
In order to get ready for the world of MAP estimation, we are going to need to brush up on our
distributions. We will need reasonable distributions for each of our different parameters. For example, if
you are predicting a Poisson distribution, what is the right random variable type for the prior of λ?
A desiderata for prior distributions is that the resulting posterior distribution has the same functional
form. We call these "conjugate" priors. In the case where you are updating your belief many times,
conjugate priors makes programming in the math equations much easier.
Here is a list of different parameters and the distributions most often used for their priors:
1
This is a header
Parameter Distribution
Bernoulli p Beta
Binomial p Beta
Poisson λ Gamma
Exponential λ Gamma
Multinomial p i Dirichlet
Normal μ Normal
2
Normal σ Inverse Gamma
You are only expected to know the new distributions on a high level. You do not need to know Inverse
Gamma. I included it for completeness.
The distributions used to represent your "prior" belief about a random variable will often have their own
parameters. For example, a Beta distribution is defined using two parameters (a, b). Do we have to use
parameter estimation to evaluate a and b too? No. Those parameters are called "hyperparameters". That is
a term we reserve for parameters in our model that we fix before running parameter estimate. Before you
run MAP you decide on the values of (a, b).
Dirichlet
The Dirichlet distribution generalizes Beta in same way Multinomial generalizes Bernoulli. A random
variable X that is Dirichlet is parametrized as X ∼ Dirichlet(a , a , … , a ). The PDF of the
1 2 m
distribution is:
m
a i −1
f (X 1 = x 1 , X 2 = x 2 , … , X m = x m ) = K ∏ x
i
i=1
You can intuitively understand the hyperparameters of a Dirichlet distribution: imagine you have seen
a − m imaginary trials. In those trials you had (a − 1) outcomes of value i. As an example
m
∑ i i
i=1
consider estimating the probability of getting different numbers on a six-sided Skewed Dice (where each
side is a different shape). We will estimate the probabilities of rolling each side of this dice by repeatedly
rolling the dice n times. This will produce n IID samples. For the MAP paradigm, we are going to need a
prior on our belief of each of the parameters p … p . We want to express that we lightly believe that
1 6
Before you roll, let's imagine you had rolled the dice six times and had gotten one of each possible
values. Thus the "prior" distribution would be Dirichlet(2, 2, 2, 2, 2, 2). After observing
n + n + ⋯ + n
1 2 new trials with n results of outcome i, the "posterior" distribution is Dirichlet(
6 i
2 + n , … 2 + n ). Using a prior which represents one imagined observation of each outcome is called
1 6
Gamma
The Gamma(k, θ) distribution is the conjugate prior for the λ parameter of the Poisson distribution (It is
also the conjugate for Exponential, but we won't delve into that).
The hyperparameters can be interpreted as: you saw k total imaginary events during θ imaginary time
periods. After observing n events during the next t time periods the posterior distribution is Gamma(
k + n, θ + t).
For example Gamma(10, 5) would represent having seen 10 imaginary events in 5 time periods. It is like
imagining a rate of 2 with some degree of confidence. If we start with that Gamma as a prior and then see
11 events in the next 2 time periods our posterior is Gamma(21,7) which is equivalent to an updated rate
of 3.
2
This is a header
Machine Learning
Machine Learning is the subfield of computer science that gives computers the ability to perform tasks
without being explicitly programmed. There are several different tasks that fall under the domain of
machine learning and several different algorithms for "learning". In this chapter, we are going to focus on
Classification and two classic Classification algorithms: Naive Bayes and Logistic Regression.
Classification
In classification tasks, your job is to use training data with feature/label pairs (x, y) in order to estimate a
function y^ = g(x). This function can then be used to make a prediction. In classification the value of y
takes on one of a \textbf{discrete} number of values. As such we often chose
^
g(x) = argmax P (Y = y|X) .
y
In the classification task you are given N training pairs: (x , y ), (x (1) (1) (2)
,y
(2)
), … , (x ) Where
(N )
,y
(N )
x
(i)
is a vector of m discrete features for the ith training example and y (i)
is the discrete label for the ith
training example.
In our introduction to machine learning, we are going to assume that all values in our training data-set are
binary. While this is not a necessary assumption (both naive Bayes and logistic regression can work for
non-binary data), it makes it much easier to learn the core concepts. Specifically we assume that all labels
are binary y and all features are binary x .
(i) (i)
∈ {0, 1} ∀i ∈ {0, 1} ∀i, j
j
1
This is a header
Naïve Bayes
Naive Bayes is a Machine Learning algorithm for the ``classification task". It make the substantial
assumption (called the Naive Bayes assumption) that all features are independent of one another, given
the classification label. This assumption is wrong, but allows for a fast and quick algorithm that is often
useful. In order to implement Naive Bayes you will need to learn how to train your model and how to use
it to make predictions, once trained.
Prediction
For an example with x = [x 1, x2 , … , xm ] , estimate the value of y as:
m
y
^ = arg max log p(Y
^ = y) + ∑ log p(X
^ i = x i |Y = y)
y={0,1}
i=1
Note that for small enough datasets you may not need to use the log version of the argmax.
Theory
In the world of classification when we make a prediction we want to chose the value of y that maximizes
P (Y = y|X).
y
^ = arg maxP (Y = y|X = X) Our objective
y={0,1}
P (Y = y)P (X = x|Y = y)
= arg max By bayes theorem
y={0,1} P (X = x)
Using our training data we could interpret the joint distribution of X and Y as one giant multinomial with
a different parameter for every combination of X = x and Y = y. If for example, the input vectors are
only length one. In other words |x| = 1 and the number of values that x and y can take on are small, say
binary, this is a totally reasonable approach. We could estimate the multinomial using MLE or MAP
estimators and then calculate argmax over a few lookups in our table.
The bad times hit when the number of features becomes large. Recall that our multinomial needs to
estimate a parameter for every unique combination of assignments to the vector x and the value y. If
there are |x| = n binary features then this strategy is going to take order O(2 ) space and there will
n
1
This is a header
likely be many parameters that are estimated without any training data that matches the corresponding
assignment.
The Naïve Bayes Assumption is wrong, but useful. This assumption allows us to make predictions using
space and data which is linear with respect to the size of the features: O(n) if |x| = n. That allows us to
train and make predictions for huge feature spaces such as one which has an indicator for every word on the
internet. Using this assumption the prediction algorithm can be simplified.
y
^ = arg max P (Y = y)P (X = x|Y = y) As we last lef t of f
y={0,1}
= arg max P (Y = y) ∏ P (X i = x i |Y = y) ï
Na ve bayes assumption
y={0,1} i
In the last step we leverage the fact that the argmax of a function is equal to the argmax of the log of a
function. This algorithm is both fast and stable both when training and making predictions.
2
This is a header
Logistic Regression
Logistic Regression is a classification algorithm (I know, terrible name. Perhaps Logistic Classification
would have been better) that works by trying to learn a function that approximates P(y|x). It makes the
central assumption that P(y|x) can be approximated as a sigmoid function applied to a linear
combination of input features. It is particularly important to learn because logistic regression is the basic
building block of artificial neural networks.
i=1
T
P (Y = 0|X = x) = 1 − σ(θ x) by total law of probability
Using these equations for probability of Y |X we can create an algorithm that selects values of theta that
maximize that probability for all data. I am first going to state the log probability function and partial
derivatives with respect to theta. Then later we will (a) show an algorithm that can chose optimal values
of theta and (b) show how the equations were derived.
An important thing to realize is that: given the best values for the parameters (θ), logistic regression often
can do a great job of estimating the probability of different class labels. However, given bad , or even
random, values of θ it does a poor job. The amount of ``intelligence" that you logistic regression machine
learning algorithm has is dependent on having good values of θ.
Notation
Before we get started I want to make sure that we are all on the same page with respect to notation. In
logistic regression, θ is a vector of parameters of length m and we are going to learn the values of those
parameters based off of n training examples. The number of parameters should be equal to the number of
features of each datapoint.
Two pieces of notation that we use often in logistic regression that you may not be familiar with are:
m
T
θ x = ∑ θi xi = θ1 x1 + θ2 x2 + ⋯ + θm xm dot product, aka weighted sum
i=1
1
σ(z) = sigmoid f unction
−z
1 + e
Log Likelihood
In order to chose values for the parameters of logistic regression we use Maximum Likelihood
Estimation (MLE). As such we are going to have two steps: (1) write the log-likelihood function and (2)
find the values of θ that maximize the log-likelihood function.
The labels that we are predicting are binary, and the output of our logistic regression function is supposed
to be the probability that the label is one. This means that we can (and should) interpret the each label as
a Bernoulli random variable: Y ∼ Bern(p) where p = σ(θ x). T
To start, here is a super slick way of writing the probability of one datapoint (recall this is the equation
form of the probability mass function of a Bernoulli):
(1−y)
T y T
P (Y = y|X = x) = σ(θ x) ⋅ [1 − σ(θ x)]
Now that we know the probability mass function, we can write the likelihood of all the data:
1
This is a header
(i) (i)
L(θ) = ∏ P (Y = y |X = x ) The likelihood of independent training labels
i=1
n (i)
(1−y )
(i)
T (i) y T (i)
= ∏ σ(θ x ) ⋅ [1 − σ(θ x )] Substituting the likelihood of a Bernoulli
i=1
And if you take the log of this function, you get the reported Log Likelihood for Logistic Regression. The
log likelihood equation is:
n
i=1
Recall that in MLE the only remaining step is to chose parameters (θ) that maximize log likelihood.
In the case of logistic regression we can't solve for θ mathematically. Instead we use a computer to chose
θ. To do so we employ an algorithm called gradient descent (a classic in optimization theory). The idea
behind gradient descent is that if you continuously take small steps downhill (in the direction of your
negative gradient), you will eventually make it to a local minima. In our case we want to maximize our
likelihood. As you can imagine, minimizing a negative of our likelihood will be equivalent to
maximizing our likelihood.
The update to our parameters that results in each small step can be calculated as:
old
∂LL(θ )
new old
θj = θj + η ⋅
old
∂θ
j
i=1
Where η is the magnitude of the step size that we take. If you keep updating θ using the equation above
you will converge on the best values of θ. You now have an intelligent model. Here is the gradient ascent
algorithm for logistic regression in pseudo-code:
Pro-tip: Don't forget that in order to learn the value of θ you can simply define x to always be 1. 0 0
2
This is a header
Derivations
In this section we provide the mathematical derivations for the gradient of log-likelihood. The derivations
are worth knowing because these ideas are heavily used in Artificial Neural Networks.
Our goal is to calculate the derivative of the log likelihood with respect to each theta. To start, here is the
definition for the derivative of a sigmoid function with respect to its inputs:
∂
σ(z) = σ(z)[1 − σ(z)] to get the derivative with respect to θ, use the chain rule
∂z
Take a moment and appreciate the beauty of the derivative of the sigmoid function. The reason that
sigmoid has such a simple derivative stems from the natural exponent in the sigmoid denominator.
Since the likelihood function is a sum over all of the data, and in calculus the derivative of a sum is the
sum of derivatives, we can focus on computing the derivative of one example. The gradient of theta is
simply the sum of this term for each training datapoint.
First I am going to show you how to compute the derivative the hard way. Then we are going to look at
an easier method. The derivative of gradient for one datapoint (x, y):
∂LL(θ) ∂ ∂
T T
= y log σ(θ x) + (1 − y) log[1 − σ(θ x] derivative of sum of terms
∂θ j ∂θ j ∂θ j
y 1 − y ∂
T
= [ − ] σ(θ x) derivative of log f (x)
T T
σ(θ x) 1 − σ(θ x) ∂θ j
y 1 − y
T T
= [ − ]σ(θ x)[1 − σ(θ x)]x j chain rule + derivative of sigma
T T
σ(θ x) 1 − σ(θ x)
T
y − σ(θ x)
T T
= [ ]σ(θ x)[1 − σ(θ x)]x j algebraic manipulation
T T
σ(θ x)[1 − σ(θ x)]
T
= [y − σ(θ x)]x j cancelling terms
∂LL(θ) ∂LL(θ) ∂p T
= ⋅ Where p = σ(θ x)
∂θ j ∂p ∂θ j
∂LL(θ) ∂p ∂z
T
= ⋅ ⋅ Where z = θ x
∂p ∂z ∂θ j
Chain rule is the decomposition mechanism of calculus. It allows us to calculate a complicated partial
∂LL(θ)
derivative ( ∂θ j
) by breaking it down into smaller pieces.
T
LL(θ) = y log p + (1 − y) log(1 − p) Where p = σ(θ x)
∂LL(θ) y 1 − y
= − By taking the derivative
∂p p 1 − p
T
p = σ(z) Where z = θ x
∂p
= σ(z)[1 − σ(z)] By taking the derivative of the sigmoid
∂z
T
z = θ x As previously def ined
∂z
= xj Only x j interacts with θ j
∂θ j
Each of those derivatives was much easier to calculate. Now we simply multiply them together.
3
This is a header
∂LL(θ) ∂LL(θ) ∂p ∂z
= ⋅ ⋅
∂θ j ∂p ∂z ∂θ j
y 1 − y
= [ − ] ⋅ σ(z)[1 − σ(z)] ⋅ x j By substituting in f or each term
p 1 − p
y 1 − y
= [ − ] ⋅ p[1 − p] ⋅ x j Since p = σ(z)
p 1 − p
= [y − p]x j Expanding
T T
= [y − σ(θ x)]x j Since p = σ(θ x)
4
This is a header
Data = [6.3 , 5.5 , 5.4, 7.1, 4.6, 6.7, 5.3 , 4.8, 5.6, 3.4, 5.4, 3.4, 4.8, 7.9, 4.6, 7.0, 2.9, 6.4, 6.0 , 4.3]
Your Gaussian
1
This is a header
X ∼ Pareto(α)
You would like the alpha in your artwork to match that of sand in your local beach. You go to the
beachand collect 100 particles of sand and measure their size. Call the measured radii x , … , x : 1 100
Derive a formula for the MLE estimate of α based on the data you have collected.
Optimization will be much easier if we instead try to optimize the log likelihood:
n
α
LL(α) = log L(α) = log ∏
α+1
x
i=1 i
n
α
= ∑ log
α+1
i=1
x
i
= ∑ log α − (α + 1) log x i
i=1
= n log α − (α + 1) ∑ log x i
i=1
Selecting α
We are going to select α to be the value which maximizes the log likelihood. To do so we are going to
need the derivative of LL w.r.t. α
n
∂LL(α) ∂LL(α)
= (n log α − (α + 1) ∑ log x i )
∂α ∂α
i=1
n
n
= − ∑ log x i
α
i=1
One way to optimize is to take the derivative and set it equal to zero:
1
This is a header
n
n
0 = − ∑ log x i
α
i=1
n
n
∑ log x i =
α
i=1
n
α =
n
∑ log x i
i=1
def estimate_alpha(observations):
# This code computes the MLE estimate of alpha
log_sum = 0
for x_i in observations:
log_sum += math.log(x_i)
n = len(observations)
return n / log_sum
def main():
observations = [1.677, 3.812, 1.463, 2.641, 1.256, 1.678, 1.157, 1.146,
1.323, 1.029, 1.238, 1.018, 1.171, 1.123, 1.074, 1.652, 1.873, 1.314,
1.309, 3.325, 1.045, 2.271, 1.305, 1.277, 1.114, 1.391, 3.728, 1.405,
1.054, 2.789, 1.019, 1.218, 1.033, 1.362, 1.058, 2.037, 1.171, 1.457,
1.518, 1.117, 1.153, 2.257, 1.022, 1.839, 1.706, 1.139, 1.501, 1.238,
2.53 , 1.414, 1.064, 1.097, 1.261, 1.784, 1.196, 1.169, 2.101, 1.132,
1.193, 1.239, 1.518, 2.764, 1.053, 1.267, 1.015, 1.789, 1.099, 1.25 ,
1.253, 1.418, 1.494, 1.015, 1.459, 2.175, 2.044, 1.551, 4.095, 1.396,
1.262, 1.351, 1.121, 1.196, 1.391, 1.305, 1.141, 1.157, 1.155, 1.103,
1.048, 1.918, 1.889, 1.068, 1.811, 1.198, 1.361, 1.261, 4.093, 2.925,
1.133, 1.573]
alpha = estimate_alpha(observations)
print(alpha)
if __name__ == '__main__':
main()
2
This is a header
Gaussian Mixtures
Data = [6.47, 5.82, 8.7, 4.76, 7.62, 6.95, 7.44, 6.73, 3.38, 5.89, 7.81, 6.93, 7.23, 6.25, 5.31, 7.71, 7.42,
5.81, 4.03, 7.09, 7.1, 7.62, 7.74, 6.19, 7.3, 7.37, 6.99, 2.97, 3.3, 7.08, 6.23, 3.67, 3.05, 6.67, 6.5, 6.08,
3.7, 6.76, 6.56, 3.61, 7.25, 7.34, 6.27, 6.54, 5.83, 6.44, 5.34, 7.7, 4.19, 7.34]
Likelihood: 1.847658621579746e-34
Log Likelihood: -77.7
Best Seen: -77.7
Generative Code
st
2
A ∼ N (μ a , σ a )
2
B ∼ N (μ b , σ )
b
1
1 1
This is a header
x−μ
1 1 x−μa 2 1 −
1
(
b
)
2
− ( ) 2 σ
f (x) = t ⋅ ( e 2 σa
) + (1 − t) ⋅ ( e b )
√2πσ a √2πσ b
Let θ→ = [t, μ , μ , σ , σ ] be the parameters. Because the math will get long I will use
a b a b θ as notation in
place of θ→. Just keep in mind that it is a vector.
The MLE idea is to chose values of θ which maximize log likelihood. All optimization methods require
us to calculate the partial derivatives of the thing we want to optimize (log likelihood) with respect to the
values we can change (our parameters).
Likelihood function
n
L(θ) = ∏ f (x i |θ)
n
x i −μ
1 −
1
(
x i −μa
)
2 1 −
1
(
b
)
2
2 σ
= ∏ [t ⋅ ( e 2 σa
) + (1 − t) ⋅ ( e b )]
√ 2πσ √ 2πσ
i a b
= log ∏ f (x i |θ)
= ∑ log f (x i |θ)
That is sufficient for now, but if you wanted to expand out the term you would get:
n x −μ
1 1 x −μa
i 2 1 −
1
(
i b
)
2
− ( ) 2 σ
LL(θ) = ∑ log [t ⋅ ( e 2 σa
) + (1 − t) ⋅ ( e b
)]
i
√2πσ a √2πσ b
Caution: When I first wrote this demo I thought it would be a simple derivative . It is not so simple
because the log has a sum in it. As such the log term doesn't reduce. The log still serves to make the
outer ∏ into a ∑. As such the LL partial derivatives are solvable, but the proof uses quite a lot of
chain rule.
Takeaway: The main takeaway from this section (in case you want to skip the derivative proof) is that
the resulting derivative is complex enough that we will want a way to compute argmax without having
to set that derivative equal to zero and solving for μ . Enter gradient descent! a
A good first step when doing a huge derivative of a log likelihood function is to think of the derivative
for the log of likelihood of a single datapoint. This is the inner sum in the log likelihood expression:
d
log f (x i |θ)
dμ a
Before we start: notice that μ does not show up in this term from f (x
a i |θ) :
x −μ
1 −
1
(
i b
)
2
2 σ
(1 − t) ⋅ ( e b ) = K
√ 2πσ b
In the proof, when we encounter this term, we are going to think of it as a constant which we call K . Ok,
lets go for it!
2
This is a header
d
log f (x i |θ)
dμ a
1 d
= f (x i |θ) chain rule on log
f (x i |θ) dμ a
1 d 1 1 x −μa
i 2
− ( )
= [t ⋅ ( e 2 σa
) + K] substitute in f (x i |θ)
f (x i |θ) dμ a √2πσ a
1 d 1 1 x −μa
i 2 d
− ( )
= [t ⋅ ( e 2 σa
)] K = 0
f (x i |θ) dμ a √2πσ a dμ a
t d −
1
(
x −μa
i
)
2
= ⋅ e 2 σa
pull out const
f (x i |θ)√ 2πσ a dμ a
t −
1
(
x i −μa
)
2 d 1 xi − μa
2 x
= ⋅ e 2 σa
⋅ − ( ) chain on e
f (x i |θ)√ 2πσ a dμ a 2 σa
t 1 x −μ a
i 2 xi − μa d xi − μa
− ( ) 2
= ⋅ e 2 σa
⋅ [ − ( ) ( )] chain on x
f (x i |θ)√2πσ a σa dμ a σa
t −
1
(
x −μ a
i
)
2 xi − μa −1
= ⋅ e 2 σa
⋅ [ − ( ) ⋅ ] f inal derivative
σa σa
f (x i |θ)√ 2πσ a
t −
1
(
x i −μa
)
2
= ⋅ e 2 σa
⋅ (x i − μ a ) simplif y
3
f (x i |θ)√ 2πσ a
n
t −
1
(
x −μ a
i
)
2
= ∑ ⋅ e 2 σa
⋅ (x i − μ a )
3
i f (x i |θ)√ 2πσ a
This process should be repeated for all five parameters! Now, how should we find a value of μ , which, a
in the presence of the other settings to parameters, and the data, makes this derivative zero? Setting the
derivative = 0 and solving for μ is not going to work.
a
dLL(θ)
⎡ ⎤
dt
dLL(θ)
dμ a
dLL(θ)
∇ θ LL(θ) =
dμ b
dLL(θ)
dσ a
dLL(θ)
⎣ ⎦
dσ b