109 Notes
109 Notes
109 Notes
ARUN DEBRAY
JUNE 11, 2013
These notes were taken in Stanford’s CS 109 class in Spring 2013, taught by Professor Mehran Sahami. I live-TEXed
them using vim, and as such there may be typos; please send questions, comments, complaints, and corrections to
[email protected]. Thanks to Rebecca Wang, Shivaal Roy, and an anonymous student for catching a few errors.
Contents
1. Counting: 4/1/13 1
2. Permutations and Combinations: 4/3/13 3
3. Probability: 4/5/13 5
4. Conditional Probability: 4/8/13 8
5. More Conditional Probability: 4/10/13 10
6. Conditional Independence and Random Variables: 4/12/13 13
7. To Vegas: 4/15/13 16
8. The Poisson Distribution: 4/17/13 18
9. From Discrete to Continuous: 4/19/13 21
10. The Normal Distribution: 4/22/13 23
11. The Exponential Distribution: 4/24/13 26
12. The Multinomial Distribution and Independent Variables: 4/26/13 28
13. Adding Random Variables: 4/29/13 31
14. Continuous Conditional Distributions: 5/1/13 33
15. More Expectation: 5/3/13 35
16. Covariance: 5/6/13 38
17. Predictions: 5/8/13 41
18. Moment-Generating Functions: 5/10/13 42
19. Laws of Large Numbers and the Central Limit Theorem: 5/13/13 44
20. Parameters: 5/15/13 46
21. Likelihood: 5/17/13 47
22. More Bayesian Probability: 5/20/13 49
23. Machine Learning: 5/22/13 51
24. Logistic Regression: 5/24/13 53
25. Bayesian Networks: 5/29/13 55
26. Generating Random Numbers: 5/31/13 58
27. Some Review for the Final: 6/3/13 61
1. Counting: 4/1/13
“One, ah, ah, ah! Two, ah ah ah! Three, ah ah ah!” – Count von Count, Sesame Street
Though that average CS student might wonder why he is learning probability, it is in fact an incredibly useful
way of understanding principles and techniques in CS. The traditional view of probability (placing balls in urns)
isn’t as helpful, but there are a lot of problems in CS which are just abstractions of the conventional view of
probability.
For example, how does Amazon determine recommendations of similar orders? How does a spam filter classify
messages? Both of these applications rely on probability.
1
Figure 1. Source: https://xkcd.com/552/.
editions of the same book, one might not want to get both of them. Under this constraint, break the problem into
three subproblems:
(1) Suppose the 8th edition and two other books are chosen. Then, there are 42 options, since the 9th
Thus, adding them together gives the total number: 16. Dividing this into subproblems is a good strategy, but
it’s important to make sure they cover the whole space and don’t overlap.
Alternatively, you could count all the options that you don’t want: there are four such options, given by both
editions and another book. Thus, there are 20 − 4 = 16 options, again. (
The formula for nk has a nice recursive definition which can be illustrated with some code: pick some
special point. Either it is in the combination, or it isn’t, and these are all of the distinct cases. Thus, there are
C(n − 1, k − 1) options if it is included, and C(n − 1, k) if it isn’t included. This is nice, but recursion usually
needs the base cases: C(n, 0) = C(n, n) = 1.
Here’s how this looks in code:
int C ( int n , int k ) {
if ( k == 0 || n == k ) return 1;
return C (n -1 , k ) + C (n -1 ,k -1);
}
n
k is what is called a binomial coefficient: it is named so because it appears in the formula given below. There
are two proofs in the textbook, but they’re not all that important in the class. However, the equation is very
important.
Theorem 2.3 (Binomial).
n
X n
(x + y)n = xk y n−k .
k
k=0
For an application, consider a set with n elements. Then, what is the size of the power set? If one chooses a
subset of size k, there are nk such options, so the total number of subsets is
n n
X n X n k n−k
= 1 1 = (1 + 1)n = 2n .
k k
k=0 k=0
4
n
More generally, one has the multinomial coefficient n1 ,...,n r
= n!/n1 ! · · · nr !, where n1 + · · · + nr = n.
Technically, this is overdetermined, and sometimes the nr -term is dropped (since it can be recovered).
Example 2.4. Suppose Google has 13 machines and 3 data centers and wants to allocate machines to data
13
centers A, B, and C which can hold 6, 4, and 3 machines, respectively. Then, there are 13!/6!4!3! = 6,4,3
possibilities. (
Returning to the ball-and-urn formalism, how many ways are there to place n distinct balls into r urns?
There are rn possibilities, since any ball can go into any urn.
If they are taken to be indistinct, but the urns aren’t, then the question becomes a bit more nuanced. There
are n balls and r urns, but once everything is placed the indistinctness of the balls means there are n + r − 1
distinct objects: in some sense, the dividers between the urns are what we care about. Thus, the number of
options is (n + r − 1)!/n!(r − 1)! = n+r−1 r−1 . The total number of options is divided by the number of ways
the balls can be permuted and the dividers can be permuted. This is because the urns can be represented as
dividers.
In CS, there is a slightly different example: consider a one-indexed array x of length r such that 0 ≤ x[i] ≤ n
for each i, and such that the sum of the elements is n. This is a reformulation of the balls problem: each index
of the array is an urn, and its value represents the number of the balls in the urn. The book calls this 6.2, and
has a related proposition 6.1:
Proposition 2.5. Consider the slightly related problem in which every urn has at least one ball in it.4 This can
be solved by giving each child one piece of candy, which gives a slightly different spin on the previous problem:
there are n − r candies to put in n baskets. Thus, the total number of options is (n−r)+(r−1) n−1
r−1 = r−1 .
Consider some venture capital firm which is required to distribute $10 million amongst 4 distinct companies
A, B, C, and D, in $1-million increments. the goal is to put 10 balls in 4 urns, and some company might
Thus,
be left out in the cold, giving 10+4−1 13
4−1 = 3 = 286.
But if you want to keep some of it, the answer is 14
4 , since you can be thought of as adding on more company:
your own back account. This is much nicer than all the casework that is the obvious approach.
Lastly, suppose that A only will accept payments of at least $3 million. This reduces the search space:
there are 10
choices, because after allocating $3 million to A, there is $7 million left to put in 4 companies,
10
3
giving 3 . Add this to the case where A is given no money, which is an exercise in allocating $10 million to 3
companies.
This final question used to be used at company A (i.e. Google). Consider an n × n grid with a robot in the
lower left-hand corner of the grid and a destination in the right-hand corner. If the robot can only move up or
right, it will reach the destination, but the path isn’t unique. How many distinct paths are there? The robot
needs to make n − 1 moves up and n − 1 moves right, so the goal is to find the number of permutations of these
moves given that the moves in the same direction are identical: (2n − 2)!/(n − 1)!(n − 1)! = 2n−2 n−1 .
3. Probability: 4/5/13
“I like to think of the size of a set as how much Stanford spirit it has: its cardinality.”
Today’s lecture will help with the material in the second half of the first problem set. However, in order to
understand probability some of the theory behind it will be necessary.
Definition 3.1. A sample space is the set of all possible outcomes of an experiment.
Example 3.2. Some examples of sample spaces:
• When flipping a coin, the sample space is S = {H, T }.
• When flipping two distinct coins, the sample space is S × S = {(H, H), (H, T ), (T, H), (T, T )}.
• When rolling a six-sided die, the sample space is {1, 2, 3, 4, 5, 6}.
• Sample spaces need not be finite: the number of emails sent in a day is {0, 1, 2, 3, . . . }.
• They can also be dense sets: the number of hours spent watching videos on Youtube in a day instead of
working is S = [0, 24]. (
4Real-world example: trick-or-treating. You wouldn’t want some kid to go home candyless, would you?
5
Definition 3.3. An event, usually denoted E, is a subset of the sample space.5 This is in some sense the outcome
we care about.
Example 3.4. Here are some events corresponding to the sample spaces in Example 3.2:
• If a coin flip is heads, E = {H}.
• If there is at least one head in two flips, E = {(H, H), (H, T ), (T, H)}. (
Now let’s take some set operations. If E, F ⊆ S are events in the sample space S, then an event that is in E
or F is represented by E ∪ F , and an event that is in E and F is given by E ∩ F .6 For example, if E represents
rolling a 1 or a 2 on a die and F represents rolling an even number, then E ∩ F is rolling a 2.
Definition 3.5. Two events E, F are called mutually exclusive if E ∩ F = ∅.
One can take the complement of an event E, ∼E = S \ E, or everything that isn’t in E. The book uses
E C . This leads to De Morgan’s Laws: ∼(E ∪ F ) = (∼E) ∩ (∼F ) and ∼(E ∩ F ) = (∼E) ∪ (∼F ). This can be
inductively generalized to n events in the unsurprising manner.
Probability is given in several different ways. The first will be given now, and another will be given later. The
frequentist interpretation of probability interprets the probability as the relative frequency of the event. Let
n(E) be the number of times event E happens when the experiment is done n times; then, the probability is
P (E) = limn→∞ n(E)/n. With this definition comes some axioms:
(1) Probabilities must lie between 0 and 1.
(2) P (S) = 1.
(3) If E and F are mutually exclusive events, then P (E) + P (F ) = P (E ∪ F ).
From axiom 3 most of the interesting mathematics comes up. For just one example, the multivariate case is a
straightforward generalization: if E1 , E2 , . . . is a sequence of mutually exclusive events, then
∞ ∞
!
[ X
P Ei = P (Ei ).
i=1 i=1
This can be generalized to the uncountable case, but things get weird, so take care.
There are several other elementary consequences of the axioms. They are easy to prove by doing a bit of set
chasing or drawing Venn diagrams.
• P (∼E) = 1 − P (E).
• If E ⊆ F , then P (E) ≤ P (F ).
• P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ).
The latter formula can generalize to n variables as
!
[n Xn X \ r
P Ei = (−1)r+1 P Eij .
i=1 r=1 i1 <···<ir j=1
This seems pretty scary, but all that it does is take the intersections of subsets of size r for all r in 1 to n,
and adds and subtracts things based on the number intersected (so nothing is double-counted). All of the odd
intersections have a positive coefficient, and the even ones have a negative coefficient. The i1 < · · · < ir means
that every combination is considered, and order is used to ensure that such each combination is taken only once.
For example, if n = 3,
P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (B ∩ C) − P (A ∩ C) + P (A ∩ B ∩ C).
By drawing a Venn diagram, it should be possible to see why this works.
Some sample spaces have a property called equally likely outcomes. This means that each point in the
sample space has the same probability: if x ∈ S, then P (x) = 1/|S|. Examples of these include fair coins and
fair dice. Thus, if E ⊆ S, then P (E) = |E|/|S|.
Example 3.6. If one rolls two six-sided dice, the sample space is S = {1, . . . , 6}2 , and if one wants to calculate
the probability that the sum is 7, the event is E = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}, so the probability is
|E|/|S| = 6/36 = 1/6. (
5The textbook writes that E ⊂ S, but E = S is allowed too, so be careful.
6More questionable textbook notation: the textbook uses EF to denote E ∩ F .
6
Example 3.7. Suppose one has four Twinkies and three Ding Dongs.7 What is the probability that one Twinkie
and 2 Ding Dongs are drawn if three are picked? There are two ways to look at this:
• One can consider it to be ordered: the Twinkie is either the first, second, or third item, giving (4)(3)(2),
(3)(4)(2), and (3)(2)(4) combinations respectively. Thus, the probability is ((4)(3)(2) + (3)(4)(2) +
(3)(2)(4))/210 = 12/35.
• Alternatively, it can be unordered: |S| = 73 = 35, and |E| = 41 32 = 12 (picking the Twinkies and
Example 3.8. Suppose someone is manufacturing chips, and has n chips, one of which is defective. If n is large,
it’s impractical to test them all, so only k will be examined. Thus, the sample space has nk possible choices,
Example 3.9. Let’s play poker! Suppose one has a 5-card poker hand drawn from a standard deck and
wants a straight. This is a set of 5 cards of consecutive rank, e.g. {2, 3, 4, 5, 6}. Again, this is slightly
different from the textbook’s definition. The probability is a little weird, because there are lots of cases:
4 5
{A, 2, 3, 4, 5}, {2, 3, 4, 5, 6}, . . . , {10, J, Q, K, A}. The sample space has size 52
5 , and each of these has 1
5
options. Thus, |E| = 10 41 , giving a probability of just under one percent.
But an “official” straight in poker is slightly different: a straight flush (i.e. all five cards are the same suit) is
5
not considered to be a straight. There are 10 41 possible straight flushes, so |E| = 10 41 − 10 41 now. (
Example 3.10. Consider flipping cards until an ace comes up, and take the next card. Is it more likely that the
ace of spades is this next card, or the two of clubs? Here, |S| = 52!, since the deck is shuffled. The first case, in
which the ace of spades is removed, and then reinserted after the first ace, there are 51! · 1 ways for this to
happen (since the other cards can be anywhere). Thus, the probability is 51!/52!.
The two of clubs has the same thing going on: there’s one place to put it, so the probability is the same. This
is counterintuitive. (
Example 3.11. Suppose 28% of students at a university program in Java, 7% program in C++, and 5% program
in both. Then, what percentage of students program in neither?
When answering a problem, define an event very precisely, as this will make errors less likely. So let A be the
probability that a randomly chosen student programs in Java and B be the probability that a student programs in
C++. Thus, the probability that someone programs in neither is 1−P (A∪B) = 1−(P (A)+P (B)−P (A∩B)) =
1 − (0.28 + 0.07 − 0.05) = 0.7, so the answer is 70%.
What percentage of programmers use C++, but not Java? This is P ((∼A) ∩ B) = P (B) − P (A ∩ B) =
0.07 − 0.05 = 0.02, or 2%.
Notice how similar this looks to a problem on this week’s homework. (
One important consequence is the birthday problem. What is the probability that in a room of n people, no
two share the same birthday? Then, |S| = 365n and |E| = (365)(364) · · · (365 − n + 1). Clearly bad things
happen if n ≥ 365. More interestingly, at 23 people, the probability that two people share a birthday is at least
1/2. This is much less than intuition would expect. At 75 people, it’s over 99%.
Now consider the probability that of n other people, none of them share a birthday with you? Here, |S| = 365n
again, and |E| = 364n . Thus, the probability is (364/365)n , and at n = 23, the probability that nobody’s
birthday matches yours is about 94%. Even at 160 people it’s only about 64%. In order to get up to a 50%
chance, you need about 253 people. That’s again quite different from intuition! The interaction of many-to-many
and one-to-many effects is tricky.
7Dr. Sahami actually had these with him. How long has it been since Hostess stopped producing them again?
7
4. Conditional Probability: 4/8/13
First, it will be helpful to understand the notion of distinctness and indistinctness and ordered and unordered
sets. In many problems, it helps to temporarily view indistinct objects as distinct ones for the purposes of a
problem. However, one must be careful, because a common mistake is to assume that all options are equally
likely in a system.
Example 4.1. Suppose there are n balls to be placed in m urns (e.g. strings in a hash table, server requests to
m machines in a computer cluster, etc.). The counts of the balls in the urns aren’t equally likely. For example, if
there are 2 balls A and B and two urns 1 and 2, then all possibilities are equally likely, since each ball has a
1/2 probability of ending up in each particular urn.
But once A and B are taken to be indistinguishable, then the counts differ: there is one configuration in
which both balls end up in urn 1, but two in which one ball ends up in each urn. Be careful! (
Now, consider two fair, six-sided dice, yielding values D1 and D2 . Let E be the event D1 + D2 = 4. Then,
P (E) = 1/12, since there are 3 possibilities out of 36 total. Then, let F be the probability that D1 = 2. What is
P (E) if F has already been observed? Not all 36 rolls are possible, and S is reduced to {(2, 1), (2, 2), . . . , (2, 6)},
and the event space is {(2, 2)}. Thus, the probability P (E) given F is 1/6 — which is different. This basic idea
is one of the most important concepts of probability.
Definition 4.2. A conditional probability is the probability of an event E occurs given that some other event F
has already occured. This is written P (E | F ).
In this case, the sample space is reduced to those options consistent with F , or S ∩ F , and the event space is
reduced in the same way to E ∩ F . Thus, in the case of equally likely outcomes, P (E | F ) = |E ∩ F |/|S ∩ F | =
|E ∩ F |/|F |, because F ⊆ S.
This can be generalized to the general case, even when there aren’t equally likely outcomes: P (E | F ) =
P (E ∩ F )/P (F ), where P (F ) > 0. Probabilities are used here because counts aren’t as valid. This implies
something known as the “chain rule,” which shows that P (E ∩ F ) = P (E | F )P (F ). Intuitively, the probability of
both E and F happening is the probability that F happens, multiplied by the probability that one happens, then
the other happens given the first. This also happens to be commutative, so this is also equal to P (F | E)P (E).
If P (F ) = 0, then the conditional probability is undefined, because the statement “P (E) given that F happened”
makes no sense when F is impossible.
The chain rule is also known as the multiplication rule. The generalized version is
!
\n Yn i−1
\
P Ei = P Ei Ej = P (E1 )P (E2 | E1 )P (E3 | E1 ∩ E2 ) · · · P (En | E1 ∩ · · · ∩ En−1 ).
i=1 i=1 j=1
Note that this doesn’t require the events to be mutually exclusive, which makes it quite powerful.
Example 4.3. Suppose 24 emails are sent to four users, and are evenly distributed. If 10 of the emails are spam,
what is the probability user 1 receives 3 pieces of spam and user 2 receives 6 pieces of spam? Call the event for
user 1 E and for user 2 F . Then,
4 14
3
P (E | F ) = 18
3 ≈ 0.0784.
6
Here, user 2 can be thought of as a honeypot, which makes it less likely that spam reaches a legitimate user
assuming the spammer is rate-limited.
If the above also has G, which is the event that user 3 receives 5 spam emails, P (G | F ) = 0, since there
aren’t that many spam emails. (
Example 4.4. Suppose a bit string with m 0s and n 1s is sent over a network, so all arrangements are equally
likely. Then, if E is the probability that the first bit received is a 1, and F is the probability that k of the first r
bits is a 1, then
P (E ∩ F ) P (F | E)P (E)
P (E | F ) = =
P (F ) P (F )
n−1
m
k−1 r−k
P (F | E) = m+n−1
.
r−1
8
Notice that the bits are treated as distinct objects, which makes life easier. Then, P (E) = n/(m + n), which is
just counting how many of the bits are there, and P (F ) = nk r−k
m
/ m+n
r . Thankfully, things cancel out, and
in the end P (E | F ) = k/r.
Another way to think of this is that once we know there are k bits out of the first r are 1, then that’s all that
needs to be worried about: given this much smaller set, there are k possibilities for E among r choices given
F. (
Example 4.5. Suppose a deck of 52 cards is distributed into four piles, with 13 cards per pile. What is the
probability that each pile contains exactly one ace? Let
• E1 be the probability that the ace of spades is in any one pile,
• E2 be the probability that the ace of spades and ace of hearts are in different piles,
• E3 be the probability that the aces of spades, hearts, and clubs are in different piles, and
• E4 the solution: that every ace is in a different pile.
Then, P (E1 ) = 1, and P (E2 | E1 ) = 39/51, since there are 39 cards not in the pile with the ace of
spaces. Then, there are 26 cards left in the other piles, so P (E3 | E1 ∩ E2 ) = 26/50, and by the same logic
P (E4 | E1 ∩ E2 ∩ E3 ) = 13/49. Thus, the overall probability is
39 · 26 · 13
P (E4 ) = P (E1 ∩ E2 ∩ E3 ∩ E4 ) = ≈ 0.105.
51 · 50 · 49
Notice that a seemingly-convoluted problem becomes much easier from this viewpoint. (
One of the most influential figures in the history of probability was Thomas Bayes,8 who formulated an
extremely important theorem named after him. It has several fomulations: P (F | E) = P (E ∩ F )/P (E) =
(P (E | F )P (F ))/P (E). (Yes, it really is that simple.)
The textbook writes Bayes’ theorem as
P (E) = P (E ∩ F ) + P (E ∩ (∼F )) = P (E | F )P (F ) + P (E | ∼F )P (∼F ).
These can be combined into the most computationally useful one:
P (E | F )P (F )
P (F | E) = .
P (E | F )P (F ) + P (E | ∼F )P (∼F )
In the fully general form:
Theorem 4.6. Let F1 , . . . , Fn be a set of mutually exclusive and exhaustive events (i.e. the sample space is the
union of the Fi ). Then,
P (E ∩ Fj ) P (E | Fj )P (Fj )
P (Fj | E) − = Pn .
P (E) i=1 P (E | Fi )P (Fi )
This is useful if given that an event E has occured, one wishes to know whether one of the Fj also occured.
Example 4.7. Consider a test for HIV, which is 98% effective and has a false positive rate of 1%. It is known that
about 1 person in 200 has HIV in the United States, so let E be the probability that someone tests positive for
HIV with this test, and F be the probability that person actually has HIV. Then, P (E | F ) is the efficacy of the
test, and the probability that a positive result is true is
P (E | F )P (F )
P (F | E) =
P (E | F )P (F ) + P (E | ∼F )P (∼F )
(0, 98)(0, 005)
= ≈ 0.330.
(0, 98)(0.005) + (0.01)(1 − 0.005)
Oddly enough, that the test that has such a high effectiveness still doesn’t mean it returns the expected results!
This is because only a very small number of people actually have HIV relative to the false positive rate, so
a false positive is more likely than a real positive.9 Consider the four possibilities: Table 1 contains a lot of
information, and can be used to aid in calculations using Bayes’ Theorem. (
8. . . who according to Dr. Sahami looks very much like Charlie Sheen. I don’t see it.
9Relevant xkcd.
9
Table 1
HIV+ HIV-
Test positive 0.98 = P (E | F ) 0.01 = P (E | ∼F )
Test negative 0.02 = P (∼E | F ) 0.99 = P (∼E | ∼F )
Example 4.8. Turning again to spam detection, suppose 60% of all email is spam, and 90% of spam has a
forged header, but 20% of non-spam has a forged header. One can check for a forged header, and let E be the
event that a message has a forged header, and F be the event that it is spam. Then, using Bayes’ Theorem,
P (F | E) = (0.9)(0.6)/((0.9)(0.6) + (0.2)(0.4)) ≈ 0.871, so this allows for a really good characterization of
whether email is spam. (
Example 4.9. Consider the Monty Hall problem, in which there are 3 doors A, B, and C. Behind one door is a
prize, and behind the other two is absolutely nothing. A contestant chooses a door, and then the game show
host opens one of the other doors to show it is empty. Then, the contestant can switch the door it chooses, but
is this a good idea?
• If the contestant doesn’t switch, then the probability of winning is 1/3, since opening the door doesn’t
make a difference.
• If it does switch, then without loss of generality assume the contestant chose A. If A was the winner,
then the probability of winning after switching is zero. If one of the other two cases occurs (each of
which happens with equal probability 1/3), the contestant wins. Thus, the overall probability of winning
is 2/3, so it is advantageous to switch.
This is generally confusing or hard to understand, so tread carefully. Suppose instead of three doors there were
1000. Then, the host opens 998 empty doors. What is the chance the winner is in the remaining door? Awfully
high, since the chance that you picked correctly initially is very small. (
10
• Commutativity: P (A ∩ B) = P (B ∩ A).
• The chain rule (multiplication rule), as seen above: P (A ∩ B) = P (A | B)P (B) = P (B | A)P (A).
• The intersection rule: P (A ∩ (∼B)) = P (A) − P (A ∩ B).
• Bonferroni’s inequality: P (A ∩ B) ≥ P (A) + P (B) − 1. This will be proved in the next problem set, but
it’s a four-line proof if you do it right.
Another notion is the generality of conditional probability: for any events A, B, and E, these all still hold in the
presence of event E:
• P (A ∩ B | E) = P (B ∩ A | E), and P (E | A ∩ B) = P (E | B ∩ A).
• The chain rule becomes P (A ∩ B | E) = P (A | B ∩ E)P (B | E).
• Bayes’ Theorem: P (A | B ∩ E) = P (B | A ∩ E)P (A | E)/P (B | E). This is the same as before, but
every probability has been conditioned on E.
There’s actually some philosophy behind this, but this can be seen as E as a summary of everything you already
know, and in some sense, every probability is a conditional probability, and without any such event E, there’s
no way to know any probability. How would you know the probabilities of a fair coin without any context?
Formally, this means that P ( | E) always satisfies the axioms of probability.
Example 5.3. For a simple example, take two six-sided dice D1 and D2 . Let E be the probability that D1 rolls a
1, and F be that D2 rolls a 1. Then, P (E) = P (F ) = 1/6, and P (E ∩ F ) = 1/36. Here, P (E ∩ F ) = P (E)P (F ).
Define a new event G, which is the probability that D1 + D2 = 5. Here, P (G) = 1/9, but P (E ∩ G) =
1/36 6= P (E)P (G). (
Definition 5.4. If E and F are two events such that P (E ∩ F ) = P (E)P (F ), then E and F are called
independent events; otherwise, they are called dependent events.
Independent events can be thought of as events which aren’t affected by each other. They also have the
property (via the chain rule) that P (E | F ) = P (E). Observing F doesn’t change the probability of E.
Claim. If E and F are independent events, then P (E | F ) = P (E | ∼F ). (
Proof.
P (E ∩ (∼F )) = P (E) − P (E ∩ F )
= P (E) − P (E)P (F )
= P (E)(1 − P (F ))
= P (E)P (∼F ),
so E and ∼F are indepedent events, so P (E | ∼F ) = P (E) = P (E | F ). The independence of E and F are
necessary in the second line, so that P (E ∩ F ) = P (E)P (F ).
Definition 5.5. More generally, the events E1 , . . . , En are independent if for every subset E1 , . . . , Er (r ≤ n) of
E1 , . . . , En , it holds that !
\r Yr
P E1 = P (Ei ).
i=1 i=1
This just means that the product formula holds for all possible subsets. This sounds hairy, but isn’t all that
bad in practice.
Example 5.6. Roll two dice D1 and D2 , and let E be the event that D1 = 1, and F be that D2 = 6. Then,
let G be the event that D1 + D2 = 7. E and F are clearly independent. P (E) = 1/6 and P (G) = 1/6, but
P (E ∩ G) = 1/36, so they have to be independent. This is counterintuitive, but 7 is the exception, since there’s
always exactly one possibility for getting a 7 on D2 no matter the roll on D1 . Thus, it is important to be careful
when mathematics and intuition meet. Note also that F and G are independent by the same reasoning.
For more weirdness with intuiton, E and F are independent, E and G are independent, and F and G
are independent, but E, F , and G are not independent: P (E ∩ F ∩ G) = 1/36, but P (E)P (F )P (G) =
(1/6)(1/6)(1/6) = 1/216. (
Example 5.7. Suppose a computer produces a sequence of random bits, where p is the probability that a bit is a
1. Then, each bit is generated in an independent trial. Let E be the probability of getting n 1s followed by a 0.
The probability of n 1s is pn , and that of getting a 0 next is 1 − p, so P (E) = pn (1 − p). (
11
Example 5.8. Imagine a coin that comes up heads with probability p (really the same thing as the previous
example). Then, the probability of n heads in n flips is pn , and the probability of n tails in n flips is (1 − p)n .
The probability of k heads first, then all tails is pk (1 −p)n−k . This requires order, so if the order is ignored,
then it’s a matter of combination, so the probability is nk pk (1 − p)n−k . There are k heads to place in n slots
if order is ignored. One could also place the n − k tails rather than the heads, and this is the same, since
n n
(
k = n−k .
Example 5.9. Consider a hash table in which m strings are hashed into n buckets with equal, independent
probability. Let E be the probability that at leat one string was hashed into the first bucket. In general, defining
one’s own events makes solving these problems much easier by breaking them into subproblems. Additionally,
the phrase “at least one” suggests solving by taking the complement, which is that no strings are hashed into
the first bucket.
Tm that string i is not hashed into the first bucket, so P (Fi ) = 1 − 1/n = (n − 1)/n for
Let Fi be the probability
every i. Then, E = ∼ i=1 Fi , so
m
! m m
\ Y n−1
P (E) = 1 − P Fi = 1 − P (Fi ) = 1 − .
i=1 i=1
n
This is very similar to the birthday problem, which attempts to find collisions in a calendar rather than a hash
table.
Now, take m strings placed into a hash table with unequal probability: let pi be the probability that a string
is hashed into bucket i. This is a more accurate model for the real world, and has lots of applications. Then, let
E be the probability that at least one of buckets 1 through k has at least one string hashed into it. Let Fi be
the probability that at least one string is hashed into the ith bucket; then, using De Morgan’s Law,
k
! k
! k
!
[ [ \
P (E) = P Fi = 1 − P ∼ Fi = 1 − P ∼Fi .
i=1 i=1 i=1
This is a problem, because these events aren’t independent! If only one string is hashed, for example, knowing
where it isn’t adds some information about where else it might be. The strings are hashed independently, but
the events Fi are about buckets, not strings.
It looks like we’re stuck, but thinking about strings, the probability that it’ll not be hashed into the first bucket
is 1 − p1 , and the probability that it won’t be in the first two is 1 − p1 − p2 . This illustrates the technique of
stepping back and figuring out what it all means. Thus, the probability that no strings are hashed into buckets 1
to k is
k
! k
!m
\ X
P ∼Fi = 1 − pi ,
i=1 i=1
Considering one final question about this hash table, let E be the probability that each of the buckets 1 to k
has a string hashed into it. Then, if Fi is the probability that at least one string is hashed into the ith bucket.
These events aren’t independent, so this becomes
P (E) = P (F1 ∩ · · · ∩ Fk ) = 1 − P (∼(F1 ∩ F2 ∩ · · · ∩ Fk ))
k
!
[
=1−P ∼Fi
i=1
k
X X
=1− (−1)r+1 P (∼(Fi1 ) ∩ · · · ∩ (∼Fir )),
r=1 i1 <···<ir
where P (∼Fi1 ∩ · · · ∩ ∼Fir ) = (1 − pi1 − · · · − pir )r as before, using the union formula. (
12
Before the next example recall the geometric series
n
X 1 − xn+1
xi = .
i=0
1−x
As n → ∞, if |x| < 1, then the infinite sum reduces to 1/(1 − x) in the limit.
Example 5.10. Suppose two six-sided dice are rolled repeatedly. What is the probability that a 5 is rolled
before a 7? This is a simplified version of a game called craps. Define the series of events Fn , in which neither
a 5 nor a 7 is rolled in the first n − 1 trials, but a 5 was rolled on the nth trial. Thus, the probability of rolling a
5 on any trial is 4/36, and of a 7 on any trial is 6/36. Thus, the overall probability is
∞ ∞ ∞ ∞ ∞
! n−1 X n−1 n
[ X X 10 4 26 4 4 X 26 2
P (E) = P Fn = P (Fn ) = 1− = = =
i=1 n=1 n=1
36 36 n=1
46 36 36 n=0 36 5
after using the geometric series. Notice that the events Fi are all mutually exclusive, so the probability of their
union is just the sum of their probabilities. (
This isn’t terribly helpful in the discrete case, but will be much more essential in the continuous
P case, which
will pop up in a couple weeks. That said, the CDF of a discrete random variable is F (a) = x≤a p(x).
Definition 6.12. The expected value for a discrete random value X is
X
E[x] = xp(x).
{x|p(x)>0}
It’s necessary to specify that p(x) > 0 for various tricky analytic reasons. This is also known as the mean, the
center of mass, the average value, etc.
P6
Example 6.13. The expected value of rolling a six-sided die is E[X] = i=1 i(1/6) = 7/2. Notice that you
can’t actually roll a 7/2, but this is the average over a large number of rolls. (
Expected value is one of the most useful notions in probability.
Definition 6.14. A variable I is called an indicator variable for an event A if its value is 1 if A occurs, and 0 if
∼A occurs.
Then, P (I = 1) = P (A), so E[I] = (1)(P (A)) + (0)(P (∼A)) = P (A). This seems pretty tautological, but
it’s a huge concept: the expected value of the indicator is the probability of the underlying event.
Now, we have the mechanics to lie with statistics. Suppose a class has 3 classes with 5, 10, and 150 students.
If a class is randomly chosen with equal probability, then let X be the number of students in the class. Then,
E[X] = 5(1/3) + 10(1/3) + 150(1/3) = 55. The average value is 55, which seems reasonable.
But why do so many classes seem so large? If a student is randomly chosen and Y is the number of people
in that class, then E[Y ] = 5(5/165) + 10(10/165) + 150(150/165) ≈ 137, which is about twice as much! E[Y ]
is the student’s perception, and E[X] is what is usually reported.
This example illustrates that a lot of statistics involves interpretation, so the upshot is to be careful when
reading statistics.
If g is some real-valued function, then let Y = g(X). Then, the expectation is
X X X
E[g(X)] = E[Y ] = yj p(yj ) = yj p(xi ).
j j xi :g(xi )=yi
What this means is that one can obtain the probability for Y = k by looking at the cases of X that lead to
Y = k.
One consequence of this is linearity, in which g is a linear function. This means that E[aX + b] = aE[X] + b.
For example, if X is the value of a 6-sided die and Y = 2X − 1, then E[X] = 7/2, so E[Y ] = 6.
Definition 6.15. The nth moment of a random variable X is
X
E[X n ] = xn p(x).
x:p(x)>0
15
This will be useful later; remember it!
The last concept for today is utility. If one has two choices with various consequences c1 , . . . , cn , where one
of the ci happens with probability pi , and has a utility (cost) U (ci ). For example. the probability of winning $1
million in a $1 lottery ticket is 10−7 . The utility of winning is $99999 (since you have to buy the ticket), and of
winning is −1. If the ticket isn’t brought, you can’t win, so the utility is zero. Then, the expected value of buying
a ticket can be calculated to be E ≈ −0.9. The takeaway lesson is that you can’t lose if you don’t play.
7. To Vegas: 4/15/13
“About the Binomial Theorem I am teeming with a lot o’ news. . . ” – The Major-General
There’s an interesting puzzle of probability called the St. Petersburg Paradox. Suppose one has a fair coin
(i.e. it comes up heads exactly half of the time). Let n be the number of coin flips before the first tails, and the
winning value is $2n .
How much would you pay to play? In expectation, the payoff is infinite over a large number of trials, but that
seems like a little much to pay. Formally, if X is the amount of winnings as a random variable,
∞ i+1 ∞
X 1 X 1
E[X] = 2i = ,
i=0
2 i=0
2
which is infinite. This seems weird, but mathematically, this is perfectly fine. Of course, for playing exactly once
one wouldn’t want to pay a million dollars. . .
Okay, now on to Vegas. Take some even-money game (e.g. betting red on a roulette table). Here, there is a
probability p = 18/38 that one wins $Y , and (1 − p) that you lose $Y . Make bets according to the following
algorithm:
(1) Let Y = $1.
(2) Bet Y .
(3) If this bet wins, then halt. Otherwise, multiply Y by 2 and go to step 2
Let Z be the amount of winnings upon stopping. Then,
∞ i i−1 X ∞ i
X 20 18 i X j 18 20
E[Z] = 2 − 2 = (1) = 1
i=0
38 38 j=0
38 i=0 38
using the geometric series formula. The expected value of the game is positive, which is interesting. Thus, it can
be used to generate infinite money, right?
Well, maybe not:
• Real games have maximum betting amounts to prevent strategies such as this one (called Martingale
strategies).
• You have finite money, and are psychologically disinclined to spend it.
• Casinos like kicking people out.
Thus, in practice, in order to win infinite money, you need to have infinite money.
There are probability distributions with the same expected value, but in which the probabilities are “spread
out” over a larger area (e.g. a uniform distribution from 0 to 10 versus a large spike at 5). Thus, some techniques
have been introduced:
Definition 7.1. If X is a random variable with mean µ, then the variance of X, denoted Var(X), is Var(X) =
E[(X − µ)2 ].
Notice that the variance is always nonnnegative. The variance is also known as the second central moment,
or the square of the standard deviation.
16
In practice, the variance isn’t computed by that formula:
Var(X) = E[(X − µ)2 ]
X
= (x − µ)2 p(x)
x
X
= (x2 − 2µx + mu2 )p(x)
x
X X X
= x2 p(x) − 2µ xp(x) + µ2 p(x)
x x x
= E[X 2 ] − 2µE[X] + µ2
= E[X 2 ] − E[X]2 , (7.2)
since E[X] = µ. This last formula is easier to use, and the quantity E[X 2 ] in particular is called the second
moment (different from the second central moment).
Variance is preserved under linear transformation: Var(aX + b) = a2 Var(X).
Proof.
Var(aX + b) = E[(aX + b)2 ] − (E[aX + b])2
= E[a2 X 2 + 2abX + b2 ] − (aE[X] + b)2
= a2 E[X 2 ] + 2abE[X] + b2 − (a2 (E[X])2 + 2abE[X] + b2 )
= a2 E[X 2 ] − a2 E[X]2 = a2 Var(X).
This depends on noticing that the expected value of a constant is just the constant itself, which makes sense.
p
Definition 7.3. The standard deviation of X is SD(X) = Var(X).
This is convenient because it has the same units as X does, and is therefore easier to understand.
Definition 7.4. The Bernoulli random variable is an experiment that can result in success (defined as 1) or
failure (defined as 0). The probability of success is denoted p, and of failure 1 − p. The phrase “X is a Bernoulli
random variable with probability p of success” is shortened to X ∼ Ber(p).
The expectation of such a variable is E[X] = p, since the expectation of an indicator variable is just its
probability, and Var(X) = p(1 − p), which is a little harder to show.
Definition 7.5. A binomial random variable X denotes the number of successes in n independent trials of some
Bernoulli random variable Ber(p). Then, one writes X ∼ Bin(n, p), and P (X = i) = ni pi (1 − p)n−i .
By the Binomial Theorem, as n → ∞, the probability eventually becomes 1. For example, if one flips three
fair coins and takes X to be the number of heads, then X ∼ Bin(3, 0.5).
Example 7.6. This has a direct application to error-correcting codes (also called Hamming codes) for sending
messages over unreliable networks. Take some 4-bit message, and append three “parity” bits. The goal is to
take the message and choose parity bits in a way such that there are three sets with four bits, and each set has
an even number of 1s.
Then, if some bit is lost in transmission, it can be detected, and it can even be fixed, by taking all the
intersection of the sets with odd numbers of 1s and the complements of those with an even number (i.e. the
correct ones).
Suppose each bit in the example is flipped with probability 0.1 (which is much more than in the real world),
and if X is the number of bits corrupted, then X ∼ Bin(7, 0.1). Thus, the probability that a correct message is
received is P (X = 0) + P (X = 1). This can be calculated to be 0.8503, but without the error-correcting codes,
X ∼ Bin(4, 0.1) and P (X = 0) ≈ 0.6561. This is interesting because the reliability is significantly improved
despite only being a software update. (
Example 7.7. Consider a setup in which a person has two genes for eye color, and brown is dominant over blue
(i.e. if the child has a brown gene, then it has brown eyes, and if it has two blue genes, then it has blue eyes).
17
Suppose each parent has one brown and one blue eye. Then, what is the probability that 3 children out of
four have brown eyes? The probability that a single child has blue eyes is 1/4, and that it has brown eyes is
thus 3/4.
Each child is an independent trial (in more ways that one), so X ∼ (4, 0.75), so P (X = 3) = 43 (0.75)3 (0.25) ≈
= npE[(Y + 1)k−1 ],
where Y ∼ Bin(n − 1, p). Then, setting k = 1, E[X] = np, and if k = 2, then E[X 2 ] = npE[Y + 1] =
np((n − 1)p + 1), so the variance is Var(X) = E[X 2 ] − (np)2 = np(1 − p). After all this math, think about what
these mean: the expected value is what intuition would suggest. This also offers a proof for the Bernoulli case,
since Ber(p) = Bin(1, p).
Example 7.8. This can be used to ask how powerful one’s vote is. Is it better to live in a small state, where it
makes it more likely that the vote will change the outcome in that state, or a larger one, where there is a larger
outcome if the state does swing?
Adding some (not quite realistic) numbers: Suppose there are a = 2n voters equally likely to vote for either
candidate11 and the voter in question will be the deciding a + 1st vote. The probability that there is a tie is
n n
2n 1 1 (2n!)
P (tie) = = ,
n 2 2 n!n!22n
√ √
which simplifies to 1/ nπ after Stirling’s Approximation n! ≈ nn+1/2 e−n 2π. This is exceptionally accurate for
n ≥ 70, which is reasonable for this situation.
its probability multiplied by the number of electoral votes: since a is the size of
Then, the power of a tie is p
the state, then this becomes c 2a/π, which would means that living in a larger state means a more powerful
vote. (
18
Definition 8.1. X is a Poisson12 random variable, denoted X ∼ Poi(λ), if X takes on nonnegative integer
values, and for the given parameter λ > 0,13 it has a PMF distribution that P (X = i) = e−λ λi /i!.
P∞ i
It’s necessary to check that this is in fact a probability distribution: using a Taylor series, eλ = i=0 λi! , so
∞ ∞
X X λi
P (X = i) = e−λ = e−λ eλ = 1.
i=0 i=0
i!
Example 8.2. Suppose a string of length 104 is sent over a network and the probability of a single bit
being corrupted is 10−6 . Using the Poisson distribution, λ = (104 )(10−6 ), P (X = 0) = e−0.01 (0.01)0 /0! ≈
0.990049834. Using a conventional binomial distribtuion, this is accurate to about nine decimal places, which is
incredibly useful. See Figure 3 for a graph of the accuracy of the Poisson distribution as an approximation. (
So what does “moderate” mean for λ, anyways? There are different interpretations, such as n > 20 and
p < 0.05, or n > 100 and p < 0.01. Moderate really just means that the approximation is reasonably accurate.
Figure 3. An illustration of the differences between binomial and Poisson distributions. Source
If X ∼ Poi(λ), where λ = np, then E[X] = np = λ, and Var(X) = np(1 − p) = λ(1 − 0) = λ as n → ∞ and
p → 0. Thus, λ represents the number of trials that are expected to be true.
Example 8.3. Intuitively, imagine that you are baking an extremely large raisin cake, which is cut into slices of
moderate size (with respect to the number of raisins in the slice of cake). The probability p that any particular
raisin is in any particular slice of cake is vanishingly small; if there are n cake slices, p = 1/n, assuming an
even distribution. If X is the number of raisins in a given cake slice, X ∼ Poi(λ), where λ = R/n, where R is
the total number of raisins. (
There are lots of concrete applications of this to computer science: one could sort strings into buckets in a
hash table, determine the list of crashed machines in some data center, or Facebook login requests distributed
across some set of servers.
Example 8.4. If computer chips are produced, so that p = 0.1 is the probability that a chip is defective. In
a sample of n = 10 chips, what is the probability that a sample contains at most one defective chip? n and
p aren’t extreme, so using the binomial distribution, P (Y ≤ 1) ≈ 0.7361, and with the Poisson distribution,
P (Y ≤ 1) ≈ 0.7358. This is still really close. (
Computing P (X ≤ a) when X ∼ Poi(λ) straightforwardly is computationally expensive, but it turns out that
λ
P (X = i + 1)/P (X = i) = λ+1 , so one can just do one approximation and then multiply, which is pretty fast.
12Poisson literally means “fish” in French, though this is not the reason that this distribution has its name.
13Relevant xkcd.
19
A Poisson distribution is an approximation, so it still approximately works for things that aren’t really binomial
distributions. This is known as the Poisson paradigm. If the dependency is mild, then the approximation is
still pretty close. For example, in a large hash table, the number of entries in a bucket are dependent events,
but it grows weaker with the size of the hash table. Additionally, if the probability of success in each trial
varies slightly, the Poisson distribution still works pretty well (e.g. when load on a network varies slightly with
time). In each of these cases, the binomial random variable is also a good approximation, but it comes with the
connotation that the probability is exact, whereas the Poisson distribution doesn’t.
Returning to the birthday problem, take m people and let Ex,y be the event that people x and y have the
same birthday. Thus, P (Ex,y ) = 1/365 = p, but not all of these events are independent (e.g. if Alice and Bob
have the same birthday and Bob and Eve have the same birthday, do Alice and Eve have the same birthday?).
Thus, this isn’t a binomial distribution, so let X ∼ Poi(λ), where λ = p m 2 = m(m − 1)/730. Then,
The n−1
r−1 term isn’t obvious, but the last trial must end in success, so the number of choices is restricted.
Notice additionally that Geo(p) ∼ NegBin(1, p), so the geometric distribution is analogous to the Bernoulli
distribution for the negative binomial distribution.
Definition 8.9. X is a hypergeometric random variable, denoted X ∼ HypG(n, N, m), if it varies as in the
following experiment: consider an urn with N balls, in which N − m are black and m are white. If n balls are
drawn without replacement, then X is the number of white balls drawn.
20
−m N
This is just combinatorics: P (X = i) = mi Nn−i / m , where i ≥ 0. Then, E[X] = n(m/N ) and
Var(X) = (nm(N − n)(N − m))/(N 2 (N − 1)).
Let p = m/N , the probability of drawing a white ball on the first draw. Note that as N → ∞,
HypG(n, N, m) → Bin(n, m/N ). The binomial is the case with replacement, and the hypergeometric is
the case without replacement. As mentioned above, one can be made to approximate the other.
Example 8.10. Suppose N is the number of individuals of some endangered species that remain. If m of them
are tagged, and then allowed to mix randomly, then randomly observe another n of them. Let X be the number
of those n that were tagged. Then, X ∼ HypG(n, N, m). Then, using something called a maximum likelihood
estimate, N̂ = mn/i is the value that maximizes P (X = i). (
14The cause of misestimation may have been that CS 109 didn’t exist when Justice Breyer was at Stanford. . .
21
Example 9.3. Suppose X is a continuous random variable (CRV) with PDF
C(4x − 2x2 ), 0 < x < 2,
f (x) =
0, otherwise.
First, C can be calculated:
2 2
2x3
Z
C(4x − 2x2 ) dx = C 2x2 − = 1,
0 3 0
so plugging in, C((8 − 16/3) − 0) = 1, so C = 3/8. Then,
Z ∞ Z 2 2
2x3
3 3 1
P (X > 1) = f (x) dx = (4x − 2x2 ) dx = 2x2 − = ,
1 1 8 8 3 1 2
which is what makes sense (since the function is symmetric). (
Example 9.4. Suppose X is the amount of time (in days) before a disk crashes. It is given by the PDF
x
λe− 100 , x ≥ 0
f (x) =
0, otherwise.
First, λ will be calculated:
Z ∞ ∞
e−x/100 ∞
Z
1= λe−x/100 dx = −100λ − dx = −100λe−x/100 = 100λ,
0 0 100 0
so λ = 1/100. Technically, an improper integral was taken here and sort of glossed over.
Then, what is P (50 < X < 150)?
Z 100 150
1 −x/100
F (150) − F (50) = e dx = −e−x/100 ≈ 0.383.
50 100 50
Linearity of expectation is still a thing: E[aX + b] = aE[X] + b, and the same proof can be given. Similarly, the
formulae for variance still hold.
Example 9.5. For example, if X has linearly increasing density, given by f (x) = 2x on [0, 1] and f (x) = 0
otherwise, the expectation isn’t obvious from the graph, but can be found as
Z ∞ Z 1
2
E[X] = xf (x) dx = 2x2 dx =
−∞ 0 3
and the variance is given by
Z ∞ Z 1
1
E[X 2 ] = x2 f (x) dx = 2x3 dx = ,
−∞ 0 2
2 2
so Var(X) = E[X ] − E[X] = 1/2 − (2/3) = 1/18. 2
(
These are arguably easier than in the discrete case.
Definition 9.6. X is a uniform random variable, denoted X ∼ Uni(α, β), if its probability density function is
1
f (x) = β−α , α ≤ x ≤ β
0, otherwise.
This is just a horizontal line.
22
Sometimes, this is defined only on (α, β), but this causes issues later on. The probability is given by
Z b
b−a
P (a ≤ x ≤ b) = f (x) dx = ,
a β −α
which is valid when α ≤ a ≤ b ≤ β. The expectation is even easier: it can be formally calculated to be (α + β)/2,
which is no surprise, and the variance is (after a bunch of math) Var(X) = (β − α)2 /12.
Example 9.7. Suppose X ∼ Uni(0, 20). Then. f (x) = 1/20 for 0 ≤ x ≤ 20 and f (x) = 0 otherwise. Thus,
R6 R 17
P (X < 6) = 0 dx/20 = 6/20, and P (4 < X < 17) = 4 dx/20 = (17 − 4)/20 = 13/20. (
Example 9.8. Suppose the Marguerite bus arrives at 15-minute intervals (on the quarter hour), and someone
arrives at a bus stop uniformly sometime between 2 and 2:30. Thus, X ∼ Uni(0, 30). If the passenger waits less
than 5 minutes for the bus, then it must arrive between 2:10 and 2:15, or 2:25 to 2:30. This is 10 minutes out of
the 30, so the probability is 1/3. (Since the distribution is uniform, this sort of calculation can be done.) The
passenger waits for more than 14 minutes if it arrive between 2 and 2:01 or 2:15 to 2:16. These are 2 out of the
30 minutes, so the probability is 1/15. (
Example 9.9. Suppose a student bikes to class, and leaves t minutes before class starts, which is (usually) a
choice by the student. If X is the travel time in minutes, given by a PDF of f (x). Then, there is a cost to be
early or late to class: C(X, t) = c(t − X) if early, and C(X, t) = k(t − X) if late (which is a different cost).
Thus, the cost function is discontinuous. Then, choose t that minimizes E[C(X, t)]:
Z ∞ Z t Z ∞
E[C(X, t)] = C(X, t)f (x) dx = c(t − x)f (x) dx + k(x − t)f (x) dx.
0 0 t
Now we need to minimize it, which involves some more calculus. Specifically, it requires the Leibniz integral
rule:
Z f2 (t) Z f2 (t)
d df2 (t) df1 (t) ∂ g(x, t)
g(x, t) dx = g(f2 (t), t) − g(f1 (t), t) + dx.
dt f1 (t) dt dt f1 (t) ∂t
Let t∗ be the optimal time. Then,
Z t Z ∞
d
E[C(X, t)] = c(t − t)f (t) + cf (x) dx − k(t − t)f (t) kf (x) dx
dt 0 t
0 = cF (t∗ ) − k(1 − F (t∗ ))
k
=⇒ F (t∗ ) = .
c+k
The question comes down to whether being late is much worse than being early. If so, this becomes closer to
1, so t should be earlier. Similarly, if being early is a problem, then this becomes close to 0, so t should be
later. (
15Apparently C.F. Gauss, who was responsible for popularizing this distribution, looks like Martin Sheen.
23
Figure 4. A normal distribution. Source
25
Example 10.6. Suppose Stanford admits 2480 students and each student has a 68% chance of attending. If
Stanford has room for 1745 students to matriculate, then let X be the number of students who attend, so that
X ∼ Bin(2480, 0.68). Thus, µ = 1686.4 and σ 2 ≈ 539.65, so this can be approximated by the normal variable
Y ∼ N(1686.4, 539.65), and
Y − 1686.4 1745.5 − 1686.4
P (X > 1745) = P (Y ≥ 1745.5) = P > = 1 − Φ(2.54) ≈ 0.0055.
23.23 23.23
This is a good approximation, because P (X > 1745) = 0.0053, which is very close. (
useful.
26
Definition 11.4. For two discrete random variables X and Y , define the joint probability mass function of X
P Y to be pX,Y (a, b) = P (X = a, Y = b). Then, the marginal distribution of X is pX (a) = P (X = a) =
and
y pX,Y (a, y), and the marginal distribution of Y can be defined in the same way.
Example 11.5. Suppose a household has C computers, of which X are Macs and Y are PCs. Assume any
given computer is equally likely to be a Mac or a PC. Then, define some probabilities: P (C = 0) = 0.16,
P (C = 1) = 0.24, P (C = 2) = 0.28, and P (C = 3) = 0.32. Thus, one can make a joint distribution table by
taking all the options that correspond to C = c and splitting the probability given that each computer is equally
likely to be of either type. Then, summing along the rows and columns gives the marginal distributions.17 (
This can be generalized to continuous distributions:
Definition 11.6. For two continuous random variables X and Y , their joint cumulative probability distribution is
FX,Y (a, b) = F (a, b) = P (X ≤ a, Y ≤ b) where a, b ∈ R.
Then, the marginal distributions are FX (a) = P (X ≤ a) = FX,Y (a, ∞), and FY is defined in the same way.
This is no longer a table, but rather a two-dimensional graph, as in Figure 5.
Figure 5. A joint probability distribution given by FX,Y (x, y) = 20(y − x)(1 − y) for x, y ≥ 0.
Definition 11.7. Two random variables X and Y are jointly continuous if there exists a PDF fX,Y (x, y) : R2 → R
such that Z Z a2 b2
P (a1 ≤ X ≤ a2 , b1 ≤ Y ≤ b2 ) = fX,Y (x, y) dy dx.
a1 b1
and the CDF is given by
Z a Z b
FX,Y (a, b) = f (x, y) dy dx.
−∞ −∞
∂ 2 FX,Y (a,b)
Conversely, one can obtain the PDF from the CDF as fX,Y (a, b) = ∂x∂y . The marginal density functions
are thus Z ∞ Z ∞
fX (a) = fX,Y (a, y) dy and fY (b) = fX,Y (x, b) dx.
−∞ −∞
If these sorts of multiple integrals are foreign, see them as iterated integrals: integrate the innermost integral
as in single-variable integrals, and treat all other variables as constants. This can be repeated, leading to a
solution. For example,
Z 2Z 1 Z 2 Z 1 Z 2 2 1 Z 2 2 2
x y y
xy dx dy = xy dx dy = y dy = dy = = 1.
0 0 0 0 0 2 0 0 2 2 0
This can be generalized to three or more integrals, but that’s not going to be required in this class. Intuitively,
double integrals should be thought of as integrating over some area.
17This is the source of the name: they’re literally written in the margins of the table.
27
Additionally,
P (X > a, Y > b) = 1 − P (∼(X > a, Y > b)) = 1 − P ((X ≤ a) ∪ (Y ≤ b))
= 1 − (P (X ≤ a) + P (Y ≤ B) − P (X ≤ A, Y ≤ B))
= 1 − FX (a) − Fy (B) + FX,Y (a, b).
Similary, P (a1 ≤ X ≤ a2 , b1 ≤ Y ≤ b2 ) = FX,Y (a1 , b1 ) + Fx,y (a2 , b2 ) − FX,Y (a1 , b2 ) − FX,Y (a2 , b1 ). This can
be seen by drawing a picture. P
If X is a discrete random variable, then E[X] = i iP (X = i), which can be generalized to the continuous
case:
Lemma 11.8. If Y is a non-negative continuous random variable with PDF F (y), then
Z ∞ Z ∞
E[Y ] = P (Y > y) dy = (1 − F (y)) dy.
0 0
Proof.
Z ∞ Z∞ Z∞
P (Y > y) dy = fY (x) dx dy
0
y=0 x=y
Z∞ Z∞
= dy fY (x) dx
x=0 y=0
Z ∞
= xfY (x) dx = E[Y ].
0
Example 12.1. Suppose a six-sided die is rolled 7 times. The probability of one 1, one 2, no threes, two 4s, no
5s, and three 6s is
1 1 0 2 0 3
7! 1 1 1 1 1 1 1
P (X1 = 1, X2 = 1, X3 = 0, X4 = 2, X5 = 0, X6 = 3) =
1!1!0!2!0!3! 6 6 6 6 6 6 6
420
= 7 . (
6
The multinomial distribution can be used in probabilistic text analysis: what is the probability that any given
word appears in a text? In some sense, each word appears with some given probability, which makes this a
giant multinomial distribution. This can be conditioned on the author, which makes sense: different people
use different writing styles. But then, it can be flipped around: using Bayes’ theorem, one can determine the
probability of a writer for a given set of words. This was actually used on the Federalist Papers from early
American history, which allowed for a probabilistic identification of which of Hamilton, Madison, and Jay wrote
each paper. Similarly, one can use this to construct spam filters, as it allows one to guess whether the author of
an email was a spammer.
Independence has already been discussed in the context of events, but it also applies to variables. The
intuition is exactly the same: knowing the value of X indicates nothing about that of Y .
Definition 12.2. Two discrete random variables X and Y are indepedent if p(x, y) = pX (s)pY (y) for all x, y.
Two variables that aren’t independent are called dependent.
28
Example 12.3. Take a coin with a probability p of heads and flip it m + n times. Then, let X be the number of
heads in the first n flips and Y be the number of heads in the next m flips. It is not surprising that
n x n−x n
P (X = x, Y = y) = p (1 − p) py (1 − p)m−y = P (X = x)P (Y = y).
x y
Thus, X and Y are independent.
More interestingly, let Z be the number of total heads in all n + m flips. Then, X and Z aren’t independent
(e.g. Z = 0), nor are Y and Z. (
Example 12.4. Let N ∼ Poi(λ) be the number of requests to a web server in a day. If each request comes
from a human with probability p and a bot with probability 1 − p, then let X be the number of humans per
day. This means that (X | N ) ∼ Bin(N, p). Similarly, if Y is the number of bot requests per day, then
(Y | N ) ∼ Bin(N, 1 − p). These can be made into a joint distribution:
P (x = i, Y = j) = P (X = i, Y = i | X + Y = i + j)P (X + Y = i + j) + P (X = i, Y = i | X + Y 6= i + j)P (X + Y 6= i + j)
= P (X = i, Y = i | X + Y = i + j)P (X + Y = i + j).
Then, P (X = i, Y = j | X + Y = i + j) = i+j
i j
i p (1 − p) since it’s multinomial, and P (X + Y = ij ) =
−λ i+j
e λ /(i + j)!, so the overall probability is
λi+j
i+j i
P (X = i, Y = j) = p (1 − p)j e−λ
i (1 + j)!
i j
(λp) (λ(1 − p))
= e−λ
i! j!
i
(λp) −λ(1−p) (λ(1 − p))j
= e−λp e = P (X = i)P (Y = j),
i! j!
where X ∼ Poi(λp) and Y ∼ Poi(λ(1 − p)).18 Thus, X and Y are in fact independent, which makes sense. (
Definition 12.5. Two continuous random variables X and Y are independent if P (X ≤ a, Y ≤ b) = P (X ≤
a)P (Y ≤ b) for any a and b.
This is equivalent to any of the following:
• FX,Y (a, b) = FX (a)FY (b) for all a, b.
• fX,Y (a, b) = fX (a)fY (b) for all a, b.
• More generally, fX,Y (x, y) = h(x)g(y) with x, y ∈ R, which has nice consequences for integration. Note
that the constraints have to factor as well (see below).
Example 12.6.
• If fX,Y (x, y) = 6e−3x e−2y , then X and Y are pretty clearly independent: let h(x) = 3e−3x and
g(y) = 2e−2y .
• If fX,Y (x, y) = 4xy for 0 < x, y < 1, then X and Y are still independent, as h(x) = 2x and g(y) = 2y.
• With fX,Y as above, with the additional constraint that 0 < x+y < 1. Here, X and Y aren’t independent,
both intuitively and since the constraints can’t factor independently. (
Example 12.7. Suppose two people set up a meeting at noon, and each arrives at a time uniformly distributed
between noon and 12:30 p.m. Let X be the minutes past noon that the first person arrives, and Y be the minutes
past noon person 2 arrives. Then, X, Y ∼ Uni(0, 30). What is the probability that the first person to arrive waits
more than ten minutes for the other?
This problem is symmetric in X and Y , so P (X + 10 < Y ) + P (Y + 10 < X) = 2P (X + 10 < Y ). Then,
ZZ ZZ
2P (X + 10 < Y ) = 2 f (x, y) dx dy = 2 fX (x)fY (y) dx dy
x+10<y x+10<y
Z 30Z y−10 2 Z 30Z y−10 Z 30
1 2 2 4
=2 dx dy = 2 dx dy = 2 (y − 10) dy =
10 0 30 30 10 0 30 10 9
29
The hardest part of this was probably finding the boundary conditions; the integration is straightforward, and
some of the steps were skipped. (
Example 12.8. Suppose a disk surface is a disc of radius R and a single point of imperfection is uniformly
distributed on the disk. Then,
1 2 2 2
fX,Y (x, y) = πR2 , x + y ≤ R
0, x + y > R2
2 2
Notice that X and Y , which are the random variables that are the coordinates of that point, are not independent:
if X = 1 = 0. More formally,
√
∞ Z2 −x2
R √
2 R 2 − x2
Z Z
1
fX (x) = fX,Y (x, y) dx dy = 2
dy = dy = .
−∞ πR √
πR2
x2 +y 2 ≤R2 − R2 −x2
p
By symmetry, fY (y) = 2 R2 − y 2 /πR2 , where −R ≤ y ≤ R. Thus, fX,Y (x, y) 6= fX (x)fY (y).
Often, the distance of this imperfection is√important: more important data tends to be placed on the outside
so that it can be accessed faster. If D = X 2 + Y 2 , then P (D ≤ A) = πa2 /πR2 = a2 /R2 , since it can be
modelled as a random dart landing in the circle of radius a.
Additionally, the expected value can be calculated, though it might be counterintuitive: using Lemma 11.8,
Z R Z R R
a2 a3
2R
E[D] = P (D > a) da = 1 − 2 da = a − 2
= . (
0 0 R 3R 0 3
Independence of n variables for n > 2 is a straightforward generalization of the n = 2 case, since it requires the
probabilities to factor through for any possible subset. This is identical to the definition for events. Indepenence
is symmetric: if X and Y are independent, then Y and X are. This is obvious, but consider a sequence
X1 , X2 , . . . of independent and identically distributed (IID)19 Then, Xn is a record value if Xn > Xi for all
i < n (i.e. Xn = max(X1 , . . . , XN )). Let Ai be the event that Xi is a record value. Then, the probability that
tomorrow is a record value seems to depend on whether today was, but flipping it around, does tomorrow being
a record value affect whether today is?
More mathematically, P (An ) = 1/n and P (An+1 ) = 1/(n + 1). Then, P (An An+1 ) = (1/n)(1/(n + 1)) =
P (An )P (An+1 ).
Example 12.9. One useful application is choosing a random subset of size k from a set of n elements. The goal
of this is to do this such that all nk possibilities are equally likely, given a uniform() function that simulates
Uni(0, 1). There’s a brute-force way to do this by calculating all subsets and then picking a random one. This is
exponential in time and space. Here’s a better solution:
int indicator ( double p ) {
if ( random () < p ) return 1;
else return 0;
}
19The second condition means that they all have the same distribution, not even depending on i.
30
13. Adding Random Variables: 4/29/13
Recall Example 12.9 from the previous lecture. The general idea of the algorithm is to create an array I with
exactly k nonzero entries. Each element is added with a probability that depends on the number of elements left
in the set. Also, note that this algorithm is linear, which is a nice speedup from the previous case.
Claim. Any given subset of k elements is equally likely to be returned from this algorithm. (
Proof. Proceed by induction k + n.In the base case, where k = 1 and n = 1, then S = {a} and rSubset returns
{a} with probability p = 1 = 1/ 11 .
In the general case, suppose the algorithm works for k + x ≤ c. Then:
• If k + n ≤ c + 1, then |S| = n. Suppose that I[1] =1. Then, by the inductive hypothesis, rSubset returns
a subset S 0 of size k − 1 with probability 1/ n−1
k−1 (i.e. equally likely). Then, since P (I[1] = 1) = k/n,
then the total probability of any subset which contains the first element is k/n · 1/ n−1 n
k−1 = k .
• If k + n ≤ c + 1 and I[1] = 0, then a set of size k is returned out of the remainin n − 1 with probability
1/ n−1 n−1
1/ nk .
k . Then, P (I[1] = 0) = (1 − k/n), so the overall probability is (1 − k/n)(1/ k−1 ) =
Suppose X and Y are independent random variables X ∼ Bin(n1 , p) and Y ∼ Bin(n2 , p). Then, X + Y ∼
Bin(n1 + n2 , p). Intuitively, do the trials for Y after those of X, and it’s essentially the same variable, since
both X and Y have the same chance of success per trial. This generalizes in the straightforward way: suppose
Xi ∼ Bin(ni , p) for 1 ≤ i ≤ n. Then,
n
! n
!
X X
Xi ∼ Bin ni , p .
i=1 i=1
Now take X ∼ Poi(λ1 ) and Y ∼ Poi(λ2 ), where X and Y are independent. These can be summed as well:
X + Y ∼ Poi(λ1 + λ2 ). The key idea in the proof is that P (X + Y = n) = P (X = k, Y = n − k) over all
0 ≤ k ≤ n. More generally, if Xi ∼ Poi(λi ) are all independent, then
n
! n
!
X X
Xi ∼ Poi λi .
i=1 i=1
Then, some things can be said about density. Suppose that g(X, Y ) = X + Y . Then, E[g(X, Y )] = E[X + Y ] =
E[X] + E[Y ]. (The proof will be given in the next lecture.) This can be generalized to many variables X1 , . . . , Xn :
" n # n
X X
E Xi = E[Xi ].
i=1 i=1
This holds regardless of the dependencies between the Xi . Thus, it is an incredibly useful result, and comes up
all the time in probability (including the next few problem sets).
Then, consider two independent, continuous random variables X and Y . The CDF of X + Y can be given by
Z∞ a−y
ZZ Z Z ∞
FX+Y (a) = P (X+Y ≤ a) = fX (x)fY (y) dx dy = fX (x) dxfY (y) dy = FX (a−y)fY (y) dy.
−∞
x+y≤a y=−∞ x=−∞
FX+Y is called the convolution of FX and FY as given above. This also works for PDFs:
Z ∞
fX+Y (a) = fX (a − y)fY (y) dy.
−∞
Convolutions exist in the discrete case, where integrals are replaced with sums and PDFs are replaced with
probabilities.
Suppose X ∼ Uni(0, 1) and Y ∼ Uni(0, 1) are independent random variables (i.e. f (a) = 1 for 0 ≤ a ≤ 1).
Intuitively, the uniform distribution seems easy to handle, but behaves badly when something more complicated
happens to it: here’s what happens to the PDF of X + Y :
Z 1 Z 1
fX+Y (a) = fX (a − y)fY (y) dy = fX (a − y) dy.
0 0
31
Now some casework is necessary: 0R ≤ a ≤ 2, so consider 0 ≤ a ≤ 1. Then, 0 ≤ y ≤ a, so that 0 ≤ a − y ≤ 0
a
and fX (a − y) = 1, so fX+Y (a) = 0 dy = a. However, if 1 ≤ a ≤ 2, then a − 1 ≤ y ≤ 1, but fX (a − y) = 1
R1
still, so fX+Y (a) = a−1 dy = 2 − a. Thus, the PDF is (as in Figure 6)
a, 0≤a≤1
fX+Y (a) = 2 − a, 1 < a ≤ 2
0,
otherwise.
Another fact (whose proof is a little complicated for now) is that if X ∼ N(µ1 , σ12 ) and Y ∼ N(µ2 , σ22 ) are
independent, then X + Y ∼ N(µ1 + µ2 , σ12 + σ22 ), which is well-behaved even though it seems more complicated
than the uniform distribution. This can be once again generalized: if X1 , . . . , Xn are independent random
variables, such that Xi ∼ N(µi , σi2 ), then
n n n
!
X X X
2
Xi = N µi , σi .
i=1 i=1 i=1
Example 13.1. Suppose an RCC checks several computers for viruses. There are 50 Macs, each of which is
independently infected with probabiliy p = 0.1 and 100 PCs, each independetly infected with probability p = 0.4.
Let A be the number of infected Macs and B be the number of infected PCs, so A ∼ Bin(50, 0, 1) ≈ N(5, 4.5),
and B ∼ Bin(100, 0.4) ≈ N(40, 24). Thus, P (A + B ≥ 40) ≈ P (X + Y ≥ 39.5). (Recall the continuity correction
and inclusiveness discussed previously.) Since the normal distribution is additive, then X + Y ∼ ∼(45, 28.5), so
P (X + Y ≥ 39.5) = 1 − Φ(−1.03) ≈ 0.8485. (
Conditional probabilities still work for distributions: the conditional PMF of X given Y (where pY (y) > 0)
can be given by
P (X = x, Y = y) pX,Y (x, y)
PX|Y (x | y) = P (X = x | Y = y) = = .
P (Y = y) pY (y)
This is not that different from before, but it is helpful to be clear about the difference between random variables
and events. Since these are discrete random variables, sometimes breaking these into sums is helpful.
Example 13.2. Suppose a person buys two computers over time, and let X be the event that the first computer
is a PC, and Y be the event that the second computer is a PC (as indicators, so X, Y ∈ {0, 1}). Then, various
conditional probabilities can be found using the joint PMF table, which looks not too tricky. Interestingly,
P (X = 0 | Y = 1) can be calculated, which is sort of looking back in time. This can be used to make
recommendations for future products, etc. (
Example 13.3. Suppose a web server receives requests, where X ∼ Poi(λ1 ) is the number of requests by
humans per day and Y ∼ Poi(λ2 ) is the number of requests from bots per day. Then, X and Y are independent,
32
so they sum to Poi(λ1 + λ2 ). Then, given only the total number of requests one can get the probablity that k of
them were by humans:
P (X = k, Y = n − k) P (X = k)P (Y = n − k)
P (X = k | X + Y = n) = =
P (X + Y = n) P (X + Y = n)
e−λ1 λk1 e−λ2 λn−k
2 n!
= · ·
k! (n − k)! e−(λ1 +λ2 ) (λ1 + λ2 )n
n! λk1 λ2n−k
= ·
k!(n − k)! (λ1 + λ2 )n
k n−k
n λ1 λ1
= .
k λ1 + λ2 λ1 + λ2
This looks suspiciously like a binomial distribution, and X | X + Y ∼ Bin(X + Y, λ1 /(λ1 + λ2 )). Intuitively, for
every request, a coin is flipped that determines the origin of each request. (
If X and Y are continuous random variables, then conditional probability looks not much different: the
conditional PDF is fX|Y (x | y) = fX,Y (x, y)/fY (y). Intuitively, this is the limit of probabilities in a small area
around x and y. Then, the CDF is
Z a
FX|Y (a | y) = P (X ≤ a | Y = y) = fX|Y (x | y) dx.
−∞
Observe that it does make sense to condition on y even though P (Y = a) = 0, because this is actually a limit
that does make sense. This would require chasing epsilons to show rigorously, however.
6x(2 − x − y)
= .
4 − 3y
Notice that this depends on both x and y, which makes sense. (
If X and Y are continuous random variables, then fX|Y (x | y) = fX (x), which is exactly the same as in the
discrete case.
Suppose X is a continuous random variable, but N is a discrete random variable. Then, the conditional PDF
of X given N is fX|N (x | n) = pN |X (n | x)fX (x)/pN (n). Notice the use of PMFs for the discrete variable and
PDFs for the continuous variable. What is slightly weirder is the conditional PMF of N given X, where the
discrete distribution changes based on the continuous one: pN |X (n | x) = fX|N (x | n)pN (n)/fX (x). Bayes’
theorem happens to apply here as well.
One good example of this is a coin flip in which the probability of the coin coming up heads is unknown:
it’s a continuous random variable between 0 and 1. Then, the number of coins coming up heads is a discrete
quantity that depends on a continuous one.
33
Definition 14.2. X is a beta random variable, denoted X ∼ Beta(a, b), if its PDF is given by
( a−1
x (1−x)b−1
, 0<x<1
f (x) = B(a,b)
0, otherwise,
R1
where B(a, b) = 0
xa−1 (1 − x)b−1 dx is the normalizing constant.
This distribution can do lots of unusual things, as in Figure 7, though when a = b it is symmetric about
x = 0.5.
The value on the bottom is obtained by a sneaky trick: there needs to be a normalizing constant, and P (N = n)
isn’t known, so they have to match up. But this holds only when 0 < x < 1, so X | (N = n, n + m trials) ∼
Beta(n + 1, m + 1). In some sense, the more heads one obtains, the more the beta distribution tends towards
1, and the more tails, the more it skews to zero. As more and more trials are conducted, the beta distribution
narrows in on the likely probability.
Another nice fact is that Beta(1, 1) = Uni(0, 1) (verified by just calculating it out). It is also a conjugate
distribution for itself (as well as the Bernoulli and binomial distributions) because the prior distribution (before
a trial) of a trial given by the beta distribution makes calculating the posterior distribution (after the trial)
easy (they’re both beta distributions). In fact, one can set X ∼ Beta(a, b) as a guess as a prior to test whether
a coin has a certain bias. This is called a subjective probability. After n + m trials, the distribution can be
updated: X | (n heads) ∼ Beta(a + n, b + m). In some sense, the equivalent sample size, or the initial guess, is
overwritten in the long run. It represents the number of trials (specifically, a + b − 2 trials) imagined, of which
a − 1 ended up heads.
34
Returning to expectation, we know that if a ≤ x ≤ b, then a ≤ E[X] ≤ b, which is fairly easy to see formally.
However, expectation can be generalized: if g(X, Y ) is a real-valued function, then in the discrete case
XX
E[g(X, Y )] = g(x, y)pX,Y (x, y),
x y
since X and Y might not be independent (so their joint distribution is necessary). In the continuous case,
Z ∞Z ∞
E[g(X, Y )] = g(x, y)fX,Y (x, y) dx dy.
−∞ −∞
For example, if g(X, Y ) = X + Y , then
Z ∞Z ∞
E[X + Y ] = (x + y)fX,Y (x, y) dx dy
−∞ −∞
Z ∞Z ∞ Z ∞Z ∞
= xfX,Y (x, y) dy dx + yfX,Y (x, y) dx dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
xfX (x) dx + yfY (y) dy = E[X] + E[Y ],
−∞ −∞
without any assumptions about X and Y (especially as to their independence).
Here’s another reasonably unsurprising fact: if P (a ≤ x) = 1, then a ≤ E[X]. This is most often useful
when x is nonnegative. However, the converse is untrue: if E[X] ≥ a, then it isn’t necessarily true that X ≥ a
(e.g. if X is equally likely to take on −1 or 3, so that E[X] = 1). Similarly. if X ≥ Y , then X − Y ≥ 0, so
E[X − Y ] = E[X] − E[Y ] ≥ 0, so E[Y ] ≥ E[Y ]. Again, the converse is untrue.
Definition 14.3. Let X1 , . . . , Xn be independently and identically distributed random variables. If F is some
PnE[Xi ] = µ, then a sequence of Xi is called a sample from the distribution F . Then, the
distribution function and
sample mean is X = i=1 Xi /n.
The idea is that X1 , . . . , Xn are chosen from a distribution (students taking a test), and the mean is just
that of this sample. The sample mean can vary depending on how the sample ends up. In some sense, E[X]
represents the average score of a class on a test, the average height on a test, etc. X is a random variable. Then,
" n # n n
X Xi 1X 1X nµ
E[X] = E = E[Xi ] = µ= = µ.
i=1
n n i=1 n i=1 n
35
Pn−1
P (Xi = k) = ((n − i)/n)(i/k)k−1 , so Xi ∼ Geo((n − i)/n), so E[Xi ] = 1/p = n/(n − i). Then, X = i=0 Xi ,
so
n−1 n−1
X n
X 1 1
E[X] = E[Xi ] = =n + + · · · + 1 = O(n log n). (
i=0 i=0
n−i n n−1
Example 15.2. Recall Quicksort, a fast, recursive sorting algorithm. Choosing the index is based on a partition
function:
The complexity of this algorithm is determined by the number of comparisons made to the pivot. Though Quicksort
is O(n log n), in the worst case it is O(n2 ), in which every time the pivot is selected, it is the maximal or
minimal remaining element. Then, the probability that Quicksort has the worst-case behavior is the same as the
probability of creating a degenerate BST (as seen on the first problem set), since there are exactly two bad
pivots on each recursive call. Thus, the probability of the worst case is 2n−1 /n!.
This isn’t enough to give a nice description of the running time; the goal is to instead find the expected
running time of the algorithm. Let X be the number of comparisons made when sorting n elements, so that E[X]
is the expected running time of the algorithm.
Let X1 , . . . , Xn be the input to be sorted, and let Y1 , . . . , Yn be the output in sorted order. Let Ia,b = 1 if Ya
and Yb are compared and 0 otherwise. Then, because of the order, each pair of values is only compared once:
Pn−1 Pn
X = a=1 b=a+1 Ia,b .
For any given Ya and Yb , if the pivot chosen is not between them, they aren’t directly compared in that
recursive call. Thus, the only cases we care about are those in which the pivot is in {Ya , . . . , Yb }. Then, in
order to compare then, one of them must be selected as the pivot, so the probability that they are compared is
2/(b − a + 1). This has the interesting consequence that randomly permuting an array can make Quicksort run
faster.
Then, one can make some explicit calculations: first, an approximation is made:
n Z n
X 2 2 db n
≈ = 2 ln(n − a + 1)|a+1 ≈ 2 ln(n − a + 1)
b−a+1 a+1 b−a+1
b=a+1
n−1
X
E[X] ≈ 2 ln(n − a + 1)
a=1
Z n−1
≈2 ln(n − a + 1) da
1
Z 2
= −2 ln y dy ≈ 2n ln(n) − 2n
n
= O(n log n). (
36
Let Ii be indicator variables for events Ai . Then, we can speak of pairwise indicators: the variable Ii Ij is the
indicator for the event Ai ∩ Aj . Thus, one has
X X X X
E =E Ij Ij = E[Ii Ij ] = P (Ai ∩ Aj ).
2 i<j i<j i<j
But knowing the variance is also helpful in terms of this real-world model. Since the events Xi are independent,
then P (Ai ∩ Aj ) = (1 − pi − pj )n for i 6= j, so
X X
E[X(X − 1)] = E[X 2 ] − E[X] = 2 P (Ai ∩ Aj ) = 2 (1 − pi − pj )n ,
i<j i<j
so the variance is
X
Var(X) = 2 (1 − pi − pj )n + E[X] − (E[X])2
i<j
k k
!2
X X X
=2 (1 − pi − pj )n + (1 − pi )n − (1 − pi )n = Var(Y ).
i<j i=1 i=1
Places such as Amazon take their excess capacity and sell it to other people who might need servers, and
knowing exactly how much they can sell is a valuable piece of information. (
This example sounds exactly like coupon collecting or using hash tables, but is ostensibly more interesting.
One can take products of expectations: suppose X and Y are independent random variables and g and h
are real-valued functions. Then, E[g(X)h(Y )] = E[g(X)]E[h(Y )]. If X and Y are dependent, this rule doesn’t
hold, but the sum rule still does. Here’s why the product rule works in the independent case:
Z ∞Z ∞
E[g(X)h(Y )] = g(x)h(y)fX,Y (x, y) dx dy
−∞ −∞
Z ∞Z ∞
= g(x)h(y)fX (x)fY (y) dx dy
−∞ −∞
Z ∞ Z ∞
= g(x)fX (x) dx h(y)fY (y) dy
−∞ −∞
= E[g(X)]E[h(Y )].
Independence was necessary so that the joint PDF factored into the marginal density functions.
Definition 15.4. If X and Y are random variables, the covariance of X and Y is Cov(X, Y ) = E[(X −E[X])(Y −
E[Y ])].
Equivalently, Cov(X, Y ) = E[XY − E[X]Y − XE[Y ] + E[Y ]E[X]] = E[XY ] − E[X]E[Y ] − E[X]E[Y ] +
E[X]E[Y ] = E[XY ] − E[X]E[Y ]. This latter formula (Cov(X, Y ) = E[XY ] − E[X]E[Y ]) is particularly
helpful, and indicates how X and Y vary together linearly. Clearly, if X and Y are independent, then
Cov(X, Y ) = 0. However, the converse is not true, and this is a frequently misunderstood aspect of probability:
if X ∈ {−1, 0, 1} with equal likelihood and Y is an indicator for Y = 0, then E[XY ] = 0, since XY = 0. Thus,
Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0, but X and Y aren’t independent.
37
Example 15.5. Imagine rolling a six-sided die and let X be an indicator for returning a 1, 2, 3, or 4, and let
Y be an indicator for 3, 4, 5, or 6. Then, E[X] = E[Y ] = 2/3 and E[XY ] = 1/3. Thus, the covariance is
Cov(X, Y ) = −1/9, which is negative. This is surprising: the variance can’t be negative, but this is, indicating
that X and Y vary in opposite directions. (
This allows us to understand the variance of a sum of variables: it isn’t just the sum of the variances.
Claim. If X1 , . . . , Xn are random variables, then
n
! n n n
X X X X
Var Xi = Var(Xi ) + 2 Cov(Xi , Xj ). (
i=1 i=1 i=1 j=i+1
The last step was done because Cov(Xi , Xj ) = Cov(Xj , Xi ), so the sum can be simplified.
Note that this formula Pnif Xi and Xj are independent when i 6= j (since Xi can’t be independent
Pnmeans that
with itself), then Var( i=1 Xi ) = i=1 Var(Xi ). This is an important difference between the expected value
and the variance.
Example 16.1. Let Y ∼ Bin(n, p) and let Xi be an indicator for the ith trial being successful. Specifically,
Xi ∼ Ber(p) and E[Xi ] = p. Notice that E[X 2 ] = E[X], since X is only 0 or 1, each of which squares to itself.
Then,
n
X n
X n
X n
X
Var(Y ) = Var(Xi ) = (E[Xi2 ] − E[Xi ]2 ) = (E[Xi ] − E[Xi ]2 ) = (p − p2 ) = np(1 − p).
I=1 i=1 i=1 i=1
This is a fact that you should have already seen before, but this is an interesting derivation. (
38
Pn
Recall the sample mean X = i=1 Xi /n for independently and identically distributed random variables
X1 , . . . , Xn . If E[Xi ] = µ and Var(Xi ) = σ 2 , then E[X] = µ, but the variance is slightly more interesting:
n
! n
!
2
X Xi 1 X
Var(X) = Var = Var Xi
i=1
n n i=1
2 X n
1
= Var(Xi )
n i=1
2 X n
1 σ2
= σ2 = .
n i=1
n
The variance gets smaller as there are more variables in the mean. This makes sense: as more things (e.g.
people’s heights) are added to the sample, then they on average look like the mean, so the variance decreases.
One can also define the sample deviation X − Xi for each i, and then obtain the sample variance, which is
the guess at what the standard deviation should be:
n
X (Xi − X)2
S2 = .
i=1
n−1
Then, E[S 2 ] = σ 2 . This means that S 2 is an “unbiased estimate” of σ 2 , which is a bit of jargon. The n − 1 in
the denominator comes out after an intimidating computation to show that E[S 2 ] = σ 2 .
Definition 16.2. If X and Y are two arbitrary random variables, the correlation of X and Y is
Cov(X, Y )
ρ(X, Y ) = p .
Var(X) Var(Y )
In some sense, this takes the covariance, which already illustrates something similar, and divides out to
account for possibly different units. The correlation is always within the range [−1, 1]. This measures how much
X and Y vary in a linear relationship: if ρ(X, Y ) = 1, then Y = aX + b, where a = σy /σX (the standard
deviations of Y and X, respectively). If ρ(X, Y ) = −1, then Y = aX + b, but a = −σy /σX . If ρ(X, Y ) = 0,
then there is no linear relationship between the two. Note that there may still be a relationship, just of a higher
order. However, the linear component is typically the one that people care the most about.
If ρ(X, Y ) = 0, then X and Y are uncorrelated; otherwise, they are correlated. Notice that independence
implies two variables are uncorrelated (since the covariance is zero), but the converse is false!
Suppose IA and IB are indicator variables for events A and B. Then, E[IA ] = P (A) and E[IB ] = P (B), so
E[IA IB ] = P (A ∩ B). Thus,
Cov(IA , IB ) = E[IA IB ] − E[IA ]E[IB ] = P (A ∩ B) − P (A)P (B)
= P (A | B)P (B) − P (A)P (B)
= P (B)(P (A | B) − P (A)).
Thus, if P (A | B) > P (A), then the correlation is positive; when A happens, then B is more likely. Similarly, if
P (A | B) = P (A), then ρ(IA , IB ) = 0, and if P (A | B) < P (B), then observing A makes B less likely, so they
are negatively correlated.
Example 16.3. Suppose n independent trials of some experiment are performed, and each trial results in one
of m outcomes, each with probabilities p1 , . . . , pm , where the pi sum to 1. Let Xi be the number of trials with
outcome i. For example, one could count the number of times a given number comes up on a die when rolled
repeatedly. Intuitively, two numbers should be negatively correlated, since more of one number implies less
room, so to speak, for another. Pn
Let Ii (k) be an indicator variable that trial k has outcome i. Then, E[Ii (k)] = pi , Xi = k=1 Ii (k). If a 6= b,
then trials a and b are independent, so Cov(Ii (b), Ij (a)) = 0, and when a = b, then E[Ii (a)Ij (a)] = 0 (you can’t
roll both a 1 and a 2 in the same trial), so Cov(Ii (a), Ij (a)) = −E[Ii (a)]E[Ij (a)], since the first term drops out.
Thus,
Xn X n Xn n
X X n
Cov(Xi , Xj ) = Cov(Ii (b), Ij (a)) = Cov(Ii (a), Ij (a)) = −E[Ii (a)]E[Ij (a)] = −pi pj = −npi pj ,
a=1 a=1 a=1 a=1 i=1
39
so they are in fact negatively correlated. This is important because multinomial distributions happen in many
cases in applications. For example, if m is large, so that pi is small, and the outcomes are equally likely (such
that pi = 1/m), then Cov(Xi , Xj ) = −n/m2 , so the covariance decreases quadratically as m increases. This is
why the Poisson paradigm works so well: this quantifies how accurate the approximation can be. (
Suppose that X and Y are jointly discrete random variables, so their conditional PMF is pX|Y (x | y) =
P (X = x | Y = y) = pX,Y (X, y)/pY (y). Then, the conditional expectation of X given that Y = y is
X X
E[X | Y = y] = xP (X = x | Y = y) = pX|Y (x | y),
x x
This is a hypergeometric distribution: (X | X + Y = m) ∼ HypG(m, 2n, n). This actually has an intuitive
explanation: in a total of m draws, there are 2n total balls in an urn, of which n are white. This is exactly what
the definition specifies. Thus, the expectation is already known: E[X | X + Y = m] = nm/(2n) = m/2. Of
course, by symmetry, this makes sense, and you could even write it down ahead of time with some foresight. (
The sum of expectations still holds, no matter what you throw at it: if X1 , . . . , Xn and Y are random variables,
then " n #
X Xn
E Xi | Y = y = E[Xi | Y = y].
i=1 i=1
Since E[E | Y = y] sums over all values of X, then it can be thought of as a function of Y but not X, so
g(Y ) = E[X | Y ] is a random variable. For any given Y = y, g(Y ) = E[X | Y = y]. Thus, we can take
E[g(Y )] = E[E[X | Y ]]. Intuitively, this is the expectation of an expectation, which simplifies easily, but these
are expectations of different variables, so they do not cancel out. The notation might be confusing, so add
subscripts corresponding to the expected variables if necessary. However, in this case the expectation can be
taken one layer at a time:
X
E[E[X | Y ]] = E[X | Y = y]P (Y = y)
y
!
X X
= xP (X = x | Y = y) P (Y = y)
y x
X X X
= xP (X = x, Y = y) = x P (X = x, Y = y)
x,y x y
X
= xP (X = x) = E[X].
x
By replacing the sums with integrals, the same result can be found for the continuous case.
Example 16.6. Consider the following code, which recurses probabilistically:
40
int Recurse () {
int x = randomInt (1 ,3); // equally likely values
if ( x == 1) return 3;
else if ( x == 2) return (5 + Recurse ());
else return (7 + Recurse ());
}
Let Y be the value returned by Recurse(). What is E[Y ]? This may seem like a weird example, but it could
indicate the running time of a snippet of code given some different recursive calls.
Notice that E[Y | X = 1] = 3, E[Y | X = 2] = 5 + E[Y ], and E[Y | X = 3] = 7 + E[Y ], where X is
the value of x in the program. Thus, E[Y ] = 3(1/3) + (5 + E[Y ])/3 + (7 + E[Y ])/3, which can be solved for
E[Y ] = 15. (
41
One strategy is to first interview k candidates, and then hire the next candidate better than all of the first k
candidates. Then, it’s possible to calculate the probability that the best person is hired. Let X be the position
of the best candidate in the line of interviews. Then, Pk (Best | X = i) = 0 if i ≤ k. This is a good argument
for making k small. In general, the best person of the first k sets the bar for hiring, so the best candidate
in position i will be selected if the best of the first i − 1 is in the first k interviewed. Thus, if i > k, then
Pk (Best | X = i) = k/(i − 1). Then,
n n
k n
Z
1X 1 X k di k n
Pk (Best) = Pk (Best | X = i) = ≈ = ln .
n i=1 n i=1 n k+1 i − 1 n k
i=k+1
Then, to optimize the value of k, differentiate this and set it equal to zero. This yields k = n/e, for which the
probability of hiring the best person is about 1/e ≈ 37%. This is interesting because it doesn’t depend on the
number of people interviewed.
Nonetheless, this isn’t optimal, which is why most companies don’t use this strategy. (
tX
Definition 17.4. The moment-generating function (MGF) of a random variable X is M (t) = E e where
t ∈ R.
X is discrete, this is just M (t) = x etx p(x), and when X is continuous, this is instead M (t) =
P
R ∞When
−∞
etx fX (x) dx. Interesting things happen when you calculate M 0 (0): derivatives commute with taking the
expected value, so M 0 (X) = E[XetX ], so M 0 (0) = E[X]. Taking the derivative again yields M 00 (t) = E[X 2 etX ],
so M 00 (0) = E[X 2 ], or the second moment. This generalizes: M (n) (0) = E[X n ].
Example 17.5. If X ∼ Ber(p), then M (t) = E[etX ] = e0 (1 − p) + et p = et p − p, so M 0 (0) = p and M 00 (t) = et p
again, so M 00 (0) = p again. This is stuff we already knew, but it’s nice to see it derived in a new way. (
Example 17.6. Now let X ∼ Bin(n, p). Then,
n n
X n k X n
M (t) = E etX = etk p (1 − p)n−k = (pet )k (1 − p)n−k = (pet + 1 − p)n
k k
k=0 k=0
using the Binomial Theorem. Thus, M (t) = n(pe + 1 − p)n−1 pet , so M 0 (0) = E[X] = np.24 Then, in a similar
0 t
process, one can show that M 00 (0) = n(n − 1)p2 + np, so Var(X) = np(1 − p). (
If X and Y are independent random variables, then MX+Y (t) = E[et(X+Y ) ] = E[etX ]E[etY ] = MX (t)MY (t),
and conversely, if the MGF factors, then these two variables are independent. Note that because of the exponential,
the sum becomes a product. Additionally, MX (t) = MY (t) iff X ∼ Y .
One can generalize to joint MGFs: if X1 , . . . , Xn are random variables, then their joint MGF is M (t1 , . . . , tn ) =
E[et1 X1 +···+tn Xn ]. Then, MXi (T ) = M (0, . . . , 0, t, 0, . . . , 0), where the t is in the ith position. Then, X1 , . . . , Xn
n
Y
are independent iff M (t1 , . . . , tn ) = MXi (t), and the proof is the same as in the n = 2 case.
i=1
If X ∼ Poi(λ), then
∞
X e−λ λn t
M (t) = E[e tX
]= etn = eλ(e −1) ,
n=0
n!
so after a bunch of somewhat ugly math, M 0 (0) = λ and so on, as we already saw, but in a new way. This
is significant because if X ∼ Poi(λ1 ) and Y ∼ Poi(λ2 ) are independent, this offers a simple proof that
X + Y ∼ Poi(λ1 + λ2 ).
42
0 0 00
Then, the first derivative is MX (t) = (µ1 +tσ12 )M (t), so MX (0) = µ1 , and after a bit more math, MX (0) = µ21 +σ12 .
2
This makes sense. As for why it’s actually important, suppose Y ∼ N(µ2 , σ2 ) is independent of X.25 Then, Y
has the same MGF with indices changed, but one can calculate the convolution:
2 t2
2 +σ 2 )t2
σ1 2 t2 (σ1 2
+µ1 t σ2 +(µ1 +µ2 )t
2 +µ2 t 2
MX (t)MY (t) = e e 2 =e = MX+Y (t),
so the normal distribution is additive: X + Y ∼ N(µ1 + µ2 , σ12 + σ22 ).
This has other meanings beyond the scope
of this class, such as uses in Fourier analysis.
Here’s another question: if V = X + Y and W = X − Y , where X, Y ∼ N(µ, σ 2 ) are independent, are V
and W independent? Take the joint MGF of V and W :
h i h i
M (t1 , t2 ) = E et1 V et2 W = E et1 (X+Y ) et1 (X−Y ) = E e(t1 +t2 )X e(t1 −t2 )Y
h i h i
= E e(t1 +t2 )X E e(t1 −t2 )Y because X and Y are independent.
σ 2 (t1 +t2 )2 σ 2 (t1 −t2 )2
= eµ(t1 +t2 )+ eµ(t1 −t2 )+
2 2
2 2 2 2
= e2µt1 +2σ t1 /2 e2σ t2 /2 = E et1 V E et2 B ,
where the last step can be seen by writing down the distributions for V and W . Notice that this result doesn’t
hold when X and Y aren’t identically distributed, since the like terms can’t be gathered.
In many cases, the true form of a probability distribution encountered in the real world isn’t obvious. For
example, this class’ midterm didn’t appear to be normally distributed. However, certain information (e.g. mean,
sample variance) can be computed anyways, and certain properties can be seen by looking at the application of
the problem (here, midterm scores are known to be nonnegative).
Proposition 18.1 (Markov’s26 Inequality). Suppose X is a nonnegative random variable. Then, for any a > 0,
P (X ≥ a) ≤ E[X]/a.
Proof. Let I be an indicator for the event that X ≥ a. Since X ≥ 0, then I ≤ X/a: if X < a, then I = 0 ≤ X/a,
and if X ≥ a, then X/a ≥ 1 = I. Then, take expectations: E[I] = P (X ≥ a) ≤ E[X/a] = E[X]/a, since
constants factor out of expectation.
Applying this to the midterm, let X be a score on the midterm. Then, the sample mean is X = 95.4 ≈ E[X],
so P (X ≥ 110) ≤ E[X]/110 = 95.4/110 ≈ 0.8673. This says that at most 86.73% of the class got greater than
a 110. Since 27.83% of the class did this, it’s clear that Markov’s inequality is a very loose bound, but that
makes sense because the only thing it knows is the mean.
Proposition 18.2 (Chebyshev’s27 Inequality). Suppose X is a random variable such that E[X] = µ and
Var(X) = σ 2 . Then, P (|X − µ| ≥ k) ≤ σ 2 /k 2 for all k > 0.
Proof. Since (X − µ)2 is a nonnegative random variable, then apply Markov’s inequality with a = k 2 , so
P ((X − µ2 ) ≥ k 2 ) ≤ E[(X − µ)2 ]/k 2 = σ 2 /k 2 . However, (X − µ2 ) ≥ k 2 iff |X − µ| ≥ k, so the inequality
follows.
This is what is known as a concentration inequality, since it puts bounds on how spread the data can be
from the mean. Using this for the midterm, the sample mean is X = 95.4 ≈ E[X], and the sample variance is
S 2 = 400.40 ≈ σ 2 , so P (|X − 95, 4| ≥ 22) ≤ σ 2 /222 ≈ 0.8273, so at most 82.73% of the class was more than 22
points away from the mean. This can also be used to calculate how much of the class is within a distance from
the mean: at least 17.27% of the class is within 22 points of the mean. Once again, this is very loose: about
83% of the class was within 22 points of the mean. Chebyshev’s inequality is most useful as a theoretical tool.
There’s another formulation called the one-sided Chebyshev inequality. In its simplest form, suppose that X
is a random variable with E[X] = 0 and Var(X) = σ 2 . Then, P (X ≥ a) ≤ σ 2 /(σ 2 + a2 ) for any a > 0. This
19. Laws of Large Numbers and the Central Limit Theorem: 5/13/13
Proposition 19.1 (Weak Law of Large Numbers). Suppose X1 , X2 , . . . are independently and P identically
n
distributed random variables, with a distribution F , such that E[Xi ] = µ and Var(Xi ) = σ 2 . Let X = i=1 Xi /n.
Then, for any ε > 0, limn→∞ P (|X − µ| ≥ ε) = 0.
Proof. Since E[X] = µ and Var(X) = σ 2 /n, then by Chebyshev’s inequality, P (|X − µ| ≥ ε) ≤ σ 2 /(nε2 ), so
as n → ∞, this probability goes to zero.
28Herman Chernoff isn’t Russian, and is still a Professor Emeritus at Harvard and MIT. He did discover Chernoff’s Bound. Since he’s still
around, someone in 109 last quarter actually emailed him and asked him if he was a Charlie Sheen fan. He was.
29Johan Ludwig William Valdemar Jensen was a Dutch mathematician who was responsible for the inequality named after him.
44
Proposition 19.2 (Strong Law of Large Numbers). With X1 , X2 , . . . and X as before,
X1 + X2 + · · · + Xn
P lim = µ = 1.30
n→∞ n
Notice that the Strong Law implies the Weak Law, but not vice versa: the Strong Law says that for any ε > 0,
there are only a finite number of values n such that |X − µ| ≥ ε holds. The proof of the Strong Law isn’t all
that easy, and has been omitted.
Consider some set of repeated trials of an experiment, and let E be some outcome. Then, let Xi be an
indicator for trial i returning E. Then, the Strong Law implies that limn→∞ E[X] = P (E), which justifies more
rigorously the fact shown in the first week of class, that P (E) = limn→∞ n(E)/n = P (E). These laws are
mostly philosophical, but they provide a mathematical definition of probability.
One misuse of the laws of large numbers is to justify the gambler’s fallacy, in which after multiple losses
of some sort, one expects a win. This is incorrect, because the trials are independent (and the distribution is
memoryless anyways), and the averaging out happens as n → ∞, which is not nearly small enough to fit in a
lifetime.
The following result is one of the most useful in probability theory:
Theorem 19.3 (Central Limit Theorem). Let X1 , X2 , . . . be a set of independently and identically distributed
random variables with E[Xi ] = µ and Var(Xi ) = σ 2 . Then,
X1 + X2 + · · · + Xn − nµ
lim √ = N(0, 1).
n→∞ σ n
Pn
There is an equivalent formulation: if X = i=1 Xi /n, then Theorem 19.3 states that X ∼ N µ, σ 2 /n as
p
n → ∞: if Z = (X − µ)/ σ 2 /n, then Z ∼ N(0, 1), but also
Pn Pn
Xi /n − µ n( i=1 Xi /n − µ) X1 + X2 + · · · + Xn − nµ
Z = i=1 p = p = √ .
σ 2 /n n σ 2 /n σ n
This means that if one takes a large number of sample means of any distribution, they form a normal distirbution.
This is why the normal distribution appears so often in the real world, and allows things such as election
polling (a reasonable sample of people can lead to a probability that the whole group will vote one way). The
distribution can be discontinuous, skewed, or anything else horrible. And this is why this theorem is so good.
However, polling can be misused: lots of election polls end up sampling nonrepresentative populations and
wildly mis-predicting the election. Similarly, McDonalds rolled out the McRib, a flop, after successful testing in
South Carolina, which clearly represents the rest of the nation in terms of barbecue.
Example 19.4. Applying this to the midterm, there are 230 Xi , and E[Xi ] = 95.4 and Var(Xi ) = 400.40. Thus,
one can create ten disjoint samples Yi and their sample means Y i . Then, using the central limit theorem, these
should all be turned into Z ∼ N(0, 1), which ended up being really close. (
Example 19.5. Suppose one has an algorithm which has a mean running time of µ seconds and a variance of
σ 2 = 4. Then, the algorithm can be run repeatedly (which makes for independent, identically distributed trials).
Then, how many trials are needed in order to estimate µ ± 0.5 with 95% certainty? By Theorem 19.3, if Xi is
the running time of the ith run, then
Pn Pn
Xi − nµ Xi − nµ
Zn = i=1 √ = i=1 √ ∼ N(0, 1).
σ n 2 n
Thus,
Pn √ √ Pn √ √ √
i=1 Xi −0.5 n n i=1 Xi − nt 0.5 n −0.5 n 0.5 n
P −0.5 ≤ =P ≤ ≤ =P ≤ Zn ≤
n − t ≤ 0.5 2 2 n 2 2 2
√ √ √
n n n
=Φ −Φ − = 2Φ − 1 ≈ 0.95,
4 4 4
√
which means that (calculating Φ−1 with the table of normals) n/4 ≈ 1.96, or n = d(7.84)2 e = 62.
30Note that one still needs the probabilistic statement; things with probability zero can still happen, in some sense, but that requires a
digression into measure theory.
45
Chebyshev’s inequality provides a worse bound: µs = µ but
n
! n
nσ 2
2
X Xi X Xi 4
σs = Var = Var = 2 = ,
i=1
n i=1
n n n
so !
n
X Xi 4/n
P − t ≥ 0.5 ≤ ,
n 0.52
i=1
which implies that n ≥ 320. This is a much worse bound. (
Example 19.6. Suppose the number of visitors to a web site in a given minute is X ∼ Poi(100), and the server
crashes if there are at least 120 requests
Pn in a given minute. Then, one can use the CLT to break this into lots of
small distribiutions: Poi(100) = i=1 Poi(100/n), with all of the distributions independent, which allows one
to use a normal distribution to approximate a Poisson one. (
Example 19.7. Suppose someone rolls ten 6-sided dice, giving events X1 , . . . , X10 . Let X be the total value of all
of the dice. What is the probability that X ≤ 25 or X ≥ 45? Using the Central Limit Theorem, µ = E[Xi ] = 3.5
and σ 2 = Var(Xi ) = 35/12. Then, approximating with a normal distribution, one shows that there is about
an 0.08 probability that this will happen. Note that Chebyshev would only say that there is at most a 0.292
probability of it happening. (
Since λ = Var(Xi ) as well, one has another estimate of λ which isn’t always the same:
n
1X 2 2
λ = E[Xi2 ] − E[Xi ]2 ≈ m̂2 − m̂21 = (X − X ) = λ̂0 .
n i=1 i
In general, the estimator of the lowest order or the one that is easiest to compute is used, and they tend to be
the same. (
Example 20.4. If X1 , . . . , Xn are IID and Xi ∼ N(µ, σ 2 ), then µ ≈ µ̂ =
P
Xi /n, but estimated variance is
not equal to the sample distribution. One is biased, and the other isn’t. If Xi ∼ Uni(α, β), one can√obtain
similar expected
√ means µ̂ and variance σ̂ 2 , which leads to a system of equations in α and β: α̂ = X − σ̂ 3 and
β = X + σ̂ 3. Notice that there could be data outside of these data points, so be careful. (
Here, log denotes the natural log, though the base doesn’t matter for maximizing. This still works for maximizing
θ because it is monotone, so x ≤ y iff log x ≤ log y for x, y > 0. Thus, arg maxθ L(θ) = arg maxθ LL(θ) =
arg maxθ cLL(θ) for any positive constant c, and this can be found with a bit of linear algebra:
• Find the formula for LL(θ) and differentiate it with each ∂ LL(θ)
∂θ , and set it to zero.
• This is a set of equations, which can be solved simultaneously.
• Then, it’s necessary to check that the critical points aren’t minima or saddle points by checking nearby
values.
31Bias is a loaded word in the outside world, but depending on the application having some bias in an estimator may cause more
accurate results.
32The function max(f (x)) returns the maximum of f (x), but arg max(f (x)) returns the value x such that f (x) is maximized.
47
Example 21.3. Suppose X1 , . . . , Xn are IID random variables such that Xi ∼ Ber(p). Then, the PMF can be
rewritten in the form f (Xi | p) = pxi (1 − p)1−xi , where xi = 0 or xi = 1. This has the same meaning, but is
now differentiable. Thus, the likelihood function is
n
Y
L(θ) = pXi (1 − p)1−Xi ,
i=1
P looks alarming until you realize we only have to differentiate with respect to λ, showing that λMLE =
This
Xi /n. (
Example 21.5. If Xi ∼ N(µ, σ 2 ), then one obtains the log-likelihood
n 2 2
! n
e−(Xi −µ) /(2σ) √ (X − µ)2
i
X X
L(θ) = log √ = log σ 2π − 2)
.
i=1
σ 2π i=1
(2σ
This seems counterintuitive, but there’s no reason that it shouldn’t be the case. (
Example 21.6. Now, suppose Xi ∼ Uni(α, β). Then, the PDF is f (Xi | α, β) = 1/(β − α) if α ≤ xi ≤ β and is
zero otherwise, so the overall likelihood is
( n
1
β−α , α ≤ x1 , . . . , x n ≤ β
L(θ) =
0, otherwise.
This can be solved with Lagrange multipliers, which is complicated and unfortunate, so turn instead to intuition.
The goal is to make the interval as small as possible while guaranteeing that all of the values are still within
the region, which means that αMLE = min(x1 , . . . , xn ) and βMLE = max(x1 , . . . , xn ). (
48
These do interesting things in small sample sizes. In many cases, the sample mean is obtained, which is pretty
nice. But the normal distribution is biased, and underestimates for small n, and the uniform distribution is also
biased, and behaves badly for small samples (e.g. if n = 1, so that αMLE = βMLE , which isn’t necessarily right).
This just fits the distribution best to what has already been seen, but this doesn’t always generalize well and
the variance might be too small.
However, the MLE is frequently used, because it has lots of nice properties:
• It’s consistent: lim P (|θMLE − θ| < ε) = 1 for any ε > 0.
n→∞
• Though it is potentially biased, the bias disappears asymptotically as n increases.
• The MLE has the smallest estimate of all good estimators for large samples.
This, this is frequently used in practice where the sample space is large relative to the parameter space. However,
this is a problem if the variables are dependent: joint distributions cause the number of parameters of n variables
to be about 2n . This can cause issues in modeling.
Example 21.7. Suppose Y1 , . . . , Yn arePindependently and identically distributed random variables, such that
m
Pi m∼ Multinomial(p1 , . . . , pm ), where i=1 pi = 1, and let Xi be the number of trials of outcome i where
Y
i=1 Xi = n. Thus, the PDF is
m
Y pxi i
f (X1 , . . . , Xm | p1 , . . . , pm ) = n! .
i=1
xi !
where g(θ) is the prior distribution. Notice that there is a significant philosophical distinction, but not as much
of a mathematical distribution. Again, it can be convenient to use logarithms:
n
!
X
θMAP = arg maxθ log g(θ) + log(f (Xi | θ)) .
i=1
Recall the beta distribution with parameters a and b. These are called hyperparameters, because they are
parameters that determine the parameters of the distribution. For example, a fair coin is given a prior Beta(x, x)
distribution, with higher values of x correlated with a lower variance.
In order to do this with a multinomial distribution, introduce the Dirichlet distribution: the way that the
beta distribution is conjugate to the binomial (or Bernoulli) distribution is the way the Dirichlet distribution is
conjugate to the multinomial distribution. This distribution is written Dirichlet(x1 , . . . , xn ):
m
1 Y
f (x1 , . . . , xn ) = xai −1 .
B(a1 , . . . , am ) i=1 i
Pm
The prior (imaginary trials) represent seeing i=1 ai − m imaginary trials such that ai − 1 are of outcome
i. Then, after observing n1 + · · · + nm new trials (such that ni had outcome i), the new distribution is
Dirichlet(a1 + n1 , . . . , am + nm ). Here is an animation of the logarithmic density of Dirichlet(a, a, a).
The Dirichlet model allows one to work around the multinomial problem presented in the previous lecture:
if one rolls a die but never sees a 3, one could do an MAP estimate in which some number of prior trials did
give a 3. The MAP estimate for Dirichlet(k + 1, . . . , k + 1) is pi = (Xi + k)/(n + mk) for this experiment. Now,
the probability of rolling a 3 is nonzero, which is nice. This method with k = 1 is also known as Laplace’s law.
Notice that this is a biased estimate.
33Frequentist and Bayesian viewpoints are the closest thing probability has to a religion. The professor, for example, is a Bayesian, and
the author of the textbook is very frequentist.
50
The conjugate distribution for the Poisson and exponential distributions is Gamma(α, λ), which represents a
prior of α imaginary events in a time period of λ. To update this for a posterior distribution, one observes n
events in k time periods, leading to a posterior distribution of Gamma(α + n, λ + k).
Finally, conjugate to the normal distribution there are several of these depending on which of the mean and
variance are known. For example, Normal(µ0 , σ02 ) where σ 2 is known but µ isn’t, is a hyperparameter, where
one believes the true µ ∼ N(µ0 , σ02 ). After observing n data points, the posterior distribution is
Pn −1 !
µ0 i=1 xi 1 n 1 n
N + / 2+ 2 , + 2 .
σ02 σ2 σ0 σ σ2 σ
Generally, one doesn’t know the variance without other information, but it could be estimated without knowing
the mean, making this useful.
52
so the table now has linear space, with only the one probabilistic assumption!
Example 23.3. Let X1 be an indicator variable that indicates that someone likes Star Wars, and X2 is that a
person likes Harry Potter. Then, Y is an indicator variable for whether someone likes the Lord of the Rings.
This is a standard marketing application, for finding recommended products, and previous purchases can be used
to make a model. Thus, we have the joint and marginal probability tables.
Now, suppose a person comes along and X1 = 1, but X2 = 0. Then, under the Naı̈ve Bayes assumption,
Ŷ = arg maxY P̂ (X~ | Y )P̂ (Y ) = arg maxY P̂ (X1 | Y )P̂ (X2 | Y )P̂ (Y ). The arg max can be evaluated by
picking Y = 0 and Y = 1 and picking whichever makes the result better. (
The number of parameters can grow very large: in one model of spam classification, m ≈ 100000, and there
are m variables corresponding to indicators for whether a given word (out of all English words) appeared in an
email. Since m is extremely large, then the Naı̈ve Bayes assumption is necessary. Additionally, since there are
lots of words that could be used in spam but that aren’t in the sample dataset, the Laplace estimate is probably
a good idea. Then, the data is tested with some other testing data (you can’t use the same training data; that
would be cheating).
There are various criteria for effectiveness, such as precision (correctly predicted the most cases relative to
incorrect ones) or recall (correctly predicted the most cases relative to all cases).
This is the MLE formula, and the formula for the Laplacian is similar. In some sense, this classifier models
the joint probability P (X, Y ). However, the goal is to find P (Y | X) (since X was observed, and Y is to be
predicted), so the goal is to find ŷ = arg maxy P (Y | X).
One can use a technique called logistical regression to model P (Y | X), which can be called the conditional
likelihood. This uses a function calledPthe logistic function (also sigmoid or “squashed S” function): P (Y = 1 |
m
X) = 1/(1 + e−z ),36 where z = α + j=1 βj Xj . In some sense, one takes this nice linear function and runs it
Pm are the α and βj , and often, the simplifying trick
through the logistic function. The parameters that can change
of letting X0 = 1 and β0 = α, so the sum becomes z = j=0 βj Xj .
Since Y = 0 or Y = 1, then P (Y = 0 | X) = e−z /(1 + e−z ), with z as before. This still doesn’t illuminate
why the function was chosen, but calculate the log-odds:
P (Y = 1 | X) 1
log = log −z = z.
P (Y = 0 | X) e
This functional form guarantees that the log-odds is a linear function, which is pretty convenient.
Now, the log-conditional likelihood of the data is
n
X
LL(θ) = (yi log(P (Y = 1 | X)) + (1 − yi ) log(P (Y = 0 | X))).
i=1
This sum represents dividing the data into two cases, those with yi = 0 and yi = 1. Given this, one might want
to know the βj that create the MLE, but, unfortunately, there is no analytic derivation of it.
53
Figure 8. The standard logistic function. Source
However, the log-conditional likelihood function is concave, so it must have a single global maximum. This
can be repeatedly estimated by computing a gradient repeatedly somewhat like Newton’s Method:
∂ LL(θ) 1
= xj y − = xj (y − P (Y = 1 | X)).
∂βj 1 − e−z
Thus, LL(θ) can be iteratively updated using the formula
new old 1
βj = βj + cxj y − ,
1 − e−z
Pm
where z = j=0 βjold and c is a constant representing the step size.37 Then, this is done repeatedly, until a
given stopping criterion is met: one may just run it a fixed number of times or until some error bound is met.
Here’s the derivation:
LLi (θ) = yi log P (Y = 1 | X) + (1 − yi )P (Y = 0 | X)
P (Y = 1 | X)
= yi log + log P (Y = 0 | X)
P (Y = 0 | X)
m
X
= yi βj Xj + log e−z − log(1 + e−z )
j=0
m
X m
X
= yi βj Xj − βj Xj − log(1 + e−z ).
j=0 j=0
One important idea is that each node is conditionally independent of its non-descendants, given the parents. For
example, given gender G, palm size S and life expectancy L are independent: P (S, L | G) = P (S | G)P (L | G).
With each node one can associate a conditional probability table (CPT) of the probability of it taking on some
value given the parents.
Thus, it is possible to obtain all of the joint probabilities given the CPT: the conditional independence of the
graph modularizes the calculation, yielding
m
Y
P (X1 , . . . , Xm ) = P (Xi | parents(Xi )).
i=1
The CPT is useful because the number of parameters in a full table is exponential in the number of nodes, but
with this setup, it’s linear, and a lot more of the parameters can be determined locally from each other.
The Naı̈ve Bayes model can be incorporated into this: a parent Y gives some probability to its (conditionally
independent) children X1 , . . . , Xm . Then, the network structure encodes the assumption
m
Y
~ | Y ) = P (X1 ∩ · · · ∩ Xm | Y ) =
P (X P (Xi | Y ).
i=1
Then, things such as the linearity make more sense. Additionally, if the Naı̈ve Bayesian assumption is
violated, then the dependencies between the Xi can be added as arcs, yielding a better model that is still
a Bayesian network. But Bayesian networks are more general than the specific prediction problem: instead
of observing all of the Xi , only some subset E1 , . . . , Ek is observed. Then, the goal is to determine the
probability of some set Y1 , . . . , Yc of unobserved variables, as opposed to a single Y . Formally, the goal is to
find P (Y1 , . . . , Yc | E1 , . . . , Ek ).
Example 25.1. Consider the following Bayesian network representing conditions around doing well on the CS
109 final:
38Does this mean causality or just direct correlation? This is a philosophical problem, and doesn’t have a clear answer, given that many
physicists claim causality can’t really exist.
56
Figure 10. Will the presence of this diagram affect how you study for the final?
Here, the professor can know B, A, and R, but doesn’t know E. One can calculate P (A = T | B = T, R = T )
by summing over the unseen variables:39
X
P (A = T, B = T, R = T, E)
P (A = T, B = T, R = T ) E=T,F
P (A = T | B = T, R = T ) = = X X .
P (B = T, R = T ) P (B = T, R = T, E, A)
E=T,F A=T,F
Notice how many fewer parameters are necessary for this, and the joint probability decomposes as P (A, B, E, R) =
P (E)P (B | E)P (R)P (A | E, R): the product of the probabilities given the parents. (
Another useful concept is a probability tree, which is a tree that models the outcome of a probabilistic event.
3 6
P (B1 ∩ B2 ) = 7 · 9
W
6
Bag 2 9
3W, 6B B
B 3 3 3
3 9 P (B1 ∩ W2 ) = 7 · 9
7
Bag 1
4W, 3B
W P (W1 ∩ B2 ) = 4
· 5
B 7 9
4
7 5
Bag 2 9
4W, 5B W
4 4 4
9 P (W1 ∩ W2 ) = 7 · 9
Probability trees are traditionally written sideways, and can be thought of as probabilistic flowcharts. This is
useful for modelling decisions.
For example, consider a game where one can receive $10 for not playing, or $20 with probability 0.5 when
playing, and $0 otherwise. If the numbers are multiplied by 1000, people much more strongly prefer not to play
the game. This is an example of utility, as seen before, which means that the leaves of the decision tree are
utilities rather than dollar values. This can be useful for making decisions about life expectancy, etc., and are
often difficult choices.
In a similar game, the expected monetary value is the expected dollar value of winning, and the risk premium
is the amount that one would be willing to give up to eliminate risk: certainty has a utility. This is exactly why
39This is the simple algorithm, but not the most efficient one. That’s a discussion for CS 228.
57
insurance companies exist; how much money would you pay to mitigate the existence of a big loss, even if the
expected loss is greater.
Utility is nonlinear: suppose one could gain $5 with probability 0.5 and otherwise get nothing, or just receive
$1 for not playing. Most people would play, but if the dollar values were scaled by a million, most people
wouldn’t. In some sense, $10000000 is life-changing, but the next $40000000 isn’t. Thus, the shape of the utility
curve isn’t linear, and depends on one’s preference for risk. Most people are risk-averse, so their utility functions
are concave down. Some people are risk-preferring, so their utility curves are concave up. A typical utility curve
looks something like U (x) = 1 − e−x/R . Here, R is one’s risk tlerance, and is rough;y the highest value of Y for
which one would play a game in which one wins $Y with probability 0. and loses $Y /2 with probability 0.5, and
receives nothing for not playing.
Then, one can calculate how irrational or irrational one is: consider a game in which one receives $1000000
for not playing, and when playing, receives $1000000 with probability 89%, $5000000 with probability 10%,
and nothing with the remaining 1%. Similarly, one could choose to play a game where $1000000 is won with
probability 0.11 (and nothing happens otherwise), or $5000000 is won with probability 0.1 and nothing is won
otherwise.
Sometimes, people pick the second options in both cases, which is inconsistent with any utility function. This
is known as the Allais paradox, since the percentages are the same.
A “micromort” is a one-in-a-million chance of death. This leads to an interesting question: how much would
one want to be paid to take on the risk of a micromort? How much would you pay to avoid one micromort?
People answer differently to these questions. Though this sounds abstract, it comes up in everyday risks that
people need to take every day when making decisions. Understanding this makes these decisions a little more
insightful.
Finally, consider an example from sports, which has lots of applications of probability. Two batters have
their batting averages over two years. If player B has better batting averages each year, he must be better. . .
but what if the first year A has 0.250 and B has 0.253, and in the second year, A has 0.314 and B has 0.321,
giving them combined averages of 0.310 and 0.270. Oops; is A better? This is known as Simpson’s paradox,
and actually happened to Derek Jeter and David Justice in 1995 and 1996. This has applications to marketing
(people may prefer different products) or effectiveness of medicine. One might have one medicine favored when
the data is broken down by gender, but the other when the total data is considered. This can happen because
one treatment has significantly fewer trials than the other in one case, but not the other. Unsurprisingly, this
comes up all the time in the real world.
40A good example of this is the problems on our problem sets. Here Dr. Sahami claims that it was a good thing we didn’t do this to
check our problem sets, though I definitely did. . .
61
Of the 50 cards that remain, there are four cards of each valid rank, and there are i − 1 valid (i.e. winning) ranks.
Then, when everything is plugged in, this ends up as about 0.2761.
There’s another way to solve the problem, which exploits the symmetry of the problem. Clearly, it’s necessary
for all three cards to have different ranks, which happens with probability (48/51)(44/50). Then, it is necessary
to have the third card in the middle of the first two, which is just a computation on the six permutations of
three cards: two of them have the third card in the middle, so the final answer is (48/51)(44/50)(1/3) ≈ 0.2761.
Much simpler.
Another example: in a group of 100 people, let X be the number of days that are the birthdays of exactly
three people in the group. What is E[X]?
When one sees expected value, the first step is to decompose the problem. Let Ai bethe number of people
who have a birthay on day i, so that Ai ∼ Bin(100, 1/365). Then, p = P (Ai = 3) = 100 3
3 (1/365) (364/365) .
97
62