MIT Overview Basic Probability
MIT Overview Basic Probability
Missed Expectations?
In the previous notes, we saw that the average value of a random quantity is captured
by the mathematical concept of the expectation of a random variable, and we calculated
expectations for several kinds of random variables. Now we will see two things that
make expectations so useful. First, they are often very easy to calculate due to the fact
that they obey linearity. Second, once you know what the expectation is, you can also get
some type of bound on the probability that you are far from the expectation —that is, you
can show that really weird things are not that likely to happen. How good a bound you
can get depends on what you know about your distribution, but don’t worry, even if you
know next to nothing, you can still say something relatively interesting.
1 Linearity of Expectation
Proof. Let T ::= R1 + R2 . The proof follows straightforwardly by rearranging terms from
the definition of E [T ].
E [R1 + R2 ] ::= E [T ]
�
::= T (s) · Pr {s}
s∈S
�
= (R1 (s) + R2 (s)) · Pr {s} (Def. of T )
s∈S
� �
= R1 (s) Pr {s} + R2 (s) Pr {s} (rearranging terms)
s∈S s∈S
= E [R1 ] + E [R2 ] .
Similarly, we have
E [aR] = a E [R] .
The proof follows easily from the definition of expectation, and we omit it.
Combining Theorem 1.1 and Lemma 1.2, we conclude
Theorem 1.3 (Linearity of Expectation). For all random variables R1 , R2 and constants a1 , a2 ∈
R,
E [a1 R1 + a2 R2 ] = a1 E [R1 ] + a2 E [R2 ] .
In other words, expectation is a linear function. The rule and its proof extends directly to
cover more than two random variables:
The great thing about linearity of expectation is that no independence is required. This is
really useful, because dealing with independence is a pain, and we often need to work
with random variables that are not independent.
Notice that we did not have to assume that the two dice were independent. The expected
sum of two dice is 7, even if they are connected together!1
Proving that the expected sum is 7 with a tree diagram would be hard; there are 36 cases.
And if we did not assume that the dice were independent, the job would be a nightmare!
1
But each die must remain fair after the connection.
Course Notes, Week 14: Missed Expectations? 3
Now we’re in trouble, because evaluating Pr {R = k} is a mess and we then need to sub
stitute this mess into a summation. Furthermore, to have any hope, we would need to fix
the probability of each permutation of the hats. For example, we might assume that all
permutations of hats are equally likely.
Now let’s try to use linearity of expectation. As before, let the random variable R be the
number of men that get their own hat. The trick is to express R as a sum of indicator
variables. In particular, let Ri be an indicator for the event that the ith man gets his own
hat. That is, Ri = 1 is the event that he gets his own hat, and Ri = 0 is the event that
he gets the wrong hat. The number of men that get their own hat is the sum of these
indicators:
R = R1 + R2 + · · · + Rn
These indicator variables are not mutually independent. For example, if n − 1 men all get
their own hats, then the last man is certain to receive his own hat. But, since we plan to
use linearity of expectation, we don’t have worry about independence!
Let’s take the expected value of both sides of the equation above and apply linearity of
expectation:
E [R] = E [R1 + R2 + · · · + Rn ]
= E [R1 ] + E [R2 ] + · · · + E [Rn ]
Since the Ri ’s are indicator variables, E [Ri ] = Pr {Ri } and since every man is as likely to
get one hat as another, this is just 1/n. Putting all this together, we have:
Notice that we did not assume that all permutations of hats are equally likely or even that
all permutations are possible. We only needed to know that each man received his own
hat with probability 1/n.
Suppose that we independently flip n biased coins, each with probability p of coming up
heads. What is the expected number that come up heads?
Let Hn,p be the number of heads after the flips. Then Hn,p has the binomial distribution
with parameters n and p. Now let Ik be the indicator for the kth coin coming up heads.
Since Ik is an indicator variable with probability p of being 1, we know that
E [Ik ] = p.
But
n
�
Hn,p = Ik ,
k=1
so by linearity
� n
� n n
� � �
E [Hn,p ] = E Ik = E [Ik ] = p = pn.
k=1 k=1 k=1
Every time I purchase a kid’s meal at Taco Bell, I am graciously presented with a miniature
“Racin’ Rocket” car together with a launching device which enables me to project my new
vehicle across any tabletop or smooth floor at high velocity. Truly, my delight knows no
bounds.
There are n different types of Racin’ Rocket car (blue, green, red, gray, etc.). The type of
car awarded to me each day by the kind woman at the Taco Bell register appears to be
selected uniformly and independently at random. What is the expected number of kids
meals that I must purchase in order to acquire at least one of each type of Racin’ Rocket
car?
The same mathematical question shows up in many guises: for example, what is the ex
pected number of people you must poll in order to find at least one person with each
possible birthday? Here, instead of collecting Racin’ Rocket cars, you’re collecting birth
days. The general question is commonly called the coupon collector problem after yet
another interpretation.
Course Notes, Week 14: Missed Expectations? 5
Linearity of expectation is somewhat like induction and the pigeonhole principle; it’s a
simple idea that can be used in all sorts of ingenious ways. For example, we can use
linearity of expecatation in a clever way to solve the coupon collector problem. Suppose
there are five different types of Racin’ Rocket, and I receive this sequence:
blue
���� green green red blue orange blue orange gray
� �� � � �� � � �� � � �� �
X0 X1 X2 X3 X4
The rule is that a segment ends whenever I get a new kind of car. For example, the middle
segment ends when I get a red car for the first time. In this way, we can break the problem
of collecting every type of car into stages. Then we can analyze each stage individually
and assemble the results using linearity of expectation.
Let’s return to the general case where I’m collecting n Racin’ Rockets. Let Xk be the length
of the kth segment. The total number of kid’s meals I must purchase to get all n Racin’
Rockets is the sum of the lengths of all these segments:
T = X0 + X1 + · · · + Xn−1
Now let’s focus our attention of the Xk , the length of the kth segment. At the beginning of
segment k, I have k different types of car, and the segment ends when I acquire a new type.
When I own k types, each kid’s meal contains a type that I already have with probability
k/n. Therefore, each meal contains a new type of car with probability 1 − k/n = (n − k)/n.
Thus, the expected number of meals until I get a new kind of car is n/(n−k) by the “mean
time to failure” formula that we worked out last time. So we have:
n
E [Xk ] =
n−k
Linearity of expecatation, together with this observation, solves the coupon collector
problem:
E [T ] = E [X0 + X1 + · · · + Xn−1 ]
= E [X0 ] + E [X1 ] + · · · + E [Xn−1 ]
n n n n n
= + + ··· + + +
n�−0 n−1 3 2 1�
1 1 1 1 1
=n + + ··· + + +
n n−1 3 2 1
= nHn
6 Course Notes, Week 14: Missed Expectations?
The summation on the nexttolast line is the nth harmonic sum with the terms in reverse
order. As you may recall, this sum is denoted Hn and is approximately ln n.
Let’s use this general solution to answer some concrete questions. For example, the ex
pected number of die rolls required to see every number from 1 to 6 is:
6H6 = 14.7 . . .
And the expected number of people you must poll to find at least one person with each
possible birthday is:
365H365 = 2364.6 . . .
3 Conditional Expectation
Just like event probabilities, expectations can be conditioned on some event.
In other words, it is the expected value of the variable R once we skew the distribution of
R to be conditioned on event A.
Example 3.2. Let D be the outcome of a roll of a fair die. What is E [D | D ≥ 4]?
6
�
i · Pr {D = i | D ≥ 4} = 1 · 0 + 2 · 0 + 3 · 0 + 4 · 31 + 5 · 13 + 6 · 1
3
= 5.
i=1
It is easy to see that the rules for expectation will extend to conditional expectation. For
example, conditional expectation will also be linear.
Theorem 3.3. For any two random variables R1 , R2 , constants a1 , a2 ∈ R, and event A,
A real benefit of conditional expectation is the way it lets us divide complicated expecta
tion calculations into simpler cases.
Theorem 3.4 (Law of Total Expectation). If the sample space is the union of the pairwise
disjoint events A1 , A2 , . . . , then
�
E [R] = E [R | Ai ] Pr {Ai } .
i
Course Notes, Week 14: Missed Expectations? 7
Proof.
�
E [R] = r · Pr {R = r}
r
� �
= r· Pr {R = r | Ai } Pr {Ai } (Law of Total Probability)
r i
��
= r · Pr {R = r | Ai } Pr {Ai } (distribute constant r)
r i
��
= r · Pr {R = r | Ai } Pr {Ai } (exchange order of summation)
i r
� �
= Pr {Ai } r · Pr {R = r | Ai } (factor constant Pr {Ai })
i r
�
= Pr {Ai } E [R | Ai ] . (Def. 3.1)
i
Example 3.5. Half the people in the world are male, half female. The expected height of a
randomly chosen male is 5� 11�� , while the expected height of a randomly chosen female is
5� 5�� . What is the expected height of a randomly chosen individual?
Let H(P ) be the height of the random person P . The events M ::=“P is male” and F ::=“P
is female” are a partition of the sample space. Then
E [H] = E [H | M ] Pr {M } + E [H | F ] Pr {F }
1 1
= 5� 11�� · + 5� 5�� ·
2 2
= 5� 8�� .
We will see in the following sections that the Law of Total Expectation has much more
power than one might think.
E [R1 | R2 = a] = E [R1 ] .
8 Course Notes, Week 14: Missed Expectations?
The Lemma follows immediately from Definition 3.1 of conditional expectation and the
fact that Pr {R1 = r} = Pr {R1 = r | R2 = a}.
Here the first equality holds by Theorem 4.2 because the dice are independent.
Now suppose that the two dice are not independent; in fact, assume that the second die
is always the same as the first. In this case, the product of expectations will not equal the
expectation of the product.
Course Notes, Week 14: Missed Expectations? 9
To verify this, let random variables R1 and R2 be the numbers shown on the two dice. We
can compute the expected value of the product without Theorem 4.2 as follows:
E [R1 · R2 ] = E R12
� �
(R2 = R1 )
6
�
i2 · Pr R12 = i2
� �
=
i=1
6
�
= i2 · Pr {R1 = i}
i=1
12 22 32 42 52 62
= + + + + +
6 6 6 6 6 6
91
=
6
�= 12.25
= E [R1 ] · E [R2 ] . from (1)
6 Markov’s Theorem
Markov’s theorem gives a generally rough estimate of the probability that a random vari
able takes a value much larger than its mean.
10 Course Notes, Week 14: Missed Expectations?
The idea behind Markov’s Theorem can be explained with a simple example of intelligence
quotient, IQ. IQ was devised so that the average IQ measurement would be 100. Now from
this fact alone we can conclude that at most 1/2 the population can have an IQ of 200 or
more, because if more than half had an IQ of 200, then the average would have to be more
than (1/2)200 = 100, contradicting the fact that the average is 100. So the probability that
a randomly chosen person has an IQ of 200 or more is at most 1/2. Of course this is not
a very strong conclusion; in fact no IQ of over 200 has ever been recorded. But by the
same logic, we can also conclude that at most 2/3 of the population can have an IQ of
150 or more. IQ’s of over 150 have certainly been recorded, though again, a much smaller
fraction of the population actually has an IQ that high.
But although these conclusions about IQ are weak, they are actually the strongest possible
general conclusions that can be reached about a nonnegative random variable using only
the fact that its mean is 100. For example, if we choose a random variable equal to 200
with probability 1/2, and 0 with probability 1/2, then its mean is 100, and the probability
of a value of 200 or more is really 1/2. So we can’t hope to get a upper better bound on
the probability of 200 than 1/2.
Theorem 6.1 (Markov’s Theorem). If R is a nonnegative random variable, then for all x > 0
E [R]
Pr {R ≥ x} ≤ .
x
Proof. We will show that E [R] ≥ x Pr {R ≥ x}. Dividing both sides by x gives the desired
result.
So let Ix be the indicator variable for the event [R ≥ x], and consider the random variable
xIx . Note that
R ≥ xIx ,
because at any sample point, w,
Therefore,
1
Pr {R ≥ c · E [R]} ≤ .
c
Suppose that n men go to a dinner party and check their hats. At the end of the night,
the hats are randomly permuted and returned, so each man gets his own hat back with
probability 1/n. What is the probability that x or more men get the right hat?
We can compute an upper bound with Markov’s Theorem. Let the random variable, R,
be the number of men that get the right hat. In previous notes, we used linearity of
expectation to show that E [R] = 1. By Markov’s Theorem, the probability that x or more
men get the right hat is:
E [R] 1
Pr {R ≥ x} ≤ = .
x x
For example, there is no better than a 20% chance that 5 men get the right hat, regardless
of the number of people at the dinner party.
The Chinese Appetizer problem is very similar. In this case, n people are eating Chinese
appetizers arranged on a circular, rotating tray. Someone then spins the tray so that each
person receives a random appetizer. What is the probability that everyone gets the same
appetizer as before?
There are n equally likely orientations for the tray after it stops spinning. Everyone gets
the right appetizer in just one of these n orientations. Therefore, the correct answer is 1/n.
But what probability do we get from Markov’s Theorem? Let the random variable, R, be
the number of people that get the right appetizer. You can show that E [R] = 1 (right?).
Applying Markov’s Theorem, we find:
E [R] 1
Pr {R ≥ n} ≤ = .
n n
Suppose we learn that the average IQ among MIT students is 150 (which is not true, by
the way). What can we say about the probability that an MIT student has an IQ of more
than 200? Markov’s theorem immediately tells us that no more than 150/200 or 3/4 of
the students can have such a high IQ. Here we simply applied Markov’s Theorem to the
random variable, R, equal to the IQ of a random MIT student to conclude:
E [R] 150 3
Pr {R > 200} ≤ = = .
200 200 4
But let’s observe an additional fact (which may be true): no MIT student has an IQ less
than 100. This means that if we let T ::= R − 100, then T is nonnegative and E [T ] = 50, so
we can apply by Markov’s Theorem to T and conclude:
E [T ] 50 1
Pr {R > 200} = Pr {T > 100} ≤ = = .
100 100 2
So only half, not 3/4, of the students can be as amazing as they think they are. A bit of a
relief!
More generally, we can get better bounds applying Markov’s Theorem to R − l instead of
R for any lower bound l > 0 on R.
Similarly, if we have any upper bound, u, on a random variable, S, then u − S will be a
nonnegative random variable, and applying Markov’s Theorem to u − S will allow us to
bound the probability that S is much less than its expectation.
7 Chebyshev’s Theorem
We have versions of Markov’s Theorem for the probability of deviation above the mean,
but often we want bounds that apply to distance from the mean in either direction, that is,
bounds on the probability that |R − E [R]| is large.
It is a bit messy to apply Markov’s Theorem directly to this problem, because it’s generally
not easy to compute E [|R − E [R]|]. However, since |R| and hence |R|k are nonnegative
variables for any R, Markov’s inequality also applies to the event [|R|k ≥ xk ]. But this
event is equivalent to the event [|R| ≥ x], so we have:
Lemma 7.1. For any random variable R, any positive integer k, and any x > 0,
� �
E |R|k
Pr {|R| ≥ x} ≤ .
xk
Course Notes, Week 14: Missed Expectations? 13
The special case of this Lemma for k = 2 can be applied to bound the random variable,
|R − E [R]|, that measures R’s deviation from its mean. Namely
� E [(R − E [R])2 ]
Pr {|R − E [R]| ≥ x} = Pr (R − E [R])2 ≥ x2 ≤
�
, (2)
x2
where the inequality (2) follows by applying Lemma 7.1 to the nonnegative random vari
able, (R − E [R])2 . Assuming that the quantity E [(R − E [R])2 ] above is finite, we can
conclude that the probability that R deviates from its mean by more than x is O(1/x2 ).
Theorem 7.3 (Chebyshev). Let R be a random variable, and let x be a positive real number.
Then
Var [R]
Pr {|R − E [R]| ≥ x} ≤ .
x2
The expression E [(R − E [R])2 ] for variance is a bit cryptic; the best approach is to work
through it from the inside out. The innermost expression, R − E [R], is precisely the devia
tion of R above its mean. Squaring this, we obtain, (R − E [R])2 . This is a random variable
that is near 0 when R is close to the mean and is a large positive number when R deviates
far above or below the mean. So if R is always close to the mean, then the variance will
be small. If R is often far from the mean, then the variance will be large.
The relevance of variance is apparent when we compare the following two gambling
games.
Game A: We win $2 with probability 2/3 and lose $1 with probability 1/3.
Game B: We win $1002 with probability 2/3 and lose $2001 with probability 1/3.
Which game is better financially? We have the same probability, 2/3, of winning each
game, but that does not tell the whole story. What about the expected return for each
game? Let random variables A and B be the payoffs for the two games. For example, A is
2 with probability 2/3 and 1 with probability 1/3. We can compute the expected payoff
for each game as follows:
2 1
E [A] = 2 · + (−1) · = 1,
3 3
2 1
E [B] = 1002 · + (−2001) · = 1.
3 3
14 Course Notes, Week 14: Missed Expectations?
The expected payoff is the same for both games, but they are obviously very different!
This difference is not apparent in their expected value, but is captured by variance. We
can compute the Var [A] by working “from the inside out” as follows:
1 with probability 32
�
A − E [A] =
−2 with probability 13
1 with probability 23
�
(A − E [A])2 =
4 with probability 13
2 1
E (A − E [A])2 = 1 · + 4 ·
� �
3 3
Var [A] = 2.
The variance of Game A is 2 and the variance of Game B is more than two million! Intu
itively, this means that the payoff in Game A is usually close to the expected value of $1,
but the payoff in Game B can deviate very far from this expected value.
High variance is often associated with high risk. For example, in ten rounds of Game A,
we expect to make $10, but could conceivably lose $10 instead. On the other hand, in ten
rounds of game B, we also expect to make $10, but could actually lose more than $20,000!
Because of its definition in terms of the square of a random variable, the variance of a
random variable may be very far from a typical deviation from the mean. For example, in
Game B above, the deviation from the mean is 1001 in one outcome and 2002 in the other.
But the variance is a whopping 2,004,002. From a dimensional analysis viewpoint, the
“units” of variance are wrong: if the random variable is in dollars, then the expectation is
also in dollars, but the variance is in square dollars. For this reason, people often describe
random variables using standard deviation instead of variance.
Definition 7.4. The standard deviation, σR , of a random variable, R, is the square root of
the variance: � �
σR ::= Var [R] = E [(R − E [R])2 ].
Course Notes, Week 14: Missed Expectations? 15
mean
0 100
stdev
Figure 1: The standard deviation of a distribution indicates how wide the “main part” of
it is.
So the standard deviation is the square root of the mean of the square of the deviation,
or the “root mean square” for short. It has the same units—dollars in our example—
as the original random variable and as the mean. Intuitively, it measures the “expected
(average) deviation from the mean,” since we can think of the square root on the outside
as canceling the square on the inside.
Example 7.5. The standard deviation of the payoff in Game B is:
� �
σB = Var [B] = 2, 004, 002 ≈ 1416.
The random variable B actually deviates from the mean by either positive 1001 or neg
ative 2002; therefore, the standard deviation of 1416 describes this situation reasonably
well.
Intuitively, the standard deviation measures the “width” of the “main part” of the distri
bution graph, as illustrated in Figure 1.
There is a useful, simple reformulation of Chebyshev’s Theorem in terms of standard
deviation.
Corollary 7.6. Let R be a random variable, and let c be a positive real number.
1
Pr {|R − E [R]| ≥ cσR } ≤ .
c2
Here we see explicitly how the “likely” values of R are clustered in an O(σR )sized re
gion around E [R], confirming that the standard deviation measures how spread out the
distribution of R is around its mean.
8 Properties of Variance
By this definition, every random variable has zero variance. That is not useful! Because
of the square in the conventional definition, both positive and negative deviations from
the mean increase the variance; positive and negative deviations do not cancel.
Course Notes, Week 14: Missed Expectations? 17
Of course, we could also prevent positive and negative deviations from canceling by tak
ing an absolute value. That is, we could define variance to be E [|R − E [R]|]. There is
no logical reason not to use this definition. However, the conventional version of vari
ance has some valuable mathematical properties which the absolute value version does
not. We describe these properties in the following sections and use them to determine the
variance of some important probability distributions.
There is an equivalent way to define the variance of a random variable that is less intu
itive, but is often easier to use in calculations and proofs:
Theorem 8.1.
Var [R] = E R2 − E2 [R] ,
� �
= E R2 − 2µ E [R] + µ2
� �
(linearity of expectation)
= E R2 − 2µ2 + µ2
� �
(def. of µ)
= E R 2 − µ2
� �
= E R2 − E2 [R] .
� �
(def. of µ)
When does a random variable, R, have zero variance?. . . when the random variable never
deviates from the mean!
Lemma 8.2. The variance of a random variable, R, is zero if and only if Pr {R = E [R]} = 1.
So saying that Var [R] = 0 is almost the same as saying that R is constant. Namely, it takes
the constant value equal to its expectation on all sample points with nonzero probability.
(It can take on any finite values on sample points with zero probability without affecting
the variance.)
E (R − E [R])2 = 0.
� �
Var [R] = 0 iff
The inner expression on the right, (R − E [R])2 , is always nonnegative because of the
square. As a result, E [(R − E [R])2 ] = 0 if and only if Pr {(R − E [R])2 =
� 0} is zero, which
2
is the same as saying that Pr {(R − E [R]) = 0} is one. That is,
But the (R − E [R])2 = 0 and R = E [R] are different descriptions of the same event.
Therefore,
Var [R] = 0 iff Pr {R = E [R]} = 1.
The following theorem describes how the variance of a random variable changes when it
is scaled or shifted by a constant.
Theorem 8.3. Let R be a random variable, and let a and b be constants. Then
This theorem makes two points. First, adding a constant b to a random variable does
not affect the variance. Second, multiplying a random variable by a constant changes the
variance by a square factor.
Proof. We will transform the left side of (4) into the right side. The first step is to expand
Var [aR + b] using the alternate definition of variance.
We will work on the first term and then the second term. For the first term, note that by
linearity of expectation,
= a2 (E R2 − E2 [R])
� �
A similar rule holds for the standard deviation when a random variable is adjusted by a
constant. Recall that standard deviation is the square root of variance. Therefore, adding
a constant b to a random variable does not change the standard deviation. Multiplying a
random variable by a constant a multiplies the standard deviation by a. So we have
Corollary 8.4. The standard deviation of aR + b equals a times the standard deviation of R.
In (7), we use the fact that the expectation of the product of two independent variables is
the product of their expectations.
Also,
� n
� � ��
n
��
� 2
E2 Ri = E Ri
i=1 i=1
n
�� �2
= E [Ri ] (linearity)
i=1
n
�� n
= E [Ri ] E [Rj ]
i=1 j=1
� n
�
= E [Ri ] E [Rj ] + E2 [Ri ] . (8)
�
1≤i=j≤n i=1
So,
�� n �� �� n �2 � � n
�
� � �
Var Ri =E Ri − E2 Ri (Theorem 8.1)
i=1 i=1 i=1
� n
�
E Ri2 −
� �
= E [Ri ] E [Rj ] +
� j≤n
1≤i= i=1
� n
�
� �
E [Ri ] E [Rj ] + E2 [Ri ] (by (7) and (8))
�
1≤i=j≤n i=1
n
� n
� � �
= E Ri2 − E2 [Ri ]
i=1 i=1
n
�
(E Ri2 − E2 [Ri ])
� �
= (reordering the sums)
i=1
�n
= Var [Ri ] . (Theorem 8.1)
i=1
Course Notes, Week 14: Missed Expectations? 21
Now we have a simple way of computing the expectation of a variable�Hn,p which has a
binomial distribution with parameters n and p. We know that Hn,p = nk=1 Ik where the
Ik are mutually independent 01valued variables with Pr {Ik = 1} = p. The variance of
each Ik is p(1 − p) by (3), so by linearity of variance, we have
In Notes 13, we used bounds on the binomial distribution to determine confidence levels
for a poll of voter preferences of Clinton vs. Giulliani. Now that we know the variance of
the binomial distribution, we can use Chebyshev’s Theorem as an alternative approach
to calculate poll size.
The setup is the same as in Notes 13: we will poll n randomly chosen voters and let Sn
be the total number in our sample who preferred Clinton. We use Sn /n as our estimate
of the actual fraction, p, of all voters who prefer Clinton. We want to choose n so that our
estimate will be within 0.04 of p at least 95% of the time.
Now Sn is binomially distributed, so from (9) we have
1
Var [Sn ] = n(p(1 − p)) ≤ n .
4
The bound of 1/4 follows from the easily verified fact that p(1 − p) is maximized when
p = 1 − p, that is, when p = 1/2.
Next, we bound the variance of Sn /n:
� � � �2
Sn 1
Var = Var [Sn ] (by (4))
n n
� �2
1 1
≤ n (by (9.1))
n 4
1
= . (10)
4n
Now from Chebyshev and (10) we have:
�� � �
� Sn � Var [Sn /n] 1 156.25
Pr � − p� ≥ 0.04 ≤
� �
2
= 2
= . (11)
n (0.04) 4n(0.04) n
22 Course Notes, Week 14: Missed Expectations?
To make our our estimate with with 95% confidence, we want the righthand side of (11)
to be at most 1/20. So we choose n so that
156.25 1
≤ ,
n 20
that is,
n ≥ 3, 125.
You may remember that in Notes 13 we calculated that it was actually sufficient to poll
only 664 voters —many fewer than the 3,125 voters we derived using Chebyshev’s The
orem. So the bound from Chebyshev’s Theorem is not nearly as good as the bound we
got earlier. This should not be surprising. In applying the Chebyshev Theorem, we used
only a bound on the variance of Sn . In Notes 13, on the other hand, we used the fact that
the random variable Sn was binomial (with known parameter, n, and unknown parame
ter, p). It makes sense that more detailed information about a distribution leads to better
bounds. But even though the bound was not as good, this example nicely illustrates an
approach to estimation using Chebyshev’s Theorem that is more widely applicable than
binomial estimations.
There are important cases where the relevant distributions are not binomial because the
mutual independence properties of the voter preference example do not hold. In these
cases, estimation methods based on the Chebyshev bound may be the best approach.
Birthday Matching is an example.
We’ve already seen that in a class of one hundred or more, there is a very high probability
that some pair of students have birthdays on the same day of the month. We can also
easily calculate the expected number of pairs of students with matching birthdays. But
is it likely the number of matching pairs in a typical class will actually be close to the
expected number? We can take the same approach to answering this question as we did
in estimating voter preferences.
But notice that having matching birthdays for different pairs of students are not mutually
independent events. For example, knowing that Alice and Bob have matching birthdays,
and also that Ted and Alice have matching birthdays obviously implies that Bob and Ted
have matching birthdays. On the other hand, knowing that Alice and Bob have match
ing birthdays tells us nothing about whether Alice and Carol have matching birthdays,
namely, these two events really are independent. So even though the events that vari
ous pairs of students have matching birthdays are not mutually independent, indeed not
even threeway independent, they are pairwise independent.
This allows us to apply the same reasoning to Birthday Matching as we did for voter
preference. Namely, let B1 , B2 , . . . , Bn be the birthdays of n independently chosen people,
and let Ei,j be the indicator variable for the event that the ith and jth people chosen have
Course Notes, Week 14: Missed Expectations? 23
� j,
the same birthdays, that is, the event [Bi = Bj ]. For simplicity, we’ll assume that for i =
2
the probability that Bi = Bj is 1/365. So the Bi ’s are mutually independent variables,
and hence the Ei,j ’s are pairwise independent variables, which is all we will need.
Let Mn be the number of matching pairs of birthdays among the n choices, that is,
�
Mn ::= Ei,j . (12)
1≤i<j≤n
So by linearity of expectation
� � � �
� � n 1
E [Mn ] = E Ei,j = E [Ei,j ] = · .
1≤i<j≤n 1≤i<j≤n
2 365
Also, by Theorem 8.5, the variances of pairwise independent variables are additive, so
� � � � � �
� � n 1 1
Var [Mn ] = Var Ei,j = Var [Ei,j ] = · 1− .
1≤i<j≤n 1≤i<j≤n
2 365 365
Now for a class of 100 students, we have E [M100 ] ≈ 14 and Var [M100 ] < 14(1−1/365) < 14.
So by Chebyshev’s Theorem
14
Pr {|M100 − 14| ≥ x} < .
x2
Letting x = 6, we conclude that there is a better than 50% chance that in a class of 100
students, the number of pairs of students with the same birthday will be between 8 and
20.
Then �� � �
� Sn � 1 � σ �2
Pr � − µ� ≥ x ≤
� � .
n n x
The second important property of Sn /n is that its variance is the variance of Gi divided
by n:
� � � �2
Sn 1
Var = Var [Sn ] (by (4))
n n
� n �
1 �
= 2 Var Gi (def. of Sn )
n i=1
n
1 �
= 2 Var [Gi ] (pairwise independent additivity)
n i=1
1 2 σ2
= · nσ = . (14)
n2 n
The Pairwise Independent Sampling Theorem provides a precise general statement about
how the average of independent samples of a random variable approaches the mean. In
particular, it shows that by choosing a large enough sample size, n, we can get arbitrarily
accurate estimates of the mean with confidence arbitrarily close to 100%.