0% found this document useful (0 votes)

13 views24 pages

MIT Overview Basic Probability

a cool short overview

Uploaded by

carmel.baris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views24 pages

MIT Overview Basic Probability

a cool short overview

Uploaded by

carmel.baris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Massachusetts Institute of Technology Course Notes, Week 14

6.042J/18.062J, Fall ’05: Mathematics for Computer Science December 2

Prof. Albert R. Meyer and Prof. Ronitt Rubinfeld revised December 8, 2005, 61 minutes

Missed Expectations?
In the previous notes, we saw that the average value of a random quantity is captured
by the mathematical concept of the expectation of a random variable, and we calculated
expectations for several kinds of random variables. Now we will see two things that
make expectations so useful. First, they are often very easy to calculate due to the fact
that they obey linearity. Second, once you know what the expectation is, you can also get
some type of bound on the probability that you are far from the expectation —that is, you
can show that really weird things are not that likely to happen. How good a bound you
can get depends on what you know about your distribution, but don’t worry, even if you
know next to nothing, you can still say something relatively interesting.

1 Linearity of Expectation

1.1 Expectation of a Sum

Expected values obey a simple, very helpful rule called Linearity of Expectation. Its sim
plest form says that the expected value of a sum of random variables is the sum of the
expected values of the variables.
Theorem 1.1. For any random variables R1 and R2 ,

E [R1 + R2 ] = E [R1 ] + E [R2 ] .

Proof. Let T ::= R1 + R2 . The proof follows straightforwardly by rearranging terms from
the deﬁnition of E [T ].

E [R1 + R2 ] ::= E [T ]
�
::= T (s) · Pr {s}
s∈S
�
= (R1 (s) + R2 (s)) · Pr {s} (Def. of T )
s∈S
� �
= R1 (s) Pr {s} + R2 (s) Pr {s} (rearranging terms)
s∈S s∈S
= E [R1 ] + E [R2 ] .

Copyright © 2005, Prof. Albert R. Meyer and Prof. Ronitt Rubinfeld.

2 Course Notes, Week 14: Missed Expectations?

Similarly, we have

Lemma 1.2. For any random variable R and constant a ∈ R,

E [aR] = a E [R] .

The proof follows easily from the deﬁnition of expectation, and we omit it.
Combining Theorem 1.1 and Lemma 1.2, we conclude

Theorem 1.3 (Linearity of Expectation). For all random variables R1 , R2 and constants a1 , a2 ∈
R,
E [a1 R1 + a2 R2 ] = a1 E [R1 ] + a2 E [R2 ] .

In other words, expectation is a linear function. The rule and its proof extends directly to
cover more than two random variables:

Corollary 1.4. For any random variables R1 , . . . , Rk and constants a1 , . . . , ak ∈ R,

� k
� k
� �
E ai R i = ai E [Ri ] .
i=1 i=1

The great thing about linearity of expectation is that no independence is required. This is
really useful, because dealing with independence is a pain, and we often need to work
with random variables that are not independent.

1.2 Expected Value of Two Dice

What is the expected value of the sum of two fair dice?

Let the random variable R1 be the number on the ﬁrst die, and let R2 be the number on
the second die. We observed earlier that the expected value of one die is 3.5. We can ﬁnd
the expected value of the sum using linearity of expectation:

E [R1 + R2 ] = E [R1 ] + E [R2 ] = 3.5 + 3.5 = 7.

Notice that we did not have to assume that the two dice were independent. The expected
sum of two dice is 7, even if they are connected together!1
Proving that the expected sum is 7 with a tree diagram would be hard; there are 36 cases.
And if we did not assume that the dice were independent, the job would be a nightmare!
1
But each die must remain fair after the connection.
Course Notes, Week 14: Missed Expectations? 3

1.3 The HatCheck Problem

There is a dinner party where n men check their hats. The hats are mixed up during
dinner, so that afterward each man receives a random hat. In particular, each man gets
his own hat with probability 1/n. What is the expected number of men who get their own
hat?
Without linearity of expectation, this would be a very difﬁcult question to answer. We
might try the following. Let the random variable R be the number of men that get their
own hat. We want to compute E [R]. By the deﬁnition of expectation, we have:
∞
�
E [R] = k · Pr {R = k}
k=0

Now we’re in trouble, because evaluating Pr {R = k} is a mess and we then need to sub
stitute this mess into a summation. Furthermore, to have any hope, we would need to ﬁx
the probability of each permutation of the hats. For example, we might assume that all
permutations of hats are equally likely.
Now let’s try to use linearity of expectation. As before, let the random variable R be the
number of men that get their own hat. The trick is to express R as a sum of indicator
variables. In particular, let Ri be an indicator for the event that the ith man gets his own
hat. That is, Ri = 1 is the event that he gets his own hat, and Ri = 0 is the event that
he gets the wrong hat. The number of men that get their own hat is the sum of these
indicators:
R = R1 + R2 + · · · + Rn
These indicator variables are not mutually independent. For example, if n − 1 men all get
their own hats, then the last man is certain to receive his own hat. But, since we plan to
use linearity of expectation, we don’t have worry about independence!
Let’s take the expected value of both sides of the equation above and apply linearity of
expectation:

E [R] = E [R1 + R2 + · · · + Rn ]
= E [R1 ] + E [R2 ] + · · · + E [Rn ]

Since the Ri ’s are indicator variables, E [Ri ] = Pr {Ri } and since every man is as likely to
get one hat as another, this is just 1/n. Putting all this together, we have:

E [R] = E [R1 ] + E [R2 ] + · · · + E [Rn ]

= Pr {R1 = 1} + Pr {R2 = 1} + · · · + Pr {Rn = 1}
1
= n · = 1.
n
So we should expect 1 man to get his own hat back on average!
4 Course Notes, Week 14: Missed Expectations?

Notice that we did not assume that all permutations of hats are equally likely or even that
all permutations are possible. We only needed to know that each man received his own
hat with probability 1/n.

1.4 Expectation of a Binomial Distribution

Suppose that we independently ﬂip n biased coins, each with probability p of coming up
heads. What is the expected number that come up heads?
Let Hn,p be the number of heads after the ﬂips. Then Hn,p has the binomial distribution
with parameters n and p. Now let Ik be the indicator for the kth coin coming up heads.
Since Ik is an indicator variable with probability p of being 1, we know that

E [Ik ] = p.

But
n
�
Hn,p = Ik ,
k=1

so by linearity
� n
� n n
� � �
E [Hn,p ] = E Ik = E [Ik ] = p = pn.
k=1 k=1 k=1

numbThat is, the expectation of an (n, p)binomially distributed variable is pn.

2 The Coupon Collector Problem

Every time I purchase a kid’s meal at Taco Bell, I am graciously presented with a miniature
“Racin’ Rocket” car together with a launching device which enables me to project my new
vehicle across any tabletop or smooth ﬂoor at high velocity. Truly, my delight knows no
bounds.
There are n different types of Racin’ Rocket car (blue, green, red, gray, etc.). The type of
car awarded to me each day by the kind woman at the Taco Bell register appears to be
selected uniformly and independently at random. What is the expected number of kids
meals that I must purchase in order to acquire at least one of each type of Racin’ Rocket
car?
The same mathematical question shows up in many guises: for example, what is the ex
pected number of people you must poll in order to ﬁnd at least one person with each
possible birthday? Here, instead of collecting Racin’ Rocket cars, you’re collecting birth
days. The general question is commonly called the coupon collector problem after yet
another interpretation.
Course Notes, Week 14: Missed Expectations? 5

2.1 A Solution Using Linearity of Expectation

Linearity of expectation is somewhat like induction and the pigeonhole principle; it’s a
simple idea that can be used in all sorts of ingenious ways. For example, we can use
linearity of expecatation in a clever way to solve the coupon collector problem. Suppose
there are ﬁve different types of Racin’ Rocket, and I receive this sequence:

blue green green red blue orange blue orange gray

Let’s partition the sequence into 5 segments:

blue
�� green green red blue orange blue orange gray
� ��
X0 X1 X2 X3 X4

The rule is that a segment ends whenever I get a new kind of car. For example, the middle
segment ends when I get a red car for the ﬁrst time. In this way, we can break the problem
of collecting every type of car into stages. Then we can analyze each stage individually
and assemble the results using linearity of expectation.
Let’s return to the general case where I’m collecting n Racin’ Rockets. Let Xk be the length
of the kth segment. The total number of kid’s meals I must purchase to get all n Racin’
Rockets is the sum of the lengths of all these segments:

T = X0 + X1 + · · · + Xn−1

Now let’s focus our attention of the Xk , the length of the kth segment. At the beginning of
segment k, I have k different types of car, and the segment ends when I acquire a new type.
When I own k types, each kid’s meal contains a type that I already have with probability
k/n. Therefore, each meal contains a new type of car with probability 1 − k/n = (n − k)/n.
Thus, the expected number of meals until I get a new kind of car is n/(n−k) by the “mean
time to failure” formula that we worked out last time. So we have:
n
E [Xk ] =
n−k

Linearity of expecatation, together with this observation, solves the coupon collector
problem:

E [T ] = E [X0 + X1 + · · · + Xn−1 ]
= E [X0 ] + E [X1 ] + · · · + E [Xn−1 ]
n n n n n
= + + ··· + + +
n�−0 n−1 3 2 1�
1 1 1 1 1
=n + + ··· + + +
n n−1 3 2 1
= nHn
6 Course Notes, Week 14: Missed Expectations?

The summation on the nexttolast line is the nth harmonic sum with the terms in reverse
order. As you may recall, this sum is denoted Hn and is approximately ln n.
Let’s use this general solution to answer some concrete questions. For example, the ex
pected number of die rolls required to see every number from 1 to 6 is:

6H6 = 14.7 . . .

And the expected number of people you must poll to ﬁnd at least one person with each
possible birthday is:
365H365 = 2364.6 . . .

3 Conditional Expectation
Just like event probabilities, expectations can be conditioned on some event.

Deﬁnition 3.1. We deﬁne the conditional expectation E [R | A] of a random variable R given

event A: �
E [R | A] ::= r · Pr {R = r | A} .
r

In other words, it is the expected value of the variable R once we skew the distribution of
R to be conditioned on event A.
Example 3.2. Let D be the outcome of a roll of a fair die. What is E [D | D ≥ 4]?
6
�
i · Pr {D = i | D ≥ 4} = 1 · 0 + 2 · 0 + 3 · 0 + 4 · 31 + 5 · 13 + 6 · 1
3
= 5.
i=1

It is easy to see that the rules for expectation will extend to conditional expectation. For
example, conditional expectation will also be linear.

Theorem 3.3. For any two random variables R1 , R2 , constants a1 , a2 ∈ R, and event A,

E [a1 R1 + a2 R2 | A] = a1 E [R1 | A] + a2 E [R2 | A] .

A real beneﬁt of conditional expectation is the way it lets us divide complicated expecta
tion calculations into simpler cases.

Theorem 3.4 (Law of Total Expectation). If the sample space is the union of the pairwise
disjoint events A1 , A2 , . . . , then
�
E [R] = E [R | Ai ] Pr {Ai } .
i
Course Notes, Week 14: Missed Expectations? 7

Proof.
�
E [R] = r · Pr {R = r}
r
� �
= r· Pr {R = r | Ai } Pr {Ai } (Law of Total Probability)
r i
��
= r · Pr {R = r | Ai } Pr {Ai } (distribute constant r)
r i
��
= r · Pr {R = r | Ai } Pr {Ai } (exchange order of summation)
i r
� �
= Pr {Ai } r · Pr {R = r | Ai } (factor constant Pr {Ai })
i r
�
= Pr {Ai } E [R | Ai ] . (Def. 3.1)
i

Example 3.5. Half the people in the world are male, half female. The expected height of a
randomly chosen male is 5� 11�� , while the expected height of a randomly chosen female is
5� 5�� . What is the expected height of a randomly chosen individual?
Let H(P ) be the height of the random person P . The events M ::=“P is male” and F ::=“P
is female” are a partition of the sample space. Then

E [H] = E [H | M ] Pr {M } + E [H | F ] Pr {F }
1 1
= 5� 11�� · + 5� 5�� ·
2 2
= 5� 8�� .

We will see in the following sections that the Law of Total Expectation has much more
power than one might think.

4 The Expected Value of a Product

4.1 The Product of Independent Expectations

We have determined that the expectation of a sum is the sum of the expectations. The
same is not always true for products: in general, the expectation of a product need not
equal the product of the expectations. But it is true in an important special case, namely,
when the random variables are independent.
Lemma 4.1. If R1 and R2 are independent random variables, then

E [R1 | R2 = a] = E [R1 ] .
8 Course Notes, Week 14: Missed Expectations?

The Lemma follows immediately from Deﬁnition 3.1 of conditional expectation and the
fact that Pr {R1 = r} = Pr {R1 = r | R2 = a}.

Theorem 4.2. For any two independent random variables R1 , R2 ,

E [R1 · R2 ] = E [R1 ] · E [R2 ] .

Proof. We apply the Law of Total Expectation by conditioning on the value of R1 .

�
E [R1 · R2 ] = E [R1 · R2 | R1 = r] · Pr {R1 = r} (Thm 3.4)
r∈range(R1 )
�
= E [r · R2 | R1 = r] · Pr {R1 = r}
r
�
= r · E [R2 | R1 = r] · Pr {R1 = r} (Lemma 1.2)
r
�
= r · E [R2 ] · Pr {R1 = r} Lemma 4.1
r
�
= E [R2 ] r · Pr {R1 = r} (factor out constant E [R2 ])
r
= E [R2 ] · E [R1 ] .

Theorem 4.2 extends routinely to a collection of mutually independent variables.

Corollary 4.3. If random variables R1 , R2 , . . . , Rk are mutually independent, then

� k � k
� �
E Ri = E [Ri ] .
i=1 i=1

4.2 The Product of Two Dice

Suppose we throw two independent, fair dice and multiply the numbers that come up.
What is the expected value of this product?
Let random variables R1 and R2 be the numbers shown on the two dice. We can compute
the expected value of the product as follows:

E [R1 · R2 ] = E [R1 ] · E [R2 ] = 3.5 · 3.5 = 12.25. (1)

Here the ﬁrst equality holds by Theorem 4.2 because the dice are independent.
Now suppose that the two dice are not independent; in fact, assume that the second die
is always the same as the ﬁrst. In this case, the product of expectations will not equal the
expectation of the product.
Course Notes, Week 14: Missed Expectations? 9

To verify this, let random variables R1 and R2 be the numbers shown on the two dice. We
can compute the expected value of the product without Theorem 4.2 as follows:
E [R1 · R2 ] = E R12
� �
(R2 = R1 )
6
�
i2 · Pr R12 = i2
� �
=
i=1
6
�
= i2 · Pr {R1 = i}
i=1
12 22 32 42 52 62
= + + + + +
6 6 6 6 6 6
91
=
6
�= 12.25
= E [R1 ] · E [R2 ] . from (1)

5 Expect the Mean

We have seen several examples of random variables that never take a value equal to their
mean. But experience suggests that we can expect the values of a variable to be near its
mean – usually – which is why the mean is also called the “expectation.” In other words,
the values of a random variable probably won’t deviate very much from the mean. We will
describe some basic results about this central topic of deviation from the mean, and we will
indicate how these results apply for testing hypotheses and estimating by sampling.
In these notes we develop two results. The ﬁrst is Markov’s Theorem, which gives a
simple, but typically coarse, upper bound on the probability that the value of a random
variable is more than a certain multiple of its mean. Markov’s result holds if we know
nothing about a random variable except what its mean is and that its values are non
negative. Accordingly, Markov’s Theorem is very general, but also is much weaker than
results which take into account more information about the distribution of the variable.
In many situations, we not only know the mean, but also another numerical quantity
called the variance of the random variable. Our second basic result is Chebyshev’s The
orem, which combines Markov’s Theorem and information about the variance to give
more reﬁned bounds. We will also examine properties of variance and ways to calculate
it.

6 Markov’s Theorem
Markov’s theorem gives a generally rough estimate of the probability that a random vari
able takes a value much larger than its mean.
10 Course Notes, Week 14: Missed Expectations?

The idea behind Markov’s Theorem can be explained with a simple example of intelligence
quotient, IQ. IQ was devised so that the average IQ measurement would be 100. Now from
this fact alone we can conclude that at most 1/2 the population can have an IQ of 200 or
more, because if more than half had an IQ of 200, then the average would have to be more
than (1/2)200 = 100, contradicting the fact that the average is 100. So the probability that
a randomly chosen person has an IQ of 200 or more is at most 1/2. Of course this is not
a very strong conclusion; in fact no IQ of over 200 has ever been recorded. But by the
same logic, we can also conclude that at most 2/3 of the population can have an IQ of
150 or more. IQ’s of over 150 have certainly been recorded, though again, a much smaller
fraction of the population actually has an IQ that high.
But although these conclusions about IQ are weak, they are actually the strongest possible
general conclusions that can be reached about a nonnegative random variable using only
the fact that its mean is 100. For example, if we choose a random variable equal to 200
with probability 1/2, and 0 with probability 1/2, then its mean is 100, and the probability
of a value of 200 or more is really 1/2. So we can’t hope to get a upper better bound on
the probability of 200 than 1/2.
Theorem 6.1 (Markov’s Theorem). If R is a nonnegative random variable, then for all x > 0

E [R]
Pr {R ≥ x} ≤ .
x

Proof. We will show that E [R] ≥ x Pr {R ≥ x}. Dividing both sides by x gives the desired
result.
So let Ix be the indicator variable for the event [R ≥ x], and consider the random variable
xIx . Note that
R ≥ xIx ,
because at any sample point, w,

• if R(w) ≥ x then R(w) ≥ x = x · 1 = xIx (w), and

• if R(w) < x then R(w) ≥ 0 = x · 0 = xIx (w).

Therefore,

E [R] ≥ E [xIx ] (since R ≥ xIx )

= x E [Ix ] (linearity of E [·])
= x Pr {Ix = 1} (because Ix is an index vbl.)
= x Pr {R ≥ x} . (def. of Ix )

Markov’s Theorem is often expressed in an alternative form, stated below as an immedi

ate corollary.
Course Notes, Week 14: Missed Expectations? 11

Corollary 6.2. If R is a nonnegative random variable, then for all c ≥ 1

1
Pr {R ≥ c · E [R]} ≤ .
c

Proof. In Markov’s Theorem, set x = c · E [R].

6.1 Examples of Markov’s Theorem

Suppose that n men go to a dinner party and check their hats. At the end of the night,
the hats are randomly permuted and returned, so each man gets his own hat back with
probability 1/n. What is the probability that x or more men get the right hat?
We can compute an upper bound with Markov’s Theorem. Let the random variable, R,
be the number of men that get the right hat. In previous notes, we used linearity of
expectation to show that E [R] = 1. By Markov’s Theorem, the probability that x or more
men get the right hat is:
E [R] 1
Pr {R ≥ x} ≤ = .
x x
For example, there is no better than a 20% chance that 5 men get the right hat, regardless
of the number of people at the dinner party.
The Chinese Appetizer problem is very similar. In this case, n people are eating Chinese
appetizers arranged on a circular, rotating tray. Someone then spins the tray so that each
person receives a random appetizer. What is the probability that everyone gets the same
appetizer as before?
There are n equally likely orientations for the tray after it stops spinning. Everyone gets
the right appetizer in just one of these n orientations. Therefore, the correct answer is 1/n.
But what probability do we get from Markov’s Theorem? Let the random variable, R, be
the number of people that get the right appetizer. You can show that E [R] = 1 (right?).
Applying Markov’s Theorem, we ﬁnd:

E [R] 1
Pr {R ≥ n} ≤ = .
n n

So for the Chinese appetizer problem, Markov’s Theorem is tight!

On the other hand, Markov’s Theorem gives the same 1/n bound for the probability ev
eryone gets their hat in the hat check problem. But in reality, the probability of this event
is 1/(n!). So for the hat check problem, Markov’s Theorem case gives probability bounds
that are way off.
12 Course Notes, Week 14: Missed Expectations?

6.2 Markov’s Theorem for Bounded Variables

Suppose we learn that the average IQ among MIT students is 150 (which is not true, by
the way). What can we say about the probability that an MIT student has an IQ of more
than 200? Markov’s theorem immediately tells us that no more than 150/200 or 3/4 of
the students can have such a high IQ. Here we simply applied Markov’s Theorem to the
random variable, R, equal to the IQ of a random MIT student to conclude:

E [R] 150 3
Pr {R > 200} ≤ = = .
200 200 4

But let’s observe an additional fact (which may be true): no MIT student has an IQ less
than 100. This means that if we let T ::= R − 100, then T is nonnegative and E [T ] = 50, so
we can apply by Markov’s Theorem to T and conclude:

E [T ] 50 1
Pr {R > 200} = Pr {T > 100} ≤ = = .
100 100 2
So only half, not 3/4, of the students can be as amazing as they think they are. A bit of a
relief!
More generally, we can get better bounds applying Markov’s Theorem to R − l instead of
R for any lower bound l > 0 on R.
Similarly, if we have any upper bound, u, on a random variable, S, then u − S will be a
nonnegative random variable, and applying Markov’s Theorem to u − S will allow us to
bound the probability that S is much less than its expectation.

7 Chebyshev’s Theorem

We have versions of Markov’s Theorem for the probability of deviation above the mean,
but often we want bounds that apply to distance from the mean in either direction, that is,
bounds on the probability that |R − E [R]| is large.
It is a bit messy to apply Markov’s Theorem directly to this problem, because it’s generally
not easy to compute E [|R − E [R]|]. However, since |R| and hence |R|k are nonnegative
variables for any R, Markov’s inequality also applies to the event [|R|k ≥ xk ]. But this
event is equivalent to the event [|R| ≥ x], so we have:

Lemma 7.1. For any random variable R, any positive integer k, and any x > 0,
� �
E |R|k
Pr {|R| ≥ x} ≤ .
xk
Course Notes, Week 14: Missed Expectations? 13

The special case of this Lemma for k = 2 can be applied to bound the random variable,
|R − E [R]|, that measures R’s deviation from its mean. Namely

� E [(R − E [R])2 ]
Pr {|R − E [R]| ≥ x} = Pr (R − E [R])2 ≥ x2 ≤
�
, (2)
x2
where the inequality (2) follows by applying Lemma 7.1 to the nonnegative random vari
able, (R − E [R])2 . Assuming that the quantity E [(R − E [R])2 ] above is ﬁnite, we can
conclude that the probability that R deviates from its mean by more than x is O(1/x2 ).

Deﬁnition 7.2. The variance, Var [R], of a random variable, R, is:

Var [R] ::= E (R − E [R])2 .

� �

So we can restate (2) as

Theorem 7.3 (Chebyshev). Let R be a random variable, and let x be a positive real number.
Then
Var [R]
Pr {|R − E [R]| ≥ x} ≤ .
x2

The expression E [(R − E [R])2 ] for variance is a bit cryptic; the best approach is to work
through it from the inside out. The innermost expression, R − E [R], is precisely the devia
tion of R above its mean. Squaring this, we obtain, (R − E [R])2 . This is a random variable
that is near 0 when R is close to the mean and is a large positive number when R deviates
far above or below the mean. So if R is always close to the mean, then the variance will
be small. If R is often far from the mean, then the variance will be large.

7.1 Variance in Two Gambling Games

The relevance of variance is apparent when we compare the following two gambling
games.
Game A: We win $2 with probability 2/3 and lose $1 with probability 1/3.
Game B: We win $1002 with probability 2/3 and lose $2001 with probability 1/3.
Which game is better ﬁnancially? We have the same probability, 2/3, of winning each
game, but that does not tell the whole story. What about the expected return for each
game? Let random variables A and B be the payoffs for the two games. For example, A is
2 with probability 2/3 and 1 with probability 1/3. We can compute the expected payoff
for each game as follows:

2 1
E [A] = 2 · + (−1) · = 1,
3 3
2 1
E [B] = 1002 · + (−2001) · = 1.
3 3
14 Course Notes, Week 14: Missed Expectations?

The expected payoff is the same for both games, but they are obviously very different!
This difference is not apparent in their expected value, but is captured by variance. We
can compute the Var [A] by working “from the inside out” as follows:

1 with probability 32
�
A − E [A] =
−2 with probability 13
1 with probability 23
�
(A − E [A])2 =
4 with probability 13
2 1
E (A − E [A])2 = 1 · + 4 ·
� �
3 3
Var [A] = 2.

Similarly, we have for Var [B]:

2
�
1001 with probability
B − E [B] = 3
1
−2002 with probability 3
2
�
2 1, 002, 001 with probability
(B − E [B]) = 3
1
4, 008, 004 with probability 3
2 1
E (B − E [B])2 = 1, 002, 001 · + 4, 008, 004 ·
� �
3 3
Var [B] = 2, 004, 002.

The variance of Game A is 2 and the variance of Game B is more than two million! Intu
itively, this means that the payoff in Game A is usually close to the expected value of $1,
but the payoff in Game B can deviate very far from this expected value.
High variance is often associated with high risk. For example, in ten rounds of Game A,
we expect to make $10, but could conceivably lose $10 instead. On the other hand, in ten
rounds of game B, we also expect to make $10, but could actually lose more than $20,000!

7.2 Standard Deviation

Because of its deﬁnition in terms of the square of a random variable, the variance of a
random variable may be very far from a typical deviation from the mean. For example, in
Game B above, the deviation from the mean is 1001 in one outcome and 2002 in the other.
But the variance is a whopping 2,004,002. From a dimensional analysis viewpoint, the
“units” of variance are wrong: if the random variable is in dollars, then the expectation is
also in dollars, but the variance is in square dollars. For this reason, people often describe
random variables using standard deviation instead of variance.

Deﬁnition 7.4. The standard deviation, σR , of a random variable, R, is the square root of
the variance: � �
σR ::= Var [R] = E [(R − E [R])2 ].
Course Notes, Week 14: Missed Expectations? 15

mean

0 100
stdev

Figure 1: The standard deviation of a distribution indicates how wide the “main part” of
it is.

So the standard deviation is the square root of the mean of the square of the deviation,
or the “root mean square” for short. It has the same units—dollars in our example—
as the original random variable and as the mean. Intuitively, it measures the “expected
(average) deviation from the mean,” since we can think of the square root on the outside
as canceling the square on the inside.
Example 7.5. The standard deviation of the payoff in Game B is:
� �
σB = Var [B] = 2, 004, 002 ≈ 1416.

The random variable B actually deviates from the mean by either positive 1001 or neg
ative 2002; therefore, the standard deviation of 1416 describes this situation reasonably
well.
Intuitively, the standard deviation measures the “width” of the “main part” of the distri
bution graph, as illustrated in Figure 1.
There is a useful, simple reformulation of Chebyshev’s Theorem in terms of standard
deviation.
Corollary 7.6. Let R be a random variable, and let c be a positive real number.
1
Pr {|R − E [R]| ≥ cσR } ≤ .
c2
Here we see explicitly how the “likely” values of R are clustered in an O(σR )sized re
gion around E [R], conﬁrming that the standard deviation measures how spread out the
distribution of R is around its mean.

Proof. Substituting x = cσR in Chebyshev’s Theorem gives:

Var [R] σR2 1
Pr {|R − E [R]| ≥ cσR } ≤ 2
= 2
= 2.
(cσR ) (cσR ) c
16 Course Notes, Week 14: Missed Expectations?

7.3 The IQ Example

Suppose that, in addition to the national average IQ being 100, we also know the standard
deviation of IQ’s is 10. How rare is an IQ of 200 or more?
Let the random variable, R, be the IQ of a random person. So we are supposing that
E [R] = 100, σR = 10, and R is nonnegative. We want to compute Pr {R ≥ 200}.
We have already seen that Markov’s Theorem 6.1 gives a coarse bound, namely,
1
Pr {R ≥ 200} ≤ .
2
Now we apply Chebyshev’s Theorem to the same problem:

Var [R] 102 1

Pr {R ≥ 200} = Pr {|R − 100| ≥ 100} ≤ 2
= 2
= .
100 100 100
The purpose of the ﬁrst step is to express the desired probability in the form required
by Chebyshev’s Theorem; the equality holds because R is nonnegative. Chebyshev’s
Theorem then yields the inequality.
So Chebyshev’s Theorem implies that at most one person in a hundred has an IQ of 200
or more. We have gotten a much tighter bound using the additional information, namely
the variance of R, than we could get knowing only the expectation.

8 Properties of Variance

8.1 Why Variance?

The definition of variance of R as E [(R − E [R])2 ] may seem rather arbitrary. The variance
is the average of the square of the deviation from the mean. For this reason, variance is
sometimes called the “mean squared deviation.” But why bother squaring? Why not
simply compute the average deviation from the mean? That is, why not define variance
to be E [R − E [R]]?
The problem with this definition is that the positive and negative deviations from the
mean exactly cancel. By linearity of expectation, we have:

E [R − E [R]] = E [R] − E [E [R]] .

Since E [R] is a constant, its expected value is itself. Therefore

E [R − E [R]] = E [R] − E [R] = 0.

By this deﬁnition, every random variable has zero variance. That is not useful! Because
of the square in the conventional deﬁnition, both positive and negative deviations from
the mean increase the variance; positive and negative deviations do not cancel.
Course Notes, Week 14: Missed Expectations? 17

Of course, we could also prevent positive and negative deviations from canceling by tak
ing an absolute value. That is, we could deﬁne variance to be E [|R − E [R]|]. There is
no logical reason not to use this deﬁnition. However, the conventional version of vari
ance has some valuable mathematical properties which the absolute value version does
not. We describe these properties in the following sections and use them to determine the
variance of some important probability distributions.

8.2 An Alternative Deﬁnition of Variance

There is an equivalent way to deﬁne the variance of a random variable that is less intu
itive, but is often easier to use in calculations and proofs:

Theorem 8.1.
Var [R] = E R2 − E2 [R] ,
� �

for any random variable, R.

Here we use the notation E2 [R] as shorthand for (E [R])2 .

Remember that E [R2 ] is generally not equal to E2 [R]. We know the expected value of
a product is the product of the expected values for independent variables, but not in
general. And R is not independent of itself unless it is constant.

Proof. Let µ = E [R]. Then

Var [R] = E (R − E [R])2

� �
(Def. 7.2 of variance)
= E (R − µ)2
� �
(def. of µ)
= E R2 − 2µR + µ2
� �

= E R2 − 2µ E [R] + µ2
� �
(linearity of expectation)
= E R2 − 2µ2 + µ2
� �
(def. of µ)
= E R 2 − µ2
� �

= E R2 − E2 [R] .
� �
(def. of µ)

For example, if B is a Bernoulli variable where p ::= Pr {B = 1}, then

Var [B] = p − p2 = p(1 − p). (3)

Proof. Since B only takes values 0 and 1, we have E [B] = p · 1 + (1 − p) · 0 = p. Since

B = B 2 , we also have E [B 2 ] = p, so (3) follows immediately from (8.1).
18 Course Notes, Week 14: Missed Expectations?

8.2.1 Zero Variance

When does a random variable, R, have zero variance?. . . when the random variable never
deviates from the mean!

Lemma 8.2. The variance of a random variable, R, is zero if and only if Pr {R = E [R]} = 1.

So saying that Var [R] = 0 is almost the same as saying that R is constant. Namely, it takes
the constant value equal to its expectation on all sample points with nonzero probability.
(It can take on any ﬁnite values on sample points with zero probability without affecting
the variance.)

Proof. By the deﬁnition of variance,

E (R − E [R])2 = 0.
� �
Var [R] = 0 iff

The inner expression on the right, (R − E [R])2 , is always nonnegative because of the
square. As a result, E [(R − E [R])2 ] = 0 if and only if Pr {(R − E [R])2 =
� 0} is zero, which
2
is the same as saying that Pr {(R − E [R]) = 0} is one. That is,

Var [R] = 0 iff Pr (R − E [R])2 = 0 = 1.

� �

But the (R − E [R])2 = 0 and R = E [R] are different descriptions of the same event.
Therefore,
Var [R] = 0 iff Pr {R = E [R]} = 1.

8.2.2 Dealing with Constants

The following theorem describes how the variance of a random variable changes when it
is scaled or shifted by a constant.

Theorem 8.3. Let R be a random variable, and let a and b be constants. Then

Var [aR + b] = a2 Var [R] . (4)

This theorem makes two points. First, adding a constant b to a random variable does
not affect the variance. Second, multiplying a random variable by a constant changes the
variance by a square factor.

Proof. We will transform the left side of (4) into the right side. The ﬁrst step is to expand
Var [aR + b] using the alternate deﬁnition of variance.

Var [aR + b] = E (aR + b)2 − E2 [aR + b] .

� �
Course Notes, Week 14: Missed Expectations? 19

We will work on the ﬁrst term and then the second term. For the ﬁrst term, note that by
linearity of expectation,

E (aR + b)2 = E a2 R2 + 2abR + b2 = a2 E R2 + 2ab E [R] + b2 .

� � � � � �
(5)

Similarly for the second term:

E2 [aR + b] = (a E [R] + b)2 = a2 E2 [R] + 2ab E [R] + b2 . (6)

Finally, we subtract the expanded second term from the ﬁrst.

Var [aR + b] = E (aR + b)2 − E2 [aR + b]

� �
(Theorem 8.1)
= a2 E R2 + 2ab E [R] + b2 −
� �

(a2 E2 [R] + 2ab E [R] + b2 ) (by (5) and (6))

= a2 E R2 − a2 E2 [R]
� �

= a2 (E R2 − E2 [R])
� �

= a2 Var [R] (Theorem 8.1)

A similar rule holds for the standard deviation when a random variable is adjusted by a
constant. Recall that standard deviation is the square root of variance. Therefore, adding
a constant b to a random variable does not change the standard deviation. Multiplying a
random variable by a constant a multiplies the standard deviation by a. So we have
Corollary 8.4. The standard deviation of aR + b equals a times the standard deviation of R.

8.3 Variance of a Sum

Earlier, we claimed that for independent random variables, the variance of a sum is the
sum of the variances.
An independence condition is necessary. If we ignored independence, then we would
conclude that Var [R + R] = Var [R] + Var [R]. However, by Theorem 8.3, the left side is
equal to 4 Var [R], whereas the right side is 2 Var [R]. This implies that Var [R] = 0, which,
by Lemma 8.2, essentially only holds if R is constant.
However, mutual independence is not necessary: pairwise independence will do. This is
useful to know because there are some important situations involving variables that are
pairwise independent but not mutually independent. Matching birthdays is an example
of this kind, as we shall see below.
Theorem 8.5. [Pairwise Independent Additivity of Variance] If R1 , R2 , . . . , Rn are pairwise in
dependent random variables, then

Var [R1 + R2 + · · · + Rn ] = Var [R1 ] + Var [R2 ] + · · · + Var [Rn ] .

20 Course Notes, Week 14: Missed Expectations?

Proof. By linearity of expectation, we have

�� n � � n n �
� �2 ��
E Ri =E Ri Rj
i=1 i=1 j=1
n
�� n
= E [Ri Rj ] (linearity)
i=1 j=1
� n
�
E Ri2 .
� �
= E [Ri ] E [Rj ] + (pairwise independence) (7)
�
1≤i=j≤n i=1

In (7), we use the fact that the expectation of the product of two independent variables is
the product of their expectations.
Also,
� n
� � ��
n
��
� 2
E2 Ri = E Ri
i=1 i=1
n
�� 2
= E [Ri ] (linearity)
i=1
n
�� n
= E [Ri ] E [Rj ]
i=1 j=1
� n
�
= E [Ri ] E [Rj ] + E2 [Ri ] . (8)
�
1≤i=j≤n i=1

So,
�� n �� n �2 � � n
�
� � �
Var Ri =E Ri − E2 Ri (Theorem 8.1)
i=1 i=1 i=1
� n
�
E Ri2 −
� �
= E [Ri ] E [Rj ] +
� j≤n
1≤i= i=1
� n
�
� �
E [Ri ] E [Rj ] + E2 [Ri ] (by (7) and (8))
�
1≤i=j≤n i=1
n
� n
� � �
= E Ri2 − E2 [Ri ]
i=1 i=1
n
�
(E Ri2 − E2 [Ri ])
� �
= (reordering the sums)
i=1
�n
= Var [Ri ] . (Theorem 8.1)
i=1
Course Notes, Week 14: Missed Expectations? 21

Now we have a simple way of computing the expectation of a variable�Hn,p which has a
binomial distribution with parameters n and p. We know that Hn,p = nk=1 Ik where the
Ik are mutually independent 01valued variables with Pr {Ik = 1} = p. The variance of
each Ik is p(1 − p) by (3), so by linearity of variance, we have

Lemma (Variance of the Binomial Distribution).

Var [Hn,p ] = n Var [Ik ] = np(1 − p). (9)

9 Estimation by Random Sampling

9.1 Estimating Voting Preferences using Chebyshev’s Theorem

In Notes 13, we used bounds on the binomial distribution to determine conﬁdence levels
for a poll of voter preferences of Clinton vs. Giulliani. Now that we know the variance of
the binomial distribution, we can use Chebyshev’s Theorem as an alternative approach
to calculate poll size.
The setup is the same as in Notes 13: we will poll n randomly chosen voters and let Sn
be the total number in our sample who preferred Clinton. We use Sn /n as our estimate
of the actual fraction, p, of all voters who prefer Clinton. We want to choose n so that our
estimate will be within 0.04 of p at least 95% of the time.
Now Sn is binomially distributed, so from (9) we have

1
Var [Sn ] = n(p(1 − p)) ≤ n .
4
The bound of 1/4 follows from the easily veriﬁed fact that p(1 − p) is maximized when
p = 1 − p, that is, when p = 1/2.
Next, we bound the variance of Sn /n:
� � � �2
Sn 1
Var = Var [Sn ] (by (4))
n n
� �2
1 1
≤ n (by (9.1))
n 4
1
= . (10)
4n
Now from Chebyshev and (10) we have:
��
� Sn � Var [Sn /n] 1 156.25
Pr � − p� ≥ 0.04 ≤
� �
2
= 2
= . (11)
n (0.04) 4n(0.04) n
22 Course Notes, Week 14: Missed Expectations?

To make our our estimate with with 95% conﬁdence, we want the righthand side of (11)
to be at most 1/20. So we choose n so that
156.25 1
≤ ,
n 20
that is,
n ≥ 3, 125.

You may remember that in Notes 13 we calculated that it was actually sufﬁcient to poll
only 664 voters —many fewer than the 3,125 voters we derived using Chebyshev’s The
orem. So the bound from Chebyshev’s Theorem is not nearly as good as the bound we
got earlier. This should not be surprising. In applying the Chebyshev Theorem, we used
only a bound on the variance of Sn . In Notes 13, on the other hand, we used the fact that
the random variable Sn was binomial (with known parameter, n, and unknown parame
ter, p). It makes sense that more detailed information about a distribution leads to better
bounds. But even though the bound was not as good, this example nicely illustrates an
approach to estimation using Chebyshev’s Theorem that is more widely applicable than
binomial estimations.

9.2 Birthdays again

There are important cases where the relevant distributions are not binomial because the
mutual independence properties of the voter preference example do not hold. In these
cases, estimation methods based on the Chebyshev bound may be the best approach.
Birthday Matching is an example.
We’ve already seen that in a class of one hundred or more, there is a very high probability
that some pair of students have birthdays on the same day of the month. We can also
easily calculate the expected number of pairs of students with matching birthdays. But
is it likely the number of matching pairs in a typical class will actually be close to the
expected number? We can take the same approach to answering this question as we did
in estimating voter preferences.
But notice that having matching birthdays for different pairs of students are not mutually
independent events. For example, knowing that Alice and Bob have matching birthdays,
and also that Ted and Alice have matching birthdays obviously implies that Bob and Ted
have matching birthdays. On the other hand, knowing that Alice and Bob have match
ing birthdays tells us nothing about whether Alice and Carol have matching birthdays,
namely, these two events really are independent. So even though the events that vari
ous pairs of students have matching birthdays are not mutually independent, indeed not
even threeway independent, they are pairwise independent.
This allows us to apply the same reasoning to Birthday Matching as we did for voter
preference. Namely, let B1 , B2 , . . . , Bn be the birthdays of n independently chosen people,
and let Ei,j be the indicator variable for the event that the ith and jth people chosen have
Course Notes, Week 14: Missed Expectations? 23

� j,
the same birthdays, that is, the event [Bi = Bj ]. For simplicity, we’ll assume that for i =
2
the probability that Bi = Bj is 1/365. So the Bi ’s are mutually independent variables,
and hence the Ei,j ’s are pairwise independent variables, which is all we will need.
Let Mn be the number of matching pairs of birthdays among the n choices, that is,
�
Mn ::= Ei,j . (12)
1≤i<j≤n

So by linearity of expectation
� � � �
� � n 1
E [Mn ] = E Ei,j = E [Ei,j ] = · .
1≤i<j≤n 1≤i<j≤n
2 365

Also, by Theorem 8.5, the variances of pairwise independent variables are additive, so
� � � � � �
� � n 1 1
Var [Mn ] = Var Ei,j = Var [Ei,j ] = · 1− .
1≤i<j≤n 1≤i<j≤n
2 365 365

Now for a class of 100 students, we have E [M100 ] ≈ 14 and Var [M100 ] < 14(1−1/365) < 14.
So by Chebyshev’s Theorem
14
Pr {|M100 − 14| ≥ x} < .
x2

Letting x = 6, we conclude that there is a better than 50% chance that in a class of 100
students, the number of pairs of students with the same birthday will be between 8 and
20.

10 Pairwise Independent Sampling

The reasoning we used above to analyze voter polling and matching birthdays is very
similar. We summarize it in slightly more general form with a basic result we call the
Pairwise Independent Sampling Theorem. In particular, we do not need to restrict our
selves to sums of zeroone valued variables, or to variables with the same distribution.
For simplicity, we state the Theorem for pairwise independent variables with possibly
different distributions but with the same mean and variance.
Theorem (Pairwise Independent Sampling). Let G1 , . . . , Gn be pairwise independent vari
ables with the same mean, µ, and deviation, σ . Deﬁne
n
�
Sn ::= Gi . (13)
i=1
2
In the U.S., Fall birthdays are more common than Winter birthdays, so Pr {Bi = Bj } is actually a bit
larger than 1/365.
24 Course Notes, Week 14: Missed Expectations?

Then ��
� Sn � 1 � σ �2
Pr � − µ� ≥ x ≤
� � .
n n x

Proof. We observe ﬁrst that the expectation of Sn /n is µ:

� � � �n �
Sn i=1 Gi
E =E (def. of Sn )
n n
�n
E [Gi ]
= i=1 (linearity of expectation)
�n n
µ
= i=1
n
nµ
= = µ.
n

The second important property of Sn /n is that its variance is the variance of Gi divided
by n:
� � � �2
Sn 1
Var = Var [Sn ] (by (4))
n n
� n �
1 �
= 2 Var Gi (def. of Sn )
n i=1
n
1 �
= 2 Var [Gi ] (pairwise independent additivity)
n i=1
1 2 σ2
= · nσ = . (14)
n2 n

This is enough to apply Chebyshev’s Bound and conclude:

��
� Sn � Var [Sn /n]
Pr � − µ� ≥ x ≤
� � . (Chebyshev’s bound)
n x2
σ 2 /n
= 2 (by (14))
x
1 � σ �2
= .
n x

The Pairwise Independent Sampling Theorem provides a precise general statement about
how the average of independent samples of a random variable approaches the mean. In
particular, it shows that by choosing a large enough sample size, n, we can get arbitrarily
accurate estimates of the mean with conﬁdence arbitrarily close to 100%.

Expectation
No ratings yet
Expectation
94 pages
Lecture 9.3 - Expectation of A Random Variable - Properties
No ratings yet
Lecture 9.3 - Expectation of A Random Variable - Properties
26 pages
ProbabilisticMethod 14
No ratings yet
ProbabilisticMethod 14
12 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
Expectations of Discrete Random Variables: Scott Sheffield
No ratings yet
Expectations of Discrete Random Variables: Scott Sheffield
61 pages
ProbabilisticMethod 17
No ratings yet
ProbabilisticMethod 17
12 pages
Unexpected - Expectations 2015
No ratings yet
Unexpected - Expectations 2015
15 pages
Linearity of Expectation: Unraveling Black Magic
No ratings yet
Linearity of Expectation: Unraveling Black Magic
26 pages
Expectations of Discrete Random Variables: Scott She Eld
No ratings yet
Expectations of Discrete Random Variables: Scott She Eld
18 pages
Lecture 5
No ratings yet
Lecture 5
17 pages
Expected Value of A Random Variable
No ratings yet
Expected Value of A Random Variable
10 pages
IP Unit 4 (Expectation)
No ratings yet
IP Unit 4 (Expectation)
22 pages
Ps 10 Sol
No ratings yet
Ps 10 Sol
10 pages
ProbabilisticMethod 4
No ratings yet
ProbabilisticMethod 4
13 pages
Topic 3: The Expectation and Other Moments of A Random Variable
No ratings yet
Topic 3: The Expectation and Other Moments of A Random Variable
20 pages
ProbabilisticMethod 13
No ratings yet
ProbabilisticMethod 13
13 pages
Expected Uses of Probability - Evan Chen
No ratings yet
Expected Uses of Probability - Evan Chen
17 pages
ProbabilisticMethod 6
No ratings yet
ProbabilisticMethod 6
16 pages
ProbabilisticMethod 9
No ratings yet
ProbabilisticMethod 9
15 pages
Probability p4
No ratings yet
Probability p4
25 pages
Combinatorics 3
No ratings yet
Combinatorics 3
3 pages
Linexpect
No ratings yet
Linexpect
3 pages
Open MOAT Solutions
No ratings yet
Open MOAT Solutions
16 pages
ProbabilisticMethod 19
No ratings yet
ProbabilisticMethod 19
10 pages
100 A HW8 Soln
No ratings yet
100 A HW8 Soln
16 pages
More Discrete R.V
No ratings yet
More Discrete R.V
40 pages
ProbabilisticMethod 15
No ratings yet
ProbabilisticMethod 15
11 pages
20 de Secrete Pentru Fotografii Digitale Uimitoare V2
100% (1)
20 de Secrete Pentru Fotografii Digitale Uimitoare V2
18 pages
Section 5 - Expectation and Variance
No ratings yet
Section 5 - Expectation and Variance
15 pages
ProbabilisticMethod 16
No ratings yet
ProbabilisticMethod 16
13 pages
Mit18 05 s22 Class04-Prep-B
No ratings yet
Mit18 05 s22 Class04-Prep-B
7 pages
Wattle Lecture 15
No ratings yet
Wattle Lecture 15
6 pages
Quant Exercises
No ratings yet
Quant Exercises
16 pages
ProbabilisticMethod 15
No ratings yet
ProbabilisticMethod 15
9 pages
HW 11
No ratings yet
HW 11
3 pages
ProbabilisticMethod 3
No ratings yet
ProbabilisticMethod 3
13 pages
Stat 110 Strategic Practice 4, Fall 2011
No ratings yet
Stat 110 Strategic Practice 4, Fall 2011
16 pages
1 The Hiring Problem and Basic Probability
No ratings yet
1 The Hiring Problem and Basic Probability
6 pages
ProbabilisticMethod 7
No ratings yet
ProbabilisticMethod 7
12 pages
Draw PDF
No ratings yet
Draw PDF
21 pages
BMC Probability
No ratings yet
BMC Probability
4 pages
EE3110 Jul 2024 Tutorial5 Solutions
No ratings yet
EE3110 Jul 2024 Tutorial5 Solutions
7 pages
Expected Uses of Probability: Evan Chen
No ratings yet
Expected Uses of Probability: Evan Chen
18 pages
ST120 Practice Sheet 5
No ratings yet
ST120 Practice Sheet 5
2 pages
Solutions To In-Class Problems Week 13, Wed
No ratings yet
Solutions To In-Class Problems Week 13, Wed
6 pages
Random Variables Tarea Teoría
No ratings yet
Random Variables Tarea Teoría
8 pages
Expectation Value (Statisitic Formulae)
No ratings yet
Expectation Value (Statisitic Formulae)
6 pages
Mathematical Foundations of Computer Science Lecture Outline
No ratings yet
Mathematical Foundations of Computer Science Lecture Outline
6 pages
Discussion 9 Fall 2019 Solutions
No ratings yet
Discussion 9 Fall 2019 Solutions
7 pages
Mathematical Foundations of Computer Science Lecture Outline
No ratings yet
Mathematical Foundations of Computer Science Lecture Outline
5 pages
March 13 Homework Solutions Math 151, Winter 2012 Chapter 7 Problems (Pages 373-379)
No ratings yet
March 13 Homework Solutions Math 151, Winter 2012 Chapter 7 Problems (Pages 373-379)
8 pages
Mathematical Expectation
No ratings yet
Mathematical Expectation
6 pages
Probabilistic PDF
No ratings yet
Probabilistic PDF
16 pages
Introduction To Probability and Expected Value
No ratings yet
Introduction To Probability and Expected Value
12 pages
Small Mathematical Expectation
No ratings yet
Small Mathematical Expectation
6 pages
Chapter-4: Pairs of Random Variables
No ratings yet
Chapter-4: Pairs of Random Variables
111 pages
J. C. Taylor (Auth.) - An Introduction To Measure and Probability-Springer-Verlag New York (1997) PDF
No ratings yet
J. C. Taylor (Auth.) - An Introduction To Measure and Probability-Springer-Verlag New York (1997) PDF
315 pages
Measure Theoretic Probability With Applications To Statistics Finance
No ratings yet
Measure Theoretic Probability With Applications To Statistics Finance
262 pages
Probability Theory and Mathematical Statistics PDF Download
No ratings yet
Probability Theory and Mathematical Statistics PDF Download
78 pages
4.1 TYBSc Mathematics Syllabus Mumbai University 2013-14 Credit System
No ratings yet
4.1 TYBSc Mathematics Syllabus Mumbai University 2013-14 Credit System
41 pages
Bovier & Den Hollander - Metastability
No ratings yet
Bovier & Den Hollander - Metastability
578 pages
Econometrics Chapter 1& 2
No ratings yet
Econometrics Chapter 1& 2
35 pages
Chapter Iii: Measure-Theoretic Probability 1. Measure: Measure Theory Int Egrale, Longueur, Aire' Length
No ratings yet
Chapter Iii: Measure-Theoretic Probability 1. Measure: Measure Theory Int Egrale, Longueur, Aire' Length
17 pages
Econometrics I Lecture 2 Wooldridge
No ratings yet
Econometrics I Lecture 2 Wooldridge
40 pages
SDENotes 2
No ratings yet
SDENotes 2
140 pages
Probability and Statistics II - Course Outline - 250414 - 101735
No ratings yet
Probability and Statistics II - Course Outline - 250414 - 101735
7 pages
IRModellingLecture Part 01
No ratings yet
IRModellingLecture Part 01
50 pages
R300 MT Class 1 Slides
No ratings yet
R300 MT Class 1 Slides
68 pages
6 437-Pset1
No ratings yet
6 437-Pset1
8 pages
Lecture 5: Joint Probability Distributions: Bo Li
No ratings yet
Lecture 5: Joint Probability Distributions: Bo Li
80 pages
X Y X X Y y X X, Y y Y Y: Conditional Expectation
No ratings yet
X Y X X Y y X X, Y y Y Y: Conditional Expectation
21 pages
Reading 3 - Probability Concepts
No ratings yet
Reading 3 - Probability Concepts
47 pages
Martingale and Markov Processes
No ratings yet
Martingale and Markov Processes
20 pages
ERG2040C: Lecture 7 Examples and Conditional Expectation
No ratings yet
ERG2040C: Lecture 7 Examples and Conditional Expectation
20 pages
Conditional Expectation Notes
No ratings yet
Conditional Expectation Notes
21 pages
Random Variables
No ratings yet
Random Variables
82 pages
Mathematical Expectation: Examples
No ratings yet
Mathematical Expectation: Examples
12 pages
Probabilistic Methods in Combinatorics: 1 Warm-Up
No ratings yet
Probabilistic Methods in Combinatorics: 1 Warm-Up
7 pages
Conditional Probability and Expectation
No ratings yet
Conditional Probability and Expectation
19 pages
Week 5 Sol
No ratings yet
Week 5 Sol
5 pages
Conditional Expected Values Independent Random Variables: Stm2Pm
No ratings yet
Conditional Expected Values Independent Random Variables: Stm2Pm
15 pages
Conditional Expectation and Martingales
No ratings yet
Conditional Expectation and Martingales
21 pages
UW MATH-STAT395 Bivariate-Distributions PDF
No ratings yet
UW MATH-STAT395 Bivariate-Distributions PDF
17 pages
Eda Joint Probability Distribution
No ratings yet
Eda Joint Probability Distribution
6 pages
A Simple Example
No ratings yet
A Simple Example
4 pages
Bell's Inequality Untwisted
From Everand
Bell's Inequality Untwisted
James Spinosa
No ratings yet
Complex numbers
From Everand
Complex numbers
Alessio Mangoni
No ratings yet
Functions and Probability for Sixth Graders
From Everand
Functions and Probability for Sixth Graders
Home School Brew
No ratings yet
Math for Computer Applications
From Everand
Math for Computer Applications
The Editors of REA
No ratings yet

MIT Overview Basic Probability

Uploaded by

MIT Overview Basic Probability

Uploaded by

Massachusetts Institute of Technology Course Notes, Week 14

6.042J/18.062J, Fall ’05: Mathematics for Computer Science December 2

1.1 Expectation of a Sum

E [R1 + R2 ] = E [R1 ] + E [R2 ] .

Copyright © 2005, Prof. Albert R. Meyer and Prof. Ronitt Rubinfeld.

Lemma 1.2. For any random variable R and constant a ∈ R,

Corollary 1.4. For any random variables R1 , . . . , Rk and constants a1 , . . . , ak ∈ R,

1.2 Expected Value of Two Dice

What is the expected value of the sum of two fair dice?

E [R1 + R2 ] = E [R1 ] + E [R2 ] = 3.5 + 3.5 = 7.

1.3 The Hat­Check Problem

E [R] = E [R1 ] + E [R2 ] + · · · + E [Rn ]

1.4 Expectation of a Binomial Distribution

numbThat is, the expectation of an (n, p)­binomially distributed variable is pn.

2 The Coupon Collector Problem

2.1 A Solution Using Linearity of Expectation

blue green green red blue orange blue orange gray

Let’s partition the sequence into 5 segments:

Deﬁnition 3.1. We deﬁne the conditional expectation E [R | A] of a random variable R given

E [a1 R1 + a2 R2 | A] = a1 E [R1 | A] + a2 E [R2 | A] .

4 The Expected Value of a Product

4.1 The Product of Independent Expectations

Theorem 4.2. For any two independent random variables R1 , R2 ,

E [R1 · R2 ] = E [R1 ] · E [R2 ] .

Proof. We apply the Law of Total Expectation by conditioning on the value of R1 .

Theorem 4.2 extends routinely to a collection of mutually independent variables.

Corollary 4.3. If random variables R1 , R2 , . . . , Rk are mutually independent, then

4.2 The Product of Two Dice

E [R1 · R2 ] = E [R1 ] · E [R2 ] = 3.5 · 3.5 = 12.25. (1)

5 Expect the Mean

• if R(w) ≥ x then R(w) ≥ x = x · 1 = xIx (w), and

• if R(w) < x then R(w) ≥ 0 = x · 0 = xIx (w).

E [R] ≥ E [xIx ] (since R ≥ xIx )

Markov’s Theorem is often expressed in an alternative form, stated below as an immedi­

Corollary 6.2. If R is a nonnegative random variable, then for all c ≥ 1

Proof. In Markov’s Theorem, set x = c · E [R].

6.1 Examples of Markov’s Theorem

So for the Chinese appetizer problem, Markov’s Theorem is tight!

6.2 Markov’s Theorem for Bounded Variables

Deﬁnition 7.2. The variance, Var [R], of a random variable, R, is:

Var [R] ::= E (R − E [R])2 .

So we can restate (2) as

7.1 Variance in Two Gambling Games

Similarly, we have for Var [B]:

7.2 Standard Deviation

Proof. Substituting x = cσR in Chebyshev’s Theorem gives:

7.3 The IQ Example

Var [R] 102 1

8.1 Why Variance?

E [R − E [R]] = E [R] − E [E [R]] .

Since E [R] is a constant, its expected value is itself. Therefore

E [R − E [R]] = E [R] − E [R] = 0.

8.2 An Alternative Deﬁnition of Variance

for any random variable, R.

Here we use the notation E2 [R] as shorthand for (E [R])2 .

Proof. Let µ = E [R]. Then

Var [R] = E (R − E [R])2

For example, if B is a Bernoulli variable where p ::= Pr {B = 1}, then

Var [B] = p − p2 = p(1 − p). (3)

Proof. Since B only takes values 0 and 1, we have E [B] = p · 1 + (1 − p) · 0 = p. Since

8.2.1 Zero Variance

Proof. By the deﬁnition of variance,

Var [R] = 0 iff Pr (R − E [R])2 = 0 = 1.

8.2.2 Dealing with Constants

Var [aR + b] = a2 Var [R] . (4)

Var [aR + b] = E (aR + b)2 − E2 [aR + b] .

E (aR + b)2 = E a2 R2 + 2abR + b2 = a2 E R2 + 2ab E [R] + b2 .

Similarly for the second term:

E2 [aR + b] = (a E [R] + b)2 = a2 E2 [R] + 2ab E [R] + b2 . (6)

Finally, we subtract the expanded second term from the ﬁrst.

Var [aR + b] = E (aR + b)2 − E2 [aR + b]

(a2 E2 [R] + 2ab E [R] + b2 ) (by (5) and (6))

= a2 Var [R] (Theorem 8.1)

8.3 Variance of a Sum

Var [R1 + R2 + · · · + Rn ] = Var [R1 ] + Var [R2 ] + · · · + Var [Rn ] .

Proof. By linearity of expectation, we have

1.3 The HatCheck Problem

numbThat is, the expectation of an (n, p)binomially distributed variable is pn.

Markov’s Theorem is often expressed in an alternative form, stated below as an immedi