0% found this document useful (0 votes)
65 views171 pages

Merged Lectures

The document outlines a lecture on permutations and combinations. It will cover permutations, counting tricks, binomial coefficients, and practice problems. Office hours for the professor are on Wednesdays from 3 to 5 PM in room 2-249.

Uploaded by

Ed Z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views171 pages

Merged Lectures

The document outlines a lecture on permutations and combinations. It will cover permutations, counting tricks, binomial coefficients, and practice problems. Office hours for the professor are on Wednesdays from 3 to 5 PM in room 2-249.

Uploaded by

Ed Z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

My office hours: Wednesdays 3 to 5 in 2-249

18.600: Lecture 1
Permutations and combinations, Pascals
triangle, learning to count

Scott Sheffield

MIT

Take a selfie with Norbert Wieners desk.

Outline Outline

Remark, just for fun Remark, just for fun

Permutations Permutations

Counting tricks Counting tricks

Binomial coefficients Binomial coefficients

Problems Problems
Politics Which of these statements is probably true?
I Suppose that, in some election, betting markets place the I 1. X (t) will go below 50 at some future point.
probability that your favorite candidate will be elected at 58 I 2. X (t) will get all the way below 20 at some point
percent. Price of a contact that pays 100 dollars if your
I 3. X (t) will reach both 70 and 30, at different future times.
candidate wins is 58 dollars.
I Market seems to say that your candidate will probably win, if I 4. X (t) will reach both 65 and 35 at different future times.
probably means with probability greater than .5. I 5. X (t) will hit 65, then 50, then 60, then 55.
I The price of such a contract may fluctuate in time. I Answers: 1, 2, 4.
I Let X (t) denote the price at time t. I Full explanations coming toward the end of the course.
I Suppose X (t) is known to vary continuously in time. What is I Problem sets in this course explore applications of probability
probability p it reaches 59 before 57?
to politics, medicine, finance, economics, science, engineering,
I If p > .5, we can make money in expecation by buying at 58
philosophy, dating, etc. Stories motivate the math and make
and selling when price hits 57 or 59.
it easier to remember.
I If p < .5, we can sell at 58 and buy when price hits 57 or 59.
I Efficient market hypothesis (a.k.a. no free money just lying
I Provocative question: what simple advice, that would greatly
around hypothesis) suggests p = .5 (with some caveats...) benefit humanity, are we unaware of? Foods to avoid?
I Natural model for prices: repeatedly toss coin, adding 1 for Exercises to do? Books to read? How would we know?
heads and 1 for tails, until price hits 0 or 100. I Lets start with easier questions.

Outline Outline

Remark, just for fun Remark, just for fun

Permutations Permutations

Counting tricks Counting tricks

Binomial coefficients Binomial coefficients

Problems Problems
Permutations Permutation notation

I A permutation is a function from {1, 2, . . . , n} to


{1, 2, . . . , n} whose range is the whole set {1, 2, . . . , n}. If
I How many ways to order 52 cards? is a permutation then for each j between 1 and n, the the
value (j) is the number that j gets mapped to.
I Answer: 52 51 50 . . . 1 = 52! =
80658175170943878571660636856403766975289505600883277824
I For example, if n = 3, then could be a function such that
1012 (1) = 3, (2) = 2, and (3) = 1.
I n hats, n people, how many ways to assign each person a hat?
I If you have n cards with labels 1 through n and you shuffle
them, then you can let (j) denote the label of the card in the
I Answer: n!
jth position. Thus orderings of n cards are in one-to-one
I n hats, k < n people, how many ways to assign each person a correspondence with permutations of n elements.
hat? I One way to represent is to list the values
I n (n 1) (n 2) . . . (n k + 1) = n!/(n k)! (1), (2), . . . , (n) in order. The above is represented as
{3, 2, 1}.
I If and are both permutations, write for their
composition. That is, (j) = ((j)).

Cycle decomposition Outline

I Another way to write a permutation is to describe its cycles:


I For example, taking n = 7, we write (2, 3, 5), (1, 7), (4, 6) for
the permutation such that (2) = 3, (3) = 5, (5) = 2 and Remark, just for fun
(1) = 7, (7) = 1, and (4) = 6, (6) = 4.
I If you pick some j and repeatedly apply to it, it will cycle Permutations
through the numbers in its cycle.
I Visualize this by writing down numbers 1 to n and drawing
Counting tricks
arrow from each k to (k). Trace through a cycle by
following arrows.
I Generally, a function f is called an involution if f (f (x)) = x Binomial coefficients
for all x.
I A permutation is an involution if all cycles have length one or Problems
two.
I A permutation is fixed point free if there are no cycles of
length one.
Outline Fundamental counting trick

Remark, just for fun I n ways to assign hat for the first person. No matter what
choice I make, there will remain n 1 ways to assign hat to
the second person. No matter what choice I make there, there
Permutations will remain n 2 ways to assign a hat to the third person, etc.
I This is a useful trick: break counting problem into a sequence
Counting tricks of stages so that one always has the same number of choices
to make at each stage. Then the total count becomes a
product of number of choices available at each stage.
Binomial coefficients
I Easy to make mistakes. For example, maybe in your problem,
the number of choices at one stage actually does depend on
Problems choices made during earlier stages.

Another trick: overcount by a fixed factor Outline

Remark, just for fun

I If you have 5 indistinguishable black cards, 2 indistinguishable


red cards, and three indistinguishable green cards, how many Permutations
distinct shuffle patterns of the ten cards are there?
I Answer: if the cards were distinguishable, wed have 10!. But Counting tricks
were overcounting by a factor of 5!2!3!, so the answer is
10!/(5!2!3!).
Binomial coefficients

Problems
n

Outline k notation

I How many ways to choose an ordered sequence of k elements


Remark, just for fun from a list of n elements, with repeats allowed?
I Answer: nk
Permutations I How many ways to choose an ordered sequence of k elements
from a list of n elements, with repeats forbidden?
I Answer: n!/(n k)!
Counting tricks
I How many way to choose (unordered) k elements from a list
of n without repeats?
Binomial coefficients I Answer: kn := k!(nk)!
 n!

I What is the coefficient in front of x k in the expansion of


Problems (x + 1)n ?
Answer: kn .

I

Pascals triangle Outline

Remark, just for fun


I Arnold principle.
n n1 n1
  
I A simple recursion: k = k1 + k .
Permutations
I What is the coefficient in front of x k in
the expansion of
(x + 1)n ?
I Answer: kn .
 Counting tricks
(x + 1)n = n0 1 + n1 x 1 + n2 x 2 + . . . + n1n
+ nn x n .
    n1 
I x
I
Pn
Question: what is k=0 k ? n
 Binomial coefficients
I Answer: (1 + 1)n = 2n .
Problems
Outline More problems

Remark, just for fun


I How many full house hands in poker?
Permutations I 13 43 12 42
 

I How many 2 pair hands?


Counting tricks 13 42 12 42 11 41 /2
  
I

I How many royal flush hands?


Binomial coefficients I 4

Problems

More problems

I How many hands that have four cards of the same suit, one
card of another suit?
4 13 13
 
4 3 1
I

I How many 10 digit numbers with no consecutive digits that


agree?
I If initial digit can be zero, have 10 99 ten-digit sequences. If
initial digit required to be non-zero, have 910 .
I How many ways to assign a birthday to each of 23 distinct
people? What if no birthday can be repeated?
I 36623 if repeats allowed. 366!/343! if repeats not allowed.
Outline

18.600: Lecture 2
Multinomial coefficients and more counting Multinomial coefficients
problems
Integer partitions
Scott Sheffield

MIT
More problems

Outline Partition problems

I You have eight distinct pieces of food. You want to choose


three for breakfast, two for lunch, and three for dinner. How
many ways to do that?
Multinomial coefficients I Answer: 8!/(3!2!3!)
I One way to think of this: given any permutation of eight
elements (e.g., 12435876 or 87625431) declare first three as
Integer partitions breakfast, second two as lunch, last three as dinner. This
maps set of 8! permutations on to the set of food-meal
divisions in a many-to-one way: each food-meal division
More problems comes from 3!2!3! permutations.
I How many 8-letter sequences with 3 As, 2 Bs, and 3 C s?
I Answer: 8!/(3!2!3!). Same as other problem. Imagine 8
slots for the letters. Choose 3 to be As, 2 to be Bs, and 3
to be C s.
Partition problems One way to understand the binomial theorem

I Expand the product (A1 + B1 )(A2 + B2 )(A3 + B3 )(A4 + B4 ).


I 16 terms correspond to 16 length-4 sequences of As and Bs.

A1 A2 A3 A4 + A1 A2 A3 B4 + A1 A2 B3 A4 + A1 A2 B3 B4 +
I In general, if you have n elements you wish to divide into r A1 B2 A3 A4 + A1 B2 A3 B4 + A1 B2 B3 A4 + A1 B2 B3 B4 +
distinct piles of sizes n1 , n2 . . . nr , how many ways to do that?
B1 A2 A3 A4 + B1 A2 A3 B4 + B1 A2 B3 A4 + B1 A2 B3 B4 +
Answer n1 ,n2n,...,nr := n1 !n2n!!...nr ! .

I
B1 B2 A3 A4 + B1 B2 A3 B4 + B1 B2 B3 A4 + B1 B2 B3 B4
I What happens to this sum if we erase subscripts?
I (A + B)4 = B 4 + 4AB 3 + 6A2 B 2 + 4A3 B + A4 . Coefficient of
A2 B 2 is 6 because 6 length-4 sequences have 2 As and 2 Bs.
(A + B)n = nk=0 kn Ak B nk , because there are
P 
I Generally,
n

k sequences with k As and (n k) Bs.

How about trinomials? Multinomial coefficients

I Is there a higher dimensional analog of binomial theorem?


I Expand I Answer: yes.
(A1 + B1 + C1 )(A2 + B2 + C2 )(A3 + B3 + C3 )(A4 + B4 + C4 ). I Then what is it?
How many terms?
I
I Answer: 81, one for each length-4 sequence of As and Bs  
and C s.
X n
(x1 +x2 +. . .+xr )n = x n1 x n2 . . . xrnr
I We can also compute (A + B + C )4 = n1 ,...,nr :n1 +...+nr =n
n1 , . . . , nr 1 2
A4 +4A3 B +6A2 B 2 +4AB 3 +B 4 +4A3 C +12A2 BC +12AB 2 C +
4B 3 C + 6A2 C 2 + 12ABC 2 + 6B 2 C 2 + 4AC 3 + 4BC 3 + C 4 I The sum on the right is taken over all collections
(n1 , n2 , . . . , nr ) of r non-negative integers that add up to n.
I What is the sum of the coefficients in this expansion? What is
the combinatorial interpretation of coefficient of, say, ABC 2 ? I Pascals triangle gives coefficients in binomial expansions. Is
there something like a Pascals pyramid for trinomial
I Answer 81 = (1 + 1 + 1)4 . ABC 2 has coefficient 12 because
expansions?
there are 12 length-4 words have one A, one B, two C s.
I Yes (look it up) but it is a bit tricker to draw and visualize
than Pascals triangle.
By the way... Outline
I If n! is the product of all integers in the interval with
endpoints 1 and n, then 0! = 0.
I Actually, we say 0! = 1. What are the reasons for that?
I Because there is one map from the empty set to itself.
Multinomial coefficients
Because we want the formula kn = k!(nk)! n!

I to still make
sense when k = 0 and k = n. There is clearly 1 way to choose
n elements from a group of n elements. And 1 way to choose
0 elements from a group of n elements so n!0! n! n!
= 0!n! = 1. Integer partitions
I Because we want the recursion n(n 1)! = n! to hold for
n = 1. (We wont define factorials of negative integers.)
R
I Because we want n! = 0 t n e t dt to hold for all More problems
non-negative integers. (Check for positive integers by
integration by parts.) This is one of those formulas you should
just know. Can use it to define n! for non-integer n.
R z1 t
I Another common notation: R write (z) := 0 t e dt and
define n! := (n + 1) = 0 t n e t dt, so that (n) = (n 1)!.

Outline Integer partitions

Multinomial coefficients
I How many sequences a1 , . . . , ak of non-negative integers
satisfy a1 + a2 + . . . + ak = n?
Integer partitions Answer: n+k1

I
n . Represent partition by k 1 bars and n
stars, e.g., as | || |.

More problems
Outline Outline

Multinomial coefficients Multinomial coefficients

Integer partitions Integer partitions

More problems More problems

More counting problems More counting problems

I In 18.821, a class of 27 students needs to be divided into 9


teams of three students each? How many ways are there to
do that? I How many 13-card bridge hands have 4 of one suit, 3 of one
I 27!
(3!)9 9! suit, 5 of one suit, 1 of one suit?
4! 13
 13 13 13
I You teach a class with 90 students. In a rather severe effort to I
4 3 5 1
combat grade inflation, your department chair insists that you I How many bridge hands have at most two suits represented?
assign the students exactly 10 As, 20 Bs, 30 Cs, 20 Ds, and 4 26
 
2 13 8
I
10 Fs. How many ways to do this?
90
 90! I How many hands have either 3 or 4 cards in each suit?
10,20,30,20,10 = 10!20!30!20!10!
I
I Need three 3-card suits, one
 4-card suit, to make 13 cards
I You have 90 (indistinguishable) pieces of pizza to divide 3 13
total. Answer is 4 13
among the 90 (distinguishable) students. How many ways to 3 4
do that (giving each student a non-negative integer number of
slices)?
179 179
 
90 = 89
I
Outline

18.600: Lecture 3 Formalizing probability

What is probability?
Sample space

Scott Sheffield
DeMorgans laws
MIT

Axioms of probability

Outline What does Id say theres a thirty percent chance it will


rain tomorrow mean?

I Neurological: When I think it will rain tomorrow the


Formalizing probability truth-sensing part of my brain exhibits 30 percent of its
maximum electrical activity.
I Frequentist: Of the last 1000 days that meteorological
Sample space
measurements looked this way, rain occurred on the
subsequent day 300 times.
DeMorgans laws I Market preference (risk neutral probability): The
market price of a contract that pays 100 if it rains tomorrow
agrees with the price of a contract that pays 30 tomorrow no
Axioms of probability matter what.
I Personal belief: If you offered me a choice of these
contracts, Id be indifferent. (If need for money is different in
two scenarios, I can replace dollars with units of utility.)
Outline Outline

Formalizing probability Formalizing probability

Sample space Sample space

DeMorgans laws DeMorgans laws

Axioms of probability Axioms of probability

Even more fundamental question: defining a set of possible Event: subset of the sample space
outcomes

I Roll a die n times. Define a sample space to be I If a set A is comprised of some of the elements of B, say A is
{1, 2, 3, 4, 5, 6}n , i.e., the set of a1 , . . . , an with each a subset of B and write A B.
aj {1, 2, 3, 4, 5, 6}. I Similarly, B A means A is a subset of B (or B is a superset
I Shuffle a standard deck of cards. Sample space is the set of of A).
52! permutations. I If S is a finite sample space with n elements, then there are 2n
I Will it rain tomorrow? Sample space is {R, N}, which stand subsets of S.
for rain and no rain. I Denote by the set with no elements.
I Randomly throw a dart at a board. Sample space is the set of
points on the board.
Intersections, unions, complements Venn diagrams

I A B means the union of A and B, the set of elements


contained in at least one of A and B.
I A B means the intersection of A and B, the set of elements
contained on both A and B.
I Ac means complement of A, set of points in whole sample
space S but not in A. A B
I A \ B means A minus B which means the set of points in A
but not in B. In symbols, A \ B = A (B c ).
I is associative. So (A B) C = A (B C ) and can be
written A B C .
I is also associative. So (A B) C = A (B C ) and can
be written A B C .

Venn diagrams Outline

AB Ac B Formalizing probability

Sample space
A Bc Ac B c
A B DeMorgans laws

Axioms of probability
Outline DeMorgans laws

Formalizing probability I It will not snow or rain means It will not snow and it will
not rain.
I If S is event that it snows, R is event that it rains, then
Sample space (S R)c = S c R c
I More generally: (ni=1 Ei )c = ni=1 (Ei )c
DeMorgans laws
I It will not both snow and rain means Either it will not
snow or it will not rain.
I (S R)c = S c R c
Axioms of probability I (ni=1 Ei )c = ni=1 (Ei )c

Outline Outline

Formalizing probability Formalizing probability

Sample space Sample space

DeMorgans laws DeMorgans laws

Axioms of probability Axioms of probability


Axioms of probability
I Neurological: When I think it will rain tomorrow the
truth-sensing part of my brain exhibits 30 percent of its
maximum electrical activity. Should have P(A) [0, 1] and
presumably P(S) = 1 but not necessarily
P(A B) = P(A) + P(B) when A B = .
I Frequentist: P(A) is the fraction of times A occurred during
I P(A) [0, 1] for all A S.
the previous (large number of) times we ran the experiment.
I P(S) = 1. Seems to satisfy axioms...
I Finite additivity: P(A B) = P(A) + P(B) if A B = . I Market preference (risk neutral probability): P(A) is
P
I Countable additivity: P(i=1 Ei ) = i=1 P(Ei ) if Ei Ej = price of contract paying dollar if A occurs divided by price of
for each pair i and j. contract paying dollar regardless. Seems to satisfy axioms,
assuming no arbitrage, no bid-ask spread, complete market...
I Personal belief: P(A) is amount such that Id be indifferent
between contract paying 1 if A occurs and contract paying
P(A) no matter what. Seems to satisfy axioms with some
notion of utility units, strong assumption of rationality...
Outline

18.600: Lecture 4
Axioms of probability and Axioms of probability
inclusion-exclusion
Consequences of axioms
Scott Sheffield

MIT
Inclusion exclusion

Outline Axioms of probability

Axioms of probability
I P(A) [0, 1] for all A S.
I P(S) = 1.
Consequences of axioms I Finite additivity: P(A B) = P(A) + P(B) if A B = .
P
I Countable additivity: P(i=1 Ei ) = i=1 P(Ei ) if Ei Ej =
for each pair i and j.
Inclusion exclusion
Axiom breakdown

I What if personal belief function doesnt satisfy axioms?


I Neurological: When I think it will rain tomorrow the
truth-sensing part of my brain exhibits 30 percent of its I Consider an A-contract (pays 10 if candidate A wins election)
maximum electrical activity. a B-contract (pays 10 dollars if candidate B wins) and an
A-or-B contract (pays 10 if either A or B wins).
I Frequentist: P(A) is the fraction of times A occurred during
the previous (large number of) times we ran the experiment. I Friend: Id say A-contract is worth 1 dollar, B-contract is
worth 1 dollar, A-or-B contract is worth 7 dollars.
I Market preference (risk neutral probability): P(A) is
price of contract paying dollar if A occurs divided by price of I Amateur response: Dude, that is, like, so messed up.
contract paying dollar regardless. Havent you heard of the axioms of probability?
I Personal belief: P(A) is amount such that Id be indifferent I Cynical professional response: I fully understand and
between contract paying 1 if A occurs and contract paying respect your opinions. In fact, lets do some business. You sell
P(A) no matter what. me an A contract and a B contract for 1.50 each, and I sell
you an A-or-B contract for 6.50.
I Friend: Wow... youve beat by suggested price by 50 cents
on each deal. Yes, sure! Youre a great friend!
I Axioms breakdowns are money-making opportunities.

Outline
I Neurological: When I think it will rain tomorrow the
truth-sensing part of my brain exhibits 30 percent of its
maximum electrical activity. Should have P(A) [0, 1],
maybe P(S) = 1, not necessarily P(A B) = P(A) + P(B)
when A B = .
I Frequentist: P(A) is the fraction of times A occurred during Axioms of probability
the previous (large number of) times we ran the experiment.
Seems to satisfy axioms...
I Market preference (risk neutral probability): P(A) is Consequences of axioms
price of contract paying dollar if A occurs divided by price of
contract paying dollar regardless. Seems to satisfy axioms,
assuming no arbitrage, no bid-ask spread, complete market... Inclusion exclusion
I Personal belief: P(A) is amount such that Id be indifferent
between contract paying 1 if A occurs and contract paying
P(A) no matter what. Seems to satisfy axioms with some
notion of utility units, strong assumption of rationality...
Outline Intersection notation

Axioms of probability

I We will sometimes write AB to denote the event A B.


Consequences of axioms

Inclusion exclusion

Consequences of axioms Famous 1982 Tversky-Kahneman study (see wikipedia)

I People are told Linda is 31 years old, single, outspoken, and


I Can we show from the axioms that P(Ac ) = 1 P(A)? very bright. She majored in philosophy. As a student, she was
I Can we show from the axioms that if A B then deeply concerned with issues of discrimination and social
P(A) P(B)? justice, and also participated in anti-nuclear demonstrations.
I Can we show from the axioms that I They are asked: Which is more probable?
P(A B) = P(A) + P(B) P(AB)? I Linda is a bank teller.
I Can we show from the axioms that P(AB) P(A)? I Linda is a bank teller and is active in the feminist movement.
I Can we show from the axioms that if S contains finitely many I 85 percent chose the second option.
elements x1 , . . . , xk , then the values I Could be correct using neurological/emotional definition. Or a

P({x1 }), P({x2 }), . . . , P({xk }) determine the value of P(A) which story would you believe interpretation (if witnesses
for any A S? offering more details are considered more credible).
I What k-tuples of values are consistent with the axioms? I But axioms of probability imply that second option cannot be
more likely than first.
Outline Outline

Axioms of probability Axioms of probability

Consequences of axioms Consequences of axioms

Inclusion exclusion Inclusion exclusion

Inclusion-exclusion identity Inclusion-exclusion identity

I Can we show from the axioms that


I Imagine we have n events, E1 , E2 , . . . , En .
P(A B) = P(A) + P(B) P(AB)?
I How do we go about computing something like
I How about P(E F G ) =
P(E1 E2 . . . En )?
P(E ) + P(F ) + P(G ) P(EF ) P(EG ) P(FG ) + P(EFG )?
I It may be quite difficult, depending on the application.
I More generally,
I There are some situations in which computing
n
P(E1 E2 . . . En ) is a priori difficult, but it is relatively X X
easy to compute probabilities of intersections of any collection P(ni=1 Ei ) = P(Ei ) P(Ei1 Ei2 ) + . . .
i=1 i1 <i2
of Ei . That is, we can easily compute quantities like X
(r +1)
P(E1 E3 E7 ) or P(E2 E3 E6 E7 E8 ). + (1) P(Ei1 Ei2 . . . Eir )
I In these situations, the inclusion-exclusion rule helps us i1 <i2 <...<ir

compute unions. It gives us a way to express + . . . + (1)n+1 P(E1 E2 . . . En ).


P(E1 E2 . . . En ) in terms of these intersection P n

probabilities. I The notation i1 <i2 <...<ir means a sum over all of the r
subsets of size r of the set {1, 2, . . . , n}.
Inclusion-exclusion proof idea

I Consider a region of the Venn diagram contained in exactly


m > 0 subsets. For example, if m = 3 and n = 8 we could
consider the region E1 E2 E3c E4c E5 E6c E7c E8c .
I This region is contained in three single intersections (E1 , E2 ,
and E5 ). Its contained in 3 double-intersections (E1 E2 , E1 E5 ,
and E2 E5 ). Its contained in only 1 triple-intersection
(E1 E2 E5 ).
It is counted m1 m2 + m3 + . . . m
   
m times in the
I
inclusion exclusion sum.
I How many is that?
I Answer: 1. (Follows from binomial expansion of (1 1)m .)
I Thus each region in E1 . . . En is counted exactly once in
the inclusion exclusion sum, which implies the identity.
Outline

18.600: Lecture 5
Equal likelihood
Problems with all outcomes equally likely,
including a famous hat problem A few problems

Scott Sheffield
Hat problem
MIT

A few more problems

Outline Equal likelihood

Equal likelihood

I If a sample space S has n elements, and all of them are


A few problems equally likely, then each one has to have probability 1/n
I What is P(A) for a general set A S?
Hat problem I Answer: |A|/|S|, where |A| is the number of elements in A.

A few more problems


Outline Outline

Equal likelihood Equal likelihood

A few problems A few problems

Hat problem Hat problem

A few more problems A few more problems

Problems Outline

I Roll two dice. What is the probability that their sum is three?
I 2/36 = 1/18
I Toss eight coins. What is the probability that exactly five of Equal likelihood
them are heads?
8
 8
5 /2
I

I In a class of 100 people with cell phone numbers, what is the A few problems
probability that nobody has a number ending in 37?
I (99/100)100 1/e Hat problem
I Roll ten dice. What is the probability that a 6 appears on
exactly five of the dice?
10 5 10
 A few more problems
5 5 /6
I

I In a room of 23 people, what is the probability that two of


them have a birthday in common?
1 22 365i
Q
I
i=0 365
Outline Recall the inclusion-exclusion identity

I
Equal likelihood
n
X X
P(ni=1 Ei ) = P(Ei ) P(Ei1 Ei2 ) + . . .
A few problems i=1 i1 <i2
X
+ (1)(r +1) P(Ei1 Ei2 . . . Eir )
i1 <i2 <...<ir
Hat problem n+1
= + . . . + (1) P(E1 E2 . . . En ).
n
P 
I The notation i1 <i2 <ir means a sum over all of the r
A few more problems
subsets of size r of the set {1, 2, . . . , n}.

Famous hat problem Outline

I n people toss hats into a bin, randomly shuffle, return one hat
to each person. Find probability nobody gets own hat.
Equal likelihood
I Inclusion-exclusion. Let Ei be the event that ith person gets
own hat.
I What is P(Ei1 Ei2 . . . Eir )? A few problems
(nr )!
I Answer: n! .
I There are nr terms like that in the inclusion exclusion sum. Hat problem
n (nr )!

What is r n! ?
I Answer: r1! .
1 1 1 1
A few more problems
I P(ni=1 Ei ) = 1 2! + 3! 4! + . . . n!
1 1 1 1
I 1 P(ni=1 Ei ) = 1 1 + 2! 3! + 4! . . . n! 1/e .36788
Outline Problems
I Whats the probability of a full house in poker (i.e., in a five
card hand, 2 have one value and three have another)?
I Answer 1:
Equal likelihood # ordered distinct-five-card sequences giving full house
# ordered distinct-five-card sequences
I Thats
A few problems 5

2 13 12 (4 3 2) (4 3)/(52 51 50 49 48) = 6/4165.
I Answer 2:
Hat problem # unordered distinct-five-card sets giving full house
# unordered distinct-five-card sets
Thats 13 12 43 42 / 52
  
5 = 6/4165.
I
A few more problems What is the probability of a two-pair hand  in poker?
I
Fix suit breakdown, then face values: 42 2 13
 13 52

2 13/ 5
I
2
I How about bridge hand with 3 of one suit, 3 of one suit, 2 of
one
 suit, 13
5 of another
  suit?
4 13 13 13
 52
2 2 3 5 / 13
I
3 2
Outline

18.600: Lecture 6
Definition: probability of A given B
Conditional probability

Scott Sheffield Examples

MIT

Multiplication rule

Outline Conditional probability

I Suppose I have a sample space S with n equally likely


elements, representing possible outcomes of an experiment.
Definition: probability of A given B I Experiment is performed, but I dont know outcome. For
some F S, I ask, Was the outcome in F ? and receive
answer yes.
Examples I I think of F as a new sample space with all elements
equally likely.
I Definition: P(E |F ) = P(EF )/P(F ).
Multiplication rule I Call P(E |F ) the conditional probability of E given F or
probability of E conditioned on F .
I Definition makes sense even without equally likely
assumption.
Outline Outline

Definition: probability of A given B Definition: probability of A given B

Examples Examples

Multiplication rule Multiplication rule

More examples Another famous Tversky/Kahneman study (Wikipedia)

I Imagine you are a member of a jury judging a hit-and-run


driving case. A taxi hit a pedestrian one night and fled the
I Probability have rare disease given positive result to test with scene. The entire case against the taxi company rests on the
90 percent accuracy. evidence of one witness, an elderly man who saw the accident
from his window some distance away. He says that he saw the
I Say probability to have disease is p. pedestrian struck by a blue taxi. In trying to establish her
I S = {disease, no disease} {positive, negative}. case, the lawyer for the injured pedestrian establishes the
I P(positive) = .9p + .1(1 p) and P(disease, positive) = .9p. following facts:
.9p I There are only two taxi companies in town, Blue Cabs and
I P(disease|positive) = .9p+.1(1p) . If p is tiny, this is about 9p. Green Cabs. On the night in question, 85 percent of all taxis
I Probability suspect guilty of murder given a particular on the road were green and 15 percent were blue.
suspicious behavior. I The witness has undergone an extensive vision test under
conditions similar to those on the night in question, and has
I Probability plane will come eventually, given plane not here
demonstrated that he can successfully distinguish a blue taxi
yet. from a green taxi 80 percent of the time.
I Study participants believe blue taxi at fault, say witness
correct with 80 percent probability.
Outline Outline

Definition: probability of A given B Definition: probability of A given B

Examples Examples

Multiplication rule Multiplication rule

Multiplication rule Monty Hall problem

I Prize behind one of three doors, all equally likely.


I P(E1 E2 E3 . . . En ) = I You point to door one. Host opens either door two or three
P(E1 )P(E2 |E1 )P(E3 |E1 E2 ) . . . P(En |E1 . . . En1 ) and shows you that it doesnt have a prize. (If neither door
I Useful when we think about multi-step experiments. two nor door three has a prize, host tosses coin to decide
which to open.)
I For example, let Ei be event ith person gets own hat in the
n-hat shuffle problem. I You then get to open a door and claim whats behind it.
Should you stick with door one or choose other door?
I Another example: roll die and let Ei be event that the roll
does not lie in {1, 2, . . . , i}. Then P(Ei ) = (6 i)/6 for I Sample space is {1, 2, 3} {2, 3} (door containing prize, door
i {1, 2, . . . , 6}. host points to).
 
I What is P(E4 |E1 E2 E3 ) in this case? I We have P (1, 2) = P (1, 3) = 1/6 and
P (2, 3) = P (3, 2) = 1/3. Given host points to door 2,
probability prize behind 3 is 2/3.
Another popular puzzle (see Tanya Khovanovas blog)

I Given that your friend has exactly two children, one of whom
is a son born on a Tuesday, what is the probability the second
child is a son.
I Make the obvious (though not quite correct) assumptions.
Every child is either boy or girl, and equally likely to be either
one, and all days of week for birth equally likely, etc.
I Make state space matrix of 196 = 14 14 elements
I Easy to see answer is 13/27.
Outline

18.600: Lecture 7
Bayes formula and independence Bayes formula

Scott Sheffield

MIT Independence

Outline Recall definition: conditional probability

Bayes formula I Definition: P(E |F ) = P(EF )/P(F ).


I Equivalent statement: P(EF ) = P(F )P(E |F ).
I Call P(E |F ) the conditional probability of E given F or
probability of E conditioned on F .
Independence
Dividing probability into two cases Bayes theorem

I
I Bayes theorem/law/rule states the following:
P(E ) = P(EF ) + P(EF c ) P(A|B) = P(B|A)P(A)
P(B) .
c c
= P(E |F )P(F ) + P(E |F )P(F ) I Follows from definition of conditional probability:
P(AB) = P(B)P(A|B) = P(A)P(B|A).
I In words: want to know the probability of E . There are two I Tells how to update estimate of probability of A when new
scenarios F and F c . If I know the probabilities of the two evidence restricts your sample space to B.
scenarios and the probability of E conditioned on each P(B|A)
scenario, I can work out the probability of E .
I So P(A|B) is P(B) times P(A).
P(B|A)
I Example: D = have disease, T = positive test. I Ratio P(B) determines how compelling new evidence is.
I If P(D) = p, P(T |D) = .9, and P(T |D c ) = .1, then I What does it mean if ratio is zero?
P(T ) = .9p + .1(1 p). I What if ratio is 1/P(A)?
I What is P(D|T )?

Bayes theorem Bayesian sometimes used to describe philosophical view

P(B|A)P(A) I Philosophical idea: we assign subjective probabilities to


I Bayes formula P(A|B) = P(B) is often invoked as tool
questions we cant answer. Will candidate win election? Will
to guide intuition.
Red Sox win world series? Will stock prices go up this year?
I Example: A is event that suspect stole the $10, 000 under my
I Bayes essentially described probability of event as
mattress, B is event that suspect deposited several thousand
dollars in cash in bank last week. value of right to get some thing if event occurs
.
I Begin with subjective estimates of P(A), P(B|A), and value of thing
P(B|Ac ). Compute P(B). Check whether B occurred.
Update estimate.
I Philosophical questions: do we have subjective
probabilities/hunches for questions we cant base enforceable
I Repeat procedure as new evidence emerges.
contracts on? Do there exist other universes? Are there other
I Caution required. My idea to check whether B occurred, or is intelligent beings? Are there beings smart enough to simulate
a lawyer selecting the provable events B1 , B2 , B3 , . . . that universes like ours? Are we part of such a simulation?...
maximize P(A|B1 B2 B3 . . .)? Where did my probability I Do we use Bayes subconsciously to update hunches?
estimates come from? What is my state space? What
assumptions am I making?
I Should we think of Bayesian priors and updates as part of the
epistemological foundation of science and statistics?
Updated odds P(|F ) is a probability measure

I Define odds of A to be P(A)/P(Ac ).


I Define conditional odds of A given B to be
P(A|B)/P(Ac |B). I We can check the probability axioms: 0 P(E |F ) 1,
P
I Is there nice way to describe ratio between odds and P(S|F ) = 1, and P(Ei |F ) = P(Ei |F ), if i ranges over a
conditional odds? countable set and the Ei are disjoint.
P(A|B)/P(Ac |B)
I
P(A)/P(Ac ) =? I The probability measure P(|F ) is related to P().
I By Bayes P(A|B)/P(A) = P(B|A)/P(B). I To get former from latter, we set probabilities of elements
P(A|B)/P(Ac |B) outside of F to zero and multiply probabilities of events inside
I After some algebra, P(A)/P(Ac ) = P(B|A)/P(B|Ac )
of F by 1/P(F ).
I Say I think A is 5 times as likely as Ac , and
I It P() is the prior probability measure and P(|F ) is the
P(B|A) = 3P(B|Ac ). Given B, I think A is 15 times as likely
posterior measure (revised after discovering that F occurs).
as Ac .
I Gambling sites (look at oddschecker.com) often list
P(Ac )/P(A), which is basically amount house puts up for bet
on Ac when you put up one dollar for bet on A.

Outline Outline

Bayes formula Bayes formula

Independence Independence
Independence Independence of multiple events
I Say E and F are independent if P(EF ) = P(E )P(F ).
I Equivalent statement: P(E |F ) = P(E ). Also equivalent:
I Say E1 . . . En are independent if for each
P(F |E ) = P(F ).
{i1 , i2 , . . . , ik } {1, 2, . . . n} we have
I Example: toss two coins. Sample space contains four equally P(Ei1 Ei2 . . . Eik ) = P(Ei1 )P(Ei2 ) . . . P(Eik ).
likely elements (H, H), (H, T ), (T , H), (T , T ).
I In other words, the product rule works.
I Is event that first coin is heads independent of event that
second coin heads.
I Independence implies P(E1 E2 E3 |E4 E5 E6 ) =
P(E1 )P(E2 )P(E3 )P(E4 )P(E5 )P(E6 )
P(E4 )P(E5 )P(E6 ) = P(E1 E2 E3 ), and other similar
I Yes: probability of each event is 1/2 and probability of both is
statements.
1/4.
I Does pairwise independence imply independence?
I Is event that first coin is heads independent of event that
number of heads is odd? I No. Consider these three events: first coin heads, second coin
heads, odd number heads. Pairwise independent, not
I Yes: probability of each event is 1/2 and probability of both is
independent.
1/4...
I despite fact that (in everyday English usage of the word)
oddness of the number of heads depends on the first coin.

Independence: another example

I Shuffle 4 cards with labels 1 through 4. Let Ej,k be event that


card j comes before card k. Is E1,2 independent of E3,4 ?
I Is E1,2 independent of E1,3 ?
I No. In fact, what is P(E1,2 |E1,3 )?
I 2/3
I Generalize to n > 7 cards. What is
P(E1,7 |E1,2 E1,3 E1,4 E1,5 E1,6 )?
I 6/7
Outline

18.600: Lecture 8
Defining random variables
Discrete random variables

Scott Sheffield Probability mass function and distribution function

MIT

Recursions

Outline Random variables

I A random variable X is a function from the state space to the


Defining random variables real numbers.
I Can interpret X as a quantity whose value depends on the
outcome of an experiment.
Probability mass function and distribution function I Example: toss n coins (so state space consists of the set of all
2n possible coin sequences) and let X be number of heads.
I Question: What is P{X = k} in this case?
Recursions
Answer: kn /2n , if k {0, 1, 2, . . . , n}.

I
Independence of multiple events Examples

I In n coin toss example, knowing the values of some coin


tosses tells us nothing about the others.
I Say E1 . . . En are independent if for each I Shuffle n cards, and let X be the position of the jth card.
{i1 , i2 , . . . , ik } {1, 2, . . . n} we have State space consists of all n! possible orderings. X takes
P(Ei1 Ei2 . . . Eik ) = P(Ei1 )P(Ei2 ) . . . P(Eik ). values in {1, 2, . . . , n} depending on the ordering.
I In other words, the product rule works. I Question: What is P{X = k} in this case?
I Independence implies P(E1 E2 E3 |E4 E5 E6 ) = I Answer: 1/n, if k {1, 2, . . . , n}.
P(E1 )P(E2 )P(E3 )P(E4 )P(E5 )P(E6 )
P(E4 )P(E5 )P(E6 ) = P(E1 E2 E3 ), and other similar I Now say we roll three dice and let Y be sum of the values on
statements. the dice. What is P{Y = 5}?
I Does pairwise independence imply independence? I 6/216
I No. Consider these three events: first coin heads, second coin
heads, odd number heads. Pairwise independent, not
independent.

Indicators Outline

I Given any event E , can define an indicator random variable,


i.e., let X be random variable equal to 1 on the event E and 0
otherwise. Write this as X = 1E . Defining random variables
I The value of 1E (either 1 or 0) indicates whether the event
has occurred.
If E1 , E2 , . . . Ek are events then X = ki=1 1Ei is the number
P
I
Probability mass function and distribution function
of these events that occur.
I Example: in n-hat shuffle problem, let Ei be the event ith
person gets own hat. Recursions
Then ni=1 1Ei is total number of people who get own hats.
P
I

I Writing random variable as sum of indicators: frequently


useful, sometimes confusing.
Outline Probability mass function

I Say X is a discrete random variable if (with probability one)


it takes one of a countable set of values.
I For each a in this countable set, write p(a) := P{X = a}.
Defining random variables Call p the probability mass function.
I For the cumulative distribution
P function, write
F (a) = P{X a} = xa p(x).
Probability mass function and distribution function I Example: Let T1 , T2 , T3 , . . . be sequence of independent fair
coin tosses (each taking values in {H, T }) and let X be the
smallest j for which Tj = H.
Recursions I What is p(k) = P{X = k} (for k Z) in this case?
I p(k) = (1/2)k
I What about FX (k)?
I 1 (1/2)k

Another example Outline

I Another example: let X be non-negative integer such that


p(k) = P{X = k} = e k /k!.
Recall Taylor expansion k
P
k=0 /k! = e .
I

I In this example, X is called a Poisson random variable with Defining random variables
intensity .
I Question: what is the state space in this example?
I Answer: Didnt specify. One possibility would be to define Probability mass function and distribution function
state space as S = {0, 1, 2, . . .} and define X (as a function
on S) by X (j) = j. ThePprobability function would be
determined by P(S) = kS e k /k!. Recursions
I Are there other choices of S and P and other functions X
from S to P for which the values of P{X = k} are the
same?
I Yes. X is a Poisson random variable with intensity is
statement only about the probability mass function of X .
Outline Using Bayes rule to set up recursions
I Gambler one has positive integer m dollars, gambler two has
positive integer n dollars. Take turns making one dollar bets
until one runs out of money. What is probability first gambler
runs out of money first?
Defining random variables I n/(m + n)
I Gamblers ruin: what if gambler one has an unlimited
amount of money?
Probability mass function and distribution function I Wins eventually with probability one.
I Problem of points: in sequence of independent fair coin
tosses, what is probability Pn,m to see n heads before seeing m
Recursions tails?
I Observe: Pn,m is equivalent to the probability of having n or
more heads in first m + n 1 trials.
Probability of exactly n heads in m + n 1 trials is m+n1

I
n .
I Famous correspondence by Fermat and Pascal. Led Pascal to
write Le Triangle Arithmetique.
Outline

18.600: Lecture 9
Defining expectation
Expectations of discrete random variables

Scott Sheffield Functions of random variables

MIT

Motivation

Outline Expectation of a discrete random variable

I Recall: a random variable X is a function from the state space


to the real numbers.
I Can interpret X as a quantity whose value depends on the
Defining expectation outcome of an experiment.
I Say X is a discrete random variable if (with probability one)
it takes one of a countable set of values.
Functions of random variables I For each a in this countable set, write p(a) := P{X = a}.
Call p the probability mass function.
I The expectation of X , written E [X ], is defined by
Motivation X
E [X ] = xp(x).
x:p(x)>0

I Represents weighted average of possible values X can take,


each value being weighted by its probability.
Simple examples Expectation when state space is countable

I If the state space S is countable, we can give SUM OVER


STATE SPACE definition of expectation:
I Suppose that a random variable X satisfies P{X = 1} = .5,
P{X = 2} = .25 and P{X = 3} = .25.
X
E [X ] = P{s}X (s).
I What is E [X ]? sS
I Answer: .5 1 + .25 2 + .25 3 = 1.75. I Compare this to the SUM OVER POSSIBLE X VALUES
I Suppose P{X = 1} = p and P{X = 0} = 1 p. Then what definition we gave earlier:
is E [X ]? X
I Answer: p. E [X ] = xp(x).
x:p(x)>0
I Roll a standard six-sided die. What is the expectation of
number that comes up? I Example: toss two coins. If X is the number of heads, what is
1
I Answer: 61 + 16 2 + 16 3 + 61 4 + 61 5 + 16 6 = 21
6 = 3.5. E [X ]?
I State space is {(H, H), (H, T ), (T , H), (T , T )} and summing
over state space gives E [X ] = 14 2 + 14 1 + 14 1 + 14 0 = 1.

A technical point Outline

If the state Defining expectation


P space S is countable, is it possible that the sum
I
E [X ] = sS P({s})X (s) somehow depends on the order in
which s S are enumerated?
In Functions of random variables
P principle, yes... We only say expectation is defined when
I

sS P({x})|X (s)| < , in which case it turns out that the


sum does not depend on the order.
Motivation
Outline Expectation of a function of a random variable
I If X is a random variable and g is a function from the real
numbers to the real numbers then g (X ) is also a random
variable.
I How can we compute E [g (X )]?
Defining expectation I SUM OVER STATE SPACE:
X
E [g (X )] = P({s})g (X (s)).
sS
Functions of random variables
I SUM OVER X VALUES:
X
E [g (X )] = g (x)p(x).
Motivation x:p(x)>0

I Suppose that constants a, b, are given and that E [X ] = .


I What is E [X + b]?
I How about E [aX ]?
I Generally, E [aX + b] = aE [X ] + b = a + b.

More examples Additivity of expectation

I Let X be the number that comes up when you roll a standard I If X and Y are distinct random variables, then can one say
six-sided die. What is E [X 2 ]?
that E [X + Y ] = E [X ] + E [Y ]?
I 1 (1 + 4 + 9 + 16 + 25 + 36) = 91/12
6 I Yes. In fact, for real constants a and b, we have
I Let Xj be 1 if the jth coin toss is heads and 0 otherwise.
E [aX + bY ] = aE [X ] + bE [Y ].
What is the expectation of X = ni=1 Xj ?
P
Pn I This is called the linearity of expectation.
I Can compute this directly as
k=0 P{X = k}k. I Another way to state this fact: given sample space S and
I Alternatively, use symmetry. Expected number of heads probability measure P, the expectation E [] is a linear
should be same as expected number of tails. real-valued function on the space of random variables.
I This implies E [X ] = E [n X ]. Applying I Can extend to more variables
E [aX + b] = aE [X ] + b formula (with a = 1 and b = n), we E [X1 + X2 + . . . + Xn ] = E [X1 ] + E [X2 ] + . . . + E [Xn ].
obtain E [X ] = n E [X ] and conclude that E [X ] = n/2.
More examples Outline

I Now can we compute expected number of people who get


own hats in n hat shuffle problem? Defining expectation
I Let Xi be 1 if ith person gets own hat and zero otherwise.
I What is E [Xi ], for i {1, 2, . . . , n}?
I Answer: 1/n. Functions of random variables
I Can write total number with own hat as
X = X1 + X2 + . . . + Xn .
I Linearity of expectation gives Motivation
E [X ] = E [X1 ] + E [X2 ] + . . . + E [Xn ] = n 1/n = 1.

Outline Why should we care about expectation?

I Laws of large numbers: choose lots of independent random


variables with same probability distribution as X their
Defining expectation average tends to be close to E [X ].
I Example: roll N = 106 dice, let Y be the sum of the numbers
that come up. Then Y /N is probably close to 3.5.
Functions of random variables I Economic theory of decision making: Under rationality
assumptions, each of us has utility function and tries to
optimize its expectation.
Motivation I Financial contract pricing: under no arbitrage/interest
assumption, price of derivative equals its expected value in
so-called risk neutral probability.
I Comes up everywhere probability is applied.
Expected utility when outcome only depends on wealth

I Contract one: Ill toss 10 coins, and if they all come up heads
(probability about one in a thousand), Ill give you 20 billion
dollars.
I Contract two: Ill just give you ten million dollars.
I What are expectations of the two contracts? Which would
you prefer?
I Can you find a function u(x) such that given two random
wealth variables W1 and W2 , you prefer W1 whenever
E [u(W1 )] < E [u(W2 )]?
I Lets assume u(0) = 0 and u(1) = 1. Then u(x) = y means
that you are indifferent between getting 1 dollar no matter
what and getting x dollars with probability 1/y .
Outline

18.600: Lecture 10 Defining variance

Variance and standard deviation


Examples
Scott Sheffield

MIT
Properties

Decomposition trick

Outline Recall definitions for expectation


I Recall: a random variable X is a function from the state space
to the real numbers.
I Can interpret X as a quantity whose value depends on the
Defining variance outcome of an experiment.
I Say X is a discrete random variable if (with probability one)
it takes one of a countable set of values.
Examples
I For each a in this countable set, write p(a) := P{X = a}.
Call p the probability mass function.
Properties I The expectation of X , written E [X ], is defined by
X
E [X ] = xp(x).
Decomposition trick x:p(x)>0

I Also, X
E [g (X )] = g (x)p(x).
x:p(x)>0
Defining variance Very important alternatative formula

I Let X be a random variable with mean .


I Let X be a random variable with mean . I We introduced above the formula Var(X ) = E [(X )2 ].
I The variance of X , denoted Var(X ), is defined by I This can be written Var[X ] = E [X 2 2X + 2 ].
Var(X ) = E [(X )2 ]. I By additivity of expectation, this is the same as
I Taking g (x)P= (x )2 , and recalling that E [X 2 ] 2E [X ] + 2 = E [X 2 ] 2 .
E [g (X )] = x:p(x)>0 g (x)p(x), we find that I This gives us our very important alternative formula:
X Var[X ] = E [X 2 ] (E [X ])2 .
Var[X ] = (x )2 p(x). I Seven words to remember: expectation of square minus
x:p(x)>0
square of expectation.
I Variance is one way to measure the amount a random variable I Original formula gives intuitive idea of what variance is
varies from its mean over successive trials. (expected square of difference from mean). But we will often
use this alternative formula when we have to actually compute
the variance.

Outline Outline

Defining variance Defining variance

Examples Examples

Properties Properties

Decomposition trick Decomposition trick


Variance examples More variance examples

I You buy a lottery ticket that gives you a one in a million


chance to win a million dollars.
I If X is number on a standard die roll, what is Var[X ]? I Let X be the amount you win. Whats the expectation of X ?
I Var[X ] = E [X 2 ] E [X ]2 = I How about the variance?
1 2 1 2 1 2 1 2 1 2 1 2 2 91 49 35
6 1 + 6 2 + 6 3 + 6 4 + 6 5 + 6 6 (7/2) = 6 4 = 12 . I Variance is more sensitive than expectation to rare outlier
I Let Y be number of heads in two fair coin tosses. What is events.
Var[Y ]? I At a particular party, there are four five-foot-tall people, five
I Recall P{Y = 0} = 1/4 and P{Y = 1} = 1/2 and six-foot-tall people, and one seven-foot-tall person. You pick
P{Y = 2} = 1/4. one of these people uniformly at random. What is the
I Then Var[Y ] = E [Y 2 ] E [Y ]2 = 14 02 + 21 12 + 14 22 12 = 12 . expected height of the person you pick?
I E [X ] = .4 5 + .5 6 + .1 7 = 5.7
I Variance?
I .4 25 + .5 36 + .1 49 (5.7)2 = 32.9 32.49 = .41,

Outline Outline

Defining variance Defining variance

Examples Examples

Properties Properties

Decomposition trick Decomposition trick


Identity Standard deviation

I If Y = X + b, where b is constant, then does it follow that I Write SD[X ] =


p
Var[X ].
Var[Y ] = Var[X ]?
I Satisfies identity SD[aX ] = aSD[X ].
I Yes.
I Uses the same units as X itself.
I We showed earlier that E [aX ] = aE [X ]. We claim that
Var[aX ] = a2 Var[X ].
I If we switch from feet to inches in our height of randomly
chosen person example, then X , E [X ], and SD[X ] each get
I Proof: Var[aX ] = E [a2 X 2 ] E [aX ]2 = a2 E [X 2 ] a2 E [X ]2 = multiplied by 12, but Var[X ] gets multiplied by 144.
a2 Var[X ].

Outline Outline

Defining variance Defining variance

Examples Examples

Properties Properties

Decomposition trick Decomposition trick


Number of aces Number of aces revisited

I Choose five cards from a standard deck of 52 cards. Let A be


I Choose five cards from a standard deck of 52 cards. Let A be the number of aces you see.
the number of aces you see. I Choose five cards in order, and let Ai be 1 if the ith card
I Lets compute E [A] and Var[A]. chosen is an ace and zero otherwise.
To start with, how many five card hands total? Then A = 5i=1 Ai . And E [A] = 5i=1 E [Ai ] = 5/13.
I
P P
I

Answer: 52

I
5 . I Now A2 = (AP 1 + AP
2
2 + . . . + A5 ) can be expanded into 25
I How many such hands have k aces? terms: A2 = 5i=1 5j=1 Ai Aj .
Answer: k4 5k
 48 
. So E [A2 ] = 5i=1 5j=1 E [Ai Aj ].
P P
I I

(4)( 48 ) I Five terms of form E [Ai Aj ] with i = j five with i 6= j. First


I So P{A = k} = k 525k .
(5) five contribute 1/13 each. How about other twenty?
P4
I So E [A] = k=0 kP{A = k}, I E [Ai Aj ] = (1/13)(3/51) = (1/13)(1/17). So
5 20 105
E [A2 ] = 13
P4
I and Var[A] = k=0 k 2 P{A = k} E [A]2 . + 1317 = 1317 .
105 25
I Var[A] = E [A2 ] E [A]2 = 1317 1313 .

Hat problem variance

I In the n-hat shuffle problem, let X be the number of people


who get their own hat. What is Var[X ]?
I We showed earlier that E [X ] = 1. So Var[X ] = E [X 2 ] 1.
I But how do we compute E [X 2 ]?
I Decomposition trick: write variable as sum of simple variables.
I Let Xi be one if ith person gets own
P hat and zero otherwise.
Then X = X1 + X2 + . . . + Xn = ni=1 Xi .
I We want to compute E [(X1 + X2 + . . . + Xn )2 ].
I Expand this out and using linearity of expectation:
n n n X
n
X X X 1 1
E[ Xi Xj ] = E [Xi Xj ] = n +n(n1) = 2.
n n(n 1)
i=1 j=1 i=1 j=1

I So Var[X ] = E [X 2 ] (E [X ])2 = 2 1 = 1.
Outline

18.600: Lecture 11
Binomial random variables and repeated Bernoulli random variables

trials
Properties: expectation and variance
Scott Sheffield

MIT
More problems

Outline Bernoulli random variables

I Toss fair coin n times. (Tosses are independent.) What is the


probability of k heads?
Bernoulli random variables
Answer: kn /2n .

I

I What if coin has p probability to be heads?


Answer: kn p k (1 p)nk .

I
Properties: expectation and variance
Writing q = 1 p, we can write this as kn p k q nk

I

I Can use binomial theorem to show probabilities sum to one:


1 = 1n = (p + q)n = nk=0 kn p k q nk .
P 
More problems I

I Number of heads is binomial random variable with


parameters (n, p).
Examples Other examples

I Room contains n people. What is the probability that exactly


I Toss 6 fair coins. Let X be number of heads you see. Then X
i of them were born on a Tuesday?
is binomial with parameters (n, p) given by (6, 1/2).
Answer: use binomial formula ni p i q ni with p = 1/7 and

I
I Probability mass function for X can be computed using the
q = 1 p = 6/7.
6th row of Pascals triangle.
I Let n = 100. Compute the probability that nobody was born
I If coin is biased (comes up heads with probability p 6= 1/2),
on a Tuesday.
we can still use the 6th row of Pascals triangle, but the
probability that X = i gets multiplied by p i (1 p)ni . I What is the probability that exactly 15 people were born on a
Tuesday?

Outline Outline

Bernoulli random variables Bernoulli random variables

Properties: expectation and variance Properties: expectation and variance

More problems More problems


Expectation Useful Pascals triangle identity

n
 n(n1)...(ni+1)
I Recall that i = i(i1)...(1) . This implies a simple
n n1
 
I Let X be a binomial random variable with parameters (n, p). but important identity: i i = n i1 .
I What is E [X ]? I Using this identity (and q = 1 p), we can write
I Direct approach: by definition of expectation, n   n  
E [X ] = ni=0 P{X = i}i.
P X n i ni X n 1 i ni
E [X ] = i pq = n pq .
i i 1
I What happens if we modify the nth row of Pascals triangle by i=0 i=1
multiplying the i term by i? Pn n1
p (i1) q (n1)(i1) .

I Rewrite this as E [X ] = np i=1 i1
I For example, replace the 5th row (1, 5, 10, 10, 5, 1) by
(0, 5, 20, 30, 20, 5). Does this remind us of an earlier row in I Substitute j = i 1 to get
the triangle? n1  
X n1
I Perhaps the prior row (1, 4, 6, 4, 1)? E [X ] = np p j q (n1)j = np(p + q)n1 = np.
j
j=0

Decomposition approach to computing expectation Interesting moment computation


I Let X be binomial (n, p) and fix k 1. What is E [X k ]?
I Let X be a binomial random variable with parameters (n, p).
Recall identity: i ni = n n1
 
i1 .
I
Here is another way to compute E [X ].
I Generally, E [X k ] can be written as
I Think of X as representing number of heads in n tosses of
coin that is heads with probability p. n  
X n i
I Write X = nj=1 Xj , where Xj is 1 if the jth coin is heads, 0
P i p (1 p)ni i k1 .
i
i=0
otherwise.
I In other words, Xj is the number of heads (zero or one) on the I Identity gives
jth toss. n 
X 
n 1 i1
I Note that E [Xj ] = p 1 + (1 p) 0 = p for each j. E [X k ] = np p (1 p)ni i k1 =
i 1
i=1
I Conclude by additivity of expectation that
n1  
X n1 j
n
X n
X np p (1 p)n1j (j + 1)k1 .
E [X ] = E [Xj ] = p = np. j
j=0
j=1 j=1
I Thus E [X k ] = npE [(Y + 1)k1 ] where Y is binomial with
parameters (n 1, p).
Computing the variance Compute variance with decomposition trick

I Let X be binomial (n, p). What is E [X ]?


X = nj=1 Xj , so
P
I We know E [X ] = np. I

E [X 2 ] = E [ ni=1 Xi nj=1 Xj ] = ni=1 nj=1 E [Xi Xj ]


P P P P
I We computed identity E [X k ] = npE [(Y + 1)k1 ] where Y is
binomial with parameters (n 1, p). I E [Xi Xj ] is p if i = j, p 2 otherwise.
In particular E [X 2 ] = npE [Y + 1] = np[(n 1)p + 1].
I
Pn Pn
j=1 E [Xi Xj ] has n terms equal to p and (n 1)n
I
i=1
I So Var[X ] = E [X 2 ] E [X ]2 = np(n 1)p + np (np)2 = terms equal to p 2 .
np(1 p) = npq, where q = 1 p. I So E [X 2 ] = np + (n 1)np 2 = np + (np)2 np 2 .
I Commit to memory: variance of binomial (n, p) random I Thus
variable is npq. Var[X ] = E [X 2 ] E [X ]2 = np np 2 = np(1 p) = npq.
I This is n times the variance youd get with a single coin.
Coincidence?

Outline Outline

Bernoulli random variables Bernoulli random variables

Properties: expectation and variance Properties: expectation and variance

More problems More problems


More examples

I An airplane seats 200, but the airline has sold 205 tickets.
Each person, independently, has a .05 chance of not showing
up for the flight. What is the probability that more than 200
people will show up for the flight?
P205 205
 j 205j
j .95 .05
I
j=201
I In a 100 person senate, forty people always vote for the
Republicans position, forty people always for the Democrats
position and 20 people just toss a coin to decide which way to
vote. What is the probability that a given vote is tied?
20
 20
10 /2
I

I You invite 50 friends to a party. Each one, independently, has


a 1/3 chance of showing up. What is the probability that
more than 25 people will show up?
P50 50
 j 50j
j=26 j (1/3) (2/3)
I
Outline

18.600: Lecture 12
Poisson random variable definition
Poisson random variables

Scott Sheffield Poisson random variable properties

MIT

Poisson random variable problems

Outline Poisson random variables: motivating questions

I How many raindrops hit a given square inch of sidewalk


during a ten minute period?
I How many people fall down the stairs in a major city on a
Poisson random variable definition given day?
I How many plane crashes in a given year?
I How many radioactive particles emitted during a time period
Poisson random variable properties in which the expected number emitted is 5?
I How many calls to call center during a given minute?
I How many goals scored during a 90 minute soccer game?
Poisson random variable problems I How many notable gaffes during 90 minute debate?
I Key idea for all these examples: Divide time into large
number of small increments. Assume that during each
increment, there is some small probability of thing happening
(independently of other increments).
Remember what e is? Bernoulli random variable with n large and np =

I Let be some moderate-sized number. Say = 2 or = 3.


I The number e is defined by e = limn (1 + 1/n)n . Let n be a huge number, say n = 106 .
I Its the amount of money that one dollar grows to over a year I Suppose I have a coin that comes up heads with probability
when you have an interest rate of 100 percent, continuously /n and I toss it n times.
compounded.
I How many heads do I expect to see?
I Similarly, e = limn (1 + /n)n .
I Answer: np = .
I Its the amount of money that one dollar grows to over a year
I Let k be some moderate sized number (say k = 4). What is
when you have an interest rate of 100 percent, continuously
the probability that I see exactly k heads?
compounded.
I Binomial formula:
I Its also the amount of money that one dollar grows to over n k nk = n(n1)(n2)...(nk+1) k
p)nk .

years when you have an interest rate of 100 percent, k p (1 p) k! p (1
k nk k e .
continuously compounded. This is approximately k! (1 p)
I
k!
I Can also change sign: e = limn (1 /n)n . I A Poisson random variable X with parameter satisfies
k
P{X = k} = k! e for integer k 0.

Outline Outline

Poisson random variable definition Poisson random variable definition

Poisson random variable properties Poisson random variable properties

Poisson random variable problems Poisson random variable problems


Probabilities sum to one Expectation

I A Poisson random variable X with parameter satisfies


k
P{X = k} = k! e for integer k 0.
I What is E [X ]?
I We think of a Poisson random variable as being (roughly) a
I A Poisson random variable X with parameter satisfies Bernoulli (n, p) random variable with n very large and
k
p(k) = P{X = k} = k! e for integer k 0. p = /n.
How can we show that
P
k=0 p(k) = 1?
I I This would suggest E [X ] = . Can we show this directly from
k
Use Taylor expansion e =
the formula for P{X = k}?
P
k=0 k! .
I
I By definition of expectation

X X k X k
E [X ] = P{X = k}k = k e = e .
k! (k 1)!
k=0 k=0 k=1

P j
I Setting j = k 1, this is j=0 j! e = .

Variance Outline
k
I Given P{X = k} = k! e for integer k 0, what is Var[X ]?
I Think of X as (roughly) a Bernoulli (n, p) random variable
with n very large and p = /n.
I This suggests Var[X ] npq (since np and
Poisson random variable definition
q = 1 p 1). Can we show directly that Var[X ] = ?
I Compute
Poisson random variable properties
X X k X k1
E [X 2 ] = P{X = k}k 2 = k2 e = k e .
k! (k 1)!
k=0 k=0 k=1

I Setting j = k 1, this is Poisson random variable problems




X j
(j + 1) e = E [X + 1] = ( + 1).
j!
j=0

I Then Var[X ] = E [X 2 ] E [X ]2 = ( + 1) 2 = .
Outline Poisson random variable problems
I A country has an average of 2 plane crashes per year.
I How reasonable is it to assume the number of crashes is
Poisson with parameter 2?
I Assuming this, what is the probability of exactly 2 crashes?
Poisson random variable definition Of zero crashes? Of four crashes?
I e k /k! with = 2 and k set to 2 or 0 or 4
I A city has an average of five major earthquakes a century.
Poisson random variable properties What is the probability that there is at least one major
earthquake in a given decade (assuming the number of
earthquakes per decade is Poisson)?
Poisson random variable problems I 1 e k /k! with = .5 and k = 0
I A casino deals one million five-card poker hands per year.
Approximate the probability that there are exactly 2 royal
flush hands during a given year.
Expected number of royal flushes is = 106 4/ 52

5 1.54.
I
Answer is e k /k! with k = 2.
Outline

18.600: Lecture 13
Lectures 1-12 Review Counting tricks and basic principles of probability

Scott Sheffield

MIT Discrete random variables

Outline Selected counting tricks

I Break choosing one of the items to be counted into a


sequence of stages so that one always has the same number of
choices to make at each stage. Then the total count becomes
a product of number of choices available at each stage.
Counting tricks and basic principles of probability
I Overcount by a fixed factor.
I If you have n elements you wish to divide into r distinct piles
of sizes n1 , n2 . . . nr , how many ways to do that?
Discrete random variables Answer n1 ,n2n,...,nr := n1 !n2n!!...nr ! .

I

I How many sequences a1 , . . . , ak of non-negative integers


satisfy a1 + a2 + . . . + ak = n?
Answer: n+k1

I
n . Represent partition by k 1 bars and n
stars, e.g., as | || |.
Axioms of probability Consequences of axioms

I Have a set S called sample space.


I P(A) [0, 1] for all (measurable) A S.
I P(Ac ) = 1 P(A)
I P(S) = 1.
I A B implies P(A) P(B)
I Finite additivity: P(A B) = P(A) + P(B) if A B = .
I P(A B) = P(A) + P(B) P(AB)
Countable additivity: P(
P I P(AB) P(A)
i=1 Ei ) = i=1 P(Ei ) if Ei Ej =
I
for each pair i and j.

Inclusion-exclusion identity Famous hat problem

I Observe P(A B) = P(A) + P(B) P(AB). I n people toss hats into a bin, randomly shuffle, return one hat
I Also, P(E F G ) = to each person. Find probability nobody gets own hat.
P(E ) + P(F ) + P(G ) P(EF ) P(EG ) P(FG ) + P(EFG ). I Inclusion-exclusion. Let Ei be the event that ith person gets
I More generally, own hat.
n
X X I What is P(Ei1 Ei2 . . . Eir )?
P(ni=1 Ei ) = P(Ei ) P(Ei1 Ei2 ) + . . . (nr )!
I Answer: n! .
i=1 i1 <i2
(r +1)
X I There are nr terms like that in the inclusion exclusion sum.
+ (1) P(Ei1 Ei2 . . . Eir ) n (nr )!

What is r n! ?
i1 <i2 <...<ir
1
= + . . . + (1)n+1 P(E1 E2 . . . En ).
I Answer: r! .
1 1 1 1
I P(ni=1 Ei ) = 1 2! + 3! 4! + ... n!
n
P 
I The notation i1 <i2 <...<ir means a sum over all of the r I 1
1 P(ni=1 Ei ) = 1 1 + 2! 1
3! 1
+ 4! 1
. . . n! 1/e .36788
subsets of size r of the set {1, 2, . . . , n}.
Conditional probability Dividing probability into two cases

I Definition: P(E |F ) = P(EF )/P(F ). I

I Call P(E |F ) the conditional probability of E given F or


probability of E conditioned on F . P(E ) = P(EF ) + P(EF c )
I Nice fact: P(E1 E2 E3 . . . En ) = = P(E |F )P(F ) + P(E |F c )P(F c )
P(E1 )P(E2 |E1 )P(E3 |E1 E2 ) . . . P(En |E1 . . . En1 ) I In words: want to know the probability of E . There are two
I Useful when we think about multi-step experiments. scenarios F and F c . If I know the probabilities of the two
I For example, let Ei be event ith person gets own hat in the scenarios and the probability of E conditioned on each
n-hat shuffle problem. scenario, I can work out the probability of E .

Bayes theorem P(|F ) is a probability measure

I Bayes theorem/law/rule states the following: I We can check the probabilityPaxioms: 0 P(E |F ) 1,
P(A|B) = P(B|A)P(A)
P(B) . P(S|F ) = 1, and P(Ei ) = P(Ei |F ), if i ranges over a
I Follows from definition of conditional probability: countable set and the Ei are disjoint.
P(AB) = P(B)P(A|B) = P(A)P(B|A). I The probability measure P(|F ) is related to P().
I Tells how to update estimate of probability of A when new I To get former from latter, we set probabilities of elements
evidence restricts your sample space to B. outside of F to zero and multiply probabilities of events inside
I So P(A|B) is P(B|A)
times P(A). of F by 1/P(F ).
P(B)
P(B|A)
I P() is the prior probability measure and P(|F ) is the
I Ratio P(B) determines how compelling new evidence is. posterior measure (revised after discovering that F occurs).
Independence Independence of multiple events

I Say E1 . . . En are independent if for each


{i1 , i2 , . . . , ik } {1, 2, . . . n} we have
P(Ei1 Ei2 . . . Eik ) = P(Ei1 )P(Ei2 ) . . . P(Eik ).
I Say E and F are independent if P(EF ) = P(E )P(F ). I In other words, the product rule works.
I Equivalent statement: P(E |F ) = P(E ). Also equivalent: I Independence implies P(E1 E2 E3 |E4 E5 E6 ) =
P(E1 )P(E2 )P(E3 )P(E4 )P(E5 )P(E6 )
P(F |E ) = P(F ). P(E4 )P(E5 )P(E6 ) = P(E1 E2 E3 ), and other similar
statements.
I Does pairwise independence imply independence?
I No. Consider these three events: first coin heads, second coin
heads, odd number heads. Pairwise independent, not
independent.

Outline Outline

Counting tricks and basic principles of probability Counting tricks and basic principles of probability

Discrete random variables Discrete random variables


Random variables Indicators

I A random variable X is a function from the state space to the I Given any event E , can define an indicator random variable,
real numbers. i.e., let X be random variable equal to 1 on the event E and 0
I Can interpret X as a quantity whose value depends on the otherwise. Write this as X = 1E .
outcome of an experiment. I The value of 1E (either 1 or 0) indicates whether the event
I Say X is a discrete random variable if (with probability one) has occurred.
If E1 , E2 , . . . , Ek are events then X = ki=1 1Ei is the number
P
if it takes one of a countable set of values. I

I For each a in this countable set, write p(a) := P{X = a}. of these events that occur.
Call p the probability mass function. I Example: in n-hat shuffle problem, let Ei be the event ith
I
P
Write F (a) = P{X a} = xa p(x). Call F the person gets own hat.
Then ni=1 1Ei is total number of people who get own hats.
P
cumulative distribution function. I

Expectation of a discrete random variable Expectation when state space is countable

I Say X is a discrete random variable if (with probability one) I If the state space S is countable, we can give SUM OVER
it takes one of a countable set of values. STATE SPACE definition of expectation:
I For each a in this countable set, write p(a) := P{X = a}. X
Call p the probability mass function. E [X ] = P{s}X (s).
I The expectation of X , written E [X ], is defined by sS

X I Agrees with the SUM OVER POSSIBLE X VALUES


E [X ] = xp(x).
definition:
x:p(x)>0
X
E [X ] = xp(x).
I Represents weighted average of possible values X can take, x:p(x)>0

each value being weighted by its probability.


Expectation of a function of a random variable Additivity of expectation

I If X is a random variable and g is a function from the real I If X and Y are distinct random variables, then
numbers to the real numbers then g (X ) is also a random E [X + Y ] = E [X ] + E [Y ].
variable. I In fact, for real constants a and b, we have
I How can we compute E [g (X )]? E [aX + bY ] = aE [X ] + bE [Y ].
I Answer: X I This is called the linearity of expectation.
E [g (X )] = g (x)p(x). I Can extend to more variables
x:p(x)>0
E [X1 + X2 + . . . + Xn ] = E [X1 ] + E [X2 ] + . . . + E [Xn ].

Defining variance in discrete case Identity

I Let X be a random variable with mean .


I The variance of X , denoted Var(X ), is defined by
Var(X ) = E [(X )2 ].
I Taking g (x)P= (x )2 , and recalling that I If Y = X + b, where b is constant, then Var[Y ] = Var[X ].
E [g (X )] = x:p(x)>0 g (x)p(x), we find that I Also, Var[aX ] = a2 Var[X ].
X I Proof: Var[aX ] = E [a2 X 2 ] E [aX ]2 = a2 E [X 2 ] a2 E [X ]2 =
Var[X ] = (x )2 p(x). a2 Var[X ].
x:p(x)>0

I Variance is one way to measure the amount a random variable


varies from its mean over successive trials.
I Very important alternate formula: Var[X ] = E [X 2 ] (E [X ])2 .
Standard deviation Bernoulli random variables

I Toss fair coin n times. (Tosses are independent.) What is the


p probability of k heads?
I Write SD[X ] = Var[X ]. I Answer: kn /2n .

I Satisfies identity SD[aX ] = aSD[X ]. I What if coin has p probability to be heads?
I Uses the same units as X itself. I Answer: kn p k (1 p)nk .

I If we switch from feet to inches in our height of randomly I Writing q = 1 p, we can write this as kn p k q nk

chosen person example, then X , E [X ], and SD[X ] each get I Can use binomial theorem to show probabilities sum to one:
multiplied by 12, but Var[X ] gets multiplied by 144.
1 = 1n = (p + q)n = nk=0 kn p k q nk .
P 
I

I Number of heads is binomial random variable with


parameters (n, p).

Decomposition approach to computing expectation Compute variance with decomposition trick

I Let X be a binomial random variable with parameters (n, p).


Here is one way to compute E [X ].
X = nj=1 Xj , so
P
I
I Think of X as representing number of heads in n tosses of E [X 2 ] = E [ ni=1 Xi nj=1 Xj ] = ni=1 nj=1 E [Xi Xj ]
P P P P
coin that is heads with probability p.
I E [Xi Xj ] is p if i = j, p 2 otherwise.
Write X = nj=1 Xj , where Xj is 1 if the jth coin is heads, 0
P
I Pn Pn
j=1 E [Xi Xj ] has n terms equal to p and (n 1)n
I
otherwise. i=1
I In other words, Xj is the number of heads (zero or one) on the terms equal to p 2 .
jth toss. I So E [X 2 ] = np + (n 1)np 2 = np + (np)2 np 2 .
I Note that E [Xj ] = p 1 + (1 p) 0 = p for each j. I Thus
I Conclude by additivity of expectation that Var[X ] = E [X 2 ] E [X ]2 = np np 2 = np(1 p) = npq.
I Can P
show generally
P that if X1 , . . . , Xn independent then
n n
X X Var[ nj=1 Xj ] = nj=1 Var[Xj ]
E [X ] = E [Xj ] = p = np.
j=1 j=1
Bernoulli random variable with n large and np = Expectation and variance

I Let be some moderate-sized number. Say = 2 or = 3.


Let n be a huge number, say n = 106 .
I Suppose I have a coin that comes on heads with probability
/n and I toss it n times.
I A Poisson random variable X with parameter satisfies
k
P{X = k} = k! e for integer k 0.
I How many heads do I expect to see?
I Clever computation tricks yield E [X ] = and Var[X ] = .
I Answer: np = .
I We think of a Poisson random variable as being (roughly) a
I Let k be some moderate sized number (say k = 4). What is
Bernoulli (n, p) random variable with n very large and
the probability that I see exactly k heads?
p = /n.
I Binomial formula:
n k nk = n(n1)(n2)...(nk+1) k I This also suggests E [X ] = np = and Var[X ] = npq .
p)nk .

k p (1 p) k! p (1
k nk k e .
This is approximately k! (1 p)
I
k!
I A Poisson random variable X with parameter satisfies
k
P{X = k} = k! e for integer k 0.

Poisson point process Geometric random variables

I A Poisson point process is a random function N(t) called a


Poisson process of rate . I Consider an infinite sequence of independent tosses of a coin
I For each t > s 0, the value N(t) N(s) describes the that comes up heads with probability p.
number of events occurring in the time interval (s, t) and is I Let X be such that the first heads is on the X th toss.
Poisson with rate (t s). I Answer: P{X = k} = (1 p)k1 p = q k1 p, where q = 1 p
I The numbers of events occurring in disjoint intervals are is tails probability.
independent random variables. I Say X is a geometric random variable with parameter p.
I Probability to see zero events in first t time units is e t . I Some cool calculation tricks show that E [X ] = 1/p.
I Let Tk be time elapsed, since the previous event, until the kth I And Var[X ] = q/p 2 .
event occurs. Then the Tk are independent random variables,
each of which is exponential with parameter .
Negative binomial random variables

I Consider an infinite sequence of independent tosses of a coin


that comes up heads with probability p.
I Let X be such that the r th heads is on the X th toss.
Then P{X = k} = k1
 r 1
I
r 1 p (1 p)kr p.
I Call X negative binomial random variable with
parameters (r , p).
I So E [X ] = r /p.
I And Var[X ] = rq/p 2 .
Outline

18.600: Lecture 15 Poisson random variables

Poisson processes
What should a Poisson point process be?

Scott Sheffield
Poisson point process axioms
MIT

Consequences of axioms

Outline Properties from last time...

I A Poisson random variable X with parameter satisfies


k
P{X = k} = k! e for integer k 0.
I The probabilities are approximately those of a binomial with
Poisson random variables
parameters (n, /n) when n is very large.
I Indeed,
What should a Poisson point process be?  
n k n(n 1)(n 2) . . . (n k + 1) k
p (1p)nk = p (1p)nk
k k!
Poisson point process axioms
k k
(1 p)nk e .
k! k!
Consequences of axioms I General idea: if you have a large number of unlikely events
that are (mostly) independent of each other, and the expected
number that occur is , then the total number that occur
should be (approximately) a Poisson random variable with
parameter .
Properties from last time... A cautionary tail

I Many phenomena (number of phone calls or customers I Example: Joe works for a bank and notices that his town sees
arriving in a given period, number of radioactive emissions in an average of one mortgage foreclosure per month.
a given time period, number of major hurricanes in a given I Moreover, looking over five years of data, it seems that the
time period, etc.) can be modeled this way. number of foreclosures per month follows a rate 1 Poisson
I A Poisson random variable X with parameter has distribution.
expectation and variance . I That is, roughly a 1/e fraction of months has 0 foreclosures, a
1
I Special case: if = 1, then P{X = k} = k!e . 1/e fraction has 1, a 1/(2e) fraction has 2, a 1/(6e) fraction
I Note how quickly this goes to zero, as a function of k. has 3, and a 1/(24e) fraction has 4.
I Example: number of royal flushes in a million five-card poker I Joe concludes that the probability of seeing 10 foreclosures
hands is approximately Poisson with parameter during a given month is only 1/(10!e). Probability to see 10
106 /649739 1.54. or
P more (an extreme tail event that would destroy the bank) is
I Example: if a country expects 2 plane crashes in a year, then k=10 1/(k!e), less than one in million.

the total number might be approximately Poisson with I Investors are impressed. Joe receives large bonus.
parameter = 2. I But probably shouldnt....

Outline Outline

Poisson random variables Poisson random variables

What should a Poisson point process be? What should a Poisson point process be?

Poisson point process axioms Poisson point process axioms

Consequences of axioms Consequences of axioms


How should we define the Poisson process? Outline

I Whatever his faults, Joe was a good record keeper. He kept


track of the precise times at which the foreclosures occurred
over the whole five years (not just the total numbers of Poisson random variables
foreclosures). We could try this for other problems as well.
I Lets encode this information with a function. Wed like a What should a Poisson point process be?
random function N(t) that describe the number of events
that occur during the first t units of time. (This could be a
model for the number of plane crashes in first t years, or the Poisson point process axioms
number of royal flushes in first 106 t poker hands.)
I So N(t) is a random non-decreasing integer-valued
function of t with N(0) = 0. Consequences of axioms
I For each t, N(t) is a random variable, and the N(t) are
functions on the same sample space.

Outline Poisson process axioms

I Lets back up and give a precise and minimal list of properties


we want the random function N(t) to satisfy.
Poisson random variables I 1. N(0) = 0.
I 2. Independence: Number of events (jumps of N) in disjoint
time intervals are independent.
What should a Poisson point process be?
I 3. Homogeneity: Prob. distribution of # events in interval
depends only on length. (Deduce: E [N(h)] = h for some .)
Poisson point process axioms I 4. Non-concurrence: P{N(h) 2} << P{N(h) = 1} when
h is small. Precisely:
I P{N(h) = 1} = h + o(h). (Here f (h) = o(h) means
Consequences of axioms limh0 f (h)/h = 0.)
I P{N(h) 2} = o(h).
I A random function N(t) with these properties is a Poisson
process with rate .
Outline Outline

Poisson random variables Poisson random variables

What should a Poisson point process be? What should a Poisson point process be?

Poisson point process axioms Poisson point process axioms

Consequences of axioms Consequences of axioms

Consequences of axioms: time till first event Consequences of axioms: time till second, third events

I Can we work out the probability of no events before time t?


I We assumed P{N(h) = 1} = h + o(h) and
I Let T2 be time between first and second event. Generally, Tk
P{N(h) 2} = o(h). Taken together, these imply that
is time between (k 1)th and kth event.
P{N(h) = 0} = 1 h + o(h).
I Then the T1 , T2 , . . . are independent of each other (informally
I Fix and t. Probability of no events in interval of length t/n
this means that observing some of the random variables Tk
is (1 t/n) + o(1/n).
gives you no information about the others). Each is an
I Probability of no events
n in first n such intervals is about exponential random variable with rate .
1 t/n + o(1/n) e t .
I This finally gives us a way to construct N(t). It is determined
I Taking limit as n , can show that probability of no event by the sequence Tj of independent exponential random
in interval of length t is e t . variables.
I P{N(t) = 0} = e t . I Axioms can be readily verified from this description.
I Let T1 be the time of the first event. Then
P{T1 t} = e t . We say that T1 is an exponential
random variable with rate .
Back to Poisson distribution Summary

I Axioms should imply that P{N(t) = k} = e t (t)k /k!.


I One way to prove this: divide time into n intervals of length I We constructed a random function N(t) called a Poisson
t/n. In each, probability to see an event is p = t/n + o(1/n). process of rate .
I Use binomial theorem to describe probability to see event in I For each t > s 0, the value N(t) N(s) describes the
exactly k intervals. number of events occurring in the time interval (s, t) and is
I Binomial formula: Poisson with rate (t s).
n k nk = n(n1)(n2)...(nk+1) k
p (1 p)nk .

k p (1 p) k!
I The numbers of events occurring in disjoint intervals are
(t)k k
nk (t) e t .
independent random variables.
This is approximately k! (1 p)
I
k! I Let Tk be time elapsed, since the previous event, until the kth
I Take n to infinity, and use fact that expected number of event occurs. Then the Tk are independent random variables,
intervals with two or more points tends to zero (thus each of which is exponential with parameter .
probability to see any intervals with two more points tends to
zero).
Outline

18.600: Lecture 16
Geometric random variables
More discrete random variables

Scott Sheffield Negative binomial random variables

MIT

Problems

Outline Geometric random variables

I Consider an infinite sequence of independent tosses of a coin


that comes up heads with probability p.
Geometric random variables
I Let X be such that the first heads is on the X th toss.
I For example, if the coin sequence is T , T , H, T , H, T , . . . then
X = 3.
Negative binomial random variables
I Then X is a random variable. What is P{X = k}?
I Answer: P{X = k} = (1 p)k1 p = q k1 p, where q = 1 p
Problems is tails probability.
I Can you prove directly that these probabilities sum to one?
I Say X is a geometric random variable with parameter p.
Geometric random variable expectation Geometric random variable variance

I Let X be a geometric with parameter p, i.e.,


P{X = k} = (1 p)k1 p = q k1 p for k 1.
I Let X be a geometric random variable with parameter p.
Then P{X = k} = q k1 p.
I What is E [X ]?
P k1 pk.
I What is E [X 2 ]?
I By definition E [X ] = k=1 q P
I By definition E [X 2 ] = k=1 q
k1 pk 2 .
I Theres a trick to computing sums like this.
P k1 I Lets try to come up with a similar trick.
I Note E [X 1] = P k=1 q p(k 1). Setting j = k 1, we
Note E [(X 1)2 ] = k1 p(k 1)2 . Setting j = k 1,
P
Pq
I
have E [X 1] = q j=0 q
j1 pj = qE [X ]. k=1
we have E [(X 1)2 ] = q j=0 q
j1 pj 2 = qE [X 2 ].
I Kind of makes sense. X 1 is number of extra tosses after
first. Given first coin heads (probability p), X 1 is 0. Given I Thus E [(X 1)2 ] = E [X 2 2X + 1] = E [X 2 ] 2E [X ] + 1 =
first coin tails (probability q), conditional law of X 1 is E [X 2 ] 2/p + 1 = qE [X 2 ].
geometric with parameter p. In latter case, conditional I Solving for E [X 2 ] gives (1 q)E [X 2 ] = pE [X 2 ] = 2/p 1, so
expectation of X 1 is same as a priori expectation of X . E [X 2 ] = (2 p)/p 2 .
I Thus E [X ] 1 = E [X 1] = p 0 + qE [X ] = qE [X ] and I Var[X ] = (2p)/p 2 1/p 2 = (1p)/p 2 = 1/p 2 1/p = q/p 2 .
solving for E [X ] gives E [X ] = 1/(1 q) = 1/p.

Example Outline

I Toss die repeatedly. Say we get 6 for first time on X th toss.


I What is P{X = k}? Geometric random variables
I Answer: (5/6)k1 (1/6).
I What is E [X ]?
Negative binomial random variables
I Answer: 6.
I What is Var[X ]?
I Answer: 1/p 2 1/p = 36 6 = 30. Problems
I Takes 1/p coin tosses on average to see a heads.
Outline Negative binomial random variables

I Consider an infinite sequence of independent tosses of a coin


that comes up heads with probability p.
Geometric random variables I Let X be such that the r th heads is on the X th toss.
I For example, if r = 3 and the coin sequence is
T , T , H, H, T , T , H, T , T , . . . then X = 7.
Negative binomial random variables I Then X is a random variable. What is P{X = k}?
I Answer: need exactly r 1 heads among first k 1 tosses
and a heads on the kth toss.
So P{X = k} = k1
 r 1
Problems I
r 1 p (1 p)kr p. Can you prove these
sum to 1?
I Call X negative binomial random variable with
parameters (r , p).

Expectation of binomial random variable Outline

I Consider an infinite sequence of independent tosses of a coin


that comes up heads with probability p.
I Let X be such that the r th heads is on the X th toss.
I Then X is a negative binomial random variable with Geometric random variables
parameters (r , p).
I What is E [X ]?
I Write X = X1 + X2 + . . . + Xr where Xk is number of tosses Negative binomial random variables
(following (k 1)th head) required to get kth head. Each Xk
is geometric with parameter p.
I So E [X ] = E [X1 + X2 + . . . + Xr ] = Problems
E [X1 ] + E [X2 ] + . . . + E [Xr ] = r /p.
I How about Var[X ]?
I Turns out that Var[X ] = Var[X1 ] + Var[X2 ] + . . . + Var[Xr ].
So Var[X ] = rq/p 2 .
Outline Problems
I Nate and Natasha have beautiful new baby. Each minute with
.01 probability (independent of all else) baby cries.
I Additivity of expectation: How many times do they expect
the baby to cry between 9 p.m. and 6 a.m.?
Geometric random variables
I Geometric random variables: Whats the probability baby is
quiet from midnight to three, then cries at exactly three?
I Geometric random variables: Whats the probability baby is
Negative binomial random variables
quiet from midnight to three?
I Negative binomial: Probability fifth cry is at midnight?
Problems I Negative binomial expectation: How many minutes do I
expect to wait until the fifth cry?
I Poisson approximation: Approximate the probability there
are exactly five cries during the night.
I Exponential random variable approximation: Approximate
probability baby quiet all night.

More fun problems

I Suppose two soccer teams play each other. One teams


number of points is Poisson with parameter 1 and others is
independently Poisson with parameter 2 . (You can google
soccer and Poisson to see the academic literature on the
use of Poisson random variables to model soccer scores.)
Using Mathematica (or similar software) compute the
probability that the first team wins if 1 = 2 and 2 = 1.
What if 1 = 2 and 2 = .5?
I Imagine you start with the number 60. Then you toss a fair
coin to decide whether to add 5 to your number or subtract 5
from it. Repeat this process with independent coin tosses
until the number reaches 100 or 0. What is the expected
number of tosses needed until this occurs?
Outline

Continuous random variables


18.600: Lecture 17
Continuous random variables Expectation and variance of continuous random variables

Scott Sheffield Uniform random variable on [0, 1]

MIT
Uniform random variable on [, ]

Measurable sets and a famous paradox

Outline Continuous random variables

Continuous random variables I Say X is a continuous random variable if there exists a


probability density
R function R f = fX on R such that
P{X B} = B f (x)dx := 1B (x)f (x)dx.
Expectation and variance of continuous random variables R R
I We may assume R f (x)dx = f (x)dx = 1 and f is
non-negative.
Uniform random variable on [0, 1] Rb
I Probability of interval [a, b] is given by a f (x)dx, the area
under f between a and b.
Uniform random variable on [, ] I Probability of any single point is zero.
I Define cumulative distribution function R
a
Measurable sets and a famous paradox F (a) = FX (a) := P{X < a} = P{X a} = f (x)dx.
Simple example Another example

(
1/2 x [0, 2]
I Suppose f (x) =
x6 [0, 2].
(
0 x/2 x [0, 2]
I Suppose f (x) =
I What is P{X < 3/2}? 0 06 [0, 2].
I What is P{X = 3/2}? I What is P{X < 3/2}?
I What is P{1/2 < X < 3/2}? I What is P{X = 3/2}?
I What is P{X (0, 1) (3/2, 5)}? I What is P{1/2 < X < 3/2}?
I What is F ? I What is F ?
I We say that X is uniformly distributed on the interval
[0, 2].

Outline Outline

Continuous random variables Continuous random variables

Expectation and variance of continuous random variables Expectation and variance of continuous random variables

Uniform random variable on [0, 1] Uniform random variable on [0, 1]

Uniform random variable on [, ] Uniform random variable on [, ]

Measurable sets and a famous paradox Measurable sets and a famous paradox
Expectations of continuous random variables Variance of continuous random variables
I Recall that when X was a discrete random variable, with
p(x) = P{X = x}, we wrote I Suppose X is a continuous random variable with mean .
X I We can write Var[X ] = E [(X )2 ], same as in the discrete
E [X ] = p(x)x.
case.
x:p(x)>0
I Next, if g =R g1 + g2 then R
I How should we define E [X ] when X is a continuous random RE [g (X )] = g1 (x)f
 (x)dx + g2 (x)f (x)dx =
variable? g1 (x) + g2 (x) f (x)dx = E [g1 (X )] + E [g2 (X )].
R
I Answer: E [X ] = f (x)xdx. I Furthermore, E [ag (X )] = aE [g (X )] when a is a constant.
I Recall that when X was a discrete random variable, with I Just as in the discrete case, we can expand the variance
p(x) = P{X = x}, we wrote expression as Var[X ] = E [X 2 2X + 2 ] and use additivity
X of expectation to say that
E [g (X )] = p(x)g (x). Var[X ] = E [X 2 ] 2E [X ] + E [2 ] = E [X 2 ] 22 + 2 =
x:p(x)>0 E [X 2 ] E [X ]2 .
I What is the analog when X is a continuous random variable? I This formula is often useful for calculations.
R
I Answer: we will write E [g (X )] = f (x)g (x)dx.

Outline Outline

Continuous random variables Continuous random variables

Expectation and variance of continuous random variables Expectation and variance of continuous random variables

Uniform random variable on [0, 1] Uniform random variable on [0, 1]

Uniform random variable on [, ] Uniform random variable on [, ]

Measurable sets and a famous paradox Measurable sets and a famous paradox
Recall continuous random variable definitions Uniform random variables on [0, 1]

I Say X is a continuous random variable if there exists a


probability density
R function R f = fX on R such that I Suppose X is a random variable with probability density
P{X B} = B f (x)dx := 1B (x)f (x)dx.
(
1 x [0, 1]
I
R R
We may assume R f (x)dx = f (x)dx = 1 and f is function f (x) =
0 x 6 [0, 1].
non-negative.
Rb I Then for any 0 a b 1 we have P{X [a, b]} = b a.
I Probability of interval [a, b] is given by a f (x)dx, the area
I Intuition: all locations along the interval [0, 1] equally likely.
under f between a and b.
I Say that X is a uniform random variable on [0, 1] or that X
I Probability of any single point is zero.
is sampled uniformly from [0, 1].
I Define cumulative distribution function R
a
F (a) = FX (a) := P{X < a} = P{X a} = f (x)dx.

Properties of uniform random variable on [0, 1] Outline


I Suppose X is a random
( variable with probability density
1 x [0, 1]
function f (x) = which implies
0 x 6 [0, 1], Continuous random variables

0
a<0
FX (a) = a a [0, 1] . Expectation and variance of continuous random variables

1 a>1

I What is E [X ]? Uniform random variable on [0, 1]


I Guess 1/2 (since 1/2 is, you know, in the middle).
2 1
R R1
I Indeed, f (x)xdx = 0 xdx = x2 = 1/2. Uniform random variable on [, ]
0
I What is the general moment E [X k ] for k 0?
I Answer: 1/(k + 1). Measurable sets and a famous paradox
I What would you guess the variance is? Expected square of
distance from 1/2?
I Its obviously less than 1/4, but how much less?
I VarE [X 2 ] E [X ]2 = 1/3 1/4 = 1/12.
Outline Uniform random variables on [, ]

Continuous random variables


I Fix < and suppose X is a random ( variable with
1
x [, ]
Expectation and variance of continuous random variables probability density function f (x) =
0 x 6 [, ].
ba
I Then for any a b we have P{X [a, b]} = .
Uniform random variable on [0, 1]
I Intuition: all locations along the interval [, ] are equally
likely.
Uniform random variable on [, ] I Say that X is a uniform random variable on [, ] or that
X is sampled uniformly from [, ].
Measurable sets and a famous paradox

Uniform random variables on [, ] Outline

I Suppose X is a random
( variable with probability density
1
x [, ]
function f (x) = Continuous random variables
0 x 6 [, ].
I What is E [X ]?
+
Expectation and variance of continuous random variables
I Intuitively, wed guess the midpoint 2 .
I Whats the cleanest way to prove this?
Uniform random variable on [0, 1]
I One approach: let Y be uniform on [0, 1] and try to show that
X = ( )Y + is uniform on [, ].
I Then expectation linearity gives Uniform random variable on [, ]
+
E [X ] = ( )E [Y ] + = (1/2)( ) + = 2 .
I Using similar logic, what is the variance Var[X ]? Measurable sets and a famous paradox
I Answer: Var[X ] = Var[( )Y + ] = Var[( )Y ] =
( )2 Var[Y ] = ( )2 /12.
Outline Uniform measure on [0, 1]

Continuous random variables I One of the


( very simplest probability density functions is
1 x [0, 1]
f (x) = .
Expectation and variance of continuous random variables 0 0 6 [0, 1].
I If B [0, 1] is an interval, then P{X B} is the length of
that interval.
Uniform random variable on [0, 1] R R
I Generally, if B [0, 1] then P{X B} = B 1dx = 1B (x)dx
is the total volume or total length of the set B.
Uniform random variable on [, ] I What if B is the set of all rational numbers?
I How do we mathematically define the volume of an arbitrary
Measurable sets and a famous paradox set B?

Idea behind parodox Formulating the paradox precisely


I Uniform probability measure on [0, 1) should satisfy
I What if we could partition [0, 1] into a countably infinite translation invariance: If B and a horizontal translation of B
collection of disjoint sets that all looked the same (up to a are both subsets [0, 1), their probabilities should be equal.
translation, say) and thus all had to have the same I Consider wrap-around translations r (x) = (x + r ) mod 1.
probability? I By translation invariance, r (B) has same probability as B.
I Well, if that probability was zero, then (by countable I Call x, y equivalent modulo rationals if x y is rational
additivity) probability of whole interval would be zero, a (e.g., x = 3 and y = 9/4). An equivalence class is
contradiction. the set of points in [0, 1) equivalent to some given point.
I But if that probability were a number greater than zero the I There are uncountably many of these classes.
probability of whole interval would be infinite, also a I Let A [0, 1) contain one point from each class. For each
contradiction... x [0, 1), there is one a A such that r = x a is rational.
I Related problem: if you can cut a cake into countably I Then each x in [0, 1) lies in r (A) for one rational r [0, 1).
infinitely many pieces all of the same weight, how much does I Thus [0, 1) = r (A) as r ranges over rationals in [0, 1).
each piece weigh? P
I If P(A) =P0, then P(S) = r P(r (A)) = 0. If P(A) > 0 then
P(S) = r P(r (A)) = . Contradicts P(S) = 1 axiom.
Three ways to get around this Perspective
I 1. Re-examine axioms of mathematics: the very existence
of a set A with one element from each equivalence class is I More advanced courses in probability and analysis (such as
consequence of so-called axiom of choice. Removing that 18.125 and 18.175) spend a significant amount of time
axiom makes paradox goes away, since one can just suppose rigorously constructing a class of so-called measurable sets
(pretend?) these kinds of sets dont exist. and the so-called Lebesgue measure, which assigns a real
I 2. Re-examine axioms of probability: Replace countable number (a measure) to each of these sets.
additivity with finite additivity? (Doesnt fully solve problem: I These courses also replace the Riemann integral with the
look up Banach-Tarski.) so-called Lebesgue integral.
I 3. Keep the axiom of choice and countable additivity but I We will not treat these topics any further in this course.
dont define probabilities of all sets: Instead of defining I We usually limit our attention to probability density functions
P(B) for every subset B of sample space, restrict attention to
a family of so-called measurable sets. Rf and sets B for which the ordinary Riemann integral
1B (x)f (x)dx is well defined.
I Most mainstream probability and analysis takes the third
I Riemann integration is a mathematically rigorous theory. Its
approach.
just not as robust as Lebesgue integration.
I In practice, sets we care about (e.g., countable unions of
points and intervals) tend to be measurable.
Outline

18.600: Lecture 18
Tossing coins
Normal random variables

Scott Sheffield Normal random variables

MIT

Special case of central limit theorem

Outline Tossing coins

Tossing coins I Suppose we toss a million fair coins. How many heads will we
get?
I About half a million, yes, but how close to that? Will we be
Normal random variables off by 10 or 1000 or 100,000?
I How can we describe the error?
I Lets try this out.
Special case of central limit theorem
Tossing coins Outline

I Toss n coins. What is probability to see k heads?


Answer: 2k kn .

I Tossing coins
I Lets plot this for a few values of n.
I Seems to look like its converging to a curve.
I If we replace fair coin with p coin, whats probability to see k Normal random variables
heads.
Answer: p k (1 p)nk kn .

I
Special case of central limit theorem
I Lets plot this for p = 2/3 and some values of n.
I What does limit shape seem to be?

Outline Standard normal random variable


I Say X is a (standard) normal random variable if
2
fX (x) = f (x) = 12 e x /2 .
I Clearly f is alwaysR non-negative for real values of x, but how

do we show that f (x)dx = 1?
Tossing coins
I Looks kind of tricky.
R 2
I Happens to be a nice trick. Write I = e x /2 dx. Then
try to compute I 2 as a two dimensional integral.
Normal random variables
I That is, write
Z Z Z Z
2 2 2 2
I2 = e x /2 dx e y /2 dy = e x /2 dxe y /2 dy .
Special case of central limit theorem

I Then switch to polar coordinates.


Z Z 2 Z
2 2 /2 2 /2
I2 = e r /2 rddr = 2 re r dr = 2e r ,

0 0 0 0

so I = 2.
Standard normal random variable mean and variance General normal random variables

I Say X is a (standard) normal random variable if


2
f (x) = 12 e x /2 .
I Again, X is a (standard) normal random variable if
I Question: what are mean and variance of X ? 2
f (x) = 12 e x /2 .
R
I E [X ] = xf (x)dx. Can see by symmetry that this zero. I What about Y = X + ? Can we stretch out and
I Or can compute directly: translate the normal distribution (as we did last lecture for
Z
1 2 1 2
the uniform distribution)?
E [X ] = e x /2 xdx = e x /2 = 0.

2 2
I Say Y is normal with parameters and 2 if
2 2
1
f (x) = 2 e (x) /2 .
I How wouldR we compute R I What are the mean and variance of Y ?
2
Var[X ] = f (x)x 2 dx = 1 e x /2 x 2 dx?
2 I E [Y ] = E [X ] + = and Var[Y ] = 2 Var[X ] = 2 .
2
I Try integration by parts with u = x and dv = xe x /2 dx.
2 R 2
Find that Var[X ] = 12 (xe x /2 + e x /2 dx) = 1.

Cumulative distribution function Outline

I Again, X is a standard normal random variable if


2
f (x) = 12 e x /2 .
Tossing coins
I What is the cumulative distribution function?
Ra 2
I Write this as FX (a) = P{X a} = 12 e x /2 dx.
I How can we compute this integral explicitly? Normal random variables
I Cant. Lets Rjust give it a name. Write
a 2
(a) = 12 e x /2 dx.
I Values: (3) .0013, (2) .023 and (1) .159. Special case of central limit theorem
I Rough rule of thumb: two thirds of time within one SD of
mean, 95 percent of time within 2 SDs of mean.
Outline DeMoivre-Laplace Limit Theorem

I Let Sn be number of heads in n tosses of a p coin.


I Whats the standard deviation of Sn ?

I Answer: npq (where q = 1 p).
Tossing coins I The special quantity Sn np
npq describes the number of standard
deviations that Sn is above or below its mean.
I Whats the mean and variance of this special quantity? Is it
Normal random variables
roughly normal?
I DeMoivre-Laplace limit theorem (special case of central
limit theorem):
Special case of central limit theorem
Sn np
lim P{a b} (b) (a).
n npq

I This is (b) (a) = P{a X b} when X is a standard


normal random variable.

Problems

I Toss a million fair coins. Approximate the probability that I


get more than 501, 000 heads.

I Answer: well, npq = 106 .5 .5 = 500. So were asking
for probability to be over two SDs above mean. This is
approximately 1 (2) = (2) .159.
I Roll 60000 dice. Expect to see 10000 sixes. Whats the
probability to see more than 9800?

q
I Here npq = 60000 16 56 91.28.
I And 200/91.28 2.19. Answer is about 1 (2.19).
Outline

18.600: Lecture 19 Exponential random variables

Exponential random variables


Minimum of independent exponentials

Scott Sheffield
Memoryless property
MIT

Relationship to Poisson random variables

Outline Exponential random variables

I Say X is an exponential random variable of parameter


Exponential random variables when its probability distribution function is
(
e x x 0
f (x) = .
Minimum of independent exponentials 0 x <0

I For a > 0 have


Memoryless property Z a Z a a
FX (a) = f (x)dx = e x dx = e x 0 = 1 e a .
0 0
Relationship to Poisson random variables I Thus P{X < a} = 1 e a and P{X > a} = e a .
I Formula P{X > a} = e a is very important in practice.
Moment formula Outline

I Suppose X is exponential with parameter , so


fX (x) = e x when x 0.
Exponential random variables
I What is E [X n ]? (Say n 1.)
R
I Write E [X n ] = 0 x n e x dx.
I Integration Rby parts gives Minimum of independent exponentials
x x
E [X n ] = 0 nx n1 e dx + x n e 0 .
I We get E [X n ] = n E [X n1 ].
Memoryless property
I E [X 0 ] = E [1] = 1, E [X ] = 1/, E [X 2 ] = 2/2 ,
E [X n ] = n!/n .
I If = 1, then E [X n ] = n!. Could take this as definition of n!. Relationship to Poisson random variables
It makes sense for n = 0 and for non-integer n.
I Variance: Var[X ] = E [X 2 ] (E [X ])2 = 1/2 .

Outline Minimum of independent exponentials is exponential

I CLAIM: If X1 and X2 are independent and exponential with


parameters 1 and 2 then X = min{X1 , X2 } is exponential
Exponential random variables with parameter = 1 + 2 .
I How could we prove this?
I Have various ways to describe random variable Y : via density
Minimum of independent exponentials function fY (x), or cumulative distribution function
FY (a) = P{Y a}, or function P{Y > a} = 1 FY (a).

Memoryless property
I Last one has simple form for exponential random variables.
We have P{Y > a} = e a for a [0, ).
I Note: X > a if and only if X1 > a and X2 > a.
Relationship to Poisson random variables I X1 and X2 are independent, so
P{X > a} = P{X1 > a}P{X2 > a} = e 1 a e 2 a = e a .
I If X1 , . . . , Xn are independent exponential with 1 , . . . n , then
min{X1 , . . . Xn } is exponential with = 1 + . . . + n .
Outline Outline

Exponential random variables Exponential random variables

Minimum of independent exponentials Minimum of independent exponentials

Memoryless property Memoryless property

Relationship to Poisson random variables Relationship to Poisson random variables

Memoryless property Memoryless property for geometric random variables

I Suppose X is exponential with parameter .


I Memoryless property: If X represents the time until an
event occurs, then given that we have seen no event up to I Similar property holds for geometric random variables.
time b, the conditional distribution of the remaining time till I If we plan to toss a coin until the first heads comes up, then
the event is the same as it originally was. we have a .5 chance to get a heads in one step, a .25 chance
I To make this precise, we ask what is the probability in two steps, etc.
distribution of Y = X b conditioned on X > b? I Given that the first 5 tosses are all tails, there is conditionally
I We can characterize the conditional law of Y , given X > b, a .5 chance we get our first heads on the 6th toss, a .25
by computing P(Y > a|X > b) for each a. chance on the 7th toss, etc.
I That is, we compute I Despite our having had five tails in a row, our expectation of
P(X b > a|X > b) = P(X > b + a|X > b). the amount of time remaining until we see a heads is the
I By definition of conditional probability, this is just same as it originally was.
P{X > b + a}/P{X > b} = e (b+a) /e b = e a .
I Thus, conditional law of X b given that X > b is same as
the original law of X .
Exchange overheard on Logan airport shuttle Exchange overheard on a Logan airport shuttle

I Bob: Theres this really interesting problem in statistics I just


learned about. If a coin comes up heads 10 times in a row, I Alice: Yeah, yeah, I get it. I cant win here.
how likely is the next toss to be heads? I Bob: No, I dont think you get it yet. Its a subtle point in
I Alice: Still fifty fifty. statistics. Its very important.
I Bob: Thats a common mistake, but youre wrong because I Exchange continued for duration of shuttle ride (Alice
the 10 heads in a row increase the conditional probability that increasingly irritated, Bob increasingly patronizing).
theres something funny going on with the coin. I Raises interesting question about memoryless property.
I Alice: You never said it might be a funny coin. I Suppose the duration of a couples relationship is exponential
I Bob: Thats the point. You should always suspect that there with 1 equal to two weeks.
might be something funny with the coin. I Given that it has lasted for 10 weeks so far, what is the
I Alice: Its a math puzzle. You always assume a normal coin. conditional probability that it will last an additional week?
I Bob: No, thats your mistake. You should never assume that, I How about an additional four weeks? Ten weeks?
because maybe somebody tampered with the coin.

Remark on Alice and Bob Radioactive decay: maximum of independent exponentials

I Alice assumes Bob means independent tosses of a fair coin. I Suppose you start at time zero with n radioactive particles.
Under this assumption, all 211 outcomes of eleven-coin-toss Suppose that each one (independently of the others) will
sequence are equally likely. Bob considers HHHHHHHHHHH decay at a random time, which is an exponential random
more likely than HHHHHHHHHHT, since former could result variable with parameter .
from a faulty coin. I Let T be amount of time until no particles are left. What are
I Alice sees Bobs point but considers it annoying and churlish E [T ] and Var[T ]?
to ask about coin toss sequence and criticize listener for I Let T1 be the amount of time you wait until the first particle
assuming this means independent tosses of fair coin. decays, T2 the amount of additional time until the second
I Without that assumption, Alice has no idea what context Bob particle decays, etc., so that T = T1 + T2 + . . . Tn .
has in mind. (An environment where two-headed novelty coins I Claim: T1 is exponential with parameter n.
are common? Among coin-tossing cheaters with particular
I Claim: T2 is exponential with parameter (n 1).
agendas?...)
And so forth. E [T ] = ni=1 E [Ti ] = 1 nj=1 1j and (by
P P
I
I Alice: you need assumptions to convert stories into math.
independence) Var[T ] = ni=1 Var[Ti ] = 2 nj=1 j12 .
P P
I Bob: good to question assumptions.
Outline Outline

Exponential random variables Exponential random variables

Minimum of independent exponentials Minimum of independent exponentials

Memoryless property Memoryless property

Relationship to Poisson random variables Relationship to Poisson random variables

Relationship to Poisson random variables

I Let T1 , T2 , . . . be independent exponential random variables


with parameter .
I We can view them as waiting times between events.
I How do you show that the number of events in the first t
units of time is Poisson with parameter t?
I We actually did this already in the lecture on Poisson point
processes. You can break the interval [0, t] into n equal pieces
(for very large n), let Xk be number of events in kth piece, use
memoryless property to argue that the Xk are independent.
I When n is large enough, it becomes unlikely that any interval
has more than one event. Roughly speaking: each interval has
one event with probability t/n, zero otherwise.
I Take n limit. Number of events is Poisson t.
Outline

18.600: Lecture 20
Gamma distribution
More continuous random variables

Scott Sheffield Cauchy distribution

MIT

Beta distribution

Outline Defining gamma function

I Last time we found that


R if X is geometric with rate 1 and
n 0 then E [X n ] = 0 x n e x dx = n!.
Gamma distribution I This expectation E [X n ] is actually well defined whenever
n > 1. Set = n + 1. The following quantity is well defined
for any > 0: R
() := E [X 1 ] = 0 x 1 e x dx = ( 1)!.
Cauchy distribution
I So () extends the function ( 1)! (as defined for strictly
positive integers ) to the positive reals.
Beta distribution I Vexing notational issue: why define so that () = ( 1)!
instead of () = !?
I At least its kind of convenient that is defined on (0, )
instead of (1, ).
Recall: geometric and negative binomials Poisson point process limit

I Recall that we can approximate a Poisson process of rate by


tossing N coins per time unit and taking p = /N.
I Lets fix a rational number x and try to figure out the
I The sum X of n independent geometric random variables of probability that that the nth coin toss happens at time x (i.e.,
parameter p is negative binomial with parameter (n, p). on exactly xNth trials, assuming xN is an integer).
I Waiting for the nth heads. What is P{X = k}? I Write p = /N and k = xN. (Note p = x/k.)
Answer: k1
 n1
n1 p (1 p)kn p. For large N, k1
 n1
I I
n1 p (1 p)kn p is
I Whats the continuous (Poisson point process) version of
waiting for the nth event? (k 1)(k 2) . . . (k n + 1) n1
p (1 p)kn p
(n 1)!

k n1 n1 x 1  (x)(n1) e x 
p e p= .
(n 1)! N (n 1)!

Defining distribution Outline

(n1) e x
 
I The probability from previous side, N1 (x)(n1)! suggests
the form for a continuum random variable. Gamma distribution
I Replace n (generally integer valued) with (which we will
eventually allow be to be any real number).
I Say that random variable X has( gamma distribution with Cauchy distribution
(x)1 e x
() x 0
parameters (, ) if fX (x) = .
0 x <0
I Waiting time interpretation makes sense only for integer , Beta distribution
but distribution is defined for general positive .
Outline Cauchy distribution

I A standard Cauchy random variable is a random real


Gamma distribution number with probability density f (x) = 1 1+x
1
2.

I There is a spinning flashlight interpretation. Put a flashlight


at (0, 1), spin it to a uniformly random angle in [/2, /2],
Cauchy distribution and consider point X where light beam hits the x-axis.
I FX (x) = P{X x} = P{tan x} = P{ tan1 x} =
1 1 1 x.
2 + tan
Beta distribution d 1 1
I Find fX (x) = dx F (x) = 1+x 2 .

Cauchy distribution: Brownian motion interpretation Question: what if we start at (0, 2)?

I The light beam travels in (randomly directed) straight line.


Theres a windier random path called Brownian motion. I Start at (0, 2). Let Y be first point on x-axis hit by Brownian
I If you do a simple random walk on a grid and take the grid motion. Again, same probability distribution as point hit by
size to zero, then you get Brownian motion as a limit. flashlight trajectory.
I We will not give a complete mathematical description of
I Flashlight point of view: Y has the same law as 2X where X
Brownian motion here, just one nice fact. is standard Cauchy.
I FACT: start Brownian motion at point (x, y ) in the upper half
I Brownian point of view: Y has same law as X1 + X2 where X1
plane. Probability it hits negative x-axis before positive x-axis and X2 are standard Cauchy.
is 12 + 1 tan1 yx . Linear function of angle between positive I But wait a minute. Var(Y ) = 4Var(X ) and by independence
x-axis and line through (0, 0) and (x, y ). Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) = 2Var(X2 ). Can this be
I Start Brownian motion at (0, 1) and let X be the location of right?
the first point on the x-axis it hits. Whats P{X < a}? I Cauchy distribution doesnt have finite variance or mean.
I Applying FACT, translation invariance, reflection symmetry: I Some standard facts well learn later in the course (central
P{X < x} = P{X > x} = 12 + 1 tan1 x1 . limit theorem, law of large numbers) dont apply to it.
I So X is a standard Cauchy random variable.
Outline Outline

Gamma distribution Gamma distribution

Cauchy distribution Cauchy distribution

Beta distribution Beta distribution

Beta distribution: Alice and Bob revisited Beta distribution

I Suppose I have a coin with a heads probability p that I really


I Suppose I have a coin with a heads probability p that I dont dont know anything about. Lets say p is uniform on [0, 1].
know much about. I Now imagine a multi-stage experiment where I first choose p
I What do I mean by not knowing anything? Lets say that I uniformly from [0, 1] and then I toss n coins.
think p is equally likely to be any of the numbers I If I get, say, a 1 heads and b 1 tails, then what is the
{0, .1, .2, .3, .4, . . . , .9, 1}. conditional probability density for p?
I Now imagine a multi-stage experiment where I first choose p I Turns out to be a constant (that doesnt depend on x) times
and then I toss n coins. x a1 (1 x)b1 .
I Given that number h of heads is a 1, and b 1 tails, whats 1 a1 (1 x)b1 on [0, 1], where B(a, b) is constant
B(a,b) x
I
conditional probability p was a certain value x? chosen to make integral one. Can be shown that
1
( n )x a1 (1x)b1 B(a, b) = (a)(b)
 
I P p = x|h = (a 1) = 11 a1 P{h=(a1)} which is (a+b) .
x a1 (1 x)b1 times a constant that doesnt depend on x. I What is E [X ]?
a
I Answer: a+b .
Outline

18.600: Lecture 21 Distributions of functions of random variables

Joint distributions functions


Joint distributions
Scott Sheffield

MIT
Independent random variables

Examples

Outline Distribution of function of random variable

I Suppose P{X a} = FX (a) is known for all a. Write


Distributions of functions of random variables Y = X 3 . What is P{Y 27}?
I Answer: note that Y 27 if and only if X 3. Hence
P{Y 27} = P{X 3} = FX (3).
Joint distributions
I Generally FY (a) = P{Y a} = P{X a1/3 } = FX (a1/3 )
I This is a general principle. If X is a continuous random
Independent random variables variable and g is a strictly increasing function of x and
Y = g (X ), then FY (a) = FX (g 1 (a)).
I How can we use this to compute the probability density
Examples function fY from fX ?
I If Z = X 2 , then what is P{Z 16}?
Outline Outline

Distributions of functions of random variables Distributions of functions of random variables

Joint distributions Joint distributions

Independent random variables Independent random variables

Examples Examples

Joint probability mass functions: discrete random variables Joint distribution functions: continuous random variables

I If X and Y assume values in {1, 2, . . . , n} then we can view


Ai,j = P{X = i, Y = j} as the entries of an n n matrix. I Given random variables X and Y , define
I Lets say I dont care about Y . I just want to know F (a, b) = P{X a, Y b}.
P{X = i}. How do I figure that out from the matrix? I The region {(x, y ) : x a, y b} is the lower left quadrant
Answer: P{X = i} = nj=1 Ai,j .
P
I centered at (a, b).
Similarly, P{Y = j} = ni=1 Ai,j .
P
I I Refer to FX (a) = P{X a} and FY (b) = P{Y b} as
I In other words, the probability mass functions for X and Y marginal cumulative distribution functions.
are the row and columns sums of Ai,j . I Question: if I tell you the two parameter function F , can you
I Given the joint distribution of X and Y , we sometimes call use it to determine the marginals FX and FY ?
distribution of X (ignoring Y ) and distribution of Y (ignoring I Answer: Yes. FX (a) = limb F (a, b) and
X ) the marginal distributions. FY (b) = lima F (a, b).
I In general, when X and Y are jointly defined discrete random
variables, we write p(x, y ) = pX ,Y (x, y ) = P{X = x, Y = y }.
Joint density functions: continuous random variables Outline

I Suppose we are given the joint distribution function


F (a, b) = P{X a, Y b}.
Distributions of functions of random variables
I Can we use F to construct a two-dimensional probability
density function? RPrecisely, is there a function f such that
P{(X , Y ) A} = A f (x, y )dxdy for each (measurable) Joint distributions
A R2 ?

I Lets try defining f (x, y ) = x y F (x, y ). Does that work?
I Suppose first that A = {(x, y ) : x a, b}. By definition of Independent random variables
F , fundamental theorem of calculus, fact that F (a, b)
vanishes
R b R a as either a or b tends Rto , we indeed find
b Examples
x y F (x, y )dxdy = y F (a, y )dy = F (a, b).
I From this, we can show that it works for strips, rectangles,
general open sets, etc.

Outline Independent random variables

I We say X and Y are independent if for any two (measurable)


sets A and B of real numbers we have
Distributions of functions of random variables
P{X A, Y B} = P{X A}P{Y B}.

Joint distributions I Intuition: knowing something about X gives me no


information about Y , and vice versa.
I When X and Y are discrete random variables, they are
Independent random variables independent if P{X = x, Y = y } = P{X = x}P{Y = y } for
all x and y for which P{X = x} and P{Y = y } are non-zero.
I What is the analog of this statement when X and Y are
Examples
continuous?
I When X and Y are continuous, they are independent if
f (x, y ) = fX (x)fY (y ).
Sample problem: independent normal random variables Outline

I Suppose that X and Y are independent normal random


variables with mean zero and variance one. Distributions of functions of random variables
I What is the probability that (X , Y ) lies in the unit circle?
That is, what is P{X 2 + Y 2 1}? Joint distributions
I First, any guesses?
I Probability X is within one standard deviation of its mean is
about .68. So (.68)2 is an upper bound. Independent random variables
2 2 1 r 2 /2
f (x, y ) = fX (x)fY (y ) = 1 e x /2 1 e y /2 = 2 e
I
2 2
I Using
R1 polar coordinates, we want 1 Examples
1 r 2 /2 2
0 (2r ) 2 e dr = e r /2 0 = 1 e 1/2 .39.

Outline Repeated die roll

I Roll a die repeatedly and let X be such that the first even
Distributions of functions of random variables
number (the first 2, 4, or 6) appears on the X th roll.
I Let Y be the the number that appears on the X th roll.
Joint distributions I Are X and Y independent? What is their joint law?
I If j 1, then
Independent random variables P{X = j, Y = 2} = P{X = j, Y = 4}

= P{X = j, Y = 6} = (1/2)j1 (1/6) = (1/2)j (1/3).


Examples
I Can we get the marginals from that?
Continuous time variant of repeated die roll More lions, tigers, bears

I On a certain hiking trail, it is well known that the lion, tiger, I Lion, tiger, and bear attacks are independent Poisson
and bear attacks are independent Poisson processes with
processes with values .1/hour, .2/hour, and .3/hour.
respective values of .1/hour, .2/hour, and .3/hour.
I Distribution of time Ttiger till first tiger attack?
I Let T R be the amount of time until the first animal
attacks. Let A {lion, tiger, bear} be the species of the first
I Exponential tiger = .2/hour. So P{Ttiger > a} = e .2a .
attacking animal. I How about E [Ttiger ] and Var[Ttiger ]?
I What is the probability density function for T ? How about I E [Ttiger ] = 1/tiger = 5 hours, Var[Ttiger ] = 1/2tiger = 25
E [T ]? hours squared.
I Are T and A independent? I Time until 5th attack by any animal?
I Let T1 be the time until the first attack, T2 the subsequent I distribution with = 5 and = .6.
time until the second attack, etc., and let A1 , A2 , . . . be the I X , where X th attack is 5th bear attack?
corresponding species. I Negative binomial with parameters p = 1/2 and n = 5.
I Are all of the Ti and Ai independent of each other? What are I Can hiker breathe sigh of relief after 5 attack-free hours?
their probability distributions?

Buffons needle problem

I Drop a needle of length one on a large sheet of paper (with


evenly spaced horizontal lines spaced at all integer heights).
I Whats the probability the needle crosses a line?
I Need some assumptions. Lets say vertical position X of
lowermost endpoint of needle modulo one is uniform in [0, 1]
and independent of angle , which is uniform in [0, ]. Crosses
line if and only there is an integer between the numbers X
and X + sin , i.e., X 1 X + sin .
I Draw the box [0, 1] [0, ] on which (X , ) is uniform.
Whats the area of the subset where X 1 sin ?
Summing two random variables

I Say we have independent random variables X and Y and we


know their density functions fX and fY .
18.600: Lecture 22 I Now lets try to find FX +Y (a) = P{X + Y a}.
I This is the integral over {(x, y ) : x + y a} of
Sums of independent random variables f (x, y ) = fX (x)fY (y ). Thus,
I Z Z ay
Scott Sheffield P{X + Y a} = fX (x)fY (y )dxdy

MIT Z
= FX (a y )fY (y )dy .

I DifferentiatingRboth sides gives
d R
fX +Y (a) = da FX (ay )fY (y )dy = fX (ay )fY (y )dy .
I Latter formula makes some intuitive sense. Were integrating
over the set of x, y pairs that add up to a.

Independent identically distributed (i.i.d.) Summing i.i.d. uniform random variables

I Suppose that X and Y are i.i.d. and uniform on [0, 1]. So


I The abbreviation i.i.d. means independent identically fX = fY = 1 on [0, 1].
distributed. I What is the probability density function of X + Y ?
I It is actually one of the most important abbreviations in I
R R1
fX +Y (a) = fX (a y )fY (y )dy = 0 fX (a y ) which is
probability theory. the length of [0, 1] [a 1, a].
I Worth memorizing. I Thats a when a [0, 1] and 2 a when a [1, 2] and 0
otherwise.
Review: summing i.i.d. geometric random variables Summing i.i.d. exponential random variables

I Suppose X1 , . . . Xn are i.i.d. exponential random variables with


I A geometric random variable X with parameter p has parameter . So fXi (x) = e x on [0, ) for all 1 i n.
What is the law of Z = ni=1 Xi ?
P
P{X = k} = (1 p)k1 p for k 1. I

I Sum Z of n independent copies of X ? I We claimed in an earlier lecture that this was a gamma
I We can interpret Z as time slot where nth head occurs in distribution with parameters (, n).
e y (y )n1
i.i.d. sequence of p-coin tosses. I So fZ (y ) = (n) .
I So Z is negative binomial
 n1 (n, p). So I We argued this point by taking limits of negative binomial
P{Z = k} = k1n1 p (1 p)kn p. distributions. Can we check it directly?
I By induction, would suffice to show that a gamma (, 1) plus
an independent gamma (, n) is a gamma (, n + 1).

Summing independent gamma random variables Summing two normal variables


I Say X is gamma (, s), Y is gamma (, t), and X and Y are I X is normal with mean zero, variance 12 , Y is normal with
independent. mean zero, variance 22 .
I Intuitively, X is amount of time till we see s events, and Y is x 2 y 2
1 e 2 2 2
1 e 22 .
amount of subsequent time till we see t more events. I fX (x) = 21
1 and fY (y ) = 22
R
e x (x)s1 e y (y )t1 I We just need to compute y )fY (y )dy .
fX +Y (a) = fX (a
I So fX (x) = (s) and fY (y ) = (t) .
I
R
Now fX +Y (a) = fX (a y )fY (y )dy .
I We could compute this directly.
I Up to an a-independent multiplicative constant, this is
I Or we could argue with a multi-dimensional bell curve picture
Z a Z a that if X and Y have variance 1 then f1 X +2 Y is the density
of a normal random variable (and note that variances and
e (ay ) (ay )s1 e y y t1 dy = e a (ay )s1 y t1 dy .
0 0 expectations are additive).
I Letting x = y /a, this becomes
I Or use fact that if Ai {1, 1} are i.i.d. coin tosses then
P2 N
1 2
i=1 Ai is approximately normal with variance when
R1
e a as+t1 0 (1 x)s1 x t1 dx. N
I This is (up to multiplicative constant) e a as+t1 . Constant N is large.
must be such that integral from to is 1. Conclude I Generally: if independent random P
variables P
Xj are normal
(j , j2 ) then nj=1 Xj is normal ( nj=1 j , nj=1 j2 ).
P
that X + Y is gamma (, s + t).
Other sums

I Sum of an independent binomial (m, p) and binomial (n, p)?


I Yes, binomial (m + n, p). Can be seen from coin toss
interpretation.
I Sum of independent Poisson 1 and Poisson 2 ?
I Yes, Poisson 1 + 2 . Can be seen from Poisson point process
interpretation.
Outline

18.600: Lecture 23
Conditional probability, order statistics, Conditional probability densities
expectations of sums
Order statistics
Scott Sheffield

MIT
Expectations of sums

Outline Conditional distributions

I Lets say X and Y have joint probability density function


f (x, y ).
Conditional probability densities
I We can define the conditional probability density of X given
that Y = y by fX |Y =y (x) = ffY(x,y )
(y ) .

Order statistics I This amounts to restricting f (x, y ) to the line corresponding


to the given y value (and dividing by the constant that makes
the integral along that line equal to 1).
R
Expectations of sums I This definition assumes that fY (y ) = f (x, y )dx < and
fY (y ) 6= 0. Is that safe to assume?
I Usually...
Remarks: conditioning on a probability zero event A word of caution

I Our standard definition of conditional probability is


P(A|B) = P(AB)/P(B). I Suppose X and Y are chosen uniformly on the semicircle
I Doesnt make sense if P(B) = 0. But previous slide defines {(x, y ) : x 2 + y 2 1, x 0}. What is fX |Y =0 (x)?
probability conditioned on Y = y and P{Y = y } = 0. I Answer: fX |Y =0 (x) = 1 if x [0, 1] (zero otherwise).
I When can we (somehow) make sense of conditioning on I Let (, R) be (X , Y ) in polar coordinates. What is fX |=0 (x)?
probability zero event? I Answer: fX |=0 (x) = 2x if x [0, 1] (zero otherwise).
I Tough question in general. I Both { = 0} and {Y = 0} describe the same probability zero
I Consider conditional law of X given that Y (y , y + ). If event. But our interpretation of what it means to condition
this has a limit as  0, we can call that the law conditioned on this event is different in these two cases.
on Y = y . I Conditioning on (X , Y ) belonging to a (, ) wedge is
I Precisely, define very different from conditioning on (X , Y ) belonging to a
FX |Y =y (a) := lim0 P{X a|Y (y , y + )}. Y (, ) strip.
I Then set fX |Y =y (a) = FX0 |Y =y (a). Consistent with definition
from previous slide.

Outline Outline

Conditional probability densities Conditional probability densities

Order statistics Order statistics

Expectations of sums Expectations of sums


Maxima: pick five job candidates at random, choose best General order statistics

I Suppose I choose n random variables X1 , X2 , . . . , Xn uniformly


I Consider i.i.d random variables X1 , X2 , . . . , Xn with continuous
at random on [0, 1], independently of each other.
probability density f .
I The n-tuple (X1 , X2 , . . . , Xn ) has a constant density function
I Let Y1 < Y2 < Y3 . . . < Yn be list obtained by sorting the Xj .
on the n-dimensional cube [0, 1]n .
I In particular, Y1 = min{X1 , . . . , Xn } and
I What is the probability that the largest of the Xi is less than
Yn = max{X1 , . . . , Xn } is the maximum.
a?
I What is the joint probability density of the Yi ?
I ANSWER: an .
Answer: f (x1 , x2 , . . . , xn ) = n! ni=1 f (xi ) if x1 < x2 . . . < xn ,
Q
I
I So if X = max{X1 , . . . , Xn }, then what is the probability
zero otherwise.
density function of X ?
I Let : {1, 2, . . . , n} {1, 2, . . . , n} be the permutation such
0 a < 0

that Xj = Y(j)
I Answer: FX (a) = an a [0, 1] . And
I Are and the vector (Y1 , . . . , Yn ) independent of each other?
1 a>1

I Yes.
fx (a) = FX0 (a) = nan1 .

Example Outline

I Let X1 , . . . , Xn be i.i.d. uniform random variables on [0, 1].


I Example: say n = 10 and condition on X1 being the third
largest of the Xj . Conditional probability densities
I Given this, what is the conditional probability density function
for X1 ?
I Write p = X1 . This kind of like choosing a random p and Order statistics
then conditioning on 7 heads and 2 tails.
I Answer is beta distribution with parameters (a, b) = (8, 3).
I Up to a constant, f (x) = x 7 (1 x)2 . Expectations of sums
I General beta (a, b) expectation is a/(a + b) = 8/11. Mode is
(a1)
(a1)+(b1) = 2/9.
Outline Properties of expectation

I Several properties we derived for discrete expectations


continue to hold in the continuum.
I If X is discrete
P with mass function p(x) then
Conditional probability densities E [X ] = x p(x)x.
I Similarly,R if X is continuous with density function f (x) then
E [X ] = f (x)xdx.
Order statistics If X is discrete
P with mass function p(x) then
I
E [g (x)] = x p(x)g (x).
I Similarly, X Rif is continuous with density function f (x) then
Expectations of sums E [g (X )] = f (x)g (x)dx.
I If X and Y have P joint
P mass function p(x, y ) then
E [g (X , Y )] = y x g (x, y )p(x, y ).
I If X and Y have R joint
R probability density function f (x, y ) then
E [g (X , Y )] = g (x, y )f (x, y )dxdy .

Properties of expectation

I For both discrete and continuous random variables X and Y


we have E [X + Y ] = E [X ] + E [Y ].
I In both discrete and continuous
P settings,P
E [aX ] = aE [X ]
when a is a constant. And E [ ai Xi ] = ai E [Xi ].
I But what about that delightful area under 1 FX formula
for the expectation?
I When X is non-negative
R with probability one, do we always
have E [X ] = 0 P{X > x}, in both discrete and continuous
settings?
I Define g (y ) so that 1 FX (g (y )) = y . (Draw horizontal line
at height y and look where it hits graph of 1 FX .)
I Choose Y uniformly on [0, 1] and note that g (Y ) has the
same probability distribution as X .
R1
I So E [X ] = E [g (Y )] = 0 g (y )dy , which is indeed the area
under the graph of 1 FX .
Outline

18.600: Lecture 24
Covariance and some conditional
expectation exercises Covariance and correlation

Scott Sheffield
Paradoxes: getting ready to think about conditional expectation
MIT

Outline A property of independence

I If X and Y are independent then


Covariance and correlation
E [g (X )h(Y )] = E [g (X )]E [h(Y )].
R R
I Just write E [g (X )h(Y )] = g (x)h(y )f (x, y )dxdy .
RSince f (x, y ) = fXR(x)fY (y ) this factors as
I

Paradoxes: getting ready to think about conditional expectation h(y )fY (y )dy g (x)fX (x)dx = E [h(Y )]E [g (X )].
Defining covariance and correlation Basic covariance facts
I Using Cov(X , Y ) = E [XY ] E [X ]E [Y ] as a definition,
I Now define covariance of X and Y by certain facts are immediate.
Cov(X , Y ) = E [(X E [X ])(Y E [Y ]). I Cov(X , Y ) = Cov(Y , X )
I Note: by definition Var(X ) = Cov(X , X ). I Cov(X , X ) = Var(X )
I Covariance (like variance) can also written a different way. I Cov(aX , Y ) = aCov(X , Y ).
Write x = E [X ] and Y = E [Y ]. If laws of X and Y are
I Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ).
known, then X and Y are just constants.
I General statement of bilinearity of covariance:
I Then
Xm n
X m X
X n
Cov(X , Y ) = E [(X X )(Y Y )] = E [XY X Y Y X +X Y ] = Cov( ai Xi , bj Yj ) = ai bj Cov(Xi , Yj ).
i=1 j=1 i=1 j=1
E [XY ] X E [Y ] Y E [X ] + X Y = E [XY ] E [X ]E [Y ].
I Special case:
I Covariance formula E [XY ] E [X ]E [Y ], or expectation of n n
product minus product of expectations is frequently useful. Var(
X
Xi ) =
X
Var(Xi ) + 2
X
Cov(Xi , Xj ).
I Note: if X and Y are independent then Cov(X , Y ) = 0. i=1 i=1 (i,j):i<j

Defining correlation Important point

I Again, by definition Cov(X , Y ) = E [XY ] E [X ]E [Y ]. I Say X and Y are uncorrelated when (X , Y ) = 0.


I Correlation of X and Y defined by I Are independent random variables X and Y always
uncorrelated?
Cov(X , Y )
(X , Y ) := p . I Yes, assuming variances are finite (so that correlation is
Var(X )Var(Y ) defined).
I Correlation doesnt care what units you use for X and Y . If I Are uncorrelated random variables always independent?
a > 0 and c > 0 then (aX + b, cY + d) = (X , Y ). I No. Uncorrelated just means E [(X E [X ])(Y E [Y ])] = 0,
I Satisfies 1 (X , Y ) 1. i.e., the outcomes where (X E [X ])(Y E [Y ]) is positive
I Why is that? Something to do with E [(X + Y )2 ] 0 and (the upper right and lower left quadrants, if axes are drawn
E [(X Y )2 ] 0? centered at (E [X ], E [Y ])) balance out the outcomes where
this quantity is negative (upper left and lower right
I If a and b are constants and a > 0 then (aX + b, X ) = 1. quadrants). This is a much weaker statement than
I If a and b are constants and a < 0 then (aX + b, X ) = 1. independence.
Examples Outline

I Suppose that X1 , . . . , Xn are i.i.d. random variables with


variance 1. For example, maybe each Xj takes values 1
according to a fair coin toss.
I Compute Cov(X1 + X2 + X3 , X2 + X3 + X4 ).
I Compute the correlation coefficient Covariance and correlation
(X1 + X2 + X3 , X2 + X3 + X4 ).
I Can we generalize this example?
I What is variance of number of people who get their own hat
Paradoxes: getting ready to think about conditional expectation
in the hat problem?
I Define Xi to be 1 if ith person gets own hat, zero otherwise.
I Recall
Pformula P
Var( ni=1 Xi ) = ni=1 Var(Xi ) + 2 (i,j):i<j Cov(Xi , Xj ).
P

I Reduces problem to computing Cov(Xi , Xj ) (for i 6= j) and


Var(Xi ).

Outline Famous paradox

I Certain corrupt and amoral banker dies, instructed to spend


some number n (of bankers choosing) days in hell.
I At the end of this period, a (biased) coin will be tossed.
Banker will be assigned to hell forever with probability 1/n
Covariance and correlation and heaven forever with probability 1 1/n.
I After 10 days, banker reasons, If I wait another day I reduce
my odds of being here forever from 1/10 to 1/11. Thats a
Paradoxes: getting ready to think about conditional expectation reduction of 1/110. A 1/110 chance at infinity has infinite
value. Worth waiting one more day.
I Repeats this reasoning every day, stays in hell forever.
I Standard punch line: this is actually what banker deserved.
I Fairly dark as math humor goes (and no offense intended to
anyone...) but dilemma is interesting.
I Paradox: decisions seem sound individually but together yield Money pile paradox
worst possible outcome. Why? Can we demystify this?
I Variant without probability: Instead of tossing (1/n)-coin,
person deterministically spends 1/n fraction of future days I You have an infinite collection of money piles with labeled
(every nth day, say) in hell. 0, 1, 2, . . . from left to right.
I Even simpler variant: infinitely many identical money sacks I Precise details not important, but lets say you have 1/4 in
have labels 1, 2, 3, . . . I have sack 1. You have all others. the 0th pile and 38 5j in the jth pile for each j > 0. Important
I You offer me a deal. I give you sack 1, you give me sacks 2 thing is that pile size is increasing exponentially in j.
and 3. I give you sack 2 and you give me sacks 4 and 5. On I Banker proposes to transfer a fraction (say 2/3) of each pile
the nth stage, I give you sack n and you give me sacks 2n and to the pile on its left and remainder to the pile on its right.
2n + 1. Continue until I say stop. Do this simultaneously for all piles.
I Lets me get arbitrarily rich. But if I go on forever, I return I Every pile is bigger after transfer (and this can be true even if
every sack given to me. If nth sack confers right to spend nth banker takes a portion of each pile as a fee).
day in heaven, leads to hell-forever paradox. I Banker seemed to make you richer (every pile got bigger) but
I I make infinitely many good trades and end up with less than I really just reshuffled your infinite wealth.
started with. Paradox is really just existence of 2-to-1 map
from (smaller set) {2, 3, . . .} to (bigger set) {1, 2, . . .}.

Two envelope paradox Moral


I X is geometric with parameter 1/2. One envelope has 10X
dollars, one has 10X 1 dollars. Envelopes shuffled.
I You choose an envelope and, after seeing contents, are
allowed to choose whether to keep it or switch. (Maybe you I Beware infinite expectations.
have to pay a dollar to switch.)
I Beware unbounded utility functions.
I Maximizing conditional expectation, it seems its always
better to switch. But if you always switch, why not just
I They can lead to strange conclusions, sometimes related to
choose second-choice envelope first and avoid switching fee? reshuffling infinite (actual or expected) wealth to create
more paradoxes.
I Kind of a disguised version of money pile paradox. But more
subtle. One has to replace jth pile of money with I Paradoxes can arise even when total transaction is finite with
restriction of expectation sum to scenario that first chosen probability one (as in envelope problem).
envelop has 10j . Switching indeed makes each pile bigger.
I However, Higher expectation given amount in first envelope
may not be right notion of better. If S is payout with
switching, T is payout without switching, then S has same
law as T 1. In that sense S is worse.
Outline

18.600: Lecture 25
Conditional probability distributions
Conditional expectation

Scott Sheffield Conditional expectation

MIT

Interpretation and examples

Outline Recall: conditional probability distributions


I It all starts with the definition of conditional probability:
P(A|B) = P(AB)/P(B).
I If X and Y are jointly discrete random variables, we can use
this to define a probability mass function for X given Y = y .
Conditional probability distributions p(x,y )
I That is, we write pX |Y (x|y ) = P{X = x|Y = y } = pY (y ) .
I In words: first restrict sample space to pairs (x, y ) with given
Conditional expectation y value. Then divide the original mass function by pY (y ) to
obtain a probability mass function on the restricted space.
I We do something similar when X and Y are continuous
Interpretation and examples random variables. In that case we write fX |Y (x|y ) = ffY(x,y )
(y ) .
I Often useful to think of sampling (X , Y ) as a two-stage
process. First sample Y from its marginal distribution, obtain
Y = y for some particular y . Then sample X from its
probability distribution given Y = y .
I Marginal law of X is weighted average of conditional laws.
Example Outline

I Let X be value on one die roll, Y value on second die roll,


and write Z = X + Y . Conditional probability distributions
I What is the probability distribution for X given that Y = 5?
I Answer: uniform on {1, 2, 3, 4, 5, 6}.
Conditional expectation
I What is the probability distribution for Z given that Y = 5?
I Answer: uniform on {6, 7, 8, 9, 10, 11}.
I What is the probability distribution for Y given that Z = 5? Interpretation and examples
I Answer: uniform on {1, 2, 3, 4}.

Outline Conditional expectation

I Now, what do we mean by E [X |Y = y ]? This should just be


Conditional probability distributions the expectation of X in the conditional probability measure
for X given that Y = y .
I Can write this as
Conditional expectation
P P
E [X |Y = y ] = x xP{X = x|Y = y } = x xpX |Y (x|y ).
I Can make sense of this in the continuum setting as well.
f (x,y )
I In continuum setting we had fX |Y (x|y ) = fY (y ) . So
Interpretation and examples R
E [X |Y = y ] = x ffY(x,y )
(y ) dx
Example Conditional expectation as a random variable

I Can think of E [X |Y ] as a function of the random variable Y .


When Y = y it takes the value E [X |Y = y ].
I So E [X |Y ] is itself a random variable. It happens to depend
only on the value of Y .
I Let X be value on one die roll, Y value on second die roll, I Thinking of E [X |Y ] as a random variable, we can ask what its
and write Z = X + Y . expectation is. What is E [E [X |Y ]]?
I What is E [X |Y = 5]? I Very useful fact: E [E [X |Y ]] = E [X ].
I What is E [Z |Y = 5]? I In words: what you expect to expect X to be after learning Y
I What is E [Y |Z = 5]? is same as what you now expect X to be.
I Proof in discretePcase:
E [X |Y = y ] = x xP{X = x|Y = y } = x x p(x,y )
P
pY (y ) .
P
I Recall that, in general, E [g (Y )] = y pY (y )g (y ).
E [E [X |Y = y ]] = y pY (y ) x x p(x,y )
P P P P
pY (y ) = y p(x, y )x =
I
x
E [X ].

Conditional variance Example

I Definition:
Var(X |Y ) = E (X E [X |Y ])2 |Y = E X 2 E [X |Y ]2 |Y .
   
I Let X be a random variable of variance X2 and Y an
I Var(X |Y ) is a random variable that depends on Y . It is the independent random variable of variance Y2 and write
variance of X in the conditional distribution for X given Y . Z = X + Y . Assume E [X ] = E [Y ] = 0.
I Note E [Var(X |Y )] = E [E [X 2 |Y ]] E [E [X |Y ]2 |Y ] = I What are the covariances Cov(X , Y ) and Cov(X , Z )?
E [X 2 ] E [E [X |Y ]2 ]. I How about the correlation coefficients (X , Y ) and (X , Z )?
I If we subtract E [X ]2 from first term and add equivalent value I What is E [Z |X ]? And how about Var(Z |X )?
E [E [X |Y ]]2 to the second, RHS becomes I Both of these values are functions of X . Former is just X .
Var[X ] Var[E [X |Y ]], which implies following:
Latter happens to be a constant-valued function of X , i.e.,
I Useful fact: Var(X ) = Var(E [X |Y ]) + E [Var(X |Y )]. happens not to actually depend on X . We have
I One can discover X in two stages: first sample Y from Var(Z |X ) = Y2 .
marginal and compute E [X |Y ], then sample X from I Can we check the formula
distribution given Y value. Var(Z ) = Var(E [Z |X ]) + E [Var(Z |X )] in this case?
I Above fact breaks variance into two parts, corresponding to
these two stages.
Outline Outline

Conditional probability distributions Conditional probability distributions

Conditional expectation Conditional expectation

Interpretation and examples Interpretation and examples

Interpretation Examples

I Sometimes think of the expectation E [Y ] as a best guess or I Toss 100 coins. Whats the conditional expectation of the
best predictor of the value of Y . number of heads given that there are k heads among the first
I It is best in the sense that at among all constants m, the fifty tosses?
expectation E [(Y m)2 ] is minimized when m = E [Y ]. I k + 25
I But what if we allow non-constant predictors? What if the I Whats the conditional expectation of the number of aces in a
predictor is allowed to depend on the value of a random five-card poker hand given that the first two cards in the hand
variable X that we can observe directly? are aces?
I Let g (x) be such a function. Then E [(y g (X ))2 ] is I 2 + 3 2/50
minimized when g (X ) = E [Y |X ].
Outline

18.600: Lecture 26
Moment generating functions and Moment generating functions

characteristic functions
Characteristic functions
Scott Sheffield

MIT
Continuity theorems and perspective

Outline Moment generating functions

I Let X be a random variable.


I The moment generating function of X is defined by
M(t) = MX (t) := E [e tX ].
When X is discrete, can write M(t) = x e tx pX (x). So M(t)
P
I
Moment generating functions
is a weighted average of countably many exponential
functions.
R
Characteristic functions I When X is continuous, can write M(t) = e tx f (x)dx. So
M(t) is a weighted average of a continuum of exponential
functions.
Continuity theorems and perspective
I We always have M(0) = 1.
I If b > 0 and t > 0 then
E [e tX ] E [e t min{X ,b} ] P{X b}e tb .
I If X takes both positive and negative values with positive
probability then M(t) grows at least exponentially fast in |t|
as |t| .
Moment generating functions actually generate moments Moment generating functions for independent sums

I Let X be a random variable and M(t) = E [e tX ].


Then M 0 (t) = dt
d
 d tX 
I E [e tX ] = E dt (e ) = E [Xe tX ].
I Let X and Y be independent random variables and
I in particular, M 0 (0) = E [X ].
Z = X +Y.
I Also M 00 (t) = d 0 d tX
dt M (t) = dt E [Xe ] = E [X 2 e tX ]. I Write the moment generating functions as MX (t) = E [e tX ]
I So M 00 (0) = E [X 2 ]. Same argument gives that nth derivative and MY (t) = E [e tY ] and MZ (t) = E [e tZ ].
of M at zero is E [X n ]. I If you knew MX and MY , could you compute MZ ?
I Interesting: knowing all of the derivatives of M at a single I By independence, MZ (t) = E [e t(X +Y ) ] = E [e tX e tY ] =
point tells you the moments E [X k ] for all integer k 0.
E [e tX ]E [e tY ] = MX (t)MY (t) for all t.
I Another way to think of this: write
2 2 3 3 I In other words, adding independent random variables
e tX = 1 + tX + t 2!X + t 3!X + . . ..
corresponds to multiplying moment generating functions.
I Taking expectations gives
2 3
E [e tX ] = 1 + tm1 + t 2!m2 + t 3!m3 + . . ., where mk is the kth
moment. The kth derivative at zero is mk .

Moment generating functions for sums of i.i.d. random Other observations


variables

I We showed that if Z = X + Y and X and Y are independent, I If Z = aX then can I use MX to determine MZ ?
then MZ (t) = MX (t)MY (t) I Answer: Yes. MZ (t) = E [e tZ ] = E [e taX ] = MX (at).
I If X1 . . . Xn are i.i.d. copies of X and Z = X1 + . . . + Xn then I If Z = X + b then can I use MX to determine MZ ?
what is MZ ? I Answer: Yes. MZ (t) = E [e tZ ] = E [e tX +bt ] = e bt MX (t).
I Answer: MXn . Follows by repeatedly applying formula above. I Latter answer is the special case of MZ (t) = MX (t)MY (t)
I This a big reason for studying moment generating functions. where Y is the constant random variable b.
It helps us understand what happens when we sum up a lot of
independent copies of the same random variable.
Examples More examples: normal random variables

I Lets try some examples. What is MX (t) = E [e tX ] when X is


binomial with parameters (p, n)? Hint: try the n = 1 case I What if X is normal with mean zero, variance one?
first. R 2
I MX (t) = 12 e tx e x /2 dx =
I Answer: if n = 1 then MX (t) = E [e tX ] = pe t + (1 p)e 0 . In R (xt)2
1 t2 2
general MX (t) = (pe t + 1 p)n .
2 exp{ 2 + 2 }dx = e t /2 .
I What if X is Poisson with parameter > 0? I What does that tell us about sums of i.i.d. copies of X ?
2 /2
e tn e n If Z is sum of n i.i.d. copies of X then MZ (t) = e nt
Answer: MX (t) = E [e tx ] = I .
P
I
n=0 n! =
(e t )n What is MZ if Z is normal with mean and variance 2 ?
e e e t = exp[(e t 1)].
P I
n=0 n! = e
I We know that if you add independent Poisson random I Answer: Z has same law as X + , so
2 2
variables with parameters 1 and 2 you get a Poisson MZ (t) = M(t)e t = exp{ 2t + t}.
random variable of parameter 1 + 2 . How is this fact
manifested in the moment generating function?

More examples: exponential random variables More examples: existence issues

I What if X is exponential with parameter > 0?


R R
MX (t) = 0 e tx e x dx = 0 e (t)x dx = I Seems that unless fX (x) decays superexponentially as x tends
t .
I
to infinity, we wont have MX (t) defined for all t.
I What if Z is a distribution with parameters > 0 and
1
n > 0? I What is MX if X is standard Cauchy, so that fX (x) = (1+x 2 )
.
I Then Z has the law of a sum of n independent copies of X . I Answer: MX (0) = 1 (as is true for any X ) but otherwise
n
So MZ (t) = MX (t)n = t . MX (t) is infinite for all t 6= 0.
I Exponential calculation above works for t < . What happens I Informal statement: moment generating functions are not
when t > ? Or as t approaches from below? defined for distributions with fat tails.
R R
I MX (t) = 0 e tx e x dx = 0 e (t)x dx = if t .
Outline Outline

Moment generating functions Moment generating functions

Characteristic functions Characteristic functions

Continuity theorems and perspective Continuity theorems and perspective

Characteristic functions Outline

I Let X be a random variable.


I The characteristic function of X is defined by
(t) = X (t) := E [e itX ]. Like M(t) except with i thrown in. Moment generating functions
I Recall that by definition e it = cos(t) + i sin(t).
I Characteristic functions are similar to moment generating
functions in some ways. Characteristic functions
I For example, X +Y = X Y , just as MX +Y = MX MY .
I And aX (t) = X (at) just as MaX (t) = MX (at).
(m)
Continuity theorems and perspective
I And if X has an mth moment then E [X m ] = i m X (0).
I But characteristic functions have a distinct advantage: they
are always well defined for all t even if fX decays slowly.
Outline Perspective

I In later lectures, we will see that one can use moment


generating functions and/or characteristic functions to prove
the so-called weak law of large numbers and central limit
Moment generating functions theorem.
I Proofs using characteristic functions apply in more generality,
but they require you to remember how to exponentiate
Characteristic functions imaginary numbers.
I Moment generating functions are central to so-called large
deviation theory and play a fundamental role in statistical
Continuity theorems and perspective physics, among other things.
I Characteristic functions are Fourier transforms of the
corresponding distribution density functions and encode
periodicity patterns. For example, if X is integer valued,
X (t) = E [e itX ] will be 1 whenever t is a multiple of 2.

Continuity theorems

I Let X be a random variable and Xn a sequence of random


variables.
I We say that Xn converge in distribution or converge in law
to X if limn FXn (x) = FX (x) at all x R at which FX is
continuous.
I Levys continuity theorem (see Wikipedia): if
limn Xn (t) = X (t) for all t, then Xn converge in law to
X.
I Moment generating analog: if moment generating
functions MXn (t) are defined for all t and n and
limn MXn (t) = MX (t) for all t, then Xn converge in law to
X.
Outline

18.600: Lecture 27
Continuous random variables
Lectures 15-27 Review

Scott Sheffield Problems motivated by coin tossing

MIT

Random variable properties

Outline Continuous random variables

I Say X is a continuous random variable if there exists a


probability density function R f = fX on R such that
Continuous random variables R
P{X B} = B f (x)dx := 1B (x)f (x)dx.
R R
I We may assume R f (x)dx = f (x)dx = 1 and f is
non-negative.
Problems motivated by coin tossing Rb
I Probability of interval [a, b] is given by a f (x)dx, the area
under f between a and b.
I Probability of any single point is zero.
Random variable properties
I Define cumulative distribution function R
a
F (a) = FX (a) := P{X < a} = P{X a} = f (x)dx.
Expectations of continuous random variables Variance of continuous random variables
I Recall that when X was a discrete random variable, with
p(x) = P{X = x}, we wrote I Suppose X is a continuous random variable with mean .
X I We can write Var[X ] = E [(X )2 ], same as in the discrete
E [X ] = p(x)x.
case.
x:p(x)>0
I Next, if g =R g1 + g2 then R
I How should we define E [X ] when X is a continuous random RE [g (X )] = g1 (x)f
 (x)dx + g2 (x)f (x)dx =
variable? g1 (x) + g2 (x) f (x)dx = E [g1 (X )] + E [g2 (X )].
R
I Answer: E [X ] = f (x)xdx. I Furthermore, E [ag (X )] = aE [g (X )] when a is a constant.
I Recall that when X was a discrete random variable, with I Just as in the discrete case, we can expand the variance
p(x) = P{X = x}, we wrote expression as Var[X ] = E [X 2 2X + 2 ] and use additivity
X of expectation to say that
E [g (X )] = p(x)g (x). Var[X ] = E [X 2 ] 2E [X ] + E [2 ] = E [X 2 ] 22 + 2 =
x:p(x)>0 E [X 2 ] E [X ]2 .
I What is the analog when X is a continuous random variable? I This formula is often useful for calculations.
R
I Answer: we will write E [g (X )] = f (x)g (x)dx.

Outline Outline

Continuous random variables Continuous random variables

Problems motivated by coin tossing Problems motivated by coin tossing

Random variable properties Random variable properties


Its the coins, stupid Discrete random variable properties derivable from coin
toss intuition
I Much of what we have done in this course can be motivated
by the i.i.d. sequence Xi where
Peach Xi is 1 with probability p
and 0 otherwise. Write Sn = ni=1 Xn .
I Binomial (Sn number of heads in n tosses), geometric I Sum of two independent binomial random variables with
(steps required to obtain one heads), negative binomial parameters (n1 , p) and (n2 , p) is itself binomial (n1 + n2 , p).
(steps required to obtain n heads). I Sum of n independent geometric random variables with
n E [Sn ]
I Standard normal approximates law of SSD(S ) . Here
parameter p is negative binomial with parameter (n, p).
p n I Expectation of geometric random variable with parameter
E [Sn ] = np and SD(Sn ) = Var(Sn ) = npq where
q = 1 p. p is 1/p.
I Poisson is limit of binomial as n when p = /n. I Expectation of binomial random variable with parameters
(n, p) is np.
I Poisson point process: toss one /n coin during each length
1/n time increment, take n limit. I Variance of binomial random variable with parameters
(n, p) is np(1 p) = npq.
I Exponential: time till first event in Poisson point process.
I Gamma distribution: time till nth event in Poisson point
process.

Continuous random variable properties derivable from coin DeMoivre-Laplace Limit Theorem
toss intuition

I Sum of n independent exponential random variables each


with parameter is gamma with parameters (n, ).
I Memoryless properties: given that exponential random I DeMoivre-Laplace limit theorem (special case of central
variable X is greater than T > 0, the conditional law of limit theorem):
X T is the same as the original law of X .
Sn np
I Write p = /n. Poisson random variable expectation is lim P{a b} (b) (a).
limn np = limn n n = . Variance is
n npq
limn np(1 p) = limn n(1 /n)/n = . I This is (b) (a) = P{a X b} when X is a standard
I Sum of 1 Poisson and independent 2 Poisson is a normal random variable.
1 + 2 Poisson.
I Times between successive events in Poisson process are
independent exponentials with parameter .
I Minimum of independent exponentials with parameters 1
and 2 is itself exponential with parameter 1 + 2 .
Problems Properties of normal random variables

I Say X is a (standard) normal random variable if


2
f (x) = 12 e x /2 .
I Toss a million fair coins. Approximate the probability that I
get more than 501, 000 heads. I Mean zero and variance one.

I Answer: well, npq = 106 .5 .5 = 500. So were asking I The random variable Y = X + has variance 2 and
for probability to be over two SDs above mean. This is expectation .
approximately 1 (2) = (2). I Y is said to be normal with parameters and 2 . Its density
2 2
I Roll 60000 dice. Expect to see 10000 sixes. Whats the
1
function is fY (x) = 2 e (x) /2 .
probability to see more than 9800? Ra 2
I Function (a) = 12 e x /2 dx cant be computed

q
I Here npq = 60000 16 56 91.28. explicitly.
I And 200/91.28 2.19. Answer is about 1 (2.19). I Values: (3) .0013, (2) .023 and (1) .159.
I Rule of thumb: two thirds of time within one SD of mean,
95 percent of time within 2 SDs of mean.

Properties of exponential random variables Defining distribution

I Say X is an exponential random variable of parameter


when its probability distribution function is f (x) = e x for Say that random variable X has( gamma distribution with
I
x 0 (and f (x) = 0 if x < 0). (x)1 e x
() x 0
I For a > 0 have parameters (, ) if fX (x) = .
Z a 0 x <0
Z a a
FX (a) = f (x)dx = e x dx = e x 0 = 1 e a . I Same as exponential distribution when = 1. Otherwise,
0 0 multiply by x 1 and divide by (). The fact that () is
what you need to divide by to make the total integral one just
I Thus P{X < a} = 1 e a and P{X > a} = e a .
follows from the definition of .
I Formula P{X > a} = e a is very important in practice. I Waiting time interpretation makes sense only for integer ,
I Repeated integration by parts gives E [X n ] = n!/n . but distribution is defined for general positive .
I If = 1, then E [X n ]
= n!. Value (n) := E [X n1 ] defined for
real n > 0 and (n) = (n 1)!.
Outline Outline

Continuous random variables Continuous random variables

Problems motivated by coin tossing Problems motivated by coin tossing

Random variable properties Random variable properties

Properties of uniform random variables Distribution of function of random variable

I Suppose P{X a} = FX (a) is known for all a. Write


I Suppose X is a random
( variable with probability density Y = X 3 . What is P{Y 27}?
1
x [, ]
function f (x) = I Answer: note that Y 27 if and only if X 3. Hence
0 x 6 [, ]. P{Y 27} = P{X 3} = FX (3).
+
I Then E [X ] = 2 .
I Generally FY (a) = P{Y a} = P{X a1/3 } = FX (a1/3 )
I And Var[X ] = Var[( )Y + ] = Var[( )Y ] = I This is a general principle. If X is a continuous random
( )2 Var[Y ] = ( )2 /12. variable and g is a strictly increasing function of x and
Y = g (X ), then FY (a) = FX (g 1 (a)).
Joint probability mass functions: discrete random variables Joint distribution functions: continuous random variables

I If X and Y assume values in {1, 2, . . . , n} then we can view


Ai,j = P{X = i, Y = j} as the entries of an n n matrix. I Given random variables X and Y , define
I Lets say I dont care about Y . I just want to know F (a, b) = P{X a, Y b}.
P{X = i}. How do I figure that out from the matrix? I The region {(x, y ) : x a, y b} is the lower left quadrant
I Answer: P{X = i} = nj=1 Ai,j .
P centered at (a, b).
I Similarly, P{Y = j} = ni=1 Ai,j .
P I Refer to FX (a) = P{X a} and FY (b) = P{Y b} as
marginal cumulative distribution functions.
I In other words, the probability mass functions for X and Y
are the row and columns sums of Ai,j .
I Question: if I tell you the two parameter function F , can you
use it to determine the marginals FX and FY ?
I Given the joint distribution of X and Y , we sometimes call
distribution of X (ignoring Y ) and distribution of Y (ignoring
I Answer: Yes. FX (a) = limb F (a, b) and
X ) the marginal distributions. FY (b) = lima F (a, b).

I In general, when X and Y are jointly defined discrete random
I Density: f (x, y ) = x y F (x, y ).
variables, we write p(x, y ) = pX ,Y (x, y ) = P{X = x, Y = y }.

Independent random variables Summing two random variables

I Say we have independent random variables X and Y and we


know their density functions fX and fY .
I We say X and Y are independent if for any two (measurable) I Now lets try to find FX +Y (a) = P{X + Y a}.
sets A and B of real numbers we have I This is the integral over {(x, y ) : x + y a} of
f (x, y ) = fX (x)fY (y ). Thus,
P{X A, Y B} = P{X A}P{Y B}.
I Z Z ay
I When X and Y are discrete random variables, they are P{X + Y a} = fX (x)fY (y )dxdy
independent if P{X = x, Y = y } = P{X = x}P{Y = y } for

all x and y for which P{X = x} and P{Y = y } are non-zero.
Z
= FX (a y )fY (y )dy .
I When X and Y are continuous, they are independent if
f (x, y ) = fX (x)fY (y ). I DifferentiatingRboth sides gives
d R
fX +Y (a) = da FX (ay )fY (y )dy = fX (ay )fY (y )dy .
I Latter formula makes some intuitive sense. Were integrating
over the set of x, y pairs that add up to a.
Conditional distributions Maxima: pick five job candidates at random, choose best

I Suppose I choose n random variables X1 , X2 , . . . , Xn uniformly


at random on [0, 1], independently of each other.
I The n-tuple (X1 , X2 , . . . , Xn ) has a constant density function
I Lets say X and Y have joint probability density function on the n-dimensional cube [0, 1]n .
f (x, y ).
I What is the probability that the largest of the Xi is less than
I We can define the conditional probability density of X given a?
that Y = y by fX |Y =y (x) = ffY(x,y )
(y ) . I ANSWER: an .
I This amounts to restricting f (x, y ) to the line corresponding I So if X = max{X1 , . . . , Xn }, then what is the probability
to the given y value (and dividing by the constant that makes
density function of X ?
the integral along that line equal to 1).
0 a < 0

I Answer: FX (a) = an a [0, 1] . And

1 a>1

fx (a) = FX0 (a) = nan1 .

General order statistics Properties of expectation

I Several properties we derived for discrete expectations


I Consider i.i.d random variables X1 , X2 , . . . , Xn with continuous continue to hold in the continuum.
probability density f . I If X is discrete with mass function p(x) then
I Let Y1 < Y2 < Y3 . . . < Yn be list obtained by sorting the Xj .
P
E [X ] = x p(x)x.
I In particular, Y1 = min{X1 , . . . , Xn } and I Similarly,R if X is continuous with density function f (x) then
Yn = max{X1 , . . . , Xn } is the maximum. E [X ] = f (x)xdx.
I What is the joint probability density of the Yi ? If X is discrete
P with mass function p(x) then
I
Answer: f (x1 , x2 , . . . , xn ) = n! ni=1 f (xi ) if x1 < x2 . . . < xn ,
Q
I E [g (x)] = x p(x)g (x).
zero otherwise. I Similarly, X Rif is continuous with density function f (x) then
I Let : {1, 2, . . . , n} {1, 2, . . . , n} be the permutation such E [g (X )] = f (x)g (x)dx.
that Xj = Y(j) If X and Y have P joint
P mass function p(x, y ) then
I
I Are and the vector (Y1 , . . . , Yn ) independent of each other? E [g (X , Y )] = y x g (x, y )p(x, y ).
I Yes. I If X and Y have R joint
R probability density function f (x, y ) then
E [g (X , Y )] = g (x, y )f (x, y )dxdy .
Properties of expectation A property of independence

I For both discrete and continuous random variables X and Y


we have E [X + Y ] = E [X ] + E [Y ].
I In both discrete and continuous
P settings,P
E [aX ] = aE [X ]
when a is a constant. And E [ ai Xi ] = ai E [Xi ].
I But what about that delightful area under 1 FX formula I If X and Y are independent then
for the expectation? E [g (X )h(Y )] = E [g (X )]E [h(Y )].
R R
I When X is non-negative with probability one, do we always I Just write E [g (X )h(Y )] = g (x)h(y )f (x, y )dxdy .
R
have E [X ] = 0 P{X > x}, in both discrete and continuous
RSince f (x, y ) = fXR(x)fY (y ) this factors as
I

settings? h(y )fY (y )dy g (x)fX (x)dx = E [h(Y )]E [g (X )].
I Define g (y ) so that 1 FX (g (y )) = y . (Draw horizontal line
at height y and look where it hits graph of 1 FX .)
I Choose Y uniformly on [0, 1] and note that g (Y ) has the
same probability distribution as X .
R1
I So E [X ] = E [g (Y )] = 0 g (y )dy , which is indeed the area
under the graph of 1 FX .

Defining covariance and correlation Basic covariance facts

I Cov(X , Y ) = Cov(Y , X )
I Cov(X , X ) = Var(X )
I Cov(aX , Y ) = aCov(X , Y ).
I Now define covariance of X and Y by
Cov(X , Y ) = E [(X E [X ])(Y E [Y ]).
I Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ).
I Note: by definition Var(X ) = Cov(X , X ).
I General statement of bilinearity of covariance:
I Covariance formula E [XY ] E [X ]E [Y ], or expectation of Xm n
X m X
X n

product minus product of expectations is frequently useful. Cov( ai Xi , bj Yj ) = ai bj Cov(Xi , Yj ).


i=1 j=1 i=1 j=1
I If X and Y are independent then Cov(X , Y ) = 0.
I Converse is not true. I Special case:

Xn n
X X
Var( Xi ) = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 (i,j):i<j
Defining correlation Conditional probability distributions

I It all starts with the definition of conditional probability:


I Again, by definition Cov(X , Y ) = E [XY ] E [X ]E [Y ]. P(A|B) = P(AB)/P(B).
I Correlation of X and Y defined by I If X and Y are jointly discrete random variables, we can use
Cov(X , Y ) this to define a probability mass function for X given Y = y .
(X , Y ) := p . That is, we write pX |Y (x|y ) = P{X = x|Y = y } = p(x,y )
pY (y ) .
I
Var(X )Var(Y )
I In words: first restrict sample space to pairs (x, y ) with given
I Correlation doesnt care what units you use for X and Y . If y value. Then divide the original mass function by pY (y ) to
a > 0 and c > 0 then (aX + b, cY + d) = (X , Y ). obtain a probability mass function on the restricted space.
I Satisfies 1 (X , Y ) 1. I We do something similar when X and Y are continuous
I If a and b are positive constants and a > 0 then random variables. In that case we write fX |Y (x|y ) = ffY(x,y )
(y ) .
(aX + b, X ) = 1. I Often useful to think of sampling (X , Y ) as a two-stage
I If a and b are positive constants and a < 0 then process. First sample Y from its marginal distribution, obtain
(aX + b, X ) = 1. Y = y for some particular y . Then sample X from its
probability distribution given Y = y .

Conditional expectation Conditional expectation as a random variable

I Can think of E [X |Y ] as a function of the random variable Y .


When Y = y it takes the value E [X |Y = y ].
I So E [X |Y ] is itself a random variable. It happens to depend
I Now, what do we mean by E [X |Y = y ]? This should just be
only on the value of Y .
the expectation of X in the conditional probability measure
for X given that Y = y . I Thinking of E [X |Y ] as a random variable, we can ask what its
expectation is. What is E [E [X |Y ]]?
I Can write this as
P P
E [X |Y = y ] = x xP{X = x|Y = y } = x xpX |Y (x|y ). I Very useful fact: E [E [X |Y ]] = E [X ].
I Can make sense of this in the continuum setting as well. I In words: what you expect to expect X to be after learning Y
f (x,y ) is same as what you now expect X to be.
I In continuum setting we had fX |Y (x|y ) = fY (y ) . So
R I Proof in discretePcase:
E [X |Y = y ] = x ffY(x,y )
(y ) dx E [X |Y = y ] = x xP{X = x|Y = y } = x x p(x,y
P )
pY (y ) .
P
I Recall that, in general, E [g (Y )] = y pY (y )g (y ).
E [E [X |Y = y ]] = y pY (y ) x x p(x,y )
P P P P
pY (y ) = y p(x, y )x =
I
x
E [X ].
Conditional variance Example

I Definition:
Var(X |Y ) = E (X E [X |Y ])2 |Y = E X 2 E [X |Y ]2 |Y .
   
I Let X be a random variable of variance X2 and Y an
I Var(X |Y ) is a random variable that depends on Y . It is the independent random variable of variance Y2 and write
variance of X in the conditional distribution for X given Y . Z = X + Y . Assume E [X ] = E [Y ] = 0.
I Note E [Var(X |Y )] = E [E [X 2 |Y ]] E [E [X |Y ]2 |Y ] = I What are the covariances Cov(X , Y ) and Cov(X , Z )?
E [X 2 ] E [E [X |Y ]2 ]. I How about the correlation coefficients (X , Y ) and (X , Z )?
I If we subtract E [X ]2 from first term and add equivalent value I What is E [Z |X ]? And how about Var(Z |X )?
E [E [X |Y ]]2 to the second, RHS becomes I Both of these values are functions of X . Former is just X .
Var[X ] Var[E [X |Y ]], which implies following:
Latter happens to be a constant-valued function of X , i.e.,
I Useful fact: Var(X ) = Var(E [X |Y ]) + E [Var(X |Y )]. happens not to actually depend on X . We have
I One can discover X in two stages: first sample Y from Var(Z |X ) = Y2 .
marginal and compute E [X |Y ], then sample X from I Can we check the formula
distribution given Y value. Var(Z ) = Var(E [Z |X ]) + E [Var(Z |X )] in this case?
I Above fact breaks variance into two parts, corresponding to
these two stages.

Moment generating functions Moment generating functions for sums of i.i.d. random
variables
I Let X be a random variable and M(t) = E [e tX ].
I Then M 0 (0) = E [X ] and M 00 (0) = E [X 2 ]. Generally, nth I We showed that if Z = X + Y and X and Y are independent,
derivative of M at zero is E [X n ].
then MZ (t) = MX (t)MY (t)
I Let X and Y be independent random variables and I If X1 . . . Xn are i.i.d. copies of X and Z = X1 + . . . + Xn then
Z = X +Y.
what is MZ ?
I Write the moment generating functions as MX (t) = E [e tX ] I Answer: MXn . Follows by repeatedly applying formula above.
and MY (t) = E [e tY ] and MZ (t) = E [e tZ ].
I This a big reason for studying moment generating functions.
I If you knew MX and MY , could you compute MZ ?
It helps us understand what happens when we sum up a lot of
I By independence, MZ (t) = E [e t(X +Y ) ] = E [e tX e tY ] = independent copies of the same random variable.
E [e tX ]E [e tY ] = MX (t)MY (t) for all t. I If Z = aX then MZ (t) = E [e tZ ] = E [e taX ] = MX (at).
I In other words, adding independent random variables I If Z = X + b then MZ (t) = E [e tZ ] = E [e tX +bt ] = e bt MX (t).
corresponds to multiplying moment generating functions.
Examples Cauchy distribution

I If X is binomial with parameters (p, n) then I A standard Cauchy random variable is a random real
MX (t) = (pe t + 1 p)n . number with probability density f (x) = 1 1+x
1
2.

I If X is Poisson with parameter > 0 then I There is a spinning flashlight interpretation. Put a flashlight
MX (t) = exp[(e t 1)]. at (0, 1), spin it to a uniformly random angle in [/2, /2],
2 /2
I If X is normal with mean 0, variance 1, then MX (t) = e t . and consider point X where light beam hits the x-axis.
I If X is normal with mean , variance 2 , then I FX (x) = P{X x} = P{tan x} = P{ tan1 x} =
2 2 1 1 1 x.
MX (t) = e t /2+t . 2 + tan
d 1 1
I If X is exponential with parameter > 0 then MX (t) = t .
I Find fX (x) = dx F (x) = 1+x 2 .

Beta distribution

I Two part experiment: first let p be uniform random variable


[0, 1], then let X be binomial (n, p) (number of heads when
we toss n p-coins).
I Given that X = a 1 and n X = b 1 the conditional law
of p is called the distribution.
I The density function is a constant (that doesnt depend on x)
times x a1 (1 x)b1 .
1
I That is f (x) = B(a,b) x a1 (1 x)b1 on [0, 1], where B(a, b)
is constant chosen to make integral one. Can show
B(a, b) = (a)(b)
(a+b) .
a (a1)
I Turns out that E [X ] = a+b and the mode of X is (a1)+(b1) .
Outline

18.600: Lecture 29
Weak law of large numbers Weak law of large numbers: Markov/Chebyshev approach

Scott Sheffield

MIT Weak law of large numbers: characteristic function approach

Outline Markovs and Chebyshevs inequalities


I Markovs inequality: Let X be a random variable taking only
non-negative values. Fix a constant a > 0. Then
P{X a} E [X a .
]

I Proof:( Consider a random variable Y defined by


a X a
Weak law of large numbers: Markov/Chebyshev approach Y = . Since X Y with probability one, it
0 X <a
follows that E [X ] E [Y ] = aP{X a}. Divide both sides by
a to get Markovs inequality.
Weak law of large numbers: characteristic function approach I Chebyshevs inequality: If X has finite mean , variance 2 ,
and k > 0 then
2
P{|X | k} .
k2
I Proof: Note that (X )2 is a non-negative random variable
and P{|X | k} = P{(X )2 k 2 }. Now apply
Markovs inequality with a = k 2 .
Markov and Chebyshev: rough idea Statement of weak law of large numbers

I Markovs inequality: Let X be a random variable taking only


non-negative values with finite mean. Fix a constant a > 0.
Then P{X a} E [X ]
a .
I Suppose Xi are i.i.d. random variables with mean .
I Chebyshevs inequality: If X has finite mean , variance 2 ,
and k > 0 then I Then the value An := X1 +X2 +...+X
n
n
is called the empirical
average of the first n trials.
2 I Wed guess that when n is large, An is typically close to .
P{|X | k} 2 .
k
I Indeed, weak law of large numbers states that for all  > 0
I Inequalities allow us to deduce limited information about a we have limn P{|An | > } = 0.
distribution when we know only the mean (Markov) or the I Example: as n tends to infinity, the probability of seeing more
mean and variance (Chebyshev). than .50001n heads in n fair coin tosses tends to zero.
I Markov: if E [X ] is small, then it is not too likely that X is
large.
I Chebyshev: if 2 = Var[X ] is small, then it is not too likely
that X is far from its mean.

Proof of weak law of large numbers in finite variance case Outline

I As above, let Xi be i.i.d. random variables with mean and


write An := X1 +X2 +...+X
n
n
.
Weak law of large numbers: Markov/Chebyshev approach
I By additivity of expectation, E[An ] = .
2
I Similarly, Var[An ] = n
n2
= 2 /n.
Var[An ] 2

I By Chebyshev P |An |  2
= n2
.
Weak law of large numbers: characteristic function approach
I No matter how small  is, RHS will tend to zero as n gets
large.
Outline Extent of weak law

I Question: does the weak law of large numbers apply no


matter what the probability distribution for X is?
I Is it always the case that if we define An := X1 +X2 +...+X
n
n
then
An is typically close to some fixed value when n is large?
Weak law of large numbers: Markov/Chebyshev approach
I What if X is Cauchy?
I Recall that in this strange case An actually has the same
probability distribution as X .
Weak law of large numbers: characteristic function approach I In particular, the An are not tightly concentrated around any
particular value even when n is very large.
I But in this case E [|X |] was infinite. Does the weak law hold
as long as E [|X |] is finite, so that is well defined?
I Yes. Can prove this using characteristic functions.

Characteristic functions Continuity theorems


I Let X be a random variable and Xn a sequence of random
I Let X be a random variable. variables.
I Say Xn converge in distribution or converge in law to X if
I The characteristic function of X is defined by
limn FXn (x) = FX (x) at all x R at which FX is
(t) = X (t) := E [e itX ]. Like M(t) except with i thrown in.
continuous.
I Recall that by definition e it = cos(t) + i sin(t). I The weak law of large numbers can be rephrased as the
I Characteristic functions are similar to moment generating statement that An converges in law to (i.e., to the random
functions in some ways. variable that is equal to with probability one).
I For example, X +Y = X Y , just as MX +Y = MX MY , if X I Levys continuity theorem (see Wikipedia): if
and Y are independent.
lim Xn (t) = X (t)
I And aX (t) = X (at) just as MaX (t) = MX (at). n
(m) for all t, then Xn converge in law to X .
I And if X has an mth moment then E [X m ] = i m X (0).
I But characteristic functions have an advantage: they are well I By this theorem, we can prove the weak law of large numbers
defined at all t for all random variables X . by showing limn An (t) = (t) = e it for all t. In the
special case that = 0, this amounts to showing
limn An (t) = 1 for all t.
Proof of weak law of large numbers in finite mean case

I As above, let Xi be i.i.d. instances of random variable X with


mean zero. Write An := X1 +X2 +...+X
n
n
. Weak law of large
numbers holds for i.i.d. instances of X if and only if it holds
for i.i.d. instances of X . Thus it suffices to prove the
weak law in the mean zero case.
I Consider the characteristic function X (t) = E [e itX ].
I Since E [X ] = 0, we have 0X (0) = E [ t
itX
e ]t=0 = iE [X ] = 0.
I Write g (t) = log X (t) so X (t) = e g (t) . Then g (0) = 0 and
(by chain rule) g 0 (0) = lim0 g ()g

(0)
= lim0 g ()
 = 0.
I Now An (t) = X (t/n)n = e ng (t/n) . Since g (0) = g 0 (0) = 0
g ( nt )
we have limn ng (t/n) = limn t t = 0 if t is fixed.
n
Thus limn e ng (t/n) = 1 for all t.
I By Levys continuity theorem, the An converge in law to 0
(i.e., to the random variable that is 0 with probability one).
Outline

18.600: Lecture 30
Central limit theorem Central limit theorem

Scott Sheffield

MIT Proving the central limit theorem

Outline Recall: DeMoivre-Laplace limit theorem


I Let XiPbe an i.i.d. sequence of random variables. Write
Sn = ni=1 Xn .
I Suppose each Xi is 1 with probability p and 0 with probability
q = 1 p.
I DeMoivre-Laplace limit theorem:
Central limit theorem
Sn np
lim P{a b} (b) (a).
n npq

Proving the central limit theorem I Here (b) (a) = P{a Z b} when Z is a standard
normal random variable.
n np
I S
npq describes number of standard deviations that Sn is
above or below its mean.
I Question: Does a similar statement hold if the Xi are i.i.d. but
have some other probability distribution?
I Central limit theorem: Yes, if they have finite variance.
Example Example

I Suppose earthquakes in some region are a Poisson point


I Say we roll 106 ordinary dice independently of each other. process with rate equal to 1 per year.
P 6
I Let Xi be the number on the ith die. Let X = 10 i=1 Xi be the I Let X be the number of earthquakes that occur over a
total of the numbers rolled. ten-thousand year period. Should be a Poisson random
I What is E [X ]? variable with rate 10000.
I 106 /6 I What is E [X ]?
I What is Var[X ]? I 10000
I 106 (35/12) I What is Var[X ]?
I How about SD[X ]? I 10000
p
I 1000 35/12 I How about SD[X ]?
I What is the probability that X is less than a standard I 100
deviations above its mean? I What is the probability that X is less than a standard
Ra 2
I Central limit theorem: should be about 12 e x /2 dx. deviations above its mean?
Ra 2
I Central limit theorem: should be about 12 e x /2 dx.

General statement Outline

I Let Xi be an i.i.d. sequence of random variables with finite


mean and variance 2 .
Write Sn = ni=1 Xi . So E [Sn ] = n and Var[Sn ] = n 2 and
P
I

SD[Sn ] = n. Central limit theorem
I n n . Then Bn is the difference
Write Bn = X1 +X2 +...+X
n
between Sn and its expectation, measured in standard
deviation units.
Proving the central limit theorem
I Central limit theorem:

lim P{a Bn b} (b) (a).


n
Outline Recall: characteristic functions

I Let X be a random variable.


I The characteristic function of X is defined by
(t) = X (t) := E [e itX ]. Like M(t) except with i thrown in.
Central limit theorem I Recall that by definition e it = cos(t) + i sin(t).
I Characteristic functions are similar to moment generating
functions in some ways.
I For example, X +Y = X Y , just as MX +Y = MX MY , if X
Proving the central limit theorem and Y are independent.
I And aX (t) = X (at) just as MaX (t) = MX (at).
(m)
I And if X has an mth moment then E [X m ] = i m X (0).
I Characteristic functions are well defined at all t for all random
variables X .

Rephrasing the theorem Continuity theorems

I Let X be a random variable and Xn a sequence of random


I Levys continuity theorem (see Wikipedia): if
variables.
lim Xn (t) = X (t)
n
I Say Xn converge in distribution or converge in law to X if
limn FXn (x) = FX (x) at all x R at which FX is for all t, then Xn converge in law to X .
continuous. I By this theorem, we can prove the central limit theorem by
2
I Recall: the weak law of large numbers can be rephrased as the showing limn Bn (t) = e t /2 for all t.
statement that An = X1 +X2 +...+X
n
n
converges in law to (i.e., I Moment generating function continuity theorem: if
to the random variable that is equal to with probability one)
moment generating functions MXn (t) are defined for all t and
as n .
n and limn MXn (t) = MX (t) for all t, then Xn converge in
I The central limit theorem can be rephrased as the statement law to X .
n n converges in law to a standard
that Bn = X1 +X2 +...+X
n I By this theorem, we can prove the central limit theorem by
normal random variable as n . 2
showing limn MBn (t) = e t /2 for all t.
Proof of central limit theorem with moment generating Proof of central limit theorem with characteristic functions
functions
X
I Write Y = . Then Y has mean zero and variance 1.
I Write MY (t) = E [e tY ] and g (t) = log MY (t). So
MY (t) = e g (t) .
I Moment generating function proof only applies if the moment
I We know g (0) = 0. Also MY0 (0) = E [Y ] = 0 and generating function of X exists.
MY00 (0) = E [Y 2 ] = Var[Y ] = 1.
I But the proof can be repeated almost verbatim using
I Chain rule: MY0 (0) = g 0 (0)e g (0) = g 0 (0) = 0 and characteristic functions instead of moment generating
MY00 (0) = g 00 (0)e g (0) + g 0 (0)2 e g (0) = g 00 (0) = 1. functions.
I So g is a nice function with g (0) = g 0 (0) = 0 and g 00 (0) = 1. I Then it applies for any X with finite variance.
Taylor expansion: g (t) = t 2 /2 + o(t 2 ) for t near zero.
I Now Bn is 1
times the sum of n independent copies of Y .
n
n ng ( t n )
I So MBn (t) = MY (t/ n) = e .
ng ( t ) n( t )2 /2 2 /2
I But e n e n = et , in sense that LHS tends to
2
e t /2 as n tends to infinity.

Almost verbatim: replace MY (t) with Y (t) Perspective

I The central limit theorem is actually fairly robust. Variants of


I Write Y (t) = E [e itY ] and g (t) = log Y (t). So
the theorem still apply if you allow the Xi not to be identically
Y (t) = e g (t) .
distributed, or not to be completely independent.
I We know g (0) = 0. Also 0Y (0) = iE [Y ] = 0 and
I We wont formulate these variants precisely in this course.
00Y (0) = i 2 E [Y 2 ] = Var[Y ] = 1.
I But, roughly speaking, if you have a lot of little random terms
I Chain rule: 0Y (0) = g 0 (0)e g (0) = g 0 (0) = 0 and
that are mostly independent and no single term
00Y (0) = g 00 (0)e g (0) + g 0 (0)2 e g (0) = g 00 (0) = 1.
contributes more than a small fraction of the total sum
I So g is a nice function with g (0) = = 0 and g 0 (0) then the total sum should be approximately normal.
g 00 (0) = 1. Taylor expansion: g (t) = t 2 /2 + o(t 2 ) for t
I Example: if height is determined by lots of little mostly
near zero.
independent factors, then peoples heights should be normally
I Now Bn is 1
times the sum of n independent copies of Y .
n distributed.
n ng ( t n )
I So Bn (t) = Y (t/ n) = e . I Not quite true... certain factors by themselves can cause a
ng ( t ) n( t )2 /2 2 /2 person to be a whole lot shorter or taller. Also, individual
I But e n e n = e t , in sense that LHS tends factors not really independent of each other.
2
to e t /2 as n tends to infinity.
I Kind of true for homogenous population, ignoring outliers.
Outline

18.600: Lecture 31
Strong law of large numbers and Jensens A story about Pedro
inequality
Strong law of large numbers
Scott Sheffield

MIT
Jensens inequality

Outline Pedros hopes and dreams

I Pedro is considering two ways to invest his life savings.


I One possibility: put the entire sum in government insured
A story about Pedro
interest-bearing savings account. He considers this completely
risk free. The (post-tax) interest rate equals the inflation rate,
so the real value of his savings is guaranteed not to change.
Strong law of large numbers I Riskier possibility: put sum in investment where every month
real value goes up 15 percent with probability .53 and down
15 percent with probability .47 (independently of everything
Jensens inequality else).
I How much does Pedro make in expectation over 10 years with
risky approach? 100 years?
Pedros hopes and dreams Pedros financial planning

I How much does Pedro make in expectation over 10 years with I How would you advise Pedro to invest over the next 10 years
risky approach? 100 years?
if Pedro wants to be completely sure that he doesnt lose
I Answer: let Ri be i.i.d. random variables each equal to 1.15 money?
with probability .53 and .85 with probability .47. Total value I What if Pedro is willing to accept substantial risk if it means
after n steps is initial investment times
there is a good chance it will enable his grandchildren to retire
Tn := R1 R2 . . . Rn .
in comfort 100 years from now?
I Compute E [R1 ] = .53 1.15 + .47 .85 = 1.009. I What if Pedro wants the money for himself in ten years?
I Then E [T120 ] = 1.009120 2.93. And I Lets do some simulations.
E [T1200 ] = 1.0091200 46808.9

Logarithmic point of view Outline

I We wrote Tn = R1 . . . Rn . P
Taking logs, we can write
Xi = log Ri and Sn = log Tn = ni=1 Xi .
I Now Sn is a sum of i.i.d. random variables.
I E [X1 ] = E [log R1 ] = .53(log 1.15) + .47(log .85) .0023. A story about Pedro
I By the law of large numbers, if we take n extremely large,
then Sn /n .0023 with high probability.
Strong law of large numbers
I This means that, when n is large, Sn is usually a very negative
value, which means Tn is usually very close to zero (even
though its expectation is very large).
Jensens inequality
I Bad news for Pedros grandchildren. After 100 years, the
portfolio is probably in bad shape. But what if Pedro takes an
even longer view? Will Tn converge to zero with probability
one as n gets large? Or will Tn perhaps always eventually
rebound?
Outline Strong law of large numbers

I Suppose Xi are i.i.d. random variables with mean .


I Then the value An := X1 +X2 +...+X
n
n
is called the empirical
A story about Pedro average of the first n trials.
I Intuition: when n is large, An is typically close to .
I Recall: weak law of large numbers states that for all  > 0
Strong law of large numbers
we have limn P{|An | > } = 0.
I The strong law of large numbers states that with
probability one limn An = .
Jensens inequality
I It is called strong because it implies the weak law of large
numbers. But it takes a bit of thought to see why this is the
case.

Strong law implies weak law Proof of strong law assuming E [X 4 ] <

I Assume K := E [X 4 ] < . Not necessary, but simplifies proof.


I Note: Var[X 2 ] = E [X 4 ] E [X 2 ]2 > 0, so E [X 2 ]2 K .
I Suppose we know that the strong law holds, i.e., with
probability 1 we have limn An = . I The strong law holds for i.i.d. copies of X if and only if it
holds for i.i.d. copies of X where is a constant.
I Strong law implies that for every  the random variable
Y = max{n : |An | > } is finite with probability one. It I So we may as well assume E [X ] = 0.
has some probability mass function (though we dont know I Key to proof is to bound fourth moments of An .
what it is). I E [A4n ] = n4 E [Sn4 ] = n4 E [(X1 + X2 + . . . + Xn )4 ].
I Note that if |An | >  for some n value then Y n. I Expand (X1 + . . . + Xn )4 . Five kinds of terms: Xi Xj Xk Xl and
I Thus for each n we have P{|An | > } P{Y n}. Xi Xj Xk2 and Xi Xj3 and Xi2 Xj2 and Xi4 .
So limn P{|An | > } limn P{Y n} = 0. The first three terms all have expectation zero. There are n2

I I
I If the right limit is zero for each  (strong law) then the left of the fourth type and n of  the last type, each equal to at
limit is zero for each  (weak law). most K . So E [A4n ] n4 6 n2 + n K .
Thus E [
P 4
P 4
P 4
n=1 An ] = n=1 E [An ] < . So n=1 An <
I
(and hence An 0) with probability 1.
Outline Outline

A story about Pedro A story about Pedro

Strong law of large numbers Strong law of large numbers

Jensens inequality Jensens inequality

Jensens inequality statement More about Pedro


I Let X be random variable with finite mean E [X ] = . I Disappointed by the strong law of large numbers, Pedro seeks
I Let g be a convex function. This means that if you draw a a better way to make money.
straight line connecting two points on the graph of g , then I Signs up for job as hedge fund manager. Allows him to
the graph of g lies below that line. If g is twice differentiable, manage C 109 dollars of somebody elses money. At end of
then convexity is equivalent to the statement that g 00 (x) 0 each year, he and his staff get two percent of principle plus
for all x. For a concrete example, take g (x) = x 2 . twenty percent of profit.
I Jensens inequality: E [g (X )] g (E [X ]). I Precisely: if X is end-of-year portfolio value, Pedro gets
I Proof: Let L(x) = ax + b be tangent to graph of g at point
(E [X ], g (E [X ])). Then L lies below g . Observe g (X ) = .02C + .2 max{X C , 0}.

E [g (X )] E [L(X )] = L(E [X ]) = g (E [X )] I Pedro notices that g is a convex function. He can therefore


increase his expected return by adopting risky strategies.
I Note: if g is concave (which means g is convex), then
E [g (X )] g (E [X ]).
I Pedro has strategy that increases portfolio value 10 percent
with probability .9, loses everything with probability .1.
I If your utility function is concave, then you always prefer a
safe investment over a risky investment with the same I He repeats this yearly until fund collapses.
expected return. I With high probability Pedro is rich by then.
Perspective

I The two percent of principle plus twenty percent of profit is


common in the hedge fund industry.
I The idea is that fund managers have both guaranteed revenue
for expenses (two percent of principle) and incentive to make
money (twenty percent of profit).
I Because of Jensens inequality, the convexity of the payoff
function is a genuine concern for hedge fund investors. People
worry that it encourages fund managers (like Pedro) to take
risks that are bad for the client.
I This is a special case of the principal-agent problem of
economics. How do you ensure that the people you hire
genuinely share your interests?
Outline

18.600: Lecture 32
Markov chains
Markov Chains

Scott Sheffield Examples

MIT

Ergodicity and stationarity

Outline Markov chains

I Consider a sequence of random variables X0 , X1 , X2 , . . . each


taking values in the same state space, which for now we take
to be a finite set that we label by {0, 1, . . . , M}.
Markov chains
I Interpret Xn as state of the system at time n.
I Sequence is called a Markov chain if we have a fixed
collection of numbers Pij (one for each pair
Examples
i, j {0, 1, . . . , M}) such that whenever the system is in state
i, there is probability Pij that system will next be in state j.
I Precisely,
Ergodicity and stationarity
P{Xn+1 = j|Xn = i, Xn1 = in1 , . . . , X1 = i1 , X0 = i0 } = Pij .
I Kind of an almost memoryless property. Probability
distribution for next state depends only on the current state
(and not on the rest of the state history).
Simple example Matrix representation

I For example, imagine a simple weather model with two states: I To describe a Markov chain, we need to define Pij for any
rainy and sunny. i, j {0, 1, . . . , M}.
I If its rainy one day, theres a .5 chance it will be rainy the I It is convenient to represent the collection of transition
next day, a .5 chance it will be sunny. probabilities Pij as a matrix:
I If its sunny one day, theres a .8 chance it will be sunny the
P00 P01 . . . P0M

next day, a .2 chance it will be rainy. P10 P11 . . . P1M

I In this climate, sun tends to last longer than rain.
A=


I Given that it is rainy today, how many days to I expect to


have to wait to see a sunny day?
I Given that it is sunny today, how many days to I expect to PM0 PM1 . . . PMM
have to wait to see a rainy day? I For this to make sense, we require Pij 0 for all i, j and
I Over the long haul, what fraction of days are sunny? PM
j=0 Pij = 1 for each i. That is, the rows sum to one.

Transitions via matrices Powers of transition matrix


I Suppose that pi is the probability that system is in state i at
time zero. I
(n)
We write Pij for the probability to go from state i to state j
I What does the following product represent? over n steps.

P00 P01 . . . P0M
I From the matrix point of view
P10 P11 . . . P1M (n) (n) (n)
n



P00 P01 . . . P0M P00 P01 . . . P0M
p0 p1 . . . pM (n) (n) (n)
P10 P11 . . . P1M
P10 P11 . . . P1M







=



PM0 PM1 . . . PMM



I Answer: the probability distribution at time one. (n) (n) (n) PM0 PM1 . . . PMM
PM0 P M1 ... P MM
I How about the following product?
I If A is the one-step transition matrix, then An is the n-step
An

p0 p1 . . . pM transition matrix.

I Answer: the probability distribution at time n.


Questions Outline

I What does it mean if all of the rows are identical? Markov chains
I Answer: state sequence Xi consists of i.i.d. random variables.
I What if matrix is the identity?
Examples
I Answer: states never change.
I What if each Pij is either one or zero?
I Answer: state evolution is deterministic. Ergodicity and stationarity

Outline Simple example

I Consider the simple weather example: If its rainy one day,


theres a .5 chance it will be rainy the next day, a .5 chance it
will be sunny. If its sunny one day, theres a .8 chance it will
Markov chains be sunny the next day, a .2 chance it will be rainy.
I Let rainy be state zero, sunny state one, and write the
transition matrix by
Examples  
.5 .5
A=
.2 .8

Ergodicity and stationarity I Note that  


.64 .35
A2 =
.26 .74
 
.285719 .714281
I Can compute A10 =
.285713 .714287
Does relationship status have the Markov property? Outline

In a relationship

Markov chains
Single Its complicated

Examples

Married Engaged
Ergodicity and stationarity
I Can we assign a probability to each arrow?
I Markov model implies time spent in any state (e.g., a
marriage) before leaving is a geometric random variable.
I Not true... Can we make a better model with more states?

Outline Ergodic Markov chains


I Say Markov chain is ergodic if some power of the transition
matrix has all non-zero entries.
I Turns out that if chain has this property, then
(n)
j := limn Pij exists and the j are the unique
Markov chains
non-negative solutions of j = M
P
k=0 k Pkj that sum to one.
I This means that the row vector
Examples = 0 1 . . . M


is a left eigenvector of A with eigenvalue 1, i.e., A = .


Ergodicity and stationarity I We call the stationary distribution of the Markov chain.
I One can
P solve the system of linear equations
j = M k=0 k Pkj to compute the values j . Equivalent to
considering A fixed and solving A = . Or solving
(A I ) = 0. This determines
P up to a multiplicative
constant, and fact that j = 1 determines the constant.
Simple example
 
.5 .5
I If A = , then we know
.2 .8
 
 .5 .5 
A = 0 1 = 0 1 = .
.2 .8

I This means that .50 + .21 = 0 and .50 + .81 = 1 and


we also know that 0 + 1 = 1. Solving these equations gives
0 = 2/7 and 1 = 5/7, so = 2/7 5/7 .
I Indeed,
 
 .5 .5 
A = 2/7 5/7 = 2/7 5/7 = .
.2 .8

I Recall 
that     
.285719 .714281 2/7 5/7
A10 = =
.285713 .714287 2/7 5/7
Outline

18.600: Lecture 33
Entropy
Entropy

Scott Sheffield Noiseless coding theory

MIT

Conditional entropy

Outline What is entropy?

Entropy I Entropy is an important notion in thermodynamics,


information theory, data compression, cryptography, etc.
I Familiar on some level to everyone who has studied chemistry
Noiseless coding theory or statistical physics.
I Kind of means amount of randomness or disorder.
I But can we give a mathematical definition? In particular, how
Conditional entropy do we define the entropy of a random variable?
Information Shannon entropy
I Suppose we toss a fair coin k times.
I Then the state space S is the set of 2k possible heads-tails
I Shannon: famous MIT student/faculty member, wrote The
sequences. Mathematical Theory of Communication in 1948.
I If X is the random sequence (so X is a random variable), then
I Goal is to define a notion of how much we expect to learn
for each x S we have P{X = x} = 2k . from a random variable or how many bits of information a
random variable contains that makes sense for general
I In information theory its quite common to use log to mean
experiments (which may not have anything to do with coins).
log2 instead of loge . We follow that convention in this lecture.
In particular, this means that I If a random variable X takes values x1 , x2 , . . . , xn with positive
probabilities p1 , p2 , . . . , pn then we define the entropy of X by
log P{X = x} = k n n
X X
H(X ) = pi ( log pi ) = pi log pi .
for each x S.
i=1 i=1
I Since there are 2k values in S, it takes k bits to describe an
element x S. I This can be interpreted as the expectation of ( log pi ). The
I Intuitively, could say that when we learn that X = x, we have value ( log pi ) is the amount of surprise when we see xi .
learned k = log P{X = x} bits of information.

Twenty questions with Harry Other examples

I Harry always thinks of one of the following animals:


x P{X = x} log P{X = x}
I Again, if a random variable X takes the values x1 , x2 , . . . , xn
Dog 1/4 2
with positive probabilities p1 , p2 , . . . , pn then we define the
Cat 1/4 2
entropy of X by
Cow 1/8 3
Pig 1/16 4 n
X n
X
Squirrel 1/16 4 H(X ) = pi ( log pi ) = pi log pi .
Mouse 1/16 4 i=1 i=1
Owl 1/16 4 I If X takes one value with probability 1, what is H(X )?
Sloth 1/32 5
Hippo 1/32 5
I If X takes k values with equal probability, what is H(X )?
Yak 1/32 5 I What is H(X ) if X is a geometric random variable with
Zebra 1/64 6 parameter p = 1/2?
Rhino 1/64 6
47
I Can learn animal with H(X ) = 16 questions on average.
Outline Outline

Entropy Entropy

Noiseless coding theory Noiseless coding theory

Conditional entropy Conditional entropy

Coding values by bit sequences Twenty questions theorem


I David Huffman (as MIT student) published in A Method for
I Noiseless coding theorem: Expected number of questions
the Construction of Minimum-Redundancy Code in 1952.
you need is always at least the entropy.
I If X takes four values A, B, C , D we can code them by:
I Note: The expected number of questions is the entropy if
A 00 each question divides the space of possibilities exactly in half
B 01 (measured by probability).
I In this case, let X take values x1 , . . . , xN with probabilities
C 10
p(x1 ), . . . , p(xN ). Then if a valid coding of X assigns ni bits
D 11 to xi , we have
I Or by N
X N
X
A0 ni p(xi ) H(X ) = p(xi ) log p(xi ).
B 10 i=1 i=1

C 110
I Data compression: X1 , X2 , . . . , Xn be i.i.d. instances of X .
Do there exist encoding schemes such that the expected
D 111 number of bits required to encode the entire sequence is
I No sequence in code is an extension of another. about H(X )n (assuming n is sufficiently large)?
I What does 100111110010 spell? I Yes. We can cut space of N n possibilities close to exactly in
I A coding scheme is equivalent to a twenty questions strategy. half at each stage (up till near end maybe).
Outline Outline

Entropy Entropy

Noiseless coding theory Noiseless coding theory

Conditional entropy Conditional entropy

Entropy for a pair of random variables Conditional entropy

I Lets again consider random variables X , Y with joint mass


I Consider random variables X , Y with joint mass function function p(xi , yj ) = P{X = xi , Y = yj } and write
p(xi , yj ) = P{X = xi , Y = yj }. XX
I Then we write H(X , Y ) = p(xi , yj ) log p(xi , yi ).
i j
XX
H(X , Y ) = p(xi , yj ) log p(xi , yi ). I But now lets not assume they are independent.
i j
I We can define a conditional entropy of X given Y = yj by
I H(X , Y ) is just the entropy of the pair (X , Y ) (viewed as a X
random variable itself). HY =yj (X ) = p(xi |yj ) log p(xi |yj ).
I Claim: if X and Y are independent, then i

I This is just the entropy of the conditional distribution. Recall


H(X , Y ) = H(X ) + H(Y ).
that p(xi |yj ) = P{X = xi |Y = yj }.
P
Why is that? I We similarly define HY (X ) = j HY =yj (X )pY (yj ). This is
the expected amount of conditional entropy that there will be
in Y after we have observed X .
Properties of conditional entropy Properties of conditional entropy

P
P I Definitions:PHY =yj (X ) = i p(xi |yj ) log p(xi |yj ) and
I Definitions:PHY =yj (X ) = i p(xi |yj ) log p(xi |yj ) and HY (X ) = j HY =yj (X )pY (yj ).
HY (X ) = j HY =yj (X )pY (yj ). I Important property two: HY (X ) H(X ) with equality if
I Important property one: H(X , Y ) = H(Y ) + HY (X ). and only if X and Y are independent.
I In words, the expected amount of information we learn when I In words, the expected amount of information we learn when
discovering (X , Y ) is equal to expected amount we learn when discovering X after having discovered Y cant be more than
discovering Y plus expected amount when we subsequently the expected amount of information we would learn when
discover X (given our knowledge of Y ). discovering X before knowing anything about Y .
I To prove this property, recall that p(xi , yj ) = pY (yj )p(xi |yj ). I
P
Proof: note that E(p1 , p2 , . . . , pn ) := pi log pi is concave.
P P
I Thus,
P P H(X , Y ) = i j p(xi , yj ) log p(xi , yj ) = I The vector v = {pX (x1 ), pX (x2 ), . . . , pX (xn )} is a weighted
Pi j pY (yj )p(xi |yj )[log P pY (yj ) + log p(xi |yj )] = average of vectors vj := {pX (x1 |yj ), pX (x2 |yj ), . . . , pX (xn |yj )}

P j pY (yP j ) log pY (yj ) i p(xi |yj ) as j ranges over possible values. By (vector version of)
j pY (yj ) i p(xi |yj ) log p(xi |yj ) = H(Y ) + HY (X ). Jensens inequality, P P
H(X ) = E(v ) = E( pY (yj )vj ) pY (yj )E(vj ) = HY (X ).
Outline

18.600: Lecture 34
Martingales and the optional stopping
Martingales and stopping times
theorem

Scott Sheffield
Optional stopping theorem
MIT

Outline Martingale definition


I Let S be a probability space.
I Let X0 , X1 , X2 , . . . be a sequence of random variables.
Informally, we will imagine that we acquiring information
about S in a sequence of stages, and each Xj represents a
quantity that is known to us at the jth stage.
Martingales and stopping times I If Z is any random variable, we let E [Z |Fn ] denote the
conditional expectation of X given all the information that is
available to us on the nth stage. If we dont specify otherwise,
we assume that this information consists precisely of the
Optional stopping theorem
values X0 , X1 , . . . , Xn , so that E [Z |Fn ] = E [Z |X0 , X1 , . . . , Xn ].
(In some applications, one could imagine there are other
things known as well at stage n.)
I We say Xn sequence is a martingale if E [|Xn |] < for all n
and E [Xn+1 |Fn ] = Xn for all n.
I Taking into account all the information I have at stage n, the
expected value at stage n + 1 is the value at stage n.
Martingale definition Martingale examples
I Suppose that A1 , A2 , . . . are i.i.d. random variables each equal
to 1 with probability .5 and 1 with probability .5.
I Example: Imagine that Xn is the price of a stock on day n.
Let X0 = 0 and Xn = ni=1 Ai for n > 0. Is the Xn sequence
P
I
I Martingale condition: Expected value of stock tomorrow, a martingale?
given all I know today, is value of the stock today. I Answer: yes. To see this, note that
I Question: If you are given a mathematical description of a E [Xn+1 |Fn ] = E [Xn + An+1 |Fn ] = E [Xn |Fn ] + E [An+1 |Fn ],
process X0 , X1 , X2 , . . . then how can you check whether it is a by additivity of conditional expectation (given Fn ).
martingale? I Since Xn is known at stage n, we have E [Xn |Fn ] = Xn . Since
I Consider all of the information that you know after having we know nothing more about An+1 at stage n than we
seen X0 , X1 , . . . , Xn . Then try to figure out what additional originally knew, we have E [An+1 |Fn ] = 0. Thus
(not yet known) randomness is involved in determining Xn+1 . E [Xn+1 |Fn ] = Xn .
Use this to figure out the conditional expectation of Xn+1 , I Informally, Im just tossing a new fair coin at each stage to
and check to see whether this is necessarily equal to the see if Xn goes up or down one step. If I know the information
known Xn value. available up to stage n, and I know Xn = 10, then I see
Xn+1 = 11 and Xn+1 = 9 as equally likely, so
E [Xn+1 |Fn ] = 10 = Xn .

Another martingale example Another example


I What if each Ai is 1.01 with probability .5 andQ.99 with
probability .5 and we write X0 = 1 and Xn = ni=1 Ai for
n > 0? Then is Xn a martingale?
I Suppose A is 1 with probability .5 and 1 with probability .5.
I Answer: yes. Note that E [Xn+1 |Fn ] = E [An+1 Xn |Fn ]. At Let X0 = 0 and write Xn = (1)n A for all n > 0.
stage n, the value Xn is known, and hence can be treated as a I What is E [Xn ], as a function of n?
known constant, which can be factored out of the I E [Xn ] = 0 for all n.
expectation, i.e., E [An+1 Xn |Fn ] = Xn E [An+1 |Fn ]. I Does this mean that Xn is a martingale?
I Since I know nothing new about An+1 at stage n, we have
I No. If n 1, then given the information available up to stage
E [An+1 |Fn ] = E [An+1 ] = 1. Hence E [An+1 Xn |Fn ] = Xn .
n, I can figure out what A must be, and can hence deduce
I Informally, Im just tossing a new fair coin at each stage to
exactly what Xn+1 will be and it is not the same as Xn . In
see if Xn goes up or down by a percentage point of its current
particular, E [Xn+1 |Fn ] = Xn 6= Xn .
value. If I know all the information available up to stage n,
and I know Xn = 5, then I see Xn+1 = 5.05 and Xn+1 = 4.95 I Informally, Xn alternates between 1 and 1. Each time it goes
as equally likely, so E [Xn+1 |Fn ] = 5. up and hits 1, I know it will go back down to 1 on the next
I Two classic martingale examples: sums of independent step.
random variables (each with mean zero) and products of
independent random variables (each with mean one).
Stopping time definition Stopping time examples

I Let A1 , . . . be i.i.d. random variables equal to 1 with


probability
P .5 and 1 with probability .5 and let X0 = 0 and
I Let T be a non-negative integer valued random variable. Xn = ni=1 Ai for n 0.
I Think of T as giving the time the asset will be sold if the I Which of the following is a stopping time?
price sequence is X0 , X1 , X2 , . . .. 1. The smallest T for which |XT | = 50
I Say that T is a stopping time if the event that T = n 2. The smallest T for which XT {10, 100}
depends only on the values Xi for i n. In other words, the 3. The smallest T for which XT = 0.
decision to sell at time n depends only on prices up to time n, 4. The T at which the Xn sequence achieves the value 17 for the
9th time.
not on (as yet unknown) future prices.
5. The value of T {0, 1, 2, . . . , 100} for which XT is largest.
6. The largest T {0, 1, 2, . . . , 100} for which XT = 0.
I Answer: first four, not last two.

Outline Outline

Martingales and stopping times Martingales and stopping times

Optional stopping theorem Optional stopping theorem


Optional stopping overview Doobs Optional Stopping Theorem: statement

I Doobs optional stopping time theorem is contained in I Doobs Optional Stopping Theorem: If the sequence
many basic texts on probability and Martingales. (See, for X0 , X1 , X2 , . . . is a bounded martingale, and T is a stopping
example, Theorem 10.10 of Probability with Martingales, by time, then the expected value of XT is X0 .
David Williams, 1991.)
I When we say martingale is bounded, we mean that for some
I Essentially says that you cant make money (in expectation) C , we have that with probability one |Xi | < C for all i.
by buying and selling an asset whose price is a martingale.
I Why is this assumption necessary?
I Precisely, if you buy the asset at some time and adopt any
I Can we give a counterexample if boundedness is not assumed?
strategy at all for deciding when to sell it, then the expected
price at the time you sell is the price you originally paid. I Theorem can be proved by induction if stopping time T is
bounded. Unbounded T requires a limit argument. (This is
I If market price is a martingale, you cannot make money in
where boundedness of martingale is used.)
expectation by timing the market.

Martingales applied to finance Martingales as successively revised best guesses

I Many asset prices are believed to behave approximately like I The two-element sequence E [X ], X is a martingale.
martingales, at least in the short term. I In previous lectures, we interpreted the conditional
I Efficient market hypothesis: new information is instantly expectation E [X |Y ] as a random variable.
absorbed into the stock value, so expected value of the stock I Depends only on Y . Describes expectation of X given
tomorrow should be the value today. (If it were higher, observed Y value.
statistical arbitrageurs would bid up todays price until this
was not the case.)
I We showed E [E [X |Y ]] = E [X ].
I But what about interest, risk premium, etc.?
I This means that the three-element sequence E [X ], E [X |Y ], X
is a martingale.
I According to the fundamental theorem of asset pricing,
the discounted price XA(n)
(n)
, where A is a risk-free asset, is a
I More generally if Yi are any random variables, the sequence
martingale with respected to risk neutral probability. More E [X ], E [X |Y1 ], E [X |Y1 , Y2 ], E [X |Y1 , Y2 , Y3 ], . . . is a
on this next lecture. martingale.
Martingales as real-time subjective probability updates More conditional probability martingale examples
I Ivan sees email from girlfriend with subject some possibly
serious news, thinks theres a 20 percent chance shell break
up with him by emails end. Revises number after each line:
I Example: let C be the amount of oil available for drilling
I Oh Ivan, Ive missed you so much! 12
under a particular piece of land. Suppose that ten geological
I I have something crazy to tell you, 24 tests are done that will ultimately determine the value of C .
I and so sorry to do this by email. (Wheres your phone!?) 38 Let Cn be the conditional expectation of C given the
I Ive been spending lots of time with a guy named Robert, 52 outcome of the first n of these tests. Then the sequence
I a visiting database consultant on my project 34 C0 , C1 , C2 , . . . , C10 = C is a martingale.
I who seems very impressed by my work. 23 I Let Ai be my best guess at the probability that a basketball
I Robert wants me to join his startup in Palo Alto. 38 team will win the game, given the outcome of the first i
I Exciting!!! Of course I said Id have to talk to you first, 24 minutes of the game. Then (assuming some rationality of
my personal probabilities) Ai is a martingale.
I because you are absolutely my top priority in my life, 8
I and youre stuck at MIT for at least three more years... 11
I but honestly, Im just so confused on so many levels. 15
I Call me!!! I love you! Alice 0
Outline

18.600: Lecture 35
Martingales and risk neutral probability Martingales and stopping times

Scott Sheffield

MIT Risk neutral probability and martingales

Outline Recall martingale definition


I Let S be the probability space. Let X0 , X1 , X2 , . . . be a
sequence of real random variables. Interpret Xi as price of
asset at ith time step.
I Say Xn sequence is a martingale if E [|Xn |] < for all n and
E [Xn |Fn ] := E [Xn+1 |X0 , X1 , X2 , . . . , Xn ] = Xn for all n.
Martingales and stopping times I Given all I know today, expected price tomorrow is the price
today.
I If you are given a mathematical description of a process
Risk neutral probability and martingales X0 , X1 , X2 , . . . then how can you check whether it is a
martingale?
I Consider all of the information that you know after having
seen X0 , X1 , . . . , Xn . Then try to figure out what additional
(not yet known) randomness is involved in determining Xn+1 .
Use this to figure out the conditional expectation of Xn+1 ,
and check to see whether this is always equal to the known Xn
value.
Recall stopping time definition Examples

I Let T be a non-negative integer valued random variable. I Suppose that an asset price is a martingale that starts at 50
I Think of T as giving the time the asset will be sold if the and changes by increments of 1 at each time step. What is
price sequence is X0 , X1 , X2 , . . .. the probability that the price goes down to 40 before it goes
I Say that T is a stopping time if the event that T = n up to 70?
depends only on the values Xi for i n. In other words, the I What is the probability that it goes down to 45 then up to 55
decision to sell at time n depends only on prices up to time n, then down to 45 then up to 55 again all before reaching
not on (as yet unknown) future prices. either 0 or 100?

Outline Outline

Martingales and stopping times Martingales and stopping times

Risk neutral probability and martingales Risk neutral probability and martingales
Martingales applied to finance Risk neutral probability

I Risk neutral probability is a fancy term for market


I Many asset prices are believed to behave approximately like
probability. (The term market probability is arguably more
martingales, at least in the short term.
descriptive.)
I Efficient market hypothesis: new information is instantly
I That is, it is a probability measure that you can deduce by
absorbed into the stock value, so expected value of the stock
looking at prices on market.
tomorrow should be the value today. (If it were higher,
statistical arbitrageurs would bid up todays price until this I For example, suppose somebody is about to shoot a free
was not the case.) throw in basketball. What is the price in the sports betting
world of a contract that pays one dollar if the shot is made?
I But there are some caveats: interest, risk premium, etc.
I If the answer is .75 dollars, then we say that the risk neutral
I According to the fundamental theorem of asset pricing,
probability that the shot will be made is .75.
the discounted price XA(n)
(n)
, where A is a risk-free asset, is a
martingale with respected to risk neutral probability. I Risk neutral probability is the probability determined by the
market betting odds.

Risk neutral probability of outcomes known at fixed time T Risk neutral probability differ vs. ordinary probability

I Risk neutral probability of event A: PRN (A) denotes I At first sight, one might think that PRN (A) describes the
markets best guess at the probability that A will occur.
Price{Contract paying 1 dollar at time T if A occurs }
. I But suppose A is the event that the government is dissolved
Price{Contract paying 1 dollar at time T no matter what }
and all dollars become worthless. What is PRN (A)?
I If risk-free interest rate is constant and equal to r I Should be 0. Even if people think A is likely, a contract
(compounded continuously), then denominator is e rT . paying a dollar when A occurs is worthless.
I Assuming no arbitrage (i.e., no risk free profit with zero I Now, suppose there are only 2 outcomes: A is event that
upfront investment), PRN satisfies axioms of probability. That economy booms and everyone prospers and B is event that
is, 0 PRN (A) 1, and PRN (S) = 1, and if events Aj are economy sags and everyone is needy. Suppose purchasing
disjoint then PRN (A1 A2 . . .) = PRN (A1 ) + PRN (A2 ) + . . . power of dollar is the same in both scenarios. If people think
I Arbitrage example: if A and B are disjoint and A has a .5 chance to occur, do we expect PRN (A) > .5 or
PRN (A B) < P(A) + P(B) then we sell contracts paying 1 if PRN (A) < .5?
A occurs and 1 if B occurs, buy contract paying 1 if A B I Answer: PRN (A) < .5. People are risk averse. In second
occurs, pocket difference. scenario they need the money more.f
Non-systemic event Extensions of risk neutral probability

I Suppose that A is the event that the Boston Red Sox win the
World Series. Would we expect PRN (A) to represent (the I Definition of risk neutral probability depends on choice of
markets best assessment of) the probability that the Red Sox currency (the so-called numeraire).
will win? I In 2016 presidential election, investors predicted value of
I Arguably yes. The amount that people in general need or Mexican peso (in US dollars) would be lower
value dollars does not depend much on whether A occurs I Risk neutral probability can be defined for variable times and
(even though the financial needs of specific individuals may variable interest rates e.g., one can take the numeraire to
depend on heavily on A). be amount one dollar in a variable-interest-rate money market
I Even if some people bet based on loyalty, emotion, insurance account has grown to when outcome is known. Can define
against personal financial exposure to teams prospects, etc., PRN (A) to be price of contract paying this amount if and
there will arguably be enough in-it-for-the-money statistical when A occurs.
arbitrageurs to keep price near a reasonable guess of what I For simplicity, we focus on fixed time T , fixed interest rate r
well-informed informed experts would consider the true in this lecture.
probability.

Risk neutral probability is objective Prices as expectations

I Check out binary prediction contracts at predictwise.com,


I By assumption, the price of a contract that pays one dollar at
oddschecker.com, predictit.com, etc. time T if A occurs is PRN (A)e rT .
I Many financial derivatives are essentially bets of this form.
I If A and B are disjoint, what is the price of a contract that
pays 2 dollars if A occurs, 3 if B occurs, 0 otherwise?
I Unlike true probability (what does that mean?) the risk
neutral probability is an objectively measurable price.
I Answer: (2PRN (A) + 3PRN (B))e rT .
I Pundit: The market predictions are ridiculous. I can estimate
I Generally, in absence of arbitrage, price of contract that pays
probabilities much better than they can. X at time T should be ERN (X )e rT where ERN denotes
expectation with respect to the risk neutral probability.
I Listener: Then why not make some bets and get rich? If your
estimates are so much better, law of large numbers says youll
I Example: if a non-divided paying stock will be worth X at
surely come out way ahead eventually. time T , then its price today should be ERN (X )e rT .
I Pundit: Well, you know... been busy... scruples about
I So-called fundamental theorem of asset pricing states that
gambling... more to life than money... (assuming no arbitrage) interest-discounted asset prices are
martingales with respect to risk neutral probability. Current
I Listener: Yeah, thats what I thought.
price of stock being ERN (X )e rT follows from this.
Outline

18.600: Lecture 36
Risk Neutral Probability and Black-Scholes Black-Scholes

Scott Sheffield

MIT Call quotes and risk neutral probability

Outline Overview
I The mathematics of todays lecture will not go far beyond
things we know.
I Main mathematical tasks will be to compute expectations of
functions of log-normal random variables (to get the
Black-Scholes formula) and differentiate under an integral (to
Black-Scholes compute risk neutral density functions from option prices).
I Will spend time giving financial interpretations of the math.
I Can interpret this lecture as a sophisticated story problem,
illustrating an important application of the probability we have
Call quotes and risk neutral probability
learned in this course (involving probability axioms,
expectations, cumulative distribution functions, etc.)
I Brownian motion (as mathematically constructed by MIT
professor Norbert Wiener) is a continuous time martingale.
I Black-Scholes theory assumes that the log of an asset price is
a process called Brownian motion with drift with respect to
risk neutral probability. Implies option price formula.
Black-Scholes: main assumption and conclusion Black-Scholes example: European call option
I More famous MIT professors: Black, Scholes, Merton. I A European call option on a stock at maturity date T ,
I 1997 Nobel Prize. strike price K , gives the holder the right (but not obligation)
to purchase a share of stock for K dollars at time T .
I Assumption: the log of an asset price X at fixed future time The document gives the
T is a normal random variable (call it N) with some known bearer the right to pur-
chase one share of MSFT
variance (call it T 2 ) and some mean (call it ) with respect from me on May 31 for
35 dollars. SS
to risk neutral probability.
I
2
Observation: N normal (, T 2 ) implies E [e N ] = e +T /2 . I If X is the value of the stock at T , then the value of the
option at time T is given by g (X ) = max{0, X K }.
I Observation: If X0 is the current price then
2 I Black-Scholes: price of contract paying g (X ) at time T is
X0 = ERN [X ]e rT = ERN [e N ]e rT = e +( /2r )T .
ERN [g (X )]e rT = ERN [g (e N )]e rT where N is normal with
I Observation: This implies = log X0 + (r 2 /2)T . variance T 2 , mean = log X0 + (r 2 /2)T .
I Conclusion: If g is any function then the price of a contract I Write this as
that pays g (X ) at time T is
e rT ERN [max{0, e N K }] = e rT ERN [(e N K )1Nlog K ]
ERN [g (X )]e rT = ERN [g (e N )]e rT e rT
Z
(x)2
= e 2T 2 (e x K )dx.
where N is normal with mean and variance T 2 . 2T log K

The famous formula Outline

I Let T be time to maturity, X0 current price of underlying


asset, K strike price, r risk free interest rate, the volatility.
R (x)2
I We need to compute e rT log K e 2T 2 (e x K )dx where Black-Scholes
= rT + log X0 T 2 /2.
I Can use complete-the-square tricks to compute the two terms
explicitly in terms of standard normal cumulative distribution
Call quotes and risk neutral probability
function .
I Price of European call is (d1 )X0 (d2 )Ke rT where
X0 2 X0 2
ln( )+(r + 2 )(T ) ln( )+(r 2 )(T )
d1 = K
T
and d2 = K
T
.
Outline Determining risk neutral probability from call quotes

I If C (K ) is price of European call with strike price K and


f = fX is risk neutral probability
R density function for X at
time T , then C (K ) = e rT f (x) max{0, x K }dx.
Black-Scholes I Differentiating under the integral, we find that
Z
e rT C 0 (K ) = f (x)(1x>K )dx = PRN {X > K } = FX (K )1,

Call quotes and risk neutral probability


e rT C 00 (K ) = f (K ).
I We can look up C (K ) for a given stock symbol (say GOOG)
and expiration time T at cboe.com and work out
approximately what FX and hence fX must be.

Perspective: implied volatility Perspective: why is Black-Scholes not exactly right?


I Main Black-Scholes assumption: risk neutral probability
densities are lognormal.
I Risk neutral probability densities derived from call quotes are I Heuristic support for this assumption: If price goes up 1
not quite lognormal in practice. Tails are too fat. Main percent or down 1 percent each day (with no interest) then
Black-Scholes assumption is only approximately correct. the risk neutral probability must be .5 for each (independently
I Implied volatility is the value of that (when plugged into of previous days). Central limit theorem gives log normality
Black-Scholes formula along with known parameters) predicts for large T .
the current market price. I Replicating portfolio point of view: in the simple binary
I If Black-Scholes were completely correct, then given a stock tree models (or continuum Brownian models), we can transfer
and an expiration date, the implied volatility would be the money back and forth between the stock and the risk free
same for all strike prices. In practice, when the implied asset to ensure our wealth at time T equals the option payout.
volatility is viewed as a function of strike price (sometimes Option price is required initial investment, which is risk neutral
called the volatility smile), it is not constant. expectation of payout. True probabilities are irrelevant.
I Where arguments for assumption break down: Fluctuation
sizes vary from day to day. Prices can have big jumps.
I Fixes: variable volatility, random interest rates, Levy jumps....
Expectation and variance

I Eight athletic teams are ranked 1 through 8 after season one,


18.600: Lecture 37 and ranked 1 through 8 again after season two. Assume that
each set of rankings is chosen uniformly from the set of 8!
Review: practice problems possible rankings and that the two rankings are independent.
Let N be the number of teams whose rank does not change
from season one to season two. Let N+ the number of teams
Scott Sheffield whose rank improves by exactly two spots. Let N be the
number whose rank declines by exactly two spots. Compute
MIT the following:
I E [N], E [N+ ], and E [N ]
I Var[N]
I Var[N+ ]

Expectation and variance answers Conditional distributions

I Let Ni be 1 if team ranked Pith first season remains ith second


seasons. Then E [N] = E [ 8i=1 Ni ] = 8 18 = 1. Similarly,
E [N+ ] = E [N ] = 6 81 = 3/4
I Var[N] = EP [N 2 ] PE [N]2 and I Roll ten dice. Find the conditional probability that there are
E [N 2 ] = E [ 8i=1 8j=1 Ni Nj ] = 8 1
8 + 56 1
56 = 2. exactly 4 ones, given that there are exactly 4 sixes.
i
I N+ be 1 if team ranked ith has rank improve to (i 2)th for
second seasons.PThenP
i N j ] = 6 1 + 30 1 = 9/7, so
E [(N+ )2 ] = E [ 83=1 83=1 N+ + 8 56
Var[N+ ] = 9/7 (3/4)2 .
Conditional distributions answers Poisson point processes

I Suppose that in a certain town earthquakes are a Poisson


point process, with an average of one per decade, and volcano
I Straightforward approach: P(A|B) = P(AB)/P(B). eruptions are an independent Poisson point process, with an
average of two per decade. The V be length of time (in
(10)(6)42 (10)56 decades) until the first volcano eruption and E the length of
I Numerator: is 4 6104 . Denominator is 4610 .
6 2 6
 6 1 4 4 2
 time (in decades) until the first earthquake. Compute the
I Ratio is 4 4 /5 = 4 ( 5 ) ( 5 ) . following:
I Alternate solution: first condition on location of the 6s and I E[E 2 ] and Cov[E , V ].
then use binomial theorem. I The expected number of calendar years, in the next decade
(ten calendar years), that have no earthquakes and no volcano
eruptions.
I The probability density function of min{E , V }.

Poisson point processes answers

I E [E 2 ] = 2 and Cov[E , V ] = 0.
I Probability of no earthquake or eruption in first year is
1
e (2+1) 10 = e .3 (see next part). Same for any year by
memoryless property. Expected number of
quake/eruption-free years is 10e .3 7.4.
I Probability density function of min{E , V } is 3e (2+1)x for
x 0, and 0 for x < 0.
Order statistics

18.600: Lecture 38
Review: practice problems I Let X be a uniformly distributed random variable on [1, 1].
I Compute the variance of X 2 .
I If X1 , . . . , Xn are independent copies of X , what is the
Scott Sheffield
probability density function for the smallest of the Xi
MIT

Order statistics answers Moment generating functions


I
Var[X 2 ] = E [X 4 ] (E [X 2 ])2
Z 1 Z 1
1 4 1 2 1 1 4
= x dx ( x dx)2 = = .
1 2 1 2 5 9 45
I Note that for x [1, 1] we have
Z 1 I Suppose that Xi are independent copies of a random variable
1 1x X . Let MX (t) be the moment generating function for X .
P{X > x} = dx = .
x 2 2 Compute the moment generating function for the average
P n
If x [1, 1], then i=1 Xi /n in terms of MX (t) and n.

P{min{X1 , . . . , Xn } > x}
1x n
= P{X1 > x, X2 > x, . . . , Xn > x} = ( ) .
2
So the density function is
1x n n 1 x n1
( ) = ( ) .
x 2 2 2
Moment generating functions answers Entropy

I Suppose X and Y are independent random variables, each


Write Y =
Pn equal to 1 with probability 1/3 and equal to 2 with probability
i=1 Xi /n. Then
I
2/3.
Pn
MY (t) = E [e tY ] = E [e t i=1 Xi /n
] = (MX (t/n))n . I Compute the entropy H(X ).
I Compute H(X + Y ).
I Which is larger, H(X + Y ) or H(X , Y )? Would the answer to
this question be the same for any discrete random variables X
and Y ? Explain.

Entropy answers

I H(X ) = 13 ( log 13 ) + 32 ( log 32 ).


I H(X + Y ) = 19 ( log 91 ) + 49 ( log 94 ) + 49 ( log 49 )
I H(X , Y ) is larger, and we have H(X , Y ) H(X + Y ) for any
X and Y . To see why, write a(x, y ) = P{X = x, Y = y } and
b(x, y ) = P{X + Y = x + y }. Then a(x, y ) b(x, y ) for any
x and y , so
H(X , Y ) = E [ log a(x, y )] E [ log b(x, y )] = H(X + Y ).
Markov chains
I Alice and Bob share a home with a bathroom, a walk-in
closet, and 2 towels.
I Each morning a fair coin decide which of the two showers first.
18.600: Lecture 39 I After Bob showers, if there is at least one towel in the
Review: practice problems bathroom, Bob uses the towel and leaves it draped over a
chair in the walk-in closet. If there is no towel in the
bathroom, Bob grumpily goes to the walk-in closet, dries off
Scott Sheffield there, and leaves the towel in the walk-in closet
I When Alice showers, she first checks to see if at least one
MIT
towel is present. If a towel is present, she dries off with that
towel and returns it to the bathroom towel rack. Otherwise,
she cheerfully retrieves both towels from the walk-in closet,
then showers, dries off and leaves both towels on the rack.
I Problem: describe towel-distribution evolution as a Markov
chain and determine (over the long term) on what fraction of
days Bob emerges from the shower to find no towel.

Markov chains answers Optional stopping, martingales, central limit theorem


I Let state 0, 1, 2 denote bathroom towel number.
I Shower state change Bob: 2 1, 1 0, 0 0.
I Shower state change Alice: 2 2, 1 1, 0 2.
Suppose that X1 , X2 , X3 , . . . is an infinite sequence of independent
I Morning state change AB: 2 1, 1 0, 0 1. random variables which are each equal to P1 with probability 1/2
I Morning state change BA: 2 1, 1 2, 0 2. and 1 with probability 1/2. Let Yn = ni=1 Xi . Answer the
I Markov chain matrix: following:

0 .5 .5
I What is the the probability that Yn reaches 25 before the
M = .5 0 .5 first time that it reaches 5?
0 1 0 I Use the central limit theorem to approximate the probability
that Y9000000 is greater than 6000.
I Row vector such that M = (with components of
summing to one) is 29 49 13 .
I Bob finds no towel only if morning starts in state zero and
Bob goes first. Over long term Bob finds no towel 29 12 = 1
9
fraction of the time.
Optional stopping, martingales, central limit theorem Martingales
answers

I Let Xi be independent random variables with mean zero. In


I p25 25 + p5 5 = 0 and p25 + p5 = 1. Solving, we obtain which of the cases below is the sequence Yi necessarily a
martingale?
p25 = 1/6 and p5 = 5/6. Pn
I Yn = Pi=1 iXi
I One standard deviation is 9000000 = 3000. We want I Yn
n
= Q i=1 Xi2 n
n
probabilityR to be 2 standard deviations above mean. Should I Yn = Qi=1 (1 + Xi )
2 n
be about 2 12 e x /2 dx. I Yn = i=1 (Xi 1)

Martingales Calculations like those needed for Black-Scholes derivation

I Let X be a normal random variable with mean 0 and variance


1. Compute the following (you may use the function
Ra 2
I Yes, no, yes, no. (a) := 12 e x /2 dx in your answers):
I E [e 3X 3 ].
I E [e X 1X (a,b) ] for fixed constants a < b.
Calculations like those needed for Black-Scholes derivation Calculations like those needed for Black-Scholes derivation
answers answers

Z b
1 2
E [e X 1X (a,b) ] = e x e x /2 dx
Z
1 2
E [e 3X 3 ] = e 3x3 e x /2 dx a 2
2
Z
Z b
1 x 2

=
1 x 2 6x+6
e 2 dx = e x e 2 dx
2 a 2
Z
Z b
1 x 2 2x+11
1 x 2 6x+9 3/2 = e 2 dx
= e 2 e dx
2 a 2
Z Z b
1 (x3)2 1/2 1 (x1)2
= e 3/2 e 2 dx =e e 2 dx
2 a 2
Z b1
= e 3/2 1/2 1 x2
=e e 2 dx
a1 2
= e 1/2 ((b 1) (a 1))

If you want more probability and statistics... Thanks for taking the course!
I UNDERGRADUATE:
(a) 18.615 Introduction to Stochastic Processes
(b) 18.642 Topics in Math with Applications in Finance
(c) 18.650 Statistics for Applications
I GRADUATE LEVEL PROBABILITY I Considering previous generations of mathematically inclined
(a) 18.175 Theory of Probability MIT students, and adopting a frequentist point of view...
(b) 18.176 Stochastic calculus
(c) 18.177 Topics in stochastic processes (topics vary
I You will probably do some important things with your lives.
repeatable, offered twice next year) I I hope your probabilistic shrewdness serves you well.
I GRADUATE LEVEL STATISTICS I Thinking more short term...
(a) 18.655 Mathematical statistics
(b) 18.657 Topics in statistics (topics vary topic this year was
I Happy exam day!
machine learning; repeatable) I And may the odds be ever in your favor.
I OUTSIDE OF MATH DEPARTMENT
(a) Look up new MIT minor in statistics and data sciences.
(b) Look up long list of probability/statistics courses (about 78
total) at https://stat.mit.edu/academics/subjects/
(c) Ask other MIT faculty how they use probability and statistics
in their research.

You might also like