0% found this document useful (0 votes)
154 views

Probability MIT

This document provides an introduction to probability and statistics concepts. It discusses (1) the relationship between probability and statistics, noting that statistics applies probability to draw conclusions from data, (2) the frequentist and Bayesian interpretations of probability, and (3) how probability and statistics are used in applications and how toy models can help understand complex real-world problems. The document also previews topics to be covered, including counting techniques, sets, and visualizing set operations using Venn diagrams to compute probabilities.

Uploaded by

Minh Quan LE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views

Probability MIT

This document provides an introduction to probability and statistics concepts. It discusses (1) the relationship between probability and statistics, noting that statistics applies probability to draw conclusions from data, (2) the frequentist and Bayesian interpretations of probability, and (3) how probability and statistics are used in applications and how toy models can help understand complex real-world problems. The document also previews topics to be covered, including counting techniques, sets, and visualizing set operations using Venn diagrams to compute probabilities.

Uploaded by

Minh Quan LE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

Introduction

Class 1, 18.05

Jeremy Orloff and Jonathan Bloom

1 Probability vs. Statistics

In this introduction we will preview what we will be studying in 18.05. Don’t worry if many
of the terms are unfamiliar, they will be explained as the course proceeds.
Probability and statistics are deeply connected because all statistical statements are at bot­
tom statements about probability. Despite this the two sometimes feel like very different
subjects. Probability is logically self-contained; there are a few rules and answers all follow
logically from the rules, though computations can be tricky. In statistics we apply proba­
bility to draw conclusions from data. This can be messy and usually involves as much art
as science.
Probability example
You have a fair coin (equal probability of heads or tails). You will toss it 100 times. What
is the probability of 60 or more heads? There is only one answer (about 0.028444) and we
will learn how to compute it.
Statistics example
You have a coin of unknown provenance. To investigate whether it is fair you toss it 100
times and count the number of heads. Let’s say you count 60 heads. Your job as a statis­
tician is to draw a conclusion (inference) from this data. There are many ways to proceed,
both in terms of the form the conclusion takes and the probability computations used to
justify the conclusion. In fact, different statisticians might draw different conclusions.
Note that in the first example the random process is fully known (probability of heads =
.5). The objective is to find the probability of a certain outcome (at least 60 heads) arising
from the random process. In the second example, the outcome is known (60 heads) and the
objective is to illuminate the unknown random process (the probability of heads).

2 Frequentist vs. Bayesian Interpretations

There are two prominent and sometimes conflicting schools of statistics: Bayesian and
frequentist. Their approaches are rooted in differing interpretations of the meaning of
probability.
Frequentists say that probability measures the frequency of various outcomes of an ex­
periment. For example, saying a fair coin has a 50% probability of heads means that if we
toss it many times then we expect about half the tosses to land heads.
Bayesians say that probability is an abstract concept that measures a state of knowledge
or a degree of belief in a given proposition. In practice Bayesians do not assign a single
value for the probability of a coin coming up heads. Rather they consider a range of values
each with its own probability of being true.
In 18.05 we will study and compare these approaches. The frequentist approach has long

18.05 class 1, Introduction, Spring 2014 2

been dominant in fields like biology, medicine, public health and social sciences. The
Bayesian approach has enjoyed a resurgence in the era of powerful computers and big
data. It is especially useful when incorporating new data into an existing statistical model,
for example, when training a speech or face recognition system. Today, statisticians are
creating powerful tools by using both approaches in complementary ways.

3 Applications, Toy Models, and Simulation

Probability and statistics are used widely in the physical sciences, engineering, medicine, the
social sciences, the life sciences, economics and computer science. The list of applications is
essentially endless: tests of one medical treatment against another (or a placebo), measures
of genetic linkage, the search for elementary particles, machine learning for vision or speech,
gambling probabilities and strategies, climate modeling, economic forecasting, epidemiology,
marketing, googling... We will draw on examples from many of these fields during this
course.
Given so many exciting applications, you may wonder why we will spend so much time
thinking about toy models like coins and dice. By understanding these thoroughly we will
develop a good feel for the simple essence inside many complex real-world problems. In
fact, the modest coin is a realistic model for any situations with two possible outcomes:
success or failure of a treatment, an airplane engine, a bet, or even a class.
Sometimes a problem is so complicated that the best way to understand it is through
computer simulation. Here we use software to run virtual experiments many times in order
to estimate probabilities. In this class we will use R for simulation as well as computation
and visualization. Don’t worry if you’re new to R; we will teach you all you need to know.
Counting and Sets

Class 1, 18.05

Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Know the definitions and notation for sets, intersection, union, complement.

2. Be able to visualize set operations using Venn diagrams.

3. Understand how counting is used computing probabilities.

4. Be able to use the rule of product, inclusion-exclusion principle, permutations and com­
binations to count the elements in a set.

2 Counting

2.1 Motivating questions

Example 1. A coin is fair if it comes up heads or tails with equal probability.

You flip a fair coin three times. What is the probability that exactly one of the flips results

in a head?

answer: With three flips, we can easily list the eight possible outcomes:

{T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH}

Three of these outcomes have exactly one head:

{T T H, T HT, HT T }

Since all outcomes are equally probable, we have


number of outcomes with 1 head 3
P (1 head in 3 flips) = = .
total number of outcomes 8

Think: Would listing the outcomes be practical with 10 flips?

A deck of 52 cards has 13 ranks (2, 3, . . . , 9, 10, J, Q, K, A) and 4 suits (♥, ♠, ♦, ♣,). A
poker hand consists of 5 cards. A one-pair hand consists of two cards having one rank and
three cards having three other ranks, e.g., {2♥, 2♠, 5♥, 8♣, K♦}

Test your intuition: the probability of a one-pair hand is:


(a) less than 5%
(b) between 5% and 10%
(c) between 10% and 20%

1
18.05 class 1, Counting and Sets, Spring 2014 2

(d) between 20% and 40%


(e) greater than 40%

At this point we can only guess the probability. One of our goals is to learn how to compute
it exactly. To start, we note that since every set of five cards is equally probable, we can
compute the probability of a one-pair hand as
number of one-pair hands
P (one-pair) =
total number of hands
So, to find the exact probability, we need to count the number of elements in each of these
sets. And we have to be clever about it, because there are too many elements to simply
list them all. We will come back to this problem after we have learned some counting
techniques.
Several times already we have noted that all the possible outcomes were equally probable
and used this to find a probability by counting. Let’s state this carefully in the following
principle.
Principle: Suppose there are n possible outcomes for an experiment and each is equally
probable. If there are k desirable outcomes then the probability of a desirable outcome is
k/n. Of course we could replace the word desirable by any other descriptor: undesirable,
funny, interesting, remunerative, . . .
Concept question: Can you think of a scenario where the possible outcomes are not
equally probable?
Here’s one scenario: on an exam you can get any score from 0 to 100. That’s 101 different
possible outcomes. Is the probability you get less than 50 equal to 50/101?

2.2 Sets and notation

Our goal is to learn techniques for counting the number of elements of a set, so we start
with a brief review of sets. (If this is new to you, please come to office hours).

2.2.1 Definitions

A set S is a collection of elements. We use the following notation.

Element: We write x ∈ S to mean the element x is in the set S.

Subset: We say the set A is a subset of S if all of its elements are in S. We write this as

A ⊂ S.
Complement:: The complement of A in S is the set of elements of S that are not in A.

We write this as Ac or S − A.

Union: The union of A and B is the set of all elements in A or B (or both). We write this

as A ∪ B.

Intersection: The intersection of A and B is the set of all elements in both A and B. We

write this as A ∩ B.

Empty set: The empty set is the set with no elements. We denote it ∅.

18.05 class 1, Counting and Sets, Spring 2014 3

Disjoint: A and B are disjoint if they have no common elements. That is, if A ∩ B = ∅.
Difference: The difference of A and B is the set of elements in A that are not in B. We
write this as A − B.

Let’s illustrate these operations with a simple example.


Example 2. Start with a set of 10 animals
S = {Antelope, Bee, Cat, Dog, Elephant, Frog, Gnat, Hyena, Iguana, Jaguar}.
Consider two subsets:
M = the animal is a mammal = {Antelope, Cat, Dog, Elephant, Hyena, Jaguar}
W = the animal lives in the wild = {Antelope, Bee, Elephant, Frog, Gnat, Hyena, Iguana, Jaguar}.
Our goal here is to look at different set operations.

Intersection: M ∩ W contains all wild mammals: M ∩ W = {Antelope, Elephant, Hyena, Jaguar}.

Union: M ∪ W contains all animals that are mammals or wild (or both).

M ∪ W = {Antelope, Bee, Cat, Dog, Elephant, Frog, Gnat, Hyena, Iguana, Jaguar}.

Complement: M c means everything that is not in M , i.e. not a mammal. M c =

{Bee, Frog, Gnat, Iguana}.

Difference: M − W means everything that’s in M and not in W .

So, M − W = {Cat, Dog}.

There are often many ways to get the same set, e.g. M c = S − M , M − W = M ∩ Lc .

The relationship between union, intersection, and complement is given by DeMorgan’s laws:
(A ∪ B)c = Ac ∩ B c
(A ∩ B)c = Ac ∪ B c
In words the first law says everything not in (A or B) is the same set as everything that’s
(not in A) and (not in B). The second law is similar.

2.2.2 Venn Diagrams

Venn diagrams offer an easy way to visualize set operations.


In all the figures S is the region inside the large rectangle, L is the region inside the left
circle and R is the region inside the right circle. The shaded region shows the set indicated
underneath each figure.

S L R

L∪R L∩R Lc L−R


18.05 class 1, Counting and Sets, Spring 2014 4

Proof of DeMorgan’s Laws

(L ∪ R)c Lc Rc Lc ∩ Rc

(L ∩ R)c Lc Rc Lc ∪ Rc

Example 3. Verify DeMorgan’s laws for the subsets A = {1, 2, 3} and B = {3, 4} of the
set S = {1, 2, 3, 4, 5}.
answer: For each law we just work through both sides of the equation and show they are
the same.
1. (A ∪ B)c = Ac ∩ B c :

Right hand side: A ∪ B = {1, 2, 3, 4} ⇒ (A ∪ B)c = {5}.

Left hand side: Ac = {4, 5}, B c = {1, 2, 5} ⇒ Ac ∩ B c = {5}.

The two sides are equal. QED


2. (A ∩ B)c = Ac ∪ B c :
Right hand side: A ∩ B = {3} ⇒ (A ∩ B)c = {1, 2, 4, 5}.
Left hand side: Ac = {4, 5}, B c = {1, 2, 5} ⇒ Ac ∪ B c = {1, 2, 4, 5}.
The two sides are equal. QED
Think: Draw and label a Venn diagram with A the set of Brain and Cognitive Science
majors and B the set of sophomores. Shade the region illustrating the first law. Can you
express the first law in this case as a non-technical English sentence?

2.2.3 Products of sets

The product of sets S and T is the set of ordered pairs:

S × T = {(s, t) | s ∈ S, t ∈ T }.

In words the right-hand side reads “the set of ordered pairs (s, t) such that s is in S and t
is in T .
The following diagrams show two examples of the set product.
18.05 class 1, Counting and Sets, Spring 2014 5

4
× 1 2 3 4 3
1 (1,1) (1,2) (1,3) (1,4)
2 (2,1) (2,2) (2,3) (2,4)
3 (3,1) (3,2) (3,3) (3,4)
1
{1, 2, 3} × {1, 2, 3, 4}

1 4 5
[1, 4] × [1, 3] ⊂ [0, 5] × [0, 4]

The right-hand figure also illustrates that if A ⊂ S and B ⊂ T then A × B ⊂ S × T .

2.3 Counting

If S is finite, we use |S| or #S to denote the number of elements of S.

Two useful counting principles are the inclusion-exclusion principle and the rule of product.

2.3.1 Inclusion-exclusion principle

The inclusion-exclusion principle says

|A ∪ B| = |A| + |B| − |A ∩ B|.

We can illustrate this with a Venn diagram. S is all the dots, A is the dots in the blue
circle, and B is the dots in the red circle.

B A∩B A

|A| is the number of dots in A and likewise for the other sets. The figure shows that |A|+|B|
double-counts |A ∩ B|, which is why |A ∩ B| is subtracted off in the inclusion-exclusion
formula.

Example 4. In a band of singers and guitarists, seven people sing, four play the guitar,
and two do both. How big is the band?
18.05 class 1, Counting and Sets, Spring 2014 6

answer: Let S be the set singers and G be the set guitar players. The inclusion-exclusion
principle says
size of band = |S ∪ G| = |S| + |G| − |S ∩ G| = 7 + 4 − 2 = 9.

2.3.2 Rule of Product

The Rule of Product says:

If there are n ways to perform action 1 and then by m ways to perform action
2, then there are n · m ways to perform action 1 followed by action 2.

We will also call this the multiplication rule.

Example 5. If you have 3 shirts and 4 pants then you can make 3 · 4 = 12 outfits.
Think: An extremely important point is that the rule of product holds even if the ways to
perform action 2 depend on action 1, as long as the number of ways to perform action 2 is
independent of action 1. To illustrate this:
Example 6. There are 5 competitors in the 100m final at the Olympics. In how many
ways can the gold, silver, and bronze medals be awarded?

answer: There are 5 ways to award the gold. Once that is awarded there are 4 ways to
award the silver and then 3 ways to award the bronze: answer 5 · 4 · 3 = 60 ways.
Note that the choice of gold medalist affects who can win the silver, but the number of
possible silver medalists is always four.

2.4 Permutations and combinations

2.4.1 Permutations

A permutation of a set is a particular ordering of its elements. For example, the set {a, b, c}
has six permutations: abc, acb, bac, bca, cab, cba. We found the number of permutations by
listing them all. We could also have found the number of permutations by using the rule
of product. That is, there are 3 ways to pick the first element, then 2 ways for the second,
and 1 for the first. This gives a total of 3 · 2 · 1 = 6 permutations.
In general, the rule of product tells us that the number of permutations of a set of k elements
is
k! = k · (k − 1) · · · 3 · 2 · 1.

We also talk about the permutations of k things out of a set of n things. We show what
this means with an example.
Example 7. List all the permutations of 3 elements out of the set {a, b, c, d}. answer:
This is a longer list,
abc acb bac bca cab cba
abd adb bad bda dab dba
acd adc cad cda dac dca
bcd bdc cbd cdb dbc dcb
18.05 class 1, Counting and Sets, Spring 2014 7

Note that abc and acb count as distinct permutations. That is, for permutations the order
matters.
There are 24 permutations. Note that the rule or product would have told us there are
4 · 3 · 2 = 24 permutations without bothering to list them all.

2.4.2 Combinations

In contrast to permutations, in combinations order does not matter: permutations are lists
and combinations are sets. We show what we mean with an example
Example 8. List all the combinations of 3 elements out of the set {a, b, c, d}.
answer: Such a combination is a collection of 3 elements without regard to order. So, abc
and cab both represent the same combination. We can list all the combinations by listing
all the subsets of exactly 3 elements.
{a, b, c} {a, b, d} {a, c, d} {b, c, d}
There are only 4 combinations. Contrast this with the 24 permutations in the previous
example. The factor of 6 comes because every combination of 3 things can be written in 6
different orders.

2.4.3 Formulas

We’ll use the following notations.

n Pk = number of permutations (lists) of k distinct elements from a set of size n

n
n Ck = k = number of combinations (subsets) of k elements from a set of size n
We emphasise that by the number of combinations of k elements we mean the number of

subsets of size k.

These have the following notation and formulas:

n!
Permutations: n Pk = = n(n − 1) · · · (n − k + 1)
(n − k)!
n! n Pk
Combinations: n Ck = =
k!(n − k)! k!
The notation n Ck is read “n choose k”. The formula for n Pk follows from the rule of
product. It also implies the formula for n Ck because a subset of size k can be ordered in k!
ways.
We can illustrate the relation between permutations and combinations by lining up the
results of the previous two examples.
abc acb bac bca cab cba {a, b, c}
abd adb bad bda dab dba {a, b, d}
acd adc cad cda dac dca {a, c, d}
bcd bdc cbd cdb dbc dcb {b, c, d}
Permutations: 4 P3 Combinations: 4 C3
Notice that each row in the permutations list consists of all 3! permutations of the corre­
sponding set in the combinations list.
18.05 class 1, Counting and Sets, Spring 2014 8

2.4.4 Examples

Example 9. Count the following:


(i) The number of ways to choose 2 out of 4 things (order does not matter).
(ii) The number of ways to list 2 out of 4 things.
(iii) The number of ways to choose 3 out of 10 things.
answer: (i) This is asking for combinations: 42 = 2!4!2! = 6.
4!
(ii) This is asking for permuations: 4 P2 = 2! = 12.
10
(iii) This is asking for combinations: 3 = 3!10!7! = 10·9·8
3·2·1 = 120.

Example 10. (i) Count the number of ways to get 3 heads in a sequence of 10 flips of a
coin.
(ii) If the coin is fair, what is the probability of exactly 3 heads in 10 flips.
answer: (i) This asks for the number sequences of 10 flips (heads or tails) with exactly 3
heads. That is, we have to choose exactly 3 out of 10 flips to be heads. This is the same
question as in the previous example.

10 10! 10 · 9 · 8
= = = 120.
3 3! 7! 3·2·1

(ii) Each flip has 2 possible outcomes (heads or tails). So the rule of product says there are
210 = 1024 sequences of 10 flips. Since the coin is fair each sequence is equally probable.
So the probability of 3 heads is
120
= .117 .
1024
Probability: Terminology and Examples

Class 2, 18.05

Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Know the definitions of sample space, event and probability function.

2. Be able to organize a scenario with randomness into an experiment and sample space.

3. Be able to make basic computations using a probability function.

2 Terminology

2.1 Probability cast list

• Experiment: a repeatable procedure with well-defined possible outcomes.

• Sample space: the set of all possible outcomes. We usually denote the sample space by
Ω, sometimes by S.

• Event: a subset of the sample space.

• Probability function: a function giving the probability for each outcome.

Later in the course we will learn about

• Probability density: a continuous distribution of probabilities.

• Random variable: a random numerical outcome.

2.2 Simple examples

Example 1. Toss a fair coin.

Experiment: toss the coin, report if it lands heads or tails.

Sample space: Ω = {H, T }.

Probability function: P (H) = .5, P (T ) = .5.

Example 2. Toss a fair coin 3 times.

Experiment: toss the coin 3 times, list the results.

Sample space: Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.

Probability function: Each outcome is equally likely with probability 1/8.

For small sample spaces we can put the set of outcomes and probabilities into a probability
table.
Outcomes HHH HHT HTH HTT THH THT TTH TTT
Probability 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8

18.05 class 2, Probability: Terminology and Examples, Spring 2014 2

Example 3. Measure the mass of a proton

Experiment: follow some defined procedure to measure the mass and report the result.

Sample space: Ω = [0, ∞), i.e. in principle we can get any positive value.

Probability function: since there is a continuum of possible outcomes there is no probability

function. Instead we need to use a probability density, which we will learn about later in

the course.

Example 4. Taxis (An infinite discrete sample space)

Experiment: count the number of taxis that pass 77 Mass. Ave during an 18.05 class.

Sample space: Ω = {0, 1, 2, 3, 4, . . . }.

This is often modeled with the following probability function known as the Poisson distri­
bution. (Do not worry about mastering the Poisson distribution just yet):

λk
P (k) = e−λ ,
k!
where λ is the average number of taxis. We can put this in a table:
Outcome 0 1 2 3 ... k ...

Probability e−λ e−λ λ e−λ λ2 /2 e−λ λ3 /3! ... e−λ λk /k! ...

λk 0
Question: Accepting that this is a valid probability function, what is ? e−λ
k!
k=0
answer: This is the total probability of all possible outcomes, so the sum equals 1. (Note,

λ
0 λn
this also follows from the Taylor series e = .)
n!
n=0

In a given setup there can be more than one reasonable choice of sample space. Here is a

simple example.

Example 5. Two dice (Choice of sample space)

Suppose you roll one die. Then the sample space and probability function are

Outcome 1 2 3 4 5 6
Probability: 1/6 1/6 1/6 1/6 1/6 1/6
Now suppose you roll two dice. What should be the sample space? Here are two options.
1. Record the pair of numbers showing on the dice (first die, second die).
2. Record the sum of the numbers on the dice. In this case there are 11 outcomes
{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. These outcomes are not all equally likely.
As above, we can put this information in tables. For the first case, the sample space is the
product of the sample spaces for each die

{(1, 1), (2, 1), (3, 1), . . . (6, 6)}

Each of the 36 outcomes is equally likely. (Why 36 outcomes?) For the probability function
we will make a two dimensional table with the rows corresponding to the number on the
first die, the columns the number on the second die and the entries the probability.
18.05 class 2, Probability: Terminology and Examples, Spring 2014 3

Die 2

1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
Die 1
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Two dice in a two dimensional table

In the second case we can present outcomes and probabilities in our usual table.

outcome 2 3 4 5 6 7 8 9 10 11 12
probability 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

The sum of two dice


Think: What is the relationship between the two probability tables above?
We will see that the best choice of sample space depends on the context. For now, simply
note that given the outcome as a pair of numbers it is easy to find the sum.
Note. Listing the experiment, sample space and probability function is a good way to start
working systematically with probability. It can help you avoid some of the common pitfalls
in the subject.

Events.
An event is a collection of outcomes, i.e. an event is a subset of the sample space Ω. This
sounds odd, but it actually corresponds to the common meaning of the word.
Example 6. Using the setup in Example 2 we would describe the event that you get
exactly two heads in words by E = ‘exactly 2 heads’. Written as a subset this becomes

E = {HHT, HT H, T HH}.

You should get comfortable moving between describing events in words and as subsets of
the sample space.
The probability of an event E is computed by adding up the probabilities of all of the
outcomes in E. In this example each outcome has probability 1/8, so we have P (E) = 3/8.

2.3 Definition of a discrete sample space

Definition. A discrete sample space is one that is listable, it can be either finite or infinite.
Examples. {H, T}, {1, 2, 3}, {1, 2, 3, 4, . . . }, {2, 3, 5, 7, 11, 13, 17, . . . } are all
discrete sets. The first two are finite and the last two are infinite.
Example. The interval 0 ≤ x ≤ 1 is not discrete, rather it is continuous. We will deal
with continuous sample spaces in a few days.
18.05 class 2, Probability: Terminology and Examples, Spring 2014 4

2.4 The probability function

So far we’ve been using a casual definition of the probability function. Let’s give a more
precise one.
Careful definition of the probability function.
For a discrete sample space S a probability function P assigns to each outcome ω a number
P (ω) called the probability of ω. P must satisfy two rules:

• Rule 1. 0 ≤ P (ω) ≤ 1 (probabilities are between 0 and 1).

• Rule 2. The sum of the probabilities of all possible outcomes is 1 (something must
occur)

In symbols Rule 2 says: if S = {ω1 , ω2 , . . . , ωn } then P (ω1 ) + P (ω2 ) + . . . + P (ωn ) = 1. Or,


0n
using summation notation: P (ωj ) = 1.
j=1
The probability of an event E is the sum of the probabilities of all the outcomes in E. That
is, 0
P (E) = P (ω).
ω∈E

Think: Check Rules 1 and 2 on Examples 1 and 2 above.

Example 7. Flip until heads (A classic example)

Suppose we have a coin with probability p of heads and we have the following scenario.

Experiment: Toss the coin until the first heads. Report the number of tosses.

Sample space: Ω = {1, 2, 3, . . . }.

Probability function: P (n) = (1 − p)n−1 p.

Challenge 1: show the sum of all the probabilities equals 1 (hint: geometric series).

Challenge 2: justify the formula for P (n) (we will do this soon).

Stopping problems. The previous toy example is an uncluttered version of a general

class of problems called stopping rule problems. A stopping rule is a rule that tells you

when to end a certain process. In the toy example above the process was flipping a coin

and we stopped after the first heads. A more practical example is a rule for ending a series

of medical treatments. Such a rule could depend on how well the treatments are working,

how the patient is tolerating them and the probability that the treatments would continue

to be effective. One could ask about the probability of stopping within a certain number of

treatments or the average number of treatments you should expect before stopping.

3 Some rules of probability

For events A, L and R contained in a sample space Ω.


Rule 1. P (Ac ) = 1 − P (A).
Rule 2. If L and R are disjoint then P (L ∪ R) = P (L) + P (R).
18.05 class 2, Probability: Terminology and Examples, Spring 2014 5

Rule 3. If L and R are not disjoint, we have the inclusion-exclusion principle:

P (L ∪ R) = P (L) + P (R) − P (L ∩ R)

We visualize these rules using Venn diagrams.

Ac
L R L R
A

Ω = A ∪ Ac , no overlap L ∪ R, no overlap L ∪ R, overlap = L ∩ R

We can also justify them logically.


Rule 1: A and Ac split Ω into two non-overlapping regions. Since the total probability
P (Ω) = 1 this rule says that the probabiity of A and the probability of ’not A’ are comple­
mentary, i.e. sum to 1.
Rule 2: L and R split L ∪ R into two non-overlapping regions. So the probability of L ∪ R
is is split between P (L) and P (R)
Rule 3: In the sum P (L) + P (R) the overlap P (L ∩ R) gets counted twice. So P (L) +
P (R) − P (L ∩ R) counts everything in the union exactly once.
Think: Rule 2 is a special case of Rule 3.
For the following examples suppose we have an experiment that produces a random integer
between 1 and 20. The probabilities are not necessarily uniform, i.e., not necessarily the
same for each outcome.
Example 8. If the probability of an even number is .6 what is the probability of an odd
number?
answer: Since being odd is complementary to being even, the probability of being odd is
1-.6 = .4.
Let’s redo this example a bit more formally, so you see how it’s done. First, so we can refer
to it, let’s name the random integer X. Let’s also name the event ‘X is even’ as A. Then
the event ‘X is odd’ is Ac . We are given that P (A) = .6. Therefore P (Ac ) = 1 − .6 = .4 .
Example 9. Consider the 2 events, A: ‘X is a multiple of 2’; B: ‘X is odd and less than
10’. Suppose P (A) = .6 and P (B) = .25.
(i) What is A ∩ B?
(ii) What is the probability of A ∪ B?
answer: (i) Since all numbers in A are even and all numbers in B are odd, these events are
disjoint. That is, A ∩ B = ∅.
(ii) Since A and B are disjoint P (A ∪ B) = P (A) + P (B) = .85.

Example 10. Let A, B and C be the events X is a multiple of 2, 3 and 6 respectively. If


P (A) = .6, P (B) = .3 and P (C) = .2 what is P (A or B)?
answer: Note two things. First we used the word ‘or’ which means union: ‘A or B’ =
A ∪ B. Second, an integer is divisible by 6 if and only if it is divisible by both 2 and 3.
18.05 class 2, Probability: Terminology and Examples, Spring 2014 6

This translates into C = A ∩ B. So the inclusion-exclusion principle says

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = .6 + .3 − .2 = .7 .
Conditional Probability, Independence and Bayes’ Theorem

Class 3, 18.05

Jeremy Orloff and Jonathan Bloom

1 Learning Goals
1. Know the definitions of conditional probability and independence of events.

2. Be able to compute conditional probability directly from the definition.

3. Be able to use the multiplication rule to compute the total probability of an event.

4. Be able to check if two events are independent.

5. Be able to use Bayes’ formula to ‘invert’ conditional probabilities.

6. Be able to organize the computation of conditional probabilities using trees and tables.

7. Understand the base rate fallacy thoroughly.

2 Conditional Probability

Conditional probability answers the question ‘how does the probability of an event change
if we have extra information’. We’ll illustrate with an example.
Example 1. Toss a fair coin 3 times.
(a) What is the probability of 3 heads?
answer: Sample space Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.
All outcomes are equally likely, so P (3 heads) = 1/8.

(b) Suppose we are told that the first toss was heads. Given this information how should
we compute the probability of 3 heads?
answer: We have a new (reduced) sample space: Ω' = {HHH, HHT, HT H, HT T }.
All outcomes are equally likely, so

P (3 heads given that the first toss is heads) = 1/4.


This is called conditional probability, since it takes into account additional conditions. To
develop the notation, we rephrase (b) in terms of events.
Rephrased (b) Let A be the event ‘all three tosses are heads’ = {HHH}.
Let B be the event ‘the first toss is heads’ = {HHH, HHT, HT H, HT T }.
The conditional probability of A knowing that B occurred is written

P (A|B)

This is read as
‘the conditional probability of A given B’
or
‘the probability of A conditioned on B’

1
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
2

or simply
‘the probability of A given B’.

We can visualize conditional probability as follows. Think of P (A) as the proportion of the
area of the whole sample space taken up by A. For P (A|B) we restrict our attention to B.
That is, P (A|B) is the proportion of area of B taken up by A, i.e. P (A ∩ B)/P (B).

B
A
A∩B

Conditional probability: Abstract visualization and coin example


Note, A ⊂ B in the right-hand figure, so there are only two colors shown.
The formal definition of conditional probability catches the gist of the above example and
visualization.
Formal definition of conditional probability
Let A and B be events. We define the conditional probability of A given B as
P (A ∩ B)
P (A|B) = , provided P (B) = 0. (1)
P (B)

Let’s redo the coin tossing example using the definition in Equation (1). Recall A = ‘3 heads’
and B = ‘first toss is heads’. We have P (A) = 1/8 and P (B) = 1/2. Since A ∩ B = A, we
also have P (A ∩ B) = 1/8. Now according to (1), P (A|B) = 1/8 1/2 = 1/4, which agrees with
our answer in Example 1b.

3 Multiplication Rule

The following formula is called the multiplication rule.


P (A ∩ B) = P (A|B) · P (B). (2)
This is simply a rewriting of the definition in Equation (1) of conditional probability. We
will see that our use of the multiplication rule is very similar to our use of the rule of
product in counting. In fact, the multiplication rule is just a souped up version of the rule
of product.
We start with a simple example where we can check all the probabilities directly by counting.
Example 2. Draw two cards from a deck. Define the events: S1 = ‘first card is a spade’
˙
and S2 = ‘second card is a spade’. What is the P (S2 |S1 )?
answer: We can do this directly by counting: if the first card is a spade then of the 51
cards remaining, 12 are spades.
P (S2 |S1 ) = 12/51.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
3

Now, let’s recompute this using formula (1). We have to compute P (S1 ), P (S2 ) and
P (S1 ∩ S2 ): We know that P (S1 ) = 1/4 because there are 52 equally likely ways to draw
the first card and 13 of them are spades. The same logic says that there are 52 equally
likely ways the second card can be drawn, so P (S2 ) = 1/4.
Aside: The probability P (S2 ) = 1/4 may seem surprising since the value of first card
certainly affects the probabilities for the second card. However, if we look at all possible
two card sequences we will see that every card in the deck has equal probability of being
the second card. Since 13 of the 52 cards are spades we get P (S2 ) = 13/52 = 1/4. Another
way to say this is: if we are not given value of the first card then we have to consider all
possibilities for the second card.
Continuing, we see that
13 · 12
P (S1 ∩ S2 ) = = 3/51.
52 · 51
This was found by counting the number of ways to draw a spade followed by a second spade
and dividing by the number of ways to draw any card followed by any other card). Now,
using (1) we get
P (S2 ∩ S1 ) 3/51
P (S2 |S1 ) = = = 12/51.
P (S1 ) 1/4
Finally, we verify the multiplication rule by computing both sides of (2).
13 · 12 3 12 1 3
P (S1 ∩ S2 ) = = and P (S2 |S1 ) · P (S1 ) = · = . QED
52 · 51 51 51 4 51

Think: For S1 and S2 in the previous example, what is P (S2 |S1c )?

4 Law of Total Probability

The law of total probability will allow us to use the multiplication rule to find probabilities
in more interesting examples. It involves a lot of notation, but the idea is fairly simple. We
state the law when the sample space is divided into 3 pieces. It is a simple matter to extend
the rule when there are more than 3 pieces.
Law of Total Probability
Suppose the sample space Ω is divided into 3 disjoint events B1 , B2 , B3 (see the figure
below). Then for any event A:

P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + P (A ∩ B3 )

P (A) = P (A|B1 ) P (B1 ) + P (A|B2 ) P (B2 ) + P (A|B3 ) P (B3 ) (3)

The top equation says ‘if A is divided into 3 pieces then P (A) is the sum of the probabilities
of the pieces’. The bottom equation (3) is called the law of total probability. It is just a
rewriting of the top equation using the multiplication rule.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
4


B1
A ∩ B1

A ∩ B2 A ∩ B3

B2 B3
The sample space Ω and the event A are each divided into 3 disjoint pieces.
The law holds if we divide Ω into any number of events, so long as they are disjoint and

cover all of Ω. Such a division is often called a partition of Ω.

Our first example will be one where we already know the answer and can verify the law.

Example 3. An urn contains 5 red balls and 2 green balls. Two balls are drawn one after

the other. What is the probability that the second ball is red?

answer: The sample space is Ω = {rr, rg, gr, gg}.

Let R1 be the event ‘the first ball is red’, G1 = ‘first ball is green’, R2 = ‘second ball is

red’, G2 = ‘second ball is green’. We are asked to find P (R2 ).

The fast way to compute this is just like P (S2 ) in the card example above. Every ball is

equally likely to be the second ball. Since 5 out of 7 balls are red, P (R2 ) = 5/7.

Let’s compute this same value using the law of total probability (3). First, we’ll find the

conditional probabilities. This is a simple counting exercise.

P (R2 |R1 ) = 4/6, P (R2 |G1 ) = 5/6.

Since R1 and G1 partition Ω the law of total probability says

P (R2 ) = P (R2 |R1 )P (R1 ) + P (R2 |G1 )P (G1 ) (4)


4 5 5 2
= · + ·
6 7 6 7
30 5
= = .
42 7

Probability urns
The example above used probability urns. Their use goes back to the beginning of the
subject and we would be remiss not to introduce them. This toy model is very useful. We
quote from Wikipedia: http://en.wikipedia.org/wiki/Urn_problem

In probability and statistics, an urn problem is an idealized mental exercise


in which some objects of real interest (such as atoms, people, cars, etc.) are
represented as colored balls in an urn or other container. One pretends to draw
(remove) one or more balls from the urn; the goal is to determine the probability
of drawing one color or another, or some other properties. A key parameter is
whether each ball is returned to the urn after each draw.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
5

It doesn’t take much to make an example where (3) is really the best way to compute the
probability. Here is a game with slightly more complicated rules.
Example 4. An urn contains 5 red balls and 2 green balls. A ball is drawn. If it’s green
a red ball is added to the urn and if it’s red a green ball is added to the urn. (The original
ball is not returned to the urn.) Then a second ball is drawn. What is the probability the
second ball is red?
answer: The law of total probability says that P (R2 ) can be computed using the expression
in Equation (4). Only the values for the probabilities will change. We have

P (R2 |R1 ) = 4/7, P (R2 |G1 ) = 6/7.

Therefore,
4 5 6 2 32
P (R2 ) = P (R2 |R1 )P (R1 ) + P (R2 |G1 )P (G1 ) = · + · = .
7 7 7 7 49

5 Using Trees to Organize the Computation

Trees are a great way to organize computations with conditional probability and the law of
total probability. The figures and examples will make clear what we mean by a tree. As
with the rule of product, the key is to organize the underlying process into a sequence of
actions.
We start by redoing Example 4. The sequence of actions are: first draw ball 1 (and add the
appropriate ball to the urn) and then draw ball 2.

5/7 2/7
R1 G1
4/7 3/7 6/7 1/7

R2 G2 R2 G2

You interpret this tree as follows. Each dot is called a node. The tree is organized by levels.
The top node (root node) is at level 0. The next layer down is level 1 and so on. Each level
shows the outcomes at one stage of the game. Level 1 shows the possible outcomes of the
first draw. Level 2 shows the possible outcomes of the second draw starting from each node
in level 1.
Probabilities are written along the branches. The probability of R1 (red on the first draw)
is 5/7. It is written along the branch from the root node to the one labeled R1 . At the
next level we put in conditional probabilities. The probability along the branch from R1 to
R2 is P (R2 |R1 ) = 4/7. It represents the probability of going to node R2 given that you are
already at R1 .
The muliplication rule says that the probability of getting to any node is just the product of
the probabilities along the path to get there. For example, the node labeled R2 at the far left
really represents the event R1 ∩ R2 because it comes from the R1 node. The multiplication
rule now says
5 4
P (R1 ∩ R2 ) = P (R1 ) · P (R2 |R1 ) = · ,
7 7
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
6

which is exactly multiplying along the path to the node.


The law of total probability is just the statement that P (R2 ) is the sum of the probabilities
of all paths leading to R2 (the two circled nodes in the figure). In this case,
5 4 2 6 32
P (R2 ) = · + · = ,
7 7 7 7 49
exactly as in the previous example.

5.1 Shorthand vs. precise trees

The tree given above involves some shorthand. For example, the node marked R2 at the
far left really represents the event R1 ∩ R2 , since it ends the path from the root through
R1 to R2 . Here is the same tree with everything labeled precisely. As you can see this tree
is more cumbersome to make and use. We usually use the shorthand version of trees. You
should make sure you know how to interpret them precisely.

P (R1 ) = 5/7 P (G1 ) = 2/7


R1 G1
P (R2 |R1 ) = 4/7 P (G2 |R1 ) = 3/7 P (R2 |G1 ) = 6/7 P (G2 |G1 ) = 1/7

R1 ∩ R2 R1 ∩ G2 G1 ∩ R2 G1 ∩ G2

6 Independence

Two events are independent if knowledge that one occurred does not change the probability
that the other occurred. Informally, events are independent if they do not influence one
another.
Example 5. Toss a coin twice. We expect the outcomes of the two tosses to be independent
of one another. In real experiments this always has to be checked. If my coin lands in honey
and I don’t bother to clean it, then the second toss might be affected by the outcome of the
first toss.
More seriously, the independence of experiments can by undermined by the failure to clean or
recalibrate equipment between experiments or to isolate supposedly independent observers
from each other or a common influence. We’ve all experienced hearing the same ‘fact’ from
different people. Hearing it from different sources tends to lend it credence until we learn
that they all heard it from a common source. That is, our sources were not independent.
Translating the verbal description of independence into symbols gives

A is independent of B if P (A|B) = P (A). (5)

That is, knowing that B occurred does not change the probability that A occurred. In
terms of events as subsets, knowing that the realized outcome is in B does not change the
probability that it is in A.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
7

If A and B are independent in the above sense, then the multiplication rule gives P (A ∩
B) = P (A|B) · P (B) = P (A) · P (B). This justifies the following technical definition of
independence.
Formal definition of independence: Two events A and B are independent if

P (A ∩ B) = P (A) · P (B) (6)

This is a nice symmetric definition which makes clear that A is independent of B if and only
if B is independent of A. Unlike the equation with conditional probabilities, this definition
makes sense even when P (B) = 0. In terms of conditional probabilities, we have:
1. If P (B) =6 0 then A and B are independent if and only if P (A|B) = P (A).
2. If P (A) 6= 0 then A and B are independent if and only if P (B|A) = P (B).
Independent events commonly arise as different trials in an experiment, as in the following
example.
Example 6. Toss a fair coin twice. Let H1 = ‘heads on first toss’ and let H2 = ‘heads on
second toss’. Are H1 and H2 independent?

answer: Since H1 ∩ H2 is the event ‘both tosses are heads’ we have

P (H1 ∩ H2 ) = 1/4 = P (H1 )P (H2 ).

Therefore the events are independent.


We can ask about the independence of any two events, as in the following two examples.
Example 7. Toss a fair coin 3 times. Let H1 = ‘heads on first toss’ and A = ‘two heads
total’. Are H1 and A independent?
answer: We know that P (A) = 3/8. Since this is not 0 we can check if the formula in
Equation 5 holds. Now, H1 = {HHH, HHT, HTH, HTT} contains exactly two outcomes
(HHT, HT H) from A, so we have P (A|H1 ) = 2/4. Since P (A|H1 ) 6= P (A) these events
are not independent.
Example 8. Draw one card from a standard deck of playing cards. Let’s examine the
independence of 3 events ‘the card is an ace’, ‘the card is a heart’ and ‘the card is red’.
Define the events as A = ‘ace’, H = ‘hearts’, R = ‘red’.
(a) We know that P (A) = 4/52 = 1/13, P (A|H) = 1/13. Since P (A) = P (A|H) we have
that A is independent of H.
(b) P (A|R) = 2/26 = 1/13. So A is independent of R. That is, whether the card is an ace
is independent of whether it’s red.
(c) Finally, what about H and R? Since P (H) = 1/4 and P (H|R) = 1/2, H and R are not
independent. We could also see this the other way around: P (R) = 1/2 and P (R|H) = 1,
so H and R are not independent.

6.1 Paradoxes of Independence

An event A with probability 0 is independent of itself, since in this case both sides of
equation (6) are 0. This appears paradoxical because knowledge that A occurred certainly
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
8

gives information about whether A occurred. We resolve the paradox by noting that since
P (A) = 0 the statement ‘A occurred’ is vacuous.
Think: For what other value(s) of P (A) is A independent of itself?

7 Bayes’ Theorem

Bayes’ theorem is a pillar of both probability and statistics and it is central to the rest of
this course. For two events A and B Bayes’ theorem (also called Bayes’ rule and Bayes’
formula) says
P (A|B) · P (B)
P (B|A) = . (7)
P (A)
Comments: 1. Bayes’ rule tells us how to ‘invert’ conditional probabilities, i.e. to find
P (B|A) from P (A|B).
2. In practice, P (A) is often computed using the law of total probability.
Proof of Bayes’ rule
The key point is that A ∩ B is symmetric in A and B. So the multiplication rule says

P (B|A) · P (A) = P (A ∩ B) = P (A|B) · P (B).

Now divide through by P (A) to get Bayes’ rule.

A common mistake is to confuse P (A|B) and P (B|A). They can be very different. This is
illustrated in the next example.
Example 9. Toss a coin 5 times. Let H1 = ‘first toss is heads’ and let HA = ‘all 5 tosses
are heads’. Then P (H1 |HA ) = 1 but P (HA |H1 ) = 1/16.
For practice, let’s use Bayes’ theorem to compute P (H1 |HA ) using P (HA |A1 ).The terms
are P (HA |H1 ) = 1/16, P (H1 ) = 1/2, P (HA ) = 1/32. So,

P (HA |H1 )P (H1 ) (1/16) · (1/2)


P (H1 |HA ) = = = 1,
P (HA ) 1/32
which agrees with our previous calculation.

7.1 The Base Rate Fallacy

The base rate fallacy is one of many examples showing that it’s easy to confuse the meaning
of P (B|A) and P (A|B) when a situation is described in words. This is one of the key
examples from probability and it will inform much of our practice and interpretation of
statistics. You should strive to understand it thoroughly.
Example 10. The Base Rate Fallacy
Consider a routine screening test for a disease. Suppose the frequency of the disease in the
population (base rate) is 0.5%. The test is highly accurate with a 5% false positive rate
and a 10% false negative rate.
You take the test and it comes back positive. What is the probability that you have the
disease?
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014
9

answer: We will do the computation three times: using trees, tables and symbols. We’ll

use the following notation for the relevant events:

D+ = ‘you have the disease’

D− = ‘you do not have the disease

T + = ‘you tested positive’

T − = ‘you tested negative’.

We are given P (D+) = .005 and therefore P (D−) = .995. The false positive and false

negative rates are (by definition) conditional probabilities.

P (false positive) = P (T + |D−) = .05 and P (false negative) = P (T − |D+) = .1.

The complementary probabilities are known as the true negative and true positive rates:

P (T − |D−) = 1 − P (T + |D−) = .95 and P (T + |D+) = 1 − P (T − |D+) = .9.

Trees: All of these probabilities can be displayed quite nicely in a tree.

.995 .005
D− D+
.05 .95 .9 .1

T+ T− T+ T−

The question asks for the probability that you have the disease given that you tested positive,
i.e. what is the value of P (D+|T +). We aren’t given this value, but we do know P (T +|D+),
so we can use Bayes’ theorem.
P (T + |D+) · P (D+)
P (D + |T +) = .
P (T +)

The two probabilities in the numerator are given. We compute the denominator P (T +)
using the law of total probability. Using the tree we just have to sum the probabilities for
each of the nodes marked T +

P (T +) = .995 × .05 + .005 × .9 = .05425

Thus,
.9 × .005
P (D + |T +) = = 0.082949 ≈ 8.3%.
.05425
Remarks: This is called the base rate fallacy because the base rate of the disease in the
population is so low that the vast majority of the people taking the test are healthy, and
even with an accurate test most of the positives will be healthy people. Ask your doctor
for his/her guess at the odds.
To summarize the base rate fallacy with specific numbers

95% of all tests are accurate does not imply 95% of positive tests are accurate

We will refer back to this example frequently. It and similar examples are at the heart of
many statistical misunderstandings.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 10

Other ways to work Example 10


Tables: Another trick that is useful for computing probabilities is to make a table. Let’s
redo the previous example using a table built with 10000 total people divided according to
the probabilites in this example.
We construct the table as follows. Pick a number, say 10000 people, and place it as the
grand total in the lower right. Using P (D+) = .005 we compute that 50 out of the 10000
people are sick (D+). Likewise 9950 people are healthy (D−). At this point the table looks
like:
D+ D− total
T+
T−
total 50 9950 10000
Using P (T + |D+) = .9 we can compute that the number of sick people who tested positive
as 90% of 50 or 45. The other entries are similar. At this point the table looks like the
table below on the left. Finally we sum the T + and T − rows to get the completed table
on the right.
D+ D− total D+ D− total
T+ 45 498 T+ 45 498 543
T− 5 9452 T− 5 9452 9457
total 50 9950 10000 total 50 9950 10000
Using the complete table we can compute

|D + ∩ T + | 45
P (D + |T +) = = = 8.3%.
|T + | 543

Symbols: For completeness, we show how the solution looks when written out directly in
symbols.

P (T + |D+) · P (D+)
P (D + |T +) =
P (T +)
P (T + |D+) · P (D+)
=
P (T + |D+) · P (D+) + P (T + |D−) · P (D−)
.9 × .005
=
.9 × .005 + .05 × .995
= 8.3%

Visualization: The figure below illustrates the base rate fallacy. The large blue area
represents all the healthy people. The much smaller red area represents the sick people.
The shaded rectangle represents the the people who test positive. The shaded area covers
most of the red area and only a small part of the blue area. Even so, the most of the shaded
area is over the blue. That is, most of the positive tests are of healthy people.
18.05 class 3, Conditional Probability, Independence and Bayes’ Theorem, Spring 2014 11

D− D+

7.2 Bayes’ rule in 18.05

As we said at the start of this section, Bayes’ rule is a pillar of probability and statistics.
We have seen that Bayes’ rule allows us to ‘invert’ conditional probabilities. When we learn
statistics we will see that the art of statistical inference involves deciding how to proceed
when one (or more) of the terms on the right side of Bayes’ rule is unknown.
Discrete Random Variables

Class 4, 18.05

Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Know the definition of a discrete random variable.

2. Know the Bernoulli, binomial, and geometric distributions and examples of what they
model.

3. Be able to describe the probability mass function and cumulative distribution function
using tables and formulas.

4. Be able to construct new random variables from old ones.

5. Know how to compute expected value (mean).

2 Random Variables

This topic is largely about introducing some useful terminology, building on the notions of
sample space and probability function. The key words are

1. Random variable

2. Probability mass function (pmf)

3. Cumulative distribution function (cdf)

2.1 Recap

A discrete sample space Ω is a finite or listable set of outcomes {ω1 , ω2 . . .}. The probability
of an outcome ω is denoted P (ω).
An event E is a subset of Ω. The probability of an event E is P (E) = P (ω).
ω∈E

2.2 Random variables as payoff functions

Example 1. A game with 2 dice.


Roll a die twice and record the outcomes as (i, j), where i is the result of the first roll and
j the result of the second. We can take the sample space to be

Ω = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)} = {(i, j) | i, j = 1, . . . 6}.

The probability function is P (i, j) = 1/36.

1
18.05 class 4, Discrete Random Variables, Spring 2014 2

In this game, you win $500 if the sum is 7 and lose $100 otherwise. We give this payoff
function the name X and describe it formally by

500 if i + j = 7
X(i, j) =
−100 if i + j = 7.

Example 2. We can change the game by using a different payoff function. For example

Y (i, j) = ij − 10.

In this example if you roll (6, 2) then you win $2. If you roll (2, 3) then you win -$4 (i.e.,

lose $4).

Question: Which game is the better bet?

answer: We will come back to this once we learn about expectation.

These payoff functions are examples of random variables. A random variable assigns a
number to each outcome in a sample space. More formally:
Definition: Let Ω be a sample space. A discrete random variable is a function

X: Ω→R

that takes a discrete set of values. (Recall that R stands for the real numbers.)
Why is X called a random variable? It’s ‘random’ because its value depends on a random
outcome of an experiment. And we treat X like we would a usual variable: we can add it
to other random variables, square it, and so on.

2.3 Events and random variables

For any value a we write X = a to mean the event consisting of all outcomes ω with
X(ω) = a.
Example 3. In Example 1 we rolled two dice and X was the random variable

500 if i + j = 7
X(i, j) =
−100 if i + j = 7.

The event X = 500 is the set {(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)}, i.e. the set of all
outcomes that sum to 7. So P (X = 500) = 1/6.
We allow a to be any value, even values that X never takes. In Example 1, we could look
at the event X = 1000. Since X never equals 1000 this is just the empty event (or empty
set)
‘X = 1000' = {} = ∅ P (X = 1000) = 0.
18.05 class 4, Discrete Random Variables, Spring 2014 3

2.4 Probability mass function and cumulative distribution function

It gets tiring and hard to read and write P (X = a) for the probability that X = a. When
we know we’re talking about X we will simply write p(a). If we want to make X explicit
we will write pX (a). We spell this out in a definition.
Definition: The probability mass function (pmf) of a discrete random variable is the
function p(a) = P (X = a).
Note:
1. We always have 0 ≤ p(a) ≤ 1.
2. We allow a to be any number. If a is a value that X never takes, then p(a) = 0.

Example 4. Let Ω be our earlier sample space for rolling 2 dice. Define the random
variable M to be the maximum value of the two dice:

M (i, j) = max(i, j).

For example, the roll (3,5) has maximum 5, i.e. M (3, 5) = 5.


We can describe a random variable by listing its possible values and the probabilities asso­
ciated to these values. For the above example we have:
value a: 1 2 3 4 5 6
pmf p(a): 1/36 3/36 5/36 7/36 9/36 11/36
For example, p(2) = 3/36.
Question: What is p(8)? answer: p(8) = 0.
Think: What is the pmf for Z(i, j) = i + j? Does it look familiar?

2.5 Events and inequalities

Inequalities with random variables describe events. For example X ≤ a is the set of all
outcomes ω such that X(w) ≤ a.
Example 5. If our sample space is the set of all pairs of (i, j) coming from rolling two dice
and Z(i, j) = i + j is the sum of the dice then

Z ≤ 4 = {(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (3, 1)}

2.6 The cumulative distribution function (cdf )

Definition: The cumulative distribution function (cdf) of a random variable X is the


function F given by F (a) = P (X ≤ a). We will often shorten this to distribution function.
Note well that the definition of F (a) uses the symbol less than or equal. This will be
important for getting your calculations exactly right.

Example. Continuing with the example M , we have


value a: 1 2 3 4 5 6
pmf p(a): 1/36 3/36 5/36 7/36 9/36 11/36
cdf F (a): 1/36 4/36 9/36 16/36 25/36 36/36
18.05 class 4, Discrete Random Variables, Spring 2014 4

F (a) is called the cumulative distribution function because F (a) gives the total probability
that accumulates by adding up the probabilities p(b) as b runs from −∞ to a. For example,
in the table above, the entry 16/36 in column 4 for the cdf is the sum of the values of the
pmf from column 1 to column 4. In notation:

As events: ‘M ≤ 4’ = {1, 2, 3, 4}; F (4) = P (M ≤ 4) = 1/36+3/36+5/36+7/36 = 16/36.

Just like the probability mass function, F (a) is defined for all values a. In the above
example, F (8) = 1, F (−2) = 0, F (2.5) = 4/36, and F (π) = 9/36.

2.7 Graphs of p(a) and F (a)

We can visualize the pmf and cdf with graphs. For example, let X be the number of heads
in 3 tosses of a fair coin:
value a: 0 1 2 3
pmf p(a): 1/8 3/8 3/8 1/8
cdf F (a): 1/8 4/8 7/8 1
The colored graphs show how the cumulative distribution function is built by accumulating
probability as a increases. The black and white graphs are the more standard presentations.

3/8 3/8

1/8 1/8
a a
0 1 2 3 0 1 2 3

Probability mass function for X

1 1
7/8 7/8

4/8 4/8

1/8 1/8
a a
0 1 2 3 0 1 2 3

Cumulative distribution function for X

18.05 class 4, Discrete Random Variables, Spring 2014 5

1
33/36

30/36

26/36

21/36

15/36

10/36

6/36 6/36
5/36
4/36
3/36 3/36
2/36
1/36 1/36
a a
1 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12

pmf and cdf for the maximum of two dice (Example 4)

Histograms: Later we will see another way to visualize the pmf using histograms. These
require some care to do right, so we will wait until we need them.

2.8 Properties of the cdf F

The cdf F of a random variable satisfies several properties:

1. F is non-decreasing. That is, its graph never goes down, or symbolically if a ≤ b then
F (a) ≤ F (b).

2. 0 ≤ F (a) ≤ 1.

3. lim F (a) = 1, lim F (a) = 0.


a→∞ a→−∞

In words, (1) says the cumulative probability F (a) increases or remains constant as a
increases, but never decreases; (2) says the accumulated probability is always between 0
and 1; (3) says that as a gets very large, it becomes more and more certain that X ≤ a and
as a gets very negative it becomes more and more certain that X > a.
Think: Why does a cdf satisfy each of these properties?

3 Specific Distributions

3.1 Bernoulli Distributions

Model: The Bernoulli distribution models one trial in an experiment that can result in
either success or failure This is the most important distribution is also the simplest. A
random variable X has a Bernoulli distribution with parameter p if:
18.05 class 4, Discrete Random Variables, Spring 2014 6

1. X takes the values 0 and 1.

2. P (X = 1) = p and P (X = 0) = 1 − p.

We will write X ∼ Bernoulli(p) or Ber(p), which is read “X follows a Bernoulli distribution


with parameter p” or “X is drawn from a Bernoulli distribution with parameter p”.
A simple model for the Bernoulli distribution is to flip a coin with probability p of heads,
with X = 1 on heads and X = 0 on tails. The general terminology is to say X is 1 on
success and 0 on failure, with success and failure defined by the context.
Many decisions can be modeled as a binary choice, such as votes for or against a proposal.
If p is the proportion of the voting population that favors the proposal, than the vote of a
random individual is modeled by a Bernoulli(p).
Here are the table and graphs of the pmf and cdf for the Bernoulli(1/2) distribution and
below that for the general Bernoulli(p) distribution.

p(a) F (a)
1
value a: 0 1
pmf p(a): 1/2 1/2 1/2 1/2
cdf F (a): 1/2 1
a a
0 1 0 1

Table, pmf and cmf for the Bernoulli(1/2) distribution

p(a) F (a)
1
values a: 0 1 p
pmf p(a): 1-p p
cdf F (a): 1-p 1 1−p 1−p
a a
0 1 0 1

Table, pmf and cmf for the Bernoulli(p) distribution

3.2 Binomial Distributions

The binomial distribution Binomial(n,p), or Bin(n,p), models the number of successes in n


independent Bernoulli(p) trials.
There is a hierarchy here. A single Bernoulli trial is, say, one toss of a coin. A single
binomial trial consists of n Bernoulli trials. For coin flips the sample space for a Bernoulli
trial is {H, T }. The sample space for a binomial trial is all sequences of heads and tails of
length n. Likewise a Bernoulli random variable takes values 0 and 1 and a binomial random
variables takes values 0, 1, 2, . . . , n.

Example 6. Binomial(1,p) is the same as Bernoulli(p).


18.05 class 4, Discrete Random Variables, Spring 2014 7

Example 7. The number of heads in n flips of a coin with probability p of heads follows
a Binomial(n, p) distribution.
We describe X ∼ Binomial(n, p) by giving its values and probabilities. For notation we will
use k to mean an arbitrary number between 0 and n.
n
We remind you that ‘n choose k = = n Ck is the number of ways to choose k things
k
out of a collection of n things and it has the formula
n n!
= . (1)
k k! (n − k)!
(It is also called a binomial coefficient.) Here is a table for the pmf of a Binomial(n, k) ran­
dom variable. We will explain how the binomial coefficients enter the pmf for the binomial
distribution after a simple example.
values a: 0 1 2 ··· k ··· n
n 1 n 2 n k
pmf p(a): (1 − p)n p (1 − p)n−1 p (1 − p)n−2 ··· p (1 − p)n−k ··· pn
1 2 k

Example 8. What is the probability of 3 or more heads in 5 tosses of a fair coin?


answer: The binomial coefficients associated with n = 5 are
5 5 5! 5·4·3·2·1 5 5! 5·4·3·2·1 5·4
= 1, = = = 5, = = = = 10,
0 1 1! 4! 4·3·2·1 2 2! 3! 2·1·3·2·1 2
and similarly
5 5 5
= 10, = 5, = 1.
3 4 5
Using these values we get the following table for X ∼ Binomial(5,p).

values a: 0 1 2 3 4 5
pmf p(a): (1 − p)5 5p(1 − p)4 10p2 (1 − p)3 10p3 (1 − p)2 5p4 (1 − p) p5

We were told p = 1/2 so


3 2 4 1 5
1 1 1 1 1 16 1
P (X ≥ 3) = 10 +5 + = = .
2 2 2 2 2 32 2

Think: Why is the value of 1/2 not surprising?

3.3 Explanation of the binomial probabilities

For concreteness, let n = 5 and k = 2 (the argument for arbitrary n and k is identical.) So
X ∼ binomial(5, p) and we want to compute p(2). The long way to compute p(2) is to list
all the ways to get exactly 2 heads in 5 coin flips and add up their probabilities. The list
has 10 entries:
HHTTT, HTHTT, HTTHT, HTTTH, THHTT, THTHT, THTTH, TTHHT, TTHTH,
TTTHH
18.05 class 4, Discrete Random Variables, Spring 2014 8

Each entry has the same probability of occurring, namely


p2 (1 − p)3 .
This is because each of the two heads has probability p and each of the 3 tails has proba­
bility 1 − p. Because the individual tosses are independent we can multiply probabilities.
Therefore, the total probability of exactly 2 heads is the sum of 10 identical probabilities,
i.e. p(2) = 10p2 (1 − p)3 , as shown in the table.
This guides us to the shorter way to do the computation. We have to count the number of
sequences with exactly 2 heads. To do this we need to choose 2 of the tosses to be heads
and the remaining 3 to be tails. The number of such sequences is the number of ways to
choose 2 out of 5 things, that is 52 . Since each such sequence has the same probability,
T

p2 (1 − p)3 , we get the probability of exactly 2 heads p(2) = 52 p2 (1 − p)3 .


T

Here are some binomial probability mass function (here, frequency is the same as probabil­
ity).

3.4 Geometric Distributions

A geometric distribution models the number of tails before the first head in a sequence of
coin flips (Bernoulli trials).
Example 9. (a) Flip a coin repeatedly. Let X be the number of tails before the first heads.
So, X can equal 0, i.e. the first flip is heads, 1, 2, . . . . In principle it take any nonnegative
integer value.
(b) Give a flip of tails the value 0, and heads the value 1. In this case, X is the number of
0’s before the first 1.
18.05 class 4, Discrete Random Variables, Spring 2014 9

(c) Give a flip of tails the value 1, and heads the value 0. In this case, X is the number of
1’s before the first 0.
(d) Call a flip of tails a success and heads a failure. So, X is the number of successes before
the first failure.
(e) Call a flip of tails a failure and heads a success. So, X is the number of failures before
the first success.
You can see this models many different scenarios of this type. The most neutral language
is the number of tails before the first head.
Formal definition. The random variable X follows a geometric distribution with param­
eter p if

• X takes the values 0, 1, 2, 3, . . .

• its pmf is given by p(k) = P (X = k) = (1 − p)k p.

We denote this by X ∼ geometric(p) or geo(p). In table form we have:


value a: 0 1 2 3 ... k ...
2 3 k
pmf p(a): p (1 − p)p (1 − p) p (1 − p) p . . . (1 − p) p . . .
Table: X ∼ geometric(p): X = the number of 0s before the first 1.
We will show how this table was computed in an example below.
The geometric distribution is an example of a discrete distribution that takes an infinite
number of possible values. Things can get confusing when we work with successes and
failure since we might want to model the number of successes before the first failure or we
might want the number of failures before the first success. To keep straight things straight
you can translate to the neutral language of the number of tails before the first heads.
1

0.4 0.8

0.3 0.6

0.2 0.4

0.1 0.2

0.0 a 0.0 a
0 1 5 10 0 1 5 10

pmf and cdf for the geometric(1/3) distribution

Example 10. Computing geometric probabilities. Suppose that the inhabitants of an


island plan their families by having babies until the first girl is born. Assume the probability
of having a girl with each pregnancy is 0.5 independent of other pregnancies, that all babies
survive and there are no multiple births. What is the probability that a family has k boys?
18.05 class 4, Discrete Random Variables, Spring 2014 10

answer: In neutral language we can think of boys as tails and girls as heads. Then the
number of boys in a family is the number of tails before the first heads.
Let’s practice using standard notation to present this. So, let X be the number of boys in
a (randomly-chosen) family. So, X is a geometric random variable. We are asked to find
p(k) = P (X = k). A family has k boys if the sequence of children in the family from oldest
to youngest is
BBB . . . BG
with the first k children being boys. The probability of this sequence is just the product
of the probability for each child, i.e. (1/2)k · (1/2) = (1/2)k+1 . (Note: The assumptions of
equal probability and independence are simplifications of reality.)

Think: What is the ratio of boys to girls on the island?


More geometric confusion. Another common definition for the geometric distribution is the
number of tosses until the first heads. In this case X can take the values 1, i.e. the first
flip is heads, 2, 3, . . . . This is just our geometric random variable plus 1. The methods of
computing with it are just like the ones we used above.

3.5 Uniform Distribution

The uniform distribution models any situation where all the outcomes are equally likely.

X ∼ uniform(N ).

X takes values 1, 2, 3, . . . , N , each with probability 1/N . We have already seen this distribu­
tion many times when modeling to fair coins (N = 2), dice (N = 6), birthdays (N = 365),
and poker hands (N = 52
T 
5 ).

3.6 Discrete Distributions Applet

The applet at http://mathlets.org/mathlets/probability-distributions/ gives a dy­


namic view of some discrete distributions. The graphs will change smoothly as you move
the various sliders. Try playing with the different distributions and parameters.
This applet is carefully color-coded. Two things with the same color represent the same or
closely related notions. By understanding the color-coding and other details of the applet,
you will acquire a stronger intuition for the distributions shown.

3.7 Other Distributions

There are a million other named distributions arising is various contexts. We don’t expect
you to memorize them (we certainly have not!), but you should be comfortable using a
resource like Wikipedia to look up a pmf. For example, take a look at the info box at the
top rightof http://en.wikipedia.org/wiki/Hypergeometric_distribution. The info
box lists many (surely unfamiliar) properties in addition to the pmf.
18.05 class 4, Discrete Random Variables, Spring 2014 11

4 Arithmetic with Random Variables

We can do arithmetic with random variables. For example, we can add subtract, multiply
or square them.
There is a simple, but extremely important idea for counting. It says that if we have a
sequence of numbers that are either 0 or 1 then the sum of the sequence is the number of
1s.
Example 11. Consider the sequence with five 1s

1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0.

It is easy to see that the sum of this sequence is 5 the number of 1s.
We illustrates this idea by counting the number of heads in n tosses of a coin.
Example 12. Toss a fair coin n times. Let Xj be 1 if the jth toss is heads and 0 if it’s
tails. So, Xj is a Bernoulli(1/2) random variable. Let X be the total number of heads in
the n tosses. Assuming the tosses are independence we know X ∼ binomial(n, 1/2). We
can also write
X = X1 + X2 + X3 + . . . + Xn .
Again, this is because the terms in the sum on the right are all either 0 or 1. So, the sum
is exactly the number of Xj that are 1, i.e. the number of heads.
The important thing to see in the example above is that we’ve written the more complicated
binomial random variable X as the sum of extremely simple random variables Xj . This
will allow us to manipulate X algebraically.

Think: Suppose X and Y are independent and X ∼ binomial(n, 1/2) and Y ∼ binomial(m, 1/2).
What kind of distribution does X + Y follow? (Answer: binomial(n + m, 1/2). Why?)

Example 13. Suppose X and Y are independent random variables with the following
tables.
Values of X x: 1 2 3 4
pmf pX (x): 1/10 2/10 3/10 4/10

Values of Y y: 1 2 3 4 5
pmf pY (y): 1/15 2/15 3/15 4/15 5/15
Check that the total probability for each random variable is 1. Make a table for the random
variable X + Y .
answer: The first thing to do is make a two-dimensional table for the product sample space
consisting of pairs (x, y), where x is a possible value of X and y one of Y . To help do the
computation, the probabilities for the X values are put in the far right column and those
for Y are in the bottom row. Because X and Y are independent the probability for (x, y)
pair is just the product of the individual probabilities.
18.05 class 4, Discrete Random Variables, Spring 2014 12

Y values

1 2 3 4 5

X values 1 1/150 2/150 3/150 4/150 5/150 1/10

2 2/150 4/150 6/150 8/150 10/150 2/10

3 3/150 6/150 9/150 12/150 15/150 3/10

4 4/150 8/150 12/150 16/150 20/150 4/10

1/15 2/15 3/15 4/15 5/15

The diagonal stripes show sets of squares where X + Y is the same. All we have to do to
compute the probability table for X + Y is sum the probabilities for each stripe.
X + Y values: 2 3 4 5 6 7 8 9

pmf: 1/150 4/150 10/150 20/150 30/150 34/150 31/150 20/150

When the tables are too big to write down we’ll need to use purely algebraic techniques to
compute the probabilities of a sum. We will learn how to do this in due course.
Discrete Random Variables: Expected Value

Class 4, 18.05

Jeremy Orloff and Jonathan Bloom

1 Expected Value

In the R reading questions for this lecture, you simulated the average value of rolling a die
many times. You should have gotten a value close to the exact answer of 3.5. To motivate
the formal definition of the average, or expected value, we first consider some examples.
Example 1. Suppose we have a six-sided die marked with five 5 3’s and one 6. (This was
the red one from our non-transitive dice.) What would you expect the average of 6000 rolls
to be?
answer: If we knew the value of each roll, we could compute the average by summing
the 6000 values and dividing by 6000. Without knowing the values, we can compute the
expected average as follows.
Since there are five 3’s and one six we expect roughly 5/6 of the rolls will give 3 and 1/6 will
give 6. Assuming this to be exactly true, we have the following table of values and counts:
value: 3 6
expected counts: 5000 1000
The average of these 6000 values is then
5000 · 3 + 1000 · 6 5 1
= · 3 + · 6 = 3.5
6000 6 6
We consider this the expected average in the sense that we ‘expect’ each of the possible
values to occur with the given frequencies.

Example 2. We roll two standard 6-sided dice. You win $1000 if the sum is 2 and lose
$100 otherwise. How much do you expect to win on average per trial?
1
answer: The probability of a 2 is 1/36. If you play N times, you can ‘expect’ · N of the
36
35
trials to give a 2 and 36 · N of the trials to give something else. Thus your total expected
winnings are
N 35N
1000 · − 100 · .
36 36
To get the expected average per trial we divide the total by N :
1 35
expected average = 1000 · − 100 · = −69.44.
36 36

Think: Would you be willing to play this game one time? Multiple times?

Notice that in both examples the sum for the expected average consists of terms which are
a value of the random variable times its probabilitiy. This leads to the following definition.
Definition: Suppose X is a discrete random variable that takes values x1 , x2 , . . . , xn with
probabilities p(x1 ), p(x2 ), . . . , p(xn ). The expected value of X is denoted E(X) and defined

18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 2

by
n
n
E(X) = p(xj ) xj = p(x1 )x1 + p(x2 )x2 + . . . + p(xn )xn .
j=1

Notes:

1. The expected value is also called the mean or average of X and often denoted by µ
(“mu”).

2. As seen in the above examples, the expected value need not be a possible value of the
random variable. Rather it is a weighted average of the possible values.

3. Expected value is a summary statistic, providing a measure of the location or central


tendency of a random variable.

4. If all the values are equally probable then the expected value is just the usual average of
the values.

Example 3. Find E(X) for the random variable X with table:


values of X: 1 3 5
pmf: 1/6 1/6 2/3
1 1 2 24
answer: E(X) = · 1 + · 3 + · 5 = =4
6 6 3 6
Example 4. Let X be a Bernoulli(p) random variable. Find E(X).
answer: X takes values 1 and 0 with probabilities p and 1 − p, so

E(X) = p · 1 + (1 − p) · 0 = p.

Important: This is an important example. Be sure to remember that the expected value of
a Bernoulli(p) random variable is p.
Think: What is the expected value of the sum of two dice?

1.1 Mean and center or mass

You may have wondered why we use the name ‘probability mass function’. Here’s the
reason: if we place an object of mass p(xj ) at position xj for each j, then E(X) is the
position of the center of mass. Let’s recall the latter notion via an example.
Example 5. Suppose we have two masses along the x-axis, mass m1 = 500 at position
x1 = 3 and mass m2 = 100 at position x2 = 6. Where is the center of mass?
answer: Intuitively we know that the center of mass is closer to the larger mass.
m1 m2
x
3 6
18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 3

From physics we know the center of mass is

m 1 x1 + m 2 x2 500 · 3 + 100 · 6
x= = = 3.5.
m 1 + m2 600
We call this formula a ‘weighted’ average of the x1 and x2 . Here x1 is weighted more heavily
because it has more mass.
Now look at the definition of expected value E(X). It is a weighted average of the values of
X with the weights being probabilities p(xi ) rather than masses! We might say that “The
expected value is the point at which the distribution would balance”. Note the similarity
between the physics example and Example 1.

1.2 Algebraic properties of E(X)

When we add, scale or shift random variables the expected values do the same. The
shorthand mathematical way of saying this is that E(X) is linear.
1. If X and Y are random variables on a sample space Ω then
E(X + Y ) = E(X) + E(Y )
2. If a and b are constants then
E(aX + b) = aE(X) + b.
We will think of aX + b as scaling X by a and shifting it by b.

Before proving these properties, let’s consider a few examples.

Example 6. Roll two dice and let X be the sum. Find E(X).

answer: Let X1 be the value on the first die and let X2 be the value on the second

die. Since X = X1 + X2 we have E(X) = E(X1 ) + E(X2 ). Earlier we computed that

E(X1 ) = E(X2 ) = 3.5, therefore E(X) = 7.

Example 7. Let X ∼ binomial(n, p). Find E(X).


answer: Recall that X models the number of successes in n Bernoulli(p) random variables,
which we’ll call X1 , . . . Xn . The key fact, which we highlighted in the previous reading for
this class, is that
nn
X= Xj .
j=1

Now we can use the Algebraic Property (1) to make the calculation simple.
n
n n n
X= Xj ⇒ E(X) = E(Xj ) = p = np .

j=1 j j

We could have computed E(X) directly as


n n

n n n k
E(X) = kp(k) = k p (1 − p)n−k .

k=0 k=0
18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 4

It is possible to show that the sum of this series is indeed np. We think you’ll agree that
the method using Property (1) is much easier.

Example 8. (For infinite random variables the mean does not always exist.) Suppose X
has an infinite number of values according to the following table
values x: 2 22 23 ... 2k ...
2 3 Try to compute the mean.
pmf p(x): 1/2 1/2 1/2 . . . 1/2k . . .
answer: The mean is
∞ ∞
n
k 1 n
E(X) = 2 k = 1 = ∞.
2
k=1 k=1
The mean does not exist! This can happen with infinite series.

1.3 Proofs of the algebraic properties of E(X)

The proof of Property (1) is simple, but there is some subtlety in even understanding what
it means to add two random variables. Recall that the value of random variable is a number
determined by the outcome of an experiment. To add X and Y means to add the values of
X and Y for the same outcome. In table form this looks like:
outcome ω: ω1 ω2 ω3 ... ωn

value of X: x1 x2 x3 ... xn

value of Y : y1 y2 y3 ... yn

value of X + Y : x1 + y1 x2 + y2 x3 + y3 . . . xn + yn

prob. P (ω): P (ω1 ) P (ω2 ) P (ω3 ) . . . P (ωn )

The proof of (1) follows immediately:


n n n
E(X + Y ) = (xi + yi )P (ωi ) = xi P (ωi ) + yi P (ωi ) = E(X) + E(Y ).

The proof of Property (2) only takes one line.


n n n
E(aX + b) = p(xi )(axi + b) = a p(xi )xi + b p(xi ) = aE(X) + b.

The b term in the last expression follows because p(xi ) = 1.


Example 9. Mean of a geometric distribution
Let X ∼ geo(p). Recall this means X takes values k = 0, 1, 2, . . . with probabilities
p(k) = (1 − p)k p. (X models the number of tails before the first heads in a sequence of
Bernoulli trials.) The mean is given by
1−p
E(X) = .
p
To see this requires a clever trick. Mathematicians love this sort of thing and we hope you
are able to follow the logic. In this class we will not ask you to come up with something
like this on an exam.
Here’s the trick.: to compute E(X) we have to sum the infinite series

n
E(X) = k(1 − p)k p.
k=0
18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 5


n 1
Here is the trick. We know the sum of the geometric series:
xk = .
1−x
k=0

n 1
Differentiate both sides:
kxk−1 = .
(1 − x)2
k=0

n x
Multiply by x:
kxk = .
(1 − x)2
k=0

n 1−p
Replace x by 1 − p:
k(1 − p)k = .

p2
k=0

n 1−p
Multiply by p:
k(1 − p)k p = .

p
k=0
This last expression is the mean.
1−p
E(X) = .
p

Example 10. Flip a fair coin until you get heads for the first time. What is the expected
number of times you flipped tails?
answer: The number of tails before the first head is modeled by X ∼ geo(1/2). From the
1/2
previous example E(X) = = 1. This is a surprisingly small number.
1/2

Example 11. Michael Jordan, the greatest basketball player ever, made 80% of his free
throws. In a game what is the expected number he would make before his first miss.
answer: Here is an example where we want the number of successes before the first failure.
Using the neutral language of heads and tails: success is tails (probability 1 − p) and failure
is heads (probability = p). Therefore p = .2 and the number of tails (made free throws)
before the first heads (missed free throw) is modeled by a X ∼ geo(.2). We saw in Example
9 that this is
1−p .8
E(X) = = = 4.
p .2

1.4 Expected values of functions of a random variable

(The change of variables formula.)


If X is a discrete random variable taking values x1 , x2 , . . . and h is a function the h(X) is
a new random variable. Its expected value is
n

E(h(X)) = h(xj )p(xj ).


j

We illustrate this with several examples.

Example 12. Let X be the value of a roll of one die and let Y = X 2 . Find E(Y ).

answer: Since there are a small number of values we can make a table.

X 1 2 3 4 5 6

Y 1 4 9 16 25 36

prob 1/6 1/6 1/6 1/6 1/6 1/6

18.05 class 4, Discrete Random Variables: Expected Value, Spring 2014 6

Notice the probability for each Y value is the same as that of the corresponding X value.
So,
1 1 1
E(Y ) = E(X 2 ) = 12 · + 22 · + . . . + 62 · = 15.167.
6 6 6
Example 13. Roll two dice and let X be the sum. Suppose the payoff function is given
by Y = X 2 − 6X + 1. Is this a good bet?
n12
answer: We have E(Y ) = (j 2 − 6j + 1)p(j), where p(j) = P (X = j).
j=2
We show the table, but really we’ll use R to do the calculation.
X 2 3 4 5 6 7 8 9 10 11 12

Y -7 -8 -7 -4 1 8 17 28 41 56 73

prob 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

Here’s the R code I used to compute E(Y ) = 13.833.


x = 2:12
y = x^2 - 6*x + 1
p = c(1 2 3 4 5 6 5 4 3 2 1)/36
ave = sum(p*y)
It gave ave = 13.833.
To answer the question above: since the expected payoff is positive it looks like a bet worth
taking.

Quiz: If Y = h(X) does E(Y ) = h(E(X))? answer: NO!!! This is not true in general!
Think: Is it true in the previous example?

Quiz: If Y = 3X + 77 does E(Y ) = 3E(X) + 77?

answer: Yes. By property (2), scaling and shifting does behave like this.

Variance of Discrete Random Variables

Class 5, 18.05

Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Be able to compute the variance and standard deviation of a random variable.

2. Understand that standard deviation is a measure of scale or spread.

3. Be able to compute variance using the properties of scaling and linearity.

2 Spread

The expected value (mean) of a random variable is a measure of location or central tendency.
If you had to summarize a random variable with a single number, the mean would be a good
choice. Still, the mean leaves out a good deal of information. For example, the random
variables X and Y below both have mean 0, but their probability mass is spread out about
the mean quite differently.
values X -2 -1 0 1 2 values Y -3 3
pmf p(x) 1/10 2/10 4/10 2/10 1/10 pmf p(y) 1/2 1/2

It’s probably a little easier to see the different spreads in plots of the probability mass
functions. We use bars instead of dots to give a better sense of the mass.
p(x) pmf for X pmf for Y p(y)
1/2
4/10

2/10
1/10
x y
-2 -1 0 1 2 -3 0 3
pmf’s for two different distributions both with mean 0
In the next section, we will learn how to quantify this spread.

3 Variance and standard deviation

Taking the mean as the center of a random variable’s probability distribution, the variance
is a measure of how much the probability mass is spread out around this center. We’ll start
with the formal definition of variance and then unpack its meaning.
Definition: If X is a random variable with mean E(X) = µ, then the variance of X is
defined by
Var(X) = E((X − µ)2 ).

18.05 class 5, Variance of Discrete Random Variables, Spring 2014 2

The standard deviation σ of X is defined by

σ= Var(X).

If the relevant random variable is clear from context, then the variance and standard devi­
ation are often denoted by σ 2 and σ (‘sigma’), just as the mean is µ (‘mu’).
What does this mean? First, let’s rewrite the definition explicitly as a sum. If X takes
values x1 , x2 , . . . , xn with probability mass function p(xi ) then
n
n
Var(X) = E((X − µ)2 ) = p(xi )(xi − µ)2 .
i=1

In words, the formula for Var(X) says to take a weighted average of the squared distance
to the mean. By squaring, we make sure we are averaging only non-negative values, so that
the spread to the right of the mean won’t cancel that to the left. By using expectation,
we are weighting high probability values more than low probability values. (See Example 2
below.)
Note on units:
1. σ has the same units as X.

2. Var(X) has the same units as the square of X. So if X is in meters, then Var(X) is in

meters squared.

Because σ and X have the same units, the standard deviation is a natural measure of

spread.

Let’s work some examples to make the notion of variance clear.


Example 1. Compute the mean, variance and standard deviation of the random variable
X with the following table of values and probabilities.
value x 1 3 5
pmf p(x) 1/4 1/4 1/2
answer: First we compute E(X) = 7/2. Then we extend the table to include (X − 7/2)2 .
value x 1 3 5
p(x) 1/4 1/4 1/2
(x − 7/2)2 25/4 1/4 9/4
Now the computation of the variance is similar to that of expectation:
25 1 1 1 9 1 11
Var(X) = · + · + · = .
4 4 4 4 4 2 4
Taking the square root we have the standard deviation σ = 11/4.

Example 2. For each random variable X, Y , Z, and W plot the pmf and compute the
mean and variance.

(i) value x 1 2 3 4 5

pmf p(x)
1/5 1/5 1/5 1/5 1/5
(ii) value y 1 2 3 4 5
pmf p(y) 1/10 2/10 4/10 2/10 1/10
18.05 class 5, Variance of Discrete Random Variables, Spring 2014 3

(iii) value z 1 2 3 4 5
pmf p(z) 5/10 0 0 0 5/10
(iv) value w 1 2 3 4 5
pmf p(w) 0 0 1 0 0
answer: Each random variable has the same mean 3, but the probability is spread out
differently. In the plots below, we order the pmf’s from largest to smallest variance: Z, X,
Y , W.
p(z) pmf for Z p(x) pmf for X
.5

1/5

z x
1 2 3 4 5 1 2 3 4 5

p(w)
1

p(y) pmf for Y pmf for W


.5

.3

.1
y W
1 2 3 4 5 1 2 3 4 5

Next we’ll verify our visual intuition by computing the variance of each of the variables.
All of them have mean µ = 3. Since the variance is defined as an expected value, we can
compute it using the tables.
(i) value x 1 2 3 4 5
pmf p(x) 1/5 1/5 1/5 1/5 1/5
(X − µ)2 4 1 0 1 4
4 1 0 1 4
Var(X) = E((X − µ)2 ) = 5 + 5 + 5 + 5 + 5 = 2 .

(ii) value y 1 2 3 4 5
p(y) 1/10 2/10 4/10 2/10 1/10
(Y − µ)2 4 1 0 1 4
4 2 0 2 4
Var(Y ) = E((Y − µ)2 ) = 10 + 10 + 10 + 10 + 10 = 1.2 .

(iii) value z 1 2 3 4 5
pmf p(z) 5/10 0 0 0 5/10
(Z − µ)2 4 1 0 1 4
20 20
Var(Z) = E((Z − µ)2 ) = 10 + 10 = 4 .

18.05 class 5, Variance of Discrete Random Variables, Spring 2014 4

(iv) value w 1 2 3 4 5
pmf p(w) 0 0 1 0 0
(W − µ)2 4 1 0 1 4

Var(W ) = 0 . Note that W doesn’t vary, so it has variance 0!

3.1 The variance of a Bernoulli(p) random variable.

Bernoulli random variables are fundamental, so we should know their variance.


If X ∼ Bernoulli(p) then
Var(X) = p(1 − p).

Proof: We know that E(X) = p. We compute Var(X) using a table.


values X 0 1
pmf p(x) 1−p p
(X − µ)2 (0 − p)2 (1 − p)2
Var(X) = (1 − p)p2 + p(1 − p)2 = (1 − p)p(1 − p + p) = (1 − p)p.

As with all things Bernoulli, you should remember this formula.


Think: For what value of p does Bernoulli(p) have the highest variance? Try to answer
this by plotting the PMF for various p.

3.2 A word about independence

So far we have been using the notion of independent random variable without ever carefully
defining it. For example, a binomial distribution is the sum of independent Bernoulli trials.
This may (should?) have bothered you. Of course, we have an intuitive sense of what inde­
pendence means for experimental trials. We also have the probabilistic sense that random
variables X and Y are independent if knowing the value of X gives you no information
about the value of Y .
In a few classes we will work with continuous random variables and joint probability func­
tions. After that we will be ready for a full definition of independence. For now we can use
the following definition, which is exactly what you expect and is valid for discrete random
variables.
Definition: The discrete random variables X and Y are independent if

P (X = a, Y = b) = P (X = a)P (Y = b)

for any values a, b. That is, the probabilities multiply.

3.3 Properties of variance

The three most useful properties for computing variance are:


1. If X and Y are independent then Var(X + Y ) = Var(X) + Var(Y ).
18.05 class 5, Variance of Discrete Random Variables, Spring 2014 5

2. For constants a and b, Var(aX + b) = a2 Var(X).


3. Var(X) = E(X 2 ) − E(X)2 .
For Property 1, note carefully the requirement that X and Y are independent. We will
return to the proof of Property 1 in a later class.

Property 3 gives a formula for Var(X) that is often easier to use in hand calculations. The
computer is happy to use the definition! We’ll prove Properties 2 and 3 after some examples.

Example 3. Suppose X and Y are independent and Var(X) = 3 and Var(Y ) = 5. Find:
(i) Var(X + Y ), (ii) Var(3X + 4), (iii) Var(X + X), (iv) Var(X + 3Y ).
answer: To compute these variances we make use of Properties 1 and 2.
(i) Since X and Y are independent, Var(X + Y ) = Var(X) + Var(Y ) = 8.
(ii) Using Property 2, Var(3X + 4) = 9 · Var(X) = 27.
(iii) Don’t be fooled! Property 1 fails since X is certainly not independent of itself. We can
use Property 2: Var(X + X) = Var(2X) = 4 · Var(X) = 12. (Note: if we mistakenly used
Property 1, we would the wrong answer of 6.)
(iv) We use both Properties 1 and 2.
Var(X + 3Y ) = Var(X) + Var(3Y ) = 3 + 9 · 5 = 48.

Example 4. Use Property 3 to compute the variance of X ∼ Bernoulli(p).


answer: From the table
X 0 1
p(x) 1−p p
X2 0 1
we have E(X 2 ) = p. So Property 3 gives
Var(X) = E(X 2 ) − E(X)2 = p − p2 = p(1 − p).
This agrees with our earlier calculation.
Example 5. Redo Example 1 using Property 3.
answer: From the table
X 1 3 5
p(x) 1/4 1/4 1/2
X2 1 9 2
we have E(X) = 7/2 and
1 1 1 60
E(X 2 ) = 12 ·
+ 32 · + 52 · = = 15.
4 4 2 4
So Var(X) = 15 − (7/2)2 = 11/4 –as before in Example 1.

3.4 Variance of binomial(n,p)

Suppose X ∼ binomial(n, p). Since X is the sum of independent Bernoulli(p) variables and
each Bernoulli variable has variance p(1 − p) we have
X ∼ binomial(n, p) ⇒ Var(X) = np(1 − p).
18.05 class 5, Variance of Discrete Random Variables, Spring 2014 6

3.5 Proof of properties 2 and 3

Proof of Property 2: This follows from the properties of E(X) and some algebra.
Let µ = E(X). Then E(aX + b) = aµ + b and

Var(aX+b) = E((aX+b−(aµ+b))2 ) = E((aX−aµ)2 ) = E(a2 (X−µ)2 ) = a2 E((X−µ)2 ) = a2 Var(X).

Proof of Property 3: We use the properties of E(X) and a bit of algebra. Remember
that µ is a constant and that E(X) = µ.

E((X − µ)2 ) = E(X 2 − 2µX + µ2 )


= E(X 2 ) − 2µE(X) + µ2
= E(X 2 ) − 2µ2 + µ2
= E(X 2 ) − µ
2
= E(X 2 ) − E(X)2 . QED

4 Tables of Distributions and Properties

Distribution range X pmf p(x) mean E(X) variance Var(X)

Bernoulli(p) 0, 1 p(0) = 1 − p, p(1) = p p p(1 − p)


_ _
n k
Binomial(n, p) 0, 1,. . . , n p(k) = p (1 − p)n−k np np(1 − p)
k
1 n+1 n2 − 1
Uniform(n) 1, 2, . . . , n p(k) =
n 2 12
1−p 1−p
Geometric(p) 0, 1, 2,. . . p(k) = p(1 − p)k
p p2

Let X be a discrete random variable with range x1 , x2 , . . . and pmf p(xj ).


Expected Value: Variance:
Synonyms: mean, average
Notation: E(X), µ Var(X), σ 2
n n
Definition: E(X) = p(xj )xj E((X − µ)2 ) = p(xj )(xj − µ)2
j j
Scale and shift: E(aX + b) = aE(X) + b Var(aX + b) = a2 Var(X)
Linearity: (for any X, Y ) (for X, Y independent)
E(X + Y ) = E(X) + E(Y ) Var(X + Y ) = Var(X) + Var(Y )
_
Functions of X: E(h(X)) = p(xj ) h(xj )
Alternative formula: Var(X) = E(X 2 ) − E(X)2 = E(X 2 ) − µ2
Continuous Random Variables

Class 5, 18.05

Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Know the definition of a continuous random variable.

2. Know the definition of the probability density function (pdf) and cumulative distribution
function (cdf).

3. Be able to explain why we use probability density for continuous random variables.

2 Introduction

We now turn to continuous random variables. All random variables assign a number to
each outcome in a sample space. Whereas discrete random variables take on a discrete set
of possible values, continuous random variables have a continuous set of values.
Computationally, to go from discrete to continuous we simply replace sums by integrals. It
will help you to keep in mind that (informally) an integral is just a continuous sum.
Example 1. Since time is continuous, the amount of time Jon is early (or late) for class is
a continuous random variable. Let’s go over this example in some detail.
Suppose you measure how early Jon arrives to class each day (in units of minutes). That
is, the outcome of one trial in our experiment is a time in minutes. We’ll assume there are
random fluctuations in the exact time he shows up. Since in principle Jon could arrive, say,
3.43 minutes early, or 2.7 minutes late (corresponding to the outcome -2.7), or at any other
time, the sample space consists of all real numbers. So the random variable which gives the
outcome itself has a continuous range of possible values.
It is too cumbersome to keep writing ‘the random variable’, so in future examples we might
write: Let T = “time in minutes that Jon is early for class on any given day.”

3 Calculus Warmup

While we will assume you can compute the most familiar forms of derivatives and integrals
by hand, we do not expect you to be calculus whizzes. For tricky expressions, we’ll let the
computer do most of the calculating. Conceptually, you should be comfortable with two
views of a definite integral.

b
1. f (x) dx = area under the curve y = f (x).
a

2. f (x) dx = ‘sum of f (x) dx’.


a

18.05 class 5, Continuous Random Variables, Spring 2014 2

The connection between the two is:


n
n
area ≈ sum of rectangle areas = f (x1 )Δx + f (x2 )Δx + . . . + f (xn )Δx = f (xi )Δx.
1
As the width Δx of the intervals gets smaller the approximation becomes better.
y
y y = f (x)
y = f (x)
Area = f (xi )∆x

∆x
x
x x0 x1 x2 ··· xn
a b a b

Area is approximately the sum of rectangles

Note: In calculus you learned to compute integrals by finding antiderivatives. This is


important for calculations, but don’t confuse this method for the reason we use integrals.
Our interest in integrals comes primarily from its interpretation as a ‘sum’ and to a much
lesser extent its interpretation as area.

4 Continuous Random Variables and Probability Density Func­


tions

A continuous random variable takes a range of values, which may be finite or infinite in
extent. Here are a few examples of ranges: [0, 1], [0, ∞), (−∞, ∞), [a, b].
Definition: A random variable X is continuous if there is a function f (x) such that for
any c ≤ d we have
Z d
P (c ≤ X ≤ d) = f (x) dx. (1)
c
The function f (x) is called the probability density function (pdf).
The pdf always satisfies the following properties:

1. f (x) ≥ 0 (f is nonnegative).
Z ∞
2. f (x) dx = 1 (This is equivalent to: P (−∞ < X < ∞) = 1).
−∞

The probability density function f (x) of a continuous random variable is the analogue of
the probability mass function p(x) of a discrete random variable. Here are two important
differences:

1. Unlike p(x), the pdf f (x) is not a probability. You have to integrate it to get proba­
bility. (See section 4.2 below.)

2. Since f (x) is not a probability, there is no restriction that f (x) be less than or equal
to 1.
18.05 class 5, Continuous Random Variables, Spring 2014 3

Note: In Property 2, we integrated over (−∞, ∞) since we did not know the range of values
taken by X. Formally, this makes sense because we just define f (x) to be 0 outside of the
range of X. In practice, we would integrate between bounds given by the range of X.

4.1 Graphical View of Probability

If you graph the probability density function of a continuous random variable X then
P (c ≤ X ≤ d) = area under the graph between c and d.

f (x)

P (c ≤ X ≤ d)

x
c d

Think: What is the total area under the pdf f (x)?

4.2 The terms ‘probability mass’ and ‘probability density’

Why do we use the terms mass and density to describe the pmf and pdf? What is the
difference between the two? The simple answer is that these terms are completely analogous
to the mass and density you saw in physics and calculus. We’ll review this first for the
probability mass function and then discuss the probability density function.
Mass as a sum:
If masses m1 , m2 , m3 , and m4 are set in a row at positions x1 , x2 , x3 , and x4 , then the
total mass is m1 + m2 + m3 + m4 .
m1 m2 m3 m4
x
x1 x2 x3 x4

We can define a ‘mass function’ p(x) with p(xj ) = mj for j = 1, 2, 3, 4, and p(x) = 0
otherwise. In this notation the total mass is p(x1 ) + p(x2 ) + p(x3 ) + p(x4 ).
The probability mass function behaves in exactly the same way, except it has the dimension
of probability instead of mass.
Mass as an integral of density:
Suppose you have a rod of length L meters with varying density f (x) kg/m. (Note the units
are mass/length.)

Δx
x
0 x1 x2 x3 xi xn = L
mass of ith piece ≈ f (xi )Δx
18.05 class 5, Continuous Random Variables, Spring 2014 4

If the density varies continuously, we must find the total mass of the rod by integration:

Z L
total mass = f (x) dx.
0

This formula comes from dividing the rod into small pieces and ’summing’ up the mass of
each piece. That is:
n n
total mass ≈ f (xi ) Δx
i=1
In the limit as Δx goes to zero the sum becomes the integral.
The probability density function behaves exactly the same way, except it has units of
probability/(unit x) instead of kg/m. Indeed, equation (1) is exactly analogous to the
above integral for total mass.
While we’re on a physics kick, note that for both discrete and continuous random variables,
the expected value is simply the center of mass or balance point.

Example 2. Suppose X has pdf f (x) = 3 on [0, 1/3] (this means f (x) = 0 outside of
[0, 1/3]). Graph the pdf and compute P (.1 ≤ X ≤ .2) and P (.1 ≤ X ≤ 1).
answer: P (.1 ≤ X ≤ .2) is shown below at left. We can compute the integral:
Z .2 Z .2
P (.1 ≤ X ≤ .2) = f (x) dx = 3 dx = .3.
.1 .1

Or we can find the area geometrically:

area of rectangle = 3 · .1 = .3.

P (.1 ≤ X ≤ 1) is shown below at right. Since there is only area under f (x) up to 1/3, we
have P (.1 ≤ X ≤ 1) = 3 · (1/3 − .1) = .7.

3 f (x) 3 f (x)

x x
.1 .2 1/3 .1 1/3 1

P (.1 ≤ X ≤ .2) P (.1 ≤ X ≤ 1)

Think: In the previous example f (x) takes values greater than 1. Why does this not
violate the rule that probabilities are always between 0 and 1?

Note on notation. We can define a random variable by giving its range and probability
density function. For example we might say, let X be a random variable with range [0,1]
18.05 class 5, Continuous Random Variables, Spring 2014 5

and pdf f (x) = x/2. Implicitly, this means that X has no probability density outside of the
given range. If we wanted to be absolutely rigorous, we would say explicitly that f (x) = 0
outside of [0,1], but in practice this won’t be necessary.
Example 3. Let X be a random variable with range [0,1] and pdf f (x) = Cx2 . What is
the value of C?
answer: Since the total probability must be 1, we have
Z 1 Z 1
f (x) dx = 1 ⇔ Cx2 dx = 1.
0 0

By evaluating the integral, the equation at right becomes

C/3 = 1 ⇒ C=3.

Note: We say the constant C above is needed to normalize the density so that the total
probability is 1.

Example 4. Let X be the random variable in the Example 3. Find P (X ≤ 1/2).


Z 1/2
1/2 1
answer: P (X ≤ 1/2) = 3x2 dx = x3 0 = .
0 8

Think: For this X (or any continuous random variable):

• What is P (a ≤ X ≤ a)?

• What is P (X = 0)?

• Does P (X = a) = 0 mean that X can never equal a?

In words the above questions get at the fact that the probability that a random person’s
height is exactly 5’9” (to infinite precision, i.e. no rounding!) is 0. Yet it is still possible
that someone’s height is exactly 5’9”. So the answers to the thinking questions are 0, 0,
and No.

4.3 Cumulative Distribution Function

The cumulative distribution function (cdf ) of a continuous random variable X is defined


in exactly the same way as the cdf of a discrete random variable.

F (b) = P (X ≤ b).

Note well that the definition is about probability. When using the cdf you should first think
of it as a probability. Then when you go to calculate it you can use
Z b
F (b) = P (X ≤ b) = f (x) dx, where f (x) is the pdf of X.
−∞

Notes:
1. For discrete random variables, we defined the cumulative distribution function but did
18.05 class 5, Continuous Random Variables, Spring 2014 6

not have much occasion to use it. The cdf plays a far more prominent role for continuous
random variables.
2. As before, we started the integral at −∞ because we did not know the precise range of
X. Formally, this still makes sense since f (x) = 0 outside the range of X. In practice, we’ll
know the range and start the integral at the start of the range.
3. In practice we often say ‘X has distribution F (x)’ rather than ‘X has cumulative distri­
bution function F (x).’

Example 5. Find the cumulative distribution function for the density in Example 2.
Z a Z a
answer: For a in [0,1/3] we have F (a) = f (x) dx = 3 dx = 3a.
0 0
Since f (x) is 0 outside of [0,1/3] we know F (a) = P (X ≤ a) = 0 for a < 0 and F (a) = 1
for a > 1/3. Putting this all together we have

⎨0
⎪ if a < 0
F (a) =
3a if 0 ≤ a ≤ 1/3

1 if 1/3 < a.

Here are the graphs of f (x) and F (x).

F (x)
1
3 f (x)

x x
1/3
1/3

Note the different scales on the vertical axes. Remember that the vertical axis for the pdf
represents probability density and that of the cdf represents probability.

Example 6. Find the cdf for the pdf in Example 3, f (x) = 3x2 on [0, 1]. Suppose X is a
random variable with this distribution. Find P (X < 1/2).
Z a
answer: f (x) = 3x2 on [0,1] ⇒ F (a) = 3x2 dx = a3 on [0,1]. Therefore,
0

⎨0
⎪ if a < 0
F (a) =
a
3 if 0 ≤ a ≤ 1


1 if 1 < a

Thus, P (X < 1/2) = F (1/2) = 1/8. Here are the graphs of f (x) and F (x):

1
3
f (x) F (x)

x x
1 1

18.05 class 5, Continuous Random Variables, Spring 2014 7

4.4 Properties of cumulative distribution functions

Here is a summarry of the most important properties of cumulative distribution functions


(cdf)

1. (Definition) F (x) = P (X ≤ x)
2. 0 ≤ F (x) ≤ 1
3. F (x) is non-decreasing, i.e. if a ≤ b then F (a) ≤ F (b).
4. lim F (x) = 1 and lim F (x) = 0
x→∞ x→−∞

5. P (a ≤ X ≤ b) = F (b) − F (a)
6. F  (x) = f (x).

Properties 2, 3, 4 are identical to those for discrete distributions. The graphs in the previous
examples illustrate them.
Property 5 can be seen algebraically:
 b  a  b
f (x) dx = f (x) dx + f (x) dx
−∞ −∞ a
 b  b  a
⇔ f (x) dx = f (x) dx − f (x) dx
a −∞ −∞
⇔ P (a ≤ X ≤ b) = F (b) − F (a).
Property 5 can also be seen geometrically. The orange region below represents F (b) and
the striped region represents F (a). Their difference is P (a ≤ X ≤ b).

P (a ≤ X ≤ b)
x
a b

Property 6 is the fundamental theorem of calculus.

4.5 Probability density as a dartboard

We find it helpful to think of sampling values from a continuous random variable as throw-
ing darts at a funny dartboard. Consider the region underneath the graph of a pdf as a
dartboard. Divide the board into small equal size squares and suppose that when you throw
a dart you are equally likely to land in any of the squares. The probability the dart lands
in a given region is the fraction of the total area under the curve taken up by the region.
Since the total area equals 1, this fraction is just the area of the region. If X represents
the x-coordinate of the dart, then the probability that the dart lands with x-coordinate
between a and b is just
 b
P (a ≤ X ≤ b) = area under f (x) between a and b = f (x) dx.
a
Gallery of Continuous Random Variables

Class 5, 18.05, Spring 2014

Jeremy Orloff and Jonathan Bloom

1 Learning Goals
1. Be able to give examples of what uniform, exponential and normal distributions are used
to model.

2. Be able to give the range and pdf’s of uniform, exponential and normal distributions.

2 Introduction

Here we introduce a few fundamental continuous distributions. These will play important
roles in the statistics part of the class. For each distribution, we give the range, the pdf,
the cdf, and a short description of situations that it models. These distributions all depend
on parameters, which we specify.
As you look through each distribution do not try to memorize all the details; you can always
look those up. Rather, focus on the shape of each distribution and what it models.
Although it comes towards the end, we call your attention to the normal distribution. It is
easily the most important distribution defined here.

3 Uniform distribution
1. Parameters: a, b.

2. Range: [a, b].

3. Notation: uniform(a, b) or U(a, b).

4. Density: f (x) = for a ≤ x ≤ b.


b−a
5. Distribution: F (x) = (x − a)/(b − a) for a ≤ x ≤ b.

6. Models: All outcomes in the range have equal probability (more precisely all out­
comes have the same probability density).

Graphs:
1
b−a f (x)
F (x)
1

x x
a b a b
pdf and cdf for uniform(a,b) distribution.

18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 2

Examples. 1. Suppose we have a tape measure with markings at each millimeter. If we


measure (to the nearest marking) the length of items that are roughly a meter long, the
rounding error will uniformly distributed between -0.5 and 0.5 millimeters.
2. Many boardgames use spinning arrows (spinners) to introduce randomness. When spun,
the arrow stops at an angle that is uniformly distributed between 0 and 2π radians.
3. In most pseudo-random number generators, the basic generator simulates a uniform
distribution and all other distributions are constructed by transforming the basic generator.

4 Exponential distribution

1. Parameter: λ.

2. Range: [0, ∞).

3. Notation: exponential(λ) or exp(λ).

4. Density: f (x) = λe−λx for 0 ≤ x.

5. Distribution: (easy integral)

F (x) = 1 − e−λx for x ≥ 0

6. Right tail distribution: P (X > x) = 1 − F (x) = e−λx .

7. Models: The waiting time for a continuous process to change state.

Examples. 1. If I step out to 77 Mass Ave after class and wait for the next taxi, my
waiting time in minutes is exponentially distributed. We will see that in this case λ is given
by one over the average number of taxis that pass per minute (on weekday afternoons).
2. The exponential distribution models the waiting time until an unstable isotope undergoes
nuclear decay. In this case, the value of λ is related to the half-life of the isotope.

Memorylessness: There are other distributions that also model waiting times, but the
exponential distribution has the additional property that it is memoryless. Here’s what
this means in the context of Example 1. Suppose that the probability that a taxi arrives
within the first five minutes is p. If I wait five minutes and in fact no taxi arrives, then the
probability that a taxi arrives within the next five minutes is still p.
By contrast, suppose I were to instead go to Kendall Square subway station and wait for
the next inbound train. Since the trains are coordinated to follow a schedule (e.g., roughly
12 minutes between trains), if I wait five minutes without seeing a train then there is a far
greater probability that a train will arrive in the next five minutes. In particular, waiting
time for the subway is not memoryless, and a better model would be the uniform distribution
on the range [0,12].
The memorylessness of the exponential distribution is analogous to the memorylessness
of the (discrete) geometric distribution, where having flipped 5 tails in a row gives no
information about the next 5 flips. Indeed, the exponential distribution is the precisely the
18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 3

continuous counterpart of the geometric distribution, which models the waiting time for a
discrete process to change state. More formally, memoryless means that the probability of
waiting t more minutes is unaffected by having already waited s minutes without incident.
In symbols, P (X > s + t | X > s) = P (X > t).
Proof of memorylessness: Since (X > s + t) ∩ (X > s) = (X > s + t) we have

P (X > s + t) e−λ(s+t)
P (X > s + t | X > s) = = = e−λt = P (X > t). QED
P (X > s) e−λs

Graphs:

5 Normal distribution

In 1809, Carl Friedrich Gauss published a monograph introducing several notions that have
become fundamental to statistics: the normal distribution, maximum likelihood estimation,
and the method of least squares (we will cover all three in this course). For this reason,
the normal distribution is also called the Gaussian distribution, and it the most important
continuous distribution.

1. Parameters: μ, σ.

2. Range: (−∞, ∞).

3. Notation: normal(μ, σ 2 ) or N (μ, σ 2 ).

1
2 2
4. Density: f (x) = √ e−(x−μ) /2σ .
σ 2π
5. Distribution: F (x) has no formula, so use tables or software such as pnorm in R to
compute F (x).

6. Models: Measurement error, intelligence/ability, height, averages of lots of data.


18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 4

The standard normal distribution N (0, 1) has mean 0 and variance 1. We reserve Z for
1 2
a standard normal random variable, φ(z) = √ e−x /2 for the standard normal density,

and Φ(z) for the standard normal distribution.
Note: we will define mean and variance for continuous random variables next time. They
have the same interpretations as in the discrete case. As you might guess, the normal
distribution N (μ, σ 2 ) has mean μ, variance σ 2 , and standard deviation σ.
Here are some graphs of normal distributions. Note they are shaped like a bell curve. Note
also that as σ increases they become more spread out.
Graphs: (the bell curve):

5.1 Normal probabilities

To make approximations it is useful to remember the following rule of thumb for three
approximate probabilities
P (−1 ≤ Z ≤ 1) ≈ .68, P (−2 ≤ Z ≤ 2) ≈ .95, P (−3 ≤ Z ≤ 3) ≈ .99
within 1 · σ ≈ 68%

Normal PDF within 2 · σ ≈ 95%

within 3 · σ ≈ 99%
68%

95%

99%
z
−3σ −2σ −σ σ 2σ 3σ

Symmetry calculations
We can use the symmetry of the standard normal distribution about x = 0 to make some
calculations.
Example 1. The rule of thumb says P (−1 ≤ Z ≤ 1) ≈ .68. Use this to estimate Φ(1).
answer: Φ(1) = P (Z ≤ 1). In the figure, the two tails (in red) have combined area 1-.68 =
.32. By symmetry the left tail has area .16 (half of .32), so P (Z ≤ 1) ≈ .68 + .16 = .84.
18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 5

P (−1 ≤ Z ≤ 1)

P (Z ≤ −1) P (Z ≥ 1)
.34 .34
.16 .16
z
−1 1

5.2 Using R to compute Φ(z).

# Use the R function pnorm(x, μ, σ) to compute F (x) for N(μ, σ 2 )


pnorm(1,0,1)
[1] 0.8413447

pnorm(0,0,1)
[1] 0.5
pnorm(1,0,2)
[1] 0.6914625

pnorm(1,0,1) - pnorm(-1,0,1)
[1] 0.6826895

pnorm(5,0,5) - pnorm(-5,0,5)

[1] 0.6826895

# Of course z can be a vector of values


pnorm(c(-3,-2,-1,0,1,2,3),0,1)
[1] 0.001349898 0.022750132 0.158655254 0.500000000 0.841344746 0.977249868 0.998650102

Note: The R function pnorm(x, μ, σ) uses σ whereas our notation for the normal distri­
bution N(μ, σ 2 ) uses σ 2 .
Here’s a table of values with fewer decimal points of accuracy
z: -2 -1 0 .3 .5 1 2 3

Φ(z): 0.0228 0.1587 0.5000 0.6179 0.6915 0.8413 0.9772 0.9987

Example 2. Use R to compute P (−1.5 ≤ Z ≤ 2).


answer: This is Φ(2) − Φ(−1.5) = pnorm(2,0,1) - pnorm(-1.5,0,1) = 0.91044

6 Pareto and other distributions

In 18.05, we only have time to work with a few of the many wonderful distributions that are
used in probability and statistics. We hope that after this course you will feel comfortable
learning about new distributions and their properties when you need them. Wikipedia is
often a great starting point.
The Pareto distribution is one common, beautiful distribution that we will not have time
to cover in depth.
18.05 class 5, Gallery of Continuous Random Variables, Spring 2014 6

1. Parameters: m > 0 and α > 0.

2. Range: [m, ∞).

3. Notation: Pareto(m, α).

α mα

4. Density: f (x) = .
xα+1
5. Distribution: (easy integral)


F (x) = 1 − , for x ≥ m

6. Tail distribution: P (X > x) = mα /xα , for x ≥ m.

7. Models: The Pareto distribution models a power law, where the probability that
an event occurs varies as a power of some attribute of the event. Many phenomena
follow a power law, such as the size of meteors, income levels across a population, and
population levels across cities. See Wikipedia for loads of examples:
http://en.wikipedia.org/wiki/Pareto_distribution#Applications
Manipulating Continuous Random Variables

Class 5, 18.05, Spring 2014

Jeremy Orloff and Jonathan Bloom

1 Learning Goals
1. Be able to find the pdf and cdf of a random variable defined in terms of a random variable
with known pdf and cdf.

2 Transformations of Random Variables

If Y = aX +b then the properties of expectation and variance tell us that E(Y ) = aE(X)+b
and Var(Y ) = a2 Var(X). But what is the distribution function of Y ? If Y is continuous,
what is its pdf?
Often, when looking at transforms of discrete random variables we work with tables.
For continuous random variables transforming the pdf is just change of variables (‘u­
substitution’) from calculus. Transforming the cdf makes direct use of the definition of
the cdf.
Let’s remind ourselves of the basics:
1. The cdf of X is FX (x) = P (X ≤ x).
2. The pdf of X is related to FX by fX (x) = FX' (x).

Example 1. Let X ∼ U (0, 2), so fX (x) = 1/2 and FX (x) = x/2 on [0,2]. What is the
range, pdf and cdf of Y = X 2 ?

answer: The range is easy: [0, 4].

To find the cdf we work systematically from the definition.

√ √ √
FY (y) = P (Y ≤ y) = P (X 2 ≤ y) = P (X ≤ y) = FX ( y) = y/2.

To find the pdf we can just differentiate the cdf

d 1
fY (y) = FY (y) = √ .
dy 4 y

An alternative way to find the pdf directly is by change of variables. The trick here is to
remember that it is fX (x)dx which gives probability (fX (x) by itself is probability density).
Here is how the calculation goes in this example.
dy
y = x2 ⇒ dy = 2x dx ⇒ dx = √
2 y
dx dy
fX (x) dx = = √ = fY (y) dy
2 4 y

dy
Therefore fY (y) = √
4 y

18.05 class 5, Manipulating Continuous Random Variables, Spring 2014 2

Example 2. Let X ∼ exp(λ), so fX (x) = λe−λx on [0, ∞]. What is the density of Y = X 2 ?
answer: Let’s do this using the change of variables.
dy
y = x2 ⇒ dy = 2x dx ⇒ dx = √
2 y
√ dy
fX (x) dx = λe−λx dx = λe−λ y √ = fY (y) dy
2 y

λ √
Therefore fY (y) = √ e−λ y .
2 y

X −5
Example 3. Assume X ∼ N(5, 32 ). Show that Z = is standard normal, i.e.,
3
Z ∼ N(0, 1).
answer: Again using the change of variables and the formula for fX (x) we have

x−5 dx
z= ⇒ dz = ⇒ dx = 3 dz
3 3
1 2 2 1 2 1 2
fX (x) dx = √ e−(x−5) /(2·3 ) dx = √ e−z /2 3 dz = √ e−z /2 dz = fZ (z) dz
3 2π 3 2π 2π
1 2
Therefore fZ (z) = √ e−z /2 . Since this is exactly the density for N(0, 1) we have shown

that Z is standard normal.

This example shows an important general property of normal random variables which we
give in the next example.
X −μ
Example 4. Assume X ∼ N(μ, σ 2 ). Show that Z = is standard normal, i.e.,
σ
Z ∼ N(0, 1).
answer: This is exactly the same computation as the previous example with μ replacing 5
and σ replacing 3. We show the computation without comment.
x−μ dx
z= ⇒ dz = ⇒ dx = σ dz
σ σ
1 2 2 1 2 1 2
fX (x) dx = √ e−(x−μ) /(2·σ ) dx = √ e−z /2 σ dz = √ e−z /2 dz = fZ (z) dz
σ 2π σ 2π 2π
1 2
Therefore fZ (z) = √ e−z /2 . This shows Z is standard normal.

Expectation, Variance and Standard Deviation for
Continuous Random Variables
Class 6, 18.05
Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Be able to compute and interpret expectation, variance, and standard deviation for
continuous random variables.

2. Be able to compute and interpret quantiles for discrete and continuous random variables.

2 Introduction

So far we have looked at expected value, standard deviation, and variance for discrete
random variables. These summary statistics have the same meaning for continuous random
variables:

• The expected value µ = E(X) is a measure of location or central tendency.

• The standard deviation σ is a measure of the spread or scale.

• The variance σ 2 = Var(X) is the square of the standard deviation.

To move from discrete to continuous, we will simply replace the sums in the formulas by
integrals. We will do this carefully and go through many examples in the following sections.
In the last section, we will introduce another type of summary statistic, quantiles. You may
already be familiar with the .5 quantile of a distribution, otherwise known as the median
or 50th percentile.

3 Expected value of a continuous random variable

Definition: Let X be a continuous random variable with range [a, b] and probability
density function f (x). The expected value of X is defined by
Z b
E(X) = xf (x) dx.
a

Let’s see how this compares with the formula for a discrete random variable:
n
X
E(X) = xi p(xi ).
i=1

The discrete formula says to take a weighted sum of the values xi of X, where the weights are
the probabilities p(xi ). Recall that f (x) is a probability density. Its units are prob/(unit of X).

1
18.05 class 6, Expectation and Variance for Continuous Random Variables 2

So f (x) dx represents the probability that X is in an infinitesimal range of width dx around


x. Thus we can interpret the formula for E(X) as a weighted integral of the values x of X,
where the weights are the probabilities f (x) dx.
As before, the expected value is also called the mean or average.

3.1 Examples

Let’s go through several example computations. Where the solution requires an integration
technique, we push the computation of the integral to the appendix.
Example 1. Let X ∼ uniform(0, 1). Find E(X).
answer: X has range [0, 1] and density f (x) = 1. Therefore,
Z 1 1
x2 1
E(X) = x dx = = .
0 2 0 2
Not surprisingly the mean is at the midpoint of the range.

Example 2. Let X have range [0, 2] and density 38 x2 . Find E(X).


answer:
2 2
2
3 3 3x4 3
Z Z
E(X) = xf (x) dx = x dx = = .
0 0 8 32 0 2
Does it make sense that this X has mean is in the right half of its range?
answer: Yes. Since the probability density increases as x increases over the range, the
average value of x should be in the right half of the range.
f (x)

x
1 µ = 1.5

µ is “pulled” to the right of the midpoint 1 because there is more mass to the right.

Example 3. Let X ∼ exp(λ). Find E(X).


answer: The range of X is [0, ∞) and its pdf is f (x) = λe−λx . So (details in appendix)
Z ∞ ∞
−λx −λx e−λx 1
E(X) = λe dx = −λe − = .
0 λ 0 λ

f (x) = λe−λx

x
µ = 1/λ

Mean of an exponential random variable


18.05 class 6, Expectation and Variance for Continuous Random Variables 3

Example 4. Let Z ∼ N(0, 1). Find E(Z).


1 2
answer: The range of Z is (−∞, ∞) and its pdf is φ(z) = √ e−z /2 . So (details in

appendix)
1 −z 2 /2 ∞
Z ∞
1 −z 2 /2
E(Z) = √ ze dz = − √ e = 0.
−∞ 2π 2π
−∞

φ(z)

z
µ=0

The standard normal distribution is symmetric and has mean 0.

3.2 Properties of E(X)

The properties of E(X) for continuous random variables are the same as for discrete ones:
1. If X and Y are random variables on a sample space Ω then
E(X + Y ) = E(X) + E(Y ). (linearity I)
2. If a and b are constants then
E(aX + b) = aE(X) + b. (linearity II)

Example 5. In this example we verify that for X ∼ N(µ, σ) we have E(X) = µ.


answer: Example (4) showed that for standard normal Z, E(Z) = 0. We could mimic
the calculation there to show that E(X) = µ. Instead we will use the linearity properties
of E(X). In the class 5 notes on manipulating random variables we showed that if X ∼
N(µ, σ 2 ) is a normal random variable we can standardize it:
X −µ
Z= ∼ N(0, 1).
σ
Inverting this formula we have X = σ Z + µ. The linearity of expected value now gives
E(X) = E(σ Z + µ) = σ E(Z) + µ = µ

3.3 Expectation of Functions of X

This works exactly the same as the discrete case. if h(x) is a function then Y = h(X) is a
random variable and Z ∞
E(Y ) = E(h(X)) = h(x)fX (x) dx.
−∞

Example 6. Let X ∼ exp(λ). Find E(X 2 ).


answer: Using integration by parts we have
2 −λx ∞
Z ∞  
2 2 −λx 2 −λx 2x −λx 2
E(X ) = x λe dx = −x e − e − 2e = 2 .
0 λ λ 0 λ
18.05 class 6, Expectation and Variance for Continuous Random Variables 4

4 Variance

Now that we’ve defined expectation for continuous random variables, the definition of vari-
ance is identical to that of discrete random variables.
Definition: Let X be a continuous random variable with mean µ. The variance of X is

Var(X) = E((X − µ)2 ).

4.1 Properties of Variance

These are exactly the same as in the discrete case.


1. If X and Y are independent then Var(X + Y ) = Var(X) + Var(Y ).
2. For constants a and b, Var(aX + b) = a2 Var(X).
3. Theorem: Var(X) = E(X 2 ) − E(X)2 = E(X 2 ) − µ2 .
For Property 1, note carefully the requirement that X and Y are independent.
Property 3 gives a formula for Var(X) that is often easier to use in hand calculations. The
proofs of properties 2 and 3 are essentially identical to those in the discrete case. We will
not give them here.

Example 7. Let X ∼ uniform(0, 1). Find Var(X) and σX .


answer: In Example 1 we found µ = 1/2. Next we compute
Z 1
2 1
Var(X) = E((X − µ) ) = (x − 1/2)2 dx = .
0 12

Example 8. Let X ∼ exp(λ). Find Var(X) and σX .


answer: In Examples 3 and 6 we computed
Z ∞ ∞
1 2
Z
E(X) = xλe−λx dx = and E(X ) = 2
x2 λe−λx dx = .
0 λ 0 λ2
So by Property 3,
2 1 1 1
Var(X) = E(X 2 ) − E(X)2 = − = 2 .and σX =
λ2 λ2 λ λ
R∞
We could have skipped Property 3 and computed this directly from Var(X) = 0 (x − 1/λ)2 λe−λx dx.

Example 9. Let Z ∼ N(0, 1). Show Var(Z) = 1.


Note: The notation for normal variables is X ∼ N(µ, σ 2 ). This is certainly suggestive, but
as mathematicians we need to prove that E(X) = µ and Var(X) = σ 2 . Above we showed
E(X) = µ. This example shows that Var(Z) = 1, just as the notation suggests. In the next
example we’ll show Var(X) = σ 2 .
answer: Since E(Z) = 0, we have

1
Z
2 /2
Var(Z) = E(Z 2 ) = √ z 2 e−z dz.
2π −∞
18.05 class 6, Expectation and Variance for Continuous Random Variables 5

2 /2 2 /2
(using integration by parts with u = z, v 0 = ze−z ⇒ u0 = 1, v = −e−z )
 ∞  Z ∞
1 −z 2 /2 1 2
=√ −ze +√ e−z /2 dz.
2π −∞ 2π −∞
The first term equals 0 because the exponential goes to zero much faster than z grows at
both ±∞. The second term equals 1 because it is exactly the total probability integral of
the pdf ϕ(z) for N(0, 1). So Var(X) = 1.

Example 10. Let X ∼ N(µ, σ 2 ). Show Var(X) = σ 2 .


answer: This is an exercise in change of variables. Letting z = (x − µ)/σ, we have
Z ∞
1 2 2
Var(X) = E((X − µ)2 ) = √ (x − µ)2 e−(x−µ) /2σ dx
2π σ −∞
Z ∞
σ2 2
=√ z 2 e−z /2 dz = σ 2 .
2π −∞

The integral in the last line is the same one we computed for Var(Z).

5 Quantiles

Definition: The median of X is the value x for which P (X ≤ x) = 0.5, i.e. the value
of x such that P (X ≤ X) = P (X ≥ x). In other words, X has equal probability of
being above or below the median, and each probability is therefore 1/2. In terms of the
cdf F (x) = P (X ≤ x), we can equivalently define the median as the value x satisfying
F (x) = 0.5.
Think: What is the median of Z?
answer: By symmetry, the median is 0.

Example 11. Find the median of X ∼ exp(λ).


answer: The cdf of X is F (x) = 1 − e−λx . So the median is the value of x for which
F (x) = 1 − e−λx = 0.5.. Solving for x we find: x = (ln 2)/λ .
Think: In this case the median does not equal the mean of µ = 1/λ. Based on the graph
of the pdf of X can you argue why the median is to the left of the mean.

Definition: The pth quantile of X is the value qp such that P (X ≤ qp ) = p.


Notes. 1. In this notation the median is q0.5 .
2. We will usually write this in terms of the cdf: F (qp ) = p.
With respect to the pdf f (x), the quantile qp is the value such that there is an area of p to
the left of qp and an area of 1 − p to the right of qp . In the examples below, note how we
can represent the quantile graphically using either area of the pdf or height of the cdf.

Example 12. Find the 0.6 quantile for X ∼ U (0, 1).


18.05 class 6, Expectation and Variance for Continuous Random Variables 6

answer: The cdf for X is F (x) = x on the range [0,1]. So q0.6 = 0.6.
f (x) F (x)

left tail area = prob = 0.6


1

F (q0.6 ) = 0.6

x x
q0.6 = 0.6 q0.6 = 0.6

q0.6 : left tail area = 0.6 ⇔ F (q0.6 ) = 0.6


Example 13. Find the 0.6 quantile of the standard normal distribution.
answer: We don’t have a formula for the cdf, so we use the R ‘quantile function’ qnorm.

q0.6 = qnorm(0.6, 0, 1) = 0.25335

Φ(z)
φ(z)
1
left tail area = prob. = .6 F (q0.6 ) = 0.6

z z
q0.6 = 0.253 q0.6 = 0.253

q0.6 : left tail area = 0.6 ⇔ F (q.6 ) = 0.6

Quantiles give a useful measure of location for a random variable. We will use them more
in coming lectures.

5.1 Percentiles, deciles, quartiles

For convenience, quantiles are often described in terms of percentiles, deciles or quartiles.
The 60th percentile is the same as the 0.6 quantile. For example you are in the 60th percentile
for height if you are taller than 60 percent of the population, i.e. the probability that you
are taller than a randomly chosen person is 60 percent.
Likewise, deciles represent steps of 1/10. The third decile is the 0.3 quantile. Quartiles are
in steps of 1/4. The third quartile is the 0.75 quantile and the 75th percentile.

6 Appendix: Integral Computation Details

Example 3: Let X ∼ exp(λ). Find E(X).


The range of X is [0, ∞) and its pdf is f (x) = λe−λx . Therefore
Z ∞ Z ∞
E(X) = xf (x) dx = λxe−λx dx
0 0
18.05 class 6, Expectation and Variance for Continuous Random Variables 7

(using integration by parts with u = x, v 0 = λe−λx ⇒ u0 = 1, v = −e−λx )


∞ Z ∞
−λx
= −xe + e−λx dx
0 0

e−λx 1
=0− = .
λ 0 λ

We used the fact that xe−λx and e−λx go to 0 as x → ∞.

Example 4: Let Z ∼ N(0, 1). Find E(Z).


1 2
The range of Z is (−∞, ∞) and its pdf is φ(z) = √ e−z /2 . By symmetry the mean must

be 0. The only mathematically tricky part is to show that the integral converges, i.e. that
the mean exists at all (some random variable do not have means, but we will not encounter
this very often.) For completeness we include the argument, though this is not something
we will ask you to do. We first compute the integral from 0 to ∞:
Z ∞ Z ∞
1 2
zφ(z) dz = √ ze−z /2 dz.
0 2π 0

The u-substitution u = z 2 /2 gives du = z dz. So the integral becomes


Z ∞ Z ∞
1 −z 2 /2 1 ∞
√ ze dz. = √ e−u du = −e−u 0 = 1
2π 0 2π 0
Z 0
Similarly, zφ(z) dz = −1. Adding the two pieces together gives E(Z) = 0.
−∞

Example 6: Let X ∼ exp(λ). Find E(X 2 ).


Z ∞ Z ∞
2
E(X ) = 2
x f (x) dx = λx2 e−λx dx
0 0

(using integration by parts with u = x2 , v 0 = λe−λx ⇒ u0 = 2x, v = −e−λx )


∞ Z ∞
= −x2 e−λx + 2xe−λx dx

0 0

(the first term is 0, for the second term use integration by parts: u = 2x, v 0 = e−λx ⇒
−λx
u0 = 2, v = − e λ )
∞ Z ∞ −λx
e−λx e
= −2x + dx
λ 0
0 λ

e−λx 2
= 0 − 2 2 = 2.
λ λ 0
Central Limit Theorem and the Law of Large Numbers
Class 6, 18.05
Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Understand the statement of the law of large numbers.

2. Understand the statement of the central limit theorem.

3. Be able to use the central limit theorem to approximate probabilities of averages and
sums of independent identically-distributed random variables.

2 Introduction

We all understand intuitively that the average of many measurements of the same unknown
quantity tends to give a better estimate than a single measurement. Intuitively, this is
because the random error of each measurement cancels out in the average. In these notes
we will make this intuition precise in two ways: the law of large numbers (LoLN) and the
central limit theorem (CLT).
Briefly, both the law of large numbers and central limit theorem are about many independent
samples from same distribution. The LoLN tells us two things:

1. The average of many independent samples is (with high probability) close to the mean
of the underlying distribution.

2. This density histogram of many independent samples is (with high probability) close
to the graph of the density of the underlying distribution.

To be absolutely correct mathematically we need to make these statements more precise,


but as stated they are a good way to think about the law of large numbers.
The central limit theorem says that the sum or average of many independent copies of a
random variable is approximately a normal random variable. The CLT goes on to give
precise values for the mean and standard deviation of the normal variable.
These are both remarkable facts. Perhaps just as remarkable is the fact that often in practice
n does not have to all that large. Values of n > 30 often suffice.

2.1 There is more to experimentation than mathematics

The mathematics of the LoLN says that the average of a lot of independent samples from a
random variable will almost certainly approach the mean of the variable. The mathematics
cannot tell us if the tool or experiment is producing data worth averaging. For example,
if the measuring device is defective or poorly calibrated then the average of many mea-
surements will be a highly accurate estimate of the wrong thing! This is an example of

1
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 2

systematic error or sampling bias, as opposed to the random error controlled by the law of
large numbers.

3 The law of large numbers

Suppose X1 , X2 , . . . , Xn are independent random variables with the same underlying


distribution. In this case, we say that the Xi are independent and identically-distributed,
or i.i.d. In particular, the Xi all have the same mean µ and standard deviation σ.
Let X n be the average of X1 , . . . , Xn :
n
X1 + X2 + · · · + Xn 1 X
Xn = = Xi .
n n
i=1

Note that X n is itself a random variable. The law of large numbers and central limit
theorem tell us about the value and distribution of X n , respectively.
LoLN: As n grows, the probability that X n is close to µ goes to 1.
CLT: As n grows, the distribution of X n converges to the normal distribution N (µ, σ 2 /n).
Before giving a more formal statement of the LoLN, let’s unpack its meaning through a
concrete example (we’ll return to the CLT later on).

Example 1. Averages of Bernoulli random variables


Suppose each Xi is an independent flip of a fair coin, so Xi ∼ Bernoulli(0.5) and µ = 0.5.
Then X n is the proportion of heads in n flips, and we expect that this proportion is close to
0.5 for large n. Randomness being what it is, this is not guaranteed; for example we could
get 1000 heads in 1000 flips, though the probability of this occurring is very small.
So our intuition translates to: with high probability the sample average X n is close to the
mean 0.5 for large n. We’ll demonstrate by doing some calculations in R. You can find the
code used for ‘class 6 prep’ in the usual place on our site.
To start we’ll look at the probability of being within 0.1 of the mean. We can express this
probability as

P (|X n − 0.5| < 0.1) or equivalently P (0.4 ≤ X n ≤ 0.6)

The law of large numbers says that this probability goes to 1 as the number of flips n gets
large. Our R code produces the following values for P (0.4 ≤ X n ≤ 0.6).
n = 10: pbinom(6, 10, 0.5) - pbinom(3, 10, 0.5) = 0.65625
n = 50: pbinom(30, 50, 0.5) - pbinom(19, 50, 0.5) = 0.8810795
n = 100: pbinom(60, 100, 0.5) - pbinom(39, 100, 0.5) = 0.9647998
n = 500: pbinom(300, 500, 0.5) - pbinom(199, 500, 0.5) = 0.9999941
n = 1000: pbinom(600, 1000, 0.5) - pbinom(399, 1000, 0.5) = 1
As predicted by the LoLN the probability goes to 1 as n grows.
We redo the computations to see the probability of being within 0.01 of the mean. Our R
code produces the following values for P (0.49 ≤ X n ≤ 0.51).
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 3

n = 10: pbinom(5, 10, 0.5) - pbinom(4, 10, 0.5) = 0.2460937


n = 100: pbinom(51, 100, 0.5) - pbinom(48, 100, 0.5) = 0.2356466
n = 1000: pbinom(510, 1000, 0.5) - pbinom(489, 1000, 0.5) = 0.49334
n = 10000: pbinom(5100, 10000, 0.5) - pbinom(4899, 10000, 0.5) = 0.9555742
Again we see the probability of being close to the mean going to 1 as n grows. Since 0.01
is smaller than 0.1 it takes larger values of n to raise the probability to near 1.
This convergence of the probability to 1 is the LoLN in action! Whenever you’re confused,
it will help you to keep this example in mind. So we see that the LoLN says that with high
probability the average of a large number of independent trials from the same distribution
will be very close to the underlying mean of the distribution. Now we’re ready for the
formal statement.

3.1 Formal statement of the law of large numbers

Theorem (Law of Large Numbers): Suppose X1 , X2 , . . . , Xn , . . . are i.i.d. random


variables with mean µ and variance σ 2 . For each n, let X n be the average of the first n
variables. Then for any a > 0, we have

lim P (|X n − µ| < a) = 1.


n→∞

This says precisely that as n increases the probability of being within a of the mean goes
to 1. Think of a as a small tolerance of error from the true mean µ. In our example, if we
want the probability to be at least p = 0.99999 that the proportion of heads X ¯ n is within
a = 0.1 of µ = 0.5, then n > N = 500 is large enough. If we decrease the tolerance a and/or
increase the probability p, then N will need to be larger.

4 Histograms

We can summarize multiple samples x1 , . . . , xn of a random variable in a histogram. Here


we want to carefully construct histograms so that they resemble the area under the pdf.
We will then see how the LoLN applies to histograms.
The step-by-step instructions for constructing a density histogram are as follows.

1. Pick an interval of the real line and divide it into m intervals, with endpoints b0 , b1 , . . . ,
bm . Usually these are equally sized, so let’s assume this to start.
x
b0 b1 b2 b3 b4 b5 b6

Equally-sized bins
Each of the intervals is called a bin. For example, in the figure above the first bin is
[b0 , b1 ] and the last bin is [b5 , b6 ]. Each bin has a bin width, e.g. b1 − b0 is the first bin
width. Usually the bins all have the same width, called the bin width of the histogram.

2. Place each xi into the bin that contains its value. If xi lies on the boundary of two bins,
we’ll put it in the left bin (this is the R default, though it can be changed).
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 4

3. To draw a frequency histogram: put a vertical bar above each bin. The height of the
bar should equal the number of xi in the bin.

4. To draw a density histogram: put a vertical bar above each bin. The area of the bar
should equal the fraction of all data points that lie in the bin.

Notes:
1. When all the bins have the same width, the frequency histogram bars have area propor-
tional to the count. So the density histogram results from simply by dividing the height of
each bar by the total area of the frequency histogram. Ignoring the vertical scale, the two
histograms look identical.
2. Caution: if the bin widths differ, the frequency and density histograms may look very
different. There is an example below. Don’t let anyone fool you by manipulating bin widths
to produce a histogram that suits their mischievous purposes!
In 18.05, we’ll stick with equally-sized bins. In general, we prefer the density histogram
since its vertical scale is the same as that of the pdf.
Examples. Here are some examples of histograms, all with the data [0.5,1,1,1.5,1.5,1.5,2,2,2,2].
The R code that drew them is in the R file ’class6-prep.r’. You can find the file in the usual
place on our site.
1. Here the frequency and density plots look the same but have different vertical scales.
Histogram of x
Histogram of x
4

0.8
3

0.6
Frequency

Density

0.4
2

0.2
1

0.0
0

0.5 1.0 1.5 2.0


0.5 1.0 1.5 2.0

x x

Bins centered at 0.5, 1, 1.5, 2, i.e. width 0.5, bounds at 0.25, 0.75, 1.25, 1.75, 2.25.

2. Here each value is on a bin boundary. Note the values are all on the bin boundaries
and are put into the left-hand bin. That is, the bins are right-closed, e.g the first bin is for
values in the right-closed interval (0, 0.5].
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 5

Histogram of x Histogram of x

0.8
4

0.6
3
Frequency

Density

0.4
2

0.2
1

0.0
0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

x x

Bin bounds at 0, 0.5, 1, 1.5, 2.

3. Here we show density histograms based on different bin widths. Note that the scale
keeps the total area equal to 1. The gaps are bins with zero counts.

Histogram of x Histogram of x
1.5
0.6

1.0
Density

Density
0.4

0.5
0.2
0.0

0.0

0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0

x x

Left: wide bins; Right: narrow bins.

4. Here we use unqual bin widths, so the frequency and density histograms look different

Histogram of x Histogram of x
0.8
5
4

0.6
Frequency

Density
3

0.4
2

0.2
1

0.0
0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

x x

Don’t be fooled! These are based on the same data.


The density histogram is the better choice with unequal bin widths. In fact, R will complain
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 6

if you try to make a frequency histogram with unequal bin widths. Compare the frequency
histogram with unequal bin widths with all the other histograms we drew for this data. It
clearly looks different. What happened is that by combining the data in bins (0.5, 1] and
(1, 1.5] into one bin (0.5, 1.5) we effectively made the height of both smaller bins greater.
The reason the density histogram is nice is discussed in the next section.

4.1 The law of large numbers and histograms

The law of large number has an important consequence for density histograms.
LoLN for histograms: With high probability the density histogram of a large number
of samples from a distribution is a good approximation of the graph of the underlying pdf
f (x).
Let’s illustrate this by generating a density histogram with bin width 0.1 from 100000 draws
from a standard normal distribution. As you can see, the density histogram very closely
tracks the graph of the standard normal pdf φ(z).

0.5

0.4

0.3

0.2

0.1

0
-4 -3 -2 -1 0 1 2 3 4

Density histogram of 10000 draws from a standard normal distribution, with φ(z) in red.

5 The Central Limit Theorem

We now prepare for the statement of the CLT.

5.1 Standardization

Given a random variable X with mean µ and standard deviation σ, we define its standard-
ization of X as the new random variable
X −µ
Z= .
σ
Note that Z has mean 0 and standard deviation 1. Note also that if X has a normal
distribution, then the standardization of X is the standard normal distribution Z with
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 7

mean 0 and variance 1. This explains the term ‘standardization’ and the notation of Z
above.

5.2 Statement of the Central Limit Theorem

Suppose X1 , X2 , . . . , Xn , . . . are i.i.d. random variables each having mean µ and standard
deviation σ. For each n let Sn denote the sum and let X n be the average of X1 , . . . , Xn .
n
X
Sn = X1 + X2 + . . . + Xn = Xi
i=1
X1 + X2 + . . . + Xn Sn
Xn = = .
n n
The properties of mean and variance show

E(Sn ) = nµ, Var(Sn ) = nσ 2 , σSn = nσ
σ2 σ
E(X n ) = µ, Var(X n ) = , σX n =√ .
n n

Since they are multiples of each other, Sn and X n have the same standardization

Sn − nµ Xn − µ
Zn = √ = √
σ n σ/ n
Central Limit Theorem: For large n,

X n ≈ N(µ, σ 2 /n), Sn ≈ N(nµ, nσ 2 ), Zn ≈ N(0, 1).

Notes: 1. In words: X n is approximately a normal distribution with the same mean as X


but a smaller variance.
2. Sn is approximately normal.
3. Standardized X n and Sn are approximately standard normal.
The central limit theorem allows us to approximate a sum or average of i.i.d random vari-
ables by a normal random variable. This is extremely useful because it is usually easy to
do computations with the normal distribution.

A precise statement of the CLT is that the cdf’s of Zn converge to Φ(z):

lim FZn (z) = Φ(z).


n→∞

The proof of the Central Limit Theorem is more technical than we want to get in 18.05. It
is accessible to anyone with a decent calculus background.

5.3 Standard Normal Probabilities

To apply the CLT, we will want to have some normal probabilities at our fingertips. The
following probabilities appeared in Class 5. Let Z ∼ N (0, 1), a standard normal random
variable. Then with rounding we have:
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 8

1. P (|Z| < 1) = 0.68


2. P (|Z| < 2) = 0.95; more precisely P (|Z| < 1.96) ≈ .95.
3. P (|Z| < 3) = 0.997
These numbers are easily compute in R using pnorm. However, they are well worth remem-
bering as rules of thumb. You should think of them as:

1. The probability that a normal random variable is within 1 standard deviation of its
mean is 0.68.
2. The probability that a normal random variable is within 2 standard deviations of its
mean is 0.95.
3. The probability that a normal random variable is within 3 standard deviations of its
mean is 0.997.

This is shown graphically in the following figure.


within 1 · σ ≈ 68%

Normal PDF within 2 · σ ≈ 95%

within 3 · σ ≈ 99%
68%

95%

99%
z
−3σ −2σ −σ σ 2σ 3σ

Claim: From these numbers we can derive:


1. P (Z < 1) ≈ 0.84
2. P (Z < 2) ≈ 0.977
3. P (Z < 3) ≈ 0.999
Proof: We know P (|Z| < 1) = 0.68. The remaining probability of 0.32 is in the two regions
Z > 1 and Z < −1. These regions are referred to as the right-hand tail and the left-hand
tail respectively. By symmetry each tail has area 0.16. Thus,
P (Z < 1) = P (|Z| < 1) + P (left-hand tail) = 0.84
The other two cases are handled similarly.

5.4 Applications of the CLT

Example 2. Flip a fair coin 100 times. Estimate the probability of more than 55 heads.
answer: Let Xj be the result of the j th flip, so Xj = 1 for heads and Xj = 0 for tails. The
total number of heads is
S = X1 + X2 + . . . + X100 .
We know E(Xj ) = 0.5 and Var(Xj ) = 1/4. Since n = 100, we have
E(S) = 50, Var(S) = 25 and σS = 5.
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 9

The central limit theorem says that the standardization of S is approximately N(0, 1). The
question asks for P (S > 55). Standardizing and using the CLT we get

S − 50 55 − 50
 
P (S > 55) = P > ≈ P (Z > 1) = 0.16.
5 5

Here Z is a standard normal random variable and P (Z > 1) = 1 − P (Z < 1) ≈ 0.16.

Example 3. Estimate the probability of more than 220 heads in 400 flips.
answer: This is nearly identical to the previous example. Now µS = 200 and σS = 10 and
we want P (S > 220). Standardizing and using the CLT we get:

S − µS 220 − 200
 
P (S > 220) = P > ≈ P (Z > 2) = .025.
σS 10

Again, Z ∼ N(0, 1) and the rules of thumb show P (Z > 2) = .025.


Note: Even though 55/100 = 220/400, the probability of more than 55 heads in 100 flips
is larger than the probability of more than 220 heads in 400 flips. This is due to the LoLN
and the larger value of n in the latter case.
Example 4. Estimate the probability of between 40 and 60 heads in 100 flips.
answer: As in the first example, E(S) = 50, Var(S) = 25 and σS = 5. So

40 − 100 S − 50 60 − 50
 
P (40 ≤ S ≤ 60) = P ≤ ≤ ≈ P (−2 ≤ Z ≤ 2)
5 5 5

We can compute the right-hand side using our rule of thumb. For a more accurate answer
we use R:
pnorm(2) - pnorm(-2) = 0.954 . . .

Recall that in Section 3 we used the binomial distribution to compute an answer of 0.965. . . .
So our approximate answer using CLT is off by about 1%.
Think: Would you expect the CLT method to give a better or worse approximation of
P (200 < S < 300) with n = 500?
We encourage you to check your answer using R.

Example 5. Polling. When taking a political poll the results are often reported as a
number with a margin of error. For example 52% ± 3% favor candidate A. The rule of

thumb is that if you poll n people then the margin of error is ±1/ n. We will now see
exactly what this means and that it is an application of the central limit theorem.
Suppose there are 2 candidates A and B. Suppose further that the fraction of the population
who prefer A is p0 . That is, if you ask a random person who they prefer then the probability
they’ll answer A is po
To run the poll a pollster selects n people at random and asks ’Do you support candidate
A or candidate B. Thus we can view the poll as a sequence of n independent Bernoulli(p0 )
trials, X1 , X2 , . . . , Xn , where Xi is 1 if the ith person prefers A and 0 if they prefer B. The
fraction of people polled that prefer A is just the average X.
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 10

We know that each Xi ∼ Bernoulli(p0 ) so,


p
E(Xi ) = p0 and σXi = p0 (1 − p0 ).

Therefore, the central limit theorem tells us that


√ p
X ≈ N(p0 , σ/ n), where σ = p0 (1 − p0 ).

In a normal distribution 95% of the probability is within 2 standard deviations of the mean.

This means that in 95% of polls of n people the sample mean X will be within 2σ/ n of
the true mean p0 . The final step is to note that for any value of p0 we have σ ≤ 1/2. (It is
an easy calculus exercise to see that 1/4 is the maximum value of σ 2 = p0 (1 − p0 ).) This
means that we can conservatively say that in 95% of polls of n people the sample mean

X is within 1/ n of the true mean. The frequentist statistician then takes the interval

X ± 1/ n and calls it the 95% confidence interval for p0 .
A word of caution: it is tempting and common, but wrong, to think that there is a 95%
probability the true fraction p0 is in the confidence interval. This is subtle, but the error
is the same one as thinking you have a disease if a 95% accurate test comes back positive.
It’s true that 95% of people taking the test get the correct result. It’s not necessarily true
that 95% of positive tests are correct.

5.5 Why use the CLT

Since the probabilities in the above examples can be computed exactly using the binomial
distribution, you may be wondering what is the point of finding an approximate answer
using the CLT. In fact, we were only able to compute these probabilities exactly because
the Xi were Bernoulli and so the sum S was binomial. In general, the distribution of the
S will not be familiar, so you will not be able to compute the probabilities for S exactly; it
can also happen that the exact computation is possible in theory but too computationally
intensive in practice, even for a computer. The power of the CLT is that it applies when Xi
has almost any distribution. Though we will see in the next section that some distributions
may require larger n for the approximation to be a good one).

5.6 How big does n have to be to apply the CLT?

Short answer: often, not that big.


The following sequences of pictures show the convergence of averages to a normal distribu-
tion.
First we show the standardized average of n i.i.d. uniform random variables with n =
1, 2, 4, 8, 12. The pdf of the average is in blue and the standard normal pdf is in red. By
the time n = 12 the fit between the standardized average and the true normal looks very
good.
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 11

0.4 0.5 0.4


0.35 0.35
0.4
0.3 0.3
0.25 0.3 0.25
0.2 0.2
0.15 0.2 0.15
0.1 0.1
0.1
0.05 0.05
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Next we show the standardized average of n i.i.d. exponential random variables with
n = 1, 2, 4, 8, 16, 64. Notice that this asymmetric density takes more terms to converge to
the normal density.
1 0.7 0.5
0.6
0.8 0.4
0.5
0.6 0.4 0.3

0.4 0.3 0.2


0.2
0.2 0.1
0.1
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.5 0.5 0.5

0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Next we show the (non-standardized) average of n exponential random variables with


n = 1, 2, 4, 16, 64. Notice how this standard deviation shrinks as n grows, resulting in a
spikier (more peaked) density.
1 0.8 1
0.7
0.8 0.8
0.6
0.6 0.5 0.6
0.4
0.4 0.3 0.4
0.2
0.2 0.2
0.1
0 0 0
-2 -1 0 1 2 3 4 -2 -1 0 1 2 3 4 -1 0 1 2 3 4
18.05 class 6, Central Limit Theorem and the Law of Large Numbers, Spring 2014 12

2 3.5
3
1.5 2.5
2
1
1.5

0.5 1
0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4

The central limit theorem works for discrete variables also. Here is the standardized average
of n i.i.d. Bernoulli(.5) random variables with n = 1, 2, 12, 64. Notice that as n grows, the
average can take more values, which allows the discrete distribution to ’fill in’ the normal
density.
0.4 0.4 0.4
0.35 0.35 0.35
0.3 0.3 0.3
0.25 0.25 0.25
0.2 0.2 0.2
0.15 0.15 0.15
0.1 0.1 0.1
0.05 0.05 0.05
0 0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4

Finally we show the (non-standardized) average of n Bernoulli(.5) random variables, with


n = 4, 12, 64. Notice how the standard deviation gets smaller resulting in a spikier (more
peaked) density.
1.4 3 7
1.2 2.5 6
1 5
2
0.8 4
1.5
0.6 3
1
0.4 2
0.2 0.5 1
0 0 0
-1 -0.5 0 0.5 1 1.5 2 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Appendix
Class 6, 18.05
Jeremy Orloff and Jonathan Bloom

1 Introduction

In this appendix we give more formal mathematical material that is not strictly a part of
18.05. This will not be on homework or tests. We give this material to emphasize that in
doing mathematics we should be careful to specify our hypotheses completely and give clear
deductive arguments to prove our claims. We hope you find it interesting and illuminating.

2 With high probability the density histogram resembles the


graph of the probability density function:

We stated that one consequence of the law of large numbers is that as the number of samples
increases the density histogram of the samples has an increasing probability of matching the
graph of the underlying pdf or pmf. This is a good rule of thumb, but it is rather imprecise.
It is possible to make more precise statements It will take some care to make a sensible and
precise statement, which will not be quite so sweeping.
Suppose we have an experiment that produces data according to the random variable X
and suppose we generate n independent samples from X. Call them

x1 , x 2 , . . . , x n .

By a bin we mean a range of values, i.e. [xk , xk+1 ). To make a density histogram of the
data we divide the range of X into m bins and calculate the fraction of the data in each
bin.
Now, let pk be the probability a random data point is in the kth bin. This is this probability
for an indicator (Bernoulli) random variable Bk,j which is 1 if the jth data point is in the
bin and and 0 otherwise.
Statement 1. Let p̄k be the fraction of the data in bin k. As the number n of data points
gets large the probability that x̄k is close to pk approaches 1. Said differently, given any
small number, call it a the probability P (|p̄k − pk | < a) depends on n, and as n goes to
infinity this probability goes to 1.
Proof. Let B ¯k be the average of Bk,j . Since E(Bk,j ) = pk , the law of large number says
exactly that
P (|B̄k − pk | < a) approaches 1 as n goes to infinity.
But, since the Bk,j are indicator variables, their average is exactly p̄k , the fraction of the
data in bin k. Replacing B¯k by p̄k in the above equation gives

P (|p̄k − pk | < a) approaches 1 as n goes to infinity.

This is exactly what statement 1 claimed.

1
18.05 class 6, Appendix, Spring 2014 2

Statement 2. The same statement holds for a finite number of bins simultaneously. That
is, for bins 1 to m we have
¯2 −p2 | < a), . . . , (|B
P ( (|B̄1 −p1 | < a), (|B ¯m −pm | < a) ) approaches 1 as n goes to infinity.

Proof. First we note the following probability rule, which is a consequence of the inclusion
exclusion principle: If two events A and B have P (A) = 1 − α1 and P (B) = 1 − α2 then
P (A ∩ B) ≥ 1 − (α1 + α2 ).
Now, Statement 1 says that for any α we can find n large enough that P (|B ¯k − pk | < a) >
1 − α/m for each bin separately. By the probability rule, the probability of the intersection
of all these events is at least 1 − α. Since we can let α be as small as we want by letting n
go to infinity, in the limit we get probability 1 as claimed.
Statement 3. If f (x) is a continuous probability density with range [a, b] then by taking
enough data and having a small enough bin width we can insure that with high probability
the density histogram is as close as we want to the graph of f (x).
Proof. We will only sketch the argument. Assume the bin around x has width is ∆x. If
∆x is small enough then the probability a data point is in the bin is approximately f (x)∆x.
Statement 2 guarantees that if n is large enough then with high probability the fraction
of data in the bin is also approximately f (x)∆x. Since this is the area of the bin we see
that its height will be approximately f (x). That is, with high probability the height of the
histogram over any point x is close to f (x). This is what Statement 3 claimed.
Note. If the range is infinite or the density goes to infinity at some point we need to be
more careful. There are statements we could make for these cases.

3 The Chebyshev inequality

One proof of the LoLN follows from the following key inequality.
The Chebyshev inequality. Suppose Y is a random variable with mean µ and variance σ 2 .
Then for any positive value a, we have

Var(Y )
P (|Y − µ| ≥ a) ≤ .
a2

In words, the Chebyshev inequality says that the probability that Y differs from the mean
by more than a is bounded by Var(Y )/a2 . Morally, the smaller the variance of Y , the
smaller the probability that Y is far from its mean.

Proof of the LoLN: Since Var(X ¯ n ) = Var(X)/n, the variance of the average X¯ n goes to
¯
zero as n goes to infinity. So the Chebyshev inequality for Y = Xn and fixed a implies
that as n grows, the probability that X ¯ n is farther than a from µ goes to 0. Hence the
¯
probability that Xn is within a of µ goes to 1, which is the LoLN.
Proof of the Chebyshev inequality: The proof is essentially the same for discrete and
continuous Y . We’ll assume Y is continuous and also that µ = 0, since replacing Y by
18.05 class 6, Appendix, Spring 2014 3

Y − µ does not change the variance. So


Z −a Z ∞ Z −a 2 Z ∞ 2
y y
P (|Y | ≥ a) = f (y) dy + f (y) dy ≤ 2
f (y) dy + f (y) dy
−∞ a −∞ a a a2
Z ∞ 2
y Var(Y )
≤ 2
f (y) dy = .
−∞ a a2

The first inequality uses that y 2 /a2 ≥ 1 on the intervals of integration. The second inequality
follows because including the range [−a, a] only makes the integral larger, since the integrand
is positive.

4 The need for variance

We didn’t lie to you, but we did gloss over one technical fact. Throughout we assumed
that the underlying distributions had a variance. For example, the proof of the law of
large numbers made use of the variance by way of the Chebyshev inequality. But there are
distributions which do not have a variance because the sum or integral for the variance does
not converge to a finite number. For such distributions the law of large numbers may not
be true. In 18.05 we won’t have to worry about this, but if you go deeper into statistics
this may become important.
Joint Distributions, Independence
Class 7, 18.05
Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Understand what is meant by a joint pmf, pdf and cdf of two random variables.

2. Be able to compute probabilities and marginals from a joint pmf or pdf.

3. Be able to test whether two random variables are independent.

2 Introduction

In science and in real life, we are often interested in two (or more) random variables at the
same time. For example, we might measure the height and weight of giraffes, or the IQ
and birthweight of children, or the frequency of exercise and the rate of heart disease in
adults, or the level of air pollution and rate of respiratory illness in cities, or the number of
Facebook friends and the age of Facebook members.
Think: What relationship would you expect in each of the five examples above? Why?
In such situations the random variables have a joint distribution that allows us to compute
probabilities of events involving both variables and understand the relationship between the
variables. This is simplest when the variables are independent. When they are not, we use
covariance and correlation as measures of the nature of the dependence between them.

3 Joint Distribution

3.1 Discrete case

Suppose X and Y are two discrete random variables and that X takes values {x1 , x2 , . . . , xn }
and Y takes values {y1 , y2 , . . . , ym }. The ordered pair (X, Y ) take values in the product
{(x1 , y1 ), (x1 , y2 ), . . . (xn , ym )}. The joint probability mass function (joint pmf) of X and Y
is the function p(xi , yj ) giving the probability of the joint outcome X = xi , Y = yj .
We organize this in a joint probability table as shown:

1
18.05 class 7, Joint Distributions, Independence, Spring 2014 2

X\Y y1 y2 ... yj ... ym


x1 p(x1 , y1 ) p(x1 , y2 ) · · · p(x1 , yj ) · · · p(x1 , ym )
x2 p(x2 , y1 ) p(x2 , y2 ) · · · p(x2 , yj ) · · · p(x2 , ym )
··· ··· ··· ··· ··· ··· ···
··· ··· ··· ··· ··· ··· ···
xi p(xi , y1 ) p(xi , y2 ) ··· p(xi , yj ) ··· p(xi , ym )
··· ··· ··· ··· ··· ···
xn p(xn , y1 ) p(xn , y2 ) · · · p(xn , yj ) · · · p(xn , ym )

Example 1. Roll two dice. Let X be the value on the first die and let Y be the value on
the second die. Then both X and Y take values 1 to 6 and the joint pmf is p(i, j) = 1/36
for all i and j between 1 and 6. Here is the joint probability table:

X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36

Example 2. Roll two dice. Let X be the value on the first die and let T be the total on
both dice. Here is the joint probability table:

X\T 2 3 4 5 6 7 8 9 10 11 12
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36

A joint probability mass function must satisfy two properties:


1. 0 ≤ p(xi , yj ) ≤ 1
2. The total probability is 1. We can express this as a double sum:
n X
X m
p(xi , yj ) = 1
i=1 j=1
18.05 class 7, Joint Distributions, Independence, Spring 2014 3

3.2 Continuous case

The continuous case is essentially the same as the discrete case: we just replace discrete sets
of values by continuous intervals, the joint probability mass function by a joint probability
density function, and the sums by integrals.
If X takes values in [a, b] and Y takes values in [c, d] then the pair (X, Y ) takes values in
the product [a, b] × [c, d]. The joint probability density function (joint pdf) of X and Y
is a function f (x, y) giving the probability density at (x, y). That is, the probability that
(X, Y ) is in a small rectangle of width dx and height dy around (x, y) is f (x, y) dx dy.
y
d
Prob. = f (x, y) dx dy

dy

dx

x
a b

A joint probability density function must satisfy two properties:


1. 0 ≤ f (x, y)
2. The total probability is 1. We now express this as a double integral:
Z dZ b
f (x, y) dx dy = 1
c a

Note: as with the pdf of a single random variable, the joint pdf f (x, y) can take values
greater than 1; it is a probability density, not a probability.
In 18.05 we won’t expect you to be experts at double integration. Here’s what we will
expect.

• You should understand double integrals conceptually as double sums.

• You should be able to compute double integrals over rectangles.

• For a non-rectangular region, when f (x, y) = c is constant, you should know that the
double integral is the same as the c × (the area of the region).

3.3 Events

Random variables are useful for describing events. Recall that an event is a set of outcomes
and that random variables assign numbers to outcomes. For example, the event ‘X > 1’
is the set of all outcomes for which X is greater than 1. These concepts readily extend to
pairs of random variables and joint outcomes.
18.05 class 7, Joint Distributions, Independence, Spring 2014 4

Example 3. In Example 1, describe the event B = ‘Y − X ≥ 2’ and find its probability.


answer: We can describe B as a set of (X, Y ) pairs:

B = {(1, 3), (1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 5), (3, 6), (4, 6)}.

We can also describe it visually

X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36

The event B consists of the outcomes in the shaded squares.

The probability of B is the sum of the probabilities in the orange shaded squares, so
P (B) = 10/36.

Example 4. Suppose X and Y both take values in [0,1] with uniform density f (x, y) = 1.
Visualize the event ‘X > Y ’ and find its probability.
answer: Jointly X and Y take values in the unit square. The event ‘X > Y ’ corresponds
to the shaded lower-right triangle below. Since the density is constant, the probability is
just the fraction of the total area taken up by the event. In this case, it is clearly 0.5.
y
1

‘X > Y ’

x
1
The event ‘X > Y ’ in the unit square.
Example 5. Suppose X and Y both take values in [0,1] with density f (x, y) = 4xy. Show
f (x, y) is a valid joint pdf, visualize the event A = ‘X < 0.5 and Y > 0.5’ and find its
probability.
answer: Jointly X and Y take values in the unit square.
18.05 class 7, Joint Distributions, Independence, Spring 2014 5

y
1

x
1
The event A in the unit square.
To show f (x, y) is a valid joint pdf we must check that it is positive (which it clearly is)
and that the total probability is 1.
Z 1Z 1 Z 1 Z 1
 2 1
Total probability = 4xy dx dy = 2x y 0 dy = 2y dy = 1. QED
0 0 0 0

The event A is just the upper-left-hand quadrant. Because the density is not constant we
must compute an integral to find the probability.
.5 Z 1 .5  .5
3x 3
Z Z Z
1
P (A) = 4xy dy dx = 2xy 2 .5 dx = dx = .
0 .5 0 0 2 16

3.4 Joint cumulative distribution function

Suppose X and Y are jointly-distributed random variables. We will use the notation ‘X ≤
x, Y ≤ y’ to mean the event ‘X ≤ x and Y ≤ y’. The joint cumulative distribution function
(joint cdf) is defined as
F (x, y) = P (X ≤ x, Y ≤ y)

Continuous case: If X and Y are continuous random variables with joint density f (x, y)
over the range [a, b] × [c, d] then the joint cdf is given by the double integral
Z yZ x
F (x, y) = f (u, v) du dv.
c a

To recover the joint pdf, we differentiate the joint cdf. Because there are two variables we
need to use partial derivatives:

∂2F
f (x, y) = (x, y).
∂x∂y

Discrete case: If X and Y are discrete random variables with joint pmf p(xi , yj ) then the
joint cdf is give by the double sum
X X
F (x, y) = p(xi , yj ).
xi ≤x yj ≤y
18.05 class 7, Joint Distributions, Independence, Spring 2014 6

3.5 Properties of the joint cdf

The joint cdf F (x, y) of X and Y must satisfy several properties:

1. F (x, y) is non-decreasing: i.e. if x or y increase then F (x, y) must stay constant or


increase.
2. F (x, y) = 0 at the lower-left of the joint range.
If the lower left is (−∞, −∞) then this means lim F (x, y) = 0.
(x,y)→(−∞,−∞)

3. F (x, y) = 1 at the upper-right of the joint range.


If the upper-right is (∞, ∞) then this means lim F (x, y) = 1.
(x,y)→(∞,∞)

Example 6. Find the joint cdf for the random variables in Example 5.
answer: The event ‘X ≤ x and Y ≤ y’ is a rectangle in the unit square.
y
1

(x, y)

‘X ≤ x & Y ≤ y’

x
1

To find the cdf F (x, y) we compute a double integral:


Z yZ x
F (x, y) = 4uv du dv = x2 y 2 .
0 0

Example 7. In Example 1, compute F (3.5, 4).


answer: We redraw the joint probability table. Notice how similar the picture is to the one
in the previous example.
F (3.5, 4) is the probability of the event ‘X ≤ 3.5 and Y ≤ 4’. We can visualize this event
as the shaded rectangles in the table:

X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
18.05 class 7, Joint Distributions, Independence, Spring 2014 7

The event ‘X ≤ 3.5 and Y ≤ 4’.

Adding up the probability in the shaded squares we get F (3.5, 4) = 12/36 = 1/3.
Note. One unfortunate difference between the continuous and discrete visualizations is that
for continuous variables the value increases as we go up in the vertical direction while the
opposite is true for the discrete case. We have experimented with changing the discrete
tables to match the continuous graphs, but it causes too much confusion. We will just have
to live with the difference!

3.6 Marginal distributions

When X and Y are jointly-distributed random variables, we may want to consider only one
of them, say X. In that case we need to find the pmf (or pdf or cdf) of X without Y . This
is called a marginal pmf (or pdf or cdf). The next example illustrates the way to compute
this and the reason for the term ‘marginal’.

3.7 Marginal pmf

Example 8. In Example 2 we rolled two dice and let X be the value on the first die and
T be the total on both dice. Compute the marginal pmf of X and of T .
answer: In the table each row represents a single value of X. So the event ‘X = 3’ is the
third row of the table. To find P (X = 3) we simply have to sum up the probabilities in this
row. We put the sum in the right-hand margin of the table. Likewise P (T = 5) is just the
sum of the column with T = 5. We put the sum in the bottom margin of the table.

X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(tj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1

Computing the marginal probabilities P (X = 3) = 1/6 and P (T = 5) = 4/36.

Note: Of course in this case we already knew the pmf of X and of T . It is good to see that
our computation here is in agreement!
As motivated by this example, marginal pmf’s are obtained from the joint pmf by summing:
X X
pX (xi ) = p(xi , yj ), pY (yj ) = p(xi , yj )
j i

The term marginal refers to the fact that the values are written in the margins of the table.
18.05 class 7, Joint Distributions, Independence, Spring 2014 8

3.8 Marginal pdf

For a continous joint density f (x, y) with range [a, b] × [c, d], the marginal pdf’s are:
Z d Z b
fX (x) = f (x, y) dy, fY (y) = f (x, y) dx.
c a

Compare these with the marginal pmf’s above; as usual the sums are replaced by integrals.
We say that to obtain the marginal for X, we integrate out Y from the joint pdf and vice
versa.
Example 9. Suppose (X, Y ) takes values on the square [0, 1]×[1, 2] with joint pdf f (x, y) =
8 3
3 x y. Find the marginal pdf’s fX (x) and fY (y).
answer: To find fX (x) we integrate out y and to find fY (y) we integrate out x.

4 3 2 2
Z 2  
8 3
fX (x) = x y dy = x y = 4x3
1 3 3 1
2 4 1 1
Z 1  
8 3 2
fY (y) = x y dx = x y = y.
0 3 3 0 3

Example 10. Suppose (X, Y ) takes values on the unit square [0, 1] × [0, 1] with joint pdf
f (x, y) = 32 (x2 + y 2 ). Find the marginal pdf fX (x) and use it to find P (X < 0.5).
answer:
1 1
y3

3 2 3 2 3 1
Z
2
fX (x) = (x + y ) dy = x y+ = x2 + .
0 2 2 2 0 2 2
0.5 0.5
1 3 1 0.5
 
3 2 1 5
Z Z
P (X < 0.5) = fX (x) dx = x + dx = x + x = .
0 0 2 2 2 2 0 16

3.9 Marginal cdf

Finding the marginal cdf from the joint cdf is easy. If X and Y jointly take values on
[a, b] × [c, d] then
FX (x) = F (x, d), FY (y) = F (b, y).
If d is ∞ then this becomes a limit FX (x) = lim F (x, y). Likewise for FY (y).
y→∞

Example 11. The joint cdf in the last example was F (x, y) = 12 (x3 y + xy 3 ) on [0, 1] × [0, 1].
Find the marginal cdf’s and use FX (x) to compute P (X < 0.5).
answer: We have FX (x) = F (x, 1) = 12 (x3 + x) and FY (y) = F (1, y) = 1
2 (y + y 3 ). So
5
P (X < 0.5) = FX (0.5) = 12 (0.53 + 0.5) = 16 : exactly the same as before.

3.10 3D visualization

We visualized P (a < X < b) as the area under the pdf f(x) over the interval [a, b]. Since
the range of values of (X, Y ) is already a two dimensional region in the plane, the graph of
18.05 class 7, Joint Distributions, Independence, Spring 2014 9

f (x, y) is a surface over that region. We can then visualize probability as volume under the
surface.
Think: Summoning your inner artist, sketch the graph of the joint pdf f (x, y) = 4xy and
visualize the probability P (A) as a volume for Example 5.

4 Independence

We are now ready to give a careful mathematical definition of independence. Of course, it


will simply capture the notion of independence we have been using up to now. But, it is nice
to finally have a solid definition that can support complicated probabilistic and statistical
investigations.
Recall that events A and B are independent if

P (A ∩ B) = P (A)P (B).

Random variables X and Y define events like ‘X ≤ 2’ and ‘Y > 5’. So, X and Y are
independent if any event defined by X is independent of any event defined by Y . The
formal definition that guarantees this is the following.
Definition: Jointly-distributed random variables X and Y are independent if their joint
cdf is the product of the marginal cdf’s:

F (X, Y ) = FX (x)FY (y).

For discrete variables this is equivalent to the joint pmf being the product of the marginal
pmf’s.:
p(xi , yj ) = pX (xi )pY (yj ).
For continous variables this is equivalent to the joint pdf being the product of the marginal
pdf’s.:
f (x, y) = fX (x)fY (y).

Once you have the joint distribution, checking for independence is usually straightforward
although it can be tedious.
Example 12. For discrete variables independence means the probability in a cell must be
the product of the marginal probabilities of its row and column. In the first table below
this is true: every marginal probability is 1/6 and every cell contains 1/36, i.e. the product
of the marginals. Therefore X and Y are independent.
In the second table below most of the cell probabilities are not the product of the marginal
probabilities. For example, none of marginal probabilities are 0, so none of the cells with 0
probability can be the product of the marginals.
18.05 class 7, Joint Distributions, Independence, Spring 2014 10

X\Y 1 2 3 4 5 6 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 1/6
2 1/36 1/36 1/36 1/36 1/36 1/36 1/6
3 1/36 1/36 1/36 1/36 1/36 1/36 1/6
4 1/36 1/36 1/36 1/36 1/36 1/36 1/6
5 1/36 1/36 1/36 1/36 1/36 1/36 1/6
6 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/6 1/6 1/6 1/6 1/6 1/6 1

X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1

Example 13. For continuous variables independence means you can factor the joint pdf
or cdf as the product of a function of x and a function of y.
(i) Suppose X has range [0, 1/2], Y has range [0, 1] and f (x, y) = 96x2 y 3 then X and Y
are independent. The marginal densities are fX (x) = 24x2 and fY (y) = 4y 3 .
(ii) If f (x, y) = 1.5(x2 +y 2 ) over the unit square then X and Y are not independent because
there is no way to factor f (x, y) into a product fX (x)fY (y).
(iii) If F (x, y) = 21 (x3 y + xy 3 ) over the unit square then X and Y are not independent
because the cdf does not factor into a product FX (x)FY (y).
Covariance and Correlation
Class 7, 18.05
Jeremy Orloff and Jonathan Bloom

1 Learning Goals

1. Understand the meaning of covariance and correlation.

2. Be able to compute the covariance and correlation of two random variables.

2 Covariance

Covariance is a measure of how much two random variables vary together. For example,
height and weight of giraffes have positive covariance because when one is big the other
tends also to be big.
Definition: Suppose X and Y are random variables with means µX and µY . The
covariance of X and Y is defined as

Cov(X, Y ) = E((X − µX )(Y − µY )).

2.1 Properties of covariance

1. Cov(aX + b, cY + d) = acCov(X, Y ) for constants a, b, c, d.

2. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ).

3. Cov(X, X) = Var(X)

4. Cov(X, Y ) = E(XY ) − µX µY .

5. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ) for any X and Y .

6. If X and Y are independent then Cov(X, Y ) = 0.


Warning: The converse is false: zero covariance does not always imply independence.

Notes. 1. Property 4 is like the similar property for variance. Indeed, if X = Y it is exactly
that property: Var(X) = E(X 2 ) − µ2X .
By Property 5, the formula in Property 6 reduces to the earlier formula Var(X + Y ) =
Var(X) + Var(Y ) when X and Y are independent.
We give the proofs below. However, understanding and using these properties is more
important than memorizing their proofs.

1
18.05 class 7, Covariance and Correlation, Spring 2014 2

2.2 Sums and integrals for computing covariance

Since covariance is defined as an expected value we compute it in the usual way as a sum
or integral.
Discrete case: If X and Y have joint pmf p(xi , yj ) then
 
n X
X m Xn X
m
Cov(X, Y ) = p(xi , yj )(xi − µX )(yj − µY ) =  p(xi , yj )xi yj  − µX µY .
i=1 j=1 i=1 j=1

Continuous case: If X and Y have joint pdf f (x, y) over range [a, b] × [c, d] then
Z dZ b Z dZ b 
Cov(X, Y ) = (x − µx )(y − µy )f (x, y) dx dy = xyf (x, y) dx dy − µx µy .
c a c a

2.3 Examples

Example 1. Flip a fair coin 3 times. Let X be the number of heads in the first 2 flips
and let Y be the number of heads on the last 2 flips (so there is overlap on the middle flip).
Compute Cov(X, Y ).
answer: We’ll do this twice, first using the joint probability table and the definition of
covariance, and then using the properties of covariance.
With 3 tosses there are only 8 outcomes {HHH, HHT,...}, so we can create the joint prob-
ability table directly.

X\Y 0 1 2 p(xi )
0 1/8 1/8 0 1/4
1 1/8 2/8 1/8 1/2
2 0 1/8 1/8 1/4
p(yj ) 1/4 1/2 1/4 1

From the marginals we compute E(X) = 1 = E(Y ). Now we use use the definition:
X
Cov(X, Y ) = E((X − µx )(Y − µY )) = p(xi , yj )(xi − 1)(yj − 1)
i,j

We write out the sum leaving out all the terms that are 0, i.e. all the terms where xi = 1
or yi = 1 or the probability is 0.
1 1 1
Cov(X, Y ) = (0 − 1)(0 − 1) + (2 − 1)(2 − 1) = .
8 8 4
We could also have used property 4 to do the computation: From the full table we compute
2 1 1 1 5
E(XY ) = 1 · +2 +2 +4 = .
8 8 8 8 4
18.05 class 7, Covariance and Correlation, Spring 2014 3

5 1
So Cov(XY ) = E(XY ) − µX µY = −1= .
4 4
Next we redo the computation of Cov(X, Y ) using the properties of covariance. As usual,
let Xi be the result of the ith flip, so Xi ∼ Bernoulli(0.5). We have
X = X1 + X2 and Y = X2 + X3 .
We know E(Xi ) = 1/2 and Var(Xi ) = 1/4. Therefore using Property 2 of covariance, we
have
Cov(X, Y ) = Cov(X1 +X2 , X2 +X3 ) = Cov(X1 , X2 )+Cov(X1 , X3 )+Cov(X2 , X2 )+Cov(X2 , X3 ).
Since the different tosses are independent we know
Cov(X1 , X2 ) = Cov(X1 , X3 ) = Cov(X2 , X3 ) = 0.
Looking at the expression for Cov(X, Y ) there is only one non-zero term

1
Cov(X, Y ) = Cov(X2 , X2 ) = Var(X2 ) = .
4

Example 2. (Zero covariance does not imply independence.) Let X be a random variable
that takes values −2, −1, 0, 1, 2; each with probability 1/5. Let Y = X 2 . Show that
Cov(X, Y ) = 0 but X and Y are not independent.
answer: We make a joint probability table:

Y \X -2 -1 0 1 2 p(yj )
0 0 0 1/5 0 0 1/5
1 0 1/5 0 1/5 0 2/5
4 1/5 0 0 0 1/5 2/5
p(xi ) 1/5 1/5 1/5 1/5 1/5 1

Using the marginals we compute means E(X) = 0 and E(Y ) = 2.


Next we show that X and Y are not independent. To do this all we have to do is find one
6 p(xi )p(xj ):
place where the product rule fails, i.e. where p(xi , yj ) =
P (X = −2, Y = 0) = 0 but P (X = −2) · P (Y = 0) = 1/25.
Since these are not equal X and Y are not independent. Finally we compute covariance
using Property 4:
1
Cov(X, Y ) = (−8 − 1 + 1 + 8) − µX µy = 0.
5
Discussion: This example shows that Cov(X, Y ) = 0 does not imply that X and Y are
independent. In fact, X and X 2 are as dependent as random variables can be: if you know
the value of X then you know the value of X 2 with 100% certainty.
The key point is that Cov(X, Y ) measures the linear relationship between X and Y . In
the above example X and X 2 have a quadratic relationship that is completely missed by
Cov(X, Y ).
18.05 class 7, Covariance and Correlation, Spring 2014 4

2.4 Proofs of the properties of covariance

1 and 2 follow from similar properties for expected value.


3. This is the definition of variance:

Cov(X, X) = E((X − µX )(X − µX )) = E((X − µX )2 ) = Var(X).

4. Recall that E(X − µx ) = 0. So

Cov(X, Y ) = E((X − µX )(Y − µY ))


= E(XY − µX Y − µY X + µX µY )
= E(XY ) − µX E(Y ) − µY E(X) + µX µY
= E(XY ) − µX µY − µX µY + µX µY
= E(XY ) − µX µY .

5. Using properties 3 and 2 we get

Var(X+Y ) = Cov(X+Y, X+Y ) = Cov(X, X)+2Cov(X, Y )+Cov(Y, Y ) = Var(X)+Var(Y )+2Cov(X, Y ).

6. If X and Y are independent then f (x, y) = fX (x)fY (y). Therefore


Z Z
Cov(X, Y ) = (x − µX )(y − µY )fX (x)fY (y) dx dy
Z Z
= (x − µX )fX (x) dx (y − µY )fY (y) dy

= E(X − µX )E(Y − µY )
= 0.

3 Correlation

The units of covariance Cov(X, Y ) are ‘units of X times units of Y ’. This makes it hard to
compare covariances: if we change scales then the covariance changes as well. Correlation
is a way to remove the scale from the covariance.
Definition: The correlation coefficient between X and Y is defined by
Cov(X, Y )
Cor(X, Y ) = ρ = .
σX σY

3.1 Properties of correlation

1. ρ is the covariance of the standardizations of X and Y .

2. ρ is dimensionless (it’s a ratio!).

3. −1 ≤ ρ ≤ 1. Furthermore,
ρ = +1 if and only if Y = aX + b with a > 0,
ρ = −1 if and only if Y = aX + b with a < 0.
18.05 class 7, Covariance and Correlation, Spring 2014 5

Property 3 shows that ρ measures the linear relationship between variables. If the corre-
lation is positive then when X is large, Y will tend to large as well. If the correlation is
negative then when X is large, Y will tend to be small.

Example 2 above shows that correlation can completely miss higher order relationships.

3.2 Proof of Property 3 of correlation

(This is for the mathematically interested.)


       
X Y X Y X Y
0 ≤ Var − = Var + Var − 2Cov , = 2 − 2ρ
σX σY σX σY σX σY
⇒ ρ≤1
 
X Y
Likewise 0 ≤ Var + ⇒ −1 ≤ ρ.
σX σ
 Y 
X Y X Y
If ρ = 1 then 0 = Var − ⇒ − = c.
σX σY σX σY

Example. We continue Example 1. To compute the correlation we divide the covariance


by the standard deviations. √In Example 1 we found
√ Cov(X, Y ) = 1/4 and Var(X) =
2Var(Xj ) = 1/2. So, σX = 1/ 2. Likewise σY = 1/ 2. Thus

Cov(X, Y ) 1/4 1
Cor(X, Y ) = = = .
σX σY 1/2 2

We see a positive correlation, which means that larger X tend to go with larger Y and
smaller X with smaller Y . In Example 1 this happens because toss 2 is included in both X
and Y , so it contributes to the size of both.

3.3 Bivariate normal distributions

The bivariate normal distribution has density


 
−1 (x−µX )2 (y−µY )2 2ρ(x−µx )(y−µy )
+ −
2(1−ρ2 ) σ2 σ2 σx σy
e X Y
f (x, y) = p
2πσX σY 1 − ρ2

For this distribution, the marginal distributions for X and Y are normal and the correlation
between X and Y is ρ.
In the figures below we used R to simulate the distribution for various values of ρ. Individ-
ually X and Y are standard normal, i.e. µX = µY = 0 and σX = σY = 1. The figures show
scatter plots of the results.
These plots and the next set show an important feature of correlation. We divide the data
into quadrants by drawing a horizontal and a verticle line at the means of the y data and
x data respectively. A positive correlation corresponds to the data tending to lie in the 1st
and 3rd quadrants. A negative correlation corresponds to data tending to lie in the 2nd
and 4th quadrants. You can see the data gathering about a line as ρ becomes closer to ±1.
18.05 class 7, Covariance and Correlation, Spring 2014 6

3
rho=0.00 rho=0.30

● ●

3

● ●
● ● ● ● ●
● ●● ● ● ● ●
●●
● ● ● ● ● ●
●● ● ● ● ●
● ●
●● ●
2

● ●

2
●● ●
●● ● ● ●● ● ●
● ● ● ●● ● ●● ●
●● ● ●● ●
● ● ● ●● ●● ● ●
● ● ● ● ●● ● ●● ●●● ●
●● ● ●● ● ● ● ● ● ●●
● ●● ●● ● ● ● ● ●
● ● ●● ● ●●
● ● ● ●● ● ●
● ●●
● ● ● ●●● ●● ●●●●● ● ●
● ●●● ● ● ● ●● ●
● ●
● ● ● ●● ●● ● ●● ● ● ● ●● ● ●● ●●● ●● ● ●
● ●●● ● ●● ●● ●●● ● ●● ●●● ● ● ● ●●
● ● ● ●●●●
● ●● ●

● ●●●●
●● ● ●● ● ●●● ● ● ● ●●● ● ●●
● ●● ● ●
●●● ●● ●●● ●●●
● ●●
●●
●● ●
● ●
1

● ● ●●● ● ● ● ●● ● ●● ●

1
●●●● ●●●
● ● ● ● ●
●●
● ● ● ● ●● ● ●●● ● ●●● ●●●
●● ●●● ●● ● ● ●● ●
● ● ● ● ● ● ● ●

● ●● ●● ● ●● ●●● ● ● ●● ●
● ●● ●●●● ●● ● ● ●● ●●● ● ●● ●
●● ● ●●●● ● ● ●
●● ● ● ●●
●●

● ●
●●
● ●●●
● ●● ●● ● ●
● ●●● ●● ● ● ● ● ● ● ●● ●● ● ●●
● ●● ●●● ● ●● ●● ●●
●● ●

● ●● ● ●
●●●
●●●
●●


●●●● ●●
●● ● ● ●●
● ● ● ● ● ● ● ● ● ● ●
● ●● ●
● ● ● ●● ● ●●● ●●●●
●●●●●● ●
● ● ● ●● ● ●●● ●● ● ● ● ●● ●● ●● ●●●●
● ● ● ● ●● ●●●●●●●● ● ●● ● ● ●●
●● ● ●
● ●● ● ●● ●●● ● ●● ●
●●
●●●
●● ● ● ● ●●● ●●● ●● ●● ●●
●● ● ● ● ● ● ●● ● ● ● ●● ●●● ●
● ●●● ●●● ● ●
●●● ● ● ●● ●● ●

●● ● ●
● ● ● ●● ● ●● ●
● ●● ●● ●● ●
● ●●●●
●●
●●●●● ● ●● ● ●● ● ● ●● ●●●● ●●●●●● ●●● ●●●● ●● ● ●●
●● ●●●●
●● ● ●●
● ● ●●● ●●● ● ● ●● ● ●●●
● ● ●
● ●
●●●● ●●●● ●● ● ●●
● ●● ●●
● ● ●●● ●●● ●
● ●●●●●●●●● ●●●●

●●

● ●●
●●●
●●●
● ● ●●


● ●● ● ●●● ● ● ●● ● ● ●●● ● ● ●●●
● ● ● ● ●● ●
●●
●●● ● ●

● ●●●● ● ●●
●● ●
● ●●● ●●● ● ●●●●
● ●●●●●● ●
● ● ●● ● ●●
●● ● ●●●
●●●● ● ●●●●● ● ●● ●●
●● ●●●●●
●● ●● ●
−1 0

● ● ●
●● ●●

−1 0
y

● ●●●

y
● ●●● ● ●● ●● ●● ●
●●
●●●●●
●●●
●●
● ●● ● ●● ●●

●●
● ●●


●●● ●● ● ●
●●● ●
● ● ● ●

●● ● ●● ● ●●
●●●● ● ● ● ● ●●● ●●●● ●
●●●●● ●

●●●●
● ●● ●
●● ●● ●
● ● ●●●●● ● ●●●● ●●● ●●●●●●● ● ●● ● ●●● ●●
● ●● ● ●● ●
● ● ●●●● ● ●●●●● ●
● ●● ●
● ●●

●●● ●● ●
● ●●


●●●●●
● ● ●●
●●


●●
●● ●●

●●●● ● ●● ●●●● ● ● ● ● ● ● ●●● ●● ●●●● ●● ●●●●

● ●●●
●● ●●
●● ●●●●● ●

●●● ●●

● ●
●●●●
●● ● ●
● ●
● ●
●● ● ●● ● ● ● ● ●
● ●
●● ● ● ●●● ●●● ● ● ●● ● ●●●●
●● ●● ●● ●● ● ● ● ● ● ●● ● ●
●●●●●●● ● ● ● ● ●●● ●● ●●

● ●● ● ● ●● ●
●● ● ●●●
● ●●● ●●
● ●●●● ●●● ●
●●● ● ● ●●●●

● ●●● ●●●●● ●● ● ●● ●●●●
●●
● ●
●● ●
● ● ●●
● ●
●●

● ● ● ●●
●● ●● ●● ● ●●● ● ● ●●
● ●
● ●
●● ● ● ● ●● ●●● ● ● ● ●●●● ●● ● ●●
●●●●●●●● ●●
● ●●●●
● ●
● ●● ● ● ● ●● ●● ●
● ● ●●●● ●● ● ● ●●● ● ●●● ●●●● ● ●●●

● ● ●● ●● ● ● ●●● ●● ● ● ● ● ●● ●●
● ●●●●
●●●●●
●●●●
● ●●●● ●● ●●● ●
●● ●●
● ●●● ●
●● ● ●● ● ● ●●
● ● ●
●●● ●●

●●
●● ●
●●●●●● ●● ● ● ● ●● ●● ●●
● ●● ●●●●● ●●● ● ●●●
● ● ●
●●●● ● ● ● ● ●● ●●
● ●● ●●●
● ● ● ●● ● ● ● ● ●● ●●●
●●●● ● ●●●●
●● ●
●●●
●●
● ●● ●●● ●●●● ● ●
● ● ● ● ● ● ● ●●● ●●● ● ● ● ●●●●
● ● ●●● ● ●●●● ● ●


● ●
● ●
● ● ● ●●●

●●● ● ●● ●
●●● ●● ●● ● ● ●●● ●● ●● ●●
● ●● ●●●●● ● ●
● ● ● ●●●
●● ● ● ●

● ● ● ●●●●●
● ● ●●●● ● ●●
●● ● ●
● ● ● ● ● ● ●
● ●● ● ● ● ● ●●● ●●● ●● ● ●● ● ●● ● ●
● ● ●● ● ● ● ● ●●● ● ●●● ● ●● ● ● ● ● ●● ●● ●● ● ● ●
● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●
●● ● ●● ●
●●●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●
● ● ●

● ●

● ● ●
●●● ●●
● ● ● ● ●● ●
● ● ●● ●● ●
● ●● ●
● ●●
●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●

−3

−3
● ●● ●

−3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
x x

rho=0.70 rho=1.00

● ●● ● ●
● ●
●●●● ● ●● ● ●● ● ● ●●
●● ●
2

● ●
● ● ●●● ●

● ● ●
●● ●
●●●


●● ● ● ● ● ●● ●● ●●● ● ● ●
●●

2


●● ●● ●● ●● ● ● ●● ● ● ●

●●

●●● ●● ●●●●

●●●● ●●
●● ●●●●● ● ● ● ●


●●●

● ●●● ● ●●● ●●● ●● ● ● ●●

● ● ● ● ●● ● ●● ● ● ●● ● ●
●●

●●
●●● ● ● ● ●● ● ●●● ●● ●●
●●●● ● ●● ● ●● ● ●●
●●
1

●● ● ● ●

●●

●●●● ● ● ●● ●●●●●● ●
●●● ● ● ● ● ●

●●

●●
● ● ●● ●● ● ●● ●●
● ●●

● ●●

●●●
●●● ●●●● ● ●●


●●
●●
● ●●● ● ● ●●● ●
●●●
●●
● ●●●● ● ●
● ●● ● ● ●●


●●
● ● ● ●●
●● ●●
●●●●●

●● ●●●●●●●

●●● ●●●●●● ● ●●●●● ●
●●


●●


●● ● ●●● ● ●● ●●●● ●●● ●● ●
●●●●● ●● ●●● ● ● ● ●

●●


●●
●● ●● ● ●●●●
●●● ●●
●●●
●● ●

● ● ●
● ● ●● ●● ● ● ●● ●
● ● ● ● ●








●●●● ●● ● ●●●
● ●●●● ●● ●
●● ● ●●

●●


● ● ●● ●● ●●● ●● ●
●●●●

●●●●● ● ●● ●●●
● ● ● ●

●●


●● ● ●
● ●● ●
●●●
●●●
●●● ●●

●●● ● ●●
● ●● ● ● ●● ●
●●



●● ●● ●●
0

●● ● ●
0

●● ●●●● ●● ●●● ● ●● ●
●●● ●● ● ●● ●●●●●●● ● ●● ● ● ●●

●●

● ● ● ●● ●●●●● ● ●●●●
● ●●●● ●● ●●
● ●

y

● ● ● ●● ●
● ● ●●

●●
● ●●
● ●● ● ● ●●

●●●
●●


●●
●●

●●●
● ●●●● ●

●●●●●
● ●● ●●
●● ●● ● ● ● ●
●●





●●


● ● ● ● ● ● ●● ● ●
●●
● ● ● ●●
● ●● ●

●●●● ●●●● ●●●●●● ●● ●● ● ●●●●
● ●●● ● ●●● ●
●●

●●

● ●●
● ● ●
●●
●●
● ●●●●●●●●●● ●
●●●●● ●
●●

●●

● ● ●● ●● ●● ● ●●●●●●
● ●● ●
● ●●


● ● ●● ●●● ●
●●

●●

● ● ● ● ● ●●● ● ●
●●●●●● ● ●●●● ●
●● ● ●●● ● ●● ●● ● ● ●
●●

●●

● ● ●● ● ●● ●● ● ● ● ●● ●

●●
−1

●●● ● ●
● ●●
●●● ●●● ●
●● ●● ●●●●● ●● ● ●

● ●
● ● ●
●●


●●


●● ●●● ●●●●● ●●
●●●●● ● ●●
● ●●●●● ● ● ●

●●
●●
● ●
● ●●
●●
●●●●
●● ●●●●● ●●● ●● ● ● ● ●
● ●● ● ● ●
●●





●●
●● ● ●●● ●●
● ● ●● ● ●

●●
● ● ● ● ● ● ● ●●● ●● ● ● ●
●● ●● ●●
● ● ●

●●

−2



● ●●● ●● ● ● ●
● ● ●●●●● ● ● ● ● ● ●
●●●


● ●● ●● ●●

● ● ● ●● ● ●

●●
●● ● ●●● ● ● ●


● ● ● ●
● ●● ●
● ●● ●●

●● ● ● ● ● ●●

●● ● ●●
● ●●
●●
● ●● ●● ●
● ● ●
● ●

−3

−4

● ●

−3 −2 −1 0 1 2 3 −4 −2 0 2
x x

rho=−0.50 rho=−0.90
3

● ●
● ●●
● ●
● ● ● ● ●
●● ● ●●
● ● ●
2

● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●
● ●
● ● ●●● ● ●●● ●●●● ●
2

●● ● ●● ●
● ● ●● ●●●●●●● ●●●● ●●
●● ●●● ● ●
●●
● ● ● ● ● ● ●● ●
● ● ● ● ● ● ●● ●●
●●


●●●●
●●●●●●●● ●●
● ● ●
●● ● ●● ●● ●●● ●●●
●●● ● ● ● ● ●
●●● ● ●● ●●
● ●●
●●●●● ●●


● ●●
●● ● ● ● ●●●●● ●● ● ● ● ●● ● ●● ●
● ●●● ●
● ●●●●● ●
● ●●

●●●●
●●●●●
●●●

●●● ●● ●

● ● ● ●● ●●
●●● ● ● ●● ● ●●●● ● ● ● ●●● ●

●● ●● ●
● ● ●● ●
● ●

● ● ● ● ● ● ● ● ●●
●● ●
●●
●●●
●●●
● ●●●●●
●●
● ●●
●●●● ●
● ●● ●●
●●
● ● ● ●●● ●
●●●● ● ●●● ●
●● ● ● ● ●● ●●
● ●●●● ●●●●●
●●


●●
● ●
●●●●● ● ● ●● ●● ● ●
● ●●● ● ●● ●● ●● ● ● ●●●
●●
●● ●
●● ●●
●●
●● ● ●
1

●●● ● ● ● ● ● ● ●●●● ●●
●●
●● ● ●●●●
● ● ●●●● ●
●●
●● ● ●●●

●●●
● ●●●● ●●●● ●●●● ●●●● ●● ●● ●● ●●
●●● ●●
●●● ●
●●●●●● ● ● ● ●●●●● ●● ● ● ● ● ●
●●


●●● ● ●
●●
●●●
●●●●●●●● ●
● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ● ●●●
●● ●●● ● ●● ●●
●● ●●
● ●

●●●
●●


●●

●●
●●● ●●


●●
●●●●
● ● ●●● ●
●●●
● ●● ●● ●●● ● ● ●●
0

● ●● ●
● ●
●●
●● ● ●●● ●● ●
●● ●●●● ●● ● ●●●●
●● ● ●● ● ● ● ●
● ●●●
●●●
●●



●●
●●●●
●●
● ●

● ●●●
●●●●● ● ●
● ●●●●● ●● ● ●
●● ● ●
●● ●●●● ●● ● ●
● ● ● ●
●● ●● ● ●● ● ● ● ● ●●●
●●
●●


●●
● ●●●

● ●●●
●●●
●●
● ●
●●
● ●●●●
●●●
● ●●●
●●● ● ●●
●● ●●● ●● ● ●
● ●●●● ●●● ●●● ●
●●●
● ●●● ●● ● ●● ● ● ●● ● ● ●●●
●●● ●
●●

● ●●●●










●●●
● ●
●●
●●
● ●





●●●
●●
● ●●


●●
●● ●● ●
●● ● ●●●
● ●● ● ● ● ●
●● ● ● ●●
● ●●● ●
●● ● ●● ●●●
● ● ● ●●●
●●●●● ●
● ●● ●●●
●●● ● ● ●
●● ● ●● ●● ● ●● ●● ●● ●●●●● ●
●● ●●●●●●●● ● ●●●
●● ●●
y

● ● ● ● ●● ●
●● ● ● ●
● ●●●

● ●●
● ●
● ● ●
●●●●
● ●

● ● ●● ●
●●● ●●●●● ● ● ● ● ● ● ●● ●●
● ●●●

●●

●●

●●
●●●●
●●

●●

●●

●● ●




●●
●●●●●●●● ●
● ●
● ●● ●●● ● ●● ● ●●● ● ● ●●● ●● ●● ● ● ●●● ●●●●●
0

●●● ●●●●●●●●● ● ●●● ●●●●● ● ● ●


● ● ● ● ●●● ●●● ●●●●
●● ● ●● ●●● ●
●●●● ●●● ●
● ●● ● ● ●●● ●
●● ●
●●●

●●

●●
●●● ●
●●
●● ●● ●● ●
● ● ●● ● ● ●● ●
● ● ●● ● ●
●●●●
● ●

● ● ● ●● ● ●
● ●● ●
● ●

●● ● ●
●●● ●●

● ●
●●
●● ●
●●●● ●
● ● ● ●●●●●●●●●● ● ●● ●●
●●● ●● ● ● ● ●● ●
● ● ●● ● ● ●● ●
●●●●●●●
● ●●
●●

●●●●●●●●●●●● ●
● ● ●● ● ●●●●●●●●● ●●●●●● ●●●●● ●

● ●●●
●●


●● ●●●●
●●● ● ●● ● ● ●
●●●●
● ● ●
● ● ●● ●●● ● ● ● ●
● ● ●●● ●● ●
● ● ●●●● ●●
●●● ●

●●●●●● ●●●
● ●●
●●●●● ● ● ●● ●
●● ●● ●●●

● ●● ●●
● ● ●● ●●●
●●


●●● ●●●
●● ●●● ●●●●●
● ● ● ● ●● ●●● ● ● ●
●●● ●● ●●● ●●● ● ●● ●●
●●
● ●●
● ● ●●
● ● ●
●● ●●●●● ●●● ● ●
●● ● ●● ● ●● ●● ●● ●●
●●● ●●●● ● ●●
● ●
●●●●
● ●●●
−2

●● ●● ● ● ● ●

●●●● ●●● ●●● ●● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●
●●● ●
●● ●● ●● ● ●● ●●
●●●
● ●●● ●●●● ●●●●●●
●●● ● ●● ● ● ●
● ●● ●● ●●● ● ● ● ● ●
●●
●● ● ●
● ● ●
●●●●●● ● ●●●● ●●●●● ●● ●
●● ● ● ● ●●
●●
●● ●●


●●● ● ●
● ●● ● ● ● ●
● ●
●● ●
●● ● ●●● ●●● ●● ●●
●● ● ●
● ●●● ● ●●●●
● ● ●● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ●


● ●● ●●●● ● ●
● ●● ●
● ●
● ●● ● ● ● ●
−2

●●● ● ● ● ●
● ● ● ●● ● ●● ●
● ● ●
● ● ● ● ●
−4

● ●
● ● ●
● ●

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4
x x
18.05 class 7, Covariance and Correlation, Spring 2014 7

3.4 Overlapping uniform distributions

We ran simulations in R of the following scenario. X1 , X2 , . . . , X20 are i.i.d and follow a
U (0, 1) distribution. X and Y are both sums of the same number of Xi . We call the number
of Xi common to both X and Y the overlap. The notation in the figures below indicates
the number of Xi being summed and the number which overlap. For example, 5,3 indicates
that X and Y were each the sum of 5 of the Xi and that 3 of the Xi were common to both
sums. (The data was generated using rand(1,1000);)
Using the linearity of covariance it is easy to compute the theoretical correlation. For
each plot we give both the theoretical correlation and the correlation of the data from the
simulated sample.

(1, 0) cor=0.00, sample_cor=−0.07 (2, 1) cor=0.50, sample_cor=0.48

2.0
● ● ● ●● ● ● ● ●
● ● ●● ● ● ●● ● ●● ● ● ●
● ●● ● ● ● ● ●●
● ●
● ●
● ● ● ●● ●●
● ● ● ●
● ●●● ●●●● ● ● ● ● ● ● ●● ● ●
● ●●
●● ●● ●● ●● ●● ● ● ●● ●● ● ● ●
● ● ●
● ● ●● ● ●
● ● ● ● ●● ●● ●● ●
●● ●● ● ● ● ● ●
●●● ● ● ●● ●● ●● ● ● ●● ● ●● ●●● ● ● ●●●● ● ●
● ●● ●● ● ●
● ● ● ● ● ●● ● ● ● ● ●●
●●
●●
● ●
● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ●
0.8

● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●
● ● ●● ● ● ● ● ● ● ● ●
● ●
● ● ●● ● ● ●● ●●● ● ● ●●● ●● ● ● ●

1.5

● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ●●●● ● ● ● ●● ● ●●
● ●

● ●●● ● ● ●●● ● ●●● ● ● ●● ● ●● ● ●● ●● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ●
● ● ●●● ● ● ● ●●●● ●● ●●● ●
● ●● ●● ●● ●● ● ● ● ● ●●●● ● ●● ●●● ●
●● ● ● ●●●● ● ● ●
● ●● ● ●
● ●

● ● ●●
●● ● ● ● ● ●● ●●●

●●
● ●● ●● ● ●
●● ● ●
● ●
● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ●● ● ●●● ● ●● ●

● ●
● ● ● ● ● ● ●● ● ●
●● ●
● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ●● ● ●● ●● ● ● ● ●●● ●●●●●●●

● ●●● ●
●● ● ● ● ●● ● ●● ● ●
●●●● ● ●● ● ● ●● ●● ●●● ● ●●● ● ● ●
● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●
●● ● ●● ● ●
●● ●● ●● ●●


● ●●● ● ●
● ● ●
● ● ●● ● ●●●● ● ● ●● ● ● ● ● ● ●
● ●●● ●● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ●

● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●
● ● ● ●●● ● ●
● ● ● ● ● ● ●● ●
● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ●● ●
●● ● ● ●● ●● ●●
● ●
●● ●
●●● ●●● ●● ●●● ●● ● ●

●●● ● ●● ●●
● ● ● ●● ●
● ● ● ● ●●●● ● ● ●●● ●●● ● ●● ●●● ●●● ● ●●
● ●● ●● ●● ●
● ● ●●●● ● ● ●● ●●● ● ● ● ● ● ● ●● ●●●● ●● ●●●● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ●● ●● ●●● ● ●● ● ● ●● ●● ● ● ● ● ● ●● ●

1.0
●● ● ●● ● ●● ●● ●
● ● ● ● ●● ● ● ● ●●●● ●
●●● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ●● ●
●● ●● ● ●● ● ● ● ●● ● ● ●●
● ● ●● ● ● ●●● ● ● ●●
● ●● ●● ●● ● ●● ● ● ●●
● ●
y

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ● ● ● ● ●● ●
● ● ●
● ● ●● ● ●● ● ● ● ●● ● ●● ● ●
●● ●●● ●● ● ●●●● ●●●●● ● ● ●
●● ● ● ● ●●

● ● ● ●● ●● ● ● ●● ● ● ● ● ●● ● ●
● ● ● ● ● ● ●●●●● ● ●●● ● ●● ● ● ●●
● ●
●● ●●● ●●● ●●●●●●●● ● ●

● ●●● ● ●
●●● ●● ● ● ●
●● ● ●
●● ●● ● ●● ● ●● ●●●●●● ● ● ●●●● ● ●● ●
●● ● ● ● ●● ● ●●● ●
0.4

● ●● ● ● ●
● ●● ●
● ● ● ● ●● ● ●
● ●● ●● ● ● ●●● ●
● ●● ● ●●●●
●● ●
● ● ● ●
● ● ● ● ●
● ●● ● ● ● ●
●●●● ●● ●●● ● ● ●● ● ● ●
● ●●
● ● ●
● ●● ● ● ● ●●● ● ● ●● ● ●●●● ●● ● ●● ● ● ●● ● ●● ●●●●● ●● ●
● ●
● ●● ●
●●● ● ● ● ●● ●●●● ●● ●● ● ● ● ● ● ●●● ● ●● ● ●●
● ● ●● ● ●●
●● ●●
●● ●●● ● ●●●
● ● ● ●● ● ● ●●
● ● ●● ●●
● ● ● ● ●
●● ● ●● ● ● ●●●●● ● ●● ● ●
● ● ● ●
●● ● ● ● ●●
● ●

● ● ● ● ● ● ●● ●
● ● ●● ● ● ●●
● ● ●● ●● ● ● ● ● ● ● ● ● ●●
● ●
● ● ●● ●● ● ●● ● ●
●● ●● ●
● ● ●● ●● ● ● ● ●● ●● ●● ● ● ●● ● ●
● ● ●● ●● ●● ●
●● ●
● ● ●● ●
● ●● ●●
● ●●
● ●●● ● ●● ●●●● ● ●● ●●
● ● ● ● ●
● ●
0.5

● ● ● ● ●● ●● ● ● ● ●● ● ● ●
● ● ● ●
● ● ●●● ● ● ● ● ● ● ● ●● ●● ●
● ●
● ●
● ●●
●●● ●● ●● ●
● ● ● ● ● ● ●● ● ● ● ● ●
● ●


●●●●●● ●●
●●
● ● ●● ● ●

● ● ●● ●●
● ● ●
●●
●● ●●● ● ●● ●

●●
● ● ● ● ●● ● ●● ●
●● ● ●
● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●
● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ●●●● ●●● ●●●●● ● ●●
● ●
● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●
● ●●●
● ●● ● ● ●● ●● ● ● ●●●● ● ●● ● ● ● ●● ●● ● ●● ● ●● ●
● ● ● ●●●● ● ●● ● ●● ● ● ●●
● ● ● ● ●● ●● ● ● ●● ● ●● ● ●
●● ●● ●

● ●●● ●● ● ● ● ●
● ●●
●●
● ●●● ●● ● ●● ● ● ● ● ●
●●●
● ● ● ● ●● ● ●●
● ● ●
● ● ● ●● ● ● ● ●●● ● ●●
● ●

●● ● ●●● ●●●
● ●●● ●
● ● ● ●● ● ● ● ●● ● ●
● ●●
●● ● ● ● ● ●
0.0

● ●● ● ● ● ● ● ●●● ● ●
●● ●
0.0

● ●● ● ● ●● ● ●● ● ● ●
● ●● ●● ● ●● ● ● ● ● ●● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
x x

(5, 1) cor=0.20, sample_cor=0.21 (5, 3) cor=0.60, sample_cor=0.63

● ●

● ●
● ●
●● ● ● ●

4

● ● ● ●
4

● ● ● ● ●

● ● ● ● ●
● ● ●●
● ● ● ● ●● ● ●
● ● ●● ●● ●● ● ●● ● ● ●
●●●●● ●● ●● ● ●
●●
● ● ● ● ● ● ● ● ●● ●
● ● ●
● ● ● ● ●●●● ● ●● ●● ● ● ● ● ●● ● ●
●●● ● ● ● ●● ● ●●● ●● ● ● ● ● ●
● ●
● ● ● ● ●●● ●● ●● ●
● ● ●●●● ● ● ●● ●● ●● ● ● ● ● ●● ●●● ●● ● ●●●
● ● ●
● ● ● ● ●●● ● ●
●● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●●● ●●● ● ● ● ●● ●● ●●●● ● ●
● ●
●●● ●● ●●
●●●
●● ● ● ● ●● ● ●● ●
●● ●

● ● ● ●● ● ●●● ● ● ● ● ●●


● ●● ● ● ● ●●
● ●●● ●● ●● ● ●● ● ● ●

● ● ● ● ●● ● ●●●● ●
● ●●● ● ●● ●● ●
● ●
●● ● ● ● ● ● ● ●●●●● ●● ● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ● ● ●
● ●● ● ●● ●●●● ● ●
●● ●
● ● ●●●●●● ●● ●● ●●●
● ●● ●● ●● ●●
● ●●● ●● ● ● ● ●● ●
3

● ● ● ●● ● ● ● ●● ●● ● ●● ●
● ● ●● ● ●●● ●● ●● ●●●● ● ● ● ●●
●●●
●● ● ● ●● ●● ●● ● ● ● ●●●●●● ●● ● ●
3

●●
● ●● ●●●● ● ●●● ● ● ●● ● ●●● ● ●● ● ●●● ● ●● ●● ●●●
●● ● ●
● ●● ● ● ● ●●

● ● ●●● ●●●●●●● ●


●●
●●●● ●● ●● ●● ● ●●● ●

● ●● ●●●●
● ●●
● ●●●●●●●●●●●●● ●
●● ●●●●●● ●● ●● ●

● ● ● ●● ●●●● ●●●● ● ● ●●●● ●● ● ●●●●● ●● ●● ●● ●●● ●● ● ●
● ●● ● ●● ●● ● ●
●●●●● ●●●
●● ● ●●
●● ●●●●●● ● ● ●
● ●
● ● ● ●● ●● ●● ●
●●●
●● ●●

● ● ●● ●
● ● ● ● ●● ●● ●● ● ● ●● ●
●●●● ●●●
●●●
● ●● ●
●● ●●● ● ●●●
● ●● ● ●●●● ●


● ●
● ●● ●● ●
●●● ●● ●● ●●●●●●●●●● ●●●● ● ● ●
● ●

● ●● ●●●●●● ● ● ●●
● ●● ●
● ●●
●● ●● ●●●●
●●●●● ● ●●

● ● ●● ●● ●●● ● ● ●●● ●●

●●●
●● ●●● ●●●● ●●● ●
●●●● ●●●
●●●●●

●●
●● ● ● ● ●● ●● ● ●
●● ● ●
● ● ● ●●
●● ● ●● ● ● ● ●●
●● ●

● ●●●● ●●
● ●
● ● ● ●● ● ●● ● ● ●●● ●● ●
●●
●●● ●●●●
●●● ●●●● ●● ●●●● ●
y

●● ● ●● ●● ●
● ●● ●●
●●● ●● ●
● ●●●●●
●● ● ● ● ●●●●●●●●●● ● ●●● ● ● ●
●●●● ●●●
●● ●●
●●●● ● ●
● ●●●●● ●● ● ●●●●●●●● ●
● ●●●●●● ●●
● ● ● ● ● ● ● ● ●● ●●●●●●● ●● ●
●●●●●●●
●●● ● ● ●
●●●

●● ● ● ●●
● ● ●●●● ●
●●●
● ● ●●●● ●
● ● ● ● ●
● ● ● ● ●● ● ●●● ●● ● ● ●
● ●●●●●●
● ●● ● ● ● ● ● ● ● ● ●
●● ● ●●● ●


● ●●
●●● ●● ●●
● ●
●●●●●
● ●●
● ● ● ●● ● ●● ●●
● ●
● ●●
●●
●●●●●● ●●● ●● ●●● ●● ● ●
● ● ● ●● ●● ●●●● ●●●●

● ●●
● ●●
●●
● ●

● ● ●


● ●●
● ●● ● ●● ●●
● ●
● ● ●● ● ● ●●● ●●
●●
●● ●
●●● ●●●
● ●● ●●●●●
●●●●●●● ● ● ● ● ●
● ●●●
● ● ●●●●
● ●● ● ●
●●● ●● ●
●●●
● ● ●●● ●● ●● ● ●
● ● ●●● ●● ● ● ● ●● ●●
●● ● ● ●●● ● ● ●●●●●●● ●
●●
●●●
●● ● ●
● ●● ● ●
●●
● ●●● ●●
●●● ●●●● ●
●● ● ● ●●●
●●● ● ●●●● ●
● ● ●● ●●●
2

●● ● ● ●● ● ● ● ● ●● ● ● ●
● ● ●●●● ●●●● ●● ●● ●●● ● ●● ●●● ●● ●
2

● ● ●●●● ●
● ● ● ●● ●●
●●● ●● ● ●●●●●● ●● ● ● ● ● ●● ●●
● ● ●● ●●● ●● ●●●● ● ●
● ●●● ● ●● ● ●
● ● ●● ● ●
● ●●● ● ● ●●

●●●●●● ● ● ● ● ● ● ●
● ● ●● ●● ●●● ●
● ●● ●●● ● ● ●
●●● ● ● ●
● ● ●● ●● ● ● ●●● ● ● ●●●●
● ●●● ●
● ●● ● ●
●●●●●●●●● ● ●
● ● ●
● ● ●● ● ● ● ●● ●●●●
●●● ● ● ●
● ● ●●●● ● ●●




●● ●●● ●● ●● ● ●●
●●
● ● ● ● ● ●●● ●●
● ●●●●● ●● ●● ●● ●
● ●
● ● ●

●● ●● ● ● ●●
●● ●●●●
● ●●● ● ●●
●●●●●●
● ● ●● ●● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ●●
● ●
● ● ●● ● ● ●●
● ●●
● ● ●● ● ● ● ●●● ● ●
● ●●●● ● ●● ●● ●● ● ●
● ● ● ●● ●●
● ●● ● ● ●
● ● ● ● ●● ●● ● ● ●●
● ● ● ● ●● ● ●●● ●● ●
● ● ● ● ●● ●
● ●
● ● ● ● ●● ● ● ● ●
● ●●●● ● ● ●
● ● ● ● ●● ● ● ●
1

●●
1

● ● ● ● ●● ● ●●
● ● ● ● ●●●
● ● ● ●
● ● ●
● ●
● ●

● ●

1 2 3 4 1 2 3 4
x x
18.05 class 7, Covariance and Correlation, Spring 2014 8

(10, 5) cor=0.50, sample_cor=0.53 (10, 8) cor=0.80, sample_cor=0.81

8
● ● ●
● ●
● ● ●
●●● ● ● ●
● ● ●
● ● ● ●

7

●● ● ●
● ●●

7
● ● ● ● ● ● ● ●● ●● ●
● ● ● ● ● ●
● ●● ● ● ●
● ● ● ●● ● ● ● ●
● ●● ● ●● ● ● ●● ● ● ● ● ●
● ● ●

●● ●● ● ●● ●●
● ● ●● ● ● ● ● ● ● ●●

● ●

● ● ●
● ●● ● ● ● ● ● ● ● ● ●●●● ●●●● ● ●●
● ●●

● ●● ● ● ●●●
●●
● ● ● ● ●● ●●● ●● ● ● ●

● ●● ●●

●● ●
● ●
● ●● ●●
● ●

●●● ●● ●●●● ●
●●
●●
●● ●●●
●●● ●● ● ● ●
● ●●
● ● ●●●
6

●● ● ● ●● ● ●● ●● ● ●●● ● ● ●●● ● ●
● ●
● ● ●●●●● ● ●●● ● ● ● ● ●

6

●●●● ● ● ● ● ● ● ● ● ● ● ●● ●
● ●●●●●●
● ●
● ●● ● ●
● ● ● ● ●●●●●
●● ●● ● ● ● ● ●●●● ●
●●● ● ●● ● ● ● ●●
● ● ●● ● ● ● ●●● ●●● ●●● ●●● ●
● ● ● ● ● ● ● ●
●●● ●●●●● ●● ●
●● ●
●●●
●● ●
●●●●
● ● ●●

●●●


●●●
●●
●●●●●
●●● ●●● ●●● ●● ●
●●●
●●●

● ●●● ●●●
●●● ●● ● ●● ● ●●●
● ● ●●● ● ●●
● ● ●
●●
●●●●●● ● ● ● ● ● ● ●
●● ● ●●● ●●
●●●
●● ●●●●
●● ●
● ● ●● ●●
●●● ● ●
● ● ● ●● ● ●
●● ●●●●●● ● ●● ●●
●●●●
● ● ●● ●● ●●● ● ●
●●
● ●● ● ● ●● ●●●●
● ●●● ● ●● ●
● ●●●● ●
● ● ●● ●●● ●●● ● ● ●●

●●
● ●● ●●
● ●●
● ● ●●●
● ●
●● ●● ●●●● ●● ●●
● ●●●
● ●
● ● ● ● ● ● ●
●●●●
●●●●

●●
●●
●● ●● ● ●●●●
●●●●●
● ●
● ●● ●●● ● ●●● ●
● ●● ●●● ● ●●● ● ●●●
●●
●●

●● ●●●●●
●●
●●●●●
● ●●
● ● ●● ● ● ●●
●●● ●● ●●● ● ● ● ●
●● ●●
●●

●● ●●● ●
● ●● ●●●● ●

● ●
●● ● ● ● ●●● ●● ● ●● ●
●●●● ●

● ● ● ● ●●● ● ●●● ●● ●
●● ● ●●
●●
●● ●●● ●●
● ●
●● ●●
● ● ●● ●● ● ●
● ●
●● ●● ●● ●● ●● ●
●●●● ●●
●● ●●
●●●●● ●●●● ●● ● ●●●●●●● ●● ●
●●● ●● ● ● ● ●

●●


●●
●● ●●●●●●
●●
●●
● ●●●●●●●● ●
●●●●●●
● ● ●●
5

● ● ● ●● ● ●

5
● ● ● ●● ● ● ●
● ●
●●● ●●●
● ● ●
●●●●● ●●● ●● ●
● ● ●●
● ●● ● ● ● ● ● ●● ●● ●● ● ● ●

y

y
● ● ● ●●● ●●● ● ●●● ●● ●● ●
● ● ●●● ●● ● ● ●●●
●●●●

● ●●

●●●●●●
● ●●●
●● ●
● ● ● ● ●● ● ●● ●●●●●● ●●●●●
●● ●●● ● ● ●
●●

●●● ●
●●● ●●●
●●● ●● ●
●●●● ●● ● ●●● ●● ● ●●● ●● ●
●●●
●● ●● ●

●●

●●●


● ●
●●
● ●●●
●● ●●●●
●●●●●●●● ● ● ●
● ●● ● ●●
● ●
●●● ●●●●● ●● ● ●●●● ●● ● ●● ●●● ● ● ● ●● ● ●● ●● ● ●●●●

●●●●
●●
●●

● ●●●

●●● ●
●●●
●●●●●●●
●●● ●● ●

● ●●●●
● ●● ● ● ● ●●●● ●●●● ●●
● ●● ● ● ● ●●● ● ●● ● ●●●●● ●
●●●● ● ●●●●●● ●●● ●●
● ●●

●●●●●●●●●●● ●●
●● ●●●
● ●●●●● ●●
●●●●●●● ● ●●● ●
●●● ●
●● ●● ●●
●●
● ●●● ●
● ●●●●●● ● ●
● ●● ● ●
●● ● ●
● ●
●● ●●
● ● ●●●●●●● ● ● ●
● ●●● ●●● ●● ● ● ●

● ● ● ● ● ● ● ● ●●● ●
● ●
●●
●● ● ●● ●
● ●● ● ● ●●
●●●●●●●● ●●●


●● ● ●●● ● ● ●●● ●●●
● ●

●●
●●● ●
● ●●●● ● ●●
● ●● ● ●
●●●● ●●●
●●
●●● ● ●● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ●
●●●
●●
●●●●●●●●●●● ● ● ●●● ●

● ●● ● ●● ●●● ● ●● ● ●●● ● ●●●●● ● ●● ● ● ● ●
● ●●● ●

4
● ●● ●● ● ●● ●●● ●● ●●
4

● ● ● ● ● ● ● ●● ●● ● ● ●● ●●● ● ●● ●

● ● ●● ●● ● ● ●● ●● ●● ●●
● ● ● ●● ● ●●● ●●
● ●●

●● ●●●● ● ●● ●●● ● ●● ● ● ●
● ● ● ● ● ●●


● ● ●● ●●●●●

● ●●
●● ● ● ●●●
● ●●●●
● ●●●

●●● ● ● ● ●●● ● ● ● ●●
●●●● ● ●●●

● ● ●● ● ●●●●●●●
● ● ●
● ●● ●● ● ●● ● ● ●●

● ● ●● ●● ● ●●● ● ● ● ●● ●●
●●● ●

● ●● ●●
●● ● ● ● ●●
●● ●

● ● ●●● ● ●● ● ●●
●● ● ● ●● ●● ●● ● ●● ● ● ● ●● ●● ● ●

● ●●● ●●
● ●
● ● ● ● ● ●●● ● ●
● ●●● ● ●● ● ● ●

3
●● ● ● ● ● ●
3

●● ● ● ●
● ●● ● ● ●●
● ● ●

● ● ● ● ● ●

2
● ●
2

2 3 4 5 6 7 3 4 5 6 7 8
x x
Class 8 Review Problems
18.05, Spring 2014

1 Counting and Probability

1. (a) How many ways can you arrange the letters in the word STATISTICS? (e.g.
SSSTTTIIAC counts as one arrangement.)
(b) If all arrangements are equally likely, what is the probabilitiy the two ’i’s are next to
each other.

2 Conditional Probability and Bayes’ Theorem

2. Corrupted by their power, the judges running the popular game show America’s Next
Top Mathematician have been taking bribes from many of the contestants. Each episode,
a given contestant is either allowed to stay on the show or is kicked off.
If the contestant has been bribing the judges she will be allowed to stay with probability 1.
If the contestant has not been bribing the judges, she will be allowed to stay with probability
1/3.
Over two rounds, suppose that 1/4 of the contestants have been bribing the judges. The
same contestants bribe the judges in both rounds, i.e., if a contestant bribes them in the
first round, she bribes them in the second round too (and vice versa).
(a) If you pick a random contestant who was allowed to stay during the first episode, what
is the probability that she was bribing the judges?
(b) If you pick a random contestant, what is the probability that she is allowed to stay
during both of the first two episodes?
(c) If you pick random contestant who was allowed to stay during the first episode, what
is the probability that she gets kicked off during the second episode?

3 Independence

3. You roll a twenty-sided die. Determine whether the following pairs of events are
independent.
(a) ‘You roll an even number’ and ‘You roll a number less than or equal to 10’.
(b) ‘You roll an even number’ and ‘You roll a prime number’.

4 Expectation and Variance

4. The random variable X takes values -1, 0, 1 with probabilities 1/8, 2/8, 5/8 respectively.
(a) Compute E(X).

1
Class 8 review, Spring 2014 2

(b) Give the pmf of Y = X 2 and use it to compute E(Y ).


(c) Instead, compute E(X 2 ) directly from an extended table.
(d) Compute Var(X).

5. Compute the expectation and variance of a Bernoulli(p) random variable.

6. Suppose 100 people all toss a hat into a box and then proceed to randomly pick out a
hat. What is the expected number of people to get their own hat back.
Hint: express the number of people who get their own hat as a sum of random variables
whose expected value is easy to compute.

5 Probability Mass Functions, Probability Density Functions


and Cumulative Distribution Functions

7. (a) Suppose that X has probability density function fX (x) = λe−λx for x ≥ 0. Compute
the cdf, FX (x).
(b) If Y = X 2 , compute the pdf and cdf of Y.

8. Suppose you roll a fair 6-sided die 100 times (independently), and you get $3 every
time you roll a 6. Let X1 be the number of dollars you win on rolls 1 through 25.
Let X2 be the number of dollars you win on rolls 26 through 50.
Let X3 be the number of dollars you win on rolls 51 through 75.
Let X4 be the number of dollars you win on rolls 76 throught 100.
Let X = X1 + X2 + X3 + X4 be the total number of dollars you win over all 100 rolls.
(a) What is the probability mass function of X?
(b) What is the expectation and variance of X?
(c) Let Y = 4X1 . (So instead of rolling 100 times, you just roll 25 times and multiply your
winnings by 4.) (i) What are the expectation and variance of Y ?
(ii) How do the expectation and variance of Y compare to those of X? (I.e., are they bigger,
smaller, or equal?) Explain (briefly) why this makes sense.

6 Joint Probability, Covariance, Correlation

9. (Arithmetic Puzzle) The joint and marginal pmf’s of X and Y are partly given in
the following table.

Y
X\ 1 2 3
1 1/6 0 . . . 1/3
2 . . . 1/4 . . . 1/3
3 . . . . . . 1/4 . . .
1/6 1/3 . . . 1
(a) Complete the table.
Class 8 review, Spring 2014 3

(b) Are X and Y independent?

10. Covariance and Independence


Let X be a random variable that takes values -2, -1, 0, 1, 2; each with probability 1/5.
Let Y = X 2 .
(a) Fill out the following table giving the joint frequency function for X and Y . Be sure
to include the marginal probabilities.
X -2 -1 0 1 2 total
Y
0
1
4
total
(b) Find E(X) and E(Y ).
(c) Show X and Y are not independent.
(d) Show Cov(X, Y ) = 0.
This is an example of uncorrelated but non-independent random variables. The reason
this can happen is that correlation only measures the linear dependence between the two
variables. In this case, X and Y are not at all linearly related.

11. Continuous Joint Distributions


Suppose X and Y are continuous random variables with joint density function f (x, y) = x+y
on the unit square [0, 1] × [0, 1].
(a) Let F (x, y) be the joint CDF. Compute F (1, 1). Compute F (x, y).
(b) Compute the marginal densities for X and Y .
(c) Are X and Y independent?
(d) Compute E(X), (Y ), E(X 2 + Y 2 ), Cov(X, Y ).

7 Law of Large Numbers, Central Limit Theorem

12. Suppose X1 , . . . , XP
100 are i.i.d. with mean 1/5 and variance 1/9. Use the central limit
theorem to estimate P ( Xi < 30).

13. (More Central Limit Theorem)


The average IQ in a population is 100 with standard deviation 15 (by definition, IQ is
normalized so this is the case). What is the probability that a randomly selected group of
100 people has an average IQ above 115?
Class 8 review, Spring 2014 4

Standard normal table of left tail probabilities.

z Φ(z) z Φ(z) z Φ(z) z Φ(z)


-4.00 0.0000 -2.00 0.0228 0.00 0.5000 2.00 0.9772 Φ(z) = P (Z ≤ z) for N(0, 1).
-3.95 0.0000 -1.95 0.0256 0.05 0.5199 2.05 0.9798
(Use interpolation to estimate
-3.90 0.0000 -1.90 0.0287 0.10 0.5398 2.10 0.9821
z values to a 3rd decimal
-3.85 0.0001 -1.85 0.0322 0.15 0.5596 2.15 0.9842
place.)
-3.80 0.0001 -1.80 0.0359 0.20 0.5793 2.20 0.9861
-3.75 0.0001 -1.75 0.0401 0.25 0.5987 2.25 0.9878
-3.70 0.0001 -1.70 0.0446 0.30 0.6179 2.30 0.9893
-3.65 0.0001 -1.65 0.0495 0.35 0.6368 2.35 0.9906
-3.60 0.0002 -1.60 0.0548 0.40 0.6554 2.40 0.9918
-3.55 0.0002 -1.55 0.0606 0.45 0.6736 2.45 0.9929
-3.50 0.0002 -1.50 0.0668 0.50 0.6915 2.50 0.9938
-3.45 0.0003 -1.45 0.0735 0.55 0.7088 2.55 0.9946
-3.40 0.0003 -1.40 0.0808 0.60 0.7257 2.60 0.9953
-3.35 0.0004 -1.35 0.0885 0.65 0.7422 2.65 0.9960
-3.30 0.0005 -1.30 0.0968 0.70 0.7580 2.70 0.9965
-3.25 0.0006 -1.25 0.1056 0.75 0.7734 2.75 0.9970
-3.20 0.0007 -1.20 0.1151 0.80 0.7881 2.80 0.9974
-3.15 0.0008 -1.15 0.1251 0.85 0.8023 2.85 0.9978
-3.10 0.0010 -1.10 0.1357 0.90 0.8159 2.90 0.9981
-3.05 0.0011 -1.05 0.1469 0.95 0.8289 2.95 0.9984
-3.00 0.0013 -1.00 0.1587 1.00 0.8413 3.00 0.9987
-2.95 0.0016 -0.95 0.1711 1.05 0.8531 3.05 0.9989
-2.90 0.0019 -0.90 0.1841 1.10 0.8643 3.10 0.9990
-2.85 0.0022 -0.85 0.1977 1.15 0.8749 3.15 0.9992
-2.80 0.0026 -0.80 0.2119 1.20 0.8849 3.20 0.9993
-2.75 0.0030 -0.75 0.2266 1.25 0.8944 3.25 0.9994
-2.70 0.0035 -0.70 0.2420 1.30 0.9032 3.30 0.9995
-2.65 0.0040 -0.65 0.2578 1.35 0.9115 3.35 0.9996
-2.60 0.0047 -0.60 0.2743 1.40 0.9192 3.40 0.9997
-2.55 0.0054 -0.55 0.2912 1.45 0.9265 3.45 0.9997
-2.50 0.0062 -0.50 0.3085 1.50 0.9332 3.50 0.9998
-2.45 0.0071 -0.45 0.3264 1.55 0.9394 3.55 0.9998
-2.40 0.0082 -0.40 0.3446 1.60 0.9452 3.60 0.9998
-2.35 0.0094 -0.35 0.3632 1.65 0.9505 3.65 0.9999
-2.30 0.0107 -0.30 0.3821 1.70 0.9554 3.70 0.9999
-2.25 0.0122 -0.25 0.4013 1.75 0.9599 3.75 0.9999
-2.20 0.0139 -0.20 0.4207 1.80 0.9641 3.80 0.9999
-2.15 0.0158 -0.15 0.4404 1.85 0.9678 3.85 0.9999
-2.10 0.0179 -0.10 0.4602 1.90 0.9713 3.90 1.0000
-2.05 0.0202 -0.05 0.4801 1.95 0.9744 3.95 1.0000
Class 8 Review Problems –solutions, 18.05, Spring 2014

1 Counting and Probability

1. (a) Create an arrangement in stages and count the number of possibilities at each
stage:  
10
Stage 1: Choose three of the 10 slots to put the S’s:
3  
7
Stage 2: Choose three of the remaining 7 slots to put the T’s:
 3
4
Stage 3: Choose two of the remaining 4 slots to put the I’s:
 2
2
Stage 4: Choose one of the remaining 2 slots to put the A:
  1
1
Stage 5: Use the last slot for the C:
1
Number of arrangements:
     
10 7 4 2 1
= 50400.
3 3 2 1 1
 
10
(b) The are = 45 equally likely ways to place the two I’s.
2
There are 9 ways to place them next to each other, i.e. in slots 1 and 2, slots 2 and 3, . . . ,
slots 9 and 10.
So the probability the I’s are adjacent is 9/45 = 0.2.

2 Conditional Probability and Bayes’ Theorem

2. The following tree shows the setting. Stay1 means the contestant was allowed to stay
during the first episode and stay2 means the they were allowed to stay during the second.

1/4 3/4

Bribe Honest
1 0 1/3 2/3

Stay1 Leave1 Stay1 Leave1


1 0 1/3 2/3

Stay2 Leave2 Stay2 Leave2

Let’s name the relevant events:


B = the contestant is bribing the judges
H = the contestant is honest (not bribing the judges)
S1 = the contestant was allowed to stay during the first episode
S2 = the contestant was allowed to stay during the second episode

1
Class 8 review, Spring 2014 2

L1 = the contestant was asked to leave during the first episode


L2 = the contestant was asked to leave during the second episode

(a) We first compute P (S1 ) using the law of total probability.


1 1 3 1
P (S1 ) = P (S1 |B)P (B) + P (S1 |H)P (H) = 1 · + · = .
4 3 4 2

P (B) 1/4 1
We therefore have (by Bayes’ rule) P (B|S1 ) = P (S1 |B) =1· = .
P (S1 ) 1/2 2
(b) Using the tree we have the total probability of S2 is
1 3 1 1 1
P (S2 ) = + · · =
4 4 3 3 3
P (L2 ∩ S1 )
(c) We want to compute P (L2 |S1 ) = .
P (S1 )
From the calculation we did in part (a), P (S1 ) = 1/2. For the numerator, we have (see the
tree)
1 2 3 1
P (L2 ∩ S1 ) = P (L2 ∩ S1 |B)P (B) + P (L2 ∩ S1 |H)P (H) = 0 · + · =
3 9 4 6
1/6 1
Therefore P (L2 |S1 ) = = .
1/2 3

3 Independence

3. E = even numbered = {2, 4, 6, 8, 10, 12, 14, 16, 18, 20}.


L = roll ≤ 10 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
B = roll is prime = {2, 3, 5, 7, 11, 13, 17, 19} (We use B because P is not a good choice.)
(a) P (E) = 10/20, P (E|L) = 5/10. These are the same, so the events are independent.
(b) P (E) = 10/20. P (E|B) = 1/8. These are not the same so the events are not indepen-
dent.

4 Expectation and Variance

4. (a) We have

X values: -1 0 1
prob: 1/8 2/8 5/8
X2 1 0 1
So, E(X) = −1/8 + 5/8 = 1/2.
Y values: 0 1
(b) ⇒ E(Y ) = 6/8 = 3/4.
prob: 2/8 6/8
Class 8 review, Spring 2014 3

(c) The change of variables formula just says to use the bottom row of the table in part
(a): E(X 2 ) = 1 · (1/8) + 0 · (2/8) + 1 · (5/8) = 3/4 (same as part (b)).
(d) Var(X) = E(X 2 ) − E(X)2 = 3/4 − 1/4 = 1/2.

5. Make a table
X: 0 1
prob: (1-p) p
X2 0 1.
From the table, E(X) = 0 · (1 − p) + 1 · p = p.
Since X and X 2 have the same table E(X 2 ) = E(X) = p.
Therefore, Var(X) = p − p2 = p(1 − p).

6. Let X be the number of people who get their own hat.


Following the hint: let Xj represent whether person j gets their own hat. That is, Xj = 1
if person j gets their hat and 0 if not.
100
X 100
X
We have, X = Xj , so E(X) = E(Xj ).
j=1 j=1
Since person j is equally likely to get any hat, we have P (Xj = 1) = 1/100. Thus, Xj ∼
Bernoulli(1/100) ⇒ E(Xj ) = 1/100 ⇒ E(X) = 1.

5 Probability Mass Functions, Probability Density Functions


and Cumulative Distribution Functions

7. (a) We have cdf of X,


Z x
FX (x) = λe−λx dx = 1 − e−λx .
0

Now for y ≥ 0, we have


(b)
√ √
FY (y) = P (Y ≤ y) = P (X 2 ≤ y) = P (X ≤ y) = 1 − e−λ y
.
Differentiating FY (y) with respect to y, we have

λ − 1 −λ√y
fY (y) = y 2e .
2

8. (a) There are a number of ways to present this.


Let T be the total number of times you roll a 6 in the 100 rolls. We know T ∼ Binomial(100, 1/6).
Since you win $3 every time you roll a 6, we have X = 3T . So, we can write
   k  100−k
100 1 5
P (X = 3k) = , for k = 0, 1, 2, . . . , 100.
k 6 6
Class 8 review, Spring 2014 4

Alternatively we could write


   x/3  100−x/3
100 1 5
P (X = x) = , for x = 0, 3, 6, . . . , 300.
x/3 6 6

(b) E(X) = E(3T ) = 3E(T ) = 3 · 100 · 16 = 50,


Var(X) = Var(3T ) = 9Var(T ) = 9 · 100 · 16 · 56 = 125.
(c) (i) Let T1 be the total number of times you roll a 6 in the first 25 rolls. So, X1 = 3T1
and Y = 12T1 .
Now, T1 ∼ Binomial(25, 1/6), so

E(Y ) = 12E(T1 ) = 12 · 25 · 16 = 50.

and
1 5
Var(Y ) = 144Var(T1 ) = 144 · 25 ·· = 500.
6 6
(ii) The expectations are the same by linearity because X and Y are the both
3 × 100 × a Bernoulli(1/6) random variable.
For the variance, Var(X) = 4Var(X1 ) because X is the sum of 4 independent variables all
identical to X1 . However Var(Y ) = Var(4X1 ) = 16Var(X1 ). So, the variance of Y is 4
times that of X. This should make some intuitive sense because X is built out of more
independent trials than X1 .
Another way of thinking about it is that the difference between Y and its expectation is
four times the difference between X1 and its expectation. However, the difference between
X and its expectation is the sum of such a difference for X1 , X2 , X3 , and X4 . Its probably
the case that some of these deviations are positive and some are negative, so the absolute
value of this difference for the sum is probably less than four times the absolute value of this
difference for one of the variables. (I.e., the deviations are likely to cancel to some extent.)

6 Joint Probability, Covariance, Correlation

9. (Arithmetic Puzzle) (a) The marginal probabilities have to add up to 1, so the two
missing marginal probabilities can be computed: P (X = 3) = 1/3, P (Y = 3) = 1/2. Now
each row and column has to add up to its respective margin. For example, 1/6 + 0 + P (X =
1, Y = 3) = 1/3, so P (X = 1, Y = 3) = 1/6. Here is the completed table.
Y
X\ 1 2 3
1 1/6 0 1/6 1/3
2 0 1/4 1/12 1/3
3 0 1/12 1/4 1/3
1/6 1/3 1/2 1

(b) No, X and Y are not independent.


For example, P (X = 2, Y = 1) = 0 6= P (X = 2) · P (Y = 1).
Class 8 review, Spring 2014 5

10. Covariance and Independence


(a)
X -2 -1 0 1 2
Y
0 0 0 1/5 0 0 1/5
1 0 1/5 0 1/5 0 2/5
4 1/5 0 0 0 1/5 2/5
1/5 1/5 1/5 1/5 1/5 1
Each column has only one nonzero value. For example, when X = −2 then Y = 4, so in
the X = −2 column, only P (X = −2, Y = 4) is not 0.
(b) Using the marginal distributions: E(X) = 15 (−2 − 1 + 0 + 1 + 2) = 0.
1 2 2
E(Y ) = 0 · + 1 · + 4 · = 2.
5 5 5
(c) We show the probabilities don’t multiply:
P (X = −2, Y = 0) = 0 6= P (X = −2) · P (Y = 0) = 1/25.
Since these are not equal X and Y are not independent. (It is obvious that X 2 is not
independent of X.)
(d) Using the table from part (a) and the means computed in part (d) we get:

Cov(X, Y ) = E(XY ) − E(X)E(Y )


1 1 1 1 1
= (−2)(4) + (−1)(1) + (0)(0) + (1)(1) + (2)(4)
5 5 5 5 5
= 0.

11. Continuous Joint DistributionsZ aZ b


(a) F (a, b) = P (X ≤ a, Y ≤ b) = (x + y) dy dx.
0 0
b a
y 2 b2 x2 b2 a2 b + ab2
Inner integral: xy + = xb + . Outer integral: b + x = .
2 0 2 2 2 0 2
x2 y + xy 2
So F (x, y) = and F (1, 1) = 1.
2
Z 1 Z 1 1
y 2 1
(b) fX (x) = f (x, y) dy = (x + y) dy = xy + = x + .
0 0 2 0 2
By symmetry, fY (y) = y + 1/2.
(c) To see if they are independent we check if the joint density is the product of the
marginal densities.
f (x, y) = x + y, fX (x) · fY (y) = (x + 1/2)(y + 1/2).
Since these are not equal, X and Y are not independent.
Z 1Z 1 Z 1" 1 # Z 1
2 y 2 x 7
(d) E(X) = x(x + y) dy dx = x y + x dx = x2 + dx = .
0 0 0 2 0 0 2 12
Class 8 review, Spring 2014 6

Z 1 Z 1
(Or, using (b), E(X) = xfX (x) dx = x(x + 1/2) dx = 7/12.)
0 0
By symmetry E(Y ) = 7/12.
Z 1Z 1
2 2 5
E(X + Y ) = (x2 + y 2 )(x + y) dy dx = .
0 0 6
Z 1Z 1
1
E(XY ) = xy(x + y) dy dx = .
0 0 3
1 49 1
Cov(X, Y ) = E(XY ) − E(X)E(Y ) = − = − .
3 144 144

7 Law of Large Numbers, Central Limit Theorem

12. Standardize:
! !
1 P
X Xi − µ
n 30/n − µ
P Xi < 30 =P √ < √
σ/ n σ/ n
i
 
30/100 − 1/5
≈P Z< (by the central limit theorem)
1/30
= P (Z < 3)
= 0.9987 (from the table of normal probabilities)

13. (More Central Limit Theorem)


Let Xj be the IQ of a randomly selected person. We are given E(Xj ) = 100 and σXj = 15.
Let X be the average of the IQ’s of 100 randomly selected people. Then we know

E(X) = 100 and σX = 15/ 100 = 1.5.

The problem asks for P (X > 115). Standardizing we get P (X > 115) ≈ P (Z > 10).
This is effectively 0.

You might also like